The Spot Instance Playbook — Cut Databricks Costs by 60–90%

If you're running Databricks on AWS, your compute bill is probably one of your biggest line items. At on-demand prices, a single high-memory cluster can run thousands of dollars a month. But here's the thing AWS doesn't want you to forget: Spot Instances exist, and they can cut your compute costs by 60–90% with the right architecture.

This playbook covers exactly how to do it — without losing reliability.


Why Spot Instances for Databricks?

AWS Spot Instances let you bid on spare EC2 capacity at steep discounts. The trade-off? They can be reclaimed by AWS with a 2-minute warning when capacity is needed elsewhere. For traditional workloads that's a dealbreaker, but for Databricks — which was built for resilience — it's a golden opportunity.

Databricks already handles node failures gracefully. When a worker disappears, Spark recomputes lost partitions from other nodes using lineage. The key is designing your clusters so that spot interruptions don't take down the entire job.

Realistic Savings Estimates

Instance Type On-Demand (USD/hr) Spot (USD/hr) Savings
r5.2xlarge (8 vCPU, 64 GB) $0.504 ~$0.151 ~70%
r5.4xlarge (16 vCPU, 128 GB) $1.008 ~$0.302 ~70%
i3.2xlarge (8 vCPU, 61 GB, NVMe) $0.624 ~$0.187 ~70%
p3.2xlarge (GPU, 16 GB) $3.060 ~$0.918 ~70%

Actual savings depend on instance family, region, availability zone, and time of day. Deep-discount instance families (r5, i3) commonly see 60–70% savings, while GPU instances can hit 60–80% during off-peak hours. The highest savings come from mixing spot with on-demand in a stable proportion.


Step 1: Configure Your Databricks Cluster for Spot

Every Databricks cluster has a spot instance configuration. The critical setting is the spot bid percentage vs. the on-demand price:

  • 100% — You'll pay up to the on-demand price. High availability, moderate savings.
  • 50–80% — Sweet spot for most workloads. Significant savings with manageable interruption rates.
  • <50% — Maximum savings, but expect frequent interruptions. Best for fault-tolerant batch jobs.

In the Databricks UI, navigate to Compute → Create Cluster → Advanced Options → Spot Instances. Use the API or Terraform for repeatable configurations.

Terraform Example

resource "databricks_cluster" "etl_jobs" {
  cluster_name  = "production-etl"
  spark_version = "15.3.x-scala2.12"
  node_type_id  = "r5.2xlarge"
  autoscale {
    min_workers = 2
    max_workers = 16
  }
  aws_attributes {
    availability           = "SPOT_WITH_FALLBACK"
    spot_bid_price_percent = 80
    zone_id                = "us-west-2a"
  }
}

The SPOT_WITH_FALLBACK strategy tells Databricks to launch spot instances first and fall back to on-demand only when spot is unavailable. This is your safest bet for production.


Step 2: Diversify Across Instance Types and AZs

AWS has abundant spot capacity across many instance types and availability zones. The trick is to give Databricks multiple options. Instead of pinning to a single instance type, use a hybrid fleet:

  • Primary: r5.2xlarge (memory-optimized, good general purpose)
  • Secondary: r5a.2xlarge (AMD variant, often cheaper)
  • Tertiary: i3.2xlarge (NVMe storage, great for shuffle-heavy jobs)
  • Fallback: r5.4xlarge (more memory, can handle the load if needed)

In Databricks, you can specify a list of node types in the cluster config. The platform will attempt to launch each in order until it succeeds.

aws_attributes {
  availability           = "SPOT_WITH_FALLBACK"
  spot_bid_price_percent = 80
  zone_id                = "us-west-2a,us-west-2b,us-west-2c"
}

Note: Databricks does not natively support a comma-separated list of zones in a single cluster config. For multi-AZ resilience, consider using instance pools or job clusters that retry in different AZs on failure.


Step 3: Use Instance Pools for Fast Failover

Instance pools are Databricks' answer to spot interruptions. They maintain a warm pool of pre-initialized instances so that when a spot instance is reclaimed, your cluster can swap in a replacement within seconds instead of minutes.

Setting Up an Instance Pool

resource "databricks_instance_pool" "spot_pool" {
  instance_pool_name = "spot-etl-pool"
  node_type_id       = "r5.2xlarge"
  min_idle_instances = 5
  max_capacity       = 50
  aws_attributes {
    availability           = "SPOT_WITH_FALLBACK"
    spot_bid_price_percent = 80
    zone_id                = "us-west-2a"
  }
  idle_instance_autotermination_minutes = 15
}

Key settings:

  • min_idle_instances — Keep 5 instances warm and ready. Pay only for what's idle.
  • idle_instance_autotermination_minutes — Auto-kill idle instances after 15 min to save costs.
  • max_capacity — Hard cap to prevent runaway scaling.

Step 4: Optimize Your Workloads for Interruptions

Not all workloads handle spot equally well. Here's how to classify your jobs:

Workload Type Spot Suitability Strategy
Batch ETL (hourly/daily) 🟢 Excellent 100% spot with fallback. Add retry logic.
Interactive queries (SQL) 🟡 Good Mix spot workers with on-demand driver.
Delta Live Tables pipelines 🟢 Excellent DLT auto-retries from checkpoint. Pure spot.
ML training (single-node) 🔴 Avoid Use on-demand or reserved instances.
Streaming (Structured Streaming) 🟡 Good On-demand driver, spot workers with checkpointing.

Critical Rule: Keep the Driver On-Demand

The driver node coordinates the SparkContext and holds the job state. If the driver gets reclaimed, the entire job fails. Always configure the driver as on-demand, even when workers use spot. This ensures job continuity even during mass interruptions.


Step 5: Set Up Monitoring and Alerts

Databricks provides cluster-level metrics on spot interruptions. Monitor these to tune your strategy:

  • Cluster Events — Filter by "Spot Instance Reclamation" to see interruption frequency.
  • Spark UI — Executors — Track executor removals during job runs.
  • CloudWatch Metrics — Monitor SpotInstanceInterruptions across your account.

Suggested Alert Thresholds

  • Warning: >5 interruptions per hour per cluster — consider lowering bid % or diversifying instance types.
  • Critical: >20 interruptions per hour — switch to SPOT_WITH_FALLBACK or on-demand temporarily.
  • Job failure: Any spot-related job failure should trigger an alert and escalation.

Step 6: Calculate Your Savings

Use this formula to estimate monthly savings:

Monthly Savings = (OnDemandCost - SpotCost) × NumWorkers × HoursPerDay × DaysPerMonth

Example: 10 × r5.2xlarge workers, 12 hrs/day, 22 days/month
On-demand: 10 × $0.504 × 12 × 22 = $1,330.56
Spot (70% off): 10 × $0.151 × 12 × 22 = $398.64
Monthly Savings: $931.92 (70%)

Add instance pool idle costs (~5 × $0.151 × 6hr idle × 22 = ~$99.66)
Net Monthly Savings: ~$832.26

Common Pitfalls to Avoid

  1. Putting the driver on spot. If the driver dies, the whole job dies. Driver = on-demand, always.
  2. Single instance type. If that type has a capacity crunch, your cluster stalls. Diversify.
  3. No fallback. SPOT_WITH_FALLBACK is your safety net. Without it, jobs fail silently.
  4. Ignoring shuffle-heavy jobs. If your job shuffles large datasets, interruptions during shuffle cause full recomputation. Use i3 instances with NVMe in that case.
  5. Not testing. Run a spot-only cluster for a week in staging first. Measure interruption rates before going to production.

Bottom Line

Spot Instances are not "risky" when architected correctly. They're a lever — and for Databricks workloads, they're the single biggest cost-saving lever you have. Start with one batch pipeline, measure the savings, and expand from there.

At DataRazi, we routinely help clients achieve 60–90% compute cost reductions by combining spot instances with intelligent cluster design, instance pools, and workload-aware scheduling.

Need help optimizing your Databricks bill? Get in touch — we'll audit your cluster configurations and build a spot strategy tailored to your workloads.