If you're running Databricks on AWS, your compute bill is probably one of your biggest line items. At on-demand prices, a single high-memory cluster can run thousands of dollars a month. But here's the thing AWS doesn't want you to forget: Spot Instances exist, and they can cut your compute costs by 60–90% with the right architecture.
This playbook covers exactly how to do it — without losing reliability.
Why Spot Instances for Databricks?
AWS Spot Instances let you bid on spare EC2 capacity at steep discounts. The trade-off? They can be reclaimed by AWS with a 2-minute warning when capacity is needed elsewhere. For traditional workloads that's a dealbreaker, but for Databricks — which was built for resilience — it's a golden opportunity.
Databricks already handles node failures gracefully. When a worker disappears, Spark recomputes lost partitions from other nodes using lineage. The key is designing your clusters so that spot interruptions don't take down the entire job.
Realistic Savings Estimates
| Instance Type | On-Demand (USD/hr) | Spot (USD/hr) | Savings |
|---|---|---|---|
| r5.2xlarge (8 vCPU, 64 GB) | $0.504 | ~$0.151 | ~70% |
| r5.4xlarge (16 vCPU, 128 GB) | $1.008 | ~$0.302 | ~70% |
| i3.2xlarge (8 vCPU, 61 GB, NVMe) | $0.624 | ~$0.187 | ~70% |
| p3.2xlarge (GPU, 16 GB) | $3.060 | ~$0.918 | ~70% |
Actual savings depend on instance family, region, availability zone, and time of day. Deep-discount instance families (r5, i3) commonly see 60–70% savings, while GPU instances can hit 60–80% during off-peak hours. The highest savings come from mixing spot with on-demand in a stable proportion.
Step 1: Configure Your Databricks Cluster for Spot
Every Databricks cluster has a spot instance configuration. The critical setting is the spot bid percentage vs. the on-demand price:
- 100% — You'll pay up to the on-demand price. High availability, moderate savings.
- 50–80% — Sweet spot for most workloads. Significant savings with manageable interruption rates.
- <50% — Maximum savings, but expect frequent interruptions. Best for fault-tolerant batch jobs.
In the Databricks UI, navigate to Compute → Create Cluster → Advanced Options → Spot Instances. Use the API or Terraform for repeatable configurations.
Terraform Example
resource "databricks_cluster" "etl_jobs" {
cluster_name = "production-etl"
spark_version = "15.3.x-scala2.12"
node_type_id = "r5.2xlarge"
autoscale {
min_workers = 2
max_workers = 16
}
aws_attributes {
availability = "SPOT_WITH_FALLBACK"
spot_bid_price_percent = 80
zone_id = "us-west-2a"
}
}The SPOT_WITH_FALLBACK strategy tells Databricks to launch spot instances first and fall back to on-demand only when spot is unavailable. This is your safest bet for production.
Step 2: Diversify Across Instance Types and AZs
AWS has abundant spot capacity across many instance types and availability zones. The trick is to give Databricks multiple options. Instead of pinning to a single instance type, use a hybrid fleet:
- Primary: r5.2xlarge (memory-optimized, good general purpose)
- Secondary: r5a.2xlarge (AMD variant, often cheaper)
- Tertiary: i3.2xlarge (NVMe storage, great for shuffle-heavy jobs)
- Fallback: r5.4xlarge (more memory, can handle the load if needed)
In Databricks, you can specify a list of node types in the cluster config. The platform will attempt to launch each in order until it succeeds.
aws_attributes {
availability = "SPOT_WITH_FALLBACK"
spot_bid_price_percent = 80
zone_id = "us-west-2a,us-west-2b,us-west-2c"
}Note: Databricks does not natively support a comma-separated list of zones in a single cluster config. For multi-AZ resilience, consider using instance pools or job clusters that retry in different AZs on failure.
Step 3: Use Instance Pools for Fast Failover
Instance pools are Databricks' answer to spot interruptions. They maintain a warm pool of pre-initialized instances so that when a spot instance is reclaimed, your cluster can swap in a replacement within seconds instead of minutes.
Setting Up an Instance Pool
resource "databricks_instance_pool" "spot_pool" {
instance_pool_name = "spot-etl-pool"
node_type_id = "r5.2xlarge"
min_idle_instances = 5
max_capacity = 50
aws_attributes {
availability = "SPOT_WITH_FALLBACK"
spot_bid_price_percent = 80
zone_id = "us-west-2a"
}
idle_instance_autotermination_minutes = 15
}Key settings:
min_idle_instances— Keep 5 instances warm and ready. Pay only for what's idle.idle_instance_autotermination_minutes— Auto-kill idle instances after 15 min to save costs.max_capacity— Hard cap to prevent runaway scaling.
Step 4: Optimize Your Workloads for Interruptions
Not all workloads handle spot equally well. Here's how to classify your jobs:
| Workload Type | Spot Suitability | Strategy |
|---|---|---|
| Batch ETL (hourly/daily) | 🟢 Excellent | 100% spot with fallback. Add retry logic. |
| Interactive queries (SQL) | 🟡 Good | Mix spot workers with on-demand driver. |
| Delta Live Tables pipelines | 🟢 Excellent | DLT auto-retries from checkpoint. Pure spot. |
| ML training (single-node) | 🔴 Avoid | Use on-demand or reserved instances. |
| Streaming (Structured Streaming) | 🟡 Good | On-demand driver, spot workers with checkpointing. |
Critical Rule: Keep the Driver On-Demand
The driver node coordinates the SparkContext and holds the job state. If the driver gets reclaimed, the entire job fails. Always configure the driver as on-demand, even when workers use spot. This ensures job continuity even during mass interruptions.
Step 5: Set Up Monitoring and Alerts
Databricks provides cluster-level metrics on spot interruptions. Monitor these to tune your strategy:
- Cluster Events — Filter by "Spot Instance Reclamation" to see interruption frequency.
- Spark UI — Executors — Track executor removals during job runs.
- CloudWatch Metrics — Monitor
SpotInstanceInterruptionsacross your account.
Suggested Alert Thresholds
- Warning: >5 interruptions per hour per cluster — consider lowering bid % or diversifying instance types.
- Critical: >20 interruptions per hour — switch to
SPOT_WITH_FALLBACKor on-demand temporarily. - Job failure: Any spot-related job failure should trigger an alert and escalation.
Step 6: Calculate Your Savings
Use this formula to estimate monthly savings:
Monthly Savings = (OnDemandCost - SpotCost) × NumWorkers × HoursPerDay × DaysPerMonth
Example: 10 × r5.2xlarge workers, 12 hrs/day, 22 days/month
On-demand: 10 × $0.504 × 12 × 22 = $1,330.56
Spot (70% off): 10 × $0.151 × 12 × 22 = $398.64
Monthly Savings: $931.92 (70%)
Add instance pool idle costs (~5 × $0.151 × 6hr idle × 22 = ~$99.66)
Net Monthly Savings: ~$832.26Common Pitfalls to Avoid
- Putting the driver on spot. If the driver dies, the whole job dies. Driver = on-demand, always.
- Single instance type. If that type has a capacity crunch, your cluster stalls. Diversify.
- No fallback. SPOT_WITH_FALLBACK is your safety net. Without it, jobs fail silently.
- Ignoring shuffle-heavy jobs. If your job shuffles large datasets, interruptions during shuffle cause full recomputation. Use i3 instances with NVMe in that case.
- Not testing. Run a spot-only cluster for a week in staging first. Measure interruption rates before going to production.
Bottom Line
Spot Instances are not "risky" when architected correctly. They're a lever — and for Databricks workloads, they're the single biggest cost-saving lever you have. Start with one batch pipeline, measure the savings, and expand from there.
At DataRazi, we routinely help clients achieve 60–90% compute cost reductions by combining spot instances with intelligent cluster design, instance pools, and workload-aware scheduling.
Need help optimizing your Databricks bill? Get in touch — we'll audit your cluster configurations and build a spot strategy tailored to your workloads.