Delta Lake brings ACID transactions and schema enforcement to your data lake. But to get the best performance out of it, you need to tune a few knobs. Here's what we've found works in production.
Liquid Clustering
Liquid clustering is Databricks' next-generation data layout optimisation, replacing the older Z-Ordering approach. It automatically maintains clustering for multiple keys without manual maintenance:
ALTER TABLE your_table
CLUSTER BY (date, customer_id);
Unlike Z-Ordering, liquid clustering is incremental — new data is clustered as it arrives, and the table layout evolves without full rewrites.
Automatic Optimization
Enable these settings for hands-off performance:
ALTER TABLE your_table SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
This automatically compacts small files and writes optimally sized files during ingestion — dramatically reducing the metadata overhead from thousands of tiny Parquet files.
Vacuum and Retention
Delta Lake keeps historical data for time travel, but this comes at a storage cost. Set retention thresholds based on your replay requirements:
ALTER TABLE your_table SET TBLPROPERTIES (
'delta.logRetentionDuration' = 'interval 7 days',
'delta.deletedFileRetentionDuration' = 'interval 7 days'
);
Then run VACUUM regularly to reclaim storage from old versions.
Partitioning Strategy
Choose partition columns based on query patterns. High-cardinality columns like customer_id lead to too many small partitions — stick to date-based partitioning for most workloads. Combined with liquid clustering on frequent filter columns, this gives the best of both worlds.
Want a performance review of your Databricks environment? Get in touch →