Findings (7)
Performance
2.1GB disk spill per run indicates severe memory pressure. Shuffle partitions are undersized for the data volume, forcing expensive disk I/O.
→ Increase spark.sql.shuffle.partitions from 200 to 600. Set spark.sql.adaptive.enabled=true for AQE. Consider repartitioning source data.
Storage
Average 89 Delta versions per table with no VACUUM configured. SalesLakehouse is consuming ~40% more storage than necessary.
→ Enable scheduled VACUUM RETAIN 168 HOURS on all tables. Set delta.deletedFileRetentionDuration=7d.
Cost
ML-Churn-Training allocates 28.6GB memory per executor but analysis shows 18GB peak usage. Over-provisioning wastes 37% of allocated capacity.
→ Reduce executor memory to 20GB. Set spark.executor.memoryOverhead=2g. This will free capacity for other workloads.
Reliability
Stream-ClickEvents has no checkpoint location set, meaning any restart reprocesses all data from the beginning.
→ Set checkpointLocation to a reliable ABFS path. Enable exactly-once semantics.
Performance
ML-Churn-Training reads the same feature dataset 4 times during training iterations without persisting/caching.
→ Add .cache() or .persist(StorageLevel.MEMORY_AND_DISK) after initial feature engineering step.
Performance
23 tables have average file sizes under 32MB, causing excessive task overhead during reads.
→ Run OPTIMIZE on affected tables. Set spark.databricks.delta.autoCompact.enabled=true.
Performance
spark.sql.autoBroadcastJoinThreshold is at default 10MB. Several dimension tables (15-45MB) would benefit from broadcast.
→ Increase to 50MB: spark.conf.set('spark.sql.autoBroadcastJoinThreshold', 52428800)
Recommendations
1Fix disk spill in ETL-Daily-Sales by increasing shuffle partitions and enabling AQE
2Enable VACUUM on SalesLakehouse — estimated 40% storage savings
3Right-size ML notebook executors — 37% over-provisioned currently
4Add checkpoint location to Stream-ClickEvents immediately
5Run OPTIMIZE on 23 tables with small file problem
6Add caching strategy for iterative ML training workloads
7Increase broadcast join threshold to 50MB for dimension table joins