⚡ Spark Optimization
Assessment Report
Client...
March 29, 2026 11:18 AM
gaston@thepowermates.com
Overall Health Score
61
Grade: D
1
CRITICAL
3
HIGH
2
MEDIUM
1
LOW
3
notebooks_analyzed
2
lakehouses_analyzed
67
total_tables
191
total_lakehouse_size_gb
44
avg_job_duration_min
9
optimization_opportunities

Findings (7)

Excessive disk spill in ETL-Daily-Sales
CRITICAL
Performance

2.1GB disk spill per run indicates severe memory pressure. Shuffle partitions are undersized for the data volume, forcing expensive disk I/O.

→ Increase spark.sql.shuffle.partitions from 200 to 600. Set spark.sql.adaptive.enabled=true for AQE. Consider repartitioning source data.
Delta table bloat — VACUUM not enabled on SalesLakehouse
HIGH
Storage

Average 89 Delta versions per table with no VACUUM configured. SalesLakehouse is consuming ~40% more storage than necessary.

→ Enable scheduled VACUUM RETAIN 168 HOURS on all tables. Set delta.deletedFileRetentionDuration=7d.
ML notebook using oversized executor config
HIGH
Cost

ML-Churn-Training allocates 28.6GB memory per executor but analysis shows 18GB peak usage. Over-provisioning wastes 37% of allocated capacity.

→ Reduce executor memory to 20GB. Set spark.executor.memoryOverhead=2g. This will free capacity for other workloads.
No checkpoint configured for streaming notebook
HIGH
Reliability

Stream-ClickEvents has no checkpoint location set, meaning any restart reprocesses all data from the beginning.

→ Set checkpointLocation to a reliable ABFS path. Enable exactly-once semantics.
No caching strategy for iterative ML workloads
MEDIUM
Performance

ML-Churn-Training reads the same feature dataset 4 times during training iterations without persisting/caching.

→ Add .cache() or .persist(StorageLevel.MEMORY_AND_DISK) after initial feature engineering step.
Small file problem in SalesLakehouse
MEDIUM
Performance

23 tables have average file sizes under 32MB, causing excessive task overhead during reads.

→ Run OPTIMIZE on affected tables. Set spark.databricks.delta.autoCompact.enabled=true.
Broadcast join threshold too low
LOW
Performance

spark.sql.autoBroadcastJoinThreshold is at default 10MB. Several dimension tables (15-45MB) would benefit from broadcast.

→ Increase to 50MB: spark.conf.set('spark.sql.autoBroadcastJoinThreshold', 52428800)

Recommendations

1Fix disk spill in ETL-Daily-Sales by increasing shuffle partitions and enabling AQE
2Enable VACUUM on SalesLakehouse — estimated 40% storage savings
3Right-size ML notebook executors — 37% over-provisioned currently
4Add checkpoint location to Stream-ClickEvents immediately
5Run OPTIMIZE on 23 tables with small file problem
6Add caching strategy for iterative ML training workloads
7Increase broadcast join threshold to 50MB for dimension table joins