Spark Optimization — Client...

Findings (7)

Excessive disk spill in ETL-Daily-Sales

CRITICAL

Performance

2.1GB disk spill per run indicates severe memory pressure. Shuffle partitions are undersized for the data volume, forcing expensive disk I/O.

→ Increase spark.sql.shuffle.partitions from 200 to 600. Set spark.sql.adaptive.enabled=true for AQE. Consider repartitioning source data.

Delta table bloat — VACUUM not enabled on SalesLakehouse

HIGH

Storage

Average 89 Delta versions per table with no VACUUM configured. SalesLakehouse is consuming ~40% more storage than necessary.

→ Enable scheduled VACUUM RETAIN 168 HOURS on all tables. Set delta.deletedFileRetentionDuration=7d.

ML notebook using oversized executor config

HIGH

Cost

ML-Churn-Training allocates 28.6GB memory per executor but analysis shows 18GB peak usage. Over-provisioning wastes 37% of allocated capacity.

→ Reduce executor memory to 20GB. Set spark.executor.memoryOverhead=2g. This will free capacity for other workloads.

No checkpoint configured for streaming notebook

HIGH

Reliability

Stream-ClickEvents has no checkpoint location set, meaning any restart reprocesses all data from the beginning.

→ Set checkpointLocation to a reliable ABFS path. Enable exactly-once semantics.

No caching strategy for iterative ML workloads

MEDIUM

Performance

ML-Churn-Training reads the same feature dataset 4 times during training iterations without persisting/caching.

→ Add .cache() or .persist(StorageLevel.MEMORY_AND_DISK) after initial feature engineering step.

Small file problem in SalesLakehouse

MEDIUM

Performance

23 tables have average file sizes under 32MB, causing excessive task overhead during reads.

→ Run OPTIMIZE on affected tables. Set spark.databricks.delta.autoCompact.enabled=true.

Broadcast join threshold too low

LOW

Performance

spark.sql.autoBroadcastJoinThreshold is at default 10MB. Several dimension tables (15-45MB) would benefit from broadcast.

→ Increase to 50MB: spark.conf.set('spark.sql.autoBroadcastJoinThreshold', 52428800)

1Fix disk spill in ETL-Daily-Sales by increasing shuffle partitions and enabling AQE

2Enable VACUUM on SalesLakehouse — estimated 40% storage savings

3Right-size ML notebook executors — 37% over-provisioned currently

4Add checkpoint location to Stream-ClickEvents immediately

5Run OPTIMIZE on 23 tables with small file problem

6Add caching strategy for iterative ML training workloads

7Increase broadcast join threshold to 50MB for dimension table joins