The Problem

Common Spark Anti-Patterns Are Costing You

Most Fabric Spark notebooks ship with default configs, unoptimized code, and neglected Delta tables. The result: slow jobs, high CU burn, and OOM failures nobody can explain.

⚠

Default Spark Configs

Fabric ships sensible defaults, but they're one-size-fits-all. Your write-heavy ETL and your read-heavy aggregation need different settings — and you're probably running both on the same config.

💥

Hidden Code Anti-Patterns

.collect() on a 50M-row DataFrame. Python UDFs killing parallelism. Cross joins nobody intended. These patterns hide in plain sight until a job OOMs at 2 AM.

📉

Neglected Delta Tables

No OPTIMIZE schedule, no VACUUM, wrong partition strategy, missing Z-ORDER on filtered columns. Small files accumulate, reads slow down, and capacity costs creep up.

What We Score

Five Dimensions. Scored 0–100.

Every finding comes with a specific fix, not just a warning. Your overall score is a weighted average across these five areas.

⚙

25% — Spark Config

Spark Configuration

We compare your spark.conf.set() calls against Microsoft's recommended configs for write-heavy, balanced, and read-heavy workloads. Missing settings flagged with exact values to add.

🔍

25% — Code Quality

Code Quality

10 anti-pattern detectors scan every code cell: .collect(), blind repartitions, cross joins, schema inference, missing write modes, Python UDFs, and more. Each hit includes the fix.

📊

20% — Delta Health

Delta Lake Health

OPTIMIZE/VACUUM frequency, partition strategy, Z-Ordering on filtered columns, small file consolidation, V-Order usage, and table statistics. We tell you exactly which tables need attention.

💰

15% — Resource Efficiency

Resource Efficiency

Idle Livy sessions, driver/executor memory sizing, Starter Pool vs Workspace Pool selection, and capacity utilization patterns. Stop burning CUs on jobs that should take half the time.

🏗

15% — Architecture

Architecture

Medallion layer detection, error handling coverage, logging practices, Variable Library usage, lakehouse binding patterns, and environment parameterization.

10 Detectors

Anti-Patterns We Catch

These are the performance killers hiding in your notebooks. We find every instance and tell you exactly how to fix it.

CRITICAL .collect() on DataFrame

Pulls entire dataset to driver memory — guaranteed OOM on large tables. Fix: use .take(N) or .limit(N).toPandas() instead.

CRITICAL Cross Join

Cartesian products explode row counts exponentially. We verify intent and suggest bounded alternatives with explicit join keys.

HIGH Python UDFs

Row-by-row Python execution kills Spark parallelism. We identify replacements with built-in Spark SQL functions or Pandas UDFs.

HIGH Schema Inference

inferSchema=True triggers a full extra scan pass on every read. We generate explicit StructType schemas from your data.

HIGH Blind Repartition

Fixed partition counts waste resources or create skew. We recommend .coalesce(), adaptive execution, or partition-by-column strategies.

MEDIUM Hardcoded ABFSS Paths

Breaks across environments. We convert to 3-part naming (catalog.schema.table) or Variable Library parameterization.

MEDIUM .toPandas() on Large Frames

Converts distributed DataFrame to single-node Pandas — driver memory bottleneck. We flag frames exceeding safe thresholds.

MEDIUM Missing Write Modes

No .mode("overwrite") or .mode("append") means jobs fail silently or produce duplicates on reruns.

MEDIUM .cache() Without Reuse

Caching a DataFrame used once wastes memory and adds serialization overhead. We identify cache calls that should be removed.

MEDIUM Missed Broadcast Joins

Small dimension tables joined without broadcast hints force expensive shuffle joins. We flag tables under the broadcast threshold.

How It Works

Three Steps. Half a Day. Clear Action Plan.

Connect

15-minute call. We get read-only access to your workspace and understand which notebooks run most frequently, which ones fail, and what matters most to your team.

Scan

AI reads every notebook via the Fabric REST API. Inspects Spark configs, scans code cells for all 10 anti-patterns, checks lakehouse structure, and analyzes Livy session metadata.

Deliver

Scored report with prioritized fixes, recommended Spark configs per notebook, and a 2-hour walkthrough call with a Dual Microsoft MVP. We implement the highest-impact fixes together.

Deliverables

What You Get

Not a generic checklist. A scored, notebook-by-notebook report with code-level fixes you can apply immediately.

📋

Scored Audit Report

JSON + PDF report with overall score (0–100) and per-dimension breakdown. Every finding includes severity, affected notebook, line reference, and specific fix.

⚙

Recommended Spark Configs

Per-notebook configuration file matched to your workload profile (write-heavy, balanced, or read-heavy). Copy-paste ready spark.conf.set() blocks.

📝

Audit Summary

Executive-friendly 1-page summary: top 5 issues, estimated performance impact, and recommended priority order. Share it with your manager or stakeholders.

🤝

2-Hour MVP Walkthrough

Live session where we walk through every finding, answer questions, and implement the highest-impact fixes together in your notebooks.

📈

Delta Lake Health Matrix

Table-by-table assessment: partition strategy, OPTIMIZE/VACUUM schedule, Z-ORDER recommendations, small file count, and V-Order status.

📞

30-Day Support

One follow-up call within 30 days. Check progress, troubleshoot implementation, and fine-tune configs after you've applied the fixes.

Pricing

One Price. Every Notebook Audited.

Spark Job Optimization Audit

$3,500 one-time

Every notebook in your workspace, scanned and scored. One flat fee.

All notebooks in workspace scanned
Spark config audit per workload type
10 code anti-pattern detectors
Delta Lake health check
Resource efficiency analysis
Architecture & best practices review
Scored report (0–100 across 5 dimensions)
Recommended Spark configs per notebook
Prioritized fix list with code examples
2-hour walkthrough with a Dual MVP
30-day support (1 follow-up call)

Get Started — $3,500

Secure checkout via Stripe. You'll receive an intake questionnaire after payment.

Bundle & Save

Spark Optimization ($3,500) + Semantic Model Audit ($2,500) + Pipeline Health Check ($1,500)

$6,500 save $1,000

Data Engineering Pack →

Preview Your Deliverables

Every Spark Optimization Audit engagement includes three professional deliverables — see a sample below.

📊

HTML Dashboard

Interactive scored report with findings, severity ratings, metrics, and recommendations. Dark-themed, print-ready.

📑

Executive Deck

PowerPoint summary for leadership — score, key findings, recommendations, and next steps. Ready to present.

📄

Word Summary

Detailed written report with findings table, remediation steps, and priority recommendations. Shareable with stakeholders.

Live Sample Report

Open Full Report →

View Full Report →

Sample uses anonymized data for demonstration purposes

FAQ

Common Questions

How many notebooks can you audit?

No limit. We scan every notebook in the workspace you point us at. For large workspaces (50+ notebooks), we prioritize by execution frequency and job duration so you get the highest-impact findings first.

Do you need write access to my workspace?

No. Read-only access is sufficient for the full audit. We use the Fabric REST API to read notebook content and Livy session metadata. We never modify your notebooks or lakehouse tables.

What if I'm using Databricks, not Fabric Spark?

This audit is specifically for Microsoft Fabric Spark (notebooks, Lakehouses, Livy sessions). If you're migrating from Databricks, we can assess your notebooks for Fabric compatibility as part of the engagement.

Will you actually fix the issues or just report them?

The audit includes specific code fixes for every finding. During the 2-hour walkthrough, we implement the highest-impact fixes together in your actual notebooks. For larger remediation projects, we offer follow-up engagements.

What's the difference between this and the Semantic Model Audit?

The Semantic Model Audit ($2,500) focuses on Power BI datasets — DAX measures, relationships, storage modes. This Spark audit focuses on PySpark notebooks — code patterns, Spark configs, Delta Lake. Different layers of the stack, both critical for performance. Bundle them together for $5,000.

How is the overall score calculated?

Weighted average: Spark Config (25%) + Code Quality (25%) + Delta Health (20%) + Resource Efficiency (15%) + Architecture (15%). Each dimension is scored 0–100 independently, then combined. You get both the overall score and per-dimension breakdowns.

CLIENT RESULTS

Results from Recent Engagements

Manufacturing

8x faster batch processing

Spark notebooks running 6+ hours due to full-scan joins and missing Z-order. Rewrote with predicate pushdown and optimized Delta Lake layout.

6.2hrs → 47min runtime

E-Commerce

60% capacity cost savings

Over-provisioned Spark pools running 24/7 for jobs that needed 4hrs/day. Right-sized compute and added auto-scaling policies.

$4,200 → $1,680/mo

Telecom

Eliminated data quality issues

Schema drift in bronze layer causing silent failures in silver transformations. Added schema enforcement and alerting.

99.7% pipeline reliability

IS THIS RIGHT FOR YOU?

This audit is built for teams that

✓

Have slow-running Spark notebooks

Jobs that take hours when they should take minutes — but you're not sure where the bottleneck is.

✓

Are burning through Fabric CUs

Capacity costs are climbing and you suspect unoptimized Spark is the culprit.

✓

Run 10+ notebooks in production

You've scaled past experimentation — now performance and reliability matter at scale.

✓

Want an MVP to review your PySpark

Your data engineers built it, but you want expert validation on patterns and configurations.

Your Spark notebooks are leaving performance on the table