A consolidated, scenario‑first reference organized around the three official exam domains. Every topic is paired with the decisions, trade‑offs, and distractors you will actually encounter on the exam.
Eight chapters covering the exam blueprint, deep topic notes for all three domains, comparison tables, the traps the exam likes to set, and a twenty‑item pre‑exam checklist.
The DP‑700 exam validates your ability to implement, manage, and monitor data engineering solutions in Microsoft Fabric — spanning Lakehouse, Warehouse, Eventhouse, pipelines, notebooks, and real‑time intelligence. It sits squarely in the data engineer's world: you are the person who moves, shapes, and keeps data flowing reliably.
| Domain | Weight | Focus |
|---|---|---|
| 01 Implement & Manage a Data Engineering Solution | 30–35% | Workspaces, lifecycle, security |
| 02 Ingest & Transform Data | 30–35% | Batch, streaming, pipelines, Spark |
| 03 Monitor & Optimize an Analytics Solution | 30–35% | Monitor, tune Delta, Spark, SQL |
All three DP‑700 domains are weighted roughly equally — about one‑third each. You cannot skip any area. Expect Ingest & Transform and Monitor & Optimize to trade questions depending on the form you receive. Know pipelines, notebooks, Delta maintenance, and the Monitor Hub cold.
Fabric · OneLake · Lakehouse · Warehouse · Eventhouse
Multiple choice · Drag‑and‑drop · Case studies
SQL · PySpark · KQL · Power Query (M)
lakehouse.default path resolution. Pin carefully to avoid accidental writes.| Requirement | Feature | Note |
|---|---|---|
| Isolate tenants by row | Row‑Level Security (RLS) | Security predicate; Warehouse, Lakehouse SQL endpoint, semantic models. |
| Hide sensitive columns | Column / Object‑Level Security | DENY SELECT on specific columns (Warehouse). |
| Mask PII at presentation | Dynamic Data Masking | Not encryption — privileged users still see real data. |
| Restrict folders in OneLake | OneLake file / folder ACLs | Enforced across every engine reading the path. |
| Protect downstream exports | Sensitivity labels (Purview) | Labels flow to PBIX, Excel, downstream reports. |
| Signal trust | Endorsement | Promoted (author) or Certified (admin) — not a security control. |
DP‑700 leans on scenario‑based RLS and OneLake ACLs. Know the difference between DDM (masking at read) and OLS (hard deny on a column). DDM is not a security control on its own.
| Item | Primary purpose | Owner‑persona |
|---|---|---|
| Lakehouse | Files + Delta tables over OneLake. | Data engineer |
| Warehouse | T‑SQL DW with multi‑table transactions. | Data / analytics engineer |
| Eventhouse | KQL databases for streaming + telemetry. | Data engineer |
| Notebook | PySpark / SparkSQL transformations. | Data engineer |
| Data pipeline | Orchestrate copy, notebook, dataflow, SP. | Data engineer |
| Dataflow Gen2 | Low‑code Power Query ingest + shape. | Analyst / DE |
| Eventstream | Ingest & route streaming events. | Data engineer |
| Mirrored DB | Near‑real‑time replica of external source. | Data engineer |
.ipynb plus Fabric metadata. Use branches for feature work.If Condition, Until, and activity dependencies (success / failure / skipped / completed) to handle branch logic.%pip install in a notebook session.For production pipelines that must be reproducible, pin a custom environment and reference it from every notebook activity. The starter pool is convenient in dev but can shift runtime behavior silently.
| Need | Tool | Why |
|---|---|---|
| Scheduled batch copy, 100+ connectors | Data pipeline — Copy activity | Schema mapping, fault tolerance, staging. |
| Low‑code shape & enrich for analysts | Dataflow Gen2 | Power Query UI; lands results in Lakehouse/Warehouse. |
| Complex logic, reuse, ML | Notebook (PySpark / Spark SQL) | Distributed compute, code‑first. |
| Live read of operational DB | Mirroring | Continuous replication into OneLake as Delta. |
| Virtualize without copy | Shortcut | Point at ADLS, S3, GCS, Dataverse, OneLake. |
| Real‑time streams | Eventstream | Ingest events → Lakehouse / Eventhouse / custom endpoint. |
Watch for "minimal source load" and "near real‑time" in a scenario — they point to Mirroring. "Without copying data" or "avoid duplication" → Shortcut. "Scheduled nightly load with transformations" → Copy activity or Dataflow.
LastModified in a lookup, copy rows above watermark, then update watermark. Rebuild the parameter in pipeline expressions.MERGE ... WHEN MATCHED UPDATE — overwrite the row.IsCurrent=0, EffectiveTo), insert new version.ROW_NUMBER() OVER (PARTITION BY nk ORDER BY ts DESC), keep rn = 1.
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY CustomerId
ORDER BY UpdatedAt DESC) rn
FROM stg.Customer
) t WHERE rn = 1;
-- pipeline expression
@{activity('LookupWM').output.firstRow.LastLoad}
SELECT * FROM src.Orders
WHERE ModifiedDate > @prevWatermark;
%%sql, %%pyspark)./lakehouse/default/Files or Tables).mssparkutils.fs, mssparkutils.notebook.run, mssparkutils.credentials.spark.read.format("delta").load(path) or spark.table("lh.schema.tbl").df.filter(...).groupBy(...).agg(...); chain, don't materialize.df.write.format("delta").mode("append").saveAsTable("lh.schema.tbl").DeltaTable.forName(spark,"t").alias("t").merge(source.alias("s"), "t.k=s.k").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute().F.broadcast(dim).F.* — Catalyst can optimize them; UDFs are black boxes.
from pyspark.sql import functions as F
df = spark.table("lh.silver.orders")
daily = (df
.filter(F.col("status")=="paid")
.groupBy("order_date")
.agg(F.sum("amount").alias("rev")))
daily.write.format("delta")\
.mode("overwrite")\
.saveAsTable("lh.gold.daily_rev")
from delta.tables import DeltaTable
tgt = DeltaTable.forName(spark,"lh.dim.customer")
(tgt.alias("t")
.merge(stg.alias("s"),
"t.CustomerKey = s.CustomerKey")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
OPENROWSET over Parquet in OneLake for ad‑hoc reads without ingestion.When the exam describes large joins, complex logic, reuse via functions → pick notebook. Analyst with a UI → Dataflow Gen2. Transactional, multi‑table operation → T‑SQL in Warehouse.
RangeStart / RangeEnd, detect changes column.Dataflow Gen2 has a Fast Copy toggle that bypasses the M engine for large loads. If a scenario cares about throughput and a Power Query destination is given, Fast Copy is often the correct answer.
queryinsights.exec_requests_history, long_running_queries, frequently_run_queries.| Operation | What it does | When to run |
|---|---|---|
| V‑Order | Write‑time Parquet ordering for Fabric engines. | On by default; keep on for DirectLake / SQL endpoint reads. |
| OPTIMIZE | Compacts many small files into few larger ones. | After streaming or frequent small writes. |
| Z‑ORDER BY col | Co‑locates rows on chosen columns for data skipping. | Queries frequently filter on that column. |
| VACUUM | Removes obsolete files past retention (default 7d). | Reduce storage; after big rewrites. |
| Partitioning | Physical layout by a low‑cardinality column. | When date/region pruning gives big wins. |
OPTIMIZE ≠ VACUUM. OPTIMIZE compacts small files; VACUUM removes tombstoned files after retention. Never partition on a high‑cardinality column — it creates the small‑file problem.
repartition, or broadcast the small side.F.broadcast(small_df) when the small side fits comfortably in executor memory.df.cache() for a DataFrame reused many times in one session. Drop with unpersist().queryinsights.exec_requests_history to find long / frequent queries.UPDATE STATISTICS after large loads if estimates look off.SELECT *, filter early, prefer set operations.Fail activity to raise a meaningful error.on failure to run cleanup or alert; on completion for must‑run steps.mssparkutils.notebook.exit("ok") to pass a signal back to the calling pipeline.| Symptom | Remedy | Why |
|---|---|---|
| Many tiny files on a Delta table | OPTIMIZE | Compacts into larger Parquet files. |
| Slow filter on a hot column | Z‑ORDER BY col | Co‑locates rows for skipping. |
| Storage bill growing | VACUUM (with retention) | Removes obsolete files. |
| One Spark stage far slower than siblings | Salt / repartition / broadcast | Classic skew fix. |
| Pipeline fails on flaky source | Activity retries + on‑failure branch | Resilience without manual rerun. |
Telemetry
| where Timestamp > ago(1h)
| summarize
avgTemp = avg(Temp)
by bin(Timestamp, 5m),
DeviceId
Telemetry
| make-series cnt=count() default=0
on Timestamp step 1m
by DeviceId
| extend (anomalies, score, base) =
series_decompose_anomalies(cnt)
.create materialized-view
DailyRevenue on table Orders
{
Orders
| summarize sum(Amount)
by bin(CreatedAt, 1d)
}
.alter-merge table Telemetry policy retention
'{"SoftDeletePeriod":"30.00:00:00"}'
.alter table Telemetry policy caching
hot = 7d
Turn on OneLake availability on a KQL table when downstream consumers (Power BI DirectLake, notebooks) need the same data without double‑ingesting.
| Feature | Lakehouse | Warehouse | Eventhouse (KQL) | Mirrored DB |
|---|---|---|---|---|
| Primary workload | Files + Delta tables | T‑SQL DW | Streaming telemetry / logs | Replica of external DB |
| Language | Spark SQL / PySpark | T‑SQL | KQL | T‑SQL (via SQL endpoint) |
| SQL writes | read‑only endpoint | full | Append (ingest) | read‑only |
| Multi‑table txn | no | yes | no | no |
| Unstructured files | yes | no | no | no |
| OneLake availability | native | native | Opt‑in per table | native (Delta) |
| Best for | Medallion ELT, ML | Star schema, gold serving | Logs, IoT, telemetry | Live analytics on OLTP source |
| Tool | Best for | Persona |
|---|---|---|
| Copy activity | Scheduled batch, 100+ connectors, heavy transforms downstream. | Data engineer |
| Dataflow Gen2 | Low‑code M shaping with destination. | Analyst / DE |
| Notebook | Code‑first Spark pipelines. | Data engineer |
| Shortcut | Zero‑copy virtualization of external data. | Data engineer |
| Mirroring | Near‑real‑time read‑only replica. | Data engineer |
| Eventstream | Streaming ingest from Event Hub / IoT / Kafka. | Data engineer |
| Language | Strength | Typical item |
|---|---|---|
| Power Query (M) | Low‑code UI shaping. | Dataflow Gen2 |
| PySpark / Spark SQL | Distributed compute, ML, large joins. | Notebook |
| T‑SQL | Set ops, transactions, SPs. | Warehouse |
| KQL | Time‑series, logs, streaming analytics. | Eventhouse / KQL DB |
The exam uses plausible‑sounding options to test depth of understanding. Twelve of the most common traps, with corrections.
Review the night before. If any item feels unfamiliar, revisit that topic in the guide.
USERPRINCIPALNAME() and a security table.If Condition, and notebook activity.Microsoft Learn — Study Guide for Exam DP‑700, Implementing Data Engineering Solutions Using Microsoft Fabric.