Introduction: The Lakehouse Paradigm
The data lakehouse combines the best of data warehouses (structure, performance, ACID transactions) with data lakes (flexibility, scalability, raw data storage). This deep dive builds a production-grade lakehouse on Azure using the Medallion architecture (Bronze → Silver → Gold), Unity Catalog for governance, and Power BI for self-service analytics — all tied together with Microsoft Purview for organization-wide data discovery and lineage.
Prerequisites
- Azure subscription with Databricks workspace access
- Azure Synapse Analytics workspace
- Microsoft Purview account
- Power BI Premium or Premium Per User licenses
- Azure Data Lake Storage Gen2
- Familiarity with Spark, SQL, and Delta Lake
Phase 1: Storage Foundation — ADLS Gen2 with Medallion Zones
Infrastructure as Code with Bicep
param location string = resourceGroup().location
param projectName string = 'datalakehouse'
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
name: '${projectName}adls'
location: location
kind: 'StorageV2'
sku: { name: 'Standard_LRS' }
properties: {
isHnsEnabled: true
minimumTlsVersion: 'TLS1_2'
allowBlobPublicAccess: false
supportsHttpsTrafficOnly: true
networkAcls: {
defaultAction: 'Deny'
virtualNetworkRules: [
{ id: databricksSubnet.id, action: 'Allow' }
{ id: synapseSubnet.id, action: 'Allow' }
]
}
}
}
// Medallion zone containers
var zones = ['bronze', 'silver', 'gold', 'sandbox', 'archive']
resource containers 'Microsoft.Storage/storageAccounts/blobServices/containers@2023-01-01' = [for zone in zones: {
name: '${storageAccount.name}/default/${zone}'
properties: {
publicAccess: 'None'
}
}]
// Lifecycle management for cost optimization
resource lifecyclePolicy 'Microsoft.Storage/storageAccounts/managementPolicies@2023-01-01' = {
parent: storageAccount
name: 'default'
properties: {
policy: {
rules: [
{
name: 'ArchiveBronzeAfter90Days'
type: 'Lifecycle'
definition: {
filters: {
blobTypes: ['blockBlob']
prefixMatch: ['bronze/']
}
actions: {
baseBlob: {
tierToCool: { daysAfterModificationGreaterThan: 30 }
tierToArchive: { daysAfterModificationGreaterThan: 90 }
}
}
}
}
{
name: 'DeleteSandboxAfter14Days'
type: 'Lifecycle'
definition: {
filters: {
blobTypes: ['blockBlob']
prefixMatch: ['sandbox/']
}
actions: {
baseBlob: {
delete: { daysAfterModificationGreaterThan: 14 }
}
}
}
}
]
}
}
}
Phase 2: Bronze Layer — Raw Data Ingestion with Databricks
Auto Loader for Streaming Ingestion
# Databricks notebook: bronze_ingestion.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name, lit
from delta.tables import DeltaTable
spark = SparkSession.builder.getOrCreate()
# Configure Auto Loader for streaming CSV ingestion
bronze_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", "/mnt/adls/bronze/_schemas/sales")
.option("cloudFiles.schemaHints", "order_date DATE, amount DOUBLE")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load("/mnt/adls/landing/sales/")
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.withColumn("_batch_id", lit(spark.conf.get("spark.databricks.workflow.runId", "manual")))
)
# Write to Bronze Delta table with merge schema
(bronze_stream
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/mnt/adls/bronze/_checkpoints/sales")
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable("bronze.raw_sales")
)
# Data quality checks on bronze layer
from pyspark.sql.functions import col, count, when
quality_report = (
spark.table("bronze.raw_sales")
.agg(
count("*").alias("total_rows"),
count(when(col("order_id").isNull(), 1)).alias("null_order_ids"),
count(when(col("amount") < 0, 1)).alias("negative_amounts"),
count(when(col("order_date").isNull(), 1)).alias("null_dates")
)
)
quality_report.display()
Change Data Capture (CDC) from SQL Server
# CDC ingestion from SQL Server using Debezium format
cdc_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/adls/bronze/_schemas/customers_cdc")
.load("/mnt/adls/landing/cdc/customers/")
)
# Parse Debezium CDC events
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructType, StringType, TimestampType
cdc_parsed = (
cdc_stream
.select(
col("payload.op").alias("operation"),
col("payload.after.*"),
col("payload.ts_ms").cast("timestamp").alias("_cdc_timestamp"),
current_timestamp().alias("_ingested_at")
)
)
# Merge CDC changes into Bronze Delta table
def merge_cdc_batch(batch_df, batch_id):
if batch_df.isEmpty():
return
target = DeltaTable.forName(spark, "bronze.raw_customers")
(target.alias("target")
.merge(
batch_df.alias("source"),
"target.customer_id = source.customer_id"
)
.whenMatchedUpdate(
condition="source.operation IN ('u', 'r')",
set={"*": "source.*"}
)
.whenMatchedDelete(condition="source.operation = 'd'")
.whenNotMatchedInsert(
condition="source.operation IN ('c', 'r')",
values={"*": "source.*"}
)
.execute()
)
(cdc_parsed
.writeStream
.foreachBatch(merge_cdc_batch)
.option("checkpointLocation", "/mnt/adls/bronze/_checkpoints/customers_cdc")
.trigger(availableNow=True)
.start()
)
Phase 3: Silver Layer — Cleansed and Conformed Data
Data Quality Framework with Great Expectations
Architecture Overview: # Silver layer transformation: bronze → silver
Phase 4: Gold Layer — Business-Ready Aggregations
Star Schema for Analytics
-- Gold layer: Create dimensional model
CREATE OR REPLACE TABLE gold.dim_customer AS
SELECT
customer_id,
customer_name,
email,
city,
state,
country,
customer_segment,
first_order_date,
lifetime_value,
total_orders,
CASE
WHEN lifetime_value >= 10000 THEN 'Platinum'
WHEN lifetime_value >= 5000 THEN 'Gold'
WHEN lifetime_value >= 1000 THEN 'Silver'
ELSE 'Bronze'
END AS loyalty_tier,
_valid_from,
_valid_to,
_is_current
FROM silver.clean_customers
WHERE _is_current = true;
-- Fact table: Daily sales aggregation
CREATE OR REPLACE TABLE gold.fact_daily_sales AS
SELECT
date_format(order_date, 'yyyyMMdd') AS date_key,
order_date,
product_id,
customer_id,
store_id,
COUNT(DISTINCT order_id) AS order_count,
SUM(quantity) AS total_quantity,
SUM(amount) AS gross_revenue,
SUM(discount_amount) AS total_discounts,
SUM(amount - discount_amount) AS net_revenue,
SUM(cost_amount) AS total_cost,
SUM(amount - discount_amount - cost_amount) AS gross_margin,
AVG(amount) AS avg_order_value
FROM silver.clean_sales s
GROUP BY order_date, product_id, customer_id, store_id;
-- Date dimension
CREATE OR REPLACE TABLE gold.dim_date AS
SELECT
date_format(date_value, 'yyyyMMdd') AS date_key,
date_value AS full_date,
year(date_value) AS year,
quarter(date_value) AS quarter,
month(date_value) AS month,
day(date_value) AS day,
dayofweek(date_value) AS day_of_week,
weekofyear(date_value) AS week_of_year,
date_format(date_value, 'MMMM') AS month_name,
date_format(date_value, 'EEEE') AS day_name,
CASE WHEN dayofweek(date_value) IN (1, 7) THEN true ELSE false END AS is_weekend
FROM (
SELECT explode(sequence(
to_date('2020-01-01'),
to_date('2030-12-31'),
interval 1 day
)) AS date_value
);
Phase 5: Azure Synapse Serverless SQL for Ad-Hoc Queries
Querying Delta Lake from Synapse
-- Synapse Serverless SQL: Query Gold layer directly
CREATE EXTERNAL DATA SOURCE GoldLake
WITH (
LOCATION = 'https://datalakehouseadls.dfs.core.windows.net/gold',
CREDENTIAL = ManagedIdentityCredential
);
-- Create view over Delta table
CREATE VIEW synapse.vw_sales_summary AS
SELECT
d.year,
d.quarter,
d.month_name,
c.loyalty_tier,
SUM(f.net_revenue) AS total_revenue,
SUM(f.gross_margin) AS total_margin,
COUNT(DISTINCT f.customer_id) AS unique_customers,
SUM(f.order_count) AS total_orders,
SUM(f.net_revenue) / NULLIF(COUNT(DISTINCT f.customer_id), 0) AS revenue_per_customer
FROM OPENROWSET(
BULK 'fact_daily_sales/',
DATA_SOURCE = 'GoldLake',
FORMAT = 'DELTA'
) AS f
JOIN OPENROWSET(
BULK 'dim_date/',
DATA_SOURCE = 'GoldLake',
FORMAT = 'DELTA'
) AS d ON f.date_key = d.date_key
JOIN OPENROWSET(
BULK 'dim_customer/',
DATA_SOURCE = 'GoldLake',
FORMAT = 'DELTA'
) AS c ON f.customer_id = c.customer_id
GROUP BY d.year, d.quarter, d.month_name, c.loyalty_tier;
Phase 6: Power BI DirectQuery on Lakehouse
Semantic Model Configuration
{
"model": {
"name": "Lakehouse Analytics",
"defaultMode": "DirectQuery",
"tables": [
{
"name": "Sales",
"source": {
"type": "AzureSynapseServerlessSQL",
"query": "SELECT * FROM synapse.vw_sales_summary"
},
"measures": [
{
"name": "Revenue YoY Growth",
"expression": "VAR CurrentYear = SUM(Sales[total_revenue]) VAR PreviousYear = CALCULATE(SUM(Sales[total_revenue]), SAMEPERIODLASTYEAR('Date'[full_date])) RETURN DIVIDE(CurrentYear - PreviousYear, PreviousYear, 0)"
},
{
"name": "Customer Retention Rate",
"expression": "VAR ReturningCustomers = CALCULATE(DISTINCTCOUNT(Sales[customer_id]), FILTER(Sales, Sales[total_orders] > 1)) VAR AllCustomers = DISTINCTCOUNT(Sales[customer_id]) RETURN DIVIDE(ReturningCustomers, AllCustomers, 0)"
}
]
}
]
}
}
Phase 7: Microsoft Purview Data Governance
Automated Data Classification and Lineage
# Register lakehouse assets in Purview
from azure.purview.catalog import PurviewCatalogClient
from azure.purview.scanning import PurviewScanningClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
catalog_client = PurviewCatalogClient(
endpoint="https://datalakehouse-purview.purview.azure.com",
credential=credential
)
# Register ADLS Gen2 source
source_definition = {
"kind": "AdlsGen2",
"properties": {
"endpoint": "https://datalakehouseadls.dfs.core.windows.net/",
"collection": {
"referenceName": "DataLakehouse",
"type": "CollectionReference"
}
}
}
catalog_client.collection.create_or_update(
collection_name="DataLakehouse",
collection={"description": "Enterprise Data Lakehouse", "parentCollection": {"referenceName": "root"}}
)
# Create classification rules for PII detection
classification_rule = {
"kind": "Custom",
"properties": {
"description": "Detect employee IDs in format EMP-XXXXX",
"columnPatterns": [
{"kind": "Regex", "pattern": "employee.*id|emp.*id|staff.*id"}
],
"dataPatterns": [
{"kind": "Regex", "pattern": "EMP-\\d{5}"}
],
"classificationName": "CUSTOM_EMPLOYEE_ID",
"ruleType": "Custom"
}
}
Architecture Decision Matrix
| Component | Technology | Justification |
|---|---|---|
| Raw Storage | ADLS Gen2 | Hierarchical namespace, cost-effective, Azure-native |
| Data Engineering | Databricks | Best Spark runtime, Delta Lake native, Unity Catalog |
| Table Format | Delta Lake | ACID, time travel, schema evolution, Z-ORDER |
| Serving Layer | Synapse Serverless | Pay-per-query, no infrastructure, Delta Lake support |
| Visualization | Power BI | DirectQuery on lakehouse, enterprise distribution |
| Governance | Microsoft Purview | Unified catalog, lineage, classification |
| Orchestration | Databricks Workflows | Native Spark, task dependencies, alerts |
| Security | Unity Catalog + Purview | Column-level security, row filters, audit |
Best Practices
- Medallion architecture is not optional: Bronze (raw) → Silver (cleansed) → Gold (business) ensures data quality at each layer
- Use Delta Lake everywhere: ACID transactions, time travel, and schema evolution are essential for reliable data pipelines
- Z-ORDER your Gold tables: Dramatically improves query performance on frequently filtered columns
- Automate data quality checks: Catch issues at Bronze→Silver boundary, not after business reports are generated
- Lineage tracking from day one: Purview lineage is exponentially harder to add retroactively
- Use Synapse Serverless for exploration: Don't spin up dedicated pools for ad-hoc queries
- Lifecycle management for cost control: Archive Bronze data aggressively, delete sandbox data automatically
Validation and Versioning
- Last validated: April 2026
- Validate examples against your tenant, region, and SKU constraints before production rollout.
- Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.
Security and Governance Considerations
- Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
- Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
- Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.
Cost and Performance Notes
- Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
- Baseline performance with synthetic and real-user checks before and after major changes.
- Scale resources with measured thresholds and revisit sizing after usage pattern changes.
Official Microsoft References
- https://learn.microsoft.com/azure/architecture/
- https://learn.microsoft.com/azure/well-architected/
- https://learn.microsoft.com/power-platform/guidance/
Public Examples from Official Sources
- These examples are sourced from official public Microsoft documentation and sample repositories.
- Documentation examples: https://learn.microsoft.com/azure/well-architected/
- Sample repositories: https://github.com/Azure/ArchitectureCenter
- Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.
Key Takeaways
- The lakehouse architecture eliminates the need for separate data warehouses and data lakes
- Delta Lake provides the reliability and performance needed for production analytics
- Microsoft Purview unifies data governance across the entire lakehouse
- Power BI DirectQuery on Synapse Serverless provides real-time analytics without data duplication
- The Medallion pattern ensures data quality improves at each layer of processing
Discussion