Home / Deep Dive / Enterprise Data Lakehouse: Azure Synapse + Databricks + Power BI + Purview
Deep Dive

Enterprise Data Lakehouse: Azure Synapse + Databricks + Power BI + Purview

Design and implement a modern data lakehouse architecture combining Azure Synapse Analytics, Databricks for data engineering, Power BI for visualization, and Microsoft Purview for end-to-end data governance and lineage tracking.

What you will learn

Practical execution with concise explanations, real implementation patterns, and production-ready recommendations.

Introduction: The Lakehouse Paradigm

Introduction: The Lakehouse Paradigm

The data lakehouse combines the best of data warehouses (structure, performance, ACID transactions) with data lakes (flexibility, scalability, raw data storage). This deep dive builds a production-grade lakehouse on Azure using the Medallion architecture (Bronze → Silver → Gold), Unity Catalog for governance, and Power BI for self-service analytics — all tied together with Microsoft Purview for organization-wide data discovery and lineage.

Prerequisites

  • Azure subscription with Databricks workspace access
  • Azure Synapse Analytics workspace
  • Microsoft Purview account
  • Power BI Premium or Premium Per User licenses
  • Azure Data Lake Storage Gen2
  • Familiarity with Spark, SQL, and Delta Lake

Phase 1: Storage Foundation — ADLS Gen2 with Medallion Zones

Phase 1: Storage Foundation — ADLS Gen2 with Medallion Zones

Infrastructure as Code with Bicep

param location string = resourceGroup().location
param projectName string = 'datalakehouse'

resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: '${projectName}adls'
  location: location
  kind: 'StorageV2'
  sku: { name: 'Standard_LRS' }
  properties: {
    isHnsEnabled: true
    minimumTlsVersion: 'TLS1_2'
    allowBlobPublicAccess: false
    supportsHttpsTrafficOnly: true
    networkAcls: {
      defaultAction: 'Deny'
      virtualNetworkRules: [
        { id: databricksSubnet.id, action: 'Allow' }
        { id: synapseSubnet.id, action: 'Allow' }
      ]
    }
  }
}

// Medallion zone containers
var zones = ['bronze', 'silver', 'gold', 'sandbox', 'archive']

resource containers 'Microsoft.Storage/storageAccounts/blobServices/containers@2023-01-01' = [for zone in zones: {
  name: '${storageAccount.name}/default/${zone}'
  properties: {
    publicAccess: 'None'
  }
}]

// Lifecycle management for cost optimization
resource lifecyclePolicy 'Microsoft.Storage/storageAccounts/managementPolicies@2023-01-01' = {
  parent: storageAccount
  name: 'default'
  properties: {
    policy: {
      rules: [
        {
          name: 'ArchiveBronzeAfter90Days'
          type: 'Lifecycle'
          definition: {
            filters: {
              blobTypes: ['blockBlob']
              prefixMatch: ['bronze/']
            }
            actions: {
              baseBlob: {
                tierToCool: { daysAfterModificationGreaterThan: 30 }
                tierToArchive: { daysAfterModificationGreaterThan: 90 }
              }
            }
          }
        }
        {
          name: 'DeleteSandboxAfter14Days'
          type: 'Lifecycle'
          definition: {
            filters: {
              blobTypes: ['blockBlob']
              prefixMatch: ['sandbox/']
            }
            actions: {
              baseBlob: {
                delete: { daysAfterModificationGreaterThan: 14 }
              }
            }
          }
        }
      ]
    }
  }
}

Phase 2: Bronze Layer — Raw Data Ingestion with Databricks

Auto Loader for Streaming Ingestion

# Databricks notebook: bronze_ingestion.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name, lit
from delta.tables import DeltaTable

spark = SparkSession.builder.getOrCreate()

# Configure Auto Loader for streaming CSV ingestion
bronze_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "/mnt/adls/bronze/_schemas/sales")
    .option("cloudFiles.schemaHints", "order_date DATE, amount DOUBLE")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("/mnt/adls/landing/sales/")
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .withColumn("_batch_id", lit(spark.conf.get("spark.databricks.workflow.runId", "manual")))
)

# Write to Bronze Delta table with merge schema
(bronze_stream
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/adls/bronze/_checkpoints/sales")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("bronze.raw_sales")
)

# Data quality checks on bronze layer
from pyspark.sql.functions import col, count, when

quality_report = (
    spark.table("bronze.raw_sales")
    .agg(
        count("*").alias("total_rows"),
        count(when(col("order_id").isNull(), 1)).alias("null_order_ids"),
        count(when(col("amount") < 0, 1)).alias("negative_amounts"),
        count(when(col("order_date").isNull(), 1)).alias("null_dates")
    )
)
quality_report.display()

Change Data Capture (CDC) from SQL Server

# CDC ingestion from SQL Server using Debezium format
cdc_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/mnt/adls/bronze/_schemas/customers_cdc")
    .load("/mnt/adls/landing/cdc/customers/")
)

# Parse Debezium CDC events
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructType, StringType, TimestampType

cdc_parsed = (
    cdc_stream
    .select(
        col("payload.op").alias("operation"),
        col("payload.after.*"),
        col("payload.ts_ms").cast("timestamp").alias("_cdc_timestamp"),
        current_timestamp().alias("_ingested_at")
    )
)

# Merge CDC changes into Bronze Delta table
def merge_cdc_batch(batch_df, batch_id):
    if batch_df.isEmpty():
        return
    
    target = DeltaTable.forName(spark, "bronze.raw_customers")
    
    (target.alias("target")
        .merge(
            batch_df.alias("source"),
            "target.customer_id = source.customer_id"
        )
        .whenMatchedUpdate(
            condition="source.operation IN ('u', 'r')",
            set={"*": "source.*"}
        )
        .whenMatchedDelete(condition="source.operation = 'd'")
        .whenNotMatchedInsert(
            condition="source.operation IN ('c', 'r')",
            values={"*": "source.*"}
        )
        .execute()
    )

(cdc_parsed
    .writeStream
    .foreachBatch(merge_cdc_batch)
    .option("checkpointLocation", "/mnt/adls/bronze/_checkpoints/customers_cdc")
    .trigger(availableNow=True)
    .start()
)

Phase 3: Silver Layer — Cleansed and Conformed Data

Phase 3: Silver Layer — Cleansed and Conformed Data

Data Quality Framework with Great Expectations

Architecture Overview: # Silver layer transformation: bronze → silver

Phase 4: Gold Layer — Business-Ready Aggregations

Star Schema for Analytics

-- Gold layer: Create dimensional model
CREATE OR REPLACE TABLE gold.dim_customer AS
SELECT
    customer_id,
    customer_name,
    email,
    city,
    state,
    country,
    customer_segment,
    first_order_date,
    lifetime_value,
    total_orders,
    CASE
        WHEN lifetime_value >= 10000 THEN 'Platinum'
        WHEN lifetime_value >= 5000 THEN 'Gold'
        WHEN lifetime_value >= 1000 THEN 'Silver'
        ELSE 'Bronze'
    END AS loyalty_tier,
    _valid_from,
    _valid_to,
    _is_current
FROM silver.clean_customers
WHERE _is_current = true;

-- Fact table: Daily sales aggregation
CREATE OR REPLACE TABLE gold.fact_daily_sales AS
SELECT
    date_format(order_date, 'yyyyMMdd') AS date_key,
    order_date,
    product_id,
    customer_id,
    store_id,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(quantity) AS total_quantity,
    SUM(amount) AS gross_revenue,
    SUM(discount_amount) AS total_discounts,
    SUM(amount - discount_amount) AS net_revenue,
    SUM(cost_amount) AS total_cost,
    SUM(amount - discount_amount - cost_amount) AS gross_margin,
    AVG(amount) AS avg_order_value
FROM silver.clean_sales s
GROUP BY order_date, product_id, customer_id, store_id;

-- Date dimension
CREATE OR REPLACE TABLE gold.dim_date AS
SELECT
    date_format(date_value, 'yyyyMMdd') AS date_key,
    date_value AS full_date,
    year(date_value) AS year,
    quarter(date_value) AS quarter,
    month(date_value) AS month,
    day(date_value) AS day,
    dayofweek(date_value) AS day_of_week,
    weekofyear(date_value) AS week_of_year,
    date_format(date_value, 'MMMM') AS month_name,
    date_format(date_value, 'EEEE') AS day_name,
    CASE WHEN dayofweek(date_value) IN (1, 7) THEN true ELSE false END AS is_weekend
FROM (
    SELECT explode(sequence(
        to_date('2020-01-01'), 
        to_date('2030-12-31'), 
        interval 1 day
    )) AS date_value
);

Phase 5: Azure Synapse Serverless SQL for Ad-Hoc Queries

Querying Delta Lake from Synapse

-- Synapse Serverless SQL: Query Gold layer directly
CREATE EXTERNAL DATA SOURCE GoldLake
WITH (
    LOCATION = 'https://datalakehouseadls.dfs.core.windows.net/gold',
    CREDENTIAL = ManagedIdentityCredential
);

-- Create view over Delta table
CREATE VIEW synapse.vw_sales_summary AS
SELECT
    d.year,
    d.quarter,
    d.month_name,
    c.loyalty_tier,
    SUM(f.net_revenue) AS total_revenue,
    SUM(f.gross_margin) AS total_margin,
    COUNT(DISTINCT f.customer_id) AS unique_customers,
    SUM(f.order_count) AS total_orders,
    SUM(f.net_revenue) / NULLIF(COUNT(DISTINCT f.customer_id), 0) AS revenue_per_customer
FROM OPENROWSET(
    BULK 'fact_daily_sales/',
    DATA_SOURCE = 'GoldLake',
    FORMAT = 'DELTA'
) AS f
JOIN OPENROWSET(
    BULK 'dim_date/',
    DATA_SOURCE = 'GoldLake',
    FORMAT = 'DELTA'
) AS d ON f.date_key = d.date_key
JOIN OPENROWSET(
    BULK 'dim_customer/',
    DATA_SOURCE = 'GoldLake',
    FORMAT = 'DELTA'
) AS c ON f.customer_id = c.customer_id
GROUP BY d.year, d.quarter, d.month_name, c.loyalty_tier;

Phase 6: Power BI DirectQuery on Lakehouse

Semantic Model Configuration

{
  "model": {
    "name": "Lakehouse Analytics",
    "defaultMode": "DirectQuery",
    "tables": [
      {
        "name": "Sales",
        "source": {
          "type": "AzureSynapseServerlessSQL",
          "query": "SELECT * FROM synapse.vw_sales_summary"
        },
        "measures": [
          {
            "name": "Revenue YoY Growth",
            "expression": "VAR CurrentYear = SUM(Sales[total_revenue]) VAR PreviousYear = CALCULATE(SUM(Sales[total_revenue]), SAMEPERIODLASTYEAR('Date'[full_date])) RETURN DIVIDE(CurrentYear - PreviousYear, PreviousYear, 0)"
          },
          {
            "name": "Customer Retention Rate",
            "expression": "VAR ReturningCustomers = CALCULATE(DISTINCTCOUNT(Sales[customer_id]), FILTER(Sales, Sales[total_orders] > 1)) VAR AllCustomers = DISTINCTCOUNT(Sales[customer_id]) RETURN DIVIDE(ReturningCustomers, AllCustomers, 0)"
          }
        ]
      }
    ]
  }
}

Phase 7: Microsoft Purview Data Governance

Automated Data Classification and Lineage

# Register lakehouse assets in Purview
from azure.purview.catalog import PurviewCatalogClient
from azure.purview.scanning import PurviewScanningClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
catalog_client = PurviewCatalogClient(
    endpoint="https://datalakehouse-purview.purview.azure.com",
    credential=credential
)

# Register ADLS Gen2 source
source_definition = {
    "kind": "AdlsGen2",
    "properties": {
        "endpoint": "https://datalakehouseadls.dfs.core.windows.net/",
        "collection": {
            "referenceName": "DataLakehouse",
            "type": "CollectionReference"
        }
    }
}

catalog_client.collection.create_or_update(
    collection_name="DataLakehouse",
    collection={"description": "Enterprise Data Lakehouse", "parentCollection": {"referenceName": "root"}}
)

# Create classification rules for PII detection
classification_rule = {
    "kind": "Custom",
    "properties": {
        "description": "Detect employee IDs in format EMP-XXXXX",
        "columnPatterns": [
            {"kind": "Regex", "pattern": "employee.*id|emp.*id|staff.*id"}
        ],
        "dataPatterns": [
            {"kind": "Regex", "pattern": "EMP-\\d{5}"}
        ],
        "classificationName": "CUSTOM_EMPLOYEE_ID",
        "ruleType": "Custom"
    }
}

Architecture Decision Matrix

Component Technology Justification
Raw Storage ADLS Gen2 Hierarchical namespace, cost-effective, Azure-native
Data Engineering Databricks Best Spark runtime, Delta Lake native, Unity Catalog
Table Format Delta Lake ACID, time travel, schema evolution, Z-ORDER
Serving Layer Synapse Serverless Pay-per-query, no infrastructure, Delta Lake support
Visualization Power BI DirectQuery on lakehouse, enterprise distribution
Governance Microsoft Purview Unified catalog, lineage, classification
Orchestration Databricks Workflows Native Spark, task dependencies, alerts
Security Unity Catalog + Purview Column-level security, row filters, audit

Best Practices

  1. Medallion architecture is not optional: Bronze (raw) → Silver (cleansed) → Gold (business) ensures data quality at each layer
  2. Use Delta Lake everywhere: ACID transactions, time travel, and schema evolution are essential for reliable data pipelines
  3. Z-ORDER your Gold tables: Dramatically improves query performance on frequently filtered columns
  4. Automate data quality checks: Catch issues at Bronze→Silver boundary, not after business reports are generated
  5. Lineage tracking from day one: Purview lineage is exponentially harder to add retroactively
  6. Use Synapse Serverless for exploration: Don't spin up dedicated pools for ad-hoc queries
  7. Lifecycle management for cost control: Archive Bronze data aggressively, delete sandbox data automatically

Validation and Versioning

  • Last validated: April 2026
  • Validate examples against your tenant, region, and SKU constraints before production rollout.
  • Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.

Security and Governance Considerations

  • Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
  • Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
  • Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.

Cost and Performance Notes

  • Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
  • Baseline performance with synthetic and real-user checks before and after major changes.
  • Scale resources with measured thresholds and revisit sizing after usage pattern changes.

Official Microsoft References

  • https://learn.microsoft.com/azure/architecture/
  • https://learn.microsoft.com/azure/well-architected/
  • https://learn.microsoft.com/power-platform/guidance/

Public Examples from Official Sources

  • These examples are sourced from official public Microsoft documentation and sample repositories.
  • Documentation examples: https://learn.microsoft.com/azure/well-architected/
  • Sample repositories: https://github.com/Azure/ArchitectureCenter
  • Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.

Key Takeaways

  • The lakehouse architecture eliminates the need for separate data warehouses and data lakes
  • Delta Lake provides the reliability and performance needed for production analytics
  • Microsoft Purview unifies data governance across the entire lakehouse
  • Power BI DirectQuery on Synapse Serverless provides real-time analytics without data duplication
  • The Medallion pattern ensures data quality improves at each layer of processing

Further Reading

Discussion