Skip to content

Medallion Architecture for Data Engineering projects

Notifications You must be signed in to change notification settings

somrajroy/medallion

Repository files navigation

Medallion Lakehouse Architecture: A modern Approach to Data Management. Building Lakehouses in Cloud.

Introduction

The Medallion Architecture, a transformative data design pattern, organizing data lakehouses by ensuring structured, reliable, and progressively enriched data pipelines through its Bronze, Silver, and Gold tiers. The medallion architecture organizes data into bronze (raw), silver (clean), and gold (business-ready). By segmenting data into distinct layers, this architecture ensures incremental improvement in data quality and structure. Each layer, from raw data in the Bronze layer to refined data in the Gold layer, represents a progressive enhancement in data accuracy and usability. This blog delves into the intricacies of Medallion Architecture, its integration with Delta Lake, and best practices for implementation on platforms like Azure, offering actionable insights for building efficient, high-performing, and cost-effective Data Lakehouses in Cloud (& data systems). This blog primarily focusses on Microsoft Azure for Lakehouse but the fundamentals would be same for other CSP's.

image

Purpose

The blog aims to provide a comprehensive guide on implementing Lakehouse architectures and simplify it's adoption for modern data management, enabling development and cloud infrastructure teams to design & implement a structured, cost-effective, and high-performing data lakehouse systems by aligning on design topics.

  • Designing Databricks Data Lakehouses : This blog serves as a guide for designing Databricks Lakehouses using the Medallion design pattern. It aims to simplify the adoption of modern data management by providing development and cloud infrastructure teams with a structured architectural framework to design cost-effective, scalable, and high-performing data systems on cloud platforms such as Azure and AWS.
  • Data Modernization : Drive the adoption and implementation of the Lakehouse architectures to modernize legacy data platforms. Lakehouse is a (Databricks) Data Architecture that combines best of both data warehouse and data lake in a single system.
  • Optimizing resource utilization: By defining clear storage tiers and performance requirements for each layer, the blog guides teams towards cost-effective resource allocation and data lifecycle management.
  • Industry standards and best practices : By adhering to the principles and best practices outlined in this blog, teams can streamline their data operations, reduce operational overhead, and focus on higher-value activities such as data analysis and insights generation.

What is a Medallion Architecture

A Medallion architecture, coined by Databricks, is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture. The layers are typically referred to as Bronze, Silver, and Gold.This architecture consists of three distinct layers – bronze (raw), silver (validated) and gold (enriched) – each representing progressively higher levels of quality.
The Medallion architecture is a "tier"-based architecture that consists of three main layers: Bronze, Silver, and Gold. The bronze layer contains unvalidated data, and data is ingested incrementally.
Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products (medallion lakehouse architecture). This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data in each of these layers.
image

Data flows through the medallion architecture in a linear fashion, from bronze to silver to gold. At each layer, the data is processed and transformed to improve its quality and usability. The bronze layer is the simplest layer in the medallion architecture. It is simply a storage layer for raw data.
Microsoft Azure has various data storage models and it is essential to have a high level overview of the same to implement polyglot persistence.
image

A medallion archtitecture is also called as "multi-hop" architecture, meaning that data can move between layers as needed.This architecture ensures that data is reliable and consistent as it goes through various checks and changes. It also guarantees that the data is safely stored in a way that makes it easier and faster to analyze. The medallion architecture complements other data organization methods, rather than replacing them. Customer's can think of the medallion architecture as the framework for data cleaning, rather than a data architecture or model. It ensures compatibility and flexibility for businesses to adopt its benefits alongside existing data models, allowing them to modernize/customize data solutions and preserve expertise while remaining adaptable in the ever-changing data landscape. Data design patterns like the Medallion Architecture, and associated technologies, provide a structured framework for transforming legacy data systems into modern, scalable, and high-performing platforms. By establishing a structured framework with distinct layers the Medallion Architecture reduces cognitive load for data engineering and cloud infrastructure teams.

Data Modernization

Data modernization refers to the process of transforming legacy data systems, architectures, infrastructures, tools and workflows to take advantage of modern data management technologies and practices and align with evolving business needs. It involves migrating data from legacy systems to modern platforms, improving data integration, ensuring data quality, and enabling advanced analytics and real-time processing. The goal is to enhance data accessibility (democratization), scalability, and efficiency, thereby supporting better decision-making and driving business growth. Key aspects of data modernization include migrating data to cloud-based platforms, adopting advanced analytics and machine learning techniques, and implementing modern data architectures that support real-time processing and data governance. As customers seek to modernize their data practices, adopting Medallion Architecture provides a flexible framework that not only streamlines data processing but also fosters a culture of data-driven decision-making.

Databricks Data Lakehouse and Lakehouse Architecture

The data lakehouse is modern data management system that combines the benefits of data lakes and data warehouses (scalability and flexibility of data lakes with the reliability and performance of data warehouses). It provides a unified platform for managing structured, semi-structured, and unstructured data, enabling organizations to perform both real-time analytics and traditional business intelligence on a single system. It supports interoperability between data lake formats. Datalakehouse supports ACID transactions and can run fast queries, typically through SQL commands, directly on object storage in the cloud or on-prem on structured and unstructured data. This hybrid approach supports advanced analytics, machine learning, and real-time data processing, while reducing complexity and lowering costs. Key components of a data lakehouse include robust data governance, ACID transactions, and support for diverse data formats.

image

image

Delta Lake

Delta Lake is a crucial technology in implementing Medallion architecture, providing ACID transactions, scalable metadata handling, and unified batch and streaming data processing. Delta Lake enhances data reliability and quality by enabling schema enforcement and data versioning, making it an ideal choice for managing data across the Bronze (raw), Silver (cleansed), and Gold (refined) layers. By leveraging Delta Lake, organizations can ensure that their data is accurate, consistent, and readily available for analytics and reporting. This integration fosters a unified approach to handling large volumes of data efficiently, supporting various use cases from ETL processes to real-time analytics. Delta Lake supports both batch and streaming data, making it ideal for handling incremental data loads and change data capture (CDC). Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) properties to data lakes. By enabling reliable and scalable data pipelines, Delta Lake & things plays a crucial role in implementing the Medallion Architecture within Azure environments. Delta lake acts as a single source of truth for all types of workloads. Typically Delta Lake is used in Medallion Lakehouse architecture but can also be applied elsewhere as it is open source.

image

Understanding medallion architecture format & layers

Below are the standard definitions in a medallion architecture although elsewhere in this blog I primarily focus on Bronze, Silver and Gold layer only. However depending on specific use cases, customer's may have a need for more layers. For example, some might have an additional "raw(staging)" layer for landing data in a specific format before it's transformed into the bronze layer. Or they might have a "platinum" layer for data that's been further refined and enriched for a specific use case. Regardless of the names and number of layers, the medallion architecture is flexible and can be tailored to meet every organization's particular requirements.

  • Bronze Layer (Raw Data) : Contains raw, unprocessed data that serves as the foundation for further data transformations.
    • Purpose : Ingest raw data from various sources with minimal/zero transformation.
    • Performance Tier: Standard. Data is typically accessed infrequently.
    • Storage Tier : Cool, as data access frequency is low and the retention is primarily for compliance/audits. Enable versioning only if regulatory requirements demand it. Implement lifecycle policies to transition older data to Archive. Store data in its native format (JSON, CSV, etc.).Apply compression where possible.
    • Lifecycle Management: Implement lifecycle management policies to move data to cooler storage tiers automatically.
    • Data Compression: Enable data compression to reduce storage costs.
  • Silver Layer(Cleansed, validated & Enriched data) : Contains cleansed, validated, and enriched data. Data in this layer undergoes transformations to correct errors and improve quality.
    • Performance Tier: Standard unless pipelines require high IOPS
    • Storage Tier: Hot, as this data is accessed for regular processing and analytics (Hot for frequently accessed data and Cool for less frequently accessed data). Partition data based on query patterns (e.g., by date or category).Use Parquet format for optimized storage and analytics.
    • Cost Optimization : Optimize read performance to reduce compute costs.Avoid keeping intermediate datasets longer than necessary by implementing cleanup jobs.
    • Leverage Azure Data Factory (ADF) or Databricks with Delta Lake for efficient processing, reducing unnecessary reads.
    • Resource Scheduling: Schedule data transformation jobs during off-peak hours to reduce compute costs.
    • Avoid duplication of datasets by establishing clear data lineage.
    • Optimize Compute Resources: Use Azure Spot VMs for non-critical workloads to save costs.
  • Gold Layer (Curated/Business-Level Data): Refined data for analysis and reporting.
    • Purpose: Provide refined data for reporting and analytics.
    • Performance Tier: Premium, especially for business-critical workloads requiring high throughput and low latency. Else standard can suffice.
    • Storage Tier: Primarily Hot, as this data is accessed frequently for reports and dashboards.
    • Best Practices: Store data in analytics-ready formats (Parquet, Delta). Leverage snapshots/versioning for point-in-time recovery.Retain only the most relevant datasets in Hot tier and move older versions to Cool/Archive tiers.Ensure proper indexing to minimize query runtime costs.
    • Reserved Instances: Purchase reserved instances for Synapse Analytics to get discounts. Use Databricks Reserved Capacity to reduce compute costs for Gold Layer workloads if customer have consistent usage.
    • Compute Time : Schedule ADF pipelines efficiently to avoid idle compute times.
    • Archive historical data periodically into Cool or Archive tiers.
    • Maintain a clear SLA for data freshness and retention policies to avoid over-provisioning.
    • Query Optimization: Optimize queries to reduce compute resource usage.
  • Staging Layer: Temporary storage for intermediate processing.
    • Performance Tier: Standard.
    • Storage Tier: Hot, given frequent and short-term access.
    • Cost Optimization : Implement auto-delete policies to remove temporary data after a set period. Use Hot tier only when processing is active and transition to Cool after completion.
    • Optimize Compute Resources: Use Azure Spot VMs for non-critical workloads to save costs.
    • Use ephemeral storage patterns for highly transient data.
    • Archive Layer: Long-term storage for historical & compliance data .
    • Performance Tier: Standard.
    • Storage Tier: Always Archive
    • Set up Blob Tiering Policies to move unused data from Cool or Hot tiers to Archive automatically.
    • Store compressed, immutable data to save space and ensure compliance.
    • Leverage Blob-level Tiering for granular control over lifecycle transitions.
    • Lifecycle Management: Use lifecycle management policies to automate data tiering.
  • Sandbox/Experimentation Layer : Stores isolated data for data scientists and analysts to experiment with.
    • Performance Tier: Standard, unless real-time analysis is required.
    • Storage Tier: Primarily Hot, as access patterns are frequent/unpredictable.
    • Cost Optimization : Restrict sandbox size and enforce quotas using Azure Storage Policies. Archive unused experiments after X days automatically.Use tagging for cost attribution and ensure unused resources are cleaned up. Use Azure Spot VMs for non-critical testing workloads.
      Below is the typical recommended Configurations summary table. By following this and the guidelines in this blog, you can implement a cost-effective and high-performing medallion architecture in ADLS Gen2 for your customers.
      image

Modernizing with Databricks Lakeflow : The Declarative Medallion

While the Medallion Architecture defines the logic of data layers, Databricks Lakeflow Declarative Pipelines provides the automation engine to build them. Lakeflow Spark Declarative Pipelines helps to operationalize the layer of Medallion Architecture reliably at scale by managing pipeline orchestration, dependencies, and incremental processing. Moving away from manual, imperative coding, Lakeflow allows architects to use a declarative approach—defining what the data should look like at each stage rather than how to move it.

How Lakeflow Powers the Medallion Layers

Bronze (Incremental Ingestion): Utilizing Streaming Tables and Auto Loader, Lakeflow simplifies the ingestion of raw data from cloud storage (ADLS Gen2/S3). It handles schema evolution and ensures data is ingested incrementally and cost-effectively.
Silver (Validated & Enriched): Lakeflow enables data quality enforcement by allowing engineers to define Expectations directly within declarative pipelines (for example, validating that key fields are not null). Records that do not meet defined quality rules can be dropped, quarantined, or flagged, helping maintain the Silver layer as a trusted and validated dataset for downstream consumption.
Gold (Business-Ready): By leveraging Materialized Views, the Gold layer can be kept up to date automatically. Lakeflow manages dependencies between upstream Silver tables and Gold views, ensuring that changes propagate efficiently—often incrementally—without requiring manual orchestration or full pipeline refreshes.

Benefits for architects

  • Simplified Orchestration: Lakeflow automatically constructs and manages the internal transformation DAG, eliminating manual notebook chaining and handling compute scaling and retries. External orchestrators are only needed for cross-platform workflows.
  • Built-in Observability: Lakeflow provides automatic lineage and visual DAGs integrated with Unity Catalog, enabling end-to-end data traceability and easier troubleshooting as demoed in this video.
  • Unified Batch and Streaming: Lakeflow uses a single declarative programming model for batch, streaming, and CDC pipelines, significantly reducing code and maintenance overhead.
  • If you want to see how Lakeflow and Databricks Asset Bundles(DABs) work together in a real production environment — including CI/CD, unit testing, and automated deployment — this session on Lakeflow in Production: CI/CD, Testing and Monitoring at Scale is the definitive guide.

Below are some tutorials to get started

Setting up the medallion architecture medallion storage accounts

The below YouTube Video demostrates setting up the medallion architecture storage account ADLS Gen-2..
Details of the 3 layers of Medallion archcitecture can be found in this link.. The other videos demostrates the creation of simialr architecture on AWS.

Setting Up a ADLS Gen-2 medallion Storage account
Creating a Lakehouse on AWS
Back to Basics: Building an Efficient Data Lake
Build a Lake House Architecture on AWS
Building Data Lakes on AWS: Build a simple Data Lake on AWS with AWS Glue, Amazon Athena, and S3
Databricks on AWS

Good practices for setting up cost-effective and high-performance storage in Microsoft Azure.

  • Recommendations for optimizing data costs.
  • Medallion Architecture in Synapse & strategies
  • Workload Analysis & Cost-Benefit Analysis: Assess the specific performance and cost requirements of each layer. Evaluate the trade-off between performance and cost when selecting tiers.
  • Data Tiering and Access Tiers : Organize data into different access tiers (Hot, Cool, Archive) based on access frequency and performance requirements. This allows to store data in the most cost-effective tier while maintaining performance for frequently accessed data.
  • Implement Data Lifecycle Management : Implement policies to automatically tier data between tiers based on age and access patterns. based on age and access patterns. The transition between tiers (Hot → Cool → Archive) should be automatic based on access patterns. Retire unused datasets using rules in Lifecycle Management.
  • Evaluate premium performance tiers within ADLS Gen2 :Ensure the use of premium tiers is justified by performance needs, as they come with higher costs compared to the standard tier. Mission-critial worklods typically would need premium tier. However some business-critical workloads needs KPI's of mission-critical systems. Use Standard tier for most other workloads to reduce costs. Use Premium tier only for high-performance workloads that require low latency and high throughput.
  • Leverage Azure Storage Reserved Capacity : By committing to a specific amount of storage over a period ( 1 or 3 years), customers can benefit from significant cost savings compared to pay-as-you-go rates.
  • Automate data pipelines: Utilize tools like Azure Data Factory to automate the scheduling and orchestration of data pipelines.
  • Data Segmentation and Categorization : Leverage data segmentation and categorization based on type, usage patters and importance.
  • Tagging and Governance : Apply Azure Resource Tags to track costs by layer, team, or purpose.
  • Azure Cost Management Tools : Leverage Azure Cost Management and Billing to analyze cost drivers across layers.
  • File Format Optimization : Different file formats (like Avro, Parquet, ORC) can be chosen based on I/O patterns and query requirements to optimize performance and storage efficiency.
  • Use Private Endpoints: Configure private endpoints for secure access to storage without data exfiltration risks.
  • Access Policies: Implement proper access policies and Azure RBAC (Role-Based Access Control) to limit unnecessary access and prevent misuse.
  • Networking : Leverage private endpoints to avoid unnecessary data egress costs.
  • Optimize Redundancy : Use Locally Redundant Storage (LRS) for non-critical data to save costs, and Geo-Redundant Storage (GRS) only for critical data.
  • Monitoring and Alerts : Continuously monitor storage usage and performance metrics to identify optimization opportunities. Use Azure Monitor to track storage usage and set alerts for abnormal activity.
  • Efficient Data Ingestion and Processing : Optimize data pipelines to minimize processing time and resource usage. Use serverless compute options like Azure Databricks or Azure Synapse Analytics.
  • Performance vs. Cost Trade-offs : Balance performance requirements with cost optimization strategies. Customers need to decide when to prioritize one over the other. Balancing performance requirements with cost optimization strategies is crucial in modern data & cloud architectures. Here are some points for consideration :
    • Prioritize Performance : For mission-critical workloads or critical business-critical applications or high-frequency transactions, prioritize performance over cost.Use premium storage-tiers, performance-tiers or dedicated resources to ensure fast data access and processing.
    • Optimize Costs : For less demanding workloads or batch processing jobs, focus on cost optimization. Consider using cheaper storage tiers or shared resources for non-real-time operations.
    • Cost-aware design : Design data pipelines and architectures with cost-efficiency in mind. Choose appropriate data types, compression methods, and storage formats to minimize costs.
    • Consider Long-Term Impacts : Evaluate long-term cost implications of performance choices. Balance immediate needs with future scalability and cost projections.

Change Data Capture (CDC)

CDC is a technique used to identify and capture changes made to data in a database(data source) and then deliver those changes to a downstream process or system in real-time or near real-time. Delta Transfers in the Medallion architecture leverage CDC to transfer only the changes (deltas) rather than the entire dataset. This incremental approach minimizes the amount of data transferred, leading to significant cost savings and improved performance.

Appendix

Below are some additional resources and references for further learning: