A real-world data engineering project using Lakehouse + Medallion Architecture
Built an end-to-end data engineering project focused on consolidating data in the FMCG domain. The objective was to simulate a real industry scenario where a large retail company acquires a smaller one and needs a unified data foundation for analytics.
I built a lakehouse-based ETL (Extract, Transform, and Load) pipeline using Databricks, Python, SQL, Spark, and Amazon S3. The pipeline follows the Medallion architecture and processes raw OLTP (Online Transaction Processing) data into curated Gold layer tables that support business reporting. The project also includes a dimensional data model, automated jobs for incremental and full loads, and a sales insights dashboard built from the final Gold tables.
In the FMCG domain, large retail organizations often grow through acquisitions. When a big retail company acquires a smaller one, their data ecosystems usually differ in structure, format, and technology. This project simulates a real-life industry scenario where we consolidate data from two retail companies into a unified Lakehouse Architecture using a scalable ETL pipeline.
The pipeline extracts, transforms, and loads data into Bronze → Silver → Gold layers following the Medallion Architecture, enabling unified analytics and BI reporting.
- Python – Data processing, orchestration scripts
- SQL – Transformations and Gold layer modeling
- Amazon S3 – Lakehouse storage (Bronze/Silver/Gold)
- Spark (PySpark) – Distributed ETL computation
- Medallion Architecture – Bronze → Silver → Gold
- BI Dashboard – Insights generation (e.g., Tableau/Power BI)
- Genie – Query acceleration / Lakehouse query engine
- Load raw data from both companies into S3 (CSV).
- No transformations, only ingestion and basic quality checks.
- Stores raw sales, inventory, and product master data.
- Apply schema normalization for both companies.
- Deduplicate, validate, and harmonize columns.
- Conform dimensions (dates, products, stores).
- Output is clean, query-ready structured data.
-
Build aggregated fact tables:
- Sales fact
- Inventory fact
- Customer analytics
-
Create unified models used by BI dashboards.
-
Power BI/Tableau dashboard built on top of this layer.
- End-to-end ETL pipeline from raw ingestion to BI-ready tables
- Lakehouse storage using Amazon S3
- Data standardization across two different companies
- Scalable transformation engine using Apache Spark
- Medallion architecture for clean, maintainable pipelines
- Business insights dashboard powered by Genie + BI tools



