FMCG Retail Data Consolidation ETL Pipeline

A real-world data engineering project using Lakehouse + Medallion Architecture

Built an end-to-end data engineering project focused on consolidating data in the FMCG domain. The objective was to simulate a real industry scenario where a large retail company acquires a smaller one and needs a unified data foundation for analytics.

I built a lakehouse-based ETL (Extract, Transform, and Load) pipeline using Databricks, Python, SQL, Spark, and Amazon S3. The pipeline follows the Medallion architecture and processes raw OLTP (Online Transaction Processing) data into curated Gold layer tables that support business reporting. The project also includes a dimensional data model, automated jobs for incremental and full loads, and a sales insights dashboard built from the final Gold tables.

Project Overview

In the FMCG domain, large retail organizations often grow through acquisitions. When a big retail company acquires a smaller one, their data ecosystems usually differ in structure, format, and technology. This project simulates a real-life industry scenario where we consolidate data from two retail companies into a unified Lakehouse Architecture using a scalable ETL pipeline.

The pipeline extracts, transforms, and loads data into Bronze → Silver → Gold layers following the Medallion Architecture, enabling unified analytics and BI reporting.

Project Screenshots

Dashboard

Data Model

Databricks Jobs Pipeline

Project Structure

Architecture Overview

Tech Stack

Python – Data processing, orchestration scripts
SQL – Transformations and Gold layer modeling
Amazon S3 – Lakehouse storage (Bronze/Silver/Gold)
Spark (PySpark) – Distributed ETL computation
Medallion Architecture – Bronze → Silver → Gold
BI Dashboard – Insights generation (e.g., Tableau/Power BI)
Genie – Query acceleration / Lakehouse query engine

Lakehouse Data Flow (Medallion)

1. Bronze Layer – Raw Ingestion

Load raw data from both companies into S3 (CSV).
No transformations, only ingestion and basic quality checks.
Stores raw sales, inventory, and product master data.

2. Silver Layer – Standardization & Cleansing

Apply schema normalization for both companies.
Deduplicate, validate, and harmonize columns.
Conform dimensions (dates, products, stores).
Output is clean, query-ready structured data.

3. Gold Layer – Business Models & KPIs

Build aggregated fact tables:
- Sales fact
- Inventory fact
- Customer analytics
Create unified models used by BI dashboards.
Power BI/Tableau dashboard built on top of this layer.

Features

End-to-end ETL pipeline from raw ingestion to BI-ready tables
Lakehouse storage using Amazon S3
Data standardization across two different companies
Scalable transformation engine using Apache Spark
Medallion architecture for clean, maintainable pipelines
Business insights dashboard powered by Genie + BI tools

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
.DS_Store		.DS_Store
1_setup.py		1_setup.py
2_dim_date_table_creation.py		2_dim_date_table_creation.py
3_customer_data_processing.py		3_customer_data_processing.py
3_gold_processing.py		3_gold_processing.py
4_products_data_processing.py		4_products_data_processing.py
5_pricing_data_processing.py		5_pricing_data_processing.py
6_fact_full_load_data_processing.py		6_fact_full_load_data_processing.py
7_fact_incremental_load.py		7_fact_incremental_load.py
INFO		INFO
Readme.md		Readme.md
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FMCG Retail Data Consolidation ETL Pipeline

Project Overview

Project Screenshots

Dashboard

Data Model

Databricks Jobs Pipeline

Project Structure

Architecture Overview

Tech Stack

Lakehouse Data Flow (Medallion)

1. Bronze Layer – Raw Ingestion

2. Silver Layer – Standardization & Cleansing

3. Gold Layer – Business Models & KPIs

Features

About

Uh oh!

Languages

ayushdoesdev/fmcg-data-consolidation-databricks

Folders and files

Latest commit

History

Repository files navigation

FMCG Retail Data Consolidation ETL Pipeline

Project Overview

Project Screenshots

Dashboard

Data Model

Databricks Jobs Pipeline

Project Structure

Architecture Overview

Tech Stack

Lakehouse Data Flow (Medallion)

1. Bronze Layer – Raw Ingestion

2. Silver Layer – Standardization & Cleansing

3. Gold Layer – Business Models & KPIs

Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages