data-engineering

Data Engineering Pipeline

Star Schema Lakehouse

A centralized Star Schema Lakehouse architecture that unifies disparate data sources — semi-structured MongoDB documents, CSV flat files, and nested JSON — into a clean, queryable analytical layer. The pipeline flattens nested structures, engineers derived geographical dimensions, and materializes everything as columnar Parquet files for high-speed aggregation via DuckDB.

Key Features

Star Schema design with fact and dimension tables for analytical workloads

PyMongo extraction and flattening of deeply nested JSON documents

Derived geographical dimensions engineered from raw coordinate data

Columnar Parquet storage for 10–50x compression vs raw JSON

DuckDB for in-process SQL aggregations with zero infrastructure overhead

PySpark integration for distributed processing of larger datasets

Schema Design

The Star Schema centers on fact tables capturing transactional events, surrounded by dimension tables for geography, time, and entity attributes. This denormalized structure optimizes analytical queries by minimizing joins while maintaining referential integrity through surrogate keys.

ETL Pipeline

The extraction layer pulls from MongoDB (via PyMongo) and CSV sources, applying schema inference and type coercion. The transformation layer flattens nested JSON structures, engineers geographical dimensions from coordinate data (reverse geocoding, region classification), and handles temporal features. The load phase writes to Parquet with Snappy compression, partitioned by date for efficient range queries.

Tech Stack

Python

PySpark

MongoDB

DuckDB

Parquet

View Code