Data Engineering Pipeline
data-engineering

Data Engineering Pipeline

Star Schema Lakehouse

A centralized Star Schema Lakehouse architecture that unifies disparate data sources — semi-structured MongoDB documents, CSV flat files, and nested JSON — into a clean, queryable analytical layer. The pipeline flattens nested structures, engineers derived geographical dimensions, and materializes everything as columnar Parquet files for high-speed aggregation via DuckDB.

Key Features

Star Schema design with fact and dimension tables for analytical workloads
PyMongo extraction and flattening of deeply nested JSON documents
Derived geographical dimensions engineered from raw coordinate data
Columnar Parquet storage for 10–50x compression vs raw JSON
DuckDB for in-process SQL aggregations with zero infrastructure overhead
PySpark integration for distributed processing of larger datasets

Schema Design

The Star Schema centers on fact tables capturing transactional events, surrounded by dimension tables for geography, time, and entity attributes. This denormalized structure optimizes analytical queries by minimizing joins while maintaining referential integrity through surrogate keys.

ETL Pipeline

The extraction layer pulls from MongoDB (via PyMongo) and CSV sources, applying schema inference and type coercion. The transformation layer flattens nested JSON structures, engineers geographical dimensions from coordinate data (reverse geocoding, region classification), and handles temporal features. The load phase writes to Parquet with Snappy compression, partitioned by date for efficient range queries.

Tech Stack

Python
PySpark
MongoDB
DuckDB
Parquet