Data Engineering Pipeline
Star Schema Lakehouse
A centralized Star Schema Lakehouse architecture that unifies disparate data sources — semi-structured MongoDB documents, CSV flat files, and nested JSON — into a clean, queryable analytical layer. The pipeline flattens nested structures, engineers derived geographical dimensions, and materializes everything as columnar Parquet files for high-speed aggregation via DuckDB.
Key Features
Schema Design
The Star Schema centers on fact tables capturing transactional events, surrounded by dimension tables for geography, time, and entity attributes. This denormalized structure optimizes analytical queries by minimizing joins while maintaining referential integrity through surrogate keys.
ETL Pipeline
The extraction layer pulls from MongoDB (via PyMongo) and CSV sources, applying schema inference and type coercion. The transformation layer flattens nested JSON structures, engineers geographical dimensions from coordinate data (reverse geocoding, region classification), and handles temporal features. The load phase writes to Parquet with Snappy compression, partitioned by date for efficient range queries.