Star Schema Lakehouse.
A centralised lakehouse pipeline that unifies semi-structured sources — MongoDB documents, CSV flat files, deeply nested JSON — into a clean analytical layer. PyMongo + Pandas flatten and engineer derived dimensions; Parquet stores it columnar (10-50× compression); DuckDB queries it with zero infrastructure overhead.
01 / Schema design.
A Star Schema centres on fact tables (transactional events) surrounded by dimension tables (geography, time, entity attributes). The denormalised structure trades storage for read speed — analytical queries hit at most one join instead of seven.
+-------------------+
| dim_time |
| (year, qtr, mo) |
+---------+---------+
|
+--------------+ +-----+----------+ +---------------+
| dim_geo |----| fact_event |----| dim_entity |
| (region, | | (id, ts, | | (id, name, |
| country, | | amount, | | type, tags) |
| city, h3) | | fks to dims) | +---------------+
+--------------+ +-----+----------+
|
+---------+---------+
| dim_source |
| (CSV / Mongo / |
| JSON ingest) |
+-------------------+Surrogate keys link facts to dimensions. Source provenance is itself a dimension (dim_source), so lineage queries — "which records came from which ingest run?" — are one WHERE clause away.
02 / ETL pipeline.
Three phases, each idempotent and resumable. The transformation layer is the one that earns its keep — nested JSON is where most lakehouses fall over.
- 01Extract
PyMongo pulls from MongoDB with cursor-based batching. Pandas reads CSV with schema inference + type coercion. Both feed into a uniform
RawRecordshape before transformation. - 02Transform
Nested JSON gets flattened — arrays exploded, deeply-nested objects denormalised into columns. Coordinate pairs are reverse-geocoded into derived geographical dimensions (country, region, city, H3 hex). Temporal features (year, quarter, day-of-week) are pre-computed so the analytical layer doesn't recompute them on every query.
- 03Load
Write to Parquet with Snappy compression, partitioned by date for efficient range-scan queries. Each partition is a separate file that DuckDB can read without loading the whole table.
03 / Why these tools.
Snowflake, BigQuery, and Redshift could have done this. They'd also have added subscription cost, vendor lock-in, and network latency. The lakehouse-on-laptop combo earns its place by sitting one rung below the cloud DWH tier.
Columnar storage. Queries that touch 3 columns out of 40 read 7.5% of the file. Snappy compression gets 10-50× ratio vs raw JSON for typical event data. Native partition pruning.
Embedded analytical SQL engine. Reads Parquet natively, runs in-process, no daemon, no port, no auth. SELECT queries on millions of rows execute in milliseconds. Perfect for analytical workloads that don't need concurrency.
When a single machine isn't enough — PySpark reads the same Parquet partitions, distributes the workload, writes back to Parquet. The Spark + DuckDB split lets the same lakehouse serve both scale-up (laptop) and scale-out (cluster) consumers.
Source code.
Pipeline notebooks, schema definitions, and the Parquet/DuckDB query layer.