Case studyData engineering · Lakehouse architecture

Star Schema Lakehouse.

A centralised lakehouse pipeline that unifies semi-structured sources — MongoDB documents, CSV flat files, deeply nested JSON — into a clean analytical layer. PyMongo + Pandas flatten and engineer derived dimensions; Parquet stores it columnar (10-50× compression); DuckDB queries it with zero infrastructure overhead.

10-50×Parquet vs raw JSON
Zero infraDuckDB in-process SQL
Star SchemaFacts + dimensions
SnappyPer-partition compression
/ Stack
PythonPySparkMongoDBPyMongoDuckDBParquetPandas

01 / Schema design.

A Star Schema centres on fact tables (transactional events) surrounded by dimension tables (geography, time, entity attributes). The denormalised structure trades storage for read speed — analytical queries hit at most one join instead of seven.

                +-------------------+
                |   dim_time        |
                |   (year, qtr, mo) |
                +---------+---------+
                          |
+--------------+    +-----+----------+    +---------------+
| dim_geo      |----| fact_event     |----| dim_entity    |
| (region,     |    | (id, ts,       |    | (id, name,    |
|  country,    |    |  amount,       |    |  type, tags)  |
|  city, h3)   |    |  fks to dims)  |    +---------------+
+--------------+    +-----+----------+
                          |
                +---------+---------+
                |   dim_source      |
                |   (CSV / Mongo /  |
                |    JSON ingest)   |
                +-------------------+

Surrogate keys link facts to dimensions. Source provenance is itself a dimension (dim_source), so lineage queries — "which records came from which ingest run?" — are one WHERE clause away.

02 / ETL pipeline.

Three phases, each idempotent and resumable. The transformation layer is the one that earns its keep — nested JSON is where most lakehouses fall over.

  1. 01
    Extract

    PyMongo pulls from MongoDB with cursor-based batching. Pandas reads CSV with schema inference + type coercion. Both feed into a uniform RawRecord shape before transformation.

  2. 02
    Transform

    Nested JSON gets flattened — arrays exploded, deeply-nested objects denormalised into columns. Coordinate pairs are reverse-geocoded into derived geographical dimensions (country, region, city, H3 hex). Temporal features (year, quarter, day-of-week) are pre-computed so the analytical layer doesn't recompute them on every query.

  3. 03
    Load

    Write to Parquet with Snappy compression, partitioned by date for efficient range-scan queries. Each partition is a separate file that DuckDB can read without loading the whole table.

03 / Why these tools.

Snowflake, BigQuery, and Redshift could have done this. They'd also have added subscription cost, vendor lock-in, and network latency. The lakehouse-on-laptop combo earns its place by sitting one rung below the cloud DWH tier.

Parquet

Columnar storage. Queries that touch 3 columns out of 40 read 7.5% of the file. Snappy compression gets 10-50× ratio vs raw JSON for typical event data. Native partition pruning.

DuckDB

Embedded analytical SQL engine. Reads Parquet natively, runs in-process, no daemon, no port, no auth. SELECT queries on millions of rows execute in milliseconds. Perfect for analytical workloads that don't need concurrency.

PySpark (optional)

When a single machine isn't enough — PySpark reads the same Parquet partitions, distributes the workload, writes back to Parquet. The Spark + DuckDB split lets the same lakehouse serve both scale-up (laptop) and scale-out (cluster) consumers.

Source code.

Pipeline notebooks, schema definitions, and the Parquet/DuckDB query layer.