Case studyGeospatial · Predictive analytics

London Housing & Crime.

A spatio-temporal pipeline merging 1M+ records from London's housing market and Met Police crime data to construct an Opportunity Index — a composite score identifying neighborhoods where property values lag the underlying safety trajectory. An Optuna-tuned XGBoost regressor scores R² = 0.92 on held-out data.

1M+Records merged

R² 0.92Held-out test set

LSOASpatial join granularity

Optuna200-trial TPE search

/ Stack

PythonScikit-learnXGBoostOptunaPandasGeoPandas

01 / The Opportunity Index.

A composite metric that finds neighborhoods where crime is declining faster than prices are rising. That gap is an investment window — the area is becoming safer but hasn't been repriced yet. The index is the project's point of view; everything else exists to make it accurate.

Crime trajectory

Normalised rolling 12-month change in incident rate, weighted by severity. A neighborhood with falling crime contributes positively to the index.

Price momentum

Rolling 12-month price velocity. A neighborhood with rapidly rising prices contributes negatively — the market has already priced in the improvement.

Accessibility & greenspace

Transport proximity and parks-per-LSOA enter as moderators. Two equally safe neighborhoods with different accessibility profiles get different scores.

02 / Data pipeline.

Two source datasets, one spatial join, four engineered feature families. Missing data is handled by spatial interpolation, not simple imputation — adjacent LSOAs share enough socioeconomic structure that a kNN fill outperforms column-mean.

01
Ingest
Land Registry Price Paid Data (housing transactions, 2010–present) + Met Police monthly crime CSV exports. Both are wide and dirty.
02
Spatial join
Both datasets resolve to LSOA polygons (Lower Layer Super Output Areas — ~1,500 residents each, the standard UK statistical unit). The join is a point-in-polygon operation with GeoPandas.
03
Feature engineering
Rolling crime rates (12-month / 24-month), seasonal decomposition (STL), price velocity (year-on-year %), accessibility (km to nearest tube / overground), greenspace ratio. ~40 features per LSOA-month.
04
Optuna search
200 TPE trials over max_depth, learning_rate, subsample, colsample_bytree, reg_alpha, reg_lambda. Objective: temporal-split R² (train on past, test on recent).

03 / What R² = 0.92 actually means.

The headline number is meaningful only with the right test methodology. Two design choices keep it honest.

Temporal split, not random

Training data is everything pre-2024; test data is 2024 onward.
Random k-fold would leak future information backwards — the model would memorise the period instead of generalising.
The temporal split simulates the actual deployment scenario: predict tomorrow given everything up to today.

Spatial integrity

Adjacent LSOAs are not independent — there's strong spatial autocorrelation in crime + price.
Test LSOAs are held out as contiguous regions, not individual cells, so the model can't cheat by interpolating from neighbours it's seen in training.
R² = 0.92 under this regime is genuinely about generalisation, not memorisation.

Code + data.

Full notebooks, the Optuna study, and the LSOA-level Opportunity Index outputs live in the repo.

github.com/sirdath/DS ↗