London Housing & Crime.
A spatio-temporal pipeline merging 1M+ records from London's housing market and Met Police crime data to construct an Opportunity Index — a composite score identifying neighborhoods where property values lag the underlying safety trajectory. An Optuna-tuned XGBoost regressor scores R² = 0.92 on held-out data.
01 / The Opportunity Index.
A composite metric that finds neighborhoods where crime is declining faster than prices are rising. That gap is an investment window — the area is becoming safer but hasn't been repriced yet. The index is the project's point of view; everything else exists to make it accurate.
Normalised rolling 12-month change in incident rate, weighted by severity. A neighborhood with falling crime contributes positively to the index.
Rolling 12-month price velocity. A neighborhood with rapidly rising prices contributes negatively — the market has already priced in the improvement.
Transport proximity and parks-per-LSOA enter as moderators. Two equally safe neighborhoods with different accessibility profiles get different scores.
02 / Data pipeline.
Two source datasets, one spatial join, four engineered feature families. Missing data is handled by spatial interpolation, not simple imputation — adjacent LSOAs share enough socioeconomic structure that a kNN fill outperforms column-mean.
- 01Ingest
Land Registry Price Paid Data (housing transactions, 2010–present) + Met Police monthly crime CSV exports. Both are wide and dirty.
- 02Spatial join
Both datasets resolve to LSOA polygons (Lower Layer Super Output Areas — ~1,500 residents each, the standard UK statistical unit). The join is a point-in-polygon operation with GeoPandas.
- 03Feature engineering
Rolling crime rates (12-month / 24-month), seasonal decomposition (STL), price velocity (year-on-year %), accessibility (km to nearest tube / overground), greenspace ratio. ~40 features per LSOA-month.
- 04Optuna search
200 TPE trials over
max_depth,learning_rate,subsample,colsample_bytree,reg_alpha,reg_lambda. Objective: temporal-split R² (train on past, test on recent).
03 / What R² = 0.92 actually means.
The headline number is meaningful only with the right test methodology. Two design choices keep it honest.
Temporal split, not random
- Training data is everything pre-2024; test data is 2024 onward.
- Random k-fold would leak future information backwards — the model would memorise the period instead of generalising.
- The temporal split simulates the actual deployment scenario: predict tomorrow given everything up to today.
Spatial integrity
- Adjacent LSOAs are not independent — there's strong spatial autocorrelation in crime + price.
- Test LSOAs are held out as contiguous regions, not individual cells, so the model can't cheat by interpolating from neighbours it's seen in training.
- R² = 0.92 under this regime is genuinely about generalisation, not memorisation.
Code + data.
Full notebooks, the Optuna study, and the LSOA-level Opportunity Index outputs live in the repo.