Projects
Spring 2022
Exploring Urban Data with Machine Learning
Columbia GSAPP
Prof. Boyeong Hong

Team:
Kirthi Balakrishnan,
Kit Nga Chou,
Lizzie Lee,
Michelle Chen

An ML model trained on 600+ neighborhoods predicts walkability from street connectivity and transit density — giving planners a score for any location, even without pre-calculated metrics.

Scroll
01

Approach

Method
Machine Learning
Walkscore API
Street Network Analysis

Walkscore is a great metric — when it exists. For neighborhoods Walkscore.com hasn't pre-computed, planners are stuck. We trained an ML model that predicts a walkability score from street network shape and transit density, so any neighborhood can have one.

02

Data Pipeline

Reverse-engineering Walkscore.com's methodology

Three inputs feed the model: road network screenshots from Google Maps (classified with Keras), bus stop locations from the Overpass API, and intersection node counts pulled from OpenStreetMap. We trained on six cities — Boulder, Ann Arbor, Chicago, Washington DC, New York, and San Francisco — and validated on three more: Madison, Seattle, and Tulsa.

Intersection density heatmaps and node maps for Washington DC, New York City, and San Francisco
DENSITY MAPS
Intersection density heatmaps (top) and raw node extractions (bottom) for three training cities.
Intersection nodes extracted from OpenStreetMap for Washington DC
INTERSECTION NODES
OpenStreetMap nodes extracted and their densities calculated per neighborhood.
Data pipeline workflow diagram
DATA PIPELINE
Overview of the data extraction and processing workflow integrating multiple open datasets.
03

The Tool

Urban Mobility Index interactive tool — walkability scoring across neighborhoods
INTERACTIVE TOOL
Input any US address to get a predicted walkability score with feature importance breakdown.
04

Model Results

We evaluated three clustering algorithms—K-Means, Agglomerative, and Gaussian Mixture—with Gaussian Mixture yielding the most realistic urban clusters. Linear regression using bus stop and intersection densities as predictors achieved a mean Walk Score of 71.07, with RMSE of 17.04 and an R-squared value of 0.38, indicating these features explain 38% of walkability variance.

K-Means clustering results visualization
K-MEANS
K-Means — clean, fast, but oversimplifies the messier urban cores.
Agglomerative clustering results visualization
AGGLOMERATIVE
Agglomerative — picks up nested structure but produces some lopsided clusters.
Gaussian Mixture clustering results visualization
GAUSSIAN MIXTURE
Gaussian Mixture wins. The boundaries actually match what you see on the ground.
Prepared dataset of 584 NYC neighborhoods with Walk Score, population, area, bus and intersection densities
DATASET
Prepared dataset  584 neighborhoods with Walk Score, area, population, and bus/intersection densities per sqkm and per 1000 capita.
05

Limitations & Next Steps

The training cities skew dense and coastal — the model struggles on cities outside that distribution. The R-squared of 0.38 says it: bus stops and intersection density only explain part of what makes a place walkable. To get serious, the next version needs more cities, more input variables (block size, street trees, sidewalk width), and a usable web frontend so a planner can paste an address and get a number back.

Team: Kirthi Balakrishnan, Kit Nga Chou, Lizzie Lee, Michelle Chen

Course by Professor Boyeong Hong