Student perspectives: How can we do data science without all of our data?

A post by Daniel Williams, Compass PhD student.

Imagine that you are employed by Chicago’s city council, and are tasked with estimating where the mean locations of reported crimes are in the city. The data that you are given only goes up to the city’s borders, even though crime does not suddenly stop beyond this artificial boundary. As a data scientist, how would you estimate these centres within the city? Your measurements are obscured past a very complex border, so regular methods such as maximum likelihood would not be appropriate.

Chicago Homicides
Figure 1: Homicides in the city of Chicago in 2008. Left: locations of each homicide. Right: a density estimate of the same crimes, highlighting where the ‘hotspots’ are.

This is an example of a more general problem in statistics named truncated probability density estimation. How do we estimate the parameters of a statistical model when data are not fully observed, and are cut off by some artificial boundary? (more…)

Skip to toolbar