Student Perspectives: Change in the air: Tackling Bristol’s nitrogen oxide problem

A post by Dom Owens, PhD student on the Compass programme.

“Air pollution kills an estimated seven million people worldwide every year” – World Health Organisation

Many particulates and chemicals are present in the air in urban areas like Bristol, and this poses a serious risk to our respiratory health. It is difficult to model how these concentrations behave over time due to the complex physical, environmental, and economic factors they depend on, but identifying if and when abrupt changes occur is crucial for designing and evaluating public policy measures, as outlined in the local Air Quality Annual Status Report.  Using a novel change point detection procedure to account for dependence in time and space, we provide an interpretable model for nitrogen oxide (NOx) levels in Bristol, telling us when these structural changes occur and describing the dynamics driving them in between.

Model and Change Point Detection

We model the data with a piecewise-stationary vector autoregression (VAR) model:

In between change points the time series $\boldsymbol{Y}_{t}$, a $d$-dimensional vector, depends on itself linearly over $p \geq 1$ previous time steps through parameter matrices $\boldsymbol{A}_i^{(j)}, i=1, \dots, p$ with intercepts $\boldsymbol{\mu}^{(j)}$, but at unknown change points $k_j, j = 1, \dots, q$ the parameters switch abruptly. $\{ \boldsymbol{\varepsilon}_{t} \in \mathbb{R}^d : t \geq 1 \}$ are white noise errors, and we have $n$ observations.

We phrase change point detection as a test of the hypotheses $H_0: q = 0$ vs. $H_1: q \geq 1$, i.e. the null states there is no change, and the alternative supposes there are possibly multiple changes. To test this, we use moving sum (MOSUM) statistics $T_k(G)$ extracted from the model; these compare the scaled difference between prediction errors for $G$ steps before and after $k$; when $k$ is in a stationary region, $T_k(G)$  will be close to zero, but when $k$ is near or at a change point, $T_k(G)$  will be large. If the maximum of these exceeds a threshold suggested by the distribution under the null, we reject $H_0$ and estimate change point locations with local peaks of $T_k(G)$.

Air Quality Data

With data from Open Data Bristol over the period from January 2010 to March 2021, we have hourly readings of NOx levels at five locations (with Site IDs) around the city of Bristol, UK: AURN St Paul’s (452); Brislington Depot (203); Parson Street School (215); Wells Road (270); Fishponds Road (463). Taking daily averages, we control for meteorological and seasonal effects such as temperature, wind speed and direction, and the day of the week, with linear regression then analyse the residuals.

We use the bandwidth $G=280$ days to ensure estimation accuracy relative to the number of parameters, which effectively asserts that two changes cannot occur in a shorter time span. The MOSUM procedure rejects the null hypothesis and detects ten changes, pictured above. We might attribute the first few to the Joint Local Transport Plan which began in 2011, while the later changes may be due to policy implemented after a Council supported motion in November 2016. The image below visualises the estimated parameters around the change point in November 2016; we can see that in the segment seven there is only mild cross-dependence, but in segment eight the readings at Wells Road, St. Paul’s, and Fishponds Road become strongly dependent on the other series.

Scientists have generally worked under the belief that these concentration series have long-memory behaviour, meaning values long in the past have influence on today’s values, and that this explains why the autocorrelation function (ACF) decays slowly, as seen above. Perhaps the most interesting conclusion we can draw from this analysis is that that structural changes explain this slow decay – the image below displays much shorter range dependence for one such stationary segment.

Conclusion

After all our work, we have a simple, interpretable model for NOx levels. In reality, physical processes often depend on each other in a non-linear fashion, and according to the geographical distances between where they are measured; accounting for this might provide better predictions, or tell a slightly different story.  Moreover, there should be interactions with the other pollutants present in the air. Could we combine these for a large-scale, comprehensive air quality model? Perhaps the most important question to ask, however, is how this analysis can be used to benefit policy makers. If we could combine this with a causal model, we might we be able to identify policy outcomes, or to draw comparisons with other regions.

Student perspectives: Wessex Water Industry Focus Lab

A post by Michael Whitehouse, PhD student on the Compass programme.

Introduction

September saw the first of an exciting new series of Compass industry focus labs; with this came the chance to make use of the extensive skill sets acquired throughout the course and an opportunity to provide solutions to pressing issues of modern industry. The partner for the first focus lab, Wessex Water, posed the following question: given time series data on water flow levels in pipes, can we detect if new leaks have occurred? Given the inherent value of clean water available at the point of use and the detriments of leaking this vital resource, the challenge of ensuring an efficient system of delivery is of great importance. Hence, finding an answer to this question has the potential to provide huge economic, political, and environmental benefits for a large base of service users.

Data and Modelling:

The dataset provided by Wessex Water consisted of water flow data spanning across around 760 pipes. After this data was cleaned and processed some useful series, such as minimum nightly and average daily flow (MNF and ADF resp.), were extracted. Preliminary analysis carried out by our collaborators at Wessex Water concluded that certain types of changes in the structure of water flow data provide good indications that a leak has occurred. From this one can postulate that detecting a leak amounts to detecting these structural changes in this data. Using this principle, we began to build a framework to build solutions: detect the change; detect a new leak. Change point detection is a well-researched discipline that provides us with efficient methods for detecting statistically significant changes in the distribution of a time series and hence a toolbox with which to tackle the problem. Indeed, we at Compass have our very own active member of the change point detection research community in the shape of Dom Owens. The preliminary analysis gave that there are three types of structural change in water flow series that indicate a leak: a change in the mean of the MNF, a change in trend of the MNF, and a change in the variance of the difference between the MNF and ADF. In order to detect these changes with an algorithm we would need to transform the given series so that the original change in distribution corresponded to a change in the mean of the transformed series. These transforms included calculating generalised additive model (GAM) residuals and analysing their distribution. An example of such a GAM is given by:

$\mathbb{E}[\text{flow}_t] = \beta_0 \sum_{i=1}^m f_i(x_i).$

Where the x i ’s are features we want to use to predict the flow, such as the time of day or current season. The principle behind this analysis is that any change in the residual distribution corresponds to a violation of the assumption that residuals are independently, identically distributed and hence, in turn, corresponds to a deviation from the original structure we fit our GAM to.

Figure 1: GAM residual plot. Red lines correspond to detected changes in distribution, green lines indicate a repair took place.

A Change Point Detection Algorithm:

In order to detect changes in real time we would need an online change point detection algorithm, after evaluating the existing literature we elected to follow the mean change detection procedure described in [Wang and Samworth, 2016]. The user-end procedure is as follows:

1. Calculate mean estimate $\hat{\mu}$ on some data we assume is stationary.
2. Feed a new observation into the algorithm. Calculate test statistics based on new data.
3. Repeat (2) until any test statistics exceed a threshold at which point we conclude a mean change has been detected. Return to (1).

Due to our 2 week time restraint we chose to restrict ourselves to finding change points corresponding to a mean change, just one of the 3 changes we know are indicative of a leak. As per the fundamental principles of decision theory, we would like to tune and evaluate our algorithm by minimising some loss function which depends on some ‘training’ data. That is, we would like to look at some past period of time and make predictions of when leaks happened given the flow data across the same period, then we evaluate how accurate these predictions were and adjust or asses the model accordingly. However, to do this we would need to know when and where leaks actually occurred across the time period of the data, something we did not have access to. Without ‘labels’ indicating that a leak has occurred, any predictions from the model were essentially meaningless, so we sought to find a proxy. The one potentially useful dataset we did have access to was that of leak repairs. It is clear that a leak must have occurred if a repair has occurred, but for various reasons this proxy does not provide an exhaustive account of all leaks. Furthermore, we do not know which repairs correspond to leaks identified by the particular distributional change in flow data we considered. This, in turn, means that all measures of model performance must come with the caveat that they are contingent on incomplete data. If when conducting research we find out results are limited it is our duty as statisticians to report when this is the case – it is not our job to sugar coat or manipulate our findings, but to report them with the limitations and uncertainty that inextricably come alongside. Results without uncertainty are as meaningless as no results at all. This being said, all indications pointed towards the method being effective in detecting water flow data mean change points which correspond to leak repairs, giving a positive result to feedback to our friends at Wessex Water.

Final Conclusions:

Communicating statistical concepts and results to an audience of varied areas and levels of expertise is important now more than ever. The continually strengthening relationships between Compass and its industrial partners are providing students with the opportunity to gain experience in doing exactly this. The focus lab concluded with a presentation on our findings to the Wessex Water operations team,  during which we reported the procedures and results. The technical results were well supported by the demonstration of an R shiny dashboard app, which provided an intuitive interface to view the output of the developed algorithm. Of course, there is more work to be done. Expanding the algorithm to account for all 3 types of distributional change is the obvious next step. Furthermore, fitting a GAM to data for 760 pipes is not very efficient. Additional investigations into finding ways to ‘cluster’ groups of pipes together according to some notion of similarity is a natural avenue for future work in order to reduce the number of GAMS we need to fit.This experience enabled students to apply skills in statistical modelling, algorithm development, and software development to a salient problem faced by an industry partner and marked a successful start to the Compass industry focus lab series.