Student perspectives: Extending multilevel network meta-regression to disconnected networks and single-arm studies

Posted on 20th September 2024 by ad21883

A post by Sam Perren, PhD student on the Compass programme.

Over the past year, my research has been focused on a method called network meta-analysis (NMA), which is widely used in healthcare decision-making to summarise evidence on the relative effectiveness of different treatments. In particular, I have been interested in the challenges presented by disconnected networks of evidence and single-arm studies and aim to extend the multinma package to handle these challenges. Recently, I presented at the International Society for Clinical Biostatistics (ISCB) conference in Thessaloniki, Greece. In this blog post, I will outline the key points from that presentation and discuss the latest developments from my research.

Network meta-analysis

Network Meta-Analysis (NMA) pools summary treatment effects from randomised control trials (RCTs) to estimate relative effects between multiple treatments [1]. NMA summarises all direct and indirect evidence about treatment effects, allowing comparisons to be made between all pairs of treatments [2]. Covariates such as age, biomarker status, or disease severity can be either Effect Modifiers that interact with treatment effects, or Prognostic Factors that predict outcomes without interacting with treatment effects[3]. NMA requires a connected network, either directly or indirectly, through a series of comparisons[4]. Plot 1 demonstrates the assumption in NMA of constancy of relative effects, that is, the AB effect observed in study AB would be exactly the same in study AC, if a B arm had been included. However, this assumption can break down if there are differences in effect modifiers between studies which can lead to bias.[6].

Plot 1: Simple scenario with A versus B and A versus C study: we assume constancy of relative effects when making an indirect comparison between treatments B and C via the common A arm

Population adjustments & IPD network meta-regression

Population adjustment methods aim to relax the assumption of constancy of relative effects using available individual level data (IPD) to adjust for differences between study populations[3]. A network where IPD is available from every study enables the use of IPD network meta-regression and is considered the gold standard. However, having all IPD data in a network is rare; some studies may only provide aggregate data (AgD) in published papers.

Multilevel – Network Meta-Regression

Multilevel Network Meta-Regression (ML-NMR) is a population adjustment method that extends the NMA framework to synthesise mixtures of IPD and AgD. ML-NMR can produce estimates from networks of any size and for any given target population. It does this by first defining an individual-level regression model on the IPD, then it averages (integrates) each aggregate study population to form the aggregate level model using efficient and general numerical integration. [5]

Disconnected networks

Healthcare policymakers are increasingly encountering disconnected networks of evidence, which often include studies without control groups (single-arm studies)[6]. Very strong assumptions are required to make comparisons in a disconnected network; such as adjusting for all prognostic factors and all effect modifiers, which may not always be feasible with the available data. Current methods to handle disconnected networks include unanchored Matching-Adjusted indirect comparisons (MAIC)[7] and simulated treatment comparison (STC)[8]. However, these methods have limitations: they cannot generate estimates for target populations outside the network of evidence that might be relevant to decision makers and they are limited to a two study-scenario. So there remains a need for more flexible and robust methods, such as an extended version of the ML-NMR approach, to better handle disconnected networks of evidence.

Example: Plaque Psoriasis

We use a network of 6 active treatments plus placebo all used to treat moderate-to-severe plaque psoriasis, previously analysed by Philippo et al. [9]. In this network, we have AgD from the following studies: CLEAR, ERASURE, FEATURE, FIXTURE, and JUNCTURE. Additionally, we have IPD from the IXORA-S, UNCOVER-1, UNCOVER-2, and UNCOVER-3 studies. Outcomes of interest include success/failure to achieve at least 75%, 90% or 100% improvement on the Psoriasis Area and Severity Index (PASI) scale at 12 weeks compared to baseline, denoted PASI 75, PASI 90, and PASI 100, respectively. We make adjustments for potential effect modifiers, including duration of psoriasis, previous systemic treatment, body surface area affected, weight, and psoriatic arthritis.

Plot 2: Network of studies comparing treatments for moderate-to severe plaque psoriasis. PBO, placebo; IXE, ixekizumab; SEC, secukinumab; ETN, etanercept; UST, ustekinumab. IXE and SEC were each investigated with 2 different dosing regimens.

This network (Plot 2) of evidence is connected; every pair of treatments is joined by a path of study comparisons. We will now disconnect this network to illustrate different methods for reconnecting using ML-NMR, and then compare the results back to the “true” results from the full evidence network. We removed the CLEAR study and removed the placebo arms from the ERASURE, FEATURE, and JUNCTURE studies, as well as the Secukinumab 150 mg and Secukinumab 300 mg arms from the FIXTURE study in the AgD. $N_1$ (Left hand side) shows studies comparing different doses of Secukinumab, 150mg and 300mg, $N_2$ shows studies comparing all other treatments. We are then faced with the challenge of wanting to make valid comparisons between treatments in these two sub-networks, illustrated in Plot 3.

Plot 3: Disconnected network comparing treatments for moderate-to-severe plaque psoriasis

Reconnected network – internal evidence

One approach is to combine two AgD studies from opposite sides of the network into a single study. The Fixture study is the only AgD study in $N_2$. To determine the appropriate study to combine with in $N_1$, aggregate-level matching is used[10]. This involves selecting the study that minimises the Euclidean distance between the observed sets of covariates. Table 1 shows the Erasure study has the most similar characteristics to Fixture. As a result, these two studies will be combined into a new four-arm study, referred to as FIXTURE/ERASURE, effectively bridging the gap in the network.

Table 1: Aggregate level matching results against FIXTURE study.

Plot 4: Reconnected network using aggregate level matching. Combing Fixture and Erasure into one study

Reconnected network – external evidence

Another method we used to reconnect the network is by incorporating external observational studies, specifically “Chiricozzi” and “Prospect,” which observe the effects of Secukinumab 300mg. We incorporated these single-arm studies into the Fixture study as if they were part of the original trial, thereby effectively bridging the network. As a result, we end up with two separate reconnected networks, each using one of the observational studies.

Plot 5: Reconnected network using external control studies

Producing Population-Average Estimates

We have four networks for comparison: Full connected network, Reconnected using single arm study (Chircozzi), Reconnected using single arm study (Prospect), Reconnected using aggregate-level matching (FIXTURE/ERASURE). For each network, we will run both ML-NMR and standard NMA without regression. These analyses will produce population-adjusted relative treatment effects and probability outcomes for achieving a 75% reduction in the Plaque Area Severity Index (PASI75).

The ML-NMR results in the fully connected network will serve as the gold standard. We will compare the results obtained from the different methods (ML-NMR vs. NMA) and across the various networks (Full vs. reconnected) to evaluate the impact of different approaches on the relative treatment effects and outcome probabilities.

Relative Effects vs Placebo

Plot 6: Probit relative treatment effects vs placebo estimates. Target populations in columns, treatments with their disconnected subnetwork in the rows (right-hand side) and reconnected/original networks in the subrows (left-hand side). Coloured by method (MLNMR or NMA)

Plot 6 shows the probit relative treatment effects versus placebo across three populations: Feature ($N_1$), Uncover-1 ($N_2$), and the external population, Prospect. The results demonstrate that for treatments in $N_2$, the estimates produced by both NMA and ML-NMR are generally close to the gold standard. This similarity between NMA and ML-NMR is largely due to the homogeneity of the populations within the networks and the limited covariates we used to match original analysis. However, NMA results show smaller confidence intervals compared to ML-NMR, which may suggest an overconfidence in the NMA model’s results. ML-NMR accounts for more complexity and variability therefore extrapolates results.

For the Prospect population, the NMA results exhibit slight bias, likely due to differences between the external population and the network populations.

Results for treatments in $N_1$ show varying degrees of accuracy when compared to the gold standard in all populations. Among the reconnected networks, the FIXTURE/ERASURE and Prospect reconnected networks perform relatively well, while the Chiricozzi-based network struggles to match the gold standard results. This is due to Chiricozzi differing the most on covariates compared to all other populations.

In other words, when comparisons are made across the created “bridges” in the reconnected networks, bias can be introduced into our results.

Plot 7: Reconnected network highlighting comparisons made over the “bridge”

The plot above is the reconnected plot using PROSPECT and Chiricozzi external studies and shows us what we mean by comparisons across the “bridge”. All results in plot (1) are relative to a placebo (PBO) which is in $N_2$. If we want to make comparisons to the placebo with treatments from $N_1$ we will need to use these generated direct comparisons or “bridges”.

Absolute probability of PASI75

Plot 8: Probability of absolute outcomes of PASI75. Target populations in columns, treatments with their disconnected subnetwork in the rows (right-hand side) and reconnected/original networks in the subrows (left-hand side). Coloured by method (MLNMR or NMA)

In Plot 8, the FEATURE population results are very close to the gold standard for treatments in $N_1$ but results for treatments in $N_2$ show some bias. Unlike in the probit differences, the reference treatment for FEATURE now become Secukinumab 150mg and 300mg (SEC_150 & SEC_300) so in order to estimate absolute outcomes for $N_2$ treatments, we need to use our “bridges”, thereby incurring bias. This narrative is the same for the other 2 population estimates, where UNCOVER-2 is in $N_2$, estimates for treatments in $N_1$ are bias compared to the gold standard, dependent on network used. For PROSPECT, it’s reference treatment is Secukinumab 150mg ($N_1$), therefore results for $N_2$ treatments vary from the gold standard.

Key Findings

When producing estimates across reconnected networks, there’s a risk that the estimates may be biased or deviate from the true value. In our analysis, reconnecting the networks using ML-NMR showed little improvement over NMA. These results highlight the importance of carefully selecting studies to bridge networks and minimise bias. As disconnected networks become more common, it’s clear that better tools for evidence synthesis are needed to ensure reliable results that can inform clinical decisions and improve outcomes.

Future Work

To improve the performance of ML-NMR over NMA, we will try incorporating more covariates into the regression model. We also plan to conduct a comprehensive simulation study to compare methods under various scenarios and explore additional approaches, such as class effects. Developing methods to assess the strong assumptions required for reconnecting networks will be another priority. Finally, we aim to implement these methods within the multinma package.

References

[1] – Sofia Dias, Anthony E Ades, Nicky J Welton, Jeroen P Jansen, and Alexander J Sutton. Network meta-analysis for decision-making. John Wiley & Sons, 2018.
[2] – Song F, Altman DG, Glenny AM, Deeks JJ. Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta-analyses. Bmj. 2003 Mar 1;326(7387):472.
[3] – David M Phillippo, Anthony E Ades, Sofia Dias, Stephen Palmer, Keith R Abrams, and Nicky J Welton. Methods for population-adjusted indirect comparisons in health technology appraisal. Medical decision making, 38(2):200–211, 2018
[4] – Sofia Dias, Alex J Sutton, AE Ades, and Nicky J Welton. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Medical Decision Making, 33(5):607–617, 2013
[5] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Alexander Schacht, Daniel Saure, Zbigniew Kadziola, and Nicky J Welton. Multilevel network meta-regression for population- adjusted treatment comparisons. Journal of the Royal Statistical Society. Series A,(Statistics in Society), 183(3):1189, 2020
[6] – John W Stevens, Christine Fletcher, Gerald Downey, and Anthea Sutton. A review of methods for comparing treatments evaluated in studies that form disconnected networks of evidence. Research synthesis methods, 9(2):148–162, 2018
[7] – Signorovitch, James E., et al. “Matching-adjusted indirect comparisons: a new tool for timely comparative effectiveness research.” Value in Health 15.6 (2012): 940-947.
[8] – Caro JJ, Ishak KJ. No head-to-head trial? Simulate the missing arms. Pharmacoeconomics. 2010;28(10):957–67.
[9] – David M Phillippo, Sofia Dias, AE Ades, Mark Belger, Alan Brnabic, Daniel Saure, Yves Schy-mura, and Nicky J Welton. Validating the assumptions of population adjustment: application of multilevel network meta-regression to a network of treatments for plaque psoriasis. Medical Decision Making, 43(1):53–67, 2023
[10] – Leahy, Joy, et al. “Incorporating single‐arm evidence into a network meta‐analysis using aggregate level matching: assessing the impact.” Statistics in medicine 38.14 (2019): 2505-2523.

Compass Away Day 2024

Posted on 21st August 2024 by sam.bowyer

A post by Sam Bowyer and Emma Ceccherini, PhD students on the Compass programme.

The annual Compass Away Day took place this past June at Folly Farm, a fully sustainable and eco-friendly venue that offered Compass students the opportunity to take some time away from their regular research and enjoy a variety of activities in the Somerset countryside. Over the course of three days, the students learnt—among other things—how to craft an effective CV as a machine learning researcher; how each of our research areas overlap in surprising ways; how to improve the execution time of Python programs; and how to throw an axe.

On arrival, the students first took part in a Responsible Innovation talk led by Henry Bourne, in which we explored the connections, biases, and gaps in and between our individual research areas. This took the form of a creative mind-mapping exercise in which the students eagerly relished the opportunity to don party hats (see picture below).

After a hearty lunch with plenty of coffee, the students next heard three short talks by their colleagues. First, Ed Davis treated them to a talk on the benefits of just-in-time compilation, with a live demonstration of how one line of code (‘@jit‘) can speed up your Python scripts by multiple orders of magnitude. Secondly, the students saw another live-coding demonstration given by Kieran Morris as he walked through a project of his which utilised command-line automation and LLM APIs to create a text-based simulation of a well-known sporting event (The Hunger Games, but with Compass PhD students appearing as competitors). Finally, Edward Milsom gave a tongue-in-cheek TED-style talk entitled ‘How to be an AI Bro’ advising all present in the audience to pick a side between Doomers and Zoomers (anti- and pro-AI); to use self-attention mechanisms (the basis of modern LLMs; see https://arxiv.org/abs/1706.03762) whenever possible; and, following that last piece of advice, to follow @edward_milsom on X (formerly known as Twitter).

The final session of the day was led by Emma Ceccherini and Sam Bowyer, in which the two Away Day student organisers presented on the topic of CVs and online presence. The presentation exceeded everyone’s expectations, in no small part thanks to Helen Mawdsley being present to answer questions from the audience.

Feeling refreshed and revitalised after a peaceful night’s sleep, the students awoke to sunshine and birdsong. But soon a competitive feeling came over the group; this next morning was devoted to sports: archery, axe-throwing and segwaying. We won’t dwell on the sports too much except to say that Sam Bowyer’s team won handily, and that one impressed student is reported to have said “[Sam] was a G at archery. Bosh.” (That same anonymous source, taking time out of her quantitative spatial science and networks research—located in the big Compass office somewhere between desks C and E—, is also reported to have said “Ed D[redacted for anonymity], snapping all those single-use poles, not impressive segwaying.” We, the editors, feel it would be improper for us to comment on this matter.)

The final activity on the schedule was a writing retreat, taking place over the afternoon of the second day and the entirety of the third and final day. Being in a peaceful location and setting time aside to work on a single specific task proved useful to the students present.

To summarise, Away Day 2024 went extremely well, having been organised expertly by Crina Radu, to whom the students are all incredibly grateful.

Compass students at ICLR 2024

Posted on 23rd January 2024 by crina.radu

Congratulations to Compass students Edward Milsom and Ben Anson who, along with their supervisor, had their paper accepted for a poster at ICLR 2024.

Convolutional Deep Kernel Machines

Edward Milsom, Ben Anson, Laurence Aitchison

Ed and Ben: In this paper we explore the importance of representation learning in convolutional neural networks, specifically in the context of an infinite-width limit called the Neural Network Gaussian Process (NNGP) that is often used by theorists. Representation learning refers to the ability of models to learn a transformation of the data that is tailored to the task at hand. This is in contrast to algorithms that use a fixed transformation of the data, e.g. a support vector machine with a fixed kernel function like the RBF kernel. Representation learning is thought to be critical to the success of convolutional neural networks in vision tasks, but networks in the NNGP limit do not perform representation learning, instead transforming the data with a fixed kernel function. A recent modification to the NNGP limit, called the Deep Kernel Machine (DKM), allows one to gradually “add representation learning back in” to the NNGP, using a single hyperparameter that controls the amount of flexibility in the kernel. We extend this algorithm to convolutional architectures, which required us to develop a new sparse inducing point approximation scheme. This allowed us to test on the full CIFAR-10 image classification dataset, where we achieved state-of-the-art test accuracy for kernel methods, with 92.7%.

In the plot below, we see how changing the hyperparameter (x-axis) to reduce flexibility too much harms the performance on unseen data.

Student Perspectives: SPREE Methods for Small Area Estimation

Posted on 11th December 2023 by codie.wood

A post by Codie Wood, PhD student on the Compass programme.

This blog post is an introduction to structure preserving estimation (SPREE) methods. These methods form the foundation of my current work with the Office for National Statistics (ONS), where I am undertaking a six-month internship as part of my PhD. During this internship, I am focusing on the use of SPREE to provide small area estimates of population characteristic counts and proportions.

Small area estimation

Small area estimation (SAE) refers to the collection of methods used to produce accurate and precise population characteristic estimates for small population domains. Examples of domains may include low-level geographical areas, or population subgroups. An example of an SAE problem would be estimating the national population breakdown in small geographical areas by ethnic group [2015_Luna].

Demographic surveys with a large enough scale to provide high-quality direct estimates at a fine-grain level are often expensive to conduct, and so smaller sample surveys are often conducted instead.

SAE methods work by drawing information from different data sources and similar population domains in order to obtain accurate and precise model-based estimates where sample counts are too small for high quality direct estimates. We use the term small area to refer to domains where we have little or no data available in our sample survey.

SAE methods are frequently relied upon for population characteristic estimation, particularly as there is an increasing demand for information about local populations in order to ensure correct allocation of resources and services across the nation.

Structure preserving estimation

Structure preserving estimation (SPREE) is one of the tools used within SAE to provide population composition estimates. We use the term composition here to refer to a population break down into a two-way contingency table containing positive count values. Here, we focus on the case where we have a population broken down into geographical areas (e.g. local authority) and some subgroup or category (e.g. ethnic group or age).

Orginal SPREE-type estimators, as proposed in [1980_Purcell], can be used in the case when we have a proxy data source for our target composition, containing information for the same set of areas and categories but that may not entirely accurately represent the variable of interest. This is usually because the data are outdated or have a slightly different variable definition than the target.

We also incorporate benchmark estimates of the row and column totals for our composition of interest, taken from trusted, quality assured data sources and treated as known values. This ensures consistency with higher level known population estimates. SPREE then adjusts the proxy data to the estimates of the row and column totals to obtain the improved estimate of the target composition.

An illustration of the data required to produce SPREE-type estimates.

In an extension of SPREE, known as generalised SPREE (GSPREE) [2004_Zhang], the proxy data can also be supplemented by sample survey data to generate estimates that are less subject to bias and uncertainty than it would be possible to generate from each source individually. The survey data used is assumed to be a valid measure of the target variable (i.e. it has the same definition and is not out of date), but due to small sample sizes may have a degree of uncertainty or bias for some cells.

The GSPREE method establishes a relationship between the proxy data and the survey data, with this relationship being used to adjust the proxy compositions towards the survey data.

IMG_1634 (1)

An illustration of the data required to produce GSPREE estimates.

GSPREE is not the only extension to SPREE-type methods, but those are beyond the scope of this post. Further extensions such as Multivariate SPREE are discussed in detail in [2016_Luna].

Original SPREE methods

First, we describe original SPREE-type estimators. For these estimators, we require only well-established estimates of the margins of our target composition.

We will denote the target composition of interest by $\mathbf{Y} = (Y{aj})$, where $Y{aj}$ is the cell count for small area $a = 1,\dots,A$ and group $j = 1,\dots,J$. We can write $\mathbf Y$ in the form of a saturated log-linear model as the sum of four terms,

$$ \log Y_{aj} = \alpha_0^Y + \alpha_a^Y + \alpha_j^Y + \alpha_{aj}^Y.$$

There are multiple ways to write this parameterisation, and here we use the centered constraints parameterisation given by $$\alpha_0^Y = \frac{1}{AJ}\sum_a\sum_j\log Y_{aj},$$ $$\alpha_a^Y = \frac{1}{J}\sum_j\log Y_{aj} – \alpha_0^Y,$$ $$\alpha_j^Y = \frac{1}{A}\sum_a\log Y_{aj} – \alpha_0^Y,$$ $$\alpha_{aj}^Y = \log Y_{aj} – \alpha_0^Y – \alpha_a^Y – \alpha_j^Y,$$

which satisfy the constraints $\sum_a \alpha_a^Y = \sum_j \alpha_j^Y = \sum_a \alpha_{aj}^Y = \sum_j \alpha_{aj}^Y = 0.$

Using this expression, we can decompose $\mathbf Y$ into two structures:

The association structure, consisting of the set of $AJ$ interaction terms $\alpha_{aj}^Y$ for $a = 1,\dots,A$ and $j = 1,\dots,J$. This determines the relationship between the rows (areas) and columns (groups).
The allocation structure, consisting of the sets of terms $\alpha_0^Y, \alpha_a^Y,$ and $\alpha_j^Y$ for $a = 1,\dots,A$ and $j = 1,\dots,J$. This determines the size of the composition, and differences between the sets of rows (areas) and columns (groups).

Suppose we have a proxy composition $\mathbf X$ of the same dimensions as $\mathbf Y$, and we have the sets of row and column margins of $\mathbf Y$ denoted by $\mathbf Y_{a+} = (Y_{1+}, \dots, Y_{A+})$ and $\mathbf Y_{+j} = (Y_{+1}, \dots, Y_{+J})$, where $+$ substitutes the index being summed over.

We can then use iterative proportional fitting (IPF) to produce an estimate $\widehat{\mathbf Y}$ of $\mathbf Y$ that preserves the association structure observed in the proxy composition $\mathbf X$. The IPF procedure is as follows:

Rescale the rows of $\mathbf X$ as $$ \widehat{Y}_{aj}^{(1)} = X_{aj} \frac{Y_{+j}}{X_{+j}},$$
Rescale the columns of $\widehat{\mathbf Y}^{(1)}$ as $$ \widehat{Y}_{aj}^{(2)} = \widehat{Y}_{aj}^{(1)} \frac{Y_{a+}}{\widehat{Y}_{a+}^{(1)}},$$
Rescale the rows of $\widehat{\mathbf Y}^{(2)}$ as $$ \widehat{Y}_{aj}^{(3)} = \widehat{Y}_{aj}^{(2)} \frac{Y_{+j}}{\widehat{Y}_{+j}^{(2)}}.$$

Steps 2 and 3 are then repeated until convergence occurs, and we have a final composition estimate denoted by $\widehat{\mathbf Y}^S$ which has the same association structure as our proxy composition, i.e. we have $\alpha_{aj}^X = \alpha_{aj}^Y$ for all $a \in \{1,\dots,A\}$ and $j \in \{1,\dots,J\}.$ This is a key assumption of the SPREE implementation, which in practise is often restrictive, motivating a generalisation of the method.

Generalised SPREE methods

If we can no longer assume that the proxy composition and target compositions have the same association structure, we instead use the GSPREE method first introduced in [2004_Zhang], and incorporate survey data into our estimation process.

The GSPREE method relaxes the assumption that $\alpha_{aj}^X = \alpha_{aj}^Y$ for all $a \in \{1,\dots,A\}$ and $j \in \{1,\dots,J\},$ instead imposing the structural assumption $\alpha_{aj}^Y = \beta \alpha_{aj}^X$, i.e. the association structure of the proxy and target compositions are proportional to one another. As such, we note that SPREE is a particular case of GSPREE where $\beta = 1$.

Continuing with our notation from the previous section, we proceed to estimate $\beta$ by modelling the relationship between our target and proxy compositions as a generalised linear structural model (GLSM) given by
$$\tau_{aj}^Y = \lambda_j + \beta \tau_{aj}^X,$$ with $\sum_j \lambda_j = 0$, and where $$ \begin{align} \tau_{aj}^Y &= \log Y_{aj} – \frac{1}{J}\sum_j\log Y_{aj},\\
&= \alpha_{aj}^Y + \alpha_j^Y,
\end{align}$$ and analogously for $\mathbf X$.

It is shown in [2016_Luna] that fitting this model is equivalent to fitting a Poisson generalised linear model to our cell counts, with a $\log$ link function. We use the association structure of our proxy data, as well as categorical variables representing the area and group of the cell, as our covariates. Then we have a model given by $$\log Y_{aj} = \gamma_a + \tilde{\lambda}_j + \tilde{\beta}\alpha_{aj}^X,$$ with $\gamma_a = \alpha_0^Y + \alpha_a^Y$, $\tilde\lambda_j = \alpha_j^Y$ and $\tilde\beta \alpha_{aj}^X = \alpha_{aj}^Y.$

When fitting the model we use survey data $\tilde{\mathbf Y}$ as our response variable, and are then able to obtain a set of unbenchmarked estimates of our target composition. The GSPREE method then benchmarks these to estimates of the row and column totals, following a procedure analagous to that undertaken in the orginal SPREE methodology, to provide a final set of estimates for our target composition.

ONS applications

The ONS has used GSPREE to provide population ethnicity composition estimates in intercensal years, where the detailed population estimates resulting from the census are outdated [2015_Luna]. In this case, the census data is considered the proxy data source. More recent works have also used GSPREE to estimate counts of households and dwellings in each tenure at the subnational level during intercensal years [2023_ONS].

My work with the ONS has focussed on extending the current workflows and systems in place to implement these methods in a reproducible manner, allowing them to be applied to a wider variety of scenarios with differing data availability.

References

[1980_Purcell] Purcell, Noel J., and Leslie Kish. 1980. ‘Postcensal Estimates for Local Areas (Or Domains)’. International Statistical Review / Revue Internationale de Statistique 48 (1): 3–18. https://doi.org/10/b96g3g.

[2004_Zhang] Zhang, Li-Chun, and Raymond L. Chambers. 2004. ‘Small Area Estimates for Cross-Classifications’. Journal of the Royal Statistical Society Series B: Statistical Methodology 66 (2): 479–96. https://doi.org/10/fq2ftt.

[2015_Luna] Luna Hernández, Ángela, Li-Chun Zhang, Alison Whitworth, and Kirsten Piller. 2015. ‘Small Area Estimates of the Population Distribution by Ethnic Group in England: A Proposal Using Structure Preserving Estimators’. Statistics in Transition New Series and Survey Methodology 16 (December). https://doi.org/10/gs49kq.

[2016_Luna] Luna Hernández, Ángela. 2016. ‘Multivariate Structure Preserving Estimation for Population Compositions’. PhD thesis, University of Southampton, School of Social Sciences. https://eprints.soton.ac.uk/404689/.

[2023_ONS] Office for National Statistics (ONS), released 17 May 2023, ONS website, article, Tenure estimates for households and dwellings, England: GSPREE compared with Census 2021 data

Student Perspectives: Machine Learning Models for Probability Distributions

Posted on 20th October 202323rd October 2023 by th17628

A post by Tennessee Hickling, PhD student on the Compass programme.

Introduction
Probabilistic modelling provides a consistent way to deal with uncertainty in data. The central tool in this methodology is the probability distribution, which describes the randomness of observations. To create effective models of reality, we need to be able to specify probability distributions that are flexible enough to capture real phenomena whilst remaining feasible to estimate. In the past decade machine learning (ML) has developed many new and exciting ways to represent and learn potentially complex probability distributions.

ML has provided significant advances in modelling of high dimensional and highly structured data such as images or text. Many of these modern approaches are applied as “generative models”. The goal of such approaches is to sample new synthetic observations from an approximate distribution which closely matches the target distribution. For example, we may use many images of cats to learn an approximate distribution, from which we can sample new images of cats that appear realistic. Usually, a “generative model” indicates the requirement to sample from the model, but not necessarily assign probabilities to observed data. In this case, the model captures uncertainty by imitating the structure and randomness of the data.

Many of these modern methods work by transforming simple randomness (such as a Normal distribution) into the target complex randomness. In my own research, I work on a known limitation of such approaches to replicate a particular aspect of randomness – the tails of probability distributions [1, 11]. In this post, I wanted to take a step back and provide an overview of and connections between two ML methods that can be used to model probability distributions – Normalising Flows (NFs) and Variational Autoencoders (VAEs).

Figure 1: Basic illustration of ML learning of a distribution. We optimise the machine learning model to produce a distribution close to our target. This is often conceptualised in the generative direction, such that our ML model moves samples from the simple distribution to more closely match the target observations.

Some Background
Consider real valued vectors $z \in \mathbb{R}^{d_z}$ and $x \in \mathbb{R}^{d_x}$. In this post I mirror notation used in [2], where $p(x)$ refers to the density and distribution of $x$ and $x \sim p(x)$ indicates samples according to that distribution. The generic set up I am considering is that of density estimation – trying to model the distribution $p(x)$ of some observed data $\{x_i\}_{i=1}^{N}$. I use a semicolon to denote parameters, so $p(x; \beta)$ is a distribution over $x$ with parameters $\beta$. I also make use of different letters to distinguish different distributions, for example using $q(x)$ to denote an approximation to $p(x)$. The notation $\mathbb{E}_p[f(x)]$ refers to the standard expectation of $f(x)$ over the distribution $p$.

The discussed methods introduce some simple source of randomness arising from a known, simple latent distribution $p(z)$. This is also referred to in some literature as the prior, though the usage is not straightforwardly relatable to traditional Bayesian concepts. The goal is then to fit an approximate $q(x|z; \theta)$, that is a conditional distribution, such that $$q(x; \theta) = \int q(x|z; \theta)p(z)dz \approx p(x),$$in words, the marginal density over $x$ implied by the conditional density, is close to our target distribution $p(x)$. In general, we can’t compute $q(x)$, as we can’t solve the above integral for very flexible $q(x | z; \theta)$.

Variational Inference
We commonly make use of the Kullback-Leibler (KL) divergence, which can be interpreted as measuring the difference between two probability distributions. It is a useful practical tool, since we can compute and optimise the quantity in a wide variety of situations. Techniques which optimise a probability distribution using such divergences are known as variational methods. There are other choices of divergence, but KL is the most standard. Important properties of KL are that the quantity $KL(p|| q)$ is non-negative and non-symmetric i.e. $KL(p|| q) \neq KL(q || p)$.

Given this, we can see that a natural objective is to minimise the difference between distributions, as measured by the KL, $$KL(p(x) || q(x; \theta)) = \int p(x) \log \frac{p(x)}{q(x; \theta)} dx.$$Advances in this area have mostly developed new ways to make this optimisation tractable for flexible $q(x | z; \theta)$.

Normalising Flow
A normalising flow is a parameterised transformation of a random variable. The key simplifying assumption is that the forward generation is deterministic. That is, for $d_x = d_z$, that $$
x = T(z; \theta),$$for some transformation function $T$. We additionally require that $T$ is a differentiable bijection. Given these requirements, we can express the approximate density of $x$ exactly as $$q_x(x; \theta) = p_z(T^{-1}(x; \theta))\big|\text{det } J_{T^{-1}}(x; \theta)\big|.$$Here, $\text{det }J_{T^{-1}}$ is the determinant of the Jacobian of the inverse transformation. Research on NFs has developed numerous ways to make the computation of the Jacobian term tractable. The standard approach is to use neural networks to produce $\theta$ (the parameters of the transformation), with numerous ways of configuring the model to capture dependency between dimensions. Additionally, we often stack many layers to provide more flexibility. See [10] and the review [2] for more details on how this is achieved.

As we have access to an analytic approximate density, we can minimise the negative log-likelihood of our model, $$\mathcal{J}(\theta) = -\sum_{i=1}^{N} \log q(x_i; \theta),$$which is the Monte-Carlo approximation of the KL loss (up to an additive constant). This is straightforward to optimise using stochastic gradient descent [9] and automatic differentiation.

Figure 2: Schematic of NF model. The ML model produces the parameters of our transformation, which are identical in the forward and backwards directions. We choose the transformation such that we can express an analytic density function for our approximate distribution.

Variational Autoencoder
In the Variational Autoencoder (VAE) [3] the conditional distribution $q(x| z; \theta)$ is known as the decoder. VAEs consider the marginal in terms of the posterior, that is $$q(x; \theta) = \frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}.$$The posterior $q(z | x; \theta)$ is itself not generally tractable. VAEs proceed by introducing an encoder, which approximates $q(z | x; \theta)$. This is itself simply a conditional distribution $e(z | x; \psi)$. We use this approximation to express the log marginal over $x$ as below.
$$\begin{align}
\log q(x; \theta) &= \mathbb{E}_{e}\bigg[\log q(x; \theta)\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\
&= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{q(z | x; \theta)}\frac{e(z | x; \psi)}{e(z | x; \psi)}\bigg] \\
&= \mathbb{E}_{e}\bigg[\log\frac{q(x | z; \theta)p(z)}{e(z | x; \psi)}\bigg] + KL(e(z | x; \psi) || q(z | x; \theta)) \\
&= \mathcal{J}_{\theta,\psi} + KL(e(z | x; \psi) || q(z | x; \theta))
\end{align}$$
The additional approximation gives a more complex expression and does not provide an analytical approximate density. However, as $KL(e(z | x; \psi) || q(z | x; \theta))$ is positive, $\mathcal{J}_{\theta,\psi}$ forms a lower bound on $\log q(x; \theta)$. This expression is commonly referred to as the Evidence Lower Bound (or ELBO). The second term, the divergence between our encoder and the implied posterior will be minimised as we increase $\mathcal{J}_{\theta,\psi}$ in $\psi$. As we increase $\mathcal{J}_{\theta,\psi}$, we will either be increasing $\log q(x; \theta)$ or reducing $KL(e(z | x; \psi) || q(z | x; \theta))$.

The goal is to increase $\log q(x; \theta)$, which we hope occurs as we optimise over both parameters. This ambiguity in optimisation results in well known issues, such as posterior collapse [12] and can result in some counter-intuitive behaviour [4]. Despite this, VAEs remain a powerful and popular approach. An important benefit is that we no longer require $d_x = d_z$, which means we can map high dimensional $x$ to low dimensional $z$ to perform dimensionality reduction.

Figure 3: Schematic of VAE model. We now have a stochastic forward transformation. To optimise this we introduce a decoder model which approximates the posterior implied by the forward transformation. We now have a more flexible transformation, but two models to train and no analytic approximate density.

Surjective Flows
We can now identify a connection between NFs and VAEs. Recent work has reinterpreted NFs within the VAE framework, permitting a broader class of transitions whilst retaining analytic tractability of NFs [5]. Considering our decoding conditional as $$q(x|z; \theta) = \delta(x – T(z; \theta)),$$ we have the posterior exactly as
$$q(z|x; \theta) = \delta(z – T^{-1}(x; \theta)),$$
where $\delta$ is the dirac delta function.

This provides a view of a NF as a special case of a VAE, where we don’t need to approximate the posterior. Considering our VAE approximation, $$
\log q(x; \theta) = \mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big] + KL(e(z | x; \psi) || q(z | x; \theta)),
$$and taking $e(z | x; \psi) = q(z | x; \theta)$, then the final KL term is 0 by definition. In that case, we recover our analytic density for NFs (see [5] for details).

Note that accessing an analytic density depends on having $e(z | x; \psi) = q(z | x; \theta)$ and computing $$\mathbb{E}_e\big[\log p(z) + \log\frac{q(x | z; \theta)}{e(z | x; \psi)}\big].$$These requirements are actually weaker than those we apply in the case of standard NFs. Consider a deterministic transformation $T^{-1}(x)$ which is surjective, crucially we can have many $x$ which map to the same $z$, so no longer have an analytic inverse. However we can still choose a $q(x | z)$ which is stochastic inverse of $T^{-1}$. For example, consider an absolute value surjection, $q(z | x) = \delta(z – |x|)$, to invert this transformation we can choose $q(x | z) = \frac{1}{2}\delta(x – z) + \frac{1}{2}\delta(x + z)$, which we can sample from straight-forwardly. This transformation enforces symmetry across the origin in the approximate distribution, a potentially useful inductive bias. In this example, and many others, we can also compute the density exactly. This has led to a number of interesting extensions to NFs, such as “funnel flows” which have $d_z < d_x$ but retain an analytic approximate density [6]. As we retain an analytic approximate density, we can optimise them in the same way as NFs.

Figure 4: Schematic of a surjective transformation. We have a stochastic forward transformation, but the inverse is deterministic. This restricts what transformations we can have, but retains an analytic approximate density.

Conclusion
I have presented an overview of two widely-used methods for modelling probability distributions with machine learning techniques. I’ve also highlighted an interesting connection between these methods, an area of research that has led to the development of interesting new models. It’s worth noting that other important classes of ML models such as Generative Adversarial Networks and Diffusion models can also be interpreted as approximate probability distributions. There are many superficial connections between those methods and the ones discussed here. Exploring these theoretical similarities presents an compelling direction of research to sharpen understanding of such models’ relative advantages. Another promising direction is the synthesis of these methods, where researchers aim to harness the strengths of each approach [7,8]. This not only enhances the existing knowledge base but also offers opportunities for innovative applications in the field.

References
[1] Jaini, Priyank, et al. “Tails of Lipschitz triangular flows.” International Conference on Machine Learning. PMLR, 2020
[2] Papamakarios, G., et al. “Normalizing flows for probabilistic modeling and inference.” Journal of Machine Learning Research, 2021
[3] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv:1312.6114, 2013
[4] Rainforth, Tom, et al. “Tighter variational bounds are not necessarily better.” International Conference on Machine Learning. PMLR, 2018
[5] Nielsen, Didrik, et al. “Survae flows: Surjections to bridge the gap between vaes and flows.” Advances in Neural Information Processing, 2020
[6] Samuel Klein, et al. “Funnels: Exact maximum likelihood with dimensionality reduction.” arXiv:2112.08069, 2021
[7] Kingma, Durk P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in neural information processing systems , 2016
[8] Zhang, Qinsheng, and Yongxin Chen. “Diffusion normalizing flow.” Advances in Neural Information Processing Systems 34, 2021
[9] Ettore Fincato “An Introduction to Stochastic Gradient Methods” https://compass.blogs.bristol.ac.uk/2023/03/14/stochastic-gradient-methods/
[10] Daniel Ward “An introduction to normalising flows” https://compass.blogs.bristol.ac.uk/2022/07/13/2608/
[11] Tennessee Hickling and Dennis Prangle, “Flexible Tails for Normalising Flows, with Application to the Modelling of Financial Return Data”, 8th MIDAS Workshop, ECML PKDD, 2023
[12] Lucas, James, et al. “Don’t blame the elbo! a linear vae perspective on posterior collapse.” Advances in Neural Information Processing Systems 32, 2019

Student perspectives: Compass Annual Conference 2023

Posted on 26th September 202326th September 2023 by dominic.broadbent

A post by Dominic Broadbent, PhD student on the Compass CDT, and Michael Whitehouse, PhD student of the Compass CDT (recently submitted thesis)

Introduction

September saw the second annual Compass Conference, hosted in the Fry Building, the home of the School of Mathematics. The event was particularly special as it is the first time that all five Compass cohorts were brought together, and it was a fantastic opportunity to celebrate the achievements and research of the Compass CDT with external partners. This year the theme was “Communicating Research in Context“, focusing on how research can be better communicated, and the need to highlight the motivation and applications of mathematical research.

Research talks

The day began with four long form talks touching on the topic of communicating research. Starting with Alessio Zakaria’s talk which delved into Hypothesis tests, commenting on their criticial role as the defacto statistical tool across the sciences, and how p-hacking has led to a replication crisis that undermines public confidence in research. The next talk by Sam Stockman and Emerald Dilworth discussed the challenges they faced, and the key takeaways from their shared experience of communicating mathematics with researchers in the geographical sciences. Following this, Ed Davis’s interactive talk “The Universal Language of Visualisations” explored how effective visualisation techniques should differ by the intended audience, with examples from his research and activities outside of academia. The last talk by Dan Milner explored his research on understanding the effect of environmental factors on outcomes of smallholder farmers in Kenya. He took us through the process of collecting data on the ground, to modelling and communicating results to external partners. After each talk there was an opportunity to ask questions, allowing for audience participation and the sparking of interesting discussions. The format mirrored that which is most frequently used in external academic conferences, giving the speakers a chance to practice their technique in front of friendly faces.

Lightning talks

After a short break, we jumped back into the fray with a series of 3-minute fast-paced lightning talks. A huge range of topics were covered, all the way from developing modelling techniques for the electric grid of the future, to predicting the incidence of Cerebral Vasospasm at the Southmead Hospital ICU. With such a short time to present, these talks were a great exercise in distilling research into just the essentials, knowing there is very limited time to garner the audience’s interest and convey an effective message.

Special guest lecture

After lunch, we reconvened to attend the special guest lecture. The talk, entitled Bridging the gap between research and industry, was delivered by Ruth Voisey, CEO of the Smith institute. It outlined Ruth’s journey from writing her PhD thesis ‘Multiple wave scattering by quasiperiodic structures’, to CEO of the Smith Institute – via an internship with the acoustic research team at Dyson. It was particularly refreshing to hear Ruth’s candid account of her ‘non-linear’ rise to CEO, accrediting her success to strong principles of clear research communication and ‘mathematical evangelicalism’.

As PhD students in the bubble of academia, the path to opportunities in the world of industry can often feel clouded – Ruth’s lecture painted a clear picture of how one can transition from university based research to a rewarding career outside of this bubble, applying such research to tangible problems in the real world.

Panel discussion and poster session

The special guest lecture was followed by a discussion on communicating research in context, with panel members Ruth Voisey, David Greenwood, Helen Barugh, Oliver Johnson, plus Compass CDT students Ed Davis and Sam Stockman. The panel discussed the difficulty of communicating the nuances of research conclusions with the public, with a particular focus on handling these nuances when talking to journalists – stressing the importance of communicating the limitations of the research in question.

This was followed by a poster session, one enthusiastic student had the following comment “it was great to see of all the Compass students’ hard work being celebrated and shared with the wider data science community”.

Concluding remarks

To cap off the successful event, Compass students Hannah Sansford and Josh Givens delivered some concluding remarks which were drawn from comments made by students about what key points they’d taken from the day. These focused on the importance of clear communication of research throughout the whole pipeline, from inception in discussion with fellow academics to the dissemination of knowledge to the general population.

The day ended with a walk to Goldney Hall, where students, staff, and attendees enjoyed delicious food, wine, and access to the beautiful Orangery gardens.

Student Perspectives: An Introduction to QGAMs

Posted on 16th November 2022 by ben.griffiths

A post by Ben Griffiths, PhD student on the Compass programme.

My area of research is studying Quantile Generalised Additive Models (QGAMs), with my main application lying in energy demand forecasting. In particular, my focus is on developing faster and more stable fitting methods and model selection techniques. This blog post aims to briefly explain what QGAMs are, how to fit them, and a short illustrative example applying these techniques to data on extreme rainfall in Switzerland. I am supervised by Matteo Fasiolo and my research is sponsored by Électricité de France (EDF).

Quantile Generalised Additive Models

QGAMs are essentially the result of combining quantile regression (QR; performing regression on a specific quantile of the response) with a generalised additive model (GAM; fitting a model assuming additive smooth effects). Here we are in the regression setting, so let $F(y| \boldsymbol{x})$ be the conditional c.d.f. of a response, $y$ , given a $p$ -dimensional vector of covariates, $\boldsymbol{x}$ . In QR we model the $\tau$ th quantile, that is, $\mu_\tau(\boldsymbol{x}) = \inf \{y : F(y|\boldsymbol{x}) \geq \tau\}$ .

Examples of true quantiles of SHASH distribution.

This might be useful in cases where we do not need to model the full distribution of $y| \boldsymbol{x}$ and only need one particular quantile of interest (for example urban planners might only be interested in estimates of extreme rainfall e.g. $\tau = 0.95$ ). It also allows us to make no assumptions about the underlying true distribution, instead we can model the distribution empirically using multiple quantiles.

We can define the $\tau$ th quantile as the minimiser of expected loss

$L(\mu| \boldsymbol{x}) = \mathbb{E} \left\{\rho_\tau (y - \mu)| \boldsymbol{x} \right \} = \int \rho_\tau(y - \mu) d F(y|\boldsymbol{x}),$

w.r.t. $\mu = \mu_\tau(\boldsymbol{x})$ , where

$\rho_\tau (z) = (\tau - 1) z \boldsymbol{1}(z<0) + \tau z \boldsymbol{1}(z \geq 0),$

is known as the pinball loss (Koenker, 2005).

Pinball loss for quantiles 0.5, 0.8, 0.95.

We can approximate the above expression empirically given a sample of size $n$ , which gives the quantile estimator, $\hat{\mu}_\tau(\boldsymbol{x}) = \boldsymbol{x}^\mathsf{T} \hat{\boldsymbol{\beta}}$ where

$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg \min} \frac{1}{n} \sum_{i=1}^n \rho_\tau \left\{y_i - \boldsymbol{x}_i^\mathsf{T} \boldsymbol{\beta}\right\},$

where $\boldsymbol{x}_i$ is the $i$ th vector of covariates, and $\boldsymbol{\beta}$ is vector of regression coefficients.

So far we have described QR, so to turn this into a QGAM we assume $\mu_\tau(\boldsymbol{x})$ has additive structure, that is, we can write the $\tau$ th conditional quantile as

$\mu_\tau(\boldsymbol{x}) = \sum_{j=1}^m f_j(\boldsymbol{x}),$

where the $m$ additive terms are defined in terms of basis functions (e.g. spline bases). A marginal smooth effect could be, for example

$f_j(\boldsymbol{x}) = \sum_{k=1}^{r_j} \beta_{jk} b_{jk}(x_j),$

where $\beta_{jk}$ are unknown coefficients, $b_{jk}(x_j)$ are known spline basis functions and $r_j$ is the basis dimension.

Denote $\boldsymbol{\mathrm{x}}_i$ the vector of basis functions evaluated at $\boldsymbol{x}_i$ , then the $n \times d$ design matrix $\boldsymbol{\mathrm{X}}$ is defined as having $i$ th row $\boldsymbol{\mathrm{x}}_i$ , for $i = 1, \dots, n$ , and $d = r_1+\dots +r_m$ is the total basis dimension over all $f_j$ . Now the quantile estimate is defined as $\mu_\tau(\boldsymbol{x}_i) = \boldsymbol{\mathrm{x}}_i^\mathsf{T} \boldsymbol{\beta}$ . When estimating the regression coefficients, we put a ridge penalty on $\boldsymbol{\beta}_{j}$ to control complexity of $f_j$ , thus we seek to minimise the penalised pinball loss

$V(\boldsymbol{\beta},\boldsymbol{\gamma},\sigma) = \sum_{i=1}^n \frac{1}{\sigma} \rho_\tau \left\{y_i - \mu(\boldsymbol{x}_i)\right\} + \frac{1}{2} \sum_{j=1}^m \gamma_j \boldsymbol{\beta}^\mathsf{T} \boldsymbol{\mathrm{S}}_j \boldsymbol{\beta},$

where $\boldsymbol{\gamma} = (\gamma_1,\dots,\gamma_m)$ is a vector of positive smoothing parameters, $1/\sigma>0$ is the learning rate and the $\boldsymbol{\mathrm{S}}_j$ ‘s are positive semi-definite matrices which penalise the wiggliness of the corresponding effect $f_j$ . Minimising $V$ with respect to $\boldsymbol{\beta}$ given fixed $\sigma$ and $\boldsymbol{\gamma}$ leads to the maximum a posteriori (MAP) estimator $\hat{\boldsymbol{\beta}}$ .

There are a number of methods to tune the smoothing parameters and learning rate. The framework from Fasiolo et al. (2021) consists in:

calibrating $\sigma$ by Integrated Kullback–Leibler minimisation
selecting $\boldsymbol{\gamma}|\sigma$ by Laplace Approximate Marginal Loss minimisation
estimating $\boldsymbol{\beta}|\boldsymbol{\gamma},\sigma$ by minimising penalised Extended Log-F loss (note that this loss is simply a smoothed version of the pinball loss introduced above)

For more detail on what each of these steps means I refer the reader to Fasiolo et al. (2021). Clearly this three-layered nested optimisation can take a long time to converge, especially in cases where we have large datasets which is often the case for energy demand forecasting. So my project approach is to adapt this framework in order to make it less computationally expensive.

Application to Swiss Extreme Rainfall

Here I will briefly discuss one potential application of QGAMs, where we analyse a dataset consisting of observations of the most extreme 12 hourly total rainfall each year for 65 Swiss weather stations between 1981-2015. This data set can be found in the R package gamair and for model fitting I used the package mgcViz.

A basic QGAM for the 50% quantile (i.e. $\tau = 0.5$ ) can be fitted using the following formula

$\mu_i = \beta + \psi(\mathrm{reg}_i) + f_1(\mathrm{nao}_i) + f_2(\mathrm{el}_i) + f_3(\mathrm{Y}_i) + f_4(\mathrm{E}_i,\mathrm{N}_i),$

where $\beta$ is the intercept term, $\psi(\mathrm{reg}_i)$ is a parametric factor for climate region, $f_1, \dots, f_4$ are smooth effects, $\mathrm{nao}_i$ is the Annual North Atlantic Oscillation index, $\mathrm{el}_i$ is the metres above sea level, $\mathrm{Y}_i$ is the year of observation, and $\mathrm{E}_i$ and $\mathrm{N}_i$ are the degrees east and north respectively.

After fitting in mgcViz, we can plot the smooth effects and see how these affect the extreme yearly rainfall in Switzerland.

Fitted smooth effects for North Atlantic Oscillation index, elevation, degrees east and north and year of observation.

From the plots observe the following; as we increase the NAO index we observe a somewhat oscillatory effect on extreme rainfall; when increasing elevation we see a steady increase in extreme rainfall before a sharp drop after an elevation of around 2500 metres; as years increase we see a relatively flat effect on extreme rainfall indicating the extreme rainfall patterns might not change much over time (hopefully the reader won’t regard this as evidence against climate change); and from the spatial plot we see that the south-east of Switzerland appears to be more prone to more heavy extreme rainfall.

We could also look into fitting a 3D spatio-temporal tensor product effect, using the following formula

$\mu_i = \beta + \psi(\mathrm{reg}_i) + f_1(\mathrm{nao}_i) + f_2(\mathrm{el}_i) + t(\mathrm{E}_i,\mathrm{N}_i,\mathrm{Y}_i),$

where $t$ is the tensor product effect between $\mathrm{E}_i$ , $\mathrm{N}_i$ and $\mathrm{Y}_i$ . We can examine the spatial effect on extreme rainfall over time by plotting the smooths.

3D spatio-temporal tensor smooths for years 1985, 1995, 2005 and 2015.

There does not seem to be a significant interaction between the location and year, since we see little change between the plots, except for perhaps a slight decrease in the south-east.

Finally, we can make the most of the QGAM framework by fitting multiple quantiles at once. Here we fit the first formula for quantiles $\tau = 0.1, 0.2, \dots, 0.9$ , and we can examine the fitted smooths for each quantile on the spatial effect.

Spatial smooths for quantiles 0.1, 0.2, …, 0.9.

Interestingly the spatial effect is much stronger in higher quantiles than in the lower ones, where we see a relatively weak effect at the 0.1 quantile, and a very strong effect at the 0.9 quantile ranging between around -30 and +60.

The example discussed here is but one of many potential applications of QGAMs. As mentioned in the introduction, my research area is motivated by energy demand forecasting. My current/future research is focused on adapting the QGAM fitting framework to obtain faster fitting.

References

Fasiolo, M., S. N. Wood, M. Zaffran, R. Nedellec, and Y. Goude (2021). Fast calibrated additive quantile regression. Journal of the American Statistical Association 116 (535), 1402–1412.

Koenker, R. (2005). Quantile Regression. Cambridge University Press.

Student perspectives: Neural Point Processes for Statistical Seismology

Posted on 20th June 202210th August 2022 by sam.stockman

A post by Sam Stockman, PhD student on the Compass programme.

Introduction

Throughout my PhD I aim to bridge a gap between advances made in the machine learning community and the age-old problem of earthquake forecasting. In this cross-disciplinary work with Max Werner from the School of Earth Sciences and Dan Lawson from the School of Mathematics, I hope to create more powerful, efficient and robust models for forecasting, that can make earthquake prone areas safer for their inhabitants.

For years seismologists have sought to model the structure and dynamics of the earth in order to make predictions about earthquakes. They have mapped out the structure of fault lines and conducted experiments in the lab where they submit rock to great amounts of force in order to simulate plate tectonics on a small scale. Yet when trying to forecast earthquakes on a short time scale (that’s hours and days, not tens of years), these models based on the knowledge of the underlying physics are regularly outperformed by models that are statistically motivated. In statistical seismology we seek to make predictions through looking at distributions of the times, locations and magnitudes of earthquakes and use them to forecast the future.