## Compass Special Lecture: Jonty Rougier

Compass is excited to announce that Jonty Rougier (2021 recipient of the Barnett Award) will be delivering a Compass Special Lecture.

Jonty’s experience lies in Computer Experiments, computational statistics and Machine Learning, uncertainty and risk assessment, and decision support. In 2021, he was awarded Barnett Award by the RSS, which is made to those internationally recognised for contributions in the field of environmental statistics, risk and uncertainty quantification. Rougier has also advised several UK Government departments and agencies, including a secondment to the Cabinet Office in 2016/17 to contribute to the UK National Risk Assessment.

13th April 2021
09:30 - 14:00

## Student perspectives: Stein’s paradox in statistics

A post by Alexander Modell, PhD student on the Compass programme.

In 1922, Sir R.A. Fisher laid the foundations of classical statistics with an idea he named maximum likelihood estimation. His principle, for determining the parameters of a statistical model, simply states that given some data, you should choose the parameters which maximise the probability of observing the data that you actually observed. It is based on a solid philosophical foundation which Fisher termed “logic of inductive reasoning” and provides intuitive estimators for all sorts of statistical problems. As a result, it is frequently cited as one of the most influential pieces of applied mathematics of the twentieth century. Eminent statistician Bradley Efron even goes as far as to write

“If Fisher had lived in the era of “apps”, maximum likelihood estimation would have made him a billionaire.” [1]

It came as a major shock to the statistical world when, in 1961, James and Stein exposed a paradox that had the potential to undermine what had become the backbone of classical statistics [2].

It goes something like this.

Suppose I choose a real number on a number line, and instead of telling you that number directly, I add some noise, that is, I perturb it a little and give you that perturbed number instead. We’ll assume that the noise I add is normally distributed and, to keep things simple, has variance one. I ask you to guess what my original number was – what do you do? With no other information to go on, most people would guess the number that I gave them, and they’d be right to do so. By a commonly accepted notion of what “good” is (we’ll describe this a little later on) – this is a “good” guess! Notably, this is also the maximum likelihood estimator.

Suppose now that I have two real numbers and, again, I independently add some noise to each and tell you those numbers. What is your guess of my original numbers? For the same reasons as before, you choose the numbers I gave you. Again, this is a good guess!

Now suppose I have three numbers. As usual, I independently add some noise to each one and tell you those numbers. Of course, you guess the numbers I gave you, right? Well, this no longer the best thing to do! If we imagine those three numbers as a point in three-dimensional space, you can actually do better by arbitrarily choosing some other point, let’s call it B, and shifting your estimate slightly from the point I gave you towards B. This is Stein’s paradox. At first glance, it seems rather absurd. Even after taking some time to take in its simple statement and proof, it’s almost impossible to see how it can possibly be true.

To keep things simple, from now on we will assume that B is the origin, that is that each dimension of B is zero, although keep in mind that all of this holds for any arbitrary point. Stated more formally, the estimator proposed by James and Stein is

$\hat{\mu}_i^{\text{(JS)}} = \left( 1 - \frac{(p-2)}{x_1^2+\cdots + x_p^2} \right) x_i.$

where p is the number of numbers you are estimating, and $x_1,\ldots,x_p$ are the noisy versions that I give you. Since the term in brackets is always less than one, their estimator shrinks each of the dimensions of the observation towards zero. For this reason, it is known as a “shrinkage” estimator. There are two central surprises here. Firstly, it seems totally paradoxical to estimate anything other than the values I gave you. With no other information, why on earth would you estimate anything different? Secondly, their estimate of one value depends on the observations of all of the other values, despite the fact that they’re totally independent!

To most, this is seemingly ludicrous, and to make this point, in his paper on the phenomenon [3], Richard Samworth presents an unusual example. Suppose we want to estimate the proportion of the US electorate who will vote for Barack Obama, the proportion of babies born in China that are girls and the proportion of Britons with light-coloured eyes, then our James-Stein estimate of the proportion of Democratic voters will depend on our hospital and eye colour data!

Before we discuss what we mean when we say that the James-Stein estimator is “better” than the maximum likelihood estimator, let’s take a step back and get some visual intuition as to why this might be true.

Suppose I have a circle on an infinite grid and I task you with guessing the position of the centre of that circle. Instead of just telling you the location of the centre, I draw a point from the circle at random and tell you that position. Let’s call this position A and we’ll call the true, unknown centre of the circle C. What do you propose as your guess? Intuition tells you to guess A, but Stein’s paradox would suggest choosing some other point B, and shifting your guess towards it.

Let’s suppose the circle has radius one and its true centre is at $x=0,y=2$. To keep things simple, suppose I set the point B to be the origin (that is $x=0,y=0$). Now, if I randomly draw a point from this circle, what proportion of the time will my guess actually be better if I shrink it slightly towards the point B. For this illustration, we’ll just consider shrinking towards B a very small amount, so we can rephrase this question as this: What proportion of the circle could get closer to the centre if it were shrunk towards B a little?

The answer to that question is the region marked “2” on the figure. A bit of geometry tells us that that’s about 61% of the circle, a little over half. So estimating a point slighter closer to B than the point we drew has the potential to improve our estimate more times than it hinders it! In fact, this holds true regardless of the position of B.

Now let’s consider this problem in three dimensions. We now have a sphere rather than a circle but everything else remains the same. In three dimensions, the region marked “2” covers just over 79% of the sphere, so about four times out of five, our shrinkage estimator does better than estimating the point A. If you’ve just tried to imagine a four-dimensional sphere, your head probably hurts a little but you can still do this maths. Region “2” now covers 87% of the sphere. In ten dimensions, it covers about 98% of the sphere and in one hundred dimensions it covers 99.99999% of it. That is to say, that only one in ten millions times, our estimate couldn’t be improved by shrinking it towards B.

Hopefully, you’re starting it see how powerful this idea can be, and how much more powerful it becomes as we move into higher dimensions. Despite this, the more you think about it, the more something seems amiss – yet it’s not.

Statisticians typically determine how “good” an estimator is through what is called a loss function. A loss function can be viewed as a penalty that increases the further your estimate is from the underlying truth. Examples are the squared error loss $\ell_2(\hat{\mu};\mu) = \sum_i(\hat{\mu}_i - \mu_i)^2,$ which is by far the most common choice, and the absolute error loss $\ell_1(\hat{\mu};\mu)\ = \sum_i |\hat{\mu}_i - \mu_i|$ where $\hat{\mu}$ is the estimate and $\mu$ is the truth. Since the loss function depends on the data that was observed, it is not suitable for comparing estimators, so instead the risk of the estimator is used, which is defined as the loss we would expect to see if we drew a random dataset from the underlying distribution.

The astonishing result that James and Stein proved, was that, under the squared error loss, their “shrinkage” estimator has a lower risk than the maximum likelihood estimator, regardless of the underlying truth. This came as a huge surprise to the statistics community at the time, which was firmly rooted in Fisher’s principle of maximum likelihood estimation.

It is natural to ask whether this is a quirk of the normal distribution and the squared error loss, but in fact, similar phenomena have been shown for a wide range of distributions and loss functions.

It would be easy to shrug this result off as merely a mathematical curiosity – many contemporaries of Stein did – yet in the era of big data, the ideas underpinning it have proved crucial to modern statistical methodology. Modern datasets regularly contain tens of thousands, if not millions of dimensions and in these settings, classical statistical ideas break down. Many modern machine learning algorithms, such as ridge and Lasso regression, are underpinned by these ideas of shrinkage.

References

[1]: Efron, B., & Hastie, T. (2016). Computer age statistical inference (Vol. 5). Cambridge University Press.

[2]: James, W. and Stein, C.M. (1961) Estimation with Quadratic Loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361-379, University California Press, Berkeley.

[3]: Samworth, R. J., & Cambridge, S. (2012). Stein’s paradox. eureka, 62, 38-41.

## Student Perspectives: LV Datathon – Insurance Claim Prediction

A post by Doug Corbin, PhD student on the Compass programme.

# Introduction

In a recent bout of friendly competition, students from the Compass and Interactive AI CDT’s were divided into eight teams to take part in a two week Datathon, hosted by insurance provider LV. A Datathon is a short, intensive competition, posing a data-driven challenge to the teams. The challenge was to construct the best predictive model for (the size of) insurance claims, using an anonymised, artificial data set generously provided by LV. Each team’s solution was given three important criteria, on which their solutions would be judged:

• Accuracy – How well the solution performs at predicting insurance claims.
• Explainability – The ability to understand and explain how the solution calculates its predictions; It is important to be able to explain to a customers how their quote has been calculated.
• Creativity – The solution’s incorporation of new and unique ideas.

Students were given the opportunity to put their experience in Data Science and Artificial Intelligence to the test on something resembling real life data, forming cross-CDT relationships in the process.

# Data and Modelling

Before training a machine learning model, the data must first be processed into a numerical format. To achieve this, most teams transformed categorical features into a series of 0’s and 1’s (representing the value of the category), using a well known process called one-hot encoding. Others recognised that certain features had a natural order to them, and opted to map them to integers corresponding to their ordered position.

A common consideration amongst all the teams’ analysis was feature importance. Out of the many methods/algorithms which were used to evaluate the relative importance of the features, notable mentions are Decision Trees, LASSO optimisation, Permutation Importance and SHAP Values. The specific details of these methods are beyond the scope of this blog post, but they all share a common goal, to rank features according to how important they are in predicting claim amounts. In many of the solutions, feature importance was used to simplify the models by excluding features with little to no predictive power. For others, it was used as a post analysis step to increase explainability i.e. to show which features where most important for a particular claim prediction. As a part of the feature selection process, all teams considered the ethical implications of the data, with many choosing to exclude certain features to mitigate social bias.

Interestingly, almost all teams incorporated some form of Gradient Boosted Decision Trees (GBDT) into their solution, either for feature selection or regression. This involves constructing multiple decision trees, which are aggregated to give the final prediction. A decision tree can be thought of as a sequence of binary questions about the features (e.g. Is the insured vehicle a sports car? Is the car a write off?), which lead to a (constant) prediction depending on the the answers. In the case of GBDT, decision trees are constructed sequentially, each new tree attempting to capture structure in the data which has been overlooked by its predecessors. The final estimate is a weighted sum of the trees, where the weights are optimised using (the gradient of) a specified loss function e.g. Mean-Squared Error (MSE),

$MSE = \frac{1}{N} \sum_{n = 1}^N (y_n - \hat{y}_n)^2.$

Many of the teams trialled multiple regression models, before ultimately settling on a tree-based model. However, it is well-known that tree-based models are prone to overfitting the training data. Indeed, many of the teams were surprised to see such significant difference between the training/testing Mean Absolute Error (MAE),

$MAE = \frac{1}{N} \sum_{n = 1}^N |y_n - \hat{y}_n|.$

# Results

After two weeks of hard-work, the students came forward to present their solutions to a judging panel formed of LV representatives and CDT directors. The success of each solution was measured via the MAE of their predictions on the testing data set. Anxious to find out the results, the following winners were announced.

### Accuracy Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate.

Regression Model: Gradient Boosted Decision Trees.

Testing MAE: 69.77

The winning team (in accuracy) was able to dramatically reduce their testing MAE through their choice of loss function. Loss functions quantify how good/bad a regression model is performing during the training process, and it is used to optimise the linear combination of decision trees. While most teams used the popular loss function, Mean-Squared Error, the winning team instead used Least Absolute Deviationswhich is equivalent to optimising for the MAE while training the model.

### Explainability (Joint) Winners

After much deliberation amongst the judging panel, two teams were awarded joint first place in the explainability category!

Team 1:

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. Features centred and scaled to have mean 0 and standard deviation 1, then selected using Gradient Boosted Decision Trees.

Regression Model: K-Nearest-Neighbours Regression

Testing MAE: 75.25

This team used Gradient Boosted Decision Trees for feature selection, combined with K-Nearest-Neighbours (KNN) Regression to model the claim amounts. KNN regression is simple in nature; given a new claim to be predicted, the K “most similar” claims in the training set are averaged (weighted according to similarity) to produce the final prediction. It is thus explainable in the sense that for every prediction you can see exactly which neighbours contributed, and what similarities they shared. The judging panel noted that, from a consumer’s perspective, they may not be satisfied with their insurance quote being based on just K neighbours.

Team 2:

Pre-processing: All categorical features one-hot encoded.

Regression Model: Gradient Boosted Decision Trees. SHAP values used for post-analysis explainability.

Testing MAE: 80.3.

The judging panel was impressed by this team’s decision to impose monotonicity in the claim predictions with respect to the numerical features. This asserts that, for monotonic features, the claim prediction must move in a constant direction (increasing or decreasing) if the numerical feature is moving in a constant direction. For example, a customer’s policy excess is the amount they will have to pay towards a claim made on their insurance. It stands to reason that increasing the policy excess (while other features remain constant) should not increase their insurance quote. If this constraint is satisfied, we say that the insurance quote is monotonic decreasing with respect to the policy excess. Further, SHAP values were used to explain the importance/effect of each feature on the model.

### Creativity Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. New feature engineered from Vehicle Size and Vehicle Class. Features selected using Permutation Importance.

Regression Model: Gradient Boosted Decision Trees. Presented post-analysis mediation/moderation study of the features.

Testing MAE: 76.313.

The winning team for creativity presented unique and intriguing methods for understanding and manipulating the data. This team noticed that the features, Vehicle Size and Vehicle Class, are intrinsically related e.g. They investigated whether a large vehicle would likely yield a higher claim if it is also of luxury vehicle class. To represent this relationship, they engineered a new feature by taking a multiplicative combination of the two initial features.

As an extension of their solution, they presented an investigation of the causal relationship between the different features. Several hypothesis tests were performed, testing whether the relationship between certain features and claim amounts is moderated or mediated by an alternative feature in the data set.

• Mediating relationships: If a feature is mediated by an alternative feature in the data set, its relationship with the claim amounts can be well explained by the alternative (potentially indicating it can be removed from the model).
• Moderating relationships: If a feature is moderated by an alternative feature in the data set, the strength and/or direction of the relationship with the claim amounts is impacted by the alternative.

# Final Thoughts

All the teams showed a great understanding of the problem and identified promising solutions. The competitive atmosphere of the LV Datathon created a notable buzz amongst the students, who were eager to present and compare their findings. As evidenced by every team’s solution, the methodological conclusion is clear: When it comes to insurance claim prediction, tree-based models are unbeaten!

## Skills for Interdisciplinary Research

To acknowledge the variety of sectors where data science research is relevant, in March 2021, the Compass students are undertaking a series of workshops led by the Bristol Doctoral College to explore Skills for Interdisciplinary Research.  Using the Vitae framework for researcher development, our colleague at BDC will introduce Compass students to the following topics:

Workshop 1: What is a doctorate? A brief history of doctorates in the UK, how they have changed in the past two decades, why CDTs?, what skills are needed now for a doctorate?

Workshop 2: Interdisciplinarity – the foundations. A practical case study on interdisciplinary postgraduate research at Bristol.

Workshop 3: Ways of knowing, part 1 – Positivism and ‘ologies! Deconstructing some of the terminology around knowledge and how we know what we know. Underpinning assumption – to know your own discipline, you need to step outside of it and see it as others do.

Workshop 4: Ways of knowing, part 2 – Social constructionism and qualitative approaches to research. In part 1 of ways of knowing, the ideal ‘science’ approach is objective and the researcher is detached from the subject of study; looking at other approaches where the role of research is integral to the research.

Workshop 5: Becoming a good researcher – research integrity and doctoral students. A look at how dilemmas in research can show us how research integrity is not just a case of right or wrong.

Workshop 6: Getting started with academic publishing. An introduction on the scholarly publishing pressure in contemporary research and it explores what that means in an interdisciplinary context.

## Student Research Topics for 2020/21

This month, the Cohort 2 Compass students have started work on their mini projects and are establishing the direction of their own research within the CDT.

## Supervised by the Institute for Statistical Science:

Anthony Stevenson will be working with Robert Allison on a project entitled Fast Bayesian Inference at Extreme Scale.  This project is in partnership with IBM Research.

Conor Crilly will be working with Oliver Johnson on a project entitled Statistical models for forecasting reliability. This project is in partnership with AWE.

Euan Enticott will be working with Matteo Fasiolo and Nick Whiteley on a project entitled Scalable Additive Models for Forecasting Electricity Demand and Renewable Production.  This project is in partnership with EDF.

Annie Gray will be working with Patrick Rubin-Delanchy and Nick Whiteley on a project entitled Exploratory data analysis of graph embeddings: exploiting manifold structure.

Ed Davis will be working with Dan Lawson and Patrick Rubin-Delanchy on a project entitled Graph embedding: time and space.  This project is in partnership with LV Insurance.

Conor Newton will be working with Henry Reeve and Ayalvadi Ganesh on a project entitled  Decentralised sequential decision making and learning.

## The following projects are supervised in collaboration with the Institute for Statistical Science (IfSS) and our other internal partners at the University of Bristol:

Dan Ward will be working with Matteo Fasiolo (IfSS) and Mark Beaumont from the School of Biological Sciences on a project entitled Agent-based model calibration via regression-based synthetic likelihood. This project is in partnership with Improbable

Jack Simons will be working with Song Liu (IfSS) and Mark Beaumont (Biological Sciences) on a project entitled Novel Approaches to Approximate Bayesian Inference.

Georgie Mansell will be working with Haeran Cho (IfSS) and Andrew Dowsey from the School of Population Health Sciences and Bristol Veterinary School on a project entitled Statistical learning of quantitative data at scale to redefine biomarker discovery.  This project is in partnership with Sciex.

Shannon Williams will be working with Anthony Lee (IfSS) and Jeremy Phillips from the School of Earth Sciences on a project entitled Use and Comparison of Stochastic Simulations and Weather Patterns in probabilistic volcanic ash hazard assessments.

Sam Stockman  will be working with Dan Lawson (IfSS) and Maximillian Werner from the School of Geographical Sciences on a project entitled Machine Learning and Point Processes for Insights into Earthquakes and Volcanoes

## Responsible Innovation in Data Science Research

This February our 2nd year Compass students will attend workshops in responsible innovation.

Run in partnership with the School of Management, the structured module constitutes Responsible Innovation training specifically for research in Data Science.

Taking the EPSRC AREA (Anticipate, Reflect, Engage, Act) framework for Responsible Innovation as it’s starting point, the module will take students through a guided process to develop the skills, knowledge and facilitated experience to incorporate the tenets of the AREA framework in to their PhD practice. Topics covered will include:
· Ethical and societal implications of data science and computational statistics
· Skills for anticipation
· Reflexivity for researchers
· Public perception of data science and engagement of publics
· Regulatory frameworks affecting data science

## New opportunity: a jointly funded studentship with FAI Farms

Compass is very excited to advertise this PhD studentship in collaboration with FAI Farms on a vision-based system for automated poultry welfare assessment through deep learning and Bayesian modelling.