Student Perspectives: Contemporary Ideas in Statistical Philosophy

A post by Alessio Zakaria, PhD student on the Compass programme.

Introduction

Probability theory is a branch of mathematics centred around the abstract manipulation and quantification of uncertainty and variability. It forms a basic unit of the theory and practice of statistics, enabling us to tame the complex nature of observable phenomena into meaningful information. It is through this reliance that the debate over the true (or more correct) underlying nature of probability theory has profound effects on how statisticians do their work. The current opposing sides of the debate in question are the Frequentists and the Bayesians. Frequentists believe that probability is intrinsically linked to the numeric regularity with which events occur, i.e. their frequency. Bayesians, however, believe that probability is an expression of someones degree of belief or confidence in a certain claim. In everyday parlance we use both of these concepts interchangeably: I estimate one in five of people have Covid; I was 50% confident that the football was coming home. It should be noted that the latter of the two is not a repeatable event per se. We cannot roll back time to check what the repeatable sequence would result in.

(more…)

Student perspectives: How can we do data science without all of our data?

A post by Daniel Williams, Compass PhD student.

Imagine that you are employed by Chicago’s city council, and are tasked with estimating where the mean locations of reported crimes are in the city. The data that you are given only goes up to the city’s borders, even though crime does not suddenly stop beyond this artificial boundary. As a data scientist, how would you estimate these centres within the city? Your measurements are obscured past a very complex border, so regular methods such as maximum likelihood would not be appropriate.

Chicago Homicides
Figure 1: Homicides in the city of Chicago in 2008. Left: locations of each homicide. Right: a density estimate of the same crimes, highlighting where the ‘hotspots’ are.

This is an example of a more general problem in statistics named truncated probability density estimation. How do we estimate the parameters of a statistical model when data are not fully observed, and are cut off by some artificial boundary? (more…)

Student Perspectives: Embedding probability distributions in RKHSs

A post by Jake Spiteri, Compass PhD student.

Recent advancements in kernel methods have introduced a framework for nonparametric statistics by embedding and manipulating probability distributions in Hilbert spaces. In this blog post we will look at how to embed marginal and conditional distributions, and how to perform probability operations such as the sum, product, and Bayes’ rule with embeddings.

Embedding marginal distributions

Throughout this blog post we will make use of reproducing kernel Hilbert spaces (RKHS). A reproducing kernel Hilbert space is simply a Hilbert space with some additional structure, and a Hilbert space is just a topological vector space equipped with an inner product, which is also complete.

We will frequently refer to a random variable X which has domain \mathcal{X} endowed with the \sigma-algebra \mathcal{B}_\mathcal{X}.

Definition. A reproducing kernel Hilbert space \mathcal{H} on \mathcal{X} with associated positive semi-definite kernel function k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} is a Hilbert space of functions f: \mathcal{X} \rightarrow \mathbb{R}. The inner product \langle\cdot,\cdot\rangle_\mathcal{H} satisfies the reproducing property: \langle f, k(x, \cdot) \rangle_\mathcal{H}, \forall f \in \mathcal{H}, x \in \mathcal{X}. We also have k(x, \cdot) \in \mathcal{H}, \forall x \in \mathcal{X}. (more…)

Student Perspectives: Change in the air: Tackling Bristol’s nitrogen oxide problem

A post by Dom Owens, PhD student on the Compass programme.


industry-1752876_1920

“Air pollution kills an estimated seven million people worldwide every year” – World Health Organisation

Many particulates and chemicals are present in the air in urban areas like Bristol, and this poses a serious risk to our respiratory health. It is difficult to model how these concentrations behave over time due to the complex physical, environmental, and economic factors they depend on, but identifying if and when abrupt changes occur is crucial for designing and evaluating public policy measures, as outlined in the local Air Quality Annual Status Report.  Using a novel change point detection procedure to account for dependence in time and space, we provide an interpretable model for nitrogen oxide (NOx) levels in Bristol, telling us when these structural changes occur and describing the dynamics driving them in between.

Model and Change Point Detection

We model the data with a piecewise-stationary vector autoregression (VAR) model:

In between change points the time series \boldsymbol{Y}_{t}, a d-dimensional vector, depends on itself linearly over p \geq 1 previous time steps through parameter matrices \boldsymbol{A}_i^{(j)}, i=1, \dots, p with intercepts \boldsymbol{\mu}^{(j)}, but at unknown change points k_j, j = 1, \dots, q the parameters switch abruptly. \{ \boldsymbol{\varepsilon}_{t} \in \mathbb{R}^d : t \geq 1 \} are white noise errors, and we have n observations.

(more…)

Student perspectives: Stein’s paradox in statistics

A post by Alexander Modell, PhD student on the Compass programme.

Stein’s paradox in statistics

In 1922, Sir R.A. Fisher laid the foundations of classical statistics with an idea he named maximum likelihood estimation. His principle, for determining the parameters of a statistical model, simply states that given some data, you should choose the parameters which maximise the probability of observing the data that you actually observed. It is based on a solid philosophical foundation which Fisher termed “logic of inductive reasoning” and provides intuitive estimators for all sorts of statistical problems. As a result, it is frequently cited as one of the most influential pieces of applied mathematics of the twentieth century. Eminent statistician Bradley Efron even goes as far as to write

“If Fisher had lived in the era of “apps”, maximum likelihood estimation would have made him a billionaire.” [1]

It came as a major shock to the statistical world when, in 1961, James and Stein exposed a paradox that had the potential to undermine what had become the backbone of classical statistics [2].

It goes something like this. (more…)

Student Perspectives: LV Datathon – Insurance Claim Prediction

A post by Doug Corbin, PhD student on the Compass programme.

In a recent bout of friendly competition, students from the Compass and Interactive AI CDT’s were divided into eight teams to take part in a two week Datathon, hosted by insurance provider LV. A Datathon is a short, intensive competition, posing a data-driven challenge to the teams. The challenge was to construct the best predictive model for (the size of) insurance claims, using an anonymised, artificial data set generously provided by LV. Each team’s solution was given three important criteria, on which their solutions would be judged:

  • Accuracy – How well the solution performs at predicting insurance claims.
  • Explainability – The ability to understand and explain how the solution calculates its predictions; It is important to be able to explain to a customers how their quote has been calculated.
  • Creativity – The solution’s incorporation of new and unique ideas.

Students were given the opportunity to put their experience in Data Science and Artificial Intelligence to the test on something resembling real life data, forming cross-CDT relationships in the process.

Data and Modelling

Before training a machine learning model, the data must first be processed into a numerical format. To achieve this, most teams transformed categorical features into a series of 0’s and 1’s (representing the value of the category), using a well known process called one-hot encoding. Others recognised that certain features had a natural order to them, and opted to map them to integers corresponding to their ordered position. (more…)

Student perspectives: Wessex Water Industry Focus Lab

A post by Michael Whitehouse, PhD student on the Compass programme.

Introduction

September saw the first of an exciting new series of Compass industry focus labs; with this came the chance to make use of the extensive skill sets acquired throughout the course and an opportunity to provide solutions to pressing issues of modern industry. The partner for the first focus lab, Wessex Water, posed the following question: given time series data on water flow levels in pipes, can we detect if new leaks have occurred? Given the inherent value of clean water available at the point of use and the detriments of leaking this vital resource, the challenge of ensuring an efficient system of delivery is of great importance. Hence, finding an answer to this question has the potential to provide huge economic, political, and environmental benefits for a large base of service users.

Data and Modelling:

The dataset provided by Wessex Water consisted of water flow data spanning across around 760 pipes. After this data was cleaned and processed some useful series, such as minimum nightly and average daily flow (MNF and ADF resp.), were extracted. Preliminary analysis carried out by our collaborators at Wessex Water concluded that certain types of changes in the structure of water flow data provide good indications that a leak has occurred. From this one can postulate that detecting a leak amounts to detecting these structural changes in this data. Using this principle, we began to build a framework to build solutions: detect the change; detect a new leak. Change point detection is a well-researched discipline that provides us with efficient methods for detecting statistically significant changes in the distribution of a time series and hence a toolbox with which to tackle the problem. Indeed, we at Compass have our very own active member of the change point detection research community in the shape of Dom Owens. The preliminary analysis gave that there are three types of structural change in water flow series that indicate a leak: a change in the mean of the MNF, a change in trend of the MNF, and a change in the variance of the difference between the MNF and ADF. In order to detect these changes with an algorithm we would need to transform the given series so that the original change in distribution corresponded to a change in the mean of the transformed series. These transforms included calculating generalised additive model (GAM) residuals and analysing their distribution. An example of such a GAM is given by:

\mathbb{E}[\text{flow}_t] = \beta_0 \sum_{i=1}^m f_i(x_i).

Where the x i ’s are features we want to use to predict the flow, such as the time of day or current season. The principle behind this analysis is that any change in the residual distribution corresponds to a violation of the assumption that residuals are independently, identically distributed and hence, in turn, corresponds to a deviation from the original structure we fit our GAM to. (more…)

Student perspectives: Three Days in the life of a Silicon Gorge Start-Up

A post by Mauro Camara Escudero, PhD student on the Compass programme.

Last December the first Compass cohort partook a 3-day entrepreneurship training with SpinUp Science. Keep reading and you might just find out if the Silicon Gorge life is for you!

The Ambitious Project of SpinUp Science

SpinUp Science’s goal is to help PhD students like us develop an entrepreneurial skill-set that will come in handy if we decide to either commercialize a product, launch a start-up, or choose a consulting career.

I can already hear some of you murmur “Sure, this might be helpful for someone doing a much more applied PhD but my work is theoretical. How is that ever going to help me?”. I get that, I used to believe the same. However, partly thanks to this training, I changed my mind and realized just how valuable these skills are independently of whether you decide to stay in Academia or find a job at an established company.

Anyways, I am getting ahead of myself. Let me first guide you through what the training looked like and then we will come back to this!

Day 1 – Meeting the Client

The day started with a presentation that, on paper, promised to be yet another one of those endless and boring talks that make you reach for the Stop Video button and take a nap. The vague title “Understanding the Opportunity” surely did not help either. Instead, we were thrown right into action! (more…)

Student perspectives: The Elo Rating System – From Chess to Education

A post by Andrea Becsek, PhD student on the Compass programme.

If you have recently also binge-watched the Queen’s Gambit chances are you have heard of the Elo Rating System. There are actually many games out there that require some way to rank players or even teams. However, the applications of the Elo Rating System reach further than you think.

History and Applications

The Elo Rating System 1 was first suggested as a way to rank chess players, however, it can be used in any competitive two-player game that requires a ranking of its players. The system was first adopted by the World Chess Federation in 1970, and there have been various adjustments to it since, resulting in different implementations by each organisation.

For any soccer-lovers out there, the FIFA world rankings are also based on the Elo System, but if you happen to be into a similar sport, worry not, Elo has you covered. And the list of applications goes on and on: Backgammon, Scrabble, Go, Pokemon, and apparently even Tinder used it at some point.

Fun fact: The formulas used by the Elo Rating make a famous appearance in the Social Network, a movie about the creation of Facebook. Whether this was the actual algorithm used for FaceMash, the first version of Facebook, is however unclear.

All this sounds pretty cool, but how does it actually work?

How it works

We want a way to rank players and update their ranking after each game. Let’s start by assuming that we have the ranking for the two players about to play: \theta_i for player i and \theta_j for player j. Then we can compute the probability of player i winning against player j using the logistic function:

P(Y_{ij}=1)=\frac{1}{1+\exp\{-(\theta_i-\theta_j)\}}.

Given what we know about the logistic function, it’s easy to notice that the smaller the difference between the players, the less certain the outcome as the probability of winning will be close to 0.5.

Once the outcome of the game is known, we can update both players’ abilities

\theta_{i}:=\theta_{i}+K(Y_{ij}-P(Y_{ij}=1))

\theta_{j}:=\theta_{j}+K(P(Y_{ij}=1)-Y_{ij}).

The K factor controls the influence of a player’s performance on their previous ranking. For players with high rankings, a smaller K is used because we expect their abilities to be somewhat stable and hence their ranking shouldn’t be too heavily influenced by every game. On the other hand, players with low ability can learn and improve quite quickly, and therefore their rating should be able to fluctuate more so they have a larger K number.

The term in the brackets represents how different the actual outcome is from the expected outcome of the game. If a player is expected to win but doesn’t, their ranking will decrease, and vice versa. The larger the difference, the more their rating will change. For example, if a weaker player is highly unlikely to win, but they do, their ranking will be boosted quite a bit because it was a hard battle for them. On the other hand, if a strong player is really likely to win because they are playing against a weak player, their increase in score will be small as it was an easy win for them.

Elo Rating and Education

As previously mentioned, the Elo Rating System has been used in a wide range of fields and, as it turns out, that includes education, more specifically, adaptive educational systems 2. Adaptive educational systems are concerned with automatically selecting adequate material for a student depending on their previous performance.

Note that a system can be adaptive at different levels of granularity. Some systems might adapt the homework from week to week by generating it based on the student’s current ability and update their ability once the homework has been completed. Whereas other systems are able to update the student ability after every single question. As you can imagine, using the second system requires a fairly fast, online algorithm. And this is where the Elo Rating comes in.

For an adaptive system to work, we need two key components: student abilities and question difficulties. To apply the Elo Rating to this context, we treat a student’s interaction with a question as a game where the student’s ranking represents their ability and the question’s ranking represents its difficulty. We can then predict whether a student of ability \theta_i will answer a question of difficulty d_j correctly using

P(\text{correct}_{ij}=1)=\frac{1}{1+\exp\{-(\theta_i-d_j)\}}.

and the ability and difficulty can be updated using

\theta_{i}:=\theta_{i}+K(\text{correct}_{ij}-P(\text{correct}_{ij}=1))

d_{j}:=d_{j}+K(P(\text{correct}_{ij}=1)-\text{correct}_{ij}).

So even if you only have 10 minutes to create an adaptive educational system you can easily implement this algorithm. Set all abilities and question difficulties to 0, let students answer your questions, and wait for the magic to happen. If you do have some prior knowledge about the difficulty of the items you could of course incorporate that into the initial values.

One important thing to note is that one should be careful with ranking students based on their abilities as this could result in various ethical issues. The main purpose of obtaining their abilities is to track their progress and match them with questions that are at the right level for them, easy enough to stay motivated, but hard enough to feel challenged.

Conclusion

So is Elo the best option for an adaptive system? It depends. It is fast, enables on the fly updates, it’s easy to implement, and in some contexts, it even has a similar performance to more complex models. However, there are usually many other factors that can be relevant to predicting student performance, such as the time spent on a question or the number of hints they use. This additional data can be incorporated into more complex models, probably resulting in better predictions and offering much more insight. At the end of the day, there is always a trade-off, so depending on the context it’s up to you to decide whether the Elo Rating System is the way to go.

Find out more about Andrea Becsek and her work on her profile page.

  1. Elo, A.E., 1978. The rating of chessplayers, past and present. Arco Pub.
  2. Pelánek, R., 2016. Applications of the Elo rating system in adaptive educational systems. Computers & Education, 98, pp.169-179.
Skip to toolbar