A post by Daniel Williams, Compass PhD student.
Imagine that you are employed by Chicago’s city council, and are tasked with estimating where the mean locations of reported crimes are in the city. The data that you are given only goes up to the city’s borders, even though crime does not suddenly stop beyond this artificial boundary. As a data scientist, how would you estimate these centres within the city? Your measurements are obscured past a very complex border, so regular methods such as maximum likelihood would not be appropriate.
This is an example of a more general problem in statistics named truncated probability density estimation. How do we estimate the parameters of a statistical model when data are not fully observed, and are cut off by some artificial boundary?
Incorporation of this boundary constraint into estimation is not straightforward. There are many factors which must be considered which make regular methods infeasible, we require:
- knowledge of the normalising constant in a statistical model, which is unknown in a truncated setting;
- incorporation of weighting points that are more ‘affected’ by the truncation.
A statistical model (a probability density function involving some parameters ) is defined as:
This is made up of two components: , the unnormalised part of the model, which is known analytically; and , the normalising constant, which is oftentimes unknown or undefinable. The goal of estimation is to find a vector of parameters which characterises our model, and makes the model resemble the data as closely as possible. Usually, we can estimate by trying to minimise the difference between the model, , and the theoretical data density, .
A Recipe for a Solution
Ingredient #1: Score Matching
A novel estimation method called score matching enables estimation even when the normalising constant, , is unknown. Score matching begins by taking the derivative of the logarithm of the probability density function, i.e.,
which has become known as the score function. When taking the derivative, the dependence on is removed. To estimate the parameter vector , we can minimise the difference between and by minimising the difference between the two score functions, and . One such distance measure is the expected squared distance, so that score matching aims to minimise
With this first ingredient, we have eliminated the requirement that we must know the normalising constant.
Ingredient #2: A Weighting Function
Heuristically, we can imagine our weighting function should vary with how close a point is to the boundary. To satisfy score matching assumptions, we require that this weighting function must have the property that for any point on the boundary. A natural candidate would be the Euclidean distance from a point to the boundary, i.e.,
This easily satisfies our criteria. The distance is going to be exactly zero on the boundary itself, and will approach zero the closer the points are to the edges.
Ingredient #3: Any Statistical Model
Since we do not require knowledge of the normalising constant through the use of score matching, we can choose any probability density function that is appropriate for our data. For example, if we are modelling count data, we may choose a Poisson distribution. If we have some sort of centralised location data, such as in the Chicago crime example in the introduction, we may choose a multivariate Normal distribution.
Combine Ingredients and Mix
We aim to minimise the expected distance between the score functions, and we weight this by our function , to give
The only unknown in this equation now is the data density (if we knew the true data density, there would be no point in estimating it with ). However, we can rewrite this equation using integration by parts as
This can be numerically minimised, or if is in the exponential family, analytically minimised to obtain estimates for .
As a sanity check for testing estimation methods, it is often reassuring to perform some experiments on simulated data before moving to real world applications. Since we know the true parameter values, it is possible to calculate how far our estimated values are from their true counterparts, thus giving a way to measure estimation accuracy.
Pictured below in Figure 2 are two experiments where data are simulated from a multivariate normal distribution with mean and known variance . Our aim is to estimate the parameter to be close to . The red crosses in the image are the true means at , and the red dots are the estimates given by truncated score matching.
These figures clearly indicate that in this basic case, this method is giving sensible results. Even by ignoring most of our data, as long as we formulate our problem correctly, we can still get accurate results for estimation.
Now we come back to the motivating problem, and the task is to estimate the centres of police reported crime in the city of Chicago, given the longitudes and latitudes of homicides in 2008. Our specification changes somewhat:
- from the original plots, it seems like there are two centres, so the statistical model we choose is a 2-dimensional mixed multivariate Normal distribution;
- the variance is no longer known, but to keep estimation straightforward, the standard deviation is fixed so that roughly covers the ‘width’ of the city.
Figure 3 below shows applying this method to this application.
Truncated score matching has placed its estimates for the means in similar, but slightly different places than standard MLE. Whereas the maximum likelihood estimates are more central to the truncated dataset, the truncated score matching estimates are placed closer to the outer edges of the city. For the upper-most crime centre, around the city border the data are more ‘tightly packed’ – which is a property we expect of multivariate normal data. This result does not reflect the true likelihood of crime in a certain neighbourhood, and has only been presented to provide a visual explanation of the method. The actual likelihood depends on various characteristics in the city that are not included in our analysis, see here for more details.
Score matching is a powerful tool, and by not requiring any form of normalising constant, enables the use of some more complicated models. Truncated probability density estimation is just one such example of an intricate problem which can be solved with this methodology, but it is one with far reaching and interesting applications. Whilst this blog post has focused on location-based data and estimation, truncated probability density estimation could have a range of applications. For example, consider disease/virus modelling such as the COVID-19 pandemic. The true number of cases is obscured by the number of tests that can be performed, so the density evolution of the pandemic through time could be fully modelled with incorporation of this constraint using this method. Other applications as well as extensions to this method will be fully investigated in my future PhD projects.
Alternative methods which do not require the normalising constant: We could choose a model which has truncation built into its specification, such as the truncated normal distribution. For complicated regions of truncation, such as the city of Chicago, it is highly unlikely that an appropriate model specification is available. We could also perform some rejection algorithm to approximate the normalising constant, such as MCMC or ABC, but these are often very slow, and do not provide an exact solution. Score matching can have an analytical estimate for exponential family models, and even when it does not, minimisation is usually straightforward and fast.
Choice of the weighting function: We did not need to specify before formulating our score matching objective function in ingredient #3. In fact, under some assumptions (the function is 1-Lipschitz continuous) it can be shown that the choice of which gives the supremum of the score matching expectation is in fact the Euclidean distance which was specified earlier, so it was not just a heuristic!
Score matching derivation conditions: This post has hand-waved some conditions of the derivation of the score matching objective function. There are two conditions which must be satisfied in the weighting function so that the integration by parts ‘trick’ can work correctly. Firstly, must be positive for every . Secondly, for any in the boundary, . For more information, see the further reading section below.
If you are interested to learn more about any of these topics, check out any of the papers below:
And the Github page where all the plots and results were generated can be found here.
A post by Doug Corbin, PhD student on the Compass programme.
In a recent bout of friendly competition, students from the Compass and Interactive AI CDT’s were divided into eight teams to take part in a two week Datathon, hosted by insurance provider LV. A Datathon is a short, intensive competition, posing a data-driven challenge to the teams. The challenge was to construct the best predictive model for (the size of) insurance claims, using an anonymised, artificial data set generously provided by LV. Each team’s solution was given three important criteria, on which their solutions would be judged:
- Accuracy – How well the solution performs at predicting insurance claims.
- Explainability – The ability to understand and explain how the solution calculates its predictions; It is important to be able to explain to a customers how their quote has been calculated.
- Creativity – The solution’s incorporation of new and unique ideas.
Students were given the opportunity to put their experience in Data Science and Artificial Intelligence to the test on something resembling real life data, forming cross-CDT relationships in the process.
Data and Modelling
Before training a machine learning model, the data must first be processed into a numerical format. To achieve this, most teams transformed categorical features into a series of 0’s and 1’s (representing the value of the category), using a well known process called one-hot encoding. Others recognised that certain features had a natural order to them, and opted to map them to integers corresponding to their ordered position.
A common consideration amongst all the teams’ analysis was feature importance. Out of the many methods/algorithms which were used to evaluate the relative importance of the features, notable mentions are Decision Trees, LASSO optimisation, Permutation Importance and SHAP Values. The specific details of these methods are beyond the scope of this blog post, but they all share a common goal, to rank features according to how important they are in predicting claim amounts. In many of the solutions, feature importance was used to simplify the models by excluding features with little to no predictive power. For others, it was used as a post analysis step to increase explainability i.e. to show which features where most important for a particular claim prediction. As a part of the feature selection process, all teams considered the ethical implications of the data, with many choosing to exclude certain features to mitigate social bias.
Interestingly, almost all teams incorporated some form of Gradient Boosted Decision Trees (GBDT) into their solution, either for feature selection or regression. This involves constructing multiple decision trees, which are aggregated to give the final prediction. A decision tree can be thought of as a sequence of binary questions about the features (e.g. Is the insured vehicle a sports car? Is the car a write off?), which lead to a (constant) prediction depending on the the answers. In the case of GBDT, decision trees are constructed sequentially, each new tree attempting to capture structure in the data which has been overlooked by its predecessors. The final estimate is a weighted sum of the trees, where the weights are optimised using (the gradient of) a specified loss function e.g. Mean-Squared Error (MSE),
Many of the teams trialled multiple regression models, before ultimately settling on a tree-based model. However, it is well-known that tree-based models are prone to overfitting the training data. Indeed, many of the teams were surprised to see such significant difference between the training/testing Mean Absolute Error (MAE),
After two weeks of hard-work, the students came forward to present their solutions to a judging panel formed of LV representatives and CDT directors. The success of each solution was measured via the MAE of their predictions on the testing data set. Anxious to find out the results, the following winners were announced.
Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate.
Regression Model: Gradient Boosted Decision Trees.
Testing MAE: 69.77
The winning team (in accuracy) was able to dramatically reduce their testing MAE through their choice of loss function. Loss functions quantify how good/bad a regression model is performing during the training process, and it is used to optimise the linear combination of decision trees. While most teams used the popular loss function, Mean-Squared Error, the winning team instead used Least Absolute Deviations, which is equivalent to optimising for the MAE while training the model.
Explainability (Joint) Winners
After much deliberation amongst the judging panel, two teams were awarded joint first place in the explainability category!
Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. Features centred and scaled to have mean 0 and standard deviation 1, then selected using Gradient Boosted Decision Trees.
Regression Model: K-Nearest-Neighbours Regression
Testing MAE: 75.25
This team used Gradient Boosted Decision Trees for feature selection, combined with K-Nearest-Neighbours (KNN) Regression to model the claim amounts. KNN regression is simple in nature; given a new claim to be predicted, the K “most similar” claims in the training set are averaged (weighted according to similarity) to produce the final prediction. It is thus explainable in the sense that for every prediction you can see exactly which neighbours contributed, and what similarities they shared. The judging panel noted that, from a consumer’s perspective, they may not be satisfied with their insurance quote being based on just K neighbours.
Pre-processing: All categorical features one-hot encoded.
Regression Model: Gradient Boosted Decision Trees. SHAP values used for post-analysis explainability.
Testing MAE: 80.3.
The judging panel was impressed by this team’s decision to impose monotonicity in the claim predictions with respect to the numerical features. This asserts that, for monotonic features, the claim prediction must move in a constant direction (increasing or decreasing) if the numerical feature is moving in a constant direction. For example, a customer’s policy excess is the amount they will have to pay towards a claim made on their insurance. It stands to reason that increasing the policy excess (while other features remain constant) should not increase their insurance quote. If this constraint is satisfied, we say that the insurance quote is monotonic decreasing with respect to the policy excess. Further, SHAP values were used to explain the importance/effect of each feature on the model.
Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. New feature engineered from Vehicle Size and Vehicle Class. Features selected using Permutation Importance.
Regression Model: Gradient Boosted Decision Trees. Presented post-analysis mediation/moderation study of the features.
Testing MAE: 76.313.
The winning team for creativity presented unique and intriguing methods for understanding and manipulating the data. This team noticed that the features, Vehicle Size and Vehicle Class, are intrinsically related e.g. They investigated whether a large vehicle would likely yield a higher claim if it is also of luxury vehicle class. To represent this relationship, they engineered a new feature by taking a multiplicative combination of the two initial features.
As an extension of their solution, they presented an investigation of the causal relationship between the different features. Several hypothesis tests were performed, testing whether the relationship between certain features and claim amounts is moderated or mediated by an alternative feature in the data set.
- Mediating relationships: If a feature is mediated by an alternative feature in the data set, its relationship with the claim amounts can be well explained by the alternative (potentially indicating it can be removed from the model).
- Moderating relationships: If a feature is moderated by an alternative feature in the data set, the strength and/or direction of the relationship with the claim amounts is impacted by the alternative.
All the teams showed a great understanding of the problem and identified promising solutions. The competitive atmosphere of the LV Datathon created a notable buzz amongst the students, who were eager to present and compare their findings. As evidenced by every team’s solution, the methodological conclusion is clear: When it comes to insurance claim prediction, tree-based models are unbeaten!
This month, the Cohort 2 Compass students have started work on their mini projects and are establishing the direction of their own research within the CDT.
Supervised by the Institute for Statistical Science:
Anthony Stevenson will be working with Robert Allison on a project entitled Fast Bayesian Inference at Extreme Scale. This project is in partnership with IBM Research.
Conor Crilly will be working with Oliver Johnson on a project entitled Statistical models for forecasting reliability. This project is in partnership with AWE.
Euan Enticott will be working with Matteo Fasiolo and Nick Whiteley on a project entitled Scalable Additive Models for Forecasting Electricity Demand and Renewable Production. This project is in partnership with EDF.
Annie Gray will be working with Patrick Rubin-Delanchy and Nick Whiteley on a project entitled Exploratory data analysis of graph embeddings: exploiting manifold structure.
Ed Davis will be working with Dan Lawson and Patrick Rubin-Delanchy on a project entitled Graph embedding: time and space. This project is in partnership with LV Insurance.
Conor Newton will be working with Henry Reeve and Ayalvadi Ganesh on a project entitled Decentralised sequential decision making and learning.
The following projects are supervised in collaboration with the Institute for Statistical Science (IfSS) and our other internal partners at the University of Bristol:
Dan Ward will be working with Matteo Fasiolo (IfSS) and Mark Beaumont from the School of Biological Sciences on a project entitled Agent-based model calibration via regression-based synthetic likelihood. This project is in partnership with Improbable
Georgie Mansell will be working with Haeran Cho (IfSS) and Andrew Dowsey from the School of Population Health Sciences and Bristol Veterinary School on a project entitled Statistical learning of quantitative data at scale to redefine biomarker discovery. This project is in partnership with Sciex.
Shannon Williams will be working with Anthony Lee (IfSS) and Jeremy Phillips from the School of Earth Sciences on a project entitled Use and Comparison of Stochastic Simulations and Weather Patterns in probabilistic volcanic ash hazard assessments.
Sam Stockman will be working with Dan Lawson (IfSS) and Maximillian Werner from the School of Geographical Sciences on a project entitled Machine Learning and Point Processes for Insights into Earthquakes and Volcanoes
This February our 2nd year Compass students will attend workshops in responsible innovation.
Run in partnership with the School of Management, the structured module constitutes Responsible Innovation training specifically for research in Data Science.
Taking the EPSRC AREA (Anticipate, Reflect, Engage, Act) framework for Responsible Innovation as it’s starting point, the module will take students through a guided process to develop the skills, knowledge and facilitated experience to incorporate the tenets of the AREA framework in to their PhD practice. Topics covered will include:
· Ethical and societal implications of data science and computational statistics
· Skills for anticipation
· Reflexivity for researchers
· Public perception of data science and engagement of publics
· Regulatory frameworks affecting data science
A post by Michael Whitehouse, PhD student on the Compass programme.
September saw the first of an exciting new series of Compass industry focus labs; with this came the chance to make use of the extensive skill sets acquired throughout the course and an opportunity to provide solutions to pressing issues of modern industry. The partner for the first focus lab, Wessex Water, posed the following question: given time series data on water flow levels in pipes, can we detect if new leaks have occurred? Given the inherent value of clean water available at the point of use and the detriments of leaking this vital resource, the challenge of ensuring an efficient system of delivery is of great importance. Hence, finding an answer to this question has the potential to provide huge economic, political, and environmental benefits for a large base of service users.
Data and Modelling:
The dataset provided by Wessex Water consisted of water flow data spanning across around 760 pipes. After this data was cleaned and processed some useful series, such as minimum nightly and average daily flow (MNF and ADF resp.), were extracted. Preliminary analysis carried out by our collaborators at Wessex Water concluded that certain types of changes in the structure of water flow data provide good indications that a leak has occurred. From this one can postulate that detecting a leak amounts to detecting these structural changes in this data. Using this principle, we began to build a framework to build solutions: detect the change; detect a new leak. Change point detection is a well-researched discipline that provides us with efficient methods for detecting statistically significant changes in the distribution of a time series and hence a toolbox with which to tackle the problem. Indeed, we at Compass have our very own active member of the change point detection research community in the shape of Dom Owens. The preliminary analysis gave that there are three types of structural change in water flow series that indicate a leak: a change in the mean of the MNF, a change in trend of the MNF, and a change in the variance of the difference between the MNF and ADF. In order to detect these changes with an algorithm we would need to transform the given series so that the original change in distribution corresponded to a change in the mean of the transformed series. These transforms included calculating generalised additive model (GAM) residuals and analysing their distribution. An example of such a GAM is given by:
Where the x i ’s are features we want to use to predict the flow, such as the time of day or current season. The principle behind this analysis is that any change in the residual distribution corresponds to a violation of the assumption that residuals are independently, identically distributed and hence, in turn, corresponds to a deviation from the original structure we fit our GAM to.
Figure 1: GAM residual plot. Red lines correspond to detected changes in distribution, green lines indicate a repair took place.
A Change Point Detection Algorithm:
In order to detect changes in real time we would need an online change point detection algorithm, after evaluating the existing literature we elected to follow the mean change detection procedure described in [Wang and Samworth, 2016]. The user-end procedure is as follows:
- Calculate mean estimate on some data we assume is stationary.
- Feed a new observation into the algorithm. Calculate test statistics based on new data.
- Repeat (2) until any test statistics exceed a threshold at which point we conclude a mean change has been detected. Return to (1).
Due to our 2 week time restraint we chose to restrict ourselves to finding change points corresponding to a mean change, just one of the 3 changes we know are indicative of a leak. As per the fundamental principles of decision theory, we would like to tune and evaluate our algorithm by minimising some loss function which depends on some ‘training’ data. That is, we would like to look at some past period of time and make predictions of when leaks happened given the flow data across the same period, then we evaluate how accurate these predictions were and adjust or asses the model accordingly. However, to do this we would need to know when and where leaks actually occurred across the time period of the data, something we did not have access to. Without ‘labels’ indicating that a leak has occurred, any predictions from the model were essentially meaningless, so we sought to find a proxy. The one potentially useful dataset we did have access to was that of leak repairs. It is clear that a leak must have occurred if a repair has occurred, but for various reasons this proxy does not provide an exhaustive account of all leaks. Furthermore, we do not know which repairs correspond to leaks identified by the particular distributional change in flow data we considered. This, in turn, means that all measures of model performance must come with the caveat that they are contingent on incomplete data. If when conducting research we find out results are limited it is our duty as statisticians to report when this is the case – it is not our job to sugar coat or manipulate our findings, but to report them with the limitations and uncertainty that inextricably come alongside. Results without uncertainty are as meaningless as no results at all. This being said, all indications pointed towards the method being effective in detecting water flow data mean change points which correspond to leak repairs, giving a positive result to feedback to our friends at Wessex Water.
Communicating statistical concepts and results to an audience of varied areas and levels of expertise is important now more than ever. The continually strengthening relationships between Compass and its industrial partners are providing students with the opportunity to gain experience in doing exactly this. The focus lab concluded with a presentation on our findings to the Wessex Water operations team, during which we reported the procedures and results. The technical results were well supported by the demonstration of an R shiny dashboard app, which provided an intuitive interface to view the output of the developed algorithm. Of course, there is more work to be done. Expanding the algorithm to account for all 3 types of distributional change is the obvious next step. Furthermore, fitting a GAM to data for 760 pipes is not very efficient. Additional investigations into finding ways to ‘cluster’ groups of pipes together according to some notion of similarity is a natural avenue for future work in order to reduce the number of GAMS we need to fit.This experience enabled students to apply skills in statistical modelling, algorithm development, and software development to a salient problem faced by an industry partner and marked a successful start to the Compass industry focus lab series.
A post by Mauro Camara Escudero, PhD student on the Compass programme.
Last December the first Compass cohort partook a 3-day entrepreneurship training with SpinUp Science. Keep reading and you might just find out if the Silicon Gorge life is for you!
The Ambitious Project of SpinUp Science
SpinUp Science’s goal is to help PhD students like us develop an entrepreneurial skill-set that will come in handy if we decide to either commercialize a product, launch a start-up, or choose a consulting career.
I can already hear some of you murmur “Sure, this might be helpful for someone doing a much more applied PhD but my work is theoretical. How is that ever going to help me?”. I get that, I used to believe the same. However, partly thanks to this training, I changed my mind and realized just how valuable these skills are independently of whether you decide to stay in Academia or find a job at an established company.
Anyways, I am getting ahead of myself. Let me first guide you through what the training looked like and then we will come back to this!
Day 1 – Meeting the Client
The day started with a presentation that, on paper, promised to be yet another one of those endless and boring talks that make you reach for the Stop Video button and take a nap. The vague title “Understanding the Opportunity” surely did not help either. Instead, we were thrown right into action!
Ric and Crescent, two consultants at SpinUp Science, introduced us to their online platform where we would be spending most of our time in the next few days. Our main task for the first half-hour was to read about the start-up’s goals and then write down a series of questions to ask the founders in order to get a full picture.
Before we knew it, it was time to get ready for the client meeting and split tasks. I volunteered as Client Handler, meaning I was going to coordinate our interaction with the founders. The rest of Compass split into groups focusing on different areas: some were going to ask questions about competitors, others about the start-up product, and so on.
As we waited in the ZOOM call, I kept wondering why on earth I volunteered for the role and my initial excitement was quickly turning into despair. We had never met the founders before, let alone had any experience consulting or working for a start-up.
Once the founders joined us, and after a wobbly start, it became clear that the hard part would not be avoiding awkward silences or struggling to get information. The real challenge was being able fit all of our questions in this one-hour meeting. One thing was clear: clients love to talk about their issues and to digress.
After the meeting, we had a brief chat and wrote down our findings and thoughts on the online platform. I wish I could say we knew what we were doing, but in reality it was a mix of extreme winging and following the advice of Ric and Crescent.
Last on the agenda, was a short presentation where we learned how to go about studying the market-fit for a product, judge its competitors and potential clients, and overall how to evaluate the success of a start-up idea. That was it for the day, but the following morning we would put into practice everything we had learned up to that point.
Day 2 – Putting our Stalking Skills to good use
The start-up that we were consulting for provides data analysis software for power plants and was keen to expand in a new geographical area. Our goal for the day was therefore to:
understand the need for such a product in the energy market
research what options are available for the components of their product
find potential competitors and assess their offering
find potential clients and assess whether they already had a similar solution implemented
study the market in the new geographical area
This was done with a mix of good-old Google searches and cold-calling. It was a very interesting process as in the morning we were even struggling to understand what the start-up was offering, while by late afternoon we had a fairly in-depth knowledge of all the points above and we had gathered enough information to formulate more sensible questions and to assess the feasibility of the start-up’s product. One of the things I found most striking about this supervised hands-on training is that as time went on I could clearly see how I was able to filter out useless information and go to the core of what I was researching.
To aid us in our analyses, we received training on various techniques to assess competitors, clients and the financial prospect of a start-up. In addition, we also learned about why the UK is such a good place to launch a start-up, what kind of funding is available and how to look for investors and angels.
Exhausted by a day of intense researching, we knew the most demanding moments were yet to come.
Day 3 – Reporting to the Client
The final day was all geared towards preparing for our final client meeting. Ric and Crescent taught us how to use their online platform to perform PESTEL and SWOT analyses efficiently based on the insights that we gathered the day before. It was very powerful seeing a detailed report coming to life using inputs from all of our researches.
With the report in hand, several hours of training under our belt, and a clearer picture in our head, we joined the call and each one of us presented a different section of the report, while Andrea was orchestrating the interaction. Overall, the founders seemed quite impressed and admitted that had not heard of many of the competitors we had found. They were pleased by our in-depth research and, I am sure, found it very insightful.
So, was it useful?
I believe that this training gave us a glimpse of how to go about picking up a totally new area of knowledge and quickly becoming an expert on it. The time constraint allowed us to refine the way in which we filter out useless information, to get to the core of what we are trying to learn about. We also worked together as a team towards a single goal and we formulated our opinion on the start-up. Finally, we had two invaluable opportunities to present in a real-world setting and to handle diplomatically the relationship with the client.
In the end, isn’t research all about being able to pickup new knowledge quickly, filter out useless papers, working together with other researchers to develop a method and present such results to an audience?
Find out more about Mauro Camara Escudero and his work on his profile page.
A post by Andrea Becsek, PhD student on the Compass programme.
If you have recently also binge-watched the Queen’s Gambit chances are you have heard of the Elo Rating System. There are actually many games out there that require some way to rank players or even teams. However, the applications of the Elo Rating System reach further than you think.
History and Applications
The Elo Rating System 1 was first suggested as a way to rank chess players, however, it can be used in any competitive two-player game that requires a ranking of its players. The system was first adopted by the World Chess Federation in 1970, and there have been various adjustments to it since, resulting in different implementations by each organisation.
For any soccer-lovers out there, the FIFA world rankings are also based on the Elo System, but if you happen to be into a similar sport, worry not, Elo has you covered. And the list of applications goes on and on: Backgammon, Scrabble, Go, Pokemon, and apparently even Tinder used it at some point.
Fun fact: The formulas used by the Elo Rating make a famous appearance in the Social Network, a movie about the creation of Facebook. Whether this was the actual algorithm used for FaceMash, the first version of Facebook, is however unclear.
All this sounds pretty cool, but how does it actually work?
How it works
We want a way to rank players and update their ranking after each game. Let’s start by assuming that we have the ranking for the two players about to play: for player and for player . Then we can compute the probability of player winning against player using the logistic function:
Given what we know about the logistic function, it’s easy to notice that the smaller the difference between the players, the less certain the outcome as the probability of winning will be close to .
Once the outcome of the game is known, we can update both players’ abilities
The factor controls the influence of a player’s performance on their previous ranking. For players with high rankings, a smaller is used because we expect their abilities to be somewhat stable and hence their ranking shouldn’t be too heavily influenced by every game. On the other hand, players with low ability can learn and improve quite quickly, and therefore their rating should be able to fluctuate more so they have a larger number.
The term in the brackets represents how different the actual outcome is from the expected outcome of the game. If a player is expected to win but doesn’t, their ranking will decrease, and vice versa. The larger the difference, the more their rating will change. For example, if a weaker player is highly unlikely to win, but they do, their ranking will be boosted quite a bit because it was a hard battle for them. On the other hand, if a strong player is really likely to win because they are playing against a weak player, their increase in score will be small as it was an easy win for them.
Elo Rating and Education
As previously mentioned, the Elo Rating System has been used in a wide range of fields and, as it turns out, that includes education, more specifically, adaptive educational systems 2. Adaptive educational systems are concerned with automatically selecting adequate material for a student depending on their previous performance.
Note that a system can be adaptive at different levels of granularity. Some systems might adapt the homework from week to week by generating it based on the student’s current ability and update their ability once the homework has been completed. Whereas other systems are able to update the student ability after every single question. As you can imagine, using the second system requires a fairly fast, online algorithm. And this is where the Elo Rating comes in.
For an adaptive system to work, we need two key components: student abilities and question difficulties. To apply the Elo Rating to this context, we treat a student’s interaction with a question as a game where the student’s ranking represents their ability and the question’s ranking represents its difficulty. We can then predict whether a student of ability will answer a question of difficulty correctly using
and the ability and difficulty can be updated using
So even if you only have minutes to create an adaptive educational system you can easily implement this algorithm. Set all abilities and question difficulties to , let students answer your questions, and wait for the magic to happen. If you do have some prior knowledge about the difficulty of the items you could of course incorporate that into the initial values.
One important thing to note is that one should be careful with ranking students based on their abilities as this could result in various ethical issues. The main purpose of obtaining their abilities is to track their progress and match them with questions that are at the right level for them, easy enough to stay motivated, but hard enough to feel challenged.
So is Elo the best option for an adaptive system? It depends. It is fast, enables on the fly updates, it’s easy to implement, and in some contexts, it even has a similar performance to more complex models. However, there are usually many other factors that can be relevant to predicting student performance, such as the time spent on a question or the number of hints they use. This additional data can be incorporated into more complex models, probably resulting in better predictions and offering much more insight. At the end of the day, there is always a trade-off, so depending on the context it’s up to you to decide whether the Elo Rating System is the way to go.
Find out more about Andrea Becsek and her work on her profile page.
As part of UKRI NPIF Talent Funding 2019/2020, Compass applied for additional hardware and now has an operational GPU node*, specifically designed for the needs of statistical and machine learning analysis and with priority for COMPASS students.
This will enable our students to more rapidly train a wide variety of powerful, state of the art models in machine learning. GPUs have been essential to the rise of deep learning which, in the past decade, has revolutionised machine learning, rendering previously impossible tasks in image and natural language processing surmountable.