## Student Perspectives: LV Datathon – Insurance Claim Prediction

A post by Doug Corbin, PhD student on the Compass programme.

# Introduction

In a recent bout of friendly competition, students from the Compass and Interactive AI CDT’s were divided into eight teams to take part in a two week Datathon, hosted by insurance provider LV. A Datathon is a short, intensive competition, posing a data-driven challenge to the teams. The challenge was to construct the best predictive model for (the size of) insurance claims, using an anonymised, artificial data set generously provided by LV. Each team’s solution was given three important criteria, on which their solutions would be judged:

• Accuracy – How well the solution performs at predicting insurance claims.
• Explainability – The ability to understand and explain how the solution calculates its predictions; It is important to be able to explain to a customers how their quote has been calculated.
• Creativity – The solution’s incorporation of new and unique ideas.

Students were given the opportunity to put their experience in Data Science and Artificial Intelligence to the test on something resembling real life data, forming cross-CDT relationships in the process.

# Data and Modelling

Before training a machine learning model, the data must first be processed into a numerical format. To achieve this, most teams transformed categorical features into a series of 0’s and 1’s (representing the value of the category), using a well known process called one-hot encoding. Others recognised that certain features had a natural order to them, and opted to map them to integers corresponding to their ordered position.

A common consideration amongst all the teams’ analysis was feature importance. Out of the many methods/algorithms which were used to evaluate the relative importance of the features, notable mentions are Decision Trees, LASSO optimisation, Permutation Importance and SHAP Values. The specific details of these methods are beyond the scope of this blog post, but they all share a common goal, to rank features according to how important they are in predicting claim amounts. In many of the solutions, feature importance was used to simplify the models by excluding features with little to no predictive power. For others, it was used as a post analysis step to increase explainability i.e. to show which features where most important for a particular claim prediction. As a part of the feature selection process, all teams considered the ethical implications of the data, with many choosing to exclude certain features to mitigate social bias.

Interestingly, almost all teams incorporated some form of Gradient Boosted Decision Trees (GBDT) into their solution, either for feature selection or regression. This involves constructing multiple decision trees, which are aggregated to give the final prediction. A decision tree can be thought of as a sequence of binary questions about the features (e.g. Is the insured vehicle a sports car? Is the car a write off?), which lead to a (constant) prediction depending on the the answers. In the case of GBDT, decision trees are constructed sequentially, each new tree attempting to capture structure in the data which has been overlooked by its predecessors. The final estimate is a weighted sum of the trees, where the weights are optimised using (the gradient of) a specified loss function e.g. Mean-Squared Error (MSE),

$MSE = \frac{1}{N} \sum_{n = 1}^N (y_n - \hat{y}_n)^2.$

Many of the teams trialled multiple regression models, before ultimately settling on a tree-based model. However, it is well-known that tree-based models are prone to overfitting the training data. Indeed, many of the teams were surprised to see such significant difference between the training/testing Mean Absolute Error (MAE),

$MAE = \frac{1}{N} \sum_{n = 1}^N |y_n - \hat{y}_n|.$

# Results

After two weeks of hard-work, the students came forward to present their solutions to a judging panel formed of LV representatives and CDT directors. The success of each solution was measured via the MAE of their predictions on the testing data set. Anxious to find out the results, the following winners were announced.

### Accuracy Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate.

Regression Model: Gradient Boosted Decision Trees.

Testing MAE: 69.77

The winning team (in accuracy) was able to dramatically reduce their testing MAE through their choice of loss function. Loss functions quantify how good/bad a regression model is performing during the training process, and it is used to optimise the linear combination of decision trees. While most teams used the popular loss function, Mean-Squared Error, the winning team instead used Least Absolute Deviationswhich is equivalent to optimising for the MAE while training the model.

### Explainability (Joint) Winners

After much deliberation amongst the judging panel, two teams were awarded joint first place in the explainability category!

Team 1:

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. Features centred and scaled to have mean 0 and standard deviation 1, then selected using Gradient Boosted Decision Trees.

Regression Model: K-Nearest-Neighbours Regression

Testing MAE: 75.25

This team used Gradient Boosted Decision Trees for feature selection, combined with K-Nearest-Neighbours (KNN) Regression to model the claim amounts. KNN regression is simple in nature; given a new claim to be predicted, the K “most similar” claims in the training set are averaged (weighted according to similarity) to produce the final prediction. It is thus explainable in the sense that for every prediction you can see exactly which neighbours contributed, and what similarities they shared. The judging panel noted that, from a consumer’s perspective, they may not be satisfied with their insurance quote being based on just K neighbours.

Team 2:

Pre-processing: All categorical features one-hot encoded.

Regression Model: Gradient Boosted Decision Trees. SHAP values used for post-analysis explainability.

Testing MAE: 80.3.

The judging panel was impressed by this team’s decision to impose monotonicity in the claim predictions with respect to the numerical features. This asserts that, for monotonic features, the claim prediction must move in a constant direction (increasing or decreasing) if the numerical feature is moving in a constant direction. For example, a customer’s policy excess is the amount they will have to pay towards a claim made on their insurance. It stands to reason that increasing the policy excess (while other features remain constant) should not increase their insurance quote. If this constraint is satisfied, we say that the insurance quote is monotonic decreasing with respect to the policy excess. Further, SHAP values were used to explain the importance/effect of each feature on the model.

### Creativity Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. New feature engineered from Vehicle Size and Vehicle Class. Features selected using Permutation Importance.

Regression Model: Gradient Boosted Decision Trees. Presented post-analysis mediation/moderation study of the features.

Testing MAE: 76.313.

The winning team for creativity presented unique and intriguing methods for understanding and manipulating the data. This team noticed that the features, Vehicle Size and Vehicle Class, are intrinsically related e.g. They investigated whether a large vehicle would likely yield a higher claim if it is also of luxury vehicle class. To represent this relationship, they engineered a new feature by taking a multiplicative combination of the two initial features.

As an extension of their solution, they presented an investigation of the causal relationship between the different features. Several hypothesis tests were performed, testing whether the relationship between certain features and claim amounts is moderated or mediated by an alternative feature in the data set.

• Mediating relationships: If a feature is mediated by an alternative feature in the data set, its relationship with the claim amounts can be well explained by the alternative (potentially indicating it can be removed from the model).
• Moderating relationships: If a feature is moderated by an alternative feature in the data set, the strength and/or direction of the relationship with the claim amounts is impacted by the alternative.

# Final Thoughts

All the teams showed a great understanding of the problem and identified promising solutions. The competitive atmosphere of the LV Datathon created a notable buzz amongst the students, who were eager to present and compare their findings. As evidenced by every team’s solution, the methodological conclusion is clear: When it comes to insurance claim prediction, tree-based models are unbeaten!

## Skills for Interdisciplinary Research

To acknowledge the variety of sectors where data science research is relevant, in March 2021, the Compass students are undertaking a series of workshops led by the Bristol Doctoral College to explore Skills for Interdisciplinary Research.  Using the Vitae framework for researcher development, our colleague at BDC will introduce Compass students to the following topics:

Workshop 1: What is a doctorate? A brief history of doctorates in the UK, how they have changed in the past two decades, why CDTs?, what skills are needed now for a doctorate?

Workshop 2: Interdisciplinarity – the foundations. A practical case study on interdisciplinary postgraduate research at Bristol.

Workshop 3: Ways of knowing, part 1 – Positivism and ‘ologies! Deconstructing some of the terminology around knowledge and how we know what we know. Underpinning assumption – to know your own discipline, you need to step outside of it and see it as others do.

Workshop 4: Ways of knowing, part 2 – Social constructionism and qualitative approaches to research. In part 1 of ways of knowing, the ideal ‘science’ approach is objective and the researcher is detached from the subject of study; looking at other approaches where the role of research is integral to the research.

Workshop 5: Becoming a good researcher – research integrity and doctoral students. A look at how dilemmas in research can show us how research integrity is not just a case of right or wrong.

Workshop 6: Getting started with academic publishing. An introduction on the scholarly publishing pressure in contemporary research and it explores what that means in an interdisciplinary context.

## Responsible Innovation in Data Science Research

This February our 2nd year Compass students will attend workshops in responsible innovation.

Run in partnership with the School of Management, the structured module constitutes Responsible Innovation training specifically for research in Data Science.

Taking the EPSRC AREA (Anticipate, Reflect, Engage, Act) framework for Responsible Innovation as it’s starting point, the module will take students through a guided process to develop the skills, knowledge and facilitated experience to incorporate the tenets of the AREA framework in to their PhD practice. Topics covered will include:
· Ethical and societal implications of data science and computational statistics
· Skills for anticipation
· Reflexivity for researchers
· Public perception of data science and engagement of publics
· Regulatory frameworks affecting data science

## Student perspectives: The Elo Rating System – From Chess to Education

A post by Andrea Becsek, PhD student on the Compass programme.

If you have recently also binge-watched the Queen’s Gambit chances are you have heard of the Elo Rating System. There are actually many games out there that require some way to rank players or even teams. However, the applications of the Elo Rating System reach further than you think.

### History and Applications

The Elo Rating System 1 was first suggested as a way to rank chess players, however, it can be used in any competitive two-player game that requires a ranking of its players. The system was first adopted by the World Chess Federation in 1970, and there have been various adjustments to it since, resulting in different implementations by each organisation.

For any soccer-lovers out there, the FIFA world rankings are also based on the Elo System, but if you happen to be into a similar sport, worry not, Elo has you covered. And the list of applications goes on and on: Backgammon, Scrabble, Go, Pokemon, and apparently even Tinder used it at some point.

Fun fact: The formulas used by the Elo Rating make a famous appearance in the Social Network, a movie about the creation of Facebook. Whether this was the actual algorithm used for FaceMash, the first version of Facebook, is however unclear.

All this sounds pretty cool, but how does it actually work?

### How it works

We want a way to rank players and update their ranking after each game. Let’s start by assuming that we have the ranking for the two players about to play: $\theta_i$ for player $i$ and $\theta_j$ for player $j$. Then we can compute the probability of player $i$ winning against player $j$ using the logistic function:

$P(Y_{ij}=1)=\frac{1}{1+\exp\{-(\theta_i-\theta_j)\}}.$

Given what we know about the logistic function, it’s easy to notice that the smaller the difference between the players, the less certain the outcome as the probability of winning will be close to $0.5$.

Once the outcome of the game is known, we can update both players’ abilities

$\theta_{i}:=\theta_{i}+K(Y_{ij}-P(Y_{ij}=1))$

$\theta_{j}:=\theta_{j}+K(P(Y_{ij}=1)-Y_{ij}).$

The $K$ factor controls the influence of a player’s performance on their previous ranking. For players with high rankings, a smaller $K$ is used because we expect their abilities to be somewhat stable and hence their ranking shouldn’t be too heavily influenced by every game. On the other hand, players with low ability can learn and improve quite quickly, and therefore their rating should be able to fluctuate more so they have a larger $K$ number.

The term in the brackets represents how different the actual outcome is from the expected outcome of the game. If a player is expected to win but doesn’t, their ranking will decrease, and vice versa. The larger the difference, the more their rating will change. For example, if a weaker player is highly unlikely to win, but they do, their ranking will be boosted quite a bit because it was a hard battle for them. On the other hand, if a strong player is really likely to win because they are playing against a weak player, their increase in score will be small as it was an easy win for them.

### Elo Rating and Education

As previously mentioned, the Elo Rating System has been used in a wide range of fields and, as it turns out, that includes education, more specifically, adaptive educational systems 2. Adaptive educational systems are concerned with automatically selecting adequate material for a student depending on their previous performance.

Note that a system can be adaptive at different levels of granularity. Some systems might adapt the homework from week to week by generating it based on the student’s current ability and update their ability once the homework has been completed. Whereas other systems are able to update the student ability after every single question. As you can imagine, using the second system requires a fairly fast, online algorithm. And this is where the Elo Rating comes in.

For an adaptive system to work, we need two key components: student abilities and question difficulties. To apply the Elo Rating to this context, we treat a student’s interaction with a question as a game where the student’s ranking represents their ability and the question’s ranking represents its difficulty. We can then predict whether a student of ability $\theta_i$ will answer a question of difficulty $d_j$ correctly using

$P(\text{correct}_{ij}=1)=\frac{1}{1+\exp\{-(\theta_i-d_j)\}}.$

and the ability and difficulty can be updated using

$\theta_{i}:=\theta_{i}+K(\text{correct}_{ij}-P(\text{correct}_{ij}=1))$

$d_{j}:=d_{j}+K(P(\text{correct}_{ij}=1)-\text{correct}_{ij}).$

So even if you only have $10$ minutes to create an adaptive educational system you can easily implement this algorithm. Set all abilities and question difficulties to $0$, let students answer your questions, and wait for the magic to happen. If you do have some prior knowledge about the difficulty of the items you could of course incorporate that into the initial values.

One important thing to note is that one should be careful with ranking students based on their abilities as this could result in various ethical issues. The main purpose of obtaining their abilities is to track their progress and match them with questions that are at the right level for them, easy enough to stay motivated, but hard enough to feel challenged.

### Conclusion

So is Elo the best option for an adaptive system? It depends. It is fast, enables on the fly updates, it’s easy to implement, and in some contexts, it even has a similar performance to more complex models. However, there are usually many other factors that can be relevant to predicting student performance, such as the time spent on a question or the number of hints they use. This additional data can be incorporated into more complex models, probably resulting in better predictions and offering much more insight. At the end of the day, there is always a trade-off, so depending on the context it’s up to you to decide whether the Elo Rating System is the way to go.

Find out more about Andrea Becsek and her work on her profile page.

1. Elo, A.E., 1978. The rating of chessplayers, past and present. Arco Pub.
2. Pelánek, R., 2016. Applications of the Elo rating system in adaptive educational systems. Computers & Education, 98, pp.169-179.

## Improbable sponsors Compass PhD student in new partnership

Improbable, a global technology company which provides innovative products and services to makers of virtual worlds and simulations, is sponsoring a PhD research project entitled Agent-based model calibration using likelihood-free inference.

The University of Bristol is announcing a new industrial sponsor of Compass – the EPSRC Centre for Doctoral Training in Computational Statistics and Data Science. Improbable, a global technology company which provides innovative products and services to makers of virtual worlds and simulations, is sponsoring a PhD research project entitled Agent-based model calibration using likelihood-free inference. The project’s aim is to devise a general framework for calibrating agent-based models from training data by inferring the model parameters in a statistical framework.

## Sparx joins as Compass’ newest industrial partner

The University of Bristol is today announcing a new supporter of Compass – the EPSRC Centre for Doctoral Training in Computational Statistics and Data Science. South West based learning technology company Sparx, has agreed to sponsor a PhD student’s research project which will investigate new approaches to longitudinal statistical modelling within school-based mathematics education.

Sparx, which is located in Exeter, develops maths learning tools to support teaching and learning in secondary education. As an evidence-led company, Sparx has invested heavily in researching how technology can support the teaching and learning of maths and worked closely with local schools. This new investment underlines their ongoing commitment to research.

## Compass students take part in electricity demand forecasting hackathon

Dr Jethro Browell, Research Fellow at the University of Strathclyde, and Dr Matteo Fasiolo, Lecturer at the University of Bristol, ran a regional electricity demand forecasting hackathon for students in the COMPASS Centre for Doctoral Training yesterday.

Visiting Research Fellow Dr Browell gave students an overview of how the Great Britain electricity transmission network has changed during the last decade, with particular focus on the consequences of the increased production from small-scale renewable sources, which appear as “negative demand”.

Dr Fasiolo then introduced a dataset containing electricity demand and weather-related variables, such as wind speed and solar irradiation, from 14 regions covering the whole of Great Britain. He proposed an initial forecasting solution based on a simple Generalized Additive Model (GAM), which he used to forecast the demand in each region.

The hackathon started, with the “Jim” team being the first to propose an improved solution, based on a more sophisticated GAM model, which beat the initial GAM in terms of forecasting accuracy.

The “AGang” team then produced an even more sophisticated GAM, which took them to the top of the ranking. In the meantime, the “D&D” team was struggling to make their random forest work, and submitted a couple of poor forecasts. Toward the end of event, “AGang” produced a couple of improved GAM solutions, which further strengthened their lead.

While Dr Fasiolo and Dr Browell were wrapping up the event and preparing to award the winners, the “D&D” team caught everyone by surprise by submitting a forecast which beat all others by a margin, in terms of forecasting accuracy. Their random forest was far better than the GAMs at predicting demand in Scotland, where wind production is an important factor and the dynamics are quite different relative to the other regions.

Congratulations to the top three teams:

1. D&D:  Doug Corbin and Dom Owens
2. AGang: Andrea Becsek, Alex Modell and Alessio Zakaria
3. Jim:  Michael Whitehouse, Daniel Williams and Jake Spiteri

Winning team “D&D” said:  “Given physical measurements, such as wind speeds and precipitation, as well as calendar data, we first performed a minor amount of feature engineering. Given the complex nature of the interactions between the variables, and large amount of data available, we opted to fit random forest models. These performed feature selection for us and provided some robustness from outlying observations.

“However, the models took a long time to fit. Despite parallelising the model fitting across the regions, we only just got our predictions in before the deadline. Thankfully, our model consistently outperformed the other approaches.

“Everyone taking part had a great time learning about the challenges of energy modelling, and we thrived under the pressure of friendly competition.”

Dr Browell added: “Computational statistics and data science is driving innovation in the energy sector and the technologies they enable will play a huge role in the decarbonisation. I was pleased to be able to expose the COMPASS cohort to this application and hope that they will be inspired to apply their expertise to energy and climate problems in the future.”

## Women and non-binary people in mathematics event

On 5 and 6 November the school hosted its annual Women and Non-Binary People in Mathematics event.

This two-day event held in the Fry Building, aimed at encouraging women and non-binary people to consider continuing their studies to PhD level, welcomed participants from around the country, as well as outside of the UK.

The event featured talks from mathematicians working both in universities and industry, giving insight into their current roles and their careers to date. As well as this it also offered ample time for discussion with other participants who are facing the same decisions, and with current PhD students who have recently faced the same questions.

The Tuesday afternoon kicked off with a talk from keynote speaker Delaram Kahrobaei from University of York, followed by a dinner, where participants were able to enjoy discussion and networking in a relaxed environment. Wednesday morning offered an insight into the life of a PhD student, with talks and Q&A sessions with current post-grad students. The final portion of the event was open to to all students and more broadly explored the nature of PhD research, the work environment, the application process, and career options. Talks were given by industry speaker, Katie Russell from OVO Energy and the University of Bristol’s Dr Lynne Walling.

Current PhD student and one of the event organisers, Emma Bailey said ‘My favourite part of the Women and Non-Binary conference is changing assumptions: showing people who maths PhD students are, what we do on a weekly basis, and all of the opportunities out there post-PhD’.

The event successfully illuminated some of the doors a PhD in mathematics can open and hopefully inspired potential PhD students to continue their future in mathematics.