## Student Perspectives: LV Datathon – Insurance Claim Prediction

A post by Doug Corbin, PhD student on the Compass programme.

# Introduction

In a recent bout of friendly competition, students from the Compass and Interactive AI CDT’s were divided into eight teams to take part in a two week Datathon, hosted by insurance provider LV. A Datathon is a short, intensive competition, posing a data-driven challenge to the teams. The challenge was to construct the best predictive model for (the size of) insurance claims, using an anonymised, artificial data set generously provided by LV. Each team’s solution was given three important criteria, on which their solutions would be judged:

• Accuracy – How well the solution performs at predicting insurance claims.
• Explainability – The ability to understand and explain how the solution calculates its predictions; It is important to be able to explain to a customers how their quote has been calculated.
• Creativity – The solution’s incorporation of new and unique ideas.

Students were given the opportunity to put their experience in Data Science and Artificial Intelligence to the test on something resembling real life data, forming cross-CDT relationships in the process.

# Data and Modelling

Before training a machine learning model, the data must first be processed into a numerical format. To achieve this, most teams transformed categorical features into a series of 0’s and 1’s (representing the value of the category), using a well known process called one-hot encoding. Others recognised that certain features had a natural order to them, and opted to map them to integers corresponding to their ordered position.

A common consideration amongst all the teams’ analysis was feature importance. Out of the many methods/algorithms which were used to evaluate the relative importance of the features, notable mentions are Decision Trees, LASSO optimisation, Permutation Importance and SHAP Values. The specific details of these methods are beyond the scope of this blog post, but they all share a common goal, to rank features according to how important they are in predicting claim amounts. In many of the solutions, feature importance was used to simplify the models by excluding features with little to no predictive power. For others, it was used as a post analysis step to increase explainability i.e. to show which features where most important for a particular claim prediction. As a part of the feature selection process, all teams considered the ethical implications of the data, with many choosing to exclude certain features to mitigate social bias.

Interestingly, almost all teams incorporated some form of Gradient Boosted Decision Trees (GBDT) into their solution, either for feature selection or regression. This involves constructing multiple decision trees, which are aggregated to give the final prediction. A decision tree can be thought of as a sequence of binary questions about the features (e.g. Is the insured vehicle a sports car? Is the car a write off?), which lead to a (constant) prediction depending on the the answers. In the case of GBDT, decision trees are constructed sequentially, each new tree attempting to capture structure in the data which has been overlooked by its predecessors. The final estimate is a weighted sum of the trees, where the weights are optimised using (the gradient of) a specified loss function e.g. Mean-Squared Error (MSE),

$MSE = \frac{1}{N} \sum_{n = 1}^N (y_n - \hat{y}_n)^2.$

Many of the teams trialled multiple regression models, before ultimately settling on a tree-based model. However, it is well-known that tree-based models are prone to overfitting the training data. Indeed, many of the teams were surprised to see such significant difference between the training/testing Mean Absolute Error (MAE),

$MAE = \frac{1}{N} \sum_{n = 1}^N |y_n - \hat{y}_n|.$

# Results

After two weeks of hard-work, the students came forward to present their solutions to a judging panel formed of LV representatives and CDT directors. The success of each solution was measured via the MAE of their predictions on the testing data set. Anxious to find out the results, the following winners were announced.

### Accuracy Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate.

Regression Model: Gradient Boosted Decision Trees.

Testing MAE: 69.77

The winning team (in accuracy) was able to dramatically reduce their testing MAE through their choice of loss function. Loss functions quantify how good/bad a regression model is performing during the training process, and it is used to optimise the linear combination of decision trees. While most teams used the popular loss function, Mean-Squared Error, the winning team instead used Least Absolute Deviationswhich is equivalent to optimising for the MAE while training the model.

### Explainability (Joint) Winners

After much deliberation amongst the judging panel, two teams were awarded joint first place in the explainability category!

Team 1:

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. Features centred and scaled to have mean 0 and standard deviation 1, then selected using Gradient Boosted Decision Trees.

Regression Model: K-Nearest-Neighbours Regression

Testing MAE: 75.25

This team used Gradient Boosted Decision Trees for feature selection, combined with K-Nearest-Neighbours (KNN) Regression to model the claim amounts. KNN regression is simple in nature; given a new claim to be predicted, the K “most similar” claims in the training set are averaged (weighted according to similarity) to produce the final prediction. It is thus explainable in the sense that for every prediction you can see exactly which neighbours contributed, and what similarities they shared. The judging panel noted that, from a consumer’s perspective, they may not be satisfied with their insurance quote being based on just K neighbours.

Team 2:

Pre-processing: All categorical features one-hot encoded.

Regression Model: Gradient Boosted Decision Trees. SHAP values used for post-analysis explainability.

Testing MAE: 80.3.

The judging panel was impressed by this team’s decision to impose monotonicity in the claim predictions with respect to the numerical features. This asserts that, for monotonic features, the claim prediction must move in a constant direction (increasing or decreasing) if the numerical feature is moving in a constant direction. For example, a customer’s policy excess is the amount they will have to pay towards a claim made on their insurance. It stands to reason that increasing the policy excess (while other features remain constant) should not increase their insurance quote. If this constraint is satisfied, we say that the insurance quote is monotonic decreasing with respect to the policy excess. Further, SHAP values were used to explain the importance/effect of each feature on the model.

### Creativity Winners

Pre-processing: Categorical features one-hot encoded or mapped to integers where appropriate. New feature engineered from Vehicle Size and Vehicle Class. Features selected using Permutation Importance.

Regression Model: Gradient Boosted Decision Trees. Presented post-analysis mediation/moderation study of the features.

Testing MAE: 76.313.

The winning team for creativity presented unique and intriguing methods for understanding and manipulating the data. This team noticed that the features, Vehicle Size and Vehicle Class, are intrinsically related e.g. They investigated whether a large vehicle would likely yield a higher claim if it is also of luxury vehicle class. To represent this relationship, they engineered a new feature by taking a multiplicative combination of the two initial features.

As an extension of their solution, they presented an investigation of the causal relationship between the different features. Several hypothesis tests were performed, testing whether the relationship between certain features and claim amounts is moderated or mediated by an alternative feature in the data set.

• Mediating relationships: If a feature is mediated by an alternative feature in the data set, its relationship with the claim amounts can be well explained by the alternative (potentially indicating it can be removed from the model).
• Moderating relationships: If a feature is moderated by an alternative feature in the data set, the strength and/or direction of the relationship with the claim amounts is impacted by the alternative.

# Final Thoughts

All the teams showed a great understanding of the problem and identified promising solutions. The competitive atmosphere of the LV Datathon created a notable buzz amongst the students, who were eager to present and compare their findings. As evidenced by every team’s solution, the methodological conclusion is clear: When it comes to insurance claim prediction, tree-based models are unbeaten!

## Student perspectives: Wessex Water Industry Focus Lab

A post by Michael Whitehouse, PhD student on the Compass programme.

Introduction

September saw the first of an exciting new series of Compass industry focus labs; with this came the chance to make use of the extensive skill sets acquired throughout the course and an opportunity to provide solutions to pressing issues of modern industry. The partner for the first focus lab, Wessex Water, posed the following question: given time series data on water flow levels in pipes, can we detect if new leaks have occurred? Given the inherent value of clean water available at the point of use and the detriments of leaking this vital resource, the challenge of ensuring an efficient system of delivery is of great importance. Hence, finding an answer to this question has the potential to provide huge economic, political, and environmental benefits for a large base of service users.

Data and Modelling:

The dataset provided by Wessex Water consisted of water flow data spanning across around 760 pipes. After this data was cleaned and processed some useful series, such as minimum nightly and average daily flow (MNF and ADF resp.), were extracted. Preliminary analysis carried out by our collaborators at Wessex Water concluded that certain types of changes in the structure of water flow data provide good indications that a leak has occurred. From this one can postulate that detecting a leak amounts to detecting these structural changes in this data. Using this principle, we began to build a framework to build solutions: detect the change; detect a new leak. Change point detection is a well-researched discipline that provides us with efficient methods for detecting statistically significant changes in the distribution of a time series and hence a toolbox with which to tackle the problem. Indeed, we at Compass have our very own active member of the change point detection research community in the shape of Dom Owens. The preliminary analysis gave that there are three types of structural change in water flow series that indicate a leak: a change in the mean of the MNF, a change in trend of the MNF, and a change in the variance of the difference between the MNF and ADF. In order to detect these changes with an algorithm we would need to transform the given series so that the original change in distribution corresponded to a change in the mean of the transformed series. These transforms included calculating generalised additive model (GAM) residuals and analysing their distribution. An example of such a GAM is given by:

$\mathbb{E}[\text{flow}_t] = \beta_0 \sum_{i=1}^m f_i(x_i).$

Where the x i ’s are features we want to use to predict the flow, such as the time of day or current season. The principle behind this analysis is that any change in the residual distribution corresponds to a violation of the assumption that residuals are independently, identically distributed and hence, in turn, corresponds to a deviation from the original structure we fit our GAM to.

Figure 1: GAM residual plot. Red lines correspond to detected changes in distribution, green lines indicate a repair took place.

A Change Point Detection Algorithm:

In order to detect changes in real time we would need an online change point detection algorithm, after evaluating the existing literature we elected to follow the mean change detection procedure described in [Wang and Samworth, 2016]. The user-end procedure is as follows:

1. Calculate mean estimate $\hat{\mu}$ on some data we assume is stationary.
2. Feed a new observation into the algorithm. Calculate test statistics based on new data.
3. Repeat (2) until any test statistics exceed a threshold at which point we conclude a mean change has been detected. Return to (1).

Due to our 2 week time restraint we chose to restrict ourselves to finding change points corresponding to a mean change, just one of the 3 changes we know are indicative of a leak. As per the fundamental principles of decision theory, we would like to tune and evaluate our algorithm by minimising some loss function which depends on some ‘training’ data. That is, we would like to look at some past period of time and make predictions of when leaks happened given the flow data across the same period, then we evaluate how accurate these predictions were and adjust or asses the model accordingly. However, to do this we would need to know when and where leaks actually occurred across the time period of the data, something we did not have access to. Without ‘labels’ indicating that a leak has occurred, any predictions from the model were essentially meaningless, so we sought to find a proxy. The one potentially useful dataset we did have access to was that of leak repairs. It is clear that a leak must have occurred if a repair has occurred, but for various reasons this proxy does not provide an exhaustive account of all leaks. Furthermore, we do not know which repairs correspond to leaks identified by the particular distributional change in flow data we considered. This, in turn, means that all measures of model performance must come with the caveat that they are contingent on incomplete data. If when conducting research we find out results are limited it is our duty as statisticians to report when this is the case – it is not our job to sugar coat or manipulate our findings, but to report them with the limitations and uncertainty that inextricably come alongside. Results without uncertainty are as meaningless as no results at all. This being said, all indications pointed towards the method being effective in detecting water flow data mean change points which correspond to leak repairs, giving a positive result to feedback to our friends at Wessex Water.

Final Conclusions:

Communicating statistical concepts and results to an audience of varied areas and levels of expertise is important now more than ever. The continually strengthening relationships between Compass and its industrial partners are providing students with the opportunity to gain experience in doing exactly this. The focus lab concluded with a presentation on our findings to the Wessex Water operations team,  during which we reported the procedures and results. The technical results were well supported by the demonstration of an R shiny dashboard app, which provided an intuitive interface to view the output of the developed algorithm. Of course, there is more work to be done. Expanding the algorithm to account for all 3 types of distributional change is the obvious next step. Furthermore, fitting a GAM to data for 760 pipes is not very efficient. Additional investigations into finding ways to ‘cluster’ groups of pipes together according to some notion of similarity is a natural avenue for future work in order to reduce the number of GAMS we need to fit.This experience enabled students to apply skills in statistical modelling, algorithm development, and software development to a salient problem faced by an industry partner and marked a successful start to the Compass industry focus lab series.