Improving Insurance Claims Process

Focus on what matters and let go of what doesn’t!

Photo by Cogneesol


When you’ve been devastated by a serious accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. The amount of information that are required from insurance companies can be very daunting and time consuming. But do insurance companies need all these information and paper work? Can this process be more efficient and less of a nuisance?

Now, I have a personal viewpoint on the reality of these statements. You likely have your own viewpoint. But what does the data suggest? Using a dataset from Kaggle; provided by AllState, a US-based insurance company, I am going to get more insight into these questions.

The data consist consists of 130 attributes (features) and the loss value (target value) for each observation. The dataset contains 188,318 observations where each row represents an insurance claim. 116 of the variables are categorical. They are prefaced with “cat-” followed by a number between 1 and 116 and describe some properties for attributes, while those 14 attributes that have the preface “cont” are continuous.

List of the features’ names in the dataset

Do Insurance Companies Really Need All These Information?

Each claims have 130 attributes that is 130 different types of information per claim. It is not surprising why insurance claims processes take so much time.

To investigate these questions, I am going to divide the data into two parts by data types;

  • categorical attributes: cat1cat116
  • continuous attributes: cont1cont14

I first investigate the 14 continuous variables, with principal component analysis (PCA).

From the figure above we see that with 7 components we already explain more than 90% of all variance in the features. So we could reduce the number of features to half of the original numerical features.

So which of the continuous variables are least important and I can be removed ?

To answer this question, I used correlation which indicates the extent to which two or more variables fluctuate in relation to each other. If 2 variables are correlated then adding such variable in the dataset do not tend to add much information as both are feeding similar information so it is best to eliminate one of them and retain only one.

correlation plot between categorical variables

From the correlation plot above it can be seen that several variable are correlated and can be eliminated,

  • cont6 — high correlation with cont11, so cont6 has been removed
  • cont10 — high correlation with cont1 and cont6, so cont10 has been removed
  • cont12 — high correlation with cont7 and cont13, so cont12 has been removed

More details about the calculation of the correlation can be found in my GitHub here.

So now we have reduce the number of continuous variables to 11.

Which categorical values are least important and can be dropped?

We now focus on the 116 categorical features. I used Chi-Square Test to identify the dependent/correlated categorical variables and below is the list of variables that are dependent/correlated and does not add additional value.

Dropped categorical attributes

There are a total of 88 categorical attributes which are eliminated and for the continuous there were 3 attributes which were dropped. So, we are left with 39 attributes; 28 categorical and 11 continuous. This shows that insurance claims process can be more efficient. Instead of collecting information on 130 attributes, we just need to focus on the 39 that matters the most and ignore the redundant ones. This will make insurance claim process quicker and less of a nuisance.

Which attributes are most important for insurance claims severity prediction?

Now having dropped the attributes which are not important, we are left with 39 attributes. The question is, of these 39 attributes which are the most important?

Using Feature Importance which is a technique that provides us with a relevant score for every feature variable which we can use to decide which features are most important and which features are least important for predicting the target variable.

The model that I choose was Random Forest and in the figure above you can see the ranking of the 39 attributes. cat101, a categorical variable clearly stand out with a relative importance of 100. Another important feature is cont14, a continuous variable with a relative importance of 70. On third place is another continuous variable cont7, with a relative importance just below 40. Insurance agent can use this as a guide to priorities information collection.

Can an algorithm be created that is able to predict claims severity?

After dropping so many features and keeping only 39, now the question is does it work? Can a model be trained and build with those remaining features and be able to predict severity loss?

I have successfully trained and be able to predict claim severity. To validate my results, I have submitted the result on Kaggle competition just to get the score. A score of 3011.62 is quite far from the winner, who got a score of 1109.70772. But the winner had also over 100 entries as for me it’s just my first entry. And most of the top performers have used XGBoost which I am planning to test to improve my score.

More details about the prediction can be found in this article here.


In this article we look at making the insurance claims process more efficient. We investigate the features that are least important, that have redundant information and deleted them. We were able to reduce the features from 130 to 39. To validate that our steps are correct, we trained and built a random forest algorithm with the 39 remaining variable and we were able to make accurate prediction.

We further investigate those 39 features and classify them in terms of importance at predicting the target variable. The most important feature is a categorical variable with relative importance of 100. Insurance agents can use this as a guide for information collection by prioritising the information in term of importance.

Knowing which features are most and least important will help in focusing on what matters most and let go of what doesn’t! Eventually, improving the insurance underwriting and claiming processes

To see more about this analysis, see the link to my GitHub available here.


If you have any feedback/suggestion on the work done for the project or you just want to have a chat, please reach me out on LinkedIn

Machine learning and data science enthusiast.