Customer Segmentation using Machine Learning
Apply machine learning techniques to predict customers
This post is about one of the capstone project choices for the Udacity Data Science Nanodegree; Customer Segmentation Report for Bertelsmann/Arvato. The project is of personal interest to me as it represents a real-life data science task using both unsupervised and supervised machine learning and is also a Kaggle InClass Competition.
Project Overview
In this project, I will analyse demographic data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. I will use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, I will apply what have been learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that will be used have been provided by Bertelsmann Arvato Analytics.
Problem Statement
The main goal of this project is to predict individuals who are most likely to become customers for a mail-order sales company in Germany.
As such I define the following problem statements:
- Can we reduce the number of people targeted and still retain a high of our target demographic i.e can we improve the customer acquisition process by target marketing?
Project Outline
To achieve our goal the project is divided into several subtasks:
- Data Analysis and Pre-processing , including data cleaning, categorical value encoding and feature scaling;
- Customer Segmentation Report, first perform dimensionality reduction with PCA and then clustering with K-means;
- Supervised learning models for prediction
- Kaggle submission and model evaluation
Metrics
For dimensionality reduction PCA , the explained variance ratio of each of the feature would be used to select the number of components. The aimed is to get the minimum number of dimensions explaining as much variation as possible.
For clustering, K-Means is used and based on the Within Cluster Sum of Squares (WCSS) for each solution, the Elbow method is used to make decision on the number of clusters to choose.
For the Kaggle competition, the prediction is evaluated using Area Under Curve (AUC) metric. It gives an idea about the overall performance of the model, where the curve is created by plotting True positive rate and False positive rate under different threshold settings. A good performance model will have an AUROC of 1. So, higher the AUC score, better the performance of the model.
AUC is used as performance metrics as it is appropriate for highly imbalanced dataset where the number of positive responses is slightly smaller than the number of negative responses. The metric is able to evaluate the quality of our model when, despite being a good predictor, we would still expect a fairly small conversion rate from our target demographics.
Analysis
Data Exploration and Pre-processing
There are four data files associated with this project:
AZDIAS: Demographics data for the general population of Germany.
CUSTOMERS: Demographics data for customers of a mail-order company.
MAILOUT_TRAIN & MAILOUT_TEST: Demographics data for individuals who were targets of a marketing campaign.
Demographic information was collected at different levels, including information for individuals, household, building, community, et.al. The demographics data for general population of Germany contains 891,211 records and each record has 366 attributes. Customer data from a mail-order company contains 191,562 records and 369 attributes. To compare population and customer data, we need to perform some pre-processing steps.
- Addressing mixed type columns
A warning message pops up while loading the data. This was due to the columns ‘CAMEO_DEUG_2015’ and ‘CAMEO_INTL_2015’ that have some special characters “X” and “XX” in them. Those characters were replaced by NAN.
2. Addressing unknown values
The data contains lots of unknown values. There were 232 columns with unknown values as can be seen in the figure
Furthermore, the unknown values are not consistent throughout the dataset.
As can be seen in the figure above, unknown are represented as (-1,0), (-1,9) or simply -1. We fix this to make the data consistent in both the datasets.
3. Missing Values
After engineering certain features, we analyse the missing values. The percentage of missing values in each column is analysed. Below is a picture of the top 10 features with most missing values in both general population and customers dataset.
Both the dataset seems to have similar missing data per column. ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, and ‘ALTER_KIND1 were dropped as they have more than 80% missing values. We further analysed the number of missing value per rows. Observation having more than 50 missing features were removed.
4. Correlated features
Correlation matrix with a threshold of 75% was used for removing highly correlated features as they are essentially feeding the same information to the dataset which other variables are feeding, so adding them in the model development will not add much information to the model
5. Imputing Missing Values
After all these steps, the data still contain missing values. We replace them with the most frequent observation, that is the mode
6. One-Hot Encoding
There are still some categorical variables that need to be converted so that they could be provided to machine learning algorithms. As such we used one-hot-encoding for ‘OST_WEST_KZ’ and ‘CAMEO_DEU_2015’.
7. Standardize the data
To bring all the features to the same range, standard scaler is used. This is done to avoid feature dominance when applying dimensionality reduction.
Methodology
Dimensionality Reduction
Principal Component Analysis (PCA) was applied to the data to reduce the dimensions. The aim is to get the least number of features with are able to explain the maximum variance in the dataset. The plot of the PCA shows the explained variance per components.
As seen from the figure above, 180 components are able to explained more than 80% of the variance. After cleaning and preprocessing the data, there was 312 features, we have reduce that by 132 features and still kept 87% of explained variance.
PCA Analysis
We look at the feature weights given by the PCA algorithm and understand what each component is comprised of.
we can analyse each of the cluster and find the important features. The figure above is for component 1. The most 10 weighted feature in component 1. The first 5 are positively correlated features and last 5 are negatively correlated features
It can be concluded that the first component positively weights people related to estimated household net income and classification based on classes. It negatively weights people related to number of 1–2 family houses in the cell and share of cars per household.
Further analysis of clusters can be found in my GitHub Repo.
Customer Segmentation Report
Unsupervised Learning with K-Means
We use K-Means clustering algorithm to divide the general population and customers into different segments. The basic idea behind clustering algorithm is to select the numbers of clusters to minimize the intra-cluster variation. That is the points in one cluster are as close as possible to each other. We, thus determine the within-cluster distances and use the Elbow method to decide on the number of clusters.
From the figure above, using the Elbow method, we can see there is a levelling off at 24 clusters, which suggest 24 clusters should be used.
We now compare the proportion of data in each cluster for the customer data to the proportion of data in each cluster for the general population.
In the above table the number of people in each cluster in the AZDIAS and CUSTOMERS datasets are presented.
Proportion of population and customers within each clusters.
We can see that cluster 2 and 3 have an overrepresentation of customers.
Supervised Learning Model
The aim for building a machine learning model is to predict whether an individual is likely to convert to being a customer and also being able to better select individuals to target in future marketing campaign.
Implementation
Because the problem belongs to the binary classification, the following algorithms were proposed:
- LogisticRegression
- DecisionTreeClassifier
- RandomForestClassifier
- GradientBoostingClassifier
- AdaBoostClassifier
- XGBoost Classifier
Metric
Considering it is a highly imbalanced data, an appropriate evaluation metric would be the Area Under the Curve Receiver Operating Characteristics (AUC-ROC). The closer to 1 the result of the AUC-ROC the better performance of the model.
GradientBoostingClassifier is the best model, followed by AdaBoost and XGBoostClassifier. However, even though the results for the Gradient Boosting is superior in score, it also has the highest training time.
Hyperparameter Tuning
AdaBoost and XGBClassifier will be selected for hyperparameter tuning. A set of hyperparameters for both the algorithms have been selected for tuning and a grid search has been performed to determine the best models.
Results
Best Model scored around 0.77 on the validation set. We can then predict the test label using this model.
For the model with AdaBoost , the feature with the highest importance is D19_Soziales. Unfortunately, in the ‘DIAS Information Levels — Attributes 2017.xlsx’ there is no description for it. Although, because of the prefix D19, I can assume that it is somehow associated with the Household Information level/social Transaction, but I could be wrong.
For the XGBoost model, the features with the most importance is again D19_Soziales but is not dominant as in AdaBoost. The next feature is KBA05_MOD2 which is related to share of middle class cars.
Kaggle Competition
The final prediction were made on the test dataset, Udacity_MAILOUT_052018_TEST.csv and submitted on Kaggle
Prediction with XGBoost without GridSearch,
A score of 0.79 can be further improved but is not bad for a first trial.
Justification
With GridSearch Hyperparameter tuning, I was able to improve both models XGBoost and AdaBoost. The best model was AdaBoost with a Public Score of 0.80365. I believe it can be further improve with better fine tuning.
Conclusion
The problem statement set up at the beginning of this project was:
- Can we reduce the number of people targeted and still retain a high of our target demographic i.e can we improve the customer acquisition process by target marketing
I successfully, created a model with above 80% accuracy to predict which individuals are most likely to respond to a mailout campaign. This was our main goal for this project, as using this model and by fine tuning, a much smaller proportion of general population can be identified and targeted whilst still maintaining the historic customer conversion rate. This will significantly save on cost of any future marketing campaigns.
Improvement
To further improve the model, the pre-processing steps is an area that I feel can be looked at. For example
- one one-hot encoding more categorical features.
- Better understanding of the features will definitely help in selecting relevant ones
- Better hyperparameter tuning
Finally, I would like to thank Udacity and Arvato Analytics for providing a fantastic platform and exciting opportunity to work on a real life problem that helped me put my data science skills to practice.
For more information please visit my GitHub repository
References
- The Ultimate Guide to Customer Acquisition for 2020 (hubspot.com)
- https://www.kaggle.com/c/udacity-arvato-identify-customers
- https://github.com/OliverFarren/arvato-udacity-customersegmentation
- https://towardsdatascience.com/how-to-learn-from-bigdata-files-on-low-memory-incremental-learning-d377282d38ff