Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Telecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next,...Show moreTelecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next, predicted churners can be targeted to encourage them to stay. It is therefore crucial to build a churn prediction model that is as accurate as possible. Such models are usually built by applying a supervised learning algorithm to historical data. In this study, a more sophisticated approach is investigated, where historical data is first clustered using unsupervised learning and then for each homogeneous group a model is built with the help of supervised learning. Customer data, contractual data and online behavior data from a Dutch telecom provider are collected. Homogeneous groups of customers are identified based on the customer and contractual data using t-Distributed Stochastic Neighbor embedding (t-SNE), Gaussian Mixture Model (GMM) and Latent Class Analysis (LCA). Additionally, a partitioning of data that is suggested by domain experts (i.e. segmentation) is considered. The supervised learning models used are Logistic Regression (LR), Random Forest (RF), XGBoost and a heterogeneous ensemble of the aforementioned models. The performance of the various combinations are measured with the help of the Area Under the Curve (AUC). All combinations of techniques are compared to the benchmark approach that does not utilize any results from an unsupervised learning technique. The results revealed that for the flexible models (i.e. RF, XGBoost and the ensemble) there is no added value of using a hybrid approach as the highest AUC is for the benchmark approach. However, for the less flexible models (i.e. LR), the largest AUC is for the hybrid approach. This suggests that a LR fitted for each homogeneous group is able to model the complex relations in the data set better than a LR for the whole data set.Show less