Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Telecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next,...Show moreTelecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next, predicted churners can be targeted to encourage them to stay. It is therefore crucial to build a churn prediction model that is as accurate as possible. Such models are usually built by applying a supervised learning algorithm to historical data. In this study, a more sophisticated approach is investigated, where historical data is first clustered using unsupervised learning and then for each homogeneous group a model is built with the help of supervised learning. Customer data, contractual data and online behavior data from a Dutch telecom provider are collected. Homogeneous groups of customers are identified based on the customer and contractual data using t-Distributed Stochastic Neighbor embedding (t-SNE), Gaussian Mixture Model (GMM) and Latent Class Analysis (LCA). Additionally, a partitioning of data that is suggested by domain experts (i.e. segmentation) is considered. The supervised learning models used are Logistic Regression (LR), Random Forest (RF), XGBoost and a heterogeneous ensemble of the aforementioned models. The performance of the various combinations are measured with the help of the Area Under the Curve (AUC). All combinations of techniques are compared to the benchmark approach that does not utilize any results from an unsupervised learning technique. The results revealed that for the flexible models (i.e. RF, XGBoost and the ensemble) there is no added value of using a hybrid approach as the highest AUC is for the benchmark approach. However, for the less flexible models (i.e. LR), the largest AUC is for the hybrid approach. This suggests that a LR fitted for each homogeneous group is able to model the complex relations in the data set better than a LR for the whole data set.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Random forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier:...Show moreRandom forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier: its ensembles comprise of hundreds of complex tree members. This is a major drawback for certain applications, where insight in the involvement of variables that account for certain outcomes is essential (e.g., medical diagnosis problems for identifying diseased individuals). There are however more recent methods that produce ensembles reduced in size by selecting the most important ensemble members. Some of these methods also yield ensemble members with simple structures to increase interpretation possibilities. Our selection of such methods comprises optimal trees ensemble (OTE), node harvest, and rule ensembles. These methods were assessed through a simulation study and an application to an MRI dataset on Alzheimer’s disease classification, to determine predictive performance and information recovery to estimate suitability for interpretational purposes. Random forests was taken as benchmark for predictive performance and baseline for improvement of interpretation. We focussed solely on binary classification. The benchmark random forests had generally good predictive performances and among the best variable importance recovery. It was still the superior classifier in high-dimensional settings. OTE often had similar predictive and variable importance recovery. It did however not have any advantage over random forests regarding suitability for interpretation. Node harvest had reasonable interaction recovery and good variable split point recovery, albeit at the cost of predictive and variable importance recovery performances. Rule ensembles proved to be a suitable alternative for random forests that produces models suitable for interpretation with comparable or better accuracy, but only when the dataset has clear signal. In noisy or high-dimensional settings, there still is no suitable, more interpretable tree ensemble alternative to random forests amongst the studied methods. Such settings still benefit from ensembles with numerous highly complex trees.Show less