Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Aerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and...Show moreAerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and distribution of these separate types is vital for giving shape to informed policy on air quality. In this work, methods are described to identify clusters of similar aerosol type mixtures in the Earth’s atmosphere on a global scale, on the basis of microphysical data from the space-borne remote sensing instrument POLDER-3. We report an unsupervised learning approach using the Self-Organizing Map (SOM) and k-means clustering, which allows for clustering without a priori assumptions on existing aerosol types, nature or prevalence. Two methods are introduced to stabilize these clustering algorithms over multiple equal runs to manage their local optima convergence property: the k-means nstart option is extended to the SOM and a set-up is given for a new method, Expectation-Maximization-centered Mahalanobis clustering (EMcMc). A (repeated) v-fold cross-validation framework is presented to find the optimal number of clusters k in the data by means of cluster validation measures, currently including Prediction Strength and validated variants of the Silhouette Width. Using a separate test set, the method can be used to optimize a generic k, countering overfitting. A novel validation index is developed which extends the Silhouette Width to data sets with many observations (large N): the Gridded Silhouette Width. All described methods are implemented in the statistical software package R and shown to work for simulated examples, originating from scaled Gaussian distributions with varying degrees of overlap. Analysis of the POLDER-3 data indicated that using only four variables, 8 clusters can be found in a stable and reproducable fashion. The Silhouette indices did not appear to perform well for data so widely dispersed as here. The found clusters were characterized based on their variable distributions and geographical occurence, which proved to be feasible and meaningful for real-life interpretations. The proposed aerosol types were dust, marine, urban-industrial, smoke and mixtures thereof. Keywords: aerosol typing; unsupervised learning; self-organizing map; k-means clustering; cluster validation measures; cross-validation; gridded silhouette.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Synchronous neuronal responses across subjects is also known as neural reliability. The level of neural reliability evoked by natural stimuli is shown to be a predictor to larger audience...Show moreSynchronous neuronal responses across subjects is also known as neural reliability. The level of neural reliability evoked by natural stimuli is shown to be a predictor to larger audience preferences (Dmochowski et al., 2014). The same authors also proposed the state-of-the-art method for calculating neural reliability in an EEG setting (Dmochowski et al., 2014). However, the method is indirect and rather ad hoc, therefore, some existing alternative methods are proposed as well as an own proposed algorithm of calculating neural reliability. All the different methods are compared by means of a simulation study. Here, the performance is tested in their ability to recover the actual neural reliability in the data, but also their performance in predicting a population measure. Furthermore, wavelet transform as a denoising step in the setting of EEG data is investigated. The results of the simulation study show that Dmochowski and colleagues’ (2014) is performing well on undenoised data and when the relationship between the “true” ISC and buying behaviour is strong. However, the adapted neural reliability method of Hasson and colleagues’ (2004) and originally intended for fMRI studies stands out not only in terms of performance, but also in consistency of performance under different data characteristics, like the strength of the ISC, the signal to noise ratio and the strength of the relation between true ISC and buying behavior. Moreover, this method is also more direct and easier to calculate. The proposed way of denoising by wavelet transform only hurts the performance of the proposed neural reliability methods. It can be concluded that the adapted method of Hasson and colleagues’ (2004) can be recommended both for determining the ISC as the relation between ISC and a population measure.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Criminal profiling is a rapidly growing field of research, in which statistics get more and more incorporated alongside of the traditional behavioural profiling approach that uses psychological...Show moreCriminal profiling is a rapidly growing field of research, in which statistics get more and more incorporated alongside of the traditional behavioural profiling approach that uses psychological theories to predict the behaviour of an offender. A model was built to predict the offender characteristics from crime and victim characteristics for single-victim-single-offender homicides in the Netherlands. Using the Dutch Homicide Monitor, eight different Bayesian network structure learning algorithms were combined into one model; arcs that were present in at least three separate structure learning algorithms were represented in the combined model and its direction was determined by the highest cumulative arc strength. The graphical representation of the model gives insight into the dependence relationships between crime, victim, and offender characteristics, and therefore could be used to confirm existing and develop new hypotheses on criminal psychology. Moreover, with an appropriate threshold resulting in a prediction error of less than 10 percent, the combined Bayesian network might be suitable for actual implementation by the police. This practical implication and the restrictions of the model are discussed, and recommendations for future research are given.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Food-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald et al., 2016). Early identification of contaminated food products plays an important role in reducing...Show moreFood-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald et al., 2016). Early identification of contaminated food products plays an important role in reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017). Case-control studies together with logistic regression analysis are primarily used in food-borne outbreak investigations. However, the current methodology is associated with problems including response misclassification, missing values and ignoring small sample bias. Jacobs et al. (2017) developed a formal Bayesian variable selection method which deals with the problems of missing covariates and misclassified response. The re-analysis of Dutch Salmonella Thompson 2012 outbreak data (Friesema et al., 2014) has illustrated that this Bayesian approach allows a relatively easy implementation of these concepts and performs better than the standard logistic regression analysis in the identification of responsible food products. The complete Bayesian variable selection model is composed of three different parts, namely, misclassification correction, missing value imputation and Bayesian variable selection. In this thesis, we are interested in how these different parts affect the performance of Bayesian variable selection models in scenarios with (i) the same response misclassification rate and missingness rate in an assumed responsible food product covariate as in the original food-borne disease outbreak dataset, (ii) different response misclassification rates, (iii) different missingness rates in an assumed responsible food product and (iv) the combination of different response misclassification rates and missingness rates. We answer this research question by designing and executing a simulation study. Our results indicate that for the four different versions of Bayesian variable selection models studied in this thesis, the increase in the response misclassification rate or the missingness rate in the assumed responsible food product covariate or the increase in both results in a decrease in model performance. Bayesian variable selection, misclassification correction and missing value imputation all contribute positively to the model performance. Although missing value imputation is most computationally expensive, it contributes the most to the model performance among these three components.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Telecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next,...Show moreTelecom providers suffer from a loss of valuable customers to competitors. This is known as churn. The first step to retain customers is to predict which customers are most likely to churn. Next, predicted churners can be targeted to encourage them to stay. It is therefore crucial to build a churn prediction model that is as accurate as possible. Such models are usually built by applying a supervised learning algorithm to historical data. In this study, a more sophisticated approach is investigated, where historical data is first clustered using unsupervised learning and then for each homogeneous group a model is built with the help of supervised learning. Customer data, contractual data and online behavior data from a Dutch telecom provider are collected. Homogeneous groups of customers are identified based on the customer and contractual data using t-Distributed Stochastic Neighbor embedding (t-SNE), Gaussian Mixture Model (GMM) and Latent Class Analysis (LCA). Additionally, a partitioning of data that is suggested by domain experts (i.e. segmentation) is considered. The supervised learning models used are Logistic Regression (LR), Random Forest (RF), XGBoost and a heterogeneous ensemble of the aforementioned models. The performance of the various combinations are measured with the help of the Area Under the Curve (AUC). All combinations of techniques are compared to the benchmark approach that does not utilize any results from an unsupervised learning technique. The results revealed that for the flexible models (i.e. RF, XGBoost and the ensemble) there is no added value of using a hybrid approach as the highest AUC is for the benchmark approach. However, for the less flexible models (i.e. LR), the largest AUC is for the hybrid approach. This suggests that a LR fitted for each homogeneous group is able to model the complex relations in the data set better than a LR for the whole data set.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated)...Show moreIn this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated) data, from which the key quantities contact rate and epidemic size could be obtained. With (non-)parametric bootstrapping, the GLM results could be assessed, and the key quantities were obtained and subsequently used to produce forecasts. Forecasting intervals were made to show the accuracy of the forecasts in terms of epidemic size and duration. Simulation results suggested that the method underestimated the eventual epidemic size, and overestimated the contact rate. However, applying the method to a real-life data set resulted in overestimation of the eventual epidemic size. The results of the contact rate for the application on real-life data should be compared to estimates from literature, before a significant meaning can be given to the results. Both simulation and application results gave variable estimates for the epidemic duration, although a positive relation was seen between epidemic size and epidemic length. Estimates for the contact rate could be improved. The major issues with prediction were accountable to exact collinearity introducted by the systematic model; the major issues with forecasting were accountable to extreme estimates of the epidemic size. The cause of both issues lies in the GLMs that were fit to the data.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the...Show moreThe Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the Elo system was invented by a Hungarian physics professor, Arpad Elo, to improve chess rating system. Now many rating systems used in sports are based on the Elo rating system with modifications. The objective of this thesis project is to examine the Elo rating system for soccer tournaments and how it can be applied to the 2017 UEFA Women’s Championship (short for UEFA Women’s Euro 2017). More specifically, two primary interests lie in this project. The first interest lies in determining the strength of each team by assigning an Elo rating to the each competing team after tournament. In addition, it is interesting to see how home-field advantage helped the Netherlands (the host country) win the championship of UEFA Women’s Euro 2017 by incorporating the home-field advantage in the Elo formula. Secondly, strengths of the players of all teams are also of interest. In order to estimate the strengths of the players, each player is assigned a rating (Not an Elo rating) to represent how strong every player is. We can then compare the players among all teams. In order to access the reliability of our ideas and methodology, a simulation study will follow after the theoretical part of our research. In Chapter 1 I will first describe the basic concepts of the Elo rating system. Then a short summary of the relevant literature papers will be presented. Finally I will discuss the source of the data, the arrangement of the tournament, and the process that will take to go through the algorithm / methodology. In Chapter 2 the basic Elo formula and some modified Elo models are proposed, which allows us later on to determine the most appropriate model for estimating the strengths of every single competing country and the players of all teams. In the end of this chapter, I develop an ordered probit regression model for forecasting match results in UEFA Women’s Euro 2017. Chapter 3 suggests a simulation study for estimating the strengths of all the participant countries of the tournament and the strengths of football players of all teams. Chapter 4 presents the main conclusions drawn from the model computations and suggests some further research of this thesis project.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis...Show moreThe area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis the case and control sets will vary over time, thus a dynamic definition of AUC is required. We choose the dynamic AUC defined by incident true positive rate and dynamic false positive rate (I/D AUC) proposed by Heagerty and Zheng [6]. However, the difficulty to empirically obtain the incident true positive rate is hampering the estimation of dynamic AUC. Thus, several semi-parametric and non-parametric estimators are proposed. Heagerty and Zheng [6] proposed the semi-parametric estimation method based on Cox model. The non-parametric estimates using intermediate concordance measure with LOWESS smoothing is raised by van Houwelingen and Putter [14]. Based on the same intermediate concordance measure, SahaChaudhuri and Heagerty suggested to use locally weighted mean rank smoothing [10]. Recently, Shen et al proposed a semi-parametric method by adopting fractional polynomial to fit the dynamic AUC [12]. In this thesis, we compare the performance of these methods with different configuration in a series of simulations. The plain Cox methods is not recommended when the proportional hazards assumption is not satisfied. The Cox model with time-varying coefficients are relatively stable when the marker has a mediocre effect. For the non-parametric methods, a too wide span/bandwidth may lead to large bias, and a too narrow span/bandwidth may lead to unstable estimates, thus, the trade-off between the bias and the standard deviation has to be made. For fractional polynomial, adding extra fractional polynomial terms does not benefit the performance. In addition, many researchers observed a decreasing trend of I/D AUC over time in their empirical studies [10][12][6], yet Pepe et al. held the opinion that the I/D AUC may be an increasing function over time [7]. We investigate the trend of I/D AUC under a Cox model and binary marker setting. However, we observe that under certain Cox models, the I/D AUC curve first increases then decreases, thus I/D AUC is not necessarily a decreasing function of time.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Game trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli...Show moreGame trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli trials to model noisy game trees, a challenging task is to solve the Monte Carlo Tree Search (MCTS) problem of identifying a best move under uncertainty. Confidence bound algorithms are investigated as one solution, with focus on the FindTopWinner algorithm by Teraoka, Hatano, and Takimoto [3], which uses (a) the minimax rule to evaluate the game tree by alternately minimizing and maximizing over the values associated with each move, (b) Hoeffding’s Inequality to estimate sample size requirements by fixing precision and error probability, and (c) an epoch-wise pruning regime to reduce investment on suboptimal nodes. We experimented on this algorithm by equipping it with methods that are based on (i) Bernstein’s Inequality to create a tighter confidence bound [4], (ii) the Law of the Iterated Logarithm (LIL) to sample in single-sample steps, allowing for exact pruning and stopping [5, 6], and (iii) a combination of both. An empirically-derived Hoeffding-based Iterated-Logarithm confidence bound will be proposed in a fully refurbished FindTopWinner algorithm, which achieved much better performance in terms of samples required to find a best move, whereas the Bernstein-based approaches did not fare better than the original by Teraoka et al. [3]. Possible reasons such as limited, more asymptotic advantages for Bernstein-based algorithms will be discussed and the recommended parameter space for the empirically-derived Hoeffding-based confidence bound will be provided.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Data is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated...Show moreData is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated variable, the penalized composite link mixed model can be used (PCLMM). The PCLMM only assumes that the underlying distribution is smooth, and so it can be used to estimate any nonparametric regression function. The model is a combination of the generalized linear mixed model, penalized B-splines, and the composite link model. In this thesis, the mathematical framework of these three well-known techniques is described, after which the close connection between them and the PCLMM is used to give a mathematical description of the estimation technique. Using a simulation of an one-dimensional function and an example on Q-fever cases in the Netherlands in 2009, it is shown that the PCLMM can accurately estimate even the smaller details of the underlying distribution if covariate information on the finer-scale is available. Decent approximations of the underlying distribution is obtained when covariate data is only available on the aggregated scale.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Random forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier:...Show moreRandom forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier: its ensembles comprise of hundreds of complex tree members. This is a major drawback for certain applications, where insight in the involvement of variables that account for certain outcomes is essential (e.g., medical diagnosis problems for identifying diseased individuals). There are however more recent methods that produce ensembles reduced in size by selecting the most important ensemble members. Some of these methods also yield ensemble members with simple structures to increase interpretation possibilities. Our selection of such methods comprises optimal trees ensemble (OTE), node harvest, and rule ensembles. These methods were assessed through a simulation study and an application to an MRI dataset on Alzheimer’s disease classification, to determine predictive performance and information recovery to estimate suitability for interpretational purposes. Random forests was taken as benchmark for predictive performance and baseline for improvement of interpretation. We focussed solely on binary classification. The benchmark random forests had generally good predictive performances and among the best variable importance recovery. It was still the superior classifier in high-dimensional settings. OTE often had similar predictive and variable importance recovery. It did however not have any advantage over random forests regarding suitability for interpretation. Node harvest had reasonable interaction recovery and good variable split point recovery, albeit at the cost of predictive and variable importance recovery performances. Rule ensembles proved to be a suitable alternative for random forests that produces models suitable for interpretation with comparable or better accuracy, but only when the dataset has clear signal. In noisy or high-dimensional settings, there still is no suitable, more interpretable tree ensemble alternative to random forests amongst the studied methods. Such settings still benefit from ensembles with numerous highly complex trees.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In...Show moreThe Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In this thesis, we focus on one typical spatial-temporal data type of discrete time and discrete space. We consider a rectangular grid for the space domain. We make a first order Markov property assumption in both time and space to reduce complexity. In addition, several input control features are introduced into the Kalman filter. In other words, the distribution of future states depends only on the current states and input control features in their own area and their neighboring areas. Under our Markov assumption, it is natural for the transition matrix in the Kalman filter to be sparse for spatial-temporal data where sparse transition matrices with constrained structure are designed to interpret the spatial correlation among all the areas. We will derive the equations for inference in this particular spatial system, namely the Kalman filter and Kalman smoother. Using the results for the Kalman filter and Kalman smoother, we further consider the determination of the parameters of the Kalman filter model through a modified Expectation–Maximization(EM) algorithm that estimates sparse transition matrices. This stands in contrast with the standard EM algorithm, which usually produces a dense estimate for the matrices. To respect the spatial pre-constrained sparsity structure, we specify greedy EM updates that work on rows of the transition matrix. We study the properties of our new method in simulations and apply the method to a real data set on aviation safety where the goal is to predict which areas at Schiphol airport are at risk of having a large density of birds in the near future.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
We propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the...Show moreWe propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the variation, and cross-validation to determine optimal parameters. The model was applied to a variety of simulated data sets in order to test its properties, and on a real-world personality data set. The tests on simulated data showed the ability to recover and visualize the variation of coefficients and their confidence bands over time with some success. The real-world tests showed some, but limited variation, depending on the trait studied.Show less