Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Multivariate binary data are often collected in scientific fields such as psychology, economics and epidemiology. Worku and de Rooij (2018) proposed a marginal model for the analysis of this type...Show moreMultivariate binary data are often collected in scientific fields such as psychology, economics and epidemiology. Worku and de Rooij (2018) proposed a marginal model for the analysis of this type of data in a distance framework: The multivariate logistic distance (MLD) model. Two different models were introduced by Worku and de Rooij: a restricted and an unrestricted MLD model. The interpretation of both models is clear, and a log-odds as well as a biplot representation can be used. In this work we proposed three extensions to the restricted model and showed the implications of the extensions for the interpretation of the corresponding biplot as well as for the log-odds. First, we showed how the model can be extended by making it possible for a response variable to belong to multiple dimensions. Consequently, the extended model can be used to examine other dimensionality structures compared to the original model. Second, we allowed for non-linear relationships of the predictor variables with the response variables in the model and therefore making the model more flexible. Finally, the dimensionality structure as well as the final predictor variables need to be selected. We showed how to use the prediction capability of a model as a selection criterion to select between competing models. This is a more versatile method to perform model selection, based on the bias-variance trade off, compared to the likelihood based criterion used in the original model. We fitted 16 variations of the model to an empirical data set to compare performance based on their prediction capability. All variations of the model can be estimated using standard statistical software for univariate modelsShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The brain's Default Mode Network (DMN) raised a lot of interest in neuroscience the past decade. The DMN, is active even when a human is resting and his mind is not task oriented [1]. It is...Show moreThe brain's Default Mode Network (DMN) raised a lot of interest in neuroscience the past decade. The DMN, is active even when a human is resting and his mind is not task oriented [1]. It is mentioned in the literature [2, 3], that disruptions within the DMN often occur in the profile of patients carrying some disorder, such as Parkinson's disease (PD), Alzheimer's disease (AD) and epilepsy. In this thesis we aim to build a classification model that predicts whether a new subject is an Alzheimer's patient or not. This model is created based on the DMN profile of 250 subjects. To this purpose, we employ the δmachine classification approach of Yuan, Heiser and De Rooij [4], which uses the distances between DMN profiles as the predictor matrix in a lasso logistic regression model. It is essential to define a distance measure that best fits the DMN univariate time series data, that is, a measure which can strongly represent the distances, irrespective of the possibility of data distortion in time. Keeping that in mind, five distance measures were investigated, which are designed for time series and are implemented in the up-to-date R packages TSdist and TSclust. The final goal is twofold: on the one hand building a classification model by using the δ-machine approach, based on the profile of the activity in the DMN of 250 subjects, and on the other hand uncovering which distance measure is the most suitable when involved in the δ-machine approach.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Reliable forecasting of infectious disease epidemics is critical for decision-making regarding the allocation of public health resources. Until now, efforts have mostly been devoted to...Show moreReliable forecasting of infectious disease epidemics is critical for decision-making regarding the allocation of public health resources. Until now, efforts have mostly been devoted to understanding disease transmission rather than forecasting. The present thesis took a Bayesian approach to forecasting epidemics, focusing on modelling the time between successive observed infection events using a Gamma generalised linear model (GLM). Specifically, a tailored sampler was introduced to solve common convergence problems associated with the use of the inverse link function. The posterior distribution obtained from the Bayesian Gamma GLM was then used to forecast stochastically. This approach was extended to diseases with different transmission dynamics, as described by traditional compartmental epidemiological models: susceptible-infectious (SI), susceptible-infectioussusceptible (SIS) and susceptible-infectious-recovered (SIR). The calibration of the forecasting technique was evaluated using probability integral transform (PIT) histograms in a large simulation study. Results showed that forecasts of SI and SIS-type epidemics in an early growth phase underestimated true future values. Across epidemic types, there was evidence of overdispersion in the forecasts. Furthermore, the method was applied to data from the Meningococcal disease, serogroup W outbreak in the Netherlands between 2012 and 2018. Forecasts suggested the outbreak has reached an equilibrium of approximately 50-55 new observed cases per 6 months. Avenues for future research are provided, with a focus on how we could improve the Bayesian approach and adapt the method to account for covariates of interest.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Inverse Probability of Treatment Weighted (IPTW) estimators can be used to correctly estimate the parameters of marginal structural models (MSMs) for causal effects using observational data and...Show moreThe Inverse Probability of Treatment Weighted (IPTW) estimators can be used to correctly estimate the parameters of marginal structural models (MSMs) for causal effects using observational data and a number of assumptions. In this thesis we focus on the positivity assumption which holds when there is a positive probability of receiving every level of an exposure variable for every combination of values defined by the observed confounders in the analysis. When the positivity assumption is violated, the resulting IPTW estimators may become very unstable and exhibit high variability. However, the severity with which this impacts the IPTW estimators under different conditions is not widely known or understood. In particular, to our knowledge, no existing study has investigated violations of the positivity assumption for survival analysis, or in a time dependent context more generally. This is surprising because MSMs are often applied in practice precisely because they adjust for time-dependent confounding. A novelty of this thesis is to investigate the effect of positivity violations on the performance of the IPTW-estimator in a survival context in which time dependent confounding is present. We approach the problem in a simulation setting. One reason why the effects of positivity violations in a survival context have not been systematically studied is that existing algorithms for generating suitable data are intensive and challenging to implement. An added value of this thesis is to cast some light on this process in the hope that it will encourage other researchers to broach the subject in the future. We implement an existing algorithm in R and then extend that algorithm to incorporate violations of the positivity assumption that are propagated through time. A simulation study was carried out using the extended algorithm. We investigate how the IPTW estimators respond as strict violations of the positivity assumption become increasingly severe. As part of this study we examine the finite sample properties of the estimator and how it behaves for varied lengths of follow-up time. We also consider the case where the positivity assumption is not strictly violated but some exposure levels are rare within certain levels of the confounder. Our results indicate that even relatively benign violations of the positivity assumption can be a problem in the time-dependent context. We also find that, contrary to expectations, positivity violations are worse for studies of shorter duration. More optimistically, near violations of the positivity assumption do not appear to be serious under realistic circumstances.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
National statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such...Show moreNational statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such as administrative data or survey data. When datasets do not pertain to the same units, one can sometimes resort to statistical matching to integrate them. Statistical matching is a data fusion technique which can be used when different data sets contain different units, but with a set of common (background) variables. The main goal of statistical matching is to estimate the relationship between the non-common variables in the different datasets. This paper investigates how best to utilize a small overlap of units in a statistical matching situation where data only consists of categorical variables. A small overlap of units contains joint information on all variables for only a limited number of units. A new statistical matching method, namely the combined estimator, is developed in this paper employing an idea from small area estimation. The performance of the combined estimator was compared to a couple of pre-existing statistical matching methods for categorical data under various data conditions. The result shows that, even though the combined estimator itself does not perform better than the pre-existing statistical matching method (the EM algorithm), the usage of the combined estimator as the starting point of the EM algorithm helps increasing its accuracy under certain data circumstances. The improvement of accuracy was noticed in cases where the number of matching variables was large.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Association of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent...Show moreAssociation of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent functional connectivity, series of functional Magnetic Resonance Imaging (fMRI) scans are typically reduced by averaging activations in brain atlas regions. The averaged activations are further reduced to pairwise correlation in sliding fixed width time windows. Unfortunately such reduction in dimensions also reduces the scan resolution and complicates interpretation. Changing to a text mining perspective, this thesis interprets the high dimensional scans as documents with categorical words drawn from a study bag. Consecutive scans measure the activation in V discrete voxels of brain volumes. Activation series in each voxel are segmented into stationary subsequences. Similar correlated segments within voxels and from distinct voxels are then bagged as words. The words capture correlated activation both within- and between-voxels. Instead of being predefined in an atlas, regions emerge as neighbourhoods of voxels drawing the same word at the original scan resolution. The word counts that document voxels draw from the bag of categorical words defines the document state. Document state transition probabilities measure the dynamics in coactivated brain locations at the original fMRI resolution, as a possible marker for a neurological condition. This alternative fMRI activation reduction method avoids a-priori selection of regions, tuning of fixed time window widths, and selection of the number of principal components of the contrasted existing method; the alternative method allows a more direct interpretation of activations. However, the direct state switching interpretation of scan document voxels drawing categorical word counts, does not sufficiently separate subject groups for reliable classification of neurological conditions.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated....Show moreThe Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated. The idea of RMST came from Irwin (1949) [5], and when combined with the formal definition of the survival function, RMST can be defined as the integral of survival function up to a time limit τ . Several different methods for estimating the RMST are available. The Kaplan-Meier method and Cox PH model are the most commonly used methods in survival analysis, and they are also suitable for estimating RMST. This is done by first estimating the survival curve and then calculating the area under it to give an estimation of RMST. To allow a more general population of survival time distributions, a flexible parametric model was introduced by Royston and Parmar (2002) [4]. This flexible parametric model method followed the same method of estimating RMST as the Kaplan-Meier and Cox PH model: a survival function is estimated from the model, then a 15-point Gauss-Kronrod quadrature can be used to calculate the integral of the survival function, which allows estimation of RMST. The final option is a pseudo-observation method proposed by Anderson et al. (2004) [3]. This method first builds a pseudo-observation of RMST for each subject. Then, using the pseudo-observations of RMST as outcome variables, a generalized linear model can be built to describe the relationship between the covariates and RMST. A generalized estimating equation (GEE) method can then be used to estimate the parameters of the generalized linear model [8]. Comparisons between these methods under various simulation scenarios were conducted for this thesis. The Kaplan-Meier method is simple to calculate and performs well with early time limits and low censoring proportions. It is also faster to estimate RMST result than Cox model and flexible parametric model. However, this method lacks the ability to be adjusted for more covariates, so it is only suitable when estimating average RMST difference for a population. The unstratified Cox model performed well in datasets that satisfied the proportional hazards assumption. The stratified Cox model also performed well in our simulated non-proportional hazards datasets. The performance of the flexible parametric model method was similar to that of the Cox model, but it is more time-consuming in the integral calculation step. The pseudo-observation methods offered the shortest computation time among all four methods. However, when estimating RMST difference for a subject with given age and gender, the performance of the pseudo-observation method was worse than either the Cox model or flexible parametric model.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Liver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease....Show moreLiver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease. Predicting survival of patients after liver transplantation is regarded as one of the most challenging areas in medicine. Hence, selecting the best prediction model is of paramount importance. Machine learning - field of computer science where specific algorithms are used to learn and make predictions on data - has lately received increased attention in the medical field due to contribution in medical imaging, ability to diagnose diseases and its great potential for personalized treatment. In survival analysis, machine learning implementation is difficult due to censored data. In this thesis, survival random forests and partial logistic artificial neural networks have been applied. Cox model has been exclusively used due to its easy implementation and straightforward interpretation. The model is however restricted to the proportionality of hazards assumption whereas the machine learning techniques do not make any assumptions. Nowadays, there is a strong discussion in the medical field about machine learning and if it has greater potential than Cox models when it comes to complex data. Criticism to machine learning is related to unsuitable performance measures and lack of interpretability which is important for the medical personnel. The potential of machine learning is investigated for large data of 62294 patients in USA for 106 prognostic factors selected from over 600; 52 donor’s characteristics and 54 patient’s characteristics. A meticulous comparison is performed between 3 proportional hazards models and machine learning techniques. For the artificial neural network novel extensions are provided to its original specification using state-of-the-art R software. A variety of measures is employed not only from survival field but also from simple classification setting. In this project, it is of particular interest the identification of potential risk factors post-operatively. Two survival outcomes are reported: overall survival (time to death since operation) and failure-free survival (minimum time between graft-failure and death since operation). In this thesis, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation. Random survival forest shows in general better predictive performance than Cox models. Neural networks can reach comparable performance with the Cox models and even perform better in some classification metrics. However, high instability is present due to the lack of a global performance evaluation measure in survival setting.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Currently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy...Show moreCurrently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy making in housing. Nowadays, people express their ‘wish to move’ on social media. In this research, it was found that certain features derived from tweet texts distinguish ‘wish to move’ tweets from others. The best logistic regression classifier developed in this research achieves an F1-score of 0.556 in identifying ‘wish to move’ tweets indicating that it is possible to timely keep track of the proportion of the ‘wish to move’ proportion of the Dutch population active on Twitter. Further, it is found that actual relocation can be identified by following ‘wish to move’ users. By engineering features through aggregating their subsequent tweets, classifiers were established to automatically determine if a ‘wish to move’ user relocated in the follow-up period. The best logistic regression classifier can determine if ‘wish to move’ users relocated in the two subsequent years with an F1-score of 0.701. With it, the proportion of ‘wish to move’ users who actually relocated later can be estimated.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
This study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of...Show moreThis study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of a large Dutch telecom provider. Customer lifetime value is the total profit or loss to a company over the whole period of transactions by a customer. Business customers are defined here as firms or locations of large firms that are contracted for one or more business products of the telecom provider. Customer lifetime values are calculated of the level of mobile contracts and taken together per location afterwards. In order to calculate customer lifetime values, individual lifetime predictions and a definition of the values is needed. The lifetime predictions resemble a survival analysis that models the time from becoming contractfree until one of three possible decisions (contract renewal, product migration or contract termination) is made. Using survival estimates and semi-parametric models the overall survival is analyzed as well as the influence of characteristics of locations and companies to which the locations belong. Then, with the R package mstate competing risks models are applied to model the time to each decision while taking into account the other possible decisions. Additionally, lifetime estimations that result from the competing risks models are updated, whereby the survival analysis starts several months after becoming contract-free. Results show that approximately 25% of the decisions have been made at the start of the study. The duration of mobile contracts and ownership of a business internet product or a mobile internet product next to the mobile contract discriminate most between the occurance of the decisions. Furthermore, results of the competing risks models show that probabilities of making any decision attenuate over time. This is confirmed with a fictional product offer on both the levels of the mobile contract and business customers. The customer lifetime value as described here is a useful metric for the telecom provider to make customer selections and, after applying it to other business products, it could be used to discriminate between product offers.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Discretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to...Show moreDiscretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to segment the city map-and hence discretize trajectory data-by finding region boundaries based on strong density changes. We solve the map segmentation problem as a model selection problem, using the existing MDL-histogram approach. We also propose a heuristic algorithm so that we can apply MDL-histogram on 2-dimensional data (longitude and latitude). Finally, we validate our approach and algorithm by simulation studies and on taxi trajectories from New York CityShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Over the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for...Show moreOver the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for instance - ML methods, which have great potential for modelling complex data, have been increasingly applied. Few publications however have seen clinical applications, and the trend towards ML has been criticised for a lack of attention towards proper validation and towards the use of appropriate performance measures to quantify the model performance. Initially, in the context of medical research, machine learning methods were mainly used for diagnosis and detections, but the last years have seen a vast increase in ML modelling for the purpose of cancer prediction and prognosis. The latter trend has given rise to various adaptations of traditional ML approaches to censored survival data. Two such approaches - Biganzoli's survival neural network and Ishwaran's random survival forest - are evaluated in this thesis. They are compared to a statistical model - the well-used Cox proportional hazards model - in an application to a clinical dataset with 7 variables, measured on 2025 osteosarcoma patients- the EURAMOS-1 clinical trial. The purpose of this thesis is two-fold; 1) performing an in-depth comparison of the two ML methods and gaining insight into the potential of ML for clinical data with a limited number of predictors; 2) adding to existing osteosarcoma literature, in which ML methods have a very limited presence. The analyses performed on the EURAMOS data are reinforced by a simulation study, which is novel in the approach it takes to ensure that the simulated data closely mimics the original. This thesis shows that for the EURAMOS-1 osteosarcoma data the Cox proportional hazard model is suitable, and that both ML approaches have limited added benefit. Appropriate performance measures are identified for assessing neural network and random survival forest performance. For the survival neural network a modification to an existing measure is proposed to aid in identifying network instability - a known neural network pitfall. For the random survival forest it is shown that while suitable for distinguishing high and low risk patients, it results in unreliable individual survival predictions. An additional, unrelated chapter has been included in this thesis, detailing the application of a dynamic prediction model to the EURAMOS-1 osteosarcoma data.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
This thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the...Show moreThis thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the convex relaxation are described, as well as all the steps required to implement the algorithm. To this end, the thesis describes the mathematical structure of the shallow networks, how the function class can be relaxed to the convex case, as well as the role of Reproducing Kernel Hilbert Spaces, the Nystrom method, and projected gradient descent on the nuclear norm ball. The main contribution of this work is the implementation and application to a new data set. The problems considered are a simulation study and an implementation on the classification problem of text data. The results of the CCNN implementation show that it can be successfully applied on text data through the use of vectorized word representations. Advantages and drawbacks compared to more mainstream approaches are discussed.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Functional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between...Show moreFunctional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between functional magnetic resonance imaging (fMRI) blood-oxygenation-level-dependent (BOLD) time courses in different brain regions can measure FC which has revealed a meaningful organization of spontaneous fluctuations in the brain during rest. Therefore, in most studies, the presence of temporal and spatial dynamics of FC are usually measured by the correlation coefficients between the fMRI signals of several brain regions. However, recent research has shown that FC is not stationarity. That is, FC dynamically changes over time reflecting additional and rich information about brain organization. In 2013, Leonardi et al. proposed a new approach which was based on principal component analysis (PCA) to reveal hidden patterns of coherent FC dynamics across multiple subjects. This thesis evaluates this new approach in a simulation study. Moreover, also a framework to test the new approach is proposed. The simulation study showed advantages and disadvantages of the new approach. The results of the simulation study showed that the new approach can extract the most important dynamic connectivity features underlying fMRI data. It can retrieve timevarying connectivity between dynamic brain regions during rest effectively. The new approach identified connections with similar fluctuations, and gave an efficient linear representation, but only sensitive to linear relations between connectivity pairs, and it yielded robust results in restricted conditions. Finally, some recommendations for researchers using this method to study dynamic brain functional brain connectivity at rest are provided.Show less