Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis...Show moreThe area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis the case and control sets will vary over time, thus a dynamic definition of AUC is required. We choose the dynamic AUC defined by incident true positive rate and dynamic false positive rate (I/D AUC) proposed by Heagerty and Zheng [6]. However, the difficulty to empirically obtain the incident true positive rate is hampering the estimation of dynamic AUC. Thus, several semi-parametric and non-parametric estimators are proposed. Heagerty and Zheng [6] proposed the semi-parametric estimation method based on Cox model. The non-parametric estimates using intermediate concordance measure with LOWESS smoothing is raised by van Houwelingen and Putter [14]. Based on the same intermediate concordance measure, SahaChaudhuri and Heagerty suggested to use locally weighted mean rank smoothing [10]. Recently, Shen et al proposed a semi-parametric method by adopting fractional polynomial to fit the dynamic AUC [12]. In this thesis, we compare the performance of these methods with different configuration in a series of simulations. The plain Cox methods is not recommended when the proportional hazards assumption is not satisfied. The Cox model with time-varying coefficients are relatively stable when the marker has a mediocre effect. For the non-parametric methods, a too wide span/bandwidth may lead to large bias, and a too narrow span/bandwidth may lead to unstable estimates, thus, the trade-off between the bias and the standard deviation has to be made. For fractional polynomial, adding extra fractional polynomial terms does not benefit the performance. In addition, many researchers observed a decreasing trend of I/D AUC over time in their empirical studies [10][12][6], yet Pepe et al. held the opinion that the I/D AUC may be an increasing function over time [7]. We investigate the trend of I/D AUC under a Cox model and binary marker setting. However, we observe that under certain Cox models, the I/D AUC curve first increases then decreases, thus I/D AUC is not necessarily a decreasing function of time.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Game trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli...Show moreGame trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli trials to model noisy game trees, a challenging task is to solve the Monte Carlo Tree Search (MCTS) problem of identifying a best move under uncertainty. Confidence bound algorithms are investigated as one solution, with focus on the FindTopWinner algorithm by Teraoka, Hatano, and Takimoto [3], which uses (a) the minimax rule to evaluate the game tree by alternately minimizing and maximizing over the values associated with each move, (b) Hoeffding’s Inequality to estimate sample size requirements by fixing precision and error probability, and (c) an epoch-wise pruning regime to reduce investment on suboptimal nodes. We experimented on this algorithm by equipping it with methods that are based on (i) Bernstein’s Inequality to create a tighter confidence bound [4], (ii) the Law of the Iterated Logarithm (LIL) to sample in single-sample steps, allowing for exact pruning and stopping [5, 6], and (iii) a combination of both. An empirically-derived Hoeffding-based Iterated-Logarithm confidence bound will be proposed in a fully refurbished FindTopWinner algorithm, which achieved much better performance in terms of samples required to find a best move, whereas the Bernstein-based approaches did not fare better than the original by Teraoka et al. [3]. Possible reasons such as limited, more asymptotic advantages for Bernstein-based algorithms will be discussed and the recommended parameter space for the empirically-derived Hoeffding-based confidence bound will be provided.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Data is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated...Show moreData is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated variable, the penalized composite link mixed model can be used (PCLMM). The PCLMM only assumes that the underlying distribution is smooth, and so it can be used to estimate any nonparametric regression function. The model is a combination of the generalized linear mixed model, penalized B-splines, and the composite link model. In this thesis, the mathematical framework of these three well-known techniques is described, after which the close connection between them and the PCLMM is used to give a mathematical description of the estimation technique. Using a simulation of an one-dimensional function and an example on Q-fever cases in the Netherlands in 2009, it is shown that the PCLMM can accurately estimate even the smaller details of the underlying distribution if covariate information on the finer-scale is available. Decent approximations of the underlying distribution is obtained when covariate data is only available on the aggregated scale.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Random forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier:...Show moreRandom forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier: its ensembles comprise of hundreds of complex tree members. This is a major drawback for certain applications, where insight in the involvement of variables that account for certain outcomes is essential (e.g., medical diagnosis problems for identifying diseased individuals). There are however more recent methods that produce ensembles reduced in size by selecting the most important ensemble members. Some of these methods also yield ensemble members with simple structures to increase interpretation possibilities. Our selection of such methods comprises optimal trees ensemble (OTE), node harvest, and rule ensembles. These methods were assessed through a simulation study and an application to an MRI dataset on Alzheimer’s disease classification, to determine predictive performance and information recovery to estimate suitability for interpretational purposes. Random forests was taken as benchmark for predictive performance and baseline for improvement of interpretation. We focussed solely on binary classification. The benchmark random forests had generally good predictive performances and among the best variable importance recovery. It was still the superior classifier in high-dimensional settings. OTE often had similar predictive and variable importance recovery. It did however not have any advantage over random forests regarding suitability for interpretation. Node harvest had reasonable interaction recovery and good variable split point recovery, albeit at the cost of predictive and variable importance recovery performances. Rule ensembles proved to be a suitable alternative for random forests that produces models suitable for interpretation with comparable or better accuracy, but only when the dataset has clear signal. In noisy or high-dimensional settings, there still is no suitable, more interpretable tree ensemble alternative to random forests amongst the studied methods. Such settings still benefit from ensembles with numerous highly complex trees.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In...Show moreThe Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In this thesis, we focus on one typical spatial-temporal data type of discrete time and discrete space. We consider a rectangular grid for the space domain. We make a first order Markov property assumption in both time and space to reduce complexity. In addition, several input control features are introduced into the Kalman filter. In other words, the distribution of future states depends only on the current states and input control features in their own area and their neighboring areas. Under our Markov assumption, it is natural for the transition matrix in the Kalman filter to be sparse for spatial-temporal data where sparse transition matrices with constrained structure are designed to interpret the spatial correlation among all the areas. We will derive the equations for inference in this particular spatial system, namely the Kalman filter and Kalman smoother. Using the results for the Kalman filter and Kalman smoother, we further consider the determination of the parameters of the Kalman filter model through a modified Expectation–Maximization(EM) algorithm that estimates sparse transition matrices. This stands in contrast with the standard EM algorithm, which usually produces a dense estimate for the matrices. To respect the spatial pre-constrained sparsity structure, we specify greedy EM updates that work on rows of the transition matrix. We study the properties of our new method in simulations and apply the method to a real data set on aviation safety where the goal is to predict which areas at Schiphol airport are at risk of having a large density of birds in the near future.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
We propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the...Show moreWe propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the variation, and cross-validation to determine optimal parameters. The model was applied to a variety of simulated data sets in order to test its properties, and on a real-world personality data set. The tests on simulated data showed the ability to recover and visualize the variation of coefficients and their confidence bands over time with some success. The real-world tests showed some, but limited variation, depending on the trait studied.Show less