Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the...Show moreThe Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the Elo system was invented by a Hungarian physics professor, Arpad Elo, to improve chess rating system. Now many rating systems used in sports are based on the Elo rating system with modifications. The objective of this thesis project is to examine the Elo rating system for soccer tournaments and how it can be applied to the 2017 UEFA Women’s Championship (short for UEFA Women’s Euro 2017). More specifically, two primary interests lie in this project. The first interest lies in determining the strength of each team by assigning an Elo rating to the each competing team after tournament. In addition, it is interesting to see how home-field advantage helped the Netherlands (the host country) win the championship of UEFA Women’s Euro 2017 by incorporating the home-field advantage in the Elo formula. Secondly, strengths of the players of all teams are also of interest. In order to estimate the strengths of the players, each player is assigned a rating (Not an Elo rating) to represent how strong every player is. We can then compare the players among all teams. In order to access the reliability of our ideas and methodology, a simulation study will follow after the theoretical part of our research. In Chapter 1 I will first describe the basic concepts of the Elo rating system. Then a short summary of the relevant literature papers will be presented. Finally I will discuss the source of the data, the arrangement of the tournament, and the process that will take to go through the algorithm / methodology. In Chapter 2 the basic Elo formula and some modified Elo models are proposed, which allows us later on to determine the most appropriate model for estimating the strengths of every single competing country and the players of all teams. In the end of this chapter, I develop an ordered probit regression model for forecasting match results in UEFA Women’s Euro 2017. Chapter 3 suggests a simulation study for estimating the strengths of all the participant countries of the tournament and the strengths of football players of all teams. Chapter 4 presents the main conclusions drawn from the model computations and suggests some further research of this thesis project.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Game trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli...Show moreGame trees have been utilized as a formal representation of adversarial planning scenarios such as two-player zero-sum games like chess [1, 2]. When using stochastic leaf values based on Bernoulli trials to model noisy game trees, a challenging task is to solve the Monte Carlo Tree Search (MCTS) problem of identifying a best move under uncertainty. Confidence bound algorithms are investigated as one solution, with focus on the FindTopWinner algorithm by Teraoka, Hatano, and Takimoto [3], which uses (a) the minimax rule to evaluate the game tree by alternately minimizing and maximizing over the values associated with each move, (b) Hoeffding’s Inequality to estimate sample size requirements by fixing precision and error probability, and (c) an epoch-wise pruning regime to reduce investment on suboptimal nodes. We experimented on this algorithm by equipping it with methods that are based on (i) Bernstein’s Inequality to create a tighter confidence bound [4], (ii) the Law of the Iterated Logarithm (LIL) to sample in single-sample steps, allowing for exact pruning and stopping [5, 6], and (iii) a combination of both. An empirically-derived Hoeffding-based Iterated-Logarithm confidence bound will be proposed in a fully refurbished FindTopWinner algorithm, which achieved much better performance in terms of samples required to find a best move, whereas the Bernstein-based approaches did not fare better than the original by Teraoka et al. [3]. Possible reasons such as limited, more asymptotic advantages for Bernstein-based algorithms will be discussed and the recommended parameter space for the empirically-derived Hoeffding-based confidence bound will be provided.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Data is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated...Show moreData is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated variable, the penalized composite link mixed model can be used (PCLMM). The PCLMM only assumes that the underlying distribution is smooth, and so it can be used to estimate any nonparametric regression function. The model is a combination of the generalized linear mixed model, penalized B-splines, and the composite link model. In this thesis, the mathematical framework of these three well-known techniques is described, after which the close connection between them and the PCLMM is used to give a mathematical description of the estimation technique. Using a simulation of an one-dimensional function and an example on Q-fever cases in the Netherlands in 2009, it is shown that the PCLMM can accurately estimate even the smaller details of the underlying distribution if covariate information on the finer-scale is available. Decent approximations of the underlying distribution is obtained when covariate data is only available on the aggregated scale.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Random forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier:...Show moreRandom forests is generally known as an excellent classifier that is flexible in the types of data it is applied to. Despite this characteristic, it is also regarded as a ‘black box’ classifier: its ensembles comprise of hundreds of complex tree members. This is a major drawback for certain applications, where insight in the involvement of variables that account for certain outcomes is essential (e.g., medical diagnosis problems for identifying diseased individuals). There are however more recent methods that produce ensembles reduced in size by selecting the most important ensemble members. Some of these methods also yield ensemble members with simple structures to increase interpretation possibilities. Our selection of such methods comprises optimal trees ensemble (OTE), node harvest, and rule ensembles. These methods were assessed through a simulation study and an application to an MRI dataset on Alzheimer’s disease classification, to determine predictive performance and information recovery to estimate suitability for interpretational purposes. Random forests was taken as benchmark for predictive performance and baseline for improvement of interpretation. We focussed solely on binary classification. The benchmark random forests had generally good predictive performances and among the best variable importance recovery. It was still the superior classifier in high-dimensional settings. OTE often had similar predictive and variable importance recovery. It did however not have any advantage over random forests regarding suitability for interpretation. Node harvest had reasonable interaction recovery and good variable split point recovery, albeit at the cost of predictive and variable importance recovery performances. Rule ensembles proved to be a suitable alternative for random forests that produces models suitable for interpretation with comparable or better accuracy, but only when the dataset has clear signal. In noisy or high-dimensional settings, there still is no suitable, more interpretable tree ensemble alternative to random forests amongst the studied methods. Such settings still benefit from ensembles with numerous highly complex trees.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In...Show moreThe Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In this thesis, we focus on one typical spatial-temporal data type of discrete time and discrete space. We consider a rectangular grid for the space domain. We make a first order Markov property assumption in both time and space to reduce complexity. In addition, several input control features are introduced into the Kalman filter. In other words, the distribution of future states depends only on the current states and input control features in their own area and their neighboring areas. Under our Markov assumption, it is natural for the transition matrix in the Kalman filter to be sparse for spatial-temporal data where sparse transition matrices with constrained structure are designed to interpret the spatial correlation among all the areas. We will derive the equations for inference in this particular spatial system, namely the Kalman filter and Kalman smoother. Using the results for the Kalman filter and Kalman smoother, we further consider the determination of the parameters of the Kalman filter model through a modified Expectation–Maximization(EM) algorithm that estimates sparse transition matrices. This stands in contrast with the standard EM algorithm, which usually produces a dense estimate for the matrices. To respect the spatial pre-constrained sparsity structure, we specify greedy EM updates that work on rows of the transition matrix. We study the properties of our new method in simulations and apply the method to a real data set on aviation safety where the goal is to predict which areas at Schiphol airport are at risk of having a large density of birds in the near future.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
We propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the...Show moreWe propose a varying coefficient IRT model, in order to study the effect of a metric variable on model and population parameters estimated by IRT models. Kernel smoothing was used to capture the variation, and cross-validation to determine optimal parameters. The model was applied to a variety of simulated data sets in order to test its properties, and on a real-world personality data set. The tests on simulated data showed the ability to recover and visualize the variation of coefficients and their confidence bands over time with some success. The real-world tests showed some, but limited variation, depending on the trait studied.Show less