Search results

(1 - 20 of 20)

Kim, D. 2018

Development of a statistical matching method with categorical data

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

National statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such...Show moreNational statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such as administrative data or survey data. When datasets do not pertain to the same units, one can sometimes resort to statistical matching to integrate them. Statistical matching is a data fusion technique which can be used when different data sets contain different units, but with a set of common (background) variables. The main goal of statistical matching is to estimate the relationship between the non-common variables in the different datasets. This paper investigates how best to utilize a small overlap of units in a statistical matching situation where data only consists of categorical variables. A small overlap of units contains joint information on all variables for only a limited number of units. A new statistical matching method, namely the combined estimator, is developed in this paper employing an idea from small area estimation. The performance of the combined estimator was compared to a couple of pre-existing statistical matching methods for categorical data under various data conditions. The result shows that, even though the combined estimator itself does not perform better than the pre-existing statistical matching method (the EM algorithm), the usage of the combined estimator as the starting point of the EM algorithm helps increasing its accuracy under certain data circumstances. The improvement of accuracy was noticed in cases where the number of matching variables was large.Show less

Hollmann, M. 2018

Weakly-Supervised Speaker Presence Detection on Podcast Episodes

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Veerman, J.R. 2018

Estimating Error and Prior Variance in a High-Dimensional Ridge Regression Models: With Applications to Bayesian Structural Equation Models for Finding Treatment Targets

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Elzinga, O. 2018

Text mining of fMRI at full resolution by energy coding and entropy bagging of "resting state" series

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Association of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent...Show moreAssociation of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent functional connectivity, series of functional Magnetic Resonance Imaging (fMRI) scans are typically reduced by averaging activations in brain atlas regions. The averaged activations are further reduced to pairwise correlation in sliding fixed width time windows. Unfortunately such reduction in dimensions also reduces the scan resolution and complicates interpretation. Changing to a text mining perspective, this thesis interprets the high dimensional scans as documents with categorical words drawn from a study bag. Consecutive scans measure the activation in V discrete voxels of brain volumes. Activation series in each voxel are segmented into stationary subsequences. Similar correlated segments within voxels and from distinct voxels are then bagged as words. The words capture correlated activation both within- and between-voxels. Instead of being predefined in an atlas, regions emerge as neighbourhoods of voxels drawing the same word at the original scan resolution. The word counts that document voxels draw from the bag of categorical words defines the document state. Document state transition probabilities measure the dynamics in coactivated brain locations at the original fMRI resolution, as a possible marker for a neurological condition. This alternative fMRI activation reduction method avoids a-priori selection of regions, tuning of fixed time window widths, and selection of the number of principal components of the contrasted existing method; the alternative method allows a more direct interpretation of activations. However, the direct state switching interpretation of scan document voxels drawing categorical word counts, does not sufficiently separate subject groups for reliable classification of neurological conditions.Show less

Zhang, Y. 2018

A comparison of methods for estimating Restricted Mean Survival Time

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

The Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated....Show moreThe Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated. The idea of RMST came from Irwin (1949) [5], and when combined with the formal definition of the survival function, RMST can be defined as the integral of survival function up to a time limit τ . Several different methods for estimating the RMST are available. The Kaplan-Meier method and Cox PH model are the most commonly used methods in survival analysis, and they are also suitable for estimating RMST. This is done by first estimating the survival curve and then calculating the area under it to give an estimation of RMST. To allow a more general population of survival time distributions, a flexible parametric model was introduced by Royston and Parmar (2002) [4]. This flexible parametric model method followed the same method of estimating RMST as the Kaplan-Meier and Cox PH model: a survival function is estimated from the model, then a 15-point Gauss-Kronrod quadrature can be used to calculate the integral of the survival function, which allows estimation of RMST. The final option is a pseudo-observation method proposed by Anderson et al. (2004) [3]. This method first builds a pseudo-observation of RMST for each subject. Then, using the pseudo-observations of RMST as outcome variables, a generalized linear model can be built to describe the relationship between the covariates and RMST. A generalized estimating equation (GEE) method can then be used to estimate the parameters of the generalized linear model [8]. Comparisons between these methods under various simulation scenarios were conducted for this thesis. The Kaplan-Meier method is simple to calculate and performs well with early time limits and low censoring proportions. It is also faster to estimate RMST result than Cox model and flexible parametric model. However, this method lacks the ability to be adjusted for more covariates, so it is only suitable when estimating average RMST difference for a population. The unstratified Cox model performed well in datasets that satisfied the proportional hazards assumption. The stratified Cox model also performed well in our simulated non-proportional hazards datasets. The performance of the flexible parametric model method was similar to that of the Cox model, but it is more time-consuming in the integral calculation step. The pseudo-observation methods offered the shortest computation time among all four methods. However, when estimating RMST difference for a subject with given age and gender, the performance of the pseudo-observation method was worse than either the Cox model or flexible parametric model.Show less

Kantidakis, G. 2018

Prediction Models for Liver Transplantation: Comparisons between Cox proportional hazards regression models and machine learning techniques

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Liver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease....Show moreLiver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease. Predicting survival of patients after liver transplantation is regarded as one of the most challenging areas in medicine. Hence, selecting the best prediction model is of paramount importance. Machine learning - field of computer science where specific algorithms are used to learn and make predictions on data - has lately received increased attention in the medical field due to contribution in medical imaging, ability to diagnose diseases and its great potential for personalized treatment. In survival analysis, machine learning implementation is difficult due to censored data. In this thesis, survival random forests and partial logistic artificial neural networks have been applied. Cox model has been exclusively used due to its easy implementation and straightforward interpretation. The model is however restricted to the proportionality of hazards assumption whereas the machine learning techniques do not make any assumptions. Nowadays, there is a strong discussion in the medical field about machine learning and if it has greater potential than Cox models when it comes to complex data. Criticism to machine learning is related to unsuitable performance measures and lack of interpretability which is important for the medical personnel. The potential of machine learning is investigated for large data of 62294 patients in USA for 106 prognostic factors selected from over 600; 52 donor’s characteristics and 54 patient’s characteristics. A meticulous comparison is performed between 3 proportional hazards models and machine learning techniques. For the artificial neural network novel extensions are provided to its original specification using state-of-the-art R software. A variety of measures is employed not only from survival field but also from simple classification setting. In this project, it is of particular interest the identification of potential risk factors post-operatively. Two survival outcomes are reported: overall survival (time to death since operation) and failure-free survival (minimum time between graft-failure and death since operation). In this thesis, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation. Random survival forest shows in general better predictive performance than Cox models. Neural networks can reach comparable performance with the Cox models and even perform better in some classification metrics. However, high instability is present due to the lack of a global performance evaluation measure in survival setting.Show less

Gao, H. 2018

Estimating the Actual Relocation of Dutch People Based on `Wish to Move' Messages on Twitter

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Currently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy...Show moreCurrently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy making in housing. Nowadays, people express their ‘wish to move’ on social media. In this research, it was found that certain features derived from tweet texts distinguish ‘wish to move’ tweets from others. The best logistic regression classifier developed in this research achieves an F1-score of 0.556 in identifying ‘wish to move’ tweets indicating that it is possible to timely keep track of the proportion of the ‘wish to move’ proportion of the Dutch population active on Twitter. Further, it is found that actual relocation can be identified by following ‘wish to move’ users. By engineering features through aggregating their subsequent tweets, classifiers were established to automatically determine if a ‘wish to move’ user relocated in the follow-up period. The best logistic regression classifier can determine if ‘wish to move’ users relocated in the two subsequent years with an F1-score of 0.701. With it, the proportion of ‘wish to move’ users who actually relocated later can be estimated.Show less

Quataert, I. 2018

Customer Lifetime Value calculations for mobile contracts of a telecom provider using a life time prediction method for competing events

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

This study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of...Show moreThis study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of a large Dutch telecom provider. Customer lifetime value is the total profit or loss to a company over the whole period of transactions by a customer. Business customers are defined here as firms or locations of large firms that are contracted for one or more business products of the telecom provider. Customer lifetime values are calculated of the level of mobile contracts and taken together per location afterwards. In order to calculate customer lifetime values, individual lifetime predictions and a definition of the values is needed. The lifetime predictions resemble a survival analysis that models the time from becoming contractfree until one of three possible decisions (contract renewal, product migration or contract termination) is made. Using survival estimates and semi-parametric models the overall survival is analyzed as well as the influence of characteristics of locations and companies to which the locations belong. Then, with the R package mstate competing risks models are applied to model the time to each decision while taking into account the other possible decisions. Additionally, lifetime estimations that result from the competing risks models are updated, whereby the survival analysis starts several months after becoming contract-free. Results show that approximately 25% of the decisions have been made at the start of the study. The duration of mobile contracts and ownership of a business internet product or a mobile internet product next to the mobile contract discriminate most between the occurance of the decisions. Furthermore, results of the competing risks models show that probabilities of making any decision attenuate over time. This is confirmed with a fictional product offer on both the levels of the mobile contract and business customers. The customer lifetime value as described here is a useful metric for the telecom provider to make customer selections and, after applying it to other business products, it could be used to discriminate between product offers.Show less

Gawehns, D. 2018

Preschoolers during recess: dynamic patterns in face-to-face interactions: Are network dynamics more than a sequence of static descriptors?

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Yang, L. 2018

MDL-based Map Segmentation for Trajectory Mining

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Discretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to...Show moreDiscretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to segment the city map-and hence discretize trajectory data-by finding region boundaries based on strong density changes. We solve the map segmentation problem as a model selection problem, using the existing MDL-histogram approach. We also propose a heuristic algorithm so that we can apply MDL-histogram on 2-dimensional data (longitude and latitude). Finally, we validate our approach and algorithm by simulation studies and on taxi trajectories from New York CityShow less

Hazewinkel, A.-D. 2018

Prediction models with survival data: a comparison between machine learning and a Cox proportional hazard regression model: A simulation study and an application to osteosarcoma data from a randomised clinical trial

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Over the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for...Show moreOver the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for instance - ML methods, which have great potential for modelling complex data, have been increasingly applied. Few publications however have seen clinical applications, and the trend towards ML has been criticised for a lack of attention towards proper validation and towards the use of appropriate performance measures to quantify the model performance. Initially, in the context of medical research, machine learning methods were mainly used for diagnosis and detections, but the last years have seen a vast increase in ML modelling for the purpose of cancer prediction and prognosis. The latter trend has given rise to various adaptations of traditional ML approaches to censored survival data. Two such approaches - Biganzoli's survival neural network and Ishwaran's random survival forest - are evaluated in this thesis. They are compared to a statistical model - the well-used Cox proportional hazards model - in an application to a clinical dataset with 7 variables, measured on 2025 osteosarcoma patients- the EURAMOS-1 clinical trial. The purpose of this thesis is two-fold; 1) performing an in-depth comparison of the two ML methods and gaining insight into the potential of ML for clinical data with a limited number of predictors; 2) adding to existing osteosarcoma literature, in which ML methods have a very limited presence. The analyses performed on the EURAMOS data are reinforced by a simulation study, which is novel in the approach it takes to ensure that the simulated data closely mimics the original. This thesis shows that for the EURAMOS-1 osteosarcoma data the Cox proportional hazard model is suitable, and that both ML approaches have limited added benefit. Appropriate performance measures are identified for assessing neural network and random survival forest performance. For the survival neural network a modification to an existing measure is proposed to aid in identifying network instability - a known neural network pitfall. For the random survival forest it is shown that while suitable for distinguishing high and low risk patients, it results in unreliable individual survival predictions. An additional, unrelated chapter has been included in this thesis, detailing the application of a dynamic prediction model to the EURAMOS-1 osteosarcoma data.Show less

Schaik, M.J.G. van 2018

Applications of the Convexified Convolutional Neural Network: Experiments on simulated and real data

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

This thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the...Show moreThis thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the convex relaxation are described, as well as all the steps required to implement the algorithm. To this end, the thesis describes the mathematical structure of the shallow networks, how the function class can be relaxed to the convex case, as well as the role of Reproducing Kernel Hilbert Spaces, the Nystrom method, and projected gradient descent on the nuclear norm ball. The main contribution of this work is the implementation and application to a new data set. The problems considered are a simulation study and an implementation on the classification problem of text data. The results of the CCNN implementation show that it can be successfully applied on text data through the use of vectorized word representations. Advantages and drawbacks compared to more mainstream approaches are discussed.Show less

Vinkenoog, M. 2018

Reconstruction by deconstruction: Diplotype frequency estimation for genotype data in stratified populations

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Zhang, X. 2018

Evaluating the PCA-based Eigenconnectivity Approach to Extract Dynamic Functional Brain Connectivity Patterns

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Functional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between...Show moreFunctional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between functional magnetic resonance imaging (fMRI) blood-oxygenation-level-dependent (BOLD) time courses in different brain regions can measure FC which has revealed a meaningful organization of spontaneous fluctuations in the brain during rest. Therefore, in most studies, the presence of temporal and spatial dynamics of FC are usually measured by the correlation coefficients between the fMRI signals of several brain regions. However, recent research has shown that FC is not stationarity. That is, FC dynamically changes over time reflecting additional and rich information about brain organization. In 2013, Leonardi et al. proposed a new approach which was based on principal component analysis (PCA) to reveal hidden patterns of coherent FC dynamics across multiple subjects. This thesis evaluates this new approach in a simulation study. Moreover, also a framework to test the new approach is proposed. The simulation study showed advantages and disadvantages of the new approach. The results of the simulation study showed that the new approach can extract the most important dynamic connectivity features underlying fMRI data. It can retrieve timevarying connectivity between dynamic brain regions during rest effectively. The new approach identified connections with similar fluctuations, and gave an efficient linear representation, but only sensitive to linear relations between connectivity pairs, and it yielded robust results in restricted conditions. Finally, some recommendations for researchers using this method to study dynamic brain functional brain connectivity at rest are provided.Show less

Bakker, V. de 2018

A Novel Cross-Validation Framework for Identification of Atmospheric Aerosol Types: Cluster Number Validation Methods for Unsupervised Learning

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Aerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and...Show moreAerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and distribution of these separate types is vital for giving shape to informed policy on air quality. In this work, methods are described to identify clusters of similar aerosol type mixtures in the Earth’s atmosphere on a global scale, on the basis of microphysical data from the space-borne remote sensing instrument POLDER-3. We report an unsupervised learning approach using the Self-Organizing Map (SOM) and k-means clustering, which allows for clustering without a priori assumptions on existing aerosol types, nature or prevalence. Two methods are introduced to stabilize these clustering algorithms over multiple equal runs to manage their local optima convergence property: the k-means nstart option is extended to the SOM and a set-up is given for a new method, Expectation-Maximization-centered Mahalanobis clustering (EMcMc). A (repeated) v-fold cross-validation framework is presented to find the optimal number of clusters k in the data by means of cluster validation measures, currently including Prediction Strength and validated variants of the Silhouette Width. Using a separate test set, the method can be used to optimize a generic k, countering overfitting. A novel validation index is developed which extends the Silhouette Width to data sets with many observations (large N): the Gridded Silhouette Width. All described methods are implemented in the statistical software package R and shown to work for simulated examples, originating from scaled Gaussian distributions with varying degrees of overlap. Analysis of the POLDER-3 data indicated that using only four variables, 8 clusters can be found in a stable and reproducable fashion. The Silhouette indices did not appear to perform well for data so widely dispersed as here. The found clusters were characterized based on their variable distributions and geographical occurence, which proved to be feasible and meaningful for real-life interpretations. The proposed aerosol types were dust, marine, urban-industrial, smoke and mixtures thereof. Keywords: aerosol typing; unsupervised learning; self-organizing map; k-means clustering; cluster validation measures; cross-validation; gridded silhouette.Show less

Hille, Y. 2018

Using synchronicity in neural responses to predict population behaviour: A simulation study of different EEG Inter-Subject Correlation methods

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Synchronous neuronal responses across subjects is also known as neural reliability. The level of neural reliability evoked by natural stimuli is shown to be a predictor to larger audience...Show moreSynchronous neuronal responses across subjects is also known as neural reliability. The level of neural reliability evoked by natural stimuli is shown to be a predictor to larger audience preferences (Dmochowski et al., 2014). The same authors also proposed the state-of-the-art method for calculating neural reliability in an EEG setting (Dmochowski et al., 2014). However, the method is indirect and rather ad hoc, therefore, some existing alternative methods are proposed as well as an own proposed algorithm of calculating neural reliability. All the different methods are compared by means of a simulation study. Here, the performance is tested in their ability to recover the actual neural reliability in the data, but also their performance in predicting a population measure. Furthermore, wavelet transform as a denoising step in the setting of EEG data is investigated. The results of the simulation study show that Dmochowski and colleagues’ (2014) is performing well on undenoised data and when the relationship between the “true” ISC and buying behaviour is strong. However, the adapted neural reliability method of Hasson and colleagues’ (2004) and originally intended for fMRI studies stands out not only in terms of performance, but also in consistency of performance under different data characteristics, like the strength of the ISC, the signal to noise ratio and the strength of the relation between true ISC and buying behavior. Moreover, this method is also more direct and easier to calculate. The proposed way of denoising by wavelet transform only hurts the performance of the proposed neural reliability methods. It can be concluded that the adapted method of Hasson and colleagues’ (2004) can be recommended both for determining the ISC as the relation between ISC and a population measure.Show less

Pas, L.L. 2018

Statistical Criminal Profiling: Predicting Homicide Offender Characteristics using a Bayesian Network

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Criminal profiling is a rapidly growing field of research, in which statistics get more and more incorporated alongside of the traditional behavioural profiling approach that uses psychological...Show moreCriminal profiling is a rapidly growing field of research, in which statistics get more and more incorporated alongside of the traditional behavioural profiling approach that uses psychological theories to predict the behaviour of an offender. A model was built to predict the offender characteristics from crime and victim characteristics for single-victim-single-offender homicides in the Netherlands. Using the Dutch Homicide Monitor, eight different Bayesian network structure learning algorithms were combined into one model; arcs that were present in at least three separate structure learning algorithms were represented in the combined model and its direction was determined by the highest cumulative arc strength. The graphical representation of the model gives insight into the dependence relationships between crime, victim, and offender characteristics, and therefore could be used to confirm existing and develop new hypotheses on criminal psychology. Moreover, with an appropriate threshold resulting in a prediction error of less than 10 percent, the combined Bayesian network might be suitable for actual implementation by the police. This practical implication and the restrictions of the model are discussed, and recommendations for future research are given.Show less

Liu, Y. 2018

A simulation study to evaluate the performance of Bayesian variable selection in identification of the source of food-borne disease outbreaks

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

Food-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald et al., 2016). Early identification of contaminated food products plays an important role in reducing...Show moreFood-borne disease outbreaks constitute a large, ongoing public health burden worldwide (Hald et al., 2016). Early identification of contaminated food products plays an important role in reducing health burdens of food-borne disease outbreaks (Jacobs et al., 2017). Case-control studies together with logistic regression analysis are primarily used in food-borne outbreak investigations. However, the current methodology is associated with problems including response misclassification, missing values and ignoring small sample bias. Jacobs et al. (2017) developed a formal Bayesian variable selection method which deals with the problems of missing covariates and misclassified response. The re-analysis of Dutch Salmonella Thompson 2012 outbreak data (Friesema et al., 2014) has illustrated that this Bayesian approach allows a relatively easy implementation of these concepts and performs better than the standard logistic regression analysis in the identification of responsible food products. The complete Bayesian variable selection model is composed of three different parts, namely, misclassification correction, missing value imputation and Bayesian variable selection. In this thesis, we are interested in how these different parts affect the performance of Bayesian variable selection models in scenarios with (i) the same response misclassification rate and missingness rate in an assumed responsible food product covariate as in the original food-borne disease outbreak dataset, (ii) different response misclassification rates, (iii) different missingness rates in an assumed responsible food product and (iv) the combination of different response misclassification rates and missingness rates. We answer this research question by designing and executing a simulation study. Our results indicate that for the four different versions of Bayesian variable selection models studied in this thesis, the increase in the response misclassification rate or the missingness rate in the assumed responsible food product covariate or the increase in both results in a decrease in model performance. Bayesian variable selection, misclassification correction and missing value imputation all contribute positively to the model performance. Although missing value imputation is most computationally expensive, it contributes the most to the model performance among these three components.Show less

Verkerk, L. 2018

Forecasting Infectious Disease Epidemics

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

In this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated)...Show moreIn this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated) data, from which the key quantities contact rate and epidemic size could be obtained. With (non-)parametric bootstrapping, the GLM results could be assessed, and the key quantities were obtained and subsequently used to produce forecasts. Forecasting intervals were made to show the accuracy of the forecasts in terms of epidemic size and duration. Simulation results suggested that the method underestimated the eventual epidemic size, and overestimated the contact rate. However, applying the method to a real-life data set resulted in overestimation of the eventual epidemic size. The results of the contact rate for the application on real-life data should be compared to estimates from literature, before a significant meaning can be given to the results. Both simulation and application results gave variable estimates for the epidemic duration, although a positive relation was seen between epidemic size and epidemic length. Estimates for the contact rate could be improved. The major issues with prediction were accountable to exact collinearity introducted by the systematic model; the major issues with forecasting were accountable to extreme estimates of the epidemic size. The cause of both issues lies in the GLMs that were fit to the data.Show less

Chen, C.H. 2018

ELO Rating System for UEFA Women's Euro 2017: The Predictive Power of Elo Ratings for the Performance of Teams and Players in the 2017 UEFA Women’s Championship

Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)

open access

The Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the...Show moreThe Elo rating system has been used in various sports / games, such as chess, soccer, tennis and even video games, to calculate the relative playing strengths of players / teams. Originally, the Elo system was invented by a Hungarian physics professor, Arpad Elo, to improve chess rating system. Now many rating systems used in sports are based on the Elo rating system with modifications. The objective of this thesis project is to examine the Elo rating system for soccer tournaments and how it can be applied to the 2017 UEFA Women’s Championship (short for UEFA Women’s Euro 2017). More specifically, two primary interests lie in this project. The first interest lies in determining the strength of each team by assigning an Elo rating to the each competing team after tournament. In addition, it is interesting to see how home-field advantage helped the Netherlands (the host country) win the championship of UEFA Women’s Euro 2017 by incorporating the home-field advantage in the Elo formula. Secondly, strengths of the players of all teams are also of interest. In order to estimate the strengths of the players, each player is assigned a rating (Not an Elo rating) to represent how strong every player is. We can then compare the players among all teams. In order to access the reliability of our ideas and methodology, a simulation study will follow after the theoretical part of our research. In Chapter 1 I will first describe the basic concepts of the Elo rating system. Then a short summary of the relevant literature papers will be presented. Finally I will discuss the source of the data, the arrangement of the tournament, and the process that will take to go through the algorithm / methodology. In Chapter 2 the basic Elo formula and some modified Elo models are proposed, which allows us later on to determine the most appropriate model for estimating the strengths of every single competing country and the players of all teams. In the end of this chapter, I develop an ordered probit regression model for forecasting match results in UEFA Women’s Euro 2017. Chapter 3 suggests a simulation study for estimating the strengths of all the participant countries of the tournament and the strengths of football players of all teams. Chapter 4 presents the main conclusions drawn from the model computations and suggests some further research of this thesis project.Show less

Leiden University Student Repository

Refine Results

Availability

Faculty

Thesis type

Programme

Supervisor

Language

Your search

Enabled Filters

Sort

Search results