In recent years, statistics play an increasing role in professional football. A controversialtopic inside the emerging field of football data science is the effect of ball possession onmatch...Show moreIn recent years, statistics play an increasing role in professional football. A controversialtopic inside the emerging field of football data science is the effect of ball possession onmatch outcomes. We contribute to this discussion by analyzing the effect of possession onmatch outcomes while controlling for match status and match-up balance. We examinethe importance of the position of possession by comparing the kernel density estimateof winning and losing teams. Based on these findings we split the football pitch intodistinct zones using Voronoi cells based on the centroids of a k-means clustering. We fit amultiple linear regression model that regresses a match’s final goal difference on possessionper match status per zone using a 5x5-fold nested cross-validation. The resulting modelsplits the football pitch into 11 zones. Our metric holds higher predictive power thanthe traditional metric. To demonstrate the potential of this work for both analysts andjournalists, we analyze a teams performance over a whole season as well as individualmatch performances using the metric.Show less
In neuroscience, recent research has seen the rise of complex network analysis methods to studybrain data from methods like EEG and fMRI. Among the many applications of complex networkanalysis, it...Show moreIn neuroscience, recent research has seen the rise of complex network analysis methods to studybrain data from methods like EEG and fMRI. Among the many applications of complex networkanalysis, it has been used to the study of interpersonal synchrony in brain data (M ̈uller andLindenberger, 2014; M ̈uller et al., 2013; S ̈anger et al., 2012). However, the use of complexnetwork analysis in this context is very limited, despite its high potential. To bridge this gap, inthis thesis, we present a simulation study allowing for a systematic comparison of the ability ofvarious network analysis measures to capture synchrony and this in the context of different inputparameters, functions to transform the data to be suited for network analysis and different typesof networks (topology). The results made clear that a lot of graph theory measures, mostlyfrom weighted and unweighted networks, were able to pick up on changes in synchrony. Theresults also showed the importance of lengthier EEG recordings, as that improved a lot theperformance of several graph theory measures. Moreover, the use of Pearson’s correlation andcircular correlation for transforming the data for network analysis appears to be a better choicethan using some other transform functions. Although further research must be done in the fieldof complex network measures for synchrony, the results are promising for the use of graph theorymeasures to detect changes in interpersonal synchrony.Show less
Using individual participant data (IPD) has many advantages over using aggregate data (AD) inclinical meta-analysis. However, access to the IPD is often limited, yet the aggregate data is...Show moreUsing individual participant data (IPD) has many advantages over using aggregate data (AD) inclinical meta-analysis. However, access to the IPD is often limited, yet the aggregate data is availablefrom most clinical trials. Papadimitropoulou’s et al. [4] propose a method for studies with continuousoutcomes at baseline and follow-up measurement to generate pseudo-IPD from the aggregate data,which can be analyzed as IPD, using analysis of covariance (ANCOVA) models and linear mixed mod-els. The pseudo-IPD is generated based on the mean, standard deviation at baseline and follow-up, andthe correlation between baseline and follow-up, which are sufficient statistics of the linear mixed model.This thesis exemplified the pseudo-IPD models, standard meta-analysis models, and a Trowman meta-regression model on Obstructive Sleep Apnea Data with 2 treatment groups. We further exploredthe performance of the models under different conditions by a simulation study. The estimates of theTrowman meta-regression suffered from significant variance, and the standard AD models providedbias estimation when baseline imbalance exists. The ANCOVA models for pseudo-IPD and AD offeredmore accurate and stable results. The pseudo-IPD ANCOVA model is the most preferred since it canaccount for baseline difference and interaction between treatment and baseline, and different residualstructures can be used.Show less
This paper describes the results of a study to test the efficiency of a set of estimating methodsused in Statistical Matching (SM) for categorical variables, in a variety of different conditions.SM...Show moreThis paper describes the results of a study to test the efficiency of a set of estimating methodsused in Statistical Matching (SM) for categorical variables, in a variety of different conditions.SM is a technique that integrates datasets that include different units, that are coming from thesame population and share some common variables. The goal of the technique is to estimate theassociation between the variables that are not common. The tested estimators include some exist-ing methods (i.e. the Direct estimator, the CIA estimator, the Combined estimator and the EMestimator), along with Iterative Proportional Fitting (IPF), which is applied for the first time inthe context of SM in the context of this research. The methods are tested in populations withdifferent levels of dependence between the target variables and also for the effect of using a selectiveoverlap of units (sampled from another, relevant population) to make these estimations. For thisreason synthetic populations are created and are used both to directly test the estimators for theirpredicting accuracy and as populations from which selective overlap sets could be sampled from.Furthermore, the accuracy, bias and variance of the cells of the estimated contingency table wereassessed. The results suggest that Direct and EM estimators remain almost unaffected by thechanges in the populations’ characteristics or the selective overlaps respectively. On the contrary,methods based on CIA estimator appear to have advantage when the conditional independenceassumption is met in the population.Show less
ObjectiveRandomized controlled trials (RCTs) for rare neurological diseases, such as the Guillain-Barr ́esyndrome (GBS), have a disappointing lack of success, possibly due to inefficient...Show moreObjectiveRandomized controlled trials (RCTs) for rare neurological diseases, such as the Guillain-Barr ́esyndrome (GBS), have a disappointing lack of success, possibly due to inefficient statisticalanalysis. We aimed to evaluate the impact of covariate adjustment for baseline characteristics,ordinal analysis and repeated assessments on statistical power in randomized controlled trialswith ordinal scales as outcome measure.MethodsWe re-analysed a previous trial in GBS (the IVIg + placebo vs IVIg + Methylprednisolone trial,n= 221) and conducted power simulations to assess performance of different approaches foranalysis of ordinal scales such as the GBS Disability Scale under different conditions. The ap-proaches consist of binary logistic regression and proportional odds logistic regression, with andwithout covariate adjustment for important prognostic factors (MRC sum score and days sinceonset of weakness to randomisation). The conditions consist of satisfaction of the proportionalodds assumption, the use of weaker prognostic baseline characteristics, and quantitative versusqualitative violation of the proportional odds assumption. We extended these approaches to alongitudinal proportional odds model. Simulations varied in sample size and treatment effect.ResultsCovariate adjustment led to an increased estimated treatment effect and increased standard errorin the GBS trial. Proportional odds analysis decreased the standard error in comparison to abinary logistic regression analysis, indicating a more sensitive analysis. The longitudinal pro-portional odds resulted in a larger standard error as compared to single time point proportionalodds analyses. Simulations for analysis of continuous data with a linear mixed model confirmedthat a longitudinal approach does not increase power as compared to a single time point analysisin case of a low within-subject variance, as was observed for the GBS trial. In simulations wefocused on the effect of covariate adjustment and ordinal analysis. Simulations indicated thatType I errors were generally around 5%. A small gain in power was achieved by covariate ad-justment for two known prognostic factors in GBS, and a larger gain by exploiting ordinalityinstead of dichotomizing the ordinal scale. The gains translated to a gain in power of up to 7 and13% points by covariate adjustment and exploiting ordinality respectively. The gains in powerwere only slightly smaller under violation of the proportional odds assumption and with smallerprognostic effects of the covariates.ConclusionOptimal analysis of ordinal scales should adjust for baseline characteristics (covariate adjust-ment) and should respect the ordinality of the outcome measure. A longitudinal proportionalodds model for analysis of repeated assessments may not have added benefit as compared to asingle time point proportional odds model. Further research should confirm that the use of alongitudinal proportional odds model is only beneficial when the observed disease course withinpatients is more variable over time.Show less
In the past decade, experimental developments in the field of transcriptomics have enabledresearchers to measure gene expression at the level of single cells, leading to a great increasein...Show moreIn the past decade, experimental developments in the field of transcriptomics have enabledresearchers to measure gene expression at the level of single cells, leading to a great increasein measurement resolution. Clustering the cells according to their gene expression profilescan aid the discovery of novel cell types. Unfortunately, it is often difficult to tell whetherthe established clusters are homogeneous or if sub-clustering might be possible. Recently,a new clusterability measure has been developed that aims to quantify the heterogeneityof gene expression within clusters. The so-calledSIGnal Measurement Angle(SIGMA)is based on a result from random matrix theory which states that the singular values ofa random matrix follow a known probability distribution. Singular values that stronglydeviate from this distribution are likely to be caused by deterministic sources of variability,such as differences between cell types. However, the heterogeneity may also be causedby unwanted technical sources of variance, such as batch effects, which arise when datafrom several experiments are combined. Various methods exist for batch effect correction,but it is yet unclear whether they reduce batch effects to the extent that their effect onSIGMA is sufficiently eliminated. In this thesis, we compared the efficacy of three differentbatch correction methods (fastMNN, Harmony, and Seurat) on simulated data and on twoempirical data sets. Their effectiveness was evaluated with several batch mixing metricsas well as by inspecting the singular values and vectors of the batch-corrected expressionmatrices. In conclusion, both fastMNN and Harmony worked well in most cases, but withunbalanced data sets (i.e., when one or more cell types were absent in one of the batches), itbecame increasingly difficult to decide whether singular values were batch effect-associated,because in those cases the batch effect also contained biological heterogeneity.Show less
In the analysis of 2×2×Kcontingency tables, a common hypothesis is the conditionalindependence of rows and columns controlling for a third variable. While frequentist ver-sions to test this...Show moreIn the analysis of 2×2×Kcontingency tables, a common hypothesis is the conditionalindependence of rows and columns controlling for a third variable. While frequentist ver-sions to test this hypothesis (e.g., the Cochran-Mantel-Haenszel) exist, it was the goal ofthis thesis to evaluate a Bayes factor alternative for the conditional independence testin contingency tables. Framing the test as a Bayesian model comparison using general-ized linear models, multipleg-prior variants were evaluated and compared to each otherthrough a simulation study. The simulation results indicate that priors like the hyper-g/n,intrinsic or the robust prior generally show desirable patterns for medium to large effectsizes, but are prone to lead to wrong conclusions for small underlying effect sizes unlessthe sample size is large. The R code for the simulation study can be found on GitHub:https://github.com/DHeemann/Bayesian-conditional-independence-simulation-Show less
In the context of factor analysis, the most common estimation method for analysing discrete datais multiple step Diagonally Weighted Least Squares (DWLS). A novel estimation method is...Show moreIn the context of factor analysis, the most common estimation method for analysing discrete datais multiple step Diagonally Weighted Least Squares (DWLS). A novel estimation method is calledPairwise Maximum Likelihood (PML). PML calculates the product of bivariate likelihoods byonly using a single step. PML estimation was found effective for small datasets with few discretevariables. In this study, we investigate how PML performs with large datasets and different typesof data (e.g., discrete data, continuous data, and combinations thereof).We conducted two different simulation studies to compare the performance of PML to theDWLS estimation method in terms of accuracy and efficiency. We thereby examined differentexperimental conditions; model sizes (small, medium, large, and huge), sample sizes (200, 400, and800), and answer categories (two and four). In addition, we checked the robustness of PML byfitting a model without misspecifications (i.e., correctly specified model) and with misspecifications(i.e., misspecified model). ANOVAs were conducted to test whether the differences between PMLand DWLS depend on the aforementioned design factors. Regarding the performance of PMLand DWLS, our results indicate that the (relative) bias of both the parameter estimates and thestandard errors remain very small among the varying experimental conditions for the correctlyspecified model and slightly increases in conditions with a misspecified model. Overall, our findingsdemonstrate that PML performed slightly better compared to DWLS in terms bias of both theparameter estimates and the standard errors.Show less
The detection of anomalies is a research area that has made great progressin recent years and decades. As more and more applications produce everlarger amounts of data, anomaly detection becomes...Show moreThe detection of anomalies is a research area that has made great progressin recent years and decades. As more and more applications produce everlarger amounts of data, anomaly detection becomes increasingly important.In the past most anomaly detection algorithms focused on static data sets,that is data sets with not time stamp or element, and did not take the ele-ment of time into account if it was provided. In addition, these algorithmsrarely have the ability to incorporate additional knowledge into their decision-making process and cannot adapt to changes in the data over time. Buildingon an algorithm called Evolutionary Isolation Forest which attempts to solveboth of these problems, this paper suggests a variation of this algorithm calledExtended Evolutionary Isolation Forest. This algorithm uses more complexsplitting criteria to isolate anomalies and uses evolutionary operators to refinethe decision process and adapt to feedback from experts. Using benchmarkdata, it can be shown that the algorithm performs similarly to the Evolution-ary Isolation Forest, but without generally outperforming it. In addition, thealgorithms are compared with a real-world data set from the energy infras-tructure provided by WithTheGrid.Show less
E-variables are tools for statistical testing that allow results from multiple ex-periments to be easily combined. Large E-variables denote a large amount ofevidence against the null hypothesis. In...Show moreE-variables are tools for statistical testing that allow results from multiple ex-periments to be easily combined. Large E-variables denote a large amount ofevidence against the null hypothesis. In this thesis we focus on E-variablesthat are valid under optional stopping, which means that the researcher maychoose to stop collecting data at any point in time, even after viewing the sta-tistical result. We look at a non-parametric setting where it is tested whethera distribution is symmetric around 0 or not. These E-variables are allowedto be conditioned on past data. This means that we can learn ideal valuesof E-variable parameters while collecting data such that the E-variables per-form better and better as more data come in. We examine multiple versions ofEfron-De la Pe ̃na E-variables and introduce ‘hedge’ E-variables. We also exam-ine E-variables based on rank-based methods such as the Sequential Rank Test,and we introduce a modified version of the Safe Mann-Whitney U test, whichwe call the Split-Safe Mann-Whitney U test. We evaluate these E-variables bycomparing the amount of data needed to gather enough evidence for a ‘signifi-cant’ result, as well as the rate at which the E-variables grow larger when moredata is collected, when data is generated by a variety of probability distribu-tions. We have found that different E-variables perform better across differentgenerative probability distributions, but overall the ‘hedge’ Efron-De la Pe ̃naE-variable and the Sequential Rank Test appear to be the best E-variables forthis setting. All examined E-variables require more data to be collected for a‘significant’ result in comparison to the classical Mann-Whitney U test. How-ever, the benefit of E-variables is that they do not require the researcher to seta fixed sample size beforehand, which can outweigh the lower performance.Show less
Patient survival in biomedical studies is often subject to multiple clinical endpoints, allof which compete for the first and possibly only opportunity of occurrence. As a result,the occurrence of...Show morePatient survival in biomedical studies is often subject to multiple clinical endpoints, allof which compete for the first and possibly only opportunity of occurrence. As a result,the occurrence of competing events may preclude the observation of a specific clinicaloutcome of interest. To gain further insight into specific outcomes in the presence ofcompeting events, a special type of survival analysis is required, known as competingrisks analysis. The presence of treatment effects in competing risk models can be visuallyexamined by constructing a cumulative incidence curves. These curves illustrate theprobability of first occurrence for each event over a series of time points, and therebyavoid the bias that is introduced by competing events in classic survival curves.In randomized controlled trials, cumulative incidence curves are unaffected by con-founding from patient-specific covariates, which is the result of strict random assignmentof patients between treatment cohorts. However, observational studies may often intro-duce imbalance of covariates between treatment cohorts, as certain groups of patientsmay be overrepresented within a particular treatment strategy. Covariate imbalancebetween cohorts results in a biased comparison of cumulative incidence curves, sincethey reflect the average failure probability within each cohort. This may discourageresearchers from using cumulative incidence curves to report findings in the presence ofcompeting risks in the presence of covariate imbalance. Fortunately, strategies have al-ready been well-documented to address covariate imbalance for survival analysis, whichhas led to covariate-adjusted survival functions. However, these methods have yet tobe further expanded upon to provide covariate adjustment for the cumulative incidencecurves used to report competing risk models.In this study, we have developed and examined various adjustment methods to pro-duce covariate-adjusted cumulative incidence curves in the presence of covariate imbal-ance between cohorts. A simulation study was carried out to compare the accuracy andprecision of these methods, and the best-performing method was applied on real-worldbreast cancer survival data. Covariate adjustment in breast cancer survival data al-lowed us to shed light on the role of covariate imbalance between patients treated withmastectomy and those treated with breast-conserving therapy for each of the competingoutcomes.Show less
One of the key characteristics to describe an infectious disease is incubation period. Commonlyincubation period estimates are obtained via interval-censored methods. Deng, You, Liu, Qin,and Zhou ...Show moreOne of the key characteristics to describe an infectious disease is incubation period. Commonlyincubation period estimates are obtained via interval-censored methods. Deng, You, Liu, Qin,and Zhou (2020) and Qin et al. (2020) proposed a new family of methods for estimating incuba-tion period and applied them to data from the initial SARS-CoV-2 outbreak in Wuhan. Thesemethods are based on the theory of the renewal process and do not require information on thetime of infection. Instead travel information (i.e., day of departure) are needed. These data tendto be easier obtainable and hence, larger datasets can usually be used. However, both Deng andQin made a number of assumptions that appear questionable. To date, no study has addressedthe validity of their proposed renewal methods nor their assumptions.In a novel simulation study, the impact of changing assumptions on the estimated incubationtime was investigated. Deng and Qin assumed that the time from infection to leaving Wuhanfollows a uniform distribution. This assumption is problematic because of the exponential in-crease in SARS-CoV-2 cases and the sharp increase in people leaving Wuhan before lockdownmeasures were implemented. In addition, both assume that up to 20% additional infectionsoccur at day of travel due to busy environments. However, it is not clear whether the correctionfor additional infections at travel day is warranted. As part of the thesis, a data generationmethod was introduced that takes these aspects into account.In this thesis, it is shown that the assumptions underlying the renewal process method areviolated by Qin and Deng and the proposed data generation method. The simulation studyshowed that the violated assumptions introduce a bias that is partially compensated by the biasintroduced by the inclusion of additional infection at day of travel. The findings suggest thatincubation period estimates based on current renewal process methods should be interpretedwith caution. The results of this work provide important insights into the accuracy of currentmethods for estimating incubation period. This can help to better understand the dynamics ofinfectious diseases, which in turn can help to contain the spread of future outbreaks.Show less
This thesis investigates the effect of having a family history of exceptional longevity on the risk ofcontracting age-related diseases. It analyses electronic health records of the offspring (and...Show moreThis thesis investigates the effect of having a family history of exceptional longevity on the risk ofcontracting age-related diseases. It analyses electronic health records of the offspring (and theirpartners) of members of long-lived sibships that were collected as part of the Leiden LongevityStudy (LLS). Because participants can contract multiple age-related diseases, I work within arecurrent events framework; this is an adaption of the classical Survival Analysis framework thatallows for events to happen repeatedly to an individual. The data has a nested structure: eventshappen to individuals who are organised within families. As such, a random-effects survival modelwith two levels of correlation – termed the Nested Frailty model – is applied to the data, with anadditional element of event dependence. The thesis consists of three parts. In the first, I derivethe likelihood for a Nested Frailty model. Next, two simulation studies explore the possible pitfallsof ignoring the elements of nested frailties and event dependence when these are present in thedata, and demonstrate that it is key to include both elements in the model. Finally, the LLS datais analysed. I find that a family history of exceptional longevity is linked with a slower rate ofacquisition of age-related diseases.Show less
Missing data is common in clinical research. How these missing valuesare handled has a direct impact on the final study results. Existing medi-cal studies commonly use complete case analysis to...Show moreMissing data is common in clinical research. How these missing valuesare handled has a direct impact on the final study results. Existing medi-cal studies commonly use complete case analysis to remove observationswith missing values, which has the advantage that it is simple and easybut depletes the information in the original dataset, and may result in bi-ased estimated. Multiple imputation (MI) methods are often consideredmore reliable than complete case analysis, missing indicator methods andsingle imputation methods. However, recent research has shown that bycomparing a number of MI methods, particularly where the underlyingassumptions are undermined, some MI methods may cause more bias inmodel estimates than complete case analysis.To study which methods would perform better under which circum-stances, this thesis will perform a simulation study, comparing the resultsof the above-mentioned techniques under certain types of missingness,such as MCAR, MAR and MNAR. Complicated connections like miss-ingness is also correlated with survival time will be considered.The various parameter settings for the simulation study are based ona real case study where about 12% of the observations contain missingvalues for some variables. In addition to the basic MICE, two other multi-ple imputation methods are compared, one with interaction terms betweenthe full variables and the baseline hazard in the imputation model, and theother with a specific substantive model in the iteration. In this thesis, thesubstantive model is Cox model. The simulation studies show that theone with interaction terms is not significantly different from MICE andits improvements are of limited applicability. The one with the specificsubstantive model is more suitable for complex data types and when thereare strong correlations between covariates. Besides, basic MICE also per-forms well in data sets with a high proportion of missing binary covariates, while the missing indicator method produces large bias in many settings, even for full case studies.Show less
We provide a simulation study complementing the theoretical results of Bosand Schmidt-Hieber (2021) for supervised classification using deep neuralnetworks. Theirmain risk boundsuggests a faster...Show moreWe provide a simulation study complementing the theoretical results of Bosand Schmidt-Hieber (2021) for supervised classification using deep neuralnetworks. Theirmain risk boundsuggests a faster truncated Küllback-Leiblerdivergence risk convergence rate with smoother conditional class probabilityfunctions and when fewer conditional class probabilities are near zero; aswell as that convergence rate is fast when the functions have a high degree ofsmoothness even if many probabilities are near zero. The proportion of smallconditional class probabilities can be measured bysmall value boundindex훼.We calculate훼for an illustrative selection of settings with conditional classprobability functions that have an arbitrarily high Hölder smoothness index훽.We estimate the Küllback-Leibler divergence risk convergence rate in thesesettings by evaluating networks trained on simulated datasets of various sizes.We find slower convergence rates than suggested by the main risk bound.However, in line with expectations,훼has no consistent effect on convergencerate when combined with arbitrarily high훽.Show less
The reliability of statistics is essential for official statistics. With administrative data more often used instead of survey data, non-sampling errors become important factors in the accuracy of...Show moreThe reliability of statistics is essential for official statistics. With administrative data more often used instead of survey data, non-sampling errors become important factors in the accuracy of statistics. For domain statistics, such as yearly turnover of enterprises, classification errors occur. This study aims to measure the effect of classification errors on domain statistics, more specially, bias and variance due to classification errors. In this study, a new method was developed that applies a Gaussian mixture model, estimated by the EM algorithm, in short referred to as the EM method. Further another method was introduced that combined the EM method with bootstrapping, referred to as the combined method. Among them, the EM method only estimates bias, and the combined method is able to estimate both bias and variance. Together with a previously used bootstrap method, the three methods were tested in a simulation study and in a case study. The bias and variance estimates from the three methods were compared with their corresponding true values in different settings. The results showed that the bias estimates from the EM and the combined method were closer to the true values compared to the bootstrap method; The combined method had closer outputs on variance estimation than the bootstrap method. The EM and the combined method were equally accurate in estimating the true bias. These results suggest that the EM and the combined method estimated the bias and variance more accurately than the bootstrap method. In practice, the combined method is recommended since both the bias and the variance can be estimated. In a situation with a very large data set, where the variance is usually small and the bias is of most concern, the EM method may be preferred.Show less