Search results

(1 - 20 of 39)

Pages

Papakosta, S. 2022

CLusterwise INdividual Differences SCALing (CLINDSCAL) for discovering heterogeneity in dynamic connectivity patterns in multi-subject fMRI data

Master thesis | Statistics and Data Science

open access

Functional Magnetic Resonance Imaging (fMRI) data capturing the BOLD response for various voxels through time for a single subjectcan reveal idiosyncratic dynamic functional connectivity (dFC)...Show moreFunctional Magnetic Resonance Imaging (fMRI) data capturing the BOLD response for various voxels through time for a single subjectcan reveal idiosyncratic dynamic functional connectivity (dFC) patternsunderlying a subject’s brain responses. These dFC patterns are known to be related to mental disorders, likeschizophrenia (Lynall et al.,2010)and Alzheimer’s disease (AD; Gili et al., 2011). Current directions in neuroscience hope to identify possible types and subtypes of mental disorders. In this thesis, we make the assumption that heterogeneity in dFC patterns across subjects may be indicative of such mental disorder (sub)types. To detect these(sub)typesbased on dFC patterns, we propose the clusterwise INDSCAL modelto analyze multi-subject rs-fMRI data, which is a generalization of the K-INDSCAL model of Bocci andVichi (2011). In this model, subjects with similar dFC patterns are clustered together, whereas subjects with clearly different patterns are allocated to different clusters. As such, clusterwise INDSCAL captures heterogeneity between subjects in dFC patternsand is able to identify unknown disease (sub)types. An Alternating Least Squares (ALS) algorithm to estimate the model parameters is presented in which the clustering and the model parameters for each cluster are updated alternatingly. This algorithm, alongwith a model selection heuristic to determine the optimal number of clusters and dFC patterns,is evaluated in an extensive simulation study in which several data characteristics (e.g., signal-to-noise ratio, similarity of clusters) are manipulated.The results show that the CLINDSCAL algorithm can successfully identify the true clustering of the patients and their underlying dFCpatterns. Further, when the spatial overlap in dFC patterns between the clusters increases, the performance of the algorithm in terms of recovering the clustering of the patientsdecreases.It can be concluded that CLINDSCAL is an interesting tool to discover a natural subject clustering with subject clusters differing in the dFC patterns underlying their data. Such a clustering may point at the existence of (yet unknown) mental disorder (sub)types.Show less

Wahyuningsih, P.A. 2022

Genetic association mapping andgenomic prediction of photosynthesis efficiency in Arabidopsis thaliana using deep learning on chlorophyll fluorescence images

Master thesis | Statistics and Data Science

open access

The emergence of automated high-throughput phenotyping platforms provides us with the po-tential for powerful genome-wide association studies (GWAS) using image phenotypes. A majorchallenge of GWAS...Show moreThe emergence of automated high-throughput phenotyping platforms provides us with the po-tential for powerful genome-wide association studies (GWAS) using image phenotypes. A majorchallenge of GWAS on imaging-genetics datasets is to define a meaningful representation of thetraits, to make the data amenable to GWAS-based prediction. We propose a different approachof reverse-GWAS mapping, which predicts the genetic markers from the phenotypes rather thanthe other way around. This method would allow us to use chlorophyll fluorescence images asphenotypes to identify markers that influence photosynthesis efficiency inArabidopsis thaliana.We implemented several deep learning methods for reverse-GWAS, based on Convolutional Neu-ral Networks (CNNs), including well-known architectures such as ResNet-50, DenseNet-121, andXception.These results suggest that the various convolutional neural networks are not able to classifySNP markers with high accuracies. The main challenges seems to be the highly overlappingphotosynthesis efficiency of alleleaandb, even for the most significant SNP markers that aredetected from the GWAS, which makes it difficult for the models to classify the alleles correctly.Moreover, the individual narrow-sense heritability of the trait is low, indicating that the geneticadditive effect is low. Furthermore, the number of accessions used in this study is relativelylow compared to the approximately 6,000 registered accessions. Hence, it is possible that theaccessions with high and low photosynthesis efficiency are not included in the study. For futurestudies, adding more accessions might be able to improve the accuracy of Reverse-GWAS.Show less

Kisoentewari, A.M.K. 2022

Evaluating the robustness of permutation-based multiple testing methods

Master thesis | Statistics and Data Science

open access

Statistical hypothesis testing is central to many scientific fields. Testing many hypothe-ses simultaneously is called multiple testing. The main concern in multiple testing, is toensure that most...Show moreStatistical hypothesis testing is central to many scientific fields. Testing many hypothe-ses simultaneously is called multiple testing. The main concern in multiple testing, is toensure that most of the rejected null hypotheses are indeed false, i.e., that the numberof incorrect rejections remains low. A major challenge in multiple testing is to accountfor the complex dependencies in the data. A powerful approach in this regard, arepermutation-based multiple testing methods. These methods make few distributionalassumptions. In fact, they often make only one assumption, called joint exchangeabil-ity. In this thesis we investigate the robustness of the methods to violations of thisassumption. We do this by means of simulations, where we focus on case-control data.We find that, while the theoretical literature always makes the mentioned assumption,it is often not necessary in practice. Thus, this thesis provides further evidence for thevalidity of these powerful methods in practice.Show less

Openshaw, K. 2022

Estimating the effect of treatment timing from observational dataof couples with unexplained subfertility: A simulation studyto evaluate the accuracyof causal multi-state models

Master thesis | Statistics and Data Science

open access

In this study, the combination of multi-state survival analysis and causal inference was used to estimate theprobability of an event occurring as a function of treatmenttiming. This studyfollows...Show moreIn this study, the combination of multi-state survival analysis and causal inference was used to estimate theprobability of an event occurring as a function of treatmenttiming. This studyfollows throughfrom the results and recommendations of a previous methodological researchestimating the average pregnancy probabilityas a function ofintrauterine insemination (IUI)treatment timings using observational data from a prospective cohort studyin the Netherlands. The study applied anillness-death multi-statemodel with expectation management as the initial state, IUI treatment as the transition state, and pregnancy as the final or absorbing state. To study the performance ofcausalmulti-state survival analysis,multipledatasetswitha woman’sage with a standardised normal distribution and treatment timings following an exponential distribution were generated using simulations. Fivetreatment strategies were considered: when a patient receives the treatment without delay, when treatment is delayed at three, six, nine months from follow-up,and when treatmentis delayed indefinitely i.e., the patient does not receive treatment during the observation period. For each strategy, the pregnancy probabilityfor an individual andthe group average wereestimated using the causal multi state model and compared to the calculatedtrue valuesfor an observation period of 1.5 years from the start of the follow-up. Variance, bias, and the root mean square error (RMSE) were used as performance measures toassess whether the methodcan accurately estimate the average pregnancy probabilities by treatment strategy over time.The resultsfrom the performance measuresindicate that the methodology canprovide precise and unbiasedestimates. Future work in this area includesintroducing a mechanism for censoringin the data generating stepof the simulation, exploring other probability distributions to generate the transition times, and comparing theresults for the multi-state approachwith those for other similar methodologiessuch as inverse probability weighting used to estimate the outcomesof treatment timing fromobservational data.Show less

Doornkamp, F. 2022

Evaluation of new Markers forImproved Medical Decision Making: Illustration using two validated genomic markers for women withbreast cancer

Master thesis | Statistics and Data Science

open access

Background -Recent studies suggest that adding genomic markers leads to improved riskprediction and reduces over-treatment for women with early stage breast cancer. A more refinedevaluation from a...Show moreBackground -Recent studies suggest that adding genomic markers leads to improved riskprediction and reduces over-treatment for women with early stage breast cancer. A more refinedevaluation from a decision analytic perspective may be informative for patients and policymakers.We aim to examine the clinical utility of two recently proposed genomic markers.Methods -We reanalyzed aggregated data from the MINDACT and TAILORx trials, in-cludingN= 6653 andN= 10253 women, to evaluate the clinical utility of the MammaPrintand OncotypeDX tests. Clinical utility was quantified using Net Benefit, reflecting the rela-tion between 8-year distant metastasis free interval (DMFI) and the number of women receiv-ing chemotherapy. Net Benefit balances the DMFI and number of chemotherapy courses by aweight, this weight is determined by the decision threshold. This decision threshold indicates thethreshold where the gain in DMFI is considered sufficiently high to indicate chemotherapy. Keyparameters were estimated from the two trials, including distributions for clinical and genomicrisks, where statistical correlation between clinical and genomic risks was retained.First, a reanalysis was performed for the MINDACT and TAILORx trials separately usingdecision analytic modelling. We compared the proposed strategies in the trials with decision an-alytic modelling. Second, we resimulated from the MINDACT trial population to enable a directcomparison between the MammaPrint and OncotypeDX genomic tests. Here, the effectivenessof chemotherapy was estimated from earlier randomized controlled trials (HR = 0.64), similarto the PREDICT decision support tool. We also assumed a similar clinical risk function. Thisapproach allowed for estimating individualized risk distributions and expected benefit distribu-tions only differing in the specific genomic marker used. Third, sensitivity analyses examinedhow different qualities of baseline models, quality of genomic tests, effectiveness of chemotherapy,and decision thresholds affected clinical utility.Results -Results show that using decision analytic modelling results in a more favorablebalance between benefits and harms. For MINDACT, using decision analytic modelling increasedNet Benefit from 0.55 to 0.58. Whereas for TAILORx the Net Benefit increased from 0.04 to0.05. The comparison for the genomic markers shows that OncotypeDX performs similar to orbetter than the MammaPrint. Although the average DMFI was similar (MammaPrint: 92.38 vsOncotypeDX: 92.49), the Net Benefit was higher for OncotypeDX (MammaPrint: 0.31 vs Onco-type: 0.50). Moreover, the MammaPrint had a prognostic effect of HR = 2.37, whereas the di-chotomized OncotypeDX (i.e. dichotomized at the same proportion of high risks to MammaPrint)showed a significantly higher prognostic effect of HR = 3.23. The sensitivity analysis showedthat the clinical utility of genomic markers depended on the quality of the baseline model, theeffectiveness of chemotherapy, and the decision threshold for the expected benefit.Conclusions -Decision analytic modelling confirms that the Mammaprint and Oncotype ge-nomic tests both have clinical utility, with OncotypeDX potentially outperforming MammaPrint.Using decision analytic modelling provides detailed information on the expected benefits of treat-ments, which can assist in shared-decision making about adjuvant chemotherapy. Further vali-dation and direct comparison are needed for MammaPrint and OncotypeDX to optimal compareand evaluate their clinical utility.Show less

Feng, Y. 2022

An empirical evaluation of methods for the prediction of survival with many longitudinally-measured predictors

Master thesis | Statistics and Data Science

open access

Survival analysis deals with the study of the time until an event of interest occurs. The CoxProportional Hazards model (Cox model) is commonly used to model the relationship betweena survival...Show moreSurvival analysis deals with the study of the time until an event of interest occurs. The CoxProportional Hazards model (Cox model) is commonly used to model the relationship betweena survival outcome and a set of cross-sectional covariates, but it cannot handle longitudinal co-variates, i.e. covariates that are repeatedly measured over time. Traditional ways to deal withlongitudinal covariates include joint modelling, landmarking and the time-dependent Cox model,but to date their applicability has mostly been restricted to problems with a small number oflongitudinal covariates.Recently, the increasing availability of repeated measurements in biomedical studies has mo-tivated the development of statistical methods specifically designed to predict survival from alarge (potentially high-dimensional) number of longitudinal covariates. Due to the fact that suchmethods are still quite new, little is known about how these methods may perform in practice.The aim of this thesis is to compare the performance of various statistical methods to predictsurvival on a real dataset where many longitudinal covariates are available as predictors. Fourmethods were chosen for comparison, including two novel methods employing different techniquesto harness the longitudinal information, Penalized Regression Calibration (PRC) and Multivari-ate Functional Principal Component Cox (MFPCCox) model, and penalized Cox models usinglandmarking (last observation carried forward method) and baseline measurements respectively.These methods were applied to the data from the Alzheimer’s Disease Neuroimaging Initiative(ADNI) study in the context of dynamic prediction of time to develop dementia. The ADNIstudy monitored the development of dementia in cohort of elderly individuals, and collected anextensive, heterogeneous set of markers over multiple years of follow-ups. Predictions were com-puted using a total of 26 covariates, of which 21 were longitudinal. The predictive performanceof the models was evaluated considering three performance measures (time-dependent AUC, Cindex, and Brier score).The results showed that the best performing method depended on the choice of performancemeasure, landmark time, and prediction time. Landmarking was the best performing methodwhen looking at the time-dependent AUC and C index, whereas PRC was the best performingmethod in terms of Brier score. Landmarking, PRC, and MFPCCox outperformed the baselinemodel that ignored the follow-up information, suggesting that the longitudinal information inthe ADNI data can be used to improve predictions for dementia. Overall, our results seem toindicate that for the ADNI data a simple approach such as landmarking may be enough to deliveraccurate predictions, when compared to more sophisticated approaches (PRC and MFPCCox)that model the trajectories of longitudinal covariates.Show less

Kalkantzis, G. 2022

A simulation study evaluating the effect of dependent censoring on survival curves and the performance of Inverse Probability Censoring Weights

Master thesis | Statistics and Data Science

open access

Survival analysis studies time-to-event outcomes. One of the main characteristics of survivaldata is that some survival times are not observed, we call those observations censored. Standardmethods...Show moreSurvival analysis studies time-to-event outcomes. One of the main characteristics of survivaldata is that some survival times are not observed, we call those observations censored. Standardmethods to analyze censored data, like the Kaplan-Meier estimator or the Cox proportionalhazards model, assume that censored observations are independent of the time to event and wecall this type of censoring non-informative. In real life studies however, this is not always thecase and the censoring may depend on time to event either directly or through covariates. Inthat case the censoring is called informative or dependent and using the standard methods canlead to biased results.In this thesis we examined first how serious the issue of dependent censoring is by generatingdata with dependent censoring using two methods, one for two time-independent covariates andone for two time-independent and one time-dependent covariate, and studying how much biasis introduced if we assume independent censoring in the analysis.Different approaches have been proposed in order to correct for the issue of dependentcensoring, one of them is Inverse Probability Censoring Weighting (IPCW). In the second part,we will perform a simulation study to evaluate the performance of the IPCW method in thepresence of dependent censoring, for each of the two methods we examined in the first part.Results showed that the survival curves estimated by the traditional Kaplan-Meier methodhave only small bias in most cases. The bias increased when the dependent censoring getsstronger. The IPCW method overall performs well and corrects for the presence of dependentcensoring but it is not able to correct the bias fully in case the dependency is too strong orwhen we introduced a time dependent covariate which is subject to measurement error.Show less

Long, J. 2022

Self-supervised learning forclassification of anatomic featureson kidney biopsy whole slide images

Master thesis | Statistics and Data Science

open access

The technique of whole slide imaging (WSI) boosts the application of deeplearning in medical imaging analysis and computational pathology. How-ever, fully supervised learning stucks into...Show moreThe technique of whole slide imaging (WSI) boosts the application of deeplearning in medical imaging analysis and computational pathology. How-ever, fully supervised learning stucks into bottlenecks due to the heavyreliance on manual annotations, which requires specific expertise and ex-pensive cost. Self-supervised learning would be a potential solution, whichis supervised by the signals generated from itself. It has been provedto perform as well as supervised learning on ImageNet in classificationtasks. Yet, its performance on medical image classification is unexplored.This study verifies the effectiveness of four self-supervised learning to de-tect anatomic structures on kidney biopsy WSI, including SimCLR, MoCo,SwAV and Barlow Twins. In the pretext-task, these self-supervised learn-ing algorithms are trained in 500 epochs with the same backbone archi-tecture, ResNet-50, which is initialized by the weights pre-trained on Im-ageNet correspondingly. The evaluation protocol is a semi-supervisedlinear classifier, implemented by using multi-nomial logistic regression.The results of the classification task show the features extracted by thefour algorithms all achieve good accuracy scores, higher than 85% withonly 10% labels. Among them, SwAV outperforms the other algorithmsfrom the perspective of overview and each class. Through this study,self-supervised learning algorithms exhibit the potential for more complextasks related to renal pathology.Show less

Jiang, X. 2022

A Comparison of Single Tree Models

Master thesis | Statistics and Data Science

open access

It is very easy to understand and to interpret a single tree model. However, it is of-ten unstable and relatively inaccurate. The aim of this article is to evaluate and improvethe performance of...Show moreIt is very easy to understand and to interpret a single tree model. However, it is of-ten unstable and relatively inaccurate. The aim of this article is to evaluate and improvethe performance of single tree algorithms. In total, three single tree algorithms includingClassification and Regression Tree (CART) applied with R package ’rpart’, EvolutionaryTree applied with R package ’evtree’ and a new method that combining Bayesian AdditiveRegression Trees (BART) and Born Again Tree were evaluated. We did a bechmark studyon six differnet datasets and found that the evolutionary trees and born-again trees bothperform better than CART in terms of accuracy. The relative performance between evolu-tionary and born-again trees depended on the dataset. Evolutionary trees performed betteron relatively larger datasets and born-again trees performed better on relatively smallerdatasets. However, these single tree methods still showed a huge gap in performance com-pared to BART, especially when applied to large datasets. we conclude that there is stillroom for the improvement of single trees compared to ensemble methods.Show less

Mentese, A. 2022

Simple Methods for Conditional Gaussian Graphical Modeling

Master thesis | Statistics and Data Science

open access

Gaussian graphical models (GGMs) are probabilistic models that represent theconditional independence between random variables and present them in a graph.These models are applied to a variety of...Show moreGaussian graphical models (GGMs) are probabilistic models that represent theconditional independence between random variables and present them in a graph.These models are applied to a variety of domains, such as social sciences, eco-nomics and natural sciences when visualizing the topology of a network. However,traditional GGMs can be improved by adding conditioning to the estimation of thenetwork on another related data source. These models are called conditional Gaus-sian graphical models (cGGMs). In these models, another related data source isconditioned on the primary data source when creating its graph. Most developmentin the field of cGGMs useℓ1norm penalty which can show shortcomings in certainscenarios. In this thesis, we proposed three simple cGGM estimation methods us-ingℓ2norm penalty parameters as an alternative to these methods. We conducted asimulation study and a real data analysis to test our proposed estimating methods.Our results demonstrated that several of our proposed estimation methods betterreconstruct the network topology when compared toℓ1based cGGMs and GGM.Show less

Kellerhuis, B.E. 2022

Estimating the heritability ofbrain phenotype with INLA

Master thesis | Statistics and Data Science

open access

Understanding the causes of phenotypic variation is a crucial element in research areas such asbiology and medicine. Knowing the causes of variation allows for higher yields in crops and bet-ter...Show moreUnderstanding the causes of phenotypic variation is a crucial element in research areas such asbiology and medicine. Knowing the causes of variation allows for higher yields in crops and bet-ter detection and treatment of diseases in humans. The amount of phenotypic variation that canbe linked to genetic variation is known as heritability. In this thesis project we will explore her-itability model analysis in the frequentist and Bayesian contexts and compare heritability modelanalysis using integrated nested Laplace approximation (INLA) with established frequentist andBayesian methods. When the number of variance components is restricted, INLA can be lesscomputationally intensive and more robust than MCMC methods and is therefore suitable forgenetic heritability model analysis. Through comparison with established methods in simulationand in fMRI data from the Human Connectome Project we will show INLA is a computation-ally efficient and reliable method for heritability model analysis and that including a spatialcorrelation structure in INLA can characterise the spatial nature of the imaging data and canpotentially offer better detection than methods that do not take into account this spatial nature.Show less

Li, Y. 2022

Investigating interpersonal synchrony methods: A simulation study to study and compare several methods in terms of their ability to capture synchrony between subjects

Master thesis | Statistics and Data Science

open access

Synchronization between brain signals can be quantifiedby mathematical approaches. Recent studies have proposed a large variety ofsynchrony methods tocapture the synchrony in brain activitybetween...Show moreSynchronization between brain signals can be quantifiedby mathematical approaches. Recent studies have proposed a large variety ofsynchrony methods tocapture the synchrony in brain activitybetween interacting subjects. However, there isno detailed overview ofhow each synchrony method performs under different data characteristics.Here we investigate four synchrony methods, corr-entropy, S-estimator, Globalfield synchrony (GFS) and Omega complexityand this under varying data characteristics. These four synchrony methods are applied to time series simulated by two data generation mechanisms: Roessler system and linear multivariate autoregressive (MVAR)process.The simulated time series represent the brain activity ofsubjects and several data characteristics have been manipulated.Theperformance of each synchrony method is evaluated by root mean square error (RMSE) and the correlation coefficient between true and estimated synchronyvalues. Besides, the ANOVA analysis and effect sizes are introduced to test the influence of data characteristicson the performance of synchrony methods.The results show that the S-estimator is always the first or the second best performing method and corr-entropy outperforms other methodswhen it is applied to data generated by Roessler system. Thecoupling strength and thelength of time series can interact with synchrony methodsand significantly influence the performanceof each method.Time series with high synchrony results in good performance of the S-estimator and poor performance of Corr-entropy. It turns out that the longer data length can lead to better performance of each synchrony method.Show less

Tan, Y. 2022

Studying synchrony between interacting subjects: Comparison of statistical methods to quantify synchrony between subjects

Master thesis | Statistics and Data Science

open access

Understanding the functioning of the brain and how it relates to behavior is one of theprimary objectives of neuroscience. The focus of neuroscience has evolved from a singlebrain to studying...Show moreUnderstanding the functioning of the brain and how it relates to behavior is one of theprimary objectives of neuroscience. The focus of neuroscience has evolved from a singlebrain to studying interactions between multiple brains. In several fields, synchrony inbrain responses between individuals has been proven to positively influence psychologicalprocesses and lead to better outcomes.Time-series data for each subject’s behavior or modality are obtained by measuringsynchrony. Comparative studies for synchrony methods have been carried out in orderto gain some insight into the similarities and differences between many measures forevaluating the synchrony between subjects using such time-series data. The research onlyprovides a partial picture of the performance of the synchrony methods in terms of captur-ing synchrony and the conditions in which these methods are optimal. It is still unknownhow well the synchrony methods perform when changing other data characteristics.The goal of this study is to evaluate the performance of several methods for capturingdifferent types of synchrony between a pair of time series. Two mechanisms are usedto generate a pair of time-series data with a known amount of synchrony between thetime series (1) two unidirectionally connected Hénon maps, and (2) bivariate von Misesdistribution. Correlations between the two time series are computed as another definitionof true synchrony to provide a different perspective on true synchrony. In addition, asystematical evaluation of the performance of the synchrony methods on simulated datawith various data characteristics is carried out.For the generated data coherence and phase synchrony are the two best performingmethods. Regarding the varied data characteristics, especially the amount of true syn-chrony has a large effect on recovery performance. These main effects between the datacharacteristics are qualified by several two-way and three-way interactions that almostalways include the synchrony methods and the amount of true synchrony. Under all ofthe different data characteristics, no synchrony method is perfect, and all of the synchronymethods in this study are not always stable. As a result, using a combination of differentsynchrony methods to detect synchrony is recommended.Show less

Meulen, M.J.E. van der 2022

Hierarchical multi-class classification of imbalanced product descriptions using Support Vector Machines and BERT

Master thesis | Statistics and Data Science

open access

Product data can be useful to perform environmental impact assessments of prod-uct life-cycles. In order to automatize such assessments, this research is examiningmethodologies that encounter the...Show moreProduct data can be useful to perform environmental impact assessments of prod-uct life-cycles. In order to automatize such assessments, this research is examiningmethodologies that encounter the challenges with the processing and classification ofproduct data. We consider a large imbalanced and multilingual dataset with short andnoisy product descriptions that have been labeled by human annotators. The productclasses are hierarchically ordered and characterized by two levels. To treat the class im-balance we proposed two data enrichment methods on the training data. Oversamplingand a web scraping method with a prior-filtering. The web scraper was parsing webdata using a search engine and used Sentence-BERT with cosine-similarity to assesssemantic relevant information. In addition, we proposed two classification methods,Support Vector Machines (SVM) and BERT. Both models were evaluated according toseveral experiments considering a flattened- and hierarchical classification approach ofthe products. In addition, we perform an extensive error analysis on the model resultsconsidering the SVM feature importance and the BERT attention weights.The results showed that both models show similar flattened classification performanceusing the normal data, i.e. no data enrichment. SVM show better flattened classi-fication performance after treated the class imbalance with data enrichment. BERTshow poor performance using data enrichment and is overfitting the training data.Hierarchical classification improved the classification performance of BERT using over-sampling. SVM did not benefit from the hierarchical classification approach and showbetter classification performance using flattened classification. As last, the error anal-ysis have showed that the data consist of incorrect or subjective manual labeling.The SVM feature importance and BERT attention weights results suggest that non-representative tokens or out-of-vocabulary tokens have the tendency to decrease theclassification performance.Show less

Lin, I. 2022

Analysis on Distributed Gaussian Processes in Classification Tasks

Master thesis | Statistics and Data Science

open access

The purpose of the thesis project is to compare the performances of distributed Gaussian Processes modelsand their non-distributed counterparts. Distributed methods fit local models on each non...Show moreThe purpose of the thesis project is to compare the performances of distributed Gaussian Processes modelsand their non-distributed counterparts. Distributed methods fit local models on each non-overlappingpartition of a sample, and make an overall prediction based on these models. Distributed methods can berun on different machines simultaneously, and they are very useful in reducing the computational cost.It has been shown that distributed GPs models are able to achieve similar results in regression tasks as thefull GPs model does, but limited discussion in classification tasks. We presented a simulation study and acase study to evaluate three distributed Gaussian Processes models, including the naive average method,the adjusted prior method, and the spatial method. The results showed that in the classification tasks,these three distributed methods reached similar results as the full model did, and the spatial methodconverged to good metric results most quickly, followed by the adjusted prior method, and the naiveaverage method.Show less

Gomon, G. 2022

Joint Models: Implementation in INLA and Applications

Master thesis | Statistics and Data Science

open access

Longitudinal data is encountered when repeated measurements are performed on subjects over aperiod of time. Many models exist to fit longitudinal data, all sharing the feature that...Show moreLongitudinal data is encountered when repeated measurements are performed on subjects over aperiod of time. Many models exist to fit longitudinal data, all sharing the feature that explanatorycovariates are introduced into the model to explain the observed change over time. A special type ofcovariate found within the longitudinal framework is the time-varying covariate, such as the BMI orblood biomarker levels. These covariates, whose value changes over time, are a blessing in disguise.On one hand they allow the researcher to better model the change in outcome over time, and thus fita better model. On the other hand bias can be introduced when the time-varying covariates dependon the outcome or its previous values.Time-varying covariates that introduce such bias are called endogenous time-varying covariates:these are covariates whose current value, given their own history, depends on past values of theoutcome. In the presence of such endogenous covariates, because of the cross-reliance of the endoge-nous covariate on the outcome, standard Mixed Models are no longer valid and one needs to resortto joint modelling of both the outcome and the endogenous covariate.In this thesis several such joint longitudinal models will be discussed. Our focus will be on JointMixed Models and Joint Scaled Models. Both explicitly model the dependence between the outcomeand the endogenous covariate, thereby removing the possible bias incurred by the time-varying co-variate. We shall show how to fit these models using a novel Bayesian technique called INLA (In-tegrated Nested Laplace Integration), which is an elegant technique and a good alternative for thecomplex and long MCMC estimation procedure. Although INLA has seen rapid development overrecent years, joint longitudinal models have so far received little attention. The goal of this thesisis therefore to implement several joint longitudinal models within the INLA framework and applythem on a simulation study as well as on a synthetic version of a clinical dataset.Show less

Hochstenbach, S. 2022

Comparing survival rates for clusters of depressive symptoms found by Network Analysis’ community detection algorithms

Master thesis | Statistics and Data Science

open access

Long-term survival rates for most types of cancer have improved significantly in thepast decades. However, survivors may encounter various psychic and psychosocialconsequences due to their illness...Show moreLong-term survival rates for most types of cancer have improved significantly in thepast decades. However, survivors may encounter various psychic and psychosocialconsequences due to their illness or treatment. A consequence often mentioned isdepression, which can, even when weakly manifested, negatively affect the qualityof life, and, likely, even the mortality rate for cancer itself.In this research, we have analyzed if survival differs for cancer survivors in differentclusters of depressive symptoms. The clusters, often mentioned as communities inthis paper, were found with community detection algorithms of Network Analysisand comprise negative affect, recent negative affect, motivational anhedonia, andconsummatory anhedonia.The results of our study can prove extremely useful in treating cancer survivors thatencounter depressive symptoms. While communities consummatory anhedonia andmotivational anhedonia showed to have large effects on survival, the communitiesnegative affect and recent negative affect did not show to influence survival much.We advise that if a cancer survivor encounters diminished pleasure in consumingrewards (consummatory anhedonia) or diminished motivation in pursuing rewards(motivational anhedonia), appropriate treatment and medication are needed.Show less

Pollaers, J. 2022

Identifying Long-lived Families

Master thesis | Statistics and Data Science

open access

Family Longevity Scores are a potentially useful tool for researchers attempting to identify fam-ilies in which longevity is clustered, using survival data from multigenerational families....Show moreFamily Longevity Scores are a potentially useful tool for researchers attempting to identify fam-ilies in which longevity is clustered, using survival data from multigenerational families. Despitethis, side-by-side evaluation of the way that the unique features of each score affect their per-formance is lacking. Specifically, a comparison between empirical and model-based scores, andbetween scores that treat longevity as a binary versus continuous outcome is needed in orderto identify score characteristics that most enrich for survival advantage across a wide variety ofstudy scenarios.The scores of interest in this paper are the Longevity Relatives Count, the model-based LongevityRelatives Count, and two as yet unpublished Beta Regression scores, namely the Beta Agnosticand Beta Threshold. The analysis is separated into two parts. Part 1 consists of a simulationstudy with seven scenarios designed to evaluate score performance in the presence of higher orlower family size variation, presence or absence of right-censored observations, and independentor dependent families in the study population. Part 2 involves an application of the scores to adata-set constructed from the Historical Sample of the Netherlands (HSN) and the use of CoxProportional Hazard models to assess the extent to which increases in Family Longevity Scoresare associated with a survival advantage in reference individuals.This study found that, in scenarios with an absence of right-censoring, all Family LongevityScores tested were moderately successful as identifiers of longevity clustered within families. TheBeta Agnostic score was consistently the most effective score in both parts of the study. In sce-narios where it was present, right-censoring was found to significantly diminish the performanceof the unweighted-LRC score and moderately diminish the performance of the mLRC and BetaRegression Scores.This study highlighted the differences in performance displayed by scores with empirical ver-sus model-based structures and binary versus continuous conceptualisations of longevity acrossa range of study scenarios. Consequently, the situations in which each score may be useful foridentifying longevous families were revealed and areas needing further development were exposed.Show less

Hoekstra, I.D. 2022

Statistical postprocessing of a time-lagged ensemble: Postprocessing short-range wind speed forecasts of the Netherlands

Master thesis | Statistics and Data Science

open access

IntroductionNumerical weather prediction (NWP) models are used to forecast weather variables. If an NWPmodel is initialized at different times, but verified at the same time it creates a time...Show moreIntroductionNumerical weather prediction (NWP) models are used to forecast weather variables. If an NWPmodel is initialized at different times, but verified at the same time it creates a time-lagged en-semble. Since the ensemble output from an NWP model is often underdispersed and biased itneeds statistical postprocessing. This thesis studies the best way to postprocess a time-laggedensemble of wind speed forecasts.MethodWe use the data of the COntinuous Mesoscale Ensemble Prediction System from the DanishMeteorological Institute, where every hour three perturbed members are run and the membersof the six latest runs are collected to form an 18-member time-lagged ensemble. We apply theparametric Ensemble Model Output Statistics method, with two underlying distributions (theTruncated Normal (TN) and the Log-Normal (LN) distribution), two parameterization strate-gies and six weighting methods. We use different verification methods to compare the forecasts.ResultsBased on the results of the training set, we decide to verify weighting based on age and bothparameterization method (assuming full exchangeability or exchangeability per mini-ensemble).The test set shows that the models perform similarly, but the full exchangeable model is slightlyworse. The models always improve on the raw ensemble based on the Continuous Rank Proba-bility Score (CRPS) and improve the raw ensemble for some lead times based on the Brier SkillScore (BSS) for the high wind speed thresholds. Using the TN distribution results in bettercalibrated models and we find no advantage of the LN distribution at predicting high wind speed.ConclusionWhen dealing with a time-lagged ensemble of wind speed forecasts, we recommend postprocess-ing by weighting the ensemble members based on the age of the mini-ensemble, and fitting anEMOS model with the TN distribution. This strategy is fast to compute and leads to the bestcalibrated and most accurate wind speed forecasts.Show less

Walter, J. 2022

Permutation Tests for AssessingDifferential Expression in RNA-Seq Data

Master thesis | Statistics and Data Science

open access

Statistical analysis of the data arising from RNA-Sequencing (RNA-Seq) experiments is compli-cated. We face difficulties in testing, as we cannot rely on asymptotic properties of tests giventhe...Show moreStatistical analysis of the data arising from RNA-Sequencing (RNA-Seq) experiments is compli-cated. We face difficulties in testing, as we cannot rely on asymptotic properties of tests giventhe small sample sizes. Furthermore, we face a huge multiple testing burden. In this thesis, wewill explore the theoretical and empirical properties of permutation tests for RNA-Seq data andcompare them to their parametric alternatives. This will be done on both simulated data and areal dataset. Additional to two classical permutation methods, the applicability of a novel per-mutation test based on sign flipping score contributions will be analysed (Hemerik et al., 2020).We demonstrate that permutation tests are an attractive alternative for simple two-group com-parisons, as they have comparable power to parametric tests, while having stricter control overthe Type I error rate. Furthermore, our parametric simulation indicates that they have superiorpower to many parametric methods when covariates are included in the model.Show less

Leiden University Student Repository

Refine Results

Availability

Faculty

Thesis type

Issued

Supervisor

Language

Your search

Enabled Filters

Sort

Search results

Pages

Pages