Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually...Show moreA stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually comply with the sulfur cap. The thesis devises a systematic approach to a prototype of a sulfur compliance monitoring system using the state-of-the-art TROPOspheric Monitoring Instrument(TROPOMI) which measures the atmospheric presence of trace gases. Oceanic geographical coordinates are classified by the similarity in the concentration level of trace gas with the k-means clustering method and adequate averaging techniques. The choice of hyperparameters and the final results are statistically formulated and verified. The subsequent longitudinal analysis on the temporal trends of trace gas emission suggests that the sulfur dioxide measurements of TROPOMI are dominated by measurement noise. The thesis concludes with the outcome that the nitrogen dioxide measurements of TROPOMI can be well-utilized to backtrack the maritime anthropogenic activities such as the regional shipping route, which indicates a possibility to be further developed as a global monitoring system for both land and maritime emission.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Ascertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population...Show moreAscertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population and an overrepresentation of young, affected subjects in the sample. The motivating example for this thesis is a family study where the goal is to estimate an unbiased hazard ratio (HR) for the effect of Polygenic Risk Score (PRS), a continuous score based on several Single Nucleotide Polymorphisms (SNPs), on age of breast cancer diagnosis. Weighted Cox model approaches have been proposed in this context, however their performance has never been evaluated for a continuous covariate. Two different approaches were considered, using time fixed and time dependent weights. A simulation study was conducted to assess the performance of the different approaches for scenarios where different family correlation, family size, sample size and selection criterium have been chosen. We found that under the null hypothesis, (un)weighted models behave similarly. When a covariate effect is assumed, in any scenario where the within-family correlation is low, weighting methods perform better than a naive approach; the same holds for moderate within-family correlation in combination with weak ascertainment. For strong ascertainment and/or strong within-family correlation, coverage of weighting methods is very poor and bias is high. To obtain an unbiased HR for PRS, we used high-risk breast cancer families data. Inclusion criteria were absence of high-risk mutations BRCA1 and BRCA2 and at least three affected female family members or in two members if at least one had bilateral breast cancer before age 60. A total of 101 families were selected between 1990 and 2012 by Clinical Genetic Services in four Dutch cities and one Hungarian city, with 323 (55.1%) events. The HR of PRS, adjusted by family history, was 1.29 (95% CI 1.04; 1.60), for the naive model, with a frailty variance of 0.53 which indicates rather strong within-family correlation. For none of the weighting approaches, the covariate effect of PRS adjusted for family history in a Cox model was significant (HR 1.09 and 1.09). For analysis of outcome dependently sampled survival data, weighting approaches may be used to limit ascertainment bias, for some scenarios. A note of caution is required when this approach is used in scenarios with (moderate to) strong within-family correlation. No evidence for a significant effect of PRS on age of breast cancer diagnosis was found in this studyShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Accurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox...Show moreAccurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox proportional hazards (PH) model is used to predict survival. Yet, recently machine learning (ML) has received increased attention. ML methods learn complex relations between explanatory variables and outcomes, without the need to specify these effects beforehand. In contrast, in the Cox PH model, non-linear and interaction effects need to be specified before estimating the model. The flexibility of ML methods is believed to improve predictive accuracy, which drives the application of ML methods to survival data. One of the aims of this thesis was to compare prediction models for survival data based on machine learning methods to the traditional Cox PH model. Predictive ability was assessed by using Brier score, concordance index and calibration plots. Furthermore, software implementation and interpretability were investigated. Two ML methods, partial logistic regression models with artificial neural networks (PLANN) and random survival forest (RSF) models were considered. Predictive performance was studied in a soft tissue sarcoma cohort: a right-censored survival dataset with a small number of explanatory variables. In terms of IBS and calibration, the optimally tuned RSF models had similar predictive performance compared to the Cox model. The Cox model had better predictive performance than the RSF models in terms of C-index. One of the NN models outperformed Cox in terms of Integrated Brier Score (IBS). Also, the NN models were slightly better calibrated than the Cox PH model. It would be interesting to see whether a Cox model including non-linear effects would outperform the ML methods considered in terms of prediction. Differences between the ML methods and the Cox PH model concern the route towards finding the most optimal predictions. When estimating survival probabilities using ML methods, focus is mainly on the correct implementation of the ML algorithm: finding suitable tuning parameters, how to select the best set of tuning parameters and running the algorithm, which takes time. On the other hand, when identifying the best predicting Cox model, time is spent on specifying the model, looking at non-linear effects and evaluating goodness of fit. The initial set of tuning parameters considered for the PLANN approach resulted in non-informative NN models. This showed the importance of thorough knowledge on the characteristics of tuning parameters in the ML methods. The work in this thesis shows how survival prediction could be unreliable if the NN is not properly tuned.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the...Show moreIn this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the analysis of Factor-by-Curve interactions between a categorical and a continuous variable. The Optimal Scaling methodology was originally developed for analysis of categorical data, but is also applicable to continuous data. It estimates optimal quantifications for the original observed values in an iterative process by maximising the squared multiple regression coefficient (R2 ), thereby transforming the original variable. These quantifications are restricted according to a prespecified scaling level, indicating the stringency of the transformation. These scaling levels can restrict the quantifications to be unsmoothed (non)monotone, or to be smooth (non)monotone. Unsmoothed nonmonotone quantifications are not restricted to any relation between the original observed values, whereas the monotone restriction preserves the ordering of the original observed values in the quantifications. The smooth restrictions are similar, but the quantifications are then also smoothed using a spline function. The quantifications can also be restricted to a linear transformation of the original observed values. This (ordinary) Optimal Scaling regression model, however, does not take into account any interaction effects between the variables. The type of interactions considered in this thesis are the Factor-by-Curve interactions. Factorby-Curve interactions are interactions between a categorical variable (factor) and a continuous variable. The models proposed in this thesis will be referred to as the Factor-by-Curve Optimal Scaling regression (FbC-OS-regression) models. Both models fit a separate curve for the continuous variable in the interaction for each level of the factor. For example, an interaction between a continuous variable and a factor of three levels is then fitted with three curves on that continuous variable. The difference between the two proposed models is that they either fit main and interaction effects separately or fit the joint effects in a single term. The models are illustrated with two applications on real data. The advantage of both FbCOS-regression models, compared to existing methods for modelling of Factor-by-Curve interactions, is that the Optimal Scaling methodology allows for monotone restrictions of the effects. This is demonstrated using the applications shown in this thesis, which are fitted using monotone spline restrictions. Results for the fitted FbC-OS-regression models are then compared to fitted linear regression models with interactions. Finally, the two approaches of modelling Factor-byCurve interactions with OS-regression are compared to each other and to the additive model, which is a model suitable for nonlinear analysis of Factor-by-Curve interactions as well, after which suggestions for further study of the proposed models are given.Show less