Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated....Show moreThe Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated. The idea of RMST came from Irwin (1949) [5], and when combined with the formal definition of the survival function, RMST can be defined as the integral of survival function up to a time limit τ . Several different methods for estimating the RMST are available. The Kaplan-Meier method and Cox PH model are the most commonly used methods in survival analysis, and they are also suitable for estimating RMST. This is done by first estimating the survival curve and then calculating the area under it to give an estimation of RMST. To allow a more general population of survival time distributions, a flexible parametric model was introduced by Royston and Parmar (2002) [4]. This flexible parametric model method followed the same method of estimating RMST as the Kaplan-Meier and Cox PH model: a survival function is estimated from the model, then a 15-point Gauss-Kronrod quadrature can be used to calculate the integral of the survival function, which allows estimation of RMST. The final option is a pseudo-observation method proposed by Anderson et al. (2004) [3]. This method first builds a pseudo-observation of RMST for each subject. Then, using the pseudo-observations of RMST as outcome variables, a generalized linear model can be built to describe the relationship between the covariates and RMST. A generalized estimating equation (GEE) method can then be used to estimate the parameters of the generalized linear model [8]. Comparisons between these methods under various simulation scenarios were conducted for this thesis. The Kaplan-Meier method is simple to calculate and performs well with early time limits and low censoring proportions. It is also faster to estimate RMST result than Cox model and flexible parametric model. However, this method lacks the ability to be adjusted for more covariates, so it is only suitable when estimating average RMST difference for a population. The unstratified Cox model performed well in datasets that satisfied the proportional hazards assumption. The stratified Cox model also performed well in our simulated non-proportional hazards datasets. The performance of the flexible parametric model method was similar to that of the Cox model, but it is more time-consuming in the integral calculation step. The pseudo-observation methods offered the shortest computation time among all four methods. However, when estimating RMST difference for a subject with given age and gender, the performance of the pseudo-observation method was worse than either the Cox model or flexible parametric model.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
This study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of...Show moreThis study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of a large Dutch telecom provider. Customer lifetime value is the total profit or loss to a company over the whole period of transactions by a customer. Business customers are defined here as firms or locations of large firms that are contracted for one or more business products of the telecom provider. Customer lifetime values are calculated of the level of mobile contracts and taken together per location afterwards. In order to calculate customer lifetime values, individual lifetime predictions and a definition of the values is needed. The lifetime predictions resemble a survival analysis that models the time from becoming contractfree until one of three possible decisions (contract renewal, product migration or contract termination) is made. Using survival estimates and semi-parametric models the overall survival is analyzed as well as the influence of characteristics of locations and companies to which the locations belong. Then, with the R package mstate competing risks models are applied to model the time to each decision while taking into account the other possible decisions. Additionally, lifetime estimations that result from the competing risks models are updated, whereby the survival analysis starts several months after becoming contract-free. Results show that approximately 25% of the decisions have been made at the start of the study. The duration of mobile contracts and ownership of a business internet product or a mobile internet product next to the mobile contract discriminate most between the occurance of the decisions. Furthermore, results of the competing risks models show that probabilities of making any decision attenuate over time. This is confirmed with a fictional product offer on both the levels of the mobile contract and business customers. The customer lifetime value as described here is a useful metric for the telecom provider to make customer selections and, after applying it to other business products, it could be used to discriminate between product offers.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated)...Show moreIn this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated) data, from which the key quantities contact rate and epidemic size could be obtained. With (non-)parametric bootstrapping, the GLM results could be assessed, and the key quantities were obtained and subsequently used to produce forecasts. Forecasting intervals were made to show the accuracy of the forecasts in terms of epidemic size and duration. Simulation results suggested that the method underestimated the eventual epidemic size, and overestimated the contact rate. However, applying the method to a real-life data set resulted in overestimation of the eventual epidemic size. The results of the contact rate for the application on real-life data should be compared to estimates from literature, before a significant meaning can be given to the results. Both simulation and application results gave variable estimates for the epidemic duration, although a positive relation was seen between epidemic size and epidemic length. Estimates for the contact rate could be improved. The major issues with prediction were accountable to exact collinearity introducted by the systematic model; the major issues with forecasting were accountable to extreme estimates of the epidemic size. The cause of both issues lies in the GLMs that were fit to the data.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis...Show moreThe area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used measurement for the discriminative ability of a model. For the time to event variable in survival analysis the case and control sets will vary over time, thus a dynamic definition of AUC is required. We choose the dynamic AUC defined by incident true positive rate and dynamic false positive rate (I/D AUC) proposed by Heagerty and Zheng [6]. However, the difficulty to empirically obtain the incident true positive rate is hampering the estimation of dynamic AUC. Thus, several semi-parametric and non-parametric estimators are proposed. Heagerty and Zheng [6] proposed the semi-parametric estimation method based on Cox model. The non-parametric estimates using intermediate concordance measure with LOWESS smoothing is raised by van Houwelingen and Putter [14]. Based on the same intermediate concordance measure, SahaChaudhuri and Heagerty suggested to use locally weighted mean rank smoothing [10]. Recently, Shen et al proposed a semi-parametric method by adopting fractional polynomial to fit the dynamic AUC [12]. In this thesis, we compare the performance of these methods with different configuration in a series of simulations. The plain Cox methods is not recommended when the proportional hazards assumption is not satisfied. The Cox model with time-varying coefficients are relatively stable when the marker has a mediocre effect. For the non-parametric methods, a too wide span/bandwidth may lead to large bias, and a too narrow span/bandwidth may lead to unstable estimates, thus, the trade-off between the bias and the standard deviation has to be made. For fractional polynomial, adding extra fractional polynomial terms does not benefit the performance. In addition, many researchers observed a decreasing trend of I/D AUC over time in their empirical studies [10][12][6], yet Pepe et al. held the opinion that the I/D AUC may be an increasing function over time [7]. We investigate the trend of I/D AUC under a Cox model and binary marker setting. However, we observe that under certain Cox models, the I/D AUC curve first increases then decreases, thus I/D AUC is not necessarily a decreasing function of time.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Data is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated...Show moreData is often collected in an aggregated fashion, for instance as categories, in intervals, or in predefined areas. In order to estimate the underlying, continuous distribution of an aggregated variable, the penalized composite link mixed model can be used (PCLMM). The PCLMM only assumes that the underlying distribution is smooth, and so it can be used to estimate any nonparametric regression function. The model is a combination of the generalized linear mixed model, penalized B-splines, and the composite link model. In this thesis, the mathematical framework of these three well-known techniques is described, after which the close connection between them and the PCLMM is used to give a mathematical description of the estimation technique. Using a simulation of an one-dimensional function and an example on Q-fever cases in the Netherlands in 2009, it is shown that the PCLMM can accurately estimate even the smaller details of the underlying distribution if covariate information on the finer-scale is available. Decent approximations of the underlying distribution is obtained when covariate data is only available on the aggregated scale.Show less