In this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated)...Show moreIn this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated) data, from which the key quantities contact rate and epidemic size could be obtained. With (non-)parametric bootstrapping, the GLM results could be assessed, and the key quantities were obtained and subsequently used to produce forecasts. Forecasting intervals were made to show the accuracy of the forecasts in terms of epidemic size and duration. Simulation results suggested that the method underestimated the eventual epidemic size, and overestimated the contact rate. However, applying the method to a real-life data set resulted in overestimation of the eventual epidemic size. The results of the contact rate for the application on real-life data should be compared to estimates from literature, before a significant meaning can be given to the results. Both simulation and application results gave variable estimates for the epidemic duration, although a positive relation was seen between epidemic size and epidemic length. Estimates for the contact rate could be improved. The major issues with prediction were accountable to exact collinearity introducted by the systematic model; the major issues with forecasting were accountable to extreme estimates of the epidemic size. The cause of both issues lies in the GLMs that were fit to the data.Show less
Medical researchers frequently make statements that one model predicts survival better than another, and are frequently challenged to provide rigorous statistical justification for these statements...Show moreMedical researchers frequently make statements that one model predicts survival better than another, and are frequently challenged to provide rigorous statistical justification for these statements. In general, it is important to quantify how well the model is able to distinguish between high risk and low risk subjects (discrimination), and how well the model predicts the probability of having experienced the event of interest prior to a specified time t (predictive accuracy). For ordinary – right censored – survival data, the two most popular methods for discrimination and predictive accuracy are the concordance index, or c-index (Harrell et al. 1986) and the prediction error based on the Brier score (Graf et al. 1999). In the absence of censoring, it is straightforward to define and estimate these measures. Adaptations of these simple estimates for right censored survival data have been proposed and are now in common use. The novel part of this thesis is to develop methods for calculating/estimating the concordance index and the Brier score prediction error in the context of interval censored survival data. The starting point is that we have interval censored data of the form (Li , Ri ] for subjects i = 1, ..., n, with Li < Ri(Li may be 0, Ri may be infinity to accommodate right censored data), and a given prediction model yielding a single (estimated) baseline hazard h0(t), one vector of (estimated) regression coefficients beta. From this prediction model, prognostic scores β T xi , and predicted survival probabilities S(t|xi) = exp(−H0(t)β T xi), may be calculated for each subject i. Methods to estimate the concordance index and the Brier score prediction error for exponential and Weibull baseline hazards are proposed and evaluated in a simulation study. An application to real data is also provided.Show less
The increasing dimensionality of data, for example in genetics, requires additional assumptions on the regression models in order to obtain estimates of the regression coefficients, known as...Show moreThe increasing dimensionality of data, for example in genetics, requires additional assumptions on the regression models in order to obtain estimates of the regression coefficients, known as regularization. Lasso, one of the regularization methods available, imposes an L1 constraint on the vector of coefficients, leading to the desirable feature of setting some coefficients to zero. This results in sparse, and more importantly, estimable models. However, setting some coefficients becomes problematic when one would want a group of variables to be in or out of the model, e.g. a factor coded as several dummy variables. This problem is solved by the group lasso. This paper presents an extension to the algorithm presented by Goeman for optimizing the penalized log likelihood under the proportional hazards model. This algorithm combines the gradient ascent algorithm and the Newton-Raphson algorithm. This methodology is then applied to data obtained from the Carema case-cohort study, which aim is to assess the additional predict 10-year risk of coronary heart disease.Show less