As the gathering of data becomes easier, by for instance using computers and the internet, datasets keep becoming larger. This makes it more difficult to find an appropriate way to analyze the data...Show moreAs the gathering of data becomes easier, by for instance using computers and the internet, datasets keep becoming larger. This makes it more difficult to find an appropriate way to analyze the data. In this thesis I have analyzed the data of de Grote Griepmeting (influenzanet), which is a dataset that contains over 300,000 measurements. The goal of this thesis is to find out if weather circumstances have an effect on the incidence of influenza. To do this the data of de Grote Griepmeting are combined with weather variables gathered from the KNMI. The data of the Grote Griepmeting contains repeated measurements, multicollinearity, the covariates are likely to be nonlinear to the response variable and because influenza is a contagious disease most likely there will be dependence between the subjects. These issues all need to be accounted for in the analysis. In this thesis several possibilities to analyze the data are considered; the Cox proportional hazards model, a logistic regression model, generalized estimating equations and the generalized linear mixed model. Finally, it was decided to use a logistic regression, with a lasso penalty to account for multicollinearity and Bsplines for nonlinearity. For using the B-splines a lot of extra variables need to be created, so the data expand even more. Computations become bothersome, and a trick from the medical field that is used in case-control studies is introduced to reduce computational time.Show less
The health of patients at the ICU is endangered by hospital-acquired infections. For some of these infections it is unclear whether the infection prolongs ICU stay. Simply comparing the lengths of...Show moreThe health of patients at the ICU is endangered by hospital-acquired infections. For some of these infections it is unclear whether the infection prolongs ICU stay. Simply comparing the lengths of stay of uninfected and infected patients would not give a valid answer. This comparison suffers from the immortal time bias, by which we compare survival of groups that are formed during the actual survival. The three-state illness-death model does allow us to make claims about the lengthening effect of infection. We distinguish three data types; complete information (infection state monitored daily), panel data (fragmented monitoring of infection state during stay), and endpoint-only data (no monitoring of infection state during stay). We show that from each of the three data types piecewise constant transition rates can be estimated. All methods are implemented in R, as well as a number of functions that can be applied on fitted models. Simulation studies and applications to real data show that estimating piecewise constant models from endpoint-only data often results in very wide confidence intervals for the transition rates. Confidence intervals can be narrowed somewhat with model restrictions, however useful results are difficult to obtain from endpoint-only data. Results are discussed, with suggestions for further research.Show less
We present a quadratic difference penalty on logistic regression as a solution for the high dimensional data problem and spatial correlation in the classification of genetic copy number data. The...Show moreWe present a quadratic difference penalty on logistic regression as a solution for the high dimensional data problem and spatial correlation in the classification of genetic copy number data. The quadratic difference penalty is the L2 norm of the first order difference penalty matrix times the coefficient vector, and thereby shrinks adjacent regression coefficients to each other. We propose an L2 fused lasso, a logistic lasso with an extra quadratic difference penalty; and a smoothed logistic regression, a logistic regression with only a quadratic difference penalty. We construct algorithms for both penalized regressions. We explain the connection between our smoothed logistic regression and ridge regression. We demonstrate the challenges in fitting a lasso, and adapt the gradient ascent. The L2 fused lasso and smoothed logistic regression are applied on genetic copy number data to classify the grade of bladder tumors.Show less
The data used for this thesis are data about Bacterial Vaginosis (BV) and they have some special characteristics. The numerical values are semi-quantitative, the response is categorical (BV...Show moreThe data used for this thesis are data about Bacterial Vaginosis (BV) and they have some special characteristics. The numerical values are semi-quantitative, the response is categorical (BV negative, intermediate and BV positive) and the data are high-dimensional. Categorical regression (CATREG) is a method that can be used to analyze these data. To determine how CATREG performs in predicting future outcomes from these data it will be compared to Random Forests, one of the golden standards in statistical learning. The dataset was randomly divided in a training and test set. The training set was used for variable selection and determining the values of the regularization parameters, and the test set was used for estimating the prediction accuracy. Based on the training set a Random Forests model and a CATREG model were chosen and used for prediction. Random Forests and CATREG both classify 68% of the outcomes correctly, but the models are not able to distinguish well between intermediate and BV positive women. When the intermediate and BV positive women are taken together, the percentages of correctly classified women increases to 95% and 97% for Random Forest and CATREG, respectively. Overall this analysis showed that CATREG performs as well as Random Forests in the prediction and therefore it can be considered as a worthwhile alternative.Show less
There are two frameworks of models in item response theory (IRT), unipolar and bipolar models. The bipolar models usually use a distance approach, but this approach has not yet been applied in the...Show moreThere are two frameworks of models in item response theory (IRT), unipolar and bipolar models. The bipolar models usually use a distance approach, but this approach has not yet been applied in the unipolar models. Five examples of unipolar models are the Rasch, the two parameter logistic (2PL) model, the rating scale model (RSM), the multidimensional Rasch model (MRM), and the multidimensional 2PL model (M2PLM). We will show that unipolar models can also be built using distances. Therefore, the Ideal Point Classfication (IPC) model is used. The parameters of the IPC model are linearly related to those of the unipolar models. The item midpoint positions and the person positions in the IPC model are the shrinkage version of the location parameters and the ability parameters, respectively.Show less
The increasing dimensionality of data, for example in genetics, requires additional assumptions on the regression models in order to obtain estimates of the regression coefficients, known as...Show moreThe increasing dimensionality of data, for example in genetics, requires additional assumptions on the regression models in order to obtain estimates of the regression coefficients, known as regularization. Lasso, one of the regularization methods available, imposes an L1 constraint on the vector of coefficients, leading to the desirable feature of setting some coefficients to zero. This results in sparse, and more importantly, estimable models. However, setting some coefficients becomes problematic when one would want a group of variables to be in or out of the model, e.g. a factor coded as several dummy variables. This problem is solved by the group lasso. This paper presents an extension to the algorithm presented by Goeman for optimizing the penalized log likelihood under the proportional hazards model. This algorithm combines the gradient ascent algorithm and the Newton-Raphson algorithm. This methodology is then applied to data obtained from the Carema case-cohort study, which aim is to assess the additional predict 10-year risk of coronary heart disease.Show less
Freely available toolsets that can handle genome-wide association (GWA) studies on twin-family data and take into account imputed genotypes are growing in number. However, the documentation that...Show moreFreely available toolsets that can handle genome-wide association (GWA) studies on twin-family data and take into account imputed genotypes are growing in number. However, the documentation that comes with them (if available), does not facilitate the choice for a particular toolset. We propose a research strategy in which we compare ASSOC, EMMAX, MERLIN, PLINK and ProbABEL on feasibility and statistical accuracy for GWA studies on simulated traits. Feasibility comparison was based on install requirements, versatility on data input, command line interface, and help information. The comparison on statistical accuracy was performed on Type-I error, genomic inflation, power, and consistency and efficiency of estimated SNP-effects. We simulated 100 replicates of binary and quantitative phenotypic traits over heritability conditions of 5, 10, 20, 30, 50 and 80%, based on 3 effect-SNPs from 1557 samples from 597 nuclear twin-families from the Netherlands Twin Registry. Analyses on Type-I error and genomic inflation were performed on 7757 pruned and unlinked SNPs that represented the null hypothesis. In the current design PLINK performs best on feasibility and statistical accuracy for the binary trait. On the quantitative trait ASSOC performs best on Type-I error control, EMMAX on statistical power, and PLINK on genomic inflation. Future research is needed for larger sample sizes and larger numbers of causal SNPs to compare the performance of the toolsets on complex traits.Show less