Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A job application process often includes a test battery with several skills and personality tests. The performance on these tests is used to predict an overall job performance score and can help...Show moreA job application process often includes a test battery with several skills and personality tests. The performance on these tests is used to predict an overall job performance score and can help decide whether or not to hire someone. Prediction based on a test battery is often done by ordinary least-squares (OLS) models. OLS models try to correctly explain the relationship between the dependent variable and the tests itself. However, prediction is also important when you want to select the best candidates. OLS models are not sparse and often have high variance, thus it may not be the best model in terms of prediction. To improve prediction, machine learning methods, such as the least absolute shrinkage and selection operator (LASSO) regression, can be used. The LASSO adds bias to estimates and reduces variance to improve prediction. One disadvantage of LASSO regression is that its not scale invariant in the predictors. Therefore, predictors are standardized, typically by using the observed-score variance. In psychological tests, scores consist of two parts: the error part and the truescore part. The observed-score variance thus also consists of two parts: error variance and true-score variance. The true-score variance part is the most important part for prediction. However, the error variance part can cloud the effect of the true-score variance and influences whether a test is present in the prediction of the LASSO or not. This study examines two alternatives to standardization by the observed-score variance for the LASSO. The first one standardizes by the true-score variance, to minimize the effect of the error variance in the statistical model for variable selection. The second alternative is a transformation by the ordinary least-squares coefficient, based on the nonnegative garrote model, to add explanatory value to the model and overshadow the effect of the error variance. We examine the truthfulness of variable selection, truthfulness of coefficient size, and prediction accuracy through simulation with multiple scenarios of design factors. Design factors include number of observations, reliabilities of the tests, covariance between latent variables and the number of true nonzero regression coefficient. The methods were also compared with respect to an empirical data set of test results for psychological trait tests measuring general mental health to determine differences and semblencas between real-world data and simulation. Results showed that the methods act differently under different circumstances. Both alternatives improved the variable selection and truthfulness of coefficients in most scenarios, while the prediction was approximately the same for all three methods. This thesis gives recommendations for which method is best to use in which scenario, and shows the effects of the design factors on the truthfulness of the three methods in the simulation study. Limitations of this simulation study are given together with recommendations for further research.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Multivariate binary data are often collected in scientific fields such as psychology, economics and epidemiology. Worku and de Rooij (2018) proposed a marginal model for the analysis of this type...Show moreMultivariate binary data are often collected in scientific fields such as psychology, economics and epidemiology. Worku and de Rooij (2018) proposed a marginal model for the analysis of this type of data in a distance framework: The multivariate logistic distance (MLD) model. Two different models were introduced by Worku and de Rooij: a restricted and an unrestricted MLD model. The interpretation of both models is clear, and a log-odds as well as a biplot representation can be used. In this work we proposed three extensions to the restricted model and showed the implications of the extensions for the interpretation of the corresponding biplot as well as for the log-odds. First, we showed how the model can be extended by making it possible for a response variable to belong to multiple dimensions. Consequently, the extended model can be used to examine other dimensionality structures compared to the original model. Second, we allowed for non-linear relationships of the predictor variables with the response variables in the model and therefore making the model more flexible. Finally, the dimensionality structure as well as the final predictor variables need to be selected. We showed how to use the prediction capability of a model as a selection criterion to select between competing models. This is a more versatile method to perform model selection, based on the bias-variance trade off, compared to the likelihood based criterion used in the original model. We fitted 16 variations of the model to an empirical data set to compare performance based on their prediction capability. All variations of the model can be estimated using standard statistical software for univariate modelsShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The brain's Default Mode Network (DMN) raised a lot of interest in neuroscience the past decade. The DMN, is active even when a human is resting and his mind is not task oriented [1]. It is...Show moreThe brain's Default Mode Network (DMN) raised a lot of interest in neuroscience the past decade. The DMN, is active even when a human is resting and his mind is not task oriented [1]. It is mentioned in the literature [2, 3], that disruptions within the DMN often occur in the profile of patients carrying some disorder, such as Parkinson's disease (PD), Alzheimer's disease (AD) and epilepsy. In this thesis we aim to build a classification model that predicts whether a new subject is an Alzheimer's patient or not. This model is created based on the DMN profile of 250 subjects. To this purpose, we employ the δmachine classification approach of Yuan, Heiser and De Rooij [4], which uses the distances between DMN profiles as the predictor matrix in a lasso logistic regression model. It is essential to define a distance measure that best fits the DMN univariate time series data, that is, a measure which can strongly represent the distances, irrespective of the possibility of data distortion in time. Keeping that in mind, five distance measures were investigated, which are designed for time series and are implemented in the up-to-date R packages TSdist and TSclust. The final goal is twofold: on the one hand building a classification model by using the δ-machine approach, based on the profile of the activity in the DMN of 250 subjects, and on the other hand uncovering which distance measure is the most suitable when involved in the δ-machine approach.Show less