Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually...Show moreA stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually comply with the sulfur cap. The thesis devises a systematic approach to a prototype of a sulfur compliance monitoring system using the state-of-the-art TROPOspheric Monitoring Instrument(TROPOMI) which measures the atmospheric presence of trace gases. Oceanic geographical coordinates are classified by the similarity in the concentration level of trace gas with the k-means clustering method and adequate averaging techniques. The choice of hyperparameters and the final results are statistically formulated and verified. The subsequent longitudinal analysis on the temporal trends of trace gas emission suggests that the sulfur dioxide measurements of TROPOMI are dominated by measurement noise. The thesis concludes with the outcome that the nitrogen dioxide measurements of TROPOMI can be well-utilized to backtrack the maritime anthropogenic activities such as the regional shipping route, which indicates a possibility to be further developed as a global monitoring system for both land and maritime emission.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Ascertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population...Show moreAscertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population and an overrepresentation of young, affected subjects in the sample. The motivating example for this thesis is a family study where the goal is to estimate an unbiased hazard ratio (HR) for the effect of Polygenic Risk Score (PRS), a continuous score based on several Single Nucleotide Polymorphisms (SNPs), on age of breast cancer diagnosis. Weighted Cox model approaches have been proposed in this context, however their performance has never been evaluated for a continuous covariate. Two different approaches were considered, using time fixed and time dependent weights. A simulation study was conducted to assess the performance of the different approaches for scenarios where different family correlation, family size, sample size and selection criterium have been chosen. We found that under the null hypothesis, (un)weighted models behave similarly. When a covariate effect is assumed, in any scenario where the within-family correlation is low, weighting methods perform better than a naive approach; the same holds for moderate within-family correlation in combination with weak ascertainment. For strong ascertainment and/or strong within-family correlation, coverage of weighting methods is very poor and bias is high. To obtain an unbiased HR for PRS, we used high-risk breast cancer families data. Inclusion criteria were absence of high-risk mutations BRCA1 and BRCA2 and at least three affected female family members or in two members if at least one had bilateral breast cancer before age 60. A total of 101 families were selected between 1990 and 2012 by Clinical Genetic Services in four Dutch cities and one Hungarian city, with 323 (55.1%) events. The HR of PRS, adjusted by family history, was 1.29 (95% CI 1.04; 1.60), for the naive model, with a frailty variance of 0.53 which indicates rather strong within-family correlation. For none of the weighting approaches, the covariate effect of PRS adjusted for family history in a Cox model was significant (HR 1.09 and 1.09). For analysis of outcome dependently sampled survival data, weighting approaches may be used to limit ascertainment bias, for some scenarios. A note of caution is required when this approach is used in scenarios with (moderate to) strong within-family correlation. No evidence for a significant effect of PRS on age of breast cancer diagnosis was found in this studyShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Accurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox...Show moreAccurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox proportional hazards (PH) model is used to predict survival. Yet, recently machine learning (ML) has received increased attention. ML methods learn complex relations between explanatory variables and outcomes, without the need to specify these effects beforehand. In contrast, in the Cox PH model, non-linear and interaction effects need to be specified before estimating the model. The flexibility of ML methods is believed to improve predictive accuracy, which drives the application of ML methods to survival data. One of the aims of this thesis was to compare prediction models for survival data based on machine learning methods to the traditional Cox PH model. Predictive ability was assessed by using Brier score, concordance index and calibration plots. Furthermore, software implementation and interpretability were investigated. Two ML methods, partial logistic regression models with artificial neural networks (PLANN) and random survival forest (RSF) models were considered. Predictive performance was studied in a soft tissue sarcoma cohort: a right-censored survival dataset with a small number of explanatory variables. In terms of IBS and calibration, the optimally tuned RSF models had similar predictive performance compared to the Cox model. The Cox model had better predictive performance than the RSF models in terms of C-index. One of the NN models outperformed Cox in terms of Integrated Brier Score (IBS). Also, the NN models were slightly better calibrated than the Cox PH model. It would be interesting to see whether a Cox model including non-linear effects would outperform the ML methods considered in terms of prediction. Differences between the ML methods and the Cox PH model concern the route towards finding the most optimal predictions. When estimating survival probabilities using ML methods, focus is mainly on the correct implementation of the ML algorithm: finding suitable tuning parameters, how to select the best set of tuning parameters and running the algorithm, which takes time. On the other hand, when identifying the best predicting Cox model, time is spent on specifying the model, looking at non-linear effects and evaluating goodness of fit. The initial set of tuning parameters considered for the PLANN approach resulted in non-informative NN models. This showed the importance of thorough knowledge on the characteristics of tuning parameters in the ML methods. The work in this thesis shows how survival prediction could be unreliable if the NN is not properly tuned.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the...Show moreIn this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the analysis of Factor-by-Curve interactions between a categorical and a continuous variable. The Optimal Scaling methodology was originally developed for analysis of categorical data, but is also applicable to continuous data. It estimates optimal quantifications for the original observed values in an iterative process by maximising the squared multiple regression coefficient (R2 ), thereby transforming the original variable. These quantifications are restricted according to a prespecified scaling level, indicating the stringency of the transformation. These scaling levels can restrict the quantifications to be unsmoothed (non)monotone, or to be smooth (non)monotone. Unsmoothed nonmonotone quantifications are not restricted to any relation between the original observed values, whereas the monotone restriction preserves the ordering of the original observed values in the quantifications. The smooth restrictions are similar, but the quantifications are then also smoothed using a spline function. The quantifications can also be restricted to a linear transformation of the original observed values. This (ordinary) Optimal Scaling regression model, however, does not take into account any interaction effects between the variables. The type of interactions considered in this thesis are the Factor-by-Curve interactions. Factorby-Curve interactions are interactions between a categorical variable (factor) and a continuous variable. The models proposed in this thesis will be referred to as the Factor-by-Curve Optimal Scaling regression (FbC-OS-regression) models. Both models fit a separate curve for the continuous variable in the interaction for each level of the factor. For example, an interaction between a continuous variable and a factor of three levels is then fitted with three curves on that continuous variable. The difference between the two proposed models is that they either fit main and interaction effects separately or fit the joint effects in a single term. The models are illustrated with two applications on real data. The advantage of both FbCOS-regression models, compared to existing methods for modelling of Factor-by-Curve interactions, is that the Optimal Scaling methodology allows for monotone restrictions of the effects. This is demonstrated using the applications shown in this thesis, which are fitted using monotone spline restrictions. Results for the fitted FbC-OS-regression models are then compared to fitted linear regression models with interactions. Finally, the two approaches of modelling Factor-byCurve interactions with OS-regression are compared to each other and to the additive model, which is a model suitable for nonlinear analysis of Factor-by-Curve interactions as well, after which suggestions for further study of the proposed models are given.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Currently, platelet transfusion is the main treatment for patients with thrombocytopenia due to haematological malignancy and intensive chemotherapy. When the platelet count is low, transfusion is...Show moreCurrently, platelet transfusion is the main treatment for patients with thrombocytopenia due to haematological malignancy and intensive chemotherapy. When the platelet count is low, transfusion is given to prevent bleedings. However, the platelet count is not the only determinant of bleeding (Ypma et al., 2019). Other biomarkers might additionally or even better predict bleeding such as the albumincreatinine ratio measured in urine. This thesis project will determine the predictive value of these new biomarkers where we would like to predict the ”untreated risk” of bleeding: the risk of bleeding if patients would not receive a transfusion. We used a real dataset that contains 88 patients with 116 thrombocytopenic episodes in which patients’ platelet counts are low and they may develop a bleeding. A problem is that the patients who received transfusions cause diculty in predicting the “untreated risk”. Another problem is that transfusions were given partly based on the platelet counts, which makes the e↵ect of transfusion on bleeding confounded by platelet count. We considered two situations. One was to predict the bleeding during the day based on the platelet count that was measured in the morning (the one-day situation). The two-day situation was to predict bleeding in the next two days, but before the second day-night based on the platelet count that was measured on the first day morning. In the first part of this thesis, we structured the relationship between biomarkers, transfusions and bleeding by expressing them in causal diagrams. Using the causal diagrams, we found the reason why the conventional models failed to predict untreated risk in the two-day situation. We found that the marginal structural model might be a solution. In the second part, we set up a simulation to verify whether the marginal structural model or conventional regression models can handle the confounding in the one-day situation and the time-dependent confounding in the two-day situation. Based on our simulation studies, we concluded that for the one-day situation the regression model including treatment and predictor was well equipped, while in the two-day situation the marginal structure model is recommended to estimate the “untreated” risk. In the third part, we applied the models to the dataset. We found that in the one-day situation urine albumin/creatinine ratio and the platelet count have potential predictive value for predicting same day bleeding, while, for the two-day situation, only the urine albumin/creatinine ratio was significantly associated with the risk of bleeding in all models. Additionally, there was not a clear e↵ect of transfusion detected in the one-day situation and two-day situation.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this Thesis, we explore the feasibility of the task to identify impact data in humanitarian documents. We approach this as a sentence classification task and create a human-labelled set of over...Show moreIn this Thesis, we explore the feasibility of the task to identify impact data in humanitarian documents. We approach this as a sentence classification task and create a human-labelled set of over 11,000 sentences extracted from documents related to the IFRC’s Disaster Relief Emergency Fund. Using this set, we compare various classification models and feature sets and show that it is possible to classify sentences containing impact data with a good performance. Our final model, a Linear Support Vector machine trained on a Document-Term Matrix of word bigrams, achieves a precision of 0.852 and a recall of 0.746 (F1 = 0.796) on a separated validation set of 1, 114 sentences. In a second part of our research, we describe techniques that can be applied when there are fewer human-labelled examples available. When performing brief experiments with the simplest of these techniques, we show that indeed it is possible to achieve the aforementioned performance on the validation set with 7, 454 fewer labelled examples in the training set (approximately 75% less). Our work can serve as an exploratory first step towards fully automated impact data extraction from text. The work has its limitations. For instance, we found that it is very difficult to define what is impact data when creating a labelled ground-truth, which influences the generalisability of our ground truth data set. Further work can focus on the impact data definition. Other ideas for future work are the investigation of newer (e.g. neural network-based) techniques for humanitarian text processing tasks such as this one. A continuation of our work on investigating techniques that can solve problems based on fewer labelled examples specifically for text from the humanitarian domain is also a valuable next step.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to...Show moreA problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to overcome this problem is to use sample matching. In sample matching, one draws a sample from the population using a well-defined sampling mechanism. Next, units in the survey dataset are matched to units in the drawn sample using some background information. Usually the background information is insufficiently detaild to enable exact matching, where a unit in the survey dataset is matched to the same unit in the drawn sample. Instead one usually needs to rely on synthetic methods on matching where a unit in the survey dataset is matched to a similar unit in the drawn sample. This study developed several methods in sample matching for categorical data. A selective panel represents the available completed but biased dataset which used to estimate the target variable distribution of the population. The result shows that the exact matching is unexpectedly performs best among all matching methods, and using a weighted sampling instead of random sampling has not contributes to increase the accuracy of matching. Although the predictive mean matching lost the competition against exact matching, with proper adjustment of transforming categorical variables into numerical values would substantial increase the accuracy of matching. All the matches are used in reducing overfitting of machine learning, and the results show that all matches are able to increase the prediction precision.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In clinical trials, heterogeneity of treatment effect often exists between patients with different pretreatment characteristics, such as age, gender, weight, etc. In response to such issue, various...Show moreIn clinical trials, heterogeneity of treatment effect often exists between patients with different pretreatment characteristics, such as age, gender, weight, etc. In response to such issue, various subgroup identification approaches have been proposed. Two methods among them, Qualitative Interaction Tree (QUINT) and a method adapted from an optimal treatment regimes (OTR) approach proposed by Zhang et al. (2012), are compared in this paper. These two methods identify three types of subgroups in a situation with two treatments (A and B): one subgroup for which treatment A is better than treatment B, one for which treatment B is better than treatment A, and one for which the difference between the two treatment outcomes is negligible (called ”indifference group”). A simulation study was conducted to compare the two methods with regard to their recovery performance (quantified by type I error rates, type II error rates, Cohen’s κ agreement to the true subgroups, and splitting performance of the derived trees) and their predictive performance (quantified using the difference between the true expected treatment outcome and the estimated treatment outcome of sample data and population data). Results of the simulation study suggested that QUINT has its advantage in recovering the subgroups, and the method adapted from the OTR approach has its advantage in predicting treatment outcome.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In forensics it is relevant to identify the presence of one or several body fluids in a crime stain. This may be done using traditional methods however, those methods require a part of the...Show moreIn forensics it is relevant to identify the presence of one or several body fluids in a crime stain. This may be done using traditional methods however, those methods require a part of the available material, therefore leaving less residual material for the purpose of other analysis. Alternatively, one can use messenger RNA evidence: mRNA expression levels may vary among body fluids and therefore can be identified. The currently used method provides the forensic examiner with a categorical statement regarding the existence of the body fluid. However, such a method cannot express any associated uncertainty, whereas alternatively, a probabilistic method can and hence is a preferable choice. In forensic science it is common to express the level of uncertainty by means of a likelihood ratio but, due to a bad choice of statistical model or data scarcity, may be inaccurate. This thesis first of all carries out experiments using four probabilistic classification methods, namely Multinomial Logistic Regression, Multilayer Perceptron, Extreme Gradient Boosting and a Fully connected Feed Forward model. In actual casework the crime stain often consists of multiple body fluids, which is why the classifiers are compared using synthetic representations of actual mixture samples. Multi-label approaches that enable the classifiers to express the level of uncertainty about multiple body fluids in a sample are used. The output from the logistic regression model is directly interpreted as likelihood ratio, whereas for the remaining three classifiers a post-hoc calibration step to improve the accuracy of the clasiffiers is included. Additional tests are performed to investigate how susceptible the classifiers are when the relative frequency of the body fluids in the data changes. The main focus is on two target classes, namely on saliva and a combination of vaginal mucosa and menstrual secretion, because these are most often requested to be identified in a crime stain and therefore seen as most relevant. It is concluded that using a separate logistic regression model for each target class in combination with presence/absence data results in both accurate and reliable likelihood ratios. Results also indicate that these models are the least susceptible to a change in the frequency with which body fluids occur in the train dataset. Furthermore, a study using an additional dataset with actual mixtures of two body fluids that are not assumed representative of forensically realistic mixtures of the same two components is done. Results show that the accuracy of the classifiers on the mixtures dataset are higher in comparison to the accuracy on the synthetic representations. This indicates that the results are overly optimistic, hereby verifying that the mixtures’ cell type dataset should not be used as validation set. A user-friendly tool is constructed that implements logistic regression to calculate the likelihood ratio from samples from actual casework. Using mRNA measurements from two cases both the practical use and the interpretability of the results are shownShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The task of finding suitable candidates for a job has never been an easy one, and now that recruiters have access to various online job boards and are not necessarily constrained by national...Show moreThe task of finding suitable candidates for a job has never been an easy one, and now that recruiters have access to various online job boards and are not necessarily constrained by national borders, it can be argued that shortlisting relevant candidates is more difficult than ever. This is especially true for online recruitment agencies that have huge databases of potential candidates and no effective ways to quickly identify which of those candidates have the required experience and skills for the vacancy at hand. There are many ways that different companies go about solving the aforementioned problem. In case of YoungCapital, a Dutch recruitment agency, all candidates can state their preferred profession and location when creating a profile on the company’s website, and recruiters can then create a search query based on those stated preferences. It is also possible to get keyword matches with candidates’ resumes, which, however, is a manual task where recruiters have to decide on the specific keywords they want to find. Given recent advances in machine learning and natural language processing, it was decided that a learning-to-rank (LTR) approach should be tried to see whether the candidate search process could be improved by presenting recruiters with a ranked list of candidates for each job, with the most suitable candidates at the top of the list. The LambdaMART model was chosen for this task as the state-of-the-art algorithm, and the baseline ranking model was a simple Linear Regression. Most of the features were designed using custom word embeddings. The results were evaluated with common rank-based measures: Normalised Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP). Precision, which ignores the order of results, was reported as well. Overall, we found a significant improvement over the current method according to all three measurements. We also demonstrated the impact of different feature sets on the performance of ranking models.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Online Linear Regression is a sequential variant of regression in which the data points arrive one by one. It is normally studied in the gametheoretic framework of Online Convex Optimization, which...Show moreOnline Linear Regression is a sequential variant of regression in which the data points arrive one by one. It is normally studied in the gametheoretic framework of Online Convex Optimization, which models the data as being generated by an adversary. In this framework, the standard statistical procedure of Online Ridge Regression is known to be essentially optimal. In Statistics, there is an improvement for Ridge Regression when the noise is not constant. This improvement is Weighted Ridge Regression, which relies on weighting the data by their variances. In this thesis, we will employ weighting in Online Ridge Regression to show that an improvement over Online Ridge Regression can be made. We furthermore explored the situation where weighting is disadvantageous, mathematically and experimentally using simulations. Finally we applied Online Weighted Ridge Regression to different real-world datasets and found that we also can improve Online Ridge Regression in practical situationsShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Recently, a new theory of hypothesis testing was introduced: safe testing. Within the safe testing framework, random variables called S-values are used for hypothesis testing. S-values can be...Show moreRecently, a new theory of hypothesis testing was introduced: safe testing. Within the safe testing framework, random variables called S-values are used for hypothesis testing. S-values can be interpreted as both conservative p-values and Bayes factors. Further, they allow for optional continuation: S-values from multiple studies can be multiplied while retaining a type-I error guarantee, and some S-values are even robust under the frequentist interpretation of optional stopping. For this thesis, I developed safe tests for two classical frequentist hypothesis tests: the 2x2 contingency table test and its stratified equivalent, the Cochran-Mantel-Haenszel test. These tests were designed to be GROW (growth-rate optimal in the worst case) for certain subsets of the alternative hypothesis. Two versions of the tests were presented: a version that provides the GROW S-value for a restricted alternative hypothesis based on a minimal absolute di↵erence between group means, and a version that is based on the Kullback-Leibler divergence between the alternative and null hypothesis. For the ‘minimal absolute di↵erence’ version, an analytically computable ‘simple’ S-value turned out to exist, which is robust under optional stopping. I showed that when using this safe test for optional stopping, the expected sample size needed to achieve a desired power can be lower than when using Fisher’s exact test. No ‘simple’ definition could be found for the Kullback-Leibler version: this GROW safe test has to be found through numerical optimization. Nevertheless, the Kullback-Leibler version could still be preferred in some cases: it was shown to gain higher power for certain data-generating distributions compared to the simple S-value. Both S-values were implemented in an R package: the safe2x2 package.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Prediction rule ensembles (PREs) aim to offer a good compromise between prediction accuracy and interpretability by selecting a small set of the most important prediction rules. The accuracy of...Show morePrediction rule ensembles (PREs) aim to offer a good compromise between prediction accuracy and interpretability by selecting a small set of the most important prediction rules. The accuracy of tree-based methods, such as single decision trees are known to be negatively affected by measurement error. The PRE algorithm is based on single decision trees, which are turned into an ensemble of multiple rules and may thus inherit the negative effect of measurement error. However, an extensive investigation of the influence of measurement error on the performance of PREs has not been conducted before. Therefore, we evaluated the impact of measurement error on the performance of PREs though two simulation studies: one for data with continuous predictor variables and the other for data with binary predictor variables. In both the focus is solely on binary classification. We found that the predictive accuracy of PREs, as measured by AUC values, deteriorated in the presence of measurement error. More importantly, it was found that the performance of the PRE method deteriorated with larger amounts of measurement error for both the binary and continuous predictor scenarios. In addition, the performance of PREs in terms of number of correctly selected rules, type I and type II errors was evaluated. We found that, apart from deteriorating the predictive performance of the PREs, measurement error can also deteriorate the interpretability of the fitted ensemble by selecting wrong rules, resulting in unreliable and wrong conclusions. Keywords: RuleFit, prediction rule ensembles, measurement error, classification error, reliability, type I error, type II errorShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Real-world data contains both signal and noise. In this study, we developed a method to utilize replications to separate signal and noise. Our proposed method employed the Expectation-maximization...Show moreReal-world data contains both signal and noise. In this study, we developed a method to utilize replications to separate signal and noise. Our proposed method employed the Expectation-maximization algorithm to estimate both signal and noise precision matrices. The estimated precision matrices were used to construct a Gaussian graphical model, which represents the network of variables. In highdimensional settings, regularization techniques were used to ensure the positive definiteness of the estimated precision matrices. In the simulation study, we varied the graphical structure, the number of edges and the size of noise to see how the proposed method performs. As the true signal precision matrix is known, the estimates of the proposed method were compared to those by other methods through Kullback-Leibler divergence from the true one and prediction accuracy of edge presence or absence. The results show that for the clique models in our case, our proposed method unpenalized performed best in edge detection while for banded and star models under certain circumstances, the unpenalized estimates by the proposed method came last in edge detection. The distributions using the penalized estimates by our method are best or second best approximations of the true distribution in terms of KL divergence. Our results also show that with increasing samples and replications, the estimates become better in edge detection and approximation to the true distributions. In the real-world data analysis, we used three pathways of a lung cancer dataset from TCGA project. The results show that there is more overlapping between estimates of the merged data and single-platform data than between estimates of the two platforms in terms of KL divergence and edges in common. We also found that the distribution constructed by the signal-noise estimator by the proposed method is better approximation to that of the new data than signal estimator.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A job application process often includes a test battery with several skills and personality tests. The performance on these tests is used to predict an overall job performance score and can help...Show moreA job application process often includes a test battery with several skills and personality tests. The performance on these tests is used to predict an overall job performance score and can help decide whether or not to hire someone. Prediction based on a test battery is often done by ordinary least-squares (OLS) models. OLS models try to correctly explain the relationship between the dependent variable and the tests itself. However, prediction is also important when you want to select the best candidates. OLS models are not sparse and often have high variance, thus it may not be the best model in terms of prediction. To improve prediction, machine learning methods, such as the least absolute shrinkage and selection operator (LASSO) regression, can be used. The LASSO adds bias to estimates and reduces variance to improve prediction. One disadvantage of LASSO regression is that its not scale invariant in the predictors. Therefore, predictors are standardized, typically by using the observed-score variance. In psychological tests, scores consist of two parts: the error part and the truescore part. The observed-score variance thus also consists of two parts: error variance and true-score variance. The true-score variance part is the most important part for prediction. However, the error variance part can cloud the effect of the true-score variance and influences whether a test is present in the prediction of the LASSO or not. This study examines two alternatives to standardization by the observed-score variance for the LASSO. The first one standardizes by the true-score variance, to minimize the effect of the error variance in the statistical model for variable selection. The second alternative is a transformation by the ordinary least-squares coefficient, based on the nonnegative garrote model, to add explanatory value to the model and overshadow the effect of the error variance. We examine the truthfulness of variable selection, truthfulness of coefficient size, and prediction accuracy through simulation with multiple scenarios of design factors. Design factors include number of observations, reliabilities of the tests, covariance between latent variables and the number of true nonzero regression coefficient. The methods were also compared with respect to an empirical data set of test results for psychological trait tests measuring general mental health to determine differences and semblencas between real-world data and simulation. Results showed that the methods act differently under different circumstances. Both alternatives improved the variable selection and truthfulness of coefficients in most scenarios, while the prediction was approximately the same for all three methods. This thesis gives recommendations for which method is best to use in which scenario, and shows the effects of the design factors on the truthfulness of the three methods in the simulation study. Limitations of this simulation study are given together with recommendations for further research.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Machine Learning classifiers are naturally black boxes when it comes to interpretation. In this thesis, Decision Boundary Approximation (DBA), a new algorithm for locally explaining complex binary...Show moreMachine Learning classifiers are naturally black boxes when it comes to interpretation. In this thesis, Decision Boundary Approximation (DBA), a new algorithm for locally explaining complex binary classifiers is developed, tested experimentally and discussed. The algorithm explains predictions of individual instances, by approximating their most relevant region of the decision boundary with a linear model. We overview and discuss limitations of existing methods when applied to classification, with specific focus on LIME due to the similarity with DBA concepts. Experiments with DBA, cover both low dimensions and sparse high dimensional data. In Experiment 1 we show that DBA can provide stable explanations for various decision boundary structures in a 2D simulated case. Experiment 2 demonstrates that DBA outperforms LIME for low dimensionalities, while in Experiment 3 (MNIST data) we show that when data are sparse, DBA explanations can include features that are absent from the explained example, making the explanation more complete compared to LIME. In Experiment 4 we explain a Naive Bayes trained on SMS ham/spam messages and show that the DBA solution is in agreement with the Naive Bayes posterior. Finally, the benefits and drawbacks of DBA are discussed elaborately and future recommendations for modifications are given.Show less