Statistical matching is a technique which can be applied when one wants to investigate the joint relationship between two variables that are observed in different datasets, using one or more...Show moreStatistical matching is a technique which can be applied when one wants to investigate the joint relationship between two variables that are observed in different datasets, using one or more variables that overlap in both datasets. This joint relationship cannot be estimated without relying on assumptions or additional data. Classically, statistical matching is based on the Conditional Independence Assumption (CIA) which asserts the non-overlapping variables to be independent given the overlapping variable. This assumption is inflexible, untestable and often does not hold. The current project proposes to use an approach based on the Instrumental Variable Assumption (IVA). An instrumental variable is a variable that, given the value of some mediating variable, has no effect on some outcome variable. In the context of statistical matching this gives rise to three scenarios: the mediating variable overlaps, the outcome variable overlaps, or the instrumental variable overlaps. The IVA approach is more flexible than the CIA approach. This is because the IVA approach does not make any assumptions on which variable is the overlapping variable, whereas the CIA always conditions on the overlapping variable. The aims of the current study were twofold: 1) how does the IVA approach perform when the assumption is violated to various degrees and 2) how does the IVA approach compare to the CIA approach. To answer these questions, a simulation study was performed. For each scenario, joint probabilities of the non-overlapping variables were estimated under both the IVA and the CIA in populations which violate the IVA to various degrees. Measures for the bias, accuracy and precision were estimated and compared. The results indicate that the IVA approach is moderately robust against slight violations of the assumption. When the IVA is not violated, estimations are unbiased and for all matching scenarios the method outperforms the CIA. When the IVA is violated it is advisable to rely on the CIA, since results of the current simulation study suggest the CIA to be more robust against violations in general.Show less
The study looked at how regression to the mean affects the results of diabetes Type II medication effectiveness using data from the Netherlands Epidemiology of Obesity (NEO) study. This research...Show moreThe study looked at how regression to the mean affects the results of diabetes Type II medication effectiveness using data from the Netherlands Epidemiology of Obesity (NEO) study. This research showed that RTM can happen before and after starting medication. In simulations where there was no treatment effect, the HbA1c levels before starting medication were unexpectedly lower than the true average. This suggests that choosing different times to compare in the study can really change how effective the medication seems. When applying the same methods to the real NEO study data, there were some big differences compared to the simulations. For example, the estimated average HbA1c level before starting medication in the NEO study was higher than what was found using certain statistical models. This points out some possible issues with these methods and shows how complex the NEO study data is. The findings suggest that RTM is an important factor to consider in studies about medication effectiveness. Because the NEO study data is so complex, future research might need more detailed methods to properly understand RTM and how it affects the results. Expanding the types of scenarios studied, including different criteria for when treatment starts, and looking at more patient data could help give a fuller picture of RTM and improve how we evaluate medication effectiveness in real-world studiesShow less
Prediction models play a paramount role in various fields such as psychology and medicine, where the aim is to maximize predictive performance while ensuring high interpretability and stability....Show morePrediction models play a paramount role in various fields such as psychology and medicine, where the aim is to maximize predictive performance while ensuring high interpretability and stability. Prediction rule ensembles are a recent statistical learning method that address the black-box problem from common machine learning methods. First, an ensemble of trees is fitted, and by employing sparse regression, such as the lasso, only a subset of those trees is retained in the final ensemble, enhancing interpretability. However, the lasso suffers from drawbacks, considering that the optimal penalty parameter for variable selection may lead to an over-shrinkage of large coefficients. This study investigates if accuracy, sparsity, and stability of prediction rule ensembles can be improved by using the adaptive or relaxed lasso, or their combination. In the adaptive lasso, weight parameters are assigned to each coefficient in the penalty term, while in the relaxed lasso the lasso coefficients are debiased towards unpenalized values. In addition, in this study we compared if the results differ if the model selection was based on the lambda- 1se or lambda-min criterion and between continuous and binary outcomes. For this, the models were evaluated on nine benchmark datasets using repeated 10-fold cross-validation. The results show that all lasso variations improve model sparsity significantly while maintaining high accuracy, but at the cost of stability. The relaxed and adaptive lasso select sparser models than the standard lasso while achieving good stability of variable selection, but at the cost of less stable predictions. The relaxed adaptive lasso yields the sparsest model, but is the most unstable. Regarding lambda criterion, for continuous outcomes the lambda-minimum criterion leads to highly unstable results and diminishes the effect of lasso approach used. For binary outcomes, the lambda-1se criterion only improves accuracy and sparsity, but not stability, while for continuous outcomes it improves all performance diagnostics.Show less
In scientific research, interpretability and high predictive performance are difficult to combine: while black-box models perform better than interpretable models, only the latter allow for...Show moreIn scientific research, interpretability and high predictive performance are difficult to combine: while black-box models perform better than interpretable models, only the latter allow for transparency and inference, which are necessary tools when these models are used in decision-making or in hypothesis testing. Models such as RuleFit combine the flexibility of a black-box tree ensemble with the interpretability of a sparse, LASSO linear regression. Later work substitutes Bayesian regression for the LASSO regression, thus further improving the model’s prediction (Horserule). The work in this thesis was two-sided: on the one hand, we applied a different Bayesian prior (the informative Horseshoe prior) to the linear step of the RuleFit model, which can naturally take the structure of RuleFit into account. On the other hand, we used Shapley values to measure the contribution of each predictor in the RuleFit model and combined these values with the Bayesian regression to build inferential tools. The new machinery was tested on both synthetic data and the dataset from the Helius study. The predictive performance of the resulting model was observed to be higher than that of the original RuleFit model, but lower than that of Horserule. Compared to Horserule, the proposed model excessively favours trees over linearity, but in doing so it more strongly enforces the choice of simpler trees. Shapley values were also compared to other importance measures mentioned in the RuleFit literature, and shown to be more accurate in reconstructing the contributions as defined in the synthetic datasets.Show less
Missing data is a common problem in survey research which leads to several problems, e.g., increased survey costs and biased survey estimates. Different multiple imputation (MI) methods have been...Show moreMissing data is a common problem in survey research which leads to several problems, e.g., increased survey costs and biased survey estimates. Different multiple imputation (MI) methods have been developed to handle missing categorical data. One specific subset of the MI methods used for the task are the so-called robust methods, which use one or several outcome and response models to improve robustness against model misspecification. One of the robust methods, Multiply Robust Nearest Neighbour Multiple Imputation (MRNNMI), is a donor-based method that uses several outcome and response models. In MRNNMI predictive scores are obtained from all the models, weighted by using prespecified equal weights and the predictive scores are used to compute the distances between units with missing values and possible donors. In this thesis, I developed and tested the derived method Multiply Robust Imputation for Categorical Data (MRIC) which uses model quality measures to weight the predictive scores. MRIC applies the same steps as MRNNMI, but the prespecified weights are replaced by weights based on three model quality measures: four types of pseudo-R2, the Hosmer-Lemeshow test statistic and Akaike weights. The performance of MRIC using the three different weighting approaches was compared to the existing robust MI methods that use prespecified weights, and the well-known MI approach Multivariate Imputation by Chained Equations (MICE), in a simulation study with different sample sizes and response rates. Based on the results, none of the weighting approaches influenced the imputation performance on categorical data. MRIC performed similarly to all the existing robust methods under all the conditions tested. The results indicate that for small sample sizes combined with low response rates, all the robust methods provide similar but more accurate results than MICE. However, with larger sample sizes, MICE, especially without explicit model specification, outperformed the robust methods in terms of bias and precision. Future research is needed to examine the influence of weighting based on model quality in other existing robust methods and to implement other existing model quality measures to be used in weighting the predictive scores.Show less
The dimensional approach to data sets aims to provide a more accurate interpretation of the relationships between variables and to reveal the underlying patterns among them. In this thesis, the...Show moreThe dimensional approach to data sets aims to provide a more accurate interpretation of the relationships between variables and to reveal the underlying patterns among them. In this thesis, the survival data set, consisting of childhood adversities (CAs) and correlated adult psychiatric disorders, was analyzed using the Logistic Reduced Rank Regression (LRRR) approach. This method allows dimensional investigation of CAs and disorders while considering the correlations between the disorders. The data set was also analyzed using the traditional Logistic Regression (LR) approach, and the results showed that the LRRR outperforms LR by providing more information about the data set. It was observed that parental mental illness (PMI) and physical illness (PIllness) adversities experienced during childhood strongly influence the development of disorders, as evident with both LR and LRRR approaches. However, since LR assumed the same effect of these adversities for each disease, it failed to identify which disorders were more affected by them. The effects of PMI and PIllness were found to have a greater impact on the development of posttraumatic stress disorder (PTS) when analyzed using the LRRR approach. This dimensional approach, which provides more information on the data set, does have limitations. Biplots, used for dimensional analysis, are easier to interpret in two-dimensional models. However, when dealing with high-dimensional models, constructing the biplots with pairwise dimensions becomes necessary, which in turn makes the simultaneous examination of biplots challenging. Furthermore, the missing data in the data set must follow the assumption of being missing completely at random (MCAR). If the missing data does not meet this assumption, it needs to be addressed through advanced techniques such as multiple imputation. Failure to handle the missing data appropriately may lead to limitations in the effectiveness of this method. In future research, cross-validation can be employed to assess the generalization ability of the model and enhance related analyses.Show less
Estimating the number of innovative companies in a country can be beneficial for policymakers and the production of official statistics. Currently, innovation activity is estimated by administering...Show moreEstimating the number of innovative companies in a country can be beneficial for policymakers and the production of official statistics. Currently, innovation activity is estimated by administering a survey to a stratified sample of companies. However, this method is costly, and small companies are not sampled. Previous research utilized company website texts with word embeddings to detect innovation activity, allowing for the inclusion of small companies. The model was initially highly accurate, but suffered from concept drift due to the words on websites changing over time. This paper proposes an alternative method of detecting innovation in website texts, using semantically meaningful sentence embeddings. We hypothesized that website texts stay semantically similar over time, although they may use different words, and that the use of sentence embeddings will provide a classification model with more stability over time. These hypotheses were confirmed, although the external validation of the model is inconclusive. Points of note and suggestions for further research are discussed.Show less
With modern measurement techniques, the prominence of data that can be best described as functional data is increasing. Statistical techniques for comparing the means of two samples of functional...Show moreWith modern measurement techniques, the prominence of data that can be best described as functional data is increasing. Statistical techniques for comparing the means of two samples of functional data are well studied, but less is known about tests for comparisons of covariance. We compare several methods to estimate the covariance and multiple test statistics in a power analysis of a dataset, and a simulation study. Our results show that, in a two-sample permutation test of the phoneme data set from the fda package in R, some of the test statistics seem to have a power of 1.000, while others show more reasonable power levels in the 0.500-0.700 range. A simulation study shows that the power of a two-sample permutation test can differ vastly between test statistics, depending on the type difference between the covariances of the two samples. This means that a test statistic must be chosen very carefully by researchers.Show less
Multiple imputation of latent classes (MILC) is a technique that utilizes latent class analysis to estimate errors and multiple imputation to correct errors in composite data sets. To examine the...Show moreMultiple imputation of latent classes (MILC) is a technique that utilizes latent class analysis to estimate errors and multiple imputation to correct errors in composite data sets. To examine the impact of missingness on measurement error estimation and correction in a composite data set, a simulation study was conducted. Three approaches on how to handle missing data were examined under different conditions such as different missing data and measurement error proportions, and different missingness mechanisms. The approaches were complete case analysis (CCA), let MILC handle missing values (ECMV) and multiple imputation of chained equations (MICE). These approaches were also used on an application on the energy consumption data set.Show less
The Employment Register (ER) and the Labour Force Survey (LFS) measure the labour contract of Dutch citizens. However, both sources provide different results. One possible explanation is that both...Show moreThe Employment Register (ER) and the Labour Force Survey (LFS) measure the labour contract of Dutch citizens. However, both sources provide different results. One possible explanation is that both sources contain measurement error (ME). Previous research has used hidden Markov models (HMMs) to estimate and correct for ME in linked data from the ER and the LFS. The HMMs did, however, have some limitations. For example, the HMMs used a suboptimal approach to include covariates that were missing for observations for whom one particular contract type was observed by the ER. In this thesis, these covariates are referred to as missing covariates. To overcome the limitations of the HMMs, this thesis compared the performance of three different latent variable methods (LVMs), namely latent class (LC) analysis, latent class tree (LCT) analysis and tree-multiple imputation of latent classes (tree-MILC) analysis, to correct for ME in the ER and the LFS. For this purpose, two simulation studies were conducted: one without and one with missing covariates. For the second simulation study, a new approach was developed in which missing covariates were included using direct effects and parameter restrictions. In the end, LC and tree-MILC analysis was performed on real data from the ER and the LFS for respondents in the age of 25 to 55 in the first quarters of 2016, 2017 and 2018 to compare the estimates to the original HMM estimates. In the simulation studies, little differences were found between the methods. The results showed that all model-based estimators were often considerably biased in conditions with two indicators. Although the bias and the variance decreased when one or two missing covariates were added, the largest decreases in bias and variance were observed when a third indicator was added. Furthermore, the analyses of the real data showed that the LC estimates, the tree-MILC estimates, and the original HMM estimates were different from each other. Nevertheless, the differences were smaller than the original differences between the ER and the LFS. Future research that aims to correct for ME in the ER and the LFS could use the approach proposed in this study to include missing covariates. In addition, to enhance the accuracy of the estimates, the current findings suggest that it may be beneficial for Statistics Netherlands to find a third indicator measuring the contract types of Dutch citizens. Finally, LVMs could potentially be used for the production of official statistics. However, to implement these methods in practice, further research is needed on both a methodological and an organisational level.Show less
The main goal of this thesis is to create a definition for mixed milk feeding clusters based on the mixed milk feeding behavior in which breast milk and formula feeding regimen are carried out on...Show moreThe main goal of this thesis is to create a definition for mixed milk feeding clusters based on the mixed milk feeding behavior in which breast milk and formula feeding regimen are carried out on the same day. Additionally, the aim is to study the relationship between these clusters and different health outcomes, specifically the microbiota composition in a baby’s gut. Although the mixed milk feeding regimen is a common practice, there is no definition for clusters of this type of feeding behavior. Hence, the first aim of this thesis is to evaluate the optimal cut-off values chosen earlier for the purpose of defining MMF clusters and to use them to determine the most important variables in describing MMF clusters. The ROC curve methodology is used to find the optimum cut-off values. The study creates different confusion matrices based on these variables and determines the best cut-off values considering clinical implications and accuracy. With this technique, we developed new simplified MMF clusters that can serve as the gold standard for identifying MMF clusters to address the lack of consensus in the literature. A number of important conclusions were drawn from the findings of the analysis. First, the results of the first study were to develop new simplified clusters that would later become the gold standard for MMF clusters in the literature. The second aim of the thesis is to examine the relationship between identified MMF clusters and microbiota composition using 16s rRNA data. Four different transformations for microbiota data were applied, including log transformation, scaled log transformation, centered log-ratio transformation (for relative abundance), and relative abundance log transformation, to create four different principal response curves. The aim is to study the change in microbiota composition within these MMF clusters at the genus level and species level over time. The second study showed that the clusters that are fed with more formula are more distant from the reference group, which is the reference breastfeeding cluster. In addition, clusters with more formula-fed infants had higher abundances of the Lachnospiraceae genus and Bifidobacterium breve species, while the abundance of Escherichia-Shigella genus and Bifidobacterium dentium species had higher in clusters with more breast milk infants. These findings have important implications for the field of MMF, providing insights into the impact of different feeding practices on infant gut microbiota.Show less
The developments in Artificial Intelligence have resulted in the emergence of large lan- guage models such as ChatGPT. The development of such models has led to an increased risk of fraudulent...Show moreThe developments in Artificial Intelligence have resulted in the emergence of large lan- guage models such as ChatGPT. The development of such models has led to an increased risk of fraudulent activities, therefore this research wants to determine the most effective features for distinguishing between humanly-authored and ChatGPT-generated text within the sci- entific domain. This research has constructed a text corpus consisting of humanly-authored and ChatGPT-generated abstracts based on the titles of scientific papers. This research build three different XGBoost classifiers, the first based on Doc2Vec vector embeddings, the second on text-extracted features and the third combining both. The results underscore the superiority of models incorporating Doc2Vec vector embeddings while reading time emerged as the most influential feature in accurately predicting whether a text is humanly-authored or ChatGPT-generated in both the text-extracted feature and the combined model. The combined model had the best performance in terms of accuracy. Nevertheless, the model based on Doc2Vec vector embeddings and text-extracted features was still outperformed by the GPTZero model, emphasizing the necessity for further refinement before its application in assessing whether a text is humanly-authored or ChatGPT-generated.Show less
Record linkage aims to bring records together from two or more files that belong to the same statistical entity. Linkage errors can occur during this process. Ignoring these linkage errors can lead...Show moreRecord linkage aims to bring records together from two or more files that belong to the same statistical entity. Linkage errors can occur during this process. Ignoring these linkage errors can lead to biased inference. There is a growing emphasis on accounting for linkage errors in the statistical analysis of categorical data and contingency tables. In this thesis, we developed three new approaches for compensating for linkage errors in contingency tables. The first approach, the regularised estimator, uses ideas from the application of regularisation of ill-conditioned matrices. Two other approaches use probabilities to compute the expected contingency table given the observed contingency table and to weight three existing correction methods with their estimated mean square error. The new approaches were tested together with two existing estimators by means of a simulation study. For dependent contingency tables, we propose to use the expected value approach with a prior distribution that uses information about the observed values of the contingency table. Moreover, we propose to use the existing Q approach for independent contingency tables. The regularised estimator seems to have a lot of potential for both dependent and independent tables, but improvement is still needed.Show less
Brain activity in fMRI studies is represented by voxels; units of graphic information defining a small location in the brain. In a typical case, the brain is visualized using somewhere around 200...Show moreBrain activity in fMRI studies is represented by voxels; units of graphic information defining a small location in the brain. In a typical case, the brain is visualized using somewhere around 200.000 voxels. To measure activity every location or voxel is tested individually, with every voxel using a separate hypothesis test; this leads to a massive multiple testing problem. One way this problem is solved is by Bonferroni-like corrections on single voxels, however Bonferroni is notorious for it’s conservativeness (Samuel-Cahn, 1996). Instead of correcting for every test at the voxel level, one can also test groups (called clusters) of voxels. Hypothesis-testing on clusters reduces the multiple testing problem by accept- or rejecting entire clusters, but leads to a new problem known as the ‘spatial specificity paradox’: inference on the voxel level accurately locates activation at the cost of having low power for each test, whereas inference on the cluster level has more power but cannot localize activation any more accurate than ”there is at least one voxel active in this cluster”. Recently a solution called All-Resolutions Inference (ARI) was developed based on closed-testing to tackle this problem (Rosenblatt, Finos, Weeda, Solari, & Goeman, 2018). This method offers one way to quantify activation within clusters, without losing too much power. This project aims to assess and compare the quality of these new methods using simulation studies and real data applications.Show less