Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to...Show moreA problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to overcome this problem is to use sample matching. In sample matching, one draws a sample from the population using a well-defined sampling mechanism. Next, units in the survey dataset are matched to units in the drawn sample using some background information. Usually the background information is insufficiently detaild to enable exact matching, where a unit in the survey dataset is matched to the same unit in the drawn sample. Instead one usually needs to rely on synthetic methods on matching where a unit in the survey dataset is matched to a similar unit in the drawn sample. This study developed several methods in sample matching for categorical data. A selective panel represents the available completed but biased dataset which used to estimate the target variable distribution of the population. The result shows that the exact matching is unexpectedly performs best among all matching methods, and using a weighted sampling instead of random sampling has not contributes to increase the accuracy of matching. Although the predictive mean matching lost the competition against exact matching, with proper adjustment of transforming categorical variables into numerical values would substantial increase the accuracy of matching. All the matches are used in reducing overfitting of machine learning, and the results show that all matches are able to increase the prediction precision.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
National statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such...Show moreNational statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such as administrative data or survey data. When datasets do not pertain to the same units, one can sometimes resort to statistical matching to integrate them. Statistical matching is a data fusion technique which can be used when different data sets contain different units, but with a set of common (background) variables. The main goal of statistical matching is to estimate the relationship between the non-common variables in the different datasets. This paper investigates how best to utilize a small overlap of units in a statistical matching situation where data only consists of categorical variables. A small overlap of units contains joint information on all variables for only a limited number of units. A new statistical matching method, namely the combined estimator, is developed in this paper employing an idea from small area estimation. The performance of the combined estimator was compared to a couple of pre-existing statistical matching methods for categorical data under various data conditions. The result shows that, even though the combined estimator itself does not perform better than the pre-existing statistical matching method (the EM algorithm), the usage of the combined estimator as the starting point of the EM algorithm helps increasing its accuracy under certain data circumstances. The improvement of accuracy was noticed in cases where the number of matching variables was large.Show less