Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to...Show moreA problem for survey datasets is that the data may cone from a selective group of the population. This is hard to produce unbiased and accurate estimates for the entire population. One way to overcome this problem is to use sample matching. In sample matching, one draws a sample from the population using a well-defined sampling mechanism. Next, units in the survey dataset are matched to units in the drawn sample using some background information. Usually the background information is insufficiently detaild to enable exact matching, where a unit in the survey dataset is matched to the same unit in the drawn sample. Instead one usually needs to rely on synthetic methods on matching where a unit in the survey dataset is matched to a similar unit in the drawn sample. This study developed several methods in sample matching for categorical data. A selective panel represents the available completed but biased dataset which used to estimate the target variable distribution of the population. The result shows that the exact matching is unexpectedly performs best among all matching methods, and using a weighted sampling instead of random sampling has not contributes to increase the accuracy of matching. Although the predictive mean matching lost the competition against exact matching, with proper adjustment of transforming categorical variables into numerical values would substantial increase the accuracy of matching. All the matches are used in reducing overfitting of machine learning, and the results show that all matches are able to increase the prediction precision.Show less