Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
National statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such...Show moreNational statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such as administrative data or survey data. When datasets do not pertain to the same units, one can sometimes resort to statistical matching to integrate them. Statistical matching is a data fusion technique which can be used when different data sets contain different units, but with a set of common (background) variables. The main goal of statistical matching is to estimate the relationship between the non-common variables in the different datasets. This paper investigates how best to utilize a small overlap of units in a statistical matching situation where data only consists of categorical variables. A small overlap of units contains joint information on all variables for only a limited number of units. A new statistical matching method, namely the combined estimator, is developed in this paper employing an idea from small area estimation. The performance of the combined estimator was compared to a couple of pre-existing statistical matching methods for categorical data under various data conditions. The result shows that, even though the combined estimator itself does not perform better than the pre-existing statistical matching method (the EM algorithm), the usage of the combined estimator as the starting point of the EM algorithm helps increasing its accuracy under certain data circumstances. The improvement of accuracy was noticed in cases where the number of matching variables was large.Show less