Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this Thesis, we explore the feasibility of the task to identify impact data in humanitarian documents. We approach this as a sentence classification task and create a human-labelled set of over...Show moreIn this Thesis, we explore the feasibility of the task to identify impact data in humanitarian documents. We approach this as a sentence classification task and create a human-labelled set of over 11,000 sentences extracted from documents related to the IFRC’s Disaster Relief Emergency Fund. Using this set, we compare various classification models and feature sets and show that it is possible to classify sentences containing impact data with a good performance. Our final model, a Linear Support Vector machine trained on a Document-Term Matrix of word bigrams, achieves a precision of 0.852 and a recall of 0.746 (F1 = 0.796) on a separated validation set of 1, 114 sentences. In a second part of our research, we describe techniques that can be applied when there are fewer human-labelled examples available. When performing brief experiments with the simplest of these techniques, we show that indeed it is possible to achieve the aforementioned performance on the validation set with 7, 454 fewer labelled examples in the training set (approximately 75% less). Our work can serve as an exploratory first step towards fully automated impact data extraction from text. The work has its limitations. For instance, we found that it is very difficult to define what is impact data when creating a labelled ground-truth, which influences the generalisability of our ground truth data set. Further work can focus on the impact data definition. Other ideas for future work are the investigation of newer (e.g. neural network-based) techniques for humanitarian text processing tasks such as this one. A continuation of our work on investigating techniques that can solve problems based on fewer labelled examples specifically for text from the humanitarian domain is also a valuable next step.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Currently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy...Show moreCurrently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy making in housing. Nowadays, people express their ‘wish to move’ on social media. In this research, it was found that certain features derived from tweet texts distinguish ‘wish to move’ tweets from others. The best logistic regression classifier developed in this research achieves an F1-score of 0.556 in identifying ‘wish to move’ tweets indicating that it is possible to timely keep track of the proportion of the ‘wish to move’ proportion of the Dutch population active on Twitter. Further, it is found that actual relocation can be identified by following ‘wish to move’ users. By engineering features through aggregating their subsequent tweets, classifiers were established to automatically determine if a ‘wish to move’ user relocated in the follow-up period. The best logistic regression classifier can determine if ‘wish to move’ users relocated in the two subsequent years with an F1-score of 0.701. With it, the proportion of ‘wish to move’ users who actually relocated later can be estimated.Show less