Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The task of finding suitable candidates for a job has never been an easy one, and now that recruiters have access to various online job boards and are not necessarily constrained by national...Show moreThe task of finding suitable candidates for a job has never been an easy one, and now that recruiters have access to various online job boards and are not necessarily constrained by national borders, it can be argued that shortlisting relevant candidates is more difficult than ever. This is especially true for online recruitment agencies that have huge databases of potential candidates and no effective ways to quickly identify which of those candidates have the required experience and skills for the vacancy at hand. There are many ways that different companies go about solving the aforementioned problem. In case of YoungCapital, a Dutch recruitment agency, all candidates can state their preferred profession and location when creating a profile on the company’s website, and recruiters can then create a search query based on those stated preferences. It is also possible to get keyword matches with candidates’ resumes, which, however, is a manual task where recruiters have to decide on the specific keywords they want to find. Given recent advances in machine learning and natural language processing, it was decided that a learning-to-rank (LTR) approach should be tried to see whether the candidate search process could be improved by presenting recruiters with a ranked list of candidates for each job, with the most suitable candidates at the top of the list. The LambdaMART model was chosen for this task as the state-of-the-art algorithm, and the baseline ranking model was a simple Linear Regression. Most of the features were designed using custom word embeddings. The results were evaluated with common rank-based measures: Normalised Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP). Precision, which ignores the order of results, was reported as well. Overall, we found a significant improvement over the current method according to all three measurements. We also demonstrated the impact of different feature sets on the performance of ranking models.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Online Linear Regression is a sequential variant of regression in which the data points arrive one by one. It is normally studied in the gametheoretic framework of Online Convex Optimization, which...Show moreOnline Linear Regression is a sequential variant of regression in which the data points arrive one by one. It is normally studied in the gametheoretic framework of Online Convex Optimization, which models the data as being generated by an adversary. In this framework, the standard statistical procedure of Online Ridge Regression is known to be essentially optimal. In Statistics, there is an improvement for Ridge Regression when the noise is not constant. This improvement is Weighted Ridge Regression, which relies on weighting the data by their variances. In this thesis, we will employ weighting in Online Ridge Regression to show that an improvement over Online Ridge Regression can be made. We furthermore explored the situation where weighting is disadvantageous, mathematically and experimentally using simulations. Finally we applied Online Weighted Ridge Regression to different real-world datasets and found that we also can improve Online Ridge Regression in practical situationsShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Machine Learning classifiers are naturally black boxes when it comes to interpretation. In this thesis, Decision Boundary Approximation (DBA), a new algorithm for locally explaining complex binary...Show moreMachine Learning classifiers are naturally black boxes when it comes to interpretation. In this thesis, Decision Boundary Approximation (DBA), a new algorithm for locally explaining complex binary classifiers is developed, tested experimentally and discussed. The algorithm explains predictions of individual instances, by approximating their most relevant region of the decision boundary with a linear model. We overview and discuss limitations of existing methods when applied to classification, with specific focus on LIME due to the similarity with DBA concepts. Experiments with DBA, cover both low dimensions and sparse high dimensional data. In Experiment 1 we show that DBA can provide stable explanations for various decision boundary structures in a 2D simulated case. Experiment 2 demonstrates that DBA outperforms LIME for low dimensionalities, while in Experiment 3 (MNIST data) we show that when data are sparse, DBA explanations can include features that are absent from the explained example, making the explanation more complete compared to LIME. In Experiment 4 we explain a Naive Bayes trained on SMS ham/spam messages and show that the DBA solution is in agreement with the Naive Bayes posterior. Finally, the benefits and drawbacks of DBA are discussed elaborately and future recommendations for modifications are given.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In...Show moreThe Kalman filter has numerous applications in spatial-temporal prediction. A common application is for guidance, navigation, and control of vehicles, particularly aircraft and spacecraft. [1] In this thesis, we focus on one typical spatial-temporal data type of discrete time and discrete space. We consider a rectangular grid for the space domain. We make a first order Markov property assumption in both time and space to reduce complexity. In addition, several input control features are introduced into the Kalman filter. In other words, the distribution of future states depends only on the current states and input control features in their own area and their neighboring areas. Under our Markov assumption, it is natural for the transition matrix in the Kalman filter to be sparse for spatial-temporal data where sparse transition matrices with constrained structure are designed to interpret the spatial correlation among all the areas. We will derive the equations for inference in this particular spatial system, namely the Kalman filter and Kalman smoother. Using the results for the Kalman filter and Kalman smoother, we further consider the determination of the parameters of the Kalman filter model through a modified Expectation–Maximization(EM) algorithm that estimates sparse transition matrices. This stands in contrast with the standard EM algorithm, which usually produces a dense estimate for the matrices. To respect the spatial pre-constrained sparsity structure, we specify greedy EM updates that work on rows of the transition matrix. We study the properties of our new method in simulations and apply the method to a real data set on aviation safety where the goal is to predict which areas at Schiphol airport are at risk of having a large density of birds in the near future.Show less