In this study the performance of feature-based dissimilarity space(FDS) classification is evaluated by comparing it to conventional classification techniques. In FDS classification a classifier is...Show moreIn this study the performance of feature-based dissimilarity space(FDS) classification is evaluated by comparing it to conventional classification techniques. In FDS classification a classifier is trained by using a dissimilarity space instead of a feature vector space. Since FDS classification is applied in a wide range of classifiers a new and model independent dissimilarity feature selection method is presented and tested. The fundamentals of this newly proposed selection method are given by the compactness hypothesis(Arkadev and Braverman, 1966). The performance of this newly proposed dissimilarity feature selection technique is evaluated by a Monto-Carlo simulation experiment and a bootstrap study. The performance of FDS classification is evaluated by comparing it to the performance of conventional classification techniques. The performance of FDS classification is estimated by using a bootstrap procedure. The results indicate that FDS classification is beneficial in combination with a linear classifier and a complex classification task. Due to the combination of a linear classifier and FDS classification a linear decision boundary is fitted in a dissimilarity space. This decision boundary becomes non-linear in the original feature vector space.Show less
Medical researchers frequently make statements that one model predicts survival better than another, and are frequently challenged to provide rigorous statistical justification for these statements...Show moreMedical researchers frequently make statements that one model predicts survival better than another, and are frequently challenged to provide rigorous statistical justification for these statements. In general, it is important to quantify how well the model is able to distinguish between high risk and low risk subjects (discrimination), and how well the model predicts the probability of having experienced the event of interest prior to a specified time t (predictive accuracy). For ordinary – right censored – survival data, the two most popular methods for discrimination and predictive accuracy are the concordance index, or c-index (Harrell et al. 1986) and the prediction error based on the Brier score (Graf et al. 1999). In the absence of censoring, it is straightforward to define and estimate these measures. Adaptations of these simple estimates for right censored survival data have been proposed and are now in common use. The novel part of this thesis is to develop methods for calculating/estimating the concordance index and the Brier score prediction error in the context of interval censored survival data. The starting point is that we have interval censored data of the form (Li , Ri ] for subjects i = 1, ..., n, with Li < Ri(Li may be 0, Ri may be infinity to accommodate right censored data), and a given prediction model yielding a single (estimated) baseline hazard h0(t), one vector of (estimated) regression coefficients beta. From this prediction model, prognostic scores β T xi , and predicted survival probabilities S(t|xi) = exp(−H0(t)β T xi), may be calculated for each subject i. Methods to estimate the concordance index and the Brier score prediction error for exponential and Weibull baseline hazards are proposed and evaluated in a simulation study. An application to real data is also provided.Show less
In the world of clustering methodology, there exists a plethora of options. The choice becomes especially important when the number of clusters is not known a priori. Methods to handle missing data...Show moreIn the world of clustering methodology, there exists a plethora of options. The choice becomes especially important when the number of clusters is not known a priori. Methods to handle missing data also have vast variation and these choices are often made based on the data missing mechanism. In this paper we seek to investigate the intersection of both these situations: Clustering, where one of the major objectives is cluster discovery, and doing so in the presence of missing values. Model-based clustering estimates the structure of clusters, (number, size and distribution of clusters) using likelihood approaches. Likelihood methods also allow researchers to gain information from incomplete observations. In the following work, we will investigate adaptations of these likelihood estimations to infer cluster information about a given data set. Model-based clustering becomes the focal point because of the objectivity in cluster discovery, and for continuous data, its multivariate Gaussian density assumptions can be an asset to handling the problem of missing data. An algorithm that utilises marginal multivariate Gaussian densities for assignment probabilities, was developed and tested versus more conventional ways of model-based clustering for incomplete data. These conventional methods included multiple imputation and using complete observations only. Assumptions of the data missing mechanism were important and taken into consideration during the testing of these methods. These assumptions were especially important for the model-based method when parameters had to be updated. All methods were tested using simulated data as well as real life publicly available data. It was found that for cases with many observations, the complete case and multiple imputation have advantages over the marginal density method due to the increased availbiltity of disposable information and borrowable information respectively. Dimensionality and cluster separation were also important factors. Multiple Imputation was the preferred method when our data structure was more complicated (high dimensions, high cluster overlap), however in simpler settings, the marginal method worked best. The marginal method also showed significant promise in classifying observations to their clusters. The marginal method can be further adapted by making more robust parameter estimates and is discussed in this paper.Show less
A multi-groups factor model is often used to study moderator effects. In many cases, the moderator variable is of a metric nature rather than nominal. In this paper, we propose a varying...Show moreA multi-groups factor model is often used to study moderator effects. In many cases, the moderator variable is of a metric nature rather than nominal. In this paper, we propose a varying-coefficient factor model. With the application of kernel smoothing, a number of varying-coefficient factor models exist, dependent on the smoothing parameter. The aforementioned factor models were fitted on a set of personality data. Through model selection, a factor model was chosen that best depicts the trends of model parameters of interest, with accompanying confidence bands, across values of a metric moderator variable.Show less
Introduction Action recognition is an important task for domestic care robots. Current action recognition literature exclusively studies closed-set recognition problems, where performance is...Show moreIntroduction Action recognition is an important task for domestic care robots. Current action recognition literature exclusively studies closed-set recognition problems, where performance is evaluated on videos which were also available in the training set. However, the real-world environment is by definition open, and it is both theoretically and practically infeasible to supply the robot with all necessary information beforehand. This work develops novelty detection methodology applicable to HMM-based classifiers, which have shown earlier success in action recognition. By filtering unknown actions instances, our novelty detection module increases system robustness in open environments, and is a first step towards adaptively learning robots. Methodology We first develop an ordinary action recognition system based on a new skeleton-derived feature and a HMM back-end classifier. The HMM system is estimated in three ways: clustered, Expectation-Maximization (EM) and Extended Baum-Welch (EBW). The latter is a discriminative training criterion, which could theoretically improve novelty detection accuracy. Since the EBW algorithm has only been implemented in speech recognition software, we wrote its first open-source implementation (in Matlab, publicly available from www.github.com/thomasmoerland/Thesis). Then, novelty detection is approached from both a posterior probability and hypothesis-testing perspective, which we unify as background models. Since novelty detection for action recognition has not been reported before, we investigate a diverse set of background models: sum over competing models, filler models, flat models, anti-models, reweighted anti-models, and some combinations of them. Results Our HMM classification system has around 95% closed-set recognition accuracy on the Microsoft Action 3D dataset, which is near the state-of-the-art. Performance did not di↵er between the clustered, EM an EBW estimation methods, although our results do indicate the latter two might be beneficial on more challenging datasets. The optimal novelty detection module combining anti-models with flat models had 78% novelty accuracy, while maintaining 78% recognition accuracy as well. Novelty detection results were consistent over various dataset splits. Discriminative training did not alter novelty detection performance. Conclusion We are the first to study novelty detection for action recognition. Our results could increase system robustness in an open-set real-world environment, and furthermore serve as a first step towards an adaptively learning robot.Show less
In this thesis we focus on the modeling of large credit losses in corporate asset portfolio. We compare loss estimates based on the classic Vasicek’s approach with the assumption of normal...Show moreIn this thesis we focus on the modeling of large credit losses in corporate asset portfolio. We compare loss estimates based on the classic Vasicek’s approach with the assumption of normal-distributed loss distribution, and the copula approach generating heavier-tailed loss distribution. We also provide the numeric implementations of both Vasicek’s and copula modeling approaches which are widely used in bank’s risk management. In addition, we demonstrate how Vasicek’s approach can be adopted for estimating portfolio’s concentration risk charge. The last work is my own development inspired by my internship experience at the Royal Bank of Scotland. All presented results are complemented with review of the corresponding classical works in credit risk modeling.Show less
This thesis is focused on the K-Competing Queues problem, which seeks for an optimal way to share a common resource among multiple user classes. The basic version of this problem, where users have...Show moreThis thesis is focused on the K-Competing Queues problem, which seeks for an optimal way to share a common resource among multiple user classes. The basic version of this problem, where users have unlimited patience, can be solved to optimality by using variations of the well-known cµ rule. The more relevant version of the problem, where user patience is limited, does not yet have an optimal solution. However, some authors have been able to identify special instances, where it is optimal to prioritise one user class over the others. By employing a simple coupling technique, we give an extension to the set of instances, where a full priority policy is optimal. Along with our search for optimality, we also attempt to give an approximate solution by a number of heuristics. As we will see from the numeric simulations, the best of all known heuristics turns out to be the one that solves the fluid approximation of the problem to optimalityShow less
Next-Generation Sequence (NGS) technologies provide promising new opportunities for the quantitative comparison of genomic expression profiles. Analysis of NGS datasets is made challenging by their...Show moreNext-Generation Sequence (NGS) technologies provide promising new opportunities for the quantitative comparison of genomic expression profiles. Analysis of NGS datasets is made challenging by their high-dimensionality and count-based nature. Modelling frameworks are based on negative binomial GLM’s and involve multiple testing. Analytical formula’s to express power are not available in this setting. During this investigation, functions to do simulation-based power calculations for NGS data, based on small pilot datasets, were created. The task sequence is as follows: First, empirical Bayesian estimation methods are applied to a pilot dataset to recover distributional parameters that reflect data structure and signal in a population. These parameters are then employed within a data-generative framework to simulate datasets of increasing sample size. Finally, tests of differential expression on these simulations yield a prediction of average power and number of rejections associated with each value of sample size. To assess the performance of our proposed power calculation algorithm, we used publicly available, comparatively large datasets, sampled “pilot” subsets from these, and compared predictions based on pilots to results obtained with the full-sized datasets. Our functions are useful to any researcher confident about the homogeneity of his data. Our results however also indicate that stable estimation of the proportion of differential expression p1 is difficult when sample size is small, which sometimes leads to inaccurate power calculations. The observed variation was such, that we suspect it also influences standard differential expression analysis in an undesired manner. We therefore argue that general care should be taken in NGS research, because currently accepted sample sizes may not always be large enough to yield a representative image of differential expression between populations.Show less
In this master thesis we study patterns in animal populations that arise due to a density dependent movement speed v of one of the involved species. This new description of the movement leads to a...Show moreIn this master thesis we study patterns in animal populations that arise due to a density dependent movement speed v of one of the involved species. This new description of the movement leads to a Cahn Hilliard equation describing the evolution of the concentration of the animal specie in question. Our main interest is a modification of the generally used standard predator-prey reactiondiffusion type of description of the evolution of two interacting species, where the standard diffusive movement of one of the species is replaced with this fast Cahn-Hilliard like movement. This leads to a fourth order slow-fast partial differential equation, which forms the system that will be the main object of study in this thesis. In this thesis we first present an in-depth literature study of the general Cahn Hilliard system focusing on the evolution - both short and long term - of solutions starting from an uniform state. Subsequently we will analyze the full population model, with the Cahn-Hilliard like movement, on an one-dimensional spatial domain via a weakly non-linear stability analysis, leading to a (real) GinzburgLandau equation as amplitude equation for variations from steady states of the model. All our findings will be applied to a system describing the interaction between mussels and algae. This analytic approach, supplemented by numerical simulations on the one-dimensional model, is then used to explain the occurrence and behaviour of patterns in mussel beds.Show less