A vast number of questions and problems concerning probability theory need conditional probability for providing answers and solutions. From traditional games of dice to modern statistical...Show moreA vast number of questions and problems concerning probability theory need conditional probability for providing answers and solutions. From traditional games of dice to modern statistical applications and machine learning, all use conditional probability in some sense to obtain more insight in the problem. As fundamental conditional probability is, it is not without controversy. Problems and paradoxes like the Borel-Kolmogorov paradox, Monty Hall’s three door problem and the two envelope problem have puzzled mathematicians, statisticians and psychologists for centuries, resulting into much debate and a vast amount of literature. This thesis concerns some of the most well-known paradoxes in conditional probability. In all paradoxes, the paradoxical result arises from wrongly stating the probability space concerning the problem or wrongly applying conditional probability, like not giving the accompanying σ-algebra or not conditioning on a partition. All problems can be easily avoided by always stating the probability space with the σ-algebra when applying conditional probability. The two most obvious examples are the Borel-Kolmogorov paradox and Monty Hall’s problem. The Borel-Kolmogorov paradox is a good example of why conditioning on sets with zero measure is only possible with much care and why it is necessary to provide the accompanying σ-algebra with your solution. Monty Hall’s three door problem is a prime example of wrongly conditioning on a set of subsets that cannot form a partition of the sample space. The original problem asks for a single probability, however correctly applying conditional probability reveals that the probability of the car being behind the other door is dilated between the two values 1 2 to 1. In both cases the paradoxical results vanish when the whole probability space is considered and when conditional probability is applied correctly. The dilation of the conditional probability like in Monty Hall’s problem is investigated further in this thesis. Problems like Monty Hall and the boy or girl problem resemble each other in such a fundamental fashion that a generalization exists, encompassing them both. Furthermore, safe probability introduced by Grünwald [Grff18b] can be applied to answer the following question: if one should pin a single probability on for example in Monty Hall’s game the car being behind the other door, which probability should it be? This generalization can be applied to all problems with a countable space of outcomes with fixed probability measure and finite set of possible observations with sufficiently many possible probability measures, resolving several paradoxes in probability at once.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Recently, a new theory of hypothesis testing was introduced: safe testing. Within the safe testing framework, random variables called S-values are used for hypothesis testing. S-values can be...Show moreRecently, a new theory of hypothesis testing was introduced: safe testing. Within the safe testing framework, random variables called S-values are used for hypothesis testing. S-values can be interpreted as both conservative p-values and Bayes factors. Further, they allow for optional continuation: S-values from multiple studies can be multiplied while retaining a type-I error guarantee, and some S-values are even robust under the frequentist interpretation of optional stopping. For this thesis, I developed safe tests for two classical frequentist hypothesis tests: the 2x2 contingency table test and its stratified equivalent, the Cochran-Mantel-Haenszel test. These tests were designed to be GROW (growth-rate optimal in the worst case) for certain subsets of the alternative hypothesis. Two versions of the tests were presented: a version that provides the GROW S-value for a restricted alternative hypothesis based on a minimal absolute di↵erence between group means, and a version that is based on the Kullback-Leibler divergence between the alternative and null hypothesis. For the ‘minimal absolute di↵erence’ version, an analytically computable ‘simple’ S-value turned out to exist, which is robust under optional stopping. I showed that when using this safe test for optional stopping, the expected sample size needed to achieve a desired power can be lower than when using Fisher’s exact test. No ‘simple’ definition could be found for the Kullback-Leibler version: this GROW safe test has to be found through numerical optimization. Nevertheless, the Kullback-Leibler version could still be preferred in some cases: it was shown to gain higher power for certain data-generating distributions compared to the simple S-value. Both S-values were implemented in an R package: the safe2x2 package.Show less
In this thesis, we present the theory of test martingales. Whereas in traditional statistics, the p-value indicates the level of evidence against the null hypothesis, in martingale testing a...Show moreIn this thesis, we present the theory of test martingales. Whereas in traditional statistics, the p-value indicates the level of evidence against the null hypothesis, in martingale testing a betting strategy that allows one to make a (virtual) profit is seen as evidence against the null hypothesis. We extend the concept of test martingales of Shafer et al. (2011) to composite null hypotheses. We refer to these martingales as composite test martingales – they are the main innovation in this thesis. Using composite test martingales, we construct two martingale tests as an alternative for the Student’s t-test. These two tests appear to be the first published martingale tests that can differentiate the t-test hypotheses. The main result of this thesis concerns the Jeffreys Bayesian t-test. It was already known, experimentally, to be robust under optional stopping. Optional stopping refers to the practice of looking at observed experimental data in order to decide whether or not to continue testing. Robustness in this context means that the statistical method preserves its significance level even when optional stopping is employed. We prove the Jeffreys Bayesian t-test to be a martingale test. Therefore, it is robust under optional stopping. In general, under a composite null hypothesis, Bayesian tests are not robust under optional stopping.Show less
High dimensional classification problems become increasingly frequent and these problems are notoriously difficult. Classifying Alzheimer patients using MRI data or fMRI data is one such challenge:...Show moreHigh dimensional classification problems become increasingly frequent and these problems are notoriously difficult. Classifying Alzheimer patients using MRI data or fMRI data is one such challenge: often no more than 50 subjects are measured while the number of variables or features observed per patient or object can be as high as 10000. Specialized statistical learners attempt to combat the challenges these high dimensional classification problems present. In this thesis we propose an extension of an ensemble learner called Stacked Generalization that combines the idea of Stacking multiple classification techniques and sub setting the feature space. We call it Stacked Domain Learning. We argue that Stacked Domain Learning may improve prediction performance in high dimensional classification problems. Performance increase is mainly expected in situations where the data presents different modalities. We investigate this claim in a simulation study. We apply state of the art (high dimensional) classification techniques as part of the ensemble learner and as comparison for the extension. Differential performance between the learners and the extension when applied to relatively simple data sets, without different modalities, shows that the extension could improve performance of both Stacked Generalization in general and choosing, through cross-validation, the single best performing statistical learner. Performance improvement is highly dependent on the characteristics of the data and most notable in conditions that are relatively noiseless. Performance increase is however not universal, even in the most favorable conditions, and application of Stacked Domain Learning is therefore best used not as a replacement of existing techniques but rather as an addition to the library of techniques the statistician might consider. The results warrant further study of Stacked Domain Learning to investigate performance improvement in a practical setting: in settings with or without explicit modalities. Results of the simulation study also implicate that further improvements could be made, for instance in the way the ensemble is combined. We also attempt to measure the quality of the prediction performance of the Stacking ensemble by attempting to measure the size of the classifier space (or hypothesis space), the enlargement of which is the main argument in favor of the extension. Results interpreted favorably indicate a negligible relation but the results are not conclusive.Show less
In the past few years there has been a great development in the field of sequential prediction. Starting with the simple, yet often effective, Follow-The-Leader strategy, numerous different...Show moreIn the past few years there has been a great development in the field of sequential prediction. Starting with the simple, yet often effective, Follow-The-Leader strategy, numerous different strategies have been conceived. The most prevalent algorithm being used to realize these strategies is Hedge. This algorithm’s performance crucially depends on a parameter called the learning rate. Based on the work of Cesa-Bianchi, Mansour and Stoltz, a better Hedge algorithm named AdaHedge has been developed by Gr¨unwald, De Rooij, Van Erven and Koolen that has great worst-case performance bounds. It sets the learning rate parameter dynamically without using the doubling trick. This means that it looks at the previous results to make the next prediction. At around the same time, a new, completely different type of algorithm, named NormalHedge, has been devised in San Diego. NormalHedge is parameter free. This algorithm, by Freund, Chaudhuri and Hsu, completely skips the learning rate complication. In some simple examples it has been shown that NormalHedge has similar if not better performances than all traditional Hedge strategies. In this paper AdaHedge and some similar other Hedge Algorithms are compared to NormalHedge. First we will do this through examples given by Gr¨unwald et al. that we will reproduce. Next an extensive and elaborate data sequence is created. This complicated experiment will give new insights in the strengths and weaknesses of the algorithmsShow less
Bayesian inference is considered one of the best statistical methods available when the model is correctly specified. On the other hand, when this is not the case, and model assumptions do not hold...Show moreBayesian inference is considered one of the best statistical methods available when the model is correctly specified. On the other hand, when this is not the case, and model assumptions do not hold, it can lead to suboptimal results. Equipping the likelihood with a learning rate parameter protects against this. In this thesis the performances of various more robust Bayesian approaches, that differ in the way the learning rate parameter is chosen, are compared to standard Bayes in a variety of situations. Results for various classification problems (with simulated data) and Lasso-type regression problems (with real-world data) indicate that the robust forms of Bayes outperform standard Bayes when the model is incorrect, and don’t perform much worse when the model is correct. Especially the robust Bayesian method with learning rate parameter estimated by k-fold cross-validation achieves good results.Show less
In this thesis, three properties of model selection criteria are considered: consistency, minimax-rate optimality and insensitivity to the stopping rule. A result of Yang, demonstrated for linear...Show moreIn this thesis, three properties of model selection criteria are considered: consistency, minimax-rate optimality and insensitivity to the stopping rule. A result of Yang, demonstrated for linear regression, is proven to hold in model selection with single-parameter exponential families as well: consistency and minimax-rate optimality are mutually exclusive. Susceptibility to optional stopping is shown to be asymmetric in nature. While AIC and null hypothesis significance testing are well-known to be sensitive to the stopping rule if the null model is correct, we show that the probability of incorrectly selecting the null model will remain bounded away from one. The main result concerns the switch model selection criterion δsw. It was already known to be consistent and minimax-rate optimal in the cumulative sense. In this thesis, we prove that the worst-case instantaneous standardized quadratic risk in a simple parametric problem is of order log log n n , missing the optimal rate only by a factor log log n. By a result of Shafer et al., δsw is not sensitive to optional stopping. Hence, δsw comes close to combining all three desirable properties in one criterion.Show less
Suppose a Decision Maker wants to make a prediction about the value of a random variable. He knows the distribution of the random variable, and he is also told that the outcome is contained in some...Show moreSuppose a Decision Maker wants to make a prediction about the value of a random variable. He knows the distribution of the random variable, and he is also told that the outcome is contained in some given subset of the domain of the random variable. The Decision Maker is then asked to give his best guess of the true value of the random variable. The knee-jerk reflex of a probabilist is to use conditioning, if the probability of all outcomes is known. However, this reflex may well be incorrect if the specific outcome of the random variable is contained in more than one of the subsets that may be revealed. This situation has been analysed in the literature in the case of a single (random) selection procedure. When the selection procedure satisfies a condition called Coarsening at Random, standard conditioning does the trick. However, in many cases this condition cannot be satisfied. We analyse the situation in which the selection procedure is unknown. We use a minimax approach of the Decision Maker against Nature, which can choose from a set of selection procedures. The loss of the Decision Maker is modeled by the logarithmic loss. We give a minimax theorem applicable in our setting. This enables us to give a characterisation of the best prediction in all cases. Surprisingly, our results show that for certain cases this characterisation is a kind of reverse Coarsening at Random conditioning.Show less
Many uses of data mining, such as clustering, classification, the construction of decision trees, subgroup discovery and itemset mining, often fail to be able to cope with real-valued data well. In...Show moreMany uses of data mining, such as clustering, classification, the construction of decision trees, subgroup discovery and itemset mining, often fail to be able to cope with real-valued data well. In fact, it is common for data mining methods to only work well on nominal data with little different values. We build the theory to fill this gap for data from arbitrary uncountable sets and introduce an efficient method to mine data, without the usual discretization as a pre-processing step. It is shown that discretization is not needed in order to make use of the MDL principle.Show less