Hidden Models arise if an unobserved process is partially observed through a noisy channel. Such an model is referred to as a Hidden Markov Model if the unobserved process is assumed to be a Markov...Show moreHidden Models arise if an unobserved process is partially observed through a noisy channel. Such an model is referred to as a Hidden Markov Model if the unobserved process is assumed to be a Markov process. Often true statistical properties of unobserved processes are unknown. Hence, for this general Hidden Model we are seeking an universal method for decoding the observed noisy sequences. The problem is solved by the Discrete Universal Denoiser (DUDE) for discrete-valued observed data. The key assumption in the DUDE is, that the combined hidden and observed process is stationary. The case of continuous observed data is dealt with by quantizing the outputs and applying an extended version of the DUDE. We show an improvement for the DUDE and an optimization strategy for the continuous data.Show less
In this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated)...Show moreIn this project a new approach to forecasting infectious disease epidemics was tested in a simulation and applied to data of the 2014 - 2016 Ebola epidemic. GLMs were applied to the (simulated) data, from which the key quantities contact rate and epidemic size could be obtained. With (non-)parametric bootstrapping, the GLM results could be assessed, and the key quantities were obtained and subsequently used to produce forecasts. Forecasting intervals were made to show the accuracy of the forecasts in terms of epidemic size and duration. Simulation results suggested that the method underestimated the eventual epidemic size, and overestimated the contact rate. However, applying the method to a real-life data set resulted in overestimation of the eventual epidemic size. The results of the contact rate for the application on real-life data should be compared to estimates from literature, before a significant meaning can be given to the results. Both simulation and application results gave variable estimates for the epidemic duration, although a positive relation was seen between epidemic size and epidemic length. Estimates for the contact rate could be improved. The major issues with prediction were accountable to exact collinearity introducted by the systematic model; the major issues with forecasting were accountable to extreme estimates of the epidemic size. The cause of both issues lies in the GLMs that were fit to the data.Show less
When optimizing a multi-objective function, it is interesting to find the points where the best trade-off is reached between the different objectives, i.e. to find the points where no improvement...Show moreWhen optimizing a multi-objective function, it is interesting to find the points where the best trade-off is reached between the different objectives, i.e. to find the points where no improvement in one of the objective functions can be made without degrading some of the other objective values. To find these optimal points, we suggest a new algorithm that uses the expected hypervolume improvement of a point. Subsequently this algorithm is considered for the single objective case in one dimension and compared to the existing algorithm for this single objective problem called Shubert’s Algorithm. Moreover, some problems with the existing proof of the convergence rate of Shubert’s Algorithm are resolved. Finally, the bi-objective case is investigated and a method is constructed to calculate the expected hypervolume improvement in a point in an explicit way.Show less
Many practical optimisation problems can be formulated as a traffic assignment problem, i.e. optimally route a multi-commodity flow through a network. In order to do this, a network is defined that...Show moreMany practical optimisation problems can be formulated as a traffic assignment problem, i.e. optimally route a multi-commodity flow through a network. In order to do this, a network is defined that can capture congestion and a notion of optimal flow. The shortest path problem is derived as a sub-problem of the traffic assignment problem, discussing several algorithms that can solve it. In addition, several speed-up techniques for the shortest path problem are described that can be applied to static networks. In conclusion, an algorithm is discussed that solves the traffic assignment problem by iteratively solving a shortest path problem.Show less
Fusarium crown rot is a damaging disease frequently found in wheat, which is mostly caused by Fusarium pseudograminearum. Crown rot resistance is quantitatively inherited complex trait caused by...Show moreFusarium crown rot is a damaging disease frequently found in wheat, which is mostly caused by Fusarium pseudograminearum. Crown rot resistance is quantitatively inherited complex trait caused by many small effects loci. Lines with crown rot resistance have been identified that are interesting potential sources of favorable resistance alleles. In order to make use of this resistance in agriculture, lines need to be developed which contain a combination of high crown rot resistance and other agro-economically important traits, like high yield. In this study, these lines have been bred by crossing multiple crown rot resistant donor lines with elite lines. The lines are mostly selected by phenotype, but scoring of crown rot severeness is difficult. In this study genomic prediction of crown rot resistance was explored to partially or totally replace phenotyping in the selection process. Two different generations in a complex wheat population (an early generation CRI0 lines and a later generation CRI2 lines) were used to make genomic prediction models and to validate these models. A genomic best linear unbiased prediction (G-BLUP) model, a linear model based on a set of selected markers and a Gaussian kernel model were trained on the early generation (CRI0 lines) and validated on the later generation (CRI2 lines). Prediction based on early generation information was disappointingly low, so suggesting that phenotyping cannot be fully avoided at the later generation. The alternative of partial phenotyping at the late generation yielded more encouraging results. The Gaussian kernel model and the G-BLUP model yielded a predicted ability of about 0.41. While still further researcher is needed, the results so far imply genomic prediction within a population could be useful to select highly crown rot resistant lines.Show less
The Multi-Armed Bandit(MAB) problem is named after slot machine games. When playing slot machines, one player has to decide which machine to play, in which order to play them and how many times to...Show moreThe Multi-Armed Bandit(MAB) problem is named after slot machine games. When playing slot machines, one player has to decide which machine to play, in which order to play them and how many times to play each machine. After the choice, that specific machine will offer a random reward from a probability distribution, and the player’s target is to maximize the sum of rewards earned through a sequence of lever pulls. In order to figure out the distribution of each machine as soon as possible and to get as much profit as possible, we will consider the popular Thompson Sampling (TS) method, which is based on Bayesian ideas. TS is a heuristic for choosing actions that maximize the expected reward with respect to a randomly drawn belief.[9] In this thesis, for the first half part, we test the performance of TS and also compare a variation called Top-two Thompson Sampling(TTTS) method to the normal TS, based on uniform sampling. Computationally, TTTS is a slow algorithm, so we also try to improve its performance and create another algorithm: Top-two Gibbs Thompson sampling, which combines TTTS and Gibbs Sampling methods and improves the computation speed of the TTTS method. In the second half of the thesis, we try to take a step forward in the application of TS, so we combine TS with the Maximin Action Identification(MAI) problem. Maximin is a concept from two-player zero-sum games in game theory. The main idea behind maximin action selection is a constant alternation of minimizing and maximizing the value of moves to account for an adversarial opponent in the game. Existing methods of maximin are related to games such as Checkers, Chess and Go. We try to broaden its application area and apply it to our new algorithms created in the first half part. First of all, the time budget is limited and we use different divisions to test the performance; Afterwards, we create a new algorithm to pick out the right arm with one step for two layers. The results show that there is not any significant difference between the time division method, which consists of even time budgets for he child layer and no budget for parent layer, and the Maximin Thompson Sampling method.Show less
The High School Scheduling Problem is the operational research problem of finding an optimal timetable for students and teachers in a secondary school. It is a minimisation problem, in which we...Show moreThe High School Scheduling Problem is the operational research problem of finding an optimal timetable for students and teachers in a secondary school. It is a minimisation problem, in which we need to satisfy as many wishes as possible while guaranteeing all demands are satisfied. The High School Scheduling Problem has been studied for over half a century, owing to the importance of schools having a good schedule and the difficulty of finding a schedule that adequately satisfies the different constraints. This thesis studies two new approaches to optimising solutions of a problem instance. Both of these approaches are based on transformations of the problem’s search space. First, locally optimal solutions are improved by changing the penalty associated with soft constraints in specific ways. Secondly, the schedule’s penalty function itself is improved, aimed at shifting focus from improving the schedule as a whole to making sure individual schedules become acceptable.Show less
Declining response rates and budgetary constraints are making it difficult for survey researchers. Therefore it is necessary to use more efficient data collection strategies in order to still...Show moreDeclining response rates and budgetary constraints are making it difficult for survey researchers. Therefore it is necessary to use more efficient data collection strategies in order to still obtain accurate estimates on the key survey variables. One way to reach this goal is by the use of Bayesian Adaptive Survey Designs (BASDs). In a BASD one can monitor the data quality and key survey estimates per (sub)group. By using a Bayesian analysis the inaccuracies of the estimates can be taken into account. In this paper we investigate the characteristics of the estimates of the quality indicators and key survey variables by answering four research questions in a simulation study based on the Dutch Health Survey. (1) What are the characteristics of the estimates of the key survey variables in a Bayesian analysis? (2) How do the quality indicators behave in a Bayesian setting? (3) How well does the Bayesian approach perform when we add a misspecification in the method effect? (4) What happens to the key survey estimates and the quality indicators of a key survey variable when using a more elaborate model? The results for the first question show that the estimates of the key survey variables depend both on the sample size and on the data collection strategy. While monitoring the key survey estimates, method effects between the strategies can be detected. In the results of the second research question we see that the quality indicators also depend on the sample size and the data collection strategy, but also on the size of the informative prior. Using an informative prior in a Bayesian analysis can help when the sample size is small. However, the results from the third question show that we have to be careful when the observed data and the informative prior do not coincide. The size of the method effect for the (sub)groups is in that case difficult to detect. In the results of the final research question we see that using a more elaborate model can help with showing patterns in the estimates of the key survey variables. This paper contributes to the BASD-framework. The results can be used to further develop the theory to eventually change the practice of only looking at budget and response rates. Using more elaborate data collection strategies and a Bayesian analysis can help to obtain more accurate results.Show less
The Game of Cops and Robbers is played by two players on a graph, the one player controlling a set of cops and the other controlling a robber. The goal of the cops is to capture the robber in a...Show moreThe Game of Cops and Robbers is played by two players on a graph, the one player controlling a set of cops and the other controlling a robber. The goal of the cops is to capture the robber in a finite number of moves, whereas the robber wants to evade the cops indefinitely. The central question we ask is: how many cops are needed on a given graph for them to be able to win the game? This amount is called the cop number of a graph. In this thesis, we give an extensive overview of results that give an upper bound on the cop number for graphs of a given type, like unit disk graphs and string graphs, as well as some examples of these graphs having a high cop number. In this overview, we will see that the results all greatly depend on the statement or proof method of one of two theorems published by Aigner and Fromme [1]. In addition to a survey of the existing work, we explore whether the two main theorems can be used to further sharpen these results. Leading to mostly negative answers, we conjecture that a different approach is necessary in order to book further progress.Show less
We consider the problem of learning the parameters of a pairwise undirected graphical model in which the node-conditional distribution of each variable conditioned on all others can be any ...Show moreWe consider the problem of learning the parameters of a pairwise undirected graphical model in which the node-conditional distribution of each variable conditioned on all others can be any (possible distinct) exponential family member. Markov Random Fields (MRF) represent the conditional independence structure of a collection of p random variables with a graph. Analogous to previous work, we derive the pairwise joint MRF distribution Pθ ∗ for a collection of exponential family members and specify conditions for well-definedness of Pθ ∗ . The Pseudolikelihood (PL) is defined as the sum of all p node-conditional log-likelihoods. We use the penalized PL to find an estimator θˆ pen n of the true parameter θ ∗ from high-dimensional data (n < p for n samples), and establish the conditions for consistency (θˆ pen n p→ θ ∗ as n → ∞). These results are illustrated with the GLM family (the Bernoulli, Gaussian, Poisson and Exponential distributions), for which θˆ pen n is consistent if the parameter space is compact and satisfies certain boundary constraints. For the GLM family, we present a parallel algorithm that generates a sequence {t k}k≥0 converging to θˆ pen n with computational complexity O(p 3 ) per step. Finally, we demonstrate the theoretical framework with a numerical study. A simulated network that includes all GLM family members shows that the ridge-penalized PL estimator θˆ L2 n converges to θ ∗ for increasing n as expected. Moreover, we find θˆ L2 n for a real-world network concerning invasive breast carinomas with high-throughput genomics data that includes gene expression and mutations. As expected, the majority of the largest parameters in θˆ L2 n involve mutations in well-known oncogenes, showing the effectiveness of the developed framework.Show less
The Alexander polynomial is a link invariant and is used in this thesis to analyze the spatial structure of proteins. It can efficiently be computed by the Jacobianadjugate method, derived from the...Show moreThe Alexander polynomial is a link invariant and is used in this thesis to analyze the spatial structure of proteins. It can efficiently be computed by the Jacobianadjugate method, derived from the adjugate of a Jacobian matrix of the link group. It does so by building a link diagram from smaller tangle diagrams using the operations disjoint union (t) and stitching (mij stitches edge i to edge j) and similarly computing the Jacobian-adjugates JA(T) of the tangle diagrams T. This Jacobian-adjugate is a pair (∆, A) with ∆ ∈ Z[Gab] and A ∈ Mat(Z[Gab]), where G is the group of a link diagram. It satisfies the following rules: (∆A, A) t (∆B, B) = (∆A · ∆B, ∆B · A ⊕ ∆A · B) mij ((∆, A)) = (∆ − A i j ,(1 − ∆−1A i j )A bi + ∆−1A bi j · A i ). Positive and negative crossings with over-strand a and outgoing and incoming under-strand edges b and c have as Jacobian-adjugates (∆ is written in the upper left corner) 1 a b c a 1 0 1 − b b 0 1 a and 1 a b c a 1 0 a −1 (b − 1) b 0 1 a −1 . The Alexander polynomial equals the gcd of the remaining 1 × n matrix when there is just one stitching left. In the single-variable case, at each stage of the computation we only need those columns corresponding to incoming edges, and the final 1 × 1 matrix gives the Alexander polynomial. This Alexander polynomial can be used to compute topological invariants of proteins. Using Sage code we can reconstruct proteins from files from the Protein Data Bank and compute certain (single-variable) Alexander polynomials corresponding to it. Firstly, a link is obtained by closing chains along hydrogen bonds. Running the code on a sample of 1200 PDB files suggests that almost no protein contains a knotted structure. So it seems that the methods of studies such as [2] often infer a knotted structure that is absent in the actual protein. Secondly, a link is obtained from the protein by replacing chains and hydrogen bonds by mutually linked loops. The corresponding Alexander polynomial is very large and some improvements can still be made with regard to the computation time.Show less