Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In forensics it is relevant to identify the presence of one or several body fluids in a crime stain. This may be done using traditional methods however, those methods require a part of the...Show moreIn forensics it is relevant to identify the presence of one or several body fluids in a crime stain. This may be done using traditional methods however, those methods require a part of the available material, therefore leaving less residual material for the purpose of other analysis. Alternatively, one can use messenger RNA evidence: mRNA expression levels may vary among body fluids and therefore can be identified. The currently used method provides the forensic examiner with a categorical statement regarding the existence of the body fluid. However, such a method cannot express any associated uncertainty, whereas alternatively, a probabilistic method can and hence is a preferable choice. In forensic science it is common to express the level of uncertainty by means of a likelihood ratio but, due to a bad choice of statistical model or data scarcity, may be inaccurate. This thesis first of all carries out experiments using four probabilistic classification methods, namely Multinomial Logistic Regression, Multilayer Perceptron, Extreme Gradient Boosting and a Fully connected Feed Forward model. In actual casework the crime stain often consists of multiple body fluids, which is why the classifiers are compared using synthetic representations of actual mixture samples. Multi-label approaches that enable the classifiers to express the level of uncertainty about multiple body fluids in a sample are used. The output from the logistic regression model is directly interpreted as likelihood ratio, whereas for the remaining three classifiers a post-hoc calibration step to improve the accuracy of the clasiffiers is included. Additional tests are performed to investigate how susceptible the classifiers are when the relative frequency of the body fluids in the data changes. The main focus is on two target classes, namely on saliva and a combination of vaginal mucosa and menstrual secretion, because these are most often requested to be identified in a crime stain and therefore seen as most relevant. It is concluded that using a separate logistic regression model for each target class in combination with presence/absence data results in both accurate and reliable likelihood ratios. Results also indicate that these models are the least susceptible to a change in the frequency with which body fluids occur in the train dataset. Furthermore, a study using an additional dataset with actual mixtures of two body fluids that are not assumed representative of forensically realistic mixtures of the same two components is done. Results show that the accuracy of the classifiers on the mixtures dataset are higher in comparison to the accuracy on the synthetic representations. This indicates that the results are overly optimistic, hereby verifying that the mixtures’ cell type dataset should not be used as validation set. A user-friendly tool is constructed that implements logistic regression to calculate the likelihood ratio from samples from actual casework. Using mRNA measurements from two cases both the practical use and the interpretability of the results are shownShow less