Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
A stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually...Show moreA stricter global sulfur regulation by International Maritime Organization MARPOL Annex IV is effective as of the beginning of 2020, but there is no monitoring system on whether the ships actually comply with the sulfur cap. The thesis devises a systematic approach to a prototype of a sulfur compliance monitoring system using the state-of-the-art TROPOspheric Monitoring Instrument(TROPOMI) which measures the atmospheric presence of trace gases. Oceanic geographical coordinates are classified by the similarity in the concentration level of trace gas with the k-means clustering method and adequate averaging techniques. The choice of hyperparameters and the final results are statistically formulated and verified. The subsequent longitudinal analysis on the temporal trends of trace gas emission suggests that the sulfur dioxide measurements of TROPOMI are dominated by measurement noise. The thesis concludes with the outcome that the nitrogen dioxide measurements of TROPOMI can be well-utilized to backtrack the maritime anthropogenic activities such as the regional shipping route, which indicates a possibility to be further developed as a global monitoring system for both land and maritime emission.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Ascertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population...Show moreAscertainment bias is common in genetic-epidemiological cancer studies, where sampling of high-risk families is outcome-dependent. This results in too many events in comparison to the population and an overrepresentation of young, affected subjects in the sample. The motivating example for this thesis is a family study where the goal is to estimate an unbiased hazard ratio (HR) for the effect of Polygenic Risk Score (PRS), a continuous score based on several Single Nucleotide Polymorphisms (SNPs), on age of breast cancer diagnosis. Weighted Cox model approaches have been proposed in this context, however their performance has never been evaluated for a continuous covariate. Two different approaches were considered, using time fixed and time dependent weights. A simulation study was conducted to assess the performance of the different approaches for scenarios where different family correlation, family size, sample size and selection criterium have been chosen. We found that under the null hypothesis, (un)weighted models behave similarly. When a covariate effect is assumed, in any scenario where the within-family correlation is low, weighting methods perform better than a naive approach; the same holds for moderate within-family correlation in combination with weak ascertainment. For strong ascertainment and/or strong within-family correlation, coverage of weighting methods is very poor and bias is high. To obtain an unbiased HR for PRS, we used high-risk breast cancer families data. Inclusion criteria were absence of high-risk mutations BRCA1 and BRCA2 and at least three affected female family members or in two members if at least one had bilateral breast cancer before age 60. A total of 101 families were selected between 1990 and 2012 by Clinical Genetic Services in four Dutch cities and one Hungarian city, with 323 (55.1%) events. The HR of PRS, adjusted by family history, was 1.29 (95% CI 1.04; 1.60), for the naive model, with a frailty variance of 0.53 which indicates rather strong within-family correlation. For none of the weighting approaches, the covariate effect of PRS adjusted for family history in a Cox model was significant (HR 1.09 and 1.09). For analysis of outcome dependently sampled survival data, weighting approaches may be used to limit ascertainment bias, for some scenarios. A note of caution is required when this approach is used in scenarios with (moderate to) strong within-family correlation. No evidence for a significant effect of PRS on age of breast cancer diagnosis was found in this studyShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Accurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox...Show moreAccurate predictions of survival probabilities can be helpful to determine treatment strategies and shared decision making in medical applications, like cancer prognosis. Traditionally, the Cox proportional hazards (PH) model is used to predict survival. Yet, recently machine learning (ML) has received increased attention. ML methods learn complex relations between explanatory variables and outcomes, without the need to specify these effects beforehand. In contrast, in the Cox PH model, non-linear and interaction effects need to be specified before estimating the model. The flexibility of ML methods is believed to improve predictive accuracy, which drives the application of ML methods to survival data. One of the aims of this thesis was to compare prediction models for survival data based on machine learning methods to the traditional Cox PH model. Predictive ability was assessed by using Brier score, concordance index and calibration plots. Furthermore, software implementation and interpretability were investigated. Two ML methods, partial logistic regression models with artificial neural networks (PLANN) and random survival forest (RSF) models were considered. Predictive performance was studied in a soft tissue sarcoma cohort: a right-censored survival dataset with a small number of explanatory variables. In terms of IBS and calibration, the optimally tuned RSF models had similar predictive performance compared to the Cox model. The Cox model had better predictive performance than the RSF models in terms of C-index. One of the NN models outperformed Cox in terms of Integrated Brier Score (IBS). Also, the NN models were slightly better calibrated than the Cox PH model. It would be interesting to see whether a Cox model including non-linear effects would outperform the ML methods considered in terms of prediction. Differences between the ML methods and the Cox PH model concern the route towards finding the most optimal predictions. When estimating survival probabilities using ML methods, focus is mainly on the correct implementation of the ML algorithm: finding suitable tuning parameters, how to select the best set of tuning parameters and running the algorithm, which takes time. On the other hand, when identifying the best predicting Cox model, time is spent on specifying the model, looking at non-linear effects and evaluating goodness of fit. The initial set of tuning parameters considered for the PLANN approach resulted in non-informative NN models. This showed the importance of thorough knowledge on the characteristics of tuning parameters in the ML methods. The work in this thesis shows how survival prediction could be unreliable if the NN is not properly tuned.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
In this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the...Show moreIn this thesis, two regression models for the nonlinear analysis of interaction effects are proposed. The regression models are based on the Optimal Scaling methodology and specifically target the analysis of Factor-by-Curve interactions between a categorical and a continuous variable. The Optimal Scaling methodology was originally developed for analysis of categorical data, but is also applicable to continuous data. It estimates optimal quantifications for the original observed values in an iterative process by maximising the squared multiple regression coefficient (R2 ), thereby transforming the original variable. These quantifications are restricted according to a prespecified scaling level, indicating the stringency of the transformation. These scaling levels can restrict the quantifications to be unsmoothed (non)monotone, or to be smooth (non)monotone. Unsmoothed nonmonotone quantifications are not restricted to any relation between the original observed values, whereas the monotone restriction preserves the ordering of the original observed values in the quantifications. The smooth restrictions are similar, but the quantifications are then also smoothed using a spline function. The quantifications can also be restricted to a linear transformation of the original observed values. This (ordinary) Optimal Scaling regression model, however, does not take into account any interaction effects between the variables. The type of interactions considered in this thesis are the Factor-by-Curve interactions. Factorby-Curve interactions are interactions between a categorical variable (factor) and a continuous variable. The models proposed in this thesis will be referred to as the Factor-by-Curve Optimal Scaling regression (FbC-OS-regression) models. Both models fit a separate curve for the continuous variable in the interaction for each level of the factor. For example, an interaction between a continuous variable and a factor of three levels is then fitted with three curves on that continuous variable. The difference between the two proposed models is that they either fit main and interaction effects separately or fit the joint effects in a single term. The models are illustrated with two applications on real data. The advantage of both FbCOS-regression models, compared to existing methods for modelling of Factor-by-Curve interactions, is that the Optimal Scaling methodology allows for monotone restrictions of the effects. This is demonstrated using the applications shown in this thesis, which are fitted using monotone spline restrictions. Results for the fitted FbC-OS-regression models are then compared to fitted linear regression models with interactions. Finally, the two approaches of modelling Factor-byCurve interactions with OS-regression are compared to each other and to the additive model, which is a model suitable for nonlinear analysis of Factor-by-Curve interactions as well, after which suggestions for further study of the proposed models are given.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
National statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such...Show moreNational statistical institutes (NSIs) try to construct datasets that are rich in information as efficiently and cost effectively as possible. This can be achieved by combining available data, such as administrative data or survey data. When datasets do not pertain to the same units, one can sometimes resort to statistical matching to integrate them. Statistical matching is a data fusion technique which can be used when different data sets contain different units, but with a set of common (background) variables. The main goal of statistical matching is to estimate the relationship between the non-common variables in the different datasets. This paper investigates how best to utilize a small overlap of units in a statistical matching situation where data only consists of categorical variables. A small overlap of units contains joint information on all variables for only a limited number of units. A new statistical matching method, namely the combined estimator, is developed in this paper employing an idea from small area estimation. The performance of the combined estimator was compared to a couple of pre-existing statistical matching methods for categorical data under various data conditions. The result shows that, even though the combined estimator itself does not perform better than the pre-existing statistical matching method (the EM algorithm), the usage of the combined estimator as the starting point of the EM algorithm helps increasing its accuracy under certain data circumstances. The improvement of accuracy was noticed in cases where the number of matching variables was large.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Association of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent...Show moreAssociation of neurological and psychological conditions with changes in coactivation patterns of brain regions in ’resting state’ is of recent interest in neuroscience. To uncover such latent functional connectivity, series of functional Magnetic Resonance Imaging (fMRI) scans are typically reduced by averaging activations in brain atlas regions. The averaged activations are further reduced to pairwise correlation in sliding fixed width time windows. Unfortunately such reduction in dimensions also reduces the scan resolution and complicates interpretation. Changing to a text mining perspective, this thesis interprets the high dimensional scans as documents with categorical words drawn from a study bag. Consecutive scans measure the activation in V discrete voxels of brain volumes. Activation series in each voxel are segmented into stationary subsequences. Similar correlated segments within voxels and from distinct voxels are then bagged as words. The words capture correlated activation both within- and between-voxels. Instead of being predefined in an atlas, regions emerge as neighbourhoods of voxels drawing the same word at the original scan resolution. The word counts that document voxels draw from the bag of categorical words defines the document state. Document state transition probabilities measure the dynamics in coactivated brain locations at the original fMRI resolution, as a possible marker for a neurological condition. This alternative fMRI activation reduction method avoids a-priori selection of regions, tuning of fixed time window widths, and selection of the number of principal components of the contrasted existing method; the alternative method allows a more direct interpretation of activations. However, the direct state switching interpretation of scan document voxels drawing categorical word counts, does not sufficiently separate subject groups for reliable classification of neurological conditions.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
The Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated....Show moreThe Restricted Mean Survival Time (RMST) is a statistic that measures treatment effects which can be used as a replacement for the hazard ratio when the proportional hazards assumption is violated. The idea of RMST came from Irwin (1949) [5], and when combined with the formal definition of the survival function, RMST can be defined as the integral of survival function up to a time limit τ . Several different methods for estimating the RMST are available. The Kaplan-Meier method and Cox PH model are the most commonly used methods in survival analysis, and they are also suitable for estimating RMST. This is done by first estimating the survival curve and then calculating the area under it to give an estimation of RMST. To allow a more general population of survival time distributions, a flexible parametric model was introduced by Royston and Parmar (2002) [4]. This flexible parametric model method followed the same method of estimating RMST as the Kaplan-Meier and Cox PH model: a survival function is estimated from the model, then a 15-point Gauss-Kronrod quadrature can be used to calculate the integral of the survival function, which allows estimation of RMST. The final option is a pseudo-observation method proposed by Anderson et al. (2004) [3]. This method first builds a pseudo-observation of RMST for each subject. Then, using the pseudo-observations of RMST as outcome variables, a generalized linear model can be built to describe the relationship between the covariates and RMST. A generalized estimating equation (GEE) method can then be used to estimate the parameters of the generalized linear model [8]. Comparisons between these methods under various simulation scenarios were conducted for this thesis. The Kaplan-Meier method is simple to calculate and performs well with early time limits and low censoring proportions. It is also faster to estimate RMST result than Cox model and flexible parametric model. However, this method lacks the ability to be adjusted for more covariates, so it is only suitable when estimating average RMST difference for a population. The unstratified Cox model performed well in datasets that satisfied the proportional hazards assumption. The stratified Cox model also performed well in our simulated non-proportional hazards datasets. The performance of the flexible parametric model method was similar to that of the Cox model, but it is more time-consuming in the integral calculation step. The pseudo-observation methods offered the shortest computation time among all four methods. However, when estimating RMST difference for a subject with given age and gender, the performance of the pseudo-observation method was worse than either the Cox model or flexible parametric model.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Liver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease....Show moreLiver transplantation -i.e. the replacement of a diseased liver with healthy liver of another person, is the most effective therapeutic strategy for patients with end-stage liver disease. Predicting survival of patients after liver transplantation is regarded as one of the most challenging areas in medicine. Hence, selecting the best prediction model is of paramount importance. Machine learning - field of computer science where specific algorithms are used to learn and make predictions on data - has lately received increased attention in the medical field due to contribution in medical imaging, ability to diagnose diseases and its great potential for personalized treatment. In survival analysis, machine learning implementation is difficult due to censored data. In this thesis, survival random forests and partial logistic artificial neural networks have been applied. Cox model has been exclusively used due to its easy implementation and straightforward interpretation. The model is however restricted to the proportionality of hazards assumption whereas the machine learning techniques do not make any assumptions. Nowadays, there is a strong discussion in the medical field about machine learning and if it has greater potential than Cox models when it comes to complex data. Criticism to machine learning is related to unsuitable performance measures and lack of interpretability which is important for the medical personnel. The potential of machine learning is investigated for large data of 62294 patients in USA for 106 prognostic factors selected from over 600; 52 donor’s characteristics and 54 patient’s characteristics. A meticulous comparison is performed between 3 proportional hazards models and machine learning techniques. For the artificial neural network novel extensions are provided to its original specification using state-of-the-art R software. A variety of measures is employed not only from survival field but also from simple classification setting. In this project, it is of particular interest the identification of potential risk factors post-operatively. Two survival outcomes are reported: overall survival (time to death since operation) and failure-free survival (minimum time between graft-failure and death since operation). In this thesis, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation. Random survival forest shows in general better predictive performance than Cox models. Neural networks can reach comparable performance with the Cox models and even perform better in some classification metrics. However, high instability is present due to the lack of a global performance evaluation measure in survival setting.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Currently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy...Show moreCurrently the ‘wish to move’ to another house of the Dutch people is measured through the WoON survey conducted every three years. A more frequent way of measuring is wished for to improve policy making in housing. Nowadays, people express their ‘wish to move’ on social media. In this research, it was found that certain features derived from tweet texts distinguish ‘wish to move’ tweets from others. The best logistic regression classifier developed in this research achieves an F1-score of 0.556 in identifying ‘wish to move’ tweets indicating that it is possible to timely keep track of the proportion of the ‘wish to move’ proportion of the Dutch population active on Twitter. Further, it is found that actual relocation can be identified by following ‘wish to move’ users. By engineering features through aggregating their subsequent tweets, classifiers were established to automatically determine if a ‘wish to move’ user relocated in the follow-up period. The best logistic regression classifier can determine if ‘wish to move’ users relocated in the two subsequent years with an F1-score of 0.701. With it, the proportion of ‘wish to move’ users who actually relocated later can be estimated.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
This study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of...Show moreThis study deals with the introduction of a customer lifetime value for business customers with a focus on lifetime estimations using mobile contracts that are part of larger business contracts of a large Dutch telecom provider. Customer lifetime value is the total profit or loss to a company over the whole period of transactions by a customer. Business customers are defined here as firms or locations of large firms that are contracted for one or more business products of the telecom provider. Customer lifetime values are calculated of the level of mobile contracts and taken together per location afterwards. In order to calculate customer lifetime values, individual lifetime predictions and a definition of the values is needed. The lifetime predictions resemble a survival analysis that models the time from becoming contractfree until one of three possible decisions (contract renewal, product migration or contract termination) is made. Using survival estimates and semi-parametric models the overall survival is analyzed as well as the influence of characteristics of locations and companies to which the locations belong. Then, with the R package mstate competing risks models are applied to model the time to each decision while taking into account the other possible decisions. Additionally, lifetime estimations that result from the competing risks models are updated, whereby the survival analysis starts several months after becoming contract-free. Results show that approximately 25% of the decisions have been made at the start of the study. The duration of mobile contracts and ownership of a business internet product or a mobile internet product next to the mobile contract discriminate most between the occurance of the decisions. Furthermore, results of the competing risks models show that probabilities of making any decision attenuate over time. This is confirmed with a fictional product offer on both the levels of the mobile contract and business customers. The customer lifetime value as described here is a useful metric for the telecom provider to make customer selections and, after applying it to other business products, it could be used to discriminate between product offers.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Discretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to...Show moreDiscretization is a key issue in urban trajectory pattern mining tasks. By assuming that regions with different functions will probably have different densities of visiting people, we propose to segment the city map-and hence discretize trajectory data-by finding region boundaries based on strong density changes. We solve the map segmentation problem as a model selection problem, using the existing MDL-histogram approach. We also propose a heuristic algorithm so that we can apply MDL-histogram on 2-dimensional data (longitude and latitude). Finally, we validate our approach and algorithm by simulation studies and on taxi trajectories from New York CityShow less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Over the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for...Show moreOver the past years there has been an increased interest in applying machine learning (ML) techniques to medical research. With the growing availability of mixed data - clinical and genomic for instance - ML methods, which have great potential for modelling complex data, have been increasingly applied. Few publications however have seen clinical applications, and the trend towards ML has been criticised for a lack of attention towards proper validation and towards the use of appropriate performance measures to quantify the model performance. Initially, in the context of medical research, machine learning methods were mainly used for diagnosis and detections, but the last years have seen a vast increase in ML modelling for the purpose of cancer prediction and prognosis. The latter trend has given rise to various adaptations of traditional ML approaches to censored survival data. Two such approaches - Biganzoli's survival neural network and Ishwaran's random survival forest - are evaluated in this thesis. They are compared to a statistical model - the well-used Cox proportional hazards model - in an application to a clinical dataset with 7 variables, measured on 2025 osteosarcoma patients- the EURAMOS-1 clinical trial. The purpose of this thesis is two-fold; 1) performing an in-depth comparison of the two ML methods and gaining insight into the potential of ML for clinical data with a limited number of predictors; 2) adding to existing osteosarcoma literature, in which ML methods have a very limited presence. The analyses performed on the EURAMOS data are reinforced by a simulation study, which is novel in the approach it takes to ensure that the simulated data closely mimics the original. This thesis shows that for the EURAMOS-1 osteosarcoma data the Cox proportional hazard model is suitable, and that both ML approaches have limited added benefit. Appropriate performance measures are identified for assessing neural network and random survival forest performance. For the survival neural network a modification to an existing measure is proposed to aid in identifying network instability - a known neural network pitfall. For the random survival forest it is shown that while suitable for distinguishing high and low risk patients, it results in unreliable individual survival predictions. An additional, unrelated chapter has been included in this thesis, detailing the application of a dynamic prediction model to the EURAMOS-1 osteosarcoma data.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
This thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the...Show moreThis thesis describes the model class of convexified convolutional neural networks (CCNNs), a type of deep learning model introduced by Zhang, Liang & Wainwright [1]. First, steps towards the convex relaxation are described, as well as all the steps required to implement the algorithm. To this end, the thesis describes the mathematical structure of the shallow networks, how the function class can be relaxed to the convex case, as well as the role of Reproducing Kernel Hilbert Spaces, the Nystrom method, and projected gradient descent on the nuclear norm ball. The main contribution of this work is the implementation and application to a new data set. The problems considered are a simulation study and an implementation on the classification problem of text data. The results of the CCNN implementation show that it can be successfully applied on text data through the use of vectorized word representations. Advantages and drawbacks compared to more mainstream approaches are discussed.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Functional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between...Show moreFunctional connectivity (FC) is an important metric to characterize brain mechanisms. Assessment of resting-state FC is a popular tool for studying brain disease mechanisms. Correlations between functional magnetic resonance imaging (fMRI) blood-oxygenation-level-dependent (BOLD) time courses in different brain regions can measure FC which has revealed a meaningful organization of spontaneous fluctuations in the brain during rest. Therefore, in most studies, the presence of temporal and spatial dynamics of FC are usually measured by the correlation coefficients between the fMRI signals of several brain regions. However, recent research has shown that FC is not stationarity. That is, FC dynamically changes over time reflecting additional and rich information about brain organization. In 2013, Leonardi et al. proposed a new approach which was based on principal component analysis (PCA) to reveal hidden patterns of coherent FC dynamics across multiple subjects. This thesis evaluates this new approach in a simulation study. Moreover, also a framework to test the new approach is proposed. The simulation study showed advantages and disadvantages of the new approach. The results of the simulation study showed that the new approach can extract the most important dynamic connectivity features underlying fMRI data. It can retrieve timevarying connectivity between dynamic brain regions during rest effectively. The new approach identified connections with similar fluctuations, and gave an efficient linear representation, but only sensitive to linear relations between connectivity pairs, and it yielded robust results in restricted conditions. Finally, some recommendations for researchers using this method to study dynamic brain functional brain connectivity at rest are provided.Show less
Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Aerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and...Show moreAerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and distribution of these separate types is vital for giving shape to informed policy on air quality. In this work, methods are described to identify clusters of similar aerosol type mixtures in the Earth’s atmosphere on a global scale, on the basis of microphysical data from the space-borne remote sensing instrument POLDER-3. We report an unsupervised learning approach using the Self-Organizing Map (SOM) and k-means clustering, which allows for clustering without a priori assumptions on existing aerosol types, nature or prevalence. Two methods are introduced to stabilize these clustering algorithms over multiple equal runs to manage their local optima convergence property: the k-means nstart option is extended to the SOM and a set-up is given for a new method, Expectation-Maximization-centered Mahalanobis clustering (EMcMc). A (repeated) v-fold cross-validation framework is presented to find the optimal number of clusters k in the data by means of cluster validation measures, currently including Prediction Strength and validated variants of the Silhouette Width. Using a separate test set, the method can be used to optimize a generic k, countering overfitting. A novel validation index is developed which extends the Silhouette Width to data sets with many observations (large N): the Gridded Silhouette Width. All described methods are implemented in the statistical software package R and shown to work for simulated examples, originating from scaled Gaussian distributions with varying degrees of overlap. Analysis of the POLDER-3 data indicated that using only four variables, 8 clusters can be found in a stable and reproducable fashion. The Silhouette indices did not appear to perform well for data so widely dispersed as here. The found clusters were characterized based on their variable distributions and geographical occurence, which proved to be feasible and meaningful for real-life interpretations. The proposed aerosol types were dust, marine, urban-industrial, smoke and mixtures thereof. Keywords: aerosol typing; unsupervised learning; self-organizing map; k-means clustering; cluster validation measures; cross-validation; gridded silhouette.Show less