Master thesis | Statistical Science for the Life and Behavioural Sciences (MSc)
open access
Aerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and...Show moreAerosols are tiny particles of various kinds and compositions suspended in the atmosphere, some of which have a critical, adverse impact on public health. Hence, modelling the prevalence and distribution of these separate types is vital for giving shape to informed policy on air quality. In this work, methods are described to identify clusters of similar aerosol type mixtures in the Earth’s atmosphere on a global scale, on the basis of microphysical data from the space-borne remote sensing instrument POLDER-3. We report an unsupervised learning approach using the Self-Organizing Map (SOM) and k-means clustering, which allows for clustering without a priori assumptions on existing aerosol types, nature or prevalence. Two methods are introduced to stabilize these clustering algorithms over multiple equal runs to manage their local optima convergence property: the k-means nstart option is extended to the SOM and a set-up is given for a new method, Expectation-Maximization-centered Mahalanobis clustering (EMcMc). A (repeated) v-fold cross-validation framework is presented to find the optimal number of clusters k in the data by means of cluster validation measures, currently including Prediction Strength and validated variants of the Silhouette Width. Using a separate test set, the method can be used to optimize a generic k, countering overfitting. A novel validation index is developed which extends the Silhouette Width to data sets with many observations (large N): the Gridded Silhouette Width. All described methods are implemented in the statistical software package R and shown to work for simulated examples, originating from scaled Gaussian distributions with varying degrees of overlap. Analysis of the POLDER-3 data indicated that using only four variables, 8 clusters can be found in a stable and reproducable fashion. The Silhouette indices did not appear to perform well for data so widely dispersed as here. The found clusters were characterized based on their variable distributions and geographical occurence, which proved to be feasible and meaningful for real-life interpretations. The proposed aerosol types were dust, marine, urban-industrial, smoke and mixtures thereof. Keywords: aerosol typing; unsupervised learning; self-organizing map; k-means clustering; cluster validation measures; cross-validation; gridded silhouette.Show less