Search results

Rozema, Lude 2022

Subspace Independent Component Analysis (SICA) A comparison of methods for cluster analysis in high dimensionality.

Master thesis | Psychology (MSc)

open access

Clustering algorithms are important for data mining, and K-means is one of the most well-known clustering algorithms currently available. In cases in which data are high-dimensional, however, mere...Show moreClustering algorithms are important for data mining, and K-means is one of the most well-known clustering algorithms currently available. In cases in which data are high-dimensional, however, mere application of K-means to a data set may fail to uncover clusters due to presence of masking variables, the curse of dimensionality, and difficulties in interpretation of the obtained solution. A commonly used work-around is to apply dimension reduction to the data prior to performing cluster analysis, a practice called Tandem Analysis (TA). A vulnerability of TA is that the applied dimension reduction is not guaranteed to preserve cluster structure present in the original data, jeopardising the usefulness of subsequent cluster analysis. Multiple authors have provided algorithms that reduce dimensionality of a data set and perform cluster analysis on the reduced data, either in a sequential fashion or a simultaneous fashion, all aiming to find suitable low-dimensional representations of data while also keeping cluster structures intact. In this thesis, a novel approach to reducing dimensionality and performing cluster analysis on the low dimensional representation of the data - called SICA - is described and thoroughly tested in two systematically manipulated simulation studies and applied to three empirical data applications. Results show that SICA is a computationally efficient algorithm well able to extract components from the original data that preserve cluster structures, but that performance depends on characteristics of the data and the model of data generation. In addition, the correctness and validity of the clusterings obtained through SICA is high, although it does not always outperform currently available methods in this regard and is dependent on the same characteristics of the data and model generation as the other algorithms. Limitations and implications for future research are discussed.Show less

Leiden University Student Repository

Refine Results

Availability

Faculty

Thesis type

Programme

Issued

Supervisor

Language

Your search

Enabled Filters

Sort