Pushing the limits of traditional unsupervised learning
Date
2018-08-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Unsupervised learning has important applications in extremely large data settings such as in medical, biological, social, and environmental data. Typically in these settings, copious amounts of data are collected, with the additional burden of high dimensionality and unavailability of class labels. Improving the performance and usability of unsupervised learning algorithms provides improved resource management and delivery of services to users. Although deep learning methods have become popular due to their success in the supervised learning problem of classification and unsupervised learning problems of feature extraction and cluster analysis, traditional machine learning methods can still provide state-of-the-art performance. In this thesis, a novel clustering framework that combines common clustering and feature extraction methods along with careful parameter selection is presented. This framework is able to achieve state-of-the-art clustering performance that is better than many deep learning-based methods on large benchmark and web-based text and image datasets. This pipeline incorporates deep learning-style feature extraction, but without the onerous hyper-parameter tuning procedure. Then two novel methods are provided for testing the significance and reliability of clusters, in which the null-hypothesis statistical distribution is formed either by: (1) a uniform distribution projected onto the principal components of the original data; or (2) a randomized, weighted adjacency matrix. Significance testing of clusters is important when the nature or underlying properties of the data are unknown, especially in large data settings or in nonstandard datasets. Since, a random sample of the population data could contain properties that are not representative of the whole population. Thus, providing a clustering result that is not typical of the population. Finally, given the success of traditional matrix factorization methods in the clustering pipeline, text document classification using a new convolutional neural network architecture that leverages singular value decomposition was developed. This new model provided state-of-the-art document classification accuracy.
Description
Keywords
Unsupervised feature learning, Cluster significance testing, Latent semantic analysis, Spectral clustering, Independent component analysis