Proposing an ensemble-based model using data clustering and machine learning algorithms for effective predictions

Date

2019-08-01

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

One of the most important tasks in machine learning is prediction. Data scientists use various regression methods to find the most appropriate and accurate model applicable for each type of datasets. This study proposes a meta-model to improve prediction accuracy. In common methods different models are applied to the whole dataset to find the best model with the highest accuracy. This means, a global model is developed for the entire dataset. In the proposed approach, first, we cluster data using different methods and we have used algorithm-based and expert-based clustering. Algorithm-based clustering such as K-means, DBSCAN, agglomerative hierarchical clustering algorithms. For expert-based clustering, we use expert knowledge to group datasets based on the important features which are selected by experts. Then, for each clustering method and for each generated cluster, we apply different machine learning models including linear and polynomial regressions, SVR, neural network, genetic programming and other techniques and select the most accurate prediction model per cluster. In every cluster, the number of samples in each cluster is reduced compared to the number of samples in the original dataset and consequently, by decreasing the number of samples in each cluster, the model is prone to lose its accuracy. On the other hand, customizing a model for each sub-dataset increases the capability of offering more effective prediction, compared to a situation where one model is fitted to the whole dataset. That is why the proposed model can be categorized as in an ensemble-based group due to the fact that the prediction is performed based on the collaboration of various models over clusters of sub-datasets. Moreover, granularity of the proposed method is better for parallelization purposes. This means, it can be parallelized in a more efficient way. As our main case study, we used real-estate data with more than 21,000 instances and 20 features to improve house price prediction. However, this approach is applicable to other large datasets. In order to examine its capability, we applied the proposed method on two other datasets; agricultural dataset with 10 features and more than 7,000 instances and also Facebook comments volume dataset, which contains roughly 41,000 samples with 54 features. For the first dataset, the new approach reduces error value from 0.14 to 0.087 for K-means clustering and 0.086 for grouping based on human knowledge. With respect to our second case study, the water evaporation data did not obtain considerable improvement in accuracy; however, in some sub-datasets there was an improvement in accuracy.

Description

Keywords

Data mining, Machine learning, Clustering, Regression, Prediction

Citation