About Some Data Precaution Techniques For K-Means Clustering Algorithm
DOI:
https://doi.org/10.56919/usci.1122.003Keywords:
standardization, Mean absolute deviation, Decimal scale, min-max, z-score, K-means clusteringAbstract
Clustering is a technique of creating groups of objects such that each group contains similar and unique objects. One of the most popular clustering techniques is the k-means clustering algorithm. Conventional k-means techniques may not work well for high-dimensional datasets, due to the noise, discrepancies, and outliers associated with the original dataset. However, some form of transformation is required to organize the data for clustering. Four different data pre-processing methods are applied before the clustering algorithm to make the data clean, noise-free and consistent. The impact of data pre-processing on the basic k-means clustering algorithm was tested on real-life data using some normalization techniques such as z-score, mean-max, decimal scaling, and mean absolute deviation. We find that the pre-processing before clustering yields good clustering results and significantly reduces the running time compared to the traditional techniques. We can also conclude that the mean absolute deviation is the best among the four normalization methods as it captures all clustering points.
References
Alshalabi, L., Shaaban, Z. and Kasasbeh, B. (2006). Data Mining: A Preprocessing Engine. Journal of Computer Science, 2(9):735-739. http://dx.doi.org/10.3844/jcssp.2006.735.739
https://doi.org/10.3844/jcssp.2006.735.739
Atomi, W.H. (2012). The effect of data preprocessing on the performance of artificial neural networks techniques for classification problems, Doctoral dissertation, University Tun Hussein Malaysia. https://www.semanticscholar.org/paper/The-effect-of-data-preprocessing-on-the-performance-Atomi/a218d30a0e94e72ecda2bfc63034f253bc21a79c
Chandrasekhar, T., Thangavel, K. and Elayaraja, E. (2011). Effective Clustering Algorithms for Gene Expression Data. International Journal of Computer Applications, 32(4): 25-29. http://research.ijcaonline.org/volume32/number4/pxc3875454.pdf
Cios, K. J., Swiniarski, R. W., Pedrycz, W. and Kurgan, L. A. (2007). Unsupervised learning: clustering in Data Mining. Springer, Boston, MA, 257-288. https://doi.org/10.1007/978-0-387-36795-8_9
https://doi.org/10.1007/978-0-387-36795-8_9
Guojun, G., Chaoqun, M. and Jianhong, W. (2007). Data Clustering: Theory, Algorithms and Applications. ASA-SIAM Series on Statistics and Applied Probability. https://dl.acm.org/doi/10.5555/1296150
Hans-Joachim M., Bartel, H.G. and Dolata, J. (2008). Effects of Data Transformation on Cluster Analysis of Archaeometric Data. Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg 681-688. https://www.worldcat.org/title/5665233343
https://doi.org/10.1007/978-3-540-78246-9_80
Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice-Hall. Englewood Cliffs, NJ: https://dl.acm.org/doi/abs/10.5555/42779
Manpreet ,K. and Usvir, K. (2013). A Survey on Clustering Principles with K-means Clustering Algorithm Using Different Methods in Detail. International Journal of Computer Science and Mobile Computing; 2(5):327-331. https://ijcsmc.com/docs/papers/May2013/abstracts/V2I52013120.pdf
Larose, D. T. (2005). Discovering Knowledge in Data: An Introduction to Data Mining, New Jersey: John Wiley and Sons. https://www.stevens.edu/sites/stevens_edu/files/CareCenter/UTC/Discovering_Knowledge_in_Data.pdf
Luai, A., Zyad, S. and Basel, K. (2006). Data Mining a Preprocessing Engine. Journal of Computer Science. 2(9), 735-739. http://dx.doi.org/10.3844/jcssp.2006.735.739
https://doi.org/10.3844/jcssp.2006.735.739
Milligan, G.W. and Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204. https://doi.org/10.1007/BF01897163
https://doi.org/10.1007/BF01897163
Manimekalai, M., Anusha, M. and Srinaganya, G. (2013). Clustering Analysis on Statistical Data Using Agglomerative Method. International Journal of Information Sciences and Application. 5(1): 33-38 https://www.ripublication.com/irph/ijisa/ijisav5n1_04.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 UMYU Scientifica
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.