About Some Data Precaution Techniques For K-Means Clustering Algorithm

Authors

DOI:

https://doi.org/10.56919/usci.1122.003

Keywords:

standardization, Mean absolute deviation, Decimal scale, min-max, z-score, K-means clustering

Abstract

Clustering is a technique of creating groups of objects such that each group contains similar and unique objects. One of the most popular clustering techniques is the k-means clustering algorithm. Conventional k-means techniques may not work well for high-dimensional datasets, due to the noise, discrepancies, and outliers associated with the original dataset. However, some form of transformation is required to organize the data for clustering. Four different data pre-processing methods are applied before the clustering algorithm to make the data clean, noise-free and consistent. The impact of data pre-processing on the basic k-means clustering algorithm was tested on real-life data using some normalization techniques such as z-score, mean-max, decimal scaling, and mean absolute deviation. We find that the pre-processing before clustering yields good clustering results and significantly reduces the running time compared to the traditional techniques. We can also conclude that the mean absolute deviation is the best among the four normalization methods as it captures all clustering points.

References

Alshalabi, L., Shaaban, Z. and Kasasbeh, B. (2006). Data Mining: A Preprocessing Engine. Journal of Computer Science, 2(9):735-739. http://dx.doi.org/10.3844/jcssp.2006.735.739

https://doi.org/10.3844/jcssp.2006.735.739

Atomi, W.H. (2012). The effect of data preprocessing on the performance of artificial neural networks techniques for classification problems, Doctoral dissertation, University Tun Hussein Malaysia. https://www.semanticscholar.org/paper/The-effect-of-data-preprocessing-on-the-performance-Atomi/a218d30a0e94e72ecda2bfc63034f253bc21a79c

Chandrasekhar, T., Thangavel, K. and Elayaraja, E. (2011). Effective Clustering Algorithms for Gene Expression Data. International Journal of Computer Applications, 32(4): 25-29. http://research.ijcaonline.org/volume32/number4/pxc3875454.pdf

Cios, K. J., Swiniarski, R. W., Pedrycz, W. and Kurgan, L. A. (2007). Unsupervised learning: clustering in Data Mining. Springer, Boston, MA, 257-288. https://doi.org/10.1007/978-0-387-36795-8_9

https://doi.org/10.1007/978-0-387-36795-8_9

Guojun, G., Chaoqun, M. and Jianhong, W. (2007). Data Clustering: Theory, Algorithms and Applications. ASA-SIAM Series on Statistics and Applied Probability. https://dl.acm.org/doi/10.5555/1296150

Hans-Joachim M., Bartel, H.G. and Dolata, J. (2008). Effects of Data Transformation on Cluster Analysis of Archaeometric Data. Proceedings of the 31st Annual Conference of the Gesellschaft für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg 681-688. https://www.worldcat.org/title/5665233343

https://doi.org/10.1007/978-3-540-78246-9_80

Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice-Hall. Englewood Cliffs, NJ: https://dl.acm.org/doi/abs/10.5555/42779

Manpreet ,K. and Usvir, K. (2013). A Survey on Clustering Principles with K-means Clustering Algorithm Using Different Methods in Detail. International Journal of Computer Science and Mobile Computing; 2(5):327-331. https://ijcsmc.com/docs/papers/May2013/abstracts/V2I52013120.pdf

Larose, D. T. (2005). Discovering Knowledge in Data: An Introduction to Data Mining, New Jersey: John Wiley and Sons. https://www.stevens.edu/sites/stevens_edu/files/CareCenter/UTC/Discovering_Knowledge_in_Data.pdf

Luai, A., Zyad, S. and Basel, K. (2006). Data Mining a Preprocessing Engine. Journal of Computer Science. 2(9), 735-739. http://dx.doi.org/10.3844/jcssp.2006.735.739

https://doi.org/10.3844/jcssp.2006.735.739

Milligan, G.W. and Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204. https://doi.org/10.1007/BF01897163

https://doi.org/10.1007/BF01897163

Manimekalai, M., Anusha, M. and Srinaganya, G. (2013). Clustering Analysis on Statistical Data Using Agglomerative Method. International Journal of Information Sciences and Application. 5(1): 33-38 https://www.ripublication.com/irph/ijisa/ijisav5n1_04.pdf

Downloads

Published

2022-09-30

How to Cite

Zulkifilu, M., & Yasir, A. (2022). About Some Data Precaution Techniques For K-Means Clustering Algorithm. UMYU Scientifica, 1(1), 12–19. https://doi.org/10.56919/usci.1122.003