Comparative analysis of resampling algorithms in the prediction of stroke diseases

Authors

DOI:

https://doi.org/10.56919/usci.2123.011

Keywords:

Stroke, imbalanced data, resampling algorithm, SMOTE, Machin learning classifier

Abstract

Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.

References

Abdullahi, S. D., & Muhammad, S. A. (2022). Early Prediction of Cerebrovascular Disease using Boosting Machine Learning Algorithms to Assist Clinicians. Journal of Applied Sciences and Environmental Management, 26(6), 1031–1037. https://doi.org/10.4314/jasem.v26i6.6

Ahmed, H., Abd-El Ghany, S. F., Youn, E. M. G., Omran, N. F., & Ali, A. A. (2019). Stroke prediction using distributed machine learning based on apache spark. International Journal of Advanced Science and Technology, 28(15), 89–97. https://doi.org/10.13140/RG.2.2.13478.68162

Biswas, N., Uddin, K. M. M., Rikta, S. T., & Dey, S. K. (2022). A comparative analysis of machine learning classifiers for stroke prediction: A predictive analytics approach. Healthcare Analytics, 2(October), 100116. https://doi.org/10.1016/j.health.2022.100116

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(Sept. 28), 321–357. https://doi.org/10.1613/jair.953

Fedesoriano. (2021). Stroke prediction dataset. Kaggle. https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

Feigin, V. L., Roth, G. A., Naghavi, M., Parmar, P., Krishnamurthi, R., Chugh, S., Mensah, G. A., Norrving, B., Shiue, I., Ng, M., Estep, K., Cercy, K., Murray, C. J. L., & Forouzanfar, M. H. (2016). Global burden of stroke and risk factors in 188 countries, during 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. The Lancet Neurology, 15(9), 913–924. https://doi.org/10.1016/S1474-4422(16)30073-4

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Proceedings of the International Joint Conference on Neural Networks, 3, 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239

Jason, B. (2020). Tour of Evaluation Metrics for Imbalanced Classification. Machine Learning Mastery. https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

Lamari, M., Azizi, N., Hammami, N. E., Boukhamla, A., Cheriguene, S., Dendani, N., & Benzebouchi, N. E. (2021). SMOTE--ENN-Based Data Sampling and Improved Dynamic Ensemble Selection for Imbalanced Medical Data Classification. In F. Saeed, T. Al-Hadhrami, F. Mohammed, & E. Mohammed (Eds.), Advances on Smart and Soft Computing (pp. 37–49). Springer Singapore.

Markus, H. S., & Brainin, M. (2020). COVID-19 and stroke—A global World Stroke Organization perspective. International Journal of Stroke, 15(4), 361–364. https://doi.org/10.1177/1747493020923472

More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. 10000, 1–7. http://arxiv.org/abs/1608.06048

Ray, S., Alshouiliy, K., Roy, A., Alghamdi, A., & Agrawal, D. P. (2020). Chi-Squared Based Feature Selection for Stroke Prediction using AzureML. 2020 Intermountain Engineering, Technology and Computing, IETC 2020. https://doi.org/10.1109/IETC47856.2020.9249117

Sailasya, G., & Kumari, G. L. A. (2021). Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. International Journal of Advanced Computer Science and Applications, 12(6), 539–545. https://doi.org/10.14569/IJACSA.2021.0120662

Sridharan, M., Mantyla, M., Rantala, L., & Claes, M. (2021). Data balancing improves self-admitted technical debt detection. Proceedings - 2021 IEEE/ACM 18th International Conference on Mining Software Repositories, MSR 2021, 358–368. https://doi.org/10.1109/MSR52588.2021.00048

Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719. https://doi.org/10.1142/S0218001409007326

Wu, Y., & Fang, Y. (2020). Stroke prediction with machine learning methods among older chinese. International Journal of Environmental Research and Public Health, 17(6), 1–11. https://doi.org/10.3390/ijerph17061828

Downloads

Published

2023-03-30

How to Cite

Abdullahi, D. S., Aliyu, D. M. S., & Musa Abdullahi, U. (2023). Comparative analysis of resampling algorithms in the prediction of stroke diseases. UMYU Scientifica, 2(1), 88–94. https://doi.org/10.56919/usci.2123.011