Improving Model Performance for Predicting Exfiltration Attacks Through Resampling Strategies

Arif Rahman Hakim; Kalamullah Ramli; Muhammad Salman; Esti Rahmawati Agustina

doi:10.31436/iiumej.v26i1.3547

Authors

Arif Rahman Hakim University of Indonesia https://orcid.org/0009-0000-7621-0301
Kalamullah Ramli University of Indonesia https://orcid.org/0000-0002-0374-4465
Muhammad Salman University of Indonesia https://orcid.org/0009-0004-8282-1108
Esti Rahmawati Agustina University of Indonesia https://orcid.org/0009-0006-4917-8127

DOI:

https://doi.org/10.31436/iiumej.v26i1.3547

Keywords:

machine learning, imbalance data, SMOTE, exfiltration

Abstract

Addressing class imbalance is critical in cybersecurity applications, particularly in scenarios like exfiltration detection, where skewed datasets lead to biased predictions and poor generalization for minority classes. This study investigates five Synthetic Minority Oversampling Technique (SMOTE) variants, including BorderlineSMOTE, KMeansSMOTE, SMOTEENC, SMOTEENN, and SMOTETomek, to mitigate severe imbalance in our customized tactic-labeled dataset with dominant majority class influence and weak class separability class imbalance. We use seven imbalance metrics to assess each SMOTE variant’s impact on class distribution stability and separability. Furthermore, we evaluate model performance across five classifiers: Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, and XGBoost. Findings reveal that SMOTEENN consistently enhances performance metrics (accuracy, precision, recall, F1-score, and geometric mean) on an average of 99% across most classifiers, establishing itself as the most adaptable variant for handling imbalance. This study provides a comprehensive framework for selecting resampling strategies to enhance classification efficacy in cybersecurity tasks with imbalanced data.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Cybersecurity ventures, "Cybercrime to cost the world $9,5 Trillion USD Annually in 2024," 2024. [Online]. Available: https://www.esentire.com/resources/library/2023-official-cybercrime-report

P. Verma, J. G. Breslin, D. O'Shea, N. Mehta, N. Bharot, and A. Vidyarthi, "Leveraging Gametic Heredity in Oversampling Techniques to Handle Class Imbalance for Efficient Cyberthreat Detection in IIoT," IEEE Trans. Consum. Electron., vol. 70, no. 1, pp. 1940–1951, 2024, doi: 10.1109/TCE.2023.3319439. DOI: https://doi.org/10.1109/TCE.2023.3319439

W. P. Chawla, Nitesh V and Bowyer, Kevin W and Hall, Lawrence O and Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. DOI: https://doi.org/10.1613/jair.953

H. Han, W. Wang, and B. Mao, "Borderline-SMOTE?: A New Over-Sampling Method in Imbalanced Data Sets Learning," pp. 878–879, 2005. DOI: https://doi.org/10.1007/11538059_91

G. Douzas, F. Bacao, and F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE," Inf. Sci. (Ny)., vol. 465, pp. 1–20, 2018, doi: 10.1016/j.ins.2018.06.056. DOI: https://doi.org/10.1016/j.ins.2018.06.056

M. Mukherjee and M. Khushi, "Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features," Appl. Syst. Innov., vol. 4, no. 1, 2021, doi: 10.3390/asi4010018. DOI: https://doi.org/10.3390/asi4010018

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004, doi: 10.1145/1007730.1007735. DOI: https://doi.org/10.1145/1007730.1007735

G. E. a P. a Batista, a L. C. Bazzan, and M. C. Monard, "Balancing Training Data for Automated Annotation of Keywords: a Case Study," Rev. Tecnol. da Informação, vol. 3, no. 2, pp. 15–20, 2004. DOI: https://doi.org/10.1007/978-3-540-30478-4_3

N. H. N. B. M. Shahri, S. B. S. Lai, M. B. Mohamad, H. A. B. A. Rahman, and A. Bin Rambli, "Comparing the performance of adaboost, xgboost, and logistic regression for imbalanced data," Math. Stat., vol. 9, no. 3, pp. 379–385, 2021, doi: 10.13189/ms.2021.090320. DOI: https://doi.org/10.13189/ms.2021.090320

C. K. Aridas, S. Karlos, V. G. Kanas, N. Fazakis, and S. B. Kotsiantis, "Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers under Imbalanced Data Sets," IEEE Access, vol. 8, pp. 2122–2133, 2020, doi: 10.1109/ACCESS.2019.2961784. DOI: https://doi.org/10.1109/ACCESS.2019.2961784

V. Ganganwar, "An overview of classification algorithms for imbalanced datasets," Int. J. Emerg. Technol. Adv. Eng., vol. 2, no. 4, pp. 42–47, 2012, [Online]. Available: http://www.ijetae.com/files/Volume2Issue4/IJETAE_0412_07.pdf

B. Pes, "Learning from high-dimensional and class-imbalanced datasets using random forests," Information, vol. 12, no. 8, p. 286, 2021. DOI: https://doi.org/10.3390/info12080286

S. He, B. Li, H. Peng, J. Xin, and E. Zhang, "An Effective Cost-Sensitive XGBoost Method for Malicious URLs Detection in Imbalanced Dataset," IEEE Access, vol. 9, pp. 93089–93096, 2021, doi: 10.1109/ACCESS.2021.3093094. DOI: https://doi.org/10.1109/ACCESS.2021.3093094

A. Kuppa, L. Aouad, and N. A. Le-Khac, "Linking CVE's to MITRE ATT and CK Techniques," ACM Int. Conf. Proceeding Ser., 2021, doi: 10.1145/3465481.3465758. DOI: https://doi.org/10.1145/3465481.3465758

S. EL JAOUHARI, N. TAMANI, and R. I. JACOB, "Improving ML-based Solutions for Linking of CVE to MITRE ATT &CK Techniques," 2024 IEEE 48th Annu. Comput. Software, Appl. Conf., pp. 2442–2447, 2024, doi: 10.1109/compsac61105.2024.00392. DOI: https://doi.org/10.1109/COMPSAC61105.2024.00392

I. Branescu, O. Grigorescu, and M. Dascalu, “Automated Mapping of Common Vulnerabilities and Exposures to MITRE ATT&CK Tactics,” Inf., vol. 15, no. 4, pp. 1–20, 2024, doi: 10.3390/info15040214.

M. D. and R. R. Octavian Grigorescu , Andreea Nica, "CVE2ATT & CK?: BERT-Based Mapping of CVEs to MITRE," Algorithms, pp. 1–22, 2022.

T. Al-Shehari and R. A. Alsowail, "Random resampling algorithms for addressing the imbalanced dataset classes in insider threat detection," Int. J. Inf. Secur., vol. 22, no. 3, pp. 611–629, 2023, doi: 10.1007/s10207-022-00651-1. DOI: https://doi.org/10.1007/s10207-022-00651-1

I. Branescu, O. Grigorescu, and M. Dascalu, “Automated Mapping of Common Vulnerabilities and Exposures to MITRE ATT&CK Tactics,” Inf., vol. 15, no. 4, pp. 1–19, 2024, doi: 10.3390/info15040214. DOI: https://doi.org/10.3390/info15040214

I. Branescu, "Open source the dataset and the code used for Automated Mapping CVE to ATT&CK," Github. Accessed: Oct. 16, 2024. [Online]. Available: https://github.com/readerbench/CVE2ATT-CK-tactics

T. K. Ho and M. Basu, "Complexity measures of supervised classification problems," IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 289–300, 2002, doi: 10.1109/34.990132. DOI: https://doi.org/10.1109/34.990132