Class imbalance in data poses challenges for classifier learning, drawing increased attention in data mining and machine learning. The occurrence of class overlap in real-world data exacerbates the learning difficulty. In this paper, a novel pseudo oversampling method (POM) is proposed to learn imbalanced and overlapping data. It is motivated by the point that overlapping samples from different classes share the same distribution space, and therefore information underlying in majority (negative) overlapping samples can be extracted and used to generate additional positive samples. A fuzzy logic-based membership function is defined to assess negative overlaps using both local and global information. Subsequently, the identified negative overlapping samples are shifted into the positive sample region by a transformation matrix, centered around the positive samples. POM outperforms 15 methods across 14 datasets, displaying superior performance in terms of metrics of Gm, F1 and AUC.
Published in | Applied and Computational Mathematics (Volume 13, Issue 5) |
DOI | 10.11648/j.acm.20241305.15 |
Page(s) | 165-177 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2024. Published by Science Publishing Group |
Imbalanced Learning, Class Overlap, Feature Transformation, Oversampling
[1] | Yuan, X., Xie, L., Abouelenien, M. A. Regularized Ensemble Framework of Deep Learning for Cancer Detection from Multi-Class, Imbalanced Training Data. Pattern Recognition. 2018, 77, 160-172. |
[2] | Serguieva, A., Ishibuchi, H., Yager, R. R., Alade, V. P. Guest Editorial Special Issue on Fuzzy Techniques in Financial Modeling and Simulation. IEEE Transactions on Fuzzy Systems. 2017, 25(2), 245-248. |
[3] | Chen, R., Guo, S. K., Wang, X. Z., Zhang, T. L. Fusion of Multi-RSMOTE with Fuzzy Integral to Classify Bug Reports with an Imbalanced Distribution. IEEE Transactions on Fuzzy Systems. 2019, 27(12), 2406-2420. |
[4] | Jiang, Z., Zhao, L., Lu, Y., Zhan, Y., Mao, Q. A Semi-Supervised Resampling Method for Class-Imbalanced Learning. Expert Systems with Applications. 2023, 221, 119733. |
[5] | Vuttipittayamongkol, P., Elyan, E., Petrovski, A. On the Class Overlap Problem in Imbalanced Data Classification. Knowledge-Based Systems. 2021, 212, 106631. |
[6] | Soltanzadeh, P., Feizi-Derakhshi, M. R., Hashemzadeh, M. Addressing the Class-Imbalance and Class-Overlap Problems by a Metaheuristic-Based Under-Sampling Approach. Pattern Recognition. 2023, 143, 109721. |
[7] | Ren, J., Wang, Y., Cheung, Y. M., Gao, X. Z., Guo, X. Grouping-Based Oversampling in Kernel Space for Imbalanced Data Classification. Pattern Recognition. 2023, 133, 108992. |
[8] | Vuttipittayamongkol, P., Elyan, E. Neighbourhood-Based Undersampling Approach for Handling Imbalanced and Overlapped Data. Information Sciences. 2020, 509, 47-70. |
[9] | Bunkhumpornpat, C., Sinapiromsaran, K. DBMUTE: Density-Based Majority Under-Sampling Technique. Knowledge and Information Systems. 2017, 50, 827-850. |
[10] | Vuttipittayamongkol, P., Elyan, E., Petrovski, A., Jayne, C. Overlap-Based Undersampling for Improving Imbalanced Data Classification. In Intelligent Data Engineering and Automated Learning-IDEAL 2018: 19th International Conference, Madrid, Spain, November 21-23, 2018, Proceedings, Part I 19 (pp. 689-697). Springer International Publishing. |
[11] | Dai, Q., Liu, J. W., Shi, Y. H. Class-Overlap Undersampling Based on Schur Decomposition for Class-Imbalance Problems. Expert Systems with Applications. 2023, 221, 119735. |
[12] | Lango, M., Stefanowski, J. What Makes Multi-Class Imbalanced Problems Difficult? An Experimental Study. Expert Systems with Applications. 2022, 199, 116962. |
[13] | Li, Z., Xie, H., Cheng, G., Li, Q. Word-Level Emotion Distribution with Two Schemas for Short Text Emotion Classification. Knowledge-Based Systems. 2021, 227, 107163. |
[14] | Yu, H., Sun, C., Yang, X., Zheng, S., Zou, H. Fuzzy Support Vector Machine with Relative Density Information for Classifying Imbalanced Data. IEEE Transactions on Fuzzy systems. 2019, 27(12), 2353-2367. |
[15] | Tao, X., Zheng, Y., Chen, W., Zhang, X., Qi, L., Fan, Z., Huang, S. SVDD-Based Weighted Oversampling Technique for Imbalanced and Overlapped Dataset Learning. Information Sciences. 2022, 588, 13-51. |
[16] | Dai, Q., Liu, J. W., Liu, Y. Multi-Granularity Relabeled Under-Sampling Algorithm for Imbalanced Data. Applied Soft Computing. 2022, 124, 109083. |
[17] | Shi, H., Zhang, Y., Chen, Y., Ji, S., Dong, Y. Resampling Algorithms Based on Sample Concatenation for Imbalance Learning. Knowledge-Based Systems. 2022, 245, 108592. |
[18] | Bui, Q. T., Vo, B., Snasel, V., Pedrycz, W., Hong, T. P., Nguyen, N. T., Chen, M. Y. SFCM: A Fuzzy Clustering Algorithm of Extracting the Shape Information of Data. IEEE Transactions on Fuzzy Systems. 2020. 29(1), 75-89. |
[19] | Ünlü, R., Xanthopoulos, P. Estimating the Number of Clusters in a Dataset via Consensus Clustering. Expert Systems with Applications. 2019, 125, 33-39. |
[20] | Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research. 2002, 16, 321-357. |
[21] | Tomek, I. Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics. 1976, SMC-6(11), 769-772, |
[22] | Tang, Y., Zhang, Y. Q., Chawla, N. V., Krasser, S. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2008, 39(1), 281-288. |
[23] | Han, H., Wang, W. Y., Mao, B. H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In International Conference on Intelligent Computing. 2005, 878-887. Berlin, Heidelberg: Springer Berlin Heidelberg. |
[24] | He, H., Bai, Y., Garcia, E. A., Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008, 1322-1328. |
[25] | Zeng, M., Zou, B., Wei, F., Liu, X., Wang, L. Effective Prediction of Three Common Diseases by Combining SMOTE with Tomek Links Technique for Imbalanced Medical Data. In 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). 2016, 225-228. |
[26] | Fitriyani, N. L., Syafrudin, M., Alfian, G., Yang, C. K., Rhee, J., Ulyah, S. M. Chronic Disease Prediction Model Using Integration of DBSCAN, SMOTE-ENN, and Random Forest. In 2022 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS). 2022, 289-294. |
[27] | Wang, S., Yao, X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. In 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324-331. |
[28] | Liu, X. Y., Wu, J., Zhou, Z. H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2008, 39(2), 539-550. |
[29] | Asim, Y., Malik, A. K., Raza, B., Shahid, A. R., Qamar, N. Predicting Influential Blogger's by a Novel, Hybrid and Optimized Case Based Reasoning Approach with Balanced Random Forest Using Imbalanced Data. IEEE Access. 2020, 9, 6836-6854. |
[30] | Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans. 2009, 40(1), 185-197. |
[31] | Niu, K., Zhang, Z., Liu, Y., Li, R. Resampling Ensemble Model Based on Data Distribution for Imbalanced Credit Risk Evaluation in P2P Lending. Information Sciences. 2020, 536, 120-134. |
[32] | Passos, L. A., Jodas, D. S., Ribeiro, L. C., Akio, M., De Souza, A. N., Papa, J. P. Handling Imbalanced Datasets through Optimum-Path Forest. Knowledge-Based Systems. 2022, 242, 108445. |
[33] | Dong, Z., Xu, C., Xu, J., Zou, B., Zeng, J., Tang, Y. Y. Generalization Capacity of Multi-Class SVM Based on Markovian Resampling. Pattern Recognition. 2023, 142, 109720. |
[34] | Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. The Annals of Mathematical Statistics. 1940, 11(1), 86-92. |
APA Style
Pan, T., Pedrycz, W., Yang, J., Zhang, D. (2024). Pseudo Oversampling Based on Feature Transformation and Fuzzy Membership Functions for Imbalanced and Overlapping Data. Applied and Computational Mathematics, 13(5), 165-177. https://doi.org/10.11648/j.acm.20241305.15
ACS Style
Pan, T.; Pedrycz, W.; Yang, J.; Zhang, D. Pseudo Oversampling Based on Feature Transformation and Fuzzy Membership Functions for Imbalanced and Overlapping Data. Appl. Comput. Math. 2024, 13(5), 165-177. doi: 10.11648/j.acm.20241305.15
AMA Style
Pan T, Pedrycz W, Yang J, Zhang D. Pseudo Oversampling Based on Feature Transformation and Fuzzy Membership Functions for Imbalanced and Overlapping Data. Appl Comput Math. 2024;13(5):165-177. doi: 10.11648/j.acm.20241305.15
@article{10.11648/j.acm.20241305.15, author = {Tingting Pan and Witold Pedrycz and Jie Yang and Dahai Zhang}, title = {Pseudo Oversampling Based on Feature Transformation and Fuzzy Membership Functions for Imbalanced and Overlapping Data}, journal = {Applied and Computational Mathematics}, volume = {13}, number = {5}, pages = {165-177}, doi = {10.11648/j.acm.20241305.15}, url = {https://doi.org/10.11648/j.acm.20241305.15}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.acm.20241305.15}, abstract = {Class imbalance in data poses challenges for classifier learning, drawing increased attention in data mining and machine learning. The occurrence of class overlap in real-world data exacerbates the learning difficulty. In this paper, a novel pseudo oversampling method (POM) is proposed to learn imbalanced and overlapping data. It is motivated by the point that overlapping samples from different classes share the same distribution space, and therefore information underlying in majority (negative) overlapping samples can be extracted and used to generate additional positive samples. A fuzzy logic-based membership function is defined to assess negative overlaps using both local and global information. Subsequently, the identified negative overlapping samples are shifted into the positive sample region by a transformation matrix, centered around the positive samples. POM outperforms 15 methods across 14 datasets, displaying superior performance in terms of metrics of Gm, F1 and AUC.}, year = {2024} }
TY - JOUR T1 - Pseudo Oversampling Based on Feature Transformation and Fuzzy Membership Functions for Imbalanced and Overlapping Data AU - Tingting Pan AU - Witold Pedrycz AU - Jie Yang AU - Dahai Zhang Y1 - 2024/09/19 PY - 2024 N1 - https://doi.org/10.11648/j.acm.20241305.15 DO - 10.11648/j.acm.20241305.15 T2 - Applied and Computational Mathematics JF - Applied and Computational Mathematics JO - Applied and Computational Mathematics SP - 165 EP - 177 PB - Science Publishing Group SN - 2328-5613 UR - https://doi.org/10.11648/j.acm.20241305.15 AB - Class imbalance in data poses challenges for classifier learning, drawing increased attention in data mining and machine learning. The occurrence of class overlap in real-world data exacerbates the learning difficulty. In this paper, a novel pseudo oversampling method (POM) is proposed to learn imbalanced and overlapping data. It is motivated by the point that overlapping samples from different classes share the same distribution space, and therefore information underlying in majority (negative) overlapping samples can be extracted and used to generate additional positive samples. A fuzzy logic-based membership function is defined to assess negative overlaps using both local and global information. Subsequently, the identified negative overlapping samples are shifted into the positive sample region by a transformation matrix, centered around the positive samples. POM outperforms 15 methods across 14 datasets, displaying superior performance in terms of metrics of Gm, F1 and AUC. VL - 13 IS - 5 ER -