World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

    https://doi.org/10.1142/S0218001409007326Cited by:1173 (Source: Crossref)

    Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

    References

    • N. Abe, B. Zadrozny and J. Langford, An iterative method for multi-class cost-sensitive learning, Proc. Tenth ACN SIGKDD Int. Conf. Knowledge Discovery and Data Min. (2004) pp. 3–11. Google Scholar
    • R. Akbani, S. Kwek and N. Jakowicz, Applying support vector machines to imbalanced datasets, Proc. Eur. Conf. Mach. Learn. (2004) pp. 39–50. Google Scholar
    • R. Anandet al., IEEE Trans. Neural Networks 4(6), 962 (1993), DOI: 10.1109/72.286891. Crossref, Web of ScienceGoogle Scholar
    • G. E. A. P. A. Batista, R. C. Prati and M. C. Monard, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 20 (2004). Web of ScienceGoogle Scholar
    • J. Bradfordet al., Pruning decision trees with misclassification costs, Proc. Tenth Eur. Conf. Mach. Learn. (ECML-98) (1998) pp. 131–136. Google Scholar
    • L. Breiman, Mach. Learn. 24(2), 123 (1996). Web of ScienceGoogle Scholar
    • L. Breiman, Ann. Statist. 26(3), 801 (1998). Web of ScienceGoogle Scholar
    • L. Breiman, Mach. Learn. 45, 5 (2001), DOI: 10.1023/A:1010933404324. Crossref, Web of ScienceGoogle Scholar
    • L.   Breiman et al. , Classification and Regression Trees ( Wadsworth Belmont , 1984 ) . Google Scholar
    • C. Cardie and N. Howe, Improving minority class predication using case-specific feature weights, Proc. Fourteenth Int. Conf. Mach. Learn. (1997) pp. 57–65. Google Scholar
    • K. Carvajalet al., INSIGHT, J. British Institute of Non-Destructive Testing 46(7), 399 (2004), DOI: 10.1784/insi.46.7.399.55578. Google Scholar
    • N. V. Chawlaet al., J. Artif. Intell. Res. 16, 321 (2002). Crossref, Web of ScienceGoogle Scholar
    • N. V.   Chawla et al. , Data Mining and Knowledge Discovery   ,   http://www.springerlink.com/content/g406864763788315/ . Google Scholar
    • N.   Chawla , L.   Hall and A.   Joshi , Wrapper-based computation and evaluation of sampling methods for imbalanced datasets , Workshop on Utility-Based Data Mining Held in Conjunction with the 11th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining ( 2005 ) . Google Scholar
    • N. V. Chawla, N. Japkowicz and A. Kolcz, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 1 (2004). Google Scholar
    • N. V.   Chawla , N.   Japkowicz and A.   Kotcz , Proc. ICML'2003 Workshop on Learning from Imbalanced Data Sets , ICML ( 2003 ) . Google Scholar
    • N. V.   Chawla , N.   Japkowicz and A.   Kotcz , SIGKDD Explorations , Special Issue on Class Imbalances   6 ( ACM , 2004 ) . Google Scholar
    • N. V. Chawlaet al., SMOTEBoost: improving prediction of the minority class in boosting, Proc. Seventh Eur. Conf. Principles and Practice of Knowledge Discovery in Database (2003) pp. 107–119. Google Scholar
    • T. Cover and P. Hart, IEEE Trans. Inform. Th. 13, 21 (1967), DOI: 10.1109/TIT.1967.1053964. Crossref, Web of ScienceGoogle Scholar
    • P. Domingos and P. Metacost, Metacost: a general method for making classifiers cost sensitive, Advances in Neural Networks. Int. J. Patt. Recogn. Artif. Intell. San Diego, CA (1999), pp. 155–164 . Google Scholar
    • G. Donget al., CAEP: classification by aggregating emerging patterns, Proc. Second Int. Conf. Discovery Sci. (DS'99) (1999) pp. 43–55. Google Scholar
    • R.   Duda and P.   Hart , Pattern Classification and Scene Analysis ( John Wiley & Sons , 1973 ) . Google Scholar
    • R.   Duda , P.   Hart and D.   Stork , Pattern Classification ( Wiley , 2001 ) . Google Scholar
    • C. Elkan, Results of the kdd'99 classifier learning contest, http://www-cse.ucsd.edu/users/elkan/clresults.html . Google Scholar
    • A. Estabrooks, A combination scheme for inductive learning from imbalanced data sets, Master's thesis, Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada (2000) . Google Scholar
    • K. Ezawa, M. Singh and S. W. Norton, Learning goal oriented bayesian networks for telecommunications risk management, Proc. Thirteenth Int. Conf. Mach. Learn. (1996) pp. 139–147. Google Scholar
    • W. Fanet al., Adacost: misclasification cost-sensitive boosting, Proc. Sixth Int. Conf. Mach. Learn. (ICML-99) (1999) pp. 97–105. Google Scholar
    • T. E. Fawcett and F. Provost, Data Mining and Knowledge Discovery 1(3), 291 (1997), DOI: 10.1023/A:1009700419189. Crossref, Web of ScienceGoogle Scholar
    • Y. Freund and R. E. Schapire, J. Comput. Syst. Sci. 55(1), 119 (1997), DOI: 10.1006/jcss.1997.1504. Crossref, Web of ScienceGoogle Scholar
    • J. Friedman, T. Hastie and R. Tibshirani, Ann. Statist. 28(2), 337 (2000), DOI: 10.1214/aos/1016218223. Crossref, Web of ScienceGoogle Scholar
    • N. Friedman, D. Geiger and M. Goldszmidt, Mach. Learn. 29(3), 131 (1997), DOI: 10.1023/A:1007465528199. Crossref, Web of ScienceGoogle Scholar
    • H. Guo and H. L. Viktor, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 30 (2004). Web of ScienceGoogle Scholar
    • J. A. Hanley and B. J. McNeil, Intell. Data Anal. J. 143, 29 (1982). Google Scholar
    • D. Hecherman, Advance in Knowledge Discovery and Data Mining (AAAI Press/The MIT Press, 1996) pp. 273–305. Google Scholar
    • J.   Hertz , A.   Krogh and R. G.   Palmer , Introduction to the Theory of Neural Computation ( Addison Wesley , 1991 ) . Google Scholar
    • R. C. Holte, L. E. Acker and B. W. Porter, Concept learning and the problem of small disjuncts, Proc. Eleventh Int. Joint Conf. Artif. Intell. (1989) pp. 813–818. Google Scholar
    • N. Japkowicz, Proc. AAAI'2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05 (AAAI, 2000) . Google Scholar
    • N. Japkowicz, Concept-learning in the presence of between-class and within-class imbalances, Proc. Fourteenth Conf. Canadian Soc. for Computational Studies of Intelligence (2001) pp. 67–77. Google Scholar
    • N. Japkowicz, Mach. Learn. 41(1), (2001). Google Scholar
    • N. Japkowicz and S. Stephen, Intell. Data Anal. J. 6(5), 429 (2002). CrossrefGoogle Scholar
    • M. V. Joshi, Learning Classifier Models for Predicting Rare Phonemena, PhD thesis, University of Minnesota, Twin Cites, Minnesota, USA (2002) . Google Scholar
    • M. V.   Joshi , V.   Kumar and R. C.   Agarwal , Evaluating boosting algorithms to classify rare classes: comparison and improvements , Proc. First IEEE Int. Conf. Data Min. (ICDM'01) ( 2001 ) . Google Scholar
    • M. S.   Kamel and N.   Wanas , Data dependence in combining classifiers , Proc. Fourth Int. Workshop on Multiple Classifiers Systems ( 2003 ) . Google Scholar
    • H. Kück, Bayesian formulations of multiple instance learning with applications to general object recognition, Master's thesis, University of British Columbia, Vancouver, BC, Canada (2004) . Google Scholar
    • J. Kittleret al., IEEE Trans. Patt. Anal. Mach. Intell. 20(3), (1998). Google Scholar
    • M. Kubat, R. Holte and S. Matwin, Mach. Learn. 30, 195 (1998), DOI: 10.1023/A:1007452223027. Crossref, Web of ScienceGoogle Scholar
    • D. Lewis and W. Gale, Training text classifiers by uncertainty sampling, Proc. Seventeenth Annual Int. ACM SIGIR Conf. Research and Development in Information (1998) pp. 73–79. Google Scholar
    • J. Liet al., Mach. Learn. 54(2), 99 (2004), DOI: 10.1023/B:MACH.0000011804.08528.7d. Crossref, Web of ScienceGoogle Scholar
    • W. Li, J. Han and J. Pei, CMAR: accurate and efficient classification based on multiple class-association rules, Proc. 2001 IEEE Int. Conf. Data Min. (ICDM'01) (2001) pp. 369–376. Google Scholar
    • Y. Lin, Y. Lee and G. Wahba, Mach. Learn. 46, 191 (2002), DOI: 10.1023/A:1012406528296. Crossref, Web of ScienceGoogle Scholar
    • C. X. Ling and C. Li, Data mining for direct marketing: problems and solutions, Proc. Fourth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Min. (1998) pp. 73–79. Google Scholar
    • C. X.   Ling and C.   Li , Decision trees with minimal costs , Proc. 21st Int. Conf. Machine Learn. ( 2004 ) . Google Scholar
    • B. Liu, W. Hsu and Y. Ma, Integrating classification and association rule mining, Proc. Fourth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (1998) pp. 80–86. Google Scholar
    • B. Liu, W. Hsu and Y. Ma, Mining association rules with multiple minimum supports, Proc. Fifth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (1999) pp. 337–341. Google Scholar
    • B. Liu, Y. Ma and C. K. Wong, Improving an association rule based classifier, Proc. 4th Eur. Conf. Principles of Data Mining and Knowledge Discovery (2000) pp. 504–509. Google Scholar
    • L. M. Manevitz and M. Yousef, J. Mach. Learn. Res. 2, 139 (2001), DOI: 10.1162/15324430260185574. Web of ScienceGoogle Scholar
    • D. Margineantu, Class probability estimation and cost-sensitive classification decisions, Proc. 13th Eur. Conf. Mach. Learn. (2002) pp. 270–281. Google Scholar
    • P. M.   Murph and D. W.   Aha , UCI Repository of Machine Learning Databases , Department of Information and Computer Science ( University of California , Irvine , 1991 ) . Google Scholar
    • J.   Pearl , Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infernence ( Morgan Kaufmann , 1988 ) . Google Scholar
    • R. C. Prati and G. E. A. P. A. Batista, Class imbalances versus class overlapping: an analysis of a learning system behavior, Proc. Mexican Int. Conf. Artif. Intell. (MICAI) (2004) pp. 312–321. Google Scholar
    • F. Provost and T. Fawcett, Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions, Proc. Third Int. Conf. Knowledge Discovery and Data Mining (KDD-97) (1997) pp. 43–48. Google Scholar
    • J. R. Quinlan, Mach. Learn. 6, 93 (1991). Web of ScienceGoogle Scholar
    • J. R.   Quinlan , C4.5: Programs For Machine Learning ( Morgan Kaufmann Publishers , 1993 ) . Google Scholar
    • J. R. Quinlan, Mach. Learn. 1(1), 81 (1986). Google Scholar
    • B. Raskutti and A. Kowalczyk, Extreme rebalancing for SVMs: a case study, Proc. Eur. Conf. Mach. Learn. (2004) pp. 60–69. Google Scholar
    • P. Riddle, R. Segal and O. Etzioni, Appl. Artif. Intell. 8, 125 (1991), DOI: 10.1080/08839519408945435. Crossref, Web of ScienceGoogle Scholar
    • J. Rissanen, Automatica 14, 465 (1978), DOI: 10.1016/0005-1098(78)90005-5. Crossref, Web of ScienceGoogle Scholar
    • R. E. Schapire, The boosting approach to machine learning — An overview, MSRI Workshop on Nonlinear Estimation and Classification (2002) pp. 149–172. Google Scholar
    • R. E. Schapire and Y. Singer, Mach. Learn. 37(3), 297 (1999), DOI: 10.1023/A:1007614523901. Crossref, Web of ScienceGoogle Scholar
    • S. Steward, Cellular Business 23 (1997). Google Scholar
    • Y. Sun, M. S. Kamel and Y. Wang, Boosting for learning multiple classes with imbalanced class distribution, Proc. Sixth IEEE Int. Conf. Data Min. (2006) pp. 592–602. Google Scholar
    • Y. Sunet al., J. Patt. Recogn. 40(12), 3358 (2007), DOI: 10.1016/j.patcog.2007.04.009. Crossref, Web of ScienceGoogle Scholar
    • Y. Sun, A. K. C. Wong and Y. Wang, Parameter inference of cost-sensitive boosting algorithms, Fourth Int. Conf. Machine Learning and Data Mining (MLDM) (2005) pp. 21–30. Google Scholar
    • P.   Tan , M.   Steinbach and V.   Kumar , Introduction to Data Mining ( Addison Wesley , 2006 ) . Google Scholar
    • D. M. J.   Tax and R. P. W.   Duin , Using two-class classifiers for multiclass classification , Int. Conf. Patt. Recogn. ( 2002 ) . Google Scholar
    • K. M. Ting, The problem of small disjuncts: its remedy in decision trees, Proc. Tenth Canadian Conf. Artif. Intell. (1994) pp. 91–97. Google Scholar
    • K. M. Ting, A comparative study of cost-sensitive boosting algorithms, Proc. 17th Int. Conf. Mach. Learn. (2000) pp. 983–990. Google Scholar
    • P. Turney, Types of cost in inductive concept learning, Proc. Workshop on Cost-Sensitive Learning at the 17th Int. Conf. Mach. Learn. (2000) pp. 15–21. Google Scholar
    • V. Vapnik and A. Lerner, Automat. Remont Contr. 24, 774 (1963). Web of ScienceGoogle Scholar
    • D. Walters and W. Wilkinson, Mobile Phone News 4 (1994). Google Scholar
    • K. Wang, S. Zhou and Y. He, Growing decision tree on support-less association rules, Proc. Sixth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'00) (2000) pp. 265–269. Google Scholar
    • Y. Wang, High-Order Pattern Discovery and Analysis of Discrete-Valued Data Sets, PhD thesis, University of Waterloo, Waterloo, Ontario, Canada (1997) . Google Scholar
    • Y. Wang and A. K. C. Wong, IEEE Trans. Knowl. Data Engin. 15(3), 764 (2003). Crossref, Web of ScienceGoogle Scholar
    • G. Weiss, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 7 (2004). Google Scholar
    • G. Weiss and F. Provost, J. Artif. Intell. Res. 19, 315 (2003). Crossref, Web of ScienceGoogle Scholar
    • A. K. C. Wong and Y. Wang, IEEE Trans. Knowl. Data Engin. 9(6), 877 (1997), DOI: 10.1109/69.649314. Crossref, Web of ScienceGoogle Scholar
    • G.   Wu and E. Y.   Chang , Class-boundary alignment for imbalanced dataset learning , Proc. ICML'03 Workshop on Learning from Imbalanced Data Sets ( 2003 ) . Google Scholar
    • X. Yin and J. Han, CPAR: Classification based on predictive association rules, Proc. 2003 SIAM Int. Conf. Data Min. (SDM'03) (2003) pp. 331–335. Google Scholar
    • B. Zadrozny and C. Elkan, Learning and making decisiions when costs and probabilities are both unknown, Proc. Seventh Int. Conf. Knowledge Discovery and Data Mining (2001) pp. 204–213. Google Scholar
    • B. Zadrozny, J. Langford and N. Abe, Cost-sensitive learning by cost-proportionate example weighting, Proc. Third IEEE Int. Conf. Data Min. (2003) pp. 435–442. Google Scholar
    • J.   Zhang and I.   Mani , KNN approach to unbalanced data distributions: a case study involving information extraction , Proc. ICML'03 Workshop on Learning from Imbalanced Data Sets ( 2003 ) . Google Scholar
    • Z. H. Zhou and X. Y. Liu, IEEE Trans. Knowl. Data Engin. 18(1), 63 (2006). Crossref, Web of ScienceGoogle Scholar