CLASSIFICATION OF IMBALANCED DATA: A REVIEW
Abstract
Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.
References
N. Abe , B. Zadrozny and J. Langford , An iterative method for multi-class cost-sensitive learning, Proc. Tenth ACN SIGKDD Int. Conf. Knowledge Discovery and Data Min. (2004) pp. 3–11. Google ScholarR. Akbani , S. Kwek and N. Jakowicz , Applying support vector machines to imbalanced datasets, Proc. Eur. Conf. Mach. Learn. (2004) pp. 39–50. Google Scholar- IEEE Trans. Neural Networks 4(6), 962 (1993), DOI: 10.1109/72.286891. Crossref, Web of Science, Google Scholar
- SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 20 (2004). Web of Science, Google Scholar
J. Bradford , Pruning decision trees with misclassification costs, Proc. Tenth Eur. Conf. Mach. Learn. (ECML-98) (1998) pp. 131–136. Google Scholar- Mach. Learn. 24(2), 123 (1996). Web of Science, Google Scholar
- Ann. Statist. 26(3), 801 (1998). Web of Science, Google Scholar
- Mach. Learn. 45, 5 (2001), DOI: 10.1023/A:1010933404324. Crossref, Web of Science, Google Scholar
-
L. Breiman , Classification and Regression Trees ( Wadsworth Belmont , 1984 ) . Google Scholar C. Cardie and N. Howe , Improving minority class predication using case-specific feature weights, Proc. Fourteenth Int. Conf. Mach. Learn. (1997) pp. 57–65. Google Scholar- INSIGHT, J. British Institute of Non-Destructive Testing 46(7), 399 (2004), DOI: 10.1784/insi.46.7.399.55578. Google Scholar
- J. Artif. Intell. Res. 16, 321 (2002). Crossref, Web of Science, Google Scholar
- Data Mining and Knowledge Discovery , http://www.springerlink.com/content/g406864763788315/ . Google Scholar
-
N. Chawla , L. Hall and A. Joshi , Wrapper-based computation and evaluation of sampling methods for imbalanced datasets , Workshop on Utility-Based Data Mining Held in Conjunction with the 11th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining ( 2005 ) . Google Scholar - SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 1 (2004). Google Scholar
-
N. V. Chawla , N. Japkowicz and A. Kotcz , Proc. ICML'2003 Workshop on Learning from Imbalanced Data Sets ,ICML ( 2003 ) . Google Scholar -
N. V. Chawla , N. Japkowicz and A. Kotcz , SIGKDD Explorations ,Special Issue on Class Imbalances 6 ( ACM , 2004 ) . Google Scholar N. V. Chawla , SMOTEBoost: improving prediction of the minority class in boosting, Proc. Seventh Eur. Conf. Principles and Practice of Knowledge Discovery in Database (2003) pp. 107–119. Google Scholar- IEEE Trans. Inform. Th. 13, 21 (1967), DOI: 10.1109/TIT.1967.1053964. Crossref, Web of Science, Google Scholar
- P. Domingos and P. Metacost, Metacost: a general method for making classifiers cost sensitive, Advances in Neural Networks. Int. J. Patt. Recogn. Artif. Intell. San Diego, CA (1999), pp. 155–164 . Google Scholar
G. Dong , CAEP: classification by aggregating emerging patterns, Proc. Second Int. Conf. Discovery Sci. (DS'99) (1999) pp. 43–55. Google Scholar-
R. Duda and P. Hart , Pattern Classification and Scene Analysis ( John Wiley & Sons , 1973 ) . Google Scholar -
R. Duda , P. Hart and D. Stork , Pattern Classification ( Wiley , 2001 ) . Google Scholar - C. Elkan, Results of the kdd'99 classifier learning contest, http://www-cse.ucsd.edu/users/elkan/clresults.html . Google Scholar
- A. Estabrooks, A combination scheme for inductive learning from imbalanced data sets, Master's thesis, Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada (2000) . Google Scholar
K. Ezawa , M. Singh and S. W. Norton , Learning goal oriented bayesian networks for telecommunications risk management, Proc. Thirteenth Int. Conf. Mach. Learn. (1996) pp. 139–147. Google ScholarW. Fan , Adacost: misclasification cost-sensitive boosting, Proc. Sixth Int. Conf. Mach. Learn. (ICML-99) (1999) pp. 97–105. Google Scholar- Data Mining and Knowledge Discovery 1(3), 291 (1997), DOI: 10.1023/A:1009700419189. Crossref, Web of Science, Google Scholar
- J. Comput. Syst. Sci. 55(1), 119 (1997), DOI: 10.1006/jcss.1997.1504. Crossref, Web of Science, Google Scholar
- Ann. Statist. 28(2), 337 (2000), DOI: 10.1214/aos/1016218223. Crossref, Web of Science, Google Scholar
- Mach. Learn. 29(3), 131 (1997), DOI: 10.1023/A:1007465528199. Crossref, Web of Science, Google Scholar
- SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 30 (2004). Web of Science, Google Scholar
- Intell. Data Anal. J. 143, 29 (1982). Google Scholar
D. Hecherman , Advance in Knowledge Discovery and Data Mining (AAAI Press/The MIT Press, 1996) pp. 273–305. Google Scholar-
J. Hertz , A. Krogh and R. G. Palmer , Introduction to the Theory of Neural Computation ( Addison Wesley , 1991 ) . Google Scholar R. C. Holte , L. E. Acker and B. W. Porter , Concept learning and the problem of small disjuncts, Proc. Eleventh Int. Joint Conf. Artif. Intell. (1989) pp. 813–818. Google Scholar- N. Japkowicz, Proc. AAAI'2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05 (AAAI, 2000) . Google Scholar
N. Japkowicz , Concept-learning in the presence of between-class and within-class imbalances, Proc. Fourteenth Conf. Canadian Soc. for Computational Studies of Intelligence (2001) pp. 67–77. Google Scholar- Mach. Learn. 41(1), (2001). Google Scholar
- Intell. Data Anal. J. 6(5), 429 (2002). Crossref, Google Scholar
- M. V. Joshi, Learning Classifier Models for Predicting Rare Phonemena, PhD thesis, University of Minnesota, Twin Cites, Minnesota, USA (2002) . Google Scholar
-
M. V. Joshi , V. Kumar and R. C. Agarwal , Evaluating boosting algorithms to classify rare classes: comparison and improvements , Proc. First IEEE Int. Conf. Data Min. (ICDM'01) ( 2001 ) . Google Scholar -
M. S. Kamel and N. Wanas , Data dependence in combining classifiers , Proc. Fourth Int. Workshop on Multiple Classifiers Systems ( 2003 ) . Google Scholar - H. Kück, Bayesian formulations of multiple instance learning with applications to general object recognition, Master's thesis, University of British Columbia, Vancouver, BC, Canada (2004) . Google Scholar
- IEEE Trans. Patt. Anal. Mach. Intell. 20(3), (1998). Google Scholar
- Mach. Learn. 30, 195 (1998), DOI: 10.1023/A:1007452223027. Crossref, Web of Science, Google Scholar
D. Lewis and W. Gale , Training text classifiers by uncertainty sampling, Proc. Seventeenth Annual Int. ACM SIGIR Conf. Research and Development in Information (1998) pp. 73–79. Google Scholar- Mach. Learn. 54(2), 99 (2004), DOI: 10.1023/B:MACH.0000011804.08528.7d. Crossref, Web of Science, Google Scholar
W. Li , J. Han and J. Pei , CMAR: accurate and efficient classification based on multiple class-association rules, Proc. 2001 IEEE Int. Conf. Data Min. (ICDM'01) (2001) pp. 369–376. Google Scholar- Mach. Learn. 46, 191 (2002), DOI: 10.1023/A:1012406528296. Crossref, Web of Science, Google Scholar
C. X. Ling and C. Li , Data mining for direct marketing: problems and solutions, Proc. Fourth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Min. (1998) pp. 73–79. Google Scholar-
C. X. Ling and C. Li , Decision trees with minimal costs , Proc. 21st Int. Conf. Machine Learn. ( 2004 ) . Google Scholar B. Liu , W. Hsu and Y. Ma , Integrating classification and association rule mining, Proc. Fourth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (1998) pp. 80–86. Google ScholarB. Liu , W. Hsu and Y. Ma , Mining association rules with multiple minimum supports, Proc. Fifth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (1999) pp. 337–341. Google ScholarB. Liu , Y. Ma and C. K. Wong , Improving an association rule based classifier, Proc. 4th Eur. Conf. Principles of Data Mining and Knowledge Discovery (2000) pp. 504–509. Google Scholar- J. Mach. Learn. Res. 2, 139 (2001), DOI: 10.1162/15324430260185574. Web of Science, Google Scholar
D. Margineantu , Class probability estimation and cost-sensitive classification decisions, Proc. 13th Eur. Conf. Mach. Learn. (2002) pp. 270–281. Google Scholar-
P. M. Murph and D. W. Aha , UCI Repository of Machine Learning Databases ,Department of Information and Computer Science ( University of California , Irvine , 1991 ) . Google Scholar -
J. Pearl , Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infernence ( Morgan Kaufmann , 1988 ) . Google Scholar R. C. Prati and G. E. A. P. A. Batista , Class imbalances versus class overlapping: an analysis of a learning system behavior, Proc. Mexican Int. Conf. Artif. Intell. (MICAI) (2004) pp. 312–321. Google ScholarF. Provost and T. Fawcett , Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions, Proc. Third Int. Conf. Knowledge Discovery and Data Mining (KDD-97) (1997) pp. 43–48. Google Scholar- Mach. Learn. 6, 93 (1991). Web of Science, Google Scholar
-
J. R. Quinlan , C4.5: Programs For Machine Learning ( Morgan Kaufmann Publishers , 1993 ) . Google Scholar - Mach. Learn. 1(1), 81 (1986). Google Scholar
B. Raskutti and A. Kowalczyk , Extreme rebalancing for SVMs: a case study, Proc. Eur. Conf. Mach. Learn. (2004) pp. 60–69. Google Scholar- Appl. Artif. Intell. 8, 125 (1991), DOI: 10.1080/08839519408945435. Crossref, Web of Science, Google Scholar
- Automatica 14, 465 (1978), DOI: 10.1016/0005-1098(78)90005-5. Crossref, Web of Science, Google Scholar
R. E. Schapire , The boosting approach to machine learning — An overview, MSRI Workshop on Nonlinear Estimation and Classification (2002) pp. 149–172. Google Scholar- Mach. Learn. 37(3), 297 (1999), DOI: 10.1023/A:1007614523901. Crossref, Web of Science, Google Scholar
- Cellular Business 23 (1997). Google Scholar
Y. Sun , M. S. Kamel and Y. Wang , Boosting for learning multiple classes with imbalanced class distribution, Proc. Sixth IEEE Int. Conf. Data Min. (2006) pp. 592–602. Google Scholar- J. Patt. Recogn. 40(12), 3358 (2007), DOI: 10.1016/j.patcog.2007.04.009. Crossref, Web of Science, Google Scholar
Y. Sun , A. K. C. Wong and Y. Wang , Parameter inference of cost-sensitive boosting algorithms, Fourth Int. Conf. Machine Learning and Data Mining (MLDM) (2005) pp. 21–30. Google Scholar-
P. Tan , M. Steinbach and V. Kumar , Introduction to Data Mining ( Addison Wesley , 2006 ) . Google Scholar -
D. M. J. Tax and R. P. W. Duin , Using two-class classifiers for multiclass classification , Int. Conf. Patt. Recogn. ( 2002 ) . Google Scholar K. M. Ting , The problem of small disjuncts: its remedy in decision trees, Proc. Tenth Canadian Conf. Artif. Intell. (1994) pp. 91–97. Google ScholarK. M. Ting , A comparative study of cost-sensitive boosting algorithms, Proc. 17th Int. Conf. Mach. Learn. (2000) pp. 983–990. Google ScholarP. Turney , Types of cost in inductive concept learning, Proc. Workshop on Cost-Sensitive Learning at the 17th Int. Conf. Mach. Learn. (2000) pp. 15–21. Google Scholar- Automat. Remont Contr. 24, 774 (1963). Web of Science, Google Scholar
- Mobile Phone News 4 (1994). Google Scholar
K. Wang , S. Zhou and Y. He , Growing decision tree on support-less association rules, Proc. Sixth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'00) (2000) pp. 265–269. Google Scholar- Y. Wang, High-Order Pattern Discovery and Analysis of Discrete-Valued Data Sets, PhD thesis, University of Waterloo, Waterloo, Ontario, Canada (1997) . Google Scholar
- IEEE Trans. Knowl. Data Engin. 15(3), 764 (2003). Crossref, Web of Science, Google Scholar
- SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 7 (2004). Google Scholar
- J. Artif. Intell. Res. 19, 315 (2003). Crossref, Web of Science, Google Scholar
- IEEE Trans. Knowl. Data Engin. 9(6), 877 (1997), DOI: 10.1109/69.649314. Crossref, Web of Science, Google Scholar
-
G. Wu and E. Y. Chang , Class-boundary alignment for imbalanced dataset learning , Proc. ICML'03 Workshop on Learning from Imbalanced Data Sets ( 2003 ) . Google Scholar X. Yin and J. Han , CPAR: Classification based on predictive association rules, Proc. 2003 SIAM Int. Conf. Data Min. (SDM'03) (2003) pp. 331–335. Google ScholarB. Zadrozny and C. Elkan , Learning and making decisiions when costs and probabilities are both unknown, Proc. Seventh Int. Conf. Knowledge Discovery and Data Mining (2001) pp. 204–213. Google ScholarB. Zadrozny , J. Langford and N. Abe , Cost-sensitive learning by cost-proportionate example weighting, Proc. Third IEEE Int. Conf. Data Min. (2003) pp. 435–442. Google Scholar-
J. Zhang and I. Mani , KNN approach to unbalanced data distributions: a case study involving information extraction , Proc. ICML'03 Workshop on Learning from Imbalanced Data Sets ( 2003 ) . Google Scholar - IEEE Trans. Knowl. Data Engin. 18(1), 63 (2006). Crossref, Web of Science, Google Scholar