Outlier Detection in High Dimensional Data
Abstract
High-dimensional data poses unique challenges in outlier detection process. Most of the existing algorithms fail to properly address the issues stemming from a large number of features. In particular, outlier detection algorithms perform poorly on dataset of small size with a large number of features. In this paper, we propose a novel outlier detection algorithm based on principal component analysis and kernel density estimation. The proposed method is designed to address the challenges of dealing with high-dimensional data by projecting the original data onto a smaller space and using the innate structure of the data to calculate anomaly scores for each data point. Numerical experiments on synthetic and real-life data show that our method performs well on high-dimensional data. In particular, the proposed method outperforms the benchmark methods as measured by -score. Our method also produces better-than-average execution times compared with the benchmark methods.
References
- 2001] Outlier detection for high dimensional data. In ACM Sigmod Record, Vol. 30(2), pp. 37–46. New York: ACM. Crossref, ISI, Google Scholar [
- 2002, August). Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 15–27. Berlin: Springer. Crossref, Google Scholar (
- 1994] Outliers in Statistical Data, Vol. 3. USA: Wiley. Google Scholar [
- 2000] LOF: identifying density-based local outliers. In ACM Sigmod Record, Vol. 29(2), pp. 93–104. New York: ACM. Crossref, ISI, Google Scholar [
- 2015] Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5. Crossref, ISI, Google Scholar [
- 2016] On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891–927. Crossref, ISI, Google Scholar [
- 2018] A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognition, 74, 406–421. Crossref, ISI, Google Scholar [
- 2011]
RKOF: Robust kernel-based local outlier detection . In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 270–283. Berlin: Springer. Crossref, Google Scholar [ - 2015] Real time traffic flow outlier detection using short-term traffic conditional variance prediction. Transportation Research Part C: Emerging Technologies, 50, 160–172. Crossref, ISI, Google Scholar [
- 2019] The implementation of an adjusted relative strength index model in foreign currency and energy markets of emerging and developed economies. Macroeconomics and Finance in Emerging Market Economies, 12(2), 105–123. Crossref, ISI, Google Scholar [
- 2014] Multivariate spatial outlier detection using robust geographically weighted methods. Mathematical Geosciences, 46(1), 1–31. Crossref, ISI, Google Scholar [
- 2004] Outlier detection using k-nearest neighbour graph. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, Vol. 3, pp. 430–433. New York: IEEE. Crossref, Google Scholar [
- 2001] Mining top-n local outliers in large databases. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 293–298. New York: ACM. Crossref, Google Scholar [
- 2006, April). Ranking outliers using symmetric neighborhood relationship. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 577–593. Berlin: Springer. Crossref, Google Scholar (
- 2015] Using d mining to detect health care fraud and abuse: A review of literature. Global Journal of Health Science, 7(1), 194. Google Scholar [
- 2020] Kernel density estimation based sampling for imbalanced class distribution. Information Sciences, 512, 1192–1201. Crossref, ISI, Google Scholar [
- 2017] A feature selection method based on ranked vector scores of features for classification. Annals of Data Science, 4(4), 483–502. Crossref, Google Scholar [
- 1997] A unified notion of outliers: Properties and computation. In KDD, Vol. 97, pp. 219–222. Google Scholar [
- Kriegel, HP, P Krger and A Zimek (2010). Outlier detection techniques. Tutorial at KDD. Google Scholar
- 2008] Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 444–452. New York: ACM. Crossref, Google Scholar [
- 2008] Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. New York: ACM. Crossref, Google Scholar [
- 2018]
Anomaly detection system using beta mixture models and outlier detection . In Progress in Computing, Analytics and Networking, pp. 125–135. Singapore: Springer. Crossref, Google Scholar [ - 2016] Auto insurance fraud detection using unsupervised spectral ranking for anomaly. The Journal of Finance and Data Science, 2(1), 58–75. Crossref, Google Scholar [
- 2018] Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2041–2050. New York: ACM. Crossref, Google Scholar [
- 2018] Fraud detection with density estimation trees. In KDD 2017 Workshop on Anomaly Detection in Finance, pp. 85–94. Google Scholar [
- 2000] Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, Vol. 29(2), pp. 427–438. New York: ACM. Crossref, ISI, Google Scholar [
- 1999] A fast algorithm for the minimum covariance determinant estimator. Techno-Metrics, 41(3), 212. Crossref, ISI, Google Scholar [
- 2016] Fast memory efficient local outlier detection in data streams. IEEE Transactions on Knowledge and Data Engineering, 28(12), 3246–3260. Crossref, ISI, Google Scholar [
- 2001] Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. Crossref, ISI, Google Scholar [
- 2014] Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 542–550. Philadelphia: Society for Industrial and Applied Mathematics. Crossref, Google Scholar [
- 2018] Density Estimation for Statistics and Data Analysis. UK: Routledge. Crossref, Google Scholar [
- 2019] A review of unsupervised feature selection methods. Artificial Intelligence Review, 1–42. ISI, Google Scholar [
- 2002] Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535–548. Berlin: Springer. Crossref, Google Scholar [
- 2017] A local density-based approach for outlier detection. Neurocomputing, 241, 171–180. Crossref, ISI, Google Scholar [
- 1977] Exploratory Data Analysis. Reading: Addison-Wesley. Google Scholar [
- 2016] Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. International Journal of Accounting Information Systems, 21, 18–31. Crossref, ISI, Google Scholar [