TY - GEN
T1 - Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data
AU - Pang, Guansong
AU - Cao, Longbing
AU - Chen, Ling
AU - Lian, Defu
AU - Liu, Huan
N1 - Funding Information: Acknowledgments This work is partially supported by the ARC Discovery Grant DP140100545. Funding Information: This work is partially supported by the ARC Discovery Grant DP140100545. Publisher Copyright: Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Copyright: Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018
Y1 - 2018
N2 - The large proportion of irrelevant or noisy features in real-life high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli's inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-the-art competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.
AB - The large proportion of irrelevant or noisy features in real-life high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli's inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-the-art competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.
UR - http://www.scopus.com/inward/record.url?scp=85057546667&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057546667&partnerID=8YFLogxK
M3 - Conference contribution
T3 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
SP - 3892
EP - 3899
BT - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PB - AAAI press
T2 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Y2 - 2 February 2018 through 7 February 2018
ER -