文件名称:
domingues-outlier-detection-evaluation.pdf
开发工具:
文件大小: 4mb
下载次数: 0
上传时间: 2019-07-07
详细说明:domingues-outlier-detection-evaluationdomingues-outlier-detection-evaluationNumerous machine learning methods are suitable for anomaly detection
However, supervised algorithms are more constraining than unsupervised meth-
ods as they need to be provided with a labeled dataset. This requirement is
particularly expensive when the labeling must be performed by humans. Deal-
ing with a heavily imbalanced class distribution, which is inherent to anomaly
detect
also affect the efficiency of supervised algorithms 12. This
per focuses on unsupervised machine learning algorithms to isolate outliers from
nominal samples. In previous reviews, such methods were shown to be proficient
for outlier and novelty detection 23 33. Unsupervised algorithms make use
of unlabeled data to assign a score to each sample, representing the observa
tiOn normality. Billary segmentations canl be further illade by thresholding the
scores
Outlier detection is a notoriously hard task: detecting anomalies can be
difficult when overlapping with nominal clusters. and these clusters should be
dense enough to build a reliable model. The problem of contamination, 1.e
using an input dataset contaminated by outliers, makes this task even trickier as
dlloiialies imlay degrade the final Iniodel if the training algorithIn lacks robustness
These issues, which are present in many real-world datasets, are not always
addresscd in works describing ncw unsupervised mcthods, as thosc algorithms
may target a different use case. These reasons motivate the need for a thorough
benchmark bringing together distinctive techniques on complex datasets
Dur paper extends previous works 7 33 by using 12 publicly available la
led datasets, most of which are recommended for outlier detection in 7, in
addition to 3 novel industrial datasets from the production environments of a
major company in the travel industry. The benchmark made in 7 used fewer
datasets and solcly numcrical fcaturcs while we benchmark and address ways
to handle categorical data. While the previous works perform outlier detection
on training data, our study test for the generalization ability of all methods by
predicting outliers on unseen testing data. The selected parametric and non
parametric algorithms come from a variety of approaches including probabilistic
algorithms, nearest-neighbor based methods, neural networks, information theo-
retic and isolation methods. The performance on labeled datasets are compared
with the area under the ROC and precision-recall curves
In order to give a full overview of these methods, we also benchmark the
training time, prediction time, memory usage and robustness of each method
when increasing the number of samples, features and the background noise
These scalability measurements allow us to compare algorithms not only based
on t heir outlier detection performance but also on their scalability, robustness
and suitability for large dimensional problems
The paper is organized as follows: section 2 introduces the research fields
and methods targeted by the benchmark; section 3 details the experimental
setup, the public and proprietary datasets used and the method implemented
Lo generate synthetic datasets; section 4 presents the results of our benchmark
and section 5] summarizes our conclusions
2. Outlier detection methods
The algorithms described in this section belong to a wide range of ap
proaches. These methods build a model representing the nominal classes, i.e
dense clusters of similar data points, during a training phase. Online or batch
predictions can thereafter be applied to new data based on the trained model to
assign an anomaly score to the new observations. Applying a threshold on the
returned scores provides a decision boundary separating nominal samples from
outliers
The evaluation presented in this study contains both parametric and non
parametric machine learning algorithms. While parametric approaches model
the underlying data with a fixed lumber of parallleters, thie nuNber of paralle
ters of nonparametric methods is potentia lly infinite and can increase with the
complexity of data. If the former are often computationally faster, they require
assumptions about the data distribution, e.g. the number of clusters, and may
result in a flawed model if based on erroneous assumptions The latter make
fewer assumptions about the data distribution and may thus generalize better
while requiring less knowledge about the data
2.1. Probabilistic methods
Probabilistic algorithms estimate the probability density function of a dataset
X, by inferring the model parameters 6. Data points having the smallest likeli-
X0) are identified as outliers. Most probabilistic methods described in
this section can be trained incrementally, i.e. presenting new input data to an
existing model will cause the model to adapt to the new data while remembering
The first algorithm used in our benchmark is the Gaussian Mixture Model
(GMM), which fits a given number of Gaussian distributions to a dataset
The model is trained using the Expectation-Maximization(EM) algorithm 6
which maximizes a lower bound of the likelihood iteratively. This method has
been successfully applied to identify suspicious and possibly cancerous masses
in mammograms by novelty detection in28. However, assessing the number of
components of the mixture by data exploration can be complex and motivates
the use of nonparametric alternatives described hereafter
In 3, Bleiet al. describe the Dirichlet Process Mixture Model (DPMM)
a nonparametric Bayesian algorithm which optimizes the model parameters and
tests for convergence by monitoring a non-decreasing lower bound on the log
marginal likelihood. The result is a mixture model where each component is a
product of exponential-family distributions. Most likelihoods, e.g. Gaussian or
Categorical, can be written in exponential-family form; in this formulation, it
is possible to obtain suitable conjugate priors to facilitate inference
The nuinber of conponents is inferred during the training and requires all
upper bound K. Weights T: are represented by a Dirichlet Process modelled as
3
a truncated stick-breaking process(equation 1p. The variable v; follows a Beta
distribution, where ak and k are variational parameters optimized during the
training for each component
丌(0)=0z
II(1-U;) qa, 3(0)=I Beta(ak, Bk)
(1)
The training optimizes the parameters of the base distributions, e.g. the
parameters of a Gaussian-Wishart for a multivariate Gaussian likelihood. The
Scorn
ng is then made by averaging the log likelihood computed fr
om each mix
ture of likelihoods sampled from the conjugate priors. The algorithm is applied
to intrusion detection on the KDD 99 dataset in 8 and outperforms SVM and
KNN algorithms. The cluster centroids provided by the model can also be valu
able to an end-user as they represent the average nominal data points
Kernel density estimators(KDE), also called Parzen windows estima
tors, approximate the density function of a dataset by assigning a kernel func
the local contributions of the kernels. A
bandwidth parameter acts as a smoot hing parameter on t he density shape and
can be estimated by methods such as Least-Squares Cross-Validation(LSCV
As shown in 28, KDE methods are efficient when applied to novelty detection
problems. However, these approaches are sensitive to outliers and struggle in
finding a good estimate of the nominal data density in datasets contaminated
by outliers. This issue is shown by Kim et al. in 13 where the authors describe
a Robust Kernel Density Estinator(RKDE). algorithIn which uses M
estimation met hods. such as robust loss functions. to overcome the limitations
f plain KDE
Probabilistic principal component analysis(PPCA)29 is a latent
variable model which estimates the principal components of the data. It allows
for the projection of a d-dimensional observation vector y to a ke-dimensional
vector of latent variables X, with k the number of components of the model. The
elationship Y=wX+u+e is trained by expectation-maximization. The au
thors suggest to usc the log-likclihood as a dcgrcc of novelty for ncw data points
More recently, Least-squares anomaly detection(LSA)25 developed
by Quinn et al. extends a multi-class probabilistic classifier to a one-class prob
lem. The approach is compared against kNN and One-class SvM using the area
under the roc curve
2. 2. Distance-based methods
This class of methods uses solely the distance space to flag outliers. As such
the mahalanobis distance is suitable for anomaly detection tasks targeting
multivariate datasets composed of a single Gaussian-shaped cluster 2. The
model parameters are the mean and inverse covariance matrix of the data, thus
similar to a one-component GMM with a full covariance matrix
2.3. Neighbor-based methods
Neighbor-based methods study the neighborhood of each data point to iden-
tify outliers. Local outlier factor(LOF) described in 4 is a well-known dis
tance based approach corresponding to this description. For a given data point
C, LOF computes its degree dk(a) of being an outlier based on the Euclidean dis
tance d between and its kth closest neighbor Tk, which gives dk(a )=d(,lk
The scoring of m also takes into account, for each of its neighbors na, the max
imum between dk(ni) and d(, ni). As shown in 7, LOF outperforms Angle
Based Outlier Detection 16 and One-class svm 26 when applied on real-world
datasets for outlier detection, which makes it a good candidate for this bench-
mark
Angle-Based Outlier Detection(ABOD)16 uses the radius anld vari-
ance of angles measured at each input vector instead of distances to identify
outliers. Thc motivation is here to remain officient in high-dimensional spaco
and to be less sensible to the curse of dimensionality. Given an input point a
ABOD Samples several pairs of points and computes the corresponding angles at
I and their variance. Broad angles imply that a is located inside a major cluster
as it is surrounded by many data points, while small angles denote that z is
positioned far from most points in the dataset. Similarly, a higher variance will
be observed for points inside or at the border of a cluster than for outliers. The
authors show that thcir mcthod outperforms LOF on synthctic datasets contain
ing more than 50 features. According to the authors, the pairs of vectors can
be built from the entire dataset, a random subset or the k-nearest neighbors in
order to speed up the computation at the cost of lower outlier detection perfor
mance
The Subspace outlier detection (SOD)Ib algorithm finds for each
point p
elg
d its k-nearest neighb
1g
The outlier score is then the standard deviation of p from the mean of a given
subspace, which is composed of a subset of dimensions. The attributes having
a small variance for the set of m points are selected to be part of the subspace
2.4. Information, theor
The Kullback-Leibler (KL) divergence was used as an information
theoretic measure for novelty detection in
The method first trains a gaussian
mixture model on a training set, then estimates the information content of ney
data points by neasuring the Kl divergence between the estimated density anld
the density est imated on the training set and the new point. This reduces to
an F-test in the case of a single gaussian
2.5. Neural networks
In19, Marsland et al. propose a reconstruction-based nonparametric neural
network called Grow When Required (GWR) network. This method is
based on Kohonen networks, also called Self-Organizing maps(SOM)(14,and
fits a graph of adaptive topology lying in the input space to a dataset. While
training the network, nodes and edges are added or removed in order to best fit
the data, the objective being to end up with nodes positioned in all dense data
regions while edges propagate the displacement of neighboring nodes
Outliers froIn artificial datasets are detected using fixed-topology SOM in all
experimental work 21. The paper uses two t. hresholds t 1 and t2 to identify data
points having their closest node further than t1, or projecting on an outlying
node, i.e. a neuron having a median interneuron distance(MID higher than
t2. The MID of each neuron is computed by taking the median of the distance
between a neuron and its 8 neighbors in a network following a 2-dimensional grid
topology. Severe outliers and dense clouds of outliers are correctly identified
with this technique, though soime IloInlinal saInples call be Illistakell as Inild
outliers
2.6. Domain-based methods
Additional methods for outlying data identification rely on the construction
of a boundary separating the nominal data from the rest of the input space
thus estimating the domain of the nominal class. Any data point falling outside
of thc dclimitcd boundary is thus fagged as outlier
One-class SVM 26, an application of support vector machine(svM) algo
rithms to one-class problems, belongs to this class of algorithms. The met hod
computes a separating hyperplane in a high dimensional space induced by ke
nels performing dot products between points from the input space in high-
dimensional space. The boundary is fitted to the input data by maximizing the
margin between the data and the origin in the high-dimensional space. The
algorithm prevents overfitting by allowing a percentage v of data points to fall
outside the boundary. This percentage v acts as regularization parameter; it is
used as a lower bound on the fraction of support vectors delimiting the bound-
ary and as an upper bound on the fraction of margin errors, i. e. training points
emaining outside the bounda
The experiment of the origina l paper targets mostly novelty detection, i.e
anomaly detection using a model trained on a dataset free of anomalies. Our
benchmark uses contaminated datasets to assess the algorithm robustness with
a regularization parameter significantly higher than the expected proportion of
outliers
2.. Solution methods
We include an isolation algorithm in this study, which focuses on separating
outliers from the rest of the data points. This method differs from the previous
methods as it isolates anomalies instead of profiling normal points
6
The concept of Isolation forest was brought by Liu in 17 and uses random
forests to compute an isolation score for each data point. The model is built by
performing recursive random splits on attribute values, hence generating trees
able to isolate any data point from the rest of the data. The score of a point is
then the average path length from the root of t he tree to the node containing
the single point, a short path denoting a point easy to isolate due to attribute
values significantly different from nominal values
The author states that his algorithm provides linear time complexity and
demonstrates outlier detection performance significantly better than LOF on
real-world datasets
3. Experimental evaluation
3. 1. Metrics
We evaluate the performance of the outlier detection algorithms by compar
ing two metrics bascd on rcal-world labeled datasets dctailcd in scction 3.2 For
this purpose, we use the receiver operating characteristic (ROC) curve(true
positive rate against false positive rate) and the precision-recall(Pr)curve
equations 2 and 3, where the positive class represents the anomalies and the
negative class represents the nominal samples. The comparison is then based on
the area under the curve(AUC) of both metrics, respectively the ROC aUC
and the average precision(AP)
true positives
precison
true positives t false positives
true positives
true positives t false negatives
8. 2. Datasets
Our evaluation uses 15 datasets ranging from 723 to 20,000 samples and con
taining from 6 to 107 features. Of those datasets, 12 are publicly available on
the UCI l or OpenML BO repositories while the 3 remaining datasets are novel
proprietary datasets containing production data from thc company Amadeus
Table l gives an overview of the datasets characteristics. Our study assesses if
the models are able to generalize to future datasets, which is a novel approach
in outlier detection works. This requires that algorithms support unseen test
ing data, and is achieved by performing a Monte Carlo cross validation of 5
iterations, using 80% of the data for the training phase and 20% for the pre-
diction. Training and testing datasets contain the same proportion of outliers,
and ROC AUC and aP are measured based on the predictions made. For 7
of the publicly available datasets, the outlier classes are selected according to
the recommendations made in 7, which are based on extensive datasets com
parisons. However, the cited experiment discards all categorical data, while we
keep those features and performed one-hot encoding to binarize them, keeping
Table 1: UCI, OpenML and proprietary datasets benchmarked-(# categ. dims) is the
average number of binarized features obtained after transformation of the categoricals
Dataset
Nominal class Anomaly class Numeric dims Categ. dims Samples Anomalies
N-THYRO
0(0)
3.251
(2.25%)
can
unace, acc, good vgood
1,72s
COVTYpE
00)
1000195(0.95%)
2
3(54)
723
23(3.18%)2
KDD-SUB
normal
u2r, probe
10,00385(3.85%)
LAGIC-GAMMA-SUB
h
12,2408(320%)2
1AMMOGRAPHY
111c3
MUSHROOM
e
2(107)
4.36§
139(3.20%)2
2.3.5.6
9
0(0
12.34
867(702%)
INE-QUALITY
4,5,6.7,8
25(0.51%)
CYT, NUC, MIT ERL, POX, Vac 8
0(0)
191
55(4.62%)
2,3,4,5
SHARED-ACCESS
0
49
0(0)
18.722
37(0.20%)
10,.021(0.21%)
Subsets of the original datasets are used, with the same proportion of outliers
Ancmalies are sanpled from the corresponding class, using the average percentage of outliers depicted in[I
The first feature corresponding to the protein name was discarded
all information from the dataset at the cost of a higher dimensionality. nor-
malization is further achieved by centering numerical features to the mean and
ling thei to unit variance
The three following datasets contain production data collected by Amadeus
a Global Distribution System(GDs) providing online platforms to connect the
travel industry. This company manages almost half of the fight bookings world-
wide and is targeted by fraud attempts leading to revenue losses and indemni
fications. The datasets do not contain information traceable to any specific
individuals
The PASSENGER NAME RECORDS(PNR) dataset contains booking records
mostly Hight and train bookings, containing 5 types of frauds labeled by fraud
xports. The fcaturcs in this datasct describe the most important changes ap
plied to a booking fron its creation to its deletion. Features include time-based
information, e.g. age of a PNR, percentage of cancelled fight segments or passen
gers, and means and standard deviations of counters, e. g. number of passenger
modifications, frequent traveller cards, special service requests(additional lu
gage, special seat or meal), or forms of payment
The TRANSACTIONS dataset is extracted from a Web application used to
perform bookings. It focuses on user sessions which are compared to identify
bots alld Malicious users. ExaMples of features are the nunBer of distinct IPs
and organizational offices used by a user, the session duration and means and
standard deviations applied to the number of bookings and number of actions
The most critical actions are also monitored
The SHARED-ACCESS dataset was generated by a backend application used
to manage shared rights between different entities. It enables an entity to grant
specific reading(e.g. booking retrieval, seat map display ) or writing(e.g. cruise
distribution) rights to another entity. Monitoring the actions made with this
application should prevent data leaks and sensible right transfers. For each user
account, fcaturcs includc the avcragc number of actions pcrformcd pcr scssion
and time unit, the average and standard deviation for some critical actions per
session. and the targeted rights modified
3. Datasets for scalability tests
As the choice of an out lier detection aIgorithm may not only be limited
to its accuracy but is often subject to computational constraints, our experi
ment includes training time, prediction time, memory usage and noise resistance
(through precision-recall measurements )of each algorithm on synthetic datasets
For these scalability tests, we generate datasets of different sizes containing
a fixed proportion of background noise. The datasets range from 10 samples
to 10 InilliOll sainples anld fron 2 to 10,000 lluInlerical features. We also keep
the number of features and samples fixed while increasing the proportion of
background noise from O% to 90% to perform robustness measurements. T
experiment is repeated 5 times, using the same dataset for training and testing
We allow up to 24 hours for training or prediction steps to be completed and stop
the computation after this period of time. Experiments reaching a timeout or
requiring more than 256 Gb ram do not report memory usage nor robustness
Il section④
In order to avoid advantaging some algorithms over others, the datasets are
gcncrated using two Student's T distributions. The distributions arc respec
tively centered in(0, 0.0)and (5,5.,5), while the covariance matrices are
computed using Cij=pla jI where ci; is entry(i, j) of the matrix. The param-
eter p follows a uniform distribution p N 0(0, 1)and the degrees of freedom
parameter follows a Gamma distribution vn r(1, 5). We then add outliers uni
formly sampled from a hypercube 7 times bigger than the standard deviation
of the nominal data
3. 4. Algorithms implementations and paramcters
Most implementations used in this experiment are publicly available. Table
2 details the programming languages and initialization parameters selected. A
majority of methods have fexible parameters and perform very well without an
xtensive tuning. The Matlab Engine API for Python and therpy 2 library
allow us to call Matlab and i code from Python
DPGMM is our own implementation of a Dirichlet Process Mixture Model and
follows the guidelines given in 3 where we place a Gamma prior on the scaling
parameter B. Making our own implementation of this algorithm is motivated
by its capability of handling a wide range of probability distributions, including
categorical distributions. We thus benchmark DPGMM, which uses only Ga
sian dist ributions to model data and thus uses continuous and binarized features
as all other algorithms, and DPMM which uses a mixture of multivariate gaus
sian/ Categorical distributions, hence requiring fewer data transformations and
working on a smaller number of features. This algorithm is the only one capable
of using the raw string features from the datasets
Note that DPGMM and DPMM converge to the same results when applied to
I1O1l-Categorical data, alld that our DPGMM perforins similarly to the correspond
ing scikit-learn implementation ca lled Bayesian Gaussian Mixture (BGM)
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.