domingues-outlier-detection-evaluation.pdfdomingue

文件名称: domingues-outlier-detection-evaluation.pdf

所属分类: 其它

开发工具:

文件大小: 4mb

下载次数: 0

上传时间: 2019-07-07

提供者: anhuiboz********

下载 (4mb)

不能下载？报告错误

详细说明：domingues-outlier-detection-evaluationdomingues-outlier-detection-evaluationNumerous machine learning methods are suitable for anomaly detection However, supervised algorithms are more constraining than unsupervised meth- ods as they need to be provided with a labeled dataset. This requirement is particularly expensive when the labeling must be performed by humans. Deal- ing with a heavily imbalanced class distribution, which is inherent to anomaly detect also affect the efficiency of supervised algorithms 12. This per focuses on unsupervised machine learning algorithms to isolate outliers from nominal samples. In previous reviews, such methods were shown to be proficient for outlier and novelty detection 23 33. Unsupervised algorithms make use of unlabeled data to assign a score to each sample, representing the observa tiOn normality. Billary segmentations canl be further illade by thresholding the scores Outlier detection is a notoriously hard task: detecting anomalies can be difficult when overlapping with nominal clusters. and these clusters should be dense enough to build a reliable model. The problem of contamination, 1.e using an input dataset contaminated by outliers, makes this task even trickier as dlloiialies imlay degrade the final Iniodel if the training algorithIn lacks robustness These issues, which are present in many real-world datasets, are not always addresscd in works describing ncw unsupervised mcthods, as thosc algorithms may target a different use case. These reasons motivate the need for a thorough benchmark bringing together distinctive techniques on complex datasets Dur paper extends previous works 7 33 by using 12 publicly available la led datasets, most of which are recommended for outlier detection in 7, in addition to 3 novel industrial datasets from the production environments of a major company in the travel industry. The benchmark made in 7 used fewer datasets and solcly numcrical fcaturcs while we benchmark and address ways to handle categorical data. While the previous works perform outlier detection on training data, our study test for the generalization ability of all methods by predicting outliers on unseen testing data. The selected parametric and non parametric algorithms come from a variety of approaches including probabilistic algorithms, nearest-neighbor based methods, neural networks, information theo- retic and isolation methods. The performance on labeled datasets are compared with the area under the ROC and precision-recall curves In order to give a full overview of these methods, we also benchmark the training time, prediction time, memory usage and robustness of each method when increasing the number of samples, features and the background noise These scalability measurements allow us to compare algorithms not only based on t heir outlier detection performance but also on their scalability, robustness and suitability for large dimensional problems The paper is organized as follows: section 2 introduces the research fields and methods targeted by the benchmark; section 3 details the experimental setup, the public and proprietary datasets used and the method implemented Lo generate synthetic datasets; section 4 presents the results of our benchmark and section 5] summarizes our conclusions 2. Outlier detection methods The algorithms described in this section belong to a wide range of ap proaches. These methods build a model representing the nominal classes, i.e dense clusters of similar data points, during a training phase. Online or batch predictions can thereafter be applied to new data based on the trained model to assign an anomaly score to the new observations. Applying a threshold on the returned scores provides a decision boundary separating nominal samples from outliers The evaluation presented in this study contains both parametric and non parametric machine learning algorithms. While parametric approaches model the underlying data with a fixed lumber of parallleters, thie nuNber of paralle ters of nonparametric methods is potentia lly infinite and can increase with the complexity of data. If the former are often computationally faster, they require assumptions about the data distribution, e.g. the number of clusters, and may result in a flawed model if based on erroneous assumptions The latter make fewer assumptions about the data distribution and may thus generalize better while requiring less knowledge about the data 2.1. Probabilistic methods Probabilistic algorithms estimate the probability density function of a dataset X, by inferring the model parameters 6. Data points having the smallest likeli- X0) are identified as outliers. Most probabilistic methods described in this section can be trained incrementally, i.e. presenting new input data to an existing model will cause the model to adapt to the new data while remembering The first algorithm used in our benchmark is the Gaussian Mixture Model (GMM), which fits a given number of Gaussian distributions to a dataset The model is trained using the Expectation-Maximization(EM) algorithm 6 which maximizes a lower bound of the likelihood iteratively. This method has been successfully applied to identify suspicious and possibly cancerous masses in mammograms by novelty detection in28. However, assessing the number of components of the mixture by data exploration can be complex and motivates the use of nonparametric alternatives described hereafter In 3, Bleiet al. describe the Dirichlet Process Mixture Model (DPMM) a nonparametric Bayesian algorithm which optimizes the model parameters and tests for convergence by monitoring a non-decreasing lower bound on the log marginal likelihood. The result is a mixture model where each component is a product of exponential-family distributions. Most likelihoods, e.g. Gaussian or Categorical, can be written in exponential-family form; in this formulation, it is possible to obtain suitable conjugate priors to facilitate inference The nuinber of conponents is inferred during the training and requires all upper bound K. Weights T: are represented by a Dirichlet Process modelled as 3 a truncated stick-breaking process(equation 1p. The variable v; follows a Beta distribution, where ak and k are variational parameters optimized during the training for each component 丌(0)=0z II(1-U;) qa, 3(0)=I Beta(ak, Bk) (1) The training optimizes the parameters of the base distributions, e.g. the parameters of a Gaussian-Wishart for a multivariate Gaussian likelihood. The Scorn ng is then made by averaging the log likelihood computed fr om each mix ture of likelihoods sampled from the conjugate priors. The algorithm is applied to intrusion detection on the KDD 99 dataset in 8 and outperforms SVM and KNN algorithms. The cluster centroids provided by the model can also be valu able to an end-user as they represent the average nominal data points Kernel density estimators(KDE), also called Parzen windows estima tors, approximate the density function of a dataset by assigning a kernel func the local contributions of the kernels. A bandwidth parameter acts as a smoot hing parameter on t he density shape and can be estimated by methods such as Least-Squares Cross-Validation(LSCV As shown in 28, KDE methods are efficient when applied to novelty detection problems. However, these approaches are sensitive to outliers and struggle in finding a good estimate of the nominal data density in datasets contaminated by outliers. This issue is shown by Kim et al. in 13 where the authors describe a Robust Kernel Density Estinator(RKDE). algorithIn which uses M estimation met hods. such as robust loss functions. to overcome the limitations f plain KDE Probabilistic principal component analysis(PPCA)29 is a latent variable model which estimates the principal components of the data. It allows for the projection of a d-dimensional observation vector y to a ke-dimensional vector of latent variables X, with k the number of components of the model. The elationship Y=wX+u+e is trained by expectation-maximization. The au thors suggest to usc the log-likclihood as a dcgrcc of novelty for ncw data points More recently, Least-squares anomaly detection(LSA)25 developed by Quinn et al. extends a multi-class probabilistic classifier to a one-class prob lem. The approach is compared against kNN and One-class SvM using the area under the roc curve 2. 2. Distance-based methods This class of methods uses solely the distance space to flag outliers. As such the mahalanobis distance is suitable for anomaly detection tasks targeting multivariate datasets composed of a single Gaussian-shaped cluster 2. The model parameters are the mean and inverse covariance matrix of the data, thus similar to a one-component GMM with a full covariance matrix 2.3. Neighbor-based methods Neighbor-based methods study the neighborhood of each data point to iden- tify outliers. Local outlier factor(LOF) described in 4 is a well-known dis tance based approach corresponding to this description. For a given data point C, LOF computes its degree dk(a) of being an outlier based on the Euclidean dis tance d between and its kth closest neighbor Tk, which gives dk(a )=d(,lk The scoring of m also takes into account, for each of its neighbors na, the max imum between dk(ni) and d(, ni). As shown in 7, LOF outperforms Angle Based Outlier Detection 16 and One-class svm 26 when applied on real-world datasets for outlier detection, which makes it a good candidate for this bench- mark Angle-Based Outlier Detection(ABOD)16 uses the radius anld vari- ance of angles measured at each input vector instead of distances to identify outliers. Thc motivation is here to remain officient in high-dimensional spaco and to be less sensible to the curse of dimensionality. Given an input point a ABOD Samples several pairs of points and computes the corresponding angles at I and their variance. Broad angles imply that a is located inside a major cluster as it is surrounded by many data points, while small angles denote that z is positioned far from most points in the dataset. Similarly, a higher variance will be observed for points inside or at the border of a cluster than for outliers. The authors show that thcir mcthod outperforms LOF on synthctic datasets contain ing more than 50 features. According to the authors, the pairs of vectors can be built from the entire dataset, a random subset or the k-nearest neighbors in order to speed up the computation at the cost of lower outlier detection perfor mance The Subspace outlier detection (SOD)Ib algorithm finds for each point p elg d its k-nearest neighb 1g The outlier score is then the standard deviation of p from the mean of a given subspace, which is composed of a subset of dimensions. The attributes having a small variance for the set of m points are selected to be part of the subspace 2.4. Information, theor The Kullback-Leibler (KL) divergence was used as an information theoretic measure for novelty detection in The method first trains a gaussian mixture model on a training set, then estimates the information content of ney data points by neasuring the Kl divergence between the estimated density anld the density est imated on the training set and the new point. This reduces to an F-test in the case of a single gaussian 2.5. Neural networks In19, Marsland et al. propose a reconstruction-based nonparametric neural network called Grow When Required (GWR) network. This method is based on Kohonen networks, also called Self-Organizing maps(SOM)(14,and fits a graph of adaptive topology lying in the input space to a dataset. While training the network, nodes and edges are added or removed in order to best fit the data, the objective being to end up with nodes positioned in all dense data regions while edges propagate the displacement of neighboring nodes Outliers froIn artificial datasets are detected using fixed-topology SOM in all experimental work 21. The paper uses two t. hresholds t 1 and t2 to identify data points having their closest node further than t1, or projecting on an outlying node, i.e. a neuron having a median interneuron distance(MID higher than t2. The MID of each neuron is computed by taking the median of the distance between a neuron and its 8 neighbors in a network following a 2-dimensional grid topology. Severe outliers and dense clouds of outliers are correctly identified with this technique, though soime IloInlinal saInples call be Illistakell as Inild outliers 2.6. Domain-based methods Additional methods for outlying data identification rely on the construction of a boundary separating the nominal data from the rest of the input space thus estimating the domain of the nominal class. Any data point falling outside of thc dclimitcd boundary is thus fagged as outlier One-class SVM 26, an application of support vector machine(svM) algo rithms to one-class problems, belongs to this class of algorithms. The met hod computes a separating hyperplane in a high dimensional space induced by ke nels performing dot products between points from the input space in high- dimensional space. The boundary is fitted to the input data by maximizing the margin between the data and the origin in the high-dimensional space. The algorithm prevents overfitting by allowing a percentage v of data points to fall outside the boundary. This percentage v acts as regularization parameter; it is used as a lower bound on the fraction of support vectors delimiting the bound- ary and as an upper bound on the fraction of margin errors, i. e. training points emaining outside the bounda The experiment of the origina l paper targets mostly novelty detection, i.e anomaly detection using a model trained on a dataset free of anomalies. Our benchmark uses contaminated datasets to assess the algorithm robustness with a regularization parameter significantly higher than the expected proportion of outliers 2.. Solution methods We include an isolation algorithm in this study, which focuses on separating outliers from the rest of the data points. This method differs from the previous methods as it isolates anomalies instead of profiling normal points 6 The concept of Isolation forest was brought by Liu in 17 and uses random forests to compute an isolation score for each data point. The model is built by performing recursive random splits on attribute values, hence generating trees able to isolate any data point from the rest of the data. The score of a point is then the average path length from the root of t he tree to the node containing the single point, a short path denoting a point easy to isolate due to attribute values significantly different from nominal values The author states that his algorithm provides linear time complexity and demonstrates outlier detection performance significantly better than LOF on real-world datasets 3. Experimental evaluation 3. 1. Metrics We evaluate the performance of the outlier detection algorithms by compar ing two metrics bascd on rcal-world labeled datasets dctailcd in scction 3.2 For this purpose, we use the receiver operating characteristic (ROC) curve(true positive rate against false positive rate) and the precision-recall(Pr)curve equations 2 and 3, where the positive class represents the anomalies and the negative class represents the nominal samples. The comparison is then based on the area under the curve(AUC) of both metrics, respectively the ROC aUC and the average precision(AP) true positives precison true positives t false positives true positives true positives t false negatives 8. 2. Datasets Our evaluation uses 15 datasets ranging from 723 to 20,000 samples and con taining from 6 to 107 features. Of those datasets, 12 are publicly available on the UCI l or OpenML BO repositories while the 3 remaining datasets are novel proprietary datasets containing production data from thc company Amadeus Table l gives an overview of the datasets characteristics. Our study assesses if the models are able to generalize to future datasets, which is a novel approach in outlier detection works. This requires that algorithms support unseen test ing data, and is achieved by performing a Monte Carlo cross validation of 5 iterations, using 80% of the data for the training phase and 20% for the pre- diction. Training and testing datasets contain the same proportion of outliers, and ROC AUC and aP are measured based on the predictions made. For 7 of the publicly available datasets, the outlier classes are selected according to the recommendations made in 7, which are based on extensive datasets com parisons. However, the cited experiment discards all categorical data, while we keep those features and performed one-hot encoding to binarize them, keeping Table 1: UCI, OpenML and proprietary datasets benchmarked-(# categ. dims) is the average number of binarized features obtained after transformation of the categoricals Dataset Nominal class Anomaly class Numeric dims Categ. dims Samples Anomalies N-THYRO 0(0) 3.251 (2.25%) can unace, acc, good vgood 1,72s COVTYpE 00) 1000195(0.95%) 2 3(54) 723 23(3.18%)2 KDD-SUB normal u2r, probe 10,00385(3.85%) LAGIC-GAMMA-SUB h 12,2408(320%)2 1AMMOGRAPHY 111c3 MUSHROOM e 2(107) 4.36§ 139(3.20%)2 2.3.5.6 9 0(0 12.34 867(702%) INE-QUALITY 4,5,6.7,8 25(0.51%) CYT, NUC, MIT ERL, POX, Vac 8 0(0) 191 55(4.62%) 2,3,4,5 SHARED-ACCESS 0 49 0(0) 18.722 37(0.20%) 10,.021(0.21%) Subsets of the original datasets are used, with the same proportion of outliers Ancmalies are sanpled from the corresponding class, using the average percentage of outliers depicted in[I The first feature corresponding to the protein name was discarded all information from the dataset at the cost of a higher dimensionality. nor- malization is further achieved by centering numerical features to the mean and ling thei to unit variance The three following datasets contain production data collected by Amadeus a Global Distribution System(GDs) providing online platforms to connect the travel industry. This company manages almost half of the fight bookings world- wide and is targeted by fraud attempts leading to revenue losses and indemni fications. The datasets do not contain information traceable to any specific individuals The PASSENGER NAME RECORDS(PNR) dataset contains booking records mostly Hight and train bookings, containing 5 types of frauds labeled by fraud xports. The fcaturcs in this datasct describe the most important changes ap plied to a booking fron its creation to its deletion. Features include time-based information, e.g. age of a PNR, percentage of cancelled fight segments or passen gers, and means and standard deviations of counters, e. g. number of passenger modifications, frequent traveller cards, special service requests(additional lu gage, special seat or meal), or forms of payment The TRANSACTIONS dataset is extracted from a Web application used to perform bookings. It focuses on user sessions which are compared to identify bots alld Malicious users. ExaMples of features are the nunBer of distinct IPs and organizational offices used by a user, the session duration and means and standard deviations applied to the number of bookings and number of actions The most critical actions are also monitored The SHARED-ACCESS dataset was generated by a backend application used to manage shared rights between different entities. It enables an entity to grant specific reading(e.g. booking retrieval, seat map display ) or writing(e.g. cruise distribution) rights to another entity. Monitoring the actions made with this application should prevent data leaks and sensible right transfers. For each user account, fcaturcs includc the avcragc number of actions pcrformcd pcr scssion and time unit, the average and standard deviation for some critical actions per session. and the targeted rights modified 3. Datasets for scalability tests As the choice of an out lier detection aIgorithm may not only be limited to its accuracy but is often subject to computational constraints, our experi ment includes training time, prediction time, memory usage and noise resistance (through precision-recall measurements )of each algorithm on synthetic datasets For these scalability tests, we generate datasets of different sizes containing a fixed proportion of background noise. The datasets range from 10 samples to 10 InilliOll sainples anld fron 2 to 10,000 lluInlerical features. We also keep the number of features and samples fixed while increasing the proportion of background noise from O% to 90% to perform robustness measurements. T experiment is repeated 5 times, using the same dataset for training and testing We allow up to 24 hours for training or prediction steps to be completed and stop the computation after this period of time. Experiments reaching a timeout or requiring more than 256 Gb ram do not report memory usage nor robustness Il section④ In order to avoid advantaging some algorithms over others, the datasets are gcncrated using two Student's T distributions. The distributions arc respec tively centered in(0, 0.0)and (5,5.,5), while the covariance matrices are computed using Cij=pla jI where ci; is entry(i, j) of the matrix. The param- eter p follows a uniform distribution p N 0(0, 1)and the degrees of freedom parameter follows a Gamma distribution vn r(1, 5). We then add outliers uni formly sampled from a hypercube 7 times bigger than the standard deviation of the nominal data 3. 4. Algorithms implementations and paramcters Most implementations used in this experiment are publicly available. Table 2 details the programming languages and initialization parameters selected. A majority of methods have fexible parameters and perform very well without an xtensive tuning. The Matlab Engine API for Python and therpy 2 library allow us to call Matlab and i code from Python DPGMM is our own implementation of a Dirichlet Process Mixture Model and follows the guidelines given in 3 where we place a Gamma prior on the scaling parameter B. Making our own implementation of this algorithm is motivated by its capability of handling a wide range of probability distributions, including categorical distributions. We thus benchmark DPGMM, which uses only Ga sian dist ributions to model data and thus uses continuous and binarized features as all other algorithms, and DPMM which uses a mixture of multivariate gaus sian/ Categorical distributions, hence requiring fewer data transformations and working on a smaller number of features. This algorithm is the only one capable of using the raw string features from the datasets Note that DPGMM and DPMM converge to the same results when applied to I1O1l-Categorical data, alld that our DPGMM perforins similarly to the correspond ing scikit-learn implementation ca lled Bayesian Gaussian Mixture (BGM)

(系统自动生成,下载前可以参看下载内容)