outlierbook.pdf做异常检测常用的算法介绍都在这本书里哦，做异常检测常用的算法介绍都在这

文件名称: outlierbook.pdf

所属分类: 其它

开发工具:

文件大小: 957kb

下载次数: 0

上传时间: 2019-07-07

提供者: anhuiboz********

下载 (957kb)

不能下载？报告错误

详细说明：做异常检测常用的算法介绍都在这本书里哦，做异常检测常用的算法介绍都在这本书里哦，PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 To my wife, my daughter Sayani and my late parents Dr. Prem Sarup and Mrs. Pushplata Aggarwal PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 Contents 1 An Introduction to Outlier Analysis 1.1 Introduction 1.2 The Data Model is Everything 1.2.1 Connections with Supervised modcls 11580 1.3 The basic outlier detection models 1. 3. 1 Feature Selection in Outlier detection 1.3.2 Extreme-Value Analysis 1. 3.3 Probabilistic and Statistical Models 012 1.3.4 Linear Models 1.3.4.1 Spectral Models 14 1.3.5 Proximity-Based Models 1. 3.6 Information-Theoretic Models 16 1.3.7 High-Dimcnsional Outlier Dctcction 1.4 Outlier ensembles 1.4.1 Sequential Ensembles 1. 4.2 Independent ensembles 20 1.5 The Basic Data Types for Analysis 21 1.5.1 Categorical al. Text. and mixed Attributes 1.5.2 When the Data Values have Dependencies 1.5.2.1 Times-Series Data and Data streams 22 1.5.2.2 Discrete Sequences 24 1.5.2.3 Spatial Dat 24 1.5.2.4 Network and Graph Data 25 1.6 Supervised Outlier Detection 25 1.7 Outlier Evaluation Techniques 1.7.1 Interpreting the ROC AUC 29 1.7.2 Common Mistakes in Benchmarking 30 Conclusions and summary 31 1.9 Bibliographic Survey 31 1.10 Exercises PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 CONTENTS 2 Probabilistic models for outlier detection 35 35 2.2 Sta tistica. I Met hods for Fxtreme-Value Analysis 37 2. 2. 1 Probabilistic Tail Inequalities 37 2.2.1.1 Sum of bounded random variables 38 2.2.2 Statistica -Tail Confidence Tests 43 2.2.2.1 t-Value Test 43 2.2.2.2 Sum of Squares of Deviations 2.3 Extreme-Value Analysis in Multivariate Data ith Values with box plots 2.2.2.3 Visualizing extreme values 45 46 2.3.1 Depth-Based Methods 2.3.2 Deviation-Based methods 2.3.3 Angle-Based Outlier Detection 49 2.3.4 Distancc Distribution-bascd Techniques: Thc Mahalanobis Mcthod 2.3.4.1 Strengths of the Mahalanobis Method 53 2.4 Probabilistic Mixture Modeling for Outlier Analysis 2.4.1 Rclationship with Clustcring Mcthods 2.4.2 The Special Case of a Single Mixture Component 2.4. 3 Other Ways of Leveraging the EM Model 58 2.4.4 An Application of EM for Converting Scores to Probabilities 59 2.5 Limitations of Probabilistic Modeling 60 2.6 Conclusions and summary 61 2.7 Bibliographic Survey 61 2. 8 Exercises 62 3 Linear models for outlier detection 65 3.1 Introduction 65 3.2 Lincar Regression Modcls 68 3. 2. 1 Modeling with Dependent variables 70 3.2.1.1 Applications of Dependent Variable Modeling 3.2. 2 Linear Modeling with Mean-Squared Projection Error 74 3.3 Principal Component Analysis 3. 3.1 Connections with the Mahalanobis method 3.3.2 Hard pca versus soft pca 79 3.3.3 Sensitivity to Noise 7 3.3.4 Normalization issues 80 3.3.5 Regularization Issues 3.3.6 Applications to Noise Correction 3.3.7 How Many Eigenvectors 3.3.8 Extension to Nonlinear Data distributions 3.3.8.1 Choice of Similarity Matrix 3.3.8.2 Practical Issues 3.3.8.3 Application to Arbitrary Data Types 3.4 Onc-Class Support Vector Machines 3.4. 1 Solving the Dual Optimization ProbleIn 92 3. 4.2 Practical issues 92 3.4.3 Connections to Support Vector Data Description and Other Kernel 93 3.5 A Matrix Factorization view of Linear Models 95 PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 CONTENTS 3.5.1 Outlier Dctection in Incomplctc Data 3.5. 1.1 CoIllputing the Outlier Scores 3.6 Neural Networks: From Linear Models to Deep Learning 3.6.1G to nonli 10 3.6.2 Replicator Neural Net works and Deep autoencoders 102 3. 6.3 Practical Is 105 3.6.4 The Broad Potential of Neural Networks 106 3.7 Limitations of Linear Modeling 106 3.8 Conclusions and Summary 107 3.9 Bibliographic Survey 108 3.10 Exercises 109 4 Proximity-Based Outlier Detection 111 4.1 Introduction 111 4.2 Clusters and Outliers: The Complementary Relationship 112 4.2.1 ExteNsions to Arbitrarily Shaped Clusters 4.2.1.1 Application to Arbitrary Data Types 118 4. 2. 2 Advantagcs and Disadvantages of Clustcring Mcthods 4.3 Distallce-Based Outlier Analysis 4.3.1 Scoring Outputs for Distance-Based Methods 4.3.2 Binary Outputs for Distancc-Bascd Mcthods 121 4.3.2. 1 Cell-Based Pruning 122 4.3.2.2 Sampling-Based Pruning l24 1.3.2.3 Index-Based Pruning 126 4.3.3 Data-Dependent Similarity Measures 128 4.3.4 ODIN: A Reverse Nearest Neighbor Approach 129 1.3.5 Intensional Knowledge of Distance-Based Outliers 130 4.3.6 Discussion of Distance-Based Methods 131 4.4 Density-Based Outliers 131 4.4.1 LOF: Local Outlier Factor 132 4.4.1.1 Handling Duplicate Points and Stability Issues 134 4.4.2 LOCI: Local Correlation Integral 135 4.4.2.1 LOCI Plot 136 4.4.3 Histogram-Based Techniques 137 4.4.4 Kernel Density Estimation 138 4.4.4.1 Connection with Harmonic k-Nearest Neighbor Detector. 139 4.4.4.2 Local variations of Kernel methods 140 4.4.5 Ellseinble-Based Iinpleinentations of Histogralls and Kernel Methods 140 4.5 Limitations of Proximitv-Based Detection 141 4.6 Conclusions and Summary ,,,,142 7 Bibliographic Survey 142 4.8 Exercises 146 5 High-Dimensional Outlier Detection 149 5.1 Introduction 149 5.2 Axis-Parallel Subspaces 152 5.2.1 Genetic Algorithms for Outlier Detection 153 5.2.1.1 Defining Abnormal Lower-Dimensional Projections 153 5.2.1.2 Defining ic Operators for Subspace search 154 PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 CONTENTS 5. 2.2 Finding Distancc-Bascd Outlying Subspaces ..157 5.2.3 Feature Bagging: A Subspace SaInpling Perspective 157 5.2.4 Projected Clustering ensembles 158 5.2.5 Subspace histograms in Linear Time 160 5.2.6 Isolation Forests 161 5.2.6.1 Further Enhancements for Subspace Selection 163 5.2.6. 2 Early Termination 163 5.2.6. 3 Relationship to Clustering Ensembles and Ilistograms 164 5.2.7 Selecting High-Contrast Subspaces 164 5.2.8 Local Selection of Subspace Projections 166 5.2.9 Distance-Based Reference Sets 169 5.3 Generalized Subspaces 170 5.3.1 Generalized Projected Clustering Approach 171 5.3. 2 Leveraging Instancc-Spccific Rcfcrcncc Scts 172 5.3.3 Rotated Subspace Salnlpli 175 5.3.4 Nonlinear Subspaces 176 5.3.5 Regression Modeling Tcchniqucs 178 5.4 Discussion of Subspace Analysis 5.5 Conclusions and Summary 180 5.6 Bibliographic survey 181 5.7 Exercises 184 6 Outlier Ensembles 185 6.1 Introduction 185 6.2 Categorization and Design of Ensemble Methods 6.2.1 Basic Score Normalization and Combination Methods 189 6.3 Theoretical Foundations of Outlier Ensembles 191 6.3. 1 What is thc Expectation Computed Ovcr? 195 6.3.2 Relationship of Ensemble Analysis to Bias-Variance Trade-Off 195 6.4 Variance Reduction methods 196 6.4.1 Parametric Ensembles 197 6.4.2 Randomized Detector Averaging 199 6.4.3 Feature Bagging: An Ensemble-Centric Perspective 199 6.4.3.1 Connections to Representational Bias 200 6.4.3.2 Weaknesses of Feature bagging 202 6.4.4 Rotated bagging 202 6.4.5 Isolation Forests: An Ensemble- Centric View 203 6.4.6 Data-Centric Variance Reduction with Sampling 205 6.4.6. 1 Bagging 205 6.4.6.2 Subsampl 6.4.6.3 Variable sub g 207 6.4.6.4 Variable Subsampling with Rotated Bagging (VR) .209 6.4.7 Other Va Reduction methods 209 6.5 Flying Blind with Bias Reductic 211 6.5.1 Bias Reduction by Data-Centric Pruning 2 6.5.2 Bias Reduction by Model-Centric Pruning 6.5.3 Combining Bi Reduct 6.6 Model combination for Outlier Ensembles 6.6.1 Combining Scoring Methods with Ranks PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 CONTENTS 6.6.2 Combining bias and variance rcduction 6.7 ConclusiOns and SullInary 6.8 Bibliographic Survey 6.9 Exercises 7 Supervised Outlier Detection 219 7.1 Introduction 7. 2 Full Supervision: Rare Class Detection 221 7. 2. 1 Cost-Sensitive Learning ...223 7.2.1.1 Meta Cost: A Relabeling Approach 223 7. 2.1.2 Weighting Methods 225 7.2.2 Adaptive Re-sampling 228 7.2.2. 1 Relationship between Weighting and sampling 229 7.2.2.2 Synthetic Over-sampling: SMOTE 9 7. 2.3 Boosting Methods 230 7.3 SeInli-Supervision: Positive and Unlabeled Data 231 7.4 ai-Su Partially observed cla 232 7.4.1 Onc-Class Learning with Anomalous Examples 233 7.4.2 One-Class Learning with Norlllal exalnples 234 7.4.3 Learning with a Subset of Labeled Classes 234 7.5 Unsupervised Fcaturc Enginccring in Supcrviscd Mcthods 35 7. 6 Active Learning 236 7.7 Supervised Models for Unsupervised Outlier Detection 239 7.7.1 Connections with PcA-Based methods 242 7.7.2 Group-wise Predictions for High-Dimensional Data.......... 243 7.7.3 Applicability to Mixed -Att ribute Data Sets 7.7.4 Incorporating Column-wise Knowledge 7.7.5 Other Classification Methods with Synthetic Outlier 244 7.8 Conclusions and Summary 245 7.9 Bibliographic Survey 245 7.10 Exercises 247 8 Categorical, Text, and Mixed Attribute Data 24 8.1 Introduction 249 8.2 Extending Probabilistic Models to Categorical Data 250 8.2.1 Modeling Mixed Dat 253 8.3 Extending Linear Models to Categorical and Mixed Data 254 8.3.1 Leveraging Supcrviscd Regression Modcls 254 8.4 Extending Proximity Models to Categorical Data 255 8.4.1A 256 8.4.2 Contextual Similarity 257 8.4.2. 1 Connections to Linear Models 258 8. 4.3 Issues with mixed Data 259 8.4.4 Density-Based Methods 259 8.4.5 Clustering Methods 259 8.5 Outlier Detection in Binary and Transaction Data 260 8.5.1 Subspace methods 260 8.5.2 Novelties in Temporal Transactions 262 8.6 Outlier detection in Text data 262 PdfDownloadablefromhttp://rd.springercom/book/10.1007/978-3-319-47578-3 CONTENTS 8.6.1 Probabilistic Modcl 262 8.6.2 Linear Models: Latent Sellalltic Anlalysis 264 8.6.2. 1 Probabilistic Latent Semantic Analysis(PISA) 265 8.6.3 Proximity-Based Models 268 8.6.3. 1 First Story Detection 269 8.7 Conclusions and Summary 270 8.8 Bibliographic Survey 270 8.9E 9 Time Series and Streaming Outlier Detection 273 9.1 Introduction 9.2 Predictive Outlier Detection in Streaming Time-Series ...276 9.2.1 Autoregressive Models 9.2.2 Multiple time Series Regression Models 9.2.2.1 Direct Generalization of Autoregressive Models 279 9.2.2.2 Time-Series selec 281 9.2.2.3 Principal Component Analvsis and Hidden Variable-Based Models 282 9.2. 3 Relationship between Unsupervised Outlier Detection and Prediction 284 9.2. 4 Supervised Point Outlier Detection in Time Series 284 9.3 Time-Series of Unusual Shapes 286 9.3.1 Transformation to Other Representations 287 9.3.1.1 Numeric Multidimensional Transformations 288 9..1 Transformati 290 9.3.1.3 Leveraging Trajectory Representations of Time Series 9.3.2 Distance-Based Methods 9.3.2.1 Single Series versus Multiple Series 9.3.3 Probabilistic Modcls 9.3. 4 Linear models 295 9.3.4.1 Univariate Series 295 9.3.4.2 Multivariate Serie 9.3.4.3 Incorporating Arbitrary Similarity Functions 9.3.4.4 Leveraging Kernel Methods with Linear Models 298 9.3.5 Supervised Methods for Finding Unusual Time-Series Shapes 9. 4 Multidimensional Streaming Outlier Detection 9.4.1 Individual Data Points as Outliers 299 9.4.1.1 Proximity-Based Algorithms 299 9.4.1.2 Probabilistic Algorithms 301 9. 4.1.3 High-Dimensional scenario 9.4.2 Aggregate Change Points as Outliers 301 9.4.2. 1 Velocity Density Estimation Method 302 9.4.2.2 Statistically Significant Changes in Aggregate DistributiOns 304 9.4.3 Rare and novel class detection in Multidimensional data streams. 305 0.4.3. 1 Dctccting Rarc Classes 9.4.3.2 Detecting Novel Classes 9.4.3.3 Detect ing Infrequently Recurring Classes 9.5 Conclusions and Summary 307 9.6 Bibliographic Survey 307 9.7 Exercises

(系统自动生成,下载前可以参看下载内容)