### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 700 Class Note for STAT 59800 with Professor Neville at Purdue

### View Full Document

## 18

## 0

## Popular in Course

## Popular in Department

This 19 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 18 views.

## Similar to Course at Purdue

## Reviews for 700 Class Note for STAT 59800 with Professor Neville at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Data Mining 3857300 STAT 59800 024 Purdue University April 21 2009 Anomaly detection source Introduction to Data Mining by Tan Steinbach and Kumar Task Find da39ta points that are considerably different from the remainder of the data gt anomaliesoutliers 0 Variants 0 Find all points with anomaly scores gt threshold Find point with largest anomaly score 0 Given a database D with mostly normal points compute the anomaly score of a point X with respect to D Examples 0 Fraud detection O Intrusion detection 0 Ecosystem disturbances O System monitoring O Biosurveillancepublic health 0 Data preprocessing Importance of accurate detection Ozone Depletion History In 1985 three researchers Farman Gardinar and Shanklin were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10 below normal levels Why did the Nimbus 7 satellite which had instruments aboard for recording ozone levels not record similarly low ozone concentrations The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded Antarctic Ozone Hole Average Area I m a North America ZD Antarctica Millian Square Kilometers D 1979 50 2001 Hula dailan a am lt 220 Dnhsnn Units Sauna NASA Gnddam Space Fllgm Center Sources humexploringdatacqueduauozonehtml http www epagovozonescienceholesize html Types of anomalies 0 Data from different classes 0 An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism Natural variation 0 Extreme or unlikely variations are often interesting 0 Data measurement and collection errors 0 Preprocess to remove Defining an outlier Notion of outlier is highly subjective and domain dependent i s i i inlaviivriliwlwm w a Outliers wm a dismbu un 1w mi Oulilers w m a panem 1 time series uulliers Source OsmavZaAane uAibenB PKDD 7 Point anomalies An individual data instance is anomalous wrt the data Source LBZEVSAC St El mmupmo oa mmin e Contextual anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies Song et al TDKE 06 Month 1y Temp Normal Time Source Lazarevlc et all ECMLPKDD OS Tutorial 9 Collective anomalies A collection of related data instances is anomalous Requires a relationship among data instances a Sequential Data 9 Spatial Data Graph Data 0 The individual instances within a collective anomaly are not anomalous by themselves Anomalous Suhsequence Source Lazarevic et all ECMLPKDD OB Tutorial 10 Anomaly detection Challenges 0 How many attributes are used to define an outlier 0 How many outliers are there in the data C Class labels are costly evaluation can be challenging 0 Skewed class distribution finding needles in haystack 0 Working assumption C There are considerably more normal observations than abnormal observations in the data 11 Approaches 0 Supervised Labels available for both normal data and anomalies 0 Similar to classification with imbalanced classes 0 Semi supervised Labels available only for normal data 0 Unsupervised 0 No labels assumed Based on the assumption that anomalies are very rare compared to normal data 12 Unsupervised point anomaly detection 0 General method 0 Build a profile of normal behavior based on patterns or summary statistics for the overall population 0 Use deviations from normal to detect anomalies Types of methods Visual and statistical based o Distancebased Modelbased Visual methods 0 Box plot 1 D Bax plat of petal length pevclass 0 Scatter plot 2D Limitations Time consuming E 39 Subjective IYlSrseIaSa iiiweuimim Convex Hull method 0 Use convex hull methods to detect extreme values 0 Extreme points on the convex hull are assumed to be outliers What if the outlier is in the middle of the data 5 37m OsmarZaan UAlhena mm 15 Statistical approaches Use a parametric model to describe the data eg Normal distribution 0 Apply a statistical test that depends on how likely a point is under the data distribution Need to specify a confidence limit eg Go39s away from mean is an outlier 16 Grubbs test Detect outliers in univariate data C Assume data is drawn from a normal distribution O Detect one outlier at a time remove the outlier and repeat 0 Ho There is no outlier 0 H1 There is at least one outlier X 7 X Grubbs statistic G M a O Re39ect H if J 0 G gt N 1 atNwiz W N 7 2 atNwiz 17 Multivariate data 0 Model data with a multivariate Gaussian O Calculate distance of each point to the centroici 0 Use Mahalanobis distance to take into account the covariance of the attributes dMHi71 960 rjTE 1Ii 7 WM 18 Example 24 5 Mahalanubls Distance Figure 103 Mahalanobis distance of points lrom the center 01 a twodimensional set oi 2002 points 19 Mixture model approach 0 Assume the dataset D is a mixture of two distributions 0 M majority distribution 0 A anomalous distribution 39 General method 0 Initially assume that all points belong to M 0 Let LlD be the likelihood at iteration i 0 For each point x in M move it to A 0 Compute the difference in likelihood ALlDLwD 0 If Agtc then xi is moved permanently to A 20 Statistical based cont Mixture distribution D1M AA M is a probability distribution estimated from data A is initially assume to be a uniform distribution x is the expected fraction of outliers Likelihood at iteration i N L1ltDgt71IIPltx7gt 14 H PM1ltx7gtHAlAl H BMW x7 EMi x7 6A1 21 Limitations of statistical approaches Most tesm are for univariate data In manv cases the form of the data distribution mav not be known If misspecified outliers may look likely eg mean is an outlier 39 For highdimensional data it may be difficult to estimate likelihood of points in the true distribution 22 Distancebased approaches Three major types of methods c Nearest neighbor O Density based 0 Clustering approach 23 Nearestneighbor c Compute distance between every pair of points O How to define outliers Points for which there are fewer than p neighboring points with distance d O Top p points whose distance to kth nearest neighbor is greatest O Top p points whose average distance to the k nearest neighbors is greatest 24 Example c z I 0 z 39 i 3 G 2 o quot 1 00 0 o o p 0 L M Dutller Scare Figure 104 Outlier score based on the distance to fth nearest neighbor or on 0 o 8 not a a Cosmo o u 3 39 3 a o Dutller Scare Figure 105 Outlier score based on the dis tance to the first nearest neighbor Nearby out liers have low outlier scores 25 Example outlier Scare Figure 106 Outlier score based on distance to the titth nearest neighbor A small cluster becomes an outlier I0 09E 3 139 o f 439 cm a s 39 A 2 e o o 0 o Outller Score Figure 107 Outlier score based on the dis tance to the fifth nearest neighbor Clusters of di ering density 25 High dimensions In high dimensional space data is sparse and notion of proximity becomes meaningless 0 Every point is almost equally good outlier from the perspective of proximity based definitions O Lower dimensional projections can be used for outlier detection 0 A point is an outlier if in some lower dimensional projection it is present in a local region of abnormally low density 27 Low dimensional projection 0 Diver each attribute in intervals 4 based on frequency each interval contains f1ltjgt records O Consider a k dimensional cube created by picking grid ranges from k different dimensions 0 If attributes are independent we expect a region to contain a fraction fk of the points If there are N points we can measure the sparsity of a cube D as nD i N fk xN 2quot 1 ark 0 Negative sparsity indicates cube contains smaller number of points than expected 51 28 29 Density based 9 For each point compute the density of its local neighborhood ZyeNmk distance I y gt 1 density y 7 lt NI7 Local outlier factor LOF o For a instance i the LOF is the ratio of the density of instance i and average density of its nearest neighbors Outliers are points with largest LOF value p In the NN approach p2 is not considered as outlier while LOF approach find both p1 and p2 as outliers 30 Clusteringbased Cluster the data into groups of different density Choose points in small clusteras candidate outliers Compute the distance between candidate points and non candidate clusters If candidate points are far from all other non candidate points they are outliers 31 Example no 017 Distance Figure 109 Distance of points lrurn closest centroidt 32 Example no so so an ac zu Relative Distance Figure 1010 Relative distance oi points from closest centroid 33 Spectral methods t Analysis based on eigendecomposition of data 0 Key Idea 0 Find combination of attributes that capture bulk of variability 0 Reduced set of attributes can explain normal data well but not necessarily the anomalies 0 Advantage 0 Can operate in an unsupervised mode Drawback 6 Based on the assumption that anomalies and normal instances are distinguishable in the reduced space Source Lazarevic et ai EOM LF39KDD OB Tutorial 34 PCA approach lde amp Kashima KDD O4 A few top principal components capture variability in normal data Smallest principal component should have constant values for normal data Outliers have variability in the smallest component Network intrusion detection using PCA 0 For each time t compute the principal component 0 Stack all principal components over time to form a matrix Left singular vector of the matrix captures normal behavior 0 For any t angle between principal component and the singular vector gives degree of anomaly Source Lazarevic et al ECMLPKDD39OB Tutorial 35 36 Evaluation criteria 0 False alarm rate type I error 0 Misdetection rate type II error a NeymanF39earson criterion Minimize misdetection rate while false alarm rate is bounded Am mg l I 7 W 3 n Where PAX g ano a Bayesian criterion Minimize a weighted sum for false alarm and misdetection rate 37 Next class Topic sequential anomaly detection 38

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.