### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 603 Class Note for STAT 59800 with Professor Neville at Purdue

### View Full Document

## 19

## 0

## Popular in Course

## Popular in Department

This 16 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 19 views.

## Similar to Course at Purdue

## Reviews for 603 Class Note for STAT 59800 with Professor Neville at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Data Mining 3857300 STAT 59800 024 Purdue University April 7 2009 Descriptive modeling evaluation Cluster validity 0 For prediction tasks there are a variety of external evaluation metrics 0 Accuracy squared loss area under ROC etc 0 For cluster analysis the external evaluation should evaluate the goodness of the resulting clusters 0 Why do we want external validation 0 To avoid finding patterns in noise 0 To compare clustering algorithms 0 To compare two sets of clusters Random data Random DBSCAN Points I ma 39 39 Kmeans S 39 3quot I Complete Link Evaluation approaches Determine the clustering tendency of the data c Evaluate the clusters using known class labels O Evaluate how well the clusters fit the data 0 Determine which of two different clustering results is better O Determine the quotcorrectquot number of clusters Evaluation measures c Supervised O Measures the extent to which clusters match external class label values Unsupervised O Measures goodness of fit without class labels O Relative 0 Compares two clusterings often uses external or internal index Unsupervised Correlation I11 0 Compute the correlation between the initial similarity matrix and an idea cluster matrix C Entry ij is 1 if i and j are in the same cluster 0 otherwise High correlation indicates that points in the same cluster are close to each other 0 Assumes that the proximity values are 01 0 When is this a good measure Example Corr 09235 Corr 05810 Visual inspection Order the proximity matrix with respect to cluster labels 0 Inspect visually 0 Good clusterings exhibit clear block pattern 10 Example 1 11 Example CohedonandsepamMon Centroid or medoid A Cohesion 8 Separation 13 Cohegon Measures how closely related the objects are within each cluster 0 Within cluster sum of squared errors SSE 0 For each point the error is the distance to the centroid 0 Within cluster pairwise weighting 0 Sum distance between all pairs of points in same cluster 14 Separation Measures how distinct a cluster is from the other clusters 0 Between cluster SSE for cluster C O For each cluster C the error is the distance from the centroid c to the other centroid c The error is multiplied by the cluster size C 0 Between cluster pairwise weighting 0 Sum distance between all pairs of points in different clusters 15 Cohesion and separation 0 The sum of the between cluster SSE and within cluster SSE is equal to the total sum of squared error distance of each point to overall mean O Thus minimizing cohesion is equivalent to maximizing separation 16 Silhouette coefficient Combines both cohesion and separation 0 For an individual point i 0 A average distance of i to points in same cluster 0 B average distance of i to points in other clusters c S B A maxAB O Can calculate average 8 for a cluster or clustering 0 Closer to 1 is better 17 Cophenetic distance 0 For hierarchical clustering techniques O Cophenetic distance between objects 0 Similarity level at which an agglomerative clustering technique puts the objects in the same cluster for the first time 0 Can define a cophenetic distance matrix between all pairs of objects to describe a hierarchical clustering 18 Example 020 015 010 005 03900 3 6 2 5 A 1 Single link clustering Single link dendrogram DO39ii39it D1 D2 D3 D4 D5 D6 DO39ii39it D1 D2 D3 D4 D5 D6 D1 0 024 022 037 034 023 D1 0 022 022 022 022 022 D2 024 0 015 020 014 025 D2 022 0 015 015 014 015 D3 022 015 0 015 028 011 D3 022 015 0 015 015 011 D4 037 020 015 0 029 022 D4 022 015 015 0 015 015 D5 034 014 028 029 0 039 D5 022 014 015 015 0 015 D6 023 025 011 022 039 0 D5 022 015 011 015 015 O Similarity matrix Copnenetic matrix 19 Cophenetic correlation 6 Measure correlation between the original disimilarity matrix and the cophenetic matrix 3102D7 CD Cm 20 Supervised 21 Classlabel evaluation 0 If you have class labels why cluster C Usually small hand labeled clataset for evaluation 0 But large dataset to cluster automatically O May want to assess how close clusterings correspond to classes but still allow for more variation in the clusters 22 Classificationoriented Purity another measure of the extent to which a cluster contains objects of a single class 0 The purity of cluster i is pimaX pi K i N39 purity E W219i i1 0 Entropy the degree to which each cluster consists of objects of a single class 0 For each cluster i compute the probability of classj c 6239 21917109 pij 13971 23 Classificationoriented 0 Normalized mutual information gain Measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are 10 G HC HG 0 9 pm grog103 EC 100qu 100 i 29 pg109 109 NMIC G 24 Classificationoriented Precision The fraction of a cluster that consists of objects of a specified class 0 Recall 0 The extent to which a cluster contains all objects of a specified class C Accuracy O Why is it hard to measure the accuracy of a clustering if you know class labels 25 Similarityoriented c Based on premise that any pair of objects in the same cluster should have the same class and vice versa O Compare the ideal cluster similarity matrix to the ideal class similarity matrix 26 Approaches 0 Correlation between two ideal matrices 0 Measures of binary similarity between two ideal matrices 0 fun pairs of objects having diff class and diff Cluster 0 fm pairs of objects having diff Class and same Cluster 0 g 2 pairs of objects having same class and cliff cluster 0 f1 pairs of objects having same Class and same Cluster foo f11 Jaccard f11 Rand f00f01f10f11 f01f10f11 27 Determining k 0 Approach evaluate over a range of k look for peak clip or knee in evaluation measure 5 10 TB 20 25 Silhouette 28 Clustering tendency Evaluate whether a dataset has clusters without clustering c Most common approach for low dimensional Euclidean data C Use a statistical test for spatial randomness 0 Hopkins statistic sample 20 points from dataset generate 20 random points in same space H 2L1 wi Ur distance from random point to NN in data 21 239 t 2L1 wz Wii distance from sample point to NN in data 29 Assessing significance c How do we know a score is good O How do we know that a difference between two algorithms is significant 0 This is the same problem we had for predictive models O Need a sampling distribution to compare to O Can generate random data in same space and compute empirical sampling distribution 0 Can partition data to get multiple folds for evaluation 30 Bayesian networks Goodness of fit 0 AIC BIC MDL BD O How to assess quality of structure learning 0 Use synthetic data generated from specified Bayes net try to recover structure compare to gold standardquot 31 Next class 0 Reading Chapter 13 PDM O Topic Pattern mining representation 32

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.