### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 474 Class Note for STAT 59800 with Professor Neville at Purdue

### View Full Document

## 46

## 0

## Popular in Course

## Popular in Department

This 19 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 46 views.

## Similar to Course at Purdue

## Popular in Subject

## Reviews for 474 Class Note for STAT 59800 with Professor Neville at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Data Mining 3857300 STAT 59800 024 Purdue University March 10 2009 Descriptive modeling cont b Centezrhased clusters Each point is closer tn the center of its cluster than to the center of any other cluster 21 Wen separated clusters Each ptan is closer to all of the points in its cluster than to any pomt in another cluster c Contiguityrbased clusters Each a Densitybased crustets Che tars are regions of high density sepr point is closer to at least rim point graced by regions iii low density in its cluster than to any point in anothci39 clustcz 9 Conceptual clusters Points in a cluster Share some general property that derives from the entire set of points Points in the intetsminn of the circles belong to both Figure 82 Dil lerent types of clusters as illustrated by sets of twodimensional points K means 0 Strengths Relatively efficient Otkn e Finds spherical clusters 0 Weaknesses Terminates at local optimum sensitive to initial seeds o Applicable only when mean is defined Need to specify k e Susceptible to outliersnoise Variations 0 Selection of initial centroids 0 Run with multiple random selections pick result with best score C Use hierarchical clustering to identify likely clusters and pick seeds from distinct groups 0 Select first seed randomly and then pick successive points that are farthest away 0 Algorithm modifications 0 Recompute centroid after each point is assigned O Allow for merge and split of clusters Variations 0 When mean is undefined 0 K mediods use one of the data points as cluster center O K modes uses categorical distance measure and frequency based update method O Selecting k Hierarchical methods Construct a hierarchy of nested clusters rather than picking k beforehand c Approaches O Agglomerative merge clusters successively 0 Divisive divided clusters successively O Dendrogram depicts sequences of merges or splits and height indicates distance Agglomerative c For i 1 to n Let Ci Xi 0 While Cgt1 0 Let Ci and Ci be the pair of clusters with min DCiCj c CiCi u Ci C Remove Ci c Complexity time space Distance measures between Clusters O Single linknearest neighbor DCiCj mindxy x 6 Ci y 6 0 gt can produce long thin clusters c Complete linkfurthest neighbor DCiCi maxdxy x 6 Ci y e Ci is sensitive to outliers o Average link DCiCi avgdxy X 6 Ci y 6 Ci gt compromise between the two Divisive c While C lt n For each Ci with more than 2 objects Apply partition based clustering method to split Ci into two clusters Ci and Ck I C C Ci u Ch Ck O Complexity c Example partition based methods O K means spectral clustering 10 Spectral clustering 6 Cut a weighted graph into a number of disjoint groups using a similarity matrix W c Popularizeci for image segmentat on Shi and Maik 2000 11 Algorithm 1 Let W be an N X N matrix with Wii Sij 2 LetD be an N X N diagonal matrix with di Ejev Sij 3 Solve the eigensystem D 7 Wx ADxfor the eigenvector x1 associated with the second smallest eigenvalue A1 4 Sort x1 5 Consider m evenly spaced entries in x1 For each value mm a Bipartition the nodes into A7 B such that A O B 0 and A U B and mm lt 951m Vva E A b Calculate the normalized cut objective function JA B RAVE tA cut AB1 26A 2 276m where cutA7 B 226M613 503 6 Partition the graph into the A7 B that mimmizes J Calculate the stability1 of the current cut if stability gt 0 stop recursing 8 Recursiver repartition A and B if necessary 1 V 1 12 Spectral clustering Strengths 0 Can find non spherical clusters Weaknesses 0 CW to find all eigenvalues of an arbitrary eigensystem approximate methods OE 0 Researchers disagree about 0 Which eigenvectors to use 0 How to derive clusters from the eigenvectors Probabilistic model based clustering Assumes a probabilistic model for each underlying cluster component 0 Mixture model specifies a weighted combinations of component distributions eg Gaussian Poisson Exponential K f39v Zwkmw 0 k1 Gaussian mixture models Mixture of three Gaussians A Contours of probability distribution a Mixture of three Gaussians A Mixture models cont o How to learn the model from data C Don t know W1k or Component parameters 0 Solution 0 Interpret mixing coefficients as prior probabilities C Use Expectation Maximization Dempstel Laird Rubin 1977 Zwkfk33 0 161 Zplklpl lk 161 17 Generative process 0 For each data point C Pick component Gaussian randomly with probability Wk 0 Draw sample point from that component randomly O Repeat 18 Sample dataset 05 irquot gm 9 p a 5 00 I39 1 I I 0o 015 1 a Learning the model 0 We want to invert this process 0 Given the dataset find the parameters 0 Mixing coefficients 0 Component means and covariance matrix 0 If we knew which component generated each point then the MLE solution would involve fitting each component distribution to the appropriate cluster points 0 Problem the cluster memberships are hidden 2O Unlabeled dataset 05 o 015 1 21 Posterior probabilities 0 We can think of the mixing coefficients as prior probabilities for the components 0 For given value of x we can evaluate the corresponding posterior probabilities of cluster membership with Bayes theorem PkP5I3lk 1933 wkNfl3le 2k 2le ijCElea 2339 M56 E 29kliv 22 Posterior probabilities 1 39 0 U a39 1 no ao39 05quot if gio gr v 5 5 39 Eh 39 quot9353 39 quot 39 00 05 1 a 23 MLE for GIVIIVI 0 Log likelihood takes the following form N K log pDlw u E Z logZwkNanluk 2k n21 lt21 0 Note the sum over components is inside the log 0 There is no closed form solution for the MLE 24 EM algorithm 0 Popular algorithm for parameter estimation in data with hiddenunobserved values O Hidden variablescuster membership c Basic idea Initialize hidden variables and parameters Predict values for hidden variables given current parameters O Estimate parameters given current prediction for hidden variables c Repeat 25 Hidden variables c If we know the values of the hidden variables we can maximize the complete data log likelihood N K log 1993 20 Z Z znk log wk log Nanuk 20 711 k1 0 This has a trivial closed form solution except we don t know the values for the hidden variables 0 But for given set of parameters we can compute the expected values of the hidden variables 26 EM for GMM Suppose we make a guess for the parameters values 8 pltkgtpltrlkgt Egrep p03 wkN33 Mka 2k K 231 HARM 1 2139 a Use these to evaluate cluster 7433 3 paging memberships 0 Now compute the loglikelihood using predicted cluster memberships N K log 19037 ZIB 2 2720271 log wk log Mall1k 220 711 161 0 Use completed likelihood to determine lLE for B MStep 27 28 00 f o 0 O o 0 03 o 390 0 g Q o O 39 o o 0 o 29 30 31 32 o o oo o 0 Q 3 0 o 0 or v 2 o 39 c3 2 33 More on EM 0 Often both the E and the M step can be solved in closed form a Neither the E step nor the M step can decrease the loglikelihood 9 Algorithm is guaranteed to converge to a local maximum of the likelihood 0 Must specify initialization and stopping criteria 34 Probabilistic clustering Model provides full distributional description for each component 0 May be able to interpret differences in the distributions O Soft clustering compared to k mean hard clustering 0 Given the model each point has a k component vector of membership probabilities 0 Key cost assumption of parametric model 35 How to choose k 0 Choose k to maximize likelihood O As k increases the value of the maximum likelihood cannot decrease 0 Thus more complex models will always improve likelihood O How to compare models with different complexities 36 Announcements Midterm Mar 10 8 10pm LWSN 8151 0 Class O Thursday Mar 12 class is cancelled 0 Thursday Mar 26 David Jensen Myths of Research in Computer Science O Homework O Review report deadline extended to Mar 24 37 Score functions 0 Goal 0 Compact clusters minimize within cluster distance O Separated clusters maximize between cluster distance 0 ScoreCD fwcCbcC rk i 2 0 miECk W e 2 dm 7 1 jltk K K K 11100 211140 2 Z dci7 k2 k1 k1 neck 38

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I made $350 in just two days after posting my first study guide."

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.