### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 480 Class Note for STAT 59800 with Professor Neville at Purdue

### View Full Document

## 14

## 0

## Popular in Course

## Popular in Department

This 15 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 14 views.

## Similar to Course at Purdue

## Reviews for 480 Class Note for STAT 59800 with Professor Neville at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Data Mining 3857300 STAT 59800 024 Purdue University March 24 2009 Modelbased Clustering Probabilistic model based Clustering 0 Assumes a probabilistic model for each underlying oluster component 0 Mixture model specifies a weighted combinations of component distributions eg Gaussian Poisson Exponential K e Zwkfk56 9 121 Gaussian mixture models 0 05 1 Mixture of three Gaussians Contours of probability distribution 0 05 Mixture of three Gaussians Mixture models cont 0 How to learn the model from data 0 Don t know W1k or component parameters 0 Solution 0 Interpret mixing coefficients as prior probabilities 0 Use ExpectationMaximization Dempster Laird Rubin 1977 K Zwkfk 9 k1 Zpkpivlk k1 Generative process 0 For each data paint 0 Pek eempenent Gaussian randemly with pmbabilify Wk 0 Draw sample paint from hat f Z wkfklt 9 k1 K eempenent randemly 1023 ZNWWW k1 Sample dataset 1 39 3 O o 39 g I no Jo 05 39r r g39E JY 39 1 a 39g 3 i39b 39h 939quot 0 Equot o o39Er an o 00 05 1 a Learning the model 0 We want to invert this process 0 Given the dataset find the parameters 0 Mixing coefficients 0 Component means and covariance matrix 0 If we knew which component generated each point then the MLE solution would involve fitting each component distribution to the appropriate cluster points 0 Problem the cluster memberships are hidden Unlabeled dataset 0 C r 0 o 1 o O 039 quot 0 f a I 39 I a 0 50039 I 05 3 r3913 tt p 5 1 o 0 0 39 300 r 390 39 1 I O 0 39 0 05 1 Posterior probabilities 0 We can think of the mixing coefficients as prior probabilities for the components 0 For given value of x we can evaluate the corresponding posterior probabilities of cluster membership with Bayes theorem p pxlkgt wkN luk7 276 2 ijCEle 23 Posterior probabilities 1 39 39 39539 39 393 0 05 J 1 193 J i 39 9quotquot 5 2quot o l39o39p quot o 00 05 1 a MLE for GMM Log likelihood takes the following form N K n1 61 O Note the sum over components is inside the log C There is no closed form solution for the MLE 13 EM algorithm 0 Popular algorithm for parameter estimation in data with hiddenunobserved values 0 Hidden variablescluster membership O Basic idea 0 Initialize hidden variables and parameters Predict values for hidden variables given current parameters m 0 Estimate parameters given current prediction for hidden variables O Repeat MStep 14 Hidden variables If we know the values of the hidden variables we can maximize the complete data log likelihood N K log 1993 20 Z Z znk log wk log Nanuk 20 711 k1 0 This has a trivial closed form solution except we don t know the values for the hidden variables O But for given set of parameters we can compute the expected values of the hidden variables 15 EM for GMM Suppose we make a guess for the parameters values 6 pkP vlk W WNWm 3k K 211 ijlej7 2139 0 Use these to evaluate cluster EStep memberships we 3190400 Now compute the log likelihood using predicted cluster memberships N K lag 1993 Zlg Z 274 109 w lag Marnle 29 n 1k1 Use completed likelihood to determine MLE for 6 MSlep 17 18 19 20 21 22 More on EM Often both the E and the M step can be solved in closed form 0 Neither the E step nor the M step can decrease the log likelihood O Algorithm is guaranteed to converge to a local maximum of the likelihood 0 Must specify initialization and stopping criteria 23 Probabilistic Clustering 0 Model provides full distributional description for each component C May be able to interpret differences in the distributions 0 Soft clustering compared to k mean hard clustering O Given the model each point has a k component vector of membership probabilities O Key cost assumption of parametric model 24 How to choose k Choose k to maximize likelihood 0 As k increases the value of the maximum likelihood cannot decrease O Thus more complex models will always improve likelihood 0 How to compare models with different complexities 25 Model selection 0 Goal 1 Summarize the data C Describe data as precisely as possible 0 General approach based on data compression and information theory uses score function 0 Score0M bits to describe the model bits to describe the exceptions in the data log p6M log pD 0M 26 Model selection Goal 2 Generalize to new data 0 Goodness of fit is part of the evaluation O But since the data is not the entire population we want to learn a model that will generalize to other new data instances Thus want to strike a balance between between how well the model fits and the data and the simplicity of the model 27 Score functions 0 Score0M errorM penaltyM O Penalty may depend on the number of parameters in the model p and the number of data points n AIC Akaike information criterion ScoreAic 2 log L 2p 0 BIC Bayesian information criterion SCOI GBIC 2 log L p log n O Other functions minimum description length structural risk minimization 28 Announcements Next class David Jensen Myths of Research in Computer Science 12 1pm Lawson 3102AB Pizza will be provided 0 Homework 0 Review report deadline due today in class C HW 4 will be assigned on Thursday 29

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.