### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Class Note for CSCI 243 with Professor Bellaachia at GW

### View Full Document

## 18

## 0

## Popular in Course

## Popular in Department

This 19 page Class Notes was uploaded by an elite notetaker on Saturday February 7, 2015. The Class Notes belongs to a course at George Washington University taught by a professor in Fall. Since its upload, it has received 18 views.

## Reviews for Class Note for CSCI 243 with Professor Bellaachia at GW

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/07/15

i 9 0 9 Clustering Objectives 2 Clustering 2 21 De nitions 2 22 General Applications 2 23 What is a good clustering 3 24 Requirements 3 Data Structures 4 Similarity Measures 4 41 Standardize data 5 42 Binary variables 7 43 Nominal Variables 8 44 Ordinal Variables 9 45 Ratioscaled variables 10 46 Variables of mixed types 10 Clustering approaches 11 51 Major approaches 11 52 Partitionng approach 11 The Kmeans clustering method 12 The Kmedoids Clustering Method 14 Hierarchal Clustering 15 81 AGNES Agglomerative Nesting 15 82 Divisive Analysis DIANA 17 83 Analysis of hierarchical clustering 17 Outliers 1 8 9 1 Statistical Approach 18 92 DistanceBased Approach 19 A Bellaachia Page 1 1 Objectives 0 Techniques to group data into related classify datasets and provide categorical labels eg sports technology kid etc 0 Detection of patterns 0 Models to predict certain future behaviors 2 Clustering 21 Definitions 0 Cluster a collection of data objects 0 Similar to one another Within the same cluster 0 Dissimilar to the objects in other clusters 0 Cluster analysis 0 Grouping a set of data objects into clusters 0 Clustering is unsupervised classi cation no prede ned classes 0 Typical applications 0 As a standalone tool to get insight into data distribution 0 As a preprocessing step for other algorithms 22 General Applications 0 Text mining I Document categorization I Detection of topics I Summarization 0 Text Mining I Web log analysis I Detection of groups of similar access patterns A Bellaachia Page 2 o Bioinformatics I Gene expression data detection of cancer genes 0 Others I Image processing I Market analysis I Etc 23 What is a good clustering 0 A good clustering method will produce high quality clusters with 0 High intra class similarity 0 Low interclass similarity 0 The quality of a clustering result depends on both the similarity measure used by the method and its implementation 0 The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns 24 Requirements Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers lnsensitive to order of input records High dimensionality Incorporation of userspeci ed constraints Interpretability and usability A Bellaachia Page 3 3 Data Structures 0 Data Matrix two modes x11 xlf x1 xn xnf xnp 0 Dissimilarity or similarity matrix 0 day 0 d3 1 d32 0 dnl dn2 0 4 Similarity Measures 0 Dissimilarity Similarity metric Similarity is expressed in terms of a distance function which is typically metric di j 0 There is a separate quality function that measures the goodness of a cluster 0 The de nitions of distance functions are usually very different for intervalscaled boolean categorical ordinal and ratio variables A Bellaachia 0 Weights should be associated with different variables based on applications and data semantics 0 It is hard to de ne similar enough or good enough 0 The answer is typically highly subjective 0 Type of data in clustering analysis 0 Intervalscaled variables 0 Bina variables 0 Nominal ordinal and ratio variables 0 Variables of mixed gpes 41 Standardize data 0 Calculate the mean absolute deviation l sf x1f mfx2f mfxnf mfl Where 2 l mf x1fx2fX nf 0 z score Calculate the standardized measurement 0 Using mean absolute deviation is more robust than using standard deviation A Bellaachia Page 5 0 Computation of data similarity I Distances are normally used to measure the similarig or dissimilari between two data objects I Some popular ones include Minkowski distance q q q dl ltxl1 lel xi2 xj2 xlp xjp where 139 x11 x12 xip and le sz xjp are two pdimensional data objects and q is a positive integer I If q 1 dis Manhattan distance dij x11 x1 x12 x2 xip xjp I If q 2 dis Euclidean distance P 2 2 2 dl xl1 lel xi2 szl xlp x I Properties 0 dij 2 0 o diz 0 O 0713 03903 o dij S dik dkj A Bellaachia Page 6 I Also one can use weighted distance parametric Pearson product moment correlation or other disimilarity measures 42 Binary variables 0 A contingency table for binary data 1 0 sum 1 a 9 61 0 c 6 06 sum ac bd p 0 Simple matching coefficient invariant if the binary variable is gmmetric d bc 1 1 abcd 0 Jaccard coef cient noninvariant if the binary variable is asymmetric dltwgt A Bellaachia Page 7 0 Example Name Gender Fever Cough Testl Test2 Test3 Test4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N 0 gender is a symmetric attribute 0 The remaining attributes are asymmetric binary 0 Let the values Y and P be set to l and the value N be set to 0 0 1 d 39ackma 033 ry 2 0 1 1 1 dackjlm 067 111 1 2 d 39imma 075 1 W 1 1 2 43 Nominal Variables 0 A generalization of the binary variable in that it can take more than 2 states eg red yellow blue green 0 Method 1 Simple matching 0 m of matches 17 total of variables A Bellaachia Page 8 dij p7m 0 Method 2 use a large number of binary variables 0 Creating a new binary variable for each of the M nominal states 44 Ordinal Variables 0 An ordinal variable can be discrete or continuous 0 Order is important eg rank 0 Can be treated like intervalscaled 0 Replace xif by their rank rifelMf 0 Map the range of each variable onto 0 l by replacing ith object in the fth variable by 0 Compute the dissimilarity using methods for interval scaled variables A Bellaachia Page 9 45 Ratioscaled variables I Ratioscaled variable a positive measurement on a nonlinear scale approximately at exponential scale such as AeBt or AeBt 0 Methods I Treat them like intervalscaled variables not a good choice Why the scale can be distorted I Apply logarithmic transformation yl39f logxi I Treat them as continuous ordinal data treat their rank as intervalscaled 46 Variables of mixed types I A database may contain all the six types of variables I Symmetric binary asymmetric binary nominal ordinal interval and ratio 0 One may use a weighted formula to combine their effects dog J Z 2 15fdl fgt 210 5m f111 I f is binary or nominal dijf 0 if xif xjf or dijf 1 oW I f is intervalbased use the normalized distance I f is ordinal or ratioscaled o compute ranks rif and o treat zif as intervalscaled Zr f A Bellaachia Page 10 5 Clustering approaches 51 Major approaches 0 Partitioning algorithms Construct various partitions and then evaluate them by some criterion 0 Hierarchy algorithms Create a hierarchical decomposition of the set of data or objects using some criterion 0 Densi based based on connectivity and density functions 0 Gridbased based on a multiplelevel granularity structure 0 Modelbased A model is hypothesized for each of the clusters and the idea is to nd the best t of that model to each other 52 Partitioning approach 0 Partitioning method Construct a partition of a database D of 11 objects into a set of k clusters 0 Given a k nd a partition of k clusters that optimizes the chosen partitioning criterion 0 Global optimal exhaustively enumerate all partitions o Heuristic methods kmeans and kmedoids algorithms 0 kmeans MacQueen 67 Each cluster is represented by the center of the cluster 0 k medoids or PAM Partition around medoids Kaufman amp Rousseeuw 87 Each cluster is represented by one of the objects in the cluster A Bellaachia Page 11 6 The K means clustering method 0 Given k the kmeans algorithm is implemented in four steps 0 o O O O O 0 Step 1 Partition objects into knonempty subsets Step 2 Compute seed points as the centroids of the clusters of the current partition the centroid is the center ie mean point of the cluster Step 3 Assign each object to the cluster with the nearest seed point I Go back to Step 2 stop when no more new assignment Example Analysis Strength Relatively e icient 0tkn where n is objects k is clusters and t is iterations Normally k t ltlt n Comparing PAM Oknk2 CLARA Oks2 kn k Comment Often terminates at a local optimum The global optimum may be found using techniques such as deterministic annealing and genetic algorithms Weakness Applicable only when mean is de ned then what about categorical data Need to specify k the number of clusters in advance Unable to handle noisy data and outliers Not suitable to discover clusters with nonconvex shapes A Bellaachia Page 12 0 Variations of Kmeans method 39 A few variants of the k means which differ in 0 Selection of the initial k means 0 Dissimilarity calculations 0 Strategies to calculate cluster means I Handling categorical data kmodes Huang 98 o Replacing means of clusters with modes 0 Using new dissimilarity measures to deal with categorical objects 0 Using a frequencybased method to update modes of clusters 0 A mixture of categorical and numerical data k protolype method o Drawbacks of kmean method 0 The kmeans algorithm is sensitive to outliers I Since an object with an extremely large value may substantially distort the distribution of the data 0 KMedoids Instead of taking the mean value of the object in a cluster as a reference point medoids can be used which is the most centrally located object in a cluster A Bellaachia Page 13 7 The K medoids Clustering Method Find representative objects called medoids in clusters PAM Partitioning Around Medoids 1987 0 starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non medoids if it improves the total distance of the resulting clustering o PAM works effectively for small data sets but does not scale well for large data sets CLARA Kaufmann amp Rousseeuw 1990 CLARANS N g amp Han 1994 Randomized sampling Focusing spatial data structure Ester et al 1995 A Bellaachia Page 14 8 Hierarchal Clustering 0 Use distance matrix as clustering criteria 0 This method does not require the number of clusters k as an input but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative divisive Step 4 Step 3 Step 2 Step 1 Step 0 81 AGNES Agglomerative Nesting Introduced in Kaufmann and Rousseeuw 1990 Implemented in statistical analysis packages eg Splus Use the SingleLink method and the dissimilarity matrix Merge nodes that have the least dissimilarity Go on in a nondescending fashion Eventually all nodes belong to the same cluster A Bellaachia Page 15 23 567aam 0 A Dendrogram Shows How the Clusters are Merged Hierarchically o Decompose data objects into a several levels of nested partitioning m of clusters called a dendrogram o A clustering of the data objects is obtained by cutting the dendrogram at the desired level then each connected component forms a cluster A Bellaachia Page 16 82 Divisive Analysis DIANA Introduced in Kaufmann and Rousseeuw 1990 Implemented in statistical analysis packages eg Splus Inverse order of AGNES Eventually each node forms a cluster on its own D1236567851D D1236567851D n123356785m 83 Analysis of hierarchical clustering 0 Major weakness of agglomerative clustering methods 0 do not scale well time complexity of at least 0042 where n is the number of total objects 0 Integration of hierarchical with distancebased clustering o BIRCH 1996 uses CFtree and incrementally adjusts the quality of subclusters o CURE 11998 selects wellscattered points from the cluster and then shrinks them towards the center of the cluster by a speci ed fraction 0 CHAMELEON 11999 hierarchical clustering using dynamic modeling A Bellaachia Page 17 9 Outliers o What are outliers o The set of objects are considerably dissimilar from the remainder of the data 0 Example Sports Michael Jordon Wayne Gretzky 0 Problem 0 Find top n outlier points 0 Applications 0 Credit card fraud detection 0 Telecom fraud detection 0 Customer segmentation 0 Medical analysis 91 Statistical Approach 0 Assume a model underlying distribution that generates data set eg normal distribution 0 Use discordancy tests depending on 0 Data distribution 0 Distribution parameter eg mean variance 0 Number of expected outliers o Drawbacks 0 Most tests are for single attribute o In many cases data distribution may not be known E 1 1 W 8 95 E of z Are 5 315 Con dence r o Data Values A Bellaachia Page 18 92 DistanceBased Approach 0 Introduced to counter the main limitations imposed by statistical methods 0 We need multidimensional analysis Without knowing data distribution o Distancebased outlier A Outlierp Doutlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O o Algorithms for mining distancebased outliers o Indexbased algorithm I Use Rtree indexing structure I It takes Okn2 Without the cost of building the tree 0 Nestedloop algorithm I Divide the dataset into blocks and look for outliers in block by block I It has the same complexity as indexbased algorithm 0 Cellbased algorithm I Divide the data space into cells and look for outliers cellbycell rather than pointbypoint I It takes On2 A Bellaachia Page 19

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.