Microarray MethAnalysis BINF 636
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in BioInformatics
This 159 page Class Notes was uploaded by Nathanael Schowalter on Monday September 28, 2015. The Class Notes belongs to BINF 636 at George Mason University taught by Staff in Fall. Since its upload, it has received 35 views. For similar materials see /class/215261/binf-636-george-mason-university in BioInformatics at George Mason University.
Reviews for Microarray MethAnalysis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
Lecture 1 Introduction to Micorarrays and Concepts of Molecular Biology M Saleet J afri Program in Bioinformatics and Computational Biology George Mason University Lecture 1 Overview of Molecular and Cellular Biology Biological References Molecular Biology of the Cell by Bruce Alberts 1994 or newer edition Molecular Cell Biology by Darnell Lodish and Baltimore 1995 or newer edition Part 1 Molecular Biology Review Where do biological sequences come fromquot Cholesterol 239 a 66 23 Om cm 3905 0 Fatty chains Terrestrial Life aquot sum mm Malnmmw i fm WMquot Presumed common progemmr mu mm avgamsms Pmsunm common magma or archaehamena and emawms Cell types Prokaryotes no nuclear membrane repreSented by cyanobacteria bluegreen algae and common bacteria Escherichia coli Eukaryotes unicellular organisms such as yeast and multicellular organisms Archaebacteria no nuclear membrane but similar to eukaryotes in transcription and translation mechanisms discovered in deep sea thermal vents in 1982 Prokaryotic Cell Con roummmume wan 1mm membranei Rmosome Prokaryotic Cell 15 P rokarvotic CEIFI Nucleoid Septum i ca and cell wa l DIN Meso some Inner 39p asmal membrane ll quotV r 7 u i I 39l quot CE Wa Gunter membrane Pernplasmnc space R Inner plasma membrane i Nuc eo Outer membrane d FEM Eukaryote s In eukaryotes transCription is complex Many genes contain alternating exons and introns lntrons are spliced out of mRNA mRNA then leaves the nucleus to be translated by ribosomes GenOmic DNA entire gene including exons and introns The same genomic DNA can produce different proteins by alternative splicing of exons Complementary DNA cDNA spliced sequence containing only exons cDNA can be manufactured by capturing mRNA and performing reverse transcription Eukaryotic Cell Eh Eukarycttic cell Nuclear membrane Plasma membrane I 39 0ng vesmlesi x Pemxi 50 m e Lysosurne Mit chnndri n W x N 1 m I 7 5565 vesicle Endoplasmic reticulum in I Haugh Endaplasmic reticulum Eukaryotic Cell Organelles Cell membrane Nucleus Cytoplasm Endoplasmic Reticulum rough and smooth Golgi Apparatus received newly formed proteins from the ER and modi es them and directs them to nal destination Mitochondria respiratory centers have own circular DNA bacterial or1g1n Eukaryotic Cell Organelles Chromosomes chromatin histones centromeres and arms 2 pairs in eukaryotes Lysosomes contain acid hydrolases nucleases protea39ses glycodidases lipases phosphatases sulfatases phospholipases Peroxisomes use oxygen to remove hydrogen om substrates forming H202 abundant in kidney and liver detoxi cation Cytoskeleton Eukaryotic Membrane A t 7 P 1 I t 39 Ememm Oliigasatcharlde Glympmtmn Glycohpm when pm em Integral protein Cr 3 Chg9 1 35 it 1 a quotJS Kg x V 39 fig A quota I Leaflets w Phospholipid biiayer Hydrophobic care Fatty acyl tails Phos holi ic 39 Hydrophilit polar p p Peripheral pruteins head Integral protein Nucleic Acids Nucleic Acid Structure PUFHNES NHZ Nc Hhk 3 N H Adenin A PYRIM DINES u c Hm w W H Umil u H 39H c H Cvmsine C DNA Double stranded Four bases adenine A guanine G cytosine C and thymine T A and G are purines C and T are pyrimidines A always paired With T complementary C always paired With G complementary gt WatsonCrick base pairs bp DNA may consist of hundreds of millions bp A short sequence ltlOO is called an oligonucleotide I base pairs Bu r phosga ate ban rm quot ngdrlngen harms Nuclemid I Phosphate 39 2Deoxyribose Adenosine 5 monopnosphate AMP Central Dogma DNA gt RNA gt Protein DNA Deoxyribonucleic Acid RNA Ribonucleic Acid Protein Functional and Structural units of cells Flow of Information is unidirectional DNA Replication nfermatio n1 DNA ciupkicates W W nfurmalion W41 Transcription M W RNMWEES RNA 1 QRLTEL 7 y nucleus cytoplasm Translation Protein syn lesis Protein The Central Dogma oi Mulecular Biology Gene Transcription or DNA Transcription RNA molecules synthesized by RNA polymerase RNA polymerase found in ee and bound form RNA polymerase binds very tightly to promoter region on DNA Promoter region contains start site Transcription ends at termination signal site Primary transcript direct coding of RNA from DNA RNA splicing introns removed to make the mRNA mRNA contains the sequence of codons that code for a protein uracil replaces thymine splicing and alternative splicing Transcription of DNA to Messenger RNA WWW J Yrmseriplion Nuclexr RNA Messenger RNA Translation Ribosomes made of protein and ribosomal RNA rRNA Transfer RNA tRNA make connection between speci c codons in mRNA and amino acids As tRNA binds to the next codon in mRNA its amino acid is bound to the last amino acid in the protein chain When a STOP codon is encountered the ribosome releases the mRNA and synthesis ends Gene Translation iRNA links an amino acid to the codon on the mRNA Via the anii coa on rRNA RNA found in ribosomes ribosomes large and small subunit made of protein and rRNA initiator iRNA always carries methionine initiation factors proteins that catayze the start of transcription stop codon Endoplasmic Reticulum Posttranscriptional modi cation Translation Involves ribosomes and RNA Ribosomes made of protein and RN A Messenger RNA mRNA is the sequen ce transcribed from the DNA The mRNA is threaded through the ribosomes Transfer RNA tRNA brings the different amino a cids to the ribosome complex so that the amino acids can be attached to the growing amino acid chain tRNA L Protein amino acid chain L mRNA Amino acid Ribo some Hibosome RNA protain Growing polypeptide Chaim IRMA quot j if I 1ll leaving ul 1 39u C AG 5 r 35 aa7 LENA 4quot arriving LI 339 h Movemem of ribosome Codan Codm 30an Codon Sodom Coder Cde aa1 aaz 333 aad 335 aa aa 32 305 33 392 12 10 nucleotides I l Prnkarvutic mamma Euk ryutic rFlNA Fmtains Subunits Assembled L1 L2 L3 ribosum as 4 L A 17 R 53 4 x J 233 as 7 29m bases 120 bases 1Totall 3H liTDtal 21 I L1 L2 L3 1 165 1506 bases 3 WOW 503 235 585 E 4800 bases 160 bases 120 basesil s1 52 53 E a i 185 HEIDI basesb IiTotaJ 331i U uopoo ul aszq m o u o u 5 m n The Genetic Code Ix y G C ILI ll C G G A G C U U C G G A G G U A G Ribonucleic acid DNA Structure DNA contains Promoters Genes Junk DNA Reading frames An open reading frames ORF a contiguous sequence of DNA starting at a start codon and ending at a STOP codon w um mm w Mam DNA quotWquot mRNA mm W mm Chromosomes A genome a complete set of chromosomes Within a cell Different species have different numbers of chromosomes in their genomes Prokaryotes usually have a single chromosome often a circular DNA molecule Eukaryotic chromosomes appear in pairs diploid each inherited from one parent Homologous chromosomes carry the same gene39s Some genes are same in both parents Some genes appear in different fOrms called alleles Eg human blood has three alleles A B and 0 All genes are present in all cells but a given cell types only expresses a small portion of the genes Chromosomes Chromatin Structure Octameric hislone core 10 nm DNA H1 histone Nucleosome Gene Coding and Replication Double helix Nitrogenous bases ATGC SugarPhosphate backbone Nucleotide sugar base phosphate group Nucleoside sugar base Purines adenine guanine Pyrimidines cysteine thymine AT 2 H bonds GC 3 H bonds Gene Coding and Replication 5 end contains a phosphate group 3 end is free DNA extended om 5 to 3 Gene is a segment of DNA that codes for a speci c protein Exons are coding regions of the DNA Intron s are in between regions fOund in eukaryotes Codons Reading frame Consensus sequences are conserved regions found in a particular type of regulatory region ama ea luquot S DNA replica n A mama bequot An Chroma d 39 x V Mikn u apparatus anhasemn Ananhase can 3 Ax Daugmar rails Y an W Q WW1 9mg my magma Q mmw 1pr 3 mmmm mu lkun nu mm QI I l l C H 7 N7 c 7c new we mo 6 ammu a cavhule a r Emmus C eva as N o scHszwc smumca Protein Folding Primary structure amino acid sequence Secondary structure local structure such as 0t helix and B sheets Tertiary structure 3dimensional structure of 39a protein monOmer Quarternary structure 3dimensional structure of a fully functional protein protein complexes BrimFY mu mam mqummmmmmmmm Primary mcmr AmmuAmdx sequence A new We mm M 54mm mum Imcmm m mun5m Banimmamd Hahn mm a mcmre P ula she mm pram mum at mnmmmummmm Damn mama Wadas 1 of each atom Quatemax 1m 0 together anmemmy mquot smmum g a mum nnsin g m wareman an m m m Mechanism of enzyme activity Products 0 Or Enzyme Enzyme substrate Unsolved problem 3 15 i39 Cell Signaling and Biochemical Pathways Surface receptors Gproteins kinases etc Transcription factors Other biochemical reactions glycolysis citric acid cycle etc Molecular Biology Summary Life and evolution Proteins Nucleic Acids Eukaryotes versus Prokaryotes Ribosome Translation Transcription Central dogma Genetic code DNA structure Chromosome Mitosis Polymerase Chain Reaction In order to make enough RNA for microarray experiments PCR is used to amplify the amount of RNA Most PCR methods typically amplify nulceic fragments of up to 10 kilo base pairs kb although some techniques allow for amplification of fragments up to 40 kb in size Polymerase Chain Reaction A basic PC R set up requires several components and reagents These components include DNA template that cOntains the DNA region target tobe ampli ed TWO which are tothe DNA regions at the ve prime or three prime ends of the DNA region A such as or another DNA polymerase with a temperature optimum at around 700C I dNTPs also very commonly and erroneously called the building blocks om whiCh the DNA polymerases synthesizes a new DNA strand providing a suitable chemical environment for optimum activity and stability of the DNA polymerase or ions generally Mg2 is used but Mn2 can be utilized for PCRmediated DNA mutagenesis as higher Mn2 concentration increaSes the error rate during DNA synthesis Monovalem cation ions From Wikipedia Polymerase Chain Reaction The PCR is commonly carried out in a reaction volume of 10200 pl in small reactiOn tubes 0205 ml volumes in a thermal cycler The thermal cycler heats and cools the reaction tubes to achieve the temperatures required at each step of the reaction see below Many modern thermal cyclers makeuse of the Peltiel39 effect which permits both heating and cooling of the block holding the PCR tubes simply by reversing the electric current Thinwalled reaction tubes permit faVorabletherr39nal conductivity to allow for rapid thermal equilibration Most thermal cyclers have heated lids to prevent condensation at the top of the reaction tube Older thermocyclers lacking a heated lid require a layer of oil on top of the reaction mixture or a ball of wax inside the tube From Wikipedia Polymerase Chain Reaction Polymerase Chain Reaction Polymerase Chain Reaction Polymerase Chain Reaction Polymerase Chain Reaction Lecture 1 Introduction to Microarrays Microarray Data Analysis What is Microarray Data Types of Microarrays cDNA microarray Nylon membrane and plastic arrays by Clontech Oligonucleotide silicon chips by Affymetrix Note Each new version of amicroarray chip is at least slightly different from the previous version This means that the measures are likely to change This has to be taken into account when analyzing data cDNA Microarray The expression level eij of a gene 139 in sample is expressed as a log ratio logrigj of the log of its actual expression level ri in this sample over its expression level g in a j I control When this data is visualized eij is color coded to a mixture of red rij gtgt g and green rij ltlt g and a mixture in between Nylon Membrane and Plastic Arrays by Clontech A raw intensity and a39background value are measured for each gene The analyst is free to choose the raw intensity or can adjust it by subtracting the background intensity Oligonucleotide Silicon Chips by Affymetrix quot f f 7 For What Do We Use Microarray Data For What Do We Use Microarray Data Genes with unusual expression levels in a sample In contrast to standard statistical methods where we ignore outliers here outliers might have particular importance Hence we look for genes whose expression levels are very different from the others Genes whose expression levels vary across sample s We can compare gene expression levels of a particular gene or set of genes in different samples This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment For What Do We Use Microarray Data Samples that have similar expression patterns We might want to compare the expression patters of all genes between two samples We might cluSter the genes into gene with similar expression patterns to help with the comparison This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment Tissues that might be cancerous diseased We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue Statistical Methods Can Help Experimental Design Since using microarrays is costly and time consuming we Want to design experiments to use the minimal number of micorarrays that will give a statistically significant result Data Preprocessing It is sometimes useful to preprocess the data prior to visualization An example of this is the log ratio mentioned earlier It is often necessary to rescale data from different microarrays so that they can be compared This is due to variation in chip to chip intensity Another type of preprocessing is subtracting the mean and dividing by the variance Statistical Methods Can Help Data Visualization Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions This allows us to Visualize it Cluster Analysis By associating genes with similar expression patterns we might be able to draw conClusions about their functional expression Probability Theory We can use statistical modeling and inference to analyze our data Probability theory is the basis for these Statistical Methods Can Help Statistical Inference This is the formulation and statistical testing of a hypothesis and alternative hypothesis Classifiers f0r the Data We can construct Classes from data such a diseased vs nondiseased tissue We can build a model such as a hidden Markov model that fits know data for the different classes This can then be used to classify previously unclassified data Preprocessing Microarray Data Before microarray data can be analyzed or stored a number of procedures or transformations must be appliedto it In order to analyze the data correctly it is important to understand What the transformations might be doing to the data Preprocessing Microarray Data Ratioing the data Logtranforming ratioed data Alternative to ratioing the data Differencing the data Scaling data across chips to account for chiptochip difference Zerocentering a gene on a sample expression pattern Weighting the components of a gene or sample expression pattern differently Handling missing data Variation filtering expression patterns Discretizing expression data Ratioing the data This is the most popular transformation The expression level eij of a gene i in sample j is expressed as a ratio rijgi of its actual expression level rij in this sample over its expression level gi in a control This tells us the level of under or over expression of a gene i in the sample j If the control value gi is very small it can make the ratio very big This can skew results incorrectly Lo gtranforming ratioed data This is also a popular transformation The expression level eij of a gene i in sample j is expressed as alog ratio logrij g of the log of its actual expression level rij in this sample over its expression level gi in a control This will suppress outliers caused when the control value g is very small However it creates a new outlier when rij is very small Alternative to ratioing the data An alternative that eliminated both of the outlier problems above is rzj ry l39gi This gives a value in 01 and can be interpreted at the probability of gene i is higher in sample j than in control Differencing the data Another transformation is to difference the data ie rij g This is not really appropriate in our previous context However this is used by Affymetrix in a different context In their data r1 is the strength of the match of the target 139 to a specific probe j and g is he strength of the match of the target 239 to a contrOl for this probe Scaling data across chips to account for chiptochip difference As mentioned previously different chips might display different intensities When comparing different chips the data might need to be scaled so that they are on the same scale Alternatively they can be normalized so that they are between 01 and compared Zerocentering a gene on a sample expression pattern This in effect the same as subtracting the mean expreSsion pattern Suppose that x is an expression pattern for a particular gene gi whose components are logratios Let Where is the average expression pattern or control Then x indicates Whether the gene gi is induced or repressed relative to control Remember the x s are vectors Subtracting the mean expression pattern and dividing by the standard deviation Can accomplish this x f 6 Remember the x s are vectors Weighting the components of a gene or sample expression pattern differently If we have a matrix of weights Wdiagw1 wn we can weight the expression patterns by In this way we can weight the contributions from different genes differently Handling missing data Sometimes components of an expression pattern X are missing To x this the missing values can be replaced by the mean over the nonmissing values in x Variation ltering expression patterns When We are performing cluster analysis on gene or sample expression patterns patterns with low variance will all seem suf ciently similar to each other and might form a cluster This cluster will probably not re ect any interesting result Discretizing expression data Sometimes we might want to convert gene or Sample expression pattern into discrete values For example if we have logratio we may want to simply look 39at whether something is up or downregulated To do this we can do the following 39l when xi gt O xb xbgi where xb i 1 when xi lt O 0 whenxi 0 In this case l would indicate upregulation 0 would indicate no change and 1 would indicate downregulation Measuring Dissimilarity of Expression Data We might want to compare two or more gene or sample expression patterns This might be used to differentiate between diseased and normal cells or nding out the genetic similarity of tissues To do this we need a distance metric or a dissimilarity measure Example Distance Metric Euclidean Distance This is the most c0mmon distance measure 0 06 y This should not be used if either 1 Not all components of the vectors being compared have equal weight 2 There is missing data Preprocessing the data can often alleviate these problems We can also use the normalized Euclidean distance 206139 yi2 dxyi ZT Example Dissimilarity Measures Maximum Coordinate Difference The following computes the maximum absolute distance along a coordinate d39x9y maXIxl yi Dot Product This is a dissimilarity version of the dot product dxyxoy Visualizing Micorarray Data Principal Components Analysis In principal components analysis ndimension39al data is converted to ddimensional data dltltn such that the components in the new space are uncorrelated and axis or dimensions of the new space are ordered with respect to the amount of variance they explain The first component explains the most about the data The second component is orthogonal to the rst and explains more about the data and so on Principal Components Analysis h quot i j 5 i 39 7 39 i i C m 4nrr 1 12 A j q I l 17 lg 1 u A quotA r r 4 ii W i hf r i PCA and Microarrays Sample Application 1 If we want to compare the sample expression patterns from two groups diseased vs normal experimental vs control If we have n genes the each pattern is a point in ndimensional space Suppose we want to see if the sample expression patterns for these two groups cluster by group We might want to perform PCA analysis and perform cluster analysis at the top three components PCA and Microarrays Sample Application 2 On gene chips such as the one made by Affymetrix the same gene occupies multiple cells In theory the expression level of all cells with the same gene should be perfectly correlated However in practice this is often not the case due to imperfections in the technology or hybridization of39the sequence fragment to other genes in the target PCA and Microarrays PCA allows us to see how good the correlation among these cells is To use PCA we wou1d hybridize k different samples on the same chip For each sample the expression levels of39a gene X in the n cells is an ndimensional vector Hence there are k points in ndimension al space Using PCA if most of the variance is explained by the first principal component the effective dimensionality of39the data is l and there cells are highly correlated PCA Limitations Clustering by PCA effectively yields clusters as if the Euclidean distance metric had been used Hence it is possible that it might miss clusters The reduction of dimensionality uses all coordinates If only a few genes out of 39a thousand differ between two samples Application 1 clustering by PCA might not yield any meaningful results Cluster Analysis of Microarray Data Cluster Analysis of Microarray Data Hierarchical Clustering Assume each data point is in a singleton cluster Find the two clusters that are closest together Combine these to form a new cluster Compute the distance from all clusters to the new cluster using some form of averaging Find the two closest clusters and repeat Cluster Analysis of Microarray Data kMeans Clustering An alternate method of clustering called kmeans clustering partitions the data into k clusters and nds cluster means ui for each Cluster In our case the means will be vectors also Usually the number of clusters k is fixed in advance To choose k something must be know about the data There might be a range of possible k values To decide which is best optimization of a quantity that maximizes cluster tightness ie minimizes distances between points in a cluster Cluster Analysis of Microarray Data Hidden Markov Models and Microarray Data Finding Genes Expressed Unusually Different in a Population The following section addresses the question Is gene g expressed unusually in the sample The rst thing to do is to come up with a formal mathematical de nition for What unusual is Assume that the microarray data is log transformed ratio data If a histogram is constructed of the data it should yield roughly a normal distribution Anything that is out near either tail can be considered to be unusually expressed Note that this can be either a high or low expression level BINF 636 Lecture 9 Clustering How Do They Make and Interpret Those Dendrograms and Heat Maps Differences Between Unsupervised Clustering and Classification Description Clustering for the purpose of this lecture is the exploratory partitioning of a set of data points into subgroups clusters such that members of each subgroup are relatively similar to each other and members of distinct clusters are relatively dissimilar For example one might have gene expression pro les from a set of samples of a particular type of tumor and wish to see if the samples separate out into distinct subgroups In this case one could be looking to uncover evidence of previously unknown subtypes or one might wish to see if the results of clustering the gene expression pro les are consistent with classi cation by histopathology this class we will describe how dendrograrns such as the example to the right are constructed using hierarchical agglomerative clustering where one starts with each of the data points as an individual cluster and in successive steps combines the pair of clusters that are closest to each other into one new cluster This requires specifying a distance I I I I I measure between data points and between clusters Each clustering step reduces by one the number of existing clusters until at the end of the nal step there is one cluster containing all the data points If one has ordered the data points along a line so that at each step the clusters that are joined together are adjacent to each other one draw a corresponding diagram dendrogram where the heights of the vertical lines re ect the distance between the pair of clusters joined at each stage of the procedure If one has eg microarray data from a set of tumor samples one can cluster both the tissue sample gene expression pro les and the expression pro les 7 r of the genes across the tissue samples thus determining a corresponding ordering of the tumor samples and of the genes One can then color code each rectangle representing the expression level of one gene in one tumor sarnp e producing a heat map such as the example to the right We will describe how these procedures are carried out and how the resulting hierarchies of clusters can depend on the speci cations for distances between data points and between clusters The type of changes in appearance that may occur in a dendrogram in response to small changes in the data points will also be illustrated An individual dendrogram is essentially a one dimensional ordering of a data set in contrast with two or three dimensional visualizations that can be obtained by principal component analysis PCA which is the subject of Lecture 10 The difference between exploratory unsupervised clustering and classi cation will be noted along with the importance of proper validation of classi cation methods Genes quotmumType quotTumu ypez 7 Alan E Berger PhD J HBMC Lowe Family Genomics Core Johns Hopkins University School ofMedicine aberger9 jhmiedu 410 5505089 Clustering example modified version of Figure 1 of A A Alizadeh et al Distinct Types of Diffuse Large BCell Lymphoma Identified by Gene Expression Profiling Nature 403 3 Feb 2000 pp 503511 Centroid clustering on log of fold changes of measured expression levels with Pearson correlation similarity for tissue samples columns and cosange similarity for genes rows Fold changes are ratios of mRNA expression level in tissue sample relative to mRNA level in reference pool The values in each row gene were median centered before the clustering heat map plot Heat map color code for coloring of each tissue sample gtlt genej rectangle is at the bottom ofthe figure mm 5mm mm a m nnaas39lansll Acllvale am a mum lll39llrnmlli x u I Pan B cell I Germinal Centre B cell T cell I Activated B cell I Prolilerallon I Lymph mm 72 yr 0 l uzsu USUU lunn um um fold change sauas 2 logzfold change Information from a Clustering Example of Single Linkage Hierarchical Clustering D Hierarchical Points f5 associations belng C of groupings clustered g of the points h point 1s an g a ntuple of numbers here have Ky Pairs in 1Dimensional ordering of the points components for each point e g Inicroanay data Each point has form P X1 X2 xquot Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering Some References for Clustering 1 R O Duda P E Hart and D G Stork Pattern Classi cation Second Edition John Wiley amp Sons 2001 Chapter 10 Unsupervised Learning and Clustering 2 R A Johnson and D W Wichern Applied Multivariate Statistical Analysis Fourth Edition Prentice Hall 1998 Chapter 12 Clustering Distance Methods and Ordination 3 J Quackenbush Computational Analysis of Microarray Data Nature Reviews Genetics 2 2001 pp 418427 4 R Simon M D Radmacher K Dobbin and L M McShane Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification Journal of the National Cancer Institute 95 2003 pp 1418 ltexploratory class discovery supervised classification proper use of crossvalidationgt see also M West et al Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles PNAS 98 2001 pp 1 146211467 5 httpenwikipediaorgwikiDataclustering 6 M B Eisen P T Spellman P 0 Brown and D Botstein Cluster Analysis and Display of Genomewide Expression Patterns PNAS 95 1998 pp 1486314868 7 Software for performing a variety of clustering methods is available in ltwith usual disclaimersgt e g R open source amp open source R programs see in particular the Bioconductor suite of software and in general data analysis software such as MATLAB and lDL statistics packages SAS etc commercial packages for microarray analysis such as Partek GeneSpring GeneSifter JMP Genomics and the MATLAB Bioinformatics Toolbox free software such as Cluster httpranalblgovEisenSoftwarehtm and GenePattern httpwwwbroadmiteducancersofUNaregenepatterm software available at NIH including the Mathematical and Statistical Computing Lab toolbox of scripts that complement and interface with the JlVlP statistics package httpabscitnihgovMSCLtoolbox BRB Array Tools http linusncinih govBRBArrayToolshtml and mAdb httpnciarrayncinih g0V 8 M Zvelebil and J O Baum Understanding Bioinformatics Garland Science NY 2008 Chapter 16 Clustering Methods and Statistics 9 D Stekel Microarray Bioinformatics Cambridge University Press Cambridge 2003 Chapter 8 Analysis of Relationships Between Genes Tissues or Treatments CLUSTERING Without Training Data Unsupervised Clustering Exploratory separation of data into groups of points Points in distinct groups to be more different than points within one group Discovery of classes Information about the data may be used to evaluate the results but is NOT used in doing the clustering With Training Data Supervised Clustering Accurate classi cation of new data points Veri cation of accuracy of the training data Discovery of additional classes Information about the training data is used to do the clusteringclassi cation Components of the Clustering Process Example of Single Linkage Hierarchical Chstering Intercluster Distance Each point or item being clustered could eg be a set of gene expression values from a tumor sample A cluster is a subset of the items being clustered Start with each individual point as a cluster and successively combine the closest pair of clusters into one new cluster End with one cluster Standard Euclidean Distance between two points Y y W Ls K Distance2 between Xy and rs is Xr2 ys2 Distance2 between p1 pn and q1 qn is pl102 pm1amp2 De nitions of InterCluster Distances Single Linkage mm Q Complete Linkage O O O O Q max Average Linkage Centroid Hierarchical Agglomerative Clustering 1 Make choice of intercluster distance and specify the distances dissimilarities between points 2 Start with each point as a singleton cluster 3 At each step join the pair of clusters that have the smallest distance between them Draw vertical line from top of each joined cluster up to height distance between them connect with horizontal line Top of newjoined cluster is midway between them 4 To avoid crossed lines must have ordered the points so that at each step joined clusters are next to each other get unique dendrogram if specify rule for leftright order at each join Example of Single Linkage Hierarchical Clustering Distances dij 1 relevant d16 or Single Linkage 06 are given m o lmer cluswr Distance 3 At each step join the pair of clusters that have the smallest distance between them Draw vertical line from top of each joined cluster up to height distance between them connect with horizontal line Top of new joined cluster is midway between them Consequences of Improper Ordering of the Points Intercluster Distance 8 Imwmqer Dmmm e a m m quotp M grim II 34557 Totally Uninformative 11608 quot6311 Finding an Admissible Ordering by Left Sliding Single Linkage Example IIIII 12525 16 3 2 S 04 starting with the ordering 0 1 2 3 4 5 6 7 rst merge is 16 so form 0 16 2 3 4 5 7 next merge is 163 so form 0 163 2 4 5 7 next merge is 04 so form 04 163 2 5 7 next merge is 047 so form 047 163 2 5 next merge is 25 so form 047 163 25 next merge is 047163 so form 047163 25 last merge is 04716325 so corresponding ordering of the points is 0 4 7 1 6 3 2 5 with this ordering there will be no crisscrossing of lines when draw the dendrogram Exercise try starting with 5 3 7 6 4 1 2 0 Finding an Admissible Ordering by Left Sliding Single Linkage Example starting With the ordering 5 3 7 6 4 1 2 0 rst merge is 16 so form 5 3 7 614 2 0 next merge is 163 so form 5 361 7 4 2 O a next merge is 04 so form 5 361 7 4O 2 next merge is 047 so form 5 361 740 2 next merge is 25 so form 52 361 740 next merge is 047163 so form 52 361740 0 s 2 s s 1 7 4 a last merge is 52361740 so corresponding ordering of the points is 5 2 3 6 1 7 4 O with this ordering there will be no crisscrossing of lines when draw the dendrogram Example of Smgle Linkage Hierarchical Cllmtering Distances d In Class Exercise sta with the admissible 1 relevant a or 52 m ordering 2 5 0 4 7 6 1 3 e are gia en and draw the resulting singlelinkage dendrogram Imercluster Distance 2 504 B 1l 3 IL o 39 S 3 At each step join the pair of clusters that have the smallest distance between them Draw vertical line from top of each joined cluster up to height distance between them connect with horizontal line my of new joined cluster is midway between them Intercluster Distance Intercluster Distance Intercluster Distance Intercluster Distance Intercluster Distance Intercluster Distance Intercluster Distance ers Com at 0 N wAMl Microarrcy Don C Avcvogc Li wkogc Mama kt cp top 600 var gone 7 T 00hr c 0L A AM chroorvcy Dom 1453ch T Average Lwcge lndclu keep rep 600 var genes 77 Essa II II 39 Acute Lymahoo uszc Leummic AL39 BiceH Acute Mvekzk Lemrnm w Acme Lymohcbm c Leummc Acute Myc mc Lcakcmic w T com a m M AM Microurrcy Dom C1519 Average Li mcgc mmcm keep top 500 mr genes T Com c oL H AM rwgrourrcy 0th cm Average Liwcgc Mam kccp rop BUD var gmcs n 3 r7 1 ntzrcmsier mng o lerrmsie M31522 Acu Lymahco as Leumrv k Aw EiccH Acute Lymshcmusir LamaWk ALJ39 aim Acute Mmequot LCJKBWR 3M m Acute Myc awc Lcukurmc m Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering Example of Single Linkage Hierarchical Clustering Distances d1 1 relevant aw 12 for ller01115145 Distance cf Eisen et a1 6 Example of a way of specifying leftright orderings Examples oI Poinl Rankings left 0 7 1 4 6 2 3 5 GeneX At each join onwo subcluswrs place the one left 5 2 7 4 0 3 6 1 Genev containing the le mosl poinl number on thelel t 2 d2525 left 1 2 3 0 4 7 5 6 Gene G Could also use average gene expression Eisen aal Dendrograrn is indep of original order given unique lerright orderings 11608 quot6311 Finding an Admissible Ordering by Left Sliding LeftRight by Gene ma i X Single Linkage Example I I I I I 12525 04 starting with the ordering 6 5 7 4 3 2 1 0 Gene X left right point ranking 0 7 l 4 6 2 3 5 rst merge is 16 so form 16 5 7 4 3 2 0 next merge is 163 so form 163 5 7 4 2 0 next merge is 04 so form 163 5 7 04 2 next merge is 047 so form 163 5 047 2 next merge is 25 so form 163 25 047 next merge is 047163 so form 047163 25 last merge is 04716325 so get 0 4 7 1 6 3 2 5 Exercise try starting with 3 5 0 2 6 7 1 4 Intercluster Distance Intercluster Distance Ordering Ordering from Gene X Intercluster Distance Intercluster Distance Ordering from Ordering from Gene Y Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering Data for 1Dimensional Clustering Example v1v2 v3v4v5 v6 v7 v8 points on line 1011 12 14 17 21 25 distancesVi1Vi wwmr D skmc Hierarchicc C ustering of Linear Example Hierm39chlcul Clustering of Linear Examp e 250 nlmmm DFiLnnre comp ete anoge sing e Wage Hierorchlcal Cmstering of Linear Example mo nmmmr Distance average mege Warning hierarchical agglomerative clustering programs will always return dendrograms even for random data should use any information available about data to judge usefulness of clustering as well as compactness and separation of clusters use eg PCA to visualize data and clusters Correlation Distances Suppose each point is a vector of logfold changes eg logtreated gene expression level control expression level Two such vectors V W are well positively correlated if when an entry in V is gt O the corresponding entry in W is gt O and when Vj is lt 0 then Wj is lt 0 Two such vectors V W are highly negatively correlated if when an entry in V is gt O the corresponding entry in W is lt O and when Vj is lt 0 then Wj is gt 0 Two common choices of correlation distance are dVW l cosangle between the vectors V and W dVW l cosangle between the vectors V and W V 9 W Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering REPORTS Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring T R Golub1392T D K Slonim1T P Tamayo1 C Huard1 M Gaasenbeek1 J P Mesirov1 H Coller1 M L Loh2 J R Downing3 M A Caligiuri4 C D Bloomfield4 E S Lander1395 Although cancer classification has improved over the past 30 years there has been no general approach for identifying new cancer classes class discovew or for assigning tumors to kno wn classes class prediction Here a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case A class discovery procedure automatically discovered the distinction between quotquot39AMLnA I without 1 1 r previous knowledge of these classes An automatically derived dass predictor e e I 39 I I I I I 39 Tim rnclllfc I Ilcw the feasibility of cancer classification based solely on gene expression moni toring and suggest a general strategy for discovering and redicting cancer classes for other types of cancer independent of previous biological knowledge The challenge of cancer treatment has been to target speci c therapies to pathogenetically distinct tumor types to maximize ef cacy and minimize toxicity Improvements in can cer classi cation have thus been central to a vances in cancer treatment Cancer classi cation has been based primarily on morpho logical appearance of the tumor but this has serious limitations Tumors with similar his topathological appearance can follow signif icantly different clinical courses and show different responses to therapy In a few cases such clinical heterogeneity has been ex plained by 39 39 39 g morphologically similar tumors into subtypes with distinct pathogen eses Key examples include the subdivision of acute leukemias nonHod ms 1 ho mas and c idhoo or subclassified into neuroblastomas rhabdo myosarcoma Ewing s sarcoma and other types 2 For many more tumors important subclasses are likely to exist but have yet to IWhitehead InstituteMassachusetts Institute of Jude Children s Research Hospital Memphis TN 38105 USA 4Comprehensive Cancer Center and Can cer and Leukemia Group B Ohio State University Columbus OH 43210 USA 5Department of Biology Massachusetts Institute of Tedinology Cambridge MA 02142 USA To whom correspondence should be addressed E mail golub genomewimitedu lander genomewi mitedu TThese authors contributed equally to this work www5ciencemagorg SCIENCE VOL 286 be de ned by molecular markers For exam ple prosmte cancers of identical grade can have widely variable clinical courses from indolence over decades to explosive growth causing rapid patient death Cancer classifi cation has been dif cult in part because it has historically relied on speci c biological irr sig e te 39c and unbiased approaches for recognizing tumor subtypes Here we describe such an approach based on global gene expression anal sis e divided cancer classi cation into two challenges class discovery and class predic tion Class discovery refers to defining pre viously unrecognized tumor subtypes Class prediction refers to the assignment of partic u tumor samples to alreadydefmed class es which could reflect current states or future outco es We chose acute leukemias as a test case Classi cation of acute leukemias began with the observation of variability in clinical out come 3 and subtle differences in nuclear morphology 4 Enzymebased histochemi ca analyses were introduced in the 1960s to demonstrate that some leukemias were peri odic acidSchiff positive whereas others were myeloperoxidase positive 5 This pro vided the first basis for classification of acute leukemias into those arising om yrnphoid precursors acute lymphoblastic leukemia ALL or from myeloid precursors acute my eloid leukemia AML This classi cation was further solidi ed by the development in the 1970s of antibodies recognizing either lymphoid or myeloid cell surface molecules 6 Most recently particular subtypes of acute leukemia have been found to be asso ciated with speci c chromosomal transloca tionsifor example the t1221pl3q22 translocation occurs in 25 of patients with ALL whereas the t821q22q22 occurs in 15 of patients with AML 7 Although the distinction between AML and ALL has been well established no single test is currently suf cient to establish the diagnosis Rather current clinical practice his oc emistry imrnunophenotyping and cy togenetic analysis each performed in a sep arate highly specialized laboratory Although usually accurate le 39a classi cation re mains imperfect and errors do occur istinguishing ALL from AML is critical for successful treatment chemotherapy regi mens for ALL generally contain corticoste roids vincristine methotrexate and L asparagi nase whereas most AML regimens rely on a backbone of daunorubicin and cymrabine 8 Although remissions can be achieved using ALL therapy for AML and vice versa cure rates are markedly diminished and unwarrant ed toxicities are encountered We set out to develop a more systematic approach to cancer classi cation based on the simultaneous expression monitoring of thou sands of genes using DNA microarrays 9 It has been suggested 10 that such microar rays could provide a tool for cancer classi cation Microarray studies to date 11 how ever have primarily been descriptive rather than ana ytical and have focused on cell cul ture rather than primary patient material in which genetic noise might obscure an under lying reproducible expression pattern be an with class prediction How could one use an initial collection of samples belonging to lmown classes such as AML and ALL to create a class predictor to classify new unlmown samples We devel oped an analytical method and first tested it on distinctions that are easily made at the morphological level such as distinguishing normal kidney from renal cell carcinoma 12 We then turned to the more challenging problem of distinguishing acute leukemias whose appearance is highly similar Our initial leukemia data set consisted of 38 bone marrow samples 27 ALL 11 AML obtained from acute leukemia patients at the time of diagnosis 13 RNA prepared from T Golub et 0 ALLAML Mieroorroy Date Clusters 64 Intercluster Distance 01 M 4O O 3 41214252391819 61524117202126272934303337353632311 2 5 91022513251715 Single Li koge lnd0t0 keep top 600 vor genes AcuteLym phobms c Leuken o ALL B ceH T CeH Acute Myek d Leuke o AML nanes plotted Tue Oct 08 020808 2002 plothierollcm3npro Oct 8 2002 CbergerbiochemgolubdotooIomltroim ulpoth3n singeps T Golub et ai ALLAML Mieroarray Data Clusters Intercluster Distance 86 38 01518251124 3 23 S 7 2126201711 E 910 2 5 22 4121419163536272831323729343033 Complete Linkage lndata keep top 600 var genes Acute Lymphoblastic Leukemia ALL B cell T Cell Acute Myeioid Leukemia AML nnlines plotted Tue Oct 08 022411 2002 plothierallami3npro Oct 8 2002 Cbergerbiochemgolubdataalamltrainfullpiotth competeps T Golub et ai ALLAML Microarray Data Clusters Intercluster Distance 380 635 O 41214231915182517 3 S 7 212620113 2 5 22 910 816H243227353629343033372831 Average Linkage lndata keep top 600 var genes Acute Lymphoblastic Leukemia ALL B cell T cell Acute Myetoid Leukemia AML nnlines plotted Tue Oct 08 022045 2002 plothierallamt3npro Oct 8 2002 CbergerbiochemgolubdataaIamltraim ulIpiothC m averageps 9 New Golul 01 CL AllAMl lvlicroorrcy Cit Clusmrs Note sometimes the heights at which combine successive subclusters can decrease when using the centroid set distance This never happens when using single complete or average linkage O 20 Centloid Moots keep lop 600 vcr genes Acme Lymanomm Leumn iz mi 2 u 10 12 10 Successive centroid distances are 10 1o 22 20 note sqrt521 2 22825 Aculn Aycloid Lcuxcmzc ML PCA of Golub training data using 600 top variance genes B vs M 30 Genes B vs T 30 Genes 3O Genes T vs M Expression Levels for Genes Classifying Tumor Type 3430332729373536323128 AML 412140 3 25231819 515241172021261716 ALL B 25911022813 ALL T Color scale for z ranges of expression for each gene 0 units 25 20 15 10 05 00 T R Golub et al Acute Leukemia Data Science V286 15 Oct 1999 ALL B is B cell acute lymphablastic leukemia ALL T is T cell acute lymphoblastic leukemia AML is acute myeloid leukemia Tue Mar 11 104130 2003 plotcdata3pro March 11 2003 Cbergerbiochemgolubdataaam3x30genes3psps u rj mum Breaking the c m Color Barrier Bolt3515 umaly us red end green Stai39ISIO depict St JCtUl39 inside cells But that s I erd on ru matherswith Ihl most CDI39I I Comtlonal and magentatinged sllde as mn by 111 mlorblhd mon form ofoolcrbljndnessm in magnum and me 39n alum n1alug39fa39uxamplnman ttull them hlJ aparthm help son 111 my fmrn a pair of colorblind japanes scientistsmm have parsnaded Japan39s baring mutual haloml jcumal to start pnnti39lg imagus thumbrblind cm hturpret Kai Itoa momma HeLroszientist at 1113 Unh39a39sity of39lhkyo and Hasa l39dia Dhabe cu39fthe Ma ond Institute of enetics in Mishin u w thatw ith software it39s easy tocorn39asrt 1h reds into magen tas which contain enough him tohemnhythecolm blind ue fw39m ij f i 39 addressmg chOIce of f mm tg g 0 5m KIng C 39Tn I I colors dlstlngwshable gaggggrgmgmg39kd to those with the La fs i sm i zg i t mm e cis tfgahr l imrtnf most common form quotgmg jmMML low suitsays Im Woman of colorblmdness wutdbenernrmmme changes he notes Thefe39s a good chance that on of the l39e via ers of me nat paper ycu mhmitwill be tolorhlind Hmwsciuncemagmg SCIENCE VOL EBB BENJVEMBERZD JE Proc Natl Acad Sci USA Vol 96 pp 674576750 June 1999 Cell Biology Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays U ALONT N BARKAIT D A NOTTERMAN K Gisni S YBARRAi D MACKi AND A J LEVINE Departments of Molecular Biology and iPhysics Princeton University Princeton NJ 08540 and EDS Biotechnology 225A Gateway Boulevard 0 South San Francisco CA 9408 Contributed by A J Levine April 13 1999 ABSTRACT Oligonucleotide arrays can provide a broad picture of the state of the cell by monitoring the expression level of thousands of genes at the same time It is of interest to develop techniques for extracting useful information from the resulting data sets Here we report the application of a twoway clustering method for analyzing a data set consisting of the expression patterns of different cell types Gene expres sion in 40 tumor and 22 normal colon tissue samples was analyzed with an Affymetrix oligonucleotide array comple mentary to more than 6500 human genes An efficient two way clustering algorithm was applied to both the genes and the tissues revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues Coregulated families of genes clustered together as demonstrated for the ribosomal proteins Clustering also separated cancerous from noncancerous tissue and cell lines from in viva tissues on the basis of subtle distributed patterns of genes even when expression of individual genes varied only slightly between the tissues Twoway clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on gene expression Recently introduced experimental techniques based on oligo nucleotide or DNA arrays now allow the expression level of thousands of genes to be monitored in parallel 179 To use the full potential of such experiments it is important to develop the ability to process and extract useful information from large gene expression data sets Elegant methods recently havebeen applied to analyze gene expression data sets that are comprised of a time course of expression levels Examples of such timecourse experiments include following a develop mental process or changes as the cell undergoes a perturbation such as a shift in growth conditions The analysis methods were based on clustering of genes accordin to similarity in their temporal expression 5 6 9711 Such clustering has been demonstrated to identify functionally related families of genes both in yeast and human cell lines 5 6 9 11 Other methods have been proposed for analyzing timecourse gene expression data attempting to model underlying genetic circuits 12 13 Here we report the application of methods for analyzing data sets comprised of snapshots of the expression pattern of different cell types rather than detailed timecourse data The data set used is composed of 40 colon tumor samples and 22 normal colon tissue samples analyzed with an Affymetrix oligonucleotide array 8 complementary to more than 6500 human genes and expressed sequence tags ESTs 14 We focus here on generally applicable analysis methods a more detailed discussion of the cancerspecific biology associated with this study will be presented elsewhere D AN and AL The publication costs of this article were defrayed in part by page charge payment This article must therefore be hereby marked advertisement in accordance with 18 USC 1734 solely to indicate this fact PNAS is available online at wwwpnasorg 6745 unpublished work The correlation in expression levels across different tissue samples is demonstrated to help identify genes that regulate each other or have similar cellular function To detect large groups of related genes and tissues we applied twoway clustering an effective technique for detecting pat terns in data sets see eg refs 15 and 16 The main result is that an efficient clustering algorithm revealed broad coherent patterns of genes whose expression is correlated suggesting a high degree of organization underlying gene expression in these tissues It is demonstrated for the case of ribosomal proteins that clustering can classify genes into coregulated families It is further demonstrated that tissue types eg ancerous and noncancerous samples can be separated on the basis of subtle distributed patterns of genes which individually vary only slightly between the tissues Twoway clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on their gene expression similarity MATERIALS AND METHODS Tissues and Hybridization to Affymetrix Oligonucleotide Arrays Colon adenocarcinoma specimens snapfrozen in liquid nitrogen within 20 min of removal were collected from patients DAN and AJL unpublished work From some of these patients paired normal colon tissue also was obtained Cell lines used EB and EBl have been described 17 RNA was extracted and hybridized to the array as described 1 8 Treatment of Raw Data from Affymetrix Oligonucleotide Arrays The Affymetrix Hum6000 array contains about 65000 eatures each containing N107 strands of a DNA 25mer oligonucleotide Sequences from about 3200 fulllength As and 3400 ESTs that have some similarity to other eukaryotic genes are represented on a set of four chips In the following we refer to either a fulllength gene or an EST that is represented on the chip as EST Each EST is repre sented on the array by about 20 feature pairs Each feature contains a 25bp sequence which is either a perfect match PM to the EST or a single centralbase mismatch MM The hybridization signal fluctuates between different features that represent different 25mer oligonucleotide segments of the same EST This fluctuation presumably reflects the variation in hybridization kinetics of different sequences as well as the presence of nonspecific hybridization by background RNAs Some of the features display a hybridization signal that is many times stronger than their neighbors 4 of the intensities are gt3 SD away from the mean for their EST These outliers appear with roughly equal incidence in PM or MM features If not filtered out outliers contribute significantly to the reading of the average intensity of the gene Because most features Intercluster Distance U Alom et al Coloa Mieroarray Data C1usters 140 115 90 65 40 51915 7 15172 I n a 242M473 35545554546l 1 54551512 n4 Average Linkage lndata keep top 50 var ge es Colon Tissue Samples Normal Cancer plotted Thu Jun 09 223428 2005 CbergerbiochemaondataaIonaploth nnotrueiaverageink50PCAcassps plothiero1on3nnotrueproiiJuly 11 2003 LeaveOneOut Crossvalidation for Alon et a1 Data Miscla551 Cd points 3 tumor 3 normal are larger size Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering kmean52tex KMeans Clustering Choose number of clusters Choose initial cluster centers Assign each point to the cluster Whose center is closest Rede ne each cluster center as center of mass of all the points assigned to that cluster Repeat 3 amp 4 until clusters stabilize memmeeem 7 a a 5 6 5 6 4 mmmmmmwm eeooenom Nunmwmmm 3 2 3 2 3 3 L I Part of the Anderson Fisher Iris Data Set msmeehms mesemmee mmemomeo oooooooo 2 2 2 1 2 4 3 2 HHHHHHHH omwmwmmh 2 1 2 1 2 2 L L HHHHHHHH NNNNNNMN mmwmmmww mmms Setosa Versioo1or mNH full data set is so samples of each iris flower speoies data from R A Johnson and n w chhern eaoh row is the data from one flower oo1mmns 1 2 3 4 are measured properties of each flower Asepa1 1ength sepa1 width peta1 1ength peta1 width Botany definitions e calyx is the outermost group of floral parts usually green Sepals are the individual 1eaves or parts of the ris Do to Principol Component Plot of Anderson Iris Doto I 8 7 7 1 C Q C I 39 o Q l I E 39 O O I I I O 39 A o 6 6 39 A I I CL 1 A 39I o A A I I I I I I U I A A I I C I A I O I A I O I A I I Q OI A I I 7 AAA AI I 5 A A A A 39 AA A l39 I A A A A A i I A I A 4 I I I 2 4 6 8 10 First Principal Component cov evolues 4228 0243 0078 0024 Fri Sep 21 123600 2001 irispc1pro September 21 2001 irispc1covps data from Table 1115 of Johnson and Wichern 1998 K7 eohs C usiering of Ancerson 5 Beta HHiiiii iiii 7 Two K Means H Clusters 5 I 5 Fquot m I39 I I a 2 4 m E a pat Frincipui Cowman The color of the perimeter of each square designates its correct group the color of the inside of each square gives its K Means cluster z m 0145 cm W A 23 imam 2m w 25 m m cm mi chie v 5 vi mmm m Wicnem iega KrVeans Ciuszerinq 0quot Ar cerson irrs Dcic 7 Three K Mea ns Clusters 5 s I new WiFEiDC Compnngn I l I u 39 IHl 39 l39 I l I a a mi Vrvncipui Cumpmm The color of the perimeter of each square designates its correct group the color of the inside of each square gives its K Means cluster um um mm on w n 21 um u m mm a Law 2 mz I u raVzklin am pm Table w 5 u Jnlvgun M Waugh x995 KrMeons Ciustering of Anderson iris Dcto Four 39 K Mea ns Clusters cm erc w Camnonen 139 39 39W quot I A 3 Via 39 57 quot5quot 39 quot a V39 39 3955quot u 1 u n u 39 e a m Wipe Conveneu The color of the perimeter of each square designates its correct group the color of the inside of each square gives its K Means cluster 11qu any eu39wlwu W 2 we ia izmv n and mm mm Clustering How Do They Make Those Dendrograms and Heat Maps Outline De nition of unsupervised clustering Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Additional examples Heat map construction Brie y noting other methods for clustering and data visualization The difference between exploratory and supervised clustering CLUSTERING Without Training Data Unsupervised Clustering Exploratory separation of data into groups of points Points in distinct groups to be more different than points within one group Discovery of classes Information about the data is used to evaluate the results but is NOT used in doing the clustering With Training Data Supervised Clustering Accurate classi cation of new data points Veri cation of accuracy of the training data Discovery of additional classes Information about the data is used in doing the clusteringclassi cation tlike statistic for selecting genes to be used in a classi er Tg E lml mzl V1N1 V2N2 512 m mean of gene g expression levels over training group i Vi variance of gene g expression levels over training group i Ni number of samples in training group i 5 a small constant to prevent division by nearly 0 Genes with large T are more likely to be useful discriminators B vs M 30 Genes B vs T 30 Genes 3O Genes T vs M Expression Levels for Genes Classifying Tumor Type 3430332729373536323128 AML 412140 3 25231819 515241172021261716 ALL B 25911022813 ALL T Color scale for z ranges of expression for each gene 0 units 25 20 15 10 05 00 T R Golub et al Acute Leukemia Data Science V286 15 Oct 1999 ALL B is B cell acute lymphablastic leukemia ALL T is T cell acute lymphoblastic leukemia AML is acute myeloid leukemia Tue Mar 11 104130 2003 plotcdata3pro March 11 2003 Cbergerbiochemgolubdataaam3x30genes3psps Variance Matters Top situation is better as classifier even though the C red and blue group centroids are closer because variances are smaller O O C O g 0 39 g o o o C o 39 o O C C O K Nearest Neighbor Classi er The class assigned to a new data point C is that of its nearest neighbor ltor weighted vote of K nearest neighborsgt O O O o o g o 39 O o O O 0 O Must Do Sound Validation of Any Proposed Classi cation Method use crossvalidation test on independent data see eg R Simon M D Radmacher K Dobbin and L M McShane Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification Journal of the National Cancer Institute 95 2003 pp 1418 lteXploratory class discovery supervised classification proper use of crossvalidationgt Clustering How Do They Make Those Dendrograms and Heat Maps Summary Dendrogram construction by hierarchical agglomerative clustering given speci ed intercluster and interpoint distance measures and a proper ordering of the points Uniqueness of the dendrogram if an unambiguous choice of leftright ordering is speci ed for each join of two clusters in the dendrogram construction Dependence of the clustering dendrogram on the definition of intercluster distance Heat map construction Other methods for clustering and data Visualization kmeans PCA The difference between exploratory and supervised clustering Hierarchical Methods for clustering a set of n points Suppose one has n quotpointsquot one wants to hierarchically cluster quotdraw the hierarchical dendrogram forquot in order to get an exploratory look to see if there appears to be naturally occurring subgroups the points could be expression patterns of n genes across a set of experiments or the points could be the expression patterns of n patient samples In general one then has n vectors VlVjVn each of length say m so each point Vj has m components columnVlj V2j Vij ij I tend to think of vectors as column vectors but that is just personal preference Thus each Vj could be the expression of a given gene across m experiments or the expression level of a given tissue sample measured for m genes If one considers a customary quotsummaryquot of a set of microarray experiments the matrix M of expression levels where the entry in row g column e is the suitably normalized and often transformed by taking log expression level of gene g in experiment or sample e then the vectors Vj are either the rows or the columns of M depending on whether one wishes to cluster genes or samples To carry out the clustering one wants to form successive hierarchical groups clusters combining 2 subsets of points at each step of the process To do this one needs to specify TWO distance measures 1 a distance measure dVW between any two vectors V and W of size m eg the Euclidean distance dVW sqrt Vi Wivz Vm WmA2 or the absolute value distance dVW Vl Wl Vm Wm 2 given a particular choice of distance d in item 1 above one also needs to specify a distance measure between any two sets A and B of vectors common set distances are single linkage the distance between set A and set B of points is the minimum of the distances dab where a ranges over vectors in the set A and b ranges over vectors in the set B complete linkage the distance between set A and set B of points is the maximum of the distances dab where a ranges over vectors in the set A and b ranges over vectors in the set B average linkage the distance between set A and set B of points is the average of the distances dab where a ranges over vectors in the set A and b ranges over vectors in the set B simple example given quotvectorsquot with only one component points on a line with Euclidean distances as indicated ltpicture not to scalegt V1 3 V4 E 7 points on line 10 11 12 14 17 21 25 distancesVi1Vi single linkage produces successive joinings V1 with V2 ltdistance 10gt V3 with V1v2 ltdistance 11gt V4 with V1V2v3 ltdistance 12gt V8 with V1v2v3v4v5v6v7 ltdistance 25gt illustrating the so called quotchainingquot effect often associated with single linkage unless there are well separated groups of points complete linkage produces successive joinings V1 with V2 ltdistance 10gt but now the distance from V1v2 to V3 is 21 not 11 so the next join is V3v4 then V5v6 then V7v8 then V1v2 V3V4 ltdistance 33 between V1v2 and V3v4gt then by narrow margin V5v6v7v8 ltdistance 63 between V5v6 amp V7v8gt vs distance of 64 between V1v2v3v4 and V5V6H then lastly the join of the latter two size 4 subsets so one gets apparent structure where arguably in this case there really isn39t much by nature of using maximum inter set distance complete linkage tends to force formation of a larger number of more compact clusters average linkage produces joinings V1v2 then V3v4 then V5v6 then V7v8 then V1v2 V3V4 as before but here the next join is V1v2v3v4 with V5V6 ltaverage distance with some arithmetic is 41625gt average distance between V5v6 and V7v8 is 42 and so the last join is V1V2v3v4v5v6 with V7V8 so here and arguably in general average linkage gives clustering somewhat between single linkage and complete linkage note the quotunstablequot nature of the clustering results depending on choice of inter set distance measure and specific point locations the dendrograms are drawn by arranging the point numbers equally spaced along the x axis in an ordering compatible with the order in which the subsets are joined At each step of the process initially there are n quotclustersquot the individual points each subset has associated with it an x axis location and a height
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'