New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Kari Harber Jr.


Kari Harber Jr.
GPA 3.72


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Biological Sciences

This 161 page Class Notes was uploaded by Kari Harber Jr. on Thursday September 17, 2015. The Class Notes belongs to BSC 5936 at Florida State University taught by Staff in Fall. Since its upload, it has received 22 views. For similar materials see /class/205433/bsc-5936-florida-state-university in Biological Sciences at Florida State University.

Similar to BSC 5936 at FSU

Popular in Biological Sciences




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15
Steve Thompson An Introduction to Bioinformatics Special Topics BSC49335936 Florida State University The Department of Biological Science httpwwwbiofsuedu Sept 23 2003 Database Similarity Searching Steven M Thompson Florida State University School of Computational Science and Information Technology CSIT How can you search the databases for similar sequences if pairwise alignments take N2 time Significance and heuristics Steve Thompson But why even do database searches We can imagine screening databases for sequences similarto ours using the concepts of dynamic programming and logodds scoring matrices and some yet to be described tricks But what do database searches tell us what can we gain from them Why even bother ference through homology is a fundamental principle of biology When a sequence is found to fall into a preexisting family we may be able to infer function mechanism evolution perhaps even structure based on homology with its neighbors If no signi cant similarity can be found the very fact that your sequence is new and different could be very important Granted its characterization may prove difficult but it could be well worth it Homology and similarity Don t confuse homology with similarity there is a huge difference Similarity is a statistic that describes how much two subsequences are alike according to some set scoring criteria It can be normalized to ascertain statistical significance but it s still just a number Homology in contrast and by definition implies an evolutionary elationship more than just the fact that we ve all evolved from the same old primordial ooze To demonstrate homology reconstruct the phylogeny of the organisms or genes of interest Better yet show experimental evidence structural morphological genetic or fossil that corroborates your assertion quot39 There is no such thing as percent homology something is either I homologous or it is not Walter Fitch is credited with homology is like pregnancy you can t be 45 pregnant just like something can t be 45 homologous You either are or you are not Highly significant similarity can argue for homology but neverthe other way around Steve Thompson Independent of all that what is a good alignment So first Significance when is any alignment worth anything biologically An old statistics trick Monte Carlo simulations 2 score actua1 score C mean of randomized scores 2 C standard deviation of randomized score distribution The Normal AbbyNormal distribution Many Z scores measure the distance from the mean using this simplistic Monte Carlo model assuming a Gaussian distribution aka the Normal distribution httpmathworldwolframcomNormalDistributionhtmI in spite of the fact that sequencespace actually follows what is know as the Extreme Value distribution However the Monte Carlo method does approximate significance estimates fairly well Steve Thompson Sequencespace Huh what s that actually follows the Extreme Value distribution httpmathworldwolframcomExtremeValueDistributionhtml Based on this known statistical 44 ms mu onandrobust a s cm methodology a realistic Expectation function the E Value can be calculated from database searches The particulars of how this is done will wait but the takehome u u H H M ii a u u it message Is the same u THHH u umw u THHH u THHH H HHHH u THHH H H H H H H it v H mm ssswtw wN m Nm ssmemsm mSNswmmeszthwa Stsstmws m waaw NNHN at MMMMMM Nasmm MHHHH m m Ntm stgm meN MNHNH 8 am m s swswms NNHH H mm MN HHHH SNsmwSNsmw N mwsNaswsNacwSNsmwSNsmwSNsmwSNswwSNsmws NNNNNNNN mm a ssmmmmmwmmweKNNttwwwwwmmmmmsssssaaaaaN V HHHHHHHHHHH The Expectation Value The higher the E value is the more probable that the observed match is due to chance in a search of the same size database and the lower its Z score will be ie is NOT significant Therefore the smaller the E value ie the closer it is to zero the more significant it is and the higher its Z score will be The E value is the number that really matters Steve Thompson Rules of thumb for a protein search The Z score represents the number of standard deviations some particular alignment is from a distribution of random alignments the normal distribution They very roughly correspond to the listed E Values based on the Extreme Value distribution for a typical protein sequence similarity search through a database with 125000 protein entries On to the searches But N2 is way too slow How can it be done Database searching programs use the two concepts of dynamic programming and logodds scoring matrices however dynamic programming takes far too long when used against most sequence databases with a normal computer Remember how big the databases are Therefore the programs use tricks to make things happen faster These tricks fall into two main categories that of hashing and that of approximation Steve Thompson Corn beef hash Huh Hashing is the process of breaking your sequence into small words or k tuples think all chopped up just like corn beef hash of a set size and creating a lookup table with those words keyed to position numbers Computers can deal with numbers way faster than they can deal with strings of letters and this preprocessing step happens very quickly Then when any of the word positions match part of an entry in the database that match the offset is saved In general hashing reduces the complexity of the search problem from N2 for dynamic programming to N the length of all the sequences in the database A simple hash table this example is from the Krane and Raymer text p50 The sequence FAMLGFIKYLPGCM and a word size ofone would produce this query lookup hash table word A C F G I K L M P Y Pos Z 13 1 5 7 8 4 3 11 9 6 12 10 14 I comparing it to the database sequence TGFIKYLPGACT would yield the following offset table 2 3 4 5 6 7 8 9 10 11 12 G F I K Y L P G A C T 3 2 3 3 3 3 3 4 8 2 10 3 3 3 Steve Thompson Hmmm amp some interpretation The offset numbers come from the difference between the positions of the words in the query sequence and the position of the occurrence of that word in the target sequence Then Look at all of the offsets equal to three in the previous table Therefore offset the alignment by three FAMLGFIKYLPGCM TGFIKYLPGACT Quick and easy Computers can compare these sorts of tables very fast The trick is to know how far to attempt to extend the alignment out OK Heuristics What s that Approximation techniques are collectively known as heuristics Webster s de nes heuristic as serving to guide discover or reveal but unproved or incapable of proof In database similarity searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued This statistic may miss things depending on the parameters set that s what makes it heuristic Worthwhile results at the end are compiled and the longest alignment within the program s restrictions is created The exact implementation varies between the different programs but the basic idea follows in most all ofthem Steve Thompson Two predominant versions exist BLAST and Fast Both return local alignments and are not a single program but rather a family of programs with implementations designed to compare a sequence to a database in about every which way imaginable These include 1 a DNA sequence against a DNA database not recommended unless forced to do so because you are dealing with a nontranslated region of the genome DNA is just too darn noisy only identity amp four basesl B a translated where the translation is done onthefly in all six frames version of a DNA sequence against a translated onthefly sixframe version of the DNA database not available in the Fast package C V a translated onthefly sixframe version of a DNA sequence against a protein database 4 a protein sequence against a translated onthefly sixframe version of a DNA database 5 or a protein sequence against a protein database Many implementations allow forthe possibility of frame shifts in translated comparisons and don t penalize the score for doing so The BLAST and Fast programs some generalities BLAST Basic Local Alignment FastA and its family of relatives seareh TOOL developed at NCBI developed by Bill Pearson at the 1 Normally NOT a good idea University of Virginia to use for DNA against DNA searches We 1 Works well for DNA against translation not optimized DNA searches within limits 2 Prefilters repeat and low Of pOSSible senSitiVity complexity sequence 2 Can find only one gapped regions region of similarity 4 Can find more than one region of gapped similarity C V Relatively slow should often be run in the 5 Very fast heuristic and background parallel Implementation 6 Restricted to precompiled 4 Does net reqUire SpeCially specially formatted prepared preformatted databases databases Steve Thompson The algorithms in brief Combine non 39 e I Two word hits on the same diagonal above some s39mllanty Initiate gapped extensions threshold triggers ungapped extension until o usmg dynamic programming for the score isn t improved those HSP s above a third enough above another threshold up to the pointwhere thre5h0d3 39 the score starts to drop below a the HSP fourth threshold yields alignment I I Find all ungapped exact I word hits maximize the I ten best continuous regions scores init1 overlapping init 39 regions on different quot Use dynamic 39 quot 39 programming in a quot band forall regions with iiitn scores quot threshold opt better than some 39 score BLAST the algorithm in more detail 1 After BLAST has sorted its lookup table it tries to find all double word hits along the same diagonal within some specified distance using what NCBI calls a Discrete Finite Automaton DFA These word hits of size Wdo not have to be identical rather they have to be betterthan some threshold value T To identify these double word hits the DFA scans through all strings of words typically W3 for peptides that score at least Tusually 11 for peptides Each double word hit that passes this step then triggers a process called ungapped extension in both directions such that each diagonal is extended as far as it can until the running score starts to drop below a predefined value Xwithin a certain range A The result ofthis pass is called a HighScoring segment Pair or HSP Those HSPs that pass this step with a score betterthan S then begin a gapped extension step utilizing dynamic programming Those gapped alignments with Expectation values betterthan the user specified cutoff are reported The extreme value distribution of BLAST Expectation values is precomputed against each precompiled database this is one area that speeds up the algorithm considerably Steve Thompson The BLAST algorithm continued The math can be generalized thus for any two sequences of length m and n local best alignments are identified as HSPs HSPs are stretches of sequence pairs that cannot be further improved by extension ortrimming as described above For ungapped alignments the number of expected HSPs with a score of at least 8 is given by the formula E Kmneis This is called an E value forthe score 8 In a database search n is the size ofthe database in residues so Nmn is the search space size Kand A are be supplied by statistical theory and as mentioned above can be calculated by comparison to precomputed simulated distributions These two parameters define the statistical significance of an E value The E value defines the significance of the search As mentioned above the smaller an E value is the more likely it is significant Avalue of 001 to 0001 is a good starting point for significance in most typical searches In other words in orderto assess whether a given alignment constitutes evidence for homology it helps to know how strong an alignment can be expected from chance alone The Fast algorithm in more detail Fast is an older algorithm than BLAST The original Fast paper came out in 1988 based on David Lipman s work in a 1983 paper 1 the original BLAST paper was published in 1990 Both algorithms have been upgraded substantially since originally released Fast was the rst widely used powerful sequence database searching algorithm Bill Pearson continually refines the programs such that they remain a viable alternative to BLAST especially if one is restricted to searching DNA against DNA without translation quot They are also very helpful in situations where BLAST finds no signi cant alignments arguably Fast may be more sensitive than BLAST in these situations Fast is also a hashing style algorithm and builds words of a set k tuple size by default two for peptides It then identi es all exact word matches between the sequence and the database members Note that the word matches must be exact for Fast and only similar above some threshold for BLAST 10 Steve Thompson The Fast algorithm continued From these exact word matches 1 2 3 Scores are assigned to each continuous ungapped diagonal by adding all of the exact match BLOSUM values The ten highest scoring diagonals for each querydatabase pair are then rescored using BLOSUM similarities as well as identities and ends are trimmed to maximize the score The best of each of these is called the lnit1 score Next the program looks around to see if nearby offdiagonal lnit1 alignments can be combined by incorporating gaps If so a new score lnitn is calculated by summing up all the contributing lnit1 scores penalizing gaps with a penalty for each The program then constructs an optimal local alignment for all lnitn pairs with scores betterthan some set threshold using a variation of dynamic programming in a band A sixteen residue band centered at the highest lnit1 region is used by default with peptides The score generated from this step called opt The Fast algorithm still continued 5 7 Next Fast uses a simple linear regression against the natural log ofthe search set sequence length to calculate a normalized zscore forthe sequence pair Note that this is not the same Monte Carlo style Z score described earlier and can not be directly compared to one Finally it compares the distribution ofthese zscores to the actual extremevalue distribution ofthe search Using this distribution the program estimates the number of sequences that would be expected to have purely by chance a zscore greater than or equal to the zscore obtained in the search This is reported as the Expectation value lfthe user requests pairwise alignments in the output then the program uses full SmithWaterman local dynamic programming not restricted to a band to produce its nal alignments 11 Steve Thompson Let s see em in action To begin we ll go to the most widely used and abused biocomputing program on earth NCBl s BLAST Connect to NCBl s BLAST page with any Web browser httpwwwncbinlmnihqovBLAST There is a wealth of information there including a wonderful tutorial and several very good essays for teaching yourself way more about BLAST than this lecture can ever hope for For now I ll demonstrate with a simple example one of my favorites the elongation factor 1a protein from Giardia lambia named EF1AGiala in the SwissProt database but we have to use the accession code 008046 for NCBl s BLAST server to find the sequence Let s see how it works and how quickly we get results back Let s contrast that with GCG s BLAST version l ll illustrate with the same molecule and I ll use GCG s SegLab GUI to show the difference between the two implementations of the program nd finally let s see how GCG s FastA version compares to either BLAST implementation Again l ll launch the program from SeqLab with the same example but this time I ll take advantage of Fast s flexible database search syntax being able to use any valid GCG sequence specification Here I ll search against a precompiled LookUp list file of all of the socalled primitive eukaryotes in SwissProt 12 Steve Thompson Finally Why do I keep diss ing DNA for searches and alignment All database similarity searching and sequence alignment regardless of the algorithm used is far more sensitive at the amino acid level than at the DNA level This is because proteins have twenty match criteria versus DNA s four and those four DNA bases can generally only be identical not similar to each other and many DNA base changes especially third position changes do not change the encoded protein All of these factors drastically increase the noise level of a DNA against DNA search and give protein searches a much greater lookback time typically doubling it Therefore whenever dealing with coding sequence it is always prudent to search at the protein level References and a Comment The better you understand the chemical physical and biological systems involved the better your chance of success in analyzing them Certain strategies are inherently more appropriate to others in certain circumstances Making these types of subjective discriminatory decisions and utilizing all of the available options so that you can generate the most practical data for evaluation are two of the most important takehome messages that I can offer Altschul S F Gish W Miller W Myers E W and Lipman D J 1990 Basic Local Alignment Tool Journal of Molecular Biology 215 403410 Altschul SF Madden TL Schaffer AA Zhang J Zhang 2 Miller W and Lipman DJ 1997 Gapped BLAST and PSIBLAST a New Generation of Protein Database Search Programs Nucleic Acids Research 25 33893402 Genetics Computer Group GCG Copyright 19822002 Program Manual for the Wisconsin Package Version 103 Accelrys Inc A Pharmocopeia Company San Diego California USA Gribskov M and Devereux J editors 1992 Sequence Analysis Primer WH Freeman and Company New York New York USA Henikoff S and Henikoff JG 1992 Amino Acid Substitution Matrices from Protein Blocks Proceedings of the National Academy ofSciences USA 89 1091510919 Needleman SB and Wunsch CD 1970 A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins Journal ofMolecular Biology 48 443453 Pearson WR and Lipman DJ 1988 Improved Tools for Biological Sequence Analysis Proceedings of the National Academy of Sciences USA 85 24442448 Schwartz RM and Dayhoff MO 1979 Matrices for Detecting Distant Relationships ln Atlas ofProtein Sequences and Structure MO Dayhoff editor 5 Suppl 3 353358 National Biomedical Research Foundation Washington DC USA Smith TF and Waterman MS 1981 Comparison of BioSequences Advances in Applied Mathematics 2 482489 Wilbur WJ and Lipman DJ 1983 Rapid Similarity Searches of Nucleic Acid and Protein Data Banks Proceedings of the National Academy of Sciences USA 80 726730 13 Evolutionary Trees F redrik Ronquist August 29 2005 1 Evolutionary Trees Tree is an important concept in Graph Theory Computer Science Evolutionary Biology and many other areas In evolutionary biology a tree is a branching diagram used to illustrate the divergence over time of evolutionary lineages species or of genetically encoded traits such as DNA or protein sequences of those lineages The former trees are sometimes called species trees and the latter gene trees with the word gene loosely referring to any genetic component or genetically determined trait of an organism The word phylogeny is usually used as a synonym of species tree In organisms that reproduce asexually gene trees and species trees are essentially the same thing but organisms that reproduce sexually including most multicellular organisms mix genes with different ancestry every time they reproduce Therefore sexually reproducing organisms diverge more slowly than their genes and this introduces several potential sources of mismatch between gene and species trees as we will see later during the course Generally speaking however gene trees and species trees tend to be largely congruent We will during the course cover both Population Genetics and Phylogenetics in particular the ways in which they use evolutionary trees Population Genetics focuses on individual or closely related evolutionary lineages and the genetic structure within them This information contained in gene trees is used to describe the present infer the recent past or predict the immediate future of evolutionary lineages populations Phylogenetics is the study of species trees and their implications concerning the evolution over longer time scales There is an interesting divide between Population Genetics and Phylogenetics among other things evident in the fact that evolutionary lineages are referred to as populations in the former eld and species in the latter The critical difference appears to be in the life span of the lineages the term species implies a longer life span BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology than the term population A lineage that lasted or is predicted to last some critical amount of time say 10000 to 100000 years or more is typically referred to as a species7 whereas a lineage with shorter life span is called a population and not a species There is also some anthropocentrism involved7 so we tend to call lineages closely related to us separate species eg Neanderthals whereas similar amounts of divergence would probably be understood as population level differentiation in invertebrates An evolutionary tree is typically drawn with the root at the bottom Fig 1 unlike other types of trees such as those used in Computer Science7 which by convention are drawn with the root at the top The evolutionary tree or any other tree for that matter consists of two components the divergence points and the lines connecting them Biologists typically refer to the divergence points as nodes and the lines connecting them as branches or sometimes intemodes Fig 1 Figure 1 An evolutionary tree The nodes in an evolutionary tree fall into three types terminal nodes7 interior nodes7 and the root node Fig 1 The terminal nodes either represent current lineages or lineages that have gone extinct evolutionary dead ends The interior nodes represent slightly different things depending on whether the tree is a species tree or a gene tree If it is a species tree then an interior node BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology symbolizes a speciatz39on event whereby one ancestral mother lineage diverges into two descendant daughter lineages The two daughter lineages are also referred to as sister lineages Note that biologists tend to use feminine rather than masculine father or neutral children siblings words to describe these tree components If the tree is a gene tree in a sexually reproducing organism then the interior nodes are genetic divergence events that ultimately go back to a DNA or RNA replication event Some of those events may correspond to but typically precede speciation events as will be explained later during the course The root node indicates where the tree is rooted that is the position of the most basal divergence event in the tree If the tree is a species tree then the root is the place where it connects to the rest of the Tree of Life the species tree for all life on earth The root of a species tree is also referred to as the most recent common ancestor of the species in the tree This term implies that the node represents the last snapshot of the ancestral lineage just before the divergence event We can similarly think of an interior node in a species tree as symbolizing either a snapshot of the ancestral linage just before the speciation an ancestor or the speciation event itself The nodes in a gene tree are typically not described as ancestors but sometimes as ancestral sequences or traits The branches in an evolutionary tree have lengths measured in time units This means that in a tree with the root at the bottom we can place a time scale along the tree and the vertical position of the divergence events will then indicate when they took place Fig 1 Similarly the vertical range of a branch in such a tree corresponds to the life span of the corresponding evolutionary lineage In a tree with slanted branches Fig 1 the actual length of a branch also depends on its angle which is typically chosen to give a graphically pleasing presentation and therefore has no direct interpretation To avoid this problem evolutionary trees are often drawn with rectangular branches Fig 2 In such a tree each branch is represented by two lines the length of one re ects the time duration of the corresponding lineage and the length of the other is chosen so that the branch can be connected with its ancestral lineage It is important to understand that evolutionary trees are only models even if they are usually very good models hence their use in Virtually all disciplines in the life sciences There are several processes that Violate the basic assumptions of the tree model including recombination common in sexually reproducing organisms and horizontal gene transfer rare but occurring in most organ isms For now we will ignore these processes but we will return to them later during the course An important fact to remember is that most evolutionary lineages are short lived and go extinct before they become signi cant enough to be considered populations or species There is usually an implicit understanding that an evolutionary tree is what remains after a large number perhaps BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 2 An evolutionary tree with rectangular branches millions of such insigni cant side branches have been pruned away 2 More About Trees An evolutionary tree is a special case of the general concept of trees and it is characterized by several special properties First it is rooted that is it has a root Another way of saying the same thing is that the tree is directed there is an explicit time direction in it otherwise we would not be able to identify ancestors and descendants Second the tree is a clock tree meaning that branch lengths are measured in time units A clock tree is sometimes also called a linearized tree an ultrdmetric tree or a labeled history In practice it is seldom possible to directly infer an evolutionary tree of the type described above and we will often use the term evolutionary tree to refer to simpler models In particular it is difficult to estimate the divergence times because of variation in evolutionary rates For this reason we often have to be satis ed with trees where branch lengths are measured in terms of the BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology amount of evolutionary change instead of in terms of time Such a tree is referred to as a non clock tree or an additive tree If we consider a species tree for a set of extant now living species where all terminals are of the same age then a clock tree has all terminals at the same level Fig 3a whereas a non clock tree potentially has the terminals at different levels Fig 3b In the non clock tree the terminals on short branches are more similar to the most recent common ancestor of the group than the terminals that sit on long branches Biologists sometimes refer to additive trees as phylogmms Figure 3 Comparison between a non clock tree a and a clock tree As we will discover soon many methods of inferring evolutionary trees cannot estimate divergence times nor can they place the root in the tree Such a method will result in an unrooted tree also called an undirected tree because of its lack of a time direction Fig 4a An unrooted tree is always non clock If you are interested in the position of the root which is almost always the case you have to use a separate rooting method to nd it A common method is to include one or more reference lineages also called outgroups in your analysis After the analysis the root is placed between the outgroup and the lineage being studied the ingmup Fig 4b Biologists are sometimes just interested in the branching structure or branching order of an evolu BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 4 An unrooted tree a can be rooted between an outgroup and the study group the ingroup b to give a rooted tree tionary tree and not in the branch lengths The branching structure is referred to as the topology of a tree and is represented in a cladogmm The same cladogram can be drawn in many different ways Fig 5 Evolutionary biologists typically restrict their attention to trees that are binary or dichotomous meaning that there are only two descendants of each interior node Sometimes trees with three or more descendants from a single node called polytomous trees are also considered In many situations polytomous trees are adequately approximated by sets of binary trees with short interior branches Fig 6 Regardless of its type clock rooted non clock or unrooted it is always possible to draw the same tree in different ways in a two dimensional gure because of the rotational freedom around each node For instance in a rooted tree we can shift the left and right descendants of each node without changing the meaning of the tree Fig 7 ln Graph Theory and Computer Science there is a different set of terms used for trees and you BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 5 The same tree topology cladogram can be drawn in many different ways because among other things the branch lengths are arbitrary often see these used in computational evolutionary biology as well They are vertex for node edge for branch leaf for terminal node and root for the root node In this terminology the degree of a vertex is the number of branches connected to it Thus an interior node of a binary tree has degree three a leaf has degree one and the root has degree two An edge weight is the same as the length of a branch It is often practical to refer to subtrees within a larger tree Biologists refer to them as clades A clade can de ned as a node i and all of its descendants or alternatively as the subtree rooted at node Clades can also be called taxa singular taxon or operational taxonomix units OTUs Taxa are used for named groups in a biological classi cation whereas OTUs are groups that lack formal names It is not uncommon to see evolutionary trees in which the terminals are taxa above the species level also called higher taxa For instance a cladogram may give the relationships among a set of genera or families each containing many species Since individual evolutionary lineages can also be called taxa or OTUs it is generally true that the terminals in a species tree are taxa or OTUs BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 6 A tree with a polytomy a can be thought of as being approximated by a set of of binary trees with very short interior branches b to d Closely related to the concept of a subtree is the concept of a split or taxon bipartitz39on Each branch in a tree divides the terminal nodes taxa into two disjoint sets one on each side of the branch We will be using splits when we describe methods for assessing the robustness in the inference of evolutionary trees 3 Computer Representation of Trees Reference Felsenstein Chapter 35 pp 585 590 Much of this course will be devoted to the description of algorithms operating on trees For this purpose we will need a data model for trees A simple model for rooted binary trees found in many computer science texts is given below Fig 8 see also Felsenstein Fig351 We will be referring to this model as the Binary Tree Data Model BTM The idea is to use a simple data structure for each node containing pointers to the two descendant nodes and the ancestral node BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Figure 7 The same tree can be drawn in many different ways because of the rotational freedom around each node The terminal nodes have their descendant node pointers set to NULL and the root node has its ancestral node pointer set to NULL Strictly speaking7 the pointer to the ancestral node is not needed but it is convenient to have it in many cases Felsenstein also includes a boolean variable indicating whether a node is terminal a tip Again7 this can be gured out from the pointers7 so this variable is only there for convenience If we need to store branch lengths7 we can simply store them in the node structure of the node above the branch The branch length variable needs to be added to the node structure in Felsenstein s gure In an object oriented computer language7 we would typically store the nodes inside a tree class7 which would contain a pointer to the root of the tree A complication arises when we want to represent unrooted trees using this structure7 since they do not have a root node nor a natural ancestor descendant direction to them One way of solving the problem is to treat the tree as if it were rooted on an internal node7 typically the interior node adjacent to the outgroup terminal Fig 9 We can easily represent all of the nodes that are descendants of the interior root node using the BTM The interior root node is then connected to BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 8 The Binary Tree Data Model BTM two of its subtrees using its left and right pointers and the third subtree using its ancestor pointer All the branch lengths around the interior root node are now stored in other nodes so the interior root node does not have a branch length By using this type of structure we have forced a direction onto the unrooted tree Clearly truly polytomous trees cannot be represented using the BTM A data structure similar to the BTM that accommodates polytomous trees uses pointers from a node to its ancestor leftmost descendant and to one of its siblings Fig 10 This structure can easily accommodate unrooted trees by arbitrarily selecting one of the interior nodes as the 7root7 of the tree We will refer to this data structure as the Polytomous Tree Data Model PTM A somewhat different structure has been pioneered by Felsenstein Fig 11 see also Felsenstein Fig 352 and will be referred to as the Felsenstez39n Tree Data Model It uses rings of nodelets to represent nodes Each nodelet has a next pointer to the next node in the ring and an out pointer to the nodelet on the other end of the branch to which it is connected This structure can accommodate both binary and polytomous trees as well as rooted and unrooted trees An BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Figure 9 A modi cation of BTM to accommodate unrooted trees advantage of this data structure is that it can be rerooted more easily than the previous two types Branch lengths can either be stored in separate data structures pointed to by both of the adjacent nodelets7 or they can be contained as member variables of both of the adjacent nodelets7 and set functions can then be used to make sure that the two member variables representing the same branch are always in sync It is convenient to have the nodelet structure or class contain a boolean member indicating whether the nodelet points down in the tree7 so that we know where to start and stop when cycling through the nodelets representing a node of the tree 4 Text and XML Representation of Trees Reference Felsenstein Chapter 35 pp 590 591 There is often a need to describe a tree in text format7 for instance for storing trees resulting from an analysis This is typically done using the Newick format described in Felsenstein Chapter 357 pp 590 A shortcoming of the Newick format is that it can only specify rooted trees An unrooted 11 BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 10 The polytomous tree data model PTM tree must be converted to one of its rooted representations before it can be expressed in Newick format If we arbitrarily root an unrooted tree on one of its branches however there is one branch length that will have to be arbitrarily divided into two components one on each side of the root Fig 9a A better method is to root the tree on an interior node and write it as if it had a basal trichotomy since the interior root node will have three descendants unlike all other nodes which have only two If there is a single outgroup terminal the tree is typically rooted on the interior node adjacent to that terminal Assume we have an unrooted binary tree with terminals labeled A to D and the branch lengths labeled 121 to 125 and that we root it on the interior node adjacent to the tip D alternatively we can describe this as rooting the tree on D Fig 9b The Newick format description would now be Av1Bv2v3Cv4Dv5 as if the tree had a basal trichotomy involving C D and the ancestor of A and B Note how this solution gives us a natural place to put all branch lengths The basal trichotomy solution only works if we restrict our attention to binary trees since there is no way we can distinguish a truly rooted tree with a basal trichotomy from an unrooted tree written as if it were a rooted tree with a basal trichotomy A Newick format tree is often contained in a tree block of a NEXUS le The general structure of BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Figure 11 The Felsenstein tree data model a NEXUS le containing a tree block is NEXUS begin trees translate 1 Man 2 Chimp 3 Gorilla 4 Orangutan 5 Gibbon tree example 12345 end Note the translate command7 which allows you to map some tokens in a Newick format tree de scription typically integers to real taxon terminal node names Note also that the tree is given without branch lengths this is permitted by the Newick format In a commonly used extension of 13 BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology the Nexus tree format we can also specify whether the tree is rooted or unrooted using the tags amp R for rooted and amp U for unrooted The tree above would then be given as tree example ampU 12345 to indicate that it should be understood as an unrooted tree instead of a rooted tree with a true basal trichotomy As Felsenstein mentions it is likely that the Newick format will be replaced or at least comple mented by an XML standard for describing trees The XML code suggested by Felsenstein seems reasonable to us but since there are a couple of errors in his example we reiterate it here for clarity hopefully without errors ltphylogenygt ltcladegt ltclade length006gt ltclade length0102gtltnamegtAltnamegtltcladegt ltclade length023gtltnamegtBltnamegtltcladegt ltcladegt ltclade length04gtltnamegtCltnamegtltcladegt ltcladegt ltphylogenygt Note the use of the tag 7clade7 for subtrees and 7phylogeny7 for the entire tree To preserve more of the NEXUS format one could perhaps have used the term tree instead of 7phylogeny The phylogeny or tree tag would presumably be able to hold information on whether the tree is rooted or unrooted and the phylogeny could also contain a name tag if the tree is named 5 Some Simple Tree Algorithms Reference Felsenstein Chapter 35 pp 587 589 Most algorithms on trees involve repeating the same operation on each node or branch of the tree in turn either passing from the tips of the tree down to the root or from the root up to the tips Biologists often refer to these traversals as the downpass and the uppass sequence respectively In 14 BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology computer science they are known as the postorder downpass and preorder uppass traversals respectively There is also an inorder traversal but it is rarely used for evolutionary trees Algorithm 1 Postorder traversal algorithm Traverse left descendant of p Traverse right descendant of p Carry out fp on p Assume that for each node 197 we want to implement a function fp Then the postorder traversal is easily implemented as a recursive function Algorithm 1 It is called postorder traversal because the function is carried out after passing through the subtree rooted at p The preorder traversal algorithm is similar Algorithm 2 but the function is now carried out before passing through the subtree rooted at p Algorithm 2 Preorder traversal algorithm Carry out fp on p Traverse left descendant of p Traverse right descendant of p If you want to minimize the use of recursive algorithms7 you can generate an array of node or branch pointers in postorder traversal order using a recursive function once Anytime you need a postorder traversal of the tree after this initial call7 you can then simply loop over the elements of the downpass node array in order Similarly7 a preorder traversal is implemented by looping over the elements in the downpass array in reverse order The following code illustrates how you can create a postorder traversal array using Java assuming rooted trees represented using BTM public class Tree public Node downpass use as pointers public Node theRoot use as pointer private int index private Node theNodes this is where we actually keep the nodes code to construct the tree and set theRoot appropriately public void GetDownPass void index O BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology DownPass theRoot private void DownPass Node p if ptip DownPass Node p1eftdesc DownPass Node prightdesc downPass index p The downPass array can then be accessed by using code such as for int i0 ilttheTreeGetNumNodes i Node p theTreeallDownPass i Do something with the node p here Note that the downPass array is made public here which is convenient but exposes the array to users of the tree class Better encapsulation can be provided for instance by using get functions to access the elements of the downPass array In many algorithms we need to handle only internal nodes so it makes good sense to keep postorder arrays of both all the nodes and only the interior nodes in the tree including the root To print a tree in Newick format it is convenient to use a recursive algorithm for printing a subtree rooted at p and call it initially with the root of the tree If we use the BTM and the convention described above for storing ordered and unordered binary trees the algorithm needs to include an extra branch for the root of the tree Algorithm 3 Note that the algorithm will not print the semicolon required after the parenthetical description of the tree so this element will have to be added Almost anything you can do in a recursive algorithm can also be done using an ordinary algorithm For instance you can print a BTM unrooted or rooted tree using a non recursive algorithm if you add an integer member let s call it mp to each node p and use that to indicate whether the node is 16 BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 3 Recursive algorithm for printing a tree BTM unordered or ordered if p is not tip then Print and left subtree ofp Print 7 and right subtree of p if p is root of unrooted tree then Print 7 and ancestor subtree of p end if Print end if Print name ofp if any Prin and branch length ofp if any being Visited the rst7 second7 or third time the root of an unrooted tree will actually be Visited also a fourth time Algorithm 4 The algorithm starts with p being the root node 0 of the tree To construct a binary rooted or unrooted BTM tree from a Newick format string7 you can use a recursive algorithm for reading a subtree rooted at node p Algorithm 5 It uses a global index variable which indicates the position from which we are currently reading the string or array 3 containing the Newick format tree You start the algorithm by setting i 0 and calling it with the root of the tree Again7 the same task can be accomplished using a non recursive algorithm Algorithm 6 It uses a semicolon to mark the end of the string or array containing the tree description any other way of determining the end of the tree description would work equally well If you are interested in exploring the PTM7 here is a snippet of Java code indicating the basic recursive algorithm for a postorder traversal public void Traverse Node p Node q for q p1eft q null qqgtleft Traverse q Fp The same code could be written like this using the FTM assuming that a tip node has an out 17 BSCSQ36 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 4 Nonrecursive algorithm for printing a tree BTM unordered or ordered for all p do mp lt 0 end for p H 0 while p 7 0 do mp lt mp 1 if p is not tip then if mp 1 then Print p 9 left descendant of p else if mp 2 then Print 7 p 9 right descendant of p else if mp 3 and p is root of unrooted tree then Prin 7 p e ancestor of p else if mp 2 3 then Print end if end if if mp 2 3 or p is tip then Print name ofp if any Prin and branch length of p if any if mp gt 3 root7 of unrooted tree then 19 e 0 else p e ancestor of p end if end if end while pointer pointing to itself public void Traverse Node p Node q for q pnext q p qqgtnext 18 BSC5936 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 5 Recursive algorithm for reading a tree BTM unrooted or rooted if is then i lt i 1 Create a new node q Set left descendant of p to q Set ancestor of q to p Read subtree rooted at q Check that is 7 i lt i 1 Create a new node q Set right descendant of p to q Set ancestor of q to p Read subtree rooted at q if p is 7root7 of unrooted tree then Check that is 7 i Hi 1 Create a new node q Set ancestor ofp to q Set ancestor of q to p Read subtree rooted at q end if Check that is i lt i 1 end if Read name ofp if any Read and branch length ofp if any Set i to index of next character to read Traverse q out Fp Both models give quite compact code for reading a tree since we do not have to deal with any special cases for unrooted trees The recursive tree writing algorithm could be written like this for PTM BS05936 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 6 Nonrecursive algorithm for reading a tree BTM unrooted or rooted i lt 0 while is not g do if is then i Hi 1 Create a new node q Set left descendant of p to q Set ancestor of q to p P H q else if is 7 then Create a new node q if p is left descendant of its ancestor then Set right descendant of ancestor ofp to q else Set ancestor of ancestor ofp to q end if Set ancestor of q to p P H q else if is then Set p to ancestor of p else Read label of p if any Read and branch length of p if any Set i to index of next character to read end if end while String treeDescription quotquot public void PrintSubtree Node p if ptip treeDescription quotquot for Node q p1eft q null q qsib PrintSubtree q if qsib null treeDescription quotquot BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology treeDescription quotquot if ptip treeDescription pGetLabel if hasBrlens ampamp p theRoot treeDescription quotzquot treeDescription pGetLength Note the clean code due to the lack of special cases The same algorithm might look almost identical Rm FTlA String treeDescription quotquot public void PrintSubtree Node p f ptip treeDescription quotquot H for Node q pnext q p q qnext PrintSubtree qout if qnext p treeDescription quotquot treeDescription quotquot if ptipgt treeDescription pGetLabel if hasBrlens ampamp p theRoot treeDescription quotzquot treeDescription pGetLength BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology As stated previously the FTM representation has the advantage that it can be rerooted more easily than BTM and PTM However it is also fairly common that we want to pass from a node p down to the root This is easily done with BTM or PTM using code such as while p theRoot do something with p ppanc The FTM representation on the other hand requires slightly more complex code while p theRoot do something with p p p0ut assume we have a boolean called top marking whether the node is on the top of a branch while ptop p pnext 6 Study Questions H De ne a clock tree a non clock tree a rooted tree and an unrooted tree to De ne terminal node interior node root and branch 03 Can a clock tree be unrooted Can a non clock tree be rooted F How many branches are there in a rooted binary tree with n terminals How many interior nodes are there count the root as an interior node 01 How many branches are there in an unrooted binary tree with n terminals How many interior nodes are there a What is the difference between a cladogram and a phylogram 22 BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology 7 Of the rooted trees given below Fig 127 which have identical topology Which trees have identical topology after the root is removed Figure 12 Some rooted trees Although all of them look different7 some of them specify the same ABCDABCDCDBAADCB ABCDEFGHABCDEFGHBACDEFHGABCEDFGH 8 Give a recursive algorithm performing the function fp in postorder sequence on the nodes topology p of a tree represented by the PTM 9 Describe a complete data structure for the FTM using Java code 10 Show how a postorder array of nodes for the BTM can be used to perform a function fp on each node of a tree in a preorder traversal Models of morphological evolution F redrik Ronquist September 21 2005 In this lecture we will be covering models of morphological evolution These fall into two different classes those that model morphology in terms of discrete states and those that treat morphology as quantitative characters Both are applicable to other types of data for instance the discrete state models can be applied to restriction site data and other data of a presenceabsence nature and the quantitative character models can be used as approximations of allele frequency evolution see previous lecture 1 Morphology as discrete states In many cases morphological variation falls naturally into discrete states Typical examples include the presence or absence of major features like wings and feathers Although it is currently standard practice to use parsimony methods like Fitch optimization to infer phylogenies from such discrete state morphological data we can also apply discretestate continuoustime Markov models like the ones we have seen previously for molecular sequence data A few minor complications arise but they are all possible to address 11 The binary model For a morphological character with two states 0 and 1 we can simply use a Markov model with the instantaneous rate matrix unsealed Qqi 1 BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology This is obviously an analogue of the Jukes Cantor model of DNA evolution Lewis 2002 recently referred to this model as the M2 model it has been in use for morphological and other types of discrete data at least since the 80 s In principle the M2 model can be easily extended to multi state characters but the issue of state ordering comes up Recall that parsimony methods distinguish between ordered and unordered multi state characters Fig 1 In the simplest case we assume that all changes between states are possible Fig 1a the alternative is to order the states in a linear series such that in a three state character changes between the two end states have to go through the intermediate state Fig 1b In a stochastic model we would simply set the instantaneous rate of the impossible changes to zero and use a uniform rate for the other events Thus the stochastic model for a three state unordered character we can refer to it as the M3u model is Figure 1 Stochastic models for unordered a and ordered b morphological characters Note the similarity between the M30 model and the codon models we discussed earlier As with BSCSQ36 Fall 2005 PBFR Computational Evolutionary Biology the latter models the zero entries in the instantaneous rate matrix of the M30 model do not result in transitions between the end states being impossible but they do force those transitions to go through the intermediate state thus lowering the transition probability An interesting but somewhat counterintuitive property of the M30 model is that it has the stationary state frequencies equal for all states the rate asymmetries only affect the intensity with which the state changes occur Thus if you start a set of characters in the intermediate state you expect to approach equilibrium faster than if you start with the same set of characters in one of the extreme states The M30 and M3u models can easily be extended to any number of states However this raises the question of how to determine the appropriate state space for a given morphological character The state space may be obvious for some characters but for the majority it is difficult to determine the state space in any other way than by simply recording the number of observed states in some set of organisms If the number of states is based on observations we may wish to let the size of the state space of each character be a parameter of the model However the probability of the state space being larger than the number of observed states appears to be small for most real data sets it is only slowly evolving characters on small phylogenies that are likely to have unobserved states 12 Ascertainment bias JFz Chapter 152234235 An issue that arises with both morphological and restriction site data is ascertainment bias also called coding bias When calculating the probability of observing some data on a tree we typically assume that all character patterns can be observed For instance with a four species tree and a four by four model of DNA evolution for 71 characters we could calculate the probability of observing all the 44 256 character patterns from AAAA over AAAC t0 TTTT on a given tree If the probability of pattern i is pi we have that 229139 1 The likelihood L or the probability of the data given some parameter values can be calculated by taking the number of observations 1 of each pattern i into account in the product over sites L lf certain types of character patterns cannot be observed then the likelihood of the observed data calculated naively according to this formula will be too high with potentially serious effects on parameter estimates ln restriction site data for instance we typically cannot observe sites that are absent from all of the studied taxa To compensate for this we simply divide the probability of each character by the probability of the unobservable pattern If the probability is p0 for the all absence pattern then the corrected likelihood is L Hpipo With morphological data the difficulty is to include invariable characters There is simply no BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology straight forward way of sampling invariable characters with an intensity comparable to that with which variable characters are scored Again the solution is to calculate the probability of the un observable patterns in this case the all zero and all one patterns and correct for the ascertainment bias when calculating the likelihood value If the probability is p0 for the all zero pattern and p1 for the all one pattern then the corrected likelihood is L Hpipo p1 It is common practice in morphological studies to omit not only invariable characters but also characters that are not parsimony informative The latter are characters that have the same length on all trees using the parsimony criterion typically they are unique features of single terminals in the tree In principle this ascertainment bias can also be corrected for but the correction is more complicated because there are many more character patterns that are left out of the character matrix The total probability of all character patterns is summarized in Figure 2 Figure 2 Cumulative probabilities of different types of character patterns 13 Rate asymmetry We can easily accommodate transition rate asymmetry in the M2 model by assuming that the stationary state frequencies of the two states are different giving us an analogue of the Felsenstein 81 model Qqiji W1 W0 7 BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology There is only one free parameter in this model since we m 1 An alternative parameterization of the same model would use the ratio a of the forward 0 to 1 rate to the backward 1 to 0 Qqi We can go from one representation to the other by noting that a 7r17r0 m m ar 1 and we 1r 1 rate There is one implicit assumption of this type of model that can cause problems with morphological data By introducing a stationary state frequency parameter for the 0 state and one for the 1 state we are assuming that state labels are non arbitrary if there is more than one character In other words given that we know the states of one arbitrarily chosen binary character there must be a way of determining how to apply the state labels to all other characters For instance if one state is taken to mean absence and the other state presence of a particular type of trait then state labels are non arbitrary the assumption is satis ed and we can reasonably infer the stationary state frequency of the two states This would be true for instance for restriction site data where the two states can be used to denote presence or absence of the each restriction site For many morphological characters however state labels are truly arbitrary For instance how can we translate the presence or absence of wings to black or yellow body color The solution to this dilemma is to model variation in stationary state frequencies across sites using some suitable distribution for instance a symmetric beta distribution Fig 3 The symmetric beta distribution is like a normal beta distribution where the two parameters referred to as 040 and 041 are assumed to be equal and replaced by a single value 04 040 041 The symmetric beta distribution can model a wide variety of scenarios from extremely asymmetric rates 04 ltlt 1 over moderately asymmetric rates 04 z 1 to equal rates 04 oo 2 Quantitative characters JF2 Chapters 2325 Many morphological and other characters are best understood as evolving on a continuous scale Most of the work on stochastic models for quantitative characters has focused on Brownian motion more precisely a mathematical model that approximates Brownian motion In this model a particle takes a large number of steps on an axis each step being independent of the others and displacing BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Figure 3 Symmetric beta distribution with some different values of Oz the particle by a random amount drawn from a normal distribution with variance 32 The model is obtained as the limit condition when the number of steps 71 goes to in nity and the variance 52 goes to zero while the product 7132 remains constant If we start a Brownian motion process with value m0 and the process accrues the variance 02 per time unit7 then the expected value at time t is normally distributed with mean mo and variance 0225 Calculating the likelihood under the Brownian motion model is relatively straight forward except for a problem with the degrees of freedom at the root of the tree Felsenstein explains the procedure in detail in his Chapter 23 A disadvantage of the Brownian motion model is that it has no limits Fig 4 A character can easily take on an extreme value and then it is unlikely to come back to the ancestral value This may not be a realistic model for the evolution of morphological characters7 which tend to have physical bounds as well as some central tendency in their values An alternative to Brownian motion that may better re ect these circumstances is the Ornstein Uhlenbeck OU process The OU process is Brownian motion but with a force continually pulling towards a central point Fig 4 If the returning force is pushing a character towards a central point at the rate of a per unit time and the process accumulates 02 variance per unit time then an OU process starting at 0 will nd itself in position mt after if time units7 where mt is normally distributed with mean BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Expzt mean and variance 17 872at02 Varzt 2a The Ornstein Uhlenbeck process is difficult to handle in likelihood inference but it has been used to model rate variation over time in a Bayesian context Figure 4 Examples of Brownian motion a and the Ornstein Uhlenbeck process 3 Study questions H What is ascertainment bias and how can it be corrected to How can ordered and unordered discrete morphological characters be modeled OJ What are the stationary state frequencies of the M3u model and the M30 model q When is it appropriate to assume transition rate asymmetry in a binary character 01 How can you address transition rate asymmetry when state labels are arbitrary a What is the difference between the Brownian motion model and the Ornstein Uhlenbeck process Sequence Alignment and i Dynamic Programming J R Quine Department of Mathematics Sources it Part of this lecture is taken from Calculating the secrets of life Chapter 3 Seeing conserved signals using algorithms to detect similarities between biosequences by Eugene W Meyers You can print low resolution copies of each page directly from this link Another reference for this material is Biological Sequence Analvs Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin g Eugene Meyers oEugene Meyers was responsible for the computations by Ceera Genomcs for completing the Human Genome Project oRead the first 6 pages of his article in Chapter 3 of Calculating the secrets 0fife This is a good introduction to dynamic programming g Mutations Mutation in DNA is a natural evolutionary process Less often a nucleotide is deleted or inserted Some mutations are lethal some are favorable some have little impact g Sequence similarity We are looking for similarities between nucleotide or amino acid sequences Some possible implications of sequence similarity are Proteins have a common evolutionary origin Proteins have a similar function Improving the Score Using Gaps Without Penalty 1 ATTACG ATATCG 2 ATTACG ATATCG 3 ATTACG ATATCG The scores are 4 5 5 respectively g Important concepts Scoring or substitution matrix 8 The simplest is the unit scoring matrix as in dot matrix techniques Total score of alignment 2 8ai bi Local alignment global alignment Edit graph This helps to formalize the ideas of dynamic programming g Unit cost scoring scheme 39339 392 H 39339 393 l l3939 E 3 IE l H Hamlin I i QQGGQI GagH33 FI UEE ll Main guring 511th g The edit graph M1N1 vertices Agap edges with score 8ai Bgap edges with score 8 bi Alignment edges with score 8ai bj Rule for Scoring Each Vertex in the Edit Graph Sz j max 573 17939 1 6aiabja 373 19339 5 3737939 1 6abj g Scores In The Edit Graph 81 1 1 1 82311 56 l Si1j Sij 5air g The Completed Edit Graph Completed clraph Pseudocode for dynamic programming pm wPWNHP W 3 array ODJMKDn f irtail 5 11 tr ll fur j I 1 WEN u SE39DJ 16 SE39CLE 1 3 51393 for 7 1 tun11 do 5 50 S39EH 10 El n III far j th39Ia39 Si j max Hi Lj l faib 1 5Ef L f EEG 5 SEEquotj 1 5f Eb rjx l wfaxjmuni sunre is 539 ML 3939 33 Th cla sicaj dynamic migranmni g anun39ehm g Dynamic Programming As we see from the above the Meyers article there is an algorithm called dynamic programming which gives us the highest scoring alignment between two sequences The algorithm runs in OMN time where M and N are the lengths of the two sequences g Dynamic Programming Dynamic programming is a general computational paradigm of wide applicability A problem can be solved by dynamic programming if the final answer can be determined by computing a tableau of answers to progressively larger subproblems g Alignment Algorithms NeedlemanWunsch Global alignment SmithWaterman Local alignment Who Invented Dynamic Programming from Introduction to Computational Biology Maps sequences and genomes by Michael S Waterman Who Invented Dynamic Programming Needleman and Wunsch 1970 wrote a paper titled A general method applicable to the search for similarities in the amino acid sequence of two proteins It was surely unknown to the authors that their method fit into a broad class of algorithms introduced by Richard Bellman under the name dynamic programming g Gap Penalty The above unit scoring scheme scored zero for matching with a gap Usually there is a negative score for matching with a gap Sa Sa d The number cl is called the gap penalty This will decrease the occurrence of gaps in optimal alignments g Homework Problem Calculate the dynamic programming matrix and an optimal alignment for the DNA sequences GAA39I39I39C ancl GA39I39I39A scoring 2 for a match 1 for a mismatch 2 for a gap 2 is the gap JenaM Note Do this twice for the different ways to score a gap at the beginning or end of the sequence g Amino Acid Sequences For proteins we work with strings from a 20 letter alphabet A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine g Amino Acid Sequences Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine g Amino Acid Sequences K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine TThr Threonine g Amino Acid Sequences W Trp Tryptophan Y Tyr Tyrosine V Va Valine B Asx Aspartic acid or Asparagine Z GIX Glutamic acid or Glutamine X Xaa Any amino acid Scoring Matrices for Amino Acids We want a scoring matrix or substitution matrix 5 for amino acids A scoring matrix should reflect amino acid properties A better score should result If amino acids with similar properties are aligned So Sab should be positive if the residues are very similar and negative if very unsimilar g Amino Acid Properties Type of Amino Properties Amino Acid Acids Amino acids with The hydrophobic side chains of these amino acids will not Ala Val aliphatic form hydrogen bonds or ionic bonds with other groups Leu lle hydrophobic side These hydrophobic amino acids tend to be buried in the Met Pro chains centre of proteins away from the surrounding aqueous Phe Trp environment Amino acids with The side chains of these amino acids are uncharged at Ser Tyr uncharged but physiological pH Asp Gln polar side chains Cys Amino acids with These have a carboxylic acid group in their side chain Asp Glu acidic side and are very hydrophilic chains Amino acids with The positive charge on these side chains makes them Lys Arg basic side chains hydrophilic and they are likely to be found at the protein His surface Neutral side The single hydrogen atom side chain has no strong Gly chain hydrophobic or hydrophilic properties Scoring Matrix Should the scoring matrix be decided by scientists with a knowledge of biochemistry or should it be computed from a analysis of the current database of sequences Bioinformatics or Biochemistry g Scoring Matrix A frequently used scoring matrix for amino acid sequences is a BLOSUM matrix such as the one shown below The matrix is based on a set of multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins These are called blocks v 04444444441414446445 11 n JAwa n aua lH4Ja04q au louH W 334551n33337931144J521J F l T U10111122111l2125320 a s 14104040444044452444 F13114112334134011433 ampl p 4444444440140334144 44444044423470444401 F 13U13212 u33624lain 013 X H L4333325421311 L I 14444443524334444 a H 20113102043011212324 1 M G 301323244234202334 d w O a 4002425404414414444 5 a 41345413344033 C l4n43335an3332r 11531 1 i w m n3234021444439134 u u4 13031333534 S R17124103343313311313 O a5214211n21311311 u320 I B PERECLEGHILKEIFSTHYV g Homework Problem Amino acids D E and K are all charged V I and L are all hydrophobic What is the average BLOSUMSO score within the charged group of three Within the hydrophobic group Between the two groups Three Alignments Good Less Good and Ugly To illustrate problems in evaluating the significance of an alignment the figure below shows examples of three pairwise alignments all to the same region of the human alpha globin protein sequence SWISSPROT database identifier HBAHUMAN The central line in each alignment indicates identical positions with letters and similar positions with a plus sign Similar pairs of residues are those which have a positive score in the substitution matrix used to score the alignment Three Alignments 1 Wm HBBHUHAN I HELHUHAN LEEELUPLU it HEAHUKAH F1l ll2 BSAQWKGHGKKVHD LTN VAHVDDHFNRLSALSDLHAHEL 8 VKHGKKV AAHD L5LH EL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLEELHCDKL GSAQVKGHGKKVADALTNAVAHV DDHPN3LSALSDLHAHKL 4 H xv L LH K ENFELDAHAEKVKLVYE5ampIQLGVGVVVTDATLEHLGSVHVSKG i ESEQVKG GKKVKDALTEAVAHVDDHHELSELSD LHAEKL 55 w G a D L Hi 39 A AL D DOAH GSGYLVGDSLTW quotVHQHTBDLLERHHKLLDEFPQFKAHDE HHFIEILJ Ihnm MMMHE n gmumn mlgjhugumtqf uiun akin gm al mrmmhmhmgm mummy Ma gmmmlm unw mmm frj mdm hm umn mga gnmuwtnmanmnwmadeghmnhMMGSh mmn murhmmmhumt W FHGHJ g First Alignment a Good In the first alignment human alpha globin to human beta globin there are many positions at which the two corresponding residues are identical many others are functionally conservative such as the pair D E towards the end representing an alignment of an aspartic acid residue with a glutamic acid residue both negatively charged amino acids Second Alignment b Less Good b also shows a biologically meaningful alignment of human alpha globin to leghaemoglobin from yellow lupin These two sequences are evolutionarily related have the same threedimensional structure and function in oxygen binding However there are many fewer identities than for a and in a couple of places gaps have been inserted into the alpha globin sequence to maintain the alignment across regions where the leghaemoglobin has extra residues g Third Alignment c Ugly c shows an alignment with a similar number of identities or conservative changes as b However in this case we are looking at a spurious alignment to a protein a nematode glutathiamine S transferase homologue that has a completely different structure and function g Role of Statistics We would like some probabilistic measure an alignment score indicating how closely two sequences are related We want the score to tell us if there is a significant relationship between the sequences or if what looks good is just a random occurrence A warm up the birthday paradox The probability of n random people having different birthdays is 365365n365 n This is 12 for n 23 g Log Odds The score for the maximal scoring alignment between two amino acid sequences using the BLOSUM substitution matrix is based on a loq odds svstem The theory does not take account of gaps Some possibilities for scoring gaps are given by the linear gap penalty and the gap extension penalty g Log joke As the animals left the ark Noah told them to go forth and multiply After some while Noah happened upon two snakes sunning themselves quotWhy aren39t you multiplyingquot Noah asked The snakes replied quotWe can39t we39re addersquot So Noah and his sons went into the nearby forest and felled some trees They made a platform of logs onto which they placed the snakes You see even adders can multiply on a log table g Scoring Using Mape Maple has procedures for dealing with strings and these can be used to write a procedure for scorinci aliqnments usinq BLOSUM7O Steve Thompson Special Topics BSC5936 An Introduction to Bioinformatics Florida State University The Department of Biological Science wwwbiofsuedu Sept 9 2003 The Dot Matrix Method Steven M Thompson Florida State University School of Computational Science and Information Technology CSIT Steve Thompson A general way to see similarities in pairwise comparisons The Dot Matrix Method Gets you started thinking about sequence alignment in general Provides a Gestalt of all possible alignments between two sequences To begin I will use a very simple 0 1 match nomatch identity scoring function without any windowing As you will see later today more complex scoring functions will normally be used in sequence analysis especially with amino acid sequences This example is based on an illustration in Sequence Analysis Primer Gribskov and Devereux editors 1991 The sequences to be compared are written out along the x and y axes of a matrix Put a dot wherever symbols match identities are highlighted Since this is a comparison between two of the same sequences an intrasequence comparison the most obvious feature is the main identity diagonal Two short perfect palindromes can also be seen as crosses directly off the main diagonal they are ANA and SIS Steve Thompson The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once not concentrating on any one optimal region but rather giving you the Gestalt of the whole thing Since your own mind and eyes are still better than computers at discerning complex visual patterns especially when more than one pattern is being considered you can see all these less than best comparisons as well as the main one and then you can zoomin on those regions of interest using more detailed procedures If the previous plot was a doublestranded DNA or RNA sequence self comparison the inverted repeat regions would be indicative of potential cruciform structures at that point Direct internal repeats will appear as parallel diagonals off of the main diagonal Check out the mutated intersequence comparison below Here you can easily see the effect of a sequence insertion or deletion It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an indel A jump or shift in the register of the main diagonal on a dotplot clearly points out the existence of an indel again zeroone match score function Steve Thompson Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats These are shown in the following example The duplication here is seen as a distinct column of diagonals whenever you see either a row or column of diagonals in a dotplot you are looking at direct repeats Now consider the more complicated mutation in the following comparison Again notice the diagonals However they have now been displaced off of the center diagonal of the plot and in fact in this example show the occurrence ofa transposition Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences Inverted repeats still show up as perpendicular lines to the diagonals they are just now not on the center ofthe plot The deletion of PRIMER is shown by the lack ofa corresponding diagonal Steve Thompson Filtered Windowing Reconsiderthe same plot Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats These merely contribute noise to the plot and are due to the random occurrence of the letters in the sequences the composition ofthe sequences themselves How can we clean up the plots so that this noise does not detract from our interpretations Consider the implementation ofa ltered windowing approach a dot will only be placed if some stringency is met What is meant by this is that if within some de ned window size and when some de ned criteria is met then and only then will a dot be placed at the middle of that window Then the window is shifted one position and the entire process is repeated This very successfully rids the plot of unwanted noise In this plot a window of size three and a stringency of two is used to considerably improve the signal to noise ratio remember I am using a 10 identity scoring function The only remaining dots indicate the two runs of identity between the two sequences however any indication ofthe palindrome ANA has been lost This is because our filtering approach was too stringent to catch such a short element In general you need to make your window about the same size as the element you are attempting to locate In the case of our palindrome AN and NA are the inverted repeat sequences and since our window was set to three we will not be able to see an element only two letters long Had we set our stringency filter to one in a window of two then these would be visible The Wisconsin Package s implementation of dot matrix analysis the paired programs Compare and DotPlot use the windowstringency method by default Steve Thompson Filtered dot plot techniques You need to be careful with windowstringency dot matrix methods Default window sizes and stringencies may not be appropriate for the analysis at hand The Wisconsin Package default window size and stringency for protein sequences are 30 and 10 respectively based on BLOSUM scores soon to be explained in Dr Quine s ecture Sometimes this is perfectly reasonable Take for instance the next reallife example the human calmodulin protein sequence compared to itself Human calmodulin x itself What s your Interpretation Do you know what the EFhand is Steve Thompson y The calmodulin structure The four EFHand Helix LoopHelix conformations at positions 2056 93 and 129 bind Ca ions to affect several biological systems including mediate control of a large number of Ca dependent enzymes in particular several protein kinases and phosphotases many of which affect systems ranging from muscle action and CAMP to insulin release Calmodulin x alpha actinin default parameters a some confusion window24lstringency24 a clearer picture Alpha actinin has two EFhand motifs to calmodulin s four Steve Thompson Even more can be done with RNA Consider the following set of examples from the phenylalanine transfer RNA tRNAPhe molecule from Baker s yeast The sequence and structure of this molecule is also known the illustration will show how simple dotmatrix procedures can quickly lead to functional and structural insights even without complex folding algorithms lf run with all default settings including a 01 scoring table the dotpot from a comparison of this sequence with itself is quite uninformative only showing the main identity diagonal Default RNA self comparison window of 21 and stringency of 14 with the 0 1 scoring function Steve Thompson I However if you adjust the window size down to nd ner features some elements of y symmetry become apparent Here l have changed the window size to 7 and the stringency value to 5 As a general guide pick a window size about the same size as Several direct repeats are now obvious that remained obscured in the previous analysis RNA comparisons of the reverse complement of a sequence to itself can often be very informative Here the yeast tRNA sequence is compared to its reverse complement using the same 5 out of 7 stringency setting as previously The stemloop inverted repeats ofthe tRNA cloverleaf molecular shape become obvious They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal running oppositely than before Take for instance the middle stem the region ofthe molecule centered at approximately base number 38 has a clear propensity to base pair with itself without creating a loop since it crosses the main diagonal and then just after a small unpaired gap ano her stem is formed between the region from about base number 24 through 30 with approximately 46 through 40 Steve Thompson That same region zoomed in on has some small direct repeats seen by comparing the sequence against itself without reversal But looking at the same region of the sequence against its reverse complement shows a wealth of potential stemloop structure in the transfer RNA 10 Steve Thompson 22 GAGCGCCAGAC39I G 12 22 A 48 CTGGAGGTCTAG A 3 Base posi ion 22 through position 33 base pairs with think is quite similar to the reverse complement of itself from base position 37 through position 48 MFold Zuker s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43 However he region around position 38 is represented as a loop The actual modeled structure as seen in PDB s 1TRA shows reality lies somewhere in between Conclusions What about these alike areas What s the best path through the dot matrix How long do I extend it How can I zoomin on it to see exactly what s happening Where specifically is this alignment how can I see the best ones And what can I learn from these alignments This brings up the alignment problem It is easy to see that two sequences are aligned when they have identical symbols at identical positions but what happens when symbols are not identical or the sequences are not the same length How can we know that the most alike portions of our sequences are aligned when is an alignment optimal and does optimal mean biologically correct But how to do all ofthis A brute force approach just won t work Even without considering the introduction of gaps the computation required to compare all possible alignments between two sequences requires time proportional to the product of he lengths ofthe two sequences Therefore if the two sequences are approximately the same length N this is a N2 problem To include gaps we would have to repeat the calculation 2N times to examine the possibility of gaps at each possible position within he sequences now a N4N problem Waterman illustrated the problem in 1989 stating that to align two sequences 300 symbols long 1088 comparisons would be required about the same number of elementary particles estimated to exist in the universe Part of a better solution enter the dynamic programming algorithm and Dr Jack Quine s lecture httpbio fsu edustevetworkshop html Contact me stevetbiofsuedu for specific bioinformatics assistance andor collaboration 11 A PRIMER ON BAYESIAN APPROACHES TO HYPOTHESIS TESTING AN OPTIONAL SUPPLEMENT TO THE BASIC LECTURE Overview The Bayesian approach to testing hypotheses has two critical elements First a Bayesian approach stands the classical by which I mean frequentist but I can t bring myself to type these ist designations approach on its head In the classical approach our questions revolve around the probability of the data given a speci c hypothesis In the Bayesian approach our questions revolve around the probability of various hypotheses given the data You can see that the derivation of Bayesian methods will involve the same 1 39 and 391 39 39 of 39 39 quot39 39 that we discussed when we covered Bayesian estimation methods Technically our Bayesian questions revolve around the likelihood of various hypotheses given the data which is not the same as the probabilities of various hypotheses we will not delve into this distinction here this is a heuristic exercise after all The point is that the classical approach is based on calculating probabilities of the data in hand under certain conditions which are encompassed by the null and alternative hypotheses and the Bayesian approach looks at probabilities of competing conditions which are the hypotheses being compared given the data in hand Second a Bayesian approach integrates prior probabilities associated with competing conditions into the assessment of which condition is the most likely explanation for the data in hand Put another way a Bayesian approach evaluates the likelihoods of competing conditions by evaluating the change in the odds associated with the conditions a change produced by assessing the data in hand If the odds change suf ciently when the data are examined then the scientist may alter his or her opinion about which competing condition is the most likely one Now a dedicated Bayesian may take this process even further by weighing the change in the odds by the cost of mistakes in judgment In other words the dedicated Bayesian will weigh the new odds by the expected losses associated with an incorrect decision accept one hypothesis when the other is true and base the nal decision on the combined evidence of the change in the odds and the possible costs associated with mistakes Of course one can integrate costs of mistakes into a classical approach but a classical approach does not admit the concept of prior probabilities or include any criterion like assessing the change in the odds ratio suggested by analysis of the data in hand Perhaps the best way to appreciate a Bayesian approach and the methods therein is to walk through a relatively simple example I have not presented all details in the example below and I am also presuming that you remember the basic elements of likelihood calculations from the lectures in other words this example is not intended as a selfcontained tutorial Example Situation Scientists and policy makers in a shery management agency are faced with a potential problem Existing harvest rules for the species Savory gustatorum are based on two assumptions First the natural adult mortality rate is about 20 per year Second shing under the rules imposes an annual adult mortality rate of 60 Models of this shery suggest that as long as annual adult survival is at least 20 the population is sustainable and the harvesting is sustainable But this year s annual survey of tagged adults suggests to some scientists in the agency that shing mortality rates have increased perhaps because more shing licenses have been sold because sherpeople are getting better at nding the sh because some individuals are not observing catch limits llets of this sh bring a nice price or all of the above The data in hand are that 60 of 80 tagged adults were landed there is a reward for reporting the landing of a tagged sh We assume that sh that aren t reported as landed survive the season ie natural mortality is negligible in the period between tagging and landing this is just to make simple calculations for this pedagogic exercise This is a higher fraction of tagged sh reported landed than in most previous years The scientists are concerned because the models of the shery indicate the emergence of real problems real quickly if less than 20 of the adults survive the season These scientists think that shing mortality rates are probably around 70 rather than 60 which might create adult survival rates as low as 10 if natural mortality rates remain at 20 and so a change in rules for neXt season would be in order to prevent a problem from emerging As often happens this idea terri es the policy people who must develop new rules that would reduce shing mortality and defend them before hostile audiences So a rejection of the idea that shing mortality is 60 is not to be made lightly Let s take the data as they are and not worry about whether landings are underreported whether shing and natural mortality are compensatory or any other realworld complication A classical analysis There is a simple classical analysis based on examining numbers of landings and survivors as binomially distributed random variables If the rate of shing mortality is 60 then survival should be 40 The eXpected number of survivors that is sh not landed under a binomial model with probability of individual survival being 040 will be the product of the number of trials number of tagged adults 80 and the probability of individual survival p 040 in this case np 32 When the number of trials is large the number of successes converges on a tdistribution with eXpected value np and variance nplp We can then calculate a tstatistic as the observed number minus the eXpected number divided by the square root of the variance This yields a tstatistic value of 274 with 79 degrees of freedom The probability of a result this eXtreme or more eXtreme under the null hypothesis is about 0004 under a onetailed test alternative hypothesis is that mortality rates are larger than 060 So with this approach we would reject the null hypothesis in favor of a hypothesis that says that shing mortality rates are higher than 60 A Bayesian approach The Bayesian approach to the problem begins with specifying the hypotheses to be compared and the prior probabilities associated with each one The goal of the Bayesian approach is to reach a decision if possible about which hypothesis is the more likely one given the data in hand In our case there are two hypotheses that adult shing mortality rates are 60 which we will call state A or that those rates are 70 which we will call state B The prior probability of state A being true is 080 this is based on the fact that in 8 of the last 10 years the results of the annual tagging study would give an estimated shing mortality rate of about 60 The results of the last two years are suggesting a higher rate perhaps 70 The scientists doing the Bayesian analysis assign a prior probability of 040 to this state B because they have also noticed fewer juveniles and have found out that more licenses have been sold and think that these observations reinforce their sense that shing mortality rates are higher In our class notation then P A 080 P B 040 and the prior odds ratio 0 2 The neXt step is to calculate the posterior odds ratio 0 Recall we calculate this as the product of the prior odds ratio and the likelihood ratio LRAB which is the ratio of the likelihood of the data given state A divided by the likelihood of the data given state B But before we do this we need to clarify how we will decide which hypothesis we will accept as more likely given the data and if we want to have a criterion for making the important decision about whether to reject state A in favor of state B We need to face the decision issue because the management agency must take action if the shing mortality appears to be rising so the outcome of our analysis is not to be taken lightly Of course we could decide that we will change our adherence from state A to state B if the posterior odds ratio is different from the prior odds ratio But given that there will be consequences to a decision we might want to specify how different in some objective manner This means that we might say that we ll turn to state B as the better choice if the magnitude of the posterior odds ratio is less than some value C because smaller values of 0 indicate a greater likelihood of state B One way to proceed is to set C as an odds ratio from a classical testing approach namely the ratio between the probability of rejecting the original hypothesis when it is really true and the probability of accepting it when it is indeed true or C 0c l 0c where on is the classical type I error probability With this criterion in mind we calculate the likelihood of the data that is this year s tagging study results 80 sh tagged 60 landed under alternative binomial distributions 060 probability of landing or 070 probability of landing because it is the ratio of these likelihoods that we want that is LRAB Using the formula that O O LRAB we nd that O 006 If we take on 005 then C 0052 and O is not quite small enough to change our minds in favor of state B Of course the reason for this is that the prior probabilities are very different and play a substantial role in determining the posterior odds ratio But no one at the agency is really happy with this result because it depends heavily on the prior probabilities Some of the staff point out that if we had used diffuse uninformative priors the results would indicate that we should change our minds and accept state B And even with the priors as stated the pattern is right at the edge of our admittedly conservative criterion One approach is to model the time series of the data that is de ne state A as saying that shing mortality rates have been constant for the last ten years and are the same for this year and de ne state B as saying that the rst 8 of the last ten years have experienced 60 shing mortality rates and the most recent 2 years have experienced the higher rate and this higher rate accounts for this year s data In this approach we d calculate the likelihoods of the longer series of data under each model and use that in our calculation of the posterior odds ratio To do this the scientists decide to use the data from the tagging studies in the odd numbered years to estimate the parameters of the binomial distributions For state A they use data from years 1 3 5 7 and 9 to estimate a shing mortality rate of 060 which is how the data come out For state B they use data form years 1 3 5 and 7 to estimate the old shing mortality rate which comes out to 06 and the data from year 10 to estimate the new rate which is 070 They then take data from years 2 4 6 and 8 along with this year s data to examine each hypothesis in this fashion they do not use the same data for parameter estimation and hypothesis testing When the staff performs this analysis the posterior odds ratio is actually slightly higher than 006 the minimal change is not surprising given that the models differ only in the mortality rate for the last year of the data used in the test We are in the same situation as before in terms of what to do But at this point it s important to notice that there is an additional factor that must be taken into account which is that the model for state A has a single parameter and the model for state B has two parameters This difference could in uence the t of each model to the data more parameters will always describe a given body of data better but the improvement may not be meaningful if those additional parameters represent little more than random variation We need to evaluate our competing models in this light The typical method for doing so is to calculate a measure of lack of t of model to data a measure that is discounted by the number of parameters and choose the model that minimizes this measure Various measures are used for different types of problems two popular measures are the Akaike Information Criterion AIC and the Bayesian Information Factor or Bayesian Information Criterion BIC The AIC which has become very popular in ecology and evolutionary biology is computed for a model of state A as AIC A 2 log LA 2pA where LA is the likelihood of the data under model A and pA is the number of parameters embedded in model A The BIC replaces the term 2pA with pA pA log N where N number of observations in the data When we are comparing only two models say A and B we can calculate the change in AIC between them as A AIC 2 log LRAB 2 pA pB If A AIC gt 0 we would favor model B over model A and viceversa of course In our sheries data A AIC 498 so we would favor state B over state A even though there are more parameters in state B Yet all of these efforts may seem as frills that merely hide the fundamental dilemma that our Bayesian approach leaves us on the margin of a decision and the agency must make a decision to rework the rules so as to lower neXt year s shing mortality rate or do nothing and take a risk that the shery will collapse A dedicated Bayesian will take one more step toward solving the problem Let s consider the possible decisions and the costs of each decision TRUE STATE Decision A Accept state A 0 u AlB Accept state B u BlA 0 where u AB is the cost of accepting state A when state B is true and u BlA is the cost of accepting state B when state A is true The u can be considered loss functions because they can in principle be not only scalars ie a number representing an economic loss or time drain but functions of parameters in the models or other complicated eXpressions Now remember that at least in principle we could use our prior probabilities and likelihoods from the data to calculate posterior probabilities for each state P A and P B Putting aside how we might do that for the moment we could use these loss functions plus the posterior probabilities to calculate the eXpected loss under each decision Let Eu A be the eXpected loss ifwe accept state A and Eu B be the eXpected loss if we accept state B These can be calculated as Eu A 0 P A u AlB P B Eu B u BlA P A 0 P B We can formulate a decision rule that incorporates the costs of wrong decisions using these values which would be if Eu A lt Eu B accept state A if Eu A gt Eu B accept state B Note that this rule incorporates our analyses of the data through the effects of those analyses on the posterior probabilities and the weights those probabilities give to the loss functions There is a way to make this decision rule easier to use especially for situations in which it is difficult to estimate explicit loss functions or in which only relative losses might be available To begin observe that by dropping terms that multiply to zero we can rewrite the eXpected loss equations as Eu A u AlB P B Eu B u BlA P A Population genetics Coalescence theory l Peter Beerli November 67 2005 1 Population models To understand inferences based on sampling random relationships among a small sample of indi viduals of a contemporary population we need to know some basic models of population history Several such models exist Fisher 19297 and 1930 and Wright 1931 developed independently a simple population model We call this model the Wright Fisher population model Alternatives are the Moran model7 developed by Moran in 1958 A model that extends the Wright Fisher model was found by Cannings in 1974 We will discuss the three models 11 WrightFisher population model We assume that we have a population of N diploid individuals7 so each individual has two gene copies at a speci c locus A population therefore has 2N gene copies Every generation the individuals reproduce once and die Each individual is releasing a very large number of gametes The next generation is formed by picking random pairs from this pool of gametes Most simple implementation of this model do not allow for change of population size through time7 mutation7 selection7 immigration and other complicating population genetics forces If we assume that we start such a population with two alleles A1 and Ag7 than we look at the number of one speci c allele X7 say the number of A1 alleles The future states of X can represent any of O7 17 27 2N Sampling from the gene pool can be replaced by a sampling with replacement from the population at time t This means that Xt 1 is a binomial random variable with index 2N and parameter Xt2N We can express the transition probability from Xt i to Xt 1 j as 2N 2 739 2 2N 7 17 01232N P7 lt 7 Z7 7 7 7 7 7 1 BSCSQB6 Fall 2005 Computational Evolutionary Biology Figure 1 Example of a Wright Fisher population model We could now calculate or approximate average time to xation of an allele or probability of xation The Wright Fisher population model as explained here is prognostic as we look into the future and this makes it particularly dif cult to derive quantities as average to xation7 Ewens 2004 gives and example of the time of xation of an allele with frequency i2N For i 1 one gets a mean absorption time 1 tlt gt 2log2N We shall later see that we can derive similar estimates using coalescence theory Extending from two alleles to k alleles is easy as it uses simply the multinomial instead of the binomial distribution BSCSQB6 Fall 2005 Computational Evolutionary Biology 12 Moran7s population model Moran s model was derived for haploid populations but many of its ndings can be applied to diploids because if we assume neutrality then the alleles in a diploid population of size N behave like a haploid population of size 2N At time points t1 t2 t3 t4 an adult individual is chosen at random to reproduce after reproduction and adult individual is chosen randomly to die Again with two alleles A1 and A2 we can calculate transition probabilities For comparison with the diploid Wright Fisher model we still use 2N as the number of chromosomes in the population If we have at time t i copies of allele A1 then it at time 25 1 there will be i 7 1 A1 copies if an A2 individual reproduced and an A1 is chosen to die This results in the probability 7 2N 72 2N 2N With probability i2N an Al individual dies and it gets replaced by an Ag indvidual with proba pii 7 l bility 2N 7 i2N For gaining an A1 equals loosing an A2 rst we nd 2 2N72 1 7 pw 2N 2N 7 and for no change i22N72 2 P JW The above probabilities form a transition probability matrix and one can calculate the probability of xation 7139 of allele A1 given that we have i individuals with allele A1 now Ewens 2004 calculates 2 W Using this one can calculate the expected xation time of a single allele A1 as 7139 E 2N2N71 For a comparison with the Wright Fisher population model we need to nd a common measure of time as the events happening at times 257 in a Moran model are not equivalent to the events that happen at the times t in a Wright Fisher Model We can express a common time 9 as the complete turnover of a population If we set g1 for the Wright Fisher model then the Moran model needs on average around 2N time events to turn over so we can express it generation time 9 2N So we might express the waiting time to xation in Wright Fisher units as 912N 2N 7 1 13 Canning7s Exchangeable population model The Canning s model can be Viewed as an intermediate between the Wright Fisher model and the Moran model Because depending on the reproduction function it can mimic the other two models BSCSQB6 Fall 2005 Computational Evolutionary Biology Figure 2 Example of Moran s population model In its most basic version it looks like the Wright Fisher model Consider a a population of genes of size 2N reproducing at time points 2517 252 The transition between the old generation and the new generation can be very general not like the Wright Fisher model as long as the model guarantees that all alleles at time t have the same distribution of descendants at time t 1 they need to have the same offspring probability distribution This distribution has a mean of 1 offspring with variance 02 We can construct offspring distributions that t this general description that are far from the multinomial distribution needed for the Wright Fisher model BSCS936 Fall 2005 Computational Evolutionary Biology 2 The coalescent coalescence theory 21 Historical note Up to 1982 most development in population genetics was prospective and developed expectations based on situations of today Most work did provide expectations about the future With the easy availability of genetic data retrospective analyses did catch up only in phylogenetics starting in the sixties Only Mal cot who pioneered looking backwards in time in 1948 to develop results in population genetics Kingman expressed this looking backwards in time approach as the coalescence of sampled lineages He was not the only one working on such problem at the the time as with many great solutions it was in the air7 see Hudson 1983 and Tajima 1983 22 The coalescent A sample of 71 gene copies is taken at the present time and we are interested in the ancestral relationship of these gene copies We express time 739 increasing the further back in real time we go 7391 lt 7392 means that 7392 is further in the past than 7391 Kingman 1982 and Ewens 2004 describe this backwards in time process with equivalence classes Two copies are in the same equivalence class at time 739 when they have a common ancestor at that time At time 739 0 each individual Figure 3 Example of the coalescence process BSC5936 Fall 2005 Computational Evolutionary Biology gene can be considered in its own equivalence class and we could express this for a sample of n 8 as gto at 19 07 17 67 ft 97 71 Kingman s n coalescent describes the moves from 10 to a single equivalence class n a7 57 c d7 6 129771 All individuals are in some equivalence relation 5 and we can nd a new equivalence relation 7 by joining two of the equivalence classes in E This joining process is called a coalescence7 and a series of such joinings is called the coalescent or coalescence process Figure 4 gives an example of the relationship of a sample and the equivalence classes describing the process It is assumed that the probability of a coalescence depends on time waiting time 6739 Probprocess in 7 at time tau 6739 l process in E at time 739 6739 ignoring higher order terms and if k is the number of equivalence classes in 5 then Probprocess in E at time tau 6739 l process in E at time 739 17 67 17 6739 These result in the rates of the coalescent event of 739 and the rate for the waiting that that event happens We will see that if we apply the right time scale to 739 then we will end up in the more familiar terms that are common in the applied population genetics literature Kingman focused on the Canning model7 and since the Wright Fisher model is a special case of the results carry over easily The coalescent is an approximation to these models because it was developed on a continuous time scale whereas the Canning and Wright Fisher population models have discrete time Any ndings using this coalescent machinery needs to be rescaled to the time scale of these discrete time models In the coalescent framework one has only a single coalescent per in nitesimal time period This forces us to restrict the use of the coalescent to discrete time models were we can guarantee that there is not more than one coalescent event occurring per time period For example7 for a Wright Fisher population we can allow only one coalescent event per generation this sound rather restrictive but as long as the sample n is much smaller than the population N this situation rarely occurs F u year calculates that a sample needs to be less than the squareroot of N nltV7 when we use the coalescent for a model such as the Wright Fisher or some version of the Canning model BSCSQB6 Fall 2005 Computational Evolutionary Biology 23 The coalescent and the WrightFisher population model The coalescent process is in effect a sequence of n 7 1 Poisson processesl7 with rates kk 71 2 m knniln722 describing the Poisson process at which two of the equivalence classes merge when there are k equivalence classes Since these events are coming from a Poisson distribution we can calculate the expectation for each interval7 which is is 1rate7 here 7 All mergers of the equivalence classes are independent of each other so the expectation of the whole coalescence process is V L V L 2 1 EltTMRCAgt Z 7 2 Z 7 k2 kk71 k2kk71 Comparison of the content of the sum with the coalescence rate makes clear that the Wright Fisher population model is 2x the standard coalescent units and we need to multiply by 2N to arrive at the more familiar generation time scale Figure 4 Expected times in the Wright Fisher coalescent process 1From Mathworld A Poisson process is a process satisfying the following properties H The numbers of changes in nonoverlapping intervals are independent for all intervals F0 The probability of exactly one change in a suf ciently small interval is where is the probability of one change and is the number of trials on The probability of two or more changes in a suf ciently small interval is essentially 0 In the limit of the number of trials becoming large the resulting distribution is called a Poisson distribution BSCSQB6 Fall 2005 Computational Evolutionary Biology The coalescent process results in a tree of a sample of 71 individuals We call this often a genealogy as the the individuals are typically from the same species or population hence population size Often the probability of the coalescent process for the Wright Fisher population model is expressed as V L kk71 2 G N 57 41V 7 plt l gt H W k2 where u is expressed in generations The expected TMRCA is the same as above Often we will not use a time scale in generations but generations x mutation rate and then we would express the above formula as kk71 2 Mala Heiuk e 67 k2 where 9 is 4 x N5 x u with u as the mutation rate per generation and site when using sequence data7 and N5 as the effective population size Under a strict Wright Fisher population model N N57 but under more biological scenarios one needs to know more about the life history of the species to translate the N5 into real numbers 24 The coalescent and the Moran population model The coalescent is an exact representation of the Moran model because the problems with the multiple coalescent events in one generation do not occur The Moran model allows only one lineage to change at a given time Therefore the limitation to small sample size as we have seen for the Wright Fisher model is not needed Using our ndings of the discussion of the Moran model earlier7 but instead of thinking forward in time think backward in time Looking backwards we see that the Moran process is similar structured like the coalescence process We have 71 individuals that are reduced in their ancestry to n 7 171 7 27 and eventually to one gene7 the most common recent ancestor of the n sampled individuals Assume that we are at a time where we have a sample of k individuals7 these are descendants of k 7 1 parents of one of these parents was chosen to reproduce and the offspring is in ancestry of the sample of 71 genes The probability of this event is kk 7 1 W2 and with probability Steve Thompson Introduction to Biolnformatics BSC49335936 Florida State University Department of Biology wwwbiofsuedu October 2 2003 Genomics Steven M Thompson Florida State University School of Computational Science and Information Technology CSIT What sort of information can be determined from a genomic sequence 92503 Steve Thompson Lecture Topics Easy Harder Very hard Easy restriction digests and associated mapping eg software like the Vl sconsin Package s Genetics Computer Group GCG Map MapSort and MapPIot fragment assembly and genome mapping such as packages from the University of Washington s Genome Center httpwwwgenomewashingtoneduz PhredPhrapConsed httpwwwphraporg and SegMap and The Institute for Genomic Research s httpwwwtigrorg Lucy and Assembler programs gene finding and sequence annotation This will be the bulk of today s lecture and is a primary focus in current genomics research fonvard translation to peptides Hard again genome scale comparisons and analyses Nucleic Acid Characterization Recognizing Coding Sequences Three general solutions to the gene finding problem 1 all genes have certain regulatory signals positioned in or about them 813 all genes by definition contain specific code patterns and many genes have already been sequenced and recognized in other organisms so we can infer function and location by homology if our new sequence is similar enough to an existing sequence All of these principles can be used to help locate the position of genes in DNA and are often known as searching by signal searching by content and homology inference respectively 92503 Steve Thompson ORF URFs and ORFS definitions URF Unidentified Reading Frame any potential string of amino acids encoded by a stretch of DNA Any given stretch of DNA has potential URFs on any combination of six potential reading frames three fonNard and three backward Open Reading Frame by definition any continuous reading frame that starts with a start codon and stops with a stop codon Not usually relevant to discussions of genomic eukaryotic DNA but very relevant when dealing with mRNAcDNA or prokaryotic DNA Signal Searching locating transcription and translation affecter sites One strategy OneDimensional Signal Recognition Start Sites Prokaryote promoter Pribnow Box TTGAwa1521TAtAaT Eukaryote transcription factor site database TFSitesDat ShineDalgarno site AGGGAGGGAx69ATG in prokaryotes Kozak eukaryote start consensus ccAgccAUGg AUG start codon in about 90 of genomes exceptions in some prokaryotes and organelles 92503 Steve Thompson Signal Searching locating transcription and translation affecter sites OneDimensional Approaches cont End Sites Nonsense chain terminating stop codons UAA UAG UGA Eukaryote terminator consensus YGTGTTYY Eukaryote polyA adenylation signal AAUAAA but exceptions in some ciliated protists and due to eukaryote suppresser tRNAs Signal Searching locating transcription and translation affecter sites Exonllntron Junctions Donor Site Acceptor Site Exon llt lntron lExon A64673G1 gt T1 A62A68G84T63 39 39 6Py7487NC65A1G1N The splice cut sites occur before a 100 GT consensus at the donor site and after a 100 AG consensus at the acceptor site but a simple consensus is not informative enough Another Strategy TwoDimensional Weight Matrix 92503 Steve Thompson 92503 Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices describe the probability at each base position to be either A C U or G in percentages The Donor Matrix CONSENSUS from Donor Splice site sequences from Stephen Mount NAR 102 459472 figure 1 page 460 Exon cutlsite Intron G 20 9 11 74 10G O 29 12 84 9 18 20 A 30 4O 64 9 O O 61 67 9 16 39 24 U 20 7 13 12 O 1 7 11 5 63 22 27 C 30 44 11 6 O O 2 9 2 12 2O 28 CONSENSUS sequence to a certainty level of 75 percent VMWKGTRRGWHH The matrix begins four bases ahead of the splice site Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The Acceptor Matrix CONSENSUS of AcceptorDat IVS Acceptor Splice Site Sequences from Stephen Mount NAR 102 459 472 figure 1 page 460 Intron cutlsite Exon G 15 22 1o 10 1o 5 7 9 7 5 5 24 1 o 1 52 24 19 A 15 1o 10 15 5 15 11 19 12 3 1o 25 4 1 o 22 17 20 T 52 44 50 54 5o 49 4e 45 45 57 58 3o 31 o o e 37 29 c 18 25 3o 21 24 3o 34 2e 35 35 27 21 54 o o 18 22 32 CONSENSUS sequence to a certainty level of 750 percent at each position BBYHYYYHYYYDYAGVBH The matrix begins fifteen bases upstream of the splice site Steve Thompson Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The CCAAT site occurs around 75 base pairs upstream of the start point of eukaryotic transcription may be involved in the initial binding of RNA polymerase II Base freguencies according to Philipp Bucher 1990 J Mol Biol 212563 578 Preferred region motif within 212 to 57 Optimized cut off value 872 G 7 25 14 4O 57 1 O O 12 9 34 30 A 32 18 14 58 29 O O 199 6 10 13 66 U 30 27 45 1 11 1 1 O 15 2 2 1 C 31 3O 27 1 3 99 99 O 5 O 51 3 CONSENSUS sequence to a certainty level of 68 percent at each position HBYRRCCTSR Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The TATA site aka Hogness box a conserved AT rich sequence found about 25 base pairs upstream of the start point of eukaryotic transcription may be involved in positioning RNA polymerase II for correct initiation and binds Transcription Factor IID Base freguencies according to Philipp Bucher 1990 J Mol Biol 212563 578 Preferred region center between 36 and 20 Optimized cut off value 79 G 39 5 1 1 1 O 5 11 4O 39 33 33 33 36 36 A 16 4 99 1 91 69 93 57 4O 14 21 21 21 17 20 U 8 79 9 96 8 31 2 31 8 12 8 13 16 19 18 C 37 12 O 3 O O 1 1 11 35 38 33 3O 28 26 CONSENSUS sequence to a certainty level of 61 percent at each position STATAWAWRS S S S S S 92503 Steve Thompson Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The GC box may relate to the binding of transcription factor Sp1 Base freguencies according to Philipp Bucher 1990 J Mol Biol 212563 578 Preferred region motif within 164 to 1 Optimized cut off value 88 G 18 41 56 75 l 99 O 82 81 62 7O 13 19 4O 37 35 18 24 O 1 2O 17 O 29 8 O 7 15 3O 12 23 O O O 18 1 18 9 15 27 42 37 15 11 2 O O O 62 O 1 O 6 61 31 9 o o o o C lit 0 CONSENSUS sequence to a certainty level of 67 percent at each position WRKGGGHGGRGBYK Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The cap signal a structure at the 5 end of eukaryotic mRNA introduced after transcription by linking the 5 end ofa guanine nucleotide to the terminal base ofthe mRNA and methylating at least the additional guanine the structure is 7MeG5ppp5Np Base freguencies according to Philipp Bucher 1990 J Mol Biol 212563 578 Preferred region center between 1 and 5 Optimized cut off value 814 G 23 O O 38 O 15 24 18 A 16 O 95 9 25 22 15 17 U 45 O 5 26 43 24 33 33 C 16 MM 0 27 31 39 28 32 CONSENSUS sequence to a certainty level of 63 percent at each position KCABHYBY 92503 Steve Thompson Signal Searching locating transcription and translation affecter sites TwoDimensional Weight Matrices cont The eukaryotic terminator weight matrix Base frequencies according to McLauchlan et a1 1985 NAR 131347 1368 Found in about 2339s of all eukaryotic gene sequences G 19 1 9 94 14 10 11 19 A 13 9 3 3 4 O 11 13 U 51 9 9 3 79 61 56 47 C 17 1 O O 3 29 21 21 CONSENSUS sequence to a certainty level of 68 percent at each position BGTGTBYY Content Approaches Strategies for finding coding regions based on the content of the DNA itself Searching by content utilizes the fact that genes necessarily have many implicit biological constraints imposed on their genetic code This induces certain periodicities and patterns to produce distinctly unique coding sequences noncoding stretches do not exhibit this type of periodic compositional bias These principles can help discriminate structural genes in two ways 1 based on the local nonrandomness ofa stretch and 2 based on the known codon usage ofa particular life form The first the nonrandomness test does not tell us anything about the particular strand or reading frame however it does not require a previously built codon usage table The second approach is based on the fact that different organisms use different frequencies of codons to code for particular amino acids This does require a codon usage table built up from known translations however it also tells us the strand and reading frame forthe gene products as opposed to the former 92503 Steve Thompson Content Approaches cont NonRandomness Techniques GCG s TestCode Relies solely on the base compositional bias of every third position base TESTCODE of TdEmQ FCHumurLIDSJ dd 225 1 t 2481 Hir dow 200 bp February 22 2002 20 0 O 0 Owl O loco IOI G M9 0W 6 l6 000 0 blood 00 IE OQ MO I 9 OOIOOOO 00 0 0 0 9 no lo 0 9 leol TostCode T l a L 1quot FPJ z 560 15m 1300 2600 The plot is divided into three regions top and bottom areas predict coding and noncoding regions respectively to a confidence level of 95 the middle area claims no statistical signi cance Diamonds and vertical bars above the graph denote potential stop and start codons respectively Content Approaches cont Codon Usage Techniques GCG s CodonPreference WEFEREMIE pf ldJmfmeJI33 Old 1 tn 2481 Fm 22 2002 0302 Codm Tabla Wt lwmmhlgmm 25 Raw cm Thwahold Lil maxim 814 Codon Prome Each fonNard reading frame indicates a red codon preference curve and a blue third position GC bias curve The horizontal lines within each plot are the average values of each attr bute Start codons are represented as vertical lines rising above each box and stop codons are shown as lines falling below the reading frame boxes Rare codon choices are shown for each frame with hash marks below each reading frame Genomes use synonymous codons unequally sorted phylogenetically 92503 Steve Thompson Internet World Wide Web servers Many servers have been established that can be a huge help with gene finding analyses Most of these servers combine many of the methods previously discussed but they consolidate the information and often combine signal and content methods with homology inference in order to ascertain exon locations Many use powerful neural net or artificial intelligence approaches to assist in this difficult decision process A wonderful bibliography on computational methods for gene recognition has been compiled by Wentian Li httpwwwns jqeneticsorqqenel and the Baylor College of Medicine s Gene Feature Search httpsearchlauncherbcmtmceduseq searchgenesearchhtml is another nice portal to several gene finding tools World Wide Web Internet servers cont Five popular genefinding services are GrailEXP Geneld GenScan NetGene2 and GeneMark The neural net system GrailEXP Gene recognition and analysis internet link EXPanded httpgraillsdornlgovgrailexp is a gene finder an EST alignment utility an exon prediction program a promoter and polyA recognizer a CpG island locater and a repeat masker all combined into one package Geneld mpzllwww1imimessothareqeneidindexhtml is an ab initio Artificial Intelligence system for predicting gene structure optimized in genomic Drosophia or Homo DNA NetGene2 httpzllwwwcbsdtudkservicesNetGene2l another ab initio program predicts splice site likelihood using neural net techniques in human C elegans and A thaliana DNA GenScan httpqenesmiteduGENSCANhtml is perhaps the most trusted serverthese days with vertebrate genomes The GeneMark httpopalbiologygatecheduGeneMarkl family of gene prediction programs is based on Hidden Markov Chain modeling techniques originally developed in a prokaryotic context the programs have now been expanded to include eukaryotic modeling as well 92503 10 Steve Thompson Homology Inference Similarity searching can be particularly powerful for inferring gene location by homology This can often be the most informative of any ofthe gene finding techniques especially now that so many sequences have been collected and analyzed Vl sconsin Package programs such as the BLAST and FastA families Compare and DotPlot Gap and BestFit and FrameAlign and FrameSearch can all be a huge help in this process But this too can be misleading and seldom gives exact start and stop positions For example 805 851 C39139 46 AlaAlaAlaArgCysLysAlaAlaGluAlaAlaAlaAspGluProAlaLe 62 852 901 I III MM 63 rye i L Jv v u 79 902 951 1 yquot1 TyEValGluIleLeuGln 95 80 alPrnThrT 952 990 96 HisValIleAspTyrIleLeuAspLeuGlnLeuAlaLeu 108 Summary The combinatorial approach Get all your data in one place GCG s SegLab is a great way to do this due to its advanced annotation capabilities Edit Functions uption windows Rdd Raise Deles quotthisslb 491 users twp un eq u mm mg is Features at mm Editor Displag Graphic Features reeiwesteyt 739 reeiweznd lZZ Subjempec quotSinlanUMaNquot Close m1 l ll hunamid as 92503 11 Steve Thompson The Objective a complete protein Now what Beyond just finding genes Genome scale analyses Unfortunately much traditional sequence analysis software can t do it but there are some very good Web resources available for these types of global view analyses Let s run through a few examples NCBl s Genome pages httpwwwncbinmnihgov present a good starting point in North America ma 39l 1mm 3 lmp WWquotem11m111gesgemgu1ge1um 39wm39skeigleg l11ww1e1111m11 gagususunwmgu eg1 W m skemeg meme amiang it Human Genome W5 NW A gurd m Uriins information resources Genes A 11111 l 1111 l Wen immune Hullnma 11 IMnrmltInn 1Mm1mg1ure ueu HleS e Unea lnci SaveAllLuci BLASI campe1evgu1seguenge Achallengelacingveseavchevsladavtslheabilitle e u 1 J K 1 u u g p g 11 s 1 u 1g the gengme g11s gene p1ege lagethev and analvzelhe munnudes g1ge1e mgguas gu11enw being gene1e1eg Waugh the Human eengme p1g1eg1 NCEl sWeb s1e serves an an 111eg1e1eg1 ABEDEF vwxvz Dick 11 usleVmRNABenamc ugmms 5mm 1mm bps Mn Magenztlti Acvlagenetu mesa geneme 1mg1me11gn ntvastmctuve 1g1 1 2 a A s a 7 a cunt nus e1 1esgu1ge g1 FlSHemapPed1 biomedical researchers tram amundlhe Wg11g salhat m seguenge1eggegg1gnes neg meg usems data mlhew research e11g11s Mme m MW Dmabasemmw V mum gmemene smug me 11611 mm genetic vanmns wmm 11111 11111 Published 9 1a 11 1213141515 m V 1 W WW mm quotW M WM Wm 11 epcn Check vauv seguenge 1g1 11M 1155 We ggemeu ms and View 1n genomic context 1111mm wees E New smwy uemme 11 1111111 M11x vle g1 vmem51 11111115 mum g1 meme 1 e we memememexes 6E0 Gene Ex vessian Omnibus a m1 gem m1 1 11 em pubhmpasniyMEXWESW NM amwvnsm Dim We 171m 2021 22 x 11 usvype geneummempmguemmmugun111mm m ahanumnun gem 1m 11111111g1gm1ng1111 gmmegewe1e11x1ggphe1xpme11 HElRl gmgmoene mange menumquot 11111111g1gm1ng1111 mm mm 1e11x1ggphe1x gMgggeseMgng umemmeuse summenew m a vat andzehvatish MVwmmk car Sjle ded We s1g11es g1 u e sari epuu11g gene Manny1 unmnlngy Mv masks at when brawsinglhe genome usinglhe new MBPVieWEL 1m meme sum m cansewegsvmrsnvueween g11g1 gquot 01 pay e111 gs1g house tram severaltvpes g1 M 1 1 Wm mauseandhuman apsand Eelavv avelhvee views mlhe gmz 1ggus m p E W V usin dittevem us is g tans Clicklheima e1g seelhe quotM r anusLInk Fggai paint lav genes m apv ewe 1 m g mm D d ues and essgg1e1eg mtmmatmn g 7 1m Mm sum m OMIM Mewenesw 4 4 ma 2 men 1mm lnmhnnrnrmvmssnr e Fmenne 11 inhemed g1sg1ge1s maintained by 1 R H h JHUandcatabmmvs 1 am e u 1311125 u1egs1 1111 Let 11523ngqu Men Re1e1engeseguengesg1 l 11m Camplex g1 News Man mum m we Mm genomic EDMlgS1mRNAS1BVVd vetmahlastama we pvmem 111 an em m 1s 11m p1g1e1ns W E71 3 m1 ancapvmem that 11e uentl bindsREand 111 111mm SAGElnp Gene expression A m u1gg1s11qs1unl11gn gmeme 1 m1 1Ssuu1sn1c11 sees 1ggs mapped 1g W 13 p311 Hucu m1 1 m w Sequenclng Summary g1 human mhergennmes WM W gengme seguengmg p1gg1ess M musculus R n g1gus 92503 12 Steve Thompson Beyond just finding genes Genome scale analyses cont That can lead to neat places like the Genome Browser at the University of California Santa Cruz httpgenomeucscedu and the Ensembl project at the Sanger Center for Biolnformatics httpwwwensemblorq a me 7 ElE 39 m mm Hume sum may pm sum mp 39 a k Rim Mme s i M w i saw 5n Lwatim 4 My WW m ediiMai39hinlm racks 1hlr95amp9nsnivrNMJUZl67 39mrmiai a m39wmmima MFEJEJEJ a quot L Fenian m mum m immuneWire Mt Mugm NW N am PM Lam lam Lm Elm L Wm Transcrmt 39 w m mm mm i um I win 7 will EMBL Vansan Genscaiis a Human mews a Protexns lLll u lZ llLELl m l im mi l i i i g iuii Elll39DJI u i i IllT inii i i l i ll m Cum u i l mu Elli llEl l i Weomnw amen Ell i mi in i l um Elli i l i 39 ll i EIDIJI H i Ill imi iiim W5 n i iiml m I I imi i i i iii i mi m i I Ell iii i l 7 39 i i mum Amuumdeuils ic nmbsftDosnlunlnmamlnamuhdcursw El We quotms w an m mm mmquot mm skew WWW hldaall Euld lmaia mi manning ml lause matches illllllll i IIlIll llllllllii Haiker i i i i n H l i ll GEM m m m at 19 i 195 i m it ran m m n 199 it 2L in 2a i may mm mmm mum W I PW 9 5 quot I mum Ivanix inma anHmu upsz imam m 1 aquotin Gene ieaw mi Elstm mnluin nzxrs am i Beyond just finding genes Genome scale analyses cont And sites like the the University of Wisconsin s E coi Genome Project httpwwwgenomewiscedu and The Institute for Genomic Research s htt9wwwtigrorg MUMMER package iimmm j in mewmii ii39wmmw imm jw mm name 2 m wounauwmm swam Ecoli Genome Pro e n V inquot WEEl University of WlsconslnvMadl mimm 5m livnmem min 51v5 Mymmmm cuosis mvm mum cum r IllllllIIIII Illlllllill II w mmsumigmmmalsfmwmigm gum Wm mam mmmwmwwmmmmmm mmmmmm mm am mm mm mmmwmw mumw mmnun mmmumm mmmeWum mmm fil amimm tmlmxunn nmnnmmm Maugham minimimuiimmmmmi dmuusmmmuammlmmu 4mm c mith mmmmmmmlmmmmimgiwi mm mmaummthmmm mmmemmesrm mm M in Em C WWW Em mm MEI Rediawwilh newcuuids P l 1 W i D ian m max iii xL 92503 13 Steve Thompson References A perplexing variety of techniques exist for the identification and analysis of protein coding regions in genomic DNA Knowing which to use when and how to combine their inferences will go a long way in your genomic analyses Bucher P 1990 Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase Promoter Elements Derived from 502 Unrelated Promoter Sequences Journal of Molecular Biology 212 563578 Bucher P 1995 The Eukaryotic Promoter Database EPD EMBL Nucleotide Sequence Data Library Release 42 Postfach 102209 D6900 Heidelberg Ghosh D 1990 A Relational Database of Transcrip ion Factors Nucleic Acids Research 18 17491756 Gribskov M and Devereux J editors 1992 Sequence Analysis Primer WH Freeman and Company New York NY USA Hawley DK and McClure WR 1983 Compilation and Analysis of Escherichia coli promoter sequences Nucleic Acids Research 11 22372255 Kozak M 1984 Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs Nucleic Acids Research 12 857872 McLauchen J Gaffrey D Whitton J and Clements J 1985 The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3 Termini Nucleic Acid Research 13 13471368 Proudfoot NJ and Brownlee GG 1976 3 Noncoding Region in Eukaryotic Messenger RNA Nature 263 211214 Stormo GD Schneider TD and Gold LM 1982 Characterization of Translational Initiation Sites in E coli Nucleic Acids Research 10 29712996 von Heijne G 1987a Sequence Analysis in Molecular Biology Treasure Trove or Trivial Pursuit Academic Press Inc San Diego CA von Heijne G 1987b SlGPEP A Sequence Database for Secretory Signal Peptides Protein Sequences amp Data Analysis 1 4142 92503 14 RULES FOR MOMENTS Expectation rst moment or mean De nition EX Z x PrX x E c X c EX E Xa EX Ea EX a EXY EX EY EXY EX EY EXY EX EY iff X and Y are independent Ec1X1 02X2 chn E01X1 Echz Echn 01 EX1 c2EX2 anXn and ifcl c2 CH 1n then Ec1X1 02X2 chn 1nEZ Xi and if EX Ll then Ec1X1 02X2 chn ln n u H Parsimony Counting Changes Fredrik Ronquist August 31 2005 1 Counting Evolutionary Changes Reference Felsenstein Chapters 1 and 2 pp 1 18 Chapter 6 pp 67 72 Chapter 7 pp 73 86 Parsimony methods are perhaps the simplest methods used for inferring evolutionary trees so we will start with them Parsimony means 7 extreme or excessive economy or frugality In the context of evolutionary inference the idea is to minimize the amount of evolutionary change Another way of understanding parsimony methods is that they are cost minimization procedures where the cost is a measure of the amount of evolutionary change In the typical case we are dealing with discrete characters and the natural measure of evolutionary change is simply the number of changes between states However there are many other possibilities as well as we will discover Throughout this lecture we will be assuming that the tree is xed We will then use parsimony methods to count the number of changes and to infer the ancestral states in the tree In principle the step is small from this to comparing and selecting among trees according to the amount of evolutionary change they imply However several practical issues arise when we start searching for trees so we will defer treatment of this topic to the next lecture 11 Fitch Parsimony To start with let us assume that we are interested in the evolution of a DNA character which has four discrete states A C G and T In the simplest possible model we assume that all changes between all states are allowed and count equally Fig 1 According to Felsenstein 2004 this model was initially proposed by Kluge and Farris 1969 and named Wagner parsimony by them BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology but it is more widely known as Fitch Parsimony because the rst algorithms implementing the model were published by Walter Fitch 19707 1971 The naming issue is utterly confusing because there is another parsimony model that we will examine shortly that is widely known as Wagner parsimony although Felsenstein calls it parsimony on an ordinal scale Figure 1 Fitch parsimony model for DNA characters All state changes are possible and count equally In simple cases7 when there are few state changes in a small tree7 it is straightforward to nd the number of required changes and the most parsimonious states of the interior nodes in the tree and hence the position of the changes Fig 2a When there are more changes7 it is quite common for there to be ambiguity both concerning the interior node states and the position of the changes7 even though we can still nd the minimum number of changes relatively easily in many cases Fig 2b For more complicated trees and larger problems7 it is necessary to use an algorithm The basic idea is to pass through the tree from the tips to the root a downpass or postorder traversal At each interior node7 we count the minimum number of steps required by the subtree rooted at that node7 and we record the ancestral states giving us that number of steps When we have reached the root node7 we know the minimum number of steps for the entire tree Since it is a recursive algorithm7 we only need to specify it for one node p and its two children q and r Assume that S is the set of observed or inferred minimal states at node i For instance7 if p is a terminal and we have observed an A for it7 we have 5 Similarly7 if there were ambiguity about the states for a terminal or interior node 197 such that the state could be either A or G7 then we have 5 A7 C Now we can calculate the length or cost 1 of a tree using the Fitch downpass algorithm by nding sections and unions of state sets Algorithm 1 BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 2 In simple cases the minimum number of changes and the optimal states at the interior nodes of the tree under Fitch parsimony are obvious The algorithm is relatively easy to understand If the descendant state sets Sq and ST overlap then we do not have to add any steps at this level and the state set of node p will include the states present in both descendants the intersection of Sq and 5 If the descendant state sets do not overlap then we have to add one step at this level and the state set ofp will include all states that are present in either one of the descendants the union of Sq and ST States that are absent from both descendants will never be present in the state set of p because they would add two steps to the length of the subtree rooted at p An example illustrates the procedure clearly Fig 3 Felsenstein Fig 21 An interesting aspect of the algorithm is that it is not guaranteed to give you the optimal states at the interior nodes of the tree In Felsenstein s example for instance the downpass gives you the state set A G for one of the nodes but the optimal state set also known as the most parsimonious reconstruction MFR for that node is actually A C C To get the optimal state sets of all the interior nodes in the tree you need to go through the Fitch uppass algorithm BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 1 Fitch downpass algorithm Sp 9 Sq ST if 5 Q then Sp 9 Sq U ST 1 lt l 1 end if Assume that we have the nal state set Fa of node 017 which is the immediate ancestor of node p We also have the downpass sets of p and its two children q and r they are labeled Sp7 Sq and ST7 respectively7 as in the description of the downpass algorithm Now Fitch s uppass also known as the nal pass algorithm is based on combining information from these state sets Algorithm 2 The algorithm is started by noting that7 for the root7 the nal state set is the same as the downpass state set Algorithm 2 Fitch uppass algorithm FF 9 Sp Fa if Fp 7 Fa then if Sq ST 7 Q then F1 9 51 o S o E o Sp else FZ7 9 SP U Fa end if end if Fitch s uppass algorithm is much less intuitive than the downpass algorithm but we will nevertheless make an attempt to explain why it works In the rst step7 you test whether the downpass state set of p includes all of the states in the nal set of a If it does7 then it means that each optimal assignment of nal state to a can be combined with the same state at p to give zero changes on the branch between a and p and the minimal number of changes in the subtree rooted at p Thus7 the nal set ofp must be the same as the set at a because any other state would add steps to the tree length including any states that might be present in the downpass set ofp but not in the nal set of a If the nal set of a includes states that are not present in the downpass set of 197 then there must be optimal solutions that have one state at a and a different state at p This means that there is a potential change on the branch between a and p and we need to see if that change can be pushed onto one of the descendant branches of p without increasing the total tree length If so7 we may have to add states from the nal set of a to the downpass set of p The second part of the algorithm BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology Figure 3 An example of state sets calculated during a Fitch downpass after the rst if statement takes care of this potential addition of states If the downpass set ofp was formed by taking an intersection then we need to add the states that are in the nal set of a and in the downpass set of either one of the two descendants of p because each of these additional state assignments to p will push a change on the branch between p and a to one of the descendant branches of p without increasing total tree length If the downpass set ofp was formed by a union then we simply add the states in the nal set of a to those in the downpass set ofp Felsenstein describes an alternative uppass algorithm brie y in Chapter 6 p 68 but that algorithm is erroneous as far as we can tell please let us know if you think we are wrong Felsenstein s algorithm will work on the example given in his Fig 21 Fig 3 but it will fail on other trees For instance you can test it on the following simple tree terminals labeled with the observed states The nal interior state sets are obviously T T and A but Felsenstein s algorithm will give you T T and AT regardless of whether you understand L as the nal state set or the downpass state set for the ancestral node Admittedly it is easy to underestimate the complexity of the Fitch uppass BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology We have described the Fitch algorithms for rooted binary trees Unrooted trees and polytomous trees require minimal modi cations that should be straightforward we leave those as an exercise Note that the Fitch algorithms accommodate uncertainty about the state of a terminal in a natural way we simply express the uncertainty by assigning the terminal a state set containing all the possible states 12 Wagner Parsimony In some cases it is natural to assume that the states we are considering form a linear series usually known in evolutionary biology particularly systematics as a transformation series Often cited examples include morphological characters with three states one of which is intermediate between the two others Think about an insect antenna with 10 11 or 12 articles a process that can be short intermediate or long pubescence that can be sparse intermediate or dense or coloration that can be white gray or black In such cases it seems reasonable to model evolution as going from one extreme to the other through the intermediate state In other words we allow changes only between the end states and the intermediate state but not directly between the end states themselves Fig 4 Another way of expressing the same thing is that we count a change between the two extreme states as costing twice as much as a change between an extreme state and the intermediate state Characters for which we would like to use this parsimony model are often referred to as ordered or additive characters in contrast to the unordered or non additive characters for which we use the Fitch model Figure 4 The Wagner parsimony model Going from any of the extreme states to the intermediate state counts as one change whereas a transition from one extreme state to the other counts as two changes since we are forced to go through the intermediate state The Wagner parsimony model can have a large number of states and it can be used to represent quantitative characters either on an ordinal scale or on an interval scale In the former case we BSCSQ36 Fall 2005 PBFR Computational Evolutionary Biology would simply group the measurements into 71 ordered groups that would comprise the states of the Wagner parsimony character In the latter case we divide the range between the minimum and maximum value into n equal length intervals to give us a Wagner parsimony character with 71 states In the extreme case we simply use the raw measurements up to some xed precision Felsenstein describes the Wagner parsimony algorithms using raw measurements but let us use a slightly modi ed set description instead Assume that the state set of a node p is a set of continuous elements S z z 1m 2 where minS z and maxS y we can also call this set an interval Now de ne the operation 5 Sj as producing the set of continuous elements from maxminSiminSJ to minmaxSimaxSj If S and Sj overlap then this operation simply produces their intersection but if they do not overlap the result is a minimum spanning interval connecting the two sets For instance 2 3 4 6 7 8 4 5 6 As usual the Wagner downpass algorithm nds the downpass interval for a node p from the downpass intervals of its two daughters q and r and it adds to a global tree length variable 1 when we need to connect two non overlapping state sets Algorithm 3 Algorithm 3 Wagner downpass algorithm Sp e Sq O S if Sp Q then Sp e Sq H S lewmspiel end if The similarities to the Fitch downpass algorithm are obvious The uppass algorithm is also similar to its Fitch parsimony analog but it is a little less complex because of the additive nature of the state transition costs Before describing it let us de ne the operation 5 U Sj as producing the set of continuous elements from minminSminSJ to maxmaxS maxSJ If the two intervals overlap the result is simply their union but if they are disjoint then the operation will produce an interval including all the values from the smallest to the largest For example 3 4 U 6 7 34 5 67 With additional notation as above the Wagner uppass algorithm is now easy to formulate Algorithm 4 The algorithm is slightly simpler than the Fitch uppass but can be justi ed proven to be correct with argumentation similar to that used above for the latter Algorithm 4 Wagner uppass algorithm F1 4 5 Fa if F1 344 Fa then F1 4 5 u 5 o Fa o 5 end if BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology 13 Other variants of parsimony Many other variants of parsimony that count changes slightly differently have been proposed Some examples include Camin Sokal parsimony where changes are assumed to be unidirectional and Dollo parsimony where changes in one direction are assumed to occur only once in a tree We refer to Felsenstein Chapter 7 for a detailed description of these methods and others In practice these alternative parsimony approaches are rarely used 14 Sankoff Optimization All parsimony methods are special cases of a technique called Sanko optimization sometimes referred to as generalized parsimony or step matrix optimization Sankoff optimization is based on a cost matrix C clj the elements of which de ne the cost CH of moving from a state i to a state j along any branch in the tree The cost matrix is used to nd the minimum cost of a tree and the set of optimal states at the interior nodes of the tree Felsenstein gives an example of a cost matrix and the Sankoff downpass algorithm in his Figure 22 and the accompanying discussion pp 13 16 The uppass algorithm is presented in Chapter 6 pp 68 69 and Figs 61 62 A slightly different presentation of the Sankoff algorithm is given here Assume that we are interested in a character with k states For the downpass assign to each node p a set GP P P P P 91 gz gk containing the minimum downpass cost 9 of assigning statei to node p We will use another similar set Hp which will give the minimum cost of assigning each state i to the ancestral end of the branch having p as its descendant The elements of H17 will be derived from GP and C as you might have guessed Before we start the algorithm we go through all terminal nodes P P and set 9 0 if the state i has been observed at tip p and g 00 otherwise As usual the downpass algorithm is formulated for a node p and its two descendant nodes q and r Algorithm 393 At the root node p we easily obtain the tree length l as l mimgp the minimum cost for any state assignment A worked example appears in Figure The uppass is relatively straightforward First we nd the state or states of the root node p associated with the minimum cost in the GP set Let Fp be the set of indices of these states Now we can nd the nal state set of a node p from its downpass cost set G1 the cost matrix C and the nal state set Fa of its ancestor a easily Algorithm 393 In some cases we may be interested not only in the optimal states at each interior node but also the cost of the suboptimal state assignments To determine these we need a slightly more complicated BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Algorithm 5 Sankoff downpass algorithm for all i do hgq lt mingle 979 by H minicij IE7 end for for all i do 9192 H hf by end for Algorithm 6 Algorithm for nding optimal state sets under Sankoff parsimony Fp lt 0 for all i 6 Fa do m H Cil g for all j 7 1 do m e mincij g 177 m end for for allj do if q 971 m then Fp lt Fp Uj end if end for end for algorithm For this Sanko uppass algon39th rn7 let Fp be a set containing the nal cost of assigning each state i to node p We start the algorithm at the root p by setting flip gigJ The nal cost set Fp of a node p can now be obtained from the cost matrix C7 the nal cost set Fa of its ancestor a and the downpass set Hp of the lower end of the branch connecting p with a Algorithm A worked example appears in Figure Algorithm 7 Sankoff uppass algorithm for all j do A e minim 1 e by cu end for 15 Multiple Ancestral State Reconstructions As we have already seen7 it is common to have ambiguity concerning the optimal state at an interior node that is7 multiple states in the nal state set It does not take many ambiguous BSCSQB6 Fall 2005 PBFR Computational Evolutionary Biology internal nodes to result in a large number of equally parsimonious reconstructions of ancestral states for the entire tree Felsenstein describes several methods for systematically resolving this ambiguity Perhaps the most popular among these are the accelerated transformation ACCTRAN and delayed transformation DELTRAN methods Both methods will select one state for each interior node among the most parsimonious assignments resulting in a unique reconstruction of ancestral states The selection is done using rules that differ between ACCTRAN and DELTRAN in the former we are trying to press state changes towards the root of the tree in the other we press the state changes towards the tips In many cases it is more appropriate to enumerate all possibilities or to randomly select among them than to systematically select a particular type of reconstruction Felsenstein Chapter 6 brie y describes how this can be accomplished and we refer to this chapter and the references cited there for more details 16 Time Complexity If we want to calculate the overall length cost of a tree with m taxa 71 characters and k states it is relatively easy to see that the Fitch and Wagner algorithms are of complexity 0mnk and the Sankoff algorithm is of complexity 0mnk2 If there is a small number of states we can save a considerable amount of time in the Fitch and Wagner algorithms by packing several characters into the smallest unit handled by the processor typically a 32 bit or 64 bit unit and use binary operators to handle several characters simultaneously There is also a number of shortcuts that can help us to quickly calculate the length of a tree if we know the length and in particular the nal state sets for some other tree which is similar to the tree we are interested in Felsenstein describes several of these techniques brie y but we nd that the discussion is difficult to follow and it appears to be partly incorrect so we recommend that you go to the original literature if you are interested in them We will brie y touch on some of these computational shortcuts when we describe searches for most parsimonious trees later during the course 17 Study Questions H What is the difference between Fitch and Wagner parsimony to When Wagner parsimony is used for continuous characters on an interval scale or raw mea surements what exactly is the amount of change that we are minimizing 03 Find the Fitch uppass and downpass algorithms for an unrooted binary tree Tip formulate a variant of the algorithm for dealing with a basal trichotomy and use it when deriving the states for an interior node selected to be the root of the unrooted tree 10 B805936 Fall 2005 PB7FR Computational Evolutionary Biology F U a T 00 Find the Fitch uppass and downpass algorithms for a polytomous tree Tip Use the same variants of the algorithms from the previous problem Express Fitch parsimony for a four state character using a Sankoff cost matrix Express Wagner parsimony for a four state character using a Sankoff cost matrix ls there a natural measure of branch length under Fitch parsimony How would you calculate that branch length Would the sum of all the branch lengths be equal to the tree length Can a Sankoff cost matrix have non zero diagonal entries Can it be non metric Violate the triangle inequality BSCSQB6 Fall 2005 PB7FR Computational Evolutionary Biology Figure 5 A worked example of a Sankoff downpass algorithm We start with a cost matrix a and then assign costs to the tips of the tree For each node in the tree in a postorder traversal7 we then calculate the cost of each state at the bottom end of each descendant branch We then add these costs to get the costs at the top of the ancestral branch Finally7 at the root node7 we nd the minimum cost of the tree as the minimum cost of any state assignment e Pr39oTein Ter39Tiar39y STr39ucTur39e Molecular Mechanics amp Dynamics 356 5936 M Chapman 2002 442002 3565936 Ter riar39y sTuc rur39e c MSChapman Tertiary STrucTure gt Some building blocks I 20 amino acids I Secondary s rruc rures gt Many differen r shapes amp func rions Encoded by differen r sequences Governed by forces be rween a roms Arginine kinase enzyme Rhinovirus capsid pro reins 442002 3565936 Tertiary sTuc rure c MSChapman Im eroTomic Forces gt Ter riory s rruc rure gt Pro rei n i n reroc rions I Subs rro res a I N711 an Avgaoa 03 234 275 x I 273 NS Drugs Assembly Recogni rion Rhinoviorus gt Dynamics drugbinding Arginine kinose ADP phosphates Arginine kinose substrateinduced conformational change 442002 3565936 Tertiary sTuc rure c MSChapm Theory Si Simulation Unders randing of Theory 9 gt Molecular Mechanics compu rer models sum energies of gt Design drugs in rerac rions gt Predic r s rruc rures amp 39 WhiCh STOW more variams favorable gt Predic r mo rions 39 Mm39m39ze 9 I PredicTed sTrucTure gt Molecular Dynamics I How mo rions propaga re MosT likely conformaTional changes Timescales of moTion 442002 3565936 Ter liary sTuc lure c MSChapman Covalem STr39ucTur39e of Proteins Theory par39T 1 442002 3565936 Tertiary sTuc rur39e c MSChapman Amino acid nomenclaTur39e gt WhaT do we call The pieces Prgr oup aka side chain Spacj fic for each amino acid 793 Car banyJ carbon C39 442002 3565936 Tertiary sTuc rur39e c MSChapman PolypepTides H2N i N i j OH PL PL W EH 1T1 EH NH H O O Numbering a J W N rermmus C rer39mmus 1 2 3 4 i1 i i1 i2 442002 3565936 Tertiary sTuc rur39e c MSChapman GeomeTr39y bond lengThs covalenT gtqu1A gt all o rher39s 15 A approxima rely eg CO CN don39T wor39r39y abou r The differences S lr39e rching and compression of bonds is negligible 442002 3565936 Ter liar39y sTuc lur39e c MSChapman GeomeTry bond angles gt Flexibili ry small 25 gt Geome rry predic rable according ro valence gt Where do you find examples of 5p 3 5p 2 hybrdzafbn sp2 for 3 bonds sp3 for 4 bonds A rom Valence Hybrid Coodina r Bond iza rion ion angle Ni rrogen 3 Trigonal 120 planar carbon I carbonyl Carbon C 0c 4 5P3 442002 3565936 Ter liary sTuc lure c MSChapman Torsion Angles Dihedr39al gt quotFreequot r39o ra rion abou r single covalen r bonds gt Only conforma rional variables De rer39minan r of s rr39uc rur39e As bond angles leng rhs are cons ran r bl CH1 CH3 m CH H 6 H 0quot J l39 H xx CH5 g staggered eclipsed 9 GM 0 f w 0 All 0 D Jl quotIII 442002 3565936 Ter liar39y sTuc lur39e c MSChapman 10 O O 00 i VariaTion in Torsion Angles gt De rermined by van der Waals forces 6 gt S raggered conformers of ren 3 Eg 60 180 300 60 gt Varia rion wi rhin each conformer I Devia rions from quotidealquot i 20 gt allows pro rein To be quotdis ror redquot from quotperfecTquot sTrucTure To move dynamically 442002 3565936 Ter riary sTuc rure c MSChapman 39l EnumeraTing Torsion angles gt Defined by 4 a roms surrounding The bond gt Which are 15 r amp 4 I Heavies r LargesT aToms groups or Main chain proTeins gt Direc rion righ r hand grip rue 442002 3565936 Ter riary sTuc rure c MSChapman 0 III 233 a I I x 39I O O 00 lt Names of Pr39oTein Torsion Angles gt Side chains 39 X1 x2 for39 bonds far39Ther39 down The side chain w gt Backbone NHJ I a I w in ocknnoc euxock order39 for39 bonds before X1 N Ca amp C39 X2 442002 3565936 Ter39Tiar39y sTuc rur39e c MSChapman 13 Torsion Angle Values Principles amp Agenda gt Wha r values are mos r favored Those corresponding To s raggered angles la not staggered Staggemd Branden amp Tooze 1999 Garland gt w are perhaps The mos r impor ran r gt x are perhaps easier To unders rand I We will s rar r wi rh x 442002 3565936 Ter liary sTuc lure c MSChapman 14 x1 Valine a Simple Example in CH1 CH3 lcl Cljla H o H 9 539 o O u 0 4 FJI W39J G HFYL nNIJ TFIJSHING IN gt 3 s raggered ro ramers possible amp jurn liars V39Hs39een I No r equally probable gt Ro ramers c amp d have CH3 be rween large backbone a roms gt Ro ramer b is preferred because CH3 nex r To Ha all o rher Things being equal 442002 3565936 Tertiary sTuc lur e c MSChapman 15 Concept of a Rotamer gt Side chains have several x1 x2 x3 gt Each x approximates a staggered conformation gt 3 or 6 discrete options for each x gt possibilities product of these options I eg His I 3 possibilities for x1 6 for M I 3 x 6 combinations gt Not every combination is allowed due to steric conflicts gt A subset of combinations gt allowed rotamers I Often 6 to 12 for each type of side chain gt 3 or 4 of these are typically much more favored I Eg His x1 x2 1800900 600900 600900 442002 BSC5936 Tertiary stucture c MSChapman 16 RoTamer Libraries gt LisT 3 or 4 preferred roTamers for each Type of side chain gt Only occasionally 10 do you find oTher roTamers in a sTrucTure Only when conTacTs beTween amino acids exclude The favored roTamers gt Some of The preferred roTamers are more common Than oThers Energy lower Choice for an individual amino acid depends on inTeracTions wiTh The local environmenT 442002 3565936 TerTiary sTucTure c MSChapman 17 ne Torsion Angles Importance gt Describe overall fold Almos r comple rely gt Because Bond leng rhs fixed Bond angles fixed Only Torsion angles variable Only 2 variables amino acid for The backbone E39I39l39 quotIGARL iND PUBLISI IND INC quotl Inrlntzvr u39 I Tamilquot l u39uimi Emu 442002 3565936 Ter liary sTuc lure c MSChapman 18 3 Torsion Angles oo1 buT 03 is fixed gt Consider a pep ride bond gt Lone pair39 elec rr39ons 9 par39Tial double C N bond I r39es rr39ic rs r39o ra rion To i 6 abou r planar39 Usually Tr39ans 180 in i double bonding V Loss of lone pair makes N trigonal planar RV 7 L 5 Egan 442002 3565936 Tertiary sTuc lur39e c MSChapman 19 scribe The overall fold gt If I w were lis red for every amino 9 acid Topology defined gt Ramachandran considered wha r 39 combina rions of I w were favorable for each amino ac39d Pep rides Alan Glyn quot nly van der Waals forces ed areas where E favorable As a 2D func rion of I w E J I39ll GARLAND PUBLIN quotNC INC A meml ar u IN Tdylv39 l rumi Evan 442002 3565936 Ter liary sfuc lure c MSChapman 20 Ramachandran PloT 1m l gt Shaded regions are favorable combina rions of I w for an individual w amino acid psi Backbone s raggered Large groups well separa red 3 gt If same values repea red 43 l HEB Branden amp Tooze Regular secondary 1999 Garland sTrucTure minima 442002 3565936 Ter riary sTuc lure c MSChapman 21 Peptide Conformation gt Observed I w values for39 each amino acid of a protein always fall near39 The calcula red energy minima I Well nearly always Nonglycine Glycine 7 A a 180 jAV Tf mrmyvxn K 1 130 4 l l 430 f 480 O phi l8 l 480 0 phi I 39JL39D G NF LAN D PU BLIS H G INC B wmlw nl39 Hm INKquot i Frau rm 442002 3565936 Ter39Tiar39y sTuc lur39e c MSChapman 22 Levinfhal39s paradox amp dictionaries How many backbone conforma rions of a 300 residue profein are possible Assume 1 Only Imp impor ran r 2 IJ need only be given i 15 ie sampled every 30 3 Consider only minima of Ramachandran plo r Sfill Approx 103 conformations Levin rhal paradoxes How is The righ r conformafion found I Why are There only 5000 profein folds 442002 3565936 Ter liary S luc lure c MSChapman 23 Levinfhal39s answer lt3 dictionaries gt Consider pep ride segmen rs of say 5 residues gt Many combina rions of The 5 l w angles fold The pep ride back on i rself These can be excluded gt Only some combina rions are energe rically favorable I Seen This argumen r before ro ramers gt Circa 1985 realized rha r almos r all 5residue segmen rs of a new profei n s rruc rure were similar ro segmen rs from a da rabase of 80 prior s rruc rures I Da rabases used for modelbuilding gt of pep ride conforma rions is fini re gt S rill roo large ro predic r wi rh confidence 442002 3565936 Terfiary sfucfure c MSChapman 24 An End To The Paradox gt Shor39Tle 2001 NMR evidence gt Na rive like fragmen r configura rions in 8 M urea gt Implies no such Thing as random s rr39uc rur39e gt Implica rions for39 folding do no r need To search all possible configura rions only combina rions of differen r fragmen r sTrucTures 442002 3565936 Ter39Tiar39y sTuc rur39e c MSChapman 25 Torsion angles and Dynamics gt Deforma rion wi rhin a minimum X i 15 is easy rapid vibra rions gt Jumping be rween minima Possible if does no r cause s reric overlap of o rher a roms Slow as ac riva rion energy barrier passing Through eclipsem conforma rion 442002 3565936 Tertiary sTuc rure c MSChapman 26 Reading amp Exercises gt FurTher reading I Branden CI amp Tooze InTroducTion To ProTein STucTure ChapTer 1 gt OpTional exercises buT I will look aT Them if you would like me To 1 If you sample backbone Torsion angles every 10 how many possible conformations are There for a 200 amino acid proTein LisT your assumpTions and show your calculaTions 2 WhaT would you expecT The favored x1 values of lysine To be Explain your answer 442002 3565936 TerTiary sTucTure c MSChapman 27 Forces conTribuTing To macromolecular sTrucTure amp dynamics Covalen r 2 Nonbonded S rrong gt Changed by in rerac rions Invarian r gt Impor ran r in viva Ro ra rion abou r Torsion Conformation angles I Dynamics In rerac rions gt Ac rually includes H bonds AAAK 442002 3565936 Ter riary sTuc rure c MSChapman 28


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Janice Dongeun University of Washington

"I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.