Bioinformatics CS 5263
Popular in Course
Popular in ComputerScienence
verified elite notetaker
This 3 page Class Notes was uploaded by Mireya Heidenreich on Thursday October 29, 2015. The Class Notes belongs to CS 5263 at University of Texas at San Antonio taught by Staff in Fall. Since its upload, it has received 8 views. For similar materials see /class/231389/cs-5263-university-of-texas-at-san-antonio in ComputerScienence at University of Texas at San Antonio.
Reviews for Bioinformatics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/29/15
OPEN 8Aocess Freelyav 39 ble online PLOS COMPUTATIONAL BIOLOGY Recent Evolutions of Multiple Sequence Alignment Algorithms C dric Notredame n ever increasing number of biological modeling methods depend on the assembly of an accurate multiple sequence alignment MSA These include phylogenetic trees pro les and structure prediction Assembling a suitable MSA is not however a trivial task and none of the existing methods have yet managed to deliver biologically perfect MSAs Many of the algorithms published these last years have been extensively described 1 3 and this review focuses only on the latest developments including meta methods and template based alignment techniques The purpose of an MSA algorithm is to assemble alignments re ecting the biological relationship between several sequences Computing exact MSAs is computationally almost impossible and in practice approximate algorithms heuristics are used to align sequences by maximizing their similarity The biological relevance of these MSAs is usually assessed by systematic comparison with established collections of structure based MSAs gold standards for review see 4 Since only a few sequences have known structures the accuracy measured on the references is merely an estimation of how well a package may fare on standard datasets Gold standards have had a considerable effect on the evolution of MSA algorithms refocusing the entire methodological development toward the production of structurally correct alignments Their use has also coincided with a notable algorithmic harmonization most MSA packages being now based on the progressive algorithm 5 This greedy heuristic assembly algorithm involves estimating a guide tree rooted binary tree from unaligned sequences and then incorporating the sequences into the MSA with a pairwise alignment algorithm while following the tree topology The progressive algorithm is often embedded in an iterative loop where the guide tree and the MSA are reestimated until convergence Most MSA packages reviewed here 6 18 follow this canvas albeit more or less extensively adapted for improved performances 1 3 The scoring schemes used by the pairwise alignment algorithm are arguably the most in uential component of the progressive algorithm They can be divided in two categories matrix and consistency based Matrix based algorithms such as ClustalW 14 MUSCLE 6 and Kalign 19 use a substitution matrix to assess the cost of matching two symbols or two pro led columns Although pro le statistics can be more or less sophisticated the score for matching two positions depends only on the considered columns or their immediate surroundings By contrast consistency based schemes incorporate a larger share of information into the evaluation This result is achieved by using a recipe initially developed for T Coffee 10 and inspired by Dialign overlapping weights 20 Its principle is to compile a collection of pairwise global and local alignments primary library and to use this collection as a position speci c PLoS Computational Biology wwwposcompbiolorg 1 405 substitution matrix during a regular progressive alignment The aim is to deliver a nal MSA as consistent as possible with the alignments contained in the library Many recent packages have built upon this initial framework For instance PCMA 15 decreases T Coffee computational requirements by prealigning closely related sequences ProbCons 7 uses Bayesian consistency and lls the primary library using the posterior decoding of a pair hidden Markov model The substitution costs are estimated from this library using ayesian statistics MUMMA S 17 combines the ProbCons scoring scheme with the PCMA strategy while including secondary structure predictions in its pair hidden Markov model The most accurate avors of MAFFT 8 ie the CNS and LNS modes use a T Coffee like evaluation A majority of studies indicate that consistency based methods are more accurate than their matrix based counterparts 4 although they typically require an amount of CPU time N times higher than simpler methods N being the number of sequences Most of these methods are available online either as downloadable packages or as online Web services Table 1 The wealth of available methods and their increasingly similar accuracies makes it harder than ever to objectively choose one over the others Consensus methods such as M Coffee 12 provide an interesting framework to address this problem M Coffee is a consensus meta method based on T Coffee Given a sequence dataset it lls up the library by using various MSA methods to compute alternative alignments T Coffee then uses this library to compute a nal MSA consistent with the original alignments When combining eight of the most accurate and distinct MSA packages M Coffee produces 67 of the time a better MSA than ProbCons the best individual method 12 Aside from its ease of extension M Coffee s main advantage is its ability to estimate the local consistency between the nal alignment and the combined MSAs CORE index 21 Figure 1 This useful index has been shown to be well correlated with the MSAs structural correctness 2122 M Coffee is not however the ultimate answer to the MSA problem and its limited performances on remote homologs suggest that Editor Fran Lewitter Whitehead Institute United States of America Citation Notredame C 2007 Recent evolutions of multiple sequence alignment algorithms PLoS Comput Biol 38 e123 doi 037Wournalpcbi0030123 Copyright 2007 C dric Notredame This is an open access article distributed under the terms of the Creative Commons Attribution License which permits unrestricted use distribution and reproduction in any medium provided the original author and source are credited Abbreviations MSA multiple sequence alignment C dric Notredame is with Information G nomique et Structurale CNRS UPR2589 Institute for Structural Biology and Microbiology Parc Scientifique de Luminy Marseille France E mail cedricnotredameeuropecom August 2007 Volume 3 Issue 8 e123 Table 1 Summary of the Methods Described in the Review Method Score Templates Validation Values Server PreFab HOMSTRAD ClustaIW 4 Matrix 680 2 httpwwwebiacukclustalw Kalign Matrix 6300 8 httpmsacgbkise MUSCLE 6 Matrix 6800 6 450 9 httpwwwdrive5commuscle T Coffee 0 Consistency 6997 2 440 9 httpwwwtcoffeeorg ProbCons 7 Consistency 7054 2 httpprobconsstanfordedu MAFFF 8 Consistency 7220 2 httpaligngenomepmafft M Coffee 2 Consistency 729 2 h t 39 tco eeo MUMMALS 6 C si t 730 6 http prodata me edu mummals DbCIustal 24 httpbips u strasbg frPlpeAllgn PRALINE 9 Matrix Profiles 502 9 httpzeuscsvunIprogramspralinewww PROMALS 6 Consistency Profiles 7900 6 httpprodataswmededupromas SPEM 28 a rix rofiles 7700 28 httpsparksinformaticsiupuieduSoftwares Services filesspemhtm Expresso 3 Consistency Structures 79 3 httpwwwtcoffeeorg Consistency Structures httpswwwmifu berlindewLiSA Validation values were compiled from several sources and selected for comparability PreFab validations were made using PreFab version 3 HOMSTRAD validations were made on datasets having less than 30 identity The source of each value is indicated by t e accompanying reference citation aThe Expresso value comes from a slightly more demanding subset of HOMSTRAD HOM39 made of sequences less than 25 identical doi10137liournalpcbioo30123mm further improvement using only sequence information a pro le rather than a structure PSI BLAST is used to build a remains an elusive goal Progress is nonetheless needed and pro le for each sequence and these pro les are used as at this point the most promising approach is probably to templates to generate better sequence alignments thanks to incorporate within the datasets any information likely to the evolutionary information they contain The on y improve the alignments such as structural and homology difference between homology and structure extension is the data Template based alignment methods 13 follow this templates nature and the associated alignment method This approach enen39c approach can easily be extended to any kind of Structural extension was initially described by Taylor 23 template For instance Expresso 13 uses SAP 2526 and The principle is fairly straightforward Figure 2 and involves FUGUE 27 to align structural templates identi ed by a identifying with BLAST a structural template in the Protein BLAST against the Protein Data Bank PROMALS 17 Data Bank for each sequence aligning the templates using a PRALINE 9 and SPEM 28 make a pro le pro le structure superposition method and mapping the original alignment with PSI BLAST pro les used as templates In sequences onto their template s alignment The resulting PRALINE and PROMALS the pro le can be complemented sequence alignments are compiled in the primary library and with a secondary structure prediction in an attem t to used by a consistency based method to compute the nal improve the alignment accuracy PROMALS uses ProbCons MSA Homology extension was originally introduced in the Bayesian consistency to ll its library with the posterior DbClustal package 24 and works along the same lines using decoding of a pair hidden Markov model T Lara 29 uses cons doi 0 37ournapcbi0030 23900 Figure 1 Typical Output of M Coffee This output was obtained on the kinasei ref5 BaliBase dataset by combining MUSCLE MAFFT POA Dialign T T Coffee ClustaIW PCMA and ProbCons with M Coffee Correctly aligned residues as judged from the reference are uppercase noncorrect ones are lowercase The color of each residue indicates the agreement of the individual MSAs with respect to the alignment ofthat speci c residue Red indicates residues aligned in a similar fashion mong all the individual MSAs blue indicates very low agreement between MSAs Dark yellow orange and red residues can be considered to be reliably aligned PLoS Computational Biology wwwposcompbioorg 1406 August 2007 Volume 3 Issue 8 e123 SAP arzny valid template zllgnmeni method Slruclu rerbased alignment D In umpiales Templale a sequence alignment Templatebased allgnmenl a the sequences 7 TCoffee Primary Library doi CH 37ijoumalpcbi0030i 239002 Figure 2 Framework of a Template Based Method Structural templates are first identi ed mapped onto the sequences 39 39 SAP The sequence template mapping is then used t original sequences This alignment is integrated into the library that is used to compute the final MSA and 0 RNA secondary structure predictions as templates and lls a T Coffee library with the Lara pairwise algorithm With the exception of PRALINE and SPEM which use a regular progressive algorithm most template based methods described here are consistency based some of them taking advantage of T Coffee modular structure Their main advantage is increased accuracy Recent benchmarks on PROMALS Table 1 show that homology extension results in a ten point improvement over existing methods Likewise structure based methods such as Expresso produce alignments much closer to the structural references than do any of their sequence based counterparts One must however be careful not to over interpret validation values like that given for Expresso in Table 1 since both t e reference and the Expresso alignments were computed using the same structural information This last point raises the important issue of method validation and benchmarking A recent study 4 shows that with the exception of arti cial datasets benchmarks carried out on most reference databases tend to deliver compatible results It also suggests that the best methods have become indistinguishable except when considering remote homologs less than 25 identity Unfortunately remote homologs are poorly suited to generating reference alignment owing to the fact that their superposition often yields alternative sequence PLoS Computational Biology l wwwploscompbiolorg 1 407 alignments that are structurally equivalent 30 However one can bypass the reference alignment stage by direct y comparing the evaluated alignment to some idealized 3 D superposition Such an alignment independent evaluation has been described and used by several authors 173132 Another trend not well accounted for by current reference collections is the alignment of very large datasets While many new methods incorporate special algorithms for aligning several hundred sequences 6818 current reference databases do not allow the evaluation of very large datasets thus making it unclear how the published accuracies scale with the number ofsequences While this last issue could probably be satisfyingly solved in the current benchmarking framework another problem remains that is much harder to address All the existing validation approaches have in common their reliance on the one size ts all assumption that structurally correct alignments are the best possible MSAs for modeling any kind of biological signal evolution homology or function A report on pro le construction 33 has recently challenged this view by showing that structurally correct alignments do not necessarily result in better pro les Likewise it may be reasonable to ask whether better alignments always result in better phylogenetic trees and more systematically to question and quantify the relationship between the accuracy of MSAs and the biological relevance of any model drawn upon them In this review I have presented some of the latest additions to the MSA computation arsenal An interesting milestone has been the development of meta methods able to seamlessly combine the output of several methods Aside from easing the user s work the main advantage of these consensus methods is probably the local estimation of reliability they provide Figure 1 Using this estimation to lter out unreliable regions has already proven useful in homology modeling 34 and could probably be used further The main improvement reported here however is probably the notion of template based alignment Template based alignment is more than a trivial extension of consistency based methods Under this new model the purpose of an MSA is not to squeeze a dataset and extract all the information it may contain but rather to use the dataset as a starting point for exploring and retrieving all the related information contained in public databases This information is to be used not only for ma 39 purposes but also for driving the MSA computation Such a usage of sequence information makes template based methods a real paradigm shift and a major step toward global biological data integration l Acknowledgments The author thanks the two anonymous reviewers for suggesting several missing references 11 her contributions CN analyzed the data and wrote the paper Fundin 39s funded and supported by the Centre National de la Recherche Scienti que Fran Com eting interests The interests exist ce author has declared that no competing References Edgar RC Batzoglo Struct Biol Wallace 1M Blackshields G Higgins DC 200 u s 2006 Multiple sequence alignment Curr Opin 16 3687373 2 5 Multiple sequence alignments Curr Opin Struct Biol 15 2617266 August 2007 l Volume 3 l Issue 8 l e123
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'