COMPUTER SCIENCE TOPICS SEMINR
COMPUTER SCIENCE TOPICS SEMINR CSC 8910
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in ComputerScienence
This 22 page Class Notes was uploaded by Mallie Crist on Monday September 21, 2015. The Class Notes belongs to CSC 8910 at Georgia State University taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/209897/csc-8910-georgia-state-university in ComputerScienence at Georgia State University.
Reviews for COMPUTER SCIENCE TOPICS SEMINR
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/21/15
Comparative Analysis of Multiple ProteinSequence Alignment Methods Marcella A McClure T aha K Vasi and Walter M Fitch Department of Ecology and Evolutionary Biology University of California Irvine We have analyzed a total of 12 different global and local multiple proteinsequence alignment methods The purpose of this study is to evaluate each method s ability to correctly identify the ordered series of motifs found among all members of a given protein family Four phylogenetically distributed sets of sequences from the he moglobin kinase aspartic acid protease and ribonuclease H protein families were used to test the methods The performance of all 12 methods was affected by 1 the number of sequences in the test sets 2 the degree of similarity among the sequences and 3 the number of indels required to produce a multiple alignment Global methods generally performed better than local methods in the detection of motif patterns Introduction Comparison of primary sequence information is rapidly becoming the major source of data in the elu cidation of the molecular mechanisms of replication and evolution of all organisms There are basically three lev els in the analysis of primary sequence information l the search for homologues 2 the multiple alignment of homologues and 3 the phylogenetic reconstruction of the evolutionary history of homologues Many multiple sequence alignment programs and various scoring schemes have been developed to analyze potential relationships among sequences Although a re view Myers 1991 and a comparison Chan et al 1992 of some methods from a computational perspective are available there are no studies to date that evaluate these methods from a biologically informed perspective The purpose of this study is to evaluate the ability of existing software to correctly identify the ordered series of motifs that are conserved throughout a given protein family There are two biological approaches to the multiple alignment of protein sequences one attempts to align homologous ancestrally related features while the other attempts to align functionally o39r spatially equiv alent features of a protein family While there is consid erable overlap in the alignments produced by methods with these two goals the intents are distinctly different Key words sequence comparison multiple alignment protein family motifs Present address and address for correspondence and reprints Marcella A McClure Department of Biological Sciences University of Nevada Las Vegas Nevada 891544004 Mol Biol Evol 11457l 592 1994 1994 by The University of Chicago All rights reserved 07374038941 lO400020200 Multiple alignment methods are often used without knowledge of the assumptions implicit in their operation We will assess the major academically produced methods available regardless of their intent and indicate the as sumptions implicit in each of the methods table 1 Our basic premise is that regardless of the nal goal a method that cannot nd the functional motifs that are highly conserved throughout a given protein family has diminished value for detecting new biologically infor mative patterns The multiple proteinsequence alignment problem may be divided into the following two conceptual steps 1 the initial inference of an ordered series of motifs de ning the limits of a protein family and 2 detection of the ordered series of motifs in other proteins thereby expanding the family Many software packages both ac ademic and commercial rely on the existence of pre viously de ned protein families to provide the motifs of the family How are such proteinfamily patterns initially determined Among highly conserved sequences gt50 identity it is very dif cult to deduce which residues of a protein are necessary for function or structure on the basis of multiple alignment of protein sequences alone Laboratory experiments can provide clues as to which residues are critical for function and structure but few generalizations can be made from such studies Among distantly related proteins lt30 identical residues however conserved residues Often indicate the essentially invariable regions of the protein that are necessary for function or structure When multiple alignments of such data are derived however it soon becomes apparent that the currently available methods are not very satis factory Even with the utilization of the most sophisti 571 Table 1 Multiple Alignment Methods Data Method Developer Algorithm Matrix l Indels Limitsb AssumptionsC Featuresd Typee Global AMULT G Barton 39 NW Any C Y S R SE P ASSEMBLE M Vingron Dot matrix NW Log odds lE Y S P CLUSTAL V D Higgens WL Any lE I P N DFALIGN DF Feng NW Log odds C UP Y E O P GENALIGNf H Martinez CW NW UM lE SE P N MSA S Altschul CL PAM250 lE ROS N B FA P MULTAL W Taylor NW UM PAM250 C S AP FA P MWT J Kececioglu maximum Any C ROS N P weight trace TULLA S Subbiah NW Any RGW 10 sequences S R SE P Local MACAW G Schuler SW PAM250 DOS Y SE FA MD P PlMA P Smith SW AACH lE Y MD P PRALIGN M Waterman CW PAM250 lEg Y MD MC P Nh a The matrices are log odds and PAM250 Dayholf et al 1978 UM unitary matrix Feng et al 1985 and AACH amino acid cluster hierarchy Smith and Smith 1990 b UP unpublished parameters ROS easily runs out of computer space thereby limited to six sequences and DOS runs only on a DOS system with Windows Y or N yes or no to the question Has homology been established S or E multiple alignment is of structural or evolutionary intent and O input sequences must be in nearestneighbor order and a program is provided for this purpose d R userspecified no of iterations for re nement SE statistical evaluation is provided I interactive mode so that user may choose intermediate alignments FA speci ed region can be forced to align B correction for bias of overrepresentation of sequences AP alteration of parameters between iterations MD userspecified motif density and MC userspeci ed degree of motif conservation P protein and N nucleic acid 39 Licensed to IntelliGenetics 3 This indel penalty applies to CWs only quot A separate program is available for nucleic acids cated software developed to date re nement of such relationships still relies on the visual patternrecognition skills of the human operator The initial inference of the motifs de ning a protein family by primary sequence analysis therefore requires the combination of multiple alignment methods and human pattemrecognition skills with corroborating experimental evidence eg sitedi rected mutagenesis and crystallography We have tested both global and local multiple alignment methods for their ability to identify the or dered series of motifs that are conserved throughout the hemoglobin kinase ribonuclease H RH and aspartic acid protease protein families The study presented here while not exhaustive indicates that all the methods an alyzed suffer to varying degrees from three types of problems 1 the inability to produce a single multiple alignment from correctly aligned subsets of the input sequences 2 sensitivity to the number of sequences in the test and 3 sensitivity to which speci c sequences are in the test The rami cations of these shortcomings for the identi cation of functional motifs as well as phylogenetic reconstruction are discussed Methods Used for Comparative Analysis of Alignment Programs All analyses were conducted on a SPARCstation GS running SUN OS 411 The test sequences were ex tracted from the nonredundant database composed of PIR version 340 SWISSPROT version 230 and GenPept translated GenBank version 730 developed by the National Center for Biotechnology Information National Library of Medicine W Gish personal com munication Scoring for Motifs In general we de ne a motif as a conserved con tiguous run of 39 residues often involved in the function or structural integrity of a protein as inferred by multiple alignment analysis or laboratory experiments In some cases only remnants of a motif can be found and we call this a semiconserved motif eg see g 3 motif II Occasionally a single residue which is completely conserved among all members of a protein family is found between larger motifs In such cases we consider the single residue as one of the motifs comprising the ordered series of motifs eg see g 4 motif II An ordered series of motifs is de ned as a set of conserved or semiconserved motifs that are found in the same ar rangement relative to one another in all the sequences of a protein family The spacing between the motifs can be highly variable re ecting the regions of a protein that are less restricted by functional or structural constraints These regions may evolve more rapidly and be more Comparison of Protein Alignment Methods 573 subject to insertion deletion and duplication There are two features of motifs that must be considered in their evaluation The rst the motif density is the percentage of the sequences in which a given motif is present The second the motif conservation is the degree to which a motif is conserved in various members of the family ie are the residues identical or has conservative re placement occurred have insertions and deletions in dels occured or can more than one set of residues de ne a motif 2 The motif conservation can be expressed in a variety of ways In the PRALIGN program eg the user speci es the number of mismatches and indels al lowed within the motifs as two separate parameters Initially we planned to develop an independent scoring scheme to measure the global goodness of the alignments produced by the global methods It soon be came apparent however that some of the methods could not even identify the motifs known to be involved in the function of a given protein family We decided therefore to score for each method s ability to detect each motif in four different data sets A score for each motif is the percentage of the number of sequences in each data set for which the motif is correctly identi ed see gs 1 4 correct motifs are indicated by blackened bars and roman numerals Some methods could nd one or more correct motifs in more than one subset of the sequences without being able to align these motifs to one another to produce a single multiple alignment of the all the input sequences In these cases the total percent correct match is a combined score of the aligned subsets tables 2 5 allowing full credit for motif iden ti cation in each subset as if the motifs were each aligned correctly throughout the set This scheme allows us to compare local and global methods to one another as well as among themselves Test Data Sets We have chosen four protein families as data sets to test the ability of the multiple alignment methods to reconstruct known biologically informative patterns To date standard sets of protein sequences have not been established for assessing multiple alignment methods The hemoglobin family has often been used to illustrate the reconstructive ability of a new multiple alignment method In light of the extensive hemoglobinsequence conservation it is not surprising that many methods succeed in aligning various members of this family rea sonably well A more rigorous test of these methods would be to measure their ability to identify the highly conserved motifs involved in the function of various protein fam ilies Many of these motifs were rst inferred from pri mary proteinsequence multiple alignment analysis and were con rmed by biochemical and crystallographic HS Table 2 Scores for Programs Tested Using Globins Program and No of Sequences Motif I Motif II Motif III Motif IV Motif V Tested 7 residues 1 5 residues 5 residues 5 residues 3 residues ParametersCommentsa Global Methods AMULT gt 12 100 100 100 100 100 Singleorder alignment defaults except 10 100 100 100 100 100 indel 8 410 and iteration 1 6 100 100 100 100 100 1 4 ASSEMBLE 12 100 92 100 100 100 Defaults except FILSUM algorithm 10 Did not perform alignment since lter produces empty plotsb 6 100 100 100 100 100 FIL39LOG I 8 8 12 CLUSTAL V 12 100 92 100 100 100 Defaults parameters tweaked are 10 100 92 100 100 100 pairwise indel 1 8 and ktuple 6 100 92 100 100 100 1 2 multiple alignment I 612 and E 2 10 DFALIGN 12 100 100 100 100 100 Defaults 10 100 100 100 100 100 6 100 100 100 100 100 GENALIGN 12 92 67 25c 100 100 83 67 17c 92 67 25 Defaults except match weight 2 NW 10 90 60 30 90 9O 50 40 80 60 20 90 60 30 6 83 100c 83 50 33c 67 2 x 33 67 2 x 33 Defaults except mawh we ght 39 1 NW SL9 MULTAL 100 90 100 100 100 Matrix weightd 0 5 cyclese 12 100 90 100 100 100 indel 20 window size 15 50 100 90 100 100 100 cutoff score 900 300 spanf 8 1283 90 80 80 80 80 RGW 2 4 6 median 2 or 4 2 12 83 83 83 67 83 RGW 8 4 12 Local Methods 75 92 75 67 67 Cutoff score 30 20 30 MD 50 70 80 70 60 60 25 50 result list size 100 for 100 67 100 67 67 all subsets several overlapping blocksh 100 100 100 100 100 i 100 100 100 100 100 E 033 ML clusters 100 100 100 100 100 SB clustersi 67 67 33 2 X 17 75 33 25 17 67 33 2 X 17 84 67 17 Window size 20 10 40 word size 50 30 20 60 3 X 20 60 3 X 20 20 50 3 35 MC 1 0 2 indel 67 2 X 33 33 33 0 50 0 MD 30 20 50 NOTE The score for each test is calculated as a percentage of the no of sequences in each data set in which the motif was identi ed Some methods nd the correct matches in gt1 subset of the data without being able to align these subsets to one another In these cases the total percent correct match is a combined score of the subsets values in parentheses Abbreviations are as in table 1 39 Deviations from default parameters are indicated by a dash for a single data set and by a bracket for two data sets or for new parameters used in all tests The explored range of parameter values is indicated in parentheses b ASSEMBLE tends to produce only correct results or nothing Has gaps in motifs d Speci es the mix ratio between the identity matrix and the PAM250 eg a weight of 2 indicates a 08 identity matrix 02 PAM250 mix Speci es the no of attempts the program makes to merge subalignments r Pairwise distance upper limit for the comparison of all sequences MULTAL allows the user to change parameters for each cycle Thus the range shown in some of the parameters indicates the change of that parameter for each cycle quot Creates several blocks for each cluster One has to manually with the help of the MACAW editor merge them to get the percentages for each cluster iCreates alignments by using two types of clusters maximal linkage ML clusters Smith and Smith 1990 and sequence branching SB clusters Smith and Smith 1992 9L9 Table 3 Scores for Programs Tested Using Kinases Program and No of Sequences Motif I Motif II Motif III Motif IV Motif V Motif VI Motif VII Motif VIII Tested 6 residues 1 residue 1 residue 9 residues 3 residues 3 residues 8 residues 1 residue ParametersComments Global Methods AMULT 12 100 83 92 100 100 100 100 100 Treebased alignment 10 100 90 90 100 100 100 100 90 Single order alignment iteration 6 100 67 67 100 100 100 100 100 4 14 ASSEMBLE 12 83 58 33 25 83 100 100 100 100 100 67 33 Defaults except FIL SUM algorithm 10 90 30 0 100 100 100 100 70 6 67 o o 100 100 100 100 50 FIL39LOG I 8 8 12 CLUSTAL V 12 100 92 92 50 42 100 100 100 100 100 58 42 Defaults parameters tweaked are 10 100 80 50 30 80 100 100 100 100 90 50 40 pairwise indel 18 and k 6 100 83 67 100 100 100 100 100 67 33 tuple 1 2 multiple alignment I 6 12 and E 2 10 DFALIGN 12 100 100 100 100 100 100 100 100 Begin weighting sequence 3 with value 2 10 100 100 100 100 100 100 100 100 Begin weighting sequence 2 with value 2 6 100 100 100 100 100 100 100 67 Begin weighting sequence 2 with value 2 LLS GENALIGN 12 100a 75 42 33 83 100 100 100 100 2 X 50 92 67 25 Defaults except NW match 10 80 60 20 60 40 20 80 1 100 100 2 X 50 100 2 X 50 90 weight 1 6 67 50 83 50 33 100 2 X 50 100 2 X 50 100 2 X 50 100 2 X 50 83 MULTAL 12 100 75 58 17 83 50 33 100 100 100 58 42 100 100 Cycles 14 window size 15 10 100 80 50 100 100 100 100 100 140 cutoff score 900 200 6 83 33 67 100 100 100 100 100 all others as in table 2b TULLA 10 90 l 60 80 100 100 90 90 90 RGW 8 10 12 median 8 6 83a 83 50 33 67 100 100 100 100 33 Defaults Local Methods MACAW 12 67 0 75 100 100 83 100 0 Cutoff score 30 20 30 MD 10 70 0 50 100 100 90 90 0 50 20 50 result list 6 100 0 0 100 100 100 100 50 size 100 for all subsets several overlapping blocks PIMA 12 100 92 92 100 100 100 100 100 SB clustersd E 033 02 175 10 100 90 100 90 90 90 9O 50 30 20 SB clustersd 6 1 100 67 100 100 100 100 100 SB clustersd E 05 02 l75 PRALIGN l2 100 84 2 X 42 50 33 17 33 75 42 33 75 42 33 33 33 WindGW 526 20 1040 word 10 90 80 30 2 X 20 20 40 70 40 30 60 2 X 30 30 30 size 3 3 MC l 0 6 67 2 X 33 67 2 X 33 0 0 67 2 X 33 67 2 X 33 67 2 X 33 33 indel 0 MD 30 20 50 NOTE All designations and abbreviations are as in tables 1 and 2 39 See footnote c of table 2 quot See footnotes d g of table 2 See footnote h of table 2 d See footnote i of table 2 578 McClure et al Table 4 Scores for Programs Tested Using Proteases Program and No of Sequences Motif I Motif II Motif 111 Tested 3 residues 5 residues 3 residues ParametersComments Global Methods AMULT 12 92 58 83 Treebased alignment SD ordering 10 90 80 50 30 70 40 30 Singleorder alignment indel 8 4 10 iteration 1 6 67 0 50 1 4 Treebased alignment SD ordering ASSEMBLE 12 10 Did not perform alignment since lter produces empty plotsb 6 CLUSTAL V 12 100 75 50 25 50 2 X 25 Defaults parameters tweaked are pairwise indel 18 10 100 70 40 30 70 30 2 X 20 ktuple 1 2 multiple alignment I 6 12 E 2 10 6 100 0 67 DFALIGN 12 100 100 70 30 100 Begin weighting sequence 3 with value 2 12 123 70 30 133 Begin weighting sequence 2 with value 2 GENALIGN 12 92 67 42 25c 58 25 2 X 17 Defaults except match weight 4 deletion weight 2 NW 10 39 39 39 39 39 90 70 20 50 30 20 80 60 20c Defaults except match weight 2 NW 6 67 33 0 MULTAL 12 83 58 33 25 75 50 25 Cycles 14 cutoff score 900 200 all others as in 10 90 50 40 70 30 2 X 20 90 50 40 table 2d 6 50 O 33 TULLA 10 70 50 30 20 70 40 30 RGW 2 4 6 median 4 212 6 83 33 O RGW 6 8 10 median 8 2 12 Local Methods MACAW 12 100 25 67 Cutoff score 20 10 20 MD 25 30 33 10 100 30 70 20 50 result list size 100 for all subsets 6 100 0 33 several overlapping blockse PlMA 12 100 42 25 17 42 25 17 SB clustersf 10 100 60 40 20 70 SB clusters39 E 033 02 175 6 100 0 33 SB clustersf PRALIGN 12 67 2 X 33 34 2 X 17 67 2 X 25 17 Window size 20 10 40 word size 3 3 5 MC 10 100 40 2 X 30 30 70 30 2 X 20 1 0 2 indel 0 MD 30 20 50 6 100 3 X 33 0 30 NOTE All designations and abbreviations are as in tables 1 and 2 39 SD ordering uses the standard deviation between sequence pairs to form an order quot See footnote b of table 2 See footnote c of table 2 quot See footnotes dquot g of table 2 See footnote h of table 2 fSee footnote of table 2 analysis In addition to the hemoglobins therefore we have analyzed three such data sets the kinase family the aspartic acid protease family both eukaryotic and viral and the RH region of both the RNAdirected DNA polymerase the reverse transcriptase and the Escherichia coli RH enzyme From each family we have selected a representative set of sequences with a broad phylogenetic distribution The percentage range of identical residues among all sequence pairs in the hemoglobin data set is 10 70 The percentage range of identical residues among all sequence pairs within each of the enzymatic data sets is 8 30 indicating that only those residues involved in function are conserved among these highly divergent sequences The alignments of gures 1 3 were extracted from larger alignments 50 65 sequences produced by the program DFALIGN and corrected manually All sets of test sequences are available through EMBL identi cation no D8161 17 The hemoglobin data set includes a and Bglobins from mammals and birds myoglobins from mammals and hemoglobins from insects plants and bacteria We designated ve regions of the alignment to serve as the ordered series of motifs de ning the globin family There is no external measure of the authenticity of this choice as there is in the case of enzymatic protein families see below The decision was made to provide a test for the globins that is consistent with the tests of the kinase aspartic acid protease and RH families We score for ve motifs that are conserved or semiconserved throughout the phylogenetic distribution of the globin family Motif I is essentially helical region C motifs II and III in helical regions E and F respectively are within the hemebinding region and motifs IV and V are in helical regions G and H respectively g 1 Bashford et a1 1987 The eukaryotic kinase proteins constitute a large enzymatic family that regulates the most basic of cellular processes These proteins have been categorized by pri mary sequence analysis on the basis of the conservation of the ordered series of eight motifs found in their cat alytic domains Hanks and Quinn 1991 g 2 Crys tallographic studies of the cyclic adenosine monophos phatedependent protein kinase con rm that most of the conserved motifs of the kinase protein core do cluster into the regions of the protein involved in nucleotide binding and catalysis Knighton et a1 1991 The kinase data set includes serinethreonine tyrosine and dual speci city kinases from mammals birds fungi retro viruses and herpes viruses The eukaryotic aspartic acid protease family con sists of pepsins chymosin and renins These proteases have two domains Each domain has an ordered series Comparison of Protein Alignment Methods 579 of three motifs that contribute to the active site of the enzyme The most prominent motif is three consecutive conserved residues aspartic acid threonine and gly cine singleletter code DTG g 3 It has been suggested that the aspartic acid proteases evolved through duplication of a singledomain prototype Tang et al 1978 The retroid family aspartic acid proteases are about half the size of the cellular proteases Primary se quence analysis of retroid proteases indicated an ordered series of three motifs suggesting that they function as dimers and that they diverged from the eukaryotic as partic acid proteases prior to the latter group s dupli cation event Pearl and Taylor 1987 Doolittle et a1 1989 Crystallographic studies subsequently con rmed the dimeric nature and catalytic site of the retrovirus aspartic acid proteases as predicted from primary se quence analysis Miller et a1 1989 The aspartic acid protease data set includes pepsin only the aminoter minal domain of this doubledomain protease from mammals birds and fungi and from representative members of the retroid family such as retroviruses cau limoviruses and retrotransposons g 3 McClure 1992 The RH domain of the RNAdirected DNA poly merase reverse transcriptase of the retroid elements resides in the carboxyl onethird of the protein Amino acid sequence comparisons of the retroviral proteins correctly predicted the position of the RH activity in the RNAdirected DNA polymerase by identi cation of motifs conserved with the E coli RH sequence Johnson et al 1986 Subsequent mutational studies con rmed the predicted position Tanese and Goff 1988 The highly conserved motifs of the retroid family and E coli RH proteins have been shown to cluster in the catalytic site as identi ed in the crystal structures of the E coli RH protein Katayanagi et al 1990 and the HIV1 RH domain Davies et a1 1991 g 4 The RH data set includes sequences from E coli and representative members of the retroid family including retroviruses caulimoviruses hepadnaviruses retrotranSposons ret roposons and group II plasmids of lamentous asco mycete mitochondria McClure 1993 Subsets of 6 10 and 12 sequences were used to assay the ability of each method to identify the ordered series of motifs de ning each protein family There are two reasons for varying the sequence number 1 by varying the number of subsets of sequences tested we could evaluate the elfects of both the sensitivity to the number of sequences and to speci c sequences in each test and 2 some methods can only handle small data sets table 1 Each sixsequence data set contains the widest distance distribution of sequence relationship The 10 and 12sequence data sets were created by ad dition of sister sequences to the 6sequence data sets 089 Table 5 Scores for Programs Tested Using RH Program and No of Sequences Motif I Motif Il Motif III Motif IV Tested 3 residues 1 residue 3 residues 5 residues ParametersComments Global Methods AMULT 12 92 75 58 17 67 50 17 59 25 2 X 17 Singleorder alignment defaults except 10 100 70 60 9O 60 30 iteration 4 1 4 6 100 83 50 33 67 80 50 33 ASSEMBLE 12 12 39 39 39 39 39 Did not perform alignment since lter produces empty plotsa T113 HLLOG and FILSUM algomhms for CLUSTAL V 12 100 75 75 58 17 75 33 25 17 Defaults parameters tweaked are pairwise 10 100 70 70 80 2 X 30 20 indel 1 8 and ktuple 1 2 multiple 6 100 67 50 50 alignment 1 6 12 and E 2 10 DFALIGN 12 100 100 83 100 Begin weighting sequence 3 with value 3 10 100 60 70 100 Begin weighting sequence 4 with value 3 6 100 100 67 100 Begin weighting sequence 2 with value 2 GENALIGN 12 100 83 l7b 58 67 33 2 X 17b 75 33 25 l7b Defaults except NW match weight l 10 80b 90 70 40 30b 90 30 3 X 20b 6 a 100b 67 67 67 I89 MULTAL 92 75 17 92 58 2 X 17 75 50 25 83 Cycles 14 cutoff score 900200 All others 100 70 30 9O 80 60 20 70 as in table 2c 100 83 67 83 100b 50 40 80 2 X 40 Defaults except RGW 8 10 12 median 8 100 50 67 50 Local Methods 58 42 58 17 Cutoff score 20 10 20 MD 25 30 80 70 70 40 33 2050 result list size 100 for all 83 67 67 67 subsets several overlapping blocksd 83 75 67 33 2 X 17 92 42 33 17 ML clusters E 02 02175 I 55 5 7 100 80 20 80 80 40 2 X 20 90 70 20 e 100 100 67 83 50 33 ML clusters E 033 02 175 75 67 2 X 33 50 33 l7 17 Window size 20 10 40 word size 3 80 80 60 20 40 20 3 5 MC 1 0 2 indel 0 MD 30 83 67 2 X 33 33 50 20 50 NOTE All designations and abbreviations are as in tables 1 and 2 See footnote b of table 2 b See footnote c of table 2 See footnotes g of table 2 quot See footnote h of table 2 See footnote i of table 2 582 McClure et al H U MA HA 0R HADK HBH U HBOR HBDK M YH U M YOK I GLOB GPUGNI GP YL GGZLB H UMA HA 0R HADK HBH U HBOR HBDK M YH U M YOR I GLOB GPUGNI GPYL GGZLB H UMA HA 0R HADK HBH U HBOR HBDK M YH U M YOR I GLOB GHKMUITDP HFEVMKGALLGTIKEAIKENWSDEMGQ GP YL GGHB VAAA HYPIVGQELLGAIKEVLGDAATDDILD I A B C I I j VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKT MLTDAEKKEVTALWGKAAGHGEEYGAEALERLFQAFPTTKT VLSAADKTNVKGVFSKIGGHAEEYGAETLERMFIAYPQTKT VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWTQR VHLSGGEKSAVTNLWGKV NINELGGEALGRLLVVYPWTQR VHWTAEEKQLITGLWGKV NVADCGAEALARLLIVYPWTQR FASFGNLS GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK GLSDGEWQLVLKVWGKVEGDLPGHGQEVLIRLFKTHPETLEKFDKFKGLK FPHF DLS FSHF DLS FPHF DLS FESFGDLS FEAFGDLS SPLTADEASLVQSSWKAV SHNEVEILAAVFAAYPDIONKFSQFA GK ALTEKQEALLKQSWEVLKQNIPAHSLRLFALIIEAAPEEKYVFSFLKDSN GVLTDVQVALVKSSFEEFNANIPKNTHRFFTLVLEIAPGAKDLFSFLKGSS MLDQQTINIIKATVPVLKEHGVTITTTFYKNLFAKHPEVRPLF DMG II III D E F V l F H g AQ EH KKVADALTNAV AHVDDM PNALSAL DLHAHKLR H GSAQ H KKVADALSTAA GHFDDM DSALSALSDLHAHKLR H GSAQ RH KKVAAALVEAV NHVDDI AGALSKLSDLHAQKLR TPDAVMGNPK RH KKVLGAFSDGL AHLDNL KGTFATLSELHCDKLH SAGAVMGNPK RH AKVLTSFGDAL KNLDDL KGTFAKLSELHCDKLH SPTAILGNPM EH KKVLTSFGDAV KNLDNI KNTFAQLSELHCDKLH SEDEMKASED KH ATVLTALGGIL KKKGHH EAEIKPLAQSHATKHK TEDEMKASAD KH GTVLTALGNIL KKKGQH EAELKPLAQSHATKHK DLASIKDTGAFATH TRIVSFLSEVIALSGNTSNAAAV NSLVSKLGEDHKARGV EIPE NNPKLEHHRAVIFKTICESA TELRQKGHAVWDNNTLKRLGS HLKNK EVPQ NNPDL H GKVFKLTYEAA IQLEVNGAVASDATLKSLGSHHVSKGV RQE SLEQP EL MTVLAAAQNI ENLPAI LPAVKKIAVKHCQAG I l l G H I r I AAHLPAEFTPAVH SLDKFLASVSTVLTSKYR ARHCPGEFTPSAH AMDKFLSKVATVLTSKYR AIHHPAALTPEVH SLDKFMCAVGAVLTAKYR LAHHFGKEFTPPVQ YQKVVAGVANALAHKYH LARHFSKDFSPEVQquotEQKLVSGVAHALGHKYH LAAHFTKDFTPECQ NQKLVRVVAHALARKYH IPVKYLEFISECIIQ LQSKHPGDFGADAQ MNKALELFRKDMASNYKELGFQG ISIKFLEYISEAIIH LQSKHSADFGADAQ MGKALELFRNDMAAKYKEFGFQG SAA QFGEFRTAL AYLQANVS WGDNVAquotWNKALlDNTFAIVVPRL wTEAYNQLVATIKAEMKE wTIAYDELAIIIKKEMKDAA GRAYGVIADVFIQVEADLYAQAVE VEPVNFKLLSHCLL VDPVNFKLLAHCIL VDPVNFKFLGHCFL VDPENFRLLGNVL VDPENFNRLGNVLI VDPENFRLLGDILI ET I c VDA HFPVVKEAILKTIKEVVGDKWSEELNT FIG lMultiple alignment of representative globin sequences The ve motifs scored for in the comparative analysis are indicated by blackened bars and the numerals I V Black white reversals of columns within the motifs indicate the most conserved residues of the motifs and their conservative substitutions based on the similarity scheme FY MLIV AG TS QN KR and ED If the same number of matches occurs for more than one residue in a column then one set is arbitrarily chosen for black white reversal The conserved helices of the globins are indicated by overlined regions and the letters A H The set of 12 sequences includes HAHU human HAOR duckbill platypus and HADK duck a chain hemoglobins and HBHU human HBOR duckbill platypus and HBDK duck B chain hemoglobins MYHU human and MYOR duckbill platypus are myoglobins The remaining hemoglobin sequences are IGLOB insect Chironomus thummi GPYL legume yellow lupine GPUGNI nonlegume swamp oak and GGZLB bacteria Vitreoscilla sp The two other test sets of globin sequences are subsets of these sequences set 10 set 12 without HAOR and HBOR and set 6 is comprised of HAHU HBHU MYHU IGLOB GPYL and GGZLB The sequences of the four protein families tested display a wide range of motif density motif conservation and indels The globins are highly conserved with few indels and the ve motifs range in size from three to seven amino acids g 1 and table 2 The kinase family has wellde ned indel regions interspersed among eight highly conserved motifs each of which varies from one to nine amino acid residues in size g 2 and table 3 The aspartic acid protease and RH sequences have the greatest range of motif density motif conservation and indels gs 2 and 3 The size of the three motifs of the protease is from three to ve amino acid residues and the four motifs of the RH data set vary from one to ve amino acid residues tables 4 and 5 These latter two tests are more dif cult than either the globin or kinase tests Description of Alignment Methods Analyzed Multiple Alignment Strategies There are two basic software approaches in deter mining the similarity among proteins The following global methods construct an alignment throughout the length of the entire sequence AMULT Barton and Sternberg 1987a 1987b DFALIGN Feng and Doolit tle 1987 MULTAL Taylor 1987 1988 MSA Lip man et a1 1989 TULLA Subbiah and Harrison 1989 CLUSTAL V Higgins et a1 1992 and MWT Kece cioglu 1993 A subclass of global methods attempts rst to identify an ordered series of motifs and then proceeds to align the intervening regions eg GENALIGN Martinez 1988 and ASSEMBLE Vingron and Argos 1991 Local methods only attempt to identify an or dered series of motifs while ignoring regions between motifs eg PIMA Smith and Smith 1990 1992 PRALIGN Waterman and Jones 1990 and MACAW Schuler et a1 1991 Brief descriptions of the basic a1 gorithms scoring matrices and penalties for indels used in all the methods analyzed are presented below table 1 Global Methods The diagram in gure 5 summarizes the basic im plementation of the various algorithms employed in the nine different global multiple alignment methods ana lyzed table 1 Barton and Sternberg 1987a 1987b Feng and Doolittle 1987 Taylor 1987 1988 Martinez 1988 Lipman et a1 1989 Subbiah and Harrison 1989 Vingron and Argos 1991 Higgins et al 1992 Kececioglu 1993 Table 1 indicates the various algorithms em ployed by each method In light of the computational expense of simultaneous comparison of protein se quences all methods begin by comparing all sequences in a pairwise fashion Several methods cluster the se quences into subalignments by using a similarity mea sure GENALIGN and MULTAL or a phylogenetic tree CLUSTAL V AMULT and DFALIGN GEN ALIGN MULTAL and CLUSTAL V subsequently align the clustered subalignments to one another by em ploying various consensus methods that reduce each subalignment to a single consensus sequence Allowing the subalignments to be merged by aligning their con Comparison of Protein Alignment Methods 583 sensus sequences to one another produces a progressive multiple alignment In addition GENALIGN allows the user to chose either the NeedlemanWunsch NW or consensus word CW algorithms for de nitions see the section Basic Algorithms for the alignment while CLUSTAL V permits the user to specify individual pa rameters for both the pairwise and multiple alignment stages AMULT and DFALIGN produce a progressive multiple alignment directly from the clustering stage AMULT then produces a nal multiple alignment through optimization of the progressive multiple align ment A novel feature of AMULT provides the option of producing a progressive multiple alignment directly from the pairwise ordering stage bypassing the phylo genetic clustering stage Two methods MSA and TULLA produce a progressive multiple alignment and then a nal multiple alignment The MSA method can also produce a nal multiple alignment bypassing the progressive multiple alignment stage if the user supplies the upper bounds for all sequence pairs that is necessary for the multidimensional dynamic programming on a restricted space ASSEMBLE and MWT produce a nal multiple alignment directly from the pairwise analysis The MSA and MWT methods differ from the others because they compute an optimal multiple alignment with respect to a wellde ned multiple alignment scoring function The source code for GENALIGN has been licensed to IntelliGenetics and therefore is no longer available All other developers have made their source code available upon request as is the standard practice in the scienti c community The concept of a progressive multiple alignment has been suggested by several developers Waterman and Perlwitz 1984 Feng and Doolittle 1987 Taylor 1987 This approach begins with alignment of the two most closely related sequences as determined by pairwise analysis and subsequently adds the next closest se quence or sequence group to this initial pair This process continues in an iterative fashion adjusting the posi tioning of indels in all sequences The major shortcoming of this approach is that a bias may be introduced in the inference of the ordered series of motifs because of an overrepresentation of a subset of sequences More re cently developed methods such as MSA use a sequence weighting scheme to correct for this potential problem table 1Altschu1 et a1 1989 Local Methods We have analyzed three local multiple alignment methods table 1 MACAW multiple alignment con struction workbench automatically performs multiple alignment of input sequences and also provides a mul tiple alignment sequence editor Schuler et a1 1991 This method begins with pairwise analysis of all se 584 McClure et a CMPK AHCK IEKH C WEI CMOS CSRC WES PDGM EGFR IBVK MPK A CK ITKH CDM WEE CMOS CSRC WES PDGM EGFR IBVK CMPK AHCK IEKH 0H8 WEE RAF CMOS CSRC VFES PDGM EGFR HSVK CMPK NECK IEKH ans WEE RAF CMOS CSRC VFES PDGM EGFR HSVK CAPK AHCK IEKH CDM W l CMOS GMC WW5 PDGM EGFR IBVK I DQFERIKTL FSMNSKEAL AKYDIKALI ANYKRLEKV TRFRNVTLL SEVMLSTRI EQVCLLQRL ESLRLEVKL EDLVLGEQI DQLVLGRTL TEFKKIKVL MGFTIHGAL E EF RVMLVKHME F AVCTCTEKS R EFSRVVRVEHRA E FEEVFQVEDPVE E EFGTVYKGKWHGD FGSVYKATY Q CFGEVWMGTWN R NFGEVFSGRLRAD a A A EGCVFDSSHPD KLEFSFKDN SNLYM QLYAAIETP HEIVL QLVEVFETQ ERVYM RLYDIVHSDA HKLYL ELMDSWEHG GFLYM LFMGYMTK DNLAI RVVAASTRTPAGS QLYAVVSE RLIGVCTQ NSLGTIIMEF RLLGICLTS PLLDLHVVSGVTCLVLPKYQ VVYKALDLRPG y El TGNHYAMKILDKQKVVKLKQIEH TLNEKRILQAV NFPFLV VMLEIEVMNQL NHRNLI CESELRVLRRV RHANII TGLKLAAKVIKKQ TPKDKEM TRQPYAIKMIETKY REGREV QGQRVVALKKIRLE KTLKYAVKKLKVKF SGPKERNR VAVKILKVVDPTPEQFQA RGVPVAIKQVNKCTKNRLASRRS GTTRVAIKTLKPGNMSPEA NTLVAVKSCRETL PPDIKAK FGQVVEATAHGLSHSQATMKVAVKMLKSTARSSEKQAL FGTVYKGLWIPEGE KVKIPVAIKELREAT SPKANKE YPQRVIVKAGWYTST VMEYVPGGEMFSHLRRIG FMEYIEGGELFERIVDEDYHLT VMELATGGELFDRIIAKGSFT VFEFLD LDLKRYMEGIPKDQ QVELCENGSLDRFLEEQGQLS VTQWCEGSSLYKHLHVQET GGNVTLHQVIYGAAGHUS EPIYI VTEYMSKGSLLDFLKGEMGKYL KQPIYI VMELVQGGDFLTFLRTEGA TFLQR HSNKHCPPSAELYSNALPVGFSLP TVQLITQLMPFGCLLDYVREHKDN ADLYTYLSRRLN SEDEGVPSTAIREISLLKELKD DNIV LLQEVSIQRALKGHDHIV FRNEVAVLRKT RHVNIL FWAELNVARL RHDNIV FLQEAQVMKKL RHEKLV FLQEAKILKQ YSHPNIV MSELYGDLVDYLHRNKH ILDEAYVMASV DNPHVC SHEARLLRRL DHPAIL RFSEPHARFYAAQIVLTFEYL EVDTMVFVR ERDATRVLQ QICDGILFM MVLDGVRYL PLGADIVKKFMMQLCKGIAYC RLDEFRVWKILVEVALGLQFI KFQMFQLIDIARQTAQGMDYL LSLGKCLKYSLDVVNGLLFL RL PQLVDMAAQIASGMAYV RLRMKTLLQMVGDAAAGMEYI SHLNLTGESDGQ4NDSPVLSYTDLVGFSYQVANGMDFI IGSQYLLNWCVQIAKGMNYI PLGRPQIAAVSRQLLSAVDY IV V VI HSLDLIYEDLKPE LL IDQQGYI QVT DFGF AKRVKG RTWTLCGTPEYLEP II Ls K HKMRVLHLDLKPE ILCVNTTGHLVKII DFGL ARRYNPNE KLKVNFGTPEFLS v VNYD HALGITH39DLKPE LL YYHPGTDSKIIITDFGLAS ARKKGDDC LMKTTCGTPEYI P VL VR K HSHRI H39DLKPQ LL INKDGNL KLG DFGL ARAFGVPL RAYTHEIVTLWYR P VL LGGK HHKNY HLDLKPA VM ITFEGTL KIG DFGM ASVWPVP RGMERE GDCEYIP VL AN H HAKNIE ED KSN IF LHEGLTVKIG DFGLATVKSRWSGS QQVEQPTGSVLWM P VIRMQDNN HSQSI HLDLKEA IL ISEQDVCKIS DFGC SEKLEDLLCFQTPSYPLGGTYTHRP L LKGE ERMNY H DLRAA IL VGENLVCKVA DFGL ARLIEDNEYTARQG AKFPIKWT P AA LYGF ESKCC H DLAAR CL VTEKNVLKIS DFGM SREAADGIYAASGGLRQVPVKWT P AL NYGF ASKNC H DLAAR VL ICEGKLVKIC DFGL ARDIMRDSNYISKGSTYLPLKWM P SI EN 5 EDRRL H DLAAR VL VKTPQHV KIT DFGL AKLLGAEEKEYHAEGGKVPIKWM L SI LH F HRQGI H DI T IF INTPEDIC LG DFGAA CFVQGSRSSPFPYGI AGTIDTn v LAGI VH GYNKAVDWWAL TLIYEMAAGY PPFFA DQPIQ IYEKIVSGK VRFPSH QISDKTD wEL ITYMLLSGL SP FLG DDDTE TLNNVLSGNWY FDEETFEA PYTNSVDM AL MIAYILLSGT MP F EDDNRTR LYRQILRGKYSYSGEPWPS QYSTGVDTwSI CI FAEMCNR KPIFSGDS EIDQIFK IFR VL GTPN EAIWPDIVYLPDFKI LYDKPAD FSL TVFEAAANIVLP DN GQ SWQKLRSG DLSDAPRLSSTDNGS ppsrosv YSY ELYELMTGE LP YS RDQI IF MVGRG YASPDLSKLYK GVTPKADEYSF TLWQ MTTKQAP YSG ERQHI LY AVVA YDLR PSLSAA FTIKSDvaF LLTELTTKGRVP YPGMVNREVLDQ VERG YRMPCPP YSSESDHVSF LLWETFSLGASP YPNLS NQQT REFVEKG GRLPCPE LYTTLSDvaF LLWEIFTLGGTP YPELP MNDQF YNAIKRG YRMAQPA IYTHQSD wsy TVWELMTFGSKP Y DGIPASEISSILEKG ERLPQPP PYTTTVDENSA E IFETAVHNAs LFSAPR GPKRGPCDS FSSDLKDLLRNL VSDEAKDFVSNL VSNLAKDFIDRL SFPQWRRKDLSQVVPSLDPRGIDLLDKL ETPANSIIGQGGLDRVVEWM NCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQH LPGQRLGDVIQRCWRPSAAQRPSARLLLVDLTSLKA SLTSSSR FEDS VII LQVDLTKR FGNLKDGV IVKEQGARMSAAQCLAHP LTVDPGARMTALQALRHP LAYDPINRISARRAAIHP LSPEPRNRPTIDQILATD NDIKNHK WLNNL WVVSM YFQES EVCWV ECPESLHDLMCQCWRRDPEE39PTFEYLQAFLEDYFT LCPDAVFRLMEQCWAYEPGQRPSFSAIYQEL HASDEIYEIMQKCWEEKFETRPPFSQLVLLLERLLGEGKKKY ICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMAR QITRIIRQAQVHVDEFS PHPESRLTSRYRSRAAGNNRPPYTR PAWTRYYKMDIDVEYLVCKALTFDGALRPSAAELLCLPLFQQKY quences to identify potential motifs Only those motifs found in all pairwise alignments are coalesced into blocks that the user can then manipulate with the onscreen editor The PIMA method begins with a pairwise analysis of all sequences then constructs a tree on the basis of this order and derives a pattern at each node by using the progressive alignment approach Smith and Smith 1990 1992 This is continued in an iterative fashion until a root consensus pattern is achieved using the amino acid class hierarchy see Scoring Matrices PRALIGN is a method based on the CW approach Waterman 1986 Waterman and Jones 1990 Words are found on the basis of user speci ed word length number of contiguous residues and window length number of consecutive residues to search within for a word and motif density and motif conservation param eters for de nitions see Methods Used for Comparative Analysis of Alignment Programs Basic Algorithms The biologically interesting formulations of the multiple alignment problem are in the class of socalled NP complete problems ie nondeterministic polyno mial time complete problems This implies that algo rithms that can nd an optimal multiple alignment for any set of input sequences called exact algorithms are unlikely to be ef cient However exact algorithms that can ef ciently nd an optimal alignment for speci c sets of sequences exist and some are known Carrillo and Lipman 1988 Kececioglu 1993 and are included in this analysis eg MSA and MWT Algorithms that can ef ciently nd an alignment that is guaranteed to be close to the optimal alignment called approxi mation algorithms are possible and some have re cently been described Gus eld 1993 Pevzner 1993 Whether the best alignment produced by these new al gorithms includes the ordered series of motifs that de ne a given protein family remains to be determined Only the algorithms and approaches implemented in the multiple alignment methods in this study will be brie y described The dot matrix approach has been used extensively in sequence analysis In brief a two dimensional array Comparison of Protein Alignment Methods 585 of two sequences is created and a dot is placed for matches In the ASSEMBLE method the dot matrix is initially employed as a lter to identify and retain only those motifs that are conserved among a given set of sequences prior to the use of dynamic programming States and Boguski 1990 have written an elegant his tory and detailed description of the various biological applications of the dot matrix method Most of the methods compared here employ dy namic programming which nds an optimal alignment for two sequences on the basis of various scoring schemes The scoring scheme is usually based on a value for matches and replacements see below and on a pen alty for indels see below The major shortcoming of this approach when applied to more than two sequences is that it requires intensive computer time CPU time proportional to N K where K is the number of sequences and N is their average length In 1970 Needleman and Wunsch wrote the rst dynamic programming algorithm for the global comparison of two sequences In brief a twodimensional array of the sequences is employed to nd maximal matches while penalizing for indels Nee dleman and Wunsch 1970 This method has formed the basis of most of the subsequent extensions to higher dimensional arrays for multiple sequence alignment A signi cant reduction in CPU time for the case of two sequences with little loss in sensitivity was achieved by the use of the dot matrix method coupled to the NW algorithm resulting in the WilburLipman WL algo rithm Wilbur and Lipman 1982 Another improve ment to the NW algorithm when extended to multiple sequences was achieved by the use of pairwise align ments to restrict the search for optimal paths among multiple sequences thus creating the CarrilloLipman CL algorithm Carrillo and Lipman 1988 Two of the three local multiple alignment methods analyzed here employ the SmithWaterman SW al gorithm Smith and Waterman 1981 This algorithm was the rst useful approach for identifying subsequences within larger sequences and it allows for indels of ar bitrary length within the subsequence The use of this algorithm in the MACAW alignment editor however FIG 2 Multiple alignment of representative eukaryotic kinaseprotein sequences The eight motifs scored for in the comparative analysis are indicated by blackened bars and the numerals IVIII CAPK bovine cardiac muscle MLCK rat skeletal muscle PSKH Hela cell CD28 Saccharomyces cerevisiae and CMOS and RAF 1 human oncogenic proteins are the sequences of serinethreoninespeci c kinase proteins WEE is a dual speci city kinase from S pombe CSRC chicken oncogenic protein VFES feline sarcoma virus oncogenic protein PDGMR mouse PDGF receptor and EGFR human EGF receptor are sequences of tyrosinespeci c kinase proteins HSVK is the herpes simplexvirus kinase The asterisk and residues in parentheses indicate a HSVK duplication that provides a second conserved motif VIII Numbers in parentheses indicate the number of amino acids in insertion deletion positions not included in the alignment display All other designations are as in g l The two other test sets of kinase sequences are subsets of these sequences set 10 set 12 without MLCK and CSRC and set 6 is comprised of CAPK CD28 WEEl VFES PDGMR and EGFR 586 McClure et a1 does not allow the introduction of indels within a sub sequence One global method GENALIGN and one local method PRALIGN are based on the CW approach to the multiple alignment problem Karlin et a1 1983 Waterman 1986 It is assumed that the CWs de ning a given protein family are unknown All subsequences of a speci c word size are then searched for within a given window among all the input sequences Waterman and Jones 1990 have written a detailed description of the CW approach applied to both DNA and protein se quences Scoring Matrices Various types of amino acid exchange matrices are available to assist in aligning protein sequences Fitch and Margoliash 1967 Dayhoif et a1 1978 Feng et al 1985 Taylor 1986 Rao 1987 Risler et a1 1988 Values for replacing one residue with an other are based on physical chemical similarities HTLVI ILPVIPLDPARRPV IKAQVDTQTSHPKT IEALLA DM39I V RSV LA MTMEHKDRPL VRVILTNTGSHPVKQRSVYI39I ALLA DITI HIVI QI39I LWQRPL VTIKIGGQLK EALLA DDTV SRVI VQPITCQKPS LTLWLDDKM FTGLIA DVTI MoMLV TLDDQGGQGQDPP PEPRITLKVGGQP VTFLVA QHSV CaMV TQIEQVMNVTNP NSIYIKGRLYFKGYKKIE LHCFVA SLCI 176 TGRKFSATSLGKPQ YI TIKYKENN LKCLISTVN M TY3 KTLPIVHYIAIPEMD NTAEKTIKIQNTK VKTLFSPTSFI Copia IAFMVKEVNNTSVMDN CGFVLASDH L PEPH VLDEQPLENYLDMEYFGTIGIG39I39PAQD FTVVFSSNLWV PEPC ASKYHPVLTATESYEPMTNYMDASYYGTISIGTPQQD FSVIFSSNLWV HHT ASGVATNTPTAN DEEYITPVTIGGT TLNLFDTGSADLWV H V HTLVJ LPIA LF SSNTPS KNTS VL AGQTQDHFKLTSLPVL Rm ISEE Dw PTDWPV MEAANPQIH IGGGIPMRKSRDMIELG nu LEEM SL PGRWKP KMIG IGGFIKVRQY DQI LI SRMI IKLE ow PPNWPI TDTLT NLR IGQ NNPKQSSKYLTwa AMMLVLTQN PGPLSD KSAWV Q ATGGKRYRWTTDRK VH amn ASKF VIPEEHWVN AERPIMVKI DG EITISKVCKDIDLI 176 TSKN IFDLP IQN TSTFIH T N I VNKs 11ng 3 R RD IVELLKYEIYE TPPLRFRGFVATKEAVTSEA VTIDLK amm INDES LYTDS VEV VPPLKIAVAKQGEFIYATKR GIVRLR nmw39 PSVYCSSLACTNHNRFNPEDSSTYQSTSETVSITYGTGSMTGILGY DTVQVG mam PSIYCKSSACSNHKRFDPSKSSTYVSTNETVYIAYGTGSMSGILGY DTVAvs Rm FSTELPASQQSGHSVYNPSTGKEL SGYTWSISYlncs ASGNVFTst VG H HIV IRLPFRTT PIVLT SCLVDTKNNQAIIGRDALQQCQGVLY RM VINRDGSL ERPLL LFPAVAMVRGSILGRDCLQGLGLRLTN HM EICGHKAI GTVLV GPTPVNIIGRNLLTQIGCTLNF SRMI DKENNSGLIKPFVIP NLPVNL GRDLLSQMKIMMCSP2M AMMLV LATGKVTH SFLHV PDCPYPL GRDLLTKLKAQI 2m CMMV IAGEIFKI PTV YQQESG IDFI GNNFCQLY 116 KILFPTT NEFLLH PFSENYDLL GRKLLAEAKATISY 1Y3 INDLQITL AAYIL DNMDYQLL GNPILRRYPKILHTV 0mm NDHEITL EDVLFCKEAAGNLMSVKRLQEAGMS3m IETH GISDTNQI FGLSETEPGSFLYY APFDGILGLAYPSISSSGAT 1m IEPC SIDVQNQI FGLSETEPGSFFYY CNFDGILGLAFPSISSSGAT 15 PEN GVTAHGQA VQAAQQISAQFQQD TNNDGLLGLAFSSINTVQPQSQ FIG 3 Multiple alignment of representative aspartic acid protease sequences The three motifs scored for in the comparative analysis are indicated by blackened bars and the numerals 11 The retroid family aspartic acid protease sequences are from the retroviruses HTLV I human Tcell leukemia virus type I RSV Rous sarcoma virus SRVl simian retrovirus type I HIVI human immunode ciency virus type I and MoMLV Moloney murine leukemia virus the caulimovirus CaMV cauli ower mosaic virus and the retrotransposons Copia and 176 Drosophila melanogaster and TY3 Saccharomyces cerevisiae PEPH human PEPC chicken and PEPP fungus Penicillium janthinellum are the aminoterminal half of pepsin sequences All other designations are as in gs 1 and 2 The two other test sets of aspartic acid protease sequences are subsets of these sequences set 10 set 12 without SRVI and 176 and set 6 is comprised of PEPH MoMLV CaMV COPIA 176 and TY3 Comparison of Protein Alignment Methods 587 4 I1 lulu LDTAP CLFSDGSPQ KAAYVLWDQ TILQQDITPLPS HETHSAQKGELL SRWI LNNAL LVFTDGSSTG MAAYTLAD TTIKFQTN LN SAQLVELQ RSV PVPGP TVFTD SSSTH KGVVV WREGPRWEIKEIAD LGASVQQLEAR IHMH IPGAE TFYTDGSCNRQSKEG KAGYV TDRGKDKVKKLE QTTNQQAELE AMMLV PDADH TWYTDGSSLLQEGQR KAGAAV TTETEVIWAKALD AGTSAQRAELI bmi PREHY KLWTDGS VSLGE KLGAAALLHRNNTLICAPKTGAGELSCSYRAECVALEIG MMV PEEKL IIETDSDDYWGGML KAIKINEGT NTELICRYASGSFKAAE KNYHSNDKETL 116 FTKKF TLTTD39SDVALGAVLSQDGHPLSYIS RTLNEHE INYSTIEKELL Mam FNNSTNLQEPSD 39LLYR KGSWVNIRFAAYLYS KLSEEKHGLVPK FLEKLREIN HBV RPGL CQVFADEE PTGWGLVM GHQRMRGTFSA PLPIHTAELL awn FENKI IGYVDSDWAGSEIDR KSTTGYLFKM FDFNLICWNTKRQN SVAASSTEAE Emu MLKQVE IFTD CLGNPG PGGYGAIL RYRGREKTFSAGY TRTTNNRMELM HI HTLVII ALICGLR AAKPWPSL NIFLKYLIKYLH SLAIGA FL SRWI ALIAVLS AFPNQPL NIYTDSAYLAHSIP LLETVAQI K Rm AVAMALL LWPT TPT NVVTDSAFVAKM LLKMGQE G IHWH AFAMALTD SGPKV NIIVDsgyvn G ISA SQP T AMMLV ALTQALKMAE GKKL NVYTDSRYAFATAHIH GEIYRRRGLLTS E M LQR LLK WLPRYRSTPS RL SIFSDSLSMLT ALQTGPLAV T MMV AVINTIK KFSIYL TPV HF LIRTDNTH FKSFVNLNY 116 AIVWATK TFRHYLL GRHF EISSDHEPLS WLYRMK Mam FALDKVD VTEIDEKLSRLMKFSVSAAYDEVGTLALKSLFKFRNS HBV AACFARS RSGAN IIGTDNSVVLSRKY TSFPWLLGCAANW wa YMALFEAVREALWLKFLLTSINIKLENPIKIYEDN GCIS Bum AAIVAL EALKEH CEV ILSTDE YVRQ G ITQWIHNWK KRGWK IV HHWJI GTSAHQT LQAALPPL LQGKT IYLHHVRSH TNLPDPISTFNEYTDSLILAPL SRVI HISETAKLFLQCQQLIY NRSIPFYIGH VRAH SGLPGPIAHGN KADLATKTVASN Rm VPSTAAA FILEDALS QRSAMAAVLH VRSH SEVPGFFTEGN ADSQATFQAY lHMH ESESKIV NQIIEEMI KKEAIYVAWVPAH KGIGG QEVDHLVSQGIRQVL AMMLV GKEIKNK DEILALLK ALFLPKRLSIIHCPGHQKGHSAE ARGNREEDQAARKAAITETP my DPILRR LWRLLLQV QRRKIRIRLQFVFDH CGVKR NEHCDEMAKKAADLPQL hMV KGDSKLGR NIR WQAW LSH YSFDV EHIKGT DNHFADFLSR EFNKVNS 116 DPNSKL TR WRVK LSE FDFDI KYIKG KENC ADALSRIKLEETY Mum ERESIKASFKQLRENGKIAEFSEAR RLWFE ILKLIRLDLFNASSLACDDLLSHLQDRRSI HBV ILRGTSFVYVPSALNPAD DPSRGRL GLSRPLLRLPFRPTTGRTSLYADSPSVPSHLPDRV 0mm IANNPSC HKR AKHIDIKYHFAREQVQNNVICLEYIPT ENQEADIFTKPLPAARFV Ewh TADKKPVK NVDLWQRLDAALGQHQIKWEWVKGH AGHPE NERCDELARAAAMNPTL FIG 4 Multiple alignment of representative RH sequences The four motifs scored for in the comparative analysis are indicated by blackened bars and the numerals I IV The retroid family RH sequences are from the retroviruses HTLVII human Tcell leukemia virus type I and HIV1 human immunode ciency virus type II the hepadnavirus HBV human hepatitis B virus ayw strain the retroposon Ingi T brucei and the group II mitochondrial plasmid Maup Mauriceville 1c strain of Neurospora crassa Escherichia coli is the ribonuclease H from E coli Other abbreviations are as in g 3 All other designations are as in gs 1 and 2 The two other test sets of RH sequences are subsets of these sequences set 10 set 12 without HBV and Maup and set 6 is comprised of PEPH MoMLV CaMV COPIA 176 and TY3 ease of mutating one codon to another andor the observed frequency at which replacement occurs in closely related proteins A widely accepted method for generating exchange matrices is the accepted point mutation PAM model Dayhoff et a1 1978 To alleviate matrix bias we have evaluated all but two methods with a PAM250 matrix PAM120 did not produce signi cantly different results Although the method of Dayhoff provides scores for replacement between all amino acids the highest scoring replace ments are based on the similarity scheme F Y M L I V A G T S Q N K R and E D The amino acid class hierarchy is intrinsic to the PIMA method therefore this method cannot be eval uated with any other scoring scheme This hierarchical classi cation scheme gives a score of three for identical residues a score of two for some conservative replace ments and a score of one for broadbased similarities eg all charged residues Although this scheme groups the amino acids into hierarchical classes on the basis of sidechain physiochemical properties it does not allow for all known conservative replacements Smith and Smith 1990 The source code for GENALIGN is un available therefore we are unable to change the imbed ded unitary matrix to the PAM matrix 588 McClure et al standard global alignment 39 I L V clustering Via MULTALGN alignment of clusters via DFAUGN a consensus GENALIGN tats method 9C3 CLUSTAL V MWT 9 n clustering via TULLA 9 PM 081 MN phylogenetic AM LT CL tree AMULT I ro resswe 1n PAIRWISE MSA AMULT g M 1 ALIGNMENT Muluple u Up 1e i nment MSA Ali nment ASSEMBLE GENALIGN global alignment built from local regions MWT ASSEMBLE FIG 5 Schematic representation of the basic strategies employed by nine different global multiple alignment methods All methods perform initial pairwise alignments and then progress through various stages before producing a progressive or nal multiple alignment The loop on the TULLA and AMULT methods indicates that an optimization procedure can be performed on the multiple alignment at the indicated stage All abbreviations are as in the text The asterisks indicate one of two userspeci ed strategies for the AMULT program The plus sign indicates that MSA uses the progressive multiple alignment strategy to provide the upper bounds for all sequence pairs in the multidimensional dynamic programming on a restricted space The user may specify these upper bounds thereby overriding the progressive multiple alignment step Insertions and Deletions Alignment of protein sequences often requires the introduction of indels to maximize the similarity be tween sequences There are basically two different methods for scoring indels The most commonly used method assesses a constant lengthindependent penalty C The second method charges a lengthindependent penalty for the initiation of the indel I and a length dependent penalty for extending the indel E One of the methods analyzed in this study TULLA uses an indel score referred to as the relative gap weight RGW that assesses a constant indel penalty relative to how many sequences have this indel The greater the number of sequences containing a common indel the higher the penalty Parameters The ratelimiting step of this study has been deter mining the appropriate userspeci ed parameters of each method for each data set The number of userspeci ed parameters varies from method to method from one to seven Often the same parameter is called by different names in different programs We have adopted a uniform parameter listing throughout this study therefore the indel penalty is the gap penalty C is the constant length independent indel penalty and HE is the initial length independent plus the extension lengthdependent indel penalty In the ASSEMBLE program HE is the rst and second penalty and in CLUSTAL V it is the xed and oating penalty Word size is called k tuple in CLUSTAL V and amino acid residue length in GENALIGN The only parameter common to all methods is the indel penalty In PRALIGN the HE penalty is only applied to the word size thus forming part of the motif conservation A range of parameter conditions has been explored for each method Changes that have provided signi cantly better results as judged by the motif identi cation criteria when substituted for the default parameters are indicated in tables 2 5 The software developers have also been given the opportunity to improve the results of the test of their methods by altering source code or by suggesting alternative parameterrange combinations Few suggestions were forthcoming that improved the test results although those changes that resulted in im provement have been incorporated into this analysis Results Although the program MSA correctly aligns the set of six globin sequences it could not be tested further because of space requirements greater than the 40 mega bytes of RAM and 40 megabytes swap Lipman et a1 1989 The preliminary program MWT which is an implementation of the exact algorithm for maximum weighttrace multiple alignment problem could not produce results at all with our test sets We attribute this to the space limitations of our computer Kececioqu 1993 By using a set of six globins with gt50 identity however MWT produces the correct alignment un published observation An implementation of the ap proximation algorithm for MWT that is space ef cient is in progress J Kececioglu personal communication Future testing will determine whether either MSA or MWT can correctly identify motifs that de ne a protein family These two methods will not be considered fur ther Our comparative analysis indicates three distinct types of problems in multiple sequence alignment The most signi cant problem encountered is the inability to merge subsets of sequences in which motifs have been correctly identi ed to provide a single multiple align ment tables 2 5 The global method GENALIGN and the local method PRALIGN exhibit this problem for all data sets to varying degrees depending both on the number of sequences and on which speci c sequences are analyzed tables 2 5 In the kinase test several other methods ASSEMBLE CLUSTAL V MULTAL TULLA and PIMA exhibit this problem to a minor degree In this case the problem stems from the inability to recognize single residue motifs that are common be tween subsets table 3 and g 2 Both the protease and RH data sets have some mo tifs that display low motif conservation eg g 3 motif II and g 4 motif IV Most of the methods exhibit varying degrees of inability to merge correctly aligned subsets of sequences from these more distantly related data sets tables 4 and 5 It should be noted that an additional weighting parameter was developed for DFALIGN DF Feng and R F Doolittle personal communication to speci cally correct this type of error This parameter allows the user to specify an additional weight a value of 2 or 3 is suf cient to be added to the score for each identical match beginning with a user speci ed sequence For example in the kinase test set a weight of 2 is added for each identical residue common between sequences beginning with the third sequence Use of this parameter is absolutely necessary to achieve the scores of tables 3 5 for the DFALIGN program Ex treme caution should be exercised in the manipulation of this parameter even by expert users R F Doolittle personal communication The second problem is the degree to which the number of sequences in the test set affects the ability to recognize motifs Most methods perform better with larger data sets In some cases however even though the accuracy of identifying motifs increases with the number of sequences the inability to merge correct sub sets of the data set is introduced into the multiple align ment tables 3 5 comparing sets of 10 vs 12 Comparison of Protein Alignment Methods 589 The third problem sensitivity to speci c sequences in the data sets appears to be a more general problem One might think that the degree to which a method could identify motifs would not vary signi cantly as a function of addition or deletion of sister sequences to the data set but only in the globin test is this problem negligible Sensitivity to speci c sequences is most consistently ex hibited by the global methods GENALIGN and AMULT and by the local method PIMA although all methods suffered to a degree from this problem tables 2 5 Discussion Protein sequences with gt50 amino acid residue identity can usually be unambiguously aligned by many of the multiple alignment methods currently available Among protein sequences with lt30 identity it can be fairly straightforward to nd the ordered series of motifs when the motifs are well conserved and when few indels have occurred table 3 and g 2 It is dif cult however to discern the ordered series of motifs that de ne a pro tein family and to obtain an adequate global multiple alignment that can be used in subsequent phylogenetic inference if the motifs are not well conserved and if signi cant indels have occurred tables 4 and 5 and gs 3 and 4 We have identi ed three speci c problems that are exhibited to various degrees by all the methods tested The rst the inability to produce a single multiple align ment could be due to an indel penalty that is too high This seems unlikely since we have varied the indel pen alties in most methods without alleviating this problem The extra parameter of the DFALIGN method which allows the user to increase the weight for matches as the distance between sequences increases suggests that the inability to produce a single multiple alignment from subsets could be addressed as a matrix problem Perhaps identical residues common among distantly related pro tein sequences should have a higher value especially if they occur in small contiguous runs The point in the divergence of a family of protein sequences at which such an increase in the values of identities should take precedence over more standard matrix scores needs to be investigated Currently subsets are merged by ad justing the placement of indels and appropriately re ducing or increasing the number of indels to produce a single multiple alignment as a nal manual re nement The second problem the sensitivity to the number of sequences and the third problem which speci c se quences are in the test set are serious problems The increase from 6 sequences to 10 sequences by the ad dition of sister sequences to the test data sets usually increases the ability of most methods to identify motifs This increase however is accompanied by the intro 590 McClure et a duction of the inability to merge correct subsets The addition of only two more sister sequences to the 10 sequence set however causes a decrease in identi cation of motifs This effect is most signi cant for the protease and RH tests tables 4 and 5 Why so many of the methods are sensitive to sequence number and speci city is an area that warrants further investigation on the part of the software developers Such shortcomings should warn biologists that variation in data sampling could lead to erroneous conclusions regarding the ordered se ries of motifs de ning a protein family as well as the phylogenetic history of the gene when these methods are used It is surprising that the global methods perform better than the local methods in the correct identi cation of the ordered series of motifs present in the four different data sets analyzed tables 2 5 In addition methods global or local based on the CW approach perform poorly compared with all other methods In light of these results the biologistuser should exercise caution in the use of local methods or CW methods either local or global to infer functional motifs It is obvious that a method that can identify an ordered series of motifs in which individual motifs can vary in both motif density and motif conservation is just the rst stage of obtaining a structural or evolution arily meaningful multiple proteinsequence alignment Once this is achieved the intervening regions of the or dered series of motifs must be aligned Such an alignment can then be used for phylogenetic reconstruction for classi cation of additional sequences and for determin ing signi cantly different subsequences among the se quences that will provide additional information about functional properties eg substrate speci city We are interested in the development of multiple alignment approaches that are designed to reconstruct the evolutionary relationships between proteins Such approaches must not only take into account sequence identity and conservative substitution based on muta tional frequencies and physical and chemical similarities of amino acids but must also be able to describe regions of indels and duplication that can be very useful as phy logenetic markers Methods that only detect highly con served motifs while useful for inferring function are insu icient for phylogenetic analysis If all that is de tected between proteins are the functionally or struc turally constrained residues and if such regions form the basis of phylogenetic reconstruction then one runs the risk of inferring an incorrect tree topology because of the increased likelihood of parallel or convergent sub stitutions this problem can be mitigated by considering sequence information conserved between more closely related relatives The area of computational biology that encom passes both sequencesearch and alignment algorithms has created a plethora of methods In only a few instances have developers attempted to evaluate the multiple alignments produced by their methods by comparing them with experimentally determined structures Barton and Stemberg 1987a 1987b Subbiah and Harrison 1989 The eld is now suf ciently developed for ade quate testing of methods on real sequence data It is no longer suf cient that algorithm developers merely pro pose yet another approach to these problems It is in cumbent upon the software developers to specify the limits of new methods on the basis of an adequate sam pling of known protein families Likewise it is the ob ligation of the analytical biologist to provide wellcon trolled tests and to suggest further directions for the development of new methods for sequence analysis Perhaps developers could use the test sequences de scribed here to test new approaches versus older ones We hope this study not only serves as a guide for multiple proteinsequence methods for biologists but that it also provides an overview of the problem and a language with which to communicate with the mathematicians statisticians and computer scientists in the eld This analysis also provides the algorithm developers with a more informed perspective on the nature of the biolog ical pattern recognition in primary sequences The ability to infer the ordered series of motifs that de ne a protein family is not trivial While the parameter values utilized in the various methods analyzed in this study may serve as a guide for inferring motifs in other protein sequences they should in no way be considered as the parameters that will always nd the motifs The stateoftheart strategy for the initial inference of the motifs de ning a protein family from primary sequence analysis still requires the combination of multiple align ment methods and human patternrecognition skills Acknowledgments We would like to thank all the developers who pro vided their source code and assistance We are grateful to Mark Boguski John Kececioglu George Gutman and Jacques Perrault for constructive criticisms on the manuscript Support for MAM and TKV was pro vided by NIH grant AI 28309 Support for WMF was provided by NSF grant DEB9096152 LITERATURE CITED ALTSCHUL S F R J CARROLL and D J LIPMAN 1989 Weights for data related by a tree J Mol Biol 207647 653 BARTON G J and M J E STERNBERG 1987a Evaluation and improvements in the automatic alignment of protein sequences Protein Eng 189 94 1987b A strategy for the rapid multiple alignment of protein sequences con dence levels from tertiary structure comparisons J Mol Biol 198327 337 BASHFORD D C CHOTHIA and A M LESK 1987 Deter minants of a protein fold unique features of the globin amino acid sequences J Mol Evol 196199 216 CARRILLO H and D LIPMAN 1988 The multiple sequence alignment problem in biology SIAM J Appl Math 48 1073 1082 CHAN S C A K C WONG and D K Y CHIU 1992 A survey of multiple sequence comparison methods Bull Math Biol 54563 598 DAVIES J F Z HOSTOMSKA Z HOSTOMSKY S R JORDAN and D A MATTHEWS 1991 Crystal structure of the ri bonuclease H domain of HIV1 reverse transcriptase Sci ence 25288 95 DAYHOFF M 0 R M SCHWARTZ and B C ORCUTT 1978 A model of evolutionary change in proteins Pp 345 352 in M O DAYHOFF ed Atlas of protein sequence and struc ture National Biomedical Research Foundation Washing ton DC DOOLITTLE R F DF FENG M S JOHNSON and M A MCCLURE 1989 Origins and evolutionary relationships of retroviruses Q Rev Biol 64130 FENG DF and R F DOOLITTLE 1987 Progressive sequence alignment as a prerequisite to correct phylogenetic trees J MOl Evol 25351 360 FENG DF M S JOHNSON and R F DOOLITTLE 1985 Aligning amino acid sequences comparison of commonly used methods J Mol Evol 21112125 FITCH W M and E MARGOLIASH 1967 Construction of phylogenetic trees Science 1552279 284 GUSFIELD D 1993 Ef cient methods for multiple sequence alignment with guaranteed error bounds Bull Math Biol 55141 154 HANKS S K and A M QUINN 1991 Protein kinase catalytic domain sequence database identi cation of conserved fea tures of primary structure and classi cation of family members Methods Enzymol 2003981 HIGGINS D G A J BLEASBY and R FUCHS 1992 CLUS TAL V improved software for multiple sequence alignment Comput Appl Biosci 8189 191 JOHNSON M S M A MCCLURE DF FENG J GRAY and R F DOOLITTLE 1986 Computer analysis of retroviral pol genes assignment of enzymatic functions Proc Natl Acad Sci USA 8376487652 KARLIN S G GHANDOUR F OST S TAVARE and L J KORN 1983 New approaches for computer analysis of nu cleic acid sequences Proc Natl Acad Sci USA 805660 5664 KATAYANAGI K M MIYAGAWA M MATSUSHIMA M ISH IKAWA S KANAYA M IKEHARA T MATSUZAKI and K MORlKAWA 1990 Threedimensional structure of ribo nuclease H from E coli Nature 347306309 KECECIOGLU J 1993 The maximum weight trace problem in multiple sequence alignment Pp 106 119 in A APOS TOLICO M C Z GALIL and U MANBER eds The 4th Comparison of Protein Alignment Methods 591 symposium on combinatorial pattern matching Springer Berlin KNIGHTON D R J ZHENG L F TEN EYCK V A ASHFORD NH XUONG S S TAYLOR and J M SOWADSKI 1991 Crystal structure of the catalytic subunit of cyclic adenosine monophosphatedependent protein kinase Science 254 407414 LIPMAN D J S F ALTSCHUL and J D KECECIOGLU 1989 A tool for multiple sequence alignment Proc Natl Acad Sci USA 864412 4415 MCCLURE M A 1992 Sequence analysis of eukaryotic retroid proteins Math Comput Modeling Int J 16121 136 1993 Evolutionary history of reverse transcriptase Pp 425 444 in A M SKALKA and S P GOFF eds Reverse transcriptase Cold Spring Harbor Laboratory Cold Spring Harbor NY MARTINEZ H M 1988 A exible multiple sequence alignment program Nucleic Acids Res 161683 1691 MILLER M M JASKOLSKI J K MOHANA RAO J LEIS and A WLODAWER 1989 Crystal structure of a retroviral pro tease proves relationship to aspartic protease family Nature 337576 579 MYERS E W 1991 An overview of sequence comparison algorithms in molecular biology Tech rep TR 9192 Uni versity of Arizona Tucson NEEDLEMAN S B and C D WUNSCH 1970 A general method applicable to the search for similarities in the amino acid sequences of two proteins J Mol Biol 48443 453 PEARL L H and W R TAYLOR 1987 A structural model for the retroviral proteases Nature 329351 354 PEVZNER P 1993 Multiple alignment communication cost and graph matching SIAM J Appl Math 521763 1779 RAO J K M 1987 New scoring matrix for amino acid residue exchanges based on residue characteristic physical param eters Int J Pept Protein Res 29276 281 RISLER J L M O DELORME H DELACROIX and A HEN AUT 1988 Amino acid substitutions in structurally related proteins a pattern recognition approach determination of a new and ef cient scoring matrix J Mol Biol 204 1019 1029 SCHULER G D S F ALTSCHUL and D J LIPMAN 1991 A workbench for multiple alignment construction and analysis Proteins Structure Function Genet 9180 190 SMITH R F and T F SMITH 1990 Automatic generation of primary sequence patterns from sets of related protein sequences Proc Natl Acad Sci USA 87118 122 1992 Patterninduced multisequence alignment PIMA algorithm employing secondary structuredepen dent gap penalties for use in comparative protein modeling Protein Eng 535 41 SMITH T F and M S WATERMAN 1981 Identi cation of common molecular subsequences J Mol Biol 147195 197 STATES D J and M S BOGUSKI 1990 Similarity and ho mology Pp 89157 in M GRIBSKOV and J DEVEREUX eds Sequence analysis primer W H Freeman New York SUBBIAH S and S C HARRISON 1989 A method for multiple sequence alignment with gaps J Mol Biol 209539 548 592 McClure et al TANESE N and S P GOFF 1988 Domain structure of the Moloney murine leukemia virus reverse transcriptase mu tational analysis and separate expression of the DNA poly merase and RNAase H activities Proc Natl Acad Sci USA 8517771781 TANG J M N G JAMES IN Hsu J JENKINS and T BLUNDELL 1978 Structural evidence for gene duplication in the evolution of acid proteases Nature 271618 621 TAYLOR W R 1986 Identi cation of protein sequence ho mology by consensus template alignment J Mol Biol 188 233 258 1987 Multiple sequence alignment by a pairwise al gorithm Comput Appl Biosci 381 87 1988 A exible method to align large numbers of biological sequences J Mol Evol 28161 169 VINGRON M and P ARGOS 1991 Motif recognition and alignment for many sequences by comparison of dotma trices J Mol Biol 21833 43 WATERMAN M S 1986 Multiple sequence alignment by consensus Nucleic Acids Res 149095 9102 WATERMAN M S and R JONES 1990 Consensus methods for DNA and protein sequence alignment Methods En zymol 183221 237 WATERMAN M S and M D PERLWITZ 1984 Line geom etries for sequence comparison Bull Math Biol 46567 577 WILBUR W J and D J LIPMAN 1982 Rapid similarity searches of nucleic acid and protein data banks Proc Natl Acad Sci USA 80726 730 STANLEY A SAWYER reviewing editor Received August 16 1993 Accepted January 5 1994
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'