Bioinformatics CS 5263
Popular in Course
Popular in ComputerScienence
verified elite notetaker
This 90 page Class Notes was uploaded by Mireya Heidenreich on Thursday October 29, 2015. The Class Notes belongs to CS 5263 at University of Texas at San Antonio taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/231389/cs-5263-university-of-texas-at-san-antonio in ComputerScienence at University of Texas at San Antonio.
Reviews for Bioinformatics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/29/15
085263 Bioinformatics Guest Lecture Part II Phylogenetics excluding viruses were bas d o ww 1 Up to now we have focused on nding similarities now we start focusing on differences clilorcnlaa dissimilarities leading iSlance 935 mitochondria Identifying sequences has been g 0 far ow w to arrange the sequences according to their ancestry e n rRNA prokaryotes and mitoc sequences for eukaryot Phylogenetic trees are a graphical representation ofthe nce between sequences or species Here we have the tree ofthe 3 groups of living organisms dista Until recently most phylogenies sequences for ial hondr es Terminology Phylogeny The evolutionary relationships among organisms based on a common ancestor Phylogenetics Area of research concerned with finding the genetic relationships between species Greek phylon race and genetic birth Terminology Phylogenetic tree Visual representation of evolutionary distances between species Miocene Pliocene Pleistocene Mmjms m Years 4 110 5 r Selma Present f I I I ii I m I I l I I 391 Black Bear Domes Dog Gray Wotquot Coyote Cane HIJnImg Dog BlackBacked Jackal Bush Dug Maned W13 Hoary Fox CrabFaring Fax Gray Fox BatEamd Fm Racooon Dog Cape 3th aInd Fox Zennec Fm Kit Fox Amer Fm quum a3tll l39lfllm n39l spine ueaiuawiv Limos splum 9aman Introduction to terminology for phylogeny lecture Speciation Gene Duplication Homologous Orthologs Paralogs Allopatric speciation populations are separated by a barrier After some time even if the barrier is removed the two populations can no longer form hybrids now different species a l ll39i39ljl h Luann quoti b l l 39rmga39 H l iiiIgjr Iquot C lquotlii H I It i Jljl q 60 a lt1 0 lt1 lerritrr ulm imn1 quot quot S f39 I 7 v t G E H r llll39 IIIw IlllL 39 Intnur are as a fin hybridu h n hyhrnl39lw X 3quot0 V Inf13911 m1 4 inall i imm H Liml 5 Nut hf With 4 m Minwan N hi39hml 2395 Iml TR lwlvmml H um Alluputrir Reinforcement Rapid ullupmriv worittlilm isrmtple liltg Sp i i lidmi swciuliun l1 illili tl l in allmmtn reinl nruwlmnl Sympatric speciation the population shares the environment mate selection effectively separates gene pools V V i Sexual Selection Gene duplication diagram The three bands are duplicated Duplicated area Before duplication After duplication Considerquot n i th peLie l 139 AL 39 39 39 1 andA2 The once u r 439 439 r 39 and Y quot 39 39 39 AZX and A1Z A2Z in species X and Z respectively Paralogous genes are derived 39om duplication such as A1 and A2 Orthologous genes are derived 39om speciation such as A1x A12 and A2x A22 Gene duplication Specimen Spams X Species 2 Genetic similarity among taXa should be estimated by comparing ortholo ous sequences A phylogeny should be computed to determine which similar sequences are orthologs homologs f A S orthogog paralogs orlhgllogs ARI f A 1quot AL 39 J 7 F gt 7 1mg 0 chlck x m 0115 e x m 0113 BB ehi cklj frogb xchain gent Bchain gene 39L yn ati21f V early globin gram axolotl myoglobin Orthologues Paralogues giant Panda lesser lt moose alpha zeta goshawk vulture alligator beta delta epsilon gamma haemoglobins Figure 71 Above a tree of urthaluguex based on a set of alpha haemoglobin Below a lree of paralagues the alpha beta gamma delta epsilon gem and hem chain of human haemaglobins and human myag obin The arthologues are the alpha haemaglobinx with SWISSPROT ldenti en HBAACCGE HBAJEGMO HBAAILFU HBAAILME HBAVALCAA HBA ALLMI HBAAMBME and HBA39ANAPL chosen because they were alphabetically the rst eight alpha glabin in PFAM Sanrthammen Eddy urbin 1997 httpgenomewuslleduPfam The paralogues are globinx with SWISSPROT identi ch HBATHUMAN HEAziHUMAN HBA HUMAN HBBHUMAN HBDHUMAN HBEHUMAN HBGHUMAN and MYG7HUMAN The WEEK were made by neighbourjoining Section 73 taxing L Felsensteht39x package PHYL v v A quotmun hintl The db tances usedfar neighbourjoining were the PAMbased ML dixtanccs tee p 228 detemtined by tlte pmgram PROTDIST in PHYLIP Definitions Classic phylogenetic analysis uses morphological features Anatomy size number of legs beak shape Modern phylogenetic analysis uses molecular information Genetic material DNA and protein sequences gt Molecular phylogenetic analysis Phylogenetic reconstruction Goal given a set of species genes reconstruct the tree which best explains their evolutionary history Phylogenetic reconstruction Nothing in evolution makes sense except in the light of phylogeny Joe Felsenstein A brief history of molecular phylogeny phylogenetic inference is old for Biology 4 5quot In 0 I mu Fquot I w Iquot u 2 I39 L quot 1 I I I I lig39 39 r x I I I I I I l I I I I I I I I 39 l I 1 XI I I I I I 1quotquot I l 1 t 1 Charles Darwin 4 Origin of Species 1859 Illustration of descent with modification IIAILxLL39s EVOLUTIth or UIh39 PEDIG REE OF MAN I g 3351 4 cm 1 531 9 A Z ApeLMIm W 23 2 233 3 2 I F A 1 5 V 71 E pcilcs lb I 39 I quot 39 Mudb isb quot ngmyncmn Iggy 39 3 I r gamma Skullless Amman eaN39III iris Tnniuu 1 SeaNellie I Atalantao u Mints rEggIIImaul Ian I In L Oxmaml gt Ernst Haeckel Tree of life Crocodile u 3956 Lizard f Eunice I lxuwn Eon Aimunis 3 l nsonu at l quot1 l Ti XI 39 Mammals blammulial VexL e bmma Verwbrata AnIrmIIa IMcLuwn Evermbmtn Invorkhrnle IntoBtu III 1891 Tracing evolutionary history phylogeny pattern and timing of evolutionary branching events evolutionary tree A B C D internal node common ancestor CA common ancestor common ancestor external node operational of A amp B of C amp D taxonomic unit OTU order of branches define the common ancestor relationShiPS tOPOIOQY of A B C D branch length defines the number of changes branching happened in the past common ancestors cannot be observed must infer from data Unrooted versus rooted phylogenies What are phylogenies good for 1 classification Systematics a scientific field devoted to classification of organisms Phenetics a classification scheme based on grouping populations according to similarities Cladistics a classification scheme based on evolutionary relationships phylogenies Monophyletic vs paraphyletic Monophyletic group Monophyletic group including all descendents of a common ancestor Paraphyletic group a set of species that includes a 000000000000000000 0 common ancestor and some but not all of its descendants Copyright 2004 Pearson Pren lllllllll c 6 v Paraphyletic groups l Dicots l l Prokaryotes l Oldest living Glades Bacteria Archaea 1amp1 00 Dicots o I I angiosperm related to 0 Outgroup clades magnolias Eudicots Eukarya l Fish l m 30 Ce g C 6 f a 9 quot lt9 99 5 9r 9 6 6m 2 Amphibians reptiles mammals Copyright 2004 Pearson Prentice Hall inc What are phylogenies good for 2 detecting coevolution Aphidbacteria M u t u a I i C C Aphids Symbiotic bacteria cospeciation M Iudovicianae U erigeronense I U caligatum U rurale 99 I U helianthicoa 87 U jaceicola U obscurum U rapuncuoidis 61 U sonchi U solidaginis 61 84 tree to node in aphiid tree U rudbeckiae 76 same U astronomus 94 Q consistent 54 99 78 I 99 U jaceae 85 39 I 94 Relationship of node In bacterial U aeneum I ALE 53 U ambrosiae contradicts Copyright 2004 Pearson Prentice Hall Inc What are phylogenies good for 3 origin of pathogens Black plague Pathogen Yersinia pests O t Branch lengths are proportional S I n S to the genetic distance between strains East Asian type Central Asian type African type Copyright 2004 Pearson Prentice Hall Inc What are phylogenies good for 4 Tree of life Animal Kingdom Inseam spiders Segmean Snaus clams 386 stars Venebmos warms Squld sand uHars ascimans V x r crustaceans V L n 1 CW atworms Roundworms JaMsm sea anemones jslllss K V A s s quot w gt gamamrsncmng s x mare denvedlmeagas s x Efy y mom ancvam lineages sponga 7 Latin branching Rrooting the tree To root a tree mentally imagine that the tree is made of string Grab the string at the root 0 and Root D tug on it until the ends of the string the taxa fall opposite the root A l Unrooted tree Rooted tree Note that In this rooted tree taxon A Is no more closely related to taxon B than Root it is to C or D OTUs n trees 01 Lil LA l J l KDCO Number of OTUs tips vs number of possible trees rooted n unrooted 2l5 n trees 2 5 2n 3 i3 1 1 l 3 3 15 15 105 105 954 954 10395 10395 135135 135135 2027025 3037025 34459425 true tree true evolutionary history is one of many possibilities difficult to infer true tree when OTUs is large inferred tree obtained using data and reconstruction method not necessarily the same as the true tree a hypothesis Counting Trees Unrooted trees 1 3 15 105 945 10935 135135 2 027 025 z358 x 1036 A B Y Taxa N c A C 3 4 5 B D 6 C 7 A D 8 9 10 B A C D gtirlt so 3 F 2N 5 unrooted trees for N taxa 2N 3 rooted trees for N taxa Reconstruct phylogenetic trees i Algor i i hm 4quot alighed 19 3 mgarail Jmmm i Methods of phylogenetic reconstruction 39 Distance based pairwise evolutionary distances computed for all taxa tree constructed using algorithm based on relationships between distances Maximum parsimony nucleotides or amino acids are considered as character states best phylogeny is chosen as the one that minimizes the number of changes between character states Maximum likelihood statistical method of phylogeny reconstruction explicit model for how data set generated nucleotide or amino acid substitution find topology that maximizes the probability of the data given the model and the parameter values estimated from data 2 Determine the evolutionary distances and build distance matrix For molecular data evolutionary distances can be the observed number of nucleotide differences between the pairs of species Distance matrix simply a table showing the evolutionary distances between all pairs of sequences in the dataset 2 Determine the evolutionary distances and build distance matrix A simple example using DNA sequences D 2 3 4 AGGCCATGAATTAAGAATAA AGCCCATGGATAAAGAGTAA AGGACATGAATTAAGAATAA AAGCCAAGAATTACGAATAA Distance Matrix 3 4 005 015 025 04 02 In this example the evolutionary distance is expressed as the number of nucleotide differences for each sequence pair For example sequences 1 and 2 are 20 nucleotides in length and have four differences corresponding to an0e2volutionary difference of 420 3 Phylogenetic Tree Construction example UPGMA algorithm UPMGA Michener amp Sokal 1957 D Bear Raccoon Weasel Seal 39J Bear 026 034 029 Raccoon 042 044 Weasel 044 Seal 1 Pick smallest entry DU Bear Raccoon 013ltgt013 2 Join the two intersecting species and assign branch lengths Dij2 to each of the nodes 3 Phylogenetic Tree Construction example UPGMA algorithm Dij Bear Raccoon Weasel Seal Bear Raccoon Bear 026 034 029 Raccoon 042 044 013 V13 Weasel 044 Seal 3 Compute new distances to the other species using arithmetic means D D 034o42 0mm W32 W 2 038 D D 029o44 om 532 SR 2 0365 3 Phylogenetic Tree Construction example UPGMA algorithm BR Weasel Seal BR 038 0365 Weasel 044 Seal 1 Pick smallest entry Dij 2 Bear Raccoon Seal 013 01825 01825 Join the two intersecting species and assign branch lengths Dij2 to each of the nodes 3 Phylogenetic Tree Construction example UPGMA algorithm D BR Weasel Seal J BR 038 0365 Weasel 044 Seal Bear Raccoon Seal 013 01825 01825 3 Compute new distances to the other species using arithmetic means DWBRS DWBDWRDWS 034042044 3 04 3 3 Phylogenetic Tree Construction example UPGMA algorithm Di BRS Wease39 Bear Raccoon Seal Weasel BRS 04 Weasel 1 Pick smallest entry Dij 2 Join the two intersecting species and assign branch lengths DijZ to each of the nodes 3 Done UPGMA clustering can be done using protein sequences Catematmn uf a bhybgenytrbrn rnbteeutar ansuns Cytuchrume e ebrnbansuns trbrn Ftteh and Marguhash Smence Wt 155 in Jan 196 The seteetee ebrnbansuns hav been arranged ranebrnty nu pamcmar urder UPGMA CLUSTERJNG H TFLTT DATA r abbheatbn uf UPGMA unwetghtee paw gruup methud usmg anthrnette averages etustenng The rn rs m the eeus shbw dt erences between the ytuchrume e rnbteeutes uf varmus SpEmES furexamp er there 5 unty 1 dtfference m the ammu and s between rnan and munkey butthere are 18 dt erences between rnan and tune mme Man Tuna Chmken Mum Munkey Dug A a c D E F G Tune A Man 3 Tune 0 ChmkenD nth E Munkey F Dug G Cytnchrnme c data frnm WM Fitch s E Mar 39 s The UPGMA method The UPGMA method is applied to the cytochrome c data sample At each cycle of the method the smallest entry is located and the entries intersecting at that cell are quotjoinedquot The height of the branch for this junction is onehalf the value of the smallest entry Thus since the smallest entry at the beginning is 1 between Bman and Fmonkey B and F are joined with branch heights of 05 102 Then the comparison matrix is reduced by combining cells These combinations are indicated with colors in the next slide For example the comparisons of A to B 190 and A to F 180 are consolidated as 185 190180l2 red cells while the comparisons of E to B 360 and E to F 3500 are consolidated as 355 3603502 blue cells The process is repeated on the reduced comparison matrix resulting in a smaller matrix with each cycle When the matrix is completely reduced the calculation is finished APPLICATION OF U39PGMA CLUSTERING METHOD ON SELECTED CY r DATA TO AI PUT ATE RELATIONS warm z JOIN Amara c JOIN nnaroc z 39 39 39 uu match and directs which entries are to be joined The height of the new branch is 39 39 39 39 39 entries same color on the right The final phylogeny calculated from tables It is in perfect accord with the fossil record showing fish ancestral to reptiles reptiles ancestral to mammals birds splitting from reptiles after the reptilemammal split and so forth The lengths of branches indicate time since last common ancestry for example moths and tuna 182 branch length separated long before turtles and chickens 40 branch length UP Ghi39i RE SUL T 3 LI CI 3933 3 I 1 LU II39I Ill C i an E m U3 CI 3 39 1 C D E E 1 i3 3 E What makes such calculations of phylogenies interesting is the fact that the results so often agree with evolutionary trees developed from other methods anatomy fossils or other proteins or genes Indeed molecular comparisons provide ample quotrepeat experimentsquot of the hypothesis of evolution Weakness of UPGMA UPGMA assumes a constant molecular clock ie accumulate mutations at the same rate All leaves in the same level Only constructs rooted trees Correct tree UPGMA Example morphologybased input DES i ance Based Algori i39hm Wn W mans a W Mikey fffgn 2 W 0 mt W hmgrimgr ma a mm M dam mrgu lgwsmwx coelocanth no salamander no frog no tu me yes h um 21 11 yes gecko yes snake yes all1gam 1 yes budgy yes yes yes The precemng matrix can be rep perch 000000000 coelocanth 000011 391 salamandero 1 1O 1 1 frog 011011 tume 110011 human 111111 gecko 110011 7 snake 100011 h all1gatur 110011 budgy 111111 no warm 78 000 000 000 000 000 110 110 101 101 rese nted I yes numerica11yfor convenience as39 Algorithm Stage 1 construct the distance matrix Distance between any species is defined by the of properties where they disagree out of total number of properties as 62 62 69 54 54 62 54 23 f 77 77 as 69 69 77 69 38 Simuamy 100 77 77 54 46 62 46 77 77 54 46 62 46 85 85 76 85 54 69 62 69 69 92 85 54 76 46 as budgy Algorithm Stage 2 cluster close neighbors Iteration 1 identify the two closest taxa from the distance matrix In our case one pair has zero distance salamander amp frog We join them together and update the distance matrix Updated matrix has only 9 species 8 old 1 new bucgy rch ii tuck The dimes pair have are the geakmsmke distance8j Algorithm Stage 2 cluster close neighbors Iteration 2 join the gecko and snake Add the new pair to the forest equally distributing their distance 8 Update the distance matrix One must calculate the similarity of each asyet ungrouped taxa to the two groups already formed above Also calculate the similarity of the two groups to each other 1 geco snake I salamander frog perch 85 69 54 54 23 75 69 coelocanth 85 69 69 38 75 69 85 as 54 85 84 69 69 a 74 69 75 84 64 64 65 E Wm smwaf m Mm sg s m h fy va ues 1W9ZWMM6M6 5 655 a The tie problem We possible xings we wad do with the resu l mg s m armes either come Wr e to Sa amander av Frog 0F conned Hum fro r Fmgq or create m san gqm iy op mali so u am and continuum for each Tie problem v a alligator perch gecko coelocanth snake maquot perch salamander coelucenth 09 salamander alligator fmg turtle turtle QECkO man snake If you breakties systematically according to the order of appearance in the matrix you will get the tree1 if you break ties randomly you make get the tree 2 Distance method 2 Neighborjoining NJ method uses star decomposition identification of neighbors that sequentially minimize the total length of the tree 1 start with star tree no topology 2 separate pair of OTUs from all others S total branch length of tree S12 total branch length of tree 8 7 1 6 2 4 3 3 choose pair of OTUs that minimizes total branch lengths in the tree 4 this pair collapsed as single OTU and distance matrix recalculated 5 next pair of OTUs that gives smallest branch length is chosen 6 iterate until complete The 1star Sum of the Branch Lengths N 1 N SHELFEEDU iltj TKN D D and L as the distance between OTUs and the branch length between nodes Each branch is counted N1 times when all distances are added The paired2 star Tree Size 1 2N 2 Ax LZX D12 8 N 1 1 Zh N aZPy 5 N S12LXYL1XL2X2LL39Y 2 4 i 3 LXY 5 D11 Dzk N 2L1X LZX Z LIY 1 N b N 1 1 D D D 2 1k 2k 2 12 N 2 J 3siltj Neighborjoining example A Cycle 1 Neighbors 2 OTU 01 u 1 2 3 4 5 s 7 2 3667 4 3 31433 3333 4 00 39 3307 s 40 33 4033 4000 39 67 5 4033 4033 4000 3967 7 4017 4017 3933 3950 3383 3333 x 4017 4017 3933 3950 31183 314113 3707 3 Cycle 2 Neighbors 5 6 ow on 12 3 4 5 0 7 3 3150 4 3230 3230 5 3390 3390 337 0 3390 3330 3370 31304 7 3370 3370 3350 3310 3310 8 3370 3370 3350 3310 3310 3190 NeighborJoining Complexity The method performs a search using time On2 and using time On2 to update distance matrix Giving a total time complexity of On3and a space complexity of On2 Neighborjoining method Extremely fast and efficient method widely used Tends to perform fairly well in simulation studies May produce tie trees from data set but this appears to be rare Algorithm is greedy and so can get stuck in local optima Main criticism is that it produces only one tree and does not give any idea of how many other trees are equally well or almost as supported by the data Maximum Parsimony MP What is parsimony A criterion for selecting among alternative patterns based on minimizing the total amount of evolutionary change Ancestral characters are inferred for each site and the total number of changes between nodes for a given topology are determined Best topology is the one that requires the fewest number of residue changes between nodes across all sites Counting substitutions on a tree 0 For an alignment site and a topology ancestral residues are inferred so that the minimum number of residue Changes between nodes is required A C A A C QIG A C Unambiguous 1 Ambiguous 2 Choosing the Sites Tree 4 steps best tree Tree 5 steps Tree III 6 steps OTU123456789lOSite shortest tree 1 T c G A c A G 2 T T G A C A G WIth parSImony 3 T T G A C A G 4 T T T A G A C 3 6 8 1A SC 1T 3T 1T 3G a I e a a W 2A 40 2A 4A 2T 4G 1A 2A 1T 2A 1T 2T 3 e o I e W SC 40 ST 4A 3G 4G 1T 2T 3Gf f4G Lecture 7 Page 5 Advantages and disadvantages of parsimony Advantages based on a logically coherent and biologically plausible model of evolution free from assumptions used in distance estimations better than distance methods when extent of sequence divergence is low 10 rate of substitution is constant number of residues is large very useful for certain types of molecular data eg indels Disadvantages gives incorrect topologies when backward substitutions are present common with nucleotides and when the number of sites is fairly small gives incorrect topologies when rate of substitution varies substantially across lineages long branch attraction long branches and short branches tend to group together on reconstructed tree difficult to treat the results in a statistical framework Maximum Likelihood Statistical probabilistic method for inferring phylogenies substitution model is chosen for sequence data alignment likelihood of observing the sequence data given the substitution model is obtained for each topology evaluated parameter fitting on branch lengths Probability of each tree is product of mutation rates in each branch Likelihoods given by each column multiplied to give the likelihood of the tree topology that gives the highest likelihood is chosen as the best tree Maximum Likelihood Extremely slow method so heuristic methods almost always have to be employed to search for best tree Method very dependent on model of substitution used Method estimates branch lengths not topology so may give wrong topology Assessing significance of Tree need some way to assess the support for the topologies evolutionary relationships of reconstructed phylogenies 1 3 or 24 bootstrapping resample alignment and construct trees from resampled data Bootstrap test Felsenstein 1985 assess the support for individual interior branches resample alignment columns with replacement testing the signal noise ratio in the data homoplasies repeat many times 100 1000 and get consensus tree O T U awn l 2 3 4 5 6 7 original alignment Site 8 9 10 OTU TCAGATCTAG 1 TTAGAACTAG TTCGATCGAG T T C T A A G G A C h LL 3 1 gt samped alignment f 1273457 9170 lCATAGCATT TAAAGCAAT TCTAGCATT TCAACGAAT lt 1te 3 W 3 3 y Interpreting bootstrap results 71 the percentage of trees built from resampled alignments that included the interior branch in question bootstrap values are said to support that interior branch the interior nodes adjacent terminal to the branch Rule of the thumb gt70 considered good evidence for proper placement of branches lt50 uncertain unresolved branching pattern polytomy Comparison of Methods Neighborjoining Maximum parsimony Maximum likelihood Uses only pairwise distances Uses only shared derived characters Uses all data Minimizes distance Minimizes total Maximizes tree likelihood between nearest distance given specific parameter neighbors values Very fast Slow Very slow Easily trapped in local Assumptions fail when Highly dependent on optima evolution is rapid Long branch attraction assumed evolution model Good for generating tentative tree or choosing among multiple trees or working on largescale data sets Best option when tractable lt30 taxa Good for very small data sets and for testing trees built using other methods Which Method to Choose depends upon the sequences that are being compared strong sequence similarity maXImum parSImony clearly recognizable sequence similarity distance methods All others maximum likelihood Best to choose at least two approaches Compare the results if they are similar you can have more confidence Which programs to use Distance method MEGA Maximum Parsimony method PAUP MacCIade Maximum Likelihood method PHYUP PAML ularEvolul n ry no Vxew H taw Eaahwavks 1m dew x 13 WNW o Gamma smegmaahnes E1 unmem a hemmed a wmmgm am a w mawmm a WWW MEGA Mue uar Evoluuanary GenemsAnulysis New Features my and csv Output 7 Update Nutmatmn KachIRo TAMURA mEL DUDLEY Interface Impmvements MASATUSHI NEI SUDHIR KUMAR Acknnwh dgemenls 1 u MEGA 4 Molecular Evolutionary Genetics Analysis mm m mtegnted mm nductmg zutumztm m mmzuequem wgnmem Kumar 5 DudleyJ Nev M a mfemng phylagenetm trees mmmg Wehrhased datahaxex estvmatmg rate w malecular hinlngislrcenlric xn wire hr and mmquot sequences Bria nn in Hininlmma cs 9 Wmdaw 9595 NT 2mm xv and vm Phylogenetics and forensic evidence Science 1998 Oct 3I2325391851353 M m We a Louisiana doctor accused of injecting victim with HIV I Baworgadswdm compaes HIV stram analysu debuts 1n murder mal sequences of Victims HIV amp Doctor s patient HIV amp local control I l 3 gel strains Victim amp patient strains more closely related to each other than controls monophyletic Victims HIV sequences were a subset of the doctor s patient s sequences Doctor guilty of attempted murder Phylogenetics and forensic evidence Balance 1992 May222555u6n 1165771 peer MIMEIM39IEUWIE FWFH Ermi aniur Molecular epidemiology of HIV transmission in a dental practice OE Ogiiesielsld CA Myers G Baudea CI Lun CC Km39her BT Mullins J1 Schnclietman G Bel kehnan RL Etmmmuu The Florida dentist 39 In 1986 dentist tested positive for HIV later AIDS 39 Continued general dentistry for 2 years 39 Patient with no risk factors discovered she had HIV 39 Dentist publicly urges other patients to take test 10 of 1100 patients are HIV Did the dentist pass HIV to his patients Phylogenetics and forensic evidence Forensics Transmission of HIV by Florida dentist Phylogenetic tree of HIV sequences from the DENTIST his Patients amp Local HIVinfected People Yes The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist Local control 2 Local control 3 patient F No L al mr l 9 From Ou et al 1992 and Page amp Holmes Local control 35 1993 redrawn by arcBeth Stewart Local control 3 Patient D 4 No Bayesian and GA software BEAST Bayesian Evolutionary Analysis Sampling Trees bayesian MCMC MrBayes bayesian MCMC and MCMCMC Phycas bayesian for DNA seqs python GARLI Genetic Algorithm for Rapid Likelihood Inference uses a stochastic genetic algorithmlike approach Computational analogue of evolution by natural selection not actually genetic algorithm Software to evaluate trees Readseq is a program that edits sequences into 18 different formats AWTY are we there yet is used to calculate whether MCMC has run long enough Tracer is similarly used to analyze MCMC based program runs FigTree is used to edit trees for publication And so much more Probabilistic Methods The phylogenetic tree represents a generative probabilistic model like HMMs for the observed sequences Background probabilities qa Mutation probabilities Pab t Models for evolutionary mutations Jukes Cantor Kimura 2 parameter model Such models are used to derive the probabilities Jukes Cantor model A model for mutation rates I Mutation occurs at a constant rate Each nucleotide is equally likely to mutate 0c into any other nucleotide with rate a 06 4 agtltx 4 OC Alt Q Kimura 2parameter model Allows a different rate for transitions and transversions m Optimal Tree Search Perform search over possible topologies T1 T2 T3 Parameter space Parametric optimization EM a Local Maxima T4 7397 Computational Problem Such procedures are computationally expensive Computation of optimal parameters per candidate requires nontrivial optimization step Spend nonnegligible computation on a candidate even if it is a low scoring one In practice such learning procedures can only consider small sets of candidate structures Current status of phylogenetic analysis Bayesian approaches widely implemented Maximum likelihood remains goldstandard Novel genetic algorithms also currently implemented but not yet widely tested Large datasets still very computationally expensnve Very few reiterative methods where phylogeny directs alignments and viceversa Still difficult for biologists to evaluate results of different algorithms Useful links IUPAC codes httpwwwbioinformaticsorqsmsiupachtml Molecular Evolution Course website httpwwwmolecularevolutionorq Tree of Life web project httptolweborotree httpbioinfoserverrsbsanueduauproqramsindexphp Introduction to evolution httpevolutionberkelevedu 346 2001 Schattauer GmbH What is Bioinformatics A Proposed De nition and Overview of the Field N M Luscombe D Greenbaum M Gerstein Department of Molecular Biophysics and Biochemistry Yale University New Haven USA Summary Background The recent ood of data from genome sequences and functional genomics has given rise to new field bioinformatics which combines elements of biology and computer science Oly39ecrives Here we propose a de nition for this new field and review some of the research that is being pursued particularly in relation to transcriptional regulatory systems Methods Our definition is as follows Bioinformatics is conceptualizing biology in terms of macromolecules in the sense of physicalchemistry and then applying informatics techniques derived from disciplines such as applied maths computer science and statisr tics to understand and organize the information associated with these molecules on a largerscale Results and Conclusions Analyses in bi oi nformatics predominantly focus on three types of large datasets available in molecular biology macromolecular struc tures genome sequences and the results of function al genomics experiments eg expression data Additional information includes the text of scientific papers and relationship data from metabolic path ways taxonomy trees and proteinrprotein interaction networks Bioinformatics employs a wide range of computational techniques including sequence and structural alignment database design and data mining macromolecular geometry phylogenetic tree construction prediction of protein structure and function gene nding and expression data clustering The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources Finally bioinformatics is a practical discipline We survey some representative applications such as nding homologues designing drugs and performing largerscale censuses Additional information pertinent to the review is available over the web at httpbioinfombbyaleeduwhatisrit Keywords Bioinformatics Genomics Introduction Transcription Regulation Method Inform Med 2001 40 346 58 Method Inform Med 42001 1 Introduction Biological data are being produced at a phenomenal rate 1 For example as of April 2001 the GenBank repository of nucleic acid sequences contained 11546000 entries 2 and the SWISS PROT database of protein sequences con tained 95320 3 On average these databae ses are doubling in size every 15 months 2 In addition since the publication of the H in uenzae genome 4 complete sequences for nearly 300 organisms have been released ranging from 450 genes to over 100000 Add to this the data from the myriad of related projects that study gene expression determine the protein structue res encoded by the genes and detail how these products interact with one another and we can begin to imagine the enormous quantity and variety of information that is being produced As a result of this surge in data compue ters have become indispensable to biologi cal research Such an approach is ideal because of the ease with which computers can handle large quantities of data and probe the complex dynamics observed in nature Bioinformatics the subject of the current review is often de ned as the appli cation of computational techniques to understand and organise the information associated with biological macromolecules Fig 1 shows that the number of papers Updated version of an invited review paper that appeared in lIaux R Kulikowski C eds 2001 IMIA Yearbook of Medical Informatics 2001 Digital Libraries and Medicine pp 83799 Stuttgart Schattauer related to this new eld has been surging and now comprise almost 2 of the annual total of papers in PubMed This unexpected union between the two subjects is attributed to the fact that life itself is an information technology an organism s physiology is largely deter mined by its genes which at its most basic can be viewed as digital information At the same time there have been major advances in the technologies that supply the initial data Anthony Kervalage of Celera recent ly cited that an experimental laboratory can produce over 100 gigabytes of data a day with ease 5 This incredible pro cessing power has been matched by devele opments in computer technology the most important areas of improvements have been in the CPU disk storage and Internet allowing faster computations better data storage and revolutionalised the methods for accessing and exchanging data 11 Aims of Bioinformatics In general the aims of bioinformatics are threeefold First at its simplest bioinfore matics organises data in a way that allows researchers to access existing information and to submit new entries as they are produced eg the Protein Data Bank for 3D macromolecular structures 6 7While dataecuration is an essential task the in formation stored in these databases is essentially useless until analysed Thus the purpose of bioinformatics extends much further The second aim is to develop tools and resources that aid in the analysis of data For example having sequenced a par ticular protein it is of interest to compare it with previously characterised sequences m Mmmln kym r mm m w mmmmmmmmmmmmmmnummmwmm m wwwwgmmm m mmm mum Mmmmmw ms med mum Jus an39mk m bud 94 and vlwumsuck xYASTA N and m ms M Imammderwka m mm mm 91min quotm amp s mam a m m 0on xwa m mm m 2m mm mm m my mum morpmwn m a mam Z Ihe INIDIWIAHOM associated wllh ha Nhlambs Tahk 1 1m m we Manda in mm m mmm and m muem39 mun m M con de m m m m a m am am Mm e mm A r mum m u hummus and m mus Baum n Mg m it WWW Mmmmg m n swank m Hymnme a m and enmmmm m m mmquot ind as Bum m mmmmw m Auknmkxdmymnmwuent s mnvnmu gums a m m rd km A pmum mmmnnm Nuenm m a en m m Bmmhmvanm a Mnmnn mum tn 7 mm mm mam m tam is an m xenseDfPM Magma and um mm m Mm M mquot dm was m mm m bummer mama mm m undershan and nuns 0 mmquot mm m m nnkuk on mmm In mm 5 3 mm mm w mm 7 m mm mm MM mam um um 348 Luscombe Greenbaum Gerstein Table 1 Sources of data used in bioinformatics the quantity of each type of data that is currently April 2001 available and bioinformatics subject areas that utilize this data 16 million 3 billion bases each largest 20 time point measurements for Gene expression Data source Data size Bioinformatics topics Raw DNA sequence 115 million sequences Separating coding and noncoding regions 125 billion bases Identi cation of introns and exons Gene product prediction Forensic analysis Protein sequence 400000 sequences Sequence comparison algorithms 3 00 amino acids Multiple sequence alignments algorithms each Identi cation of conserved sequence motifs Macromolecular 15000 structures Secondary tertiary structure prediction structure l000 atomic 3D structural alignment algorithms coordinates each Protein geometry measurements Surface and volume shape calculations Intermolecular interactions Molecular simulations force eld calculations molecular movements docking predictions Genomes 300 complete genomes Characterisation of repeats Structural assignments to genes Phylogenetic analysis Genomicscale censuses characterisation of protein content metabolic pathways Linkage analysis relating speci c genes to diseases Correlating expression patterns Mapping expression data to sequence structural and biochemical data 6000 genes in yeast Other data Literature 11 million citations Metabolic pathways Digital libraries for automated bibliographical searches Knowledge databases of data from literature Pathway simulations More recent sources of data have been from functional genomics experiments of which the most common are gene expres sion studies We can now determine expres sion levels of almost every gene in a given cell on a whole genome level however there is currently no central repository for this data and public availability is limited These experiments measure the amount of mRNA that is produced by the cell 14 18 under different environmental conditions different stages of the cell cycle and differ ent cell types in multi cellular organisms Much of the effort has so far focused on the yeast 19 24 and human genomes 25 26 One of the largest dataset for yeast has made approximately 20 time point meas urements for 6000 genes 19 However there is potential for much greater quan IVIethod Inform Med 42001 tities of data when experiments are con ducted for larger organisms and at more time points Further genomic scale data include biochemical information on metabolic pathways regulatory networks protein protein interaction data from two hybrid experiments and systematic knockouts of individual genes to test the viability of an organism What is apparent from this list is the diversity in the size and complexity of dif ferent datasets There are invariably more sequence based data than others because of the relative ease with which they can be produced This is partly related to the great er complexity and information content of individual structures or gene expression experiments compared to individual se quences While more biological informa tion can be derived from a single structure than a protein sequence the lack of depth in the latter is compensated by analysing larger quantities of data 3 ORGANISE the Infor mation on a LARGE SCALE 31 Redundancy and Multiplicity of Data A concept that underpins most research methods in bioinformatics is that much of the data can be grouped together based on biologically meaningful similarities For example sequence segments are often repeated at different positions of genomic DNA 27 Genes can be clustered into those with particular functions eg enzy matic actions or according to the meta bolic pathway to which they belong 28 although here single genes may actually possess several functions 29 Going further distinct proteins frequently have comparable sequences organisms often have multiple copies of a particular gene through duplication and different species have equivalent or similar proteins that were inherited when they diverged from each other in evolution At a structural level we predict there to be a finite number of different tertiary structures estimates range between 1000 and 10000 folds 30 31 and proteins adopt equivalent structures even when they differ greatly in sequence 32 As a result although the number of structures in the PDB has increased exponentially the rate of discov ery of novel folds has actually decreased There are common terms to describe the relationship between pairs of proteins or the genes from which they are derived analogous proteins have related folds but unrelated sequences while homologous proteins are both sequentially and structu rally similar The two categories can some times be difficult to distinguish especially if the relationship between the two proteins is remote 33 34 Among homologues it is useful to distinguish between orthologues proteins in different species that have evolv ed from a common ancestral gene and paralogues proteins that are related by gene duplication within a genome 35 Normally orthologues retain the same function while paralogues evolve distinct but related functions 36 An important concept that arises from these observations is that of a nite parts listquot for different organisms 3739 an inventory of proteins contained within an organism arranged according to different properties such as gene sequence protein fold or function Taking protein folds as an example we mentioned that with a few exceptions the tertiary structures of pro teins adopt one of a limited repertoire of folds As the number of different fold families is considerably smaller than the number of genes categorising the proteins by fold provides a substantial simpli cation of the contents of a genome Similar sime pli cations can be provided by other attrie butes such as protein function As such we expect this notion of a finite parts list to become increasingly common in future genomic analyses Clearly an essential aspect of managing this large volume of data lies in developing methods for assessing similarities between different biomolecules and identifying those that are related There are welledocue mented classi cations for all of the main types of data we described earlier Al though detailed descriptions of these clas si cation systems are beyond the scope of the current review they are of great impor tance as they ease comparisons between genomes and their products Links to the major databases are available from our supplementary website 32 Data Integration The most pro table research in bioinfore matics often results from integrating mule tiple sources of data 40 For instance the 3D coordinates of a protein are more useful if combined with data about the protein s function occurrence in different genomes and interactions with other molecules In this way individual pieces of information are put in context with respect to other data Unfortunately it is not always straightforward to access and cross reference these sources of information be cause of differences in nomenclature and le formats At a basic level this problem is fre quently addressed by providing external links to other databases For example in PDBsum webpages for individual struc tures direct the user towards corresponding entries in the PDB NDB CATH SCOP and SWISSAPROT databases At a more advanced level there have been efforts to integrate access across several data sources One is the Sequence Retrieval System SRS 41 which allows ate le databases to be indexed to each other this allows the user to retrieve link and access entries from nucleic acid protein sequence protein motif protein structure and bibliographic databases Another is the Entrez facility 42 which provides similar gateways to DNA and protein sequences genome mapping data 3D macromolecular structue res and the PubMed bibliographic database 43 A search for a particular gene in either database will allow smooth transitions to the genome it comes from the protein sequence it encodes its structure biblioe graphic reference and equivalent entries for all related genes In our own group we have developed the SPINE 44 and PartsList 39 web resources these databases intee grate many types of experimental data and rganise them using the concept of the nite parts listquot we described above 4 UNDERSTAND and Organise the Information Having examined the data we can discuss the types of analyses that are conducted As shown in Table 1 the broad subject areas in bioinformatics can be separated according to the type of information that is used For raw DNA sequences investigations involve separating coding and nonecoding regions and identi cation of introns exons and promoter regions for annotating genomic DNA 45 46 For protein sequences anae lyses include developing algorithms for sequence comparisons 47 methods for 349 What is Bioinformaticx producing multiple sequence alignments 48 and searching for functional domains from conserved sequence motifs in such alignments Investigations of structural data include prediction of secondary and tertiary protein structures producing methods for 3D structural alignments 49 50 examining protein geometries using distance and angular measurements calcu lations of surface and volume shapes and analysis of protein interactions with other subunits DNA RNA and smaller mole cules These studies have lead to molecular simulation topics in which structural data are used to calculate the energetics in volved in stabilising macromolecular struc tures simulating movements within macro molecules and computing the energies involved in molecular docking The increae sing availability of annotated genomic sequences has resulted in the introduction of computational genomics and proteomics A largeescale analyses of complete genomes and the proteins that they encode Re search includes characterisation of protein content and metabolic pathways between different genomes identi cation of interace ting proteins assignment and prediction of gene products and largeescale analyses of gene expression levels Some of these re search topics will be demonstrated in our example analysis of transcription regula tory systems Other subject areas we have included in Table 1 are development of digital libraries for automated bibliographical searches knowledge bases of biological information from the literature DNA analysis methods in forensics prediction of nucleic acid struc tures metabolic pathway simulations and linkage analysis A linking speci c genes to different disease traits In addition to nding relationships be tween different proteins much of bioine formatics involves the analysis of one type of data to infer and understand the obser vations for another type of data An exam ple is the use of sequence and structural data to predict the secondary and tertiary structures of new protein sequences 51 These methods especially the former are often based on statistical rules derived from structures such as the propensity for certain amino acid sequences to produce Method Inform Med 42001 m mm mm m can g u WWW umeme mm swam Anwd narssa 5m manna W hutmun m m m mum mummmww mm rahnay m m m M m mm a mm mm mm mm 5 37 mmw Mmm m m mm mm mum1 mm m m m mumgmmmmmmw mmmmnwmm m 7 wwwmwntmmmm mmwm WWW Hg 2 army12m manor1mqu am am mm mm mm mam um mm mm m Mm Mmusmm was m m Mm dew and mam m 55 5 mvmemad w m vemnl as m a 5m and uuvhns a pm auva m m mum druv demnv ssTkamxml M memnwnrmannxpenmm we W mvmm mm y mm m m M M m s m undershndmv a we used m m Mm a nuansm mm en 5mm A avene and Magma Mauddi m Swuenm M an em a mum mkmmmmm Wm smemmmmmm mm M m an mm a W Wmaxammmummm late the structure adopted by the protein Geometry calculations can de ne the shape of the protein s surface and molecu7 lar simulations can determine the force elds surrounding the molecule Finally using docking algorithms one could identify or design ligands that may bind the protein paving the way for designing a drug that speci cally alters the protein s function In practise the intermediate steps are still dif cult to achieve accurately and they are best combined with experimental methods to obtain some of the data for example characterising the structure of the protein of interest The aim of the second dimension the breadth in biological analysis is to compare a gene or gene product with others Inie tially simple algorithms can be used to compare the sequences and structures of a pair of related proteins With a larger num7 ber of proteins improved algorithms can be used to produce multiple alignments and extract sequence patterns or structural templates that de ne a family of proteins Using this data it is also possible to con struct phylogenetic trees to trace the evolue tionary path of proteins Finally with even more data the information must be stored in largeescale databases Comparisons become more complex requiring multiple scoring schemes and we are able to con duct genomic scale censuses that provide comprehensive statistical accounts of protein features such as the abundance of particular structures or functions in diffee rent genomes It also allows us to build phylogenetic trees that trace the evolution of whole organisms 5 applying INFORMATICS TECHNIQUES The distinct subject areas we mention require different types of informatics tech niques Brie y for data organisation the rst biological databases were simple at les However with the increasing amount of information relational database methods with Webepage interfaces have become increasingly popular In sequence analysis techniques include string compari son methods such as text search and one dimensional alignment algorithms Motif and pattern identi cation for multiple sequences depend on machine learning clustering and dataemining techniques 3D structural analysis techniques include Eu clidean geometry calculations combined with basic application of physical chemise try graphical representations of surfaces and volumes and structural comparison and 3D matching methods For molecular simulations Newtonian mechanics quan7 tum mechanics molecular mechanics and electrostatic calculations are applied In many of these areas the computational methods must be combined with good statistical analyses in order to provide an objective measure for the signi cance of the results 6 Transcription Regulation a Case Study in Bioinformatics DNAebinding proteins have a central role in all aspects of genetic activity within an organism participating in processes such as transcription packaging rearrangement replication and repair In this section we focus on the studies that have contributed to our understanding of transcription regulation in different organisms Through this example we demonstrate how bio informatics has been used to increase our knowledge of biological systems and also illustrate the practical applications of the different subject areas that were brie y outlined earlier We start by considering structural analyses of how DNAebinding proteins recognise particular base se quences Later we review several genomic studies that have characterised the nature of transcription factors in different orgae nisms and the methods that have been used to identify regulatory binding sites in the upstream regions Finally we provide an overview of gene expression analyses that have been recently conducted and suggest future uses of transcription regulatory anae lyses to rationalise the observations made in gene expression experiments All the results that we describe have been found through computational studies 351 What is Bioinformatics 61 Structural Studies As oprril 2001 there were 379 structures of proteineDNA complexes in the PDB Analyses of these structures have provided valuable insight into the stereochemical principles of binding including how par ticular base sequences are recognized and how the DNA structure is quite often modi ed on binding A structural taxonomy of DNAebinding proteins similar to that presented in SCOP and CATH was rst proposed by Harrison 56 and periodically updated to accome modate new structures as they are solved 57 The classi cation consists of a twoetier system the rst level collects proteins into eight groups that share gross structural features for DNAebinding and the second comprises 54 families of proteins that are structurally homologous to each other Assembly of such a system simpli es the comparison of different binding methods it highlights the diversity of proteineDNA complex geometries found in nature but also underlines the importance of inter actions between aehelices and the DNA major groove the main mode of binding in over half the protein families While the number of structures represented in the PDB does not necessarily re ect the rela7 tive importance of the different proteins in the cell it is clear that helixeturnehelix zincecoordinating and leucine zipper motifs are used repeatedly These provide compact frameworks to present the cohelix on the surfaces of structurally diverse proteins At a gross level it is possible to highlight the differences between transcription factor domains that just bind DNA and those involved in catalysis 58 Although there are exceptions the former typically approach the DNA from a single face and slot into the grooves to interact with base edges The latter commonly envelope the substrate using complex networks of secondary structures and loops Focusing on proteins with aehelices the structures show many variations both in amino acid sequences and detailed geoe metry They have clearly evolved indepen7 dently in accordance with the requirements of the context in which they are found While achieving a close t between the Method Inform Med 42001 352 Luxcombe Greenbaum Gemein cohelix and major groove there is enough exibility to allow both the protein and DNA to adopt distinct conformations However several studies that analysed the binding geometries of aehelices demon strated that most adopt fairly uniform cone formations regardless of protein family They are commonly inserted in the major groove sideways with their lengthwise axis roughly parallel to the slope outlined by the DNA backbone Most start with the Neterminus in the groove and extend out completing two to three turns within contacting distance of the nucleic acid 59 Given the similar binding orientations it is surprising to nd that the interactions between each amino acid position along the aehelices and nucleotides on the DNA vary considerably between different pro tein families However by classifying the amino acids according to the sizes of their side chains we are able to rationalise the different interactions patterns The rules of interactions are based on the simple pref mise that for a given residue position on aehelices in similar conformations small amino acids interact with nucleotides that are close in distance and large amino acids with those that are further 60 61 Equievae lent studies for binding by other structural motifs like Behairpins have also been con ducted 62 When considering these interactions it is important to remember that different regions of the protein surface also provide interfaces with the DNA This brings us to look at the atomic level interactions between individual amino acidebase pairs Such analyses are based on the premise that a signi cant proportion of speci c DNAebinding could be rationalised by a universal code of recognition between amino acids and bases ie whether certain protein residues preferably interact with particular nucleotides regardless of the type of proteineDNA complex 63 Studies have considered hydrogen bonds van der Waals contacts and wateremediated bonds 64766 Results showed that about 23 of all interactions are with the DNA back bone and that their main role is one of sequenceeindependent stabilisation In contrast interactions with bases display some strong preferences including the Method Inform Med 42001 interactions of arginine or lysine with guanine asparagine or glutamine with adenine and threonine with thymine Such preferences were explained through exami7 nation of the stereochemistry of the amino acid side chains and base edges Also highlighted were more complex types of interactions where single amino acids contact more than one baseestep simulta7 neously thus recognising a short DNA sequence These results suggested that universal speci city one that is observed across all proteineDNA complexes indeed exists However many interactions that are normally considered to be nonspeci c such as those with the DNA backbone can also provide speci city depending on the context in which they are made Armed with an understanding of protein structure DNAebinding motifs and side chain stereochemistry a major applica7 tion has been the prediction of binding either by proteins known to contain a partie cular motif or those with structures solved in the uncomplexed form Most common are predictions for cohelixemajor groove interactions A given the amino acid se quence what DNA sequence would it recognise 61 67 In a different approach molecular simulation techniques have been used to dock whole proteins and DNAs on the basis of forcee eld calculations around the two molecules 68 69 The reason that both methods have been met with limited success is because even for apparently simple cases like a helixebinding there are many other factors that must be considered Comparisons between bound and unbound nucleic acid structures show that DNAebending is a common feature of complexes formed with transcription factors 58 70 This and other factors such as electrostatic and cation mediated interactions assist indirect recognition of the nucleotide sequence although they are not well understood yet Therefore it is now clear that detailed rules for speci c DNAebinding will be family speci c but with underlying trends such as the arginineeguanine interactions 62 Genomic Studies Due to the wealth of biochemical data that are available genomic studies in bioine formatics have concentrated on model organisms and the analysis of regulatory systems has been no exception Identification of transcription factors in genomes invari ably depends on similarity search strate7 gies which assume a functional and evolue tionary relationship between homologous proteins In E Coli studies have so far estimated a total of 300 to 500 transcription regulators 71 and PEDANT 72 a data base of automatically assigned gene functe ions shows that typically 273 of pro karyotic and 677 of eukaryotic genomes comprise DNAebinding proteins As assign ments were only complete for 40760 of genomes as of August 2000 these gures most likely underestimate the actual num7 ber Nonetheless they already represent a large quantity of proteins and it is clear that there are more transcription regulators in eukaryotes than other species This is unsurprising considering the organisms have developed a relatively sophisticated transcription mechanism rom the conclusions of the structural studies the best strategy for characterising DNAebinding of the putative transcription factors in each genome is to group them by homology and to analyse the individual families Such classi cations are provided in the secondary sequence databases described earlier and also those that specialise in regulatory proteins such as RegulonDB 73 and TRANSFAC 74 Of even greater use is the provision of structural assignments to the proteins given a transcription factor it is helpful to know the structural motif that it uses for binding therefore providing us with a better understanding of how it recognises the target sequence Structural genomics through bioinformatics assigns structures to the protein products of genomes by demonstrating similarity to proteins of known structure 75 These studies have shown that prokaryotic transcription fac tors most frequently contain helixeturne helix motifs 71 76 and eukaryotic factors contain homeodomain type helixeturne helix zinc nger or leucine zipper motifs From the protein classifications in each genome it is clear that different types of regulatory proteins differ in abundance and families signi cantly differ in size A study by Huynen and van Nimwegen 77 has shown that members of a single family have similar functions but as the requirements of this function vary over time so does the presence of each gene family in the enome Most recently using a combination of sequence and structural data we examined the conservation of amino acid sequences between related DNAebinding proteins and the effect that mutations have on DNA sequence recognition The structural families described above were expanded to include proteins that are related by sequence similarity but whose structures remain unsolved Again members of the same family are homologous and probably derive from a common ancestor Amino acid conservations were calculate ed for the multiple sequence alignments of each family 78 Generally alignment positions that interact with the DNA are better conserved than the rest of the pro tein surface although the detailed patterns of conservation are quite complex Residues that contact the DNA backbone are highly conserved in all protein families providing a set of stabilising interactions that are common to all homologous proteins The conservation of alignment positions that contact bases and recognise the DNA sequence are more complex and could be rationalised by defining a threeeclass model for DNAebinding First protein families that bind nonespecifically usually contain several conserved baseecontacting residues without exception interactions are made in the minor groove where there is little discrimination between base types The contacts are commonly used to stabilise deformations in the nucleic acid structure particularly in widening the DNA minor groove The second class comprise families whose members all target the same nucleotide sequence here baseecontacting positions are absolutely or highly consere ved allowing related proteins to target the same sequence The third and most interesting class comprises families in which binding is also specific but different members bind distinct base sequences Here protein residues undergo frequent mutations and family members can be divided into subfamilies according to the amino acid sequences at baseecontacting positions those in the same subfamily are predicted to bind the same DNA sequence and those of different subfamilies to bind distinct sequences On the whole the subfamilies corresponded well with the proteins functions and mem bers of the same subfamilies were found to regulate similar transcription pathways The combined analysis of sequence and structural data described by this study pro vided an insight into how homologous DNAebinding scaffolds achieve different specificities by altering their amino acid sequences In doing so proteins evolved distinct functions therefore allowing structurally related transcription factors to regulate expression of different genes Therefore the relative abundance of tran7 scription regulatory families in a genome depends not only on the importance of a particular protein function but also in the adaptability of the DNAebinding motifs to recognise distinct nucleotide sequences This in turn appears to be best accommoe dated by simple binding motifs such as the zinc fingers Given the knowledge of the transcription regulators that are contained in each organism and an understanding of how they recognise DNA sequences it is of interest to search for their potential bind ing sites within genome sequences 79 For prokaryotes most analyses have in volved compiling data on experimentally known binding sites for particular proteins and building a consensus sequence that in corporates any variations in nucleotides Additional sites are found by conducting wordematching searches over the entire genome and scoring candidate sites by similarity 8083 Unsurprisingly most of the predicted sites are found in nonecoding regions of the DNA 80 and the results of the studies are often presented in databases such as RegulonDB 73 The consensus search approach is often complemented by comparative genomic studies searching upstream regions of orthologous genes in closely related organisms Through such an 353 What is Bioinformaticx approach it was found that at least 27 of known E 60H DNAeregulatory motifs are conserved in one or more distantly related bacteria 84 The detection of regulatory sites in eukaryotes poses a more difficult problem because consensus sequences tend to be much shorter variable and dispersed over very large distances However initial stud ies in S cerevisiae provided an interesting observation for the GATA protein in nitro gen metabolism regulation While the 5 baseepair GATA consensus sequence is found almost everywhere in the genome a single isolated binding site is insufficient to exert the regulatory function 85 There fore specificity of GATA activity comes from the repetition of the consensus se quence within the upstream regions of con trolled genes in multiple copies An initial study has used this observation to predict new regulatory sites by searching for over represented oligonucleotides in nonecoding regions of yeast and worm genomes 86 87 Having detected the regulatory binding sites there is the problem of defining the genes that are actually regulated commonly termed regulons Generally binding sites are assumed to be located directly upstream of the regulons however there are different problems associated with this assumption depending on the organism For prokarye otes it is complicated by the presence of operons it is difficult to locate the regulate ed gene within an operon since it can lie several genes downstream of the regulatory sequence It is often difficult to predict the organisation of operons 88 especially to define the gene that is found at the head and there is often a lack of longerange con servation in gene order between related organisms 89 The problem in eukaryotes is even more severe regulatory sites often act in both directions binding sites are usually distant from regulons because of large intergenic regions and transcription regulation is usually a result of combined action by multiple transcription factors in a combinatorial manner Despite these problems these studies have succeeded in confirming the transcrip7 tion regulatory pathways of wellecharactere ized systems such as the heat shock response Method Inform Med 42001 354 Luscombe Greenbaum Gerslein system 83 In addition it is feasible to experimentally verify any predictions most notably using gene expression data 63 Gene Expression Studies Many expression studies have so far focused on devising methods to cluster genes by similarities in expression profiles This is in order to determine the proteins that are expressed together under different cellular conditions Brie y the most com mon methods are hierarchical clustering selfeorganising maps and Kemeans cluster ing Hierarchical methods originally de rived from algorithms to construct phylogee netic trees and group genes in a bottom upquot fashion genes with the most similar expression profiles are clustered first and those with more diverse profiles are included iteratively 90792 In contrast the selfeorganising map 93 94 and Kemeans methods 95 96 employ a topedown approach in which the user preedefines the number of clusters for the dataset The clusters are initially assigned randomly and the genes are regrouped iteratively until they are optimally clustered Given these methods it is of interest to relate the expression data to other attrie butes such as structure function and sub cellular localisation of each gene product Mapping these properties provides an insight into the characteristics of proteins that are expressed together and also sug gest some interesting conclusions about the overall biochemistry of the cell In yeast shorter proteins tend to be more highly expressed than longer proteins probably because of the relative ease with which they are produced 97 Looking at the amino acid content highly expressed genes are generally enriched in alanine and glycine and depleted in asparagine these are thought to re ect the requirements of amino acid usage in the organism where synthesis of alanine and glycine are energee tically less expensive than asparagine Turn ing to protein structure expression levels of the TIM barrel and NTP hydrolase folds are highest while those for the leucine zipper zinc finger and transmembrane helixecontaining folds are lowest This Method Inform Med 42001 relates to the functions associated with these folds the former are commonly in volved in metabolic pathways and the latter in signalling or transport processes 98 This is also re ected in the relationship with subcellular localisations of proteins where expression of cytoplasmic proteins is high but nuclear and membrane proteins tend to be low 99 100 More complex relationships have also been assessed Conventional wisdom is that gene products that interact with each other are more likely to have similar expression profiles than ifthey do not 101 102 How ever a recent study showed that this rela tionship is not so simple 103 While ex pression profiles are similar for gene pro ducts that are permanently associated for example in the large ribosomal subunit profiles differ significantly for products that are only associated transiently includ ing those belonging to the same metabolic pathway As described below one of the main driving forces behind expression analysis has been to analyse cancerous cell lines 104 In general it has been shown that dif ferent cell lines eg epithelial and ovarian cells can be distinguished on the basis of their expression profiles and that these profiles are maintained when cells are transferred from an in Viva to an in Vitro environment 105 The basis for their phye siological differences were apparent in the expression of specific genes for example expression levels of gene products neces sary for progression through the cell cycle especially ribosomal genes correlated well with variations in cell proliferation rate Comparative analysis can be extended to tumour cells in which the underlying causes of cancer can be uncovered by pinpointing areas of biological variations compared to normal cells For example in breast cancer genes related to cell prolif eration and the IFNeregulated signal trans duction pathway were found to be upregue lated 25 106 One of the difficulties in cancer treatment has been to target specific therapies to pathogenetically distinct tue mour types in order to maximise efficacy and minimise toxicity Thus improvements in cancer classifications have been central to advances in cancer treatment Although the distinction between different forms of cancer A for example subclasses of acute leukaemia A has been well established it is still not possible to establish a clinical diag nosis on the basis of a single test In a recent study acute myeloid leukaemia and acute lymphoblastic leukaemia were suce cessfully distinguished based on the ex pression profiles of these cells 26 As the approach does not require prior biological knowledge of the diseases it may provide a generic strategy for classifying all types of cancer Clearly an essential aspect of under standing expression data lies in understand ing the basis of transcription regulation However analysis in this area is still limited to preliminary analyses of expression levels in yeast mutants lacking key components of the transcription initiation complex 19 107 7 many PRACTICAL APPLICATIONS Here we describe some of the major uses of bioinformatics 71 Finding Homologues As described earlier one of the driving forces behind bioinformatics is the search for similarities between different biomolee cules Apart from enabling systematic orgae nisation of data identification of protein homologues has some direct practical uses The most obvious is transferring informa tion between related proteins For example given a poorly characterised protein it is possible to search for homologues that are better understood and with caution apply some of the knowledge of the latter to the former Specifically with structural data theoretical models of proteins are usually based on experimentally solved structures of close homologues 108 Similar tech niques are used in fold recognition in which tertiary structure predictions depend on finding structures of remote homologues and checking whether the prediction is energetically viable 109 Where biocheme ical or structural data are lacking studies a mmmmw mmwwwywu mm 21127 Ill am 7 11 Wmmnnmumm nAnAAcruvaAnAvacAAAAvccAcAAav z msuwumxmnimmmu Imam u c l a x u r Mummwmwnsmuzmi r quotm quot7 WWW m J m M W ugkwmm WW mm wason 5w Kr I m ma DIMmm i a I 39rr r away I m z I on gm 1 9 ms 75mm nzvum 39 a A I r rum l g 3 Mummmmmmmwummmmmm mmmmammmwmmmmm Wmmmw mmmwmmmnwWWW mewomhmvwmmmmwdwunmmmm DmmummA mmm WWW Emmaa Vnumnwumm mmmm meumwnwam WWW mum m m m mummy m w mmu WWW Mmme m mm mm mun my mm mm mummde mm mm mm m m wwwmmmw mummam mummw m mm kmemwwmmmm mu m mm mm mm m Mm on man k m and m mks am 3 Am mm mm by mam mmk mmumm manan 10mm m Mammal swansv such 3 swam in and Ken anv M m my be hmsmd m dean dmv Mm m mm m M mpmmmmmaum mm a Spam mm mm mm m E mm W adv mmqu mum Mm Anewnknlwymtksakuemybwd mm warns W WWW mm Ha mjamm uwayk d be m Rum mm m mam 0 new smemd venom and mama m muse Brambler DIRK qu 72 Ra mna Drug DEWquot nxmuenwmmm mm amend w cmka mm Am enemas mmqumm mum Onabwerx k am m ngan mumquot in mm m n m w mam um um 356 Luscombe Greenbaum Gerstein drug design Fig 3 outlines the commonly cited approach taking the MLH1 gene prof duct as an example drug target MLH1 is a human gene encoding a mismatch repair protein mmr situated on the short arm of chromosome 3 110 Through linkage ana lysis and its similarity to mmr genes in mice the gene has been implicated in nonpolye posis colorectal cancer 111 Given the nucleotide sequence the probable amino acid sequence of the encoded protein can be determined using translation software Sequence search techniques can then be used to find homologues in model orgae nisms and based on sequence similarity it is possible to model the structure of the human protein on experimentally character ized structures Finally docking algorithms could design molecules that could bind the model structure leading the way for bio chemical assays to test their biological activity on the actual protein 73 Largescale Censuses Although databases can efficiently store all the information related to genomes struce tures and expression datasets it is useful to condense all this information into under standable trends and facts that users can readily understand Broad generalisations help identify interesting subject areas for further detailed analysis and place new ob servations in a proper context This enables one to see whether they are unusual in any way Through these largeescale censuses one can address a number of evolutionary bio chemical and biophysical questions For example are specific protein folds associate ed with certain phylogenetic groups How common are different folds within partice ular organisms And to what degree are folds shared between related organisms Does this extent of sharing parallel mease ures of relatedness derived from traditional evolutionary trees Initial studies show that the frequency of folds differs greatly between organisms and that the sharing of folds between organisms does in fact follow traditional phylogenetic classifications 37 112 113We can also integrate data on protein functions given that the particular Method Inform Med 42001 protein folds are often related to specific biochemical functions 52 53 these find ings highlight the diversity of metabolic pathways in different organisms 36 89 As we discussed earlier one of the most exciting new sources of genomic information is the expression data Combining expression information with structural and functional classifications of proteins we can ask whether the high occurrence of a protein fold in a genome is indicative of high ex pression levels 97 Further genomic scale data that we can consider in largeescale sure veys include the subcellular localisations of proteins and their interactions with each other 1147116 In conjunction with struce tural data we can then begin to compile a map of all proteineprotein interactions in an organism 8 Conclusions With the current deluge of data compue tational methods have become indispense able to biological investigations Originally developed for the analysis of biological se quences bioinformatics now encompasses a wide range of subject areas including structural biology genomics and gene ex pression studies In this review we provided an introduction and overview of the cur rent state of field In particular we discus sed the types of biological information and databases that are commonly used exae mined some of the studies that are being con ducted A with reference to transcription regulatory systems A and finally looked at several practical applications of the field Two principal approaches underpin all stud ies in bioinformatics First is that of com paring and grouping the data according to biologically meaningful similarities and sec ond that of analysing one type of data to infer and understand the observations for another type of data These approaches are re ected in the main aims of the field which are to understand and organise the information associated with biological molecules on a large scale As a result bioinformatics has not only provided great er depth to biological investigations but added the dimension of breadth as well In this way we are able to examine individual systems in detail and also compare them with those that are related in order to uni cover common principles that apply across many systems and highlight unusual fea tures that are unique to some Acknowledgments We thank Patrick McGarvey for comments on the manuscript References 1 Reichhardt T It s sink or swim as a tidal wave of data approaches Nature 1999 399 6736 720 2 Benson DA et al GenBank Nucleic Acids Res 200028 121578 3 Bairoch A preiler R The SWISSVPROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 2000 28 8 4Fleischmann RD et al Wholeygenome ranr dom sequencing and assembly of Haemor hilus in uenzae Rd Science 1995 269 5223 4967512 5Drowning in data The Economist 1999 26 June 1999 6 Bernstein FC et alThe Protein Dam Bank A computerrbased archival file for macromolecr ular structures Eur J Biochem 1977 80 2 319 24 7 Berman HM et al The Protein Data Bank Nucleic Acids Res 2000 28 1 2 235742 8 Pearson WR Li man D Improved tools for biological sequence comparison Proc Natl Acad Sci USA 1988 85 82244478 9 Altschul SF et al Gapped BLAST and PSIV BLAST a new generation of protein database search programs Nucleic Acids Res 1997 25 17 33897402 10Fleischmann RD et al Wholeygenome ranr dom s uencing and assembly of Haemor philus in uenzae Rd Science 1995 269 5223 4967512 11 Lander ES et al Initial sequencing and analyr sis oft e uman genome Nature 2001 409 12 Venter JC et al The sequence of the human genome Science 2001 291 55071304r51 13Tatusova TA KarschrMizrachi I Ostell JA Complete genomes in WWW Entrez data representation and analysis Bioinformatics 1999 15 7782536743 14 Eisen MB Brown PO DNA arrays for analyr sis of gene expression Methods Enzymol 19993031797205 15 Cheung VG et al Making and reading micror arrays Nat Genet 1999 21 1 Suppl15r9 16 Duggan D et al Expression profiling using c NA microarrays Nat Genet 1999 21 1 Suppl10r4 17Lipshutz R et al High density synthetic oligonucleotide arrays Nat Genet 1999 21 1 2074 18 Velculescu VE et al Serial Analysis of Gene Expression Detailed Protocol 1999 19 Holstege FC et al Dissecting the regulatory circuitry ofa eukaryotic genome Cell 1998 95 I 717728 20Roth FP Estep PW Church GM Finding DNA regulatory motifs within unaligned nonr coding sequences clustered by wholergenome mRNA quantitation Nat Biotech 1998 16 10939745 1Jelinsky SA Samson LD Global response of Saccharomyces cerevisiae to an a yatin agent Proc Natl Acad Sci USA 1999 96 4 148691 N N N Cho RJ et al A genomerwide transcriptional analysis of the mitotic cell cycle Mol Cell 19982 1165773 DeRisi JL Iyer VR Brown PO Exploring the mem olic and genetic control of gene expresr sion on a genomic scale Science 1997 278 533868076 4 Winzeler EA et al Functional characterizar tion of the S cerevisiae enome y gene deletion and parallel analysis Science 1999 285 5422000170 25 Perou CM et al Molecular portraits of human breast tumours Nature 2000 406 6797 7752 N u N N a Golub TR et al Molecular classi cation of cancer class discovery and class prediction by gene expression monitoring Science 1999 286 543953177 27 Pedersendagger AG et al A DNA structural atlas for Escherichia coli J Mol Biol 2000 299 39 07730 8 Kanehisa M Goto S KEGG kyoto enqclor edia of genes and genomes Nucleic Acids Res 2000 20 02730 29 Jeffery CJ Moonlighting proteins TIBS 1999 24 11811 0 Chothia C Proteins One thousand families for the molecular biologist Nature 1992 357 6379543r4 Orengo CA Jones DT Thornton JM Protein superfamilies and domain superfolds Nature 1994 372 6507 63174 Lesk AM Chothia C How different amino acid sequences determine similar protein structures the structure and evolutionary dynamics of the globins J Mol Biol 1980 136 225770 N w w u N w w Russell RB et al Recognition of analogous and homologous protein folds analysis of s q e ce and structure conservationJ Mol Biol 1997269 31423739 Russell RB et al Recognition of analogous and homologous protein folds 7 assessment of prediction success and associated alignment accurac using empirical substitution matrir ces Protein Eng 1998 11 12179 35Fitch WM Distinguishing homologous from analogous proteins Syst Zool 1970 19 997110 36 Tatusov RL Koonin EV Lipman DJA enor mic perspective on protein families Science 1997278 5338 63177 La 1 37 Gerstein M Hegyi H Comparing genomes in terms of protein structure surveys of a finite parts list FEMS Microbiol Rev 1998 22 4 2777304 38 Skolnick J Fetrow JS From genes to protein structure and function novel applications of computational approaches in the genomic era Trends Biotech 2000 18 3479 39 Qian J et al PartsList a webrbased system for dynamically ranking protein folds based on disparate attributes including wholergenome expression an interaction information Nucleic Acids Res 200129 821750764 40 Gerstein M Integrative database analysis in structural genomics Nat Struct Biol 2000 7 u p r 41 Etzold T Ulyanov A Argos P SRS informar tion retrieval system for molecular biology data ba s Methods Enzymol 1996266114728 42 Schuler GD et al Entrez molecular biology database and retrieval system Methods Enzymol 1996 266 141762 43Wade K Searching Entrez PubMed and ncover on the internet Aviat Space Environ Med 2000 71 52559 44 Bertone P et al SPINE An integrated tracking database and datamining approach for highrthroughput structural proteomics enabling the determination of the properties 0 readily characterized proteins Nucleic Acids Res In Press 45 Zhang MQ Promoter analysis of corregulated genes in the yeast genome Comput Chem 1999 23 374233750 46Boguski MS Biosequence exegesis Science 1999 286 54394535 47Miller C Gurd J Brass A A RAPID algorithm for sequence database comparisons application to the identification of vector contamination in the EMBL databases Bior informatics 1999 15 22111721 48 Gonnet GH Korostensky C Brenner S Evaluation measures of multiple sequence alignments J Comput Biol 2000 7 172 776 490rengo CA Taylor WR SSAP sequential structure alignment program for protein strucr ture comparison Methods Enzymol 1996 266 5 61773 50 Orengo CA CORA 7 topological fingerprints for protein structural families Protein Sci 19998 40997715 51 Russell RB Sternberg MJ Structure predicr tionHow good are we Curr Biol 1995 5 5 488790 52 Martin AC et al Protein folds and functions Structure 1998 6 7 2 875784 53Hegyi H Gerstein M The relationship be tween protein structure and function a com prehensive survey with ap lication to the yeastgenome J Mol Biol 1999 288 12147764 54Russell RB Sasieni PD Sternberg MJE Supersites within superfolds Binding site similari in the absence of homology J Mol Biol 1998 282 41903718 55Wilson CA Kreychman J Gerstein M Asr sessing annomtion transfer for genomics quantifying the relations between protein 357 What is Biointormatics sequence structure and function through traditional and probabilistic scores J Mol Biol 2000297 12233749 56 Harrison SC A structural taxonomy of DNA7 binding domains Nature 1991 353 6346 715 9 57 Luscombe NM et alAn overview of the strucr tures of proteinrDNA complexes Genome Biology 20001 121737 58Jones S et al ProteinrDNA interactions A structural analysisJ Mol Biol 1999 287 5 877796 59 Suzuki M Gerstein M Binding geometry of alpharhelices that recognize DNA Proteins 199523 4525735 Luscombe NM Thornton JM ProteinrDNA interactions a 3D analysis of alpharhelixr binding in the major groove Manuscript in preparation Suzuki M et al DNA recognition code of transcription factors Protein Eng 1995 8 4 9728 a o a a 2Suzuki M DNA recognition by a Brsheet Protein Eng 19958 12174 3 Seeman NC Rosenberg JM Rich A Sequence specific recognition of double helical nucleic acids by proteins Proc Natl Acad Sci USA 19767380478 Suzuki M A framework for the DNArprotein recognition code of the probe helix in transcription factors the chemical and stereo chemical rules Structure 19942 4 317726 65 Mandeerutfreund Y Schueler O Margalit H a a 1 P of common principles J Mol Biol 1995 253 2 370702 66 Luscombe NM Laskowski RAThornton JM ProteinrDNA interactions a 3D analysis of amino acidrbase interactions Nucleic Acids Res In Press 67 Mandeerutfreund Y Margalit H Quantitar tive parameters for amino acidrbase inter action inplications for prediction of protein DNA binding sites Nucleic Acids Res 1998 2612306712 68 Sternberg MI Gabb HA Jackson RM Predicr tive docking of proteinrprotein and protein DNA complexes Curr Opin Struct Biol 1998 0 2225076 69Aloy P et al Modelling repressor proteins docking to DNA Proteins 1998 33 4 535 49 70 Dickerson RE DNArbinding the prevalence of kinkiness and the virtues of normality Nucleic Acids Res 1998 26 821906726 71Pereerueda E ColladorVides J The reperr toire of DNArbinding transcriptional regular tors in Escherichia coli 712 Nuc eic Acids Res 2000 28 811838747 72 Mewes HW et al MIPS a dambase for genor mes and protein sequencesNucleic Acids Res 2000 20 1 37740 73Salgado H et al RegulonDB version 30 transcriptional regulation and operon orgar nization in Escherichia coli K712 Nucleic Acids Res 2000 28 12657 Method Intonn Med 4 2001
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'