Biological SequnceGenome Anly
Biological SequnceGenome Anly BINF 730
Popular in Course
Popular in BioInformatics
This 105 page Class Notes was uploaded by Nathanael Schowalter on Monday September 28, 2015. The Class Notes belongs to BINF 730 at George Mason University taught by Staff in Fall. Since its upload, it has received 19 views. For similar materials see /class/215257/binf-730-george-mason-university in BioInformatics at George Mason University.
Reviews for Biological SequnceGenome Anly
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
BINF 730 Biological Sequence Analysis Saleet Jafri Program in Bioinformatics and Computational Biology George Mason University Lectu re 1 Overview of Molecular and Cellular Biology Biological References I Molecular Biology of the Cell by Bruce Alberts 1994 or newer edition I Molecular Cell Biology b D rnell Lodish and Baltimore 1995 or newer edition Part I Molecular Biology Review Where do biological sequences come from Life and evolution Proteins Nucleic Acids Central dogma Genetic code DNA structure Mitochondrial DNA Life Evolved from common origin 35 billion years ago All life shares similar biochemistry Proteins active elements Nucleic acids informational elements Molecular Biology the study of structure and function of proteins and nucleic acids Clmluslwul mum Wm mm grmu Terrestrial Life gunman Diplumorlads Lzrnblra Sullolobus nacunzn Mamalobuncrmm Halocoxus Green 5mm Halobactenum banana Methanomrcus an Borrzlm DuranteYen Pvesumeu Wm mama i organisms otellcxun Finsumad common Droycnum at srchxehanena and eukETVmeS Cell types Prokaryotes no nuclear membrane represented by cyanobacteria bluegreen algae and common bacteria Escherichia coli ukaryotes unicellular organisms such as yeast and multicellular organisms Archaebacteria no nuclear membrane but similar to eukaryotes in transcription and translation mechanisms discovered in deep sea thermal vents in 1982 P ro karyotic Cell Coll roner membva quot9 w wall 1mm membrane 77 x Wmmmlmmn q P ro karyotic Cell a Prokaryonc call Nude Fenplasmlc space and cell wall Seplum A Inner lplasmal membrane Eu karyotes ln eukaryotes transcription is complex Many genes contain altemating exons and introns lntrons are spliced out of mRNA mRNA then leaves the nucleus to be translated by ribosomes Genomic DNA entire gene including exons and introns The same genomic DNA can produce different proteins by alternative splicing of exons Complementary DNA cDNA spliced sequence contai only exons cDNA can be manufactured by capturing mRNA and performing reverse transcription n9 an quotr u t Cell wall rquot 39 Outer membrane Nucleold 95 m Fenplzsmlc space J ufer membrane Inner plasma membrane Eu karyotlc Cell Eu karyotic Cell 1 Eukaryotic cell Nuclear membrane 7 r Ema ernbrana lt4 L Golgivesicles ctgwesuctes M Mitochondrion Maw h L Peroxlsorne A a vsosome a agntysasm A I 5 EB Endoplasmlc reticulum Nucleus Milochondrion v 1 m 39 Secretory vesicle lL Rough Endoplasmic retlculum Eu karyotic Cell Organelles Cell membrane E 5 M Endoplasmic Reticulum rough and smooth Golgi Apparatus received newly formed proteins from the ER and modi es them and directs them to final destination respiratory centers have own circular DNA S o a r o 3 a bacterial or Eu karyotic Membrane aner Periphlrnl proxem Ollgo uhlxide Glyazpronin Glymlipid mugmwmin 263 i LulIu Phosphnlipid bllaykv Eu karyotic Cell Organelles Chromosomes chromatin histones centromeres and arms 2 pairs in eukaryotes ys somes contain acid hydrolases nucleases proteases glycodidases lipases phosphatases sulfatases phospholipases Peroxisomes use oxygen to remove hydrogen from substrates forming H202 abundant in kidney and liver detoxification Cytoskeleton Nucleic Acids Two kinds RNA ribonucleic acid DNA deoxyribonucleic acid Null1i and Pnlrln u um mm l aanrommr 1 math are l l 1 l 1 mm quotquot5 PImx nulr id mml xmgmpmmn Hydlcpllilmpoilv quot wluwm M NucleIc ACld Structure PUFllNES PYRIMlDINES 0 Hz 0 la 539end o FD H Jim 0 0 E l mm c WNW H H H H H Uracil U lt H 070 l 0 A l H H H Guanine GI Thymine m 1 NH 040 A quot3 He 5 o C 39i H 0 W H H I 3 cytosine c 3 end OH H gt WatsonCrick base pairs bp DNA Double stranded Four bases adenine A guanine G ytosine C and thymine T A and G are purines C and T are pyrimidines A always paired with T complementary C always paired with G complementary DNA may consist of hundreds of millions bp A short sequence lt100 is called an oligonucleotide RNA Different sugar ribose instead of 2 deoxyribose Uracil U instead of thymine U binds with A RNA does not form a double helix RNA may have a complex threedimensional structure OH those 2Du0xvribose Central Dogma DNA gt RNA gt Protein DNA Deoxyribonucleic Acid RNA Ribonucleic Acid Protein Functional and Structural units of cells Flow of Information is unidirectional DNA WW Repuc adnn lmmm DNA duplacztas M Trannrip nn RNA imlil RNA W cmme Tranxlziinn 1mmquot synme The Cenlral Dogna 0 Molecular Biology Gene Transcription or DNA Transcription RNA molecules synthesized by RNA polymerase RNA polymerase found in free and bound form RNA polymerase binds very tightly to promoterregion on DNA Promoter region contains start site Transcription ends at termination signal site Primary transcript direct coding of RNA from DNA RNA splicing introns removed to make the mRNA mRNA contains the sequence of codons that code for a protein uracil replaces thymine splicing and alternative splicing Transcription of DNA to Messenger RNA nxnnl lm mum Chltmtstmd o n a V um um mm mm mm suninc m mm mm Huangquot nuA Translation Ribosomes made of protein and ribosomal RNA rRNA Transfer RNA tRNA make connection between speci c codons in mRNA and amino aci s As tRNA binds to the next codon in mRNA its amino acid is bound to the last amino acid in the protein chain When a STOP codon is encountered the ribosome releases the mRNA and synthesis ends Gene Translation tRNA links an amino acid to the codon on the mRNA via the anitcodon rRNA RNA found in ribosomes ribosomes large and small subunit made of protein and rRNA initiator tRNA always carries methionine initiation factors proteins that catayze the start of transcription stop codon Endoplasmic Reticulum Posttranscriptional modification Translation Involves ribosomes and RNA 39 osomes made of protein and RNA Messenger RNA mRNA is the sequence transcribed from the DNA The mRNA is threaded through the ribosomes Transfer RNA tRNA brings the different amino acids to the ribosome complex so that the amino acids can be attached to the growing amino acid chain Ribosome Ribosome Growing erNAproteinl polypeptide chain a l RNA RN l leavmg i CAG Ccc quot aa RNA 7 7 mRNA K arriving I 5 3 G c ofribosome Codon Codon Codcn Codon Codon Codon Codon aa aa aaJ 3a aa5 aa6 aa7 Amimacid Ribosomal RNA 323243 4 m5 w llnclaolldes Prokarvo ic Eukaryotic and Prokaryotic Ribosome Structure rRNA Proteins Subunits Axsnmhlad Q Ll L2 L3 ribosomes 235 2900 bases 705 65 1500 bases Total 21 Eukaryotic mammalian 7 u u L3 235 585 225 535 as MEDD bases s 60 76595 H 20 hasasl Tma 50 SI 52 53 Ens ms um bases Mia 33 IRNA l 1 1 A anllcodo A u G u U A C mRNA 3 151 ban in cadan ucpwuiucq m The Genetic Code DNA Structure DNA contains Co unl Promoters Cadanz Genes Junk DNA Reading frames An open reading frames ORF a contiguous sequence of CodanE DNA starting at a start codon and ending at a STOP codon Canon 3 Canon 1 Conan u oncombmooccombomopznm llllllllllllll Sudan 1 Rlbonuclelc and Chromosomes A genome a complete set of chromosomes within a cell quotquot r 39 have 39 u in their genomes Prokaryotes usually have a single chromosome often a circular DNA molecule Eukaryotic chromosomes appear in pairs diploid each inherited from one parent W 3 Homologous chromosomes carry the same genes quot 4 Some genes are same in both parents 3 Some genes appear in different forms called alleles Eg human blood has three alleles A B and o All genes are present in all cells but a given cell types only BiA expresses a small portion of the genes M l l 34 Chromosomes Chromatin Structure 30 nm e V 2 Oclameric histone core 10 nm DNA H1 histone Nucleosome Gene Coding and Replication Double helix Nitrogenous bases ATGC SugarPhosphate backbone Nucleotide sugar base phosphate group Nucleoside sugar base Purines adenine guanine Pyrimldines cysteine thymine AT 2 H bonds GC 3 H bonds Gene Coding and Replication 5 end contains a phosphate group 3 end is free DNA extended from 5 to 3 Gene is a segment of DNA that codes for a speci c protein Exons are coding regions of the DNA lntrons are in between regions found in eukaryotes Codons Reading frame Co sequences are conserved regions found in a particular type of regulatory region Mitosis 1 ea OuickTime M and a Sorenson Video decompressor are needed to see this picture fl A J V em 5 C Proteins Functions Structural proteins Antibodydefense Chains of amino acids Typical size 300 residues w U U0 A mmnmnlmiMLrl inuunlllvu r V n n 0 Do u n a n M 00 l y l l oi l y l HiNi 7 HiN c c n HiNicic Himicic 06 l l l De I l s H H a l c l 1 l D H L H H H H rim C WWW A mm s Hf CH 1 m up 0 0 n we i mum mme i mdnclmms i 7H H l I WW 5 m or I l I l N 24m Hgvraciuiiffs aN c civsfcik H I H H o H H 6 em or l 9mm l nnlmenum w m w r up m l mm m r smuam Mn 7 A 7 m 7 1w Protein Folding Primary structure amino acid sequence Secondary structure local structure such as a helix and p eets Tertiary structure 3dimensional structure of a protein monomer Quarternary structure 3dimensional structure of a fully functional protein protein complexes Lecture 1 Pan 39Mnlecular lnlngy mew nlng cal Relerences mum Mm mmm Terrestrial L e cgquot ypes Prnkalynlic Cell rmmbnmrg mm yum mun m my mum rmmbnm mm mm M mm mm m m Prnkalynllv Cell Eukauyms b mm mm mmm r m m EukalwhE CEquot Eukalynliv Cell e Eukaryotic Cell Organelles Eukaryotic Cell Organelles cell membrane Chromosomes chromatin histones centromeres and arms 2 pairs in eukaryotes Lysosomes contain acid hydrolases nucleases proteases lyco ases lipases phosphatases sulfatases phospholipases Peroxisomes use oxygen to remove hydrogen from substrates or ng H202 abundant in kidney and liver detoxification Cytoskeleton Endoplasmic Reticulum rough and smooth Golgi Apparatus received newly formed proteins from the ER and modi es them and directs them to final destination Mitochondria respiratory centers have own circular DNA bacterial origin Eukaryotic Membrane ucleic Acids Two kinds 7 RNA ribonucleic acid rmW mmquot MTWHWM r DNAdeoxyrIbonucleIc acid a M in i u Wquot aw rm 7 Nynwlllyd vivid p1llu Inl DNA Double stranded Four bases adeni eA guanineG cytosine C and thymineT Nucleic Acid Structure Lit quotLE rHl39dlD ryES 0k II A always paired with T complementary C always paired with G complemen a H gt WatsonCrick base pairs bp Cullquot 1r s o hundreds of millions bp A short sequence lt100 is cal ed an oligonucleotide RNA Central Dogma Different sugar ribose instead of 2 deoxyribose Uracil U instead of thymine U binds with A RNA does not form a double helix RNA may have a complex threedimensional structure DNA Deoxyribonucleic Acid RNA Ribonucleic Acid Protein Functional and Structural units of cells Div ll 2Daunrihcm Promoter region contains start site m Gene Transcription or DNA W R Lita 39n bitten Transcrlptlon DNA Miriam RNA molecules synthesIzed by RNA polymerase Tranruiptinn RNA polymerase found in free and bound form RNAeynmesis RNA polymerase binds very tightly to promoterregion n A l quot I V gt mlrleus Transcription ends at termination signalsite Primary transcript direct coding of RNA from DNA eympmm dea un RNA splicing introns removed to make the mRNA Proldnsymhzsis mRNA contains the sequence of codonsthat code for a protein uracil replaces thymine splicing and alternative splicing The Genlral Dayna of Molecular Biology Transcription of DNA to Messenger RNA Translation gene Ribosomes made of protein and ribosomal RNA rRNA l l Transfer RNA tRNA make connection between specific ids Chromosomxl DNA ML WM codonsin mRNA and amino ac As tRNAbinds to the next codon in mRNA its amino acid is bound to the last amino acid in the protein chain J quotm39il39mquot When a STOP codon is encountered the ribosome releases the mRNA and synthesis ends Nuclexr RNA Messenger RNA I39Mulua Innrm hmlllll gt Gene Tra slation RNA nks an amino acid to the codon on the mRNA via the anilcodon rRNA RNA found in ribosomes ribosomes large and small subunit made of protein and rRNA initiatortRNA always carries methI nine initiation factors proteins that catayze the start of transcri on stop codon Endoplasmic Reticulum Posttranscriptional modi cation Ribosome 3mm mm 7 V 39 u II A Ij39JGr39vAyl Eukaryotic and Prokaryotic Ribosome Structure lr u Tr nslation Involves ribosomes and RN 39 d A I osomes ma e of protein and RNA Messenger RNA mRNA is the sequence transcribed from the DNA The mRNA is threaded through the ribosomes Transfer RNA tRNA brings the different amino acids to the ribosome complex so that the amino acids can be attached to the growing amino acid chain Protein ramilm arid main 39bnsnme Ribosomal RNA 1 u c A anticodon 5 A c LJ codon 1stbase ln codon uapoo u uszq ms DNA Structure DNA contains Promoters Genes Junk DNA Reading frames An open reading frames ORF a contiguous sequence of DNA starting at a start codon and ending at a STOP codon mnogtonnonnnngtom Ribonucleic amd Chromosomes A genome a complete set of chromosomes within a cell Different species have different numbers of chromosomes in their genomes Prokaryotes usually have a single chromosome often a circular DNA molecu e lmm lmi mini mm W my MDM Eukaryotlc chromosomes appear In paIrs deloId each InherI e a nt M Homologous chromosomes carry the same genes Some genes are same in both parents Some genes appear in different forms called alleles Eg hurmn blood has three alleles A B and 0 All genes are present in all cells but a given cell types only expresses a small portion of the genes Chro osomes Chromatin Structure Uttarrw39c l39lizturru L39UIE Iv rm v y l mm H 391 513an N LL39IEL39I LKl I I Gene Curling and Replicatinn Gene Curling and Replicatinn umum ma Mum um z m m Wm H U u m w m y up gm mmm an s mm WW mm lump quot3 n w mm mum mmmwmmm w W m mummwu n Prnleln rum mm w hrr39mmn m w w m m me m w Lecture 5 Phylogenetic Analysis Additional Reference A olecular Evolution A Pl39zylogenetic Approach Roderic D M Page and Edward C Holmes Evolutionary Problems ii The f sil record suggests that modern man diverged from apes about 56 million year ago Modern Homo sapiens s emerged between 10000060000 yea 39 iii Work based 011 mitochondrial DNA by ilson et al suggest the modern man emerged only 200000 ye s 39 with the divergence into different races 50000 yea l mitochondrial DNA circular 2 maternal inheritance 3 10x t ter mutation rate than nuclear DNA Preliminaries T axon taxa plural or operation taxon unit is a entity whose distance from other entities can be measures ie species amino acid sequence language etc Comparisons are made on measurements or assumptions concerning rates of evolutionary change This is complicated by back mutations parallel n39zutations and variations in mutation rate We will only consider substitutions Amino Acid Sequences i For example the amino acid substitution rate per site per year is 53 X 10399 for guinea pig but only 033 X 10399 for other organisms ii The evolutionary time is the average time to produce one substitution per 100 amino acids Th L quot 100 Amino Acid Sequences Example 7 There are 2 differences in a sequence of 100 amino acids when comparing calf and pea histone H4 Since plant 1d animals diverged 1 billion years ago T 05 billion years l A z 103911 100T iii probabilit of substitution 7 several ay to calculate it 11g the PAM matri Nucleotide Sequences Different from amino acid sequences due to redund39 in the genetic code ie several codons can code for a particular amino acid bstitutions in the 3quotquot1 position are synononious the RNA coding for serine 7the corresponding DNA would be AG Since evolution should depend 011 function 39 conferred by the amino acid sequence it has been suggested that the molecular cloc hould be based 011 the substitution rate in the th 1 position of the codon In fact in the fll nopeptid this i as high as the amino acid substitution rz Nucleotide Sequences Nucleotide substitution rate is species dependent The substitution rate for nuclear DNA 1 3 X higher primate and 66 X 10399 for rodei 11d sea urchins Mitochondrial DNA substitution rate is 10 times faster than the nuclear substitution rate Nucleotide Sequences iii In the definition of PAM matrices one assumes a discrete Markov Chain with the PAM matrix being the transition matrix for the Markov Chain Markov Chains page 3843 text Clote and Backofen Assume that we have a process that has discrete observable states x1 x2 When we monitor this over time we get a sequence of the states occupied ql ql where qi any of x1 This sequence is a Markov Chain Note that while there can be an in nite number of states the Markov chain has a countable number of elements Markov Chains page 3843 Clote and Backofen A Markov chain is a stochastic process whose state qt at time t is a random variable determined by Pq0i 2 mi Pqt1j lqti pg Markov Chains page 3843 Clote and Backofen 0 A state of a Markov chain is persistent if the system returns in nitely often to that state 0 A state of a Markov chain is transient otherwise that is if the system returns only nitely many times to that state Markov Chains Another property of a Mark This means that the St a ssumed at time tl depends on the state med on t not on a 39 7139 3 ny other pre 39ous state 39 called the tX Y l 2 that history does not in atter be a d crete time random t re a Markov Ym tak n a alue V S r s not depend on the v alues of 10n probabilities Markov Chains T ion matrix 7 put the pik into a matrix P A sequence of amino acids can be thought of as a Markov chain Stationary Markov process 7 the probabilities depend 011 n that re constant Anothe s 11gt111 nutlon 1 1s sa1d to be stationary Irreducible 7 eve ate Application of Markov processes to evolutionary models i The PAM matrix has 39 bstitution probabilities determined from clo related amino acid sequences it assumes that the sub tutions have occurred through one application of the trails 1011 matrix i e no multiple substitutions and a given site and 2 evolutionary I ance results from repeated application of the same PAM matrix ii A better evolutionary model is needed text p 140144 This requires the use of a continuous Markov s rather than a discrete Markov chain Tl till has the Markov propeity Application of Markov processes to evolutionary models A time homogenous Markov process for the stochastic function Xt consists of a set of states Q 12 n a set of initial state distributions nn1 an and transition probability functions p11t p1nt Pa p1t pt Application of Markov processes to evolutionary models These transition probabilities have to satisfy the Markov property PXts j lxts ilPXt j X0 il Pijti Hence pijquotgt0 and ElijillpLJt l and Pts PmPts Application of Markov processes to evolutionary models If we assume that the functions pm t are ii ght continuous at t0 then as t approaches 0 from the right let l for ij and 0 otherwise This allows us to de ne the identity matrix P0 as the identity matrix ie P0I Application of Markov processes to evolutionary models We can apply this to nucleotide sequences Let Q l234r correspond to ACGT P11t Pm Pt Pn4t Pm PAAt PCAt PGlAt PTAt PAlCtPClCt PGlCt PTlCt PAlGt PClGt PGlGt PTlGt PAlTt PClTt PGlTt PTlTt Application of Markov processes to evolutionary models We need to generate the probability functions pLJt Consider P l1t P 14t P lttgt P n4t V44 1i111A10 PtAt 7 Pt m li111A10 PIPAt 7 PtIAt Pt limAM PAt 7 IAt Pt A Application of Markov processes to evolutionary models 11 7 12 243 214 121 42 323 124 A31 212 111 234 A 11 242 243 14 F lim pm At l At gt Ljt lim pLj At O At Hence k can be interpreted at the rate per site an V Substitution nucleotide i for nucleotide j as follows pun mm 0At mum Xi is thte rate that we substitute out of nucleotide i Application of Markov processes to evolutionary models 11 7 12 A13 214 A A21 12 223 124 A31 A12 11113933 A 11 742 343 14 The substitution rates A determine the matrix Pt through the differential equation P tPt A The solution of this differential equation is em We must 110w find A Let ui be the rate of substitution to nucleotide Then ui Application of Markov processes to evolutionary models ulu3u4 113 114 11 uIu3 114 113 114 111 112 u1 112 114 114 111 112 113 uluu3 ider the matrix P I u where 11111112u3u4 then 1 12 13 Note that P is stochastic n2 13 and stationary 12 13 12 13 Rates of Nucleic Acid Change The Jukes Cantor model assumes that u1uzu3u4a yielding the rate matrix 30 on a on a 3aa a a ac 30L at a on a 3u The P1P2P3P4a Use in lVIaimum Likelihood Calculation J ukes Cantor Model A a 4 a X a a C4 0 Transitions Transversions Purines A Pyrimidines C T I Transitions gt Transversions 0LgtB De nitions taxa 7 entities whose distance from other entities can be measured A directed graph G39 E consists of a set Vquot of nodes or vertices and a set EV quot of directed edges Then ij E means that there is a directed edge from ito A graph is undirected if the edge relation is symmetric that is ij E iff0i E A directed graph is connected if there is a directed path between any two nodes De nitions directed graph is acyclic if it does not contain a 39 39 39 L and ki all belong to E A tree is a undirected connected acyclic graph A rooted Tree has a starting node called a root The parent node is immediater before a node on the path from the root The child node is a node that is follows a node De nitions An ancestor is any node that came before a node on the path from a root A leqfor external node is a node that had no children N0n leaf nodes are called internal nodes The depth of a tree is one less than the maximal number of nodes on a path from the root to a leaf De nitions An ordered tree is a tree where the children of internal nodes are numbered A binary Tree is a tree where each node has at most two children Otherwise it is multi n ca ng Trees Question Draw all binary trees on 1 2 and 3 taxa a A pl u logene c tree on n taxa is a tree with leaves labeled by 1 n How do you tell if two trees are the same If you can convert one tree into another without breaking any branches they are topologically equivalent Phylogenetic Trees Pl39zytr39logenetic trees or evolutionaw trees are binale trees that describe the relations between species Trees consist of nodes or VerTI39CeS and tam or leaves Phylogenetic Trees To understand the data we must understand some of the methods behind phylogenetic trees or evolutionary trees i Clustering methods ii Maximum likelihood methods iii Quartet puzzling Algonthms Types of Data Distances Nucleotide sites UPGMA Neigh 3r Joining urqulo w39 uualsnp 0 Maximum liinimum Pa im ony EVOhmOn lL imum Likelihood 101121113 gilqemud 90mm 3quotmumML From Page and Holmes ilfolecular Evolution A Phylogenetic Approach What do we do with phylogenetic trees measuring evolution change 011 a tree If the leav s of a tree each signify a sequence the sum of the w ig11t of the edges gi s the evolutionary distance between the tw sequences molecular phylogenetics Convert information in sequences into an evolutiona for those sequences Cluster methods vs search methods There are two basic methods for constructing trees Cluster methods use an algorithm set of steps to generate a tree These methods are ery easy to implement and hence can be computationall efflcient They also typically produce a single tree A big d age to this method is that it depends upon the order in which e add sequences to the tree Hence there could be a different tree that explains the data just as well Search methods use some sort of optimality riteria to choose among the set of all po quot le trees The optimality criteria g1ves each tree a ore that i wased 011 the comparison of the tree to data The advantage of s arch methods is that they use an explicit function relating the trees to the data for example a model of how the sequences evolve The disadvantage is that they are computationally very expens1ve NP complete problem How do we compare different tree methods Efficiency 7 How fast is the method power 7How much data does the method require consisten quot 7 Will the tree converge 011 the right answer give enough data robustness 7 Will minor violations of the methods assumptlons result in poor estimates of phylogeny fals bility 7 Will the method tell us when its assumptions are violated How do assign weights for the edges of our trees D tance methods r t convert aligned sequences into a pairvx se distance matrix then input that matrix into a tree building method The major objections to d39 nce methods are tha summarizing 2 et of sequences by d1stance data lo information and branch lengths estimated by some nce methods mioht not be evolutionarily determinable Discrete methods cor sider each nucleotide site of some function of each site direct 39 Distance Methods Two distance methods are neighbor joining and llllllillllllll evolution ilfinimum evolution nds the tree that minimizes the sum of the branch lengths where the lengths are calculated from the pair 39se distance between the sequences Linear programming or least squares methods can be used to do this ighborfoining 39 a clustering method that is computationally and gi es a unique result This can use something llke the four point condition and clusters the closest elements Discrete Methods The two major discrete methods are maximum parsimony and maximum likelihood Both these are search methods i With 1739zaximzm7 parsimony we try to reconstruct the evolution at a particular site v 1th the fewest possible evolutionary changes The advantages of parsimony are that it makes relatively few assumptions about the evolutionary process it has been studied extensively mathematicr 39 and some very powerful software implementations are available The major disadvantage to using parsimony is that under some models of evolution it is inconsistent that is if more data is added the wrong result might occur Discrete Methods ii The ma 7mm likelihood approach looks for the tree that makes the data the most probable evolutionary outcome This approach requires a explicit model of evolution which is both a strength and weakness because the results depend on the model used This method can also be very computationally expensive Types of metrics point condition or additive metric given the and l dij dkl S dik dQ39J di1 d0k For an ultrametric metric the ultrametric or 3 point condition holds That is given the leaves i and k dij S dik d0k Ultrametric trees Clustering metho ttempt to repeated clus r the data by grouping the clos t elements together Th phylogeny and gene expression microan The pair group method PGM is a technique where the pairs are repeatedly amalgamated The unit eighted paired group method with arithmetic mean UPGMA is used to cluster molecular data where sequence alignment d39 tance between sequences has been determined in a dis nce matrix UPGMA Input an 11 X 11 distance matrix D l Initialize a set C to consist of 11 singleton clusters 2 Initialize distcd 011 C by de ning for all i and j r in C diSIGiLU Dij 3 Repeat the following 111 times a determine a pan cd of clusters in C such that dist Cd in minimal define dmin di b define a new cluster e c U d define C C 7 Cd U e c define a node with label e and daughters c and d where the e has distance dmim 2 to its leaves d define for all f in C with f different from e distef distfe distcf distdfs 2 UPGMA Example Ultrametric Topology Distance Table dij S dik d0k from Clote and Backofen Computational ilfolecularBioIogy UPGMA Example Given the distance table 1 We have ve singleton clusters a b c d and e from the set C 1abcde Get the distances from the distance table left a Find the cl est two clusters namely clusters c and d with dmin 2 bf Icd and C 1abef c f is the root for c and d d De ne new distance table Repeat 3 UPGMA Example The old distance table The new distance table UPGMA example Tree formation WPGMA Input an 11 X 11 distance matrix D l Initialize a set C to consist of 11 singleton clus rs 2 Initialize dis d 011 C by de ning for all i and j in C distiu j Dij 3 Repeat the following 111 times a determine a pair cd of clusters in C such that dist Cd in minimal define dmm distcd b define a new cluster e c U d define C C 7 Cd U e c define a node with label e and daughters c and d where the e has distance dm to its leaves d define for all f in C w1th f different from e distef distfe lcldistcf ldldistdf lcl dl Farris Transform Example Additive nonultrametric topology Distance Table dij dk1 s dik dg39J di1 d0k from Clote and Backofen Computational ilfolecularBioIogr Farris Transform Example UPGMA incorrectly Distance Table reconstructed topology from Clote and Backofen Computational ilfolecularBioIogr Farris Transform Sometimes the data will sa 39 1n additive metric and not a ultrametric This will yield a tree ith the incorrect topology if UPGMA or WPGMA is used The F a quot rmed Distance M ethod converts the data for an additive nonultrametric metric so that it tis es the ultrametric Then UPGMA or WPGMA can be used to yield a tree with the correct topology Farris Transform If we have a phylogenetic tree with root r and leaves taxa l 11 and dLi is the distance between two nodes then we have the t1 1sformed distance dij dim de where 1 II a ZdJ 7 i1 You must 39 ssume a root r Thi 39 39 39 39s aithest from all the others Unfortunatel depending 011 the root selected the method might not give the right topology Farris Transform Example a b Additive nonultrametric topology What is the distance to the root Farris Transform Example Original Distance Table with assumed root Transformed Distance Table ON 1 NNNN Farris Transform Example ansformed tree topology Distance Table dij dk1 s dik dg39J di1 d0k Farris Transform Pick 1 as root Original Distance Table with assumed root Transformed Distance Table Determining Protein Structure from Sequence using Computational Approaches M Saleet Jafri Program in Bioinformatics and Computational Biology George Mason University and Medical Biotechnology Center University of Maryland Biotechnology Institute Protein Structure Why do we care Structure Function Relation 7 The shape of a protein molecule directly determines its biological function Proteins with similar function often have similar shape or similar regions or domains Hence if we nd a new protein and know it s shape we can make a good guess about it s biological lnction Protein Databases As of June 2000 12500 protein structures have been deposited into the Protein Data Bank PDB and 86500 protein sequence entries were contained in SwissProt protein sequence database This is a 17 ratio 7 relatively few structures are known The number of sequence will increase much faster than the number of structures due to advances in sequencing Protein Basic Structure A protein is made of a chain of amino acids There are 20 amino acids found in nature Each amino acid is coded in the DNA by one or more codons ie athree base sequence Transcription and Translation Translation and Transcription mum DMAunnd ATE neT GET 66A TAG TAC an CCA CCT A n 7 sense DMAunnd unmiptinn hrtcndnn com I us ecu Imu 66A uAn redundancy mp codaquot lawman anllcn nn um an A ecu mum m RNA 9 Iv sly melllia 39ne gl39yciile alanine glycine a of profein gynth w Fromhttpwwwagenu educhynage2062lectlec1707of771aGIF Finding the Protein Sequence 0 From DNA sequence 0 From protein sequencer 0 From mRNA sequence Measuring Protein Structure 0 Determining protein structure directly is a 7 5 5 4 difficult 2 1 X ray diffraction studies 7 must first be able to crystallize the protein and then calculate is structure by the way it disperses X rays 2345673 1 0 h ru Fromhttpwwwuniwuerzburgdemineralogiecrystalteachinginviahtml Measuring Protein Structure 0 NMR 7 Use nuclear magnetic resonance to predict distances between different functional groups in a protein in solution Calculate possible structures using these distances hmpwww cisn39teduhtbooksnmrinside htm Why not stick to these methods Xray Diffraction 7 7 Only a small number of proteins can be made to form crystals 7 A crystal is not the protein s native environment 7 Very time consuming NMR Distance Measurement 7 7 Not all proteins are found in solution 7 This method generally looks at isolated proteins rather than protein complexes 7 Very time consuming Four Levels of Protein Structure Primary Structure 7 Sequence of amino acids Secondary Structure 7 Local Structure such as helices and 7 sheets Tertiary Structure 7 Arrangement of the secondary structural elements to give 3dimensi0nal structure of a protein Quaternary Structure 7 Arrangement of the subunits to give a protein complex its 3dimensi0nal structure Predicting Protein Structure from the Amino Acid Sequence Goal Predict the 3dimensi0nal structure of a protein from the sequence of amino acids primary structure Sequence similarity methods predict secondary and tertiary structure based on homology to know proteins Secondary structure predictions methods include Chou Fasman GOR neural network and nearest neighbor methods Tertiary structure prediction methods include energy minimization molecular dynamics and stochastic searches of conformational space Sequence similarity methods These methods can be very accurate if there is gt 50 sequence similarity They are rarely accurate if the sequence similarity lt 30 They use similar methods as used for sequence alignment such as the dynamic programming algorithm hidden markov models and clustering algorithms Secondary Structure Prediction Al gorithms These methods are 7075 accurate at predicting secondary structure A few examples are 7 Chou Fasman Algorithm 7 Gamier OsguthorpeRobson GOR method 7 Neural network models 7 Nearestneighbor method ChouFasman Algorithm Analyzed the frequency of the 20 amino acids in helices sheets and turns Ala A Glu E Leu L and Met M are strong predictors of helices Pro P and Gly G break helices When 4 of 5 amino acids have a high probability of being in an helix it predicts a helix When 3 of 5 amino acids have a high probability of being in a strand it predicts a strand 4 amino acids are used to predict turns GamierOsguthorpeRobson Method ChourFasman assumes that eaeh thdmdual ammo and th uehees secondary structure OR assumes the the ammo ands ankmg the eehtaa1 tho and also th uehee the secondary structure Hehee ttuses awmdow of 17 ammo ands 8 on eaeh stde ofthe eehtaa1 ammo aetd Eaeh ammo aetdth the wthdow aets thdepehdehtly on th uehethg structure to save eomputatmha1 tame Certam pathewtse combinations of am 0 ands m the wthdow also contribute to th uehethg structure HydrophobicityHydrophilicity Plow Charge ammo ands ate hydrophxhc t e Asp D Glu E LYS K Arg R hydtophobte t e Ala A Leu L De 1 Val V1gthe F Trp WMetMPm P Lu hehx hydrophobe ammo ands hught hhe up on one state Whl eh suggests that that 51 else is on the thteh or of a pmteth or pmteth complex meBMV vmn mum Mamquot My by mm mm rHelIrAvlrulylmby ccc Neural Network Model ural netwotkts txatnedto Tnts ts wetgnts on the connecttons ofthe connectaons etween the nodes The neural network can be used to predmtthe secondary s ntctune on test pnotetns These methods are over 70 accurate m ewmk Rost and Sander Neural Network Model j a pm 4 rmmnwymet Stuumldetmm My bYDzwd Mum Nearest Neighbor Method Like neural networks this is another machine learning approach to secondary structure prediction A very large list of short sequence fragments is made by sliding a window nl6 along a set of 100 400 training sequences of know structure but with minimal similarity A samesize window is selected from the query sequence and the 50 best matching sequences are found The frequencies of the of the secondary structure of the middle amino acid in each of the matching fragments is used to predict the secondary structure of the middle amino acid in the query window Can be very accurate up to 86 Energy Potential Functions Contains terms for electrostatic interatction van der Wals forces hydrogen bonding bond angle and bond length energies Common software packages have their own implementation Charmm ECEPP Amber Gromos and CVF Structural predictions only as good as the assumptions upon which it is based mainly the energy potential function Bonded Terms Bond Length Ebondlenth 7 bonds kbr 7 r0 2 Bond Angle Ebondangle 7 angle k 7 7 702 Bonded Terms Dihedral Angle Edihedralangle 7 dihedrals K7 1 cos n R NonBonded Terms Lennard Jones potential van der Waals force 7 12 6 Eva 7 ij Aijrij Bijrij repulsive dispersion Electrostatic interactions r Eelec 7 ij lin4 77793 7 permittivity of free space 7r dielectric constant of medium around charges NonBonded Terms Hydrogen Bonding 7 Some atoms 0 N and to a lesser degree S are electronegative ie the attact electrons to ll their valence shells Hydrogen tends to donate electrons to these atoms forming hydrogen bonds This is common in water Salt Bridges 7 A positively charged lysine or arginine residue can form a strong interaction with a negatively charged aspartic acid or glutamic acid residue Energy Minimization Assumes that proteins are found at or near the lowest energy conformation Uses a empirical function that describes the interaction of different parts of the protein with each other energy potential function Searches conformation space to find the global minimum using optimization techniques such as steepest descents and conjugate gradients To avoid the multipleminima problem approaches such as dynamic programming or simulated annealing have been used Molecular Dynamics Fi miai force by Newton s Second Law of Motion ai dVidt acceleration Vi dridt velocity dEdri Fi Work force X distance dEdri midzridt2 put it all together Molecular Dynamics Model System 7 Choose protein model energy potential function ensemble and boundary conditions Initial Conditions 7 Need initial positions of the atoms an initial distribution of the velocities assume no momentum ie 7i mivi 0 and the acceleration which is determined by the potential energy function Boundary Conditions 7 If water molecules are not being explicitly included in the potential function the solvent boundary conditions must be imposed The water molecules must not diffuse away from the protein Also usually a limited number of solvent molecules are included Molecular Dynamics Integration Algorithm 7 Solve the equations of motion with an algorithm that conserves energy and momentum is computationally efficient and allows a large time step Examples 7 Verlet Algorithm 7 Leapfrog Algorithm 7 Velocity Verlet 7 Beeman s algorithm Constraints BINF 730 Lecture 4 Finish Sequence Alignment and Multiple Sequence Alignment Sequence Alignment NeedlemanWunch Algorithm 7 global alignment xed gap penalty WatermanSmithBeyer Algorithm7 local alignment affine gap penalty function Gotoh s algorithm 7 local alignment affine gap penalty function Efficiency of Algorithms NeedlemanWunch Algorithm 7 Onm Onz for n gt m Waterman SmithBeyer Algorithm7 Onmnm On3 for n gt m Gotoh s algorithm 7 Onm Onz for n gt7 m NeedlemanWunch Algorithm Global Alignment n 0 7 712 Dn ZWQIN m Fe b Dwa Wltb1 Vijgt 0 DH min DHVH wqb D where w is the weight of a gap wlta7gt WatermanSmithBeyer Algorithm Local Alignment DUEl 0 Du g Du g0 gkl m gt o 13 rut1 mania Ja gltk where gk 15mg gap pmzzy mctmn and w 15th similarity scorz mctmn Gotoh s Algorithm 7 Consider the gapless sequences a and b Let gk 1 k be an af ne gap penalty function and let wa1bj be a cost function D is the distance matrix P is the matrix with the minimal distances for all alignments with b ending in a gap Q is the matrix with the minimal distances for all alignments with a ending in agap Local Alignment Lhu 0 Eu gm Dw 0 E 1 DH miquot DHH WHJ7 Q 1 iggiaiiggbi Pm D 1 mm go and 39 amp game DHg1 Gotoh 5 Algorithm Uses dynamic programming with three matrices instead of l to store information so they do not have to be recaculate Traceback 7 need to track movement through all three matrices Tandem Repeats A tandem repeat or tandem duplication is a subsequence of characters that repeats itself multiple times in a sequence with the copies adjacent to each other Example ACGA CCGTAGA TACCG ACGAG CCGTAGAA CCGTAGAA CCGTAGAA TACCG Tandem Repeats Tandem repeats are found throughout prokaryote and eukaryote DNA In humans they are known to cause at least 10 inherited neurogical diseases Huntington s disease and myotonic dystrophy They are also implicated in diabetes epilepsy and ovarian an 0 er can hey are used in DNA ngerprinting and can differentiate between bacterial strains including anthrax Tandem Repeats If we assume that 1 There is no removal of copies of the duplicated tandem repeat region 2 Tandem duplication occurs before the other types of operations Benson showed that Gotoh s algorithm can be extended to align sequences with tandem repeats The duplication cost is dly51 Tandem Repeats DuEl 0 Du gU 1 gm DH min Mania Dtpllw Iglslal 1g Slbl 13mm min kw2w Nani 1 M lmy t and D w r m1nltdkDa Wmu 1 kElEHJ t Dtpl1j min K copies ofthe word by J Wraparound Method When aligning a sequence with tandem repeats use the Wrap around method to minimize calculations When implementing the wrap around method look at the section with tandem repeats separately Write the repeated sequence only once in the similarity matrix Align as usual except when reaching the end of the repeated sequence use that value as the rst value in the next row and repeat this procedure Wraparound Method Wraparound Algorithm When developing a dynamic programing implementation for the wraparound algorithm there is a problem with determining the Q matrix In order to de ne QM it is necessary to know QL Hence there must be two passes to correctly detemine Q Wraparound Method Wraparound Method Wraparound Example In the following example a Needleman Wunch algorithm with mismatch cost 1 and match cost 0 is implemented The rst pass yields the top matrix The second pass corrects entries in the rst column as seen in the bottom matrix Wraparound Method Wraparound Method Similarity Methods a Maximize the similarity between two sequences rather than minimizing the distance 5 Use similarity score function sxy to compare characters x and y with sxy gt 0 for x y and sxy lt 0 for x gtlty The score for a gap can be different than the score for a mismatch 8 transition a substitution of a purine for a purine or a pyrmidine for a pyrimidine if the alignment for a random sequence is negative which is guaranteed by the PAM matrices Similarity Methods d tranversion a substitution ofa pyr for a pur or a pur or a pyr e Once again use dynamic programming algorithm f This can be used for local or global alignments Multiple Sequence Alignment Up until now we have aligned two sequence at a ime There are times when we would want to ali n more than two sequences simultaneously 7 ie looking for consensus sequences or looking for homology in a family of proteins Multiple Sequence Alignment The text presents five methods Dynamic Programming Gibbs Sampler MaximumWeight Trace Hidden Markov Models Steiner Sequences Multiple Sequence Alignment Dynamic Programming This is an extension of the pairwise sequence alignment we have discussed Assume we want to align k sequences of length at most k Develop a cost function wa1 ak that will align multiple letters at a time Same as before except the distance matrix is k dimensional and the weight function compares k letters Dynamic Programming for Multiple Sequence Alignment Assume that we are trying to align three sequence ab and c Also assume that we have a cost function wxyz that computes the cost of comparing x y and z in sequences a b and c respectively 0 x y z wx y z l 20f 3 symbolsarethe same 2 x 7 y 7 z Dynamic Programming for Multiple Sequence Alignment Alternatvively we can de ne our distance matrix D y D444 w bJ Ck D a WltbJ Ck Vijkgt 0 D min DHVHJ Waby Multiple Sequence Alignment Dynamic Programming This would take nk time with can be long if for large n an NPcomplete problem 7 this means that is it the length of the sequence or number of sequences get large if is extremely computationally expensive to solve Need more ef cient methods Gibbs Sampler The Gibbs Sampler is based on the Gibbs Distribution pages 4752 Basically perform local alignments on windows subsequences of fixed size The optimal alignment is found after varying the window size and repeating Original Gibbs Sampler looks for motifs in a protein Gibbs Sampler Algorithm Identify random size w segments or windows from m amino acid sequences 2 Choose one of the m proteins at random and temporarily call it P where r1 rn is its amino acid sequence 3 De ne a 20 x m frequency matrix Q where QHi N Jml and NRi is the number of occurrences of residue r at positionj in the ml remaining proteins Gibbs Sampler Algorithm For i l to lPlw compute the probability of that the subsequence of P of size W beginning at position i is generated by Q ie Pi ij1Qnm Choose the starting position of the window in P randomly according to the probability Pi k U39 Gibbs Sampler Algorithm 5 Compute a random number 2 between 0 an 1 and determine the smallest i0 for which zlt qj Set the new starting position for the window in P at position i0 0 If convergence has occurred stop otherwise go to step 2 Complete MaximumWeight Trace Addresses a subclass of multiple sequence alignment problems namely the sumof pairs multiple sequence alignment problem multiple sequence alignment problems that merges pairwise sequence alignments This method uses the creation of alignment graphs Complete Maximum Weight Trace For each pair ofsequences there is a complete alignment graph The graphs consists of nodes which represent the characters in the sequence connected by edges On each edge e there is a positive weight we The cost ofthe alignment is the sum ofthe weights between pairs of sequences Search for the alignment graph that gives the maxi um trace Complete Maximum Weight Trace Example 7 consider two sequences AACG and AGG Complete Maximum Weight Trace Example 7 Here is a possible set of edges called a trace The sum ofthe weights on the edges in a trace is the weight ofthe trace Complete Maximum Weight Trace Example 7 Here is a possible set trace that is not allows since two edges cross Hidden Markov Models View the sequences to be aligned as a training set of observations that differ from an ancestor sequence as the result of a stochastic processes The stochastic model that can best account for the sequences in the training set is determined This method uses maximum likelihood and expectation minimization Hidden Markov Models This alignment is determined by how the sequences would match up with the ancestor sequence We will discuss Hidden Markov Models in Detail later Steiner Sequences This is related to multiple sequence alrgnment In a method of DNA sequencing called singlemolecule DNA sequencing a single stranded DNA molecule is cut a single base pair at a time The freed base ows down a glass tube by an optical sensor that deterrnrnes the base Steiner Sequences This technique has errors especially near the ends of the DNA lfthis method is repeated many times many copies of erroneous DNA are generated This method computes an alignment of these sequences to find the actual sequence Steiner Sequences sequence to be deterrnrned let MiN Partition the M sequences into M3 groups of three sequences Apply dynamic programming With cost function Wxyz given earlier to determine an op 39m alignment 0 e ee sequences U VW in each group From this alignment de ne the consensus sequence S obtained by taking the majority symbol in each column and then removing all occurrences o 1 Given N erroneous copies of an original DNA to L Steiner Sequences 4 This consensus sequence S is a Steiner sequence for UVW meaning that S satis es DSUDSVDSW minDTUDTVDTW The previous step yields M3 Steiner sequences If M3 the stop and output the resulting sequence Otherwise let MM3 and return to ste 2 V39 Genomic Rearrangements The genes are located in the chromosomes Chromosomes can trade material intra or interchromosomally These events are rarer that point mutations Hence they can be used to determine similarity between very distant organisms lntrachromosomal Events Inversion of a continuous segment of genes from 53 to 35 or vice versa Duplication of pieces of the chromosome possible caused by transposons Transposition of a segment of a gene from one place to another place in the same chromosome Interchromosomal Events Reciprocal translocation 7 end segments of W0 chromosomes are exchanges with each other Chromosomal duplication 7 the number of chormosomes is double Fission 7 one chromosome is broken into pieces Fusion 7 two chromosomes are combined into one Synteny Consider the genome as a distinct unordered set of genes Two genes are said to syntenic if they both lie on the same chromosome Chromosomes can then be considered to be syntenic sets Then these sets can be transformed by ssion fusion and reciprocal translocation Synteny Consider the synteny sets of current organisms Is it possible to create a synteny set of an ancestral organism How many chromosomes did an ancestral species have Which genes were on which chromosomes How do the phylogenetic trees compare to those generated by sequence alignment Lecture 9 Database Searching Database Searching for Similar Sequences 0 Database searching for similar sequences is ubiquitous in bioinformatics Databases are large and getting larger 0 Need fast methods Types of Searches Sequence similarity search With query sequence Alignment search With pro le scoring matrix With gap penalties Serch With positionspeci c scoring matrix representing ungapped sequence alignment Iterative alignment search for similar sequences that starts With a query sequence builds a multiple alignmnet and then uses the alignment to augment the search Search query sequence for patterns representative of 5 protein fam1 e From Bioinformatics by Mount DNA vs Protein Searches DNA sequences consism of 4 characters nucleotides Protein sequences consist of 20 characters amino acids Hence it is easier to detect patterns in protein sequences than DNA sequences Better to convert DNA sequences to protein sequences for searches Database Searching Efficacy To evaluate searching methods selectivity and sensitivity need to be considere Selectivity is the ability of the method not to find members known to be of another group ie false positives 0 Sensitivity is the ability of the method to find members of the same protein family as the query sequence Protein Searches 0 Easier to identify protein families by sequence similarity rather than structural similarity same structure does not mean same sequence 0 Use the appropriate gap penalty scorings 0 Evaluate results for statistical significance History Historically dynamic programming was used for database sequence similarity searching Computer memory disk space and CPU speed were limiting factors Speed still a factor due to the larger databases and increase number of searches FASTA and BLAST allow fast searching History The PAM250 matrix was used for a long time It corresponds to a period of time where only 20 of the amino acids have remained unchanged BLOSUM has replace PAM250 in most applications BLAST use the BLOSUM62 matrix FASTA uses the BLOSUM50 matrix Dynamic Programming Use SmithWaterman algorithm or an improvement thereof for local alignment Compares individual characters in the full length sequence Slower but more sensitive than FASTA or BLAST FASTA Fast alignment of pairs of protein or DNA sequences Searches for matching sequence patterns or Words called ktuples corresponding to k consecutive matches in bo Local alignments are build based on these matches Better for DNA searches than BLAST ktuple can e smaller than minimum of7 for BLAST FASTA Algorithm FASTA uses a search for regions of similarity by hashing ln hashing a lookup table showing the positions of each ktuple is made for each sequence The relative positions of the ktuple in each sequence are calculated by subtracting the postions of the first characters Ktuples having the same offset are considered to be aligned FASTA Algorithm The number of comparisons increases linearly with the average sequence length In dynamic programming and dot plow the number of comparisons increases as the cube or square of the length respectively Signi cance of FASTA Searches 0 The average score is plotted against the log of the average sequence length in each length range A line is fit with linear regression and the z score is the number of standard deviations from the fitted line 0 A statistical distribution of alignment scores can be used to determine probabilities Versions of FASTA 0 There are many versions of FASTA FASTA 7 compares like query sequence to library 0 TFASTA 7 compares unlike query sequence to libr FASTFTFASTF 7 short fragments FASTXFASTY 7 compares DNA in all forward reading frames BLAST 0 Basic local alignment search tool Faster than FASTA Searches for words of common length in both query sequence and library 0 Confines search to the words that are the most significant FASTA finds all words 0 Significance of word matches is calculated using BLOSUM62 and the log odds score BLAST Algorithm 0 The query sequence if filtered to remove regions of low complexity not useful for meaningful sequence alignmenm 0 A list of 3 character words in the query sequence is made stepping forward on character at a time 0 The query sequence words are evaluated for an exact match with a word in the database log odds scores using BLOSUM62 BLAST Algorithm 0 The neighborhood word score threshhold cutoff score to retain only the 50 most significant ones 0 An efficient search tree is made using the high scoring words 0 The database is searched to find matches for the 50 significant words These are used as seeds for ungapped alignment with the query sequence BLAST Algorithm 0 The alignmenm are extended as long as the similarity score increases and if overlap they are combined 0 These highscoring segment pairs are matched in the entire database and listed 0 The statistical significance for these are calculated Lecture 7 Hidden Markov Models Additional Reference 0 Biological Sequence Analysis by R Durbin S Eddy A Krogh and G Mitchison Uses of Hidden Markov Models Modeling stochastic processes Sequence alignment Phylogenetic tree construction Microarray data analysis clustering Protein secondary structure prediction RNA secondary structure prediction Ion channel modeling Markov Model A process is Markov if it has no memory that is if the next state it assumes depends only on its present state and not on any previous states The states can be observed and the transition probabilities between states is known Example rolling a die has 6 possible states each with a probability of 16 Markov Model Example 2 Ion Channel Z l vjm Vij ib Vazij JIM WIW From Jafri et al 1998 0 Conversion between states determined by transition probabilities Multiple states to re ect properties of channel Markov Model Example 3 CpG Islands In the human genome the dinucleotide CG occurs called CpG is often methylated The methylated C often mutates into a T resulting in a lower frequency of CpG dinucleotides than would be expected In regions such as the start region of many genes the methylation process seems to be suppressed resulting in a higer frequency of CG dinucleotides These are called CpG Island CpG Islands Hence the CpG Island might be a good indicator of the start of a gene This leads to two questions How can we tell if a region is a CpG Island or not How do we identify CpG Islands in a long sequence of DNA CpG Islands We can create a Markov model to generate the CpG Island regions starting with a Markov model that generates a DNA sequence Markov chain On the arrows are transition probabilities astPltxrtIxnsgt Z 6 ll 11 E6 DNA sequence model The probability of the sequence is PX PXLXL1X1 PXL XL1 X1PXL1XL2X1PX1 PXLI XL1PXL1XL239 PX1 Whyl PX1 Uli1 axil xi 2 6 ll 11 E6 DNA Sequence Model We can model the beginning and end of the sequences by adding a beginning state and end state with PX1S aBS and PEXLatE tn n 69 A Terminology The state sequence is called the path 7 The ith state in the path is called 751 The chain is characterized by parameters called transition probabilities ak1 PTtil TEiZk The transition probability a0k from the begin state to k can be thought of as the probability of starting in state k In addition to haVing different states the chain consists of symbols b There is an emission probability ekb Pxib nik Example Assume three coins are used each with a different bias for heads The first coin is fair with PH05 the second coin has PH075 and the third coin has PH0 1 Also assume that the rst coin is chosen at random If the rst coin is chosen than either the second or third is chosen next with equal probability If the second or third coin is chosen any of the three are chosen next with equal probability initial probability transition probability emission probability Example CpG Islands We can create a hidden Markov model to generate the CpG Island regions The states are for nucleotides found in the CpG Islands The states are nucleotides not found in the CpG Islands There is a complete set of transitions within each set of states 66 G 0 Using the hidden Markov Model we can Example CpG Islands compute the transition probabilities by Cst an Zt39c 39 A C G T A C G T A 0180 0274 0426 0120 A 0300 0205 0285 0210 C 0171 0368 0274 0188 C 0322 0298 0078 0302 G 0161 0339 0375 0125 G 0248 0246 0298 0208 T 0079 0355 0384 0182 T 0177 0239 0292 0292 Example CpG Islands For discrimination the log liklihood ratio is calculated by PXl m 1 l 3X 0 PM 0 gm 2511 pL A c G T A 0740 0419 0580 0803 c 0913 0302 1812 0685 G 0624 0461 0331 0730 T 1169 0573 0393 0679 The occasionally dishonest casino 0 Uses a fair die most of the time but switches to a loaded die 0 The loaded die has probability of 05 of a 6 and 01 for the other outcomes 0 The fair to loaded transition probability is 0 The loaded to fair transition probability is The Occasionally Dishonest Casino 005 01 Fair Loaded The Occasionally Dishonest This is a hidden Markov model because while we can see the rolls of the die we do not know which rolls are with a fair die and which rolls are with a loaded die All we can see are the die rolls that are emitted by the model Hence we do not know the sequence of states The joint probability of an observed sequence X and a state sequence 11 is PXa 7T 30m Uli1 eni Xi ani17r1 Most Probable State Path From Durbin et al Given a sequence of emissions we want to nd the most probable sequence of states that would yield the emitted sequence 7 wgmmg POM Viterbi Algorithm Suppose that the probability vki the Viterbi variable of the most probably path ending in state k with observation i is known for all states k Then these probabilities can be calculated for observation KM as V1044 510944 makakG akl All paths start in the begin state so v00 l Viterbi Algorithm Initialization i0 V001 Vk00 for k gt 0 Recursion ilL V1ie1XimanVkilakl ptriargmanVki lakl Termination PXnmaxkvkLak0 nLargmanVkLao Traceback iL l ni1ptri1ti Viterbi Algorithm Multiplying many probabilities together results in very small numbers that will give under ow This and other algorithms should be done in log space logv1i so that the products become sums and the numbers stay reasonable The Forward Algorithm The number of paths 7 increases exponentially with the length of the sequence The forward algorithm efficiently calculates the probability of a sequence by assuming the most probably path 75 is the only path with significant probability The probability of the observed sequence up to and including Xi with TEiZk is fkiPX1 Xi TEiZk the forward variable The recursion equation is f1i1 61Xi1zk fkiakl The Forward Algorithm Initialization i0 f00l fk00 for kgt0 Recursion i1L f1i e1xi2k fkilak1 Termination PX 2k fkLak0 Backward Algorithm and Posterior State Probabilities While the Viterbi algorithm finds the most probable path through the model we might want to know what the most probable state is for an observation X The probability that an observation Xi came from a state k given the observed sequence is the posterior probability ie PTtikX PTtilX PTtil X PX PWik X fk bk where bklPXi1 XLI TEiZk the backward variable 1 The Backward Algorithm Initialization iL bkLak0 for all k Recursion iLl l b1i 21 ak1e1Xi1 bkil Termination PX 21 a01 61X1 b1l Posterior Probabilities Flair o w o o o m o m o o m Ln 0 o o c Figure 36 The paxleriar probability nf being in Iht stale correxpmxding m the fair die in the casino example The x axis shows the number afrhe roll The shaded areas xhaw when the rat was generated by the loaded die From Durbin et al Posterior Decoding The posterior probability is useful for two different forms of decoding in additiion the Viterbi decoding discussed previously These are used when there are more than one path that has similar probability as the most probable one One approach is to use an alternative path to look at a state assignment at a particular point i Posterior Decoding The second approach can be used when something other than the state sequence might be of interest For example supposed that the probability of switching from fair to loaded is 001 no switch 099 The Viterbi algorithm does not visit the loaded die state but posterior decoding can determine when the loaded die state might be visited Film u V From Durbin etal a mo 2m 390 Am 5w am ma son 9501 Figun 37 17wpmymornmbubrlm airw my hungum m mmx prulm Wm run m m H mm m m lnmlmltlw w mm m Parameter Estimation for HMMs In the example given the initial transition and emission probabilities are known In specifying a model to describe data these are usually unknown Furthermore the structure of the HMIVI is also unknown Training and Testing the HMM The parameters of the model are fit on a training set ie the parameters are chosen so that the training set is the most likely outcome for the model A test set is used to make sure the model is welltrained If so the model can be used on new data Parameter Estimation when the state sequence is known If we know all the paths we can count the number times a particular transition or emission occurs out of the total number of times that it was possible to get a maximum likelihood estimate for the transition and emission probabilities ak1 Akl 21 Akl ekb Ekb Zb Ekb This can be proven to be the NILE Estimation when the paths are unknown In this case there is no closed fOI H l estimate for the transition and emission probabilities Instead an iterative method must be used to estimated the values for the transition and emission probabilities using current values The new values replace the old values and the iteration continues until convergence occurs This procedure is called the BaumWeld Algorithm BaumWelch Algorithm Calculates the transition and emission matric as the expected number of times each transition or emission is used given the training sequence To do this the forward and backward values are used The probability that ak1 is used at position i in sequence X is P751k75111X9 fkiaklelXi1bli1PX BaumWelch Algorithm The expected number of times ak1 is used is determined by summing over all positions and all training sequences Akl Zj UPC 21 k akl 61in1bjli1 Where is the forward variable calculated for sequence j and bi is the backward variable calculated for sequence j The new model parameters are calculated by ak1 Akl 21 Akl BaumWelch Algorithm The expected number of times that the letter b appears in state k is Ekb Zj 1PXj 21 k 511 Where the inner sum is only over positions I for which the symbol emitted is b The new model parameters are calculated by ekb Ekb Zb Ekb BaumWelch Algorithm Initialization Pick arbitrary model parameters Recurrence Set all the A and E variables to their pseudocount values r or to zero For each sequencej ln Calculate fki for sequence j using the forward algorithm Calculate bki for sequence j using the backward algorithm Add the contribution of sequence j to A and E Calculate the new model parameters Calculate the new log likelihood of the model Termination Stop if the change in the log likelihood is less than some prede ned threshold or the maximum number 0 iterations is exceeded Parameter Estimation with HMM With hidden Markov models we can calculate the most likely emission and transition probabilities from the sequence of outcomes 005 005 01 0 l Fair Loaded Fair Loaded HMM Topology Thus far we have studied how to determine the unknown parameters for a model of know topology Use knowledge about the process being described to decide the topology Picking a very general topology and letting the model fit itself by reducing unused connections to low probability is not a good approach since the model gets caught in local maXima Silent States Silent states or null states are states that do not emit symbols In the preVious example the begin and end states were silent These can be added anywhere in the model 20 HMM of E Coli Gene match insert gtgtgt gt delete From Mount HMM for nding the most probable set of genes inE coli gene sequences of unknown gene composition A similar model exists for each of the 61 codons HMMs for Multiple Sequence Alignment 3n states Where n is the average sequence length N matching states along the backbone With and additional insertion or deletion state so that variable length sequences can be accommodated The training set of sequences to be aligned is treated as a collection of observation sequences Once the HMM is trained each sequence from the training set can be scored using the Viterbi algorithm Which gives rise to the path of matching inserted and deleted states HMMs for Multiple Sequence Alignment quot in i ll L mt ml ml mm mm In quotmum Winm Antumquot mm m mi me mil mzmi l l39i 3 I 1 mm MnmmumnunlgtHummu rnmmumi l unJlum Rim MW Sequence length 4 From Clote and Backofen 4 matching states 5 insertion states and 4 delete states m0 m5 and the delete states are silent HMMs for Multiple Sequence Alignment Consider the sequences GGCT ACCGAT and CT A er convergence of BaumWelch algorithm the Viterbi path GGCT m0 ml m2 m3 m4 m5 ACCGAT m0 i0 ml d2 m3 i3 i3 m4 m5 CT m0 ml d2 d3 m4 m5 This yields C C g a T m0 i0 m1d2 m3 i3i3 m4 m5 G G c T m0 m1m2m3 m4 m5 C T m0 m1d2 d3 m4m5 BINF 730 Lecture 2 Sequence Alignment DNA Sequence Alignment Why Recognition sites might be common restriction enzymes start sequences stop sequences other regulatory sequences Homology evolutionary common progenitor Mutations Deletions Insertions Transitional Substitution purinepurine AG pyrpyr TC Translational Substitution purpyr pyrpur Example Start with ACGTACGT after 9540 generations with the following probabilities Deletion 00001 Insertion 0001 Transitional substitution 000008 Translational substitution 000002 Example The two new sequences are ACG TACGT ACACGGTCCTAATAATGGCC ACGTACGT CAG GAAGATCTTAGTTC Example However if we align the two sequences by superposition ACAC GGTCCTAATAATGGCC CAG GAA G AT CTTAGTTC Example or using Gotoh s algorithm with mismatch penalty 3 and gap penalty function gk 22k for length k gap ACACG GTCCTAATAATGGCC CAGGAAGATCT TAGTT C The alignment depends on algorithm used Protein sequence alignment A Homologous proteins i Evolutionary common origin ii Structural similarity iii Functional similarity B Conserved regions i Functional domains ii Evolutionary similarity iii Structural motif Example 32 lmml 2 i i LEGTQTPVSlKLi fPMWfIEARLG KPHIQRLLD l I a l l l um PlSPlETVPVIfLKPGMDGPKU KQW EEKIRMJIWCTEMEK 39PCQSP LW39 PLLPVK1ltPGINDYVQDLREVNKBHI V l 7 Will ill ll lMMEHIl H GPENPYNTPV39FAIKKKUS I ll LVDFRELNKRTODFWEV u lmvul MC 39 i ii i 39 quotquot SEMI T PLi AFZEWw anPEMGISGQLTW39l Ri w 1 1 Mi mm 1 y H l HMMI Inn 2 1 1 J u FSVPLDEDFKhYTAFTTPSJNID TPGTRYQ quotNVTAFQGWKGS Figure 32 Part ol39an alignment umelx reverse lrumcriphbe PDB cod IMML and HIV1 mm lmnscnpmsc I DB code IK I39Hy The Alignments wrle produced mg m mu package mas with rhc BlosumSO subslilmion mam H39H92 The sequence of um um aim of both Chaim are wian in black wnuc all other amino acids are pruned in gray 39l39hicker gray bun mde an exact mulching at amino acids whereas me humor gray bms indicare mm acids mm are main ummdmg in HM Blossumjt imllunly Minx Choosing the best alignment Every alignment has a score Chose alignment with highest score Must choose appropriate scoring function Scoring function based on evolutionary model with insertions deletions and substitutions Use substitution score matrix contains an entry for every amino acid pair Substitution score matrix A D K S s sA SSD SSK SKA SRD SRK S KA SKD SKK Scoring Strategies 0 Ad hoc method a biologist can set up a score matrix that gives a good alignment 0 Use physicalchemical properties 0 Statistical approach Statistical approach Let s and s be two amino acid sequences of length n that we want to compute an alignment score Assume only substitutions occur no insertions or deletions Works for local alignment Odds Ratio and Log Odds Ratio Odds Ratio and Log Odds Ratio The score for aligning s and s is based on the comparison of the hypothesis that the two sequences are generated randomly with the hypothesis that they come from a common ancestor Assume qA is the probability of producing amino acid A in model R based on the relative frequency at which A is found in proteins The probability for the null hypothesis that s and s do not stem from a common ancestor is Odds Ratio and Log Odds Ratio The second hypothesis homologous hypothesis that s and s arise from a common ancestor sequence r of length n is based on the evolutionary model E The probability that the amino acids A and B are aligned and hence have been derived from an ancestor amino acid C is given by p AB is given by How this probability is determined will be explained later Odds Ratio and Log Odds Ratio The odds ratio compares the homologous hypothesis with the null hypothesis To achieve a scoring function that is additive rather that multiplicative the log odds ratio can be used Point Accepted Mutation PAM and Amino Acid Pair Probabilities We mentioned that we must choose an appropriate evolutionary model EpABAB for the homologous hypothesis ie we have to find p AB for each pair of amino acids A and B Since we are using a statistical approach this has to be estimated from data If we know that two sequences 5 and s are homologous we could estimate p AB by finding the value of p AB that would maximize PEPABABISS PAM and Amino Acid Pair Probabilities This can be done by using the maximum likelihood approach section 216 pp 5253 Clote and Backofen Lagrange Multipliers Section 22 Clote and Backofen Appendix Chapter 3 Clote and Backofen Lecture 8 Gene Prediction Saleet Jafri BIN39F 730 Analysis by sequence similarity can only Frequ easily Gene Prediction reliably identify about 30 of the protein coding genes in a genome 5080 ofnew genes identi ed have a partial marginal or unidenti ed homolog ently expressed genes tend to be more identi able by homology than rarely expressed genes Gene Finding Process ofidentifying potential coding regions in an uncharacterized region of the enome Still a subject of active research There are many different gene nding software packages and no one program is capable of nding everything Genes aren t the only thing we re looking for 39Biologically significant sites include 39Splice sites 39Protein binding sites 39DNA 3D structure features etc In a lot of cases we don39t even know what constitutes one ofthese sites so all we can do is look for repeating patterns Sun men m m l I k ccc mamm mamquot m K Pmlv 5 ccwccquc mowm m mum u 3 scrammggxxcm h m wmmm m m mmmm mullMM Eukaryotes vs Prokaryotes Eukaryotic DNA wrapped around histones that might result in repeated patterns for histone binding The promotor regions might be near these sites so that they remain hidden Prokaryotes have no introns Promotor regions and start sites more highly conserved in Prokaryotes Different codon use frequencies Gene nding is speciesspeci c Codon usage patterns Vary by species Functional regions promoters splice sites translation initiation sites termination signals Vary by species Common repeat sequences are species speci c Gene nding programs rely on this information to identify coding regions The genetic code Table of Standard Genetic Code T 39 c A 39 c 11139th a 39Tch s TAT Tyr Y 39TGT Cys c Trc ch TAc TGC TrALun L TcA TAA Tex TGA Tex 39n39G TCG AG T TGG Trp W 39 cTT Len L 39ccT Pm F cAT HIS H 39coT Arg R C cTc ccc cAc ch cTA ccA CAAGIIT Q CGA CAG cGG AGC CTG ccG 39 ATP 1 39AcTTm T AAT Asn N 39AGTSu s A ATc Acc AAc AAA Lys K AGA Arg R AAG AGG 39 61139 Va V 39GcT An A GAT Asp D 39GGT Gly G G GTC Gcc GAC ccc GTA GCA GAAGlu E GGA GTG GcG GAG GGG Codon usage Identifying ORFs Simple rst step in gene nding Translate genomic sequence in six frames Identify stop codons in each frame Regions Without stop codons are called quotopen rea 39ng frames or ORFs Locate and tag all of the likely ORFs in a sequence The longest ORF from a Met codon is a good prediction of a protein encoding sequence SOFTWARE NCBI ORF Finder ORF nder results mnmricm mum Tests of the Predicted ORF Check if the third base in the codons tends to be the same one more often than by chance alone Are the codons used in the ORF the same as those used in other genes need codon usage frequency Compare the amino acid sequence for similarity with other know amino acid sequences Problems with ORF finding A singlecharacter sequencing error can hide a stop codon or insert a false stop codon preventing accurate identification of RFs Short exons can be overlooked Multiple transcripts or ORFs on complementary strand can confuse results Pattem based gene finding ORF nding based on start and stop codon frequency is a patternbased procedure Other pattembased procedures recognize characteristic sequences associated with known features and genes such as ribosome binding sites promoter sites histone binding sites etc Statistically based Contentbased gene finding Contentbased gene nding methods rely on statistical information derived from known sequences to predict unknown genes Some evaluative measures include quotcoding potential based on codon bias periodicity in the sequence sequence homogeneity etc A standard contentbased alignment procedure Select a window of DNA sequence from the unknown The window is usually around 100 base pairs long Evaluate the window39s potential as a gene based on a variety of factors Move the window over by one base Repeat procedure until end of sequence is reached report continuous highscoring regions as putative genes Combining measures Programs rarely use one measure to predict genes Different values are combined using probabilistic methods discriminant analysis neural net methods etc to produce one quotscorequot for the entire window Drawbacks to windowbased evaluation A sequence length of at least 100 bp is required before significant information can be gained from the analysis Results in a 100 bp uncertainty in the start site of predicted coding regions unless an unambiguous pattem can also be found to indicate the start Most are webbased but Submit sequence input sequence length may be limited Select parameters if any Interpret results Most software is first or second generation39 results come in non graphical formats GRAIL Gene finder for human mouse arabidopsis drosophila E coli Based on neural networks Masks human and mouse repetetive elements Incorporates patternbased searches for several types of promoters and simple repeats Accuracy in 7595 range Glimmer Genefinder for bacterial and archaebacterial genomes Uses an quot interpolated Markov model approach a Markov model is a model for computing probabilities in the context of sequential events Predicts genes with around 98 accuracy when compared with published annotations No web server GENSCAN Genefinder for human and vertebrate sequences Probabilistic method based on known genome structure and composition number of exons per gene exon size distributions hexamer composition etc Only protein coding genes predicted Maize and arabidopsisoptimized versions now available Accuracy in 5095 range GeneMark Gene finder for bacterial and archaebacterial sequences Markov modelbased GeneMark and GeneMarkHMIVl available as web servers Accuracy in 9099 range CRITTCA Gene nder for bacterial and archaebacterial genomes Combines sequence homologybased prediction with contentbased statistical dicodon probability analysis Accuracy in 9099 range No web server GenePars er Other software Predicts the most likely combination of exons and Generation intmns using dynamic programming GeneID The intmn an exon positions are aligned subject to Genie the constraint that they altemate A neural network is usedto adjust the weights 39 Genv ew iv to the sequence indicators ofknow exon and Ecopme intmn regions such as codon usage information content 1ength distnbution hexamer frequencies emquot and scoring mahices Gene nding strategy tRNAscan for beginners Locating tRNA genes is less dif cult an other types ofgene identification pol III promoter is simple RNA secondary structure is conserved 39 SOFTWARE LRNAscanSE Choose the appropriate type ofgene nder thatyou39re using gene nders for microbial intronless sequences only to a e eria and arc aea Ifthere is no organismspecific gene nder foryour system at least use one that makes sense ie use an arabidopsis gene nder for other plants Neural Network Topology Input Layer Hidden Layer Output Layer Weight Perceptron Making Neural Networks Take known data and divide into two sets the training set and test set Use the optimize the weights so that the neural net gives the best outputs for the training set Test the neural net with the test set to see if it rks If data is limited you can permute the data so thatyou have multiple training and test sets Caveats with Neural Nets The net only performs as well as the training set In other words it can only find things it is trained to do As more diverse data becomes available the neural net gets better Grail II Neural Net Finds exons in eukaryotic genes that is takes inputs and predicts ifa gene is present Markov Model A process is Markov ifit has no memory that is ifthe next state it assumes depends only on its present state and not on any previous states The states can be observed and the transition probabilities between states is known Example 7 rolling a die has 6 possible states each with a probability of16 Hidden Markov Model Also has the Markov property Some ofthe state or transition probabilities information is missing The process emits sequences ofresults The emission probabilities is the probability of each outcome in a given state The model is trained so that the training set is the most likely outcome for the model
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'