New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Kari Harber Jr.


Kari Harber Jr.
GPA 3.72


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Biological Sciences

This 15 page Class Notes was uploaded by Kari Harber Jr. on Thursday September 17, 2015. The Class Notes belongs to BSC 3402L at Florida State University taught by Staff in Fall. Since its upload, it has received 86 views. For similar materials see /class/205430/bsc-3402l-florida-state-university in Biological Sciences at Florida State University.

Popular in Biological Sciences




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15
Steve Thompson BSC34026 Experimental Biology Comparative Genomics Florida State University The Department of Biological Science wwwbiofsuedu Feb 1 2007 Pairwise Alignment and Similarity Searching Steven M Thompson Florida State University School of Computational Science SCS quot Given nucleotide or amino acid sequences what can we learn about biological function First just what is homology and similarity are they the same Don t confuse homology with similarity there is a huge difference Similarity is a statistic that describes how much two subsequences are alike according to some set scoring criteria It can be normalized to ascertain statistical significance but it s still just a number Homology in contrast and by definition implies an evolutionary relationship more than just everything evolving from the same primordial ooze Reconstruct the phylogeny of the organisms or genes of interest to demonstrate homology Better yet show experimental evidence structural morphological genetic andor fossil that corroborates your claim There is no such thing as percent homology something is either homologous or it is not Walter Fitch said homology is Ike pregnancy you can t be 45 pregnant just like something can t be 45 homologous You either are or you are notquot Highly signi cant similarity can argue for homology and not the inverse quot OK how can we see if two 7 sequences are similar First to introduce the concept a graphical method One way Dot Matrices Provide a Gestalt of all possible alignments between two sequences To begin very simple 0 1 match nomatch identity scoring function Put a dot wherever symbols match Identities and insertionde etion events inde s identified zeroone match score matrix no window Steve Thompson N0ise ue 0 ran om composi ion con ri u es ocon usion o c ean up the plot consider a filtered windowing approach A dot is placed at the middle of a window if some stringency is met within that defined window size Then the window is shifted one position and the entire process is repeated zero one match score window of size three and a stringency level of two out of three Exact alignment but how can we see the correspondence of individual residues We can compare one molecule against another by aligning them However a brute force approach just won t work Even without considering the introduction of gaps the computation required to compare all poss ble alignments between two sequences requires time proportional to the product of the lengths ofthe two sequences Therefore if the two sequences are approximately the same length N this is a N2 problem To include gaps we would have to repeat the calculation 2N times to examine the possibility of gaps at each poss ble position within the sequences now a N4N problem There s no way We need an algorithm But Just what the heck is an algorithm MerriamWebster s says A rule of procedure for solving a problem often mathematical that frequently involves repetition of an operation 80 you could write an algorithm for tying your shoe It s just a set of explicit instructions for doing some routine task Enter the Dynamic Programming Algorithm Com pu er scien is s figured i ou long ago Needleman and Wunsch applied i 0 he alignmen of he full leng hs of we sequences in 1970 An op imal alignmen is defined as an arrangemen of we sequences 1 of leng h and 2 of leng h such ha 1 you maximize the number of matching symbols between 1 and 2 2 you minimize the numberof indels within 1 and 2 and 3 you minimize the numberof mismatched symbols between 1 and 2 Therefore the actual solution can be represented by S1 j1 or max SIX 1 wk1 or Sljsumax 2ltxlti max S1 17y wiry1 lt lt I Where SUis the score for the alignment ending atiin sequence 1 andin sequence 2 s is the score for aligning with j w is the score for making a xlong gap in sequence 1 wyis the score for making a ylong gap in sequence 2 allowing gaps to be any length in either sequence An oversimplified example Imgtgt Lco r L i n x u i r L n malpenalty r 1 rn r4 n m Optimum Alignments here may be more han one bes pa h hrough he ma rix and op imum doesn guaran ee biologically correc 8 ar ing a he op and working down hen racing back found one bes alignmen cTATAtAagg cg TAtAaT With our examples scoring scheme this alignment has a final score of 5 the highest bottomright score in the traceback path graph and the sum of six matches minus one interior gap This is the number optimized by the algorithm not any type ofa similarity or identity percentage Software will report only one optimal solution Do you have any ideas about how others can be discovered Answer Often if you reverse the solution of the entire dynamic programming process other solutions can be found This was a global solution Smith Waterman style local solutions 1981 use negative numbers in the match matrix and pickthe best diagonal within overall graph gives local Steve Thompson What about proteins conservative replacements and similarity as opposed to identity The nitrogenous bases AC T G are either the same or they re not but amino acids can be similar genetically evolutionarily and structurally v we amino acid substitution matrix a A 11 a 39 Iur 39 39 that arely occur go as low as 4 The most conserved residue is tnptophan with a score of 11 cysteine is next with a score of 9 both proline and tyrosine get scores M7 for identity We can imagine screening databases for sequences similar to ours using the concepts of dynamic programming and logodds scoring matrices and some yet to be described algorithmic tricks But what do database searches tell us what can we gain from them V lhy even bother Inference through homology a fundamental principle of biology When a sequence is found to fall into a preexisting family we may be able to infer function mechanism evolution perhaps even structure based on homology with its neighbors If no signi cant similarity can be found the very fact that your sequence is new and different could be very important Granted its characterization may prove dif cult but it could be well worth it Independent of all that what is a good alignment So first Significance when is any alignment worth anything biologically An old statistics trick Monte Carlo simulations 2 score actual score mean of randomized scores standard deviation of randomized score distribution The Normal distribution Many Z scores measure the distance from the mean using this simplistic Monte Carlo model assuming a Gaussian distr bution aka the Normal distribution httpmathworldwolframcomNormalDistributionhtml in spite ofthe fact that sequencespace actually follows what is know as the Extreme Value distribution However the Monte Carlo method does approximate significance estimates fairly well Seq uencespace Huh what s that actually follows the Extreme Value distribution h p quot quot 39 39Mnlfr2m quot39 quot 39 39 39 39 hfmll Based on this known statistical H H distribution and robust statistical r r r r r methodology a realistic Expectation function the E Value can be l l l l l l i l l l l l l l l r r r r r r r r r l l l l l l l l l l l l l l l l l l l l l l l l l l calculated from database searches The particulars of how this is done r will come injust a moment but the takehome message is the same The Expectation Value The higherthe E value is the more probable that the observed match is due to chance in a search ofthe same size database and the lower its Z score will be ie is NOT significant Therefore the smallerthe E value ie the closer it is to zero the more significant it is and the higher its Z score will be The E value is the number that really matters Steve Thompson Rules of thumb for a protein search The Z score represents the number of standard deviations some particular alignment is from a distribution of random alignments often the Normal distribution They very roughly correspond to the listed E Values based on the Extreme Value distribution for a typical protein sequence similarity search through a database with 125000 protein entries On to the searches How can you search the databases for similar sequences if pairwise alignments take N2 time Significance and heuristics Database searching programs use the two concepts of dynamic programming and logodds scoring matrices however dynamic programming takes far too long when used against most sequence databases with a normal computer Remember how big the databases are Therefore the programs use tricks to make things happen faster These tricks fall into two main categories that of hashing and that of approximation Corn beef hash Huh Hashing is the process of breaking your sequence into small words or ktuples think all chopped up just I ke corn beef hash ofa set size and creating a lookup table with those words keyed to position numbers Computers can deal with numbers way faster than they can deal with strings of letters and this preprocessing step happens very quickly Then when any of the word positions match part of an entry in the database that match the offset is saved In general hashing reduces the complexity of the search problem from N2 for dynamic programming to N the length of all the sequences in the database A simple hash table illustration The sequence FAILGFIKYLPGCM with a word size of one yield this lookup hash table word A C F G I K L M P Y Pos 2 13 l 5 7 8 4 3 ll 9 6 12 10 14 Comparing it to the database sequence TGFIKYLPGACI39 yields this offset table 1 2 3 4 5 6 7 8 9 10 ll 12 T G F I K Y L P G A C T 3 2 3 3 3 3 3 4 8 2 10 3 3 3 The offset numbers come from the difference between the positions ofthe words in the query sequence and the position of the occurrence of that word in the target sequence Then look at all ofthe offsets equal to three in the offset table Therefore offset the alignment by three FAMLGFIKYLPGCM TGFIKYLPGACI39 Quick and easy Computers can compare these sorts of tables very fast The trick is to know how far to attempt to extend the alignment out OK Heuristics What s that Approximation techniques are collectively known as heuristics Webster s defines heuristic as serving to guide discover or reveal but unproved or incapable of proof ln database similarity searching techniques the heuristic usually restricts the necessary search space by calculating some sort ofa statistic that allows the program to decide whether further scrutiny of a particular match should be pursued This statistic may miss things depending on the parameters set that s what makes it heuristic Worthwhile results at the end are compiled and the longest alignmentwithin the program s restrictions is created The exact implementation varies between the different programs but the basic idea follows in most all of them Two predominant versions exist BLAST and Fast Both return local alignments and are not a single program but rather a family of programs with implementations designed to compare a sequence to a database in about every which way imaginable These include 1 a DNA sequence against a DNA database not recommended unless forced to do so because you are dealing with a nontranslated region of the genome DNA isjust too darn noisy only identity amp four bases 8 a translated where the translation is done onthefly in all six frames version of a DNA sequence against a translated onthefly sixframe version of the DNA database not available in the Fast package 8 a translated on the fly sixframe version of a DNA sequence against a protein database 3 a protein sequence against a translated on the fly sixframe version of a DNA database 5 or a protein sequence against a protein database Translated comparisons allow penaltyfree frame shifts Steve Thompson The BLAST and Fast programs some generalities F tA d39tsfa 39l f It BLAST Basic LocalAlignment as an 39 m39y rea 39veS Search Tool developed at NCBI developed by Bill Pearson at the University of Virginia Normally NOT a good idea to use for DNA against DNA searches wo translation not optimized 1 Works well for DNA against DNA searches within limits of possible 2 Prefilters repeat and low sensitivity complexity sequence regions 2 Can find only one gapped 4 Can find more than one region of Similarity region of gapped similarity 3 Relatively slow should 5 Very fast heuristic and often be run in the parallel implementation background 6 Restricted to precompiled 4 Does not require Specially specially formatted re ared reformatted databases p p p databases The algorithms in brief Two word hits on the same diagonal above initiate gapped extensions 39 ungapped extension until c usmg dvnarnic programming for the score isn t improved enough above another threshold the HSP r 3 2 3 390 Ang jee m 9 mg Sod age 0 we we Q2 39 a I I Fi i ungapped exact rid al Word hits maximize the Fast Combine hon overlapping rmr regions on different 39 dl orials ag inilri Use dynamic programming in a barid for aii regions BLAST the algorithm in more detail 1 After BLAST has sorted its lookuptable it tries to find all double word hits along the same diagonal within some specified distance using what NCBI calls a Discrete Finite Automaton DFA These word hits of size Wdo not have to be identical rather they have to be better than some threshold value T To identify these double word hits the DFA scans through all strings of words typically W3 for peptides that score at least Tusually 11 for peptides B Each double word hit that passes this step then triggers a process called ungapped extension in both directions such that each diagonal is extended as far as it can until the running score starts to drop below a predefined value Xwithin a certain range A The result of this pass is called a HighScoring segment Pair or HSP 8 Those HSPs that pass this step with a score better than S then begin a gapped extension step utilizing dynamic programming Those gapped alignments with Expectation values better than the user specified cutoff are reported The extreme value distribution of BLAST Expectation values is precomputed against each precompiled database this is one area that speeds up the algorithm considerably The BLAST algorithm continued The math can be generalized thus for any two sequences of length m and 11 local best alignments are identified as HSPs HSPs are stretches of sequence pairs that cannot be further improved by extension or trimming as described above For ungapped alignments the number of expected HSPs with a score of at least S is given by the formula E Kmner This is called an E value for the score S In a database search 11 is the size of the database in residues so Nmn is the search space size K and Aare be supplied by statistical theory and as mentioned above can be calculated by comparison to precomputed simulated distributions These two parameters define the statistical significance of an E value The E value defines the significance of the search As mentioned above the smaller an E value is the more likely it is significant A value of 0 01 to 0001 is a good starting point for significance in most typical searches In other words in order to assess whether a given alignment constitutes evidence for homology it helps to know how strong an alignment can be expected from chance alone The Fast algorithm in more detail Fast is an older algorithm than BLAST The original Fast paper came out in 1988 based on David Lipman s work in a 1983 paper the original BLAST paper was published in 1990 Both algorithms have been upgraded substantially since originally released Fast was the first widely used powerful sequence database searching algorithm Bill Pearson continually refines the programs such thatthey remain a viable alternative to BLAST especially if one is restricted to searching DNA against DNA without translation They are also very helpful in situations where BLAST finds no significant alignments arguably Fast may be more sensitive than BLAST in these situations Fast is also a hashing style algorithm and builds words ofa set k tuple size by default two for peptides ltthen identifies all exact word matches between the sequence and the database members Note thatthe word matches must be exact for Fast and only similar above some threshold for BLAST The Fast algorithm continued From these exact word matches 1 Scores are assigned to each continuous ungapped diagonal by adding all of the exact match BLOSUM values The ten highest scoring diagonals for each querydatabase pair are then rescored using BLOSUM similarities as well as identities and ends are trimmed to maximize the score The best of each of these is called the Min score 8 8 Next the program looks around to see if nearby offdiagonal Inif1 alignments can be combined by incorporating gaps If so a new score Inifn is calculated by summing up all the contributing Inif1 scores penalizing gaps with a penalty for each 3 The program then constructs an optimal local alignmentfor all Inifn pairs with scores better than some set threshold using a variation of dynamic programming in a bandquot A sixteen residue band centered at the highest Inif1 region is used by default with peptides The score generated from this step called opt Steve Thompson The Fast algorithm still continued 5 Next Fast uses a simple linear regression againstthe natural log of the search set sequence length to calculate a normalized zscore forthe sequence pair Note thatthis is not the same Monte Carlo style Z score described earlier and can not be directly compared to one 92 Finally it compares the distribution of these zscores to the actual extremevalue distribution of the search Using this distribution the program estimates the number of sequences that would be expected to have purely by chance a zscore greaterthan or equal to the zscore obtained in the search This is reported as the Expectation value 7 lfthe user requests pairwise alignments in the output then the program uses full SmithWaterman local dynamic programming not restricted to a band to produce its final alignments Let s see em in action To begin we ll go to the most widely used and abused biocomputing program on earth NCBl s BLAST Connect to NCBl s BLAST page with any Web browser There is a wealth of information there including a tutorial and several very good essays for teaching yourself way more about BLAST than this lecture can ever hope for Let s use our SRYHuman SwissProt sequence but we have to use the accession code 005066 for NCBl s BLAST server to nd the sequence Check out how it works what it assets and liabilities are and how quickly we get our results back Contrast that with GCG s BLAST version l ll illustrate with the same molecule and I ll use GCG s SegLab GUI to show the difference between the two implementations ofthe program And finally let s see how GCG s FastA version compares to either BLAST implementation Again l ll launch the program from SeqLab with the same example but this time I ll take advantage of Fast s flexible database search syntax being able to use any valid GCG sequence specification Here I ll search against a precompiled LookUp list file of all of the socalled primitive eukaryotes in SwissProt What s the deal with DNA versus protein for searches and alignment All database similarity searching and sequence alignment regardless of the algorithm used is far more sensitive at the amino acid level than at the DNA level This is because proteins have twenty match criteria versus DNA s four and those four DNA bases can generally only be identical not similar to each other and many DNA base changes especially third position changes do not change the encoded protein All of these factors drastically increase the noise level of a DNA against DNA search and give protein searches a much greater lookback time at least doubling it Therefore whenever dealing with coding sequence it is always prudent to search atthe protein level Conclusions The better you understand the chemical physical and biological systems involved the better your chance of success in analyzing them Certain strategies are inherently more appropriate to others in certain circumstances Making these types of subjective discriminatory decisions is one of the most important takehome messages I can offer Gunnar von Heijne in his old but quite readable treatise Sequence Analysis in Molecular Biology Treasure Trove or Trivial Pursuit 1987 provides a very appropriate conclusion Think about what you re doing use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry use as much information as possible and do not blindly accept everything the computer offers you Refe re n c e s Altschul S F Gish W Mller W Myers E W and Lipman D J 1990 Basic Local Alignment Tool Journal of MolecularBiology 215 403410 Altschul SF Madden TL Schafrer AA Zhang J Zhang Z Mller W and Lipman DJ 1997 Gapped BLAST and PSIBLAST a New Generation of Protein Database Search Programs Nucleic Acids Research 25 33893402 39 39 Irrauvrrarrr r mottum urc v Package Version 103 Accelrys Inc A Phannocopeia Company San Diego California USA Gribskov M and Devereux J editors 1992 Sequence Analysis Primer wH Freeman and Company New York New York U S Henikoff s and Hnniknff I Mnn39n Amin Proceedings of the National Academy of Sciences USA 89 1091510919 Needleman SB and Wunsch CD 1970 A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Mo Proteins Journal of Molecular Biology 48 443453 Pearson wR and Lipman DJ 1988 Improved Tools for Biological Sequence Analysis Proceedings of the National Academy of Sciences USA 85 24442448 Schwartz RM and Dayhofr M n 39 39 39 39 39 In Atlas of Protein Sequences and Structure M0 Dayhofr editor 5 Suppl 3 353358 39 39 39 39 USA Advances in Applied Smith TF MS r 39 r B Mathematics 2 482489 Wilburw I an linman n I 39 39 39 391 I I39 39 Data Banks Proceedings of the National Academy of Sciences USA 80 726730 Steve Thompson lllllllllllllllllllllllllllllljljll BSC34026 Experimental Biology Comparative Genomics Florida State University The Department of Biological Science wwwbiofsuedu Jan252007 lllllllllllllllllllllllllllllljljll Genomics Web Resources what how and why Steven M Thompson Florida State University School of Computational Science SCS lllllllllllllllllllllllllllll m litany t tng quott F x Q My GGGG 54m 22 e O JJJJ t c NCBI 5 tag m Entrez lllllllllllllllllllllllllllll To begin some terminology What is bioinformatics genomics proteomics sequence analysis computational molecular biology lllllllllllllllllllllllllllllll My Definitions lots of overlap Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system from individual molecules to organisms to overall ecology Bioinformatics describes using computational techniques to access analyze and interpret the biological information in any type of biological database Seguence analysis is the study of molecular sequence data for the purpose of inferring the function interactions evolution and perhaps structure of biological molecules Genomics analyzes the context of genes or complete genomes the total DNA content of an organism within the same andor across different genomes Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement ie the proteome of organisms both within and between different organisms lllllllllllllllllllllllllllllll One way to think about it The Reverse Biochemistry Analogy Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product Rather now scientists can amplify a section of some genome based on its similarity to other genomes sequence that piece of DNA and using sequence analysis tools infer all sorts of functional evolutionary and perhaps structural insight into that stretch of DNA The computer and molecular databases are a necessary integral part of this entire process Steve Thompson IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD The exponential growth of molecular sequence databases amp cpu power Year BasePairs Sequences 1962 606 Grovth ofthe 1963 227 029 2 27 39 w 196 3366765 175 1965 520 20 5700 1966 9615371 9976 1967 1551 776 1 56 1966 23600000 20579 n 1969 3 762565 26791 a 1990 9179265 39533 E 1991 719 7 26 55627 1992 101006 66 76606 E 1993 157152 2 1 3 92 g 199 217102 62 215273 E 1995 36 939 65 55569 quot 1996 65197296 1021211 1997 1160300667 17656 7 1996 200676176 2637697 N 3 g 1999 36 1163011 66 570 3 g g 3 2000 11101066266 10106023 1 2001 156 9921 36 1 976310 2002 28507990155 22318883 6 111v397101 41mw 5ll 0061 2003 36553368 85 30968 18 200 5757 5176 060 319 5603773 62 52016 62 httg MNva nob nIm nth govGenbankgenbanks ats htm Doubling time just over a year IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD Database Growth cont The International Human Genome Sequencing Consortium announced the comple ion of the quotWorking Draftquot of the human genome in June 2000 independently that same month the private company Celera Genomics announced that it had completed the first Assembly of the human genome Both articles were published midFebruary 2001 in the journals Science and Nature Genome projects have kept the data coming at alarming rates As of August 2006 28 Archaea 339 Bacteria and 22 Eukaryote complete genomes and 102 Eukaryote assemblies were represented not counting the almost 2000 virus and viroid genomes available IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD Some neat stuff from the human genome papers Homo sapiens aren t nearly as special as we once thought Of he 32 billion base pairs in our DNA Traditional gene number es imates were often in the 100000 range turns out we ve only got about twice as many as a fruit fly between 25 and 30000 The protein coding region of the genome is only about 1 or so a bunch of the remainder is jumping junk selfish DNA much of which may be involved in regulation and control Over 100200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome Later shown to be false by more extensive analyses and to be due to gene loss not transfer IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD What are sequence databases Sequence databases are an organized way to store exponentially accumulating sequence data Most have a specific format An alphabet soup ofthree major organizations maintain most ofthis data They largely mirror one another and share accession codes but NOT proper identifier names North America the National Centerfor Biotechnology Information NCBI a division ofthe National Library of Medicine NLM at the National Institute of Health N H maintains the GenBank nucleotide GenPept amino acid and RefSeq genome transcriptome and proteome databases Europe the European Molecular Biology Laboratory EMBL the European Bioinformatics Institute and the Swiss Institute of Bioinformatics S B all help maintain the EMBL nucleotide sequence database and the UNIPROT SWISSPROT TrEMBL amino acid sequence database with PIRNBRF support also Asia The National Institute of Genetics NIG supports the Center for Information Biology s CIG DNA Data Bank of Japan DDBJ IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD A little history Developments that affect software and the end user The first well recognized sequence database was Dr Margaret Dayhoff s hardbound Atlas of Protein Sequence and Structure begun in the midsixties DDBJ began in 1984 GenBank in 1982 and EMBL in 1980 They are all attempts at establishing an organized reliable comprehensive and openly available library of genetic sequences Sequence databases have longsince outgrown a hardbound atlas They have become huge and have evolved through many changes Changes in format overthe years are a major source of grief for software designers and program users Each program needs to be able to recognize particular aspects of the sequence files whenever they change it screws everything up Database format standards are constantly argued over relational vs objectoriented vs XML vs ASN1 etc Unfortunately until all biologists and computer scientists worldwide agree on one standard and all software is rewritten to that standard neither of which is likely to happen very quickly if ever format issues will remain one of the most confusing and troubling aspects of working with primary sequence data IIIDDDDDDDDDDDDDDDDDDDDDDDDDDD What are sequence databases like Just what are primary sequences Central Dogma DNA gt RNA gt protein Primary refers to one dimension all ofthe symbol information written in sequential order necessary to specify a particular biological molecular entity be it polypeptide or nucleotide The symbols are the one letter codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes Biological carbohydrates lipids and structural and functional information are not sequence data Not even DNA CDS translations in a DNA database are sequence data However much ofthis feature and bibliographic type information is available in the reference documentation sections associated with primary sequences in the databases Steve Thompson Content amp Organization Sequence database instaiiations are commoniy a comoieK More organization stuff Nucieic acid sequence databases and TfEMBL are spiit into subdiyisions based on taKonomy nistoricai rankingsi tne Fungi Arcnae Warningi TfEMBL sequences are merged into SW SSPROT as tney receiye increased ieyeis of annotation Botn togetner comprise UNiPROT eenpeotnas minimai annotation Nucleic Acid DB s Amino Acid DB s ASCHBinafy miX tnougn usuaiiy not reiationai or obiect Oriented but proprietary and Webebased ones often are They ii contain 41 seyerai yeiy iong teKt fiies eacn containing differenttyoes of reiated informationi sucn as an of tne sequences themseivesy versus 5 ofthemie iinesy oraH oftne reference Sections Binary GenBankIEMBUDDBJ UNIPROT fiies often neip giuetogemef an of tnese otnerfiies by proyiding a Taxonomic SWISS PR0 indeKing functions categories TrEMBL Software is usuaiiy required to successfuiiy interact Witn tnese HTGYS 8 STSYS Genpept databases and access is most easiiy nandiedtnrougn yarious Tags EST S GSS S Software pacKages and interfaces eitner on tne Worid Wide Web 5 or otherwise uuuuuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuuuuu Parts and problems Sequence databases contain seyerai eiements associated Witn eacn sequence mat ms ta mm mm Name LOCUS ENTRV iD aii are unique identifiers m m Definition a K a titie a brief textuai sequence description m39 13 33 17y mm Accession Number a constant data identifier 32 xjf x 111 A m MR quotW W M W Source and taxonomy informa o um mmm39mm m39ne aim woman osmium 39 Comoiete iiterature references 12 Lsinmm 5 0quot mm mm Look tor Locu5quot Comments and Keywords fine i i quot uFEATURESVn Tne an important FEATURE tabiei 32122 ch 32 u ii ORIGIN the ND EHEEKSUN WE flail no in Tne sequence itseif M m sequemense39f39 m 5122 321 and then H mwmm WW0 ldaduao PUB MUBSUGE ISON i a may Homequot moms gmmm mm can u noquot uquot quotmayquot kquqto mam mum Mm m nan Th s ca b t even neiofui toois sucn as Don Giiben s Readseg Therefore becoming Tamiiiarwith some of tne common formats is a big neip LooKfor Key features of eacn type of entry as seen nere uuuuuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuuuuu w 3 22132 g a mtatswmumvmm mmmcmccmm start of quot mmE i ss l w Gamma m mmmm s m eels mm L zonscmzocmcmon quot39V w llne allowed iittgzoim m n Lavishan elwn aunn mm tam non chem 7 human at to x to harman rub E A Mann Look for ID mm FT so Jeuuo aauanbas a6uis 939 so u khmqu titans Predxcud a tam alyha swim m martiniqu no Pndxckd tn 2 may n mm was not w check 53m m mm cvisfsnynr camm omvamx vnxwuamx oat ezzsmmcommu mum saman mum at mamm AK the sequence and then ll 19UJ1 J lOHdINn Pue 39ISINE quot sequence type then annotation men sequence name on the checksum line then the sequence itself Look for identmer uuuuuuuuuuuuuuuuuuuuuuuuuuuuu uuuuuuuuuuuuuuuuuuuuuuuuuuuuu Steve Thompson ililiiiliiilililililiiiliiiliiiliiiliiiliiilililiiiiiliiil 8 mm quota We a my can mm at manta g am at wuqnt m at a wuqnt u mean can wuan 237 t 2 wuan quot1 mm can wuan nn 9 aAnzu can 27A check as wuqnt ma a The other ccc formats butthese 1 WWW m hold more than one sequence at atinie o la a ram 3 15le levy DE tramsiimamr anemiainhalant in mmquot 3 Junqname romanWampummum mgmsrrtmrwiai 51 Encth Wm 2223 23 This one is SeqLab39s ccc39s Human armmi in at in 35 graphical user Interface iculi native format ililiiiliiilililililiiiliiilililililiiiliiilililiiiiiliiil Specialized sequence type DB s Databases that contain special types of sequence information such as patterns motifs and pro les These include REBASE PROSITE BLOCKS ProDom Pfam Databases that contain multiple sequence entries aligned eg PoQSet RDP and ALN Databases that contain families of sequences ordered functionally structurally or phylogenetically eg iProCIass and HOVERGEN Databases of species speci c sequences eg the m Database and the Giardia lamblia Genome Pro39ect And on and on See Amos Bairoch s excellent links page httpusexgasyorglalinkshtml ilililiiilililililililiiililililililiiililililililiiiliiil What about other types of biological databases Threedimensional structure databases the Protein Data Bank and Rutgers Nucleic Acid Database And see MDiecuies tD GD at http mulbiu into nih guyBairbingdb These databases contain all of the 3D atomic coordinate data necessary to detine the tertiary shape of a particular biological molecule The data is usually experimentally deriyed either by x ray crystallography or by NMR sometimes it s hypothetical The source of the structure and its resolution is always giyen Secondary structure boundaries sequence data and reterence information are often associated With the coordinate data but it is the 3D data that really matters not the annotation Molecular yisualization or modeling sottware is required to interact With the data t has little meaning on its oWn ilililiiilililililililiiililililililiiililililililiiiliiil And still other types of Genomics DB s These can be considered nonrmoiecuiar Reference Databases also w pointers to sequences e g LocusLinkGene 7 integrated knowledge base Oivi M 7 Online Mendelian inheritance in Man PubMedMedLine 7 over ii rnillion Citations from more than 4thou5and biomedical Scientificiourn Phylogenetic Tree Databases e g the w Metabolic Pathway Databases e g vahat is There Japan s GenomeNet iltEGG the iltyoto Encyclopedia of Genes and Genomes andthe human Reactome Population studies data7 which strains where etc And then databases that many biocomputing people don t eyen usually consider e g GiSGPSrernote sensing data medical records census counts mortality and birth rates ililLiilLiiliiililililiiiiililililiiilLlLlLlLlLliiLliiLlLl Given a genomic sequence what next as restriction digests and associated mapping Harder tragment assembly and genome mapping such as packages trom the University of Washington s Genome Center http www genome washington edu PhredPhrapconsed http lwww phrap org and Segiviap and The institute tor Genomic Research s http lwww tigr org Lucy and Assembler programs Very hard gene finding and sequence annotation This is a primary focus ofcurrent genomics research asyi toiward translation to peptides Hard again 7 genome scale comparisons and analyses ililLiilLiiliiililililiiiiililililiiilLlLlLlLlLliiLliiLlLl Genome Characterization Recognizing Coding Sequences Three general solutions to the gene nding problem 1 all genes have certain regulatory signals positioned in or about them 2 all genes by de nition contain speci c code patterns 3 and many genes have already been sequenced and recognized in other organisms so we can infer function and location by homology if our n w sequence is similar enough to an existing sequence All ofthese principles can be used to help locate the position of genes in DNA and are often known as searchin b signalquot searching by contentquot and homology inferencequot respectively Steve Thompson lllDDDDDDDDDDDDDDDDDDDDDDDDDDD URFS and ORFs definitions URF Unidenti ed Reading Frame any potential string of amino acids encoded by a stretch of DNA Any given stretch of DNA has potential URFs on any combination of six potential reading frames hree forward and three backward ORF Open Reading Frame by definition any continuous reading frame that starts with a start codon and stops with a stop codon Not usually relevant to discussions of genomic eukaryotic DNA but very relevant when dealing with mRNAcDNA or prokaryotic DNA lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Signal Searching locating transcription and translation affecter sites with a simple consensus approach Start Sites Prokaryote promoter Pribnow Box TTGAwa1521TAtAaT Eukaryote transcription factor site databases TFSites and EPD ShineDalgarno site AGGGAGGGAx69ATG in prokaryotes Kozak eukaryote start consensus ccAgccAUGg AUG start codon in about 90 of genomes exceptions in some prokaryotes and organelles End Sites Nonsense chain terminating stop codons UAA UAG UGA Eukaryote polyA adenylation signal AAUAAA but exceptions in some ciliated protists and due to eukaryote suppresser tRNAs lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Signal Searching locating transcription and translation affecter sites Weight Matrix approaches The CCAAT site HBYRR SR The TATA site aka Hogness box samawawnssssss The 30 box WRKGEEJGGRGBYK The cap signal Roamway ExonIntron Junctions Donor Site Acceptor Site Exonl Intquot lExon A64G73G1WT1WAGZA68G84T63 39 39 6W74B7NCGSA1IWG1IWN The splice cut sites occur before a 100 GT consensus at the donor site and after a 100 AG consensus at the acceptor site The eukaryotic terminator BGTGTBYY 39 ta simple consensus is not informative enough 80 weight matrices are used here the proportion of each base within the consensus is used in the search lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Content Approaches Strategies for finding coding regions based on the content of the DNA itself Searching by content utilizes the fact that genes necessarily have many implicit biological constraints imposed on their genetic code This induces certain periodicities and patterns to produce distinctly unique coding sequences noncoding stretches do not exhibit this type of periodic compositional bias These principles can help discriminate structural genes in two ways 1 based on the local nonrandomness of a stretch and 2 based on the known codon usage of a particular life form lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Content Approaches cont NonRandomness Techniques Rely solely on the base compositional bias of every third position base but does not tell us anything about the particular strand or reading frame however it does not require a previously built codon usage table The plot is divided into three regions top and bottom areas predict coding and noncoding regions respectively the middle area claims no statistical signi cance lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Content Approaches cont Codon Usage Techniques Genomes use synonymous codons unequally sorted phylogenetically This requires a codon usage table built up from known translations however it also tells us the strand and reading frame forthe gene products Each forward reading frame indicates a red codon preference curve and a blue third position 60 bias curve Steve Thompson lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Web servers for gene finding Many sewers have been established that can be a huge help with gene finding analyses Most ofthese servers combine many ofthe methods previously discussed but they consolidate the information and often combine signal and content methods with homology inference in order to ascertain exon locations Many use powerful neural net or artificial intelligence approaches to assist in this difficult decision process A wonderful bibliography on computational methods for gene recognition has been compiled by Wentian Li http WMNnsli39 geneticsorggene and the Baylor College of Medicine s Gene Feature Search hm 39 39 39 hrmtmc edu a 39 r k htmh is another nice portal to several gene finding tools lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Web servers for gene finding Five popular genefinding services are GrailEXP Geneld GenScan NetGene2 and GeneMark The neural net system GrailEXP Gene recognition and analysis internet link EXPanded httpgraillsdornlgovgrailex9 is a gene finder an EST alignment utility an exon prediction program a promoter and polyA recognizer a CpG island locater and a repeat masker all combined 39 L 39 39 hrmh is an ab nit390 Artificial Intelligence system for predicting gene structure optimized in genomic Drosophia or Home DNA NetGene2 httn39llvwwv chs dtu 439 39 7quot another ab nit390 program predicts splice site likelihood using neural net techniques in human C elegans and A thaliana DNA GenScan httpgenesmiteduGENSCANhtml is perhaps the most trusted server these days with vertebrate genomes The GeneMark httpopalbiologygatecheduGeneMarkl family of gene prediction programs is based on Hidden Markov Chain modeling techniques originally developed in a prokaryotic context the programs have now been expanded to include eukaryotic modeling as well Geneld ht39ln39llmAwvl imim lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Homology Inference Similarity searching can be particularly powerful for inferring gene location by homology This can often be the most informative of any ofthe gene finding techniques especially now that so many sequences have been collected and analyzed But this too can be misleading and seldom gives exact start and stop positions For example 805 cmss1 63 62 852 901 I III NH 63 v 79 902 95 MIN 80 1PY OTI IY T quot T39IPYPIIF39II39I 95 952 CGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTG 990 39 1 96 HlsVallleAspTyrI eLeuAspLeuGlnLeuAlaLeu 108 lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Beyond just finding genomes and genes Genome scale analyses There are some very good Web resources available for these types of global view analyses NCBl s Genome pages httpVMMNncbinlmnihgov present a good starting point in North America luman Genome lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Beyond just finding genomes and genes Genome scale analyses cont And sites like the the University of Wisconsin s E coliGenome Project htt www enomewiscedul and The Institute for Genomic Research s httpMAMNtigrorg MUMMER package lllDDDDDDDDDDDDDDDDDDDDDDDDDDD Tying it all together map browsers Genetic linkage mapping databases for most large genome projects H sapiens Mus Drosophila Q eleg s Saccharomyces Arabidopsis Ecoli often tie it all together with links to all the other databases within the context of a genome browser or map viewer Examples include The Ensemble Project at httpwwwensemblorgl the UCSC Genome Browser at httpgenomeucscedu the Lawrence Livermore National Laboratory ECR Browser at httpwwwdcodeorg Steve Thompson llDDDDDDDDDDDDDDDDDDDDDDDDDDD Sanger Center for Biolnformatics Ensembl project httpwwwensemblorg mmm m i H m 1 ii 1 mi c i i 517mbiiwmvnnxriiiblnigHnniuaniln1iu nuijviu 7 sum summing om E Ciimiiiosnmw39 Y Detailed View Jumptoiegxon v hp 213mg In mmr m w z wme m amen vars a 5mm llDDDDDDDDDDDDDDDDDDDDDDDDDDD University of California Santa Cruz Genome Browser httpgenomeucscedu lquot n quot Human im LD JSJSL wv w on ceimmc mow i c i iTl H mm li39genulne u edugirblnlhg39iratks pnsmmizchrJ wser 0n Lll39l39l n zoomin quotm 393 39 size 846 hp Image widlh azu iump Chromosome 3W new ESYS Conserva m llDDDDDDDDDDDDDDDDDDDDDDDDDDD Lawrence Livermore National Laboratory ECR Browser http llecrbrowserdcodeorgl Ely 3me rm 45 m hm Wigwam i gt mwmgrwwaaym 255 3x an 1 sx not mm m 1nnhp7nv Wimmm a llDDDDDDDDDDDDDDDDDDDDDDDDDDD Some of my favorite WWW genomics analyses access sites SR URL Uniform Resource Locator Content PlRNERF Pro ein Da a Bank 3D mol Structure da abase Molecu es lo Go dalabaseSollware archive dalabasesollware archive The human Genome DalaEaSe 8 an ord Genomc Resource The lrsllu e or Genomic Research With tools like NCBl39s Entrez EMBL s SRS and various genome browsers and map viewers llDDDDDDDDDDDDDDDDDDDDDDDDDDD Web genomics database access tools pros and cons Advantages Accesses he very latest updates It s fun and very fast It can be very powerful and ef cient if you know what you re doing In most cases relational links between different databases ease navigation and in some cases neighboring concepts link similar entries Genomescale analysis is possible Disadvantages Can be very inefficient if you don t know what you re doing Reformatting is usually essential if the sequence is to be used in any other software And it s very easy to get lost and distracted in cyberspace llDDDDDDDDDDDDDDDDDDDDDDDDDDD Also problems sometimes arise with the Web like dropped or slow connections So what are the alternatives Personal computer software solutions public domain programs are available but a bit complicated to install configure and maintain User must be pretty computer savvy So good commercial software packages are also available eg Sequencher MacVector DNAStar DNAsis etc but license hassles especially big expense per machine and Internet andor CD database access all complicate matters Steve Thompson ililiiiliiilililililiiiliiilililililiiiliiilililiiiiiliiil Therefore serverbased nonWeb solutions we re talking server computers here OS issues Pubirc domarn soiutrons aiso exist but now a very cooperatrve ms manager needs to marntarn everythingfor users so tne SeqLab erapnrcai User interrace srmpirry matters ror admrnrstrators and users One commercrai ircense fee ror an entrre rnstrtutron and very tasty convenrent database access on iocai server disks Connectronsrrom any networked termrnai or Workstation anywhere anytrmei Within tne ece surte LookUp rs an SR8 derrvatrve used to nnd a sequence or rnterest rrom iocai ece server databases Advantage Search output rs a iegrtrmate ece iistfiie approprrate rnputto otner ece programs no need to rerormate a ece Drsadvantage DB s oniy as new as admrnrstrator marntarns tnem ililiiiliiilililililiiiliiilililililiiiliiilililiiiiiliiil The Genetics Computer Group The Accelrys V sconsin Package for Sequence Analysis GCG began in 1982 in Oliver Smithies Genetics Dept lab at the Unrversrty ofWisconsin Madison and tnen startingin19901t became a prrvate company wnrcn was acourred by tne Oxford Molecular Group U K in 1997 and then by Pharmacopeia inc U s A rn 2000 and tnen rn 2004 Acceirys San Diego California iett Pharmacopeia to become an rndependent entrty The surte contarns around 150 programs desrgned to Work rn a toolbox rasnron Several srmpie programs used rn successron can iead to very sopnrstrcated resuits Aiso internai compatibility r e once you iearn to use one program an programs can be run similarly and tne output rrom man programs can be used as rnput ror otner programs Used all overthe World at over 950 institution r 50 learning it Will likely be useful at other research institutions as Well ilililiiilililililililiiililililililiiililililililiiiliiil To answer the always perplexing GCG question What sequenceis quot Specifying sequences GCG styl in order of increasln ower and co lexi The sequence rs rn a iocai GCG rorrnat srngie sequence nie rn your UNiX account Reformat and SeqCunv programs The sequence rs rn a iocai GCG database rn wnrcn case you usrng any ortne GCG database iogrcai names A tne iogrcai name apartrrorn ertner an accessron number or a proper rdentrner name or a wridcarq expressron and tney are case rnsensrtrve uint to rt by aiways sets The sequence rs rn a GCG rormat rnuitrpie sequence we ertner an MSF rnuitrpie sequence rorrnat nie or an REF ncn sequence rorrnat me To specrry sequences contarned rn a GCG rnuitrpie sequence We suppiytne cuntaining tne sequence ilililiiilililililililiiililililililiiililililililiiiliiil Logical terms for the Wisconsin Package id puz v9 Guzman ma 153131111 paisaw 2 awzu 1mm puz asuas Elsz Xauie Asza 12 asaul uuuuuuuuuuuuuuuuuuuuuuuuuuuuu The List File Format HSEIQUENCELLIST 10 An example GCG llst flle of many elongatlon la and Tu factors follows As wlth all GCG data files two perlods separate documentatlon from data my speclalpep begm24 endl34 SwissProt ilfTuiEcoll Efla Tumsf usraccountscescanocherrsqeflagr anotherllst The way SeqLab works uuuuuuuuuuuuuuuuuuuuuuuuuuuuu SeqLab GCG s Xbased GUI SeqLab is the merger of Steve Smith s Genetic Data Environment and 606 s Wsconsin Package Interface GDE WP SegLab Requires an XWndowing environment either native on UNIX computers including LINUX but not installed by default on Mac OS X v10 but see Apple s free X11 package or XDarwin or emulated with XServer Software on personal computers Steve Thompson llDDDDDDDDDDDDDDDDDDDDDDDDDDD SeqLab GCG s Graphical User Interface w 39l gsmeirys Conclusions There s a bewildering assortment of different genomics databases and ways to access and manipulate the information within them The key is to learn how to use the data and the methods in the most efficient manner knowing which to use when and how to combine their inferences will go a long way toward success A comprehensive sequence analysis software suite such as the GCG Package expedites the chore putting a large assortment of tools all under one organizational model with one user interface FOR MOREINFO DDDDDE HDDDDDDDDDDDDDDDDDDDDD httpbiotsuedustevethorkshoghtml Contact me stevetbiofsuedu for specific bioinformatics assistance andor collaboration llDDDDDDDDDDDDDDDDDDDDDDDDDDD The combinatorial approach Get all your data in one place GCG s SegLab is a great way to do this due to its advanced annotation capabilities File am Fundmns when mm


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.