Computational Biology Tools
Computational Biology Tools BME 110
Popular in Course
Popular in Biomolecular Engineering
verified elite notetaker
This 12 page Class Notes was uploaded by Jacky Emmerich on Monday September 7, 2015. The Class Notes belongs to BME 110 at University of California - Santa Cruz taught by Dietlind Gerloff in Fall. Since its upload, it has received 63 views. For similar materials see /class/182230/bme-110-university-of-california-santa-cruz in Biomolecular Engineering at University of California - Santa Cruz.
Reviews for Computational Biology Tools
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/07/15
Linear Sequence Analysis What can you learn from a single protein sequence Calculate it s physical properties Molecular weight MW isoelectric point pl amino acid content hydropathy hydrophilic v hydrophobic regions Does not take into account posttranslational modifications of protein so are usually not 100 accurate Identify sequence motifs and families Signal sequences transmembrane domains coiledcoils post translational modification sites secondary structure non homologous Domains functional motifs homologous 3D Structure Analysis Visualization Domain structure global fold active sites point mutations SNPs splice sites Evaluate structure quality Calculate physical properties Surface areas distances sidechain conformations contact maps Structural alignment ie similarity to other structures Prediction Physical properties binding affinity pKa s stability specificity 3D structure homology modeling fold recognition de novo Advanced protein design docking of two proteins active site modeling Sequence Databases SwissProt ExPASy Highly curated updated less frequently TrEMBL ExPASy Translated nucleotide sequences Automatic translation fast but less info UniProt EBI Unified Protein Resource Combines SwissProt TrEMBL PIR sequences Sequence Analysis Sites For protein sequences and tools to analyze them the two major centers are ExPASy Expert Protein Analysis System Many tools httpcaexpasyorgtools Databases SwissProt TrEMBL NCBI Entrez Protein and Domains PIR Protein Information Resource folded into UniProt consortium no longer major resource site More Sequence Databases Nonredundant NR NCBI UniRefPlREB Reference RefSeq NCBI reannotated by NCBI DomainsFamilies Pfam protein families Sanger Center 4 mirror sites SMART Simple Modular Architecture Research Tool CDD Consened protein Domain Database NCBI combines Pfam SMART and COGS databases lnterPro based on UniProt at EMBLEBI Many others Structure Databases Experimental PDB Protein Data Bank Families SCOP CATH Dali database Homstrad ModelsPredictions ModBase SwissModel NOTE All these databases are described in January Database issue of Nucleic Acids Research plus other kinds of databases Also links to them Protein Sequence Analysis Tools ExPASy Proteomics Tools Calculate physical properties Predict sequence motifs what ExPASy calls Topology localization TM domains Signal sequences postranslational modifications Search pattern and profile collections PredictProtein and MetaPP A metasewer providing access to many servers with one submission form Secondary Structure Prediction Three good methods Psipred SamT02TO4T 06 PhD PredictProtein Compare a couple methods Use the threestate predictions SEQUENCE ltgt STRUCTURE ltgt FUNCTION Evolutionary selection operates on function Structure is more closely linked to function than is sequence so structure tends to be more consened than sequence Need to search farther in sequence space to find proteins with related structures and functions Detecting Remote Similarities Remote similarities can more easily be detected by comparing protein sequences DNA sequences change faster than protein sequences wobble position redundant codons 4 letter DNA code vs 20 letter amino acid code means that matches by chance are more likely in DNA The protein code has more information in it Multiple Sequence Alignment Multiple sequence alignment is probably the single most important bioinformatics tools Many applications require accurate MSAs PSIBLAST Family and domain classi cation Pattern identi cation Structure prediction secondary structure fold recognition Phylogeny Fullgenome alignments in browsers Conservation Patterns Cys pairs disulfide bonds His Ser catalytic sites Cys His metal binding sites Gly Pro ends of 2 structure elements turns Lys Arg Asp Glu ligand binding LysArgAspGlu pairs salt bridges Leu coiled coils leucine zippers Motifs secondary structure indels PSIBLAST Alignments The goal of BLAST is rapid detection by detecting highscoring local alignments It doesn t necessarily find the optimal global or local alignment Profiles throw away information for regions that are insertions relative to the query Methods Dynamic Programming Gives the optimal solution but prohibitively slow Progressive ClustalW o hftn39lwww ehiar quot39 quot quot 39 html most commonly used Tcoffee httpigsservercnrsmrsfrTcoffee a little better but slower Iterative better than progressive methods but slower Dialign HMMs Progressive Alignment 1 Calculate global pairwise alignments for all pairs Needleman and Wunsch N1N2 alignments required 2 Use pairwise alignment scores to calculate a guide tree describing the distance between all pairs of sequences 3 Align the sequences progressively Start with the two most closely related sequences Add in sequences in order of increasing distance ClustalW uses this method ClustalW Example Input 5 sequences detected by BLASTp using human SNAP25 as a query Default parameters output order input 5137171379 5 gt91731242 623 LLK gt91723224 09 YIGRITNDAREDEMEENVGQVNTMIGN LRNMAIDMGSELENQNRQIDRIKNKAEM gt917 929 303 IHDKAQSNEVRVESANKRAKN LITK gt91732 567202 Input Formats for Clustal programs FASTA format Download from NCBI ExPASy EBI Pfam Sequence names should be Unigue 15 characters or less Comprised of only A Zaz09 and Do not use or spaces ClustalW Output CLUSTAL w 182 Multiple Sequence Allgnments sequence format 1s Pearson ce P13795 Sequen 1 7 2 aa Sequence 2 Q1731242623 213 aa Sequence 3 Q1738224E9 195 aa Sequence 4 g17395933 8 235 aa Sequence 5 Q17325672E2 2m aa Start of Palrwlse allgnments A11gn1 Sequences A11gned Score 1 Scor 12 57 Sequen es 13 A1 gned 59 Sequences 14 Aligned 5 Sequences 15 A11gned Sequences 23 A11gned Sequences 24 A11gned Sequences 25 Aligned Sequences 34 A11gned Sequences 35 A11gned Sequences 45 A11gned Guide tree flle Start of Multlple Allgnment There are 4 groups A11gn1ng Group 1 Sequences 2 Score3E1E cro p 2 Sequences 3 Score3429 Group 3 Sequences 2 Score4233 crou 4 Se uence 5 Score3386 p q Allgnment Score 7423 CLUSTALeAhgnnenc flle created eb1extservoldrworkclustalwrZEIEI4E2E67EI1234219aln CIustaIW Guide Tree The guide tree shows the distances between sequences obtained from the initial painNise alignments This is the order that sequences were added into the MSA Guide tree is not a phylogenetic tree it sjust a rough estimate of similarity however a true phylogenetic tree can be generated after making an alignment Progressive Alignment Greedy algorithm Breaks problem up into smaller problems Finds best solution to each small problem Combine solutions to get answer to whole problem Not necessarily the global answer Doesn t use all information in solving subproblems Suboptimal answers for small problems may combine to give a better overall answer Gaps once created they stay as part of alignment for rest of alignment iterations ClustalW Alignment CLUSTAL m 132 multlple sequence allgnment Interleaved Formats Most common output formats for MSAs are interleaved MSF ASN BLAST queryanchored formats All sequences are stacked up and chopped into blocks of 60 residues Easy for humans to read but difficult to edit Tools for converting formats are available on the web Aligned FASTA A2M Format gtSN297RATl42il96 D kiwi KNSSLWrRr W WA gtSN2 97HU39MANl427197 DTTllRl w KNPHL rRWr A gtSN257TORMA957148 SGGYliiRRIiTDDA gt093578lli5 9 SGGFliiRRviTDDA gtSN257DROME98il49 QAGYliiGRliTNDA Uppercase an characters are alignment columns There must be the same number of aligned characters in all sequences Insertions that are not part ofthe alignment are indicated with lower case and characters These are not read ie they re only note mar s lll l on Bene ts Would indicate that th Easily machine readable mgquot W mmg Readable by most programs that read FASTA format eg Jalview Graphical ClustaIX or others Postscript PDF HTML Looks pretty and very visually informative Completely useless for further computational analysis DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT Jalview Java alignment editor httpvngNjalvieworg Available as an online applet or as an application Makes nice pictures and allow interactive editing