Bioinformatics Methods BINF 630
Popular in Course
Popular in BioInformatics
This 128 page Class Notes was uploaded by Nathanael Schowalter on Monday September 28, 2015. The Class Notes belongs to BINF 630 at George Mason University taught by Staff in Fall. Since its upload, it has received 8 views. For similar materials see /class/215260/binf-630-george-mason-university in BioInformatics at George Mason University.
Reviews for Bioinformatics Methods
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
Lecture 13 Systems Biology Saleet J afri What is Systems Biology In traditions science a reductionist approach is typically used with an individual system or subsystem is dissected and studied in detail Systems biology integrates information from different sources to understand how larger more complex systems work Systems Biology and Integration Molecule Gene and Protein Organelle cellular subsystem Cell Organ Organism Environment History of System Biology The Human Genome Project and modern biotechnology have created the ability to gather large amount of information about an organism Due to the inherent complexity of biological systems computational methods and models must be used to understand and integrate the data Systems Biology Methods 0 There are many methods used in systems biology and each has its strengths and weaknesses Much of what is called systems biology relates to modeling genetic networks and biochemical reaction networks however they are not the only methods Systems Biology Methods I will present an example from my own research that integrates biochemical biophysical and microstructural information to eXplain the basic mechanisms that initiate contraction in the heart Modeling the Mechanism of Calcium Sparks in the Heart M Saleet Jafri School of Computational Science George Mason University Coworkers Hena Ramay i x Keith Dilly Jader Cruz Research Goals What factors in uence spark dynamics What is the mechanism of spark termination How can we account for the spatial spread ofthe spark Presentation Outline 1 Introduction 2 Spark Termination 3 Spatial Spread of Sparks 4 Conclusions Membrane Calcium Force Currents Handling Generation Cardiac Ca2induced Ca2 release 40 mV 80 mV diastole What is a Ca2 spark thanks to Andy Ziman for assistance What is a Ca2 spark seconds cell images at 05 sec per image location time from Cheng Lederer amp Cannell 1993 Science 262 740 Why Study Calcium Sparks Calcium sparks are the most elementary observed events in excitation contraction coupling Calcium sparks are thought to regulate vascular tone in vascular smooth muscle Calcium sparks provide a good example where cellular structure and the detailed biophysics of cellular components combine to observed experimental behavior Excitation contraction coupling is defective in certain diseases such as heart failure Heart Cell ZIines lme L Fernanda 52mm unpublishajl Sarcomere Ttubules and SR apposition Mommy mm J Frank 1935 Elements of Ca2 Spark Generation An array of RyRs Junction between Ttubule and the sarcoplasmic reticulum membrane cut section Sequence of EC coupling Action potential Presentation Outline Introduction Spark Termination Spatial Spread of Sparks Conclusions Hen do Ca2t spants tenmtnatet Three hypotheses have been proposed to explain the neehanisnn of Ca sparkterminaien t Deptetten et SR Cab do Cal spantstenmtnate beeaesetheSRnunseetetCaitt Thts ts ntteb but because aThene ts ittt Ea avatta te tenretease atena Eaz tansembassant etat thh tmtbrbetat thht anbbt az qnaths tantastahngtnneephseeenbs tengGaquoteatethn team ntanebtne tbnnnttlngtlblnlnltlnnlttttllltmetltttlt 2Steehastteattttten7 tEaZt snatts tetnntnate bh bteehastte atttttenh he meant hattenntnatten happens when att etthe hhh s enthannente eteee atthesannetnne Thtsean eeeenttthere tsene eta testa tent hhb bett ts enthetnh enthe number at MS tn a destentstarge eg bennnere see anatnes byhem and rthens stantnghth tem thht tnabet heatteettsthe chines ethnh seentntn more 3 bath Ea2t snatts tetnntnate beeame the We tnaetnate h tt net eeett abantaten be the nebt thereanetnepnebtenns that bnnpte tnaetvaten hthhs has NOT been ebsenebtn ptanan hptbbttahenenpennnents Seeenb abaptattenf eennpheateb tnaetvatten tethnhststee shh t setmsteseeenbs Gheheanb tththhlhththaetahthh Recentetpentmentatnesettssuggestedanethenhnpethestste esteem Hypothesis Ca sparks terminate because of the influence of three factors on RyR gating 1 Large number of RyRs FranziniArmstrong et al 1998 2 SR lumenal Ca2 3 Coupled gating of RyRs Coupled gating of RyRs Ca2mm2quot and RyR gating Bul 4 minimum 2 n quot5 me M4 quot5 215 39quotM i I WJAJ WLJWNMMM quotum Gymke mymke 1998 EWIth 75 ZED Skeieiai Muscie RyR s Marx 2i ai 1958 ScienceZEi ma Heart R yR s Gabuuakuva 2 ai mm 30pm J an aauA R VR S Experimental Results Pooled Experimental Results 5mm fractmn of sparks lt40 m5 nntrul FKEEIB 25 M rapamymn in W 240m5 Model Conceptual Outline RyR gatmg O c IZRVR 3Ca1 prome W m dlffusmn bu enng l pt cal blumng 2 SR release ux pA 4 352 spark mage Model Sticky Cluster Spatial organization JDHW Extracellular TT space Jdlllusl n subspace Cytoplasm I Caquot channel DiIPR Ryanodine receptor Model Sticky Cluster RyR Gating men kc lass Consth FmSE kcluse Ca 4 high Caz lium 55 kmquot ConsthFupEquot Km 4 Cass4 luw Caz39lw Km fCalumen Carts CFmE kcuupQNciusewNupen CFupe hNciuszupen Model Equations Model Solution RyR open state calculated using a Monte Carlo Method Fluxes calculated to determine derivatives Differential equations solved using a Euler Method Programmed in Fortran 90 on a HP Unix Workstation Computation time for control 500 runs in 30 minutes Spark visualization determined by solving reaction diffusion system for buffered diffusion and optical blurring using Matlab on a PC Simulated Ca2 release control conditions RyR open probability lUWS SR Ca release ux Peak Caz ss150 0M lpA 1000 CalqSR 500 Cal39lkswooo 0M Simulated Ca2 sparks control conditions Caz spark rmage Caz spark trme course npm 025w 05pm 2 FFn 1 20 ms 20 ms Caz spark spaual promes 050m Simulated Ca2 sparks effect of noise Reco Simula CaP spark image Ca spark time course L quotOise Spontaneous simulated Ca2 sparks Can 100 nM 20 pml 1 second CaPL 150 nM Spark Rate vs Subspace Ca2 and SR Lumenal Ca2 5 sparkmzlspark x a n 2 E a n Simulated Ca2 sparks cluster size 100 RyRs 50 RyRs 20 RyRs 10 RyRs 352 re ease ux 1 M J J mm W Sun L k k K n Cam 1 fraction of sparks Simulated Ca2 sparks cluster size 100 RyRS 50 RyRs 20 RyRS 10 RyRS b I g i iu x 5mm amnlllude 1m em mm ms an u 2 mm m an an m an an waynsm cluser WW5 m clusu an Simulated Ca2 sparks n0 lumenal dependence Control No lumenal dependence L mm Ca reiease ux 1 MI I mums 2 2 Ca spark FFn 1 5 5 Ca spark image MWI Simulated Ca2 sparks SR Load Simulated Ca2 sparks reduced coupling km 1 km 04 Caz release ux 1 pAl 2 Caz spark Willi H 5B ms Ca spark image 1 WI 22 4 Conclusions 3 Spatial Spread of Sparks 2 Spark Termination 1 Introduction Presentati on Outline km relative to control 15 10 300 100 I l i population data mm mx VlIIlIlIIIIIIlIIi w 71 ES fraction of Sparks 0 a Imammm z I I IlllllIllIllllIIA S s mamm IllIlllIIA N kcoop Reduced coupling Involvement of Multiple release sites rim Clmlzrnl y LM Myunhsm Ilsa Channels mini VimK s i Junction m 4 mm Conventionally sparks are thought to originate from a signal re1ease site 39 Parker et a 39 39 39 39 the proximity of the re1ease site Model Layout and Method Model Layout 7 Two Functional Release Units 39 2 LCC 7 Homogeneous NSR and asm Myopl Method 7 Explicit finite difference method Euler Metho run 33531quot 11a an T me 5 f 107 5 N Spatial f 01 7 Monte Carlo rnethodto determrne RyR cha 1 te 7 No ux boundary conditions Calcium Dynamics in Functional Release Unit Subspace cm Vs Time mu gm 52W mm JunctionaISRCa2 Vs Time mm D D515 25 35 as 5D mumi CaFVsTime 5n on Eu 12D ism 2m 2m 25 on i Imaml g an a 2D 1D s 15 25 3 as mumi Undistinguished Calcium Spark Peaks CaF CaF with Optical Simulated Spark blurring Release from one site almost always triggers release from the adjacent site on the other side of the Ttubule consistent with the results of Parker et al 1996 FWHM 20 pm FWHM lt 20 um CaF CaF with Optical blurring Spark from One Release Site Simulated Spark One site is disables to show release from a single site SERCA Pump Distribution Distribution 5 SERCA Distribution 10 05 Zlir Smith etaii 1998 CaF Vs Time or Jgt Heterogeneous Homogeneous CaF quotM N w 10 20 30 40 50 Time ms Periodic distribution predicted by Smith and coworkers Homogeneous and nonhomogeneous SERCA pump distribution made little difference on spark duration SERCA Pump Activity Increasing SERCA Activity pump activity leads to decrease spark 5n can V5 Time duration Eluck2d Blocking SERCA i 32 pump activity leads 5 25 to an increase in a I spark duration in similar to that S 39 39 39 observed by Gomez 5 m 15 in 25 3D 35 In b5 et al 1996 mm Calsequestnn and Sparks Control M50 lperatoxin was addedto cardiac myocytes to LLLLJJJLLJJLL Decreased CSQ An annual Increased CSQ E mm gl 3 a w 5n we 3 P Terentyev et al 2003 increase spontaneous sparks from the same site Decreased calsequestrin expression increases spark frequency Increased calsequestrin expression decreases spark frequency Subspace Spark Restitution Spark amplitude increases as interspark interval increases Cazt Vs Time The lower spark O LL L 100 200 Junctional SR Caz Vs Tim amplitude is a result ofthe partially lled state of the SR 300 400 500 600 Time ms E Since Popen depends g 800 on Ca2SR the 3 400 iperatoxin results can V be explained by a U 0 100 200 300 400 500 600 delay in refilling Time ms Simulated Effects of Calsequestrin a 1m Increased CSQ Simulation 100 Z 0 100 200 300 400 500 600 Time ms 150 Decreased CSQ Simulation 100 50 L L l L L 0 100 200 400 500 600 300 Time ms 27 SR Buffer Data Condition Peak Peak Spark Spark Amplitude duration rate Sim rate Exp Control 128 MM 24 ms 100 100 Increased 147 MM 150 ms 28 27 CSQ expression Decreased 120 MM 18 ms 190 183 CSQ expression Citrate 156 MM 170 ms 27 38 Terentyev et al 2003 Presentation Outline 1 Introduction 2 Spark Termination 3 Spatial Spread of Sparks 4 Conclusions 28 Summary Conclusions Our sticky cluster model of a Ca release unit can simulate Ca sparks that terminate reliably Termination occurs through coupled gating and the in uence of lumenal calcium Reducing coupling between RyRs increases Ca spark duration consistent with experimental effects of FK506 Ca2 spark magnitude is only mildly sensitive to the number of RyR39s in the cluster and the Ca spark duration is even less sensitive to this num e Release from adjacent sights might combine to give spark widths of 2 m as observed experimentally The spontaneous spark rate alteration due to SR buffers is likely due to their effect on re lling of the SR Current Work We have integrated the Ca spark model into a model for whole cell Ca dynamics of the cardiac myocyte to demonstrate that the summation of many sparks from different release sites give rise to the global Ca2 transient Nanrjuncuanal DHPR lam lam m Hype channel Bulk wahsm Calmodulm myo xlaments Cardiac Myocyte Results wllnn I n cnllulk mw 4o Membrane voltage mvp mu 7 7 7 AP mm Obtained um Natalietal 2002 Innm Cardiac Myocyte Results Mumluxmlx m H mm n Mulch Tmmim retarding by Buutlmni etuL 1995 Szmulatzdpatch Clamp zxpznmems under nmmal Cundztzuns 20 000 FR Us Szmulated 3O Cardiac Myocyte Results R Calcium Inmxmu m lunch Simulated patch clamp experiments under normal conditions 20 000 FR Us simulated 02 FF 500 ms SR transient recorded by Wang et al 2004 lnmcnnl wlcmm Inn qu m lunch Simulated patch clamp experiments under normal conditions 20 000 FR Us simulated Cardiac Myocyte Results 003 AFIFD Lumenal transient recorded by Bracket et al 2005 31 Cardiac Myocyte Results Graded re ease Peak Release mMs n 4n n m Membrane pnlenlizl mV Graueu re ease Simulafedpafch clamp experiments under normal conditions 20 000 FR Us simulafed o Fsm mM squot Membrane vouaga mV Release curve recorded by Wier et al 1994 Cardiac Myocyte Results Galn Function Plot sz Memo awn Simulafedpafch clamp experiments under normal conditions 20 000 FR Us simulafed Gain JERlm 0 40 30 gt20 10 O 10 20 30 Membrane Voltage mV Experimentally generated gain platfram Wang et al 2004 32 Lecture 8 RNA Secondary Structure Central Dogma DNA J RNA J Protein RNA In the central dodgma we talk about mRNA coding for protein There are also tRNA and rRNA that are also coded for by the DNA The MicroRNAs miRNA are small 22 nucleotides noncoding RNA gene products that seem to regulate translation The RNA has features in it sequence that gives it a structure based on the folding and pairing of bases within the strand tRNA very highly conserved Secondary and Tertiary Structure WatsonCrick pairing of GC and AU GC pairing give more energetic stability Double stranded regions and loop regions are the secondary structure elements Tertiary structure is the interaction between secondary structure elements RNA Secondary Structure httpWwwbioinfo1pieduNzukermBioS495RNAfoldhtmlimg26gif Methods Sequence and base pairing patterns there is conservation of sequence in RNA structures Energy minimization find the energetically most stable structure based on local sequence no knots History of Methods 0 Energy calculations based on base pairings Tinoco et al 1971 0 Free energy of regions based on the Boltzman function Martinez 1984 All possible structures are sampled using the Monte Carlo method The selection of regions is based on weights determined by the free energy History of Methods Nussinov and Jacobson 1980 computed a scoring matrix to keep track of the maXimum number of pairs They then performed a traceback to get the optimal structure Zuker and Stiegler 1981 used dynamic programing and energy rules to get the energetically most favorable structure History of Methods I Mfold is software developed by Zuker and coWorkers It is Very computationally expensive and can be used on a maxim of about 1000 nucleotides I Genetic algorithm Shapiro and Navetta 1994 take into account both sequence and secondaIy structure Circle Plots Red lines connect complementary around outside of circle Ifnot red lines pseudo knots mp wawbmmfu m eduzukermBmVSABSRNAfuldrhtmlingB gir Dot Plots indicate regions 0 complanmmry bases mp www bimnfu pl edu ukamBiurSADSRNAfuldrhtrrdimgll gr Energy Minimization 39 Calculate energy based on some scoring matrix 39 ynamic programming to calculate the energetic most stable structure 39 Dynamics programming can be altered to also nd suboptimal alignments as with mfold Evaluation of Methods Energy minimization Advantages 7 Predict thermodynamically most stable structures 7 Reliable account for experimental and alignment data Disadvantages 7 Tertiary structures not predicted 7 Computationall expensive Base Covariation 0 Uses aligned sequences of the same class of RNA molecules from different species 0 Conserved regions are likely the base pairing regions 0 Arrange the likely paired regions With complementary regions This can be done using various methods such as msa hidden markov models an ordered tree model etc Hidden Markov Models A machine learning approach that is trained on observed variation of alignament from msa including matches substitutions indels The HMM can then classify a new mRNA Ordered tree model Secondary structure elements are place at the node if a tree with the edges re ecting the relations between the nodes Evaluation of Methods Sequence base covariation methods rely on having a conserved pattern of base pairing Advantages 7 Computationally ef cient 7 Predict both secondary and tertiary structure Disadvantages 7 Need information of sequence covariation 7 ie need to know structure of similar sequences Contextfree Grammers 0 Use of formal language theory to develop a gramar to describe RNA secondary structure 0 Find groups of logically related words separated by intervening words 0 They can accommodate stochasticity ie variations in the sequence RNA Genes RNA has various functions There are software developed to search for RNA genes in the genome tRNAscan searched for tRNA looking at conserved bases and selfcomplementarity to give correct structure RNA Structural Modeling Using RNA structural modeling a new model of phylogeny of the eukaryotes prokaryotes and achaea has been proposed RNAi are designed to inhibit specific genes Ribozymes are selfreplicating RNA RNA universe Lecture 9 Database Searching Database Searching for Similar Sequences 0 Database searching for similar sequences is ubiquitous in bioinformatics Databases are large and getting larger 0 Need fast methods Types of Searches Sequence similarity search with query sequence Alignment search with profile scoring matrix with gap penalties Serch with positionspecific scoring matrix representing ungapped sequence alignment Iterative alignment search for similar sequences that starts with a query sequence builds a multiple alignmnet and then uses the alignment to augment the search Search query sequence for patterns representative of protein families From Bioinformatics by Mount DNA vs Protein Searches DNA sequences consists of 4 characters nucleotides Protein sequences consist of 20 characters amino acids Hence it is easier to detect patterns in protein sequences than DNA sequences Better to convert DNA sequences to protein sequences for searches Database Searching Efficacy To evaluate searchng methods selectivity and sensitivity need to be considered Selectivity is the ability of the method not to find members known to be of another group ie false positives Sensitivity is the ability of the method to find members of the same protein family as the query sequence Protein Searches Easier to identify protein families by sequence similarity rather than structural similarity same structure does not mean same sequence 0 Use the appropriate gap penalty scorings 0 Evaluate results for statistical significance History Historically dynamic programming was used for database sequence similarity searching Computer memory disk space and CPU speed were limiting factors Speed still a factor due to the larger databases and increase number of searches FASTA and BLAST allow fast searching History The PAMZSO matrix was used for a long time It corresponds to a period of time where only 20 of the amino acids have remained unchanged BLOSUM has replace PAMZSO in most applications BLAST use the BLOSUM62 matrix FASTA uses the BLOSUM50 matriX Search Tools Similarity Search Tools SmithWaterman Searching Heuristic Search Tools FASTA BLAST Dynamic Programming 0 Use SmithWaterman algorithm or an improvement thereof for local alignment 0 Compares individual characters in the full length sequence 0 Slower but more sensitive than FASTA or BLAST 0 Finds optimal Alignment FASTA Fast alignment of pairs of protein or DNA sequences Searches for matching sequence patterns or words called ktuples corresponding to k consecutive matches in both sequences Local alignments are build based on these matches Better for DNA searches than BLAST ktuple can be smaller than minimum of 7 for BLAST No guarantee of finding exactly optimal alignment FASTA Algorithm FASTA uses a search for regions of similarity by hashing In hashing a lookup table showing the positions of each ktuple is made for each sequence The relative positions of the ktuple in each sequence are calculated by subtracting the postions of the first characters Ktuples haVing the same offset are considered to be aligned Adjacent regions are joined if possible by inserting gaps The highest scoring regions are aligned by dynamics programming FASTA Algorithm 0 The number of comparisons increases linearly with the average sequence length 0 In dynamic programming and dot plots the number of comparisons increases as the cube or square of the length respectively Significance of FASTA Searches 0 The average score is plotted against the log of the average sequence length in each length range 0 A line is fit with linear regression and the z score is the number of standard deviations from the fitted line 0 A statistical distribution of alignment scores can be used to determine probabilities VmSmnsofFASTA There are many versions of FASTA FASTA compares like query sequence to library TFASTA compares unlike query sequence to library FASTFTFASTF short fragments FASTXFASTY compares DNA in all forward reading frames BLAST Basic local alignment search tool Faster than FASTA Searches for words of common length in both query sequence and library Confines search to the words that are the most significant FASTA finds all words Significance of word matches is calculated using BLOSUM62 and the log odds score BLAST Algorithm The query sequence if filtered to remove regions of low complexity not useful for meaningful sequence alignments A list of 3 character words in the query sequence is made stepping forward on character at a time The query sequence words are evaluated for an exact match with a word in the database log odds scores using BLOSUM62 BLAST Algorithm The neighborhood word score threshhold cutoff score to retain only the 50 most significant ones An ef cient search tree is made using the high scoring words The database is searched to find matches for the 50 significant words These are used as seeds for ungapped alignment with the query sequence BLAST Algorithm 0 The alignments are extended as long as the similarity score increases and if overlap they are combined 0 These highscoring segment pairs are matched in the entire database and listed 0 The statistical significance for these are calculated Database Searching With a Scoring Matrix or Pro le 0 A combination of dynamic programming genetic algorithms or hidden Markov models can be used to extract patterns from a multiple sequence alignment 0 Pattern nding and statistical methods expectation minimization and Gibbs samplng can be used also 0 Example PROFILE HMM Lecture 12 Small amp Large Scale Expression Analysis M Saleet J afri Program in Bioinformatics and Computational Biology George Mason University Microarray Data Analysis Gene chips allow the simultaneous monitoring of the expression level of thousands of genes Many statistical and computational methods are used to analyze this data These include 7 statistical hypothesis tests for differential expression analysis 7 principal component analysis and other methods for Visualizing highdimensional microarray data 7 cluster analysis for grouping together genes or samples with similar expression patterns 7 hidden Markov models neural networks and other classifiers for predictively classifying sample expression patters as one of several types diseased ie cancerous vs normal What is Microarray Data In spite of the ability to allow us to simultaneously monitor the expression of thousands of genes there are some liabilities with micorarray data Each micorarray is very expensive the statistical reproducibility of the data is relatively poor and there are a lot of genes and complex interactions in the genome Microarray data is often arranged in an n x m matrix M with rows for the n genes and columns for the m biological samples in which gene expression has been monitored Hence mg is the expression level of gene 139 in sample j A row e is the gene expression pattern of gene 139 over all the samples A column s is the expression level of all genes in a sample j and is called the sample expression pattern Types of Microarrays cDNA microarray Nylon membrane and plastic arrays by Clontech Oligonucleotide silicon chips by Affymetrix Note Each new version of a microarray chip is at least slightly different from the previous version This means that the measures are likely to change This has to be taken into account when analyzing data cDNA Mi croarray The expression level eij ofa gene 139 in samplej is expressed as a log ratio logrUgi of the log of its actual expression level rij in this sample over its expression level g in a control When this data is visualized eij is color coded to a mixture of red rij gtgt g and green rij ltlt g and a m1xture 1n between Nylon Membrane and Plastic Arrays by Clontech A raw intensity and a background value are measured for each gene The analyst is free to choose the raw intensity or can adjust it by subtracting the background intensity Oligonucleotide Silicon Chips by Affymetrix These arrays produce a variety of numbers derived from 16 20 pairs of perfect match PM and mismatch MM probes There are several statistics related to gene expression that can be derived from this data The most commonly used one is the average dz erence AVD which is derived from the differences of PMMM in the 1620 probe pairs The next most commonly used method is the log absolute value LAV which comes from the ratios PMMM in the probe pairs Note The Affymetrix genechip software has a absentpresent call for each gene on a chip According to Jagota the method is complex and arbitrary so they usually ignore it For What Do We Use Microarray Data Genes with similar expression patterns over all samples 7 We can compare the expression patterns e and e of two genes 139 and 1quot over all samples If we use cluster analysis we can separate the genes into groups of genes with similar expression patterns trees This will allow us to find what unknown genes have altered expression in a particular disease by comparing the pattern to genes know to be affiliated with a disease It can also find genes that fit a certain pattern such as a particular pattern of change with time It can also characterize broad functional classes of new genes from the known classes of genes with similar expression For What Do We Use Microarray Data Genes with unusual expression levels in a sample 7 In contrast to standard statistical methods where we ignore outliers here outliers might have particular importance Hence we look for genes whose expression levels are very different from the others Genes whose expression levels vary across samples 7 We can compare gene expression levels of a particular gene or set of genes in different samples This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment For What Do We Use Microarray Data Samples that have similar expression patterns 7 We might want to compare the expression patters of all genes between two samples We might cluster the genes into gene with similar expression patterns to help with the comparison This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment Tissues that might be cancerous diseased 7 We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue Statistical Methods Can Help Experimental Design 7 Since using microarrays is costly and time consuming we want to design experiments to use the minimal number of micorarrays that will give a statistically significant result Data Preprocessing 7 It is sometimes useful to preprocess the data prior to visualization An example of this is the log ratio mentioned earlier It is often necessary to rescale data from different microarrays so that they can be compared This is due to variation in chip to chip intensity Another type of preprocessing is subtracting the mean and dividing by the variance Statistical Methods Can Help Data Visualization 7 Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions This allows us to visualize it Cluster Analysis 7 By associating genes with similar expression patterns we might be able to draw conclusions about their functional expression Probability Theory 7 We can use statistical modeling and inference to analyze our data Probability theory is the basis for these Statistical Methods Can Help Statistical Inference 7 This is the formulation and statistical testing of a hypothesis and alternative hypothesis Classi ers for the Data 7 We can construct classes from data such a diseased vs nondiseased tissue We can build a model such as a hidden Markov model that ts know data for the different classes This can then be used to classify previously unclassi ed data Preprocessing Microarray Data Before microarray data can be analyzed or stored a number of procedures or transformations must be applied to it In order to analyze the data correctly it is important to understand what the transformations might be doing to the data Preprocessing Microarray Data Ratioing the data Logtranforming ratioed data Alternative to ratioing the data Differencing the data Scaling data across chips to account for chiptochip difference Zerocentering a gene on a sample expression pattern Weighting the components of a gene or sample expression pattern differently Handling missing data Variation filtering expression patterns Discretizing expression data Ratioing the data This is the most popular transformation The expression level eij ofa gene i in samplej is expressed as a ratio rijgi of its actual expression level rij in this sample over its expression level gi in a control This tells us the level of under or over expression of a gene i in the samplej If the control value g is very small it can make the ratio very big This can skew results incorrectly Logtranforming ratioed data This is also a popular transformation The expression level eij ofa gene i in samplej is expressed as a log ratio logrijgi of the log of its actual expression level rij in this sample over its expression level g in a control This will suppress outliers caused when the control value gi is very small However it creates a new outlier when rij is very small Alternative to ratioing the data An alternative that eliminated both of the outlier problems above is This gives a value in 01 and can be interpreted at the probability of gene i is higher in sample j than in control Differencing the data Another transformation is to difference the data ie rij 7 g This is not really appropriate in our previous context However this is used by AffymetriX in a different context In their data rij is the strength of the match of the target 139 to a specific probe j and g is he strength of the match of the target 139 to a control for this probe Scaling data across chips to account for chiptochip difference As mentioned previously different chips might display different intensities When comparing different chips the data might need to be scaled so that they are on the same scale Alternatively they can be normalized so that they are between 01 and compared Zerocentering a gene on a sample expression pattern This in effect the same as subtracting the mean expression pattern Suppose that x is an expression pattern for a particular gene gi whose components are logratios Let where is the average expression pattern or control Then x indicates whether the gene gi is induced 0r repressed relative to control Remember the X s are vectors Subtracting the mean expression pattern and dividing by the stande deviation can accomplish this Remember the X s are vectors Weighting the components of a gene or sample expression pattern differently If we have a matrix of weights Wdiagw1 Wn We can weight the expression patterns by In this way we can weight the contributions from different genes differently Handling missing data Sometimes components of an expression pattern x are missing To x this the missing values can be replaced by the mean over the nonmissing values in x Variation ltering expression patterns When we are performing cluster analysis on gene or sample expression patterns patterns with low variance will all seem suf ciently similar to each other and might form a cluster This cluster will probably not re ect any interesting result Discretizing expression data Sometimes we might want to convert gene or sample expression pattern into discrete values For example if we have logratio we may want to simply look at whether something is up or downregulated To do this we can do the following In this case 1 would indicate upregulation 0 would indicate no change and 71 would indicate downregulation Measuring Dissimilarity of Expression Data We might want to compare two or more gene or sample expression patterns This might be used to differentiate between diseased and normal cells or finding out the genetic similarity of tissues To do this we need a distance metric or a dissimilarity measure Example Distance Metric Euclidean Distance 7 This is the most common distance measure This should not be used if either 1 Not all components of the vectors being compared have equal weight 2 There is missing data Preprocessing the data can often alleviate these problems We can also use the normalized Euclidean distance Example Dissimilarity Measures Maximum Coordinate Difference 7 The following computes the maximum absolute distance along a coordinate Dot Product JIhis is a dissimilarity version of the dot product Visualizing Micorarray Data It is usually easiest to understand data if it can be represented in 2 or 3 dimensions For example a 2D scatter plot of the expression levels of genes 139 and j over a number of samples can show the relationship between these two genes Principal Components Analysis In principal components analysis ndimensional data is converted to ddimensional data dltltn such that the components in the new space are uncorrelated and axis or dimensions of the new space are ordered with respect to the amount of variance they explain The rst component explains the most about the data The second component is orthogonal to the rst and explains more about the data and so on Principal Components Analysis Example 7 Consider height and weight data for a group of individuals This is 2D data but there is a correlation between height and weight We can use this property to reduce the data to 1D PCl explains most of the data and PC2 explains the rest height weight PCA and Microarrays Sample Application 1 7 If we want to compare the sample expression patterns from two groups diseased VS normal experimental VS control If we have n genes the each pattern is a point in ndimensional space Suppose we want to see if the sample expression patterns for these two groups cluster by group We might want to perform PCA analysis and perform cluster analysis at the top three components PCA and Microarrays Sample Application 2 7 On gene chips such as the one made by Affymetrix the same gene occupies multiple cells In theory the expression level of all cells with the same gene should be perfectly correlated However in practice this is often not the case due to imperfections in the technology or hybridization of the sequence fragment to other genes in the target PCA and Microarrays PCA allows us to see how good the correlation among these cells is To use PCA we would hybridize k different samples on the same chip For each sample the expression levels of a gene x in the n cells is an ndimensional vector Hence there are k points in ndimensional space Using PCA if most of the variance is explained by the rst principal component the effective dimensionality of the data is l and there cells are highly correlated PCA Limitations Clustering by PCA effectively yields clusters as if the Euclidean distance metric had been used Hence it is possible that it might miss clusters The reduction of dimensionality uses all coordinates If only a few genes out of a thousand differ between two samples Application 1 clustering by PCA might not yield any meaningful results Cluster Analysis of Microarray Data Recall that microarray data can be thought of as gene expression patterns or sample expression patterns These can be each considered to be vectors The first thing we have to do before applying cluster analysis is to find a distance between the various expression pattern vectors This is done using similarity dissimilarity measures such as Euclidean distance Mahalonobis distance or linear correlation coefficients Once a distance matrix is computed the following clustering algorithms can be used The clusters formed can differ significantly depending upon the distance measure used Cluster Analysis of Microarray Data Hierarchical Clustering 7 Assume each data point is in a singleton cluster Find the two clusters that are closest together Combine these to form a new cluster Compute the distance from all clusters to the new cluster using some form of averaging Find the two closest clusters and repeat Cluster Analysis of Microarray Data k Means Clustering 7 An alternate method of clustering called kmeans clustering partitions the data into k clusters and nds cluster means p1 for each cluster In our case the means will be vectors also Usually the number of clusters k is xed in advance To choose k something must be know about the data There might be a range of possible k values To decide which is best optimization of a quantity that maximizes cluster tightness ie minimizes distances between points in a cluster Cluster Analysis of Microarray Data Selforganizing Maps 7 This is basically an application of neural networks to microarray data Assume that there is a 2 dimensional grid of cells and a map from a given set of expression data vectors in R ie there are n nodes in the input layer and a connection neuron from each of these to each cell Each cell i j gets it own weight from n input neurons The weight vector uij is the mean of the cluster associated with cell i j Each data vector 1 gets mapped to the cell i j that is closest to 1 using Euclidean distanceIn order to train the network the mean vectors uij for the cells i j must be learned Hidden Markov Models and Microarray Data We can use Hidden Markov models for pattern recognition in the study of micorarray data Suppose that we want to consider gene expression data from a tissue sample and want to know if it is control or different from the control diseased experimentally altered responding to drug etc Consider the gene expression data vector as a set of emissions one for each vector coordinate Each emission has a value that is de ned by some probability distribution function This can be continuous or can even discrete To make it discrete the data should be preprocessed to indicate upregulation down regulation or no signi cant change 20 Introduction to Bioinformatics BINT 630 Dr D Andrew Carr Multiple Sequence Alignments Multiple Sequence Alignment mmnmgrqrwmxp mummy mum mu rnwucm uwmu a s r DETLLPVDH msc vn cw ca 39 Pvmeases I Knowledge gain I In science knowledge is in the patternssimilarities The interesting questions however are in the differences Multiple Sequence Alignment I It is believed based on finding similar protein sequences within highly divergent species that the over time the functional components embedded within the sequences are conserved in order to retain the function One of the most important elements of sequences is the phylogenetic information that similarities represent The sequence similarities gives insight into the evolution of families of protein or DNA sequences in Knowledge to be gained Estimation of evolutionary distance Mutation Speciation Note Multiple Sequence Alignment methods are also employed in assembly of DNA sequences from cloning vectors I Multiple Sequence Alignment Evolutionary distance 1 sequences Trv EP TV EP T 4 rev P Te F TV EP TTV EP TV P T sequences D mm 1 including eaps and lndels D mln r e 39 le penalwn Where a lS hE tuial number ureapswnmnme alignments cunsensus SEuuence Multiple Sequence Alignment Methods Global alignment 1 Homologous proteins Structural similarity 9 Functional similarity Common evolutionary origin Local alignment 1 Conserved regions Structural motif Functional domains in Phylogenetic or ancestral similarity Multiple Sequence Alignment Methods Global AlignmentTools u ynamio Programming Based MSA m ClustalW Thompson et al 1997 m u Iterative Methods Simulated Annealing SA A u Genetic Algorithm GA and RAGA Local alignment tools a Gibb s IBBs Lawrence 1993BLOCKS Henikoff and Henikoff1992m a Hidden Markov Model HMMER Eddy 1993 u EM based MEME Bailey and Elkan 1995 llhthllsearchlauncherbemtmcedulmultrallgnmultlrallgnman l2 nttplbayewebwadwurtn urgglbbsglbbshtml Global Multiple Sequence Alignment I MSA Multiple Sequence Alignment u Based on dynamic programming Aligns two sequences Provides a measure of accuracy ofalignments u Scores the alignment level of significance Different applications can handle indes as well as gaps Depends on choice of scoring system a Results change based on the scoring matrix PAM Evolutionary change BLOSUM Family membership a Different types of gap penalties Affine gap in Most methods employ a phylogenetic tree building to concatenate alignments I Sequence Alignment Review Dynamic Programming in Global NeedlemanWunsch a Local WatermanSmithBeyer Dunn 0 J De 2 web Dw Z mam t 04 web V2 gt 0 DH min 044 wabj 041 t Wa r where wt me wezght of a gap Dunn 0 DUJ 31 Du 32 Ego 300 V2 gt 0 SW min DHH wabj Egg0 g 16 when gUr 15MB gap penalty mctzun and wzsthz szmzlanty same mcnun Multiple Sequence Alignment Dynamic Programming An extension of the pair wise sequence alignment in Alignment of k sequences to k sequences kk12 possible sequence comparisons Alignment algorithms operate in a similar manner as before but now the distance matrix is k dimensional and the weight function compares k letters 2D simple matrix 3D Hypercube kD k dimensional hyperspace I MSA Dynamic Programming uAssume that we are trying to align three sequence ab and c Also assume that we have a cost function wxyz that computes the cost of comparing X y and z in sequences a b and c respectively 0 x y Z wx y 2 1 20f3 symbols are the same 2 x at y at Z l MSA Dynamic Programming Then our distance matrix D can be described by D17171k71 Waz b Ck DWAJH W b Ck DirtJeri Waz 9 Ck Vi jk gt 0 DIN min D y yk walbJ DlilJJr Wquot 77 DH r W bJ DWIH W Ck l MSA Dynamic Programming I Computationally expensive in Long sequences a Large number of sequences I Computational cost a For two sequences of lengths n and m I NeedlemanWunsc n Onm Onz for n gt m I WatermanSmithBeyer n Onmnm On3 for n gt m I Gotoh s algorithm El Onm Onz for n gt m n For three sequences of lengths n m and p I Most algorithms in Onmp On3 for n gt m gt p l MSA Dynamic Programming I For k sequences we get Onk where n is the longest sequence a If we compare 10 sequences of length lt 300 I 30010 comparisons in 5904900000000000000000000 I npcomplete task Global Alignment Dynamic Programming I Solutions a Sum of Pairs Carrillo and Lipman 1988 Look at all pairs Create parsimony tree Build alignment along the tree Take the best SP score Problems El Biased toward similar sets of sequences El Orderweighting important TV EP TTV EP TV P T P sequences TTV EP TW EP T V TH p Tk P Etu Global Alignment Dynamic Programming I Progressive Methods B Dynamic Sequence alignment based on hierarchical sequence similarity u Challenge I To choose the correct in nce 39 g n Gap penalties u Goal get the correct series of evolutionary changes u Programs I CLUSTALW u Gaps between conserved regions down weighted u Already occurring gaps down weighted ew aps are costly I TCoffee i Employs locally aligned conserved regionsdomains to help nd alignment I Global MSA Genetic Algorithms I SAGA Notredame and Higgins 1996 a Basic steps 1 Pairwise global alignment 2 Phylogentic tree 3 Weight sequences 4 Iterative refinement of MSA via GA in Check score and if not good enough return to 2 Global MSA Genetic Algorithms GA developed by computers scientists as a machine learning application Typical GA flow a Initialize the study population a Evaluate the population I while not termination condition in Alterthe population at this step Reproduction combination of members Mutation random changes in the population Crossover moving information between members i Reevaluate the population Departmental expert in Dr John Grefenstette l Crossover and Mutation Crossover Mutation Global MSA Other methods and what works best I Simulated Annealing u Pairwise alignment driven Heuristic algorithm to iterate and improve the score MSASA Kim et al 1994 Graph based methods a Directed acyclic graphs 1 Partial order graphs Best Lassman and Sonnhammer 2002 u DIALIGN Best for low sequence identity 1 TCOFFEE Best for high sequence similarity Local MSA Small Conserved regions e DNA sequence 9 regulaturyurcudlng regrerr ln DNA e PrutEln mmunal eernarrrs The small regions are considered motifs or pro les e Rege ar Expressluns PSSM pusmun seeerrre seerrrrg matrlx e HMM alluvvs gaps and lndels Three main methods e Prufl e analysls Uses Cerrservee Regluns erGleeal MSA Uses the eerrsewee regluns tn bulld a ererrle e Black analysls Luuks furregluns ln an Glubal MSA wrm nu subsmutluns e MOTlF lhen MOTOMAT Usesthesereglunsfurallgnments e Statls ucal and Pattern searchlng methuds HMM EM Glbbs Local MSA Sequence patterns KKFAQSTNLKSHILT KQFSHSAQLRAHIST GKFSDSNQLKSHMLV KDISSSESLRTHMFK KRFSHSGSYSSHISS KRFSHSGSFSSHMTS KTLSDRLEYQQHMLK Local MSA Protein Motif Databases l PROSITE httQwwwexgasyorgQrosite a Method I CLUSTAL MSA based I BLOCKS httglblocksthcrcorg a Method MSA based l PFAM httQwwwsangeracukSoftwarePfam a Method MSA and HMM models of protein motifs Local MSA PROSITE PROSITE a Current version contains 1446 documentation entries that describe D 1331 patterns Ei 4rues D 650 profilesmatrices Cytochrome P450 cysteine hemeiron ligand signature at FW SGNH x GD F RKHPT P c LIVMFAP GAD c is the heme iron ligand Regular Expressions Patterns described in a standard way are known as regular expressions X ANY OR RV IorLorV NOT DE notDorE repetitions x23 xx or xxx separator Nterminal Cterminal END Regular Expressions AC XVX4ED Ala or CysanyVal anyanyanyanyany but Glu or Asp LKHVAYVFQALIYWIK AVEMAGVKYLQVQHGS LYTGAIVTNNDGPYMA KEYKCKVEKELTDICN BLOCKS i A highly conserved regions 0 proteins BlacklPEnnJNJA m Dliheme max A muumu m mm p m mum a bane mm m ACE mach36 um PsaAEiusMau m aai 7a strengthzlzzs as m i say mumscmnmusxmmmsxsumx 32 G l ll bG39 R igg lll quotWW ausawtezaannawrzmeaamw mm MIr MuHM inljnmnll nmnux I Pro le Method of Local MSA I Two types of profile method in Weighted I Strictly drawn from Dayhoff PAM matrices based on amino acid frequency in Evolutionary I Rate of evolution is taken into account I Probability of the change between residues o Calculated as a log odds score of the Dayhoff method in The method for computation of the I Local MSA Pro le methods I Standard method of Profile creation in Get a global alignment in Shorter highly conserved regions in Create a Profile matrix of the conserved region I A profile is a scoring matrix a Row for each residue in the Profile in 23 columns I 20 each amino acid I 1 unknown residue 2 I 2 Gap and extension in These scores are based on strict counts Local MSA Pro le methods I BLOCKS Short highly conserved regions Can be 360 amino acids in length I Typically 10 55 a Do not contain gaps u a Are typically calculated by strict counts Do not typically account for or include evolutionary measures I Are good for pattern searching in Cons I Limitation to length I Only as good as the MSA that they are built with I MOTIF always nds a BLOCK even in random sequences I Local MSA Statistical Methods I EM Expectation Maximization u Initial guess is made as to the location of a motif I Expectation Ste The probability of nding the motif at any any point in each of he sequences is calculated based on the distribution of basesamino acids within each column ofthe initial guess i The probabilities are used to rede ne he initial distribution within each motif column Maximization Ste i The new quot L quot quotquotquot quot quot initial guess This is iterated until convergence he size of a motif region can be determined by the EM algorithm based on log likelihood scores after one iteration I GIBBS Sampler Method in Searches for the statistically most probable motifs in The goal is to maximize the ratio of motif probability to background probability Hidden Markov Models Local MSA I Hidden Markov Models 1 Statistically based 1 Produce sound results 1 Provide representations of sequence domains or protein families a Drawback I Require lots of data to train Markov Chains Probability for each state a is based only on a set number of preceding characters I of preceding characters orderofthe Markov Model Probability of a sequence order 2 i Ps Pm P C P CA PCAG PAGC PGCC PCCT i Hidden Markov Models Observed A A T T l frequencies 9 0 0 C10 ivory HCD 2313 gt gt 300 C10 07 03 Probabilistic model true state is unknown Hidden Markov Models States well defined conditions Edges transitions between the states 0 ATGAC I ATTAC Q ACGAC ACTAC Each transition assigned a probability Probability of the sequence single path with the highest probability Viterbz39 path sum of the probabilities over all paths Baum Welch method Common Protein Sequence HMM Matchstate lnsenstate Adapted representation of the model by of Krogh et al 1994 I Hidden Markov Models probabilities logodds log PS 025L How HMMS are built trained pluuauilhie Algorithms for training markov models rd arid Backward Aigorithms Predictive Measures u Viterbi u Bauereich Best practices to improve HMM pe u Aiter prior diStributiOri or amino acid Knuwiedge added siaii puim can emphasize cunvergence Aiter architecture based Orv farmiiai knowiedge rformance s u PSSM C alculano The una DHDLE MSA sm n smaH vegmnsmm ave mum Lnnsevve n Onuemesevegmnsare 1mmmswmpnnan nmnsdevmeundenwvg msummwn by Wynne mnmwas Heated n mannensmesequenuemnemmpareu Man w g n anrMpmmm mmummmmm mMW mummmwmmmummummmmm mmnnmmnnu Ha E PZ logZ PZ Fl quotmm mmmmmmnmmnmmumnnnammv gunmmnmmmu vasmpmaiwulvmmmulshulmmamlawlmliwki NI39XW The Logo Dlsplay of Shanon Entropy Scores for heme BLOCK mm mlrnmunnxnltm11ununus BINF 630 Lecture 4 Introduction to Probability and Statistical alysis Saleet Jafri Introduction to Probability Suppose we ip a coin 10 times Each time we will get either a H heads or T tails In probability jargon there are 10 trials each with an outcome of H or T Assume that the probability of getting a H is p ie PH ProbH p For a fair coin this is 05 PT l p as T is the only other possible outcome and the sum of all the probabilities must be 1 Bionomial Distribution Now suppose that we want to know the probability of getting k heads in n trials We can call this event A PA pklp 39k nknk Example k 2 n3 There are 8 possible outcomes 3 with the desired event PA38 HHH HTH HTT TTH l22l2132 l 38 HHT THH THT TTT Expected Values 0 The expected value is the expected outcome For example E of Heads 3 coin tosses is HHH HTH HTT TTH HHT THH THT TTT EH 2 H PH 018138 238 318 15 11p 312 Expected Values 0 The expected value is the expected outcome For example E of Heads 3 coin tosses is HHH HTH HTT TTH HHT THH THT TTT VarH EH2 EH2 nplp 31212 34 Conditional Probability PB is the probability that event B occurs PA and B is the probability that both and A and B occur PA or B is the probability that either A or B or both occur PAl B is the probability that occurs given that B has already occurred Bayes Rule PAlB PA and BPB comes from PA and B PB PAlB Bayesian Statistics Suppose a gene A has two possible alleles Al and A2 and another gene B has two possible states B1 and B2 PB PBl PB2 1 and PA PA1 PA2 1 Suppose that we know PBl 03 then PB2 l 7 03 07 Suppose that we also know that PAllBl 08 and PAZlBZ 07 Since PAllBl PAZlBl l PAZlBl l 7 08 02 Since PAllBZ PA2lB2 l PAllBZ l 7 07 03 How do we nd the joint probabilities of Al and B1 Bayesian Statistics Al A2 I I B1 024 006 03 How do ll 1nth1s table B2 021 049 07 Use Bayesian Statistics 03945 03955 1390 PAl and B1 7 PB1PA1B1 7 03 x 08 7 024 PA2 and 132 7 PB2PA2B2 7 07 x 07 7 049 PAl and 132 7 PB2PA1B2 7 07 x 03 7 021 PA2 and B1 7 PB1PA2B1 7 03 x 02 7 006 Other Distributions Binominal Distribution Poisson Distribution Exponential Distribution Normal Distribution Uniform Distribution Gumbel Extreme Value Distribution Statistical Hypothesis Testing Statistical hypothesis testing is used to determine Whether a finding is statistically significant A null hypothesis will be formulated and a significance level chosen The test will see if the null hypothesis is rejected in favor of the alternative Distributions and Hypothesis Testing 0 A The extereme value distribution B Normal Distribution If you want 5 of the area under the curve to be to the right of the dark line X has to be larger for the extreme value distribution Gumbel Extreme Value Distribution The distribution of alignment scores between random sequences follows the Gumbel Extreme Value distribution and not the normal distribution While the area under both curves is l the extreme value distribution is skewed For xl 96 we have a 95 con dence interval area between 1 96 and 196 in the normal distribution For x 3 we have 95 of the area to the left in the extreme value distribution Confidence interval 0 If the probability that the null hypothesis occurs is less than 5 we say that we reject the null hypothesis with a 95 con dence interval Local Alignment for Distantly Related Proteins The PAM matrices devised by Dayhoff works for closely related protein sequences 0 For proteins not closely related Dayhoff introduced an additional parameter t which represents the evolutionary time scale 0 Hence is the probability that amino acid A is substituted by B within evolutionary distance t Assume that the amino acid distribution does not change during evolution Then Hence PBlAt can be estimated from the relative frequency of the pair AB in the known alignment of two sequences s and s with distance t and from the relative frequency q A of the amino acid This allows us to generate a substitution matrix M for all pairs of amino acids for a speci c evolutionary distance t using the matrix M is the PAM matrix Point Accepted Mutation Evolutionary Time Two sequences s and s have an evolutionary distance of l PAM unit if s was converted to s by a series of accepted substitutions with an average of 1 accepted substitution per 100 amino acids Note that it is accepted substitution backsubstitutions are ignored in this model In other words a substitution matrix is l PAM unit if the expected number of substitutions in a typical protein is 1 Hence NMH is a n PAM matrix Evolutionary Time For sequences Whose distance is thought to be more than 100 PAM units distance the PAM250 matrix is used 1 PAM unit is considered to be about 107 years but this can vary Finding M It is not likely to find homologous sequences s and s that are exactly 1 PAM distant A scaling factor X was chosen so that the matrix generated by is l PAM Caveats Since it is difficult to extrapolate to distantly related proteins from closely related units other substitution matrices have been proposed such as BLOSUM BLOSUM is better for distantly related proteins What is the E value The E value measures the signi cance of an alignment Suppose you align a test sequence with 10000 sequences one of which matches the test sequence called the related sequence All the sequences will have a alignment score associated with them The E values gives an estimate of the number of unrelated sequences that have a score as high as the related sequence score The lower the E value the better the alignment E005 corresponds to the 95 confidence interval However for truly related sequences E lt 1010 Gumbel Extreme Value Distribution The extreme value distribution is described by PSltX eXp e39X PSgtX l eXp e39X To account for alignment scores it is written as PSgtX l eXp e39MX39u X is the mode highest point and u is the decay or scale These are found through parameter estimation Parameter Estimation Method of Moments The mean and standard deviation is calculated for random alignment scores for a given sequence length range The parameters are calculated as 7 Tco SQRT6 12825 G u u yku 045000 where y is Eulers constant Parameter Estimation Maximum Likelihood Estimation 7 statistical methods that chooses the parameters so that they are the most likely to explain the data Poisson Approximation Method 7 This methods aligns a few pairs of random sequences but nes a large number of possible alignments This is very fast using dynamic programming as as soon as the scoring matrix is built all possible alignments are there Of these a number of alignments above a certain threshold are retained The average number of alignments above a threshold yield a value of s which can be used to derive the parameters using a Poisson distribution Significance of Local Alignment The extreme value distribution can be used to calculate the significance of a local alignment Assume that there are two sequences about 250 amino acids long and they are aligned by Smith Waterman using the PAM 250 substitution matrix and a high gap score to omit gaps Suppose the following alignment was found FWLEV EGNSMTAPTG FWLDVQGDSMTAPAG Signi cance of a Local Alignment A significant alignment between unrelated or random sequences will have a score of S log2nm log2250250 16 bits The score of the above alignment is 73 using the PAM 250 matrix Fig 314 This must be converted into bit units by the conversion factor units 13 bits yielding 733 243 bits This greater than 16 bits by 83 bits so the alignment is very significant Significance of Local Alignment A more complicated method by Altchul and Gish provide estimates of parameters K009 and 9t0229 for the PAM250 matrix Equations 23 yields S XS ln Kmn 0229243 ln009250250 1672 863 809 bits Significance of Local Alignment Equation 24 yields the probability Ps gt809 l eXpe39809 31 X 10394 The probability can also be calculated by equation 26 Ps gt809 e39809 31 X 10394 FASTA and BLAST FASTA and BLAST also calculate significance of the search results alignments FASTA uses alignment scores between unrelated sequences to calculate the parameters of the extreme value distribution BLAST calculates estimates of the statistical parameters based on the scoring matrix and sequence composition Bayes Block Aligner 0 Software was developed by Zhu et al 1998 that slides a sequence along another to find the highest ungapped regions and blocks 0 Instead of using a substitution matrix and gap scoring system a Bayesian statistical approach is used Bayesian Evolutionary Distance How can we tell how closely related two sequence are Apply different PAM matrices starting with PAMl and compare the odds scores When the odds are sufficiently high it is a better esitmate fo the true evolutionary distance Lecture 9 Gene Prediction Saleet J afri BINF 630 Gene Prediction Analysis by sequence similarity can only reliably identify about 30 of the protein coding genes in a genome 5080 of new genes identified have a partial marginal or unidentified homolog Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes Gene Finding 0 Process of identifying potential coding regions in an uncharacterized region of the genome 0 Still a subject of active research 0 There are many different gene finding software packages and no one program is capable of findng everything Genes aren39t the only thing we39re looking for oBiologically significant sites include Splice sites oProtem binding sites promotors histones etc oDNA 3D structure features etc In a lot of cases we don39t even know what constitutes one of these sites so all we can do is look for repeating patterns codans Start Icodon Donor sile CGC GIGEGIGAGC Transcripuon Sla 539 UTR memer 39ICTCCCA M Acoeplnr sue PolyA sila Slop Icodun autumnmm GGCCCC39TE Fig 3 The s Inchurc on gene wim sum ohh impnnlm signals shown Eukaryotes vs Prokaryotes Eukaryotic DNA wrapped around histones that might result in repeated patterns periodicity of 10 for histone binding The promotor regions might be near these sites so that they remain hidden Prokaryotes have no introns Promotor regions and start sites more highly conserved in Prokaryotes Different codon use frequencies Gene nding is speciesspeci c Codon usage patterns Vary by species Functional regions promoters splice sites translation initiation si es ermination signals Vary by species Common repeat sequences are species speci c Gene nding programs rely on this information to identify coding regions The genetic code Table of Standard Genetic Code 39 c A 39 c 39 11139th a 39TcTSu s TAT Tyr Y 39TcT Cys c Tc TAc ch TAA Tex TcA Tex TAc ch Trp W 39 cTT Len L 39ccT m P cAT HIS H 39ccT Arg R C cTc cc cAc ccc TTc TTA Len L Trc 88 cTA cc cAA on Q ccA CTG ccc cAc ccc ATHleU ACTTMT AATAsn N AGTSuS A ATc Ac AAc Acc TA AAALys K AcA Arg R ATG Mu AAc Acc cc cTT vn V GCT Ah A cAT Asp D GGT Gly c G GTC ccc cAc ccc cTA ccA cAA on E ccA cTc ccc cAc ccc tenscamsmnm 195 cns39 155511 cndmuj mus mplu mqmcy per mm muuhuD Codon usage mm 17 A 959 um 1A 1 799 mm 12 3 mm 27 2 1519 um 19 3 1959 mm 21 5 mm 5 7 31 9911 9 5 A73 1m 9 9 mm 12 A 59 999 5 2 295 m 9 5 mm 11 A 531 can 13 3 7A1 mm 9 A 999 22 7 1252 can 29 7 11A9 m 15 9 mm 7 1 395 9911 12 3 591 m 19 5 mm A5 9 25A5 999 7 7 A39 9119 29 9 Mm 1A 3 795 ml 11 7 552 my 15 A we 27 5 1527 1199 2A 7 1371 m 25 9 11911 7 9 A33 ADA 13 3 735 29 5 1199 22 3 1235 1199 9 9 A91 1199 39 9lt m 9 9 5A3 999 15 7 927 my 17 7 mm 19 9lt 1997 999 39 n 1559 9119 29 2 9911 5 3 359 9911 12 9lt 712 m 23 5 999 33 9lt 1975 999 9 9 A99 9119 35 3 Fasciom hepan39mgu39nv 25 cns39 5919 cndmuj mus mplu mqmcy per mm muuhuD 77 um 1A A 95 191 29 A 159 Um 11 5 59 1A9 Ucc 9 9lt 59 mm 25 5 151 999 5 A 39 35 9911 12 n 71 1m 1 A 9 9911 2 n 12 1A5 99 5 9 A9 119 1 n 5 999 15 9 199 A3 9 5 9 35 mm 13 n 7 9 13 9 92 7A can 5 7 3A 9119 11 3 57 999 5 5 33 39 9911 11 3 57 m 19 9 119 9911 12 3 73 9A 9 19 5 53 9119 15 2 95 999 A A 25 192 ml 15 1 95 my 2A 7 1A5 Am 11 3 57 95 1199 11 5 59 m 21 1 125 1199 5 9 A9 A5 ADA 11 n 55 m A1 9 2A9 11911 5 5 39 19A 1199 19 5 53 m 29 9 171 1199 5 7 3A 99 u 25 5 151 my 35 7 217 u 33 5 99 99 999 19 5 119 9119 2A 7 1A5 999 15 9 9A A1 9911 15 5 92 m 39 9 239 9911 29 5 175 159 999 11 n 55 9119 31 3 195 999 7 A AA cum Gc As 59 is 19m Gc 52 79 219 1919 Gc 39 53 3m 1m Gc A9 29 593 1191 AA 1712 991 1559 139A 1957 Identifying ORFs Simple rst step in gene nding Translate genomic sequence in siX frames Identify stop codons in each frame Regions without stop codons are called quotopen reading framesquot or ORFs Locate and tag all of the likely ORFs in a sequence The longest ORF from a Met codon is a good prediction of a protein encoding sequence SOFTWARE NCBI ORF Finder ORF Finder input The on Fmdev Open Readmg Fvame Hwy 5 a gvapmcaW ana ysws uu wmch nds an apequot veadmg names ma se ec ame mwmum sue m a usev39s sequence m m a sequence aheady m e da abase Tms uu memmes aH upen veadmg names usmg the standavd m anemam geneue mm s The educed ammu and sequence can be saved m vamus Iuvmats ane hhnwum 39 seavched agams he seeuenee da abase usmg We www mar serve The on 1S 0 may sh m be he m m pvepavmg cump ete ane accmam seeuenee summssmns u WW WW s as Kaged me We Sequm seeuenee summssmn sunwave mm mmquot Enter GI orACCESSION I OrIFind Clear or sequence in FASTA ormat 4 FROM39 f TO39 f eeneue eeees S andam ORF finder results mew madam Maw mu F E nmsmn iy g I K 71 57 see lt t3 35390 7315 927 L w w m 2 mm m m 73 mm mm we mmmmmmu mmnlmunmmml Tests of the Predicted ORF Check if the third base in the codons tends to be the same one more often than by chance alone Are the codons used in the ORF the same as those used in other genes need codon usage frequency Compare the amino acid sequence for similarity with other know amino acid sequences Problems with ORF finding A singlecharacter sequencing error can hide a stop codon or insert a false stop codon preventing accurate identification of ORFs Short exons can be overlooked Multiple transcripts or ORFs on complementary strand can confuse results Patternbased gene finding ORF finding based on start and stop codon frequency is a pattembased procedure 0 Other pattembased procedures recognize characteristic sequences associated with known features and genes such as ribosome binding sites promoter sites histone binding sites etc 0 Statistically based Contentbased gene finding 0 Contentbased gene finding methods rely on statistical information derived from known sequences to predict unknown genes 0 Some evaluative measures include quotcoding potentialquot based on codon bias periodicity in the sequence sequence homogeneity etc A standard contentbased alignment procedure Select a Window of DNA sequence from the unknown The Window is usually around 100 base pairs long Evaluate the Window39s potential as a gene based on a variety of factors Move the Window over by one base Repeat procedure until end of sequence is reached report continuous highscoring regions as putative genes Combining measures Programs rarely use one measure to predict genes Different values are combined using probabilistic methods discriminant analysis neural net methods etc to produce one quotscorequot for the entire Window Drawbacks to Windowbased evaluation 0 A sequence length of at least 100 bp is required before significant information can be gained from the analysis 0 Results in a 100 bp uncertainty in the start site of predicted coding regions unless an unambiguous pattern can also be found to indicate the start Most are webbased but Submit sequence input sequence length may be limited Select parameters if any Interpret results Most software is first or second generation results come in non graphical formats GRAIL Gene finder for human mouse arabidopsis drosophila E coli Based on neural networks Masks human and mouse repetetive elements Incorporates pattembased searches for several types of promoters and simple repeats Accuracy in 75 95 range Glimmer Genefinder for bacterial and archaebacterial genomes Uses an quotinterpolated Markov modelquot approach a Markov model is a model for computing probabilities in the context of sequential events Predicts genes with around 98 accuracy when compared with published annotations No web server GENSCAN Genefinder for human and vertebrate sequences 0 Probabilistic method based on known genome structure and composition number of exons per gene exon size distributions hexamer composition etc 0 Only protein coding genes predicted Maize and arabidopsisoptimized versions now available 0 Accuracy in 5095 range GeneMark Gene finder for bacterial and archaebacterial sequences Markov modelbased GeneMark and GeneMarkHMM available as web servers Accuracy in 9099 range CRITICA Gene nder for bacterial and archaebacterial genomes Combines sequence homologybased prediction with contentbased statistical dicodon probability analysis Accuracy in 9099 range No web server GeneParser Predicts the most likely combination of exons and introns using dynamic programming The intron an exon positions are aligned subject to the constraint that they alternate A neural network is used to adjust the weights given to the sequence indicators of know exon and intron regions such as codon usage information content length distribution hexamer frequencies and scoring matrices Other software 0 Generation 0 GenelD Genie GenView EcoParse 0 etc tRNAscan Locating tRNA genes is less difficult than other types of gene identification 0 pol Ill promoter is simple RNA secondary structure is conserved 0 SOFTWARE tRNAscanSE Gene finding strategy for beginners Choose the appropriate type of gene finder Make sure that you39re using gene finders for microbial intronless sequences only to analyze bacteria and archaea If there is no organismspecific gene finder for your system at least use one that makes sense ie use an arabidopsis gene finder for other plants Neural Network Topology Input Layer Hidden Layer Output Layer Weight Perceptron Making Neural Networks Take known data and divide into two sets the training set and test set Use the optimize the weights so that the neural net gives the best outputs for the training set Test the neural net with the test set to see if it works If data is limited you can permute the data so that you have multiple training and test sets Caveats with Neural Nets The net only performs as well as the training set 0 In other words it can only nd things it is trained to do 0 As more diverse data becomes available the neural net gets better Grail II Neural Net score oi Smers in candidale region flanking region GC composition 0 candidale region GO Composilion 0 score ior splicing acceptox site 0 score for Splicmg arcepmr sile a length oi region 0 Exon score 5 Ouipui 0 Hidden layel O O O 0 Input layer Finds exons in eukaryotic genes that is takes inputs and predicts if a gene is present Markov Model A process is Markov if it has no memory that is if the next state it assumes depends only on its present state and not on any previous states The states can be observed and the transition probabilities between states is known Example rolling a die has 6 possible states each with a probability of 16 Hidden Markov Model Also has the Markov property Some of the state or transition probabilities information is missing The process emits sequences of results The emission probabilities is the probability of each outcome in a given state The model is trained so that the training set is the most likely outcome for the model Training and Testing the HMM The parameters of the model are t on a training set ie the parameters are chosen so that the training set is the most likely outcome for the model A test set is used to make sure the model is welltrained If so the model can be used on new data HMM of E Coli Gene match 0 O Kl HMM for nding the most probable set of genes inE coli gene sequences of unknown gene composition A similar model exists for each of the 61 codons HMM of E C011 Genes Assumes that there is no relationship each codon and codons used later in the sequence 0 This assumption works however analysis of sequential codons in a gene have shown that some pairs are found at greaterlesser frequencies than would occur at random 0 GeneMarkHMM uses sequence information from the previous 5 bases instead of the previous 2 bases Assessing Methods 0 Take a set of know genes and test method for true positives TP false positives FP true negatives TN and false negatives FN 0 Use these to calculate Specificity TPTPFN Sensitivity TNTNFP Correlation coef cient TPTN FPFN SQRTANTPFPAPTNFN 20
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'