BIOINFORMATICS I BIOL 4540
Popular in Course
Popular in Biology
This 229 page Class Notes was uploaded by Arianna Veum on Monday October 19, 2015. The Class Notes belongs to BIOL 4540 at Rensselaer Polytechnic Institute taught by Christopher Bystroff in Fall. Since its upload, it has received 27 views. For similar materials see /class/224829/biol-4540-rensselaer-polytechnic-institute in Biology at Rensselaer Polytechnic Institute.
Reviews for BIOINFORMATICS I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/19/15
Bloinformatics 1 lecture 10 Multiple sequence alignment Building and pruning multiple sequence alignments Steps in making a MSA Database search Automatic multiple sequence alignment Removing N and Cterminal extensions Removing redundant sequences Removing suspect sequences Interpretation structure function Using PSIBLAST to get a set of sequences Search for a sequence ecoli dihydrofolate Choose first hit Display in FASTA format Copy Go to Blast then protein BLAST Paste into BLAST search window Search in Arthropoda ORGN Hit BLAST the Ecoli sequence is the outgroup Set descriptionsgt100 alignment viewgtFlat quertanchored without identities Format for PSIBLASTgtcheck 005 Hit Format Check the sequences Hit Run PSIBlast iteration 2 Get selected sequences Display them as GenPept GenBank format Send to file Rename the file bugdhfrgp Send this file to bioinf45 using scp Import it into SeqLab Quick intro to PSIBLAST protein database BLAST hits aligned evalue cutoff Sequence profile Sequence distance versus similarity Maximizing similarity and Minimizing distance are equivalent if diJ Sij Smax Where smax is the maximum possible similarity and the minimum distance is d0 For each position in the alignment 0 Distance based on identity score d 100 identity 0 Distance based on similarity score d 39Iog SrealquotSrandSident39srand Bioinformatics 1 lecture 25 Gene nding in eukaryotes intronexon boundaries splicing alternative splicing Gibbs sampling from lecture Finding genes in prokaryotes is easy Just translate the DNA sequence in all 6 reading frames The ORFs regions starting With ATG and ending in an inframe stop codon will be at least 300 bases in length While random reading frames will be dotted with stop codons at the rate of about 3 stop codons every 64 codons XXXXXXXXATG 3N TGAXXXXX Finding genes in eukaryotes is harder Genes are composed of coding regions exons and internal noncoding regions introns Genes are transcribed to premRNA Introns are removed from premRNA by the spliceosome a ribozyme Proteins are translated from the mRNA after splicing Different tissues may splice premRNA differently dna 1 we M premRNA mRNA premRNA structure 39nini man ung im39tmn a II I I m 1 a 39f39 ll 7 quota a llilli tllh in lallrnlulmnn alari Iniml lu39lulamnm 39Erlln39l h li Jug JI V llrF Mumpsquot h midI1 f IIIW Ill fillmquot i L53 IIINILM hubquot 139 flJquot gquot 3quot SIk u39II39 1 l h 39h I J gquot 1 394 5 I39 a 1 2 5Mm um um rum H n lr mums Ju l hum Hull 1 I l ll It mulcli 1411 quotragau Mudg HMS1 n ujrl 339 in I a nilJr I39llJul l39Iilcl39l39l iuAH LIIJIII hymn z VIM rank prokaryotlc mRNA Hague em g m p01yA tail w mum 4 WI 1 139 1quot 2le 3 3 IJIIIIIJ imirgnw aunmmmlmum L111 urinal I llllm i hjl nimmzlcu T unTHH I3939I rum a runru r 39 quot4quot uJ rr Introns early Introns late eubacteria archea don 2 have introns did the common ancester have introns eukaryotes have introns A generic gene sequence model for premRNA pregene reglon postgene region AUG stop UAAUAGUGA j eXOn R 3 Splice Site AG 5 splice site GU intron 0 12 exon intrQIL exon XXXXXXATG XXX MT M 32 i XX XXTAAXXX Splicing mechanism splicecscme Spliceosome Def A ribonucleoprotein complex containing RNA and small nuclear ribonucleoproteins snRNPs that 63 is assembled during the splicing of premRNA messenger RNA primary transcript to excise an intron GU A 9 3 sp11ce s1te 5 SP1ch Slte branchpoint 9 the acceptor the donor Splicing mechanism GU A Spliceosome not shown A A 320H GU Splicing mechanism the lariat A A WK7 3 0H A AG gets GU I degraded f to rlbosome gt Spliceosome disassociates Splicing mechanism httpWWWneur0Wust1eduneuromuscularpatholdiagramssplice mcthtml httpWWWneur0Wust1eduneuromuscularpatholdiagramssplicemechhtml Much thanks to T Wilson UCSC RNA binding proteins may selectively block splicing in some tissues an RNA binding protein is expressed 39 in response to a stimulus For example it binds near the Q branchpoint or one of the splicepoints 1 it blocks in this case the cyclizing step GU AG Spliceosome cuts before GU and after AG This is a constraint Frame of intron Frame 0 intron starts at codon boundary AGUCUUAUcUUUUCAGUUGGG ccGUAGAACCACUCGUAA Frame 1 intron starts one after codon boundary LAGUCUUAUCUUUUCAUGUGGG CCGUAAGACCACUCGUAA Frame 2 intron starts two after codon boundary AGU CUUlAUCUUUUCAGGGUGGJi ccGUAGAGCCACUCGUAA This must be multiple of 3 if the intron is alternatively spliced How to nd splice points using the protein sequence database 1 Translate the DNA in all 6 frames 2 Search the database of protein sequences using the translations 3 Using the complete protein sequence align it to the translation and nd the regions of near perfect identity These will abruptly end at the intron start site 4 Find the 5 GT or 3 AG signal at the point Where the identity matches abruptly end 5 If your translation has an insertion with nearly perfect matches on either side you have an alternative splicing In Class exercise nd the alternative splicing On Bioinf45 cp bystr0ffintr0naltfasta Open Netscape Goto t0 NCBI Then BLAST Then select blastx not tblastx Select the SwissProt database Paste in the DNA sequence Submit While waiting do exercise on the next page In Class exercise nd the alternative splicing Open SeqLab Import altfasta Translate in the 3 forward reading frames One of these frames will be correct When blastx is done look for alignments that have very high identity and one or more large insertions see next page Find the location of the alternative splicepoints donor and acceptor Can you locate the intron exactly What is the ame relative to the codon frame A sure sign of alternative splicing in blastx output Score 160 bits 404 Expect 8e 37 Identities 85116 73 Positives 86116 74 Frame 2 Query 76820 RSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMAGAFSFIHSRVGSPWXXXXXXXXX 76999 SHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA Sbjct 778 KSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA 814 Query 77000 XXXXRHTGVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 77167 GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ Sbjct 815 GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVQVWRKNLQ 863 Identical up to the insertion Identical after the insertion Which codons can come at the start end of U i an alternative exon The unspliced intron GUi eiGU eeiGUii starts W1th GU The unspliced 39intron AG 6 ends with AG ZAG A G ee ea base Within the exon 139 a base Within the intron intronexon boundary Which amino acids can come at the staItend of an alternative exon The unspliced intron W CRSG FINHYCD starts with GU FLSYCW The unspliced 39intron FMNHYWCD SR ends with AG QKEN VADEG ea base Within the exon 139 a base Within the intron l intronexon boundary What frame is the intron in the earlier slide ExonGUAGEX0n Is that all there is to it What information is used to predict intronexon boundaries Introns always start with GU and end with AG GTAG in DNA Introns can start in one of three frames 0l2 relative to the codon frame Alternatively spliced introns may be exons must have a multiple of 3 nucleotides 393 and 5 intron sequence motifs branchpoint sequence motif Enhancersilencer sequence motifs ESEs ESSs ISEs ISSs Base composition in exonsintrons see CODONPREFERENCE in SeqLab Orthologs conserve intronexon boundaries Sequence composition method for geneflnding Most exons code for protein Most introns do not Selective pressure on exons includes 1 speciesspeci c codon preferences 2 amino acid preferences 3 selection for foldability and function PGW A 1 HMM D p 1 s1mp e PC X b for intronexon base PTy composition PA Z EJQAFQI ESEs ESSs ISEs ISSs ESE Exonic Splicing enhancers sequence in the exons that promote splicing ESS Exonic Splicing Silencers sequence in the exons that inhibit splicing ISE Intronic Splicing Enhancers sequence in the introns that promote splicing ISS Intronic Splicing Silencers sequence in the introns that inhibit splicing How were ESEs found 1 Training database was constructed of exonic mRNA postspliced that was a constitutively spliced not alternatively spliced and b from an internal non proteincoding exon 2 Database of control nonESE sequences was constructed 3 The relative abundance of all 8mers was found 4 8mers with high relative abundance were tested by mutating the putative ESE 8mers and determining the splicing ef ciency by gel electrophoresis ZhangXHF and Chasin LA Computational de nition of sequence motifs governing constitutive exon splicing Genes amp Development 18 12411250 2004 Logos for ESES and E885 Mam Maa a 413 q h lf 2quot FIEj P S 7 3 31 1quot I Iquot gy Chill 1 Hill FE E 59 his 39 u 1 39rii399 i wampl 1 libElquot E 15 1135 Q 311 It 1 I quot IHrrquot39 39E 393va 13 I1 1 W31 1 Edi hm putative ESSs a L a I U i I w n quot51335quot n l 511 1 mm FEB m um PE 7 Wu e E11511 39 r 1 9 az 39 z um 31 I1 3 I In war t 397 if KW39EVE PET SUE39an Bil putative ESEs Some of the motifs found by Zhang amp Chasin using relative abundance analysis of 8mers after clustering Motif nding revisited Where is it and what am I looking for Methods to find simultaneously the motif and the locations of the motif in a set of sequences MEME Relative abundance Gibbs sampling Gibbs Sampling Stochastic version of EM algorithm 1 Choose length and initial or random guesses of motif locations 2 Sum the motif pro le W or Wo pseudocountsnoise from the current motif positions 3 Remove one sequence Calculate probability scores for each possible motif position 4 Randomly choose a motif position from the probability distribution 5 Repeat 24 many times Radius of convergence is Wider than for EM Adding noise counts can improve rofc but at the cost of speed Example keen scoring window fixed 4 AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG TATTGCTCTCCGCAGC Slide rst sequence thr gh the motif Window calculate score 397 aligned position SCOI39G Example AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG TATTGCTCTCCGCAGC il aligned position Example AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG TATTGCTCTCCGCAGC SCOI39C l I IIIII I39lll aligned position Do next sequence and so on cycling through the sequences many times Example AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG TATTGCTCTCCGCAGC SCOI39C l I IIIII I39lll aligned position Do next sequence and so on cycling through the sequences many times Convergence is when there are no more changes AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG TATTGCTCTCCGCAGC In class exercise Gibbs Sampling Work in groups Use httpwwwrandomorg to get a page of random numbers between 1 and 40 Start SeqLab In UNIX shell on bioinf45 cp bystroffgibbsfasta Import this le into Editor Window Randomly align Windows of length 4 De ne the location of your 4base Window and don t change it continued Gibbs Sampling p2 1 Start with rst sequence and do each sequence in turn 2 Align the sequence position 1 to the motif Window position pl 3 Calculate the cummulative scores of the number of identities in the L4 block Start With score0 O scorep scorepl number of identities between the sequence you are moving and the other three sequences summed only inside the motif Window Increment p shift sequence to the right Keep score on paper g score UJN 0 Gibbs Sampling p3 4 Stop when p L4l Pick a random number choose the rst one on the list that is less than the maximum cumulative score 5 Find the rst position p that has a cumulative score greater than the random number 6 Set the sequence to position p 7 Go to next sequence and repeat Steps 26 E M For example 1 4 2 9 4 1f random number 8 3 9 choose p2 4 l l 5 l l 6 l6 Gibbs sampler home page httpbayeswebwadsworthorggibbsgibbshtml The nice thing about HMMs they are modular HMMs can be connected by their begin and end states to make a superHMM Individual modules can be trained separately A modular HMM for introns short variable length intron model 39 4 39 mass 7 tea b u l i xed length variable length intron model Stanke M Steinkamp R Waack S Morgenstern B AUGUSTUS a web server for gene nding in eukaryotes Nucleic Acids Res 2004 Jul 1 32W309 12 Intron model for mammals branch site polypyrimidine region donor motif acceptor motif contains GU contains AG from Blencowe B Exonic splicing enhancers mechanism of ction diversity and role in human genetic diseases TIBS 25 106 2000 A gene nding HMM AUGUSTUS internal exon model intron models 39 I v V I n 1 L forward strand initial exon model quot2 53 39in h w terminal exon model Intergentic single eXOIl mOdel Regions l39E new e 39 quot strand httpaugustusgobicsde I I I I E I I I 1 5 W I39 ill I 39 39l quotin I ll H l i 1 B I g I r k E HS 4 quot 91 I55 hu39l 1 E45 4 I39m is i i F a a n mm I I quot IM L I L I a I b 9i quotJ u I I I l I I rl V l u u H V 39 E 1 mil III In pl a I V I V 39quot 39quot n hm I 39i I 1 In I I JJHHI a 7 f IF W 139 39nl IDIHI u I All an I I 1971 U I l I II I 9 I 39 tum r a IE i B Em uh H Splicing fact sheet Exons average 145 nucleotides in length Contain regulatory elements ESEs Exonic splicing enhancers ESSs Exonic splicing silencers Introns average more than 10x longer than exons Contain regulatory elementsbind regulatory complexes ISEs Intronic splicing enhancers ISSs Intronic splicing silencers Splice sites 539 splice site Sequence AGguragu r purine U1 snRNP Binds to 539 splice site 339 splice site Sequence yyyyyyy nagG y pyrimidine Branch site Sequence ynyuray r purine U2 snRNP Binds to branch site Via RNARNA interactions between snRNA and premRNA Alternative splicing fact sheet Alternative 5 licin Definition Joining of different 539 and 339 splice sites 80 of alternative splicing results in changes in the encoded protein Up to 59 of human genes express more than one mRNA by alternative splicing Functional effects Generates several forms of mRNA from single gene Allows functionally diverse protein isoforms to be expressed according to different regulatory programs Structural effects Insert or remove amino acids Shift reading frame Introduce termination codon Gene expression effects Removes or inserts regulatory elements controlling translation mRNA stability or localization Regulation Splicing pathways modulated according to Cell type Developmental stage Gender External stimuli bioinfcrmatics l lecture 8 Probability Significance Extreme value distribution Probability Susan B Anthony The Queen Ireland v A Unconditional probabilities Joint probability of a sequence of 3 ips given any one unfair coin is the product PHHT PHPHPT Conditional probabilities If the coins are quotunfairquot not 5050 then PH depends on the coin you choose SQ or I PH is quotconditionalquot on the choice of coin which may have its own odds PSH PSPHS Conditional probabilities quotPABquot means the probability of A given B Where A is the result or observation B is the condition The condition may be a resultobservation of a previous condition PHS is the probability of H heads given that the coin is S In general the probability of two things together A and B is PAB PAIBPB PBIAPA Divide by PB you get Bayes39 rule PAlB PBlA PA PB To reverse the order of conditional probabilities multiply by the ratio of the probabilities of the conditions Scoring alignments using P For each aligned position match we get PAIB which is the substitution probability Ignoring all but the first letters the probability of these two sequences being homologs is PS11521 substitution of s21 for s11 Ignoring all but the first two letters it is PSl1l821lXPSlZl82Zl Counting all aligned positions H1P5182i Each position is treated as a different coin An independent stochastic process Log space is more convenient 10g HjPS1i52iPSli 2iSSli52i Where SAB PABPA This is the form of the substitution score Log likelihood ratios alias LLRs log odds lods Usually 2 or 3 times log based 2 of the probability ratio Dayhoffs randomization experiment Aligned scrambled Protein A versus scrambled Protein B 100 times re scrambling each time NOTE scrambling does not change the AA composition Results A Normal Distribution significance of a score is measured as the probability of getting this score in a random alignment Lippman39s randomization experiment Aligned Protein A to 100 natural sequences not scrambled Results A Wider normal distribution Std deV 3 times larger WHY Because natural sequences are different than random Even unrelated sequences have similar local patterns and uneven amino acid composition ail Was the significance overestimated using Dayhoff39s method Lippman got a similar result if he randomized the sequences by words instead of letters complexity sequence heterogeneity A low complexity sequence is homogeneous in its composition For example AAAAAAAAHAAAAAAAAKAAAAAEAA is a lowcomplexity sequence Compared to other sequences there are relatively few ways to make a 26 residue sequence that has 23 A39s l H l K and l E What is the effect of lowcomplexity regions on the score distribution 1 from a Dayhoff type randomization experiment 2 from a Lipmann type randomization experiment What is the effect on significance Wider score distribution gt lower significance of a given score Narrower score distribution gt higher significance Why do local patterns increase the standard deviation The two letter sequence quotPGquot occurs more often than expected by chance perhaps because PG occurs in beta turns If non homolog sequences are actually made of of small words instead of letters then how will the score distribution be effected Narrower Wider Whole word matches have higher scores Whole word mismatches have lower scores Total score of an alignment is the sum of word scores which are more variable Expected Expectation Expectation value Expectation value for coin tosses Consider a fair coin tossed 100 times The sequence is HTHTHTTTTHTHHHTHTTHHTHHHTHTH What is the expected length of the longest row of H39s Erdos amp Renyi equation 10g1pn number of times it Where p is the PH occurred length of longest sequence of H Heads match tailsmismatch Similarly we can define an expectation value EM for the longest number of matches in a row in an alignment EM is calculated similar to the headstails way using the Erdos amp Renyi equation 9 is the odds of a match log p14 A expectation given an alignment of length M But over all possible alignments of two sequences of length n the number is 10glpngtkn 2 10glpOl If the two sequences are length n and length m it is EM log1pmn constant terms Headstails matchmismatch Theoretically derived equation for the expectation value for M the longest block of Matches see Mount p134 EM loglpmn log1pIp 0577loge 12 This can be expressed as EM logeKmn7 Where K is a constant and 7 logelp In class exercise expectation value Start SeqLab Using DNA from SARS choose 100 bases at random twice Paste these sequences into a new sequence line Run BestFit default DNA parameters Find the longest string of identity matches Write it down Do it again When we have enough numbers we will plot them on the board PS gt X EM gives us the expected ength of the longest number of matches in a row But what we really want is the answer to this question How good is the score x ie how significant So we need to model the whole distribution of chance scores then ask how likely is it that my score or greater comes from that model A SCOIC freq Distribution Definitions Mean average value Mode most probable value mi For a variable Whose distribution comes from extreme value such as random sequence alignment scores the score must be greater than expected from a normal distribution to achieve the same level of significance A Normal Distribution Usually we suppose the likelihood of deviating from the mean by x in the positive direction is the same as the likelihood of deviating by x in the negative direction and the likelihood of devating by x decreases as the power of x Why Because multiplying probabilities gives this type of curve This is called a Normal or Gaussian distribution Extreme value distribution a distribution derived from extreme values NI Get statisticstifrom lots of optimal scres 35 3 U5 Using only best all possible scores scores produces a for two sequences skewed distrib Normal distrib 05 l 05 05 i 5 best score optimal Extreme value distribution v eXp X 6 EVD has this shape But the Mode and decay parameters depend on the data The mode and decay The EVD with mode k ul and decay 9 y XP X 6 MD The mode from the Erdos amp u 10geKmn739 Renyi equation Integrating this 4M equation fromx to PSZX 1 39 6Xp39Kmne X infinity gives gt39ltThe book says u is the mode This is incorrect The way it is written W is the mode 1nKmn the scoring function effects 7L 7 is calculated as the value of x that satisfies Epipjesijx 1 Substitution matrix values effect the width of the EVD voodoo mathematics For values of x greater than 1 we can make an approximation lexp e39X z e39X That means PSZX l eXp Kmne397 x becomes a single exponentional equation PSZX z Kmne397 X Now we can plot logPSZx versus x using a large number of known false alignment scores x and the slope is llagKmn Matrix bias in local alignment In Local Alignment we take a MAX over zero 0 and three other scores diagonal across down Matrix Bias is added to all match scores so the average match scoreand the extremes can be adjusted What happens if match scores are all negative Best alignment is always n0 alignment all positive Best alignment is gapless globallocal average positive Best alignment is local longer Typical random alignment is local average negative Best alignment is local shorter Typical random alignment is no alignment Altschul39s Principle the match scores should be gt zero for a match and lt zero for a mismatch on average some mismatches may have a gt 0 score What happens With matrix bias If we add a constant to each value in the substitution matrix it favors matches over gaps As we increase matrix bias 0 Longer alignments are more common in random sets 0 Longer alignments are less significant Negative matrix No matrix bias bias Positive matrix bias Fitting an extreme value distribution to a database 1 Align a sequence of length n to the database length m that has no homologs may use a shuf ed sequence Get one score for each database sequence 2 Plot the score distribution x versus PSZx how do you linearize this 3 Fit the EVD to the score distribution to get the mode u and the decay parameter 9 y eXp X e MX39UD PSZX l eXp Kmne39kx summary of significance Significance of a score is measured by the probability of getting that score by chance The expectation value for the length of a match between two sequences lengths n and m given the probability of a match p has a theoretical solution log1pnm The score of an alignment is roughly proportional to the number of matches local alignments only Therefore the expectation value of the score follows the same theoretical equation summary The extreme value distribution EVD is the right equation to use when the distribution is over extreme values optimal alignment scores The EVD models the length dependence of the score The parameters of the EVD are not theoretical but empirical derived from the database by plotting the scores and fitting The significance of a given score x is the probability of getting a higher score S from random alignments This is approximated by integrating the EVD from x to infinity PSZX l eXp Kmne39 X once 7 and K have been calculated eValues in BLAST Every BLAST quothitquot has a score x derived from the substitution matrix Parameters for the EVD have been previously calculated for m and n The length of the database and the length of the query Now we can get PSZX which is our quotpvaluequot To get the expected number of times this score will occur over the whole database we multiply by m This is the number you see reported in BLAST Popquiz BLAST gave you 100 hits with e values ranging from 0 to 10 All hits in the database with X S 10 are listed in this set of 100 How many of your sequences on your list do you expect to have e z 10 What s an SNP CSingle NMGIW de Palymonphism SN P Change 0f 3 single nuclaa de at a de ned position in the genome egg A to G aDisti net from mther genetic variations such as indels even single base 0 mdon indels and repeats but these are often lumped together SN P5 are the most common genetic variations occur once every 100 to 300 bases older literature often says 1 per 12 kB SN Ps should accelerate identi cation of disease genes by allowing researchers to look for associations between a disease and genetic differences SNPs in a population This differs from the traditional pedigree analysis which tracks transmission of a disease through a family It s easier to get DNA from a random set of individuals vs from family members over several generations Followup investigations can use sequence information around the polymorphism RFLPs were the first DNA marker studied Restriction enzymes cut DNA molecules at speci c recognition sequences Sequence speci city means that treatment of a DNA molecule with a restriction enzyme should always produce the same set of fragments In genomic DNA molecules some restriction sites are polymorphic one allele has the correct sequence for the restriction site and is out when treated with enzyme and the second allele has a mutation so the restriction site is no longer recognized The position of en RFLP on a genome map can be worked out by following the inheritance of its alleles just as when genes are the markers There are 1O5 RFLPs in the human genome but for each RFLP there are only two alleles cut r uncut The value of RFLPs in gene mapping is limited by the possibility that the RFLP being studied shows no variability among the members of any interesting family 4 SSLPs are repeat sequences that display length variatiens different alleles containing different numbers ef repeat units Unlike RFLPs SSLPs can be multiallelic as each SSLP can have multiple length variants There are two types of SSLP Minisetellltee also known as variable number at tandem repeats VNTRs in which the repeat unit is up to 25 bp in length Micresatellitee or simple tandem repeats STRs whose repeats are shorter usually dinucleptide or tetranuoleotide units Micreeatellites are better markers than minisatellites for twe reasons Minieatellites are not spread evenly around the genome but are clustered near telemeric regions Microeatellitee are more spaced throughout the genome Second PCR typing is much quicker and more accurate with sequences less than 300 bp in length a size typical for micreeatellites There are 65 105 micreeatellites in the human genome x w I Rf we Ax Ba iu LN Av v E a a k WM xv Q k t K k SH 21 i R m4 Q E 3 3 C E AEE3 93 x K 9 7 a x i Why Should We Care Many Correlations between SNP occurrence and disease assume that HW equilibrium is a nod approximation in the healt y population This mayor may not be correct for a particular group of S NPs Examge increm personal fmobili and breakdoer of homogeneity of ethnic groups invalv ates assumpfions SNPs Classified by type of nucleotide Change Transitions Purine to purine A to G G to A Pyrimidine to pyrimidine C to T T to C Less likely to alter aa vs transversion More likely retained in coding regions Often more prevalent than transversions Transversions Purine to pyrimidine or pyrimidine to purine A or G to C or T Mol Biol Fall 2003 mpyright Susan Smith and Donna Crone SNPs Variation in the identity of base that appears in a single position Represent mutations that primarily arise from unique events Individuals that share SNP like to share evolutionary history Usually don t alter phenotypic but sometimes affect genetic predisposition to disease or influence individual response to drugs Mid am Fan ma myrigm mars smith and Donna Crane Classification of SNPs Non coding 5 or 3 nontranscribed region of gene 5 or 3 untranslated region of gene intron intergenic DNA Coding Synononmous change codon not aa Replacement change codon AND aa Moi Biol Fan ma copyright mars smith and Donna 3mm Mutations in coding region synonomous TGT gt TGC results in Cys gt Cys Nonsynonomous replacement TGT gt TGG results in Cys gt Trp can be conservative or nonoonservative Nonsynonomous nonsense mutation TGT gt TGA results in Cys gt stop Nonsynonomous readthrough mutation TAA gt TTA results in stop gt lle Mol Biol Fall 2003 mpyright Susan Smith and Donna Crone Nonreplacement SNPs synonomous and noncoding polymorphisms may affect gene function by altering transcriptional or translational regulation Splicing mRNA stability SNP in promoter can affect gene expression or transcription factor binding Moi am Fan ma copyright mars smith and Donna 3mm SNP statistics View info at httpMMANnchnlmnihgovSNPsnpsum marycgisnp statistics Notice density in build 118 Predicted SNP density is 1 for every 12 kb of genomic sequence how close are we to the predicted density Mid am Fan ma myrigm mars smith and Donna 3mm SNP Discovery I 39 1 12m6 2 f r1 i n 19 37 v w 1 4 19 r Ham 7 hitting a r r llll U Process SNP s discovery by EST analysis lots of first pass sequence from cDNA libraries from different samples or by sequencing of specific genes known to be of interest cancer cardiovascular etc or by sequencing near other known SNPs 5 50 samples SNPs confirmed allele frequency 40200 Clinical Assessment association trials 200 500 samples Clinical Trials medical significance thousands of samples Diagnostic test thousands of samples Ms 35am Fan 2063 copyright Smart smith and manna om Mol Biol2 Fall 2003 copyright Susan Smith and Donna Crone Align ESTs to human genome reference sequence Remove paralogs from alignment Paralogs two genes from same species that evolved from gene duplication Has less similarity to reference sequence Paralogs frequency of variation 1 in 50 bp SNPs frequency of variation 1 in 1000 bp Mel am Fan ma myrigm mars smith and Donna coma rrrrrrrnarr mmwwom Mol Biol2 Fall 2003 copyright Susan Smith and Donna Crone Screen out SNPs that are due to sequencing errors Check chromatograms from ESTs and reference Determine probability of true SNP based on Bases observed at the position How many ESTs aligned to that position Base quality value at that position in each sequence Moi am Fan 2amp03 copyright mars smith and Donna crane GCATGCAaGCATGCAT GCATGCACGCATGCAT GCATGCAaGCATGCAT GCATGCAaGCATGCAT GCATGCAaGCATGCAT Bases acaaa Depth of coverage is 4 Base quality from PHRED SNP probability calculation also takes into account expected rate of polymorph 0001 Probability gt04 candidate SNP Further validation to ID as true SNP Mai Rial Fan 2amp03 copyright mars smith and Donna 3mm NIH s SNP database httpwwwnebinImnihgovISNPl Classes of genetic variation The database was designed to accept several classes of genetic variation 39SNF s microsatellite repeats 395 small insertiondeletion pelymerphisms dbSNP uses the term quotSNPquot in the much lesser sense of quotminor genetic variationquot eg no requirementh assumption about minimum allele frequencies for pelymorphisms in the database The dbSNP includes disease causing CLINICAL MUTATIONS as well as NEUTRAL POLYMORPHISMS It is anticipated that SNP markers with unknown selective effects will be the vast majority of submitted records dbSNP accepts submissions from all labs Special interest in mutations in genes or where added biological information is known I Major database contributors are associated with the National Human Genome Research lnsttute NHGRI program NHGRI has funded an effort to collect 50000 SNPs in 3 years I Unaffiliated labs and private companies can deposit SNP information to make it accessible to the research community A common data exchange format for SNP data will used between central SNP databases DnSNP was started in 1998 and grows at 90 SNPs per month Submissions by large research projects cause uneven growth Data from smaller contributors are welcome but the majority of the data will comes from a small number of large projects funded by the NHGRI grants Growth will be erratic for the next few years with 5000 10000 SNPs by the end of the first year of the funded SNP grants For the current number of SNPs in the database see the dbSNP Summary page DbSNP is completely separate from the NIH Pciymcrphiem Discovery Resource NIHDPR The NIHDPR is encouraged as a resource for the extra mural NIH funded SNP labs t0 facilitate evaluation of differing discovery or genotyping methods None of the frequency information currently in dbSNP comes from the NIHDPR dbSNP can be searched both via other NCBI resources or directly Via other NCBI resources a gene namenomenclature association Query results from LocusLink database show a purple quot8quot button in SNP records mapped to a gene Clicking on 8 moves to reference SNP records for any gene in the LocusLink database b by map location dbSNP is currently being integrated to GeneMap99 and the integrated physical maps that are being constructed at NCBI This feature will be ready in late May 2 Direct searching of dbSNP six ways Ad Hoc Main search by SNP accession submitter SNP ID NCBI Assay ID or genome SNP ID mBy submitter search by submitter handle 39New Batches search by local batch ID 39Method search by method used by the submitter to identify the SNP IPopulation search by the type of population studied Publioation search by publication title Chromosome Report reports of SNPs with STS mapping information sorted by cRey distance where possible 28 Mol Biol2 Fall 2003 copyright Susan Smith and Donna Crone 28 Mol Biol2 Fall 2003 copyright Susan Smith and Donna Crone 29 Mol Biol2 Fall 2003 copyright Susan Smith and Donna Crone 30 Bioinformatics 1 lecture 7 You have seen Dynamic programming Global alignment Globallocal alignment no end gaps 3 ways to do it Local alignment Linear gap penalty Af ne gap penalty How many ways are there to do DP Recent development Asymetric substitution matrices If two different species have different amino acid compositions then the substitutions between those species are assymetric meaning Si gtj Sj gti Clostridium tetani ATrich ACDEFGHIKLMNPQRSTV WY For example if tetanus has more leucine overall that tuberculosis Then on average SX gtLtuber gt SL gtXtuber tetan tetan Where X is any amino acid ACDEFGHIKLMNPQRSTV WY Yu YK Wootton J C Altschul SF The compositional adjustment of amino acid substitution matrices Proc Natl Acad Sci U S A 2003 Dec 231002615688 93 Mycobacterium tuberculosis GCrich Database searching Ol lC SC UCl lCC gt q l IOtS Of sequences Why do a database search Mol Bio Determination of gene function Primer design Pathology epidemiology ecology Determination of species strain lineage phylogeny Biophysics Prediction of RNA or protein structure effect of mutation Searching millions of sequences Given a protein or DNA sequence we want to nd all of the sequences in GenBank over 17 million sequences that have a good alignment score Each alignment score should be the Optimal score or a close approximation How do we do it DNA or Protein search Advantages of searching DNA databases Larger database Does not assume a reading frame Can nd similarity in non coding regions introns promotor regions Can nd frameshift mutations Can nd pseudogenes Disadvantages Slower Not as sensitive Ignores selective pressure at the protein level Advantages of searching protein sequences Faster More sensitive More biologically relevant Disadvantages Not applicable to non coding DNA promotors introns etc Searching using Dynamic Programming Smith amp Waterman DP returns the optimal alignment given the scoring function usually a ine gap local alignment Relatively slow but more sensitive and more selective than FASTA and BLAST Optimal sensitivity selectivity Searching using word matches W Pearson 1988 First searches for kiuples then links them Results are similar to a dot plot Finally diagonals are scored using a substitution matrix and the hi ghest scoring diagonals are joined Hi gh scoring alignments are re calculated using DP localaf ne At least 50 times faster than SSEARCH Not as sensitive Final DP step makes it more sensitive but less selective FASTA is a Heuristic alignment method not Optimal heuristic FASTA k tuples kzg CDGGAALP Finding identity matches is very fast If two ktuples are separated by exactly the same amount in both sequence draw a diagonal A gapless alignment d39IGGHHGD FASTA Score them using Connect them Fmd all gapleSS BLOSUM keep using simple af ne alignments the best gap gap ext 0 3amp5 If this alignment one of the best scores in j the database search go back and realign using DP Searching using lookup tables s Altschul et al First make a set of lookup tables for all 3 letter protein or ll letter DNA matches Make another lookup table the locations of all 3 letter words in the database Start With a match extend to the left and right until the score no longer increases Very fast Selectivebut not as sensitive as SSEARCH Good statistics Heuristic BLAST 8000 3 tuples I m W PGT PGV PGW gt PGY PAQ PCQ PDQ PEQ PFQ 50 hi gh scoring 3 tuples Each 3 tuple is scored against all 8000 possible 3 tuples using BLOSUM The top scoring 50 are kept BLAST query sequence database sequence a 3 tuple 50 hi gh scoring N 3 tuples neighborhood words I identity matches seeds HSPs For every 3 residue Window we get the set of 50 nearest neighbors Use each word to get identity matches seeds Then extend the seed alignments as long as the score increases HSPs alignment The best extended seeds are called HSPs high scoring pairs The top scoring HSP is picked rst then the second as long as it falls quotnorthwestquot or quotsoutheastquot of the rst and so on In class exercise BLAST search using NCBI Open a web browser 0G0 to NCBI BLAST WWWncbinlmnihgovBLAST and select quot Nucleotidenucleotide BLAST blastn quot Login to bioinf45 Otype 39more bystroffeVidencefasta39 0Copypaste the DNA sequence into the Blast sequence Window Select 39nr39 Select Descriptions210 AlignmentslO Run it Format it When the results are back go to the bottom of the page Hit quotSelect allquot and quotGet selected sequencesquot 0continued In class exercise BLAST search using NCBI In the page that appears select DisplayzGenBank and SendtozFile Send this le to your account on bioinf45 using a scp command 0G0 to SeqLab Go to an empty Editor page Import the GenBank le Another class exercise Doing a simple multiple sequence alignment This exercise is to practice making a multiple sequence alignment using the local databases Try it using BLAST rst Then try FASTA and SSEARCH database searches Do you get different results In class exercise BLAST search using SeqLab 0start SeqLab Using LookUp Find sequences in PIR that match the keyword quotR67quot Check the results Choose the FIR sequence What is the accession number Get the sequence using FileAdd sequences fromDatabases using the accession number Run BLAST using this sequence Be sure to search the protein databases Set the cutoff to 100 Add to Main list Go to Main list select it Go to Editor Choose to Modify the sequences This cuts off long ends In class exercise Multiple sequence alignment search using SeqLab Select all sequences Run ClustalW multiple alignment Extensi0ns gtClustalW use the defaults When the job is done save to Main list Then select it and go to the Editor You should nOW see a multiple sequence alignment Bioinformatics 1 lecture 6 Af ne gap penalty Substitution Matrices PAM BLOSUM Matrix bias in local alignment Reminder scoring an alignment The score of the alignment is the sum of the scores of each column match deletion or insertion in the alignment Match look up match score from substitution matrix New gap use gap initiation penalty Additional gap use gap extension penalty End gap Optional may be zero ATSF1VI AGLSTFM Af ne gap exercise Which alignment scores the highest ATSF 3 Given N0 end gap penalty BLOSUM score Affine gap penalty 2 1 Worksheet for af ne gap local dynamic programming Fill in the scores are the traceback letters using the BLOSUM62 matrix and Gap opening 2 gap extension 1 Start at 0 End at maximum No end gaps meaning no starting gaps LTVKP LTVKP s o E w 7e w Rules for af ne gap penalty DP Do not penalize endgaps I can follow D and Vice versa but it is a gap opening For each box write the score and the traceback letter M I or D Mi 1 j 1match score Miajl MAX Ii1j1match score Dl17J1match score Af ne gap penalty worksheet EM match matrix L insertion matrix Q deletion matrix 39 scores for alignments scores for alignments with scores for alignments w39 Mijl 2 P F G Fill 39 M as the 25m mm Di391 2 J MiLj 2 Dij MAX Ii1j 2 Di1j 1 BLOSUM matrix for match scores Two 20x20 substitution matrices are used BLOSUM amp PAM ACDE FG HI KLMNPQR ST VWY 4 o 2 1 2 o 2 1 1 1 1 2 1 1 1 1 o o 3 2gt 9 3 4 2 3 3 1 3 1 1 3 3 3 3 1 1 1 2 2O 623113143110201343U 5 3 2 o 31 3 2 o 1zoo 1 2 3 2rn 6310300343322113n 6 2 4 2 4 3o 2 2 2o 2 3 2 3G 8313212001232 2 4 321 3 3 3 3 2 13 3 1 521011201232 39 42 3 3 2 2 2 1 1 2 1 Each humber 1s the scare 5 2 2 0 1 1 1 1 1 1 g for allgmng asmgle pan 6 2 o o 1 o 3 4 2Z 712 112 4339U Ofaman aCIdS 5 1 o 1 2 2 10 l 511332m 4 1 2 3 2C What 1s the score for th1s allgnment 5 0 2 2 ACEPGAA 4 3 1 ASDDGTV 11 3 I BLOSUM62 Read Mount pp94l 13 Substitution matrices Used to score aligned positions usually of amino acids Expressed as the loglikelihood ratio of mutation or logodds ratio Derived from multiple sequence alignments Two commonly used matrices PAM and BLOSUM PAM percent accepted mutations Dayhoff BLOSUM Blocks substitution matrix Henikoff PAM M Dayhoff 1978 Evolutionary time is measured in Percent Accepted Mutations or PAMs One PAM of evolution means 1 of the residuesbases have changed averaged over all 20 amino acids To get the relative frequency of each type of mutation we count the times it was observed in a database of multiple sequence alignments Based on global alignments Assumes a Markov model for evolution BLOSUM Henikoff amp Henikoff 1992 Based on database of ungapped local alignments BLOCKS Alignments have lower similarity than PAM alignments BLOSUM number indicates the percent identity level of sequences in the alignment For example for BLOSUM62 sequences with approximately 62 identity were counted Some BLOCKS represent functional units providing validation of the alignment QUERY 42 114042 60 18853 60 455325 60 114040 60 194241 42 1263123 60 194242 42 18849 60 364011 60 309109 52 114041 52 225946 54 114038 54 3915605 48 114044 54 2388609 59 46152 59 103338 L 52 202959 L T6 295916 L 52 913986 L 52 196 L 52 416629 59 2119392 59 483134 34 192005 2 3891444 21 230118 20 230119 20 230129 20 A multiple sequence alignment is made using many pairwise sequence alignments Colmnns in a MSA have a common evolutionary history ha hl 39lt9rv vz39 lt1 r lt39rmrl391 nk n mmwmumwmmm mcmmmnmmn oncoltiom muqm u By aligning the sequences we assert that the aligned residues in each column had a common ancestor How do you count the mutations Assume any of the sequences could be the ancestral T a GWWNGG If the rst sequence was the ancestor then it mutated to a W twice to N once and conserved G three times 61612226 6122 Or we could have picked W GWWNGG W was the ancestor then it mutated to a G four times to N once and was conserved once Subsitution matrices are symmetrical Since we don39t know which sequence came rst we don39t know whether or W G is correct So we count this as one mutation of each type GgtW and WgtG In the end the 20x20 matrix will have the same number for elements ij and Li That39s why we only show the upper triangle Summing the substitution counts We assume the ancester is one of the observed amino acids but we don39t know which so we try them all G N W ammzzzm symmetrical matrix one column of a MSA Next possible ancester We already counted this residue against all others so be blank it out G N W mmzzs Next W WNGG Next Next Next 3 Last no counts for last seq Summing the substitution counts G N W 6161222616 TOTAL2 l Now we do this for every column in every multiple sequence alignment log odds Substitutions and many other things in bioinformatics are expressed as a quotlikelihood ratioquot or quotodds ratioquot of the observed data over the expected value Likelihood and odds are synomyms for Probability So Log Odds is the log usually base 2 of the odds ratio Getting logodds from counts PG 47 057 Observed probability of GgtG qGG PGgtG621 029 Expected probability of GgtG eGG 057057 033 odds ratio qGGeGG 029033 I log odds ratio log2qGG Gg Ifthe lod is lt 0 then the mutation is less likely than expected by chance If it is gt 0 it is more likely Different observations same expectation PGO50 PG050 eGG 025 eGG 025 qGG 942 o21 qGG 2142 05 lod log2021O25 02 10d 10g2050025 1 GG GW GA GA WG GW WA GA NG GW GA GA GA GA 9 G s spread over many columns GS concentrated Different observations same expectation PGO50 PWO14 PG050 PW0 14 eGW 007 er 007 qGW 742 017 qGG 342 007 lod log2017O07 13 10d 10g2007007 0 G G G W G A G A W G G W A W G A N G G W G A G A G A A G G and W seen together more GS and W s not often than expected seen together In class exercise Get the substitution value for PgtQ given a very small database P PP P Q QSQP PP PQ QQPP P ePQ QPPP qPQ Z QQQP Q 10d log2qPQ pQ Markovian evolution and PAM A Markov process is one Where te likelihood of the next quotstatequot depends only on the current state Markovian evolution assumes that base changes or amino acid changes occur at a constant rate and depend only on the identity of the current base or amino acid 9946 0002 0021 9932 0001 one position i aproteingtgt gt gt L y millions of years Markovian evolution is an extrapolation Start with all G39s Wait 1 million years Where do they go Using PAMl we expect them to mutate to about 00002 A 00007 P 09946 G etc Wait another million years The new A39s mutate according to PAMl for A39s P39s mutate according to PAMl for P39s etc Wait another million etc etc etc What is the nal distribution of amino acids at the positions that were once G39s Matrix multiplication We start with 100G 0 everything else After lMY we have 6 G gtA each airline acid accordmg to the PAM probabilities J gtlt OOOOOOOHOOOO Fwwwwrururururururu r k Matrix multiplication After 2MY each amino acid has mutated again according to the G gtA r PAMl probabilities etc gtlt fruwrururururururururu 250 PAMS 253 Differences between PAM and BLOSUM PAM PAM matrices are based on global alignments of closely related proteins The PAMl is the matrix calculated from comparisons of sequences With no more than 1 divergence Other PAM matrices are extrapolated from PAMl using an assumed Markov chain BLOSUM BLOSUM matrices are based on local alignments BLOSUM 62 is a matrix calculated om comparisons of sequences With approx 62 identity All BLOSUM matrices are based on observed alignments they are not extrapolated from comparisons of closely related proteins BLOSUM 62 is the default matrix in BLAST the database search program It is tailored for comparisons of moderately distant proteins Alignment of distant relatives may be more accurate With a different matrix Increasing sophistication in match scoring 1 Identity score 2 Genetic code changes mutations on one base more likely than 23 1966 3 Matrices based on chemical similarity of amino acids 1985 4 Matrices based on multiple sequence alignments PAM 1978 BLOSUM 1994 5 Dipeptide substitution matrices ie AG gt DG etc 1994 6 Class speci c substitution matrices D Jones39 transmembrane protein matrix 1994 7 Structurespeci c substitution matrices 2005 PAMZSO CSTPEGNDEQHRKMILVFYW BLOSUM62 ICISV39I PWAGINDEDIHRKIMILVFY Which substitution matrix favors PAMZSO BLOSUM62 conservation of polar residues El conservation of nonpolar residues polartononpolar mutations EIEIEIEIEI El conservation of C Y or W El El El polartopolar mutations Local alignment revisited Starts at zero the score of a nonalignment Ends at the maximum score anywhere in the matrix Advantages Does not care if the aligned region has long quottailsquot Can align pieces of one sequence to pieces of another Multidomain sequences are OK Global Local Local alignment revisited Disadvantages Fails on multidomain alignments if large gaps are present Success depends on an additional parameter Matrix Bias Global Local Local Alignment With matrix bias l Matrix bias a constant added to the substitution score Has the same effect as starting the alignment at a number other than zero Ai1j1 match score bias Adj1 gap Aij MAX Ai1j gap 0 match score w G H U m m w gt linear gap penalty Effect of matrix bias Higher matrix bias favors matches over gaps RESULT more matches longer local alignments How do we know When the alignment is correct Compare to known structurebased alignments 2W m 12212 mwm mum Aligning fibronectin Fibronectin is a long multidomain protein involved in adhesionmigration of cells blood clotting signaling and interactions with the extracellular matrix ECM Interacts with collagen fibrin heparin and integins It is made up of many copies of at least 3 quotmodulesquot Small differences Within modules cause important biological effects How do you align fibronectins Multiple local alignments One way would be to select the maximum score then the next highest score and so on to get all of the possible alignments II II II III II 111 II III Fragmentbased alignment methods nd all local alignments BLAST PASTA You have seen Dynamic programming Global alignment Globallocal alignment no end gaps 3 ways to do it Local alignment Linear gap penalty Af ne gap penalty How many ways are there to do DP In class exercise local alignment using BestFit Start SeqLab Go to Editor mode with no sequences Download two protein sequences from the FIR databaseFileAdd sequencesDatabases P I R2 A4 6 4 4 4 P I R2 S 5 8 6 5 3 Select both Run BestFit FunctionsPairwise comparisonBestFit Go to Options set gap creation range 120 gap extension range 010 Run LOOk at the results Compare to what you see on the screen Bioinformatics 1 lecture 13 Sequence weights 10 g Odds pro les Logos sequence motifs Polling and Modeling In a democracy we vote for our leaders Each person gets exactly one vote and that vote re ects the opinion of the voter Everyone votes There is no sampling error A poll is a small sample of the populace To be a representative sample the poll must be taken across regions socio economic classes etc think quotcladesquot Otherwise the poll is not quotrepresentativequot and Will not re ect the results of a true election A typical poll of the database If we submit one sequence for example citrate synthase from human to the GenBank database using BLAST for example and take 100 results and we build a cladogram from this we might get something like this What is our representative going to look like if we use the rule quotone sequence one votequot s L39J Lr E coli rabbit rat lawyer primates A database search is poor polling method Will this poll predict the Winner in the election Sequence weighting corrects for poor sampling To build a representative model we can I throw out all redundant sequences and keep representatives of each clade only or 2 apply a weight to each sequence re ecting how non redundant that sequence is One measure of nonredundancy is sequence distance or evolutionary distance Crude weights from a cladogram Simplest weighting scheme Start with weight 10 at the common ancestor of the tree Split the weight evenly at each node 1 000 Human sequences are 10 18 of the tree but only 0125 of the weights 39 0125 0125 0125 39 39 A AA A lquot 393 I In go 10 lt lt lt lt 1 N 8 sesswggg g 888888 8 GD 0 o o 39 O rab39 bit rat lawyer E coli primates Better weights from a phylogram 22A wA0203a035 03 37 B wB0103a025 C W 205 05 C The sequence weight is calculated starting from the distance from the taxon to the rst ancestor node adding half of the distance from the rst ancestor to the second ancestor 14th of the distance from the second to third ancetor and so on Finally the weights are normalized From Mount p 154 Easy Distance based weights A B C Pseudocode all wi initialized to 1 i 0393 1390 while wi 2 w39i do for i from A to C do B 09 39 2 wi J wj Di C end do for i from A to C do 1 Sum the weighted distances to Wi W ZJ W39j get new weights en d d0 2 Normalize the new weights 3 Repeat 1 and 2 until no change end d0 Self consistent Weights Method of Sander amp Schneider 1994 Distance based weights 1 1 Sum the weighted distances to get new weights 2 Normalize the new weights 1 Sum the weighted distances to get new weights 3 Repeat 1 and 2 until no change Running the pseudocode B w39A03 10 13 w39B0309 12 w39C1009 19 WA 2 13131219030 wB 1244 027 WC 2 1944 043 w39Az 0302710043051 w39Bz 030309043 2048 w39cz 100309027 2054 wABC 033 031 035 WABC 030 028 042 WABC 031 029 040 wABC 030 028 041 wABC030 028 041 converged U 03 09 Amino acid probability pro les An amino acid pro le is de ned as a set of probability distributions over the 20 amino acids one PDF for each position in the alignment Gap probabilities may or may not be included When talking about a pro le ACDEFGHIKLMNPQRSTVWY Amino acids are not equally likely in Nature K L and R are the most common Log likelihood ratios LLRa log Pali Pa likelihood of a overall the Whole database probability of a in one column Pseudocounts because you never know LLRa log Pali Pa iiiiili lr g The probability of seeing a in column i of a sequence alignment is never really zero So we add a small number of 39pseudocounts39 8 LLRa log Paie Pa This LLR does not go to negative in nity as Pa gt0000 Instead it goes to log8Pa One way to Visualize pro les Color matrix Color 2 LLR Blue high nega ve Values Green zero Red high posiu39ve Values Another way Logos Height of letter is the LLR 2 40 yeast TATA sites 1 Ag bits Example Logos for DNA alignments 52 Cum shes nmmmmum 5x2 Gnu Age wqqq nqmmnmo 3 5 1 5 gifgzhnm Q MA squot 7 5m nqmm oo nqmmmog Alignments of transcription factor footprint sites The score isthesunl ofthelog likelihood ratios of thean no acid in the sequence Sequence apt swine lle v J0 V cluster 12013 emus a pro le score 2 2i LLRai imsmwzemmwmmum n bu KEMGFBHIIf P In class exercise build a DNA pro le using distances 0n Bioinf45 cp bystroffl6sl 1merrsf Start SeqLab Read in l6s1 1merrsf part of a multiple sequence alignment of 16S RNA Use FunctionsEvolutionDistances to calculate distances Use uncorrected On paper Assign sequence weights based on the distances using one iteration of distance based weights W A 21 DiA Then normalize divide by 21 Wi Sum the probabilities of each base in the 6th column ie PA sum Wi over sequences that contain an A Convert to LLRs using equal probability bases 025 and a pseudocount of 001 LLR logPn001025 Divide by log2 to convert to 39bits39 Plot the LLR in Logo style Height of letter is number of bits Use upsidedown letters for negative Psi BLAST Blast with pro les Psi BLAST searches the database iteratively Cycle 1 Normal BLAST With gaps Cycle 2 a Construct a pro le from the results of Cycle 1 b Search the database using the pro le Cycle 3 a Construct a pro le from the results of Cycle 2 b Search the database using the pro le And So On user sets the number of cycles Psi BLAST is much more sensitive than BLAST Also more vulnerable to lowcomplexity Lecture 25 Theoretical models Biodiversity versus stability Computational models numerical anaIytical math heuristic system parameterization deterministic theoretical Opt39ma39 conceptual A conceptual model for protein folding the folding funnel Proteins bump against energy 39walls39 get stuck in 39traps Narrow pathways define the 39pathways39 of folding Theoretical models and Emergence Theoretical models attempt to find quotemergentquot phenomena from quotfundamentalquot principles A successful theoretical model demonstrates predictive behavior Failure of a theoretical model means quotfundamentalquot principles are wrong oversimpe insufficient overfit etc real world basic assumptions compar simulations ECOME theoretical model for a global food web nodes populations edges predator species Assumptions H i ll r ll Emergent phenomena Dynamic stability instability Biodiversity correlates with stability Trophic levels Species are measured in generic biomass units Species are measured in biomass units Species evolve by splitting unevenly 95 5 Species are measured in mass units Species evolve by splitting unevenly Species can be plants or animals Species are measured in mass units 7 Species evolve by splitting unevenly Species can be plants or animals Plants get mass from the sun amp atmosphere Species are measured in mass units Species evolve by splitting unevenly Species can be plants or animals Plants get mass trom the sun amp atmosphere The sun39s total input is limited by the amount of land Species are measured in mass units Species evolve by splitting unevenly Species can be plants or animals Plants get mass from the sun amp atmosphere The sun39s total input is limited by the amount of land Animals get mass from prey plants andor other animals 0 Species are measured in mass units Species evolve by splitting unevenly Species can be plants or animals Plants get mass from the sun amp atmosphere The sun39s total input is limited by the amount of land Animals get mass from prey plants andor other animals Animals and plants lose mass to predation respiration and natural death Species are measured in mass units Species evolve by splitting unevenly Species can be plants or animals Plants get mass mm the sun amp atmosphere The sun39s total in393ut is limited by the amount of land Animals get mass from plants andor other animals Animals and plarts lose mass to 39 predation respiration and natural death Mass balance is maintained Extinction occurs at population peak doe to resource overshoot plant mass that is left new mass needed to survive Most of prey species is consumed endangered I e 17 new mass needed to sunive Not enough A fraction is unfed unfed fraction Unfed fraction starves unfed fraction 9g Unfed fraction starves Prey availability is modeled using Hollings functions Animals can starve while food still exists AW too small can39t find it Animal is 100 unfed Collapses to zero Evolutionary escape from extinction O prey plant gt predator animal new subspecies too small to find andor evolves defense ie poisonous i subpopulation grows Original population is consumed Animal stanes or M New animal species evolves new prey ie immunity to poison Evolution on a food web Daughter species loses predators gains prey New prey must be within range httpwww bioinfo rpied ubystroeoomeooal mpeg Stability Ecosystem with no evolution 100 collapse since animals always finish off their prey species then starve Stability qualitative Results plants colla se p Plants more trophic qunckly collapse levels boom and slowly bust F few carnivores animals o collapse slowly carnivores collapse herbivores remain animals 0 collapse V Biodiversity vs stability Extant ecosystems after106 cycles m 25 20 15 10 10 10000 Number of species Why 1 no simulation 2 One eats the other then dies 3100 Not enough alternative food sources lnitial extinction rate exceeds speciation rate which is constant 1001000 More highly connected species Extinction rate lower Species number sometimes grows sometimes shnnks gt 1000 More highly connected species extinction rate always lower than speciation rate Genetic inheritance is DanNinian The sources of your genes Cultural inheritance is Lamarckian The sources of your ideas Lamarckian evolution Lamarckian evolution W l Disuse Use l V W 5 3 5 a 7 7r Y 39 39 7 h i 39 u iin book 1 Humans evolve new predatorprey learning relationships without speciating O Q Whole population gains a food source at once notjust a small subset book 1 Humans evolve new predatorprey learning relationships without speciating 2 Humans can choose prey from quot trade anywhere in the ecosystem 1 Humans evolve new predatorprey b00k relationships without speciating learn39ng 2 Humans can choose prey from trade anywhere in the ecosystem 3 Humans can defends against their self protection own predators medicine y a human predator predator becomes prey 1 Humans evolve new predatorprey b00k relationships without speciating learn39ng 2 Humans can choose prey from trade anywhere in the ecosystem 3 Humans can defends against their self protection own predators medicine 4 Humans can control their own reproductive rate can we i f t httpwww bioinfo rpied ubystrcpu bm pegculture3c mpeg
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'