INTRO BIOINFORMATCS BCB 444
Popular in Course
Popular in BioInformatics
This 70 page Class Notes was uploaded by Mrs. Frederic Hansen on Saturday September 26, 2015. The Class Notes belongs to BCB 444 at Iowa State University taught by Drena Dobbs in Fall. Since its upload, it has received 42 views. For similar materials see /class/214471/bcb-444-iowa-state-university in BioInformatics at Iowa State University.
Reviews for INTRO BIOINFORMATCS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
6 Scoring Matrices amp Alignment Statistics 363 444544 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices amp Alignment Stats 33107 U39iti oigii ugantiiil9 F i 1 Revised notes for Lecture 5 posted online Changes mainy re ardering symbols color quotcodingquot MonSept3 NO CLASSES AT ISU Labor Day Enjayll LuesSepj4 Lab 2 Exercise Writeup Due by 5 PM or sooner Send via email to Pete Zaback Wm HW2 assignment will be posted online En39 Sept 14 HW2 Due by 5 PM or sooner En39 Sen 21 Exam 1 BCB 444544 F07 ISU Dobbs 6 e Scoring Matrices amp Aiignmzm Stats 33107 Methods Global and Local Alignment Alignment Algorithms Dot Matrix Method 0 Dynamic Programming Method cont 0 Gap penalities 0 DP for Global Alignment 0 DP for Local Alignment Scoring Matrices 0 Amino acid scoring matrices PAM BLOSUM 0 Comparisons between PAM amp BLOSUM Statistical Si nificance of Sequence Alignment BCB 4445 F07 ISU Dobbs 6 e Scoring Matrices amp Alignment Stats 33107 5 BCB 444544 Fall 07 Dobbs Mon Aug 27 for Lecture 4 Pairwise Sequence Alignment Chp 3 pp 3141 gd Aug 29 for Lecture 5 Dynamic Programming 909 mm www nn lr 39 39 39 quot5 hnnl umAugiQ Lab 2 Databases ISU Resources amp Pairwise Sequence Alignment FriAug3i for Lecture 6 Scoring Matrices amp Alignment Statistics Chp 3 pp 4149 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats ECt 107 83107 Adapted liom 3wng Alman Fernandez ace alzogo B f lgmg fafgcance of Sequence Alignment CB 444544 F07 ISU Dobbs 26 r Scoring Matrices amp Alignment Stats 33107 4 Sequence Homology vs Similarity Homologous sequences sequences that share a common evolutionary ancestry 39 Similar sequences sequences that have a high percentage of aligned residues with similar physicochemical properties eg size hydrophobicity charge IMPORTANT Sequence homology 0 An inference about a common ancestral relationship drawn when two sequences share a high enough degree of sequence similarity o Homology is guaitatiie Sequence similarity o The direct result of observation from a sequence alignment Simla quot 39 Lecdescmrbed usingipzrcentagem 6 6 Scoring Matrices amp Alignment 83107 Statistics Goal of Sequence Alignment Statement of Problem Find the best pairing of 2 sequences such that Given there is maximum correspondence between residues 0 2 sequences Scoring system for evaluating match or DNA 4 le er alphabef gap mismatch of two characters TTGACAC Penalty function for gaps in sequences TTTACAC Find Optimal pairing of sequences that 39 Promins 20 le er llPl l lbe r BOP Retains the order of characters Introduces gaps where needed RKVA GMA M f f I 39 OXImIZeS 0 a SCOPE RKIAVAMA BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats ECi 107 7 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats ECi 107 E Avoiding Random Alignments with a Scoring Function Not All Mismatches are the Same 39 39 II II 0 Introducing too many gaps generates nonsense alignments some ammo ads are more FXChangeable man others physicochemical properties are similar s e qu e sometimesqutpsenttce eg Ser St Thr are more Similar than Trp St Ala Need to distinguish between alignments that occur due to homology and those that occur by chance Substitution matrix can be used to introduce quotmismatch costsquot for handling different types of Define a scoring function that rewards matches and substitutions enalizes mismatches and a s p Scoring Function 5 g p LL Match a 1 Mismmch B 1 Mismatch costs are not usually used in aligning rd I Gap v 0 DNA or RNA sequences because no substitution 5 2 is quotbetterquot than any other in general sBCB444 e mE mrigsmelih gmgmvg fgapsba107 9 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A4 Alignment Stats 33107 10 Substitution Matrix Global vs Local Alignment A C D E F G H4gt 39 sab corresponds to score of Global alignment aligning character a with A o Finds best possible alignment across entire length of 2 sequences c a acte b h r r C o Aligned sequences assumed to be generally 3 milar over entire length Match scores are often calculated D based on fre uenc of mutations in E q Y Local alignment very Sll39l39lllCllquot sequences E more details later 0 Finds local regions with highest similarity between 2 sequences G o Aligns these Without regard for rest of sequence H 0 Sequences are not assumed to be s milar over entire length l BL OSUM 62 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 ll BCB 444544 F07 ISU Dobbs 126 r Scoring Matrices A Alignment Stats E3107 12 BCB 444544 Fall 07 Dobbs 2 6 Scoring Matrices amp Alignment 83107 Statistics Global vs Local Alignment Which should be used when am C it mgt Global vs Local Alignment example CTGTCGCTGCACG TGCCGTG NH Global alignment Local alignment CTGTCGCTGCACG ETGTCGCTGEAEG TG C C G TG TGCCG TG CTGTCGCTG CACG TGCCGTG Which is better Excellent BCB 444544 F07 ISU Dobbs 6 e Scoring Matrices amp Alignment Stats 53107 13 BCB 444544 F07 ISU Dobbs 6 e Scoring Matrices amp Alignment Stats 53107 14 Alignment Algori l39hms Dot Matrix Method Dot Plots 3 maJor methods for pairwuse sequence alignment Place 1 sequence alongmp row of mamX Place 2nd sequence along left column of matrix 1 D01 mah lx analySIS Plot a dot each time there is a match between A 2 D an element of row sequence and an element of C ynam39c Programmmg column sequence A For proteins usually use more sophisticated scoring schemes than quotidentical matchquot 3 Word or ktuple methods later in Chp 4 C Diagonal lines indicate areas of match 6 Contiguous diagonal lines reveal alignment quotbreaksquot gaps indels BCB 444544 F07 ISU Dobbs 6 e Scoring Matrices amp Alignment Stats 53107 15 BCB 444544 F07 ISU Dobbs 26 e Scoring Matrices amp Alignment Stats 53107 16 Interpretation of Dot Plots Dynamic Programming When comparing 2 sequences For Pairwise se uence ali n ent Diagonal lines of dots ind cate regions of similarity belween 2 sequences Idea Display one sequence above another with Reverse diagonals perpend cular to diagonal ndicate spaces inserted in bofh 10 reveal Similar l VeI39Sa s What do such patterns mean when G A T 39 T c A 39 c comparing a sequence with itself or its I I I I I reverse complement c T C G c A G c 0 eg Reverse diagonals cro palindromes BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices amp Alignment Stats E3107 l7 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices amp Alignment Stats E3107 1E BCB 444544 Fall 07 Dobbs 3 6 Scoring Matrices amp Alignment 83107 Statistics Global Alignment Scoring Global Alignment Scoring CTGTCGCTGCACG Reward for matches 10 TGCCGTGquot Mismatch penalty 2 Reward for matches at Spacegap penalty 5 M39 t h It 5Capeenalz 5 C T G T C G C T G C 13939 Y l TGC CG TG 5 0 W x39ly 5 10 10 z 5 z 5 5 10 10 5 w matches x mismatches y spaces Total 11 We could have done better BCB 444544 FO7ISU Dobbs rScoi ing Matrices Alignment Stats E3107 19 BCB 444544 FO7ISU Dobbs brScoi ing Matrices Alignment Stats E3107 20 Alignment Algorithms Dynamic Programming Key Idea The score of the best possible alignment that ends at a 0 Global NeedlemanWunsch I given pair of posmons i J is equal to 0 Local SmithWaterman the score of best alignment ending just previous to those two positions i e ending at iI jI PLUS Both NW and SW use dynamic programming 0 Variations GOP Penalty functions the score for aligning Xi and yJ Scoring matrices BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignmzm Stats 53107 22 BCB 444544 F07 ISU Dobbs 6 A Scoring Matrices A Alignmzm Stats 53107 21 Global Alignment DP Problem Formulation amp Notations Dynamic Programming 4 Steps Given two sequences strings X XIXZN of length N x AGC N 3 Y MightM of length M y AAAC M 4 Initialize and fill in a DP matrix for storing optimal N Define score of optimum alignment using recursion N scores of subproblems by solving smallest Construct a matrix with N1 X MI elements where subproblems first bottomup approach 5U i Score f bgstgalianment of x1i xlxzx with y1J ylyz A G c iiivr u S Calculate score of optimum alignments 5 Trace back through matrix to recover optimum 52 3 score Of be al39gnmenf alignments that generated optimal score of AG X1X2 0 AAA y1y2y3 5 ogtgtgt BCB 444544 F07 ISU Dobbs 6 A Scoring Matrices A Alignment Stats 53107 23 BCB 444544 F07 ISU Dobbs 126 r Scoring Matrices A Alignmzm Stats 53107 24 BCB 444544 Fall 07 Dobbs 6 Scoring Matrices amp Alignment 83107 Statistics 1 Define score of Optimum Alignment 2 Initialize amp Fill in DP Matrix for Storing using Recursion Optimal Scores of Subproblems Define x1 i Pre x Oerngthi OfX Construct sequence vs sequence matrix y1 j Pre x of length ofy 0 0 1 S00O Sij Score of optimum alignment 0fX1i and y1j Si1391 S i139 Initial conditions J 39V J Si0i i 50jquoti 5iJI391lm SGJ Recursive definition M 5N M Forl sisN 1 sj sM Si 1j 1Oxiyj RecurSion Si1 j10x yj Initialization 5laJmaX 5l 11aJ i Sijmax Si 1j y Sl0 l Sl l1y 5i j1y 50J J39l BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats 33107 25 BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats 33107 26 3 Calculate Score NM of Optimum 2 CW Flquot m DP Mah lx Alignment for Global Alignment I Fill in from O 0 0 N M row by row calcula ng be What happens in last step in alignment of X1 i to y1 j possible score for each alignment including residues at i j 39 Keep track of dependencies of scores in a pointer matrix 1 f 3 cases applies 0 1 N i 5OIOO Xi aligns to yi Xi aligns to a gap yi aligns to a gap X1X2 Xi1 Xi X1X2 Xi1 Xi X1X2 Xi 5i1j1 5i1 Y1Y2YJ 1YJ Y1Y2YJ Y1Y2YJ 1YJ 50 11 sat mm mm WW 5i391J391 0Xi YJ39 5i391 39 l 5ij1 y M 5N M BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 27 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 2E Fill in the matrix 9 C T C G C A G C Example Case 1Line upxiwithy i 1 i L 5 10 15 20 2 I x c A T T c A c C 5 y c T T c A G J 1 J39 A l 0 Case 2 Line up xi with space 1 i T 15 x c A T T c A c T 20 y c T T c A G J C 25 A 30 Case 3 Lme up y i wuth space 39 C 35 x c A T T c A c y c T T c A G J391 J 10 for match 2 for mismatch 5 for space BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 29 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 30 BCB 444544 Fall 07 Dobbs 5 6 Scoring Matrices amp Alignment 83107 Statistics Calculate score of optimum alignment 4 Trace back through matrix to recover optimum alignments that generated the optimal score 7 C T C G C A G C 0 6 10 15 20 25 30 35 40 6 1 0 5 0 6 1 0 1 5 2 0 2 5 1 0 5 8 3 2 7 0 6 1 0 1 5 0 1 5 1 0 5 0 6 2 7 2 0 6 1 0 1 3 8 3 2 7 4 2 5 1 0 5 2 0 1 5 1 8 1 3 8 3 3 0 1 5 0 1 5 1 8 1 3 2 8 2 3 1 8 3 5 2 0 5 1 0 1 3 2 8 2 3 2 6 6393 Result Optimal alignments of sequences How quotRepeatquot alignment calculations in reverse order starting at from position with highest score and following path position by position back through matrix OgtOHHgtOgt 10 for match 2 for mismatch 5 for space BCB 444544 F07 ISU Dobbs 6 r Scoring Marrlczs 3 Allgnmzm Stats 53107 31 BCB 444544 F07 ISU Dobbs 6 r Scoring Marrlczs 3 Allgnmzm Stats 53107 32 Traceback for Global Alignment Tracebad 139 Recwequot Alignmem39 7 C T C G C A G C 0 6 510 515 520 525 530 535 540 A vertical move puts a gap in fog sequence 230 215 0 15 18 13 28 23 18 735 720 5 10 13 28 23 26W 1 have gt1 optimum alignment this example has 2 A diagonal move uses one character from each sequence Start in lower right corner 61 trace back to upper left 7 C 5 1o 5 o 5 F10 715 720 725 1A r10 5 8 3 72 r7 0 5 710 Each arrow Introduces one character at end of sequence A alignment T 515 0 15 10 5 o 5 72 77 T 720 45 10 A13 lt8 3 72 F7 11 0 A horizontal o e utsa a in le tse uence mvp 9P f q C 725 710 5 20 153918 13 8 3 A C BCB 444544 F07 ISU Dobbs 6 e Scoring Marrlczs 3 Allgnmzm Stats 53107 33 BCB 444544 F07 ISU Dobbs 6 r Scoring Marrlczs 3 Allgnmzm Stats 53107 34 Local Alignment Motivation Local Alignment Example To quotignorequot stretches of non coding DNA 0 Non coding regions if quotnon functionalquot are more likely to contain mutations than cod ng regions 0 Local alignment between two prote n encoding sequences is likely to be between two exons 0 To locate protein domains or motifs Ma l ch 2 Mismatch OIquot space 1 0 Proteins with similar structures andor similar functions but from different species for example often exhibit local seuence similarit es ggtctgag aaacga Best local alignment Score 5 Pr OCZSSIllg are quot0 r39ans39a e If 0 pro em BCB 4445 F07 ISU Dobbs 6 e Scoring Matrices 3 Alignmzm Stats 53107 35 BCB 444544 F07 ISU Dobbs 126 r Scoring Marrlczs 3 Allgnmzm Stats 53107 36 BCB 444544 Fall 07 Dobbs 6 6 Scoring Matrices amp Alignment 83107 Statistics Local Alignment Algorifhm Traceback for Local Alignment 9 C T C G C A G C 395 i J Score for optimally aligning a suffix of Xwith 9 O O O O O O O O O a suffix of Y C o 1 o 1 o 1 o o 1 A o o o o o o 2 o o Initialize to row le t ost colu n o atri with quotOquot P amp f m m f m X T O O 1 O O O O 1 O O O 1 O O O O O 0 Recall for Global Alignment T C O 1 O 2 O 1 O O 1 5 i Score foro timall ali nin a re i owaith a refix on iJii P year M P A o o 0 V1 02gto o Initialize top row lt31 leftmost column of With gap penalty c o 1 o 1 o i 2 W 1 1 1 for a match 1 for a mismatch 5 for a space BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 37 BCB 444544 FO7ISU Dobbs 6 r Scoring Matrices A Alignment Stats E3107 3E Some Results re Alignment Algorithms quotScoringquot or quotSubstitutionquot Matrices for Corns CprE amp Math types 2 Major types for Amino Acids PAM amp BLOSUM Most pairwise sequence alignment problems can be sowed in 0mm 1quot me PAM Point Accepted Mutation Space requ1rement can be reduced to 0mn while relies on quotevolu omry modelquot based on observed keeping run39 me xed Mlarses differences in alignments of closely reatedproteins Highly similar sequences can be aligned in 0 dn time where d measures the distance between the BLOSU BLOCk 5Ubs m on af x sequences Landause based on quot0 aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A4 Alignment Stats 33107 39 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A4 Alignment Stats 33107 40 PAM Matrix BLOSUM Matrix PAM Point Accegted Mutation BLOSU BLOck SUbstitution atrix relies on quotevolutionary modelquot based on observed based on aa substitutions observed in blocks of differnces in closely related proteins conserved sequences within evolutionarily divergent Model includes defined rate for each type of Franquot5 sequence change 0 Doesn39t rely on a specific evolutionary model 0 Suffix number n reflects amount of quottimequot 0 Suffix number n reflects expected similarity passed rate of expected mutation if n of amino average aa identity in the MSA from which the acids had changed matrix was generated 0 PAM1 for less divergent sequences shorter time o BLOSUM45 for more divergent sequences 0 PAM25O for more divergent sequences longer t me o BLOSUM62 for less divergent sequences BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats ECi 107 41 BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices A Alignment Stats ECi 107 42 BCB 444544 Fall 07 Dobbs 7 6 Scoring Matrices amp Alignment Statistics Statistical Significance of Sequence Alignment BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices At Allgnmzm Stats E3 107 43 83107 BCB 444544 Fall 07 Dobbs Gap penalty h gk where k length of gap h gap opening penalty 9 gap extension penalty BCB 444544 F07 ISU Dobbs 6 r Scoring Matrices At Allgnmzm Stats 33107 44 Protein Structure amp Function 110405 Announcements 1 1405 Exam 2 Has been graded Will be returned at end of class today Protein Structure amp Function Grade statistics 444 Average 81100 544 Average 100118 Questions mums 1 mums 2 Announcements Bioinformatics Seminars BCB 544 Projects Important Dates Nov 4 Fri 1210 PM 565 Faculty Seminzr in E164 Lago Nov 2 Wed noon Project proposals due to DavidDrena How to do sequence alignments on parallel computh Srinivas Alum ECprE 6 Chair BCB Pro ram Nov 4 Fri PM Approvalsresponses d tentative MW resentation schedule to students Nequot week Dec 2 Fri noon written project reports due Nov 10 Thurs 340 PM Cams Seminar in 223 Atanasoff Computational Epidemiology Dec 5789 classlab Oral Presentations 2039 Armin R Mikler Univ North Texas iitt lvwwvcsiastate Edi euilu MG Dec 15 Thurs Final Exam mums 3 mums a Bioinformatics Seminars RNA Structure amp FunctionPrediction Protein Structure amp Function Mon Reviw promoter prediction RNA structure amp function Wed RNA structure prediction 239 d 339 structure prediction miRNA 6 target prediction Lab 10 Fri a few more words re Algorithms Protein structure amp function mums 5 mums 5 D Dobbs ISU BCB 444544X 1 Protein Structure amp Function 110405 Reading Assignment for FriMon Mount nialman39s Review last lecture aip 10 Protein classification Atslruclum prediction ling www blolnformatlcsonllne orglch chl glndexhtml pp 409491 Ck Ermm tip vawvbiuinfut m ticsu lme urgzheipzerrclMZ mini RNA Structure Prediction Other That should be plenty limos u Boobs isU acomom Pmeln Simemre ioiaoi Duobbslsu acomm imiimsansioiaoi x miRNA and RNAi pathways miRNA Challenges for Computational Biology miemRNA Enthm nNAi Edhm Find the genes encoding microRNAs MiElDR NAplimalyliansclipl Er g39mus SRNKWMPUSW ec Predict their regulatory targets umputatlunal Pisaimian ut MlcluR NA Ge C Targets pioomsoi 8 NWA m is NAs target inqu Integrate miRNAs into gene regulatory pathways amp networks 2lti Nigggwshg 9 m me mR NA cleavage degradatlun translatlunai ispisssian anaai mR NA degladatlu c Bulgl zuus WM DDobbslSU acamow pmainstnlsaliauonaion 5 Hal 2quotquot5 limos Duobbslsu acomm imiimsansioiaoi RNA structure prediction strategies Secondary structure prediction strategies Secondary structure pret ction 1 Energy minimization thermodynamics 1 Energy minimization th Algorithm ermodynamics Dynamic programming to find r high probability pairs 2 ConPam Guam quotM39YS39S also some genetic algorithms covariati n m 3 Combined experimental amp computational M M bk Vienna RNA Pnekoge Hofackrzr nNAstnustm Mathews Sfold Ding amp Lawmnce limos u Boobs isU acomom Pmeln Simemre ioiaoi R Knintzuus m5 Duobbslsu acomm imiimsansioiaoi D Dobbs ISU BCB 444544X Protein Structure amp Function Secondary structure prediction strategies 2 Comparative sequence analysis covariation Algorithms Mutual information Stochastic contextfree grammars W ConStruct Alifold Pfold FOLDALIGN Dynalign R Knight 2005 110405 D Dooos iSU rBCB 444544x Protein Structure amp Function 13 Experimental RNA structure determination Xray crystallography NMR spectroscopy Enzymaticchemical mapping 110405 D Dobbs iSU r BCB 444544X Protein Structure amp Function 15 1 10405 Secondary structure prediction strategies 3 Combined experimental amp computational Experiment Map singlestranded vs double H N stranded regions in folded RNA 4 a c How quot391 quot Enzymes S1 nuclease T1 RNase lit Chemicals kethoxal DMS R Knight 2005 110405 D Dooos isueBCB 444544x Protein Structure amp Function 14 1 Energy minimization method What are the assumptions Native tertiary structure or quotfoldquot of an RNA molecule is one of its quotlowestquot free energy configurations Gibbs free energy AG in kcalmol at 37 C equilibrium stability of structure lower values negative are more favorable Is this assumption valid in viva this may not hold but we don39t really know 110405 D Dobbs iSU r BCB 444544X Protein Structure amp Function 16 Free energy minimization What are the rules A U Basep 39r AU A U AU What gives here AG 1 2 kcalmole MIME A U AU A Basepg39r UA AG 1 6 kcalmole C C Stab 2005 110405 D Dooos iSU rBCB 444544x Protein Structure amp Function 1 D Dobbs ISU BCB 444544X Energy minimization calculations Base stacking is critical AG Ac CA GA UC UG GU CU Tinocco er al C Staben 2005 110405 D Dobbs iSU rBCB 444544x Protein Structure amp Function i8 Protein Structure amp Function 110405 Nearest neighbor parameters Energy minimization calculations Most metnods for tree energy minimization Total tree energy at a speci c use nearest neighbor parameters derived conformation for a specific from experiment for predicting stability of an A molecule so 0 RNA secondary structure u term of AG at art inmmentul energy terms for helical stacking a nvost available software facial use I seqyerfe dependent the same set of parameters P quotquot quotk Mathews Sabina Zttkev at Turner 199 quotquotPu39 5 C 39quot9 favorable quotiricromntsquotau a Fig 53 Baxmianis a is orth zoos tomas umns But how many possible conformations for a single RNA molecule Huge number 2 er atinates 13N possible secondary strucme for a s Meme of N incearls for 100 ntssmallkNA X1025 structures Solution Not exhaustive enmemtion gt Dynamic programming 0N3 in time 0N2 in spacestora e iff pseudoknots excluded otherwise 0N6 time omit space umns 21 iImns 2 C 39quotP iquotquot iquotiVe sequence cquot JIYSiS RNA Secondary structure prediction co variation performanceI How evaluate Two has appnaches39 Not many experimentally determined structh Algorithms constrained by initial alignment currently 50 ure rRNA structures Much f but not cs robust cs unconstrained so quotGold Standardquot an absence of tertiary structure Basepairing probabilities determined by a compare with predicted kNA secondary partition function structure with that determined by comparative sequence analysis I osmy Benchmark Datasets Algorithms not constrained by mutual alignment I I I enetic algorithms often used for finding an quot07395quot BusPW I39Mquot1 1 7 quotWWWquot 591m alignmenp amp sap of structures analysis for lnryz a soulsubunit MINA are 97 unaware uhen compared with high resolution erystastrosttres Gatel Pace umns z iImns 24 D Dobbs ISU BCB 444544X Protein Structure amp Function 1 10405 RNA Secondary structure prediction RNA structure prediction strategies Performance Tertiary structure prediction 1 Energy minimization via dynamic programming 73 avg prediction accuracy single sequence r r 2 Comparative sequence analysis 1 Enemy cmumnw sequef39ce analysis m Pram tertiary contacts co variation e g MANIP Wesfhof Z Use emerimenta data to constrain model building e g Y Mayor 3 Homology modeling using sequence alignment di reference tertiary structure not many of these 4 Low resolution molecular mechanics e g yammp Harvey Requires quotcraftquot amp significant user input amp insight 97 avg prediction accuracy multiple sequences e g highly conserved rRNAs much lower if sequence conservation is lower ampOlquot fewer sequences are available for alignment 3 Combined recent developments combine thermodynamics amp covariation amp experimental constraints IMPROVED RESULTS uWu u 25 nmu u New Today Protein Structure amp Function Protein structure primarily determined by sequence Protein Structure amp Function Protein function primarily determined by structure Globular proteins compact hydrophobic core di hydrophiic surface Membrane proteins special hydrophobic surfaces Folded proteins are only marginally stable Some proteins do not assume a stable quotfoldquot until they bind to something Intrinsicaly disordered gt Predicting protein structure and function can be very hard di fun mudus 27 WENUS 28 4 Basic Levels of Protein Structure Primary 8 Secondary Structure Primary WW smimw 1mm Linear sequence of amino acids E A I v Description of covalent bonds linking aa39s L a A n t w W s V 5 Secondary n A Iii S E I 1 Local spatial arrangement of amino acids Description of shortrange noncovalent interactions Periodic structural patterns cxhelix sheet uWu za nmu D Dobbs ISU BCB 444544X Protein Structure amp Function 110405 Tertiary amp Quaternary Structure quotAdditionalquot Structural Levels Tertiary Super secondary elements I Overall 3D quotfoldquot of a single polypeptide chain Mo fs I Spatial arrangement of 239 structural elements 5 Domains packing of these into compact quotdomain Description of anyrange nancovalent interactions 39 Foldons plus disulfide bonds Quaternary In proteins with gt 1 polypeptide chain spatial arrangement of subunits nnAns 31 tunan5 32 D Dobbs ISU BCB 444544X 6 33 Genomics 110907 363 444544 Jquot W Lecfure 31 Phylogenefics Parsirnony and ML Chp 11 pp 142 169 Jquot W Lecfure 32 Machine Learning Fri Nov 9 Lec l39ure 33 Functional and Comparative Genomics Chp 17 and Chp 18 BCB 444544 FO7ISU Dobbs r Genomics 110907 1 BCB 444544 FO7ISU Dobbs r Genomics 110907 2 M W39HHi uiwj Hi39 Wizn i39 BCB Lisf of URLs for Seminars relafed To Bioinformafics Fri Nov 9 HW6 Will be pos l39ed This weekend hifn39 www hr m info h39hn HW6 39 M quot fun Wmquot Mach Learn39m Nov 7 Wed BBMB Seminar 410 in 1414 MBB Due Fri Nov 16 39 Sharon RO H39l Den l39 MD Anderson Cancer Cenfer 0 some me before Mon NOV 26 0 Role of chromafin and chromafn modifying profens in reguafing gene expression 0 Nov 8 Thurs BBMB Seminar 410 in 1414 MBB Jianzhi George Zhang U Michigan 0 Evaufion of new fundans for profens Nov 9 Fri BCB Faculfy Seminar 210 in 102 SciI Amy Andreo H39i ISU 739 cell signaling insg is from profen NMR specfroscopy BCB 444544 F07 ISU bobbstzaa r Genomics 110907 3 BCB 444544 F07 ISU bobbstzaa 4 Genomics 110907 4 BCB 444544 F07 ISU bobbstzaa r Genomics 110907 5 BCB 444544 F07 ISU bobbstzaa 4 Genomics 110907 6 BCB 444544 Fall 07 Dobbs 1 33 Genomics 110907 Examples of Machine Learning Algorithms Aquot Applica om Predicting RNA Binding Sites in Proteins Na i ve Bayes NB 0 Problem Given an amino acid sequence classify o Bayes Theorem each residue as RNA binding or nonRNA binding Neural network NN or Artificial Neural Net ANN Input to the classifier is a string of amino acid 0 Perceptrons identities support vector chl irle SVM output from the classifier is a class label either binding or not 0 Kernel functions Lab WEKA Decision Trees DT NB SVM BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 7 BCB 444544 F07 ISU bobbstzaa 7 Genomics 1 10907 E Bayes Theorem Applied to RNA Binding Site Prediction Na39ive Bayes for Binary Classification Pbmdmg aa seq PbmdmgPaa seq l bmdmg Haas Assignc 1 if W2 Pc0Xx Pc1XxlecD Otherwise assign c 0 Pc 0X x PG 01 Xxlc0 BCB 444544 F07 ISU bobbstzaa 7 Genomics 1 10907 9 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 10 Predicted vs Actual RNA Binding for l Ribosomal protein L15 PDB ID 1JJ2K Predicted Actual Pgtlt1T P 9 pX1Tc0pX2Sc0 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 11 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 12 BCB 444544 Fall 07 Dobbs 33 Genomics 110907 Biological Neurons Sum Input Signals amp Generate Output Signal Axon Genomes Dendrites receive inputs Axon sends output Image om ISl IS Sterglnu and Dmm rlns Slganus httpwww doc to ac whitersurprise 96Jmlrnalvnl4SIIrepnrt hnnl Artificial Neural Networks ANNs or NNs Neural networks classify input vectorsquot or quotexamplesquot into categories 2 or more They are loosely based on biological neurons Some of most successful methods for predicting secondary structure are based on neural networks 39 Neural networks are tra ned to recognize amino acid patterns corresponding to known secondary structure elements these I patterns are used to predict secondary structure type for 00 sequences in proteins of unknown structure BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 14 BCB 444544 F07 ISU bobbstzaa r Genomics 110907 Simple Neuron Perceptron The Perceptron Perceptron is Simplest ANNquot feed forward NN X1 W linear classifier T 1 S T X2 W2 gt Cell am 0 S lt T wN Dominicr XN r l l nreshoid V Input X Weights W Summation 5 Threshold T Output F l I f Perceptron comb nes input vectors XLN compares quotsumquot 5 with a l threshold T and generates output class label either 1 or O Axe sum mum If weights W and threshold T are not known in advance the perceptron must be tra39 d Ideally perceptron is trained to return correct answer for all training examples and perform well on test examples it has never seen Image 39nm ChI39ISl IS Stergl u and Dmm rlns Siganns httpwww doclcac ik ndsu prise964nmnaVvu4sxxrepurrhm 7minng SE musf commquot bofh Classes of dam Len Wm 1 and 0 amp BCB 444544 F07 ISU bobbstzaa r Genomics 110907 15 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 16 Perceptron quotSumsquot Inputs by Computing Training a Perceptron Dot Product 5 XW Input is a vector X Weight is are another vector W Perceptron Summation 5 computes the dot product 5 XW Find the weights W that minimize the error function E P number of training examples X training vectors FWX output of perceptron tX target value for X Perceptron Output F is a function of 5 it is often discrete 1 or O 39 wh ch case the function is a step function For continuous output a sigmoido function is often used Use 5 teepes t descent 1 1 19E 19E 19E 19E Fix VE 777 7 1 9 0 compute gradient aw1 0W2 0w3 BWN 0 update weight vector Wnew old EVE llerale e learning rate BCB 444544 F07 150 bobbstzaa r Genomics 110907 17 BCB 444544 F07 150 bobbstzaa r Genomics 110907 1E BCB 444544 Fall 07 Dobbs 3 33 Genomics Artificial Neural Network ANN Outputunits 4 Artificial neural network ak 55139 of percepfrans interconnected such that A 44gt47 aufpufs of some units become 11 inputs of other units quot Many topolog es are possible Can have multiple layers Neural nefworks are fruinedin same way perceptrons are trained by minimiziry an error fundon P I I 2 E 2PPX 7 tX i1 BCB 444544 F07 ISU bobbstzaa 7 Genomics 1 1 09 07 B 1 10907 Support Vector Machines SVMs Image from wryz wio39fzdinpryW475upurvzcrara1nhiuz BCB 444544 F07 ISU bobbstzaa 7 Genomics 1 1 09 O7 SVM Finds Maximum Margin Hyperplane ie hyperplane that provides maximum separation between two classes of instances in dataset Image from hfff39zu wio39fzdl39n wmwswanmmmchin BCB 444544 F07 ISU Dobbs r Genomics 110907 21 Kernel Function The original input space can always be mapped to some higherdimensional feature space where the training data become separable BCB 444544 F07 ISU Dobbs 7 Genomics 110907 23 BCB 444544 Fall 07 Dobbs Kernel Trick Map data into a feature space where they are linearly separable gar7amp1 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 22 Take Home Messages 0 Must consider how to set up the learning problem supervised or unsupervised generative or discriminative classification or regression etc 0 Lots of algorithms out there No algorithm performs best on all problems BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 24 33 Genomics 110907 1 Genomic sequencing 1 Genom39c sequencmg Mapping and Sequencing CTGAZOO5LecureLpdf Em We NHGEI Many thanks 1390 2 Human genome projecf ENC Green NHGRI The Human genome Wlo391 Ch1739 d for fire folowing slides exfracfed from his lecfur39e an Jonafhan Pevsner Kennedy Krieger Ins flfufe 3 SNPs Mapping and Sequencing Siudying Genefic Variafion II Compufafional Techniques Jim Mullkin NHGRI TGAZOO5Lec iure13gdf W 4 Comparafive Genomics Comparafive Sequence Analysis ElloffMargules NHGIEI CTGA2005Lecfure8gdf 868 444544 F07 150 005554233 7 Genomics 110907 25 868 444544 FO7ISU 005554233 7 Genomics 110907 26 Genomic Sequencing Brief ReVieW 7 Comparison of Sequenced Genome Sizes 3ounoouuno hp 1 60000000 bp 1 00000000 bp Eric D Green MD Ph D National Human Genome Research lnslimle 1500DDDO hp 5000000 hp egreennhgriinihgov E Green 2005 BCB 444544 F07 ISU bobbszzaa 7 Genomics 110907 27 E Gregquot 2005 868 444544 F07 150 bobbszzaa 7 Genomics 110907 2E STSS Provide common markers for quotlinkingquot genetic amp physical maps Comparison of Genetic amp Physical Maps STSZ STSJ STSJ GMCTGCTA TACTACCGC ATTATTCCGquot R Primer l TT quotGGATTAGCTAGGTATTGGCTAT CCTAAGGCTGATCCAGCCAGM cc PCR Primer 2 601000 hp gt E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 29 E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 30 BCB 444544 Fall 07 Dobbs 5 33 Genomics 110907 Genomic sequencing requires assembly of sequences obtained from cloned DNA Simsgles for Physical mapping are Radically n miman mum gr quot 3 changing m mu SuwaneeBased Era Localluiian am lsolannn oiGeneS 125 Posilional aiming sway oi Genome Diaznlxa un am Evolullon mm quot59quot swam Wlll Maw See a Closer iniarniay armauuiug and MW mapping Invulves arming Elnnos magnum Sequmma m the Explomw orNew Genomes IGeneml Twas anliysinal Ma39R39 Landmark My M Rammn Hymn Mans Construction oi New EAC Libraries wiii Aiiaiu cioneaaseu Physical Mapping Sludles of More Spades Genomes Se uenoe 110907 32 E Green 2005 868 444544 FO7ISU bobbstzaa 7 Genomics E Green 2005 868 444544 F07 150 bobbstzaa r Genomics 110907 31 Human Genome Sequencing Two approaches Public government International Consortium 6 countries NIH funded in US quotHierarchicalquot cloning amp BAC by BAC sequencing Map based assembly Private industry Celera Craig Venter Whole genome random quotshotgunquot sequencing Computational assembly took advantage of public maps amp sequences foo Guess which human genome Celera sequenced Green 2001i 110907 34 110907 33 E Green 2005 BCB 444544 F07 150 bobbsrzaa Genomics BCB 444544 F07 ISU bobbstzaa r Genomics 39 Prepare Multiple Copies 1 Randomly Fragment Subclone Fragments Green 2001 E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 36 E Green 2005 BCB 444544 F07 ISU bobbstzaa r Genomics 110907 35 BCB 444544 Fall 07 Dobbs 33 Genomics 110907 Either Strategy Sequence quotFinishingquot Hardest part Finishing Sequence Finishing Remains Relatively Expensive E Green 2005 868 444544 F07 150 bobbstzaa r Genomics 110907 37 E Green 2005 868 444544 FO7ISU bobbstzaa e Genomics 110907 35 Advances i n DNA Sequencing Technology Sequencing Method 1 Gi Ibert Maxi m quotChemical Degradationquot i i m G Reaction A Reaction m T Reaction i me C Reaction m Adapted lrom Messlng amp Llaca PNAS 1998 E Green 2005 BCB 444544 F07 ISU bobbstzaa r eznomms 110907 39 E Green 2005 BCB 444544 F07 ISU bobbstzaa e eznomms 110907 40 Sequencing Method 2 Sanger quotDi deoxy Chain Terminationquot DNA mgmm Savanna1 by Ekdulphm esis Dpticai Dalenllon System Laser mam Fluorescen Dyes i 9Wquot Another quotrecentquot improvement rapid amp high resolution separation of fragments in capillaries instead of gels 1E Yeung Ames Lab ISUL so Elaauapharuh Wilson I Mardls 1997 E Green 2005 BCB 444544 F07 ISU bobbstzaa r eznomms 110907 41 E Green 2005 BCB 444544 F07 ISU bobbstzaa e eznomms 110907 42 BCB 444544 Fall 07 Dobbs 7 33 Genomics 110907 Recent technologies Pyro amp 454 Sequencing 454 Life Sciences From Mkmedla he lree SnCyClODEdIa 1st Eukaryotic Genome Sequence S cerevisiae w lhrmmhnul quot 45 39 ils c5520 39 39 A 1 MR a 0 million m in Au 1 MM h l m mm l W F riheir l rm mquot nr 3 WK AKA lo Innovation in zoos nr Holhb m Milb l Ahnlm M H 39 and Initiated me The yeast m genome 03515422 million in will remain a separate business unn however directory In May 2007 Project Jim 39 39 5 I mplexed g r n Walson was 4 taking place 3 Baylor College cl Medicine Nature 3871105 1997 868 444544 F07 150 bobbstzaa r Genomics 110907 43 E Green 2005 BCB 4W5 FOUSU Dobbs aa 7 62mm megO7 44 1st Animal Genome Sequence C elegans V Timetable for Human Genome Sequencing Faster than expected 39man Genome Saquencing Science Genome Sequence of the Nematode C elegans A Ph onn for Investigating Biology 2 Geneuc a Complotlon g Genomic Sequence i laum Science 28210122018 1998 E Green 2005 BCB 444544 F07 ISU bobbstzaa r Genomics 110907 45 E Green 2005 BCB 444544 F07 150 bobbstzaa 7 Genomics 110907 46 1st Draft Human Genome quotcomplefequot in 2001 Public Sequencing International Consortium Cignce 6 Countries 20 Sequencing Centers 100039s oflndlvlduals 1000 bases per second 24 hours per day 7 days per week E Green 2005 BCB 444544 F07 150 bobbstzaa r Genomics 110907 47 E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 Genomics 110907 4E BCB 444544 Fall 07 Dobbs 8 33 Genomics 110907 Af39l39er quotCompletequot Human Genome Sequence What next 39 39E the Human Genome lming me We mane 7m human genome W 1990 to 2000 1998 to 2003 the Human Genome 2003 to 7 Sequence Nature 431931545 2004 E Green 2005 868 444544 F07 150 bobbstzaa 7 Genomics 110907 49 E Green 2005 868 444544 FO7ISU bobbstzaa 7 Genomics 110907 50 Comparative Genomics now with complete genomic sequences Using the Experiments ovanlution to Decode the Human Genome Compare E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 eznomms 110907 51 E Green 2005 BCB 444544 FO7ISU bobbstzaa 7 eznomms 110907 52 ENCODE Project yclopedia f NA laments Major role for comparative sequence analysis will be the identl catlon of functionain ortant noncodin seuences 53 E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 eznomms 110907 54 E Green 2005 BCB 444544 F07 ISU bobbstzaa 7 znomms 110907 BCB 444544 Fall 07 Dobbs 9 33 Genomics 1 10907 ENCODE Web Sites genomergovENCODE genomeucsceduENCODE E Green 2005 868 444544 F07 ISU bobbstzaa r Genomics 110907 55 Eric Green39s Genomic Sequencing Challenges 2005 List Defining Saturation Pointsquot in Terms of Information Gained by Comparative Sequence Analyses The 1000 Genomequot Medical Sequencing aka Human ReSequencing E Green 2005 BCB 444544 F07 150 bobbstzaa r eznomms 110907 57 BCB 444544 Fall 07 Dobbs ENCODE Results June 2007 mm 1 mum u nmnoInmlu quotArum ARTICLES Identification and analysis of functional elements in 1 of the human genome by the ENCODE pilot project on about numm 1mm w m 1 W1 14 5 a m be mg 11 371mm mmquot wamlainndln39 quotwas mm mm maidan mausundinl m mm 51 mm mm 10 28 Promoter Prediction 363 444544 in Iiiifii BCB 444544 F07 ISU Dobbs 234 Promoter Prediction 102907 1 itl i M W39itlit1itutliii39 l tlttlquot Mon Qgt 2 HW5 will be posted Today HW5 Hands on exercises with phylogenetics and tree building software Due Mon Nov 5 not Fri Nov 1 as previously posted BCB 444544 F07 ISU Dobbs 234 Promoter Prediction 102907 3 544 Extra2 Due J39PART 1 ASAP PART 2 meeting prior to 5 PM Fri Nov 2 Part 1 Brief outline of Project email to Drena amp Michael after responseapproval then Part 2 More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena amp Michael to discuss ideas BCB 444544 F07 ISU Dobbs 234 Promoter Prediction 102907 5 BCB 444544 Fall 07 Dobbs Mon Oct 29 Lecture 28 Promoter 8i Regulatory Element Prediction ctip 9 pp 113 126 MM Lecture 29 Phylogenetics Basics Chp 1o pp 127 141 unLQeLL Lab 9 Eene amp Regulatory Element Predictan Eri QQI 30 Lecture 29 Phylogenetic Tree Construction Methods 8i Programs Chp 11 pp 142 169 BCB 444544 F07ISU Dobbs 234 Promoter Prediction 102907 102907 2 Last week of classes will be devoted to Projects 0 Written reports due Mon Dec 3 no class that day 0 Oral presentations 20 3039 will be 0 Wed Fri Dec 567 0 1 or 2 teams will present during each class period gt See Guidelines for Projects posted online BCB 444544 F07 ISU Dobbs 234 Promoter Prediction 102907 4 363 List of URLs for Seminars related to Bioinformatics httn39 www hrh 11an 39 39 39 html Nov 1 Thurs BBMB Seminar 410 in 1414 MBB Todd Yeates UCLA TBA something cool about structure and evolution Nov 2 Fri BCB Faculty Seminar 210 in 102 ScI Bob Jernigan BBMB ISU Control of Protein Motions by Structure BCB 444544 F07 ISU Dobbs 234 Promoter Prediction 102907 6 28 Promoter Prediction 102907 Computational Gene Prediction Approaches Ab initio methods 0 Search by signal find DNA sequences involved in gene expression Search by content Test statistical properties dist nguishing cod ng from non coding DNA Similarity based methods 0 Database search exploit s milarity to prote ns ESTs cDNAs 0 Comparative genomics exploit aligned genomes 0 Do other organisms have similar sequence 0 Hybrid methods best BCB 444544 F07 ISU Dobbs 237 Promoier Prediciioii 102907 7 BCB 444544 FO7ISU Dobbs 237 Promoier Prediciioii 102907 S Signals Search Computational Gene Prediction Algorithms Approach Build models PSSMs profiles HMMs and search 1 Neural Newer5 NNS more 0397 H7555 ieru against DNA Detected instances provide evidence for genes eg GRAIL Slancudun cnduns Donorsite 2 Linear discriminant analysis LDA see text eg FGENES MZEF Transcription sum a Markov Models MMs amp Hidden Markov Models HMMs 5 eg6ene5eqer uses MMs ll39lllolquot GENSCAN uses 5th order HMMs see text W HMMgene uses conditional maximum likelihood see text Slop codon 339 UTR BCB 444544 F07 ISU Dobbs ZEr Promoier Prediciion 102907 9 BCB 444544 FO7ISU Dobbs ZEr Promoier Prediciion 102907 10 Content Search Human Codon Usage Observation Encoding a protein affects statistical propert es of DNA sequence39 0 Nucleotideamino acid distribution 0 GC content CpG islands exonintron o Uneven usage of synonymous codons codori bias 0 Hexamer frequency most discriminative of these for identifying coding potential Method Evaluate these differences coding statistics to differentiate between coding and non cod ng regions BCB 444544 F07 ISU Dobbs 237 Promoter Prediciioii 102907 11 BCB 444544 F07 ISU Dobbs 237 Promoter Prediciioii 102907 12 BCB 444544 Fall 07 Dobbs 2 28 Promoter Prediction 102907 Predicting Genes based on Codon Usage Differences Similarity Based Methods Database Search In differentdgenornes Transate DNA into all 6 reading frames on search aga nst proteins TBLASTXBLA5TX etc Algorithm w W f 39 39 39 39 Process sliding window at i 39 39 39 39 l 0 Use codon frequencies to m l compute probability of i f if coding versus non coding L AAAAl llALLJ 0 Plot log likel hood ratio m I I I I I I logy pSi coding q 2 EXOHS Within same genome Search with ESTcDNA database Pm non coding m a m n ESTdenome BLAT etc Coding Profile of B globin gene Problems 0 Will not find quotnewquot or RNA genes non coding genes 0 Limits of similarity are hard to define 0 Small exons might be overlooked BCB 444544 F07 ISU Dobbs 257 Promoter Prediction 102907 13 BCB 444544 FO7ISU Dobbs 257 Promoter Prediction 102907 14 SimilarityBased Methods Comparative Genomics Human Mouse Homology Human Mouse Idea Functional regions are more conserved than non functional 2 39 39 39 a 39 39 ones high similarity in alignment ndicates gene I I H H a I I I I human a a E i 6 GGTTM m l lllllllllllll llllllllllllllllll l W Comparison of 1196 orthologous genes Advanfages Sequence identity between genes In human vs mouse M f d h d RNA Exons 846 ay In unc aracterize or genes Protein 854 Problems Introns 35 0 Finding suitable evolutionary distance 539 UTRs 67 0 Finding limits of high similarity functional regions 339 UTRs 69 BCB 444544 F07 ISU Dobbs ZEr Promoter Prediction 102907 15 BCB 444544 F07 ISU Dobbs ZEr Promoter Prediction 102907 16 GeneSQe Brendel et al IS thg teen 7n Iin 139an 1 39L m Thanks to Volker Brendel ISU for the following Figs amp Slides Spliced Alignment Algorithm Slightly modified from 355139 Genome In armatics Module 0 Perform paIrWIse alignment with large gaps in one htt wwwbioin ormaticsiastateedu I course desc 20 sequence due 10 Mfrans 05htmlmoduleB Align genomic DNA with cDNA ESTs protein sequences 0 Score semi conserved sequences at splice junctions v Brendel Vbrendel lasfafe ed 0 Using Bayesian probability model amp Ist order MM 0 Score coding constraints in translated exons 0 Using Bayesian model Iquotquot quot GT A5 A A Donor Fe Acceptor Splice sutes BCB 444544 F07 ISU Dobbs 222E Promoter Prediction 102907 17 Burdd 2005 133907 13 BCB 444544 Fall 07 Dobbs 3 28 Promoter Prediction 102907 i Splice Site Detection Information Content vs Position Information Content L I l 1 Hum a Ii 2 20113 10g2fi3 mi BEUCAG i o Exfem of Window 75D 74D en 720 in a a 2a a 4a a 4a a 2n a 4n 5 1 s 1 1960 i ifli position in sequence I avg information content over all positions gt20 nt from splice site crl avg sample standard deviation of I Burdd 2005 BCB 444544 F07 ISU Dobbs 237 Promoter Prediction 102907 9 Bruich 2005 BCB 444544 FO7ISU Dobbs 237 Promoter Prediction 102907 to o Mark M del for spliced Alignmenf Evaluation of Splice Site Prediction n mom 0 n T positive instance correctly predicted as positive negative instance incorrectly predicted as positive negative instance correctly predicted as negative positive instance incorrectly predicted as negative Fig 5 11 Bauvanis amp Onelle e Bi39tndll 2005 BCB 444544 FO7ISU Dobbs ZEr Promoter Prediction 102907 21 2005 BCB 444544 FO7ISU Dobbs ZEr Promoter Prediction 102907 22 Evaluation of Predictions Evaluation of Predictions in English Actual True False Actual True False Predicted True Fa s vquot c i iv it ii a I Eng5 sensi vify is The IMPORTANT Sensitivity alone does fraction of all positive instances Egg us 2I FV7 2 e 391 havmg a true posvtve predicton achieved quotMany by labeling all my 39 Coverage cases pDSlflVe Recall Rm IMPORTANT in medical jargon 39 39 39 is some times defined differenty who f we define here as Specificity is sometimes referred to as quotPosi five predictive vuue BCB 444544 FO7ISU Dobbs ZEr Promoter Prediction 102907 23 BCB 444544 FO7ISU Dobbs 23 Promoter Prediction 102907 24 In English Specif city is the fraction of all predicted positives that are in fact true positives BCB 444544 Fall 07 Dobbs 4 28 Promoter Prediction 102907 Genes er In ut Best Measures for Comparison eq v httpdeech 51 Iastateeducgbngscg curves Receiver Operating Characteristic httpenwikigediaomwikIRoc curve In signal detection theory a receiver operating characteristic ROC or ii mg ROC curve is a plot of sensitivity vs 1 specificity for a binary classifier system as its discrimination threshold is varied l 39 I L H I I The ROC can also be represented equivalently by plotting fraction quotquotm P GD mu m of true positives TPR true positive rate vs am fraction of fa se positives FPR false positive rate H Fu uwstevxlgt4arc cknn elporTutmialmrinnredeutiled mama2m Correlation CoeffICIent quot quot39 Matthews correlation coefficient MCC 323 i i il l r i i wum 5 cerium mm m 4 39 m minijun MCC n TPITA FFpFiV I T11 4 FNJUP l Fanny I39 mm 4 mg STEP 1 Select splice site model MCC 1 for a perfect prediction mmcicwsrrcampcciiuriammmam gg 0 for a completely random assignment 1 for a quotperfectly incorrectquot prediction i STEP 2 Inpulgcnomic DNAsequunce ace 444544 F07 ISU Dobbs 228 Promoter Prediction 102907 25 Brendel 2005 ace 444544 F07 ISU Dobbs 228 Promoter Prediction 102907 26 GeneSeqer Output GeneSeqer Gene Evidence Summary Genese er Preaiuinn Sulluum lPGL7PGS q mm Nawgator i magnum Help meow 3 mm mm 1 1 mmt Tits m w CuleS ler mm Splinl ndidor Bioinl onua rx Z 10 BC i 5139 L r Brendel 2005 BCB 444544 F07 ISU Dobbs 28 Promoter Prediction 102907 27 Brendgl 2005 BCB 444544 F07 ISU Dobbs 28 Promoter Prediction 102907 28 lquot Gene Prediction Problems amp Status Recommended Gene Prediction Software Common errors Ab inI o False Posmve lnlergenlc reglom GENSCAN39 httpzz39geres mit eduGEN CANvtm Z annotated genes actually correspond to a single gene 39 39 39 39 False negative iMergenic region GeneMarkhmm htt exont atecheduGeneMark One annotated gene structure actually contains 2 genes 0 others GRAIL FGENES MZEF HMMgene 39 False negative gene prediction Missing gene no annotation Similarity based 0M2 BLAST GenomeScan ESTZGenome Twinscan Partially incorrect gene annotation 39 Combined Missing annotation of alternative transcripts Geneseqerl hffp dppnf7n in na J w b nqs ca Current status ROSETTA For ab initio prediction in eukaryotes HMMs have better overall gt Consensus because results depend on organisms d specific performance for detecting intranexan boundaries task Always use more than one program I Limitation Training data predictions are organism specific Two servers haf reporf consensus Fredeffons GeneComber 39 Combined ab initiohomology based predictions Improved accurracy 39 Limitation Availability of identifiable sequence homoogs in databases BCB 444544 F07 ISU Dobbs 28 Promoter Prediction 102907 29 BCB 444544 F07 ISU Dobbs 28 Promoter Prediction 102907 30 BCB 444544 Fall 07 Dobbs 5 28 Promoter Prediction 102907 httn 39 www 39 Other Gene Prediction Resources at ISU incfnte Other Gene Prediction Resources GaTech MIT Stanford etc Lists of Gene Prediction Software I quot 09 t llltml lift cm mstanfordedu classes enefmd l mmgi z xsu Molecular Evolution Software DIVERSE ISU BLAST output parser 15quot Sequence Analysis Server sunummurlnumn xsu Pmlein s ouuary structure Prediction server v mm sumo mum mill m Suur BCB 444544 F07 ISU Dobbs 237 Promoier Przdlcilon 102907 31 BCB 444544 F07ISU Dobbs 237 Promoier Przdlcilon 102907 32 Eukaryotes vs Prokaryotes Genomes Eukaryotic genomes Are packaged in chromatin amp sequestered in a nucleus 0 Are larger and have mult ple linear chromosomes 0 Conta n mosty non prote n coding DNA 98 99 Prokar ic genomes 0 DNA is associated with a nucleoid but no nucleus 0 Much larger usually single circular chromosome 0 Conta n mosty protein encod ng DNA BCB 444544 F07 ISU Dobbs 237 Promoier Przdlcilon 102907 33 BCB 444544 F07 ISU Dobbs 237 Promoier Przdlcilon 102907 34 Eukaryotes vs Prokryotes Gene Structure Eukaryotes vs Prokaryotes Genes Eukaryotic genes 0 Are larger and more complex than n prokaryotes 0 Contain introns that are quotsplicedquot out to generate mature mRNAs rm Often undergo alternative splicing giv ng rise to mult ple RNAs Dllmalv RNA llllllulul o Are transcribed by 3 different RNA polymerases instead of I as in prokaryotes MM lamsmlcmc mRNA AAAA ADD slur ANo RNA no PDLVKN wL In biology statements such as this include an implicit usually 0 quotoftenquot BCB 444544 F07 ISU Dobbs 237 Promoier Przdlcilon 102907 35 BCB 444544 F07 ISU Dobbs 237 Promoier Przdlcilon 102907 36 BCB 444544 Fall 07 Dobbs 28 Promoter Prediction 102907 Eukaryotes vs Prokaryotes Eukaryotes vs Prokaryotes Regulatory Elements Levels of Gene Regulation Primary level of control Prokaryotes Promoters amp operators for operons cis acting DNA signals Pr llt 1quotY 55 TWV SCNPT39OV 39mllallon Activators amp repressors trans acting proteins Eukaryotes Transcription is also very important but we won t discus5 mesem Expression is regulated at multiple levels many of wh ch are post transcrptional Eukaryofes 39 RNA Pmcess39w lmnsl orl Slab39l39ly Promoters amp enhancers for s ngle genes cis acting Translation initiation Protein processing transport stability Posttranslational modification PTill Subcellular localization 39 Important difference 39 What the RNA polymerase actualy binds Transcr ption factors trans acting Recent important discoveries small regulatory RNAs m RNA s RNA are abundant and play very important roles in controlling gene expression n eukaryotes often at post transcrptional levels BCB 444544 F07 ISU Dobbs 257 Promoter Prediction 102907 37 BCB 444544 F07 ISU Dobbs 257 Promoter Prediction 102907 3E Prokaryotic Promoters Eukaryotic Promoters RNA polymerase complex recognizes promoter sequences located Eukaryo c RNA polymerase complexes M bind redy 10 very close to and on 539 side quotupstreamquot of tansription nitiation site pramafer sequences I I I o Transcription factors must bind first and serve as landmarks Prokaryotc RNA polymerase compex bnds drecty to promoter recognized by RNA Polymerase complexes y v rtue of its sigma subunit no requ rement for transcription factorsquot b nd ng rst 0 Eukaryot c promoter sequences are less highly conserved but many o Prokar otc romoter se uences are hi hi conserved y P 10 I q 9 y promoters for RNA polymerase II contain re Ion 35 g o 30 region quotTATAquot box 0 re Ion g o 100 region quotCCAATquot box BCB 444544 F07 ISU Dobbs ZEr Promoter Prediction 102907 39 BCB 444544 F07 ISU Dobbs ZEr Promoter Prediction 102907 40 Eukaryotic genes are transcribed by 3 different RNA polymerases Location of promoter regions TFBSs amp TFs differ too Eukaryotic Promoters vs Enhancers Bot1 promoters amp enhancers are binding sites for transcription factors TFs 200 i no I Ion 290 l J Promoters essential for initiation of transcription located relativel 39 close to start site usually lt200 bp upstream but can be ocated withn gene rather t an upstream LILE 45 lg Enhancers needed for regulated transcription differential expression in specific cell types developmental stages in response to environment etc prtr can be very far from start site sometimes 100 kb elements RNr K polymt Cult grammer Brown Fig 9 1a 41 V V 544 F07 ISU Dobbs 23 Promoter Prediction 102907 42 BCB 444544 F07 ISU Dobbs 257 Promoter Prediction 102907 BCB 444544 Fall 07 Dobbs 7 28 Promoter Prediction ProkaryoTic Genes amp Operons Genes wiTh relaTed funcTions are ofTen clusTered wiThin operons eg lac operon Operais genes wiTh relaTed funcTions ThaT are Transcribed and regulaTed as a single uniT one promoTer conTrols expression of several proTe ns mRNAs produced from operons are 39paycisrrancquot a single mRNA encodes several proTe ns ie There are mulT ple ORFs each wiTh iTs own AUG START amp STOP codons linked wiTh n one mRNA molecule BCB 444544 F07 ISU Dobbs 207 Promoizr Przdiciion 102907 43 EukaryoTic genes 0 Genes wiTh relaTed funcTions are occasionally buT 1mm clusTered insTead They share common regulaTory regions promoTers enhancers eTc Chromafin sfrucfure musT also be acTive for TranscripTion To OCCUlquot BCB 444544 F07 ISU Dobbs 207 Promoter Przdiciion 102907 45 EukaryoTic PromoTers DNA sequences required for iniTiaTion usually lt200 bp from sTarT siTe EukaryoTic RNA polymerases bind by recognizing a complex of TFs bound aT promoTor Pmmmer Exam mm Ema mm Fxnn acne l39 lg 51 7 First TFs musT bind shorT moTifs TFBSs N250 blj wiThin promoTers W fhen RNA polymerase m cs a3quot mm can bind and iniTiaTe l Tm mml W Transcri Tion of RNA 0 PremRNA P E quot25 quot45 63 all 41 7 Pmumal pmmuxmegm 4 cum Wilma LWi 7 7 33 quotzuovnisa BCB quotW 102907 PromoTer of lac operon in E coli Transcribed by prokaryofic RNA polymerase Promoter incl le lacl Lactose option 40 I 20 l l 38 hp 0 the AGGCTTTACACTTTATGCTTCCGGC TCGTATGTTGTGTGG AAT snrt oi iarZ Brown Fig 9 17 E at D D 4 544 FO7ISU Dobbs ZEr Promoter Prediclion 102907 44 EukaryoTic genes have large amp complex regulaTory regions Erlm39iwx fis acfng regulaTory elemenfs include PromoTers enhancers silencers Trans ac ng regulaTory facTors include Transcr pTion facTors TFs chromaTin remodeling complexes small RNAs Brawn Fig 917 544 F07 ISU Dobbs 207 Promoter Przdiciion 102907 46 EukaryoTic promoTers amp enhancer regions ofTen conTain many differenT TFBS moTifs A retinaxc and 139 receptor AEL 52 i r p32 315121 M322 1 F1 TAT 11 r us 410393au 7350 325 41 3990 21 1 BCB 444544 Fall 07 Dobbs ig 91 MWquot 2004 868 444544 F07 ISU Dobbs 22237 Promoler Prediclion 102907 4E 28 Promoter Prediction Simplified View of Promoters in Eukaryotes iiiiiianuei Upstream Piuinoier Cure Pioniutei rs i raw rm E lii 0r opp ur mm b may ga n 39 i i a i 95 l 4 400 72000 Fig 512 Baxevanis 6i Ougllgffg 2005 868 444544 F07 ISU Dobbs 220 Promoter Prediction 102907 49 Eukaryotic Transcription Factors T Fs Transcription factors proteins that interact with the RNA polymerase complex to activate or repress transcription TFs often conta n both 0 a trans activating domain 0 a DNA binding domain or mij TFs recognize and b nd specific short DNA se Llence miis called transcription factor binding sitesquot T 355 Databases for TFs ampTFBSs include JAEPAR BCB 444544 F07 ISU Dobbs 207 Promoter Przdiciion 102907 51 Promoter Prediction Algorithms amp Software BCB 444544 F07 ISU Dobbs 207 Promoter Przdiciion 102907 53 BCB 444544 Fall 07 Dobbs 102907 Eukaryotic Activators vs Repressors Regions far from the promoter can act as quotenhancersquot or quotrepressorsquot of transcription by serving as binding sites for activator or repressor proteins TFs promoter O Gene enhancerNW 100 50000 bp repressor enhancer proteins Activator proteins T Fs inter With RNAP bind to enhancers amp transcription interact wuth RNAP to stimulate transcription repressor prevents binding of activator RNAP Repressors block the action of activators BCB 444544 FO7ISU Dobbs 207 Promoter Przdiciion 102907 50 Zinc Finger Proteins Transcription Factors Common in eukaryotic proteins 1 of mammalian genes encode z nc finger proteins ZFPs In C eegons39 there are gt 500 l Can be used as highly specific DNA binding modules Potentially valuable tools for irected genome modification esp in plants amp human gene therapy one clinical trial will begin soon Brown Fig 912 r 544 F07 ISU Dobbs 22207 Promoter Prediction 102907 52 Eukaryotes vs Prokaryotes Promoter Prediction Promoter prediction is much easier in prokaryotes Why Highly conserved Simpler gene structures More sequenced genomes for comparative approaches Methods Previously mostly HMM based ow s milarity based comparative methods because so many genomes available Xiong textbook 1 quotManual methodquot rules of Wang et al see text 2 BPROM uses near discrim nant function BCB 444544 F07 ISU Dobbs 207 Promoter Przdiciion 102907 54 28 Promoter Prediction Eukaryotes vs Prokaryotes Promoter Prediction Promoter prediction is much easier in prokaryotes Why Highly conserved Simpler gene structures More sequenced genomes or comparative approaches Methods Previously mosle HllAllA based 39 s milarity based comparative methods because so many genomes available Xiong textbook 1 quotManual methodquot rules of Wang et al see text 2 BPROM uses l near discr m nant function BCB 444544 F07 ISU Dobbs 207 Premier Prediciiori 102907 55 Predicting promoters Steps amp Strategies Identify TSS if possible 0 One of biggest problems is determining exact TSS Not very many full length cDNAs 0 Good start ng po nt human amp vertebrate genes se FirstEF found within UCEC Genome Browser or submit to FirstEF web server Hm39iH H l l H ii VI l I I i 4 4H y y c mum Hllll imi i amen l I l 7n I Fig 5 1o xtvunis 1 Caulkquot 2005 BCB 444544 F07 ISU Dobbs 22237 Premier Prediciioyi 102907 57 102907 Predicting Promoters in Eukaryotes gt Closely related to gene prediction 39 Obtain genomic sequence Use sequence similarity based comparison BLAST MSA to find related genes But quotregulatoryquot regions are much less well conserved than coding regions Locate ORFs Identify Transcription Start Site TSS if possible Use Promoter Prediction Programs Analyze motifs etc n DNA sequence TRANSFAC JASPAR BCB 444544 F07 ISU Dobbs 207 Premier Prediciiori 102907 56 Automated Promoter Prediction Strategies 1 Pattern driven algorithms ab initio 2 Sequence driven algorithms homology based 3 Combined quotevidence basedquot BEST RESULTS Combined sequential BCB 444544 F07 ISU Dobbs 207 Premier PPBdiCliOYi 102907 5E 1 Pattern driven Algorithms Success depends on availability of collections of annotated transcription factor binding sites T FBSs Tend to produce very large numbers of false positives FPs Why Binding sites for specific TFs are often variable Binding sites are short typ cally 6 10 bp Interactions between TFs amp other proteins influence both affinity amp specificity of TF b nding One binding site often recognized by multiple TFs Biology is complex gene activation is often specif c to organismcellstageenvironmental condition promoter and enhancer elements must mediate this BCB 444544 F07 ISU Dobbs 207 Premier Prediciiori 102907 59 BCB 444544 Fall 07 Dobbs Ways to Reduce FPs in ab initio Prediction Take sequence contextbiology into account Eukaryotes clusters of TFBSs are common Prokaryotes knowledge of 0 sigma factors helps Probability of quotrealquot binding site higher if annotated transcription start site T SS is nearby But What about enhancers no TSS nearby amp only a small fraction of TSSs have been experimentally determinined Do the wet lab experiments But Promoter bash ng can be tedious BCB 444544 F07 ISU Dobbs 207 Premier Prediciiori 102907 60 1O 28 Promoter Prediction 102907 2 Sequence driven Algorithms Phylogenetic FooTprinTing Assumpfion Common functionalify can be deduced from Based on ncreas ng availabiliTy of whole genome DNA sequences sequence conservafion Homology from many differenT species SelecTion of organisms for comparison is imporTanT AlignmenTs of co regulaTed genes should highlighT elemenTs not too close not too far good umcin vs mouse involved it regulaTion To reduce FPS musT exfracf non cod ng sequences and Then Carelel How defermlne 60quot 59L39laflon align Them predicTion depends on good alignmenT 139 af laogaus genes from dlffarance Spec as use MSA algoriThms eg CLUSTAL 2 Genes experimenTally shown To be co regulaTed more sensitive methods using microarrays Gibbs sampling Comparafive promofer prediction Expectation MaximiZaTion EM meThods Phylogenefc faafprinfng 2 Expression Profiling Examples of programs Consife rVISTA PromHW Bayes aligner Footprinfer BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 6 BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 62 Expression Profiling Problems wiTh Sequence driven Algorithms Based on increasing availabiliTy of whole genome mRNA expression Need seTs of co regulaTed genes daTa esp m croarray daTa High ThroughpuT simulTaneous moniToring of expression levels of 39 F r 3 quotquot39Pcquot39cmve Pl Yl QBquot5T39 meflwds MusT choose appropriaTe species DifferenT genomes evolve aT differenT raTes Classical alignmenT meThods have Trouble wiTh Thousands of genes Assumpfions someTimes valid someTimes NOT 1 Coexpression implies coregulaTion 2 Co regulated genes share common regulatory elements TranslocaTions or inversions Than change order of Drawbacks funcTional elemenTs Signals are short amp weak If background conservaTion of enT re region is high Requires Gibbs sampling or EM eg MEME AlignACE Melina comparison is useless 2 PredicTion depends on deTermining which genes are coexpressed usually by cusfering which an be error prone NoT enough daTa buT ProkaryoTes gtgtgt EukaryoTes Examples of programs 39 INCLUSive comb ned microarray analysis lt31 moTif deTecTion PhyloCon comb ned phylo fooTprinT ng lt31 expression profiling across SPeCleSl ComplexTy many reguloTory elemenTs are noT conserved BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 63 BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 64 PromoTers 200 bp upsTream from T55 1 ACCCCCAAATTTTTGGGAGGTACCCAAGGGTGCGCGCGTG Himan 1 GCCCCCAAACTT poncrcamaacacu V V House xxwrvxxx u g H e um vvw FIeds 41 GCTCCTGGCGCGCCGAGCCCCTCCC39l39CGAGGCCCCGCGAG Human 23 VGCTGAAGTTCTCCCTCGAGGCGCCTAGM House 39 Accession ampID u mmmm as n Br ef descri Tion 39 39 81 GTGUACACTGCGGGCCCAGGGCTAGCAGCCGCCCGGCACG Human 57 TGGAGCACTAGG TTGCTGCTGCCA CTG Manse Thls army m n s m 3 WeighT maTriX 51 Number of SlTZS 121 TCGCTACCCTGAGGGGCGGGGCGGGAGCTGGCGCTAGAAA Human used To bud 34 TTGCTGGCCCGCTGGGTGGGGCGGGAGTTGGCGCTCGCAG Mouse x x u x Wt xxttvxtht um i v OTher info CEBP 161 TGCGCCGGGGCCTGCGGGGCAGTTGCGCAAGTTGTGATCG Human 124 GGACTGGGGCTGGCCGGACAGTTGCGCAAGTGGCACTGG Mouse w W aw in w Huntva i i 39I A39I A bun 201 GGCCGCTATAAGAGGGGCGGCCAGGCATGGACCCCC Human Mouse GGCAGTTATMGAGGGGcAGGcAGGcATGGAGcCccG A no unauu t tt tc 3 o BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 65 BCB 444544 F07 ISU Dobbs 257 Proiiioter Prediction 102907 66 Fig 5 14 Fig 5 14 Baxevanis a Dueliem Bamams a Dueliem 2005 2005 BCB 444544 Fall 07 Dobbs 11 28 Promoter Prediction 102907 H Annotated Lists of Promoter Databases amp Promoter Prediction Software Check out Optional Review amp Try Associated Tutorial Wasserman WW amp Sandelin A 2004 Applied bioinformatics for identification of regulatory elements Nat Rev Genet 5276 287 L H Iih inctntp quot quot quot quotfullnra1315 fs htrnl BCB 444544 F07 ISU Dobbs 237 Promo2r Przdiciion 102907 67 BCB 444544 FO7ISU Dobbs 237 Promo2r Przdiciion 102907 63 BCB 444544 Fall 07 Dobbs 12 25 Clustering Algorithms amp 102306 Microarrays amp Medicine BCB 444544 Introduction to Bioinformatics Mon Oct 23 Heike Hofmann Stat Visualizing Conditional Distributions LecTur e Statistics Seminar 410 PM in 319 Snedecor 0 Clark Ford FSHN Genetic Analysis of QPM Modifiers to Enhance Maize Protein Nutrition Microarrays amp MBdlClne Genetics Seminar 1210 PM in 101A Ind Ed II Thurs Oct 26 0 Peter Flynn Chem Utah A New Biophysical Paradigm for Encapsulation Studies BBMB Seminar 410 PM in 1414 M33 W 0 Diane Bassham GDCB Signal perception transduction and autophagy in development and survival under stress GDCB Seminar 410 PM in 1414 M33 BCB 444544 FO6ISU Dobbs 257 A1 Microarrays A1 Medicine 102306 1 BCB 444544 FO6ISU Dobbs t R A1 Microarrays A1 Medicine 102306 2 iiirgi iiliiii Machine Learning Overview amp Algorithms Chp 7 Applied Research with Microarrays BCB 444 amp 544 HW4 Due at Noon Mon Oct 23 Mon Oct 23 Exam 2 Thurs Oct 26 Lab Practical 30 Fri Oct 27 In Class Exam 70 BCB 544 Only Teams amp Projects Any questions GEPAS Clustering Tutorial online Clip 72 Improving Health Care with DNA Microarrays Wed Oct 25 Machine Learning more Algorithms Michael Thurs Oct 26 Lab Study Guide Review amp EXAM 2 Lab Practical Exam 30 pts 544Extra2 Due Mon Nov 6 Fri Oct 27 EXAM 2 In Class Exam 70 pts See ugdated Schedule Oct 232 posted online BCB 444544 F0615U Dobbs 257 l l h AMicroarrays AtMZdlClYlB 102306 3 BCB 444544 F0615U Dobbs 22 K l h AMicroarrays AtMZdlClYlB 102306 4 61 Introduction to Microarrays 71 Cancer and Genomic Microarrays Math Minute 61 Why Should You Log Transform Microarray Data Are There Beffer Ways To Diagnose Cancer Math Minute 62 How Do You Measure Similarity between Ma Minufe 71 Whor Are Signafure Genes amp How Do you Expression Patterns Use Them Ma Minme 63 HOW DO you CIL39STZ39 69mg Can Breast Cancer Be Cateorized with Microarra s 62 Alternative Uses of DNA Microarrays e 39I39li ligii itzaaigli i39rz leiiiiigi ix39zlili39r in 1 Math Minute 64 Is It Useful to Compare the Columns of a Gene Expression Mamx 7172 Improvmg Health Care With DNA Microarrays Which Predicted Genes Are Real and Which Ones Aren39t Why Is the Tuberculosis Vaccine Less Effective Now cf M d Can Microarrays Improve Annotations Can We Choose the Most E Could a Microarray Validate Annotation of an Entire Genome r 1 invi39ikl Chg 6 Camgbell amp Heyer Camganion Website l l Chg 7 Camgbel amp Hezer Camganian Website 211951 BCB 444544FO6ISU Dobbs 25e i i v A1 Microarrays A1 Medicine 102306 5 WWW BCB 444544 FO6ISU Dobbs 2 R i i v A1 Microarrays A1 Medicine 102306 6 BCB 444544 Fall 06 Dobbs 1 25 Clustering Algorithms amp 102306 Microarrays amp Medicine W Machine Learnlng Algorithms 39 w GEPAS s Gene Expression Pattern Analysis Suite 1131 For Microarray Analysis q Primarily clustering amp classification algorithms we won39f discussquot 0 3931 M M l l M m 161 Experimental desugn mm mm m mm is n E1 Preprocessmg data Normalization amp background issues WWW Wm we W dlscuss 16 ml n H mm Clustering algorithms used in lab Specific applications interpretation amp significance 1 m1 4 Tutorial 1quot allJllelu am llltdac lvelv visuallle Illelamlllcal trees USUquot cmr W m W a 565mm 868 444544 F06ISU Dobbs 257 1 1 r amp Mlcroarr ays amp Mzdlcli lz 102306 7 BCB 444544 F06ISU Dobbs t R r amp Mlcroarr ays amp Mzdlcli lz 102306 S Table MM62 Correlation coefficients for each GEPAS l pair of genes from Table MM61 Gene Expressiun Pattern Analysis Suite v31 6 EPAS Tutorial on Clustern W mm mm alllllmllllrlllmlwllmllmlrlllrllgtl Table MM k M I EWMM ganac umn gunnE gaIIuF guns uunaH 9E1 usnaJ max usual gunaM gmquot Dullesrlslly muse clrssm Clustering Tutorial on DNA array data clustering E a lllllrlllmlllllm l mam vulllllllllzlunl 1 lululmli 1 Introduction I 1 Class discovery or unsupervised pmmsms Wu lan m 61 glutL39s ms was if pmulems SEE Dobbs 22257 1 1 P 6 Mlcroarrays 6 Medlclne 102306 9 fi l f m psl BCB 444544 F06Isu Dobbs t K 1 P 6 Mlcroarrays 6 Medlclne 102306 10 Table MM64 Summary of hierarchical clustering Fig MM61 Dendrogmm of clusfered genes f 12 genes in He MM62 from Table MM63 and Fig 68 Table MMSA Summary 01 the hierarchical clustering algurith applied to me 11 genes in Table 61 Iwu Must Sitnilav hjunls Iteration ohiecn Dhjecl Emulation New Object gene M E gene N gene H Ez l f m psl BCB 444544 6150 Dobbs 25 v 6 Mlcroarrays 6 Medlclne 102306 11 tmmsmpsl BCB 444544 6150 Dobbs 4 R l L 6 Mlcroarrays 6 Medlclne 102306 5 BCB 444544 Fall 06 Dobbs 2 25 Clustering Algorithms amp 102306 Microarrays amp Medicine Fig 627 Designing microarrays to determine Alternative Uses of DNA Microarrays exon boundaries Which Predicted Genes Are Real and Which Ones T L Lquotif i a Aren39t quot 74 39 some Can Microarrays Improve Annotations M quot5 quot m 39S SM 39 Could a Microarray Validate Annotation of an I Entire Genome V 60m cmmz mm quotBAC align eizii i To create a comprehensive chromosome 22 DNA microarray two 60mer s for each annotated exon were spotted on chip BCB 444544 F06ISU Dobbs 257 l l l 44 Mlcroarrays At Medlclne 102306 13 AMmlmBmpbsll BCB 444544 F06ISU Dobbs 22 I l l l 44 Mlcroarrays At Medlclne 102306 14 F39Q 628 Chr m 5 me 22 M39 quot quotquot Y Fig 629 a Microarray validation of predicted Experiment Placenta biopsy vs control cells labeled quotpoo399 exons 0quot Chrome 22 a 69 experimental conditions in duplicate 2 dye colors were X39axls 39 6 quot39quot quot P liquots f quot 3939 3133 Pquotedined X quotS reversed for labeling the cDNAs l y39 ff39s 39 f fr39memal 22 3939 quotS b amp c 1 ORF composed of 9 exons is shown in detail amp labeled by a dye reversal each pair of spots represents a single exon 69 fluorrevaised pairs of nondiiions bl An expanded proponion or image coniaining probes belonging Io one gene in placenta cw vs pool Cysl C3 cl An expanded proponlail ol image comelnlno probes belonging o one gene in pool om vs placenta m5 74 5 r amp Microarrays amp Medicine 102306 15 elmoampell BCB 444544 FO6ISU Dobbs t R r amp Microarrays amp Medicine 102306 16 l somer Piacenla cyal I C 5 col Cv3 a maz e vs P00 y vs placeniaiCyai si aielammi BCB 444544 FO6ISU Dobbs 22257 l l Fig 629 bd Microarray validation of predicted Fig 629 de Microarray validation of predicted exons on Chromo 22 exons on Chromo 22 Z 9 b Expanded region for a known gene Experiments on the Y axis have been clustered to emphasize how co regulation of transcr ption in diverse exper mental conditions is used to define exons in a gene c Expanded region showing a set of co regulated exons from another known gene One pred cted exon arrow that was not incorporated nto mRNA for most of the 69 d Expanded region indicating that 2 different ESTs are a part of a single gene amp transcript e Expanded region showing a gene containing 6 exons of a j previously undetected gene that a is transcribed only in testes arrows denote exper ments that included testes mRNA mwegmzmnmmowmmma oa mediums Emra lon yamlamquot l fl i 2 F s B l9 H i E E St E E S 17 E E g a E aaumgnammmawwmamemapmmmmxmmw enamel sERPNm I322Fi exper mental conditions is ndicated Database 40 metacaan Ill7001469 Similaila Chi 01 pmlcin mammal mesauulm 54455332337686 LinlnineHs 14587 lasso V 7 visasz alleiaslamsolu f g l f m pei BCB 444544FO6ISU Dobbs 257 l l v a Mlcroarrays a Medlclne 102306 17 f g l f m pei BCB 444544 FO6ISU Dobbs 2 e l v a Mlcroarrays a Medlclne 102306 1E BCB 444544 Fall 06 Dobbs 3 25 Clustering Algorithms amp 102306 Microarrays amp Medicine Fig 630 ab Defining exons of a newly Fig 630 cd Defining exons of a newly discovered gene Testes specific gene discovered gene Testes specific gene a Gene defined by 60 mers every 10 bp over a 113000 bP c 2 exons af higher magnificafion Each 60 mer separafe bar segmenf of chromosome 22 Comparisons of exon predictions using currenf soffware b 6 exons were idenfified over a 10 kb region Technology vs microarray exon defection 39v 1 1 d Sequence from 539 end of exon 3 showing consensus splice sife 395 391 4 050505 a gt m 0 G n 4 2 consisfenf wifh microarray predicfion for 539 boundary of exon 3 7w bpwmdow 320 hpwinduw lbr snap a n 1 v lmhpiibv Lagquot normalized signal slrenglhl n mama axon Veri ed extn I Consistent wzm prediumn 102 hp exisnsian oi exon 3 a imm ullng d ll Splice me 112 539 GGGTCCTCCCACCCCAACTCTTCGTICCCC A rm man Raw mm 56110 mazm Splice site Sequence 6i exon 3 Tilliii f ninu BCB 444544FO6ISU Dobbs 22257 l l P Mic oarrays wwdmmz 102306 19 Tilliii f ninu BCB 444544FO6ISU bobbstz R l l P Mic oarrays wwdmmz 102306 20 Fig 631 Rose39H39a group Whole genome Fig 631 Rose39H39a group Whole genome microarray To validate every predic l39ed exon microarray To validate every predic l39ed exon LiUCillps a exons Tesfed h MIMMM quotle d lmiluuuiluxnm ilmioummus 60m39ers for a fond of 111HWIquot 1 1391 puq I39deograms of 18 DNA IWill vrxriliririlvxmis 1 090 408 Spofs mumLn quot i gtr C romosomes e I I m l scale IS Incorrect xm I I m d f nunm m w H quotWM Includes confrol spofs of m quot 39 A The predimed exons wniixms rel335222325quences 1 r gray bars on each q m 2 2 chromosome amp percenfages i U d 50 d I i3 quotj r 1 Thaf were verified wifh rsde Iffef em 9 055 mm 1 3939 These cell lines red bars 539 3539 mm 2 3 mu E F Similar graphs for llavmsmmiplmiwm 111625 1111 1 1 defecfion of previously 5 1 u n 1 H confirmed exons on eac mum 39 11 chromosome 11 n i i g lf ninu BCB 444544FO6ISU Dobbs 22257 l l P Mic oarrays wwdmmz 102306 21 Tilliii f ninu BCB 444544FO6ISU bobbstz R l P Mic oarrays wwdmmz 102306 22 rminal cs 1 13 r NI lymph nodyionsn Aclivated blood 8 Ragingactivated T Transformed cell lines FL Resting blood B emuul len CLL BCB 444544 FO6ISU Dobbs 257 r A1 Microarrays A1 Medicine 102306 23 BCB 444544 Fall 06 Dobbs 4 34 Proteomics BCB 444544 Introduction to Bioinfonna39ics Lecture 34 Profeomics 34Nov1 3 hawwmaxsl mow mom llwue i 111306 M mena Dobbs coca 151 DNA Birdhg by Design 15 Faculty Sem39nar 1219 m in 1111 1nd Ed 11 Thus Nov 15 nub of rival pn r Baker Center Sem39nar 219 m in Howe Hall Alld39 rium Tom reteison coca 15v Alumnae tramwayWart of maize AcD eknenrs Mice enmesnnoe bmalmgt and whr Hassane Mchaummb Center for smmml Biology Vanderbilt 5mm 39 m chivmnm management was seminar 419 in in 1414 M55 Nov 3 Chris n e MSc39 15 m Micmam Am oco mny Semlmlezd m n W14 lago39 hawwmaxsl mow mom llwue 1 JMon Nov 13 Chp a rioteoniies Sec ans 3 1 34 pmviously assigled Wed Nov 15 Chp 9 Case Study Why can39t we one note diseases Thin Nov 15 Lab not sure Frl Nov 13 Chp 1n Genomic cimiits h shgle Genes Sections 10x 19x Dim i fwy bst week39s Nadir essigmmrs 1 Stinklvi cl 1 any Nudu39r Adi Rs 171174 d lnillnIrgla39lz7 g IIIking in I39uh arnulics my mm vs 4 i1 39 e41 In g39lu Mk mmvoysmm Vcrlu mine 4 llyla quot171239 cl ogmzoymgw new mm llwue 3 Schedule hawwmaxsl mow mom llwue 5 BC 444 ti 544 HW6 Due at Fri Nov 17 963 544 Only 5445xtnoaxz Due Mon Nov 1 today See Maud Schedule Nay 171 posted online llwue 4 39v m m ese noteis eate Which rioteins Am Needed in Different Conditions MM 31 How Do You Know if you Have Sampled Enough Cells E2 Protein 3D structures BCB 444544 Fall 06 Dobbs 34 Proteomics 111306 33 MM 32 Is 5upas a Centml rmtei How Much of Each rrotein Is rresent Is It rossible to Understand rroteonieWide Intemctions 39 5quotquot W Q quot 39Y quotquot f mf quot39 MM quot5 What Does a mteonie rroduce 39 5439 We QWI39IfY quotW05 Iquot MY 5 ice Idea But Does ICAT Work qual Does a rmteone Induce ll Cells E M t mmwmsu than one mwe 1 m n wiwwmxm than one mwe s momentum One Approach Knockout Genes Facilities Carver Colab Guru Rao new faculty BBMB pig H Diagram of quotm transposon that was used to knockout genes Experiments P a Rodermel Voytas Animal Greenlee perhaps others Analysis Honavar39 Dobbs mwe o e mwe n Phenotype macroarray analysis h Where are the proteins in the cell Hellenile n Use uorescent antibody to bind to epitope tag Glycerol Cale me hawwmusl mum ems mwe u mum wiwwmxm than one mwe 11 BCB 444544 Fall 06 Dobbs 2 34 Proteomics 111306 Biological processes foryeast proteins Two dimensional gels A m 7 ii i 4 a ii 5 hawwmexsl mum Mm mm 13 hawwmexsl mum Wm mm 1 Proteins iden led on 2D gels IEFISDS PAGE Evaluation of In gels IEF5D5PAGE Direct prutein micrciseduencing by Edman degradaticins Advantages w dcine at taciiities nere atisU Visualize nundredstcitnciusands cit pruteins 7 WWW Need 5 mummies lmpruved identiticaticin utprcitein puts w ufteri get1Etu 2n aminci acids seduenced Disadvantages F39VDIEW mass anew by MALD VTOF Limited number at samples can be prucessed Musth abundant pruteins Visualized w dcine at iaciiities nere at isU Tammany minim w ufteri detect pusttranslatiunal mcidincatici s 77 matrix assisted laser desurpticinicinizaticin tlmEerr lght spectrcisccipy mu pm we M mm m m we M mm w we is was m Mass Spectrometry to identify proteins Some mass spec data hawwmexsl mums Mm mm H i hawwmexsl mums Mm mm is BCB 444544 Fall 06 Dobbs 34 Proteomics 111306 E SWISSQDPAGE Map Selection mummwmmmm an rmware quota nm WW w Search in SWISSZDPAGE for retinal REMIX 1 null u39nlulrs up in mus mu I mm mm Mmmxmac 39f tdwiwwmmwuruturv mm lam mm mm urtbrrjwlmnm l 9 E 13 hawwmexsl mm Mm SSW E53 hawwmexsl mm Mm SSW Z52 mum mm mum Affinity chromatographymass spec ETClegmumgtm GST Bait protein Add yeast extract Protein complexes bind quot4quotquot I C I39g mmzij l Most proteins do not bind Flg a 21 WWW wwwmm anMrAMIwrrwna gae 253 Eagle 252 Af nity chromatographymass spec Af nity chromatographymass spec 5 False negatives al must be pruperly lucallzed and m us natlve cundltlun Amm lag may mm W Wm Translent prutem mteracuuns may be mlssed I nghly sperm physluluglcal cundltluns b k 7 EIule may a requlre Run Ella agamst hydruphublc and small prutelns L lT F Identify complexes mum wwwmm hump n hawwmexsl mm Mm xFlage 52 hawwmexsl m xFlage ass BCB 444544 Fall 06 Dobbs 4 34 Proteomics 1 11306 Affinity chromatographymass spec False positives Stlcky protelns GST Bait protein nge as 3 Evaluation of af nity chromatographymass spec Advantages Thousands of protem complexes menunea Fu ctlons can he Sslgnedto protems Dlsaavamages False negatlve results False posmve results WWW my Mmm mmwmm quotMamas We 1 m 19 Structural Genomics Goal get structural information for every protein Structural information can be experimentally determined structure or predicted structure mmwmm quotMamas We wwue 11 What can structural genomics do Automate most of protein structure determination Provide a large number of protein structure models The goal is to cover the protein structure space It has been estimated that about 16000 well chosen structures should cover protein structure space mmwmm quotMamas We wwue a What can39t structural genomics do Determine protein structures to very high resolution Produce detailed studies of protein function Solve structures of proteins complexed with other molecules Solve the structures of membrane proteins or other difficult to work with proteins mmwmm quotMamas We wwue 13 Structural Genom ics Projects There are many different structural genomics projects and each has a different goal Solve the structure of one member of each protein sequence famil Solve the structure of all proteins from a particular organisn Solve the structure of all proteins involved in a particular pathway mmwmm quotMamas We wwue a BCB 444544 Fall 06 Dobbs 34 Proteomics 111306 Structures Released by 56 Centers Structural Genomics by year ma 11w 4 1quot man I 515 l 1 m m m L39 m E u 9 ma 1 i 1 1 a z 1 1 a sun I n h m 1 i am 356 w u u g s u 3 3 ma 5 E E v 1 3 s 1 3m 9 a g 3 z a s g a a n m 2 2m 15 U 5 E as a I u n I I lt2uuo 2am m m m was mas Vear BCBMMEMFUMSU Terr1b11m34 Prutenmms 111305 31 BCBMMEMFUMSU Terr1b11m34 Prutenmms 111305 32 Comparison of Unique Structures with Number of Structures Released By 56 Centers quotquot1 rcza39swml vss 39 N 5Iumuesmh30 1dmm 11 u m 11 5n 5 w w m It u 39i 39 quotivc1171 m 12 m 2 w fur EMA544 F111 1511 Terr1b11m34 Prutenmms 111301 33 Hurnbr f nkInd 5mm BCB 444544 Fall 06 Dobbs 6 8 Finish DP Scoring Matrices Stats amp BLAST 363 444544 BCB 444544 F07 ISU Dobbs B 4 Finish DP Scoring Mairiczs Siais A BLAST 9707 iZA i39iiiimiplliii39 i iiilquot Fri Sept 14 HW2 Due by 5 PM Fri Se t 21 Exam 1 BCB 444544 F07 ISU Dobbs 22B 4 Finish DP Scoring Mairiczs Siais A BLAST I Tues Sept 4 Lab 2 Exercise Writeup due by 5 PM Send via email to Pete Zaback petezQastateedu For now no late penalty just sendASAP I Wed Sept 5 Notes for Lecture 5 posted online HW2 posted online amp sent via email amp handed out in class 9707 Methods o f Global and Local Alignment o fAlignment Algorithms o f Dot Matrix Method 0 Dynamic Programming Method cont 0 Gap penalities 0 DP for Global Alignment 0 DP for Local Alignment Scoring Matrices Amino acid scoring matrices PAM BLOSUM Comparisons between PAM amp BLOSUM 0 Statistical Significance of Sequence Alignment BCB 444544 F07 ISU Dobbs 22B 4 Finish DP Scoring Mairiczs Siais A BLAST 9707 BCB 444544 Fall 07 Dobbs 1mm for Lectures 47 Pairwise Sequence Alignment Dynamic Programming Global vs Local Alignment Scoring Matrices Statistics Xiong Chp 3 Eddy What 15 Dynamic Programming 2004 Nature Biotechnol 22 909 f Wed Sept 5 for Lecture 7 amp Lab 3 Database Similarity Searching BLAST nape more DP Clip 4 pp 51 62 Fri Sept 7 for Lecture 8 will finish on Monday BLAST variations BLAST vs FASTA Clip 4 pp 51 62 BCB 444544 F07 ISU Dobbs B 4 Finish DP Scoring Mairiczs Siais A BLAST 9707 2 Adapled lrom Brown and Caragea 2007 Wilh some S ides lmm Alman Fernandezr aca Balzoglou Craven Hunler Pa e B 444544 F07 ISU Dobbs B4 Finish DP Scoring Mairiczs Siais A BLAST 9707 4 5 N 5 Dynamic Programming 4 Steps scores of subproblems by solving smallest subproblems first bottomup approach Calculate score of optimal alignments Trace back through matrix to recover optimal alignments that generated optimal score BCB 444544 F07 ISU Dobbs 22B 4 Finish DP Scoring Mairiczs Siais A BLAST Define score of optimal alignment using recursion Initialize and fill in a DP matrix for storing optimal 9707 6 9707 8 Finish DP Scoring Matrices Stats 9707 amp BLAST 1 Define Score of Optimal Alignment using Recursion 2 Initialize amp Fill in DP Matrix for Storing Optimal Scores of Subproblems Define x1 L Pre x Oerngthi OfX 39 Construct sequence vs sequence matrix y1 j Pre x of length ofy 39 Fill in from 0 0 to N M row by row calculating best possible score for each alignment ending at residues at i j Sz Score of optlmal allgnment 0f X1 i and y1j 0 1 N Initial conditions Si0 i i 50j Jquoti Recursive definition ForlsisN IsjsM saw D woe7y SijmaX 5iLj J SUV1 J BCB 444544 F07 ISU Dobbs E 7 Finish DP Scoring Matrices Stats A BLAST 9707 7 BCB 444544 F07 ISU Dobbs B 7 Finish DP Scoring Matrices Stats A BLAST 9707 B How do we calculate ij Specific Example ie Score for alignment of x1i to y1j C 1 L39 39th 1 of 3 cases optimal score for this subproblem use me Up x39 w39 yquot 1 i x c T c G c A y c A T T c A Xi aligns to yi Xi aligns to a gap yi aligns to a gap J 1 l 39 Case 2 Line up xi with space i 1 i x c T c G c A y c A T T c A J 5i1J391 0Xi yj 5i1 Y so J1 Y Case 3 Line up yi with space i x c T c G c A y c A T T c A J391 J BCB 444544 FO7ISU Dobbs Er Finish DP Scoring Matrices Stats Alt BLAST 9707 9 BCB 444544 FO7ISU Dobbs Er Finish DP Scoring Matrices Stats Alt BLAST 9707 10 Fill in the DP matrix I A C T C G C A G C 9 0 5 1 15 20 2 C A T T C A Initialization Recursion 500 l y Si 1j 1 crxy C Si 39maX Si 1 39 50 J i J SW1 BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Matrices Stats A BLAST 9707 1 10 for match 2 for mismatch 5 for space BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Matrices Stats A BLAST 9707 12 BCB 444544 Fall 07 Dobbs 2 8 Finish DP Scoring Matrices Stats 9707 amp BLAST 3 Calculate Score NM of Optimal 4 Trace back through matrix to recover Alignment for Global Alignment optimal alignments that generated A C T C G C A G C the optimal score 0 5 10 15 20 25 30 35 40 5 1 0 5 0 1 0 1 5 20 2 5 How quotRepeatquot alignment calculations in reverse order starting at from position with highest score and following path position by position back through matrix 10 5 8 3 2 7 0 5 10 15 0 15 10 5 0 5 2 7 20 5 10 13 8 3 2 7 4 25 10 5 20 15 18 13 8 3 30 15 0 15 18 13 28 23 18 35 20 5 10 13 28 23 26 Q3 Result Optimal alignments of sequences OgtOH1gtOgt 10 for match 2 for mismatch 5 for space BCB 444544 F07 ISU Dobbs E 7 Finish DP Scoring Mairices Siais A BLAST 9707 13 BCB 444544 F07 ISU Dobbs B 7 Finish DP Scoring Mairices Siais A BLAST 9707 14 Traceback for Global Alignment Traceba k 39l39 Recwequot Alignmem Start in lower right corner 61 trace back to upper left 9 0 r10 r15 r20 F25 F30 r35 50 C 5quot 1 0 5 o 5 r10 F15 F20 725 Each arrow introduces one character at end of alignment A 710 g 8 3 72 77 0 5 710 0 A horizontal move puts a gap in lg 39sequence T 1 5 0 15 4110 5 0 5 2 7 0 A vertical move puts a gap in E2 sequence T 20 45 10 13lt 8 3 2 7 11 0 A diagonal move uses one character from M sequence C 725 710 5 20 15 18 13 8 3 A F30 715 o 15 18 13 28423 18 C r35 r20 5 1o 13 28 23 26 33 y have gt1 optimal alignment this example has 2 BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 15 BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices 51013 A BLAST 9707 16 Traceback to Recover Alignment Traceback to Recover Alignment A A G k C T C G C A G C 9 o 5 r10 r15 r20 F25 F30 r35 50 9 0 5 F10 F15 F20 F25 F30 735 50 C 45 0 5 0 J5 quot0 quot5 20 25 Va A 710 5 8 3 F2 F7 0 5 710 c 5 1 0 5 o 5 F10 F15 F20 725 A 710 g 8 3 72 77 0 5 710 T 715 0 15 10 5 0 5 F2 77 T 720 5 10 13 8 3 F2 F7 5 T 715 0 154i10 5 0 5 F2 r7 5 T 720 JD 10 13 8 3 2 7 4 C r25 r10 5 2o 15 18 13 8 3 A r30 r15 0 15 18 13 28 23 18 C F25 r10 5 2o 15 18 13 8 3 A C F35 720 5 1o 13 28 23 26 33 r30 r15 0 15 18 13 28423 18 C 10 for match 2 for mismatch 5 for space F35 r20 5 1o 13 28 23 26 33 Where did red arrows came from BCB 444544 FO7ISU Dobbs 57 Finish DP Scoring Mairices Siais A4 BLAST 9707 17 BCB 444544 FO7ISU Dobbs 57 Finish DP Scoring Mairices Siais A4 BLAST 9707 IE BCB 444544 Fall 07 Dobbs 3 8 Finish DP Scoring Matrices Stats 9707 amp BLAST Traceback to Recover Alignment Traceback to Recover Alignment C T C G C A G C 9 0 5 710 715 720 725 730 735 50 5 C T C G C A G C C 5 1 0 5 0 5 710 715 720 725 5 0 A 710 5 8 3 72 r7 0 5 710 C 5 T 715 0 154110 5 0 5 72 77 A 710 T 720 5 10 134 8 3 72 r7 5 T 715 o C 725 710 5 20 15 18 13 8 3 T 720 5 A 73o 715 o 15 18 13 28423 18 C 725 71o 20 15 C 735 720 5 1o 13 28 23 26 33 A 730 715 15 18 13 10 for match 2 for mismatch 5 for space C 35 20 5 10 13 28 23 7 Great but what are the alignments 1 BCB 444544 FO7ISU Dobbs Er Flmsh DP Scormg Mairlces Stats A BLAST 9707 19 BCB 444544 FO7ISU Dobbs Er Flmsh DP Scormg Mairlces Stats A BLAST 9707 20 What are the 2 Global Alignments Traceback to Recover Alignment with Optimal Score 33 10 15 0 20 5 25 10 20 30 15 15 18 13 OgtOHHgtOgt 35 20 10 13 28 26 Great but what are the alignments 2 m N BCB 444544 F07 ISU Dobbs 87 Fwsh DPScor1ng Mamczs Stats A BLAST 9707 BCB 444544 F07 ISU Dobbs E r Fwsh DPScor1ng Mamczs Stats A BLAST 9707 2 What are the 2 Global Alignments or I check Traceback with Optimal Score 33 h C A T 1 Z T C A 2 C Check the scores 10 for match 2 for mismatch 5 for space BCB 444544 F07 ISU Dobbs 22E 7 Flmsh DP Scormg Mairlces 51013 A BLAST 9707 23 BCB 444544 F07 ISU Dobbs 22E 7 Flmsh DP Scormg Malrlces 51013 A BLAST 9707 24 BCB 444544 Fall 07 Dobbs 4 8 Finish DP Scoring Matrices Stats 9707 amp BLAST Local Alignment Motivation Local Alignment Example To quotignorequot stretches of non coding DNA 0 Non coding regions if quotnon functionalquot are more likely to contain mutations than cod ng regions G G T C T G A G 0 Local alignment between two protein encod ng sequences is A A A C G A likely to be between two axons To locate protein domains or motifs 0 Proteins with similar structures andor similar functions but MOTCl lI 2 Mismatch or space 1 rom different species for example often exh bit local sequence similarit es 0 Local sequence s milarit es may indicate quotfunctional modulesquot 325 local ligrmqequotz Score 5 BCB 444544 F07 ISU Dobbs E 7 Finish DP Scoring Mairices Siais A BLAST 9707 25 BCB 444544 F07 ISU Dobbs B 7 Finish DP Scoring Mairices Siais A BLAST 9707 26 Local Alignment DP Initialization amp Recursion Local Alignment Algorithm 1 Initialize top row lt31 leftmost column of matrix with E 3 Optimal score in highest scoring cells 4 Optimal alignments Traceback from each cell containing the optimal score until a cell with quotOquot is reached nofjust from lower right corner BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 27 BCB 444544 F07 ISU Dobbs Be Finish DP Scoring Mairices Siais A BLAST 9707 2B Filling in DP Nlatrix for Locg Aligninth Traceback for Local Alignmer k NoCnegTa veCscz39esC filln C k C T C G C A G C A o o o o o o o o o 9 C o 1 o 1 o 1 o o 1 C A o o o o o o 2 o o A T o o 1 o o o o 1 o T T o o 1 o o o o o o T C o 1 o 2 o 1 o o 1 C A o o o o 1 o 2 o o A C o 1 o 1 o 2 o 1 1 C 1 for match 1 for mismatch 5 for space 1 for match 1 for mismatch 5 for space BCB 444544 FO7ISU Dobbs Er Finish DP Scoring Mairices Siais A1 BLAST 9707 29 BCB 444544 FO7ISU Dobbs Er Finish DP Scoring Mairices Siais A1 BLAST 9707 30 BCB 444544 Fall 07 Dobbs 5 8 Finish DP Scoring Matrices Stats 9707 amp BLAST What are the 4 Local Alignments with Optimal Score 2 What are the 4 Local Alignments with Optimal Score 2 H 99 Check the scores 1 for match 1 for mismatch 5 for space BCB 444544 F07 ISU Dobbs E 7 Finish DP Scoring Mairices Siais A BLAST 9707 31 BCB 444544 F07 ISU Dobbs B 7 Finish DP Scoring Mairices Siais A BLAST 9707 32 Some Results re Alignment Algorithms Affine Gap Penalty Functions for Cams CprE amp Math types Affine Gap Penalties Differential Gap Penalties 39 M S39l39 P i39ll39Wise sequence alignmen l Pquot39 blems can be used to reflect cost differences between opening a solved in 0mn time gap and extending an existing gap 0 Space requirement can be reduced to 0mn while keeping runtime fixed Myers88 Total Gap Penalty is linear function of gap length 0 Highly similar sequences can be aligned in 0 dn time where d measures the distance between the W y 6 x k 1 sequences Landausel where y gap opening penalty 6 gap extension penalty k length of gap Sometimes a Constant Gap Penal is used but it is usually least realistic than the Affine ap Penalty BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 33 BCB 444544 F07 ISU Dobbs Be Finish DP Scoring Mairices Siais A BLAST 9707 34 Methods quotScoringquot or quotSubstitutionquot Matrices Global and Local Alignmenl 2 Major types for Amino Acids PAM amp BLOSUM o fAlignment Algorithms PAM Point Accepted Mutation o f Dot Matrix Method relies on quotevolutionary modelII based on observed 0 f Dynam c Programm ng Method cont 6 It GP Pena I leg differences In alignments of closely related proteins DP for Global Alignment DP for Local Alignment Scoring Matrices BLOSUM BLOck SUbstitution atrix Amino acid scoring matrices based on quot0 aa substitutions observed in blocks of pAM conserved sequences within evolutionarily divergent BLOSUM proteins 0 Comparisons between PAM amp BLOSUM Statistical Significance of Sequence Alignment BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 35 544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mair ices Siais A BLAST 9707 36 BCB 444544 Fall 07 Dobbs 6 8 Finish DP Scoring Matrices Stats 9707 amp BLAST PAM MaTrix BLOSUM MaTrix PAM Poim Accegfed Mufa on BLOSUM BLOck SUbsTiTL on MaTrix relies on quotevoluTionary modelII based on observed based on aa subsTiTuTions observed in blocks of differences in closely reafed profeins conserved sequences wiThin evoufionariy divergenf Model includes defined raTe for each Type of Franquot5 sequence change Doesn39T rely on a specific evoluTionary model 0 Suffix number n reflecTs amounT of llTimell Suffix number n reflecTs ex ecTed similariTy passed rafe of ex ecfed mufa on if n of amino average aa iden fy in The 51 from which The acids had change mafrix was generafed PAMl for lessdivergenT sequences shorTer Time BLOSUM45 for more divergenT sequences PAM250 for more divergenT sequences longer Time BLOSUM62 for less divergenT sequences 544 F07 ISU Dobbs B 7 Finish DP Scoring Matrices Siais A4 BLAST 9707 37 544 F07 ISU Dobbs B 7 FWle DP Scoring Matrices Siais A4 BLAST 9707 3B Which is BeTTer PAM or BLOSUM PAM250 vs BLOSUM 62 See TexT 0 PAM maTrices 9 BliAOAQ35A362 0 derived from evoluTionary model 399 39 A o ofTen used in reconsTrucTing phylogeneTic Trees buT noT d f h39 hl d39 T usually only of maTrix is 0 very 900 0iquot lg y Ivergen sequences displayed if is symmeTric D E BLOSUM maTrices 0 based on d recT observaTions Here F o more reaisfic and oquerform PAM maTrices in Terms of 5ab corresponds 1 0 score of G accuracy in local alignmen allgnlng characTer a WITh characTer b H l BLOSUM 62 BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 39 BCB 444544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairices Siais A BLAST 9707 40 Which Type of MaTrix Should Sequence AlignmenT STaTisTics You Use several her l39YPeS 0f maTrices available DisTribuTion of similariTy scores in sequence alignmenT GonneT Ki JonesTaylorThornTon is noT a simple quotnormalquot disTribuTion Very robusT in Tree consTrucTion quotGumble exTreme value disTribuTionquot a highly skewed quotBest39 man ix depends on task normal disTribuTion wiTh a long Tail differenT maTrices for differenT applicaTions ADVICE if unsure Try several differenT maTrices amp choose The one ThaT gives besT alignmenT resulT imiiiv mmquot BCB 444544 F07 ISU Dobbs E 7 Finish DP Scoring Mairiczs Siais Ai BLAST 9707 41 544 F07 ISU Dobbs E 7 Finish DP Scoring Mairices Siais Ai BLAST 9707 42 BCB 444544 Fall 07 Dobbs 7 8 Finish DP Scoring Matrices Stats 9707 amp BLAST How Assess Statistical Significance of an Alignment Corn are score of an alignment with distribution of scores of a ignments for many randomized shuffled versions of the original sequence 0 If Score is in extreme margin then unlikely due to random chance 0 P value probability that original alignment is due to random chance lower P is better P 10395 103950 sequences have clear homology P gt 10391 no better than random 544 F07 ISU Dobbs E 7 Finish DP Scoring Mairiczs Siais A BLAST 9707 43 BCB 444544 F07 ISU Dobbs B 7 Finish DP Scoring Mairiczs Siais A BLAST 9707 44 Today39s Lab focus on BLAST EXhOUSflve vs HeurIShC Methds Basic Local Alignment Search Tool Exhaustive tests every possible solution STEPS guaranteed to give best answer idem es optimasouron 1 Create list of very posSIble IIwordll e g 311 letters 0 can be very ti mespace intensive from query sequence I 39 Dynamic Programmm 2 Search database to identify sequences that contain as in 5mitl1 Waterman algorithm matc ing wor s 3 Score match of word with sequence using a substitution Heuristic does NOT test every possibility no guaranfee mm answer is begr EXTnm rl l IS da1lcho g gec l39lons while calculating 17 Offer am Idemlfy cplWalsomhoquot Continue extension until score drops below a threshold sacrifices accuracy potentially for speed due 10 mismafches 39 565 quot39 U39es f Thumbquot 39 quotShmw rsquot High Scoring Segment Pair HSP contiguous aligned eg BLAST amp FASTl segment pair no gaps 4 U1 544 F07 ISU Dobbs 22B 7 Finish DP Scoring Mairiczs Siais A BLAST 9707 45 544 F07 ISU Dobbs Be Finish DP Scoring Mairiczs Siais A BLAST 9707 46 Lab3 focus on BLAST Basic Local Alignment Search Tool BLAST a few details Developed by Stephen Altschul at NCBI in 1990 BLAST Results 0 Word length Original version of BLAST TYPicallyi 131001quot PDT sequence nt or sequence List of HSPs MaXImum Scoring Pairs subsfnu on mah ix D f l 39 BLOSUM iZ More recent improved version of BLAST a a l ls Can change under Algorithm Parameters Allows gaps apped Alignment Choose other BLOSUM or PAM matrices Haw Allows score to drop below threshold Stop Extension Threshold but only temporarily Typically 22 for proteins 20 for DNA 544 F07 ISU Dobbs 22E 7 Finish DP Scoring Matrices Stats amp BLAST 9707 47 544 F07 ISU Dobbs 22E 7 Finish DP Scoring Matrices Siais amp BLAST 9707 43 BCB 444544 Fall 07 Dobbs 8 Finish DP Scoring Matrices Stats 9707 amp BLAST BLAST Statistical Significance 1 EVaue E mx nx P m total number of residues in database n number of residues in query sequence P robability that an HSP is result of random c ance lower EVaue less likely to result from random chance thus higher significance 2 Bit Score 539 normalized score to account for differences in sequence length Ki size of database 3 Low Complexity Masking remove repeats that confound scoring 9707 49 544 F07 ISU Dobbs E 7 Finish DP Scoring Mairiczs Stats a BLAST BCB 444544 Fall 07 Dobbs 32 SVMs amp NNs amp Protein 239 Structure Prediction BCB 444544 Introduction to Bioinforvnatics Lecture 32 NNs amp SVMs Secondary Structure Prediction 3 2Nov8 tctuosmoozso tuooo t 11606 Mon Not 9 S La ue mont An Sci 1S Integrated ramic reaches ta 1ng has msisrlgrcc ta fund591quot 3 s m Facutty Semitar 1219 m in 191 nd ed Ttms Not 9 Sean 9 Biol Sci Texas Tech artsttuctirg an arm and unrirersa eramtt39rraty quotBury Applied MatheeoB Seminar 345 in 219 Bessey Fri Not 19 5 Ni Irya allapmgada Fhenl 9 Biol Eng ISUMkmpa nr d ra her Substrate far rlem News kcgcrnr an mt arms of Moral Stem mt wth amt Differentiate BcB Faetlty Seminar 219 in lago w142 Ttms Not 19 Ha lvr ssane cltauotralt center for Structtral Biology Vanderbilt Strucnradynamr39g ar touring transplrnrs Baker Center Semttar 219 m rt Howe Hall Auditorium 99on no no to Mon Nor 6 Review rrotein Structure rrediction sireski et a gem Mrceic Acii Res 511974 doi191993nargki327 Wed Not 9 1 Review SVMs in Bio formatics y 2224 Briefquot s in Biu39nfbrm a 5529 919939 54329 2 SW am 401 with warm art lecrar Macnir 7 ANM m nm71 39 cdnmg tVMifk39nl new rteth Thws Nor 9 Lab 19 rrvte39rt Structure rrediction Fri Not 19 ctp 91 94 rrvteomics rreriousty assigned tuooo 3 tuooo Macromolecular interactions mediated by the Dev roten lent uses VampEIAV Computationaly model sirmures of lentiriral Rev proteins ing threading algorithm with Ha et a1 Predict critical resdues for RNAbinding protein interaction tsin machine learnin lrithms with lunar et II T151 model and r c us39rtg ans ltiochemieal approaches with carpenters citW ng biophys cal approaches with Antwall rumpus Initialy focus an 514 l Ru amp REE BCB 444544 Fall 06 Terribilini 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction CaliperSM of edcred Rev Structures Predich the RNA binw39ny dammit of HA V Rev EIAV 39 391 t C m sIv Dimer 1 n 1 K 3 maps umpsRRDRW quImuw wmmwm 2 m m 5 mumm smemm Eumxussp mnensx m m WW 1 Simmer 39 Predictbus V E rimems Computational a wet lab approaches revealed that o EIAV new has a bipariiiz RNA binding danm n 0 TM Argrch REM are cri 39 I e RRDRW m centmi regiun e KRRRK av 42m nus uveriappmg the NLS 3 Based on com uiaiionai modeimg the new are m ciose proximity wii m the aeb structure of protein i Lemwai Revs amp RRE bmdmg snes may be more Simiiar m simcmre man has been appremaied Fuiul39e 61 I 1 91 Identif quot redictive rulesquot for roteinRNA a mmm m ume u Mn N E v v v magma P umgimmm m Secondary Structure Prediction Quality of Secondary Structure Prediction Give a pruili sequence appaN seeundany siruciurz prldciia aims a1 de ning 1h sme uf Determine Secundary Strumure pusmuns m knuvvn prutein each amino ae helix strumures using DSSF ur STRiDE Extend her have 4 5171 H 1 when 1 Kabsmanusanuev DidianawatSecandaWSim uve ianieinS Panem vemgnman m memeenemneee and geamemcai temuves I Empaivmevzz 257 The quality uf seeundany siruciwz prldiciia is 2 measured WI 7 quot5sme aeewey enre Dr 1ne peneen ufrzsdues 1 2537 1953 o J 5 Q 1 ha mic1 39 zaiiy away siruciurz ssP rusemn and Avgas Knavviedgerbased secandaw emmme assignments PvmemS ZESBBrSW 1995 w DE hawwmaxsl we 11 zwwwmexsl we 1 BCB 444544 Fall 06 Terribilini 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction Limi39a39i ms 0f Q7 Early methods gzdfg zrary Structure unanscyvaLerDvTvansImEQnK Arllnaaudseouenue than a d Fasma hhhhhnnnneeeennneeennnnnhumh Ami Semnsw 5mm Chou and Fasman Pmdictian of prm39n conform Biochemistry 13 211245 1974 nhhhnnnneeeennnnneeennnhhlmhh a znsIw usemi msman 50 Mhnnnnhhhhnnnhhhnnnnnhhhhh clawI71 armies Osgmhorpe and Robson Analysis of the ievvlbiepvedldlon accmcy and implications of simple methods for d secondary swuc39um of globu r 5 39Oammm pvemamam tens J Mol Biol 121797 1211 1973 o Secondaw smduve assignment in veai Pimelns 1s unuenalnio about 1 Yhevemve a Pervert pvedl lon Wuuid have 03 U mmwmm mm is mmwmm Wm i Chou and Fasman Chou and Fasnan 39 Start by commit amino 2 ids Favovs penslies to below to a 539 rm 2 a secondary simcmm POM16100 zoo5m PlTum Favovs P0 P0 P0 mm Plupensliles gt1 mean ihai m VESlduE iypei 1s ilkeiyiu be mm m m Cullespundlng secundaly slmclmslyps FMS mm mm is was is Chou and Fasman Chou and Fasman 11 111 1m 1m Predlcilrlg ryelass Postmnspecl cparameiers m a me o 035 n Jlndnucieailunslie Auuluus cunilguuuslesldueswnh Pogt1 Valium no we v p m o lesldues has an avelage We lt1 mam arm acid mevevences Cy q r 11 avelage We uvelwhuie leglun 1s gt11 11s mama in be heilcai E m m a W 25 sin o Ely o 431 posmon 2 Pm lshlghiv m 0 Flaming 511m moves valsdlsiavoved g Jlndnucieailunslie auululs cunilguuuslesldueswnh Pigt1 rm quot M M a are Pveievved m a lesldues has an avelage Pm lt1 memo m a r 11 avelage Hi uvelwhuie vEglun 1s gt1 1 ii 1s pledlmediu be a slim 7 mm 1 W Wm M m o m 6 sex a 1n n 39K v m o m a wowwmlm mm 17 wowwmlm mm is BCB 444544 Fall 06 Terribilini 3 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction Chou and Fasman The GOR method ndem pibpensties iai heiiXi sheet biiuin is calmlaled tai ciu rbi eech pasiiibni in the sequence eight iesiuues an nsiueieu P bsiiibneuepe each amino e Prediding mms rim eschieiispepiiue starting al iesiuue l camp 7 M a Eiage piupensilyuveiaiM ies F win il2 il3 uie iuues PM gt Peanu PM gt Pi and PM gt1 and FgtEI mums 7 ii ieiiapepiiue is cunsideieu alum A helix Pmpensnylable Bantams inbineiicn sham picpensiiy bi iesiuues ei CM and Fasman Wman 17pasmansvmenlhecaniumialianaivesiduElisheiicai ihehelix Pmpensnvlahles have 2D x i7 eniiies Build simiiavlahies bi slvarids and mm hiip fasia hiuch Virginia EdufastaivWWvchufas him can snrwicennn he Predicted eie bi AAi is calmlaled esine sum mine pasman i st dependent Pmpensmes aieii Vesldues ei unu AA cow an be used at hug iiehs cit nih navuavcuvvem Veislan is cow W wiwwmem wane wiwwmxm We is New improved algorithm future eon v 50R IV a39gorifhm Kioczhowshi Ting Jernigan Aearnier Database of 257 sequences New database of 514 nonredundant sequences proposed by Cuff and Barton 39 N0 quotUMPquot We l39g39mems Add tional statistics of triplets Frequencies of singlets and doublefs R izablde wind owl sizeh offth e window is a 39uste to t e engt o t e sequence Fixed wmdow of Size of 17 resudues Optimization of Parameters If m ov or o I n p quotmeters to hemse the acmmcy of pmd n for psheets 39 A er f Predmfmquot 643 quot w39 39 Multiple sequence alignments PSIBLAST full Jack knife procedure FAsTA cLusTAL in an early version We 11 We 11 hiip gur bb iasiaie edu Advanmges of file 60R me iod WWW Physical nonquotblackboxquot modei gives full insight into the relationship between protein m r sequence and its secondary structure s ws that an alternative to artificial inteiiigence methods is possibie mm in Accuracy of rediction close to the best neurai n t rk pre ictions Some applications where soil method is superior transmembrane proteins no memory effect of NN Fulljackknife method is possible Very fast NN require a lot of time for computer learning hawwmexsl We 13 zwwwmem We 1 BCB 444544 Fall 06 Terribilini 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction Neural networks Accuracy Both Chou and Fasman and 602 have The most successful methods for predicting secondary structure are based on neural networks The overall idea is that neural beequot assessed and tram rmy IS networks can be trained to recognize amino acid patterns in ESf39mafEd 390 be Q339603965 known secondary structure units and to use these pa terns to distinguish between the different types of secondary structur inIialy higher scores were rqxarred bu 12 e I39ilrlellis ser 390 measure Q3 WEI Neural networks classify input vectorsquot or examples Into awed as Ihe res cases included categories 2 or more quot yes They are loosely based on biological neurons prareirs used Ia derive Ihe prope mm 2 llxnn 25 m wwmtsu m wswnt tsu Biological Neurons Artificial Neuron quotPercep rron T mp Dc gtmes DEMres receive 1115 Am gins output me Me mm mm mm 5th Wmquot m I at WWMMUWWWWMMW my llxnn 27 m totSumnst est may me that am ioyw the mm mm mmquot m e t WmdSIWni WmntmaAJJAEWM mu mm 2 The perceptron Nules rThe thput is a vecturX she the Weights can be sluved th anulhel ve DYW Xi Wt T X2 2 1 s gt T 0 s lt T rlhe perceplvun cumpules the dull pmducl s x w W rlhe uulpul F is a tuhetteh ets it is utteh set etsetete l e tut u h which case the tuhetteh is the step it het h Fur Eunllnuuus uulpul etteh use a slgmul 1 input Threshold UNI Output 1 m The petceplmn eiassthesthe lnpulveclulX lMEI lWEI caleguries FX 1 7X e n itthe Weights and lhreshuld T ale het knDWn th advance lhe peteeptmh must be Named ideally lhe peteeptmh must be Named in tetuth the meet anSWEY eh all aming examples and temples it has quotEVE seeh pettehh Well eh e The Naming set must eehtath bulh type at data l e thh t ahu uulpul llxnn 2v eeewsmnttsu llxnn 3 m wswnt tsu BCB 444544 Fall 06 Terribilini 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction The perceptron Biological Neural Network rannu a renenan u nnu ne rerunsw nu nnnaeslne runuun r nunaerunannuuue x lannu recurs F w x0 output atlne nercenlrun tgtlt39 target value tar gtltl Use steepest descent eunnule urcurem W1 aw rupdatevielght ve m WW WM 7 EVE ltemte 1 m noren quotper Wnucurnau uranmnr hawwmexsl 2 eammgm um 31 truesumo um 31 Arti cial Neural Network Neural networks and Secondary Structure prediction A complete news quotSWark 55 SEW99 quot0quot5 mgrelm from Chou and Fame and 09 rnlerconneczeu sucn lncl I I lne ouwuls 0 some unrls 1 bemmesrhempu father on predicting the conformation of a residue it is N15 WWW 9 95 5 9 39 ortant to consider a window around it possoer o Helices and strands occur in stretch on is important to consider multiple sequences Neural nelnarrs srelrsrneu rus like Perceptmn bv nrnrnrzrnu an error function E EENNX 4X Z mm 34 PHD Input nut run IllsM For each residue consrder ml Sci warm rm 5 Window osrze 13 l xZUZEEI values i hawwmexsl IVsDe 35 zwwwmem mm 39 BCB 444544 Fall 06 Terribilini 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction PHD Network1 PHD Network 2 Sequence Structure Structure Structure 5 W ZZZquotW 75 39 3 quotW t7 i xZDvaiues Svaiues 93751 vaiues 3 vaiues 3 vaiues r Pan PM Pam M p 0 PM Pro Jo 5 mm secondiry RED PHD sgqumzersWWH quotmm m eqzh mm and qr q Wmquot 13 rearing art at art i tamiderad m anew mm m mm mm m m m m neurqi quotmm and g amqu i 3 WWW m at WWW pm ban and new Swuzmrer wuzmre quotmm Far each qr m mm m a Wm M n rearing m Wham mum name and Wm m k m r 78198 m m m in mm my Mandi quotmm Wm my m m Wham W mm at i n eqzh w n 3 mm s m mmquot a 1W y em PHD m mm mi neurqi nmwurk m Mam mm etx a neurqi mm m mm to 0 mt mm and M Myw quotWquot m quot9quot to mama 1mquot 01 my Mammyquot Far each payth 0 maqu mum m in mm mm zare i amqu a m mamquot 7 w 7 7 WNW i an m mm mam wane a w y 0 Performances Secondary Structure Prediction monitored at CASP rAvaiabe servers at of Wm W as YEAR Targets ltQ3gt Group ROS VPHD as 1994 5 53 and rms W sander VNNPREDiCY mgWcmghavm utsteduwammngvedm mm asrz 1995 24 7o rChauandFassman Mtgl ssabiamvwgimaedu asta vwwwchatsshtm cAsP 199i 1 75 Jones quotquot 9 95quotquot9quot5quot9 7 m mam m Wm mm gummy mm as 2000 H E0 Jones mm pmeemssiememw wume wane 1 zwwwmem wane 1 BCB 444544 Fall 06 Terribilini 7 32 SVMS amp NNs amp Protein 239 11606 Structure Prediction SVM finds the maximum margin Support Vector Machines SVMs hyperplane fl L2 0 O C C D G j quot o L3 o r III III f E III i l Image from hffpen MilkpediaorymlaSuppom vecf0rn1ach39ne IWI Image from hffpen Wikl39pedborywikl5upporf vecfoI39n1ac139ne BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 43 BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 44 What about this Kernel function Maps inputs to a high dimensional I feature space I I Hopefully the two classes will be 0 I linearly separable in this high I dimensional space O C I BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 45 BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 46 Protein Structure Prediction Pmniquot Slrucmre Predidim One popular model for protein folding assumes a sequence of events A PhyS CS39based approaChJ nd conformation of protein corresponding to a O Hydmp39mbic napse thermodynamics minimum free energy minimum 0 Local Interactions stabilize secondary structures cannot minimize internal energy alone Needs to include solvent 0 Secondary structures Interact to form motifs o simulate foldinga very long process 0 Motifs aggregate to form tertiary structure Folding time are in the ms to second time range Folding simulations at best run 1 ns in one day BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 47 BCB 444544 F06 ISU Terribilini 32 NNS amp SVMS Protein 239 Structure Prediction 11806 48 BCB 444544 Fall 06 Terribilini 8 32 SVMs amp NNs amp Protein 239 11606 Structure Prediction The Folding Home initiative The Folding Home initiative Way Pande Stanford University wmi aim Fouiiiggnam im mainggnm I a detracted cmmg prqeci mm studies pmmiii luldlnn mixruldna Newman and Ialnlzd dimm e use rim ampimam methods and Ian scare 1r quotmosaics Iquot 7 FBI d n 0m 8 distrlbmed cumming WW Wum39 m quotwquot W5 m mllDlts or iiims Ihan iisly ammo this has ml acct our 59mm in examine Ewing romeo dimae out an you help viii can help um annlunu quotquot mouiiwowaani N on mom mdesgm iicn I lil no Our goal to understand protein folding larevery compul lhllljons me pratain aggregation and related dilallas MI was anonmm i mum unease iii sriiilaaui s mm mummummimyim r i i mu are mgiwanwmin i i r 5min Oneunilsohcluby donlllng imam noun 3 inarmwol 511301 My tunes to lheml m Shalom 1m 7 in 7 t 7M minim inquot mi anar clmn sehtl m i I 39Y39ie Mega Myer 1 avg naumnland Mllrcnniiuvnux NM F i may t aw Sianctobell 2000 um 1000ch15 ii wne rlvn ndarn requiem sin a M nave hid several successes Von Faimnggnam Each ddrlmnal CF39UgJias Us my mum masu nevi wcan Allanman mum Vi iessi cm as Nunlrvl l m Fi mwniduck r semici p199 New Mllclli oi mm ngMmrmwuam uan Mum warmly mm mm and mum lsnuyormom amummy u u i I i i na r BCEMMEM rue 15o TerribiIim 327 run A SVMS w i39ZWw run no imimiim out 7 MM wvis mm L rmcrure mamaquot ttou Eu Folding Home Results Protein Structure Prediction mm 39 Experiments DECOYS m Generate a large number mm Raleigh et al E SUNVVSMY Brook of possible shapes 395 1 239 W 2 1 le em UIUC DISCRIMINATION E in 39 39 Select the correct nativelike 2 3 mg beta hairpin f Id 1iquot 5 EatonetalNlH 1 E 1393 V In El ha hIHI NIH 30quot e a 39 39 C J PPA K I Gruebele et al UIUC I IEI IEIEI IEIEIEI IEIEIEIEI IEIEIEIEIEI experimental measurement Need good decoy structures Need a good energy function nanoseconds 868444544 rue 15o mibiiim 332 r NNs I SVMS PM j anda wwwmm El ECH444E44 rue Isu rmihiiim 332 r NNsA smg Protein 2 Structure Prediction llBUo 52 ROSETTA at CASP David Baker ROSETTA resu39ts at CASP5 Q Homoo modeIn quot M39 Ab Initio redlctlon V i P t mg mg m VulAVwvm E 07 ms 7 H 7W muiiwnrm l I quotmay rst MMquot H Simultaneous modeling 1 4 r of the large and 2 homo095 g mun wawmim 5 i I glue quot quot r W M I Secundaryslrudure a a im qr tan quota 39 icio human I nugget quot quot WW predlcllmi a W ma 7 m r155 m T161 WIN g gauge i Mquotmw Fragmenlbased iii 39quotle39UC mm approach In generate IE Serverquot decoys E I amp a 39 n 39 m quota m Moslsuccessfu MWquot q T quot731 732 39 xfgm mfm SelectSdecuys miT I T W W 39 1 Memoquot 5 CA 3 3 Furpredlcllun g i for fold recognition ww mwmm a and ab milo predicton 5 i I Rosella predlclluns in CASPS g u 7 a i A Successes failures andpmspecl m n m a m quotquotquotquot quot forcnmplele aulnmallnn Bakerel o n L mum all zoos ECEMAEMFUHSU Wigwamquot IVEIo 54 BCB 444544 Fall 06 Terribilini 9 32 SVMs amp NNs amp Protein Structure Prediction 2 ms mm u rwmein m Min mm M 9 uquot Lnlth mmnzvmanmmvi r5 11 A as i g nS Lzummmms K BCB 444544 Fall 06 Terribilini 3 RosETrA resuits at CASPE 4 rt S o nlidu with nw w 41 Imam um um um um um nu um l77 um 11606 Genomics 111405 Bioinformatics Seminars 111605 Nov 17 Thurs 410 55MB Seminar i 1414 M33 C2 and PH Domains Diverse mulafors of membrane signaling events Joe Falke UC Boulder Nov 18 Fri 1210 565 Semimr in E164 Lag Usim P Valus for the Planning and Analgis of Microzrrax Experiments Dan Net eron Stat Genomics Aln5 o DahhslSU Vacs mam Genomes AAHS o DahhslSU Vacs Mwsux Genomes Protein Structure Prediction Reading Assignment for MonFri Genome Analysus Mount Bioinformafics Mon Protein 339 structure prediction I Clip 11 Genome Analysis hilpwww biomfor ma csonlmz orgChichlllmdexhlml pp 495 547 Ck Emta httpwwwbioinformaticsonlineorgheperrata2htm Wed Genome analysis amp genome projects Comparative genomics ENCODE SNPs HapMaps medical genomics Thur Lab Protein structure prediction SNPs Fri Experimental approaches microarrays proteomics metabolomics chemical genomics o DahhslSU Vacs Moswx Genomes Aln5 AAHS o DahhslSU Vacs mam Genomes BCB 544 Additional Readings Review last lecture Required Gene Prediction Burge dKarIin 1997 MB 26878 Protein Structure Prediction WWW focus on Human HapMap Nevm 437 on 27 2005 Threading ameniany 4374233 News a View 437 1241 hilglWWW name camnamre ournalv437n7063full437124mh1ml Optional Amide 4371299 A hagorng mag afrhe human genome The International HapMap tonsor um Aln5 o DahhslSU Vacs mam Genomes AAHS o DahhslSU Vacs Mwsux Genomes D Dobbs ISU BCB 444544X Genom ics 111405 Protein Structure Prediction using Threading Pro39l39elquot Threading WPlcal equot39er gY func l39lm 1 Align target sequence with template structures told MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE library from the Protein Data Bank PDB 2 Calculate energy score to evaluate goodness of fit What Is quotprobabilityquot between target sequence and template structure that two specific How well does a specific resdue fit 3 Rank models based on energy scores assumption native residues are in syrucmra environmean like structures have lowest energy contact All nment a Target ALKKGFHFDTSE painf g p Sequence y39 Total energy Ep Es Eg Structure Templates Find a sequence structure alignment that minimizes t 2 energy function 111605 D Dobbs lSU r BCB 444544X Genomlcs 7 111605 D Dobbs lSU rBCB 444544X Genomlcs 8 A Rapid Threading Approach for 9 Protein Structure Prediction Tmquot for Fa Threadmg39 l Kai Ming Ho Physics Haibo Cao Yungok Ihm Zhong Gao James Morris Caizhuang Wang Sequence Structure 1D 3D problem Drena Dobbs GDCB Sequence Contact Matrix 1D 2D problem J ae Hyung Lee Michael Terribilini Jeff sander gtequence 1D Profile 1D 1D problem 111605 DDobbs lSUrBCB AAA544x Genomlcs 9 Ihm 2004 Template structure reduced representation Energy func on Assumption At residue level pair wise hydrophobic interaction is dominant E 211 Cu UiJ39 Cu contact matrix UiJ39 Uresidue I residue J 39 MJ matrix U Uij Cy 1 if n S 65 A contact U QiQJ Template structure H C N xN contact matrix CI 0 Otherwise nomeonmcf model U 1 0 A neighbor n sequence Ihm 111605 D Dobbs lSU rBCB 444544X Genomlcs 11 111605 D Dobbs lSU rBCB 444544X Genomlcs 12 D Dobbs ISU BCB 444544X 2 Genom ics 111405 Contact energy pairwise interactions Template Structure Contact Matrix MiyazawaJernigan MJ matrix c M F I L C 1 if r lt6 5 A Statistical potential c o 46 q c 0 Otherwise M 210 parameters M F 054 4120 a nelghbor 1n sequence l 049 001 006 L 057 001 003 008 V 052 018 010 001 004 W Sequence Sequence Vector quotquotTaquot939w39quot939eequot LTW M C2qx am a 3 AVFMR39HND39WND39ANW 3 QAQKQF 93 20 parameters 0 79970 98971 1197 0 6497 61v solubility a 06797 1 Q39 l1ydrophobicity 02604 Contact Energy Scoring Function N Contact Energy EC EQZCUQ 3 J Cao et a Ihrn 2004 Ihrn 2004 pom 45 2004 1D profile firsl e 39 Weights of eigenvectors for real proteins 3 average 01174 plolems Hydrophobic Contacts N N N N E l a TCT HEATMW 1 1 l x O N N 2 7 a a l ELI1T s 0 C 3 5 a 3 a l l u m n t C 39contact matrix lgquot I r 39 l 39 l I I 44 n l 39 3quot 3quot 7 quot 7 a S a l um au39g39mw n7 l39 quotquot 39 2 A ith eigenvalue ofC Ingenvalue Index D gigenv ctorinde I elgenveclor lndex l V Eigenvecwr First eigenvector of contact matrix dominates the overlap Vl elgenvector w the blggest elgenvalue between sequence and structure T proteln sequence of the template structure u quot Vi fraction of hydrophobic contacts from ith eigenvector H39Ql equot quot quotk39quot9 quotMk gt 4 B39QBWBCTW S 0quot sequence bl39quotd Ihrn 2004 Fast threading alignment algorithm Parameters for alignment Gap penalty Insertiondeletion in helices or strands strongly penalized small penalties for indels in loops but gap penalties do not count in energy calculation Size penalty If a target residue amp aligned tem late residue differ in radius by gt 05 amp if the residue is involved in gt 2 contacts alignment contribution is penalized but size penalties do not count in energy calculation Ihm 111605 D Dobbs lSU r BCB 444544X Genomlcs 17 111605 D Dobbs lSU r BCB 444544X Genomlcs D Dobbs ISU BCB 444544X Genom ics 111405 H W iquot39 quot139 quot 1 l39e second IV Y S l39r uc l39m e Finally calculate IIrelativeII score How much better is this quotfitquot than random Predict secondary structure of target sequence PSIPREDPROFJPRED5AM 60R V Emodify Sequence vs Sfrucfure adjusted for 239 structure match 0 N number matches between predicted 239 structure of target amp 239 structure of template N number 01 misma rches Eshuffle 2 shuffled Sequence vs structure NS number of res39dues se39eC39ed n c39l39gnmen39 randomize order of amino acids in target sequence 50 200X calc score for each shuffled sequence take average Eshufflad Global fitnessquot f 1 N N Ns Emdify f thmading Erelative Emodify 39 Eshuffled 17 2004 111605 DDobbs 1sUeBCB AAA544x Genormcs 19 17 2004 Performance Evaluation Results well actually our BEST Results in a quotBlind Testquot HO top ranked CAP5 prediction for this target Compe 39 39o Target 174 PDB ID 1M67 Critical Assessment of Protein Structure Prediction P d f d sf f A m IS m P2 IC 2 NC ure C a NC P2 Given Amino acid sequence Goal Predict 3D structure before experimental results published 111605 DDobbs 1sueBCB AAA544x Genormcs 21 17 2004 Overall Performance in CAP5 Contest Protein Structure Prediction H0 8th out Of 180 by M Levff Stanford Server s amp Software FR Fold Recognition Three basic approaches ar e s manuall assessed b ick rishin 39t g t Y Y N G 1 Homology modeling need gt30 sequence identity Rank ZScore N cod N red N NW N w Grou name PredictProtein META SWISSMODEL C113D 1 2426 900 1200 9 12 Ginalski 2 2164 700 1200 7 12 Skolnick E Kolinski 3 1955 500 1250 9 14 Baker 2 Threading lt30 sequence identity 2 112 3122 13122 3 393 23235 Best Hmm see amp M 6 14 56 50 1 5 7 13 BAKER ROBE391391 R Z 122 2 1 231 9 3 Ab inifia if no template available amp many CPUs 9 quot5 339 5395 3 quotes39mmu39 Best Rosetta Baker see CASP amp EVA m gIlW e mullet 17f good predictions vn39thaut weighting fur multiple undel m leW e mullet 17f total predicm39ons withnut weighting fut multiple mndels M Levitt 2004 111605 D Dobbs 18U r BCB 444544X Genomlcs 24 D Dobbs ISU BCB 444544X 4 Genom ics 111405 Best approach for protein structure prediction Try several servers How submit to a META server PredictProtein META 3D Jury BioInfoBank META c u 7 Sequence Idenldy a treading Also check continuous benchmarking sites 39 EVA LiveBench d2 nova Ivedlnllnn Baker amp Sai 2000 111605 D Dobbs lSU r BCB 444544X Genomlcs 25 111605 Genomlcs 26 Genomics for excellent overview lectures New Tedayl see these posted by NHGRI amp Pevsner 1 Genomic sequencing Mapping and Sequencing CTGA 2005Lecture1pdf Eric Green NHGRI 2 Human genome project The Human Genome 2005 10 19 ch17pdf Jonathan Pevsner Kennedy Krieger Institute 3 SNPs Studying Genetic Variation II Computational Techniques Jim Mullkin NHGRI CT GA2005Lectur213pdf 4 Comparative Genomics Comparative Sequence Analysis Elliott Marguies NHGRI CT GA 2005Lectur28pdf 111605 D Dobbs lSU rBCB 444544X Genomlcs 27 111605 D Dobbs lSU r BCB 444544X Genomlcs 28 D Dobbs ISU BCB 444544X 5
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'