BIOINFORMATICS GEN 440
Popular in Course
Popular in Genetics (Graduate Group)
verified elite notetaker
This 10 page Class Notes was uploaded by Helmer Gutmann on Saturday September 26, 2015. The Class Notes belongs to GEN 440 at Clemson University taught by Staff in Fall. Since its upload, it has received 67 views. For similar materials see /class/214244/gen-440-clemson-university in Genetics (Graduate Group) at Clemson University.
Reviews for BIOINFORMATICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
GEN440640 Blolnformatlcs 123 Long Hall 292006 Lecture 9 Polymorphism Ii Introduction 1 Polymorphism The occurrence in a population or among populations of several phenotypic forms associated with alleles of one gene or homologs of one chromosome 2 Genetic Polymorphism The occurrence together in the same population of more than one allele or genetic marker at the same locus with the least frequent allele or marker occurring more frequently than can be accounted for by mutation alone Allele Alternative form of a gene One of the different forms of a gene that can exist at a single locus One of the different forms of a gene or DNA sequence that can exist at a single locus Alternatively one of several alternate forms of a gene occupying a particular location on a chromosome Cause of polymorphism mutation Allele frequency Often called gene frequency A measure of how common an allele is in a population the proportion of all alleles at one gene locus that are of one specific type in a population Operational definition allele frequency gt 1 otherwise called mutations or rare alleles Further distinction common polymorphism alleles vs private polymorphism less useful for this distinction 8 Why are We interested Why do We care79ldentif in the enes conferring 39 39 39 or resistance to common human diseases should become increasingly feasible with improved methods for finding DNA sequence varianm on a genomeWide scale Collins et al 1997 Types of polymorphisms SNP llOOO STR short tandem repeam sequencing project SSLP simplesingle sequence length polymorphism human mouse amp rat eno i V39F 5 gt1quot 0 SR mple sequence repeatplant genome microsatellite DNAevolution Satellite DNA This term was used originally to describe a discrete fraction of DNA visible in a CsCl2 density gradient as a quotsatellitequot to the main DNA band The term now refers to all simple sequence DNA having a centromeric location Whether distinguishable on density O grad n Microsatellites A very short unit sequence of DNA 2 to 4 bp that is repeated multiple times i ndem Microsatellites are highly polymorphic and make ideal markers for linkage analysis A polymorphism at a microsatellite locus is also referred to as a Simple Sequence Length Polymorphism SSLP Alleles are designated by providing the repeat motif and the copy number for each allele 12 Example of an SSLP marker Symbol D15Ra145 Alias symbols R0067DD4 Sequence EXPEClEd Sill 213 hp 41 View Strain Variations Primer pairsts39a39 Regort GTCTCCTGGCTTCGTACTGG CTGTTGACCTCTTTCCAGTGG Template Report 1 0 GEN440640 Bioinformatics 3 Long Hall 12 Example of an SSLP polymorphism g Strain Symbol Allele Size hp 1 M 177 2 AVNOrl 177 3 BBDP 163 4 BBDR 163 5 BCICpbu 173 S BDIXHan 179 7 BDVIICub 163 8 BNvLXICUb 161 9 BNSsNHsd 161 10 BPCub 177 11 BUFPi1 163 12 COPOlaHsd 177 13 SNP number gt27 million SNPs gt 10 million SNPs validated 11mm 125 STATISTICS Number or Nun1on 0111 Gm Slums mantle 112 39 3 3 59139s 11mm validated in gene genotype mqumy Homosamcns A 27 189291 10430751 1415681261 4236590 2913971 662975 Mus musculus 31 6405963 1133763 1539162 m 4701773 Cams 131111113111 M 3 526 1173 3310903120971111 1 170 325 2551455 17 5311113 gallus m 3 64 959 3296472 3281506 1 446038 3624831 Pan mgludvms m 1544897 15427181112549 M 1544395 2 Orvusanva J 1 931 108 35999151220571 1021130 Anopheles gamblac 2 1 3m 1105 1136268 11 Dos mums L m 23658 139721 44m 9810 54 Example of a TraceArchive Figure 71 How do you de ne SNPs human reference genome vs DNAs 24 initial DNAs for SNPs Coalescent theory to study linkage disequilibrium Ascertainment bias Fig 72 What is a DIP deletioninsertion polymorphism An observed insertion of one or more nucleotides in one individual relative to another individual Since the molecular event that gave rise to this observation cannot be determined from the observation alone ie was it an insertion or a deletion both events are incorporated into the name of this polymorphism type Again who is the wild type What is a MNP multiple sequence polymorphism Variations that are multibase variations of a single common length all alleles same length where lengthgtl Sample allele ACGTTC The structure of submitted SNP components variant alleles and the anking nucleotide sequence that speci es a unique position in the genome Figure 76 text p179 10 complete survey 2 partial 3 capture from publication survey position unknown 1 11 11 11 11 1 DONOUIb N 0 GENAAEIEAEI Blmnfnrmatlcs 123 Lung Hall 5 V r gibamuimn E kmmuun a assay assay C39I39I CTCTACAAAA E GAGATAAGCAAGTR m Hank assay assay ank CTI39CTCTACAAAA 5 AGATAAGCAAGTR c flank ank C39I39I39CTCTACAAAA 5 GAGATAAGCAAGTR The structure of the anking sequence The structure ofthe anking sequence in dbSNP is a composite ofbases either assayed for variation or included from published sequence 21 Functional Analysis 9 relationship between a SNP and gene features colocalization ame as con ig b synonymous substitution c non synonymous substitution d position in the coding but with unknown relationship because lack of annotation 39 39 39 39 39 4 low resolution 11 Other SNP databases III Genntyping Genotyping process of determining the allele sates of select polymorphisms in selected group of individu s STRSSLP rPCR RFLP halptype chromosome Potential advantage of SNPs9 high density Why9Genotyping and disease diagnosis 9 lt Haplntype Map project Haplotype speci c set ofallele observed on achromosome9 nonrandom association shared pattern Linkage disequilibrium 39r L I L r r L 4 2 3 10 million common 5N1 only 500000 tag SNP for 90 halplotype 120 GENAAEIEAEI Emmtmmaues 123 Lung Han 272006 Lecture 8 RNA Structure Prediction 1 Introduction 1 RNA can do more than just carrying genetic information enzyme signaling development 39 eng er RNA iee major categories of RNA mess RNA rRNA Ribosomal RNA and oteins while messenger RNA has an information tdm arget molecules are built from a amiy of four he e h s The backbo o is an alternating polymer ofnbose the 0339 an 05 atoms from consecutive nhoses F39guie 1 4 Three level of organization of RNA structure gnmaiydineai sequence of nue1eotide secondag eanoniea1 base pair stem loop hair pine desenhes the general three dimensional form oflacal regions or overall Shape ofbiopolymers and ter g three dimensiona1 arrangement ofatoms including a11noneanoniea1 eontae s 5 Standard on eanoniea1 WatsonCrick base pairs AU an cnoneanoniea1 base pairs mismatehes Bases punne hases adenine A an guanine G and from the two pynmidine hases eytosine c and uracil U oi thymine T Purinepyri quot 39 10 Homo punnepunne base pairs 7 Hetero punnepunne base pairs 4 Rynmidinepynmidine base pairs 7 6 For example George Fox U Houston eo11eetion A total of 1761 occurrences of i i M a i a e 7 RNA secondary structure is generally divided into he1iees contiguous base pairs and various kinds ofloops unpaired nue1eotides sunounded by he1iees 8 Th o M r 1 H 39 39 39 1 39 evolutionary evidence by sequence alignments Why a good strategy9gt Too many possibilities This method canreach 97 accuracy Why7 a 11 RNA thermodynamics 1 Tl umu u I quot it my 1 I 2 Another estimate is 18quot on N nue1eoti de ziikeidt Snakoff1984 3 Anumberul nat i i 39 39 A Gruner eta1 GEN440640 Bioinformatics Deterministic Minimum free energy Kinetic folding 5 3 folding Partition function Stochastic Simulated annealing algorithm can predict pseudo knots Effectively three types of optimization 1 Maximizes the number of base pairs 2 Minimizes the free energy or 3 Optimal given a family of related sequences One of the methodologies that is commonly used for RNA structure prediction is based on calculating free energy estimates for each possible fold then choosing the fold that yields the lowest free energy These free energy values are a combination of energy values calculated for each pair of adjacent base pairs plus loop or bulge energies The energy values are derived from melting studies of 39 39 quot 39 quot quot 39 39 39 Free energy minimization Gibbs free energy 37 C9 nearest neighbor parameters because depends only on the adjacent pair sum of each contribution Helical stacking loop initiation unpair stacking text Figure 62 has a nice description Double helix Hairpin loop Multibranch loop helical junction Internal loop Bulge loop See httpwwwbioinforpieduNzukermmaenergynodelhtmlSECTION20 for an introduction For computation of the free energy of an RNA structure using the efn server an RNA free energy web site authored by Michael Zuker Rensselaer Polytechnic Institute Copy and paste the following RNA sequence into the sequence query box httpwww bininfn rpi 39 l l quot 39 39 fuuulcgi For example Energy scoring base pairing kcalM 3 U 9 4 00 Q O Energy scoring loop penalties kcalM 11 Most prediction utilizes dynamic programming but doesn t handle pseudoknot 12 A pseudoknot con guration occurs in which segments of sequence are bonded in the same direction or have a three dimensional contact GENAAEIEAEI Emmmvmaucs 123 Lung HaH pseudoknot unknoted pseudo knal 14 Ofcourse mm is a general RNA database mwwwmabaseorg 15 Four so ware in the text Mfald Vxenmz pakage mxmcme sfatda similar result Why 15 Oligowalk can mam aa nity of bmer DNA oligo to target RNA drug or inhibition es39gn 17 How to read a dot plot pISZenergy rpo le or probability 154 III Other issu phylogm cem RNA cture constrain CanStmct aleald Pfald vs dynamic programmi 7FULDALIGN Qnalxgn memodis about73 p613 5 ummary 1 at Its Infancy ecause not many 3D 5mm avanlable cc Doudna 2000 review in the Reading material folder of our website 3 Thea 4 Predicting tmiary 5mm 61 5 R GEN440640 Bioinformatics 123 Long Hall 222006 Lecture 7 Promoter Analysis I More on HlVIMs 1 For DNAprotein we utilize a leftright architecture 2 Uses associated with HMMs Once a system can be described as a HMIVI three problems can be solved The first two are pattern recognition problems Evaluation ie Likelihood 9finding the probability of an observed sequence given a HMIVI Viterbi algorithm Decoding Posterior9finding the sequence of hidden states that most probably generated an observed sequence Forward and Backward Learning generating a HMM given a sequence of observations optimizaton BaumWelch Expectation Maximization EM or gradient descent 3 Caveat Don t deal with correlation 11 Strategy of evaluation of gene prediction methods 1 As seen in text different methods produce different results Strategy a experimental verification b consistency among different methods 3 systematic statistical comparison for different methods 2 To evaluate the accuracy of a gene prediction programcompared with the actual gene structure of the sequence There are two basic measures Sensitivity and Specificity which essentially measure prediction errors Briefly sensitivity is the proportion of real elements coding nucleotides exons or genes that have been correctly predicted while specificity is the proportion of predicted elements that are correct Using the concept of truth and false more specifically if TP are the total number of coding elemenm correctly predicted TN the number of correctly predicted noncoding elements FP the number of noncoding elements predicted coding and FN the number of coding elemenm predicted noncoding then in the gene finding literature Sensitivity is defined as SnTPTPFN and Specificityquot as SpTPTPFPishould be called precision Both sensitivity and specificity take values from O to 1 with perfect prediction when both measures are equal to 1 Neither Sn nor S alone constitutes good measures of global accuracy since one can have high sensitivity with little specificity and vice versa Note In statistics usudly specificity is defined as number of true negatives speci city number of true negatives number of false positives mid sensitivity is de ned as number of true positives Sensitivity number of me positives number of false negatives sensitivity is not the sme as the positive predictive value defined as number of true positives number of true positives number of false positives which is as much a statement about the proportion of actual positives in the population being tested as it is about the test In information retrievd positive predictive value is calledprecisian mid sensitivity is brown as recall GEN440640 Bioinformatics 123 Long Hall 4 Relationship between and sensitivity and specificity can be measure by the correlation coefficient de ned as text p 127 TPxTNHFNxFP CC JlTPFNlgtltlJNFPMTPFP1gtltTNFN CC ranges from 1 to 1 with 1 corresponding to a perfect prediction and 1 to a prediction in which each coding nucleotide is predicted as noncoding and vice versa 5 Typical studies are Burset amp Guigo 1996 and Rogic 20019 it seems that various methods improved over the years in term of their CC values 6 Shortcoming the dataset used for these studies short genomic sequences encoding a single gene simpler structure Better way annotated chromosomal sequences like human chromosome 22 7 Three results text Table 51 p128 1 ab initio is not good at complexity GENESCANE drops from 091 to 064 2 dual genome comparative approaches are better 3 automatic annotation is difficult even utilizing cDNA or RefSeq as training set 111 Promoter Analysis 1 Promoter functional regions immediately upstream or downstream of a transcriptional start site TSS that are involved in regulation of the transcription test Fig 512 p128 2 Figure 1 1 Components of transcriptional regulation Transcription factors TFs bind to specific sites transcriptionfactor mmmm binding sites TFBS that are either proximal or distal to a transcription start site Sets of TFs can operate in functional cisregulatory modules CRMs to achieve specific regulatory properties Interactions between bound TFs and cofactors stabilize the transcriptioninitiation machinery to enable gene expression 3 As the initial step of gene expression transcription 7 one of the most widely studied processes in cell and molecular biology 7 is central to regulatory mechanisms Transcription is shaped by the interactions between transcription factors TFs that bind cis regulatory elements in DNA additional cofactors and the in uence of chromatin structure FIG 1 Transacting proteins that control the rate of transcription at the level of the individual gene bind crucial cisregulatory sequences A full understanding of the interplay between transfactors and cissequences would transform biological research providing the means to interpret and model the responses of cells to diverse stimuli 4 Computational approach is critical experiments will take a long time Computational challenges 1 promoters are very diverse and even wellknown motifs are not always conserved 2 binding sites are short 525 bp9 specificity is low 3 combinatory nature temporal and spatial expression e g human genome contains about 1850 unique lt Colachvalor complex Q Transcription mmauon complex CHM Proximal TFBS GEN440640 Bioinformatics TFs each of which binds to a speci c TFBS 9 pattern permutations is immense 4 multipletomultiple relationship of binding sites and TFs9 speci city Two major tasks 1 promoter prediction genome studies 2 promoter characterization Microarrayamp tissue speci c coregulation Prediction Two main 1 l 39 pattemdriven vs sequencedriven Because most TFs bind to short 5725 bp degenerate sequence motifs that occur very frequently in the genome a position weight matrix PWM is often used to quantitatively represent the binding speci city of these factors Pattern driven I signal based Relies on the collection of experimentally annotated biding sites recognition of relatively conserved signals and conserved spacing among patterns such as the TATA box CCAAT box using weight matrix based approach TRANSFAC PROMO neural network TFBS and genetic algorithm PROMOTERZD Pattern driven II content based distinguishes promoter sequences from nonpromoter sequences based on content differences such as triplet basepair preferences around transcription start site TSS hexamer frequencies in consecutive 100bp upstream regions etc using linear discriminant function TSSG TSSW 5 or quadratic discriminant analysis CorePromoter Problem while those programs predicted about 13754 of the promoters correctly each program also predicted a number of false positive promoters New methods utilize more information Strategy text p1299 utilizing additional information 0 Regions with higher genepromoter predictions 0 Nearby annotated TSS Clusters of associated sites 0 Experimental data Sequence driven methods depend on common functionality conserved throughout evolution 9 sequence alignment as comparative genomic approach in gene prediction Characterization search for cisregulatory elements computationally by screening genomic sequences for the presence of TFBS motifs that have already been identi ed Major drawback of using PWM only a small fraction of the predicted binding sites is functionally signi cant A method for reducing the large number of false positives phylogenetic footprinting approach has been used by many researchers and will be discussed below Motivation of characterization of promoter Genes in the same cluster are assumed to be coregulated Different methods can then be used to discover regulatory elements within the genes in the same cluster Popular programs that can perform this task are Consensus MEME Gibbs Sampler ANN Spec AlignACE PROJECTION MDScan and the more recent YMF 1 such methodologies have been successfully demonstrated mainly in prokaryote and lower eukaryotic organisms 2 Limitation correlation between gene cluster and regulatory motifs is imprecise text p129 a Different regions evolve at different rate degree of conservation is different b Do not handle chromosomal translocation inversion and deletion c Do not handle high background conservation close species d Not all coregulated gene promoters share a common motif e No conservation of regulatory regions 9species uniqueness f Combinatorial nature of TFS the same motif can be found in the promoter regions of genes that are not coregulated 17 See review by Wasserman and Sandelin 2004 18 Graphic tools available p135 39m 00 0 O N W 4 U 6 GEN440640 Bioinformatics l9 Unsolved problems teXt pl32l35 Also problems for general gene predictions 0 Repetitive sequences MASKing effect 0 EST quality 39 Pseudogenes structure and evolution
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'