BIOINFORMATICS GEN 440
Popular in Course
Popular in Genetics (Graduate Group)
verified elite notetaker
This 8 page Class Notes was uploaded by Helmer Gutmann on Saturday September 26, 2015. The Class Notes belongs to GEN 440 at Clemson University taught by Chin-Fu Chen in Fall. Since its upload, it has received 35 views. For similar materials see /class/214243/gen-440-clemson-university in Genetics (Graduate Group) at Clemson University.
Reviews for BIOINFORMATICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
1192006 123 Long Hall Lecture 3 Mapping Databases ampInf0rmati0n retrieval Please check httppeopleclemsoneduNcchenGEN4407640 1 Mapping Databases From Last lecture 1 Two components of Genomic Mapping a Representational measurement dijfkrent maps b Process of determining where the biological object gene or disease locus lies in the genome9 linking a molecular signature with a biological outcome 2 Major bioinformatics challenge e icient mining and use of genomic data 3 Relationship between mapping and sequence sometimes DNA sequence tracts can be thought of ultra high resolution maps9DNA sequence can be considered as an annotation of the position Today 4 Types of sequence 1 Markerbased tags 2 Genebased tags 3 Single gene sequences 4 Prefinished draft sequences and 5 Completed continuous sequences tracts 5 Genomic Map Elements a DNA markers genomic landmark STS sequence tagged sites b Polymorphic Markers sequence variations restriction based RFLP variable number of tandem repeat units VNTRs PCR based microsatellites short tandem repeats STR SSLP simple sequence length polymorphism SSRs simple sequence repeats PCR based SNPs 7 single sequence polymorphisms occur in each 100 to 300 bases in human provbably l in 1000 base homework c DNA clones BACs PACs and YACs DNA fingerprinting restriction digestion fragment pattern are compared between clones to identify those shared subsets clones whose insert ends have been determined are referred to as sequence7 tagged clones STCs7physical mapping Note PACs are bacteriphage P1 based have a negative selection against non recombinantsamp have an IPTG inducible high copy number origin of replication for large DNA production d Genomic Annotation biological information 6 Complexity and pitfalls how to minimize errors a Using several independent maps b Using maps integrated with all available genomic information e Seetnng expennnentat entaenee 7 Numendaturexssues a Resnlubnn 1nntts fur atffenent maps 9 Types nfMaps a Cytngenette maps Glemsa bands labeld eramuacnvely Dr unneseenuy ase FIS amp ber FistH ar ay CGH enmparattve genumlc es mterph hybryduanun resuurc Resulutinn in Flunrescence in situ Hybrid39 akinquot ea Fla ucleus 2 a 93 st hr 1 H hndlzed 1 g y DNA y DNA k I quot Masses uneeneensen chmmaun enmnann nnee cbmmsnme hybwdtzmtan mgm 1 mm SHProctzdurtz C121 are spread and mvra mzd on a Slide glut m call on a Slide glut an Iymd by damgm TVton 100 In a closed contamth Chmmann btzv armch no the m glm what It I slowly Yamowdfrom the contamth m pwzpamnon I xed with ethanol Htgt moan FISHm a dtmmnz 0157700 blame pair become pass17112 m pvobtz an hybwdztzdjbllowmgrandmdpvonzduwz e bnkage GL maps stamng pntnt fur many diseasergene mapptng S n u b Genet pnneets amp baek bnne fur pny cal mappt g rely un tne nat rally neeumng recumbmatmn amp pulymu l arkers usmg genntypes nubserved tn 7 tnken use tabta al dels fur quan catmn 1na scare e mannnunn hkehhuud atstanee tn GL nnt pmpumun tn phystea1 maps e mysteal Maps 1 STS enntentnnapsze1nne ahgument a radlahun hybnd mapsmntenneatate resuluhun between 61 and pnystea1 maps 4 sequences based maps 1n Gennnne databases 11 Compamuve Maps symeny and predmuon from one memes to II ENTREZ 1 PubMed amp MEDLINE MEDLINE English publications biomed journals PubMed has gt one million more entries and extended general science and chemistry journals nonEnglish 2 The idea of crossreferencing Entree The Life Sciences SearchrErnging 39 Genome 39 SEC 3 39 n 3 PopSel ON Dwain 16 Skuclure A K O O 0 ioou s wo uoo s lUODUU s 1000000 5 luuououo s Figure 1 ENTREZ integrated information retrieval system Each sphere represents one of the elements that can be accessed through Entrez and the lines represents each component databases connects to the others The original version of Entrez had just 3 nodes nucleotides proteins and PubMed abstracts Entrez has now grown to nearly 20 0 es 3 LocusLink is superceded by Entrez Gene Your browser should automatically link to the Entrez Gene home page in five seconds 4 Medical Databases usually nonsequencebased information a Example OMIM Online Mendelian Inheritance in Man b What is the difference between an Mendelian trait or disease vs non Mendelian complex trait Complex diseases are any genetic diseases which do not obey the singlegene dominant or singlegene recessive Mendelian law The term complex traits is also used for phenotypes that may not be considered as diseases d Complex diseases are nonmendelian they show familial aggregation but no clear segregation Segregation is the principal difference between singlegene disorders and complex diseases although the genes of complex diseases segregate their phenotypes do not 0 5 OMIM a Electronic catalog of human genes and genetic disorders b Founded by Victor McKusick housed at NCBI c Concise textual information from the published literature in human genetics diseases mans mine nut Lecture 2 DNA Databases amp Mapping Databases 1 DNA Nucleutide Databases 1 Data ow ofthree maerNA databases sunmlsslnns UDdalns EMBL Figure 1 Data ow for new submission and updates between the three databases 2 Lrnportanee of aeeuraey and ease ofuse for nueleoade sequenee databases a Sequenee comparison more useful to translate DNA into eodrng gt protein at b Avoiding error propagaaon e Faerlrtaang inform atron retrieval Nueleotrde Sequenee atflles a Most comm on formatr ateflle b Sequ nee reeordrepresented as a stnng ofnucleotldes wrtln tags andldentl ers e FATSA forrnat gt denotes tlne beginning of a new seq records definition line defllne39 and an identifier accession ID d Upper or lower ease letters for DNA seq usually so elnaraeter per line Courier fantxsthe best e Similarly a protein seq ean use FATSA forrnat 4 Drsseeaon ofnucleonde seq atflle a Headers database speci c rst itemrDDBJGenBank LOCUS EMBL ID has to be unique wrtlnrn tlne database seconds lengtln of seq thirds molecule b C 3 1 P Q J type biological nature of the molecule fourth division code INV historical datelast date when the record was last made public Organismal division httpwwwncbinlmnihgovHTGStablelhtml BCT bacterial sequences 0 FUN 7 fungal HUM Human INV invertebrate sequences 0 MAM other mammalian sequences 0 ORG Organelle sequences 0 PHG bacteriophage sequences 0 PLN plant fungal and algal sequences 0 PRI primate sequences 0 RNA Structural RNA sequences 0 ROD rodent sequences 0 SYN synthetic sequences 0 UNA unannotated sequences 0 VRL viral sequences 39 VRT other vertebrate sequences Functional division 0 CON 7 Constructed 9or Contigged records of chromosomes genomes and other long DNA sequences 0 EST EST sequences expressed sequence tags 0 GSS GSS sequences genome survey sequences 0 HTC un nished highthroughput cDNA sequencing HTG HTGS sequences high throughput genomic sequences 0 PAT patent sequences 0 STS STS sequences sequence tagged sites 0 WGS 7 Whole Genome Shotgun Sequence EST 7 expressed sequence tag 39 Partial DNA sequence singlepass ofa cDNA clone 39 Largest and fastest growing division of GenBank 39 Derived from some speci c RNA source 39 Source eld can be searched Second part of header de nition lines DE in EMBL summary of biological content Accession number cited in publication two formats 15 and 26 one upper case letter followed by ve digits more than two accession numbers first one is the primary one version U5446919 ACCESSION VERSION accession unchanged but version incremented each and every time the sequence changes Source amp organism OSamp OC in EMBL Feature tables tabled direct representation of biological information feature keys location and additional quali ers source feature is the only feature that must be present in all DDBJEMBLGenBank entries CDS coding sequence instruction on how to join two sequences together or how to make an amino acid sequence from the indicated coordinates and inferred genetic code 5 Third Party Annotation TPA a Primary database entries are owned by the original submitter and the coauthors of the submission publications only owners have the privileges to update the data content b TPA reannotations of existing entries combinations of novel sequence and existing primary entries and annotation of trace archive and whole genome shotgun data 6 RefSeq a Many sequences are represented more than once redundancy b Curated secondary database for genomic DNA transcripts and proteins for O Fquot quot1 g selected organisms reviewed by NCBI staff Provide one and only one reference sequence for each DNA RNA and protein nonredundant RefSeq nomenclature 26 format N experimentally determined NC7 complete genome 39 NG7 incomplete genomic 39 NM7 mRNA 39 NR7 noncoding transcripts 39 NP7 proteins 39 NT7 intermediate genomic contigs X computational prediction model transcripts and proteins generated through genome annotation 39 XM7 Model mRNA 39 XR7 Model RNA 39 XP7 Model protein Assembled Genomic Regions contigs Chromosome records 7 EMBL Genome reviews a b C Curated secondary database that representing complete genome sequence in DDBJEMBLGenBank Standardized annotations Synchronized with UniProt evidence tagged 8 Protein Sequence databases a b C d D Information Space MS technology Assays for proteinprotein interaction Derived form the translation of DNA nucleotide sequence databases Universal vs specialized protein databases computational vs curated enhanced GenPept basic NCBI multiple uncurated records RefSeg curated but limited staff UniProt a combination of SwissProt TrEMBL and PIRPSD 39 UniProt Knowledgebase UniProt central access point for extensive curated protein information including function classification and crossreference 39 UniProt Nonredundant Reference UniRef set of databases that combine closely related sequences into a single record to speed searches 39 UniProt Archive UniParc Comprehensive repository re ecting the history of all protein sequences No annotation used internally g Other protein databases II Mapping Databases 1 Two components of Genomic Mapping a Representational measurement different maps b Process of determining where the biological object gene or disease locus lies in the genome linking a molecular signature with a biological outcome 2 Major bioinformatics challenge efficient mining and use of genomic data 3 Relationship between mapping and sequence sometimes DNA sequence tracts can be thought of ultra highresolution maps DNA sequence can be considered as an annotation of the position
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'