MMG 301 LECTURES 1-10 STUDY GUIDE
MMG 301 LECTURES 1-10 STUDY GUIDE MMG 301
Popular in Introductory Microbiology
Popular in Department
One Day of Notes
verified elite notetaker
verified elite notetaker
Test Prep (MCAT, SAT...)
verified elite notetaker
verified elite notetaker
verified elite notetaker
One Day of Notes
verified elite notetaker
This 32 page Study Guide was uploaded by Sydney on Sunday September 18, 2016. The Study Guide belongs to MMG 301 at Michigan State University taught by s. mulrooney in Fall 2015. Since its upload, it has received 3 views.
Reviews for MMG 301 LECTURES 1-10 STUDY GUIDE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/18/16
Lecture 1: Intro to Genomics Central Dogma: The goal is to characterize the relationship between genes and Phenotypes. The classical genetics approach can only deal with a few genes at a time. o DNA is transcribed and translated into many copies, and gives many variabilities. Trying to map static dna into dynamics of life is hard Protein coding genes only account for a small fraction of genome. (2-3% sequence.) Protein coding genes are transcribed into mRNA, after with RIBOSOMES translate the mRNA into polypeptide chains. Each nucleotide triplet codes for an AA o What is Genomics? “For the newly developing discipline of mapping/sequencing (including analysis of info,), we have adopted the term GENOMICS” o “Born from the marriage of molecular and cell biology with classical genetics and is fostered by computer science. o “Genome is an irregular hybrid of gene and chromosome… first used in 1920 by Winkler” Molecular Biology: “For th LectURE 02: The Human Genome Why sequence Full genomes? Morphological changes come from changes in the dna/Vertical mutations!! Mutations don’t always give morphological changes, and many mechanisms to analyze them. Convergent evolution (same selective pressure, evolve same morphology) “The wealth of genomic dna has allowed for the discovery of new molecular markers for phylogenetic reconstruction…” “The dramatic increase in data set seizes has led, in many cases, to increased confidence in the inference of evolutionary relationships. Before we have morphological info it was bad, but now with genetic info we can do so much more. The mutations can be selected for if they provide an advantage, and you can get evolution. (they occur at a random rate, so looking at the rate can tell you how much time it’s been since the divergence between the species.) HOWEVER MUTATION RATE IS NOT CONSTANT because of of environmental factors ect. Evolution of the genome is not just a set of mutations, sometimes you can get foreign dna, drastic rearmaments, ect. The more data, the better you can resolve fine evidence in genome history, with many variations of the same species, and get an idea of source of evidence like recombination ect. o Using this infor you can reconstruct the PHYOLGENY OF LIFE. CARL WOESE’S DISCOVERY OF ARCHAE: Evolutionary info can be found in dna sequences. He would sequence just a small piece, and found that the gene coding for 16sRNA (transcribed and not translated) and particular to ribozymatic facticity ribosomeranslates the rna into protein). o You can understand the phylogeny by just looking at the mutations (rate of mutation is very slow so can go back in time very far. o Purify and digest with rnase-T (cuts specific sites) so that every organisms has a different barcode and runs in 2d electrophoresis to differentiate the species by charge/size. GENOMICS AND THE TREE OF LIFE: “Rare features in the genomes contents, such as sequence rearrangements or integrations of mobile genetic elements, offer some powerful alternative markers for addressing such challenging phylogenetic riddles. The extensive occurrence of lateral gene transfer in proks has raised concerns as to the validity of any gene-based phylogenies. ..Gene history in genomes that have undergone lateral gene transfer more likely resemble evolutionary networks and not trees” …With about 2 million known species of organisms and another 10,000 being discovered each each year… less than 1% of known species have ben part of any sort of phylogenetic analysis. “ THE GENETIC BASIS OF HYBRID VIGOUR (Heterosis): The new large-scale sequencing and phenotyping experiment of hybrid rice varieties leads to associations with genetic determinants whose mode of action was revealed. “The vast majority of genes that they identified with herterosis exhibit incomplete dominance” “Huang and colleagues raise the possibility of accumulating superior alleles to eliminate the need for hybrids (…) The authors note that epistatic effects could complicate such an effort. Perhaps the best combining ability of superior hybrid performance involved the concerted effects of genes across the genome. Time for One-Person Trials )Precision medicine requires a different type of clinical trial that focuses on induvial, not average, responses to therapy. Every day, millions of people are taking medications that will not help them. “Researchers need to probe the myriad of factors (Genetic and environmental among others), (that shape a person’s response to a particular treatment” “Erbitux (cetuximab) improves the survival of people with colorectal cancer whose tumor cells carry a mutated EGFR gene but not a mutated KRAS gene. “ “By looking for commonalities across multiple N-of-1 studies… researchers should be able to draw inferences about the effectiveness of an intervention in certain subsets of the population…. Variety of Life (an effort to sequence thousand’s of people’s genomes reaches the end of the beginning). the completion of the 1000 Genomes Project, the largest work yet to sequence the genetic information of hundreds of individuals in an attempt to tune into Mother Nature’s hum of human variation.” T racking the relationships between genetic variation and human disease help develop effective treatments. To exploit the gathered information, more projects need to link and cross reference it to clinical information and well characterized phenotype data sets. Full genome Sequencing Paced the way from Spores to a suspect. The key to understanding the investigation is that anthrax used in the attacks didn’t have a single, uniform genetic makeup, “Standard sequencing-which would require the DNA from thousands of spores- would have resulted in a “consensus sequence: for the spores, in which such rare mutations were simply drowned out” They then searched for colonies that looked different from the majority; .. they set out to find the mutations that made these colonies different. They included single-nucleotide polymorphism, a change of a single base pair, and tandem repeats (…) Lecture 05: History: Discussion about the possibility of sequencing the human genome began in the mid 1980’s-controversial objections* were… o 1) Big biology was bad biology o 2) Why sequence the junk (stuff between genes) o 3) Impossible to do this (however automation was changing this) Work on the first microbial genomes was considered to be a test of feasibility for higher eukaryotic genomes. o Escherichia coli K12 (bacterium), by hierarchical sequencing (started ~ 1992, finished 1997) o Haemophilus influenzae (bacterium) by whole genome shotgun seq. 1995 o Saccharomyces cerevisiae (baker’s yeast - eukaryote) 1996 o Drosophila melanogaster (fruit fly) genome sequenced in 2000 o Human work begun about the above time period by – International Human Genome Sequencing Consortium – Celera (private corporation) • Public (hierarchical) and private (WGS) human genome sequencing competition was bitter. • Two draft sequences were simultaneously published 2001 (huge issues of Science and Nature!). • New versions (builds) are constantly being released to clean up gaps, errors, etc. – genome browsers show different “builds” that have different errors and place genes at different base positions! But the human genome seems to have stabilized starting with build 36. • Debate continues over approaches, but most sequencing is now shotgun for higher eukaryotes as well as unicellular organisms, but with some hierarchical sequencing, particularly for genomes with large number of transposon sequences (e.g., corn, wheat). • Only three major public sequencing centers survive in the U.S. – Broad Institute (M.I.T and Harvard), Baylor University, and Washington University in St. Louis – (there are also a few large commercial entities that exist). Hierarchical Sequencing: Clone large pieces of DNAs (BACS, ect. Prepare restriction enzyme maps- computer programs match overlapping clones. Produce minimal tiling path. Shotgun sequence and assemble individual BACs. o —Bacterial (and yeast) genetics have been critical to the advancement of the genomics of eukaryotic, as well as prokaryotic, genomes. In particular, large genomes are often cloned as fragments (libraries) in E. coli plasmids. o —One of the things we want in big genomes is to handle them in reasonably large pieces so we can look at whole gene on single plasmid but it isnt easy. These large fragments are often cloned (recombinant DNA) in ecoli plasmid vectors or yeast. BACs arent really chromosome, just a large plasmid, can carry up to 250kb of any DNA you can input, it has high stability (when you get dna back out of cell, it hasnt changed/rearranged). —BAC “contig” physical maps are generated from high throughput analysis of BACs. A CONTIG is a contiguous segment of the parent genome which is completely covered by a collection of recombinant DNA clones or DNA sequence reads. Original Method Required Genomic DNA Cloning: • Purified genomic DNA is cut into many pieces with a restriction enzyme • Each piece is ligated into a vector - a piece of DNA (usually modified from a plasmid or virus found in nature) that has (1) a replication origin, (2) a cloning site, and (3) a selectable marker such as antibiotic resistance • The ligated DNA is transferred into individual bacteria • The bacteria are grown as colonies on an agar plate • DNA from each colony (clone) can then be purified for sequence analysis 2nd type of cloning: cDNA Clones • Individual cDNA clones selected from libraries and sequenced to determine open reading frames (computer programs, such as ORF Finder, can help to identify these). • Eukaryotic genomic clones are often not useful for finding ORFs by programs such as ORF Finder because of the presence of introns. • Random single read sequencing of a cDNA clone produces an expressed sequence tag (EST). • The sequencing of large numbers of ESTs can be used to discover genes that are expressed in a given tissue. • Open reading frames or raw EST sequences are then mapped onto complete, or nearly complete, eukaryotic genomes by the UCSC Genome Browser BLAT tool (which is designed to “skip over” introns by using consecutive sequences in a cDNA using a local alignment strategy; alternatively, you could use Spidey or Splign from NCBI, which do similar things) to determine exon/intron structures for genes. cDNA clones do not clontain introns or trx regulatory elements. Minimal Tiling Path- Hierarchical Sequencing Method: Once the arrangement of the BAC inserts has been determined, each one is individually sequenced (usally shotgun!). Blue BACs are ones that would not be used because there is less unique seqnece than the purple clone. Hierarchical vs. WGS Assembly Main issue is handling repetitive regions (hierarchical is better) vs. speed of sequencing (WGS is better) Both methods handle relatively small repeats (eg SINEs and LINEs) well, (except current 454 and other next-gen sequencing if only single reads are used – reads are only up to a few hundred bp, whereas with Sanger sequencing reads are typically 700 bp in length). The paired end strategy (300-500 bp) is of some help for these short read technologies (more on this later). However, some plant genomes have so many simple repeats (the corn genome is 85% repeats) that any assembly is extremely difficult by the whole genome shotgun method. Hierarchical does much better (but is far from perfect) on determining the correct assembly of regions of the genome that are recently duplicated (these are called segmental duplications). However, hierarchical is much more labor intensive and, therefore, more expensive! Approximately 5% of the human genome is segmentally duplicated – WGS does very poorly in assembly. Annotation: Contents of the Genome Two million recognizable repeats (perhaps 50% of genome) About 21, 000 protein encoding genes. Half of which had never been seen before the genome sequence was complotted and have no recognizable sequence identity with known genes. Evidence for genes not previously known to exist. o 1. Expressed Sequence tags (ESTs) (or alternatively, completely sequenced cDNAs) identified in cDNA libraries. Expression databases record measurements of mRNA levels, usually via ESTs (expressed sequence tags: short terminal sequences of cDNA synthesized from mRNA) describing patterns of gene transcription. Proteomics databases record measurements on pro- teins, describing patterns of gene translation. Comparisons of expression patterns give clues to (1) the function and mechanism of action of gene products; (2) how organisms coordinate their control over metabolic processes in different conditions – for instance, yeast under aerobic or anaerobic condi- tions; (3) the variations in mobilization of genes in different tissues, or at different stages of the cell cycle, or of the development of an organism; (4) mech- anisms of antibiotic resistance in bacteria and con- sequent suggestion of targets for drug development; (5) the response to challenge by a parasite; (6) the response to medications of different t o 2. Computer predicted genes based upon ORFs bordered by expected consensus splice sequences o 3. Convervation of long open reading frames across species. DNA Sequencing Quality Scores: • Based upon probability that a base is called incorrectly • The reported statistic is 10 x (-log(P)) – for example, a quality score of 20 => error rate 10 -2 , or 1 miscalled base per 100 bases sequenced. • Usually, a sequence is considered to be very good if the score is 40 or better (i.e., an error rate of only 1 in 10,000 bases called). • Only some genomes (e.g., dog and chicken) have associated quality scores that are available in the genome browsers. Ironically, the human genome sequence does not have quality scores available because, being one of the first genomes sequenced, an easy way to implement quality score reporting had not yet been established. Labs were required to provide “certificates” that their sequences were very high quality before they were accepted for publication. o Note: for some sequencing projects, raw unassembled traces can be found in the NCBI Trace Archive Database (however, this service might be discontinued at some point due to budget constraints) • You can see actual traces • Quality scores are provided • Hyperlinks between mate pairs are provided • You can find homologous regions using BLAST or cross- species MegaBLAST The Human Genome Sequence is Now of Very High Quality • Only a handful of gaps remain in the reference genome (ignoring highly repetitive regions such as those near centromeres) for human in the unique portions of the genome, and overall Phred quality scores are reported to be >40 by the sequencing centers. • For other organisms (e.g., dog and chicken) the assemblies have thousands of gaps, placement errors and collapsed segmental duplications. Unassembled pieces that the assembly programs could not place in the genome are dumped into a separate file location (called chromosome “Un”, for unassembled in the UCSC Genome browser). • Nevertheless, the non-human sequences are still very useful, but researchers must use them with some caution and not assume correct assembly Contigs and Supercontigs (more commonly known as scaffolds) • Mate pairs: (if the DNA between them cross a gap, the orientation and position of two contigs relative to one another is thus known, thus forming a supercontig) • Reasons for Gaps in Genomic Sequences 1. May be due to lack of depth of coverage – coverage follows a statistical distribution called the Poisson distribution (a particular skew of the normal distribution) – o For Sanger sequences, a rule of thumb is that 5X (5 fold) coverage results in “half” of the genome being assembled into large contigs, 8X coverage means nearly all of the genome can be assembled into large contigs (“full” coverage; 95% of the genome) (the numbers for this rule applies to Sanger sequencing) o b) Shorter reads from next-gen sequencing require higher coverage). With next- gen sequencing (covered in Dr. Dufour’s portion of the course), ≥30X coverage is often considered full coverage because of the read lengths (or “reads”) are relatively short. (Note: If a genome is 3 billion bp, sequencing to a depth of 1X of means you have sequenced 3 billion bp, 5X is 5 x 3 billion = 15 billion bp, et 2. Gaps may be due to unclonable regions (E. coli cannot replicate some DNA sequences). 3. DNA may be difficult to sequence (this is particularly true of regions that have very high G + C percentage) 4. Gaps may be due to a region that cannot be assembled (e.g., repetitive region). o Mate pairs allow one to skip over gaps and still determine the relative placement and orientation of contigs (which are continuous sequences). Scaffolds (supercontigs) usually have gaps. o Transposon Repetitive Elements Are abundant in the Human Genome Note the many transposons (particularly those called SINEs and LINEs) that are present in just this one region of the human genome. Details on transposition will be covered in a later lectur Formation of an Interchromosomal Segmental Duplication The two copies of the segmental duplication can have >99% sequence identity, which can cause genome sequence assembly errors, particularly with whole genome shotgun sequencing. Collapsed Repeat Note: This is primarily a problem in whole genome shotgun sequencing, but it can also be a minor problem in hierarchical sequencing (just less common) Tandem Duplications are sometimes Collapsed in Assemblies: o “The dog APOBEC gene region is collapsed, – reduced interval between bordering genes and low quality scores are hints.” o “The human genome was determined by hierarchical sequencing (the repeats are not collapsed), but the dog by whole genome shotgun sequencing. In this figure, Low quality scores and reduced genome size indicate the location of a collapsed repeat.” Comparison of Prokaryote and Eukaroyte Genome Assembly: • The genome assembly errors caused by repetitive elements are much less severe in prokaryotes because there are very few repetitive sequences in those genomes. There are generally at least a few repetitive elements in most prokaryote genomes, and so there can still be some assembly errors due to repeats. • More information will be presented later in the course on methods to reduce these and other types of errors in genome sequences. “Resequencing” The Human Genome Thousands of human genomes have been re-sequenced using high throughput next generation sequencing methods. Many more will follow. • Re-sequencing does not require new (“de novo”) assembly of the genomes, but rather sequences are compared to a reference genome and the relative order of sequenced fragments is assumed to be identical. o Advantage: Re-sequencing is an easier and less expensive task than determining the de novo genome sequence for a given eukaryotic species. o Disadvantage: Sequences for repetitive DNA (e.g., SINEs and LINEs) are left out of re-sequenced genomes because it is difficult or impossible to map individual repeats to the correct location using short read sequences. • The goal is to identify DNA variations that may give insights into variable phenotypes (e.g, genetic predisposition to diseases). o Note that re-sequencing a reference rates improving genome, as is rates often done with newer technologies, is technically likely high availability o much easier than de novo sequencing of a genome. Re-sequencing allows one to discard non- aligning reads (likely errors) and allows for the placement of small sequence contigs using the reference genome alignment. Repetitive sequences (non-unique reads) are often ignored during re- sequencing but are important in sequencing a reference genome. Longer read technologies should improve handling of repeats, but perhaps at the cost of higher error rates. Lecture 06: DNA Sequencing Technologies (the coding regions of genes are open reading frames) . - Describe two ways to sequence a genome. • shotgun sequencing - DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence “Next Generation sequencing” • First generation: • Requires cloning and PCR amplification (can introduce biases) • Chain termination / capillary electrophoresis (96 reactions per run) • Second generation: • No cloning – DNA libraries are physically separated on solid surface • Cell free amplification allows sequencing of toxic sequences • No electrophoresis – DNA is sequenced as it is synthesized • High resolution imaging of millions of parallel sequencing reactions • Shorter sequence reads and higher raw read error rates • Much higher throughput • Third generation: • Single molecule sequencing • • No DNA synthesis -This is called the shotgun method. It does not require any prior knowledge of the genome and so can be carried out in the absence of a genetic or physical map. Emulsion PCR 1. The DNA is fragmented and adaptors are ligated to the ends to generate the library. 2. DNA fragments are added to an oil mixture containing millions of beads. 3. DNA is amplified using emulsion PCR resulting in many copies of the fragment. 4. Beads are deposited in wells for sequencing. Pyrosequencing • Pyrophosphoric acid is released as a substrate for the generation of ATP by sulfurylase. • Luciferase then uses ATP and luciferin to produce light. • Most of errors result from large homopolymers (seven or more) Solid phase amplification • Produces 100-200 million spatially separated clusters, providing free ends to which a universal sequencing primer can be hybridized to initiate the sequencing reaction Sequence by Synthesis: • Add fluorescent nucleotides • Wash • Scan and detect specific fluorescence • Remove 3’ –blocking group (Reversible termination) and fluorescent label. Wash and Repeat o Synthesis dephasing causes errors as the reads get longer The DNA polymerase incorporates just one fluorescently modified nucleotide 3’ blocked reversible terminators • The DNA polymerase incorporates just one fluorescently modified nucleotide • Unincorporated nucleotides are washed away and a four-color imaging is acquired • A cleavage step (a reducing agent) removes the terminating group restoring the 3’-OH group and removes the fluorescent dye Dual-indexed paired-end sequencing • Paired reads are useful for the analysis because they provided very accurate read alignment and thus improve coverage and sequence accuracy. Sequencing by ligation • An oligonucleotide sequence in which two interrogation bases are associated with a particular dye (e.g. AA, CC, GG, TT are encoded with a blue dye) o SBL is another cyclic method that differs from CRT in its use of DNA ligase 35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adja- cent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence • There are 16 possible combinations, each dye is associated with 4 • 1,2-probes indicates that the first and second nucleotides are the interrogation bases. • The remaining bases consist of either degenerate or universal bases • The linkage between the fifth and six nucleotides of the probe sequence can be cleaved with silver ions o Upon annealing of a universal primer, a library of 1,2- probes is added. o Four-color imaging o The ligated 1,2-probes are chemically cleaved with silver ions to generate a 5’-PO 4 group o The cycle is repeated 9 times The extended primer is stripped and four more ligation rounds are preformed. Color space decoding • 16 possible base combinations are represented by only 4 colors. o With two-base-encoded probes, the fluorescent signal or colour obtained during imaging is associated with four dinucleotide sequences having a 5′- and 3′- base. Colour space is the sequence of overlapping dinucleotides that codes four simultaneous nucleotide sequences. Alignment with a reference genome is the most accurate method for translating colour space into a single nucleotide sequence. o • The color sequence does not identify specific nucleotides. • All possible sequence combinations need to be decoded. o Sequence by ligation has one of the lowest error-rates (~0.01) due to 2- base encoding. It is however still limited by short read lengths (~35 nt) Single molecule real-time DNA sequencing Circular consensus sequencing o DNA template is circularized by the use of “bell” shaped adapters. o As long as the polymerase is stable this allows for continuous sequencing of both strands. o Advantages: No amplification required. Extremely long read lengths (~2.5 kb). o Disadvantages: High error rates. o Summary of Sequencing Platforms Nanopore Sequencing Nanopore Sequencing o Another method coming along now... Have membrane of some sort with pore to run DNA through. The size of the base as it goes through pore will change electric potential across memebrane. DNA can be fed in linear fashion with DNA polymerase or withdraw it. Sometimes proteins are used for pores, sometimes are physical grafts. o Can get lots of sequence, very long o error rate uncertain. o Has potential to be cheap Lecture 07: Genome Assembly: -ASSEMBLY VS. MAPPING!! For assembly, no reference needed (de novo genomes, transcriptomes) -but.. scales poorly, need big computer -repeats get in the way and need higher genome. -SHOTGUN SEQUENCING and assembly randomly fragment and sequence from dna. then resample computationally. -Assembly and mapping (and batons of thereof) are the two basic approaches to deal with next-gen sequencing data. A. Mapping: goal is to assignn each read to location(s) within genome.assign all reads to locations within reference. (required for resequencing, ChIP-seq and mRNAseq) B. you want GLOBAL not local allighnemt, (do not want matches within read like BLAST would produce) -2 BASIC ASSEMBLY APPROACHES: 1) overlap/layout/consensus -used for long reads, esp. sanger based assemblies 2) De Bruin k-mer graph (used decayse memory efficient) 4 main challenges of DE NOVE SEQUENCING 1) REPEATS (overlaps don’t place sequences unqiurles when there are repeats present) 2)Low coverage )#of reads x read length /genome size Errors these introduce breaks in the construction of contains. —Variation in coverage- transcroptomes and metagenomes, as well as amplified genomic. -this challenge assembler to disguise between erroneous connections (eg repeats) and real connections. DE-NOVO GENOME ASSEMBLY: determination of a full-genome sequence without using a known reference sequence from an individual of the species to avoid the assembly step. GENOME ASSEMBLY WITH LONG READS: ““A straightforward Edges are method pairwise for alignments assembling reads into longer contiguous Edges are k-mers sequences [...] uses a graph in which each read is represented by a node and overlap between reads is represented by an arrow (called a ‘directed edge’) d c ATG ATG joining two reads.” “[...] a ‘Hamiltonian cycle’ in our graph [...] is a path that travels to everynode exactly once and ends at the starting node, meaning that each read will 3 GTG CA be included once in the assembly.” o mapping long reads different from mapping short reads because —1) volume of data is traditional much less (1 million 454 reads vs. 200 m Illumina) —2) long reads more likely to have INSERTIONS OR DELETIONS in them o BLAST is NOT THE RIGHT TOOL for long reads because require a query sequences counting the same 11 mer as a database sequence before it attempts further alignment. o Relies on DEEP COVERAGE, read trimming and paired ends. (is problematic) -overlap/layout//consensus 1. calculation all overlaps 2. cluster based on overlap 3. do multiple sequence alignment GENOME ASSEMBLY WITH SHORT READS: SCALABALE ASSEMBLY WITH DE BRUIJN GRAPHS “Instead of assigning each k-mer contained in some read to a node, we will now assign each such k-mer to an edge.” “First, form a node for every distinct prefix or suffix of a k-mer, meaning that a given sequence of length k–1 can appear only once as a node of the graph. Then, connect node x to node y with a directed edge if some k-mer has prefix x and suffix y, and label the edge with this k-mer.” “[...] finding an Eulerian cycle that visits all edges of a graph exactly once is much easier.” “We took for granted that we can generate all k- mers present in the genome, that all k-mers are error free, that each k-mer appears at most once in the genome and that the genome consists of a single circular chromosome.” o De Brjin k-mer graphs are used because more memory efficient o Shorter reads, need to find a different approach. to change the structure of the graph and make a de- Bruiji graph where you assign each Kmer to an edge . You can now construct a graph based on those overlaps , and those reads are edges. o Now a way to transform your reads into a graph structure and then use algorithms to solve graph structure for r sequence. you go through every edge at least once (Another problem). o Eulerian cycle does this, to try and reconstruct the sequence. we you sequence your genome, every one is represented by a read. HOWEVER SOME ARE MISSING!! o -EUK. genome, very late genome with many repeats. o De Bruijn Graph - denotes a sequence read of a fragment of fixed length; the source node of this edge is a prefix string of the read omitting the last nucleotide, the destination node of this edge is suffix string of the same read by omitting the first nucleotide. One panel is a pool of representative short fragments. another panel denotes unique sequence prefix or suffix segments of a certain length found in the original, longer segments. this allows the problem to become a fragment assembly problem o Effect of Read Length on the E.coli K12 de Brujin Graph. E. coli K12 has a small circular genome of 4.6 Mb with low complexity. Assuming an error-free sequencing method, reads of 1,000 bases do not guarantee a unique Eulerian cycle to assemble the full genome. The task is even more challenging o -not many sequence repeats. if you have reads with 1000 bases (Sanger). If you take reads that are 1000 bases long, o instead of putting you sequence as a node you put as an edge. you create nodes base on this edge as 2 nodes that overlap as 2-1. Dealing with short reads, preliminary method is the de-brun method. Pair end sequences with long inserts de-novo means that you take genome that has never been sequenced before, you just sequence everything and try to construct it back from the sequences. -did an analysis of the human genome with -losing much of the complexity because anything with a direct repeat will be collapsed and not represented in genome. computer also rids the duplicates. on o f you were to do it now a days, you could compare it back to the reference genome. LIMITATIONS OF NEXT-GENERATION GENOME ASSEMBLY: One of the most commonly used technologies routinely produces read lengths of 75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp. …The predominate assembly methods for short reds are based on de Bruijin graph and Eulerian path approaches, which have difficulty in assembling complex regions of the genome. ..paired-end libraries with long inserts help ameliorate this biases. de novo sequence assemblies [...] require particular scrutiny and additional validation b/c of their tendency to enrich for contamination artifacts. “Any WGS-based de novo sequence assembly algorithm will collapse identical repeats, resulting in reduced or lost genomic complexity we conclude that 99.4% of true pairwise segmental duplications were absent. At the gene level, only 56% of the genes has sufienent representation in the assembly. o basically de-novo assembly with de-brun is missing a lot of the dn!!! o Why does the shotgun sequencing method require that the number of nucleotides sequenced is several times larger than the size of the genome??? answer: 4.7. The redundancy is required as the clones for the sequencing project are randomly generated and sequenced; thus, to ensure complete coverage of the genome it is necessary to sequence a large number of nucleotides. o Shotgun sequencing of complex eukaryotic genomes can result in seg- ments of DNA, possibly including genes or parts of genes, being omitted from the draft sequence. There is also a greater chance that sequence errors will not be recognized. Repetitive DNA in the Human Genome o for every chromosome, at least 50% is repeats thus lots of redundancy. (mostly lines) Assembly error caused by Repeats o The difficulties in applying the shotgun method to a large molecule that has a significant repetitive DNA content means that this approach cannot be used on its own to sequence a eukaryotic genome. Instead, a genome map must first be generated. A genome map provides a guide for the sequencing experiments by showing the positions of genes and other distinctive features. Once a genome map is available, the sequencing phase of the project can proceed in either of two ways (Figure 3.3): Paired-ends and mate-pairs (Resolving graph complexity example) o Paired-end and mate-pair sequences provide a physical map of the sequence that is very valuable during the assembly process. Ekblom and Wolf o -mate threading resolves “frayed rope” patterns caused by repeats, and separating paths based on mate pair reads. o —Rejecting inconsistent path based on mate pair reads and insert size. Genetic Mapping Before Sequencing: • Mendel’s law of independent assortment: alleles for separate traits are passed to offspring independently of each others (because of the independent assortment of chromosome during meosis). • Genetic linkage: some traits are inherited together (not independently) indicating a physical connection (two alleles on the same chromosome). • Cross over: linked traits are separated with variable frequencies indicating a genetic distance between alleles (caused by chromosome recombination). o genes as markers. To be useful in genetic analysis, a gene must exist in at least two forms, or alleles, each spec- ifying a different phenotype, an example being tall or short stems in the pea plants originally studied by Gregor Mendel. To begin with, the only genes that could be studied were those specifying phenotypes that were distinguishable by visual examination. o Goal of mapping is to assign each individual read to locations within genomes, so you map each read separately. o volume of data, error in reads, quality scored, repeat elements, multi copy sequence, SNP/SNVS, INDELS, AND TRANCRIPTOME SPLICING make mapping challenging. PHYSICAL MAPPING- RESTIRCTION MAPPING: Restriction mappiNG: • Digest DNA with restriction enzymes. • Size fragments using gel electrophoresis. • Use different combinations of restriction enzymes to work put the relative positions. o These two limitations of genetic mapping mean that for most eukaryotes a genetic map must be checked and supplemented by alternative mapping procedures before large-scale DNA sequencing begins. A plethora of physical mapping techniques has been developed to address this problem, the most important techniques being: o Restriction mapping, which locates the relative positions on a DNA mol- ecule of the recognition sequences for restriction endonucleases. Fluorescent in situ hybridization (FISH), in which marker locations are mapped by hybridizing a probe containing the marker to intact chromoSomes. o Restriction mapping located relative positions on a dan molecule of the recognition sequences for restriction endonuclease. (Sequence tagged site (STS) mapping, in which the positions of short sequences are mapped by examining collections of genomic DNA frag- ments by PCR and/or hybridization analysis.) OPITCAL RESTRICTION MAPPING: FLOURESCENT in sitU hybridization: o 3.9. FISH uses a fluorescently labeled DNA fragment as a probe to bind to an intact chromosome. The binding position can be determined and this infor- mation used to create a physical map of the chromosome. Sequence Tagged Site (STS) Mapping: o At present the most powerful physical mapping technique, and the one that has been responsible for generation of the most detailed maps of large genomes, is STS mapping. A sequence tagged site, or STS, is simply a short DNA sequence, generally between 100 bp and 500 bp in length, that is easily recognizable and occurs only once in the chromosome or genome being studied. To map a set of STSs, a collection of overlapping DNA fragments from a single chromosome or from the entire genome is needed. o STS is a short DNA sequence between 100 and 500 bp in length that occurs only once in the chromosome or genome being studied o The data from which the map will be derived are obtained by determining which fragments contain which STSs. How has PCR made the analysis of RFLPs much faster and easier? What was required to map RFLPs prior to the utilization of PCR? Answer: short generation time, large number of offspring, easily scored phenotypes, and such like. It is instructive to consider to what extent genomics has added new criteria to this list: is a complete genome sequence a useful feature of an organism to be used in studies of heredity? 3.3. How has PCR made the analysis of RFLPs much faster and easier? What was required to map RFLPs prior to the utilization of PCR? Answer: short generation time, large number of offspring, easily scored phenotypes, and such like. It is instructive to consider to what extent genomics has added new criteria to this list: is a complete genome sequence a useful feature of an organism to be used in studies of heredity? Lecture 08: Functional Genomics/Protein Coding Structure/Transcriptome/Proteome Functional Genomics: o Functional genomics integrates information from various molecular methodologies to gain an understanding of how DNA sequence is translated into complex information in a cell (DNA → RNA → Proteins → biological process) o The aim of functional genomics studies is to understand the complex relationship between genotype and phenotype on a global (genome-wide) scale o The promise of functional genomics is to expand and synthesize genomic and proteomic knowledge into an understanding of the dynamic properties of an organism at cellular and/or organismal levels. This would provide a more complete picture of how biological function arises from the information encoded in an organism's genome. The possibility of understanding how a particular mutation leads to a given phenotype has important implications for human genetic diseases, as answering these questions could point scientists in the direction of a treatment or cure. Protein coding sequences o Nucleotide sequences can be translated into amino acid sequences from 6 reading frames. o Translation starts from a ‘start codon’, usually AUG (E. coli uses 83% AUG, 14% GUG, 3% UUG) o Translation stops at a ‘stop codon’ (UAG, UAA, or UGA) o -mRNA molecules of transcritome direct protein synthesis of the proteome (tRNA being the adaptor molecule bride between mRNA and synthesized polypeptide). o — codon/genetic code early determine by which AAs associate with which RNa sequences in a purified ribosome assay. o —Genetic code is mostly universal for vast majority of genes, however ex. MITOCHONDRIAL GENOMES use a nonstandard code. (probs corrected by rna editing before translation). o Genes that code for proteins comprise open reading frames (ORFs) consist- ing of a series of codons that specify the amino acid sequence of the protein that the gene codes for (Figure 5.1) o Simple ORF scans are less effective with DNA of higher eukaryotes o o Although ORF scans work well for bacterial genomes, they are less effective for locating genes in DNA sequences from higher eukaryotes. This is partly because there is substantially more space between the real genes in a eukary- otic genome (for example, approximately 62% of the human genome is inter- genic), increasing the chances of finding spurious ORFs. But the main prob- lem with the human genome and the genomes of higher eukaryotes in gen- eral is that their genes are often split by introns (Section 1.2.3), and so do not appear as continuous ORFs in the DNA sequence. o The key to the success of ORF scanning is the frequency with which termina- tion codons appear in the DNA sequence. A simple bacterial gene promoter. o • Information about promoter structure and protein coding sequences can be use to identify genes throughout the genome. o • High-throughput DNA-protein interaction mapping helps characterize the regulatory program of the genome. The Transcriptome: • messenger RNAs (mRNA) are translated into proteins. • non-coding RNAs are not translated in to proteins: • Ribosomal RNAs (rRNAs) are the most abundant RNAs in the cell making up over 80% of the total in actively dividing bacteria. • Transfer RNAs (tRNAs) are involved in protein synthesis by carrying amino acids to the ribosome to translate the DNA. • Small regulatory RNAs (srRNA) are involved in diverse regulatory functions. -microRNA and siRNA (short interfering) REGULATE EXPRESSION OF INDIVIDUALL GENES. •The transcriptome is the set of all the transcribed RNAs at a particular time in a cell. • Analyses of transcription profiles can reveal how cells regulate gene expression to respond to changing conditions or go through developmental programs. o Every cell receives part of its parent’s transcriptome when it is first brought into existence by cell division, and maintains a transcriptome throughout its lifetime. Transcription of individual protein-coding genes does not therefore result in synthesis of the transcriptome but instead main- tains the transcriptome by replacing mRNAs that have been degraded, and brings about changes to the composition of the transcriptome via the switch- ing on and off of different sets of genes. o o Never synthesized de-novo, o —rapid turnover of mRNA means transcriptome composition never fixed and rate changed. Protein Structure: • The protein structure is a key determinant of function. • The secondary structure of a protein can be predicted from the amino acid sequence. • There are no simple rules to determine the tertiary and quaternary structure. • The structure of uncharacterized proteins may be determined by homology to known proteins. • Determining the function of a protein based on the amino acid sequence remains very challenging The Proteome • mRNAs code for all the proteins found in a cell. • The entire cell proteome can be deduced from the transcriptome using the genetic code. • The active proteome determines the metabolic capabilities of a cell. • High-throughput genetics can help assign functions to proteins and determine the structure of metabolic pathways. • Post-transcriptional regulation complicates predictions about cell metabolism. o -Biological info encoded by the genomes finds its final expression in a protein whose bio proeteries are determine its FOLDING STRUCTURE and chemical surface groupies o — by specifying proteins of different types, THE GENOME CAN CONSTRUCT AND MAINTAIN A PRTEOMS whose overall biological properties make the basis of life. o 2nd product of gene expression is the proteome (cells repetorire of protein, which specifies the nature of the biochemical reactions a cell can cary out. o These proteins are made by translation of the mRNA making the trascriptome. o —Is made of ALL OF THE CELLS PRESENT IN A CELL AT A PARTICULAR TIME. Few differences in abundant preotins are seen when the proteomes of different types of mammal cella=s are looked at, suggesting most of them are HOUSEKEEPING proteins with general biochemical activities in all cells. Lecture 09: Blast/alignment/dyanamic programming/dotplot Sequence Alignment: o One of the most basic task of bioinformatics. Arranging two or more sequences to identify regions of similarity. Identify: 1. Matches 2. Insertions and deletions 3. Substitutions Sequences that are similar may have the same function or are important for the structural properties of proteins or DNA. In Sequence alignment…… o ….Given two or more sequences, we wish to: • measure their similarity; • understand how the residues match up; • observe patterns of conservation and variability; and • infer evolutionary relationships. o If we can do this, we will be in a good position to go fishing in databanks for related sequences, and mea- suring relative degrees of similarity among genes or proteins. A major application of sequence alignment is to the annotation of genes, through identifica- tion of homologues, in order to assign structure and function to as many genes as possible. o Sequence alignment is the identification of residue– residue correspondences. Any assignment of corres- pondences that preserves the order of the residues within the sequences is an alignment. Alignments may contain gaps. For example, Sequence Homology: o Two similar sequences found in the genomes of two different organisms may be related through evolution (homologous). o Random mutations will make sequences diverge but the selective pressure to preserve function will impose sequence conservation at critical positions. o A pair of homologous genes do not usually have identical nucleotide sequences, because the two genes undergo different random changes by mutation, but they have similar sequences because these random changes have operated on the same starting sequence, the common ancestral gene. Homology searching makes use of these sequence similarities. The basis of the analysis is that if a newly sequenced gene turns out to be similar to a pre- viously sequenced gene, then an evolutionary relationship can be inferred and the function of the new gene is likely to be the same, or at least similar, to the function of the known gene. Dot plot for pairwise sequence similarity o Simple visual comparison of two sequences in matrix format. Identical nucleotides are marked with a dot. o Dot plots do not provide an optimal alignment. o A dotplot shows perspicuously the quality and distri- bution of the pattern of similarity between two sequences. Each possible alignment of the two sequences cor- responds to a path through the dotplot, from upper left to lower right.!! o The dot plot is a simple picture that gives an over- view of pairwise sequence similarity. Less obvious is its close relationship to alignments. o -Dot plots gives quick pictorial statements of the relationship between two sequences. Obvious features of similarity stand out. Figure 5.3 shows a dot plot of a sequence containing internal repetitions. Figure 5.4 shows a dot plot of a palindromic sequence (a sequence that is identical to its revers o -A disadvantage of the dot plot is that its ‘reach’ into the realm of distantly related sequences is poor. In analysing sequences, one should always look at a dot plot to be sure of not missing anything obvious, but be prepared to apply more subtle tools. Optimal pairwise alignment: o Determine possible alignments, score alignments, and pick the best one. o Scoring alignments: Percentage identity is too simple. Not all amino acid substitutions are equally likely. o PAM (Point accepted mutations) • Observed mutations in closely related proteins. • Uses global alignments. • Does not work well for aligning distantly related proteins. o BLOSUM (Block substitution matrix) • Observed mutations in distantly related proteins (BLOSUM62 uses a 62% threshold). • Only use blocks of conserved positions to calculate the substitution rates. Dynamic Programing for optimal alignment: o In this example, the optimal alignment and optimal path are obvious. In general, a computer program must examine all possibilities. How to do that effec- tively is a matter of some delicacy. Without explaining the methods in detail, the trick is to decide, for each partial path, what its best extension is. Algorithms for relating locally optimal moves to integrated optimal pathways – that is, for constructing full alignments – depend on a mathematical technique called dynamic programming. Multiple Sequence Alignment: o MSA can identify conserved functional domains within a gene family. o MSA are used to reconstruct evolutionary threes. o Computationally too expensive to do perform optimal alignments on many sequences using dynamical programing. o Heuristic approaches are used and do not guarantee optimal alignments. Multiple sequence alignments are rich in information about patterns of conservation. They helps us to understand the common features of structure and function of a family of sequences, by showing uswhich residues are crucial (and therefore conserved). They also help us to identify distant homologues with greater confidence than a pairwise sequence alignment could. Multiple sequence alignment reveals the underlying patterns contained in a set of related sequences much more clearly than pairwise sequence alignments. The patterns inherent in a multiple sequence align- ment are not merely inferences from the alignment table – this is leaving it too late – but can actively con- tribute to creating a high-quality alignment. The idea is for an algorithm to learn the underlying patterns while it is assembling the multiple sequence alignment. Protein MSA illustrating conserved positions: Basic Local Alignment Search Tool (BLAST) o Need a fast method to query very large sequence databases such as NCBI. o Trying to align all the sequences not feasible. o Blast adopts a seed and expand alignment approach. PSI-BLAST, an extension of BLAST for multiple sequence alignment (see Figure 5.8). PSI- BLAST con- structs a profile, i.e. a conservation pattern, in an ini- tial multiple alignment of the ‘hits’ from a preliminary BLAST search. Armed with the profile, the method returns to the database and does a more sensitive search, giving higher weight to well- conserved posi- tions; it then realigns what it finds and refines the profile. Several such cycles of refinement of the pro- file give PSI-BLAST the power both to detect distant relationships and to create high-quality multiple sequence alignments. It is routine to screen genes from a new genome against databases, to find similarities to other sequences. Databases have grown so large that pro- grams based on exact local alignments are too slow. Approximate methods can detect close relationships well and quickly but are inferior to the exact ones in picking up very distant relationships. In practice, they give satisfactory performance in the many cases in which the probe sequence is fairly similar to one or more sequences in a databank, and they are, there- fore, certainly worth trying first. A typical approximation approach such as BLAST (basic local alignment search tool) takes a small inte- ger k and determines all instances of each ‘word’ of length k (i.e. each set of k consecutive characters, with no gaps) of the probe sequence that occur in any sequence in the database. A candidate sequence is a sequence in the databank containing a large number of matching k-tuples, with equivalent spacing in probe and candidate sequences. For a selected set of candidate sequences, approximate optimal alignment calculations are then carried out, with the time- and space-saving restriction that the paths through the matrix considered are restricted to bands around the diagonals containing many matching k-tuples. It is clearest to show the procedure in terms of a dot plot (see Figure 5.7). Evaluating Blast Results: o Score: pairwise alignment score calculated from the scoring matrix (eg BLOSUM62). o Query cover: percentage overlap between the query sequence and the database hit. o E-value: expected number of hits returned by chance given the size of the query sequence (false positive hits). o Max identity: Number of identical positions between the aligned sequences. Lecture 10: Genome Annotation Genome Annotation: • Challenges: • large number of new sequences • diverse organisms with diverse biology • integrate experimental evidence from various sources • Solutions: • create databases to compile and organize computational predictions, experimental evidence, and useful analyses • build automated computational pipelines to annotate genetic elements • create collaborative networks of experts to curate databases • leverage large datasets to identify conserved patterns and rules to inform new hypotheses GenBank: http://www.ncbi.nlm.nih.gov/genbank/ • GenBank is the National Instutute of Health database, an annotated collection of all publicly available DNA sequences. • GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) with the DNA DataBank of Japan (DDBJ), the European Molecular Biology Lab (EMBL) Specialized Databases: o Integrated Microbial Genomes at the DOE Joint Genome Institute (https://img.jgi.doe.gov/) • The Saccharomyces Genome Database (SGD) at Stanford University (http://www.yeastgenome.org) • Ensembl, a joint project between EMBL-EBI and the Wellcome Trust Sanger Institute (http://useast.ensembl.org) • BioCyc from SRI International (http://biocyc.org) • FlyBase from the National Human Genome Research Institute at the NIH (http://flybase.org) NCBI gene annotation pipeline for gene annotation: o NCBI has developed an automated pipeline that : • uses similarity when sufficient quantities of comparative data are available • uses statistical predictions in the absence of external evidence Genome annotation: o Gene prediction algorithms: • the first generation analyzed one ORF at a time based on a statistical model of gene structure. • the second-generation analyzed the global properties of the genomic sequence of a given organism (GeneMark and Glimmer). • the third generation combined execution of multiple gene-calling algorithms with similarity-based methods to balance predictions with evidence. o Gene prediction can leverage the large amount of sequenced genomes to detect conserved structures o Assumes that proteins conserved in related genomes should be found in a new sequenced genomes. GeneMark: Gene prediction in Eukaryotes: Minimum Standards for Annoating Complete Genomes: 1. ANNOTATION SHOULD FOLLOW INSDC SUBMISSION GUIDELINES • Prior to genome submission a submitted Bioproject record with a registered locus_tag prefix is required according to accepted guidelines • The genome submission should be valid according to feature table documentation 2. MINIMAL GENOME ANNOTATION SHOULD HAVE • At least one copy of rRNAs (5S, 16S, 23S) of appropriate length and corresponding genes with locus_tags • At least one copy of tRNAs for each amino acid and corresponding genes with locus_tags • Protein-coding genes with locus_tags (see below) and corresponding CDS 3. VALIDATION CHECKS AND ANNOTATION MEASURES Statistical measures that are used for annotation quality assessment include: • Feature counts by feature type • Protein coding gene count vs genome size ratio • Percent of short (<30 aa) proteins • Percent of coding regions with a standard start codon • Count of protein coding regions with “hypothetical protein” product 4. EXCEPTIONS Exceptions (unusual annotations, annotations not within expected ranges) should be documented and strong supporting (experimental) evidence should be provided. Protein number vs. Genome Size: NCBI Genomes Database: • Reference genomes: currently 120 prokaryotic genomes with high quality assembly and functional annotation. • http://www.ncbi.nlm.nih.gov/genome/browse/reference/ • full representation of the genome • manual curation with references to experimental evidence • Representative genomes: currently >4,000 genomes prokaryotic representing different branches of the three of life. • http://www.ncbi.nlm.nih.gov/genome/browse/representative/ • Unlike eukaryotes, prokaryotes do not have clear definition of a species • a comparison of universally conserved ribosomal proteins is
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'