Bioinformatics Models and Algorithms
Bioinformatics Models and Algorithms BME 205
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
Earth Science 1100
verified elite notetaker
Popular in Biomolecular Engineering
This 5 page Class Notes was uploaded by Jacky Emmerich on Monday September 7, 2015. The Class Notes belongs to BME 205 at University of California - Santa Cruz taught by Kevin Karplus in Fall. Since its upload, it has received 50 views. For similar materials see /class/182232/bme-205-university-of-california-santa-cruz in Biomolecular Engineering at University of California - Santa Cruz.
Reviews for Bioinformatics Models and Algorithms
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/07/15
Evolution39s cauldron Duplication deletion and rearrangement in the mouse and human genomes W James Kent Robert Baensch Angie Hinrichs Webb Miller and David Haussler iii ii list mi dial in ano nepartment i mpuii i n Sarita Cruz p p University Park PA lEXOZ Waterman utnern i no l CA and approved lulyii 2003 received for review Aprli 9 2003 This study examines genomic duplications deletions and rear depending on details of definition and method The length rangernents that have happened at scales ranging from a single distributionofsyntenyblockswas foundto beconsistentwiththe ase to complete chromosomes by comparing t e mouse an theoryo i39an om reakage introduce byNadeau andTaylor8 human genomes From wholegenome sequence alignments 344 9 before significant gene orderdata became available In recent I 1nn 39 hiitthe e compMi rearran ement of 2100000 bases were studied by comparing 558000 higily a are further fragmented by smaller scale evolutionary events Ex cluding transposon insertions on a erage in each me abase of 39 39 me we observe tvvo inversions 17 duplications andem or nearly tandem seven transpositions and zoo conserved short sequence alignments average length 340 bp within 30041 39 An estimated 217 blocks of conserved synteny were found formed from 342 conserved segments with five deletions of ua im include duplications or tr p s of length gt100 kb The frequencies model 3 subsequent arialysiS of these data found 231 con f hese ler events are not substantially higher in finished served synteny blocks of size at least 1 megabase with a few portions In the a blv M nv h allertra sposition re thousand rt er microreanangements within these blocks processed pseudogenes we define a quotsyntemc ubset o bout one permegabase ali ments that excludes these and other smallscale transposi e most wmenvariations are singlerbase transitions that tions These alignments provide evidence that 4 of the genes is cT and GAsubstitutions11712Singleebase insertions and in the humanmouse common a es or have been deleted or deletionS m 3190 its mm on u y are apidly paitiallydeletedinthe mouse Therealso appearsto be slightlyless selected out of coding regions Substitutions and m lt nontransposoninduced genome duplication in the mouse than in I ome ents ent techniques that can n a robust fashion and discriminate between orthologous and paralogous alignments omparatlve genomics l rossrspe es alignments i synteny i chromosomal inversion l breakpoints E volution creates new forms and functions from the interplay 1 There are many of reproduction variation and se ection of vari io most commo and we ied is the substitution of o e ase for another Sm l insertions and quite co involve the duplication ofpart of the genome These insertions and deletions c n s o maximize the A simple example is der39 ed from a c mmon ancestor eir common ancestor im arly an alignment gap be caused either by an insertion in one sequence or a deletion in the o er A eart of the pairwise alignment process is a scoring function that assigns positive values to matching nucleotides and E can be the starting point for the development of a new gene with a new function Tquot 39 39 PnP pnprall ne ative values to mismatches and gaps Most modem programs se what is called a 39 39 sco where the first ga character in a g is a substantial gap quot co an each subsequent gap character incurs a somewhat 1 gap exte ionquot cost Because gaps are frequently more than a single ase lorg 39 39 the underlying biologr r in the gene must be maintained After duplication one copy is free to lose its original function and possibly assume a new function 12 Deletinn anrl 39 i roles in the longsterm evolution of genomes by comparing the human and mouse genomes to each s sequence at the DNA level 3 yet distant enough that a great deal of variation as had the opportunity to accumulate Chromosomal rearrangements of 21 megabase can b served by comparing genetic maps b tween organisms 4 and by 39 39 39 bl k oc s of synteny between human and mouse were discovered by gene ulucl L I39 7 ahip on kl i 114M711 l PNAS l SeptembelElZm3 l voi loo l no 20 ga scores generally work fairly well for protein alignments where gaps are rare and tend to be short but do not represent t particularly outside ofcoding regions tend to require many more r 39 39 ent quot 4 ga large to be found by traditional pairwise 39onal p 39 16 Furthemiore in traditi 39On 3 even though these are quite common In these irstances the alignment program will either break the alig1ment into two or This paper was submitted dlrediy track H to the was ofme Wwwprla oigcgidoilo i073prla l932072l00 mum mm nn n 2 Flg 1 MauwHuimn Ni nmsm Nat mm 1 mom mmmmgn thi hiirrrah n h iii frarlm m related fragment The The rhethaihihghhhs a new browser on the torresoohdihg region in the other spenes force nonhomologous bases to align Traditional programs are able to accommodate inversions translocations or can al39 only iiouer f 39 as occurred Therefore 39 se scale e sand t several million own evolution 17 is one indication of the impor tance ofvariation at the mi dle sca es In this article we describe automated methods for linking 39 39 39 39 rhain and nets that can effectively bridge the gap k h painting and se ue ce lignment Fig 1 These methods can accommodate inversions translocations duplications larger etions and overlapping deletiors We apply these tools STngeneraled alig1menls 18 to investigate the pattems of Variation that have occurred at all sca es sinz e the divergenz e of the mouse and human lineages Me ads The November 2002 freeze of the human genome and the Febru y 2002 free e of the mouse genome were taken from genomes were aligned by the Z program as described in ref 18 exz e t that in addition Son repeats identified by REPEATMASKER 19 simple repeats of period 12 or less found by TANDEM R 20 were masked out and the artifaclrprone uenz el at Don not be assemr By us atrlx as BLASTZ but a novel piecewise a new program HAIN foimed maximally scoring chained alignments out f the gapless subsections of the input alignmenm Fig 1 A Kent eta i open chained alignment or chainquot between two species consists of an ordered sequence of traditional pairwise nucleotide align merit UIULK 39 simultaneous gaps in both species The orderofblockswithin the chain must be consistent with the genomic sequence order in bot us a chain c nnot have local inversions transr l cations or duplications amo e parts of the DNA that it igns ever chain llo edt 39 oversegnentsofDNA either othspecies In particular mte e 39 N inone 39 not al wi other se it 39 y lnverle or as been lnserle or duplication is skipped over during construction of the chain Thu 39 39 39 rm from a single genomic segiient in the common ancestor without rearrangement o buil chains efficiently AXTCHAIN uses a variation of the krdimensional tree kdrlree based algorithm described in re 21 To detect cases of overlapping deletions in both species scoring is 39ned such that the alignment program will typically 39 lignmenls of a g c o c He is e E mg simultaneous gaps also he missing se For many pu 13917 533r r every region of the human genome Previously 0 gram AXTBEST for this a loop at each iteration taking th throwing out the parts of the chain that intersect with bases PNAS i SeptembelEl1003 i vol loo i no 20 i 11435 EVOLUTION Table 1 Comparison of m r 39 39 and after processing with AxrcHAm and after building the human net amvz mourn Number of alignmentsthains 5550145 147445 onges human span bp 53750 115044504 Average human sp n bp 505 22 Most aligning bases bp 59559 27055473 Average aligning bases bp A 7052 Bases aligned in human genome 3s 9 3A 5 already covered by previously taken chains and then marking redeblack trees to keep track ofwhich areas of a chromosome overed If a chain covers bases that are in a gap in net The 2 covered by more t from nonduplicated regions e result39 net files are further annotated by the program 39 h in d NETSYNTENY which notes w lch c alns t e net are lnverte d in a larger c am on e some and come from the same region as the larger chain Thus The at httpgenomeucscedu 22 and may also be downloaded in bulk from that site esulis The initial BLASTZ mouse alignments cover 359 ofthe human enome This is less than the 399 reported in ref 3 d e to masking ofthe tandem repeatsofperiod 12 and less and removal of a ifactrprone mouse se uenf e in rU quot e con struction of longer chains from the initial BLASTZ alignments resulted in fewer and substantially longer alignments Fig 1 Chaine alignment le can be measured in two different ways We define the human span of a chain to be the distance in h 39 39h h genome from the first to the last human the chain including gaps and we def chain as the number of aligning ases in Both of these showed substantial increases due to chain con struction with the average human 39 22830 bp and the average size increasing from 574 to 7062 bp Because many smaller BLASTZ alignments were often merged into a sirgle longerchain there was also a substantial reduction in the total number of a ignments involved in the humanrmouse m 85 million to 15 some cases small st alignments are discarded after chaining as well This results in a decrease from 359 to 346 ofthe bases in human genome being aligned to mouse These and other comparative statistics are listed in Table 1 enom hairs was used to study as the divergence o t ese basersize indels The affine gap score model conventionally used 39 quen a 39 rograms corresp 39 model where indel sizes follow a geometric distribution inwhich the number of gaps of size N lwould be a constant fraction N program by using the esyn ag Although NETFlLTER does not eliminate pseudogenes that arise from tandem duplication it does eliminate processed pseudogenes which are much more plenti Further details ofthese algorithms are described in the source code which along with Lin execu ables for LAVTOAXT AXTCHAIN CHAINNET NETSYNTENY NETCLASS and NETFlLTER are available at wwwsoeucscedukent The chains and nets Pi sizes observed in the set ofchainswe oonstructedwith frequency of occurrence for various indel lengths plotted on a logarithmic re are more S 011 an geometricdistribution and there are sharp spikes in the numbers of ps observed in the human sequence at e300 bases corre sponding to ALU insertions 19 a hinlormdalmimorhdd b urmdd mimurhdal humanlupl 155 16 5 14 g 145 12 s 14 1 5 135 3 13 6 4 125 2 1 2 4 5 a a 9 l d 1 GupSizn 8 35E 3 s e Flg 2 their SlZe Thevertltal 11436 i wwwpnarorgtgidoim1073pnar1932072100 Kentetal JUR a C ll son 1quot rain is urine iir min I l l l l l l I 7 e 7 s 1 5 a e a g s a 5 L 3 i 5 7 E E 5 i Flg3 quentlesfor e gau up l500baseslongalognlstogram gau up ooona l nn sot gaps in numan The log ofthe e l lGapsoflUrl ba Tn omentratlon near the diagonal for inserts of lt200 bases Tnls 0urreme ni r nri L is mostly due to small inversions and loally dlvergent seguenre r Gaps of hoooesuooo oases lrit tne sum ofthe log treguenries ortne lndlvldual onesided gap sizes in earn independent gaps in earn lndlvldual speries The chains also included a substantial numberof simultaneous 39 L L I F39 3 quot 39 imnltanenn gaps are quite rare but the phenomenon becomes increasingly important with increasing gap size The frequency of large simultaneous gaps is roughly consistent with a model in which Fig 3c The nested net structure of chains produced by the CHAlNL w p 39 L occurred since the divergence of human and mouse in the form of inversion Fig 4 deletion tran nositioii 39 39 tandem duplication and interspersed duplication Because of A a a M speries ln tnis sense tnese longer simultaneous gaps act as it tney arise from only alignments that are bestrmrgenome for both species are allowed The latterprohihio the investigation oflineagespecific duplications by crossspecies alignmen because it allows at most one copy of the duplicated region to align to the other species and often not even one co can be fully aligned With human and mouse this resulo in a substantial reduction of the total amount of genomic DNA covered by cross 39 39 m tby 11 and in the mouse net y 9 T ese numbers are quite close suggesting similar I H I39 t39 L L 39 39 hoth enome since the common ancestor In what follows we will explore the t t s s t s M without the human genome as a whole are covered by more than one the reciprocal best requiremen 1 basis of analysis of the human net the frequency of t i r LASTZ mouse alignment 3 even after excluding transposons 39rL39 i dim i r r i i i i 39r L LIoA39 k39 these The net structure is not symmetric between specie he DNAfrom L align any 39 L g 39 L r s r nner This onesided bestiwgenome requirement is not as strong as the requirement of reciprocal bestquot matching where and on the 48Lmillionrbase 2 subset of the sequence that is finished is shown in Table 2 Overall rearrangement patterns were my imilai39 L 39 39 L l L L 39 L h l The same held for the finished subset more local duplications in the finished subset likely reflecting the collapse of local duplications a common artifact of the 21soooo zzmuoul 25 on 23ml asnoool maul maul zsowoal H Ssqunu us2 l rnv e MGczzsm emailPng m1023 MouaelHumln Allgnmnt Nat Laval I WWW WWI Lavalz I PW I III Lavals Levell Levels Lavals Fig 4 A l5000Lbae inversion Ontalnlng two trammpt and snowmg tnr7 207722272497l00 in the November 2002 assembly of the numan genome in wnen tne ortnologous mouse region is not yet seguenred Kent etal PNAS l Septembeiao2ooa l Vol loo l no 20 l 11437 EVOLUHON S N S E a a s g i 1 2 3 4 5 7 Shaw nlcham Mi in base H Flg 5 Distribution of the span of EH 147445 ham in the human net The at taiifor s79 iong drains of size between i05 and i08 wholergenome shotgun assembly techniques used in the mouse In geneial the mostcommon type of ieaiiangement is a section of the genome being duplicated and inseited in a diffeient chiomosome nonsyntenic duplication Mostofthese appeai to weie also suipi39 ingly ommon Th distiibution of the spans of the chains fiom the human net Fig 5 shows a ioughly bimodal distiibution foi the chains that 100 000 t chainsquot and a lo lat tail between 100000 and e115 million bases long chainsquot aveiage length 983 kb The long chains combined span 909 of the human genome excluding ps covei 329 of the bases in the human genome 1n contiast all chains togethei including aibitiaiily laige gaps in individual chains span 963 c of the human genome and ali n to 346 of it Thus the long chains alone without theii long gaps span 944 of the bases spanned by all chains and include 951 of all aligned bases of the 579 long chains 344 appeai at the top level of the net n m laige piimaiy unin of synteny with the mouse The locations of these c aie ge 39 Discussum The iemaining 235 long chains appeai at lowei levels of the net because they aie embedded within the gaps of 39 39 Of chews 39 39 tianslocations at gieat distance on the same chiomosome oi between chiomosomes Some of these weie also detected in eailiei studies In addition to the 344 long chains at the top level of the net theie aie 19800 shoit chains at the top level and many moie at lowei levels The shoit chains at the top level often appeai in long iuns in iegions wheie no significant synteny between the v a n l r I what appeai to be hot spots foi ieaiiangemenn oi duplications In these iegions the best alignment in the othei species shim at i i i c i p contain clusteis of genes fiom families that have undeigone iecent lineagerspecific expansions They include many g involved in the immune system olfactoiy ieceptois 23 and Kr ppelrasmciated box c2H2 type zinc fingeis which aie evolution 24 Foi a table ofsuch iegions see Table 3 which is published as suppoiting infoimation on the PNAS web site www nasoig Because the ns o synteny identified in eailie o chains suggesn that othei piocesses may be at not only within ut also between the synteny blocks defined by the long chains These inteivening shoit chains cannot be explained as Table 2 Rearrangement statistics of the mouse genome relative to human Genomewide frequemy Finished frequemy Genome median Finished median events per megabase events per megabase size inversion 2 0 i 8 3i4 762 inversion iotai dupiitation 05 i 0 275 302 inversion iotai part dupiitation 0 7 0 8 W i235 Lotai move 0 8 i 0 204 246 Lotai dupiitation i9 4 0 2ii 35i Lotai part dupiitation 0 9 i 2 343 388 SyntenK move 0 8 i 6 223 322 svntenitdupiitation i 3 i 2 283 286 Syntenit part dupiitation 0 7 0 8 474 946 N nsvntenit move 5 0 s 2 i04 i09 Nonsvntenicdupiitation ii 9 ii 6 235 228 Nonsyntenic part dupiitation 4 6 4 6 282 256 M ibase aps i46i 8 i5i34 i i Mousei0base gaps 397 464 i0 i0 0 se gaps 2i 00 68 8 80 8 207 20i Doubie gaps 2i00 3986 4i99 444 4ii H iikeiv deietion 2i00 2300 2235 685 633 move or tne rearranged ham aiign to muitipie piates in tne numan genome quotPan dupiitationquot means some sequem p as ii ii i i ii gaps or in more bases in mouse and 0 bases in numan The quotdoubie gaps 2 ionquot row snows gaps or we or more in mouse and gt0 bases in numan The quotn iikeiy n N we base deietion zine 11488 i wwwpnasorgtgidOiioi073pnasi932072i00 quotDupiitationquot means at ieast sonn ortne aiigning bases or e but lt80 n is dupiitated The moves and iii in inequotsingie gap 0 or s Kent eta
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'