New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Earnest Jast


Earnest Jast
GPA 3.68


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Music

This 3 page Class Notes was uploaded by Earnest Jast on Thursday October 22, 2015. The Class Notes belongs to MUS 88 at University of California Santa Barbara taught by Staff in Fall. Since its upload, it has received 24 views. For similar materials see /class/226843/mus-88-university-of-california-santa-barbara in Music at University of California Santa Barbara.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/22/15
VOLUME 88 NUMBER 6 PHYSICAL REVIEW LETTERS 11 FEBRUARY 2002 Sequence Determination from Overlapping Fragments A Simple Model of WholeGenome Shotgun Sequencing Bernard Derrida and Thomas MA Fink Laboratoire de Physique Statistiqae Ecole Normale Sape rieare 75231 Paris Cede 05 France and Institute for Theoretical Physics University of California Santa Barbara California 931064030 Received 3 July 2001 published 28 January 2002 Assembling fragments randomly sampled from along a sequence is the basis of wholegenome shotgun sequencing a technique used to map the DNA of the human and other genomes We calculate the probability that a random sequence can be recovered from a collection of overlapping fragments We provide an exact solution for an in nite alphabet and in the case of constant overlaps For the general problem we apply two assembly strategies and give the probability that the assembly puzzle can be solved in the limit of in nitely many fragments DOI 101l03PhysRevLett88068106 The problem of sequencing the human and other genomes is at the origin of a large number of technical and theoretical challenges Considerable progress in practical sequencing techniques has allowed the recent completion of the human genome 12 Theoretical effort has focused on among other things sequence align ment 3 longrange correlations 4 and distinguishing between coding and noncoding regions 5 Many of these questions can be formulated in terms of random spin chains or systems with quenched disorder and have therefore created a lot of interest among physicists 67 Here we study a simple model related to the problem of reconstructing an unknown sequence by assembling fragments We consider a random uncorrelated sequence S of length L composed of characters drawn from an al phabet of size A This sequence S might be made of letters as in DNA eg ACTTAATG or of bits as in a binary sequence eg 11001000 For convenience we take S to have periodic boundary conditions so that it forms a ring We do not know the sequence itself or its length but only the identity of M short fragments of length 7 lt L sampled at random from along the se quence In this paper we address the following question What is the probability PML A that knowledge of the M fragments enables us to reconstruct the original native sequence S This question is at the heart of wholegenome shotgun sequencing 89 a technique developed by Celera Ge nomics which proved crucial in accelerating the map of the human and other genomes Shotgun sequencing in volves breaking identical copies of a single strand of DNA into pieces and sequencing the fragments which must then be assembled in their correct order To accomplish this it is necessary to collect enough fragments such that some regions are sampled redundantly These multiply sampled regions or overlaps must be long enough to allow recog nition of which pairs of fragments are neighbors along the original DNA molecule Shotgun fragments are typically narrowly distributed about a xed length 8 so we take in 0681061 003190070288606810642000 PACS numbers 8714Gg 8715Aa our model all pieces to be of constant length 7 Because DNA is directed the fragments cannot be ipped Since in our model the characters in S are uncorrelated we do not take into account repeat regions present in real DNA Moreover we do not consider sequencing errors or scaffold structure 8 Clearly if the number of fragments M is too small if Ml lt L we can be sure that some regions of S are not covered by the fragments thereby making it impossible to recover S in its entirety Moreover even if M 7 gt L there remains a nonzero probability that the M randomly selected fragments do not cover the whole sequence But whether or not the sequence is completely covered by the fragments is not the full story Additionally the fragments must be long enough for the information con tained in their overlaps to be suf cient to reassemble them For instance even if all fragments of length 2 from the se quence 001011 are known 000110 011110 they do not give a prescription for a unique sequence the pieces could also be assembled into 001101 Knowing the M fragments one can construct the matrix of their overlaps qa where qa is the maximal length over which the head of fragment 1 coincides with the tail of fragment 3 In this matrix nonzero elements have two possible origins either the same region in S has been covered by two fragments we call this region a native overlap or two fragments in S overlap by chance we call this a nonnative overlap Of course it is not a priori known which are which and our problem may be posed as distinguishing one set from the other In nite alphabetiThe simplest version of the sequence recovery problem is the limiting case of an in nite alphabet A gt 00 In this case all nonzero overlaps qa are native As long as the entire sequence is covered and all nearest neighbor fragments have overlap 21 the sequence S can be recovered Accordingly the recovery probability can be expressed as a covering problem with pieces of length 7 7 1 that is 2002 The American Physical Society 0681061 VOLUME 88 NUMBER 6 PHYSICAL REVIEW LETTERS 11 FEBRUARY 2002 PML DOCML 71 l where CM L 7 is the probability that M pieces of length 7 completely cover with or without overlap a sequence of length L The calculation of CM L 5 not presented here can be done exactly One rst calculates the probability that hav ing chosen M fragments from L possible fragments M0 of them are different Then one calculates the probability that the sum 039 of M0 7 1 random integers uniformly dis tributed between 1 and 7 is such that L 7 7 S 0 L 7 1 We nd M Mo M 7 J M0 CML UMO f MOZ1 JlkMO JlLM1 7 61 M0 xi dz Z Z lt2 2771 zLJ 1 1 7 z Fig 2 below which at least in the continuous limit L gt 00 at xed I is a known result 10 For large 7 and M0 C 1 exp7MOequot since covering is achieved when the end point of each fragment is covered by other fragments here M0 may be replaced by M provided M ltlt L Constant native overlapsiAnother simpli ed version of the recovery problem is the case in which the M fragments are not randomly sampled but equally spaced with a constant overlap b between consecutive fragments Fig 1 In this case M 35 17 To avoid correlations between characters in different overlapping regions we consider only the case b S 3 Since all neighboring pairs of fragments must have precisely overlap b only in the case of two or more identical overlaps out of a total of N Ab realizations can a nonnative sequence be constructed The probability that no two of the M overlaps are identical is the same as the solution to the wellknown birthday problem ll if M pieces are selected with replacement from N objects the probability that no two are identical is QO which in the regime 1 ltlt M ltlt N yields mm 2 exp7M2lt2Ngt lt3 The probability Qk that there are k pairs of identical overlaps with the remainin M 7 2k overlaps distinct Eor is Q06 NN71NN7MIC1AASM71NM7ZI 1 N Nlz this becomes 1 M2 k Q06 7 Q0 4 One could also calculate the probability that some over laps are repeated more than twice ll but they do not contribute signi cantly as long as M ltlt N2 Now we need the probability that M fragments contain ing k identical pairs can be uniquely assembled Since the M 7 2k unique overlaps do not allow rearrangements we can disregard them as shown in Fig l The number of ways that the 2k remaining overlaps can be paired is 223 0681062 B B A A K c c c c V k V FIG 1 Left fragments drawn from a native sequence with constant overlaps Only pairs of identical overlaps labeled A B an allow possible rearrangements 1ght equivalent diagram after xing nonrepeated overlaps Whether or not we are able to uniquely recover S from the remaining 2k pieces depends on the precise order of their 2k overlaps along S As soon as among the k pairs two pairs are entangled eg ABAB one can reorder the fragments in a nonnative way Indeed if in the original sequence we have AplB szA Ap3B Bp4A where the pi indicate the nonoverlapping interiors of the pieces we can reorder them as AplB Bp4A Ap3B szA There fore we need to count the number of ways ck that there is no entanglement between any two pairs this question appears in many contexts in particular in the enumer ation of meanders 12 Clearly cl 1 CZ 2 and with the convention co 1 one can show by recursion that Ck1 i ckiici which is satis ed by the Catalan numbers ck These are the number of permuta tions of the 2k overlaps for which S can be recovered out of a total of So if M overlapping fragments contain k pairs of identical overlaps S can be uniquely recovered with probability P Zkk 1 5 Summing the product of 4 and 5 over k with N Ab yields the probability of recovering S it is P m M M 7 b e A 2 exp7M 2 2A 0 2 k X Z k39k11l39 k70 6 Note that as suggested by the birthday problem 6 changes from 0 to l at bc 7 2 logAM Eor b lt b5 there are many pairs of identical overlaps and recovery is unlikely whereas for b gt bc the number of repetitions is small and recovery is very probable Native versus nonnative overlapsiThe key to deter mining the native sequence is the ability to distinguish be tween native and nonnative overlaps qa Recovery is dif cult when native and nonnative overlaps are compa rable in length but becomes easier as the native overlaps dominate In particular recovery should be straightfor ward whenever the smallest native overlap is larger than the largest nonnative overlap The probability that 2 q is simply the probability that M pieces of length 7 7 1 cover S In terms of the 0681062 VOLUME 88 NUMBER 6 PHYSICAL REVIEW LETTERS 11 FEBRUARY 2002 covering probability 2 this is CML q In the range 1 ltlt M ltlt L q f has typical value L 2 max M lnM0 7 Calculating the exact distribution of q l however is much more dif cult This will become apparent below in our estimation of the probability r that a non native overlap has length n Let tn be the probability that the tail of one random frag ment coincides with the head of another both supposed to be long over a length n Let r be the probability that these two fragments have overlap n ie the tail of one coincides with the head of the other over a maximal length n For an alphabet of size A In Surprisingly an analytic derivation of the r is less straightforward than one might expect This may be appreciated by noting that tid tit but ti M titj tk where tid is the probability that the two fragments have coincidence i and j etc In particular fOI i lt j one always has Zijj1 Zjj1 9E Zitjj1 Within the approximation that all the ti 1316 are factor izable that is ILLW 2 A Jk we nd rapproxiLL n A AMA 1 A3nA 1A2 1 1 1 1 1 8 A4nA 1A2 1A3 1 by the standard formula r In Zn 21 Zn2gtn1gtn 2mm 2 These approximate values are reasonably close but clearly not equal to numerical extrap olations of data obtained by an exhaustive enumeration of small strings Table I Although we could not calculate the rn s exactly it is clear that for large n r 2 In A We can then es timate the typical value q g or at least a bound on it by writing gig S n for M 2 2A 1 this estimate ne glects the correlations between pairs of non native over laps We nd s 2ln M lnA 9 From 7 and 9 the condition q f gt qgg is satis ed if L 2 gt l M 10 6 M in n a lt gt TABLE I The probability rn that two random fragments have overlap n for A 2 and A 4 The rftmap are our best nu merical estimates whereas the rfiPPmX are given by 8 A 2 A 4 n rzxtrap rillpprOX rzxtrap rzpprox 0 0267787 0288788 0687748 0688538 1 0300420 0288788 0230237 0229512 2 0198919 0192525 0061264 0061203 3 0112161 0110014 0015548 0015544 4 0059285 0058674 0003901 0003901 5 0030446 0030284 0000976 0000976 068106 3 which in terms of y M 6 L called redundancy of cov erage in 13 may be written M lt e y for M ltlt L Assembly strategies We now come back to our origi nal question concerning fragments independently sampled from a nite alphabet native sequence a situation mirror ing shotgun sequencing How is shotgun data put together in practice Because the number of ways of ordering the fragments grows as M an exhaustive search over all pos sible solutions quickly becomes impractical Instead the many overlapping fragments are assembled according to one of a variety of proprietary heuristic algorithms 14 The assembly puzzle can be naturally mapped onto a traveling salesman problem TSP 15 To each permuta tion 7739 of the fragments we associate an energy E de ned as M 6 minus the sum of the overlaps of fragments consec utive in 7739 The assembly puzzle is soluble if and only if the ground state energy Emin L and is nondegenerate Inspired by TSP greedy algorithm heuristics we investi gate two simple assembly strategies G1 and G2 below An elaboration of G1 in which fragment cleanliness absence of repeat regions reliability of bases etc is also consid ered is the basis of a real assembly technique 14 In the rst greedy strategy G1 a random fragment is selected and to its right is concatenated that fragment from the M 1 remaining with the greatest overlap To the right of this piece is joined the optimal fragment from the M 2 remaining and so on until one fragment remains This and the string are concatenated at both ends and the resulting sequence is the candidate sequence The second strategy G2 described in 16 is similar to the rst one but allows the formation of disjoint strings From the pool of M fragments that pair with the great est overlap is concatenated and thereafter considered as a single fragment This is then repeated with the pool of M 1 fragments and so on until two fragments remain These are joined at both ends and the result forms the can didate sequence If G1 or G2 are faced with identical overlaps we pro ceed by ipping a coin As we want the probability that S Recovery probability 05 O 00 o o 000 0 00 0 0 10 20 30 FIG 2 Probability of recovering the exact sequence as a func tion of y M L using G1 circles and G2 lines Curves are shown for from left 13 12 11 with L 64 and A 2 The covering probability 2 dots is shown for comparison with 12andL64 068106 3


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.