### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Bioinformatics BME 4800

UCONN

GPA 3.92

### View Full Document

## 32

## 0

## Popular in Course

## Popular in Biomedical Engineering

This 6 page Class Notes was uploaded by German Farrell Sr. on Thursday September 17, 2015. The Class Notes belongs to BME 4800 at University of Connecticut taught by Yufeng Wu in Fall. Since its upload, it has received 32 views. For similar materials see /class/205865/bme-4800-university-of-connecticut in Biomedical Engineering at University of Connecticut.

## Popular in Biomedical Engineering

## Reviews for Bioinformatics

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15

BME 4800CSE 3800CSE 5800 Notes on suf x tree and suf x array Yufeng Wu September 10 2009 1 Introduction of suf x tree and suf x array Building a suf x tree is an important result of 1973 which builds suf x tree in On time for a string S with n symbols Suf x tree is leaf labeled by suf x number at leaves Edges are labeled with substrings of S A key property is from root to leaf the substrings along the path spell out the suf x on that leaf Also no two edges out a node have the same starting symbol Note that every substring of S can be found on the tree starting from the root Now try to build suf x yourself for S tartaril and get the rst hand taste of what a suf x tree is Why is suf x tree useful Suppose we want to nd whether pattern P tar is in Startar We rst build a ST Then search for tar in it How to proceed We just match P with the unique path that agree with P The rst thing to note the path is indeed unique that is at each node we will have at most one branch to follow This is because in ST each branch starts with different char If we can not continue matching while still something left in P we conclude it is not in Otherwise ALL leaves under the current ending position has an exact match of P Why The crucial point is that if pattern P starting in TD iff suf x Tjm starts with P Thus all positions of T matching P must have the same T When alphabet size is constant the search for path to go can be done in constant time Thus pattern matching can be done in linear time Linear time pattern matching has been around for over 30 years But the algorithms known till very recently are too complicated even after much efforts in trying to simplify Fortunately something called suf x array comes to rescue Initially proposed by Manber and Myers as an space e icient alternative to suf x trees suf x array has some very interesting results In particular it has very simple linear time algorithm to construct directly and suf x array is also easily converted to suf x tree Therefore nowadays suf x array is used more often than suf x tree in pattern matching Suf x tree however is still useful in term of getting intuition you can think in term of suf x tree and implementation is done by suf x array So what is suf x array introduced by Manber and Myers We start with the same string tartar POS is an array of suf x each cell stores a suf x and listed in lexicographic order regarding to the corresponding suf x Here POS 7 5 2 6 3 4 1 A main advantage of suf x array is the memory reduction But I still think suf x tree is more conceptually clearer when thinking of things First suf x array can be easily built from suf x tree just a depth rst search in a lexicographically well when the alphabet is constant size How to build suf x tree from array We need one additional piece of data LCP array also called depth array The LCP array stores the length of the longest common string for two suf x next to each other in POS For our example here S tartar LCP 0 O 2 O 1 O 3 Here is how you will generate suf x tree from POS and LCP arrays Essentially start from smallest rst suf x and when move to the next create a new branch on the tree from the current tree Where to branch out Well depends on the depth array From the last added suf x si leaf backup till depth of LCPi O1 per edge when backing up Create a new internal node if needed and then branching off an edge Question will this violate the property of suf x tree by creating an edge whose label has the same start symbol as another existing edge Convince yourself it will not Now how about running time The main concern is that insertion of new branches leads to traversal of sometimes multiple edges How to bound such traversal The key is that only the path leading to the rightmost suf x is traversed Once traversed it is no longer part of the new rightmost path and so it will not be traversed again All other operations are only a single action per suf x So entire conversion takes On How can do pattern matching in suf x array The simplest thing is to do binary search This gives Onlog time where m is the pattern size But there are more advanced techniques which gives Onlogm We omit such details See Gus eld s notes on LG arrays which explain how to build the LC array ef ciently We now focus on the problem of building the POS array in On time This problem has puzzled people for ten years and three groups of people solved it a little different methods in 2003 All the three methods use the same high level idea that is rst sort part of string called a sample and then sort the rest of string from the sorted sample The three methods differ by the way of choosing samples and how to merge The main idea is divide and conquer But how are we going to divide One may say do rst half and second half but this does not seem to do much Another is to say odd position or even position Why is this useful Sorting suf x is different from sorting a list of numbers suf x i and i1 is not independent unlike sorting an array of numbers Look at two positions i and j If we know which one is lexicographically bigger then we can decide in constant time which of suf x i1 j1 bigger Why Compare the rst char and if equal then i1j1 pair tells us the result This is roughly the high level idea this approach takes but with an important twist instead of doing odd even we do tri partition 012 mod 3 Let us give an example to see how this works Here we start numbering position from 0 As an example consider T y a b b a d a b b a d o For this string POS12 1 6 4 9 3 8 2 7 5 10 11 0 We list suf x for 1 mod 3 in an array called called B1 and list suf x for 2 mod 3 in an array called called B2 For this example B114710 B2 25811 Step 1 Sort su ix in B1 and B2 For this purpose we will sort this string abb ada bba do l bba dab bad 039 Note each triple encapsulated in becomes a new character This string is longer than the original string but we just replace the triple with the ranks of these triples in the matrix by their ranks This is done by radix sort of the letters present in the samples This turns to abb 1 ada 2 bba 4 do 6 bba 4 dab 5 bad 3 03953 7 So we have the new string by plugging in the ranks as R 12464537 Then we sort this recursively and note R7 is shorter than T by 13 And suppose we get SAR 8 O 1 6 4 2 5 3 7 recall su iX start from 0 Use radix sort is the simple way of converting the combo alphabet back to integer alphabet so that we can use recursionl Also create a rank for every su iX for the sampled positions RankR X 1 4 X 2 6 X 5 3 X 7 8 X 0 0 note use 0 for the last two positions Be careful how do we know this SAR 8 O 1 6 4 2 5 3 7 so rankR 1 2 5 7 4 6 3 8 0 simply put the current rank in the corresponding position and increase the current rank by 1 What does this mean Want to compare words convert to the normal positions because we let 1 mod 3 go rst then followed by 2 mod 3 go neXt It now becomes 1 4 2 6 5 3 7 8 0 0 ll 00 near the end Step 2 Sort positions not sampled ie mod 3 0 Now this becomes easy For a pair for each of these position Si rankSi 1 Then do radiX sort of this pairs and it takes only On time Just line them up like before and sort it In our eXample we have y1 b2 a5 a7 which gives 00 lt a5 lt a7 lt b2 lt y1 Step 3 Merge Standard comparison mergel In more detail we are merging one list contains 0 mod 3 su iX called M0 and the other list contains 1 or 2 mod 3 su iX called M1 and M2 So we have two cases compare M0 with M1 and M0 with M2 If we compare M0 with M1 say s0 and s4 we rst see if SO 7 S4 If so we are done which su iX is larger is already known Otherwise we then look to see if s1 and s5 which one is larger and this is already known in one of the list B1 and B2 If we compare M0 with M2 say so with 35 we compare two characters rst and then the case reduce to the original situation Now we are done Finally time analysis We are doing recursion over 23n so we have Tn T2n3 On This gives On running time 2 Pattern matching using suf x array We now consider the problem of searching for pattern string P in teXt T whose su iX array POS and LCP array are given We rst review what we brie y discussed in class before Since POS array is sorted to nd which su iX s pre X matches P we can perform a binary search as follows We maintain the current search interval L and R which are initialized to 1 and m respectively We compare P with the middle su iX M L R2 by a simple direct comparison of P and the su iX If we see a match we are done Otherwise if P is leXically less than this middle su iX we know P has to appear in the rst interval We then change R to be one less than the middle position and continue The other case is similar In general this is the standard binary search procedure As we all know binary search runs in Ologm iterations Since each iteration we may need to compare the whole On symbols the search takes Onlogm time This is not bad when m is not very large but clearly its performance is worse than that of the su iX tree case We now show some simple tricks to speed up the search procedure First it seems wasteful to compare the whole P and a su iX each time A simple observation can help Suppose during search we keep track the matched length of P and su iX POSL denoted as l and of P and su iX POSR denoted as r Then a moment of thought suggests that P must match the rst mlr minlr symbols That is we can skip the rst mlr symbols in matching Now try to think about it yourself to make sure you understand why this is the case In practice this simple trick alone can bring down the eXpected running time of the search to On logm We now take a closer look at the binary search Note that to really reduce the running time of search it is critical to reduce the number of redundant character comparison Here redundant comparison means a symbol in P is compared multiple time Recall that this is not the case in suf x tree search where we only match each symbol in P once In the above speedup trick we avoided mlr redundant search but the comparison from mlr to maxlr is still redundant To eliminate these redundant comparison we need to nd a way to start searching from maxlr instead This needs the LCP array Recall that the LCP array stores the length of the longest common pre x of two adjacent suf x in POS For the purpose of matching we need an extension of LCP For any two positions i and j we de ne LCEij as the length of longest common pre x of suf x POSi and P080 Note that this means LCPi LCEi 1 i when i 2 2 In the following we will assume the LCE values for needed pairs of positions are known We will address the issue of how to obtain LCE values later Now consider several cases First if lr there is no extra redundancy and so start matching from mlr 1 Make sure you see why there is no redundancy We now assume 1 gt r Note the case 1 lt r is similar That is the left suf x matches more of P than the right suf x We then consider three sub cases H LCEL M gt Z This implies P can not occur between L and M This is because P is lexically larger than the left suf x and the middle suf x matches at the position where P and the left suf x differ which implies P is also lexically larger than the middle suf x So in this case we move L to M No comparison is needed And we keep 1 and r the same to LCEL M lt 1 It is not hard to see that P must be lexically smaller than the middle suf x why7 Then we move R to M change rLCELM Why No comparison is needed 03 LCELMl Then P can start from 11 It is now clear that with the help of LCE values we can get rid of redundant character com parison since the above shows we can start comparing at maxlr instead of minlr If you are still not convinced you should check closely to see how values 1 r change where comparisons occur and so on The only remaining issue is on LCE values There are two concerns rst there may be too many LCE values we need second computing LCE values may be slow not linear time in the length of T Fortunately it turns out here are only Om LCE pairs needed and compute these LCE values takes only Om time So we can just do that in the pre processing stage and these LCE values can be used in later pattern matching To summarize pattern matching in suf x array can be done in Onlogm time after some additional pre processing which takes Om time BME 4800CSE 3800CSE 5800 Two applications of suf X treesuf x array Yufeng Wu September 10 2009 1 Find common substrings of more than two strings This is an extension to the LCS longest common substring problem Say we have K strings whose total length is n We de ne lk be the length of longest substring common to at leat k of strings Here 2 g k K Surprisingly this problem can be solved in On time We will only give a OnK algorithm Similar to the LCS problem we build a concatnated string for Sl S2 S3 as S 5152Sg SK Now we de ne for each internal node v Cv number of distinct string identi ers 1 SQ appear in the leaves of the subtree We also collect string depth of each node 1 Then we simply traverse the suf x tree to nd for each k the deepest node v whose Cv k More detail we initialize an array Vk 0 for each k During the traversal the tree for each node v if Cv k and Vk lt depth of v then Vk depth of v Note Vk means this substring appears in exactly k strings out of K strings This is slightly different from the de nition lk But lk is easily derived from Vk scan V from k being the largest to smallest if Vk lt Vk1 then set Vk Vk1 Then the Vk value is what we need for lk If we assume all Cv values are known this step takes On time Now the question is how to compute Cv We take a simple approach for each ancestral node v store a binary vector by with one bit for each string Here by i 1 means the presence of some suf x of stirng 3 within the subtree under the node and 0 otherwise Then for each internal node v suppose it has two children 121 and 122 we simply generate the b vector by uu For a leaf node v we set by to contain 1 if that leaf suf x is from 5 Then a simple tree traversal and we are done The total running time is OKn time This is because we need to compute the b vector for each node in the suf x tree BTW this problem can be solved to On time But we will not discuss it here 2 Finding tandem repeat using suf x tree A tandem repeat occurs when there is a substring 04 that appear twice and at adjacent positions That is any is a substring of the text Now the problem is give stirng S of n symbols nd tandem 1I suggest you to think of how exactly you will do this a depth rst search Working out these details improves your understanding of suffix tree repeats in S A naive method is try all possible starting point X and y and direct compare from the two starting positions This will take 0n3 Here is the key lemma Lemma 21 There is a tandem repeat starting at position i andj that is l distance from i i i andj are located in the subtree of some internal node of the su x tree and depth of this node is longer than I Now think about it to see why this lemma holds 1 suggest you draw some pictures and think about in two directions to conVince yourself the lemma works Here is a simple algorithm collect depth for each node which takes On time Then for each internal node test all pairs of leaves X and y to see if lz 7 yl depthV This still sounds 0n3 since there are On nodes and each node we may test 0n2 pairs of nodes But a better analysis gives 0n2 time for node V we only test the nodes X in its left subtree and y in its right subtree That is we only test X and y whose lowest common ancestor is V This way each pair of nodes is tested for just once and each test takes 01 time That is the running time is 0n2 BTW there is Onlognk algorithm for this problem where k is the number of tandem repeats in the string Again we will not discuss it here

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I made $350 in just two days after posting my first study guide."

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.