Bioinformatics CS 5263
Popular in Course
Popular in ComputerScienence
verified elite notetaker
This 29 page Class Notes was uploaded by Mireya Heidenreich on Thursday October 29, 2015. The Class Notes belongs to CS 5263 at University of Texas at San Antonio taught by Jianhua Ruan in Fall. Since its upload, it has received 9 views. For similar materials see /class/231374/cs-5263-university-of-texas-at-san-antonio in ComputerScienence at University of Texas at San Antonio.
Reviews for Bioinformatics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/29/15
Pattern Matching in Strings M xime Crechemere Institut G pard Mange U niyersite de M nela Vallee Christephe Hancart Laberateire d Iiiferii39iatique de Rmien U niyersite de Reuen 1 Introduction The present chapter describes a few standard algerithms used fer precessing texts They apply fer example t0 the manipulatien 01quot texts text editers t0 the sterage 0f textual data text cempressien and t0 data retrieyal s tems The algerithms 0f the chapter are interestii different respects First they are basic cempenents used in the implementatiens 0f practical seftware Secend they intreduce pregramming metheds that serve as paradigms in ether elds 01quot cemputer science stem er seftware design Third they play an impertant rule in theeretical cemputer science by pre 39 ing challenging preblems Althmigh data stared arieusly text rei in farm 01quot exchanging infermatien This is particularly eyident in literature er linguistics where data are cempesed atquot huge cerpera and dictienaries This applies as well t0 cemputer science where a large ammint 01quot data are stered in linear les And this alse the case in melecular bielegy where bielegical melecules can eften be apprex nated as sequences 01quot nucleetides er amineaci Mereeyer the quantity ilable data in these elds tends t0 deuble eyery eighteen menths Thi is the reasen why sheuld be efficient even if the speed 01quot cemputers increases regularly The manipulatien 01quot texts inyelyes seyeral preblems ameng which are pattern matching appreximate pattern matching cemparing strings and text cempressien The rst preblem is partially treated in the present chapter in that we censider enly enedimensienal ubjects Extensiens 0f the metheds t0 higher dimensienal abject 39ld selutiens t0 the secend preblem appear in the chapter headed Generalized Pattern Matching The third preblem includes the cemparisen 0f melecular s 1quot n39 s and is 39 l p in the cur r quot chapter Finally an entire chapter is deyeted t0 text cempressien Pattern matching is the preblem 0f lecating a cellectien 0f ebjects the pattern inside raw text In this chapter texts and elements 01quot pattern are strings which are nite sequences 01quot symbels dyer a nite alphabet Metheds fer searching patterns described by general regular expressiens deriye rem standard parsing techniques the chapter en fermal grammars and languages We fecus 0111 attentien t0 the case where the pattern represents a nite set 01quot strii Althmigh the latter case is a specializatien 0f the farmer case it can be selyed with mere efficient algerithii Selutiens t0 pattern matching in strings divide in twe families In the rst ene the pattern is xed This situatien eccurs fer example in text editers fer the search and substitute in telecemmunicatiens fer checking tekens In the secend am y 01quot selutiens the text is censidered as xed while the pattern is Variable This applies t0 dictienaries and t0 fulltext data ba s fer example The efficiency 01quot algerithms is evaluated by their werstcase running times and the ammint 0f memer space they require The alphabet the nite set 01quot 39nbels is denuted by E and the whale set 01quot strii E by E The length 01quot a s ring u is denuted by 39 it is the length 01quot the underlying nite sequence 01quot sy 39nbels The cencatenatien 0f twe strings u and i is denuted by m A string 1 is said t0 be a facter er a segment 01quot a string u if u can be written in the ram M 1m where u uquot E E if i u and j u i I we say that the facter 1 starts at pesitien i and ends A he ii S QVBI at pesitien j in u the facter i is alse deneted by uIi The symbel at pesitien i in u that is the 6 lth symbel ef u is deneted by um 2 Matching Fixed Patterns W e censider in this sectien the twe cases where the pattern represents a xed string er a xed dictienary a nite set ef strings Algerithms search fer and lecate all the eccurrences ef the pattern in a y text matching preblem the rst case it is cenyenient te censider that the text is examined threugh a windew The windew delimits a facter ef the text and has usually the length ef the pattern It slides aleng the text frem left te right During the search it is periedically shifted accerding te rules that are speci c te each algerithm When the windew is at a certa 1 pesitien en the text the algerithm checks whether the pattern eccurs there er net by cemp 1g seme symbels in the windew with the ce pending igned lelS ef the pattern if there is a whele match the pesitien is reperted During this scan eperatien the algerithm acquires frem the text infermatien which are eften used te determine the length ef the next shift ef the windew Seme part ef the gathered infermatien can alse be memerized in erder te save time during the next scan eperatien In the dictien matching preblem the secend case metheds are based en the use ef autemata er related data structures 21 The Brute Force Algorithm The simplest implementatien ef the sliding windew mechanism is the brute ferce algerithm The strategy censists here in sliding unifermly the windew ene pesitien te the right after each an eperatien As far as c 1s are cerrectly implemented this ebyieusly leads te a cerrect algerithm We give belew the r ef the cur r quot r The inputs are a nenempty string its length m thus m 2 1 a string 9 and its length n The variable 1 in the precedure cerr pends te the current left pesitien ef the windew en the text It is understeed that the stringte string cemparisen in line 2 has te be precessed symbel per symbel accerding te a given erder I I BRETE FORCE NIATCHER m y n 1 for p frem 0 up te n m 2 loop ify pml17 3 then repert p The time cemplexity ef the brute ferce algerithm is 0m X n in the werst case fer instance when I39m 1b is searched in a fer any twe symbel 11 E E satisfying 1 I if we assume that the rightmest symbel in the windew is cempared last But its behayier is linear in T when searching in rande texts 22 The KarpRabin Algorithm Hashing preyides a simple methed fer ayeiding a quadratic number ef symbel cemparisens in mest practical situatiens Instead ef checking at each pesitien p ef the winde en the text whether the pattern eccurs here er net it seems te be mere efficient te check enly if the facter ef the text delimited by the windew namely 9L3 p m 1 leeks like In erder te check the resemblance between the twe strings a hash functien is used But te be helpful fer the stringmatching preblem the hash functien sheuld be highly discriminating fer strings Accerding te the running times ef the algerithms the functien sheuld alse have the fellewing preperties o te be ef ciently cemputable o te preV39ide an easy cemputatien ef the Value asseciated with the next facter frem the Value aeciated with the current facter The last peint is met when symbels ef alphabet E are assimilated with integers and when the hash functien say h is de ned fer each string u E E by 14 1 Mu Z um gtlt diui l i med 19 150 where q and I are twe censtants Then fer each string 1 E E fer each symbels a a E E Mm is cemputed frem ha i by the fermula Mm ha i 1 X 1M X H a med 1 During the search fer pattern it is eneugh te cempare the Value at with the hash Value asseciated with each facter ef length m ef text 9 If the twe Values are equal that is in case ef cellisien it is still necessary te check whether the facter is equal te er net by symbel cemparisens The underlying stringmatching algerithm which is deneted as the KarpRabin algerithm is implemented belew as the precedure KARE RABIX MA I CHER In the precedure the V39alues I39m 1 med 1 M17 and HMO m 2 are rst precemputed and stered respectiV39ely in the Variables r g and i lines 177 The Value ef i is then recemputed at each step ef the search sumed that the Value ef symbels anges frem 0 te 1 the quantity 6 1 X g is added in line te preV39ide cerrect cemputatiens en pesitiV39e integers KARERABixMA39I CHERm m y n 1 r lt 1 s lt 1710 med 1 3 i lt 0 4 for i frem 1 up te m 1 5 loop 7 lt r X I med 1 6 s lt a X H med 1 7 lt gtltdy lmedq for p frem 0 up te n m 9 loop i X H pr m 1 med 1 O 10 if sandy pmI 11 then repert p 12 lt 1gtltq ypgtltrmedq CenV39enient V39alues fer I are pewers ef 2 in this case all the preducts by I can be cemputed as shifts en integers The Value ef q is generally a large prime such that the quantities q 1 X 1 c 1 and c X g 1 de net cause eV39erllews but it can alse be the Value ef the implicit medulus supperted by integer eperatiens An illustratien ef the behaV39ier ef the algerithm is giV39en in Figure 1 p 01234 5 6 7 8 910111213141516171819 pnoudefenseuforusense 31 izypp41 8 8 6 28 18 28 26 2212 17 2416 0 1 Figure 1 An illustration of the behavior of the KarpRahi11 algorithm when searching for the pattern a sense in the text 3 nondefenseuforusenset Here symbols are assimilated with their ASCII codes hence c 256 and the values of q and d are set respectively to 31 and 2 This is valid for example when the maximal integer is 215 1 The value of lzfaz is 115 x 16 101 x 8 110 x 4 115 x 2 101 mod 31 Since only 32th 3 and hfy t t 19 among the defined values of My cp4 are equal to hfaz two stringtostring comparisons against a are performed The worst case complexity of the above stringri39iatching algorithm is quadratic as it is for the brute force algorithm but its expected running time is Om n if parameters 1 and I are adequate 23 The KnuthMorrisPratt Algorithm This section presents the rst discovered lineartime stringri39iatching algorithm Its design follows a tight an quotis of a version of the brute force algorithm in which the stringtostring comparison proceeds from left to right The brute force algorithm wastes the information gathered during the scan of the text On the contrary the KnuthMorris Pratt algorithm stores the information with two purposes First it is used to improve the length of shifts Second there is no backward scan of the text Consider a given position 1 of the window on the text Assume that a mismatch occurs between symbols yp and for some i 0 g i lt m an illustration is given in Figure 2 Thus we have 9b p i 1 17m i 1 and yp 7E W ith regard to the information given by 17 i 1 interesting shifts are nece ily connected with the borders of 17m i 1 A border of a string u is a factor of u that is both a pre x and a suffix of u Among the borders of 17m i 1 the longest proper border followed by a svmbol different from is the best possible candidate subject to the existence of such of a border A factor 1 of a string u is said to be a proper factor of u if u and i are not identical that is if i lt u This introduces the function 1 de ned for each i E 0 1 m 1 by 106 maxk 0 g k lt i 1 170k 1 7E or k 1 Then after a shift of length i 106 the symbol comparisons can resume with yp against in the case where Mi 2 0 and yp i 1 against 17 otherwise Doing so we miss no occurrence of in y and avoid a backtrack on the text The previous statement is still valid when no mismatch occurs that is when i m if we consider for a moment the string 17 instead of where is a symbol of alphabet E occurring nowhere in This amounts to completing the de nition of function 1 by setting who maxk 0 g k lt m1 m k m 1 17m k The KnuthMorrisPratt stringri39iatching algorithm is given in pseudocode below as the procedure KNLtTH MOREIS PRA Iquot1 11A 1 CHER The values of function 1 are rst computed by the function BE39I TER PREFIx FtXCTIOX given after The value of the variable j is equal to pi in the remainder of the code the search phase of the algorithm strictly speaking this simpli es the code and points out the sequential proce ing of the text Observe that the preprocessing phase applies a similar method to the pattern itself as if y m 1 labalabcababcabbablaabacbaabcly aboababcababaaz a labaabcababclabbabaabacbaalbcly Shift abcababcababaaz b i012345678910111213 c abcababcababai m 100 10200 102021 Figure 2 An illustration of the shift in the KnuthMorrisPratt algorithm when searching for the pattern a abcababcababat a The window on the text 3 is at position 3 A mismatch occurs at position 10 on a The matching symbols are shown darkly shaded and the current analyzed symbols lightly shaded Avoiding both a backtrack on the text and an immediate mismatch leads to shift the window 8 positions to the right The stringtostring comparison resumes at position 2 on the pattern b The current shift is the consequence of an analysis of the list of the proper borders of a0 t and of the symbol which follow them in a The pre xes of a that are borders of a0 t abcabacbab are rightaligned along the discontinuous vertical line String d0 4 abcab is a border of MU but is followed by symbol a which is identical to may String a0 t 1 is the expected border since it is followed by symbol c c The values of the function 1 for pattern a KN LV39l H NIORRIS PRATT NIATCHER9 m y n 1 1 BE39FI ER PREFIX FKXCTIOX at m 2 6 lt 0 3 for j from 0 up to n 1 4 loop while 6 2 0 and 9U 7E 5 loop 6 M6 6 6 lt 6 1 7 if6 m then report j 1 m 9 6 Mm BET ERPREMxthcnoxm m 1 pm lt 1 2 6 lt 0 3 for j from 1 up to m 1 4 loop if 5 then 0U M6 6 else 0U 6 7 loop 6 M6 while 6 2 0 and 7E 9 6 6 1 10 who 6 11 return 1 The algorithm has a worstcase running time in 0m n and requires 0m extraspace to store function 1 The linear running time results from the fact that the number of sy 1bol comparisons performed during the preprocessing phase and the search phase is less than 2m and 26 respectively All the previous bounds are independent of the size of the alphabet 24 The BoyerMoore Algorithm The BoyerMoore algorithm is considered as the most efficient stringmatcl39ling algorithm in usual applications A simpli ed version of it or the entire algorithm is often implemented in text editors for the search and substitute commands The scan operation proceeds from right to left in the window on the text instead of left to right as in the KnuthMorris Pratt algorithm In case of a mismatch the algorithm uses two functions to shift the window These two shift functions are called the betterfactor shift function and the badsymbol shift function In the two next paragraphs we explain the goal of the two functions and we give procedures to precompute their values We rst explain the aim of the betterfactor shift function Let p be the current left position of the window on the text Assume that a mismatch occurs between symbols yp and for some 6 0 g 6 lt m an illustration is given in Figure 3 Then we have yp6 7E and 9 6 1 p m 1 1 m 1 The betterfactor shift consists in aligning the factor yp6 1 p m 1 with its rightmost occurrence 1 m 1 6 is in preceded by a symbol different from to avoid an immediate mismatch If no such factor exists the shift consists in aligning the longest suffix of 9 6 1 p m 1 with a matching pre x of The betterfactor shift function 8 is de ned by min6 k 0 g k lt 6 1 m 1 6 k 1 m 1 7E or m g k lt 0 km 117m 6 km labcblabcabbaaabalcbaabbabcably babacbababaa a labcbabcabbaaalbacbaabbabclably Shift babacbababaa b 1 U l 2 3 4 5 6 7 8 9 10 c b a b a c b a b a b a 321777777294111 Figure 3 An illustration of the betterfactor shift in the BoyerMoore algorithm When searching for the pattern a babacbababa a The Window on the text is at position 4 The stringtostring comparison Which proceeds from right to left stops With a mismatch at position 7 on a The Window is shifted 9 positions to the right to avoid an immediate mismatch 1 Indeed the string wk t 10 aba is repeated three times in 12 but is preceded each time by symbol a7 bt 1e expected matching factor in a is then the prefix be of a The factors of 1 identical With aba and the pre xes of a ending With a suf x of aba are rightaligned along the rightmost discontinuous vertical line c The values of the shift function g for pattern a 1 lbacdlcbababadacalbaabbcbcabdly babacbababaa lbacdcbabalbadacabaabblcbcabdly a a Shift babacbababaa 2 in a b Figure 4 An illustration of the badsymbol shift in the BoyerMoore algorithm when searching for the pattern a babacbababat a The window on the text is at position 4 The stringtostring comparison stops with a mismatch at position 9 on 1 Considering only this position and the unexpected symbol occurring at this position namely symbol c leads to shift the window 5 positions to the right Notice that if the unexpected symbol were a or d the applied shift would have been 1 and 10 respectively b The values of the table 2 for pattern a when alphabet E is reduced to a b c d for each i E 01 m 1 The Value is then exactly the length of the shift induced by the betterfactor shift The values of function 8 are computed by the function given below as the function BE39I TER FACTOR FENCTIOX An auxiliary table namely f is used it is an analogue of the function 1 used in the KnuthMorris Pratt algorithm but de ned this time for the reverse pattern it is indexed from 0 to m 1 The running time of the function BE39I TER FAC39I OR FKXCTIOX is 0m BE39I TER FAC39I OR FKXCTIOX at m 1 for j from 0 up to m 1 2 loop lt 0 3 i lt m 4 for j from m 1 down to 0 5 loop ijlt i1 6 while i lt m and 7E 7 loop if 0 then ij 9 6 1M 1 10 i 1 11 for j from 0 up to m 1 12 loop if 0 13 then 6U lt e 1 14 if j e 15 then i lt 1 16 return 8 We now come to the aim of the bad mbol shift function Figure 4 shows an illustration Consider again the text symbol pr that causes a mismatch Assume rst that this mbol occurs in mm m 2 Then let it be the position of the rightmost occurrence of 9L3 in omnm 2 The window can be shifted 16 positions to the right if k lt i and only one position otherwise without missing an occurrence of in y A sume now that symbol 9L3 does not occur in Then no occurrence of in y can overlap the position 1 i on the text and thus the window can be shifted 6 1 positions to the right Let 0 be the table indexed on alphabet E and de ned for each symbol 1 E E by minm U m 1 j 0 g j lt m 1 a According to the above discussion the badsymbol shift for the unexpected text symbol a aligned with the symbol at position i on the pattern is the value 710 maxwa i m 1 1 which de nes the bad ymbol shift function 739 on E X 0 1 m 1 Elle give now the code of the function LAfnl OCCLVRREXCE FLVXCTIOX that computes table lts running time is 0m card 2 LASTOccrRREXCEFrxcriox at m 1 for each I E E 2 loop lt m 3 for j from 0 up to m 2 4 loop lt m 1 j 5 return to The shift applied in the BoyerMoore algorithm in case of a mismatch is the maximum between the betterfactor shift and the bad ymbol shift In case of a whole match the shift applied to the window is m minus the length of the longest proper border of that is also the value 1310 this value is indeed what is called the period of the pattern The code of the entire algorithm is given below BOYER lVIOORE NIATCHER m y n 1 8 BE39I TER FAC I OR FKNCTIOX 17 m 2 w LASTOCCLRREXCEFrxcrioxm m 3 p lt 0 4 While 1 g n m 5 loop 6 lt m 1 6 while i 2 0 and 93 7 loop 6 lt i 1 ifi 2 0 9 then 1 lt p max8 waLp i m 1 10 else report 1 11 i3l0 The worstcase running time of the algorithm is quadratic It is surprising however that when used to search only for the rst occurrence of the pattern the algorithm runs in linear time Slight modi cations of the strategy yield lineartime algorithms When searching for I39m 1b in a with 11 E E and 1 7E 11 the algorithm considers only Tam symbols of the text This bound is the absolute minimum for any stringmatching algorithm in the model where the pattern onlk is preprocessed Indeed the algorithm is expected to be extremely fast on large alphabets relative to the length of the pattern 25 Practical StringMatching Algorithms The badsymbol shift function introduced in the BoyerMoore algorithm is not very efficient for small alphabets but when the alphabet is large compared with the length of the pattern as it is often the case with the ASCII table and ordinary searches made under a text editor it becomes very useful Using only the corresponding table produces some efficient algorithms for practical searches W e describe one of these algorithms below Censider a pesitien p 01quot the winde an the text and assume that the syrnbels yp m 1 L 1 are identical If 17m 1 dues net eccur in the pre x 170m 2 0f the winde can be shifted m pesitiens t0 the right after the stringtestring cernparisen between 9b 11 mQ and 17m mQ is perferrned Otherwise let k be the pesitien 0f the rightrnest eccurrence 0f 17m 1 in mm mQ the windew can be shifted m 1k pesitiens t0 the right This shews that wyp m 1 is alse a valid shift in the case where 9 m 1 17m 1 The underlying algerithrn is the Herspeel algerithrn The pseudecede 0f the Herspeel algerithrn is given belew Te prevent twe references t0 the rightrnest sy39rnbel in the winde at each scan and shift eperatien table to is slightly rnedi ed w17m 1 centains the sentinel value 0 after its previeus value is saved in variable i The value 01quot the variable j is the value 01quot the expressien p m 1 in the iscussien abeve HORSPOOL NIA39I CHER17TI L y n 1 w LASTOccrnn xCEFrxcrioxm m 2 i w17m 1 3 w17m 1 lt 0 4 j lt m 1 5 While j lt n 6 loop a lt 7 if s 0 then j j s 9 else ifyjTI L1j107I L2 10 then repert j m 1 11 j lt j Just like the brute farce algerithrn the Herspeel algerithrn has a quadratic werstcase tirne cernplexity But its behavier in practice is at least as geed as the behavier 0f the BeyerMeme algerithrn is An example shewing the behavier 0f beth algerithrns is given in Figure 5 26 The AhoCorasick Algorithm The UNIX uperating systern prevides standard text le facilities Arneng them is the series 01quot grep cernrnands that lecate patterns in les We describe in this sectien the AheCerasick algerithrn underlk ing an irnplernentatien 0f the fgrep cernrnand 0f UNIX It searches les fer a nite and xed set 01quot strings the dictienary and can fer instance eutput lines centaining at least ene 01quot the strin s If we are interested i searching fer all eccurrences 01quot all strings 01quot a dictienarv a rst SQlllLin l cens ts in repeating seine stringmatching algerithrn fer each strir Censidering a dictienary X centaining 16 strings and a text 9 the search runs in that case in time Omn X k where m is the sum 01quot the length 01quot the strings in X and n the length 01quot 9 But thi selutien is net ef cient since text 9 has t0 be read 16 times The selutien described in this sectien prevides bath a sequential read 01quot the text and a tetal running time which is 0m 7 en a xed alphabet The algerithrn can be viewed as a direct extensien 0f weaker versien 0f the Knuth s Pratt algerithrn The search is dene with the help 01quot an auternaten that steres the situatiens enceuntered during the precess At a given pesitien en the text the current state is identi ed with the set 01quot pattern pre xes ending here The state represents all the facters 0f the pattern that can pessibly lead t0 eccurrences Arneng the facters the lengest centains all the inferrnatien necessary t0 centinue the search Se the search is realized with an auternaten deneted by z D b 10 a b Figure 5 An illustration of the behavior of two fast stringmatching algorithms when searching for the pattern a sense in the text 3 nondefenseuforusenset The successive positions of the Window on the text are suggested by the alignments of a with the corresponding factors of 31 The symbols of a considered during each scan operation are shown hachured a Behavior of the BoyerMoore algorithm The first and second shifts result from the bettershift function the third and fourth from the badsymbol function and the fifth from a shift of the length of a minus the length of its longest proper border the period of 12 1 Behavior of the IIorspool algorithm We assume here that the four leftmost symbols in the Window are compared with the symbols of a0 t 3 from left to right ZXX of which state e in onetoone correspondence with the pre xes of X Implementing completely the tra tion function of Z X would required a size 0m gtlt card 2 Instead of that the AhoCoi sick algorithm requires only 0m space To get this space complexity a part of the transition function is made explicit in the data and the other part is computed with the help of a failure function For the rst part we assume that for any input 1 a the function denoted by TARGET returns some state 1 if the triple 1414 is an edge in the data and the value NIL otherwise The second part uses the failure function fail which is an analogue of the function 1 used in the KnuthMorrisPratt algorithm But this time the function is de ned on the set of states and for each state 1 different from the initial state faiim the state identi ed with the longest proper suffix of the pre x identi ed with 1 that is also a pre x of a string of X The aim the failure function is to defer the computation of a transition from the current state say 1 to the computation of the transition from the state faiin with the same input symbol say a when no edge from 1 labeled by symbol a is in the data the initial state which is identi ed with the empty string is the default state for the statement We give below the pseudocode of the function NEX391 S391 A391 E that computes the transitions in the representation The initial state is denoted by i Figure 6 The trielike automaton of the pattern X ace as ease The initial state is distinguished by a thick ingoing arrow each terminal state by a thick outgoing arrow The states are numbered from 0 to 8 according to the order in which they are created by the construction statement described in the present section State 0 is identified with the empty string state 1 with a state 2 with ac state 3 with ace and so on The automaton accepts the language X t NLxI S39I A39I Ip a i 1 while 1 3A NIL and TARGE39I I NIL 2 loop 1 faiim 3 if 1 3A NIL 4 then 1 TARGIJI Q9 a 5 else 1 lt i 6 return 1 The preprocessing phase of the AhoCorasick algorithm builds the explicit part of Z X including function fail It is divided itself into two pha s The rst phase of the preprocessing phase cons39 ts In building a subautomaton of ZXX It is the trie of X the digital tree in which branches spell the strings of X and edges are labeled by symbols having as initial state the root of the trie and as terminal states the nodes corresponding to strings of X an example is given in Figure 6 It differs from Z X in two points o it contains only the forward edges 0 it accepts only the set X An edge 14141 in the automaton is said to be forward if the pre x identi ed with q is in the form an where u is the pre x corresponding to p The function given below as the function TRIELlKEAKTOMA I ON computes the automaton corresponding to the trie of X by returning its initial state The terminal mark of each state 7 is managed through the attribute imminaih the mark is either TREE or FALSE depending on whether state 7 is terminal or not We assume that the function NEW S39I A I E creates and returns a new state and that the procedure lVlAKEEDGE adds a given new edge to the data TRIE LlKE ALV 1 0 MATOX X 1 i NEW S39l A l E 2 imminail FALSE 3 for string from rst to last string of X 4 loop 1 lt i 5 for symbol a from rst to last symbol of 6 loop 1 TARGE39I 1 7 if q NIL then 1 NEW S39l A l E 9 iermina q FALSE 10 MAKEEDGEQ a q 11 p lt q 12 ierminaim TREE 13 return i The second step of the preprocessing phase consists mainly in precomputing the failure function 39 Is done by a breadth rst trayersal of the trielike automaton The corresponding pseudocode is given below as the procedure NIAKEFAILEREFLNCTIOX NIAKE FAILLVRE FLVXCTIOXi 1 faiim lt NIL 2 t9 ENIP39I YQIIEIIE 3 EXQKEEEW i 4 while not QIIEIIEIsENIP39I YW 5 loop 1 DEQKEEEW i for each symbol a such that TARGE I Q 1 NIL 7 loop 1 TARGEI QJJI faiiq NEX39l S391 A39I Efaii p a i 9 if tierminailfai q 10 then iermina q TREE 11 EXQLVELVEWJI During the computation some states can be made terminal This occurs when the state is identi ed with a pre x that ends with a string of X an illustration is given in Figure 7 The complete dictionarymatching algorithm implemented in the pseudocode below as the procedure AHO CORASICK MA I CHER starts with the two steps of the preproc sing the search follows which simulates automaton ZXX It is understood that the empty string does not belong to X AHOCORASICK MA I CIIERX y i TRIE LIKE ALTOMATOXIf 2 NIAKE FAILLVRE FLVXCTIOXi 3 p lt i 4 for symbol a from rst to last symbol of y 5 loop 1 lt NEX39I S39I A I Ep a i if ierminaim then report an occurrence H The total number of tests TARG E39I p a N IL performed by function NEXT STATE during its calls by procedure NIAKEFAILLVREFLVXCTIOX and during its calls by the search phase of 13 Figure 7 The explicit part of the automaton DfX of the pattern X aceaseaset Compared to the trielile automaton of X displayed in Figure state 7 has been made terminal this is because the corresponding prefix namely eas ends With the string as that is in X t The failure function fai is depicted With discontinuous nonlabeled directed edges the algorithm are bounded by 2m and 27 respectively similarly as the bounds of comparisons in the KnuthMorri Pratt algorithm Using a total order on the alphabet the running time of function TARGET is both 0log k and 0log card 2 since the maximum number of edges outgoing a state in the data representing automaton Z X is bounded both by k and by card 2 Thus the entire algorithm runs in time 0m n on a xed alphabet and in time Om n X log mink card 2 in the general case The algorithm requires 0m extraspace to store the data and to implement the queue used during the breadth rst traversal executed in procedure lVlAKE FAILKRE FLNCTION Let us discuss the question of reporting occurrences of pattern X line 7 of procedure AHO CORASICK MA I CHER The simplest way of doing it is to report the ending positions of occurrences This remains to output the value of the position of the current symbol in the text A second possibility is to report the whole set of strings in X ending at the current position To do so the attribute imminai has to be transformed First for a state 7 ierminaih is the set of the string of X that are suffixes of the string corresponding to 7 Second to avoid a quadratic behavior sets are manipulated by their identi ers only 3 Indexing Texts This section deals with the patternri39iatching problem applied to xed texts Solutions cons39st in building an index on the text that speeds up further searches The indexes that we consider here are data structures that contain all the suffixes and therefore all the factors of the text Two types of structures are presented suffix trees and suffix automata They are both compact representations of suffixes in the sense that their sizes are linear in the length of the text although the sum of lengths of suffixes of a string is quadratic Moreover their constructions take linear time on xed alphabets On an arbitrary nite alphabet E assumed to be ordered a log cardE factor has to be added to almost all running times given in the following This corresponds to the branching operation involved in the respective data structures Indexes are powerful tools that have many applications Here is a nonexhaustive list of them assuming an index on the text 9 o Meri39ibership testing if a string occurs in y o Occurrence number producing the number of occurrences of a string in y 14 Figure 8 The suf x tree 73 of the string 3 aabbabbt The nodes are numbered from 0 to 12 according to the order in Which they are created by the construction algorithm described in the present section Each of the eight external nodes of the trie is marked by the position of the occurrence of the corresponding suf x in 31 Hence the branch 0 5 9 4 running from the root to an external node spells the string bbabb Which is the suf x of 3 starting at position 2 0 List of positions analogue of the stringmatching problem of Section 2 o Longest repeated factor locating the longest factor of y occurring at least twice in y o Longest common factor nding a longest string that occurs both in a string and in 9 Solutions to some of these problems are rst considered with suffix trees then with suffix automata 31 Suf x Trees The suffix tree 79 of a nonempty string 9 of length n is a data structure containing all the suffixes of y In order to simplify the statement it is assumed that 9 ends with a special symbol of the alphabet occurring nowhere else in 9 this special symbol is denoted by in the examples The suffix tree of y is a trie which satis es the following properties 0 the branches from the root to the external nodes spell the nonempty suffixes of y and each external node is marked by the position of the occurrence of the corresponding suffix in y o the internal nodes have at least two successors except if y is a onelength string 0 the edges outgoing an internal node are labeled by factors starting with different symbols 0 any string that labels an edge is represented by the couple of integers corresponding to its position in y and its length An example of suffix tree is displayed in Figure The special symbol at the end of y ayoids marking nodes and implies that 79 has exactly 7 external nodes The other properties then imply that the total size of 79 is 0a which makes it possible to design a lineartime construction of the data structure The algorithm described in the following and implemented by the procedure SLVM Ix TREE giyen further has this time complexity 15 The censtructien algerithin werks as fellews It inserts the neneinpty suf xes y n 1 0 g i lt n ef y in the data structure frein the lengest te the shertest suf x In erder te explain hew this is perferined we intreduce the twe netatiens h the lengest pre x ef y n 1 that is a pre x ef seine stricly lengest suf x ef y and J the string is such that y n 1 is identical with hm de ned fer each i E 1 n 1 The strategy te insert the suf xes is precisely based en these de nitiens Initiall the data structure centains enly the string 9 Then the insertien ef the string 9 n 1 1 g i lt n preceeds in twe steps 0 rst the head in the data structure that is the nede h cerrespending te string hi is lecated pessibly breaking an edge 0 secend a nede called the tail say i is created added as successer ef nede h and the edge frein h te i is labeled with string lg The secend step ef the insertien is clearly perferined in censtant tiine Thus nding the head is critical fer the eyerall perferinance ef the censtructien algerithin A bruteferce inethed te nd the head cens ts in spelling the current suf x y n 1 frein the reet ef the trie giving an Ohg tiine ceinplexity fer the insertien at step i and an 0012 running tiine te build the suf x tree 7 Adding shert circuit links leads te an eyerall 0a tiine ceinplexity altheugh there is ne guarantee that the insertien at any step i is realized in censtant tiine Observe that in any suf x tree if the string cerrespending te a given internal nede p in the data structure is in the ferin m with I E E and u E X then there exists an unique internal nede cerrespending te the string u Frein this reinark are de ned the suf x links by linkHJ the nede q cerrespending te the string u when p cerrespends te the string m fer seine syinbel I E E fer each internal nede 1 that is different frein the reet The links are useful when ceinputing h frein fig1 because ef the preperty if 11241 is in the ferin aw fer seine syinbel a E E and seine string is E X then is is a pre x ef 1111 We explain in three fellewing paragraphs hew the suf x links help te nd the succ heads ef ciently We censider a step i in the algerithin assuining that i 2 1 le denete by g the nede that ce pe s te the string 2121 The aiin is beth te insert 9 n 1 and te nd the nede h cerrespending te the string 1111 We rst study the inest general case ef the insertien ef the suf x y n 1 Particular ca are studied after We assuine in the present case that the predecesser ef g in the data structure say 9 is beth de ned and different frein the reet Then fig1 is in the ferin am where a E 2 1m E E m cerrespends te the nede 9 and i labels the edge frein 9 te 9 Since the string m is a pre x ef 1111 it can be fully spelled frein the reet Mereeyer the spelling eperatien ef m frein the reet can be shertcircuited by spelling enly the string 1 frein the nede Emmy The nede 1 reached at the end ef the spelling eperatien pessibly breaking the last partially taken dewn edge is then exactly the nede Emmy It reinains te spell the string 51 frein q fer ceinpletely ins rting the string yh n 1 The spelling steps en the expected nede h pessibly breaking again an edge which beceines the new head in the data structure The suf x ef 51 that has net been spelled se far is exactly the string lg An exainple fer the whele preyieus stateinent is given in Figure 9 39e 16 a b Figure 9 During the construction of the suf x tree Tfy of the string 3 aabbabb the step 5 that is the insertion of the suf x bb The defined suf x link are depicted with discontinuous nonlabeled directed edgest a Initially the head in the data structure is node 7 and its suf x link is not yet defined The predecessor of node 7 node 2 is different from the root and the factor of 3 that is spelled from the root to node 7 namely In abb is in the form amt where a 6 E u 6 2quot and t7 is the string of 2quot labeling the edge from node 2 to node 7 Here a a u is the empty string and t bb Then the string at bb is spelled from the node linlced with node 2 that is from node 0 the spelling operation stops on the edge from node 5 to node 4 this edge is brolqen which creates node Node 9 is linked to node 7 The string 4 is spelled from node the spelling operation stops on node which becomes the new head in the data structure 1 Node 10 is created added as successor of node and the edge from node 9 to node 10 is labeled by the string remainder of the last spelling operation The second case is when 9 is a direct successor of the root The string hi4 is then in the form m where a E E and u E 2 Similarly to the above case the string u can be fully spelled from the root The spelling of u gives a node 1 which is then linked with g Afterwards the string 51 is spelled from g The last case is when 9 is the root itself The string 51 minus its rst ibol has to be spelled from the root Which ends the study of all the possible c ses that c n arise The important aspect of the algorithm is the use of two different implementations for the two spelling operations pointed out above The rst one given in the pseudocode below as the function FAST FIND deals with the situation where we know in advance that a given factor 9U j k 1 of y can be fully spelled from a given node 1 of the trie It is then suf cient to scan only the rst symbols of the labels of the encountered nodes which justi es the name of the function The second implementation of the spelling operation spells a factor 9U jHCl of y from a given node 1 too but this time the spelling is performed symbol by symbol The corresponding function is implemented after as the function SLOW leD Before giving the pseudocode of the functions we precise the notations used in the following o For any input 941439 the function SLVCC SOR BY OXE SYMBOL returns the node 1 such that q is a successor of the node 1 and the rst symbol of the label of the edge from p to q is gm if such a node 1 does not exist it returns NIL o For any input 1191 the function LABEL returns the two integers that represents the label of the edge from the node 1 to the node 1 o The function NEW NUDE creates and returns a new node 17 o For any input pj k 115 the function NEW BREAKIXGNODE creates and returns the node 7 breaking the edge 1yj jHC 1 g at the position i in the label 9U jHC 1 W hich gives the two edges 1 9 j i 1 r and r 9 If j k 1 Function FAST FIND returns a couple of nodes such that the second one is the node reached by the spelling and the rst one is its predecessor FAS I F1XDypj k 1 p lt NIL 2 while is gt 0 3 loop 1 lt p 4 q SLtCCESSOR BY OXE SYMBOLypj 5 r a LABEL1 6 if s g 7 then 1 lt q s 9 k k s 10 else p NEWBREAKIXGNODEQJrsqk 11 k lt 0 12 return p 1 Compared to function FAST FIND function SLOWle 1 considers an extrainput that is the predecessor of node 1 denoted by p It considers in addition two extraoutputs that are the position and the length of the factor that remains to be spelled SLOWF1xnyp pjk I lt FALSE 2 loop 1 SLtCCESSOR BY OXE SYMBOLypj 3 if q NIL 4 then I lt TREE 5 else 735 LABEL q 6 i lt 1 7 while lt sand yj yr loop 15 lt i 1 9 j lt j f 10 k lt k f 11 p lt p 12 iff a 13 then 1 lt q 14 else 1 NEWBREAKIXGNODEQJ r 5 915 15 I lt TREE 16 while I FALSE 17 return p pjk The complete construction algorithm is implemented as the function SLVM Ix TREE given below The function returns the root of the constructed suf xtree Merriorizrng vstematicallv the predecessors h and q of the nodes h and q avoids considering doublv linked tries The name of the attribute which marks the positions of the external nodes is made explicit Sti i ixTnmw n 1 p NEW NODE 2 h lt NIL 3 h lt p 4 r lt 1 5 s lt n 1 i for i frem 0 up t0 7 1 7 loop if h NIL then iz izrs lt SLOWF1xnyx1tpr 15 1 9 else j k LABELUZ JZ 10 if h p 11 then 1 9 1 FAS391 leuypj 11 1 12 else 921 FAS391 F1XDyiinkfh39jk 13 iinkfh q 14 h has lt SLOWF1xny q q as 15 teNExvNon 16 fVIAKE EDGEUZ r a t 17 posiiionfl i 18 return 1 The algerithm runs in time 0a mere precisely 0n gtlt leg card 2 if we take inte acceunt the branching in the data structure Indeed the instructien at line 4 in functien FAST FIND is perfermed less than 27 times and the number 01quot sy39mbel cemparisens dene at line 7 in functien SLOW FIND is less than 7 Once the suf x tree 01quot y is build seme eperatiens can be perfermed rapidly W e describe feur applicatiens in the fellewing Let be a string 01quot length m Testing whether eccui in 9 er net can be selyed in time 0m by spelling frem the rent 01quot the trie sy39mbel by ymbel If the eperatien succeeds eccurs in 9 Otherwise we get the lengest pre x 0f eccurring in y Preducing the number 01quot eccurrences 0f in 9 starts identically by spelling Assume that eccuis actually in 9 Let p be the nude at the extremity 0f the last taken dewn edge er be the rent itself if is empty The expected number ay 16 is then exactly the number 01quot external nedes 0f the subtrie 0f rent 1 This number can be cemputed by traversing the subtrie Since each internal nude 01quot the subtrie h at least twe succ sers the tetal size 01quot the subtrie is 0 and the trayersal 0f the subtrie is perfermed in time 0 independently atquot E The methed can be impreyed by precemputing in time 0a independently atquot Z all the Values asseciated with each internal nede the whale eperatien is then perfermed in time 0m whatever is the number 01quot eccurrences 0f The methed fer reperting the list 01quot pesitiens 0f in y preceeds in the same way The running time needed by the eperatien is 0m te lecate in the trie plus 0 t0 repert each ciated with the is external nedes Finding the lengest repeated facter 01quot 9 remains te cempute the deepest internal nude 01quot the trie that is the internal nede cerrespending te a lengest pessible facter in y This is perfermed in time a t Lquot D C D a S D 32 Suf x Automata The suf x autematen 89 01quot a string 9 is the minimal deterministic autematen recegnizing Suffy that is the set 01quot suf xes 01quot y This autematen is minimal ameng all the deterministic 19 Figure 10 The suf x automaton S y of the string 3 aabbabbt The states are numbered from 0 to 10 according to the order in Which they are created by the construction algorithm described in the present section The initial state is state 0 terminal states are states 0 5 and lot This automaton is the minimal deterministic automaton accepting the language of the suf xes of 31 automata recognizing the same language which implies that it is not necessarily complete An example is given in Figure 10 The in n point about suffix automata is that their size is asymptotically linear in the length of the string More precisely given a string 9 of length n the number of states of 89 is equal to n 1 when n g 2 and is bounded by 71 1 and 27 1 otherwise as to the number of edges it is equal to n 1 when n g 1 it is 2 or 3 when n 2 and it bounded by n and 37 4 otherwise The construction of the suffix automaton of a string 9 of length n can be performed in time 0a or more precisely in time 0n gtlt log card 2 on an arbitrary alphabet E It makes use of a failure function fail de nes on the states of 89 The set of states of 89 identi es with the quotient sets u l Suffy 1 E E m E Suffy for the strings u in the whole set of factors of 9 One may observe that two sets in the form u l Suffy are either disjoint or comparable This allows to set faiifp the smallest quotient set stricly containing the quotient set identi ed with p for each state 1 of the automaton different from the initial state of the automaton The function given below as the function SEFFIXAKTOMATOX builds the suffix automaton of y and returns the initial state say i of the automaton The construction is online which means that at each step of the construction just after processing a pre x y of y the suffix automaton 89 is build Denoting by i the state without outgoing edge in the automaton 59 terminal states of 89 are implicitly known by the suffix path of i that is the list of the states faii faiivaii i The algorithm uses the function length de ned for each state 1 of 89 by img ihfp the length of the longest string spelled from i to p 20 SLVFFIX ALV39I OMA39I ON y 1 i NEW S39l A l E 2 imminaili FALSE 3 3673973le 0 4 faiim lt NIL 5 t lt i i for symbol a from rst to last symbol of y 7 loop 1 SLVFFIX ALVTOMATOXEXTEXSIOX i t a p lt i 9 loop ierminaim TREE 10 p faiim 11 while 1 NIL 12 return i The online construction is based on the function SLl FlX ALV39I OMA39I OX EXTEXSIOX that is implemented below The latter function processes the next symbol say a of the string 9 If y is the pre x of y preceding a it transforms the suf x automaton 89 already build into the suf x automaton 59 0 SIII I IxAII 1 0NIAT0NExTENSION I a 1 I i 2 t NEW S39l A l E 3 i 1n1inai FALSE 4 3673973le ic ng39ihli 1 5 p lt i 6 loop MAKE EDGEQ a i 7 p faiim while 1 3A NIL and TARGE39I pa NIL if p NIL 10 then faiim i 11 else qlt TARGE I 91 12 if img il q lengihm 1 13 then faiim q 14 else r NEW S39l A l E 15 ierminaih FALSE 16 for each letter I such that TARGE39I q 1 NIL 17 loop NIAKE EDGE7 bTARcE39I q 11 18 img ihh ieng ihm 1 19 fail fai q 20 faiilq r 21 faiim r 22 loop CANCELEDGEp ITARGET 1 23 MAKEEDGEQ a r 24 p faiin 25 while 1 NIL and TARGE I p a q 26 return i W e illustrate the behavior of function SFFFIx AL39I ONIA39I ON Exquot VSION in Figure 11 W ith the suf x automaton 89 of 9 several operations can be solved ef ciently W e describe three of them Let be a string of length m 21 a Figure 11 An illustration of the behavior of function SUFFIXAU39I OMA I ONEX I ENSIONt The function transforms the suf x automaton S y of a string 31 in the suf x automaton S y a for any given symbol a the terminal states being implicitly lcnownt Let us consider that y bbbbaabbb and let us examine three possible cases according to 2 namely a c a b and a at a The automaton S bbbbaabbb The state denoted by t is state 10 and the suf x path of t is the list of the states 10 3 7 l and 0 During the execution of the first loop of the function state 37 runs through a part of the suf x path of t t At the same time edges labeled by a are created from p the newly created state 15 11 unless such an edge already exists in which case the loop stops 1 If a c the execution stops with an undefined value for 37 The edges labeled by 0 start at terminal states and the failure of t is the initial state c If a b the loop stops on state 37 3 because an edge labeled by b is defined on it The condition at line 12 of function SUFFIXAU39I OMA39I ONEX39I ENSION is satisfied which means that the edge labeled by a from p is not a shortcircuit In this case the state ending the previous edge is the failure of it 31 Finally when a a the loop stops on state 37 3 for the same reason but the edge labeled by a from p is a shortcircuit The string bbba is a suf x of the newly considered string bbbbaabbba but bbbba is not Since these two strings reach state 5 this state is duplicated into a new state r 12 that becomes terminalt Suf xes bba and be are redirected to this new state The failure of t is 12 c d Membership test selyes in time 0m by spelling rem the initial state 01quot the autematen If the entire string is spelled eccurs in 9 Otherwise we get the lengest pre x 0f eccurring in y Cemputing the number k 01quot eccurrences 0f in 9 assuming that s a facter 01quot 9 starts similarly Let p be the state reached after the spelling 0f rem the i al state Then is is exactly the number 01quot terminal states accessible rum 1 The number k asseciated with each state 1 can be precemputing in time 0a independently atquot the alphabet by a depth rst trayersal 0f the graph underlying the autematen The query fer is then perfermed in time 0m whatever is k The base 01quot an algerithm fer cemputing a lengest facter cemmen t0 and y is implemented in the precedure EXDIXG FAC I ORS MATCHER giyen belew This precedure reperts at each pesitien in y the length 01quot the lengest facter 0f ending here It can ebyieusly be used fer string matching It werk as the precedure AHO CORASICK MA I CHER in the use 01quot the failure functien The running time 01quot the search phase 01quot the precedure is 0m ENIINGFAC39I ORsMA39I CIILRy 17 i SKFFIX AKTOMATOXy f lt 0 3 p lt i 4 for syIIIbel a rein rst te last syIIIbel 0f 5 loop if TARGE I Q a 3A NIL wl J i then i 1 7 p TARGE39I 11 else loop 1 faiILp 9 While 1 3A NIL and TARGETU a 3A NIL 10 if p NIL 11 then i lt 0 12 p lt i 13 else i iengihHJ 1 14 p TARGE39I 11 15 repert i Retaining a largest value 01quot the Variable i in the precedure instead 01quot reperting all Values selves the lengest ceIIIII39Ien facter prebleIII 4 Research Issues and Summary String searching by hashing was intreduced by Harrisen 1971 and later fully analyzed in Karp and Rabin 1 quot7 The rst lineartiIIIe stringIIIatcl39Iing algerithIII is due te Knuth Merris and Pratt Knuth Merris and Pratt 1977 It can be prayed that during the search the delay that is the number 01quot times a yIIIbel 0f the text is ceIIIpared te syIIIbels 0f the pattern is less than legpm 1 where ltTgt is the gelden ratie 1 2 SirIIen 1993 gives a similar algerithIII but with a delay beunded by the size 01quot the alphabet 0f the pattern Hancart 1993 preyes that the delay 01quot SiII39Ien algerithIII is less than 1 leg2 m This paper alse preyes that this is eptiIIIal aIIIeng algerithIIIs precessing the text with a ene syIIIbel buffer The beund beceIIIes 0leg IIIin1 leg2 m card 2 using an erdering en the alphabet E which is net a restrictien in practice Galil 1981 gives a general criterien te transferrn stringIIIatcl39Iing algerithIIIs that werk sequentially en the text inte realtiIIIe algerithIIIs The BeyerMeme algerithIII was designed in Beyer and Meere 1977 The yersien given in this chapter fellews Knuth Merr and Pratt 1977 This paper centains the rst preef en the linearity 0f the algerithIII when restricted te the search 01quot the rst eccurrence 0f the pattern Cele 1994 preyes that the maximum number 01quot syIIIbel ceIIIparisens is beunded by 37 fer nen periedic patterns and that this beund is tight Knuth Merris and Pratt 1977 censiders a Variant 0f the BeyerMeme algerithIII in which all preyieus IIIatches inside the current winde are IIIeII39Ierized Each winde cen guratien beceIIIes the state 01quot what is called the BeyerMeme auteIIIaten It is still unknewn whether the maximum number 01quot states 01quot the auteIIIaten is pelynernial er net Several Variants 0f the BeyerMeme algerithIII amid the quadratic behayier when searching fer all eccurrences 0f the pattern AIIIeng the must ef cient in terms 01quot the number 01quot syIIIbel ceIIIparisens are the algerithIII 0f Apestelice and Giancarle 1986 TurbeBM algerithIII by Crecheinere 6i aiii 1992 the twe preyieus algerithins are analyzed in Lecreq 1995 and the algerithin ef Celussi Celussi 1994 The Herspeel algerithin is frein Herspeel 1980 The paper centains practical aspects ef string inatching that are deyeleped in Huine and Sunday 1993 The eptiinal beund en the expected tiine ceinplexity ef string inatching is see Knuth Merris and Pratt 1977 and the paper ef Yae 1 ing can be selyed by lineartiine algerithins requiring enly a censtant aineunt ef ineinery in additien te the pattern and the windew en the text This can be preyed by different techniques presented in Crecheinere and Rytter 1994 The inest recent selutien is by Gasieniec Plandewski and Rytter 1995 Cele 6i aiii 1995 shews that in the werst case ai 39 stringinatching algerithin werllting with syinbel ceinparisens inalltes at least 7 n m ceinparisens during its search phase Seine stringinatching algerithins make less than 27 ceinparisens The presentlyknewn upper beund en the preblein is n 3 1 T m but with a quadratictiine preprecessing ph se see Cele 6i aiii 1995 With a lineartiine preprecessing phase the current upper beunds are 37 m and n m see respectiyely Galil and Giancarle 1992 and Breslauer and Galil 1993 Except in a few cases patterns ef length 3 fer exainple lewer and upper beunds de net ineet Se the preblein ef the exact ceinplexity ef string inatching is epen The Ahe Cerasicllt algerithin is frein Ahe and Cerasick 1975 CeininentzVValter 1979 has designed an extensien ef the BeyerMeere algerithin that selyes the dictienaryinatching preblein It is fully described in Ahe 1990 The suffixtree censtructien ef Sectien 3 is frein McCreight 1976 An enline yersien is by U kkenen 1992 A preyieus algerithin by VVeiner 1973 relat sufo trees te a data structure clese te suffix auteinata The censtructien ef sufo auteinata alse des 39ribed as direct 39 39clic werd graphs and eften deneted by the acrenyin DAVVG is frein Bluiner 6i aiii 1985 and frein Crecheinere 1986 An alternative data structure that iinpleinents efficiently indexes is the netien ef suffix arrays intreduced in Manber and Myers 1993 J r H E 06 be S s r F Lquot 5 De ning Terms Border A string 1 is a berder ef a string u if i is beth a pre x and a suffix ef u String 1 is said te be the berder ef u if it is the lengest preper berder ef u Factor A string 1 is a facter ef a string u if u u im fer seine strings u and u Occurrence A string 1 eccurs in a string u if i is a facter ef u Pattern A nite nuinber ef strings that are searched fer in texts Pre x A string 1 is a pre x ef a string u if u 1m fer seine string u Proper Quali es a facter ef a string that is net equal te the string itself Segment Equiyalent te facter Suf x A string 1 is a sufo ef a string u if u u i fer seine string u Suf x tree Trie centaining all the suffixes ef a string Suf x automaton Sinallest auteinaten accepting the suffixes ef a string Text A streain ef syinbels that is searched fer eccurrences ef patterns Trie Digital tree tree in which edges are labeled by syinbels er strings 25 Window Facter ef the text that is aligned with the pattern 6 References Ahe AV 1990 Algerithms fer nding patterns in strings In Handbook of Theoieiicai Com puier Science ed 1 van Leeuwen vel A chap 5 p 2557300 ElseVier Amsterdam Ahe A V and Cerasick Ml 1975 Ef cient string matching an aid te bibliegraphic search Comm ACM 18 340 Baase S Compuiei aig39oiiihms Iniioduciion io design and anaiysis AddisenVVesley Blumer A Blumer l Ehrenfeucht A Haussler D Chen MT and Seiferas l 1985 The smallest autematen recegnizing the subwerds e1quot a text Theorei Compui Sci 4031755 Beyer RS and Meere lS 1977 A fast string searching algerithm Comm ACM 207627772 Breslauer D and Galil Z 1993 Ef cient cemparisen based string matching J Compiesciiy 93394365 Cele R 1994 Tight beunds en the cemplexity ef the BeyerMeere pattern matching algerithm SIAM J Compui 23107571091 Cele R Hariharan R Zwick U and Patersen MS 1995 Tighter lewer beunds en the exact cemplexity e1quot string matching SIAM J Compui 2430745 Celussi L 1994 Fastest pattern matching in strings J Algoiiihms 161637189 Cermen TH Leisersen CE and Rivest RL 1990 Iniioduciion io aigoriihms MIT Press Crechemere M 1986 Transducers and repetitiens Theoiei Compui Sci 4563786 Crechemere M and Rytter W 1994 Texi Algoiiihms Oxferd University Press Galil Z 1981 String matching in real time J ACM 281347149 Galil Z and Giancarle R 1992 On the exact cemplexity e1quot string matching upper beunds SIAM J Compui 21407 7437 Gennet GH and BaezaYates RA 1991 Handbook of aig39oiiihms and dosing sii ucimes AddisenVVesley Hancart C 1993 On Simen39s string searching algerithm Inf Process Leii 4795799 Herspeel RN 1980 Practical fast searching in strings Sofiivme Pmciice and Expeiience 105017506 Hume A and Sunday DM 1991 Fast string searching Sofiivme Pmciice and Expeiience 21122171248 Karp RM and Rabin MO 1987 Ef cient randemized patternmatching algerithms IBM J Res Dev 312497260 Knuth DE Merris lr lH and Pratt VR 1977 Fast pattern matching in strings SIAM J Compui 63237350 Lecreq T 1995 Experimental results en stringmatching algerithms Sofiivme Pmciice and Experience 257277765 McCreight EM 1976 A spaceecenemical suf x tree censtructien algerithm J Algorilhms 233627272 What is dymmic pmgzamming Preganmng39 has nu pamcular Eunnemmn 1e cumpula pmmmmmg 112mm eemes 39um r e quot151 events semeumes Elledapmmm Meuwung Example 555mg shunest p215 Tu nd me shunesl p215 39um 5131 15 gual we havetu g5 15 atherA a Dr c Ifwe are greedy we 51151115 g5 15 me dusest my 13 But 11115 my 551 be me shunest p215 evemu Suppuse that we y 5mm gea1mm11s1s1an x 55m gea1xAa c We En use the same 15a 15 gel wx gual Exceplthat we are nuW cluster 15 Eur gual man we engmany was Gamma But dues 1lalways 11me N51 necesanly Depmdmg un huw me mane subrprubl ns an be sulved and used F1bunacn sequmce 1 1 2 3 5 a 13 21 34 55 89 144 233 377 5111 987 1597 2584 418167651U94617711286574636875U25121393196418Kl7811 Fn Fnel FnezFnF1 1 Analyumlly Fn1618 n1382 Pseuduecude fur eempuung me 5411 F1bunacn numb at by recursmn runenm 5135 5 n urn 1 return 1 else return 512571 512572 Ifwe E11 5125 men 5135 5124 5133 51e351e2 51251e1 5b25b1 b15bn5b15bn 1 b1 b0111111 8 The number of times b is called 15 IfWe call b10 the number of times fib is called 177 In general to calculate Fn We need 2 x Fn r 1 inction calls Which is N 1447 x 1618An Which is exponential F50 40 billion in calls Try it on your computer Can you do faster than your computer inction bn F is an array With n1 elements F0 0 Fl 1 For i 2 to n Fn Fn1 FnZ End Return Fn FZ F1 Fib0 1 F3 FZ Fl 2 F4 F3 FZ 3 F10F9F855 Time On Memory On In fact only need Ol here since We don t need to remember all F s just remember Fil and FiZ Example 2 Finding the shortest path SP from 00 to nn on a nxn grid N2 6 distinct routes each With 4 steps To calculate path lengths for all routes by enumeration We need 6 4 26 operations Let spij represents the shortest path from 0 0 to ij Ri j represents the distance from i j to ij1 and Ci j represents the distance from ij to i1j We can use recursive call spi j min spij1 Rij1 spi1j Ci1j ignore boundary conditions Recursion tree SP22 SP12 SP2 1 SP02 SP11 SP11 SP20 SP01 SP01 SP10 SP10 SP01 SP1 0 Number of function calls 18 Number ofsteps ifusing DP 12 N 4 Enumeration 20 x 6 120 Recursion 68 DP 24 In general Enumeration 2n 2nn 2 Recursion N 2A2n DP 2 n n1 N 5 Enumeration 2520 Recursion 1032 DP 60 N10
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'