### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Efficient Algorithms and Intractable Problems COMPSCI 170

GPA 3.93

### View Full Document

## 22

## 0

## Popular in Course

## Popular in ComputerScienence

This 65 page Class Notes was uploaded by Mr. Hayley Barton on Thursday October 22, 2015. The Class Notes belongs to COMPSCI 170 at University of California - Berkeley taught by Staff in Fall. Since its upload, it has received 22 views. For similar materials see /class/226667/compsci-170-university-of-california-berkeley in ComputerScienence at University of California - Berkeley.

## Similar to COMPSCI 170 at

## Reviews for Efficient Algorithms and Intractable Problems

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/22/15

300316 1 a Hyperlink Analysis on the Web Monika Henzinger monikagooglecom Outline Random Walks Classic Information Retrieval IR vs Web lR Hyperlink Analysis PageRank HITS Random Walks on the Web Googlequot Random Walks Random Walk discretetime stochastic process over a graph GVE with a transition probability matrix P Random Walk is at one node at any time making nodetransitions at time steps t12 with P being the probability of going to node when at nodei Initial node chosen according to some probability distribution qlt0gt over S Googlequot Random Walks cont q row vector whose ith component is the probability that the chain is in node i at timet qt1 qa p gt qt q0 Pt A stationary distribution is a probability distribution q such that q q P steadystate behavior Example Pu 1degreei if ij in G and 0 otherwise then qi degreei2m Googlequot Random Walks cont Theorem Under certain conditions There exists a unique stationary distribution q with qi gt O for all i Let Nit be the number of times the random walk visits node i in t steps Then the fraction of steps the walk spends at i equals qi ie MM 2 l gtoo t Googlequot Information Retrieval Input Document collection Goal Retrieve documents or text with information content that is relevant to user s information need Two aspects 1 Processing the collection 2 Processing queries searching Googlequot Classic information retrieval Ranking is a function of query term frequency within the docum ent tf and across all documents idf This works because of the following assumptions in classical IR Queries are long and well specified What is the impact of the Falklands war on AngloArgentinean relations Documents eg newspaper articles are coherent well authored and are usually about one topic The vocabulary is small and relatively well understood Googlequot Web information retrieval None of these assumptions hold Queries are short 235 terms in avg Huge variety in documents language quality duplication Huge vocabulary 100s million of terms Deliberate misinformation Ranking is a function of the query terms and of the h yperink structure Googlequot Hyperlink analysis Idea Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Related work Classic IR work citations links aka Bibliometrics K 63 G 72 S 73 Sociometrics K 53 MMSM 86 Many Web related papers use this approach PPR 96 AMM 97 8 97 CK 97 K 98 BP 98 4 0 W Gogglequot Google s approach Assumption A link from page A to page B is a recommendation of page B by the author of A we say B is successor of A Quaity of a page is related to its indegree Recursion Quality of a page is related to its indegree and to the quality of pages linking to it PageRank BP 98 Googlequot Definition of PageRank Consider the following infinite random walk surf Initially the surfer is at a random page At each step the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1d The PageRank of a page p is the fraction of steps the surfer spends at p in the limit Googlequot PageRank cont By previous theorem PageRank stationary probability for this Markov chain ie PageRankp 1 d 2 PageRankq 0utdegreeq qpeE where n is the total number of nodes in the graph Googlequot PageRank cont ey PageRank of P is 1 d 14th the PageRank ofA 13rd the PageRank of B d n 300316quot PageRank Used in Google s ranking function Queryindependent Summarizes the web opinion of the page importance Googlequot Outline Markov Chains and Random Walks Information Retrieval IR vs Web IR Hyperlink Analysis PageRank HITS Random Walks on the Web Googlequot Neighborhood graph Subgraph associated to each query Query Results Back Set Start Set Forward Set An edge for each hyperlink but no edges within the same host 300316 HITS K 98 Goal Given a query find Good sources of content authorities Good sources of links hubs A Googlequot Intuition Authority comes from inedges Being a good hub comes from outedges DEEMED Better authority comes from inedges from good hubs Being a better hub comes from outedges to good authorities quot E Googlequot HITS details gt gt Repeat until HUB and AUTH converge gt gt Normalize HUB and AUTH HUBp 2 AUTHri for all ri with p ri in E AUTHp 2 HUBqi for all qi with q p in E P 339 V l Googlequot PageRank vs HITS Computation Once for all documents and queries offline Computation Requires computation for each query o Queryindependent o Querydependent requires combination with querydependent criteria o Relatively easy to spam Hard to Spam Quality depends on quality of start set Gives hubs as well as authorities Googlequot PageRank vs HITS Lempel Not rank stable 01 changes in graph can change ON2 orderrelations NgZheng Jordan01 Vaue Stabe change in k nodes with PR values p1pk results in p st k lp pISZZpJd 11 Not rankstable vaue stabiity depends on gap 9 between largest and second largest eigenvector change of Og nodes results in p St llp p Q1 300316quot Outline Random Walks Classic Information Retrieval IR vs Web lR Hyperlink Analysis PageRank HITS Random Walks on the Web Googlequot Let s do it Perform PageRank random walk Select uniform random sample from resulting pages gt Qualitybiased sample of the web gt Useful for estimation Web properties Percentage of highquality pages in a domain in a language on a topic Search engine comparison Sum of probabilities of pages in the index index quality 39 Googlequot Sampling pages almost according to PageRank Problems Starting state bias finite walk only approximates PageRank Can tjump to a random page instead jump to a random page on a random host seen so far Sampling pages according to a distribution that behaves similarly to PageRank Googlequot Experiments on the real web Performed two long random walks with d17 starting at wwwyahoocom length HTML pages successfully downloaded unique HTML pages sampled pages Walk 1 Walk2 18 hours 54 hours 1393265 2940794 509279 1 002745 1025 1100 Googlequot Random walk effectiveness Repeatability Index quality results are consistent over the 2 walks Reduction of initial bias Bias for wwwyahoocom is reduced in longer walk Similarity to PageRank Pages or hosts that are highlyreachable are visited often by the random walks The average indegree of pages with indegree lt 1000 is high 53 in walk 1 60 in walk 2 Googlequot Most frequently visited pages Page www microsoftcom wwwmicrosoftcomwindowsiedefault htm wANanetscapeconv wwwmicrosoftcomie www microsoft co mwindowsiedownload wwwmicrosoftcomwindowsiedownloadallhtn wwwadobecomprodindexacro batreadstep htr homenetscapecom mANMLanexchangeconv wNmmyahoeconv Freq Wak2 3172 2064 1991 1982 1915 1696 1634 1581 1574 1527 Freq Walk1 1600 1045 876 1017 943 830 780 695 763 1132 300316quot NQOCDNO IACDOOA Rank Walk1 Most frequently visited hosts Site www m icrosoft com homenetscapecom wwwadobe com wwwamazoncom www netscape com excite netscape com www realcom www lycos com wwwzdnet com wwwinkexchangecom wwwyahoocom Frequency Walk 2 32452 23329 10884 10146 4862 4714 4494 4448 4038 3738 3461 Frequency Walk 1 16917 11084 5539 5182 2307 2372 2777 2645 2562 1940 2595 Rank Walk 1 A A IMCDCDO IGDOLOOMA 300316quot Estimating search engine index quality Choose a sample of pages p7p2p3 pn according to PageRank distribution w Check if the pages are in search engine index 8 BB 98 Exact match Host match Estimate for quality of index 8 is the percentage of sampled pages that are in 8 Le 1 M gzzipj e S j where pj in S 1 if pjis in S and O othenvise Googlequot Results for index quality fall 98 035 7 03 lWalk 1 exact match 025 7 gt I Walk 2 exact 02 match 3 015 7 Walk 1 host 3 01 7 match 5 5 IEIWaIk 2 host 005 7 match 0 7 m be 4x9 Results for index qualitypage fall 98 Index Quality per Page lWalk 1 exact match I Walk 2 exact match ilWalk 1 host match IE Walk 2 host match Sampling pages nearly uniformly Perform PageRank random walk Sample pages from walk st Prp is sampled p is crawled oc 1 PageRankp gt Nearly uniform sample of the web gt Useful for estimation Web properties Percentage of pages in a domain in a language on a topic Search engine comparison Percentage of pages in a search engine index index size Googlequot Sampling pages nearly uniformly Nearly uniform sample Prp is sampled Prp is crawled Prp is sampled p is crawled A page is wellconnected if it can be reached by almost every other page by short paths On12 steps For short paths in a wellconnected graph Prp is crawled z E number of Visits of p m L PageRankp Googlequot Sampling pages nearly uniformly Problems All previous problems Need to approximate PageRank PR PageRank computation of crawled graph VR VisitRatio on crawled graph Dependence especially in short cycles Googlequot Evaluation using synthetic graph Generated graph that mimics connectivity of real web Zipfian distribution of in amp outdegree Performed near uniform sampling using both PR and VR Compared connectivity characteristics of sampled nodes to those of entire graph If sampling were truly uniform characteristics should beiden cal Googlequot Evaluation based on outdegree 30 25 Original Graph PR biased Sample VFl biased Sample 20 o15 Unbiased 10 5 0 5 8 11 14 17 20 Out Degree Googlequot Evaluation based on outdegree Original Graph PFl blased Sample 1 5 VFl biased Sample Unbiased Sample Sample Graph 1 Ratio 05 0 l l l l l 5 8 11 14 17 20 Out Degree Evaluation based on indegree 30 Original Graph 25 PFl blased Sample 20 VR biased Sample Unbiased Sample o 4 6 8 1O 12 14 16 18 ln Degree Evaluation based on indegree 25 Original Graph PR biased 2 VR biased Unbiased Sample Graph 1 395 Ratio 1 05 0 10 ln Degree 12 14 Evaluation based on PageRank 30 25 20 l Original Graph PFl biased Sample 0 15 VFl biased Sample 0 Unbiased Sample 10 06 08 1 12 14 16 18 2 22 24 26 28 PageRank Factor Googlequot Evaluation based on PageRank 35 3 Original Graph PFl biased Sample 2395 VR biased Sample Sample 2 Unbiased Graph Ratio 15 1 05 0 06 08 1 12 14 16 18 2 22 24 26 28 PageRank Factor 39 Googlequot Experiments on the real web Performed 3 random walks in Nov 1999 starting from 10258 seed URLs Small overlap between walks walks disperse well 82 visited by only 1 walk Wak visited URLs unigue URLs 1 2702939 990251 2 2507004 921114 3 5006745 1655799 Experiments on the real web cont Sampled each walk Uniform sampling VR sampling PR sampling Total of 9 samples each containing 10000 URLs 2 Experiments Computed distribution of toplevel domains of URLs in each sample and compared to distribution discovered during an 80m document web crawl Index size comparison on 8 search engines Googlequot Percentage of pages in domains ICrawl lWak1 Uniform lWalk 2 Uniform lWalk 3 Uniform Walk 1 PR Walk 2 PR Walk 3 PR lWalk 1 VR lWalk 2 VR lWalk 3 VR com edu org net jp gov de uk Googlequot Estimating search engine index size Choose a sample of pages p7p2p3 pn according to near uniform distribution Check if the pages are in search engine index 8 BB 98 Exact match Host match Estimate for size of index 8 is the percentage of sampled pages that are in 8 Le 1 w zzzipj e S where pj in S 1 if pjis in S and O othenvise Googlequot Result set for index size fall 99 IWalk 1 PR Sample IWalk1VR Sample IWalk 2 PR Sample IWalk 2 VB Sample IWalk 3 PR Sample lWalk 3 VR Sam le InfUSeek Lycos Googlequot Summary Our random walks oversample well connected pages We compensate by sampling pages visited during random walk such that wellconnected pages are less likely to be sampled Resulting sample is less skewed than random walk but still not uniform Googlequot Other approaches Lawrence and Giles 99 BarYossef etaI OO Rusmevichientong et al 01 Googlequot UC Berkeley 7 CS170 Intro to CS Theory Handout N13 Professor Luca Trevisan October 18 2001 corrected Oct 25 Notes for Lecture 13 1 Edit Distance 11 De nition When you run a spell checker on a teXt and it nds a word not in the dictionary it normally proposes a choice of possible corrections If it nds stell it might suggest tell swell stull still steel steal stall spell smell shell and sell As part of the heuristic used to propose alternatives words that are close to the misspelled word are proposed We will now see a formal de nition of distance between strings and a simple but efficient algorithm to compute such distance The distance between two strings z 1 mn and y yl ym is the minimum number of errors edit operations needed to transform z into y where possible operations are 0 insert a character insertx i a m1z2miaxi1zw o delete a character deletemi mim2xillzi1mw o modify a character modifMQE Z a 1m2 i71ai1 n For example if z aabab and y babb then one 3 steps way to go from x to y is a a b a b X b a a b a b X7 insertX0b b a b a b X77 delete X 2 b a b b y delete X 4 another sequence still in three steps is a a b a b X a b a b X7 delete X1 b a b X77 deleteX 1 b a b b y insert X 3b Can you do better 12 Computing Edit Distance To transform ml zn into yl ym we have three choices 0 put ym at the end z a 1 mnym and then transform z1zn into y1ym1 o delete zn z a 1 znll and then transform ml mnll into y1ym Notes for Lecture 13 2 0 change zn into ym if they are different z a m1mn71ym and then transform 951 39 39 39 n71 thO y1quot39ym71 This suggests a recursive scheme where the sub problems are of the form how many operations do we need to transform 1 m into yl yj Our dynamic programming solution will be to de ne a n 1 x m 1 matrix M that we will ll so that for every 0 g i g n and 0 g j g m Mij is the minimum number of operations to transform 1 mi into M yj The content of our matrix M can be formalized recursively as follows 0 MOj j because the only way to transform the empty string into yl yj is to add the j characters yl yj o Mi O i for similar reasons 0 Forij1 Mlivjl mini Mliilvjl17 Mme 1H1 7 1j 7 1 change i where changei yj 1 if z 7 yj and changei yj 0 otherwise As an example consider again z aabab and y babb Ababb What is then the edit distance between z and y The table has nm entries each one computable in constant time One can construct an auxiliary table Op such that Op speci es what is the rst operation to do in order to optimally transform 1 mi into yl yj The full algorithm that lls the matrices can be speci ed in a few lines algorithm EdDist x y n lengthx m lengthy for i 0 to n Notes for Lecture 13 for i 1 to n for j 1 to m if m yj then change 0 else change 1 Mi 7 1j 1 Op deletemi if Mij711ltMij then Mij711 Opi insertmi yj if 7 1j 7 1 change lt then Mi71j71change if change 0 then Opij none else Opij changeziyj 2 Longest Common Subsequence A subsequence of a string is obtained by taking a string and possibly deleting elements If ml mn is a string and 1 1 lt 2 lt lt ik n is a strictly increasing sequence of indices then mix Mk is a subsequence of m For example art is a subsequence of algorithm In the longest common subsequence problem given strings z and y we want to nd the longest string that is a subsequence of both For example art is the longest common subsequence of algorithm and parachute As usual we need to nd a recursive solution to our problem and see how the problem on strings of a certain length can be reduced to the same problem on smaller strings The length of the lcs of z z1zn and y y1ym is either 0 The length of the lcs of 1 mn1 and y1ym or o The length of the lcs of 1 mn and y1ym1 or 0 1 the length of the lcs of 1 mn1 and y1 ym1 if zn ym The above observation shows that the computation of the length of the lcs of z and y reduces to problems of the form what is the length of the lcs between z1mi and 241 yi7 Our dynamic programming solution uses an n 1 x m 1 matrix M such that for every 0 g i g n and 0 g j g m Mij contains the length of the lcs between z1mi and y1 yj The matrix has the following formal recursive de nition 0 M2 0 0 Mm 0 Mij max 7 1j W471 Mli 17 1l 5qivyj where eqmiyj 1 if z yj eqm yj 0 otherwise Notes for Lecture 13 4 The following is the content of the matrix for the words algorithm and parachute arachute m The matrix can be lled in 0nm time How do you reconstruct the longest common substring given the matrix 3 Chain Matrix Multiplication Suppose that you want to multiply four matrices A x B x C x D of dimensions 40 x 20 20 x 300 300 x 10 and 10 x 100 respectively Multiplying an m x 71 matrix by an n x p matrix takes mnp multiplications a good enough estimate of the running time To multiply these matrices as Ax Bx C XD takes 402030040300104010100 380 000 A more clever way would be to multiply them as A x B x C x D with total cost 20 300 10 20 10 100 40 20 100 160000 An even better order would be Ax B x 0 x D with total cost 20300 104020 1040 10 100 108000 Among the ve possible orders the ve possible binary trees with four leaves this latter method is the best How can we automatically pick the best among all possible orders for multiplying n given matrices Exhaustively examining all binary trees is impractical There are Cn 78753 z 7 such trees is called the Catalan number of Naturally enough dynamic programming is the answer Suppose that the matrices are A1 x A2 x x An with dimensions respectively me x m1m1 x m2mn1 x mn De ne a subproblem remember this is the most crucial and nontrivial step in the design of a dynamic programming algorithm the rest is usually automatic to be to multiply the matrices A x x Aj and let Mij be the optimum number of multiplications for doing so Naturally MG 0 since it takes no effort to multiply a chain consisting just of the i th matrix The recursive equation is M2 j 1min1Mi k Mk 1j mi1mkmj 1Sklt7 This equation de nes the program and its complexityiOn3 fori 1 to 71 do 0 ford1ton71do fori1ton7ddo Notes for Lecture 13 5 j i dMij oo7 bestij nil forkitoj71do if Mij gt MikMk1jmi1mkmj then 3 Mg k Mk 17j mi71mkmj7bestij 3 k As usual7 improvements are possible in this case7 down to On log Run this algorithm in the simple example of four matrices given to verify that the claimed order is the best UC BerkeleyicS 170 E icient Algorithms and lntractable Problems Lecturer Michael Jordan Handout 1 August 28 2005 Notes 1 for CS 170 1 Topics to be covered 1 2 9 2 Data Structures and bigO Notation a quick review of the prerequisites DivideandOonqaer Algorithms work by dividing problem into two smaller parts and combining their solutions They are natural to implement by recursion or a stack eg binary search Fast Fourier Transform in signal processing speech recognition Graph Algorithms depth rst search DFS breadth rst search BFS on graphs with properties and applications eg nding shortest paths as in trip planning Randomization how to use probability theory to come up with fast and elegant algorithms and data structures Data Compression the theory and the algorithms Dynamic Programming works by solving subproblems like in divide and conquer but differently Linear Programming LP solves systems of linear equations and inequalities We will concentrate on using LP to solve other problems eg production scheduling computing capacities of networks NPOompleteness many algorithms we will study are quite e icient costing On or On log n or 0n2 on input size it But some problems are much harder with algo rithms costing at least 92 or some other exponential rather than polynomial func tion of It is believed that we will never be able to solve such problem completely for large n eg the Traveling Salesman Problem So we are forced to approximate them We will learn to recognize when a problem is so hard NP complete and how to devise algorithms that provide approximate if not necessarily optimal solutions On Algorithms An algorithm is a recipe or a well de ned procedure for transforming some input into a desired output Perhaps the most familiar algorithms are those for adding and multiplying integers Here is a multiplication algorithm that is different the algorithm you learned in school write the multiplier and multiplicand side by side say 13 X 15 195 Repeat the following operations divide the rst number by 2 throw out any fractions and multiply the second by 2 until the rst number is 1 This results in two columns of numbers Now cross out all rows in which the rst entry is even and add all entries of the second column that haven t been crossed out The result is the product of the two numbers In this course we will ask a number of basic questions about algorithms Notes number 1 2 0 Does it halt The answer for the algorithm given above is clearly yes provided we are multiplying positive integers The reason is that for any integer greater than 1 when we divide it by 2 and throw out the fractional part we always get a smaller integer which is greater than or equal to 1 o Is it correct To see that the algorithm correctly computes the product of the integers observe that if we write a 0 for each crossed out row and 1 for each row that is not crossed out then reading from bottom to top just gives us the rst number in binary Therefore the algorithm is just doing the standard multiplication in binary and is in fact essentially the algorithm used in computers which represent numbers in binary How much time does it take It turns out that the algorithm is as fast as the standard algorithm How do we implement the division step which seems harder than multiplication in a computer In particular if the input numbers are 71 digits long the algorithm takes 0n2 oper ations Later in the course we will study a faster algorithm for multiplying integers o How much memory does it use When we study cacheaware algorithms we will ask more details about what kinds of memory are used eg cache main memory etc The history of algorithms for simple arithmetic is quite fascinating Although we take them for granted their widespread use is surprisingly recent The key to good algorithms for arithmetic was the positional number system such as the decimal system Roman numerals I II III IV V VI etc were just the wrong data structure for performing arithmetic e iciently The positional number system was rst invented by the Mayan Indians in Central America about 2000 years ago They used a base 20 system and it is unknown whether they had invented algorithms for performing arithmetic since the Spanish conquerors destroyed most of the Mayan books on science and astronomy The decimal system that we use today was invented in India in roughly 600 AD This positional number system together with algorithms for performing arithmetic were trans mitted to Persia around 750 AD when several important lndian works were translated into arabic In particular it was around this time that the Persian mathematician Al Khwarizmi wrote his arabic textbook on the subject The word algorithm comes from Al Khwarizmi s name Al Khwarizmi s work was translated into Latin around 1200 AD and the positional number system was propagated throughout Europe from 1200 to 1600 AD The decimal point was not invented until the 10th century AD by a Syrian mathemati cian al Uqlidisi from Damascus His work was soon forgotten and ve centuries passed before decimal fractions were re invented by the Persian mathematician al Kashi With the invention of computers in this century the eld of algorithms has seen explosive growth There are a number of major successes in this eld not all of which we can discuss in CS 170 o Parsing algorithmsithese form the basis of the eld of programming languages CS 164 Notes number 1 3 Fast Fourier transformithe eld of digital signal processing is built upon this algo rithm CS 170 BE Linear programmingithis algorithm is extensively used in resource scheduling CS 170 IEOR Sorting algorithmsiuntil recently sorting used up the bulk of computer cycles CS 61B CS 170 String matching algorithmsithese are extensively used in computational biology CS 170 Number theoretic algorithmsithese algorithms make it possible to implement cryp tosystems such as the RSA public key cryptosystem CS 276 Numerical algorithms for evaluating models of physical systems replacing the need for expensive or impossible physical experiments eg climate modeling earthquake modeling building the Boeing 777 Ma 128ab and many engineering and science courses Geometric algorithms for computing and answering questions about physical shapes that appear in computer graphics and models of physical systems The algorithms we discuss will be in pseudo code so we will not worry about certain details of implementation So we might say choose the i th element of a list77 without worrying about exactly how the list is represented But the e iciency of an algorithm can depend critically on the right choice of data structure eg 01 for an array versus for a linked list So we will have a few programming assignments where success will depend critically on the correct choice of data structure 3 Computing the nth Fibonacci Number Remember the famous sequence of numbers invented in the 15th century by the Italian mathematician Leonardo Fibonacci The sequence is represented as F0F1F2 where F0 F1 1 and for all n 2 2 Fn is de ned as Fnil Fnig The rst few numbers are 1 1 2 3 5 8 13 21 34 55 89 144 233 F30 is greater than a million The Fibonacci numbers grow exponentially Not quite as fast as 2 but close Fn is about 20694 Suppose we want to compute the number F for some given large integer 71 Our rst algorithm is the one that slavishly implements the de nition function Fn integer integer if nlt1 then return 1 else return Fn1 Fn2 It is obviously correct and always halts The problem is it is too slow Since it is a recursive algorithm we can express its running time on input 71 Tn by a recurrence equation in terms of smaller values of T we shall talk a lot about such recurrences later Notes number 1 4 in this class So what is T01 We shall be interested in the order of growth of T01 ignoring constants If n S 1 then we do constant amount of work to obtain F since we are suppressing constants we can write T01 1 Otherwise if n 2 2 we have T01 T017 1 T0172 1 because in this case the running time of F on input n is just the running time of F on input n 7 17which is T01 7 17plus T01 7 2 plus the time for the addition This is nearly the Fibonacci equation We can change it to be exactly the Fibonacci equation by noting that the new variable T 01 E T01 1 satis es T 01 T 01 7 1 T 01 7 2 T 1 T1 1 2 and T 0 TO 1 2 Comparing this recurrence for T 01 to the Fibonacci recurrence we see T 01 2Fn and so T01 2F 7 1 In other words the running time of our Fibonacci program grows as the Fibonacci numbers7that is to say way too fast Can we do better This is the question we shall always ask of our algorithms The trouble with the naive algorithm is wasteful use of recursion The function F is called with the same argument over an over again exponentially many times try to see how many times F1 is called in the computation of A simple trick for improving this performance is to memoize the recursive algorithm by remembering the results from previous calls Speci cally we can maintain an array A0 n initially all zero except that AO A1 1 This array is updated before the return step of the F function which now becomes An An 7 1 An 7 2 return More importantly it is consulted in the beginning of the F function and if its value is found to be non zero7that is to say de ned7 it is immediately returned But then of course there would be little point in keeping the recursive structure of the algorithm we could just write function Fn integer integer array A0 n of integers initially all 0 A0 A1 1 for i2 to 11 do Ai Ai1 Ai2 return An This algorithm is correct because it is just another way of implementing the de nition of Fibonacci numbers The point is that its running time is now much better We have a single for loop executed n 7 1 times And the body of the loop takes only constant time one addition one assignment Hence the algorithm takes only 001 operations This is usually an important moment in the study of a problem We have come from the naive but prohibitively exponential algorithm to a polynomial one Now let us be more precise in our accounting of the time requirements for all these methods We have made a grave and common error we have been too liberal about what constitutes an elementary step of the algorithm In general in analyzing our algorithms we shall assume that each arithmetic step takes unit time because the numbers involved will be typically small enough7about n the size of the problem7 that we can reasonably expect them to t within a computer s word But in the present case we are doing arithmetic on huge numbers with about n bits7remember a Fibonacci number has about 694 n Notes number 1 5 bitsiand of course we are interested in the case n is truly large When dealing with such huge numbers and if exact computation is required we have to use sophisticated long integer packages Such algorithms take On time to add two n bit numbersihence the complexity of the two methods was not really 0Fn and On but instead OnFn and On2 respectively notice that the latter is still exponentially faster 4 Algorithm Design Paradigms Every algorithmic situation every computational problem is different and there are no hard and fast algorithm design rules that work all the time But the vast majority of algorithms can be categorized with respect to the concept of progress they utilize In an algorithm you want to make sure that after each execution of a loop or after each recursive call some kind of progress has been made In the absence of such assurance you cannot be sure that your algorithm will even terminate For example in depth rst search after each recursive call you know that another node has been visited And in Dijkstra s algorithm you know that after each execution of the main loop the correct distance to another node has been found And so on DivideandOonquer Algorithms These algorithms have the following outline To solve a problem divide it into subproblems Recursively solve the subproblems Finally glue the resulting solutions together to obtain the solution to the original problem Progress here is measured by how much smaller the subproblems are compared to the original problem You have already seen an example of a divide and conquer algorithm in cs61B mergesort The idea behind mergesort is to take a list divide it into two smaller sublists conquer each sublist by sorting it and then combine the two solutions for the subproblems into a single solution These three basic stepsidivide conquer and combineilie behind most divide and conquer algorithms With mergesort we kept dividing the list into halves until there was just one element left In general we may divide the problem into smaller problems in any convenient fashion Also in practice it may not be best to keep dividing until the instances are completely trivial Instead it may be wise to divide until the instances are reasonably small and then apply an algorithm that is faster on small instances For example with mergesort it might be best to divide lists until there are only four elements and then sort these small lists quickly by other means We will consider these issues in some of the applications we consider 5 Maximumminimum Suppose we wish to nd the minimum and maximum items in a list of numbers How many comparisons does it take The obvious loop takes 2n 7 2 m L1M LU for i2 lengthL m minm Li M maxM Li Notes number 1 Can we do it in fewer A natural approach is to try a divide and conquer algorithm Split the list into two sublists of equal size Assume that the initial list size is a power of two Find the maxima and minimum of the sublists Two more comparisons then su ice to nd the maximum and minimum of the list function mM MinMaxL if lengthL lt 3 if L1 lt LlengthL return L1 LlengthL else return LlengthL L1 find min max of first half of L else find min max of last half of L m1 M1 MinMaXL1 1121 m2 M2 MinMaXLn21 n return minm1m2 maxM1M2 Hence if Tn is the number of comparisons then Tn 2Tn2 2 and T2 1 How do we solve such a recursion Suppose that we guess that the solution is of the form Tn an b Then a and b have to satisfy the system of equations 2ab1 anb2ab2 which solves to b 72 and a 32 so that that for n a power of 2 Tn 3712 7 2 or 75 of the obvious algorithm One can in fact show that f3n2l 7 2 comparisons are necessary to nd both the minimum and maximum A similar idea can be applied to a sequential algorithm Suppose for simplicity that the array has an odd number of entries then the algorithm is as follows function mM MinMaxL currmin minL1L2 currmax maxL1L2 for i3ilt lengthLii2 m minLi Li1gt M maxLi Li1gt if m lt currmin currmin m if M gt currmax currmax M return currmin currmax Each iteration of the for loop requires three comparisons and the initialization of the variables requires one comparison Since the loop is repeated 71 7 22 times we have a total of 3712 7 2 comparisons for even 71 UC Berkeley 7 08170 Intro to CS Theory Handout N16 Professor Luca Trevisan October 30 2001 Notes for Lecture 16 1 The LempelZiv algorithm There is a sense in which the Huffman coding was optimal but this is under several assumptions 1 The compression is lossless ie uncompressing the compressed le yields exactly the original le When lossy compression is permitted as for video other algorithms can achieve much greater compression and this is a very active area of research because people want to be able to send video and audio over the Web 2 We know all the frequencies f with which each character appears How do we get this information We could make two passes over the data the rst to compute the fi and the second to encode the le But this can be much more expensive than passing over the data once for large les residing on disk or tape One way to do just one pass over the data is to assume that the fractions of each character in the le are similar to les you ve compressed before For example you could assume all Java programs or English text or PowerPoint les or have about the same fractions of characters appearing A second cleverer way is to estimate the fractions on the y as you process the le One can make Huffman coding adaptive this way OJ We know the set of characters the alphabet appearing in the le This may seem obvious but there is a lot of freedom of choice For example the alphabet could be the characters on a keyboard or they could be the key words and variables names appearing in a program To see what difference this can make suppose we have a le consisting of 71 strings mom and 71 strings bbbb concatenated in some order If we choose the alphabet 1 b then 871 bits are needed to encode the le But if we choose the alphabet aaaa bbbb then only 271 bits are needed Picking the correct alphabet turns out to be crucial in practical compression algorithms Both the UNIX compress and GNU gzip algorithms use a greedy algorithm due to Lempel and Ziv to compute a good alphabet in one pass while compressing Here is how it works If s and t are two bit strings we will use the notation s 025 to mean the bit string gotten by concatenating s and t We let F be the le we want to compress and think of it just as a string of bits that is 0 s and 1 s We will build an alphabet A of common bit strings encountered in F and use it to compress F Given A we will break F into shorter bit strings like FAloOoA2olooA7oOooA5olooAiojo and encode this by 100020loo7oOooSolooiojo Notes for Lecture 16 2 F 1111 101101 set A is full Encoded F 001001 3130 516010 0000 0010 0001 0111 0110 1011 1100 0010 Figure 1 An example of the Lempel Ziv algorithm The indices 0 of are in turn encoded as xed length binary integers and the bits j are just bits Given the xed length say r of the binary integers we decode by taking every group of r 1 bits of a compressed le using the rst r bits to look up a string in A and concatenating the last bit So when storing or sending an encoded le a header containing A is also stored or sent Notice that while Huffman s algorithm encodes blocks of xed size into binary sequences of variable length Lempel Ziv encodes blocks of varying length into blocks of xed size Here is the algorithm for encoding including building A Typically a xed size is available for A and once it lls up the algorithm stops looking for new characters A 0 start with an alphabet containing only an empty string 0 0 points to next place in le 1 to start encoding repeat nd Ak in the current alphabet that matches as many leading bits fiFHlng initially only A0 empty string matches Let b be the number of bits in Ak if A is not full add Ak o be to A flurb is the rst bit unmatched by Ak output k 0 FM i i b 1 until i gt lengthF Note that A is built greedily based on the beginning of the le Thus there are no optimality guarantees for this algorithm It can perform badly if the nature of the le changes substantially after A is lled up however the algorithm makes only one pass through the le there are other possible implementations A may be unbounded and the index k would be encoded with a variablelength code itself In Figure 1 there is an example of the algorithm running where the alphabet A lls up after 6 characters are inserted In this small example no compression is obtained but if A were large and the same long bit strings appeared frequently compression would be substantial The gzip manpage claims that source code and English text is typically compressed 60 70 as possible Notes for Lecture 16 3 To observe an example we took a latex le of 74 892 bytes Running Huffman s algo rithm with bytes used as blocks we could have compressed the le to 36757 bytes plus the space needed to specify the code The Unix program compress produced an encoding of size 34 385 while gzip produced an encoding of size 22 815 2 Lower bounds on data compression 21 Simple Results How much can we compress a le without loss We present some results that give lower bounds for any compression algorithm Let us start from a worst case77 analysis THEOREM 1 Let C 01 7 01 be an encoding algorithm that allows lossless decoding ie let C be an injective function mapping 71 bits into a sequence of bits Then there is a le 1 6 01 such that CUM 2 n In words for any lossless compression algorithm there is always a le that the algorithm is unable to compress PROOF Suppose by contradiction that there is a compression algorithm C such that for all f 6 01 CUM n 71 Then the set Cf f 6 01 has 2 elements because C is injective but it is also a set of strings of length n 7 1 and so it has at most 2711 2l 2 7 2 elements which gives a contradiction D While the previous analysis showed the existence of incompressible les the next theo rem shows that random les are hard to compress thus giving an average case77 analysis THEOREM 2 Let C 01 7 01 be an encoding algorithm that allows lossless decoding ie let C be an injective function mapping 71 bits into a sequence of bits Let f 6 01 be a sequence of 71 randomly and uniformly selected bits Then for every t 1 PrHCfl n7 tl For example there is less than a chance in a million of compression an input le of 71 bits into an output le of length n 7 21 and less than a chance in eight millions that the output will be 3 bytes shorter than the input or less PROOF We can write wow i n 7 t W 3 W21 7 75 Regarding the numerator it is the size of a set that contains only strings of length n 7t or less so it is no more than 27 21 which is at most 27H r1 7 2 lt 27H r1 2 2t 1 D The following result is harder to prove and we will just state it THEOREM 3 Let C 01 7 01 be a pre x free encoding and let 1 be a random le ofn bits Then Eileen 2 n Notes for Lecture 16 4 This means that from the average point of view the optimum pre x free encoding of a random le is just to leave the le as it is In practice however les are not completely random Once we formalize the notion of a not completely random le we can show that some compression is possible but not below a certain limit First we observe that even if not all n bits strings are possible les we still have lower bounds THEOREM 4 Let F Q 0 1 be aset ofpossible les and let C0 1 a 0 1 be an injective function Then 1 There is a le 1 E F such that lCfl 2 log2 2 If we pick a le 1 uniformly at random from F then for every t we have 1 Prll0fl g logz tl g F 3 KC is pre x free then when we pick a le 1 uniformly at random from F we have Ell0fll Z logz lFl PROOF Part 1 and 2 is proved with the same ideas as in Theorem 2 and Theorem 3 Part 3 has a more complicated proof that we omit D 22 Introduction to Entropy Suppose now that we are in the following setting the le contains 71 characters there are 0 different characters possible character i has probability of appearing in the le What can we say about probable and expected length of the output of an encoding algorithm Let us rst do a very rough approximate calculation When we pick a le according to the above distribution very likely there will be about 71 characters equal to 2 Each le with these typical frequencies has a probability about p Hpipi39 of being generated Since les with typical frequencies make up almost all the probability mass there must be about 119 H1pipi39 les of typical frequencies Now we are in a setting which is similar to the one of parts 2 and 3 of Theorem 4 where F is the set of les with typical frequencies We then expect the encoding to be of length at least logz Hipipi39quot n 39 2W legal191 The quantity 21191 logz 1 290 is the expected number of bits that it takes to encode each character and is called the entropy of the distribution over the characters The notion of entropy the discovery of several of its properties a formal version of the calculation above as well as a inef cient optimal compression algorithm and much much more are due to Shannon and appeared in the late 40s in one of the most in uential research papers ever written Notes for Lecture 16 5 23 A Calculation Making the above calculation precise would be long and involve a lot of es Instead we will formalize a slightly different setting Consider the set F of les such that the le contains 71 characters there are 0 different characters possible character i occurs a times in the le We will show that F contains roughly 2 Sip 1 g2 117 les and so a random element of F cannot be compressed to less than n E log2 1pi bits Picking a random element of F is almost but not quite the setting that we described before but it is close enough and interesting in its own Let us call n the number of occurrences of character i in the le We need two results both from Math 55 The rst gives a formula for n F f1f2fc Here is a sketch of the proof of this formula There are n permutations of 71 characters but many are the same because there are only 0 different characters In particular the f1 appearances of character 1 are the same so all f1 orderings of these locations are identical Thus we need to divide a by f1 The same argument leads us to divide by all other fi Now we have an exact formula for F but it is hard to interpret so we replace it by a simpler approximation We need a second result from Math 55 namely Stirling s formula for approximating 71 n x V 27rnn557n This is a good approximation in the sense that the ratio 27rn 395e approaches I quickly as n grows In Math 55 we motivated this formula by the approximation log n 222 logi z log zdz We will use Stirling s formula in the form log2 n z log2 x27r n 5 log2 n 7 n log2 e Stirling s formula is accurate for large arguments so we will be interested in approxi mating log2 for large 71 Furthermore we will actually estimate logfL F which can be interpreted as the average number of bits per character to send a long le Here goes 10g2IFI 10g2nIf1fcl 71 7 71 10g2 71 2191 logz 107 71 log2 V27r n 5log2 n 7 nlogze 7 iaogz m W mm M e N logger i1 Notes for Lecture 16 6 1 C g 7110an Z mogz i1 C 17 c log2 v 27139 510an 7 5 Zlogz i1 10g2 71 i Z logz i1 lt17 cgtlog2 m 71 510g2 n 7 5 2191 logz 71 71 As 71 gets large the three fractions on the last line above all go to zero the rst term looks like 01n and the last two terms look like 010g2n This lets us simplify to get n log F c f i 10g2 71 i 1Og2 107 10g2 71 i 10197 logz C Z ogzmw 2i 10pi10g2 71W 1 pwogznnpm i1 iplt1gtlog2 1299 i1 Normally the quantity Eip log2 1pi is denoted by H How much more space can Huffman coding take to encode a le than Shannon s lower bound Hm A theorem of Gallagher 1978 shows that at worst Huffman will take 71 pmm 086 bits more than Hn where pm is the largest of any pi But it often does much better Furthermore if we take blocks of k characters and encode them using Huffman s algo rithm then for large k and for n tending to in nity the average length of the encoding tends to the entropy

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.