Computer Science II
Computer Science II COP 3503
University of Central Florida
Popular in Course
verified elite notetaker
Popular in Computer Programming
This 11 page Class Notes was uploaded by Luisa Beer on Thursday October 22, 2015. The Class Notes belongs to COP 3503 at University of Central Florida taught by Ali Orooji in Fall. Since its upload, it has received 33 views. For similar materials see /class/227458/cop-3503-university-of-central-florida in Computer Programming at University of Central Florida.
Reviews for Computer Science II
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
COP 3503 Computer Science 11 Spring 2000 CLASS NOTES Hashed Tables 0 Hash tables files rely on hashing to perform insertion deletion and retrieval in constant time Hashing functions are a mapping between a key value and a location in the table The location is typically produced as an offset from the first position in the table The key value is the data field on which retrieval insertion and deletion will be based To be retrieved from inserted to or deleted from the file an item s key value must be specified If the hashing function is a 11 function from key value to location then any retrieval can be done in constant time If the hashing function is not 11 but M 1 then it is possible for more than key value to map to the same location in the hash table This is called a collision Typically a restriction on the possible key values that may be used will be expected Without any restrictions the size of the set of possible key values is infinite Domain is infinite and the range will normally be finite Hashing Methods Method 1 Convert the key value a string let s say into an integer by adding the product of the character its ASCII or Unicode value and some number say 128 raised to the position of the character Note 128 is used since the typical character set requires 7 bits in which to encode the character set ASCII not extended ASCII or Unicode Since 27 128 this means that we can encode 128 different characters using the integer numbers 0 through 127 Example CSII C 1283 S1282 I1281 I1280 This method has a serious potential for over ow For example using ASCII code the value for the work junk is 224229227 A long string will generate a huge number Also note that this technique is not a 11 mapping so collisions will be possible Collision resolution will be discussed later To prevent calculating a number such as 128i directly for some applies to general polynomials can be used A general polynomial A33 A22 Ale AOX0 Day 24 l can be evaluated as A3 X A2 X A1X A0 This has three distinct advantages over the earlier method 1 A large intermediate result that will over ow is deferred until the end of the calculation 2 only three multiplications and three additions are required to evaluate the polynomial and 3 the entire calculation proceeds from left to right eXponentiation is from right to left A better solution is public int hash1 string s int tablesize int HashVal 039 for int i 039 i lt slength39 i HashVal HashVal 128 scharATi tablesize39 return HashVal39 end hash1 Note that the only improvement in this solution compared to the first is that modulo arithmetic has been applied However modulo operations are very eXpensive An even better solution Since Java allows over ow to occur ie you don t get an over ow error during runtime and it over ows consistently across a given platform so we can just allow the over ow to occur public int hash2 string s int tablesize int HashVal 039 for int i 039 i lt slength39 i HashVal 37 HashVal scharATi39 Hashval tablesize39 if HashVal lt 0 HashVal tablesize39 return HashVal39 end hash2 This algorithm is even better than hash1 since the modulo operation only occurs once and an acceptable value between 0 and tablesize is produced The value of 37 has been used instead of the 128 because repeated multiplication by a factor of 128 will cause the most significant bits to over ow off the end too fast and the hash value will be more likely to cause a collision Day 24 2 In general hash functions 0 must be easy to compute and fast so we don t give up the advantage of 01 lookup time must distribute the key values across the entire range of the table to yield fewer collisions faster hashing typically means a higher collision rate lower collision rate typically means slower hashing Collision Resolution in Hash Tables Techniques for resolving collisions include 1 Ignore the collision If the probability of collision is very low or the hash function is already too slow to add the overhead of collision resolution 2 Create and utilize a collision resolution protocol This adds complexity to hashed operations and causes extra implementation work Collision Resolution Protocols Collision resolution protocols can range from fairly simple to very complex techniques Among the simplest protocols are 1 linear probing 2 quadratic probing 3 chaining More advanced techniques such as multiple hash functions and bucketing can be applied when the table size is relatively large Linear Probing Technique When a collision occurs sequentially search through the table from the point of the collision using wraparound searching modulo arithmetic until an empty location is found Specifically if the hash function returns a value H and location cell H is not empty then cell Hl is attempted followed by H2 H3 Hi using wraparound Example Suppose our hash function maps the letter A to location 0 B to l Z to 26 And we are hashing based upon the first letter of a person s name With the input sequence Insert Al Insert Bob Insert Betty Insert Carl we can see how linear probing handles collisions Day 24 3 should be in location 1 but a collision occurred moving it to location 2 location value 0 Al 1 Bob 2 Betty 3 Carl 4 25 should be in location 2 but a collision occurred moving it to location 3 Details Retrievals are handled by hashing the key and comparing the data at the location provided by the hash function If the two values are not equal the location is incremented and the comparison is made again against the value in this new location This is repeated until either the key value is found or an empty location is encountered Deletion must be lazy This entails marking the item as deleted but leaving it in place in the table using a delete bit without actually physically removing it from the table This ensures that the lookup operation always works Items which have been lazily deleted are only removed when they won t break a chain valid items or when a new item can be inserted at this location which overwrites the deleted item Analysis Definition Load factor The load factor of a probing hash table is the fraction of the table that is full The load factor is represented by the symbol 7b and generally ranges from 0 empty table to 1 full table Assuming that the probes are independent the average number of locations cells in the table that will be examined in a single probe is llk This comes simply from the fact that the probability that a location is empty is 17 The above assumption is bad In fact linear probing causes a phenomenon called primary clustering These clusters are blocks of occupied cells locations These blocks cause excessive attempts to resolve collisions Taking this into account the average number of cells that will need to be examined for an insertion into the hash table is 2 l 1 gt Zl Day 24 4 For halffull tables ie when X S 05 this is an acceptable value of 25 but when X 09 the search will require that 50 cells on the average be examined We need a solution that eliminates primary clustering The following picture illustrates sort of the longterm effect primary clustering has on the file density The shaded areas indicate areas of the file that are occupied with records The unshaded areas are unoccupied areas containing no information Primary clustering tends to divide the file space into discrete clusters which further increases the probability of collision and tends only to expand each cluster rather than spread the information across the file space Quadratic Probing Quadratic probing eliminates the problem of primary clustering caused by linear probing The technique is similar to linear probing but the location increment is not 1 Specifically if the hash function produces a hash value a location or cell index of H and the search at location H is unsuccessful then the next location that is searched is H12 followed by H22 H32 H42 Hi2 using wraparound as before Example Suppose our hashing function is a simple mod operation on the size of the hash table If the hash table is size 10 and the input sequence is Insert89 Insert 18 Insert 49 Insert 58 Insert 9 Then the hash table is filled as shown below Day 24 5 49 H0 collision Hlmod 10 0 58 H8 Hl 10 9 H9 Hl 10 18 89 0 l 2 3 4 5 6 7 8 9 Why is quadratic probing better If the size of the hash table is a prime number and X S 05 then all probes will be to different locations and an item can always be inserted and further no location will be probed twice during an access However at X 05 linear probing is fairly good and the removal of primary clustering by use of quadratic probing will only save 05 probes for an average insertion and 01 probes for an average successful search Quadratic probing provides an additional benefit in that it will be unlikely to encounter an excessively long probe as might be the case with linear probing However quadratic probing requires a multiplication the i2 term so an efficient algorithm for this multiplication will be necessary Given the previous value of Hi1 it is possible to determine the next value Hi without requiring the computation of i2 Assuming that we still require a wraparound technique this new value of H is computed as follows H Hi1 2i 1 mod tablesize This can be implemented as follows use an addition to increment 139 use a left bit shift 1 to compute 2i a subtraction to compute 2i l a second addition to increment the old value of 2i l finally a modulo operation if wraparound is needed LIIkWhquot Example Using the example from earlier consider the steps to insert58 Initially H0 58 mod 10 8 and collision results Thenz39 I and H 0 8 H1 Day 24 6 H0 21 7 1m0d 10 81mod 10 9 This too results in a collision so another value of H must be calculated as follows H 2 H 1 22 7 1m0d 10 93mod 10 2 which is empty so insertion occurs at position 2 in the hash table Using the shift operation this example proceeds as with numbers shown in binary form Initially H0 58 mod 10 8 and collision results Then 139 0001 and H 0 1000 H1 1000 0010 i 0001m0d 10 81m0d 10 9 This too results in a collision so another value of H must be calculated as follows H 2 1001 0100 i 0001m0d 10 93m0d 10 2 which is empty so insertion occurs at position 2 in the hash table Quadratic probing eliminates primary clustering but introduces the problem of secondary clustering Elements which hash to the same location will probe the same set of alternative locations This however is not a real concern Simulations have shown that in general less than 05 additional probes are required per search and this only occurs for high load factors If secondary clustering does present a problem for a given application there are techniques which will eliminate it altogether One of the more popular techniques is called double hashing in which a second hash function is used to drive the collision resolution Chaining 0 Maintain an array of linked lists at each hash addressable location 0 The hash function returns an index of a specific list 0 Insertions deletions and searches occur in that list 0 If the lists are kept short then the potential performance bottleneck is eliminated 0 9c is calculated by dividing the total number of nodes N by the number of lists which are maintained M o X NM 0 9c is no longer bounded by 10 but has an average value of 10 o The expected number of probes for insertion and an unsuccessful search is 9 o The expected number of probes for a successful search is l N2 Day 24 7 Example M6N 15 kNM15625 Hash Address List Kristi 0 Each list referenced by the hash table is a singlylinked list see previous notes for implementation details The singlylinked lists shown above do not have a tail node Would the use of a tail node be beneficial in this data structure The answer is yes it could help in two different ways Notice that there is no implied order to the elements of a specific list This is done since insertion into a hash table should be an 01 operation If the list is maintained in alphabetical order then insertion will not be an 01 operation and we would Violate one of the specifications of the hash table data structure This also happens in the implementation shown above since we have no way other than traversing the list of finding the end of the list Therefore a better implementation is the one shown on the next page Day 24 8 Hash Address List T H E Kris Kristi l39 I I a 3 Cris 0 H 39 39 Cyn Calli l IEIE 0 Notice in this implementation of the hash table that even the hash addresses with no entries maintain an empty list chain o The first way that the tail node improves the implementation is as follows in typical implementations the tail node will actually contain a data field which is usually set to the largest possible key value that will could be hashed This eliminates null value comparisons in the code replacing them with perhaps comparisons to MaXInt or something similar Since each list has a logical end there should be no problems associated with running off the end of a list 0 Also notice how wasteful of space it is to have a separate tail node for every list In reality all of these nodes will be condensed to a single node to which all lists will link This is shown in the next diagram Day 24 9 Hash Address List Kristi 0 Notice that this better implementation still does not provide 01 insert time unless we can identify have a reference to the node immediately preceding the tail node in any given list For example if we want to insert Alice into the first list having a tail node only tells us where the end of the list is not where the node next to the end of the list is What do we do to get our required 01 insert The answer has been available all along and none of the improvements that we have made to our structure have done anything toward this end Recall some of the issues we discussed when dealing with the implementation of linked lists We stated that in a list without header and tail nodes that insertion at either end of the list was a special case that was different from inserting in the middle of the list So we put header and tail nodes in to prevent the special cases from occurring However in our hash table structure there has been a header node all along It is embedded in the hash table itself as the reference to the chain for each hashable location Therefore to achieve 01 insertion time we simply perform ALL insertions at the head of the list rather than at the tail of the list A potential benefit of this is that the chain will contain the elements in the order of their arrival ie they appear in entry order within each chain This again illustrates that you need to be aware of the various implementation issues for all of the data structures that are involved in any application The final diagram illustrates the insertion of a newly hashed value into our hash table Day 24 10 Hash Address List Kristi Hash tables can be used to implement insert and nd operations in 01 time on the average There are many implementation factors that can in uence the performance of the hash table such as the load factor the hash function itself file size input rates and distributions as well as many other factors It is important to pay attention to these details if you are to perform these operations in 01 time Day24ll
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'