Systems Software COP 3402
University of Central Florida
Popular in Course
Popular in Computer Programming
This 28 page Class Notes was uploaded by Luisa Beer on Thursday October 22, 2015. The Class Notes belongs to COP 3402 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 26 views. For similar materials see /class/227466/cop-3402-university-of-central-florida in Computer Programming at University of Central Florida.
Reviews for Systems Software
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
10f8 httpwwwcosccanterburyac nztadtakaokacosc230p1c Further modified 16 Feb 2001 This works supports only lowercase chracters sign give sign lt given by sign gt given by include ltstdioh struct namerecord int kin char namelO int val int level int adr gt 1 This records properties of a user defined name struct instruction This defines a machine instruction tt int f int 1 int a 1 int iiijkm nkind char lineBl container of one line of user program int norwll no of reserved wor s int txmaxlOO length of indentifier table int nmaxl4 max no of digits in numbers int al10 length of identifier int amax2047 maximum address int levmax3 maximum depth of block nesting int cxmax200 size of code array The following is for symbol type int null int ident2 int number3 int plus4 int minus5 int mult6 int slash7 int oddsym8 int neglO int lssll int leg12 int gegl4 int lparen15 int rparenl6 int commal7 int semicolon18 int periodl9 int becomes20 int beginsymZl int endsym22 int ifsym23 int thensym24 int whilesym25 int dosym26 int callsym27 int constsym28 int varsym29 int procsym30 int constantl int variable2 int null0 int procedure3 The following is the list of machine instructions int litl int opr2 int lod3 int sto4 int cal5 int inc6 int jmp7 int jpc8 char chchldummy last character read 39 last symbol read nt sym char id last identifier read int num last number read int cc character count int ll line length int kkii int cx code allocation index int err char alO int s500 datastore struct namerecord tablelOOO struct instruction codelOOO container of user defined names container of word symbols int ssymlOOO container of special symbols char mnemoniclOO int declbegsys statbegsys facbegsys FILE f char c errori error handling int 1 1192007 1151 AM httpwwwcosccanterburyac nztadtakaokacosc230p1c printfquoterror dnquot1 get one line getline cc0 110 nO while cgetcfllO while character is not quotreturnquot key tt 1 if cEOFchl return 0 n ll linenc n ll linen39 39 for il iltn ip intfquotcquot line39 printfquotnquot return 1 1 getch get one character 1 if ccll 39 chl printfquotPROGRAM INCOMPLETEquot getchar return 0 getline ccccl chlinecc 1 search serch for a reserved word int i j len char s for inorw igtl i swordi lenstrlens for jl jltlen j ifajls break s 1 if jgtlen return i return 0 getsym gets one symbol while ch39 39 getch if Ch gt 39A39 ampamp Ch lt 39Z39 k0 do if kltalk akch QBtCh 1 Whilechgt39039 ampamp Chlt39939 11 Chgt39A39 ampamp Chlt39z39 akl0 idal if isearchl0symwsym 1 else symident else if Chgt39039 ampamp Ch lt 39939 k0 num0 symnumber do numlOnum ch 39039 k getch 1 While Ch gt 39039 ampamp Ch lt 39939 if k gt nmax error30 else if Ch39 39 getCh if Ch3939 symbecomes getch else symnull else if chl return else 119200711151AM 20f8 httpwwwcosccanterburyac nztadtakaokacosc230p1c symssymch getch 1 genxyz This generates one instruction int xyz 1 if cxgtcxmax error99 else 1 codecx codecx codecx 1 cx enterkptxpdxlev This enters a symbol into the table int k int ptx int pdx int lev tx pointer to the table pdx data allocation index char idl int iilen ptx idli lenstrlenid for iiOiiltlenii tableptxnameiiidl idl tableptxkindk if kconstant tableptxvalnum else if kvariable tableptxlevellev tableptxadrpdx pdx 1 else procedure tableptxlevellev mismatchab if a and b mismatch return true char a char b int i int lenl int len2 lenlstrlena len2strlenb l if lenlllen2 return 1 else 1 whileiltlen2l if ail bi return 1 i a return 0 1 int positionidptx This finds the position of id char id int ptx char buflO char idl int iiilen idlid lenstrlenid for i0iltleni table0nameiidl idl iptx do lenstrlentableiname for ii0iiltlenlii bufiitableinameii 1 bufii0 i1 while mismatchbuf id return il 30f8 1192007 1151 AM httpwwwcosccanterburyac nztadtakaokacosc230p1c constdeclarationlevptxpdx int ptx pdx 1 if symident QBtSYm if symeql if symbecomes erro 11 symbecomes rl getsym if symnumber enterconstantptxpdxlev getsym 1 1 vardeclarationlevptxpdx int ptxpdxlev 1 ptxpdxlev getsymm if s ident entervariable else error4 1 factorlev th int lev ptx 1 int i level adr val while symident11symnumber11symlparenH i f symident ipositionidptx if i errorll else 1 kindtableikind leveltableilevel adrtableiadr valtableival if kindconstant genlitOval generates load literal else if kindvariable genlodlevleveladr generates load else errorZlL 1 getsym 1 else ifsymnumber if numgtamax error3l num0 genlit0num getsym 1 else ifsymlparen getsym expressionlevptx if symrparengetsym else error22 1 1 termlevptx int lev ptx int mulop factorlevptx whilesymmult11symslash mulopsym getsym factorlevptx if mulopmult genopr04 multiplication else genopr05 division 1 1 expressionlevptx int lev 1 int addop if symplus 11s minus addopsym getsym termlevptx if addopminus genopr0l1 else termlevptx while symplus11symminus addopsym getsym termlevptx 1192007 1151 AM 40f8 50f8 5 int 1 httpwwwcosccanterburyac nztadtakaokacosc230p1c addition if addopplus genopr02 subtraction else genopr03 1 1 conditionlevptx int lev ptx 1 int relop ifsymoddsym getsym expressionlev ptx genopr06 else expressionlevptx if symleqlampampsymlneqampampsymllssampamp symlleqampampsymlgtrampampsymlgeq error20 else relopsym getsym expressionlevptx switch relop case 9 genopr08 break equal case 10 genopr09 break not equal case ll genopr0lO break lt case 12 genopr0ll break gt case 13 genopr012 break gt case 14 genopr0l3 lt tatementlevptx lev x int i cxlcx2 if symident ipositionidptx ifi errorll else if tableikindlvariable error12 iO 1 getsym if symbecomes getsym else errorl3 expressionlevptx if ilO gensto leVtableilevel tableiadr 1 else if symcallsym procedure call getsym if symlident errorl4 else ipositionid ptx ifi errorll else if tableikindprocedure gencalleVtableilevel tableiadr else errorlS getSYmO 1 1 else if symifsym if statement getsym conditionlevptx if s thensym getsym else errorl6 cxlcx genjpc00 cl memorizes code index of jpc statementlev ptx codecxlacx fixup for jump address 1 else if symbeginsym begin end getsym statementlevptx whilesym semicolon11symwhilesym1 SEm 11 symcallsym11symbeginsym 1 ifsymsemicolon statementlevptx getsym else error10 1192007 1151 AM httpwwwcosccanterburyac nztadtakaokacosc230p1c 1 if symendsym getsym else errorl7 1 else if symwhilesym while statement cxlcx getsym conditionlevptx cx2cx genjpc00 c2 memorizes code index of jpc if sy dosym getsym else error18 statementlevptx genjmp0cxl jump to beginning of condition codecx2acx fixup 1 1 blocklev tx int lev int tx 1 int dx txO ch dx3 tXOtX tabletxadrcx genjmp00 if levgtlevmax error32 o if symconstsym getsym constdeclarationlevamptxampdx whilesymcomma getsym constdeclarationlevamptxampdx 1 ifsymsemicolongetsym else error5 1 if symvarsym getsym vardeclarationlevamptxampdx while symcomma getsym vardeclarationlevamptxampdx ifsymsemicolon getsym else error5 1 whilesymprocsym getsym ifsymident enterprocedureamptxampdxlev getsym 1 else error4 if symsemicolon getsym blocklevl tx ifsymsemicolon getsym 1 else error5 else error5 1 while symconstsym11symvarsym11symprocsym codetabletx0adracx tabletx0adrcx chcx geninc0dx statementlevamptx genopr00 1 int baselb in lb int bl find base l levels down b1b While lgt0 blsbl l 1 return bl 1 interpret int stacksize500 int pbt program base topstackregisters int i instruction register 6 of8 1192007 1151 AM httpwwwcosccanterburyac nztadtakaokacosc230p1c prihtfquotstart PLOhquot tO bl p0 S111O S1210 S131O do icodepf switch i case 1 t stcodepa p break 1it switch codepa case 0 returh tbl pst3 bst2 break1 case 1 stst p break case 2 t stststl p break case 3 t stststl p break case 4 t stststl p break case 5 t stststl p break case 6 if st2 stl else st0 p break case 7 p break case 8 t stststl p break case 9 t ststlstl p break case 10 t ststltstl p break case 11 t ststltstl p break case 12 t ststgtstl p break case 13 t ststgtstl p break 1 break case 3 t stsbasecodep1bcodep case 4 sbasecodep1bcodepast p prihtfquotdhquot st t break case 5 generate new block mark stlbasecodep1b st2b st3pl btl pcodepa break case 6 ttcodepa p break case 7 pcodepa break case 8 if st0 pcodepa else p t break a p break 1 1 while p0 prihtfquot END PLOhquot maihargc argv int argc char argv1 1 wordlquotbegihquot word2quotca11quot word3quotc0hstquot word4quotdoquot word5 ehdquot word6quotifquot word7quotod word8quotprocedurequot word9quotthehquot wordlOquotvarquot wordllquotwhi1equot wsymlbegihsym wsym2callsym wsym3c0hstsym wsym4dosym wsym5ehdsym wsym6ifsym wsym7oddsym wsym8procsym wsym9thehsym wsymlOvarsym wsymllwhi1esym ssym3939p1us ssym3939mihus ssym3939mu1t ssym3939s1ash ssym39391pareh ssym3939rpareh ssym3939eql ssym3939comma ssym3939period ssym3939heq ssym39lt391ss ssym39gt39gtr ssym39391eq ssym3939geq ssym3939semicoloh ffopehargvquotrquot kkal cc0 110 ch39 39 cx0 getsymO block0 O getchar for i Oiltcxli if iltlO prihtfquot quot prihtfquotd quoti switch codeif case 1 prihtfquot1it quot break case 2 prihtfquotopr quot break case 3 prihtfquot1od quot break 7 of8 1192007 1151 AM 80f8 case 4 printfquotsto U case 5 printfquotca1 quot case 6 printfquotinc U case 7 printfquotjmp quot case 8 printfquotjpc W 1 printfquotd quot printfquotd quot printfquotnquot 1 if symlperiod getchar interpret ex39 1 codei1 codeia error9 break break break break break httpwwwcosccanterburyac nztadtakaokacosc230p1c 1192007 1151 AM For the past several weeks we have studied a variety of advanced data structures whose primary purpose was the representation of data contained in main memory that supported an algorithm during its execution With the exception of Btrees in which part of the data was resident in main memory and part was resident in secondary memory all of these data structures were designed for data representation in main memory Hashing is a variant of a more general file organization technique called a direct le A direct file is a variant of an even more general type of file organization known as an indexed le lndexed files typically consist of two main structures an index structure and a main structure Similar to the concept employed in Btrees and Btrees with their index set and sequential set There are many different variations of indexed files however they can be broadly categorized into two categories which are based primarily on the density of entries in the index structure compared to the number of entries in the main file These two primary categories are sparse index les and dense index les A hash file or direct file falls generally under the category of a dense index file although it is a very special variant of the dense index file Hash files themselves are typically categorized in two different manners The first depends on whether the file structure is resident in main memory or on secondary memory The former is called internal hashing while the latter is called external hashing You may have been introduced to internal hashing in CS2 or perhaps in System Software COP 3402 as the commonly referred to hash table or hash file is a common data structure used within a compiler or assembler as a method for implementing a symbol table External hashing is a common database approach for hashing secondary memory primarily diskbased files The primary difference between internal hashing and external hashing is that in external hashing the hashing function is tailored to take advantage of the block based access methods on the disk drive This allows a single hash function value to load into main memory an enormous amount of data in one single disk fetch operation whereas with internal hashing typically either a single value record or very small number of values are returned for a single hash value In this set of notes we ll give a quick review of common hashing techniques and take a quick look at internal hashing and the problems associated with internal hashing We ll examine external hashing and finally we ll look at hashing Hashing 1 techniques that allow for dynamic file expansion something which is not feasible in terms oftime with internal hashing Hash functions are a specific case of a more general technique known as key toaddress transformations KTA transformations There are many different KTA transformation techniques possible Figure 1 illustrates the hierarchy of KTA transformations l I I Folding and ExpOnential PieceWise Digit Remainder XOR Adding Transform Linear Analysis of Transtrm Division Figure 1 Keytoaddress transformation hierarchy Distribution dependent transformations depend on at least approximate knowledge of the key values that will be expected The benefits that can be gained by distribution dependent techniques depend on openaddressing bucket size file density and the appropriateness of the transformation itself For small bucket size and a good distribution algorithm the improvement over randomizing transformations can be significant On the other hand the liabilities of distribution dependent transformations are major since a change in the key distribution can cause these methods to generate many more collisions Hashing 2 than a randomization would generate for the same data A benefit of some distribution dependent KTA transforms is that they can allow for maintaining sequentiality Such sequence maintaining transforms allow the addresses produced to increase with increasing value of the key Serial access is made possible in this case Othenvise a direct file does not generally support serial access In Figure 1 there are two distribution dependent transformation shown digit analysis and sequence maintaining transformations Deterministic transformations take the set of all key values and determine a unique corresponding address Algorithms which produce such transformations become very difficult to construct if the number of key values is large more than a few dozen Adding a new key value requires a new algorithm since the algorithm is dependent on the distribution of the source keys Therefore only static files can be feasibly processed using deterministic procedures Replacing the algorithm with a table of addresses corresponding to key values makes the problem more tractable solvable but in so doing you have essentially created an indexed file structure which is a completely different beast Deterministic algorithms are quite common for extremely static data in which the KTA transformation can be optimized to ensure 01 access time We won t discuss deterministic transformations any further Probabilistic transformations translate the key values into addresses which are within the fileaddress space using an algorithmic process Probabilistic take advantage of the random properties of the digits of a key value Operations such as arithmetic multiplication and addition which tend to produce normally distributed random values are undesirable when hashing A uniform distribution of the addresses is desired since this will evenly spread the key values records across the file space Uniform distribution of the data within the fileaddress space is optimal but difficult to achieve in general We ll see why this is later At any point the KTA transformation may produce for two or more different key values the same corresponding file address This causes a collision which must be handled by some technique such as rehashing chaining buckets etc we ll see these later as well Probabilistic transformations may either preserve the order of the records sequence maintaining transformations or they may be designed to maximize the degree of uniqueness of the resulting address The more common probabilistic transformation take this latter approach which is called a random KTA transformation or more commonly a hashing technique Digit analysis is a known distribution probabilistic hashing technique that attempts to capitalize on the existing distribution of key digits An estimate or a tabulation is made for each of the successive digit positions of the keys using a Hashing 3 sample of the records to be stored For example if the key is social security number then the sample of records that would be examined will probably show a uniform distribution over the loworder three digits A tabulation simply lists the frequency of distribution of zeros ones twos and so on The digit positions that show a reasonably uniform even distribution are candidates for use as digits in the file address A sufficient number of such digit positions must be found to make up the full address othenvise combinations of other digit positions perhaps taken modulo 10 or as appropriate can be tested A sequence maintaining transformation function can be obtained by taking a simplified inverse of the distribution of keys found The addresses are generated to maintain sequentiality with respect to the source key In a piece wise linear transformation the observed distribution is approximated either automatically or manually by simple line segments This approximation is then used to distribute the addresses in a complementary manner The remainder of division modulo operation of the key by a divisor equal to the number of record spaces allocated in the file can be used to obtain the desired address Division is in some sense similar to taking the loworder digits but when the divisor is not a multiple of the base of the number system of the key or the hardware information from the highorder portions of the key will be included and this additional will have a positive effect on the number of addresses generated and thus on the uniformity of the generated addresses Large prime numbers are generally used as divisors since their quotients exhibit a welldistributed behavior even when parts of the keys do not In general divisors that do not contain small primes lt 19 are adequate Empirical data has shown that division tends to preserve better than other methods preexisting uniform distributions especially uniformity due to sequences of loworder digits in assigned identification numbers The remainder does not preserve sequentiality The problem with division is in the capability of the available division operation itself Frequently the key field to be transformed is larger than the largest dividend the divide operation can accepts and some hardware does not have division instructions which provide a remainder although this is rare When this occurs the remainder address can be calculated according to the expression address key x m m The floor operation is necessary to prevent a smart optimizer from generating address O for every key which would lead to an extreme number of collisions n l if n records are to be stored Hashing 4 The exclusiveor technique typically divides the key digit string is segmented into parts which match the required address size Using this operation results in random patterns for random binary inputs The various segments are then exclusivelyor ed together to produce the address Segment sizes need to be chosen carefully so that they have no common divisor relative to word sizes This is among the faster KTA transformations available and is widely used Folding and adding of the key digit string produces a shorter string as the address and is a commonly used hashing technique Alternate segments of the key digit string are bitreversed The primary design criterion for an internal hash file is to achieve as nearly as possible 01 access time to any element in the file based upon access through the hash eld the component of the file element on which the elements are hashed Although the hash field may be any component of an element in the file it is typically the key value component on which the hashing occurs In order to achieve this 01 access criterion we need to first determine how a hash function operates Although the hash field does not need to be a key of the file in most cases it is and it is then typically referred to as the hash key For internal files those which have no component in secondary memory hashing is typically implemented as a hash table through the use of an array of records The typical configuration is shown in Figure 2 address out key value in main file M records Figure 2 Typical Internal Hashing Configuration Hashing 5 A collision occurs in a hash table any time that the hash function maps two or more key values into the same address within the address space There are two basic techniques that can be used to handle collisions the initial technique is a lazy approach also referred to as an optimistic approach and the second technique is a greedy approach also referred to as a pessimistic approach 1 Ignore the collision If the probability of collision is very low or the hash function is already too slow to add the overhead of collision resolution 2 Create and utilize a collision resolution protocol This adds complexity to hashed operations and causes extra implementation work Collision resolution protocols can range from fairly simple to very complex techniques Among the simplest protocols are 1 linear probing 2 quadratic probing 3 chaining More advanced techniques such as multiple hash functions and bucketing can be applied when the table size is relatively large Linear Probifno Technigue When a collision occurs sequentially search through the table from the point of the collision using wraparound searching modulo arithmetic until an empty location is found Specifically if the hash function returns a value H and location cell H is not empty then cell H1 is attempted followed by H2 H3 Hi using wraparound Example Suppose our hash function maps the letter Ato location 0 B to 1 Z to 26 And we are hashing based upon the first letter of a person s name With the input sequence Insert AI Insert Bob Insert Betty Insert Carl we can see how linear probing handles collisions Hashing 6 location value 0 AI should be in location 1 but a collision occurred moving it 1 Bob f to location 2 2 Betty 3 Carl 4 should be in location 2 but a collision occurred moving it to location 3 Details Retrievals are handled by hashing the key and comparing the data at the location provided by the hash function If the two values are not equal the location is incremented and the comparison is made again against the value in this new location This is repeated until either the key value is found or an empty location is encountered Deletion must be lazy This entails marking the item as deleted but leaving it in place in the table using a delete bit without actually physically removing it from the table This ensures that the lookup operation always works ltems which have been lazily deleted are only removed when they won t break a chain valid items or when a new item can be inserted at this location which ovenvrites the deleted item Analysis Definition Load factor The load factor of a probing hash table is the fraction of the table that is full The load factor is represented by the symbol 7 and generally ranges from 0 empty table to 1 full table Assuming that the probes are independent the average number of locations cells in the table that will be examined in a single probe is 11k This comes simply from the fact that the probability that a location is empty is Ht The above assumption is bad In fact linear probing causes a phenomenon called primary clustering These clusters are blocks of occupied cells locations These blocks cause excessive attempts to resolve collisions Hashing 7 Taking this into account the average number of cells that will need to be examined for an insertion into the hash table is 170 2 For halffull tables ie when 7t 3 05 this is an acceptable value of 25 but when 7s 09 the search will require that 50 cells on the average be examined We need a solution that eliminates primary clustering The following picture illustrates sort of the longterm effect primary clustering has on the file density The shaded areas indicate areas of the file that are occupied with records The unshaded areas are unoccupied areas containing no information Primary clustering tends to diVide the file space into discrete clusters which further increases the probability of collision and tends only to expand each cluster rather than spread the information across the file space Quadratic Prdbi n Quadratic probing eliminates the problem of primary clustering caused by linear probing The technique is similar to linear probing but the location increment is not 1 Specifically if the hash function produces a hash value a location or cell index of H and the search at location H is unsuccessful then the next location that is searched is H12 followed by H22 H32 H42 Hi2 using wraparound as before Example Suppose our hashing function is a simple mod operation on the size of the hash table If the hash table is size 10 and the input Hashing8 sequence is Insert89 Insert 18 Insert 49 Insert 58 Insert 9 Then the hash table is filled as shown below 49 H0 H1 10 0 58 H8 collision Hlmod 10 collision H4mod 10 2 9 H9 Hl 10 10 3 18 89 0 l 2 3 4 5 6 7 8 9 The question now becomes ls quadratic probing any better than linear probing If the size of the hash table is a prime number and k s 05 then all probes will be to different locations and an item can always be inserted and further no location will be probed twice during an access However at k 05 linear probing is fairly good and the removal of primary clustering by use of quadratic probing will only save 05 probes for an average insertion and 01 probes for an average successful search Quadratic probing provides an additional benefit in that it will be unlikely to encounter an excessively long probe as might be the case with linear probing However quadratic probing requires a multiplication the i2 term so an efficient algorithm forthis multiplication will be necessary Given the previous value of HM it is possible to determine the next value H without requiring the computation of i2 Assuming that we still require a wraparound technique this new value of H is computed as follows H HM 2i 1 mod tablesize This can be implemented as follows use an addition to incrementi use a left bit shift 1 to compute 2i a subtraction to compute 2i 1 a second addition to increment the old value of 2i 1 finally a modulo operation if wraparound is needed UIbOONA Hashing 9 Example Using the example from earlier consider the steps to insert58 Initially H0 58 mod 10 8 and collision results Then i 1 and H0 8 H1 H0 21 1mod 10 81mod 10 9 This too results in a collision so another value of H must be calculated as follows H2 H1 22 1mod 10 93mod 10 2 which is empty so insertion occurs at position 2 in the hash table Using the shift operation this example proceeds as with numbers shown in binary form Initially H0 58 mod 10 8 and collision results Then i 0001 and H0 1000 H1 1000 0010 0001mod 10 81mod 10 9 This too results in a collision so another value of H must be calculated as follows H2 1001 0100 0001mod 10 93mod 10 2 which is empty so insertion occurs at position 2 in the hash table Quadratic probing eliminates primary clustering but introduces the problem of secondary clustering Elements which hash to the same location will probe the same set of alternative locations This however is not a real concern Simulations have shown that in general less than 05 additional probes are required per search and this only occurs for high load factors If secondary clustering does present a problem for a given application there are techniques which will eliminate it altogether One of the more popular techniques is called double hashing in which a second hash function is used to drive the collision resolution Ch a ini n Maintain an array of linked lists at each hash addressable location The hash function returns an index of a specific list Insertions deletions and searches occur in that list If the lists are kept short then the potential performance bottleneck is eliminated A is calculated by dividing the total number of nodes N by the number of lists which are maintained M A NM A is no longer bounded by 10 but has an average value of 10 The expected number of probes for insertion and an unsuccessful search is The expected number of probes for a successful search is 1 A2 Hashing 10 M6N15kNM15625 Has List Address 0 Ann Art Ali 1 Kris 2 3 Cris Cindi C n Calli Carl y 4 gt 5 h Each list referenced by the hash table is a singlylinked list see previous notes for implementation details The singlylinked lists shown above do not have a tail node Would the use of a tail node be beneficial in this data structure The answer is yes it could help in two different ways Notice that there is no implied order to the elements of a specific list This is done since insertion into a hash table should be an 01 operation If the list is maintained in alphabetical order then insertion will not be an 01 operation and we would violate one of the specifications of the hash table data structure This also happens in the implementation shown above since we have no way other than traversing the list of finding the end of the list Therefore a better implementation is the one shown on the next page Hashing 1 1 Hash Address LlSt TAIL Notice in this implementation of the hash table that even the hash addresses with no entries maintain an empty list chain The first way that the tail node improves the implementation is as follows in typical implementations the tail node will actually contain a data field which is usually set to the largest possible key value that will could be hashed This eliminates null value comparisons in the code replacing them with perhaps comparisons to Maxlnt or something similar Since each list has a logical end there should be no problems associated with running off the end of a list Also notice how wasteful of space it is to have a separate tail node for every list In reality all of these nodes will be condensed to a single node to which all lists will link This is shown in the next diagram Hashing 12 Hash Address 0 Notice that this better implementation still does not provide 01 insert time unless we can identify have a reference to the node immediately preceding the tail node in any given list For example if we want to insert Alice into the first list having a tail node only tells us where the end of the list is not where the node next to the end of the list is What do we do to get our required 01 insert The answer has been available all along and none of the improvements that we have made to our structure have done anything toward this end Recall some of the issues we discussed when dealing with the implementation of linked lists in 082 We stated that in a list without header and tail nodes that insertion at either end of the list was a special case that was different from inserting in the middle of the list So we put header and tail nodes in to prevent the special cases from occurring However in our hash table structure there has been a header node all along It is embedded in the hash table itself as the reference to the chain for each hashable location Therefore to achieve 01 insertion time we simply perform ALL insertions at the head of the list rather than at the tail of the list A potential benefit of this is that the chain will contain the elements in the order of their arrival ie they appear in entry order within each chain This again illustrates that you need to be aware of the various implementation issues for all of the data structures that are involved in any application The final diagram illustrates the insertion of a newly hashed value into our hash table Hashing 13 Hash Address Hash tables can be used to implement insert and nd operations in 01 time on the average There are many implementation factors that can influence the performance of the hash table such as the load factor the hash function itself file size input rates and distributions as well as many other factors It is important to pay attention to these details if you are to perform these operations in 01 time Hashing techniques for secondary storage primarily disk files is called external hashing Basically external hashing is the same as internal hashing however the hash address is optimized to take advantage of the blockoriented nature of external memory and is thus optimized toward the hash bucket size In an external hashing environment a bucket is either a single block or a cluster of contiguous blocks Which is used depends on several factors which include the size of the physical records and how this relates to the blocking factor whether the records are spanned or unspanned whether the records are compressed or not as well as several other factors The hashing function maps a key into a Hashing 14 relative bucket address rather than to assign an absolute block address to the bucket A table typically a hash table maintained in the file header is used to convert the bucket number into the corresponding disk block address This is illustrated in Figure 3 bucket block addr disk 0 V m2 m l H I V Figure 3 Typical External Hashing Configuration The collision problem that we discussed in the context of internal hash files is less severe when buckets are utilized because as many records as will fit into a bucket can hash to the same bucket address without causing problems For this same reason buckets are sometimes used with internal hash structures when the internal file is relatively large However collision must still be handled because there is the possibility that a bucket will fill up and then overflow on the next insertion to that bucket Typically a variation of chaining is employed when a bucket overflows in which a pointer is maintained in each bucket to a linked list of overflow records belonging to that bucket The pointers in the linked list will be record pointers meaning that they include both a block address and a relative record position within the block Hashing 15 Although it is more of a problem with external hashing than internal hashing a nonkey based search in a hashed file is a very costly operation in terms of time There is another file organization technique in which the records of the file appear in no particular order analogous to an unsorted array called a heap le Since there is no order to this type of file on any field within the records of the file sequential searching operations are the only suitable search technique Hashed files on the other hand were designed to provide 01 access time to the file This access time was based upon the hash field again typically the key field If access to the hashed file is to be through any field other than the hash field this includes secondary key fields access deteriorates to that of a sequential search The hashing schemes that we have examined for internal hashing are basically the same as those used for external hashing with the only slight change being the adaptation for bucket addresses relating to physical block addresses in the case of external hashing The hashing techniques that we have seen so far are called static hashing techniques In static hashing a fixed number of file locations size of the address space for internal hashing and the number of allocated buckets for external hashing are allocated to the file structure based upon the initial requirements This is a serious drawback for dynamic files A dynamic file is one whose size total number of bytes required by all records in the file changes perhaps drastically over time Suppose that we allocate a total of M buckets for the address space of a hashed file and let m be the maximum number of records that can fit into a single bucket then at most m x M records will fit into the allocated address space If the number of records ultimately turns out to be substantially fewer than m x M records we will be left with a lot of unused space On the other hand if the number of records increases to substantially more than m x M records numerous collisions will result and retrieval operations will be significantly slowed due to the long lists of overflow records that will require traversing In either case the number of blocks allocated to the file M may need to be changed This will require the development of a new hash function it must handle the larger allocation to redistribute the existing records into the new space allocation This type of reorganization is extremely time consuming for large files There are two primary schemes that have been developed to allow dynamic resizing of hashed files Both schemes are designed for external hashing applications and are not in general applicable to internal hashing The first Hashing 16 type maintains an access structure similar to an indexed file in addition to the main file The most common techniques which fit into this category are called dynamic hashing and extendibe hashing The second type does not maintain the access structure but allows for dynamic resizing The best example of this latter type is called linear hashing These hashing schemes take advantage of the fact that the result of the hashing function is usually a nonnegative integer and therefore can be represented as a binary number The access structure is built on the binary representation of the result of applying the hashing function to the hash field value of a record which is a string of bits This is called the hash value of the record Records are distributed among buckets based on the values of the leading bits in their hash values ln dynamic hashing the number of buckets is not fixed as in regular hashing but expands and contracts as needed The file can start with a single bucket once that bucket is full an insertion will cause the bucket to overflow The overflow will cause the bucket to split into two buckets The records are distributed among the two buckets based on the value of the first bit of their hash values All records whose hash value starts with a 0 bit are stored in one bucket and all those whose hash value starts with a 1 bit are stored in the other bucket The indexing structure is a binary tree in which convention has set the left child pointers for internal nodes correspond to a 0 bit and the right child pointers for the internal nodes correspond to a 1 bit Leaf nodes hold pointers to buckets Figure 4 illustrates the basic structure of a dynamically hashed file data file buckets Figure 4 Structure of a dynamically hashed le Hashing 17 Figure 5 illustrates more of the details of the dynamically hashed file structure In Figure 5 the tree portion of the hashed file the index structure has leaf nodes on two levels indicating that some buckets have already split due to insertion overflow Assume that key values are six bits in length the buckets Figure 5 An example of a dynamically hashed file with a bucket size of four Consider in Figure 5 how the left subtree of the root came to be in the configuration that it is shown Initially for example only a single record would have been in the left subtree and the key value for this record would have contained a MSB of 0 Since the bucket size in the file is four three additional records would have been inserted into the left subtree of the root before the first split occurred Upon splitting this left child node the key values all four of them would have been redistributed into the two nodes based upon the two MSBs Insertions would continue until the 00 bucket became full again at which point the 00 bucket is split into a 000quot and 001quot buckets with the subsequent key value redistribution At the point the file is shown in Figure 5 the 01 bucket has not yet split there is still room in this bucket for three more key values to be inserted and hence this leaf node is one level higher in the tree than are the leaf nodes for the 000quot and 001quot buckets To illustrate what happens to the dynamically hashed file when an insertion causes an overflow consider inserting the new key value 100110 into the file Hashing 18 structure shown in Figure 5 This will require splitting the leftmost bucket in the right subtree of the root the bucket for 100 since this is the bucket in which key value 100110 hashes based upon its three MSBs Splitting this bucket will add a new level to the index structure by replacing the current leaf node associated with the 100 bucket with a new internal node and pointers to two leaf nodes one for key values 1000 and one for key values 1001 Notice that adding a level to the index structure requires that we differentiate keys on one more bit along this path The remainder of the index structure is unchanged Figure 6 illustrates this splitting and redistribution of the key values in the current 100 bucket into the 1000 and 1001 buckets 1i 39 ll Cir the buckets Figure 6 Bucket splitting on over ow in a dynamically hashed file Key value 100110 inserted is inserted into the le structure shown in Figure 5 As illustrated by the insertion example shown using Figures 5 and 6 the dynamically hashed file can easily expand when required by allocating another bucket and redistributing the key values into two buckets one level deeper in the index structure The dynamically hashed file structure can also contract when a deletion empties a bucket causing an underflow condition to occur Using Figure 5 as the starting point assume that both key values 001001 and 001011 are deleted from the file Since these are the only two key values in their bucket the bucket Hashing 19 will empty and the two leaf nodes will contract into their parent node which will become a leaf node and a single bucket the contract bucket s sibling will be the only remaining bucket in this subtree Notice too that no redistribution of key values will be required on a contraction Figure 7 illustrates the changes that will occur to the file shown in Figure 5 when these two key values are deleted the buckets Figure 7 Bucket contraction caused by an under ow on deletion Key values 001001 and 001011 are deleted from the dynamically hashed le shown in Figure 5 which produces this file structure If the hash function distributes the key values uniformly the index structure will be balanced In some systems rather than wait for an underflow condition to develop on deletion contraction of two siblings can occur at any point in time when the total number of key values in the two sibling buckets is less than or equal to the size of a single bucket This make optimal use of bucket space but may unnecessary splitting on insertion Whether to use advance contraction or not depends in part on the access patterns to the file If insertions tend to dominate deletions then advance contraction is not typically a good idea on the other hand if deletions tend to dominate insertions then bucket space utilization can be optimized through advance contraction Hashing 20
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'