Distributed Systems CS 757
Popular in Course
Popular in ComputerScienence
This 207 page Class Notes was uploaded by Abe Jones on Saturday September 12, 2015. The Class Notes belongs to CS 757 at West Virginia University taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/202785/cs-757-west-virginia-university in ComputerScienence at West Virginia University.
Reviews for Distributed Systems
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/12/15
Synchronization Chapter 6 Copyright KGoseva 2008 cs 757 Distributed Systems Slide 1 w Outl1ne Clock synchronization Logical clocks Mutual exclusion Election algorithms Copyright KGoseva 2008 cs 757 Distributed Systems Slide 2 W Clock Synchromzatlon University When each machine has its own clock an event that occurred after another event may nevertheless be assigned an earlier time example with UNIX make program Computer on 2144 2145 2146 2147 47 Time according which compiler I l l to IOCal Clock runs outputo created Computer on 2142 2143 2144 2145 4 Time according which editor 1 I i i to local clock runs outputc created Simple question Is it possible to synchronize all clocks in a distributed system Copyright KGoseva 2008 CS 757 Distributed Systems Slide 3 w PhySICal Clocks Computation of the mean solar day Earth39s orbit At the transit of the sun n days later the earth has rotated fewer than 360 A transit of the sun occurs when the sun reaches the highest point of the day Earth on day 0 at the transit of the sun To distant gaiaxy To distant galaxy Earth on day n at the transit of the sun Copyright KGoseva 2008 cs 757 Distributed Systems Slide 4 quotw Phys1cal Clocks y Atomic clock 1948 counts transitions of the cesium 133 atom International Atomic Time TAI TAI seconds are of constant length unlike solar seconds Leap seconds are introduced when necessary to keep in phase with the sun Universal Coordinated Time UTC UTC replaced the old standard Greenwich Mean Time which is an astronomical time OS must have special software to account for leap seconds 0 12345678910111213141516171819202122232425 TAIiiiiiiiiiiiiiiiiiiiiiiiiii 12 131415 16 17 18 19 2122 23 24 25 1 11 1 1 1 1 1 11 1 1 1 Leap seconds introduced into UTC to get it in synch with TAI Copyright KGoseva 2008 cs 757 Distributed Systems Slide 5 Clock Synchronization Algorithms nlversity 0 Each machine has a timer that causes an interrupt H times a second 0 The interrupt handler adds 1 to a software clock that keeps track of the number of interrupts When the UTC time is t the value of the clock on the machine 9 is Cp Ideally Cp t for all p and all t that is dCdt1 Relative error for modern timer chips is about I 0395 The timer is working within its specification when dC l S 3 1 p dt p Copyright KGoseva 2008 cs 757 Distributed Systems Slide 6 Clock Synchronization Algorithms University Relation between clock time and UTC when clocks tick at different rates dC 1 S S l p dt p p is the maximum drift rate C39Ocktime C If two clocks are drifting from UTC in opposite directions at a time At after they were synchronized they may be ZpAt apart To assure that clocks will not differ more than 6 they must be resynchronized at least every 62p seconds UTC t Copyright KGoseva 2008 cs 757 Distributed Systems Slide 7 W Global Positioning System Related problem Determining one s position anywhere in the world GPS is a satellitebased distributed system launched in 1978 29 satellites each circulating in an orbit at 20000 km Each satellite has up to four atomic clocks A satellite continuously broadcasts its position and timestamps each message with its local time Three satellites are needed to determine the longitude latitude and altitude of a receiver on Earth Copyright KGoseva 2008 es 757 Distributed Systems Slide 8 Global Positioning System University Two factors to be taken into account 1 It takes a while before data on the satellite position reaches the receiver 2 The receiver clock is generally not in synch with the satellite s clock 0 Assuming that all clocks are not perfectly synchronized adds to the complexity Let us assume that the timestamp from the satellite is completely accurate amp that the deviation of the receiver clock from the actual time is Ar Copyright KGoseva 2008 cs 757 Distributed Systems Slide 9 Global Pos1tion1ng System When a message is received from satellite 1 with timestamp T i the delay Al measured by the receiver is Ai71r10w 7Ar 0 As signals travel with the speed of light 6 the measured distance from the satellite is CoAi C iqow Ti39CAr di CoAr The real distance is computes as dl Jxi xrgt2 021 yrgt2 ltzi zr2 If we have four satellites we have to solve four equations for the coordinates of the receiver xr yr Zr and A Copyright KGoseva 2008 es 757 Distributed Systems Slide 10 W Global Positioning System niversity 0 Many things are not perfect GPS does not take leap seconds into account The deviation from UTC can easily be taken care in software Atomic clocks in satellites are not always in sync The position of the satellite is not known precisely The receiver s clock has a nite accuracy The signal propagation speed is not constant The Earth is not a perfect sphere Relatively cheap GPS receivers precision within 15 meters for the location 0 Professional receivers hooked up in a computer network less than 1035 nanoseconds error for the time Copyright KGoseva 2008 es 757 Distributed Systems Slide 11 quotw Network T1me Protocol University Common approach to many protocols is to let the client contact a time server B T2 T3 Let us assume that T 2 T1 z T 4 T3 0 Estimation for the delay T1 T4 52T2 T1T4 T3 A W dT res A can estimate its offset relative to B as 6T3 T3W Copyright KGoseva 2008 es 757 Distributed Systems Slide 12 w Network Time Protocol University o If A s clock is fast 6 lt 0 which means A should set its clock backward However time is not allowed to run backward Changes are introduced gradually Assume the timer is set to generate 100 interrupts per second ie interrupts are 10 msec apart Slow down each interrupt adds 9 msec until the correction is made Speed up each interrupt adds 11 msec until the correction is made Copyright KGoseva 2008 es 757 Distributed Systems Slide 13 Network Time Protocol Network Time Protocol NTP allows setup pairwise between servers Eight pairs of 65 values are buffered The minimum 5 is adopted as the best estimate for the delay and the associated 6 as the best estimate of the offset 0 NTP divides servers into strata a server with a reference clock is startuml server WhenA contacts B it will adjust its time only if its stratum level is higher than the stratum level of B After the synchronization A s stratum level will become by one higher than B Copyright KGoseva 2008 es 757 Distributed Systems Slide 14 w Multlple external t1me sources iversity If an extremely accurate synchronization with UTC is required multiple receivers for WWV GEOS and other UTC sources are needed Inaccuracy in time source Fluctuations in the signal path The best that OS can do is to establish a range in which UTC falls Copyright KGoseva 2008 es 757 Distributed Systems Slide 15 W Use of synchronlzed clocks University In the last few years hardware and software for synchronizing clocks on a wide scale ie over the entire Internet has become available With this new technology it is possible to keep millions of clocks synchronized to within a few milliseconds of UTC New algorithms that utilize synchronized clocks are starting to appear Enforce atmostonce message delivery to a server even in the face of crashes Achieve cache consistency Handle commitment to atomic transaction Copyright KGoseva 2008 es 757 Distributed Systems Slide 16 IV The Berkeley Algorithm West Virginia University Suitable when no machine has a WWV receiver The time server ie daemon is active Time daemon 300 300 300 305 I o2 o t f lt gt051 lt3 G G 250 325 305 305 TR 250 325 a b C a The time daemon polls all the other machines for their clock values b The machines answer c The time daemon tells estimates an average time and tells everyone how to adjust their clocks Slide 17 Copyright KGoseva 2008 es 757 Distributed Systems quot Clock Synchronization in Wireless Networks University As in case of the Berkeley algorithm Reference Broadcast Synchronization RBS aims at internally synchronizing the clocks instead of synchronizing all nodes with theUTC time A sender broadcasts a reference Message preparation message which will allow T39me Spem m N39C the receivers to adjust the clock Delivery time A to app B x c i i l l Critical path Copyright KGoseva 2008 es 757 Distributed Systems Slide 18 V Clock Synchronization in Wireless Networks When a sender broadcasts a reference message m each node 9 records the time when it receives the message T p m from its local clock Two nodes 9 and q exchange their corresponding times so they can estimate their relative offset M Z TPJC Tqk 0 set H 17 p q M where M is the total number of reference messages sent 0 9 will know the value of q s clock relative to its own Slide 19 Copyright KGoseva 2008 es 757 Distributed Systems W Clock Synchronization in Wireless Networks The previous estimate works only if the clocks do not drift apart 0 If the clocks of p and q drift apart the later values will be less accurate than the earlier values and simple average as in the previous slide will not work 0 Algorithm that uses linear regression to compute the offset 01939641996110 0H The constants 0i and B are computed from the pairs T M T M Copyright KGoseva 2008 es 757 Distributed Systems Slide 20 West Virginia University Logical clocks Copyright KGoseva 2008 cs 757 Distributed Systems Slide 21 W Log1cal clocks University For many purposes it is important that all machines agree on the same time Consistency of the clocks matters not how close they are to the real time logical clocks Furthermore for many purposes What matters is to agree on the order in which events occur Copyright KGoseva 2008 es 757 Distributed Systems Slide 22 V West Virginia Lamport t1mestamps Relation happensbefore a gt b which means that all processes agree that event a occurs rst and then event 9 occurs 1 If a and b are events in the same process and 01 occurs before I then a gt b is true 2 If a is the event of a message being sent by one process and b is the event of the message being received by another process than a gt b is also true A message cannot be received before it is sent or even at the same time it is sent since it takes a finite nonzero amount of time to arrive Copyright KGoseva 2008 es 757 Distributed Systems Slide 23 W Lamport t1mestamps University Happensbefore is a transitive relation Ifa gt bandb gt cthena gt c o If two events x and y happen in different processes and do not exchange messages then neither x gt y nor y gt x is true In that case x and y are said to be concurrent To every event a we assign a time value Ca on which all processes agree Ifa gt b then Ca lt Cb The clock time C must always go forward increasing never backward decreasing corrections to time can be made by adding a positive value never by subtracting one Copyright KGoseva 2008 cs 757 Distributed Systems Slide 24 w Lamport tlmestamps y Three processes each runs on a different machine that has its own clock which may run at different rate P1 P2 P3 P1 P2 P3 T o T T o 0 m1 i 39 m1 39i39 39 39i 3939 29 42gt1 29 1 8 24 m2 30 m2 24 92 49 24 92 49 99 49 99 99 P2 adjusts 49 99 36 48 60 36 its Clock 48 60 V 42 99 ms 29 42 29 48 64 80 gt48 94 22 99 ZZ 99 so amp 100 6 p1 adjust385 100 its clock a b Copyright KGoseva 2008 cs 757 Distributed Systems Slide 25 Lamport timestamps Assigning time to all events in a distributed system is a subject to the following conditions 1 If a happens before I in the same process CaltCb 2 If a and I represent the sending and receiving of a message respectively Ca lt Cb 3 For all distinctive events a and 9 C01 gt Cb This algorithm provides a total ordering of all events in the system Used by many other distributed algorithms that need ordering Copyright KGoseva 2008 CS 757 Distributed Systems Slide 26 V Lamport timestamps How the orderin is achieved Each process Pl maintains local counter Cl Before executing an event P executes C lt Cl 1 When process P sends a message m to P it sets m s timestamp tsm lt Cl after having executed the previous step Upon the receipt of a message m process P adjusts its own local counter as C lt maxC ts m after which it then executes the rst step and delivers the message to the application Copyright KGoseva 2008 cs 757 Distributed Systems Slide 27 w Lamport tlmestamps mversity Illustration of how the ordering is achieved Application layer Application sends message Message is delivered to application Adjust local clock Adjust local clock Middleware layer and timestamp message Middleware sends message Message is received Network layer Copyright KGoseva 2008 cs 757 Distributed Systems Slide 28 West Virginia Example TotallyOrdered Multieasting Updating a replicated database and leaving it in an inconsistent state Update 1 quot999699quot Replicated database Update 1 is Update 2 is performed before performed before update 2 update 1 Both copies should be exactly the same totallyordered multicast Copyright KGoseva 2008 es 757 Distributed Systems Slide 29 Example TotallyOrdered Multicasting Consider a group of processes that multicast messages to each other Each message is timestamped with the current logical time of its sender West Virginia When a message is multicasted it is also sent to the sender Messages from the same sender are received in the order they were sent and no messages are lost When a process receives a message it is put into a local queue ordered accordingly to its timestamp the receiver multicasts an acknowledgement to the other processes Copyright KGoseva 2008 es 757 Distributed Systems Slide 30 Example TotallyOrdered Multicasting Following the Lamport s algorithm for adjusting local clocks the timestamps of the received messages is lower than the timestamp of the acknowledgement Lamport s algorithm ensures that no two messages have the same timestamp and that the timestamps re ect a consistent global ordering of events A process can deliver a queued message to the application only when that message is at the head of the queue and has been acknowledged by each other process Copyright KGoseva 2008 es 757 Distributed Systems Slide 31 Vector clocks Lamport timestamps do not support causality Example Internet s electronic bulletin board service Post articles and react to posted articles Postings within specific discussion group are multicast to all group members Totallyordered multicasting does not imply that if message B is delivered after message A that B is a reaction to what is posted by message A A and B may be completely independent Causality may be captured by vector clocks Copyright KGoseva 2008 cs 757 Distributed Systems Slide 32 W Vector clocks y If VCa lt VCb then event a causally precedes event 9 Each process Pl maintains a vector VCZ with the following properties VCll is the number of events that have occurred at P increment VCZ139 at each new event that happens at P1 If VCik then P knows that k events have occurred at P piggybacking vectors along with messages that are sent Copyright KGoseva 2008 es 757 Distributed Systems Slide 33 V Vector clocks 1 Before executing an event Pl executes VClz39lt VCZ239I 2 When process P sends a message m to P it sets m s vector timestamp ts m lt VCZ after having executed the previous step 3 Upon the receipt of a message m process P adjusts its own vector by setting VC k lt maxVCj k ts mk for each k after which it executes the rst step and delivers the message to the application Copyright KGoseva 2008 es 757 Distributed Systems Slide 34 V Vector clocks amp Causally ordered multicastin Suppose P receives message m from Piwith vector timestamp tsm The delivery of the message to the application is delayed until the following conditions are met 1 tsmi VCJl 1 m is the next message PJ expects fromPi 2 tsmk S VCjk for all kit PJ has seen all the messages that have been seen by P when it sent the message m Copyright KGoseva 2008 cs 757 Distributed Systems Slide 35 Vi Vector clocks amp Causally ordered Wjiiviifisia multicasting VC0 100 VCO 110 P0 I i quot m P1 1 x2 mic V01 2 1 VC2 P2 i l l vc2 000 vc2 100 The delivery of the message m which arrives at P 2 sooner than m is delayed until m is received and delivered to the P2 s application layer Copyright KGoseva 2008 cs 757 Distributed Systems Slide 36 West Virginia University Mutual exclusion Copyright KGoseva 2008 cs 757 Distributed Systems Slide 37 quotW Mutual exclus1on niversity When a process has to read or update shared data structures it first enters a critical region to achieve mutual exclusion and ensure that no other process will use the shared data structures at the same time o In single processor systems critical regions are protected using semaphores monitors and similar constructs In distributes system the most straightforward algorithm to achieve mutual exclusion is to simulate how it is done in a single processor system centralized algorithm with a coordinator Copyright KGoseva 2008 cs 757 Distributed Systems Slide 38 Mutual exclusion Centralized algorithm Request OK Requesy Release 1 No reply 2v air Coordinator Process 1 asksatjie coordinator Process 2 then asks When process 1 exits for permission to enter a critical 10611111581011 to enter the critical region region Permission is granted the same C cal region it tells the coordinator The coordinator does which then replies to 2 not reply or sends permission denied Copyright KGoseva 2008 es 757 Distributed Systems Slide 39 Mutual exclusion Centralized algorithm University Advantages West Virginia Easy to implement and requires only three messages per use of a critical region request grant release Its is fair the request are granted in order they are received No process ever waits forever no starvation It can be used for more general resource allocation rather than just managing critical regions Copyright KGoseva 2008 es 757 Distributed Systems Slide 40 Mutual exclusion Centralized algorithm University Shortcomings West Virginia The coordinator is a single point of failure The process that blocks after making a request cannot distinguish between a dead coordinator from permission denied since in both cases no message comes back In large systems a single coordinator becomes a performance bottleneck Copyright KGoseva 2008 es 757 Distributed Systems Slide 41 Mutual exclusion Decentralized algorithm University Probabilistic distributed mutual exclusion algorithm West Virginia Voting based system Each resource is replicated n times Every replica has its own coordinator that controls the access The process can get access only if it gets a majority vote from mgtn2 coordinators Less vulnerable to a failure of a single coordinator Copyright KGoseva 2008 es 757 Distributed Systems Slide 42 Mutual excluSIOn Decentralized algorithm University 0 If the coordinator fails or resets it may forget that it has granted access to a resource before the failure and incorrectly will grant permission to another process after its recovery Let p be the probability that a coordinator failsresets during a time interval Al The probability that k out of m coordinators reset in the same interval is m k mik Pk k p 11 The probability that a Violation will occur is Z Plkl k2m7n Copyright KGoseva 2008 cs 757 Distributed Systems Slide 43 Mutual exclusion Distributed algorithm Deterministic distributed mutual exclusion algorithm West Virginia Requires a total ordering of all events in the systems Lamport s algorithm is one way to achieve ordering ie provide timestamps for distributed mutual exclusion When a process wants to enter a critical region it builds a message containing the name of the critical region its process number and the current time It then sends a message to all other processes including itself Sending of messages is assumed to be reliable every message is acknowledged Instead of individual massages reliable group communication can be used Copyright KGoseva 2008 es 757 Distributed Systems Slide 44 Mutual exclusion Distributed algorithm niversity When a process receives a request message from another process three cases can be distinguished West Virginia 1 If the receiver is not in the critical region and does not want to enter it it sends back an OK message to the sender 2 If the receiver is already in the critical region it does not reply Instead it queues the request 3 If the receiver wants to enter the critical region it compares the timestamp in the incoming message with the timestamp contained in the message that it has sent to everyone The lowest one wins Copyright KGoseva 2008 es 757 Distributed Systems Slide 45 West Virginia Mutual exclusion Distributed algorithm After sending out request asking for permission to enter critical region a process waits until everyone else has given permission After all permissions are in the process may enter the critical region When it exits the critical region it sends 0K messages to all processes on its queue and deletes them all from the queue Copyright KGoseva 2008 es 757 Distributed Systems Slide 46 Mutual exclusion Distributed algorithm West Virginia University Enters critical 8 region F A A o O 0 8 N2 OK OK OK 8 A A A Enters 1 2 2 w L critical 1 2 OK region 3912 Two processes 0 amp 2want to enter the same critical region at the same moment Copyright KGoseva 2008 Process 0 has the lowest timestamp so it wins CS 757 Distributed Systems When process 0 is done it sends an OK also so 2 can now enter the critical region Slide 47 Mutual exclus10n D1str1buted algorithm West Virg University The number of messages per entry is 2nI for n processes Mutual exclusion is guaranteed Without deadlock or starvation No single point of failure However there are now n points of failure The distributed algorithm has n times larger failure probability and requires much more traffic than the centralized algorithm Copyright KGoseva 2008 es 757 Distributed Systems Slide 48 Mutual excluSIOn Distributed algorithm University 0 The algorithm can be patched if the receiver always sends a reply either granting or denying permission Whenever a request or reply is lost the sender keeps trying until either a reply comes back or the sender concludes that the destination is dead Other problems Either a group communication should be used or each process must maintain the group membership list All processes are involved in all decisions a single slow process will slow down the whole system 0 Distributed algorithm for mutual exclusion is slower more complicated more expensive and less robust than the centralized one Copyright KGoseva 2008 CS 757 Distributed Systems Slide 49 Mutual exclusion Token ring algorithm An unordered group of processes on a network A logical ring constructed in software Copyright KGoseva 2008 es 757 Distributed Systems Slide 50 Mutual exclus10n Token r1ng algortthm University When a ring is initialized process 0 is given a token The token circulates around the ring The process can enter a critical region only when it has the token When no process want to enter any critical region the token just circulates around the ring Copyright KGoseva 2008 es 757 Distributed Systems Slide 51 West Virginia Mutual exclusion Token ring algorithm nlversity Once a process decides it wants to enter a critical region at worst it will have to wait every other process to enter and leave one critical region Starvation cannot occur Problems Detecting that the token is lost is difficult When a process crashes Solution the process that receives the token is required to acknowledge the receipt This solution requires that every process maintains the current ring con guration Copyright KGoseva 2008 es 757 Distributed Systems Slide 52 V Comparison of mutual exclusion WSELZEE algorlthms Algorithm Messages per Pelay before entry Problems entryeXIt In message times Centralized 3 2 Coordinator crash Decentralized 3mk k12 2m Starvation low efficiency Distributed 2 n 1 2 n 1 Crash of any process Token ring 1 to 00 O to n 1 Lost token process crash Copyright KGoseva 2008 cs 757 Distributed Systems Slide 53 West Virginia University Election algorithms Copyright KGoseva 2008 cs 757 Distributed Systems Slide 54 W Elect1on algor1thms University Many distributed algorithms require one process to act as coordinator initiator or otherwise perform some special role Election algorithms Attempt to locate the process with the highest process number and designate it as coordinator Every process knows the process number of every other process Processes do not know which processes are currently up and which one are currently down Copyright KGoseva 2008 es 757 Distributed Systems Slide 55 W The bully algor1thm nlversity When any process notices that the coordinator is no longer responding to requests it initiates an election I P sends an ELECTION message to all processes with higher numbers 2 If no one responds P wins the election and becomes coordinator 3 If one of the higherups answers it takes over P s job is done The biggest guy in town always wins hence the name bully algorithm Copyright KGoseva 2008 es 757 Distributed Systems Slide 56 West Virginia University The bully algorithm C3 Previous coordinator has crashed Process 4 first notices that Processes 5 amp 6 both Processes 5 amp 6 both the coordinator 7 has crashed respond with OK 4 s hold election it send ELECTION message job is over to all processes with higher numbers Copyright KGoseva 2008 cs 757 Distributed Systems Slide 57 West Virginia University The bully algorithm Process 6 collects state information from the disk then it announces by sending a COORDINATOR message to all running processes Process 6 tells 5 that it will take over it knows that 7 is dead and that it 6 is the winner Copyright KGoseva 2008 es 757 Distributed Systems Slide 58 W A r1ng algorlthm ity Assume that The processes are physically or logically ordered so that each process knows who its successor is When any process notices that the coordinator is not functioning it sends an ELECTION message containing its own process number to its successor If the successor is down the sender skips over until a running process is located At each step the sender adds its own process number to the list in the message When the message gets back to the process that started it all the message type is changed to COORDINATOR and circulates again to inform everyone who the coordinator is the list member with the highest number Copyright KGoseva 2008 cs 757 Distributed Systems Slide 59 West Virginia University A ring algorithm Election message 2 Previous coordinator A has crashed w 23 No response Copyright KGoseva 2008 cs 757 Distributed Systems Slide 60 West Virginia Elections in wireless ad hoc networks 0 Traditional election algorithms are based on assumptions that do not hold in wireless ad hoc networks Message passing is reliable Topology of the network does not change Election of the best leader in wireless ad hoc networks Any node in the network called the source can initiate an election by sending an ELECTION message to its immediate neighbors When a node receives an ELECTION message for the first time it designates the sender as its parent and sends an ELECTION message to all its immediate neighbors except for the parent When a node receives an ELECTION message from a node other than its parent it only acknowledges the receipt If all neighbors already have a parent the node can report back to its parent including information such as its battery lifetime and other resource capacities Copyright KGoseva 2008 es 757 Distributed Systems Slide 61 V Elections in wireless ad hoc networks 0 Node a initiates an election by broadcasting ELECTION message to its neighbors I and j a is a source for this election Nodes propagate the message to their 9 receives 0 immediate neighbors broadcast from b rst Each node reports to its parent the node with the best capacity The source notices that h is the best leader and broadcasts this information to all other nodes West Virginia University Fault Tolerance Chapter 8 Copyright KGoseva 2008 es 757 Distributed Systems Slide 1 w o Outllne University Introduction to fault tolerance Process resilience Reliable clientserver communication Copyright KGoseva 2008 es 757 Distributed Systems Slide 2 Basic concepts Goal of fault tolerance in distributed system automatic recovery from partial failures Without seriously affecting the overall performance Dependability includes Reliability Availability Safety Maintainability Con dentiality Integrity CS 757 Distributed Systems Slide 3 West Virginia Failure the system does not meet its requirements Error part of the system s state that may lead to failure Fault cause of an error High dependability can be achieved by Preventing faults Removing faults Fault tolerance system provides its services even in the presence of faults Copyright KGoseva 2008 es 757 Distributed Systems Slide 4 Basic concepts Types of faults Transient Occur once and then disappear o If the operation is repeated the fault goes away Intermittent 0 Appears vanishes then reappears Permanent Continues to exist until the faulty component is repaired Copyright KGoseva 2008 cs 757 Distributed Systems Slide 5 W o Fallure models y Different types of failures Type of failure Description Crash failure A server halts but is working correctly until it halts Omission failure A server fails to respond to incoming requests Receive omission A server fails to receive incoming messages Send omission A server fails to send messages Timing failure A server39s response lies outside the specified time interval Response failure The server39s response is incorrect Value failure The value of the response is wrong State transition failure The server deviates from the correct ow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times Byzantine failure Copyright KGoseva 2008 cs 757 Distributed Systems Slide 6 Failure masking by redundancy y Failure masking is achieved by redundancy West Virginia Information redundancy Extra bits are added to allow recovery e g Hamming code Time redundancy Action is performed and if needed performed again e g if transaction aborts it can be redone 0 Especially helpful for transient or intermittent faults Physical redundancy Extra equipment or software are added e g 747 has four engines but can y on three Copyright KGoseva 2008 es 757 Distributed Systems Slide 7 West Virginia niv C E 2 Q Failure masking by redundancy Triple modular redundancy W 6 G 6 Voter A2 v2 B2 v5 Q9 amp 639 w b Copyright KGoseva 2008 es 757 Distributed Systems Slide 8 West Virginia University Process resilience Copyright KGoseva 2008 es 757 Distributed Systems Slide 9 W West Virginia Process resilience y The key approach to tolerating a faulty process is to organize several identical processes into a group A process can send a message to a group of servers without having to know how many servers there are or where they are All members of the group receive the messages Process groups may be dynamic Old groups are destroyed new groups are created A process can join or leave a group A process can be a member of several groups Copyright KGoseva 2008 cs 757 Distributed Systems Slide 10 Flat groups vs hierarchical groups West Virginia University Flat group Advantage no single point of failure Advantage coordinator makes the decisions Disadvantage more complex decision Disadvantage coordinator is a single point making of failure Copyright KGoseva 2008 es 757 Distributed Systems Slide 11 w o Group membership University 0 Mechanisms for creating and deleting groups as well as for allowing processes to join and leave groups Centralized approach group server that maintains a complete data based of all groups and their exact membership Straightforward efficient and easy to implement A single point of failure Distributed approach reliable multicast Outsider sends a message to all groups members announcing its wish to join the group To leave a group a member sends a goodbye message to everyone If a process crashes other members have to discover the crash Distinguish between crashed and slow processes Copyright KGoseva 2008 cs 757 Distributed Systems Slide 12 Failure masking and replication y Replace a single process with a group of replicated processes Primarybased protocols West Virginia 0 Hierarchical group in which the primary coordinates all write operations 0 The primary is xed When primary crashes the backups execute election algorithms to choose a new primary Replicatedwrite protocols 0 Active replication or quorumbased protocols 0 Flat group Copyright KGoseva 2008 cs 757 Distributed Systems Slide 13 West Virginia Failure masking and replication y How much replication is needed A system is k fault tolerant if it can survive faults in k components and still meet its speci cation If processes exhibit failstop failures k1 components is enough to provide k fault tolerance o If processes exhibit Byzantine failures 2k1 components are needed to provide k fault tolerance All requests arrive at all servers in the same order also called the atomic multicast problem Copyright KGoseva 2008 cs 757 Distributed Systems Slide 14 Agreement in faulty systems University Agreement is needed in many cases electing a coordinator deciding Whether or not to commit a transaction dividing tasks among workers and synchronization Reaching an agreement is more complex The goal of distributed agreement algorithms is to have all nonfaulty processes reach consensus on some issue and to establish the consensus in a finite number of steps Copyright KGoseva 2008 es 757 Distributed Systems Slide 15 quotW Agreement 1n faulty systems University The case of perfect processes and unreliable communication Two army problem red army with 5000 troops in the valley and two blue armies each 3000 troops on surrounding hills If the two blue armies can coordinate their attacks on the red army they will be Victorious The blue armies need to reach an agreement about attacking They can communicate using unreliable channel sending a messenger who can be captured by the red army Even with nonfaulty processes generals agreement is not possible in the face of unreliable communication Copyright KGoseva 2008 es 757 Distributed Systems Slide 16 quotW Agreement in faulty systems University The case of perfect communication but faulty processors Byzantine generals problem red army is still in the valley but n blue armies are on the nearby hills Communication is done pairwise and it is instantaneous and perfect m out of n generals are traitors faulty and are actively trying to prevent the loyal generals from reaching agreement by providing incorrect and contradictory information Assumption Generals exchange the troop strengths If general 7 is loyal the element 739 is the troop strength otherwise it is unde ned The question is whether the loyal generals can reach the agreement Copyright KGoseva 2008 es 757 Distributed Systems slide 17 Agreement in faulty systems West Virginia University 1 Got12x4 1Got 2Got 4Got 2 Got12 y 4 12 y4 12 x4 12 x4 3 Got12 34 a b cd e f gh 12 y4 4 Got12z4 1214 12214 Li KI Faulty process b c The Byzantine generals problem for 3 loyal generals and 1 traitor a The generals announce their troop strengths in units of 1 kilosoldiers b The vectors that each general assembles based on a c The vectors that each general receives in step 3 1 At step 4 each general examines the ith value of newly received vectors if any value have a majority that value is put in a result vector Generals 1 2 and 4 come to agreement on 1 2 UNKNOWN 4 Copyright KGoseva 2008 es 757 Distributed Systems Slide 18 w Agreement 1n faulty systems University l Got1 2 x 1 Got 2 Got 2 Got12y 12 y 12 x 3 Got1 2 3 a b c d e f Faulty process a b C The algorithm fails to produce an agreement In general in a system with m faulty processors agreement can be achieved only if 2m1 correctly functioning processors are present for total of 3m1 processors that is 23 of the processors are working properly Copyright KGoseva 2008 es 757 Distributed Systems Slide 19 Reliable clientsever communication Copyright KGoseva 2008 cs 757 Distributed Systems Slide 20 Rellable chentserver commumcatlon University Both processes and communication failures can be crash omission timing and arbitrary failures Reliable pointtopoint communication is established by using reliable transport protocol such as TCP TCP masks omission failures ie lost messages by using acknowledgements and retransmissions The only way to mask crash failures is to attempt to set up a new connection automatically Copyright KGoseva 2008 cs 757 Distributed Systems Slide 21 West Virginia Reliable clientserver communication y RPCs or RMIs in the presence of failures Five different classes of failures 0 The client is unable to locate the server The request message from the client to the server is lost The server crashes after receiving a request 0 The reply message from the server to the client is lost 0 The client crashes after sending a request Copyright KGoseva 2008 es 757 Distributed Systems Slide 22 West Virginia The client cannot locate the server y The server might be down or has evolved Raise an exception Not every language has exceptions or signals Writing an exception or signal handler destroys the transparency Copyright KGoseva 2008 es 757 Distributed Systems Slide 23 quotW Lost request messages Easer to deal with OS or client stub starts a timer when sending the request If the timer expires before a reply or acknowledgement comes back the message is sent again If the request was truly lost the server cannot tell the difference between the retransmission and the original If the request was not lost the server should be able to detect that it is dealing with a retransmission not an easy task Copyright KGoseva 2008 es 757 Distributed Systems Slide 24 quotW 7 Server crashes West Virginia University REQ Server REQ Server REQ Server 4 Receive gt Receive gt Receive Execute Execute REP Reply N21quot N21quot NOHIlal 0386 Crash after execution Crash before execution The client cannot tell the difference Slide 25 Copyright KGoseva 2008 es 757 Distributed Systems V We Server crashes University 0 Three possibilities At least once semantics Wait until the server reboots or rebind to a new server and try the operation again Guarantees that the RFC has been carried at least once but possibly more At l IlOSt OIICC semantics Give up immediately and report failure Guarantees that the RFC has been carried at most one time but possibly none at all Guarantee nothing When the server crashes the client gets no help and no promises about what happened Copyright KGoseva 2008 es 757 Distributed Systems Slide 26 V West Virginia Server crashes y We would like exactly once semantics which in general is not possible to arrange Assume that a remote operation consists of printing some text When a client issues a request it receives an acknowledgement that the request has been delivered to the server A server sends a completion message to the client when the text is printed Assume that the server crashes and subsequently recovers Two strategies that the server can follow Send a completion message just before it tells the printer to do its work M gt P Send a completion message after the text has been printed P gt M Copyright KGoseva 2008 es 757 Distributed Systems Slide 27 Server crashes Four strategies that the client can follow Never reissue a request the text may not be printed Always reissue a request the text may be printed twice Reissue a request only if it has not received the acknowledgement that its print request had been delivered to the server counts on the fact that the server has crashed before the print request could be delivered Reissue a request only if it has received an acknowledgement for the print request Copyright KGoseva 2008 es 757 Distributed Systems Slide 28 W Server crashes University Client Server Stratey M gt P Stratey P gt M Reissue stratey MPC MCP CMP PMC PCM CPM Always DUP OK OK DUP DUP OK Never OK ZERO ZERO OK OK ZERO Only when ACKed DUP OK ZERO DUP OK ZERO Only when not ACKed OK ZERO OK OK DUP OK M send the completion message P print the text C crash There is no combination of client strategy and server strategy that will work correctly under all possible events Copyright KGoseva 2008 cs 757 Distributed Systems Slide 29 W WE LOSt reply messages niversity Timer set by the client operating system The client is not sure why there is no answer request or reply may be lost the server may be slow Operations that can be safely repeated without no damage are said to be idempotent Reading a file is idempotent Transferring money is not idempotent 0 Another method the client assigns each request a sequence number so the server can tell the difference between an original request and a retransmission Requires that the server maintains administration on each client Copyright KGoseva 2008 cs 757 Distributed Systems Slide 30 V West Virginia Client crashes y Client sends a request and crashes before the server replies orphan computation Four solutions Extermination The log is kept on disk that survives crashes After a reboot the log is checked and the orphan is explicitly killed off Reincarnation 0 Divide time into sequentially numbered epochs When a client reboots it broadcasts a message that it starts a new epoch All remote computations on the behalf of the client are killed Copyright KGoseva 2008 es 757 Distributed Systems Slide 31 W Cllent crashes University Gentle reincarnation When a epoch broadcast comes in each machine checks to see if it has any remote computations and tries to locate the owner Only if the owner cannot be found the computation is killed Expiration Each RPC is given a standard amount of time T to do the job If it cannot finish it must explicitly ask for another quantum If the client waits a time T after a crash to reboot all orphans are sure to be gone None of these methods is satisfactory Killing an orphan may have unforeseen consequences An orphan may have obtained locks on files or data base records that may remain locked forever if the orphan is suddenly killed Killing the orphan may not remove grandorphans Copyright KGoseva 2008 es 757 Distributed Systems Slide 32 Reliable group communication Copyright KGoseva 2008 cs 757 Distributed Systems Slide 33 Basic reliablemulticasting schemes West Virginia 0 Transport layers provide reliable pointtopoint communication but rarely offer reliable communication to a collection of processes 0 Reliable multicasting means that a message that is sent to a process group should be delivered to each member of that group Reliable communication in the presence of faulty processes Messages delivered to all nonfaulty group members Simpler situation Processes do not fail Processes do not join or leave the group while communication is going on No requirement that all group members receive messages in the same order Copyright KGoseva 2008 cs 757 Distributed Systems Slide 34 Bas1c reliable multicastmg schemes University Additional assumptions underlying communication system offers only unreliable multicasting all receivers are known messages are received in the order they are sent Receiver missed message 24 Sender Receiver Receiver Receiver Receiver History buffer Last 24 Last 23 Last 24 M25 NEE NEE Network a Sender Receiver Receiver Receiver Receiver Last 25 Last 24 Last 2 23 Last 24 W5 ACK25 I I K J ACK25 Missed24 j ACK25 w Copyright KGoseva 2008 cs 757 Distributed Systems Slide 35 West Virginia Scalability in reliable multicasting y This scheme has problems If the number of receivers is large the sender may be swamped with acknowledgement messages feedback implosion Solution receivers do not acknowledge the receipt of a message they return feedback only when a message is missing The sender is forced to keep a message in its history buffer forever In practice the sender will remove the message after some time Scalable reliable multicasting solutions Nonhierarchical feedback control Hierarchical feedback control Copyright KGoseva 2008 cs 757 Distributed Systems Slide 36 W Nonhierarchical feedback control University Feedback suppression Only negative acknowledgements are returned as a feedback A receiver that missed a message multicasts its feedback to the rest of the group The first retransmission request leads to the suppression of others The retransmitted message is multicasted to all processes Sender receives Receivers suppress their feed back only one NACK Sender Receiver Receiver Receiver Receiver T3 T24 T1 T2 tj NACK I Network Copyright KGoseva 2008 cs 757 Distributed Systems Slide 37 quotW a Nonh1erarch1cal feedback control West Virgini University Feedback suppression scales reasonably well Used in collaborative Internet applications such as a shared whiteboard Problems Ensuring that only one request for retransmission is returned to the sender requires a reasonably accurate scheduling of feedback messages at each receiver which is not easy Multicasting feedback interrupts those processes that have received the message successfully Copyright KGoseva 2008 cs 757 Distributed Systems Slide 38 Hierarchical feedback control West Virginia Achieving scalability for large groups 0 The group is partitioned into number of subgroups organized as a tree Within each subgroup any reliable multicasting scheme that works for small groups can be used Each subgroup appoints a local coordinator which is responsible for handling retransmission requests of receivers contained in its subgroup The local coordinator has its own history buffer If the coordinator has missed the message itself it asks the coordinator of the parent subgroup to retransmit the message Copyright KGoseva 2008 cs 757 Distributed Systems Slide 39 Hierarchical feedback control West Virginia University Longhaul connection Localarea network Coordinator Receiver Copyright KGoseva 2008 cs 757 Distributed Systems Slide 40 Hierarchical feedback control The main problem with the hierarchical solution is the construction of the tree It is not easy to enhance multicast routers in the network layer to act as a local coordinators Building reliable multicast schemes that can scale to a large number of receivers spread across a wide area network is a dif cult problem No single best solution exists Much research is needed Copyright Maw 2008 CS 757 Distributed Systems Slide 41 w Atomic multlcast University Atomic multicast problem Message is delivered to either all processes or to none at all All messages are delivered in the same order to all processes Consider a replicated database running on top of a distributed system CASE 1 distributed system offers reliable multicasting Replicated database is constructed as a group of processes one for each replica Update operations are multicast to all replicas and subsequently performed locally active replication protocol If a replica crashes and then recovers it can recover to the same state it had before the crash however it may have missed several updates Copyright KGoseva 2008 cs 757 Distributed Systems Slide 42 W West Virginia Atomic multicast y CASE II distributed system supports atomic multicasting Update operation is either performed at all nonfaulty replicas or by none at all Operations can be performed by all correctly operating replicas only if they have reached agreement on the group membership that is if they agreed that the crashed replica no longer belongs to the group When a crashed replica recovers it forced to join the group again Joining the group requires that its state is brought up to date with the rest of the group members Atomic multicasting ensures that nonfaulty processes maintain a consistent View of the database and forces reconciliation when a replica recovers and rejoins the group Copyright KGoseva 2008 cs 757 Distributed Systems Slide 43 Atomic multicast Virtual synchrony West Virginia University A Application Message is delivered to application 7 A Comm layer Message is received by communication layer Messa e comes in from th tw k g e quote or CA Local 08 Network A received message is locally buffered in the communication layer until it can be delivered to the application at a higher layer Copyright KGoseva 2008 cs 757 Distributed Systems Slide 44 West Virginia Atomic multicast Virtual synchrony A multicast message m is uniquely associated with a list of processes to which it should be delivered group view Each process on delivery list has the same view m is multicast in the time its sender has group view G While the multicasting is taking place another process joins or leaves the group View change takes place by multicasting a message vc Guarantee is needed that m is either delivered to all processes in G before each one of them is delivered message vc or m is not delivered at all Copyright KGoseva 2008 es 757 Distributed Systems Slide 45 Atomic multicast Virtual synchrony University There is only one case in which the delivery of m is allowed to fail when a group membership change is the result of the sender of m crashing 0 Virtually synchronous reliable multicast Guarantees that a message multicast to group view G is delivered to each nonfaulty process of G If the sender of a message crashes during the multicast the message may either be delivered to all remaining processes or ignored by each of them All multicasts take place in epochs that are separated by group membership changes Similarly to using synchronization variable in distributed data stores Copyright KGoseva 2008 cs 757 Distributed Systems Slide 46 West Virginia University P39l joins the group point to point messages P1 P2 P3 P4 Reliable multicast by multiple P3 crashes Atomic multicast Virtual synchrony P3 rejoins i W 7 X K V l 1 Z G P1P2P3P4 Copyright KGoseva 2008 G P1P2P4 Partial multicast from P3 is discarded CS 757 Distributed Systems G P1 P2P3P4 Time gt Slide 47 Message ordering Four different ordering of multicast Unordered multicasts FIFOordered multicasts Causallyordered multicasts Totallyordered delivery Copyright KGoseva 2008 es 757 Distributed Systems Slide 48 if Message ordering University A reliable unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered to different processes Process P1 Process P2 Process P3 sends m1 receives m1 receives m2 sends m2 receives m2 receives m1 Copyright KGoseva 2008 es 757 Distributed Systems Slide 49 W Message ordering University In the case of reliable FIFOordered multicasts the communication layer is forced to deliver incoming messages from the same process in the same order as they have been send There is no constraint regarding the delivery of messages by different processes Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 receives m3 receives m1 sends m4 receives m2 receives m2 receives m4 receives m4 Copyright KGoseva 2008 CS 757 Distributed Systems Slide 50 Message ordering A reliable causallyordered multicast delivers massages so that potential causality between different messages is preserved If a message m causally precedes message m2 regardless of whether they were multicast by the same sender the communication layer at each receiver will always deliver m2 after it has received and delivered m1 Can be implemented using vector timestamps Copyright KGoseva 2008 es 757 Distributed Systems Slide 51 Message ordering Totalordered delivery means that regardless of Whether message delivery is unordered FIFO ordered or causally ordered it is required additionally that when messages are delivered they are delivered in the same order to all group members The example on slide 50 violates the totalordering constraint Virtually synchronous reliable multicasting offering totallyordered delivery is called atomic multicasting Copyright KGoseva 2008 es 757 Distributed Systems Slide 52 West Virginia Atomic multicasting y Different versions of Virtually synchronous reliable multicasting Multicast Basic Message Ordering Totalordered Delivery Reliable multicast None No FIFO multicast FIFOordered delivery No Causal multicast Causalordered delivery No Atomic multicast None Yes FIFO atomic multicast FIFOordered delivery Yes Causal atomic multicast Causalordered delivery Yes Copyright KGoseva 2008 cs 757 Distributed Systems Slide 53 Implementing Virtual synchrony Isis West Virginia University Unstable Flush message a b C a Process 4 notices that process 7 has crashed sends a View change b Process 6 sends out all its unstable messages followed by a ush message c Process 6 installs the new View when it has received a ush message from everyone else Copyright KGoseva 2008 cs 757 Distributed Systems Slide 54 West Virginia University Distributed Commit Copyright KGoseva 2008 cs 757 Distributed Systems Slide 55 Distributed commit The atomic multicasting problem is an example of a more general problem distributed commit that involves having an operation being performed by each member of a process group or none at all Distributed commit is often established by means of a coordinator In a onephase commit protocol the coordinator tells all other processes called participants Whether of not to locally perform the operation in question Drawback if one of the participants cannot perform the operation there is no way to tell the coordinator Copyright KGoseva 2008 es 757 Distributed Systems Slide 56 V West Virginia Twophase commit 2PC 0 Assuming that no failures occur the twophase commit protocol consists of two phases each consisting of two steps Voting phase 1 The coordinator sends a VOTEiREQ UEST message to all participants 2 A participants returns either a VOTEiCOMMI T or VOTEiABORT message 0 Decision phase 1 If all participants have voted to commit the transaction the coordinator will commit the transaction and send a GLOBALiCOMMI T message to all participants If one participant had voted to abort the transaction the coordinator will also decide to abort the transaction and multicast a GLOBALiABORT message 2 Each participant that voted for commit waits for the final reaction by the coordinator If it receives a GLOBALiCOMMI T message it locally commits the transaction If it receives a GLOBALiABORT message it locally aborts the transaction Copyright KGoseva 2008 cs 757 Distributed Systems Slide 57 Twophase commit 2PC West Virginia University W NIT MW Commit Votereguest Votereq uest Vote commit Tm Vote commit Globalabort Globalcommit ACK COMMIT ABORT The nite state machine The nite state machine for the coordinator for a participant Vote abort Global commit Globalabort ACK Copyright KGoseva 2008 cs 757 Distributed Systems Slide 58 West Virginia 2PC in the presence of failures Coordinator and participants have states in which they block waiting for incoming messages A participant is blocked in NIT state waiting for VOTEiREQUES T from the coordinator If not received for some time the participant decides to locally abort the transaction The coordinator is blocked in WAIT state waiting for the votes of each participant If not all votes have been collected after a certain time the coordinator will vote for an abort as well and subsequently send GLOBALiABORT to all participants A participant is blocked in state READY waiting for global vote sent by coordinator Simplest solution participant blocks until the coordinator recovers again Better solution participant P contacts another participant Q to see if it can decide from Q s current state what it should do Copyright KGoseva 2008 es 757 Distributed Systems Slide 59 2PC in the presence of failures West Virginia University Actions taken by a participant P when residing in state READY and having contacted another participant Q State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT Make transition to ABORT READY Contact another participant If all participants are in state READY no decision can be made Consequently 2PC protocol blocks until the coordinator recovers Copyright KGoseva 2008 es 757 Distributed Systems Slide 60 V West Virginia Twophase commit y actions by coordinator write START 2PC to local log multicast VOTEREQUEST to all participants while not all votes have been collected wait for any incoming vote iftimeout write GLOBALABORT to local log multicast GLOBALABORT to all participants exit record vote if all participants sent VOTECOMMIT and coordinator votes COMMIT write GLOBALCOMMIT to local log multicast GLOBALCOMMIT to all participants else write GLOBALABORT to local log multicast GLOBALABORT to all participants Copyright KGoseva 2008 cs 757 Distributed Systems Slide 61 West Virginia University Twophase commit actions by participant write INITto local log wait for VOTEREQUEST from coordinator if timeout write VOTEABORT to local log exit if participant votes COMMIT write VOTECOMMIT to local log send VOTECOMMIT to coordinator wait for DECISION from coordinator iftimeout multicast DECISIONREQUEST to other participants wait until DECISION is received remain blocked write DECISION to local log if DECISION GLOBALCOMMIT write GLOBALCOMMIT to local log else if DECISION GLOBALABORT write GLOBALABORT to local log else write VOTEABORT to local log send VOTE ABORT to coordinator Copyright KGoseva 2008 cs 757 Distributed Systems Slide 62 W o Twophase commlt University actions for handlin decision requests executed by separate thread while true wait until any incoming DECISIONREQUEST is received remain blocked read most recently recorded STATE from the local log if STATE GLOBALCOMMIT send GLOBALCOMMIT to requesting participant else if STATE lNlT or STATE GLOBALABORT send GLOBALABORT to requesting participant else skip participant remains blocked Copyright KGoseva 2008 cs 757 Distributed Systems Slide 63 V West Virginia Twophase commit y It may be possible that a participant will need to block until the coordinator recovers When all participants have received and processed the VOT EiREQ UES T from the coordinator while in the meantime the coordinator crashed participants cannot cooperatively decide on the nal action to take For this reason 2PC is also referred to as a blocking commit protocol Solutions to avoid blocking Use a multicast primitive by which a receiver immediately multicasts a received message to all other processes Threephase commit protocol Copyright KGoseva 2008 cs 757 Distributed Systems Slide 64 Threephase commit 3 PC 3PC protocol avoids blocking processes in the presence of failstop crashes Widely referred to in the literature but not applied often in practice because the conditions under which 2PC blocks rarely occur The states of the coordinator and each participant satisfy the following two conditions There is no single state from which it is possible to make a transition directly to either a COW T or an ABORT state There is no state in which it is not possible to make a nal decision and from which a transition to a C 0W1 T state can be made Copyright KGoseva 2008 cs 757 Distributed Systems Slide 65 West Virginia University Threephase commit Vote reg uest lNIT Voteabort NIT Commit Vote reguest Vote req uest Vote commit WAIT READY Vote abort Votecommit Global abort Pregare commit Globalabort Preparecommit ACK Readycommit ABORT Readycommit Global commit Globalcommit ACK COMMIT COMMIT The nite state machine The nite state machine for the coordinator for a participant Copyright KGoseva 2008 cs 757 Distributed Systems Slide 66 W Threephase commit University P is blocked in the READY state or in the PRECOMMI T state If each of the participant that P can contact is in state READY and they form a majority the transaction should be aborted No crashed processes will recover to a state other than INIT ABORT or PRECOMMI T in ZPC a crashed process may recover to COMMIT state that is remaining operational processes could not reach a decision and have to wait until the crashed process recovers If each of the participants that P can reach are in state PREC OW T and they form majority then it is safe to commit the transaction In this case no crashed processes will recover to a state other than READY PRECOMMIT 0r COMMIT Copyright KGoseva 2008 cs 757 Distributed Systems Slide 67 West Virginia University Recovery Copyright KGoseva 2008 cs 757 Distributed Systems Slide 68 W Recovery y The goal of error recovery is to replace an erroneous state with an errorfree state Two forms of error recovery Backward recovery Record the system s state from time to time checkpoint Restore to the last recorded state Forward recovery 0 Bring the system in a correct new state from which it can continue to execute Copyright KGoseva 2008 es 757 Distributed Systems Slide 69 W Examples University Consider the implementation of reliable communication 0 Backward recovery Recover from a lost packet by retransmitting the packet 0 Forward recovery Reconstruct a missing packet from other successfully delivered packets erasure correction In an nk block erasure code a set of k source packets is encoded into a set of n encoded packets such that any set of n encoded packets is enough to reconstruct the original k source packets If not enough packets have yet been delivered the sender will have to continue transmitting packets until a previously lost packet can be constructed Copyright KGoseva 2008 cs 757 Distributed Systems Slide 70 quotW Backward recovery University Backward error recovery is Widely used It is a generally applicable method independent of any specific system or process It can be integrated into the middleware layer as a general purpose service Problems Restoring to a previous state is relatively costly operation in terms of performance Some states can never be rolled back to e g taking 1000 from incorrectly functioning ATM machine 0 Taking checkpoints is often costly operation and may have severe performance penalty Copyright KGoseva 2008 cs 757 Distributed Systems Slide 71 Backward recovery Combine checkpointing With message logging After a checkpoint has been taken a process logs its messages before sending them off senderbased logging OR Receiving process first logs an incoming message before delivering it to the application receiverbased logging Combining checkpounting With message logging makes it possible to restore a state that lies beyond the most recent checkpoint Copyright KGoseva 2008 cs 757 Distributed Systems Slide 72 W Stable storage University To be able to recover to a previous state it is necessary that recovery information survives process crashes and site failures but possibly also various storage media failures Types of storage RAM memory Disk storage Stable storage designed to survive anything except major calamities such as oods and earthquakes Copyright KGoseva 2008 cs 757 Distributed Systems Slide 73 V Stable storage West Virginia University Sector has different value Bad checksum Stable storage implemented Crash after drive 1 Bad spot with a pair of ordinaiy disks is updated Copyright KGoseva 2008 es 757 Distributed Systems Slide 74 quotW o o Checkpomtmg Each process saves its state from time to time to a locally available stable storage To recover after a process or system failure requires to construct a consistent global state from these local states it is best to recover to the most recent distributed snapshot recovery line Recovery line correspond to the most recent distributed snapshot if a process Q has recorded the receipt of the message then there should also be a process R that has recorded the sending of that message Initial state Recovery line K jheckpomt Fquot i i 7 i i i Failure P2 1 I 1 N I I x 7 r Time 4 Message sent InconSIstent out from P2 to P1 Cc quotW Independent checkpomting University If local states jointly do not form a distributed snapshot further rolling back is necessary to find a recovery line The domino effect cascaded rollback Initial state Checkpoint P1 m itM n Time gt lmplementmg 1ndependentcheckpomt1ng requires that dependenc1es are recorded in such a way that processes can jointly roll back to a consistent global state This is fairly complex and does not justify the use of independent checkpointing Coordinated checkpointing which is much simpler than independent checkpointing is becoming mope popular Copyright KGoseva 2008 cs 757 Distributed Systems Slide 76 V West Virginia Coordinated checkpointing In coordinated checkpointing all processes synchronize to jointly write their state to local stable storage The saved state is automatically globally consistent so the domino effect is avoided Nonblocking checkpoint coordination use distributed snapshot algorithm Twophase blocking protocol A coordinator multicasts a CHECKPOINTiREQ UEST message to all processes A process that receives this message takes a local checkpoint queues any subsequent messages and acknowledges to the coordinator that it has taken a checkpoint When the coordinator has received acknowledgement from all processes it multicasts a CHECKPOINTiDONE message to allow the blocked processes to continue Copyright KGoseva 2008 cs 757 Distributed Systems Slide 77 quotW Message logging University Checkpointing is an expensive operation especially with respect to writing state to stable storage Message logging reduces the number of checkpoints but still enables recovery A checkpointed state is taken as a starting point and all messages that have been sent since are simply retransmitted and handled accordingly Works well under the assumption of piecewise deterministic model Each interval in the piecewise deterministic model starts with a nondeterministic event such as the receipt of a message From that moment on the execution of the process is completely deterministic An interval ends with the last event before a nondeterministic event occurs Consequently if we record all nondeterministic events it becomes possible to completely replay the entire execution of a process in a deterministic way Copyright KGoseva 2008 es 757 Distributed Systems Slide 78 West Virginia University Message logging An orphan process is a process that survives the crash of another process but whose state is inconsistent with the crashed process after recovery Q crashes and recovers F m1 m1 m2 is never replayed so neither will m3 Q i e m2 3 m2 7 j R IIK gt Unlogged message Time gt 1 Logged message Incorrect replay of messages after recovery leading to an orphan process Copyright KGoseva 2008 es 757 Distributed Systems Slide 79 V Messa e 10 in West Virginia g g University Each message has a header that contains all information necessary to retransmit m The header usually includes The sender and receiver a sequence number to recognize a duplicate a delivery number to decide when exactly it should be handed over to the receiving application A message is stable ifit can no longer be lost eg it has been written to stable storage 0 For each message m A set DEPm consists of those processes to which m has been delivered In addition if m is causally dependent on the delivery of m and m has been delivered to a process Q then Q will also be contained in DEPm A set of COPY m consists of those processes that have a copy of m but not yet in their stable storage When a process Q delivers message m it also becomes a member of COPY m If all of these processes crash replaying the transmission of m is not possible Copyright KGoseva 2008 cs 757 Distributed Systems Slide 80 Message logging Suppose that in a distributed system some processes have just crashed and Q is one of the surviving pI39OCGSSCS Process Q is an orphan process if there is a message m such that Q is contained in DEPm while at the same time every process in COPY m has crashed To avoid orphan processes we need to enforce that whenever a process becomes a member of DEPm it also becomes a member of COPY m Whenever a process becomes dependent on the delivery of m it will always keep a copy of m Copyright KGoseva 2008 es 757 Distributed Systems Slide 81 w o Message logging University 0 Pessimistic logging protocol As soon as message m is delivered to process P P becomes a member of COPYm P is not allowed to send any messages after the delivery of m without rst having ensured that m has been written to stable storage Optimistic logging protocol Actual work is done after a crash occurs If for some message m each process in COPY m has crashed any orphan process in DEPm is rolled back to a state in which it no longer belongs to DEPm Pessimistic logging is much simpler than optimistic logging Copyright KGoseva 2008 es 757 Distributed Systems Slide 82 West Virginia University Naming Chapter 5 Slide 1 CS 757 Distributed Systems Copyright KGoseva 2008 quotw V Outlme West Virginia University Naming entities names identifiers amp addresses Three different classes of naming systems Flat naming Structured naming Attributebased naming Slide 2 Copyright KGoseva 2008 es 757 Distributed Systems West Virginia University Naming entities Slide 3 CS 757 Distributed Systems Copyright KGoseva 2008 V1 Namlng ent1t1es address Name is a string of bits or characters that is used to refer to an entry Entries hosts printers disks files processes user mailboxes newsgroups Web pages messages etc To operate on an entry we need access point The name of the access point is called address Example a telephone is an access point for a person an telephone number is an address An entry may have more than one access point An entry may change its access points CS 757 Distributed Systems Slide 4 Copyright KGoseva 2008 V Nammg ent1t1es address West Virginia University Can we use the address of an access point as a name Entry may change an access point or an access point may be reassigned to a different entry Entry may offer more than one access point Location independent name that is independent from its address is much easier and more exible to USS Slide 5 CS 757 Distributed Systems Copyright KGoseva 2008 V Namlng ent1t1es 1dent1 ers West Virginia University 0 Identifier is a name that has the following properties An identi er refers to at most one entity Each entity is referred to by at most one identi er An identi er always refers to the same entity ie it is never reused Can we use an address as an identifier In many computer systems addresses and identifiers are represented only in machine readable form eg Ethernet address is a random string of 48 bits 0 Human friendly names character strings eg file name in UNIX string up to 255 characters Slide 6 Copyright KGoseva 2008 cs 757 Distributed Systems Nammg ent1t1es Central question How we resolve names and identifiers to addresses There is close relationship between name resolution and message routing In principle a naming system maintains a nameto address binding Simplest form a table of name address pairs Centralized table does not scale well Slide 7 CS 757 Distributed Systems Copyright KGoseva 2008 West Virginia University Flat Naming Slide 8 CS 757 Distributed Systems Copyright KGoseva 2008 quotw V Flat namlng West Virginia University Simples solutions applicable only to LAN Broadcasting amp multicasting Forwarding pointers Homebased approaches Distributed hash tables Hierarchical approaches CS 757 Distributed Systems Copyright KGoseva 2008 Slide 9 V Broadcastmg and mult1cast1ng West Virginia University Broadcasting a massage containing the identi er of the entry is broadcast to each machine and each machine is requested to check whether it has that entry Used in the Internet Address Resolution Protocol ARP Inef cient for large networks Network bandwidth is wasted on request messages Many hosts may be interrupted by requests they cannot answer Solution multicasting only the restricted group of hosts receives the request Ethernet networks support datalink level multicasting directly in hardware Slide 10 Copyright KGoseva 2008 cs 757 Distributed Systems Broadcasting and multicasting Multicasting can also be used in pointtopoint networks Internet supports network level multicasting by allowing hosts to join a specific multicast group Multicasting can be used as a general location service for multiple entities to locate the nearest replica Copyright KGoseva 2008 es 757 Distributed Systems Slide 11 quotwquot V Forwardmg pomters West Virginia University When an entity moves from A to B it leaves behind a reference to its new location B Simple locate the entity using a traditional naming service then follow the chain of forwarding pointers Drawbacks The chain may become large Intermediate locations have to maintain their part of the chain Vulnerable to broken links Copyright KGoseva 2008 es 757 Distributed Systems Slide 12 w V Forwardmg Pomters West Virginia University The principle of forwardng pointers using client stub server stub palrs Process P2 Stub cs refers to Client Stub 33 same server stub as stub cs Process P3 Identical client stub process P1 Server stub or T tentstu cs Process P4 Object Local invocation lnterprocess communication dentical server stub Copyright KGoseva 2008 cs 757 Distributed Systems Slide 13 w Forwardmg Po1nters University Redirecting a forwarding pointer by storing a shortcut in a proxy Server stub is no Invocation longer referenced request IS by any client stub sent to object m Client stub sets a shortcut Server stub at object s current process returns the current location a b V K When a server stub is no longer referred to by any client it can be removed so called distributed garbage collection is not a trivial task Slide 14 CS 757 Distributed Systems Copyright KGoseva 2008 HomeBased Approaches Supporting mobile entities in largescale networks can be done using the home location which keeps track of the current location of the entity Fallback mechanism for location services based on forwarding pointers Mobile IP Mobile telephony Drawbacks Increases the communication latency The use of xed home location Solution register home location at the naming service Slide 15 CS 757 Distributed Systems Copyright KGoseva 2008 V HomeBased Approaches The principle of Mobile IP may Host39s homw VI A location Rm 1 I Send p cRet tohost at its home m Client39s ZZN location 5 2 Return address of current location NEW 0 E k p 3 Tunnel packet to L current location J A K 4 jw J l 4 Send successive packets C Copyright KGoseva 2008 to current location J g Host39s present location Slide 16 CS 757 Distributed Systems V Distributed Hash Tables West Virginia University We mentioned Distributed Hash Tables DHT as a way how to resolve an identifier to the address of the associated entry when we discussed Chord now we discuss the details Reminders Chord uses mbit random identi er as a key k to specify an entity m is usually 128 or 160 bits depending on the hash function used An entity with a key k falls under the jurisdiction of the node with the smallest identifier id 2 k that is succk Obvious but not the best approach Each node p keeps track of succp1 and predp When p receives request to resolve key k it returns its own address if predp lt k s p or it returns the appropriate address of one of its neighbors Copyright KGoseva 2008 es 757 Distributed Systems Slide 17 V Distributed Hash Tables West Virginia Unive sity Instead of this linear approach for key lookup each Chord node maintains a finger table of at most In entries Let F T p denotes the nger table of node 9 then 1 1 FTpl succp2 These references are actually shortcuts to existing nodes where the shortcut distance from node 9 increases exponentially as the index i in the finger table increases 0 To lookup a key k node 9 will forward the request to node q qFTpjSkSFTpjl Slide 18 CS 757 Distributed Systems Copyright KGoseva 2008 West Virginia University Distributed Hash Tables Fin er table 9 3 Example Resolving k26 from node 1 ltl 1 returns 18FT15326 39 18 returns 20FT1823263 FT183 39 20 returns 21FT2013263 FT202 4 4 21 returns 28 which is responsible Actual node Resolve k 12 for from node 28 The address of the node 28 is returned to node 1 1e the key k26 ls reSOIVGd Resolve k 26 from node 1 It can be shown that the lookup will generally require OlogN steps where N is the number of nodes in the system Note This is a recursive lookup Copyright KGoseva 2008 es 757 Distributed Systems Slide 19 w Distributed Hash Tables v West Virginia University 0 Reminder Joining DHT system such as Chord is simple Node p which wants to join contacts arbitrary node and request a lockup for succp1 and inserts itself into the ring 0 The finger tables have to be kept uptodate when nodes join leave or fail Note that for each node q F Tq1 succq1 Each node q regularly contacts succq1 and checks predsuccq 1 If qpredsuccq 1 then q s info is consistent with its successor s info Otherwise a new node p has entered the system with qltp succq1 so q will adjust F Tq I p and check whether p has recorded q as its predecessor Slide 20 Copyright KGoseva 2008 es 757 Distributed Systems V Distributed Hash Tables West Virginia University 0 In a similar way q updates each entry 1 in its nger table issuing a request to resolve succq2quot 1 In Chord this type of requests that help keeping the finger table uptodate are issued regularly be means of background processes 0 Resolution of failures q checks its predecessor and if it has failed sets predq unknown If q nds predsuccq 1 unknown it will notify succq1 that it suspects to be the predecessor of succq1 With these simple procedures Chord is kept generally consistent perhaps with exception of a few nodes Slide 21 CS 757 Distributed Systems Copyright KGoseva 2008 V Distributed Hash Tables WSE WZLEE Exploring network prox1m1ty 0 Problem with the DHT approach described so far Since the proximity of the nodes is not taken into account requests may be routed erratically across the Internet Solutions Topologybased assignment of node identi ers Assign identi ers in a such a way that two nearby nodes will have identi ers that are close to each other Proximity routing instead of having one successor each node in Chord keep track of r successors Proximity neighbor selection optimize the routing tables such that the nearest node is selected as neighbor not exactly possible in Chord Copyright KGoseva 2008 cs 757 Distributed Systems Slide 22 V West Virginia University 0 A network is divided into domains there is a single top Hierarchical Approaches level domain Each domain D has an associated directory node dirD that keeps track of the entities in that domain A lowest level domain leaf domain corresponds to a LAN in computer networks or a cell in a mobile telephone network Each entry located in a domain D is represented by a location record in the directory node dirD for a leaf is a current address for others is a pointer to the directory node of the next lower level Slide 23 Copyright KGoseva 2008 cs 757 Distributed Systems lw Hierarchical Approaches West Virginia University Hierarchical organization of a location service into domains each having an associated directory node The root directory Topeve node dirT domain T Directory node dirS of domain 5 C A subdomain S of toplevel domain T S is contained in T x x A leaf domain contained in 8 Slide 24 Copyright KGoseva 2008 es 757 Distributed Systems W o o H1erareh1eal Approaches U n iversity Storing information of an entity having two addresses in different leaf domains Example replicated entry will have two addresses Field with no data Field for domain I a 39 Location record dom N With pointerto N for E at node M Domain D I Domain D2 Copyright KGoseva 2008 es 757 Distributed Systems Slide 25 w V Hierarchical Approaches Look up West Virginia University Looking up a location in a hierarchically organized location service Node knows about E so request is forwarded to child A Node has no record for E so that request is forwarded to parent M Look39up Domain D request Slide 26 Copyright KGoseva 2008 es 757 Distributed Systems V Hierarchical Approaches West Virginia University 0 Lookup operation in hierarchical location services exploits locality Entity is searched in a gradually increasing ring centered around the requesting client At worst it reached the root node that have a location entry for each entity Update operations exploit locality in a similar fashion Slide 27 Copyright KGoseva 2008 es 757 Distributed Systems V Hierarchical Approaches Insert West Virginia University Node knows Node has no about E so request record for E IS no longer forw ded Node creates record so request is forwarded to parent 11 9 and stores pointer ti 1 Node creates at record and stores address m t we 39 request a b A chain of forwarding pointers in a topdown fashion to the leaf node is created Domain D An insert request is forwarded to the first node that knows about entity E Slide 28 Copyright KGoseva 2008 cs 757 Distributed Systems Hierarchical Approaches Delete Delete is similar to insert explores locality When address for entry E in a leaf domain D needs to be removed directory node dirD is requested to remove that address from its location record for E If that location record becomes empty no other addresses for E in D the location record can be removed In that case the parent node of dirD checks for location record for E o If not empty do nothing 0 If empty remove the location record for E and repeat with the higher level directory node Copyright KGoseva 2008 es 757 Distributed Systems Slide 29 Hierarchical Approaches quotV WSW Scalability Issues 0 Problem with hierarchical location services is that the root node is required to store a location record for each entry and to process requests for each entry 1 KB location record gtlt billion entities l terabyte The root may be required to do many lookups and updates that it will become a bottleneck Solution partition the root node and other highlevel directory nodes into subnodes Subnodes close to each other in cluster this may solve the problem of processing capacity but network connections might still be a problem Spread the subnodes uniformly across the network Slide 30 Copyright KGoseva 2008 es 757 Distributed Systems W Hierarchical Approaches V WSELVELEE Scalabilitx Issues Deciding which subnodes should handle which entities is still an open question 4 K R M Subnode of the root responsible Domain wh 7en for handling req s rlli y E d E currently resides U jz x Di Ll Alternative and better choice for a subnode to handle E m cs 3 Current route 1 jf lookup request JUH Alternative route of lookup request Qient requesting the current address of E 4 Slide 31 West Virginia University Structured Naming Slide 32 CS 757 Distributed Systems Copyright KGoseva 2008 V Name S aces West Virginia p University Data stored in M quotkeysquot quothomelsteenkeysquot keys Leaf node 0 b twmrc m ox H O quothomesteenmboxquot Directory node B U Ab solute path name the rst node in the path name is the root of the naming graph Example n0lth0me steen mb0xgt Slide 33 Otherwise relative path name CS 757 Distributed Systems Copyright KGoseva 2008 lw V Name Spaces West Virginia University Global name denotes the same entry no matter where that name is used in a system Local name interpretation depends on where the name is being used eg environmental variable such as HOME in UNIX 0 There are many different ways to organize a name space Tree strictly hierarchical Directed acyclic graph as in the previous slide Copyright KGoseva 2008 cs 757 Distributed Systems Slide 34 quotw Name Spaces The general organization of the UNIX le system implementation on a logical disk of contiguous disk blocks File data blocks iiiiiiiii 2 Disk block Superblock l it Will J Boot block Index nodes CS 757 Distributed Systems Copyright KGoseva 2008 Slide 35 quotwquot V Name resolut1on West Virginia University Name resolution the process of looking up a name Closure mechanism deals with selecting the initial node in a name space from which to start the name resolution In UNIX le system the inode of the root directory is the rst inode in the logical disk representing the file system Resolving a name requires that some mechanism has already been implemented by which the resolution process can start Example 130429304052523 Copyright KGoseva 2008 es 757 Distributed Systems Slide 36 Linking and Mounting West Virg Unive sity Alias another name for the same entry Example environmental variable such as HOME Two different ways to implement an alias Hard links allow multiple absolute path names to refer to the same node 6g keys and homesteen keys Symbolic links represent an entry by a leaf node that stores an absolute path name Slide 37 CS 757 Distributed Systems Copyright KGoseva 2008 V Linking and Mounting The concept of a symbolic link explained in a naming graph Data stored in M Q9 quotkeysquot Data stored in n6 keys quotkeysquot n6 quotfhomesteenkeysquot CS 757 Distributed Systems 3 Leaf node 0 Q1 Q9 twmrc mbox Directory node B O 0 Copyright KGoseva 2008 Slide 38 V Linking and Mounting West Virginia University Name resolution can be used to merge different name spaces in a transient way Mounted file system corresponds to letting a directory node store the identi er of a directory node from different foreign name space Mount point directory node storing the node identi er Mounting point directory node in the foreign name space normally the root of a name space Slide 39 Copyright KGoseva 2008 es 757 Distributed Systems w Linking and Mounting v West Virginia University To mount a foreign name space in a distributed system requires at least Name of an access protocol Name of the server Name of the mounting point in the foreign name space Each of this needs to be resolved None of these may be needed in nondistributed systems eg UNIX Slide 40 Copyright KGoseva 2008 es 757 Distributed Systems West Virginia University Linking and Mounting Mounting remote name spaces using Sun s Network File System NF S Name server for foreign name space Machine B keys remote home u Lquotnfsfii scsvu ee 00W 00th i u 1 Name server Machine A i 21 s Reference to foreign name space Network Copyright KGoseva 2008 cs 757 Distributed Systems Slide 41 quotwquot V Implementatlon of the name space West Virginia University Name space is the heart of a naming service that allows users and processes to add remove and look up names LANs on a single name server WANs distributed over multiple name servers Usually organized hierarchically into logical layers Global layer roots nodes and its children characterized by stability Administrational layer directory nodes that are managed within a single organization Managerial layer nodes that may change regularly e g shared les such as libraries or binaries user de ned directories and les Copyright KGoseva 2008 es 757 Distributed Systems Slide 42 V Name Space D1str1but1on West Virginia University Partitioning of DNS name space including Internetaccessible les into three layers each zone is implemented by a separate name server Global Iayer lt fom Agov 21H osr g gym f is Wa39e 30m f ieee 30R Admini gm CS eng jack tilt at quotquotquotquotquot quot strational layer D D D 05 at linda Q p024 robot D Mana gerial quot layer zone indextxt Slide 43 Copyright KGoseva 2008 cs 757 Distributed Systems West Virginia University Name Space Distribution Item Global Administrational Manaerial Geographical scale of network Worldwide Organization Department Total number of nodes Few Many Vast numbers Responsiveness to lookups Seconds Milliseconds Immediate Update propagation Lazy Immediate Immediate Number of replicas Many None or few None ls clientside caching applied Yes Yes Sometimes Copyright KGoseva 2008 CS 757 Distributed Systems Slide 44 Iterative Name Resolution 1 ltnvucsftpgt Root lt 5 2 ltngt ltvucsftpgt Name server lt Client39s 4 ltvugt ltcsftpgt I Ode name 6 ltcsgt ltftpgt V Ode MW gt Name server 84 ltftpgt CS node ftp i ltnvucsftpgt T ltnlvucsftpgt Nodes are A managed by O U 5 the same server Copyright KGoseva 2008 cs 757 Distributed Systems Slide 45 Recursive Name Resolution 1ltnvucsftpgt 8 ltnvucsftpgt name server nlnode Client s name vu node cs node ltnvucsftpgt T Lltnvucsftpgt Slide46 v West Virginia University 2 ltvucsftpgt 3 ltcsftpgt 4 ltftpgt CS 757 Distributed Systems Copyright KGoseva 2008 w Recurs1ve Name Resolut1on University Recursive name resolution puts a higher performance demand on Drawback each name server Therefore name servers in the global layer of a name space support do not support recursive name resolution Advantages next two slides Caching is more effective than in case of iterative name resolution It is often more effective with respect to the communication Slide 47 CS 757 Distributed Systems Copyright KGoseva 2008 West Virginia University Recursive Name Resolution Recursive name resolution of lt11 vu cs ftpgt Name servers cache intermediate results for subsequent lookups Server for Should Passes to Receives Returns to Looks up node resolve child and caches requester cs ltftpgt ltftpgt ltftpgt vu ltcs pgt ltcsgt ltftpgt ltftpgt ltcsgt ltcs ftpgt nl ltvucs pgt ltvugt ltcs pgt ltcsgt ltvugt ltcsftpgt ltvucsgt ltvucsftpgt root ltnvucsftpgt ltngt ltvucs pgt ltvugt ltngt ltvucsgt ltnlvugt ltvucs pgt ltnvucsgt ltnvucs pgt Copyright KGoseva 2008 CS 757 Distributed Systems Slide 48 Implementation of Name Resolution The comparison between recursive and iterative name resolution with respect to communication costs Recursive name resolution Name server nlnode R2 CI t quot J l2 Name server 399 ma a vu node 239 7 7 quotquot s k42 Name server R3 Iterative name resolution cs node V Longdistance communication Copyright KGoseva 2008 cs 757 Distributed Systems Slide 49 V Example Domain Name System West Virginia University 0 Internet Domain Name System DNS is one of the largest distributed naming services Works well more than 30 years after its introduction Primarily used for looking up host addresses and mail servers Organized as a tree Label is a caseinsensitive string made up of alphanumeric characters maximum length 63 characters Maximum length of a complete path 255 characters root ltedu wvu cseegt csee wvu edu Label attached to a node s incoming edge is used a name of the node Domain subtree Domain name a path name to the root of a domain Copyright KGoseva 2008 es 757 Distributed Systems Slide 50 quotW DNS implementation DNS name space consists of a global layer and an administrative layer Managerial layer is formed by local le system and it is not part of DNS 0 Each zone is implemented by a name server replicated for availability Updates for a zone are handled only by the primary name server Secondary name servers request the primary server to transfer database content zone transfer DNS database a collection of les File that contains all the resource records for all the nodes in a particular Slide 51 zone CS 757 Distributed Systems Copyright KGoseva 2008 West Virginia University DNS resource records The most important types of resource records forming the contents of nodes in the DNS name space 122 Aszzfiitjlted Description SOA Zone Holds information on the represented zone A Host Contains an IP address of the host this node represents MX Domain Refers to a mail server to handle mail addressed to this node SRV Domain Refers to a server handling a speci c service NS Zone Refers to a name server that implements the represented zone CNAME Node Symbolic link with the primary name ofthe represented node PTR Host jgsnvlisninivearjjrza pging of IP addresses to host names by means of HINFO Host Holds information on the host this node represents TXT Any kind Contains any entityspeci c information considered useful Copyright KGoseva 2008 CS 757 Distributed Systems Slide 52 V Excerpt from DNS database for zone erSItiVZirrsgiitgia CS Vuo Name Record type Record value 1 csvunl SOA star 199912150272003600241920086400 1csvun NS 5 rcsvun csvu1nl NS topcs vu1n csvunl NS solocsrun Domal amp Z0116 csvunl TX39I39 quotVrije Universiteit Mam amp Comp Scquot 1 i csvun MX 1 zephyr csyuni csvunl MX 2tomadocsvunl cavu nl MX 3 far r m nl starcsvanl HINFO Sun Unix starcsvunl MX 1 starcsvunl Name server starcsvu nl MX 10 zephyrcsunl starcsvu nl A 13037246 stancsmml A 19231 123142 zephyrecsvvuenl HINFO Sun Unix server 39 zephercstun MX 1 zephchsvunl zephyrcsvu nl MX 2 tomadocsvunl zephyr csvunl A 19231231156 wwwcswn CNAME solingcsvunl ftpcsvun CNAME soling svvuen Web server amp server 1 soiingcsvun HINFO Sun Unix I soiingcsvanI MX 1 solingcsvunl 39 solingcsvun MX 10 zephyrcsvun solingcsvunl A 130372411 39 Iasercsvun HINFO PC MSDOS Laser punter asercsvun A 1303713032 vu03ltdascsvu nl PTR o2637130inaddrarpa IHVCTSG mapplng vucs dasrcsvvurnl A 13037250 Copyright KGoseva 2008 Attributebased naming Directory services CS 757 Distributed Systems Slide 54 quotwquot V Attrlbutebased nammg West Virginia University Flat amp structured names generally provide a unique and locationindependent way of referring to entities Structured names also provide a humanfriendly way to name entities o In an attributebased naming an entity has a set of associated attributes each described in terms of attribute value pairs Example In email system messages can be tagged with attributes for sender recipient subject and so on Attributebased naming systems are also known as directory services Copyright KGoseva 2008 cs 757 Distributed Systems Slide 55 V Example X500 directory service West Virginia University Directory service client looks for an entity based on the description of properties instead of a full nmne Directory entry in 081 X500 is comparable to a resource record in DNS Each entry is a collection of attribute value pairs 0 Single value amp multiple value attributes Copyright KGoseva 2008 es 757 Distributed Systems Slide 56 West Virginia University X500 directory service A simple example of a X500 directory entry using X500 naming conventions Attribute Abbr Value Country C NL Locality L Amsterdam Organization L Vrije Universiteit OrganizationalUnit OU Math amp Comp Sc CommonName CN Main server MaiISeners 13037246 192312311923123166 FTPServer 130372111 VWVWServer 130372111 Copyright KGoseva 2008 CS 757 Distributed Systems Slide 57 X500 directory service West Virg Unive 39 Directory Information Base DIB collection of all directory entries Each record is uniquely named as a sequence of naming attributes Relative Distinguished Name RDN naming attributes CNLOVrije UniversiteitOUMath amp Comp Sc is equivalent to the DNS name nvucs Directory Information Tree DIT hierarchy of collection of directory entries CS 757 Distributed Systems Copyright KGoseva 2008 Part of the directory information tree HostName star HostName zephyr Copyright KGoseva 2008 cs 757 Distributed Systems Slide 59 West Virginia University X5 00 directory service Two directory entries having HostiName as RDN Attribute Value Attribute Value Country NL Country NL Locality Amsterdam Locality Amsterdam Organization Vrije Universiteit Organization Vrije Universiteit OrganizationaIUnit Math amp Comp Sc OrganizationaIUnit Math amp Comp Sc CommonName Main server CommonName Main server HostName star HostName zephyr HostAdd ress 1923123142 HostAdd ress 1923123166 Copyright KGoseva 2008 CS 757 Distributed Systems Slide 60 quotWquot X500 d1rectory serv1ce A node in DIT simultaneously represents a directory and an X500 record read read a single record given its path name in DIT Li st list the names of all outgoing edges of a given node in DIT does not return records it returns names Example input name CNLOVrije UniversiteitOUMathampCompScCNMain server Calling read will return the record on Slide 57 Calling l i s t will return names star and zephyr from the entries shown in the previous slide Slide 61 CS 757 Distributed Systems Copyright KGoseva 2008 V X5 00 Implementation West Virgini University Implementation similar to DNS DIT is partitioned and distributed across several servers Directory Service Agents DSA corresponding to zones in DNS Clients are represented by Directory User Agents ie name resolvers X500 supports searching through a DIB A list of all main servers at the Vrije Universiteit in NL answersearch ampCNLOVrij e UniversiteitOUCNMain Slide 62 server Searching in a directory service is an expensive operation CS 757 Distributed Systems Copyright KGoseva 2008 W Other examples of directory serv1ces Lightweight Directory Access Protocol LDAP 1993 Simpli ed version of X500 Works over TCPIP not over OSI as the original X500 Universal Description Discovery and Integration UDDI 2000 Interoperable foundational infrastructure for a Web services based software environments Supports description and discovery of Businesses organizations and other Web services providers Web services they make available WSDL interfaces which may be used to access those services Based on a common set of industry standards including HTTP XML VIL Schema and SOAP Copyright KGoseva 2008 es 757 Distributed Systems Slide 63
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'