Popular in Course
Popular in ComputerScienence
This 92 page Class Notes was uploaded by Sterling Schmeler on Monday October 26, 2015. The Class Notes belongs to CS2510 at University of Pittsburgh taught by Staff in Fall. Since its upload, it has received 6 views. For similar materials see /class/229405/cs2510-university-of-pittsburgh in ComputerScienence at University of Pittsburgh.
Reviews for COMPUTEROPERATINGSYSTEMS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
Replicas why and why not 0 Reliability 7 Tolerance to component failures 7 Tolerance to corrupted data 0 Performance 7 Benefits scalability 7 Allows for concurrent access to resources dataobjectsprocessors 7 Reduces the delay for geographically dispersed resources 0 Disadvantages 7 Cost of replications is high 7 Maintaining consistency of multiple copies is tough 7 Implementation is harder eg different users have different needs of number of replicas and more or less accurate consistency models Object Replication Problem If objects or data are shared we need to do something about concurrent accesses to guarantee state consistency Client machine Server machine Client machine Server me E E Skeleton ServerOS L JL J Network Object Replication solutions A remote object capable of handling A remote object for which an object concurrent invocations on its adapter is required to handle own concurrent invocations Server machine Server machine Server Server E ii Mechanism for mutual exclusion Sk I t Mechanism Sk I t Concurrent e e on for rlnulual e e on invocations exc us39on Adapter Adapter gt1 Concurrent 08 l f invocations gt I Incoming requests incoming requests a b W here in the world 1s ObJBCt Replication Replicated Replicatecl object object o A k 1 Object Q l l l specific replication Middleware PFOtOCOI Middleware Middleware M I I Middleware Network 08 Network 03 Network 08 replication Network 08 protocol Network Network a A distributed system A distributed system for replication aware distributed objects is more customizable responsible for replica management is more general and easier to implement Performance replication scalability Main issue To keep replicas consistent we generally need to ensure that all con icting operations are done in the the same order everywhere Con icting operations example from transactions 7 Readewrite con ict a read operation and a write operation act concurrently e Writeewrite con icts two concurrent write operations Tradeoff guaranteeing global ordering on con icting operations may be a costly operation downgrading scalability Solution weaken consistency requirements so that hopefully global synchronization can be avoided This solution only lends itself to some applications not all Data Centric Consistency Models 1 The general organization of a logical data store physically distributed and replicated across multiple resources Process Process Process Local copy Distributed data store 0 Consistency model a contract between a distributed data store and processes in which the data store specifies precisely what the results of read and write operations are in the presence of concurrency Processes agree or don t use it Data Centric Consistency Models 2 Strong consistency models Operations on shared data are synchronized 7 Strict consistency related to quotabsolute globalquot time 7 Sequential consistency what we are used to 7 Causal consistency maintains only causal relations 7 FIFO consistency maintains only individual ordering 0 Weak consistency models Synchronization occurs only when shared data is locked and unlocked General weak con i tency Relea e con i tency 7 Entry consistency 0 Observation The weaker the consistency model the easier it is to build a scalable solution Strict Consistency Any read to a shared data item X returns the value stored by the most recent write operation on X P1 Wxa P1 Wxa P2 Rxa P2 Rxfl E Rxa A strictly consistent store A store that is not strictly consistent 0 It may be expensive to maintain strict consistency 0 Does everyone need it Who does 0 How can it be bettereasierless costly 0 Note Strict consistency is what you get in the normal uniprocessor sequential case where your program does not interfere with any other program Sequential Consistency The result of any eXecution is the same as if the operations of all processes were eXecuted in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program P1 Wxa P1 Wxa P2 Wx P3 m xa P4 Rxb Rxa xb A strictly consistent store A store that is not strictly consistent This is for interleaved executions there is ONE total ordering for all operations Linearizability Sequential consistency operations are ordered according to a global time This may be more practical since it assumes loosely synchronized clocks Lamport clocks NTP Since the definition of global time is loose so is the consistency model Therefore linearizability is weaker than strict consistency but stronger than sequential consistency Happy medium Casual Consistency 1 Events a and b are causally related if a causes or in uences b Events that are not causallyerelated are concurrent Causal consistency Writes that are potentially causally related must be seen by all processes in the same order Concurrent writes may be seen in a different order on different machines P1 Wxa Wxc P2 RxaaWxb P3 R003 R000 Rxb P4 RXa RXb Rxc This sequence is allowed with a casuallyeconsistent store but not with sequentially or strictly consistency what writes are concurrent Casual Consistency 2 A violation of a casuallyeconsistent store WHY P1 Wxa P2 Rxa Wxb P3 Rxb Rxa P4 Rxa Rxb a A correct sequence of events in a casuallyeconsistent store WHY P1 Wxa P2 Wxb P3 Rxb Rxa P4 Rxa Rxb b FIFO Consistency Writes done by a single process are seen by all other processes in the order in Which they were issued but writes from different processes may be seen in a di erent order by different processes P1 Wxa P2 Rxa Wxb Wxc P3 Rxb Rxa Rxc P4 Rxa Rxb Rxc A valid sequence of events of FIFO consistency Is it valid for causal consistency What about sequential consistency FIFO Consistency Implementation is simple 0 Attach a PlDsequence to each event 0 Perform writes according to the this ordering Process P1 Process P2 if y Q 0 kill P2 if x Q 0 kill P1 TWO concurrent processes What s the beef Summary of Consistency Models not using synchronization operations Consistency Description Strict Absolute time ordering of all shared accesses matters All processes must see all shared accesses in the same order Linearizability Accesses are furthermore ordered according to a nonunique global timestamp All processes see all shared accesses in the same order Sequential Accesses are not ordered in time All processes see causallyerelated shared accesses in the same Causal r er All processes see writes from each other in the order they were FIFO used Writes from different processes may not always be seen in that order Weak Cons1stency 1 Properties Accesses to synchronization variables associated with a data store are sequentially consistent No operation on a synchronization variable is allowed to be performed until all previous writes have been completed everywhere No read or write operation on data items are allowed to be performed until all previous operations to synchronization variables have been performed Implementation use a synchronization phase Basic idea You don39t care that reads and writes of a series of operations are immediately known to other processes You just want the effect of the series itself to be known Weak Consistency 2 Observation Weak consistency implies that we need to lock and unlock data implicitly or not P1 Wxa Wxb 8 P2 Rxa Rx b 8 P3 Rxb Rxa S A valid sequence of events for weak consistency P1Wxa Wxb S 39 S Rxa An invalid sequence for weak consistency Release Consistency 1 Idea Divide access to a synchronization variable into two parts Acquire phase forces a requester to wait until the shared data can be accessed Release phase sends requestor s local value to other servers in data store P1 AcqL Wxa Wxb ReIL P2 AcqL Rxb ReIL P3 Rxa Release Consistency 2 Rules Before a read or write operation on shared data is performed all previous acquires done by the process must have completed successfully Before a release is allowed to be performed all previous reads and writes by the process must have completed Accesses to synchronization variables are FIFO consistent sequential consistency is not required Entry Consistency 1 Conditions An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process no other process may hold the synchronization variable not even in nonexclusive mode After an exclusive mode access to a synchronization variable has been performed any other process39s next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable39s owner Entry Consistency 2 With release consistency all local updates are propagated to other copiesservers during release of shared data With entry consistency each shared data item is associated with a synchronization variable When acquiring the synchronization variable the most recent values of its associated shared data item are fetched Note Where release consistency affects all shared data entry consistency affects only those shared data associated with a synchronization variable Question What would be a convenient way of making entry consistency more or less transparent to programmers Entry Consistency 3 P1 AcqLx Wxa AcqLy Wyb ReILx ReILy AcqLx RXa RyNL P3 ACqLy ROI b A valid event sequence for entry consistency Summary of Consistency Models with synchronization operations Consistency Description Weak Shared data can be counted on to be consistent only after a synchronization is done Release Shared data are made consistent when a critical region is exited Entry Shared data pertaining to a critical region are made consistent when a critical region is entered Summary of Consistency Models Consistency Description Strict Absolute time ordering of all shared accesses matters u u Linearizability 5th 39 Accesses are ume lamp u n i i i Sequential A quot Iquot 39 time Causal All processes see causallyrelated shared accesses in the same order All processes see writes 39om each other in the order they were used Writes 39om FIFO different processes may not always be seen In that order Consistency models not using synchronization operations Consistency Description Weak Shared data can be counted on to be consistent only alter a synchroniza ion is done Release Shared data are made consistent when a critical region is exited Entry Shared data pertaining to a critical region are made consistent when a critical region is entered l Models with 39 39 operations Operating Systems Interface between the hardware and the rest editors compilers database systems application programs your programs etc Allows portability enables easier programming The manager of different resources memory CPU disk printer etc in your system Takes responsibility away from the users tends to improve metrics throughput response time etc 2 types 7 monolithic all functions are inside a single kernel central part of the OS 7 microkernel based nonbasic functions oat as servers there is a small kernel for the basic functionality Distributed Systems A distributed system is A collection of independent computers that appears to its users as a single coherent system also a collection of systems that when one breaks nothing works Distributed Systems Machine A Machine B Machine C Distributed applications Middleware service Local 08 Local 08 Local 08 Network A distributed system organized as middleware What s middlewe Note that the middleware layer extends over multiple machines Issues in Distributed Computing Why distribute What s bad about centralized DISTRIBUTION 7 allows sharing of data code devices messages etc 7 is more exible can add more resources scalable 7 is cheaper several small machines are cheaper than one powerful That is the priceperformance ratio is smaller than in centralized 7 is usually faster same as above 7 can be fault tolerant if one site fails not all computations fail 7 much MUCH MORE 0 More complex YES much more here is the much more 0 More time consuming messages need to go back and forth 0 Slower response time messages but can parallelize comps What about reliability security cost network messages congestion load ba ancing Distributed OS Services Global Inter Process Communication IPC primitives transparent to the users currently support to client server computing Global protection schemes so that a validation at a site needs to be validated at another site Kerberos Global process management usual destroy create etc migration load distribution so that the user need not manually logon to a different machine The OS takes charge and executes the program requested by the user in a lessloaded fastresponding machine compute server le server etc Global process synchronization supporting different language paradigms for heterogeneity and openness Compatibility among machines binary protocol etc Global naming and file system Distributed OSs Transparency attempts to hide the nature of the system from users 7 Good because users usually don t need to know details 7 Degree of transparency is important too much may be too much Performance is usually an issue that needs to be studied for a speci c system architecture application users etc Scalability is important in the long run and general use7some applications systems users etc do not need scalability Distributed algorithms are also needed which have the following characteristics 7 State information should be distributed to all nodes how overhead 7 Decisions are made based on local information Why 7 Fault tolerance What for 7 No globalsynchronized clocks Why Transparency in a Distributed System Transparency Description ACCESS Hide diggerences in data representation and how a resource is Locatjon Hide where a resource is located Migration Hide that a resource may move to another location Reloca on that a resource may be moved to another location while Replica on geStjwat a resource may be shared by several competitive Concurrency Hide that a resource may be shared by several competitive Failure Hide the failure and recovery of a resource Parallelism Hide how many resources are being used Persistence Hide whether a software resource is in memory or on disk Degree of Transparency Observation Aiming at full distribution transparency may be too much Users may be located in different continents distribution is apparent and not something you want to 39 Completely hiding failures of networks and nodes is theoretically and practically impossible 7 You cannot distinguish a slow computer from a failing one 7 You can never be sure that a server actually performed an operation before a crash Full transparency will cost performance exposing distribution of the system 7 Keeping Web caches exactly uptodate with the master copy 7 Immediately flushing write operations to disk for fault tolerance Scalability Three dimensions Size Number of users andor processes Geographical Maximum distance between nodes Administrative Number of administrative domains Limitations Concept Example Centralized services A single server for all users Centralized data A single online telephone book Centralized algorithms Doing routing based on complete information Techniques for Scaling Distribution Partition data and computations across multiple machines Move computations to clients eg Java applets Decentralized naming services eg DNS Decentralized information systems eg WWW Replication Make copies of data available at different machines eg replicated le servers databases mirrored websites etc Caching Allow client processes to access local copies Web caches eg browserweb proxy File caches eg server or client Scalability Applying scaling techniques is easy except for Having multiple copies cached or replicated leads to inconsistencies modifying one copy makes that copy different from the rest Always keeping copies consistent and in a general way requires global synchronization on each modi cation Global synchronization precludes largescale solutions Observation1 If we can tolerate inconsistencies we may reduce the need for global synchronization Observationz Tolerating inconsistencies is application dependent Scaling Techniques Example 1 Client Server FIRST NAME 4 LAST NAME EMA39L 4 El Check form Process form Client Server FIRST NAME LAST NAME VAN STEEN W EEEN EMAIL mmgm vu NL Check form Process form b I V The difference between letting a a server or b a client check forms as they are being lled Scaling Techniques Example 2 Generic Countries L J L i t 21 n int coer edu gov org net jp A T r v v t en a i Iinda robot An example of dividing the DNS name space into zones Hardware Concepts Shared memory Private memory E E m m m m E E H E H H H H y m m m m H E E H E E E E E Processor E Memory pesEqsng pesequoums Different basic organizations and memories in distributed computer systems Multiprocessors 1 A busbased multiprocessor CPU CPU CPU Memory Multiprocessors 2 Memories EEE CPUs EEEE K Crosspoint switch 2x2 switch a b A crossbar switch An omega switching network Homogeneous Multicomputer Systems 3 Grid b Hypercube Software Concepts Distributed systems can be achieved in 3 ways DOS Distributed Operating Systems NOS Network Operating Systems Middleware System Description Main Goal DOS Tightlycoupled operating system for multi Hide and manage processors and homogeneous multicomputers hardware resources NOS Looselycoupled operating system for Offer local services heterogeneous multicomputers LAN and WAN to remote clients Additional layer atop of NOS implementing general Provide distribution Mi r dd ewe e purpose senIces transparency Uniprocessor Operating Systems No direct data exchange between modules tilt V OS interface User application Memory module Process module F39I d le mo u e User mode Kernel mode System call Microkernel l r Hardware Separating applications from operating system code through a microkemel Multiprocessor Operating Systems Mutual exclusion and synchronization Semaphores Testnset and Swap instructions Spin locks Monitors A combination thereof Implementationissues Shared memory Message passing A combination thereof Multicomputer Operating Machine A Machine B Systems 1 Distributed applications Distributed operating system services Machine C Kernel Kernel i Kernel OS on each computer is the same computers Network 08 on each computer knows about the other computers Senices are generally transparently distributed across Multicomputer Operating Systems 2 Possible synchronization Sender i 81 pomt Receiver sender Receiver buf fer i er t s2 t Network No shared memory message passing Harder to do synchronization No centralized decision making OSs exist Amoeba Authors Typically no broadcasting thus need software In practice then only very few truly distributed multicomputer Multicomputer Operating Systems 3 Synchronization point Send buffer Riggifg Block sender until buffer not full Yes Not necessary Block sender until message sent No Not necessary Block sender until message received No Necessary Block sender until message delivered No Necessary Relation between blocking buffering and reliable communications Distributed Shared Memory Systems 1 a Pages of address space distributed among four machines b Situation after CPU 1 references page 10 c Situation if page 10 is read only and replication is used Shared global address space WDMWNNN slglidididish4h5 l WWW WWW WWW HE E 4 Memory CPU 1 CPU 2 CPU 3 CPU 4 a WWW WWWlMWW HE mm H HE CPU1 CPU2 CPU3 CPU4 b WWW WWWlMWW HE E an an CPU1 CPU2 CPU3 CPU4 C l2 Distributed Shared Memory Systems 2 Machine A Page transfer when Machine B B needs to be accessed rMf f i Two independent Page transfer when data Items A needs to be accessed Code using A Code using B False sharing of a page between two independent processes Network Operating System 1 Each computer has its own operating system with networking facilities Computers work independently ie they may even have different operating systems Services are tied to individual nodes ftp telnet WWW Highly file oriented basically processors share only files Machine A Machine B Machine C Distributed applications Network 08 services Network 08 services Network 08 services Kernel Kernel Network Kernel Network Operating System 2 Two clients and a server in a network operating system File server Client 1 Client 2 Q DlSkS 0n WhiCh shared file system Request Reply Q IS stored Network Network Operating System 3 Different clients may mount the servers in different places Chen Client 2 Server 1 Server 2 I gam es work private I pacman pacwoman pacchild mail teaching research a Client 1 Client 2 I privatelgam es work pacman mail pacwoman teaching pacchild research pacman mail pacwoman teaching pacchild research b C l4 Middleware Machine A Machine B Machine C l l l l Distributed applications Middleware services Network 08 Network 08 Network 08 services services services Kernel Kernel Kernel Network OS on each computer need not know about the other computers 08 on different computers need not generally be the same Services are generally transparently distributed across computers Middleware Motivation Too many networked applications were hard or difficult to integrate Departments are running different NOSs Integration and interoperability only at level of primitive NOS services Need for federated information systems Combining different databases but providing a single view to applications Setting up enterprisewide Internet services making use of existing information systems Allow transactions across different databases Allow extensibility forfuture services eg mobility teleworking collaborative applications Constraint use the existing operating systems and treat them as the underlying environment they provided the basic functionality anyway 15 Middleware Services Communication Services replace primitive sockets with 7 Remote Procedure Calls or Remote Method Invocations 7 Message passing 7 Communicating streams Information Services data management 7 Largescale systemWide naming 7 Advanced directory services 7 Location services 7 Persistent storage 7 Data caching and replication Security Services secure communication and processing 7 Authentication and authorization 7 Encryption Middleware and Openness Give applications control of when Where and how to access data e g code migration and distributed tmnsaction processing Application Same Application programming interface Middleware Middleware Common Network 08 protocol Network 08 In an open middlewarebased distributed system the protocols used by each middleware layer should be the same as well as the interfaces they offer to applications Comparison between Systems Item Distributeu 05 Network Middleware Multiproc Multicomp 05 based 05 Degree of transparency Very High High Low High Same OS on all nodes Yes Yes No No Number of copies of OS 1 N N N Basis for Shared memory Messages Files Model speCI c Global Global Resource management central distributed Per node Per node Scalability No Moderately Yes Varies Openness Closed Closed Open Open A comparison between multiprocessor operating systems multicomputer operating systems network operating systems and middleware based distributed systems Clients and Servers Servers typically manage shared resource more intensive use Clients are thinner less resource intensive provide interface manage smaller digital components eg barcode readers Clients and servers may be in different machines Requestreply model blocking or not Client Request Server 77777777777 W Wait for result Provide service Application Layering Traditional three layers User interface only interface Processing layer no data Data layer only data 0 Typical application database technology 0 Many others Layering User interface Keyword expression Query generator Database queries Userinterface vel HTML page containing list HTML generator Ranking component MWeb page titles with metainformation Database with Web pages The general organization of an Internet search engine into three different layers 1 Processing Ranked list j39evel of page titles Data level Alternative clientserver organizations Client machine User interface User interface User interface User interface i Application i nte rface Useri l Application Application l Application Database User interface Application VApplication Database i Database i i Database i i Database i ixixDatabase Server machine a b C d e Example Architecture User interface Wait for result presentation Request Return operation result Wait for data Return data Application server Request data Database server Ti me An example of a server acting as a client Modern Architectures Front end handling incoming Replicated Web servers each requests Containing the same Web pages Requests handled in roundrobin fashion W An example of horizontal distribution of a Web service Alternative C S Architectures Cooperating servers Service is physically distributed across a collection of servers Eg Replicated le systems Network news systems Naming systems DNS X500 Work ow systems Cooperating clients distributed application exists by virtue of client collaboration Teleconferencing where each client owns a multimedia workstation Publishsubscribe pushpull architectures in which role of client and server is blurred 20 Simple Solutions for Locating Entities Broadcasting Simply broadcast the ID requesting the entity to return its current address Can never scale beyond localarea networks think of ARPRARP Requires all processes to listen to incoming location requests Forwarding pointers Each time an entity moves it leaves behind a pointer telling where it has gone to Dereferencing can be made entirely transparent to clients by simple following the chain of pointers Update a client s reference as soon as present location has been found Geographical scalability problems Long chains are not fault tolerant lncreased network latency at dereferencing Essential to have separate chain reduction mechanisms HomeBased Approaches 12 Singletiered scheme Let a home keep track of where the entity 1s An entity s home address is registered at a naming service The home registers the foreign address of the entity Clients always contact the home first and then continues with the foreign location 45V a k g Host39s home iy VVi location a y w E 1 Send packet to host at Its home 4 CA 2739 k P d 2 Return address of current location C NR E 0 3 Tunnel packet to j 4 q J U l 4 Send successive packets to current location 0 current location 4 g Host39s present location a Client39s location HomeBased Approaches 22 Twotiered scheme Keep track of visiting entities Check local visitor register rst 0 Fall back to home location if local lookup fails Problems with homebased approaches 0 The home address has to be supported as long as the entity lives 0 The home address is xed which means an unnecessary burden when the entity permanently moves to another location 0 Poor Geographical scalability the entity may be next to the client Question to ponder How can we solve the permanent move problem Hierarchical Location Services HLS The root directory To eve node dirT p 39 domain T Directory node dirS of domain 8 A subdomain S of toplevel domain T S is contained in T A leaf domain contained in S Hierarchical organization of a location service into domains each having an associated directory node HLS Lookup Operation Node knows about E so request Node has no is forwarded to child record for E so that request is forwarded to parent Lookup request 1 1 Domain D Looking up a location in a hierarchically organized location service HLS Record Placement Observation If an entity E moves regularly between leaf domains D1 and D2 it may be more ef cient to store E s contact record at the least common ancestor LCA of dirD1 and dirD2 Lookup operations from either D1 or D2 are on average cheaper Update operations ie changing the current address can be dome directly at LCA Note Assuming that E generally stays in domLCA it does make sense to cache a pointer to LCA x 39 Domaln D 1 r x r I r l I r A r t M l r r l r d Cached Winters E moves regularly between to Ode drD the two subdomains HLS Invalidate Cache Domain D K K v I L K i K I E M d Cached pOinterS E moves regularly between to Ode dirD the two subdomains Caching a reference to a directory node of the lowest level domain in which an entity will reside most of the time HLS Scalability Issues Size scalability Again we have a problem of overloading higherlevel nodes Only solution is to partition a node into a number of subnodes and evenly assign entities to subnodes Naive partitioning may introduce a node management problem as a subnode may have to know how its parent and children are partitioned Geographical scalability We have to ensure that lookup operations generally proceed monotonically in the direction of where we ll find an address If entity E generally resides in California we should not let a root subnode located in France store E s contact record Unfortunately subnode placement is not that easy and only a few tentative solutions are known Unreferenced Objects Problem Assumption Objects may exist only if they can be contacted Each object should be named Each object can be located A reference can be resolved to clientobject communication Problem Removing unreferenced objects How do we know when an object is no longer referenced cyclic references Who is responsible for deciding on removing an object Entities forming an unreachable cycle Root set 0 l T I 771 O Reachable entity from the root set Unreachable entity from the root set Reference Counting 12 Principle Each time a client creates removes a reference to an object 0 a reference counter local to 0 is incremented decremented Problem 1 Dealing with lost and duplicated messages An increment is lost so that the object may be prematurely removed A decrement is lost so that the object is never removed An ACK is lost so that the incrementdecrement is resent Solution Keep track of duplicate requests 1 Skeleton maintains reference counter PI39OCESS P Object O K Proxy p 4 Proxy p is now counted twice Reference Counting 22 Problem 2 Dealing with duplicated references7 client P1 tells client P2 about object O 7 Client P2 creates a reference to 0 but dereferencing communicating with 0 may take a long time 7 lfthe last reference known to O is remove before P2 talks to O the object is removed prematurely Solution 1 Ensure that P2 talks to O on time 7 Let P1 tell 0 it will pass a reference to P2 7 Let 0 contact P2 immediately 7 A reference cannot be removed before 0 has acked that reference to the holder P1 sends f t P2 P1 deletes its P1 tens O that it Wi P1 deletes its re erence 0 reference to 0 pass a reference to P2 reference to 0 P1 P1 K A O has been 1 removed 1 A o O 1 7 7 P2 K Ime P2 lme P2 informs 0 it P1 sends O acks it knows has a reference reference to P2 about P239s reference a b Advanced Referencing Counting 1 8k I t Partial weight at e e 0 Object 0 Partial process p A skeleton is halved wei ht 128 Total weight of pgroxy Partial weight E a a Proilty a The initial assignment of weights in weighted reference counting b Weight assignment when creating a new reference Advanced Referencing Counting 2 Process P2 P2 gets half of the weight of proxy at P1 Total and partial 39 weight at skeletor remain the same P1 passes reference to P2 0 Weight assignment when copying a reference Advanced Referencing Counting 3 P1 has run out of weight and Object has no creates skeleton s39 i more partial Process P1 weight Ief t Process P2 P2 refers to object Via P1 rpamalweight Creating an indirection when the partial weight of a reference has reached 1 Advanced Referencing 4 Process P2 Copy counter Generation Creating and copying a remote reference in generation reference counting Reference Listing Observation We can avoid many problems if we can tolerate message loss and duplication Reference listing Let an object keep a list of its clients 0 Increment operation is replaced by an idempotent insert 0 Decrement operation is replaced by an idempotent remove There are still some problems to be solved Passing references client B has to be listed at 0 before last reference at O is removed or keep a chain of references Client crashes we need to remove outdated registrations e g by combining reference listing with leases Leases Observation If we cannot be exact in the presence of communication failures we will have to tolerate some mistakes Essential issue We need to avoid that object references are never reclaimed Solution Hand out a lease on each new reference 0 The object promises not to decrement the reference count for a speci ed time 0 Leases need to be refreshed be object or client Observations Refreshing may fail in the face of message loss Refreshing can tolerate message duplication 0 Does not solve problems related to cyclic references Tracing in Groups 1 External process Process group 93 0 soft mark I hard mark 3 Initial marking of skeletons Tracing in Groups 2 b After local propagation in each process Tracing in Groups 3 0 Final Marking Mutual Exclusion ME In a singleprocessor system ME can be achieved with semaphores lock variables monitors etc In dist systems ME is more complex due to no shmem timing comm delays and clocks and ordering of events Two basic approaches of ME in dist systems can be identi ed 7 Centralized Master and slaves master dictates actions 7 Distributed each site decides on actions based on own state Distributed algorithms can be subdivided into 7 Token based a site is allowed in the CS if has a token Tokens are passed from site to site in some priority order 7 Nontoken based site enters CS when an assertion becomes TRUE A site communicates with others to get information about the other sites states and based on this decides Whether assertions is TRUE or FALSE This communication is usually based on timestamps ME requirements For a solution to be correct the algorithms must have the following properties 7 Deadlock freedom 7 Starvation freedom 7 Fairness give everyone a chance in certain systems it also means to execute the requests as they are made eg Logical FIFO Performance metrics 7 Number ofmessages sent received or bath 7 Synchronization delay 7 Response time 7 System throughput Conditions under which must test performance 7 High vs low load behaviors 7 Worst vs best vs average case behaviros Simple Solution Centralized ME Centralized approach Master holds a list of processes requesting the CS grants requests in some order random FIFO etc Site 1 has critical section Will yield to the master master Will grant to site x List of requests Advantages fair correct no starvation simple to implement Disadvantages single point of failure and bottleneck Bottleneck H network congestion a timeout Distributed ME Problem if messages are not ordered correctly the CS can be acquired improperly remember that if there are no global timestamps the earliest request should gain the CS In the example below P3 thinks that Pl should have the CS P3 got Pl s message rst while both P1 and P2 think that P2 should have it In a distributed algorithm the decision must made independently from other nodes in the system AND the decision is the same T1me P1 P2 P3 Request for CS Distributed ME Lamport s algorithm solves the problem above with the logical clocks Lamport 78 Clocks Messages and the Pursuit of Happiness To request CS 7 send req message M to all processes 7 enQ M in its own queue Upon receiving request from Pi enQ and send ack Upon receiving release from Pi deQ To release CS 7 send ack message to all procs 7 remove it from Q To acquire CS enter CS when got a message with a larger timestamp from every other proc AND has own message with smallest TS in own Q Note that to enter the CS checks must only be made locally Distributed ME Ricart and Agrawala s Algorithm To request CS 7 send req message to M to all processes 7 enQ M in its own queue Upon receiving M from Pi 7 If it doesn t havewant CS send ack 7 If it has CS enQ request PiM 7 If it wants CS either enQ or ack To release CS 7 send ack message to all procs in the Q 7 remove them procs from Q To acquire CS enter CS when got ack from every other proc Ricart and Agrawala s Algorithm Cont The main idea is that lowest timestamp wins Also the acks AND permission messages are combined into acks which are only send out after a proc used the CS if it has smaller timestamp Example 7 Node A req CS with ts8 Node C with ts12 7 B acks both AampC 7 C acks A smaller ts 7 A uses CS then sends ack to C ack e ack Z 2 a Phase I Phase 11 Phase III Maekawa s Algorithm Maekawa presents an algorithm where a subset of nodes can give permission to a node to enter the CS It is suf cient for a majority of the nodes to give permission In Maekawa s algorithm lGl sqrtnl G is the subset There are sqrtn subsets By design there is always a node in the intersection of 2 subsets To request CS send request message M to all processes in Gi Upon receiving a request M from Pi if it hasn t sent a reply since last release send a reply Otherwise queue it To release CS send release to all procs in Gi Upon receiving a release from Pi send a reply to head of Q and remove it from Q if Q is empty update state no reply sent To acquire CS enter CS when got ack from all other proc in Gi Maekawa s Algorithm For example Ga b d e Gc a b f Gg c d f Say sites a c g make requests simultaneously If site b responds positively to a what happens Site b queues the response to site 0 until it gets a release from abut it could send positive responses to site a Meanwhile site f responds positively to site g but then must queue the request for site 0 o Eventually site a will send a release to 0 its subset of nodes and site b will then respond positively to c Maekawa s Alg Problem The biggest problem in Maekawa s algorithm is that it may lead to deadlocks HOW Homework show a sequence of events that may lead to deadlocks due next class simple answer Note that there is no ordering in the messages that are sent to the subsets of authorizing nodes Also there are communication delays that may cause the deadlocks Solution One of the initiates a deadlock resolution phase New types ofmessages are needed FAILED message sent to a site after a REPLY when a higher priority request was received It s a INQUIRE message sent to a site to check whether the site was able to acquire all locks YIELD a message sent in reply to INQUIRE message releasing the locks in case it didn t get all locks GRANT a message sent to a site granting that site a lock Maekawa s Alg Modi cations Upon receiving a request M form Pi39 if it hasn t sent a reply since last release send a reply Ow if new req has a larger timestamp send a FAILED message ow if new req has a smaller timestamp send a INQUIRE message tot eh site of the previous REPLY Upon receiving an INQUIRE message a site ignores it if it already got all grants from its subset If the site received a FAILED message from any site it sends a YIELD message IF a site sent a YIELD message but got no GRANT message from another site in the request set it sends a YIELD message Upon receiving a YIELD from a site a member of a request site assumes that the site has release the locks and processes the request normally Number of messages required for the algorithm to succeed Homework show a sequence of events that solves the deadlocks with new messages due next class simple answer Token Based Dist ME In the token based approaches to distributed ME the general correctness of the algorithm is easier to prove since the site will only go into the CS if it holds a token On the other hand there are more problems to prove freedom of starvation since a site may retain a token forever Another problem is when a site fails there has to be a regeneration of the tokens which is similar to leader election One representative algorithm for tokenbased treebased ME is Raymond s algorithm Other Approaches Rings physical or logical any physical interconnection eg BUS pointtopoint nets etc Usually uses a token the proc that has the token enters the CS It s fair since CS is roundrobin It s correct since there is a single token Problems losttoken duplicated token Need to be able to detect and regeneratedelete token If a process crashesfails what to do Acks MsgCS Delay Problems Centralized 3 2 Coordinator crash Distributed 2n1 2n1 N points of failure Token ring 0 to nl 0 to nl LosUdupl token 449 LBSe n dA h s sa iiy nsn ntate a1 or s thl h s e no wil ventualyfinda eV 6 n a wdt odseIan ndma inf at i s s 6 ad iethepoling fot n e d t h y m vnu 1y ut eigi he efi I e o f ati p icy is used wih 0 es nmnnocr ase liPus th hlpcoh dy enh 4495 Sender ta g r lcto iyPd ere agort Ra mL 0 e t si r dom y No overheado colle gi to oko Idem mk ks e r nsf rr am no h Q i tte th no LB Thrs0d 31 m srnol tbfr edin tasolhIQgt n 4495 Receiv nt Ag s saiiy nr ie nitate a1 or s thl re V oewllevenuallyfnd e 0 se hsmasta h 011 il w us I 1 to h 01 w 1b wa ted but sinc the s n iy ornfrp c T 36 thes dp cy a 1g f t trnfr oliczaaskt rs a ltT tend 60 6 66V 0 e t p0 S n d 1 a V6 as s sks whose tra sfer ha m1 te ase aadoher it ra 4495 4 Symmet ly a e rt m tbltzI yrdagortms ei as s ei om ne 1 th 10 dis igh pol c et tm nt 6 u h ever niiatd tw fi tr sf poiy T useadoub1 o Tgt yo teray e h rthetra fe cy tas t r e n a T odebeo eaene I i aigll T end beom aec er 0 ect n l y P 113 e ny t eapproaches 0ct 0i nae 1 0 wt hihld e r adc s r 4495 Symin 10 l plc L ee rIiiat lAn de t o a r e b d st at ski an d messga t f0 6 r 2 fats wsN Tr ce d him 6 t V t 11 ch oadcastsa es ge h 31 w I n s sr ied thendi eas it s a t ad nf mto 0iyIPde and 6 pl 11 sat w odtosarieNO t th isi en m d em ati a 011g the nod s Mos h r a Communication Problems Flow loss congestion policing Messages get lost due to several factors including collisions lack of buffer space lack of computing power etc To allow the ow of data at a good pace the senders of data should be sensitive to the amount of data that is getting through throughput of the network From each sender s perspective the maximum bandwidth the network provides should be the upper limit on the amount of data to be transmitted Does this work for realtime However several senders at once may cause network congestion The protocol being used must provide some means of detecting lost messages or packets and of correcting present and future Two types hopbyhop and endtoend ow control and loss detection and correction Tradeoffs Flow Control Some methods are windows sliding and jumping and buckets Windows count the number of packets in each window and refuse any packets that exceed the number allowed in the window Buckets have a usually constant rate of accepting packets Any packet in excess to that rate is not accepted dropped Flow Control Flow control can be done with and without feedback from the net If there is no feedback the sender has to control the number of packets it will send according to some preestablished threshold This scheme is often applied to implicit resource reservation used in many multimedia and realtime applications due to the tight timing requirements for example video and audio their synchronization telephone delays or echoes etc On the other hand feedback can come on a hopbyhop basis or on an endtoend basis Feedback can take several forms for example the time for a packet to arrive to the destination final or intermediate plus the time for an ack to be received by the sender round trip delay Another example is the packets that arrive at a destination or their ordering for example a destination got packet n but not yet packet n k Media Access Control In LANs usually the medium is shared among all Pes Therefore there must be a way of detecting when it is safe to transmit and after that if the transmission was successful explicit reservation of the bandwidth and time to send implicit reservations for each PE in LAN eg token rings hope anal pray put the packet in the network and wait for ack look hope anal pray attempt to send only if no other traffic on net Aside from collisions the PE should be able to collect the bits from the network put them into packets put these into messages and give them to the user or forward them to the next hop switch Usually protocols implemented in software are the bottleneck in terms of performance data transfer rates Many new applications are taking the protocols to hardware with an improvement of a factor of 420 times their software counterparts Traffic Police The Media Access Control MAC layer ts between the physical and datalink layers of the ISO OSI model It accepts packets from either connectionless or connectionoriented channels and examines the media to verify when to send the packet Combining the ow control exercised in the network layer and the access control makes the network sager with respect to reliability and ef ciency However all nodes connected to the net should obey the rules of traf c control and the switched should behave as traf c police These policing mechanisms are essential to RT nets where each packet or message has associated with it a deadline before which it must be delivered The criticalness of the application de nes whether the traf c is non RT datagram softRT mm hard RT remote surgery StreamOriented Communication All communication discussed so far is time independent Support for continuous media CM is time dependent Audio video sensor data applications Different types of devices camera audio card etc Maximum endtoend delays synchronous CM and perhaps maximum jitter isochronous CM There are sources of data and sinks of data Known source and destinationsink Are there multiple sinks perhaps multiple sources What is the optimal way to reach all destinations Each source may have multiple flows or different sources will have multiple flows to the same sink Data Streams a A Stream tWO sending process Receiving process 7 V pI39OCCSSCS across a E network I Stream I 08 OS J Network a Camera 39 Q Display a A stream directly Limr between two I I devices Network b Data Streams An example of multicasting a stream to several receivers Stream Sink Intermediate node possibly with filters Lower bandwidth AM Source Streams ows and QoS The Quality of Service QoS of an application is the look and feel of how the stream will ow The QoS depends on two components 7 Specification of the owstream 7 Implementation to satisfy the owstream specifications A ow speci cation encompasses 7 The rates at which the media is transmitted 7 How bursty the owstream is 7 How much error can there be given a cost function hopefully The implementation depends on many many factors 7 Common among all implementations is an enforcement of the specs Specifying QoS QoS specification parameters for a Token Bucket model Token bucket the owstream is serviced when there are credits Other schemes are also good but different characteristics Leaky buckets Explicit realtime scheduler Characteristim of the Input Service Required maximum data unit size bytes oTollten buclltet rate bytessec oTollte bucket size bytes oMaximum transmission rate bytessec Loss sensitivity bytes Loss interval usec oBurst loss sensitivity data units oMinimum delay noticed usec oMaximum delay variation usec Quality of guarantee The principle of a token bucket Application regular Stream One token is added of data units to the bucket every AT C C o o 0 Regular stream Sender process 0 Resource reservation in a RSVPenabled host Application distributed system OS must support mechanism 0 Policy can be implemented at the user level kl RSVP process Application data stream RSVP program v Local 03 Reservation requests Data link layer Admission from other RSVP hosts Lg Data link layer 239 mvmxp Internetwork data stream 7 Local network Jq Setup information to W other RSVP hosts 0 Reservations 0 Statistical good on average vs guaranteed good for important stuff 0 No standards much research no prevalent results Vnet Pitt s approach Vnet is a versatile network architecture IEEE Transactions on Computers Aug 2000 Speci cation of both minimum data rate mandatory tra ic above minimum optional traffic and bursty traffic Use realtime scheduling techniques to be able to schedule streams 7 EDF is earliest deadline first if CPU utilization is less than 100 of the time accept new stream 7 RMS is rate monotonic scheduling if CPU utilization is less than n21 I accept new stream 7 CPU utilization is sum of all process utilizations which is CiPi where Ci is the execution time of the process and Pi is the period f the process Probabilistic acceptances are also possible and pricebased approaches should also be explored Added Alternative protocols Best Effort some call is hope and pray method 7 Do as much as you can and live with consequences Fair Queueing or Processor sharing 7 Define the share of each process 7 System to provide only that much service 7 How is it accomplished RSVP ReSerVation Protocol 7 Softstate protocol refresh messages allows for routes to be pinned 7 Need to refresh routes at specific periodicity 7 Overhead is high if large groups or if many shortlived communications Multicast Trees 7 Shared Trees ST one tree for all sources and destinations minimum spanning tree Steiner tree problem 7 SourceSpecific Trees SST one tree per source 7 Advantages and disadvantages Synchronization Mechanisms l Synchronization of data and of flows the unpredictability of the computing 7 Interrupts may disturb media continuity is a big problem mainly due to systems we have today 7 Priority inversion may occur high priority task blocked by low priority task 7 Different ows should be synchronized Application can do synchronization Hardware can be used to help specially when the sync requirements are strict 7 Separate streams should be played Within 30 microsecs Network Procedure that reads two audio data units for each video data unit eceiver39s machi e i WW Application Incoming stream Synchronization Mechanisms 2 Synchronization supported by highlevel interfaces Middleware or specialized applications or layers multiplex all substreams into a single stream and demultipleX at the receiver Synchronization is handled at Rewiver39s Wine Ei i39mril ig multiplexing lslt39itit39 i al t re Application streams demultipleXing point Example MPEG V Middleware layer 7 m A A Incoming stream T T08 Application tells Replica Placement Model We consider objects and don t worry whether they contain just data or code or both Distinguish different processes A process is capable of hosting a replica of an object or data Permanent replicas Processmachine always having a replica Serverinitiated replica Process that can dynamically host a replica on request of another server in the data store Clientinitiated replica Process that can dynamically host a replica on request of a client client cache Replica Placement 1 x 9 Serverinitiated replication VA xx 7quot Clientinitiated replication Permanent Serverinitiated replicas The logical organization of different kinds of copies of a data store into three concentric rings Examples web servers file servers multicast trees ServerInitiated Replicas CZ Server without 9 copy of file F Server with P K copy of F Client c1 Q File F Server Q counts access from C1 and 02 as if they would come from Keep track of access counts per le aggregated by considering server closest to requesting clients Number of accesses drops below threshold D gtdrop le Number of accesses exceeds threshold R gtreplicate le Number of access between D and R gtmigrate le ClientInitiated Replicas More like a client cache 7 Keep it on disk 7 Keep it in memory 7 How much space to use 7 How long to keep copyreplica 7 How to detect data is stale Readonly les work best Sharing data among client processes may be good Sharing space is essential Update Propagation 13 Fmpagate umy nuh muunmwhdauun urupdate utten used fur mehes Transfer data r m Dpy m anumerdstnbuted databases FmpagzteLheupdateoptzvmtontu uthereupresa1su Elledacnv repllmuun ohsmm39nn Nu angle appmachls the best but depends hghly un amiable bandwidth and radrlurwnle mum at rephms Update Propagation 23 Fush ng updates sa39Va39rmmate d appruaeh m whmh update 15 pmpagated regardless whether target asked fur rt Puurng updates euen Hmuated appruaeh m wmeh euent requests tu be updated Nan se 1 Re pansetme anteater Update Propagation 33 Observation We can dynamically switch between pulling and pushing using leases A contract in which the server promises to push updates to the client until the lease expires Issue Make lease expiration time dependent on system s behavior adaptive leases Agebased leases An object that hasn t changed for a long time will not change in the near future so provide a longlasting lease Renewalfrequency based leases The more often a client requests a speci c object the longer the expiration time for that client for that object will be Statebased leases The more loaded a server is the shorter the expiration times become Question Why are we doing all this Epidemic Algorithms Basic idea Assume there are no writeiwrite con icts Update operations are initially performed at one or only a few replicas A replica passes its updated state to a limited number of neighbors Update propagation is lazy ie not immediate Eventually each update should reach every replica Anti entropy Each replica regularly chooses another replica at random and exchanges state differences leading to identical states at both afterwards Gossiping A replica which has just been updated ie has been contaminated tells a number of other replicas about its update contaminating them as well System Model We consider a collection servers each storing a number of objects Each object O has a primary server at which updates for O are always initiated avoiding writewrite con icts An update of object O at server S is always timestamped the value of O at S is denoted VALOS TOS denotes the timestamp of the value of object O at server S AntiEntropy Basic issue When a server S contacts another server S to exchange state information three different strategies can be followed Push S only forwards all its updates to S if TOS lt TOS then VALOS lt VALOS Pull S only fetched updates from S if TOS gt TOS then VALOS lt VALOS Push Pull S and S exchange their updates by pushing and pulling values Observation if each server periodically randomly chooses another server for exchanging updates an update is propagated in Olog 7 time units Question why is pushing alone not ef cient when many servers have already been updated G0331p1ng Basie model A server 5 havrng an update to report conLac other servers If a server rs contacted to whlch the update has already propagated 5 stops eontaetrng other server w th probabrlrty lk Ila rs the Eraetron of rgnorant servers r e whlch are unaware olthe update rt can be shown that vvrth rnany servers 5 gtwltle observatron Ifwe really have to ensure that all servers are eventually updated gosslplng alone rs not enougn Deleting Values em We aannatrernave an ald value fmm a 5m expert the renaval ta prap due trrne usrng eprdenra al e and agate lnstead rnere renaval wlll be undane rn g nthrns Solution Remaval has ta be regtsteed as a speclal update by rnsetrng a death cem mle Nextproblern When ta rernave a death aetr aate rtrs nat allawed ta stay fareve Run a glahal alganthrnta detect whether the rernaval rs kmwn eveyvvhee andthen aallett the death aertr aates laaks llke garbage sallettran Assume death cetr cates prap agate rn t rnrte trrne and ass crate a rnarnrnurnlrfeurne faraaetrt39raateaanhe dane atnsk afnatreaahrng all smels Nam rt rs neaessarythat a renaval attually reaches all servers Question What s the saalahrlrty pmblem heren Consistency Protocols Consistency protocol describes the implementation of a speci c consistency model We will concentrate only on sequential consistency Primarybased protocols Replicatedwrite protocols Cachecoherence protocols PrimaryBased Protocols 14 All read and write operations go to server Example used in traditional clientserver systems that do not support replication Client Client Single server for 39tem X Backup server W1 W4 R1 R4 Vl W2 d R2 L Ejantj lt7 W3 R3 Data store W1 Write request R1 Read request W2 Forward request to server for x R2 Forward request to server for x W3 Acknowledge write completed R3 Return response W4 Acknowledge write completed R4 Return response PrimaryBased Protocols 24 Primarybackup protocol writes are typically forwarded to server Client Client Primary server for item X Backup sener W1 Write request R1 Read request W2 Forward request to primary R2 Response to read W3 Tell backups to update W4 Acknowledge update W5 Acknowledge write completed Example Traditionally applied in distributed databases and le systems that require a high degree of fault tolerance Replicas are often placed on same LAN PrimaryBased Protocols 34 Primarybased localwrite protocol migrate the data do not replace it Client Current server New server for item x for item x B Eaj i 8 Data store 1 Read or write request 2 Forward request to current server for x 3 Move item x to client39s server 4 Return result of operation on client39s server Example Establishes only a fully distributed nonreplicated data store Useful when writes are expected to come in series from the same client eg mobile computing without replication PrimaryBased Protocols 44 Primarybackup protocol with local writes replicate data only for reading Client Client Old primary New primary for item x for item x Backup server R1 R2 W1 W3 A L lk W5 W5 W Al W4 Data store W5 W2 W4 W1 Write request R1 Read request W2 Move item x to new primary R2 Response to read W3 Acknowledge write completed W4 Tell backups to update W5 Acknowledge update Example Distributed shared memory systems but also mobile computing in disconnected mode ship all relevant les to user before disconnecting and update later on ReplicatedWrite Protocolsl2 0 Active replication Updates are forwarded to multiple replicas where they are carried out 0 One problem to deal with replicated invocations Client replicates 39 invocation request Object receives the same invocation three times 1 All replicas see the same invocation Replicated object ReplicatedWrite Protocols 22 Replicated invocations Centralized Solution Assign a coordinator on each side client and server which ensures that only one invocation a and one reply is send b Coordinator Coordinator of object B of object C Client replicates invocation request 1 Triple Modular Redundancy Simple to implement Vote on all three results Majority 50 1 wins Request is replicated to all servers Request Al A2 A3 QuorumBased Protocols Quorumbased protocols Ensure that each operation is carried out in such a way that a majority vote is established distinguish read quorum and write quorum Read quorum Example Lazy Replication Basic model number of replica servers jointly implement a causalconsistent data store 0 Clients normally talk to front ends which maintain data to ensure causal consistency Clients F W up Write queue Read queue iquot 3 EU UH Pending 1 D D D uij 1 Local serve request 1 E353 i k H 1 r l j I l l i i I Distributed data store Clock Synchronization Physical clocks Logical clocks Vector clocks Physical Clocks Problem Suppose we have a distributed system with a UTC receiver somewhere in it we still have to distribute its time to each machine UTC is Universal Coordinated Time based on some atomic element Cs Basic principle Every machine has a timer that generates an interrupt H times per second There is a clock in machine p that ticks on each timer interrupt Denote the value of that clock by Cpt where t is UTC time Ideally we have that for each machine p C p t t or in other words dCdt l Clock Synchronization Algorithms Clock time C UTCt In practice I 7 p lt dCdt lt I p for some small constant drift p Goal Never let two clocks in any system differ by more than 5 time units synchronize at least every 52p seconds Clock Synchronization Idea 1 Every machine asks a time server for the accurate time at least once every dZr seconds Good solution but need an accurate measure of round trip delay including interrupt handling and processing incoming messages Idea 2 Let the time server scan all machines periodically calculate an average and inform each machine how it should adjust its time relative to its present time Another good solution you ll probably get every machine in sync Fundamental problem You ll have to take into account that setting the time back is never allowed 39 smooth adjustments Note you don t even need to propagate UTC time Why not The Berkeley Algorithm Time daemon a The time daemon asks all the other machines for their clock values b The machines answer c The time daemon tells everyone how to adjust their clock The HappenedBefore Relationship Problem We rst need to introduce a notion of ordering before we can order anything The happenedbefore relation on the set of events in a distributed system is the smallest relation satisfying If a and b are two events in the same process and a comes before I then a b If a is the sending of a message and b is the receipt of that message then a 9 b Ifa39 bandb39 qthena39 c Is this a partial or total ordering of events in a system with concurrently operating processes Logical Clocks Problem How do we maintain a global view on the system s behavior that is consistent with the happenedbefore relation Solution attach a timestamp Ce to each event 6 satisfying the following properties P1 Ifa and b are two events in the same process and a b then we demand that Ca lt Cb P2 Ifa corresponds to sending a message m and b corresponds to receiving that message then also Ca lt Cb Problem How to attach a timestamp to an event when there s no global clock 9 maintain a consistent set of logical clocks one per process Logical Clocks Each process Pl maintains a local counter Cl and adjusts this counter according to the following rules 1 For any two successive events that take place within PI Cl is incremented by l 2 Each time a message m is sent by process Pi the message receives atimestamp Tm C 3 Whenever a message m is received by a process Pj Pj adjusts its local counter CjltmaxCj 1 Tm 1 Property P1 is satis ed by 1 Property P2 by 2 and 3 Extension to Multicasting Vector Timestamps Observation Lamport timestamps do not guarantee that if Ca lt Cb then a indeed happened before b Why We need vector timestamps for that Each process Pi has an array V 1n where V j denotes the number of events that process P knows have taken place at process Pj When PI sends a message m it adds 1 t0 Viz and sends along with m as vector timestamp vtm Result upon arrival each other process knows Pi s timestamp Question What does Vi kmean in terms ofmessages sent and received Extension to Multicasting Vector Timestamps When a process P receives a message m from P with vector timestamp vtm it 1 updates each 1k to max k Vm k and 2 increments j by 1 Is the book correct To support causal delivery of messages assume you increment your own component only when sending a message Then PJ postpones delivery of m until Vtm1 VJU1 Vtmk lt VkfUVki Example Take V3 022 Vt m 130 What information does P3 have and What will it do when receiving m from P1 Global State 1 Basic Idea Sometimes you want to collect the current state of a distributed computation called a distributed snapshot It consists of all local states and messages in transit Important A distributed snapshot should re ect a consistent state Consistent cut Inconsistent cut P1 Time fgt p1 m1 Xxx K m3 P2 y P2 m2 P3 V P3 Sender of m2 cannot be identified with this cut a b Global State Note any process P can initiate taking a distributed snapshot P starts by recording its own local state P subsequently sends a marker along each of its outgoing channels When Q recieves a marker through channel C its action depends on Whether it has already recorded its local state 7 Not yet recorded it records its local state and sends the marker along each of its outgoing channels 7 Already recorded the marker on C indicates that the channel s state should be recorded all messages received before this marker and the time Q recorded its own state Q is finished When it has received a marker along each of its incoming channels Global State 2 Incoming Outgoing message Process State message Local Marker filesystem a a Organization of a process and channels for a distributed snapshot Global State 3 an I U ti REE state b C d b Process Q receives a marker for the rst time and records its local state c Q records all incoming message 01 Q receives a marker for its incoming channel and nishes recording the state of the incoming channel Election Algorithms Principle An algorithm requires that some process acts as a coordinator The question is how to select this special process dynamically Note In many systems the coordinator is chosen by hand eg le servers This leads to centralized solution 9 single point of failure Question If a coordinator is chosen dynamically to what extent can we speak about a centralized or distributed solution Question Is a fully distributed solution ie one without a coordinator always more robust than any centralizedcoordinated solution Election by Bullying Principle Each process has an associated priority weight The process with the highest priority should always be elected as the coordinator Issue How do we nd the heaviest process Any process can just start an election by sending an election message to all other processes assuming you don t know the weights of the others If a process Pheavy receives an election message from a lighter process Plight it sends a takeover message to Plight Plight is out of the race Ifa process doesn t get a takeover message back it wins and sends a victory message to all other processes The Bully Algorithm 1 C2 920 0 6 4 Election 6 4 OK 6 w Previous coordinator has crashed a b C The bully election algorithm Process 4 holds an election Process 5 and 6 respond telling 4 to stop Now 5 and 6 each hold an election The Bully Algorithm 2 EnergyEf cient RealTime Heterogeneous Server Clusters Cosmin Rusu Alexandre Ferreira Claudio Scordinof Aaron Watson Rami Melhem and Daniel Mosse Department of Computer Science University of Pittsburgh Abstract With increasing costs of energy consumption and cool ing power management in server clusters has become an increasingly important design issue Current clusters for realtime applications are designed to handle peak loads where all servers are tlly utilized In practice peak load conditions rarely happen and clusters are most of the time underutilized This creates the opportunity for using slower frequencies and thus smaller energy consumption with lit tle or no impact on the Quality ofService QoS for exam ple performance and timeliness In this work we present a clusterwide QoSaware tech nique that dynamically recon gures the cluster to reduce energy consumption during periods of reduced load More over we also investigate the effects of local QoSaware power management using Dynamic Voltage Scaling DVS Since most realworld clusters consist of machines of dif ferent kind in terms of both performance and energy con sumption we focus on heterogeneous clusters For validation we describe and evaluate an implemen tation of the proposed scheme using the Apache Webserver in a small realistic cluster Our experimental results show that using our scheme it is possible to save up to 45 ofthe total energy consumed by the servers maintaining average response times within the speci ed deadlines and number of dropped requests within the required amount 1 Introduction Until recently performance had been the main concern in server farms but energy consumption has also become a main concern in such systems Due to the importance of customer care service in commercial installations and the importance of timely responses for embedded server clus ters current clusters are typically designed to handle peak Supported by N39FS through the SecurerCITI project AN170325353 and the PowerrAutonomous Networks project AN170121658 lClaudio Scordino is a PhD student at the University of Pisa visiting the University of Pittsburgh loads However peak load conditions rarely happen in prac tice and clusters are most of the time underutilized In fact their loads often vary signi cantly depending on the time of the day or other external factors therefore the average pro cessor use of such systems may be even less than 50 with respect to their peak capacity 7 Clusters with high peak power need complex and expen sive cooling infrastructures to ensure the proper operation of the servers With power densities increasing due to in creasing performance demands and tighter packing proper cooling becomes even more challenging fans driving the cooling system may consume up to 50 of the total sys tem power in some commercial servers 16 18 and man ufacturers are facing the problem of building powerful sys tems Without introducing additional techniques such as liq uid cooling Electricity cost is a signi cant fraction of the operation cost of data centers 6 For example a Google lOkW rack consumes about lOMWh a month including cooling which is at least 10 of the operation cost 5 with this fraction likely to increase in the future These issues are even more critical in embedded clus ters 28 typically untethered devices in which peak power has an important impact on the size of the system While energy consumption determines the device lifetime Exam ples include satellite systems or other mobile devices with multiple computing platforms such as the Mars Rover and robotics platforms Power management PM mechanisms can be divided into two categories clusterwide and local 6 Cluster Wide mechanisms involve global decisions such as turning on and off cluster machines according to the load Lo cal techniques put unused or underutilized resources in lowpower states for example selfrefresh standby and off modes for DRAM chips Dynamic Voltage Scaling DVS and lowpower states for the CPU disk shutdown etc A PM mechanism local or clusterWide is QoSaware if it reduces the power consumption While guaranteeing a cer tain amount of Quality of Service QoS such as average response times or percentage of deadlines met To the best of our knowledge this is the rst Power Management scheme that is simultaneously a clusterWide ie tuniing on and off machines b designed for hetero geneity c QoSaware and poweraware at the local servers ie deadlineaware d measurementbased contrary to theoretical modeling relying on measurements is the key to our approach e implementationoriented and f per forming recon guration decisions at runtime Our scheme is realistic because most clusters have one or more frontends are composed of different kind of ma chines and need both local and clusterwide QoSAware PM schemes While the methodology and the algorithms proposed apply to any kind of cluster we show their use in a web server context Our measurements show a reduc tion of energy consumption equal to 17 using only the local PM 39 using the OnOff scheme and 45 using both schemes With respect to delays the local PM added 0i5ms while OnOff added about 47215 in all cases the av erage delay was quite small with respect to deadlines The remainder of the paper is organized as follows We rst present related work in Section 2 The cluster model is given in Section 3 The clusterwide PM scheme is ex plained in Section 4 while the local realtime DVS scheme is presented in Section 5 Both schemes are then evaluated in Section 6 In Section 7 we state our conclusions 2 Related Work With energy consumption emerging as a key aspect of cluster computing much recent research has focused on PM in server farms A rst characterization of power con sumption and workload in realworld webservers was made in 7 DVS was proposed as the main technique to re duce energy consumption in such systems DVS and re quest batching techniques were further evaluated in 10 Software peak power control techniques were investigated in 11 However these studies considered only power consumption of processor and main memory in single processor systems The problem of cluster con guration ie turning on and off cluster machines for homogeneous clusters was rst ad dressed in 20 An of ine algorithm determines the number of servers needed for a given load Cluster recon guration is then performed online by a process running on a server using thresholds to prevent too frequent recon gurations even though there is no explicit QoS consideration The authors have extended their work to heterogeneous clus ters in 14 Models have been added for throughput and power consumption estimation Recon guration decisions are made online based on the precomputed information and the predicted load The authors also proposed to add request types to improve load estimation in 15 Our work differs from the above previous studies in the following ways we consider QoS directly individual servers are both poweraware and QoSaware we rely on of ine measurements instead of using models recon gura tion decisions ie number of active servers and load dis tribution are not expensive and are performed online and recon guration thresholds are based on the time needed to bootshutdown a server One of the rst attempts to combine clusterwide and lo cal PM techniques 9 proposed ve different policies com bining DVS and cluster con guration However the the ory behind this work relies on a homogeneous clusters and cannot be easily extended to heterogeneous machines and b the oftenincorrect assumption that power is a cubic function of the CPU frequency Another work proposed to use the cluster load instead of the average CPU frequency as the criteria for tuniing onoff machines 28 However this study assumed homogeneous clusters as well To the best of our knowledge this work is the rst attempt to com bine clusterwide and local techniques in the context of het erogeneous clusters In realtime computing dynamic voltage and fre quency scaling has been explored to reduce energy con sumption DVS schemes typically focus on minimizing CPU energy consumption while meeting a performance re quirement 29 DVS work for aperiodic tasks in single processors includes of ine and online algorithms assum ing worstcase execution times 29 24 automatic DVS for Linux with distinction between background and interactive jobs 12 and use of knowledge about the distribution of job lengths for voltage scaling decisions 17 21 However these techniques typically aim at reducing the energy con sumed onlyby the CPU 17 22 24 23 and do not take into account other devices such as memory power supplies or disk that contribute with an important fraction to the to tal energy consumed by the system In our model instead servers can put their resources in lowpower states and no assumption is made about their local PM schemes Most related to our local scheme is Sharma et al s work on adaptive algorithms for DVS for a QoSenabled web server 26 eir scheme uses a theoretical utilization bound derived in 3 to guarantee the QoS of web requests However they take into account only local PM assuming that a good load balancing algorithm is used at the front end In that sense our works are complementary since we describe how to achieve such load balancing 3 Cluster Model This section introduces the cluster model that we con sider see Figure 1 A frontend machine receives requests from clients and redirects them to a set of processing nodes henceforth referred to as servers The frontend is not a processing node and has three main functions a accept ing aperiodic requests from clients b distributing the load to servers and c recon guring the cluster ie tuniing servers onoff to reduce the global energy consumption while keeping the overall performance within a prespeci ed QoS requirement After receiving a request the front end communicates to the client to which server the request must be sent using HTTP redirection 8 Then the client sends its request directly to the server In our cluster scheme each request is an aperiodic task ie no assumptions are made about task arrival times and is assigned a deadline The speci cation of the QoS is systemwide and is in our case thepercentage of deadlines met The way to achieve the softrealtime properties will be presented in detail in the next sections Each server in the heterogeneous cluster performs the same service ie all servers can process all requests No restriction is imposed regarding any aspect of their com putation process scheduling CPU performance memory speedbandwidth disk speedbandwidth power consump tion network bandwidth etc In addition servers periodi cally inform the frontend about their current load to aid the frontend in load distribution and cluster con guration de cisions After a request has been processed by a server the result is retunied directly to the client without the frontend as intermediary Note that a more common cluster design is with the frontend acting as a proxy ie acting as intermediary between clients and servers In our webserver example choosing one con guration or the other ie proxy versus no proxy with redirection is simply a con guration option and the proposed scheme in this paper works equally well with either type of frontend In our experiments for high loads above leps we had to use the noproxy architec ture shown in Figure l as a proxy frontend cannot fully utilize the cluster in our experimental setup our frontend has only one GbE network interface card When using redirection instead of proxying the links and internal references should use the full URL to guaran tee that all the requests are sent to the frontend This way redirection works with either the HTTP10 or the HTTP11 protocols In HTTP11 the client may keep multiple con nections open to the frontend and the servers it was redi rected to but all the requests will be sent to the frontend rst The aspects related to cluster con guration PM and load distribution performed by the frontend will be presented in detail in the next section Local PM is performed indepen dently by each server without frontend control and will be presented in Section 5 4 Frontend Power Management Our proposed frontend follows a very general frame work that is applicable to any heterogeneous cluster To achieve this goal we cannot impose any restriction on Cluster Cllants Figure 1 Cluster architecture server characteristics However for ease of presentation de nitions and examples emphasize web server clusters 41 Load De nition and Estimation The frontend determines the number of active servers to meet the desired level of QoS while minimizing cluster en ergy consumption The number of servers is computed of ine or online as a function of the system load Thus de n ing load correctly is a crucial step A measure of the load for clusters is the number of requests received per second measured over some recent interval Clearly depending on the kind of service under consideration other de nitions of load may be more appropriate such as the bandwidth for a le server At runtime the frontend needs to correctly estimate or observe the load in order to make PM decisions and to per form load distribution The load estimation can be further improved by using feedback from the servers As observed in previous work 21 28 15 load estima tion can be greatly improved by considering request types The type of a request may be conveniently determined only by the header eg the name of the requested le Notice that the number of types is a design issue On one hand different types may not be necessary if the variability of the time to service a request is low On the other hand each request could be of a different type leading to an im proved estimation but also to a higher overhead to measure all different types of requests and update statistics tables In the case of a web server there are two main types of requests with different computational characteristics static and dynamicpages Static pages reside in server s memory and do not require much computation Dynamic pages in stead are created ondemand through the use of some exter nal language eg Perl or PHP For this reason dynamic pages typically require more computation than static ones Consider a generic server and let Asmm and Adynamic be the average execution times to serve a static and a dy namic page respectively at the maximum CPU speed For example for one server in our cluster we measured an av erage execution time Asmm 4385 for static pages and Adynamic 2457215 for dynamic pages On average the time needed by the CPU to serve Nmmc static requests and Ndynamic dynamic requests is thus NmmCAsmuc NdynamicAdynamc seconds If the number of requests is observed over a period of manitmxpe ad seconds then the load of the machine serving the requests is NstaticAstatic NdynamicAdynamic monitor period Load 1 Notice that this de nition of load assumes a CPUbound server This is normal for most web servers because much of the data are already in memory 7 27 In fact on all our machines we have noticed that the bottleneck of the system was the CPU However for systems with different bottle necks eg disk 10 or network bandwidth another de ni tion of load may be more appropriate In fact the de nition of load should account for the bottleneck resource Note that even though web requests may exhibit a large variation in execution times using the average values Asmm and Adynamic in Equation 1 results in a very accurate load es timation This is because web requests are relatively small and numerous We de ne the maximum load of a server as the max imum number of requests that it can handle meeting the 95 of deadlines The frontend never directs more than the maximum load to a server The cluster load is de ned as the sum of the current loads of all active servers Therefore the maximum load that the cluster can handle is the sum of the maximum loads of all servers At runtime the cluster load ie both variables Nmmc and Ndynamic is observed every monitorperiod seconds The value of manitmxpe ad is a design issue related to the tradeoff between response time and overhead In our cluster values in the order of a few seconds were found suitable 42 Server information In order to reduce the global power consumption at run time we funiish the frontend with information about the average power consumption of each server for any differ ent value of its load Servers can reduce their own power consumption in a number of different ways such as using DVS and lowpower states for the CPU selfrefresh modes for memory stopping disk spinning after some time of idle ness etc Moreover each server may use a different OS or a different scheduling policy such as a standard round robin or a realtime policy to give higher priority to static pages with respect to dynamic ones No assumption is made at the frontend about local PM schemes Once the local PM scheme the OS and the scheduling policy have been decided for a server the power consump tion as function of the load and the maximum load can be determined through simple measurements In our experiments after choosing the local PM scheme see Section 5 we measured the average power consump tion for a load in 5 increments Then we interpolated the points to have values in 1 increments We measured the total power consumed by the whole machines not only by their CPUs In our case recording the average power consumption for a given load over a period of 10 minutes was suf cient to obtain a good average We measured AC power directly with a simple power meter with 2 accu racy 25 Hence the whole process required at most few hours for each machine Clearly identical machines need not to be measured twice The curve representing the power consumption of each server of our cluster is shown in Fig ure 2 The last point on each curve represents the maximum load that meets our QoS speci cation ie 95 of dead lines met normalized to the fastest machine in the cluster The parameters for each machine are reported in Table 2 on Page 9 Power consumption W Green iv 20 Silver 4 Blue Transmeta 80 100 40 50 Load Figure 2 Power consumption vs load for servers The information about power consumption of servers can then be used by the frontend at runtime Since the frontend controls the load for each server and the power consump tion of each server for a given load is known the frontend has enough information for PM decisions Notice that the cooling costs of the room have not been taken into account However since these costs are expected to be proportional to the AC power drawn they are automatically reduced by minimizing cluster power consumption We now present all the information and the corre sponding notation about each server needed at the front end level bootjz39mei and shutdowniimei represent the time to boot and to shutdown server 239 including the time to start and nish the webserver process of the server mazioadi is the maximum load of server 239 that can satisfy the 95 QoS requirement of f powen is the power con sumed when the server is off some components such as the WakeOnLan interface used to power up the machine may not be completely off Finally powerrvsrloadi is an array with W entries recording the measured power consumption of server 239 for each value of the load in loadrincrement percents we used 1 The rst entry of the array denotes the idle power ie no load 43 OnOff Policy This section describes the key idea behind our cluster wide PM scheme The frontend besides distributing the load to servers to minimize global power consumption de termines the cluster con guration by tuniing onoff servers Below is the algorithm used by the frontend to decide which servers will be turned onoff The algorithm tunis machines on and off in a speci c order which is based on the power ef ciency of servers ie the integral of power consumption versus load In our case according to the values of Figure 2 the ordering for our cluster is Transmeta Blue Silver Green In some situations we may need to change the order at runtime as explained later The frontend tunis on servers as the cluster load in creases However since the boot time is not negligi ble we need to tuni machines on before they are actually needed For this reason the frontend maintains a variable called maxrloadrincrease which speci es the maximum load variation that the cluster is prepared to sustain during monitorrperiod This is essentially the maximum slope of the load characterization for the cluster The onoff policy relies on two tables computed of ine The rst table called mandatory ervers keeps the load at which to tuni servers on and is used to de termine the lowest number of servers needed at a certain load to satisfy the QoS requirement For example con sider a cluster with three servers having maximum loads mazioado 0 5 mazioadl 1 5 and mazioadg 10 respectively Suppose that monitorrperiod is 5 sec onds mazrloadrincrease is equal to 005 and the boot time is 10 seconds for every machine Ideally we need only one server when the cluster load is less than 05 two servers when load is between 05 and 2 and all servers when load is higher than 2 However if we account for the time to boot a new machine and we suppose that the cluster load is checked periodically every monitmperiod seconds the table becomes mandatory ervers 0035185 Thus the rst server is always on whereas the second and third servers are turned on when the cluster load reaches 035 and 185 respectively In fact if we consider the boot time of a new server we have to account for a potential load increase equal to Wmaxrloadrincrease pmo Moreover if we suppose that the load is checked periodi cally every monitorrperiod seconds we have to introduce an additional interval of time to account for the error when measuring the current load In general server 239 is turned on when the cluster load reaches 271 be Lt39 39 Zmazloadj 7 1mazl0adincrease 39 O momtoryerwd 7 The second table called pawerjervers precomputes the number of servers needed to minimize the power con sumption for a given load Unlike the previous table this table is computed considering the power consumption of servers and is used to distribute the current load among active servers For a given value of N we compute the power consumption of the cluster as follows We start con sidering a load equal to zero and we increase its value in loadrincrement increments For any increment of the load we evaluate which server can handle it in order to min imize the overall energy consumption To determine the load at which N servers become more power ef cient than N 7 l we follow this procedure con sidering both cases of N 7 l and N machines respectively The load at which N servers consume less power than N 71 servers is the value after which the N h server is tunied on The server to be tunied on is the next one according to the power ef ciency order The complexity of computing the two tables is ON where N is the number of servers for mandatoryjervers and ONM for p0werservers where M In our cluster the time to compute these two tables was less than lmsec which was negligible compared to monitorrperiod that is in the range of seconds Thus this computation can also be performed online For example a new ordering of the servers and an online recalculation of the tables become necessary when a server crashes A highlevel view of the frontend onoff policy is presented in Figure 3 Every monitorrperiod seconds the load is estimated according to Equation 1 then the request counters are reset The number of mandatory servers Nmandawry is determined by a lookup in the mandatoryrservers tab e If Nmandawry is higher than the current number of active servers Nwmem all needed servers are immediately tunied on Each server can be in one of the following states Off Boot On or Shutdown After receiving the boot com mand such as a WakeOnLan packet the server 239 moves from the Off to the Boot state It stays in this state for bootrtimei seconds ie until it starts the server process then informs the frontend that it is available for processing moving to the On state When server 239 is shutdown it stays in the Shutdown state for shutdowniz mei seconds after that the frontend changes its state to Off The variable Cmd in Figure 3 can have three different values None Boot or Shutdown This variable allows to 1 Every manhunperiod seconds 1 Compute the load according to Equation 1 12 Reset the counters Nstatic 0 Ndymmic 0 Compute the minimum number of servers that can handle the load Nmmdaanymandat0ryJeWersLoad if Nmmdmow gt Numem turn on the servers L h set Ncumem Nmmdmow urn Compute the number of servers needed to reduce the energy consumption NpO LU m pawer erversLoad ifltNp01UET gt thmem and Cmd 7 Boot Set Cmd Boot Find the next server i to boot 331 Ncumem Ncumem 1 return if NW lt New and Cmd Shut down Set Cmd Shutdow Find the next server i to shutdown Set Ncumem Ncurrent LII Ox kt return 2 If CmdBOOt for a period of time equal to timelmoti 21 Turn on serveri 22 Set Cmd None 23 return 3 If CmdShUtdOWn for a period of time equal to timelmoti timejhutdowni 31 Turn off serveri 32 Set Cmd None 33 return Figure 3 OnOff policy describe the use of thresholds when turning onoff servers If no server is in transition ie all servers are in the On or Off states a server may be turned on or off as decided after a lookup in the powerjervers table To be conser vative only one server at a time is turned on or off Server 239 is turned off if the system is in state Cmd Shutdown for at least timejhutdowni timeizooti consecutive sec onds which is the renttoown threshold see Step 3 Fig ure 3 Similarly server 239 is turned on if Cmd 2 Boot for timelzooti consecutive seconds see Step 2 Figure 3 Notice that these thresholds do not apply to the mandatory servers which are started immediately The running time of the onlinepart of the algorithm every manitmxpe ad sec onds is negligible because it is in the microsecond range the complexity is ON but can be improved to 01 by increasing the table size from N to M that is storing all entries in an array For convex and linear power functions tables mandatoryjervers and power ervers contain the optimal transition points in the discrete space for continu ous space see 23 In practice however power functions may have concave regions This means that a server with an abrupt power increase at some load I may not be allocated more than I load even though the power may become at above I 6 making it a good target for load allocation A simple x to the problem is to consider the average power consumption over a larger interval rather than the exact value at each load This effectively results in smoothing the power functions In our case although the measured power functions have concave regions we have found that no smoothing was necessary 44 Request Distribution Policy The frontend distributes the incoming requests to a subset of the current servers that are in the On state loadellocation is a table containing the estimated load allocated to each server and is computed with the same procedure used to determine the powerservers table in 0MN time The load allocation is computed every manitmxpe ad seconds after the onoff decisions Another table called load ccumulaied stores the ac cumulated load of each server and is reset after computing loadellocation The server 239 with the minimum weight loadeccumulaiedi wi load llocaiioni 2 gets the next request Notice that w can be higher than 1 when the load is underestimated The server that re ceives the request updates its accumulated load and thus in creases its weight by adding Asmmmonitorperiod or Adynamic monitor period depending on the request type The complexity to nd the server with minimum weight is ON with a straightforward implementation but can be improved to OlogN using a tree 45 Implementation Issues We implemented our PM scheme in the Apache 1333 Web server 4 We created an Apache module called mod mo f which makes onoff decisions Moreover we extended an existing module modjackhand 2 to support our distribution policy modlzackhand is a module responsible for load distri bution in Apache clusters It allows servers to exchange information about their current usage of resources It also provides a set of candidacy functions to forward requests from an overloaded server to other less utilized servers Ex amples of such functions are byLoad which selects as can didate the least loaded server and byCost which considers a cost for each request We added a new candidacy function called byEnergy to implement our request distribution policy Notice that only frontend machines use this function In addition servers provide some feedback about their current realtime utiliza tion as explained in Section 5 to frontends We used this feedback to prevent the overloading of the servers In partic ular the server with the minimum w is selected providing that it is not overloaded The mod mo r module communicates with modlzackhand through shared memory On initializa tion mod mo r acquires server information and computes both mandatoryJSETvers and pOUJETJSET UETS tables mod mo r executes periodically every manitmxpe ad seconds On each invocation it performs the following tasks a computes the current load based on the counters Nmmc and Nd mumc that are incremented in the Apache post read request phase b looks up in the table to determine the number of servers needed for the next period c computes the loadellocation table for the active servers not shown in Figure 3 d tunis on by send ing WakeOnLan packets and off by invoking special CGI scripts servers and nally e resets the counters Nmm39c Ndynamic and load ccumulated In addition it displays at runtime the estimated power and energy consumption of each server based on the paweTmslaad and load ccumulated tables 5 Server Power Management In addition to frontend directed cluster recon gurations ie tuniing onoff machines the servers perform their own local PM to reduce power consumption of unutilized or underutilized resources We present an example of a QoS aware DVS scheme and we discuss an implementation us ing the Apache Webserver 4 51 Local DVS Policy We rely on a local realtime scheme where each request is an aperiodic task and is assigned a deadline Each request type 21 28 15 has a deadline to allow for more accurate load estimation We consider a soft realtime system in which the sched ule is not generated by a realtime scheduler and the com putation time Ci is the average execution time ie Asmm or Adynamic not the worstcase Let Di be the time re maining to the deadline then the realtime utilization of a server is de ned as U Z If the CPU is the bottleneck of the system as in our case the CPU frequency to handle this rate of requests is U fmw where fmw is the highest possible frequency of the CPU Each server periodically computes its utilization U and sets the CPU frequency to the closest value higher than U fmax Note that DVS architectures may have inef cient oper ating frequencies 22 which exist when there are higher frequencies that consume less energy A simple online tool for inef cient frequency elimination has been provided in 19 Removal of inef cient operating frequencies is the rst step in any DVS scheme This was not necessary in our servers because surprisingly all frequencies were ef cient although we had a different experience with other systems we tested 28 52 Implementation Issues We implemented an Apache module called modeufreq responsible for CPU speed settings at the user level On Athlon machines the CPU speed was changed by writing to the sys le system using the CPUfreq interface 1 On the Transmeta machine the speed was changed by writing to a modelspeci c register MSR Since the register cannot be written from userlevel we added two system calls for setting and reading its value 13 After detecting the available frequencies our module creates an Apache process that periodically sets the CPU frequency according to the current value of U We chose as period 107213 to match any default Linux kernel the measured overhead for changing voltagefrequency in the Athlon64 machines is approximately 50M To compute U the module needs to know the type ie static or dynamic and the arrival time of each request At every request arrival called Apache postread request phase the arrival time and the deadline are recorded with 3 accuracy and stored in a hash table in shared memory The request type is determined from the name of the re quested le Thus a single queue traversal is necessary to compute U In fact the current value of U depends on all queued requests therefore the complexity is OR where R is the number of requests queued the overhead is neg ligible Requests are removed from the queue after being served called Apache logging request phase A problem we encountered during the implementation was that our scheme worked very well except for fast ma chines serving a large amount of small static pages In this case those machines were not increasing their speed re sulting in a large number of dropped requests A further investigation revealed that the value of U was close to zero We did not see this phenomenon on slower machines such as Transmeta nor using bigger pages The problem was that the requests were served too fast in approximately 150 3 Such short requests were queued served and removed from the queue before other requests were added to the queue Thus at any time only a few requests usually just one was in the queue and when modeufreq recomputed the utiliza tion it resulted in an underestimation of U In other words even though the requests were received and queued at the OSlevel Apache was not able to see them because it is a userlevel server and it has no information about requests stored at the OS level We called this problem the short request overload problem phenomenon A simple x was to compute the utilization also over a recent interval of time intervalsize we used 2007228 NstaticAstatic Ndymmic Adynamic Urecent tnterval tze We would like to keep the server utilization Urecem be low a certain threshold we used threshold 80 The minimum frequency that does that is fmax Thus our module sets the CPU speed to max U7 fmax Note that Sharma et al s work with a ken1el webserver kHTTPd 26 aware of small requests at the OSlevel has a nice synergy with our approach and could be used in lieu of our scheme Exploring the composition of our cluster con guration and Sharma s or other similar DVS work is left for future work The problem with including such work in our scheme is exactly the reason why the authors discon tinued the development of kHTTPd the dif culty of main taining developing and debugging a ken1ellevel server 6 Evaluation To evaluate our QoSaware PM scheme we used a small cluster composed by one frontend and 4 different servers Every machine ran Gentoo Linux 26 as operating system and Apache 1333 servers The parameters of the machines are shown in Tables 1 and 2 The cluster has been tested using 2 clients connected to the cluster with a GbE interface and Gbps switch the clients generate up to 3186 requests per second which cor responds to a total maximum cluster load equal to 295 all loads were normalized to that of Silver machine A total cluster load of 005 or 5 corresponds on average to 54 re questssecond Considering request types however greatly improves the prediction as 54 requestssecond may corre spond to a load ranging from 002 if Ndynamic 0 to 132 if Nmmc 0 We assigned deadlines of 507228 and 2007228 for requests of static and dynamic pages We set mazioadn ncrease 0005 therefore we had mandatoryjervers 000000627101272012 and power ervers 000070100710502040 61 DVS policy As rst experiment we evaluated the effectiveness of our local DVS scheme We compared our modjpufreq module with the default PM in Linux ie HALT instruction when idle and with Sharma s DVS scheme for QoSaware web Transmeta Frequency MHz 333 40 53 667 733 Idle W l 8 8 9 9 1851 5 1 l 1 1919511051121125 i BusyW Blue 1 Min mu 1 800 l 1800 l 2000 l 2200 l IdleW l 73 75 1 805 1 BusyW 745 1 935 1 1055 1 1205 1 Silver 1 FrequencyMHz l 1000 l 1800 l 2000 l 2200 l 2400 l IdleW 70 1 745 1 785 1 835 1 895 1 BusyW 805 1 925 1 1035 1 1195 1 1405 Table 1 Idlebusy power consumption in Watts for each server at each frequency servers proposed in 26 which we implemented at user level in our modjpufreq module This scheme adjusts the speed of the processor to the minimum speed that maintains a quantity called synthetic utilization below the theoretical utilization bound Ubound 586 that ensures that all deadlines are met 3 Average power consumption W Detault LanX 4397 7 Snarma s scheme 7er U 10 20 30 40 50 80 70 80 90 Load Figure 4 Comparison of DVS policies The measured power consumption of each scheme on the Blue machine is shown as function of the load in Figure 4 The graph shows that our scheme outperforms the other schemes especially for the midrange load values Higher savings are obtained on machines with a more convex power function the power function of the Blue machine is rather linear i see Fi ure 2 In fact for a rate of 300 requestssec approximately 28 load the average processor frequency is 125GHz using our scheme and 15GHz using Sharma s scheme but the amount of energy saved is only 3 Impor tantly we observed that both schemes maintained the QoS level above 99 even at the highest load Machine Processor RAM Cache WakerOanan Boot Shutdown Off Max name model memory size support time time power load size sec sec Transmeta Transmeta Crusoe TM5800 256 MB 512 KB 100 60 1 010 Blue AMD Athlon 64 Mobile 3400 1GB 1 MB 33 11 8 095 Silver AMD Athlon 64 3400 1GB 512 KB 33 12 8 100 Green AMD Athlon 64 3000 1GB 512 KB 33 11 8 090 Frontrend AMD Athlon 64 Mobile 3400 1GB 1 MB Not applicable Table 2 Parameters of the machines of the cluster 62 Overall scheme To evaluate the overall scheme we performed many ex periments with and without the clusterwide PM scheme OnOff scheme and with and without the local PM scheme DVS scheme For each load value we measured the power consumption of the entire machine not only CPU for each scheme independently see Figure 5 For fairness we used the load balancing policy in Section 44 for all the schemes The OnOff policy allows a striking reduction of the en ergy consumption for low values of the load because ob viously it allows to turn off unutilized servers In Fig ure 5 we can see that when load 2 0 the cluster consump tion is around 32W because each Athlon server consumes 8W when in the Off state and the Transmeta also consumes 8W when in the On state The DVS technique instead has its biggest impact whenever a new server is turned on since not all active servers are fully utilized However its im portance decreases as the utilization of the active servers increases For high values of the load in our case at 70 or higher all servers are on therefore the OnOff technique does not allow to reduce energy consumption In those sit uations however there is still room for the DVS technique that becomes more important than the OnOff technique The energy consumption of all servers without any power management scheme was 132KWh On average we measured energy savings of 17 using DVS 39 using OnOff and 45 using both schemes It is worth noting that the frontend estimation of the to tal energy consumed when using DVS was extremely ac curate the difference from the actual values was less than 1 For example when using the onoff scheme the mea sured value was 072KWh while the frontend estimated value was 0725KWh the resolution of our powerenergy meter 25 is 001KWh To measure the impact of clusterwide and local PM schemes in the loss of QoS we ran many fourhour ex periments with workloads derived from actual webserver traces and generated with the same shape of statistics taken from our cspittedu domain see Table 3 The average de lay observed at the client side without any PM scheme was 8297213 a small response time is due to all machines being on at all times and running at maximum frequency Adding DVS local PM had a very small impact on the de lay with the average delay measured at 8777215 However with OnOff scheme we measured an average delay equal to 12297213 without DVS and 12837213 with DVS In both cases the average delay was not higher than 50 of the noPM delay and was quite small with respect to deadlines Request type Request type 4 ms CGI 010 677 KB html 284 7 ms CGI 071 778 KB html 158 23 ms CGI 098 879 KB html 180 40 ms CGI 023 9710 KB html 187 200 ms CGI 006 1020 KB tml 1074 01 KB html 37 78 20730 KB html 362 172 KB html 886 30740 KB html 117 273 KB html 656 40750 KB html 067 374 KB html 458 50760 KB html 080 475 KB html 494 60770 KB html 146 576 KB html 338 above70 KB html 527 Table 3 Web server statistics percentage of accesses roximate size for static pages and running time for dynamic pages Average power OOHSUWPUOH W OnOtt OnOtt D VS 73 80 Q t t 40 80 Load Figure 5 Evaluation of clusterwide and local techniques 7 Conclusions and Future Work We have presented a new QoSaware power manage ment scheme that combines clusterwide OnOff and lo cal DVS power management techniques in the context of heterogeneous clusters We have also described and evalu ated an implementation of the proposed scheme using the Apache Webserver in a small realistic cluster Our experimental results show that a our load estima tion is very accurate b the OnOff policy allows a striking reduction of the power consumption c DVS is very im portant whenever a new server is tunied on or as shown before when all servers are on d as expected for high values of the load the OnOff technique does not help to reduce energy consumption but there is still room for DVS Using both techniques we saved up to 45 of the total energy with a limited loss in terms of QoS In the worst case the average delay was increased by at most 50 and was still very small when compared to the deadlines As immediate future work we plan to investigate the use of both suspendtodisk and suspendtoRAM techniques to reduce the time to boot and shutdown a server We also plan an integration of our cluster PM schemes with other gridlike or cluster eg Condor load balancing schemes References 1 Linux kernel CPUfreq subsystem www kernel orgpublinuXutils kernelcpufreqcpufreqhtml The BackhandProject http www backhand org T F Abdelzaher and C Lu Schedulability Analysis and Utilization Bounds for Highly Scalable RealTime Services In Praceedings 0f the 7th IEEE RealTime Technalagy and Applicatians SympasiumRTAS 01 Taiwan June 2001 Apache HTTP Server Project http httpd apache org L A Bairoso J Dean and U Holzle Web Search for a Planet The Google Cluster Architecture IEEE Micra 23222728 2003 6 R Bianchini and R Rajamony Power and Energy Manage ment for Server Systems Camputer 3711687742004 P Bohrer E N Elnozahy T Keller M Kistler C Le furgy C McDowell and R Rajamony The casefarpawer management in web servers Kluwer Academic Publishers 0Q http ES 33 T V Cardellini M Colajanni and P S Yu Redirection Algo rithms for Load Sharing in Distributed Webserver Systems In 19th IEEE Internatianal Canference 0n Distributed Cam puting Systems ICDCS 99 Austin TX June 1999 M Elnozahy M Kistler andR Rajamony EnergyEf cient Server Clusters In Warkshap an Pawer Aware Camputer Systems PACS 02 2002 M Elnozahy M Kistler and R Rajamony Energy Conser vation Policies for Web Servers In 4th USENIX Sympasium an Internet Technalagies and Systems Seattle Mar 2003 W Felter K Rajamani T Keller and C Rusu A perfonnance conserving approach for reducing peak power consumption in server systems In Internatianal Canference an Supercamputing ICS 05 pages 2937302 Cambridge Massachusetts June 2005 E E E d T E K Flautner and T Mudge Vertigo Automatic PerfonnanceSetting for Linux In 5th Sympasium an 0p erating Systems Design and Implementatian Dec 2002 S Gleason Power Aware Operating Systems Task Spe ci c CPU Throttling http www cs pitt edu PARTSimplementation T Heath B Diniz E V Carrera W M Jr and R Bian chini SelfCon guring Heterogeneous Server Clusters In Warkshap 0n Campilers and Operating Systems far Law Pawer COLP 03 September 2003 T Heath B Diniz E V Carrera W M Jr and R Bian chini Energy Conservation in Heterogeneous Server Clus ters In 10th ACMSIGPLAN Sympasium 0n Principles and Practice afParallel Pragramming June 005 C Lefurgy K Rajamani F Rawson W Felter M Kistler and T W Keller Energy management for commercial servers IEEE Camputer 361239748 Dec 2003 J Lorch and A Smith Improving Dynamic Voltage Scaling Algorithms with PACE In ACM SIGMETRICS June 2001 J Moore R Shanna R Shi J Chase C Patel and P Ranganathan Going Beyond CPUs The Potential of TemperatureAware Data Center Architectures In lSt Wark shap 0n TemperatureAware Camputer Systems 2004 PARTS Power Ef ciency Test http www cs pitt eduPARTSdemosefficient E Pinheiro R Bianchini E V Carrera and T Heath Load Balancing and Unbalancing for Power and Performance in ClusterBased Systems In Warkshap an Campilers and 0p erating Systemsfar Law P0werCOLP 01 Sept 2001 C Rusu R Xu R Melhem andD Mosse EnergyEf cient Policies for RequestDriven Soft RealTime Systems In Euramicra Canference 0n RealTime Systems ECRTS 04 Catania Imly July 2004 S Saewong and R Rajkumar Practical VoltageScaling for FixedPriority RT Systems In Praceedings 0f the 9th IEEE RealTime and Embedded Technalagy and Applica tians Sympasium RTAS 03 May 2003 C Scordino and E Bini Optimal Speed Assignment for Probabilistic Execution Times In 2de Wu shap an Pawer Aware RealTime Camputing PARC 05 NJ Sept 2005 C Scordino and G Lipari Using resource reservation tech niques for poweraware scheduling In 4th ACM Interna tianal Canference 0n Embedded Saftware Pisa Imly 2004 Seasonic Power Angel http www 5 ea 5 on icus a comproducts phplineId8 V S A Thomas T Abdelzaher K Skadron and Z Liu Poweraware QoS Management in Web Servers In Praceedings 0f the 24th IEEE RealTime Systems Sympa sium RTSS 03 Cancun Mexico December 2003 Xiong S Han and KY Lam A Deferrable Schedul ing Algorithm for RealTime Transactions Mainmining Dam Freshness In IEEE RealTime System Sympasium Miami Florida Dec 2005 R Xu D Zhu C Rusu R Melhem andD Mosse Energy Ef cient Policies for Embedded Clusters In ACM SIG PLANSIGBED Canference 0n Languages Campilers and Taalsfar Embedded Systems LCTES 05 June 2005 F Yao A Demers and SShankar A Scheduling Model for Reduced CPU Energy In IEEE Annual Faundatians 0f CamputerScience pages 374382 1995
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'