FAULT TOL COMPUTING
FAULT TOL COMPUTING ECE 257A
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 68 page Class Notes was uploaded by Spencer Ondricka on Thursday October 22, 2015. The Class Notes belongs to ECE 257A at University of California Santa Barbara taught by B. Parhami in Fall. Since its upload, it has received 50 views. For similar materials see /class/227031/ece-257a-university-of-california-santa-barbara in ELECTRICAL AND COMPUTER ENGINEERING at University of California Santa Barbara.
Reviews for FAULT TOL COMPUTING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
1 Ex 1 13 7 J a J UP U Hardware Design Methods v mix 95 7 35 H rm 9 fo ag 7 M m u u M w a I 97 Q 739 gt quot gs Hi 7 f A Nov 2007 r SelfChecking Modules m Slide1 About This Presentation I i II the im1ei A I quot A Q empw er g i new Umiwers y 1 e i e m The mie emieil em heue rm 1131 s ieesmem ueeeifnmg L 39 Urmem iimemizecd wees emimiiteme quot Edition First Revised Revised NOV 2007 Released Oct 2006 SelfChecking Modules Nov 2007 U 33 Slide 2 Of course I have a grandiose sense of selfimportance Who doesn t 5 WEEK A bElz oN LL EIGHT E EEVED 39h39h ln hlbg TOCINSEGM TiHalH f He s saved Sir Lanna Aim because he s always checking his blood glucose Nov 2007 B SelfChecking Modules m Slide 4 Multilevel Model of Dependable Computing Level Component Logic Information System Service Result I Erroneous Degraded I Jlalfunetionzino r4 H Unimpaired LowLevel Impaired MidLevel Impaired HighLevel Impaired Legend JEntry gtDeviation Remedy fTolerance Slide 5 Nov 2007 S B SelfChecking Modules 1 Main Ideas of SelfChecking Design Function unit deSIgned In a Encoded me m Encoded way that faultserrorsmalfns input mm P output manifest themselves as invalid errorspace outputs self checking which are detectable by an external code checker ob dfefoheCRe F Status Four pOSSIbIIItIeS Both function unit and checker okay Only function unit okay false alarm may be raised but this is safe Implementation ofthe desired functionality with coded values e Only checker okay we have either no Circuit function withthe fault 4 output error or a detectable error Neither function unit nor checker okay use 2 output checker a single check signal stuckatokay goes undetected leading to fault accumulation Nov 2007 S B SelfChecking Modules Input Output Slide 6 Cascading of SelfChecking Modules Given selfchecking modules that Input have been designed separately how does one combine them into a selfchecking system Input unchecked f Input checked don t care Encoded Encoded FMlEQE CWH OUtPUt Function input unit ii unit 2 Can remove this checker if we do not expect both units to fail and Function unit 2 translates any noncodeword Output of multiple checkers may be combined in selfchecking manner Input Into noncode output I I Nov 2007 S B SelfChecking Modules 7 Slide 7 TotallySelf Checking Error Signal Combining Encoded Encoded In gt 39 put 0 o o o o o 0 0 0 1 0 0 Simplified truth table 01 10 G 8 8 1 13 8 8 or 31353615 G 00 B 010 0 0 0 0 1 0 1 0 1 00and11asB AI T 011010 T 39 39 39 01 1 1 1 1 In OUt Circuit to 1 0 0 0 O 0 combine 1 0 0 1 1 0 error 1 0 1 0 0 1 tworail 1 1 0 0 O O checker 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 Show that this circuit is selftesting I I Slide8 Nov 2007 S B SelfChecking Modules 7 Totally SelfChecking Design A module is totally selfchecking if it is selfchecking and selftesting If the dashed red arrow option is used too often faults may go undetected for long periods of time raising the danger of a second fault invalidating the selfchecking design unchecked f4 Input checked don t care A selfchecking circuit is selftesting if any fault from the class covered is revealed at NW mB expliot lly tints out utb at least one codes ace in ut n n r s p y p 390 tests for so that the fault IS guaranteed to be f MHE may lag mg m detectable during normal Circuit operation mpm Wm The selftesting property allows us to focus on a small set of faults thus leading to more economical selfchecking circuit implementations with a large fault set cost would be prohibitive Nov 2007 S B SelfChecking Modules 7 Slide 9 SelfMonitoring Design A module is self monitoring with respect to the fault class F if it is 1 Selfchecking with respect to F or 2 Totally selfchecking wrt the fault class Finit g F chosen such that all faults in F develop in time as a sequence of simpler faults the first of which is in F init Faultfree Example A unit that is totallyselfchecking wrt single faults may be deemed selfmonitoring wrt to multiple faults provided that multiple faults develop one by one and slowly over time The selfmonitoring design approach requires the more stringent totallyselfchecking property to be satisfied for a small manageable set of faults while also protecting the unit against a broader fault class Slide 10 Nov 2007 S B SelfChecking Modules r Totally SelfChecking Checkers Conventional code checker Example 5input oddparity checker saO fault on output Pleasant surprise The self checking version is simpler Slide 11 Nov 2007 S B SelfChecking Modules TSC Checker for m out of 2m Code Divide the 2m bits into two disjoint subsets A and B of m bits each Let v and W be the weight of number of 1s in A and B respectively Implement the two code checker outputs e0 and e1 as follows m e0 Vv2iw2m i i0 v 15 w 15 ieven r A r A N m mbits mbits e1 v V 2 m j SubsetA subset B 1 UOdd v 15 w 15 f H f H Example 3out of 6 code checker m 3 A a b c B f g h e0vOw23vv22w21fghvabvbcvcafvgvh e1v21w22vv23wgOavbvcfgvghvhfvabc Always satisfied Nov 2007 S B SelfChecking Modules Slide 12 Another TSC m outof 2m Code Checker Cellular realization due to J E Smith This design is testable with only 2m inputs all having m consecutive 1s in cyclic order lt m 1 stages gt Slide 13 Nov 2007 S B SelfChecking Modules r Using 2out of 4 Checkers as Building Blocks Building m out of 2m TSC checkers 3 s m s 6 from 2outof4 checkers construction due to Lala Busaba and Zhao Examples 3out of 6 and 4outof8 TSC checkers are depicted below only the structure is shown some design details are missing 123456123456 1234 5678 34781256 llll llll llll llll 2outof 4l l2 outof 4l l2out of 4 l2outof 4l I Slightly different from an ordinary 2outof 4 checker 21 l chemo 3outof6 4o utof8 Nov 2007 S B SelfChecking Modules Slide 14 TSC Checker for k outof n Code 2outof 5 One design strategy is to proceed in 3 stages Convert the k out of n code to a 1out of Z code 1outof10 10 AND gates Convert the latter code to an m out of 2m code 10 bits Check the m out of 2m code using a TSC checker 3outof6 6 OR gates This approach is impractical for many codes TSC checker A procedure due to Marouf and Friedman m Implement 6 functions of the general form emj E V m 39 72 these have different subsets of bits as inputs and constitute a 1out of 6 code Use a TSC 1out of 6 to 2outof4 converter Use a TSC 2outof4 code checker e0 e1e2 e3 e4 e5 The process above works for 2m 2 s k 3 4m 2outof4 OR gates It can be somewhat simplified for k 2m 1 TSC checker Slide 15 Nov 2007 B SelfChecking Modules 7 TSC Checkers for Separable Codes Here is a general strategy for designing totallyself checking checkers for separable codes Generate n k complement of check bits k data bits Tworail checker n k check bits eo Input W O r d TSC code checker gte1 Checker outputs For many codes direct synthesis will produce a faster andor more compact totallyself checking checker Google search for totally self checking checker produces 442 hits SelfChecking Modules Nov 2007 S B Slide 16 TSC Design with Parity Prediction Recall our discussion of parity prediction as an alternative to duplication a i u k u generator 3 Parity 9 5 Parity encoded Parity c gt ALU Emmi encoded predictor 39 V L D k l Ordinary ALU Rm Error Lug2mmumzmuuuuamzmuuuanEu um signal If the parity predictor produces the complement of the output parity and the XOR gate is removed we have a selfchecking design i I n c an ii I u H To ensure the T80 property we must also verify that the parity predictor is testable only with input codewords Slide 17 Nov 2007 S B SelfChecking Modules TSC Design with Residue Encoding Residue checking is applicable directly to addition subtraction and multiplication and with some extra effort to other arithmetic operations XXmOdA y ymodA Add Find mod A s s modA Nov 2007 S B l Not I equal Error SelfChecking Modules To make this scheme TSC Modify the Find mod A box to produce the complement of the residue Use tworail checker instead of comparator Verify the selftesting property if the residue channel is not completely independent of the main computation not needed for addsubtract and multiply Slide 18 SelfChecking Design with FPGAs LUTbased FPGAs can suffer from the following fault types Single s a faults in RAM cells Single s a faults on signal lines Functional faults in a multiplexer within a single CLB Functional faults in a D flipflop within a single CLB Single sa faults in pass transistors connecting CLBs Synthesis algorithm 1 Use scripts in the Berkeley synthesis tool SlS to decompose an SOP expression into an optimal collection of parts with 4 or fewer variables 2 Assign each part to a functional cell that produces a 2rail output 3 Connect the outputs of a pair of intermediate functional cells to the inputs of a checker cell and find the output equations for that cell 4 Cascade the checker cells to form a checker tree Ref PK Lala and AL Burress IEEE Trans Instrumentation and Measurement Vol 52 No 5 pp 13911398 October 2003 Nov 2007 S B SelfChecking Modules Slide 19 Synthesis of TSC Systems from TSC Modules System consists of a set of modules with interconnections modeled by a directed graph Theorem 1 A sufficient condition for a system to be TSC with respect to all singlemodule failures is to add checkers to the system such that if a path leads from a module M to itself a loop then it encounters at least one checker Theorem 2 A sufficient condition for a system to be TSC with respect to all multiple module failures in the module set A M is to have no loop containing two modules in A in its path and at least one checker in any path leading from one module in A to any other module in A Optimal placement of checkers to satisfy these condition Easily solved when checker cost is the same at every interface Slide 20 Nov 2007 S B SelfChecking Modules r Partially SelfChecking Units Some ALU functions such as logical operations cannot be checked using lowredundancy codes Such an ALU can be made partially selfchecking by circumventing the errorchecking process in cases where codes are not applicable 01 10 G to Selfchecking ALU Residumeck 01D 10gg ttom with residue code errorSI nal ALU error Indicator 00 11 B O 1 1 0 In Out Do not check Check X B B The checkdonotcheck indicator B C B is produced by the control unit B D G Normal G C G 01 operation G D G 10 Nov 2007 S B SelfChecking Modules Slide 21 39 V 1 3 r1 C 717 K gikib 79 3 Hz J ill xL U U 23 E J DU v 7 x NSC 7 73 SirJ D U iiiquot1979 U L Hardware 39quot 14439 Nov 2007 r Hardware Implementation Strategies m Slide1 About This Presentation beam the teicairwetie EQIE 2 x 1quot tkeliemt xqvtnimpwtmmgw H 6 mid empwter Pelt Eitng tiling d The Quiet 39quot L F Utmetuti me ized mm Iit qaimibt rttedt D Femem Edition Released Revised Revised First Nov 2006 Nov 2007 Slide 2 Nov 2007 Hardware Implementation Strategies IeS Hardware Implementation Strateg 4 m m r s b g e t a r t S n D t a t n e m b D m e r a W 0 r a H 9 Ema Nov2007 g 2000 Randy GIastgen WWWQMSWVQMWm Copyright 2001 by Randy Giasbergcn Wawgla5blrgencom f LASBERGEN When I m away fun my deSk it goes into 5133 Crashing is an expression of hostility against your modequotblquot the snoring annoys my coworkers network administrator Though you appear to be uncooperative it s actually a desperate cry for help You must clearly explain your problem SUDDENLY FOL MYSELF oOO wwwgeneralcumic5com Nov 2007 B Hardware Implementation Strategies m Slide 4 7 Multilevel Model of Dependable Computing Level Component Logic Information System Service Result I Erroneous Degraded I Jlalfunetionzino r4 H Unimpaired LowLevel Impaired MidLevel Impaired HighLevel Impaired Legend JEntry gtDeviation Remedy fTolerance Slide 5 Nov 2007 S B Hardware Implementation Strategies 1 HardwareBased ToleranceRecovery Methods Data path methods Control unit methods Replication in space costly Coding of control signals delicate and compare Controlflow watchdog Triplicate and vote Sew checking design Pairandspare NMRhybrid Replication in time slow Glue logic methods Recompute and compare Selfchecking design Recompute and vote Alternating logic Recompute after shift Recompute after swap Replicate operand segments Control unit Mixed spacetime replication control 39 Condition Monitoring imperfect coverage S39gna39s I R Slgnals Watchdog timer Activity monitor gt gt Lowredundancy coding Parity prediction 39 pUtS OUtpUtS Residue checking gt D t th gt Selfchecking design a a pa 39 Slide 6 Nov 2007 U C S B Hardware Implementation Strategies Replication of DataPath Elements in Space Switch The following schemes have already been discussed in connection with fault tolerance Comparator Duplicate and compare so Triplicate and vote Nov 2007 l B Hardware Implementation Strategies 7 Slide 7 Main Drawback of Replication in Time Can be slow but in many control applications extra time is available lnterleaving of the primary and duplicate computations saves time Duplicate Duplicate computation computation Schedule with 1 adder Computation flowgraph and schedule with 2 adders Nov 2007 B Hardware Implementation Strategies Slide 8 Recompute and Compare Vote Repeat computation and store the results for comparison or voting Comparison or voting need not be done right away primary result may be used in further computations with the result subsequently validated if appropriate Duplicate computation On a simultaneous Triplicate multithreading computat39on architecture multiple instruction streams may be interspersed 39 39 39 39 Some Cray machines quot take advantage of extensive hardware resources to execute instructions twice Use as operand in further computations while awaiting confirmation of validity Slide 9 Nov 2007 S B Hardware Implementation Strategies 1 Alternating Logic Basic Ideas Transmission of data over unreliable wires or buses Send data store at receiving end Send bitwise complement of data Compare the two versions Detects wires saO or s a1 as well as many transients The dual of a Boolean function fx1 x2 X is another function fdx1 x2 xn such that fdx1 x2 xn f x1x2 xn Fact Obtain the dual of fby exchanging AND and OR operators in its logical expression For example the dual of f ab v c is fd a v bc Advantages of this Inputs approach compared to duplication include Compl a smaller probability inputs Of common errors Slide 10 Alternating Logic SelfDual Functions Afunction fis selfdual if fX1 X2 Xn fdx1 X2 X For example both the Use same Circuit twnce sum a G b ea c and Inputs Output carry ab v bc v ca D07 Error outputs of a fulladder Compl I are selfdual functions 39 pUts With a selfdual function f the functions fand fd in the diagram above can be computed by using the same circuit twice time redundancy Many functions of practical interest are selfdual Examples proofs left as exercise A kbit binary adder with 2k 1 inputs and k 1 outputs is selfdual So are 1 s complement and 2 scomplement versions of such an adder Slide 11 Nov 2007 U C S B Hardware Implementation Strategies r Recomputing with Transformed Operands Alternating logic is a special case of the following general scheme with its encoding and decoding functions being bitwise complementation Inputs 39j Output Encode Decode Error Inputs XNOR If lower path finds complement of the result Recompute after shift When fis binary addition we can use shifts for encoding and decoding Shifting causes the adder circuits to be exercised differently each time Originally proposed for ALUs with bitslice organization Recompute after swap When fis binary addition we can use swaps for encoding and decoding Swap the two operands eg compute b a instead of a b Swap upper and lower halves of the two operands modified adder Slide 12 Nov 2007 U C S B Hardware Implementation Strategies r TimeRedundant Segmented Addition Instead of using a kbit adder twice for error detection or 3 times for error correction one can segment the operands into 2 or 3 parts and similarly segment the adder perform replicated addition on operand segments and use comparisonvoting to detectcorrect error FF 0 Sum computed in two cycles The lower half in cycle 1 and Lower half the upper half in cycle 2 of adder Upper half of adder Error Comparator Various other segmentation schemes have been suggested Example 16bit adder with 4way segmentation and voting Mixed SpaceTime Replication Instead of duplicating the computation with no hardware change slow or duplicating the entire hardware costly we can add some hardware to make the interleaved recomputations more efficient Recomputation with same Consider the effect hardware resources T 5 of including a excluding compare time second adder Duplicate Recomputation with the inclusion l i l i of an extra adder T 3 excluding compare time computation Original computation T 3 Monitoring via Watchdog Timers Monitor or watchdog is a hardware unit that checks on the activities of a function unit Funcf ion Monitoror un39t watchdog Watchdog is usually much simpler and thus more reliable than the unit it monitors Watchdog timer counts down beginning from a preset number It expects to be preset periodically by the unit that it monitors If the count reaches 0 the watchdog timer raises an exception flag Watchdog timer can also help in monitoring unit interactions When one unit sends a request or message it sets a watchdog timer If no response arrives within the allotted time a failure is assumed Watchdog timer obviously does not detect all problems It verifies liveness of the unit it monitors good with failsilent units Often used in conjunction with other tolerancerecovery methods Slide 15 Nov 2007 U C S B Hardware Implementation Strategies Activity Monitor Watchdog unit monitors events occurring in and activities performed by the function unit Funcl39on Activity eg event frequency and relative timing mt mon39tor Observed behavior is compared against expected behavior The type of monitoring is highly applicationdependent Example Monitoring of program or microprogram sequencing Activity monitor receives contents of microprogram counter If new value is not incremented version of old value then it ascertains that the instruction just executed was a branch orjump Example Matching assertionsfirings of control signals or units against expectations for the instructions executed Slide 16 Nov 2007 U C S B Hardware Implementation Strategies r Design with Parity Codes and Parity Prediction Operands and results are parityencoded Parity is not preserved over arithmetic and logic operations Parity prediction is an alternative to duplication Compared to duplication Parity prediction often involves less overhead in time and space The protection offered by parity prediction is not as comprehensive generator 6 l Parity ALU mwmmi encoded U n V a output Parity encoded inputs l d l X x D N Ll L 7 Ordinary ALU a LWJDME Error signal Slide 17 Parity Prediction for an Adder Operand A 1 o 1 1 o o o 1 Parity o APQA BaPQB Operand B 0 0 1 1 1 0 1 1 Parity 1 I AElDB 1 O O O 1 0 1 0 Paritychecked C adder 0 Carries 0 0 1 1 0 0 1 1 Parity 0 i SumS 11101100 Parity1 8 108 pSpA rpB rco c1 rcz ck V V Inputs Must compute second versions of these carries to ensure independence Parity predictor for our adder consists of a duplicate carry network and an XOR tree Slide 18 Nov 2007 U C S B Hardware Implementation Strategies Coding of Control Signals Encode the control signals using a separable code eg Berger code Either check in every cycle or form a signature over multiple cycles In a microprogrammed control unit store the microinstruction address and compare against MicroPC contents to detect sequencing errors Microinstruction register I I I i i tf n 033 llllllllllllllllllll 53 register Control signals to data path Dispatch E E table 1 Microprogram memory or PLA Dispatch table 2 i l iData E l l I control Slide 19 Nov 2007 U C S B Hardware Implementation Strategies r ControlFlow Watchdog Watchdog unit monitors the instructions executed and their addresses for example by snooping on the bus Instruction sequencer Controlflow Watchdog The watchdog unit may have certain info about program behavior Control flow graph valid branches and procedure calls Signatures of branchfree intervals consecutive instructions Valid memory addresses and required access privileges In an applicationspecific system watchdog info is preloaded in it For a GP system compiler can insert special watchdog directives Overheads of controlflow checking V der memory due to the need for tag bits to distinguish word types Additional memory to store signatures and other watchdog info Stolen processorbus cycles by the watchdog unit Nov 2007 U C S B Hardware Implementation Strategies r Slide 20 l 4 1 Fl 1 1175 f gCZF TM r43 gift TR IVEEC if a J l LL U M ULL K257 u ll M MD M Eh Dealing with HighLevel Impairments Nov 2006 Degradation Allowance and Management m Slide1 About This Presentation Im II quotLine im1ai A I quot A Q empw mer g i Ham Uniwers y 1 a i m The thiaikemiail L fn heue n m a iaeemgmm meaeifnnng L 39 Unamiimmizecd wees midfime quot Edition First Released Revised Revised NOV 2006 Nov 2006 s Degradation Allowance and Management Slide 2 Degradation Allowance and Management Nov 2006 I Degradation Allowance and Management 39 139 1 Slide 3 3 6 SeeFGEA I always give 110 to my job 66 2323 gciurniat Eggsicgg tfs 40 on Monday 30 on Tuesday 20 on y 39 Wednesday 15 on Thursday and 5 on Friday mm M by Mark Parlsi www0l f themawmcnm NAOOF SF I 5ET174E P4077014 DEIIC39EER BE39FOREI 1377 Nov 2006 s B Degradation Allowance and Management m Slide 4 I Multilevel Model of Dependable Computing Level Component Logic Information System Service Result Malfunctioning Degraded Ideal Defective CE Ck Unimpaired LowLevel Impaired MidLevel Impaired HighLevel Impaired Legend JEntry gtDeviation Remedy fTolerance Slide 5 Nov 2006 s B Degradation Allowance and Management 39 39 quotH I 39 Requirements for Graceful Degradation Terminology n Graceful degradation adj Gracefully degradingdegradable failsoft Strategies for failure prevention 1 Quick malfunction diagnosis 2 Effective isolation of malfunctioning elements 3 Online repair preferably via hotpluggable modules 4 Avoidance of catastrophic malfunctions Degradation allowance Provide capability for the system to function without the modules which have been diagnosed as malfunctioning Degradation management Control operation in degraded mode and try to return the system to the intact or a less degraded state ASAP Slide 6 Nov 2006 Degradation Allowance and Management 39 7 quot r m in I a 11 in lluh StateSpace Model of a FailSoft System Intact Degraded Failed Catastrophic malfunction to be avoided Reducing the probability of catastrophic malfunctions Reduce the probability of malfunctions going undetected Increase the accuracy of malfunction diagnosis Make repair rates much greater than malfunction rates keep spares Provide sufficient safety factor in computational capacity Nov 2006 S B Degradation Allowance and Management 39 7 quot hm quotaquot I Slide 7 Importance of Coverage in FailSoft Systems A failsoft system can fail either indirectly due to resource exhaustion or directly because of imperfect coverage analogous to leakage Semidirect path L L y y Indirect path to failure resource exhaustion Intact Failed Direct path to failure imperfect coverage Providing more resources safety factor lengthens the indirect path thus slowing indirect failures but does nothing to block the direct path Saturation effect For a given coverage factor addition of resources beyond a certain point would not be costeffective with regard to the resulting reliability gain same effect observed in standby sparing Slide 8 Nov 2006 S B Degradation Allowance and Management 39 7 quot r in I a 217 Performability of a FailSoft System Performance Partially failed Offline repair f Online repair Done by removalreplacement of affected modules in a way that does not disrupt the operation of the remaining system parts Offline repair Involves shutting down the entire system while affected modules are removed and replacements are plugged in Slide 9 Graceful Degradation after a Malfunction Diagnose the malfunction Unit Remove the malfunctioning unit Update system 08 tables Physical isolation Initiate repair if applicable Create new working configuration Exclude processor channel controller lO device eg sensor Avoid bad tracks on disk garbled files noisy communication links Remove parts of the memory via virtual address mapping Bypass a cache or use only half of it more restricted mapping Perform process and data recovery Ad j f j l lg g ggg Recover state information from removed unit needed lib itdl Initialize any new resource brought online lTQQ FE39 l Elmiii 31 Reactivate processes via rollback or restart WFWWQ gnaw Slide 10 Nov 2006 s B Degradation Allowance and Management 39 7 quot r my I Process and Data Recovery Time Process 1 l Process 2 l Process 3 I I Process 4 l Affected Process 5 l processes Process 6 l o I Checkpoint 1 Checkpoint 2 Detected malfunction Roll back process 2 to the last checkpoint 2 Restart process 6 Rollback or restart creates no problem for tasks that do lO at the end Interactive processes must be handled with more care eg bank ATM transaction to withdraw money or transfer funds check balance reduce balance dispense cash or increase balance Slide 11 Nov 2006 S B Degradation Allowance and Management 39 V quot h Optimal Checkpointing for Long Computations T Total computation time without checkpointing q Number of computation segments there will be q 1 checkpoints TClo Time needed to capture a checkpoint snapshot x Malfunction rate O W 2Tq T I Discrete Markov model I I I l Expected length of stay in 1 kTq 1 kTq each state11 kTq W 40 Where unit time is 77quot Start Chkpt 1 Chkpt 2 End Computation time with checkpointing Ttotal T1 Tq q 1TClo T XTZq M39 q 1TClo thotaldq KT2qM2 Top 0 gt q pt TO xTcp Example T 200 hr x 001 hr TClo 18 hr qClot 200001 001025 2 42 713 z 215 hr Nov 2006 s B Degradation Allowance and Management 39 1 quot r m in I a 11 in lluh Elaboration on Optimal Computation Checkpoints T Total computation time without checkpointing Example 200 hr q Number of computation segments there will be q 1 checkpoints TClo Time needed to capture a checkpoint snapshot Range 18 1000 hr 1 Malfunction rate Example 001 hr Rollback Checkpomting Computation time with checkpointing Ttotal T xPq 91739 q 1TClo thotaldq 4T2q4732 Top 0 gt qopt TO ADVTop dZTtotaIdq2 2XT2q XT3 gt O gt Bowllike curve for Ttotal with a minimum Example q Ttotall 4 6 8 10 20 30 40 50 0 400 300 267 250 222 214 211 208 18 400 301 267 251 225 218 215 m 13 401 302 269 253 229 224 m 224 Top 1 403 305 274 259 m 243 250 257 10 430 350 m 340 412 504 601 698 100 700 800 967 1150 2122 3114 4111 5108 1000 3400 5300 7267 9250 19222 29214 39211 49208 Nov 2006 S B Degradation Allowance and Management 39 1 quot hm quotaquot I 4 r mj Slide 13 Optimal Checkpointing in Transaction Processing PClo Checkpointing period TClo Checkpointing time overhead for capturing a database snapshot Trb Expected rollback time upon malfunction detection Relative checkpointing overhead 0 Ex dx 3 O Top Trb Pop lop gtlt I Checkpoint i 1 Checkpoint i Assume that rollback time given malfunction at time X is a bx b is typically small because only updates need to be reprocessed px Expected rollback time due to malfunction in the time interval 0 x pxdx px a bx dx gt dpxdx a bx gt px kxa bx2 Tm map chpa map2 o Tcp Tm op TopPcp Ma chpz is minimized for PCp 2 TopMp Slide 14 Nov 2006 Degradation Allowance and Management 39 i quot r m in I a 11 l rl nivji Examples for Optimal Database Checkpointing o TcpTrbPcp TopPCp Ma bPCpZ is minimized for Pop zapMp TClo Time needed to capture a checkpoint snapshot 16 min 9 Malfunction rate 00005 min MTTF 2000 min m 333 hr b 01 Pf tV2Tcpxb 800 min N 133 hr Suppose that by using faster memory for saving the checkpoint snapshots eg disk rather than tape we reduce TClo to 1 min PtV2Tcp b 200 min z 33 hr Slide 15 Nov 2006 S B Degradation Allowance and Management 39 1quot h quotaquot I Asynchronous Distributed Checkpointing Time e11 e1 12 Process 1 Chkpt11 Chkpt 13 Process 2 Affected process 21 1 Process 3 31 e32 Chkpt32 Process 4 Process 5 Chkpt 32 Process 6 Detected malfunction Recovery line For noninteracting processes asynchronous checkpoints not a problem When one process is rolled back other processes may have to be rolled back also and this has the potential of creating a domino effect Identifying a consistent set of checkpoints recovery line is nontrivial Slide 16 Nov 2006 S B Degradation Allowance and Management 39 V quot h Checkpointing for Data Consider data objects stored on a primary site and k backup sites with appropriate design such a scheme will be k malfunctiontolerant Each access request is sent to the primary site Read request honored immediately by the primary site One way to deal with a write request Alternative Update requests sent to backup sites Primary site does Request is honored after all messages ack ed frequent backups If one or more backup sites malfunction service is not disrupted If the primary site malfunctions a new primary site is elected distributed election algorithms exist that can tolerate malfunctions Analysis by Huang and Jalote Time in Availability Normal state primary OK data available 0922 0987 Recovery state primary Site IS changing O 996 Checkpoint state primary doing backup 1fklt 039997 ldle no site is OK Lin f lk 0997 Nov 2006 s B Degradation Allowance and Management 39 7 quot 7M 393quot I 3 a r mj Slide 17 Maintaining System Consistency Atomic action Either the entire action is completed or none of it is done One key tool is the ability to ensure atomicity despite malfunctions Similar to a computer guaranteeing sequential execution of instructions even though it may perform some steps in parallel or out of order Where atomicity is useful Upon a write operation ensure that all data replicas are updated Electronic funds transfer reduce one balance increase the other one In centralized systems atomicity can be ensured via locking mechanisms Acquire read or write lock for desired data object and operation Perform operation Release lock waiting Nov 2006 B Degradation Allowance and Management 39 7 quot r 25 I Slide 18 TwoPhase Commit Protocol Ensuring atomicity of actions in a distributed environment Coordinator Where the transaction is initiated Begin to all p Yes from all No from some Abort to all Where some part ofthe transaction is performed Participant Begin Yes Begin No WN K f Execute the quot termination protocol Ei il il To avoid participants being stranded in the wait state eg when the coordinator malfunctions a timeout scheme may be implemented Slide 19 ThreePhase Commit Protocol Twophase commit is a blocking protocol even with timeout transitions Coordinator Participant Begin to all Begin Yes Begin No l r wait v V Yes from all Prepare to all No from some Abort to a Prepare Ack Ack from all Commit to all 7 Slide 20 MalfunctionStop Modules Malfunction tolerance would be much easier if modules simply stopped functioning ratherthan engage in arbitrary behavior Unpredictable Byzantine malfunctions are notoriously hard to handle Assuming the availability of a reliable stable storage along with its controlling s process and approximately synchronized clocks a kmalfunctionstop module can be implemented from k 1 units Operation of s process to decide whether the module has stopped R bag of received requests with appropriate timestamps if R k1 all requests identical and from different sources sz op then if request is a write then perform the write operation in stable storage else if request is a read send value to all processes else set variable stop in stable storage to TRUE Slide 21 Nov 2006 Degradation Allowance and Management 39 7 quot r m in I 39 V 1 3 r1 C 717 K gikib 79 3 Hz J ill xL U U 23 E J DU v 7 x NSC 7 73 SirJ D U iiiquot1979 U L Hardware 39quot 14439 Nov 2006 r Hardware Implementation Strategies m Slide1 About This Presentation Im II tithe im1i A I quot A Q empwtmer g i pt Umiwers ty t g i tetrtm The litigattemiaail L fn l fm m e iezeemmm eetnmg L 39 Utmetutimmizecd wees midfime quot Edition First Released Revised Revised Oct 2006 Nov 2006 Hardware Implementation Strategies Slide 2 IeS Hardware Implementation Strateg 4 m m r s b g e t a r t S n D t a t n e m b D m e r a W 0 r a H 9 Ema Nov2006 g 2000 Randy GIastgen WWWQMSWVQMWm Copyright 2001 by Randy Giasbergcn Wawgla5blrgencom f LASBERGEN When I m away fun my deSk it goes into 5133 Crashing is an expression of hostility against your modequotblquot the snoring annoys my coworkers network administrator Though you appear to be uncooperative it s actually a desperate cry for help You must clearly explain your problem SUDDENLY FOL MYSELF oOO wwwgeneralcumic5com Nov 2006 B Hardware Implementation Strategies m Slide 4 7 Multilevel Model of Dependable Computing Level Component Logic Information System Service Result I Erroneous Degraded I Jlalfunetionzino r4 H Unimpaired LowLevel Impaired MidLevel Impaired HighLevel Impaired Legend JEntry gtDeviation Remedy fTolerance Slide 5 Nov 2006 S B Hardware Implementation Strategies 1
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'