Special Topics Advanced Database Systems
Special Topics Advanced Database Systems EECS 700
Popular in Course
Popular in Elect Engr & Computer Science
verified elite notetaker
This 198 page Class Notes was uploaded by Melissa Metz on Monday September 7, 2015. The Class Notes belongs to EECS 700 at Kansas taught by Prasad Kulkarni in Fall. Since its upload, it has received 27 views. For similar materials see /class/182480/eecs-700-kansas in Elect Engr & Computer Science at Kansas.
Reviews for Special Topics Advanced Database Systems
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/07/15
Trusted Computing Bowe Neuenschwander EECS 700 Virtual Machines Dr Prasad Kulkarni PhD Components Endorsement Key Secure Input amp Output Memory Curtaining Protected Execution Sealed Storage Remote Attestation Direct Anonymous Attestation TPM Trusted Platform Module Contents Unique RSA Key Pseudorandom Number Generator Cryptographic Key Generator Effects on Virtual Machines Possible to Emulate Possible to Trap and Fonvard TPM Calls Applications of Trusted Computing Protect Hard Disk Data Digital Rights Management Identity Theft Protection Prevention of Cheating in Online Games Protection from Viruses and Spyware Protection of Biometric Authentication Data Verification of Grid Computing Data Criticisms of Trusted Computing Digital Rights Management Users Have Little Control Over Their Data Users Unable to Override Loss of Anonymity Upgrading amp Hardware Failure Shutting Out Competing Products Remote Censorship NextGeneration Secure Computing Base NGSCB Secure Storage Attestation Curtained Memory Applications Trusted Untrusted No User Override 1 2 3 4 References Trusted Computing Wikipedia 2009 httpenwikipediaorgwikiTrustedcomputing Retrieved on 2009 0309 Direct anonymous attestation V kipedia 2008 httpenwikipediaorgwikiDirectanonymousattestation Retrieved on 2009 0309 Trusted Platform Module Wikipedia 2009 httpenwikipediaorgwikiTrustedPatformModule Retrieved on 2009 0309 NextGeneration Secure Computing Base V kipedia 2009 httpenwikipediaorgwikiNextgenerationsecurecomputingbase Retrieved on 2009 0309 Optimizing an ANSI C Interpreter with Superoperators By Todd A Proebsting University of Arizona Presented by Jonathan Lanning University of Kansas g Introduction to the paper Translator Output Superoperator Optimization Translator Design Interpreter Generation Implementation Details Overview g System Obstacles Experimental Results Limitations and Extensions Related Work Discussion Acknowledgements and References Conclusion Overview Cont d Introduction 0 This paper will describe the design and implementation of hti a hybrid translator interpreter system for ANSI C that has been targeted to both the MIPS R3000 KH92 and the SPARC Sun91 o hti will introduce superoperators a novel optimization technique for customizing interpreted code for space and time Superoperators automatically fold many atomic operations into a more efficient compound operation in a fashion similar to supercombinators in functional language implementations Translator Output For example the tree for 23 would be ADD I CNSTI CNSTI where ADD I represents integer addition ADD I For this VM the translation of 23 would be similar to the following byte 36 CNSTI word 2 immediate value byte 36 CNSTI word 3 immediate value byte 8 ADDI A prologue on the R3000 looks like the following main U li 24 192 put activation record size in 24 li 8 96 put location of evaluation stack in la 25 11 put location of bytecode in 25 jprologuescalar jump to interpreter Superoperator Optimization ADDP INDIRP x CNSTI is the most common 3 node IR pattern produced by lcc when it compiles itself The lcc IR has 109 operators 3 El 147 bytecodes for superoperators For example assume that the input trees with weights are u IAZ Y 10 AY Y 1 The original operators s frequencies of use are n Y 12 z 10 I 10 A 11 The frequencies of the parent child pairs are n IA 1o AZ 10 A Y 11 The resulting trees are n IBZ 1o BY 1 The new frequencies of parent child pairs are n IB 1o BZ 1o BY 1 AY 1 Translator Design Bytecode Emitter hti translates lCC S IR into bytecodes and attributes The optimal translation of an IR tree into bytecodes is automated Via tree pattern matching using burg Some sample patterns in the burg specification u stk ADDP INDIRP stk CNSTI 51 n stk ADDPstk stk 91 n stk CNSTI 36 1 a stk INDIRPstk 77 1 o Attribute Emitter hti must emit node attributes after appropriate bytecodes n ADDI MULI x CNSTI CNSTI becomes a reg CNSTI 1 emitsymbol P gtsyms o gtX name 4 4 Interpreter Generation The interpreter is implemented in assembly language The pattern matcher processes superoperator trees to determine how to best translate each into machine code The interpreter generator selects instructions and allocates temporary registers for each superoperator Machine emitter specifications are not limited to single operator patterns Complex IR tree patterns may better express the relationship between target machine instructions and lcc s IR Implementation Details A target machine emitter specification lcc back end code to handle data layout calling conventions etc A library of interpreter routines for observing calling conventions Machine dependent interpreter generator routines g Limitations of system software on both R3000 and SPARC systems System Obstacles Neither machine s assembler supports unaligned initialized data words or halfwords Experimental Results Comparisons were made for three programs a burg A 5oooline tree pattern matcher generator p ocessing a 136rule specification n hti The 13oooline translator and subject of this paper translating a 1117line C file a loop An empty for loop that executes 106 times Experimental Results Figure 2 R3000 Benchmark Code Sizes Figure 3 SPARC Benchmark Cude Sizes Experimental Results ht i30 Figure 4 R30EJU Bthmark Code Speeds Figure 5 SPARC Benchmark Cutie Speeds g Each additional superoperator contributes decreasing marginal returns Limitations and Extensions The C expression X y cannot be expressed as a single tree by lcc hti generates bytecode as data assembler directives and function prologues as assembly language instructions Related Work Supercombinators optimize combinator based functional language interpreters in a way similar to how superoperators optimize hti C interpreter Cint that like hti maintained C calling conventions in order to link with native code routines Saber C performs approximately 70 run time error checks g o For the MIPS R3000 and the SPARC each machine required fewer than 800 lines of machine specific code to be retargeted Discussion Superoperators make the interpreted code both smaller and faster Heuristics can automatically isolate beneficial superoperators from static or dynamic feedback information g Acknowledgements and References These are on the last page You should have read them already g Introduction to the paper Translator Output Superoperator Optimization Translator Design Interpreter Generation Implementation Details Conclusion g System Obstacles Experimental Results Limitations and Extensions Related Work Discussion Acknowledgements and References Conclusion Cont d g Questions IQJliiilL KiERE Adaptive Optimization in the Jalapeno JVM By Matthew Arnold Stephen Fink David Grove Michael Hind and Peter F Sweeney Presentation by Michael Jantz Introduction The dynamic nature of the Java programming language presents both a large challenge and a great opportunity for high performance Java implementations Features such as dynamic class loading and reflection prevent straightforward applications of traditional static compilation and interprocedural optimization As a result much effort has been invested in developing dynamic compilers for Java 2 Dynamic Compilation As dynamic compilation occurs during application execution dynamic compilers must carefully balance optimization effectiveness with compilation overhead to maximize system performance However dynamic compilers can also exploit runtime information to perform optimizations beyond the scope of a purely static compilation model 3 A Little History Early VM39s employed JustlnTime JIT compilers that relied on simple static strategies to choose compilation targets and typically compiling with a fixed set of optimizations Smalltalk80 Self91 Later on more sophisticated VM39s began dynamically selecting a subset of all executed methods for optimization focusing optimization effort on program hot spots Self93 HotSpot early Jalapeno 4 Online FeedbackDirected Optimizations Research has explored more aggressive forms of dynamic compilation using runtime information to tailor the executed code to its current environment Most of these systems are not fully automatic and most of their techniques have not appeared in mainstream JVM39s However this research has demonstrated that online feedbackdirected optimizations can yield substantial performance improvements 5 Jalape o39s AOS Jalapeno is a research JVM developed at IBM primarily targeted at server applications Since the time of this writing it has become open source and is now known as the Jikes RVM Research Virtual Machine It is widely used in research This presentation will present an overview of the Jalapeno Adaptive Optimization System AOS as it was at the time of this writing a key component of the Jalapeno VM Later on we will cover in greater detail the controller model employed by this system 6 Outline Background AOS Architecture Current Controller Model Experimental Results Conclusion Background Jalapeno itself is written in Java Techniques described here apply not only to the application code but to the JVM itself We may apply adaptive optimization to the JVM subsystems ie the compilers the thread scheduler the garbage collector and the A08 itself Drawback To avoid having to run Jalapeno on another JVM users must build a boot image file This file is a snapshot of a Jalapeno JVM written into a file Jalapeno uses this image to bootstrap itself into a running VM 8 Compilation in Jalapeno Jalapeno employs a compileonly strategy All methods are compiled to native before they execute There are two compilers in Jalapeno The baseline compiler translates bytecodes into native code by simulating Java39s operand stack Does not perform register allocation Performs only slightly better than bytecode interpretation The optimizing compiler translates bytecodes into an IR upon which it performs a variety of optimizations Uses an efficient linear scan register allocator May be applied with one of three levels which successively apply more aggressive and more expensive optimizations 9 Compilation Rates Compiler Bytecode BytesMillisecond Baseline 274 14 Opt Level 0 877 Opt Level 1 359 Opt Level 2 207 Table 1 Average compilation rates on the SPEijm98 benchmark suite 10 TI1TLJNI FR ITY OT KANSAS Java Threads and Scheduling Jalapeno multiplexes Java threads onto JVM virtual processors Jalapeno implements a scheduler in user mode to schedule Java Threads onto virtual processors Virtual processors are essentially threads which are schedulable by the operating system The system supports a quasipreemptive thread scheduler Each compiler generates yield points which are program points where the running thread checks a dedicated bit in a machine control register to determine if it should yield the virtual processor A timer interrupt periodically sets a dedicated bit When a running thread next reaches a yield point a check of this bit will result in a call to the scheduler Yield points are currently inserted in method prologues and loop back edges 11 Compilation Scenarios Machine Code Executing Code Unresolved Lazy Compilation Pro ling Data stub Invoked Resolution Reference Dynamic Linker j Class Load Request Adaptive Optimization System Compilers BaseOpt ClassLoader ReCompilation Plan Figure 1 Compilation scenarios in the Jalape o JVM 12 THY UNIVFTSITY OT KANSAS Compilation Scenarios Notice from Figure 1 a Jalapeno compiler can be invoked in one of three ways First when executing code reaches an unresolved reference causing a new class to be loaded the class loader invokes a compiler to compile the class initializer Second whenever the executing code attempts to invoke a method that has not yet been compiled methods are initialized to lazy compilation stubs when their class is loaded Third the A08 can invoke a compiler when profiling data suggests that recompiling a method with additional optimizations may be beneficial The remainder of this talk will focus on this third scenario 13 System Architecture Compilers lmuumenled lmuumenlalion Optimized Cumpiluliun Code PI an E 39 V l r 1 5 A0 i gt i Com nlau 3n 2 Dmuime Thi39cudsl E 1 E 2 E 5 Cmupilalian Queue Event Queue Cunlroller Adaptive Optimization System Figure 2 Architecture of the Jalape o Adaptive Optimization System 14 THY UNIVFR ITY OT Major Subsystems We will take each of these one at a time in more detail Runtime measurements subsystem Controller Recompilation subsystem AOS database Runtime Measurement Subsystem The runtime measurement subsystem RMS Gathers information about the executing methods Summarizes the information And then either a passes this information to the controller via the organizer event queue or b records the information in the A08 database Runtime Measurement Subsystem cont Raw profiling data is gathered from several sources Instrumentation in executed application code Hardware performance monitors Instrumentation in the VM itself The controller directs the data monitoring and creates organizer threads to process the raw data at specific time intervals Each organizer analyzes the raw data and either Packages it into a form suitable for direct use by the controller Adds the information to the organizer event queue for the controller to process Records the information for later queries by other AOS components 17 The Current RMS Sampling is performed using existing Jalapeno mechanisms Whenever the scheduler is entered because a timer interrupt has occurred and the executing thread has reached a yield point instrumentation in the JVM records the current method before it switches to a new thread Timer interrupts occur every 10ms resulting in roughly 100 samplessec Two organizers periodically process these samples The hot method organizer searches for methods with a percentage of samples above a certain threshold The threshold varies from 025 to 1 The adaptive inlining organizer is used to guide inlining decisions This organizer searches for hot call edges ie callercallee method pairs where there is a high frequency of samples attributed to the entry of the callee This threshold is initialized to 1 and is periodically reduced until it reaches 02 For both samplers a decay mechanism is used to weight more recent behavior more heavily 18 Controller The controller orchestrates and conducts the other components of the A08 Specifically it coordinates the activities of the RMS and the recompilation subsystem Based on information from the RMS the controller performs a cost benefit analysis described later to Instruct the RMS to continue or change its profiling strategy Recompile one or more methods to improve their performance It uses priority queues to communicate with the other systems It extracts measurement events from a queue filled by the RMS the Organization Event Queue And it inserts recompilation decisions into a queue the compilation threads process the Compilation Queue 19 System Architecture Compilers lmuumenled lmuumenlalion Optimized Cumpiluliun Code PI an E 39 V l r 1 5 A0 i gt i Com nlau 3n 2 Dmuime Thi39cudsl E 1 E 2 E 5 Cmupilalian Queue Event Queue Cunlroller Adaptive Optimization System Figure 2 Architecture of the Jalape o Adaptive Optimization System 20 THY UNIVFR ITY OT Recompilation Subsystem This system consists of threads that invoke compilers These threads extract and execute compilation plans that are inserted into the compilation queue by the controller Recompilation occurs in separate threads from the application and may occur in parallel Note this is different from the initial lazy compilation which occurs the first time a method is invoked Lazy compilation occurs in the application thread that attempted to invoke the uncompiled method The output of the compiler is then installed into the JVM Currently previous activations of a freshly compiled method will continue to use the old compiled code for the method until that activation completes 21 AOS Database The A08 database provides a repository where the A08 records decisions events and static analysis results The various adaptive system components may query this database as needed For example the controller uses the A08 database to record compilation plans and track the status and history of methods selected for recompilation 22 System Architecture Compilers lmuumenled lmuumenlalion Optimized Cumpiluliun Code PI an E 39 V l r 1 5 A0 i gt i Com nlau 3n 2 Dmuime Thi39cudsl E 1 E 2 E 5 Cmupilalian Queue Event Queue Cunlroller Adaptive Optimization System Figure 2 Architecture of the Jalape o Adaptive Optimization System 23 THY UNIVFR ITY OT Current Controller Model The central role of the controller is to determine if it is profitable to recompile a hot method with additional optimizations and if so which optimization level is used This decision is made using a simple a cost benefit analysis I will describe now 24 Current Controller Model Number the optimization levels available to the controller from O to N For a method m currently compiled at level i the controller estimates the following quantities T the expected time the program will spend executing method m ifm is not recompiled C the cost of recompiling method m at optimization levelj for i Sj lt N New pro ling information may enable additional speedups over the previous version compiled at level I the expected time the program will spend executing method m in the future ifm is recompiled at level j Using these estimated values the controller minimizes the expected future running time of a compiled version of m ie choosing thejthat minimizes C if C lt T the controller decides to recompile at level otherwise it does not 25 Estimating TI The factors in this model are unknowable in practice and estimating them is an ongoing research problem The current controller uses the following method to estimate the expected method execution time for some method m if m is not recompiled Ti The controller assumes that if a program has run for n seconds that it will continue to execute for exactly n more seconds De ne T to be the future expected running time of the program Using a weighted average of the samples described earlier recall sample weight starts at one and decays periodically the model estimates the percentage of future time that will be spent in each method Pm Now the controller predicts the future time spent in each method as T T Pm lt1 26 Estimating Assuming we can estimate the relative speedup for code at optimization level k compared to level 0 then we can calculate the expected time the program will spend executing method m in the future if m is recompiled Tj o Let Sk be the speedup estimate for code at optimization level k relative to level 0 Then if method m is at level i the future expected running time if we recompile at leveljis 27 Estimating Compilation Cost and Relative Speedup This simple analytical model requires two parameters for each optimization level the cost to compile at each level C and the expected speedup for recompiling at each level 8 The authors estimate these values by running a configuration that compiles with the designated compiler all invoked methods with no profiling or recompilation occurring ie a nonadaptive system on the SPEijm98 benchmarks with input size 100 They gathered the compilation rate bytecodesms that is used by C and the speedup 8 Their results are presented in Table 1 on the following slide The averages given in the last line are used as the default parameters to the analytical model 28 Table 1 Table 1 Compilation rate and speedup of the SPEijmQB benchmarks run with input size 100 This con guration is not adaptive it compiles all invoked methods and no pro ling or recompilation occurs Speedup is measured as the best of ve runs relative to a baselinecompiled system Compilation Rate bemsec Speedup over Baseline Benchmark Baseline Opt 0 Opt 1 Opt 2 Baseline Opt 0 Opt 1 Opt 2 compress 31876 959 316 169 100 542 692 750 jess 28716 916 356 173 100 310 514 533 db 35000 955 320 162 100 254 269 290 javac 32700 1000 442 187 100 121 314 346 mpeg 47916 1012 400 208 100 700 1034 1169 mtrt 33638 933 343 170 100 387 657 668 jack 36909 923 398 181 100 343 422 473 Geo Mean 34845 956 365 178 100 336 507 548 29 Example Suppose the weighted samples suggest an application spends 10 of its execution time in method m and the program has been running for 10 seconds TI 10s 10 1 second Also suppose m is composed of 1000 bytecodes is composed and is currently baseline compiled and the controller is checking whether or not it would be bene cial to compile using the optimizing compiler at level 0 T 1s 1 0336 30s Now we add in the compile time and check if it would be beneficial to recompile this method O 1000bc 956 bcms 105ms 30s 105s 405 lt 1s Thus the compiler chooses to recompile the method with at least optimization level 0 Note the controller will check the other optimization levels as well to determine which is optimal 30 Evaluation The implementation of the controller model presented here rests on many simplifying questions Evaluation of this model is a topic of ongoing and future work In this section the authors present several experiments they used to evaluate the effectiveness of the approach employed in Jalapeno at the time of this writing For the sake of time I will only present two of these experiments here 31 Experimental Methodology The experimental evaluations presented here were performed on an IBM F50 Model 7025 with two 333MHz PPC604e processors running AIX v43 The system has 168 of main memory All experiments were performed using Jalape o39s non generational copying garbage collector The Jalapeno boot image was compiled using the optimizing compiler at level 2 the optimizing compiler and the adaptive optimization system were included in the boot image 32 MultiLevel Recompilation The first experiment is intended to evaluate the effectiveness of the adaptive multilevel recompilation system described earlier by comparing its performance to both JIT and simple adaptive level configurations of Jalapeno To allow the experiments to focus on the recompilation decisions none of the configurations perform feedbackdirected optimizations ie no adaptive inlining For each benchmark the following configurations were ran Baseline compiler as a JIT Optimizing compiler at each level as a JIT Adaptive singlelevel con guration using the optimizing compiler at each level Adaptive multilevel system using the optimizing compiler at any of its three levels 33 MultiLevel Recompilation The experiment was run on the SPEijm98 benchmarks the Jalapeno optimizing compiler and the Volano benchmark a multithreaded server application Program startup and steady state performance are also distinguished in the results Startup performance was gathered by running each benchmark with small input sizes causing each benchmark to run then quickly terminate Steady state performance was gathered over longer program executions The results for the SPEijm98 and the optimizing compiler are the minimum elapsed time from five runs of each benchmark The Volano benchmark performance is reported in terms of message throughput 34 IQ QIQi E k Speedup over JIT Baseline Startup Performance El 111quot OplLeve 0 El JIT Opchvel l c JIT Opchvel 2 Adaptivc OplLevel 0 I Adaptive OplLevel l l Adaptive Opchvel 2 l Adaptive MulliLevel u IE I1 IIII uplcumpiler volano H II x mm H a I II II db compress jess juvac mpegaudju jack geomeuic mean Figure 5 Startup performance 35 Startup Performance Adaptive recompilation clearly delivers better performance than the JIT configurations Compiletime overhead plays a large role in this regime Note that for all benchmarks increasing the optimization level in JIT configuration always causes performance to suffer The best single level of adaptive optimization varies among benchmarks between level 0 and 1 For this reason overall the multilevel optimization strategy delivers the best performance 36 Steady State Performance 8 1 HT OptLeVel 0 El IlT Opchvel l a El llT Opthvel 2 is 6 Adaptive OptLevel 0 g I Adaptive Opthvcl 1 51 I Adaptive DptLevel 2 b I Adaptive MulliLevel E m E 4 D 1 o v a U 2 IL m n l compress jess db javnc mpcgaudio mlrl jack optcompiler volann geomeuicmczm Figure 6 Steadystate performance 37 Steady State Performance Each adaptive configuration is competitive with its JIT counterpart The multilevel adaptive system delivers the best performance of the adaptive configurations with overall performance within 2 of the best JIT configuration 38 Good Performance in Both Startup and Steady State Figures 5 and 6 show that any one fixed strategy does not suit a workload with programs that execute for different lengths of time For longrunning programs the highest optimization level delivers the best performance However for shortrunning programs the highest optimization levels deliver the worst performance The adaptive multilevel system attains high performance in both the startup and steady state programs 39 The Controller Model39s Predictive Abilities The second experiment explores the effects of predicting the future execution time of an application The system was modified to take as an argument the expected execution time of an application and this argument was varied to observe an application39s runtime behavior compared with the heuristic described earlier Note adaptive inlining is enabled in this experiment 40 Tables 2 and 3 Table 2 This table presents the runtime behavior of javac when the application s predicted execu tion time varies Table 3 This table presents the runtime behavior ofjack when the application s predicted execution time varies 41 Interpreting the Results Results are shown for the javac and jack benchmarks Interpret the tables as follows Row 1 shows the runtime behavior when the heuristic is used Row 2 shows the runtime behavior when the application39s exact execution time is known in advance Row 3 show the runtime behavior when the application39s expected execution time three orders of magnitude larger than its actual execution time The Total Time column is the application39s execution time This is subdivided in the Threads columns to show how much time the application spent in threads that contribute significantly to time The Compiled columns show how many times the optimizing compiler was invoked at what level of optimization and for how many methods The second number in some columns represents the number of methods that are recompiled multiple times before obtaining this level of optimization No method is ever recompiled more than twice 42 IQJlQI L Ci i lk Observations Generally the exact prediction of execution time is slightly better than the heuristic The fact that predicting execution time exactly performs only slightly better than this heuristic indicates that this heuristic performs well in practice Grossly mispredicting execution time has significant performance implications As expected when execution time is grossly over predicted the controller is too aggressive with compilation 43 Conclusions The adaptive optimization system of the Jalapeno JVM was presented More specifically the controller component of this system was described in detail It was shown that this multilevel optimization strategy can deliver robust performance in both startup and steadystate program regimes competitive with the best alternative in each regime Finally it was shown that the current heuristic of assuming a method will execute for twice the current duration is an effective predictor 44 Conclusions The Jalapeno system provides a flexible infrastructure for future research on online optimization Future research topics include Automatic specialization Profiledirected memory layout optimizations Refinements to the recompilation analytic model Consideration of larger server codes based on IBM middleware products Expect the Jalapeno A08 to play a key role in efforts to improve Java server performance 45 Cell GC Using the Cell Synergistic Processor as a Garbage Collection Coprocessor ChenYong Cher Michael Gschwind IBM T Watson Research Center Presented by Arturo Ramos The University of Kansas EECS 700 Virtual Machines Spring 2009 Introduction in What is the Cell Broadband Engine Processor Cell BE in But First Heterogeneous multiprocessor found in super computers servers and the PlayStation 3 Designed to give a programmer more control over the internal workings of the hardware Designed for number crunching Parallel processing large data sets like graphics What about programs not designed for parallelism Cell Broadband En Cell Memory Architecture The Difference Between Cell and Conventional CPUs n Aconvm nmlCPUhasnn Upk eVdsofamha which is used to help hide latency when running RandomAccess applications Great for single processor systems Bad for multiprocessor systems NUMA ring a bell Coherence problems a Cell s innovation Only the PPE has cache The 8 SPE39s have local stores El Local Stores A local store is a large area oflow latency memory used to store both data and instructions The programmer tells the SPE how to use its local store unlike cache which is managed directly by hardware Effectively it is both a cache and a data area for the SPE n PPE n SPE 32KB instruction cache 32KB data cache 512KB L2 cache Normally a local store is used to store the SPE39s program and a large data set for it to work on 256KB local store What ifwe want to use it for something other than large data sets Running Java on the Cell BE In previous experiments the Cell processor was successfully tested for executing Java applications Problem Many system functions only run on PPE Type resolution Garbage collection This means that the entire CPU stalls when the garbage collection routine is called Solution Of oad the garbage collection routine to an SPE so that the PPE can continue to execute the Java application The Challenge a Why is using an SPE to do Garbage Collecting such a big deal a Garbage Collection is a Local Store s worst enemy Local stores are meant to provide fast access to contiguous data so that an SPE can process the data quickly sequential access Garbage collection dereferences pointers across a large memory space in order to find objects no longer referenced random access The Garbage Collection Algorithm n BoehmDemersWeiser BDW mark lazysweep collector A form of the markandsweep GC algorithm Used in many different languages and platforms Due to popularity it is well tested and tuned Marks unreferenced objects by placing them in a stack Sweeps away the objects lazily only when the space is needed by a new allocation El El The Testing System Cell Blade Server with 2 Cell CPUS 2 PPES 16 SPES 32 Ghz 1 GB RAM Cell BE Linux 2620CBE GNU Compiler for lava gcj Benchmarks SPEijm98 Iolden The Implementation The Cell PPE runs a ava Application One SPE is used to run the BDW GC Marking routine only markphasequot The Cell PPE still handles much of the GC routine that interacts directly with the IVM When the SPE begins to run the PPE is not allowed to change anything in the heap until the garbage collector has finished stoptheworldquot When the SPE is done with its workload it resynchronizes with the PPE passing it a descriptor to its mark stack The Implementation El The simplest way to implement the markphase Make the SPE transfer every reference to its local store one at a time every time it needs to check a reference baseline 1 Need some method to cache references so that we don t have to load everything into the local store one reference at a time every time a Two Solutions Operandbuffers Software Caching 10 Operand Buffers I Is a method of retrieving an entire block of pointers rather than one single pointer 1 Example An array of pointers Load block of pointers from ptr vs load ptri I When the SPU receives a reference to check it will retrieve the entire block rather than just the reference 11 Software Caches An operand buffer is only good for contiguous data Luckily the Cell SDK distributed by IBM comes with a Software Cache component Emulates a hardware cache through software Con Huge loss in performance due to being software based Pros Software cache can be configured to any size Replacement strategy can be configured for the data structure being used Results Garbage Collection Caching a Performance is enhanced With any form of caching n 400 600 faster with a Hybrid Sofware Cache and Operand Buffer method 800 a 700 D128KB SW5 l SW Operand P Van ESUOQQ 1 E x a 500 8 400 39 E g 300 8 200 w 32 1000 0 13 Results Garbage Collection on SPE vs PPE n Even With caching methods implemented and up to 600 speed up the SPE still cannot perform as well as the PPE n The SPE is however almost as good in most cases Normalized Mark Performance SPE versus PPE 14 Conclusion 2 Running Garbage Collection on coprocessors using local memorybased hierarchy can be done a Allows the main processor to be used for the important user applications While coprocessor does the mundane Garbage Collection task I Since local stores are explicitly managed programmers can implement their own caching techniques 400600 performance improvement over no caching at all Questions 16 gHighLevel Language VM Outline 0 Introduction Virtualizing conventional ISA Vs HLL VM ISA Pascal P code virtual machine OO HLL virtual machines properties architecture terms Implementation of HLL virtual machine class loading security GC JNI 5 Introduction HLL PVM similar to a conventional PVM VISA not designed for a real hardware processor HULL iHiLJL Compiler frontend Intermediate Code Compiler backend Compiler Portable Code Virtual ISA Loader Me Traditional HLL VM Virtualizing Conventional ISA Vs HighLeveILanguage VM ISA Drawbacks of virtualizing a conventional ISA not developed for being virtualized operating system dependencies issues with fixedsize address space pagesize memory address formation maintaining precise exceptions instruction set features instruction discovery during indirectjumps selfmodifying and selfreferencing code is CISA Not for Being Virtualized Conventional ISA after the fact solution for portability no builtin ISA support for virtualization High level language V ISA VM based portability is a primary design goal generous use of metadata metadata allows better typesafe code verification interoperability and performance 3 Operating System Dependencies Conventional ISA most difficult to emulate exact emulation may be impossible different OS High level language V ISA find a least common denominator set of functions programs interact with the library API library interface is higher level than conventional OS interface 9 Memory Architecture Conventional ISA fixedsize address spaces specific addresses visible to user programs High level language V ISA abstract memory model of indefinite size memory regions allocated based on need actual memory addresses are never visible outof memory error reported if process requests more that is available of platform 5 Memory Address Formation Conventional ISA unrestricted address computation difficult to protect runtime from unauthorized guest program accesses High level language V ISA pointer arithmetic not permitted memory access only through explicit memory pointers staticdynamic type checking employed 9 Precise Exceptions Conventional ISA many instructions trap precise state needed global flags enabledisable exceptions High level language V ISA few instructions trap test for exception encoded in the program requirements for precise exceptions are relaxed 3 Instruction Set Features Conventional ISA guest ISA registers gt host registers is a problem ISAs with condition codes are difficult to emulate High level language V ISA stackoriented condition codes are avoided 5 Instruction Discovery Conventional ISA indirect jumps to potentially arbitrary locations variablelength instruction embedded data padding High level language V ISA restricted indirectjumps no mixing of code and data variablelength instructions permitted 3 SelfModifyingReferencing Code Conventional ISA pose problems for translated code High level language V ISA selfmodifying and selfreferencing code not permitted if Pascal Pcode Popularized the Pascal language simplified porting of a Pascal compiler Introduced several concepts used in HLL VMs stackbased instruction set memory architecture is implementation independent undefined stack and heap sizes standard libraries used to interface with the OS Objective was compiler portability not application portability 5 Pascal PCode 2 Protection via trusted interpreter 0 Advantages porting is simplified don39t have to develop compilers for all platforms VM implementation is smallersimpler than a compiler VM provides concise definition of semantics Disadvantages achieving OS independence reduces API functionality to least common denominator tendency to add platformspecific API extensions Object Oriented HLL Virtual Machines Used in a networked computing environment Important features of HLL VMs security and protection protect remote resources local files VM runtime robustness OOP model provides componentbased programming strong typechecking and garbage collection networking incremental loading and small codesize performance easy code discovery allows entire method compilation if Terminology Java Virtual Machine Architecture ltigt CLI analogous to an ISA Java Virtual Machine Implementation gtCLR analogous to a computer implementation Java bytecodes ltigt Microsoft Intermediate Language MSIL CIL IL the instruction part of the ISA Java Platform ltigt NET framework ISA Libraries a higher level ABI s9 Modern HLL VM 0 Compiler frontend produces binary files standard format common to all architectures Binary files contain both code and metadata Virtual Machine Implementation Machine Independent Program File 19 Security Remote System A key aspect of modern network oriented Vms quotprotection sandbox 0 Must protect remote resources files local files runtime Java39s first generation a security method still the default LocaISystem 5 Protection Sandbox 39 Remote FESOUFCES G g e protected by remote system Local resources protected by security manager VM software protected VIa 39 staticdynamic checking 19 Java 11 Security Signing Identifies source of the input program can implement different security policies for programs from different vendors drama 2 public key 5 Java 2 Security Stack Walking Ins ect rivile es p p g Method 1 System Full of all methods on W39teA pe a i32 fl 39 39 stack MethodZ Untruste only Pr 39 39e append method Method3 System Full permISSIons method 4 attempts Method 4 Untruste Igspeit t to write file B Vla ac i0 m eth 0d 5 anneitgfpf System Full call fails since methodz does not Check Method System Full have privileges principal permissions 20 9 Garbage Collection Issues with traditional mallocfree newdelete explicit memory allocation places burden on programmer dangling pointer double free errors Garbage collection objects with no references are garbage must be collected to free up memory for future object allocation OS limits memory use by a process eliminates programmer pointer errors 21 9 Network Friendliness Support dynamic class loading on demand load classes only when needed spread loading over time 0 Compact instruction encoding zeroaddress stackbased bytecode to reduce code size contain significant metadata maybe a slight code size win over RISC fixedwidth ISAs 22 is Java ISA Formalized in classfile specification Includes instruction definitions bytecodes Includes data definitions and interrelationships metadata 23 9 Java Architected State 0 Implied registers program counter local variable pointer operand stack pointer current frame pointer constant pool base Stack arguments locals and operands Heap objects and arrays implementationdependent object representation Class file content constant pool holds immediates and other constant information 24 5 Data Items Types are defined in specification implementation free to choose representation reference pointers and primitive byte int etc types 0 Range of values that can be held are given eg byte is between 127 and 128 data is located via references as fields of objects in heap offsets using constant pool pointer stack pointer 25 Data Accessing 26 3 Instruction Set Bytecodes single byte opcode zero or more operands 0 Can access operands from instruction current constant pool current frame local variables values on operand stack 27 Instruction Types Pushing constants onto the stack Moving local variable contents to and from the stack Managing arrays Generic stack instructions dup swap pop amp nop Arithmetic and logical instructions Conversion instructions Control transfer and function return Manipulating object fields Method invocation Miscellaneous operations Monitors 28 9 Stack Tracking 0 At any point in program operand stack has same number of operands of same types and in same order regardless of the control path getting there Helps with static type checking 29 5 Stack Tracking Example Valid bytecode sequence iload A llpush int A from local mem iload B llpush int B from local mem lfcmpne 0 else ll branch if B ne 0 iload C II push int C from local mem goto endelse else iload F llpush F endelse add ll add from stack result to stack istore D II pop sum to D 30 9 Stack Tracking Example 0 Invalid bytecode sequence stack at skip1 depends on controlflow path iload B ll push int B from local mem fcmpne 0 skip1 ll branch if B ne 0 iload C II push int C from local mem skip1 iload D II push D iload E II push E ifcmpne o skipz ll branch if E ne 0 add ll add stack result to stack skip2 istore F ll pop to F 31 5 Exception Table 0 Exceptions identified by table in class file address Range where checking is in effect target if exception is thrown operand stack is emptied 0 If no table entry in current method pop stack frame and check calling method default handlers at main From To Target Type 8 12 96 Arithmetic Exception 32 5 Binary Class Format 0 Magic number and header 0 Regions preceded by counts constant pool interfaces field information methods attributes 33 9 Java Virtual Machine Abstract entity that gives meaning to class files 0 Has many concrete implementations hardware interpreter JlT compiler Persistence an instance is created when an application starts terminates when the application finishes 34 as JVM Implementation 0 A typical JVM implementation consists of class loader subsystem memory subsystem emulation execution engine garbage collector class les data amp instructions addresses 1 native method libraries 35 all Class Loader Functions find the binary class convert class data into implementationdependent memory image verify correctness and consistency of the loaded classes Security checks checks class magic number component sizes are as indicated in class file checks numbertypes of arguments verify integrity of the bytecode program 36 3 Protection Sandbox i qieeie airDim damn Q i me mam Store must be to Ana stores are reference and field with y range checked correct types Load type determin d from referencefield t pe Array loads are range checked Move to local storage must be to a location with coeype Move to operand stroage type determined from local storage type tracked types 37 Protection Sandbox Security Manager A trusted class containing check methods a attached when Java program starts cannot be removed or changed 0 User specifies checks to be made files types of access etc Operation native methods that involve resource accesses eg IO first call check methods 38 5 Verification Class files are checked when loaded to ensure security and protection 0 Internal Checks checks for magic number checks for truncation or extra bytes each component specifies a length make sure components are wellformed 39 9 Verification 2 Bytecode checks check valid opcodes perform full path analysis regardless of path to an instruction contents of operand stack must have same number and types of items checks arguments of each bytecode check no local variables are accessed before assigned makes sure fields are assigned values of proper type 40 9 Java Native Interface JNI Allows java code and native code to interoperate access legacy code system calls from Java access Java API from native functions see figure on next slide each side compiles to its own binary format differentjava and native stacks maintained arguments can be passed valuesexceptions returned 41 3 Java Native Interface JNI Java Side I Native Side Byt c ode Methiads getfield putfield 42 Garbage Collector Provides implicit heap object space reclamation policy Collects objects that have all their references removed or destroyed Invoked at regular intervals or when low on memory see figure on next slide root set point to objects in heap objects not reachable from root set are garbage 43 Garbage Collector 2 Global Heap 3 Types of Collectors 0 Reference count collectors keep a count of the number of references to each object 0 Tracing collectors using the root set of references 45 5 Mark and Sweep Collector Basic tracing collector start with root set of references trace and mark all reachable objects sweep through heap collecting marked objects Advantages does not require moving objectpointers Disadvantages garbage objects combined into a linked list leads to fragmentation segregated freelists can be used consolidation of free space can improve efficiency 46 5 Compacting Collector 0 Make free space con guous multiple passes through heap lot of object movement many pointer updates 47 3 Copying Collector 0 Divide heap into halves collect when one half full tumude copy into unused half during sweep phase Reduces passes through heap Wastes halfthe heap limped l9 Simplifying Pointer Updates Add level of indirection use handle pool G39 ba39 Heap object moves update handle pool 0 Makes every object access slow 49 9 Generational Collectors Reduce number of objects moved during each collection cycle 0 Exploit the bi modal distribution of object lifetimes 0 Divide heap into two sub heaps nursery for newly created objects tenured for older objects Collect a smaller portion ofthe heap each time 50 E Generational Collectors 2 Stoptheworld collectors time consuming long pauses unsuitable for realtime applications 51 3 Concurrent Collectors 2 GC concurrently with application execution partially collected heap may be unstable see figure synchronization needed between the application mutator and the collector ROOt set1 ROOt set 52 9 JVM Bytecode Emulation 0 Interpretation simple fast startup slow steadystate Just ln Time JIT compilation compile each method on first invocation simple optimizations slow startu p fast steadystate Hot spot compilation compile frequently executed code can apply more aggressive optimizations moderate startup fast steadystate 53 3 Introduction to Virtual Machines introduction abstraction and interfaces Virtualization computer system architecture process Virtual machines system Virtual machines EECS 700 Virtual Machines Spring 2009 is Abstraction Mechanism to manage complexity in computer systems Mechanism consists of partition the design of a system into levels allow higher levels to ignore the implementation details of lower levels In computer systems lower levels are implemented in hardware and higher levels in software EECS 700 Virtual Machines Spring 2009 3 Interfaces An interface de nes the communication boundary between two ent1t1es hierarchical relationship linear relationship Software can run on any machine supporting a compatible interface LA u EECS 700 Virtual Machines Spring 2009 is Interfaces Advantages Allows decoupling of computer design tasks each component provides an abstraction of itself to the outside world Work on different components can progress independently Helps manage system complexity EECS 700 Virtual Machines Spring 2009 3 Interface Disadvantages Software compiled for one ISA will not run on hardware with a different ISA powerPC binaries on an X86 machine Even if lSA39s are same Oses may differ MS Windows applications Sun Solaris EECS 700 Virtual Machines Spring 2009 5 Interface Disadvantages 2 Binaries may not be optimized for the platform they run on Intel Pentium binaries on AMD Athlon Innovation may be inhibited by a xed ISA hard to change instruction sets Application software cannot directly exploit microprocessor implementation features software supposed to be implementationneutral EECS 700 Virtual Machines Spring 2009 9 Virtualization Removes some constraints imposed by system interfaces and increases exibility improves availability of application software removes the assumption of a single management regime improving security and failure isolation Provide a di erem view to a particular computer resource not necessarily a simpler view EECS 700 Virtual Machines Spring 2009 s9 Virtualization 2 Virtualization constructs an isomorphism that maps a Virtual guest system to a real host EECS 700 Virtual Machines Spring 2009 as Virtualizaticn 3 Virtualizaticn Vs abstraction virtualization does not necessarily hide details virtualiza tio I I i EECS 700 Virtual Machines Spring 2009 5 Concept of Virtualization applied to the entire machine Virtual Machines A Virtual machine is implemented by adding a layer of software to a real machine to support the desired Virtual machine s architecture 0 Virtualization mapping of Virtual resources or state to real resources use of real machine instructions to carry out actions speci ed by the Virtual machine instructions EECS 700 Virtual Machines Spring 2009 Some Bene ts of VMs Flexibility Portability Isolation Security EECS 700 Virtual Machines Spring 2009 3 Computer System Architecture Architecture functionality and appearance of a computer system but not the details of its implementation Implementation the actual embodiment of an architecture EECS 700 Virtual Machines Spring 2009 12 9 Computer Architecture 2 Software Computer systems are Appma on built of levels of 322 abstraction operatingsystem hierarchical abstraction Drivers in33 Schedum wellde ned interfaces Execution Hardware Memory System Interconnect TranSIatlon bus Controllers Controllers IIO devices Maln and Memor Networking y Hardware EECS 700 Virtual Machines Spring 2009 3 Interface between hardware and software Important for OS developer The ISA Interface Application Programs Libraries Operating System man an Imitamanm i ikiraitwm l rrrgj EECS 700 Virtual Machines Spring 2009 Mammy mmmmm WWW 3 The ABI Interface Application Application Binary programs Interface ABI user ISA system calls Libraries Important for compiler writers Extortion Wm b m lli ml l Hate MED Jhbiavmr mj mm1 Illorirmi EECS 700 Virtual Machines Spring 2009 15 5 The API Interface Application 22221 Pro gramming LIbrarIes Interface API Operating System library calls Execution Hardware Important for Memory Translation System Interconnect programmers bus IIO devices and Maln Networking memory EECS 700 Virtual Machines Spring 2009 5 Maj or Program Interfaces ISA supports all conventional software Application Software System Calls l l Operating System I System ISA User ISA ISA ABI supports application software only Application Software I I System Calls V V Operating System System ISA User ISA ABI EECS 700 Virtual Machines Spring 2009 17 3 Process Virtual Machines Process virtual machine is capable of supporting an individual process different guest and host ISA couple at ABI level Via runtime system Guest Runtim Wmla Ml lh lmrg Host llllacd mflmrgj 2 lirrt lwem EECS 700 Virtual Machines Spring 2009 18 3 Process Virtual Machines 2 Constructed at ABI level Runtz me manages guest process Runtime communicates with host OS Guest processes may intermingle with host processes As a practical matter binaries built for same OS Dynamic optimizers are a special case Examples lA32 EL FX32 Dynamo 1 network communicatior EECS 700 Virtual Machines Spring 2009 19 3 System Virtual Machines System Virtual Machine capable of supporting an OS with potentially many user processes couple at ISA level eg IBM VM360 VMWare Transmeta Crusoe o J o l I ls H a l o m 7 if v 77 7 7 s 7 J V 7 477 gt u 7 7739 7 7 77 3 ji p 7 Cl JL ll W p A 7 L i a o l l la l l t l la o l l l f 7 Jr7 77 7 7 7quot H7 47 77quot 7 7 7 7 c777quot Guest rtuizi VMM Software In Mmua Machine Elam romaine H 05 t DDWEaelm lmUD EECS 700 Virtual Machines Spring 2009 20 9 PVM provided by a multiprocess OS for each concurrently executing application Combination of the OS system call interface and the userlevel ISA Each process is given the illusion of having the complete machine to itself PVM Multiprogramming EECS 700 Virtual Machines Spring 2009 21 5 PVM Emulators Execute binaries compiled to a different instruction set than that executed by the host s hardware Interpretation low startup overhead high steadystate per instruction emulation overhead EECS 700 Virtual Machines Spring 2009 22 ls PVM Dynamic Translators Runtime translation of blocks of source instructions to equivalent target instructions high startup translation overhead fast steadystate execution Uses a code cache to store translated blocks of code for reuse eg Digital s FX32 system Aries system Intel IA32 EL system EECS 700 Virtual Machines Spring 2009 23 PVM Same ISA Binary Optimizers Same source and target ISAs Main task is the optimization of the source binary ABI level optimization may also collect performance pro les may also enhance security e g Dynamo system developed at Hewlett Packard EECS 700 Virtual Machines Spring 2009 24 PVM High Level Language VM A HLL is designed for VM execution minimize hardwarespeci c and OSspeci c features that could compromise portability Him Compiler frontend Intermediate Code Compiler backend Loader Traditional taut Compiler Portable Code Virtual ISA i VM oader VM InterpreterTranslator HLL VM EECS 700 Virtual Machines Spring 2009 25 gPVM High Level Language VM Binary class les are distributed ISA part of class le no real implementation OS interaction via API eg Java Microsoft CLI Java VM 1 Architecture VM VM VM implementation implementation Implementatlon Sparc x ss Apple Workstation pc Mace EECS 700 Virtual Machines Spring 2009 26 5 Classic System Virtual Machine 0 Original meaning of the term virtual machine all guest and host software use the same ISA VMM runs on bare hardware most privileged mode VMM intercepts and implements all the privileged Operations for the guest OS mm Wm um i a i Wm virtual l network communication 1 EECS 700 Virtual Machines Spring 2009 27 39 Hosted System Virtual Machine Virtualizing software is built on top of an existing host OS Advantages installation is like installing application programs host OS provides device driver support Drawbacks less ef cient EECS 700 Virtual Machines Spring 2009 28 9 Whole System VMs Different ISA for guest and host systems both application and OS code require emulation Implemented by placing the VMM and the guest software on top of a conventional host OS running on the hardware eg Virtual PC EECS 700 Virtual Machines Spring 2009 29 9 Codesigned Virtual Machines VMs designed to enable innovative ISAs andor hardware implementations for improved performance power efficiency etc Similar hardware virtualization is common for microprocessors such as Pentium IV Software VM is part of the hardware design applicationsOS never directly execute native ISA instructions e g Transmeta Crusoe processor EECS 700 Virtual Machines Spring 2009 VM Taxonomy Process VMs I System VMs different I different same IS same I A ISA l ISA D namic I Multiprogrammed Tra slators I Classlc System WhoIeVSMystem Systems lA32EL FX32 I IBM VM370 Virtual PC for Mac l I Dynamic I I Hosted VM 39 I VMware Codesi ned Ema HLL VM I WEI Optimizers Java W I Dynamo MS CLI I Trggtzsgeeta EECS 700 Virtual Machines Spring 2009 31 Disco Running Commodity Operating Systems on Scalable Multiprocessors EDOUARD BUGNION KINSHUK GOVIL SCOTT DEVINE MENDEL ROSENBLUM Stanford University Presented by Arturo Ramos The University of Kansas EECS 700 Virtual Machines Spring 2009 introduction Some things to note before starting This paper was published in 1997 VMware was founded in 1998 This paper was written by the founders of VMware At the time of publishing scalable computers were making their first introductions into the commercial marketplace What is a scalable computer system A computer with multiple processors which share the same memory or resources Multiple computers tied together which act as one unified computer All computers need Operating Systems even scalable computers Make a brand new OS for the scalable computer PortReprogram an existing OS to run on the scalable computer Write Use a Virtual Machine Monitor to run exiting OS39s on the scalable computer What kind of OS should we use on scalable computers The Scalable Computer Problem Writing a new Operating System for new hardware with the functionality of existing Operating Systems is hard takes time and is impractical Porting eXisting Operating Systems to work on new hardware will also take a lot of time Development Costs Existing Operating Systems are MILLIONS oflines of code Resource Handling Need to rewrite drivers for changed devices Resource Handling Need to modify memory handling CCNUMA Scalability Need to divide the system into scalable units Fault Containment Need fault containment for each unit Distribution Need to build a single OS image across all units The Result Hardware is released with late buggy or even no software to support it In order for new technology to succeed hardware needs reliable software that will let users continue to use their existing library ofapplications The Solution Virtual Machine Monitors 1 Rather than creating a new OS or modifying existing OS39s use a VMM n The VMM can run existing commodity OS39s and also specialized OS39s all at the same time on one scalable computer a With small changes to existing OS39s make them VMM Aware the different Virtual Machines can share their data with each other 1 Using a VMM solves the following issues involved in porting an OS Development Costs The time needed to develop a VMM for a scalable computer is muchless than developing a complete OS Resource Management The VMM will handle all of the resource management Scalability Each Virtual Machine is one scalable unit ie 1 VM per processorcomputer Fault Containment Since each unit contains only one VM if a unit fails only one VM crashes Distribution Only the VMM needs to be distributed across all units in the scalable computer Virtual Machine Monitor Disadvantages Overhead More Exception processing More Instructions executed More Memory needed Device I O virtualization Resource Management VMMs make bad resource decisions when the 0539s running on a VM are VM unaware OS39s idleloop looks the same as calculations if the OS does not tell the VMM that the process is unimportant VMM cannot know when an VM is no longer using a page if the VM does not tell it so Communication and Sharing Virtualized disks that are in use by one VM cannot be used by another VM until the disk is released by the first VM Even though data is stored on the same hardware it cannot be shared between VMs if the VMs are VMunaware What kind of VMMS to use 1 Now that we know the scalable computing problem and have a possible solution we need to try it out n One example ofa good VMM to use to address the scalable computing problem is Disco a What is Disco The Interface The Implementation The Simulated Results The Actual Results Disco A Virtual Machine Monitor 1 Disco is a VMM designed for the FLASH multiprocessor Scalable computer cluster Each unit has its own CPU DRAM and IO devices Units are connected together by a high speed interconnect Software running on this machine sees one computer with multiple processors and one bank of shared memory a Disco is just an example VMM Which can be used to address the scalable computer problem Disco The Interface n Processors Disco emulates a MIPS R10000 processor which maps directly to the actual processors contained on the FLASH machine Disco extends additional CPU features to improve performance for Discoaware VMs eg reduce trap emulation overhead a Physical Memory Disco emulates a contiguous physical address space The actual FLASH system has a NUMA space What is NUMA NonUniform Memory Addressing Remember each scalable unit has its own RAM so the total ofall memory available to the machine is spread out nonuniformly across several units By emulating contiguous space existing OS s can run without modification El I O Devices Every IO device is virtualized Hard Disk Network Interface Interrupt Timers etc Special virtualization options on virtual disks to allow mounting across multiple VMs Implements an additional network device type other than Ethernet which allows faster and larger transfers between two VMs Disco The Implementation n Multithreaded sharedmemory program a Very lightweight 13000 lines of code 72KB executable footprint n Strives to be efficient Careful attention to NUMA placement Cacheaware data structures to lower cachemiss rates Interprocessor communication patterns Uses as few locks as possible so that multiple processors can access data at the same time Uses Waitfree synchronization for the same reason Disco The Implementation 2 Virtual CPUs Direct execution of Virtual CPU code on the Real CPU Allows most operations to run at the same speed as hardware except privileged instructions Disco runs in Kernel mode OS in Supervisor Mode Programs in User mode Schedules each virtual CPU to be timeshared across the physical processors Tries to run the virtual CPU on a real CPU that is close to the memory that it needs to access a Virtual Physical Memory Does a VMtoreal memory mapping by using the translationlookaside buffer TLB of the MIPS processor Due to a limitation of the MIPS processor Disco cannot use the TLB to remap memory that is within a kernelmode direct access memory segment In this case Disco must force the OS to use only memory that can be mapped Since Disco uses the TLB to remap memory the TLB will cache miss more often Disco implements a secondary TLB in software to quotincreasequot the size of the TLB 10 Disco The Implementation n NUMA Memory Management Uses a dynamic page migration and page replication system to help keep data in local memory Pages heavily accessed by one node are moved to that node Pages that are read by several nodes are copied to the nodes most heavily accessing them Pages have a limited number of times a page can move to avoid constant movement overhead us u now macnrne add95s 11 Emma 1M Nisan wnuai machine vcvu Disco Fig 393 Majanlah ELl39uclnres of Dino chU nl Node 0 Node I VB Vilma Pages Oxmmtlng Syslnm M Physical Pages 0150 l h m Machine Page Fig 2 Tampa page rephuan n 11 Disco The Implementation Virtual IO Devices Each virtual device uses a custom device driver written for the operating system The custom driver passes all command arguments in one trap rather than several traps Disk and network devices send Direct Memory Access DMA information through the trap to allow Disco to remap memory Since all DMA information is sent to Disco memory disk and disk resources can be shared between VMs ie Two VMs with the same exactVirtual Disk Virtual Machine 0 Vi ual Machine 1 7 D71 i Code Daia Buffer Cache l W J Machine Memory T 3 2 Shared pages Prime pages Fig 4 Memory sharing in Discoi Flea W985 12 Disco The Implementation in Virtual Network Interface The custom Disco network interface device does not limit the maximum transfer unit MTU of packets Disco remaps existing pages rather than copies them when a network device receives a packet that already exists in memory Saves memory and allows VMs to share memory through network communication NFS Server NFS Client sages l Buns Cache Physical Pages Machine Pages Fig 5 Example of transparent sharing of pages over 13 Disco Running Commodity Operating Systems a The OS Virtualized for this test was IRIX a UNIX based operating system a To Virtualize an existing operating system using Disco some changes need to be made MIPS Architecture workarounds Custom Device Drivers Hardware Abstraction Layer HAL modifications 14 Disco Running Commodity Operating Systems a MIPS Architecture workarounds MIPS always allows direct memory access to the first kernel segment of memory Disco needs to remap memory before allowing an OS to access memory A change to the OS must be made so that the OS will never use this first kernel segment of memory a Device Drivers Custom Discoaware device drivers written specifically to run on the Guest OS can be made Allows devices to be optimized for disco In the case of IRIX no custom drivers were written 15 Disco Running Commodity Operating Systems a The Hardware Abstraction Layer HAL The HAL is a layer that comes with most modern OS s Allows an OS to be modified to run on different hardware Common changes to the HAL include Reduce the number of privileged instructions the OS uses Inform the VMM of resource utilization Inform the VMM of idle time 16 Disco Running Specialized Operating Systems I Allows Disco to support largescale parallel applications or applications that do not need a fullfunction or existing OS I SPLASHOS is a specialized OS for Disco Runs all applications in the same address space as the OS Allows Disco to have full control over page faults Very simple OS but a custom OS that can run parallel to other commodity 08 s 17 El El Experimental Results Experimental Setup The FLASH machine didn39t exist yet u Had to simulate it using SimOS u SimOS is a machine simulator The MIPS R10000 simulation model was too slow used simpler CPU model 1 Simulation was too slow to get useful data for large workloads Workloads Software Development pmake Parrallel compile of GNU chess many short lived processes OSVMM intensive Hardware Development ashlight VCS Concurent run of ashlight and vcs simulators long lived processes OS unintensive Scientific Computing raytrace Single parrallel process renders a quotcarquot model OS unintensive Commercial Database sybase Database queries on a preloaded database already in memory Memory intensive 18 Experimental Results n Execution Overheads Normalized Execution Tlme 160 140 120 IHIX Disco IFHX Disco IHIX Disco Pmake Engineering Haytrace Database Fig 6 Overhead of virtualization 19 Experimental Results In Memory Overheads IHIX DaIa g Disco cu BufferCache 7 IR XJext E 3 LL IRIX 1VM 2VMS 4VMS 8VMS 8VMSNFS Fig 7 Data sharing in Disco between virtual machines Experimental Results In Scalability Normalized Execution Time Idle lag Disco Sync 140 136 Kernel Userislall 3920 1 a User 100 39 80 l 50 40 4 20 A l IRIX 1VM 2VM 4VM 8VM BVMnfs IFHX SplashOS pmake RADIX Fig Workload scalability under Disco Experimental Results a Dynamic Page Migration and Replication Normalized Execution Tlme O 0 100 100 Disco remote a iocal 39 EXEC 67 4s 49 15 75 100 5 76 100 IRIX Disco UMA IRIX Disco UMA Engineering Raytrace Fig 9 Performance bene ts of page migration and replication Real Hardware Results 1 Disco was ported to hardware that did exist in order to confirm test results a Disco ported to run on SGI Origin200 Single 180MHz MIPS R10000 128 MB RAM a Porting Disco First IRIX boots and discovers Hardware Drivers Disco is jumped into before IRIX begins init Disco takes control over the system using the hardware drivers discovered by IRIX Real Hardware Results 1 Virtualization Overheads Slightly different workloads Pmake compiles Disco Engineering simulates the FLASH memory system Results are consistent with simulation Pmake 8 slower with Disco Engineering 7 slower with Disco Table III Execution Time Broakdmvn User Kernel Conclusion Virtual Machine Monitors are a viable solution to the problem of developing system software for scalable sharedmemory multiprocessors The Disco prototype experiment shows that the overheads of VMMs are low Implementation costs using this technique will be lower than developing system software from scratch The growing trend toward multiprocessors promotes the growth ofVMMs Questions