New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Cleora Stiedemann


Cleora Stiedemann
Rice University
GPA 3.72

John Mellor-Crummey

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

John Mellor-Crummey
Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 383 page Class Notes was uploaded by Cleora Stiedemann on Monday October 19, 2015. The Class Notes belongs to COMP 522 at Rice University taught by John Mellor-Crummey in Fall. Since its upload, it has received 25 views. For similar materials see /class/224949/comp-522-rice-university in ComputerScienence at Rice University.

Similar to COMP 522 at Rice University

Popular in ComputerScienence


Reviews for MULTI


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/19/15
C Concurrency Memory Model John MellorCrummey Department of Computer Science Rice University johnmccsriceedu RICE COMP 522 Lecture 12 9 October 2008 Why a C Concurrency Memory Model 0 Technology trends chips relying on increasing core counts to boost performance substantial performance gains available only for parallel codes 0 Current practice most parallel programs threads sharing an address space 0 C and C defined as single threaded languages no reference to threads compilers are thread unaware generate code as for a singlethreaded application compilers reorder assignments to independent vars does not preserve meaning of multithreaded programs Concurrent Programming with PThreads 0 Prohibit data races involving normal variables data race 2 2 concurrent accesses to a variable 2 1 writes 0 Provide synchronization primitives eg pthreadmutexlock 0 Implementations can t reorder ordinary memory operations with respect to synchronization operations discussed last lecture in context of weak ordering enforced by compilers by treating synch as opaque calls ordinary memory ops can t be moved past them because opaque synch calls may have side effects ordinary memory operations may be reordered between calls subject to singlethread constraints of course Problems with Thread Libraries Informal multithreaded semantics is insufficient 0 Definition of a data race is under specified is this program race free Initially XYO T1 T2 r1X r2Y if r11 if r21 Y1 X1 is outcome r1r21 allowed how could it occur 0 Hard to reason when a compiler might introduce races struct s char a char b x Thread 1 is not equivalent to struct s tmp x Thread 1 Thread 2 tmpa 1 xa 1 xb 1 x tmp a compiler might do something like this with bit fields why is this not OK Avoiding Data Races 0 Atomic primitives that are safe for concurrent use 0 eg gcc sync intrinsics selected examples n a n ya isynciznnnninnnindd um ptz um um M isyncifetchiandisnb iryp ptry type Value w Lyn syn mm d an 1 39p n 0pc Value y n n synciadd a d t h icy on r Iypr yam n sync sub andifcteh we ptxy C Pc Value on syn 0 fan in w pi pa value on sync a andiictch um ptn rm mun w isync xuriandiic ch iryp ptry rm yam w sync nnndinnn mm mm ptxr owe value i baa isyneibaalicumpare andiswap We 39ptn cm ondval we ncwaly m sync val comparciandiswap Wm err zypn aldva lyp nnwnJ i isyncisynchranxzu afuilmcmoubmiu synciluck Lnst and seh rypn an Ayn value an cqmm barrier m buliun mnnot mnvc m or be nmnuinmd 0bcfum me bin n preuous memory mm my m be globally vmhlc yet preumls memory 1m may nahcl be Smu cd yam isynciluckixnlcasc lypc an Issues with Aforementioned Atomics 0 Not portable 0 Interactions with other shared memory variables not carefully or clearly specified Review of Memory Model Motivation 0 Memory model specifies the values that a read of a shared variable may return Affects programmability performance portability 0 Compiler and hardware must respect a memory model constrains compiler transformations compiler and hlw must cooperate to preserve intended semantics Review of Memory Model Flavors 0 Sequential consistency as if single total order an interleaving of memory operations operations within each thread appear in program order intuitive but limits optimizations and performance 0 Relaxed memory models allow various hlw optimizations difficult to reason with for highlevel languages but still limit certain compiler optimizations see Java MM Dataracefree Models 0 Programming practice programs should be free of data races 0 Dataracefree models guarantee sequential consistency for programs wlo data races provide no guarantees in the presence of races why allows standard hw and compiler optimizations provides simple model for programming in according to accepted practice ie without races A Concurrency Memory Model for C Define semantics of multithreaded C for C0x Main issues 1 Sequentially consistent atomics 2 Trylock and impact on the definition of data race 3 Semantics for data races 4 Lowlevel atomics with weaker memory ordering 10 Preliminaries 0 Data operations load store 0 Synchronization operations atomic load atomic store atomic readmodifywrite lock unlock 0 Seguencedbefore relation on memory operations unlike prior work this relation is only a partial order permits undefined argument evaluation order for functions Type 1 data race two concurrent conflicting accesses from different threads at least one is a data operation Type 2 data race two data accesses to a location unordered by happensbefore Sequentially Consistent Execution An interleaving of thread actions in a total order Each thread is internally consistent corresponds to a correct sequential execution of the thread respects ordering implied by its sequencedbefore relation Total order is consistent with all sequencedbefore orders Each load lock and readmodifywrite operation reads value from last preceding write to same location last operation on a given lock preceding an unlock must be a lock performed by the same thread 12 C Concurrency Memory Model If program on a given input has a sequentially consistent execution with a type 1 data race its behavior is undefined Otherwise the program behaves according to one of its sequentially consistent executions Q why can a program have more than one sequentially conSIstent execution 13 Optimizations Allowed Given memory op M1 sequencedbefore M2 HNV may reorder M1 and M2 if semantics of thread allow and M1 is data operation and M2 is read synchronization operation may delay loadstore until after atomicload M1 is a write synchronization and M2 is data operation may promote loadstore before atomicstore M1 and M2 are both data with no synch sequenced between them reorder loadstore wrt other loadstore if no synch between them When locks used in well structured ways can reorder as long as intrathread semantics permit M1 is data operation and M2 is write to a lock M1 is unlock and M2 is read or write to a different lock Data writes amp wellstructured lockunlock can be nonatomic 14 Reordering Restrictions Synchronization operations must appear sequentially consistent with one another 0 Atomic operations in sequencedbefore order 0 Atomic writes execute atomically 15 1 Sequentially Consistent Atomics Some had concerns that requirement atomics appear sequentially consistent would be too costly except on ltanium existing lSAs don t distinguish synch ops compilers typically enforce orderings with fence instructions types of fences membar all memops before membar finish before any after storeload fence some processors ensure orderings implicitly AMD64 and lntel64 provide loadload and loadstore fence after each load provide storestore fence semantics after each store Many fences ambiguous about hw write atomicity whether a thread s write could be visible to one thread before all Difficult to formalize semantics that were meaningfully weaker than SC for hlw and simple enough to program 16 Cost of Enforcing Write Atomicity 0 Fences ensure reads execute in program order 0 Writes must be atomic to ensure sequential consistency consider the effect of nonatomic writes Init ially XY0 T1 T2 T3 T4 H H r1X r3v inferred fencegtlt fence r2Y I4x orderlng r11 r20 r31 r40 violates write atomicity 0 Ownership cache protocols avoid this inconsistency invalidate other copies before committing write ensures that all readers see new value provides serialization for reads and writes to the same location 17 Write Atomicity and Multicore Previously readother swrite early only had implications for memory and cache coherence protocol New SMT and multicore architectures have tighter sharing L1 cache suppose T1 and T3 share an L1 writethrough data cache suppose T3 s read of X occurs after T1 s write to X tempting to return new value to X even if T1 s ownership request has not made its way through lower levels of cache hierarchy processor core if threads may share load and store queues 0 Some existing systems don t provide efficient means for SC for prior independentreadsindependentwrites example 18 Consequences of Relaxing Write Atomicity 0 Writetoread causality must be upheld write on by T2 followed by read of value written by T3 establishes a causal connection xvo n T2 T3 x1 x1x x2Y fence fence Y1 x3x ma 27 30 mm write acomlcity how could write atomicity violation occur T1 and T2 share L1 writethrough cache T2 is allowed to read T1 s update earl T3 receives invalidation for V before that of X sequential consistency requires T3 to return new value for X One possible approach to x this ensure writes separated by a fence from a given node are seen in same order y a no es but insuf cient for other programs 19 Consequences of Relaxing Write Atomicity 0 Readtowrite causality read of a old value by a T2 followed by T3 s write of new value n establishes causal connectlo Inlcinlly PM 12 13 11 11X v1 2 111 120 r30 Home u39nm atonnmcy how could write atomicity violation occur T1 and T2 share L1 writethrough cache T2 is allowed to read T1 s update to X ar T3 reads old value for X before receiving invalidation E 0 Solution suggested for WRC is insuf cient each node T1T2 or T3 performs only one write 0 This atomicity violation may be acceptable to programmers20 Interplay between Coherence and Causality 0 Cache coherence guarantees that writes to same location are seen in same order by all threads consensus that this must be respected for meaningful code Also consensus writetoread causalitv must be respected Initually xYo 11 gt x 51 34 39 39 x11 x2o 132 tea mums ume mmmy Figure 7 CC Sub interplay between cache coherence and writemend causnlny how can a omIcIty VIolatIon occur assume T1 and T2 on the same node with shared writethrough L1 assume T3 and T4 on separate nodes T2 reads T1 5 write of X early fence r ads T3 writes new value ofV executes fence updates X to 2 T4 reads x2 fence T1 s inval and write complete T4 reads x1 21 Summary 0 Relaxation of SC was difficult to formalize for C 0 C retains SC for atomics and synchronization operations 22 Implications for Current Processors New AMD64 and Intel64 memory ordering spec provide clear way to guarantee SC require atomic write to be mapped to xchg hlw now only has to ensure that xchg writes execute atomically xchg implicitly ensures storeload fence removes need for explicit fence after atomic store see slide 17 Isn t this cumbersome to convert atomicwrite to xchg better to pay penalty on stores since loads are more frequent storeload fence replaced by RMW is as costly on procs today Other approaches SA could distinguish atomic stores from regular stores as on ltanium writeatomicity could be provided for all writes eg Sun s Total Store Order 23 2 Making Trylock Efficient pthreadmutextrylock attempt to acquire a lock without blocking return status code indicating whether lock acquired or not Undesirable use of trylock T1 T2 while trylock success lock1 unlock l assert x inverts the semantics of lock T2 waits for T1 to get lock instead of waiting for T1 to release it Difficulty assert can fail if compiler or hw reorders stmts executed by T1 prohibiting this reordering requires a fence before the lock even though for wellstructured use of locks this reordering is safe New semantics allow trylock to spurioust fail 24 3 Semantics of Data Races 0 Undefined 0 Why too hard to explain potential behaviors in a language standard unsigned i x if i lt 2 oo switch i case 0 break case 1 break default 0 Implication sZs compilers may never induce data races can t introduce speculative loads and stores but partial redundancy elimination introduces speculative loads 25 4 C Model for Lowlevel Atomics An issue SC atomic stores often require two fences one before and one after the store fence before ensures memory ops sequenced before the store are visible to any thread that sees the store fence after ensures store is not reordered wrt later atomic load Memory model thus far does not support nonSC synch ops A formal model for a program execution set of thread executions consider LIRMW as acquire consider SIRMW as release mapping between from atomicloadRMW to atomicstoreRMW establishes syncwith relation irreflexive total order of atomic operations intended to reflect global order in which they execute define happens before relation consistent with sequencing sync with mapping and transitivity Allow atomic xoadrelaxed that can be reordered not considered acquire for synchwith relation Allow RMW to act as acquire release neither or both 26 Summary Simple SC programming model no data races or identify variables involved in races as atomic most users can ignore details of hlw memory models and compiler optimizations Based on intuitive notion of simultaneous execution of conflicting accesses Lowlevel atomics allowed but trade simplicity for cross platform performance 27 Implications of the Concurrency Model 0 For compiler writers ordinary variables don t change asynchronously standard program analysis and optimizations remain valid except for atomics even with threads however implementation must refrain from introducing visible races by rewriting adjacent structure fields or register promotion 0 For hw implementers need modest cost facility to implement SC atomics and write atomicity don t need writeatomicity for all stores only for atomics ideally SC atomics should be implementable with small overhead for load operations 28 References 0 Foundations of the C concurrencv memorv model H Boehm and S V Adve in Proc of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation Tucson AZ USA June 2008 29 Simultaneous Multithreading and the Case for Chip Multiprocessing John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 2 28 August 2008 MicroprocessorArchitecture Mid 90 s 0 Superscalar SS designs were the state of the art mutipe instruction issue dynamic scheduling hlw tracks instruction dependencies speculative execution looks past predicted branches nonblocking caches multiple outstanding memory accesses 0 Circuit density continuing to double every 18 months provides raw material for more logic enables higher clock frequencies 0 Apparent path to higher performance wider instruction issue support for more speculation Flies in the Ointment Claim 1r issue width of SS will provide diminishing returns Two factors 0 Fundamental circuit limitations 0 Limited amount of instructionlevel parallelism Superscalar Designs eg R10K PA8K lnalmcllon Inn and mm Hammam 5mm 3 phases I I 2 instruction fetch We issue and retirement fl I lnstrucliun Quauaa execution LDST I Its 39J o Circuitlevel performance limiters issue and retirement issue instruction when all operands ready register renaming avoid artificial dependences mapping table operands issue width ports reorder buffer 1bit comparators operands issue width log2 reg issue Q length wider issue widths increase need for deeper issue Q s for sufficient ism claims quadratic increase in size of issue Q long wires for bcast tags of issued instr 2 ultimately limit cycle time execution phase quadratic 1 register file quadratic 1 bypass logic Farkas et al 96 8issue only 20 gt 4issue reg file complexity limits perf 56ins Q 20 die area 11 4 Circuit Technology Impact o Delays 1r as issue queues 1i and multiport register files 1r 0 Increasing delays limit performance returns from wider issue Instructionlevel Parallelism Concerns iiSi39 tane o us multith dingzm 39mizing39 39 3999539 Issue Waste 7 V gnghig parallelism TullsenLeL 1 CA 7 I full issue slot El empty issue slot issue slots i i horizontal waste 9 slots I vertical waste 12 slots 0 Contributing factors instruction dependencies longlatency operationswithin a thread How do they contribute to waste lt cycles Sources of Wasted lssue Slots TLB miss larger TLB hw instruction prefetching hw or SM data prefetch l cache miss larger icache more icache associativity hw prefetch D cache miss larger more associative prefetching more dynamic execution Control hazard speculative execution aggressive ifconversion Branch misprediction better prediction logic lower mispredict penalty Load delays L1 hits shorten load latency better scheduling Instruction delays better instruction scheduling Memory conflict multiple access to same location in a cycle improved instruction scheduling How Much IPC is There 0 Approach study applications and evaluate their characteristics assess quantity and character of parallelism present 0 Are there any pitfalls to this approach 0 Is there any better approach Intel s Recognition Mining Synthesis Research Initiative ComputeIntensive Highly Parallel Applications and Uses Intel Technology Journal 92 May 19 2005 DOI 101535itj0902 httpwwwintelcomtechnologyitj2005volume09issue02 Simulations of 8issue Superscalar Sim ultanebusmult39ithreading maximizing onchiprparallelism Tullsen at al I SCA 199395 I memon connm 39A long l p a slum rp II longinugzr In in r w mg Summary Elmmmams Highly underutilized E bunch mispredicuan s 39 dmche miss leach miss dub miss Percent of Total Issue Cycles u F E Z I processor busy Applications most of SPECSZ 0 On average lt 15 lPiC 19 0 Dominant waste differs by application 3 5 52 a 0 Short FP dependences 37 E Applications Analysis of 8issue Simulations No dominant cause of wasted cycles but 61 vertical waste No dominant solution no single latencytolerance technique likely to help dramatically Even if memory latencies eliminated utilization lt 40 Tullesen et al claim instruction scheduling targets several important segments of the wasted issue bandwidth but we expect that our compiler has already achieved most of the available gains in that regard using Multiflow tracescheduling compiler If specific latency hiding mechanisms limited then need general latency hiding solution to increase parallelism 1O What Should be Next If not 1r superscalar issue width then what Alternatives single chip multiprocessor simultaneous multithreaded processor How should we decide Best approach depends upon application characteristics The Case for a Singlechip Multiprocessor 12 The Case for a Single Chip Multiprocessor Two motivations 0 Technology push delays 1 as issue queues 1 and multiport register files grow increasing delays limit performance returns from wider issue 0 Application pull limited IPC is a problem 13 Comparing Alternative Designs 0 Consider two microarchitectures 6way superscalar architecture 4 x 2way superscalar multiprocessor architecture 14 Floorplans 6issue SS vs 4 x 2 CMP 6issue superscalar 4 x 2issue CMP 4 2rm lt 21mmp Howlquot BK bald 5K lngm iun Em ac e lnslructron lumin quotquotquotquotquot Fetch 5 A Processor Processor A TLB g 1 2 Inst Decode amp 3 In Rename c e 9 4 3 g z lt32 m g z 2 a 21mm 5 3 2mm a O 8 5 Reorder Buffer S 3 3i 6 Instruction Queues E 9 8 5 9 and OuiofOrdar Logic gt 5 P g 5 30 area 5 3335 P ifs E 5 E 8 Healing Point 9 Unit lCm39am lCIHI39 394 x 2 CMP differences s imper issue logic 14 size branch prediction buffer much simpler renaming lo gic more execution units Integer Benchmarks eqntott translates logic equations into truth tables manually parallelized bit vector comparison routine 90 time compress compresses and uncompresses files in memory unmodified on both SS and SMT architectures m88ksim simulates Motorola 88000 CPU manually parallelized into 3 threads using SUIF compiler threads simulate different instructions in different phases parallelization analogous to the hw pipelining it simulates MPSim simulates a busbased multiprocessor manually assign parts of model hierarchy to different threads 4 threads one for each simulated CPU Th ehcgasguezfio r asingleqzhi p1multijdrgxaesgsorz Olu39kqtiunr at al ASPLOLLSi Vll 05995quot 16 FF and Multiprogramming Benchmarks Floating point applications all parallelized with SUIF applu solves paraboliceliptic PDEs apsi computes temp wind velocity and distrib of pollutants swim shallow water model with 1k x 1k grid tomcatv generates mesh using Thompson solver Multiprogramming application pmake performs parallel make of gnuchess C compiler same application simulated on both architectures OS exploits extra PEs in MP architecture to run parallel compiles Theacase39fbr a singlechip multip rNecesslorz Olukotun et al ASPLQSVII 1996 IPC Breakdown of Superscalar PEs 24sue DEacheQaIl I came Stall PMM Slal Actual IPC IX II IMw mma r I 1a mmr ag 1 a E 5 E 2 g g g a 2 Sassue B locussm 5 I Ilcmsml 1 f 7 339 7 2 g 39 y gt r Wammfn 1m8 33 Wb V gigag39fg 18 Cache Performance SS vs CMP ZJSsue MPCI missescompleted BPRnle 1 ch D h L2 11 39 gt 39r r ngnm PC 91 ebuma 951131 9513c mStru Ctlo n compress 09 359 00 35 10 mm D m 00 0 8 0397 Cache mIss rates Inflated for mssksim 14 917 22 04 00 specu39atlve 00 PE Wm 0 8 737 51 Z3 23 misses for speculative W1 09 792 1 397 instructions that abort apsi 06 951 10 41 21 swim 09 997 00 12 12 n n39b ck39ng caches39 mm M m 00 77 12 multiple mlssesIlne pmake 10 862 23 21 04 Sassue 4x2455ue RPM 1 11 Dutch Llca II Program PC 91 e 753mg MPCI Mc1 Fund 51 2 compress 12 864 00 39 11 Gems 00 35 L0 eqmon 18 1100 00 11 11 mm 0 6 14 L2 mBBksim 23 926 01 00 00 mum 23 33 on MPsim 12 816 34 17 23 Mpsim 43 25 34 app 17 797 00 23 21 m1 110 21 13 apsi 12 956 02 31 26 apsi 27 69 20 swim 22 998 00 23 25 swim 00 12 15 tomcatv 13 997 00 42 43 tomcalv 00 78 25 pmak 14 827 07 1 0 06 punk Z4 46 01 9 Relatlve Speedup I Performance 4 x 2 CMP vs 6issue SS o Nonparallelizable codes wide superscalar architecture performs up to 30 better 0 Codes with finegrain threadle parallelism most 10 better wl same clock freq uency that would eliminate this difference wide superscalar architecture is at expect that simpler CPUs in CMP would support higher clock rates vel Codes with coarsegrain threadlevel parallelism and multiprogramming workloads CMP performs 50100 better than wide superscalar SS less able to dynamically extract parallelism from 128wide instruction window 2O Simultaneous Multithreading Maximizing Onchip Parallelism 21 Simultaneous Multithreading Permit several independent threads to issue to multiple functional units each cycle Objective increase processor utilization despite long memory latencies limited available parallelism per thread Design parameters number of threads number of functional units issue width per thread binding between thread and units static completely dynamic Advantages combines multiple instruction issue from superscalar architectures latency hiding ability of multithreaded architectures 22 Highlevel VIew of Threading Alternatives Single Thread Coarse Grain Threading FXO I FXOIIIII FX1ECICICIClCI FX1 EICICICIClCl FPOII FPOEICICICICIClCl FP1II FP1I LSDCICICIIZIE LSDICICICIEIEI Ls1H LS1I Rx BRXClIZICICICICI CRLlIlIH CRLII Fine GrainThreading FxoCICICICIEI FX1 FX1I WEDGE E39 FPDClCICICIICI FF FP1I WEE EC LSDICICICICII LS1EEEE LS1 III BRXClICIClI LUCIEDEED CRLCICICIIZICICICI ThreadDExecuting Thread1Executing NoThreadExecuting Goals of SMT Simulations Evaluate alternatives 0 Wide superscalars 0 Traditional multithreaded processors 0 Smallscale multiple issue multiprocessors 24 SMT Simulation Environment Emulationbased instructionlevel simulation like Tango 988 What do they model execution pipelines memory hierarchy hit rates blw TLBs branch prediction logic Base processor like Alpha 21164 10 functional units 4 integer 2 floating point 3 load store 1 branch latencies derived from 21164 assume larger caches than Alpha SMT strains cache subsystem timebank l Instruction Clue Lutency integer multiply 816 conditional move 2 compare 0 all other integer 1 PP divide 1730 all other F 4 load L1 cache hit no bank con icts 2 load L2 cache hit 8 load L3 cache hit l4 load memory 50 control hazard br or jmp predicted 1 control hazard br or jmp mispredicted 6 Simultaneous multith reading maximizing soncliip parallelism Tullsen at al ISCA 1995 25 Alternatives Studied Baseline finegrain multithreading not SMT like Tera MTA only one thread issues instructions each cycle can use entire issue width but admits horizontal waste attacks vertical waste by hiding memory latency hiding functional unit latencies full simultaneous issue 8 threads compete for each issue slot each cycle single issue SM dual issue SM four issue limit instructions each thread can issue in a cycle limited connection connect each hlw context to 1 of each type functional unit each unit can only receive instructions from its attached threads key points partitioning of functional units lt dynamic than other models functional units still shared among threads for high utilization 26 Instruction Scheduling 0 Limited support for dynamic execution 8instruction per thread issue window issues instructions out of order to functional units dependencefree inst issued inorder to window by each thread 0 Use Multiflow trace compiler for good static scheduling trace scheduling 27 SMT Performance 3 Design Points slowdown due to contention for other system resources eg cache TLB branch predictor 7 Guineanms o nmuhmmV 12345675 Numberan Inalmcllons Issued Par cycle nalrumlons Issued Per Cyd o n u u a Insuucllons lssuad Per Cycle Nimbexof nuuds a Fm Grain Muiulhreading b SM Single Issue Per Thread c 51w FullSimulxancous Issue Segments of each bar show IPC contribution by each thread Highest priority thread at the bottom What do the graphs show Claim finegrain multithreading only beneficial for 25 threads 39 tinchip parallelism Tullseh e tual ISCA 1995 28 SMT Comparison of All Models 13 full Simultaneous Issue 13 MzFour Issue 39Duai Issue SMzLimited Connection O FineGrain l 90 1 dualissue 94 efficiency Instructions Issued Per Cycle O 5 2345678 Numbcrof l hmads Q Why does IPC of finegrain multithreading plateau 29 Cache Issues for SMT Decrease in locality from singlethreaded model Waste due to Icache misses 1 1 thread 14 8 threads Waste due to Dcache misses 12 1 thread 18 8 threads Waste due to DTLB misses 1 1 thread 6 8 threads approaches increase shared DTLB from 64 to 96 entries 1 waste 8 threads private DTLB entries 24 each yields waste of 2 30 Cache Design for SMT Private vs shared L1 cache per thread D 0 Shared Dcache optimizes for small threads over all threads 0 Private Icaches help 78 threads l o 1 54564 n 64p64p a w I Throughpnl IPC Ralalivc l0 64564p for performance which is 739 I the best design point I I I I I 2 3 4 5 6 Number of Tluads Notation lcacheDcache KB 31 SMTVS CMP Key difference way resources are partitioned and scheduled CMP statically partitions resources SMT enables partitioning to change every cycle SMT Claims 0 SMT faster by 24 single SMT processor with 10 functional units conventional 8way multiprocessor with 32 functional units 0 Study caveats architecture not built results may be optimistic in two respects number of pipeline stages required for instruction issue Alpha uses 2 pipeline stages for issue if SMT requires more issue stages increase mispred penalty data cache access time for a shared cache this study should serve as an upper bound of SMT performance 33 SMT Experience with IBM s Power5 Most workloads benefited from SMT gains 11 to 41 Main cause of performance degradation loadstore activity Lower values of SMT gain came from repeated invalidation and reloading of L2 lines by 2 programs higher rejection of loadsstores as a result higher number of loads satisfied from L3 and memory data footprints wouldn t fit into L2 cache SMT costbenefit assessment 24 chip area for SMT support yields 205 average SMT gain From LLNL Using Power5 SMT on ASCI Purple use no more than 1 MPI task per CPU use 2nd virtual CPU for use by auxiliary threads or daemons performance characteristics vary sPPM and UMT2K realize a 2022 performance gain says IBM 34 Discussion Questions What questions do these papers leave unanswered Why is SMT not dominant does paper overestimate benefits of SMT over CMP interthread contention impact of latency hardware support overhead bad benchmarks Interactions between the two techniques synergies interference How should the techniques be combined what is the right balance point between the techniques how much superscalar execution per hlw thread how many simultaneous multithreads per core how many cores how should one determine this 35 References Simultaneous multithreading maximizing onchip parallelism Dean Tullsen Susan Eggers and Henry Levy In 25 Years of ISCA 1995 The case for a singlechip multiprocessor Kunle Olukotun Basem Nayfeh Lance Hammond Ken Wilson and Kunyung Chang ASPLOSVll 1996 Characterization of simultaneous multithreading SMT efficiency in POWER5 H M Mathis A E Mericas J D McCalpin R J Eickemeyer and S R Kunkel IBM Journal of Research and Development 4945 2005 Purple Fifth Generation ASC Platform httpwwwllnlgov asciplatformspurple 36 Intel s Vision for Terascale Multicore Microprocessors John Mellor Crummey Department of Computer Science Rice University johnmccsriceedu RICE COMP 522 Lecture 6 11 September 2008 Context Dualquad processors have become mainstream better performance more efficiently than single core Trend toward higher core counts is continuing SMP techniques enable integration of small PEs on die How do we move to higher levels of integration Looking into the Future 39 Move to tens or hundreds of cores on a die 39 Must also consider impact of incorporating other components memory controllers IIO bridges graphics engines 39 What infrastructure is needed to tie these together ondie interconnect cache hierarchy memory the uncore IIO system interfaces Challenges for Terascale Multicore Process variations Higher failure rates Ondie power dissipation Wire delays Offchip memory bandwidth must meet constraints for cost and power limited by capabilities of highspeed signaling technology Cache hierarchy design and coherence Requirements Offer flexibility for assigning computing resources to concurrently solve different problems Support highvolume manufacturing must enhance reliability to cope with increasing architectural complexity decreasing silicon geometries Must perform within constrained power envelope Approach Tiled Architecture 39 Divide clie into large number of identical or nearly so tiles 39 Interconnect tiles with scalable energyefficient interconnect 39 Benefits simplifies layout enables rapid integration of different blocks Architectural Vision Components Scalable high bandwidth low latency power efficient interconnect Cache hierarchy High bandwidth memory subsystem to feed the PEs Both general and special purpose cores e shaders FFUs Platform elements memory amp l0 controllers H ll C PCLE C PCI based Controller R Router Sy System Interface Ondie Interconnect Performance Metrics 39 Capacity per PE for uniform random traffic amount of traffic a PE could inject into the network C ZBBIN where BB bisection bandwidth N PE 39 Latency want low avg contentionfree latency manageable growth under load T0s d Hs d trr Tcs d le Hsd the number of hops between source s and destination d tIr the router traversal latency Tcs d the channel traversal latency L packet length b channel bandwidth Ondie Interconnect Desirable Properties Partitionability should enable dynamic partitioning for performance and fault isolation Fault tolerance graceful degradation under faults must support routing around faults Validation and testing deadlockfree routing rather than deadlock recoverybased routing facilitates testing and validation Regularity layout planning design tiles should be physically symmetric each tile must plan global wiring tracks Flexibility router design must scale support clustering of multiple lowbandwidth cores Network Topology Considerations 39 Offchip network pinout limitations determine number of links ink widths 39 Onchip networks don t have these limitations 39 Onchip networks determined by wiring density function of metal layers directionality router complexity time area need to embed topology in 2D space for topologies gt 2D topological adjacency physical adjacency 1O Onchip Network Topologies 39 Onchip topologies will be fixed degree low dimension 1 2 for the forseeable future 39 Candidate topologies 2D mesh or torus rings and variants Example 2D embedding of 64node 3D mesh network diameter 9 but tile span 18 Other Candidate Topologies I I I w I d FTFEIQ TTree Torus If C Figure 8 Network Topologies mama III a Meshlx MeleXQ V orus c rMeshcmgmm Figure 9 Floorplan for Estimating Area Lower Left Quadrant of Die Shown 12 Network Comparisons r E i E I I i Emma h max nu mm mm V m a Rahth VVorHaad Cumplmmn Tuna V PWquot WWWH In I nmmm 2 b Chm Are mm mm 13 Interconnect Microarchitecture Challenge must simultaneously satisfy latency and bandwidth requirements power and area requirements Topologies with latency and high bw 1 power MIT RAW experience router power is nearly 2X power of the wires roughly as much spent in crossbar switch as in buffers Approaches to reduce crossbar power segmented and cutthrough crossbars Wang et al design buffers to minimize for given throughput Nicopoulos et al traffic on trunks can be aggregated and switched together avoid need for intermediate buffering 14 Interconnect Needs 39 Route operands between clusters of functional units virtual superscalar microarchitecture requires finegrained communication operands control cache coherent communication messagebased communication 39 Meet realtime constraints of multimedia applications 39 Ration bandwidth between competing applications 15 Fault Tolerance is a Must Higher probability of failures in the field 39 Increased susceptibility to faults more transistors smaller process technology leaves less room for error process variations will be a more significant determinant of performance 39 Insufficient burnin time to weed out infant mortality 16 Using Sparing for Fault Tolerance Provision chip with extra tiles Detect failures Reconfigure interconnect Activate spare tiles Kinds of faults true faults tiles wires out of spec componefnts mam DID mammmwmmmlmmw D D D I spare column B I D nodes DESIWQDDDIMM DID LDDDD EDDmmg m mlmmmmmm mm mum DEED Faulttolerant Routing Mark failed and unsafe components at initialization unsafe lacks sufficient paths for deadlock free routing Reconfigure components in place of failed tiles and routers Routing algorithm desired properties simple to implement deadlock free handle variety of faults adaptive to congestion eg from faults working dead unsafe forced dead Partitioning for Performance Isolation Partition chip components among multiple tasks eg multiple virtual servers on a chip Confine intrapartition communication to its own components Why Guarantee quality of service 39EXample isolation within 2D rectangular partitions amm Virtualization of Network Interfaces Export interconnect abstraction to programmer and run time meet different needs of acclerators cores cache blocks etc Support applicationcentric requirements why power and performance Exam ples media components may be better served by sendreceive interface rather than a cache coherent one programmer may want to explicitly control provisioning number of buffers switching priority bandwidth Flexible network interface facilitates rapid integration of multiple IP blocks 20 Cache Hierarchy and Coherence Lessons from the past shared memory is desirable shared memory machines can scale to hundreds of processors MP often works better using shared memory Tremendous demands on cache hierarchy and coherence by diversity of workloads concentration of computing resources Requirements for a cache coherence protocol efficiency scalability Design goals minimize latency to frequently accessed data exploit flexible organization to adapt to workload demands minimize requirements for high performance software 21 Cache Hierarchy Goal Support a diverse set of workloads Multiprogrammed workloads with no communication Workloads with mix of scalar and parallel sections overall performance limited by Amdahl s law Highlyparallel workloads thread parallelism eg in W apps or transaction processing may use shared data data parallelism typically uses shared data Streams programs structured as kernels output data from one kernel fed as input to another kernel Hybrid models thread or data parallelism within a streaming kernel 22 Cache Hierarchy Design Challenges 39 Organizations and policies to meet goals for performance scalability energy efficiency 39 Organization design parameters number of hierarchy levels size associativity and bandwidth at each level sharing single block or multiple banks with nonuniform latencies 39 Cache policies control accessibility allocation eviction 23 Trends in Cache Hierarchy Design 39 First one or two levels of cache are private to a core 39 Different designs for last level cache sharing private to a core eg Montecito dualcore ltanium shared among multiple cores Merom Niagara block design fewcore CMPs single physical block UMA latency to all cores physically distributed caches are attractive physical design response to wiredelay dominated latencies Burger Keckler 24 Cache Organization Options for Multicore 5 2quot general Pu rpm Cog iiill Throuahpllt 90 3 3 41th Salquot quot3 xxgt I quot x T I r reed Lever x ggrIlclislnx Private L2 8 quotFau LevJ xK Pgnllelisg Shared L3 Private L2 Shared L2 3level sun O Q Q Q PVT 0090 000 x 139 quot 9 H I l Core L1 cache Private cache I Shared cache olreclory Network I L L I z 25 Cache Design Tradeoffs Shared cache benefit increases effective capacity only single copy of a shared block resides in cache drawback latency a block may be in a tile arbitrarily far from the core using it even if the block is not shared Private cache benefit all blocks used by a core on its local tile drawbacks readshared blocks replicated effective capacity reduced offdie traffic may increase 26 Hybrid Alternatives 39 Goals combine advantages of private and shared caches avoid their shortcomings 39 How physicaly distributed cache design some banks closer to each core than others optimize performance by optimizing placement of blocks adaptive selective replication Wisconsin Micro Dec 2006 cooperative caching Princeton ISCA June 2006 victim migration MIT TR Oct 2005 victim replication MIT ISCA June 2005 appropriate these ideas for terascale multicore where cache in different tiles 27 Distributed Caching Policy Issues Initial placement when fetched from memory Readshared block replication Block migration in response to accesses Eviction what happens to block evicted from a cache bank 28 Caching Concerns for Terascale Processors 39 Consider a terascale microprocessor composed of general purpose cores network accelerators graphics coprocessors security coprocessor 39 What s the problem different data footprints different locality characteristics 39 Destructive interference could occur with diverse percore access patterns qualitatively different working sets 29 Cache Coherence Protocol Issues 39 Tiles are expected to need highbandwidth interconnect 39 Protocol must enable high utilization of onchip structures 39 Protocol interconnect overhead must be kept to a minimum 30 Directorybased Cache Coherence Widely used in largescale SMPs States invalid exclusive shared Identify info full bit map coarse bit map limited explicit list Desired feature ability to pin lines in particular place mm mm 5mg a w a Figure 7 Du39ectmy structure in track cache line Memory Architecture 39 Many applications will need major increases in memory bw needed to feed the tiled array of cores 39 Two issues power efficient highspeed offdie IIO slow but steady progress over the last decade power efficient high bandwidth DRAM access requires new look at DRAM core and IIO design 39 Approaches more efficient storage eg embedded DRAM is higher density than SRAM better management of ondie storage noninclusive caches to avoid duplication integration of DRAM inside processor package 3D stacked SRAM high bandwidth power efficient but low capacity 32 Intel Teraflop Processor Prototype 80 cores 10 x 8 mesh 4 GHz 5 port router per PE 2 logical lanes Wormhole routing Bisection bandwidth 256 335 2D A 2172mm Teraflop Processor Tile Mesoch ODOUS Gread 4wn39te 32 entry Register File I 32 A 32 in 32 532 quot FPMACQW 2EPMAC1 Processing Enine PE quot0 34 Teraflop Processor Crossbar Router F W stgs 39039 To PE PE 39 D 39 Lane 0 From To N N 2 From To E E 392 From To 8 S From To W Y Crossbar switch 39 35 Frequency Power and Efficiency Estimated frequency performance and power versus Vcc Power efficiency with 80 cores active 1100c 232me 30 4 1TFLOP 250 i use 2 E 3 r 067TFLOP a 200 gt 21GHz g 150 E 100 L 1 so u Total Power W Nine Stage Floatingpoint Pipeline Operand A lEEE754 single precision inputs Operand B Partial Product Encoding Wallace Stage 1 Wallace Stage 2 3 Wallace Stage 4 O F PIE Base 32 conversion 0 2 2 Elg L 39 Sign Inversion a Single cycle E3 Accumulate g Final adder LZA Normalize Base 2 conversion Summary High level of integration and heterogeneous building blocks require modular scalable onchip interconnect ring mesh or similar appear most attractive 2D mesh good at using wiring tracks good latency at low power is problematic Fault tolerance is important for coping with failure Challenges memory and IIO bandwidth better use of onchip cache will help hi h integration reduces memory blw pressure and need for lots 0 offchip coherence bandWIdth Future 3D stacked memory concern limited capacity higher speed interfaces 38 References 39 Inteqration Challenqes and Tradeoffs for Terascale Architectures Mani AzimiNaveen Cherukuri D N Jayasimha Akhilesh Kumar Partha Kundu Seungjoon Park loannis Schoinas Aniruddha S Vaidya Intel Technology Journal August 2007 39 An 80Tile 128TFLOPS NetworkonChip in 65 nm CMOS S Vangal et al In Proceedings of ISSCC 2007 IEEE International SolidState Circuits Conference Feb 12 2007 39 Desiqn Tradeoffs for Tiled CMP Onchip networks J Balfour W J Dally ICS 2006 39 Programming Parallel Algorithms with NESL John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 10 29 September 2008 Context Multicore hardware is here Multiple cores accelerate programs with threaded parallelism Need models for writing multithreaded programs What makes a good multithreaded programming model simple to write expressive low parallelization overhead dynamic load balance supports construction of efficient scalable programs Last class Cilk a practical parallel programming model Today NESL a language of more theoretic interest for expressing and reasoning about parallel algorithms Goals of NESL 0 Languagebased performance model using work and depth better than a machinemodel that uses running time enables performance analysis without a machine model 0 Support for nested dataparallel constructs ability to apply function in parallel to each element of a collection ability to nest such parallel calls Outline PRAM model Work and depth algorithmic complexity measures Computational models NESL nested data parallelism infuences N ESL operators Programming with NESL sparse matrix vector multiply quicksort primes NESL performance model analysis using work and depth Discussion Parallel Random Access Machine PRAM Parallel extension of serial Random Access Machine RAM PRAM model p processors global memory of unbounded size that is uniformly accessible unit time access to any location by any proces processors share a common clock processors may execute different instructions in each cycle all processors may access memory locations simultaneously Shared Memory PRAM Classification PRAMs differ in handling of simultaneous memory accesses Exclusiveread exclusivewrite EREW PRAM Concurrentread exclusivewrite CREW PRAM Exclusiveread concurrentwrite ERCW PRAM Concurrentread concurrentwrite CRCW PRAM CRCW PRAM Flavors of concurrent write semantics Common write only if all values are identical Arbitrary write the data from a randomly selected processor 0 Priority follow a predetermined priority order 0 Sum write the sum of all data items Algorithm Performance on a PRAM 0 Measured in terms of the number of instruction cycles 0 Usually expressed as a function of input size number of processors Physical Complexity of PRAMs Processors and memories are connected via switches Switches must operate in 01 time at the level of words For p processors and m words switch complexity is Omp PRAM serves as a virtual model not realizable for large values of p and m can be mapped onto realistic machines by having each PE simulate multiple PRAM processors Why virtual models They are easier to program Complexity Measures for Parallel Algorithms 0 Work total number of operations executed by a computation also known as T1 Depth also known as Tm longest chain of sequential dependencies in a computation Summing numbers on a tree Depth Work Why Work and Depth Provide abstract measures of algorithmic complexity Enables an algorithm to be analyzed without a particular machine in mind Suppresses details that only serve to obscure how the data is partitioned among the processors how recursive parallelism is mapped onto processors load balancing synchronization 0 Circuit model Computational Models 0 Vector machine models sequentia random access machine that operates on vectors supports Scalar Memory Scalar Processor elementwise operations eg add subtract aggregate operations eg extract elements using a vector of indices Vector Memory procedure SUMV n lenglhV for i 1 to loan begin Vo oddclts V V evcnclts V end return V l Parallel Vector Processor V vectoradd1 V3 12 Languagebased Models 0 Define a workdepth model using language constructs 0 Specifies work and depth cost of primitive instructions rules for composing costs over sequential and parallel program expressions when executing in parallel total work is sum of work over all tasks total depth is the maximum of the depth of the tasks when executing sequentially both total work and depth of tasks are summed 0 Advantage of languagebased models can describe running time without introducing a machine model 13 NESL Foundation Data parallelism perform an operation in parallel over data Nested data parallelism flat vs nested data parallelism flat function can be applied over a set of data values nested parallel functions can be applied over a set of data values Example of nested data parallelism sum elements over each row in a matrix for each row in parallel compute the sum over its elements in parallel Most dataparallel languages support flat data parallelism eg HPF C NESL supports nested data parallelism 14 NESL Language Influences 0 ML developed in the 70 s by Milner and others general purpose functional programming language impure because it permits side effects known for its powerful type system HindleyMilner can automatically infer types of most expressions factorial program in ML fun fac 0 int int 1 fac n int int n fac n1 0 SETL developed in the 60 s by Schwartz sequential language with highlevel set operations aggregate data types unordered sets and sequences aka tuples maps primitive set operations intersection union etc quantified boolean expressions including universal and existential quantifiers of firstorder predicate logic NESL Operation Apply to Each Uses a setlike notation ea a in s aaain3495 squares each element returning 9 16 81 25 Can be applied over multiple sequences abain3495bin1234 adds two sequences yielding 4 2 12 9 Can subselect elements aaain3495agt0 squares each element that is gt 0 returning 9 25 selected elements remain in their relative order Can apply any function to the elements in a sequence 16 Example Apply to Each of User Function 0 Define a function function factorialn ifn1then1 else n factorial n 1 Apply it over elements in a sequence factorialx x in 3 1 7 applies factorial to each element returning 6 1 5040 17 Sample NESL Builtin Functions Dataparallel operations on sequences sums compute the sum of a sequence suma ain 2 3 8 3 9 7 returns 5 20 7 reverses reverse a sequence fattens collapse nested sequences into an unnested sequence flatten 2 3 8 3 9 7 returns 2 3 8 3 9 7 18 More NESL Builtin Functions Cpcl39alinn Dmripnon dislal Create a Jequmce afas oftenglb 1 Remm integer sequmce rum 5 m e by d we a writedagt a drupavn x 1 elementx qfsequmcea interleaveab Int lean elmznlx qfsequmms a and b a Flatten mased sequence a Lresult Modifying Sequences 0 Only write functions may modify sequences in parallel write s pairs parallel modification of a sequence in place sis a sequence pairs is a sequence of position value tuples to modify pair i v writes v into slot i of s example write 0 0 0 0 0 0 0 0 4 2 2 5 5 9 0 0 5 0 2 9 0 0 if two pairs write to the same location arbitrary one succeeds ewrite exclusive parallel modification of a sequence in place like write except multiple writes to same index cause error 0 Analogous to CW and EW write operations on PRAM 20 Parallel Programming in NESL 0 Want total work to be close to that of good serial algorithm workefficient algorithm parallel work is within constant factor of serial work 0 Low depth fast parallel running time 21 Sparse Matrix Vector Multiply in NESL 20 10 O O 10 20 710 0 mm A 39 0 10 20 710 0 0 10 20 Sparse representation ofA in NESL OY 20 1 710 01012012y10L I 10 2 2 039 5 710 2 10 5 201 Dot product of a vector with a sparse row sum v xi i v in row Sparse matrix vector multiply in NESL sum v xi i v in row row in A Sequential Quicksort procedure QUICKSORTS if S contains at most one element then return S else begin choose an element a randomly from S let S 82 and be the sequences of elements in Sless than equal to and greater than a respectively return QUICKSORTS followed by S2 followed by QUICKSO RT S 3 end 0 Algorithm runs on average in time On log n expected case 0 Maxiumum depth of recursive calls is Olog n expected case 0 Where s the potential parallelism How can we exploit it 23 Quicksort in NESL function guicksor s 39 S lt 1 then S S rand 5 t in lc lt a S 39n SI 2 8i 53 39n S e gt ai R Quicksordvii Vin 5153 in RU 4 52 H RU Work n log n expected Depth 0log 1 expected NH 7 E wumuhuna Total work is still On log n o Recursion depth is 0log n Depth of each operation is constant Total depth is 0log n as well PlusScan in NESL function scana ifa 2 1 then 0 else let e eveneltsa Work 2 072 0 oddeltsa Depth 0108 7 s scan e o e in e 0 in 0 in interleavdst 6 sin 3 c in 2 0 Takes a sequence and returns a sequence of equal length 0 Each element in returned value sum of all previous elements in the original sequence 0 Scan on the sequence 3 5 3 1 6 returns 0 3 8 11 12 25 Computing Primes Serially Sieve of Eratosthenes 1 procedure PRIMESM 2 let A be an array oflength n 3 set all but the first element of A to TRUE 4 for ifrom 2 to VG begin if Az39 is TRUE then set all multiples of i up 0 n to FALSE end OOKJOWU 0 Serial work known to be On log log n 26 Computing Primes in NESL funmion primestn if n 2 than H int C St primes sqrdn 2pnp p in sqrpnmes mp5 7 atlcnltcompashcs I s ritedisttrue n ifalse39 i in nagomps indicts i in 0nf1in flags 1 m in lvophndiucs 2 Examplr for primcs20 squnmm 3 CompoSiles l s101214161s 691215181 alconps 4128101214161869191518 ags hlllrlf1l fll39ff t indicns 1255711137191 result 235711131719 primesm nlog log n primeswE nlHlogzog quotW new imam 1 primesu I Work and Depth vs Running Time 0 Work and depth running time of an algorithm at two limits on one processor work on unlimited processors depth 5 s T lt A L P P Performance Model 0 For expression e1 e2 Wel e2 1 Wel We2 Dlte1 e2 maxltDe1De2 assumption cost of is 1 0 For an applyeach expression We1aain e 1 We2 20 m C1Wlte1ltagtgt De1ltugt a in e 1 Dlte2gt max Dlte1ltagtgt Thought Questions 0 Is NESL a practical model for parallel programming 0 Can there be data races in NESL programs 30 References 0 Programming Parallel Algorithms Guy Blelloch Communications of the ACM volume 39 number 3 March 1996 Java Memory Model John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 12 13 October 2008 Why a Memory Model for Java Java supports threads that shared memory Need memory model to define program semantics Key issues strong safety and security properties Impact of the Memory Model Impacts language implementation determines transformations Java compiler may apply producing byte codes transformations virtual machine may apply to byte codes when producing native code optimizations hardware may perform on the native code Impacts program development determines possible outcomes of a program what results are legal for programs in the language determines what design patterns are legal for communicating between threads eg is doublechecked locking valid Example DoubleChecked Locking alas Foo 39 private He helper publie Helper getllelperi null synchronized i this rh per null new Helper i OThread A sees null helper acquires lock and begins initialization OCompiler might set helper before A finishes constructor return helper OThread B sees nonnull helper and returns value Fixing DoubleChecked Locking class Foo private Helper helper null public Helper getHelper if helper null Helper h synchronizedthis h helper if h null synchronized this h new Helper release inner synchronization lock helper h return helper still broken helperh may be moved into inner synchronized block amp executed before initialization is complete Doublechecked Locking in C C implementation with explicit memory barriers From quotPatterns for Concurrent and Distributed Objectsquot by Doug Schmidt temp ate ltc1ass TYPE class LOCKgt T PE Sin letonltTYPE LOCKgtinstance v01d 7 First check TYPE tm instance Inser the CPU s Ecific memory barrier instruction to synchronize t e cache lines on multi processor asm quotmemor Barrierquot constructor acquires ock GuardltLOCKgt guard lock Double check tmp instance if tmp tmp new T PE Insert the CPU specific memory barrier to synch the cac e lines on multi processor asm quotmemoryBarrierquot instance tmp if tmE quot nsure serialization iguard re urn tmp private static TYPE instance static LOCK lock Today s Paper Not The First JMM Original Memory Model for Java JMM 10 Java language specification Gosling Joy Steele hard to interpret poorly understood had many problems One feature JMM 10 required coherence informay for each variable in isolation uses and assigns to that variable must appear as if they acted directly on global memory in some order that respects the order within each thread each variable in isolation is sequentially consistent JMM 10 Fatally Flawed Pugh CPEOO Imposed constraints that prohibited common compiler optimizations eg coherence leads to reads that killquot p and q might be aliased int i x concurrent write to 1 by another thread in j q 4 could see newvalue int k 1 lt muslnewvaluelooifqx does Expensive to implement on existing hardware Most JVMs violated constraints in the model Conforming to the specification would impose significant performance penalties enforcing reads killquot in the compiler yielded programs 2045 slower Scales 99 Goals of new JMM Guarantee safety and security Strike balance between ease of use for programmers implementation flexibility for system designers Why Not Sequential Consistency Easiest to understand Programmer can write concurrent algorithms but requires great skill however algorithms that exploit SC are often incorrect such programming should be discouraged SC restricts ordering pairs of memory ops in a thread even if no data and control dependences Initially y 0 x Thread 1 139 2 4 x r2 2 1 1 violates sequential cousisteucy no compiler or processor reorderings should be Visible SC can be a serious obstacle to performance eg even common subexpression elimination is illegal Java Memory Model Relaxed memory model Adopts data race free model correct programs are free of data races such programs are guaranteed sequential consistency What about programs with races must their behaviors be defined yeS Why define behaviors for programs with races eaving semantics unspecified allows violation of safety and security properties Java Memory Model Sketch Build legal executions iteratively in each iteration commit a set of memory actions actions may be committed if they occur in a wellbehaved execution that also contains prior committed actions carefu definition of well behaved ensures undesirable behaviors are prohibited standard compiler optimizations are allowed Focus of model ordinary reads writes locks volatile variables Also covers interactions with outside world infinite executions Java Memory Model Benefits Fully specified memory semantics guarantees safety and security Flexible enough to permit performance current and futurel programmer sanity Terminology l Conflicting accesses two accesses to same variable at least one is a write Synchronization actions ockunock readwrite of volatile variables Synchronization order consistency total order over all synchronization actions must be consistent with program order read of volatile must return value of last write in synch order Terminology ll Synchronizes with order partial order relates unlockl gt all subsequent lockl volatilewritev gt all subsequent volatilereadv subsequent means according to their synchronization order Happensbefore order transitive closure of synchronizes with order program order Data race conflicting accesses unordered by happensbefore Correctly synchronized data race free only if all SC programs are free of data races HappensBefore Model Key question What values can a read see HappensBefore Model Key question What values can a read see An execution gt sync order total order HappensBefore Model Key question What values can a read see An execution gt sync order total order Our interest syncwith order partial order synchronization actions induce syncwith edges Wv syncswith all subsequent Rv by any thread partia because we don t care about Rv gtRv unlockl syncswith all subsequent ock by any thread many such orders HappensBefore Model Key question What values can a read see An execution gt sync order total order Our interest syncwith order partial order synchronization actions induce syncwith edges Wv syncswith all subsequent Rv by any thread partia because we don t care about Rv gtRv unlockl syncswith all subsequent ock by any thread many such orders Happensbefore partial order transitive closure of syncwith edges programorder edges HappensBefore Model Key question What values can a read see An execution gt sync order total order Our interest syncwith order partial order synchronization actions induce syncwith edges Wv syncswith all subsequent Rv by any thread partia because we don t care about Rv gtRv unlockl syncswith all subsequent ock by any thread many such orders Happensbefore partial order transitive closure of syncwith edges programorder edges Awellformed execution must obey happensbefore consistency Rv v is ordinary sync order consistency Rv v is volatile HappensBefore Example 1 Using happensbefore introduce syncwith using volatiles Initially X 0 ready Va Thread 1 Thread 2 x 1 if ready ready true 11 x If 11 x executes it will read 1 false ready is a volatile 10 Could a compiler reorder statements in Thread 1 HappensBefore Example 1 Using happensbefore introduce syncwith using volatiles Initially X 0 ready Va Thread 1 Thread 2 x 1 if ready ready true 11 x If 11 x executes it will read 1 false ready is a volatile 10 Could a compiler reorder statements in Thread 1 No HappensBefore Example 2 After everything executes could r22 and r1 1 HappensBefore Example 2 Af ter everything executes could r22 and r1 1 Initially I 1 Yes 4 1 2 3 Same as before 39 Why no volatiles a no synchwith edges a no happensbefore edges between threads the partial order is underconstrained Isthis a problem HappensBefore Example 2 Af ter everything executes could r22 and r11 Initially x Thread 1 Thread 2 I Why no volatiles a no synchwith edges a no happensbefore edges between threads the partial order is underconstrained Isthis a problem d ta race a incorrectly synchronized program we don ave to guarantee sequential consistency So far so good Legal DoubleChecked Locking class Foo private volatile Helper helper null public Helper getHelper if helper null synchronizedthis if helper null helper new Helper return helper volatile ensures proper initialization order What About Programs with Races Can t leave behavior undefined inconsistent with Java s safety and security guarantees Outof thinair Problem Assume an incorrectly synchronized program Af39ter execution could r1 r2 42 Initially x y 0 Outof thinair Problem Assume an incorrectly synchronized program Af39ter execution could r1 r2 42 Initially x y 0 Zy xr2 What if a compiler 39 loaded 42 in y and x Outof thinair Problem Assume an incorrectly synchronized program Af39ter execution could r1 r2 42 Initially x y 0 consider 1234 3 xr2 4 Z y What if a compiler 39 loaded 42 in y and x i Analysis Outof thinair Problem Should we disallow this compiler optimization Why not let this behavior be undefined Analysis Outof thinair problem Should we disallow this compiler optimization Why not let this behavior be undefined Suppose 42 was ampThreadContextSwitcher Unintentional behaviors gt violate safety ntentional behaviors gt security risk HappensBefore amp Outof thinair Assume an incorrectly synchronized program Af ter execution could r1 r2 42 Initially x y 0 HappensBefore amp Outof thinair Assume an incorrectly synchronized program Af ter execution could r1 r2 42 Initially x y 0 Y What if a compiler 39 loaded 42 in y and x HappensBefore amp Outof thinair Assume an incorrectly synchronized program Af ter execution could r l r2 42 Initially x y 0 consider 1 234 Y i What if a compiler 39 loaded 42 in y and x l Happensbefore allows behavior we want to prevent must guarantee safety of incorrectly synced programs HappensBefore amp Correct Sync After execution could r1 r2 42 Initially x y 0 Thread 1 Thread 2 HappensBefore amp Correct Sync Af39ter execution could r1 r2 42 Initially x y Thread 2 Thread 1 1 r1 x 3 if r1 0 r2 y 4 2 if r2 03 y 42 x 42 Where is the synchronization no sequentially consistent execution performs such if bodies thus correctly synchronized no dataraces HappensBefore amp Correct Sync Af39ter execution could r1 r2 42 Initially x y 0 Thread 2 Thread 1 r2 y 4 if r2 03 x 42 Where is the synchronization no sequentially consistent execution performs such if bodies thus correctly synchronized no dataraces Same old trick load 42 in y and x HappensBefore amp Correct Sync Af39ter execution could r1 r2 42 Initially x y Thread 1 Thread 2 Where is the synchronization no sequentially consistent execution performs such if bodies thus correctly synchronized no datara Same old trick load 42 in y and x l Happensbefore does not uarantee that correctly synchronized programs ave SC sema 39 Data and Control Dependencies Fundamental problem Initially x y Thread 1 Thread 2 Data and Control Dependences speculatively loading 42 in y and x a circular causality Data and Control Dependencies Fundamental problem Initially x y Thread 1 Thread 2 Data and Control Dependences speculatively loading 42 in y and x a circular causality 3e2lt 16 5lt 41lt 64lt 3 A write causes an illegal write Data and Control Dependencies However we want to allow this circularity Before and after com ilero timizatio Thread 1 Thread 2 Data and Control Dependencies However we want to allow this circularity Before and after com ilero timizatio nitiall Thread 1 Thread 2 4 A 3 112 5 lt 4 1 lt Data and Control Dependencies However we want to allow this circularity Before and after com ilero timizatio Thread 1 Thread 2 4e34 125lt 4 12lt 6 Thus r1r2r32 is notse consistent Now r1r2r32 is seq consistent r2 In if true Data and Control Dependencies How to restrict former outofthinair case speculative write of 42 allow latter case early write of b2 Similarities both optimize across control dependencies branchesl both move a write former speculative write of 42 latter early write of b2 Data and Control Dependencies How to restrict former outofthinair case speculative write of 42 allow latter case early write of b2 Similarities both optimize across control dependencies branchesl both move a write former speculative write of 42 latter early write of b2 Differences no y42 with sequential consistency as with all the out of thinair examples b2 occurs with sequential consistency Data and Control Dependencies How to restrict former outofthinair case speculative write of 42 allow latter case early write of b2 Similarities both optimize across control dependencies branchesl both move a write former speculative write of 42 latter early write of b2 Differences no y42 with sequential consistency as with all the out of thinair examples b2 occurs with sequential consistency lnsi ht Use a wellbehaved executio to justify another execution Formal Model Goals Correctly synchronized programs exhibit only sequentially consistent behavior Ensure safety of incorrectly synchronized programs Wellbehaved Executions Build a wellbehaved execution iteratively statements are committed in groups a group is committed if it occurs in a wellbehaved execution that also contains actions committed so far bad causal cycles gt failure to commit Wellbehaved Executions Build a wellbehaved execution iteratively statements are committed in groups a group is committed if it occurs in a wellbehaved execution that also contains actions committed so far bad causal cycles gt failure to commit Observation early execution does not result in a bad causal cycle if the execution is not dependent on a read returning a value from a data race Wellbehaved Executions Build a wellbehaved execution iteratively statements are committed in groups a group is committed if it occurs in a wellbehaved execution that also contains actions committed so far bad causal cycles gt failure to commit Observation early execution does not result in a bad causal cycle if the execution is not dependent on a read returning a value from a data race Wellbehaved executions should require an uncommitted read must return value of a write that is ordered before it by happensbefore Justifying Executions Use a wellbehaved execution to justify further executions where writes occur early Given a well behaved execution may commit any of the uncommitted writes that occur in it with the same address and value may commit any uncommitted reads that occur require these reads to return a previously committed value in each execution these values may be different Formal Model Example Initially x y 0 Thread 2 1 2 1 is a legal behavior Figure 7 A Standard Reordering Final First First Sees Action Value Committed In Final Value In x 71 1 y O 0 01 E1 y 1 1 71 E1 r2 y 1 C2 E3 1 r2 1 Cg E3 r1 x 1 04 E Figure 8 Table of commit sets for Figure 7 Considerations for Implementers Weak control dependence Initially 1 339 0 39I hrcad 1 I 39I39hrcad 2 do rl x do r12 y while ljrl 0i while rf 0 y 4392 x A12 Correctly synchronized so nontermilmtion is the only legal behavior Figure 13 Correctly Synchronized Program Desirable reorderings are allowed proof omitted Redundant synchronization can be ignored or removed Volatile fields of threadlocal objects can be treated as normal fields Thread inlining is not legal in the JMM Summary Motivation preserve language guarantees safety security sufficiently easy to use permit optimizations used in current and futurel compilers hardware Approach formal model for iteratively justifying wellbehaved executions Result data race free programs are sequentially consistent behaviors of incorrectly synchronized programs are specified References The Java Memory Model J Manson W Pugh and S V Adve in Proceedings of the Symposium on Principles of Programming Languages POPL January 2005 DoubleChecked Locking httpenwikipediaorgwikiDouble checkedocking The quotDoubleChecked Locking is Brokenquot Declaration David Bacon et al httpwwwcsumdedupughjavamemoryModel DoubleCheckedLockinghtm Doublechecked Locking An optimization pattern for efficiently initializing and accessing threadsafe objects Douglas Schmidt and Tim Harrison The 3rd Annual Pattern Languages of Parallel Programming Allerton Park Illinois Sept 1996 http wwwcsewustleduschmidtPDFDCLockingpdf Ct A Programming Model for Tera Scale Processors John MellorCrummey Department of Computer Science Rice University johnmccsriceedu RICE COMP 522 Lecture 11 30 September 2008 Context Last class NESL a mostly functional language nested data parallelism languagebased performance model 0 Today Ct embedding NESL ideas into CIC for an expressive programming model Process and Architecture Trends 0 CPU designs are increasingly power constrained power efficiency is the ultimate goal m ust improve MIPSwatt over traditional processors 0 Hardware and power requirements of exposing parallelism in 00 processors simultaneous multithreading good at hiding memory latency but requires more area vector ISA future architectures will use longer SIMD vectors eg 4 8 16 improves power efficiency but inefficient for algorithms using short or unaligned vectors 0 Expect to scale number of cores about 2x per generation Programming Problem 0 ISVs concerned about burden of managing explicit parallelism threads vector intrinsics provide maximum flexibility reduce programmer productivity hurt scalability as vector lengths and core counts evolve 0 GPU programming approach avoid exposing threads or SIMD width eg in DirectX CUDA Cg process collection of pixels using streaming data parallelism each pixel processed independently of neighbors constrained programming model enables compiler and runtime to manage parallelism weaknesses requires significant effort to reformulate code difficult to use linked lists compressed data structures may lead to less efficient algorithms Motivation for Ct 0 Processors exposing more parallelism to software multiple cores wider SIMD instructions eg SSE 0 Two key software challenges to address this trend mitigate the increased complexity of exposed parallelism provide scalability to increasing core counts and SIMD ISA Ct Design Goals Comprehensive data parallel programming model Aims to leverage features of emerging GPU programming models fully expose CPU flexibility Support writing code for multiple processor architectures unlike GPU prog models which expose architecture constraints Easy to use Ct Goals Define a constrained programming model for highly parallel general purpose cores 0 Target more general architecture than GPUs coherent cache hierarchies low latency thread synchronization across processor array narrower SIMD width 0 Provide expressive programming model exploit advantages of less specialized architecture run applications written to use GPU programming model require less specialpurpose software eg linked lists OK 0 Ease incremental adoption extend CIC support use of legacy threading APls Ct is Deterministic Advantages of determinism 0 Predictable highlevel programming model programs behave identically regardless of core counts data races are impossible 0 Easier to use than unconstrained threading models Ct Expressiveness Sparse Matrices 0 Why sparse matrices important for many classes of problems 0 NIST Matrix Market sparse matrix problem domains Acoustic Scattering Air Traffic Control Astrophysics Biochemistry Chemical Engineering Chemical Kinetics Circuit Physics Computer System Simulation Demography Economics Finite Element Analysis Fluid Flow Linear Programming Nuclear Reactor Design Numerical Linear Algebra Oceanography Petroleum Engineering Plasma Physics Power System Networks Quantum Physics Structural Engineering Surveying Sparse Matrix Representations 0 Two common sparse representations compressed sparse row CSR A 0 1 0 0 Van 455789 2034 ColIdx102314142 0500 RowP 014689 0700 for row0 rowltnrows row 0 0 9 0 for elt RowProw elt lt RowProw1 eltH i tcol Co ldx access element of Arowcol as Valueselt compressed sparse column C56 is similar 0 Challenges inner loop has varying and unpredictable trip count hard to balance workload across inner loop instances indirection through Colldx similarly Rowldx for CSC array creates aliases and potential data dependences 01 u 5 e 0 Patterns in Sparsematrix Vector Product Implications for parallelism of patterns in SMVP CSR xColldxelt x is permuted by contents of column index similar to a GPU gather operation resultrow sum reduction of all product pairs within an inner loop instance CSC xcol each element of x will be used colPcol1 ColPcol times also somewhat like a GPU gather operation resultRowldxelt combining sendmultireductioncombining scatter Conceptual Pattern for Performing SMVP 0 Permute the vector 0 Perform an elementwise product with the matrix elements 0 Perform some flavor of reduction 12 Ct Nested Data Parallelism Exhibit parallelism as operations over collections of data Abstract parallelism away from cores hw threads vectors reduce architecture dependence Provide support for handling irregular problems dynamic parallel data types support reductions and pre x sums support moderate control ow well structured conditional loop nests nested loops recursive functions CPU vs GPU Issues Collective operations eg reductions and prefix sum need interthread synchronization need to exchange of partial results GPUs lack facilities for doing this efficiently all coordination is through offchip memory which is costly reductions must repeatedly writeread partial results tofrom memory reduction of length n uses log n stages of progressive combining n gt n2 elements n2 gt n4 elements 2 gt 1 element On log n memory bandwidth More traditional multicore processors offer better support coherent caches support coretocore communication no demands for offchip memory bandwidth highperformance interconnect provides low latency coordination Sparse matrices seg reductions multireduce prefix sum complicated on GPU even more passes through memory 14 Ct Features 0 TVEC polymorphic managed parallel vectors 0 Functionallypure operators elementwise collective perm utation 0 Nested vectors 15 Ct TVEC Template style polymorphic type Declaration TVEClttypegt var primitive types I32 l64 F32 F64 Bool userdefined structlike base type Write once vector Reside in vector space segregated from native CC types Data must be explicitly copied inout of vector space Red copyinCRed height width l8 copyoutvoid CXes Xes Use only Ct functionally pure operators on TVECs TVECs logically passed by value to guarantee safety 16 Ct TVEC Management Managed VectorNative Space Copying copyIn capyinZD ccpyinau copyout ors cat repeat replicate replace index copyv DEwVectGr Vector uenit39e extract copy length Ct Operators 0 Functionally pure operators logically free from side effects 0 Each operator logically returns a new TVEC scaledRed Red 05 II scaledRed is a new TVEC 0 Three flavors of operators elementwise operators that require no interactions implemented with map operations in functional languages collective operators eg reductions scan permutation operators move data from original to final position eg gather operator that uses index array to collect values eg pack uses a bit vector to select elements from a source vector 18 Ct Operators 0 Typical elementwise operators or add s mul div equal nun max mod 15h reh greater less geg leg neq ior and xor power divTan select ma Vector5513 r also ScalarVector variants exist genVector5calar map U 325 not log exp sqrt Isqrt sin 05 tan asin aces aten sinh cosh tenh floor ceilmg round map Use operator overloading to ease the programming burden exam ple TVECltF32gt A B C II resolves to add 0 Collective operators educe mulReduce minReduCE mmeduce andReduCe iorReduCe xorReduCE reduce ScanPref 39x sum addscan mulscan minscan maxscan andscan iorscan xorscan scan Ct Permutation Operators Packunpack pack unpack scatterGather scatter gather shiftRotate shiftDefaultPermute rotateDefaultPermute Partition partition nnpartltlan Miscellaneous defaultPermutE omegaPermute butterflyPermute dlstribute Ct Supports Nested Vectors with TVEC 0 Nested vectors flat vector a b c d e nested vectora b c de 0 TVEC support for managing nested vectors Nested Vectors newNestedVector a plyNestlng copyNesting setRegularZDNestlnq aetRegularSDNestlng getNesting getNestAs ea 0 Elementwise on TVECs support nested vectors adda bc d 6 9 hllij k 39 a9 bhci dj ek f 0 Collectives on nested vectors respect the boundaries addReducea b c d e 11 singleton vector abcdet addReducea bc d e 11 twoelement vector ab cde Ct Abstractions Provide Transparency 0 Several layers of abstraction below this highlevel APl task granularity machinewidth independent SIMD intrinsics 0 All exposed for use by the expert programmer 22 Coding with Ct 23 Example Color Conversion Difference between streaming style and data parallel style streaming style specify elementwise kernel dataparallel style specify vector kernel TVECltF32gt colorconvertltTVECltF32gt rcharmel TVECltF32 gt gchannel TVECltF32gt bchannel TVECltF32gt achannel F32 ao F32 a1 F32 a2 F32 33 return rchannel so gchannel 31 bchannel a2 achannel a3 24 Example 3D Convolution Compute each result pixel by computing the sum reduction of the product of a 3x3 stencil with input pixels around it TVECltF32gt Convolve2n3x3 TVECltF32gt pixels TVBCltF32gt respixels 132 channels TVECltF32gt kernel directions ml n is a constant TVEC of size 2 with values m4 nrll respixsls shiftPermutefpixels directionsm 0 respixels shiftPermutelpsixels directions 0 1 respixels shiftPermutelpixels directionso2 respixels 4 shiftPermutep1xels directionsu 0 respixels pixels kernelll resplxels shiftPermuteplxels directionsll 2 respixels shiftPermutepixels directlons2 0 respixels ShiftPermutEpixels direction52 1 respixels 2 shiftPermutelpixels directionsD 2 return respixels kernelml o kernello 1 kernello 2 Kernel110 kernelIlI 2 kernell2 0 kernel2 1 39 kemell2 2 Example Quicksort Keys BEBE um mm EIEIEII Ell V s fay asnn Ell EIE El ason fsm qm um ownIIM El II E E II E Dureism m zrz engmin mneasmulask ivi ehxmmumtkiml hum a mmwmxumuwn Example Dataparallel Quicksort mum um I Keys EIEEIEEIII asm mm narmkpnxumt qr Wumm h El E E IE EEEEE ummnmmv umznmkmmm mam neskd m Mum 27 Sparsematrix Vector Product TVECltF64gt sparseMatrixVectorProducSC TVECltF64gt Values TVECltI32gt RowIdx TVEClt132gt ColP TVECltF64gt v TvECltF64gt expv distributevC01P TVECltFS4gt product Valuesexpv product product applyNesting RowIdx ctIndeX TVECltF64gt result 7 product addReduCe return result Compiling Ct for Terascale Chips Concerns memory bandwidth and task granularity Optimizations target these concerns make effective use of memory bandwidth minimize threading overhea Approach use negrain dataflow threading model have compiler merge subprlmltlves to create coarser tasks m uaakauarpndu i 29 Observations about Ct 0 Ct supports finegrained adaptive dependent tasking model based on implicit task parallelism 0 Supports a highlevel performance model that can be used to guide algorithmic choice elementwise operators generally scale linearly for large vectors runtime system mitigates threading overhead for short vectors throttles parallelism collectives generally scale linearly with core count synchronization patterns are architecture dependent but that is OK 30 Future of Ct 0 Emerging features deterministic task parallelism determinism is a critical for programmer productivity no data races 0 Interested in applying Ct to exotic algorithms lossy realtime and adaptive computation 31 References Ct A Flexible Parallel Programming Model for Terascale Architectures Anwar Ghuloum Eric Sprangle Jesse Fang Gansha Wu and Xin Zhou Intel White Paper October 25 2007 32 Example Preconditioned Conjugate Gradient Piranha and Hydra Two Sophisticated Chip Multiprocessor Designs John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 4 4 September 2008 Context 39 Last week Tulsen et al advocated simultaneous multithreading 1995 Olukotun et al advocated a simple chip multiprocessor 1996 39 Fast forward to 2000 today s papers two sophisticated chip multiprocessor designs Familiar Motivation 39 Increasing transistor counts on chip 39 Difficulties with wider superscalar designs complexity of wider instruction issue per cycle complexity means more space and time 39 Limited instructionlevel parallelism ILP in applications Outline 39 Piranha objectives chip design and features performance 39 Hydra objectives chip design and features performance 39 Discussion Piranha A Scalable Architecture Based on SingleChip Multiprocessing Piranha Workloadfocused Design Objectives 39 Maximize performance on commercial workloads OLTP DSS radically different characteristics from technical workloads large data suffer from memory stalls ittle lLP abundant threadlevel parallelism no use for floating point and multimedia 39 Minimize processor development time and costs SMT recognized as beneficial but adds to design complexity most important for workloads wo explicit threadlevel parallelism 39 Design it as a component in a scalable system Piranha System Architecture 39 CMP hierarchically partitioned and replicated design short wires leads to fast cycle time 39 Processor 8 singleissue inorder Alpha core private L1 instruction and data caches 1 MB L1 data shared noninclusive L2 cache wl8 banks each bank 8way set associative 8 memory controllers 128 GBIs aggregate memory bandwidth 2 coherence protocol engines home and remote network router interconnect blw 3ZGBls 39 IIO Node 39 Modular scalability glueless Piranha Novel Ideas 39 CMP cache coherence protocol that does not enforce inclusion in L1 instruction and data caches performance improves when lower level caches are isolated from the effects of coherence checks and invalidations required with inclusive designs 39 Coherence protocol that has fewer protocol messages and lower protocol engine occupancy 39 lO nodes fully integrated into system architecture participates in global sharedmemory coherence Interconnect Lin ks Piranha Processor Block Diagram Output A Queue V Home Engine 2 Wm H Rome C ontrol 3 40 H I n 4 Remote 1 gt Engine Queue Direct Rambus Array Piranha Memory Hierarchy 39 L1 Caches per core separate data and instruction caches aggregate capacity 1MB 39 L2 Cache logically shared 1MB cache 8 physical banks interleaved using lower address bits each bank is 8way associative roundrobin leastrecentlyloaded replacement 39 Noninclusive cache hierarchy L2 acts like a victim cache data in L1 is NOT automatically present in L2 as well avoid wasting cache capacity on duplicate data write data into into L2 when replacing lines in L1 L2 decides when writebacks should occur 10 Piranha s Support for Cache Coherency L2 controllers keep duplicate copy of L1 tags and state L1L2 state includes ownership L2 L1 in exclusive state or one of Us when multiple sharers L2 controllers enforce onchip coherence invalidations L2 can respond to L1 memory request by supplying the requested data from L2 forwarding the request to an onchip owner L1 forwarding the request to a protocol engine obtaining local data from memory through a private memory controller One controller and associated RDRAM channel per L2 bank Example Piranhabased System Arbitrary interconnect topologies Glueless scaling to systems with 1024 nodes IIO as full fledged member of the interconnect Figm 3 Example configuration for a Piranha system with six processing 3 CPUs each and two He chips Piranha Access to Offchip Memory Protocol engines support access to remote memory home engine for exporting local memory remote engine for importing remote memory Directory information maintained at nodelevel Hotpotato routing reduce buffering requirements lnvalidationbased directory protocol Cruisemissileinvalidates bound the number of invalidation messages per request properties each request visits predetermined set of nodes single acknowledgement to initiator benefits avoid serial injection of multiple invalidates by initiator 13 Piranha Interconnection Network 39 IntraChip Switch 27 ports each with 2 datapaths send recv 8 internal data paths 32 GBs capacity 2 logical lanes lowamp high priority 39 System Interconnect input queue output queue router supports 2 packet types short 128bits header only dataless transaction long header 64byte data section 14 Narmmod Exocmm Tm Piranha Performance g LZNIIS Hm CPU 232 1B 145 a 5 i a 34 A a k 11 5 mm 63 P1 INC 000 Pu Pl mo 000 PH an I W 500MHz 16H IGHZ 500MHz 500MHz IGHz GH 500MHz g an Haws IIsiue 44m Hague has mus mue 1 mue m our 055 s an 3 an V 3 Figure 3 Estimated perfomxance of a smgle clup Puanha 8 g 3 CPI schlp versus a lGHz out of ordet processot g m E m a z u m us we no 5mm 115mm 17 Figure 61mm a Speedup and b Ll nus Ureak mvn for OLTP The Stanford Hydra CMP 16 Hydra Design Objectives 39 Make CMP viable for serial applications use threadlevel speculation 39 Minimize complexity of highspeed logic 39 Minimize latency of interprocessor communication on chip 17 Threadlevel Speculation Convert instructions to sequenced threads eg loop iterations as separate threads procedure calls as separate threads Run threads in parallel Use hardware to track interthread dependences Correct any violations by reexecuting instructions with correct data 18 Threadlevel Speculation TLS 39 Advantages simpifies parallelization no need for synchronization points doesn t need to be as conservative 39 Disadvantages must undo errors degrades performance requires hardware support and overhead 19 Hydra Design Overview 39 Cores four MIPSbased 39 Cache organization private 16 KB D and L1 caches writethrough L1 shared 2 MB L2 cache incusive cache hierarchy 39 Read and write buses 20 Hydra Design L1 data cache L1 aa1a cache L1 data cache L1 data cache Onrcmp L2 cache Mam memavy menace 10 bus1menace DRAM mam memory IIO names Hydra Bus 39 Read bus moves data between processors L2 cache also serves as interface to offchip memory 39 Write bus processor cores write directly to L2 cache writes invalidate L391 lines for coherence 39 The design is simple but is it scalable 22 TLS Hardware Requirements Forward data between parallel threads data written by one iteration may be needed by a later iteration Detect readafterwrite violations an iteration may read data written by an earlier iteration if read erroneously executes before earlier write to the location ordering violation that must be corrected Safely discard speculative state after violations Maintain correct writeafterwrite order must yield the proper final result Avoid writeafterread violations via memory renaming 23 Requirements for TLS Time Tlme Orlglnal sequential lump lierallon r Wmes after VlOlatlDrlS Forwarding from wnle lterallon m Vlolallun Read X Forwarding Wriles al ler succesz nemuuns lleratlon l Iteration 1 Discarded lleratlon I L 39 Permanent state ld Iteration 11 Five requirements for TLS A sequential program that can be broken into two threads Forwarding and violations caused by interaction of reads and writes Speculative memory state eliminated following violations Reordering of writes following thread commits Memory renaming among threads 24 Hydra s Support for TLS Cemvahzed bus ammaucn mecnamsms CPU 0 L1 Inst 1 data cache and cache specmancn crts CPU 0 memory camrouer L1 ms1 L1 aa1a cache and cache specularrcn blls CPU 3 mcmary cummHer Ll ms L1 data cache and cache speculanan ms CPU 2 memory conuoHev L1 Inst LI caLa cache and cache speculalmn ms CFUI memory carmoller I Wmertmuugn bus SA 015 Specmauon wme bu evs E E E E Mam memory mterfaoe we nus menace DRAM uarn memory 110 newces Hydra Support for Threadlevel Speculation 39 Tag bits for L1 data indicate speculative reads or writes 39 Per thread write buffers for speculative writes 39 Hardware and software handlers to control thread sequencing 26 Hydra Hardware Support for TLS Forward data between parallel threads writes cause L1 cache invalidations in later threads data from write buffers replaces data returned from L2 on byte bybyte basis Detect readafterwrite violations Safely discard speculative state after violations Maintain correct writeafterwrite order Avoid writeafterread violations via memory renaming 27 Hydra Hardware Support for TLS Forward data between parallel threads Detect readafterwrite violations L1 cache bits indicate reads that may cause violations write from earlier thread invalidates address violation is detected and offending thread restarted Safely discard speculative state after violations Maintain correct writeafterwrite order Avoid writeafterread violations via memory renaming 28 Hydra Hardware Support for TLS Forward data between parallel threads Detect readafterwrite violations Safely discard speculative state after violations only L2 contains permanent state L1 speculative data can be invalidated speculative write buffer can be emptied Maintain correct writeafterwrite order Avoid writeafterread violations via memory renaming 29 Hydra Hardware Support for TLS Forward data between parallel threads Detect readafterwrite violations Safely discard speculative state after violations Maintain correct writeafterwrite order drain speculative write buffers into L2 in correct sequence Avoid writeafterread violations via memory renaming 30 Hydra Hardware Support for TLS Forward data between parallel threads Detect readafterwrite violations Safely discard speculative state after violations Maintain correct writeafterwrite order Avoid writeafterread violations via memory renaming each thread can only read data written by self or earlier thread writes from later threads don t invalidate an earlier thread s data L1 preinvalidate bits indicate later writes 31 Hydra Performance Without threadlevel speculation With threadlevel speculation Speedup ova one Hydva CPU Speedup a z a E u m N g g L x g 2 a X g E E 5 a g g N a E g a U E A I u Appiicanun ppicauan pm 5 5 Hydrz39s four prams processors The gray areas shamLive impmveu performance foiiowing mug with feedbackrnzsed code 32 Design Rationales 39 Piranha design rationale high performance server for commercial applications 39 Hydra design rationale applications such as database and Web servers will provide the initial motivation to adopt CMP architectures at least for servers uniprocessor applications must also work well on CMP architectures before they can ever replace uniprocessors in most computers 33 ThreadLevel Parallelism 39 Designs focus on different styles of threadlevel parallelism Piranha designers target workloads known to have abundant threadlevel parallelism Hydra designers at Stanford target threadlevel parallelism in applications that are difficult to parallelize by the usual methods 39 Questions do you think that the focus of their designs reflects their industrial vs academic affiliations does this influence other aspects of their designs does the Hydra paper make a successful case for threadlevel speculation are their experimental methods and results convincing 34 Design Simplicity 39 Design objectives for both Piranha and Hydra short development times design simplicity 39 Questions how successful are they how do their designs compare with current CMPs are there capabilities that have been compromised or omitted 35 The Role of Chip Multiprocessors 39 What features do you think are most important in a CMP 39 How useful will CMPs will be for everyday computing 39 ls targeting them for servers and specialized workloads more practical 36 References 39 L A Barroso K Gharachorloo R McNamara A Nowatzyk S Qadeer B Sano S Smith R Stets and B Verghese Piranha A Scalable Architecture Based on Single Chip Multiprocessing Proceedings of the International Symposium on Computer Architecture ISCA pp 282293 June 2000 39 Lance Hammond Ben Hubbert Michael Siu Manohar Prabhu Mike Chen and Kunle Olukotun The Stanford Hydra CMP lEEE Micro pp 7184 MarchApril 2000 39 Mohamed Zahran Kursad Albayraktaroglu and Manoj Franklin NonInclusion Property in Multilevel Caches Revisited IJCA Vol 14 No 2 June 2007 39 COMP 522 presentation Anna Youssefi 2006 37 Mid 90 s Multiprocessor Cache Design Studies John Mellor Crummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 3 2 September 2008 Today s Papers 0 Exploring the Design Space for a SharedCache Multiprocessor Basem Nayfeh and Kunle Olukotun ISCA 1994 0 Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem Nayfeh Lance Hammond and Kunle Olukotun ISCA 1996 Putting These Papers in Perspective 0 Last class explore and evaluate processor core organization alternatives for srngle chIp processors 0 Today explore and evaluate offchip and onchip cache organizations for multiprocessors Multichip Multiprocessor Study Multichip multiprocessors why multichip designs can scale to higher performance Organize multichip designs on MultiChip Modules MCM Explore performance tradeoffs between cache organization and number of PEs Study interaction between number of processors in system small to medium size cache size application programs Multiprocessor Design Space l 0 Context multiprocessor system with a single shared bus keep processor caches coherent using snooping invalidationbased protocol 0 Two cluster design alternatives 1separate cache per processor advantage cache bandwidth scales with PEs large caches reduce communication traffic disadvantage bus becomes a performance bottleneck with more PEs coherence misses invalidation traffic 2have processors communicate using multiported cluster cache expanding chip capabilities and MCMs make this possible advantage single copy in shared cache rather than 1 per PE uses cache more efficiently avoids coherence overhead disadvantages shared cache could become a bottleneck conflicts in shared cache can elevate miss rates 5 Multiprocessor Design Space ll Consider 18 processors per cluster Private instruction cache per PE Shared Cluster Cache SCC for data nonblocking multiple banks banks interleaved at the cache line level T lCN cache lnterConnection mummy Network SCC and main memory are connected by a bus lnvalidationbased snoopy cache coherence protocol Study Two Flavors Workloads 0 Single parallel application workload using multiprocessor to reduce execution time 0 Multiprogramming workload using multiprocessor to enhance server throughput Parallel Application Workloads 0 SPLASH benchmarks BarnesHut hierarchical nbody simulation of galactic evolution MP3D particlebased simulation of hypersonic flow Cholesky Cholesky factorization of a sparse matrix parallelization C ANL macros for process creation data sharing synchronization 0 Simulation methodology TangoLite simulates access interleaving for multiple PEs multiplexes lightweight threads during simulation 4 SCC clusters consider contention in each SCC at the bank level SCC coherence invalidation protocol on a snoopy bus 16byte cache line limits false sharing 100cycle latency to either remote cache or main memory Multiprogramming Workload 0 Benchmarks 6 SPEC92 integer benchmarks sc spreadsheet espresso PLA minimization eqntott truth table simplification xlisp interpreter compress gcc compiler 2 SPEC92 floating point benchmarks spice analog circuit simulation wave5 maxwell s equation solver 0 Simulation methodology use Pixie to collect BB dynamic memory access traces pipe access traces into multiprogram scheduler scheduler uses roundrobin discipline switching every 5M cycles each processor on the cluster does this simulate 100M references total avg 30M instructionsapplication More Processors andor Larger Cache BarnesHut parallel workload 1 pwmm I a m I5 2 a 2 Clemson14KB hgm z llamasHm pzrfnrmanc Chlriaensllcs Table 3 snr pu clmlu o Good scaling superlinear speed up for medium to large SCC Why Increased locality of reference within a cluster More Processors andor Larger Cache BarnesHut parallel workload o Nbody simulation based on hierarchical octtree o Processors in a cluster work on bodies adjacent in the tree when modest PE count per cluster data reuse of same bodies prefetching when traversing same boxes when too many PEs destructive interference Pmsors cluster Rad Missan 256 KB 410 092 I Z 4 017 I E 1033 026 Table 4 Effects of pl39efclching and destructive inter ference on read miss mus for BarnesHut Miss rate decreases with modest sharing due to data reuse and prefetch effect From 4PElcache to 8PEI cache miss rate increases destructive interference More Processors andor Larger Cache MP3D parallel workload MP3D doesn t scale very well using invalidationbased coherence lack of locality no prefetch high invalidation rate Ncluster has almost same invalidation rate as n processor system clustering reduces global Invali ates Small 4KB SCC speedup of 38 destructive interference Large 512KB SCC speedup of 72 wmmamm 4 rmminim i umumm nmennlclnlv quotmam 5mm 7m 2 54 m cm M a mm 3 Mm rfnnnannc chmmmncx More Processors andor Larger Cache Cholesky parallel workload Some prefetching but not as much as for BarnesHut 8proccluster speed up small 4KB SCC 30 arge 512KB SCC 35 Reason limited concurrency bad load balance high synchronization verhead Namw Euzullm m 4alsns3941amm a mm 4 Chaleskv m nrmnnc characlensliu More Processors andor Larer Cache M ulti prngramming worklnad Context switches increase cache miss rate Sharing in clusters exacerbates miss rate fur a processnr multiple processes access SCC simultaneously multiprogramming increases con icts Namlined 2mm Yin u m cmsmw Figure 5 Muillpmgramming performance chmcmmuus More Processors andlor Larger Cache Multiprogramming workload Speedup relative to one PE per cluster at same SCC size nonnalized to one processor per cluster with same SCC more PEs increases conflicts Larger SCC better speedup larger SCC yields fewer nflicts Figure 6 Mullipmgramming spudup chimeramm Implementation Cost Estimates Implementation Assumptions 0 L1 cache size and number of PE on chip affect load latency processor cycle time and total chip area 0 Evaluate the effect of cluster size semiconductor process and processor area CMOS with 04 micron gate lengths 3 layers interconnect total die area is 300 mmz side 18 mm 1996 estimate chip area with DEC Alpha 21064 068 micron pipeline IF DecodeEXMEMWB load has two cycles latency cycle time in terms of gate delay F04 delay of an inverter with a fanout of four 16 Cluster Implementation and Cost One Processor per Cluster 11mm Chip area 204 mm2 64bit integer unit 64bit FP unit 16 KB instruction cache Singleported 64KB data cache 8x8KB SRAM 66 mm largest directmapped cache can be accessed in 30 F04 IUFPUI Hum Figure 8 Floorplan ofa one processor per cluster chip Cluster Implementation and Cost Two Processors per Cluster Two processors per chip Chip area 279 mm2 37 larger than 1PE chip Two integer units Two FP units Two 16KB lcache One 32KB SCC triple ported 8way interleaved crossbar pc ICN Load latency is 3 cycles arbitration takes 17 F04 add extra pipeline stage IUFFUIS IUFPU IS 155mm Figure 9 Floorplan of a two processors per cluster chip Cluster Implementation and Cost Four Processors per Cluster Combine two 2PE chips larger ICNlarger quot0 pins total chip area 297 mm2 m m 46 larger than 1PE chip MCM packaging o IUFPUIS mFPulS msmn 15mm 0 Access cache on other chip MN MW exceeds the 30 F04 of the processor cycle 11 mini nLluilu one more stage between 39 m Wquot M m MEM and WB Wm 39 4 cycle load latency 39 ma Ana gum 10 Floorplan of a foul processors par cluster building block Cluster Implementation and Cost Eight Processors per Cluster 0 Connect each of 8 banks to 9 ports 2 PC ICN provide the ports chip area 306 mm2 50 larger than 1PE chip llO with 6 offchip a 5mm IUFPUIS IUFPUI 1100 signal IIO pads BM C4 pads in an array on a separate layer of metal on top of active circuity 35m wwwwb mu m 1mm an an a KB ma 4m 4m mm n mm m cash ma mw cum new a cm W 1 m Figure 11 ing block Floorplan ofaeighlprocessms per clustcrbuild Comparing SharedCache Single Chip Implementation 1PE with 64KB Dcache vs 2PE with 32KB SCC Study how load latency affects execution time 4 2PE clusters with 32 KB SCC are much better t an 41PE clusters with 64KB Dcache Chip area differs by 37 Performance differs by 70 Costperformance improves 24 based on die area smaller is better MCM SharedCache Systems Two MCM cluster 4 PE per cluster with 64 KB SCC 16PE system 8 PE per cluster with 128 KB SCC 32PE system 4 x 4PElcluster 16PE have half execution time compare with 4 x 2PE From 16PE to 32PE also linear speedup chum Conanrum 1pm mica Aymmm an Nearly linear speed up in small to medium size multiprocessors using 2PE building blocks except Cholesky from 16cpu to 32cpu Conclusions of First Paper Clustering via SCC add more processors will not increase invalidation rate for some applications prefetching in SCC reduces the miss rate providing superlinear speedup Multiprogramming workload may suffer from inference conflicts in the SCC Single chip with 2PE SCC yields better performance than 1PE with larger data cache better costperformance than 1PE with larger data cache A chip with 2PE is an effective building block for multiprocessors with less than 32 processors 23 Multiprocessor Microprocessor Study Compare three multiprocessor architectures Sharedprimary cache multiprocessor microprocessor Sharedsecondary cache multiprocessor microprocessor Reference configuration busbased sharedmemory multiprocessor Goal explore benefits and cost of sharedL1 and sharedL2 24 Study Two CPU Models 0 Simple CPU model no consider latency hiding purpose classify the parallel application high communication moderate communication low communication 0 Advanced CPU model dynamic scheduling speculative execution nonblocking memory references purpose explore performance in detail 25 Advanced CPU Model Characteristics dualissue processor dynamic scheduling 5peculative execution nonblocking caches 16KB 2w set ay associative lcache and 32 entry instruction window 32 entry reorder buffer 1024 entry branch target buff 4 outstanding L1 misses mug hum Flutill mm mm Am 2 mm 2 Magi 2 w Mul vly 2 Dmdz 2 sr Divldn 11 ann 2 up mm 2 1 a 3 my mum 2 m 2 mm H mm cm mailman MIPS2 instruction set 3 stage pipeline fetch execution and SharedL1 Cache MP Advantages 28 bit Ivslnm has low latency interprocessor communicat39on no need for coherence logic good for finegrain parallelism lower cache miss rate for shared data we in L2 bu 2 ME Level 2 cm 2m Annmath Fume x smd mm mm mul plmssm Disadvantages crossbar and bank arbitration logic overhead higher cache miss rate for conflict references larger die area 4PEs share a common 4way banked writeback L1 cache PC connected by a crossbar L2 cache and memory similar to conventional busbased MP latencyoccupancycoherence 27 SharedL2 Cache MP 1cycle L1 latency 14cycle L2 latency sharedL2 cache interface overheadcrossbar chip bounda L2 64bit data path needs to have 4 cycle occupancy to transfer a 323 cache line Main memory 50cycle latency 6cycle occupancyaccess an m it l ngnxv 2 simad mommy ache muiupmmmr MCM packaging 4 PE L2 interface Each processor has private writethrough L1 cache directory based coherence Main memory identical to the SharedL1 MPM m 28 SharedMemory MP and Latencies Traditional busbased MP mamw swam SIZKZWiy mamw L2 L2 u 2 Each processor has its own L2 bank a Interprocessor communication must access shared memory via bus mmmm mm wrmwmmmm All processors must check cache tag for cacheto cache data sharing Processors must support snoopy cache coherence for L1 and L2 cache Simulation and Workloads 0 Simulation SimOS models CPUsmemory hierarchy and IIO CPU simulators Embra fast Mipsy medium MXS slowest eventdriven memory and cache system MlPS2 instruction set lRlX 53 OS tuned for MP performance 0 Applications hand parallelized applications Eqntott 90 computaton is bit vector comparison MP3D heavy communication demand Ocean highly optimized parallel application Volpack dynamic task stealing max data sharing min sync compiler parallelized applications Ear FFT parallelized by Stanford s SUIF compiler multiprogramming workload modified Andrew benchmark 30 Performance Analysis Eqntott hand parallelized application 0 Low L1R miss rate in shared L1 cache and the high L1 miss rate in the two other architectures indicate the application has small working set and high communication to computation ratio Normalized oxec limo 0 Parallel execution inside the inner vector comparison loop causes 1 L1 miss for U L2 sharedL2 and shared memory MP Memory Cont if L2 Cache gall L2 Cache Cont f L1 DCacho 5m Ll tom 51ml CPU L1R LII L21 L2 MM ruins 9 9 Figure 4 Eqntott performance 31 Performance Analysis MP3D hand parallelized application High HR in sharedL1 means c Moch conflict reference in L1 ucmsm chachoCmi LOW in sharedL2 means all 3g LiCachoSlall the working sets fit into 2MB L2 j u ems I on 9 cache g E 2 High L2l in sharedmemory due to L1H L11L2R L3 heavy communication requirements L D 49 0 L2 2 D 9 D Ibm 2 0 14 33 Small L1 cache causes high HR 50 u L2 Mam Himmm miss rate and further cause high 39 F 5 M931 rf L2R mIss rate lam pe mum 32 Performance Analysis Ocean hand parallelized application 0 Shared cache architectures do not offer much advantage over quot0w the sharedmemory architecture 3 Memory camp 23935 L2 cw snail L2 Cache Com L1 DCachoShl SharedL2 architecture sharedL2 has higher hit time L1 s writethrough policy degrades performance of sharedL2 architecture L1 leach Sh CPU SharedL1 architecture wider bus and writeback L1 cache yield slightly better F 6 0mm f I performance 393 p mm39 33 Performance Analysis Volpack compiler parallelized application 39 Mommy cont Low L1 R miss rate 1 and 72 L2 Cache Sl l negligible L1l miss rate memo L1 Beachesmn Nonnegligible L2l miss rate Q wcmw in sharedmemory g cm architecture g 2 Synchronization time causes me mum different 1 g g 1 O 24 5 L1 L2 Mom Mlssramf b Figuze 7 Volpack performance 34 Performance Analysis EAR compiler parallelized application Finegrained parallel code with a high ratio of communication to computation Negligible L1 miss rate on sharedL1 MP means all the important data fit in L1 cache High communication rate high L1 miss rate in sharedL2 architecture For sharedL1 architecture almost no memory system stalls Normalized exec time Memory Com 12 Cachl Stall L2 Cacha Cont L1 DCncM Stal U Cach Shall CPU L1 L2 lump L1H L1l UR L21 D D i D G 3 1 D 2 5 DO Figure 8 Ear wrformance Miss rules we 35 Performance Analysis FFT compiler parallelized application 0 Low L1R miss rate and L1 miss rate bring similar performance for shared cache architectures 0 Higher L2R and L2l miss rates in sharedmemory architecture cause slightly worse performance Mommy Com L2 Cache Stall L2 Cache Com L1 DCaoM Stall r L1 lCacM Stall CPU Nomaizod exec time L1 L1 13R LII 0 For relatively large grain quot 39 M 1 fl 5 1 parallel execution with little u L2 Meme Warsaw shared data the three architectures offer similar performance Figure 9 FFI performance 36 Performance Analysis Multiprogramming and OS workload No user data shared code run as multiple independent processes with separated address spaces SharedL1 cache and shared memory architectures have similar performance SharedL2 cache architecture performs 6 worse than the sharedmemory architecture L1 miss cost higher contention at L2 cache ports 1 Memory Corn 100 l2 Cash Stall L2 Cache Com Ll mocha 111 Ll name Stall ICSXI CPU Normalized exact time L1H Lil L2H L2 L1 2 0 10 0 L2 2 D u 0 Inn 2 D 1 l L1 L2 Mam Mlsa rates 96 Figure 10 Multiprogramming and OS workload perfonnance 37 Superscalar CPU Results Workloads Multiprorammin shared memory performance 17 better than using sharedL1 and 33 better than using sharedL2 Eqntott shared L1 cache performance 18 better than shared memory sharedL2 12 better than sharedmemory Ear sharedL2 cache reduces cache and Dcache stall time and achieves top performance u L2 mm L L2 mm Mullinlarmmmmu Ewan u 12 mm Figure in Perkinuanc uflbe lbw mbllecmrax will dynamic Superscalar pmcesmvs Using slowest CPU model MXS dynamic scheduling speculative execution nonblocking memory access 3 cycle L1 hit latency L1 bank contention for sharedL1 38 Conclusions Using CPU model Mipsy three classes of applications high degree of communication eg EqntottMP3D and Ear moderate degree of communication eg Volpack and FFT little or no communication eg Ocean and multiprogramming Performance for high communication sharedL1 outperforms sharedmemory about 2070 MP3D 16 worse for moderate communication sharedL1 outperforms shared memory by 10 for low communication sharedL1 better than sharedmemory lower miss rate for the 64 KB sharedL1 cache sharedL2 cache bank contention Using CPU model MXS sharedL2 provides the same performance benefits as sharedL1 architecture 39 Discussion 0 What have we learned from these papers 0 Are the results in these papers complementary or not 0 What are the lessons for 2008 0 Do we care about the same application workloads today 40 COMP 522 Multiccre Computing An Introduction John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 1 25 August 2008 Logistics Instructor John Mellor Crummey emai iohnmccsriceedu phone x5179 office DH 3082 office hours by appointment Sign in on pad Meeting time schedued TTh 100 220 would another time work better Permanent meeting place Martel 103 Web site httpwwwcsriceeduljohnmclcomp522 The Shift to MultiCore Microprocessors Underlying Technology Forces Continuation of Moore s law for circuit complexity Gordon Moore Founder of Intel 1965 since the integrated circuit was invented the number of transistors per inch2 in these circuits has roughly doubled every year this trend would continue for the foreseeable future 1975 revised circuit complexity doubles every 18 months Transistors Per Die 10 10 1965 Actual Data 16 20 109 mos Arrays mos Logic 1975 Actual Data 256M 51239 10a 1975 Projection M123 ltanium39quot a Memory Pentium 4 107 I 39 Pentium V Microprocessor 106 256K 1M Penliu P 39 6 maenhum 4K i486quot 39 M 80286 393 105 6 4K 16K39uo e 10quot 103 Mr 030 4004 102 101 100WW 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Image credit httpdownloadintecomresearchsiliconGordonMooreISSCC021003pdf Increasing problems with power consumption and heat as clock speeds increase Why Multicore Microprocessors Rising transistor count enables more functionality Why not make single cores more sophisticated let s take a look at a few microprocessors for some intuition Pentium 4 SuperPipelined SuperScalar Stages 12 Trace cache next instruction pointer Stages 34 Trace cache fetch Stage 5 Drive wire delay Stages 68 Allocate and Rename Stages 1012 Schedule instructions Memoryfast ALUslow ALU amp general FPsimple FP Stages 1314 Dispatch Stages 1516 Register access Stage 17 Execute Stage 18 Set flags Stage 19 Check branches Stage 20 Drive more wire delay 5 operations issued per clock 1 load 1 store unit 2 simplefast 1 complexslower integer units 1 FP exe 1 FP move unit Up to 126 instructions in flight 48 loads 24 stores Opteron Pipeline SuperPipelined SuperScalar Integer 1 Fetch1 gt8 2 Fetch2 9 3 Pick 10 4 Decode1 11 5 Decode2 6 Pack 1239 13 7 Pack Decode 14 15 8 Dispatch 16 9 Schedule 17 10 AGUALU 11 DC access 12 DC response Floating Point Dispatch Stack rename Register rename Sched Write Schedule Reg Read FXO FX1 FX2 FX3 L2 Cache gt13 L2 Tag 39 3914 Address to SAQ 14 15 L2 Data 16 17 Rout e Multiplex ECC 18 19 Write DC Forward Data DRAM 15 Clock Crossing 16 Sys Req Queue 17 26 Req DRAM pins 27 Memory Access Fetchdecode 3 instcycle 3 integer units 3 address units 3 FPUmultimedia units 2 loadstores to dcacheclk Multicore Microprocessors Why not make single cores more sophisticated limits to instructionlevel parallelism available in programs especially for codes with difficulttopredict branches 0 A new approach use available chip real estate to support threadlevel parallelism Strategies multiple threads per core Simultaneous multithreading maximizing onchip parallelism Dean Tullsen Susan Eggers and Henry Levy lSCA 1995 multiple cores The case for a singlechip multiprocessor Kunle Olukotun Basem Nayfeh Lance Hammond Ken Wilson and Kunyung Chang ASPLOSVll 1996 IBM Power4 December 2001 POWER4 Shipped in Systems December 2001 Technology 180nm lithography Cu SOI gt POWER4 shipping in 130nm today Dual processor core 8way superscalar gt Out of Order execution gt 2 Load Store units r 2 Fixed Point units p 2 Floating Point units r Logical operations on Condition Register gt Branch Execution unit a gt 200 instructions in night 5 Hardware instruction and data prefetch IBM Power 5 August 2003 POWER5 The Next Step u Technology 130nm lithography Cu SOI u Dual processor core a 8 way superscalar a Simultaneous multithreaded SMT core r Up to 2 virtual processors per real processor gt 24 area growth per core for SMT gt Natural extension to POWER4 design Intel x86 is Forced to Change Course May 17 2004 Intel the world39s largest chip maker publicly acknowledged that it had hit a quotthermal wallquot on its microprocessor line As a result the company is changing its product strategy and disbanding one of its most advanced design groups Intel also said that it would abandon two advanced chip development projects Now Intel is embarked on a course already adopted by some of its major rivals obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor Intel39s decision to change course and embrace a quotdual corequot processor structure shows the challenge of overcoming the effects of heat generated by the constant onoff movement of tiny switches in modern computers some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures New York Times May 17 2004 Dualcore Opteron Announced August 31 2004 Advanced Micro Devices plans to demonstrate its version of a new approach to processor design on Tuesday with a chip that is expected to offer faster computing and relatively less power consumption The chip which is called Opteron and has two processing units The shift to multiple processing units or cores embedded in the same chip has recently become a significant technological approach for BM Sun Microsystems and Intel as well as Advanced Micro as computer designers hunt for new ways to increase processing power Advanced Micro said on Monday that it would make the chips available commercially for servers in mid2005 and in its 64bitAthIon chips for desktop computers before the end of next year Intel has not yet set dates for its dual core X86 processors New York Times August 31 2004 12 Sun Niagara Ultrasparc T1 Announced November 14 2005 Sun Microsystems is set to announce the first of a new generation of processors for computer servers on Monday that the company says offers faster performance with far less energy use The chip codenamed Niagara while in development is designed for a specific niche of the server market highvolume Web service operations The UltraSparc T1 following a trend in the semiconductor industry adds new features that conserve energy significantly The UltraSparc T1 has eight processing cores each able to execute four instruction sequences called threads New York Times November 15 2005 13 Leading Edge Intel and AMD Offerings 0 Intel Xeon Harpertown X5482 Nov 2007 64bit dualdie quad core 32 GHz 150 W 2 x 6 MB L2 cache 24way associative front side bus 1600 MHz httpIlprocessorfinderintelcomldetaiIsaspxsSpecSLANZ 0 AMD Barcelona Opteron 83608E April 2008 in volume 64bit quad core 25 GHz 105 W 3 levels of cache private 64KB L1 cache 512KB L2 cache exclusive cache shared 2MB L3 cache system bus speed 1000 MHz httplproductsamdcomlenulepteronCPUDetailaspxid446 Image Credits httpiicomcomcnwk1dibto20080425intelquadcorejpg httpimghexusnetv2cpuamdbarcelonajuIydieshotbigjpg 14 IBM Power6 Announced May 2007 o 64 bit dualcore processor Superscalar SMT cores 0 Memory hierarchy 4 MB of private L2 cachecore up to 32 MB offchip L3 cache 2 memory controllers 75 GBIs memory blw Up to 47 GHz clock rate 0 Power double the speed of previous Power5 chips with virtually no increase in power consumption power consumption drops to 35 when idle 790 million transistors Size 341 mm2 Figure credit httpWWW O3ibmcompressusenattachmentZ1562Wss leIdzATTACHFILE2amp leNamep6die1jpg 1 5 httpwww2hursleyibmcomldecimalllBMPowerRoadmapMcCrediepdf What s Next 0 Intel Itanium Tukwila designed for the server market Q42008 2 billion transistors quad core 30MB on chip cache 0 Intel Core i7 0 Leaks from IBM IBM POWER7 2010 8 cores 4GHz Reported by The Register 11 July 2008 httpwwwtheregistercouk20080711ibmpower7ncsa 16 Using Multicore Processors The processor core is only part of the picture 0 Data sharing cache hierarchy designs and implications shared cache vs private cache costs vs benefits 0 Synchronization what is possible what hardware primitives exist how should they be used what are the implications for programming models what alternative hardware primitives should be explored 17 Isn t Multicore just More of the Same No The emergence of multicore processors marks a watershed for software Processor clock frequency increases will no longer compensate for increasing software bloat Application speed won t track processor enhancements unless software is highly concurrent 18 What about the Software With concurrency essential for performance Languages and programming systems must embrace concurrency to survive expressive embarrassingly parallel dataparallel task parallel simple to write efficient Challenge concurrent programming is much more difficult software development is much harder lack of standardized amp effective development tools programming models and environments algorithm development is harder complexity of specifying and coordinating concurrent activities concurrent data structures are extremely difficult to design 19 What Makes Concurrent Programming So Hard The problem of shared state Application threads generally need to share state Data race two threads access a shared variable at least one access is a write data race occurs if no ordering guaranteed by synchronization data races can yield unpredictable or corrupted values Race conditions are easy to write and hard to pinpoint Data races must be avoided for correct execution 20 Avoiding Data Races 0 Conventional approach locks each thread must acquire a lock for shared data before using it 0 Problems with locks not robust if lock holder delayed progress stalls relationship between lock and data is implicit preserved only through programmer discipline association between lock and data is a global property convention must be observed by all code accessing the data hard to use coarsegrain locks shackle parallelism finegrain locks admit possibility of deadlock lack of composability when layering software calling into code you don t control is a recipe for deadlock locks must be acquired in a fixed global order to avoid deadlock extensible frameworks often call virtual functions while holding lock lost wakeups on pthread condition vars when condition s mutex not held 21 Java s Synchronized Methods Works as long as methods properly annotated with synchronized declarations data accessed only by methods 0 Problems calling virtual function while holding a lock admits deadlock locking for synchronized methods adds overhead even when no concurrency doesn t guarantee atomicity across multiple method calls example acount1debitamount account2creditamount object locking preserves atomicity of method but not sequence additional explicit synchronization required 22 Alternatives to Locks Lock free programming of concurrent data structures requires deep knowledge of processor s memory model difficult and fragile implementations of new data structures are publishable results eg Maged M Michael Michael L Scott Simple Fast and Practical NonBlocking and Blocking Concurrent Queue Algorithms PODC 1996 267275 doublyended queue twolock and nonblocking versions Transactional memory operations that appear to execute indivisibly concurrent operations see state before or after operation promising area of active research exploring software hardware and hybrid implementations 23 Course Outline 24 Objectives Immersion in research related to multicore processors issues shaping the design of multicore processors difficulties of programming multicore systems emerging programming models for multicore emerging technologies to simplify multicore programming synchronization debugging concurrent data structures Hone your ability to analyze research papers paper evaluations class discussions Develop skill preparing and delivering presentations presentation evaluations Explore a topic of interest in greater depth final project or term paper 25 Topics Microprocessors explore the design space threading flavors of multicore designs Memory hierarchy cache organizations and their implications for performance Programming models anguages Cilk NESL Ct libraries thread building blocks Schedang strategies for distributing parallel work among processors Memory models consistency models Java and C memory models Debugging techniques for uncovering and pinpointing data races in parallel code Synchronization theoretical underpinnings a range of approaches hlw and slw approaches Concurrent data structures Transactional memory Performance analysis of multithreaded code 26 Award Winning Papers Simultaneous multithreading maximizing on chip parallelism Dean Tullsen Susan Eggers and Henry Levy ISCA 1995 25 Years of ISCA Retrospectives and Reprints 1998 The Implementation of the Cilk S Multithreaded Language Matteo Fri go Charles E Leiserson and Keith H Randall 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI Montreal Canada June 1998 PLDI most in uential paper award 2008 Algorithms for scalable synchronization on shared memory multiprocessors John Mellor Crummey and Michael L Scott ACM Trans Comput Syst 9 1 Feb 1991 21 65 2006 Edsger Dijkstra Prize Wait free synchronization Maurice Herlihy ACM Trans Program Lang Syst 13 1 Jan 1991 124 149 2003 Edsger Dijkstra Prize Transactional memory architectural support for lock free data structures Maurice Herlihy and J Elliot Moss ISCA 1993 289 300 ISCA most in uential paper award 2008 Memory model papers by Sarita Adve 2008 Maurice Wilkes Award 27 Recommended Prerequisites Understanding of computer systems COMP 320 0 Understanding of machine architecture COMP 425 0 See me if you have concerns 28 Course Format 29 Paper Evaluations 10 Students must evaluate a paper for twothirds of the class sessions A paper evaluation consists of your name the paper name paper summary 5 sentences most important strengths no more than 3 1 sentence each most important weaknesses no more than 3 1 sentence each one problem or issue left open no more than 3 sentences Assessment evaluations will be graded on a scale of 13 disappointing reviews 1 default grade 2 insightful review 3 Email evaluations of papers to the instructor before the start of the class in which the paper is presented Late or incomplete evaluations will receive no credit 30 Presentations 40 Analyze a pair of papers supporting work Prepare a presentation the motivation for the work the key techniques insights andlor results a critical evaluation of the work open issues Lead class discussion Presenters advised but not required to show presentations to the instructor in advance Each presentation will have one class period 75 minutes Provide the instructor with electronic version of the presentation suitable for posting on the class WWW page 31 Class Participation 10 Research papers not always well written sometimes make misleading claims occasionally contain errors 0 Students are expected to contribute to the discussion of the papers in class subject the papers to critical analysis ask questions offer observations 32 Presentation Evaluations 10 Students must evaluate at least twothirds of the student presentations in class 0 Why if writing reviews of presentations you will pay closer attention if you know that your presentation is being evaluated by your classmates you may try harder to make it engaging A presentation evaluation vision how well did the presenters explain why the area matters style mumbling fail to make eye contact too quickly or slowly exposition were the slides too busy too ugly orjust right question and answer how well did the presenters seem to know the material were they honest about admitting when they didn39t know something comments any additional information that you would like to add 33 Final Project 30 Explore a topic of interest in greater depth 0 Requirement one of the following a term paper must be done individually may focus on the same topic as one of your presentations you should study different papers a final project may be done with a partner written project report group projects will submit a single writeup 34 References John Markoff Advanced Micro Plans to Show New Chip Ahead of Intel Technology New York Times August 31 2004 John Markoff lntel39s Big Shift After Hitting Technical Wall Technology New York Times May 17 2004 Ron Kalla POWER5 IBM s Next Generation POWER Microprocessor Hot Chips 15 August 2003 Herb Sutter and James Larus Software and the Concurrency Revolution ACM Queue Special Issue on Multiprocessors 37 September 2005 Maurice Herlihy The Transactional Manifesto Keynote address at PLDI 2005 35 Caching for Chip Multiprocesscrs John MellorCrummey Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 7 18 September 2008 Last Week Tiled chip multiprocessors Intel s view of the future large number of tiles speciaized tiles can be incorporated into grid layout on chip Today s Papers 39 Victim Replication Maximizinq Capacitv while Hidinq Wire Delav in Tiled Chip Multiprocessors M Zhang and K Asanovic In Proceedings 32nd International Symposium on Computer Architecture Madison WI June 2005 Coonerative Cachinq for Chin Multiprocessors J Chang and G S Sohi In Proceedings of the 33rd International Symposium on Computer Architecture Boston MA June 2006 DanceHall Shared Cache CMPs Eg Piranha no a 0 l We a a System s Intra C lup Smtch l Packet 5 itch Interconnect Links 39 L1 cache colocated with PE 39 PEs on far side of interconnect from L2 cache a L2 caches equidistant all processors Looking to the Future 39 Trends more processors larger L2 cache size 39 Implications wire delay of tens of clock cycles across chip worst case latency likely unacceptable hit times for future L2 39 Recommendation tiled chip multiprocessors colocate part of shared cache near each core reduce access latency to at least some shared data Tiled Chip Multiprocessors 5mm l A 4 X 4 5MP 3920 mm X 20 mm Advantages 39 Simpler replicated physical design readily scale to larger processor counts 39 Can support product families with different number of tiles 6 Alternatives for Managing Tile L2 in CMPs Treat each slice as a Manage all Slices as a private L2 cache per Single large shared L2 tile L2P cache L23 Implications of L2 Caching Strategy 39 Treat each slice as a private L2 cache per tile use directory approach to keep caches coherent tags duplicated and distributed across chip as in Piranha delivers lowest hit latency works well when your working set fits in your L2 reduces total effective cache capacity each tile has a local copy of each line it touches can t borrow L2 space from other PEs with less full caches 39 Manage all slices as a single large shared L2 cache focus NUCA nonuniform cache architecture designs differ from Piranha s dancehall design migration based NUCA protocols seem problematic shared L2 increases effective cache capacity for shared data incurs long hit latencies when L2 data is on remote tile Victim Replication 39 Combines advantages of private and shared L2 schemes 39 Variant of shared scheme 39 Attempts to keep copies of local L1 victims in local L2 retained victim is a replica of one in an L2 on remote home tile Victim Replication in Action Dynamically build small victim cache in L2 Processor misses in shared L2 bring line from memory place in L2 in a home tile determined by subset of address bits also bring into L1 of requester Incoming invalidation to a processor follow usual L28 protocol check local L1 and L2 If L1 line is evicted on conflict or capacity miss attempt to copy victim line into local L2 Primary cache misses must check for local replica on miss no replica forward request to home tile on replica hit invalidate replica in local L2 move to local L1 10 Victim Replacement Policy Never evict global shared line in favor of local replica L2VR replaces lines in following priority order invalid line global line with no sharers existing replica If no lines belong to these categories no replica is made in local L2 cache victim evicted from the tile as in LZS More than one candidate line pick at random Advantages of Victim Replication 39 Hits to replicated copies reduce effective latency of shared L2 39 Higher effective capacity for shared data than private L2 12 Victim Replication Evaluation Parameters t C omponem Parameter Processor Model inorder Cache Line Size 64B L1 ICache Sizev Associativity 16 KB 16way L1 D Cache SizeAssociatixdty 16 KBI16way L1 Loadto Use Latency 1 cycle L1 Replacement Policy PsuedoLRL39 L2 Cache SizeAssociativin Q ERINway L3 Loadto Use Latency per slice 6 eye as L2 Replacement Policy Random L1 Vlctim Cache SizeiAssociatix ity 8 KB16way L1 Victim Cache LoadtoUse Latency 1 cycle Network Con guration Onehop latency Worst case L3 hit latency contentionfree 4x 3 1Mesh External memory latency 30 cycles 4 eye es 13 VR Singlethreaded Benchmarks SingleThreaded Benchmarks Benchmark Description Instruction Count in Billions 3 8 eon gap 8211 perlbmk vortex vpr ray tracer computmg 111 groups c wrsmn compres 51011 115mg Cutdown version of Perl v500503 An 39 oriented database 14 Single Threaded VR Access Latencies Laency Owns 0 O 1 CH O p LZP Average Dma Access Laremy E LZVR 0 nzlp crany eon gap 90 921p mcr pawnEMIme mm Benchmarks mm vortex Vpl 15 Single Threaded Offchip Miss Rate Global Cache Miss Rana L2P III Elam O 0 O W 0 N O m mes rare 4 o o o o N w s u o A 0 bzip cm y eon gap 9c 921p moi pamperme mm vane vpc Benchmarks ST Onchip Coherence Msgl1K Instructions 15 Number or Message Hme per 1K hammms OnCmp Coherence Messages 0 bzlp any can gap 903 L29 IIL2 92 md parserpenbmk Molt vortex vp Benchmarks szn V 17 VR Multithreaded Benchmarks Instruction Count in Billions BT 17 8 block conjugate apache 33 Apache s ab worker threading model gee 296 executes checkers 29 Cilk checkers Cilk 532 396 18 Lamcy Cydes MultiThreaded Average Access Latency Amage Data Amep3 Lalency 55 L2P 5 E 53C IS fit in L1 III L2VR 45 BT FT LU SP 35 fit in L2 3 MG EP checkers better with L2VR MG SP apache abencncm 19 MultiThreaded Offchip Miss Rates beu Cache M533 Rule or LZP III L23 1 L2VFI 05 04 a U 303 R 2 02 01 o 81 CG E FT IS LU MG SP apache dbenmcmams Benchmark MT Onchip Coherence Msgl1K Instructions CryChip Constance Messages l L25 Cl LEVR 2200 5 if 5150 Q 3 8 8a 100 3 3 e 5 so Q ll BT CG EP FT IS LU MG SP spasm dbencncheckers Benchmancs 2 1 MT Avg L2 as Replica Over Time BT CG EP 40 c to 330 an 439 39 quot so a n vww 319 W 7 25 20 10 3 L 10 FT IS LU 40 w no 30 an 30 W H r quot quot f 3992 391 39w quotquot 1 W as u 20 1o 1 10 MG SP apachu 4o 3 to 30 an I so 8 n E20 l L 29 20 v 1o 39 1 1o dbenm Choctaw 40 49 3 an 8 in 20 L U 20 awquot r 10 y w quotH quot W a 22 MT Memory Access Breakdown L2P LZS L2VR Breakdown fln mp Memory Accesses 003 L2 Data ala apacha dbench checkers Victim Replication In a Nutshell Distributed shared L2 caches decrease network traffic vs private caches at the expense of latency Victim replication reduces onchip latency by replicating cache lines within same level of cache near processors actively accessing the line Result dynamically selftuning hybrid between private and shared caches 24 Cooperative Caching 25 Cooperative Caching Locally active data attracted to private cache Globally active data cooperatively identified kept in aggregate shared cache Cooperation strategies cachetocache transfers of clean data replicationaware data replacement evict duplicates to make space for new lines singlets global replacement of inactive data migrate singlet from local L2 block to another L2 block rather than off chip Don t require inclusion in local L2 for data to be in L1 Use cooperation probability to guide strategy coop or not 26 Requirements for Cooperative Caching 39 Private caches must maintain cooperationrelated hints for replacement whether block is singlet reuse status of a spilled block 39 Coherence protocol must support cachetocache transfers of clean blocks spilling into other s caches for capacity sharing 27 Implementation Alternatives 39 Modify existing replacement policy and coherence protocol 39 New directorybased protocol presented in this paper 28 Modifications to Existing Implementations 39 Add 2 bits to each tag in private caches whether block was once spilled but not reused for 1Fwd whether block is a singlet for replication aware replacement can be maintained by snooping activities of other caches receiving notification from directory when duplication removed can be imprecise since only used for victim selection 39 Coherence protocol clean owner can be selected by snooping through arbitration directory protocol can maintain presence vector spilling by either push or pull pull buffer victim locally evicting cache selects amp asks random host to pull the line host issues special prefetch to initiate move push send from evicting cache to new host notify directory if any 29 Cooperative Caching Hardware Sketch Private caches centralized directory i mm m sum mm nng mum Figure 3 CCE and Directow Memory Structure 8 core CMP with 4way associative L2 cac es Cooperative Caching Evaluation Parameters of systems simulated Parameters 4widc lOyslages 128 l 64 entries 12K YAGS 644nm RAS 128 bytes Cmnpnn Processor pipeline Windowscheduler Branch prediclors oc size L UD caches L2 cachesbanks ycle Sequential mydm access lsrqcle Poinlrlorpoinl Sycycle pelvlmp 300 Cycles 012 Private Shared 2X2 mesh 2X2 mesh 3 Private 3X3 mesh Shared 4X2mesh Workloads Multiprogralmued 4r0re Benchmarks Name Mix apsi arl cquakc mesa Mix2 ammp mesa swim vortex Mix apsi gzip mch mesa Mix4 ammp gzip vurlcx wupwise Ralcl 4 copies of lwoll39 WS larger lhan 1MB Rach o ies ol39arl WS smaller lhan 1MB Name trs 4 Mullitllreaded 8r01 e Setu OLTP 400 Apache 2500 20000 les 3200 clients 251n51hmkmne JBB 12000 Sun HotSpm 140 15 warehouses perrcole Zeus 2500 Similar 0 Apache excepr being evemrdriven D IBM DB2 EEE 72 25K warehouses 28 users Multithreaded Performance Fmaxe shm1 oc 0Elucm Cicc 7nsc unmima E Q Q 5 32 Narmain Pn nrmance am om Apache JEE Zeus Fi ure 4 Multithreaded Workload Performance 9 indicates Ihe besi performing CC scheme Lccai L2 il Hemme L2 Cloned Figure 6 Multithreaded Workload Bandwidth 39 P5037c P50377CPS937C 50310 Memory Weiss ayeamMn 8 lt39 LTP Apache JBB Figure 5 Multithreaded Workload Average Memor Access Latency from Ieft to righi in each group Privale P Shared S CC 0 0 CC 30 3 CC 70 7 and CC 100 0 Multiprogrammed Performance m icc i Qi W E g may E m a a gm II I 3 in 2 SW M 1 Mix Mixa MixA Ram Rate 0 Miquot Ra39s x Mixz MixCi 4ix4 Ratei Figure 7 Mulliprogrammed Workload Performance Figure 9 Multiprogrammed Workload Bandwidth 0 F c F c P c F 1 00 Average Access Latency g 3 Mix4 Raiei R3162 Figure 8 Multiprogrammed Workload Average Memo Access Latency from left to right in each group Private Shared and cooperaiive caching Mm Mix2 Mix 34 Miss Rate and Miss Breakdown Table 5 Mullilhreaded Workload Miss Rale and L1 Miss Breakdown Table 6 Mulliprogrammed Workload Miss Rate and L1 Miss Breakdown Cooperative Caching vs Victim Replication Muliiprcgrammad lPrwaIe a shared u 2 DVR gt E Commemal 39 39 39uquotquot39 vpr 2 g perlbmk E mu 7 F s u wupwise swim mglizl imam equake I39 llill lll lPrwaIe a Shared u cc DVR an apsi applu 39 ammp sss aauenuupad pazuznuuN Performance Sensitivity of Cooperation 120 1 Private Shared I CC i ESE 39 Hi I 39F39llil39 I Hill 120 doore B oore l i H 1 2 H 1 2 100 I I 80gt I 50 i i quot 0LTP Apache JBB Zeus Figure 10 Performance Sensitivity 300 and 600 cycles memory latencies from left to right for each group 4core and 8core CMPs with 512KB 1MB and 2MB per core caches Performance SOC cycle Mem Performance 600 cycle Mem 37 Speedups vs CCE Latency Extra cycles 0 5 10 15 Multithre aded 7 5 47 32 0 Multiprogrammed l l 80 7 O 59 38 Evaluation Summary 39 Architecture 8core CMP 1MB L2 cache per core 39 Cooperative caching results improves MT commercial workloads 511 vs shared cache 438 vs private cache 39 Architecture 4core CMP running multiprogrammed SPEC 2K 11 faster vs shared cache 6 faster vs private cache 39 Assessment of Cooperative Caching Combines strengths of private and shared caches Supports spectrum of capacity sharing points suits requirements of diverse workloads Best performance for range of CMP config amp workloads reduces runtime of simulated workloads 438 5 22 slower than private or shared caches in extreme cases Outperforms victim replication by 9 on average mixture of singlethreaded and multithreaded workloads advantage increases for workloads with large working sets 40 Processors Today 41 Opteron Dual Core Rev F H w h g i n 1 u g 1 1 395 L AMD Barcelona Cache Organization Dedicated L1 39 Locallt kee s most critical data In the L1 cacze p I Core 1 Core 2 Core 3 Core 4 LoweSt Iatencv 39 Cdclu he ath dtlu 2 loads per cyc39e I Coulrul Cuiilrul oulrol ulilrol Dedicated L2 I I 39F I o Sized to accommodate the majority of working sets today 0 Dedicated to eliminate conflicts common in shared caches Better for Virtualization Shared L3 NEW Victimcache architecture maximizes efficiency of cache hierarchy o Fills from L3 leave likely shared lines in the L3 Sharingaware replacement policy a Ready for expansion at the right time for customers 43 AMD Barcelona Chip Quad Core Rev H lii uirersa ge IltghipaILEhiteEECOMn VJSAM Dggmilypicjgig 44 Intel Clovertown Quad Core IBM Power 6 39 I 4 giant Ia n n eta325536th a X 39 J a W 151 I BI 335 39139 4 th I p w 39 u 39 V 5quot 1 ii 1lumin 639 skin Iq quotv Figure source httpzlwvmIinuxdevioescum lawsmise bmjower jiagsmljpg 46 ANI s View of the Difference Native QuadCore Bene t Faster Data Sharing Situation Core 1 needs data in Core 3 How Does it Get There Barcelona L3 Core 1 probes Core 3 which copies the data directly back into Core 1 This happens at processor frequency Result Improved QuadCore Pe rforma nce 1 Figure source httpwwwCdrczpicture34716small Memory Controller Core 1 sends a request to the memory controller which probes Core 3 39 Core 3 sends the data back to the 3 memory controller which forwards it to Core 1 Q 39 1 Ti This happens at frontside bus frequency VFK39L QIVV L 9c 7 7 47


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Jennifer McGill UCSF Med School

"Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.