Popular in Course
Popular in ComputerScienence
This 47 page Class Notes was uploaded by Cleora Stiedemann on Monday October 19, 2015. The Class Notes belongs to COMP 522 at Rice University taught by Staff in Fall. Since its upload, it has received 13 views. For similar materials see /class/224958/comp-522-rice-university in ComputerScienence at Rice University.
Reviews for MULTI
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/19/15
Hubert Lee Department of Computer Science Rice University johnmccsriceedu COMP 522 Lecture 5 9 September 2008 Context 39 Prior classes SMT vs Chip Multiprocessor Hydra vs Piranha 39 Today s focus Multith reading Commercial workloads Power efficiency Niagara Design Goals Low power Multithreaded performance Improve throughput Target commercial server applications Binary compatibility with existing Solaris systems Characteristics of Commercial Server Apps Low instructionlevel parallelism High threadlevel parallelism Large data set High data sharing Niagara at a Glance 39 Processor core 8 thread groups 4 threads per group SPARC pipe for each group 39 Memory hierarchy private L1 instruction 16KB and data cache 8KB 4way set associative Shared 3MB L2 unified cache 4 banks 12way set associative Niagara Block Diagram 5m We 4 51 w 97 We Amay w 5m We may m 5m We Lnay w 5m We may w m We may w 59 We Lway w 5m We may w mom Drum mnlm Wquot manna n WW 2 Nugavimackmlmzm Sparc Pipeline Features Unique registers and instruction and store buffers per thread Each thread group shares L1 cache TLB execution units ALU multiplier divider Single issue Six stages fetch thread select decode execute memory write back Two instructions fetched per cycle Niagara s Sparc Pipeline Fetch lThread seiect l Decode l Execute l Memory WIiteback Register le 1 d l ALU DCache MUL DTLB Crossbar gt Decode gt Shim store interface DIV bu ets gt A lt lt Instruction type Thread selects Thread M35395 seiect bgic Traps and intenupts Hescwuroe con icts Name That Threading Model IQ 141 l Switch ovefhead Cl 5 W VI llllll I lt0 ILILILI 7 D Thread 0 E ma39ead 1 El Unused cyae Finegrain Multithreading 39 Select a thread each cycle 39 Deselect threads on high latency operations Single issue ILP ICMCMGM TLP Time saved on shared sing sane IE 390 M P798509 3 Memory Lalerc E Ccmaute latency 1 O Multithreading amp Pipelining Ideal Instmctlom Cycles Sac m Dun Emu Mum Wm FB m Sum 0mm Emma Mum W11 FunJ Sm Dam Ema Mme wan FC O 513w 03am EM Mann FEW Steam Drum Em lmtnntlom 2 Threads Hazard Cycles Slim DHBLE Emma M PE W11 F114 8154 DYHEX E1143 M91 Niagara s Sparc Register Files 39 3 read 2 write ports 39 Architectural set vs working set Amhwtectuva set Wurkmg set campamSRAMceHs fastRFceHs Mom 5 T mum7 Laas07 4 Transfer pun won I mum7 MSW7 n4 Mama7 lam n5 7 HeadWm accessfmmpwpe Niagara Memory Subsystem 39 16 KB L1 Instruction Cache 39 8 KB L1 Data Cache 39 Write through L1 14 Niagara Performance SPECjbb2005 Java Business Benchmark Dunnlion SIm Flre MD IBM 9520 IBM x346 Dell 5114 Shane RU l 4 1 Wall 184 420 438 376 WWW SPECIDMOUS 60323 32820 35535 24206 Dans FedormanueWau WWW 3273 751 50 m SWaP LH gner 5 Bauer 32 19 45 64 O S P ECwe b2 00 5 Spare W 1 2 2 A SWaPOthervs Seller 55 54 Niagara1 vs Opteron Performance Operating System SAP Release Database Certi cation Number Processor Type ProcessorICoresfl39hreads Con gured Memory Form Factor Rack Units Calculated System Power Watts2 Number of SAP SD Benchmark Users Number of SAP SD Benchmark UsersIWan Windows Server 2003 Solaris 10 SAP Rl3 Enterprise 47 mySAPT ERP 2004 SQL Server 2000 MaxDB 75 2005026 2005047 DualCore Opteron 22 GHz UltraSPARC T1 12 GHz 214M 1l8i32 16 GB 32 GB 2U 2U 388 376 983 950 25 25 TheRtefal about Sunl39s GooITh reads Gaka Niagara hlt fr39 hj7x391QQBMWVWZhppoMZER GGadh ZB39QT 240504031121 i m lfzERL tEU 16 Niagara vs Niagara2 39 Processor core 32 vs 64 threads doubled number of threads per core Double number of execution units per core 2 vs 1 New pipeline stage pick Pick 2 of 8 threads to issue in a cycle 39 Memory hierarchy Doubled set associativity of L1 from 4 to 8 Doubled the number of L2 cache banks from 4 to 8 Now 1 per core boosts performance 18 overjust 4 banks From 12way to 16way set associative 39 Floating point One per core rather than one per chip Turned Niagara2 into a very respectable chip for scientific computing 17 Sparse Matrix Vector Multiply H Basic 3va implementation H y lt y Anc where A is in CSR for i O i lt m i 4 double yo yi for k ptri k i ptril k 370 valk xindk Ni yO 18 NiagaraZ vs Commodity Processors Example sparse matrix vector multiply WWW sun Niagamz References 39 Niagara A 32Wav Multithreaded SPARC Processor Poonacha Kongetira Kathirgamar Aingaran and Kunle Olukotun IEEE Micro pp 2129 MarchApril 2005 39 Sun39s Niagara falls neatlv into multithreaded place Charlie Demerjian The Inquirer 02 November 2004 39 OpenSparc T1 Microarchitecture Specification Sun Microsystems httpopensparct1sunsourcenetspecs OpenSPARCT1MicroArchpdf 20 The Implementation of the Cilk 5 Multithreaded Language Ying Hu Sept 23 2008 Tuesday September 23 2008 Outline I what is parallelism work and span de nition of parallelism 2 Cilk work rst principle compilation strategy work stealing runtime system benchmark 3 Example soving Fibonacci numbers 4 Example 2 solving the NQueens problem Tuesday September 23 2008 What is parallelism Amdahl s Law speedup upperbound 11p p fraction can be run in parallel rest runs serially Dag Model for Multithreading 1lt2 1 proceeds 2 49 4 and 9 in parallel Tuesday EEpiEmtEi 23 2mg What is parallelism Amdahl s Law speedup upperbound 11p p fraction can be run in parallel rest runs serially Dag Model for Multithreading 1lt2 1 proceeds 2 49 4 and 9 in parallel How to precisely describe parallelism Tuesday EEpiEmtEi 23 2mg Measures of Parallelism Work 0 total amount of time for all the instructions Tuesday September 23 2008 Measures of Parallelism Work 0 total amount of time for all the instructions WORK LAW TPZT1P Tp execution time on P processors T1 execution time on one processor Tuesday EEpiEmtEi 23 mag Measures of Parallelism Work 0 total amount of time for all the instructions WORK LAW TPZT1P Tp execution time on P processors T1 execution time on one processor 0 Speedup T1 Tp linear perfect linear superlinear speedup ie caching effect Tuesday EEpiEmtEi 23 2mg Measures of Parallelism Span longest path of dependencies eeeeeeeeeeeeeeeeeeeeee 08 Measures of Parallelism Span o longest path of dependencies critical path span critical path length 9 Tuesday EEpiEmtEl 23 2mg Measures of Parallelism Span o longest path of dependencies SPAN LAW TP 2 T00 T00 fastest time the dag can be executed on a computer with infinite number of processors Tuesday EEpiEmtEi 23 2mg De ne Parallelism Parallelism is defined as the ratio of work to span or T1 Tltgt0 How to interpret it parallelism T1 T0 is the average amount of work along each step of the critical path 0 parallelism T1 T0 is the maximum amount of speedup that can achieved by any of processors 0 perfect linear speedup cannot be achieved for any of processors greater than T1 T00 Why Tuesday September 23 2008 De ne Parallelism Parallelism is defined as the ratio of work to span or T1 Tltgt0 How to interpret it parallelism T1 T0 is the average amount of work along each step of the critical path 0 parallelism T1 T0 is the maximum amount of speedup that can achieved by any of processors 0 perfect linear speedup cannot be achieved for any of processors greater than T1 T00 Why ifpgtT1T span law states T oltTp gt 1T ogt1Tp gt T1T ogtT1Tp thus p gt T1 Tp Tuesday September 23 2008 An Example work 18 span 9 avg parallelism 189 2 Tuesday SEMEle 23 2mg Another Example Work T Span Tm 8 Parallelism T1 T0 625 Tuesday aepremm 23 2mg What is Cilk Cilk is a generalpurpose programming language designed for multithreaded parallel programming 0 extension of C for parallel computing 0 implements workfirst principle 0 uses provany good workstealing scheduling Tuesday September 23 2008 WorkFirst Principle 0 TP 2 T1P OToo work span eeeeeeeeeeeeeeeeeeeeee 08 WorkFirst Principle TP 31 0TOO running time of the C elision work span 1 0 TP lt TlP COOTOO Z ClTSP Where TlTS T work overhead ie frame allocation prior to spawn Tuesday September 23 2008 10 WorkFirst Principle TP 31 0TOO running time of the C elision work span 1 0 TP lt TlP COOTOO Z ClTSP Where TlTS T 0 Minimize work overhead work overhead 01 even at the expense of a ie frame larger critical path overhead allocation prior Cw to spawn Tuesday September 23 2008 10 WorkFirst Principle OTP ltT1PCOOTOOZC1TsPwhere TlTs 466 MHz Alphn 21m 2110 MHz Pamum Pro 67 MHZ THE protocol Ultra SPARCI n rum Allocation I sme saving 195 MHz MIPS RIOOOO overhudx for running fib Tuesday aepremm 23 2mg H Compilation Strategy Cilk Runtime System sourceLosource ecomp er linking 1 x 1 d r bc1lk COW b c bo 03 9 b cilk2c gCC 1d each worker uses a ready deque doubleended queue as a stack pushes and pops the frame at the end when workers have no work to do steal head frames from other workers deque Tuesday September 23 2008 12 heap heap stores activation L frames m gem I deque 3 deque Tuesday aepremm 23 2mg 3 heap heap stores activation frames I 2 deque E 3 deque I deque Tuesday aepremm 23 2mg 3 heap heap stores activation frames E 2 deque Tuesday aepremm 23 2mg 3 I deque 3 deque heap stores activation frames spawn 2 deque I deque 3 deque Tuesday aepremm 23 2mg 3 heap heap stores activation frames spawn 2 deque I deque 3 deque Tuesday aepremm 23 2mg Compilation Strategy two clones of each procedure fast clone runs when a procedure is spawned dependent procedures generated with little support for parallelism checks to see if procedure was stolen sync is no op slow clone runs when a procedure is stolen supports concurrent execution calling sync to check children status Tuesday September 23 2008 14