Computer Architecture II
Computer Architecture II CSE 30322
Popular in Course
Popular in Computer Science and Engineering
This 0 page Class Notes was uploaded by Mrs. Damaris Hyatt on Sunday November 1, 2015. The Class Notes belongs to CSE 30322 at University of Notre Dame taught by Peter Kogge in Fall. Since its upload, it has received 29 views. For similar materials see /class/232740/cse-30322-university-of-notre-dame in Computer Science and Engineering at University of Notre Dame.
Reviews for Computer Architecture II
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/01/15
mascal39e The Next Great Challenge In Computer Architecture Peter Kogge April 21 2009 EEEEEEEEEEEEEE 1 So You re Interested in an ascafe Computer You should be OVERJOYED if you will need is 0 JUST a Million cores 0 ONLY 1 Nuclear Power Plant 0 ANY sort of programming support 0 And you are not dominated by Interconnect EEEEEEEEEEEEEE 2 Why do We Need Supercomputing EEEEEEEEEEEEEE 3 Basic Performance CI FLOP Floating Point Operation and add or multiply 0 Basic measure of scientific computing El xxxFLOPS xxx flops executed in a second 0 Megaflops 1 million 106 flops in a second 0 Gigaflops 1 billion 109 flops in a second Agood Laptop 0 Teraflops 1 thousand billion 1012 flops in a second TOday S beStChiP a Petaflops 1 million billion 1015 flops in a secondToday s fastest computer 0 Exaflops 1 billion billion 1018 flops in a second The limits ofsilicon o Zettaflops 1 thousand billion billion 1021 flops in a second Eligible o Yottaflops 1 million billion billion 1024 flops in a second I o 1 Googleplex 1010100 CI Note 1019 seconds since start of universe CSE 322 Exascale4 2009 What s A Petaflop CI 1 Gigaflop 109 Floating Point operations flops per second 0 Apple G4 1 GF Peak CI 1 Teraflop 1012 Flopssecond o 1000 348 running perfectly at peak CI Teraflop Computers 0 ASCI Red 1 Teraflop peak 0 ASCI Blue 3 Teraflops peak 0 ASCI White 10 Teraflops peak 0 ASCI Q 30 Teraflops I Peta op 1000 T era ops CSE 322 Exascale5 2009 A Sampling of Problems Needing an Exaflops El Climate modeling El Controlled plasma fusion simulation El Microbiology and drug synthesis El World knowledge graph problems CSE 322 Exascale6 2009 Some Observations on Climate Modeling Climate Change is Observed Climate Models as Tools nangsrmmpmmsmml unnullnu mu nllmalc mum and Nunhnm Htmiwncre Snuw mm M mmmm m WWWM i 39 E E 1 y a E quot l A A A A l u chc 2007 2002 93 Nobel Peace Prize 2007 WNW Use of Climate Models Determlne Causes Glow sunm mm 39c Intergovernmemajf glel on Climate Change Founl ssessment sulln mm m ls mammal mm 0mm llnas am ensemble ulnvdal vesulls Obsewallanscan nnlybe explalneu by lllaea a human In uence From Sm el al zuus 7 g lzzgz39 FPS 3quot CSE 322 ExascaleJ 39 13 39 4quot 2009 Complicating the Climate Models El Accounting for regional impacts 0 Dynamic vegetation models 0 Improved groundwater modelswater cycle El Ocean models that incorporate eddies in currents El Tracking Ice Sheets sea ice and sea level rise El Chemical biogeochemical models El Ocean Thermohaline Circulation El Extreme Events and Impacts Hurricanes droughts ets CSE 322 Exascale8 2009 Growth in Computing Needs Computing needs 10101012 Scaling to Large Processor Counts Resolution 103105 123127quot 39 quot339quot 7 x100 1101i X10 timestep x5710 ve 5 a I V V 7 Regional prediction 101ml i Effective for Eddy remlving ocleanlt 10km Spaualw s resolution 39 Complemss 1 a scenariosbut can t 7 Biogeochem 30100 tmceis interactions E V V 7 mums 1 scale time 1hr to 7 min Fidelity 102 m I 7 Better cloud probess es 39dy amic la dtretc Increase lengthnumber of ensembl s 103 7 Run length xl QQ 7 Number of Estuariesensembles X10 395 par wailciodr day 7 Data assimiiatib 310X a 3 lquot I39 o t E z Data requirements have sumlar iactors g 1 1 7 35 TB currently distributed 5 391quot 7 More for assimilation Yi r39 rquot air I quot39 39 out L95 Alamo Vi Sciices quot r39 Paii M 377m 1 mar KVhv H i H N i quot39 quotquotquotquot quot YUWSSOIS The Earth Simulator clocked in at about 35Tflops in 20022004 Above implies 1O23 1O25 flops needed a Yottaflopsl CSE 322 Exascale9 2009 Linpack Today s Imperfect Measure El Linpack solves a general dense matrix problem Ax b 0 Complexity for n equations 23n3 2n2 On flops El 90 of time spent in subroutine DAXPY 00 10 i E in 0Xi y i E yti alpha 1 Mi 10 CDNTINUE El Performance measured at three levels of problem size and optimization 0 100 by 100 problem inner loop optimization 0 1000 by 1000 problem three loop optimization the whole program 0 And a scalable parallel problem CSE 322 Exascale10 2009 Performance of Real Systems A Function of El Parallelism Number of concurrent threads 0 Classically the of Processors El TLC Thread Level Concurrency o of flops per thread per cycle El Clock of cycles per second CSE 322 Exascale11 2009 Measured Performance ml LINPACK um Mmlmm Pn muanvv aluminum Mmqw Maws1 1m Pvmumv m m 3 lnlol PInlium 4 1 7m lmol ILmuuu 1120 mm 2400 mm Alptm mun mu Rsmmn 1 mm E 51m x mm Cmy 5m 1 2m mm 5 LIXrACK lwmluxmxk mlviug n ma by um mmx pmhlm Rpeak absolutely the most ops rate possible from the computer Clock rate opsper cycle Rmax max observed on the benchmark Ef ciency Rmax Rpeak CSE 322 Basaler What is Supercomputing CSE 322 Exascale13 2009 The History of Supercomputers Performance i Performance Development 1IZIIZIF39FIep3 1525313 FI 1 l 139 PFICIFIS 39 lgquot Iquot 11quot 110500 Tfl Sum 1 F39Fleps rrl ll ll mu TFIops I u I a HI I 39 f 39 39 quot39 12230 F E 1IZITFep3 3 Jl39 JJ 45 quota I v JJ l i E 1 TFleps 1quot l I I II E I I 7quot l I l m 1EIIZI GFlnps lg Iiraw 39 39 1n GFlepa VI 9 39 1 an 5quot 15quot ups 99 1UDMFIUI39JSIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII m quot1 Ln 9 f H ED m 0 N 39239 139 LO LO I39M CO on m m m m m an O1 0 D D D D D D D D D 92222923333333333 httpquotv39www to p500orgquot CSE 322 Exascale14 Types of Parallelism Page 95 on book s CD Single Instruction multiple data SIMDJ 5m ISlueter network of werketatio n5 40C Cluster network at SMPe 300 l r laeeivelyr 20C parallel pro eeseore MPPSJ 1 Di 39 Shared memoryr 7 multiprocessore ISMst OIIIIIIITIIIIII 93 93 e4 e4 95 95 96 ea 9739 9 ea 95 99 99 no Unipreeeseore CSE 322 Exascale15 2009 The Historical Top 10 1E10 1E09 1E08 1E07 1E06 1E05 1E04 1E03 1E02 1E01 1E00 1E01 A 1172117611801184118811921 110011041 111211161120 GFlops 14 years gtK Historical Rmax O Rmax Rmax Leading Edge 7 7 Rpeak Leading Edge CAGR Compound Annual Growth Rate CSE 322 Exascale16 2009 The Top 10 Efficiency Ef ciency Rmax Rpeak 100 90 80 70 60 50 40 Linpack Efficiency 30 20 10 0 1193 1195 123196 123198 123000 123002 122904 122906 122808 0 Top 10 l Top System CSE 322 Exascale17 2009 The Number of Cores in These Systems 1E06 I 1E05 Illl I 00 E ooogg 8 1904 AAA3 03 M 2 l 000038 soc E Iossgg oggg g n ltgtltgtltgtltgtoltgt 3 n1E03 AAAAAAA23988838 8 L voogvvv 888 9 8 0000080 0 0 8 00 0000 0 l0lll 0 00860 21E02 00 vv n39 0 0 00 0 1E01 0000 1E00 1193 1195 1197 1199 1101 1103 1105 1107 1109 0 Top 10 l Top System CSE 322 Exascale18 2009 The Clock Rate of These Systems 1000 0 000 028 UU oo o 9 ltgtoo8 A Ollllllggg N 000000 5 880 ltgtltgtltgtltgtltgtltgt 003939 I o eggsI98588 o o 00 ltgtltgt8ltgtltgtltgtltgt o I o ltgtltgtIgltgt8ltgt 010 55 ltgtltgt ltgtltgtIltgtltgtltgtltgtltgt lltgtltgtltgtltgtltgt 001 1193 1195 1197 1199 1101 1103 1105 1107 1109 0 Top 10 l Top System CSE 322 Exascale19 2009 The Per Core ILP in Flops per Cycle 1000 3335 gt 0 IIEI u 1OIOBG 660303 5558888 a oltgt 98 C o EIEIEIEIEIEI DUIIUEIEIEI I g 5 ltgtltgt ltgtltgtltgt 5 ooooooquot 88 3 33 g 10Huannn gggggg 80 0 g ltgtOltgt g oltIgtIltIgt ooltgtltgtltgtltgt 39 n 090 ltgt 01 1193 1195 1197 1199 1101 1103 1105 1107 1109 ltgt Rmax El Rpeak O Rmax Top System I Rpeak Top System CSE 322 Exascale20 2009 Top 10 Total Concurrency Total Concurrency ops per cycle across whole machine 1E07 1E06 quot65 M l l GAG oooog a1E05 I Illlltgt 0 g Illltgtltgt08 8 s ltgt ltgt o 21304 149223993 0 o M 3 logs goBB 3 88 g 0 U u 8 3 o1E03 0 9v AvA E ltgt ltgt ltgt 0 H o 0 0 39 1E02 ltgt 0 ltgt 0 1E01 1E00 1193 1195 123196 123198 123000 123002 122904 122906 122808 0 Top10 l Top System 77T0p1 Trend CSE 322 Exascale21 2009 Memory Trends 1E03 n I 0 eye 56 o 5 06 e z 09 96 39 65 39 gt gt x6 5quotquot e 395 6 gt29 62 e E 9xquot QOQ39 0 0 Q o A e x E 1E02 I WV chx O I O z 0 Q 39 Qg6 0 I 39 39 0 m I oquot I amp I o y gt I39 596 4quot O 9 x946 Ill V l x x 1E01 x 39e I 1E04 1E05 1E06 1E07 GFs Performance grows with time But Memory per flops is DECLINING CSE 322 Exascale22 2009 Scaling of Applications CSE 322 Exascale23 2009 Precise Definitions El TAXNP time to solution 0 for some algorithm A 0 when executed on a processor system with architecture X o where the size of the problem is N o and the number of processors available is P El We term AX as the system 0 And drop subscripts from now on CSE 322 Exascale24 2009 Application Scaling Definitions El Define TNP solution time for problem of size N using P processors El Strong Scaling TNP TN1P 0 Given fixed sized problem 0 Time decreases as P increases 0 Perfect strong scaling if time decrease 1P Perfect linear speedup Problem piece solved per processor also drops as 1P El Weak Scaling T11 TPP o For a fixed solution time o Solvable problem size increases with processor count 0 Perfect weak scaling if problem size proportional to P Problem piece solved per processor is constant CSE 322 Exascale25 2009 Strong Scaling El Strong scaling TNP1 gt TNPZ if P1 lt P2 o If problem size is held fixed solution time decreases as P increases El Perfect Strong Scaling TNP TN1P for all PgtO o If problem is held fixed Total work done is independent of of processors Each processor contributes same fixed amount of work Not normally possible to do better than this CSE 322 Exascale26 2009 Time to Solution Surface Single Processor Curve 1 r I I I TN1 I I I I E I J T l r 39 I I z 39 I I Xquot Iquot I I I z I I l I Ir I j I I I I i x I I x 4 39 I I C r 39 r 39 I I I 9 I I Sequentlal tIme compIeXIty curve 2 x39 I O I r I I D I I I r I o I z q I I I r I I v e39 E r I f I 397 u quot7 quot u I I 6 I I I I l l r a r r r A 439 I 9 quotIquot V quotV D r l I Z I z I z I Note assume log axes I x WIth tIme normaIIzed I I I I so orIgIn Is T11 1 Iquot x 739 x I I I I I I I I 111 Rdewm CSE 322 Exascale27 2009 Perfect Strong Scaling TNP TN1IP 1 z I z I TN1 When P TN1 then I a I E M TNP TN1P I hr o in TN14TN1 I 39 I 39 r s 1 R LInes are 1P I I I 39 393 5 Li 39 Enough processors 4 g E 5 1 I39 are used to keep 5 I problem 3 Ii solutIon tIme 1 o X J39 0 4LT I a i 1 quot 39 i i f 6 19 7 quotquot quot quotquotquot quotquot quot77 quotquot7quot l r I 39amp8 z quot z l r A 439 I I L oz39quot 0quot l 80 l v t E xv 1 I44PTP1 I quotquot quotgt quote itquot quotquotquot quot7 quotquotquot quot7 I 7 Curve In NxP plane Is xivquot quotquotquot 393 quotquot quot9 quotj reflection of curve in NxT plane 111 P of Processors Less than perfect strong scaling is in front of surface CSE 322 Exascale28 2009 Weak Scaling El Weak Scaling o The problem size N that can be solved in constant time increases as P increases a For any P2 gt P1 there is some N2 gt N1 such that TN2P2 TN1P1 El Perfect Weak Scaling oTPP T11 for all P gt0 CSE 322 Exascale29 2009 Weak Scaling IIIII39I CIIIII l munIv o S a I 5 I Uma cu o 1 I 39 TN1 i 9 3 I i 39 39 l I I I I I I I I 1 I I l r I r I 39 I L z I I I I a a I I 39 I I 39 I 4 I J I 39 x I E I 39I I 39 l l zr I 39 Iquotquot I r z I quotI r Ir I 39 I quotquotquotI39 39 I I I I 395 quot TNP Time to Solution 111 P of Processors CSE 322 Exascale30 2009 Speedup El SpeedupAXNP how much faster than a single processor 0 of some algorithm A when run on some architecture X 0 when presented with problem size N and executed with P processors D Linear Speedup SpeedupP KP K a constant 0 ThUS T1KP Doubling P CUtS time in half Holy o All P resources gainfully used 100 of time El Perfect Linear Speedup SpeedupP KP K1 D Logarithmic Speedup SpeedupP Plog2P o TP z T1log2PP 0 Law of diminishing returns typical of prefix problems El Fixed Overhead SpeedupP gtConstant as P gt 0 Due to fixed serial overhead see Amdahl s Law Real world is a complex mix CSE 322 Exascale31 2009 And Believe it or Not El Superlinear Speedup SNP gt P o Doubling P more than halves time 0 Not as rare as you might think El Various sources for superlinear speedup 0 Loop overhead from sequential code removed 0 Side effects of improved cache effectiveness More processors gt more bits of cache 0 reduction of system overhead 0 Random algorithms eg alphabeta where parallel searches with crossthread communication might truncate what might have been a long sequential fruitless thread in see Gustafson Fixed Memory Tiered Memory and SuperlinearSpeedup CSE 322 Exascale32 2009 Limitations to Perfect Speedup El In real world P processors cannot provide speedup of P El Primary reason there is some time when all processors cannot be kept busy El In particular serial time time during a computation where only 1 processor can be kept busy El Sources of serial time 0 Control points 0 Graphs that have insufficient degrees of concurrency throughout 0 Synchronization points CSE 322 Exascale33 2009 The Effect of Serial Code Amdahl s Law El Define o N of instructions operations needed execute during a program 0 Tsingle execution time of an individual instruction 0 F Fraction of a program s instructions that must be done serially o P of equivalent instructions from program that can be executed in one instruction time when executing in parallel Usually assumed to be by P processing units 0 Tsequential execution time on a single processor NTsingle o TparanelN P F execution time on a processor where parallelizable fraction 1F may be sped up by factor of P El TN P F NFT Speedup T sequential 139FT sequential TN P F 1F 1FP PFP 1F El Asymptotic limit as P gt is 1F There is a limit to speedup possible from parallelism P sequential CSE 322 Exascale34 2009 Overall Speedup Amdahl s Law Graphically 10000 e I e F Serial Q o ro 0 1000 6650 a 30 Q I o 10 20 100 50 quot39 100 f 39 r 200 4 6 0 10 5000 29 71000 1 1lt quot Purely Serial 1 10 100 1000 10000 P Speedup Factor for Parallelizable Part CSE 322 Exascale35 2009 Amdahl s Law Another Viewpoint El Speedup SNPF predicted by Amdahl s Law is equivalent to 0 Having SN P F parallel units 0 All of which can be gainfully employed against program 0 100 of time from start to finish El SN P F Average Parallelism Average throughput El Multiplying by Function Unit pipeline depth yields average concurrency CSE 322 Exascale36 2009 Types of Speedup Graphically 1E06 9 Ga 1E05 390 1E04 Speedup 1E03 Speedup 1E02 1EO1 Fixed Overhead 1E00 1E00 1EO1 1E02 1E03 1E04 1E05 P of quotProcessorsquot ITI O 07 LinearPerfect Logarithmic 10 Serial 1 Serial 01 Serial CSE 322 Exascale37 2009 Summary of Scaling El Strong scaling TNP1 gt TNP2 if P1 lt P2 0 Increasing the of processors solves the same problem faster El Perfect Strong scaling TNP TN1P for all PgtO o The solution speeds up as fast as processors increase El Weak Scaling For P2 gt P1 there is some N2 gt N1 such that TN2P2 TN1P1 o If you increase the of processors you can solve a larger problem in same time El Perfect Weak Scaling TPP T11 for all P gt0 0 The size of the solvable problem scales with the of processors CSE 322 Exascale38 2009 CSE 322 Exascale39 2009 223 226 8 moo P of Processors 9 I l o t Purple Weak Scaling Green Strong Scaling 4 I L Overlaying Strong and Weak Amdahl s Law CI The of code F that must run serially constrains speedup D TN P F NFTsequential139FTsequentiaIP El Speedup Tsequential TN P F 1F 1FP PFP 1F El Asymptotic limit as P gt is 1F CSE 322 Exascale40 2009 Efficiency El Efficiency what percentage of some resource is effectively used during execution of a program 0 If its there and you don t use it it is wasted capability El Different resources may have different efficiencies over same program execution o eg of FPUs vs memory bandwidth El Typical parallel computing efficiency 0 What percent of total operations 0 that are performable by each of a set of parallel function units 0 are actually used in the computation CSE 322 Exascale41 2009 Efficiency Operations PerformedTotal Possible 10 Pc eutiy Parallelizable v 39 39 Tz 39 o F Serlal fractlon 08 x o 00 0 x x o1 3910 gt 8 x x x x 20 393 X x X 50 E o quotJ 04 x x X 1005 x 200 g 500 I x 7 0 02 1000 Purely Seria X x gt u 00 quotquotquotquot quot L 39 1 10 100 1000 10000 P Speedup Factor for Parallelizable Part CSE 322 Exascale42 2009 Maximum Useable Parallelism for Various Efficiencies 10 000 Desired Ef ciency 1 000 200 400 n 500 g 600 100 700 3 800 E 900 10 950 990 1 01 10 100 1000 F Fraction of Code that is Serial CSE 322 Exascale43 2009 But What if Problem Size Can Grow Gustafson39s Law El Amdahl s Law assumes 0 Some fixed of operations 0 The of such that must be done serially is independent of of such operations El But what if 0 We can always find bigger problems more operations more date etc 0 And the serial overhead is not a constant El Gustafson s Law Very often if you are allowed to increase the size of a problem with sufficient parallelism you can achieve arbitrary speedup CSE 322 Exascale44 2009 Gustafson39s Law El Assume o n size of problem typically size of data set a TparallelNP time that a parallel processor of size P spends in serial code when solving problem of size N o sNP of above that a parallel processor of size P spends in serial code when solving problem of size N CI Thus TseiarN TparaierNPgtltsltNPgtPlt1sltNPgt D Or SpeedupNP TserialN TparallelNP sNPP1sNP IThe closer sNP goes to 0 the closer Speedup approaches PI CSE 322 Exascale45 2009 An Example El Assume total instructions a bN o a b constants a fixed overhead b overhead per datum inside the parallelizable loop 0 All of bN instructions can be parallelized El Parallel time a bNP a Serial sNP aa bNP 11baNP El As N gt sNP gtO CSE 322 Exascale46 2009 IM1Exanu e 1E07 1E06 1E05 1E04 Speedup 1E03 1E02 1E01 1E00 1E00 1E02 1E04 a100b 1E06 1E08 N Problem Size 1E10 1E12 P 1E08 391E07 1E06 1E05 1E04 1E03 391E02 1E01 quotquotquot quot1E00 1E14 IVVe can always reach speedup P by growing the problem CSE 322 Exascale47 2009 Alternative Graph a100b 1E07 d 1E06 69 QC 39 0 3 N Probem 1E05 69 Size Q QQ39 Perfect 3 1E04 ow 1E08 395 z39 39 3 68C 1E06 839 1E03 lt29 1E04 1E02 1E02 1E01 1E00 39 39 39 39 39 39 1E00 1E01 1E02 1E03 1E04 1E05 1E06 1E07 1E08 P Parallelism For any P we can reach speedupP with a big enough problem CSE 322 Exascale48 2009 KarpFlatt Metric El Determining serial portion in advance usually hard El Alternative once program is running 0 Fix problem size at N 0 Run same code at different degrees of parallelism 0 Measure execution time TNP wall clock El Define effective serial F NP as o of TN1 time that would still be serial 0 if perfect parallelism applies to the rest of the time CSE 322 Exascale49 2009 KarpFlatt Metric An Example NAMD 25 on Blue Gene Speedup 040 2500 035 2000 030 025 1500 020 015 1000 010 500 005 000 0 0 500 1000 1500 2000 2500 Number of Processors KarpFlatt Metric A Observed Speedup Perfect Speedup from Achieving Strong Scaling On Blue GeneL Case Study with NAM D www d r 39 39 J 39 39 kumarNAMDppt CSE 322 Exascale50 2009 Same Data on Log Scale 040 035 030 025 020 015 KarpFlatt Metric 010 005 000 100 1000 Number of Processors 1000 100 10000 KarpFlatt Metric A Observed Speedup lt9 Perfect Speedup CSE 322 Exascale51 2009 Speedup Memory El How does memory capacity affect ability to achieve speedups El Conventional rule of thumb for each flop per second of performance you need byte of memory 0 Today s supercomputers 01503 bytesflops 0 Reason cost El NVIDIA GPU 1 GB of memory vs peak of 1000 SP flopssecond 0001 bytesflop CSE 322 Exascale52 2009 Effect of Strong Scaling on Memory El MemP memory capacity per processor when P processors used El Problem size is fixed MemP Mem1P 0 gt total system nonduplicate memory is fixed 0 gt in best case MemP Mem1P El Why might this limit not be reachable El If system allows random addressability 0 Then partitioning data set as 1P still works 0 But nonresident accesses may cause slow down 0 As might synchronization around memory writes El If system limits direct addressability to just local memory 0 Then local data may have to grow all the way up to PMem1 0 Plus explicit messages may have to be exchanged for intermediate updates slowdown CSE 322 Exascale53 2009 Effect of Weak Scaling on Memory El MemP memory capacity per processor when P processors are used El Problem size grows as P MemP Mem1 0 gt in best case memory is constant ie MemP Mem1 El Why might this limit not be reachable El If system allows random addressability 0 Then constant data set size works 0 But nonresident accesses may cause slow down Especially if of such references grows with problem size 0 As might synchronization around memory writes El If system limits direct addressability to just local memory 0 Then local data may have to grow all the way up to PMem1 0 Plus explicit messages may have to be exchanged for intermediate updates slowdown CSE 322 Exascale54 2009 CSE 322 Exascale55 The Road We Took to a Petaflop My Generation s Path 2009 Why is Increasing Performance So Hard Little s Law ILP Getting tougher amp tougher to increase Must extract from program Must support in HW Concurrencv Throughput Latency Much less than peak Getting worse fawn and degrading rapidly CSE 322 Exascale56 2009 1990s PITAC Findings for RampD in High End Computing HEC Presidential Information Technology Advisory Council Then current highend computing systems not well suited to many applications of strategic importance to the nation Funding should be focused on 0 Innovative architectures 0 Hardware technologies 0 Software technologies 0 that overcome the limitations of today s systems CSE 322 Exascale57 Substantive technological advances will be needed to achieve petaflops performance levels by 2010 We recommend that funding for HE CC program s petaflops activity be increased to ensure the necessary advances are achieved The committee recommends expanding research and development funding for new computer architectures including the design of memory hierarchies that reduce or hide access latencies and into promising new technologies PITAC HEC 2009 D I I3 I I National Petaflops Initiative 19905 Viewpoint Petaflops are needed now Enough chips could be glued by 2010 But open issues of cost power amp efficiency Need for rethinking technology architecture algorithms 1994 1St Petaflops Workshop in Pasadena CA Result HTMT Hybrid technology MultiThreaded o Petaflops by 20052007 c MultiPhase project started 1996 0 Budget cuts forced suspension mid 2001 Enab ng Technologies for Petaflops Computing Followon HPCS High Productivity Computing Systems 0 Commercial Petaflops by 2010 o Started 2002 ND part of all these CSE 322 Exascale58 2009 The 1994 View of Petaflops El Technology Lots of choices but needs work 0 CM08 projected to be 1OGHz 100 nm in 2007 0 Optical would give more rackrack bandwidth than wire 0 3D optical storage cubes much higher density 0 Superconducting electronics offered potential of 50 GHz El Architecture 3 Viable Options 1 1 MEM MEM BLOCK BLOCK 0 CO 0 CORE CORE CORE CORE ORE BLOCK BL BLOCK BLOCK Q1 MEM MEM i i V MEM MEM MEM BLOCK BLOCK BLOCK 0 C CORE 0 CO CORE C CORE CORE CORE BLOCK BLOCK BLOCK BLOCK MEM MEM MEM MEM Category 1 Category II Category 111 Global Shared Memory Cluster ProcessingIn Memory El Software Like your Crazy Uncle Fred CSE 322 Exascale59 2009 HTMT The First Proposed Petaflop Machine Cal Tech 1 PB Storage PIM Memory Univ of Notre Dame 16 TB DRAM 1 TB SRAM Smart memory CSE 322 Exascale60 0 10100 PB Disk 0 IOU1000 PB Tape Data Vortex Optical Switch Princeton 1 PBs bisection BW 4096 RSFQ 256 GF SPELLs SUNY Stoneybrook TRW Cryostat 4 Kelvin 0 Rapid Single Flux Quantum superconducting devices Up to 100 GHz clock rate HW support for up 23028 threads Technologies within HTMT 128512 Full Rows permm multiple banks Hum uu ulul 3 A I a World Smartest Memory PIM World s Densest 3D Holographic C World s Fastest CSENeEtert Optical WDM HYBRID TECHNOLO GY MULTI THREADING Total Area 1 1850 sq ft 1 980 nm Pumps 20 cabinets 0 v w courtesy j i Trahnahgv warm a j E wwmifw Mara m l i i 93ij 1 4 1 111216 3122quot gar Ezwwwam m 5343 93953 m 13quot iii HE 3 9 U a j I Hg e 39 Carmaum Solutimn39 awaix39iJ in JFJUSJ39IIF W35 15iiizrht i F fl39 n39 555ou M51 jayElm ng Jr quot fb39H39JW1 35151 1 J u H 439 7 anquot 5quotquot1y 453 39Iqi Lfa 1 quotw pquot quot39 quot1 jun 51 J J39ELIU JjLUJMJEw LALU JJ51 UJ JJ J 5W httpwwwdarpamiIPTOprogramshpcshpcsvisionasp CSE 322 Exascale63 2009 2004 Major Study on Need for Petaflops X 2 Genomics 3 Automobile Noise 4 Biological Systems Modeling It s More than just Flops CSE 322 Exascale64 2009 The Road Ahead to an Exaflop Your Generation s Path CSE 322 Exascale65 2009 HPC is at a Crossroads TeT 1e6 1645 14544 Dally Last ClaSSIcal 1e3 Computer ISAT Study 1942 1 1941 E 1e0 R 1e1 an 000391 n 39 152 153 1e4 1930 1990 2000 2010 2020 1E09 1E08 1E07 HPL LINPACK Benchmark 3 a x g I 1E06 1E05 2 w 15 O I 1E04 d 1E03 D 1E02 1193 1197 1101 1105 1109 1113 1117 0 T0910 I Top System 7 Top 1 Trend x stto ca a Egtlta strawman 0 Heavy Node EXIrapo ann Exascale El Exascale 1000X capability of Petascale El Exascale Exaflops but 0 Exascale at the data center size gt Exaflops o Exascale at the rack size gt petaflops in a rack El It took us 14 years to get from 0 1st Petaflops workshop in Pasadena in 1994 o Thru NSF architetcural studies 0 And HTMT with novel technologies galore 0 To start HPCS out of plain old silicon 0 Which will get us to Peta Real Soon Now El Today s Question Can we ride silicon to Exa CSE 322 Exascale67 2009 Eancale Power Issue m 039 4 r 39v r o FPU in Vcanventifona l Legic 2913 2014 FPU1Q20 PJIFIo39p 39 39 gt r I r 1 Ws h 5W EX WWII 1 ExaFLPw Is 1Q 20 MW 33 y W m Some TakeAways Byte wide transfers 3X power Memory stacking increase pwr Nl2 X CSE 322 Exascale68 The Original Eancale Study Quad Chart Why Critical Apps Changing What is a Plausible Eanystem 10 PUPS 05 EB 500 racks 10 MW 0 100 GFlopsWatt 1 EBs Bisection bandwidth Hi bandwidth streaming IO DSB Defense Critical Technologies 306 Continuous Operation 39 Knowledge discovery amp Videoimage processing Defeating lEDs Persistent surveillance event forensics El Why is This REALLY HARD with Roadmap Silicon by 2015 El Massive Power Problems Total power per chipper operation dissipation 0 Minimal future Vdd reduction increasing leakage El Tremendous internal traffic at all levels intrainter chip intrainter rack 0 E9 10 PUPS gt upto 1017interracktransactionssecond El Programming nightmare flattening clock gt massive explicit parallelism o Nontraditional memoryintensive and streaming applications with supportfor persistent data El quot micro 0 Huge latencies amp cacheunfriendly apps gt managing massive concurrent memory ops heterogeneity El Inherent MTBF limitations requires extraordinary self healing 0 Eg 05 EB requires at least 12 billion BGb DRAM chips before ECC El cW Wo mg such as cost and design complexity growing also in significance 2 09 Exascale Study El Objective 0 Understand course of mainstream computing technology 0 Determine if suf cient for 1000X increase in computing capabilities by 2015 o If not what are major challenges El Conclusions 0 4 major challenges power memory concurrency resiliency 0 Recommended areas of interdisciplinary research El Study Time frame May through December 2007 CSE 322 ExascaleJO Eancale Computing Study Technology Challenges in Achieving Exascale Systems Peter Kuggc mm 3 Study Lead Kenn Bergman 5 w o Sherman Karl Siophen Murder 1mm Kl in Robert Lurns lnrk Richard E swimmer 28 2mm unsorcil m mum mm in m hastmt Computing sunny mm or illmm llnrmd w AI39Rl mum Ivar Mn quot74 4 1mmerMimide unlu omnuucm apprmnl nr duappimnl of us lulmgs OTICE nun duo mu in Am way at r ms imminncln ms urnihei dam 2009 NAME Shekhar Borkar Dan Campbell Bill Carlson Jon Hiller 4 m C3 CSE 322 Exascale71 r Affiliation Columbia lntel GTRI IDA Stanford IBM NCSU DARPA AFRL STA Study Participants NAME Steve Keckler Mark Richards Al Scarpeli Steve Scott Allan Snavely Thomas Sterling Kathy Yelick CI mic liliconmAcadel mnc Affiliation UTAustin Micron Notre Dame USCIISI Georgia Tech AFRL Cray SDSC LSU HP UCBerkeley 2009 Executive Summary El Developing Exascale systems will be tough o In anytime frame 0 For all classes of systems El Four key challenge areas for a 2015era deployment 0 Power 0 Concurrency 0 Memory Capacity 0 Resiliency El Focusing on architecture amp technology issues alone should help spur deployable embedded amp departmental systems 0 Economic drivers are the dominant issues 0 And tight integration needed between technology architecture amp applications development El But Exascale data centerclass needs more national commitment o Rationale for exascale capability applications 0 Development plan integrated with embedded amp departmental to assure transferable truly scalable technologies CSE 322 Exascale72 2009 More Specifically El What are the technology challenges amp problems 0 That need special emphasis from now to 2010 o Whose solution could make 2010 a tipping point for Exascale El What were we doing 0 NOT Design ExaIevel systems 0 NOT Support specific ExaIevel apps El We were looking for gaps in technology capabilities 0 There was a concurrent DOE study into exaclass applications El We were tasked to identify needed research to Enable Exascale technologies 0 DARPA s prerogative to translate into real programs CSE 322 Exascale73 2009 Approach El Articulate what Exascale means 0 Attributes metrics system classes El Understand current technology trends 0 Both mature and emerging o For device packagingcooling and software technologies El Understand exascale application characteristics El Develop roadmaps for mature amp emerging technologies El Extrapolate to 2015 era 0 Both evolutionary and aggressive cleansheet of paper El Identify key challenges amp suggest research directions CSE 322 Exascale74 2009 What Is Eancale DComputational systems with some key scalinq attribute 1000X that of 2010 Petascale systems DClasses of computing systems compared to 2010 systems 0 Capability systems Solve 1000X tougher problems Solve single larger problem in same time Solve same problem in shorter esp real time Figure of merit time to solution 0 Capacity systems 1000X throughput on multiple smallerjobs Solve more jobs of same size per unit time Figure of merit sustained performance per unit cost CSE 322 Exascale75 2009 Attributes El Functional Attributes Numerators related to ability to solve problems 0 Computational rate notjust flops 0 Storage capacity main scratch persistent o Bandwidth Local bisection to scratch to lO El Physical Attributes Denominators related to implementation 0 Total power consumption 0 Physical size both floor space and volume 0 Cost 0 Use here as denominator esp per watt CSE 322 Exascale76 2009 Classes of Eancale Systems El Eancale Data Center sized systems 0 1000X Performance of petascale data center systems in roughly same footprint 0 Max of 500 racks 20MW El Petascale Departmental Systems 0 Equivalent to Petascale HPP svstems but in 2 4 rack footprint El Terascale Embedder o Capability ofa 100 s of watt Targets and Attributes Aggregate Computationa Aggregate Aggregate Reference Memory Bandwidt Target Point Rate Capacity h Volume Power 2010 Peta Exa HPC CapaCIty Capacity System System Acceleration by Single Job Speed 1000X flops Same 1000X Same Same Acceleration by Replication 1000X flops up to 1000X 1000X Same Same 2010 Peta Exa HPC Capability Capability System System Current Apps Current Scale 1000X flops ops Same 1000X Same Same up to 1000X scaled current Apps flops ops up to 1000X up to 1000X Same Same up to 1QOOX New Apps up to 1000X With more flops ops perSIstenc mem accesses e up to 1000X Same Same Peta HPC Departmental Petasystem System Same Same Same 11000 11000 Embedded Personal peta HPC Terasystem System 11000 11000 11000 11 million 11 million Notes same when compared to Peta 2010 HPC systems Data center limits are no more than 20MW in electronics power 500 racks CSE 322 Exascale78 2009 Study Results Technology EEEEEEEEEEEEEEE 79 Technologies Investigated El Logic 0 Mainstream silicon Low voltage alternatives SOI a Hybrid logic nano silicon o Other El Memory 0 Mainstream silicon DRAM amp Flash 0 Emerging memories PCRAM SONOS MRAM o Nanobased 0 Disk storage trends El Interconnect 0 Wire including low voltage signaling 0 Optical all the way down to onchip CSE 322 Exascale80 2009 More technologies DPackaging amp Interconnect 0 3D chip stacks silicon carriers o Chipchip interconnect thru vias capacitive inductive 0 Cooling DResiliency 0 As a function of component count growth 0 As a function of reduced feature size and voltage 0 As a function of technology esp in memory 0 Considerations of checkpointrestart in large systems CIOperating Environments DProgramming Models esp for massive parallelism CSE 322 Exascale81 2009 Applications DScalability to Exascale size El Memory footprints 0 Including need for secondary storage El Latency amp locality sensitivities El Intrinsic concurrency growth CSE 322 Exascale82 2009 Conventional Silicon Logic EIFlattening in Vdd amp Wattschip gt Flattening in clock 0 Performance gain must be by parallelism EILogic transistors will begin to demonstrate more variability DStatic leakage may be under control CIWG have not c39 quotquotquot quotquotquotquotquot quot quot quotquotquot quotquotquotquot quotquotquotquotquot for HPC F VtO quotquot Vt050mV Vt050mV GOPSNVatt Normalized Vdd Normalized CSE 322 Exascale83 2009 Conventional Silicon Memory El NAND flash cell now driving DRAM process technology 0 Facing serious scaling problems El DRAM Cell architecture plateau at 6Fquot2 DOverall DRAM density Challenge is in 1GBchip regime by 2014 El NAND 89X cost over DRAM per bit facing problems below 43nm 0 Major Exa issue is rewrites El DRAM 30X difference between quiescent and full read power El DRAM Offchip rate increasing est 4 Gbpspin by 2013 El DRAM FIT rates failures per billion hours grown from 2 to 10 El DRAM SEU immune but latches need to grow to counteract EIThinning of DRAM die for 14 chip stacking causes refresh handling interconnect issues CSE 322 Exascale84 2009 Representative Component Power 5 in CMOS Formoafiom Urmfri i FPU 1330 lerop 1 EF 13 30 MW Onchip interconnect 1 mm gt 04pJ per hit 1EBpslmm 36MW Offchip SERDES 2 pJ per bit 1 EBps 20 MW Stacked chipchip Capacitive 2 pJ per hit 1EBps 20 MW Stacked chipchip inductive 3 pJ per hit 1EBps 30 MW Through Via lt03 lebit 1EBps 3 MW CSE 322 Exascale85 2009 New Possibilities for nanoScale Storage Memory and Logic El Fabrication Technologies nanoimprint for dense regular arrangements Demos today at 17nm pitches El nanoScale Nonlinear or Memristive Switches used in Crossbars for Storage and Memory 0 Densities of 1OOGbitscmquot2 at 10X above 2014 DRAM 0 May be nonvolatile power savings 0 May be layered on top of conventional logic 0 May be stackable for even more density El Hybrid Logic Circuits FPGAlike sytems 0 Same crossbarlike switch circuits allow personalization to be stored above CMOS logic circuits 0 May simplify dynamic reconfiguration amp selfhealing El NRAM devices have similar attributes o Bistable switch point stackable in layers above transistor array 0 Current target is flash with write energy lt 10quot15Jbit CSE 322 Exascale86 2009 Nanolmprint Crossbar on Top of Silicon a nanowire 3 la 3 0 nm lb Demonstrated 100 Gbitscm2 high defect density Stacking multiple layers could reach 1 Tbitcm2 Bit lifetime gt 35 years demonstrated World wide effort HP lnfineon NEC Samsung Sharp W Robinett et al Comput39 ith a trillion crummy CSE 322 Exasca39e3987 components CACM v50 2007 CMOL CMOS under Molecular a nano evmai nannwmng and nanudewces madam Dms unpuv wxnng have av cmos s39zck T 3 bn Koggerstone Adder Wm recon gurauon around 50 bad ceHs CSE 322 Exzsczleml mus MultiLayer Phase Change Storage Class Memory SCM 4F2 of Layers httpwwwamadenibmcomstnanoscaestnanodevicesmemoryaddressing CSE 322 Exascale89 2009 PackagingInterconnect El Need for gt2 chip stacks El Capacitive and esp inductive coupling becoming real El Growing density of thru silicon vias 0 At far less J per bit transferred than cap or inductive El Microchannel cooling El Emerging chipchip edge coupling options esp perpendicular o Quilt packaging CSE 322 Exascale90 2009 Leading 3D Packaging Options NC STATE LJNIVL39HSI I Y Final Comparison of Leading Cand Idates Technology Pros Cons Potential Edge Mounting Simple O limited Simplicity maybe 256 attractive but Data total considerable Need new thermalCPU O fixturing concepts value add needed Multilayer MCM Large O capacity die limited 2 3 Unlikely to scale Needs re engineering to address exascale 3D MINT 9 32 memory die and thousands of O Complex Can not support 1 OOW chip right now Has closest immediate potential CSE 322 Exascale91 quot i 7777 WW 1c2 39W 39 Ics 2009 Commodity DRAM Capacity 10 u I I39 1 IL II II A Aquot 39A L Ir II 11 4 39 m A 1 Kg IL II II HindiquotAL HM L n u 1 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 O DRAM Chip Capacity GB l Chips per Petabyte Millions A Chips per Petabyte per Rack thousands 2009 CSE 322 Exascale92 And Not A Lot of Alternatives In CommerCIal Technolocnes 1000 4 A A A A A A A A V V V V V V V V V V I I I I I lI I I I I II I I I I II AAA39i x 100 quot quotquot e 5 a x e x x39393lt 3396 1 e x gte xe gtIe39gtlt 6T m ltr agT Density Relative to DRAM gtWHHHHHHHHHHlt 001 2005 2010 2015 2020 MLC Flash l SLC Flash A PCRAMBJT 39Xquot PCRAMNMOS quotgtKquot eDRAMdensest MRAM eDRAMfastest FeRAM SRAM CSE 322 Exascale93 2009 Commodity DRAM Bandwidth DAggressive Strawman assumptions 0 36 PB of DRAM 36 Million 1 GB chips 16 per processor chip 0 Assumed BW taper of 1 DRAM word transfer per every 50 flops Or 016 bytes per flop gtHGHLY local applications Aggregate processor chip BW of 89 GWsec 712 GBs 0 Thus per DRAM data bandwidth at least 45 685 per chip Difwe wanted 10 PUPS o and each update involved only 16 bytes transferred 0 With same memory we d need 16X1E1636E6 45 GBs per chip CSE 322 Exascale94 2009 Disk Drive Forecast 100000 39 100000 A g E E m 3 o 2 10000 39 a Li 10000 g a E 8 E 398 o a a 1000 g Q g 1000 g 8 o A g a 0 gtlt 100 i a g x o A 100 u D 39 O E g 2 I u A 8 2 1 I I 3 010 E D 39 D 11 1 001 1995 2000 2005 2010 2015 0 Historical Consumer Enterprise Handheld 0 Historical EB El Consumer EB Enterprise EB A Handheld EB CSE 322 Exascale95 2009 Disk Drive Power 1000 5 10390 55L quot39 quot a 39ik m l w 939 10 g i 2 OJ 2007 2008 2009 2010 2011 2012 2013 2014 Consumer Enterprise Handheld CSE 322 Exascale96 2009 2015 Transfer Time for Checkpointing A 45 30 m A g 40 25 3 lt9 35 939 5 m 5 30 20 a II 25 3 s 15 gt 2 20 o g E 15 10 E 8 l 10 5 as U E 05 00 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 Consumer Data Rate Enterprise Data Rate 0 Handheld Data Rate 43 Consumer Time to PB A Enterprise Time to PB Handheld Time to PB CSE 322 Exascale97 2009 Some Historical Perspective LE08 l DRAM Chip Count EancaIe oSocket Count 1E07 1E06 I EancaIe 9 1E 5 BiueGeneL 9 ASCI White ASCI Q 9 1E04 v anqer AsgI Red I 9 cmumbia Red Storm ASCI Red II Earth Simulator LE03 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 Hardware Software Network Environment Human Unknown Breakdowns 62 18 2 1 1 16 Downtime 60 19 1 2 0 18 CSE 322 Exascale98 2009 FT Rates FIT per Components per FIT per Component Component 64K System System DRAM 5 608256 3041 K Compute lO ASIC 20 66560 1331 K ETH Complex 160 3024 484K Nonredundant power supply 500 384 384K Link ASIC 25 3072 77K Clock Chip 65 1200 8K Total FlTs 5315K CSE 322 Exascale99 2009 The Growing Technology Sensitivities 100000 a E 9 Electric Field 39n 0 Tem erature E 10000 if p g A eA1kT 3 ISE Vulnerabilit gt y A A A c 1000 3 I m A 9 100 g I H I a 5 10 I 2 s a I m z o 3 0 1 7 i 9 i i i i i i i i 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 CSE 322 Exascale100 2009 Mean Time to Interrupt hours Effect on System Utilization 10000 Per socket failure rate 0001 001 O1 1000 1997 1999 2001 2002 2003 2004 2005 2006 2007 2015 Application Utilization 00 0 O O O O O O O L in is in in 39u be in P 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 CSE 322 Exascale101 2009 InterconnectPackaging Technologies Approach Wiresmm or sqcm Bandwidthmm Comments Routable signal pairs per mm a 99 2 WireSImm per layer layers bitrate 1 mm per signal pair Laminate 20 wiresmmlayer 24 signal 6 pairsmm 30 Gbps 1 mil linetrace presents Ball Grid layers 180 720 Gbpsmm 14 practical limit Array 2000 max total pin count signal layers 1 mm BGA ball pitch Package HQ 500 pairs 15 prs Silicon 50 wiresmmlayer 2 signal 12 pairmm 30 Gbps 2 signal layers is Carrier layers 360 720 Gbpsmm 12 practical limit Has to be packaged for O signal layers 3DC Stack 1040 wiresmm vertically Total 100 200 pair 10 Limited interconnect around edge Gbps 9 05 2 prs assumes memory performance 3D IC with Through Silicon Vias In excess of 10000 vias per sqmm In excess of 100000 prssqcm Really determined by floorplan issues Chip stack limited to 48 chips depending on thermal and other issues Stacked Packages 1mm on periphery 25 pairs total 10 Gbps 9 250 Gbps Not very applicable to high performance systems Stacked Silicon Carriers l Vertical connections 20 um pitch 9 250000 sqcm 62500 pairs 30 Gbps 9 1900 prs sqcm Limited by thermal and coplanarity issues Stacked Silicon Carriers Vertical connections 100 um pitch 9 10000 sqcm 2500 pairs 30 Gbps 9 75 prs sqcm Early demonstration only Air cooled to lt 117 W total CSE 322 Exascale102 2009 State of the Art 3D Chip Stacking Schemes faceup Technology Pitch Power Through lt 5 pm 1 11 fJbit Silicon Vias Capacitive 36 pm 2 pJbit facetoface 13 E353 Inductive 30 pm 014 pJbit CSE 322 Exascale103 2009 Near Term Optical Parallel Data In Photon Electrical Source Optical Encoder Serializer Modulator Channel I Optical May be reversed Temperamamp Router 4 Decoder Deserializer Receiver Parallel Data Out Electrical If ea Term m per I115 pngjt 15 i39hil39t 1 Slaps 1039 En ctJr 39AFl u n 1023 act factor 5 bit Illquotl AF 39F 5 i39hl39t CSE 322 Exascale104 Table 17 Energy budget 3910 option modular21 TELquot m 43 me a HIRE 005 Haiti 15 39hit 2009 An Onchip Optical Option mg m mm 39mlal PM In vhammcmlelwnm 1am 7 u opmmmw mmmml d pm miumvlm CSE 322 ExzsczleJ E mus Summary State of the Art Interconnect Technology Density Power uijjhitj TEt 39lDIZIlIZIEquot Readineae iwireefmmj Longrange tunchip 251 13 EJ a39Ib39lIDm Denvmstrated mapper Chipro elu p mpper B 2 ply39hit Includes Elemnarrated Peten I CDR tial for sealing tr 1 ply39hit Reurej intercuanitect 13953 2 pr39lzuit reughly the ame fer padret router or nan blauiing circuit switch in 201 1 pr39hit Optical State atquot Art 1 393 Fulfhit FD l39irIcluzi Demanatrateui ujmulti medejn ing CUR Optical Single mcde BIZICI 75 IIquotbit Assumes lithegraphe in 2010 SDI waveEntities PCBembedded waveguide Clara net exist Opt39wal Single inc e BIZICI 15 Flybit A early research stage in 2015 Optical Reutng Add 01 pr39bit 20103 for each ewitclt Optical temperature TEC t1ner demen ccntrel snared CST bundles 125B 6 ler39bit rnm Tindemcnatratej Table 68 Summary intemturuiect tedmelcgy madmap CSE 322 Exascale106 2009 Study Results Strawman Systems Architectures Considered DEvolutionary Strawmen o Heavyweight Strawman based on commodityderived microprocessors o Lightweight Strawman based on custom microprocessors CIAggressive Strawman 0 Clean Sheet of Paper CMOS Silicon CSE 322 Exascale108 2009 A Mm em HPC S igl cem Mammy 59 PowerDlstrlbutlon Randummemuw 2 9 Runners 33 Board Area Distribution Mammy White 1 Prunessu Space 56 5u Pmcessurs 24 Runners Randum 8 8 csE 122 Existledquot 2m Evolutionary Scaling Assumptions I3 I3 I3 I3 I3 I3 I3 I3 I3 I3 CSE 322 Exascale110 Applications will demand same DRAMFlops ratio as today Ignore any changes needed in disk capacity Processor die size will remain constant Continued reduction in device area gt multicore chips Vdd max power dissipation wi flatten as forecast 0 Thus clock rates limited as before On a per core basis microarchitecture will improve from 2 flopscycle to 4 in 2008 and 8 in 2015 Max of sockets per board will double roughly every 5 years Max of boards per rank will increase once by 33 Max power per rack will double every 3 years Allow growth in system configuration by 50 racks each year 2009 Possible System Power Models I 1 1 n IIILUI DUIIIIUDI IJIIVUII El Simplistic A highly optimistic model 0 Max power per die grows as per ITRS 0 Power for memory grows only linearly with of chips Power per memory chip remains constant 0 Power for routers and common logic remains constant Regardless of obvious need to increase bandwidth 0 True if energy for bit movedaccessed decreases as fast as flops per second increase El Fully Scaled A pessimistic model 0 Same as Simplistic except memory amp router power grow with peak flops per chip 0 True if energy for bit movedaccessed remairzgg constant CSE 322 Exascale111 With All This What Do We Have to Look Forward Tn 1E10 1E09 1E08 1E07 GFlops 1E06 1E05 1E04 1E03 1100 1104 1108 1112 1116 1120 9 Top 10 Rmax Rmax Leading Edge 7 7 Rpeak Leading Edge G Eoutionary Heavy Fully Scaled A Evolutionary Heavy Simplistically Scaled CSE 322 Exascale112 2009 And At What Power Level 1000 E g 100 a CD 3 O D S gt I 1 2005 2010 2015 2020 Remember we have to add on cooling power conditioning secondary memory CSE 322 Exascale113 2009