Thesis AE 500
Popular in Course
Popular in Aerospace Engineering
This 50 page Class Notes was uploaded by Eldon Emmerich PhD on Monday October 26, 2015. The Class Notes belongs to AE 500 at University of Tennessee - Knoxville taught by Staff in Fall. Since its upload, it has received 8 views. For similar materials see /class/229900/ae-500-university-of-tennessee-knoxville in Aerospace Engineering at University of Tennessee - Knoxville.
Reviews for Thesis
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/26/15
EECS 594 Spring 2008 Lecture 2 UverV1ew 01 HIgnkertormance Computing Jack Dongarra Electrical Engineering and Computer Science Department University of Tennessee Percentage of peak 9 A rule of thumb that often applies gt A contemporary processor for a spectrum of applications delivers ie sustains 10 of peak performance 0 There are exceptions to this rule in both directions 0 Why such low efficiency 0 There are two primary reasons behind the disappointing percentage of peak gt IPC inefficiency gt Memory inefficiency IPC 9 Today the Theoretical IPC instructions per cycle IS 4 In most contemporary RISC processors 6 in Itanium 6 Detailed analysis for a spectrum of applications indicates that the average IPC is 1214 9 We are leaving 75 of the possible performance on the table Why Fast Machines Run Slow 0 Latency gt Waiting for access to memory or other parts of the system 0 Overhead gt Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform 0 Starvation gt Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources 0 Contention gt Delays due to fighting over what task gets to use a shared resource next Network bandwidth is a mqior constraint Latency in a Single System 500 Ra o a 1000 400 E m 100 300 E a o E 10 200 E P 100 E 1 CPU Tune E 0 01 1557 1555 2001 2003 2006 2095 XrAst 391 I CPU clock Period us I Rauo 391 I Memory System Access Time 39 Memory hierarchy OTypical latencies for today39s technology Hierarchy Processor clocks Register 1 L1 cache 23 L2 cache 612 L3 cache 1440 Near memory 100300 For memory 300900 Remote memory 0103 Messagepassing 0103O10 Memory Hierarchy 9 Most programs have a high degree of locality in their accesses gt spatial locality accessing things nearby previous accesses gt temporal locality reusing an item that was previously accessed 9 Memory hierarchy tries to exploit locality Memory bandwidth 9 To provide bandwidth to the processor thg bus either needs to be faster or WI er 9 Busses are limited to perhaps 400800 MHz 9 Links are faster gtSingleended 05 1 GTs gt Differential 25 50 future GTs gt Increased link frequencies increase error rates requiring coding and redundancy thus increasing power and die size and not helping bandwidt 9 Making thin s wider requires pinout 5i real estate and power gt Both power and pinout are serious issue ProcessorDRAM Memory Gap pProc 60yr 2X15yr mmnnn Moore s Lavxf m lEIEIEIEI 7 7 7 7 7 7 7 7 7 7 ProcessorMemory Performance Gap grows 50 I year W DRAM M7 9Iyr l 2X10 rs 3 9 a 6 a 3 5 o 6 ed v veil b y Performance Year Processor in Memory PIM WC loo 0 PIM merges logic with memory gt Wide ALUs next to the row buffer gt Optimized for memory throughput not ALU utilization 0 PIM has the potential of riding Moore39s law while gt greatly increasing effective memory bandwidth gt providing many more concurrent execution threads gt reducing latency gt reducing power and gt increasing overall system efficiency oIt may also simplify programming and system design 10 8 Internet 4th Revolution in Telecommunications u oTelephom Radio Television 9 Growth in Internet outstrips the others 9 Exponential growth since 1985 oTraffic doubles every 100 days Growl of Internet Hosts Sept1969 Sept 2002 tuneslttamz iimelnatnpyngmmss hnhentlaln Hintow ii i pzmnxinmmn mam Domain names Peer to Peer Computing I ll in n o Peertopeer is a style of networkin in which a group of computers communicate irectly with each other 9 Wireless communication 9 Home computer in the utility room next to the water heater and furnace m 9 Web tablets o BitTorrent E 1 gt httpenwikipediaorgwikiBittorrent o Imbedded computers in things 5 all tied together J gt Books furniture milk cartons etc quot J1 9 Smart Appliances gt Refrigerator scale etc g 39 439 H a 51 Internet On Everything SETIh0me Global Distributed Computing 0 Running on 500000 PCs 1000 CPU Years er Da 7 gt 485821 CPU Years so far Sophisticated Data amp Signal Processing Analysis Distributes Datasets from Arecibo Radio Telescope 0 0 Berkeley Open Infrastructure for Network Computing 14 SETIhome 0 Use thousands of Internet o nected PCs to help in the search or extraterrestrial intelligence 0 When their computer is idle or being wasted this software will download a 300 kilobyte clunk 9 Largest distributed computation project in o The results of this analysis ZXleence g n 9 n g a 39 1 n a 9 lt 391 in about 3 lops for each client in 15 hours are sent back to the SETI team combined with thousands gt Avera m 4o Tflo s of other participants 9 Today a number of 39 quot companies trying this for profit J39 15 Google Guglc query attribute gt 150M queriesday 2000second gt 100 countries gt 803 documents in the index to center gt 100000 Linux systems in data centers around the world 15 TFIops and 1000 TB total abili 4080 1UI2U serverscabinet 100 MB Ethernet switchescabinet with gigahit Ethernet uplink gt rowth from 4000 systans flune 2000 18M queries then Elgezvfg My A M o Palfomnnec and pmii 39 gt simple nix i filed command 5quot Mathwms h nu savers M gt no perfemale debugging I I matrix of the Markov chain Ax x Snume M w u Next Generation Web e To treat CPU cycles and software like commodities 9 Enable the coordinated use of geographically distributed resources in the absence of central control and existing trust relationships 9 Computing power is produced much like utilities such as power and water are produced for consumers 9 Users will have access to power on demand 9 This is one of our efforts at UT Why Parallel Computing o Desire to solve bigger more realistic applications problems 9 Fundamental limits are being approached o More cost effective solution Principles of Parallel Computing O Parallelism and Amdahl39s Law 9 Granularity 9 Locality 0 Load balance 9 Coordination and synchronization 9 Performance modeling All of these things makes parallel programming even harder than sequential programming 19 Automatic Parallelism in Modern Machines 0 Bit level parallelism gt within floating point operations etc 0 Instruction level parallelism ILP gt multiple instructions execute per clock cycle 0 Memory system parallelism gt overlap of memory operations with computation 0 OS parallelism gt multiple jobs run in parallel on commodity 5MPs Limits to all of these for very high performance need user to identify schedule and coordinate parallel tasks 20 Finding Enough Parallelism 9 Suppose only part of an application seems parallel O Amdahl39s law gt let fS be the fraction of work done sequentially 1fs is fraction parallelizable gt N number of processors 9 Even if the parallel part speeds up perfectly may be limited by the sequential part Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized b usin multi e rocessors Two e uivalent expressions for Amdahl s Law are given below tN fpN fst1 Effect of multiple processors on run time S lfs fpN Effect of multiple processors on speedup Where fs serial fraction of code fp parallel fraction of code l fs N number of processors Illustration of Amdahl s Law It takes only a small fraction of serial content in a code to degrade the parallel performance It is essential to determine the scaling behavior of gour code before doing roduction runs using large numbers of processors 1 speedup Number ofprocessors Overhead 0f Parallelism 0 Given enough parallel work this is the biggest barrier to getting desired speedup o Parallelism overheads include gt cost of starting a thread or process gt cost of communicating shared data gt cost of synchronizing gt extra redundant computation 0 Each of these can be in the range of milliseconds gmillions of flo s on some S39stems o Tradeoff Algorithm needs sufficiently large units of work to run fast in parallel Ie large granularity but not so large that there is not enough parallel work Locality and Parallelism Conventional Storage Hierarchy e I L2 Cachel ei1uelod SJOSU uocuelu OLarge memories are slow fasT memories are small OSTorage hierarchies are large and fasT on average OParallel processors collecTively have large fasT gt The slow accesses To remoTe daTa we call communica on oAlgoriThm should do mosT work on local daTa Load Imbalance 9 Load imbalance is The Time ThaT some processors in The sysTem are idle due To gt insufficienT parallelism during ThaT phase gt unequal size Tasks 9 Examples of The laTTer gt adapTing To inTeresTing parTs of a domainquot gt TreesTrucTured compuTaTions gt fundamenTally unsTrucTured problems 9 AlgoriThm needs To balance load What is Ahead 0 Greater instruction level parallelism o Bigger caches 0 Multiple processors per chip 0 Complete systems on a chip Portable Systems 0 High performance LAN Interface and Interconnect Directions 9 Move toward shared memory gtSMPs and Distributed Shared Memory gt Shared address space wdeep memory hierarchy O Clusterin of shared memory machines for scala ility O Efficienc of message assing and data parallel programming 7 gt Helped by standards efforts such as MP1 and OpenMP High Performance Computers 0 20 years ago gt 1x10 Floating Point Opssec Mflops gtgt Scalar based 0 10 years ago gt 1x109 Floating Point Opssec Gflops gtgt Vector amp Shared memory computing bandwidth aware gtgt Black partitioned latency tolerant 0 Today gt 1x1012 Floating Point Opssec Tflops gtgt Highly parallel distributed processing message passing network gtgt data decomposition communicationcom putation 0 1 year awa gt 1x1015 Floating Point Opssec Pflops gtgt Many more levels MH com binationlgridsampHPC gtgt More adaptive LT and bandwidth aware fault tolerant extended precision attention to SMP nodes 29 Top 500 Computers Listing of the 500 most powerful Computers in the worm Yardstick RmaX from LINPACK NIPP Axb denseprablem TPP perform c Updated twice a year E SC Xy in the States in November Size Meeting in Germany in June What is a Supercomputer su erco p er IS a an and software system that provides close to the maximum enormance that can currentl be achieved 9 Over the last 14 years the range for the Top500 has increased greater than Moore39s 39 A w LOW Why do we need them 9 1993 Almostall ofthe technical areas that gt 1 597 GFlops are important to the wellbeing of gt 500 422 MFIops humanity use supercomputing in o 2007 fundamental and essential ways gt 1 478 TFlops gt 500 59 TFlops Computational fluid dynamics protein folding climate modeling national security in particular for cryptanalysis and for simulating nuclear weapons to name a few Architecture Systems Continuum 4m Coupled Ill l l i th custom interconnect Custom gt any x1 gt NEC sxa 0 Custom processor w O Commodity processor with custom interconnect gt I Alt39 Ix Hybr Intel Itanium 2 xr a o Commod ty processor with commodity interconnect Pentium Itanium Opteron Alpha comm gtgt GigE Infiniband Myrinet Quadrics gt NEC Tx7 IBM eServer gt Loosely gt Dawning Cou pled 30th Edition The TOP 1 0 Manufacturer Computer Installation Site Country Year Cores 1 IBM eSeEllleer GElluliealGene m DOE USA 2007 Dual Core 7 6H I Lawrence Livermore Nat Lab Custom 2 IBM Quazllzirierlg peHz 167 Forschungszentr um J uelich Germany 622quot 65536 a 4 it 5 HP D rscfjfrjzlifggn 1028 Government Agency Sweden 13728 6 cray Dual 8128 GHz 1022 Sandifilit Lab USA l 2639569 7 cray Dual 8128 GHz 101 Oak Ridgeb ftional Lab USA 2339016 8 53221520352623 13mm 21 3223 g cmy Dual 2228 6H1 85394 Lawrence BeDrieEley Nat Lab USA 1939320 w 5323820352623 52 c t m5ta132 622 2856 GFs Node Board 32 chips 4x4x2 16 Compute Cards 64 processor Compute Card 2 chips 2x1x1 4 4pr cessors Chi 56112 GFs 1 GB DDR 4 MB cache The compute node ASICs include all networking and processor functionality 90180 GFs 16 IBM BlueGeneL 1 212992 Cores Total of 26 systems all in the T0p176 26 MWatts 2600 homes 104 racks 104x32x32 70000 opsg pers art Node boards 8x8x16 2048 processors uaAAqld 2957 TFs 05 TB DDR GB DDR 212992 procs 180360 TFs 32 TB DDR Iull system total 0 131072 processor Fastest Computer BGIL 700 MHz 213K pro 104 racks Each compute ASIC includestwo 32bit superscalar PowerPC 440 embedded 0 cores note that L1 cache coherence is not maintained between these cores 207K sec about 57hours n25M Peak 596 Tflopls Linpack 498 Tflopls 84 of peak DOE N 9 SA l NI QNI l ANI gt IBM BGL Red Storm Cr uy gt RoudRunner IBM Power PC AMD Dual Core AMD Dual Core Cores 212992 Cores 27200 Cores 18252 Peak 596 TF Peak 1275 TF Peak 811 TF gtgt Memory 737 TB Memory 40 TB Memory 276 TE gt IBM Purple Thunderbird Dell Power 5 Intel Xeon Cores 12208 Cores 9024 Peak 928 TF Peak 53 TF Memory 488 TB Memory 6 TB K I m LA N Roadrunner A Petascale System in 2008 cggge teduw canister 13000 Cell HPC chips 3 Wm 57533de o 133 PetaFlops from Cell connected w4 PCle x8 links 7000 dualcore Opterons 18 clusters 0 C I quotd stage In niBand 4x DDR interconne ets of 12 links to 8 swit End stage39ln ngand interconnect gaswitch e s Based on the 100 Gflopls DP Cell chip Approval y DOE 1207 First cu being built today Expect a May Pflops run Full v tam to LANL in hecem bar 003 DOE OS ORNL ILBElLC XT ANL u Jo uar Cra XT 5 ran quot7 Fay E EAMD DUZI Core AMD Dual Core 3 BGP IBM D Cores 11706 n Cores 19320 a PowerPC P kl119 4 TF u Peak 1004TF D Cores131 07 D an I 39 1 Memory 39 TB Peak 111 TI 339 UPQFGd39W 250 u BassilBM 339 TF u Powerpc a Memory 655T a Memory 46 TB Cores 976 a Phoenix Cray X1 D Peak 74 TF Cray Vec l39or 3 Memoryi 35 TB u Cores 1 024 D Seaborg 39BM u Peak 183 TF 5 Power a Cores 1 Peak 99 TF T ORNL NCCS Roadmap for Leadership Computing Mission Deploy and operate the Vision Maximize scientific productivity computational resources needed and progress on the largest scale to tackle global challenges computational problems I Future Energ Prdviding world class compiltatic39nnal resources I Understanding the universe and SpeCialiZ d Servi es Nanoscale materials 0 Providing a stable hardwarysoftware path of increasing scale Clima e Change to maximize productive applications development Compma onal Biology 0 Work with users to scale applicationsto take advantage 0 i nugg 7 quot39iii 139 Futureleadershlp class 9 I 33V PCIS39Q 1 PF sustained PF system Cray XT4 11 TF ea ers ip c ass sys em for science Cray XT4 250 TF CrayXlE 18 TF CrayME 18 for solence FY2007 FY2008 FY2009 FY201 1 38 1000 quotIF Cray Baker system in 2008 Sys rem configura rion 9 1 PF peak 9 24000 q ad core processors 9 50 KW per cabinet 0 7 MW power 1 PF Cray system 7 2008 Used by penmssmn Cvay m Nanoscience Roadmap Cumpmahungmded seavch my new nanmtmduves menusan39 mmyss m nanuandeqmes aws Expected outcomes years Reahsuc swmmauon Mu hsca a 1000 Spm dynamms d of sewassembxy and 5mger 5mm mnanums quot9 mo we e ectron transpo smvasws te temperat me re prope Deswgn nimamwa s of nanopamc es andshudmes quantum corra g WEE LEa E Fm mamas m dumam MM 10 years mtevactmns m nanuwwes Mu t ca e WOde Wg 10 of mo ecu ar e ectromc dewces Computahonrgmded Search Compulerpe ormance T ops mate a s anostmctures a o o 6 o Curstvamed um duzkm 10 Compmerpemmance Mops Mu ecu ar machme c assvca swmu atmn eeu pathway and neiwuvkswmmatmn Cumpavame enumms Biology Roadmap Cunshaw based F EX ME duck Pvu em veadm 5 Years 39 outcome 5 years Metabohc ux modehng for hydrogen rbon Manon athway Constramed ewe dockmg mm mahoer years Mu twsca e stochashc networks Dynarmc swmu atwons of com ex mo ecu armacmnes of mteractmg protems swmmahons of mwcrobwa metabohc regu atory and protem mteractwon 2008 quot oToday Dedicated cluster for data anal is a d visualization the large systems to avoid having to move data Fusion Roadmapj When 5332 5 Vears FuHrmms ebmmmagneuc swmu auun DHUYbU EM an en wdh kmeu puMwmesca e r edmnsuvswmmatmnumes Wavemagma Emma appmaehmgquotTnspmmme sca e MHD and hanspu Tuvhu ences mu mn a1 ans eve up undemandmg ulmlema vecurmedmn nd d D lp asmarmuds mn aH uvders heatmg and EUHEM dee techmquesmvmmgatmn 1n years Dwe up ddamdawe vedwcwe undevs andmv uMwsvupHun Evan s m avgemkamaks Beam megvated swmu atmn u bummg p asma devmesr mdmmyses pvedmmnsmv TER Compmerpe ormame Mops Extended MH m 39 r ux tube 10 my MHD H n Gymkmehc xhn mmuenee m a WaveMayne reduced edua Climate Roadmap quoti pee e outcome rs HAL cou ed carbonrchmate Swmmanon Ftu coup ed sump osphenc chermstry Swmmanon 10 years C oudrresowmg 307 m 5 ana resomtwon atmosphere Compmerpemmance mops a a 3 E 2 earth system mode Performance Devel woo T Dps 10 T Dps EMASCIWhile 100 G DPs Fujhsu39NwT Laptop 10 Maps 100 maps Performance Projection 1 E ops 100 P ops 10 P ops 1 P ops 10 T11 ops 10 T11 ops 1 T ops 100 G ops 10 G ops 1 G ops 100 M ops 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 30th List November 2007 wwwtop500org page 46 23 TopSOO by Usage 2E7 57 El Indusfry El Research El Academic I Government Vendor I Classified SVqtem q Chips Used in Each of the 500 mm mm MA nm EMSAT 55 72 Intel 12 IBM 16 AMD uuzmmm ms 5067 Processor Family Systems 11112007 hup wwwmpsooorgs Interconnects Svstems Interconnect Analysts Interconnect Eifeciency 1993 ezuu7 r rWHK7 V lt J A 3quot ea CH 4 39 4321 n u Equot a g g g a a g a g g g s g Q h 6 6 6 N 07 GioE 39f39r niband Mvrin f l Others I Cray Interconnect El SP Switch Ii Crossbar l Quadrics l Infiniband IE Myrinet EGigabit Ethernet270 25 Cores per System November 2007 TOP500 Total Cores 800000 600000 400000 200000 000000 800000 600000 400000 200000 0 mememmm Llst Release 6 3364 65128 129256 257512 5131024 1025 2049 4k8k 8k16k 16k32k 32k64k 64k128k 2048 4096 T0p500 Systems November 2007 Rmax Tflops 26 Computer Classes Systems EIBGL ISIMD a Vector El 5MP onstellat sum st L NuvemberZDE WWWIEIpEEIEI ma page 53 Countries Performance Nov 20 Countries Performance Vemher 2007 Umted states 60 Otheas Chma Tauvan Spam Germany39 39Sweden 2 7 7 7 mm 2 8 France unwed ngdum Japan 3 2 7 4 D 27 Environmental Burden of PC CPUs a Tom power consumpnon ol CPUs m world39s PCs 1992 160 MWans 87M CFle mm 9000 MWalls 500M OPUS a That39s 4 Hoover Damsl U Andy s vision 1 Billion Connerth PCs Saurce aa mpg 14 MW 32 Power Consumption of World s CPUs Power is an Industry Wide Problem W w 7 Google facilities Power could cost more than sewers Google warns h gt leveraging hydroelectric power old aluminum Hiding in Plain Sight Google Seeks More Power plants WNW1mm by John Markoff June 14 2006 gt gt500 000 servers 7 7 W 77 7 7 7 7W i quotW worldwide New Google Plant in The Dulles Oregon from NYT June 14 2006 if b 39 9 x 3 3 29 And Now We Want Petascale e 7 r r l gt Many high speed bullet traIns a significant start to a conventional power plant gt Hiding in Plain Sight Google Seeks More Powerquot 7712 New Yank 77mes June 14 2006 9 hat is a conventional tascale achine Top Three Reasons for Eliminating Global Climate Warming in the Machine Room 3 HPC Contributes to Global Climate Warming gt I won that we as H C expa39ts in global climate modeling are contributing to the very thing that we are trying to avoid the ion o green ou e g esquot 2 Electrical Power Costs gt Japanese Earth Sinulator m ce Livemore National Laboratory Power amp Cooling of HPC 14 millionyear Powerup ASC Purple Panic call from local elec v company 1 Reliability a Availability Impact Productivity California State of Electrical Emergencies July 24 25 2006 50538 MW A load not expected to be reached until 2010 Reliability amp Availability of HPC Systems Availabilit 9833 mumy MTBI mean time between interruptts MTBFqgean time between failures MTTR mean time to restore Source Daniel A Reed RENCI TopSOO Conclusions O Microprocessor based supercomputers have brought a mayor change In accessibility and affordability O MPPs continue to account of more than half of all installed highperformance computers worldwide 31 With All the Hype on the PS3 We Became Interested o The PlayStation 339s CPU based on a quotMquot processor 9 ac e con aIns a ower 39 processor an 39 s D SPE SPU MA engine gt 2048 Gflops peak Toplevel block diagram of the all Broadband Engine IEBEI 39 an F lamenting alumni IS procesng unit gt An SPE is a self contained vector processor which acts independently from the others 4 W SIMD floating point units capable of a total of 256 Gflopls 32 GHZ gt The catch is that this is for 32 bit floating point Single Precision 5P gt And 64 bit floating point runs at 146 Gflops total for all 8 SPEsll Divide SP peak by 14 factor of 2 because of DP and 7 because of latency issues Cell APU Archltecture mmu mmcmmauwmwum s minar Imuauu nu Inopnapnu 5096606 Increasinq CPU Parformance A Delicate Eglgtt t tg Act We have seen increasing number of gates on a chip and increasing clock speed Heat becoming an unmanageable problem Intel Processors gt 100 Watts We will not see the dramatic increases in clock speeds in the future However the number of gates on a chip will continue to increase Intel Yonah will double the processing powe n a per watt basis 32 Intel Predlctlon of Mlcroprocessor Frequency ca 2001 Frequency in MHz mu mu mu 2m mu Adumed um a pvesemahun w s Human mm B5 Intel Predlctlon of Mlcroprocessor Power Consumptlon ca 2001 P6 Pentiumsapiroc g 1971 1974 1978 1985 1992 2000 Year Adapted mm a pvesemauun W S Euvkan mew BB Moore s Law for Power P or V21 1000 100 Chip Maximum Power in waHscmz Not too long to reach Nuclear Reactor I l39anium 130 watts Pentium 4 75 watts Surpassed Hea ng Plafe Pentium II 35 watts 1a 07 05 035 025 018 013 01 007 15 1985 1995 2001 Year Source Fred Pollack lnlel New Microprocessor Challenges in the Coming Generations of CMOS Technologies MlCRO32 and Transmela 67 quot39 39lyi r nge is Coming I Core Operations per second for serial code It Just runs twrce as fast every 18 months Free Lunch For Traditional Software 1 0 the code wit I No Free Lunch For Traditional Software Without highly concurrent software it won t get any faster 2 Cores I I I l I 4 Cores an I 8 iores e 68 Additional opera ions per second if code can take advantage of concurrency 34 What is Multicore 9 Single chip oMultiple distinct processing engine 9 Multiple independent threads of control or program counters MIMD Integration is Ef cient 9 Discrete chips gt Bandwidth ZeBps gt Latency 60 us gt Bandwidth gt 20 GBps gt Latency lt 3ns 9 Multicore Parallelism and interconnect efficiency enables harnessing the power of n n cores can yield an nfold increase in performance 7 Novel Opportunities in Multicores 0 Don39t have to contend with uniproc 0 Not your same old multiprocessor gt How does going from Multiprocessors to Multicores impact programs gt What changed gt Where is the Impact Communication Bandwidth Communication Latency H l 7 Communication Bandwidth 9 How much data can be communicated between two cores 9 What changed gt Number of Wires gt Clock rate gt Multiplexing 0 Impact on programming model gt Massive data exchange is possible gt Data movement is not the bottleneck processor affinity not that important lmmE HIM 32 Giga bitssec 300 Tera bitssec 36 Communication Latency 0 How long does if fake for a round trip communication 0 What changed gt Length of wire gt Pipeline stages O Inpac r on programming model 200 C I 4 I gt Ultra fast synchronization N yo es Cyc es gt Can run real Time apps on multiple cores at Mi dis l l Jquot quotm Technology wole us xx REGION ensues realmomm suave HEALTH spam opmoy 80 C aMLDPnEEs centres CEuMONES cowmns Alumnus MUEme nest PENanEPALS i I I a 6 Intel Prototype May Herald a New Age of 39 Pl39ocessin Core chip gt 1 oPs SAN FRANCISCO Feb 11 Ltel will demonstrate on Monday an experimental computer chip with 80 separate gt WO H S processing engines or cores that company executives say provides a model for commercial chips that will be used widely E was gt 1 2 TBS in standard desktop laptop and sener computers within five i n39l39ernal BW quotm The new processor which the company first described as a Tera op chip at a cunference last year inn be detailed in a technical paper to he presented on the opening day of the International Solid States Circuits Conference beginning here on Monday While the ehip is not compatible with Intel39s current chips the company said it had already begun design vwrk on a commercial version that would essentially have dozens or even hundreds of Inteleenmpatilsle single chip Multicore FPGA Heterogeneity o What about the potential of FPGA hybrid core processors l nun m m m Major Changes to Software Must rethink the design of our software gt Another disruptive technology Similar to what happened with cluster computing and message passing gtRethink and rewrite the applications algorithms and software Numerical libraries for example will change gtFor example both LAPACK and ScaLAPACK will undergo major changes to accommodate this Interconnect Options Mesh Multicore Bus Multicore mpp Ring Multicore Top500 Conclusions 9 Microprocessor based supercomputers have brought a maJor change In accessibility and affordability oMPPs continue to account of more than half of all installed high performance computers worldwide Distributed and Parallel Systems 5 9 43 Q s K 5 Q Distributed 05 st e0 5 9 ISOQMaSSiver Systems g 2 c so i a parallel hetero I 5 P v v o s tems 4 c e 9 e e 139 VS geneous 0 11 O Q Q 39a 1 7 homo geneous 0 Gather unused resources 0 Bounded set of resources 0 Steal cycles 0 Apps grow to consume aquot cycles 0 System SW manages resources 0 Application manages resources 0 System SW adds value 0 System SW gets in the way 0 10 20 overhead is OK 0 5 overhead is maximum 0 Resources drive applications 0 Apps drive purchase of equipment 0 Time to completion is not critical 0 Real time constraints 0 Time shared O Space shared Virtual Environments 032905 000900 000900 000900 03EE06 013905 0 22905 033905 059905 011904 049902 045902 040902 035902 031902 0 25902 0 27902 0 26902 0 26902 027902 0 25902 031902 033902 036902 035902 039902 035902 036902 033902 029902 Do they make any sense Algorithms and Moore s Law 39ims for Mann39s Law 9 22416 nullion 5 Oh Sam as HR factor from algorithms abut Full MG a u Oplimal son relative w speedup Mama 5 Law Gausssawde 25 20 year 15 Hanna GE 5 a m a Algorithms and Moore s Law 39im for Moon39s Law 9 22416 million gt Oh saint as Oh factor from algorithms aloha Full MG Optimal san relative 0 Speed u p Mucru s Law Gausssaidel 10 oa Handed GE 0 5 w 15 25 so 35 20 year Different Architectures 9 Parallel computing single systems with many processors working on same problem 9 Distributed computing many systems loosely coupled to work on related problems 9 Grid Computing many systems tightly coupled by software perhaps geographically distributed to work together on single problems or on related problems Types of Parallel Computers OThe simplest and most useful way to classify modern parallel computers IS by their memory model gt shared memory gt distributed memory Shared vs Distributed Memory Shared memorx single address space All processors have access 0 a pool of shared memory EX G Origin Sun E10000 Distributed memom each processor has it s uu memory Must do message to exchange data between processors EX CRAYT3E IBM SP clusters passing n rm s o as symmetric multiprocessors Ih Sun E10000 Nonuniform mem ory access NUMA Time for memory access depends on location of data Local access is faster than nonlocal access Easier to scale than SMPs SGI Origin Distn39buted Memory MPPs vs Clusters OProcessors memory nodes are connected by some Type 0139 II39ITZI COI39II39IZCT network gt Massively Parallel Processor MPP Tighfly infegrafed single sysfem image gt Clusfer individual compufers connecfed by sw Processors Memory amp Networks 9 Both shared and distributed memory SYSTEMS have 1 processors now generally commodity processors 2 memory now generally commodity DRAM networkinterconnect between the rocessors and memor bus crossbar fat tree torus hypercube etc on InterconnectRelated Terms 9 Latency How long does it take to start sending a quotmessagequot Measured in microseconds Also in processors How long does it take to output results of some operations such as floating point add divide etc which are pipelined O Bandwidth What data rate can be sustained once the message is started Measured in Mbytessec 45 InterconnectRelated Terms Topolgy the manner in which the nodes are CONNECTBG gt Best choice would be a fully connected network every processor to every other Unfeasible for cost and scaling reasons gtInstead processors are in some ariaion of a rid torus or 3d hypercube z39d meSh Zd torus Standard Uniprocessor Memory Hierarchy 0 Intel Pentium 4 2 processor 0 P7 Prescott 478 gt 8 Kbytes of 4 wa assoc L1 instruction cac e with 32 byte lines gt 8 Kbytes of 4 way assoc 1 data cache with 32 byte m s e gt 256 Kbytes of 8 way assoc L2 cache 32 byte lines gt 400 MBs bus speed gt 55E2 provide peak of 4 Gflops Locality and Parallelism Conventional WW 0 3 9 L3 Cache a g 0 Equot 3 a V me 8 839 Memory Memory mmmmmm OLarge memories are slow fast memories are small QSforage hierarchies are large and fasf on average oParallel processors collectively have large fasf gtfhe slow accesses to remote data we call communication oAlgorifhm should do mosT work on local dafa 93 Inte Co quot1 Duo processor inlel39 I I speCIfIcatIons Dun hm rzsuu 2 ME 2 EH 557 MH 2 17mm 2 ME 2 SH 557 MHz mm 2 ME 122 SH 557 mm mm 2 ME LES SH 557 w L Lzmu 2 ME LEE SH 527 MHZ 2 mm 2 ME 15 an 557 MH i2 mm unanimm 47 Amdahl s Law p that can be alent Amdahl39s Law places a strict limit on the speedu realized 7 sin multir lei r cesso T o e ruiv expressions forAmdahl s Law are given below N fpN fstl Effect of multiple processors on run time S 1fs fpN Effect ofmulLiple processors on speedup Where fs serial fraction of code fp pamllel fraction of code l f N number of processors Amdahl s Law Theoretical Maximum Speedup of parallel execution 0 speedup 1PN S P parallel code fraction 5 serial code fraction N processors gt Example Image processing 30 minutes of preparation serial One minute to scan a region 30 minutes of cleanup serial Number 0 Time aoesnn 3u3sal 10x 17x momvw v1 c 1 lelle in o l s 3 o Speedup is restricted by serial portion And speedup increases with greater number 0 cores Speedup Illustration of Amdahl s Law It takes only a small fraction of serial content in a code to degrade the parallel performance It is essential to determine the scaling behavior of gour code before doing roduction runs using large numbers of processors Ip 1000 3 39i ip 0999 lp 0990 150 39 100 39png1L an he speedup Number ofprocessors Amdahl s Law Vs Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications In reality communications and IIO will result in a further degradation of performance 0 50 100 150 200 250 Number ofprocessors Overhead 0f Parallelism 0 Given enough parallel work this is the biggest barrier to getting desired speedup o Parallelism overheads include gt cost of starting a thread or process gt cost of communicating shared data gt cost of synchronizing gt extra redundant computation 0 Each of these can be in the range of milliseconds millions of flo s on some sstems o Tradeoff Algorithm needs sufficiently large units of work to run fast in parallel Ie large granularity but not so large that there is not enough parallel work Load Imbalance 9 Load imbalance is the time that some processors in the system are idle due to gt insufficient parallelism during that phase gt unequal size tasks 9 Examples of the latter gt adapting to interesting parts of a domainquot gt treestructured computations gt fundamentally unstructured problems 9 Algorithm needs to balance load 50