Comp EthicsIntellect Property
Comp EthicsIntellect Property COSC 594
Popular in Course
Popular in ComputerScienence
Ms. Taryn Marquardt
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
verified elite notetaker
This 36 page Class Notes was uploaded by Ms. Taryn Marquardt on Monday October 26, 2015. The Class Notes belongs to COSC 594 at University of Tennessee - Knoxville taught by Staff in Fall. Since its upload, it has received 46 views. For similar materials see /class/229870/cosc-594-university-of-tennessee-knoxville in ComputerScienence at University of Tennessee - Knoxville.
Reviews for Comp EthicsIntellect Property
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/26/15
CS 594 Spring 2006 Lecture 4 Overview of HighPerformance Computing J ack Dongarra Computer Science Department University of Tennessee Top 500 Computers Listing of the 500 most powerful Computers in the World Yardstick RmaX from LINPACK MPP Axb dense problem TFP performance Updated twice a year 3 SC Xy in the States in November 51 Meeting in Germany in June What is a Supercomputer WW and software system that provides close to the maximum Eerformance that can currently e achieve 0 Over the last 12 years the range for the Top500 has increased greater than Moore39s Law Why do we need them 0 1993 Almost all of the technical areas that gt 1 597 eFIops are importanttothe wellbeingof gt 500 422 MFIops humanity use supercomputing in o 2005 fundamental and essential ways gt 1 280 TFlops gt 500 164 TFlops Computational fluid dynamics protein folding climate modeling national security in particular for cryptanalysis and for simulating nuclear weapons to name a few Architecture Systems Continuum llggh y llll ll ll Coupled 0 Custom processor With custom interconnect gt an x1 gt sx gt IBM Regatta gt IBM Blue GeneL o commodity processor With custom interconnect gt SGI Altix Hybrid o Commodi processor with commodity interconnect gt Clusters Pentium ltairiium Opterori Alpha 7 ncommOd GigE ln riiblnd Myrinet Quintin gt NEC TX7 IBM eServer gt Loosely gt Dawning Cou pled 1993 1994 1995 1996 n SIMD I Single Free E Cluster 360 El Constellations 1997 1998 1999 2000 N N 2001 2003 2005 Cluster Commodity processors amp Commodity interconnect Constellation of procsnode 2 nodes in the system 5 Performance Devel 1 100 Tflops 10 1 Tflops 100 10 1 Gflops 100 Mflops 13 PFIs 1amp16 TFIs BlueGeneL TFIs 2 l 59 61 I 39 I e I 1 0 SUPERCDMPIJTER SITES Manufacturer Com Ulzr R W Installation 5112 Coumr Vear Proc P TFs V 1 IBM 5 21305 DOENNSALLNL USA 131072 eServer39 Blue Gene BGW 2 IBM 91 29 IBM Thomas Watson USA 40950 eServer39 Blue Gene 3 IBM 55 MPquot 6339 DOENNSALLNL USA 10240 Power5 p575 4 Columbia SGI mix mniumIn niband 51137 NASA Ames USA 10150 5 Dell PentiumIn niband 3a 27 Sandla USA 13000 0 Red Storm any any X MD 3519 Sandla USA 1013130 7 EarthSimulator K NEC 5X 6 3555 Earth Sunulatar Center Japan 5120 B MareNastrurn IBM PPC 97oMyrim 2791 Barcelona Supercomputer Center Spaln 41300 i IBM eServer39 Blue Gene 2745 U Netherlands 1221313 nwersity Gronlngen Jaguar 10 any any X MD 2053 Oak Ridge Natlonal Lab USA 5200 IBM BlueGeneL 1 131072 Processors Total of 18 systems all in the T0p100 16 MWatts 1600 homes 64 racks 64X3ZX32 43000 opssperson Rack 131072 procs 32 Node boards 8x8x16 2048 processors 32 chips 4x4x2 16 Compute Cards 64 processo Compute Card 2 chips 2x1x1 180360 TFs 4 pr cessors r 32 TB DDR Chip 6 2 processors 39 2 95 7 TH r 39 39 5 Full system total of 05 TB DDR 90180 GFs 131 072 processors 7 16 GB DDR 5 56112 GFs 2856 GFs 1 GB DDR 4 MB cache Fastest Computer BGL 700 MHZ 131K proc The compute node ASICs include all networking and processorfunctionality Each compute ASIC includes two 32bit superscalar PowerPC 440 embedded 367 T ops cores note that L1 cache coherence is not maintained between these cores L k 281 T I 3K sec about 3 6 hours n Op 5 Performance Projection 1 E ops 100 Pflops 1DP opB 1 P ops 1DDT opB 1DT opB 1 T ops 100G opb 10 G ops 1 G ops 100 M ops 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 Customer Segments Performance 100 Government iirggrsarrg m m V1 m D N 00 c c c c c c c c c c c c c c c 2000 2001 2002 2003 2004 2005 6 Process0r Types 500 100 ISIMD EISparc EIVector IMIPS lAlpha El HP IAMD HIBM Power El Intel 91 11 Processors Used 1n Each of the 500 Svstems 910o 66 Intel Hitachi SR8000 0 1 5 IB M Sun Sparc Intel lA32 o 1 41 1 1 0 AM D NEC 4 1 HP Alpha 1 Cray 2 HP PARISC 3 Intel EM64T 16 Intel IA64 9 AMD x8664 IBM Power 11 A 15 Interconnects Systems Others I Cray Interconnect El SP Switch Crossbar I Quadr39ics El Infiniband El Myrinet 101 Gigabit Ethernet 249 NA Concurrency Levels of the Top500 100000 10000 1000 processors 100 95 69 96 9 85 9 9 9 39 3 3 3 3 3 51v 5616 e 39 ampo amp 33333333 D64k128k I32k64k I 16k32k I8k16k 4 8k 20494096 l10252048 I 5131024 257512 Flops per Gross Domestic Product Based on the November 2005 TopSOO KFlops per Capita FlopsPop Based on the November 2005 TopSOO 6000 4000 Hinf39 Prater lar k nn had quotquot J dn Iirh rhi 3000 lt WETA Digital Lord of the Rings WWW VQQlt 0 QltDQQV VO b0 399 Q 399 399 395 a a 3 539 49 0 D 9 2quot 5 4 3003 lt2 1QID Q 6amp5 lt9 r Q2 gm Q9 63 0 q 17 Has nothing to do with the 47 2 million sheep in NZ Fuel Efficiency GFlopsWatt Top 20 systems 15 Ba ed on DI DCE or newer ratinu onlv 4 TopSOO Conclusions 9 Microprocessor based supercomputers have brought a major change in accessibility and affordability O MPPs continue to account of more than half of all installed highperformance computers worldwide 0 3 c 1 xx K a Q quot 39x Q DIstrIbuted g 27 sq x o q 0 so Masswely systems 39Q 00 0Q 41 parallel hetero 5 49 3 mo 339 1 53927 a systems ge neous lt5 5 Q7 27 Q V homo l l l l l l l geneous Gather unused resources Steal cycles System SW manages resources System SW adds value 10 20 overhead is OK Resources drive applications Time to completion is not critical Time shared 99999999 99999999 Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5 overhead is maximum Apps drive purchase of equipment Real time constraints Space shared Virtual Environments new n man u nnmn n man u um u may n any n 335M mm mm u an mm n mm mm n um n um amine Erna n mm mm n me name n me n am name name n 2xErn2 mm n ma mm n Erna n um am 2 mm n 27Ern2 mm n 2 z n2 n um n nan n mm mm n um n mm mm u an n um n mm n mm mm mm n nnmn n man u nnmn n man u man u awn mumannnnmnunnmnmzrnxnunmnnnnmnnnnmn mm mm u may u may n mm mm u an n um n r x r nvzrnznonzrnzn rznnrnznwzrn n in mm mm n 27Ern2 mm n am mm n 2xErn2 n ma nun navzrnznavzrnzn man rnznanrnznzvzrnzn Er nnErnz u up 2 n xzzrnz n um n um n um n ma n ma mm mm n mm mm n mm mm n um n man man u man u nnmn n man u nnmn u an m n man u mm man u man n quotmu m n um mm u any u may n x nnErnonanrnanu nnrnznzazrnznzvzrnz mm mm n Erna name n me name n 2mg n am naxErnznzazrnzn men n vErnznaxErnznuErnz Do they mak any nu may me me mm Performance Improvements for Scientific Computing Problems 1970 1975 lean 1985 1995 Derive a from Compula ional M e thods 1970 1975 1930 1935 1990 1995 Different Architectures 0 Parallel computing single systems with many processors working on same problem 0 Distributed computing many systems loosely coupled by a scheduler to work on related problems 0 Grid Computing many systems tightly coupled by software perhaps geographically distributed to work together on single problems or on related problems Types of Parallel Computers OThe simplest and most useful way to classify modern parallel computers is by their memory model gt shared memory gt distributed memory Shared vs Distributed Memory Shared memorx single address space All processors have access to a SGI pool of shared memory Origin Sun E10000 Distributed memorx each processor has it s own Io memory Must do message passing to exchange data between processors EX CRAY T3E IBM SP clusters Shared Memory UMA vs NUMA as symmetric multiprocessors Sun E10000 Nonuniform memory access NUMA Time for memory access depends on location of data Local access is faster than nonlocal access Easier to scale than SMPs SGI Origin Dist buted Memory MPPs vs Clusters oProcessors memory nodes are connected by some type of interconnect network gt Massively Parallel Processor MPP tightly integrated single system image gt Cluster individual computers connected by sw Processors Memory amp Networks 9 Both shared and distributed memory systems have 1 processors now generally commodity RISC processors 2 memory now generally commodity DRAM 3 networkinterconnect between the processors and memory bus crossbar fat tree torus hypercube etc 9 We will now begin to describe these pieces in detail starting with definitions of terms InterconnectRelated Terms 9 Latency How long does it take to start sending a quotmessagequot Measured in microseconds Also in processors How long does it take to output results of some operations such as floating point add divide etc which are pipelined O Bandwidth What data rate can be sustained once the message is started Measured in Mbytessec InterconnectRelated Terms Topolgy the manner in which the nodes are connected gt Best choice would be a fully connected network every processor to every other Unfeasible for cost and scaling reasons gtInstead processors are variation of a 3d hypercube Zd mesh 1rid torus or in some Zd torus Highly Parallel Supercomputing Where Are We 0 Performance gt Sustained performance has dramatically increased during the last year gt On most applications sustained performance per dollar now exceeds that of conventional supercomputers But gt Conventional systems are still faster on some applications 0 Languages and compilers gt Standardized portable h and MP1 are available Bu gt Initial HPF releases are not very efficient igh level languages such as HPF PVM T gt Mezsgge passing programming is tedious and hard e u to gt Programming difficulty remains a major obstacle to usage by mainstream scientist 52 Highly Parallel Supercomputing Where Are We 0 Operating systems gt Robustness and reliability are improving gt New system management tools improve system utilization But gt Reliability still not as good as conventional systems 0 IO subsystems gt New RAID disks HiPPI interfaces etc provide substantially improved IO performance But gtIO remains a bottleneck on some systems The Importance of Standards Software 0 Writing programs for MP is hard 0 But one off efforts if written in a standard language 0 Past lack of parallel programming standards gt has restricted uptake of technology to quotenthusiastsquot gt reduced portability over a range of current architectures and between future generations 0 Now standards exist PVM MPI amp HPF which gt allows users amp manufacturers to protect software investment gt encourage growth of a quotthird partY39 parallel software industry amp parallel versions of widely used codes The Importance of Standards Hardware o Processors gt commodity R156 processors 0 Interconnects gt high bandwidth low latency communications protocol gt no de facto standard yet ATM Fibre Channel HPPI FDDI 0 Growing demand for total solution gt robust hardware usable software 0 HPC systems containing all the programming tools environments languages libraries applications packages found on desktops The Future of HPC o The expense of being different is being replaced by the economics of being the same 0 HPC needs to lose its quotspecial purposequot tag 0 Still has to bring about the promise of scalable general purpose computing o but it is dangerous to ignore this technology 0 Final success when MPP technology is embedded in desktop computing 0 Yesterday39s HPC is today39s mainframe is tomorrow39s workstation Achieving TeraFlops O In 1991 1 Gflops O 1000 fold increase gt Architecture gtgt exploiting parallelism gt Processor communication memory gtgt Moore s Law gt Algorithm improvements gtgt block partitioned algorithms 57 Future Peta ops 1015 pt opss Today m V 1015 ops for our workstations o A Pflop for 1 second x a typical workstation computing for 1 year 0 From an algorithmic standpoiptdymmic redisfribu on of gt concurrency wor oa gt data locality gt new language and constructs gt latency amp sync gt role of numerical gt floating pomt accuracy libraries gt algorithm adaptation to hardware failure A Peta ops Computer System o 1 Pflops sustained computing 0 Between 10000 and 1000000 processors 0 Between 10 TB and 1PB main memory 0 Commensurate IO bandwidth mass stor e etc o If built today cost 40 B and consume 1 TWatt 0 May be feasible and quotaffordablequot by the year 2010 Question 9 Suppose we wont to compute using four decimal arithmetic gt5 1000 1000x104 1000x104 gt What39s the answer 20 De ning Floating Point Arithmetic oRzpresentable numbers gtSci ntific notation ddd x Iquot gtsign bit gtr dix n usually 2 on 10 somttim 1o gtsignificund ddd many bassn digits 1 gt xpon nt sxp hangs gtothurs 0 Operations gtanithmstic x gtgt howtorauld rooult a It In formal gt comparison lt conva sion brtwun diffmnt formats on to long FF numbm Fr mimeger gtsxeaption handling gtgt who to do for DIO2 l1rgostllm1baletc gt binnrydacirnul conva sion gtgt ror lIo when my not lo IEEE Floating Point Arithmetic Standard 754 1985 Normalized Numbers 9 Noninulizad Non apnssan bis Numbans 1dd x quotw gt Mazheps Mathinl epsilnn 2quot 5399quot39 quotquot M39s relative min in oath nperatinn lullth nllllb ellk l llb 14 gtI gt ov mrrlnw threshold largest number gt UN under nw threshnld smallest number Format 1 bits signi cand bits macheps exponent bits exponent range 39 E39s 3171 MOE u Single 32 231 271591077 3 2 Double 6 521 2 53 1o45 11 2 1quot 21quot131om Double gtao gt6 lt 4049 gt15 2m 2153 1o iml Extended 80 bits on all Intel machines m b l39iniT js significant and txponcnt all uro er zsno gtwhy bother with 0 lat 0 IEEE Floating Point Arithmetic Standard 754 Denorms O yew mun Numbers 0dd x EMUquot gt sign bit nonzero significand minimum exponent gt Fills in gap between UN an 0 O Underflow Exception gt occurs when exact nonzero result is less than underflow threshold UN gt Ex UN3 gt return a denom or zero 27126 2127272723 under nw nve n 72 quoti m nld nld 05nd 72 128 2 12A 2 125 l 2 125 2 12A 2128 i i i i i i lt ww m m 39 HmH Hn M gt i i i i i If i i nnrmuhud dennrmuhzcd nnrmuhud guuve b pnsmve numbels numb I S IEEE Floatlng P01nt Anthmetlc Standard 754 In nity Infinity Sign bit zero significand maximum exponent oOverflow Exception gtoccurs when exact finite result too large to represent accurately gtEx 2OV gtreturn infinity oDivide by zero Exception gtreturn infinity 1 0 gtsign of zero important oAlso return infinity for gt3infinity 2infinity infinityinfinity gtResult is exact not an exception 44 22 IEEE Floating Point Arithmetic Standard 754 NAN Not A Number ONAN Sign bit nonzero significand maximum exponent oInvalid Exception gtoccurs when exact result not a well defined real number gt00 gtsqrt 1 gtinfinity infinity infinityinfinity0infinity gtNAN 3 gtNAN gt 3 gtReturn a NAN in all these cases oTwo kinds of NANS gtQuiet propagates without raising an exception gt5ignaling generate an exception when touched gtgt good for detecting uninitialized data Error Analysis oBasic error formula gtfla op b a op b1 d where gtgt oponeof l gtgt ldl lt machepa gtgt assuming no overflow underflow or divide by zero oExample adding 4 numbers gtflx1xzxax4 x1xz1d1 X31dz X41d3 X11d11dz1d3 xz1d11dz1da X31dz1d3 X41d3 x11e1 Xz1ez X3183 X4184 where each leil lt 3macheps gtget exact sum of slightly changed summands x1e gtBackward Error Analysis algorithm called numerically stable if it gives the exact result for slightly changed inputs gtNumerical Stability is an algorithm design goal 46 25 Backward error 0 Approximate solution is exact solution to modified pro em 0 How large a modification to original problem is required to give result actually obtained 0 How much data error in initial input would be required to explain all the error in compute results 0 Approximate solution is good if it is exact solution to quotnearbyquot pro lem F x fx Backward error I l I FOY39War d er r or x f x f 47 Sensitivity and Conditioning 9 Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution 9 Problem is sensitive or illconditioned if relative chan e in solution can be much larger than t at in input data Cond IRelative change in solutionl IRelative change in input datal fx39 39 fxfx X39 39 XX 0 Problem is sensitive or ill conditioned if cond gtgt 1 xh instead of true input value 0 Absolute error fx h fx z f39x 0 Relative error fx h fx fx z h f39x fx 45 0 When function f is evaluated for approximate input x39 x 24 Sensitivity 2 Examples 008752 and 2d System of Equations 9 Consider problem of computing caseine function for arguments near 1 9 Let Ix m 12 and let h be small perturbation to x Then Abe fx h fx m l1 f39x Rel x h fx fx m h f39x fx absolute error cosxh cosx m sinx m h relative error m h tanx m an 9 So small change in x near 1r2 causes large relative change in cosx regardless of method used cos157079 063267949 x 105 cos157078 164267949 x 105 Relative change in output s a quarter million times greater than 49 relative change in input 999 Sensitivity 2 Examples 008752 and 2d System of Equations a x bxzif Cgtlt1dgtlt27g 9 Consider problem of computing caseine function for arguments near 1 et x m 12 and let h be small perturbation to x Then absolute error cosxh cosx m sin x m h relative error m h tanx m an 9 So small change in x near 1r2 causes large relative change in cosx regardless of method used cos157079 063267949 x 105 cos157078 164267949 x 105 Relative change in output is a quarter million times greater than relative change in input 999 Example Polynomial Evaluation Using Horner s Rule oHorner39s rule 1390 evalua139e p gtp Cquot for kn 1 down 1390 0 oNumerically 139able oApply 1390 x 29 x9 18x8 512 29 x 28 x 27 oEvalua139ed around 2 39 39 begin r rm cm and for HanerFay f and Hana7700 a F 6ML fork n lt00by lao n a 51 F l l l ea Example polynomial evaluation continued 9x 29 x9 18x8 512 oWe can c0mpu139e error bounds using gtfa 0p ba 0p b1d 5 5 n 52 26 Exception Handling oWhat happens when the exact valuequot is not a real number or too small or too large to represent accurately 05 Exceptions gtOverflow exact result gt OV too large to represent gtUnderflow exact result nonzero and lt UN too small to represent gtDivide by zero nonzero0 gtInvalid 00 sqrt 1 gtInexact you made a rounding error very common oPossible responses gt5top with error message unfriendly not default gtKeep computing default but how Summary of Values Representable in IEEE FP O Zero ONormalized nonzero numbers ODenormalized numbers mm OInfinity ONANS gtSignaling and quiet gtMany systems have only quiet 27 I Figure 2 Hypmenuse ofn right angled triangle Assuming x and y are non negative a maxx yb minxy za llbaZ agt0 0 a0 Hazards of Parallel and Heterogeneous Computing OWhaT new bugs arise in parallel floafing poinf programs oEx 1 Nonrepeafabilify gtMakes debugging hard oEx 2 Differem excepfion handling gtCan cause programs to hang oEx 3 Differem rounding even on IEEE FP machines gtCan cause hanging or wrong resulfs wifh no warning OSee wwwnefiborglapacklawnslawnI1295 OIBM R56K and Java 56 25 Types of Parallel Computers 9 The simplest and most useful way to classify modern parallel computers is by their memory model gt shared memory gt distributed memory 57 Standard Uniprocessor Memory Hierarchy 0 Intel Pentium 4 2 processor 0 P7 Prescott 478 gt 8 Kbytes of 4 we assoc L1 instruction cac e with 32 byte lines gt 8 Kbytes of 4 way assoc L1 data cache with 32 byte lines gt 256 Kb es of 8 we assoc L2 cachyeT 32 byte lines gt 400 MBs bus speed gt 55E2 provide peak of 4 Gflops 29 Shared Memory Local Memory 9 Usually think in terms of the hardware 9 What about a software model 9 How about something that works like cache 9 Logically shared memory Parallel Programming Models oControl gthow is parallelism created gtwhat orderings exist between operations gthow do different threads of control synchronize oNaming gtwhat data is private vs shared gthow logically shared data is accessed or communicated oSet of operations gtwhat are the basic operations gtwhat operations are considered to be atomic oCost gthow do we account for the cost of each of the above 50 Trivial Example f A m oParallel Decomposition gtEach evaluaTion and each parTial sum is a Task oAssign np numbers To each of p procs gteach compuTes independenT quotprivaTequot resulTs and parTial sum gtone or all collecTs The p parTial sums and compuTes The global sum gt Classes of DaTa oLogically Shared a D D gtThe original n numbers The global sum oLogically PrivaTe gtThe individual funcTion evaluaTions gtwhaT abouT The individual parTial sums Programming Model 1 mace gtprogram consisTs of a collecTion of Threads of conTrol gteach wiTh a seT of privaTe variables gtgt 6g local variables on the stack gtcollecTively wiTh a seT of shared variables gtgt eg quot39 L quot39 gtThreads communicaTe impliciTy by wriTing and reading shared variables gtThreads coortdlinaTe explicile by synchronizaTion operaTions on 5 Shared rm 5 A IIII gtgt writing and reading flags A V nlnlvzl h an d r gtgt locka5emaphorea oLike concurrenT 3 n programming on uniprocessor Model 1 9 A shared memory machine 0 Processors a connecfed to a large shared memory O Localquot memory is no usually par of fhe hardware gt n DEC Iml SMPs Symmdric mulvipmssors in Milknnium 55 Origin OCOST much cheaper To cache Than main memory OMachine model 1a A Shared Address Space Machine gtreplace caches by local memories in absfracf machine model gtThis affecfs The cosf model repeafedly accessed dafa should be copied 65 gt Cray T3E Shared Memory code for computing a sum Thread 1 Thread 2 s 0 initially s 0 initially oca 1 0 local52 o for i 0 nI21 fori nl2 n1 locas1 ocas1 fAi oca52 oca52 fAi 1 s s ocasz sslocas What could go wrong 52 Pitfall and solution Via synchronization Pitfall in computing a global sum s locals1 locals2 Thread 1 initially 50 Thread 2 initially 50 load 5 from mem to reg load 5 from mem to reg initially 0 s slocals1 locals1 in reg s slocal52 local52 in reg store 5 from reg to mem store 5 from reg to mem Time Instructions from different threads can be interleaved arbitrarily What can final result s stored in memory be Race Condition Possible solution Mutual Exclusion with Locks Thread 1 Thread 2 lack lock load 5 load 5 s slocals1 s suca52 store 5 store 5 unlock unlock Locks must be atomic execute completely without interruption Programming Model 2 ruaalng gtprogram consists of a collection of named processes gtgt th read of control plus local address space gtgt local variables static va riables common blocks heap gtprocesses communicate by explicit data transfers gtgt quot I cfsend u wiv L and11 139 proct gtcoordination is implicit in every communication event gtIogically shared data is partitioned over local processes oLike distributed programming qend PlLX Program with standardrecv my libraries MPI PVM 55 Model 2 9A distributed memory machine gtCray T3E IBM 5P2 Clusters oProcessors all connected to own memory and caches gtcannot directly access another processor39s memory oEach node has a network interface NI gtal communication and synchronization done through this interconnect 67 Computing 5 X1X2 on each processor First possible solution Processor 1 Processor 2 send xlocal proc2 receive xremote proc1 xlocal x1 send xlocal proc1 receive xremote proc2 Xlocal x2 s xoca xremote s xlocal xremote Second possible solution what could go wrong Processor 1 Processor 2 send xlocal proc2 send xlocal proc1 xlocal x1 xlocal x2 receive xremote proc2 receive xremote proc1 s xlocal xremote s xlocal xremote What if sendreceive act like the telephone system The post office 65 54 Programming Model 3 oData Parallel gt5ingle sequential thread of control consisting of parallel operations gtParallel operations applied to all or defined subset of a data structure gtCommunication is implicit in parallel operators and quotshiftedquot data structures gtElegant and easy to understand and reason about gtNot all problems fit this model oLike marching in a regiment A A array of all data f fA fA fA s sumfA mm 5 Think of Matlab Model 3 oVector Computing gtOne instruction executed across all the data in a pipelined fashion gtParallel operations applied to all or defined subset of a data structure gtCommunication is implicit in parallel operators and quotshiftedquot data structures gtElegant and easy to understand and reason about gtNot all problems fit this model oLike marching in a regiment A array of all data fA fA s sumfA Think of Matlab 55 h40del3 OAn SIMD Single Instruction Multiple Data machine 9A large number of small processors 9A single control processorquot issues each instruction gt each processor executes the same instruction gt some processors may be turned off on any instruction control processor interconnect oMachines not popular 6M2 but programming model is gtimplemented by mapping n fold parallelism to p processors gtmostly done in the compilers HPF High Performance Fortranl h40del4 oSince small shared memory machines SMPs are the fastest commodity machine why not build a larger machine by connecting many of them with a network OCLUMP Cluster of SMPs OShared memory within one 5MP message passing outside oClusters ASCI Red Intel oProgramming model gtTreat machine as quotflatquot always use message passing even within 5MP simple but ignore important part of memory hierarchy gtExpose two layers shared memory OpenMP and message passing MPI higher performance but ugly to 72 nroaram 56
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'