Architecture Of Parallel Computers
Architecture Of Parallel Computers ECE 506
Popular in Course
verified elite notetaker
verified elite notetaker
verified elite notetaker
Leading Social Change
verified elite notetaker
verified elite notetaker
verified elite notetaker
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 68 page Class Notes was uploaded by Miracle Jaskolski on Thursday October 15, 2015. The Class Notes belongs to ECE 506 at North Carolina State University taught by Edward Gehringer in Fall. Since its upload, it has received 13 views. For similar materials see /class/223892/ece-506-north-carolina-state-university in ELECTRICAL AND COMPUTER ENGINEERING at North Carolina State University.
Reviews for Architecture Of Parallel Computers
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/15/15
Parallel Programs 21 Why should we care about the structure of programs in an architecture class Knowing about them helps us make design decisions It led to key advances in uniprocessor architecture 0 Caches Instructionset design This is even more important in multiprocessors Why In our discussion of parallel programs we will proceed as follows Introduce motivating problems appication casestudies Describe the steps in creating a parallel program Show what a simple parallel program looks like in the three programming models and consider what primitives a system must support We will study these parallel applications Simulating ocean currents Discretize the problem on a set of regular grids and solve an equation on those grids 0 Common technique common communication patterns 0 Regular structure scientific computing Simulating the evolution of galaxies o No discretization of domain 0 Rather the domain is represented as a large number of bodies interacting with each other an nbody problem Lecture 5 Architecture of Parallel Computers 1 o Irregular structure unpredictable communication screntific computing Rendering scenes by ray tracing o Traverses a 3D scene with unpredictable access patterns and renders it into a 2dimensional image for display 0 lrregular structure computer graphics Data mining o lrregular structure information processing 0 HO intensive parallelizing lO important 0 Not discussed here read in book Simulating ocean currents Goal Simulate the motion of water currents in the ocean Important to climate modeling Motion depends on atmospheric forces friction with ocean floor amp friction with ocean walls Predicting the state ofthe ocean at any instant requires solving complex systems of equations The problem is continuous in both space and time but to solve it we discretize it over both dimensions Every important variable eg pressure velocity currents has a value at each grid point This model uses a set of 2D horizontal crosssections through the ocean basin Lecture 5 Architecture of Parallel Computers UUUUUUUUUU 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 a Cross sections Equations of motion are solved at all the grid points in one timestep Then the state of the variables is updated based on this solution Then the equations of motion are solved for the next timestep Each timestep consists of several computational phases Values are setup A system of equations is solved All phases sweep through all points of the arrays and manipulate their values The more grid points we use to represent an ocean the finer the spatial resolution and the more accurate our simulation Simulating an ocean that is 7000 km across with 100 x 100 points gt 70 km between points Simulating 5 years with 5000 time steps means updating the state every 812 hrs The need for parallel processing is clear Simulating the evolution of galaxies What happens when galaxies collide How does a random collection of stars become a galaxy This involves simulating the motion of a number of bodies moving under forces exerted by all the bodies Lecture 5 Architecture of Parallel Computers 3 In each timestep Compute gravitational forces exerted on each star by all the others Update the position velocity and other attributes of the star A bruteforce approach to calculating interactions between stars would be 0 However smarter algorithms are able to reduce that to O making it possible to simulate systems of millions of stars They take advantage ofthe fact that the strength of gravitational attraction falls off with distance So the influences of stars far away don t need to be computed with such great accuracy Star on which forces are being computed o o 0 Large group far 0 O enough away to approximate Small group far enough away to Star too close to approximate to center of mass approximate We can approximate a group of faroff stars by a single star at the center of the group The strength of many physical forces falls off with distance so hierarchical methods are becoming increasingly popular Some galaxies are denser in some regions These regions are more expensive to compute with 0 Stars in denser regions interact with more other stars Lecture 5 Architecture of Parallel Computers o Ample concurrency exists across stars within a timestep but it is irregular and constantly changing gt hard to exploit Ray tracing Ray tracing is a common technique for rendering complex scenes into images 0 Scene is represented as a set of objects in 3D space 0 Image is represented as a 2D array of pixels 0 Scene is rendered as seen from a specific viewpoint 0 Rays are shot from that viewpoint through every pixel into the scene 0 Followtheir paths They bounce around as they strike objects They generate new rays ray tree per input ray 0 Result is color and opacity forthat pixel There is parallelism across rays The parallelization process 22 Sometimes a serial algorithm can be easily translated to a parallel algorithm Other times to achieve efficiency a completely different algorithm is required 221 Pieces ofthe job Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access communication and synchronization Note Work includes computation data access and lO Lecture 5 Architecture of Parallel Computers Main goal Speedup plus low programming effort and low resource needs Tasks The first step is to divide the work into tasks 0 A task is an arbitrarily defined portion of the work done by the program o It is the smallest unit of concurrency that the program can exploit 0 Example In the ocean simulation it can be computations on a single grid point a row ofgrid points or any arbitrary subset of the grid 0 Example In the galaxy application it may be a Tasks are chosen to match some natural granularity in the work 0 lfthe grain is small the decomposition is called o If it is large the decomposition is called Partitioning l j l D A O M e O s I39 a c O Q s 13 P 3 O E e i p C n s n P1 0 O O m t 9 T s T Q T e a T r T T T i n a i co t g lt30 0 n CO sequential Tasks Processes Parallel Processors computation Program Processes A process or threadquot is an abstract entity that performs tasks 0 A program is composed of cooperating processes Lecture 5 Architecture of Parallel Computers o Tasks are assigned to processors Example In the ocean simulation an equal number of rows may be assigned to each processor Processes need not correspond 1to1 with processors Four steps in parallelizing a program 0 Decomposition of the computation into tasks 0 Assignment of tasks to processors 0 Orchestration of the necessary data access communication and synchronization among processes 0 Mapping of processes to processors Together decomposition and assignment are called partitioning They break up the computation into tasks to be divided among processes 0 Tasks may become available dynamically o The number of available tasks may vary with time Goal Enough tasks to keep processes busy but not too many The number of tasks available at a time is an upper bound on the achievable Amdahl s law If some portions of the problem don t have much concurrency the speedup on those portions will be low lowering the average speedup ofthe whole program Suppose that a program is composed of a serial phase and a parallel phase 0 The whole program runs for 1 time unit 0 The serial phase runs fortime s and the parallel phase for time 1 5 Lecture 5 Architecture of Parallel Computers 7 Then regardless of how many processors p are used the execution time ofthe program will be at least and the speedup will be no more that This is known as Amdahl s law For example if 25 ofthe program s execution time is serial then regardless of how many processors are used we can achieve a speedup of no more than Lecture 5 Architecture of Parallel Computers Macroarchitecture vs microarchitecture Microarchitecture is concerned with how processors and other components are put together Macroarchitecture is concerned with how processors and other components can be connected to do useful work This is a course in macroarchitecture Why parallel architecture In the early days of computing the best way to increase the speed of a computer was to use faster logic devices However the time is long past when we could rely on this approach to making computers faster As deviceswitching times grow shorter propagation delay becomes significant Logic signals travel at the speed of light approximately 30 cmnsec in a vacuum If two devices are one meter apart the propagation delay is approximately In 1960 switching speed was 10100 nsec Nowadays switching speed is typically measured in picoseconds Then how can we build faster computers The performance of highly integrated singlechip CMOS microprocessors is steadily increasing Lecture 1 Architecture of Parallel Computers 1 In fact these fast processors are nowthe best building blocks for multiprocessors So to get performance better than that provided by the fastest single processor we should figure out how to hook those processors together ratherthan rely on exotic circuit technologies and unconventional machine organizations Application trends Given a serial program it is usually not easy to transform it into an effective parallel program The measure of whether a parallel program is effective is how much better it performs than the serial version This is usually measured by speedup Given a fixed problem the speedup is measured by Speedupp processors E Time l processor Timep processors What kinds of programs require the performance only multiprocessors can deliver A lot of these programs are simulations Weather forecasting over several days Ocean circulation Evolution of galaxies Human genome analysis Superconductor modeling Parallel architectures are now the mainstay of scientific computing chemistry biology physics materials science etc Visualization is an important aspect of scientific computing as well as entertainment 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 2 In the commercial realm parallelism is needed for online transaction processing and enterprise Webseners A good example of parallelization is given on pp 8 9 of Culler Singh and Gupta AMBER Assisted Model Building through Energy Refinement was used to simulate the motion of large biological models such as proteins and DNA The code was developed on Cray vector supercomputers and ported to the microprocessorbased lntel Paragon The initial program 894 achieved good speedup on small configurations only Loadbalancing between processors improved the performance considerably 994 Optimizing the communication turned it into a truly scalable application 1294 This example illustrates the interaction between application and architecture The application writer and the architect must understand each other s work Technology trends The most important performance gains derive from a steady reduction in VLSI feature size In addition the die size is also growing This is more important to performance than increases in the clock rate Why Lecture 1 Architecture of Parallel Computers 3 Clock rates have been increasing by about 30yr while the number of transistors has been increasing by about 40 However memory speed has lagged far behind From 1980 to 1995 the capacity of a DRAM chip increased 1000 times but the memory cycle time fell by only a factor of two This has led designers to use multilevel caches Microprocessor design trends The history of computer architecture is usually divided into four generations Vacuum tubes Transistors Integrated circuits VLSI Within the fourth generation there have been several subgenerations based on the kind of parallelism that is exploited The period up to z 1986 is dominated by advancements in bitlevel parallelism However this trend has slowed considerably How did this trend help performance Why did this trend slow 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 4 The period from the mid1980s to mid1990s is dominated by advancements in instructionlevel parallelism Pipelines which we will describe in a few minutes made it possible to start an instruction in nearly every cycle even though some instructions took much longer than this to finish Today efforts are focused on tolerating latency Some operations eg memory operations take a long time to complete What can the processor do to keep busy in the meantime The Flynn taxonomy of parallel machines Traditionally parallel computers have been classified according to how many instruction and data streams they can handle simultaneously Single or multiple instruction streams Single or multiple data streams SISD machine An ordinary serial computer Comrel Instruction ALU Data unit stream stream At any given time at most one instruction is being executed and the instruction affects at most one set of operands data Lecture 1 Architecture of Parallel Computers 5 SIMD machine At the right is a diagram of an an array processor Several identical ALUs may process for example a whole array at once However the same instruc tions must be performed on all data items Instruction stream It is also possible for a single processor to perform the same instruction on a large set of data items In this case parallelism is achieved by pipelining one set of operands starts through the pipeline and before the computation is finished on this set of operands another set of operands starts flowing through the pipeline We will describe the organization of a pipeline in a few minutes MISD machine Several instructions operate simultaneously on each operand Instruction Generally unrealistic for l parallel computers stream 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 6 MIMD machine Several complete processors Control Instruction ALL Data connected together to form a unit l stream stream l multiprocessor The processors are Cohttrgl 39 Stmdion ALU 2 Data connected together Via r Stream 2 Stream 2 an interconnection O 0 network to provide a 2 2 means of cooperating during the computation Control 39 Stmdion ALU n Data Uhlt n stream n stream n The processors need not be identical Can handle a greatervariety of tasks than an array processor The MOMS taxonomy of parallel machines The SlMDMIMD taxonomy leaves something to be desired since there are many subclasses of MIMD that do not appear in the model and one class MISD that appears in the model but not in real life Gustafson 1990 proposes the following taxonomy Operations Storage Monolithic Distributed This yields the following classifications MOMS Monolithic G operations monolithic storage Applepie computing m MODS Monolithic operations distributed data Lecture 1 Architecture of Parallel Computers 7 Control processor Examples Connection Machine CM1 CM2 MassPar DOMS Distributed operations monolithic storage Examples Sequent Balance and Symmetry BBN Butterfly Cray YMP DODS Distributed operations distributed storage Examples NCUBE Intel iPSC Meiko Computing Surface In addition to these methods of obtaining parallelism among processors there is this important approach to achieving parallelism Within a processor 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 lDiraelinirig Parallelism is achieved by starting to execute one instruction before the previous one is finished The simplest kind overlaps the execution of one instruction with the fetch of the next instruction as on a RISC Because two instructions can be processed simultaneously we say that the pipeline has two stages Operations it Execution E1 E2 E3 E4 E5 E6 E7 E8 Instruction fetch F1 IF iF3 iF4 iF5 iF6 iF7 iF8 V 0 1 2 3 4 5 6 7 8 Time cycles Load and store reference memory so they take two cycles A pipeline may have more than two stages Suppose for example that an instruction consists of four phases 1 Instruction fetch 3 Operand fetch 2 Instruction decode 4 Execute In a nonpipelined processor these must be executed sequentially so that a result is only available each four pipeline cycles subcycles In a pipelined processor after a delay to load the pipeline a result is available each pipeline cycle Lecture 1 Architecture of Parallel Computers Pipeline stages Execute Operand fetch Instruction decode Instruction fetch I 1 Time pipeline cycles The type of pipelining described above achieves instructionlevel parallelism execution of multiple instructions in parallel It is also possible to use pipelining to achieve data parallelism A vector processor usually has a long pipeline and allows a large number of the same operations to take place concurrently Same operations different data A single processor may possess multiple pipelines allowing different operations to use different pipelines eg there might be a specialized addition pipeline and another load pipeline For example the CDC 6600 had ten separate functional units with a scoreboard to keep track of which was in use at any time 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 10 10 Peripheral functional processors units 24 12 registers peripheral channels Instruction stack lO Subsystem Memory Central processor Branches are a problem for pipelined computers Execution of some instructions may take longer than others lfthere are two or more units capable of performing a given function eg multiplication then two operations ofthat type may be performed at once providing that Lecture 1 Architecture of Parallel Computers 11 Scalable Cache Coherence 81 All of the cachecoherent systems we have talked about until now have had a bus Not only does the bus guarantee serialization of transactions it also serves as convenient broadcast mechanism to assure that each transaction is propagated to all other processors caches Now we want to consider how cache coherence can be provided on a machine with physically distributed memory and no globally snoopable interconnect These machines provide a shared address space They must be able to satisfy a cache miss transparently from local or remote memory This replicates data How can it be kept coherent Scalable rietwo rk m m Switch Scalable distributed memory machines consist of PCM nodes connected by a network The communication assist interprets network transactions and forms the interface between the processor and the network Lecture 20 Architecture of Parallel Computers 1 A coherent system must do these things Provide set of states statetransition diagram and actions Manage coherence protocol 0 Determine when to invoke the coherence protocol a Find source of info about state of this block in other caches b Find out where the other copies are c Communicate with those copies invalidateupdate O is done the same way on all systems The state of the line is maintained in the cache The protocol is invoked if an access fault occurs on the line The different approaches to scalable cache coherence are distinguished by their approach to a b and c Busbased coherence In a busbased coherence scheme all of a b and c are done through broadcast on bus The faulting processor sends out a search Other processors respond to the search probe and take necessary action We could do this in a scalable network too broadcast to all processors and let them respond Why don t we Why not 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 Approach 1 Hierarchical snooping Extend the snooping approach to a hierarchy of broadcast media The interconnection network is a not a bus but a tree of buses or rings eg KSRt The processors are the bus or ringbased multiprocessors at the leaves of the network Parents and children are connected by twoway snoopy interfaces Functions a through c are performed by a hierarchical extension of the broadcast and snooping mechanism A processor puts a search request on the bus It is propagated up and down the hierarchy as needed based on snoop results Problems Hierarchical snooping schemes are much less common than directory based schemes We won t consider them further for details see 8102 Approach 2 Directories The basic idea of a directorybased approach is this Every memory block has associated directory information it keeps track of copies of cached blocks and their states On a miss it finds the directory entry looks it up and communicates only with the nodes that have copies if necessary In scalable networks communication with directory and copies is through network transactions Lecture 20 Architecture of Parallel Computers Let us start off by considering a simple directorybased approach due to Tang 1976 It uses A central directory called a present table of which caches contain which mainmemory blocks Entry P i c tells whether the ith block is cached in the cth cache A central modi ed table that tells for each block of main memory whether the block is uptodate or whether some cache holds a more recent copy A local table for each cache that keeps track of whether each line in the cache is also cached somewhere else Until a block is modified it can be cached in any number of places Once it is modified only one cached copy is allowed to exist until it is written back to main memory Here is a diagram of the three tables block k I w I w I w o c U U U U E m g m E m u u r u r E g E g E 12 E g E g E E Cache line containing I I I I 1 O Cache number Block 0 Blocki P i 0 P i 1 Block N l Present table Modi ed table 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 A cached block can be in one of three states RO readonly Several cached copies of the block exist but none has been modified since it was loaded from main memory EX exclusive readonly This is the only cached copy of the block but it hasn t been modified since it was loaded RW readwrite The block has been written since it was loaded from main memory This is the only cached copy of the block When a processor writes a block it changes its status to RW and invalidates the other cached copies if its previous status was RO Here is a flowchart of the coherence checks on a read reference Lecture 20 Architecture of Parallel Computers 5 Yes RO co exists Yes EX copy exists L39 r E R0 Fetch block i in local cache 0 Find blockjto replace lf block is RW MM e block39 0 300 py39of blo39ck i in a remote 39 cache r9 No copy exists Data fetch Yes RW copy exists MM e block39 r L39 r e RO Instruction fetch LA 7 lt L39 c e RO Cache block i in cache 0 M e 0 Send word to processor GT possible access to global table WB possible MM update MM main memory MM modified bit of block i in GT MM e block39 r update MM with Modified copy of i in cache r L39 r state of block i in cache r Here is a flowchart of the coherence checks on a write reference 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 Store into block 1 in local cache 0 Find block j to replace GT WB If block j is RW MIVI lt block39 0 Yes one RW copy in cac e r Other copr yies of block 1 Blockli Status 39 RW in cache 39Lli CD 739 Write block 139 back to lnvalidate all copies of PI W and invalidate it block 1 in other caches in cache r Invalidate all copies of block 1 in other caches Fetch block 1 with RW access A Record block 1 as RW in GT Store block in cache 0 Pl Pure invalidation MIVI Main memory Operation of a simple directory scheme GT Possible access to global table WB Possible writeback to memory The scheme we have just discussed is called a centralized directory scheme because the directory is kept together in one place Let us now assume that the directory is distributed with each node holding directory information for the blocks it contains This node is called the home node for these blocks Lecture 20 Architecture of Parallel Computers What happens on a readmiss The requesting node sends a request transaction over the network to the home Requestor Read request to director Dilectory node for block Ode 2 c The home node responds with the identity of the 4a owner the node that currently holds a valid Reply copy of the block The requesting node then gets the data from the owner and revises the directory entry accordingly Node with lilty copy On a write Requestor 1 SW directory to dlrector iden es 2 COpIeS Of Reply with the bIOCk 1 sharersidentit and invali dation or Dimtorynode update messages 4b may be Inval ac sent to the copies 2002 Edward F Gehringer GEE CSCECE 506 Lecture Notes Spring 2002 Shaler One major difference from busbased schemes is that we can t assume What information will be held in the directory There will be a dirty bit telling if the block is dirty in some cache Not all state information MESI etc needs to be kept in the cache only enough to determine what actions to take Sometimes the state information in the directory will be out of date Why So sometimes a directory will send a message to the cache that is no longer correct when it is received Let us consider an example system Three stable cache states MSI Singlelevel caches One processor per node In the directory entry for each block a dirty bit and presence bits pi one for each processor b On a readmiss from main memory by processor i lf dirty then read from main memory turn pI on lfdirty then recall line from dirty processor setting cache state to shared update memory turn dirty off turn pi on supply recalled data to processor i b On a writemiss to main memory by processor i lf dirty then Lecture 20 supply line to i along with p vector send invalidations to all caches that have the block turn dirty on turn pi on other p bits off in directory entry Architecture of Parallel Computers lf dirty then recall line from dirty processor d change d s cache state to invalid supply line to i On the replacement of a dirty block by node i the data is written back to memory and the directory is updated to turn off dirty bit and pI On the replacement of a shared block the directory may or may not be updated How does a directory help It keeps track of which nodes have copies of a block eliminating the need for Would directories be valuable if most data were shared by most of the nodes in the system Fortunately the number of valid copies of data on most writes is small Scaling with number of processors 822 Scaling of memory and directory bandwidth provided 0 Centralized directory is bandwidth bottleneck just like centralized memory 0 How to maintain directory information in distributed way Scaling of performance characteristics 0 traffic of network transactions each time protocol is invoked o latency of network transactions in critical path each time 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 10 Scaling of directory storage requirements 0 Number of presence bits needed grows as the number of processors Eg 32byte block size and 1024 processors How many bits in block vs of bits in directory How a directory is organized affects all these performance at a target scale as well as coherencemanagement issues Alternatives for organizing directories 823 When a miss occurs how do we find the directory information There are two main alternatives A flat directory scheme Directory information is in a fixed place usually at the home where the memory is located On a miss a transaction is sent to the home node A hierarchical directory scheme Directory information is organized as a tree with the processing nodes at the leaves Each node keeps track of which if any of its immediate children have a copy of the block When a miss occurs the directory information is found by traversing up the hierarchy level until the block is found in the appropriate state The state indicates eg whether copies of the block exist outside the subtree of this directory Directory Schemes Centralized Distributed How to nd source of directory information Flat Hierarchical How to locate copies Memorybased Cachebased Lecture 20 Architecture of Parallel Computers 11 How do flat schemes store information about copies Memorybased schemes store the information about all cached copies at the home node of the block The directory schemes we considered earlier are memory based Cachebased schemes distribute information about copies among the copies themselves 0 The home contains a pointer to one cached copy of the block 0 Each copy contains the identity of the next node that has a copy ofthe block The location of the copies is therefore determined through network transactions Main Memory Horn 6 Node 1 Node 2 9 When do hierarchical schemes outperform flat schemes Why might hierarchical schemes be slower than flat schemes 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 12 Organizing a memorybased directory scheme All info about copies is colocated with the block itself at the home They workjust like centralized scheme except for being distributed Scaling of performance characteristics M Traffic on a write is proportional to number of sharers Latency Can issue invalidations in parallel Scaling of storage overhead Assume representation is a full bit vector Optimizations for full bitvector schemes Increase size reduces storage overhead proportionally Use multiprocessor nodes one bit per multiprocessor node not per processor still scales as pm but only a problem for very large machines 256 procs 4 per cluster 1288 line 625 ovhd b Reducing width addressing the p term observation most blocks cached by only few nodes don t have a bit per node but make entry contain a few p 1024 10bit pointers gt can use 100 pointers and still save space sharing patterns indicate a few pointers should suffice five or so need an overflow strategy when there are more sharers later Lecture 20 Architecture of Parallel Computers 13 b Reducing height addressing the mterm obsenation number of memory blocks gtgt number of cache lines most blocks will not be cached at any particular time therefore most directory entries are useless at any given time organize directory as a cache rather than having one entry per memory block Organizing a cachebased directory scheme In a cachebased scheme home only holds a pointer to the rest of the directory information The copies are lined together via a distributed list that weaves through caches Each cache tag has a pointer that points to the next cache with a copy On a read a processor adds itself to the head of the list communication needed On a write it makes itself the head node on the list then propagates a chain of invalidations down the list Each invalidation must be acknowledged On a writeback the node must delete itself from the list and therefore communicate with the nodes before and after it Disadvantages All operations require communicating with at least three nodes node that is being operated on previous node and next node Write latency is proportional to number of sharers Synchronization is needed Advantages Directory overhead is small Linked list records order that accesses were made making it easier to avoid stanation 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 14 Work of performing invalidations can be distributed among sharers IEEE Scalable Coherent Interface has formalized protocols for handling cachebased directory schemes Summary Flat Schemes Issue a finding source of directory data go to home based on address Issue b finding out where the copies are memorybased all info is in directory at home cachebased home has pointer to first element of distributed linked list Issue c communicating with those copies memorybased pointtopoint messages perhaps coarser on overflow can be multicast or overlapped cachebased part of pointtopoint linked list traversal to find them serialized Hierarchical Schemes all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children Lecture 20 Architecture of Parallel Computers 15 Using Relaxed Consistency Models CSampG discuss relaxed consistency models from two standpoints The system speci cation which tells how a consistency model works and what guarantees of ordering it provides The programmer s interface which tells what code needs to be written to invoke operations provided by the system specification The system specification 911 CSampG divide relaxed consistency models into three classes based on the kinds of reorderings they allow Write a read reordering These models only allow a write to bypass complete before an incomplete earlier read Examples Total store ordering processor consistency Write a write reordering These models also allow writes to bypass previous writes Example Partial store ordering All reorderings These models also allow reads and writes to bypass previous reads Example Weak ordering release consistency Write a read reordering Models that allow reads to bypass pending writes are helpful because they help hide the latency of write operations While a writemiss is still in the write buffer and not yet visible to other processors the processor can perform reads that hit in its cache or a single read that misses in its cache These models require minimal changes to programs For example no change is needed to the code for spinning on a flag P1 P2 A 1 while flag 0 f1ag 1 print A Lecture 24 Architecture of Parallel Computers 1 In this and later code fragments we assume all variables are initialized to 0 This is because the model does not permit writes to be reordered In particular the write to will not be allowed to complete out of order with respect to the write to flag What would write gt read reordering models guarantee about this fragment P1 P2 A 1 print B B 1 print A Recall from Lecture 23 that processor consistency does not require that all processors see writes in the same order if those writes come from different processors Under processor consistency PC what value will be printed by the following code P1 P2 P3 A 1 while A0 while B0 B 1 print A However this problem does not occur with total store ordering T80 T80 provides Store atomicity there is a global order defined between all writes A processor s own reads and writes are ordered with respect to itself x 3 print X will print3 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 The processor issuing the write may observe it sooner then the other processors Special membar instructions are used to impose global ordering between write gt read Now consider this code fragment P1 P2 1a A 1 2a B 1 1b print B 2b print A With sequential consistency what will be printed How do we know this Might something else happen under PC What about under TSO To guarantee SC semantics when desired we need to use the membar or fence instruction This instruction prevents a following read from completing before previous writes have completed If an architecture doesn t have a membar instruction an atomic read modifywrite operation can be substituted for a read Why do we know this will work Such an instruction is provided on the Sun Sparc V9 Write a write reordering What is the advantage of allowing write gt write reordering Lecture 24 Architecture of Parallel Computers 3 Well the write buffer can merge or retire writes before previous writes in program order complete What would it mean to merge two writes Multiple write misses to different locations can thus become fully overlapped and visible out of program order What happens to our first code fragment under this model P1 P2 1a A 1 2a while flag 0 1b B 1 2b u A 10 flag 1 20 v B In order to make the model work in general we need an instruction that enforces writetowrite ordering when necessary On the Sun Sparc V9 the membar instruction has a writetowrite flavor that can be turned on to provide this Where would we need to insert this instruction in the code fragment above Relaxing all program orders Models in this class don t guarantee that anything becomes visible to other processors in program order This allows multiple read and write requests to be outstanding at the same time and complete out of order This allows read latency to be hidden as well as write latency In earlier models writes couldn t bypass earlier reads so everything in a process had to wait for a pending read to finish 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 These models are the only ones that allow optimizing compilers to use many key reorderings and eliminate unnecessary accesses Two models and three implementations fall into this category Weak ordering Release consistency Digital Alpha implementation Sparc V9 Relaxed Memory Ordering RMO IBM PowerPC implementation Exercise What categories do the other models from Lecture 23 fit into Causal consistency PRAM consistency and processor consistency We have already seen the two models in Lecture 23 To get a better appreciation for the differences between the two let s consider the two examples from CSampG Here is a piece of code that links a new task onto the head of a doubly linked list It may be executed in parallel by all processors so some sort of concurrency control is needed lock tast newTask gtnext head if head NULL head gtprev newTask head new I ask unlock tast In this code tast serves as a synchronization variable If accesses to it are kept in order and accesses by a single processor are kept in program order then it will serve to prevent two processes from manipulating the queue simultaneously Lecture 24 Architecture of Parallel Computers 5 This example uses a flag variable to implement a producerconsumer interaction P1 P2 TOP while flag2 0 TOP while flagl 0 A1 xA uB yD VC B3 D B C C D B flag2 0 flagl 0 flagl 1 flag2 1 goto TOP goto TOP Which shared variables are produced by P1 Which shared variables are produced by P2 What is the synchronization variable here Again as long as one access to it completes throughout the system before the next one starts other reads and writes can complete out of order and the program will still work The diagram below similar to the one at the end of Lecture 23 illustrates the difference between weak ordering and release consistency Each block with reads and writes represents a run of non synchronization operations from a single processor With weak ordering before a synchronization operation the processor waits for all previous Also a synchronization operation has to complete before later reads and writes can be issued Release consistency distinguishes between an acquire performed to gain access to a critical section and a release performed to leave a critical section 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 6 An acquire can be issued before previous reads in block 1 complete After a release is issued instructions in block 3 can be issued without waiting for it to complete readwrite readwrite readwrite I 1 readwnte acquire read sync readwrite readwrite readwrite readwrite V release write readwrite readwrlte 3 readwrite readwrite Release consistency Weak ordering Modern processors provide instructions that can be used to implement these two models The Alpha architecture supports two fence instructions The memory barrier MB operation is like a synch operation in R0 The write memory barrier WMB operation imposes program order only between writes A read issued after a WMB can bypass complete before a write issued before a WMB Lecture 24 Architecture of Parallel Computers 7 The Sparc V9 relaxed memory order RMO provides an membar fence with four flavor bits Each flavor bit guarantees enforcement between a certain combination of previous amp following memory operations read to read read to write write to read write to write The PowerPC provides only a single sync fence instruction it can only implement WO The programmer s interface When writing code for a system that uses a relaxed memory consistency model the programmer must label synchronization operations This can be done using systemspecified programming primitives such as locks and barriers For example how are lock operations implemented in release consistency A lock operation translates to An unlock operation translates to Arrival at a barrier indicates that previous operations have completed so it is Leaving a barrier indicates that new accesses may begin so it is What about accesses to ordinary variables like the flag variables earlier in this lecture Sometimes the programmerwill just know how they should be labeled But if not here s how to determine it 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 8 1 l 00 Decide which operations are con icting Two memory operations from different processes con ict if they access the same memory location and at least one of them is a write Decide which conflicting operations are competing Two conflicting operations from different processes compete if they could appear next to each other in a sequentially consistent total order This means that one could immediately follow another with no intenening memory operations on shared data Label all competing memory operations as synchronization operations of the appropriate type Notice that we can decide where synchronization operations are needed by assuming SC even if we are not using an SC memory model To see how this is done consider our first code fragment again P1 P2 1a A 1 2a while flag 0 1b B 1 2b u 10 flag 1 20 v B Which operations are conflicting Which operations are competing Which operations need to be labeled as synchronization operations How should these operations be labeled in weak ordering How should these operations be labeled in release consistency Lecture 24 Architecture of Parallel Computers Parallelization of an Example Program 23 In this lecture we will consider a parallelization of the kernel of the Ocean application Goas Illustrate parallel programming in a lowlevel parallel language Clike pseudocode with simple extensions for parallelism Expose basic communication and synchronization primitives that must be supported This is the state of most real parallel programming today The serial program The equation solver soles a PDE on a grid using a nitedifferencing method It operates on a regular 2D grid of n2 by n2 elements The border rows amp columns contain boundary elements that do not change The interior nbyn points are updated starting from their initial values Expression for updating each interior point Aij 02 x Ai39j Ai39j 1 Ai 1j Ai39j 1 Ai1j 0000000000 0000000000 0000000000 00 0960000 0000000000 0000000000 0000000000 0000000000 0000000000 Lecture 7 Architecture of Parallel Computers 1 The old value at each point is replaced by the weighted average of itself and its 4 nearestneighbor points Updates are done from left to right top to bottom 0 The update computation for a point sees the new values of points above and to the left and o the old values of points below and to the right This form of update is called the GaussSeidel method During each sweep the solver computes how much each element changed since the last sweep lfthis difference is less than a tolerance parameter the solution has converged If so exit solver if not do another sweep Here is the code for the solver 1 int n size ofmatrix n 2byn 2 e1ements 2 float A diff O 3 main 4 begin 5 read n read input parameter matrix size 6 A malloc a 2 d array of size n 2 by n 2 doubles 7 initial ize A initia1ize the matrix A somehow 8 Solve A call the routine to solve equation 9 end main 10 procedure Solve A s01ve the equation system 11 float A A is an n 2byn 2 array 12 begin 13 int i j done O 14 float diff 0 temp 15 while ldone do outermost loop over sweeps 16 diff 0 initia1ize maximum difference to 0 17 for i lt 1 to n do sweep over nonborder points of grid 18 forjlt ltondo 19 temp Aij save old value ofelement 20 Aij lt 02 Aij Aij l Ai lj 21 Aij1 Ai1j compute average 22 diff absAij temp 23 end for 24 end for 25 if diffnn lt TOL then done l 26 end while 27 end procedure 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 Decomposition A simple way to identify concurrency is to look at loop iterations We do dependence analysis of sweeps if there s not enough concurrency then we look further Is there much concurrency in this example Can we perform more than one sweep concurrently Next we examine fundamental dependencies ignoring loop structure Note that Computation proceeds from left to right and top to bottom Thus to compute a point we use 0 the updated values from the point above and the point to the left but 0 the old values of the point itself and its neighbors below and to the right Here is a diagram that illustrates the dependencies The horizontal and vertical Z lines with arrows indicate 2 dependencies gt x The dashed lines along the g3 antidiagonal connect points g with no dependencies that can c be computed in parallel Ofthe om work in each sweep 3 concurrency propor tional to n along antidiagonals Er gtn How could we exploit this parallelism Lecture 7 Architecture of Parallel Computers We can leave loop structure alone and let loops run in parallel inserting synchronization ops to make sure a value is computed before it is used Is this a good idea We can change the loop structure making 0 the outer for loop line 17 iterate over antidiagonals and o the innerfor loop line 18 iterate over elements within an antidiagonal Is this a good idea Note that according to the GaussSeidel algorithm we don t have to update the points from left to right and top to bottom It is just a convenient way to program on a uniprocessor We can compute the points in another order as long as we use updated values frequently enough if we don t the solution will only converge more slowly Redblack ordering Let s divide the points into alternating red and black points 0 o o o o o o o o o oooooooooo Redpoim Blackpoint o o o o 3 o o o o o oo 00 o o o o 5 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 4 To compute a red point we don t need the updated value of any other red point But we need the updated values of 2 black points And similarly for computing black points Thus we can divide each sweep into two phases First we compute all red points Then we compute all black points True we don t use any updated black values in computing red points But we use all updated red values in computing black points Whether this converges more slowly or faster than the original ordering depends on the problem But it does have important advantages for parallelism How many red points can be computed in parallel How many black points can be computed in parallel Redblack ordering is effective but it doesn t produce code that can fit on a TV screen A simpler decomposition Another ordering that is simpler but still works reasonably well is just to ignore dependencies between grid points within a sweep An sweepjust updates points based on their nearest neighbors regardless of whetherthe neighbors have been updated yet Global synchronization is still used between sweeps however Now execution is no longer deterministic the number of sweeps needed and the results may depend on the number of processors used Lecture 7 Architecture of Parallel Computers 5 But for most reasonable assignments of processors the number of sweeps will not vary much Let s look at the code for this 15 while idone do asequentia1 loop 16 diff O 17 forall i e 1 to n do aparallel loop nest 18 forall j e 1 to n do 19 temp Aij 20 Aij e 02 Aij Aij 1 Ai 1j 21 Aij1 Ai1j 22 diff absAij temp 23 end forall 24 end forall 25 if diffnn lt TOL then done 1 26 end while The only difference is that for has been replaced by fora A fora just tells the system that all iterations can be executed in parallel With fora in both loops all n2 iterations of the nested loop can be executed in parallel Whether or not they are assigned and orchestrated in this way is up to the system the program doesn t ordain it We could write the program so that the computation of one row of grid points must be assigned to a single processor How would we do this With each row assigned to a different processor each task has to access about 2n grid points that were computed by other processors meanwhile it computes n grid points itself So the communicationtocomputation ratio is O 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 Assignment How can we statically assign rows to processes One option is block assignment Rowi p is assigned to process LipJ p1 p2 000000000 OOOOOOOOOOOO OIOOIOICOOOO 000000000 OOOOOOOOOOOO OIOOIOCCOOOO OIOOOOCOOOOO OOOOOOOOOOOO OIOOOOOOIOOO OOOOOOOOOOOO p3 Another option is cyclic assignment Process i is assigned rows i ip i2p etc We could instead use dynamic assignment where a process gets an index works on the row then gets a new index etc Static assignment of rows to processes reduces concurrency But block assignment reduces communication by assigning adjacent rows to the same processor How many rows now need to be accessed from other processors So the communicationtocomputation ratio is now only 0 Orchestration Once we move on to the orchestration phase the computation model affects our decisions Dataparallel model In the code below we assume that global declarations are used for shared data and that any data declared within a procedure is private Lecture 7 Architecture of Parallel Computers 7 Global data is allocated with gmaoc Differences from sequential program fora loops decomp statement mydiffvariable private to each process reduce statement 1 int n nprocs grid size n2gtltn2 and of processes 2 float A diff O 3 main 4 begin 5 read n read nprocs 7read input grid size and of processes 6 A 6 GMALLOC a 2 d array of size n2 by n2 doubles 7 initialize A initialize the matrix A somehow 8 Solve A call the routine to solve equation 9 end main 10 procedure SolveA solve the equation system 11 float Z A is an n2gtltn2 array 12 begin 13 int i j done 0 14 float mydiff O temp 14a DECOMP ABLOCK nprocs 15 while ldone do outermost loop over sweeps 16 mydiff 0 initialize maximum difference to O 17 forall i lt 1 to n do sweep over nonborder points ofgrid 18 forall j 6 l to n do 19 temp A i j save old value of element 20 Aij e 02 Aij Aij 1 Ai 1j 21 Aijl Ai1j compute average 22 mydiff absAij temp 23 end forall 24 end forall 24a REDUCE mydiff diff ADD 25 if diffnn lt TOL then done 26 endwhile The decomp statement has a twofold purpose lt specifies the assignment of iterations to processes 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 8 The first dimension rows is partitioned into nprocs contiguous blocks The second dimension is not partitioned at all Specifying CYCLIC nprocs would have caused a cyclic partitioning of rows among nprocs processes Specifying CYCLIC nprocs would have caused a cyclic partitioning of columns among nprocs processes Specifying BLOCK BLOCK nprocs would have implied a 2D contiguous block partitioning lt specifies the assignment ofgrid data to memories on a distributedmemory machine Follows ownercomputes rule The mydiffvariable allows local sums to be computed The reduce statement tells the system to add together all the mydiff variables into the shared diffvariable SAS model In this model we Processes need i l l l mechan isms to Solve Solve Solve Solve create processes 1 1 3 and manage them Sweep After we create the processes I Test Convergence they interact as l l shown on J j the right Lecture 7 Architecture of Parallel Computers 9 1 int n nprocs matrix dimension and number of processors to be used 2a float A diff A is global shared array representing the grid diffis global shared maximum difference in current 2b LOCKDEC diffilock declaration oflock to enforce mutual exclusion 2c BARDEC barl quotbarrier declaration for global synchronization between sweeps 3 main 4 begin 5 readn readnprocs read input matrix size and number of processes 6 A e GiMALLOC a twoedimensional ray e n2 by n2 doubles 7 initialize A initialize A in an unspeci ed way 8a CREATE nprocs l Solve A 8 Solve A main process becomes a worker too 8b WAIT FORiEND nprocs l wait for all child processes created to terminate 9 end main lO procedure SolveA ll float A is entire n2 by n2 shared array as in the sequential program 12 begin 13 int ij pid done 0 14 float temp mydiff private variables 14a int mymin 1 pid nnprocs a sumethatn is exactly divis ble by 14b int mymax mymin nnprocs nprocs for simplicity here l outer loop over all diagonal 1 Whlijdi fiofgii 0 Mammal diff toO okay for all to do it 16a BARRIERhar1 nprocs ensure all reach here before anyone modi es difl l l7 39n to mymax do quotfor each of my rows 18 for j e l to n do quotfor all nonborder elements in that row 19 temp A i 20 Aij 02 Alij Aijel Alielj 21 Aijl Ailj 22 mydiff abs Alij t temp 23 endfor 25 Egglicggiffilock update global diff if necessary 25b diff m diff Eigig t1ior cs ensure all reach here before checking if done 25a if diff nn lt TOL then done 1 check convergence all get same answer 25f BARRIER harl nprocs endwhi e 27 end procedure What are the main differences between the serial program and this program The first process creates nprocs l worker processes All n processes execute Solve All processes execute the same code the SPMD model But all do not execute the same instructions at the same time Private variables like mymin and mymax are used to control loop bounds All processors need to 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 10 0 complete an iteration before any process tests for convergence 0 test for convergence before any process starts the next iteration Notice the use of barrier synchronization to achieve this Locks must be placed around updates to diff so that no two processors update it at once Othenvise inconsistent results could ensue 21 22 rl e diff p1 gets 0 in its r1 r1 e diff p2 also gets 0 rl e rlr2 p1 sets its 11 to 1 r1 e rlr2 p2 sets its 11 to 1 diff e rl p1 sets diff to 1 diff e rl p2 also sets diffto 1 If we allow only one processor at a time to access diff we can avoid this race condition What is one performance problem with using locks Note that at least some processors need to access diff as a nonlocal variable What is one technique that our SAS program uses to diminish this problem of serialization Lecture 7 Architecture of Parallel Computers 11 Messagepassing model The program for the messagepassing model is also similar but again there are several differences There s no shared address space so we can t declare array A to be shared Instead each processor holds the rows ofA that it is working on The subarrays are of size nprocs2 2 x n 2 This allows each subarray to have a copy of the boundary rows from neighboring processors Why is this done These ghost rows must be copied explicitly via send and receive operations Note that send is not synchronous that is it doesn t make the process wait until a corresponding receive has been executed What problem would occur if it did Since the rows are copied and then not updated by the processors they have been copied from the boundary values are more outofdate than they are in the sequential version of the program This may or may not cause more sweeps to be needed for convergence The indexes used to reference variables are local indexes not the real indexes that would be used if array A were a single shared array 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 12 1 int pid n b process id matrix dimension and number of processors to be used 2 float myA 3 main 4 begin 5 read n read nprocs read input matrix size and number of processes 8a CREATE nprocsl Solve 8b Solve main process becomes a worker too 8c WAITiFORiEND nprocs l Wait for all child processes created to terminate 9 end main 10 procedure Solve 11 begin 13 int ij pid n39 nnprocs done O 14 float temp tempdiff mydiff 0 privatevariables 6 myA malloc a 2 d array of size nnprocs 2 by n2 my assigned rows of A 7 initial ize myA initialize my rows of A in an unspecified way 15 while ldone do 16 mydiff 0 set local diffto 0 16a if pid 0 then SENDampmyA10 nsizeoffloatpid 1ROW 16b if pid nprocs l then SENDampmyAn O nsizeof float pid1ROW 16c if pid 0 then RECEIVEampmyA00 nsizeoffloat pid 1ROW 16d if pid nprocs l then RECEIVE ampmyA n 1 0 nsizeof float pid1ROW border rows of neighbors have now been copied into myA0 and myAn l 17 for i lt 1 to 11 do for each ofmy nonghost rows 18 for j lt 1 to n do for allnonborder elementsinthatrow 19 temp myAij 20 myAij 02 myAij myAij 1 myAi 1j 21 myAij1 myAi1j 22 mydiff absmyAij temp 23 endfor 24 endfor communicate local diff values and determine if done can be replaced by reduction and broadcast 25a if pid 0 then process 0 holds global total di 25b SENDmydiffsizeoffloat ODIFF 25c RECEIVEdonesizeof int ODONE 25d else pid 0 does this 25e for i lt 1 to nprocs 1 do for each other process 25f RECEIVE tempdiff sizeof float DIFF 25g mydi ff tempdiff accumulate into total 25h endfor 25i if mydiffnn lt TOL then done 1 25j for i lt 1 to nprocs 1 do for each other process 25k SENDdonesizeofint iDONE 25l endfor 25m endif 26 endwhile 27 end procedure Lecture 7 Architecture of Parallel Computers 13 Alternatives for send and receive send and receive operations may be either SendReceive Synchronous As n ronous Blocking asynch Nonblocking asynch synchronous the program can continue past the send receive only when the corresponding receive send has completed and returned an ack to the sender receiver In this case once the operation completes both processes know that the message has been transferred successfully blocking asynchronous the sender can continue when the message has been taken out ofthe source processor s buffer and is therefore in the care of the system This allows the senderto continue and to modify the data in its buffer before the receiver has finished receiving it After the receive is complete the sender receives an ack A receive returns control after the information has been removed from the buffer in put in the receiver s address space nonbocking asynchronous a send completes immediately and a receive returns control after specifying the intent to receive This allows the greatest concurrency but also means that the program isn t guaranteed that the operations will complete successfully So the program needs to check for successful completion later The various forms of send and receive trade off for 2003 Edward F Gehrihger CSCECE 506 Lecture Notes Spring 2003 14 Buffering Problems 725 Certain challenges arise in realizing SAS or message passing programming models Two of these are inputbuffer overflow and fetch deadlock Inputbuffer overflow Suppose a large number of processors send messages to the same module in rapid succession lfthe module has a fixed buffer size it may overflow Something has to be done to prevent this Lecture 18 Make input buffers large and reserve a portion for each source called a credit A source will transmit to a module only if it the source has space free in the module s buffer This requires the receiving module to notify the sender in some way when space is available eg acknowledge or reply to transaction Destination can refuse incoming transactions when input buffer is full This leads to backpressure Explain When backpressure reaches the sources it causes the sources to slow down to the point that the destination module can handle the incoming messages Other messages in the system are also delayed Architecture of Parallel Computers 1 Shuffle Shuffle Shuffle However deadlock does not occur in a reliable network Destination can NACK transactions if buffer is full Then the sender has to try again Either the destination informs the sender over a special acknowledgment path or the source times out as in Ethernet Fetch Deadlock For a network to remain deadlock free nodes must continue accepting messages even when they cannot initiate their own messages The incoming transaction may generate a request which cannot be initiated when the network is full What happens ifthe node s internal buffer fills up Three approaches Provide two logicaly independent requestreply networks These may be physical networks or virtual channels with separate inputoutput queues In either case responses can be initiated 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 Limit of outstanding transactions and reserve input buffer space Assume there is a limit of k outstanding requests per processor Resehe space for kP 1 requests k responses per node NA CK when input buffer is full We can assume space is free at the destination of the NACK because it uses the space reserved for the response Eventually some request will succeed and the system will make progress Summary Challenges in realizing program models in the large Oneway transfer of information No global knowledge nor global control Very large number of concurrent transactions Management of input buffer resources Many sources can issue a request and overcommit destination before any see the effect Latency is large enough that you are tempted to take risks optimistic protocols large transfers dynamic allocation Many many more degrees of freedom in design and engineering of these systems Design Space for Communication Architectures 726 A network transaction is a oneway transfer of information that causes an action to occur at the destination The source s communication assist CA formats the transaction and causes it to be routed through the network Lecture 18 Architecture of Parallel Computers 3 The CA at the destination must interpret the transactions and cause the appropriate actions to take place Scalable Output Input processmg processing checks mmunication checks translatlon translation formattl g Node buffering scheduling action Key design issues How much interpretation of the message How much dedicated processing in the CA If processing is not done in the CA it must be done in the node In order of increasing hardware support and specialization the options for processing in the CA are these None Physical bit stream blind physical DMA nCUBE iPSC Usersystem Userlevel port CM5 T Userlevel handler JMachine Monsoon Remote virtual address Processing translation Paragon Meiko CS2 Global physical address Processor memory controller RP3 BBN T3D Cachetocache Cache controller Dash KSR Flash 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 Physical DMA Most early messagepassing machines used designs where no interpretation was placed on the data within a network transaction This allows for very simple hardware and very general communication abstraction What is the downside Data V 4 I It DMA channels A I I A t 39gt Cmd Status interrupt o r 39 I The communication assist merely deposits the transaction into storage whereupon it will be interpreted by the processor Memory DMA is controlled by registers generates interrupts Physical addresses are used The sender constructs a system envelope Sending normally requires a trap to the OS 9 autquotquot around user data in kernel area dest addr Receive must receive into system buffer since no interpretation in CA Message arrival normally generates an interrupt so privileged software can inspect the messages and either process it or deliver it to the appropriate user process Lecture 18 Architecture of Parallel Computers One way to reduce overhead is to allow userlevel access to the DMA device This can be done by setting up the user s virtual address space to include the region of HO space containing the devicecontrol registers Why does this reduce overhead What is one problem with it Machines that use this kind of CA tend to support the message passing abstraction directly in the kernel The arrival of a network transaction causes an interrupt The process ID and tag are parsed and action is taken as described in the previous lecture The nCUBE2 used this kind of network interface Input ports Output ports H ooo A Switch Addr Ehl nels IEiE il mm IEiE il A ll 7 Memory bus 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 6 Data is forwarded from its source to its destination through intermediate nodes The switch automatically takes a transaction destined for another node and copies it from the input port to the correct output port lfthe transaction is destined for the local node the input DMA is activated and the message is placed in memory The vendor s messagepassing library had a 150us startup cost Using an Active Message implementation von Eicken was able to reduce this to 13 us 16 instrs 18 mem refs 260 cycles to insert a message into the network 15 us 18 instrs 26 mem refs 300 cycles to take a message out of the network Userlevel messages What do you think is the biggest overhead in processing messages in the network we have just described How can we avoid this overhead With userlevel access a system message causes an interrupt so the system can extract it from the network a userlevel message can sit in the input queue until the userlevel process reads it from the network How can this be implemented Remember the userlevel messages need to be written without OS intenention What kind of instructions should be used What does this tell us about where the input and output queues should be Lecture 18 Architecture of Parallel Computers 7 Network transactions are initiated by reading and Virtual address space 1 writing the ports plus sawmill 39 checking the status register p Netinput N t39 th t h M quot P r rocessor 0 ice a eac message Status I now contains a usersystem flag as shown below i Usersystem39 Data H Best i I i Status interrupt In this design communication primitives allow a portion of the process state to be in the network Ifthe program is swapped out the messages destined for its processes need to be swapped out too When it resumes they can be reinserted into the network orthe destination input queues An example of this architecture is the Thinking Machines CM5 In the CM5 it is possible to insert a fiveword message into the network in 15 us 50 cycles and read one out in 16 us 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 Some experimental machines have made the network input and output ports into machine registers These include the Manchester Dataflow Machine iWARP Monsoon and the JMachine What is the advantage of this This makes the CA into a functional unit in the processor In iWARP two registers are bound to the head ofthe network input and output ports The processor can access the message word by word as it streams in from the network Or a message can be placed into memory by the DMA controller The processor specifies the tag of a message it wants to access via the port registers Other messages go into memory Dedicated message processing In this organization a dedicated communication or message processor operates directly on the network interface It can do the protocol processing associated with messages Lecture 18 Architecture of Parallel Computers 9 It can also support a global address space where the CP performs remote reads on behalf of the requesting node In this design the interpretation of messages is not performed in hardware The generalpurpose processor performs arbitrary output processing at system level The generalpurpose processor interprets incoming network transactions at system level User processor and message processor share memory Message processor lt gt message processorvia system network transaction Both compute processor and message processor reside on the memory bus Alternatively the message processor can be embedded into the network interface see Fig 720 p 498 The MP provides the compute processor with a very clean abstraction of the network interface All the details of the physical network are hidden inputoutput buffers status registers routing A message can be sent by simply writing it or a pointerto it into shared memory When received data can be deposited directly into the user address space User processor stores cmd msg data into the shared output queue It must still check for output queue full or make it elastic Communication assists make transaction happen They do checking translation scheduling transport interpretation This protocol is divided between two layers as shown below 2003 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2003 10 Mem Eml p User System Each network transactions flows through memory or at least across the memory bus in a cachetocache transfer between the compute processor and the message processor lt crosses the memory bus again between the CP and the network interface An example of this is the Intel Paragon architecture 1992 Each node is a sharedmemory multiprocessor with two or more 50MHz i860XP processors a NI chip 16 32 MB of memory The processors are connected by a 400MBs cache coherent memory bus Two DMA engines can transfer a contiguous block of data between main memory and the NI chip at 400 MBs Small messages of seven words can be transferred between two compute processors in about 10 us in three relatively equal steps compute proc gt MP gt MP gt compute proc Shared physical address space In this style multiple processors share physical memory However access latencies may differ to different parts ofthe memory Lecture 18 Architecture of Parallel Computers 11