Architecture Of Parallel Computers
Architecture Of Parallel Computers ECE 506
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 15 page Class Notes was uploaded by Miracle Jaskolski on Thursday October 15, 2015. The Class Notes belongs to ECE 506 at North Carolina State University taught by Staff in Fall. Since its upload, it has received 12 views. For similar materials see /class/223899/ece-506-north-carolina-state-university in ELECTRICAL AND COMPUTER ENGINEERING at North Carolina State University.
Reviews for Architecture Of Parallel Computers
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/15/15
Scalable Cache Coherence 81 All ofthe cachecoherent systems we have talked about until now have had a bus Not only does the bus guarantee serialization of transactions it also serves as convenient broadcast mechanism to assure that each transaction is propagated to all other processors caches Now we want to consider how cache coherence can be provided on a machine with physically distributed memory and no globally snoopable interconnect These machines provide a shared address space They must be able to satisfy a cache miss transparently from local or remote memory This replicates data How can it be kept coherent Scalable network SEEM i Scalable distributed memory machines consist of PCM nodes connected by a network The communication assist interprets network transactions and forms the interface between the processor and the network Lecture 20 Architecture of Parallel Computers 1 A coherent system must do these things Provide set of states statetransition diagram and actions Manage coherence protocol 0 Determine when to invoke the coherence protocol a Find source of info about state of this block in other caches This may or may not require communication with other cached copies b Find out where the other copies are c Communicate with those copies invalidateupdate 0 is done the same way on all systems The state of the line is maintained in the cache The protocol is invoked if an access fault occurs on the line The different approaches to scalable cache coherence are distinguished by their approach to a b and c Busbased coherence In a busbased coherence scheme all of a b and c are done through broadcast on bus The faulting processor sends out a search Other processors respond to the search probe and take necessary action We couddo this in a scalable network too broadcast to all processors and let them respond Why don t we Broadcast doesn t scale well for each transaction 20 messages are needed of messages in the system goes up as Why not Bus bandwidth doesn t scale On a scalable network every fault leads to at least 0 network transactions 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 2 Approach 1 Hierarchical snooping Extend the snooping approach to a hierarchy of broadcast media The interconnection network is a not a bus but a tree of buses or rings eg KSR1 The processors are the bus or ringbased multiprocessors at the leaves ofthe network Parents and children are connected by twoway snoopy interfaces Functions a through c are performed by a hierarchical extension of the broadcast and snooping mechanism A processor puts a search request on the bus It is propagated up and down the hierarchy as needed based on snoop results Problems There will be a bottleneck at the root The latency goes up as the of levels goes up we need a snoop and a lookup at each level Hierarchical snooping schemes are much less common than directory based schemes We won t consider them further for details see 8102 Approach 2 Directories The basic idea of a directorybased approach is this Every memory block has associated directory information it keeps track of copies of cached blocks and their states On a miss it finds the directory entry looks it up and communicates only with the nodes that have copies if necessary In scalable networks communication with directory and copies is th roug h network transactions Lecture 20 Architecture of Parallel Computers 3 Let us start off by considering a simple directorybased approach due to Tang 1976 It uses A central directory called a present table of which caches contain which mainmemory blocks Entry P c tells whether the th block is cached in the cth cache A central mod ed tabethat tells for each block of main memory whether the block is uptodate orwhether some cache holds a more recent copy A local table for each cache that keeps track of whether each line in the cache is also cached somewhere else Until a block is modified it can be cached in any number of places Once it is modified only one cached copy is allowed to exist until it is written back to main memory Here is a diagram of the three tables blockk g m E m E m u r u u T E E E E E E 3 g 3 g 3 g Cache I I containinL k L k 1 I I 0 1 a L Cache number Block 0 Block P390 P391 MU Blook N l Present table Modi ed table 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 A cached block can be in one of three states RO readonly Several cached copies of the block exist but none has been modified since it was loaded from main memory EX exclusive readonly This is the only cached copy of the block but it hasn t been modified since it was loaded RW readwrite The block has been written since it was loaded from main memory This is the only cached copy of the block When a processor writes a block it changes its status to RW and invalidates the other cached copies if its previous status was RO Here is a flowchart of the coherence checks on a read reference Lecture 20 Architecture of Parallel Computers 5 Fetch block in local cache c Find blockto rep ace If blockis RW 393 copy 39of blockkgin a remote No copy exists cache 7 Data Yes EX Yes RW fem copy exists CODV BXiStS Instruction fetch C lt EX MM e block L x E R0 L c e RO Cache block in cache c M e 0 Send word to processor GT possible access to global table WB possible MM update MM main memory MM modified bit of block in GT MM e block I update MM with Modified copy of in cache r L I state of block in cache Here is a flowchart of the coherence checks on a write reference 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 Store into block 1 in local cache 0 Find block j to replace GT WB If blockj is RW MIVI lt block39 0 Block4139statusy RW in cache Lli cl 739 y Yes one RW copy in Other C9P39 cac e r ies of I v block 1 Write block 1 back to Invalidate all copies of In de all comes Of PI MIVI and invalidate it block 1 in other caches block 1 in other caches in cache r Record block1 as RW Fetch block 1 With RW access Store bIOCk m caChe 639 GT Possible access to global table WB Possible writeback to memory Pl Pure invalidation MIVI Main memory Operation of a simple directory scheme The scheme we have just discussed is called a centralzed directory scheme because the directory is kept together in one place Let us now assume that the directory is distributed with each node holding directory information for the blocks it contains This node is called the home node for these blocks Lecture 20 Architecture of Parallel Computers 7 What happens on a readmiss Requestor The requesting node 1 sends a request transaction over the O ll39CC 0139 Dimtory node network to the home for block Ode 2 The home node Owne mm c responds with the identity of the owner the node that currently holds a valid copy ofthe block 4b Revision message to d1rector The requesting node cg then gets the data from the owner and revises the directory entr accordin Node with y g y lilty copy On a write Requestor 1 Same direCtO ry C to dlrector identifies 2 f C the bIOCk sharersidentit and invali dation or Dilectorynode update messages 4b may be 1 1 i sent to the copies e Shaler 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 8 One major difference from busbased schemes is that we can t assume that a write has completed when it leaves the processor the processor must wait until it has received all the acks What information will be held in the directory There will be a dirty bit telling ifthe block is dirty in some cache Not all state information MESI etc needs to be kept in the cache only enough to determine what actions to take Sometimes the state information in the directory will be out of date Why Because it has been written in some cache but the change has not yet made it to the directory So sometimes a directory will send a message to the cache that is no longer correct when it is received Let us consider an example system Three stable cache states MSI Singlelevel caches One processor per node In the directory entry for each block a d y bit and presence bits 0 one for each processor b On a readmiss from main memory by processorI lf d39Ty then read from main memory turn p on lf dn y then recall line from dirty processor setting cache state to shared update memory turn dirty off turn p on supply recalled data to processor b On a writemiss to main memory by processor II lf d39Ty then supply line to I along with pvector send invalidations to all caches that have the block turn dirty on turn p on other 0 bits off in directory entry Lecture 20 Architecture of Parallel Computers 9 lf dn y then recall line from dirty processor 0 change 0 s cache state to invalid supply line to I turn p on other 0 bits off in directory entry On the replacement of a dirty block by node I the data is written back to memory and the directory is updated to turn off dirty bit and p On the replacement of a shared block the directory may or may not be updated If we choose not to update it on the replacement of a shared block the next time that the block is written an unnecessary invalidation may be sent to the node that contains it How does a directory help It keeps track of which nodes have copies of a block eliminating the need for broadcast Would directories be valuable if most data were shared by most of the nodes in the system No then braodcast would probably be more efficient Fortunately the number of valid copies of data on most writes is small Scaling with number of processors 822 Scaling of memory and directory bandwidth provided 0 Centralized directory is bandwidth bottleneck just like centralized memory 0 How to maintain directory information in distributed way Scaling of performance characteristics 0 traffic of network transactions quot each time protocol is invoked o latency of network transactions in critical path increases each time Scaling of directory storage requirements 0 Number of presence bits needed grows as the number of processors 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 10 Eg 32byte block size and 1024 processors How many bits in line vs of bits in directory 256 data bits vs 1024 bits directory information How a directory is organized affects all these performance at a target scale as well as coherencemanagement issues Alternatives for organizing directories 823 When a miss occurs how do we find the directory information There are two main alternatives A fatdirectory scheme Directory information is in a fixed place usually at the home where the memory is located On a miss a transaction is sent to the home node A hierarchicadirectory scheme Directory information is organized as a tree with the processing nodes at the leaves Each node keeps track of which if any of its immediate children have a copy of the block When a miss occurs the directory information is found by traversing up the hierarchy level until the block is found in the appropriate state The state indicates eg whether copies of the block exist outside the subtree of this directory Directory Schemes Centralized Distributed How to nd source of directory information Flat Hierarchical How to locate copies Memorybased Cachebased How do flat schemes store information about copies Lecture 20 Architecture of Parallel Computers 11 Memorybased schemes store the information about all cached copies at the home node ofthe block Eg Dash Stanford and Alewife MIT SGI Origin The directory schemes we considered earlier are memory ased Cachebased schemes distribute information about copies among the copies themselves IEEE SCI Scalable Coherent Interface Sequent s NUMA Q O The home contains a pointer to one cached copy of the block Each copy contains the identity ofthe next node that has a copy of the block The location ofthe copies is therefore determined through network transactions Main Memory Horn 6 Node 1 Node 2 6 When do hierarchical schemes outperform flat schemes When the network is large and copies tend to be fairly localized in this case they limit the distance that transactions must traverse Why might hierarchical schemes be slower than flat schemes lf copies don t tend to be localized Also there are more transactions in a hierarchical scheme because you need to propagate them up the hierarchy 2002 Edward F Gehringer CSCECE 506 Lecture Notes Spring 2002 12 OrganZing a memorybased directory scheme All info about copies is collocated with the block itself at the home They workjust like centralized scheme except for being distributed Scaling of performance characteristics M Traffic on a write is proportional to number of sharers Latency Can issue invalidations in parallel Scaling of storage overhead Assume representation is a full bit vector With m blocks and pprocessors the amount of information is proportional to mp Optimizations for full bitvector schemes Increase cache line size reduces storage overhead proportionally Use multiprocessor nodes one bit per multiprocessor node not per processor still scales as pm but only a problem for very large machines 256 procs 4 per cluster 1288 line 625 ovhd b Reducing width addressing the pterm observation most blocks cached by only few nodes don t have a bit per node but make entry contain a few pointers 0 1024 10bit pointers 2 can use 100 pointers and still save space sharing patterns indicate a few pointers should suffice five or so need an overflow strategy when there are more sharers later Lecture 20 Architecture of Parallel Computers 13 b Reducing height addressing the mterm obsenation number of memory blocks gtgt number of cache lines most blocks will not be cached at any particular time therefore most directory entries are useless at any given time organize directory as a cache rather than having one entry per memory block OrganZing a cachebased directory scheme In a cachebased scheme home only holds a pointer to the rest of the directory information The copies are lined together via a distributed list that weaves through caches Each cache tag has a pointer that points to the next cache with a copy On a read a processor adds itself to the head of the list communication needed On a write it makes itself the head node on the list then propagates a chain of invalidations down the list Each invalidation must be acknowledged On a writeback the node must delete itself from the list and therefore communicate with the nodes before and after it Disadvantages All operations require communicating with at least three nodes node that is being operated on previous node and next node Write latency is proportional to number of sharers Synchronization is needed to avoid simultaneous replacement of adjacent nodes on the list Advantages Directory overhead is small Linked list records order that accesses were made making it easier to avoid stanation 2002 Edward F Gehringer CSOEOE 506 Lecture Notes Spring 2002 14 Work of performing invalidations can be distributed among sharers IEEE Scalable Coherent Interface has formalized protocols for handling cachebased directory schemes Summary Flat Schemes Issue a finding source of directory data go to home based on address lssue b finding out where the copies are memorybased all info is in directory at home cachebased home has pointerto first element ofdistributed linked list lssue c communicating with those copies memorybased pointtopoint messages perhaps coarser on overflow can be multicast or overlapped cachebased part of pointtopoint linked list traversal to find them serialized Hierarchical Schemes all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children Lecture 20 Architecture of Parallel Computers 15
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'