High Perform Comput Arch
High Perform Comput Arch CS 6290
Popular in Course
Popular in ComputerScienence
This 0 page Class Notes was uploaded by Alayna Veum on Monday November 2, 2015. The Class Notes belongs to CS 6290 at Georgia Institute of Technology - Main Campus taught by Milos Prvulovic in Fall. Since its upload, it has received 7 views. For similar materials see /class/234169/cs-6290-georgia-institute-of-technology-main-campus in ComputerScienence at Georgia Institute of Technology - Main Campus.
Reviews for High Perform Comput Arch
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/02/15
CS 6290 Evaluation amp Metrics 1 Georgia DU Tech m gt1 tm I f EEE Per ormance IE I Two common measures Latency how long to do X 0 Also called response time and execution time Throughput how often can it do X Example of car assembly line Takes 6 hours to make a car latency is 6 hours A car leaves every 5 minutes throughput is 12 cars per hour Overlap results in Throughput gt lLatency I I Measuring Performance Illl la 391 PeakMPS MFLOPS Often not useful unachievable in practice or unsustainable 70 r I Pawem 60 Iranium 2 3 I NEC Eath SImuIalor c as 50 CrayXl E 2 CD Q 40 X N 6 Q 3 soquot a ca 9 E 3 20 a 5 a 0 o 0 Cactus GTC plasma phyStcs materials smence aslrophysrcs magnetic fusvon amawummmw I III Measuring Performance lag 5i Benchmarks Real applications and application suites Eg SPEC CPU2000 SPEC2006 TPCC TPCH Kernels 0 Representative parts of real applications 0 Easier and quicker to set up and run 0 Often not really representative of the entire app Toy programs synthetic benchmarks etc 0 Not very useful for reporting 0 Sometimes used to teststress specific fu nctionsfeatu res Georgia quot Tech39 SPEC CPU integer Benchmark name by SPEC generation SPEC2006 benchmark description SPECZOOS SPECZOOO SPECSS SPEC92 SPECBS GNU C compiler gcc Interpreted string processing lt perl I lt espresso Combinatorial optimization mcf lt Ii Blocksorting compression bzip2 compress eqntott Go game AI go vortex go sc Video compression h264avc gzip ijpeg Gamesfpath finding aster eon n188ksirn Search gene sequence hmmer twoif Quantum computer simulation Iibquantum vortex Discrete event simulation library omnetpp vpr Chess game Al sjeng crafty XML parsing xalancbmk parser Representative applications keeps growing with time Georgia Tech 39 SPEC CPU floating point CFOblast waves Numerical relatiwty Finite element code Differential equation solver lramework Quantum chemistry EM solver freqtime domain Scalable molecular dynamics NAMD Lattice Boltzman method uidair flow Large eddie Simulationturbulent CFD Lattice quantum chromodynamics Molecular dynamics Image ray tracmg Spare linear algebra Speech recognition Quantum chemistryobject oriented Weather research and forecasting Magneto hydrodynamics astrophysics bwaves cactusADM calculix deal II gamess GemsFDTD gromacs lbm LESIieSd milc namd povray soplex sphinx3 tonto wrf zeusmp L 4 lt swim 4 wupwise apply apsi mgrid applu turb3d hydrozd su2cor waves galgel mesa art equake lacerec ammp Iucas lma3d suxtrack OmManIleMr V Tee fp tomcatv doduc nasa spice rnatrix300 SPECZOOO PricePerformance 3000 2500 2000 1500 1000 500 I SPECIntZOOObase SPECfp2000base int31k fpl1k HP Integrity Sun Java rx28202 HP ProLiant ML350 G4 em Eiasuiar Inc All ghraams Dell Precision Workstation 380 HP ProLianl BL25p Workstation W11002 700 39 600 39 500 400 300 39 200 100 SPEC2000l1000 I IEEE TPC Benchmarks 0 Measure transaction processing throughput Benchmarks for different scenarios TPC C warehouses and sales transactions TPC H ad hoc decision support TPC W web based business transactions Difficult to set up and run on a simulator Requires full 08 support a working DBMS Long simulations to get stable results Th rough putServer Perf Cost 10000000 1000000 E TPM CL High performance 39 I TPM 1000 Very expenSIve price 100000 10000 I A x u L o 4 1 1 1 x Q Q 9935 99 b 9 kg 96 0Q 0Q ltb b ofquot 1 400 fquot 09 b b39 Q b39 Qe cw Q Q 338 QefQ 6 g 69 93 9 n33 395 395quot x Q 40 Q L V I 0 0 NA 9 x a 0 e2 9 0 02 cc 69 0 0 0 0 e9 3 9 lt0 lt0 a 2 o 9 6 2 V 0quot oquot 28 Q Q0 Q0 2 2 2K 02007 arm m MI rig m 800 700 600 500 400 300 200 100 4 I33 I TPM1000 l CPU Performance Equation 1 lili 5133 I CPU time 2 CPU Clock Cyclesgtlt Clock cycle time A CPU t1me Instructlon Count X Cycles Per InstructionX Clock cycle t1me CPU time Seconds Instruct10ns Clock Cycles Seconds Program Program Instruction Clock Cycle fll ISA Organization Hardware Compller ISA Technology Technology Organization AKA The iron law of performance Georgia quot Tech 39 Il Car Analogy l39 I Need to drive from Klaus to CRC Clock Speed 3500 RPM CPI 5250 rotationskm or 019 mrot Insts 800m CPU time 2 Seconds Instructlons X Clock Cycleslt Seconds Program Program Instructlon Clock Cycle 800 m X 1 rotation 1 minute 019 m 3500 rotations 12 minutes 2quot 1 Gm Tech I Ill CPU Versmn 393quot lg Program takes 33 billion instructions to run CPU processes insts at 2 cycles per inst Clock speed of 3GHz CPU time 2 Seconds Instructlons X Clock Cycles Seconds Program Program Instruction Clock Cycle Sometimes clock cycle time given instead ex cycle 333 ps IPC sometimes used instead of CPI 22 seconds Georgia quot Tech39 I III CPU Performance Equation 2 lagg CPU time 2 CPU Clock CyclesX Clock cycle time f CPU time ZlICi X CPIi X Clock cycle time i1 How many cycles it For each kind takes to execute an of Instr uctlon instruction of this kind How many instructions of this kind are there in the program Georgia quot Tech39 CPU performance w different instructions Instruction Frequency CPI Type Integer 40 10 Branch 20 40 Load 20 20 Store 10 30 Total Insts SOB Clock speed 2 GHz IE 5 63 i1 CPU time ICi X CPIi jX Clock cycle time Georgia quot Tech39 I III Comparing Performance ligal I X is n times faster than Y Execution timeY Execution timeX Throughput of X is n times that of Y Tasks per unit timeX Tasks per unit timeY Georgia quot Tech 39 I II If Only it Were That Simple liga X is n times faster than Y on A Execution time of app A on machine Y Execution time of app A on machine X 0 But what about different applications or even parts of the same application X is 10 times faster than Y on A and 15 times on B but Y is 2 times faster than X on C and 3 times on D and r So does X have better performance than Y I III Summarizing Performance liga Ii Arithmetic mean Average execution time Gives more weight to longer running programs Weighted arithmetic mean More important programs can be emphasized But what do we use as weights Different weight will make different machines look better Speedup Progra m L Progra m 2 Wh at Wh at Wh at Wh at Machine A Machine B 5 sec 4 sec 3 sec 6 sec is the speedup ofA compared to B on Program 1 is the speedup ofA compared to B on Program 2 is the average speedup is the speedup ofA compared to B on SumFgtrogram1 Prog ram2 G ia 7 i I III Normalizing amp the Geometric Mean lagg Vi Speedup of arithmeitc means arithmetic mean of speedup 0 Use ge O m etri C m ea n Normalized execution time on i Neat property of the geometric mean Consistent whatever the reference machine Do not use the arithmetic mean for normalized execution times Ea CPIIPC Often when making comparisons in comp arch studies Program or set of is the same for two CPUs The clock speed is the same for two CPUs So we can just directly compare CPI s and often we use IPC s Average CPI vs Average IPC Average CPI CPI1 CPI2 CPInn AMofPCPC1P PCnn Not Equal to AM of CPI Must use Harmonic Mean to remain oc to runtime Georgia quot Tech 39 I551 Val I EEE Harmonic Mean HMx1x2X3Xn What in the world is this Average of inverse relationships AMCPI vs HMIPC liga quoti 0 Average PC 1 AMCPI 1 CPIl CPI2 CPI3 CPIn T n n n n CPIl CPI2 CPI3 CPIn n i i i i HMIPC IPCl IPC2 IPC3 IPCn I Amdahl s Law 1 39ii39 Vi Execution Time Without Enhancement Execution Time01 d S eedu p p Execution Time with Enhancement Execution Timenew What if enhancement does not enhance everything Execution Time without using Enhancement at all Speedup Execution Time using Enhancement when Possible Fraction Execution Tnnenew Execution Tnneold X 1 FractionEnhanced w SpeedupEnhanced Overall Speedup 1 Fraction 1 FractionEnhanced A S d Caution fraction pee upEnhanced of What Georgia quot Tech39 lil l Amdahl 5 Law 2 535 I Make the Common Case Fast Overall Speedup 1 l FractionEnhanced FracnonEquotha d J SpeedupEnhmd SpeedupEnhanced 20 FraCtionEnhmed 01 SpeedupEnhanced 12 FractionEnhanced 09 l Speedup1105 Speedup ll76 1 01j 1 09 Important Principle of locality Approx 90 of the time spent in 10 of the code Georgia quot Tech 39 I EEE Amdahl 5 Law 3 Diminishing Returns Generation 1 IL Total Execution Time Speedqumen 2 l FractionGreen 5 Generation 2 Speedupmn133 over Generation 1 Total Execution Time Speedqumn 2 FractionGreen l 3 Generation 3 Speedupmmn 12 over Generation 2 Total Execution Time Georgia quot Tech 39 ll Yet Another Car Analogy lag I 39 From GT to Mall of Georgia 35mi you ve got a Turbo for your car but can only use on highway A Spaghetti Junction to Mall of GA 23ml avg speed of 60mph avg speed of 120mph with Turbo 9 GT to Spaghettijunction12 mi V stuck in bad rush hour traffic r avg speed of 5 mph A a u 1 Turbo gives 100 speedup across 66 of the distance but only results in lt10 reduction on total trip time which is a lt11 speedup Georg 3quot Er 7 f39 Teclha 57quot 61quot g Now Consider PricePerformance lag I Without Turbo Car costs 8000 to manufacture Selling price is 12000 9 4K profit per car If we sell 10000 cars that s 40M in profit With Turbo Car costs extra 3000 Selling price is 16000 9 5K profit per car But only a few gear heads buy the car 0 We only sell 400 cars and make 2M in profit Georgia quot Tech39 CS 6290 VG and Storage Milos Prvulovic Georgia HH F Tech mggtm1 mg I Storage Systems liga ii lO performance bandwidth latency Bandwidth improving but not as fast as CPU Latency improving very slowly Consequently by Amdahl s Law fraction of time spent on IO increasing Other factorsjust as important Reliability Availability Dependability Storage devices very diverse Magnetic disks tapes CDs DVDs flash Different advantagesdisadvantages and uses Georgia quot Tech39 Magnetic Disks I III II l l 2003 Elsevier Science USA All rights reserved Magnetic Disks Good cheap MB fairly reliable Primary storage memory swapping Bad Can only readwrite an entire sector Can not be directly addressed as main memory Disk access time Queuing delay Wait until disk gets to do this operation Seek time Head moves to correct track Rotational latency Correct sector must get under the head Data transfer time and controller time Georgia quot Tech39 55 I23 I Trends for Magnetic Disks I 0 Capacity doubles in approx one year 0 Average seek time 5 12ms very slow improvement 0 Average rotational latency 12 full rotation 5000 RPM to 10000 RPM to 15000 RPM Improves slowly not easy reliability noise Data transfer rate Improves at an OK rate 0 New interfaces more data per track Ea Optical Disks Improvement limited by standards CD and DVD capacity fixed over years Technology actually improves but it takes time for it to make it into new standards Physically small Replaceable Good for backups and carrying around l Magnetic Tapes I Very long access latency Must rewind tape to correct place for readwrite Used to be very cheap M B lt sjust miles of tape But disks have caught up anyway 0 Used for backup secondary storage Large capacity amp Replaceable l Using RAM for Storage I Disks are about 100 times cheaper MB DRAM is about 100000 faster latency Solid State Disks Actually a DRAM and a battery Much fasterthan disk more reliable Expensive not very good for archives and such Flash memory Much faster than disks but slower than DRAM Very low power consumption Can be sold in small sizes few GB but tiny Georgia quot Tech39 I III Busses for lO I Traditionally two kinds of busses CPU Memory bus fast short lO bus can be slower and longer Now mezanine busses PCI Pretty fast and relatively short Can connect fast devices directly Can connect to longer slower lO busses Data transfers over a bus transactions Buses in a System l CPUmemory bus 39 Nelwovk 2003 Elsevier Science USA All riahts reserved I l I I Bus DeSIgn DeCISIOI IS i Split transactions Traditionally bus stays occupied between request and response on a read Now get bus send request free bus when response ready get bus send response free us Bus mastering Which devices can initiate transfers on the bus CPU can always be the master But we can also allow other devices to be masters With multiple masters need arbitration Georgia quot Tech39 l CPUDevice Interface I Devices typically accessible to CPU through control and data registers These registers can be either Memory mapped Some physical memory addresses actually map to lO device registers Readwrite through LSST Most RISC processors support only this kind of lO mapping Be in a separate IO address space Readwrite through special lNOUT instrs Used in x86 but even in x86 PCs some lO is memory mapped Georgia w Tech39 l CPUDevice Interface I Devices can be very slow When given some data a device may take a long time to become ready to receive more Usually we have a Done bit in status register Checking the Done bit Polling test the Done bit in a loop Interrupt interrupt CPU when Done bit becomes 1 Interrupts if lO events infrequent or if device is slow Each interrupt has some 08 and HW overhead Polling better for devices that are done quickly Even then buffering data in the device lets us use interrupts Interruptdriven lO used today in most systems Georgia w Tech39 l Dependability I Quality of delivered service thatjustifies us relying on the system to provide that service Delivered service is the actual behavior Each module has an ideal specified behavior Faults Errors Failures Failure actual deviates from specified behavior Error defect that results in failure Fault cause of error I Ill Failure Example i A programming mistake is a fault An add function that works fine except when we try 53 in which case it returns 7 instead of 8 It is a latent error until activated An activated fault becomes effective error We call our add and it returns 7 for 53 Failure when error results in deviation in behavior Eg we schedule a meeting for the 7th instead of 8th An effective error need not result in a failure if we never use the result of this add no failure Georgia w Tech39 I Reliability and Availability i System can be in one of two states Service Accomplishment Service Interruption Reliability Measure of continuous service accomplishment Typically Mean Time To Failure M39I39I39F Availability Service accomplishment as a fraction of overall time Also looks at Mean Time To Repair MTTR MTTR is the average duration of service interruption AvailabilityMTTFMTTFMTTR Georgia quot Tech39 l Faults Classified by Cause I Hardware Faults Hardware devices fail to perform as designed Design Faults Faults in software and some faults in HW Eg the Pentium FDIV bug was a design fault Operation Faults Operator and user mistakes 0 Environmental Faults Fire power failure sabotage etc I III Faults Classified by Duration I Transient Faults Last for a limited time and are not recurring An alpha particle can flip a bit in memory but usually does not damage the memory HW Intermittent Faults Last for a limited time but are recurring Eg overclocked system works fine for a while but then crashes then we reboot it and it does it again Permanent Faults Do not get corrected when time passes Eg the processor has a large round hole in it because we wanted to see what s inside Georgia quot Tech39 I lmproving Reliability I Fault Avoidance Prevent occurrence of faults by construction Fault Tolerance Prevent faults from becoming failures Typically done through redundancy Error Removal Removing latent errors by verification Error Forecasting Estimate presence creation and consequences of errors Georgia w Tech39 Disk Fault Tolerance with RAID lag I Redundant Array of Inexpensive Disks Several smaller disks play a role of one big disk Can improve performance Data spread among multiple disks Accesses to different disks go in parallel Can improve reliability Data can be kept with some redundancy Ea RAID 0 Striping used to improve performance Data stored on disks in array so that consecutive stripes of data are stored on different disks Makes disks share the load improving Throughput all disks can work in parallel Latency less queuing delay a queue for each disk No Redundancy Reliability actually lower than with single disk if any disk in array fails we have a problem RAID 1 Disk mirroring Disks paired up keep identical data A write must update copies on both disks A read can read any of the two copies Improved performance and reliability Can do more reads per unit time If one disk fails its mirror still has the data If we have more than 2 disks eg 8 disks Striped mirrors RAID 10 Pair disks for mirroring striping across the 4 pairs Mirrored stripes RAID 01 Do striping using 4 disks then mirror that using the other 4 Georgia w Tech39 IE 5 63 Ea RAID 4 Block interleaved parity One disk is a parity disk keeps parity blocks Parity block at position X is the parity for all blocks whose position is X on any of the data disks A read accesses only the data disk where the data is A write must update the data block and its parity block Can recover from an error on any one disk Use parity and other data disks to restore lost data Note that with N disks we have Nl data disks and only one parity disk but can still recover when one disk fails But write performance worse than with one disk all writes must read and then write the parity disk Georgia quot Tech39 E RAID 4 Parity Update New data 1 Read 2 Read 3 Write 4 Write 2003 Elsevier Science USA All rights reserved Georgia Tech E RAID 5 Distributed block interleaved parity Like RAID 4 but parity blocks distributed to all disks Read accesses only the data disk where the data is A write must update the data block and its parity block But how all disks share the parity update load l I IZI El N44 4 lulEll EEEHIIO E El 39 N IMEEII RAID 4 RAID 5 5 2003 Elsevier Science USA All rights reserved CS6290 Memory 1 Georgia UU Tech mm gtu m I I III Views of Memory I Real machines have limited amounts of memow 640KB A few GB This laptop 2GB Programmer doesn t want to be bothered Do you think oh this computer only has 128MB so I ll write my code this way What happens if you run on a different machine Programmer s View Example 32 bit memory When programming you don t care about how much real memory there is Even if you use a lot memory can always be paged to disk Kernel Text Data Heap SSZ39O 4GB ll Il l Programmer s VIew lag 39 0 Really Program s View 0 Each program process gets its own 4GB space Kernel Kernel Text Data Text Data Heap a E CPU 5 View Ig 0 At some point the CPU is going to have to load fromstoreto memory all it knows is the real AKA physical memory 0 which unfortunately is often lt 4GB 0 and is never 4GB per process I EEE Pages Memory is divided into pages which are nothing more than fixed sized and aligned regions of memory Typical size 4KBpage but not always 0 4095 Page 0 40968191 Page 1 819212287 Page 2 1228816383 Page 3 Page Table lag I Map from virtual addresses to physical Physical locatlons OK Addresses I 4K OK 8K 4K Page Table implements 8K i this V9P mapping I 12K 12K 39 Virtual Addresses Page Tables 393 39 OK 4K 8K Need for Translation I OXFC519088 Virtual Address Virtual Page Number Page Offset Physical Address Main Memory OXFC519 OXOO152 OX00152OSB Georgia quot Tech 39 l Simple Page Table 39igi39 I 0 Flat organization One entry per page Entry contains physical page number PPN or indicates page is on disk or invalid Also meta data eg permissions dirtiness etc One entry per page I MultiLevel Page Tables 3955 I Virtual Page Number Level 1 Level 2 Offset Physical Page Number Georgia quot Tech 39 l Choosing a Page Size I o Page size inversely proportional to page table overhead 0 Large page size permits more efficient transfer tofrom disk vs many small transfers Like downloading from Internet 0 Small page leads to less fragmentation Big page likely to have more bytes unused CPU Memory Access 39 i 5 Program deals with virtual addresses Load R1 OR2 On memory instruction 1 Compute virtual address OR2 2 3 0quot Compute virtual page number Compute physical address of VPN s page table entry Load mapping Compute physical address Could be more depending On page table organization Do the actual Load from memory Georgia 39 39 Tech 39 73 I III Impact on Performance Ii Every time you loadstore the CPU must perform two or more accesses Even worse every fetch requires translation of the PC 0 Observation Once a virtual page is mapped into a physical page it ll liker stay put for quite some time I IEEE Idea Caching 0 Not caching of data but caching of translations Physical 0K ysses 4K 0K 8K 4K 12K 8K 16K 1 K 20K 24K 28K Virtual Addresses VPN 8 PPN 16 Georgia quot Tech 39 Translation Cache TLB liga Va I TLB Translation Lookaside Buffer Physical Virtual Address Cache Address TLB Data Cache Hit Tags 39 If TLB hit no need to do page table lookup Note data cache from memory accessed by physical addresses now Georgia quot Tech 39 Ea PAPT Cache Previous slide showed Physically Addressed Physically Tagged cache Sometimes called PIPT llndexed Con TLB lookup and cache access serialized Caches already take gt 1 cycle Pro cache contents valid so long as page table not modified Georgia quot Tech39 Virtually Addressed Cache 393 IE I Virtual Cache leg gzany Physical Data Address Address Virtually tagged TLB On Cache T L2 Cache Miss 0 Tags Pro latency no need to Check TLB Con Cache must be flushed on process Change Tech 39 III Virtually Indexed Physically Tagged nual Cache Address Data Cache Physical Tag Tags gt Hit TLB Physical Address Pro latency TLB parallelized Pro don t need to flush on process swap Con Limit on cache indexing can only use bits not from the VPNPPN Georgia quot Tech39 Ea TLB DeSIgn Often fully associative For latency this means few entries However each entry is for a whole page Ex 32 entry TLB 4K8 page how big of working set while avoiding TLB misses If many misses Increase TLB size latency problems Increase page size fragmenation problems Georgia quot Tech39 I p EEE rocess Changes With physically tagged caches don t need to flush cache on context switch But TLB is no longer valid Add process ID to translation PIDO VPN8 PPN 28 PD11 PPN 44 I SRAM vs DRAM 39igi39 I DRAM Dynamic RAM SRAM 6T per bit built with normal high speed CMOS technology DRAM 1T per bit built with special DRAM process optimized for dens y Georgia quot Tech 39 Hardware Structures l 93 I SRAM DRAM wordline wordline Hm lyl b E b I III Implementing the Capacitor lggl 39 Cell Plate Si Trench Cell J Ya I 39 Cap Insulator Re lling Poly 1 Si Substrate Storage N ode Poly Field OxideWquot DRAM gures from thls Sllde Were taken from Prof N kollc s EECSMlZOOS Lecture notes from UOBel keley Georgia Tech l DRAM Chip Organization I 0 O 2 Row 0 Memory T Address g Cell Array O D Column Address Data Bus Georgia quot Tech39 I III DRAM Chip Organization 2 I 0 Differences with SRAM 0 reads are destructive contents are erased after reading row buffer 0 read lots of bits all at once and then parcel them out based on different column addresses similar to reading a full cache line but only accessing one word at a time o FastPage Mode FPM DRAM organizes the DRAM row to contain bits for a complete page row address held constant and then fast read from different locations from the same page Georgia quot Tech39 DRAM Read Operation Memory Cell Array OX1 FE DU 0 E U a o o o a Sense Am s Ill Ill Ill Ill Column Decoder Data Bus 0x009 Accesses need not be sequential IE 5 63 Georgia quot Tech 39 Destructive Read 39 Wordline Enabled Sense Amp Enabled Vdd sense amp C b39itline quotvoltage After read of 0 or 1 cell contains something close to 12 storage cell voltage l ia J 539 quot 4 la IEEE Refresh 0 So after a read the contents of the DRAM cell are gone 0 The values are stored in the row buffer Write them back into the cells for the next read in the future DRAM cells Sense Amps l I I I I I I I I I I I l Georgia quot Tech 39 I l l Refresh 2 Fairly gradually the DRAM cell will lose its contents even if it s L not accessed FT This is why it s called dynamic Gate Leakage Contrast to SRAM which is static in that once written it maintains its value forever so longas power remains on o All DRAM rows need to be regularly read and re written If it keeps its value even if power is removed then it s nonvolatile eg flash HDD DVDs Georgia quot Tech39 DRAM Read Timing Data Transfer g Data Transfer Overlap Accesses are 3 Column Access asynchronous L RWAmSS triggered by RAS and CAS signals which can in theory occur at arbitrary times subject to DRAM timing constraints Address I r lt Row Column Address Address u DQ raorgia Tech SDRAM Read Timing f DoubleData Rate DDR DRAM transfers data on both rising and falling edge of the clock 1 Column Access Row Access 3 Command frequency does not change I Coidmn 1 l i i Row I Address 39 Add r I fess I 1 I Ely r 1 DO 39 i r I If J Trrmng gures taken from A Performance Comparrson of Contemporary DRAM Aromeomres by Cuppu Jacob Davis and Mudge anemia Tech I Rambus RDRAM lagg I Synchronous interface Row buffer cache last 4 rows accessed cached higher probability of low latency hit DRDRAM increases this to 8 entries 0 Uses other tricks since adopted by SDRAM multiple data words per clock high frequencies Chips can selfrefresh Expensive for PC s used by X Box PS2 Georgia quot Tech39 I Example Memory Latency ggg Computation FSB freq 200 MHz SDRAM RAS delay 2 CAS delay 2 A0 A1 BO co D3 A2 DO c1 A3 c3 c2 D1 Bl D2 0 What s this in CPU cycles assume 2GHz Impact on AMAT Georgia quot Tech39 More Latency More wire delay getting to the memory chips Signi cant wire delay just getting from the CPU to the memory controller f H CPU 128bit 100MHz bus Memwy and caches Controller WidthSpeed varies depending on memory type X16 DRAM plus the return trip Georgia gt quot Tech 39 Memory COI IthIIer Like WriteCombining Buffer Scheduler may coalesce multiple accesses together or reorder to reduce number of row accesses r lt Read E Write E Response Commands Queue E Queue E Queue Data I I T ToFrom CPU l g Scheduler 4 l L l I l Memory Controller Bank 0 Bank 1 Ill Memory Reference Scheduling 0 Just like registers need to enforce RAW WAW WAR dependencies No memory renaming in memory controller so enforce all three dependencies Like everything else still need to maintain appearance of sequential access Consider multiple readwrite requests to the same address Example Memory Latency Computation I l 3 lei FSB freq 200 MHz SDRAM RAS delay 2 CAS delay 2 0 Scheduling in memory controller A0 A1 BO C0 D3 A2 DO C1 A3 C3 C2 D1 Bl D2 Think about hardware complexity Georgia quot Tech39 So what do we do about it I Caching reduces average memory instruction latency by avoiding DRAM altogether Limitations Capacity 0 programs keep increasing in size Compulsory misses l Faster DRAM Speed I Clock FSB faster DRAM chips may not be able to keep up 0 Latency dominated by wire delay Bandwidth may be improved DDR vs regular but latency doesn t change much 0 Instead of 2 cycles for row access may take 3 cycles at a faster bus speed 0 Doesn t address latency of the memory access Georgia quot Tech39 CS6290 Pentiums Georgia HH F Tech mggtm1 mg l Case Study1 PentiumPro lagg I 0 Basis for Centrinos Core Core 2 We ll also look at P4 after this I Ill Hardware OverVIew Ii Ital I Reservs Reorder mm 168 Insh 6 maps aTctTiom EXECU Buffer 7mm Hon a on 1 Decode Renaming L 3 fclk ck fclk 5 new RS 20 entries unified commit issuealloc ROB 40 entries Georgia quot Tech 39 Speculative Execution amp Recovery FE I E FE2 FE2 000 1 IE Fe Normal execution speculatively fetch and execute instructions 000 core detects misprediction flush FE and start refetching New insts fetched but 000 core still contains wrongpath uops 000 core has drained retire bad branch and flush rest of 000 core Normal execution speculatively fetch and execute instructions Georgia quot Tech39 39 39 IEIII Branch Prediction 333 BTB I quot 2bit ctrs T ad I Target Hist Use dynamic predictor Indirectjump Use static predictor Stall until decode Microop Decomposition I CISC 9 RISC Simple x86 instructions map to single uop Ex INC ADD rr XOR MOV r r load Moderately complex insts map to a few uops Ex Store 9 STASTD ADD rm 9 LOADADD ADD mr 9 LOADADDSTASTD More complex make use of UROM PUSHA 9 STASTDADD STASTDADD l Decoder lagg l 4 1 1 limitation Decode up to three instructions per cycle 0 Three decoders but asymmetric 0 Only first decoder can handle moderately complex insts those that can be encoded with up to 4 uops If need more than 4 uops go to UROM 120 7 I i E A r B39 A A 422 2 80 g B1422 3560 c 4 1 1 E s40 c D 4 2 a me 0 i E39 41 25 5390 63 7 5 87 1 o lea 250 2390 Relative area percentage Georgia quot Tech 39 I Simple Core lggl After decode the machine only deals with uops until commit Rename RS ROB Looks just like a RISC based 000 core A couple of changes to deal with x86 Flags 0 Partial register writes I Execution Ports Unified RS multiple ALUs I Ex Two Adders What if multiple ADDs ready at the same time 0 Need to choose 2ofN and make assignments To simplify each ADD is assigned to an adder during Alloc stage Each ADD can only attempt to execute on its assigned adder o If my assigned adder is busy I can t go even if the other adder is idle 0 Reduce selection problem to choosing l of N easier logic Execution Ports con t I RS Entries Porto Port1 Port2 Port3 Port4 JEU LDA STA AGU AGU STD Memory Ordering Buffer aka LSQ In theory can exec up to 5 uops A per cycle assuming they match DatafCaCh e the ALUs exactly Georgia 39 Tech 39 I RisceCIsc Commit 3955 I External world doesn t know about uops Instruction commit must be all or nothing Either commit all uops from an inst or none Ex ADD EBX ECX LOAD EBX ADD tmpO EBX ECX STA tmpl EBX STD tmp2 tmpO If load has page fault if store has protection fault if Case Study 2 Intel P4 Primary Objectives Clock speed Implies performance True if CPI not increases too much 0 Marketability GHz sells Clock speed Clock speed Georgia quot Tech 39 I III IE legal I I Faster Clock Speed liga ii Less work per cycle Traditional single cycle tasks may be multi cycle More pipeline bubbles idle resources More pipeline stages More control logic need to control each stage More circuits to design more engineering effort More critical paths More timing paths are at or close to clock speed Less benefit from tuning worst paths Higher power P 12CV2f Georgia w Tech39 l EEE Extra Delays Needed 33 Basic Pentium III Processor Misprediction Pipeline 10 Exec 8 9 RdylSch Dispatch 6 T Rename ROB Rd 3 Decode 4 5 Decode Decode 1 Fetch 2 Fetch Basic Pentium 4 Processor Misprediction Pipeline 11 Sch 12 13 14 Dlsp 5 Drive 6 7 8 Alloc Renlame 1 2 a 4 TC NIXK IP TC Fletch 9 y 10 15 16 117 1a 19 20 Que Sch RF RF Ex Flgs 3er1 Drive Sch Disp Branch mispred pipeline has 2 Drive stages Extra delay because P4 can t get from Point A to Point B in less than a cycle Side Note P4 does not have a 20 stage pipelinequot It s much longer Georgia quot Tech 39 Make Common Case Fast I Fetch Usually hit Branches are frequent Branches are often taken Branch mispredictions are not that infrequent Even if frequency is low cost is high pipe flush P4 Uses a Trace Cache Caches dynamic instruction stream 0 Contrast to I which caches the static instruction image Georgia quot Tech39 l Traditional Fetch IS l Fetch from only one l line per cycle lf fetch PC points to last instruction in a line all you get is one instruction 0 Potentially worse for x86 since arbitrary bytealigned instructions may straddle cache lines Can only fetch instructions up to a taken branch Branch misprediction causes pipeline flush Cost in cycles is roughly num stages from fetch to branch execute I Trace Cache Illl Trace Cache 39 0 Multiple l Lines per cycle 0 Can fetch past taken branch And even multiple taken branches om Tecll39ai Traditional I Decoded Trace Cache Ii lei I Trace Trace Dispatch Renamer 39 2 Builder Cache Allocator etc 1 Does not add Branch Mispred To mispred Pipeline depth 0 Trace cache holds decoded x86 instructions instead of raw bytes On branch mispred decode stage not exposed in pipeline depth Georgia quot Tech39 I III Less Common Case Slower liga re Trace Cache is Big I Decoded instructions take more room X86 instructions may take 115 bytes raw All decoded uops take same amount of space Instruction duplication Instruction X may be redundantly stored ABX CDX XYZ EXY Tradeoffs No Trace miss requires going to L2 Decoderwidth L Trace hit 3 ops fetched per cycle Trace miss 1 op decoded therefore fetched per cycle Georgia w Tech39 I EEE Addition Igg Common Case Adds Simple ALU lnsts I Typically an add must occur in a single cycle P4 double pumps adders for 2 addscycle 20 GHz P4 has 40 GHz adders A1631 XAB B1631 A015 r X015 B015 C015 Cycle 0 Cycle 05 Cycle 1 Georgia 39 Tech 39 Common Case Fast it I951 Val i4 lv v l Inteqer quister File vaass Network A AGU AGU 2x ALU Slow ALU Load Store Simple Simple Complex Address Address Instr Instr Instr 1 1 So long as only executing simple ALU ops can execute two dependent ops per cycle 2 ALUs so peak 4 simple ALU ops per cycle Can t sustain since T only delivers 3 ops per cycle Still useful eg after D miss returns Georgia quot Tech39 I III Less Common Case Slower liga Ii Requires extra cycle of bypass when not doing only simple ALU ops Operation may need extra half cycle to finish Shifts are relatively slower in P4 compared to previous latencies in P3 Can reduce performance of code optimized for older machines I III Common Case Cache Hit liga Vi Cache hitmiss complicates dynamic scheduler Need to know instruction latency to schedule dependent instructions 0 Common case is cache hit To make pipelined scheduler just assume loads always hit Pipelined Scheduling Ii Ig l 1 2 3 4 5 6 7 8 9 10 I I 10 11 12 13 14 15 1s 17 17 A39 I Sch Sch Sch Disp Dlsp RF RF Ex Ex I 10 11 12 13 14 15 16 t 17 B39 ISch Sch Sch Dlsp Dlsp RF RF Ex In cycle 3 start scheduling B assuming A hits in cache 0 At cycle 10 A s result bypasses to B and B executes Georgia quot Tech 39 Less Common Case is Slower III A MOV ECX EAX B XOR EDX ECX C SUB EAX ECX D ADD EBX EAX E NOR EAX EDX F ADD EBX EAX I13 12 345 678910111213 12 13 14 1s 1s17 17 Sch Disp Dlsp RF RF Ex Ex k 10 11 12 1311415 16 17 Sch Sch Sch Disp Dlsp RF RF 39 Ex 10 11 12 13 14 15 min Sch Sch Sch DisplDISp RF RF Ex 1 1D 11 12 13 14 15 16117 Sch SCI Sch Disp Dlsp RF RF Ex 10 11 12 13 14 15 16 I 17 Sch Sch Sch Disp Dlsp RF RF Ex l 10 11 12 13 14 15 16 17 Sch Sch Sch Disp Dlsp RF RF Ex ia L39 Replay On cache miss dependents are speculatively misscheduled Wastes execution slots Other useful work could have executed instead Wastes a lot of power Adds latency Miss not known until cycle 9 Start rescheduling dependents at cycle 10 Could have executed faster if miss was known 1 2 3 4 5 6 7 8 9 1011121314151617 10 16 I 17 17 Sch RF EX EX 12 13 14 Sch Disp Disp 11 Sch 15 RF 10 11 12 13 1415 16 17 Sen Sch Sch DisplDlsp RF RF Ex 16 I 17 RF Ex 11 Sch 12 13 14 15 Sch DlSD Disp RF 10 Son Gegrrgia lt7 39 gr l P4 Philosophy Overview l Amdahl s Law 0 Make the common case fast Trace Cache Double Pumped ALUs Cache Hits There are other examples Resulted in very high frequency I IEEE P4 Pitfall Ig Making the less common case too slow Performance determined by both the common case and uncommon case If uncommon case too slow can cancel out gains of making common case faster common by what metric should be time Lesson Beware of Slhadma Don t screw over the less common case Georgia quot Tech39 I T L ii a as essons NextGen P4 P5 Cancelled spring 2004 Complexity of super duper pipelined processor Time to Market slipping Performance Goals slipping Complexity became unmanageable Power and thermals out of control 0 Performance at all costs no longer true CS 6290 Manycore amp Interconnect Milos Prvulovic FaH2007 Georgia HH F Tech mggttm tmg I Ill Interconnection Networks l ggl I Classification Shared Medium or Switched Shared media old Elhemet Node Node i i i Switched media ATM Node 1 Node Switch Node 2003 Elsevier Science USA All rights reserved l Shared Media Networks I Need arbitration to decide who gets to talk Arbitration can be centralized or distributed Centralized not used much for networks Special arbiter device or must elect arbiter Good performance ifarbiter far away Nah Distributed arbitration Check if media already used carrier sensing If media not used now start sending Check if another also sending collision detection If collision wait for a while and retry For a while is random otherwise collisions repeat forever Exponential backoff to avoid wasting bandwidth on collisions Georgia quot Tech39 l Switched Networks I Need switches Introduces switching overheads No time wasted on arbitration and collisions Multiple transfers can be in progress If they use different links of course Circuit or Packet Switching Circuit switching endtoend connections Reserves links for a connection eg phone network Packet switching each packet routed separately Links used only when data transferred eg Internet Protocol Georgia quot Tech39 I Ill Routing I Shared media has trivial routingbroadcast In switched media we can have Sourcebased source specifies route Virtual circuits endtoend route created When connection made set up route Switches forward packets along the route Destinationbased source specifies destination Switches must route packet toward destination Also can be classified into Deterministic one route from a source to a destination Adaptive different routes can be used Georgia quot Tech39 I III Routing Methods for Switches I Store and Forward Switch receives entire packet then forwards it If error occurs when forwarding switch can resend Wormhole routing Packet consists of flits a few bytes each First flit contains header w destination address Switch gets header decides where to forward Other flits forwarded as they arrive Looks like packet worming through network If an error occurs along the way sender must resend No switch has the entire packet to resend it Georgia quot Tech39 l CutTh rough Routing I What happens when link busy Header arrives to switch but outgoing link busy What do we do with the other flits of the packet Wormhole routing stop the tail when head stops Now each flit along the way blocks the a link One busy link creates other busy links gt trafficjam Cut Through Routing If outgoing link busy receive and buffer incoming flits The buffered flits stay there until link becomes free When link free the flits start worming out of the switch Need packetsized buffer space in each switch Wormhole Routing switch needs to buffer only one flit Georgia quot Tech39 l Routing Network Latency I Switch Delay Time from incoming to outgoing link in a switch Switches Number of switches along the way 0 Transfer time Time to send the packet through a link 0 StoreandForward endtoend transfer time SwitchesSwitchDelayTransferTimeSwitches1 Wormhole or CutThrough endto end transfer time SwitchesSwitchDelay TransferTime Much better if there are many switches along the way See the example on page 811 Georgia quot Tech39 Switch Technology 55 I23 What do we want in a switch Many input and output links Usually number of input and output links the same Low contention inside the switch Best if there is none only external links cause contention Short switching delay Crossbar Very low switching delay no internal contention Complexity grows as square of number of links Can not have too many links eg up to 64 in and 64 out Georgia quot Tech39 I III Switch Technology 395quot all I o What do we want in a switch Many input and output links Usually number of input and output links the same Low contention inside the switch Best if there is none only external links cause contention Short switching delay 0 Crossbar Very low switching delay no internal contention Complexity grows as square of number of links Can not have too many links eg up to 64 in and 64 out 0 Omega Network Build switches with more ports using small crossbars Lower complexity per link but longer delay and more contention Georgia w Tech39 Switch Technology a Crossbar A C B D 0 Omega neman swncrr box 2003 Exsevier Science USAAIJ rigms reserved 1 Omega network Network Topology fjv s s a 2D grid or mesh of 16 nodes 3 2D toms of 16 nodes c Hypercube tree of 16 nodes 16 2 so 4 if 83 2003 Elsevier Science USA All rwqhts reserved l Network Topology I What do we want in a network topology Many nodes high bandwidth low contention low latency Low latency few switches along any route For each src dest pair we choose shortest route Longest such route over all srcdst pairs network diameter We want networks with small diameter Low contention high aggregate bandwidth Divide network into two groups each with half the nodes Total bandwidth between groups is bisection bandwidth Actually we use the minimum over all such bisections Georgia quot Tech39 l OnChip Networks I We ll have many cores onchip Need switched network to provide bandwidth 0 Need to map well onto chip surface Eg hypercube is not great Mesh or grid should work well torus OK too Limited ports per switch CPU amp 4 neighbors All links short going to neighbors Many parallel algorithms map well onto grids Matrices grids etc l Trouble with shared caches I Private caches OK Each placed with its own processor We want a shared cache too Fits more data than if broken into private caches Private caches replicate data Dynamically shared 0 Threads that need more space get more space But how do we make a shared cache fast l Trouble with shared caches I Private caches OK Each placed with its own processor We want a shared cache too Fits more data than if broken into private caches Private caches replicate data Dynamically shared 0 Threads that need more space get more space But how do we make a shared cache fast III Nonuniform Cache Arch NUCA lggl I Ea SNUCA Perormance 0 Fast access to nearby banks 0 Slow access to far away banks Average better than worst case MHTHM Nun 50 DEEMDDDE Egg om 66320 Ill DNUCA Perormance I 0 Fast access to nearby banks 0 Slow access to far away banks Average much better than worstcase But we keep moving blocks Lots of power hungry activity 0 Need smart policies for block migration Move blocks less frequently But get most of the benefit of being able to move CS 6290 Intro amp Trends Milos Prvulovic Fall 2007 Georgia HH F Tech mggttm tmg PrerequISItes liga 5 63 I o I will assume you have detailed knowledge of Pipelining Classic 5stage pipeline pipeline hazards stalls etc Caches TaglndexOffset hitmiss setassociativity replacement policies write thorughwriteback etc Assembler and ISAs RISC loadstore instruction encoding callersavedcaleesaved registers stack pointer frame pointer function callreturn code etc o If you don t remember a few of these Read Appendices A B and C Review the 082200 textbook o If you don t know what some of these are Take CS22OO before you take this class or Read the 082200 textbook before next week s lectures Georgia quot Tech39 39 39 lili Logistics IE I Internet httpwwwccgatechedumiosCS629OFO7 temporary address Email miloscc Office hours TBA TAs Minjang Kim minjangcc office hours TBA Sunjae Park sunjaeparkgatech office hours TBA Textbook Computer Architecture AQA 4th Edition by Hennessy and Patterson Georgia quot Tech 39 I What is Architecture 3955 I Original sense Taking a range of building materials putting together in desirable ways to achieve a building suited to its purpose In Computer Science Similar how parts are put together to achieve some overall goal Examples the architecture of a chip of the Internet of an enterprise database system an email system a cable TV distribution system Adapted from David Clark s What is Architecture Georgla r 39 gt 39 39 Tech 39 l Why Computer Architecture I Exploit advances in technology Make things Faster Smaller Cheaper Which enables new applications Shrek 20 years ago Make new things possible Accurate one month weather forecasts Cure for cancer Life like virtual reality The advancement of computer architecture is vital for the advancement of all other areas of computing Georgia quot Tech39 l EEE Today Trends in Computer Industry setting the stage for the what s why s and how s to come through the rest of this course l Moore s Law 1965 i Transistors per inch square Twice as many after 15 2 years Related trends Processor performance Twice as fast after 18 months Memory capacity Twice as much in lt2 years I Moore s NotExactlyLaw I 0 Not a law of nature But fairly accurate over 42 years and counting No exponential is forever but we can delay forever Gordon Moore in 2003 0 More about Moore s Law at httpwwwinteloomresearohsilioonmooreslawhtm Georgia quot Tech39 Ill How to use 13 or more transistors lagquot I il l Computer Architecture Review 13 and 1 Instruction set architecture ISA interface between HW and SW different lSAs may be more less effective for different target application areas 2 Microarchitecture Techniques below the ISA level transparent ex pipelining caching branch prediction superscalar dyna i scheduli gclockgating Review appendix A Review appendix C Georgia quot Tech39 ivs VAX119780 Perfosmanc Performance Trend 10000 I lntsl Xcon 36 6H2 WEI GHZ Ch Doubling the number of people on a MWOD E39WEZGH 545754 project doe sn t speed it up by 2x lmel Pentium Ill 0 Griz I Alpha 2 I264A 07 GHZ 1 Similarly 2x transistors does not Alpha 31z 4 o396m J Alpha 21134 08 GHZ 993 automatically get you 2x performance i32116l05Gl1 649 quot481 39 eu 20 o 1lJCI quot39 39 52 year Possible because of continued advances in computer architecture yaar Much of computer architecture is U 4i about how do you organize these 1978 1980 1952 1984 1986 1988 1990 199 resources to get more done Georgia Tech VAX119780 I EEE Prlce Trends Pentium III I 1000 900 Raw performance and performance per improves Parmum 4 PenhumM 800 700 600 500 400 lnle list price TOGOunit quanlity 300 200 100 50 Feb 2004 May 2004 Aug 2004 Nov 2004 Feb 2005 May 2005 Aug 2005 NOV 2005 ozoomsemanlmmnymmn ia 39rf Price Trends DRAM memory Dollars per DRAM chip 40 a b a 16KbilS Similar trends for main memow capacity and capacity per both getting better 16M bits 4M bils 1M bits 256K bits 64K bils sammis 94 153 ue 15 b N L b 43 5 3ng gt39 55 gt39 99 a a g Q 1 9 Q1 V Q 9quot 3 3 Sb 65 3 9 9 3 393 Year 2003 Eisevier Science USA AH rights reserved G eorgla Tech l III Why Do You Care About Prices i 0 Target market target prices place a limit on the cost of my processor Today we dealwith We price what I sell the part for We jnesc39ay We dea39w39th performance power cost what it costs me 0 Design decisions affect the cost and price Ex adding more cache may improve performance but increase cost Priceperformance is often what we re trying to balance 39 Depends on the Class of Processor ll Feature Desktop Server Embedded Price of 5005K 5K 5M 10 100K system USD ex highend network routers Price Of CPU 50 500 200 10K 001 100 per processor Critica design Price Throughput Price power issues performance availability application graphics scalability specific performance performance Georgia quot Tech39 III Desktop Systems Examples Intel Core 2 Duo AMD Opteron Applications everything general purpose Office Internet Multi media Video Games 0 Goals performance priceperformance power 9 affects cost noise size Servers I Examples IBM Power Sun Niagara T1 Intel Xeon Applications infrastructure file server email server business web e commerce databases Goals Throughput transactionssecond Availability reliability dependability fault tolerance Cost not a major issue Georgia quot Tech 39 I l l Embedded Examples Xscale ARM MIPS x86 many varieties Applications cell phones mp3 players game consoles consumer electronics refrigerator microwave automobiles many varieties Goals Cost Power Sufficient performance real time performance Size CPU size memory footprint chip count Georgia quot Tech39 E Fabrication Costs 0 CPU die size greatly affects cost of all systems desktopserverembedded Current CPUs 1 2 cm2 Embedded much smaller hllp Mnews bbc co ukolmedlal lAOOOOlmagesu1449177domolylsopalpg Georgia quot Tech 39 Yield Manufacturing Defects 1316 working chips 8125 yield 14 working chips 250 yield Yield 2 Assuming 250 per waferz a pier die 17 diel 2650 yield 9 4225 working Parts 1 wafer 52 diet 8125 yield 9 42225 workihg parts wafer Gnorgia Tech I CostYield Equations lagg a fox39 5139 00 I C t fD Cost of wafer OS 0 le Dies per wafer x Die yield Dies perwafer n x Wafer diameter22 n x Wafer diameter Die area sqrt2 x Die area Defects per unit area x Die area 0 DIe yield Wafer erld x 1 Parameter related to complexity of manufacturing typical or 04 Typical 04 defects per cm2 in 90nm but improves with time Number of completely bad wafers Georgia quot Tech39 Ill Interaction of Price and Performance liga ii Add a new architectural feature to chip for more performance for less power etc Chip die size increases Fewer dies per wafer More defective dies Die testing more expensive Must test whether feature works Die package more expensive Larger package maybe more pins If feature needs more power may need better heat sink Georgia quot Tech39 Goal of Processor Design Maximize performance Within the constraints of Peak power average power thermals iIity manufacturing costs implementati verification complexity timetom manufacturer Intel cost to OE rl39 ell cost to end customer you j 39 Which reallyjust says 39 Maximize performa per 0 Huge multi variable optimization problem http err Wlklpedla orgW kllmage Wuitangiflnanclal lpg Not all variables are independent Georgia quot Tech39 CS 6290 Branch Prediction i Georgia HU i Tech mg u m I Control Dependencies III 23 I Branches are very frequent Approx 20 of all instructions Can not wait until we know where it goes Long pipelines Branch outcome known after B cycles 0 No scheduling past the branch until outcome known Superscalars eg 4 way Branch every cycle or so 0 One cycle of work then bubbles for B cycles Geo ia 7 s ngch 39 39 I Surviving Branches Prediction lag I Predict Branches And predict them well Fetch decode etc on the predicted path Option 1 No execute until branch resovled Option 2 Execute anyway speculation Recover from mispredictions Restart fetch from correct path Geo ia 7 s ngch 39 39 I I I Branch Prediction III 13 I Need to know two things Whether the branch is taken or not direction The target address if it is ta ken target Directjumps Function calls Direction known always taken target easy to compute Conditional Branches typically PC relative Direction difficult to predict target easy to compute lndirectjumps function returns Direction known always taken target difficult Geo ia 7 ngch I Branch Prediction Direction lag I Needed for conditional branches Most branches are of this type Many many kinds of predictors for this Static fixed rule or compiler annotation eg BEQL is branch if equal likey Dynamic hardware prediction Dynamic prediction usually history based Example predict direction is the same as the last time this branch was executed Geo ia 7 s ngch 39 l ll Static Prediction I39ll l l Always predict NT easy to implement 30 40 accuracy not so good Always predict T 60 70 accuracy 0 BTFNT loops usually have a few iterations so this is like always predicting that the loop is taken don t know target until decode Geo ia 7 s ngch 39 39 I Il OneBlt Branch Predlctor ll 3923 I Branch history table of ZAK entries 1 bit per entry K bits of branch instruction address Index Use this entry to predict this branch 0 predict not taken 1 predict taken When branch direction resolved go back into the table and update entry 0 if not taken 1 if takGeer irgia Z i Ill OneBit Branch Predictor cont d lgggl I 0XDC08 fori0 i lt 100000 i 0ch44 if i 100 0 tick OXDCSO Georgia 7 r Tech 39 quot I The Bit Is Not Enough III 5 I Example short loop 8 iterations Taken 7 times then not taken once Nottaken mispredicted was taken previously Execute the same loop again First always mispredicted previous outcome was not taken Then 6 predicted correctly Then last one mispredicted again Each flukeanomaly in a stable pattern results in two mispredicts per loop Geo ia 7 ngch A A 09 3 3511639 iNiNiNiNiNiNiNiNiNiNiNiNiNiNi 10930 O3986 OOLZ JLLJLlLL iiiifm JJJJJ woa OOO OOL Z ewoomo ShOA9Jd i ewoolno qoueJq s ueuo NO 9193 uonomeud 8663966 suoneJel OOO OOL JJJJJJJ iiiiiiii quot39 iiiiiiiiiii 18030 saIdumxa Two Bits are Better Than One Predict NT Predict T Transistion on T outcome Transistion on NT outcome FSM for LastOutcome FSM for 2bC Prediction 2blt Counter 1100 Georgia Tech V l EEE Example I Initial TrainingNVarmup 1bC ml TTTT TNTTT If xxJ 2bCr TTT TNTTT If lacJ Only 1 Mispredict per N branches now DC08 99999 DCO4 990 Still Not Good Enough nasa7 1 These are We can matrIx300 0 good 11ve w1th tomcatv these doduc SPECSQ spice benchmarks fPPPP goc Th1s 1s bad espresso eqntott 18 0 2 4 6 8 10 12 14 16 18 Frequency of mispredictions 2003 Elsevier Science USA All rights reserves Tegci 3 quotV r I II Importance of Branches lig I39 I 0 98 9 99 Who cares Actually it s 2 misprediction rate 9 1 That s a halving of the number of mispredictions 0 So what lf misp rate equals 50 and 1 in 5 insts is a branch then number of useful instructions that we can fetch is 51 12 122 123 10 If we halve the miss rate down to 25 51 34 342 343 20 Halving the miss rate doubles the number of useful instructions that we can try to extract lLP from Geo ia 7 ngch 39 Ill How about the Branch at 0xdc50 Igg I 1bc and 2bc don t do too well 50 at best But it s still obviously predictable Why It has a repeating pattern NT How about other patterns TTNTN Use branch correlation The outcome of a branch is often related to previous outcomes Geo ia 7 ngch Idea Track the History of a Branch lagg i PC prev 1 Q prediction N prev O Q prediction T prev 1 prediction T prev 1 prediction N prev 0 prediction T prev 0 Prediction T prev 1 prediction T prev 1 prediction T Georgia f Tech quot39 397 l Deeper History Covers More Patterns lag I pC Counter if prev010 l 6 CD Counter if prev1 11 o What pattern has this branch predictor entry learned 00191011901109010091 00110011001 0011 Georgia 39 Tech quot39 quot I III Global vs Local Branch History Is I 0 Local Behavior What is the predicted direction of Branch A given the outcomes of previous instances of Branch A 0 Global Behavior What is the predicted direction of Branch Z given the outcomes of all previous branches A B X and Y number of previous branches tracked limited by the history length Geo ia 7 s ngch 39 39 I III Why Global Correlations Exist Igg i Example related branch conditions A p findNodefoo if p is parent do something do other Sthf may contain more branches Outcome of second branch is always B if p IS a child opposite ofthe first do something else branch Georgia 7 Tech 39 quot I Other Global Correlations Igg i Testing samesimilar conditions code might test for NULL before a function call and the function might test for NULL again in some cases it may be faster to recompute a condition rather than save a previous computation in memory and reload it partial correlations one branch could test for condl and another branch could test for condl ampamp cond2 if condl is false then the second branch can be predicted as false multiple correlations one branch tests condl a second tests cond2 and a third tests condl G cond2 which can always be predicted if the first two branches are known Geo ia 7 ngch l II Tournament Predictors I39ll l l No predictor is clearly the best Different branches exhibit different behaviors 0 Some constant some global some local Idea Let s have a predictor to predict which predictor will predict better Georgia 7 s Tech 39 39 I III Tournament Hybrid Predictors Is I table of 243bit counters Meta Predo Predictor Meta PredO PredjL Final Prediction Update X x If metacounter MSB 0 use pred0 else use pred1 X Inc J at Dec Geo ia 7 ngch 39 I Common Combinations Igg I 0 Global history Local history easy branches global history 2bC and gshare short history long history 0 Many types of behaviors many combinations Georgia 7 s Tech 39 39 I ll Direction Predictor Accuracy lag luv Local 2bit predictors Conditional branch misprediction rate I Correlating predictors Tournament 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 Total predictor size 2003 Elsevier Science USA All rights reserved Geo ia ngch I Target Address Prediction lag I Branch Target Buffer IF stage need to know fetch addr every cycle Need target address one cycle after fetching a branch For some branches eg indirect target known only after EX stage which is way too late Even easilycomputed branch targets need to wait until instruction decoded and direction predicted in ID stage still at least one cycle too late So we have a quickanddirty predictor for the target that only needs the address of the branch instruction Geo ia 7 ngch E Branch Target Buffer BTB indexed by instruction address We don t even know if it is a branch lf address matches a BTB entry it is predicted to be a branch BTB entry tells whether it is taken direction and where it goes if taken BTB takes only the instruction address so while we fetch one instruction in the IF stage we are predicting where to fetch the next one from Direction prediction can be factored out into separate table Geo ia 7 s ngch 39 I I Branch Target Buffer lili PC of insiruction to fetch Predicted PC Number of entries in branch No ins1ruction is not predicted to be Branch branch proceed normally predicted taken or Yes then instruc on is branch and predicted uniaken PC should be used as the next PC 2003 Elsevier Science USA All rights reserved GeorsTegg ll Return Address Stack RAS Igg I Function returns are frequent yet Address is difficult to compute have to wait until EX stage done to know it Address difficult to predict with BTB function can be called from multiple places 0 But return address is actually easy to predict It is the address after the last call instruction that we haven t returned from yet Hence the Return Address Stack Geo ia 7 s ngch 39 I III Return Address Stack RAS Igg i Call pushes return address into the RAS When a return instruction decoded pop the predicted return address from RAS Accurate prediction even w small RAS 50 45 40 I 9 fpppp n espresso doduc Ii A tomcatv 30 Mispredictian rate 4 Number of entries in the return s13ck Georgia 7 s Tech 39 2003 Elsevier Science USA All rights reserved I III Example 1 Alpha 21264 I 335 magi l10 1g 1024 v 3 W counter Brandi prediction Global prediction 4096 x 2 Choice prediction 4096 x 2 Path history Hybrid predictor combines local history and global history components with a meta predictor Georgia 7 r Tech 39 quot Example 2 PentiumM ll la Bimodai Local 7 annual 39 Predictor iPmdictur Pmdldm In uJ39L J 7 i l IW1 7 I I I i ll min 2gt1 4 PM 0 7 mux2gt1 239 I l39rwz NU Also hybrid but uses tag based selection mechanism Georgia Tech 39 W 39 r Il PentiumM cont d lagg I Local component also has support for loops accurately predict branches of the form TkN Georgia 7 r Tech 39 39 CS 6290 Instruction Level Parallelism l Georgia HH F Tech mg u m Instruction Level Parallelism ILP lag he Basic idea Execute several instructions in parallel We already do pipelining But it can only push thtough at most 1 instcycle We want multiple instrcycle Yes it gets a bit complicated More transistorslogic That s how we got from 486 pipelined to Pentium and beyond Geo ia 7 s 39 ngch 39 l h39 9I9 IEEI st 5 ega Ig ISA defines instruction execution one by one ll ADD R1 R2 R3 0 fetch the instruction 0 read R2 and R3 0 do the addition 0 write R1 increment PC Now repeat for I2 Darth Sidious Begin landing your troops Nute Gunray Ah my lord is that legal Darth Sidious I will make it legal Georgia f Tech 39 39 39 39 Ill It s legal if we don t get caught Is I 0 How about pipelining already breaks the rules 0 we fetch I2 before I1 has finished Parallelism exists in that we perform different operations fetch decode on several different instructions in parallel as mentioned limit of 1 IPC Geo ia s ngch 39 39 u n IEI Define not get caught Program executes correctly 0 Ok what s correct As defined by the ISA Same processor state registers PC memory as if you had executed one at a time 0 You can squash instructions that don t correspond to the correct execution ex misfetched instructions following a taken branch instructions after a page fault Geo ia 7 s ngch 39 39 Ill Example Toll Booth 3939 39 7 1 s a W Caravannmg on a trip must stay In D C B A order to prevent losing anyone When we get to the toll everyone gets in the same lane to stay in order This works but it s slow Everyone has to wait for D to get through the toll booth Go through two at a time m 5 lj in parallel Before Toll Booth After To Booth Georgia 7 Tech E Illu5on of Sequentiality So long as everything looks OK to the outside world you can do whatever you want Outside Appearance Architecture ISA Whatever you want Microarchitecture uArch basically includes everything not explicitly defined in the ISA pipelining caches branch prediction etc Geo ia 7 s ngch 39 I III Back to ILP But how Inquot 5 I 0 Simple ILP recipe Read and decode a few instructions each cycle 0 can t execute gt 1 IPC if we re not fetching gt 1 IPC If instructions are independent do them at the same time If not do them one at a time Georgia 7 s Tech 39 39 I Example IEEEI I AADD R1 R2 R3 0 BSUBR4R1 R5 CXOR R6R7AR8 D Store R6 9 OR4 0 E MULR3R5R9 FADD R7R1R6 0 G SHL R8 R7 ltlt R4 E Ex Original Pentium Fetch up to 32 bytes Decode up to 2 insts Read operands and Decedez Decedez Check dependencies Execute Execute Writeback Writeback Georgia 7 Tech 39 gg eat Example for Pentiumlike IE AADD R1 R2 R3 0 BSUBR4R1 R5 0 CXOR R6R7AR8 D Store R6 9 OR4 0 E MULR3R5R9 0 FADD R7R1R6 G SHL R8 R7 ltlt R4 Geo ia 7 I i ngch 39 I II This is Superscalar lil lg Scalar CPU executes one inst at a time includes pipelined processors 0 Vector CPU executes one inst at a time but on vector data XO7 YO7 is one instruction whereas on a scalar processor you would need eight Superscalar can execute more than one unrelated instruction at a time ADDXY MULWZ Geo ia 7 s ngch 39 I Scheduling lag I 0 Central problem to ILP processing need to determine when parallelism independent instructions exists in Pentium example decode stage checks for multiple conditions 0 is there a data dependency does one instruction generate a value needed by the other do both instructions write to the same register 0 is there a structural dependency most CPUs only have one divider so two divides cannot execute at the same time Geo ia 7 ngch I Scheduling lag I 0 How many instructions are we looking for 3 6 is typical today A CPU that can ideally do N instrs per cycle is called Nway superscalar Nissue superscalar or simply Nway Nissue or Nwide 0 Peak execution bandwidth 0 This N is also called the issue width Georgia 7 Tech 39 I III Dependences Dependencies lag I 0 Data Dependencies RAW Read After Write True Dependence WAR Anti Depedence WAW Output Dependence Control Dependence When following instructions depend on the outcome of a previous branchjump Geo ia 7 s ngch 39 39 E Data DependenCIes 0 Register dependencies RAW WAR WAW based on register number Memory dependencies Based on memory address This is harder Register names known at decode 0 Memory addresses not known until execute Georgia 7 s Tech 39 39 Hazards When two instructions that have one or more dependences between them occur close enough that changing the instruction order will change the outcome of the program 0 Not all dependencies lead to hazards Geo ia 7 s ngch 39 39 Ea ILP Arrange instructions based on dependencies ILP Number of instructions Longest Path I1 R2 17 I2 R1 49 I3 R3 8 I4 R5 LOAD OR3 I5 R4 R1 R2 I6 R7 R4 R3 I7 R6 R4 R5 Georgia 7 Tech 39 quot Dynamic OutofOrder Scheduling Cycle 1 Operands ready I1 I5 I1 Start I1 I5 I2 Cycle 2 3 Operands ready I2 I3 I4 Start I2I3 I5 Window size W SUB R4kl R5 AND R6 1 R7 I II III Ia Program code ADD R1 R2 R3 OR R8R6 XOR R10 R2 R11 how many instructions ahead do we look Do not confuse with issue width N Eg a 4issue outoforder processor can have a 128entry window it can look at up to 128 instructions at a time Georgia 7 Tech I 0 EEE Ordering I In previous example l5 executed before l2 l3 and I4 0 How to maintain the illusion of sequentiality w Oneat atime455 55 55 305 55 With a 4l55ue Toll Booth 000 305 Hands tollbooth agent a 100 bill takes a while to count the change Georgia f Tech quot I ILP I IPC IE quot 13 I ILP is an attribute of the program also dependent on the ISA compiler ex SIMD FMAC etc can change inst count and shape of dataflow graph IPC depends on the actual machine implementation ILP is an upper bound on IPC achievable IPC depends on instruction latencies cache hit rates branch prediction rates structural conflicts instruction window size etc etc etc Next several lectures will be about how to build a processor to exploit ILP Geo ia 7 ngch CS 6290 Dependences and Register Renaming Georgia HH F Tech mg u m E ILP IS Bounded 0 For any sequence of instructions the available parallelism is limited HazardsDependencies are what limit the ILP Data dependencies Control dependencies Memory dependencies Georgia 7 s Tech 39 39 I III Types of Data Dependencies Igg I Assume A comes before B in program order 0 RAW ReadAfterWrite A writes to a location B reads from the location therefore B has a RAW dependency on A Also called a true dependency Georgia 7 s Tech 39 39 I Data Dep s cont d lig I WAR Write After Read A reads from a location B writes to the location therefore B has a WAR dependency on A If B executes before A has read its operand then the operand will be lost Also called an anti dependence Georgia 7 s Tech 39 39 I Data Dep s cont d IEEEI I WriteAfterWrite A writes to a location B writes to the same location If B writes first then A writes the location will end up with the wrong value Also called an output dependence Georgia 7 s Tech 39 39 I Control Dependencies lag I o If we have a conditional branch until we actually know the outcome all later instructions must wait That is all instructions are control dependent on all earlier branches This is true for unconditional branches as well eg can t return from a function until we ve loaded the return address Geo ia 7 s ngch 39 l ll Memory DependenCIes lig i Basically similar to regular register data dependencies RAW WAR WAW However the exact location is not known A STORE R1 OR2 B LOAD R5 24R8 C STORE R3 8R9 RAW exists if R2O R824 WAR exists if R824 R9 8 WAW exists if R2O R9 8 Geo ia 7 s ngch 39 Ill Impact of Ignoring Dependencies lagg I ReadAfterWrite WriteAfter Read WriteAfter Write AR1R2R3 AR1 3R4 AR1 R2R3 BR4R1R4 BR3R2R4 BR1 R3R4 Georgia 7 Tech 39 I III Eliminating WAR Dependencies lag IR WAR dependencies are from reusing registers A1R1 3R4 AIR1 R3R4 BR3R2R4 BR5R2R4 With no dependencies reordering still produces the correct results G a f Iii39 Ill Eliminating WAW Dependencies lag WAW dependencies are also from reusing registers A R1 R2 R3 A R5 R2 R3 BR1 R3R4 BIR1 R3R4 Same solution works G ia f Iii39 eggs l II 80 Why Do False Dep s EXIst lig H Finite number of registers At some point you re forced to overwrite somewhere Most RISC 32 registers x86 only 8 x86 64 16 Hence WAR and WAW also called name dependencies ie the names of the registers So why notjust add more registers Thought exercise what if you had infinite regs Geo ia 7 s 39 ngch 39 39 l ll Reuse Is IneVItable lil lg Loops Code Reuse If you write a value to R1 in a loop body then R1 will be reused every iteration 9 induces many false dep s Loop unrolling can help a little Will run out of registers at some point anyway Trade off with code bloat Function calls result in similar register reuse If printf writes to R1 then every call will result in a reuse of R1 Inlining can help a little for short functions Same caveats Geo ia 7 ngch Ill Obvious Solution More Registers Is I 0 Add more registers to the ISA BAD Changing the ISA can break binary compatibility All code must be recompiled Does not address register overwriting due to code reuse from loops and function calls Not a scalable solution BAD x8664 adds registers but it does so in a mostly backwards compatible fashion Geo ia s ngch 39 Better Solution HW Register Renaming lag I 0 Give processor more registers than specified by the ISA temporarily map ISA registers logical or architected registers to the physical registers to avoid overwrites Components mapping mechanism physical registers allocated vs free registers allocationdeallocation mechanism Geo ia 7 s ngch 39 Register Renaming Inquot 25 I Example Program code B can not exec before I2 because 1 ADD m 122 R3 I3 Will overwrite R6 2 SUB R2 R1 R6 I I5 can not go before I2 because I3 AND P16 1 R7 I2when it goes will overwrite I4 OR R8 R5 R2 R2 With a stale value I5 x0R R2 R4 R11 RAW gt WAR WAW gt Georgia 7 Tech 39 quot I I I Register Renaming lig Pro ram code 0 Solution g Let s give l2 temporary name location eg S for the value it produces But l4 uses that value so we must also change that to S In fact all uses of R5 from l3 to the next instruction that writes to R5 again must now be changed to S We remove WAW deps in the same way change R2 in IS and subsequent instrs to T 11 ADD R1 R2 R3 12 SUB 52 1 R6 13 AND 36 JIRR7 14 OR R5 32 15 XOR 32 R4 R11 Geo ia 7 s ngch 39 Register Renaming lig Program code I 0 Implementation 1 ADD R1 R2 R3 Space for S T U etc I2 SUB SIR1 R5 I3 AND U R R7 How do we know when to rename a register 0 Simple Solution Do renaming for every instruction Change the name of a register each time we decode an instruction that will write to it Remember what name we gave it Geo ia 7 s ngch 39 39 I4 OR R8 R5 S I5 XOR T R4 R11 Register File Organization I III III 53 IA I 0 We need some physical structure to store the register values i Register Alias Table ARF PRF Physical Register File One PREG per instruction inflight Architected Register File Outside world sees the ARF Georgia Q 33 3 39 Tech Putting it all Together lig Free pool I top x9 x11 x7 x2 x13 x4 X8 x12 x3 R1 R2 R3 x5 ARF PRF R2R4 R1 R1R3R6 R2R1R2 R3R1 gtgt1 BNEZ R3 top Renaming in action R1 R2 R3 R2 R4 R1 R1 R3 R6 R2 R1 R2 R3 R1gtgt1 BNEZR3top R1 R2 R3 R2 R4 R1 R1 R3 R6 R2 R1 R2 R3 R1gtgt1 BNEZR3top X5 R2R3 x11 R4 XS x7 R3R6 X2 Xl5 X7gtgt1 BNEZ XI top X9 XZXI3 U M xii 101 xsR6 x3 mum legt1 BNEZ xtop lil l l Free pool X9 X11 X7 X2 X13 X4 X8 X12 X3 X5 ARF PRF NWT R1 x MR2 3 R3 R4 R5 R6 Geo ia ngch 39 l ll Even Phy5cal Registers are Limited lag ri We keep using new physical registers What happens when we run out There must be a way to recycle When can we recycle When we have given its value to all instructions that use it as a source operand This is not as easy as it sounds Geo ia 7 s 39 ngch 39 39 Ill Instruction Commit leaving the pipell I Architected register file contains R3 the of cial processor state ARF When an instruction leaves the R3 pipeline it makes its result RAT official by updating the ARF The ARF now contains the PRF correct value update the RAT Free Pool g 513 T42 IS no longer needed return y 3 Six I 39 to the physical register free pool a c Georgia 7 Tech 39 Careful with the RAT Update R3 ARF R3 RAT T17 PRF J2 Free Pool 91 A f I III III I Update ARF as usual Deallocate physical register Don t touch that RAT Someone else is the most recent writerto R3 At some point in the future the newer writer of R3 exits This instruction was the most recent writer now update the RAT Deallocate physical register Georgia 7 s Tech 39 quot CS 6290 Manycore amp Interconnect Milos Prvulovic FaH2007 Georgia HH F Tech mggttm tmg I Ill Interconnection Networks l ggl I Classification Shared Medium or Switched Shared media old Elhemet Node Node i i i Switched media ATM Node 1 Node Switch Node 2003 Elsevier Science USA All rights reserved l Shared Media Networks I Need arbitration to decide who gets to talk Arbitration can be centralized or distributed Centralized not used much for networks Special arbiter device or must elect arbiter Good performance ifarbiter far away Nah Distributed arbitration Check if media already used carrier sensing If media not used now start sending Check if another also sending collision detection If collision wait for a while and retry For a while is random otherwise collisions repeat forever Exponential backoff to avoid wasting bandwidth on collisions Georgia quot Tech39 l Switched Networks I Need switches Introduces switching overheads No time wasted on arbitration and collisions Multiple transfers can be in progress If they use different links of course Circuit or Packet Switching Circuit switching endtoend connections Reserves links for a connection eg phone network Packet switching each packet routed separately Links used only when data transferred eg Internet Protocol Georgia quot Tech39 I Ill Routing I Shared media has trivial routingbroadcast In switched media we can have Sourcebased source specifies route Virtual circuits endtoend route created When connection made set up route Switches forward packets along the route Destinationbased source specifies destination Switches must route packet toward destination Also can be classified into Deterministic one route from a source to a destination Adaptive different routes can be used Georgia quot Tech39 I III Routing Methods for Switches I Store and Forward Switch receives entire packet then forwards it If error occurs when forwarding switch can resend Wormhole routing Packet consists of flits a few bytes each First flit contains header w destination address Switch gets header decides where to forward Other flits forwarded as they arrive Looks like packet worming through network If an error occurs along the way sender must resend No switch has the entire packet to resend it Georgia quot Tech39 l CutTh rough Routing I What happens when link busy Header arrives to switch but outgoing link busy What do we do with the other flits of the packet Wormhole routing stop the tail when head stops Now each flit along the way blocks the a link One busy link creates other busy links gt trafficjam Cut Through Routing If outgoing link busy receive and buffer incoming flits The buffered flits stay there until link becomes free When link free the flits start worming out of the switch Need packetsized buffer space in each switch Wormhole Routing switch needs to buffer only one flit Georgia quot Tech39 l Routing Network Latency I Switch Delay Time from incoming to outgoing link in a switch Switches Number of switches along the way 0 Transfer time Time to send the packet through a link 0 StoreandForward endtoend transfer time SwitchesSwitchDelayTransferTimeSwitches1 Wormhole or CutThrough endto end transfer time SwitchesSwitchDelay TransferTime Much better if there are many switches along the way See the example on page 811 Georgia quot Tech39 Switch Technology 55 I23 What do we want in a switch Many input and output links Usually number of input and output links the same Low contention inside the switch Best if there is none only external links cause contention Short switching delay Crossbar Very low switching delay no internal contention Complexity grows as square of number of links Can not have too many links eg up to 64 in and 64 out Georgia quot Tech39 I III Switch Technology 395quot all I o What do we want in a switch Many input and output links Usually number of input and output links the same Low contention inside the switch Best if there is none only external links cause contention Short switching delay 0 Crossbar Very low switching delay no internal contention Complexity grows as square of number of links Can not have too many links eg up to 64 in and 64 out 0 Omega Network Build switches with more ports using small crossbars Lower complexity per link but longer delay and more contention Georgia w Tech39 Switch Technology a Crossbar A C B D 0 Omega neman swncrr box 2003 Exsevier Science USAAIJ rigms reserved 1 Omega network Network Topology fjv s s a 2D grid or mesh of 16 nodes 3 2D toms of 16 nodes c Hypercube tree of 16 nodes 16 2 so 4 if 83 2003 Elsevier Science USA All rwqhts reserved l Network Topology I What do we want in a network topology Many nodes high bandwidth low contention low latency Low latency few switches along any route For each src dest pair we choose shortest route Longest such route over all srcdst pairs network diameter We want networks with small diameter Low contention high aggregate bandwidth Divide network into two groups each with half the nodes Total bandwidth between groups is bisection bandwidth Actually we use the minimum over all such bisections Georgia quot Tech39 l OnChip Networks I We ll have many cores onchip Need switched network to provide bandwidth 0 Need to map well onto chip surface Eg hypercube is not great Mesh or grid should work well torus OK too Limited ports per switch CPU amp 4 neighbors All links short going to neighbors Many parallel algorithms map well onto grids Matrices grids etc l Trouble with shared caches I Private caches OK Each placed with its own processor We want a shared cache too Fits more data than if broken into private caches Private caches replicate data Dynamically shared 0 Threads that need more space get more space But how do we make a shared cache fast l Trouble with shared caches I Private caches OK Each placed with its own processor We want a shared cache too Fits more data than if broken into private caches Private caches replicate data Dynamically shared 0 Threads that need more space get more space But how do we make a shared cache fast III Nonuniform Cache Arch NUCA lggl I Ea SNUCA Perormance 0 Fast access to nearby banks 0 Slow access to far away banks Average better than worst case MHTHM Nun 50 DEEMDDDE Egg om 66320 Ill DNUCA Perormance I 0 Fast access to nearby banks 0 Slow access to far away banks Average much better than worstcase But we keep moving blocks Lots of power hungry activity 0 Need smart policies for block migration Move blocks less frequently But get most of the benefit of being able to move
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'