Computer Architecture CS 6810
Popular in Course
Popular in ComputerScienence
Marian Kertzmann DVM
verified elite notetaker
This 17 page Class Notes was uploaded by Marian Kertzmann DVM on Monday October 26, 2015. The Class Notes belongs to CS 6810 at University of Utah taught by Alan Davis in Fall. Since its upload, it has received 59 views. For similar materials see /class/229976/cs-6810-university-of-utah in ComputerScienence at University of Utah.
Reviews for Computer Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/26/15
Lecture 11 ILP Innovations and SMT Today outof order example ILP innovations SMT Sections 35 and supplementary notes 000 Example Assumptions same as HW 4 except there are 36 physical registers and 32 logical registers Estimate the issue time completion time and commit time for the sample code 000 Example Original code ADD LD ADD ST SUB LD ADD R1 R2 R3 R2 8R1 R2 R2 8 R1 R3 R1 R1 R5 R1 8R2 R1 R1 R2 Renamed code ADD P33 P2 P3 LD P34 8P33 ADD P35 P34 8 ST P33 P3 SUB P36 P33 P5 Must now wait for regs to be freed 3 000 Example I I I I I I I I I I Original code Renamed code an Iss Comp Comm ADD R1 R2 R3 ADD P33 P2 P3 i i1 i6 i6 LD R2 8R1 LD P34 8P33 i i2 i8 i8 ADD R2 R2 8 ADD P35 P34 8 i i4 i9 i9 ST R1 R3 ST P33 P3 i i2 i8 i9 SUB R1 R1 R5 SUB P36 P33 P5 i1 i2 i7 i9 LD R1 8R2 ADD R1 R1 R2 000 Example I I I I I I I I I I Original code Renamed code an Iss Comp Comm ADD R1 R2 R3 ADD P33 P2 P3 i i1 i6 i6 LD R2 8R1 LD P34 8P33 i i2 i8 i8 ADD R2 R2 8 ADD P35 P34 8 i i4 i9 i9 ST R1 R3 ST P33 P3 i i2 i8 i9 SUB R1R1R5 SUB P36 P33 P5 i1 i2 i7 i9 LD R18R2 LD P18P35 i7 i8 i14 i14 ADD R1R1R2 ADD P2P1P35 i9 i10i15 i15 5 Reducing Stalls in RenameRegfile Larger ROBregister fileissue queue Virtual physical registers assign virtual register names to instructions but assign a physical register only when the value is made available Runahead while a long instruction waits let a thread run ahead to prefetch this thread can deallocate resources more aggressively than a processor supporting precise execu on Twolevel register files values being kept around in the register file for precise exceptions can be moved to 2ncl 6level Stalls in Issue Queue Twolevel issue queues 2ncl level contains instructions that are less likely to be woken up in the near future Value prediction tries to circumvent RAW hazards Memory dependence prediction allows a load to execute even if there are prior stores with unresolved addresses Load hit prediction instructions are scheduled early assuming that the load will hit in cache Functional Units Clustering allows quick bypass among a small group of functional units FUs can also be associated with a subset of the register file and issue queue ThreadLevel Parallelism Motivation gt a single thread leaves a processor underutilized for most of the time gt by doubling processor area single thread performance barely improves Strategies for threadlevel parallelism gt multiple threads share the same large processor 9 reduces underutilization efficient resource allocation Simultaneous MultiThreading SMT gt each thread executes on its own mini processor 9 simple design low interference between threads Chip MultiProcessing CMP How are Resources Shared Each box represents an issue slot for a functional unit Peak thruput is 4 IPC Thread 1 Thread 2 CI Thread 3 Cycles Thread 4 Idle Superscalar FineGrained Simultaneous Multithreading Multithreading Superscalar processor has high underutilization not enough work every cycle especially when there is a cache miss Finegrained multithreading can only issue instructions from a single thread in a cycle can not find max work every cycle but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot 10 What Resources are Shared Multiple threads are simultaneously active in other words a new thread can start without a context switch For correctness each thread needs its own PC its own logical regs and its own mapping from logical to phys regs For performance each thread could have its own ROB so that a stall in one thread does not stall commit in other threads lcache branch predictor Dcache etc for low interference although note that more sharing 9 better utilization of resources Each additional thread costs a PC rename table and ROB cheap 11 Pipeline Structure What about RAS LSQ Private Shared Frontend Private Frontend Shared Exec Engine Resource Sharing Thread1 R1 6 R1 R2 P736 P1 P2 R3eR1R4 P74eP73P4 R5 6 R1 R3 P75 6 P73 P74 Instr Fetch Instr Rename Instr Fetch Instr Rename Issue Queue R2 6 R1 R2 P76 6 P33 P34 P736 P1 P2 R5eR1R2 P77eP33P75 P746P73J39P4 R3 6 R5 R3 P78 6 P77 P35 P75 6 P73 P74 Th d 2 P76 6 P33 P34 rea 39 P77 6 P33 P76 Register File P78 6 P77 P35 HI Performance Implications of SMT Single thread performance is likely to go down caches branch predictors registers etc are shared this effect can be mitigated by trying to prioritize one thread While fetching instructions thread priority can dramatically influence total throughput a widely accepted heuristic ICOUNT fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources SMT yields throughput improvements of roughly 24 Alpha 21464 and Intel Pentium 4 are examples of SMT 14 Pentium4 HyperThreading Two threads the Linux operating system operates as if it is executing on a twoprocessor system When there is only one available thread it behaves like a regular singlethreaded superscalar processor Statically divided resources ROB LSQ issueq a slow thread will not cripple thruput might not scale Dynamically shared trace cache and decode finegrained multithreaded roundrobin FUs data cache bpred MultiProgrammed Speedup Benchmark Bust Spcctlup Worst Speedup Mg pccdup gzip 143 114 124 x39pr 143 1114 117 gcc 144 11111 111 Incl 15 MI 121 111113 1111 H110 117 parser 141 100 118 can 142 1117 125 pcrlhmk 141 1117 12I1 gap 143 117 125 ni lcx 141 1111 113 hxipl 14 115 124 m 01139 143 1112 111 t Llpwisc 133 112 124 swim 158 11911 114 Ingrid l 23 194 11I1 applu 137 1I12 1111 mesa 13 11 l 122 gulgcl 147 1115 25 an 155 1911 113 equukc 14K 1112 121 I Lit39crcc 131 1111 125 immp 141quot 11M 121 lucus 130 IN 11 I39Jnai tl l 34 113 12I1 sixtrack 15K 12N 142 npsi 1411 114 123 lwrull 15x I39WI39I 12II sixtrack and eon do not degrade their partners small working sets swim and art degrade their partners cache contention Best combination swim amp sixtrack worst combination swim amp art Static partitioning ensures low interference worst slowdown is 09 Title Bullet