Computer Architecture CMPE 202
Popular in Course
Popular in Computer Engineering
This 40 page Class Notes was uploaded by Buck Ankunding on Monday September 7, 2015. The Class Notes belongs to CMPE 202 at University of California - Santa Cruz taught by Staff in Fall. Since its upload, it has received 52 views. For similar materials see /class/182219/cmpe-202-university-of-california-santa-cruz in Computer Engineering at University of California - Santa Cruz.
Reviews for Computer Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/07/15
Understanding the lA64 Architecture Gautam Doshi Senior Architect lA64 Processor Division Intel Corporation August 31 99 September 2 99 Intel m Agenda oToday s Architecture Challenges olA64 Architecture Performance Features oHighend Application Characteristics oEnd User Benefits of A3964 Features Programmer programs in high level lang Compiler compiles program to machine inst Machine executes these instructions High Level Lang Programmer39s vocabulary Inst Set Arch Compiler s vocabulary Architecture determines what the compiler can express Architecture determines what the machine must execute Architecture the compiler s vocabulary Sequential Semantics of the ISA Low Instruction Level Parallelism lLP Unpredictable Branches Mem dependencies Ever Increasing Memory Latency Limited Resources registers memory addr Procedure call LOOp pi pelin ing Overhead Fundamental challenges abound Intel Intel Labs Today s Architecture Challenges Program Seguence of instructions Implied order of instruction execution Potential dependence from inst to inst But High performance needs parallel execution Parallel execution needs igdependent insts Independent insts must be redi scovered Sequentiality inherent in traditional archs Today s Architecture Challenges add r1 39 r2 r3 add r1 r2 r3 sub r4 r1 r2 sub r4 r11 r2 shl r5 r4 r8 shl r5 r14 r8 Compiler knows the available parallelism but has no vocabulary to express it Hardware must rediscover parallelism Complex hardware needed to reextract ILP Today s Architecture Challenges Branches Frequent Code blocks Small Limited parallelism within code basic blocks Wider machines need m parallel insts Need to exploit ILP across branches But some instructions can fault ran39ches are a barrier to code motion Limited ILP available within basic blocks Today s Architecture Challenges Branches alter the sequence of insts ILP must be extracted across branches ranch prediction has its limitations Not perfect performance penalty when wrong Need to speCulatively execute insts that can fault memory operations loads floatingpoint operations Need to defer exceptions on speculative operations more book keeping overhead hardware Branches make extracting ILP difficult Intel Labs Today s Architecture Challenges Loads usually at the top of a chain of insts ILP extraction requires moving these loads ranches abound and are a barrier Stores abound and are also a barrier programming paradigm Pointers can point anywhere Dynamic disambiguation has its limitations limited in its scope requires additional hardware adds to code size increase if done in software Memory dependencies further limit ILP Today s Architecture Challenges Has been increasing over time Need to distance loads from their uses ranches and Stores are barriers Cache hierarchy has its limitations r i Typically small so limited working set Consumes precious silicon area Helps if there is locality Hin ders otherwise Managed asynchronously by hardware Increasing latency exacerbates ILP need Intel 39 Labs Today s Architecture Challenges Small Register Space Limits compilers ability to express parallelism Creates false dependencies Overcome by renaming Shared Resources Condition flags Control registers etc Forces dependencies on otherwise independent insts FloatingPoint Resources Limited performance even in ILP rich applications Data parallel applications need flexible resources Limited Resources a fundamental constraint Today s Architecture Challenges Modular programming increasingly used Programs tend to be call intensive Register space is shared by caller and callee CallReturns require register saverestores Software convention has its limitations Parameter passing limited Extra savesrestores when not needed Shared resources create more overhead Intel Intel Labs Today s Architecture Challenges Loops are a common source of good ILP UnrollinglPipelining exploit this ILP PrologueEpilogue cause code expansion Unrol ling causes more code expansion Limits the applicability of these techniques Loop ILP extraction costs code size Intel 39 Labs Today s Architecture Challenges Complex conditionals sequential branchexecution increases critical path Dynamic resource binding parallel insts need to be reorganized to fit machine capability Lack of Domain specific support Multimedia operations repertoire efficient datatypes Floatingpoint standard compliant accuracy speed And the challenges continue Sequentiality inherent in traditional architectures Complex hardware needed to reextract ILP Limited ILP available within basic blocks Branches make extracting ILP difficult Memory dependencies furtherquot limit ILP Increasing latency exacerbates ILP need Limited resources A fundamental constraint Shared resources create more overhead Loop ILP extraction costs code size And the challenges continue lA64 overcomes these challenges I t I n e intel Agenda oToday39s Architecture Challenges elA64 Architecture Performance Features oHighend Application Characteristics oEnd User Benefits of A3964 Features it s all about Parallelism Enabling it Enhancing it 39 Expressing it rquotExpoiting it at the proclthread level for programmer at the instruction level for compiler Enable Enhance Express Exploit Parallelism Explicitly Parallel Instruction Semantics Predication and ControlData Speculation Massive Massive Resources regs mem Register Stack and its Engine RSE Memory hierarchy management support Software Pip elining Support Challenges addressed from the ground up Intel Intel Labs IA6394 Architecture Performance Features Program Sequence of Parallel Inst Groups Implied order of instruction groups amp dependence between insts within group So High performance needs parallel execution Parallel execution needs igdependent insts Independent instructions explicitly indicated Parallelism inherent in lA64 architecture IA6394 Architecture Performance Features add r1 r2 r3 add r1 r2 r3 sub r4 r1 r2 sub r4 r11 r2 i shl r5 r4 r8 shl r5 r14 r8 Compiler knows the available parallelism and now HAS the vocabulary to express it STOPS Hardware easily exploits the parallelism Frees up hardware for parallel execution Intel Labs 1 rm 39 39139 t F quotrquot f A quot391 393 f K quotP Iquot quotl P 1 dexukzrk i Ubl l M if i lt5 I l Ll ca52 U l l k 5 Low Instruction Level Parallelism lLP Unpredictable Branches Mem dependencies Ever Increasing Memory Latency Limited Resources registers memory addr Procedure call LOOp pi pelin ing Overhead lA64 EPIC ISA Sequential Parallel IA64 Architecture Performance Features cm 1 2 1114 Control flow to Data flow Predication removesreduces branches lA64 Architecture Performance Features Unpredictable branches removed Mispre diction penalties eliminated Basic block size increases Compiler has a larger scope to find ILP lLP within the basic block increases Both then and else executed in parallel Wider machines are better utilized Predication enables and enhances lLP G if I 11 m if Legtii FmmMirifc n La rxrcaf F sr22Hcf 39ffism UL g U m quot3 if m 3 He w mquot ch rMem de pende ncies Ever Increasing Memory Latency Limited Resources registers memory addr Procedure call LOOp pi pelining Overhead lA64 Predication LP Branches wee WMee we ID Snmme Featumee Centre eu at em deit em Areh Ms M Instr J hair 2 Barrier hr 1 E M M Chks m 53 M L Use EM Lead aer braneh by eempii er Branch barrier broken Memory latency addressed m new 7L 70 lA6 4 Architecture Performance Features lag 4i It irquot Recovery Code l t Speculative data Uses can also be speculated 39 Control speculating uses further increases ILP Intel Labs Lo WSUU 150quot U LastiViirf r22Haitian TL U m rd 601 1 He E n C E a Mem dependen cies i 7 e we momma imam V n ma Cry m w 55531 PM Krz lJ UQJ L53 Ltd D UL Limited Resources registers memory addr Procedure call LOOp pi pelin ing Overhead lA64 Control Speculation LP Latency impact MQ eiL W ee we Mme Fm HAM Ida r1 instr J JIJSitI 2 St s s xii g by mmpier die r lJ LJS B EM Store barrier broken Memory latency addressed I m lA64 Architecture Performance Features RecoVery Code Speculative data Uses can be Speculated Data speculating uses further increases ILP Intel Labs Grif filii 39jf f f ij nquot rvLifrzaE LG FRaisinM2456U Larlt39f ra rai fem Hf U quot7 if zit as Y He is m CB1 J m or quot5 111 9 UH me Limited Resources registers memory addr Procedure call LOOp pi pelining Overhead lA64 Data Speculation LP Latency impact IA6394 Architecture Performance Features FloatingPoint Branch Predicate Integer Registers 81 Registers 0 Registers Registers 613 0 I 32 Static I 32 Static 3916 Static I D96 Stacked Rotating D 96 Rotating 48 Rotating D An abundance of machine resources Intel Labs IA64 Architecture Performance Features 18 BILLION Giga Bytes accessible 2 64 1844674407 3709551 616 Both 64bit and 32bit pointers supported Both Little and Big Endian Order supported An abundance of memory resources Intel Intel Labs l 3 xx rm r rx w if r 3 r 27 zquot quotquot r39 w r39r quot1 39 LquotVquotLT KT i ii A L l i Ll rf f J Lair ii1a ifmantiOm MTVTel F reliaalii m H U iquot 3 Mil iii in He E lquot H fi lm r F m 17 is c m iim Hi L Mug A7 U u a quot 7 in l 4 g 7 s Ar 1quot7r if as q a w r Jquot 5 Lil I Ltirxul RAZii i kggtU l ult Missy leruil 155 ll Hell w 3 c l Procedure call Leap pi pelining Overhead lA64 Resources Aid explicit parallelism IA64 Architecture Performance Features GR Stack reduces need for 7 saverestore across calls Procedure stack frame of programmable size 0 to 96 regs PROC B 96 Stacked overlap Mechanism implemented by renaming register addresses Distinct resources reduce overhead intel 39quotte39 Labs weer W ee we Perfe me Fm Register ek lt a il H quot k 7 j rg 9 f V 51 W V gamma are emcee Ewm L002quot I 21 7 7 Inputs Inputs 33139 4 0 123 lw g VB m mi w E I 7 3 mm m lg ar l 161er A A lm m lid wm Wm 13 Frame overlap eases parameter passing IA6394 Architecture Performance Features Automatically savesrestores stack registers without software intervention Provides the illusion of39infinite physical registers by mapping to a stack of physical registers in memory Overflow Alloc needs more registers than available Undterflow Return needs to restore frame saved in memory RSE may be designed to utilize unused memory bandwidth to perform register Spill and fill operations in the background RSE eliminates stack management overhead LOT H giftmif0m L y r 391 5 a on a 3 TH F prquot eeril at h H524 x U x J j PFIQ UVquot lt2 quotL Loop pipelining Overhead lA64 Reg Stack Modular program support 3464 Am Fm gr1W Wh m S PE liming Support I o ggjm m wmglmcg 1 I w mm J 3269 Wmh a INJ 19f1gjw pdmw mmth jg Er11iif LC 3 TW EEG bufmwghmwmp a gammy Wm if ir mggcw Wham Ump mm tMMM in 1mm lA64 Loop support LP Overhead mil Details in Compiler Technology for mmquot a Q E Lair H151mifon F rem ifi m H U ifquot 131 r if iirig i1 HE E r quot7 vii E a r P m 17 if 3 m 522 m 6 tr Ewe Hr i lj mu w Lam Hm d C F m K 13 5V Lltiii 7 lA64 Loop support Perf wlo code overhead intel 39quotte39 Labs