0080 6385 Computer Architecture Instruction Level Parallelism with Software Approaches Ill Edgar Gabriel Fall 2006 co ssss 85 rrrrrrrrrrr hi ttttt re aaaaaaaa riel 0 Hardware Support for Software ILP ConditionalPredicated instructions eg if A ST Can bernappedto BNEZ R1 L ADDU R2 R3 R0 L Codesequencecanbernappedh3acondMonalmove CMOVEZ R2 R3 R1 COSC 6385 Computer Architecture Edgar Gabriel 0 Conditionalpredicated instructions Converts a control dependence into a data dependence Also called an ifconversion Often used to implement abs Two options when to nullify an instruction Instruction issue Requires early knowledge of the condition variable gt can lead to a stall Avoids the execution of useless instructions Before the WB stage Avoids stalls Can have a negative impact on the performance of the code COSC 6385 Computer Architecture Edgar Gabriel 0 Limitations of predicated instructions Predicated instructions that are annulled take processor resources Limited usage of predicated instructions when the control flow involves more than a simple alternative sequence Eg moving an instruction across multiple branches Conditional instruction may have a speed penalty compared to uncoditional instructions COSC 6385 Computer Architecture Edgar Gabriel 0 Compiler speculation with hardware support In case the compiler has good knowledge about the behavior of branches compiler might want to move code sections before the branch to eliminate branch hazards Problem how to preserve exception behavior 1 Ignore exceptions of speculative instructions 2 Special speculative instructions are used which do not raise exceptions 3 Set a bit to a register which contains the result ofa speculative instruction which caused an exception Poison bit 4 A mechanism is provided to indicate speculative instructions and the hardware buffers instruction results until the instruction is known not to be speculative COSC 6385 Computer Architecture Edgar Gabriel 7 Excep ons Two type of exceptions important in this context Terminating exceptions would usually lead to a termination of a program Resumable exceptions no problems for speculative instructions Correct program should not generate terminating exceptions Result of an incorrect program is not well defined Have to be sure that a terminating exception is not generated by the speculative execution of instructions COSC 6385 Computer Architecture Edgar Gabriel 0 1 Ignore exceptions of speculative instructions Handle nonterminating exceptions of instructions when they are occurring Return an undefined value for instructions which would cause an a terminating exception Program is allowed to continue and Result of the speculative instruction will not be used in a correct program COSC 6385 Computer Architecture Edgar Gabriel 0 1 Ignore exceptions of speculative instructions ll Example1f O AB else AA4 Regular instruction sequence Speculative instruction sequence LD R1 0R3 LD R1 0R3 BNEZ R1 L1 LD R14 0R2 LD R1 0R2 BEQZ R1 L3 J L2 DADDI R14 R1 4 L1 DADDI R1 R1 4 L3 SD R14 0R3 L2 SD R1 0R3 COSC 6385 Computer Architecture Edgar Gabriel 0 2 Speculative instructions not raising exceptions For the previous example LD R1 0 R3 SLD R14 0 R2 BNEQZ Rl Ll SPECCK 0 R2 J L2 L1 DADDI R14 R1 4 L2 SD R14 0R3 Requires maintaining the then block Introduction of speculation check instructions SPECCK Maintains precise exception behavior COSC 6385 Computer Architecture Edgar Gabriel 0 3 Using poison bits Tracks exceptions as they occur but postpones raising terminating exceptions until the value is really used One bit is added to every register poison bit One bit is added to every instruction to mark whether it s a speculative instruction A regular instruction uses a register having the poison bit set the instruction will cause a fault If a speculative instruction uses a register having the poison bit set the instruction will set the poison bit for the destination register No speculative stores allowd COSC 6385 Computer Architecture Edgar Gabriel 7 4 Compiler marks instructions as speculative Compiler adds an indicator how many branches the instruction was speculativer moved across and what the assumed branch action was Typically only one branch 1 additional bit required All instructions are placed in a reorder buffer and are forced to commit Delays write back Difference between this approach and reorder buffers when using Tomasolu s approach No dynamic scheduling required here Register renaming done by the compiler COSC 6385 Computer Architecture Edgar Gabriel 7 Hardware vs Software speculation Difficult to disambiguate memory references at compile time Hardware based speculation works better for unpredictable flow control Hardware based speculation does not require compensation code Compiler based approach may see further in the future Hardware based speculation adds a lot of complexity to the hardware COSC 6385 Computer Architecture Edgar Gabriel 0 A case study Intel IA64 Architecture RISC style compiler Registerregister instruction set VLIW architecture Registers 128 64bit general purpose registers 65 bits wide 128 82bit floating point registers 64 1bit predicate registers 8 64bit branch registers holds branch destination addresses for indirect branches COSC 6385 Computer Architecture Edgar Gabriel IIll A case study Intel IA64 Architecture ll Instructions without a data dependence are placed into instruction groups End of an instruction group marked by a stop instruction Three instructions are always packed into a bundle fixed format A bundle is 128 bits wide 5bit template field indicates execution unit required by each instruction and the presence of a stop instruction Not all possible combinations are allowed 3 41bit instructions COSC 6385 Computer Architecture Edgar Gabriel 7 Execution Unit slots Execution unit Instruction Instruction Example slot type description IUnit A Integer ALU Add sub and I NonALU integer Shift bittests moves MUnit A Integer ALU Add sub and M Memory access Load store FUnit F Floating point FP add sub etc BUnit B Branches Conditional branches loops calls LX LX Extended Extended immediates stops noops COSC 6385 Computer Architecture Examples for the 5bit templates Template Slot 0 Slot 1 Slot2 o M I I 1 M I I I 2 M I I 3 M I I I 4 M L X 5 M L X I 8 M M I 9 M M I I 10 M M I 1 1 M M I I M F I COSC 6385 Computer Architecture Edgar Gabriel Example for an unrolled version of a loop x x s Bundle Slot 0 Slot 1 Slot 2 Execute template cycle 8 M M LD FO 0R1 LD F6 8R1 1 9 M M LD F10 16R1 LD F14 24R1 I2 14 M M F LD F18 32R1 LD F22 40R1 ADDD F4 F0 F2 3 14 M M F LD F26 48R1 ADDD F8 F6 F2 4 15MMF ADDD F12 F10 F2I 5 14 M M F SD F4 0R1 ADDD F16 F14 F2 6 14 M M F SD F8 8R1 ADDD F20 F18 F2 7 15 M M F SD F12 16R1 ADDD F24 F18 F2 8 14 M M F SD F16 24R1 ADDD F28 F26 F2 9 9 M M SD F20 32R1 SD F24 40R1 I11 8 M M I SD F28 48R1 DADDUI R1R1 56 BNE R1 R2 Loop 12 COSC 6385 Computer Architecture Edgar Gabriel 7 Speculation support Nearly all instructions can be predicated An instruction is predicated by using a predicate register Lower 6 bits in the instruction field Exception handling Using equivalent of poison bits called NaT Not a Thing for GPR registers Using a special value for FP registers NatVal COSC 6385 Computer Architecture Edgar Gabriel 0 I111 The Itanium processor First incarnation of the IA 64 architecture 2001 Processor core is capable of issuing six instructions 2 bundles per clock cycle Nine functional units all fully pipelined 2 lunits 2 Munits 3 Bunits 2 Funits COSC 6385 Computer Architecture Edgar Gabriel The Itanium Processor ll 10 stage pipeline which can be divided into four categories Front end IPG Fetch Rotate load bundles into prefetch buffer Instruction delivery EXP REN issue up to 6 instructions to the 9 functional units implement register renaming Operand delivery WLD REG access register file perform register bypass access and update register scoreboard and check predicate dependencies Execution EXE DET WRB COSC 6385 Computer Architecture Edgar Gabriel 7 I111 The Itanium Processor III Instruction Latency Integer load 1 Floatingpoint load 9 Correctly predicted taken branch 03 Mispredicted branch 9 Integer ALU operation 0 FP arithmetic 4 COSC 6385 Computer Architecture Edgar Gabriel Performance SPEC MBMZMG MDquot bzrpz vortsx w palhmk Banchnavlcs son nurse many we I NM 21264 I quotmum I m l lan um ap 0100200300400500600700300800 Penman 0 2003 Elsaviev Scienca USAMH vighxa reserved SPECprImEDDO I Ahha 21234 I Penlium 4 I Imnium mi slx ack 1mm Imas ammp lacunae mum Benchmarks new mail applu mnvid swim wupwiso 0 500 1000 15W 2000 2500 Parlormm E4 20m ElseviurSuianue USA A mm reserved Single processor performance according to SPEC CPU2000 Benchmark Results 2500 2000 1 500 1 000 500 ill 3 O 0 e b 3 Ebb 0 lt0 n SPECint2000 1 PE I SPECintbase2000 1 PE in SPECfp2000 1 PE 1 SPECfpbase2000 1 PE 9 a o 2 o 2 O O 1s 1s Prozessoren COSC 6385 Computer Architecture Edgar Gabriel Slides based on a talk and courtesy of Matthias Mueller Center for Information Services and High Performance Computing v Technical University Dresden


