Special Topics ECE 4893
Popular in Course
verified elite notetaker
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 0 page Class Notes was uploaded by Cassidy Effertz on Monday November 2, 2015. The Class Notes belongs to ECE 4893 at Georgia Institute of Technology - Main Campus taught by Staff in Fall. Since its upload, it has received 16 views. For similar materials see /class/233913/ece-4893-georgia-institute-of-technology-main-campus in ELECTRICAL AND COMPUTER ENGINEERING at Georgia Institute of Technology - Main Campus.
Reviews for Special Topics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/02/15
Larrabee A ManyCore x86 Architecture for Visual Computing from Intel Prof HsienHsin S Lee School of Electrical and Computer Engineering Georgia Tech Georgia Dmt tt 9139 nwlmgy Is t at necessarily represent the of cial opinions of Intel Nvid r Georgia Tech I E I If v 1 gt1 x Vision Ambition and Design Goals 0 Intel Software is the New Hardware Intel x86 ISA makes parallel program easier AN 5f Betterflexibility and programmability Support subroutine call and page faulting Mostly software rendering pipeline GXUUpL teALUIe filtering Note that general goal for current day GPGPU designers well also Intel s Larrabee architects 1 performance per mm2 1 performance per watt l J Georgia imi 41 V fiTechE39fiiiiJ T337 The Larrabee Architecture Cohelfrent CohezErent CoheErent CoheErent L2 L2g L2 L23 Coheirent CoheErent Coheirent CoheErent L2 L25 I Fixed Function Logic Lots of x86 cores 8 to 64 Fully coherence cache hierarchy J I 1 J Georgla mgs w quot3f Techm U gjy 3 Conventional GPGPU pipeline base on DirectX10 Larrabee s fu rorammabe i eine H I D II Georgla netmze TEChDTJL jufpixEjy v l II I X86 Core 0 LRB s inorder core is The original Pentium p54c ie pre MMX 64bit extensions Larger L1 caches a shared L2 4way multithreading 16wide VPU Vector Processing Unit Rumor has it this is the thoroughly debugged P540 given back by Pentagon who got the original RTL from Intel to develop their radiation hardened version which I really doubt Compatibility is the keyword l l l Georgia lm39e C Techmxa Single Larrabee Core 3 6 l NE 3 CijifTechE39 CODU xu y o Gerorgiamg w Dual Issue Core Rely on compilerto pairtwo instructions for asymmetric pipes Same as P540 Primary instruction pipe U pipe All instructions Secondary more restricted pipe V pipe 0 Id st spcc Ops ms cache manipumo instructions vector st lGHz 32 cores to reach 1 TeraFLQPS i i yGeorgiaimg ii wtg ii jTecl lE39BCIDUC 37 IE II I Shared L2 Divided L2 Each core has a local L2 subset 256KB each Enable parallel lookup among cores One core can access others subsets directly Entire L2 is coherent no hassle like Cell DMA SIGGRAPH paper shows a 4MB L2 indicating 16 cores l l i Georgia mgn w l J l l i on Techmco xgy Cache Control Instructions 0 Each core can Fast access its local subset of L2 256KB Access other s L2 shares too 0 Control for non temporal streaming data SSE Prefetch to L1 or L2 only 0 Mark a streaming cache line for early eviction Rendertaret ke t in L2 e FB ZB SB etc Georgialmg i w e ijecthrwzmgy Ring Network Bi directional ring network All cores L2 block of FF logic are attached to 512bit wide each direction Simpler than mesh easy wire routing One clock cycle for each stop a hop Number of nodes between two parties determine latencies Worst case halfway around the ring Ring latency is small compared to DRAM access When gt 16 cores multiple hierarchical rings will be needed think about KSR MPP l l Georgia lml fw e iffTechE1lt3UIn 7 4Way MT Four x86 contexts to support 4 hardware threads One thread picked per clock 0 MT is especially helpful When compiler fails to schedule code without stalls Upon L1 misses Can hide long vector instruction latency Can switch thread on every clock i l lyGeorgiaimea ii wtg ll jTecI lE39 CIDUC 37 Ilm l wln39lwlw Iscluln ulnlll mmnl ISOAUIIWI Im nsm mlgmlmmmn nonInmx AllHQ mI39nv c3U1mnm0Hm UIm rm mlommmv Allmam I030 U mnmn mUIm I39mInm39lIAu mmv I 1 q 4 44 4 V 4 4 ll Ill ll Ill ll Ill ll F 444 T 44 w 4r44 4 4 4 4 ulno mu lum vmlllnlmm 0 I39mInm39llAuV Ino 00ltm1v m0 Ill 10m H IS I IN 41 mm 01m MN JJ HMWMK HHUW I39lm 0 VPU 12 0 16 wide Integer single precision FP 8 wide double precision FP Ternary operands One source can come from memory 0 Free predication on every instruction 16bit predicate registers one quotenablequot per lane 0 Gatherscatter instructions Readwrite 16 results tofrom 16 different offsets 0 13 the area of the LRB core Mask Registers VPU 22 16wide Vector ALU Replicate GQSOI QI 16wide Vector ALU alm i TeGI39IE39BCIJDUCDXQEW Fixed Function Logic 13 Modern GPGPU have the following done in HW Texture filtering display processing post shader alpha blending rasterization interpolation etc LRB do all in SW except Texture Sampler Units Much faster than software approach 12x 40x Texture filtering still most commonly uses 8bit operations Efficiently selecting unaligned 2x2 quad requires a s ecialized i elined gather logic Filtering on VPU requires an impractical amount of RF bw Onthefly texture decompression drastically more efficient in dedicated hardware Georgialms f fh e l l l l l l Cifj39l39ech 39 wfdxgy Fixed Function Logic 23 0 Similar to typical GPU texture logic 32KB texture cache per core Supports all the usual operations DXlO compressed texture format Mipmapping Anisotropic filtering l l l Georgialmea i qtg l 7239 n l waTecthouzoxgy Fixed Function Logic 33 Core pass commands to the texture units through the L2 and receive results the same way Virtual to Phsical a39e translation Report any page misses to the core Retry the texture filter command after the page is in memory LRB Still can perform texture operations on the cores if the performance is fast enough in software l Georgialmgft 2e 3 37 l CiroeGI39IE39BCO 3t IE I Slmulatlon Data from SIGGRAPH paper uJ Scaled pellormance J U l I FEAR Ha11 Lifc 2 613 2 Gears of War Larrabee Units IGHZ Cores l l l v 16 24 32 40 48 t Cmn Fluid o Qune Cloth L GJK Cullisiou Dclccliuu 1K Obj Sweep amp l runc Broad l hazc 4 0 Game Rigid Body Cnsl lc D Lurrabe Units lGlu Cures U 8 16 21 31 IU 48 50 64 Scalable Performance for 3D games Scalable Performance for 3D game Physics Source SIGGRAPH08 l l Georgia mgn i w la LSTeGhEDCODU fgjy l I Slmulatlon Data from SIGGRAPH paper 80 70 o Larrabee lGHz nominal 7169 396 8 000 cores it 26GHz 5 60 C Scalable RT ray tracmg 6 a 50 416 S 40 x 3 30 E W0 E L 10 Larrabee Units1Ghz Cores O 0 8 16 24 32 70 Pmduciinn Flnxd O Pruducrim1Face Pmduciion Iorh Marcnng uth 60 3puns Video Analysis vid o Cusl Indexqu Tcxlludexmg Foreground Enimmion 50 r HumanEodyTracldng Fomfolm Mangcmcm 3 DFFT Nongraphics app amp kernels 407 9 639 30 8 1 2 20 I E 107 5 Source SIGGRAPH08 Larrabee Units 1 GHz Cores 0 u Georgia mgn i w 0 8 16 24 32 40 48 56 64 L5TeGhEBCODU fgjy l Slmulatlon Data from SIGGRAPH paper FEARA 30 U m25 0 o39 o20 jaw ES 615 BE so 2210 5 s 0 Gears of W at HalfLife 2 Ep 2 X axis is the 25 tested frames of LRB units needed for 60fps Source SIGGRAPH08 Georgialmae wiai w 5 r r 2 y d TechE acomogjy Profile Breakdown for Title Games 100 90 Alpha Blend E Pixel Shade 3 60 Pixel Setup 3 50 Depth Test g 40 Rasterization Vertex Shade PreVertex FEAR Gears of War Half LifE 2 Episode 2 Modern games 70 pixel setupshading 10 depth 10 rasterization 10 vertex shading M Ggorgia mga mwe Source Tom Forsyth Intel SIGGRAPHOS 3 gt TeGhED J CEJy 21 View from Nvidia http wwwpcpercom images news A20viewooint20from20NVlDlpdf I don t know who actually wrote this article HPC developers said Easier parallel computing on x86 multicore has not proven true Applications struggle to scale from 2 to 4 cores Why people are not using quad cores with 4 wide SIMD We d like to know what has changed in Larrabee Questions from Nvidia Will apps written for today s Intel CPUs run unmodified on Larrabee Will apps written for Larrabee run unmodified on today s Intel uti core CPUs The SIMD part of Larrabee is different from Intel s CPUs so won t that create compatibility problems Georgiaimg ii w e jTecthd r