SPECIAL TOPICS CDA 6938
University of Central Florida
Popular in Course
Popular in Computer Design Architecture
This 83 page Class Notes was uploaded by Genoveva Bogisich on Thursday October 22, 2015. The Class Notes belongs to CDA 6938 at University of Central Florida taught by Huiyang Zhou in Fall. Since its upload, it has received 54 views. For similar materials see /class/227525/cda-6938-university-of-central-florida in Computer Design Architecture at University of Central Florida.
Reviews for SPECIAL TOPICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
l39nivvreil u Ivnlml Flnrillu Programming for MultiCore CPUs Locking and Transactional Memory Huiyang Zhou Slides 1718 are from Professor Urnakishore Ramachandran GaTech School of Electrical Engineering and Computer Science University of Central Florida l G compute IIO IIO result Needed compute a Sequential process Example use of threads 1 com pute thread thread IIO request IIO complete IIO result Needed b Multithreaded process 6 Example use of threads 2 Q Digitizer H Tracker H Alarm 9 f 39i39 3 6 Programming Support for Threads 39 creation pthreadicreatetop level procedure args 39 termination return from toplevel procedure explicit kill 39 rendezvous creator can wait for children 0 pthreadijoir childitid 39 synchronization mutex condition variables main thread Main thread foo thread threadcreatefoo args threadcreatefoo args a Before thread creation b After thread creation 6 Sample program thread createjoin int fooint n threadij oinchild7tid l G Programming with Threads 0 synchronization Pmducer for coordination of the threads 0 conununication for interthread sharing of data threads can be in different processors how to achieve sharing in SMI consumer 0 software accomplished by keeping all threads in the same address space by the OS 0 hardware accomplished by hardware shared memory and coherent caches Need for Synchronization Eligitizero tracked imageitype trackiimage 1magetype d1g1mage int head 0 int tail 0 100M 1 if bufavail lt MAX 90 trackiima 1f bufavall gt 0 frameibuf head mod MAX grabdigimage head head 1 framebuftail mod MAX bumva bufava 1 digimage analyzetrack7image tail tail 1 bufavail bufavail 1 Problem digitizer tracker bufavail bufavail 1 E E bufaVai bufavai1 Shared data structure f b E o ranie u 99 head tail First valid lled First empty spot In framebuf in frame Synchronization Primitives lock and unlock mutual exclusion among threads busywaiting Vs blocking pthreadimutexitrylock no blocking pthreadimutexilock blocking pthreadimutexiunlock l O Fix number 1 with locks Eligi zero tracker imageitype diampimage imageitype trackiimage int tail 0 int head 0 loop threadimutexilockbuflock loop if bufavajl gt 0 threadimutexilockbu ock grabdiampimage if bufavail lt MAX frameibu tail mod MAX trackiijna e digiimage frameibu head mod MAX tail tail 1 head head 1 bufavail bufavajl 1 bufavajl bufavail 1 analyzetrack7image threadimutexiunlockbuflock threadimutexiunlockbuflock Problem 6 digitizer imageitype digiimage int tail 0 loop grabdigimage digiimage Problem ea 7 tex or or frameibuf tail mod MAX Fix number 2 tracked imageitype trackiimage int head 0 loop head head 1 analyzetrackimage l G digitizer imageitype digiimage intt 39 100p grabdigimage frameibuf tail mod MAX igiimage Problem Fix number 3 analyzetrack7image tracked imageitype trackiimage int head 0 loop trackiimage frameibuf head mod MAX head head 1 condition variables pthreadicondiwait block for a signal pthreadicondisignal signal one waiting thread pthreadicondibroadcast signal all waiting threads l G Wait and signal with cond vars T1 T2 condsignal c condwait c m condwait c m blocked condsignal c resumed a Wait before signal b Wait after signal T1 blocked forever 6 digi zer imageitype digiimage int tail 0 loop grabdig7image threadimutexilodqbu ock Fix number 4 cond var track er imageityPe trackiimage int head 0 loop rea imutex oc if bufavail MA threadicondiwaitgh lfinotiempty if bufavail 0 bu ock thread condiwai bufinotifull thread 11 exiu buflock trackjmage frameibu head mod 1 1 1 1 1 MA thread mutex head head 1 frameibuf tail mo MA digiimage ailtail 1 threadimutexilockwu ock bufavajl bufavaiL hreadimutexiunlocmbuflock threadicondisigna bufinotiempty threadimutexilock bu o ck bufavail bufavail 1 threadicondisignal ufinotifull readimutexiunlock bu o ck analyze trackiimage This solution is correct so long as there is exalgly one producer and one consumer acquireisharediresourceo if resistate BUSY simutex resistate BUS releaseisharediresourceo resistate NOTiBUSY 6 Gotchas in programming with cond vars threadimutexilocmsimutex4 T3 is here threadicondiwait resinotibusy threadimutexiunlockcsimutex threadimutexilockksimutg T1 is here threadicondisignalresinot7busy threadimutexiunlockcsimutex T2 is here 9 State of waiting queues m m a Waiting queues before T1 signals 8 Waiting queues after T1 Signals Ii Gotchas in programming with cond vars acquireisharediresourceo threadimutexilockcsimutex if resistate BUSY threadicondiwait resinotibusy simutex resistate BUSY 4 T2 is here get the lock threadimutexiunlockksimutex T3 is here release the lock reflease hare llesourcw Both T2 and T3 access the shared resource which threadimutexilockcsimutex was su osed to be resistate NOTiBUSY accessed exclusively threadicondisignalresinot7busy threadimutexiunlockcsimutex 4 T1 is here acquireisharediresourceo threadimutexilockcsimutex 4 T3 is here while resistate BUSY threadicondiwait resinotibusy simutex resistate BUSY threadimutexiunlockcsimutex releaseisharediresourceo threadimutexilockcsimutex resistate NOTiBUSY 4 T1 is here threadicondisignalres7not7buys threadimutexiunlockcsimutex 6 Defensive programming retest predicate lt T2 is here Ii Defensive programming retest predicate acquireisharediresourceo threadimutexilockcsimutex while resistate BUSY threadicondiwait resinotibus simutex lt T2 is still here et the loc fail the condition wait again resistate BUSY threadimutexiunlockcsimutex T3 15 here release the lock releaseisharediresourceo threadimutexilockcsimutex resistate NOTiBUSY threadicondisignalres7not7buys threadimutexiunlockcsimutex T1 is here Atomic region Transactional Memory Borrow the transaction idea from database systems All commit or none commit Read set write set con ict detection Rollback when con ict happens Software implementation using locks Hardware implementation with specialize cache designs 6 digitizer imageitype diampimage int tail 0 loop threadimutexilockbuflock if bufavajl gt 0 grabdiampimage frameibu tail mod MAX digiimage tail tail 1 bufavail bufavajl 1 threadimutexi n I IIHK Problem Lockbased code tracker imageitype trackiimage int head 0 loop threadimutexilockbu ock if bufavail lt MAX trackiimage framejm head mod MAX head head 1 bufavajl bufavail 1 analyzetrack7image threadimutexiunlockbuflock 6 Transactional memory programs digi zer tracker imageitype diampimage imageitype trackiimage int tail 0 int head 0 loop atomic loop if bufavail gt 0 atomic grabdiampimage if bufavail lt MAX frameibu tail mod MAX trac iimage di iimage frameibu head mod MAX tail tail 1 head head 1 bufavail bufavail 1 bufavail bufavail 1 ll end of atomic region analyzetrackiimage end of atomic region l O Transactional memory programs digi zer tracker imageitype diampimage imageitype trackiimage int tail 0 int head 0 loop if bufavail gt 0 loop grabdiampimage frameibu tail mod MAX igiimage tail tail 139 if bufavail lt MAX track ima atomic e frameibiu hgad mod MAX head head 1 bufavail bufavail1 atomicf II and of atomic region bufavail bufavail 1 II and of atomic region tr Challenges of TM 10 in atomic regions Nested atomic regions Atomic regions con icting with code in nonatomic regrons Etc Promising research area with signi cant challenges l39lu39vl39rhlly ul Tt39lllml Florida Programming for Cell Processors Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida lit Review of Cell Processor Architecture 0 Heterogeneous Chip Multiprocessors CMP SPE Synerg39sm SFE SPE SPE Processor Elemenn Element lnlerconnea Bus EIE Broadband Engine 4 Flexlo Imerface BEI 4 Channels SPE I SFE SPE SFE UniveIsity of Central Florida 4 xro IrollerMlC lt gt Channels Future High Performance CMPs A shared view of comp arch com m unity Heterogeneous cores on a single chip M large complex OOO cores and N small in order cores NgtgtM r Larg OOO cores for Controlrintensive hardetoe aralleli e Instructionelevel parallelism Memoryelevel parallelism aggressive speculatio Man small inrorder processors for dataelevel parallelism task threadelevel p ar ism Cell is heterogeneous but does not really fit the description PPE is far frompowerful e oug 7 Next generation of Cell may address PPE performance University of C entral Florida l G Flynn s Taxonomy MU ILnscmmnn 100 msu n5 mc nn Paul Data Poul Data Pun MIMD Instmctinn Paul Data Pool Figure from Wlklpedla University of C entral Florida MIMD Architectures 39 Further Division of MIMD Single Program Multiple Data Stream SPMD Exploit DataLevel Parallelism 0 Difference between SIMD no lockstep 0 In GPU SlID in a warpcluster SPMD among multiple warps clusters Multiple Program Multiple Data Stream MPMD Exploit FunctionTask Level Parallelism Eg Masterworker 39 Cell processors Supports MPMD In each SPU SlID execution exploits datalevel parallelism Multiple SPUs can execute different codes University of Central Florida Workload Partition 39 PPE centric vs SPE centric Models quot mx PEEK PPEaceritric SPEecentric pa Mod el Xquot xxx Multistage Parallel Services Pipleline Stages Model Model Model University of Central Florida Multistage Pipeline Model 0 Mainissues Load balance Data transmission Univelsity of Central Flon39da Ii Parallel Stage Model 0 SPMD Similar to CUDA or Brooks model v UniveIsity of Central Florida Service Model MPMD PPE l SPE Apphcalaon Code MPEG Encnmng v UniVErsily of Central Flanda Outline Introduction of programing models Hello world 7 Three versions Runr me support Programming using vectors SIMDiz a on DM A Program Optimization UniVErsily of Central Flanda AMD Smarter Choice AMD HLSL h February 1 2008 AMD HLSL AMD iii ll Smarter Choice HLSL High Level Shading Language shading language developed by Microsoft and part of DirectX AMD HLSL is a derivative of HLSL versions 30 and 40 with extensions for ATI hardware scatter scratch buffers double precision Compiler compiles AMD HLSL to AMD IL Compiler works in both Windows and Linux h February 1 2008 AMD HLSL AMD Smarter Choice Texture2D teXO Texture2D texl uniform float2 offset float4 mainfloat2 tcoord TEXCOORDO COLORO float4 a teX2DteX0 tcoord offset float4 b teX2DteX1 tcoord return foora b High Level Shading Language Loose C like syntax Pre defined graphics centric functions eg texture fetches 5 February 1 2008 AMD HLSL AMD Smarter Choice bool float uint int double struct h February 1 2008 AMD HLSL AMD Smarter Choice Vector types lttypegt1 2 3 4 eg float3 var0quot declares var as a vector of length 3 Uniform modifier uniform lttypegt ltvargt variables are visible outside of the shader ie constants eg uniform float4 var1quot 5 February 1 2008 AMD HLSL AMD Smarter Choice Texture2D teXO Texture2D texl uniform float2 offset float4 mainfloat2 tcoord TEXCOORDO COLORO float4 a teX2DteX0 tcoord offset float4 b teX2DteX1 tcoord return floora b 5 February 1 2008 AMD HLSL AMD Smarter Choice member access structs swizzle operator prepost increment prepost decrement unary minus unary plus Casts multiply addition subtraction ternary operator assignment operator increment and assign decrement and assign multiply and assign h February 1 2008 AMD HLSL AMD Smarter Choice divide modulus divide and assign Oo mod and assign lt compare lt compare gt compare gt compare compare compare 5 February 1 2008 AMD HLSL AMD Smarter Choice Operators supported for Array types index operator Operators supported for Boolean uint and integer types logical negate logical compliment V l h February 1 2008 AMD HLSL AMDu Smarter Choice Operators supported for uint and integer types ltlt left shift gtgt right shift amp bitwise and and assign A bitwise exclusive or and assign bitwise or and assign ltlt bitwise shift and assign gtgt bitwise shift and assign l h February 1 2008 AMD HLSL AMD Smarter Choice scope gt pointer access dereference amp address of h February 1 2008 AMD HLSL AMD Smarter Choice break exit the surrounding loop do while or for or switch continue go to the next iteration of the surrounding loop do while or for or switch do repeats a block of code do block while condtion for as in C forinit condition increment block if as in C ifcondition block switch as in C while whilecond block b February 1 2008 AMD HLSL AMD Smarter Choice dotxy returns the dot product of x and y xy must be vectors expx returns e x float only and approximate floorx returns the largest integer lt x fmodxy re turns the float reminder of xy such that fracx returns the fractional part of x works for floats and doubles See MS HLSL sec for more info h February 1 2008 AMD HLSL AMD Smarter Choice Full Gamut of Data Types UINT INT FLOAT Vector Types 1 2 3 Examples R3ZG3ZB32A32FLOAT 4 32 bit float values R3ZG3ZB32A32INT 4 32 bit integer values R3ZG3ZB32A32TYPELESS for mixed data b February 1 2008 AMD HLSL AMD Smarter Choice DX9 Ster Declaration Texturelt 123gtD ltnamegt Call tex2Dltnamegt ltaddreSSgt DXlO Ster ltnamegtLoadltaddressgt h February 1 2008 AMD HLSL AMD Smarter Choice Texture2D teXO Texture2D texl uniform float2 offset float4 mainfoat2 tcoord TEXCOORDO COLORO float4 a tex2DteX0 tcoord offset float4 b tex2DteX1 tcoord return foora b h February 1 2008 AMD HLSL AMD Smarter Choice Instead of single constants can use constant arrays cbuffer name list of variable declarations lttypegt templtsizegt h February 1 2008 AMD HLSL AMD Smarter Choice AMD s derivative of HLSL Why Expose ATI hardware features Compiler can run in both Windows and Linux Includes everything previously discussed plus h February 1 2008 AMD HLSL AMD Smarter Choice ATI Radeon HD 3000 series exposes first double precision GPU units What s available in AMD HLSL Raw converts for doubles convert uint2 to double b February 1 2008 AMD HLSL Smarter Choice Write data to an arbitrary address 1 Declare a global buffer global float4 ltnamegt 2 Write as array global float4 resut resutltaddrgt ltvagt h February 1 2008 AMD HLSL Bottom Line Programming in shader assembly isn t fun We re used to a high level language AMD HLSL provides Linux and Windows compatibility For more info see the documentation February 1 2008 AMD HLSL AMD Smarter Choice AMD IL h February 8 2008 High Level Programming for GPGPU wa i r li F Ew i AMDIL AMD Smarter Choice Last time we introduced HLSL and the R600 ISA AMD IL is a portable immediate language that sits between high level languages Brook or HLSL and the ISA AMD IL is meant to be generation compatible st future hardware can compile from IL whereas the ISA is asic dependent February 8 2008 High Level Programming for GPGPU i quot7 3quotquot fjr39 Callie 4 Brook Kernel Generated AMD IL 5 February 8 2008 AMD Smarter Choice out float cltgt kernel void sumfloat altgt float bltgt ilps20 dc1cb cb01 dc1resourceid0type2dunnormFmtxFloatFmtyFloatFmtzFloatFmtwFloat dc1inputgenericinterplinear V0Xy dc1resourceid1type2dunnormFmtxFloatFmtyFloatFmtzFloatFmtwFloat dc1inputgenericinterplinear V1Xy samp1eresourcesamp1er0 rx v0xy samp1eresource1samp1er1 r1x v1xy00 mov PZX rxxxx mov r3x r1xxxx call 0 mov r4X r5xxxx dc1outputgeneric 00 mov 00 r4xxxx ret Func 0 add r6x r2xxxx r3xxxx mov r7x r6xxxx mov r5X r7xxxx ret end iiii m q High Level Programming for GPGPU IL code generation DX HLSL Compile to DX asm using fxc Microsoft HLSL compiler Compile DX asm to IL using AMD GPU Shader Analyzer AMD HLSL Compile AMD HLSL to IL using AMD HLSL compiler Brook Compile Brook kernels to IL using brcc Or write it yourself February 8 2008 High Level Programming for GPGPU 3W Camel 4 Writing IL code IL resembles DX assembly IL is also used for DirectX and OpenGL shaders IL code will be optimized by the GPU compiler to ISA Readability may be more important when writing IL February 8 2008 High Level Programming for GPGPU C mll ll 9lt idil Er DX asm vs IL ps40 dclinput linear v0xy ilps20 dcloutput 00xyzw dclinputinterplinear v0xy dclsampler so modedeFault dcloutputgeneric 00 dclsampler sl modedeFault dclresourceid0type2dFmtXFloatFmtyFloatFmt2FloatFmtwFloat dclresourcetexture2d C Float Float Float Float t0 dclresourceid1type2dFmtXFloatFmtyFloatFmt2FloatFmtwFloat dclresourcetexture2d C Float Float Float Float t1 sampleresource0sampler0 r0 v0xyxx dcltemps 2 sampleresource1sampler1 r1 v0xyxx sample r0xyzw v0xyxx t0xyzw 50 add 00 r0 r1 sample r1xyzw v0xyxx t1xyzw sl retdyn add 00xyzw r0xyzw r1xyzw end ret February 8 2008 High Level Programming for GPGPU 3 34 1 iiiWM Instruction syntax ltinstrgtltctrgtltctrvagt ltdstgtltmodgtltwrite maskgt ltsrcgtltmodgtltswizzemaskgt Broken down ltinstrgt ltctrgtltctrvagt instruction with control specifiers ltdstgtltmodgtltwritemaskgt destination register with modifier and write mask ltsrcgtltmodgtltswizzemaskgt source registers with modifier and swizzle mask February 8 2008 High Level Programming for GPGPU EMU Registers Registers are four component vectors v import registers o output registers r general purpose registers There are also other special enumerated registers Registers are typeless Integer instructions can operate on float data User must take care to keep track of register types and convert February 8 2008 High Level Programming for GPGPU Register modifiers Destination modifiers apply extra operations to the destination register after the instruction runs Examples ltdstgtx2 multiplies by 2 ltdstgtd4 divides by 4 Source modifiers apply extra operations to the source register before the instruction runs Examples ltsrcgtabs returns the absolute value ltsrcgtsign returns the sign February 8 2008 High Level Programming for GPGPU Ciwul 2 Write Masks Elementwise write masks on destination registers control how components are written to Syntax regX01y01z01wl01 xyzw are components underscore means don t write can also leave blank 0 or 1 replaces component with O or 1 Mask is position dependent Examples mov rOx r1 move r1x to rOx and leave rest unchanged mov rOx r1 same as previous mov rOyw r1 only write to rOy and rOw and leave rest mov r00000 r1 zero out all components mov r0xyzl r1 write all except w which changes to 1 0K February 8 2008 High Level Programming for GPGPU Er Swizzle mask Controls how the register components are used Syntax regXIYIZIWIOI1XyZW01XyZW01XyZW01 Mask is position independent Blanks mean use default component Examples mov r0 r1yxzw move r1ygtrOx r1xgtrOy r1zgtrOz r1wgtrOw mov r0 r1yx same as previous mov r0 r1xyzO standard move except force rOw to zero February 8 2008 High Level Programming for GPGPU Er Instructions Declaration and Initialization Input memory fetches Conversion General ALU instructions MathTrigSpecial Flow control Bitwise Double Comparison February 8 2008 High Level Programming for GPGPU C zml FNMA 32 Declaration and initialization Inputs and outputs must be declared constant buffers input interpolators resources texturesstreams iteras February 8 2008 High Level Programming for GPGPU Umwi i i E M i HEW Constant buffers In the past constants were passed to the shader individually Now constants are passed together in constant buffers Constant buffers elements are 4 component vectors Declaration dccb cbltgtltsizegt Buffer elements are addressed like C arrays February 8 2008 High Level Programming for GPGPU Constant literals Literals are constant values used in the source code For example int a 5 Previously constants had to be passed in like constant variables Syntax dclliteral ltxbitsgt ltybitsgt ltzbitsgt ltwbitsgt Example dclliteral l1 Ox4OAOOOOO Ox3F800000 Ox3E99999A Ox3EOF5C29 float literal float45 1 03 014 dclliteral l2 0x00000003 OxFFFFFFFF OxFFFFFFFB 0x00000007 integer literal int43 1 5 7 February 8 2008 High Level Programming for GPGPU Er Declaring memory Syntax dclresourceidntypepixtexusageunormfmtxfmt fmtyfmtfmtzfmtfmtwfmt Example dclresourceidOtype2dfmtxfloatfmtyfloatfmtz floatfmtwfloat 2D texture of float4 February 8 2008 High Level Programming for GPGPU CEWM 39W ii iu 33 Fetching from memorytextures Textures can be addressed as normalized O1 or unormalized It depends on how it was declared unorm Syntax sampleresourcensamplermaoffimmiu v w dst srcO aoffimmi apply integer offset to the input address Addressing is determined by the type of the texture 1D vs 2D AMDN Smarter Choice February 8 2008 High Level Programming for GPGPU Scratch buffers You can allocate a temporary buffer that you can address like C arrays called scratch buffers Performance can be slow but you can address using registers Example dclindextempa rray XOsize mov XOrOX r3 mov r2 XOrOy February 8 2008 High Level Programming for GPGPU General ALU instructions See the spec for a complete list of functions There are often different integer double and float instructions eg IADD DADD ADD respectively There are also special functions eg CMOV FLR There are also trig instructions eg SIN COS February 8 2008 High Level Programming for GPG PU CEWM fifm ii w iu Er Doubles Doubles are a special case because there are no native 64 bit component registers Doubles are represented as two 32bit components If srcxy contains OX4008000OOOOOOOOO then srcy OX4OO8OOOO ancl srcX OXOOOOOOOO When using double the value must be in positions srcxy See spec for the double operators February 8 2008 High Level Programming for GPGPU Er Conversion Registers are typeless so registers need to be converted from one type to another Be careful conversions only happen on the transcendental unit Valid conversions are D2F F2D FTOI FTOU ITOF 39 UTOF February 8 2008 High Level Programming for GPGPU Gama Er Flow control instructions Basic branching ifogicanz src0x execute instruction block if scr0XO ifogicaz src0x execute instruction block if src0x ifcreopop srcO srcl execute block if relop is true 0p eq I he I gtI ge gtI ltI Ie lt else endif February 8 2008 High Level Programming for GPG PU Cim39 WIFM M Er Control flow instructions cont Loops wthoop break instruction break breakogicanz srcO breakogicaz srcO breakcreopop srcO srcl endloop February 8 2008 High Level Programming for GPGPU Un temf 03 Na Comparison You can break on a register by first doing a comparison instruction first eq ge It ne February 8 2008 High Level Programming for GPGPU quotw39r i i l w d l Output Kernel outputs are written to special output registers They can only be used for output but are handled like any other register up to the maximum number AMDN Smarter Choice February 8 2008 High Level Programming for GPGPU Global gather scatter buffers Scatter is handled using a global buffer There is only one global buffer at any time so it is not declared It is addressed like scratch buffers using a C array interface Because it uses 128 bit alignment it is always treated as a float4 type so be careful when addressing Example mov grOX r0 mov r1 gr3X February 8 2008 High Level Programming for GPGPU i i f4l ifiz39 Callus 4 Lots more See the spec for more info Learn by example Look at the CAL samples Use AMD s GPU ShaderAnaIyzer to generate IL from HLSL or Brook February 8 2008 High Level Programming for GPGPU l39Iiivvrsil u Zrmm Flnrirlu A Comparison of Various MulticoreManycore Architectures Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida l G Challenges facing microprocessors Moore s law still holds Smaller transistor size More transistors 0 Power wall Voltage does not scale Power density problem 0 Memory wall Higher latency to hide 0 Frequency wall Diminishing returns in performance Worsen the power problem University of C entral Florida Multi ooreMany oore Processors General purpose multicore CPUs Graphics processors AMDATT RV670 RV770 etc NVidia G80 GTXZSO etc Cell Broadband Engine Larrabee later University of C entral Florida l G Power wall Key idea each core runs at lower voltage potentially lower frequency Dynamic power consumption is proportional to FV2 Same idea for mutlicore CPU GPUs Cell BE and Larrabee Homogeneous cores VS heterogeneous cores University of C entral Florida Memory wall 0 Multi core CPUs 7 Large onrchip caches r Onrchip memory Controllers 0 GPUS 7 rquot 1 1 1 1 1 7 Small caches 7 Software managed local shared memory or cache 7 Onrchip memory controllers Memory coalescin Ratio of number of threads over the number of cores in each SIMD engine AMD GPUs or MP Nvidia GPUs Eg 400 cycles memory access latency Warp size 52 Independent ops 2 after amemory access How many warps necessary to hide the latency 0 Cell BE Software managed shared memory or cache 7 DMA enforcedexplicit coalescing r Overlapping DMA and computation 7 Coarse grain multithreading University of C entral Florida Frequency wall 0 Multi core CPUs Balancing number of pipeline stages vs complexity Complicated branch predictor for control hazard 0 000 Renaming Bypass for data hazard Most advanced semiconductor technologies 0 GPUs Simplest pipeline small number of pipeline stages inorder scalar Nvidia GPUs VLIVV AMDATI GPUs 0 CellBE Vector pipeline a relatively high number of pipeline stages vs GPU Rely on software to handle branches University of C entral Florida Programming model Multi core CPUs 7 Shared memory model cache coherent 7 Most exible compatible with legacy code 7r 1 hing 1 The cost of communication among threads is low 7 SSE SIMD instructions for dataelevel parallelism parallelism High cost of communication among threads if not in the same thread b1 k oc 7 Data level parallelism CellBE r Explicit multithreading r VectorSIMD programming Explicit management of communication University of C entral Florida l O Key to Performance Multicore CPUs Instructionlevel parallelism exploited by hardware Threadlevel parallelism by user 0 Workload balance 0 L A amp 1 GPUS Datalevel parallelism by user Local data share or shared memory explicitly managed by user User also needs to overcome the limited HW support for inter thread communication and synchronization CellBE Datalevel parallelism by user Local storage explicitly managed by user Vectorization either by user or compiler User needs to manage DMAs and hide their latency University of C entral Florida Brook Data Types 0 Important for all data representations in Brook Streams Constants Temporary variables 0 Brook Supports Basic Types Short Vector Types UserDefined Types l G Basic Data Types 0 Basic data types oat 32bit oating point double 64bit oating point int 32bit signed integer unsigned intuint 32bit unsigned integer eg Streamlt oatgt f Streamltuintgt i Streamltdoublegt d Literal Constants for above types oat a 02f double d 03 int c 10 Basic Data Types Unsupported types Reserved for future use currently keywords char unsigned char short unsigned short long unsigned long ulong long long Ii Vector Types 0 Short vectors 2 to 4 elements inherited from 3D graphics traditionally used to represent positions x y z W or color r g b a GPU registers are 4component registers 128bit per register names built by appending the count to the type eg int2 oat4 doubles are limited to up to 2 elements Vector Types 7 Inlllallzallon ampOperallons c conslmdmslyle 1n1halxzahon u am u atAO m 2 Of 3 m 4 of anZ b 7m 21s 5 Accessto1ndxmdual eldsxsthmugh structure member syntax x y z w b mt w y b y Arlmmem operahons on operands of vector types are supported 7 equwalent m applyxng the uperamrm 25m cumpunent mdwxdually 7 h d r quxck perfunnance tumng and data cumpactmn 7 than 1 of a of m of 5 0 5 b an y m fluat b nuam k Vector Types 7 Usage Original Kernel Vecturized Kernel ma mm ma mam l muglea Lnt mm mm magma m mm munu smmmm an mum smmmam an mum by mum an mum w mum an mm mm b an mm b m l kmel hem ma summon a0 ma sumlfluatl 30 float no out oat cltgtl mm ho m am col l a a t h c a 4 b 1 Vector Types Swizzle operations Reordering components of a short vector for certain operations Nice shorthand for developers Optimization hint to compiler Brook Use Swizzle to reorder elements in operands eg temp inputyxzx temp Use Swizzle to mask elements in output eg outputxw inputx39 C Cannot use vector instructions without reordering explicitly oat temp4 0 1 2 3 yxzx temp0 inputy39 temp1 inputx39 temp2 inputz39 temp3 inputx39 l O Swizzle Examples oat4 pos oat43f 5f 2f 1f39 oat value1 posx39 value1 is 3f vec0 is 3f 5f vec1 is 5f 2f 3f 3f vec1 is 2f 2f 3f 3f oat2 vec0 posxy39 oat4 vec1 posyzxx vec1x posz39 Swizzle usage 2x2 mamx mumpocauon J J J Swizzle usage Ongmauy A B would bewntten as ua mull mum Agtlt Bgtlt Ay 5x It nu mz w Usmg swxzzles A B would be wnuen as uatQmulQ AW Bgtltygtlty Ayyww Bzwzw Userde nedtypes 0 Basic Types and Short Vectors can be aggregated using user de ned structs 0 Syntax for declaration and data access is same as C structs typedef struct PairRec oat row oat column Pair kernel void foo oat c Pair pltgt out oat bltgt b c proW pcolumn l O Userde nedtypes 0 Streams of structs are internally translated into multiple streams In previous example Pair pltgt will be translated into 2 oat streams Data processed on the hostside during Streamread Onetime Overhead Might be better to use oat2 Stream in this case Type checking brcc performs strong typechecking Operands must have same 0 Variable type 0 Number of components short vectors 0 Constant type literal constants Implicit conversions are NOT allowed 0 Initializations Assignments LHS and RHS should have same type 0 Arithmetic Expressions Typepromotion not done with strong typechecking Relational operations on vector types use only x component l O Standard Streams Standard streams passed using open brackets ltgt Positions of Input and output elements are implicit and predictable eg kernel void sum oat altgt oat bltgt out oat cltgt cab Stream rank Need to be speci ed explicitly during creation allocation Up to 3diJnensional streams supported int dim22 Width height Streamlt oatgt str2D2 dim2 int dim33 Width height depth Streamlt oatgt str3D3 dim3 Rank not speci ed when used in Kernel kernel void oat str2Dltgt oat str3Dltgt out str1Dltgt Stream Dimensions and Ranks should match for well de ned results 12 and prior supports Stream resize 13 runtime will issue a warningerror on mismatch l O Arbitrary Reads Gather Implicit addressing mechanism is restrictive for many interesting applications Desirable to arbitrarily address streams Access is called a gather operation in stream terminology Gather streams Can be arbitrarily addressed Are not subject to alignment like input streams can have arbitrary dimensions relative to output stream Declared using square brackets H as kernel arguments Declaration and indexing 1within the kernel uses C array syntax Stream declaration outside the kernel does not change Need to specify dimensionality like C arrays ie for 1D and for 2D streams Gather streams kernel void simplesatheri int indexov float dataUr ant Eluat resulto result dzcallndexll l Streamo llt indigesii iceunm Streamdluat derail mum Streallelaat resulm Leduntll quot393quot simplecathen in ices data xesult xesuitli deraiindsceslslll Gather streams Dimensionallty needs to be specified In kernel arguments 29 ll ii for 2D and il ll ii rural gather indexed using Crstyle indexing kernel vuxd gatbexidixectil lfluatZ index Elna aim we float is gt lt l i I auxlager indexex 1 quotmg a I 1 l can have arbitrary number of gather operations in a kernel Also referred m as arrays for nbvious reasons index shnuld be uf integer type with c style numbering 0 7 float indexing allowed fur backward compau h y mlght be deprecated in future releases Array dimensions Array Allocation order same when using vector types unsigned int dims4 x y z W Array Reference order same when using scalar types MW 2 y X Use of gather streams is more popular in reallife applications Generalpurpose Better control over memory access patterns l O nstance In C applications explicitly control the current x y index being processed forinty 0 y lt N y forintx 0 x lt M x oat temp Ayx In Brook kernels are executed for each element in output stream ianlicitly over the execution domain Querying the current kernel instance being processed is useful in various cases instance intrinsic returns an int4 vector for current kernel instance For a given stream rank invalid components are set to 0 Similar to pthreadiself with a spatial context Stream outputs C Writing to arrays is very exible out0x y value out1 y x value2 Write to output streams on GPUs has been quite restricting Traditionally restricted to writing pixel values to a single color buffer Output position of the output result was also xed Brook Write to output stream at the implicit position kernel void copy oat altgt out oat bltgt ba Ii Multiple outputs 0 Legal to output multiple streams from a kernel Simply specify multiple out streams e g kernel void multiiout oat altgt out oat bltgt out oat cltgt out oat dltgt b a a a c a a a d a a a 0 Results undefined if dimensions of output streams differ Dimensions of first output stream in argument list is used for domain execution Multiple output streams 0 Current AMD GPUs limit maximum outputs to 8 Brook allows exceeding the GPU limits by using a multipass algorithm 0 Existing kernel is split into multiple kernels 0 Each subkernel is executed in a serial manner eg Following kernel is legal in Brook kernel Void foo oat inplt gt out float b1ltgt out oat b2ltgt out float b3ltgt out oat b4ltgt out float b5ltgt out oat b6ltgt out float b7ltgt out oat b8ltgt out float b9ltgt out oat b10ltgt bl inp b2 inp b3 inp b4 inp b5 inp b6 inp b7 inp b8 inp b9 inp b10 inp l O Arbitrary writes Scatter 0 Legal to output to arbitrary address from a kernel 0 Output stream needs to be declared and indexed as a scatter stream using the C array syntax Similar to gather streams eg Following kernel computes an output steam based on the addresses given in the index stream kernel void scatter oat indexltgt oat altgt out oat4 b bindex a Scatter Currently scatter operations are slower than normal output streams GPU requires stream elements to be 128bit aligned notice the use of oat4 type in previous examp e Brook supports all data types using appropriate conversions Only works on R670 and later Does not work on R600 HD 290 Allows arbitrary number of writes in a kernel kernel void multipleiwrites oat altgt out oat4 int2 ind instancexy bindx a bind y a Order of writes not guaranteed in case of same output address for multiple kernel instances l O Domain of Execution Corresponds to the number of kernel instances created by the runtime Maps directly to number of elements in output stream However if output stream is a scatter stream No necessary correlation between output stream dimensions and domain of execution Eg output domain might be larger than execution domain Kernel Interface allows explicit control over execution domain brcc generated Kernel prototype derives from Kernellnterface class with operator overridden Kernellnten ace class class hmlmmxfam c public mid mmisaamuinu uE ut 39 mm 1112 u n mid dmainsimuintl aka 1 my 512 512 1 1 7 l Ishtar dmamf set offaet i Exew on Immendmainliletaila Domaln a mtter ndax i b i 00 Streams Data layout int Width 8 Height a float altHEight w1dthgt float Ln 111 M 3m 4m sn 60 Each element has 1 oat va ue Mn 1 21 3 4 51 61 CSty e LINEAR 04 11 24 34 44 54 54 74 He39ght5 o arrangement 1n r wvm 39 r rder Data layout of streams of vector types Ant dAmSH 2 a St reanKElaatd am dms n nah Each eternen has 4 oat values CStyle mam m mory arrangement m rowma or rder I Thread Creation Thre d re Created dLII39Ing rastenzation or domam Wavefront Z Quads Quad 2x2 Threads Nanaanve threads 1 a dumam gt Idle 5w Order of thread creation 39 5 enzatlon order no disclosed Based on mummies at 8x8 marks sue 01w valmnt Mammals Thread execution RV770 structure Ultra rreaded Drspaxch Processor pr m r pr m cm Stream Processors E r r r r t i i l I Thread execution thread processing All thread processors within a SIMD engine execute the same instruction for each cycle In a thread processor up to four threads can issue 4 VLIW instructions over four cycles RV 670 and RV 770 16 thread processors execute the same instructions with each thread processor processing four threads gt appear to be a 64wide SIMD engine called a wavefmunt SIMD engines operate independently of each other l O Flow control Branches If a kernel has a branch with two paths wavefront executes first and then second Branch Granularity Number of threads executed during a branch Branch Granularity Wavefront size Loops Total execution time for wavefront Time of longest loop Other SIMDs execute independently in case of branches and loops Constants 0 All nonstream kernel arguments are treated as constants Value remains same for all kernel instances in the domain Only basic and short vector data types are supported All other operations like swizzle are valid kernel void scale oat k1 oat2 k2 oat4 k4 oat altgt out oat bltgt b a k1 k2y k4w l O Cached arrays 7 Used for various constants to be used in cornputatl Cached arrays can be used with brcc and runtime r Kernel hints on array size 7 speci ed using literal constants for array size 0 GPUs have special caches for small readonly arrays constant buffers on kernel void cachedArray oat data1024 out float resultltgt r Runtirne allows directly passing the array pointer to kernel invocation routine float data getConstantData cachedArraydata outStrearn ifout5trearn7gterror cerr ltlt llError in output stream ltlt outstrearn7gterrorLog0 ltlt endl i Cached Arrays Can rasun m srgnmaam performance rmprayamams for specxhc applxcahons uka Kernel lters Tramferfunctmns ms per cached cached arrays aua Array ma sho CPU 35erch 1mum cache sxze apply 4a a elem array wed by a uld ba completely de Max in raah med at camprla uraa kernel vmd 252 an data1024 nut uatresultltgt ERROR Fmblem mm Array yarrabxa declarahun E Reduction 7 Motivation gazilizis ny Common in various 39 applications like financial alg n n nu H Rama m o 39t ms a merical simulations r mnan Element float m n makruarq mama Eg Average the Se 039 values cnmvuted lmm random samples h get one Drlue Reduction Frovlde a d value from 7 Examples ufreductmn aperaudns lnclu emaxlmum medran anthmetlc sum matnx prdduec etc ngle elementurlesser elements atarparallel method for computlng a slngle a sel of records a b a a Output speeuled uslng reduee eyword Kernel use reduee modlfler mslead of kernel keyword Legal lo read and wrlle to reduce parameter durlng eompulallon IE Reduction exam ple mo declare may mm mm iuml um mum float mum mn nm sulto y l new aver mum Alumnti quotmt K m a n um rum gt arure 1 1 e an r M t mm 1 in mm anmflaab mm mu me res sum mum rem rum e mu r u um 1 n Reduction Number of input elements used to produce output elements determined by number of output elemen s 7 l 39 pa esoverthelz I dd YK39I imninlll t WA gt Dimensi n Redm on e g uint in2 100 200 uint out 100 sueamlt oatgt s2 kin Iedmed to sueamlt oatgt t1 ampout Valid to pass sca1ar e1ement as output stieam to nintime 7 It stream factor between dimensions is use gt Partia1 Reduction eg uint out22 100 50 Sueamlt oatgt s2 ddn Iedmed to Streamlt oatgt tlt2 ampout2gt Calling a reduce kernel 0 Apply the kernel to 11 input elements to compute one output elements where nreducelucmr reduceFactor output stream input stream reduce V ld sum tuna do reduce float c0 Awly the we h lmamlnn ml a em lmamlun Wm Radium go Dwsab e Address Wmahzaoun mm x a o b Rainy w 3D u xny mom Kym m 30 mxowxyzm Wammg as Ermrs M 111 a my u gum Me an He p msamg Addras Wmahzaoan mum Lx zzm mom Lszm Wammg as Ermrs an an y m an u me am Hdp Snurce Cede roman Ends Fumiat i agave Addrm Wrmahaoun v Wammg ms r Wammg as Ermrs xxx pss ma Dion mrcz i A A mm mm acmede A A A NA NA NA NA A A A NA NA NA WA A A A NA NA NA NA 0 u 7 2 no 1 no a m We eads SK E I an mmm39 r u u n 251 ma Giabalwnte 4m m r mea sa a u 7 mm mm Glabaik mte am amuMwem sK a u n ma zsu alabahhne z u 5 DEM 33sz Matrix Multiplication Rmxyu 1 Eu momxyzm Rxxyxx w 3D mmommyzm a x RELx 00 gm mm mm m Jun 1 lt mam um um i armed m hm imam i mum mu um M 235 727 y x D oxninimmulirx u Emislzmai
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'