SPECIAL TOPICS CDA 6938
University of Central Florida
Popular in Course
Popular in Computer Design Architecture
This 40 page Class Notes was uploaded by Genoveva Bogisich on Thursday October 22, 2015. The Class Notes belongs to CDA 6938 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 23 views. For similar materials see /class/227526/cda-6938-university-of-central-florida in Computer Design Architecture at University of Central Florida.
Reviews for SPECIAL TOPICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
AMDH Smarter Choice Advanced R7XX Features Compute Shaders More general approach for GPU Compute Removes graphics centric terminology and ideas Exposes GPU as an array of parallel processing elements Removes graphics pipeline from the picture no PS GS VS Disconnects output domain from execution domain Read anywhere write anywhere Global Buffer Linear memory format Gives more control to the kernel writer on thread execution and corresponding optimizations Compute Terminology Thread Single invocation of a kernel Group Set of threads that can share data and run together on a single SIMD Multiple groups can run on a single SIMD if registers allow Wavefront Group of 64 threads running concurrently on a SIMD 16 SPs 4 cycles Neighborhood Group of 4 threads in the same Wavefront having consecutive thread IDs Tid Using Compute Mode in IL Header ilcs2O Instead of ilps2O Number of threads per group dclnumthreadpergroup 64 New Indexing Values No more vPosvWinCoord VTid ID of thread within a group vaTid ID of thread within a domain VTgroupid ID of group within a domain eg Group ID 10x 6 Upper 26 bits in vaTidO in above case ishr rOX vaTidOX 10x Tid within a group llw Ox3F Lower 6 bits in vaTidO and rOy vaTidOX llw Using Compute Mode in CAL New Entry Points calCthunProgramGrid Routine to launch kernel in Compute Mode Exposed as a CAL extension New Domain specification mechanism CALprogramGrid Specifies various parameters for kernel launch struct CAqunc func CAqunc to execute CALdomainBD gridBlock size of a block of data CALdomainBD gridSize size of 39blocks39 to execute CALuint flags misc grid flags CALprogramGrid AMDH Smarterchoice CALprogramGrid pg pgfunc func pgflags O pggridBlockwidth 64 same as the value in the kernel for block size pggridBlockheight I pggridBlockdepth pggridSizewidth 1024 1024 63 pggridBlockwidth pggridSizeheight I A pggridSizedepth Get the function ptr for CAL Extension calExtGetProcCALextprocampcalCtXRunProgramGrid CALEXTCOMPUTESHADER quotcalCtXRunProgramGridquot Launch the kernel in compute mode calCthunProgramGridampevent ctX amppg Using Compute Mode Key Items to Remember Output resources are required to be Global Buffers only 1 supported Cache characteristics will be different from regular kernels due to different execution order eg for 8 MRT MMM algorithm implemented using CAL PS 8 MRT 393 Gflops PS MemExport 393 Gflops CS MemExport 222 Gflops R7xx supports only linear thread dispatch True 3D grid blocks available with future hardware only For R7XX gridBlockwidth dclnumthreadpergroup m 5mm Core AMD RV39ITO Hadeon HD 4373th Geometry 1 rtex ahaden gShaders 59 rquot l Inter olatars I gaffgale 533 j Thread Dispatch Pr u 3 m 1 97 g aq EmeJ 6mg PerSIMD A 3 Shared Memory TextureUn s PerSIMD quot 339 L1 Cache Data Sharing Local Data Share LDS 16kb On chip memory per SIMD shared between threads in a block Write local read global system Share between all threads in a block Synchronization required Shared Registers SR Globally shared registers Registers that are global to a SIMD Sharing between all wavefronts in a SIMD Column sharing on the SIMD Persistent registers Atomic read modify write in same instruction guaranteed Using LDS in IL Size of LDS memory to be used in a shader in dwords dclldssizeperthread n n lt64 and a factor of 4 LDS Memory sharing dclldssharingmode mode where mode can be wavefrontRel gt Relative ie each wavefront has its private LDS memory wavefrontAbs gt Absolute ie all wavefronts share the same piece of LDS memory Using LDS in IL Reading LDS Memory readlds neighborExchsharingMode dst srcOxy LDS location is given by srCOxy where srCOx Tid srcOy offset dst can be any register Options Flags sharingMode rel or sharingMode abs for relative or absolute sharing mode neighborExch If specified the output of LDS will be exchanged with its neighboring threads such that first thread gets all values from xchannels second thread gets all values from ychannels and so on This flag is useful for applications like FFT matrix transpose Using LDS in IL Writing LDS Memory writelds offset sharingMode dst src src can be any register Location is fixed to Tid offset dst must be of type ILREGTYPEGENERICMEM This is only used to provide write mask Options Flags sharingMode rel or sharingMode abs for relative or absolute sharing mode lOffset n If not specified offset O n must be a value of multiples of 4 in the range of 0 60 and smaller than declared ldssizeperthread Synchronization Fence Synchronization mechanism for threads within a group No thread in the group should pass that point until all threads reach the point Disallow compiler optimizations to occur around that point The fence instruction has four flags One of the flags must be present and they can exist in any order lds is for LDS accesses threads is for thread synchronization memory is for nonIds memory accesses shared is for SR accesses QampA and Recap RV77O New Features Compute Shaders Data Sharing Mechanisms AMDH Smarter Choice l39nivvrsil u Zrmm Flnrirlu Patterns for Parallel Programming School of Electrical Engineering and Computer Science University of Central Florida Ii Textbook T Mattson B Sanders and B Massingill Patterns for Parallel Programming AddisonWesley 2005 ISBN 0321228111 University of C entral Florida First of all 0 Is the problem large enough and the results signi cant enough to justify the effort to solve it faster 0 If so what are the most computationally intensive parts Whether speeding them up provides suf cient performance gains ie Amdahl s law University of C entral Florida l O Overview Exposing exploitable concurrency Finding Concurrency Structuring algorithm to take advantage Algorithm Structure Structuring program and shared data Supporting structures Implementation mechanisms Mapping to particular programming vironment University of C entral Florida Finding Concurrency Group Tasks Order Tasks Data Sharing Task Decomposition Data Decomposition Algorithm Stmeture Supporting structures Implementation mechanisms 39ty Centr F or39i Univers o l G Decomposition Patterns 0 Task decomposition View problem as a stream of instructions that can be broken into sequences called tasks that can execute in parallel Key Independent operations Data decomposition View problem from data perspective and focus on how the can be broken into distinct chunks Key Data chunks that can be operated upon independently Task and data decomposition imply each other They are different facets of the same fundamental decomposition University of C entral Florida Example 0 Matrix multiplication Task decomposition 0 Considering the computation of each element in the product matrix as a separate task 0 Performs poorly gt group tasks pattern Data decomposition Decompose the product matrix into chunks eg one row a chunk or a small submatrix or block per chunk University of C entral Florida Ii Dependency analysis patterns 0 Group tasks group tasks that have the same dependency constraints identify which tasks must execute concurrently Reduced synchronization overhead all tasks in the group can use a barrier to wait for a common dependence All tasks in the group ef ciently share data loaded into a common onchip shared storage Shard Memory Grouping and merging dependent tasks into one task reduces need for synchronization 0 Order task pattern identifying order constraints among task groups Control dependency Find the task group that creates it Data dependency temporal order for producer and consumer relationship University of C entral Florida Dependency analysis patterns Data sharing pattern how data is shared among the tasks Read only make own local copies Effectively local the shared data is partitioned into subsets each ofwhich is accessed for read or write by only one task a time Readwrite the data is accessed by more than one task Need exclusive access mechanisms Example the use of the shared memory among threads in a thread block University of C entral Florida l O Design Evaluation Pattern Whether the partition ts the target hardware platform Key questions to ask How many threads can be supported How many threads are needed How are the data structures shared Is there enough work in each thread between synchronizations to make parallel execution worthwhile University of C entral Florida Algorithm Structure TaSk Decomposmon DiVide and conquer Decomposmon Event Condition Supporting stmetures Implementation mechanisms University of C entral Florida l G Organizing Principle Stlart Organize Organize by Organize by bV Task Data Data Flow Linear Recursive Linear Recursive Regular Irregular i Task Divide and Geometn39c Recursive Pi Cline Event Driven Parallelism Conquer DeCOmPOSi39iO Data p University of C entral Florida Task Parallelism Pattern 0 After the problem is decomposed into a collection of tasks that can execute concurrently how to exploit this concurrency efficiently 0 Load balancing WW university or LEntrat r lonua Divide and Conquer Pattern 0 If the problem is formulated using the sequential divide andeconquer strategy how to exploit the potential concurrency w my LQ hpmhcm w MW mmmm r 7mm gq suiwmnlnml aubprnblum J Lsnbpmlllnm W tn 1 u v r39mlmru39mg 1mm Wm lm v K ulmulutunJ sutvmt on mm l lunllmnn l mw W upmtm mnmm l hsulutum Ilhnlmmn 7 MW 4 AA mvaunul L ulntmn University of Central Honda DivideandConquer Pattern Sequential code func solve returns Solution a solutian stage func baseCaSe returns Boolean d L reci solution test func baseSolve returns Soluti n d39 t solution func merge retu 5 Solution bine subsalutwns rn func split returns Problem split into subprubs Solution solveProb1em P if baseCaseP return baseSolveP else Problem subProblelnS N Solution subSolutions N subProblems split P for int i 0 i lt N iJ subSclutions i solveltsubProblems i return mergeltsub501utionsgt Univelsity of Central Flon39da V6 DivideandConquer Pattern Paralleliza on Strategy Imsrwuw mw Geometric Decomposition Pattern 0 How to organize the algorithm after the data has been decomposed into concurrently updatable chunks 0 Decomposition to minimize the data communication and dependency among tasks 0 Care needs to be taken when update nonlocal data e g exchange operations University of C entral Florida l O Recursive data pattern 0 Suppose the problem involves an operation on a recursive data structure that appears to require sequential processing How to make the operations on these data structures parallel 0 Check whether diVideandconquer pattern works 0 If not may need to transform the original algorithm University of C entral Florida Example Finding root in a forest mp UmvErsxty of CEntral Flonda Supporting Structures Fmdmg Concurrency Algorxthm Smcturz Data Structures Shared D ata MasterWorker Shared Queue D ted Amy Implzmzntatmn mechamsms UmvErsxty of CEntral Honda Loop Paralle Ism Relationship between Supporting Program Structure Patterns and Algorithm Strcture Patterns Task DivideCo Geometric Recursiv Pipeline Event Parallel nquer Decomp e Data based SPMD Loop Parallel Master Work er Fork Join University of C entral Florida Relationship between Supporting Program Structure Patterns and Programming Environment OpenMI MI I Java Brook Cell CUDA SPMD Loop Parallel Mastler Slave ForkIoi II University of C entral Florida Implementation Mechanisms Finding Concurrency Algorithm Structure Supporting struc lures University of C entral Florida l39nivvrsil u Ivnlml Flnrillu Stream Computing using Brook School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P Bhanirarnka Ii Outline Overview of Brook 0 Brook Software Architecture Compiler Runtime 0 Brook Kernel Development Performance analysis Performance Optimization Debug Brook 0 Brook is an extension to the Clanguage for stream computing Syntax to represent and manipulate Streams Syntax for kernel speci cation Originally developed by Stanford U Brook is AMD s implementation of the Brook GPU spec on AMD s CAL compute abstraction layer Fixes Optimizations Many changes in 13 release l O Design Goals Open Standards Open Source Existing initiative with multiple users Crossplatform support Simple and easy to use Attractive for application development and platform adoption Lowlevel tools accessible for further enhancements Portability and Performance BrookGPU worked with multiple backends DirectX9 OpenGL CPU running on both ATI and nVidiaGPUs Interoperability with other tools Necessary for production quality software Fits in the Stream SDK stack C application compatibility 13 release Programming model Prugramming Model Host aDplrcaUun LoadInltlallze Input Data LCan Input Data Imm Host to cupy Output from cm a Host Mewan Programming model example st an tlun Flms lelm uuuuu l mu scum A El 5 midrmkn r g wnurwwul GPU Memory 7 Programming model I Brook programs consists of 7 Host side code I Application data management I Stream allocation and data transfer I Compumtion sequential r GPUeside Kernel Data parallel code I Read input streams I Perform parallel computations I Write output streams Program structure GPU kernel 0 Written in separate file mm Wm mm usually br an 0 Compiled using brook compiler brcc mm mm Emma x k a I Generates GPUespeci c binary co e Generates header file to be included in host side for m as lm x 1m kernel invoka On cram may snagging mu m mammalian D k shadylhndixh mm ma scalel cm x am an u out last no Program structure host side Written in regular CC mm a Compiled and linked using H arm a he s regular cc compiler 7 Default is C flout un mixer imam Need to include broc generated header l n false ack result tn huffax tn copy cut mt u t u Brook Kernels Computational routines that operate on kemels ne instance executed for each element in the domain of execution aka a thread 7 Implicit elements in the output stream 7 Explicit control domainsize and domainotrsethinctions Internally compiled into GPUrspeci c binary e Invokede host app As a regular tunction call 7 Accept streams as leadronly inputs and w teronly outputs 7 Accept additional constants as leadrouly inputs Kernel syntax 0 Look like C routines with minor differences Use of kernel keyword Extensions for stream computing model Restrictions for stream computing model kernel void nameT arg0 T argl T argn1 eg kernel void sum oat altgt oatbltgt out oat cltgt cab l G Kernel syntax Similarity with C Function declaration kernel void foo oat a oat bltgt out oat cltgt Builtin Types oat int unsigned int double Variable Declarations and Usage int k 9 f k 0 Block amp Scoping rules int k k 5 intj Illegal Arithmetic Operators etc Kernel syntax Similarity with C Assignment Expressions etc Array Declaration and Data Organization kernel void bar oat b out oat cltgt Array Data Access f blilli ComJnents Cstyle and Cstyle Comparision l Conditional expressions User defined functions no recursive functions yet A subset of C s ow constructs do while for if break continue Not goto switch structs l O Kernel syntax extensions kernel Keyword for CPU routines Stream syntax ltgt with implicit addressing for reads and writes kernel void foo oat a oatbltgt out oat cltgt c a b out Keyword for explicitly output streams Implicit addressing read and write operations Strict type checking implicit conversions not allowed oat a 50 Illegal Need to explicitly use 50f Vector data types and vector math operations more details later Function Overloaded Math routines more details later Kernel syntax restrictions Writing to static or global variables not allowed inside kernels No data sharing among multiple kernel invocations as of now oat a global variable kernel void bar oat bltgt out oat cltgt c a b Illegal a not de ned in local scope No memory allocation inside kernel No pointer arithmetic as of now Recursion is not allowed as of now l O Kernel syntax restrictions Calljng nonkernel functions from a kernel is illegal void foo Nonkernel CPU routine kernel void bar oat a oat bltgt out oat cltgt foo illegal calling CPU from the GPU Kernels invoked from application have a void return type Nonvoid kernels are meant for GPU local subroutines Callable from other kernels only kernel oat foo kernel void bar oat a oatbltgt out oat cltgt c foo Kernel syntax Argument Valid arguments Streams Outputs Need Minimuml Output Writeonly Initialize first to use as temporary variable kernel void fooout float cltgt kernel void barout float c Inputs Readonly kernel void foofloat altgt out oat bltgt kernel void barfloat al out oat bltgt Constants Single Variables or Fixedsize Arrays Readonl kernel void foofloat a out oat cltgt kernel void foofloat a10 out oat cltgt l O Streams Collection of data elements of the same type that can be operated on in parallel Essentially an array of elements of speci ed dimensions Data organization is Cstyle in row major order Can use structs for userde ned types Declaration similar to C arrays but with angular brackets Associated with name type and shape type mzmeltngt1D stream of type with n elements type mzmeltn mgt392D stream of type with n x m elements e oat a lt10gt 1D stream of1 32bit oat value per element with 10 elements double b lt1010gt 2D stream of 1 64bit double value per element with 10x10 elements Streams 0 Kernel Usage Interface Angular brackets for implicit addressing Declaration requires name and type type nameltgtStream mzme of type Access requires variable usage only a name C arraylike Syntax for explicit addressing Declares requires name type and rank type mme2D Stream mzme of type type mme1D Stream name of type Access requires Carraystyle reference a name i i Ii Steam Hostside interface namespace brook templateltclass Tgt class Stream Templated C class public Constructor Streamunsigned short rank unsigned int dimensions Operators for reading and Writing Stream contents Void readconst Void ptr Void writevoid ptr const Query errors and further information BRerror error const char errorLog Streams hostside interface 0 Declaration Allocation and De allocation Requires type rank and dimensions unsigned int n 10000 1D double Streamltdoublegt s1 ampn39 OR Streamltdoublegt s new Streamltdoublegt1 ampn39 unsigned int dims2 1024 1024 2D oat Streamlt oatgt s2 dims OR Streamlt oatgt s new Streamlt oatgt2 dims Similarly for other ranks and types l O Stream hostside interface Data Initialization and Retrieval void Streamreadc0nst v0idquot ptr Initializes Stream content with data from given ptr 0 CPU to GPU data transfer ptr must be atleast as big as Stream size given by size of element number of elements eg unsigned int dims2 1024 1024 2D oat Streamlt oatgt s2 dims 39 bytes sizeof oat 1024 1024 4 MB void Streamxwritevoid ptr const Initializes ptr content with data contained in Stream 0 GPU to CPU data transfer ptr must be atleast as big as Stream size Streams Data layout int Width a Hegh a float adieight w1dthgt flmt 10 m Ln 3n mu su 50 70 Each element has 1 oat value 01 1 2 3 41 51 61 11 CStyle LINEAR quotWWW 04 m 24 24 44 54 64 74 Height5 arrangement m rammap as xs 15 35 45 55 65 75 order as 176 16 36 46 56 56 76 117 17 27 77 7 57 57 77 Width8 Stream Error handling 0 Error Query and Debugging BReImr Streametror 7 Used to check if an error occurred during processing of this tream 7 Returns an error code enum for rst error that occurred I BRiNoiERROR if no error occurred I See ErrorCodesh for full list of error codes I Error ag is cleared after the error routine oust 1131 StreamerrorLog 7 Returns NULL terminated char string with log messages 7 Information on errors from the rst occurrence are accumulated IE Brook Architecture Consrsts of two components e amen campuenbree Generates GPUrspecxfxc Devxce Cude Generates CPU Emulatmn Cude Generates CumpxlenRuntxme Interface Cude Based en the quotPaul Openrsuurce c parser e amen Rename Respunsxble fur e Lemme peeempuee Kemers e We data management a seem e Resume meme Suppurts mumple Runtxme Backends e rmmeeseememeeeee meeehsuppmeemkene e enemy supported nmxe OpenGL and CPU beekenes e CALbeekeneeeeeeenesuppmeehymn Brook development Host Code up ert j Brook runtime Brook Runtlme supports muluple backends A ows user to select a pamcular backend at runume 7 SetenvmmemvanablejRLRUNTIIxIEaspemywhmh backend to use BRTiRUNTIME pu farCI U emulation made BRTiRUNTIME al farCAL backend default with B k 7 CPU barkend ls very useful fardebuggmg kerne 7 CALbackend provides hxgh perfur mance more detax later If speuhed backend fads to load Brook defaults to CPU barkend Brook execution