SPECIAL TOPICS CDA 6938
University of Central Florida
Popular in Course
Popular in Computer Design Architecture
This 77 page Class Notes was uploaded by Genoveva Bogisich on Thursday October 22, 2015. The Class Notes belongs to CDA 6938 at University of Central Florida taught by Huiyang Zhou in Fall. Since its upload, it has received 41 views. For similar materials see /class/227525/cda-6938-university-of-central-florida in Computer Design Architecture at University of Central Florida.
Reviews for SPECIAL TOPICS
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
Watershed Transform The watershed algorithm is an image segmentation algorithm that splits an image into areas of interests It is described in the book of digital image processing by Gonzales and Woods as following The concept of watershed is based on visualizing an image in three dimensions two spatial coordinates versus gray levels In such a topographic interpretation we consider three types of points a points belonging to a regional minimum b points at which a drop of water if placed at the location of any of those points would fall with certainty to a single minimum and c points at which water would be equally likely fall to more than one such minimum The points satisfying condition c form crest lines on the topographic surface and are termed divide lines or watershed lines The goal of the algorithm is to find the watershed lines References l IMAGE SEGMENTATION AND MATHEMATICAL MORPHOLOGY 2 The Watershed Transform39 De nitions Algorithms and P quot quot quot 3 Matlab implementation watershed a sequential algorithm introduced by introduced by F Meyer MATLAB R709 quot 39 39 g p 39 4 Writing C Functions in MATLAB MEXFiles 5 Matlab acceleration using CUDA via the MEX file interface I 1 imeyercpp Dithering According to Wikipedia Dithering is a technique used in computer graphics to create the illusion of color depth in images with a limited color palette color quantization In a dithered image colors not available in the palette are approximated by a diffusion of colored pixels from within the available palette The human eye perceives the diffusion as a mixture of the colors within it see color vision Dithering is analogous to the halftone technique used in printing Dithered images particularly those with relatively few colors can often be distinguished by a characteristic graininess or speckled appearance References 1 Floyd R W and L Steinberg quotAn Adaptive Algorithm for Spatial Gray Scalequot International Symposium Digest of Technical Papers Society for Information Displays 1975 p 36 2 Lim Jae S TwoDimensional Signal and Image Processing Englewood Cliffs NJ Prentice Hall 1990 pp 469476 3 Matlab implementation dither MATLAB R709 quot 39 39 g p i Jquot c dithercm 4 Optimal Parallel ErrorDiffusion Dithering Proceedings of the 1999 Electronic Imaging SPIE San Jose CA January 1999 Morphological Reconstruction Morphological reconstruction can be thought of conceptually as repeated dilations of an image called the marker image until the contour of the marker image ts under a second image called the mask image In morphological reconstruction the peaks in the marker image quotspread outquot or dilate Link here 1 Vincent L quotMorphological Grayscale Reconstruction in Image Analysis Applications and Ef cient Algorithmsquot IEEE Transactions on Image Processing Vol 2 No 2 April 1993 pp 176201 2 Improving nerformance ofmo 3 Matlab implementation MATLAB R7nf o n I I b y I A I A uciuic Cpp 39 39 39 reconstruction Demosaicing A demosaicing algorithm is a digital image technique used to reconstruct a full color image from the incomplete color samples output from image sensors 1 Matlab description and implementation httpWWW mathwnrk I I J I I I L I 1 mm MATLAB R7000 39I 39 39 agesp A mm Smarter Choice AMD Entering the Golden Age of Heterogeneous Computing Michael Mantor Senior GPU Compute Architect I Fellow AMD Graphics Product Group michaelmantoramdcom P r 7 AMD Smarter Choice 0 Performance Moore39s Law 2x lt 18 Months FrequencyPowerComplexity Wall Parallel Opportunity for growth 0 Power 0 Price 0 Programming Models GPU is the first successful massively parallel COMMODITY architecture with a programming model that mana ed to tame 1000 s of arallel threads in hardware to perform useful work efficiently Quick recap of where we are Perf Power Price Smarter Choice 10 ATI RadeonTM 9 GigaFLOP S per Watt 4x Performancew and 7 8 Performancemm2 in a year 7 GigaFLOPS per S 6 ATI Radeon quot X1800 XT ATI Radeon quot HD 3850 5 ATI Radeon quot HD 2900 XT 4 ATI Radeon quot X1900 XTX ATI Radeon quot X1950 PRO l 3 Q f f 2 v i u t 1 i W V O a Q4Q1Q2Q3JQ4 a1a2 a3ilt14 Q1QZ 2005 2006 2007 2008 I Source of GigaFLOPS per watt maximum theoretical performance divided by maximum board power Source of GigaFLOPS per 55 maximum theoretical performance divided by price as reported on wwwbu com as of 92408 ATI RadeonTMHD 4850 AND Smarter Choice Designed to Perform in Single Slot l SP Compute Poweri 10 T FLOPS Core Clock Speed 625 Mhz Stream Processors GDDR3 Max Board Power I Memory Bandwidthl 64 GBSec A ATI RadeonTMHD 4870 AND Smarter Choice First Graphics with GDDR5 1 SP Compute Power i 12 T FLOPS 1 DP Compute Power Core Clock Speed 750 Mhz i Memory Type T Memory Capacity Memory Bandwidth 1152 GBSec i A Smarter Choice ATI RadeonTMHD 4870 X2 AMD Incredible Balance of Performance Power Price 24 TFLOPS 480 GFLOPS 1 Memory Type L amp A L 1 Memory Capacity VQEMI lMemory Bandwidth 230 GBSec J AMD FireStreamTM 9250 AMDFI Smarter Choice AMD s Second Generation Stream Com utin39 Product Single PCI Slot Computational Power One TFLOPS Single Precision Float o 200 GFLOPS Double Precision 1 GB GDDR3 150 Watts 8GFLOPSWatt Familiar 32 and 64 bit Linux and V ndows Environment Stream software supports multiple GPUs per system Brook Open Source Clevel language amp Compiler GPU ShaderAnalyzer AMD Code Analyst AMD s Compute Abstraction Layer CAL Crossfire FragBox QuadFire Falcon NorthwestAMDn Smarter Choice L J quot w h a W F fag r k 39339 397 1 quot MO 4 U F r e a quot swag 0 I0 12 QuadFire 2x ATI RadeonTM 4870X2 2 GB 3200 Single Precision Stream Processer Or 160 DP units with 160 SP units AMDII Smarter Choice Rapid Advances for Compute Toaayl Meet the ATI RadeonTM HD48xx Architecture Terascale Unified Graphics Engine AMDH Smarter Choice Gamma Frau0i 800 highly optimized stream processing units New Unified SIMD core layout Optimized texture units New texture cache design New memory architecture UptlmIZECl render DaCK39el iCIS TOI fast antialiasing performance Enhanced geometry shaderamp tessellator performance E E Simple View of a Unified Data AMD Parallel Throughput Machine ma a 0 Highly threaded to hide latency for light ALU loads and IO constrained Apps Hardware has support for 16000 shader program invocations Hardware has register amp resources for each invocation 0 IO bound with low ALU count 16 arrive per clock 16 leave per clock 40 threads issue fetch per clock Fetch latency 300 clks latency 4800 threads 1 1 dependant fetch 9600 threads 2 dependant fetch 14400 threads Host Interface ATI Radeonm HD 4300 Series Architecture AMDn 10 SIMD cores Smarter Choice Each with 80 32bit Stream Processing Units 800 total Dual Identity 16 64bit amp 16 32bit Stream Processing Units 0 4O Texture Units 115 GBsec GDDR5 memory interface it 7 Die Size 190 mm2 260 mm2 14x Memory 72 GBsec 115 GBsec 16x AA Resolve 32 64 2x ZStencil 32 64 2x Texture 16 4O 25x Shader 320 800 25x SIMD Cores AMDD Smarter Choice 0 Each core Includes 80 scalar stream processing units in total 16KB Local Data Share Has its own control logic and runs from a shared set of threads Has 4 dedicated texture units L1 cache Communicates with other SIMD cores via 16KB global data share 0 New design allows texture fetch capability to scale efficiently with shader power maintaining 41 ALUTEX ratio Local Data Share Sequencer Ultra Threaded Texture Unit Dispatch Processor Stream Processing Units AMDCI Smarter Choice 0 40 increase in performance per mm2 More aggressive clock gating for improved Performance per watt 0 Fast double recision rocessin 240 GigaFLOPS o Integer bit shift operations for all units 125x improvement 4a 14 Source of performance per mm2 maximum theoretical performance divided by surface area of aSIMD Source of 125x improvement forATI RadeonTM HD 3800 Series 4 SIMDs X 16 shift operations per SIMD 64 shift operations ForATI Radeon 4800 Series 10 SIMDs X 80 shift operations 800 shift operations Texture Units AMDH Smarter Choice 0 Streamlined design 70 increase in performancemm o More performance Double the texture cache bandwidth of the ATI RadeonTM HD 3800 SERIES 25x increase in 32bit filter rate 125x increase in 64bit filter rate Firm 5m m Address Units 1 39ll Up to 160 fetches per clock 10 Texture Units liextrrre Address Proeessnrs in Total E 16 FP32 Texture Samplers each 160 total M eak 343 teture ftCh rae I 4FP32 Texture Filler Units each 40 totall ATI Radeon HD 487 r o j ll f 7 1 ATI Radeon HD 3870 r v 7 Source of performance per mm2 maximum theoretical performance divided by surface area of the texture units Texture Units AMDJ Smarter Choice Texture Units 0 New cache design Firmwarest 3 If i Lch a a iquot i 7 r L25 aligned with memory channels WW39F mg i i 1 quot i i Lls store uni ue data er SIMD i MW I if ghee M q p quotquotquot t t 5 d a s i U 51 25x increase aggregate L1 withit i 5 mg 52 i39 L g I Calczhe MC Separate vertex cache whith ta H 3 inc 981 L i i e Increased bandwidth W 390 W3 39 it i 394 L2 MC 9 Cache V C V LI TC Up to 480 GBsec of L1 texture WWTF E fetch bandwidth Wmm Up to 384 GBsec between L1 amp L2 shnets a w smms quotquotquot 39 39 i Comparing ATI RadeonTM HD 48 series and ATI RadeonTM HD PM series 16 Q C L2 ache AMD Smarter Choice 0 Focus on improving AA performance per mm2 Doubled peak rate for depthstencil ops to 64 per clock Doubled AA peak fill rate for 32 bit amp 64 bit color Doubled non AA peak fill rate for 64 bit color 0 Supports both fixed function MSAA and r0 rammable CFAA modes ATI RadeonwI ATI RadeonwI Color HD 3800 HD 4800 Difference series series No MSAA 16 pixclk 16 pixclk 1x 7 ill 133A 32bit 8 pixclk 16 pixclk 2x D 8x MSAA i 4 pIXclk 8 pIXclk 2x 5 gg 7 q No MSAA 8 pixclk 16 pixclk 2x lm N 2 4 6439 XMg AA bit I 8 pIXclk 16 pIXclk 2x ii H 8x MSAA 4 pixclk 8 pixclk 2x A V ll Depthstencil only 32 piXClk 64 piXClk 2X H r quotCola Cache h Comparing ATI RadeonTM HD 4800 series and ATI RadeonTM HD 3800 series AMD Smarter Choice 0 New distributed design with hub L2 L2 Controllers distributed around periphery 2 of chip adjacent to primary bandwidth consumers 0 Memory tiling amp 256 bit interface allows reduced latency silicon area and power SD Engine consumption Hub handles relatively low bandwidth traffic PCI Express CrossFireXTM interconnect UVD2 display controllers x91 335013 inte rcom m u n ication PCI Express Display Controllers ATI Radeon HD 4870 Computation quotigh39ights AMDII Smarter Choice 0 gt100 GBs memory bandwidth 4 SIMDs gt 10 SIMDs 256b GDDRS interface 25X peak theoretical performance increase over ATI RadeonTM 3870 0 Targeted for handling thousands of 12 mops Fp32 theoretica peak simultaneous lightweight threads N240 GFlops FP64 theoretical peak 800 160x5 stream processors I 640 16OX4 basic units SCFatChDad memories FMAC ADDSUB CMP etc 16KB per SIMD LDS 12 TFlops theoretical peak 16KB across SIMDS GDS 160 enhanced transcendental units adds COS LOG EXP RSQ etc Synchronization capabilities Support for INTUINT in all units ADDSUB AND XOR NOT OR Compute Shader etc Launch work without rasterization 64bit double precision FP support Linearquot SChEdU39ing 15 single precision rate Fa5ter thread IaunCh 24OGFlops theoretical performance 19 I Performance AMD Smarter Choice 800 0 AMD Core Math Library 3870 700 SGEMM 300 GFLOPS 600 DGEMM 137 GFLOPS 500 o Fl I39 400 305 GFLOPS Mercury Computer 300 wade D 3870 oyaccma 1 IL 200 lRadeon HD4870 100 0 a 0 00 6 0 0 Q a9 c Q o o 0 0 6 O 69 00 0k 6 41 9 30 9 Based on internal testing at AMD Performance Labs on reference platform Configuration 9 quot 6amp9 o 9 3 Intel Q6600 24 GHz ZGB DDR3 ASUS P5E3 Deluxe motherboard Vl ndows Vista 9 0 3quot 19 Ultimate 32bit For ATI Radeon39quot GPUs CAL 101 Matrix multiply CAL optimized Q97 0393quot 0 8539 implementation available in SDK FFT CAL optimized implementation ATI Catalysth 86 Q Q 393quotb so ware Quoted peaks are based on manufacturer claims All results normalized to ATI RadeonTM HD 3870 GPU ATI RadeonTM HD 4800 Series Stream Architecture Smarter Choice Several enhancements done for stream computing Fast compute dispatch Local Global data shares Increased Integer Processing Abilities Fast Mem importexport Significant increases in performance on many important stream processing workloads Agenda l N 4 Stret SDK strategy update V rquot 1 consumer mar es 39 AM D s 39 DirectX 110 Compute Shader Introduction IOpenCL Introduction Stream SDK Momentum AMDN Smarter Choice oAMD was the first company to offer a freely downloadable 0 en set of rorammin tools for GPGPU programming Adoption of Stream SDK launched in 2006 continues to grow Hundreds of topics have been posted and discussed on the AMD Stream Developer Forum making it the most active developer forum at AMD 0 httpdeveooeramdcomdevforum httpatiamdcomtechnoloavstreamcomputindsdkdwnidhtml Key imperatives for the growth AND of an application ecosystem emanath oIndustry standard Programming models oTools Libraries Middleware oServices 0 Research community and early adopters AMD Stream Processing Strategy Smarter Choice Applications 1 Game Computing Video Computing Scientific Computing Productivity Tools Libraries Middleware ACML Brook Cobra RapidMind Havok etc OpenCL DirectX 1 1 etc quot 39 Industry Standard Interfaces k I J Single Programming Environment AMDII T op to bottom support of stream applications an AMD GPUs Smanerohoice Professional AMD FireStreamTM 9250 Graphics AA 1st GPU to Break 1Tflops barrier x I 8 SP GFLOPS per watt i r ATI Radeon HD Graphics a Swift APU planned Consumer 1H 2009 httpatiamdpnuu 39 quot rump ml per himl AMDFI Smarter Choice Computing Language What is Brook Smarter Choice Brook is an extension to the Clanguage for stream programming originall develo ed be Stanford Universit Brook is an implementation by AMD of the Brook GPU spec on AMD39s compute abstraction layer with some enhancements Asynchronous CPUgtGPU transfers GPUgtCPU still synchronous Linux Windows Vista Microsoft Windows XP 32 amp 64bit Extension Mechanism Allow ASIC specific features to be exposed without sullying core language Simple Example Smarter Choice int mainint argc char argv int i j float alt10 10gt float blt10 10gt float cltlO 10gt float inputa10 10 float inputb10 10 Streams COlleICtIOh Of data f1 at input c 1 1 elements of the same type which fori0 ilt10 i can be Operated on in parallel forj0 jlt10 j inputaij float i inputbij float j streamReada inputma streamReadb inputb suma b c I streamWritecf input 0 Brook kernels Smarter Choice kernel void sumfloat altgt float bltgt out float cltgt kernel void sumfloat LL float b out float cltgt int idx indexofc U dLidXJ 39r ulidx V kernel void sumfloat altgt float bltgt out float c int idx indexofc cidx a b a0 al a2 a3 a4 a5 a6 a7 bO bl b2 b3 b4 b5 b6 b7 cO c1 c2 c 3 c4 c 5 c6 c7 Brook Compiler Smarter Choice Converts Brook files into C code Kernels written in C are compiled to AMD s IL code for the GPU or C code for the CPU i K it Egg ggggg g ggggmme m wtg g gigggi W If l7 E3 is is t 2 cm W M a g 2 3 m hm mg W m 3wquot AMDI J Smarter Choice Brook Compiler IL code is executed on the GPU The backend is written in CAL Stream Runtime CAL W V GPU Backend AMD Smarter Choice 0 Improve and evolve Brook programming language compiler and stable high Improved error handling performance data transfer optimizations Platform for accelerating C ru ntime API applications today I I paving migration path access to CALlevel functionality towards OpenCLDirectX Compute AMDFI Smarter Choice t t Compute Abstraction Layer Compute Abstraction Layer CAL goals AND Smarter Choice Expose relevant compute aspects of the GPU command Processor Data Parallel Processors Memory Controller Hide all other graphicsspecific features Provide direct amp asynchronous communication to device Eliminate driver implemented procedural API Push policy decisions back to application Remove constraints imposed by graphics APIs CAL Highlights AMDII Smarter Choice Memory managed Don39t have to manually maintain orrsets etc Asynchronous DMA CPUgtGPU GPU GPU GPU CPU Multiple GPUs can share the same owe cory Core CAL API is device agnostic Enables multidevice optimizations eg Multiple GPUs working together concurrently Multiple GPUs show up as multiple CAL devices txtensrons to CAL provrde opportunities tor devrce specific optimization Commitment to Industry Standards AMD Believes Smarter Choice Industry standards and open collaboration is key to driving development of stream applications and ensuring these applications work on the broadest range of hardware Industry Standards Proprietary Solutions 39 As hardware and software evolves key 39 Hel s to drive ex erimentation of new p p to making it accessible in a unified way technology Stream processing has evolved to a 39 Appropriate when no standards in place pomt where proprietary solutions are not helping to drive broader technology accepta nce AMDFI Smarter Choice Compute Shaders An evolving processing model for GPUs DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Introducing the Compute Shader A new processing model for GPUs Data l arallel I rovramminv for mass market client a I s Integrated with Direct3D For efficient interop with graphics in client scenarios Supports more general constructs than before Cross thread data sharing Unordered access lO operations Enables more general data structures Irregular arra s trees etc Enables more general algorithms Far beyond shading DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Target Interactive GraphicsGames ImagePost processing Image Reduction Histogram Convolution FFT Effect physics Particles smoke water cloth etc Advanced renderers ABufferOIT Reyes Raytracing radiosity etc Game play physics Al etc Producv pines DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Taret Media Processin Video Transcode Super Resolution etc Photoimavinv Consumer applications Nonclient scenarios HPC server workloads etc DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Component Relationships Media playback or processing media Ul recognition etc Accelerator Brook Rapidmind Ct MKL ACML chFT D3DX etc DirectX 110 Compute CUDA CAL OpenCL LRB Native etc CPU GPU Larrabee nVidia Intel AMD S3 etc DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Compute Shader Features Predictable Thread Invocation Regular arrays of threads 1D 2D 3D Don t have to draw a quad anymore Shared registers between threads Reduces register pressure Can eliminate redundant compute and io Scattered Writes Can readwrite arbitrary data structures Enables new classes of algorithms Integrates with Direct3D resources DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Integrated with Direct3D Full su orts all Direct3D resources Targets graphicsmedia data types Evolution of DirectX HLSL Graphics pipeline updated to emit general data structures via addressable writes Which can then be manipulated by compute shader And then rendered by Direct3D again DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Integration with Graphics Pipeline Render scene Write out scene image quot a Use Compute for image Tessellation L j postprocessmg Ge metry5hader Output final image Pixel Shader Data Structure Final Image Scene Image DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Memor Obects DXGI Resources Used for textures images vertices hulls etc Enables outof bounds memory checking Returns 0 on reads Writes are NoOps lm roves security reliabilit of shi ed code Exposed as HLSL Resource Variables Declared in the language as data objects DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Optimized IIO Intrinsics Textures amp Buffers RWTexture2D RWBuffer Act just like existing types Structured IIO RWStructuredBuffer Structured Buffer readonly Template type can be any struct definition Fast Structured IIO AppendStructuredBuffer ConsumeStructuredBu r rer Work like streams Do not preserve ordering DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Atomic Operator Intrinsics Enable basic operations wo lockco i InterlockedAdd rVar val InterlockedMin rVar val InterlockedMax rVar val Interlocked0r rVar val InterlockedXOr rVar val InterlockedCompareWrite rVar val InterlockedCompareExchange rVar val DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Reduction Compute Code Bufferltuintgt Values OutputBufferltuintgt Result ImageAverage groupshared uint Total Total so far groupshared uint Count Count added float3 vPixel load sampler svIhreadID float fLuminance dot vPixel LUMVECTOR uint value fLuminance65536 InterlockedAdd Count 1 InterlockedAdd Total value GroupMemoryBarrier Let all threads in group complete DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology Summary DirectX 110 Compute Shader expected to deliver the performance of 3D games to new appHca ons 39 integration between UUIIIIJutaLIUII al N rendenng Scalable parallel processing model UOCle ShOUICl scale TOl several generations DirectX 110 Siggraph2008 slides courtesy of Chas Boyd Architect Microsoft Windows Desktop and Graphics Technology AMDFI Smarter Choice Open Computing Language Siggraph2008 slides courtesy of Aaftab Munshi Architect Apple amp Khronos UpenUL WorKgroup Member Update on OpenCL AND Smarter Choice tandardize framework and language for multiple heterogeneous processors PC developers are expected to have early version in Q1 2009 0 Based on a proposal by Apple OpenCL A Brief Preview Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi Design Goals of OpenCL Use all computational resources in system GPUs and CPUs as peers Data and task parallel compute model Efficient parallel programming model Based on C Abstract the specifics of underlying hardware Specify accuracy of floatingpoint computations IEEE 754 compliant rounding behavior Define maximum allowable error of math functions Drive future hardware requirements Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi OpenCL Software Stack Platform Layer query and select compute devices in the system initialize a compute devices create compute contexts and workqueues Runtime resource management execute compute kernels Compiler A subset of ISO 099 with appropriate language additions Compile and build compute program executables online or offline Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi OpenCL Execution Model Compute Kernel Basic unit of executable code similar to a C function Dataparallel or taskparallel Compute Program Collection of compute kernels and internal functions Analogous to a dynamic library Applications queue compute kernel execution instances Queued inorder Executed inorder or outof order Events are used to implement appropriate synchronization of execution instances Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi OpenCL DataParallel Execution Model Define NDimensional computation domain Each independent element of execution in ND domain is called a workitem The ND domain de nes the total number of workitems that execute in parallel global work size Workitems can be grouped together workgroup Workitems in group can communicate with each other Can synchronize execution among workitems in group to coordinate memory access Execute multiple workgroups in parallel Mapping of global work size to workgroups implicit or explicit Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi OpenCL Task Parallel Execution Model Dataparallel execution model must be implemented by all OpenCL compute devices Some compute devices such as CPUs can also execute task parallel compute kernels Executes as a single workitem A compute kernel written in OpenCL A native C C function Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi OpenCL Memory Model Implements a relaxed consistency shared memory model Multiple distinct address spaces Address spaces can be collapsed depending on the device s memory subsystem Address Qualifiers Localwiemory private IocaI constant and global Example global float4 p Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi Local Memory Language for writing compute kernals Derived from ISO C99 A few restrictions Recursion function pointers functions in 099 standard headers Preprocessing directives defined by C99 are supported Builtin Data Types Scalar and vector data types Pointers Datatype conversion functions convertz ypeltsaz gtltroundingmodegt Image types image2dt image3dt and samplert Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi Lan i ua i e for writin com ute kernels Builtin Functions Required workitem functions mathh read and write image relational geometric functions synchronization functions Builtin Functions Optional double precision atomics to global and local memory selection of rounding mode writes to image3dt surface Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi Summary A new com ute language that works across GPUs and CPUs 099 with extensions Familiar to developers Includes a rich set of builtin functions Makes it easy to develop data and task parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing Beyond Programmable Shading Fundamentals Siggraph2008 slides courtesy of Aaftab Munshi Summarizing AMD s Industry Standards Efforts oEasing crossplatform development with major enhancements for stream software oSingle development environment for open flexible software development and support for a broad range of GPU solutions 0A clear path to OpenCL and DirectX 11 mm Firegimamm Gmearkat Cammemial Varticalg FQCMS Smarterohoice An 39 OknO 39 data from multiple locations into a single environment Stream computing drives optimal performance maximum flexibility for future enhancements AMDfLll Smarter Choice 1 minquot video and NTP v13 A quot3quot 7 quot r l smgla core CPU NTP v10 mp 71 it v 1 AMD FireStreamTM 9150 versus dual AMD Opteron quot 248 processor using only a single processor for comparison w 2GB SDRAM DDR 400 ECC dual channel and SUSE Linux 10 custom kernel Performance information supplied by customers AMD has not independently verified these results Building the Ecosystem Stream Research Breakthroughs Smarter Choice seconds in Brook AMD FireStream M 9150 versus Inte CoreTM 2 Duo E6550 233 MHZ running Windows XP 32bit 60 a AMD FireStreamTM 9150 versus quadcore AMD PhenomTM 9500 processor 2 GHz w 8GB ECC DDR2 running at 667 MHz and AMD 790FX motherboard running V ndows XP 64bit Performance information supplied by customers AMD has not independently veri ed these results Building the Ecosystem Stream Development Tools Successes I R A P D M i ii 7 CAPS on binomial options pricing calculator 39efense finance life sciences SourcePro iCH Suite AMD FireStream M 9150 versus Quantlib running solely on single core of a dualprocessor AMD Opteron M 2352 BB processor on Tyan 52915 W Windows XP 32bit Palomar Workstation from Colfax Performance information supplied by customers AMD has not independently verified these results u AMD I Smarter Choice quotProvide Stream Computing consulting develo ment and inte39ration services s ecializin39 in the following industries Oil amp Gas Medical Imaging Financial Servicesquot 5 Mercury benchmark system details 2x Opteron quadcore 23GHz processors 4GB ATI Radeon 4870 GPU Expamsian f Stream Simiegy 4 m a m AMD Him Emgumer Applmca tgm Smartercmice cience l H httpdeveloperamdcomdocumentationvideospagesfroblinsaspxone Summar amp uestions AMDEI Smatter Choice K Easing crossplatform development with major enhancements for stream software strategy Aggressively expanding stream strategy to consumer Segment J Contact AMD Stream Computing SDK httpatiamdcomtechnoloqvstreamcomputinq Disclaimer and Attribution Smarter Choice DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES ATTRIBUTION 2008 Advanced Micro Devices Inc All rights reserved AMD the AMD Arrow logo AMD Opteron AMD Phenom ATI the ATI logo Radeon FireGL FirePro FireStrea m and combinations thereof are trademarks of Advanced Micro Devices Inc Other names Microsoft Windows and Windows Vista are registered trademarks of Microsoft Corporation in the United States andor other jurisdictions are for informational purposes only and may be trademarks of their respective owners