PARALLEL PROGRAMMING CS 515
Popular in Course
Popular in ComputerScienence
This 16 page Class Notes was uploaded by Orrin Rutherford on Tuesday September 1, 2015. The Class Notes belongs to CS 515 at Portland State University taught by Jingke Li in Fall. Since its upload, it has received 54 views. For similar materials see /class/168267/cs-515-portland-state-university in ComputerScienence at Portland State University.
Reviews for PARALLEL PROGRAMMING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/01/15
Harnessing Stream Processors massively parallel processing Jesse Rosenzweig CTO jesseeementatechnologiescom Apr 2155 2009 Elemental Technologieslncorporaled Confidenhal elementalatechnm gles Company Background Story of a Startup The Elemental Video Engine Elemental Product Line CUDA introduction Conclusion 39 H 39 i l 2 Elemental Technologieslncorporaled Confiden al e ememat Khm es Company Background Our Mission To create the fastest highest quality video solutions by harnessing massively parallel offtheshelf hardware Founded in 2006 Team led display revolution at piiEEworks Headquartered in beautiful Portland Oregon Profitable in first quarter of revenue Q4 08 Raised 71 M Series A in June 2008 l GENERAL CATAL I ST quot1 P HTHEH a ll E ii F39 l A L 3 Elemental Technologies Incorporated Confidential ELE39mEmal EEEEMMES Story of a Sta rtup i i i t i 39 Founded AUQUSt 022mg Catarina Focus was to build ASIC A33 E 39 Standalone transcoder encoder Estimated cost 20M to revenue Funding sources limited w ill Elemental 20 April 2007 NVIDIA G80 had been released CUDA had been launched Powerful parallel engine available r i Switched to software model 4 Elemental Technologies Incorporated Confidential EL E39 mE m I all EEEEELUJ glE S The Video Dilemma Explosive growth in all types of video seen Processing all this video is challenging Specialized chips are expensive and inflexible CPUonly servers are relatively slow r f 7 RapiHDW GPUraccelerated software CPUrbased systems awe Lows Cost DSP FPGAASlC Performance 8 Quality The Elemental Solution Video processing software that harnesses the GPU Exploits serial CPU and parallel GPU engines for optimal ef ciency E Outperforms CPUonly competitors Ef cient use of system resources High quality video processing No compromise performance Core technology built with exible API Easy to drop into different products Disru ptive Innovation Elemental s video harnesses key GPU trends 1 2 3 GPUs have become immensely powerful GPUs have become extremely programmable PCle bus allows fast CPU GPU communication m anun mu mam Jun Jun A x Jun Mu Nuv Why Jun m mm was zoos mu m Video Engine Pipeline Harnesses both the CPU and GPU strengths Achieves up to 10x performance of CPUonly Ef cient use of system resources is key sum mum Km nunTs was WEE Synm 1mm VLC new Em Mum WWW WM E EMEquot 9 O EE W ES Elemental Video Engine Currently used by a variety of applications Virtualization Remote Video Distribution 39 United States Intelligence Community Professional Video Editing Application 1 Ionelt or more Pipeline Input Pipeline Output gt Thread 1 gt Data Flow Path El etiCoreLib Components ETI Pipeline Stages l ETI Codec plugin elemental I 9 Elemental Technologies Incorporated Confidential EIementaI s Product Line All powered by Elemental core technology Elemental Video EngineT39V39 SDK Developer Flex bIe and extensible Supports a variety of codecs BadaboomT39V39 Media Converter Consumer Video on mobile devices 1 million downloads Elemental Accelerator for CS4 Professional Premiere Pro plugin Bundled w NVIDIA Quadro CX L a I v 7 I A ff 3 ii xg jig Q3 08 l A J a M x4 I I e Itquot 5 1 I I I i I l quot 10 Elemental Technologies Incorporated Confidential BadaboomTM Media Easiest way to format video for any device Simple to use Fast performance Efficient system use High image quality The Accelerator for Adobe Creative Suite 4 Includes Elemental s H264 video encoding and soon much more Available today only with NVIDIA Quadro CX board Advanced MultiDisplay Management Tools Designed and Optimized for Adobe C ve Suite 4 Launched Oct 15 V 39 m Adobe Elemental fechnnlngies momma Ennmential Mung CUDA Introduction etementaltecl1mlogies Elemental Technologies incorporated Confidential CUDA Introduction What is CUDA Compute Unified Device Architecture Parallel processing at a very low level Extensions to Q when man is processed witnCPUoniysaiiware renderllmes an ery mgn cau Llllllzallull with His RIDKHDVquot sunware an ngntl lrames are still DroKen lnla macrablncks but all manta locks are Dracessedcancurrenlly aria massivelyparallel GPUlYlSlUiHQHlNUNl tnrauqnputtnan CFU u 391 vllllll elemental lecltgnlngies 14 Elemental Technologies incorporated Confidential Arrays of multiprocesors Each multiprocessor has sets of processors Each processor executes the same instruction on different data Each processor has access to share memor eluillwitm39ll uitww Eiememai Yechnaiavis momma Eammenmi CUDA Introduction Me mo ry types GlobalDevice PU S DRAM Siowest of aii memory Constant Cached giobai memory for constant readroniy oata Texture 2D cache and hardware interpoiation for giobai memory Shared Fast memor as fast as registers avaiiabie to a CUDA biock Register Set or generai purpose registers avaiiabie rorme thread CUDA Introduction Threads Warps Blocks and Grids Thread set of instructions running on processors Warp set ofthreads running at the same time Block set of threads that share memory sequenced on same multiprocessor Grid set of blocks to be executed over a set of data Figure 21 Grid of Thread Blocks Elemental Technologies lncurpurated Confidential emmhmdl lechijogies CUDA Introduction Kernels The set of instructions that run on the GPU in parallel Write one set of instructions run on lots of data SIMD processors flea EE12ltE Elemental Technologies lncurpurated Confidential eiell el l amm mm CUDA Introduction Typical data ow CPU producescaptures data Copy data to GPU DRAM Kernel loads data from DRAM into shared memory Threads execute in parallel on data in shared memory Once threads are done syncthreads move data back into GPU DRAM Move results back to CPU 19 mm Technnlngis lumpan 2mm E39EMEHIHIOIECWWES CUDA Introduction Occupancy The ratio of the number of active warps per multiprocessor to the maximum number of active warps Current NVIDIA GPU capability has a max of 32 active warps Higher occupancy is not necessarily faster for any given algorithm but is a measure of how much work can be done per clock elementataclmnmgles mum lechnnlngis lncnipnialed Ennmenlial CUDA Introduction Optimize kernels by minimizing registers gt simple algorithms Minimizing shared memory usage gt resourceful mem management Maximizing warps per block gt give the device enough work Good memory access Coalesced global reads and writes Reduce bank con lcts on shared memory 21 E Emgn amchnu ms mmmmmmmm elelnerllaltechnnlagles CUDA Introduction Example Matrix Multiply Each thread block is responsible for computing one square submatrix Csub of C Each thread within the block is responsible for computing one element of Csub Elemental Technulugles lnculpulated Cuntldentlal Elcmen a39 E U ml CUDA In duction Thread black eeze dEfLE 131055sz 16 CP U SI de M iglabali vad Mali 1 am Host mu1p14acm Eunccxan Move data to GPU H c e A y a M 15 The hELth e A we 15 he wldth of A w 15 the wldth af E vald Mul1ccnst fluaquot A glean Cl 1 Leea A and a to tha deuce quoteeev zed 532E nA NA seameueel veee eaevlemepylea A size Iemcpy ast39fubevxce a 51 e we i E euee i cl Launch kernel grid of Exec 52 16x16 thread blocks mg C w 3 he lt vE e eaeweuec l veeanl scd d1m3 dmlucklamcxisizi emcx SiZL ckx na de1dwE deIE 213 LhmElcnky La ch the deuce computatxcn MuldltltltmEzLdy dunElacbgtgtAd Ed WA WE cal Move results back E er deuce memory cudai zei 1M cuda eel ndaFZEE Cd 7 23 Elemental Technologles lncorporateu Lonnuentlal 39 39 quot 39 GPU Kernel side part 1 Addresses calculated with blockldx and threadldx special symbols 39 a pee 17 lee b the sumariez u 24 Elemental Technologles lncorporated Confldentlal elemental tectllmlogies CUDA Introduction GPU Side part 2 Load shared memory with data Do matrix multiply in parallel Write result to global memory CUDA Introduction Performance for A4880 B128 48 C12880 GPU 10ms 54x faster CPU 54ms 491 k multiplies and 491 k adds nnNVKDIA cum SDKbm 39 quot1 mun u can elemenwOrecmmms CUDA Introduction Performance for A488000 B12800 48 on 28008000 GPU 663ms 142x faster CPU 9483ms 5 billion multiplies and adds E a zP raglanNva Ca mtlunNVIDIA CUDASDXbin l E Compute competition CUDA only for NVIDIA but Mac Linux and Windows supported OpenCL Apple and DX11 Microsoft for all GPU and CPU platforms Cl l l AMD 1 a i quot w Intel nv 1 D1 A 9 2a eiementhemmnam Many CPU cores vs GPU Intel s latest Nehalem processorwill have 8 hyperthreaded cores giving 16 threads of execu on GPU has overhead of copies Sometimes fasterto do things on CPU even if it is slower Must chose best processo orthe problem Heterogeneous computing Clock rates are stagnating Parallel programming is the future Elemental Technologies incorporated Confidenhai elementalo technologies Thanks for your time Any questions We are always looking for talent Everyone gets to work on cool stuff elemental techuulogres Elemental Technologies incorporated Confiden ai More information CUDA OpenCL wwwkhronosorgopenc DX11 Compute DirectX March 2009 release wwweementaltechnologiescom elememalomzmmxes mm Techna aws mem cmamm
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'