Parallel Processing CS 6643
Popular in Course
Popular in ComputerScienence
verified elite notetaker
This 8 page Class Notes was uploaded by Mireya Heidenreich on Thursday October 29, 2015. The Class Notes belongs to CS 6643 at University of Texas at San Antonio taught by Staff in Fall. Since its upload, it has received 30 views. For similar materials see /class/231386/cs-6643-university-of-texas-at-san-antonio in ComputerScienence at University of Texas at San Antonio.
Reviews for Parallel Processing
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/29/15
1 Basics of Shared Memory Programming with OpenMP Material taken from sections in Chapter 7 OpenMP Parallelization basics Loop specific OpenMP parallelization General OpenMP parallelization method Shared memory and race conditions 2 What is OpenMP OpenMP is a specification of a set of compiler directives library rou tines and environment variables designed to enable shared memory pro gramming for Fortran and CC programs 0 Originally developed by SGI 0 NOW a standard www0penmporg o Supposed to be easier to program than pthreads also have better Fortran support stop laughing Requires egtltplicit compiler support Intel39s icc Solaris cc SGI cc IBM gtltlc Portland Group pgcc HP cc etc Supposedly fully supported as of gcc 42 Not sure of details but clearly OpenMP is easiest threading platform for simple applications ll 3 Stupendously Basic Basics Almost all parallel programs need to answer the following questions How do I set nthreads to use gt setenv UMPJIU39MJHREADS ompsetJ1umthreadS How many threads are running gt int ompgetmumthreads Which thread am I gt int ompgetthreadnum How do I start parallel threads gt pragma omp parallel parallel for How do Ijoin threads 84 their results back to serial a End of pragma scope reduction clause How can I synchronize threads gt ompsetlocklock gt pragma omp critical ltnamegt 4 Parallelizing Simple Dot Product Intro Will use simple dot product to illustrate how OpenMP is used There are two approaches 1 Simple loop oriented approach 2 More complex general approach Here39s the simple serial code 0 Note there are no loop carried dblddttNdle 0 e 0 quble39 e 39 dependenCIes other than dot d a We can run all iterations at ouble dot 39 i same time dot gtlt0 Y0 a Can use specialized loop in for 1 lt N i dot X YD struction returnd0t 5 Simple OpenMP Loop Parallelization 0 Using parallel for In 8 OpenMP splits loop iterations amongst threads 0 Reduce the partial sums created by all threads down to master thread using reduction clause In 9 0 Can do this because we have loop wt no loop carried dependencies other than dot which we fix with reduction threaded code ldouble dd0tint N double X serial code ldouble dd0tint N double X 2 double Y 2 double Y 3 3 4 double dot 4 double dot 5 int I 5 int I 6 6 7 dot gtlt0 Y0 7 dot gtlt0 Y0 8 for 1 lt N i 8 pragma omp parallel for 9 dot Xi Yi 9 reducti0n dot 10 returnd0t 10 for 1 lt i 11 11 dot gtlti Yl 12 returnd0t 13 o Red vars init to 0 then added with serial val on end of H region 0 For more complex codes may need a more general approach 6 Explicit Parallelization Example 0 Can use parallel region to explicitly thread code and indicate egtlt plicitly which items are private shared or reduced 0 Book is in error no such thing as defaultltprivategt in C ldouble dd0tint N double X double Y 2 double d0t00 x y n nthr mythr pragma omp parallel defautn0ne reducti0n dot sharedNgtltY privateninthrmythrgtlty nthr omplgetlnumlthreadsO mythr 0mpgetthreadnum n N Min iter forall threads gtlt gtlt n mythr trs to this threads yY n mythr chunks if mythr nthril Last thread gets remainder n N 7 nnthr dot gtlt0 y0 for 1 lt n i d0tgtlt ylll returnd0t o WO default none all vars are shared 7 Shared Memory and Race Conditions Global 84 heap shared amongst threads unlike processes 0 Imagine two threads both doing the following loc glob glob s equal to 10 Different outcomes may occur 1 Thread 0 84 1 read at same time have value of 10 both incre ment both write Both threads get loc 10 glob 11 2 Thread 0 reads increments writes and then thread 1 does the same Thread 0 has loc 10 thread 1 has loc 11 and glob 12 for both after both completed will be 11 for a while for thread 0 3 Swap thread 0 for l in above for different outcome i This is known as a Race condition which may occur whenever a thread updates memory that another thread may read or write a A code which is guaranteed to behave correctly during threaded execution is called thread safe A routine is that is thread safe is said to be reentrant 8 Thread Safety There are various ways to achieve thread safety 1 Duplicate any resource that must be written so that each thread has private copy copy to stack or malloc area for each thread 2 Synchronize threads so they egtltecute section in known order 3 Place locks around all updates For many problems will have to insert egtlttra code and use one or a combo of above methods to achieve thread safety in complex routines This includes the Operating System itself 0 Too many lockssyncs can make threaded slower than serial 0 Must link in thread safe system routines mallocrandmathetc Multithreaded networking applications used to be tremendously slower under Linux than FreeBSD due to where and how long the TCPIP stack got locked to guarantee thread safety Many times standard sys names replaced with thread safe ver sions other times must egtltplicitly call thread safereentrant ver sion eg randr rather than rand 1 Basics of Shared Memory Programming with Pthreads Shared mem same as OpenMP race cond etc Topics Ch 73 79 o What are pthreads and pthread basics Starting joining and exiting threads Pthreads example with dot product Pthreads error handling Challenges in using Pthreads standard Mutual exclusion using mutexes Awaiting an event using condition variables Etc 2 What is Pthreads Pthreads POSIX Portable Operating System Interface threads An IEEE operating system interface standard POSIX Section 10031c It consists of a ANSIISO C API with appropriate CPPtypedefs in pthreadh Does the same stuff as OpenMP but at much lower level Preceeded OpenMP standard 1995 threads around longer Some OpenMP implementations actually devolve to pthread calls Lower level 84 quite a bit grungier than OpenMP Requires minimal compiler support need lib header file and thread safe system libs Available on almost all platforms including gcc More info httpwwwllnlgovcomputingtutorialspthreads Sections 73 79 of textbook Pthreads Programming O39Reilly by Nichols Buttlar 84 Farrell 3 Stupendously Basic Basics Almost all parallel programs need to answer the following questions How do I set nthreads to use a How many successful calls to pthreadjreate did you make How many threads are running A See above Which thread am I a Err you passed that in when you called pthreadjreate right a Can get thread handle not rank wt pthreadJ pthreadjelfo How do I start parallel threads gt That would be pthreadjreate How do Ijoin threads 84 their results back to serial a Join them up using pthreadLjoin a Add them up yourself shared memory boy How can I synchronize threads a mutex condition variables later How can a thread finish execution gt Just finish execution return of threaded route is thread ret val gt pthreadLexit void ret 4 Pthread creation finish pthreadcreate Creates thread with attribute attr NULL uses defaults new thread starts exec in startroutine and is passed single void pointer arg very convenient int pthreadcreate pthreadt thread const pthreadattrt attr void startroutine void void arg Nonzero means error returns thread s struct Attributes NULL for defaults Routine to start executing Ptr arg to startroutine pthreadJoin Block until thread terminates int pthreadjoin pthreadt thread void valptr Return value from joined thread pthreadLexit Terminates thread use exit to term in threads void pthreadexit void value Return val given to pthreadjoin caller Pthreads are not required to join or call pthread xit Threads that are neverjoined can avoid overhead wt the detached attribute All pthreads peers can create new threads kill old etc Can spawn threads in 09212 overhead wt tree strategy 5 Parallelizing Simple Dot Product Intro Will use simple dot product to illustrate simplist Pthreads use There are two approaches 1 Masterslave where master spawns all threads adds all results 2 Peer approach where everyone helps with everything not shown Remember the good old days they are gone serial code OpenMP code ldouble dd0tint N double X ldouble dd0tint N double X double Y 2 double Y 3 3 4 double dot 4 double dot 5 int l 5 int l 6 6 7 dot gtlt0 Y0 7 dot gtlt0 Y0 8 for 1 l lt N i 8 pragma omp parallel for 9 dot Xi Yi 9 reducti0n dot 10 returnd0t 10 for 1 lt N H 11 11 dot gtlti Yl 12 returnd0t 13 6 Simple Master slave Dot Product wt Pthreads include ltpthreadhgt double ddotptint m int N double x double Y 1 double dot struct dotstruct double dotout double x double Y int N 1 1 n pthread mythrs struct dogscrucc dss dss mallocCsizeosztruct dotstruct1NT mythrs manocs1zeo1pthreadjwmem n N m Void ddotserialvoid was 1or 11 1 lt m 1 1 1 sh 1117nnilemm 1 1 N dss11x dssiY Y double 1 Y dot pthreadgreatemythrs1e1 NULL ddotser1al dss1 struct dogstruct ds vds x dss11il Y dss11il N dssgtN 11 N gt o N dss0 x x dss0Y Y 1 ddocser1aldss x dssgtX dot dss101dotout Y dssgtY 1 r 11 1 lt NT 1 1 dot x10 1 YO pthreadjo1nmythrs11e1 NULL or N 1 dot dss11dotout dot x11 2 Y11 dssgtdotout dot 1 eedss 1reemychrs 1se dssgtdotout 00 returndot e recurnltNULL 7 Pthreads Error Messages Most pthread routines return error condition 0 0 means no error each routine can return different errors pthreadjreatei EAGAIN no recsources EINVAL bad attr val EPERM insuff permission pthreadLjoin EINVAL thr not joinable ESRCH can39t find thr Use man ltpthreadroutgt to find error names 0 Must include lterrnohgt to test using names 0 Most systems provide strerror in ltstringhgt include ltpthreadhgt include ltpthreadhgt iuclude lterrnohgt nclude lterrnohgt include ltscr1ng if is 11 1err pchreadgreatenu 1 hgt ASA N r pthreadgreatenu 1pr1nt1ltstderr Insufficient resources quotd 01 quotsln 1pr1nt1ltstderr FILE Error line quotd 01 quots quotsn 11 1err EINVAL L 1pr1nt1ltstderr Bad arg to pthreadgreate quotd 01 quotsn L ILE WLINE J st rerror 1err exit 1err e1se 1pr1nt1ltstder Error quotd to pthreadgreate quotd 01 quotsn 1err WLINE WFILE ex1t 1err 8 Pthreads Challenges 0 Must indicate thread use for compilation 84 linking why but no standard flags even always using gcc Complink pthreads pthread mt DjEENTRANT DTHREADSAFE libs 1pthread nothing 0 Standard has nothing to say on some very important issues How to tell system to spawn thread to particular node diff same How to tie process to node No standard way that I know to get of physical procs a Many things that are specified are optional and thus fairly useless 0 At least in old days vendors often violate standard Must use attribute to guarantee default thread state Must use non standard names to get correct behavior Be prepared to use man compiler search thread pthreads 9 Shared Memory Programming Concepts 0 Atomic an operations that is always performed in its entirety and cannot be readmodified in an intermediate stage by another processor Note that many C statement are not atomic a 0 Critical section a series of statements that must be performed atomically and only executed by one thread at a time if item lt min min item min item lt min item min Adding toremoving item from work queue Generally any read update write series to shared mem Mutual Exclusion Locks mutegtlt mutegtlt lock OShardware supported atomic mechanism that allows for two atomic states locked a thread owns the rights and unlocked a new thread can acquire the lock Used to guard a critical section Condition variable OS provided mechanism allowing a thread to yield egtltecution until a given signal is sent Can be accomplished wt mutegtlt wt higher overhead Eg garbage collection thread triggered by memalloc failure 10 Basic Pthreads Mutex Routines To make code thread safe must remove race conditions wt min overhad wo causing deadlock Mutex is simplist method 0 pthreadmutexinit Creates a mutegtlt lock int pthreadmutexinit pthreadmutext mutex ptr to mutex to create const pthreadmutexattrt attr attributes of mutex NULL defaults 0 int pthreadmutexdestroypthreadmutext mutex pthreadimutexilock Block until lock acquired int pthreadmutexlockpthreadmutext mutex o pthreadmutexunlock Release lock which thread will acquire lock depends on scheduling and race conditions int pthreadmutexunlockpthreadmutext mutex o pthreadimutexitrylock Ret 0 if lock acquired otherwise error int pthreadmutextrylockpthreadmutext mutex o USAGE pthreadmutexlockampminlock if locmin lt globmin globmin locmin pthreadmutexunlock ampminlock 11 Boneheaded Untested Work Pool Example struct workq void DoAllWorkO struct workq next C ii struct workq mywork double x Y while1 workheadNULL if workhead pthreadmutexlockampwaock pthreadmutext waock C kh ad khead workheadgtne main wor xt pthreadmutexinit ampwaock NULL pthreadmutexunlockampwaock startworkers DoThisWork mywork GenerateWork FreeWorkNode mywork leanu else pthreadmutexunlock ampwaock void AddWorkstruct workq continue pthr P eadmutexlock ampwaock khead pgtnext wor workhead p 339 pthreadmutexunlockampwaock 339 12 Work Pool With Chunk Scheduling Untested struct work q void DoAllWorkO ruct workq next C N int struct workq mywork double X Y int mynwork0 while1 workhead if mynwork lt THRESH M workhead C int nwork0 i pthreadmutextrylockampwaock0 define CHUNK 16 if workhead my define THRESH 6 pthreadmutext work GeterChunkltmywork qlock PTHREADMUTEXINITIALIZER ampmynwork main pthreadmutexunlockampwaock startworkers 339 GenerateWork 339 C1eanup if mynwork gt 0 DoThisWorkltmyworkgt mywork FreeWorkNode mywork ork 11 forn11astp lastgtnext last lastgtnext n pthreadmutexlockampwaock workhea rk n pthreadmutexunlock ampwaock 13 Mutex Gotchas Ensure mutegtlt protective wt no deadlocks not always easy 0 Instead of pthreadJnutexltrylock should I spawn another thread If lrg data struc eg tree may want heir of locks eg lock subtree Must force ordering of lock acquisition so that all locks are ac quired in same direction preventing deadlock due to two locks awaiting each other39s release 2 procs 2 locks pO locks Ll pl locks L2 pO blocks L2 pl blocks on L1 deadlock deadly embrace NH pO locks Ll pl blocks Ll pO locks L2 unlocks Ll pl gets Ll blocks L2 pO unlocks L2 pl gets L2 0 Putting a mutegtlt around every shared data structure will work but may be too restrictiveexpensive so use another solution Usually OK to have multiple readers but only one writer If we spend a lot of time awaiting lock will waste CPU cycles Mutegtlt often used for critical section but often use more complicated synchronization methods for complex structures 9 ll 14 Why Condition Variables o Mutegtlt alone can be expensive Thread wastes a lot of CPU time sitting blocked on lock Thread wastes a lot of sys res by excessive polling wt trylock Costs can be much higher if multiple threads on one node Blockedpolling thread competes for CPU wt thread holding lock ll Condition variables allow thread to yield egtltecution until given event takes place eg more work in queue o Cond vars have mutegtlt associated with them a Usually action taken after cond var depends on shared data struct eg task queue Want newly awoken threads to take coordinated action mutegtlt ensures coordination example forthcoming Threads waiting on cond vars can be spuriously awoken by OS so checking data associated wt mutegtlt can ensure we don39t want to go back to sleep 15 Basic Pthreads Condition Variable Routines 0 int pthreadcondinit pthreadcondt cond const pthreadcondattrt attr 0 int pthreadconddestroypthreadcondt cond pthreadlcondlwait release mutegtlt mut suspend egtltec until another thread signals on cond or OS wakes us spurious on wakeup each awakened thread blocks until mut reaquired int pthreadcondwait pthreadcondt cond pthreadmutext mut o pthreadlcondlsignal awakens 2 1 thread waiting on cond int pthreadcondsigna1pthreadcondt cond pthreadlcondlbroadcast awakens all threads waiting on cond int pthreadcondbroadcast pthreadcondt cond 0 Typically signalling thread has mut locked when signalling and re leases it directly after 0 Also a timed wait pthreadlcondltimedwaito RTFM 16 Chunk Scheduled Work Pool using a Condition Variable struct workq C void DoAllWorkO workhead int nwork0 define CHUNK 16 define THRESH 4 pthreadmutext w lock PTHREADMUTEXINITIALIZER pthread c ondt wqcond PTHREADCONDINITIALIZER void AddWorkstruct workq p C struct workq ast forn01astp lastgtnext 1a t lastgtnext n pthreadmutexlockampwaock lastgtnext workhead nwork n if nwork 11 pthreadcondbroadcast ampwqcond 339 pthreadmutexunlock ampwao ck if mynwork 0 C pthreadmutexlockampwaock while Workhead pthreadcondwait ampwqcond ampwaock LOCKED 1 else if mynwork lt THRESH M workhead LOCKED pthreadmutextrylockampwaock if LOCKED if workhead myw rk GeterChunkltmyworkampmynwork pthreadmutexmlockampwaock DoThi sWork mywork mywork FreeWorkNode mywork mynwork 17 Building a Barrier May need to have phase Cha ge so barriert threads rurvriH lQ Orv p DroCS 0 mm Called by mam one copy o r t m mem 0 Cost a 2 xt scruct bamer wbzrrlergnltol 1 WW t grab 3 0 OCK p maocszeofstruct bamer 7 Prob not payed m full due pchreaamucexomubpe ock NULL pthrezdcondmtamppgtcond mm to ECK 0 SWC psgtndone o 2 After 7 1 grabrel or lock returnp vom b2rrerstruct barrler a gt r b nd 7 cexuocmbenock NULL Ca We get d 2 t lt NT bonaone lt m pchreaaconawa1cabocona abomck e1se pchreaaconabroaacascabemna pchreamuceunockltbbegt1ockgt 18 Clint s Special newacarasmell39 ExtraaCheap Barrie Barr req 1LSt sync but not last so what f we use a dummy lock so evenone doesn t Sync on begt1ook7 vold barrler struct barrler ab pthrezdmutext DJlUIEXJNITIALIZER ckampbgtock NULL NI dum mm pthrezdmutexo 1f ltbegtnaone lt pchreamucexuockbaum pchreaaumuceumockltbbegt1ockgt bonaone lt m pchreaaconawa1cabocona ampdum pchreaaumucexmmcnbaum else pchreaaconabroaacascabemna chreaaumuceumockltbbegt1ockgt Cost to O o Prob rs COWDBUUOH for smgle lockComm Show later how to reduce 22092p by usmg g lockCorn 19 Oploggp Barrie P9 308 typedef struct barrnode 1 law1 o r even stay 1n pthrezdmutext countlock pthreszmtexJockwad countlock c eaa count39 p hr cona c ukcoUp ElkGoDown bud nt count whlle C1dcount lt 2 mybarr c pmreaaconawamltampbm bkcovp ampbld count pthrezdnmtexunockampbl l countlock typedef struct barrnode e1se odd odes slgiz amp leave pthrezdnmtexockampbl l counmockgt bm count vom mybau1n1cltmy1gbauc b r 1f bl lcount 2 f0 pmreaaconas1gla1abridbkcoub 0 mngbarrJdNT r 10 1 lt NT bcount o whlle b1dcount r pthrezdmutexgnlt pmreaaconawa1cltampbm bkconown ampbl l cou ampb countlock NULL pthrezdnmtexunockampbl l countlock pchreaacona1mc break ampb bkcovp NULL pchreaacona1mc base mn ampb bkconown NULL 1 a 2 vold mngozrrJer mngbzrrJ b Jul 1am 1 base 0 do k base 1am 1 pchreaaconas1gla1ltampbm ukconown 20 Log vs linear Barrier on 32 Nodes L ZLxllmlznm e un aamuam 22ml l39 1 4 A a m l 5 m 39 quot7 mawm Flynn 7 Emma m cl moo sequavkal ana loganmmc bamersas 3 Michel m numeral quotneeds on a 2 mm SGI Ovlgm m 0 Log barrier doesn t have true OQ Shape WHEN 21 A Read Write Lock Implementation Imagine database everyone can read only one write prioritize writes so searches find new stuff May want to typedef struct pthreadcondt rcond wcond p gtnwp n text rwmut while pgtnread I pgtwriting int nread writing nwpend threadcondwaitamppgtwcondpgtrwmut d void rcwrwlockwlockrcwrlockt p C pthreadmutexlock amppgtrwmut e d rcwrwlockt pgtwriting 1 void rcwrwlockinitrcwrwlockt p pthreadmutexmlockamppgtrwmut p nread pgtwriting pgtnwpend o pthreadmutexinit amppgtrwmut pthreadcondinit amppgtrcond void rcwrwlockmlockrcwrlockt p C pthreadcondinit amppgtwcond u pthreadmutexlockamppgtrwm t if p gtvriting pgtwriting 0 void rcwrwlockrlockrcwrlockt p C if pgtnwpend pthreadmutexlockamppgtrwmut pthreadcondsig39nal amppgtwcond while pgtwriting I pgtnwpend else if pgtnread pthreadcondwaitamppgtrcond pgtrwmut pthreadcondbcastampp gtrcond p gtnrea pthreadmutexunlockamppgtrwmut else if pgtnread 0 M pgtnwpend p hreadcondsig39nalamppgtwcond pthreadmutexmlockamppgtrwmut 22 Error in book s Read Write Lock Implementation Book gives following wlock implementation on page 303 void mylibrwlockwlock mylibrwlockt l pthreadmutexlock amplgtreadwritelock whilelgtwriter gt0 II lgtreaders gt 0 lgtpendingwriters p hreadcondwaitamplgtwriterproceed lgtpendingwriters lgtwriter pthreadmutexmlockamplgtreadwritelock 0 Anyone see the error 23 Handling Pthreads Startup in Libraries If you39re creating lib no control over main not always possible to alloc all shared thread specific structs up front eg mutegtltes etc 0 Can handle init via statically init mutegtlt or pthreadonce call void libinitvoid pthreadmutext libinitlock PTHREADMUTEXINITIALIZER int libinit0 void libinit void pthreadonce libinitonc ePTHREAD0NCEINIT void libcall pt void libcall hreadonceamplibinitonce libinit 1f llibinit threadmutexlock amplibinitlock 339 if llibinit libinito t pthreadmutexmlo ck amplibinitlock 24 Info for Further Studies 0 Can set attributes of a thread If thread is detached or joinable default Size 84 location of threads private stack recursion etc Threads scheduling policy and priority complex Can set type attr of mutegtlt using pthreadinutexattnsettype PTHREADMUTBXJORMAL PTHREADMUTEstECURSIVE PTHREADMUTEXJERRORCHECK Cancel egtltec of thread from another thread pthreadjancel Cancelled thread must allow it 84 have locks etc in known state Threads change their canceabiity via pthreadjetcancelstatei PTHREADLANCELJDEFERRED default cancelled only during cond wait join or testcancel PTHREADLANCELASYNCHRUNUUSi Killed anytime PTHREAD ANCELJDISABLE Thumb nose at cancelling thread Thread aware atexit wt pthreadcleanuppushpop Func stack called when cancelled or pthreadexit Handle internal library state info for each thread by using keys Same key points to different mem for each thread Eg use key to have serial rand gen point to local seed