### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# STPrivacyinElectr.Society CS680

Drexel

GPA 3.88

### View Full Document

## 36

## 0

## Popular in Course

## Popular in ComputerScienence

This 55 page Class Notes was uploaded by Vito Kilback on Wednesday September 23, 2015. The Class Notes belongs to CS680 at Drexel University taught by Staff in Fall. Since its upload, it has received 36 views. For similar materials see /class/212457/cs680-drexel-university in ComputerScienence at Drexel University.

## Reviews for STPrivacyinElectr.Society

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/23/15

SSE and SSE2 Timothy A Chagnon 18 September 2007 All images from nte 64 and IA32 Architectures Software Developer s Manuals Overview SSE Streaming SIMD Single Instruction Multiple Data EXtentions Successor to 64bit MMX integer and l 128 bits l XMM Registers Eight 1 ZBrEit MXtSR Register MMX Registers Eight smart Generalrl urpose EFLADS Register its Registers Eight 3273K lS Address Spam 232 0 Figure 111 Steaming SIMD Extensions 2 Execution Environment AM D s 3DNow extentions SSE Introduced in 1999 with the Pentium III 128bit packed single precision floating point 8 128bit registers xmmOxmm7 hold 4 tloats each Additional 8 on x8664 xmm8 xmm15 SSE2 Introduced with Pentium 4 Added double precision FP and several integer types SSE 3 amp 4 Not yet widespread More ops DSP permutations fixed point dot product Data Types Instructions end in D S I Q etc to designate type Conversion instructions CVTXX2XX 127 9695 54 63 3231 o Figure 104 128 Eit Packed SinglePrecision FloatingPoint Data Type Pretisiun Fluatinquuint 127 64 E3 0 127 D 127 D integers 127 D I l l integers 127 0 Figure 112 Data Types Introduced with the SSEZ Extensions 39 Packed vs Scalar Ops 39 Most operate on 2 args Instruction Basics OP xmmdest xmmm128 src lnplace operation on dest x2 I Xi XD I lv39 l lw I gt0on l xcono v l v2 I X 0P V x2 OP v2 Instruction Varieties Figure 105 Packed SinglePrecision FloatingPoint Operation MOV loads stores Arithmetic amp Logical Shuffle amp Unpack 9 CVT conversion x2 l x l xo OP Va Cache amp Ordering Control Figure 105 Scalar SinglePrecision FloatingPaint Operation Arithmetic amp Logical Instructions Can take a memory reference as 2nol argument ADD All put results into 1St argument SUB CMP can be made to use EFLAGS MUL Full instruction names are OPTYPE DIV ADDPD add packed double SQRT AN D38 and scalar single etC39 MOVAPD xmmO eax MOVAPD xmml eaxl6 MOVAPD xmm2 xmmO ADDPD xmm2 xmml SUBPD xmmO xmml MOVAPD eax xmm2 MOVAP D eaxl6 xmmO Moving Data tofrom Memory MOVAPD move aligned MOVUPD move unaligned regreg regmem memreg Memory should be 16 byte aligned if possible Starts at a 128bit boundary in VirtualPhysical addressing Special types and attributes are used in CC MOVUPD takes much longer than MOVAPD Intel39s optimization manual says it39s actually faster to do MOVSD xmmO mem MOVSD xmml mem8 UNPCKLPD xmmO xmml Shufer amp Unpack BEST I Xt I xo I r i chI w I Iva I iltl BESTI V1orVu I x1 arm I Figure 115 SHUFPD Instruction Packed Shufer Operation BEST I x I Xn I SRO I W I W I BEST I w I X I Figure 115 Hinh Inpark BEST I x1 I Xu I w w X I SRC I BEST I Figure 117 UNPCKLPD Instruction Low Unpack and Interleave Operation SHUFPD xmmO xmm1 pattern UNPCKHPD xmmO xmm1 UNPCKLPD xmmO xmm1 Combine parts of 2 vectors InPlace Values from 15t argument alway go into low 12 of dest Note that UNPCKs are just a shortcut of SHUF But is it faster or slower CC lntrinsics Both Intel39s ICC and GCC have intrinsic functions to access the SSE instructions More or less a 11 mapping of instructions Instead of in place they look like 2 in 1 out functions Does this indicate that penalties are incurred for unnecessary register copiesloadsstores Or are these optimized away It appears that way MOV iS Split up into load store and set m128d mmaddpd le8d a le8d b m128d mmloadpddouble const dp Adds the two DP FP values of a and b uses MOVAPD Loads two DP FP values r0 a0 b0 The address p must be 16byte aligned r1 a1 b1 r0 pO r1 p1 Example include ltxmmintrinhgt int mainvoid int i m128d a b c double XO2 attributealignedl6 12 35 double xl2 attributealignedl6 O7 26 a mmloadpdx0 b mmloadpdxl c mmaddpda b mmstorepdx0 c fori O i lt 2 i printfquot2fnquot x0i return 0 Efficient Use Use 16byte aligned memory Compiler driven automated vectorization is limited Use structofarrays SoA instead of A08 3d vectors ie block across multiple entities and do normal operations To Investigate Denormals and underflow can cause penalties 1500 cycles Are intrinsics efficient Do they use direct memory refs SHUF vs UNPCK Cache utilization ordering instructions X87 FPU and SSE are disjoint and can be mixed MMX registers can be used for shuffle or copy Intel Core architecture has more efficient SIMD than NetBurst Direct memory references Trace cache and reorder buffer effects on loop unrolling Nontemporal stores Prefetch SHUFVS UNPCK Intel Optimization Manual shows sometimes UNPCK is faster Probably because there s no decoding of immediate field gcc4x will convert a call to the shuffle intrinsic to an unpck instruction if it can Table data represents usage of register arguments delays from direct memory references depend heavily on cache state NetBurst Arch Pentium 4 Xeon Table 4 Streaming SIMD Extension 2 Doublepretisinn Floa 39 g 39 39 Instruction Latent Throughput lExecutinn Unic2 lDISpIayFamllyjlsplayMadel oFoarI 0F02H 0F03H lDF02H lDFDZH lADDPD xmm mm 5 4 2 lFPiADD ISHUFPDaxmm mm mm l6 s 2 2 lMMxisHFT lUNPEKHPD Xmm xmm 6 5 2 2 lMMxisHFT lUNPCKLPD xmm xmm 4 4 2 2 IMMXjHFT Core Arch Core Duo Pentium M Table C4a Streaming SIMD Extension 2 Doubleprecision DisplayFamilyiDisplayModel represent variation across different processors Family CF is NetBurst Architecture Family 06 is Intel Core Architecture Model for NetBurst in range 00H to 06H data for 03H applies to 04H and 06H Model for Pentium M are 09H and ODH Core SoloDuo is OEH Gore is OFH References Wikipedia Streaming SIMD Extensions httpenwikipediaorgwikiStreamingSlMDExtensions lntel 64 and lA 32 Architectures Software Developer39s Manuals httpwwwintelcomproductsprocessormanualsindexhtm lntel C Compiler Documentation httpwwwintelcomsoftwareproductscompilersdocsclinmainclsindexhtm Apple Developer Connection SSE Performance Programming httpdeveloperapplecomhardwaredriversvessehtml What is the WHT anyway and Why are there so many ways to compute it Jeremy Johnson 1 2 6 24 112 568 3032 16768 WalshHadamard Transform yWHTNx N2 I 1 w 1 1 WHTNWHT2 W WHT2 WHT21 4 WHT4 WHTJEWHTz 1111 1 1 1 1 11 1 1 L L J 1 1 1 1 1111 WHT Algorithms Factor WHTN into a product of sparse structured matrices 0 Compute yM1 M2 Mtx YtMtx Y2M2Y3 yM1y2 Factoring the WHT Matrix AC BDA BC D A BA1I B A B CA B C 1m 1nlum 111110191190 141 101011 100 WHT4 1111 10100011 1 1 11010 1001 1 WHT2 WHT2OVHT2 LXIQ WHTQ Recursive and Iterative Factorization WHT3WHT2 I4IE WIIT4 WHT2 1002 9 mama 12 WEED WHT2 1002 39WHT2 12 12 12 WHT1 WHY 1412 NEW 12 12 3 W2 WHT2 1N2 WHT2 12H02 Iz WHTz WHT2 I4I WHT IZL WHT2 Algorithm CCUI39SIVC R n IllaIllsIII39I 11111111 I II IAIIII IlliilllsIII39I 11111111 I II I IIIIII llllilllsIII39I 1 WHTE WHT2 14x12 8 WHT2 1902 WHTQ Iterative Algori l 1 1 l l l 1 l l l l l 1 l 1 1 l l l l 1 1 l l l l 1 l l l l l l l I l l l l l l l l l l n 11 11 all ll 11 11 lol5 11 l l l 1 l I l l l 1 l I l WHTz 1412 WHT2 12I4 WHTZ WHTs WHT Algorithms Recursive Iterative General WHTNWHT2 IN2IZ WHrN2 WHTZ 12W a WHsz a lineman Where n m9unt WHT Implementation 439 De nition formula NN 4 N2N N52 xWHTNx xifs3bxbs xbMls Implementationmested loop RN S1 for it 1 RRNi n1 n WHTquot 12W1 n39 f0rj0Rl i1 for k0S1 ss Ni Partition Trees Left Recursive Right Recursive Balanced Iterative a Ordered Partitions There is a 11 mapping from ordered partitions of 11 onto n1 bit binary numbers gtThere are 2 quot ordered partitions of n 16210100010 111111111 12429 Enumerating Partition Trees 00 01 01 Search Space Optimization of the WHT becomes a search over the space of partition treesg for the fastest algorithm It The number of trees T1 2 MTquot nlmn I HI 1I 2 3I 4 5 6l 7 8 Tnl 1 2 5 24 112 568 3672 15768 Size of Search Space Let Tz be the generating function for Tt1 Tz 20 z Tz21 Tz 1 emu11312 where a4ls as 68284 Restricting to binary trees 32 z 1 z 4 822 T11 S Vnm WHT Package P schel amp Johnson ICAS SP 00 Allows easy implementation of any of the possible algorithms Partition tree representation Wnsmalln splitWnWnt i Tools Measure runtime of any algorithm r Measure hardware events Search for good implementation Dynamic WW8 Evolutionary algorithm Histogram n 16 10000 samples DisllhmhnafWHT u Runllm cln Ul aSFARC H i i Dis bulbnalWHT a Runllrnssorl Penilum Ill l l i am 500 E E 53 3 400 5 En lt E 395 3 E E 5 i 002 003 004 005 005 011 Time In Sewn 0 07 0 08 0 09 01 ds oWide range in performance despite equal number of arithmetic operations n2n ops oPentium III consumes more run time more pipeline stages oUltra SPARC II spans a larger range Operation Count Theorem Let WN be a WHT algorithm of size N Then the number of oating point operations flops used by WN is ngN Proof By induction opsWN 2132quotquot quot apsWN t I m 2quot w 2quotnn2quot Instruction Count Model Ian aAltn Z i Lin 2a An 1 i1 Am number of calls to WHT procedure a number of instructions outside loops An Number of calls to base case of size I on number of instructions in base case of size I L number of iterations of outer i1 middle i2 and outer i3 100p B number of instructions in outer i1 middle i2 and outer i3 100p body Small1 H10 3 1 w wrung quotm 051quot WWldg tw align d g1 Dbl 31ml 133mm 1 tam mly ml 1 L lime hm agglyjmilnz39 mvl itiup adx fleas strids S ta EM W1 lztamhtaax Him it array has address to m 161 than I mamam mil inasnmgal sumRH m instill fistmk xxt l 1am gunman I Wst tf j mm m xmwasamal mmmsuwmm hump Mg th usualakmxmxw 7 my 33111 fiatci hlolwlahatNahum 01 tum new rattan atmatmmtm mp1 Iiimp eda Nmm swi xl iatsl m Recurrences t An122n IAN n n1mnt i1 AG 0 n a leaf A n V12 where V number of leavesl Recurrences L1quot 2quot Lntgta nmquot39n i1 L2n2quot39 L2n2quot quotquotquot quota n nn L3quot22quotquotquotL2n12quot nn1mn 1 Lin0 n aleaf Histogram using Instruction Model P3 15GB 1DDD SDI 2 003 004 005 006 007 005 009 01 011 UZeHJE 49DE gems BeiUE 1eU712eDT1xleU71Ee 7 oc1120t134and0c1106 oc27 3118 3218 and 3120 Algorithm Comparison Recursvellteratlve Runtlme Rec ampBalllt Instructlon Count 200E00 180E00 160E00 140E00 o 120E00 1 100E00 o rr1i1 L 800E01 1391 600E01 39r I 4 00E01 bal1i1 2001301 000E00 12 3 4 5 6 7 8 910111213141516171819202122 WHTsuea ro RecamptBest Runtlme SmallIt Runtime 120E01 1 2015 100E01 100E01 800E00 800E00 g 600E00 g600E00 400E00 rilIT 4 00E00 2 00E00 2 00E00 0 00E00 H 0 00E00 1 2 3 4 5 6 7 8 12 3 4 5 6 7 8 910111213141516171819202122 WHTmzea ro WHT size2An Dynamic Programming min Cost 111 n n T T I x where T11 is the optimial tree of size n This depends on the assumption that Cost only depends on the size of a tree and not where it is located true for IC but false for runtime For 1C the optimal tree is iterative with appropriate leaves For runtime DP is a good heuristic used with binary trees Optimal Formulas Pentium ubtmBLnwLI W1 HLMH NLMH wLwn H d RLHSJMH RLHSJEH H JWLHMd H HHJRLHMAHH HHJRLHHAWLWHH WLHBJWLHMJNMH HHJRLHMARLH J HH 2Izlgllzll2536111111 2IzlgllzllI22151511 nu UltraSPARC 1139 2 3 410 539 6 3JII41 4JII41 4JIISI 55 56 444 454 4515 555 556 4454 44t51y5 4t5151115 55I51115 Different Strides Dynamic programming assumption is not true Execution time depends on stride S X ra scn wnoc gt a Tm TLMA HY Aisnnu n 1K Tm 744mm mm 1 LTWAL qLTU39lf Lghl rm 13rTUmIfm1J rM 9mm 1 inn1 qu TM fTW k LrL W i 1 60 Wm SIsz wf ul cw TCA l w b 7 m 2 an N J g kg k3 sz La jYQWMS U Q 1 W O m gl gmld 331801 For QQJL MAHX mEL 08 OPS 2 1mg m1 k New QSSUWQ Hzlc wr TL ggk gm T d 1 16 C2 Ta W J Cam T A 3 qk mq kg 4 f2 639 7V j 91k ltZ L ltfj d 5 11L 3 Cgm ligwtc OWL Ckkwa OMS er A COWdel am A g 5r ARRJA x gilo erTQ w WNW WQ am gm a x 39 k7 M q L M 1 9 vq kw I QM th o C 5amp1 4quot Qtts39l CDL1 CVL 1 CRUSH L cm 1 1 Q LlSl l curl Bk I q o 3 3 x qlk l q Cxl 11 qugqt q qukz A Rug Q11931 4 4 45 A k q LquotL c39 Lgt n aim1km QULIL Lchak39l l F a b CHSL JL x quSn 0x11 gu FCL11gtLVJ CLIgt L qugou I LW L CILXLquot J ngkgnv CIAL417T K V N WWWL nzbh w I I y 615 L14 a u LL quotquotquotquot quot m W R t w v i m n Vl 1 V1 398 VltA 3 1 Qvl L C mg 17 quot JQ J k 3 BR Mao DJ R CrLk Mxx w 7 g9 v A C U l W A A A 1 x M M s K A 393 l lt39O mw3 we a N For I A In 0 x 4 PV m 535W 5 K were FL in J SVL 41 T la quotram H 11 u V my u w WW 7 53949x CZ H194 A g WV C or 0wa u V77 1 rorAv 0 gt4 orsv W v WWW r f P 7 mi my if we 2 f f 93L mo kVIz mr racoV crrKTJ 32 CragC w Vbsvxr figD A 9 3944 I quot M u mu a rm 3as fbi f a A WWHNWMA I 39 K ova oL A SH I m xcm M A PL 7quot Q aux Jr Cs LL kn 339 L 11 F A quot PS 739 5 C W 9 9 TIM Vt ens as Q L F 3 5r 35 PIL 3r Pt i 8A K Mfg C 1 lt q au J F CtHgu Aric113 4 Ellt1 5 C411quot 391 t quotquot QT39Lh w c VMHF39L 0 n 1 L r qu zx Ova JQM 3 C l kiL canEn wwx Quin anlek H CH q kh quot C u E11 3 RunLn 393 C39H LLLL Cu1 A CL 1 L g f CV1 g A qtyk 3quot CV11in Quixu C qg L Jr Q zL H C11 739 C V g 0 xibe Ar ardfku Ar 39c u g L quot gligu 39 Cutx ah R C h quot Qk1m qu yn kSwaK qwm x l Q1 4 C lxxi

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.