Advanced Topics in Computer Graphics
Advanced Topics in Computer Graphics CMPS 290
Popular in Course
Popular in ComputerScienence
This 237 page Class Notes was uploaded by Dr. Elyssa Ratke on Monday September 7, 2015. The Class Notes belongs to CMPS 290 at University of California - Santa Cruz taught by Staff in Fall. Since its upload, it has received 58 views. For similar materials see /class/182267/cmps-290-university-of-california-santa-cruz in ComputerScienence at University of California - Santa Cruz.
Reviews for Advanced Topics in Computer Graphics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/07/15
Designing adaptive systems by maintaining a mixture over a set of experts Manfred K Warmuth Corrie Scalisi Includes some earlier work with Robert Gramacy Scott Brandt and Ismail Ari University of California Santa Cruz Dec 8 2006 NIPS worshop Last update December 13 2006 a Two example problems 9 Measuring the on lineness of the data Q The expert framework 9 Shifting experts Experimental results 6 Wrap up Outline a Two example problems e on linen M A wulj up Two example problems 1 Disk spindown problem 0 When to spin down the disk on your laptop 0 Best time out timeuserusage dependent First 3000 iterations of the Cello2 Dataset i iogeiDuration in Seconds i X i i i i 0 500 1000 1500 2000 2500 3000 Trial M Warmuth UCSC Designing adaptive systems 4 37 1mg prnb rm Non convex loss may Used bv Valan Mewquot W mm mm mme Spmdawn Ea l mm m Wm 2 Caching 0 Whenever small fast memory and larger slower secondary memory 0 Keep objects in faster memory which likely to be needed again soon a Hit if requested object resides in cache a Miss otherwise Caching Policies o Decides which objects to discard to make room for new requests a 7 common policies LRU RAND FIFO LIFO LFU and MFU o 5 fancy recent policies SIZE GDS GD GDSF LFUDA 0 Criteria 3 Recency and frequency of access a Size of objects 0 Cost of fetching object from secondary memory a De facto standard LRU Which Policy to Choose o For which situation 0 Disk access on PC a Web proxy access via browser a File server on local network a Middle of the night during backup 0 Application as well as time dependent 0 Choosing one is suboptimal o All policies claimed to be on lineadaptive Characteristics Va with Time zusuuu exuuuu Zl azuuuu zzsuuu alumna zasuuu H 7 zusuuu exuuuu 2x5uuu azauuu zzsuuu alumna zasuuu n a 7 a Outline mg g EXpz L l lmff i v 1 3 P I p 1339 First trick Permute the dat a Data not on Iine if permuting does not change things 0 Algorithm not adaptive if same performance on permuted data Measuring the on Iineness of the data Permuting trick for disk spindown data First 3000 iterations of the Cello2 Dataset First 3000 durations in the Intel dataset i i i 1 U i I I logeiDuration in Seconds Narration in 090mm 3000 0 500 1000 1500 2000 2500 3000 Trial Triad First 3000 durations in a randomized permutation of the Intel dates3t 1 A E Cl 2 e E E O i e 8 E 0 500 1000 1500 2000 2500 3000 1500 2000 2500 3000 Trial Trial on Iine not on Iine Designing adaptive systems Meas the onrlinene of the data Permuting caching data mm mm mm mm mm mm mm mm mm mm mm mm mm mm highly on Iine data some caching policies already on Iine M Warmuth Good comparators 0 As good as BestFixed chosen in hind sight 0 But BestFixed does not capture on Iineness of data 9 Same performance on original and permuted data 139 BestShiftK for spindown problem Comparator a Partltlon of the timeline Into segmens 0 Best m each segment pm xepxemng pmmun mam wfshj s Measuring the on Iineness of the data BestFixedK w 1 S u a w E a E I 2 E E o m i 2 lt1 t t BestFixed BestShif tK BestFixed 0 L13 ABestShiqu point representing partition I I I I I I I I I i 0 5 10 15 20 25 30 35 40 45 50 of shifts K Dynamic programming 0KN2T H where K of partitions N of discrete idle times T of trials M Warmuth UCSC Designing adaptive systems 17 37 Measuring the onIineness of the data BestShift cu rves BestShiftK on Intel dataset BestShimK on Canoe Data 5 50 experts spaced between 0 and 10 50 experts exponentially spaced betm een 0 and 10 I BestShifKK 1 i i BestShif tK 8 BestShiftK Randomized 7 BestShif tK Randomized Op mal 095 Optimal 5 r 0n 74 r 085 s i d 5 2 7 g 08 i 7 E a 4A 7 m 075 r e 8 i 8 a A 3 07 r E LG 4 065 s i o i4 A a 055 r 7 i2 0 5 i i i i i 3 i l 1 l 39 0 50 100 150 200 250 300 0 50 100 150 200 250 300 K of shifts of timeout value 2558785 trials of Cello2 Spindown Cost 10 sec K of shifts of timeout value 12458 trials of Intel Data Spindown Cost 10 sec M Warmuth UCSC Designing adaptive systems Com parotors for caching o BestFixed a posteriori best of 12 policies on entire request stream 0 BestRefetchingR minimum number of misses with at most R refetches in any sequence of switching policies Measuring the onrllneneix of the data Refetches amp Policy Switches Comparator All sequences of the form GDSF LFUDA SIZE GDSF LRU I U U U U ml 102 103 13904 We plot miss rate v5 refetches m B A 19 5 Paint re rescuing 5e Hence a P 1 3 a 7 6 H 1 Total of refemhes 3amp5sz M ssra es quota Bemexed Bes lReietcmngml AuvmmCacnes sz Reietches as ona Requests Dynamic programming 0RN2 T V Weazm mg the JHr meHeE of We data Goal for on Iine algorithms 0 Beat BestFixed easy 0 Get close to BestShift BestRefetching o In caching reduce IO39s and end user latency 0 Fast algorithms Mea ngthe anilinene s the data Score card for caching algorithms B7 C7 BesnFixed miss refetch lt LRU miss A Toml 10 leg man EestFixed B Toml 105 less than LRU c Tami v0 mme than LRU than BestRefetcinng man BesLFixed Miss Rate 95 7 r r miss refetch lt BesLFixed miss Besmefemmng A Ach Refetches as of Total Requests M Wa rm uth Outline M A wulj up What experts Caching o 12 caching policies Disk spin down a Discretize interval 0spindowncost 5n mum mech imam that an expanentia v waned heiween u and m mew mmm m 2mm u m 5 2m 25 an index a1 expat On line algorithm for learning as well as best experts One weight per expert o Represent confidence of master algorithm in expert 0 Master algorithm predicts with convex combination of experts t Wif 0 Loss update Wit1 N LWV 0 Designed to do well against BestFixed o In some cases logN regret Outline M A wulj up Sliilii As well as best partition 0 Loss Update follows too well 0 Follow it by Share Update a Mix in small in a 5 times past average weight 0 Updates recover after each shift a Faster recovery if expert was used before a In some cases regret 7 of bits to encode best partition Outline Experimental results M A wulj up Experimental results Spindown results BestShif tK and Share Algorithm on Cello2 dataset BestShif tK and Share Algorithm on Intel dataset 25 experts exponentially spaced between 0 and 5 25 experts exponentially spaced between 0 and 5 055 l l l l l l l quotquotquotquot quotquotquotquot quotquotquotquot quotquotquotquot quotquotquotquot quotquotquotquot 47 n4oc025 7 n 3 on 0005 g 0 5 R g C C 8 8 E n1o ooo1 E 35 7 n3oc01 E E i 045 r e i BestFixed fJ fJ g BestShif tK g BestFixed g BestShif tK Randomized Permutation g 3 r BEStShl lK ltzgt 04 Share Agorithm gt1 BestShif tK Randomized Permutation mm Optimal quot39 Share Algorithm iii Optimal l l l l l l l l l l l 0 50 100 150 200 250 300 0 50 100 150 200 250 300 K of shifts of timeout value 2558785 trials of Cello2 Spindown Cost 5 sec K of shifts of timeout value 12458 trials of Intel Data Spindown Cost 5 sec on Iine not on Iine esigning adaptive systems 30 37 Btperlmemal res s Caching we Tracks best policy Miss7rates under FSUP with Master 08 lru 7 7 fifo 7 07 7 life 7 0 6 uize 7 7 3 05 IE 7 m 7 E 04 rand 7 7 E gds 7 E 03 gdsf 7 7 02 1mg 7 7 0 1 1 roll 7 l o w w 205000 210000 215000 220000 225000 230000 235000 Requesxs Over Time K t r r i Mwssvales quota WWk Masler and Comparalor Missrales 85 LRU rmssvale 2 0 Obhgalovy rmssvale EeiseIelcmngR Rank dea Mwssvales quota UMo Masler and Comparalor Missrales 16 6 LRU rmssvale 1 5 Obhgalovy rmssvale EeiseIelcmngR Rank dea SMoLRU g S SMoLHU Masler and Comparalor Missrales 59 8 LRU rmssvale 15 3 Obhgalovy rmssvale EeiseIelcmngR d Rank dea Outline Experime 6 Wrap up Wrap W Pushing the theoretical analysis Disk spindown a Non convex loss but in each trial only two loss values 0 Experts are sorted o Analyze with continuously many experts Caching o Prove bounds for ARCing Kernelization with outer product instances Manfred K Warm uth University of California 7 Santa Cruz UC Berkeley March 18 2009 o Kernel Parading An ifF condition for the kernel paradigm 9 The matrix case Efficiency El 5 39 3 RCquot Outline o Kernel Paradigm 0 Examples lt5 ltx17617 xt76tgt 0 Linear hypothesis WltSgt w o Predicts with w x on instance x j 20 X 05 9 Simply invent new features Embed instances into a feature space RquotHRW FeatureSpwe mi 7 A 0 mi 7 1 47 w 7 5 2 7 a 3 0 7 g H n57 27 H 407 7 01 a 45 o7 7 S 71 4 4 72 u 2 4 h 1121 0 21722 arcsinzlfg x1 X2 Does the expansion always work a Can you always improve things by inventing new features 0 Fitting the data may be o Is this learning Short answer 0 If good fit with non sparse solution maybe 0 If good fit with sparse solution no algorithms based on relative entropy regularization multiplicative updates often better The Kernel Trick If w linear combination of expanded instances then 2 2a gtz gtz Z a gtz W W Km W Kernel function often efficient to compute A seemingly intractible expansion map gt 217zn 1zpzpzqzpzqz W 2n products Kernel magic n KZ7i gtZ gti Z HZkHZk H1Zk3k lg kel kel k1 hV I o2 time 0n time Visualizing the magic source 2121 2222 2323 sink One term per path Good news Many of our favorite algorithms can be kernelized Linear Least Squares Widrow Hoff Support Vector Machines PCA Simplex Algorithm Question 0 What is the class of algorithms that can be kernelized characterization of Kernel Paradigm o Alg makes linear predictions where the parameter vector is a linear combination of instances 0 Alg only relies on dot products individual features never touched Which algorithms are kernelizable The usual answer Representer Theorem KW71 w arg inf 7 Z ossw xgt Solution w linear combination of the gtx Sufficient condition for the fact that parameter vector w is linear combination of instances Outline An ifF condition for the kernel paradigm Condition for tile lernel paradigm Kernel Matrix a Labeled sequence of examples ltSgt ltX17 61 X2762 39 39 39 7 x157 tgt7 where x E R Z 6 R o Instance matrix X contains instances x17 xt as columns 0 Kernel matrix XX contains pairwise dot products between the instances x 0 Here we assume instances x are already expanded Kernel matrix If U is an orthogonal matrix then US denotes the sequence ltUX17y17 sziyzh 7 anytgt Transformation only affects the instances x x uxy ux Lemma Two sequences of examples are orthogonal transformations of each other iff the kernel matrices associated with the sequences are the same Class of algorithms 0 Alg produces weight vector W example sequences to a R 0 Actions of alg depend on dot product Wlt5gt 39 X o W is transformation invariant if Wlt5gt 39 X WltU5gt Uxi for all sequences orthogonal matrices U and instances x characterization of Kernel Paradigm Theorem Let W be the input output mapping of an algorithm from sequences of examples to parameter vectors The following three statements are an equivalent characterizations of the Kernel Paradigm 1 w is transformation invariant 2 a For any WltSgt Xa w ere a is a coefficient vector for the instances b For any lt8 and orthogonal matrix U WltU5gt UWlt5gt 3 For any sequences 57 lt5 with the same kernel matrix ie X X there is a coefficient vector a st Wltsgt Xa and Wlt gt xa Accessing features violates formal def of paradigm Consider alg 0 X1 1 gt 0 Wlt5gt x1 otherWIse Alg satisfies 2a predicts w linear combination of instances but not 2b transformed sequence a transformed matrix 1 transformation invariance 3 coefficients only depend on kernel matrix Use U 7 characterization of Kernel Paradigm a The parameter vector is a linear combination of instances Too general By above example alg may not depend on dot products 0 Algorithm only relies on dot products Individual features never touched More general than characterization of above theorem A more general notion of kernelizable algs If alg is mapping f example sequences x new instance a some range then the following two are equivelent by above lemma 0 For all lt5 x7 5 st lt5 x and have the same augmented kernel matrices flt5gt7X fllt gt7il o For all lt539gtx7 U 500 fltusgtUX an m mmm for me lernel par a ligr i i Bypassing all notions of kernelization 7 o w gtv and the algorithm predicts with gtv gtz kv z o w gtv is updated by updating v In a multiplicative update on w translates into a multiplicative update on v o w not a linear combination of the instances gtz and alg not transformation invariant Only for special kernels Loss of w Clgtv must translate to loss of the components of v For example the components of z label the edges of a directed graph Features are products along the paths of the graph Loss of path is sum of losses of its edges Ii 22 23 Z6 21 Z4 Z5 Z7 si n k 2 217 2224277 2324277 2225277 23725277772627 I l l I Dot product gtv gtz Z HVeHXe path P eeP eeP Hue path P eEP Computable via dynamic programming even when cycles in graph Loss of path P must be sum of losses of edges of path lossPvz Zlossevz eeP Outline 9 The matrix Cage Tlie mal Basic setup of matrix case 0 Instances are outer products xiy where x E Rm y E R 0 Parameter matrix W E Rm Dot product trW xy y Wx Kernelization original instance 2 Xi zi Vi Zi 0 Collaborative filtering BAVE x describes user y describes movie W should be low rank What is different in matrix case Vector case iulf Qw 7 lossw xi kernelizable ifF Qw is increasing function of HWHQ Matrix case 39 E Inf QW 7 i losstrW xy Kernelizable if QW is spectrally invariant Increasing function of trace norm QW tr W W Quantum relative entropy QW trWogW 7 IogW0 W0 7 W I l Setup continued 0 Instances are doubled xy 0 Example sequence 5 ltx1y 17 17 xty t7 61 o Transformed example sequence ltUSV ley1V7 17 Ux27y2V7627 Uxt7 y tV7 it where U and V are m and n dimensional orthogonal matrices a Transformation invariant if trWlt5gt Xy trWltU5V gt ny V L for all U V x y 0 Two instance matrices X E Rm Y E Rn Two kernel matrices X X7 YY E Rt The Characterization of kernelizable matrix algorithms Theorem Let W be the input output mapping of an algorithm from sequences of examples to parameter matrices The following three statements are equivalent 0 W is transformation invariant 9 o For any WltSgt EU cili y for some real coefficients cm a For any lt8 and orthogonal matrices UV of dimensions In and n respectively WltUSVgt UWltSgtV O For any sequences lt5 lt5 with the same kernel matrices ie X X XX and YY YY there are real coefficient cf st Wltsgt 2 cu xiyj and Wlt gt cu my I39J I l The n Representer theorem versus our characterization Sufficient conditions with representer type thms If the algorithm is defined as a minimization problem that regularizes with a function that shrinks with the spectrum and if the weight vector had an orthogonal component then removing this component would not change the predictions on the set of instances but would decrease the regularization term Our proof method has a different flavor If the weight matrix has an orthogonal component then we build an instance with this orthogonal component and show that the prediction on this instance and the prediction on a transformed instance do not match Ii Outline 9 Efficiency Large Gram matrix Need decomposition of large Gram matrix XY 6 Rm in terms of small kernel matrices X X and YY E Rt Lemma SVD of generalized Gram Matrix X D Y th tXt tXn can be computed from the SVD of VXX D VYY When Y X and D I then VX X D VY Y simplifies to the symmetric kernel matrix XX Corollary Eigendecomposition of Gram Matrix XX can be computed from eigendecomposition of kernel matrix X X Known corollary was key to kernelization of PCA and Fisher Linear Discriminant Functions SSMMRWSM Ellici Kernelization of some unusual algorithms Linear regression with outer product instances me xixj 7 w I where W is constraint to be positive definite and to have trace one When regularizing with the quantum relative entropy and choosing the initial parameter to be uniform we get the MEG update TRW explin ZitrW xix 6 xix W W7 where exp denotes the matrix exponential and the scalar ZW normalizes the trace to one Kernelization continued Iterative version W exPl ZtrWk XIX 5i XIX k1 ZWk 7 where q is the iteration number Exponent has the form XDX where D is diagonal but not necessarily positive definite Compute SVD decomposition of the exponent from SVD decomposition of the symmetric matrix VX XDVX X lterate until convergence Matrix versus vector case EG for vectors M EG for matrices of vectors to then vector case kernelizable are dyad rank one PCA vs PLS Maya Hristakeva University of California Santa Cruz May 13 2009 Maya Hnstakeva eta UCSC PCA vs PLS May 13 2009 120 Outime 0 Linear Regression Prinicpal Component Analysis 9 Partial Least Squares Maya Hristakeva etai UCSC PCA vs PLS May 13 2009 2 20 Outime Setup a Data matrix instances as columns Xx1xT RNXT 9 Reference values y yL yTT 6 RTquot 1 0 Goal minimize square loss T 1 T 2 7 1 T 2 mJnEEi xxiweyi meEHX weyii Maya Hristakeva et ai UCSC PCA vs PLS May 13 2009 3 20 Outime Variance and Covariance Expectation of X x1 XT EX x Variance of X T varX covX X 09 7 EXx a EXT i1 Covariance of X x1 X7 and Z 21 27 T covX7 Z 209 EXZi EZT In this presentation we assume that X and y are mean centered EX 0 and EM 0 Maya Hristakeva et ai UCSC PCA vs PLS May 13 2009 4 20 Lmear Regresswon Outline 0 Linear Regression Maya Hnstakeva eta UCSC PCA vs PLS May 13 2009 5 20 Linear Regression Linear Regression Least Squares optimization problem 1 T l 7 7 T 7 2 7 T 7 2 Lw 7 nun 2 7571x w y nun leX w yll Differentiate wrt w VWLw XXTwiy0 XXTw Xy Exact solution w XXT 1Xy Note XXT is not always invertible Maya Hristakeva et al UCSC PCA vs PLS May 13 2009 6 20 Linear Regression Ridge Regression Regularization penalizes large values of 1 A l W rlr lr llTWyll2illWll2 w 2 2 Differentiate wrt w VWLw XXTw 7 y Aw 0 XXT Alw Xy Exact solution w XXT AI 1Xy Note XXT Al is always invertible for A gt 0 Maya Hristakeva et al UCSC PCA vs PLS May 13 2009 7 20 Prmwcpa Component Ana ysws Outline Prinicpal Component Analysis Maya Hnstakeva eta UCSC PCA vs PLS May 13 2009 8 20 Prinicpai Component Anaiysis Compression Loss Minimization Find a rank k projection matrix P for which the compression loss is minimized T mpinEHPx 7m E mPin HPX ixiiz mPin tr 7 PXXT mFEax trPXXT T m x tr varPTx where P is a projection matrix of rank k Maya Hristakeva et ai UCSC PCA vs PLS May 13 2009 9 20 Prinicpai Component Anaiysis Projection Matrix Properties Properties of P 0 P2 P E RNXN o P 21 pipiT PPT for P p1pk E RNXK o piTp 1 Le p has unit length o pIij 0 for i7 j ie p and pj are orthogonal Maya Hristakeva etai ucsc PCA Vs PLS May 13 2009 10 20 Prinicpai Component Anaiysis Variance Maxim ization Find k projection direcEions I5 p1pk for which the variance of the compressed data PTX is maximized T T mfax tr varl5Tx E mfax tr TxI5TxT 1 N T T mfax tr 2 BEEXi 1 quot T mFax trP xixi 1 T i mFaxtrP XX c Maya Hristakeva etai ucsc PCA Vs PLS May 13 2009 1120 Prinicpal Component Analysis PCA Solution Let C XXT covariance matrix of X 7 7 mFax trPC i mFax trPZ ycc mFax 7 tr Pc clTPc l cTPck 3 max E 76 03531 25k I k max E y k largest eigenvalues of C 1i1lti2ltikn J Hence P consists of the eigenvectors corresponding to the k largest eigenvalues of C Maya Hristakeva etal ucsc PCA Vs PLS May 13 2009 12 20 Prinicpal Component Analysis Principal Component Regression Principal Component Regression E PCA Linear Regression 0 Use PCA to find a kirank projection matrix P ISIST mFin HPX 7 Xll2 to Minimize square loss arg min fl lPTXTW Yll2 w 2 a Solution w ISTXXTISrl TXy e Rk X 1 Maya Hristakeva etal ucsc PCA Vs PLS May 13 2009 13 20 Prinicpai Component Anaiysis Summary of PCA Finds a set of k orthogonal direction Directions of maximum variance of XXT Minimizes compression error ie best approximation of X Ignores all information about y while constructing the projection matrix P Maya Hristakeva etai ucsc PCA Vs PLS May 13 2009 14 20 Pama Least Squares Outline 6 Partial Least Squares Maya Hnstakeva eta ucsc PCA Vs PLS May 13 2009 15 20 Partial Least Squares Partial Least Squares PLS o Finds components from X that are also relevant to y o PLS finds projection directions for which the covariance between X and y is maximized T argmaxcovXTpy2 argmaxZxJTpyj2 Pr p F1 T argmaxtrpT 20930 Pr F1 arg magtlttrpTXy2 P arg maxpTXYpTXYT P arg max piTnyTXTp P Maya Hristakeva etal ucsc PCA Vs PLS May 13 2009 16 20 Partiai Least Sq uares Finding the First PLS Direction p1 Finding p1 arg max plTnyTXTpL 5t plTpL 1 P1 Lp1 PlTnyTXTpl plTp1 1 Vpll nyTXTpl 7 Pl 0 nyTXTpl Pl Hence p1 is the largest eigenvector of nyTXT Maya Hristakeva etai ucsc PCA Vs PLS May 13 2009 17 20 Partiai Least Sq Mares Finding the remaining k 1 PLS directions Since nyTXT is a rank 1 matrix an additional orthogonality constraints is used to find the remaining k 7 1 PLS projection directions arg max piTnyTXTp Pi 5t piTp 1 and piTXXTpJ 0 forl j lt 139 Maya Hristakeva eta ucsc PCA Vs PLS May 13 2009 18 20 Partiai Least Sq Mares PLS Regression PLS Regression E 0 Use PLS to find a projection directions p maxcovXTp y2 p 5t piTp 1 and piTXXTpJ 0 a Minimize square loss argmin i TXTW Vii2 0 Solution w PTXXTP 1PTXy for I5 p1 pk Maya Hristakeva et ai UCSC PCA vs PLS PLS Decomposition Linear Regression for1 jlti May 13 2009 19 20 Theory and Agplications of Boosting Rob Schapire Princeton University Example How May I Help You Gorin et al 0 goal automatically categorize type of call requested by phone customer Collect CallingCard PersonToPerson etc 0 yes I d like to place a collect call long distance please Collect 0 operator I need to make a call but I need to bill it to my office ThirdNumber 0 yes I d like to place a call on my master card please CallingCard 0 I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill BillingCredit Example How May I Help You Gorin et al 0 goal automatically categorize type of call requested by phone customer Collect CallingCard PersonToPerson etc 0 yes I d like to place a collect call long distance please Collect 0 operator I need to make a call but I need to bill it to my office ThirdNumber 0 yes I d like to place a call on my master card please CallingCard 0 I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill BillingCredit o observation 0 easy to find rules of thumb that are often correct 0 eg IF card39 occurs in utterance THEN predict CallingCard39 0 hard to find single highly accurate prediction rule The BoostingApproach devise computer program for deriving rough rules of thumb apply procedure to subset of examples obtain rule of thumb apply to 2nd subset of examples obtain 2nd rule of thumb repeat T times Details 0 how to choose examples on each round 0 concentrate on hardest examples those most often misclassified by previous rules of thumb 0 how to combine rules of thumb into single prediction rule 0 take weighted majority vote of rules of thumb Boosting 0 boosting general method of converting rough rules of thumb into highly accurate prediction rule 0 technically o assume given weak learning algorithm that can consistently find classifiers rules of thumb at least slightly better than random say accuracy 2 55 in two class setting 0 given sufficient data a boosting algorithm can provany construct single classifier with very high accuracy say 99 Outline of Tutorial 0 brief background 0 basic algorithm and core theory 0 other ways of understanding boosting 0 experiments applications and extensions Brief Background Strong and Weak Learnability o boosting39s roots are in PAC Valiant learning model 0 get random examples from unknown arbitrary distribution 0 strong PAC learning algorithm 0 for any distribution with high probability given polynomially many examples and polynomial time can find classifier with arbitrarily small generalization error 0 weak PAC learning algorithm 0 same but generalization error only needs to be slightly better than random guessing 7 39y 0 Kearns 84 Valiant 3988 0 does weak learnability imply strong learnability Early Boosting gorith ms 0 Schapire 3989 0 first provable boosting algorithm 0 Freund 3990 o optimal algorithm that boosts by majority O Drucken Schapire 84 Simard 3992 0 first experiments using boosting 0 limited by practical drawbacks AdaBoost o Freund 84 Schapire 3995 0 introduced AdaBoost algorithm 0 strong practical advantages over previous boosting algorithms 0 experiments and applications using AdaBoost Drucker 31 Cortes 96 Abnev Schapire 31 Singer 99 Tieu 31 Viola 00 Jackson 31 Craven 96 Haruno Shirai 31 Ooyama 99 alker Rambow 31 R0 1 Freund 31 Schapire 96 Conen 31 Singer Roc e Schapire Rannn upta 01 Quinlan 96 Dietteri n 00 M rler Furlanello LarcherrKL Sboner 01 Brei 96 Schapire 31 Singer 00 Di F bbrizio Dutton Supt l 02 Maclin Op Coll 00 u am Yasui et al 02 Ban r 31 Kohavi 97 Escudero Marquez 31 Rigau 00 Tim Schapire 31 HakkanirTur 03 Son nllt 31 Bengio 98 lyer Lewis Schapire et al 00 Viola 31 Jones 04 Schapire Singer 31 Singhal 98 Onoda Ratsch 31 Muller 00 Middendorf Kunda Wiggins et al 04 o continuing development of theory and algorithms Brennan 98 99 Duffy 1 Helmbold 99 02 Koirenrnsrn Panchenko 1 Lozano 01 Senaprre Freund Bartlett 1 Lee 98 Freund 1 Mason 99 Coiirns Schapire 1 Singer 02 Grove 1 Schuurmans 98 Ridgeway Madigan 1 Richardson 99 Demiriz Bennett 1 snaweTayior 02 son arLeLL 1 Baxter 98 Kivin armuth 99 Lebanon 1 Lafferty 02 Schapire 1 Singer 9 Fri dman Hastie 1 Tibshirani 00 Wyner 02 Cohen 1 Si ger 99 Rarscn Onoda 1 Muller 00 Rudin Daubechies 1 Schapire 03 Freund 1 Mason arson Warnnurn Mlka et al 00 Jiang 04 Domingo 1 Watanabe 99 Allwein Schapire 1 Singer 00 Lugosi 1 Vayatis 04 ason Baxter Bartlett 1 Frean 99 Friedman 01 Zhang 04 Basic Algorithm and Core Theorv 0 introduction to AdaBoost 0 analysis of training error 0 analysis of test error based on margins theory A Formal Description of Boostingr 0 given training set X17y1xmym o y E 717 1 correct label of instance X E X A Formal Description of Boostingr 0 given training set X17y1xmym o y E 717 1 correct label of instance X E X o fort1T 0 construct distribution Dt on 17 7 m 0 find weak classifier rule of thumb ht X H 717 1 with small error at on Dt 6 PriNDtlhtOQ 3 Yil A Formal Description of Boostingr 0 given training set X17y1xmym o y E 717 1 correct label of instance X E X o fort1T 0 construct distribution Dt on 17 7 m 0 find weak classifier rule of thumb ht X H 717 1 with small error at on Dt 6 PriNDtlhtXi 75 Yil 0 output final classifier H nal Ada Boost with Freund 0 constructing Dr 0 Dli 1m Ada Boost with Freund 0 constructing Dr 0 Dli 1m 0 given Dr and h Dt1i Zt eat if y 7 htXi 0271 eXP 04t yi htXi DtU X 9 0 if yi htXi where Zr normalization constant 76 atln it gt0 2 6t Ada Boost with Freund 0 constructing Dr 0 Dli 1m 0 given Dr and h mm mwwmm Dt1 Zt X em if 0 7 htlxi DtU T exPl OZt Yi htXi where Zr normalization constant 7 St at ln 7 gt O 6t 0 final classifier o H m x sign athtxgt Toy Exampe weak classifiers vertical or horizontal half planes g14x30 11042 Round 2 Round 3 Analy g the training error 0 Theorem 0 write at as 12 7 yr Analy g the training error 0 Theorem 0 write at as 12 7 yr 0 then training err0r H nal H 2m exp 722th Analy g the training error 0 Theorem 0 write at as 12 7 yr 0 then training err0r H nal 0 so ith yt2ygt0 H 2m exp 722 then training err0rH ml e Z YgT o AdaBoost is adaptive 0 does not need to know 39y or T a priori o can exploit yr gt y Proof let fx Zathtx H mmx signfx 0 Step 1 unwrapping recurrence exp Yi Zathtxigt E Ht Z l eXP Yi xi Hzt t 3 Proof cont b 0 Step 2 training err0rH ml H Zr 15 Proof cont b 0 Step 2 training errorH ml H Zr 15 0 Proof training errorH ml 1 ISyei 7g H nalxi m Proof cont b 0 Step 2 training errorH ml H Zr 15 0 Proof training errorH ml 1 1 if Yi 7 H nalxi 0 else 1 1 if yfX O 0 else Proof cont b 0 Step 2 training errorH ml H Zr 15 0 Proof training errorH ml 1 ISyei 7g H nalxi m 7 l 1 if yfx S 0 7 m 0 else 3 Z 7 f m i exp y XI A Proof cont b 0 Step 2 training errorH ml H Zr 15 0 Proof training errorH ml 1 ISyei 7g H nalxi m 1 HIMax 0 gzexpw xa D mlo zt A Proof cont b 0 Step 2 training errorH ml H Zr 15 0 Proof training errorH ml 5 3 gzme if Yi 7 H nalxi else yifX39 0 else W00 2 D mlo zt HZ t Proof cont b 0 Step 3 Zt2 Etliet Proof cont b 0 Step 3 Zt2 Etliet 0 Proof zt ZDtexp04t i htXi 2 Draw 2 04053 iJVhtX ihtX 6t eat 17 at 9 0 2xet17 6t How Will Test Error Behave A First Guess 40 60 80 100 i 20 of rounds I expect 0 training error to continue to drop or reach zero 0 test error to increase when H nal becomes too complex 0 Occam39s razor o overfitting 0 hard to know when to stop training Actual Typica Run boosting C45 on letter dataset 10 100 1000 of rounds m 0 test error does not increase even after 1000 rounds o total size gt 2000000 nodes 0 test error continues to drop even after training error is zero rounds train error test error o Occam39s razor wroneg predicts simpler rule is better A Better Story The Margins Explanation with Freund Bartlett 84 Lee 0 key idea 0 training error only measures whether classifications are right or wrong 0 should also consider confidence of classifications A Better Story The Margins Explanation with Freund Bartlett 84 Lee 0 key idea 0 training error only measures whether classifications are right or wrong 0 should also consider confidence of classifications 0 recall H nal is weighted majority vote of weak classifiers A Better Story The Margins Explanation with Freund Bartlett 84 Lee 0 key idea 0 training error only measures whether classifications are right or wrong 0 should also consider confidence of classifications 0 recall H nal is weighted majority vote of weak classifiers 0 measure confidence by margin strength of the vote fraction voting correctly 7 fraction voting incorrectly high conf high conf inIorrect 10W conf coiect Hfinal Hfinal 1 incorrect 0 correct 1 Empirical Evidence The Margin Distribution 0 margin distribution cumulative distribution of margins of training examples 20 g 10 392 15 39E Em E 05 a E 5 test E 0 min 5 10 100 1000 A 705 of rounds 739 margin rounds 5 100 1000 train error 00 00 00 test error 84 33 31 margins 05 77 00 00 minimum margin 014 052 055 Theoretical Evidence Analvzing Boosting Using Margins 0 Theorem large margins better bound on generalization error independent of number of rounds Theoretical Evidence Analvzing Boosting Using Margins 0 Theorem large margins better bound on generalization error independent of number of rounds o proof idea if all margins are large then can approximate final classifier by a much smaller classifier just as polls can predict not too close election Theoretical Evidence Analvzing Boosting Using Margins 0 Theorem large margins better bound on generalization error independent of number of rounds o proof idea if all margins are large then can approximate final classifier by a much smaller classifier just as polls can predict not too close election 0 Theorem boosting tends to increase margins of training examples given weak learning assumption Theoretical Evidence Analvzing Boosting Using Margins 0 Theorem large margins better bound on generalization error independent of number of rounds o proof idea if all margins are large then can approximate final classifier by a much smaller classifier just as polls can predict not too close election 0 Theorem boosting tends to increase margins of training examples given weak learning assumption 0 proof idea similar to training error proof Theoretical Evidence Analvzing Boosting Using Margins 0 Theorem large margins better bound on generalization error independent of number of rounds o proof idea if all margins are large then can approximate final classifier by a much smaller classifier just as polls can predict not too close election Theorem boosting tends to increase margins of training examples given weak learning assumption 0 proof idea similar to training error proof 0 so although final classifier is getting larger margins are likely to be increasing so final classifier actually getting close to a simpler classifier driving down the test error More Technically 0 with high probability V0 gt 0 A Md generalization error Prmargin 0 l O pr empirical probability 0 bound depends on o m training examples 0 d complexity of weak classifiers v entire distribution of margins of training examples 0 15rmargin 0 gt O exponentially fast in T if error of ht on Dr lt 12 7 t9 Vt 0 so if weak learning assumption holds then all examples will quickly have large margins Other Ways of Understanding AdaBoost 0 game theory 0 loss minimization o estimating conditional probabilities Game Theory 0 game defined by matrix M l Rock Pa per Scissors Rock 12 Paper 0 12 1 Scissors 1 O 12 0 row player chooses row i 0 column player chooses columnj simultaneously 0 row player39s goal minimize loss Mij Game Theory 0 game defined by matrix M l Rock Pa per Scissors Rock 12 Paper 0 12 1 Scissors 1 O 12 row player chooses row 139 column player chooses column j simultaneously row player39s goal minimize loss Mij usually allow randomized play 0 players choose distributions P and Q over rows and columns 0 learner39s expected loss ZPUWUMU PTMQ E MPQ The Minmax Theorem 0 von Neumann39s minmax theorem 0 in words 0V v min maXMPQ maxmin MPQ P Q Q P V value of game M min max means row player has strategy P such that V column strategy Q loss MPQ v max min means this is optimal in sense that column player has strategy 0 such that V row strategy P loss MPQ 2 v The Boosting Game 0 let g17 7gN space of all weak classifiers 0 row player lt gt booster 0 column player lt gt weak learner 0 matrix M 0 row lt gt example Xiy 0 column lt gt weak classifier gj MUM 1 if y gjX39 0 else weak learner g1 g quotquotquot quot gA x1y1 J E g X y M id 0 3 D mem Boostingr and the Minmax Theorem 0 if 0 V distributions over examples 3h with accuracy 2 39y 0 then 0 min maXMPj 2 l 39y P j o by minmax theorem 0 m axmiin Mi7 Q 2 39y gt 0 which means 0 3 weighted majority of classifiers which correctly classifies all examples with positive margin 2y 0 optimal margin lt gt value of game AdaBoost and Game Theory with Freund 0 AdaBoost is special case of general algorithm for solving games through repeated play 0 can show 0 distribution over examples converges to approximate minmax strategy for boosting game 0 weights on weak classifiers converge to approximate maxmin strategy 0 different instantiation of gameplaying algorithm gives on line learning algorithms such as weighted majority algorithm AdaBoost and Exponential Loss 0 many most learning algorithms minimize a loss function 0 eg least squares regression 0 training error proof shows AdaBoost actually minimizes Hzt gzexpieyifixo where fx Zath x t 0 on each round AdaBoost greedily chooses at and ht to minimize loss AdaBoost and Exponential Loss many most learning algorithms minimize a loss function 0 eg least squares regression training error proof shows AdaBoost actually minimizes Hzt gzexpieyifixo where fx Zath x t on each round AdaBoost greedily chooses at and ht to minimize loss exponential loss is an upper bound on 0 1 classification loss AdaBoost provany minimizes exponential loss yfbc Coordinate Descent Breiman 0 g17 7gN space of all weak classifiers 0 want to find A1N to minimize L1 N Zexp YiZAjgjoq i j Coordinate Descent Breiman 0 g17 7gm space of all weak classifiers 0 want to find A1N to minimize L1N Zexp iinAnglXi i j o AdaBoost is actually doing coordinate descent on this optimization problem 0 initially all A O 0 each round choose one coordinate A corresponding to h and update increment by at 0 choose update causing biggest decrease in loss 0 powerful technique for minimizing over huge space of functions Functional Gradient Descent FriedmanMason et al 0 want to minimize LU LfX17 7 fXm ZeXPPyIHXID Functional Gradient Descent FriedmanMason et al 0 want to minimize LU LfX17 7 fXm ZeXPPyIHXID 0 say have current estimate f and want to improve 0 to do gradient descent would like update f a f 7 anLf Functional Gradient Descent FriedmanMason et al 0 want to minimize LU LfX17 7 fXm ZeXPPyIHXID 0 say have current estimate f and want to improve 0 to do gradient descent would like update f a f 7 anLf 0 but update restricted in class of weak classifiers foozht Functional Gradient Descent FriedmanMason et al want to minimize LU LfX17 7 fXm ZeXPFyIHXID say have current estimate f and want to improve to do gradient descent would like update f a f 7 anLf but update restricted in class of weak classifiers foozht so choose ht closest to 7VfLf equivalent to AdaBoost Benefits of Model Fitting View o immediate generalization to other loss functions 0 eg squared error for regression o eg logistic regression by only changing one line of AdaBoost o sensible approach for converting output of boosting into conditional probability estimates Benefits of Model Fitting View 0 immediate generalization to other loss functions 0 eg squared error for regression o eg logistic regression by only changing one line of AdaBoost o sensible approach for converting output of boosting into conditional probability estimates 0 caveat wrong to view AdaBoost as just an algorithm for minimizing exponential loss 0 other algorithms for minimizing same loss will provably give very poor performance 0 thus this loss function cannot explain why AdaBoost Estimatingr Conditional Probabilities Friedman Hastie 84 Tibshirani 0 often want to estimate probability that y 1 given X o AdaBoost minimizes empirical version of EXJ e ywo EX P y 1lx 940 l P y illx goo where Xy random from true distribution Estimatingr Conditional Probabilities Friedman Hastie 84 Tibshirani 0 often want to estimate probability that y 1 given X o AdaBoost minimizes empirical version of le mxl Ex P y 1lX W P Ly aux eml where Xy random from true distribution 0 overall 1 minimized when fx gm H l Ply1lxlm 0 so to convert f output by AdaBoost to probability estimate use same formula Calibration Curve x0 a 08 x est 1 0 train v 12 4 t 37m 06 t x e o 04 02 th 0 g 0 0 02 04 06 08 1 0 order examples by f value output by AdaBoost 0 break into bins of size r o for each bin plot a point 0 X value average estimated probability of examples in bin 0 y value actual fraction of positive examples in bin Other Ways to Think about AdaBoost 0 dynamical systems 0 statistical consistency 0 maximum entropy QperimentsAgpications and Extensions basic experiments multiclass classification confidence rated predictions text categorization spoken dialogue systems incorporating prior knowledge active learning face detection Practical Advantages of Ada Boost fast simple and easy to program no parameters to tune except T flexible can combine with any learning algorithm no prior knowledge needed about weak learner provany effective provided can consistently find rough rules of thumb g shift in mind set goal now is merely to find classifiers barely better than random guessing versatile 0 can use with data that is textual numeric discrete etc 0 has been extended to learning problems well beyond binary classification Caveats 0 performance of AdaBoost depends on data and weak learner 0 consistent with theory AdaBoost can fail if 0 weak classifiers too complex gt overfitting 0 weak classifiers too weak 39yt gt 0 too quickly gt underfitting gt low margins gt overfitting 0 empirically AdaBoost seems especially susceptible to uniform nOIse UCI Experiments with Freund 0 tested AdaBoost on UCI benchmarks 0 used 0 C45 Quinlan39s decision tree algorithm 0 decision stumps very simple rules of thumb that test on single attributes eye color brown height gt 5 feet UCI Results K 0 51015202530 0 51015202530 5 0 o boosting Stumps boosting C45 Multiclass Problems with Freund o sayye Y1k 0 direct approach AdaBoostMl thaY Dt1 Dti 9 0 if y htX39 Zr 9 if y 7 htX39 H mxarggweayx at thtxy Multiclass Problems with Freund o sayye Y1k 0 direct approach AdaBoostMl thaY D Dtm if y mix 10 zt em if y y htx H nal X arg max at yethZXy 0 can prove same bound on error if Vt at g 12 0 in practice not usually a problem for strong weak learners eg C45 0 significant problem for weak weak learners eg decision stumps 0 instead reduce to binary Reducing Multiclass to Binary with Singer 0 say possible labels are a7b7c7d7e 0 each training example replaced by five 17 1 labeled examples X78 7 1 X7b 7 1 X 7 c gt X7C 1 X7d 7 1 X76 7 1 o predict with label receiving most weighted votes W can prove i i k trammg err0rH ml g E H Zr 0 reflects fact that small number of errors in binary predictors can cause overall prediction to be incorrect 0 extends immediately to multi label case more than one correct label per example Using Output Codes with Allwein 84 SingerDietterich 84 Bakiri 0 alternative choose code word for each label W4 3 7 7 b 7 7 c 7 7 d 7 e 7 7 7 0 each training example mapped to one example per column X77r1 7 1 X77r2 7 x 7 c a EMS 7 71 X77r4 7 1 0 to classify new example X o evaluate classifier on X7 17 7 X7 4 0 choose label most consistent with results put Codes icont 0 training error bounds independent of of classes 0 overall prediction robust to large number of errors in binary predictors 0 but binary problems may be harder Ranking Problems with Freund lyer 84 Singer other problems can also be handled by reducing to binary eg want to learn to rank objects say movies from examples can reduce to multiple binary questions of form quotis or is not object A preferred to object Bquot now apply binary AdaBoost Hard Predictions Can Slow Learningr 0 ideally want weak classifier that says hX 7 1 ifx above L 7 quotdon39t knowquot else Hard Predictions Can Slow Learningr e 0 ideally want weak classifier that says hX 7 1 ifx above L 7 quotdon39t knowquot else 0 problem cannot express using hard predictions 0 if must predict i1 below L will introduce many bad predictions 0 need to clean up on later rounds o dramatically increases time to convergence Confidence rated Predictions with Singer 0 useful to allow weak classifiers to assign confidences to predictions 0 formally allow ht X a R signhtx prediction lhtXl confidence Confidence rated Predictions with Singer 0 useful to allow weak classifiers to assign confidences to predictions 0 formally allow ht X a R signhtx prediction lhtXl confidence 0 use identical update Dt1 thltl 39eXp O t Yi htXi and identical rule for combining weak classifiers 0 question how to choose at and ht on each round Confidence rated Predictions cont 0 saw earlier 1 training errorH ml H Zr E 2 exp lt7 athtXgt t i t 0 therefore on each round 1 should choose atht to minimize Zr exp704t yi htXi o in many cases eg decision stumps best confidencerated weak classifier has simple form that can be found efficiently Confidence rated Predictions Help a Lot 1 10 100 1000 Number of rounds round first reached error 10000 conf i no conf speedup 40 268 16938 632 35 598 65292 1092 30 1888 gt80000 Application Boosting for Text Categorization with Singer 0 weak classifiers very simple weak classifiers that test on simple patterns namely sparse n grams 0 find parameter at and rule ht of given form which minimize Zr 0 use efficiently implemented exhaustive search 0 How may I help you data 0 7844 training examples 0 1000 test examples 0 categories AreaCode AttService BillingCredit CallingCard Collect Competitor DialForMe Directory HowToDial PersonToPerson Rate ThirdNumber Time TimeCharge Other Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 1 collect l I l I I I I I I I I I I I I I I I 39 I l 39 I I I I I I I I 2 card I I I l I I I I I I I I I I I I I I I I I I I I 3 my home I I l l l I I I I I l I 4 person 7 person I I I I I I I I I I I I I I 5 code 1 I I I I I I I I I I I I I I I I I I I I 6 I I I I I I I I I I I I I I I I I I More Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 7 time I I I I I I I I I I I l l 7 7 7 7 7 7 7 7 7 7 7 7 7 I I 7 8 wrong number I I l I I I I I I I I I I I I 9 how I I I I n I l I l I I n 10 call I I 7 7 I 7 I 7 7 7 7 11 seven I I I I I n I 12 trying to 7 I I I I I I I I I 7 13 and I I I 7 7 I 7 I More Weak Classifiers rnd term AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT 14 third I I I I r l I I I l I I 7 15 to 7 7 7 7 7 7 7 7 7 7 7 7 7 I I 16 for 7 7 7 I I 7 I 17 charges I l I I I l I 18 dial I I I I I I 19 just I n I I I 1 7 Finding Outliers examples with most weight are often outliers mislabeled andor ambiguous I m trying to make a credit card call Collect hello Rate yes I d like to make a long distance collect call please CallingCard calling card please Collect yeah I d like to use my calling card number can I get a collect call CallingCard yes I would like to make a long distant telephone call and have the charges billed to another number CallingCard DialForMe yeah I can not stand it this morning I did oversea call is so bad BillingCredit yeah special offers going on for long distance AttService Rate mister allen please William allen PersonToPerson yes ma am I I m trying to make a long distance call to a non dialable point in san miguel philippines AttService Other Collect Application Human computer Spoken Dialogue with Rahim Di Fabbrizio Dutton Gupta Hollister 84 Riccardi 0 application automatic storefront or help desk for ATampT Labs39 Natural Voices business 0 caller can request demo pricing information technical support sales agent etc o interactive dialogue How It Works Human computer raw Speech tterance text to speech automatic speech recognizer text response fem dialogue manaer natural lang age 39 predicted understanding category 0 NLU39s job classify caller utterances into 24 categories demo sales rep pricing info yes no etc 0 weak classifiers test for presence of word or phrase Need for Prior Human Knowledge with Rochery Rahim 84 Gupta 0 building NLU standard text categorization problem 0 need lots of data but for cheap rapid deployment can39t wait for it 0 bootstrapping problem 0 need labeled data to deploy 0 need to deploy to get labeled data I I J o idea use human t0x for in IIFFirient data 0 modify loss function to balance fit to data against fit to prior model Results AP Titles dataknowedge 4o nowledge only 7 7o 7 data only error rate 100 1000 10000 training examples Results Helpdesk Problem Labels are Expensive o for spoken dialogue task 0 getting examples is cheap 0 getting labels is expensive must be annotated by humans 0 how to reduce number of labels needed Active Learning 0 idea 0 use selective sampling to choose which examples to label 0 focus on least confident examples Lewis amp Gale 0 for boosting use absolute margin lfxl as natural confidence measure Abe 84 Mamitsuka Labeling Scheme 0 start with pool of unlabeled examples 0 choose say 500 examples at random for labeling 0 run boosting on all labeled examples 0 get combined classifier f 0 pick say 250 additional examples from pool for labeling 0 choose examples with minimum lfxl 0 repeat Results How Mav I HeIp You 34 32 g 30 2 28 26 240 5000 10000 15000 20000 25000 30000 35000 40000 1abe1ed examp es first reached label error random i active savings 28 11000 5500 50 26 22000 9500 57 25 40000 13000 68 Results Letter An error rate 0 0 2000 4000 6000 8000 10000 12000 14000 16000 iabeied exampies first reached label error random i active savings 10 3500 1500 57 5 9000 2750 69 4 13000 3500 73 Application Detecting Faces Viola 84 Jones 0 problem find faces in photograph or movie 0 weak classifiers detect lightdark rectangles in image 0 many clever tricks to make extremely fast and accurate Conclusions 0 boosting is a practical tool for classification and other learning problems 0 grounded in rich theory 0 performs well experimentally 0 often but not always resistant to overfitting 0 many applications and extensions 0 many ways to think about boosting 0 none is entirely satisfactory by itself but each useful in its own way 0 considerable room for further theoretical and experimental work An Adaptive Disk Spindown Algorithm Corrie Scalisi University of Cali39fOrnia Santa Cruz April 16 2007 0 Introduction D Introduction 0 Algorithm for deciding when to spin down a disk to save power 0 Based on Helmbold et al MONET 2000 0 Uses a Mixture of Experts and Weighted Majority Approach 0 See Littlestone and Warmuth 1994 Inimmchun 9 Datas ets OnIine Data 0 Temporally correlated 0 Predictable given history 0 Randomly permute an online dataset to break these correlations o Significantvisual difference between original data and permutation a Our algorithm must perform considerably better on the original data than it does on a random permutation Cello2 Dataset First 3000 iterations of the Cello2 Dataset IU 7 15 8 8 5 X XX Xx XX X I XW x X W X c XV g X XX X XXka xxx 2 55 g 0 X 1 2 X xx 2 XXX X 2 XXX X XX X X X X13 XXXXX X X 44 X X X x x X X xgt lt Xx X r X XXXX X XX Hm X XX X g XX X XX XXX XXMXX X XXX X X X X XX X XX XX X X X X X 39 1U X 3g Kw Xw 0 500 1000 1500 2000 2500 3000 A w m 1 c 8 5 X X X X X g X X XX X Xx XXX XXX X XX X X XXXXXXXX XX X XX XX X XXXX X XXXX XXX XX X C X XX XXX xXXX XXXXMXX XXYXXXXXXX X XX X as X XX XXXXXXquot quot 3 XXX X Xx X X X XX XX X g 0 x kX XXXX XXXXXXX X X XXX XXXXXX XX Xx X XX X 9 2 9 w c 2 Adaptive Disk 39 Algorithm Intel Dataset First 3000 Iterations of the Intel Dataset 7 l l l logelDuration in Seconds i l i l 0 500 1000 1500 2000 2500 3000 Trial 0 Already looks like a randomized permutation Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 16 2007 730 Inimmchun D901 e Experts Sna 39nnig AMOMEHH Our algorithms use a set of experts which each specify a timeout value The optimal set of experts for a dataset always includes 0 and actual idle durations within the dataset Always include 0 and spindown cost in set of experts rrtrtt tr 0 Spin down cost Choosing idle durations in dataset as timeouts minimizes the difference between the loss and the optimal Experts 5n mum mummy mam m an expanenha v waned he ween u and m mew mmm m 2mm quot0quot m 5 2m 25 an mm M expth Inimmchun Energy used by timeout idle time it idle time g timeout Losstlmeout tImeout splndown cost It Idle tIme gt tImeout Energy used by optimal idle time it idle time 3 spindown cost spindown cost it idle time gt spindown cost Lossoptimal Energy Used By Time Fixed Timeo may Used bv Valan Mewquot W mm mm mme Spmdawn Ea l mm m Wm Energy used by Varying idietime with FixEdTimEmA Spinduwn Cusl Energy m sea ldieTime If t2 gt t1 then t2 has smaller loss in interval t1 t2 If there is high certainty that the idle period is about to end then the disk should be kept running I l Ene y Used By Timeout Fixed Idle Time Energy Used by Varying Timeout With Fixed Idle Time timeout gt idletimequot Spindown Cost Energy in Sec 0 idletime Timeout o If timeout lt idletime Loss Spindown Cost Timeout o If timeout gt idletime don t spin down Loss idletime Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 16 2007 1530 Inimmchun D mg Mgplt 39hil uii r BestShiftK 0 Dynamic programming 0 Illustrates how adaptively changing the timeout value lowers the amount of energy used a Maintains a table B holding minimum Loss values a Run on a dataset and a random permutation in order to capture whether the data is quotonlinequot Be SMMKSmdeA omhm I trial number k max number of times the timeout value can change 0 Dynamic Programming algorithm 0 Maintain loss for each I k pair Bt k 0 Maintain last expert used Et k LossWithoutShitt 317 1k Losst Et k LossWithShitt anBti mk71 Losst Et 7 m k I77 mimi o tqk21 7 i minLossWithoutShittLossWithShitt tgt0kgt1 0 Running Time 0T2K BestShiftK Improved Algorithm t trial number k max number of times the timeout value can change i the index into a list of timeout values 0 iff0k21 BUM Btet1iLossti itk1tgt0 7 7 minBti1kiLossti minB ihki1Lossti ittgt0kgt1 0 We need to consider TKN table entries 0 Create M a KN sized table for each ttrom O to T BestShiftK 0 To calculate entries in M1 we will refer to entries in MP1 0 On each k n combination we must consider 0 a The entry k n of MH the minimum cost of using a partition that includes expert n on the 1 trial 0 b The minimal entry in the k 71 row of MM the cost of using a partition that includes a different expert on the If 1 trial a The k7 n entry of M is the minimum of a and b MM loss trial H M loss on trial 1 An AdaIZIIIVe Disk Splndown Algorithm BestShift Running Time 0 Note minBUi 1 k7 1j Losst i minBUi 1 k 71j Losst i a If we compute minBUi 17k 7 17 for each Inf the running time is OTKNZ 0 However the minimum holds for all since it is the minimum over all j in the kii row of Mt1 0 Therefore it we maintain the minimum we can use the quantity for each entry in the k th row of M 0 Then the runtime is OTKN BestShift for Cello2 BestShiftK on Cello2 Data 50 experts exponentially spaced betm een 0 and 10 1 l l BestShiftK BestShiftK Randomized 03995 7 quot 39 Optimal 0n 085 i a d g g 08 a E E m 075 r 7 8 d l 07 a E lt1 065 a i 0 055 i a l l l l l 0 50 100 150 200 250 300 K of shifts of timeout value 2558785 trials of Cello2 Spindown Cost 10 sec 0 Takes advantage of temporal similarities in the original data 0 BestShift curve for randomized permutation of data is very flat because of lower temporal correlation Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 16 2007 2230 BestShift for Intel Dataset BestShif tK on Intel dataset 50 experts exponentially spaced between 0 and 10 l l BestShif tK 4 N l l Average cost per idle time b l l l l l 0 50 100 150 200 250 300 K of shifts of timeout value 12458 trials of Intel Data Spindown Cost 10 sec 0 Similar for original and permuted data because of low temporal correlationin both 0 Only 1751 idletimes greater than 10 Approaches optimal quickly by shifting at most costly idletimes r r Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 16 2007 2330 Imam mam Sharing Algorithm 0 Weight vector W W1 W2 W o Experts 1 1112 t 0 Make prediction based on weights of experts n timeout Z WT i1 0 Calculate loss for each expert 1 ener d b t39 LossI gy use y I splndown cost 0 Loss update reduce weights and then normalize W f 21 Wje nLoss 0 Share update redistribute weight VariableShare Update 0 Calculate pool of weight to share n weightToShare 7 Z oldWeightU 1 7 1 7 0540 1 a Distribute weight amongst experts weighti 7 oldWeightU 1 7 0550 WeightToShare o VariableShare update from Herbster and Warmuth 95 o Other Share Updates 0 Fixed Share to Start Vector Update Fixed Share to Past Average etc P9 0 mance of Adaptive Algorithm on Cello2 Energy Usage for Adaptive Disk Spin Down Algorithm mall modifications in or and n parameters All combinations ofn 1234 and or 00100501051 l 14 l l l l l a 1 2 r v 1 d g 2 E 08 21 8 06 m E lt2 O tImal 04 p Best Fixed Timeout 02 Sharing Algorithm with varying n or i l l l l l l l 0 2 4 6 8 10 12 14 16 18 Spindown cost in seconds 2558785 idle durations of Cello2 data 25 exponentially spaced experts between 0 and the spindown cost Corrie Scalisi Adaptive Disk Spindow Algorithm April 18 2007 2730 nce of Adaptive Algorithm on Intel Data mm Usage 1m Adamwe w Spm Dawn Ngamhm man mammm m a and n Palmem An cammnmm m mama a lt um ms nu us 0 mum can we mm Sh W mm W a a Spmdawn mu m ecand a ta 25 expanenha v waned mum he ween u and h Performance of Adaptive Algorithm on Cello2 Compared to Loss Curve BestShif tK and Share Algorithm on Cello2 dataset 25 experts exponentially spaced between 0 and 5 l BestShif tK Randomized Permutation 055 i i g 05 R C 8 E n1oc0001 n 3 or 01 E i 045 e e g quot BestFixed I g BestShif tK S E lt1 04 r quotquot39 Share Algorithm quotquotquot quot Optimal l l l l l l 50 100 150 200 250 300 K of shifts of timeout value 2558785 trials of Cello2 Spindown Cost 5 sec Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 18 2007 2930 Performance of Adaptive Algorithm on Intel data Compared to Loss Curve BestShif tK and Share Algorithm on Intel dataset 25 experts exponentially spaced between 0 and 5 i n4oc025 l 4 n 3 or 0005 F L11 l BestFixed BestShif tK Average cost per idle time in seconds BestShiftK Randomized Permutation quotquot39 Share Algorithm mim Optimal l l l l 0 50 100 150 200 250 300 K of shifts of timeout value 12458 trials of Intel Data Spindown Cost 5 sec Corrie Scalisi UCSC An Adaptive Disk Spindown Algorithm April 18 2007 3030 Polymorphism in 00 Languages What are the alternatives Kim Bruce Williams College e Uc Santa Cruz a Pomona College Outline 0 Historyof Generics inJava 0 Language Design Proposals for Generics 0 Relevant Type Issues 0 An Alternative Proposal 0 Pros amp Cons 0 Java wildcard types History 0 Java 10 announced in 1995 0 Generics considered but omitted from nal language 0 Proposals for adding generics published starting in 1997 or earlier 0 Javaegenerics mailing list set up in 1997 0 Generics included in Java 5 in 2004 Why Generics 0 Java collection classes de ned to hold Object 7 List Stack Queue Set SortedSet 0 No restrictions onwhat can be added 0 Casts required when remove elements 7 Slows execution and obscures code 7 Lose advantages ofstatic type system Example class Stack staclq vold push0bject nev op Object pom 5 spush HellOquot7 spush new Polnt1223 Strlng greetlng Strlngspop7 Lort information about what kind of valuex are in data rtrurture Proposalsfor Generics FBounded Polymorphism 0 Bounded Polymorphism introduced by Cardelli amp Wegner in 1985 0 F Bounded Polymorphism introduced in 1989 byJohn Mitchell ampAbel group at HP Labs 7 Overcome limitations of bounded polymorphism 0 Bounded polymorphism rst used in EilTel Parametric Polymorphism crass stackltrgt stack void pushT newTop T pop 1 stackltstringgt s spush Helloquot7 s s new rzointrz23 now i 11ega1 string greeting spop Safe and more e icient Bounded Polymorphism 0 Can restrict the kinds ofvalues allowed c1ass shirtListltr extends Movab1egt T hd ShlftLlstltTgt rest pub1ic void ShlftAll1nt x int y h novez3 i11ega1 Wout bound restsh1ftAllZ37 Problems With Expressiveness luterface Comparab1e oo1ean 1ess1han Comparab1e other c1ass SortedListltr extends Comparab1egt Listltrgt aList 1 current void insert1 newElt 1f newmt1ess1hancurrent Implementing Comparables ca1endar imp1ements Comparab1e int mo e r boo1ean ca1endarLess1han Ca1endar other return month lt othernonth H 7 pub1ic boo1ean 1ess1hanConparab1e other if other instanceo Ca1endar return ca endarLessThanCalendarother7 e1se raise new hadCa1Conparison Sbouldgmerade mm39c type error not exception Using Fbounded Polymorphism luterface FConparab1eltrgt boo1ean lessT an1 other c1ass SortedListltr extends FComparableltTgtgt List i t T current void insert1 newElt 1f newEltlessTha current Guarantear T 1m lessThan takingpummeter uftype T Implementing Comparables class Calendar rmplements FComparableltCalendargt t month day year boolean lessThanCalenolar other return month lt othermontn H 0 F Bozmded solves problems wit7 expressiveness 0 In Pizza G amdNextGen among otbers Poly and with clauses class SorteoLrstltTgt wrth T boolean lessThanT other LrstltTgt aLrst publrc vorol rnsertT newElt 1f newEltlessThancurrentH else m Equivalent to F boumled polymorphism but uses structural subtypes Virtual Classes Upes Virtual Classes Example class List typedef T as Object vorol rnsertT newElt 1m where l rnter ace Comparable olean lessThanThrs other Virtual Classes 0 Designed as part of Beta 0 Proposed to be added to Java byThorup and later with Torgersen 0 Idea is to allow covariant changes to types in subclasses Virtual Classes Example class ComparableLlst extends List typedef T as Comparable class SortableLrst typedef T as Com arable LrstT as ComparableLrst LrstT aLrst prrvate T current vorol rnsertT newElt 1f newEltlessThancurrent else More Expressiveness class subject typedef OType as Observer y e s Object OType obser ers vold notifyObserversEventType e r int 1 o r lt observerslength r observerslnotlfythlse7 class observe a bject edef EventType as Object vold notifySType subject EventType e Even More Expressiveness class Menusubject extends subject typedef ype as Menuobserver typ def EventTy e MenuEvent String getselectedltem class MenuObserver extends Observer type ef Type as Menusubjec typedef EventType as MenuEvent vold notifySType subject EventType e subjectgetselectedltem Concerns About Virtual Types 0 Beta originator ofVirtual Classes appears to be type safe but not statically typeesafe 0 Emits compiler Warnings Where maybe unsafe and inserts runetime checks 0 Beta has never been given a formal semantics so type system has never been proven correct Static Overloading VS Dynamic Dispatch 0 Dynamic dispatch object receiving message determines which code will be executed s Dm erminedatmnTime 0 Static overloading occurs when an ob39ect supports two or more implementations ofa message name s Dmrmz39ned at mmpilrtz39me Language Design Interlude Overloading vs Dynamic Dispatch class c m beelean eqc other m 1 class so extends c beelean eqc her 2 ct override boelean eqsc other m 3 overload c cc39 sc sc cnew C c39 new Sc scnew SC ceqc ceqc39 ceqsc c39eqc c39eqc39 seeqe sceqc39 c39eqsc sceqsc Subtypes 67 Upe Safety Issues Subtype Polymorphism IfS ltT avalue of type S can be 7 used for a parameter of type T r assigned to avariable of type T 7 Value uftype S can mmquemde 4r 4 value uftype T Value of expression can be from a subtype of its static type In Jm subclass is always a subtype TypeSafety Issues 0 What changes in subclasses are allowed Without breaking type safety 0 Java allows addition of new features elds or methods overriding old methods 0 BeforeJaVa 5 No changes allowed in method signature when override 0 Java 5 allows specialization ofreturn types Subtyping Functions S a T lt S a T iff S lt S andT lt T Contravan39ant for parameter types Covariant for result types Functions If fhas type S a T and s has type S then fs has type T leenixS T lt S39 T If f has type S a T then need f s to have typeT Variables Variables are dilTerent fromvalues Variables can be mpplierx amp receiverx ofvalues X X 1 IfX is avariable oftype T Write type as refT When is refT lt refT To replace XWith type refT by X oftype refT in 7 expression x Need T lt T e assignment x e where eT Need Tlt T Variables Supplier covariant Receiver contravariant ref T lt refT ilT T 2T Exercise Arrays o IfSltT isSltT Java says yes but not safe With few exceptions for F Types aTypes s lt T as FS lt Fcr Why Restrictions on Subclasses 0 Why restrict changing types in subclasses e Even if don t care if subclasses are subtypes 0 Methods are implicitly mutually recursive elass Example void m chisns end elass class subExample extends Example T39 nS39 x void newMechod H end class What if relutz39umbz p u f new rignature ufn In aid ifzemnt tn remain typemfe Need S aT ltSgtT Virtual Types Are Unsafe o Allow specialization of any type in a class when creating a su class 0 Adopting in Java would introduce safety and e cz39emy issues similar to arrays Is there a type safe way of achieving same expressiveness Tbi jpe andsefw zrenca MyType Selnype ThisClass 0 Provide a name for class of object executing method co e First appeared in TrellisOwl 0 Semantics as xed point byHPAbel group in discussion of static type safety in 00 languages 0 Use syntax of LOO an extension ofJava with r ThisClass for classes and r ThisType for interfaces Subclasses ciess DbleNode extends Node rhrsciess prev Dbieuoderhrsciess next rhrsciess prev supernext s rev prev vord setuextrhrsciess newuext supersetuext newNext newNextsetPrevth157 vord setprevrhrsciess newPrev th sp ev newPrev Typ eChecking 0 Can only send binary message to object if know its exact type a C 0 Ifo C and m U then om UCThisClass 0 Ifo Cthen om UCThisClass ThisClass If m Node then mcloneO Node If n Node then ncloneO Node msetNextm legal nsetNextm illegal LOO Example Bruce amp Faster ECOOP 04 class Node Thisclass next NodeThisclass next thisnext next hymnmethod void setNextThisclass newNext thisnext next Thisclass getNext return next Typ eSafety void breakitNode nodei Node node2 node1setNextnode2 breakitaNode bNode OK if aNode amp bNode both nodes breakitdble nd nd mt OK if dbleind doublylinked amp nd node C 39 to t quot W 39WL e break type ryxteml TypeChecking Classes 0 Typecheck modularly 0 Typecheck methods ofclass C under assumptions that hold in all extensions this ThisClass ThisClass extends C Can prove soundness ufstatic type system ECOOP 04 FBounded Not Necessary abstract class Comparable boolean lessrhanrhrsClass other class SortedLlstltT extends Comparablegt Ex tLrstltrgt aLrst T current vold lnsertT newElt 1f newEltlessThancurrent lse Implementing Comparables Calendar extends Comparable lnt month day y ar boolean lessrhan ThrsClass other return month lt othermonth H EMTjerk wing TbiICd 39 extend C calendar Comparing Proposalsfor Generics Comparing Solutions 0 F Bounded and with statements roughly equivalent 0 Compare ThisClass with F Bounded s Both statically type safe a Trade offThisClass amp types with complexity of recursive bounds Disadvantage of Fbounded polymorphism Subclasses and subinterfaces not satisfy same constraints as parent classes and interfaces interface lntcomp extends FComparableltIntCompgt boolean lessThanIntComp other interface Extlntcomp extends Intcomp ExtlntcompNOTmmchomparable lt Extlntcomp gt Disadvantages of FBounded Polymorphism SortedList can be instantiated with IntComp but not EXtIntComp Fibounded polymorphism not consistent with extension in Java Tutorial Lectures on MCMC Sujit Sahu a University of Southampton httpIwwwmathssotonacuklstafflsahul Utrecht August 2000 0 Introduction to MCMC especially for computation in Bayesian Statistics 0 Basic recipes and a sample of some techniques for getting started 0 No background in MCMC assumed 0 Not for experts aIn close association with Gareth Roberts Markov Chain Monte Carlo MCMC I Introduction Outline a Monte Carlo integration 0 Markov chains 0 MCMC Bayesian Inference Data Y realisation 3 Parameters latent variables 9 9192 ap Likelihood Ly0 Prior 7r00 Inference is based on the joint posterior Ly07To0 9393 fLy97ro9d9 0C Ly97io9 7Le Posterior oc Likelihood gtlt Prior Example 1 LetY1 Yn N Ni91 and W009 116 Posterior n 2 WW 0C XP 2ng 6 X 14302 6 2 oc oxp M 23 gtlt Things of interest to Bayesians o Posterior Mean 43i9y o Posterior Variance var9y o Credible interval ay for 9 st P7 ay lt 9 lt 095 Example 2 Data Y1 Yn are a random sample from Nu 02 Noninformative prior is 1 2 WOW 0lt 02 Joint posterior 21 mam oc 0 12 X exp Egg 2102 which is not of standard from
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'