### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Machine Learning CMPS 242

UCSC

GPA 4.0

### View Full Document

## 64

## 0

## Popular in Course

## Popular in ComputerScienence

This 115 page Class Notes was uploaded by Dr. Elyssa Ratke on Monday September 7, 2015. The Class Notes belongs to CMPS 242 at University of California - Santa Cruz taught by Staff in Fall. Since its upload, it has received 64 views. For similar materials see /class/182259/cmps-242-university-of-california-santa-cruz in ComputerScienence at University of California - Santa Cruz.

## Reviews for Machine Learning

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/07/15

The Blessing and the Curse of the Multiplicative Updates Manfred K Warmuth University of California Santa Cruz Google Mountain View Oct 3 2007 Thanks to David llstrup and Anindya Sen for helping with the slides Multiplicative updates Definition 0 Weights multiplied by nonnegative factors a Motivated by relative entropies as regularizer Big Questions 0 Why are multiplicative updates so prevalent in nature 0 Are additive updates used 0 Are matrix versions of multiplicative updates used Blessing 3 Speed Curse 0 Loss of variety Will give three machine learning methods for preventing the curse Machine learning vs nature Typical multiplicative update in ML 0 Set of experts predict something on a trial by trial basis 0 W is belief in ith expert at trial I 0 Update VLW weiryloss W 2 f where 11 learning rate and Z normalizer o Bayes update is special case Nature 0 Weights are concentrations 0 Next implementation of Bayes rule in the tube In vitro selection algorithm for finding RNA strand that binds to protein Make tube of random RNA strands pro lei in sheet for t1 to 20 do 0 Dip sheet with protein into tube 9 Pull out and wash off RNA that stuck 9 Multiply washed of RNA back to original amount normalization Modifications 0 Start with modifications of a good candidate RNA strand 0 Put a similar protein into solution 0 Now attaching must be more specific a Selection for a better variant 0 Can39t be done with computer because 1020 strands of RNA in a liter of soup Basic scheme Start with unit amount of random RNA Loop 0 Functional separation into Good RNA and Bad RNA 9 Amplify Good RNA to unit amount Duplicating DNA with PCR 0 Invented by Kary Mullis 1985 0 Short primers hybridize at the ends 0 Polymerase runs along DNA strand and complements bases 0 Ideally all DNA strands multiplied by factor of 2 In vitro selection algorithm for proteins a DNA strands ohemioally similar a RNA strands more varied and can form folded structures Conjectured that first enzymes made of RNA Letters iNTA 4 RNA 4 Ribosome Protein 20 Proteins are the materials of life For any material property a protein can be found How to select for a protein with certain functionality Manfred K Warmuth UCSC The Blessing and the Curse 7 WWW mm g tr 1531M Manned K Warmulh UCSC In vitro selection for protein Initialize tube with tagged random protein strands tag of a protein is its RNA coding Loop 0 Functional separation 9 Retrieve tags from good protein 0 Reproduce tagged protein from retrieved tags Almost doable 0 Not many interestingspecific functional separations found 0 Need high throughput Mathematical description 0 Name the unknown RNA strands as 12 nwhere n 102 of strands in 1 liter 0 W 2 O is the fraction of RNA i 0 Contents of tube represented as vector w W1 W2 W 0 When tube has unit amount then W1 W2 1 Wn 1 o X E 0 1 is the fraction of one unit of RNA ithat is good 0 Protein represented as vectorx X1X2 X o Vectors WX unknown Blind Computation Strong assumption 0 Fractions X independent of concentration vector w Good RNA in tube w W1X1W2X2 Wan WX Bad RNA W117X1W217X2Wn17X 17W39X o Amplification a Good part of RNA 139 is lVij multiplied by factor F o If precise then all good RNA multiplied by same factor F a Final tube at end of loop w X Since final tube has unit amount of RNA FWX1andF W X 0 Update in each loop WiXi Implementing Bayes rule in RNA 0 W is prior Pi o X E 01 is probability PGoodli o w x is probability PGood a Pi PGoodli PGood is Bayes rule PilGood Iterating Bayes Rule with same data likelihoods X19 Concentrations Initial w 0139 6 x 9 75 8 Manlred K Warmth UCSC The Blessing and the Curse Concentrations Largest Xi always increasing Smallest X always decreasing n sn mu lS r 7 0le WfJ 7 normallzatlon tIs speed parameter The best are too good 0 Multiplicative update W N w factor 20 o Blessing Best get amplified exponentially fast a Curse The best wipe out the others Loss of variety A key problem 0 3 Strands of RNA 0 Want to amplify mixture while keeping percentages unchanged A C 1 5 25 B 60 Iterative amplification will favor one Assumption X independent of concentrations 0 Make one strand with right percentages of ABC A B B B C A B B maul I I I I I I I I I I39l39n39l IIIIII IIIII I I I I I I o Amplify same strand many times 0 Chop into constituents a Long strand functions as chromosome 0 Free floating genes in the nucleus would compete a Loss of variety Coupling preserves diversity A A W PCR W B A W W Uncoupled J A and B compete Mme Coupled A and B should cooperate Questions 9 What updates to w represent an iteration of an vitro selection algorithm 0 What updates are possible with blind computation 0 Must maintain nonnegative weights Assumptions so far X independent of w w w x measureable WX normalz More tempered update W 7W11 im3vp90 vlt1 BAD 6560 wiW XiXi d x 20 Find set of RNA strands that binds to a number of different proteins Iaaa X1 X2 X3 X4 1 1 2 3 4 5 0 In trial 1 functional separation based on xt 0 So far all Xt the same In vitro selection algorithm Xt1 09 08 096 02 004 Xt2 01 001 08 09 08 U Xt 05 0405 088 055 042 Tube u 0505 o o The two strands cooperatively solve the problem Related to disjunction of two variables Key problem 2 Starting with 1 liter tube in which all strands appear with frequency m 10 use PCR and to arrive at tube m05m05m0m0 0 Problem 0 Overtrain with P1 and P2 2 W1 m1w2 z 0 it Over train with P4 and P5 2 W1 z 0 w m1 0 Want blind computation o Xt and initial w unknown 0 Need some kind of feedback in each trial PCR problem 0 Fix It good side reasonably large then don39t update Conservative update iF Wt71XtZ t THEN wt 2 WM ELSE Wu 2 Writ 0 Xt Yt Xti nonnegfactor where 0 g m lt 1 o Concerns o How exactly can we measure w X 0 Some RNA might be inactive Statement of key problem 2 Tube we of RNA strands P1P2Pm X1X2Xm There is a tube u made up of RNA strands that occur in we such that tut 1 andVPtzuxt2 6 Goal Find tube W7 made up of RNA strands that occur in we such that V12WTXt2 9where9 lt 6 Related machine learning problem 0 Tube E Weight vector w E probability simpleXN 0 Proteins E Example vectors xt E 0 1 Eu 00 00 00 000 w k nonzero components of value 1 st Vtzuxt 2 1 Find w wxtzgf Normalized Winnow Algorithm N Littlestone Do passes over examples xt Whenever wxt g 12k W i W if X O 7 aw if X 1 where or gt 1 Renormalize o 0klog g updates needed when labels consistent with k literal disjunction o Logarithmic in N 0 Optimal to within constant factors Problems with in vitro selection 0 lntial tube contains 1020 different strands 0 After a while 0 Loss of variety 0 Some selfish strand manages to dominate without doing the wanted function 0 Fix Mix in a little bit of initial soup for enrichment Disk spindown problem 0 Experts a set of nfixed timeouts X1 X2 X 0 Master Alg maintains a set of weights W1 W2 Wn o Multiplicative update i Wt39ilgEnergy usage of timeouti Wt1i 0 Problem Burst favors long timeout Coefficients of small timeouts wiped out Disk spindown problem 0 Fix Mix in little bit of uniform vector w Multiplicative Update 7 1 1 1 7 171xW 0 NNNoclssmall 0 Nature uses mutation for same purpose 0 Better fix Keep track of past average weight vector r w Multiplicative Update w 171xW0cr facilitates switch to previously useful vector longterm memory a Long Term Memory How is it realized in nature a Sex 9 Junk DNA The story so far 0 Conservative Update prevents overtraining 9 Lowerbounding weights robust against change 0 Upper boundingweights next New trick cap weights Super predator algorithm Lion copyrights belong to Andy Warhol Preserves variety New trick cap weights WIIBLoss W W capped and rescaled version of W capping o m sets encoded as probability vectors 0 O O O called mcorners o The convex hull of the m corners capped probability simplex MUM an114 amp 0 Combined update converges to best corner of capped simplex best set of size m Motivations of the updates a Motivate additive and multiplicative updates 0 Use linear regression as example problem Online linear regression Fort12m 0 Get instance xt E Rn o Predict 9t Wt Xt 0 Get label yr 6 R o Incur square loss Wei2 0 Update wt 4 Wt1 Two main update families linear regression o Additive Wt1 Wt 1 Wt Xt Ytxt x Gradient of Square Loss 0 Motivated by squared Euclidean distance a Weights can go negative a Gradient Descent GD 0 Multiplicative thie wr39xr YrVr l WH1J ft u Motivated by relative entropy 0 Updated weightvector stays on probability simplex o Exponentiated Gradient EG Additive Updates Find wt1 close to old wt that has small loss on last example OR Minimize tradeoff between closeness and loss Wt1 arngnin Uw UW llwiwtll WXtJt2 Wd gf x divergence loss 11 gt O is the learning rate Additive updates 3U w mW t 2Wt1397thi217WXt7yfXtiO Therefore implicit Wt1 Wt 7 11wt1 ext 7 explicit N N wti 1wtrxtytxt Multiplicative updates Wt1 argminUwwhere DIN1 WW ZWI I WW39MY02 Define Lagrangian Lw Zmln wxtiyt2t2vwi1 t where A Lagrange coeff Multiplicative updates BMW W I 7 1 7 gt A O 3W n thi WW Xt YtXtI W quot1M 77Wt139xt YtXti A 1 Wti WM thie i ww 39XrinVr ie A l Enforce normalization constraint by setting e4quot to 1Zt Wt ie i wr 39XrinQ i Implicit wt 39 Zr WtIe WWr39Xr YrVr i exp Iclt m 39 Zf GD EG via differential equations 0 GD VI It Wt Here 1quot Xt armat Discretize L774 ewuwt Wth Wt 7 Wt h 1 Z Wf1 Wt 7 GD EG via differential equation 0 EG I09Wt WVLWt Discretize 3909 Wen409 Wu i vuwth log WM log Wu rhVLWt39 h1 t Wt1i Wueiwuw Derivation of normalized update more involved o Multiplicative updates converge quickly but wipe out diversity 0 Changing conditions require reuse of previously learned knowledgealternatives 0 Diversity is a requirement for success a Multiplicative updates occur in nature 0 Diversity preserved through mutations linking superpredator a Machine learning preserves diversity through conservative update learning rate capping lower bounding weights weighted mixtures of the past Lecture Slides for It I39M mm 1139 H Ir Mnclzine Learning HAI I Deasion Trees ETHEM ALPAYDIN The MIT Press 2004 alpaydinbounedutr httpVwwwempebounedutTethemi2ml I h Tree Uses Nodes and Leaves Divide and Conquer I Internal decision nodes 1 Univariate Uses a single attribute x Numeric x Binary split x gt Wm l I Y Discrete x1 nway split for n possible values 0 I D Multivariate Uses all attributes x w l I Leaves C E Classification Class labels or proportions E Regression Numeric r average or local fit C39 I Learning is greedy find the best split recursively Breiman et al 1984 Quinlan 1986 1993 3 LedMe Norm for 39 39 L 1 High level C lasszficatlon Trees ID3 CART C45 I Want small tree fitting data for good generalization I For node m Nm instances reach m Nquot belong to C1 I Occam s razor simplest hypothesis consistent with data is best I Distinguish real world from all possible situations 13C1 IximE plm 7m m I Node m is pure if pim is 0 or 1 vs Distinguish real world from all simpler situations I Measure of impurin is entropy I N data points dreal valued attributes knodes El dN many different attribute lt value tests El demany trees too many to search El Use greedy topdown search Im vinlogzvin Ledme Norm my Lecm l In benerateTreer If NodeEntropyXlt 91 eq 93 39 Create leaf labelled by majority class in X Return ilt SplitAttributeX For each branch of z I If node m is pure generate a leaf and stop Find xi falling in branch otherwise split and continue recursively SplitAttributeX I Impurity after spllt ij of Nm take branch J N mj MinEnt MAX bEIOHg to C I For all attributes i 1 d A N If m is discrete with n values z m 1 PC1Xmd pm 7N Split X into Mumr71 by mL K quot 7 le e SplitEntroDW Full 5 eq 98 I n Nm 1 10 1 If eltMinEnt MinEnt e e bestf e i quot1 2 N 2pm gzpml Else m1 is numeric J m I I Find the variable and split that min impurity l eesmnrmmxcnaziil among all variables and split pos1tions for If eltMinEnt MinEnt g e 395th i numeric variables Return bestf LedMe Nam for l l j II Model Selecrion in Trees Regression Trees I Error at node m b X 1 if x EXm x reaches node 111 m 0 otherwise a giuahio g I After splitting 1 if x E X 39 x reaches node m and branch j b x W quot 7 0 otherw1se b I T Ekamp 2 w mm mij 9 Ledme Norm my Lecm 39 Prum ng Trees Rule Extraction from Trees C4 5 Rules iiis in job I Remove subtrees for better generalization Qumlanv 1993 m 0 decrease variance El Prepruning Early stopping w x D Postpruning Grow the Whole tree then prune subtrees which overfit on the pruning set I Prepruning is faster postpruning is more accurate requires a separate pruning set R1 IF agegt385 AND yearsinj0bgt25 THEN y 08 R2 IF agegt385 AND yearsinjoszE THEN y 06 R3 IF age538VS AND jobtype A THEN y 04 R439 IF age3385 AND job type B THEN y 03 RS IF age5385 AND job type C THEN y 02 11 Lenmemth I 12 I l I39 RipperPosNegI R RuleSet e LearnRuleSetPosNeg For k times RuleSet e OptimizeRuIeSetRuleSetPosNeg L R s t P N I Rule induction is Similar to tree induction but Games 05 eg El tree induction is breadthfirst DL elbesgLengRuleSetPgnigl R t El rule 1nduction 1s depthfirst one rule at a time 953 ule e I Rule set contains rules rules are conjunctions of Add Rule to RuleSet terms DL39 e DescLenRuleSetPosNeg 1r DL39gtDL64 I Rule covers an example if all terms of the rule pruneRuleSe RuleSeLPOSYNeg evaluate to true for the example Return RuleSet t 1r DL39 DL DL 5 DL39 I Sequential covering Generate rules one at a time lt Delete Instances covered from P05 and Neg until all pos1tive examples are covered Until P05 V I IREP Fu rnkrantz and Widmer 1994 Ripper Return Rules Cohen 1995 13 14 Lenme Notm for Lecm l PruneRuleSetRuleSet PosNeg each Rule 6 RuleSet in reverse order For or e oesctenmmesaposneg Mu ltlvarlate Trees DL39 5 DescLenRuleSeteRuePosNeg IF DL39ltDL Delete Rule from RuleSet Return RuleSet OptimizeRuIeSetRuleSetPosNeg For each Rule 6 RuleSet DLO e DescLenRuleSetPosNeg VIIX1wllyzw10 0 DL1 e DescLenRuleSetARule RuleSetPosNegPosNeg DL2 e DescLenRuleSetARule RLIIeSetRLIIePosNegPosNeg If DL1minDLODL1DL2 Delete Rule from RuleSet and add ReplaceRuIeRuIeSetPosNeg Else If DL2minDLODL1DL2 Delete Rule from RuleSet and add ReviseRuleRuleSetRulePosNeg Return RuleSet A LedMe Notm my V l 339 1 Lecture Slides for ETHEM ALPAYDIN The MIT Press 2004 alpaydinb0unedutr httpwww cm peb0u nedutrethemi2ml U la Why Reduce Dimensionalily Reduces time complexity Less computation Reduces space complexity Less parameters Saves the cost of observing the feature Simpler models are more robust on small datasets More interpretable simpler explanation Data Visualization structure groups outliers etc if plotted in 2 or 3 dimensions Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 hi Feature Selection vs Extraction I Feature selection Choosing kltd important features ignoring the remaining a k Subset selection algorithms I Feature extraction Project the original xi i1d dimensions to new kltd dimensions ZJ j1k Principal components analysis PCA linear discriminant analysis LDA factor analysis FA also clustering based approaches Lecture Notes for E Alpaydin 2004 Introduction to Machine Learning The MIT Press V11 1 1 Feature Ranking GuyonElisseeff I Find features with a high score I Correlation with labels regression III Predictive power of attribute lattribute Classifier III Mutual information between labels and targets I Relatively quick and simple Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 III Subset Selection I There are 2d subsets of of features I Forward search Add the best feature at each step I Set of features F initially 9 El At each iteration find the best new feature j argminlE PU x1 EIAddetoF ifEFU xjltEF I This is Hillclimbing 0d2 0dk runs of algorithm to pick k features I Backward search Start with all features and remove one at a time if possible I Floating search Add k remove I Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 hi Principal Components Analysis PCA related to SVD I Find a lowdimensional space such that when X is projected there information loss is minimized I The projection of X on the direction of w is z wa column vectors I Find w such that Varz is maximized VarZ VarwTX EwTX wTy2 Ewa WTHXWTX WW EWTX yX yTW note wTv va wT EX HX HTw WTZW where Varx Ex yX HT Z Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 III I Maximize Varz subject to w1 Lagrange mult max wlTZW1 ocWlTw1 1 1 I Take derivative set eq t0 0 Zwl awl that is W1 is an eigenvector of Z and wlTZw1 W1T 0w1 a I Choose the eigenvector with the largest eigenvalue to maximize Varz Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Prexx V11 I Second principal component Max V3I Zza St IW2H1 and orthogonal to w1 max Wgzw2 O W rw2 1 3W2TW1 0 2 Set deriV eq to 0 can show B0 see text Again Z w2 or w2 so w2 is also a eigenvector of Z and so on I From z WTxu can reconstruct x Wz u using WWT I if W orthonormal I PCA has kdimensional projection that minimizes sum of distances 2 x x 2 Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 I39 What PCA does 2 WTX m where the columns of W are the eigenvectors of Z and m is sample mean Centers the data at the origin and rotates the axes A A RN Z1 NM gt V C Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 HR N 10 hi How to Choose k I Proportion of Variance PoV explained A1A2m2tk A1A2tktd when Aiare eigenvals sorted in descending order earlier slide used 0 I Typically stop at PoVgt09 I Scree graph plots of PoV vs k stop at elbow 1 1 Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 E39 a Scree graph for Optmgns Eigenvalues 3 4 Eigenvecmrs b Propomon of vanance expwained 0mmquot aner PCA FITS Eigenvedor I39 FCICIOI Al ldlYSiS ex News article topics I Find a small number of factors z which when combined generate cause X Xi Ml 6139 where 2 j 1k are the latent factors with E Zj0 VarZJ1 COVZl Zj0 i j e are the noise sources E 61 mi Cove 6 0 i j Cove 2 0 and vi are the factor loadings 14 Lecture Notes for E Alpaydin 2004 Introduction to Machine Learning The MIT Press V11 II PCA vs PA I PCA From x to z z WTx M IFA Fromztox x MVze x1 x2 xd Z 2 Zk factors variables x O O O Q Q O z W V W cjgables O O O O O 0 variables Z 2 z X 1 3 Zr x1 X2 xd PCA FA 15 Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 I39l Factor Analysis I In FA factors ZJ are stretched rotated and translated to generate x A N N x2 gt V Lecture Notex for E Alpaydln 2004 Introduction 0 Machine Learning The MIT Prem V11 hi Multidimensional Scaling l Given pairwise distances between N points dij ij1N place on a lowdim map st distances are preserved I z g x 9 Find 6 that rnin Sammon stress EGXQZFX 9X IG 9amp5 l6 2 X XS 2 ZS 1 S X 3 X XS 2 rs 1 S X X 1 7 Lecture Notes for E Alpaydln 2004 Introduction to Machine Learning The MIT Press V11 Lecture Slides for 1 Wk I39L 39l l h lL i ladme Learning ETHEM ALPAYDIN The MIT Press 2004 modified by dph Fall 2006 alpaydinbounedutr httpwwwcmpe boun edu trNethemiZml tinnitu 517t 7 f l5t ll learning r Assumption iid Examples I Distribution of things and measurements defines me unknown but fixed Px on domain D I Target concept C gives the correct labels Cx as a function of the features x I Find an 11 in Hfrom examples that is close to C DA loss function lr r39 measures error of pre ictions usually lrr390 if rr39 and lrr391 otherwise DWant to minimize IPx lCx l1x probability of error for usual loss Tasty Coffee example Objects are cups of coffee Measure strength and sugar Each measurement is a feature or attribute Other features cream temperature roast Features numeric precision Accuracy Label or w is quot tasty or not Example is X y pair x in R2 y inr Binary Classi cation Learning a Class from labeled Examples I Things represented by a feature vector x and a M r also called y often r in 10 or I Domain D is set of all possible feature vectors I A Hypothesis sometimes called a Concept is a partitioning of the Domain into and 7 regions or the region or function from domain X to r I A Hypothesis Class H is a set of hypotheses sometimes called Concept Class quot 2 More Terminology Domain set of all possible x vectors Concept a boolean function on domain or mapping from X s to 1quot on or T an or 1 and Target the concept to be learned Hypothesis class space is the set of hypotheses concepts that can be output by a g1ven learning algorithm 1 I Strength and sugar measured 0 to 10 I Domain has 121 different instances I How to predict from these examples Version space all concepts in hypotheses space consistent with training set If hypoth Space is all conf pts then version space evenly split on every unseen stance Need inductive bias smaller hypothesis space otherwise generalization is h Rectangles in so that Cx 1 1ff and opeless Assume tasty coffeequot is a rectangle in are concepts C for which there exists R2 c1c2c3c4 gsgsq R2 1010 Hmmgm 11 Hypothesis Class H of rectangfes 1 if h classifies x as positive 0 if h classifies x as negative mg mum Fun mum c Error of h on data X e Eh IX 1Q1CV39 r 1 I Bad use of quot Pnce Strength 1010 Hmmgm Strength 1010 HNU QSzz Strength Key interplay Underlying pattern being learned Features available Hypothesis space Number of examples available The trick is finding the right mix but Key interplay I Underlying pattern being learned I Features available I Hypothesis space I Number of examples available The trick is finding the right mix but The smaller the hypothesis space the luckier we have to be to catch the pattern il Model Selection amp Generalization I Learning is an illposed problem data is not sufficient to find a unique solution I The need for intluclive bias assumptions about H I GPDGF39dlllclllOD How well a model performs on new data What we are really intersted in I Overfitting 1H more complex than C or f I Underfitting less complex than C or f 1 Triple TradeOff I There is a tradeoff between three factors Dietterich 2003 in Complexity of H C Sif Training set size N a Generalization error E on new data D As NT EJ D As C SiDi first EJ and then ET I r Example I Not a rectangle in X1X2 I How to make it a rectangle 3 2 1 1 Example L lmncmmn I Not a recl Npoi I How to m H i can be labeled meways as M H mm ex 1H Consistent I It is a rec1 lm39Lm mam CH r i 7 i d1mens101 t H T 3 VC Dimension Shattering I Npoints can be labeled in 2N ways as I shatlers a set if for each labeling of the set there is an it E consistent with the labeling VC5 size of a largest shattered set An axis aligned rectangle shatters 4 points only VC Dimension I VapnikChervonenkis dimension is a measure of hypothesis space capacity I VCdim of rectangles in in plane is 4 I PAC Probably approximately correct bounds Hypothesis consistent with OVCdim ln1a a examples usually has error s a il Noise I Data not always perfect Attribute noise Label noise Noise can model hypothesis space approximations GD 6 target Domain 1 Noise Data not always perfect Attribute noise Label noise Noise can model hypothesis space approximations 9 9 target Domain Noise Data not always perfect Attribute noise Label noise TargetCircle N01se can model hypothesis space approximations nypothesisrectangles Domain Multiple Classes Ci i1K X x r 1 Engine power Imi l rm reject Fa uv rm Lilmii umm Price Kt lifx EQ 39 Oifx ECJji Train hypotheses hx i 1K hgt1ifx EC Oifx ECjji p Bregman Divergences BrCLCs I For any differentiable convex function F AFvaw F171 Fw I71 w VwFw f w supporting hyperplane through 11 57 Bregman Divergences Simple PropertiesI 1 AF1uw is convex in U 2 w 2 0 If F convex equality holds iff 1N0 w 3 Usually not symmetric AF1U w 7E AFw 171 4 Linearity for a Z 0 AFCLH1U w AF1U w l aAH1lJw 5 Unaffected by linear terms a E R b E R AHJFCLEJFbCHJ w w 58 Bregman Divergences more propertiesl 6 V AFCHJ w 2 Wm Vivu JVwFw f1U fw 7 Apw1w2 Apw2w3 Fw1 Fw2 wl w2fw2 Fw2 Fw3 J2 w3 AFwl w3 101 102 fw3 fw2 59 A Pythagorean Theorem BrCsAHW I W 10 is projection of w onto convex set W Wrt Bregman divergence AF 10 argmin AF39u w UEW Theorem AFuw Z Apuw l Apww j 60 Squared Euclidean Distance Fw HUME2 fw w A hw WINE2 HwH32 I7J ww HHJ ng2 Unnermalized Relative Entropy Fw Z In 1117 fw lnw AFBw 1nw U7 i 61 Examples2 GLSGL I p norm Algs q is dual to p i i 1 1 ma gHwH 1 2 fw V Hwq N 12 1 2 N AFltwwgt Hqu Hqu w wfw When 19 q 2 this reduces to squared Euclidean distance WidrOW Hoff 62 Burg entropy Examples3 Z lnwi 1 E a ZNW 63 Examples4 div between density matricesl Umegaki Divergence FW trWan W an AFWW trWan anW W LogDet Divergence 1nW fW 7 1 N 7V N AFWW 1n trWW 1 n j 64 General Motivation of Updates I Trade off between two term wt argmin Apwwt m Lt39w w W weight domain label domain Apw wt is regularization term and serves as measure of progress in the analysis When loss L is convex in w v wAFw7wt l 77tLt w 0 iff fw fwt 771i VLtWJ Z 0 W QVLt gt wt 1 f 1fwt UtVLtWJt J 65 Quadratic Loss I ayt w 39 02 39w c EtZyt z 66 Divergence Euclidean Distance Squaredl AFwwt Hw th wt 32 1 act 2 1 05 yt 1 V 67 j u wt 77yt w 39 02 t 32 1 7 wt 2 1 05 w W2 U 77 02 i f Nonlinear Regression I ghltwmgt II o Sigmoid function hz 1 0 For a set of examples 1311 acT yT total loss 231 hw ac yt22 can have exponentially many minima in weight space B11AHW 75 Want loss that is convex in w 76 Bregman Div Lead to Good Loss Functionl M2 M11 310 Am ac h4g1 h VH 77 Use AHw ac h1y as loss of w on acy Called matching loss for h Matching loss is convex in w AHWHKW transfer f H match loss W dHw ac h 1y 1 2 z zg w w y square loss 1n1 ewm yw ac 1nlt1ezgt y1nylt1 ygt1nlt1 ygt logistic loss maX0 yw ac Slgnz izi hinge loss 78 Idea behind the matching lossI If transfer function and loss match then VwAHW w h1y hw iv y Then update has simple form fwt1 fwt 77thwt 39 93 903 This can be exploited in proofs But not absolutely necessary One only needs convexity of Lhw acy in w Ce j 79 Sigmoid in the LimitI For transfer function hz signz HCZ M Matching loss is hinge loss HLw as h 1y Inaux07 yw a3 0 yw as y I1 wonvex 1n w but not dlfferentlable 80 Motivation of linear threshold algsI Gradient descent With Perceptron Hinge Loss Expon gradient Normalized With Winnow Hinge Loss Known linear threshold algorithms for il Classi cation case are Kgradient based algorithms With hinge loss 81 wt1 ar Lnin Hw MHZ2 77HLUJ why 1010 wt 77 Sig wt1 39 16 909316 wt 77 Signwt 331i yt act A yt 82 lNormalized Winnowl wt1 argmin 1117 In W l nHLw actglytgt w z 1 wm wt 6 Slgnwmtyt normalization 77 Signwt 39 t yt xti W 111m 6 t normalization 83 h g7 lossf13a 1 M2 Ma 62 Min Figure 1 The matching loss between estimate 1 and measurement a is the area from 1 to a underneath the transfer function If a is less than 13 then the loss is the area is above the transfer function 1 m2 1 m m m um us as nus m um u2 m2 2 2 5 es 4 e2 0 2 a 5 es 4 e u 2 a Legend Leeene Leeene tge so 7 tge sons 7 tge saampo2 Ma we we a a 25 a 5 a j a D2 0 3 0 0 e ms 0 r 3 a a 2 o V e es 4 e2 U 2 a 5 es 4 e2 0 2 a 5 es 4 ah ah ah Legend Leeene WWW lassah 4 WWW asstJ r meegemug lassah g lessens lame Legend 0 e e e e 0 New mg Figure 2 On the top row we plot three choices of the transfer function Ma and their derivatives The rst is the sigmoid function ie Ma 3a 13 g the others tw are shifted versions of the sigmoid In the bottom row we plot the activities a 39 3 The 393 a a D a a p 391 Q loss13 a as a function of the est M i ed in the plot above Note that locally the transfer function is always spec losses are quadratic and the steepness of the bowl is determined by h a Tradeoff between two divergences I wt argmin AFwwt l m AHw acth1yt w EH x parameter matching divergence loss divergence Both divergences are convex in w wt1 f1fwt 77thwt 39 wt 3909316 Generalization of the delta rule 84 Bregman divergences and exponential families o Exponential family of distributions 0 Inherent duality wt1 f1fwt UVLtwt primal param dual param f wt gt fwt 1 VL wt1 lt 77 t wt lEXponential Family of Distributionsl o Parametric density functions Pgmo egw GW P0ac o 0 and as vectors in Rd 0 Cumulant function G0 assures normalization 09 1n few P0ac dac o G 0 is convex function on convex set 9 Q Rd 0 G characterizes members of the family o 0 is natural parameter 91 Expectation parameter A wPawi0dw Ego 90 as where 90 V9G0 Second convex function F u on space 99 F u 0 l G 0 G0 and Fu are convm conjugate functions Let f u VuFW f u 9 100 92 Primal amp Dual Parametersl natural expectation paramater parameter 9 gt 0 lt u f G09 F u o 0 and u are dual parameters 0 Parameter transformations 99 u and u 9 AVBN j 93 Gaussian unit varianceI 13mg N e lt9 gt2 0w 0 6512 Cumulant function G0 02 Parameter transformations 990u and fuu0 Dual convex function Fu 0 A G0 2 Square loss Lt0 0t 02 94 Examples 15 are coin ips in 0 1 POEM 961 LOPquot u is the probability expectation of 1 Natural parameter 6 1n PI6 exp 61 1n1 66 Cumulant function 616 1n1 66 Parameter transformations u 99 6 1 6 and6 1n1u Dual function Fm ulna 1 u1n1 a Log loss Lt6 t6 ln1 66 t1 M 1 t1 1 M 95 Poisson Examples 111 are natural numbers in 0 1 G WC PW 96 u is expectation of 1 Natural parameter 6 ln u PW exp 91 66gt Cumulant function 616 66 Parameter transformations u 96 eg and 6 M 1nu Dual function Fm ulna a Loss Lt6 t6 66 ln1t tl MM l 1ntl 96 Bregman Div as Rel Ent between DistributionsI Let Pac0 and denote tvvo distributions With oumulant function G PGCBW N data PGWO A46 0 A Pgw 0ln PGac00ac G6 939G5d1 06 00 6 0 96 Pamwndw G05 G09 6 0 A Flt gt L Gltquotgt Fm Fm u 6 AFm A BN AW j 97 Area unchanged When Slide FlippedI A 97 AowJ Amp 98 A4035 99 Dual divergence for BernoulliI 09 1n1 66 FOL ulna 1 M1n1 u 66 AGW 1n1 e5 1n1 69gt 5 9 1 ea N LL 1 H AFLW u1nz1 u1n M 1M Binary relative entropy 100 Sum of binary relative entropies is parameter divergence for BEG 101 Dual divergence for PoissonI G9 9 FHMIHH M 99 6u fu1nu9 AG56 69 69 5 meg N u N AFltM7MH1H M H Unnormalized relative entropy 102 Sum of unnormalized relative entropies is parameter for UEG eg Winnow 103 A note on EM Fall 705 Here is algebra for the EM algorithm It is based on the presentation in Bishop7s Neural Networks for Pattern Recognition starting on page 65 The setup is that we have a param eterized set of distributions pzz l 6 where 6 E 9 is the parameters We have a set of observations zl zn with hidden information 21 2 Here we dont see the 2 so this exposition will treat them as random values from some nite set Z The goal is to nd the parameter 6 maximizing the likelihood of the as EM is an iterative technique that nds a local maximum of the likelihood of the observations Each iteration starts with the current parameters 6 and nds updated parameters 6 In the gaussian example given in class each observation was the coordinates of a point seen each hidden was the index of the gaussian generating observation xi and 6 was the means and variances of the spherical gaussians 6 could be just the means or also represent the mixture coef cients of the various gaussians or include whole covariance matrices if we allowed non spherical gaussians Assume we have an initial setting of the parameters 6 and we want to nd an improved setting 6 We will use pxl0 zpltzizz l 6 for the likelihood of the observed sample the z7s Note that for each z we sum over all possible values of the hiddens and we use 167 instead of P77 since our example deals with densities rather than discrete probability distributions Note also that the z are known values while the z are variables from a discrete set This means that I will try to write pz l but Pz 2 l However 2 without a subscript will be a known value bound by summations First de ne an error function E equal to the negative log likelihood of the observations Em 721nm l 6 Any 6 minimizing E6 will also maximize the likelihood Note that since E6 is a constant minimizing the above is equivalent to minimizing 7 maxiW WWW 21 pane 7 i n n Pzizl6 pzilziz6 Pzizlzi6 i 1 pzil6 Pzizlzi6gt39 i1 zeZ Jensen7s inequality shows that if the As are non negative and sum to one then In E Ajaj 2 Z Aj lnaj We apply this with the M Pz 2 l 16 and the aj being the rest of the stuff inside the inner summation Monty Hall Problem lets make a deal 3 doors There are three doors 1 2 3 Hosts puts good stuff behind one door g junk behind othertwo Contestant picks a door Host showsjunk behind a different door call its Contestant can switch should they What is experimentAtomic events Assume host picks good door 9 uniformly at random so pg1 pg2 pg3 13 If one unpicked junk door the host shows it if 0 unpicked junk the host randomly pickes one 5050 to show Let s try it Pick partners everybody play contestent twice switch once and don t switch once What is experimentAtomic events Assume host picks good door g uniformly at random so pg1 pg2 pg3 13 If one unpicked junk door the host shows it if two unpicked junkthe host randomly pickes one SO50 to show Assume contestant initially picks door 1 to simplify i Outcome Space contestant initially picks 1 Conditional distribution Prob s g1 g2 g3 Marg Prob s g1 g2 g3 Marginal 51 0 0 0 0 prob S s gt s2 16 0 13 12 S3 16 13 0 12 8 1 0 0 0 0 Marg 13 13 13 1 82 16 0 13 12 Pg3s2 Pg3 and s2 Ps2 23 s3 16 13 0 12 P91S2 P91 and SZPS213 So switching is twice as good as not switching Marg39rla39 13 13 13 1 prOb g 5 Note Pgs2 and Psg1 are conditional 5 distributions on 123 Exercise For Learning Repeat analysis for the following modi cation to host whenever doors 2 and 3 both junk always show door 2 rather than picking randomly Marg But joint probabilities usually not known

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.