### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Machine Learning CMPSCI 689

UMass

GPA 3.57

### View Full Document

## 11

## 0

## Popular in Course

## Popular in ComputerScienence

This 230 page Class Notes was uploaded by Roman McCullough on Friday October 30, 2015. The Class Notes belongs to CMPSCI 689 at University of Massachusetts taught by Sridhar Mahadevan in Fall. Since its upload, it has received 11 views. For similar materials see /class/232263/cmpsci-689-university-of-massachusetts in ComputerScienence at University of Massachusetts.

## Similar to CMPSCI 689 at UMass

## Popular in ComputerScienence

## Reviews for Machine Learning

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/30/15

CMPSCI 689 Linear Algebra LeastSquares Regression and the Multivariate Gaussian Sridhar Mahadevan mahadevacs umass edu University of Massachusetts 689 F03 p 141 Topics 39 Review of linear algebra Linear Algebra Strang 39 Matrices and vector spaces 39 Projections 39 Orthonormal projections and leastsquares regression 39 Eigenvalues and eigenvectors Multivariate gaussian densities 39 Discriminant functions 689 F03 p 241 Applications of Linear Algebra 39 Face recognition using eigenfaces Mogaddam and Pentland 39 Solves the problem of face recognition using linear algebra and multivariate gaussian density estimation 39 Eigenvectors produce a reduced representation of faces 39 Information retrieval Google 39 Construct a singular value decomposition of the adjacency matrix underlying the web by doing a random walk 39 Look for dominant eigenvectors 39 Orthonormal basis expansions 39 Wavelets Fourier series etc used to encode signals 39 Early layers in the visual cortex compute reduced representations of highdimensional visual data Theoretical Neuroscience Dayan et al MIT Press 689 F03 p 341 Face recognition using eigenvectors MTurk and APentland Eigen Faces for Recognition Journal of Cognitive Neuroscience 31 1991 689 F03 p44Z Feature detection using eigenvectors B Moghaddam and A Pentland Probabilistic visual learning for object representation In S K Nayar and T Poggio editors Early Visual Learning pages 99 130 Oxford Univ Press 1996 BlitancerlrurilrLEjerSpace IllSlallceiilDilliFlEjeispaEE llalancerlrumrl luulhrSpace 689 Pm 7 p 54 Vectors 001 A vectorx IS an ntuple eg 002 A text document d can be viewed as a very long vector where each component d1 lel represents the number of occurences of a particular word An image can be viewed as a vector of pixels A matrix is a list of vectors one for each column eg 11 12 21 22 For example the rows of a matrix A could be documents and the columns could represent words The 23 entry a 1 if the word 9 occurs in document 2 The transpose of a matrix AT is obtained by interchanging the rows and columns of A What is the matrix ATA if A is the worddocument frequency matrix 689 F03 p 641 Vector Addition and Subtraction 689 F03 p 741 Vectors Operations 39 Operations on vectors Addition and subtraction is done component wise Multiplication by a scalar cXT 0001 Inner product between two vectors X 0 Y X TY 2 2i xiyi Length of a vector V W Outer product between two vectors X YT results in a 7171 matrix eg cxn x191 0013p x2y1 292 689 F03 p 841 Vector Operations Unit vectors have length 1 eg z 10 9 01 u cosa sin a The unit vector in the direction of X is u H H A coordinate rotation about the axis can be represented by two unit vectors U1 2 cosa sin CKU2 sina cos a Two vectors are orthogonal if X 0 Y 0 W1 sintx COSOC 689 F03 p 941 Cauchy Schwartz Inequality 39 The angle between two vectors X and Y is given by show this XOY COSCK llelllYll 39 This immediately gives us the celebrated inequality why IX 39Yl S lelllYll 6892 F03 p 1041 Matrix inverses and determinants 39 The inverse of a square invertible matrix A is A l In practice inverses can be found by GaussJordan elimination 39 The determinant of a matrix A is a single scalar number that represents the volume of a box whose edges are defined by the rows of the matrix 39 I 1 Given a matrix A if we form a new matrix A by interchanging two rows A A If we multiply a single row of A by c keeping all other rows constant A cA Also AB AB b A a ad be c d 39 The cofactor CM 2 1ijMj where the matrix MM is formed by eliminating rowi and column 3 The z j element of the inverse A0 3 W A1 1 d b ad be c a 6892F03 p114Z Vector Spaces A vector space is a set of vectors closed under addition multiplication by a scalar has a 0 element inverse has a 1 identity element and satisfies standard associatity and commutativity operations Examples R is the vector space of ndimensional real numbers M is the vector space of all 2 x 2 real matrices etc A subspace of a vector space is a set of vectors that has the 0 element and is closed under addition and scalar multiplication eg all vectors on a line passing through 0 0 constitute a subspace of R2 Is the quadrant made up of all ac such that 001 gt 0 x2 gt 0 a vector space How about the set of all 2 2 symmetric matrices 6892 F03 p 1241 Vector Spaces associated with a matrix The column space 0 A associated with a m n matrix A is the set of all linear combinations of the columns of A Ax A1901 A2902 where A1 is the first column of A etc illiill lmlilm The row space CAT associated with a m n matrix A is the column space of the transpose AT The system of equations Ax b is solvable if and only if b is in the column space of A 6892 F03 p 1341 Vector Spaces associated with a matrix The null space N A associated with a m n matrix A is the set of all vectors ac such that Ax 0 The left null space N AT associated with a m n matrix A is the set of all vectors ac such that ATx 0 1 2 What is the nullspace of 3 6 1 2 0 We solve the below linear system of equations x1 3 6 x2 0 2 The nullspace NA consists of all scalar multiples of 1 6892 F03 p 1441 Example of Matrix Vector Spaces Consider the matrix A 1 2 3 What is the column space CA What is the row space CAT What is the null space NA What is the left null space NAT Repeat the above for the matrix A 6892 F03 p 1541 Independence The columns of a matrix A are linearly independent if the only solution to Ax 0 is ac 0 Eg 1001 are independent Any set of n vectors in H is dependent if n gt m A set of vectors span a space if their linear combination fills the space A basis for a vector space is a set of linearly independent vectors that span the space 6892 F03 p 1641 Dimensions The dimension of a vector space is the number of vectors in every basis eg the dimension of R is n The dimension of the row space 2 7 number of independent rows The dimension of the column space 2 7 number of independent columns If the dimension of the rowcolumn space 2 number of rows columns the matrix A is considered full rank and has an inverse A l The dimension of the nullspace n 7 The dimension of the left nullspace m 7 Give the dimensions of the vector spaces associated with the matrix A l 1 2 3 Repeat the above for the matrix A 6892 F03 p 1741 Vector Spaces associated with a Matrix A is an mn matrix dimension dimension r Row r space RA Column space CA x b x x Rn X m AX b R x n nunspace dimension MA quot1quot left dimension null ace NAT n r 689F03 p184 Orthogonal Spaces Two subspaces W and V are orthogonal if every w E W is perpendicularto every 21 E V A subspace W forms an orthogonal complement to another subspace V if for each vector v E V there exists a unique perpendicular vector vi E W The nullspace NA forms an orthogonal complement to the row space CAT The left nullspace NAT forms an orthogonal complement to the column space CA 6892 F03 p 1941 Projection of a Line onto a Vector We want to find the point p on a line through a vector 1 that is closest to another vector 9 The pointp m since it must be in the direction of 1 Consider b p b om Low id 0 A aob aTb x Z aoa aTa The projected vectorp is given by p a T T b The prOJectIon matrix IS GT0 Exercise find the projection of 9T 2 1 1 1 onto aT 1 2 2 and find the projection matrix of a 689F03 p204Z Projection onto a subspace 39 Given a set of linearly independent vectors 11 an in R find the combination 38111 man that is closest to a given vector 9 a1Tb A0E 0a b Aaz 0 39 These n equations can be summarized as ATb AQE 0 where A is an m n rectangular matrix 39 The projection is given by p AQE AATA 1ATb The projection matrix is AATA1AT 6892F03 p214Z Least Squares Regression 39 We have just solved the learning problem of how to find the best line that will fit a set of points and minimize the squared error loss function This is the problem of linear regression Let the equation of the line be in 2 dimensions y 2 cm B We are given a set of points x1y1ac2y2 xmym Bax1y1 axmym 1 x1 31 1 062 5 3J2 lal 1 00m ym Since m gt 2 often much larger obviously we have a system of equations that cannot be solved exactly ie the 2 columns span only a small portion of the mdimensional space 6892 F03 p224Z Least Squares Regression 39 Since we can t find an exact solution to Ax B we find an approximation 9 such that AQE B where 9 minimizes the distance Ax B 1 x1 ATA 1 1 1 002 2 m 2 xi 1 00m 91 061 00m inyz Solve the system m 2 xi 5 Z Z yz 2 xi 2 C 2 xiyz 6892F03 p234I Least Squares Projection Vector Space View AbanmmnBMX dhnen on r Cdumn space CA nuHspace 39 t IS emp y b is not in CA 39 nuHsp ce AT 6892 F03 p244Z Orthogonal Least Squares Regression We can make the computation even simpler by converting the ATA matrix into a diagonal matrix The trick is to define a new set of variables mg 2 xi m We now from before that EX EX 0 so this implies that 90 0 0 i The new system becomes m 2 B 2 y 0 Z 902 C 2 902 The new system is much easier to solve This is an example of GramSchmidt orthogonaization 6892 F03 p254Z Orthonormal vectors and matrices A set of vectors are orthonormal if 1qu 6 which is 1 exactly when 2 3 otherwise 0 So orthonormal vectors are both perpendicular to each other and are unit vectors Show that the matrix Q whose columns are orthonormal vectors has the following property QTQ I If Q is a square matrix show that this implies QT Q l cos a sin a Example of an orthonormal matrix sin a cos a This matrix corresponds to doing a coordinate rotation by a Show that the permutation matrix is orthonormal 0 1 0 1 0 0 0 0 1 6892 F03 p264Z GramSchmidt Orthogonalization How to make a set of vectors into an orthonormal set Let the first vector be given as A a The second vector B 9 must be made perpendicular to the first We can do that by subtracting its projection along A ATb Bb ATA A The third vector 0 0 now can be made perpendicularto the first two as follows ATC BTC C C ATA BTB B Repeat until done Finally normalize the length of every vector by A 2 Hi ll B B HBH etc Example Orthogonalize the following vectors a 1 1 0 b 2 0 2 c 3 3 3 6892 F03 p274Z Eigenvalues and Eigenvectors Generally when we multiply a vector ac by a matrix A the resulting vector A90 is not in the same direction as x Occasionally however this does happen and the resulting vector is then called an eigenvector Formally the eigenvectorx of a matrix is such that Ax 2 Am where is a scalar called an eigenvalue Eigenvectors lie in the nuspace of A I because they satisfy the equation A Iac 0 Algorithm Solve the equation A I 0 to find the eigenvalues For each resulting eigenvalue M solve the equation A Iac 0 to find the corresponding eigenvector 1 1 2 Exercise Find the eigenvalues and eigenvectors of 6892 F03 p284Z Properties of Eigenvalues and Eigenvectors The product of the n eigenvalues gives us the determinant of the matrix so if the matrix is singular 0 must be an eigenvalue The sum of the eigenvalues equals the trace or sum of the diagonal entries of the matrix If a square 7171 matrix A has n linearly independent columns then the following property holds S lAS A equivalently it follows that A 2 SAS 1 The columns of S are made up of the eigenvectors To prove this first show that AS 2 SA and then it follows directly Suggest a fast way to compute Ak given its decomposition above 6892 F03 32941 Eigenvalues 0f Symmetric Matrices If a matrix A is symmetric that is AT 2 A then all its eigenvalues are real This leads to a fundamental spectral theorem Every symmetric matrix A has the factorization A QAQT where the eigenvector matrix Q is comprised of an orthonormal set of vectors Interesting property If Q is orthonormal then QT Q l Find the orthonormal set of eigenvectors for the previous matrix Show that any symmetric matrix can be written as the sum of projection matrices A AlxlxlT l l AMONG Each of these matrices is a projection along an eigenvector space 6892 F03 p304Z Positive De nite Matrices A symmetric matrix A is positive de nite if xTAx gt 0 for every nonzero vector x if we allow equality we get the property of positive semidefiniteness Symmetric matrices with positive eigenvalues are positive definite if Ax 2 Am then xTAac AxTx gt 0 when gt 0 b xTAxx ya xax22l9xycy2gt0 b c 3 Note that this is an equation for an ellipse What we need to determine is the orientation and the parameters of the ellipse Result Given a positive definite matrix A QAQT the graph of xTAx 1 is an ellipse or ellipsoid ac yQAQTX YA1X22Y21 and 1 689F03 p314 The halflength of the major and minor axes are m 1 A1 Multivariate Normal Distribution 2D Normal ml 9 t W5 m i quot 003 I 039 0233quot 689 F03 33241 Multivariate Normal Distribution The multivariate normal distribution in d dimensions is given by 1 1 a HT 1 a a 00 E 00 Px 6 2Wd2212 eXp 2 M M 0 Ef is a ddimensional vector 0 Z Ef mm mT is a d x d covariance matrix 2 is the determinant of Z 6892 F03 p334Z Covariance Matrix The covariance matrix in 2 dimensions is E901 M1901 M1 E901 M1902 M2 0 012 Ex2 M2901 1 E002 WW2 2 012 0 The covariance matrix is symmetric which means that the eigenvalues are real It also means that the eigenvectors form an orthonormal set independent unit length 80 if Eu 2 Aim then uiTuj 6 Since 2 is symmetric it can be written as E QAQ l where A is the eigenvalue diagonal matrix and Q is the orthonormal eigenvector matrix It immediately follows that the inverse covariance matrix can also be written in such a form why This implies that the inverse covariance matrix 2 1 can be expressed as a sum of projection matrices 2 1 2L1 i 6892 F03 p344Z Mahalanobis Distance 39 The exponent in the multivariate normal distribution A2 2 ac MTE 1x u is called the Mahalanobis distance If the covariance matrix is diagnonal E 0 it reduces to Euclidean distance 39 Consider a change of coordinate from the original ac to the following y Uac u where the rows of U are the eigenvectors 0 y39 Ai39 s Then using the above expression for 2 1 it can be shown that A2 2L1 Notice this shows that the inverse covariance matrix is positive semidefinite y2 x2 12 M 72 X1 689F03 p354 Transformed Multivariate Gaussian x1 Note that in the transformed y Uac a the multivariate gaussian can be written in the following simplified form d 00 1 wt n e This is a product of d independent normal distributions with mean 0 and variance 2 j jth eigenvalue Thus the transformation y Uac a decorreates the original components 689F03 p364Z Singular Value Decomposition 39 So far we have seen how to decompose symmetric matrices into a product of orthogonal matrices the eigenvector matrix and a diagonal matrix the eigenvalue matrix What about nonsymmetric matrices The key insight behind SVD is to find two orthogonal basis representations U and V such that for any matrix A the following decomposition holds AzUEVT where U and V are orthogonal matrices so UUT I and VVT I 39 E is a diagonal matrix whose entries are called singular values hence the meaning behind the term SVD 6892 F03 p374Z Singular Value Decomposition We want to choose the orthogonal matrices in such a way that not only are they orthogonal but we want the result of applying A to them to also be orthogonal That is the v are are a orthonormal basis set unit vectors and moreover Av also orthogonal to each other We can then construct the u j to give us the second orthonormal basis set 0391 0 A v v ou ou u u 2 22 20 So we get the following relationships AVUE or U lszz or UTAVE AV 2 US or AVV 1 UEV 1 or A UEVT 6892 F03 p384Z Finding Orthonormal Basis Representations for SVD How do we go about finding U and V Here is the trick We can eliminate U from the equation AV 2 US by premultiplying A by its transpose ATA UEVTT UEVT VETUTUEVT VETEVT Since ATA is always symmetric why the above expression gives us exactly the familiar spectral decomposition we have seen before namely QAQT except that now V represents the orthonormal eigenvector set of ATA In a similar fashion we can eliminate V from the equation AV 2 US by postmultiplying A by its transpose AAT UEVT UEVTT UEVTVETUT UEETUT The diagonal matrix 2 is now the eigenvalue matrix of ATA 6892 F03 33941 Examples of SVD 39 Find the SVD of the following matrices 6892 F03 p404Z SVD Technology How GoogleTM works GoogleTM uses SVD to accelerate finding relevant web pages Here is a brief explanation of the general idea Define a web site as an authority if many sites link to it Define a web site as a hub if it links to many sites We want to compute a ranking x1 acN of authorities and 31 yM of hubs As a first pass we can compute the ranking scores as follows as is the number of links pointing to z and y is the number of links going out of 2 But not all links should be weighted equally For example links from authorities or hubs should count more So we can revise the rankings as follows E y ATyO and Z on Aaco j links to i i links toj Here A is the adjacency matrix where aij 1 ifz39 links to j 6892F03 p414Z SVD Technology How GoogleTM works 39 Clearly we can keep iterating and compute x5 2 ATA sick 1 and yf AAT yk1 39 This is basically doing an iterative SVD computation and as k increases the largest of eigenvalues dominate GoogIeTM is doing an SVD over a matrix A of size 3 x 109 by 3 x 109 6892 F03 p424Z CMPSCI 689 Probability Distributions Sridhar Mahadevan mahadevacs umass edu University of Massachusetts CMF SCI 689 p151 Topics Review of probability theory 0 Discrete and continuous distributions Graphical models CMF SCI 689 p251 Sample Spaces 39 A sample space S is a set finite or countany infinite eg H T is the sample space for tossing a coin 1 6 is the sample space for rolling a dice An event is any subset Si g S eg 2 4 6 denotes the event of getting an even roll 39 A collection of subsets B is a aalgebra if 39 The empty set gb E B 39 IfAEZS thenAC EB 39 IfA1An E B then UiAi E B CMF SCI 689 p351 Probability Axioms Given a sample space S and an associated aalgebra B we can define a probability function on B as follows 39 PA 2 0 for every A E B PS 1 note that S must be in every aalgebra 39 If A1 An E B is a collection of painvisedisjoint subsets then PUz Az Zi PUD CMF SCI 689 p451 Conditional Probability and Bayes Rule Given two events A and B in some sample space S the conditional probability PAB 13193 Example Let B denote the event of getting an even roll and let A denote the 1 event of getting a 2 PAB g g a Note that PABPB PA F B PB F A PBAPA This immediately gives us the celebrated Bayes rule PAB P 3121 A CMF SCI 689 p551 Marginal and Conditional Independence Two events are marginaly independent if PA n B PAPB eg let A 2 Rain in Amherst and let B 2 Bush wins reelection in 2004 Two events are conditionaly independent given a third event if PA BC PAC PBC Let A Small Shoe size B No Gray hair 0 Age30 Note that in this case A and B are not marginally independent CMF SCI 689 p651 Random Variables A random variable X is a function from a sample space S into the real numbers R We denote the value of the variable by Xs for elements E S of the sample space A random variable X induces a probability function PX S gt X where X is the range of X PXX PSj E S Xsj Examples of random variables Number of heads in an experiment where a coin is tossed 100 times number of times the word student appeared in an email document etc CMF SCI 689 p751 Graphical Models A graph G is a pair V E where V is a set of vertices or nodes and E is a set of edges If the graph is directed the edges are a set of ordered pairs of vertices If the graph is undirected the edges are a set of unordered pairs of vertices We want to define probability distributions on graphs and we will do this informally for the moment CMF SCI 689 p851 Discrete Directed Graphical Model PC5 1 S PR F 50 F 20 PW 0 O a 99 90 90 00 The joint distribution over this graph factorizes as mmaam manJaw PC s R W PCPSCPRCPWS R CMF SCI 689 p951 Discrete Undireeted Graphical Model 39 The joint distribution over this graph factorizes as 1 PABC E AB BO Here Z is a normalization constant and gbA and 3 are arbitrary realvalued potential functions of the form WC exp mm If the graph is undirected the probability distribution is called a random eld CMF SCI 689 p1051 Three Canonical Graphical Models diverging PBC A PBI A PC IA W converging PBC PB PC G serial PCA BPCA CMF SCI 689 p1151 Analysis of Canonical Models 39 Diverging case B l OA PA 30 PM PAPBAPC A PM PBAPC A PBC A Serial case B l OA PA 30 PA B PBPABPC A 0 PBPABPC A PBPABPC A PBPAB Zo PC A PC A PC BA CMF SCI 689 p1251 Dseparation for Directed Graphs 39 A set of nodes X is said to be d separated from another set of nodes Y given a conditioning set Z if and only if the following is true of of every undirected path from a node in X to a node in Y 39 The path includes a diverging or serial node V and V E Z The path includes a converging node V and neither V nor any of its descendants E Z 39 The Markov blanket of a node is the set of all its parents its children and the parents of its children A node is dseparated from all other nodes in a graph given its Markov blanket CMF SCI 689 p1351 Example of Dseparation 54 39 What is the Markov blanket of node D Assume the immediate neighbors of node E are observed s E dseparated from B How about A CMF SCI 689 p1451 Conditional Independence for Undireeted Graphs 39 In undirected graphs conditional independence corresponds directly to graph separability A set of nodes X is conditionally independent of a set of nodes Y given a conditioning set of nodes Z if and only if every path from a node in X to a node in Y goes through a node in Z 39 s E conditionally independent of B given its neighbors 0 CMF SCI 689 p1551 Example Graphical Model Spark Plugs PGaugeFuel Fuel yes Gauge full 039 Gauge half 060 Gauge empty 001 PStarts yesFuel SparkPlugs Fuel yes Fuel no Spark Plugs yes 099 Spark Plugs no 001 CMF SCI 689 p1651 Example Graphical Model Spark Plugs PFuel Fuel yes 098 PSparkPlugs Spark Plugs yes 096 CMF SCI 689 p1751 Inference in Automobile Model Consider computing the probability PStarts YesFuel yes In general we want to compute queries such as PXE where E represents the evidence and X represents the query Both X and E can be a single node or a conjunction of nodes To answer such queries we always start with the joint distribution and then marginalize it as necessary To begin with we can express our query as PStarts Yes Fuel 2 yes PFuel yes PStarts YesFuel yes 2 We can compute the numerator by marginalizing over all values of the spark plug status SP and gauge reading G E Z PF yesPSPPGF yesPStarts yesF yesSP SP G CMF SCI 689 p1851 Inference by Enumeration in Automobile Model 098 PFyes PSPYes 096 PSP No oo4 yes ye 001 PGfulFyes 039 LI LI E E a a o A V o A V 2 2 2 8 n to II II gt 1 0 Al 0 ll 5 U m U n mquot mquot h m g o 01 3 01 g 8 03999 3 099 LI 5 LI 001 i i II LI 390 T 390 7 ll 39i w w w w LL 1 gt gt gt gt O w 0 J g 2139 g g 1 E E E E 2 E E 098001001 06001 039001004 039099 06099 001099096 09318 CMF SCI 689 p1951 Eliminating Redundant Steps Inference by enumeration is an exponential O2 algorithm If we examine the steps carefully we notice there are many redundant or repeated calculations in the calculation 2 Z PF yesPSPPGF yesPStarts yesF yesSP SP G In particular with some thought we can reorder the computation of the numerator as PF yes ZPSPPStarts yesF yesSP Z PGF yes SP G We can simplify even further by noting that the sum over G 1 giving us PF yes 2 PSPPStarts yesF yes SP SP CMF SCI 689 p2051 Smarter Inference in Automobile Model Dramatic savings in computation from 9 multiplications and 5 additions down to 3 multiplications and 1 one addition 098 PFyes No YesFyes SP 0 S Yes FyesSPyes P65 PS O 098 001 004 99096 09318 CMF SCI 689 p2151 Inference in Graphical Models Many clever insights have been developed to make inference more efficient Variable elimination use dynamic programming to cache intermediate terms that are summed over repeatedly Junction tree algorithm exploits a beautiful correspondence between the computation steps and operations on the underlying graph Stochastic sampling approximate inference by sampling the network Complexity of exact inference 39 Lineartime for special types of models eg hiddenMarkov models and hierarchical HMMs 39 Polynomial time for polytrees Pearl NPhard for general DAGs Cooper 39 Approximate inference remains NPhard Dagum and Luby CMF SCI 689 p2251 Hidden Markov Models a a Actions 69 e o o 39 Hidden Markov models Abstract Markov Models hierarchical HMMs Observations 39 Markov decision processes Partially observable MDPs CMPSCI 689 p2351 Cumulative Distribution Function 39 The cumulative distribution function cdf of a random variable X denoted by FXac is defined as FXac PX X 3 ac Example Suppose we keep tossing a coin till we get a head Let the random varible X be the number of coin tosses Show that the CDF FXac 1 1 pm 139139P K7 CMF SCI 689 p2451 Cumulative Distribution Function 39 We begin by noting that if we toss a coin ac times before getting a head then PX xlp 1 pm1p 39 Next we observe that the CDF requires us to sum up all the probabilities up to and including as 1 1 pv FXX S 00 19 mi 2pm i1 CMF SCI 689 p2551 Probability Mass Function for Binary Distributions For a binary random variable X the probability mass function is fXac PX 2 ac where ac 0 orx 1 Consider a binary rv X Let PX 1 6 and PX 0 1 6 A Bernoulli distribution is defined by the following PMF fx009 9m1 9V A geometric distribution is defined by the following PMF 1 93 1 0 fx00p p p 96gt 0 I QCSO A binomial distribution is defined by the following PMF fXxngtP pm1pnm 1 x20 0 xlt0 CMF SCI 689 p2651 Multinomial Distribution A multinomial variable ac can be viewed as a vector of M binary variables ac 901 acMT where only one 98k 1 and the remaining variables are 0 We denote the probability that X 2 by 6 In any given experiment let the indicator variable xi denote the outcome of a sample taking on value 2 Then the multinomial pdf of a dataset D of size N is given by fXx6 61 6ij Note the constraints 2 6i 2 1 and 2 xi 1 CMF SCI 689 p2751 Example Generative Model of Documents 39 Consider the problem of classifying a given document di into a category cj A document is an ordered collection of word events wd where wt is a word in the vocabulary V 101102 lel 1317 We assume that the document size is marginally independent of the category but that the word probabilities are conditionally independent given the category Further we assume that the conditional word probability is marginally independent of its location in the document Idil Pdz 0j79 Pdz H Pde0jgt9 k1 39 We represent each wordcategory probability as a multinomial distribution IVI 6l7jmlj For English this random variable takes 50 000 values E Q Pdz lcjgt9 Pdz QMVM k 1 1 i N CMF SCI 689 p2851 Continuous Distributions in Graphical Models Pbuys c CMF SCI 689 32951 o as Q I 6 0w m 0 000000 1 0l t 0000 00000 000000000 00 00 0 000 0 0 000 0000000 00 000 0 0000 00 39 2 000041 hummququ 0000 o E 47 K T squot Q g Q r 2 51 g 39Q g f g m ov 2 A auto oU 51 23 m o HSQMMH o E m Qwhmv E 1 En 01 a o a H mcs O m 209 6 m muse no 24 74453003303 fm u A thutt twnt 0 m 1 N s is 0 II D O N I 0000 n Wit t a A o 03 u 3 2i 0 hi 3333 9 1 I 0 00000 0 00355 8 z Sztzzl 0O n8 n giviidz O O O u o uuu un n t nhtthhih wig I v mmo nmo Olt o E Continuous Distributions in Graphical Models v X I O 015 r ii 9 5quot iquot 0 05 439 WW 39 quot 0 D N 1w W 0 w CMF SCI 689 p3251 Probability Density Function For a continuous random variable X the probability density function f X ac is such that FX00 2C fXtdt forall ac A PDF or PMF satisfies these properties 39 fXx 2 0 for all ac Em fx00 1 0r ff fxxdx 1 The CDF of the logistic distribution is given by 1 1 CC Hence its PDF is given by CMF SCI 689 p3351 CDF 0f Logistic Distribution PXlt x 09 08 07 06 05 04 03 02 01 CDF of Logistic Distribution CMF SCI 689 p3451 PDF of Logistic Distribution PX x 09 08 07 06 05 04 03 02 01 PDF of Logistic Distribution 39 exp x22 7 CMF SCI 689 p3551 Expectation and Variance The expected value or mean of a random variable X is EX ff foxdx Xis continuous 230625 00PX x Xis discrete The variance or average squared deviation from mean of a random variable X is VarX EX EX2 Example Let X N Ua b uniform rv CMF SCI 689 33651 Expectation Properties Expectation is a linear operation this property and convexity of functions underly Jensen s inequality and the EM algorithm Nested expectation EEX EX Expected deviation around mean is 0 EX EX 0 Prove the following properties of variance varX EX2 EX2 Vara1X1 l l aan aVarX1 l l aiVaNXn CMF SCI 689 p3751 Expectation minimizes squared error loss Theorem Show that expectation minimizes squared error loss EX 92 2 Clearly we only can control the second term above and it is minimized by choosing b EX CMF SCI 689 33851 Moment Generating Function The moment generating function of a random variable X is Mt EetX foo emfxdx Let us examine the derivates of Mt att 0 dMt M t dt 00 we xdx Thus att 0 we find that M f wmmmmu 00 Similarly it is easy to show that M 0 EX2 Giving us the identity VarX M 0 M 02 CMF SCI 689 33951 Joint marginal and conditional density functions 39 Let X Y be a bivariate random variable The joint density mass functions can be defined as follows fX xay 2 way 2 PXY E A Afxydx dy 39 The marginal density function is defined as mac owMy y 39 The conditional density function is defined as leYle PX xY y fX CMPSCI 689 p4051 Covariance Given two random variables X and Y the covariance between them is defined as 0XY EX EXY EY The covariance can also be written as show this 039XY EXY EXEY This makes it clear that variance is simply the covariance of a random variable with itself When the random variable X is a vector the covariance of X becomes a matrix The full study of covariance matrices is postponed to next lecture since it requires the tools of linear algebra CMF SCI 689 p4151 Conditional expectation The conditional expected value of a random variable gY given another variable X is defined as EYX ZyPO le 00 Note that EYX is a function of X not a constant unlike EX Double conditioning EEYX EY EY zyzpltxxyygt zzypltxxyygt CC Z yPY yX 90 PX x CMF SCI 689 p4251 Fundamental Result 0n Variance Lemma varY EvarYX varEYX Proof Note that varY EY2 EY2 EEY2X EmaIX 2 Also note that emuIx EltY2Xgt ltEltYXgtgt2 EmmaIX EltEltY2Xgtgt EltltEltlegtgt2gt If we substitute the above into the expression WHY EvarYX we get varEYX CMF SCI 689 p4351 Fundamental Theorem of Regression 39 Regression means estimating a function hX of the input variables X that minimizes the meansquared error loss function with respect to the true value Y associated with X Theorem The conditional mean EYX minimizes the mean squared error loss function over all possible functions hX EY hX2 Z EY EYX2 Proof Left as a homework exercise extend the earlier proof CMF SCI 689 p4451 PDF and CDF of N ormal Distribution functionx dn ormx x functionx pnormx x Normal density dnormx dn ormx w w w w 4 2 0 2 4 x Normal Cumulative normx pn ormx w w w w 4 2 0 2 4 CMF SCI 689 p4551 Normal Distribution 39 Consider the following integral I foo 6 y22dy 39 To evaluate this integral requires the following trick Instead of computing 1 we proceed to compute 2 I2 foo 6 m22dx foo 6 y22dy foo OO m2y22dxdy CMF SCI 689 p4651 Normal Distribution Let us change to polar coordinates setting ac mine and y r cos 6 27139 00 2 2 2 Z 7 81n 0 7ocos 0 2r O O 27139 00 2 6 7 2 we 0 0 277 6 722 d7 0 274 642 8 27 We have just derived the PDF of the normal distribution namely 00 1 y22dy 1 oo 27 CMF SCI 689 p4751 General Univariate Normal Distribution 39 Let us introduce the change of variable 39 Then we can easily show that x 2 1 m 2202dx 1 oo 27702 39 We can also show that Efx M and that mem 02 CMF SCI 689 p4851 Multivariate Normal Distribution 2D Normal ml 9 t W5 m i quot 003 I 039 0233quot CMF SCI 689 p4951 Sum of Random Variables converges to Gaussian Let us try to motivate the normal Gaussian distribution through a different route Suppose we consider sums of random variables Sn where each variable Xi has a finite mean u and variance 02 Define the standardized sum 8 It follows that ESj 0 and varSj 1 Central limit theorem For large n Sf is approximately standard normal More generally a sum of normally distributed variables is normal In the multivariate case projections of normal variables remain normal CMF SCI 689 p5051 Gamma Function 39 We now introduce a key distribution based on the gamma function W yaleydy 0 Using integration by parts we can show that PCK 2 CK 1I cr 1 If a is an positive integer gt 1 then it follows that Her 2 a 1 CMF SCI 689 p5151 Gamma Distribution Let us introduce a change of variable y 905 rltagtfooolt5gt 1e 1 dx 5 5 Thus the general form of the gamma distribution can be written as 00 1 fxcv O Pa axa 1 dx The gamma distribution is important because many others emerge as special cases of it For example setting a 1 and B 2 gives us the wellknown exponential distribution f ac Ace M Also if we set a r2 and B 2 we get the wellknown X20 distribution 1 7o2 1 c2 x I r227 2 x e CMF SCI 689 p5251 Gamma Distribution 39 The mean of the Gamma distribution can be computed as 0 1 a1 EX O Pa axx e dx 39 Notice that the kernel of the above integral is actually a Gammaa 15 pdf We can use this fact to write 00 1 a g EX 2 f0 Pmmax e 3 1 a1 mom a 1 CKWOKW PM g 39 The variance of a gamma distributed random variable is varX a62 CMPSCI 689 p5351 CMPSCI 689 Probability Entropy and Bayesian Inference Sridhar Mahadevan mahadevacs umass edu University of Massachusetts i CMPSCI 689 p14 Why Study Probability and Statistics The true logic of the world lies in the calculus of probabilities James Clerk Maxwell i CMPSCI 689 p24 Why Study Probability and Statistics 0 The true logic of the world lies in the calculus of probabilities James Clerk Maxwell Probabilities are not about numbers they are about the structure of reasoning Glenn Shafer i CMPSCI 689 p24 Why Study Probability and Statistics The true logic of the world lies in the calculus of probabilities James Clerk Maxwell Probabilities are not about numbers they are about the structure of reasoning Glenn Shafer We are drowing in a sea of information and starving for knowledge Rutherford Roger d CMF SCI 689 p24 Why Study Probability and Statistics The true logic of the world lies in the calculus of probabilities James Clerk Maxwell Probabilities are not about numbers they are about the structure of reasoning Glenn Shafer We are drowing in a sea of information and starving for knowledge Rutherford Roger The quiet statisticans have changed our world not by discovering new facts or technical developments but by changing the ways that we reason experiment and form our opinions Ian Hacking CMF SCI 689 p24 Topics Review of probability theory i CMPSCI 689 3345 Topics Review of probability theory 39 Bayes rule i CMPSCI 689 3345 Topics Review of probability theory Bayes rule Conditional independence 1 CMPSCI 689 p34t Topics Review of probability theory Bayes rule Conditional independence Random variables i CMPSCI 689 p34t Topics Review of probability theory Bayes rule Conditional independence Random variables Information content of random variables i CMPSCI 689 p34t Topics Review of probability theory Bayes rule Conditional independence Random variables Information content of random variables Discrete and continuous distributions i CMF SCI 689 p34t Topics 0 Review of probability theory Bayes rule Conditional independence 0 Random variables 0 Information content of random variables Discrete and continuous distributions Fundamental theorem of regression i CMPSCI 689 p34 Topics 0 Review of probability theory Bayes rule 0 Conditional independence 0 Random variables 0 Information content of random variables 0 Discrete and continuous distributions 0 Fundamental theorem of regression 0 Normal distribution and law of large numbers 1 CMPSCI 689 p34 Sample Spaces A sample space S is a set finite countable or uncountable of outcomes of some random experiment eg H T is the sample space for tossing a coin 1 6 is the sample space for rolling a dice An event is any subset Si g S eg 2 4 6 denotes the event of getting an even roll A collection of subsets B is a aalgebra if 39 The empty set 15 E B 39 IfAEBthenAC EB IfA1An E B then UZ39AZ39 E 15 Sample spaces can be uncountable eg R 2 oo 00 A aalgebra over R is the set of all intervals a b a b a b a b where unions are restricted to be countably infinite UfilAz39 Extensions to higher dimensions is straightfonNard eg the plane R2 CMF SCI 689 p44 Probability Axioms Ff39jiven a sample space S and an associated aalgebra B we can define a probability function on B as follows 39 PA 2 0 for every A E B PS 1 note that S must be in every aalgebra If A1 An 6 B is a collection of painvisedisjoint subsets then PUz3914239 Zz39 1314239 i CMPSCI 689 p54 Conditional Probability and Bayes Rule 39 Given two events A and B in some sample space S the conditional probability P A B PltAIBgt 5383 Example Let B denote the event of getting an even roll and let A denote the 1 event of getting a 2 PAB g a 0 Note that PABPB PA n B PB n A PBAPA 39 This immediately gives us the celebrated Bayes rule PAB P 3123 A i CMPSCI 689 p64l Rule of Total Probability 39 Let the sample space S be partioned into a set of mutually exclusive subevents 14139 1 S i S n The probability of any event B can then be expressed as PB PB n S E PB n 141 Z PBAiPAi This rule is extremely useful in practice since it allows us to compute the denominator in Bayes rule i CMPSCI 689 p74 Conditional Probability and Bayes Rule 39 Given two events A and B in some sample space S the conditional probability P A B PltAIBgt 5383 Example Let B denote the event of getting an even roll and let A denote the 1 event of getting a 2 PAB g a 0 Note that PABPB PA n B PB n A PBAPA 39 This immediately gives us the celebrated Bayes rule PAB P 3123 A i CMPSCI 689 p84l Joint Distributions 39 Let E1 En be n events that are relevant to a hypothesis H 39 To compute the joint distribution PH E1 En we can recursively apply the definition of conditional probabilities PHE1En PHE1EnPE1En PHE1EnPEnE1En1PE1 Note that it is very expensive to store a joint distribution which we will address in the next lecture i CMPSCI 689 3945 Bayes Rule Problem Three CS grad students Alice Bob Cindy have been officially announced as finalists for a prestigious departmental award The chair selects a winner and informs the graduate director asking him to withold notifying the winner while he is away at a DARPA meeting Alice meets the graduate director at the departmental potluck and anxiously tries to extricate information from the director regarding the identity of the winner The graduate director declines to reveal the identity of the winner Alice then asks whether Bob or Cindy will not receive the award After some thought the director tells her that Bob will not win the award Alice goes away happy believing that with this new information she now has increased her chance of winning the award The graduate director however reasons that since either Bob or Cindy would not receive the award the information he has given Alice should not increase her confidence in winning the award Alice later meets Cindy at the party revealing what she knows Cindy is now extremely happy since she is convinced that she has the best chance to win the award Who has reasoned correctly CMPSCI 689 910 Bayes Rule Problem Alice s Reasoning 39 Let Alice Bob and Cindy denote the event of each of them winning the award Clearly PAlz ce PB0b PCquotmdy 39 Alice interpreted the graduate director s statement to mean that Bob will not win Let this event be denoted IB0b P aB A P A PAlice iBOb 0111 b lice 39 0 lt1gt 1 2 3 2 39 which explains why Alice became more confident of her winning the award Is she correct in her reasoning CMF SCI 689 p114 The Flaw in Alice s Reasoning The flaw in Alice s reasoning resulted from her using an incorrect generative model of the graduate director In response to her question it is not what the graduate director said but what he could have said that needs to be modeled Alice should have reasoned as follows If I indeed were the winner of the prize the graduate director could have equally well said Cindy will not win the prize 80 the correct generative model of the graduate director is Winner Director s Statement Likelihood Alice Bob did not win Alice Cindy did not win Bob Cindy did not win Cindy Bob did not win NIH NIH CMF SCI 689 p124 Bayes Rule Problem Alice s Analysis Revised senote as D Bob the event director said Bob is not the winner gt 1 PAlzce PB0b PCrmdy g P Da 0 A P A 1 Using Bayes rule PAl iceD Bob B bl lice lzce 6 PDBob PD Bob 1 PD BobB0bPB0b o PD BobCindyPCi7Ldy g MIHICMH 1 1 l PDBob 6 0 g 5 and SO PltAlic D Bob We thus conclude that Alice indeed is no more certain to win i CMPSCI 689 p134 Bayes Rule Problem Solution Cindy s Reasoning P D O a d P Cquot d l PltCindyD Bob lt B blllujnByZ m y PD33 b 2 PCindleuBob g macaw 80 Cindy is correct in getting more excited since the director s information has now doubled her chances of winning The graduate director was correct in his reasoning that Alice would not gain any information by his indiscretion but incorrect in his assumption that graduate students don t talk amongst themselves i CMPSCI 689 p144 Whimsical Grad Director Variant The initial conditions of the problem remain the same When Alice asks the graduate director whether Bob or Cindy will not receive the award the graduate director decides to give a probabilistic response being a fan of randomized algorithms V th probability a he informs Alice that Bob is not the winner and with probability 1 oz he tells her that Cindy is not the winner Is there a value of a for which Alice should be justifiably more confident of winning the award CMF SCI 689 p154 Jeopardy Grad Director Variant 39 In the popular game show JeopardyTM contestants are given the answers and asked to formulate the question 39 What question should Alice have asked the director in order for his answer Bob will not get the award to indeed increase her chances of winning i CMPSCI 689 31645 Marginal and Conditional Independence 39 Two events are marginalyindependent if PA n B PAPB eg let A 2 Rain in Amherst and let B 2 Bush wins reelection in 2004 Two events are conditionaly independent given a third event if PA BC PAC39PBC39 39 Let A Small Shoe size B No Gray hair C Age30 Note that in this case A and B are not marginally independent i CMPSCI 689 p174 Random Variables A random variable X is a function from a sample space S into the real numbers R We denote the value of the variable by Xs for elements E S of the sample space A random variable X induces a probability function PX S gt X where X is the range of X PXX PSj E S Xsj Examples of random variables Number of heads in an experiment where a coin is tossed 100 times number of times the word student appeared in an email document etc CMF SCI 689 31845 Information Content of Random Variables Given a single outcome of a random variable x Li the Shannon information content is defined as 1 hv ai log2 Pz Let the variable as take on values in a set a1 an with associated probabilities p1 7pm The entropy of the set of possible outcomes is defined as 1 HX 210239 10g2 Let X and Y be two variables Define the joint entropy 1 Fwy HX7Y 21301073 10g2 If X and Y are marginally independent what is theirjoint entropy CMF SCI 689 31945 Decision Trees and Information 39 You are given 12 balls each of identical weight except for one that is either lighter or heavier you don t know which You are also given a two pan balance In each pan you may place any number of balls on the left pan and the same number on the right The pan yields three outcomes the balls on the left are lighter the balls on the right are heavier or the weights are equal 39 What is the minimum number of weighings that you need to find the defective ball i CMPSCI 689 p204l Cumulative Distribution Function 39 The cumulative distribution function cdf of a random variable X denoted by FXx is defined as FXx PX X g x Example Suppose we keep tossing a coin till we get a head Let the random variable X be the number of coin tosses Show that the CDF Fxv 1 1 pm lamwi CMPSCI 689 p214 Cumulative Distribution Function We begin by noting that if we toss a coin 3 times before getting a head then PYxW pV 39 Next we observe that the CDF requires us to sum up all the probabilities up to and including as 1 pV FXXxgtZplt1 pgti1 m z391 i CMPSCI 689 32245 Probability Mass Function for Binary Distributions 39 For a binary random variable X the probability mass function is fX1c PX x where x 0 orx 1 Consider a binary rv X Let PX 1 6 and PX 0 1 0 A Bernoulli distribution is defined by the following PMF fxvl0 0m 0W 39 A geometric distribution is defined by the following PMF c l x fxltxpgt 1 p 1quot gt0 0 x g 0 39 A binomial distribution is defined by the following PM F fXnP Zpm1pnm 1 x20 0 xlt0 i CMPSCI 689 p234t Multinomial Distribution A multinomial variable x can be viewed as a vector of M binary variables x 1 xMT where only one ask 2 1 and the remaining variables are 0 We denote the probability that X i by 6139 In any given experiment let the indicatorvariable xi denote the outcome of a sample taking on value 2 Then the multinomial pdf of a dataset D of size N is given by fXac0 2 0T1 03 Note the constraints 231 6139 1 and 231 xi 1 CMF SCI 689 p244 Example Generative Model of Documents 39 Consider the problem of classifying a given document di into a category cj A document is an ordered collection of word events wd where wt is a word in the vocabulary V 11111122 lel 1317 39 We assume that the document size is marginally independent of the category but that the word probabilities are conditionally independent given the category Further we assume that the conditional word probability is marginally independent of its location in the document Idil Pdz390j79 Pldz39 H Pde0j79 k1 39 We represent each word category probability as a multinomial distribution lvll 61731 339 For English this random variable takes 50 000 values E Q Pdz390j79 Pdz39 91jxl j k 1 1 i CMF SCI 689 32545 139 N Probability Density Function 39 For a continuous random variable X the probability density function f X x is such that FXx 2C fXtdt forall x 39 A PDF or PM F satisfies these properties 39 fXx 2 0 for all 3 Ex fx 1 Giff00 fxvdx 1 39 The GDP of the logistic distribution is given by 1 1e x Hence its PDF is given by F x dX 1 63 962 H CMPSCI 689 32645 CDF 0f Logistic Distribution CDF of Logistic Distribution 09 08 07 06 05 PX lt x 04 03 02 01 i CMPSCI 689 p274t PDF of Logistic Distribution PX x 09 08 07 06 05 04 03 02 01 PDF of Logistic Distribution I exp x22 7 CMPSCI 689 p284 Expectation and Variance 39 The expected value or mean of a random variable X is EltXgt ff ZUfXxdx Xis continuous 236625 PX x Xis discrete 39 The variance or average squared deviation from mean of a random variable X is VarX EX EX2 Example Let X N Ua7 b uniform rv var X i 1 2 CMPSCI 689 12945 Expectation Properties 39 Expectation is a linear operation Ea1X1 l CLan a1EX1 l amEXn 39 Nested expectation EEX EX Expected deviation around mean is 0 EX EX 0 Note that the variance can also be written as show this varX EX2 EX2 Given a set of independent random variables X1 Xn Vara1X1 CLan aVarX1 l aiVoNXn CMPSCI 689 33045 Expectation minimizes squared error loss lT i ieorem Show that expectation minimizes squared error loss EX 92 2 Clearly we only can control the second term above and it is minimized by choosing b EX i CMPSCI 689 p314 Moment Generating Function 39 The moment generating function of a random variable X is Mt EetX foo etxfxdx 39 Let us examine the derivates of Mt at t 0 dMt M t dt 00 we xdx 39 Thus at t 0 we find that OO Mmf wmmmmu OO 39 Similarly it is easy to show that M 0 EX2 39 Giving us the identity VarX M 0 M 02 i CMPSCI 689 33245 Joint marginal and conditional density functions 39 Let X Y be a bivariate random variable The joint density mass functions can be defined as follows i fXYx7y x7Y PXY E A Afxydx dy 39 The marginal density function is defined as mm fvyydy fxv Z fv y y 39 The conditional density function is defined as i fXYyPXxYyW fX CMPSCI 689 33345 Covariance 39 Given two random variables X and Y the covariance between them is defined as 039XY EX EXY EY The covariance can also be written as show this 039XY EXY EXEY 39 This makes it clear that variance is simply the covariance of a random variable with itself When the random variable X is a vector the covariance of X becomes a matrix The full study of covariance matrices is postponed to next lecture since it requires the tools of linear algebra i CMPSCI 689 p344l Conditional expectation 39 The conditional expected value of a random variable gY given another variable X is defined as EYIX ZyPW le iv Note that EYX is a function of X not a constant unlike EX 39 Double conditioning EEYX EY E Y zyzpltxxyygt zzypltxxyygt LC i CMPSCI 689 33545 2 Z yPY yX 30 PX x Fundamental Result 0n Variance Lemma varY EvarYX va7 EYX Proof Note that varY EY2 EY2 EEY2X EEYX2 Also note that mam EltY2Xgt ltEltYXgtgt2 EltWltYXgtgt EltEltY2Xgtgt EltltEltlegtgt2gt If we substitute the above into the expression varY EvarYX we get varEYX i CMPSCI 689 33645 Regression Suppose the data set D 1341 atmyn is such that yi are continuous the goal is to approximate the data set by some function h X gt Y for simplicity assume both X and Y are scalar real numbers 0 Define the squarederror loss function as Lyf9304 y f93062 Optimal regression function The conditional mean minimizes the expected squared error loss d CMF SCI 689 p374t Polynomial Regression xdana Fundamental Theorem of Regression 39 Regression means estimating a function hX of the input variables X that minimizes the meansquared error loss function with respect to the true value Y associated with X Theorem The conditional mean EYX minimizes the mean squared error loss function over all possible functions hX EY hX2 Z 130 EYX2 Proof Left as a homework exercise extend the earlier proof i CMPSCI 689 p394 PDF and CDF of N ormal Distribution functionx dn ormx x functionx pnormx x Normal density dn ormx dnormx V2 Normal Cumulative pnormx normx 4 2 0 2 4 CMPSCI 689 p404 Normal Distribution 39 Consider the following integral I foo 6 y22dy 39 To evaluate this integral requires the following trick Instead of computing I we proceed to compute I 2 12 foo 6 9622dx foo 77922053 i foo OO m2y22dxdy i CMPSCI 689 p414 Normal Distribution Let us change to polar coordinates setting as mine and y 9 cos 0 l 27 oo 2 2 12 Z e O Sin 6 7 cos 6 27a O O 27 oo 2 fr 2r drdQ 0 0 27139 T227 dr 0 27rlt e 22gt 8 227139 We have just derived the PDF of the normal distribution namely 00 1 e922dy 1 00 27139 i CMPSCI 689 34245 General Univariate Normal Distribution 39 Let us introduce the change of variable 39 Then we can easily show that fv 2 1 63 96 02202dx 1 oo 271 02 39 We can also show that Efx i and that Varfx a2 CMPSCI 689 p434 2D Normal CMPSCI 689 p444 Multivariate Normal Distribution Central Limit Theorem 39 Let us try to motivate the normal Gaussian distribution through a different route 39 Suppose we consider sums of random variables Sn where each variable Xi has a finite mean u and variance a2 Define the standardized sum 5 It follows that ES 0 and varS L 1 39 Central limit theorem For large n 5 is approximately standard normal More generally a sum of normally distributed variables is normal In the multivariate case projections of normal variables remain normal i CMPSCI 689 p454 Review of EM and Mixture Models Intro to Hidden Markov Models Sridhar Mahadevan mahadevacs umass edu University of Massachusetts Sridhar Mahadevan CMPSCI 689 p14 The General EM Algorithm Initialize parameters 60 39 Repeat until convergence 1 Expectation Step At the j 1th step compute the expected loglikelihood function map Eme lXobs ej log PX6 Z Pam mobs 92 log PX6 m7n 2 Maximization Step Find a new parameter setting 6H1 such that W argmax9Q66j 3 Setjlt j1gotostep1 Sridhar Mahadevan CMPSCI 689 p24 Mixture Models Formal Derivation We now formally derive the EM algorithm for the general case of mixture models a g a PX0bs6 ZWiPXobs6i z1 where the mixing parameters 7 are unknown except that 21 m 1 The component mixture densities are also unknown but let us assume they are normally distributed with parameters 6 W a Let us now postulate latent binomial indicator variables that specify which mixture each sample came from The complete dataset is now X XObSXmS where Xms ZlTZ2TZT zij Zji is a binary variable that indicates whether sample 9 comes from mixture 2 Sridhar Mahadevan CMPSCI 689 p34 The EStep 39 Now we have to compute the loglikelihood of the complete data n g Z1 combo 10gPX7 H mpmjleb 3 j1z 1 n n g a Zzij logm l ZZzM logPxj6i j1 i1 j1 i1 39 Now the missing data here are the zij Since the complete loglikelihood above is linearin the missing data it follows that the Estep in EM simply corresponds to replacing the zij with its expected value conditioned on the observation xj and old parameter setting 6 k k A N Px39Q Z3 ZX 9 a 39 M Pltm6 gt Sridhar Mahadevan CMPSCI 689 p44 The M step 39 The Mstep correspods to finding a new parameter setting 6k that maximizes the complete data loglikelihood Jensen s inequality assures us this is a lowerbound on the observed data loglikelihood 39 To compute the maximum likelihood over the complete dataset we introduce a Lagrangian multiplier term 231 m 1 so that the gradient maximization now becomes amen a Ak 9 8 7 amltz Zijlogmeltz rl j1z 1 i1 39 where we have dropped the term that did nt depend on 7 a 8lC76X 213 Z 377i 2 7n Sridhar Mahadevan CMPSCI 689 p54 The M step 39 Setting the above to 0 we first solve for A n Ak n ZiA 0ez fjAm0 j17Ti j1 39 We use the identitites 2 m 2f 25 1 in solving for A n g g n 25AZM 021A0gtA n 1i1 a Using this value of A in the original gradient gives the desired value for the mixing parameter m We 71 z J J Z Sridhar Mahadevan CMPSCI 689 p64 MStep Component Densities How do we estimate the parameters of the component densities We now turn to the second term in the loglikelihood equation n n 9 C X E 2510ng Z 25 logPxj j1 z1 j1 z1 Expanding the second term only we get oz 7 X a n 9 1 3quot 2 C 2 Ale 2a 2 2 10 6 1 3M 3M z1 Z3 g 2770i2 Solving for W gives us a a n g 55339 M 3M 3 31 z1 Z3 2770i2 7L 2 E Zij Sridhar Mahadevan CMPSCI 689 p74 A 039 EM Algorithm for Mixture of Univariate Gaussians Expectation Step initialize 6 fro g2 ampZ20 2 Z wffwxju zltamp3vv 23 321 f leuf f Maximization Step A A Ah 1 laml 231 amp2k1 231 fr 2 TL A n A 2 291 z 291 g TL A 22571 91 Sridhar Mahadevan CMPSCI 689 p84 EM Algorithm for Mixture of Bivariate GallSSlaIlS A0 A0 A0 Expectation Step Inltlallze 9 71 Mi 2 k k Ak 2k 2 77 PWJ IMZ 7 2i Z 9 k k Ak i1 71Ple zazz Maximization Step 71 Ak n Ak Ak1 Ak1 T Ak1 231 27 ij ilk 231 Zz jxj Hz 902 Mi j1 ij 21 ij 7L j1 Sridhar Mahadevan CMPSCI 689 p94 EM for TwoMixture Bivariate Gaussian Generalize previous expressions from univariate to bivariate case Density1 Density2 Mixture 12 0 10 2 8 4 6 6 4 2 8 10 10 20 10 5 0 X 104 log likelihood mIXIng param 0 15 08 X X X X X X X X 05 10 06 5 1 04 0 I T 15 5 02 XX JLHAK X 2 10 39 0 0 5 10 10 0 10 20 0 5 10 Moliulldl Ivlailauc all CMF SCI 689 p104i Hidden Markov Models HMMs can be viewed as extending mixture models to include state transition dynamics For HMMs the IID assumption no longer holds true at the level of individual samples However we will assume that sequences of observations are IID Later in hierarchical HMMs we will relax the IID assumption at the level of subsequences as well Sridhar Mahadevan CMPSCI 689 p114l Applications of Hidden Markov Models Perception face recognition gesture recognition handwriting speech recognition Robot navigation Biology DNA sequence prediction Language analysis part of speech tagging Smart rooms wearable devices Sridhar Mahadevan CMPSCI 689 p124 Hidden Markov Models References Model developed by Baum in the sixties The training method was later found to be an instance of the EM framework the wellknown BaumWelch algorithm HMMs are instances of dynamic Bayesian networks DBNs Dean et al 1989 Murphy 2002 L R Rabiner A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Proceedings of the IEEE 77 no 2 257 285 February 1989 Fred Jellinek Statistical Methods in Speech Recognition MIT Press 1997 Sridhar Mahadevan CMPSCI 689 p134l HMMs for Speech Recognition Russell and Norvig 2003 Sampled quantized digital signal Analog acoustic signal IV V Ilillillll Illll Output probabilities for the phone HMM Onset Mid End C1 05 C3 02 C4 01 C2 02 C4 07 C6 05 C3 03 C5 01 C7 04 I 10 15 38 I I 22 63 24 I I 10 12 73 Frames with features 39 39 39 39 39 39 I 52 47 82 I I 89 94 HI I Phone HMM for m 03 09 04 Sridhar Mahadevan CMPSCI 689 p144 HMMs for Speech Recognition Russell and Norvig 2003 a Word model with dialect variation Sridhar Mahadevan CMPSCI 689 p154 Graphical Model Representation of HMMs H Hf gt i y0 V1 y2 yT 39 Note how conditioning on a state qt dseparates the observation sequence into three categories current observation yt the past yo yt1 and the future yt17HwyT39 Sridhar Mahadevan CMPSCI 689 p164 HMM De nition 39 A finite set of states Q 39 lQl M 39 q is a multinomial random variable 2 1 for some particular value of z and 0 otherwise 39 An initial distribution 7r on Q where 7 13chj 1 39 An observation model 9 Pytqt where the space of observations is discrete or continuous realvalued 39 We ll solve the discrete multinomial case for simplicity mj is the probability that state 2 produces observation 3 39 Using the previous week s notes you can work out the univariate or multivariate Gaussian observation model case A transition matrix aij Pm1 qi Stationarity assumption transition probabilities are timeinvariant Sridhar Mahadevan CMPSCI 689 p174 Markov Properties of HMMS 39 The future is independent of the past given the present PQt1QtQt 17QO PQt1Qt 1330 yqut Py0ythtPyt1 H yqut 39 HMMs are more powerful than any finite memory device Pyt1ytayt 17ayt k 75 Pyt1yt7yt 1739H ayt kJrl Sridhar Mahadevan CMPSCI 689 p184 Basic Problems in HMMS Likelihood Given an observation sequence Y 2 mm yT and a model a A n 7 determine the likelihood PY 93 Filtering Given an observation sequence Yt 2 mm yt and a model A Q 7r determine the beliefstate Pqt Prediction Given an observation sequence Yt yoyl yt and a model A Q 7r determine the posterior over a future state Pq3 Yt s gt t Smoothing Given an observation sequence Yt yoyl yt and a model A Q 7r determine the posterior over a previous state Pq3 Yt g s lt t Most likely explanation Given an observation sequence Y yoyl yT and a model Q A Q 7r find the most likely sequence of states qoql qT given 3 that is compute argmaxqojquPmO q y Learning Find the model parameters such that H PYi g is maximized over multiple independent IID sequences Yi Sridhar Mahadevan CMPSCI 689 p194i Paths PWall 1 1 a 12 20 3O 4 O Wall Wall Opening Wall 0 1 PWall 2 2 O 3 Q4 POpening 3 PWall 4 Sridhar Mahadevan CMPSCI 689 p204 Inference in HMMS The complete data for an HMM is the output sequence y produced along with the hidden state sequence traversed Pq y T l T Pqo H Pqt1qt139 Pltytqt 150 150 M Z39T l M 1 j T M N i j HWY H H aijqtqt1 H 177z jq tyt i1 150 z j1 t0i1j1 T l T 77610 H aqtacItJrl H Pytqt 250 250 Sridhar Mahadevan CMPSCI 689 p214 Maximizing Observed Data Likelihood is Dif cult 39 The inference problem is computing the probability of a state sequence Pqy 135 39 To get the probability of the output 3 we have to sum over all possible state sequences which is intractablel T l T Z 277610 H aCIt7Qt1 H Pytqt qo C11 CIT t0 t0 39 It is obvious that we cannot easily maximize the observed data s likelihood by analytically solving a a argmax l6y argmaxg 10gPy Sridhar Mahadevan CMPSCI 689 p224l Structuring Inference in HMMs 39 We can condition on a particular state qt to decompose the inference Wit Pqty w Py PLUG 7ytqtPyt17nyIQtPJt Py Z Py07 7yt7qtPyt1wH7yqut Py QtWUJt Py Note that Pltygt 2 mama Sridhar Mahadevan CMPSCI 689 p234 Computing the Forward Variable 39 We can define a recursive procedure for computing the forward variables by exploiting the dynamic programming trick of caching intermediate solutions Qt PyOgtayt1Ct1 Z PWOwHpyt1IQt1PQt1 30 ytht1P3t1lqt1Pqt1 PyOgtytQt1Pyt1IQt1 ZIP3 wytpqtpqt1Pyt1lqt1 qt ZP902ytQt1IQtPQtPyt1lqt1 Qt ZPuo ythtPqt1lqtPqtPyt1lqt1 gt 2 Z aqtaCIt7Qt1 Pyt1q51 qt Sridhar Mahadevan CMPSCI 689 p244 The Forward Algorithm 39 Initialization aq0 Py0q0 wq0Py0qo where 1 3 go 3 M 39 Induction crqt1 2 aqtaqt7qt1Pgt1qt1 where 0 g t g T 1 and 1 S qt7qt1 S M 39 Time complexity Each aqt computation takes 0M2 since we have to iterate over all M values of qt and qt1 Totally the forward algorithm takes 0M2T since we have to iterate over each time step from t 1 T Sridhar Mahadevan CMPSCI 689 p254 Computing the Backward Variable 39 A recursive calculation of the backward variable can be developed as follows 5Qt Pyt17 qt1 ZPyt17 qt1 ZPyt27 qt1 Z 5qt1P qt1 7yTqt ZPyt17 yT Qt1qt ayTIQt1pCItgtPltCIt1IQtgt 7yTth17QtPyt1IQt1PQt1IQt yt1qt1aCIt7qt1 Sridhar Mahadevan CMPSCI 689 p264 The Backward Algorithm 39 The 5 variables can be calculated recursively also except we proceed backwards Initialization Define 5qT 1 Induction 5Qt thH 5qt1Pyt1lqt1aCIt7Qt1 Where 1 gth land 1 g qtqt1M Time complexity 39 Each 5qt computation takes 0M2 since we have to iterate over all M values of qt and qt1 39 Totally the backwards algorithm takes 0M2T since we have to iterate over each time step from t 1 T Sridhar Mahadevan CMPSCI 689 p274 Most Likely State The most likely state at any time t can be readily calculated from the a and 5 variables We can then define the most likely state as qltVILS argmax1gz gM7Q where the variable 7qu can be easily calculated as aqi5qi M L a Jami Sridhar Mahadevan CMPSCI 689 p284l Most Likely State Sequence If we concatenate the most likely state at each time t into a sequence it might not be connected Another approach is to compute the most likely sequence given the observation sequence y and model g Let us define the probability of the most likely sequence that accounts for all observations upto time t and ends in state q as a Sta max Pq0qtiy0yt6 q17q277qt 1 We can define this probability by induction as 6t1z39 max6tjaqjqi Pyt1IQ1 j t t1 Sridhar Mahadevan CMPSCI 689 p294l The Viterbi Algorithm The Viterbi procedure keeps track of both the most likely state at a given time and the probability associated with it in two arrays 39 Initialize Recursion 6timax6t 1ja Pwthi 1 s t s T 1 s z s M J 61 aq 1 2mm 2 argmaxlgjgM 6t1jaqg1q 1 g t g T 1 g 239 g M Sridhar Mahadevan CMPSCI 689 p304 The Viterbi Algorithm 39 Termination P 1g gtlt wlt6Tltzgtgt q argmaxlSiSM T 239 39 Path computation q Zwt1q17 Sridhar Mahadevan CMPSCI 689 p314 Viterbi for Rainy Days Russell and Norvig 2003 RH PRt I 07 03 Ramt 1 Ramt Ramt 1 Rt r brellat 1 Umbrella t brellat 1 Sridhar Mahadevan CMPSCI 689 p324 Viterbi for Rainy Days Russell and Norvig 2003 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 true true true true true a false false false Umbrella t true true false true true 0361 b Sridhar Mahadevan CMPSCI 689 p334 Learning the HMM Parameters The a and 5 terms allow us to estimate the observation model from sample sequences To estimate the transition matrix we introduce a new variable qt7qt1 Pqtqt1y Pqutqt1PQt1IQtPQt Py Py0 yt qtPyt1qt1Pyt27 7 yqut1Pqt1qt Py qtP3t1lqt15qt1aqtqt1 Py Sridhar Mahadevan CMPSCI 689 p344 Maximum Likelihood for HMMS 39 Generally we want to find the parameters that maximizes the log likelihood a log Py6 where T l T 10gpyl5 10gZZmZ7TqO H 61th H Pytht 150 250 q0 q1 CIT 39 As before we note that this involves taking the log of a bunch of summations and it is pretty difficult to optimize it Sridhar Mahadevan CMPSCI 689 p354 Complete Loglikelihood for HMMs 39 To simplify the maximum loglikelihood computation we postulate a complete dataset y q where M T l T M N 39 M i j t 339 10g 770 H H aijqtqt1 H H 177z jqiyt 1 z 150 j t0z 1 j1 Z7 M T l Zq 10 Z q qi 10mm i1 M 250 z j1 15 Me M N Z Z qiyi 10 my i1j1 i i o Sridhar Mahadevan CMPSCI 689 p364 M Step of EM for HMMS 39 We find the ML parameter estimates using partial derivates H s T M M M ale 6 8 z 39 3 gtqg1logaij EA a 1 azj a 250 z j1 i1 j1 T 1 i j C C 250 a T l 39 Zt W Z T l j M a W explonting that Zaij 1 231 2250 qtqt1 91 aij Sridhar Mahadevan CMPSCI 689 p374 M Step of EM for HMMS 39 Similar calculation for the other parameters give us 39 N 236w H explonting that E 77 1 N T J H 30 quJ H A i 7Tz610 77z j These estimates are intuitive since for example 7 is simply counting the fraction of times observation 9 occurs at time t when the state is z Sridhar Mahadevan CMPSCI 689 p384 Estep for HMMS 39 Recall that the Estep replaces the values of the missing data by their predicted values conditioned on the observed data and the old parameter values Emjly 533 T l 39 76 EZq qi1ly6 150 Eqiy9kyi M e N l l o m Pq lly yi Me N l l o W M e is m 150 T l T l 39 Hk 39 We Eqiqi1ly6 Z Pq qi1ly6 150 150 T l aj O t 7 t 1 Sridhar Mahadevan CMPSCI 689 p394 a l l MStep Reestimation of HMM parameters 39 Putting the expected value estimates in the earlier ML formula gives us 728m 2 NZtToTviyi k ZtTTo y 2k1 Zt0 73 2150 7 AW 2 2 sign xii 013 ij ZtT01 23 20 7 1 78 39 These equations can be readily generalized to the case when the observation model is a univariate or multivariate Gaussian instead of a multinomial Sridhar Mahadevan CMPSCI 689 p404 Extensions of HMMS Embedded HMMs each state is viewed as a sequence of HMMs SemiMarkov HMMs state durations are not exponential but arbitrary Hierarchical HMMs multilevel treestructured models which are a special case of probabilistic contextfree grammars Abstract Markov Models HHMMs with statemediated transitions Sridhar Mahadevan CMPSCI 689 p414 forehead 88 3 Embedded HMMS Sridhar Mahadevan CMPSCI 689 p424 Embedded HMMS 6 3 9 9 This model can be viewed as a mixture of simple HMM models Sridhar Mahadevan CMPSCI 689 p434 Hierarchical HMMS 39 Fine Singer Tishby The Hierarchical Hidden Markov Model Machine Learning 1998 OBSERVATION MODEL FOR ss FRONTW01 009 LEFrwo9 001 BACKW01 009 RIGHTW09 001 Sridhar Mahadevan CMPSCI 689 p444 Using Hierarchical HMMS in Robot Navigation 39 Actual model used extends hierarchical HMM to include temporally extended actions like exit corridor hierarchical POMDP Observation vectors Front Left Right Back Wall Opening Door on Right Stripe on Right Wall 39 See Theocharous Rohanimanesh and Mahadevan ICRA 2001 Theocharous Murphy Kaelbling IJCAI 03 for details 82 83 B 0 0 O O rrrrrrrrrrrrrrrrrrrrrrrrrrrr w 0 0 s4 0 Sl Sridhar Mahadevan CMPSCI 689 p454 Foveal Face Recoglition using HMMs Minut Mahadevan Henderson and Dyer quotFace Recognition using Foveal Visionquot in39 39 r 39 39 39 r mun SeongWhan Lee Heinrich H Bulthoff Tomasio Poggio editors vol 1811 pp 424433 SpringerVerlag 2000 smmmaw cwsmaagrpxam Comparing Sliding Windows With Foveation 39 Competing approach slide a window down of fixed size Samaria PhD Cambridge PERFORMANCE OF SUBSAMPLED vs FOVEAL HMM CLASSIFIERS 00 i i i FOVEAL 9e WdMEN WOMENSUBSAMPLED 80 a 7 A9 AVERAGE RECOGNITION RATE 5S 6 7 8 NUMBER OF STATES Sridhar Mahadevan CMPSCI 689 p474 Abstract Hidden Markov Model Bui Venkatesh West JAIR IJCAI 03 SVM Classi cation and Regression Reproducing Kernels Sridhar Mahadevan mahadevacs umass edu University of Massachusetts Sridhar Mahadevan CMPSCI 689 p13 SVM review linearly separable case Softmargin classifiers dealing with overfitting SVM regression Examples of kernels Mercer s theorem Topics Sridhar Mahadevan CMPSCI 689 p23 Marginbased Classi cation Sridhar Mahadevan CMPSCI 689 p33 Separating Hyperplane between Convex Hulls Sridhar Mahadevan CMPSCI 689 p43 Optimal Margin Classi cation 39 Consider the problem of finding a set of weights w that produces a hyperplane with the maximum geometric margin max 7 such that 77w y lt mm gt 19 27 1m w1 39 We rescale the weights by w to eliminate the nonconvex constraint w 1 and instead look to minimize lt 1010 gt while keeping the functional margin a 1 1 min such that y lt mm gtb 2 12 Sridhar Mahadevan CMPSCI 689 p53 Lagrange Dual Formulation The primal optimal margin classification problem can be formulated as minfw such thatg w 3 02 1 k and 02 1 l The dual problem can be formulated using Lagrange multipliers as k l max LDaB Where LDaB 2 main ltfw l Zaig w z1 i1 a o 20 Weak Duality Theorem The dual formulation always produces a solution that is upper bounded by the solution to the primal problem Strong Duality Theorem The solution to the Lagrange dual is exactly the same as the primal solution assuming that the function fw and the constraints giw are convex and him is an affine set meaning him lt w am gt bi Sridhar Mahadevan CMPSCI 689 p63 Weak Duality Theorem 39 Suppose w is a feasible solution to the primal problem and that a and B constitute a solution to the dual problem me mgnLuaB Lwa fw Zaigxw Z mw s fw 39 Since w is a feasible solution to the primal the last inequality holds true since gw g 0 and hw 0 since w is a feasible solution and since a is a feasible solution to the primal a 2 0 This implies the following condition 2120 LDCK75 S muijnfw giw S 07mm 2 0 Sridhar Mahadevan CMPSCI 689 p73 Sparsity of Parameters Corollary Let 10 be a weight vector that satisfies the primal constraints and crquot 5 be the Lagrangian variables that satisfies both the dual constraints that is fw LDcrB where a 2 0 and gw g 0 hw 0 Then ajgw 0 forz 1 k The proof follows easily by noting that the inequality fw Zai gzlwquot Z fhiw S fw becomes an equality only when ajgw 0 forz 1 k This is a key result in the SVM context it implies that the representation will be sparse since we only need to maintain a for the support vectors that is the instances that are closest to the separating hyperplane Sridhar Mahadevan CMPSCI 689 p83 Saddle Point Function Saddle Point Function x2 y2 7 Duality Gap and Saddle Points If the primal and dual solutions are not equal this duality gap can be detected in the course of solving the dual and used as a convergence metric Define a saddle point as the triple wa where w E Q 01 2 0 and Lwa S Lwa S Lwa Theorem The triple 10 Crquot 5 is a saddle point if and only if 10 is a solution to the primal problem and Crquot 5 is a solution to the dual problem and there is no duality gap so fw LDaB Strong Duality Theorem If f w is convex and w E Q where Q is a convex set and g h are affine functions the duality gap is 0 Sridhar Mahadevan CMPSCI 689 p103 Karush Kuhn Tucker Conditions 39 Assume that the function fw and the constraints giw are convex and him is an affine set meaning him lt Li H gt bi Let there be at least one U such that giw lt 0 for all 2 Then the KKT conditions assure us the duality gap is 0 39 The minimizing values aquot 5 10 also satisfy the following conditions 6 1 Lw oz 0 21 n 310i 2 iL Uquot cf 5 0 2 1 l 7 7 7 7 7 3 afgiw0 21 k 4 9zWSOi17 k 5 afZO 21 k Sridhar Mahadevan CMPSCI 689 p113Z Support Vectors for Optimal Margin Classi ers 39 Returning to the original problem we can write the constraints as 1 2 main 5 such that 39 From the KKT constraints note that the only instances for which a gt 0 are those which have functional margins exactly 2 1 because then gw 0 39 Note that the functional margin is the smallest of all the margins which implies that we will only have nonzero a for the points closest to the decision boundary These are called the support vectors Sridhar Mahadevan CMPSCI 689 p123 Dual Form of Optimal Margin Classi er 39 We can write the Lagrangian for our optimal margin classifier as 1 L 7b 2 z z lt z gt b 1 w a 2llwll c y wx 39 To solve the dual form we first minimize with respect to w and b and then maximize wrt a m m VwLwb a w Zaiyixi 0 gt w Zaz yz xz i1 z 1 VbLwabaa Z 0 i1 39 Mechanical interpretation Think of each instance xi as generating a force aiyi on the hyperplane The forces sum to 0 Sridhar Mahadevan CMPSCI 689 p133 Support Vectors for Optimal Margin Classi ers Using these results we can simplify the Lagrangian into the following form m 1 m ma ai iaialtxixgt St aigt0 and aii0 Given the maximizing C we use the equation 10 2271 a yixi to find the maximizing 10 A new instance ac is classified using a weighted sum of inner products over only support vectors m lt wx gt 19 Zaiyi ltxix gt 19 2 origi ltxix gt 6 i1 z ESV The intercept term 9 can be found from the primal constraints maxyi1lt wxi gt l minyi1lt wxi gt 2 9 Sridhar Mahadevan CMPSCI 689 p143 Geometric Margin and Lagrange Parameters Theorem Consider a linearly separable set of instances 901 31 mm ym and suppose crquot 9 is a solution to the dual optimization problem Then the geometric margin can be expressed as 1 1 7 W Z ZiESVaz Proof Due to the KKT conditions it follows that for all support vectors 939 E SV yjf9 j7b 3j gm ltxixj gtbgt 1 z ESV llwll2 ltZQZ 3z xiazaiijj gt 2 3735 Z afyz ltxzxj gt i j jESV iESV Z GYM 92395quot Z a jESV jESV Sridhar Mahadevan CMPSCI 689 p153 Dealing with N onseparable Data non separable data high variance low bias high bias low variance nonseparable data nonseparable dat Sridhar Mahadevan CMPSCI 689 p163 Soft Margin Classi ers Let us reformulate the concept of margin to allow errors so positive or negative instances can lie on the wrong side of the margin The slack variable gi represents the extent to which a margin constraint is violated y lt whom gt 5 2 1 i where g 2 0 z 1l Similar to ridge regresson define a variable which represents the extent to which we want to tolerate errors A softmargin classifier solves the following constrained optimization problem l Minimize ltwwgt Ang z 1 subjectto yiltwixigt b21 i i1l where g O a H l N Sridhar Mahadevan CMPSCI 689 p173 SVM Regression Sridhar Mahadevan CMPSCI 689 p183 einsensitive loss Sridhar Mahadevan CMPSCI 689 p193 SVM Regression 39 We introduce two slack variables g and which represent the penalty for exceeding or being below the target value by more than 6 The primal problem can be formulated as Minimize w2 AZ i1 subjectto lt whom gt b yi lt i 71 l and yi ltwixigt b i i1 l where gig 2 0 z 1l Sridhar Mahadevan CMPSCI 689 p203 The Kernel Trick Note that SVM classification estimators of the form m ltwxgtbZaiyi ltxixgtb i1 where we can express the evaluator solely in terms of inner products and parameters that are independent of the input space dimensionality To map from the original space X into some higher dimensional space we only need to replace this by lt w 00 gt 5 Zaz yz lt MW 90 gt 5 i1 The key property of kernels we are exploiting is the reproducing property kxy lt Mac My gtlt Mac My gt In words the inner product in highdimensional space can be cheaply computed using the kernel function in the original space Sridhar Mahadevan CMPSCI 689 p213 Convolution Kernels Haussler 1999 defined convolution kernels as a general way of defining a similarity measure on structured objects eg graphs documents Consider an object ac x1 xd where each part as E X We can define the partof relation Rx1 xd ac which holds if and only if x1 xd are indeed the parts of 90 Of course there may be more than one way to decompose ac into its parts eg think of subsequences of strings or subtrees of a tree etc Let R1x 001 xdRx1 xdx Then the convolution kernel Mac 3 is defined as d M9079 Z szlxmyz R1CCR 1yi1 where kx y is a kernel on the ith component Sridhar Mahadevan CMPSCI 689 p223 String Kernels Watkins 1999 defined string kernels which can be seen as an instance of a convolution kernel Consider the set of all subsequences of a word of length n eg the subsequences of bat are ba at and bt The length of a subsequence u is defined as if 239 1 if the subsequence begins at position if in a string 8 and ends at position il Consider the mapping gb 2 gt RE where E is an alphabet 2 is the set of all strings and E is the set of all strings of length n Given any subsequence u E 2 define s ZWQM AW where z is the index vector representing the positions at which the subsequence u occurs in string 3 and E 01 The string kernel is defined as Knst 2862 me i 2 u lilj jztljl Clearly Kns t lt s t gt and so this is a valid kernel Sridhar Mahadevan CMPSCI 689 p233 Examples of Kernels Multinomial kernel Kx y 6957 Product kernel Kx y 2 ac y TFIDF kernel Kx y lt Mac My gt where magi w where ti is the term frequency of word 2 in document 90 di is the inverse document frequency of word 2 and 739 is a normalizer to ensure 1 2 llx yll U2 Gaussian kernel Kx y e RBF kernel Kac y ell cyll2 Polynomial kernel Kx y lt 003 gt cd Neural net kernel Kx y tanhac y c Sridhar Mahadevan CMPSCI 689 p243 Fisher Kernels Jaakkola and Haussler 1998 proposed using a generative model as a kernel in a discriminative nonprobabilistic kernel classifier eg in their case they used kernel logistic regression Let PX6 be any generative model eg a hiddenMarkov model Consider the Fisher score equations UX W the loglikelihood of a particular input X in other words the gradient of Define the information matrix I EUX UE where the expectation is over PX6 The Fisher kernel is Kac y U I1Uy The Fisher kernel can be asymptotically approximated as Kx y m U Uy In a protein homology similarity problem they got excellent results using the Fisher kernel with kernel logistic regression Sridhar Mahadevan CMPSCI 689 p253 Reproducing Kernel Hilbert Spaces 39 Theorem Every RKHS H on a domain 739 yields a positive definite function Ms t on T x T and conversely every positive definite function Ms t yields a RKHS Proof Assume H is a RKHS Then every function evaluation ft E H has a representer that is ft lt 771 f gt where m is the representer for function evaluation at t Define kst lt 773777 gt Clearly kst must be a symmetric function We show below that Ms t must be positive definite Zaiajktz tj ZZaiaj lt 77ti77tj gt iaj i j Zai ltnti7zaj77tj gt i j Zai lt mwzajmj gt i j lt Zamtizajmj gt i j 2 II a II gt 0 Z Z ntl Sridhar Mahadevan CMPSCI 689 p263 Z Reproducing Kernel Hilbert Spaces Proof of converse Given a positive definite function Ms t on T x T we want to show that this defines a unique reproducing kernel Hilbert space 71K Define the function we show below that this is our representer function kt 2 Mt so that kts kts8t E 739 Define the preHibert space 71 to be the linear space comprised of all f 2 2i aikti where ti E T We can show that the completion of Hg is a RKHS Consider two elements fg E 71 where f Earn 9 Z jktj z1 j1 Define the innner product in 71 as lt g gt2 2m ai jktitj Sridhar Mahadevan CMPSCI 689 p273 Reproducing Kernel Hilbert Spaces 39 From the definition of inner product in H0 we see that lt fg gt Zai jktiatj ya ZazZBJ WMy Z J 2 Eco Z jkt tz z j Zaig fz 0 Similarly it follows that lt g gt Zj Bj tj 39 This implies that this inner product gives us the reproducing property of a RKHS namely that lt kt gt ft It also follows that lt kmktl gt ktt Sridhar Mahadevan CMPSCI 689 p283 Reproducing Kernel Hilbert Spaces 39 To complete the proof we need to show that our definition of lt f g gt satisfies all the properties of an inner product For example lt f f gt2 0 holds because Ms t is positive definite lt ff gt 22aajkttj 2 0 z j 39 We also need to show that if lt f f gt2 0 then f 0 To show this first show that kernels satisfy the CauchySchwartz inequality that is k2s t 3 Ms skt t Then use the reproducing property to show that f2t lt f f gt Ktt 39 Finally we add to the preHilbert space 71 all limits of Cauchy sequences Note that if fn is a Cauchy sequence fn fm2 an fmII2Kltttgt and so the Cauchy sequence must converge to some ft T gt R We add all such f to 71 to get our RHKS HK Sridhar Mahadevan CMPSCI 689 p293 Mercer s Theorem Discrete case Let s start with the discrete simple case of Mercer s theorem Consider a symmetric kernel function Mac 2 on a finite input space Xx1acn The Gram matrix of Mac 2 is the matrix G kxxjj1 Let us restrict our attention to kernels whose Gram matrices are positive definite ie the eigenvalues are nonnegative Then we know that 7L T T T G Alxlxl l l Anacan E Amigo i1 Consider the nonlinear mapping gb x gt xAiti1 Then we can see that lt gt Zktxtixtj Macho 751 Sridhar Mahadevan CMPSCI 689 p303 Mercer s Theorem General case Let Kac y be any symmetric continuous function Kac y lt Mac gt if and only if Kac ygacgydacdy for all g The condition above generalizes the positive definite condition from vectors to the Hilbert space of functions Mercer s theorem arises in the context of integral equations nmKowmwm A function mac is an eigenfunction of Kac y if it solves the corresponding integral operator equation wmmemwy Note that eigenfunctions are a generalization of eigenvectors since in general Hilbert spaces vectors are functions Sr39dharmahadeva CMPSC39 68939p3931 3 Mercer s Theorem General case In the general case we have to ensure that inner products over infinite vectors converge We do that by redefining lt W90 3 gt 2 z z 90 z y j1 If Kx y is a kernel that satisfies the conditions of Mercer s theorem then it can be expanded using the set of eigenfunctions m Km 3 Z Ajltl5j00lt5jy j1 where the gbj are normalized so that L2 1 Here the weights j are the positive eigenvalues The condition above generalizes the property that positive matrices can be expanded as a finite sum of projection matrices Sridhar Mahadevan CMPSCI 689 p323

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.