Computing For Life Sciences
Computing For Life Sciences CS 59000
Popular in Course
Popular in ComputerScienence
This 250 page Class Notes was uploaded by Nick Rowe on Saturday September 19, 2015. The Class Notes belongs to CS 59000 at Purdue University taught by Yuan Qi in Fall. Since its upload, it has received 51 views. For similar materials see /class/208074/cs-59000-purdue-university in ComputerScienence at Purdue University.
Reviews for Computing For Life Sciences
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
CS 59000 Statistical Machine learning Lecture 1 Yuan Alan Qi alanqicspurdueedu Fall 2008 Why take machine learning Solve problems in computer vision natural language processing systems biology social network analysis finance etc Research in machine learning Find industrial opportunities Other reasons enjoying learning elegant theories doing funpractical projects Applications How to select relevant genes Task Classify gene expression datasets into different categories eg normal vs cancer Challenge Thousands of genes measured in the microarray data Only a small subset of genes are probably correlated with the classification task How to parse organization chart automatically How to label each line of FAQ automatically ltheade NNTP Poster NewsHound v133 ltheadgt ltheadgtArchive name acornfaqpart2 ltheadgtFrequencyz monthly ltheadgt equestiongt26 What configuration of serial cable should I use answergt ltanswergt Here follows a diagram of the necessary connections canswergtprograms to work properly They are as far as I know t ltanswergtagreed upon by commercial comms software developers fo qanswergt qanswergt Pins 1 4 and 8 must be connected together inside qanswergtis to avoid the well known serial port chip bugs The How to model social networks and predict your friends movie preference Adam JLIrlah Phalsruh time ui Mosesj amn Rahnh El syrsil lerw Sarah th lsaac Abmham Jesse l39lsml39c39lus Samuel Jvlaw mother ofJesuerstephen Joseph father ofdesus Joses brother ofJesus Jutleyacob James brother ofdesus OMENEnoch Tychicus Moses Dam Titus 5 3quotquot Enjah Barhalnas Nicoclemus Esau snian Demas I o l Johnthe Baptist Daul Sims 44339 Maw wife of Clopas Ju pph Pilate esus L oNem 39 Want Magdalene AIM 39 W cmrrnus drew Herud Antipas 0 L Ma hew H n Festus JLICIEIS son anames 0h Patel O39JIUIIi39Jgrimmw glmlon W fyalz wrmp lal39nes son onelJedee Elmlzrhras chas scarro Annas JosephtofAlimathea J lerorlias Aprjllcls ames son ofAIJhaeus Mamla 39J j39mh J I I II 05339 3913 Mary of Bethany F39hillp the apostle Ba hnllilmeW V li39has Priscma quiln Thomas lphaeus lfatl rer ofdames F hilip the evangelist Melchizeclek Logistics Time TR 1030 am 1145 am Instructor Alan Qi Lawson 1207 0 alanqiatcspurdueedu Teaching assistant Yao Zhu Lawson 8116F 0 zhu36atcspurdueedu Office hours MW 200 pm 315 pm or by appointment Web page httpwwwcspurdueedu aIanqiCoursesC559000htm Tentative Topics Review on probability distributions and basic concepts in information theory Linear regression and classification Probabilistic graphical models Bayesian networks Markov random fields and conditional random fields Kmeans Clustering mixture models and Expectation Maximization Hidden Markov models state space models and forwardbackward algorithms Sampling methods importance sampling and Markov Chain Monte Carlo Deterministic approximate inference Laplace39s method Variational Bayes and expectation propagation Kernel methods Selected topics Nonparametric Bayesian Dirichlet process mixture models Combining models or weak learners Boosting Recent papers from NIPS CML UAI JMLR etc Prerequisites amp Textbooks Prerequisites Calculus basic linear algebra and probability or permission of instructor Textbooks recommended Pattern Recognition and Machine Learning Christopher M Bishop 2007 Information Theory Inference and Learning Algorithms David MacKay 2003 Available on line Workload Homework 5 to 8 assignments Midterm in mid October Review of recent research Students will choose a subtopic of machine learning research select three recent conference papers on the topic and write a 2 page report outlining the main ideas of papers and relate them to the context ofthe course Workload cont Final project Topic Anything that is clearly pertains to the course material Prereport Onepage paragraph description of your project a month before the project is due Collaboration You are encouraged to collaborate on the project We expect a four page writeup about the project which should clearly and succintly describe the project goal methods and your results Each group should submit only one copy of the writeup and include all the names of the group members a two person group will have 6 pages a three person group will have 8 pages and so on Grading Class participation 10 Homework 30 Assignments will be accepted up to 5 days late with a penalty of 10 per day No assignment will be accepted more than 5 days late Midterm25 Paper Review 10 Final project 25 Acknowledgements 0 Some ofthe lecture slides are from the textbook website httpresearchmicrosoftcomquot cmbishoppr ml Example Handwritten Digit Recognition O it U 9 7 Polynomial Curve Fitting 0 1 M yaw 100 101x 10ng i wMajM E 70ij 30 SumofSquares Error Function h t in Oth Order Polynomial 1St Order Polynomial 3rOI Order Polynomial 9th Order Polynomial Ove rfitting e Training 9 Test RootMeanSquare RMS Error ERMS 2EwN Polynomial Coefficients M 0 M 1 M 3 M 9 mg 019 032 031 035 101 127 799 23237 10 2543 532133 w 1737 4356331 w 23163930 w 64004226 wg 106130052 w 104240013 w 55768299 w 12520143 Regularization Penalize large coefficient values N A Eltwgt Z ynw m2 Hw2 n21 DIH CS 59000 Statistical Machine learning Lecture 3 Warm Man Q alamq lt purdueedu Saptt 2 2008 Review Bayes Theorem 19X iYPY pYX poo 29X I ZNXIYMY Y posterior oc likelihood X prior The Multivariate Gaussian 1 1 1 1 W Dle lgexp X UTE X 11 NXMa 2 33931 Maximum Log Likelihood N 1 N N 1npxm02 T 2 E xn 2 1na2 31n27r 7121 N N 1 1 ML N 2133 0m N E 72 libIL2 n Maximum Likelihood for Regression 12 ptxw NtnynvwaB 1 n 1 N yx n7 W tn2 1115 11127T 1 lnptixw g n x 1 BEYW Determine WML by minimizing sumofsquares error N Z 333n WML tn2 n1 2H 1 BML Predictive Distribution by ML ptiWML ML N tyawML7B1 Ii MAP A Step towards Bayes pwia NW0Oz11 a M12 exp ngw pWixtaa 0lt ptixWa pWa DI Q N 65W Zyxnw tn2 ngW Determine WMAP by minimizing regularized sumofsquares error Bayesian Curve Fitting pt3xt ptawpw xt dW Ntmcs2a Bayesian Predictive Posterior Distribution ptix x t N timw 3203 Decision Theory Inference step Determine either ptix or pxt Decision step For given determine optimal t Minimum Misclassification Rate A H pmistake px E R1C2 px E R2 C1 px C2 dx pX C1 dx R1 R2 Minimum Expected Loss IEL ijpXCkdx 19 Regions Rj are chosen to minimize EiL Z ijPCkiX k The Squared Loss Function Minimize IEL t2pxt dX dt Differential Entropy log2 pq For discrete random 3 variables For continuous random variables Important quantity in coding theory statistical physics machine learning Measure randomness The KullbackLeibler Divergence KLltqugt pltxgt1nqltxgtdx pltxgt1npltxgtdxgt pxln dx 1 N KLqu 2 W nlnqgtltn9 1npXn 104pr 2 0 KLPHQ i KLQHP Conditional Entropy amp Mutual Information Hiylxi fpltyxgt1npltyxgtdydx Hixay Hile Hixi lbw E KLpXaylpX Y Parametric Distributions Basic building blocks pxl6 Need to determine 0 given 31 xN Representation 0 or p0 Recall Curve Fitting ptzxt ptl13wpwxt dw 0 1 Binary Variables 1 Coin flipping heads1 tails0 1931 1W M Bernoulli Distribution Bernttlu I l ADI quotC EliEl var II t M1 M Binary Variables 2 Q coin flips pm headsNu Binomial Distribution BinmlNu Mm mNm N E Z mBinmNpL Np N varm E 2 m BinmN pi NM1 L mO Binomial Distribution 03 I 02 Binm10 025 01 ML Parameter Estimation for Bernoulli 1 Given D 3133N m heads N m tails 0 N N pDM H plnu Mirna M1 n 1 n1 N N lnPltDiIUJ Z lnpbml 2 din 1n 1 n1n1 m 1 711 1 N m HML N 33 N n 1 ML Parameter Estimation for Bernoulli 2 Example D111 gtLLML 1 Prediction all future tosses will land heads up Overfitting to G Beta Distribution Distribution over u 6 01 BetaLa b Ida 1a lab 1 EM 2 a j b varmi ab m Beta Distribution 3 3 a 01 a 1 b 01 b 1 2 2 39 1 I 0 0 O 05 u l 0 05 I 3 3 a 2 a b 3 b 4 2 2 39 l l 0 0 Bayesian Bernoulli paoboD 0lt PDHPMaoabo N Z W L 01m Betaltulaoa be n1 Mm l aU l 1 MN mbU 1 R BetauaNbN R aNzao l m szbO l N m The Beta distribution provides the conjugate prior for the Bernoulli distribution Prior Likelihood Posterior prior 2 likelihood function posterior Properties of the Posterior As the size of the data set Q increase LN gt m bN gt N m aN m E M aN bN N MML b VELI LLL W N gt 0 CLN bN2CLN 5N 1 Prediction under the Posterior What is the probability that the next coin toss will land heads up 1 pltx1laob0vgt pltx1mgtpmlambmvgtdu 0 1 upulaobo du O El laoabmpl Predictive posterior distribution Multinomial Variables 4ofN coding scheme X 001000T K pXlu H 2 191 K szukZO and Zukzl k1 T EXH 22904103 M1 MILK M K Zpltxlugt 2m 1 X k1 ML Parameter estimation Given D X1a XN Ensure 2k Mk 1 use a Lagrange multiplier K K kalnuk Zinc 1 k1 k1 mic Mk mk MIIXIL W The Multinomial Distribution K N Multm1m2mKiMN m m m sznk 1 2 K 61 N k vari mki Nuk1 Mk covimjmki NMij The Dirichlet Distribution K DirLa 70310 ll gk l 2 Conjugate prior for the multinomial distribution 1 M3 Bayesian Multinomial K mama OltpDppMya QC H Mgwmk l k1 pm39p a Diruam K Hag N Makm PQl m1 mKk1 k CS 59000 Statistical Machine learning Lecture 23 Want QMam Q PUMME CS NQM 1 2008 Outline Review of max sum algorithm K means clustering K medoids Mixture of Gaussians Vector quantization Expectation maximization EM Alternative view of EM The MaxSum Algorithm 1 Objective an efficient algorithm for finding i the value XmaX that maximises p x ii the value ofpxmaX In general maximum marginals joint maximum arg maxpcy 1 argmaxpa 0 it 13 The MaxSum Algorithm 2 Maximizing over a chain max product 331 172 TNil iL N maxpx max maxpx x 561 36M HiaX39 39 39Igax 1251313902 39 39 39 TPN 1NN 133N N max H x 10120517962 quotIXwN 1NN 1N The MaxSum Algorithm 4 Max Product gt Max Sum For numerical reasons use In maxpx maXlnpX Again use distributive law maxa I b a c a I maXb c The MaxSum Algorithm 5 Initialization leaf nodes def39r O f mCU 1nf Recursion quAx max Infmx1xM Z Mmmfxm mEnefSx gbl argmax1nfr1aM Z Mxlrlafxm 1quotquot M mEnefsx Z memo l nexf 7 H l K 3 The MaxSum Algorithm 6 Termination root node max mgxl Z Mfs ax s ne1 quotd I xmax argmax Z Mfg 9x m sEner Kmeans Clustering Goal Given data set x1xN in Ddimensional Euclidean space Partition into K clusters which is given One of K coding Indicator variable rnk a 01 where k 21K Describes which ofK clusters data point xl is assigned to Cost Function Sum of squared errors N K 22er Hxn uk H2 n1 k1 Goal is to find values for W and the uk so as to minimize J Can be done by iterative procedure Each iteration has two steps Successive optimization wrt rnk and W Two Stage Updates First Choose initial values for pk First phase minimize Jwrt rnk keeping pk fixed Second phase minimize Jwrt pk keeping rnk fixed Optimizing Cluster Assignment N K Because 22121 len i If quot1 Irl is a linear function of rnk this optimization is performed easily closed form solution Terms involving different n are independent Therefore can optimize for each n separately Choosing rnk to be 1 for whichever value of k gives minimum value of llxnukll3 ThUS liszargminjlle uII2 If n 0 otherw1se Interpretation Assign x to cluster whose mean is closest Optimizing Cluster Centers Hold rnkfixed Objective function JIiilx mr is a quadratic function of uk Minimized by setting derivative wrt W to zero Thus 22nix u0 quot Zrnrxn Which is solved to give A Equaltonoof 39 d Interpretation 533235 Set uk to mean of all data points xn assigned to cluster k Example of KMeans Clustering Stochastic Online Clustering Instead of batch processing entire data set Apply RobbinsMonro procedure To finding roots of the regression function given by the derivative of J wrt pk M 1 777 Xi 23 where 77 is a learning rate parameter made to decrease monotonically as more samples are observed Kmedoids Algorithm Euclidean distance has limitations Inappropriate for categorical labels Cluster means are nonrobust to outliers Use more general dissimilarity measure vxx and distortion measure jZZZIivkxnnuk 111 k1 Which gives the k medoids algorithm rKMeans and Vector Quantization K 2 A 3 Iquot 1 Original image x 1quot t 2 IJ39k Codebook vectors for lossy data compression Limitations of Kmeans Every data point is assigned uniquely to one and only one cluster A point may be equidistant from two cluster centers A probabilistic approach will have a soft assignment of data points reflecting the level of uncertainty Mixture of Gaussians Mixture of Gaussians K W Z meiuk 2k 171 Introduce latent variables X M21 1 M Marginal distribution K W Zpltzgtpltxzgt EMAKIM 2k 161 Conditional Probability Ma 1pXle 1 7 2pm 1X K 1pxzj 1 33921 WkNXHka 53k K Evame 23 j1 Responsibility that component k takes for explaining the observation Maximum Likelihood Maximize the log likelihood function N K 1npXIvr u 2 Z 1n 2 me luk 2n k1 n21 Graphical representation of a Gaussian mixture model Zn for a set of N iid data points xn with corresponding latent points Zn where n 17 N 7quot 39 X71 u N L An easy solution Discussion Severe Overfitting by Maximum Likelihood M a L v v v v v r B When a cluster has only data point its variance goes to O Iden ab hy For any given maximum likelihood solution a K component mixture will have a total of K equivalent solutions corresponding to the K ways of assigning K sets of parameters to K components Why If we flip the labels of the components we will have an equivalent model Difficulty for parameter interpretation irrelevant for density estimation Maximum Likelihood Conditions 1 Setting the derivatives of lnpXimu 2 to zero IV WkNX39TZr Hk7 2k 0 Z 2kxn H11 1 Zj WjNXniHja 29 Altznk 1 N k Vltantxn N Maximum Likelihood Conditions 2 Setting the derivative of 1npleui 2 to zero N 1 EA Z 7Z39nkxn 19an HkT n21 Maximum Likelihood Conditions 3 Lagrange function K 111pXi7r u E A Wk 1 191 Setting its derivative to zero and use the normalization constraint we obtain Wk N Expectation Maximization for Mixture Gaussians Although the previous conditions do not provide closed form conditions we can use them to construct iterative updates E step Compute responsibilitiesWm M step Compute new mean Mk2 variance 2k and mixing coefficients m Loop over E and M steps until the log likelihood 1npXluE7r stops to increase Example 2 028313 I O to 2 2 O a 2 o 2 o o L n 039 L l L120 II a E u q 39 s n 1 39 quot 0 o I 0 0 I a 39 0 uquot 39039 v 3 v 0 h v t 2 5 2 5 2 0 d 1 2 0 e 0 0 EM on the Old Faithful data set General EM Algorithm Given a joint distribution pX ZlB over observed variables X and latent vari ables Z governed by parameters 9 the goal is to maximize the likelihood func tion with respect to 6 Choose an initial setting for the parameters 60 2 E step Evaluate pZlX 001d 3 M step Evaluate 6new given by Huew argmax Q67 001d 0 Where 097 90 ZNZIX 001d 1111909 Zl0 Z 4 Check for convergence of either the log likelihood or the parameter values If the convergence criterion is not satis ed then let 001d anew and return to step 2 EM as Lower Bounding Methods Goal maximize pXi9 ZpX7Zi0 Z o pXZ 6 Define 6179 QZgtln qZ 19ZiX79 KLltqup ZqltZgt qZ WE have 111pXi9 q7 9 KLltQHP Lower Bound q6 is a functional of the distribution qZgt Since KLUJHP 2 0 and 1npX9gt q79gt KLQllp MM is a lower bound ofthe log likelihood function lnpXl9 Illustration of Lower Bound KLM p q 9 111X6 Lower Bound Perspective of EM Expectation Step Maximizing the functional lower bound q6gt over the distribution qZ Maximization Step Maximizing the lower bound over the parameters 6 Illustration of EM Updates h1pX6 6 Old 6 new CS 59000 Statistical Machine learning Lecture 8 Warm QMam Q PUMME CS Sept 1 2008 Outline Review of Regularized regressionBayesian regression Equivalent kernel and Model Comparison Linear classification Discriminant functions Probabilistic generative models Probabilistic discriminative models Regularized Least Squares Adding regularization term to error controls overfitting Ew1EWW Where 2 is the regularization coefficient that controls importance of datadependent error EDw and the regularization term EWW Simple form of regularizer Total error function becomes git w7 x F w7rw Called weight decay because in sequential learning weight values decay towards zero unless supported by data More Regularizers Regularized Error itnwr xn2le q n1 Where q2 corresponds to the quadratic regularizer q1 is known as lasso Regularization allows complex models to be trained on small data sets without severe overfitting Contours of regularization term leq 976033 q 7 L5 q 7 Visualization of Regularized Regression Blue Contours of unregularized error func on 0 Constraint region With qI and A is sufficiently large some Minimization With quadratic Regularizer q2 of the coefficients wj Minimization are driven to zero With lasso Regularizer Sparse models where 41 corresponding basis functions play no role Bayesian Linear Regression Prior probability distribution over model parameters w Assume precision is known Since Likelihood function ptw with Gaussian noise has an exponential form Conjugate prior is given by Gaussian pwNWm0Sa with mean m0 and covariance S0 Posterior Distributions of Parameters Given by product of likelihood function and prior pWD pDWpWpD Due to choice of conjugate Gaussian prior posterior is also Gaussian Posterior can be written directly in the form pwtNwmNSN Where mN NSO391m06cDTt SN391SO1BcDTcD Predictive Posterior Distribution 0 We are usually not interested in the value of w itself But predicting tfor new values of x We evaluate the predictive distribution ptxt7aa Ntm x x Where a SN X Noise in data Uncertainty associated with parameters w Examples of PredictiveDistribution Predictive uncertainty depends on x and is smallest in neighborhood of data points Levelof uncertainty decreases as more data points are observed Mean of the Gaussian Predictive Distribution One standard deviation from Mean N 1 N2 N4 N25 l 0 Equivalent Kernel Given IIIN SN ETt Predictive mean is N yX7 mN mXTqXX i8 XTSNITt Z 8 XTSN Xntn where N yXgt mN Z X7 X39Iitn 711 M3935 6 XTSN X Covariance between two predictions Predictive mean at nearby points will be highly correlated whereas for more distant pairs of points the correlation will be smaller COVlyXayX l COVl XTW7WT X l XTSN gtX 5 1MX7X39 Bayesian Model Comparison Suppose we want to compare models Mi Given a training set we compute PUMA OltPMiPDlM Model evidence also known as marginal likelihood MDMli Bayes factor pDlM pDllj Likelihood Parameter Posterior amp Evidence Likelihood and evidence WWW fpwivvmmwwadw Parameter posterior distribution and evidence plt iW7MiPWM PDMz39 39 19WipaMz39 I Crude Evidence Approximation Assume posterior distribution is centered around its mode A wp osterior 191 p1wpwdw 219DinAP Awprior Awpomim 4 D m 4 Evidence penalizes overcomplex models Given Mparameters A 08 erior Inp 21npDlWMAp Afln A wprior Maximizing evidence leads to a natural trade off between data fitting amp model complexity ll M3 H MD Evidence Approximation amp Empirical Bayes Approximating the evidence by maximizing marginal likelihood Wit plttlw6gtpltwltampWIt dwdad Wit 219mm 3 plttlwBgtpltwltaBgtdw Where hyperparameters a maximize the evidence palm Known as Empirical Bayes or type2 maximum erhhood Model Evidence and CrossValidation Fitting polynomial regression models Rootmeansquare error Model evidence Linear Classification Goal Given an input vector x assign it to one odeisjoint discrete classes based on training data Input space divided by decision boundaries or decision surfaces Linear classification decision boundaries are linear functions of the input vector x Three Approaches Discriminant functions Directly assigns an input vector in a specific class Probabilistic generative models Model the data generation process PleCAl and use Bayes rule 1chlc1CAl Plxl Probabilistic discriminative models Model the class conditional densities PlClel directly 1Cle Distance from X to decision surface Fisher s Linear Discriminant find projection to a line st samples from different classes are well separated Example in 20 bad line to project to good line to project to classes are mixed up classes are well separated Linear projection Let the line direction be given by unit vector v Scalar v xi is the distance of projection of xifrom the origin Thus it vixi is the projection of xiinto a one dimensional subspace A na39I39ve choice of separation measure Let 11 and ZZZ be the means of projections of classes 1 and 2 Let u and 2 be the means of classes 1 and 2 lI1 2l seems like a good measure 1 quot1 t t 1 1 t 1 vaiv in Vlu1 n1 XiEC1 XiEC1 similarly 11 2 Vtu 2 Problem of Na39ive Separation Criterion How good is l 1 2l as a measure of separation the better is the expected separation The larger I111 7 2 A II 1 6 1112 39h39 11 32 w the vertical axes is a better line than the horizontal axes to project to for class separability however 1 2 gt 1 zl Missing Factor in Consideration Variances L r o 8 J1 pauquot39 E 1 z 39l39 52 I U gt E A gt m 1 z V large variance Scatter of Data in Each Class Define their scatter as s z z M i1 Thus scatter is just sample variance multiplied by n scatter measures the same thing as variance the spread of data around the mean scatter is just on different scale than variance 0 0 o o larger scatter 0 smaller scatter 2 Solution Normalization by Scatter Fisher Solution normalize lil1 ml by scatter Let y vtxi ie y s are the projected samples Scatter for projected samples of class 1 is 12 2Yi 12 y ie Class 1 Scatter for projected samples of class 2 is 22 2 Vi 22 y e Class 2 Next Class Perceptron Probabilistic generative models Probabilistic discriminative models CS 59000 Statistical Machine learning Lecture 25 Want QMam Q PUMME CS NQM 25 2008 Outline 0 Review of Hidden Markvo Models forward backward algorithm EM for learning HMM parameters Viterbi Algorithm Kalman filtering and smoothing Rejection Sampling Importance Sampling Metroplis hasting algorithm Gibbs sampling Hidden Markov Models Many applications eg speech recognition natural language processing handwriting recognition bio sequence analysis From Mixture Models to HMMs By turning a mixture Model into a dynamic model we obtain the HMM Let model the dependence between two consecutive latent variables by a transition probability K K pz 71 Z7171A H HAjg sznk k1j1 HMN Prior on initial latent variable K mm HWZ 23ka 1 21 Emission probabilities If pXnva Hpxn k2nk 321 Joint distribution N N pX7 pZ17T panZ7119 H pXlem7 712 771 1 Samples from HMM l l 0 0 39 t I 0 05 a Contours of constant probability density for the emission distributions corresponding to each of the three states of the latent variable b A sample of 50 points drawn from the hidden Markov model with lines connecting the successive observations Inference Forwardbackward Algorithm Goal compute marginals for latent variables Forward backward Algorithm exact inference as special case of sum product algorithm on the HMM Factor graph representation grouping emission density and transition probability in one factor at a time Forwardbackward Algorithm as Message Passing Method 1 h f I Q 121 PltZ1PX1Z1 fnZn 17Zn pz39n Zn 1PXn zn Forward messages H39szin gtfnZn 1 Mf 1 gtzn1Zn 1 ww zxzyvz ZM2714zlnmzn fn2H 271 WW Zn Z main 17zanmfnaszwlZN 04272 an mvn Zn Forwardbackward Algorithm as Message Passing Method 2 Backward messages Q how to compute it lm Mme Zn The messages actually involves X pzm X ufwzgznmf znZn 04an zn Z pXlZnpZn aZ72 ltZn 7Zn p an Similarly we can compute the following Q why Zn JPOCHlZ39rLpZnlZn i zn can 1277 pan 172nm poo Rescaling to Avoid Overflowing When a sequence is long the forward message will become to small to be represented by the dynamic range of the computer We redefine the forward message alzyn pX1 aXmZn A 04zn a z Zn Xx as l l pl l l l pX1xi Similarly we redefine the backward message 5Zn PltXn1a 39 39 7 XNlZII A X n 39397X Z71 as 1qu 1 Nl pX I1l 1739 39 39 7XNlX17 7X71 Then we can compute KYZ N azn zn zn1zn cnampzn1pxrnzrnp2nz16zn See detailed derivation in textbook Viterbi Algorithm Viterbi Algorithm 0 Finding the most probable sequence of states Special case of sum product algorithm on HMM What if we want to find the most probable individual states Maximum Likelihood Estimation for HMM Goal maximize pltX6gt meizim From EM for mixture of Gaussians to EM for HMMS EM for HMM E 7Z pZ X 601l 5ZnIZ IZn IAZnX601d Computed from forwardbackwardsumproduct algorithm M step 3 t H A i V A 5111le M2 AZHL39 H H quot1 III39XH ILLXn ILLT M2 4 w i N Z chic 11 Linear Dynamical Systems 1zIZn1 NZ17AZn IF pxz NXICZHE M21 NZIIMU VU Equivalently we have 2 AZni W17 XII CZII VII 21 2 110 11 where w w N W0I v N Nv02 11 N110V0 Kalman Filtering and Smoothing Inference on linear Gaussian systems Kalman filtering sequentially update scaled forward message A 14211 QZn Z7X39aquot397xn 1 l pltX1an Kalman smoothing sequentially update state beliefs based on scaled forward and backward messages A 7Zn ampZrz zngt Sampling Methods Goal compute Elf fzpzgtdz Challenges we cannot compute the above equation in analytically Sampling methods draw random samples suchthat Rejection Sampling 1 Goal draw samples from a complicated distribution W Zipmz where 152 can readily be evaluated but Zp is unknown In the rejection sampling method samples are drawn from a sim ple distribution 12 and rejected if they fall in the grey area be tween the unnormalized distribu tion M2 and the scaled distribu tion Aqz The resulting samples are distributed according to pz which is the normalized version of 292 Example Plot showing the gamma distribu tion given by 1115 as the green curve with a scaled Cauchy pro posal distribution shown by the red curve Samples from the gamma distribution can be obtained by sampling from the Cauchy and then applying the rejection sam pling criterion Limitations of Rejection Sampling When the proposal distribution is very different from the un normlized target distribution many samples will be wasted Difficult for high dimensional problems Importance Sampling 1 Em fzpzdz 19Z fZgtqZgtqZgt dz 1 L z 2 2 Importance weights 77 pz qz Discussion what will be an ideal proposal distribution qz Importance Sampling 2 When Zpin 192 i5zZp is unknown we have Em fzpz dz Zq N Z p m qmdz IZ N2 H A NA J Importance Sampling 3 EngN ENl N 39 4H A Q NH N v 3 A N v CL N 32 E Kw A N Q A N v CL N NME Q2 A N V i 7VZ39Z ZAJZ Z z Zq Zq plt gtd Men M 1 L EZE 11 a yawnWU L EU 2 szfzl 23Z CIZ 11 77 Em m 27ZTnqzm39 Sampling and EM M step in EM maximizes 620900 pZX601d111ZX6dZ What if we cannot even evaluate the above integration One idea using sampling method L old N 1 l Qlt67 9 E IE 1111PZ Known as Monte Carlo EM algorithm Imputation Posterior IP Algorithm Change EM for Bayesian Estimation IP Algorithm Istep We wish to sample from ZX but we cannot do this directly We therefore note the relation ZlX 7Z0t X0X 16 I 130 and hence forl l t L we rst draw a sample 0 from the current esti mate for BIX and then use this to draw a sample Z from 1Z6 X Pstep Given the relation J0X 70ZXZlXdZ 1131 we use the samples Zl j obtained from the I step to compute a revised estimate of the posterior distribution over 9 given by L ple 2 ptalz lX 1132 lzl By assumption it will be feasible to sample from this approximation in the Istep Markov Chain Monte Carlo Goal use Markov chains to draw samples from a given distribution Idea set up a Markov chain that converges to the target distribution and draw samples from the chain Invariant Distribution and Detailed Balance 1order Markov chain pz 1gtlzlt1z739 39 plzl zm Transition probabilities Tmzltmzltm E20zlquotquot lzml Invariant distribution A distribution is said to be invariant or stationary with respect to a Markov chain if each step in the chain leaves that distribution invariant Detailed balance sufficient condition for invariant distributions pltzgtTltz 2 pltz gtrltzc 2 Ergodicity and Equilibrium Distribution For m gt00 the distribution pzm converges to the required invariant distribution ie the target distribution irrespective of the choice of initial distribution which may be different from the target distribution This property is called ergodicity and the invariant distribution is called the equilibrium distribution Under weak restrictions on invariant and transition distributions a homogeneous Markov chain will be ergodic MetropolisHastings Algorithm Iterative algorithm At step T with state 2 we draw a sample 2 from a proposal qklzlzm distribution and accept it with probability N 739 AW Zm mm lt17 pZTCIIcZlZTgt Such that the detailed balance requirement is satisfied Choice of Proposal Distributions Local Gaussian distributions centered on the current state if it is symmetric Gaussian HM algorithm reduces to Metroplis algorithm Variance is too small gt high acceptance rate but slow random walks Variance is too large gt low acceptance rate Global Gaussian distribution find Laplace approximation to the posterior amp sample from it Mixture of Gaussians fit a mixture of Gaussians and then sample from it Example A simple illustration using Metropo lis algorithm to sample from a Gaussian distribution whose one standarddeviation contour is shown by the ellipse The proposal distribu tion is an isotropic Gaussian distri bution whose standard deviation is 02 Steps that are accepted are shown as green lines and rejected steps are shown in red A total of 150 candidate samples are gener ated of which 43 are rejected 25 39 Gibbs Sampler 1 Gibbs Sampling 1 Initialize 27 z39 1W 2 ForT1T 739 1 739 739 739 Sample p21lz hag z w Sample z TH N p22lzr1 2ng I 1 1 1 Sample N pzjlz T 2631 Sample 2 N pz1lz 71 257 Gibbs Sampler 2 Special case of Metroplis Hasting Sampler Given the proposal distribution QkZlZ PZZIe The acceptance rate is AZ Z pltZkaZlZ pzl zltkpzltklt3k ing 1 pzqltZIZ PWlzwlplzwlplzilzw So every sample from the proposal distribution in a Gibbs sampler will be accepted CS 59000 Statistical Machine learning Lecture 13 mm t am Q WWW CS tw 2008 Outline Review of kernel trick kernel ridge regression and kernel Principle Component Analysis Gaussian processes GPs From linear regression to GP GP for regression Kernel Trick 1 Reformulate an algorithm such that input vector X enters only in the form of inner product XTX 2 Replace input x by its feature mapping x Igt 3 Replace the inner product by a Kernel function A7Xixl Q5XTZ5Xl Examples Kernel PCA Kernel Fisher discriminant Support Vector Machines Dual Representation for Ridge Regression N WTQSX 11 tn 1 ngW 7 w 2 wT xn tn ax Z anmxn Ta 71 1 n 1 Dual variables a WTWXM 1 a Ja aTqgtltIgtTltI3ltIgtTa aTqgtltIgtTt itTt galq bea Kernel Ridge Regression Using kernel trick Knm XnT bXm kXniXm 1 r 1 w A Ja iaTKKa aIKt t1t EaTKa Minimize over dual variables a K IN 1t gX WTq3x aTltIgtgbx kXT K Arm 1 t Generate Kernel Matrix kxx ck1xxl POSItIve semidefinite XX lfltxk1xjxfx Consider Gaussian kernel kltxx q kr1xx x exp k71x x kx7x exp HX XH220392 kxx k1xX kigXX kxxl k1XX k2Xx39 1 3X Xl X MKD kxx XTAX39 HX X H2 XTX X TX ZXTx39 kXX ktaxa i kibxb X2 kx x kn xm X1ktbxb Xg kxx exp xTxZag exp xTx3902 exp x Tx 202 kxx exp Ti Hxx HltX x 2HXX Principle Component Analysis PCA Assume 272sz 0 We have Sui Vul 1 N S N anxg 721 ui is a normalized eigenvector uil uj 1 L Feature Mapping 1 T C I N 43Xn xngt Eigen problem in feature space CV13 2 A1 V2 Dual Variables A Z M mxmvi 7L1 Suppose A2 gt 0 we have A Vi Z ain xn n1 Eigenproblem in Feature Space 1 1 N N N N Z Z aim xm i Z ain xn n 1 m 1 n 1 1 N m N N Z kXl X71 2 afi39mk Xna X39m 2 Lin kxl X71 711 7711 711 K237 Z NKaY Kai A lgNaY Eigenproblem in Feature Space 2 Normalization condition N N 1 VEVZ Z Z Linai39m XHT X7H 31Ka39i n 1 7711 Projection coefficient T N y39llx 45XTVIC Z ain XT ltXn Z ari11kxsxn n1 n1 General Case for Nonzero Mean Case Kernel Matrix R I K 1NK KIN lNKlN where 1N denotes the N X N matrix in which every element takes the value 1N Gaussian PFOCESSES How kernels arise naturally in a Bayesian setting Instead of assigning a prior on parameters W we assign a prior on function value y In nite space in theory Finite space in practice nite number of training set and test set Linear Regression Revisited iX WTWX LEt yn 172 951 X71 We have y ltIgtw From Prior on Parameter to Prior on Function MW NW0a0711 KKK WT X The prior on function value T T Tl T covy 1EyyltIgtEww ltIgt Qqgtqgt K 1 Knm k X729 Xm a 91sz T Cb Xm Stochastic Process A stochastic process yx is specified by giving the joint distribution for any finite set of values yx1 yxN in a consistent manner Loosely speaking it means that a marginalized joint distribution is the same as the joint distribution that is defined in the subspace Gaussian PFOCESSES The joint distribution of anvaariabIes 113 aLUN is a multivariable Gaussian distribution Without any prior knowledge we often set mean to be 0 Then the GP is specified by the covariance K E X771 Xn Kim Impact of Kernel Function Covariance matrix kernel function 3 3 15 15 1 0 o 15 15 3 3 1 05 O 05 1 1 kxx exp HX X H2202 CU exp 9 L39li Application economics amp finance Gaussian Process for Regression Likelihood Wu lyn N022 lzym 3 1 Wiv NW llw Prior my NiyiOaKgt Marginal distribution pa ptlvpy dy Mtlo 0 Samples of GP Prior over Functions 100 100000 000 900 400 000 000 10061300001100 3 9 l 5 415 0 0 1 5 45 3 9 3 I 05 0 05 1 1 05 0 05 I 1 05 0 05 I 3 100025000000 9 mg mg jo omgrmn 1m LUUYOIUMW K 60 91 02 03 0 4 15 45 3 l5 45 7 J 73 79 w 4 l 5 0 05 l l O5 0 05 1 1 05 n 5 1 9 kXnaXm 60 9X13 Elnxn Xmllg l39 92 63391sz Samples of Data Points tn Predictive Distribution 19tN1 Nt1710CN1 Cir k CN1 k C 1015N1t is a Gaussian distribution with mean and variance mxN1 kTCgrlt 02XN1 I C kTCiirlk Predictive Mean N quotTTLXI1 Z13972lt7X7HXN1 721 1 an CN t We see the same form as kernel ridge regression and kernel PCA GP Regression o O 02 04 06 08 1 Discussion the difference between GP regression and Bayesian regression with Gaussian basis functions CS 59000 Statistical Machine learning Lecture 12 mm t am Q WWW CS t I 2008 Outline 0 Review of Probit regression Laplace approximation BIC Bayesian logistic regression Kernel methods Kernel ridge regression Kernel Principle Component Analysis Probit Regression Probit function pt Ha Iaf N39QQM 1 16 a WT qb Labeling Noise Model max lt1 aarx 41 arx E 1 260X 0X is the activation function with input vector X Robust to outliers and labeling errors Generalized Linear Models 1 t 7115 M W S S 3 91 exp S d 3 Elil39iil 5n1119ii gt r 299 Generalized linear model 1 fwTltzgt Activation function f Link function filo Canonical Link Function If we choose the canonical link function f1y quot394 51quotl39 Gradient of the error function N V111 EW Z jn tn n n 1 For the Gaussian s 6 1 Whereas for the logistic model 1 Examples Distribution Name Link Function Mean Function Normal Identity X u p 2 X3 Exponential I X 1 X 1 Gamma nverse y H Inverse Inverse 2 1 1392 Gaussian squared H 39u Poisson Log 111 p exp Binomial Logn X 1n p H 2 exp Multinomial 1 y 1 exp X u Eitlni Laplace Approximation for Posterior Gaussian approximation around mode I gt 111fz 2111fz0 7 z 7 Z0TAZ 7 20 J A 7 VVln z ZIZO fiZ 2 fZ0 9X1 52 Z TAZ Z0 A 12 1 7 12 exp 752 Z01AZ 20 NZZOA1 Evidence Approximation Z fzdz 2 HM lexp w Z0TAZ 20 dz 1 Q MQ AHW 101 2 1D6p6 19 f9 13D9P9 U 1 11173 2111DWMAP 111Plt01y1Apgt 1711197 5111A Occam factor A I VV111D911APP911AP VVIDPWMAPID Bayesian Information Criterion Approximation of Laplace approximation 1 ill 2 111D 91V1Ap ill AIN39T More accurate evidence approximation needed Bayesian Logistic Regression W m0TSO1W m0 IOIH 111wlt N Z tn In yn 1 tn 1111 7 yn const 711 qW NW W139IAP ST N r11 SA Z 9711 y39rl n n 39 711 Kernel Methods Predictions are linear combinations of a kernel function evaluated at training data points Kernel function lt gt feature space mapping K39sX x 39XTltigtx Linear kernel kxx XTX Stationary kernels MK X HX XIV Fast Evaluation of Inner Product of Feature Mappings by Kernel Functions MK d h ibf g kxz XTZ2 2121 l 23222 2371z11222 xi 1332axg zi zlez T gtX1 gtltZ Inner product needs computing six feature values and 3 x 3 9 multiplications Kernel function has 2 multiplications and a squa ng Kernel Trick 1 Reformulate an algorithm such that input vector X enters only in the form of inner product XTX 2 Replace input x by its feature mapping x Igt 3 Replace the inner product by a Kernel function A7Xixl Q5XTZ5Xl Examples Kernel PCA Kernel Fisher discriminant Support Vector Machines Dual Representation for Ridge Regression N WTQSX 11 tn 1 ngW 7 w 2 wT xn tn ax Z anmxn Ta 71 1 n 1 Dual variables a WTWXM 1 a Ja aTqgtltIgtTltI3ltIgtTa aTqgtltIgtTt itTt galq bea Kernel Ridge Regression Using kernel trick Knm XnT bXm kXniXm 1 r 1 w A Ja iaTKKa aIKt t1t EaTKa Minimize over dual variables a K IN 1t gX WTq3x aTltIgtgbx kXT K Arm 1 t Generate Kernel Matrix kxx ck1xxl POSItIve semidefinite XX lfltxk1xjxfx Consider Gaussian kernel kltxx q kr1xx x exp k71x x kx7x exp HX XH220392 kxx k1xX kigXX kxxl k1XX k2Xx39 1 3X Xl X MKD kxx XTAX39 HX X H2 XTX X TX ZXTx39 kXX ktaxa i kibxb X2 kx x kn xm X1ktbxb Xg kxx exp xTxZag exp xTx3902 exp x Tx 202 kxx exp Ti Hxx HltX x 2HXX Combining Generative amp Discriminative Models by Kernels Since each modeling approach has distinct advantages how to combine them Use generative models to construct kernels Use these kernels in discriminative approaches Measure Probability Similarity by Kernels Simple inner product kxx pxpx For mixture distribution klxax39 I Zpxlipx lipi For infinite mixture models kltxx gt pltxlzgtpltx lzgtpltzgt dz For models with latent variables eg Hidden Markov Models AvX7X gt ZpltXIZgtpltX lZgtpltZgt Z Fisher Kernels Fisher Score gltexgt w Inpltxl9 Fisher Information Matrix F IEx Fisher Kernel kfXjX g6XTF1g0X Sample Average 1 N N T F N Z g63X7g6Xn n 1 Principle Component Analysis PCA Assume ann 0 We have Sui Mug 1 N S N anxg 721 ul is a normalized eigenvector uil uj 1 L Feature Mapping 1 T C I N 43Xn xngt Eigen problem in feature space CV1 2 A1 V7 Dual Variables N Z Xn X72TV39II I ivi 721 Suppose All gt 0 we have N V11 2 Z azim xn 711 Eigenproblem in Feature Space 1 1 N N N N Z Xn xnT Z aim xm I i Z alin xn n 1 m 1 n 1 1 N m N N Z kX a X71 2 Limkxna X m Z Lin13Xl 3 Km n 1 m 1 711 K237 Z AviNKaY Kai Z quNai Eigenproblem in Feature Space 2 Normalization condition N N 1 VEVZ Z Z Linai39m XHT X7H 31Ka39i n 1 7711 Projection coefficient T N y39llx 45XTVIC Z ain XT ltXn Z ari11kxsxn n1 n1 General Case for Nonzero Mean Case Ewen am Z 45m 1 Kernel Matrix 1 I K 1NK KIN l lNKlN where 1N denotes the N X N matrix in which every element takes the value 1N Kernel PCA on Synthetic Data Contour plots of projection coefficients in feature space kxx eXp X XH201 Limitations of Kernel PCA Discussion CS 59000 Statistical Machine learning Lecture 9 Warm QMam Q PUMME CS Sept 23 2008 Outline Review of model comparison and Fisher s linear discriminant Linear classification Discriminant functions Probabilistic generative models Probabilistic discriminative models Likelihood Parameter Posterior amp Evidence Likelihood and evidence WWW fpwivvmmwwadw Parameter posterior distribution and evidence plt iW7MiPWM PDMz39 39 19WipaMz39 I Evidence penalizes overcomplex models Given Mparameters A 08 erior Inp 21npDlWMAp Afln A wprior Maximizing evidence leads to a natural trade off between data fitting amp model complexity ll M3 H MD Evidence Approximation amp Empirical Bayes Approximating the evidence by maximizing marginal likelihood Wit plttlw6gtpltwltampWIt dwdad Wit 219mm 3 plttlwBgtpltwltaBgtdw Where hyperparameters a maximize the evidence palm Known as Empirical Bayes or type2 maximum erhhood Model Evidence and CrossValidation Fitting polynomial regression models Rootmeansquare error Model evidence Classification Approaches Discriminant functions Directly assigns an input vector in a specific class Probabilistic generative models Model the data generation process 1chlc and use Bayes rule 1XCAPCA MK Probabilistic discriminative models Model the class conditional densities PlelX directly 1Cle Distance from X to decision surface Fisher s Linear Discriminant find projection to a line st samples from different classes are well separated Example in 20 bad line to project to good line to project to classes are mixed up classes are well separated A na39I39ve choice of separation measure Let 11 and ZZZ be the means of projections of classes 1 and 2 Let u and 2 be the means of classes 1 and 2 lI1 2l seems like a good measure 1 quot1 t t 1 1 t 1 vaiv in Vlu1 n1 XiEC1 XiEC1 similarly 11 2 Vtu 2 Problem of Na39ive Separation Criterion How good is l 1 2l as a measure of separation the better is the expected separation The larger I111 7 2 A II 1 6 1112 39h39 11 32 w the vertical axes is a better line than the horizontal axes to project to for class separability however 1 2 gt 1 zl Scatter of Data in Each Class Define their scatter as s z z M i1 Thus scatter is just sample variance multiplied by n scatter measures the same thing as variance the spread of data around the mean scatter is just on different scale than variance 0 0 o o larger scatter 0 smaller scatter 2 Solution Normalization by Scatter Fisher Solution normalize lil1 ml by scatter Let y vtxi ie y s are the projected samples Scatter for projected samples of class 1 is 12 2Yi 12 y ie Class 1 Scatter for projected samples of class 2 is 22 2 Vi 22 y e Class 2 Fisher Linear Discriminant Thus Fisher linear discriminant is to project on line in the direction vwhich maximizes want projected means are far from each other 2 JV 111 112 2 2 31 2 want scatter in class 1 is as want scatter in class 2 is as small as possible ie samples small as POSSlba la samples of class 1 cluster around the Of Glass 2 cluster around the projected mean A projected mean z Cost Function 2 M 51 52 If we find vwhich makes Jv large we are guaranteed that the classes are well separated projected means are far from each other 111 172 ul odo gt H4 small 1 implies that small 2 implies that Pr 0190t9d samples 0 projected samples of class 1 are Clustered class 2 are clustered aI OUNd Projecred mean around projected mean Within Class and Between Class Scatter Matrices Define the separate class scatter matrices S1 and 2 for classes 1 and 2 These measure the scatter of original samples X before projection S1 2Xi 1Xi 1t xe Class 1 2Xi u2xi 2t xie Class 2 52 Now define the within the class scatter matrix SW 51 S2 Define between the class scatter matrix SB 2 11 2 1 112 t Generative eigenvalue problem JV 111 22 VtSBV 32 E22 v SWv Minimize Jv by taking the derivative wrt ve setting it to 0 gt st iiSWv Fisher s Linear Discriminant SBV lSWv If SWhas full rank the inverse exists can convert this to a standard eigenvalue problem SV V SBV 1v But SBx for any vector x points in the same direction as u uz a 53X 1212 X 1 12 am 2 Thus can solve the eigenvalue problem immediately IV SJV 1 2I Example O 2 Z 6 2 2 Projection that maximizes FLD Projection mean separation Perceptron Realvalue inputs x weights w 1 otherwise Generalized Linear Model 1 142 0 fm 1 a lt 0 h inhr ze Z WTqbntn nEAI where Mdenotes the set of all misclassified patterns Stochastic Gradient Descent W397391 WT TFiJntn 39 u u I 39 I o o o m 39 39 39 39 M a 0 o M o O u n n u u 4m c 4 0 45 0 4n u r u 7 m o s t 39 u in 1 M i 7 45 n t 39 5 71 45 n H Probabilistic Generative Models 29XiC1PC1 13XiC1PC1 19XiC2PC2 1 1 eXp a 0a PC1ix Gaussian ClassConditional Densities Conditional densities of data 1 1 1 MXiCJ WWQXP 5W MOIE 1X PM The posterior distribution for labelclass pC1 X 0WTX 200 W 2 1U 1 2 1301 PltC2 1 1 7 1 r wo 5Hi12 lu15u1 2 Jug1 Maximum Likelihood Estimation Let 1931 7 SO that 32 1 7r tn 1 denotes class C1 and tn 0 denotes class C2 Likelihood function 17V Maximum Likelihood Estimation N 1 1 ANTI tn W V V 1 T1 AB 1 N 1 N M1 tux H2 N7 21 MK 71 a 1 Air 1 r39 w 81 Z Z X71 i H1Xn M1I quot HECl 1 rm S2 Z X71 H2Xn 72602 Discrete features Na39ive Bayes classification D 1 D 1153 Z 13 111 pk 1 1171ln1 uk111pC i1 Probabilistic Discriminative Models Instead of modeling pXlCA Model Cmglx directly Logistic Regression Lequot aws 3475 C7 WT i PC2V 31 39PC1h Ukehhoodfunc on N 1 71 Milw H 1 yn n 1 gm pC1 n Maximum Likelihood Estimation N mm 111PtiW Z in 1111972 1 tn 1111 EM n1 N VEw Zy1 tncigtn n z 1 Note that IE 01 a la NewtonRaphson Optimization Let H denote Hessian matrix Wnew Wold N VEw Zmqun tnng ltIgtTqgtw ltIgtTt 711 N H VVEw 2 an 3 6 711 W11ew Wold 7 T 1 Tqwold 7 Torlq t t converges in one iteration for linear regression CS 59000 Statistical Machine learning Lecture 15 mm t am Q WWW CS ttw 211 2008 Outline Review of Gaussian Processes G Ps From linear regression to GP GP for regression Learning hyperparameters Automatic Relevance Determination GP for classification Gaussian PFOCESSES How kernels arise naturally in a Bayesian setting Instead of assigning a prior on parameters W we assign a prior on function value y In nite space in theory Finite space in practice nite number of training set and test set Linear Regression Revisited 2100 WT X LEt gym X71 172 951 X71 We have y ltIgtw From Prior on Parameter to Prior on Function pw NWO51I quot3X WT X The prior on function value Stochastic Process A stochastic process yx is specified by giving the joint distribution for any finite set of values 3101 zXN in a consistent manner Loosely speaking it means that a marginalized joint distribution is the same as the joint distribution that is defined in the subspace Gaussian PFOCESSES The joint distribution of anvaariabIes 113 aLUN is a multivariable Gaussian distribution Without any prior knowledge we often set mean to be 0 Then the GP is specified by the covariance K E X771 Xn Kim Impact of Kernel Function Covariance matrix kernel function 3 3 15 15 1 0 0 15 15 3 3 1 O5 0 05 1 1 kxx exp HX XH220392 Mm exp 9 3quot Application economics amp finance Gaussian Process for Regression Likelihood pt72yn Ntlrzgna 3 1 PUIV Mali7541M Prior m Nltvi07K Marginal distribution pa ptlvpv dy Mule 0 Samples of Data Points tn Predictive Distribution pt171 NtN1IO CN1 C k CNH kg c gt ptN1It is a Gaussian distribution with mean and variance mxN1 chiylt 02XN1 C kTCRflk Predictive Mean N mXN1 Z ankX739u XN1 n 1 an is the nth com onent of Gilt m We see the same form as kernel ridge regression and kernel PCA GP Regression o O 02 04 06 08 1 Discussion the difference between GP regression and Bayesian regression with Gaussian basis functions Computational Complexity GP prediction for a new data point rrzxN1 kTCfVlt 02xN1 c kTCR k GP 0N3 where N is number of data points Basis function model 0M3 where M is the dimension of the feature expansion When N is large computationally expensive Sparsification make prediction based on only a few data points essentially make N small Learning Hyperparameters 6 1 1 N lnpt6 111CN ithfvlt 1n27r 8 867 1 WC7 1 quotC 111mm TIltC1Q1080Vgt EEC dag Cg t Empirical Bayes Methods Automatic Relevance Determination Consider two dimensional problems 2 1 CXa X 60 exp 7i 13i 711 Maximizing the marginal likelihood will make certain m small reducing its relevance to prediction Example t sin2 7rx1 x2 x1 n x3e 102 0 20 40 60 80 100
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'