BIG (BusinessIndustryGovernment) Statistics
BIG (BusinessIndustryGovernment) Statistics STAT 597C
Popular in Course
Popular in Statistics
This 0 page Class Notes was uploaded by Hilbert Denesik on Sunday November 1, 2015. The Class Notes belongs to STAT 597C at Pennsylvania State University taught by Staff in Fall. Since its upload, it has received 19 views. For similar materials see /class/233137/stat-597c-pennsylvania-state-university in Statistics at Pennsylvania State University.
Reviews for BIG (BusinessIndustryGovernment) Statistics
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/01/15
Perceptron Learnng Algorithm Perceptron Learning Algorithm Jia Li Department of Statistics The Pennsylvania State University Emaii Jiaii stat psu edu http www stat psu eduNnah J15 L1 httpwwwscacp Perceplron Learnng Algorithm Separating Hyperplanes gt Construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible gt Good separation is defined in a certain form mathematically gt Even when the training data can be perfectly separated by hyperplanes LDA or other linear methods developed under a statistical framework may not achieve perfect separation J15 L1 httpwwwstatpsueduleall Perceptron Learning Algorithm E emsuu m Scuuulcl Leannng Hasnp Tmnlnram A mm 2am mapquot 4 Figure 413 A tag example with tum classes separable by u hypanalane The mange hm is the least squams solution which misclassi es one a the twining paints Also shown are tum blue separating hypemlanes fuund by the perception learning algorithm mt di enmt van da39m starts Jia Li httpwaw stat ps duNjiali Perceplron Learnng Algorithm Review of Vector Algebra J15 L gt A hyperplane or affine set L is defined by the linear equation LX fx o Tx0 gt For any two points X1 and X2 lying in L BTX1 7X2 0 and hence 3 B B is the vector normal to the surface of L gt For any point X0 in L BTXO 730 gt The signed distance of any point X to L is given by l WTlX XO mlBTX l BO 1 H W H XL gt Hence fx is proportional to the signed distance from X to the hyperplane defined by fx 0 http www stat psu eduijh Perceptron Learning Algorithm E emsuu m Scuuulcl mung Hasnp Tmnlnram A mm 2051 mam 4 Figure 414 The linear algebra of n hyparylamz a ins m Jia Li httpwww stat su eduNjiali Perceptron Learnng Algorithm Rosenblatt s Perceptron Learning gt Goal find a separating hyperplane by minimizing the distance of miscassified points to the decision boundary gt Code the two classes by y171 gt fy 1 is miscassified BTX 30 lt 0 If y 71 is miscassified BTX 30 gt 0 gt Since the signed distance from X to the decision boundary is T g o the distance from a miscassified X to the decision boundary is gt Denote the set of miscassified points by M gt The goal is to minimize 057 50 i Z YIWTXI 50 ieM J15 L1 httpWWWstatpsueduleall Perceplron Learnng Algorithm Stochastic Gradient Descent gt To minimize 03730 compute the gradient assuming M is fixed 805750 7 7 83 i i yIXI 7 805750 7 7 830 7 g y gt Stochastic gradient descent is used to minimize the piecewise linear criterion gt Adjustment on B 30 is done after each misclassified point is visited J15 L1 httpwwwstatpsueduleall Perceplron Learnng Algorithm gt The update is 5 gt lt 5 gt lt YiXi gt lt 50 H 50 390 0 Here p is the learning rate which in this case can be taken to be 1 without loss of generality Note if BTX Bo O is the decision boundary ABTX A30 O is also the boundary J15 L1 httpwwwstatpsueduleall Perceplron Lear Issues gt If the classes are linearly separable the algorithm converges to a separating hyperplane in a finite number of steps gt A number of problems with the algorithm gt When the data are separable there are many solutions and which one is found depends on the starting values The number of steps can be very large The smaller the gap the longer it takes to find it When the data are not separable the algorithm will not converge and cycles develop The cycles can be long and therefore hard to detect V V J15 L1 httpWWWstatpsueduleall Perceplron Learnng Algorithm Optimal Separating Hyperplanes gt Suppose the two classes can be linearly separated gt The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point from either class gt There is a unique solution gt Tend to have better classification performance on test data gt The optimization problem max C o 1 subject to B HyBTX l 30 2 C39 17 N gt Every point is at least C away from the decision boundary pTx 50 0 J15 L1 httpWWWstatpsueduleall Perceptron Learning Algorithm element at Stunsncl Leann mme Tlhllnram A Few 2am Chapter 4 Figure 415 The same date he m Flgme 413 The shaded ngan tellneetee the meelmhm ma39l gin separat ing the Eula classes There um three suypart paints tn dxcatsd whleh he on the boundary uf the margin and the aytzmal eepamtmg hyperplane blue lme Insth the slab Included in the yuw is the boundary faund using laglst ll rEgTesszan red line whzch ll my close tn the upiimul sepamting hypaplclne eee Sectiun 1233 Jia Li httpwwwstatps duNjiali Perceplron Learnng Algorithm gt For any solution of the optimization problem any positively scaled multiple is a solution as well We can set 3 ll1C The optimization problem is equivalent to 1 2 gjgoig ll ll subject to yBTX l 30 2 17 i 17 N gt This is a convex optimization problem J15 L1 httpwwwstatpsueduleall Perceplron Lear gt The Lagrange sum is o 2 I N Lp min B H2 7 ZaibiWTXi 50 139 1 gt Setting the derivatives to zero we obtain N 5 Z aiYiXi 7 i1 N 0 Zaiyi i1 J15 L1 httpwwwstatpsueduleall Perceplron Lea m Algor 1m gt Substitute into Lp we obtain the Wolfe dual N N N 1 LD 3i i E I Z aiakYikaiTXk 1 1 k1 subject to a 2 0 This is a simpler convex optimization problem J15 L1 httpwwwstatpsuedu Perceplron Learnng Algorithm gt The Karush Kuhn Tucker conditions require 3ilYiBTXi 30 1 0 Vi gt If a gt 0 then q Txi 50 1 that is X is on the boundary of the slab gt lfy TX 50 gt1 that is X is not on the boundary of the slab a gt The points X on the boundary of the slab are called support points gt The solution vector 3 is a linear combination of the support points 5 Z aiyixi iagt0 J15 L1 httpwwwstatpsueduleall Mixture Models Mixture Models Jia Li Department of Statistics The Pennsylvania State University Emaii Manama psu edu mp www stat psu eatWan Mixture Models Clustering by Mixture Models gt General background on clustering gt Example method k means gt Mixture model based clustering gt Model estimation J15 L1 httpwwwstatpsueduleall Mlxnne Modek Clustering 5 39 V A basic tool in data 39 a i miningpattern A quot 4quotg recognition 3 quot 5331 h gt Divide a set of data 2 i i quot i 391 h39 into groups v i gt Samples in one 1 i cluster are Close and i u clusters are far 1 gt A i 2 i 7 i 1 71 n 1 2 a A 5 J12 L Mixture Models gt Motivations gt Discover classes of data in an unsupervised way unsupervised learning gt Efficient representation of data fast retrieval data complexity reduction gt Various engineering purposes tightly linked with pattern recognition J15 L1 httpwwwstatpsueduleall Mixture Models Approaches to Clustering gt Represent samples by feature vectors gt Define a distance measure to assess the closeness between data gt Closeness can be measured in many ways gt Define distance based on various norms gt For gene expression levels in a set of microarray data closeness between genes may be measured by the Euclidean distance between the gene profile vectors or by correlation J15 L1 hccpwwwscatpsueduijh Mixture Models gt Approaches gt Define an objective function to assess the quality of clustering and optimize the objective function purely computational gt Clustering can be performed based merely on pairwise distances How each sample is represented does not come into the picture gt Statistical model based clustering J15 L1 httpwwwstatpsueduleall Mixture Models K means gt Assume there are M clusters with centroids Z 217227 7ZM gt Each training sample is assigned to one of the clusters Denote the assignment function by Then 77139 j means the ith training sample is assigned to thejth cluster gt Goal minimize the total mean squared error between the training samples and their representative cluster centroids that is the trace of the pooled within cluster covariance matrix N i 2 arg r233 H X no ll J15 L1 httpwwwstatpsueduleall Mixture Models gt Denote the objective function by N LZ7nZHXi 71039 H2 il gt Intuition training samples are tightly clustered around the centroids Hence the centroids serve as a compact representation for the training data J15 L1 httpwwwstatpsueduleall Mixture Models Necessary Conditions gt If Z is fixed the optimal assignment function should follow the nearest neighbor rule that is 770 arg mlnje12M Xi Z gt If is fixed the cluster centroid 2 should be the average of all the samples assigned to thejth cluster 2 ij Xi NJ Zj NJ is the number of samples assigned to cluster j J15 L1 httpwwwstatpsueduleall Mixture Models The Algorithm gt Based on the necessary conditions the k means algorithm alternates the two steps gt For a fixed set of centroids optimize by assigning each sample to its closest centroid using Euclidean distance gt Update the centroids by computing the average of all the samples assigned to it gt The algorithm converges since after each iteration the objective function decreases non increasing gt Usually converges fast gt Stopping criterion the ratio between the decrease and the objective function is below a threshold J15 L1 httpwwwstatpsueduleall Mixture Models Mixture Model based Clustering gt Each cluster is mathematically represented by a parametric distribution Examples Gaussian continuous Poisson discrete gt The entire data set is modeled by a mixture of these distributions gt An individual distribution used to model a specific cluster is often referred to as a component distribution J15 L1 hccpwwwscatpsueduijh Mixture Models gt Suppose there are K components clusters Each component is a Gaussian distribution parameterized by pk Xk Denote the data by X X ERd The density of component k is fkX gtX i Mk7 Xk 1 exp X 7 MktX1X 7 Mid 9 in 2 39 gt The prior probability weight of component k is ak The mixture density is K K fx Z amx Z ak gtx i W 2k k1 k1 J15 L1 httpwwwstatpsueduleall Mixture Models Advantages gt A mixture model with high likelihood tends to have the following traits gt Component distributions have high peaks data in one cluster are tight gt The mixture model covers the data well dominant patterns in the data are captured by component distributions J15 L1 httpwwwstatpsueduleall Mlxnue Models Adva ntages b Well studied statistical inference techniques available gt Flexibility in choosing the component distributions b Obtain a density estimation for each cluster b A soft classification is available 0 Density function nhwo clusters Mixture Models EM Algorithm gt The parameters are estimated by the maximum likelihood ML criterion using the EM algorithm gt A P Dempster N M Laird and D B Rubin Maximum likelihood from incomplete data via the EM algorithmH Journal Royal Statistics Society vol 39 no 1 pp 1 21 1977 gt The EM algorithm provides an iterative computation of maximum likelihood estimation when the observed data are incomplete J15 L1 hccpwwwscatpsueduijh Mixture Models gt lncompleteness can be conceptual gt We need to estimate the distribution ofX in sample space X but we can only observe X indirectly through Y in sample space 3 In many cases there is a mapping xgt yx from X to 3 and X is only known to lie in a subset of X denoted by Xy which is determined by the equation y yx The distribution ofX is parameterized by a family of distributions fx l 6 with parameters 6 E Q on X The distribution of y gy l 6 is V V amokM axlmdx J15 L1 httpwwwstatpsueduleall Mixture Models gt The EM algorithm aims at finding a 9 that maximizes gy l 0 given an observed y gt Introduce the function QW l t9 EUOE fX l 0 ly707 that is the expected value of log fx l 0 according to the conditional distribution ofx given y and parameter 0 The expectation is assumed to exist for all pairs 070 In particular it is assumed that fX l 0 gt O for 9 E Q gt EM Iteration gt Estep Compute Q6 l 6 gt Mstep Choose 69 to be a value of 6 E 0 that maximizes Q9 9p J15 L1 httpwwwstatpsueduleall Mixture Models EM J15 L for the Mixture of Normals gt Observed data incomplete X17X27m7xn where n is the sample size Denote all the samples collectively by x gt Complete data X17y1X2y27 Xn7yn where y is the cluster component identity of sample X gt The collection of parameters 0 includes ak pk Xk k 127K gt The likelihood function is n K Lxl0 2 log ak gtXluk Xkgt il kl gt Lxl0 is the objective function of the EM algorithm maximize Numerical difficulty comes from the sum inside the log http www stat psu eduijh Mixture Models gt The Q function is QWW E IogHayfo ALWY x70 il E lodeM og gtx y72 70 1 2 E Ioga y Iog gtx MW 29 Xiv 0 i1 The last equality comes from the fact the samples are independent J15 L1 httpwwwstatpsueduleall Mixture Models gt Note that when X is given only y is random in the complete data Xy Also y only takes a finite number of values ie cluster identities 1 to K The distribution of Y given X X is the posterior probability of Y given X gt Denote the posterior probabilities of Y k k 17 K given X by phk By the Bayes formula the posterior probabilities are K Pik 0lt ak gtXi l Wm 07 Z Pik 1 J15 L1 httpwwwstatpsueduleall Mixture Models gt Then each summand in Q0 i0 is E 390ga y log gtxz WW X9 ixn 0 K K 2 pm log at 2 pm log Xi i 27 XL k1 k1 gt Note that we cannot see the direct effect of 9 in the above equation but pk are computed using 0 Le the current parameters 0 includes the updated parameters J15 L1 httpwwwstatpsueduleall Mixture Models gt We then have n K Q0 i0 Z Z Pik log 32 i1 kl K 2 pm log gtXi i 27 XL 391 kl gt Note that the prior probabilities a and the parameters of the Gaussian components In L can be optimized separately J15 L1 httpwwwstatpsueduleall Mixture Models gt The 3239s subject to Zf a 1 Basic optimization theories show that a are optimized by 2L1 Pik in gt The optimization of pk and ER is simply a maximum likelihood estimation of the parameters using samples X with weights phk Basic optimization techniques also lead to a U 211 PikXi k 2L1 Pik X2 271 PikXi MLXXI Milt L1 Pik gt After every iteration the likelihood function L is guaranteed to increase may not strictly gt We have derived the EM algorithm for a mixture of Gaussianrs J15 L1 httpwwwstatpsueduleall Mixture Models EM Algorithm for the Mixture of Gaussians Parameters estimated at the pth iteration are marked by a superscript 1 Initialize parameters 2 Estep Compute the posterior probabilities for all i 1 n k 1 M K prk 35qu 9721125 I K 1 EH swabs l MEX 2139 3 Mstep apl EL Pik Mpl EL PikXi k 397 7 k 21 Pik n 1 1 Elkp 211 PikXi Mlkpl Xf Ml Y 1 Pik 4 Repeat step 2 and 3 until converge J13 L1 hccpwwwstatpsueduijh Mixture Models gt Comment for mixtures of other distributions the EM algorithm is very similar The E step involves computing the posterior probabilities Only the particular distribution 1 needs to be changed The M step always involves parameter optimization Formulas differ according to distributions J15 L1 hccpwwwscatpsueduijh Mixture Models Computation Issues gt If a different Xk is allowed for each component the likelihood function is not bounded Global optimum is meaningless Don39t overdo it gt How to initialize Example gt Apply k means first gt Initialize uk and Xk using all the samples classified to cluster k gt Initialize ak by the proportion of data assigned to cluster k by kmeans gt In practice we may want to reduce model complexity by putting constraints on the parameters For instance assume equal priors identical covariance matrices for all the components J15 L1 hccpwwwscatpsueduijh Mixture Models Examples J15 L gt The heart disease data set is taken from the UCI machine learning database repository gt There are 297 cases samples in the data set of which 137 have heart diseases Each sample contains 13 quantitative variables including cholesterol max heart rate etc gt We remove the mean of each variable and normalize it to yield unit variance gt data are projected onto the plane spanned by the two most dominant principal component directions gt A two component Gaussian mixture is fit http www stat psu eduijh Mlxnne Models 73 72 7i U 1 2 3 4 A 75 72 7i U i 2 3 A The heart disease data set and the estimated cluster densities Left The scatter plot of the data Right The contour plot of the pdf estimated using a singlelayer mixture of two normals The thick lines are the boundaries between the two clusters based on the estimated pde of individual clusters 112 L httpwwwetat psueduN31al Mixture Models Classification Likelihood gt The likelihood n K l Xl 9 2 log ak gtXile7 EU i1 k1 maximized by the EM algorithm is sometimes called mixture likelihood gt Maximization can also be applied to the classification likelihood Denote the collection of cluster identities of all the samples by y y177yn lxlt97 y 2 log 3M Xill bym Ky l1 J15 L1 httpwwwstatpsueduleall Mixture Models gt The cluster identities y i 17 n are treated as parameters together with 9 and are part of the estimation gt To maximize EM algorithm can be modified to yield an ascending algorithm This modified version is called Classification EM CEM J15 L1 httpwwwstatpsueduleall Mixture Models Classification EM A classification step is inserted between the E step and the M step gt Initialize parameters gt E step Compute the posterior probabilities for all i1n k 17 K p k amx l pm 2f1a quot xz l m 25 gt Classification 1 ylpl arg mkax pm Or equivalently let pw 1 if k arg maxlt phk and 0 otherwise J15 L1 httpwwwstatpsueduleall Mixture Models gt M step n1 271 5 7 ELI Iylp1 k ak 7 f 7 f Mp1 7 2L1 akxi 7 271 CGW1 kXi k 7 2 M 2 IMP k W 2 mm 7 M P1x WW k 2L1 ak 2 IMP kxx 7 159 7 WW 21Iy quot1 k gt Repeat the above three steps until converge J15 L1 httpwwwstatpsueduleall Mixture Models Comment gt CEM tends to underestimate the variances It usually converges much faster than EM For the purpose of clustering it is generally believed that it performs similarly as EM gt lfwe assume equal priors ak and also the covariance matrices Xk are identical and are a scalar matrix CEM is exactly k means Exercise J15 L1 hccpwwwscatpsueduijh
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'