New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Adaline Pollich


Adaline Pollich
GPA 3.81

Ching Cheung

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Ching Cheung
Class Notes
25 ?




Popular in Course

Popular in Electrical Engineering

This 34 page Class Notes was uploaded by Adaline Pollich on Friday October 23, 2015. The Class Notes belongs to EE 639 at University of Kentucky taught by Ching Cheung in Fall. Since its upload, it has received 16 views. For similar materials see /class/228317/ee-639-university-of-kentucky in Electrical Engineering at University of Kentucky.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/23/15
Multimedia Information Systems Samson Cheung EE 639 Fall 2004 Lecture 20 Dimension Reduction Feature Extraction Based on Dr Gutie39rrezOsuna s lecture notes Why Dimension Reduction I Curse of Dimensionality I Applications Pattern Recognition Similarity Search Visualization Compression I Different goals I Better clusteringlclassification results I Preserve distance relationship I Produce as few bits as possible with little loss of quality a Curse of Dimensionality Feature selection vs Feature extraction Signal Representation vs Classification Principal Component Analysis PCA Linear Discriminant Analysis LDA Independent Component Analysis lCA Curse of dimensionality 1 u The curse of dimensionalit o A term coined by Bellman in 1961 o Refers to the problems associated with multivariate data analysis as the dimensionality incre 0 We will illustrate these problems with a simple example I Consider a 3class pattern recognition problem 0 A simple approach would be to Divide the feature space into uniform bins Compute the ratio of examples for each class at each bin and I For a new example find its bin and choose the predominant class in that bin 0 In our toy problem we decide to start with one single feature and divide the real line into 3 segmen W i u i X1 After doing this we notice that there exists too much overlap among the classes so we decide to incorporate a second feature to try and improve separability Curse of dimensionality 2 We decide to preserve the granularity of each axis which raises the number of bins from 3 in 1D to 329 in 2D u At this point we need to make a decision do we maintain the density of examples per bin or do we keep the number of examples had for the onedimensional case I Choosing to maintain the density increases the number of examples from 9 in 1D to 27 in 2D I Choosing to maintain the number of examples results in a ZD scatter plot that is very sparse Constant density Constant 91 examples l Moving to three features makes the problem worse The number of bins grows to 3327 For the same density of examples the number of needed exampies becomes 81 For the same number ofexamples well the 3D scatter plot is almost empty Curse of Dimensionality 3 Obviously our approach to divide the sample space into equally spaced bins was quite inefficient There are other approaches that are much less susceptible to the curse of dimensionality but the problem still exists How do we beat the curse of dimensionality o By incorporating prior knowledge By providing increasing smoothness of the target function By reducing the dimensionality In practice the curse of dimensionality means that for a given sample size there is a maximum number of features above which the performance of our classifier will degrade rather than improve In most cases the additional information that is lost by discarding some features is more than compensated by a more accurate mapping in the lower dimensional space performance dimensionality Dimensionality Reduction 1 Two approaches are available to perform dimensionality reduction 0 Feature extraction creating a subset of new features by combinations of the existing features 0 Feature selection choosing a subset of all the features the ones more informative x s s 7X L l 1 x quot x y1 x 2 2 2 X feature extraction yZ f feature selection 1 gt x7 y XN XN M7 XN V The problem of feature extraction can be stated as 0 Given a feature space xieR find a mapping yfxR RM with MltN such that the transformed feature vector yleRM preserves most of the information or structure in RN 0 An optimal mapping yfx will be one that results in no increase in the minimum probability of error I This is a Bayes decision rule applied to the initial space RN and to the reduced space RM yield the same classification rate Dimensionality Reduction 2 In general the optimal mapping yfx will be a nonlinear function a However there is no systematic way to generate nonlinear transforms I The selection of a particular subset of transforms is problem dependent o For this reason feature extraction is commonly limited to linear transforms yWx I This is y is a linear projection ofx I NOTE When the mapping is a nonlinear function the reduced space is called a manifold X r X l 1 W W X2 X2 linear feature extraction a gt Signal representation vs classification The selection of the feature extraction mapping yfx is guided by an objective function that we seek to maximize or minimize Depending on the criteria used by the objective function feature extraction techniques are grouped into two categories Signal representation The goal of the feature extraction mapping is to represent the samples accurately in a lowerdimensional space 0 Classification The goal ofthe feature extraction mapping is to enhance the classdiscriminatory information in the lowerdimensional space Within the realm of linear feature extraction two techniques are commonly used Principal Components Analysis PCA uses a signal representation criterion 0 Linear Discriminant Analysis LDA uses a signal clasSIfication criterion Feature 2 Feature 1 The objective of PCA is to perform dimensionality reduction while preserving as much of the randomness variance in the highdimensional space as possible 0 Let x be an Ndimensional random vector represented as a linear combination of orthonormal basis vectors cm ozl ltpN as in ij 11 ii 0 Suppose we choose to represent x with only M MltN of the basis vectors We can do this by replacing the components yNW yNiT with some preselected constants bl N X Zyrpi where lt9 loi H M N W Zym Zbiw i1 lle o The representation error is then N M N M AXiM x a W Z w 7 Zyinpi a Zbiwil Zlyi shin i1 Il HM iM 0 We can measure this representation error by the meansquared magnitude of AX 0 Our goal is to find the basis vectors to and constants b that minimize this meansquare error E2ltMgtEleltMgti2lEli iiyrbiiiypbilwiwi Misliyrbir rm Jle I As we have done earlier in the course the optimal values of b can be found by computing the partial derivative of the objective function and equating it to zero a Eliyi 7b2l72ltElylebiio gt bl Elyli I Therefore we will replace the discarded dimensions y39s by their expected value an intuitive solution I The meansquare error can then be written as fluI Eliyi iEiyiiiZ F xw 7Erxltpi1iiixltpi fawn N T r T N r 4 ElixsEiXlllerIXJ in 4 Exam iMri lMri I where EX is the covariance matrix ofx I We seek to find the solution that minimizes this expression subject to the orthonormalih constraint which we incorporate into the expression using a set of Lagrange multipliers A N N WM 292 Zti wiwl iMgtl iMrl I Computing the partial derivative with respect to the basis vectors a 7 N N ave Zwizxw Zwecpicpli zlzxwi 7Mle 2 in Awl i 7lMI iMri 6 CLPl if A is syminmetnc 2 d T 7 7 T NOTEV ax AxAA x I So 01 and A are the eigenvectors and eigenvalues of the covariance matrix 2x 0 We can then express the sumsquare error as N 7 N N 32M Zwlixqz Zwim ZN iNl lNlrl lMrl o In order to minimize this measure A will have to be smallest eigenvalues I Therefore to represent XWith minimum sumsquare error we will choose the eigenvectors 0 corresponding to the largest eigenvalues It PCA dimensionality reduction The optimal approximation of a random vector X69 by a linear combination of M MltN independent vectors is obtained by projecting the random vector x onto the eigenvectors p corresponding to the largest eigenvalues A of the covariance matrix EX optimaiity is defined as the minimum of the sumsquare magnitude of the approximation error PCA 4 0 Since PCA uses the eigenvectors of the covariance matrix 2X it is able to find the independent axes of the data under the unimodal Gaussian assumption For nonGaussian or multimodal Gaussian data PCA simply decorrelates the axes o The main limitation of PCA is that it does not consider class separability since it does not take into account the class label of the feature vector PCA simply performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance There is no guarantee that the directions of maximum variance will contain good features for discrimination Historical remarks 0 Principal Components Analysis is the oldest technique in multivariate analysis PCA is also known as the KarhunenLoeve transform communication theory 0 PCA was first introduced by Pearson in 1901 and it experienced several modifications until it was generalized by Loeve in 1963 Compute the principal components for the following twodimensional dataset 39 Xgtltivgtlt212333395y5V4v5v5i 5v5y8y7y9y8 I Let s first piot the data to get an idea of which soiution we should expect SOLUTION by hand 0 The biased covariance estimate of the data is 625 4257 2 425 35 J u The eigenvaiues are the zeros of the characteristic equation 0 625A 425 425 35A o The eigenvectors are the solutions of the system 625 425 v 7fiivifj v 7 o81 425 35 vi2 7in v2 1059 625 zxvAvizxeAio o934 A2041 425 v2 IAZVZW v21 7 059 425 35 vuj39itxzvu v22 081 HINT To solve each system manualiy first assume that one of the variables is equai to one 39 v17 then find the other one and finaiiy normalize the vector to make it unit iength Linear Discriminant Analysis LDA The objective of LDA is to perform dimensionality reduction while preserving as much of the class discriminatory information as possible 0 Assume we have a set of Ddimensional samples W X2 XW N1 of which belong to class 031 and N2 to class 02 We seek to obtain a scalar y by projecting the samples x onto a line y w x Of all the possible lines we would like to select the one that maximizes the separability of the scalars I This is illustrated for the twodimensional case in the followmg figures X1 X1 LDA 2 classes 1 I In order to find a good projection vector we need to define a measure of separation between the projections I The mean vector of each class in X and yfeature space is 712x and 71 iZwaewT H N quot my N xgm I We could then choose the distance between the projected means as our objective function Jltwi i em iwimi 7M I However the distance between the projected means is not a very good measure since it does not take into account the standard deviation Within the classes x This am yields bette class separablty a RH XV Tm axis has a larger dstance between means The solution proposed by Fisher is to maximize a function that represents the difference between the means normalized by a measure of the withinclass scatter For each class we define the scatter an equivalent of the variance as E 2er veal I where the quantity 512 ASE is called the Withinclass scatter of the projected examples 5 The Fisher linear discriminant is defined as the linear function wa that maximizes the criterion function 2 7 in 7 2 Jwe sfs Therefore we will be looking for a projection where examples from the same class are projected very close to each other and at the same time the projected means are as farther apart as possible LDA 2 classes 3 In order to find the optimum projection w we need to express Jw as an explicit function ofw We define a measure of the scatter in multivariate feature space x which are scatter matrices 5 Zixsuilixsuif m 5 32 sW l where SW is called the withinclass scatter matrix feature spa The scatter of the projection y can then be expressed as a function of the scatter matrix in 5 Ziys f ZWTXW7H2 ZWTXHiXXlJTWWTSi sf 22 wTSWw Similarly the difference between the projected means can be expressed in terms of the means in the original feature space E zy WiiiisziizizwTiisi12Xii H2TWWTSBW 55 its rank is t l The matrix SE is called the betweenclass scatter Note that since SE is the outer product of two vectors We can finally express the Fisher criterion in terms of SW and SB as T w S w Jw T E w SWW LDA 2 classes 4 0 To find the maximum of Jw we derive and equate to zero MM d WTSBW0 3 W W WTSWW W 3 WTSthSBw s wTst2sz 0 T T wTSWw sawi wTSBw sww0 w Sww w Sww gt SBWeJSWw0 gt 3 SQJSEWerdJ o Dividing by wTSWw 0 Solving the generalized eigenvalue problem SWquotSEWJW yields wTSBw 7 s 7 W argwmaxwTSWwjL 1 2 I This is know as Fisher s Linear Discriminant 1936 although it is not a discriminant but rather a specific choice of direction for the proiection of the data down to one dimens LDA Example Compute the Linear Discriminant projection for the following twodimensional dataset X1X1VX2i4y1yi2v4y2y3316414 o X2x1x291U689587108 SOLUTION by hand The class statistics are 080 43407 S 040 26 gt 184 o04 l 52 O 7 004 264 p1300 360 p2 s4o 760 The Within and betweenclass scatter are T2916 2160 7264 O44 89 i SW 72160 1600 7044 528 4 7 magA 881 swsav Av 2 iswsB eAIio 2 01565 376 1189 881 V 508 376 v v 091 11565 2 V24 V2 V2 0 Or directly by 039 508 The LDA projection is then obtained as the solution of the generalized eigenvalue problem WSillturuzeo91 eo39i LDA C classes 1 Fisher s LDA generalizes very gracefully for Cclass problems Instead of one projection y we will now seek Ci projections y1y2ycrl by means of C1 projection vectors w which can be arranged by columns into a projection matrix WW1W2Wc41 y w Tx gt y WTX Derivation o The generalization of the withinclass scatter is SW 25 where s Elxsurxxeuf and in any N W The generalization for the betweenclass scatter is C T SE ZNluruXuru 1 l 1 lJ X NXEZM NH I where STSESW is called the total scatter matrix LDA C classes 2 Similarly we define the mean vector and scatter matrices for the projected samples as w izw gXY iy gs ZNW e li lle 0 From our derivation for the twoclass problem we can write sw wlsww B W s w 0 Recall that we are looking for a projection that maximizes the ratio of betweenclass to withinclass scatter Since the projection is no longer a scalar it has C1 dimensions we then use the determinant of the scatter matrices to obtain a scalar objective function E 7 ijstj JW 7 swj jw swvvj 0 And we will seek the projection matrix W that maximizes this ratio It can be shown that the optimal projection matrix W is the one whose columns are the eigenvectors corresponding to the largest eigenvalues ofthe followmg generalized eigenvalue problem WTS W JIwgl IwllargmaxW 2 SE NSle39 0 W NOTES SE is the sum of C matrices of rank one or less and the mean vectors are constrained by 1 C C ii I Therefore S will be of rank Cl or les I This means that only 31 of the eigenvalues I will be non zero The projections with maximu class separability information are the eigenvectors corresponding to the largest eigenvalues of W1 an e derived as the Maximum Likelihood method forthe case of normal class conditional densities with equal covariance matrices LDA vs PCA I These figures show the performance of PCA and LDA on an odor recognition problem Five types of coffee beans were presented to an array of chemicai gas sensors For each coffee type 45 snifis were performed and the response of the gas sensor array was processed in order to obtain a 60dimensionai feature vector I Results From the 3D scatter piots it is ciear that LDA outperforms PCA in terms of ciass discrimination T i is one example where the discriminatory information is not aiigned with the direction of m 39 nce PC 5mm WW aximum varia A Limitations of LDA LDA produces at most C1 feature projections if the classification error estimates establish that more features are needed some other method must be employed to provide those additional features LDA is a parametric method since it assumes unimodal Gaussian likelihoods if the distributions are significantly nonGaussian the LDA projections will not be able to preserve any complex structure of the data which may be needed for classification LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data 1 PCA find subspace with largest variance 5 ICA find subspace that independent from the rest E1 PCA ICA for Gaussian Data but different significantly when not Uncorrelated vs Independent E1 Are x and y uncorrelated a Are x and 39y independent Simple Cocktal Party Problem Unknown mixing matrix A x x1 Observations nsources mn observations Wle tvatien Two IndependentSources Mixture at two Mics x1 a1151 611252 x2 672151quot 6752252 Depend on the distances of the microphones from the speakers Motivation Getthe Independent Signals out ofthe Mixture etinition and Applications Problem Statement Given inputs x1x x nd transformation W 2n x1 I y1 x W 2 such that y1 ym are Independent of each other ym X Applications a cocktail party problem Blind source separation Neurological signal separation from electroencephalograms EEG Separation of noise from signals in mobile Climate studies separating El Nino from Volcano Multimedia compression watermarking lCA Estimaton Principle a Principle 391 Nonlinear decorrelation E Find the matrix Wso that fOr any i j the components yi andyj are uncorrelated AND the transform components gm and gyj are uncorrelated where g and h are some suitable nOnlinear function a Principle 2 Maximum nongaussianity E Find the local maxima of n ongaussianity of a linear combinatiOn y Zibixl under the censtraint that the variance of y is constant Each lecal maximum gives one independent component Maximal NoinGaussia39nity 2 By Central Limit Theorem y 25w WOuld normally be more Gaussian than individual components This wOuld not be the case y is one of the recovered independent sources s it will be less Gaussian than any of the x s Measures of NonGaussianity We need to have a quantitative measure of nOngaussianity fOr ICA Estimation Kurtotis gauss0 sensitive to outliers lame m4 3Ey2239 Hm jfltygtlog fydy Negentropy gauss 0 difficult to estimate Hygauss Jltygt XZEW 148kurrltyf 1m z EGy EGv2 where V is a standard gaus Sia n random variable and G y log cos ha y G y exp au 2 2 EntrOpy gausslargest Approximations


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.