Outline for EECS 833 at KU
Popular in Course
Popular in Department
This 26 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Kansas taught by a professor in Fall. Since its upload, it has received 42 views.
Reviews for Outline for EECS 833 at KU
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
NONPARAMETRIC CLASSIFICATION TECHNIQUES EECS 833 3 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoffkgskuedu 8642093 Overheads and other resources available at httppeoplekuedugbohlingEECS833 Modeling Categorical Variables Classification In classification applications we are trying to develop a model for predicting a categorical response variable G from one or more predictor variables X That is if we know that an observation arises from one of K different mutually exclusive classes or groups Gk then we are trying to estimate the probability of occurrence of each group at each point in the predictor space 13k x ProbGk X x We could then assign each estimation point to the class with the highest probability at that point segmenting the predictor space into regions assigned to the different classes The fact that the probabilities are continuous variables immediately suggests the possibility of modeling them using parametric or nonparametric regression techniques brief overview at httppeoplekuedugbohlingcpe940NonParRegpdf However we will have to apply transforms to ensure that the estimated Pkx values represent a legitimate set of probabilities for mutually exclusive events meaning OSPkxSl k 1K Pkxl To use this approach the group memberships in the training dataset are coded as a set of indicator variables one for each group with yk 1 for the group to which data point 139 belongs and 0 for all other groups the set of probabilities corresponding to our certain knowledge of the group membership for each training data point Another approach to the classification problem is to model the probability density function for each group fk X and then plug the density estimates along with those pesky prior probabilities qk into Bayes theorem to get 13kx fkxqk Z In this case we automatically get a legitimate set of probabilities but our models for the probability density functions should obey the constraints fkx2 0 Ifkxclxl We can employ a variety of nonparametric density estimation techniques to accomplish this task References T Hastie R Tibshirani and J Friedman 2001 The Elements of Statistical Learning Data Mining Inference and Prediction SpringerVerlag New York 533 pp R Duda P Hart and D Stork 2001 Pattern Classi cation SecondEclition John Wiley amp Sons Inc New York 654 pp Example Data We will continue using the north central Kansas Cretaceous dataset introduced in Wednesday s lecture The data from the Jones and Kenyon wells are available in an Excel workbook at httpDeoplekuedugbohlingEECS 833J onesKenvonX1s Note that there is no core available for some segments of the Jones well data and these segments have a facies number of 0 for unknown I ve generated most of the results amp gures in this lecture using S Plus andor R the commercial and opensource implementations of the S statistical language R essentially GNU S provides a very nice statistical data analysis environment for free with numerous addin packages contributed by the user community The R project web page is httpwwwrprojectorg Nearest Neighbor Classi cation In nearestneighbor classification we assign the group or class at some point x in the predictor variable space based on a majority vote of the class memberships at nearby training data points with known group memberships Alternatively one can compute a vector of group membership probabilities by dividing the count for each group by the total number of neighboring points We would want to scale the predictor variables to comparable ranges say by standardizing each variable to zero mean amp unit standard deviation before computing distances to neighboring points Applying this approach to the twofacies RhomaaUmaa example and predicting over a grid of RhomaaUmaa values using 15 nearest neighbors we get the following map of probability of membership in the marine facies RHOMAA Applying nearestneighbor classification with 20 neighbors to the entire Jones well dataset using all six logs and predicting all six facies we get the following table of predicted columns versus actual rows facies C ore Predicted Facies Facies Marine Paralic Floodplain Channel Splay Paleosol Marine l l 2 2 8 0 0 6 Paralic 2 l l 0 l 9 l 9 3 3 Floodplain 5 7 206 7 4 3 Channel 0 l 5 25 4 2 0 Splay 0 0 21 2 15 0 Paleosol 2 5 25 0 0 34 Overall 82 of the intervals are classi ed correctly The predicted facies versus depth look a little noisy 0 Predicted Facles Core Facles o 1007 o 5 E2007 o 5 a o o o l 3007 g o 8 400 9 o 500 5 l l l l l Marine Paralic Floodplain Channel Splay Paleosol The predicted facies versus depth in the Kenyon well are very ratty 1113 N 8 Depth fee A 8 4113 5113 Marine Paralic Floodplain Channel Splay Paleosol Predicted Facies in Kenyon Note that the lines here are there just to help guide the eye through the chatter of data points 7 they do not represent known facies because we don t have core from this well Although nearestneighbor classi cation is conceptually simple and can work quite well in some cases it is too dependent on the particularities of the training dataset and also becomes highly unreliable in higher dimensions larger numbers of predictor variables due to the curse ofdimensionalily The Curse of Dimensionality We are trying to represent the data distribution for each group in the space of predictor variables which has a dimension d equal to the number of predictor variables Imagine that all the variables have been standardized to a common scale and that we are considering a hypercube with a side length of c in that common scale The volume of this hypercube is given by Vcd So the volume of the hypercube increases as a power of the dimension This means that it becomes harder and harder for the training data to fill the volume as the dimensionality increases and that the probability that an estimation point will fall in empty space far from any training data point increases as a power of the number of predictor variables So even a seemingly huge training dataset could be inadequate for modeling in high dimensions Another way of looking at the curse of dimensionality is that if 11 training data points seem adequate for developing a one dimensional model with a single predictor variable then nd data points are required to get a comparable training data density for a d dimensional problem So if 100 training data points were adequate for a ldimensional estimation problem then 10010 lgtlt1020 data points would be required to give a comparable density for a lOdimensional problem For any realistic training dataset almost all of the predictor variable space would be far from the nearest data point and nearestneighbor averaging would give questionable results A number of methods can be used to bridge the gap between a rigid global model like the normal densities used in discriminant analysis and an overly exible nearestneighbor classification Most of these methods use a set of basis or kernel functions to interpolate between training data points in a controlled fashion These methods usually involve a large number of fitting parameters but they are referred to as nonparametric because the resulting models are not restricted to simple parametric forms The primary danger of nonparametric techniques is their ability to over t the training data at the eXpense of their ability to generalize to other data sets All these techniques offer some means of controlling the tradeoff between localization representation of detail in the training data and generalization smoothing by selection of one or more tuning parameters The specification of the tuning parameters is usually external to the actual fitting process but optimal tuning parameter values are often estimated via crossvalidation Kernel Density Estimation Kernel density estimation uses kernel basis functions to produce smooth estimates of the group specific density functions fk X The kernel functions serve to smear out the location of each training data point turning a delta spike function at its exact location into a Gaussian or similar curve centered at that location The kernel density estimate at each estimation location is basically a sum of the surrounding kernels rescaled to represent a density estimate We can compute univariate kernel density estimates using the ksdensily function in Matlab To compute the density for Umaa values for the marine facies in the first column of URMar evaluated at 101 regularlyspaced points Ug 401214 using the default Gaussian smoothing window with a smoothing window width of 05 we could type gtgt UDenMar ksdensityURMarlUg39widtn3905 and for the paralic facies gtgt UDenPar ksdensityURParlUg39widtn3905 Then we could use gtgt plotUgUDenMar39b 39 gtgt hold on gtgt plotUgUDenPar39r 39 gtgt plot marine data point locations gtgt for i lzsizeURMarl plotURMaril URMaril002 00339b 39 end gtgt plot paralic data point locations gtgt for i lzsizeURParl plotURParil URParil00l 00239r 39 end 10 and edit the plot a bit to generate 035 03 quot Paralic Mime Density 1 I I I I I g I I I I I I I l 005 I I lllllllllll HHIIImilHiIIM i Wl WW 04 6 8 10 12 14 Umaa barnscc For multivariate problems we would generally have to estimate fully multivariate density functions using highdimensional kernel functions However ifwe make the simplifying assumption of independence among the variables then the multivariate density function for each group is given by the product of the density functions ofthe individual variables for that group MXfkXi XzfkXd This simplification means that we only need to develop one dimensional density function estimates Plugging these density estimates into Bayes theorem leads to a method called nai39ve Bayes 11 Applying the naive Bayes approach to the RhomaaUmaa example we get the following density estimates for each group 7 multiplying the onedimensional kernel density estimates along each axis to get the combined density estimates Rh omaa Density Paralic u u m n ww w lmu mu Ml mu n ma mm uu lm u mm m lmm u Note that the Rhomaa axis in these 3D plots is pointed in the opposite direction than the Rhomaa axis in the 2D plots so the marine cluster is plotting towards the back rather than towards the front 12 As contour plots the two densities look like Paralic 25 26 29 30 25 26 29 30 13 Combining the density estimates for the two groups in Bayes theorem give the following probability for the Marine facies thkumne um um um um Paralic l l 5 7 9 11 13 Applying the naive Bayes procedure to the full Jones dataset we get the following table of results With a 72 correct classification rate overall C ore Predicted Facies Facies Marine Paralic Floodplain Channel Splay Paleosol Marine l l 6 0 6 0 0 6 Paralic 4 l l 3 l 8 8 3 10 Floodplain 20 13 156 0 16 27 Channel 0 84 l 174 3 0 Splay 0 1 9 0 28 0 Paleosol 4 5 l 0 0 0 47 The predicted facies versus depth look like 0 Predicted Facles o 7 Core Facies o E 100 7 s o s a E 200 7 g o 5 I 5 D 300 7 o 8 O o i I 400 7 o E o o 500 7 Marine Paralic Floodplainchannel Splay Paleosol 15 Neural Networks For classification problems we can use neural networks to directly model probabilities of membership in each category At the output end we use a soflmax transfer function 13k x eXlO Tk ismo to transform unbounded output values T k into probabilities Pk This transform ensures that we get nonnegative probabilities that sum to 1 To train the network we adjust the weights so that the estimated probabilities match the group indicator values yik as closely as possible for the N training data points Although it is possible to use a leastsquares objective function in this case it is more common to use the following crossentropy objective function RM iyiklogaxiars il kl where 06 and represent the network weights to be adjusted The first sum is over the training data points and the second is over the groups for each data point Because the group indicators yLk are either 0 or 1 this objective function is really just the sum of the negative logarithms of the predicted probabilities associated with the actual class for each training data point Because logPk goes to zero as Pk approaches 1 and goes to infinity as Pk approaches 0 minimizing the crossentropy objective function tends to drive the predicted probabilities for the observed classes towards l 16 Here is a schematic representation of the classification network for our example leaving the bias nodes out of the picture Input Layer Hidden Layer Output Layer More specifically I have used the neural network implemented by the nnet function in R This implements a single hiddenlayer network with logsz39g transfer functions for the hiddenlayer nodes and for categorical prediction softmax transfer functions for the output layer The damping or decay parameter 1 multiplies the sum of the squared network weights so that the modified objective function is Roc3 im10gBxia31Zoc2Z32 k1 The 06 s are the inputtohidden layer weights and the s are the hiddenlayertooutput weights Increasing the decay parameter 1 forces the weights to be smaller than they would be otherwise 17 leading to a smoother representation of the boundaries between classes Without damping the weights can become arbitrarily large in magnitude allowing the network to develop very sharp sigmoid boundaries to reproduce fine details of the training data a form of overtraining Both sets include a bias node weight not shown in the diagram For d input variables dl including the biasM hiddenlayer nodes Ml including the bias node and K output classes the number of weights in this network is M d l K M 1 We can use any of a number of optimization algorithms to adjust the weights The code for R s nnel function uses BFGS Broyden FletcherGoldfarb Shanno optimization In my work so far including the Hugoton Panoma field modeling 1 have primarily tweaked the number of hiddenlayer nodes and the decay parameter leaving other parameters at their default values including the maXimum number of iterations 100 18 Using a single hiddenlayer node for the RhomaaUmaa example the sigmoid basis simply forms a step or boundary between the paralic and marine data points I ve included those points on the plot in their indicator form 7 0 s for paralic and 1 s for marine Ea Nquot m 00 H m m ow m WWquotWmumgm m g MW WWWWWM W The figure illustrates that the neural net s sigmoid basis functions are ideal for classification problems which is all about drawing boundaries between classes 19 Fitting five basis functions with a decay constant 0f001 yields m Wl Ill WW 39I C PkManne auto 025 050 075 no WII 1W l 39 W l MIMI II I lI I 20 Fitting a network With 20 hiddenlayer nodes to the entire Jones dataset using a decay constant of 01 just guessing at reasonable values for the tuning parameters yields the following classification table With 93 of the facies predicted correctly Core Predicted Facies Facies Marine Paralic Floodplain Channel Splay Paleosol Marine 124 l l 0 0 2 Paralic 0 140 5 9 0 2 Floodplain 0 9 216 2 2 3 Channel 0 4 2 256 0 0 Splay 0 2 8 0 28 0 Paleosol 0 2 7 0 0 57 And the predictions versus depth look like 1113 Depth n 3113 4113 5113 0 Predicted Facles Core Facles o l Marine Paralic Floodplain Channel 21 Splay Paleosol The predictions in the Kenyon are still quite messy 1113 N 8 Depth fee A 8 4113 5113 Marine Paralic Floodplain Channel Splay Paleosol Predicted Facies in Kenyon 22 We can also look at the facies probabilities versus depth in each well first the Jones then the Kenyon 10 08 o o i Probability o is i 02 10 08 o o i Probability o is i 02 Marine Paiaiic Floodplain cnannei Spiay Paieosol 125 250 375 500 Marine Paiaiic Floodplain cnannei Spiay Paieosoi quot39l 125 250 375 500 Depth n 23 For the facies classification problem we have certain expectations regarding typical facies thicknesses and regarding which facies occur more frequently above other facies We can encode these expectations into a transition probability matrix that describes the probability of observing each facies at certain depth given that a particular facies occurs at the next depth below that The transition probability matrix developed from the facies observed at halffoot intervals with a little modification is Faoies Facies Above Below Marine Paralic Floodplain Channel Splay Paleosol Marine 099 000 000 000 000 001 Paralic 001 098 001 001 000 000 Floodplain 000 001 096 001 000 001 Channel 000 000 001 097 001 001 Splay 000 000 003 000 097 000 Paleosol 000 003 003 000 000 094 We can use Bayes theorem again to combine the probabilities predicted from the logs at depth m p with facies probabilities derived from the probabilities at the underlying interval ml weighted by the upward transition probabilities 5k to derive modified probabilities at each depth combining information from the logs with our expectations regarding spatial adjacency m mil pk 13k WJ39 W m J k m mil 21 Ztm39wk J39 r Applying this procedure in both wells cleans up the predicted probabilities considerably 24 Jones Mame Parahc F oodp am Channe sway Pa eoso 0 125 250 375 500 Depth 1 10 08 Mame Parahc F oodp am Channe sway Pa eoso 02 0 0 0 125 250 375 500 Depth 1 25 Jones 1113 Depth n 3113 4113 5113 Core Fac es o Pred cted Fac es Marine Paralic Floodplainchannel Splay Paleosol Kenyon 1113 N 8 Depth fee A 8 4113 5113 3 7 22 igig E s Marine Paralic Floodplain Channel Splay Paleosol 26
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'