Neural Networks &Fuzzy Systems
Neural Networks &Fuzzy Systems EECS 833
Popular in Course
Popular in Elect Engr & Computer Science
verified elite notetaker
This 26 page Class Notes was uploaded by Melissa Metz on Monday September 7, 2015. The Class Notes belongs to EECS 833 at Kansas taught by Staff in Fall. Since its upload, it has received 21 views. For similar materials see /class/186796/eecs-833-kansas in Elect Engr & Computer Science at Kansas.
Reviews for Neural Networks &Fuzzy Systems
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/07/15
NONPARAMETRIC CLASSIFICATION TECHNIQUES EECS 833 3 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoffkgskuedu 8642093 Overheads and other resources available at httppeoplekuedugbohlingEECS833 Modeling Categorical Variables Classification In classification applications we are trying to develop a model for predicting a categorical response variable G from one or more predictor variables X That is if we know that an observation arises from one of K different mutually exclusive classes or groups Gk then we are trying to estimate the probability of occurrence of each group at each point in the predictor space 13k x ProbGk X x We could then assign each estimation point to the class with the highest probability at that point segmenting the predictor space into regions assigned to the different classes The fact that the probabilities are continuous variables immediately suggests the possibility of modeling them using parametric or nonparametric regression techniques brief overview at htmpeoplekuedugbohlingcpe940NonParRegpdf However we will have to apply transforms to ensure that the estimated Pkx values represent a legitimate set of probabilities for mutually exclusive events meaning OSPkxSl k1K Pkxl To use this approach the group memberships in the training dataset are coded as a set of indicator variables one for each group with yk 1 for the group to which data point 139 belongs and 0 for all other groups the set of probabilities corresponding to our certain knowledge of the group membership for each training data pomt Another approach to the classification problem is to model the probability density function for each group fk X and then plug the density estimates along with those pesky prior probabilities qk into Bayes theorem to get 13k x f qk f x 61 W J l l i In this case we automatically get a legitimate set of probabilities but our models for the probability density functions should obey the constraints f k x Z 0 I fk xclx 1 We can employ a variety of nonparametric density estimation techniques to accomplish this task References T Hastie R Tibshirani and J Friedman 2001 The Elements of Statistical Learning Data Mining Inference and Prediction SpringerVerlag New York 533 pp R Duda P Hart and D Stork 2001 Pattern Classi cation SecondEclition John Wiley amp Sons Inc New York 654 pp Example Data We will continue using the north central Kansas Cretaceous dataset introduced in Wednesday s lecture The data from the Jones and Kenyon wells are available in an Excel workbook at htmpeoplekuedugbohlingEECS833JonesKenyonXls Note that there is no core available for some segments of the Jones well data and these segments have a facies number of 0 for unknown I ve generated most of the results amp gures in this lecture using S Plus andor R the commercial and opensource implementations of the S statistical language R essentially GNU S provides a very nice statistical data analysis environment for free with numerous addin packages contributed by the user community The R project web page is httpwwwrprojectorg Nearest Neighbor Classi cation In nearestneighbor classification we assign the group or class at some point x in the predictor variable space based on a majority vote of the class memberships at nearby training data points with known group memberships Alternatively one can compute a vector of group membership probabilities by dividing the count for each group by the total number of neighboring points We would want to scale the predictor variables to comparable ranges say by standardizing each variable to zero mean amp unit standard deviation before computing distances to neighboring points Applying this approach to the twofacies RhomaaUmaa example and predicting over a grid of RhomaaUmaa values using 15 nearest neighbors we get the following map of probability of membership in the marine facies Paralic Applying nearestneighbor classification with 20 neighbors to the entire Jones well dataset using all six logs and predicting all six facies we get the following table of predicted columns versus actual rows facies Facies Overall 82 of the intervals are classi ed correctly The predicted facies versus depth look a little noisy 0 Predicted Pages 0 7 Core Pages 100 7 E 200 7 E n 3 300 7 8 400 7 g 0 o 500 7 l l l l l l Marine Paralic Floodplainchannel Splay Paleosol The predicted facies versus depth in the Kenyon well are very ratty Depth fee 8 l l l l l l Marine Paralic Floodplain Channel Splay Paleosol Predicted Facies in Kenyon Note that the lines here are there just to help guide the eye through the chatter of data points 7 they do not represent known facies because we don t have core from this well Although 4 r t n i 4 r T 39 is J simple and can work quite well in some cases it is too dependent on the particularities of the training dataset and also becomes highly unreliable in higher dimensions larger numbers of predictor variables due to the curse ofdimensionalily The Curse of Dimensionality We are trying to represent the data distribution for each group in the space of predictor variables which has a dimension d equal to the number of predictor variables Imagine that all the variables have been standardized to a common scale and that we are considering a hypercube with a side length of c in that common scale The volume of this hypercube is given by Vcd So the volume of the hypercube increases as a power of the dimension This means that it becomes harder and harder for the training data to fill the volume as the dimensionality increases and that the probability that an estimation point will fall in empty space far from any training data point increases as a power of the number of predictor variables So even a seemingly huge training dataset could be inadequate for modeling in high dimensions Another way of looking at the curse of dimensionality is that if 11 training data points seem adequate for developing a one dimensional model with a single predictor variable then nd data points are required to get a comparable training data density for a d dimensional problem So if 100 training data points were adequate for a ldimensional estimation problem then 10010 lgtlt1020 data points would be required to give a comparable density for a lOdimensional problem For any realistic training dataset almost all of the predictor variable space would be far from the nearest data point and nearestneighbor averaging would give questionable results A number of methods can be used to bridge the gap between a rigid global model like the normal densities used in discriminant analysis and an overly flexible nearestneighbor classification Most of these methods use a set of basis or kernel functions to interpolate between training data points in a controlled fashion These methods usually involve a large number of fitting parameters but they are referred to as nonparametric because the resulting models are not restricted to simple parametric forms The primary danger of nonparametric techniques is their ability to over t the training data at the eXpense of their ability to generalize to other data sets All these techniques offer some means of controlling the tradeoff between localization representation of detail in the training data and generalization smoothing by selection of one or more tuning parameters The specification of the tuning parameters is usually external to the actual fitting process but optimal tuning parameter values are often estimated via crossvalidation Kernel Density Estimation Kernel density estimation uses kernel basis functions to produce smooth estimates of the group specific density functions fk X The kernel functions serve to smear out the location of each training data point turning a delta spike function at its exact location into a Gaussian or similar curve centered at that location The kernel density estimate at each estimation location is basically a sum of the surrounding kernels rescaled to represent a density estimate We can compute univariate kernel density estimates using the ksdensily function in Matlab To compute the density for Umaa values for the marine facies in the first column of URMar evaluated at 101 regularlyspaced points Ug 401214 using the default Gaussian smoothing window with a smoothing window width of 05 we could type gtgt UDenMar ksdensityURMarlUg39widtn3905 and for the paralic facies gtgt UDenPar ksdensityURParlUg39widtn3905 Then we could use gtgt plotUgUDenMar39b 39 gtgt hold on gt plotUgUDenPar39r 39 gt plot marine data point locations gtgt for i lzsizeURMarl plotURMaril URMaril002 00339b 39 end gtgt plot paralic data point locations gtgt for i lzsizeURParl plotURParil URParil00l 00239r 39 end VV and edit the plot a bit to generate 03 quot Paralic Mime x u 025 I 2 02 39 a r o 015 39I 01 I r 005 I lllllllllll m lllmilllilllm i Wl limik 8 1O 12 1 4 Umaa barnscc For multivariate problems we would generally have to estimate fully multivariate density functions using highdimensional kemel fun t39ons However ifwe make the simplifying assumption of 4r4 7Lhen 4 unction for each group is given by the product of the density nctions ofthe individual variables for that group f fu MXfkXi XzfkXd This simpli cation means that we only need to develop one dimensional density function estimates Plugging these density estimates into Bayes theorem lea s to a method called nai39ve es Applying the naive Bayes approach to the RhomaaUmaa example we get the following density estimates for each group 7 multiplying the onedimensional kernel density estimates along each axis to get the combined density estimates Rhomaa Densit Marine Rh omaa Density Paralic H mm 1 llllll um mun l ll l li lllm Wu m mumi mum N m m w 3 Note that the Rhomaa axis in these 3D plots is pointed in the opposite direction than the Rhomaa axis in the 2D plots so the marine cluster is plotting towards the back rather than towards the front As contour plots the two densities look like Paralic Paralic Combining the density estimates for the two groups in Bayes theorem give the following probability for the Marine facies thkumne um um um um vv e va39o W39 lllfl llllltl lm39 Or as a contour plot oriented the log analyst s way Paralic Applying the naive Bayes procedure to the full Jones dataset we get the following table of results With a 72 correct classification rate overall The predicted facies versus depth look like 0 Predicted Pages 0 7 Core Faces Depth n l l l l l l Marine Paralic Floodplainchannel Splay Paleosol Neural Networks For classification problems we can use neural networks to directly model probabilities of membership in each category At the output end we use a soflmax transfer function 13k x eXlO Tk ismo to transform unbounded output values T k into probabilities Pk This transform ensures that we get nonnegative probabilities that sum to 1 To train the network we adjust the weights so that the estimated probabilities match the group indicator values yik as closely as possible for the N training data points Although it is possible to use a leastsquares objective function in this case it is more common to use the following crossentropy objective function RM iyiklogaxiars il kl where 06 and represent the network weights to be adjusted The first sum is over the training data points and the second is over the groups for each data point Because the group indicators yLk are either 0 or 1 this objective function is really just the sum of the negative logarithms of the predicted probabilities associated with the actual class for each training data point Because logPk goes to zero as Pk approaches 1 and goes to infinity as Pk approaches 0 minimizing the crossentropy objective function tends to drive the predicted probabilities for the observed classes towards l Here is a schematic representation of the classification network for our example leaving the bias nodes out of the picture Input Layer Hidden Layer Output Layer More specifically I have used the neural network implemented by the nnet function in R This implements a single hiddenlayer network with logsz39g transfer functions for the hiddenlayer nodes and for categorical prediction softmax transfer functions for the output layer The damping or decay parameter 1 multiplies the sum of the squared network weights so that the modified objective function is Roc3 10gBxia3AZoc2Z32 l The 06 s are the inputtohidden layer weights and the s are the hiddenlayertooutput weights Increasing the decay parameter 1 forces the weights to be smaller than they would be otherwise 17 leading to a smoother representation of the boundaries between classes Without damping the weights can become arbitrarily large in magnitude allowing the network to develop very sharp sigmoid boundaries to reproduce fine details of the training data a form of overtraining Both sets include a bias node weight not shown in the diagram For d input variables dl including the biasM hiddenlayer nodes Ml including the bias node and K output classes the number of weights in this network is M d l K M 1 We can use any of a number of optimization algorithms to adjust the weights The code for R s nnel function uses BFGS Broyden FletcherGoldfarb Shanno optimization In my work so far including the Hugoton Panoma field modeling 1 have primarily tweaked the number of hiddenlayer nodes and the decay parameter leaving other parameters at their default values including the maXimum number of iterations 100 Using a single hiddenlayer node for the RhomaaUmaa example the sigmoid basis simply forms a step or boundary between the paralic and marine data points I ve included those points on the plot in their indicator form 7 0 s for paralic and 1 s for marine E Wat mmwammma E quot quotmquot quotoquot o o o 39 mo mm mm 5 m W39ttof39 Vammmm mwum a a MW o oWoMo m mmmo om moquot nn ll WQ The figure illustrates that the neural net s sigmoid basis functions are ideal for classification problems which is all about drawing boundaries between classes Fitting five basis functions with a decay constant 0f001 yields E Q W I I quot Wittgt w IIIz PkManne auto 025 050 075 no w I W I 0 Paralic Fitting a network With 20 hiddenlayer nodes to the entire Jones dataset using a decay constant of 01 just guessing at reasonable values for the tuning parameters yields the following classification table With 93 of the facies predicted correctly Facies And the predictions versus depth look like 0 Predicted Pages 0 7 Core Pages 100 7 E 200 7 E n 3 300 7 400 7 8 o 500 7 l l l l l l Marine Paralic Floodplainchannel Splay Paleosol The predictions in the Kenyon are still quite messy Depth fee l l l l l l Marine Paralic Floodplain Channel Splay Paleosol Predicted Facies in Kenyon We can also look at the facies probabilities versus depth in each well first the Jones then the Kenyon 0 Probability Probability i l l l l i l i i it I l l Depth n f M I me P rahc Floodplain Channel SD lay ieosol For the facies classification problem we have certain expectations regarding typical facies thicknesses and regarding which facies occur more frequently above other facies We can encode these expectations into a transition probability matrix that describes the probability of observing each facies at certain depth given that a particular facies occurs at the next depth below that The transition probability matrix developed from the facies observed at halffoot intervals with a little modification is Faoies Facies Above Below Marine Paralic Floodplain Channel Splax Paleosol Marine 099 000 000 000 000 001 Paralic 001 098 001 001 000 000 Floodplain 000 001 096 001 000 001 Channel 000 000 001 097 001 001 Splay 000 000 003 000 097 000 Paleosol 000 003 003 000 000 094 We can use Bayes theorem again to combine the probabilities predicted from the logs at depth m p with facies probabilities derived from the probabilities at the underlying interval ml weighted by the upward transition probabilities 5k to derive modified probabilities at each depth combining information from the logs with our expectations regarding spatial adjacency 71 ply ZINC J Zpi39nztnjwlf J39 r Applying this procedure in both wells cleans up the predicted probabilities considerably W12quot 24 Jones 250 Depth fl Kenyon m 1 W w H WWJH Depth fl IVV m I 500 Depth n Jones Predwcted Pages 0 7 Core Faces Marine Paralic Floodplainchannel Splay Paleosol Kenyon Depth fee 3 7 22 igig E s Marine Paralic Floodplain Channel Splay Paleosol