### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# CPSC 636 CPSC 636

Texas A&M

GPA 3.58

### View Full Document

## 15

## 0

## Popular in Course

## Popular in ComputerScienence

This 87 page Class Notes was uploaded by Mozell Lind on Wednesday October 21, 2015. The Class Notes belongs to CPSC 636 at Texas A&M University taught by Y. Choe in Fall. Since its upload, it has received 15 views. For similar materials see /class/226081/cpsc-636-texas-a-m-university in ComputerScienence at Texas A&M University.

## Popular in ComputerScienence

## Reviews for CPSC 636

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/21/15

Slide02 Introduction Haykin Chapter 2 Learning 0 Property of primary significance in nnet learn from its environment and improve its performance through learning Processes o Iterative adjustment of synaptic weights 0 Learning hard to define CPSC 636600 Instructor Yoonsuck Choe Spring 2008 One definition by Mendel and McClaren Learning is a process by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded The type of learning is determined by the manner in which the parameter changes take place Learning Overview Sequence of events in nnet learning Organization of this chapter 0 nnet is stimulated by the environment 1 Five basic learning rules 0 nnet undergoes changes in its free parameters as a result of this error correction Hebbian memorybased copetetive and stimulation Boltzmann o nnet responds in a new way to the environment because of the N Learning paradigms changes that have occurred in its internal structure credit assignment problem superVIsed learning unsuperVIsed A prescribed set of welldefined rules for the solution of the learning learning problem is called a learning algorithm 9 Learning tasks memory and adaptation The manner in which a nnet relates to the environment dictates the learning paradigm that refers to a model of environment operated on 439 PrObab39IISt39c and Statlst39cal aSpectS of learn39ng bythe nnet Input vector ErrorCorrection Learning One or more layen of h idden neurons Input output and desired response or target output dk Error signal ck dk yk ck actuates a control mechanism that gradually adjust the synaptic weights to miminize the cost function or index of performance 1 2 e n 2 k When synaptic weights reach a steady state learning is stopped 5 571 MemoryBased Learning All or most past experiences are explicitly stored as inputtarget pairsxi7 Two classes C1 C2 Given a new input xtest determine class based on local neighborhood of xtest Criterion used for determining the neighborhood Learning rule applied to the neighborhood of the input within the set of training examples Input vector ErrorCorrection Learning Delta Rule On or more layers Ur hidden neurons WidrowHoff rule with learning rate 77 Awk n neknwjn With that we can update the weights 111mlquot 1 wkjn Awk n There is a sound theoretical reason for doing this which we will discuss later MemoryBased Learning Nearest Neighbor A set of instances observed so far X x1XQ xN Nearest neighbor X N E X of xtest miin dxi7 xtest ClX 7 Xtest where d7 is the Euclidean distance xtest is classified as the same class as X N Cover and Hart 1967 The bound on error is at max twice that of the optimal Bayes probability of error given The classified examples are independently and identically distributed The sample size N is infinitely large 8 MemoryBased Learning k Nearest Neighbor 0 0 o 0 Identify k classlfied patterns that lie nearest to the test vector xtest for some integer k the k neighbors use majority vote above will be classified as 1 Hebbian Synapses o Timedependent mechanism Local mechanism Interactive mechanism Correlativeconjunctive mechanism Strong evidence for Hebbian plasticity in the Hippocampus brain region Assign xtest to the class that is most frequently represented by In effect it is like averaging It can deal with outliers The input X Hebbian Learning 0 Donald Hebb s postulate of learning appeared in his book The Organization of Behavior1949 When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it some growth process or metabolic changes take place in one or both cells such that As efficiency as one of the cells firing B is increased 0 Hebbian synapse If two neurons on either side of a synapse are activated simultaneously the synapse is strengthened If they are activated asynchronously the synapse is weakened or eliminated This part was not mentioned in Hebb Classification of Synaptic Plasticity Hebbian timedependent highly local heavily interactive Type Positively correlated Negatively correlated Hebbian Strengthen Weaken AntiHebbian Weaken Strengthen NonHebbian X X Mathematical Models of Synaptic Plasticity chb s hypothesis Aw staph Covariance 1 7 We 9 t hypothesis Postsynamic Balance 7 activity yA point y rm f Maximum depression pom 0 General form Awkj Fyk n 273 o Hebbian learning with learning rate 77 Awkj nyk o Covariance ruleAwkj 77ztj 57yk y Competetive Learning Output neurons compete with each other for a chance to become active Highly suited to discover statistically salient features that may aid in classification Three basic elements Same type of neurons with different weight sets so that they respond differently to a given set of inputs A limit imposed on the strength of each neuron Competition mechanism to choose one winner winnertakesall neuron Covariance Rule Sejnowski 1977 Awkj 773Cj HEMk l 0 Convergence to a nontrivial state 0 Prediction of both potentiation and depression 0 Observations Weight enhanced when both pre and postsynaptic activities are above average Weight depressed when gtllt Presynaptic activity more than average and postsynaptic activity less than average gtllt Presynaptic activity less than average and postsynaptic activity more than average Inputs and Weights Seen as Vectors in Highdimensional Space x1 x2 x3 x coordinate system Xn Wk1 wk2 wk3 wk 0 Inputs and weights can be seen as vectors X and wk Note that the weight vector belongs to a certain output neuron k and thus the index Competetive Learning Example 0 Single layer feedforward excitatory and lateral in hibitory connections 0 Winnerselection 1 ikagtvjforalljj7k 31k 0 otherwise 0 Limit 2 wkj 1 for all k 0 Adaptation nodes nrumm Layeruf Slnglelayer 7 39 39 39 WW WWWquot Awkj quotmy Wk If C Is the wInner 0 otherwise The synaptic weight vector wk wk17wk27wkn is moved toward the input vector 17 Boltzmann Learning 0 Stochastic learning algorithm rooted in statistical mechanics 0 Recurrent network binary neurons on 1 off 1 o Energyfunction E E 7 Z Z wkjkj J39 kJWH 0 Activation Choose a random neuron k Flip state with a probability given temperature T 1 1 expiAEkT where AEC is the change in E due to the flip HM H 17k 19 Competetive Learning X wn MVWX WHD W00 0 Adaptation 1739 7 wkj if k is the winner Awkj 77lt J 0 otherwise Interpreting this as a vector we get the above plot 0 Weight vectors converge toward local input clusters clustering 18 Boltzmann Machine 0 Two types of neurons Visible neurons can be affected by the environment Hidden neurons isolated 0 Two modes of operation clamped visible neuron states are fixed by environmental input and held constant Freerunning all neurons are allowed to update their activity freely 20 Boltzmann Machine Learning and Operation 0 Learning Correlation of activity during clamped condition pg Correlation of activity during treerunning condition pg Weight update Awkj 711 pgj j 7 k 0 Train weights mm with various clamping input patterns 0 After training is completed present new clamping input pattern that is a partial input of one of the known vectors 0 Let it run clamped on the new input subset of visible neurons and eventually it will complete the pattern pattern completion 00quot937 y Correlm y may CreditAssignment Problem 0 How to assign credit or blame for overall outcome to individual decisions made by the learning machine 0 In many cases the outcomes depend on a sequence of actions Assignment of credit for outcomes of actions temporal creditassignment problem When does a particular action deserve credit Assignment of credit for actions to internal decisions structural creditassignment problem assign credit to internal structures of actions generated by the system Creditassignment problem routinely arises in neural network learning Which neuron which connection to credit or blame Learning Paradigms How neural networks relate to their environment 0 credit assignment problem 0 learning with a teacher 0 learning without a teacher Learning with a Teacher Vector lescnbmg Emir signal 0 Also known as supervised learning Teacher has knowledge represented as input output examples The environment is unknown to the nnet Nnet tries to emulate the teacher gradually o Errorcorrection learning is one way to achieve this Error surtace gradient steepest d2e4scent etc Learning without a Teacher Primary 1 rmnlorcemcnt Slaw Input mmr Environment Cum Vccrol descnbmg arc of m envxmnmem v V Mam ing syslcm Two classes 0 Reinforcement learning RLNeurodynamic programming 0 Unsupervised learningSelforganization Learning without a Teacher Unsupervised Learning vmur describing E 0i lh em ironmcni 0 Learn based on taskindependent measure of the quality of representation 0 Internal representations for encoding features of the input space 0 Competetive learning rule needed such as winnertakes all Learning without a Teacher Reinforcement Learning Primary miifnmemrut 5w finpmi Vzclm I39iiximmnmi m Cnnr l cumin mmwmmcm Learning Tasks Learning tasks 0 Pattern association 0 Pattern recognition 0 Function approximation 0 Control 0 FilteringBeamforming Memory and adaptation Learning inputoutput mapping through continued interaction with the environment Actorcritic cricit converts primary reinforce ment signal into higherquality heuristic rein forcement signal Barto Sutton Goal is to optimize the cumulative cost of actions In many cases learning is under delayed rein forcement Delayed RL is difficult since 1 teacher does not provide desired action at each step and 2 must solve temporal creditassignment problem Relation to dynamic programming in the context of optimal control theory Bellman 28 Memory and Adaptation 28 Pattern Association Pattern Classification Inpulvectur I Output 0 Mapping between input pattern and a pre scrived number of classes categories 0 Two general types o Associtive memory brainlike distributed memory that learns I I I I I Feature extraction observation space to ass00iation Storage and retrieval recall I I I feature space of dimenSIonality reduc tion then classification feature space to decision space 0 Pattern association xk key pattern yk memorized pattern xk Hyk7k12q I I I I Single step observation space to deci autoassoaation xk yk given partial or corrupted sion space verSIon of stored pattern and retrieve the original heteroassociation xk y yk Learn arbitrary pattern pairs and retrieve them 0 Relevant issues storage capacity vs accuracy 29 Function Approximation Function Apprix System Identification and Inverse System Modeling 0 Nonlinear inputoutput mapping 1 fx for an unknown f 0 Given a set of labeled examples T x7 di v1 estimate II such that 3 fXH lt EIfOrallx 0 System identification learn function of an unknown system 1 fx 0 Inverse system modeling learn inverse function x f 1d Control Filtering Smoothing and Prediction Ermr Flam 7le Tm 39quotquotquotl P quot quot Extract information about a quantity of interest from a set of noisy data a p p y 39 o Filtering estimate quantity at time n based on measurements up Umw lccdoack FIGURE 213 Block diagram of feedback control system to time n 0 Control of a plant a process or critical part of a system that is to SmOOth39ng est39mate quant39ty at me n based on measurements be maintained in a controlled condition up to me n 0 OK gt 0 0 Prediction estimate quantity at time n i oz based on 0 Feedback controller adjust plant input 11 so that the output of the plant y tracks the reference signal 1 Learning is in the form of measurements up to me n a gt 039 freeparameter adjustment in the controller 33 34 Blind Source Separation and Nonlinear Prediction Linear Algebra Tip Partitioned or Block Matrices A 4 by 5 matrix can he considered as a 1 by 2 matrix with the matrices XVI l I 3 4 5 l 2 3 4 5 wch a A a 7 s 910 AB 16 7 81349 10 c161 IEDII xwiZTJO gt mm Wquot E X quot a 1 7 1 nelka 111131415 ql 1111 1415 7 1517 all xm7mn 0 When multiplying matrices or matrix and a vector partitioning them and Blind Source Separation recover n from diStOrted Signal multiplying the corresponding partitions can be very convenient Xltngt When mlxmg mamx A 398 unknown39 0 Consider the 4 X 5 matrix above let s call it X If you have another E F Xn Aun 5 X 4 matrix partitioned similarly into let s call it Y then 7 Nonlinear prediction given San T7 San 2T7 w you can calculate the product as another block matrix estimate n is the estimated value E F AB GH i AEBGAFBH CD CECGCF CH 1 Example from http algebramathusthkmatrixilinearitransO87partitionlectureshtml 35 36 Memory 0 Memory relatively enduring neural alterations induced by an organism s interaction with the environment 0 Memory needs to be accessible by the nervous system to influence behavior 0 Activity patterns need to be stored through a learning process 0 Types of memory shortterm and longterm memory Associative Memory cont d 0 Weight matrix yk Wkxktork 1 2 1 gm Ewi k khfori 12m j1 yki wi1k7wi2k77wimk1 l ykl w11kw12kw1mk 13k1 ykg i w21kw22kw2mk 13k2 1m Associative Memory q pattern pairs xk yk for k 1 2 q lnput key vector xk k1avkg kmT Output memorized vector yk yk1yk2 ykmT Weights can be represented as a weight matrix yk Wkxktor k 1 2 q 777 gm Zwijkkjtor m 1 2 m 11 38 Associative Memory cont d With a single we can only represent one mapping xk to yk For all pairs xk yk k 1 2 q we need q such weight matrices yk Wkxktor k 1 2 q One strategy is to combine all into a single memory matrix M by simple summation q M Z Wk k l Will such a simple strategy work That is can the following be possible with M yk Mxktor k 1 2 q 40 Associative Memory Example Storing Multiple Mappings With fixed set of key vectors xk an m X m matrix can store m arbitrary output vectors yk 0 Let xk 0 0 1 0T where only the kth element is 1 and all the rest is 0 0 Construct a memory matrix M with each column representing the arbitrary output vectors yk M 1 y17y27ym1 o Thenyk Mxkforallk 12 m 0 But we want xk to be arbitrary too 41 Correlation Matrix Memory Recall 0 Will Mxk give yk o For convenience let s say q q A T M Z ykxk Z Wk k1 k1 0 First consider ykxzw only Check it Wkxk yk WkXk ykxgm YkX Xk 03 where c ngk a scalar value the length of vector xk squared If all xks were normalized to have length 1 Wkxk yk will hold Correlation Matrix Memory 0 With 1 pairs xk yk we can construct a candidate memory matrix that stores all 1 mappings as q A T T T T M E ykxk Y1X1 Y2X2 1 yqxq7 k1 where ykxg represents the outer product of vectors that results in a matrix ie T kak u Mama39 0 A more convenient notation is X1 X2 M y17y27m7yq 1 144 This can be verified easily using partitioned matrices 42 YXT Correlation Matrix Memory Recall cont d 0 Now back to M underwhat condition will MXj give yj forall j Let s begin by assuming xgx 1 key vectors are normalized 0 We can decompose MXj as follows q q A T T T MXj E ykxk Xj ijj Xj E ykxk Xj k1 k1k7 j T i 0 We know ijj Xj 7 yso It now becomes A q T ij Yj Z 3ka xj k1k7 j Nolselerm o If all keys are orthogonal perpendicular to each other then for an arbitrary k 75 jx x39 COS kj 1 X 1 X 0 0so the noise term becomes 0 and hence ij yj l 0 yj The example in page 41 is one such eatreme case Correlation Matrix Memory Recall cont d We can also ask how many items can be stored in ie its capacity The capacity is closely related with the rank of the matrix The rank means the number of linearly independent column vectors or row vectors in the matrix Linear independence means a linear combination of the vectors can be zero only when the coefficients are all zero clxl CQXQ cnxn 0 onlywhen Ci 0 for allz39 1 2 n The above and the examples in the previous pages are best understood by running simple calculations in Octave or Matlab See the src directory for example scripts 45 Statistical Nature of Learning Deviation between the target function fx and the neural network relization of the function Fx w can be expressed in statistical terms note F is parameterized by the weight w Random input vectors X E Xi v1 and random output scalar values D E di v1 Suppose we have a training set T xi The problem is that the target values D in the training set may only be approximate D fX ie D 3E fX So we end up with a regressive model D 67 where is deterministic and e is a random expectational error representing our ignorance 47 Adaptation When the environment is stationary the statistic characteristics do not change over time supervised learning can be used to obtain a relatively stable set of parameters If the environment is nonstationary the parameters need to be adapted over time on an ongoing basis continuous learning or learningonthefly If the signal is locally stationary pseudostationary then the parameters can repeatedly be retrained based on a small window of samples assuming these are stationary continual training with timeordered samples Statistical Nature of Learning cont d The error term 6 is typically assumed to have a zero mean E x O is the expected value of a random variable In this light x can be expressed in statistical terms x EDIX since from D fX we can get ElDIX EfX 6 fX El lxl fX A property that can be derived from the above is that the expectational error term is independent of the regressive function EefX O This will become useful in the following Statistical Nature of Learning cont d Neural network realization of the regressive model Y FXw We want to map the knowledge in the training data T into the weights w 0 We can now define the cost function 1 N 8w 7 Fxiw2 2 7L1 which can be written equivalently as an average over the training set ETH 8w ET di Fx72 49 Statistical Nature of Learning BiasVariance Dillema The cost function we derived ETifx 7 FxiT2l can be rewritten knowing fx EDlx ETiEiDlxl 7 FxiT2l 7 ETiElDlxl 7 ETiFltxiTgti ETFXrTl 7 FltxiTgtgt2i 7 ETlFxi Tl 7 EiDlxl2 ETFxr 7 7 ETiFltxi TM Bias Variance The last step above is obtained using ET EDlx2 EDlx27 ETETFX7 Tl2li ETFX7 Tl2i and ETlEiDlxlFerl DlxlETlFerl Note Ec carid EcX cEX lorooristaritcandrandomvariableX Statistical Nature of Learning cont d dFXi T dfxfxFxi 7 6fxFXi 7 With that w ET 7 Fx7T2 becomes 1 7 5E7 lt6 ltiltxgt 7 Fltxrrgtgt2l 7 gm is 2eltiltxgt 7 Fltxrrgtgt ltiltxgt 7 Fltx m2 7 gem ETi6fx 7 Fltx Tl genre 7 Fltxrrgtgt2i W Intrinsic error Thls reduces to 0 We re interested in this BiasVariance Dillema cont d o The bias indicates how much FX T differs from the true function f x approximation error 0 The variance indicates the variance in Fx T over the entire training set T estimation error 0 Typically achieving smaller bias leads to higher variance and smaller variance leads to higher bias Statistical Learning Theory 0 Statistical learning theory addresses the fundamental issue of how to control the generalization ability of a neural network in mathematical terms 0 Certain quantities such as sample size and the VapnikChevonenkis dimension VC dimension is closely related to the bounds on generalization error The probably approximately correct PAC learning model is another frameworkto study such bounds In this case the the confidence 6 probably and tolerable error level 6 approximately correct are important quantities Given these and other measures such as the VC dimension we can calculate the sample complexity how many samples are needed to achieve that level of correctness 6 with that much confidence 6 53 Shattering a Set of Instances Definition a dichotomy of a set S is a partition of S into two disjoint subsets Definition a set of instances S is shattered by a function class 7 if and only if for every dichotomy of S there exists some function in Tconsistent with this dichotomy Appendix on VC Dimension 0 The concept of Shattering 0 VC dimension Three Instances Shattered Instance space X Each closed contour indicates one dichotomy What kind of classifier function can shatter the instances The VapnikChervonenkis Dimension VC Dim of Linear Decision Surfaces Definition The VapnikChervonenkis dimension 39 VOWU of function class 7 defined over sample space X is the size of the largest finite subset of X shattered by 7 If a b arbitrarily large finite sets of X can be shattered by 7 then 0 When 7 is a set of lines and S a set of points VC 3 VC 2 00 Note that lfl can be in nite while VOltHgt nitel o a can be shattered but b cannot be However if at least one subset of size 3 can be shattered that s fine 0 Set of size 4 cannot be shattered for any combination of points think about an XORlike situation Uses of VC Dimension 0 Training error decreases monotinically as the VC dimension is increased 0 Confidence interval increases monotinically as the VC dimension is increased 0 Sample complexity in PAC framework increases as VC dimension increases Slide04 Haykin Chapter 4 MultiLayer Perceptrons CPSC 636600 Instructor Yoonsuok Choe Spring 2008 Mitchell 1997 A McGrawrl lill Multilayer Perceptrons Characteristics munsu SIgnaMlow graph highllghting we dualls ate41pm neuron it connected m mam neuron Each model neuron has a nonlinear activation function typically a logistic 7 1 function yj 7 m Network contains one or more hidden layers layers that are not either an input or an output layer Network exhibits a high degree of connectivity 3 Introduction quotmm Aulntrnlwalgvdplvulnmultitnyeiquotnuptialuithnhiddanlayus Networks typically consisting of input hidden and output layers Commonly referred to as Multilayer perceptrons Popular learning algorithm is the error backpropagation algorithm backpropagation or backprop for short which is a generalization of the LMS rule Forward pass activate the network layer by layer Backward pass error signal backpropagates from output to hidden and hidden to input based on which weights are updated Multilayer Networks new A signarnawgrapn hig39xlighting m ntailxofoutputneuron o Differentiable threshold unit sigmoid 01 m Interesting d property 45 won 7 w 0 Output y xTW o Otherfunctionstanhu 721071 exp72v1 4 Multilayer Networks and Backpropagation Error Gradient for a Single Sigmoid Unit h d h39d h d h d m e 9 00 For n inputoutput pairs xC7 dk 1 m 8E 8 1 2 rank 7 d 7 m 8W 8W 2 k yk 1 8 2 5w 7 d 7 2 Z 8W k yk k i 1 Z w gt d gt i 2 k k yk 81 k yk 8 Zltdk yk iaw k z 8yk 8vk Zltdk yk i k 811k 8111i 2 One output b Two hidden one output Chain rule 0 Anotherexample XOR 5 6 Error Gradient for a Sigmoid Unit Backpropagation Algorithm From the preViOUS page Initialize all weights to small random numbers 8E 72w 7 8 8 Until satisfied Do 8111i k yk 811k 81112 o For each traInIng example Do But we know 1 Input the training example to the network and compute the network 8yk i 8 uk i yklt1 7 MC outputs 811k 811k 2 For each output unitj T 5jeyj1yjdjyj auk 8xk W M k 3 For each hidden unith 8111i 8111i 5h H yh1 yh Zonutputs wjh j So 4 Update each network weight wi 39 8E wjz39 lt 39Lqu ijz where am ng ykyk1 ykivik ijz n jxi Note wji is the weight from ito j ie wjgi The 6 Term Derivation of Aw o For output unit 0 Want to update weight as 8E 6i lty391yj djyj ij39i 7i WENd 9le E 15 U3 rror where error is defined as o For hidden unit 1 2 Ew E 5 dj yj 6h yh1 yh Z wjh j onutputs 39E t t I ZSUh ts z 0 GIVen Uj Zj wj39zwi Backpropagated error avj o In sum 6 is the derivativetimes the error awji i 8 awji o Derivation to be presented later 0 Different formula for output and hidden 9 10 Derivation of Aw Output Unit Weights Derivation of Aw Output Unit Weights a Fromthe prevrous page 6E 6E 6y 39 Fromthe revious aeii d39 397 I BE p p g ij Byj ij yj av First calculate T j 0 Next calculate Since yj Mtj and aia v39y391 Jy39 avj 8yj 8W j 8yj 7 v 1 v 8E a 1 2 avj M y T dj le y y jEOutPuts Putting everything together 8 1 idj yj2 8E8Ed 1 8yj 2 8 ayj 8W J y y 23 1 8d39 y39 2 d 2 J y ayj djyj 11 Derivation of Aw Output Unit Weights Derivation of Aw Hidden Unit Weights BEiaE ijiaE 4 Fromthe previous page Startw39th Bwjz ij Bwjz ij an 3E 3E a v 8E 8E 8 i ii djyjyj1 yj TU Z TU TU 3 3yj avj J kEDownstreamj k 1 811k avj 621 Hi2 E 76k 81 Since awji 75 keDownstreamj J 811k 8yj Z 7 k T T 3E 3E aw kenomstreamm 9 U1 8w 81 39 8w 8 39 N J N Z 76km dj yjyj1 yj i kEDownstreamj 7 V 6j67 7 07 gtlt 5net inPUt Z i kwkj yj1 7 9139 1 kEDownstreamj net 13 14 Derivation of Aw Hidden Unit Weights Summary Finally given 8E 8E auj 8E 8quot W am 7 790 811135 811339 aw auj 27 milquot 3 1 and 8E 51 mm 5M mun W FIGURE 45 539 HI 87 i kwkj yj1 7 yj graph ofaparltggfth1joint m n Uj keDownstreamU system pertaining to back M I n net propagation of error signals WHWHW mL 8E 4 ijz39 UW 77 yj1 yj Z 5kwa 113239 J2 W keD t Mm MUM 7 51 71 Mn error IV V V I V39 weight correction learnlng rate local radient input Signal 9 5139 15 Extension to Different Network Topologies o Arbitrary number of layers for neurons in layer m 6T yr1 yr 2 wsr s sElaye39r m1 o Arbitrary acyclic graph 5r y39r1 yr 2 wsr s sEDownstream7 Learning Rate and Momentum o Tradeoffs regarding learning rate Smaller learning rate smoother trajectory but slower convergence Larger learning rate fast convergence but can become unstable 0 Momentum can help overcome the issues above ijin n6jnyin ozijin 1 The update rule can be written as ijin nZanit6jtyit rZanit 230 t0 Backpropagation Properties 0 Gradient descent over entire networkweight vector 0 Easily generalized to arbitrary directed graphs 0 Will find a local not necessarily global error minimum In practice often works well can run multiple times with different initial weights 0 Minimizes error over training examples Will it generalize well to subsequent examples 0 Training can take thousands of iterations gt slow 0 Using the network after training is very fast Momentum cont d 77 8E t ijin Zanit t0 awwlt o The weight vector is the sum of an exponentially weighted time series 0 Behavior When successive take the same sign Weight update is accelerated speed up downhill When successive have different signs Weight update is damped stabilize oscillation Sequential online vs Batch Training Sequential mode Update rule applied after each inputtarget presentation Order of presentation should be randomized Benefits less storage stochastic search through weight space helps avoid local minima Disadvantages hard to establish theoretical convergence conditions Batch mode Update rule applied after all inputtarget pairs are seen Benefits accurate estimate of the gradient convergence to local minimum is guaranteed under simpler conditions Learning Hidden Layer Representations Inputs Outputs anuts Input Output 1 10000000 gt 10000000 9 01000000 gt 01000000 00100000 gt 00100000 00010000 gt 00010000 00001000 gt 00001000 00000100 gt 00000100 00000010 gt 00000010 00000001 gt 00000001 Representational Power of Feedforward Networks Boolean functions every boolean function representable with two layers hidden unit size can grow exponentially in the worst case one hidden unit per input example and OR them Continous functions Every bounded continuous function can be approximated with an arbitrarily small error output units are linear Learned Hidden Layer Representations Outputs Arbitrary functions with three layers output units are linear Input Hidden Output Values 10000000 gt 89 04 08 gt 10000000 01000000 gt 01 11 88 gt 01000000 00100000 gt 01 97 27 gt 00100000 00010000 gt 99 97 71 gt 00010000 00001000 gt 03 05 02 gt 00001000 00000100 gt 22 99 99 gt 00000100 00000010 gt 80 01 98 gt 00000010 00000001 gt 60 94 01 gt 00000001 24 Learned Hidden Layer Representations o Learned encoding is similar to standard 3bit binary code 0 Automatic discovery of useful hidden layer representations is a key feature of ANN 0 Note The hidden layer represzesntation is compressed Recurrent Networks 0 Sequence recognition 0 Store tree structure next may slide 0 Can be trained with plain backpropagation o Generalization may not be perfect i Overfitti ng Emme waginupmsrmpie r Emmmswaynnpdw examplell qqqg Tnmmgsctcrmx u 0m Tnmmgselermr Vahdauousetermx 4 NV6 Vahdauonselermx 0 15000 10000 0 i000 2000 a 5000 5000 5000 10000 000 4000 mmng Numbmiwugmms 0 Error in two different robot perception tasks 0 Training set and validation set error 0 Early stopping ensures good performance on unobserved samples but must be careful 0 Weight decay use of validation sets use of kfold crossvalidation etc to overcome the problem Recurrent Networks Cont d stack input stack l A B l l o Autoassociation intput output 0 Represent a stack using the hidden layer representation 0 Accuracy depends on numerical precision Some Applications NETtalk Output u nits phoneme code 0 NETtalk Sejnowski and Rosenberg 1987 0 Learn to pronounce English text 0 Demo 0 Data available in UCI ML repository More Applications Data Compression Construct an autoassocia tive memory where Input Output mn wnh smaH hwden layer Encode using inputto hidden weights Send or store hidden layer ac va on Decode received or stored mdden layer ac va on with the hiddentooutput weights NETtalk data aardvark aardvark 1ltltltgt2ltlt0 aback Xbk 0gtlltlt0 abacus bxkxs 1ltOgtOlt0 abaft Xbft Ogt1ltlt0 abalone bxloni 2ltOgt1gtO O abandon Xbndxn Ogtlltgt0lt0 abase Xbe370gt1ltlt0 abash XbS Ogt1ltltO abate Xbet70gt1ltlt0 abatis bxti1lt0gt2lt2 0 Word Pronunciation StressSyllable 0 about 20000 words Backpropagation Exercise 0 URL httpwww r stamn 16targz o Untar and read the README file gzip dc backprop716targz l tar XVf 7 0 Run make to build on departmental unix machines 0 Run bp confXorconfetc Backpropagation Example Results Backprop Error 10000 Epochs Epoch one full cycle of training through all training input patterns OR was easiest AND the next and XOR was the most difficult to learn Network had 2 input 2 hidden and 1 output unit Learning rate was 0001 Backpropagation Things to Try How does increasing the number of hidden layer units affect the 1 time and the 2 number of epochs of training How does increasing or decreasing the learning rate affect the rate of convergence How does changing the slope of the sigmoid affect the rate of convergence Different problem domains handwriting recognition etc Backpropagation Example Results cont d Backprop o 25 on 457 02 AND XOR 015 or I 11 o 0 510152025303540 10000Epochs Ermr OR i ii i XOR Output to 00 01 10 and 1 1 form each row 34 MLP as a General Function Approximator o MLP can be seen as performing nonlinear inputoutput mapping 0 Universal approximation theorem Let be a nonconstant bounded monotoneincreasing continuousfunction Let mo denote the mOdimensional unit hypercube 0 1m0 The space of continuous functions on mo is denoted by CImo Then given any function f E CImo and E gt 0 there exists an integer m1 and a set of real constants 04 12 and wij wherei 1 m1 and j 1 m0 such that we may define m1 m0 F E177m0ai Ewijjbi 21 j1 as an approximate realization of the function that is iF E177 Em0 flt 7017700m0i lt E for all 001 acmo that lie in the input space 36 MLP as a General Function Approximator cont d o The universal approximation theorem is an existencetheorem and it merely generalizes approximations by finite Fourier series 0 The universal approximation theorem is directly applicable to neural networks MLP and it implies that one hidden layer is sufficient The theorem does not say that a single hidden layer is optimum in terms of learning time generalization etc Generalization and Training Set Size 0 Generalization is influenced by three factors Size of the training set and how representative they are The architecture of the network Physical complexity of the problem 0 Sample complexity and V0 dimension are related In practice we where W is the total number of free parameters and e is the error tolerance Generalization o A network is said to generalize well when the inputoutput mapping computed by the network is correct or nearly so for test data never used during training 0 This view is apt when we take the curvefitting view 0 Issues overfitting or overtraining due to memorization Smoothness in the mapping is desired and this is related to criteria like Occam s razor Training Set Size and Curse of Dimensionality 1D 4 inputs 2D 16 inputs 3D 64 inputs 0 As the dimensionality of the input grows exponentially more inputs are needed to maintain the same density in unit space 0 In other words the sampling density of N inputs in mdimensional space is proportional to Nlm 0 One way to overcome this is to use prior knowledge about the function 40 CrossValidation Wall E E ll l vigil TrialZ E C E l Emil Trial 3 ll D l Cl l Early lslopping point 0 Trial 4 Number of epochs Use of validation set not used during training used for measuring generalizability 0 Model selection 0 Early stopping o Holdout method multiple crossvalidation leaveoneout method etc 41 Heuristic for Accelerating Convergence Learning rate adaptation Separate learning rate for each tunable weight Each learning rate is allowed to adjust after each iteration If the derivative of the cost function has the same sign for several iterations increase the learning rate If the derivative of the cost function alternates the sign over several iterations decrease the learning rate 43 Virtues and Limitations of Backprop Connectionism biological metaphor local computation graceful degradation paralellism Some limitations exist regarding the biological plausibility of backprop Feature detection hidden neurons perform feature detection Function approximation a form of nested sigmoid Computational complexity computation is polynomialin the number of adjustable parameters thus it can be said to be efficient aFF Sensitivity analysis sensitivity Sf aww efficiently can be estimated Robustness disturbances can only cause small estimation errors Convergence stochastic approximation and it can be slow Local minima and scaling issues 42 Summary 0 Backprop for MLP is local and efficient in calculating the partial derivative 0 Backprop can handle nonlinear mappings 44 Slide04 supplemental Haykin Chapter 4 MultiLayer Perceptrons CPSC 636600 Instructor Yoonsuck Choe Spring 2008 1 Heuristic for Backprop cont d Mean removal Decuuelatiun Cuvmmce Equalization 3 Activation function Usually learning is faster with antisymmetric activation functions v v A popular example is v atanhltbv Note v ab1 tanh2bv gt Target values target values should be within the range of the sigmoid activation function 039 Normalizing the inputs preprocessing to make certain statistical properties hold over the entire training set is important 0 00 Heuristic for Making Backprop Perform Better Sequential vs batch update for large and highly redundant data sets sequential update works well Maximization of information content Every training sample should be chosen to maximize information content we want to search more of the weight space 0 Use examples that result in the largest error 0 Use examples that are radically different from those previously used Randomize shuffle input presentation order 0 Use an emphasizing scheme present more difficult inputs Issues input distribution is distorted and outlier or mislabeled input can cause serious problems Heuristic for Backprop cont d Initialization Small initial weight values are usually good since they prevent saturation of activity but it is not good either to have too small weights One suggestion is to initialize the weights to random values from a uniformly random distribution with zero mean and standard deviation of 0w mil2 where m is the number of connections for that neuron Learning from hints Include prior information about the function to be learned eg invariance properties symmetries etc Learning rates hidden layer should have higher learning rate that output layer output layer tends to have larger local gradients Neurons with many inputs should have smaller learning rates Weight Initialization 0 Given an input distribution zero mean unit variance what is the distribution of the induced local field the weighted sum that gets plugged into the sigmoid 0 We want this value 1 to be within the ramp region of the sigmoid o For this what should the weight distribution look like 5 Supervised Learning as an Optimization Problem 0 Supervised training of MLP as a problem of numerical optimization of avw averaged over all input samples Taking the Taylor series expansion savltwltngt Awltngtgt q avW 71 STUDAWW wTnHnAwn HOT7 where 85m 625 gm A and Hm J 8w WEN 8W2 Firstorder method gradient descent linear approx AWOL 777301 o Secondorder method Newton s method quadratic approx AW WL H 1ngn 7 wwn Weight Initialization cont d Inpmr My Eiyzl 005 Eiyi 7 M02 Eiy2l 1 Also assume 1 if k iand 0 if k 75 i inputs are uncorrelated Weights Mu ijji 0and aw Ewji T Mw2 Induced local field Mu Elvjl E wjz39yz39 ZEleilElyz39l 0 i1 i1 012 E 09 7 MD Ejvfj E i wjz39wjkyz39yk i1 k1 Z Z EiwjzwjkiEiyzyki ZEiw i mi We want to adjust 0391 71112010 to fit within thethe ramp of 6 Conjugate Gradient Method Introduction Newton s method using the inverse of the Hessian HT1 has limitations Computation of the Hessian can be expensive The Hessian needs to be nonsingular tor the inverse which is rarely the case When the cost function is nonquadratic convergence is not guaranteed Conjugate gradient method overcomes the above problems while accelerating convergence it s a secondorder method Quadratic Function Minimization 0 Consider minimizing the quadratic function A is sym pos def fx xTAx bTX c which is minimized for x A lb So the minimization problem turns into solving a system of linear equations AX b o Conjugate vectors Given an matrix A nonzero vectors 50 51 1 is Aconjugate if sTnAsj 0 for all n 7 j When A I the meaning is clear The vectors are orthogonal Conjugate Vectors Properties Lin Indep The conjugate vectors are linearly independent so they can form a basis to span the vector space 0 Proof Assume they are not linearly independent W71 50 Z 0450 j1 Multiplying both sides by As0 we get W71 sTOAsO Z ajsTjAs0 j1 But this is 0 since ST 0As0 0 since these are conjugate vectors However A is positive definite and the vectors are nonzero the sum cannot be 0 a contradiction Thus the con vectors are not linearly dependent Conjugate Vectors Properties 0 The square root of the matrix A is defined as A A12A12 existence and uniqueness is guaranteed for pos semidefinite matrices 0 Since A is symmetric positive definite T AT ltA12A12 A12TA12T Thus we get A 2 A12T 0 Given any conjugate vectors xi and Xj XIij xiT A12A12 xj x A12TA12 x T szA12TA12Xj ltA12xi A12Xj 0 0 So transforming any nonzero vectorxz into Vi A12X7 results in vectors vi that are mutually orthogonal 10 Conjugate Direction Method 0 For a quadratic error function fx Xn 1 xn rnsnn 0 1 W 1 where x0 is an arbitrary initial vector and 7171 is defined by fXn nn5n H211 fXn 71501 The part where 7171 is picked is called a line search 0 So in sum both the direction and the distance are determined Conjugate Direction Method cont d Observations 0 Since the conjugate vectors are linearly independent they form a basis that spans the vector space of X o The line search results in the following formula for r lt ST nAen n 7 ST nAsn 7 where 6n xn Xquot is the error vector But Xquot is unknown AX X AX b Ax b AX b This is 0 0 Starting from an arbitrary x0 the method reaches the minimum in at most W iterations 13 Conjugate Gradient Method Determine successive conjugate vectors in a sequential manner at successive steps 0 Residual rn b Axn note this is 8f8x o Recursive step Find the next step sn as sn rn nsn 1n 1 2 W 1 where the scaling factor 571 is determined as an i sTn 1Arn i sTn 1Asn This results in conjugate vectors sn 0 Problem we need to know A with respect to which the vectors are conjugate Conjugate Direction Method cont d o For each iteration n Xn 1 minimizes fx over a linear vector space 73 that passes through an arbitrary point x0 and is spanned by the Aconjugate vectors Xn 1 arg min fx xEn where 1 xn xn x0 Znjsltjgt j0 o The whole scheme depends on the existence of the conjugate vectors sn Solution conjugate gradient method Conjugate Gradient Methods Alternative ns We can evaluate the scaling factor 571 without an explicit knowledge of the matrix A based only on the residuals o PolakFlibi re formula rTnrn r01 1 5m 7 rTn 1rn 1 o FletcherReeves formula rTnrn an i rTn 1rn 1 PolakFlibi re form is superior for nonquadratic cost functions and is guaranteed to converge if 5 max6pR 0 Conjugate Gradient Method Estimating 77n Conjugate Gradient Method vs Gradient Descent Line search procedure estimates the proper 7171 0 Bracketing phase find a nontrivial interval that is known to contain the minimum o Sectioning phase divide the bracket into sections leading to o Gradient descent can oscillate badly successivel narrower brackets y o Conjugate gradient method can directly bring you to the minimum 0 These two steps above can be achieve by inverse parabolic in one step plus the initial step if the cost function is quadratic approximation for a 2D case F i G l 70 535 l 0 Repeated application of the two phases results in good estimation We mm em was noes of the point of minimum Application of CGM to NM Learning 0 Approximate the cost function 83W w by a quadratic function ignore HOT 0 Formulate the computation of coefficients 501 and rn so as to only require gradient information avoid calculating the Hessian Quadratic function fx Cost function 83W w Parameter vector xn Weight vector wn Gradient vector 620 Gradient vectorg as x 6w Matrix A Hessian matrix H Slide05 Haykin Chapter 5 RadialBasis Function Networks CPSC 636600 Instructor Yoonsuck Choe Spring 2008 RadialBasis Function Networks Output Three layers 0 Input 0 Hidden nonlineartransformation from input to hidden space 0 Output linearactivation Principal motivation Cover s theorem pattern classifiction casted in highdimensional space is more likely to be linearly separable than in lowdimensional space Learning in MLP Supervised learning in multilayer perceptrons Recursive technique of stochastic approximation eg backprop Design of nnet as a curvefitting approximation problem eg RBF Curvefitting Finding a surface in a multidimensional space that provides a best fit to the training data Best fit measured in a certain statistical sense RBF is an example hidden neurons forming an arbitrary basis for the input patterns when they are expanded into the hidden space These basis are called radial basis functions Cover s Theorem Cover s theorem on the separability of patterns A complex pa tternclassification problem cast in a highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space Basic idea nonlinearly map points in the input space to a hidden space that has a higher dimension than the input space Once the proper mapping is done simple quick algorithms can be used to find the separating hyperplane qbSeparability of Patterns N input patterns X x17 X27 xN in mOdimensional space The inputs belong to either of two sets X1 and X2 they form a dichotomy The dichotomy is separable wrt a family of surfaces if a surface exists in the familythat separates the points in class X1 from X2 Foreachx E Xdefine an mlvector ixli 7 17 2 m1 ltxgt wax 2ltxgt w xlT that maps inputs in mOD space to the hidden space of mlD are called the hidden functions and the space spanned by these functions is called the hidden space or feature space A dichotomy is separable if there exists an mlD vector w such that WT x gt 07 x 6 X1 WT x lt 07 x 6 X2 with separating hyperplane WT x 0 5 Cover s Theorem Interpretation Separability depends on 1 particular dichotomy and 2 the distribution of patterns in the input space The derived PN m1 states that the probability of being separable is equivalent to the cumulative binomial distribution corresponding to the probability that N 1 flips of a fair coin will result in m1 1 or fewer heads In sum Cover s theorem has two basic ingredients Nonlinear mapping to hidden space with ix i 1 2 m1 High dimensionality of hidden space compared to the input space m1 gt m0 Corollary A maximum of 2m1 patterns can be linearly separated by a hidden space of mlD 7 Cover s Theorem Revisited Given a set X of N inputs picked from the input space independently and suppose all the possible dichotomies of X are equiprobable Let PN m1 denote the probability that a particular dichotomy picked at random is separable where the family of surfaces has m1 degrees of freedom o lnthis case N7 1 m1 1 1 N 7 1 PN7 m1 E 2 m0 m where l l71l72l7m1 6 Example XOR again in Ill l uausn Sneriiitaticnofthe iddenFunnionsforlheg Franky Example 5 I ix IrmanaIum HisiHitlxlmrunumn SucumlHmdcnl39umlwn x ml x Um l l l Di 8 F quot51 1 ll iTS 01 8 on mm i N i 1 um um i w m W W K 0 0 7 2 it in a m m 0 With Gaussian hidden functions the inputs become linearly separable in the hidden space 1X expX t127 t1 171T 2X expX t227 t2 10701T 2D Gaussian eXDrX Xy quot36a movm Watt quot o 0 f t w 14 3 3 93 ot 39oz to b ooooooooo dmmpmmwww t Mt rev 99 f Basically ti determines the center of the 2D Gaussian MK exp llx till and points X that are equidistance from the center have the same value RBF as an Interpolation Problem Output Input 0 mOD input to 1D output mapping 3 Rmo gt R1 0 The map 3 can be thought of as a hypersurface 1 C RMOT1 Training fit hypersurface 1 to the training data points Generalization interpolate between data points along the reconstructed surface 1 0 Given xi 6 Rmo 1 2 N andN labels dz 6 Rlli 1 2 NfindF RN gt 1R1 suchthat Fxi dz foralli 11 RBF Learning Overview The RBF learning problem boils down to two tasks 1 How to determine the parameters associated with the radialbasis functions in the hidden layer ix eg the center of the Gaussians 2 How to train the hiddentooutput weights This part is relatively easy RBF and Interpolation o Interpolation is formulated as N Fx Zwm lx Kill 11 where X 1 2 N is aset of N arbitrary nonlinear functions known as radialbasis functions 0 The known data points Xi E Rmo z39 1 2 N are treated as the centers of the RBFs Note that in this case all input data need to be memorized as in instancebased learning but this is not a necessary requirement RBF and Interpolation cont d N FX Zmd llx Xi 7 11 0 So we have N inputs and N hidden units and one output unit Expressing everything all N inputoutput pairs in matrix form 7 1511 i512 1N 7 1111 l 7 d1 i521 i522 2N w2 i d2 5 5 5 5 l 5 lil 5 N1 N2 NN J wN J dN where 535 Xj 7 xi We can abbreviate the V V l 2 Input Center above as 11w d 13 RBF and Interpolation cont d When mo lt N m0 number of hidden units N number of inputs we can find 71 that minimizes N ZltFXi dz 27 i1 where FX 221 wk kx The solution involves the pseudo inverse of d w gtT 1 gtTd 8w pseudo inverse Note bis an N X mo rectangular matrix In this case how to determine the centers of the functions becomes an issue 15 RBF and Interpolation cont d 0 From w d we can find an explicit solution w w 1d assuming 4 is nonsingular Note in general the number of hidden units is much less than the number of inputs so we don t always have 4 as a square matrix We ll see how to handle this later 0 Nonsingularity of d is guaranteed by Micchelli s theorem Let Xi 1 be a set of distinct points in Rmo Then the N byN interpolation matrix 4 Whose jith element 13 Xj Xil is nonsinguar Typical RBFs Forsomec gt 00 gt 0and7 E R o Multiquadrics nonlocal 0 7 2 02 2 o Inverse multiquadrics local 1 W 0 Gaussian functions local ltrgt exp Other Perspectives on RBF Learning Supervised Learning as IllPosed Hypersurface Reconstruction Problem 0 RBF learning is formularized as o The exact interpolation approach has limitations mo FX Z wk kxr Poor generalization data points being more numerous than kil the degree of freedom of the underlying process can lead to o This kind of expression was given without much rationale other overfitting than Imumve appeal 0 How to overcome this issue 0 However there s one way to derive the above formalism based on Approach the problem from a perspective that learning is a hypersurface reconstruction probem given a sparse set of data points an interesting theoretical pointofview which we will see next Contrast between direct problem in many cases wellposed vs inverse problem in many cases illposed 18 WellPosed Problems in Reconstructing Functional IllPosed Problems and Solutions Mapping 0 Direct causal mapping are generally wellposed eg 3D object Given an unknown mapping from domain X to range 3 we want to reconstruct to 2D projection the mapping f This mapping is wellposed if all the following conditions are satis ed 0 On the other hand inverse problems are illposed eg reconstructing 3D structure from 2D projections 0 Existence For all x E Xthere exist an output y E y suchthat y fx 0 For illposed problems solutions are not unique can in many Uniqueness Fora x7 1 g X x t irrx t cases they can be infinite We need priorknowedge or some kind of preference to narrow down the range of solutions this is o ContinuityThe mapping Is continuous Forany e gt 0 there exists 6 66 suchthat called regmar39zat39om39 p700 t lt 6 a pyfx7 t lt E Where M 39Sthe Treating supervised learning as an illposed problem and using certain distance measure prior knowledge we can derive the RBF formalism If any of these conditions are violated the problem is called an illposed problem Regularization Theory Overview Tikonov s Regularization Theory 0 Task Given input xi 6 Rmo and target dz 6 R1find Fx 0 The main idea behind regularization is to stabilizethe solution by o MInImIze the sum of two terms means of prior information 1 Standard error term 0 This is done by including a functional a function that maps from 1 N N 2 1 2 afunction to ascalar in the cost function so that the functional 55a 5 EM 7 yi E Zltdi T Fxi 21 11 can also be minimized Only a small number of candidate 2 Regularizationterm solutions Will minimize this functional 1 2 0 These functional terms are called the regularization term ECU 5 HDFH 7 0 Typically the functionals measure the smoothness of the where D is a linear differential operator and the norm of the function function space 0 Putting these together we want to minimize w regularization param A N am ssltFgtAscltFgt ZltdieFltxigtgt2 AiiDFiit 21 22 11 Error Term vs Regularization Term Soiution that Minimizes F 0 Problem minimize N 8F 85Fgt8CF Zdi Fxi2DF2 i1 0 Solution F X that satisfies the EulerLagrange equation below minimizes Dmx di Fxi 6x xi 0 Error Regularization Left Bad iii Extremely smooth where D is the adjoint operator of D and is the Dirac delta Middle Good fit Smooth function Right Over fit Jagged Try this demo http lcn epfl chtutorlalengllshrbfhlml 23 Solution that Minimizes 5F cont d o The solution to the EulerLagrange equation can be formulated in terms of the Green s function that satisfies 5DGX X 6X x Note the form of G depends on the particular choice of D 0 Finally the desired function F X that minimizes 8F is 1 N FX X 2 di FXii GX7Xi 11 Solution that Minimizes 5F cont d We can use a matrix notation FA FAX17FAX17FAXNT d d17d277dNT Gx1x1 Gx1xQ Gx1xN CxQ7x1 CxQ7xQ Gltx27XN G g g g L CxN7x1 CxN7xQ CxN7xN J W w17w27u7wNiT Then we can rewrite the formula in the previous page as w d 7 FA and FA Gw Solution that Minimizes 5F cont d o Letting 1 W Xidi Fxi7i 12N we can recast F x as N FAx EwiCKx xi 2391 Plugging in input Xj we get N Fij Z wiGxj7Xz 21 0 Note the similarity to the REF N FX wa iix 7 xi i1 26 Solution that Minimizes 5F cont d o Combining 1 i d F W A A FA Gw we can eliminate FA to get G AIw d 0 From this we can get the weights w G AI 1d if G AI is invertible it needs to be positive definite which can be ensured by a large A Note Gxix39 GXjxithus GT G 28 Solution that Minimizes 5F cont d Steps to follow 1 Determine D 2 Find the Green s function matrix G associated with D 3 Obtain the weights by w G AIT1d For certain Ds the corresponding Green s functions are Gaussian inverse multiquadrics etc Thus the whole approach is similar to RBF More on this later Translationally and Rotationally Invariant D 0 One example of a Green s function that corresponds to a translationally and rotationally invariant D is the multivariate Gaussian function 1 Gx xi exp 72 xi2gt 2 0 With this the regularized solution becomes N 1 F x 71 exp 72 xi2gt 11 Zoi This function is known to be a universal approximator Regularization Theory and RBF o In conclusion the regularization problem s solution is given by the expansion N Fx Z wiGx7 3 07 21 where G isthe Green sfunction forthe selfadjoint operator DD 0 When the stabilizer D has certain properties the resulting Green s function G become radialbasis functions D is translationally invariant Gx7 xi Cx 7 xi D is translationally and rotationally invariant GX7Xi Gllx Kill which is a RBF o The regularized solution N 1 FAX Zw exp 7272 7 mil 21 7239 can be represented as a network Hidden uniti computes Gx7 xi one hidden unit per inputpattern Output unit is Iinearweighted sum of hidden unit activation 0 Problem we need N hidden units for N input patterns which can be excessive for large input sets 32 Desirable Properties of Regularization Networks Generalized RadialBasis Function Networks 0 The regularization network is a universal approximator that can approximate any multivariate continuous function arbitrarily well given sufficiently large number of hidden units o The approximation shows the best approximation property best Input coef CiemS Will be found 0 Regularization networks can be computationally demanding when N is h G l39dRBF th39 bl o The solution IS optimal it Will minimize the cost functional HT uge eneralze overcomes IS pro em m1 FX ZWWKXL 2391 where m1 lt Nand CHx 7 tillsothat mt Fx EwiG lx till 11 33 34 Generalized RadialBasis Function Networks cont d Weighted Norm 39 T0 nd the weights 239 we minimize 0 When individual input lines in the input vector X are from different N m1 2 2 classes a weighted norm can be used F di gwmquot t W H llxll lt0xgtTlt0xgt xTCTCX o Minimization of F yields 0 The approximation function can be rewritten as GTG AG0w GTd7where m1 FX ZWGUIX tillc i1 Gt1t1 Gt1t2 Gt1tml I 00327131 00327132 0t27tm1 I o For Gaussian RBF it can be interpreted as G0 I I I Gx tillc exp X tiTCTCX 00377147131 00377147132 quot39Gtm17tm1 J exp X tiT271X 7 T 7 T 0 AS A approaCheS 0 we 991 G GW G d 50 where ti represents the mean vector and 2 the covariance W GT Gg 1 GTd Gd39 matrix of a multivariate Gaussian function Regularization Networks vs Generalized RBF Estimating the Parameters o Weights 71 already discussed more next 0 Regularization parameter A Minimize averaged squared error Input Use generalized crossvalidation 0 Hidden layer in GRBF is much smaller m1 lt N O RBF centers 0 In GRBF 1 the weights 2 the RBF centers ti and 3 the norm Randomly select fixed centers weighting matrix are all unknown parameters to be determined Selforganized selection 0 In regularization networks RBF centers are known same as all the inputs Supervised selection and only the weights need to be determined 37 38 Estimating the Regularization Parameter A RBF Learning 0 Minimize average squared error For a fixed A for all N inputs calculate the squared error betwen the true function value and the Wm estimated RBF network output using the A Find the optimal A 0 Hidden that minimizes this error Problem This requires knowledge of the I Input true function values o Generalized crossvalidation Use leaveoneout cross validation With a fixed A for all N inputs find the difference between the target value from the training set and the predicted value from the leaveoneouttrained network This approach Basic idea is to learn in two different time scales 0 Nonlinear slow learning of the RBF parameters center variance depend only on the training set 0 Linear fast learning of the hiddentooutput weights RBF Learning 13 Random Centers 0 Use m1 hidden units m1 2 GllX till exp i 2 llxitill 7 dmax where 131 1 2 m1 are picked by random from the available inputst39j 1 2 N 0 Note that the standard deviation width of the RBF is fixed to dmax r 7 27111 where dmax is the max distance between the chosen centers 13 This gives a width that is not too peaked nor too flat The linear weights are learned using the pseudoinverse w GTd GTGrlGTd where the matrix G gji gji CHXj 7 till2 41 Finding G with Singular Value Decomposition cont d Using these properties UTl UT VT1 VT 22 I we can verifythat GG I UTGV 2 UUTGVVT UEVT G UEVT GG UEVTV2UT U22UT UUT 1 Finding G with Singular Value Decomposition If for a real N X M matrix G there exists orthogonal matrices U u1 112 uN V V1V2 VM suchthat UTGV diag0102 aK 2 K minM N then U is called the left singular matrix V the right singular matrix and 01 02 UK the singular values of the matrix G Once these are known we can obtain GT as G V2UT where 2 diag i i 01 U algorithms for singular value decomposition that can be used for this There are efficient RBF Learning 23 SelfOrganized Centers The randomcenter approach is only effective with large input sets To overcome this we can take a hybrid approach 1 selforganized learning of centers and 2 supervised learning of linearweights Clustering for RBF center learning similar to SelfOrganizing Maps Initialization Randomly choose distinct tk 0s N Sampling Drawa random input vectorx E X 9 Similarity matching Find bestmatching center vector tkltxgt kltxgt argkmin um 7 tknll 5 Updating Update center vectors tun wow 7 turn if k kltxgt tkn otherwise tkn1 Fquot Continuation increment n and repeat from step 2 44 RBF Learning 33 Supervised Selection of Centers Use error correction learning to adjust all RBF parameters to minimize the error cost function 5525 5139 dj FXj dj EwiGij itillc i1 Linearweights output layer win 1 7 771 88518 2 Position of centers hidden layer tin 1 7 772 88521 2 Spread of centers hidden layer 8571 21n 1 21n WSW Comparison of RBF and MLP RBF has a single hidden layer while MLP can have many In MLP hidden and output neurons have the same underlying function In RBF they are specialized into distinct functions In RBF the output layer is linear but in MLP all neurons are nonlinear The hidden neurons in RBF calculate the Euclidean norm of the input vector and the center while in MLP the inner product of the input vector and the weight vector is calculated MLPs construct global approximations to nonlinear input output mapping RBF uses exponentially decaying localized nonlinearities eg Gaussians to construct local approximations to nonlinear input output mappings 47 RBF Learning 33 Supervised Selection of Centers cont d 0 Linear weights Zeamamm 7 more 31 Position of centers 680 N 1 2mm 3 ejltngtc lej M UQHCi 2 0 xj 7mm BMW j1 0 Spread of centers 650 N W wim 5jnG lej WHICH QMWX Summary RBF network is unusual due to its two different unit types RBF hidden layer and linear output layer 0 RBF is derived in a principled manner starting from Tikohonov s regularization theory unlike MLP o In RBF the smoothing term becomes important with different D giving rise to different Green s function G o Generalized RBF lifts the requirement of N hidden units for N input patterns greatly reducing the computational complexity 0 Proper estimation of the regularization parameter A is needed Slide09 Introduction Selforganizing maps SOM is based on competitive learning where output neurons compete with each other to be activated Haykin Chapter 9 SelfOrganizing Kohonen 1982 Maps l o The output neuron that activates is called the winnertakesall CPSC 636600 quote r quot39 Instructor Yoonsuck Choe 0 Lateral inhibition is one way to implement competition for map Spring 2008 formation von der Malsburg 1973 o In SOM neurons are placed on a lattice on which a meaningful coordinate system for different features is created feature map The lattice thus forms a topographic map where the spatial location on the lattice is indicative of the input features SOM and the Cortical Maps Two Models 0 The development of SOM as a neural model is motivated by the topographical nature of cortical maps 0 Visual tactile and acoustic inputs are mapped in atopographical manner Visual retinotopy position in visual field orientation spatial Willshawrvon der Malsburg Kohonen frequency direction ocular dominance etc T n t t T k Willshawvon der Malsburg model input neurons arranged in ac Ie somao o OSI ion on s in W p 2D lattice output in 2D lattice Lateral excitationinhibition Mexican hat gives rise to soft competition Normalized Hebbian learning Biological motivation Acoustic tonotopy frequency 0 Kohonen model input of any dimension output neurons in 1D 2D or 3D lattice Relaxed winnertakesall neighborhood Competetive learning rule Coinputational motivation SOM Overview SOM is based on three principles 0 Competition each neuron calculates a discriminant function The neuron with the highest value is declared the winner 0 Cooperation Neurons nearby the winner on the lattice get a chance to adapt o Adaptation The winner and its neighbors increase their discriminant function value relative to the current input Subsequent presentation of the current input should result in enhanced function value Redundancy in the input is needed Structure No Yes Redundancy No Yes lnfoltCapacity No Yes Consider each pixel as one random variable 0 Unsupervised learning such as SOM require redundancy in the data Redundancy etc o The following are intimately related Redundancy Structure or organization Information content relative to channel capacity Redundancy etc cont d mm mm daz at 02 as no as as 07 as a azabacaxttztbtct Left Right Structure No Yes Redundancy No Yes lnfoltCapacity No Yes Consider each axis as one random variable SelfOrganizing Map SOM 2D SOM Layer Input x Kohonen 1982 0 1D 2D or 3D layout of units 0 One weight vector for each unit 0 Unsupervised learning no target output Input Space 0 Weight vectors can be plotted in the input space 0 Weight vectors move not according to their proximity to the input in the input space but according to their proximity in the lattice 11 SOM Algorithm 1 Randomly initialize weight vectors wi 2 Randomly sample input vector X 3 Find Best Matching Unit BMU ix argminj Hx 7 Wj 4 Update weight vectors Wj Wj 77110 ix Wj 77 learning rate hj7 neighborhood function of BMU 5 Repeat steps 2 4 Is This Hebbian Learning Sort of o SOM learning can be viewed as Hebbian learning with a forgetting term to check unbounded growth 0 Original Hebb s rule ij 77ij7 where Wj is the weight vector 77 the learning rate yj the output response and x the input vector 0 Hebb s rule plus a forgetting term AWJ39 77ij 9yjWj 77ij 7 nijj nhjz x X W277 assuming gy 779 and 9339 hjix39 Typical Neighborhood Functions Gemsien Neighborhood BXDVWXW WZl 7 0 Gaussian hj exp rj Mug202 0 Flat hji 1 it rj rim 3 039 and 0 otherwise 0 039 is called the neighborhood radius 0 rj is the location of unit j on the lattice Two Phases of Adaptation o Selforganization or ordering phase High learning rate large neighborhood radius entire map 0 Convergence phase Low learning rate small neighborhood radius one or zero Training Tips 0 Start with large neighborhood radius Gradually decrease radius to a small value altngt 00 exp 0 Start with high learning rate r Gradually decrease r to a small value 7101 no exp Performance Measures 0 Quantization Error Average distance between each data vector and its BMU 1 N 6Q E 2 Xj Wm H 11 o Topographic Error The proportion of all data vectors for which first and second BMUs are not adjacent units 1 N 6T f 21439 Nj1 ux 1 it the 1st and 2nd BM Us are not adjacent ux 0 otherwise SOM Summary Essential ingredients of SOM Hebbian learning rule with forgetting term 0 Input generated according to a certain probability distribution on a continuous input space 0 Topology of network form on the discrete lattice o Timevarying neighborhood function around the winner 0 Timevarying leanring rate Example 2D Input 2D Output 0 Train with uniformly random 2D inputs Each input is a point in Cartesian plane 0 Nodes weight vectors an and y coordinate o Edges connect immediate neighbors on the map SOM Summary cont d Properties of SOM o Approximation of the input space The collection of weight vectors provides a good approximation of the input space 0 Topological ordering Spatial location on the lattice correspond to a certain feature of input patterns Nearby neurons on the lattice represent similar input features 0 Density matching More neurons are recruited to represent dense area in the input space 0 Feature selection Select best features to approximate the underlying distribution Different 2D Input Distributions w 4 o What would the resulting SOM map look like 0 Why would it look like that HighDimensional Inputs Applications SOM Output Space 0 Data clustering and visualization SOM can be trained with inputs Optimization problems 0f arbitrary dimenSIOH Traveling salesman problem o Dimensionality reduction ND to 2D 0 Semantic maps Natural language processing 0 Extracts topological features Input space 0 Preprocessing for signal and imageprocessing 0 Used for visualization of data 1 Handwritten character recognition 2 Phonetic map for speech recognition Exercise SOM Example Handwritten Digit Recognition 1 What happens when hjyzym and 7 was reduced quickly vs slowly N How would the map organize if different input distributions are given 9 For a fixed number of input vectors from realworld data a different visualization scheme is required How would you use the number of input vectors that best match each unit to visualize the property of the map 0 Preprocessing for feedforward networks supervised learning 0 Better representation for training 0 Better generalization SOM Demo Jochen Frohlich s Neural Networks with JAVA page httpfbimfheregensburgdeNsaj39122jfroehldlplomeemdexhtml Check out the Sample Applet link SOM Demo Space Filling in 2D Using Frohlich s SOM applet 0 1D SOM map 1 X n where n is the number of nodes 0 2D input space 0 Initial neighborhood radius of 100 0 Stop when radius lt 0001 0 Try 1000 nodes and 1000 input points SOM Demo Traveling Salesman Problem Using Frohlich s SOM applet 0 1D SOM map 1 X n wheren is the number of nodes 0 2D input space 0 Initial neighborhood radius of 8 0 Stop when radius lt 0001 0 Try 50 nodes 20 input points Click on Parameters to bring up the config panel After the parameters are set click on Reset in the main applet and then Start learning SOM Demo Space Filling in 3D Using Frohlich s SOM applet 0 2D SOM map n X n where n is the number of nodes 0 2D input space 0 Initial neighborhood radius of 10 0 Stop when radius lt 0001 0 Try 30 X 30 nodes and 500 input points Limit the y range to 15 Also try 50 X 50 1000 input points and 16 initial radius Vector Quantization Vector quantization exploits the structure in the input distribution for the purpose of dta compression In vector quantization the input space is partitioned into a number of distinct regions and for each region a reconstruction vector is defined A new input is then represented by the reconstruction vector representing the region it falls into Since only the index of the reconstruction vector need to be stored or transmitted significant saving is possible in terms of storage space and bandwidth The collection of reconstruction vectors is called the code book Learning Vector Quantization Train with SOM in unsupervised mode Then tune the weight vectors in a supervised mode If class of the input vector and the class of the best matching weight vector match wcn 1 Wcn O nXi W001 If class of the input vector and the class of the best matching weight vector do not match wcn 1 Wcn O nXi W001 Vector Quantization cont d A vector quantizer that minimizes encoding distortion is called a Voronoi or nearestneighbor quantizer SOM provides an approximate method for calculating the Voronoi tessellation Other Topics Different ways of visualization using SOM Contextual map or semanics map SOM viewed as Abstract neuroscientific model of the cortex Vector quantizer Difficulty of analysis convergence etc Use in modeling cortical map formation Slide03 Haykin Chapter 3 SingleLayer Perceptrons CPSC 636600 Instructor Yoonsuck Choe Spring 2008 Multiple Faces of a Single Neuron What a single neuron does can be viewed from different perspectives 0 Adaptive filter as in signal processing 0 Classifier as in perceptron The two aspects will be reviewed in the above order Historical Overview McCulloch and Pitts 1943 neural networks as computing machines Hebb 1949 postulated the first rule for selforganizing learning Flosenblatt 1958 perceptron as a first model of supervised learning Widrow and Hoff 1960 adaptive filters using leastmeansquare LMS algorithm delta rule Part I Adaptive Filter Adaptive Filtering Problem 0 Consider an unknown dynamical system that takes m inputs and generates one output 0 Behavior of the system described as its inputoutput pair T xi7 dii 17 27 n7 where 9011 0021 mmiT isthe inputand the desired response or target signal 0 Input vector can be either a spatial snapshot or a temporal sequence uniformly spaced in time There are two important processes in adaptive filtering Filtering process generation of output based on the input Mi XTiWi Adapative process automatic adjustment of weights to reduce error em do 7 ya Steepest Descent 0 We want the iterative update algorithm to have the following property 8wn 1 lt o Definethe gradient vector V8w as g o The iterative weight update rule then becomes wn 1 wn 7gn where 7 is a small learningrate parameter So we can say AWW W 1 WW ngn Unconstrained Optimization Techniques 0 How can we adjust to gradually minimize 51 Note that 51 7 7 xTiwi Since and are fixed only the change in can change 51 o In other words we want to minimize the cost function E w with respect to the weight vector w Find the optimal solution w o The necessary condition for optimality is V w 07 where the gradient operator is defined as i 8 8 8 T i 811117811127m8wm With this we get V50 as as as T 811117 81112 7 8wm 6 Steepest Descent cont d We now check if 8wn 1 lt Using firstorder Taylor expansionJr of near 8wn 1 e 5mm gTnAwn and Awn 7gn we get 5Wn 1 ZZ 5Wn tillgn2 Positive So it is indeed for small 7 5Wn 1 lt Vi71 a 704 2 TTaylorseries fa f a7c 7 a 8 WWW ngTngn Steepest Descent Example 740 00 um a b 0 Convergence to optimal w is very slow 0 Small 77 overdamped smooth trajectory 0 Large 77 underdamped jzgged trajectory o 77 too large algorithm becomes unstable Newton s Method Newton s method is an extension of steepest descent where the secondorder term in the Taylor series expansion is used It is generally faster and shows a less erratic meandering compared to the steepest descent method There are certain conditions to be met though such as the Hessian matrix V25W being positive definite for an arbitarry x XT HX gt O Steepest Descent Another Example Gradient of xxyy Q 039 o o o t39 7 Wii quotllllllt A QQQQ O quotIIIIII 777777 n WW W 1 x s s lt t s gt o b 0 22200 I I Forfx zzy 2 212 a a T Vf17 y 8 3 8 3 217 2yT Note that 1 the gradient vectors are pointing upward away from the origin 2 length of the vectors are shorter near the origin If you follow Vfa y you will end up at the origin We can see that the gradient vectors are perpendicular to the level curves The vector lengths were scaled down by a factor of 101tc6avoid clutter GaussNewton Method Applicable for costfunctions expressed as sum of error squares 8ltwgt gzexw i1 where e w is the error in the ith trial with the weight w Recalling the Taylor series fa l fa a a we can express 6 w evaluated near 6 wk as 86139 exw exwk WKW w wb In matrix notation we get ew ewk l Jewkw wk We will use a slightly different notation than the textbook for clarity 12 GaussNewton Method cont d Quick Example Jacobian Matrix 0 Jew is the Jacobian matrix where each row is the gradient 0 GiVen ofel w z 81mg 962 HP ey em y cosltxgt mm o The Jacobian of em y becomes 3810041 3810041 83c 8y 290 3820041 382ww 83c 8y Jecvvy 7 sin7c enw o For 707 y 057r77rwe get Jelt0t57r770 7139 27139 7139 27139 0 We can then evaluate Je wk by plugging in actual values of wk into the Jabobian matrix above 7sin057r cos7r GaussNewton Method cont d Linear LeastSquare Filter 0 Again starting with 0 Given m input and 1 output function x3wi where 70 Le it is linear and a set oftraining samples xi7 di1 9w 6Wk 19 Wk w 7 wk we can define the errorvector for an arbitrary weight w as what we want is to set w so that the error a raoches 0 T pp ewd7x1xQxn w 0 That is wewantto minimizethe norm ofew whered i d d d T Seningxi x x x T 7 17 27quot 7 7L 7 17 27quot397 7L r iieWii2 HeltWkgtii2 29WkTJeWkW Wk we get 9W d XW39 w 7 wkTJwkJewkw 7 wk 0 Differentiatingthe above wrtwwe get Vew 7XT So the Jacobian becomes Jew VewT 7X o Differentiating the above wrt w and setting the result to 0 we get 0 Plugging this in to the GaussNewton equation we finally get JwkewkJwkJewkw7wk 07 from which we get T 1 T w wkX X X d7ka T 71 T w Wk 7 Je WkJeWk Je Wkewk wk XTX 1XTd 7XTX 1XTka JwkJewk needs to be nonsingular inverse is needed This is Iwk Wk XTX 1XTd 15 16 Linear LeastSquare Filter cont d Points worth noting o X does not need to be a square matrix 0 We get w XTX 1XTd off the bat partly because the output is linear otherwise the formula would be more complex The Jacobian of the error function only depends on the input and is invariant wrt the weight w The factor XTX 1XT let s call it X is like an inverse Multiply X to both sides of dXw then we get w Xd XXw z I 17 LeastMeanSquare Algorithm Cost function is based on instantaneous values w 520 0 Differentiating the above wrt w we get 88w 85w 5w 8w 8w 0 Pluggin in 5w d 7 xTw 8 85 5w 7x and hence w 7x5w 8w 8w Using this in the steepest descent rule we get the LMS algorithm Wn1 Wu xnen 0 Note that this weight update is done with only one xi7 di pair 19 Linear LeastSquare Filter Example See srcpseudoinvm x cellrand42 10 wtrue rand21 10 dX wtrue w 1nvX X X d X 10 7 3 7 3 6 5 4 wt rue 056644 499120 d 40603 36638 31647 22797 w 056644 499120 LeastMeanSquare Algorithm Evaluation 0 LMS algorithm behaves like a lowpass filter 0 LMS algorithm is simple modelindependent and thus robust o LMS does not follow the direction of steepest descent Instead it follows it stochastically stochastic gradient descent 0 Slow convergence is an issue 0 LMS is sensitive to the input correlation matrix s condition number ratio between largest vs smallest eigenvalue of the correl matrix 0 LMS can be shown to converge if the learning rate has the following property 2 0 lt r lt 7 Amax where Amax is the largest eigenvalue of the correl matrix 0 Improving Convergence in LMS SearchThenConverge in LMS quot71 Sullde LMS algal1mm o The main problem arises because of the fixed 7 n tlugscal l 0 One solution Use atimevarying learning rate 7171 071 as in stochastic optimization theory 0 A better alternative use a hybrid method called W u searchthenconverge Mn 77 0 1 rtT 00ny When n lt 739 performance is similar to standard LMS When ymun 35 Learningrrate annealinq schedules n gt 739 it behaves like stochastic optimization 710 710 n 7 vs n M n M UralT 21 22 The Perceptron Model 1 Inputs Outpui Harri y limiter Part II Perceptron o Perceptron uses a nonlinear neuron model McCullochPitts model 177 1 1f 1 gt 0 WZwiib7 y v 141 0 1f 1 S 0 0 Goal classify input vectors into two classes 24 1 t05 gt W y Russel amp Norvig 1 t 05 o Perceptrons can represent basic boolean functions function 0 Thus a network of perceptron units can compute any Boolean What about XOR or EQUIV Geometric Interpretation H output 1 t Slope im W t W1 w 0739 Output0 fs 0 Rearranging W0 X 0 W1 X 1 7 t gt 07 then output is 17 we get if W1 gt 0 11gt W0 I t X 0 7 W1 W1 7 where points above the line the output is 1 and 0 forthose below the line Compare with 7W0 t y X 70 W7 W1 Boolean Logic Gates with Perceptron Units 1 t15 What Perceptrons Can Represent 11 Output 1 1 t Slope 7W0 we t W1 10 0 IO Output0fs Perceptrons can only represent linearly separable functions 0 Output of the perceptron W0 X 0 W1 X 1 7 t gt 07thenoutputis1 W0 X 0 W1 X 1 7 t g 07 thenoutputisO The Role of the Bias 0 Without the bias t 0 learning is limited to adjustment of the slope of the separating line passing through the origin 0 Three example lines with different weights are shown Limitation of Perceptrons 11 Output 1 1 t Slope 7W0 t W1 Output0fs 0 Only functions where the 0 points and 1 points are clearly linearly separable can be represented by perceptrons o The geometric interpretation is generalizable to functions of n arguments ie perceptron with n inputs plus one threshold or bias unit Linear Separability 1 O Llnearly separable Not Linearly separable Not Llnearly separable o For functions that take integer or real values as arguments and output either 0 or 1 0 Left linearly separable ie can draw a straight line between the classes 0 Right not linearly separable ie perceptrons cannot represent such a function Generalizing to nDimensions Z x hllp malhwcrld Wolfram comPlane himl 39 n albicw lyizk co 96040720 0 Equation of a plane 7 a 0 0 o In short am by CZ d 0 where a b c can serve as the weight and d an as the bias 0 For nD input space the decision boundary becomes a n 1D hyperplane 1D laetss than the input space Linear Separability cont d Il o Perceptrons cannot represent XOR o Minsky and Papert 1969 XOR in Detail 10 1 XOR 4 t 1 0mm 31 W0 1 0 0 0 W0 ope V 0 O g39 W1 2 0 1 1 l 3 1 0 1 4 1 1 0 Output0fs W0 X 0 W1 X 1 t gt 0 then output is 1 1 tgo gt 120 2 Wl tgt0 gt W1gtt 3 Wo tgt0 gt W0gtt 4 W0W1 t 0 gt W0W1 t 2t lt W0 W1 lt t from 23 and 4 butt 2 0 from 1 a contradiction Perceptron Learning Rule 0 Given a linearly separable set of inputs that can belong to class C1 or C2 0 The goal of perceptron learning is to have T w x gt 0 toralllnput In class C1 T w x g 0 torall Input In class C2 0 It all inputs are correctly classified with the current weights T wn x gt 07foralllnputln class C1and wnTx g 07 torall input in class C2 then wn 1 wn no change 0 Otherwise adjust the weights Perceptrons A Different Perspective wa gt b then output is 1 wa HwHHxHCOSO gt b thenoutputis1 b cos 9 gt inH then output Is1 So if d cos 9 in the figure above is greaterthan HTI H then output 1 Adjusting w changes the tilt of the decision boundary and adjusting the bias b and Hw H moves the decision boundary closer or away from the origin 34 Perceptron Learning Rule cont d For misclassified inputs 7n is the learning rate 0 wn 1 wn nnxn ithX gt 0 and X 6 C2 0 wn 1 wn nnxn ithx 3 0 and X 6 C1 Or simply Xn 1 wn rnenxn where 6n dn yn the error Learning in Perceptron Another Look 0 When a positive example C1 is misclassified wn 1 wn 0 When a negative example C2 is misclassified wn 1 wn 0 Note the tilt in the weight vector and observe how it would change the decision boundary Perceptron Convergence Theorem cont d 0 Using CauchySchwartz inequality llWollQllWW 1 2 2 wgwn 1 2 0 From the above and wgwn 1 gt na llwoll2llwn 1ll2 2 n2a2 So finally we get 2 2 llwltn 1ll2 2 4 llwoll2 First main result Perceptron Convergence Theorem Given a set of linearly separable inputs Without loss of generality assume 77 1 w0 O o Assumethe first n examples 6 C1 are all misclassified 0 Then using wn 1 wn xnwe get wn1x1x2xn 1 0 Since the input set is linearly separable there is at least on solution wO such that ngn gt 0 for all inputs in C1 Define oz minxnecl ngn gt 0 Multiply both sides in eq 1 with wo we get T T T T w0 wn 1 w0 x1 l w0 x2w0 2 From the two steps above we get wgwn l 1 gt n04 3 38 Perceptron Convergence Theorem cont d 0 Taking the Euclidean norm ofwk 1 wk llwk 1ll2 7 llwkgtll2 2WTkxk llxlvll2 0 Since all n inputs in C1 are misclassified wTkxk g 0 for 7 1 2 n llWk 1ll2 7 llWl ll2 7 llxl ll2 2WTkxk S 07 llWk 1ll2 S llWl ll2 llxl ll2 llWk 1ll2 7 llWl ll2 S llxl ll2 o Summing up the inequalities forall k 1 2 n and w0 O we get llWk 1ll2 S E llgtltltl ll2 S 7167 5 k1 where 6 maxxk 6 C1 40 Perceptron Convergence Theorem cont d FixedIncrement Convergence Theorem d eq 5 From eq 4 an 04 2 Let the subsets of training vectors C1 and C2 be linearly separable Let gllwm1gtll Sn llwo H the inputs presented to perceptron originate from these two subsets Here a is a constant depending on the xed input set and the xed The perceptron converges after some no iterations in the sense that solution wO so wO is also a constant and 6 is also a constant since it depends only on the fixed input set wno wno 1 wno 2 In this case if n grows to a large value the above inequality will becomes is a soution vector for no lt nmax invalid n is a positive integer Thus n cannot grow beyond a certain nmax where x nmaxB llwo ll2 2 i llwo ll nmax 7 727 oz and when n nmax all inputs will be correctly classified 41 42 Summary TABLE 32 Summary oftl39ie Perceptron Convergence Algorithm irm 15m and Poiririrr39leiv xiv 7 lil 0 mle vector 7ilt lilIilH mml Adaptive filter using the LMS algorithm and perceptrons are twin i iiiwtiw mm closely related the learning rule is almost identical rm it iiquot ll zlll mmiiai o LMS and perceptrons are different however since one uses Pumiwmamiw linear activation and the other hard limiters Oil Sci MD 7 n Thcn perform lhc lolluming rompulauons im L Mm muWWquotbymemumm o LMS is used in continuous learning while perceptrons are trained Viiluedinpulwcturxln mirea punsrl ii I I 3 iriiiiiuimiiiimiermiuuiu 39nlllpLIlL llulwlunl rupulixcul h er for only a mute number Of Steps mm 3quot 39wm m o Singleneuron or singlelayer has severe limits How can multiple whch sgr i i llk ignum lunrlml39L 4 41quot743917770quotDfW lgll ll39l mr Llpddlelhs Meiglnvccmrofrhc pucqintu wrii l wru 7 ldil 7 iiii iixrn wiim ii xiriihciungs i raw iniiii Mimi luciusaquot 5 Cuiiiiiiimiimi Increment limc dun n by We miii g ii i l i mi v layers help XOR with Multilayer Perceptrons Noie the bias units are noi shown in ihe network on the righi but they are needed 0 Only three perceptron units are needed to implement XOR 0 However you need two layers to achieve this Slide11 Haykin Chapter 8 Principal Components Analysis CPSC 636600 Instructor Yoonsuck Choe Spring 2008 Principal Component Analysis Variance Probe o X mdimensional random vector vector random variable following a certain probability distribution 0 Assume EX 0 0 Projection of a unit vector q qu12 1 onto X A XTq qTX 0 We know EA EqTX qTEX 0 o The variance can also be calculated 02 ElAQl ElqTXXTQl qT Elxle q Hf l correl matrix 7 TR 7 q q 3 Motivation How can we project the given data so that the variance in the projected points is maximized Principal Component Analysis Variance Probe cont d This is sort of a variance probe 11q qTRq Using different unit vectors q for the projection of the input data points will result in smaller or larger variance in the projected points With this we can ask which vector direction does the variance probe 11q has exterma value The solution to the question is obtained by finding unit vectors satisfying the following condition Rq q where A is a scaling factor This is basically an eigenvalue problem PCA With an m X m correlation matrix R we can get m eigenvectors and m eigenvalues qu qujj 12m We can sort the eigenvectorseigenvalues according to the eigenvalues so that A1 gt A2 gt gt Am and arrange the eigenvectors in a columnwise matrix Q q17CI27 wiqml Then we can write RQ QA where A diaggt1 A2 Am Q is orthogonal so that QQT I That is Q71 QT 5 PCA Usage Project input X to the principal directions a QTX We can also recover the input from the projected point a X QTTla Qa Note that we don t need all m principal directions depending on how much variance is captured in the first few eigenvalues We can do dimensionality reduction PCA Summary 0 The eigenvectors of the correlation matrix R of zeromean random input vector X define the principal directions qj along with the variance of the projected inputs have extremal values 0 The associated eigenvaluess define the extremal values of the variance probe PCA Dimensionality Reduction 0 Encoding We can use the first 1 eigenvectors to encode X T T 017 a2 my all 0117 0127 m 011 X 0 Note that we only need to calculatel projections a1 a2 al wherel S m o Decoding Once 01 02 ale is obtained we want to reconstruct the full 1 2 avg xij T A X Q3 R 0117 0127 mymllaliab mall X Or alternatively f Qa1a2 al 00 0 F W m l zeros PCA Total Variance PCA Example o The total variance of th em components of the data vector is 395 m m 2 2 2 l 0 8 j1 j1 o The truncated version with the first 1 components have variance E 2 7 E Uj 7 J inprandn80029O5randn100026ones10002l o The larger the variance in the truncated version ie the smaller Q 070285 7071134 071134 070285 the variance in the remaining components the more accurate the dimensionality reduction 7 014425 000000 7 000000 002161 9 10 PCA s Relation to Neural Networks HebbianBased HebbianBased Maximum Eigenfilter Maximum Eigenfilter Expanding the weight update rule using Taylor series we get 0 How does all the above relate to neural networks 7 2 o A remarkable result by Oja 1982 shows that a single linear wlltn1gt 7 wlltngtnyltngt yltngtwlltn om 7 neuron with Hebbian synapse can evolve into a filter for the first with OM ihciudihg the second and higher order effects of 7 principal component of the input distribution which we can ignore for small 7 0 Activation m Based on that we get 9 Z wiivz39 21 711101 1 WW rign 96101 ynWin 0 Learning rule wi n n 70 n 7 7 2 7 lt H M gt lt gt w n 72 WM 71 Wt w n 12 wi39n 1 m 21iwzln nyn vz39nl2 Hebbian term Stabilization term l 9 5 039 Matrix Formulation of the Algorithm Activation Learning Wm 1 wltngt nynlxngt yltnwnl Combining the above WW 1 WW WXnXTnWn WTnXnXTnWnWnl represents a nonlinear stochastic difference equation which is hard to analyze Conditions for Stability 77n is a decreasing sequence of positive real numbers such that 7771 00 1 npn lt ooforp gt 1 n1 nngt0asngtoo Sequence of parameter vectors is bounded ivvth probability 1 The update function hw7 x is continuously differentiable wrt w and x and it derivatives are bounded in time The limit 71w limn oo Ehw7 exists foreach wwhere X is a random vector There is a locally asymptotically stable solution to the ODE d A w t h w t dt lt gt lt lt gtgt Let ql denote the solution to the ODE above with a basin of attraction Bq The parametervector wn enters the compact subset A of Bq infinitely often with prob 1 15 Asymptotic Stability Theorem 0 To ease the analysis we rewrite the learning rule as Wm 1 wltngt nltngthwn7 x01 0 The goal is to associate a deterministic ordinary differential equation ODE with the stochastic equation 0 Under certain reasonable conditions on r h and w we get the asymptotic stability theorem stating that lim wn q1 17an infinitely often with probability 1 Stability Analysis of Maximum Eigenfilter Set it up to satisfy the conditions of the asymptotic stability theorem 0 Set the learning rate to be 77n 171 0 Seth7 to hwi x xnyn 7 WWW xnxTnWn 7 WTnxnxTnwnlwn 0 Taking expectaion overall x it timnsoo EXnXTnwn e wTltngtxltngtxTltngtwltngtgtwltngti Rwltxgt e wTooRwoo wltxgt o Substituting it into the ODE EWlt gt Ewt Rwt 7 WTtRwtwt Stability Analysis of Maximum Eigenfilter 0 Expanding wt with the eigenvectors of R wt Zekmqk k1 and using basic definitions RCIk Akqi quQk Ak k1 l1 k1 Stability Analysis of Maximum Eigenfilter cont d o Recalling the original expansion wlttgt Zeklttgtqk k1 we can conclude that wt gt ql ast gt 00 where q1 is the normalized eigenvector associated with the largest eigenvalue A1 of the correlation matrix R Other conditions of stability can also be shown to hold see the textbook k1 Stability Analysis of Maximum Eigenfilter cont d o Factoring out 221 qk we get W mat Aloha wt 0 We can analyze the above in two cases Case I k 7E 1 0kt In this case 01 gt 0 as t gt oo Case ll k 1 In this case 01t gt l1 as t gt 00 Summary of HebbianBased Maximum Eigenfilter Hebbianbased linear neuron converges with probability 1 to a fixed point which is characterized as follows 0 Variance of output approaches the largest eigenvalue of the correlation matrix R lim 0271 lim EY2n A1 Ham 0 Synaptic weight vector approaches the associated eigenvector lim wn q1 77400 with lim 1 77400 Generalized Hebbian Algorithm for full PCA o Sanger 1989 showed how to construct a feedfoward networkto learn all the eigenvectors of R 0 Activation yjn Zw ninj 12 l 11 0 Learning 139 Awmn n yjltngtxiltngt yjltngt Z wkiltngtykltngt k1 239 12m j 12 Slide08 Haykin Chapter 6 SupportVector Machines CPSC 636600 Instructor Yoonsuck Choe Spring 2008 Note Part of this lecture drew material from Ricardo GutierrezOsuna s Pattern Analysis lectures Optimal Hyperplane For linearly separable patterns Xi di 11 with di 6 17 1 o The separating hyperplane is WTX b O wab 2 0 fordi 1 T 7 w xblt0 fordZ 7 1 0 Let wo be the optimal hyperplane and b0 the optimal bias Introduction 0 Support vector machine is a linear machine with some very nice properties 0 The basic idea of SVM is to construct a separating hyperplane where the margin of separation between positive and negative examples are maximized o Principled derivation structural risk minimization error rate is bounded by 1 training errorrate and 2 VCdimension of the model SVM makes 1 become zero and minimizes 2 Distance to the Optimal Hyperplane T a calculated as 0 From w x ibo the distance from the origin to the hyperplane is 7170 d cosxiwo m o Distance to the Optimal Hyperplane cont d o The distance from an arbitrary point to the hyperplane can be calculated as When the point is in the positive area T T x w b x w b r cosxwod 70 iiWoii iiWoii iiWoii When the point is in the negative area T r dHacH cosxwo iiWoii iiWoii 7 iiWoii Optimal Hyperplane and Support Vectors cont d o The optimal hyperplane is supposed to maximize the margin of separation p 0 With that requirement we can write the conditions that wo and b0 must meet wzxbo 2 1 fordi 1 wzxbo S 1 fordi 1 Note 2 1 and S 1 and support vectors are those X where equality holds ie wzxs b0 1 or 1 0 Since sz bo W0 1llell rd 1 1le rd 1 7 x WD 120 xTwO 120 Optimal Hyperplane and Support Vectors Support eotors 0 Support vectors input points closest to the separating hyperplane o Margin of separation p distance between the separating hyperplane and the closest input point Optimal Hyperplane and Support Vectors cont d 19 6 Support eotors o Margin of separation between two classes is 2 p 7 7 Wall 0 Thus maximizing the margin of separation between two classes is equivalent to minimizing the Euclidean norm of the weight WO 8 Primal Problem Constrained Optimization For thetraining set T Xi di 1 find w and b such that 0 they minimize a certain value 1p while satisfying a constraint all examples are correctly classified Constraint diwai b 2 1 for i 1 2 N Cost function ltIgtw WTW This problem can be solved using the method of Lagrange multipliers see next two slides Lagrange Multipliers cont d Must find 70 y 04 that minimizes Fvyoz 00722 y722 oz7c2 y27 1 Setthe partial derivatives to 0 and solve the system of equations 8F 7 2x722ax0 870 8F 2y22ay0 8y 8F 002 y2 7 1 0 870 Solve for 70 and y in the 1st and 2nd and plug in those to the 3rd equation WWW xiy71oz7 so 1a 1a 7 from whichwe geta 2amp7 1 Thus 70 y 1 7 11 Mathematical Aside Lagrange Multipliers Turn a constrained optimization problem into an unconstrained optimization problem by absorbing the constraints into the cost function weighted by the Lagrange multipliers Example Find closest point on the circle 932 y2 1 to the point 2 3 adapted from Ballard An Introduction to Natural Computation 1997 pp 119 120 0 Minimize Fmy an 22 y 32 subject to the constraint 2 y2 1 O o Absorb the constraint into the cost function after multiplying the Lagrange multipliera meya 9622y 3204962y2 1 Primal Problem Constrained Optimization cont d Putting the constrained optimization problem into the Lagrangian form we get utilizing the KunhTucker theorem N Jwba wTW Zai diwai b 1 11 0 From 6JVvb a O N w Z aidixi i1 0 From 6Jvglb o O N Zaidi 0 i1 12 Primal Problem Constrained Optimization cont d 0 Note that when the optimal solution is reached the following condition must hold KarushKuhnTucker complementary condition 04 d WTX b 1 0 for all i 1 2 N 0 Thus nonzero 04 s can be attained only when d wTX b 1 0 Le when the 04 is associated with a support vector Xs o Other conditions include 04 Z 0 Dual Problem Given thetraining sample X d 971 find the Lagrange multipliers 04 gv1 that maximize the objective function 1 N N N T QltOLgt ZZO4 O4jd dei Xj 1 j1 41 subject to the constraints N Zi1 O4 d 0 04 Z 0forallz39 127N o The problem is stated entirely in terms of the training data X d and the dot products XZTXJ39 play a key role Primal Problem Constrained Optimization cont d o Plugging in w 04 d x and 04 d 0 back into Jw7 b7 04 we get the dual problem Jw7 b7 04 WTW 7 04 d wa b 7 1 WTW 7 04 d wTX N N 17 241 aidz 241 04139 noting wTw 04 d wTX and from 04 d 0 7 41 aidiWTxi 04139 N N T N 7 241 231 aio jdidjx4 xi 241 04139 QM SoJltwbagt Qa a4 2 o o This results in the dual problem next slide 14 Solution to the Optimization Problem Once all the optimal Lagrange mulitpliers 0407 are found wo and b0 can be found as follows N W0 E ao4d4x4 41 and from WZX b0 d when X is a support vector b0 ds wfxltsgt Note calculation of final estimated function does not need any explicit calculation of wo since they can be calculated from the dot product between the input vectors N wa E 0407 d X1TX 41 16 Margin of Separation in SVM and VC Dimension Statistical learning theory shows that it is desirable to reduce both the error empirical risk and the VC dimension of the classifier o Vapnik 1995 1998 showed Let D bethe diameter of the smallest ball containing all input vectors Xi The set of optimal hyperplanes defined by wzx b0 0 has a V0 dimension h bounded from above as D2 hgmin rT mo 1 0 where is the ceiling pthe margin of separation equal to 2w0 and mg the dimensionality of the input space The implication is that the VC dimension can be controlled independetly of mg by choosing an appropriate large p 17 SoftMargin Classification cont d We want to find a separating hyperplane that minimizes N M 2 e 1 21 where 5 0 if 5 g 0 and 1 otherwise Solving the above is NPcomplete so we instead solve an approximation N W5 E 5239 21 0 Furthermore the weight vector can be factored in 1 T N 1095 5W W 025139 21 W Controls VC dim Controls error with a control parameter 0 19 SoftMargin Classification 1 Lnside maxgm cuuectly classified 0 Some problems can violate the condition diWTXi b 2 1 0 We can introduce a new set of variables 5i V1 diWTXi b 2 1 Ei where Ei is called the slack variable 18 SoftMargin Classification Solution 0 Following a similar route involving Lagrange multipliers and a more restrictive condition of 0 3 04 S C we get the solution Ns W0 E aoidiXi 11 b0 dill Ei szi Nonlinear SVM Input space x Feature space Nonlinear mapping of an input vector to a highdimensional feature space exploit Cover s theorem Construction of an optimal hyperplane for separating the features identified in the above step InnerProduct Kernel cont d The inner product cpTx Xi is between two vectors in the feature space The calculation of this inner product can be simpified by use of a innerproduct kernel K X xi mi KX7Xi ltPTXltpXi Z WAXMW j0 where m1 is the dimension of the feature space Note Kx xi KltXi X So the optimal hyperplane becomes N Zaidif x xi 0 i1 23 InnerProduct Kernel Input X is mapped to ltpx With the weight w including the bias b the decision surface in the feature space becomes assume 00X 1 wTltpx 0 Using the steps in linear SVM we get N w Zaidic xi 11 Combining the above two we get the decision surface N ZaidicpTxicpx 0 i1 22 InnerProduct Kernel cont d Mercer s theorem states that KX xi that follow certain conditions continuous symmetric positive semidefinite can be expressed in terms of an innerproduct in a nonlinearly mapped feature space Kernel function KX xi allows us to calculate the inner product cpTxltpxi in the mapped feature space without any explicit calculation of the mapping function Examples of Kernel Functions Kernel Example 0 Linear Kxxi xTxi 0 Expanding K09 Xi 1 XTXz392 o Polynomial KltX7Xi XTXZ 1 withx 001 x2T7xi 00117 mng 4 7 1 2 o RBFKXX1 iexp WHX Xlll Kltx7xw 1m m 1 2m1m2mlxi2 o Twolayer perceptron Kx xi tanh oxTxi 61 x x 2 270170 222220 for some 50 and 51 17 7 127 7 17 2 17 90317 i1i27 1702227 0021 ivwlT XT Xii Where 4Px 17 90 1311327 9037 0017 002 25 26 Nonlinear SVM Solution Nonlinear SVM Summary 0 The solution is basically the same as the linear case where Project input to highdimensional space to turn the problem into a XTXi is replaced with Kx xi and an additinoal constraint linearly separable problem that a S C 398 added Issues with a projection to higher dimensional feature space 0 Statistical problem Danger of invoking curse of dimensionality and higher chance of overfitting Use large margins to reduce VC dimension 0 Computational problem computational overhead for calculating the mapping cp Solve by using the kernel trick S39idel 0 Neural Networks with Temporal Behavior Haykin Chapter Neurodynamics 0 Inclusion of feedback gives temporal characteristics to neural networks recurrent networks CPSC 636600 0 Two ways to add feedback Instructor Yoonsuck Choe Local feedback Spring 2008 Global feed back 0 Recurrent networks can become unstable or stable Main interest is in recurrent network s stability neurodynamics Stability is a property of the Whole system coordination between parts is necessary Stability in Nonlinear Dynamical System Preliminaries Dynamical Systems 0 A dynamical system is a system whose state varies with time o Lyapunov stablity more on this later Statespace model values of state variables change over time 0 Study of neurodynamics Example m1t m2t JUN t are state variables that hold Determm39St39c neurOdynamICS eXpressed as nonhnear different values under independent variable t This describes a d39 erem39al equatlons39 system of order N and Xt is called the state vector The StoehaStie neurodynamics expressed in terms 0f StoehaStie dynamics of the system is expressed using ordinary differential nonlinear differential equations Recurrent networks equations perturbed by noise d jt Fjltjltt7j 1 2 N dt or more conveniently X dt Autonomous vs Nonautonomous Dynamical Systems 0 Autonomous does not explicitly depend on time o Nonautonomous explicitly depends on time F as a Vector Field Since can be seen as velocity Fx can be seen as a velocity vector field or a vector field In avector field each point in space X is associated with one unique vector direction and magnitude In a scalar field one point has one scalar value 0 Red curves show the state phase portrait represented by trajectories from different initial points 0 The blue arrows in the background shows the vector field Source httpIwwwmalhkuedubeerSodebicpilabplcthtml 7 State Space x1 It is convenient to view the statespace equation 31 Fx as describing the motion of a point in Ndimensional space Euclidean or nonEuclidean Note If is continuous The points traversed over time is called the trajectory or the orbit The tangent vector shows the instantaneous velocity at the initial condition 6 Conditions for the Solution of the State Space Equation A unique solution to the state space equation exists only under certain conditions which resticts the form of For a solution to exist it is sufficient for Fx to be continuous in all of its arguments For uniqueness it must meet the Lipschitz condition Lipschitz condition Let x and u be a pair of vectors in an open set M in a normal vector space A vectorfunction Fx that satisfies iiFx 7 Fuii S Kiix 7 uii for some constant K the above is said to be Lipschitz and K is called the Lipschitz constant for If 8Fi8mj are finite everywhere Fx meet the Lipschitz diti con on 8 Stability of Equilibrium States 0 5 E M is said to be an equilibrium state or singular point of the system if d E FXO Howthe system behaves near these equilibrium states is of great interest Near these points we can approximate the dynamics by linearizing Fx using Taylor expansion around X ie xt Sc Axt Fx X AAxt where A is the Jacobian A x EigenvaluesEigenvectors o For a square matrix A if a vector X and a scalar value A exists so that A AIX 0 then X is called an eigenvector of A and A an eigenvalue Note the above is simply AXX An intuitive meaning is X is the direction in which applying the linear transformation A only changes the magnitude of X by A but not the angle There can be as many as n eigenvectoreigenvalue for an n X n matrix Stability of in Linearized System In the linearized system the property of the Jacobian matrix A determine the behavior near equilibrium points This is because Ax AAXt If A is nonsingular AT1 exists and this can be used to describe the local behavior near the equilibrium X The eigenvalues of the matrix A characterize different classes of behaviors I MTquot WEE Positivenegative realimaginary character of eigenvalues of Jacobian determine behavior 0 Stable node real unstable focus Complex real 0 Stable focus Complex real Saddle point real o Unstable noderea Center qugplex 0 real Definitions of Stability Lyapunov s Theorem 0 Uniformly stable for an arbitrary 6 gt 0 if there exists a positive 0 Theorem 1 The equilibrium state Sc is stable if in a small 6 such that x0 ill lt 6 implies Xt Sc lt e for all neighborhood of ithere exists a positive definite function VX t gt 0 such that its derivative with respect to time is negative 0 Convergent if there exists a positive 6 such that semide nite in that region x0 Sc lt 6 implies xt gt Sc as t gt oo 0 Theorem 2 The equilibrium state Sc is asymptotically stable if in a small neighborhood of Sc there exists a positive definite function Asymptotically stable if both stable and convergent VX such that its derivative with respect to time is negative Globally asymptotically stable if stable and all trajectoreis of the definite in that region system converge to Sc as time t approaches infinity 0 A scalar function VX that satisfies these conditions is called a Lyapunov function for the equilibrium state Sc Attractors Types of Attractors r s V Eqnilibiiuml f poi 39 minimum I o Dissipative systems are characterized by attracting sets or manifolds of dimensionality lower than that of the embedding 39 Po39m amacmrs left space These are called attractors Limit cycle attractors right 0 Regions of initial conditions of nonzero state space volume Strange chaotic attractors not Shown convergeto these attractors as time t increases 15 16 Neurodynamical Models We will focus on state variables are continuousvalued and those with dynamics expressed in differential equations or difference equations Properties 0 Large number of degree of freedom 0 Nonlinearity o Dissipative as opposed to conservative ie open system 0 Noise Hopfield Model o N units with full connection among every node no selffeedback 0 Given M input patterns each having the same dimensionality as the network can be memorized in attractors of the network Starting with an initial pattern the dynamic will converge toward the attractor of the basin of attraction where the inital pattern was placed 19 Manipulation of Attractors as a Recurrent Nnet Paradigm We can identify attractors with computational objects associative memories inputoutput mappers etc In order to do so we must exercise control over the location of the attractors in the state space of the system A learning algorithm will manipulate the equations governing the dynamical behavior so that a desired location of attractors are set One good way to do this is to use the energy minimization paradigm eg by Hopfield Discrete Hopfield Model Based on McCullochPitts model neurons with 1 or 1 output Energy function is defined as 1 N N E E E wjiavim iy j i1 j1 Network dynamics will evolve in the direction that minimizes E Implements a contentaddressable memory ContentAddressable Memory Hopfield Model Storage 0 Map a set of patterns to be memorized EM onto fixed points XM in o The learning is similar to Hebbian learning the dynamical system realized by the recurrent network M 1 o Encoding Mapping from EM to XM wji N E Euj ui M1 o Decoding Reverse mapping from state space XM to En with wji 0 iii j Learning is oneshot o In matrix form the above becomes 1 M 7 T W 7 N E EMEM MI M1 o The resulting weight matrix W is symmetric W WT 21 22 Hopfield Model Activation Retrieval Spurious States 0 The weight matrix W is symmetric thus the eigenvalues of W are all 0 Initialize the network with a probe pattern probe real 0 i V o For large number of patters M the matrix is degenerate ie several 95 gt Eprobed elgenvectors can have the same elgenvalue 0 Update output of each neuron picking them by random as 0 These eigenvectors form a subspace and when the associated eigenvalue is 0 it is called a null space N o This is dueto M being smallerthanthe number of neurons N W n 1 sgn wjimin 21 Hopfield network as content addressable memory Discrete Hopfield network acts as a vector projector project probe um X reaCheS a xed point vector onto subspace spanned by training patterns Underlying dynamics drive the network to converge to one of the corners of the unit hypercube Spurious states are those corners of the hypercube that do not belong to the training pattern set Storage Capacity of Hopfield Network 0 Given a probe equal to the stored pattern EV the activation of the jth neuron can be decomposed into the signal term and the noise term 7 N 11339 7 E 71wji5 u 1 M N N ZFI Elm 21 Swim M N 1 Elm N E Sud E Eui uz 1ms v i1 signal noise The signalfonoise ratio is defined as variance of signal 7 1 N 7 varianceofnoise 7 M7 1N N M The reciprocal of p called the Ioadparameferis designated as 04 According to Amit and others this value needs to be less than 014 critical value ac 25 CohenGrossberg Theorem Cohen and Grossberg 1983 showed how to assess the stability of a certain class of neural networks N dt 11 Neural network with the above dynamics admits a Lyapunov function defined as 1 N N N u E EZZCjisoiUiwZO b jlt 11 j1 j1 where d uj ajuj bjo Zc mun 73 1727 Storage Capacity of Hopfield Network cont d Given Oz 014 the storage capacity becomes C aCN 014N when some error is allowed in the final patterns For almost errorfree performance the storage capacity become 7 N i 2logeN C Thus storage capacity of Hopfield network scales less than linearly with the size N of the network This is a major limitation of the Hopfield model CohenGrossberg Theorem cont d For the definition in the previous slide to be valid the following conditions need to be met The synaptic weights are symmetric The function ajUj satisfies the condition for nonnegativity The nonlinear activation function gpjuj needs to followthe monotonicity condition lt gt d lt gt gt o v u v v u v 90 duj 90 j With the above dE S 0 dt ensuring global stability of the system Hopfield model can be seen as a special case of the CohenGrossberg theorem 28 Demo 0 Noisy input 0 Partial input 0 Capacity overload Slide09 Supplemental Semantic Map 1ABLE 93 Animal Names and Their Attributes 3 H k39 h SlfO quot 33 Inc apterg e rganlzmg Ez o ed fi s i 1111111 I 1 1 1 1 1 u 0 11 o 1 o o 11 n 0 1s linednim U i i 0 0 0 1 i l 1 0 0 0 0 I D lug o 11 0 t n o o 0 0 u o 1 1 1 1 1 2119 1 i 1 i 1 i 1 i n 0 0 0 l 0 ll 0 Megs fr 0 0 i 0 0 0 1 1 1 l l 1 1 l i has hair 0 i U 0 O 0 0 1 1 1 1 1 1 l l t hooves 0 i l 0 l n 0 0 U 0 0 O 0 1 l 1 01330636600 1 mane 11 11 11 u u o 0 0 n 1 o o 1 1 1 o feathers 1 1 1 1 1 1 1 o u 11 o o o o a n InstructorzYoonsuckChoe i 111 11 11 11 1 1 1 1 1 o 1 1 1 1 o u 11 I 111m 1111 a 1 1 o o u o 0 1 1 0 1 1 1 1 a Spring 2008 w i I 1 0 u 1 1 1 1 o 11 0 0 0 11 n u n 5mm 0 0 1 1 0 Fl 0 0 0 0 U 0 0 0 D n o SOM can be trained with highdimensional vectors representing multiattributed entities such as animals 0 The resulting map could show interesting relationship among the animals 1 2 Semantic Map BMUs Semantic Map Probed Map dog dog dog fox cal eagle dog clog nll Aolf v owl will wolf tiger oil wolf wait 1 hawk 711111 won 11m linn 110m harm dove horse horse horse A 1 hen Lchra ILbra Rim I cg flu goosa REze1s 211 am cow cow duck duck duck goose r I emantic map obtained throu h th 2R259t47 Featlll39te twat containing IabEIEd neurons W39th simulated electrode penetration map in The map is diiid ieinz 9 responses 0 eir respective Inputs three regions representing birds peaceful species and hunters o The sixteen inputs are mapped to their best matching units on the 0 Find out for each SOM unit which animal it responds maximally ma p o For this kind of probe every unit Will have an animal assigned 0 Try guessing what would their neighboring units weight vector even when they are not the EMU for any of these animals WOUId IOOK like39 0 Note the partitioning of the map into coherent categories 636 Midterm Review 0 When and Where 342008 Tuesday 1245pm200pm in HRBB 126 If everyone arrives early we can begin early Material Haykin chapters 1 through 5 with focus on the material discussed in the lecture only 0 Things to bring 1 sheet of notes on US letter paper you may fill both sides No limit on font size Scientific calculator may not be necessary Your ID TAMU ID or Texas driver s license Chapter 2 Learning 0 Learning in general Five learning rules err correction Hebbian memorybased competetive and Boltzman Learning paradigms credit assignment problem supervised and unsupervised learning 0 Memory associative memory correl matrix memory autoassociation Statistical nature of leanrning biasvariance dilemma VC dimension PAC learning Chapter 1 Introduction 0 Benefits of neural networks 0 Neural networks and brain organization 0 Model neurons 0 Neural network definition 0 Similarity measures Chapter 3 Perceptrons and Linear Adaptive Filters 0 Steepest descent o GaussNewton method and linear least square fit pseudoinverse Perceptron geometric interpretations two ways Perceptron the role of the bias Linear separability and limitations of perceptron Perceptron learning rule Perceptron convergence theorem geometric insight Chapter 4 Multilayer Perceptrons Chapter 4 Multilayer Perceptrons supplemental Hidden unit representation combination hidden units 0 Heuristics for better backprop Error gradient and backpropagation 0 Weight normalization Momentum I o Backprop as optimization Batch vs sequential mode o Conjugate direction method Representational power Training recurrent networks with backprop 39 Conjugate gradient methOd Function approximation with backprop Generalization and overfitting Curse of dimensionality Virtues and limitations of backprop Chapter 5 RadialBasis Function Networks Cover s theorem separability Exact interpolation Regularization theory Generalized RBF Two phases in RBF learning Learning RBF centers

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I made $350 in just two days after posting my first study guide."

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.