Artificial Intelligence CECS 451
Popular in Course
Popular in Computer Science and Engineering
This 80 page Class Notes was uploaded by Zackary Cronin on Monday October 5, 2015. The Class Notes belongs to CECS 451 at California State University - Long Beach taught by Staff in Fall. Since its upload, it has received 17 views. For similar materials see /class/218761/cecs-451-california-state-university-long-beach in Computer Science and Engineering at California State University - Long Beach.
Reviews for Artificial Intelligence
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/05/15
Lecture 8 Decision Trees Decision Trees 0 used for classifying data into discrete categories 0 an internal node of the tree represents a question about the feature vector whose answer dictates which child node is next considered ie which question is next asked 0 each leaf node represents a potential classi cation of the feature vector 0 alternatively from a logical perspective7 decision trees allow categories to be repre sented as logical disjunctions of conjunctions of attributes 1mm we at mam 39 N xquot X f 3 m 1 mm an an xquot rr z quotN k 18013111 l Teammpmemlhh 39 l Cmiimlmm d39 P g 1 w l 3 lquot yuf um z r m t xx x is f f39 u m yam 31m by m y Characteristics of problems suitable for decision tree learning 0 instances are represented by feature vectors whose feature components are preferably discrete 0 instances are discretely classi ed 0 a disjunctive description represents a reasonable means of classi cation 0 errors may exist7 such as missing attribute values or erroneous classi cation of some training vectors Entropy as a Measure of Information Content Entropy of a random variable Let X be a random variable that takes on values from the set x17z27 zn with respective probabilities 101102 710 where sumZpl 1 Then the entropy of X7 HX7 represents the average amount of information contained in X and is de ned by HX Zpilogzpi i1 Justi cation for the de nition of HX Claude Shannon 19122001 Pioneer in applying Boolean logic to electronic circuit design studying the complexity of Boolean circuits signal processing determined lower bounds on the amount of samples needed to achieve a desired estimation accuracy game theory inventor of minimax algorithm information theory rst to give a precise de nition for the concept of information coding theory channel coding theorem7 optimal encoding schemes based on entropy Entropy Example 1 calculate the amount of information contained in a weather forecast if the possibilities include normalfaz39mf09h0tw ndy if there respective probabilities are 97 0501027 and 02 Entropy Example 2 recall the 12 balls puzzle Which of the six possible initial balancing con gurations provides the most information to the user Entropy of a collection of classi ed feature vectors given a set of traning feature vectors S suppose that the vectors are classi ed in 0 different ways7 and that pl represents the proportion of vectors in S that belong to the 239 th class Then the classi cation entropy of S is de ned as Entropy Example 3 vectors Weight medium heavy medium light medium light heavy medium heavy medium medium medium medium color orange green green red orange red green red yellow yellow red green orange EntropMS 7 EM logzpi i1 calculate the classi cation entropy for the following set of feature texture smooth smooth smooth bumpy bumpy bumpy rough smooth smooth smooth smooth smooth rough classi cation orange melon apple berry orange berry melon apple melon orange apple apple orange Quinlan7s ID3 algorithm at each phase of the construction of the decision tree and for each branch of the decision tree under construction the attribute which is considered next is that attribute which 1 has yet to be considered along that branch 2 will provde the largest information gain Information gain let A be an attribute considered here as a discrete set of possible values Then the information gain relative to A and a set of feature vectors S is de ned as Sa S EntropySa IGA S EntropyS 7 Z aeA where 5 represents the set of feature vectors 1 E S such that 11A a Decision Tree Example Using the table of feature vectors from Entropy Example 3 and the concept of information gain construct a decision tree for classifying fruit as either apple orange melon or berry Inductive Bias in Decision Tree Learning The hypothesis space of decision trees is complete in the sense that for every set S of feature vectors there exists a decision tree T5 which correctly classi es all of the training vectors 0 Bias is given towards those trees which are shorter and for which place towards the root attributes with high information gain 0 Preference bias a bias towards hypotheses based upon properties they may possess Restriction bias a bias towards hypotheses which restricts them to have a particular type of form Occam7s Razor when choosing amoung competing theories to explain a set of data7 choose the most simplest one that is consistent with the data Longer hypothesis also run the risk of over ttmg the data Hypothesis h E H is said to over t the data if there exists an alternative hypothesis h E H for which h has smaller error than h over the training data7 but h outperforms h over the entire distribution of test cases Rule postpruning steps H develop the decision tree without any concern towards over tting D convert tree into an equivalent set of rules 9 prune each rule by removing any preconditions whose removal improves accuracy 4 sort rules by their estimated accuracy and use them in this sequence Split Information let A be an attribute7 considered here as a discrete set of possible values Then the split information relative to A and a set of feature vectors S is de ned as Sal SIA S i Z W ISA Sm logl Sl where Sa represents the set of feature vectors 1 E S such that 11A a For attributes that take on many values7 using the ratio instead of IGS7 A can help avoid favoring rnany valued attributes which do not perform well in classifying the data Lecture 10 Pattern Recognition II Recall from Lecture 9 some of the goals of the science of pattern recognition 0 classify objects and data into different categories 0 nd patterns in data for the purpose of knowledge discovery and prediction 0 nd patterns in sets of experience feedback data for the purpose of learning and evolving o Associative memory the ability to recall past informationexperience based on exposure to patterns similar to those which partially comprise the stored informa tionexperience In the previous lecture we examined more closely the problem of data classi cation7 and found that neural networks represent a general purpose tool for solving classi cation problems We now examine two other goals of pattern recognition Associative Memory Hop eld neural networks unlike feedforward neural networks Hop cld networks do not have restrictions imposed in terms of which neurons a neuron may or may not connect with Moreover each neuron plays the role of a ip op digital gate in that it persists in either an on77 or off77 state and changes from o 7 respectively on to on77 respectively off whenever its input strength exceeds respectively falls below its threshold value More properties include o the connection weights are symmetric ie if wij is the weight strength between neuron 239 and neuron j then it equals wji the connection weights take on real values that may either be positive or negative the changing of state for each neuron may happen synchronously or asynchronously the cngcry function of the network is de ned as E 7 Z LUHSZ39SJ39 295 Ki 139 where s and 9139 respectively represent the state and threshold of the 239 th neuron Hop eld network Example 1 for the following synchronous Hop eld network show how the network evolves to a nal state There are four neurons 1234 and the set of connections is 17272737374 The weights are LU12 751023 751034 7875 Assume that the initial state is given by 1011 Theorem given a Hop eld neural network with n neurons and an initial state 50 E 01quot7 the network will evolve to a nal state sf 6 01 where sf is a local minimum for the networks energy function In other words changing any one bit one neuron state of sf will produce an increase in total energy Proof Boltzman machine similar to a Hop eld network7 but state changes are now nondeter ministic7 and the k th neuron changes from a high state to a low state with probability 1 PIFWv where T is the current annealing temperature Simulated annealing simulating the process of allowing molecules to lose energy slowly which allows for an alignment that has less total energy Hop eld network Example 2 design a Hop eld network that recalls either 11111 or 000007 depending on which bit is in the majority with respect to the input pattern For example7 an input pattern of 10111 should induce a memory recall network evolution of 11111 Hop eld network Example 3 design a Hop eld network that recalls either X or 07 where either Character is written on a 4 gtlt 4 grid of pixels Finding Frequent Patterns in Data Finding frequent patterns in data represents a relatively new eld in the area of machine learning and arti cial intelligence More popular names given to this endeavor include data mining and knowledge discovery One way in which this area differs from pattern classi ca tion is that the machine quite often does not know a priori what type of patterns it is looking for On the other hand a classi cation system such as a character recognition system is programmed a priori to nd speci c types of patterns such as alphanumeric characters Let X1X2 Xk be a collection of sets A pattern 15 is a k dimensional vector whose i th component either belongs to X or is equal to the star symbol 77 which is assumed to not be a member of any of the sets Moreover feature vector i 6 X1 gtlt X2 gtlt gtlt Xk is said to contain pattern p iff for every i 92 7 17 5 17 Given a collection S of feature vectors from feature space X X1 gtlt X2 gtlt gtlt Xk pattern 15 is called w frequent with respect to S iff at least a fraction omega of the feature vectors in S contain 15 Frequent pattern Example 1 let X1 represent the set of states in our union X2 and X3 equal the set of letter grades A F Finally X4 is the set 01 For each student 5 in this class assign the following feature vector 29 X1XgtltX4 where is is the state where the student attended high school is is the grade the student received in hisher highest high school history class is is the grade the student received in hisher highest high school math class and 253 is 1 iff and only if the student took a programming course during high school Then it is likely that 17 California A 1 is a 33 frequent pattern for this set of vectors The Partial Derivative of a Set of Feature Vectors Given two feature vectors 2 and iJOf same dimension7 the similarity between i and if denoted 527113 is a vector whose i th component equals gtk if i 7 y and is equal to 21 otherwise Let S be a set of feature vectors and i an element of S Then the partial derivative of S with respect to f denoted by is de ned as the set of vectors 6S a a E 7 Sam 6 8 Partial derivative Example given the set S Wim7 tree told7 four7 five7 nine of feature vectors each of the four features is a letter of the alphabet7 compute where f tree Theorem let S be a set of feature vectors and be an w frequent pattern of S Then there exists 2 E S such that 15 is an w frequent pattern of Conversely7 for any vector i E S the set of w frequent patterns of are w frequent patterns of S Proof Theorem equality of mixed partials given two feature vectors 2 and 17 in S 625 7 625 62653sz erase Moreover7 the frequent patterns of this second order partial derivative are those frequent patterns contained in both i and 17 Proof PD Algorithm for nding frequent patterns in a set S of feature vectors Inputs 0 a set S of feature vectors7 including their assigned weights 0 a number an such that 0 S u S 1 o the set P of w frequent patterns that have been found thus far 0 m7 the minimum length of an acceptable pattern 0 vector 17 from which the set S has been derived from Effect instantiate P so that for each i 6 S7 if 2 contains an w frequent pattern7 then at least one of the largest w frequent patterns contained in f is contained in P FindFrequent PatternsS7 wPm7J mark 17 as used77 eliminate component 239 from each vector if there does not exist an element a E Xi for which i a for at least a fraction w of vectors in S sort the vectors using the lexicographical ordering merge all identical vectors in doing so sum their weights and assign it as the weight of the new representative delete any vector from the list whose number of de ned components is less than m o for any vector 2 whose weight is at least w lSl7 add fto the set P of frequent patterns and mark 2 as used for each vector i that is unused if there exists pattern 15 E P that is contained in f gtk let l be the number of de ned components of the largest such pattern gtk FindFrequentPatternswPl Li 7 else gtk FindFrequentPatternswPmi 0 end of algorithm Frequent pattern Example use the PD algorithrn for nding a set of 5 frequent patterns for the following set of binary feature vectors S 1011 0111 1001 1111 1011 0011 1001 1000 1011 0110 Lecture 3 Problem Solving With Prolog A Few More Aspects of Prolog The Cut Symbol i when placed in the body of a prolog statement7 it prevents back tracking at the point of where it is placed Cut Symbol Example For the following program7 show how the cut symbol can be placed to eliminate unnecessary backtracking for the goal list 77 f1YYlt2 Program fxo 7 Xlt3 fX2 7 3ltXXlt6 fx4 7 6ltX Using cut and fail to handle exceptions the simple prolog statement fail is always unsatis able and placing the cut operator before it will ensure that goals that are net meant to be satis ed will not be Cut and Fail Example Tam loves all ice cream avors except rocky road Express this in a prolog program so that the goal 77 lovestam7 X will be answered by yes77 when X is any ice cream except for rocky roadl7 and n077 when X is rocky road De ning negation notP 7 Pfailtrue Lecture 3 Input and Output Goals see filename causes input to switch from current input stream to using filename as input stream tell filename causes output to switch from current output stream to using filename as output stream readX reads the next term T on the input stream and instantiates X to T write X writes the term X on the current output stream tab N and n1 make tabs and new lines respectively on the output stream Problem Solving With Prolog Blocks Example A robot has the task of stacking cubic blocks in some predetermined order based on their color Initially a surface has one or more stacks of blocks7 and the robot is to stack and unstack one block at a time these blocks to make the desired stack Write a prolog program that will input a desired color sequence of blocks7 and output a list of actions stacking or unstacking speci c blocks needed to obtain the desired stack Problem Solving With Prolog Set Me Example The game of Set Me is played with a deck of eighty one cards each having the following four characteristics 0 Symbol diamonds ovals or squiggles 0 Count 1 2 or 3 symbols 0 Color red green or purple o Shading outlined lled or striped The cards are shuf ed and a tableau hand of twelve cards is laid out Players then attempt to be the rst to identify sets77 which exist in the tableau Sets are removed as they are identi ed and new cards are dealt in their place Play continues in this manner until all cards have been used The winner is the player with the most sets A set is a collection of three cards in which each characteristic is either the same on all three cards or different on all three cards Write a prolog program that will nd all sets for a given input of one tableau Problem Solving With Prolog FarmerLionGoatCabbage A farmer must cross a river with a lion7 goat7 and cabbage7 but can only take one item each trip in his boat Moreover7 the farmer cannot leave the lion alone with the goat7 or the goat alone with the cabbage Write a prolog program showing the sequence of trips he should take in order to safely get all three across the river without problems Lecture 9 Pattern Recognition One area of research that is closely related to and often overlaps with machine learning is that of pattern recognition The importance of pattern recognition to machine learning and intelligence in general cannot be overstated Pattern recognition the science whose goal involves the theoretical and empirical un derstanding of how to create systems and algorithms that are able to do7 amoung other things7 o classify objects and data into different categories 0 nd patterns in data for the purpose of knowledge discovery and prediction 0 nd patterns in sets of experience feedback data for the purpose of learning and evolving 0 create memories of patterns which can be retrieved upon exposure to similar patterns Applications of Pattern Recognition 0 Machine Vision nding defects in VLSl chips7 blood testing7 character recognition7 robotics 0 Voice recognition 0 Data mining learning the preferences of a shopper7 gene analysis7 discovery of trends7 predicting weather and markets 0 security pro ling7 ngerprint analysis Pattern Classi cation As we have already seen7 pattern classi cation represents one of the main problems in both machine learning and pattern recognition The following represents the individual tasks involved with designing a pattern classi er Pattern 39 139 Hon design suppose the problem involves classifying letters written on an electronic notepad sensors a 21 gtlt 21 matrix of cells pixels which lie underneath the surface of the notepad The voltage state of each cell changes from off to on when a metallic pen makes contact directly above the cell feature generation the cell matrix causes the formation of a 21 gtlt 21 binary matrix to be stored in memory This pixel data represents the raw7 generated feature feature selection for the purpose of classi cation7 more useful features can be se lected from the raw feature For example7 the raw feature may be divided up into 9 7 gtlt 7 submatrices7 and a feature may be the presence or absence of a subpattern eg a horizontal segment of four ones classi er design once the features are selected7 they form what is called a feature vector7 which can then be input into a classi er7 which will output a classi cation Examples of classi ers include decision trees7 Boolean functions7 and neural networks evaluation testing must be performed to see ifthe percentage of correct classi cations is suf ciently high Testing will inevitably cause re evaluation of all previous stages of development Classi cation System Architecture Arti cial Neurons as Classi ers Arti cial Neurons are very similar to Boolean gates and or and inverter gates in that they have inputs and an output The main difference though is that they can receive real valued inputs output real values and have their functionality modi ed through a traininglearning process Nunan Farming z Clumiul Synaps r synapses W 1mm mHerin l heme 3991st x L rquot 1quot ck Kg mgelrn riftj sham h 7 r 330 Jquot dendm 25 If nerve tell Dang I TllJIflElJS Arti cial Neuron Output Arti cial Neuron a mathematical construct thought of as a linear threshold function which was initially intended to model the behavior of a biological neuron neuron has the tendency to synapse upon receiving a suf cient number ie A biological a number which exceeds its thresholcO of synaptic impulses from neighboring neurons Properties of an arit cial neuron associated with each neuron n is a weight vector 13 Moreover7 the dimension h of 13 represents the number of neurons or environmental impulses that feed into n each neuron n also has a threshold T If the strength of the neuron synapses feeding into n exceeds T7 then T itself will synapse ie output a value the strength of the neuron synapses feeding into n is given by the dot product 13 f where 21 represents the i th input into n7 which is also the output of the i th neuron feeding into n a discrete neuron will output 1 when 13 f gt T7 and 0 otherwise a smoothed neuron will output f13 f 7 T where f is a smoothing function7 such as A discrete neuron with weight vector 13 and threshold T may be used as a type of linear classi er For example7 given a feature vector i the vector may be classi ed into one of two categories depending on whether or not 13 f gt T But how to choose 13 and T for a given set of training vectors Perceptr39on Neuron Learning Algorithm Let W1 and W2 be two sets of feature vectors Then W1 and W2 are said to be linearly separable iff there exists vector 13 such that 113i gt 07 for every 26 W1 213flt 07 for every 26 W2 Rosenblattls Perceptr39on Learning Algorithm has the goal of nding vector 13 which linearly separates W1 and W2 Note 13 will include a component that represents the threshold 0 for each vector i in W1 U W2 add another component to f and assign it the value 1 o initialize 130 to a random vector t 07 and correct 0 o whilec0rrect lt Wl U ng get next vector i in L if13t f gt 0 and f 6 W2 gtk tt1 correct0 gtk 13t 13 t 7 f 7 else if13t f lt 0 and f 6 W1 then gtk tt1 correct0 gtk 13t 131 f 7 else correct o 13 13t 0 return 13 Perceptron Learning Example use the perceptron learning algorithm to nd a neu ronline that linearly separates W1 711 723 13 from W2 3 71 45 Arti cial Neural Networks as Classi ers An arti cial neural network is simply a collection of neurons for which the outputs of some neurons serve as the inputs to other neurons7 to potentially form a complex dynamical system They represent the most important class of nonlinear classi ers Why neural networks 0 represent an attempt to model a brain7s structure and functionality in hopes that they will be enabled with the same classi cationalpattern recognition ability that most brains possess experiments have shown they classify well in cases where the feature vectors possess errors and noise 7 represent spatial andor audio data computing with neural networks is parallel and distributed7 and thus is ideal for parallel processing learning algorithms exist for neural networks eg backpropagation Feedfor39ward Neural Network consists of a matrix of neurons with the following con nectivity properties 0 inputs to column 1 of the matrix consiste of environmental impuleses in the form of feature vectors 0 outputs from the last column of the matrix represent the output of the network 0 for 239 gt 17 inputs into a neuron in column 239 of the matrix consist of outputs from column 2397 1 of the matrix 0 each column is called a layer of the network The rst column represents the input layer7 while the nal column is called the output layer All other layers are called hidden In quot13 I Freeman Ixnfa I 3 in39 lmz39 Feedforward neural networks represent the most popular neural network architecture for the purpose of classi cation Feedforward Example 1 the EXOR problem Provide a two layer neural network which correctly classi es the vectors in W1 01 10 from W2 0 0 1 Theorem any two classes of vectors W1 and W2 can be correctly classi ed by a three layer neural network Proof 1 The rst layer rnaps real vectors to vertices of a hypercube 2 The second and third layers provide a surn of rninterrns expansion for those vertices of the hypercube which correspond to W1 Feedfor39war39d Example 2 for the following sets W1 and W2 provide a three layer network which correctly classi es the sets Training Feedforward Neural Networks Backpropagation The goal of backpropagation involves iteratively adapting the weights of each neuron to minimize a chosen error function Sumsquared error function let 21 2 be the set of training vectors7 and let represent the j th component of the output ofthe neural network using 21 as input Moreover7 let 5a be the desired j th component of the correct classi cation of i Then the error associated with 21 and induced by the network is given as m n a a a 2 52 201 0WD 7 j1 where m is the number of components of the networks output layer Furthermore7 the sum squared error function for the entire training set is given as Assuming that the training set remains constant7 we may think of E as a function of the weight vectors of all the neurons of the network and training the network involves nding an instantiation of the weight vectors so that E is a minimum The most common algorithm for doing this is called gradient descent ie taking the gradient of E with respect to all the weight vectors and changing the weights in the direction of the gradient Doing this requires using the chain rule for partial differentiation7 which allows for the weight updating to propagate backwards Once the weights are updated via backpropagation7 must be recomputed for each training vector i so that E can be re computed7 and the backpropagation process repeats Backpropagation stops when either H AE7 the change in E from one stage to the next7 becomes smaller than some threshold7 or E0 the number of stages has exceeded some tolerance threshold7 or 9 the computed error E becomes smaller than some threshold Lecture 4 Arti cial Intelligence Search Strategies The Importance of Search in AI Searching through a state space represents a primary means of problem solving when there lacks a more speci c algorithm for solving the problem Take for example the puzzle Rubiks Cube Although there is a well known algorithm for solving this puzzle7 it might be asking too much from an intelligent agent to inductively or deductively learn the algorithm A more realistic approach simply involves having the agent search through the space of cube rotation sequences until a solution is found Of course7 the quality of the search will decide if a solution is found Two fundamental qualities of a search 0 Complexity the amount of time and space required to search through a given search space 0 Completeness refers to the liklihood that a search will be successfull given that a solution exists Promising search strategies tend to strike a balance between complexity and completeness Search State Space a state space is a quadruple graph S V7 ElbG7 where 0 V is a set of vertices or states of the graph7 o E is the set of graph edges7 0 V5 is the set of initial states from where a search begins7 and o G C V is the set of goal states 0 V2 is a successor node of V1 iff l17V2 E E solution path in a search space a path connecting an initial state to a goal state Note goals can be de ned by a measurable property of a state7 or by a measurable property of the path developed by the search Part of the Eights Puzzle77 search space i r gt J Willi Depth rst search the next state examined is a successor of the most recently examined state that has an unexamined child 0 low space complexity 0 used when goal states tend to have large depths 0 tends to miss short solutions 0 low space complexity Inductive de nition for a depth rst search solution path sol let N be an arbitrary node7 then 0 basis step if N is a goal node7 then sol N o inductive step if there is a successor node N1 of N7 and soll is a path from N1 to a goal node7 then sol N solll Prolog template for depth rst search Example de ne the prolog relation sN7 N1 where the state space is that used in the sliding tiles77 puzzle Revised Prolog template for depth rst search this template prevents cycles and allows for depth limitations Iterative deepening search depth is gradually increased over time Breadth rst search 0 successors of already examined nodes are kept in a FIFO queue in contrast we may think as depth rst search as placing nodes in a LIFO stack 0 used when priority is given to nding minimal paths to a goal 0 tends to be ineffective for problems whose goal depths are sizable 0 high space complexity Search Example For the following state space7 show the order in which the nodes would be visited for both a depth rst and a breadth rst search BestFirst Search Similar to breadth rst search7 but a heap ie a queue whose members are given priority is used rather than a queue Best rst search generally requires the use of heuristics for assigning a measure which mea sures how close a node is to a goal state to each node of the state space Let S VEVbG be a state space Then function h V a R is called an evaluation function for the state space provided hn represents a measure of how close node n is to a goal state Heuristics o eurisco Greek for I discover 0 methods and rules for invention and discovery used when either no exact solution exists due to noisy data7 or when nding an exact solution is computationally infeasible Example provide some heuristic evaluation functions for the eights puzzle Qualities of Search Algorithms Admissibility an algorithm is called admissible if it always nds the path to the goal if it exists which is of minimum cost Theorem 1 best rst search in conjunction with evalution function fn is an admissible algorithm7 where fn gn hn7 where gn represents the minimum cost in traversing from the initial state to 717 and hn represents the minimum cost of traversing from n to a goal state Proof A algorithms an algorithm is called A if it uses best rst search with evaluation function fn Mn 971 where 9 represents the minimum cost in traversing from the initial state to n and hn S hn7 where hn represents the minimum cost of traversing from n to a goal state Theorem 2 A algorithms are admissible Proof Exercise Monotonicity a search algorithm is called monotone if it uses best rst search with fn Mn 971 where h has the following properties 0 7 hnj is no larger than the acutal cost of moving from m to n 0 119 0 for every 9 E G Theorem 3 f is monotone increasing along any path from the initial to a goal state Proof Theorem 4 Monotone algorithms are A ls the converse true Exercise Proof Informedness suppose hl and hg are two A heuristics We say hg is more informed than hl iff MM 3 h2n for every node n E V A Modern View of Heuristic Search for very complex problems7 basic search tech niques ie depth7 best7 and breadth rst have proved inadequate and this represents the main reason why prolog has failed to enter the mainstream of the computing world These deterministic search procedures are giving way to random search procedures A few are described below 0 genetic algorithms genetic algorithms sample paths from the search space and choose new paths from the most t77 paths already considered The new paths are obtained via recombination and mutation simulated annealing uses a simple hill climbing approach Moves are allowed to proceed downhill with a given probability P which tends to zero over time thus al lowing for more variation and moving out of local minima early in the search7 and becoming more restrictive later in the search 0 tabu search uses the concept of neighborhood and tries to nd the goal within that neighborhood If it fails7 it uses memory techniques to try to permanently avoid entering the neighborhood again Lecture 7 Introduction to Machine Learning Machine learning the study and development of algorithms that allow computers to evolve and improve at performing a given task by using past experience Examples include recognizing shapes images and spoken words driving a vehicle classifying new data based on an already existing classi cation system game playing Learning vs Understanding and Insight 0 Understanding involves the cognitive capacity to formulate knowledge of why certain learning steps proved conducive to reaching a goal and carries over to similar tasks 0 On the other hand learning has the less ambitious task of discovering a set of behavioral constraints which are conducive to attaining a goal Does not require re ection on the signi cance of the constraints within the context of the problem De nition of learning a program is said to learn form experience E with respect to some class of tasks T and a performance measure P if its performance at tasks in T as measured by P improves with experience E Learning Example 0 T shooting free throws o P measures how close a free throw came to entering the goal 0 E represents the collection of previous free throw attempts Each attempt contains information about trajectory initial velocity ball rotation and is paired with a per formance measure 0 Method of learning perform quadratic regression on E Assume that the func tion PE is a quadratic surface that is concave downward Then locate the global maximum of the best such surface that ts the data Supervised vs Unsupervised Supervised learning experience is controlled by a teacher key learning queues are provided by the teacher Unsupervised learning is determined through empirical inquiry and observation External Learning vs Internal Learning External task7 feedback7 and performance measure can be observed by an outside party but learning mechanisms often remain hidden Internal task7 feedback7 and performance measure exist as part of the internal learn ing mechanism of the machine and is not necessarily observable learning an external task often requires a hierarchy of internal learning tasks Alternative de nition of learning The process of evolving a set of internal constraints based upon the performance results P of past experience E while performing task T which will induce a machine to perform task T at optimal performance Pom Target function for learning a task a function f that is similar to a search evaluation function7 in that it helps the machine determine its next action for when it gives its next performance Quite often learning is reduced to evolving a target function until satisfactory performance is met inputs current state of the system7 including all knowledge feedback obtained up to that point outputs next step to be taken by the machine must be computationally feasible to be of any use major issue when designing a learning machine how to choose a representation of the target function f Learning hypothesis a tentative assumption made with regards to how to best perform a task The hypothesis usually exists in the form of an evaluation function7 but may also take other forms Hypothesis space the set of all possible learning hypotheses Alternative de nition of learning II a search through a hypothesis space which is con tinuously being modi ed by experience and feedback Successful learning has occurred when the complexity ofthe underlying search space has been reduced by the feedback information Major issues that drive the research in machine learning 0 How to represent a learning target function What learning algorithms exist for evolving a target function based on feedback 0 Which evaluation functions work best for which problems How to prove that a learning algorithms will ef ciently converge to an acceptable level of performance How much trainingexperience is needed to effectively evolve a target function How to choose the next training experience Concept Learning and Generalto Speci c Ordering Concept Learning lnferring a Boolean valued function from the training examples of its input and output Concept Learning Example 1 give a Boolean algebraic expression for the function flV7 C T which is true if and only if the piece of fruit with characteristics weight lV7 color 0 and texture T is a watermelon Ingredients of a concept learning task attributes and features used to describe each instance of the domain and to help classify negative and positive instances of the domain Hypothesis space H consists of a restricted class of Boolean valued functions h X a 01 where X represents the feature space Target concept 0 verges to X a 01 the concept for which the learning algorithm con Training examples 1a12a2 mam where each 6 X and 04 6 01 where oz 1 respectively 0 represents a positive respectively negative example Inductivelearning hypothesis any hypothesis found to approximate the target function well over a suf ciently large set of training examples will also approximate the target function well over other unobserved examples Mitchell7s Generalto Speci c Ordering of Hypotheses De nition of hypothesis space suppose the feature space X is n dimensional Then the hypothesis space consists of n tuples of the form lt 12 zn gt where r is either an element from the 239 th feature set or r 6 07 where Q excludes all elements from the features set while 7 includes all elements from the feature set From here on we assume this hypothesis space Hypothesis Space Example suppose the feature space has three dimensions weight light medium heavy color red green yellow and texture rough smooth bumpy Using the hypothesis space de ned above provide an hypothesis for the concept of apple Generalto speci c ordering of hypotheses hypothesis hl is said to be more general than hypothesis hg iff Vxh2 1 h1x 1 Algorithms for nding the target concept in the generalto speci c hypothesis space FindS Algorithm 1 initialize h to the most speci c hypothesis in H 2 for each positive training example z o for each attribute constraint al in h7 if the constraint is satis ed by x then do nothing Otherwise7 replace al in h by the next more general constraint that is satis ed by x 3 output hypothesis h Theorem 1 assuming there exists a target hypothesis 0 which correctly classi es all exam ples from the feature space7 Find S algorithm will converge to 0 Proof FindS Example use the nd S algorithm to nd a target hypothesis for the concept of apple using the following ordered training examples 1 medium7 red7 smoothyes 2 hecwy7 green7 smoothn0 3 medium7 green7 smooth7 yes CandidateElimination Learning Algorithm Hypothesis h is said to be consistent with respect to a set of training examples D iff h Cx7 forallz 6 D7 where Cx represents the classi cation of x Let H be a hypothesis space and D a set of training examples Then 0 the version space VMD is de ned as VHD h E 7lc0nsistenth7 o the general boundry is the set of maximally general members of VMD o the speci c boundry is the set of maximally speci c members of VMD Version Space Theorem VHD h E H38399 2 h 2 8 where s is an element from the speci c boundry7 g is an element from the general boundry7 and 2 represents the general to speci c ordering Proof of the Version Space Theorem Candidateelimination learning algorithm 0 initialize G to the maximally general hypothesis in H o initialize S to the maximally speci c hypothesis in H o if d is a positive example i remove from G any hypothesis inconsistent with d i for each hypothesis 5 E S not consistent with d remove s from S 96 96 add to S all minimal generalizations h of s such that h is consistent with d and some member of G is more general than h 96 remove from S any hypothesis that is more general than another hypothesis in S o if d is a negative example i remove from S any hypothesis inconsistent with d i for each hypothesis 9 E G not consistent with d 96 remove g from G 96 add to G all minimal specializations h of 9 such that h is consistent with d and some member of S is more speci c than h 96 remove from G any hypothesis that is less general than another hypothesis in S Proof of CandidateElimination Algorithm Correctness CandidateElimination Example use the algorithm to nd the version space for the concept of apple using the following ordered training examples 1 medium red smoothyes 2 heavy brown rough no medium orange smooth no medium green smooth yes 5 light red rough no Inductive Bias in Learning Unbiased learner one that is capable of learning every possible target concept associated with a feature space Futility of the unbiased learner a learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying new examples Inductive Bias in Learning the assumption that the target concept belongs in a given hypothesis space Lecture 2 Basics of Prolog PROLOG PROgramming in LOGic prolog language is based on predicate logic theoretically developed in the 70s by Robert Kowalski of Edinburgh University represents a major achievement for the symbolic approach to arti cial intelligence all knowledge is expressed through facts and rules written as logical formulae Problem solving is performed by a built in interpreter via logical deduction that takes the form of backtracking depth rst search considered a semi declarative language7 in that one declares facts and rules7 as opposed to writing procedures as is done in java and C greatly in uenced by LlSP LlSt Processing7 but prolog7s goal oriented style of pro gramming makes it preferable to LlSP in the Al community Prolog Program Example 1 The Simpsons Family Tree ClosedWorld Assumption if a fact is not stated explicitly in a prolog program7 then it will be deemed false unless it can be logically inferred from other stated facts Prolog sentences such as parent homerbart are called facts7 since they are considered unconditionally true Prolog rules are if then statements For example7 parentXY a ChildY7 X is considered a rule7 and in prolog it is written ChildYX 7 parentXY Prolog Program Example 2 Using the relations parent XY7 male X7 and female X7 de ne prolog rules for the relations motherXY7 grandfatherY7 and brotherXY Prolog Program Example 3 Using the parent XY7 de ne a prolog rule for the relation predecessorXY7 which is true if Y is a descendant of X in the family tree Prolog Computation Tree a tree that is constructed and depth rst searched through by the prolog interpreter in order to satisfy answer goals queries Prolog Program Example 3 For the following program7 construct the computation tree used to satisfy the goal darkXbigX Program big bear bigelephant small cat brownbear black cat gray elephant darkZ 7 blackZ darkZ 7 brownZ Prolog syntax Atoms an atom is similar to what we would call a eortstartt in a language like C in that it refers to a speci c thing Semantic and syntactic notes about atoms 0 atoms typically make up the domain of the problem under consideration For example7 in the case ofthe Simpson family tree7 the domain is the set of family members homer marge bart etc within a prolog program7 atoms typically occur in the arguments of relations and functions which in prolog are called structures syntactically7 atoms may be constructed as i strings of letters7 digits7 and underscore characters that begin with a lowercase letter ii as integers and decimal numbers7 and iii as strings of characters enclosed in single quotes Variables a variable is used to represent an arbitrary element from some domain For example7 in the relation parent XY7 X and Y are variables representing arbitrary people from a family Semantic and syntactic notes about variables 0 variables may be instantiated by atoms or other variables 0 within a prolog program7 variables typically occur in the arguments of relations and functions Moreover they are assumed to be bounded by universal quantit ers More over7 the scope of these quanti ers is one prolog sentence 0 syntactically7 variables may be constructed as strings of letters7 digits7 and underscore characters that begin with an uppercase letter or an underscore Structures a prolog structure is very similar to the notion of structure or class in C in that it represents a way to encapsulate information about a particular object Semantic and syntactic notes about variables 0 a structure is simply a composition of one or more functions The outermost function is called a furtetor7 and is what we normally think of as a relation o syntactically a function name can be constructed as a string of letters digits and underscore characters that begin with a lowercase letter 0 structures may be de ned recursively as follows Basis step if f is a function then fa1a2 an is a structure where each a is either an atom or variable Inductive step if f is a function then f51 52 5 is a structure where s is either an atom variable or another structure note structures atoms and variables are examples of what is called a term in predicate logic Prolog Sentence a prolog sentence is either a fact of the form f5152 5 where f51 52 5 is a structure or a rule of the form 517527 7 5n 3 9151175127 751i17 79m51175127 781im7 where f51525n 91511512511gm51151251im are structures Matching As witnessed in the big dark bear exarnple rnatching also known as uni cation represents the fundamental operation between two structures that is performed in order to satisfy a goal Rules for matching let S and T be two terms Then S and T are said to match iff 1 they are constants representing the same object or 2 S is a variable in which case S is instantiated to T or 3 T is a variable in which case T is instantiated to S or 4 S and T are structures and S and T have the same principle functor and all corre sponding cornponents rnatch Matching Examples Which of the following are matches For each match provide the instantiation of variables 1 authorebert34Z authorXY 2 subtract52 3 3 segmentpoint7 72 Z segmentpoint37 72pointY 1 Lists Like its ancestor LISP7 prolog programs often involve a plentitude of list processing Userfriendly de nition of a list a structure of the form 121th7 tn where each ti is a prolog term Note a term in a list may be another list The rst term t1 is called the head of the list The remaining part of the list7 15 tn is called the tail The empty list is denoted by Not so friendly recursive de nition of a list 1 Basis step the empty list7 H is a list 2 Inductive step if L is a list and t is a term7 then t7L is a list List Example 1 1 Write the list ztomummchjudy7 dam39dyolz ath7 Z7j06 using dot notation and give its tree representation 2 Write the list V7 VV7 X7 Y7 Z7 in bracket notation List Example 2 provide a prolog de nition for the relations 1 memberXY which is true if X is a member of list Y 2 concatXY7 Z which is true if the concatenation of list X with list Y equals list Z Arithmetic Arithmetic operators addition 1L7 subtraction 1L7 multiplication gtk7 division power M7 integer division modulo mod is operator in prolog7 the operator is used for matching7 and so a different operator is is needed for variable assignment to an arithmetic expression For example7 7 7 X 1 2 will answer X 1 27 while 7 7 X 51 2 will answer X 3 Comparison operators less than lt7 greater than gt7 less than or equal lt7 greater than or equal gt7 equal 7 not equal Note comparison operators do not instantiate variables7 as it is assumed that the variables are already instantiated to values Arithmetic Example write a prolog program to compute the n th Fibonacci number Use the realtion fibNF Lecture 12 Bayesian Learning Bayesian learning methods draw their roots from a branch of probability and statistics known as decision theory which invloves the theory of how to minimize risk and loss when making decisions based on uncertain information Moreover given that 1 a major task in the eld of machine learning and pattern recognition is the correct classi cation of objects and data and E0 quite often data cannot be classi ed with deterministic correct certainty and 9 associated with every classi cation problem is a riskloss function that indicates the severity of an incorrect classi cation it does not seem surprising that Bayesian methods represent a fundamental tool in the machine learning toolbox In particular Bayesian learning involves the process of calculating the most probable hypothesis that would correctly classify an object or piece of data It is based on a simple but powerful probability rule namely Bayes s Rule Some attractive aspects of Bayesian learning each training vector can be used to update probability distributions which in turn affect the probability that a given hypothesis is true providing more exibility in that a hypothesis does not get completely ruled out from a few examples primquot knowledge can be easily implemented in the form of prior probability distributions allows for classi cation results to be nondeterministic for example there is an 80 chance that this microchip has a defect77 allows for the combining of classifcations into one weighted result 0 can be used as a measure against other classi cation techniques Disadvantages of Bayesian learning include 0 knowledge about the probability distributions for each hypothesis must either be known beforehand7 or found using parametric or nonparametric estimation 0 computing the optimal hypothesis can sometimes be computationally expensive espe cially if there exist many different hypotheses Bayes7s Rule Knowledge obtained 0 a priori derived from self evident propositions no experience or testing needed 0 a posteriori derived from both reasoning an empirical observations evidence Conditional probability Suppose A and B are events of some probability space Then the probability of event A on condition that event B has been observed is is denoted by P7quotAlB7 and de ned as PrA B P A B rlt l gt MB The use of conditional probabilitie in probabili itc reasoning is analogous to the use of hypothetical assumptions in logical reasoning However7 conditional probability has the ad vantage that conclusions can be stated more exibly as probabilities rather than as truefalse statements Conditional probability Example what is the probability of rolling an eleven with a pair of dice What is the probability of rolling an eleven on condition that ve appears on the rst die Theorem of Total Probability Let E17 E2 En be 71 disjoint probability events in a probability space In other wordls7 0 for every 239 31 j Then for arbitrary event A7 PTltAgt Zn PrAElPTE i1 Proof Example a class of 30 students took an exam There where 5 A7s7 6 B7s7 15 C7s7 3 D7s7 and 1 F The extra credit problem was answered correctly by all of the A students7 half of the B students7 one third of the C students7 and one D student If a student is randomly chosen from the class7 what is the probability that heshe correctly answered the extra credit problern Bayes7s Rule Let H a hypothesis and E the obtained evidence be events in some probability space both having nonzero probabilities Then i PTE H PTltHgt PrHE 7 ln words7 Bayes7s Rule states that the a posteriori probability of hypothesis H being true can be computed using a pm39om39 probabilities P7 E7 P7quotH7 and PTE H Bayes7s Rule Example 1 a bag contains two coins a fair coin7 and a two headed coin If a coin is drawn at random from the bag7 tossed7 and a head results7 what is the probability that the chosen coin has two heads Bayes7s Rule Example 2 you are in a casino with two crap tables and four roulette tables A woman walks by you and towards one of the tables to play7 but you do not bother to see which table A few moments later7 at the table where she is playing7 you hear the table host shout seven Upon hearing this what is the probability that she is playing craps Applying Bayes7s Rule to learning and classi cation 0 maximum a posteriori MAP hypothesis the hypothesis that is determined most likely by the observed data D In other words symbols hMAp arg rhea PrhD arg rhea PrDh Prh where PrDh is called the likelihood of the data given h 0 maximum likelihood ML hypothesis similar to MAP but now all hypotheses are assumed equally likely which makes the most likely hypothesis completely depen dant upon the likelihood of the data hML arg rhea PrDh Bayesian hypothesis selection Example 1 consider the previous example of the class of 30 students Suppose a student is selected at random and it is learned that he did not correctly answer the extra credit problem What is the most likely grade that he received on the exam Bayesian hypothesis selection Example 2 recall the problem of concept learning as is de ned in Chapter 2 of Mitchell7s text Show that every hypothesis h that is consistent with a set of training data D is in fact a MAP hypothesis Theorem in attempting to learn a concept suppose the d values of the training data have Gaussian noise added to them Then choosing the hypothesis which minimizes i will yield a MAP hypothesis Bayes Optimal Classi ers Thus far we have seen how to choose a MAP hypothesis but often times the hypothesis itself may be a function that will be used to classify an object Moreover rather than classify feature vector i using only the MAP hypothesis one can classify it using a voting strategy amoung hypotheses weighting each vote according to the likelihood of the hypothesis which casts the vote In particular let C be the set of possible classifcations of feature vector 2 Then the Bayes optimal classi cation of z is given as arg Inez Z Prclh Prhlz he H Theorem let X be a feature space whose constituent vectors are to be classi ed into classes belonging to classi cation set C Let H be a collection of hypotheses that can be used for the purpose of classifying vectors in X Then the classi cation method that minimizes the expected probability of error over the entire space X is that which for arbitrary feature vector i chooses c as arg Inez Prclh Prhlz Bayesian optimal classi er Example four people catch a glimpse of the face of a bank robber just before he jumps head rst through the open window of the chevy blazer being driven by his grandmother accomplice A suspect is later brought in for the four people to identify The following table shows how high the investigator rates the beliefs of each witness as well as how positive the witness is that the suspect is indeed the robber Witness investigator con dence Witness con dence 1 8 6 2 4 5 3 7 10 4 5 2 How should the investigator classify the suspect guilty or not guilty Solution to Bayesian optimal classi er Example Gibbs Sampling since the hypothesis space H may be too large to use optimal Bayes classi er7 Gibbs sampling proposes to simply randomly select a hypothesis at random7 based on the a posteriori probabilities induced by the feature vector i to be classi ed Theorem using Gibbs sampling7 the expected probability of error is no more than twice the expected probability of error using the Bayes optimal classi er Naive Bayesian Classi er Assume that the feature space X has component sets which are nite and that the feature vectors comprising X must be classi ed according to the set of classes C Let f E X be arbitrary and assume can be written in component form as 1 xn Also let 0 E C be arbitrary The naive Bayesian classi er gets its name from the fact that it is assumed a priori that nothing is known about the values Prx1 xnlc Moreover learning these probabilities may seem computationally infeasible for even small values of n Naive Bayesian classi er makes the simplifying assumption that the components of f are conditionally independent relative to any particular classi cation 0 In other words n Prx1 xnlc H Himlo i1 Naive Bayesian classi er 0N3 arg rhea Prc Przic Naive Bayesian classi er Example 1 use the following data to develop a naive Bayesian Classi er and use it to Classify a piece of fruit with features medium bumpy green Assume that the Classes are orange apple and 77banana AAAAAAAAAA medium red smooth apple medium yellow smooth apple medium red bumpy apple light green smooth banana light green rough banana medium orange smooth orange medium yellow bumpy orange medium orange rough orange light red smooth banana light orange smooth orange medium green smooth apple light red smooth apple light yellow smooth banana medium yellow smooth banana light yellow smooth banana medium green bumpy orange light orange smooth orange light yellow smooth apple medium orange bumpy orange medium red smooth apple Naive Bayesian classi er Example 2 develop a naive Bayesian Classi er for Classifying text documents Belief Networks Belief networks as de ned here represents a general notion involving a graphs whose nodes are random variables A main purpose for developing a belief network is to approximate the joint probability distribution of the variables that comprise the network7 as well as the marginal probabilities of each variable Two very important kinds of belief networks include o Bayesian networks in this case the directed edges of the graph serve the purpose of indicating conditional dependencie and independencie between the random variables See diagram below for an example 0 Constraint networks these networks consist of a collection of constraints that are imposed on the random variables In this case nodes may also consist of constraint committees7 the inductive de nition of which is now de ned 7 basis step a set of one or more constraints is a constraint committee 7 basis step a set of one or more constraint committees is a constraint committee The committees are often formed so that tightly coupled constraints are placed in the same committee Edges of the constraint network may represent lines of communi cation between different committees7 so that committees can work together to nd instantiations of the variables which will satisfy all constraints7 and hence represent a sample to base statistical joint and marginal probability distributions upon Bayesian network example SEASON SPRINKLER Yzjl RAIN h in WET T If SLIPPERY A Constraint network Example 1 four people Ahn7 Ben7 Carrie7 Dave are to dine to gether and need to cooperate to nd a suitable place to eat Ahn and her roomate Ben prefer Asian cuisine while Carrie her roomate Dave prefer vegetarians Draw a contraint network to show how communication between the four might possibly ow Constraint network Example 2 develop a constraint network for playing the game mine sweep on a 3 gtlt 3 grid