### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Artificial Intelligence Prog CS 662

USF

GPA 3.7

### View Full Document

## 7

## 0

## Popular in Course

## Popular in ComputerScienence

This 254 page Class Notes was uploaded by Michele Herzog on Thursday October 29, 2015. The Class Notes belongs to CS 662 at University of San Francisco taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/231249/cs-662-university-of-san-francisco in ComputerScienence at University of San Francisco.

## Reviews for Artificial Intelligence Prog

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/29/15

Arti cial Intelligence Programming Reinforcement Learning Chris Brooks Department of Computer Science University of San Francisco Reinforcement Learning 9 So far most of the learning algorithms we ve looked at have been supervised passive and of ine 0 We re given a labeled dataset and need to construct a hypothesis 0 Sometimes our agent is not that lucky o Placed in an environment and forced to learn the best action to take in each state a The environment provides a reward to the agent but never tells it the best thing to do a The agent must select actions to take Department of Computer Science Universitv of San Francisco 011 Reinforcement Learning 9 This approach to learning is called reinforcement learning 9 Not too different from the way children or animals are trained a Reward positive actions 63 Punish negative actions Department of Computer Science Universitv of San Francisco 021 Reinforcement Learning 9 How could we do this in a deterministic episodic environment 0 Each action leads to a single reward Department of Computer Science Universitv of San Francisco 031 Reinforcement Learning 9 How could we do this in a deterministic episodic environment 0 Each action leads to a single reward 0 Try each action once then always select the action with the highest reward Department of Computer Science Universitv of San Francisco 041 Reinforcement Learning 9 What about in a stochastic episodic environment a Each action has multiple possible outcomes Department of Computer Science Universitv of San Francisco 051 Reinforcement Learning 9 What about in a stochastic episodic environment a Each action has multiple possible outcomes Q We try each action multiple times and learn the expected utility of each action 0 This is sometimes called a bandit problem Department of Computer Science Universitv of San Francisco 061 Reinforcement Learning 9 What about in a stochastic episodic environment a Each action has multiple possible outcomes Q We try each action multiple times and learn the expected utility of each action 0 How many times is enough a Often our agent is doing active learning must integrate learning and performance Department of Computer Science Universitv of San Francisco 071 Exploration 9 Issue The agent would always like to choose the action with the highest reward O This means taking the action with the highest expected utility Q But if the agent nevertries badIooking actions it might be missing out on a higher reward 1 Example considertwo levers one pays 1 every time and the other pays 0 90 of the time and 100 10 of the time Department of Computer Science Universitv of San Francisco 081 Exploration Q Issue early on we would like to try lots of nonoptimal actions as our estimates are probably very inaccurate Q This process is called exporation Q As our estimates of the reward for each action get more accurate we want to select the best action more frequently Q This process is called exploitation Q How to do this in a principled way Department of Computer Science Universitv of San Francisco 091 Boltzmann exploration Q One way to do this is using Boltzmann exploration Q Let U a be our estimated utility for taking action a Q We take an action with probability P kUa Q a Q Where is is a temperature parameter Q is starts high and gradually decreases Q How will this behave with k 1 k ltlt 1 Department of Computer Science Universitv of San Francisco 0101 Sequential Problems 9 This works great for episodic problems but what about sequential problems 9 Now we need to think not only about the immediate reward an action generates but also the value of subsequent states 0 How did we solve this with full information Department of Computer Science Universitv of San Francisco 0111 Markov Decision Process 9 With full information we modeled this problem as a Markov Decision Process 9 We knew all the states the transition probabilities between each state and the rewards for each state 0 We then used value iteration to estimate utilities for each state or policy iteration to find the policy directly Department of Computer Science Universitv of San Francisco 0121 Example Utilities 0812 0868 0918 0762 0660 El 0705 0655 0611 0388 1 2 3 4 51 Value iteration can find utilities for each state and we use these to construct a policy Department of Computer Science Universitv of San Francisco 0131 Example Policies 1 023 2 045 3 068 gt gt gt 1 4 006 5 033 T T 1 6 003 7 001 8 013 007 T gt lt 9 Policy iteration lets us find policies directly Department of Computer Science Universitv of San Francisco 0141 Modelfree learning 9 Both value iteration and policy iteration assume a great deal of knowledge about the world a Rewards 9 Transition function a What if we don t have a model 9 All we know is that there are a set of states and a set of ac ons Q We still want to learn an optimal policy 9 This is called modelfree learning Department of Computer Science Universitv of San Francisco 0151 Qlearning Q Learning a policy directly is difficult Q Problem our data is not of the form ltstate actiongt Q Instead it s of the form 515253 R Q Since we don t know the transition function it s also hard to learn the utility of a state Q lnstead we ll learn a function Qs a This will estimate the utility of taking action a in state 5 Department of Computer Science Universitv of San Francisco 0161 Qlearning Q More precisely Qs a will represent the value of taking a in state 5 then acting optimally after that Q Qs a Rs a ymaasa ZS Ts7 a s Us Q y is our discount factor Q The optimal policy is then to take the action with the highest Q value in each state Q Of course we don t know what T or U are Department of Computer Science Universitv of San Francisco 0171 Learning the Q function 9 To learn Q we need to be able to estimate the value of taking an action in a state even though our rewards are spread out over time a We can do this iteratively 9 Notice that Us mamaQs a Q We can then rewrite our equation for Q as o Qs a Rs a VmaxaQs a Department of Computer Science Universitv of San Francisco 0181 Learning the Q function 9 Let s denote our estimate of Qs a as Qs a Q We ll keep a table listing each stateaction pair and estimated Qvalue 9 the agent observes its state 5 chooses an action a then observes the reward 7 Rs a that it receives and the new state 5 o It then updates the Qtable according to the following formula 0 Qs a 7 ymaasaQA5 a Department of Computer Science Universitv of San Francisco 0191 Learning the Q function Q The agent uses the estimate of for s to estimate for S Q Notice that the agent doesn t need any knowledge of R or the transition function to execute this Q Qlearning is guaranteed to converge as long as Q Rewards are bounded Q The agent selects stateaction pairs in such a way that it each infinitely often Q This means that an agent must have a nonzero probability of selecting each a in each s as the sequence of stateaction pairs approaches infinity Department of Computer Science University of San Francisco 0201 Exploration Q Of course these are theoretical guarantees Q In practice we must sample each state enough Q We can do this with Boltzmann exploration kQ87a Q j Department of Computer Science Universitv of San Francisco 0211 Pseudocode s randomState while not done a selectActions use Boltzmann here 32 takeActionsa r reward52 Qsa Qsa alpha reward gamma bestActions2 if 52 is goal 3 randomState else 5 52 Department of Computer Science Universitv of San Francisco 0221 Pros and cons Q Qlearning has proved to be very useful in some settings a No knowledge of problem dynamics needed a No labeled training data needed 0 Agent can integrate learning and execution 9 It also has some weaknesses a Convergence can be slow 0 Qtable is very large a Not much generalization Department of Computer Science Universitv of San Francisco 0231 Arti cial Intelligence Programming Statistical NLP Chris Brooks Department of Computer Science University of San Francisco P Outline ngrams 9 Applications of ngrams review Contextfree grammars Probabilistic CFGs Information Extraction Department of Computer Science Universitv of San Francisco 011 Advantages of IR approaches 9 Recall that lRbased approaches use the bag of words model 9 TFIDF is used to account for word frequency 0 Takes information about common words into account a Can deal with grammatically incorrect sentences 9 Gives us a degree of correctness rather than just yes or no Department of Computer Science Universitv of San Francisco 021 Disadvantages of IR approaches 9 No use of structural information Q Not even cooccurrence of words 9 Can t deal with synonyms or dereferencing pronouns 9 Very little semantic analysis Department of Computer Science Universitv of San Francisco 031 Advantages of classical NLP 9 Classical NLP approaches use a parser to generate a parse tree 9 This can then be used to transform knowledge into a form that can be reasoned with o Identifies sentence structure 9 Easier to do semantic interpretation 0 Can handle anaphora synonyms etc Department of Computer Science Universitv of San Francisco 041 Disadvantages of classical NLP 9 Doesn t take frequency into account 0 No way to choose between different parses for a sentence 9 Can t deal with incorrect grammar Requires a lexicon 0 Maybe we can incorporate both statistical information and structure I0 Department of Computer Science University of San Francisco 051 ngrams Q The simplest way to add structure to our IR approach is to count the occurrence not only of single tokens but of sequences of tokens 9 So far we ve considered words as tokens Q A token is sometimes called a gram 9 an ngram model considers the probability that a sequence of n tokens occurs in a row 9 More precisely it is the probability Pt0ken t0ken 1token 4 token1 Department of Computer Science Universitv of San Francisco 061 ngrams Q Our approach in assignment 3 uses 1grams or unigrams Q We could also choose to count bigrams or 2 grams Q The sentence Every good boy deserves fudge 77 ll 77 It contains the bigrams every good good boy boy 77 It deserves deserves fudge Q We could continue this approach to 3grams or 4grams or 5grams Q Longer ngrams give us more accurate information about content since they include phrases rather than single words Q What s the downside here Department of Computer Science Universitv of San Francisco 071 Sampling theory Q We need to be able to estimate the probability of each ngram occurring Q In assignment 3 we do this by collecting a corpus and counting the distribution of words in the corpus Q lfthe corpus is too small these counts may not be reflective of an ngram s true frequency Q Many ngrams will not appear at all in our corpus Q For example if we have a lexicon of 20000 words there are Q 20 0002 400 million distinct bigrams Q 20 0003 8 trillion distinct trigrams Q 20 0004 16 x 1017 distinct 4grams Department of Computer Science Universitv of San Francisco 081 Smoothing Q So when we are estimating ngram counts from a corpus there will be many ngrams that we never see Q This might occur in your assignment what ifthere s a word in your similarity set that s not in the corpus Q The simplest thing to do is addone smoothing We start each ngram with a count of 1 rather than zero Q Easy but not very theoretically satisfying Department of Computer Science University of San Francisco 091 Linear Interpolation Smoothing Q We can also use estimates of shorterlength ngrams to help out Q Assumption the sequence 101102103 and the sequence 101102 are related Q More precisely we want to know Pw3w2w1 We count all 1grams 2grams and 3grams Q we estimate Pw3w2w1 as 61PZU3ZU2ZU1 C2PltZU3ZU2gt CgPltUJ3gt Q So where do we get 1 2 3 Q They might be fixed based on past experience Q Or we could learn them Department of Computer Science Universitv of San Francisco 0101 Application segmentation Q One application of ngram models is segmentation Q Splitting a sequence of characters into tokens orfinding word boundaries Q Speechtotext systems Q Chinese and Japanese Q genomic data Q Documents with other characters such as ampnbsp representing space Q The algorithm for doing this is called Viterbi segmentation Q Like parsing it s a form of dynamic programming Department of Computer Science Universitv of San Francisco 0111 Viterbi segmentation input a string S a l gram distribution P n lengthS words 2 arraynl best 2 arraynl 00 n1 best0 10 for i l to n for j 0 to i l word 2 Sj i get the substring from j to i w lengthword if Pword x besti w gt besti besti Pword X besti w wordsi 2 word now get best words result i n while i gt 0 push wordsi onto result i i lenwordsi return result besti Department of Computer Science Universitv of San Francisco 0121 Input Example catt1efish Pcat all other 1 grams are 0001 i O best0 10 001 10 gt 00 best1 0001 words1 c 2 j 0 word 001 10 gt 00 best2 0001 words2 2 ca 2 2 j 1 word 001 0001 lt 0 1 j 0 word c w 1 03 Pfish 01 Department of Computer Science Universitv of San Francisco 0131 i 0 i 0 Example 2 3 j O word2 cat w3 1 10 gt 00 best3 01 words3 cat 3 j 1 word 2 at w2 001 0001 lt 01 3 j 2 word 2 t w1 001 0001 lt 01 Department of Computer Science Universitv of San Francisco 0141 Example i4 jO word2 catt w4 0001 10 gt 00 best4 0001 words4 Catt i4j1 word 2 att w3 0001 0001 lt 0001 i4 j2 wordz tt w2 0001 0001 lt 0001 i4 j3 wordz t w1 0001 01 lt 0001 Department of Computer Science Universitv of San Francisco 0151 Example i5 j0 word2 catt1 w5 0001 10 gt 00 best5 0 001 word5 cattl i5 j1 wordz attl w4 0001 0001 lt 0001 i5 j2 word2 tt1 w3 0001 0001 lt 0001 i5 j3 wordz tl w2 0001 01 lt 0001 i5 j4 wordz l w1 0001 0001 lt 0001 Department of Computer Science Universitv of San Francisco 0161 Example 16 30 word2 cattle w6 03 10 gt 00 word6 cattle best6 03 etc Department of Computer Science Universitv of San Francisco 0171 Example best 10 0001 0001 01 0001 0001 03 0001 0001 02 words c ca cat catt cattl cattle cattlef cattlefi cattlefis fish i 10 push fish onto result i i 4 push cattle onto result i 0 Department of Computer Science Universitv of San Francisco 0181 What s going on here Q The Viterbi algorithm is searching through the space of all combinations of substrings Q States with high probability mass are pursued Q The best array is used to prevent the algorithm from repeatedly expanding portions of the search space Q This is an example of dynamic programming like chart parsing Department of Computer Science Universitv of San Francisco 0191 Application language detection Q ngrams have also been successfully used to detect the language a document is in Q Approach consideretters as tokens ratherthan words Q Gather a corpus in a variety of different languages Wikipedia works well here Q Process the documents and count all twograms Q Estimate probabilities for Language L with WW9 Call PL Q Assumption different languages have characteristic twograms Department of Computer Science Universitv of San Francisco 0201 Application language detection 9 To classify a document by language 9 Find all twograms in the document Call this set T 9 For each language L the likelihood that the document is of language L is PLt1 gtlt PLt2 gtlt gtlt PLUM a The language with the highest likelihood is the most probable language a this is a form of Bayesian inference we ll spend more time on this later in the semester Department of Computer Science Universitv of San Francisco 0211 Going further 9 ngrams and segmentation provide some interesting ideas Q We can combine structure with statistical knowledge 0 Probabilities can be used to help guide search a Probabilities can help a parser choose between different outcomes 0 But no structure used apart from colocation 41 Maybe we can apply these ideas to grammars Department of Computer Science Universitv of San Francisco 0221 Reminder CFGS Q Recall contextfree grammars from the last lecture Q Single nonterminal on the left anything on the right Q S gt NP VP Q VP gt Verb Verb PP Q Verb gt run sleep Q We can construct sentences that have more than one legal parse Q Squad helps dog bite victim Q CFGs don t give us any information about which parse to select Department of Computer Science Universitv of San Francisco 0231 Probabalistic CFGS Q A probabalisitc CFG is just a regular CFG with probabilities attached to the righthand sides of rules a The have to sum up to 1 9 They indicate how often a particular nonterminal derives that righthand side Department of Computer Science Universitv of San Francisco 0241 VP P V NP NP NP NP NP NP gt Example VP PP gt with 10 gt saw 10 NP PP 04 astronomers 01 stars 018 saw 004 ears 018 telescopes 01 Department of Computer Science Universitv of San Francisco 0251 Disambiguation Q The probability of a parse tree being correct is just the product of each rule in the tree being denved O This lets us com pare two parses and say which is more likely 10 NPO1 prj astronomers VUQ NP 04 saw NP018 PP 10 Stars P1O NPO18 with ears P110 01 07 10 04 018 10 10 018 00009072 s 1 o NPO1 VP 03 astronomers 07 PP 10 v 1 0 l NP018 P10 NP 018 saw stars with ears P210 01 08 071010 01810 018 000068 Department of Computer Science Universitv of San Francisco 0261 Faster Parsing Q We can also use probabilities to speed up parsing 9 Recall that both topdown and chart pasring proceed in a primarily depthfirst fashion a They choose a rule to apply and based on its righthand side they choose another rule 9 Probabilities can be used to better select which rule to apply or which branch of the search tree to follow 0 This is a form of bestfirst search Department of Computer Science Universitv of San Francisco 0271 Information Extraction 9 An increasingly common application of parsing is information extraction 9 This is the process of creating structured information database or knowledge base entries from unstructured text 0 Example 0 Suppose we want to build a price comparison agent that can visit sites on the web and find the best deals on flatscreen TVs 0 Suppose we want to build a database about video games We might do this by hand or we could write a program that could parse wikipedia pages and insert knowledge such as madeByBlizzard WorldOfWarcraft into a knowledge base Department of Computer Science Universitv of San Francisco 0281 Extracting speci c information Q A program that fetches HTML pages and extracts specfic information is called a scraper Q Simple scrapers can be built with regular expressions Q For example prices typically have a dollar sign some digits a period and two digits Q O9O92 Q This approach will work but it has several limitations Q Can only handle simple extractions Q Brittle and page specific Department of Computer Science University of San Francisco 0291 Extracting Relational Information 9 Suppose we want to build a database that relates organizations to cities 1 nUSF San Francisco 9 We want to be able to extract this information from a sentence like Al is the best class ever said Chris Brooks a professor at USF a university in San Francisco 9 We subdivide this problem into two pieces a Named Entity Extraction 0 Relation Extraction Department of Computer Science Universitv of San Francisco 0301 Named Entity Extraction Q Named Entity Extraction is the process of figuring out that Chris Brooks USF and San Francisco are proper nouns Q We could just have a big list of all people in an organization or all cities which might work Q Or we could use a program called a chunker which is a probabilistic parser Q Only parses to a shallow level two deep and identifies chunks of sentences rather than tagging every word Q It will often try to identify the type of entity such as organization or person typically using probabilities extracted from training corpora Department of Computer Science Universitv of San Francisco 0311 Relation Extraction Q Once we have named entities we need to figure out how they are related Q We can write augmented regular expressions that use chunk types for this Q ltORGgtinltCTYgt will match ltorganizationgt blah blah in blah ltcitygt Q There will be false positives getting this highly accurate takes some care Q In assignment 4 you ll get to experiment with information extraction using NLTK Department of Computer Science University of San Francisco 0321 Arti cial Intelligence Programming Introduction to Knowledge Representation Chrls Eruuks Department Dr Computer Sclence Umverslly Dr San Franmscu Knowledge Representation 1 Choices we ll look at include a Probabilistic approaches 39 ing a Probabilistic reasoning works well in domains with uncertain y o Inference can be more complex a Requires more prior knowledge to use effectively 0 Representational power typically more limited Introduction a So far we ve talked about search which is a means of considering alternative possibilities lt2 The way in which problem states were represented was typically pretty straightforward a The other aspect ofmany Al problems involves representing possible states a Our choice of representation influences cl The problems our agent is able to solve a The sorts of environments an agent can deal with a The complexity of search a The sophistication of our agent Declarative vs Procedural a Knowledge representation entails a shi from a procedural way ofthinking about agents to a declarative way L Procedural Behavioris encoded directly in agent program Focus is on algorithms 0 Declarative Agent knowledge is represented as n nces Focus is on data Q Goal Separate knowledge about the problem 39om the algorithm used to select actions Knowledge Representation a Choices we ll look at include a Logicbased approaches a Propositional logic 0 Firstorder logic a Ontologies u Logic is a flexible wellunderstood power Jl versatile way to represent knowled e u Olten fits with the way human experts describe their world a Facts are eithertrue or false a Has a hard time dealing with uncertainty Declarative vs Procedural a Agents maintain a knowledge base that allows them to reason about a problem a Knowledge is represented as facts and relations a Inference is typically performed automatically in This is sometimes called programming at the knowledge level a Specify facts known by an agent along with goals a Programming focus is on encoding knowledge Wumpus World a R amp N use the Wumpus World as an example domain a Environment 4x4 grid of rooms a Gold in one room wumpus in another 0 Pits in some rooms Q Actions Move forward turn left turn right shoot arrowgrab gold u Sensors Perceive stench perceive breeze perceive gold sense wall hear wumpus death a Goal maximize performance which means finding gold quickly without encountering the wumpus or falling into a pit Syntax and Semantics 1 Syntax Defines whether a sentence is properly constructed a In arithmetic a 2 5 is syntactically correct is not o In a Python program tzmeElapsed 3 is syntactically correct while 3 tzmeElapsed is not a Semantics Defines when a sentence is true orfalse o The semantics ofac 2 5 are that this sentence is true in worlds where a 3 and false otherwise a Logical sentences must be true or false no degree truthquot Wumpus World I 4 PIT 9626 N a c w 3 W ef ng PIT N 2 5323 fiery I 5538 PIT 65 START Models 0 Model A model is an assignment ofvaues to a subset ofthe variables of interest in our problem 0 A model for the Vacuum cleaner world might indicate w ere the vacuum is an w ic rooms are clean o In the Wumpus World a model would indicate the location of the pits gold agent arrow and wumpus 1 A model provides an agent with a possible world one guess at how things might be 9 Well often be interested in finding models that make a sentence true or se or a the models that could be true for a given set of sentences e Models are very much like states Knowledge base a A knowledge base is composed of sentences that assert facts about the world Q What s the difference between a knowledge base anc a In principle expressiveness and usage o In practice a knowledge base might be implemented using a database a Sentences describe a Objects ofinterest in the world wumpueses gold pits rooms agent o Relationships between objects agent is holding arrow Logic a Entailment Entailment is the idea that one sentence follows logically 39om another I Written as a l b a Technically this says for all models where a is true b is a so true a think ifthen 0 32 5 Ha 3 a Note that entailment is a property of a set of sentences and not an instruction to an agent Inference a A knowledge base plus a model allows us to perform inference a For a given set ofsentences plus some assignment ofvalues to variables what can we conclude a Entailment tells us that it is possible to derive a sentence a Inference tells us how it is derived a An algorithm that only derives entailed sentences is said to be sound 0 Doesn t make mistakes or conclude incorrect sentences Propositional Logic 1 a complex sentence is a set ofterms conjoined with v a we 95 s ltgt a 30mm 001m A Emmi 001m v 800ml UDWy a Breezem s PM qum Inference a An inference algorithm that can derive all entailed s ntences is complete lt2 lfa sentence is entailed a complete algorithm will eventually infer it o lfentailed sentences are goals this is the same definition of complete we used for search a That means we can think ofinference as search and use the algorithms we ve already learned about Propositional Logic 0 This is the sort of representation you used when constructing scorers for the crawler in assignment 3 1 A document is represented as a series ofwords that are either present true or absent false 1 The TFlDFScorer extends this to map words to numbers but the representational power is the same a Note that there s no way to indicate relationships between words a Occur near each other a Synonyms a Antonyms q Generalspecific a Author subject date etc Propositional Logic a Propositional logic is a very simple logic a Nice for examples a Computationally feasible 0 Limited in representational power a Terms R amp N call these atomic sentences consist of a single symbol that has a truth value a R0mn1UOlean VacuumInU o Propositional Logic a Notice that propositional logic does not have any way to deal with classes of objects a We can t concisely say For any room ifthere is a breeze then there is a pit in the next roomquot a To say At least one room is dirtyquot requires us to list all possibilities a We don t have Jnctions or predicates 1 There s a computational tradeoffinvolved if we re care Jl about howwe use propositions we can do fast polynomialtime inference Q But we re limited in what our agent can reason about a Propositional logic is the logic underlying hardware design Boolean logic More on predicates predic a Replace R00m0101 an with OleanRm7mU 1 CL As it is this is fine a What we re missing is a way to talk about all the rooms that are clean without explicitly enumerating a Olten people will replace atomic terms with simple ates them a We don t have variables or quanti ers a To do that we need rstorderlogic next week Still more de nitions 0 Logical equivalence Two sentences are logically equivalent ifthey are true for the same set of models 0 P A Q is logically equivalent to bP v Q u Validity tautology A sentence is valid if it is true for all models a A V TlA u Contradiction A sentence that is false in all models AATlA Notation a A A B AND sentence is true ifboth A and B are true a A v B OR Sentence is true ifeither A or B or both are true at A NOT Sentence is true ifA is false 9 A A B lmplies Sentence is true ifA is false or B is true a A ltgt B Equivalence Sentence is true ifA and B have the same truth value Still more de nitions a Satisfiability A sentence is satisfiable if it is true for 0 Emma UOlean v Emma 1015M is true in some worlds 0 O en our problem will be to find a model that makes a sentence true orfalse A model that satisfies all the sentences we re interested in will be the goal or solution to our search Propositional Logic implication a Implication is a particularly useful logical construct a The sentence A A B is true if a A is true and B is true a A is false a Example If it is raining right now then it is cloudy right now a A A Bis equivalent to A v B a Implication will allow us to perform inference Logical reasoning a Logical reasoning proceeds by using existing sentences in an agent s KB to deduce new sentences a Deduction is guarateed to produce true sentences assuming a sound mechanism is used a Rules ofinference a AndElimination a AA B conclude A a Orintroduction a A conclude A v B Logical Reasoning a Rules of inference a Contraposition A e B can be rewritten as B e A a Double negative bA A a Distribution AVBAAVO n AABVO AABVAAO a DeMorgan s theorem u A v B rewrite as bA A B a orAAB ltgt bAv B Example 0 Begin with Q There is no pit in 11 R1 Pm a A square has a breeze iffthere is a pit in the neighboring square amp R2 B11 9 P13 V Pm a Rs 321 ltgtP11 v P23 v P34 and so on for all other squares n Assume the agent visits 11 and senses no breeze but does sense a breeze in 21 Add a 84 ARI1 0 R5 B21 Inference as Search a We can then use good old breadthfirst search or any other search to perform inference and determine a sentence is entailed by a knowledge base a Basic idea Begin with statements in our KB a Actions are applications ofimplication a For example saywe know 1 A e B 2 B e O and 3 A 1 One possible action is to apply Modus Ponens to 1 and 3 to conclude B a We can then apply Modus Ponens again to conclude 0 Example lt1 We can use biconditional elimination to rewrite R2 as a R6 BM P13 v P21AP12 v P21 BM lt2 Andelimination on RE produces Rr 0312 V P21 311 q Contraposition on 8 gives us Rs n31 P13 V P24 a Modus Ponens with RS and R4 produces Re 2012 V Pm e DeMorgan s then gives us Rm Pm A Pm a Our agent can concludethat there is no pit in 00 12 or 21 It is not sure about 2 Inference as Search a Our search can proceed in a breadthfirst mannerwhat are all the possible conclusions from the original KB depthfirst take one inference then use it to make further inferences and so on or somewhere inbetween a Successor Jnction defines all applicable rules for a given knowledge base a The result of this search is called a proof Resolution a The preceding rules are sound but not necessarily mplete a Also search can be inefficient there might be many operators that can be applied in a particular state a Luckily there is a complete rule for inference when coupled with a complete search algorithm that uses a single operator 1 This is called resolution A v 0 allows us to conclude B v O a A is eithertrue or not true IfA is true then 0 must be true a ifA is false then B must be true a This can be generalized to clauses of any length Conjunctive Normal Form a Resolution works with disjunctions a This means that our knowledge base needs to be in this form a Conjunctive Normal Form is a conjunction of clauses that are disjunctions u AVBVOADVEVFAGVHVIA 11 Every propositional logic sentence can be converted to CNF Example 0 We then use DeMorgan s rule to move negation inwards L Bm v P13 v P21A P12 A P21V BM 1 Finally we distribute OR overAND a Bm v P1 V P21A P12 V 1311 A ppm V 311 u N wwe have clauses that can be plugged into a resolution theorem prover can break ANDs into separate sentences 1 They re ess readable by a human but more computation ally useful Conjunctive Normal Form Recipe Eliminate equivalence comesA BAB A Eliminate implication q comes A v B m w Move a inwards using double negation and DeMorgan s a bA becomes a A A B becomes A v B Distribute nested clauses 1AVB AO becomes AVE A AV 0 45 Proof By Refutation 0 Once your KB is in CNF you can do resolution by refutation a In math this is called proof by contradiction a Basic idea we want to show that sentence A is true a Insert A into the KB and try to derive a contradiction Example 9 B11 P13 V Pm a Eliminating equivalence produces 1311 P13 V P21A 0312 V 130r 1311 m Removing implication gives us 0 Bm V P13 V 1321 A N131 V P21V 311 m Example 9 Prove that there is not a pit in 12 Pm 9 Relevant Facts CL R22 Bm V P13 V P24 a R217 ppm V 311 0 322 ppm V 311 0 R4 Bm a Insert Rn P13 into the KB Example 0 Resolve Rn with Rab to get R6 BM 9 We already have a contradiction since 84 Bm 9 Therefore the sentence we inserted into the KB must be false a Most proofs take more than one step to get to a contradiction How to use a KB Backward Chaining 1 Backward chaining starts with the goal and works backwardquot to the start a Example Ifwe want to show that A is entailed find a sentence whose consequent is A q Then try to prove that sentence s antecendents a This is sometimes called querydriven reasoning a More effective at proving a particular query since search is focused o 1 Less likely to discover new and unknown information a Meansends analysis is a similar sort of reasoning in Prolog uses backward chaining Horn clauses a Standard resolution theorem proving and propositional inference in general is exponentially hard 9 However ifwe re willing to restrict ourselves a bit the problem becomes computationally easy a A Horn clause is a disjunction with at most one positive literal CL TlA V TlB V O V D a TlA V TlB 9 These can be rewritten as implications with one consequent a A A B A O gt D u A A B gt False a Horn clauses are the basis of logic programming sometimes called rulebased oroorammummwmnm my Strengths of Propositional Logic 0 Declarative knowledge can be separated from inference 1 Can handle partial information a Can compose more complex sentences out of simpler ones a Sound and complete inference mechanisms efficient for Horn clauses How to use a KB Forward Chaini Forward chaining involves starting with a KB and continually applying Modus Ponens to derive aquot possible facts 1 This is sometimes called datadriven reasoning a Start with domain knowledge and see what that kn owiedge tells you a This is very useful for discovering new facts or rules 0 Less helpful for proving a specific sentence true orfalse G Search is not directed towards a goa Weaknesses of Propositional logic 9 Exponential increase in number of literals a No way to describe relations between objects a No way to quantify over objects a Firstorder logic is a mechanism for dealing with these oblems a As always there will be tradeoffs 2 There s no free lunch Arti cial Intelligence Programming Inference Chris Brooks Department of Computer Science University of San Francisco Logicbased agents 9 The big picture our agent needs to make decisions about what to do in a complex environment 0 In order to do this it needs a model of the world 0 Logic is a language for representing this model 9 Our agent will maintain and update a knowledge base KB which stores this model 9 Our agent will use inference to derive new facts that are added to the KB 0 These new facts will help our agent make decisions Department of Computer Science Universitv of San Francisco 011 Logic Summary 9 Recall that propositional logic concerns 9 facts about the world that are true or false 0 These facts can be used to construct sentences using logical connectives A V a gt 0 Example P1 V P22 9 We can convert these sentences to CNF and use resolution as an inference mechanism a Resolution is sound and complete Department of Computer Science Universitv of San Francisco 021 Logic Summary Q A big weakness of propositional logic is its closedworld assumption Q No way to talk about objects that cannot be named Q Also no way to specify classes or relations Q In practice this leads to an exponential number of sentences in the agent s KB Department of Computer Science Universitv of San Francisco 031 Logic Summary 9 First order logic addresses these issues by providing variables and quantification over these variables a FOL works with objects and relations between them which form atomic sentences 0 Atomic sentences can be joined with logical connectives objects 0 Example SiblingsBartLisaValikesaMarge gt likesa homer Department of Computer Science Universitv of San Francisco 041 Quanti ers and variables 9 3 is the symbol for existential quantification a It means that the sentence is true for at least one element in the domain Q Ela femalea playsSaxophonea 0 What would happen ifl said Ela femalea gt playsSaxophonea Department of Computer Science Universitv of San Francisco 051 Quanti ers Q In general gt makes sense with V A is usually too strong Q makes sense with 3 gt is generally too weak Q Some examples Q One of the Simpsons works at a nuclear plant Q All of the Simpsons are cartoon characters Q There is a Simpson with blue hair and a green dress Q There is a Simpson who doesn t have hair Department of Computer Science Universitv of San Francisco 061 Nesting quanti ers Q Often we ll want to express more complex quantifications For example every person has a mother Q Va Ely mothera y Q Notice the scope for each a a different y is potentially chosen 0 What if we said Ely Va mothera y Q this is not a problem when nesting quantifiers of the same type 9 Va Vy brotherOfa y gt siblingOfa y and Vy Va brotherOfa y gt siblingOfa y are equivalent Q We often write that as Va7 y brother0fa y gt siblingOfa y Department of Computer Science Universitv of San Francisco 071 Negation Q We can negate quantifiers Q avg yell0wa says that it is not true that everyone is yellow 0 Elm lyellowa has the same meaning there is someone who is not yellow 1 xdaughterO f Bart x says that there does not exist anyone who is Bart s daughter 9 Va daughterO Bart 13 says that for all individuals they are not Bart s daughter 9 In fact we can use DeMorgan s rules with quantifiers just like with and V Department of Computer Science Universitv of San Francisco 081 More examples 9 A husband is a male spouse 0 V5133 husbandaz y lt spousea y maleaz 0 Two siblings have a parent in common 0 V5133 siblingaz y lt a y Elp Parentazp Parentyp 9 Everyone who goes to Moe s likes either Homer or Barney but not both 0 Va goesTomoes 13 gt Lik88 Homer lt lm6885 Barney Department of Computer Science Universitv of San Francisco 0 More examples 9 Everyone knows someone who is angry at Homer 0 Va Ely knowsaz y angryAty homer 9 Everyone who works at the power plant is scared of Mr Burns Q Va worksAtP0werPlant 13 gt scaredO x burns Department of Computer Science Universitv of San Francisco 0101 Audience Participation Everyone likes Lisa l0 Q Someone who works at the power plant doesn t like Homer both ways Q Bart Lisa and Maggie are Marge s only children Q People who go to Moe s are depressed Q There is someone in Springfield who is taller than everyone else Q When a person is fired from the power plant they go to Moe s Q Everyone loves Krusty except Sideshow Bob Q Only Bart skateboards to school Q Someone with large feet robbed the Quickiemart Department of Computer Science Universitv of San Francisco 0111 Representing useful knowledge in F0 Q We can use FOL to represent classsubclass information causality existence and disjoint sets Q Example Let s suppose we are interested in building an agent that can help recommend music Q We want it to be able to reason about musical artists songs albums and genres Q It would be tedious to enter every bit of information about every artist instead we ll enter some rules and let our agent derive entailed knowledge Department of Computer Science Universitv of San Francisco 0121 Music example Vasgenrea Punk gt genrea Rock subclass all Punk songs are Rock songs memberJohnLennon Beatles memberPaalMeCartney Beatles memberGeorgeHarrtson Beatles memberRtngoStarr Beatles Va membera Beatles gt a E John Paul George Ringo exclusive membership John Paul George and Ringo are the Beatles performedByBeatles WhiteAlbam The WhiteAIbum is a Beatles album V5133 zmemberayperformedByy z gt played0na 2 if someone is a member of a group and that group performed an album then that person played on that album Department of Computer Science Universitv of San Francisco 0131 Music example 9 genreH0lidaysInTheSun Punk Holidays in the Sun is a Punk song 9 Va genrea Rock gt likesB0b as Bob likes all rock songs a We should be able to infer that Bob will like Holidays in the Sun 9 Vw as y z likesa ymemberz yperf0rmedByz w gt likesa w If someone likes albums by a band Y and Z is a member of band Y then that person will like albums by person Z Department of Computer Science Universitv of San Francisco 0141 Inference in Propositional logic 9 We talked about two basic mechanisms for performing inference in propositional logic 1 Forward chaining Begin with the facts in your KB Apply inference rules until no more conclusions can be drawn a Backward chaining Begin with a fact or its negation that you wish to prove Find facts that will justify this fact Work backwards until you find facts in your KB that support contradict the fact you wish to prove Department of Computer Science Universitv of San Francisco 0151 Inference in FOL Q Can we do the same sorts of inference with Firstorder logic that we do with propositional logic 9 Yes with some extra details 9 Need to keep track of variable bindings substitution Department of Computer Science Universitv of San Francisco 0161 Inference in FOL Q As with propositional logic we ll need to convert our knowledge base into a canonical form 9 In the simplest approach we can convert our FOL sentences to propositional sentences We do this by removing quantification 0 we leave in predicates for readability but remove all variables Department of Computer Science Universitv of San Francisco 0171 Removing quanti ers 9 Universal Instantiation we can always substitute a ground term for a variable 0 This will typically be a term that occurs in some other sentence of interest 0 Choose a substitution for a that helps us with our proof 9 VasLivesIna Springfield gt knowsa Homer 9 Since this is true for all x we can substitute as Bart Q LivesnBart Springfield gt knowsBart Homer Department of Computer Science University of San Francisco 0181 Removing quanti ers Q Existential Instantiation we can give a name to the object that satisfies a sentence Q ElasLivesIna Springfield knowsa Homer Q We know this must hold for at least one object Let s call that object K Q LivesnK Springfield knowsK Homer Q K is called a Skoem constant Q K must be unused gives us a way of referring to an existential object Q Once we ve removed quantifiers we can use propositional inference rules Department of Computer Science Universitv of San Francisco 0191 Propositionalization We can replace every existential sentence with a Skolemized version For universally quantified sentences substitute in every possible substitution This will in theory allow us to use propositional inference rules Problem very inefficient This was the state of the art until about 1960 Unification of variables is much more efficient Department of Computer Science Universitv of San Francisco 0201 Uni cation Q The key to unification is that we only want to make substitutions for those sentences that help us prove things Q For example if we know Q VasLivesIna Springfield W0rksAta PowerPlant gt knowsa Homer Q VyLivesIny Springfield Q W0rksAtMrSmithers PowerPlant Q We should be able to conclude knowsMrSmithers Homer directly 9 Substitution xMrSmithersyMrSmithers Department of Computer Science Universitv of San Francisco p21l39 Generalized Modus Ponens Q This reasoning is a generalized form of Modus Ponens Q Basic idea Let s say we have Q An implication of the form P1 P2 Pi gt Q Q Sentences 131135 MP Q a set of substitutions such that P1 131132 132 MP Pn Q We can then apply the substitution and apply Modus Ponens to conclude Q Q this technique of using substitutions to pair up sentences for inference is called uni cation Department of Computer Science Universitv of San Francisco 0221 Uni cation 9 Our inference process now becomes one of finding substitutions that will allow us to derive new sentences 9 The Unify algorithm takes two sentences returns a set of substitutions that unifies the sentences 0 WorksAtx PowerPIant WorksAtHomer PowerPIant produces acHomer Q WorksAtx PowerPIant WorksAtHomer y produces acHomer yPowerPlant O WorksAtx PowerPIant WorksAtFatherOfBart y produces xFatherO Bart yPowerPlant 9 WorksAtx PowerPIant WorksAtHomer x fails X can t bind to both Homer and PowerPIant Department of Computer Science Universitv of San Francisco 0231 Uni cation O This last sentence is a problem only because we happened to use a in both sentences 0 We can replace a with a unique variable say 21 is one sentence g This is called standardizing apart Department of Computer Science Universitv of San Francisco 0241 Uni cation Q What ifthere is more than one substitution that can make two sentences look the same 9 SiblingBart as Siblingy 2 Q can prOdUCeBartyaz or asBart yBart zBart o the first unification is more general than the second it makes fewer commitments Q We want to find the most general unifier when performing inference a This is the heuristic in our search Department of Computer Science Universitv of San Francisco 0251 Uni cation Algorithm 9 To unify two sentences proceed recursively Q If either sentence is a single variable find a unification that binds the variable to a constant 0 Else ca unify in the first term followed by the rest of each sentence 9 Siblinga Bart PlaysSaaa and SiblingLisa y c We can unify Siblinga Bart and SiblingLisa y With asLisayBar t o The process of finding a complete set of substitutions is a search process 0 State is the list of substitutions o Successor function is the list of potential unifications and new substitution lists Department of Computer Science Universitv of San Francisco 0261 Forward Chaining Q Basic idea Begin with facts and rules implications Q Continually apply Modus Ponens until no new facts can be derived Q Requires de nite clauses Q Implications with positive clauses in the antecedent Q Positive facts Department of Computer Science Universitv of San Francisco 0271 Forward Chaining Algorithm Q while 1 for rule in rules if canunifyrule facts firerulerule assert consequent facts if no rules fired return Department of Computer Science Universitv of San Francisco 0281 Example The law says that it is a crime for an American to sell weapons to hostile nations The country of Nono an enemy of America has some missiles All of its missiles were sold to it by Colonel West who is an American 1 Prove that West is a criminal Q PPPP It is a crime for an American to sell weapons to hostile nations 1Amem canx Weap0ny Hostile2 Sellsx y 2 gt Crimina x Nono has missiles 23x0wnsN0n0 x Missildx Use Existential Elimination to substitute M1 for x 30wnsN0n0M14MissileM1 Department of Computer Science Universitv of San Francisco 0291 Example 9 Prove that West is a criminal CL PPPPPPPP 0 All Nono s missiles were sold to it by Colonel West 5Mz ssz lex OwnsN0n0 x gt SellsWest x Nona Missiles are weapons 6Mz ssz lex gt Weap0nx An enemy of America is a hostile nation 7Enemyx America gt Hostilex West is American 8Amem canWest Nono is an enemy of America 9EnemyN0n0 America 0 We want to fonNard chain until we can show all the antecedents of 1 Department of Computer Science Universitv of San Francisco 0301 Example Algorithm Repeatedly fire all rules whose antecedents are satisfied Rule 6 matches with Fact 4 xMl Add Weap0nM1 9 Rule 5 matches with 3 and 4 xMl Add SellsUVest M1 Nono Rule 7 matches With 9 asNorm Add HostileN0n0 Iterate again NOW we can match rule 1 With xWestyM1zN0n0 and conclude CriminalWest Department of Computer Science Universitv of San Francisco 0311 Analyzing basic forward chaining Q Forward chaining is sound since it uses Modus Ponens Q Forward chaining is complete for definite clauses Q Works much like BFS Q This basic algorithm is not very efficient though Q Finding all possible unifiers is expensive Q Every rule is rechecked on every iteration Q Facts that do not lead to the goal are generated Department of Computer Science Universitv of San Francisco 0321 Backward Chaining 9 Basic idea work backward from the goal to the facts that must be asserted for the goal to hold 9 Uses Modus Ponens in reverse to focus search on finding clauses that can lead to a goal 9 Also uses definite clauses 0 Search proceeds in a depthfirst manner Department of Computer Science Universitv of San Francisco 0331 Backward Chaining Algorithm 0 goals goaltoprove substitutionlist while not emptygoals goal 2 goals dequeue unifygoals substitutionlist foreach sentence in KB if unifyconsequentsentence q goalspushantecedentssentence Department of Computer Science Universitv of San Francisco 0341 Backward Chaining Example 9 To Prove 1CriminalWest Q To prove 1 prove 2AmericanWest 3Weap0ny4Hostilez5SellsWest y 2 Q 2 unifies with 8 To prove 3Weapony4Hostilez5SellsWest y 2 Q 6 unifies with 3 xMl To prove 6MissileM1 4Hostilez5SellsWest M1 2 Q We can unify 6 With 2 and add OwnsNono M1 to the KB To prove 4Hostilez5SellsWest M1 2 0 To prove Hostilez prove Enemyz America To prove 7Enemyz America 5SellsWest M1 2 6MissileM1 Department of Computer Science Universitv of San Francisco 0351 Backward Chaining Example 9 We can unify 7 With EnemyN0n0 AmericaxN0n0 To prove 5SellsWestM1N0n06MissileM1 a To prove SellsUVest M1 Nono prove MissileM1 and OwnsN0n0 M1 To prove 80wnsN0n0 M17 6MissileM1 o 8 resolves with the fact we added earlier To prove 6MissileMl Q MissileM1 resolves with 5 The list of goals is empty so we are done Department of Computer Science Universitv of San Francisco 0361 Analyzing Backward Chaining Q Backward chaining uses depthfirst search Q This means that it suffers from repeated states Q Also it is not complete Q Can be very effective for querybased systems Q Most backward chaining systems esp Prolog give the programmer control over the search process including backtracking Department of Computer Science Universitv of San Francisco 0371 Resolution Recall Resolution in Propositional Logic AVC AVB gt BVC Resolution in FOL works similarly Requires that sentences be in CNF Department of Computer Science Universitv of San Franc isco 0381 Conversion to CNF Q The recipe for converting FOL sentences to CNF is similar to propositional logic Eliminate Implications Move a inwards Standardize apart Skolemize Existential sentences Drop universal quantifiers Distribute over V S thONT Department of Computer Science Universitv of San Francisco 0391 PPPPP CNF conversion example Sentence Everyone who loves all animals is loved by someone Translation VnyAm maMy gt lovesx y gt ElyLovesy Eliminate implication Vx IVy uAmmaly lovesx y ElyLovesy Move negation inwards Q VxEy I uAm maly Lovesv ElyLovesy Q VxEy u uAm maly uL01163v ElyLovesy Q VxEyAm maly uL01163v ElyLovesy Standardize apart VxEyAm maly uL01163x EIZLoves2 Department of Computer Science Universitv of San Francisco 0401 PPPPPP CNF conversion example Skolemize In this case we need a Skolem function rather than a constant VxAm malFx uL01163x LovesGlc7 Drop universals AnimalFx uL01163x LovesGlc7 Distribute over AnimalFx LovesGx uLovesx LovesGlc7 Department of Computer Science Universitv of San Francisco 0411 Resolution Theorem Proving 9 Resolution proofs work by inserting the negated form of the sentence to prove into the knowledge base and then attempting to derive a contradiction Q A set ofsupport is used to help guide search 9 These are facts that are likely to be helpful in the proof 1 This provides a heuristic Department of Computer Science Universitv of San Francisco 0421 Resolution Algorithm sos useful facts usable all facts in KB do fact 2 sospop foreach fact in usable resolve fact with usable simplify clauses remove duplicates and tautologies if a clause has no literals return refutation found until sos 1 Department of Computer Science Universitv of San Francisco 0431 PPPPPPPPP Resolution Example 1 uAmeMcanx Weapoan uSel13x y 2 Hostz ldz Crimina x 2 uMz ssz lex OwnsN0n0 x Sellswest x Nona 3 1EnemyxAmeMca Hostilex 4 uMz ssz lex Weap0nx 50wnsN0n0 M1 6Mz ssz leM1 7Amem canW st 8EnemyN0n07 America 9 uC39MminalWest added Department of Computer Science Universitv of San Francisco 0441 I0 PF PPPPPP Resolution Example Resolve 1 and 9 Add 10 uAmem canWest Weap0ny uSel13West y z Hostz ldz Add 11 uWeap0ny uSel13West y 2 Hostz ldz Add 12 uMissz ley SellsWest y 2 Hostz le z Add 13SellsWest M12 Hostz ldz Add 14 uMissz leM1 OwnsN0n0 M1 HostileN0n0 Add 15 IOwnsN0n0 M1 Hostz ldNono Add 16 IHostz leN0n0 Add 17 uEnemyN0n0 America Resolve 10 and 7 Resolve 11 and 4 Resolve 12 and 6 Resolve 13 and 2 Resolve 14 and 6 Resolve 15 and 5 Resolve 16 and 3 Resolve 17 and 8 Contradiction Department of Computer Science Universitv of San Francisco 0451 Analyzing Resolution Theorem Provir Q Resolution is refutation complete if a sentences is unsatisfiable resolution will discover a contradiction Q Cannot always derive all consequences from a set of facts Q Can produce nonconstructive proofs for existential goals Q Prove 3xlz kesx Homer will be proven but without an answer for who a is Q Can use full FOL rather than just definite clauses Department of Computer Science Universitv of San Francisco 0461 Ef cient Forward Chaining Q Recall that basic forward chaining tries to match every rule with asserted facts on every iteration Q This is known as rules finding facts Q In practice this is not very efficient Q The knowledge base does not change drastically between iterations Q A few facts are asserted or retracted at each step Q Also many rules have similar lefthand sides can matches be combined Department of Computer Science Universitv of San Francisco 0471 Ilete Q Rete remembers past partial matches of rules and retains this information across iterations 9 Only new facts are tested against the lefthand side of a rule a Facts finding rules 9 Rete compiles a set of rules into a network where nodes in the network represent tests or conditions on a rule s lefthand side a As facts match these nodes activtion propagates through the network Department of Computer Science Universitv of San Francisco 0481 Rete Example 9 Let s say we have the following rules defrule drink beer thirsty homer gt assert drink beer homer defrule go to moes thirsty homer at homer moes gt assert at homer moes defrule happy homer drink beer homer at homer moes friday gt assert happy homer Q U e n I k I Department of Computer Science Universitv of San Francisco 0491 Rete Example 9 Singleinput nodes receive a fact test it and propagate Q Twoinput nodes group and unify facts that match each parent 9 We can also share nodes among rules which improves efficiency 0 watch compilations in Jess shows the network structure generated by a rule 0 11111122t indicates 6 new 1input nodes and 2 new 2 input nodes plus one new terminal node 9 11112t indicates four shared one input nodes plus a new twoinput node and a terminal node Department of Computer Science Universitv of San Francisco 0501 Arti cial Intelligence Programming Heuristic Search Chiis Eiuuks Depailmenl uf Compute Science Univeisily of San Fianmscu Best rst Search 1 Recall Uniformcost search a Nodes were expanded based on their total path cost EL A priority queue was used to implem t 39 a Path cost is an example ofan evaluation func I39o t a We ll use the notation fn to refer to an evaluation Jnction a An evaluation Jnction tells us how promising a node is u Indicates the quality of the solution that node leads to Overview n Heuristic Search exploiting knowledge about the problem lt1 Heuristic Search Algorithms o Bestfirstquot search a Greedy Search a A Sear h a Extensionsto A a Con stru cting H eu ristics Best rst Search lt1 By ordering and expanding nodes according to their f value we search the best nodes first a Iff was perfect we would expand a straight path from the initial state to the goal state a Arad Sibiu Rimnicu Vilcea o Ofcourse ifwe h erfect f we wouldn t need to solve the problem in the first place 0 Instead we ll try to develop heuristic functions hn that help us estimate fn Pitesti Buch arest Informing Search a Previous algorithms were able to find solutions but were very inefficient a Exponential number ofnodes expanded a By taking advantage of knowledge about the problem ructure we can improve performance a Two caveats a We have to get knowledge about the problem from somewhere a This knowledge has to be correct Best rst Search a Bestfirst Pseudocode enqueue nltlalState do node deq39ueue 1t goalTest node 139 turn node mud n z successorzmmae for mud n cmldr trwlthwhlld chxld where insertwith orders our priority queue accordingly Greedy Search 0 Let s start with the opposite ofuniformcost search a UCS used the solution cost to date as an estimate off a Greedy search uses an estimate of distance to the goal for f a Rationale Always pick the node that looks like it will get you closest to the solution a Let s start with a simple estimate of f for the Romania domain a Many Straight line distance between that city and Bucharest Issues with Greedy Search 1 Sometimes the optimal solution to a problem involves moving away 39om a For example to solve 8puzzle you often need to undo a partial solution a Greedy search has many ofthe same appeals and weaknesses as F a Expands a linear number ofnodes a Is not complete or op ima a lts ability to cut toward a goal is appealing can we salvage this Greedy Search a Notice that there wasn t anything in the problem description about straightline distance or the fact that that would be a useful estimate a We used our knowledge about the way roads generally work a This is sometimes called common sense knowledge A search a A search is a combination ofuniform cost search and greedy search 1 A node s fn gn hn lt1 57 n current path cost at hn heuristic estimate of distance to goal 1 Fa ors de ha cheap solution to date and also look like they ll get close to the goal a If hn satisfies certain conditions A is both complete always finds a solution and optimal always finds the best solution Greedy Search Example 6 n n 2 x a 7 x as m 2 m 17 7 15 m 2 aring 21 a dequeueAS253T329Z374 38quot a dequeuestZWE V 1T y 329A335z374 u E a dequeueFEU R 7 93 27 T329A33EZ374038D a dequeue a 7 solution Argt s 7gt F rgta a We found a solution quickly but it was not optimal a What was the problem with our approach A example Romania a h straightline distance a Avad 3EE3EE 2A gZEI MEI 253393TME 329 449 a dequeu 447175374 1dequeueS glADR V22D193413F239 175415 ME329447z37475449A 3 7 i E 329447z374 5449 0 3mm 253553A28n 335 P dequeueF g239P317 1EIEIM7 T 447z37475449 c35515 7 25 issn 55m S 2535538338253591A 1502913En571 llE329 A example Romania a deuueueP g317TllBo325 67Z376075 ua 51aon51ac366o16n526a5an n 56336645 553s33a2537591w o1w1a36n7c 6553916 15A2ano336 6160251o38 71 c quuEuET E111az37o75ua 273 lBon lEC 5 a1w 61 1536D70 25516 132 71 29 m 623A15D 336 321 26 366 321 671 Optimality of A a Notice that we can t discard repeated states a We could always keep the version of the state with the lowest 9 a More simply we can also ensure that we always traverse the best path to a node first a a monotonic heuristic guarantees this a A heur39s ic is monotonic if for ever node n and each of its successors 6 226 is less than or equal to stEpC05tnn 6 c In geometry this is called the triangle inequality A example Romania a de uEuELE 25A15Do336 BEE 51a 5120 1o6o3a 26 236616n526Mm quantumn5aus3mza35a32 23639336 572 32253 5512v 1219 BD7C 5516 15 BU39SSB 616 3 3za66aoza13au671 3 a 226222 503 122512 NB SED a 5260366olBEI526MZBaoZM5wSZBU 5U39EI55 6650 a1 32 71 ct deuueue a g 512 summary A a s a w a p a a Optimality of A a SLD is monotonic In general it s hard to find realistic heuristics that are admissible but not monoton39c a Corollary f22 is monotonic then f is nondecreasing as we expand the search tree 0 Alternative proof of optimality a Notice also that UCS is A with 226 0 lt1 A is also optimally efficient 11 No other complete and optimal algorithm is guaranteed to expand fewer n des Optimality of A A s optimal finds the shortest solution as long as our 22 function is admissible a Admissibl always underestimates the cost to the goal a Proof When we dequeue a goal state we see 96 the actual cost to reach the goal f22 underestimates then 2 ore optimal solution would have had a smaller 9 22 than the current goal and thus have already been eued 9 Or f22 overestimates the cost to the goal it s possible for a good solution to look badquot and get buried further backint equeue a In our Romania example SLD always underestimates Another A example llAf7guh77l Another A example Another A example m s Node Queue A cf22 5Bf2EgEh20 Another A example Another A example Bf2Eg e 2 g es g 30g2ahAGf30g2a hA Gf30 Another A example aFfe27ge21hea h20 Ef2Eg20hE Another A example W cm 0 u m s 2h a 20 g26 hA 2Eg20h a Ef29g Gf30g2ahAGf Another Aquot example Ef29g21h 6f30 g26hA Gf30 g26 AHf31g31h0 nextE an be dlscarded Pruning and Contours a Topologically we can imagine A creating a set of contours corresponding to f values over the search space a A will search all nodes within a contour before expanding a This allows us to prune the search space D We can chop offthe portion of the search tree corresponding to Zerind without searching it Node Queue 6 G Another Aquot example 20 g26 th Hf30g30h0 Hf g31h0 next a can be dlscarded Building Effective Heuristics a While A is optimally efficient actual performance depends on developing accurate heuristics a Ideally h is as close to the actual cost to the goal has possible while remaining admissible a Developing an effective heuristic requires some understanding of the problem domain Another Aquot example W cm 0 u m s Node Queue H Solutlon H E e 21 g e 21 h e 0 Solutlon AcDIGH or AcDFGHgt Effective Heuristics 8puzzle a h1 number of misplaced tiles 0 This is clearly admissible since each tile will have to be moved at least once a h2 Manhattan distance between each tile s current position and goal position a Also admissible best case we ll move each tile directly to where it should go CL Which heuristic is better Artificial Intelligence M Programming Ontologies Chris Brooks Department of Computer Sclence UnlverSlty of san Franclsco 111 Knowledge Engineering FOL gives us an answer to how to say things It doesn t tell us what to say In large domains this is the hard part We need to include enough knowledge to allow our system to answer all queries of interest The process of building a knowledge knowedge engineering An ontolog is a vocabulary that desc base is referred to as 1 13 Ontologies r bes all the objects of interest in the domain and the relations betNeen them Ontologies allow knowledge about a domain to be shared betNeen agents including humans Allows knowledge to be re used more easily Allows an agent to perform inference knowledge about current 110 Review of FOL Some more FOL samples 1 12 Knowledge Engineering A knowledge engineering process typically consists of the following steps 1 Determlne allowable queries and types Of facts that WIII be available example WIII the agent need to select actions or make conclusions orJust s 3 select a vocabulary or predicates functions and constants This is called an Ontology 4 Encode general Knowledge about the domain nt Knowledge gathered in step 2 This may require reVlSltlng step 3 5 Encode specific queries or problems that are to be solved urn to steps 3 and 4 as needed 1 14 vocabulary An ontology consists of A set of concepts or classes ProfessorBTooks v x Professorx e USFEmployeex Features or attributes of these classes often called slots or properties or roles SalaryBTooks 500 NameBTooks Ohris Restrictions on slots sometimes called facets V55 y Professorc Salaryc y 5 lt 1000 Instances of classes 115 Ontologies vs 00 design V Classes are the focus of ontology design In many ways this looks like objectoriented design We have classes and subclasses and properties of classes that look like data members However slots have richer semantics than data members A slot may attach to several classes at once We can specify constraints on the values in a slot s rang Slots can exist without being assigned to a class navammmoumvluschm imamms WWW 117 Semantic Net example 39 SisterO Mmy John FemaleMmy 39 V1Femalez gt Personz Male John V1Malez gt Personz V1Personz gt Mammalz VzPersonz gt DefaultNumberOfLegsz 2 legsJ0hn1 navammmoumvluschm imamms mpr 119 Protege Protege is a Javabased graphical tool for constructing ontologies Has plugins for Jess a rulebased inference engine and Jython Can export data in RDF and OWL for use with the Semantic Web mmwwwueuuelmums mpr 116 Graphical representations Logic is a very powerful representation language but it can be diffi cult for humans to work with In addition the lack of structure between sentences in a can make inference difficult A A semantic network is a way to graphically represent and reason about some logical concepts navamumvcmvlmschmiHumanmy mmrm 118 Inference in a Semantic Net Inference becomes easy in a semantic net to answer questions about John we follow the labeled edges emanating from the John node This is the same sort of inference done by modern 00 languages to resolve inheritance Strengths Knowledge is easily visualized relationships between objects are clearer Weaknesses only binary relations no negation disjunction or existential quantifi ers We can extend semantic nets to include these features but we lose the transparency Instead use a semantic net to model classslot relations and FOL or something similar to model rules mmnmmwscm imamms mwp mm 1110 Using Protege to build a simple ontology Let s make a simple ontology to describe the USF CS department We begin by asking competency questions these are questions we d like our KB to be able to answer Who is taking CS662 Which professors teach classes on Monday How many students are taking both CS662 and 08601 What classes should one take before taking CS662 Which professors assign the most work mmnmmwscm imamms mwp am Arti cial Intelligence Programming Neural Networks Chris Brooks Department of Computer Science University of San Francisco Neural networks Much of what we ve studied so far can be classified as symbolic Al Focus on symbols and relations between them 0 Search logic decision trees MDPs The underlying assumption is that manipulation of symbols is the key requirement for intelligent behavior Neural networks focus on subsymboic behavior Intelligent behavior emerges from the interaction of simple components Department of Computer Science Universitv of San Francisco 011 Biology vs Computer Science Q In biological neurons signals are received by dendrites and propagated to other neurons via the axon Axon from another cell 9 Signaling and firing is very complex 0 Thought and behavior are produced through the interaction of thousands of neurons Department of Computer Science Universitv of San Francisco 021 Biology vs Computer Science Q Computational neural networks are related to biological neural networks primarily by analogy Q Computational neuroscience studies the modeling of biologically plausible neurons Q Al researchers are often more interested in developing effective algorithms Q As with GAs we draw upon ideas that are successful in nature and take the parts that are useful Department of Computer Science Universitv of San Francisco 031 Computational Neural Networks 9 Neural networks are composed of nodes 9 These nodes are connected by links Q Abstraction of axons 9 Each link has an associated weight that indicates the strength of the signal 0 Each node has a nonlinear activation function 9 Governs node s output as a function of the weighted sum of its inputs Department of Computer Science Universitv of San Francisco 041 PPPPPI DP Appropriate tasks for neural learning Many attributevalue pairs Realvalued inputs Real or discrete target value Noisy or errorcontaining data Long training time OK Fast evaluation of test cases needed Ability of humans to understand the learned hypothesis is not important Department of Computer Science Universitv of San Francisco 051 Computational Neural Networks Bias Weight at g 1 n1 Input Input Activation Ou ut Links Function Function Outpm Liifis Q Bias unit is used to control the threshold value 0 How strong the weighted input signal must be for the node to fire Department of Computer Science Universitv of San Francisco 061 Activation functions 9 Any nonlinear function can be used in principle 9 Two most common functions are Q Step function threshold function Outputs 1 if input positive zero otherwise a Sigmoidlogistic function Q Continuously differentiable a Rapid change near threshold gradual at extremes Department of Computer Science Universitv of San Francisco 071 Examples W0 15 W0 05 W0 05 Wlk W1gt W1 1 W2 1 W2 2 1 AND OR NOT a Neural nets can easily be built to perform some standard logical operations using the threshold activation function a Change the threshold depending on the function needed Department of Computer Science Universitv of San Francisco 081 Types of nodes Q We can distinguish between three types of nodes Q Input nodes Q Output nodes Q Hidden nodes Q We can also distinguish between types of networks Q Feedforward networks signals flow in one direction no cycles Q Recurrent networks Cycles in signal propagation Q We ll focus primarily on feedforward networks Department of Computer Science Universitv of San Francisco 091 l0 Function Approximators Feedforward NNs fall into a family of algorithms called nonlinear function approximators The output ofa NN is a function of its inputs Nonlinear activation function allows the representation of complex functions By adjusting weights we change the function being represented NNs are often used to efficiently approximate complex functions from data Department of Computer Science Universitv of San Francisco 0101 Classi cation With Neural Networks 9 NNs also perform classification very well 0 Map inputs into one or more outputs 0 Output range is split into discrete classes 0 Very useful for learning tasks where what to look for is not known 9 Face recognition handwriting recognition driving a car Department of Computer Science Universitv of San Francisco 0111 Perceptrons Q Singlelayer networks perceptrons are the simplest form of NN 9 Easy to understand but computationally limited 9 Each input unit is di rectly connected to one or more output units Units JV Units Department of Computer Science Universitv of San Francisco 0121 Perceptrons Q Output is thresholded weighted sum of the inputs Q Threshold firing function used here Q 0321 1 if we 101931 102932 10an gt O Q 1 othenNise Department of Computer Science Universitv of San Franc isco 0131 lO Representational power of perceptror Output of net 230 sz39j gt O Perceptrons are capable of representing any linearly separable function Unfortunately many common functions XOR parity are not linearly separable In the early days ofAl perceptrons were popular due to the fact that their weights could be easily learned Department of Computer Science Universitv of San Francisco 0141 Perceptron Training Algorithm inputs 2 inl in2 inj weights 2 wl w2 wn initialize each to rand O5 05 output 2 0 training examples ti tinl tol t2 tin2 toZ do for t in training examples inputs 2 tin o compute net s output with current weights E to o for w in weight w w alpha tini E while notConverged Department of Computer Science Universitv of San Francisco 0151 Perceptron Learning 9 Intuition Ifthe output signal is too high weights need to be reduced 9 Turn down weights that contributed to output a Weights with zero input are not affected 9 If output is too low weights need to be increased 0 Turn up weights that contribute to output 1 Again zeroinput weights are not affected a Learning is really an optimization of weights a Alternatively we are doing a hillclimbing search through weight space Department of Computer Science Universitv of San Francisco 0161 0 Example Let s suppose we want to learn the majority function with three inputs Firing fn 12f 10ij gt O 0 otherwise bias O1 oz 01 initially all weights 0 inputs output 1 O O O O 1 1 1 1 1 O 1 1 1 1 1 O O 1 O 1 O 1 1 O O O O O 1 O O Department of Computer Science Universitv of San Francisco 0171 Example inputs expected output W1 w2 W3 actual output new weights AOOO A O 0 DO O x x x 0 0 0 o o o o Department of Computer Science Universitv of San Francisco 0181 Example inputs expected output W1 w2 W3 actual output new weights AOOO A O 0 DO O x x x 0 0 0 0 0 0 0 0 0 0 0 01 01 Department of Computer Science Universitv of San Francisco 0191 Example inputs expected output w1 w2 W3 actual output new weights 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 01 01 0 01 02 01 1 1 1 1 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 Department of Computer Science Universitv of San Francisco 0201 Example inputs expected output w1 w2 W3 actual output new weights 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 01 01 0 01 02 01 1 1 1 1 0 1 02 01 1 01 02 01 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 Department of Computer Science Universitv of San Francisco 0211 Example inputs expected output w1 w2 W3 actual output new weights 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 01 01 0 01 02 01 1 1 1 1 01 02 01 1 01 02 01 0 0 1 0 01 02 01 0 01 02 01 1 0 1 1 0 0 0 0 0 1 0 0 Department of Computer Science Universitv of San Francisco 0221 Example inputs expected output w1 w2 W3 actual output new weights 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 0 1 0 1 0 01 02 01 1 1 1 1 01 02 01 1 01 02 01 0 0 1 0 01 02 01 0 01 02 01 1 0 1 1 01 02 01 1 01 02 01 0 0 0 0 0 1 0 0 Department of Computer Science Universitv of San Francisco 0231 Example inputs expected output w1 w2 W3 actual output new weights 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 01 0 1 0 01 02 01 1 1 1 1 01 02 01 1 01 02 01 0 0 1 0 01 02 01 0 01 02 01 1 0 1 1 01 02 01 1 01 02 01 0 0 0 0 01 02 01 0 01 02 01 0 1 0 0 Department of Computer Science Universitv of San Francisco 0241 Example inputs expected output w1 w2 W3 actual output new weights 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 01 01 1 1 0 1 0 01 01 0 01 02 01 1 1 1 1 01 02 01 1 01 02 01 0 0 1 0 01 02 01 0 01 02 01 1 0 1 1 01 02 01 1 01 02 01 0 0 0 0 01 02 01 0 01 02 01 0 1 0 0 01 02 01 1 01 01 01 a At this point the network is trained properly 9 In many cases we would need more iterations to converge on a solution Department of Computer Science Universitv of San Francisco 0251 Gradient Descent and the Delta rule Q What about cases where we can t learn the function exac y Q Function is not linearly separable Q In this case we want to perform as well as possible Q We ll interpret this to mean minimizing the sum of squared error Q E 12 Epic ad2 for din training set Department of Computer Science Universitv of San Francisco 0261 Gradient Descent and the Delta rule 9 We can visualize this as a search through a space of weights 9 Defining E in this way gives a parabolic space with a single global minimum for linear units a By following the gradient in this space we find the combination of weights that minimizes error 91 Use unthreshoded output what is the real number computed by the weighted sum of inputs Department of Computer Science Universitv of San Francisco 0271 Gradient Descent and the Delta rule 9 Gradient descent requires us to follow the steepest slope down the error surface 9 We consider the derivative of E with respect to each weight 9 After derivation we find that the updating rule called the Delta rule is Q A107 Oz Zdepwd 0d7d 9 Where D is the training set a is the training rate id is expected output and ad is actual output Department of Computer Science Universitv of San Francisco 0281 Incremental learning Q Often it is not practical to compute a global weight change for the entire training set Q Instead we want to update weights incrementally Q Observe one piece of data then update Q Our update rule is then 107 apt 07 Q Like the perceptron learning rule except that unthresholded output is used Q Smaller step size a typically used Q No theoretical guarantees of convergence Department of Computer Science Universitv of San Francisco 0291 Multilayer Networks Q While perceptrons have the advantage of a simple learning algorithm their computational limitations are a problem Q What if we add another hidden layer Q Computational power increases Q With one hidden layer can represent any continuous func on Q With two hidden layers can represent any function Q Problem How to find the correct weights for hidden nodes Department of Computer Science Universitv of San Francisco 0301 Multilayer Network Example ai Output units Wm Hidden units Input units cisco 0311 itv of San Fran Univers Department of Computer Science Backpropagation Q Backpropagation is an extension of the perceptron learning algorithm to deal with multiple layers of nodes 9 Nodes use sigmoid activation function 0 input1 Q 9mputrz 9mputz391 input2 Q We will still follow the gradient where 9 gives us the gradient Department of Computer Science Universitv of San Francisco 0321 Backprogagation 9 Notation o hwa vector of network outputs O y desired output for a training example aj output of the jth hidden unit 0 output of the ith output unit ti output for the ith training example Output error for output node 239 A7 t7 07 gtilt input1 gtilt 1 ginput 0 Weight updating hiddenoutput Wj Wj 05 gtilt aj gtilt A7 PPPP Department of Computer Science Universitv of San Francisco 0331 l0 l0 i0 P Backpropagation Updating inputhidden weights Idea each hidden node is responsible for a fraction of the error in A7 Divide AZ according to the strength of the connection between the hidden and output node Deltaj ginput1 ginput ijiAi Update rule for inputhidden weights Wk j Wk j 05 gtilt inputk gtilt Aj Department of Computer Science Universitv of San Francisco 0341 Backpropagation Algorithm 9 The whole algorithm can be summed up as While not done for d in training set Apply inputs of d propagate forward for node 239 in output layer DeltaI output gtxlt 1 output gtxlt tam9 output for each hidden node Deltaj output gtxlt 1 output gtxlt Z kaiAi Adjust each weight Wj Wj 05 gtllt A7 gtllt input Department of Computer Science Universitv of San Francisco 0351 Theory vs Practice 9 In the definition of backpropagation a single update for all weights is computed for all data points at once a Find the update that minimizes total sum of squared error 9 Guaranteed to converge in this case 0 Problem This is often computationally spaceintensive Q Requires creating a matrix with one row for each data point and inverting it o In practice updates are done incrementally instead Department of Computer Science Universitv of San Francisco 0361 Stopping conditions 9 When to stop training 9 Fixed number of iterations 9 Total error below a set threshold 0 Convergence no change in weights Department of Computer Science Universitv of San Francisco 0371 Comments on Backpropagation Q Also works for multiple hidden layers Q Backpropagation is only guaranteed to converge to a local minimum Q May not find the absolute best set of weights Q Low initial weights can help with this Q Makes the network act more linearly fewer minima Q Can also use random restart train multiple times with different initial weights Department of Computer Science University of San Francisco 0381 Momentum Q Since backpropagation is a hillclimbing algorithm it is susceptible to getting stuck in plateaus Q Areas where local weight changes don t produce an improvement in the error function Q A common extension to backpropagation is the addition of a momentum term Q Carries the algorithm through minima and plateaus Q ldea rememberthe direction you were going in and by default keep going that way Q Mathematically this means using the second derivative Department of Computer Science Universitv of San Francisco 0391 Momentum Q Implementing momentum typically means remembering what update was done in the previous iteration Q Our update rule becomes 0 Aw m 04ijer AWJim 1 Q To consider the effect imagine that our new delta is zero we haven t made any improvement Q Momentum will keep the weights moving in the same direction Q Also gradually increases step size in areas where gradient is unchanging Q This speeds up convergence and helps push through plateaus Department of Computer Science Universitv of San Francisco 0401 Design issues Q One difficulty with neural nets is determining how to encode your problem Q Inputs must be 1 and O or else realvalued numbers Q Same for outputs Q Symbolic variables can be given binary encodings Q More complex concepts may require care to represent correctly Department of Computer Science Universitv of San Francisco 0411 Design issues 9 Like some of the other algorithms we ve studied neural nets have a number of parameters that must be tuned to get good performance Number of layers Number of hidden units Learning rate Initial weights Momentum term 0 Training regimen PPPPP a These may require trial and error to determine 1 Alternatively you could use a GA or simulated annealing to figure them out Department of Computer Science Universitv of San Francisco 0421 Arti cial Intelligence Programming Intro to Python Chris Brooks Department of Computer Science University of San Francisco What is Python 9 Python is O Highlevel a Objectoriented Free opensource Dynamically typed Has a large collection of utility libraries garbagecollected Mixable works nicely as a glue language Easy to learnuse PPPPPP Department of Computer Science Universitv of San Francisco 011 Some uses of Python 9 Things that Python is good at 0 System utilities 0 Python has hooks into standard OS level services such as files processes pipes etc Q GUls 9 Python has interfaces to Tk Qt and WxPython q Embedded integration 0 Python has hooks that allow it to callbe called by C and C code and also COM interfaces 9 Rapid Interactive Development 9 Like Lisp Python makes it easy to quickly put together a working program 0 Scripting Q Python has utilities for CGI Database access HTTR Sockets FTR P0PS39V39TBu ll gplzatsimnatstem co p 21 Invoking Python Programs 9 There are four ways to run a Python program Q Interactiver Q From the command line a As a script 91 From another program Department of Computer Science Universitv of San Francisco 031 Python Implementations Q usrbinpython on all unix machines and on OS X Q IDLE and PythonWin on Windows 0 Machthon on Macintosh Q Also for Amiga PaImOS BeOS etc Q BoaConstructor is an opensource Python IDE Q Eclipse has a Python plugin PyDev Department of Computer Science Universitv of San Francisco 041 Python Program Structure Q Programs consist of modules Q A module corresponds to a source file Q Modules contain blocks of statements Q Statements create and process objects Department of Computer Science Universitv of San Francisco 051 Basic Python 9 Python has a nice set of builtin data types Q Numbers g Strings a Lists 91 Sets 9 Dictionaries 0 Files 9 Using builtin types makes your code easier to writemaintain more portable and more efficient Department of Computer Science Universitv of San Francisco 061 Numbers 9 Numbers work like we d expect 9 There are integers equivalent to longs in C 1 31311 4000 Q There are long integers which are of unlimited size 31111L12345D 9 There are floats equivalent to doubles in C or Java 123 31e5 9 There are Octal and Hexadecimal representations as in C 0155 0x3af5 Q There are complex numbers as in Lisp Matlab etc 304j Department of Computer Science Universitv of San Francisco 071 PPPPPF P Mathematical Operators Python has all the usual mathematical operators or pow for exponentiation abs rand amp This makes Python a very handy desk calculator Operations are coerced to the most specific type 9 3 40 2 will produce a float 0 Common error since variables do not have declared types be careful of rounding 32 1 Department of Computer Science Universitv of San Francisco 081 l l0 Strings One of Python s strong suits is its ability to work with strings Strings are denoted with double quotes as in C or single quotes 1 2 concatenation s1 3 repetition s1 i indexing s1ij slicing s11 last character a parrot dead formatting for char in s1 iteration Department of Computer Science Universitv of San Francisco 091 Strings Q Strings are immutable sequences to change them we need to make a copy Q Can tdo s13 c Q Must do s2 s1O2 c s13 Q As in Java making lots of copies can be very inefficient If you need to do lots of concatenation use join instead Q We ll return to efficiency issues throughout the semester Deoartm nt of Computer Science Universitv of San Francisco 0101 PPPPPPPPP Lists Python has a flexible and powerful list structure Lists are mutable sequences can be changed in place Denoted with square brackets I1 1 234 Can create nested sublists l2 12 34 5 6 7 l1 l2 concatenation l1 4 repetition l135 l13 l15 slices append extend sort reverse built in Range create a list of integers Department of Computer Science Universitv of San Francisco 0111 Dictionaries Q A Dictionary is a Python hash table or associative list Q Unordered collections of arbitrary objects 0 d1 new hashtable d2 spam 2 eggs 3 Q Can index by key d2 spam Q Keys can be any immutable object Q Can have nested Hashtables Q d3 spam 1 other eggs 2 spam 3 Q d3 other spam Q haskey keys values for k in keys Q Typically you ll insertdelete with Q d3 spam deliciousl Q del d3 spam Department of Computer Science Universitv of San Francisco 0121 Tuples Tuples are like immutable lists Nice for dealing with enumerated types t1 123 t2 12345 Can index slice length just like lists 0 t13 t112 t12 Q Tuples are mostly useful when you want to have a list of a predetermined sizelength 0 Q Q Can be nested and indexed Q Q 0 Also constanttime access to elements fixed memory loca ons a Tuples are also very useful as keys for dictionaries Department of Computer Science Universitv of San Francisco 0131 Files Q Since it s a scripting language Python has a lot of support for file IO Operators are not too different from C Outfile fie fname w or infile file fname r Q r is default and can be left out Q S infileread read the entire file into the string 8 Q S infilereadN read N lines into the string 8 Q S inputreadine read one line Q S inputreadines read the whole file into a list of strings Q Unless the file is really huge it s fastest to read it all in at once with read or readlines l0 I Department of Computer Science Universitv of San Francisco 0141 Files 9 outfilewriteS write the string 8 into the file 0 outfilewriteinesL write the list of strings L into the file 9 outfilecose this is also done by the garbage collector Department of Computer Science Universitv of San Francisco 0151 Basic Python statements 9 Python uses dynamic typing 0 No need to predefine variables 0 Variables are instantiated by assigning values to them a Referencing a variable before assignment is an error a You can assign multiple variables simultaneously spam eggs spam spam 995 45 spam Department of Computer Science Universitv of San Francisco 0161 Python variables Q Variables must Q begin with a letter or underscore Q contain any number of letters numbers or underscores Q No etc Q Case matters Q Can t use reserved words as variables Department of Computer Science Universitv of San Francisco 0171 Printing Q We ve already seen the basic print Q print hello world Q To use a formatting string do Q print hello s bob Q print quots s quotheo quotwordquot Q To suppress the linefeed include a Department of Computer Science Universitv of San Francisco 0181 Conditionals Q The general format for an if statement is if lttestlgt ltstatementlgt ltstatement2gt elif lttest2gt ltstatement3gt else ltstatementngt a Notice the colons after the conditionals 0 Compound statements consist of the colon followed by an indented block Department of Computer Science Universitv of San Francisco 0191 Conditionals Q Logical tests return 1 fortrue and O for false Q True and False are shorthand Q and or not are available for compound tests Department of Computer Science Universitv of San Francisco 0201 Syntax Q Indentation is used to delimit blocks 0 If you are going to be editing your code on multiple machines with different editors you may want to configure your editor to use spaces instead of tabs 0 Statements can cross multiple lines if 9 They re within delimiters Q You use a backslash Department of Computer Science Universitv of San Franc isco 0211 Iteration 9 Python has the familiar while loop lt1 One wrinkle it has an optional else clause to execute if the test is false 9 Additional loop control 9 break 1 Exit the innermost loop without doing an else 0 continue 9 Return to the beginning of the loop 0 pass 9 Do nothing Department of Computer Science Universitv of San Francisco 0221 l0 Iquot For For is a generic iterator in Python Lets you loop over any indexable data structure a for item in 12345 9 for char in This is a string 0 for key in dictionary This provides a uniform polymorphic interface to a set of different data structures Note it s much faster to use the polymorphic operator than to access manually 0 Le for item in list is fasterthan fori in lenist o for item in dictionary is faster than for item in dictionarykeys Department of Computer Science Universitv of San Francisco 0231 Functions 9 def is used to define a function lt1 return returns a value 0 Functions maintain local scope 9 Names are resolved locally then globally then builtin Department of Computer Science Universitv of San Francisco 0241 Functions 9 Multiple values can be returned in a tuple We can also provide default arguments I0 9 Functions can be called with keywords Q myfuncspam1 eggs2 Q args can be used to catch arbitrary argument lists and store them in a tuple o args can be used to catch arbitrary argument lists and store them in a dictionary Department of Computer Science Universitv of San Francisco 0251 u a a Q a 9 Arti cial Intelligence Programming Bayesian Learning Chvls Evuuks Department uf Computer Sclence Umvevslly uf San Fianmscu Bayes Theorem Recall the definition of Bayes Theorem PW 415 5 Let s rewrite this a bit Let D be the data we ve seen so far Let h be a possible hypothesis FWD PgD hgpghl Learning and Classi cation n An important sort of learning problem is the classi cation problem a This involves placing examples into one of two or more c ass s a Shouldshouldn t play tennis 0 Spamnot spam a WaitJdon t wait at a restaurant 0 Classification is a supervised learning task access to a set oflabeled training examples c From this we induce a hypothesis that describes how to determine what class an example should be in MAP Hypothesis 0 O en we re not so interested in the particular probabilities for each hypothesis 0 Instead we want to know Which hypothesis is most likely given the data Q Which classification is the most probable a Is PlayTEnnls or a layTEnnls more likely a We call this the maximum a posteriori hypothesis MAP hypothesis q In this case we can ignore the denominator PD in Bayes Theorem since it will be the same for all h 1 MM Mgmwh y DlmPM Probabilistic Learning a Decision trees are one way to do this a They tell us the most likely classification for given data a Work best with tabular data where each attribute has num er 0 possible values a What ifwe want to know how likely a hypothesis is a We can apply our knowledge ofprobability to learn a hypothesis MAP Hypothesis a Advantages a Simpler calculation a No need to have a prior for PD ML Hypothesis a In some cases we can simplify things even thher a What are the priors Ph for each hypothesis 9 Without any other information we ll olten assume that they re equally possible a Each has probability 1 this case we can just consider the conditional probability PDlh a We call the hypothesis that maximizes this conditional probability the maximum likelihood hypothesis 1 km aiwarheHHDlh Example 1 How do the hypotheses change as data is observed a Initially we start with the priors 010 20 40 201 n Then we drawa lime 11 Ph1llzme aPl2melh1Ph1 o a Ph2llzme aPlzmelh2Ph2 5 x 0 2 a0 05 u Ph3llzme aPlzm lh3Ph3 5 x 0 4 a0 2 a P01407712 aPl2melh4Ph4 a x 0 2 015 a h5llzme aPl2melh5Ph5 a1 01 101 0 2 Example a Imagine that we have a large bag of candy We want to know the ratio of cherry to lime in the bag a We start with 5 hypotheses hi 100 cherry 2 kg 75 chen39y 25 lime 3 ha 50 chen39y 50 lime h 25 chen39y 75 lime h5 100 lime a Our agent repeatedly draws pieces of candy a We want it to correctly pick the type ofthe next piece of candy 01gt Example lt1 Then we draw a second lime c1 Ph1llzme lime aPlzmelzmelh1Ph1 a Ph2llzme lime aPUzmermelh PMg a 0 2 a0 0125 ha a Ph3llzme 12mg aPlzmelzmelh3P all x 4 01 a P a3 5mm 1mm aPlzmelzm lh4Ph4 E 2 1125 a Ph5llzme aPlzmelh5Ph5 a1 01 101 0 a Strictly speaking we don t really care what a is Q We can just select the MAP hypothesis since we just want to know the most likely hypothesis Example 0 Let s assume our priors forthe different hypotheses are a 0102040201 9 Also we assume that the observations are iid c1 This means that each choice is independent ofthe others order doesn t matter a In that case we can multiply probabilities a PDlhz HWWW 9 Suppose we draw10 limes in a row PDlh3 is 10 since the probability of drawing a lime under ha is Bayesian Learning a Eventually the true hypothesis will dominate all others a Caveat assuming the data is noise 39ee or noise is uniformly distributed a Notice that we can use Bayesian learning in this case either as a batch algorithm or as an incremental algorithm We can always easily update our hypotheses to in corporate new evidence a This depends on the assumption that our observations are independent Learning bias a What sort of bias does Bayesian Learning use a Typically simpler hypotheses will have larger priors a More complex hypotheses will fit data more exactly but there s many more ofthem a Under these assumptions hMAp will be the simplest hypothesis that fits the data a This is Occam s razor a ain 0 Think about the deterministic case where PhlD is ei er or 0 Bayesian Optimal Classi ers 1 Suppose we have three hypotheses and posteriors h104h203h303 a We get a new piece of data hi says its positive kg and ha negative a h is the MAP hypothesis yet there s a 06 chance that the data is negative a By combining weighted hypotheses we improve our performance Bayesian Concept Learning a Bayesian Lear ing involves estimating the likelihood of each hypothesis Q In a more complex world where observations are not independent this could be difficult a Our first cut at doing this might be a brute force approach 1 For each h in H calculate FWD 3W 2 From this output the hypothesis hMAp with the highest posterior probability a This is what we did in the example 1 Challenge Bayes Theorem can be computationally 39d expensive to use when observations are no N a Palm 4W iii 2 2 Bayesian Optimal Classi ers o By combining the predictions ofeach hypothesis we get a Bayesian optimal classifier 1 More formally let s say our unseen data belongs to one of u classes a The probability PWlD that our new instance belongs to class u is Q ZMEH Pmlh PWlD o lntuitively each h othesis gives its prediction weighted by the likelihood that that hypothesis is the correct one a This classification method is provably optimal on average no other algorithm can perform better Bay ian Optimal Classi ers 0 There s one other problem with the formulation as we have it a Usually we re not so interested in the hypothesis that fits the data a Instead we want to classify some unseen data given the data we ve seen so far a One approach would be to just return the MAP hypothesis a We can do better though Problems 9 However the Bayes optimal classifier is mostly interesting as a theoretical benchmark o In practice computing the posterior probabilities is exponentially har o This problem arises when instances or data are conditionally dependent upon each other a Can we get around this Naive Bayes classi er a The Naive Bayes classifer makes a strong assumption that makes the algorithm practical a Each attribute ofan example is independent ofthe others r1 Pu A b PuPl7 for all a and b a This makes it straightforward to compute posteriors Example 1 Recall your tennisplaying problem 39om the decision tree homework W want to use the training data and a Naive Bayes classifier to classify the following instance 0 Outlook Sunny Temperature Cool Humidity high Wind Strong The Bayesian Learning Problem ct Given a set oflabeled multivalued examples c Find a function Fx that correctly classifies an unseen example with attributes 9192 0 Call the most probable category ump CL Umu Wigner1374sz 92 Mn 0 We rewrite this with Bayes Theorem as um awmwuev m a2 war0139 lt2 Estimating Pm is straightforward with a large training set count the fraction of the set that are of class 7 a However estimating Pu1u2 anl39uz is difficult unless rainin set is we lar e We need to see every possible attribute combination many times Example Day Oulluuk Temperature Hurnrurty Wnd PlayTennrs Di unn Hui Hrgh Weak Nu D2 Sunny Hut Hrgn Strung Nu D3 Overcas1 Hui Hrgh Weak Ves D Parn Mrld Hr h Weak Ves D5 Parn Cuul Nurrnal Wee Ves DE R 1 Coal Nur al Strung Nu D7 Overcas1 Cuul Nurrnal Strung Ves DE Sunny Mrru Hrgn Weak Nu D9 Sunny Cuul Nurrnal Weak Ves DiEI Parn Mrld Nurrnal Weak Ves 011 Sunny Mrru Nurrnal Strung Ves th Overcas1 Mrru Hrgn Strung Ves D13 Overcas1 Hui Nurrnal Weak Ves 014 we Hrgn Strung Nu Naive Bayes assumption a Naive Bayes assumes that all attributes are conditionally independent ofeach other a In this case Pu1u2 anl39uz H1Pugluz a This can be estimated 39om the training data a The classifier then picks the class with the highest ion probability according to this equat a Interestingly Naive Bayes performs well even in cases where the conditional independence assumption fai s Example 9 Our priors are n PPlayTenms yes 914 0 64 o PPlayTenms nu 514 0 36 a We can estimate a Pw2nd strunulPruyTennzs yes 39 0 33 a Pw2nd stranglPlagTennzs nu 35 0 6 a Phmmdzty nzuanruyTennzs yes 39 0 33 a Phmmdzty nzuanruyTennzs nu u Pmtlaak sunnylPlagTenms yes a Pmtlaak sunnylPlagTenms nu 35 o 6 u Ptemp mallPlayTennzs yes 39 o 33 c P temp mallPlagTenms nu 15 0 2 Example 39Uyes PyesPsunnyiyesPmaliyesPhzghiyes U 005 Pstrangiyes 0 n U72 Pn Psunnyin0Pmalin0Phzghin0Pstrmgin0 Q So we see that not playing tennis is the maximum likelihood hypothesis Further by normalizing we see that the classifier predicts a 0 80 probability of not playing tennis 0 Using Naive Bayes to classify spam One are where Naive Bayes has been very success Jl is in text classification a Despite the violation ofindependence assumptions Classifying spam is just a special case oftext classification P o iven some emails labled ham or spam determine the category ofnew and unseen documents a Ourfeatures will be the tokens that appearin a document Based on this we ll predict a category 1 u 9 Estimating Probabilities a As we can see from this example estimating probabilities through frequency is risky when our data et is small a We only have 5 negative examples so we may not have an accurate estimate a A better approach is to use the following formula called n estimate 1 7mm Mm a W re m is the number of individuals with the characteristic ofinterest say Wind strong n is the total num er ofpositivenegative examples p is our prior estimate and m is a constant called the equivalent sample size Classifying spam 0 Naive Bayes is only one possible way to classify spam 0 Rulebased systems SpamAssassIn a Examining headers broken From or ContentType lt1 BlacklIst a Challengeresponse Estimating Probabilities a m determines how heavily to weightp n p is assumed to be uniform a So in the Tennis example Pwmd strmgiplayTenms m7 353 0 Well determine an m based on sample size a lfm is zero we just use observed data a lfm gtgt n we use the pror 9 Otherwise m lets us weight these parameters relative influence Classifying spam with Naive Bayes a Naive Bayes has several properties that make it nice as a spam classifier a We don t need to encode specific rules a We can adapt as the types of spam chan e a Somewhat robust to spammers adding in extra text Arti cial Intelligence Programming Constraint Satisfaction Chris Brooks Department of Computer Science University of San Francisco Constraint Satisfaction Q So far we ve focused on using search to find the best solution Q In many cases you just need to find a solution that satisfies some criteria Q These criteria are called constraints Q A problem in which we want to find any solution that satisfies our constraints is a constraint satisfaction probem Q Constraints provide us with additional knowledge about the problem that we can exploit Q We can also consider optimizing constrained problems Department of Computer Science Universitv of San Francisco 011 Examples 9 Toy Problems 0 Map coloring Nqueens cryptarithmetic 0 Real life problems 0 Scheduling register allocation resource allocation Department of Computer Science Universitv of San Francisco 021 Formalizing a CSP Q We can model a CSP as a set of variables 1 2 xn Q Each variable as a domain of possible values D17D27 Q We also have a set of constraints 01 C2 Q Unary constraints a lt 10 ym0d2 0 etc Q Binary constraints a lt y a y lt 50 no two adjacent squares can be the same color etc Q Nary constraints x1 x2 azn 75 weight of chassis plus engine plus body lt 3000 lbs etc Q An assignment of values to variables that satisfies all constraints is called a consistent solution Q We might might also have an objective function y fa1 xn that lets us compare solutions Department of Computer Science Universitv of San Francisco 031 Approaches If the domain of all variables is continuous ie real numbers and constraints are all linear functions we can use linear programming to solve the problem 0 Express the problem as a system of equations In other cases we can use dynamic programming Dynamic programming is a form of search In the most general case we can express a CSP as a search problem Department of Computer Science Universitv of San Francisco 041 Solving CSPS With search 9 We ll begin with an initial state no values assigned to 51717327 0 An action assigns values to variables a A goal is an assignment to each variable such that all constraints are met 9 The successor function returns all possible single assignments such that constraints are still met a Notice that our solution for this sort of problem is the goal state as opposed to a path through the state space Department of Computer Science Universitv of San Francisco 051 Constraint graph 9 Often the most difficult part of solving a CSP is formulating the problem 9 It is often useful to visualize the CSP as a constraint graph 9 Nodes represent variables edges represent binary constraints a For nary constraints we must add nodes that represent combinations of values Department of Computer Science Universitv of San Francisco 061 Example Coloring a Bay Area Map 9 Can we color this map using only four colors RYBG so that no ad jacent counties have the same color Department of Computer Science Universitv of San Francisco 071 Example Coloring a Bay Area Map a cccc m 9 Can we color this map using only four colors RYBG so that no ad j39 jacent counties have the H same color Department of Computer Science Universitv of San Francisco 081 Example Coloring a Bay Area Map Q Initially pick a color for SF Red Q Our successor function will return all possible colorings that don t violate the constraint Q Marin Blue Solano Red Alameda Yellow Napa Yellow etc Q We can be more clever about how to approach this problem Q CSPs are commutative it doesn t matter which order we assign colors to counties Q Therefore we should consider one assignment at a time Q For the moment let s start by always assigning a color to the county with the smallest domain Department of Computer Science Universitv of San Francisco 091 Example Coloring a Bay Area Map 9 SF Red lt1 SF Red Marin Blue 0 SF Red Marin Blue Alameda Yellow 9 SF Red Marin Blue Alameda Yellow CC Green 0 etc 0 In this case we can find a consistent fourcoloring Q What about a 3coloring Department of Computer Science Universitv of San Francisco 0101 Example Australia 9 Threecoloring the map of Australia Red Green 33333 Blue 313333 Queemand 0 Let s draw the search Ai s a m New tree 53 a Initially we color QRed Q What are possible suc Tasmama J cesso rs Department of Computer Science Universitv of San Francisco 0111 Example Australia 9 Neighbors of Q have a domain of Green Blue smallest 0 Choose a neighbor and select a color that satisfies all constraints NSW Green Now SA has a domain of size 1 Blue Coloring SA then fixes the choices for V and NT l0 l0 P Department of Computer Science Universitv of San Francisco 0121 Heuristics Q How do we pick which country to color next Q How do we choose what color to give it Q Intuition always try to make decisions that leave as much flexibility as possible Q Most constrained variable pick the variable country that has the fewest possible choices Q Least constraining value pick the value color that has the least effect on possible values for other variables Department of Computer Science Universitv of San Francisco 0131 Backtracking Q In the previous example we were fortunate Q We never made a bad choice Q What if we had colored Q red NSW green V blue 9 Usually when solving a CSP there are times when you have to undo a bad choice 9 This is called backtracking Department of Computer Science Universitv of San Francisco 0141 Example 8queens 1 Problem place each queen on the board so that no queen is attacking another 9 We can reduce this to What row should each queen be in Chronological Backtracking Q The easiest approach is to use depthfirst search Q If we reach a point where we can t place a queen back up and undo the most recent placement Q Ifthat placement can t be changed undo its predecessor Q Always undo asignments in reverse order of when they were done Q This is called chronological backtracking Department of Computer Science Universitv of San Francisco 0161 al0 a20 a30 a4l Chronological Backtracking a Problem we make a bad decision early on that isn t noticed until later Q In this 4queens problem there is no solution that has the first queen in the top row 9 We ll spend a lot of time trying different combinations for queens 2 and 3 even though there is no possible solution that can be reached Q How can we better identify which decision is causing a constraint violation Department of Computer Science Universitv of San Francisco 0171 Backjumping Q What we d like to do is identify those variables that are causing a problem and change them directly Q To do this when we reach a variable for which we cannot find a consistent value we look for all variables that it shares a constraint with Q We call these variables the conflict set Q We then unroll our search and undo the most recentlyset variable in the conflict set Department of Computer Science Universitv of San Francisco 0181 Example Australia Q Let s say our search algorithm did the following coloring Q Q Red NSW Green V Blue T Red Q There is no consistent color for SA Q Backtracking will try all other colors for T which cannot possibly help Q The conlict set for SA is QNSWV Q We backjump and undo V Once V Red we can color SA Department of Computer Science Universitv of San Francisco 0191 CSP summary Q Many interesting realworld problems can be formulated as CSPs Q Scheduling resource allocation design etc Q CSPs can be solved using DFS Q Successor chooses an unassigned variable and gives it a value Q Problem structure can be exploited to guide search Q Heurustics intelligent backtracking Department of Computer Science Universitv of San Francisco 0201 BranchandBound Search Q We can also use heuristics to prune portions of a search tree 9 In many problems we need to search for an optimal solution in a depthfirst manner a Laying out components on a chip finding the fastest schedule TSP Q How can we identify regions of the search space that can t possibly lead to an optimal solution Department of Computer Science Universitv of San Francisco 0211 Applying heuristics Q If we have an admissible heuristic we can identify branches that cannot lead to a solution Q Search down the leftmost branch of the search tree and find the first candidate solution Q As we search down the remaining branches we apply our heuristic at each step Q If we reach a point where the f cost of an incomplete solution is higher than the previous best solution we can discard the branch assuming we want a minimal solution Q Since h underestimates the true solution cost will be at least as high as f Q lff is already worse than what we ve seen there s no point in expanding this branch Department of Computer Science Universitv of San Fra ncisco 0221 BranchandBound Search This approach is used extensively in gameplaying search to eliminate branches that can t lead to a winning strategy Challenge Branchandbound is most effective when very good solutions are found early in the search 0 Why is this This leads to the development of heuristics for selecting the order in which nodes are placed on the stack in DFS Most promising node is at the top Department of Computer Science Universitv of San Francisco 0231 Arti cial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco Learning and Classi cation 9 An important sort of learning problem is the classification problem 9 This involves placing examples into one of two or more classes a Shouldshouldn t get credit 91 Categories of documents a Golfplayingnongolfplaying days 0 This requires access to a set of labeled training examples which allow us to induce a hypothesis that describes how to decide what class an example should be in Department of Computer Science Universitv of San Francisco 011 Bayes Theorem Q Recall the definition of Bayes Theorem Q Pba Plt L gtPltbgt a Q Let s rewrite this a bit Q Let D be the data we ve seen so far Q Let h be a possible hypothesis Q PhD Ph Department of Computer Science Universitv of San Francisco 021 MAP Hypothesis 9 Usually we won t be so interested in the particular probabilities for each hypothesis 0 Instead we want to know Which hypothesis is most likely given the data 0 Which classification is the most probable 0 IS PlayTenm s or PlayTenm39s more likely Q We call this the maximum a posteriori hypothesis MAP hypothesis 9 In this case we can ignore the denominator in Bayes Theorem since it will be the same for all h lt1 hMAP argmathHPDlhPh Department of Computer Science Universitv of San Francisco 031 MAP Hypothesis 9 Advantages o Simpler calculation 0 No need to have a prior for PD Department of Computer Science Universitv of San Francisco 041 ML Hypothesis Q In some cases we can simplify things even further Q What are the priors Ph for each hypothesis Q Without any other information we ll often assume that they re equally possible Q Each has probability Q In this case we can just consider the conditional probability PDh Q We call the hypothesis that maximizes this conditional probability the maximum likelihood hypothesis Q hML argmathHPDh Department of Computer Science Universitv of San Francisco 051 Example 9 Imagine that we have a large bag of candy We want to know the ratio of cherry to lime in the bag a We start with 5 hypotheses 1 m 100 cherry 2 h2 75 cherry 25 lime 3 by 50 cherry 50 lime 4 M 25 cherry 75 lime 5 h5 100 lime 9 Our agent repeatedly draws pieces of candy 9 We want it to correctly pick the type of the next piece of candy Department of Computer Science Universitv of San Francisco 061 Example 9 Let s assume our priors for the different hypotheses are a 0102040201 9 Also we assume that the observations are iid o This means that each Choice is independent of the others order doesn t matter a In that case we can multiply probabilities Q PDhz HjPdjh7L 9 Suppose we draw 10 limes in a row PDh3 is all since the probability ofdrawing a lime under rig is Department of Computer Science University of San Francisco 071 Example Q How do the hypotheses change as data is observed Q Initially we start with the priors 01 02 04 02 01 Q Then we draw a lime Q Ph1lime 04Plimeh1Ph1 0 Q Ph2lime 04Plimeh2Ph2 04 gtilt 02 04005 9 Ph3lime 04Plimeh3Ph3 04 gtilt 04 0402 Q Ph4lime 04Plimeh4Ph4 04 02 04015 Q Ph5lime 04Plimeh5Ph5 041 gtilt 01 0401 042 P Department of Computer Science Universitv of San Francisco 081 Example 9 Then we draw a second lime O Ph1lime7 lime ozPlime7 limeh1Ph1 O Q Ph2lime7 lime ozPlime7 limeh2Ph2 a 02 a00125 o Ph3lime7 lime ozPlime7 limeh3Ph3 04 04 0401 9 Ph4lime7 lime ozPlime7 limeh4Ph4 egg 02 0401125 9 Ph5lime aPlimeh5Ph5 041 01 0401 0 Oz 307 o Strictly speaking we don t really care what a is Q We can just select the MAP hypothesis since we just want to know the most likely hypothesis Department of Computer Science University of San Francisco 091 Bayesian Learning 9 Eventually the true hypothesis will dominate all others 0 Caveat assuming the data is noisefree or noise is uniformly distributed 0 Notice that we can use Bayesian learning in this case either as a batch algorithm or as an incremental algorithm 0 We can always easily update our hypotheses to incorporate new evidence at This depends on the assumption that our observations are independent Department of Computer Science Universitv of San Francisco 0101 Learning bias Q What sort of bias does Bayesian Learning use Q Typically simpler hypotheses will have larger priors Q More complex hypotheses will fit data more exactly but there s many more of them Q Under these assumptions hMAp will be the simplest hypothesis that fits the data Q This is Occam s razor again Q Think about the deterministic case where PhD is either 1 or 0 Department of Computer Science University of San Franc isco 0111 Bayesian Concept Learning Q Bayesian Learning involves estimating the likelihood of each hypothesis Q In a more complex world where observations are not independent this could be difficult Q Our first cut at doing this might be a brute force approach 1 For each h in H calculate PhD w 2 From this output the hypothesis hMAp with the highest posterior probability Q This is what we did in the example Q Challenge Bayes Theorem can be computationally expensive to use when observations are not iid Q Ph0102 Department of Computer Science Universitv of San Francisco 0121 Bayesian Optimal Classi ers Q There s one other problem with the formulation as we haveit Q Usually we re not so interested in the hypothesis that fits the data Q Instead we want to classify some unseen data given the data we ve seen so far Q One approach would be to just return the MAP hypothesis Q We can do better though Department of Computer Science Universitv of San Francisco 0131 Bayesian Optimal Classi ers Q Suppose we have three hypotheses and posteriors hl 04 02 03 hg 03 Q We get a new piece of data hl says it s positive h2 and by negative Q hl is the MAP hypothesis yet there s a 06 chance that the data is negative Q By combining weighted hypotheses we improve our performance Department of Computer Science Universitv of San Francisco 0141 Bayesian Optimal Classi ers 9 By combining the predictions of each hypothesis we get a Bayesian optimal classifier o More formally let s say our unseen data belongs to one of 2 classes 9 The probability PojD that our new instance belongs to class oj is a 2mm Pvjlh7PhqlD a Intuitively each hypothesis gives its prediction weighted by the likelihood that that hypothesis is the correct one O This classification method is provably optimal on average no other algorithm can perform better Department of Computer Science Universitv of San Francisco 0151 Problems With the Bayes Optimal clas Q However the Bayes optimal classifier is mostly interesting as a theoretical benchmark Q In practice computing the posterior probabilities is exponentially hard Q This problem arises when instances or data are conditionally dependent upon each other Q Can we get around this Department of Computer Science Universitv of San Francisco 0161 Naive Bayes classi er Q The Naive Bayes classifer makes a strong assumption that makes the algorithm practical 0 Each attribute of an example is independent of the others a Pa b PaPb for all a and b o This makes it straightfonNard to compute posteriors Department of Computer Science Universitv of San Francisco 0171 Q Q The Bayesian Learning Problem Given a set of labeled multivalued examples Find a function Fa that correctly classifies an unseen example with attributes a1 a2 an a Call the most probable category vmap 9 vmap argmaazmeva al a2 an We rewrite this with Bayes Theorem as vmap argmaazmeval a2 anv Pv7 Estimating PM is straightforward with a large training set count the fraction of the set that are of class 227 However estimating Pa1 a2 anm is difficult unless our training set is very large We need to see every possible attribute combination many times Department of Computer Science Universitv of San Francisco 0181 Naive Bayes assumption Q Naive Bayes assumes that all attributes are conditionally independent of each other Q In this case Pa1 a2 anm H Pa7v7 Q This can be estimated from the training data Q The classifier then picks the class with the highest probability according to this equation Q Interestingly Naive Bayes performs well even in cases where the conditional independence assumption fails Department of Computer Science Universitv of San Francisco 0191 Example 9 Recall your tennisplaying problem from the decision tree homework 9 We want to use the training data and a Naive Bayes classifier to classify the following instance 9 Outlook Sunny Temperature Cool Humidity high Wind Strong Department of Computer Science Universitv of San Francisco 0201 Example Day Outlook Temperature Humidity V nd PIayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mid Normal Strong Yes D12 Overcast Mid High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mid High Strong No Department of Computer Science Universitv of San Francisco 0211 Example 9 Our priors are Q PPlayTenm39s yes 914 064 o PPlayTenm39s n0 514 036 9 We can estimate 0 Pwind strongPlayTennis yes 39 033 o Pwind strongPlayTennis n0 35 06 Phumidity highPlayTennis yes 39 033 Phumidity highPlayTennis n0 45 08 P0utl00k sunnyPlayTennis yes 29 022 P0utl00k sunnyPlayTennis n0 35 06 Ptemp eoolPlayTenms yes 39 033 P PPPPPP temp eoolPlayTenms n0 15 02 Department of Computer Science Universitv of San Francisco 0221 Example yes PyesPsunnyyesPooolyesPhighyesPstr0ngyes 0005 no PnoPsunnynoPooolnoPhighnoPstrongno 00206 80 we see that not playing tennis is the maximum likelihood hypothesis Further by normalizing we see that the classifier predicts a 080 probability of not playing tennis Department of Computer Science University of San Francisco 0231 Estimating Probabilities Q As we can see from this example estimating probabilities through frequency is risky when our data set is small Q We only have 5 negative examples so we may not have an accurate estimate Q A better approach is to use the following formula called an mestimate ncmp nm Q Where nC is the number of individuals with the characteristic of interest say Wind strong n is the total number of positivenegative examples p is our prior estimate and m is a constant called the equivalent sample size 9 Department of Computer Science Universitv of San Francisco 0241 Estimating Probabilities Q m determines how heavily to weight p Q p is assumed to be uniform Q So in the Tennis example Pwind strongplayTennis n0 339 Q We ll determine an m based on sample size Q If m is zero we just use observed data Q If m gtgt n we use the prior Q Otherwise m lets us weight these parameters relative influence Department of Computer Science Universitv of San Francisco 0251 Using Naive Bayes to classify spam 9 One are where Naive Bayes has been very successful is in text classification 9 Despite the violation of independence assumptions 9 Classifying spam is just a special case of text classification 0 Problem given some emails Iabled ham or spam determine the category of new and unseen documents a Our features will be the tokens that appear in a document 0 Based on this we ll predict a category Department of Computer Science Universitv of San Francisco 0261 Using Naive Bayes to classify spam 9 For a given email we ll want to compute the MAP hypothesis that is is Q Pspamt1t2 tn greater than Q Phamt1t2tn Q We can use Bayes rule to rewrite these as o aPt1t2 tnspamPspam o aPt1t2 tnhamPham Department of Computer Science Universitv of San Francisco 027 Using Naive Bayes to classify spam 9 We can then use the Naive Bayes assumption to rewrite these as 0 aPt1spamPt2spamPtnspamPspam Q aPt1hamPt2hamPtnhamPham Q And this we know how to compute Department of Computer Science Universitv of San Francisco 0281 Using Naive Bayes to classify spam 9 We can get the conditional probabilities by counting tokens in the training set 0 We can get the priors from the training set or through estimation Department of Computer Science Universitv of San Francisco 0291 Arti cial Intelligence Programming Information Retrieval Chris Brooks Department of Computer Science University of San Francisco Processing text 9 Now that we know a little bit about how to consider and compare different states we ll think about a problem with a harder representation 0 Englishlanguage text 0 Specifically webpages 9 We ll look at several different approaches to automatically summarizing and understanding documents Department of Computer Science Universitv of San Francisco 011 Dif culties Q What makes working with natural language hard Department of Computer Science Universitv of San Francisco 021 Dif culties Q What makes working with natural language hard Q Nonunique parses Q Synonyms and multiple meanings Q Anaphora Q Slang and technical terms Q Analogy and metaphor Q Misspelling and incorrect grammar Department of Computer Science Universitv of San Francisco 031

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.