Class Note for CMPSCI 683 at UMass(14)
Class Note for CMPSCI 683 at UMass(14)
Popular in Course
Popular in Department
This 15 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 22 views.
Reviews for Class Note for CMPSCI 683 at UMass(14)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Continuation of Reinforcement learning Qeaming Instance Based Learning Victor Lesser Case based learning CMPSCI 683 Fall 2004 vmxua lm Z l lllllilllrl C In FlL the environment is usually modeled as an MDPY de nede Flndapollcyll res malls couldbestochastlc that maximizes the valueutllltyieimected lulure reward 01 each s 5 set of states of the environment seoacronspossi eins e 6 As tit hl tatSS Pss a expec rewar on ransiion 0 given Rss a ted dt rsrs a y discount rate for delayed reward n M El1Jr will l 7 l39qu probabilityoltransitionlrorn S to S givena and each nu pair Mlquot 5 dunk ldl lyrilllzl l 39 l lr5Y 1 5 ll discretetirneil0 l 2 l l lquot39 o Hl Hz 7H3 i 41 42 3 vinesmlm K vmxua lm A These are called value lunelions cl evaluation lunetions in Al There exist optimal iialuelunctions v39n m xwm Q39tm m xi239xta And corresponding optimal policies Jr ti argmlaxQ tita n isthe greedy policy with respect to Q Experience I Build Predictions Value Function Val1 Q Q Select Actions Continual onllne Policy Simultaneous acting and learning nw WWW i policy evaluainii wine lamina Value P0 0 Fund ion H May Vi Q imwauaii maedmannn mi 2 Q Given A Markov model ofthe environment Rss39a pmizaiiiiiya1trmsitionimm s to s39 givui a Rss39a 4me iewad ontrail39tion s to 539 given a States with probabilistic actions Terminal states have rewardsutilities Problem Learn expected utility of each state mnmsixmim x I H Si j 1 H i M V Cb m s Percepts tell you State Reward Terminal Developed in the late 195039s in the area of adaptive control theory Just keep a running average of rewards for each state For each training sequence compute the reward to go for each state in the sequence and update the utilities Accumulate reward as you go back Generates utility estimates that minimize the mean square error LMSupdate mu cssm mm M A training sequence is an instance of world transitions from an initial state to a terminal state The additive utility assumption utility of a sequence is the sum of the rewards over the states of the sequence Under this assumption the utility of a state is the expected rewardtogo of that state v uxxucssxurmm m 0 m1 m2 m3 Hi 11 12 13 39 i Si 8 SM 39 Uini1Ri U0quot n U0njni1 ni ni1 v laxxucssxurmm 12 Converges very slowly because it ignores the relationship between neighboring states Utilities ofneighboring states are mutually constrained Ui Ri 2 PJ U0 Can apply dynamic programming to solve the system of equations one eq per state Can use value iteration initialize utilities based on the rewards and update all values based on the above equation mnmsixmlm u Approximate the constraint equations without solving them for all states Modify Ui wheneverwe see a transition from ito j using the following rule 7 Ui Ui a mi Um Ui Newrewzmtnun The modi cation moves Ui closerto satisfying the original equation Flewritethis V5 ltV5 ar y VS39 V5 TD error to get V5 e1 aV5 otyyV539 mnmsixmlm m vse1 aVS ar yVS39 Sutton 1988 vse1 aVS aREWARDpmh A er each 5 ammn update the State s Aisumeammp mnppnnem r hdyhemmeumes malls Immkgs vumcsiwzm w mnmsixmlm 2n 1 Make a table With one eme per WW 4 5 7 if 5 7 2 Now play lots of games a W To ple our moves 5 look ahead one step a n 7 mm 5 a m WWW 3 n draw 4 mama me mm mm 7 mm 3mg mm EM me am nmepmk ammeatm nm m uplamlm m Take mmquot 0 349 Nunae evm msm nutmme man an Hg nppnnem smwa lt7 mm A We mth m mm m 7 mm m em mm 7 m s 7 m m hefule uurgzedy mnve s 7 mg ak mxmgmaymme axm puxlhve mm B a 71 the my w plumb 12 mxnue nl mesa backups s sms ae usmu 10 a smplem 11 SlmpleMane Cal B endmg me backups 7 gumumuhackugs 99 TD 1 Inclemen ly mmpms a mum a m 4 12 1 JJ 1 1 1 3 Dynsmu Pmarammma Backup mm a mmma stems my Munucanu when eanmg snaucw new 3 maps naclups I a Choose best action from any state s using learned Vquot nsargamax rv a vV5v a A problem 0 This works well if agent knows 5 SanS and r Sx Aa t 0 But when it doesn t it can t choose actions this way How Much To do we Need to Know To Learn v Wis m 25 mi Iv ll A a in u imamdjm low11d values l v l if El 391 ohm minus my Mum s 70 A l One optimal policy Define new function very similar to Vquot Q Swami WVWSJI lf agent learns Q it can choose optimal action even without knowing r or 6 71s argamaxrsa yV6sa 71s argamch s a Q is the evaluation function the agent will learn v Wis m za Note Q and V closely related VYJ max Q6419 Which allows us to write Q recursively as QSpar 6902 ill60302 7 Jya 7mg Q Sena Let Q denote learner s current approximation to Q Consider training rule Qsa r y maXQQLa39 a Where v is the state resulting from applying action a in state anda is the set of actions from v v Wis m 22 Foreach 512 initialize table entry Qae 0 Observe current state 5 Do forever Select an action a and execute it Receive immediate reward r Observe the new state Update tnetable entry for Qa asiollows Qa rymaxQ39a39 a 759539 m mm mm c qt unit lt r l m fllHlRLmuUi Lillll fill unnm ttwmul li ur wvmllt39 llirn ivquot t in or ll it at ui m gt mm v lt other lt le mnmsixmlm an What if reward and next state are non deterministic We redefine VQ by taking expected values V SEEnmmvmz J m QWEEM a WW 50 3t Where on Q learning generalizes to nondeterministic worlds Alter training rule to grim laileitaairrn2mnoxav1 1 mztm 511 Can still prove convergence of to Q Watkins and Dayan 1992 woman 2 Q learning reduce discrepancy between successive Q estimates One step time dirrerence Qlrar I rry rrgaxQsr1 a Tvrm step time dirrerence Q2stat a y rt1 v2m3XQst2a N step time difference Qquotetat n y n1l quot1gtrtn 17quotm XQxtna Blend alluf these QA511111 AQ1StatA92statA2Q3sta1 The clnser lamda is tn the 11118 mare important later differences as Q star 1 AQletattQZetatAZQ3stat Equivalent expression QA st t a y1 AmaxQst at 1 QA st1at1 TD A algorithm uses above training rule Sometimes converges faster than Q learning converges for learning Vquot for any 0 S A 1 Dayan 1992 Tesauro s TDGammon uses this algorithm v damn amt 4 Is it betterto learn a model and a utility function orto learn an actionvalue function with no model This is a fundamental question in Al where much of the research is based on a knowledgebased approach Some researchers claim that the availability of model free methods such as Qlearning means that the KB approach is unnecessary or too complex 35 Problem choosing actions with the highest expected utility ignores their contribution to learning Tradeo 39between immediate good and long lerm good exploration vs exploitation A randomwalk agent learns faster but never uses that knowledge A greedy agent learns very slowly and acts based on current inaccurate knowledge v damn amt se Give some weight to actions that were not tried very often in a given state but counter that by knowledge that utility may be low Key idea is that in early stages of learning estimations can be unrealistic low Similar to simulated annealing in that in the early phase of search more willing to explore 37 Too many states Can define Q as a weighted sum of state features or a neural net Adjust the previous equations to update weights rather than updating Q Can have different neural networks for each action This approach used very successfully in TD Gammon neural network Continuous statespace Can discretize it Polebalancing example 1968 v mm m 38 Time steps need not refer to fixed intervals of real time Actions can be low level eg voltages to motors or high level eg accept a job offer mental eg shift in focus of attention etc States can be lowlevel sensations or they can be abstract symbolic based on memory or subjective eg the state of being surprised or lost 39 Encode specific experiences in memory rather than abstractions Carry out generalizations at the retrieval time rather than the storage time lazy learning In their most general form Based on partial match on a similarity metric retrieve a set of casesinstances most relevant to the present context Adapt the retrieved experiences to new situations This could be based on algorithms ranging from a simple k nearest neighbor classi cation to chains of reasoning v mm m 40 NEE Key idea juststore all training examplnxftxt Neamt neighbor 7 Gwen ouery insannewrst loLatenearesttrainino examplex their estimate f r fquot K Neamt neighbor 7 Gwen 4 take vote among its kneaiest neighbors Miscrete valueotaioetlunrtion 7 Take mean ol values olkneaiest neighbors Niealyalueo On lell Positive and Negative Training Examples 39Nearest neighboryes rurxqsk classilies it as neoatnie 1 On Hymns decision surlacelornearestneionboroueryin k region Wiii nave same value vtnucsixirzm i mttmsixmmt i2 Imagine instances described by 20 attributes Migm wantweight nearer neighbors more but only 2 are relevant to target function heavily A Curse o dimensionality nearest neighbor is 171 easily mislead when highdimensional X Where 2m Similarto cnerfitting u x One approach do Stretcn ytn axis by Weignt z wnere 21 2 cnosen to minimize prediction error 39 And d Jquot 395 d39stance between X r Jquot Lenotn tne axestnat correspond to the more relevant Note now it makes sense to use all training attributes examples instead ofjust k Use crossvalidation to automatically cnoose Weignts 7 Classi cation rrrucn slower 2i In 39 Minimize error in classi cation 39 Setting 2 to Zero eliminatestnis dimension all together vtnucsixirzm u mttmsixmmt u Instances map to points in W Continuous real values Less than 20 attributes per instance Lots of training data Advantages Training is very fast Robust to noise training data Learn complex target functions Don t lose information Disadvantages Slow at query time Easilv fooled by irrelevant attributes vumcsiximb is Note Illiiiormsloml aipmximaionto hr each gnerypoint x Why not him an explicn approxinBtion fx ior legion snnonndingxa 7 Fri linmi Minimum knearea neignbas mwwaw viaquot 7 Fit quadratic 7 Pmdums pimmseappmumaim inr Sweial choices oierorto minimixe r Smaim eiminvei knmie tn neignbnrs Mai fxf x 7 Disamewegned squared em nvei all neignbnis 51W Lfxfx KdwI uumicsixmim m Can apply instancebased learning even whenX 9f Need different distance metric CaseBased Reasoning is instancebased learning applied to instances with symbolic logic descriptions usercomplamt errorsaonshutdown cpumodel PowerPC operatingsystem windows networkconnectmn 7cm memory iameg Installedappllcatmns Excel Netscape Vlrusscan dlsk lglg likelycause 777 vumcsiximb n Key elements of problem solving CBR are Cases represented as solved problems index cases under goals satisfied and planning problerns avoided Retrieve prior case sharing the most goals amp aoiding the most problerns Adapt solution of prior case to solve a new case Mav require resolving problems andor repairing solutions index new case and solution under goals satisfied and planning problems avoided uumicsixmim a CADET 75 stored examples of mechanical devices Each training example qualitative function mechanical structure New query desired function Target value mechanical structure for this function Distance metric match qualitative function descriptions Size of largest subgraph between two function graphs y dame nmt 4s CHEF consists of six processes e Problem anticipation tne planner anticipates planning problems by noticing features in the current input that have previously participated in past planning problems 7 Plan retrieval The planner searcnes for a plan that satisfies as many of its current goals as possible Whlle ayoiding tne problems tnat it nas predicted 7 Plan modification The planner alerts the plans lt nas found to sausfy any goals from tne input that are not already achieved 7 Plan repair When a plan fails tne planner fixes the faculty plan by building up a casual explanation ofwhy tne failure nas occurred and using itto rind tne different strategies for repairing it 7 Credit assignment Along Wlth repairing a failed plan tne planner Wants to repalrthe Chafacteflzatloh of the World that allowed lt to create the failed plan in tne first place it does tnis by using tne casual explanation ofwny the failure occurred to identify tne features in tne inputtnat led to the problem and then markthem as predictive of lt 7 Plan storage The planner places successful plans in memory indexed by tne goals tnat tney satisfy and line problems tnat tney avold y dame nmt si e aw Su ucm39c rsquot m i quotyam v Li emitlid amsirmpemm min lmnl tint L n mm y uxxl m nmt A riaf pound of beef Twotablespoons of soy sauce One teaspoon of rice Wll le A riaf tablespoon of corn starcri One teaspoon of sugar A riaf pound of green bean One teaspoon of salt One criunk of garlic cnop tne garlic into pieces the size of matcnneads shred tne beef Mannatetne beef in the garlic sugar com starch ncewme and soy sauce surfry tne spices rice wine and beeffor one minute Add the green bean to tne spices nce Wine and beef surfry tne spices rice wine green bean and beef forthree minutes Add the salt to the spices ricewme green bean and beef y uxxl m nmt 5n 52 Rectpefor BEEFVANDVBROCOU Found nearest rectpe ts BEEFVWTTHVGREENVBEANS Modtfhhg rectpe BEEFVWtTt trGREENrBEANS To sattsfy thctude thCCOtt th the dtsh Ptacthg some thOCOtt th rectpe BEEFrWtTHrGREENrBEANS rCmstderthg thgrec ehtrcrtttc Before dothg step SttrhytherVartabte do Chop the thCCOtt thto pteces the stze of chunks rthgredteht crtttc apptted Chet alters old planstosatisty new goals using a set 11 WNW modi cation rules and a sa ct neN objects mnpmmn ntheet Ohe easpnnh ntsugal Twn autumn matmm A nan pnuh nt tmcmtt One a pmtt ntttce Wthe one smart man Ahatte espnnhntmth starch Ohe muhk mam cumpo 91quotquot m pacest sue mmms Shredthe be Immune be u megaan sugar mm such use wnean 50 Sue cumpo mmquot mm pacst sue unhm sumyme spies m Mme am mam memm Am mammal tome ID m Mme am he sumyme spies m w mmquot am hemorqu minus Am mesmm spies m w mmquot am he The tEettShEIW tends The tsh hEIW tastes satty The tsh hEIW tastes savuy The tsh hEIW tastessweet Thehlnccntt Iinnlulq The tsh hEIW tastesttkegatttt mnmsixmlm 5b 39 ALTERVPLAN SDEEFFECT Reptaee the step that Causesthe vtntatthg Dnhdtttnh thh nhe that tines hnt have the same sttter etTeDt but aDhtevesthe same gnat 39 SLPtTrANDrREFORht Sptttthe step thtn th separate steps and run them thdepehdehtty 39 ADJUNTVPLAN REMOVE Add a new step tn be run atnhg thh a step that Causes a sttteretTeUt that remnvesthe sttteretTeDt as tt ts Eteatett thdexthg BEEFVANDVBROCCOU under goats and probtems tf thts ptah ts suocessfut thefdtowthg shoutd betrue The beef ts how tender The thOCOtt ts how crtsp thctude beef th the dtsh thctude thCCOtt th the dtsh M ake a sttrrhy dtsw The ptah avotds fattute exempttfted bythe state The thOCOtt ts how soggy th rectpe BEEFVANDVBROCCOU mnmsixmlm an Im39aigrm Searching lor plan that satisties input goals Driving down on Make a stirin dish lndude dtidlten in the dish Succeeded lndude snow pea in the dish Driving down on Make a stirtry dish Avoid lailure exemplilied bythe state The broccoli is now soggyquot in rea39pe BEEFANDBROCCOLI Collecting and activating tests Succeeded Is the dish STYLESTlRFRY Driving down on lndude dticken in the dish Is the item a MEAT Failed Trying more general goal Is the item aVEGElABLE Driving down on Include meal in the dish IS the TEXTURE 01 item CRISP Succeeded Driving down on lndude snow pea in the dish ChidltenSnow peaStir Frying Failure FailedTwing more general goal Meat sweats when it is stirtriedquot Driving down on Include vegetable in the dish Stirlrying in too mud1 liquid makes aisp vegetables soggyquot Succeeded Reminded ol alailure in the BEEFANDBROCCOLI plan Failure The vegeabe is now 5099quot Found redpegt RECQ BEEFANDBROCCOLI v we m 57 v mm m 52 Analytical Learning ExplanationBased Learning First work done at Umass on learning rules of Baseball A Quick Overview of Planning v mm m 59
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'