Class Note for CMPSCI 683 at UMass(11)
Class Note for CMPSCI 683 at UMass(11)
Popular in Course
Popular in Department
This 14 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 10 views.
Reviews for Class Note for CMPSCI 683 at UMass(11)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
Victor Lesser CMPSCI 683 Fall 2004 Review of Neural Networks MarkovDecision Processes Reinforcement learning vlmuamim 39 Gradient descent over network weight vector 39 Easily generalizes to any directed graph 39 Will nd a local not necessarily global error minimum 39 Minimizes error over training examples will it generalize well to subsequent examples 39 Training is slow can take thousands of iterations 39 Using nehivork allerlraining is very fast y unimlm K 39 instances are represented by many attribute value pairs 39 The target function output may be discrete valued realvalued or a vector of several real or discretevalued attributes 39 The training examples may contain errors 39 Long training times are acceptable 39 Fast evaluation of the learned target function may be required 39 The ability of humans to understand the learned target function is not important vlmuamim Supervised learning is sometimes unrealistic 39 Using feedbackrewards to learn a where will correct answers come from successful agent function Rewards may be provided following each action or only when the agent In many cases the agent will only receive a single reward after a long sequence of actions reaches a terminal state Rewards can be components of the Envtronments change and so the agent must actual utility function or they can be adjust its action choices OHM We hints nice move bad dog etc vumrcmfmy 5 vunmmrmi a Tmmnglnfu desired target mtch Tmmnglnfu evalmtmrs 3de pem1ttesquot Input Output Input Output 3qu 39 Objective Minimize Emrr larger autpul actual output Dhjectiw get as muck reward as possible vumrcmfmy 7 vunmmrmi x Percepu onrewar lztn Wm arm f 4 r 1 if Environment Agent 4 Agmt and mvlmnmmt mtEraztatdumEtE iimexiepx I i L 2 K Agentuhmvaxtatzeatxl pt 565 0 O 39 pmduca ammuai deg K i an am on gem mulling maid r em Utilityreward depends on a sequence of decisrons WE M ME WEE Sm Howtolearnbestactionmawmizeewpected o m M m R 4a 42 4 reward totake at each state of Agent vmmmm v vmmmm m s finite set of domain states Augmenting the completely observable MDP with the following elements 0 A finite set of actions 0 Ps lsa state transition function 39 039 set OfObseNatlons I Pol a observation function ma 39 reward fundwn Discrete probability distribution over s0 initia state starting states The Markov assumption Can be mapped into MDP PW Swisrrzr39quot r311 Pmr Stir Explades state space vmmmm u vmmmm u Specify how to combine rewards over multiple time steps Finitehorizon and infinitehorizon problems Additive utility sum of rewards Using a discount factor Utility as averagereward per time step is a scalar reward signal an adequate notion of a goal maybe not but it is surprisingly flexible A goal should specify what we want to achieve not how we want to achieve it A goal must be outside the agent s direct control thus outside the agent The agent must be able to measure success explicitly frequently during its lifespan v misses em 14 Suppose the sequence of rewards after step ris What do we wantto maximize we want maximize the expected return ER for each step t Episodic tasks iiitemetioi breaks naturally into episodes e g plays ofa game trips through a maze rH2 L rT where Tis a final time step at which a terminal state is reached ending an episode Continuing tasks interaction does not have natural episodes Expected Return becomes infinite Discounted return 2 i RH mz7 ceL 27 new i where y 0 s y s 1 is the discount rate shortsighted 0 lty gt1 farsighted v misses em a Avoid failure the pole falling beyond a critical angle orthe cart hiding end of i lJi track As an episodic task where episode ends upon failure reward 1 for each step before failure retum number of steps before failure As a continuing task with discounted return reward e1 upon failure 0 otherwise return is related to e 7 fork steps before failure In either case return is maximized by avoiding failure for as long as possible v urniisssrmu i7 A policy is a choice ofwhat actioh to choose at each state Ah Optimal Policy is a policy Where you are always choosing the action that maximizes the returh utiiity ofthe current state gt gt gt 812868912 1 1 T 1 I l 762 1 a f 4 1511388 Actions succeed with probability 08 and move at right angles wIYh probabilitya 1 remain in the same position when there is a wall Actions incur a small cost 004 What happens when cost increases Why move to 655 instead of 611 v urniisssrmu in Getto the top ofthe bill as quickly as possible reward e 71 for each step where notat top of hill retum e enumberofsteps before reaching top ofhill Return is maximized by minimizing number ofsteps reach the top ofthe hill v iridium ram is Optimal policy defined by policyquot i argng P laUj W R0 Inng my is aUj Can be solved using dynamic programming Bellman 1957 How to compute Ujwhen it s def recursive v be icsm ram 2o repeat U 6 U39 for each state i do U39i e Ri m ale l lx aUj end until Cloernoughw U39 W car think a Vl as maximrzing ourulrity over some Wed horizon in We will calculate Until the maximum value attainable in in steps staring in slaze llor all states l We pmcesd tamwarcs t vnm o39nu ulillryior taking nostepslarilnalvat v vlm max Irina Min Wynne step simply mooseme action vim lae nigh vM rri mgx IiIMiZld39ZPl39 r t 1 me mm win the tignesr ut ily iroludrg ira vtiiry oi the result wg stale Wmhmle mum Emww m it vunmmrml 7E um m a damn a Em Final Verslnn Slow to converge Convergence occurs out from goal Information about shortcuts propagates out from goal Greedy policy is optimal before values completely settle Optimal value function is a xed point of VI 2 vunmmrml 2t Statevalue updatzs can ha pe ormed in any order in vain naration This sugg ls tryingto decidew atstates to update to maximin convergence speed Prioritixedsweeping is avariation olvalne Relation more computationally ef cient llacusedl Put aiiytatn in a priority queue in nrdlr oinow nnrcir w tlilnkthsirvaluu might otranga giuln aatan otvaina narration Very ef cient in practice Moore a Atkeson1993 Solve in nitehorizon discounted MDPs in nite time Start With value function VD Letin be greedy policy for Vu Evaluate in and lelV1 be the resulting value runclibn Let 77 be greedy policy for V Let lm be yalue cl i7 Each policy is an improvement until optimal policy is reached another xed point Since nite set of policies convergence in nite time i 1 Z improvement Policy Greeryricatmi is monotonic Evaluation step step Generalized Policy Iteration lntermixthetwo steps at a liner scale state by state action by adion etc repeat H 61139 U eValneDelermiruzliorKH lor each statei do H39i argmaxEPnlis nyj a I end until H 1139 Can be implemented using Value Iteration Um ROME P03 lXHHiU39 or Dynamic Programming Ui Ri 2 P03 l 3AHiU j 0 Learning Model of Markov Decision Process Learning Optimal Policy l Fewer iterations than V but each iteration more expensive Source of disagreement among practloners Pl vs Vl Learner is not told which actions to take TrialandError search Possibility of delayed reward Sacrifice shortterm gains for greater longterm The need to explore and exploit Considers the whole problem of a goal directed agent interacting with an uncertain environment Learning from interaction Learning about from and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal A collection of methods for approximating the solutions of stochastic optimal control problems d at annionmart Policy what to do Reward what is good Value what is good because it predicts reward Model what follows what Whimsile at The agent may learn Utility function on states or histories which can be used in order to select actions Requires a model of the environment Actionvalue function for each state also called Qlearning Does not require a model ofthe environment A passive learner simply watches the world going by and tries to learn the utility of being in various states An active learner must also act using the learned information and can use its problem generator to suggest explorations of unknown portions of the environment Whimsile I Given A Markov model of the environment States with probabilistic actions Terminal states have rewardsutilities Problem Learn expected utility of each state vmmcmmn 1 In RL the environment is usually modeled as an MDP delined by S 7 set elstates urine envimnment AG set niacimns pessmie in state 5 ES Pss39ae pmtzatzility nitransitmn lmm s m 539 given a Rss ae expected reward en transitinn s m 539 given a y r discmmt rate lei delayed reward discretetime l0 1 2 m 14 7m 4n hot 4 mmwm as Flnda policyn sES9 11643 could he stochastic matmaxlmlzes the valuemllllty expected lumretewatd oleach S VlrErMvrM fra lifm andeadr 541 pal lEWaids Q m E am my v 9m J RFD Jr These are called valueiundlons d evaualon iundlons In AI vmmcmmn 11 There exist optimal valuelundions v39n m xV J QM m xQ39m And corresponding optimal polldes It d argrrl xQ W n is the greedy policy with respect to Q mmwm w policy OOO O evaluation I Experience Build 25 Value Predidions POIIC Funaion inlay Vi Q Value mulovauait Funaion yaw nreedmmnon Q w Select Actions Continual oniine policy Simultaneous acting and learning nw 3 lt V Q mummy i mum 1 Given A Markov model ofthe environment States with probabilistic actions Terminal states have rewardsutilities I Problem Learn expected utility of each state Percepts tell you State Reward Termina vumum u vunmmf m u A training sequence is an instance of world transitions from an initial state to a terminal state The additive utility assumption utility of a sequence is the sum of the rewards over the states of the sequence Under this assumption the utility of a state is the expected rewardtogo of that state Developed in the late 195039s in the area of adaptive control theory Just keep a running average of rewards for each state For each training sequence compute the reward togo to each state in the sequence and update the utilities Accumulate reward as you go back Generates utility estimates that minimize the mean square error LMSupdate mum is Converges very slowly because it ignores the relationship between neighboring states Utilities of neighboring states are mutually constrained Ui Ri 21MJ U0 Can apply dynamic programming to solve the system of equations one eq per state Can use value iteration initialize utilities based on the rewards and update allvalues based on the above equation mum t8 Approximate the constraint equations without Rewrite this solving them for all states VS e VS 137 yV5 V5 Modify Ui whenever we see a transition from i toj l Vi using the following rule TD error Vi W on Ri V0 Vi to get The modi cation moves Vi closer to satisfying the WI 61 aVS txr yVs original equation to vunmmf m Vse1 aVS otv yVS39 Sutton 1988 Vse1 aVS aREWAkzxpam Alter each 3 action update the state s 5t vunmmf m 0 based Reinfnrecement Learning Additional Topics in Machine Learning vumum m vunmmfnm 55
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'