### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Stochastic Control EECS 558

UM

GPA 3.8

### View Full Document

## 21

## 0

## Popular in Course

## Popular in Engineering Computer Science

This 29 page Class Notes was uploaded by Ophelia Ritchie on Thursday October 29, 2015. The Class Notes belongs to EECS 558 at University of Michigan taught by Staff in Fall. Since its upload, it has received 21 views. For similar materials see /class/231559/eecs-558-university-of-michigan in Engineering Computer Science at University of Michigan.

## Reviews for Stochastic Control

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/29/15

MultiArmed Bandits Aditya Mahajan Ravi Agarwal EECS 558 Stochastic Control December 9 2003 Aditya Mahajan Ravi Agarwal MultiArmed Bandits Outline Multi Armed Bandits Bandit Process Gittin s Index Multi Armed Bandit Problem FonNards Induction Why does it work Queuing Models Multiple Plays Sufficiency Conditions Concave Envelopes Aditya Mahajan Ravi Agarwal MultiArmed Bandits Nl iii li i i EHUEW ir39VliDFJK39JM o A Markov decision process 0 Countable State Space 5 0 Action SpaceZl 01 Freezing Control and Continuation Control 0 Time Invariant Rewards R Xt7 Ut R Xt7 Ut Ri0 0 Ri1 Ri Aditya Mahajan Ravi Agarwal WEEFire mm aa miia i r imam Expected Disoounted Reward Total Expected Discounted Reward under any policy 7139 E g WU Equot imam Utixo i 0 Deterministic Stationary Markov Policies 0 Divide the state S into Stopping Set 50 and Continuation Set 51 o Stopping Rule Random stopping time 7 Aditya Mahajan Ravi Agarwal MultiA ed anits Bandit Process Multi Armed Bandit Problem Gittin s Index The Expected Reward per unit Discounted Time under 7 771 t 7 E H B Row ixw i I E 3 3 ixw i 1i7 739 Aditya Mahajan Ravi Agarwal MultiArmed Bandits Multi medBa ts Fm BandItProcess Multi Armed Bandit Problem Gittin s Index The Expected Reward per unit Discounted Time under 7 771 t 7 E H B Row ixw i I E 3 3 ixw i The Gittin s Index is 1i sup 1077 Tgt0 Aditya Mahajan Ravi Agarwal MultiArmed Bandits Multi medBa ts Fm BandItProcess Multi Armed Bandit Problem Gittin s Index The Expected Reward per unit Discounted Time under 7 771 t 7 E H B Row ixw i I E 3 3 ixw i The Gittin s Index is 1i sup 1077 Tgt0 Supremum achieved for To mint zXt lt 1i Aditya Mahajan Ravi Agarwal MultiArmed Bandits 1 iirgmili 5 M ddf il 0 Family of N alternative bandit processes o Countable State Space State Xt X1tX2t XNt 0 Action SpaceZl 127 7N 0 States of all other processes remain frozen 0 Total Reward at time t Ru RntXtt Rni Aditya Mahajan Ravi Agarwal WWW r mily of N alternative bandit processes 0 Countable State Space State Xt X1tX2t XNt 0 Action SpaceZl 127 7N 0 States of all other processes remain frozen 0 Total Reward at time t Ru RntXtt Rni Find 7r to Maximize Tota Expected Discounted Reward Aditya Mahajan Ravi Agarwal WW6 Bandit Process Multi Armed Bandit Problem Multi Armed Bandits 1 it OptimalPoli Optimal Policy From Dynamic Programming the solution is given by V7r Rn BPWVW biiiih 39liE F ic V From Dynamic Programming the solution is given by V7i39 R7i39 BPWVW Gittin s Index Rule At each time choose process with the highest Gittin s Index for con tinuation ut arg max Vni new Aditya Mahajan Ravi Agarwal WWW tauEma britimaliinI39iicy From Dynamic Programming the solution is given by V7r R7r BPWVW Gittin s Index Rule At each time choose process with the highest Gittin s Index for con tinuation ut arg max 1quoti new Advantage 0 Index rule does not depend on state of other processes o Forwards Induction Policy Computationally Iess intensive than backwards induction Aditya Mahajan Ravi Agarwal MultiA ed anits Why does itwork euing Models Qu Backwards Induction vs Forwards Induction Backwards Induction 9 Works on the Principle of Optimality o If optimal policy from certain stage or time is known we can extend this policy to an optimal policy starting one stage earlier Aditya Mahajan Ravi Agarwal MultiArmed Bandits mma EIQEEM Q is BackWardslndu39ction 0 Works on the Principle of Optimality o If optimal policy from certain stage or time is known we can extend this policy to an optimal policy starting one stage earlier Fo Nardsilnjduction o Randomized s step look ahead policy o For a given policy maximize over all stopping times 7 0 Second maximization over all policies Aditya Mahajan Ravi Agarwal MultiA ed anits Why does itwork Queuing Models Does Forwards Induction always work 9 Not optimal for all Markov decision processes 0 For some policies lower initial reward can be compensated by higher gains later on EHWW Matias Does Fonjxards l n39duotio alwaysWo39rR 0 Not optimal for all Markov decision processes a For some policies lower initial reward can be compensated by higher gains later on Forwards lfndy tion works inrthvis gfase a When playing one arm other arms remain frozen Aditya Mahajan Ravi Agarwal WW EHWW Matias Does Fonjxards l n39ductio alwaysWo39rR 0 Not optimal for all Markov decision processes a For some policies lower initial reward can be compensated by higher gains later on Forwards lfndy tion works inrthvis ca se a When playing one arm other arms remain frozen 0 Decisions at each stage are not irrevocable Aditya Mahajan Ravi Agarwal WW Does Fonjvardys l n39ductl H 0 Not optimal for all Markov decision processes a For some policies lower initial reward can be compensated by higher gains later on why Fgmajrd s lg y tiointwork39s inrth e 0 When playing one arm other arms remain frozen 0 Decisions at each stage are not irrevocable 0 There is no later compensation for not choosing the arm having highest rate of reward Aditya Mahajan Ravi Agarwal WWW lvluli i Armw Why does itwork Queuing Models M Lil Multiarmed Bandit Process in Queueing Resource Allocation Problem 0 N Parallel Queues with holding cost but no arrivals 0 Single Server preemptive memoryless service 9 Minimize total expected discounted holding cost Aditya Mahajan Ravi Agarwal MultiArmed Bandits lvluli i Amtw Mm mi mmma mm M Lil Bandit Process in Queueing o N Parallel Queues With holding cost but no arrivals 0 Single Server preemptive memoryless service 0 Minimize total expected discounted holding cost o Arrivals 0 Switching Costs 0 Connectivity of Servers and Queues 0 Multiple Servers Aditya Mahajan Ravi Agarwal Mme Sufficiency Conditions Co ave Envelopes Multiple Armed Bandits with Multiple Plays 9 Family of N alternative bandit processes 0 Countable State Space State Xt X1tX2t7 XNt o M arms chosen to be played at each time 0 Total Reward at time t lat isthe sum ofthe individual rewards Aditya Mahajan Ravi Agarwal MultiArmed Bandits 0 Family of N alternative bandit processes 0 Countable State Space State Xt X1tX2t7 XNt o M arms chosen to be played at each time 0 Total Reward at time t K U isthe sum ofthe individual rewards Find 7r to Maximize Total Expected Discounted Reward Vquoti Equot 5 Rtlxo i 0 Aditya Mahajan Ravi Agarwal WW6 s 1 s Sufficiency Conditions Co ave Envelopes Optimality of Gittin s Index ls Gittin s Index Optimal 3 Decisions not irrevocable o In general Gittin s Index not optimal lshikida 1992 Aditya Mahajan Ravi Agarwal MultiArmed Bandits o DeCIsions not irrevocable o In general Gittin s Index not optimal lshikida 1992 Suffij i ncy Coniditibnifqr C35ptimality OfiGli infsilhd x Pandelis and Teneketzis 1999 For each realization w for any i if and k7 I such that 17kww gt 1477Iww we have ViTik7w7w1i Z VfTfI7w7w Aditya Mahajan Ravi Agarwal MultiA ed anits ncy c ns Concave Envelopes Concave Envelops The concave envelope XI o of Xilf0 is given by 21 min us I o12 sgl hey can ns Concave Envelopes Concave Envelops The concave envelope XI o of Xilf0 is given by 21 min us I o12 sgl o The reward obtained under any policy 7r is not decreased when the original reward processes are replaced with their complex envelopes hey can ns Concave Envelopes Concave Envelops The concave envelope XI o of Xilf0 is given by 21 min us I o12 sgl o The reward obtained under any policy 7r is not decreased when the original reward processes are replaced with their complex envelopes o The reward obtained under Gittin s Index Rule does not change when the reward processes are replaced by their concave en velopes Solving Gittin39s Index 5 Inex for a Single Arm Stan39dardAEm o Bandit Process X 0 Standard Process with fixed reward 39y v0 max Rm BZPiJvu 39y mm Rifeitrenten t ithion o Bandit Process X 0 Option of retiring with a fixed reward M 1 Vi M max Rm BZPUwj M M Aditya Mahajan Ravi Agarwal MultiA ed anits Solving Gittin39s Index re ment Option V o Vi7 M is a nondecreasing convex function of M 0 Optimal Policies partition 5 into three sets Strict Continuation Set CM i Vi7 M gt M Strict Stopping Set SM i M gt Ri BZP1VjM Indifferent States 6M i M Ri BZP1VjM 0 Optimal Value Function TiM71 Vi M E Z B RXt BT gtMM X0 i t0 Aditya Mahajan Ravi Agarwal WW6

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.