RL fundamentals

Published: January 08, 2023 by Chia-Hsien (Cathy) Shih

Categories:
Fundamentals 7

Markov Decision Processes (MDP)

probabilistic state model
- instead of deterministic events assumed in most classical planning algorithms, MDPs assume that each action could have multiple outcomes (with probabilitities).
- robot navigation, treatment for patients
\( S, A, P(s’\vert s, a), r(s,a,s’), \gamma \)
- state space, action space, transition probability, reward, discount factor
transition probabiliy \( P(s’\vert s, a) \)
- But given \( s \) and \( a \), the transition is conditionally independent of all previous states and actions → Markov property .
discount factor
- how much a future reward should be discounted compared to a current reward
- reward \( r_t \)
- \( R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} \ldots = \sum_{t’=0}{\gamma}^{t’} r_{t+t’} \)
- \( R_t = r_t + \gamma R_{t+1} \)
- less than 1
(partially observable)
- observation model \( O(o\vert s,a) \)
- POMDP can be seen as MDP with a new state space where each state is a probability distribution over the original state space
  - → exponentially-larger state space, so POMDPs are typically harder problems to solve
an MDP produces a policy instead of a sequence of actions in most classical planning algorithms
- the best action to choose in each state
- a policy can be deterministic or stochastic.
Value function: expected discounted accumulated reward
- \( V^{\pi}(s)=E_{\pi}[R_t\vert s_t=s] \)
- the expected value given that the agent follows policy \( \pi \) from \( s \)
Action value function: the value of choosing action a in state s and then following this same policy until termination
- \( Q(s, a)=\sum_{s’}P(s’\vert s,a)[r+\gamma V(s’)] \)
Bellman equation: the condition that must hold for a policy to be optimal
- recursive
  - count the reward backward so that the agent can choose action that leads to \( s' \) with high value function
- \( V(s) = \max_{a} \sum_{s’} P(s’\vert s,a)[r+\gamma V(s’)] \)
  - max of a is because the Bellman equation assumes that once we know the best states, we will always take the action that leads to the best state
- \( V(s) = \max_a Q(s,a) \)
Policy extraction (works when action space is small) \)
- \( \pi(s) = \arg\max_a \sum_{s’}P(s’\vert s, a)[r+\gamma V(s’)] \)
- \( \pi(s) = \arg\max_a Q(s,a) \)
methods
- value-based method
- policy-based method:
  - good when action space is infinite
    - bypass learning value functions
    - take argmax for big action space in policy extraction
- hybrid method

[1]Markov Decision Processes — Introduction to Reinforcement Learning

[2]Markov decision process - Wikipedia

Value Iteration

Find \( V^\star \) by solving bellman equation
while error > threashold
- initialize: error, \( V[\cdot] \)
- for all states
  - calculate the action values
    - (for each action, action value is expectation of \( r+\gamma V[s’]) \)
    - (loop through all possible next states and their corresponding transition probabilities)
  - V' \( = \max Q \)
  - error \( = \vert V-V' \vert \)
complexity
- \( \vert S\vert^2 \vert A \vert \) (state, action, next state) → bad scaling
- can be vectorized

Multi-armed bandits

problem
- each arm gives reward \( X_{a,t} \) with different probabilities
- maximize total rewards over \( T \) plays
- no states (independent events)
- drift: when the probabilities suddenly change
  - recommendation system: users preferences change constantly
\( Q(a) \)
- average reward from using arm a → \( \sum(X_{a,t}) \, / \, N(a) \)
- \( N(a) \) number of times using arm \( a \)
minimize total regret → use to analyze multi-arm bandit algo
- regret \( \mathcal{R}(\pi,T)=T \max_a Q^\star(a) - E[\sum_t X_{\pi(k),k}] \)
- expected loss from not taking the best action
- zero-regret strategy
  - regret approaches zero as \( T \) goes to infinity
  - converge to a optimal policy (may not be unique)
How to select \( a \)?
- epsilon-first
  - explore: first epsilon rounds, select arm uniformly
  - exploit: after epsilon rounds, select arm with \( \max Q \)
  - NOT zero-regret
- (decreasing) epsilon-greedy
  - explore: with probability epsilon: select arm uniformly
  - exploit: with probability (1-epsilon): select arm with \( \max Q \)
  - can be zero-regret with specific parameters
- softmax: sample action based on probability determined by \( Q \)
  - Boltzmann distribution: \( e^{Q(a)/\tau}/\sum_b e^{Q(b)/\tau} \)
  - \( \tau=1 \) → probability proportional to the \( Q \)
  - \( \tau»1 \) → uniform distribution
  - \( \tau \approx 0 \) → greedy
  - good for problems with drift → when \( Q \) changed, the policy can adjust due to the probabilities based on \( Q \)
  - can be zero-regret with specific parameters
- upper confidence bounds (UCB) : sample action based on \( Q \) and number of times using arm \( a \)
  - \( \arg\max_a [Q(a)+\sqrt{2ln T/N(a)}] \)
  - exploitation: \( Q(a) \)
  - exploration: second term → fewer usage gives higher bound
  - good for problems with drift → when \( Q \) changed, the policy can adjust based on \( Q \)

Temporal-difference reinforcement learning

Model-free
- no knowledge of transition
- experience by trying actions and getting rewards
- learn value function and policy directly (not transitions)
- On-policy learning estimate \( Q^\pi \) for the current \( \pi \)
- off-policy learning estimates the policy independent of the current behaviour
methods
- monte-carlo RL
  - nice toy example
  - keep track of \( Q \) for every state-action
    - for each episode:
      - \( R_t=r_t+\gamma R_{t-1} \)
      - generate \( a \) using multi-arm bandit algo (given different state)
      - update \( Q(s,a) \leftarrow [R_t-Q(s,a)]/N(s,a) \) (average return over \( N \) )
      - \( N(s,a) += 1 \)
  - Monte Carlo method: simulations that use some randomness to explore actions
  - update at the end of episode
  - high variance
    - \( R_t=r_t+\gamma R_{t-1} \) the future discounted reward (high var, depends on eps) is used to calculate \( Q(s,a) \)
    - execute action \( a \) in state \( s \) in different episodes can result in very different values → require a lot of episodes to get good estimates
- temporal different methods
  - bootstrap
    - update \( Q \) based on TD target: \( r_t+\gamma V(s’) \)
      - the reward plus an estimate of future discounted reward
      - \( Q = (1-\alpha)Q+\alpha [r+\gamma V(s’)] \)
        
        higher alpha makes recent info more important
      - \( V(s’) \) converges to optimal, making the TD estimate more stable than relying on episode return
      - rewrite into \( Q = Q + \alpha (r+\gamma V(s’) - Q) \)
        
        learning rate (\( \alpha\)) * TD difference
    - limitation:
      - for sparse rewards, it can take longer for rewards to propagate through \( Q \)
      - keeping track of \( Q \) table is expensive and inefficient
        
        similar states that haven’t been visited have no values
  - update every steps: good for early stage of episodes
  - two ways to estimate the bootstrapped value \( V(s’) \)
    - Q-learning - off-policy
      - optimistic: estimate \( V(s’) \) using \( Q \), assuming all best actions starting from \( s' \)
      - \( V(s’)= \max_a’ Q(s’,a’) \)
      - off-policy: ignore the actual next action that will be executed, and instead updates based on the estimated best action.
      - convergence guaranteed for tabular \( Q \)
    - SARSA - on-policy
      - sample the next action a’ and use that \( Q \) to update
      - \( V(s’)= Q(s’,a’) \)
      - the policy being used matters!
        
        Using a random policy, Q-learning will still converge to the optimal policy, but SARSA will not (necessarily).
comparison of on-policy and off-policy methods
- off-policy
  - learn from demonstration: Q-learning → instead of \( \arg\max_a \), use expert \( a \)
- on-policy
  - safer demonstrated in the cliff env
    - SARSA takes \( a' \) and the reward will immediately be used to update \( Q \). Since \( a' \) can be exploratory and can be risky, thee reflected \( Q \) will avoid states that give these possibilities → safe but sub-optimal
    - \( Q \) is optimistic and doesn’t care the potential risky exploratory action → optimal, but more risky during training (worse returns with respect to off-policy learning)
- combination
  - pre-train: off-policy learning with some heuristics (don’t start \( Q \) with all 0)
  - refine using on-policy learning

N-step reinforcement learning

between temporal difference and monte-carlo simulation
- better for sparse rewards
  - n-step bootstrapping don’t rely on estimation that much
  - 1-step bootstrapping can be slow → need to visit all states and do all actiions and propogate the reward back
- less variance compared to monte-carlo
the same update rule \( Q = Q + \alpha ( R- Q) \), but
- update \( Q \) for current state and action after n steps
  - store the rewards and observed states
- return \( R \) is \( \sum_j \gamma^j r_{\tau+j} \)
  - if \( n-1<T \) then has to add \( R+=\gamma^n Q(s_{\tau+n},a_{\tau+n}) \)
    - \( Q(s_{\tau+n},a_{\tau+n}) \) can be
      - Q-leanring → max a
      - SARSA → next action from policy
missing steps
- no updates until the n-th steps of the episode
- no TD-estimate of the future reward in the last n steps of the episode.

Monte-carlo tree search

Value iteration
- off-line planning visits every state in state space → can’t scale with big state space
MCTS is online and deal better with big state space
- on-line planning action is guided during the solving process → doesn’t need to visit every state
- planning and execution at the same time during learning (usually through simulations)
- \( Q(s,a) \) os approximated by averaging the expected reward of trajectories over \( s \) obtained by repeated simulations (randomly)
- choose \( \arg\max_a \) given current \( s \)
- features
  - build tree incrementlly
  - suitable for problem with big state and action space
    - Otherwise, just use value iteration
    - terminate with predefined iterations (not finishing the entire search) → solution not optima
  - bad for sparse reward: no guidance for exploration
MDPs as ExpectiMax Trees (maximize the expected return)
- s (- action → branch node — transition probability →) s’
- each node stores
  - a set of children branch nodes (use this to determine fully expanded → all actions have been tried)
  - pointers to its parent node and parent action; and
  - the number of times it has been visited.
MCTS framework
- Select a node that is not fully expanded (children not explored)
  - if current \( s \) is fully expanded, apply \( a \) with multi-armed bandit to get next state as \( s \)
- Expand try action \( a \) and get \( s' \) and \( r \)
- Simulation from \( s' \) till a termination state or till some predefined steps
  - can use some heuristic to guide towards promising states
  - record return \( R \); no need to store states along the way
- Backpropagate value back propagated back to the root node
  - total return \( R= r+\gamma R \)
  - \( N=N+1 \)
  - update \( Q \)
Upper confidence tree
- MCTS + UCB (with state now)
- \( \arg\max_a [Q(a)+2C\sqrt{2ln N(s)/N(s,a)}] \)
- \( C \) is exploration constant
- if \( Q\in [0,1] \) and \( C=1/sqrt(2) \) → Minimax algo
limit MCTS from expanding visited states
- as simulations repeat, the same state along a path will continue to be expanded
- return \( V(s) \) as the reward for any expanded state
MCTS with pertained \( Q \)
- pertained \( Q \) guides MCTS search towards important states
- MCTS improves \( Q \) by getting rewards from states

Q-function approximation

pros
- memory Q-table too many states and actions to store
- with approximation, after updating even one weight, we can get values for all states
cons
- the states that share the same feature values will have the \( Q(s,a) \) (same policy for action) → not necessarily for the real \( Q(s,a) \)
linear Q-fn approx \( Q=f(s,a) w(s,a) \) → features*weights
- both has size → total num of state features * \( \vert a\vert \) features
- update with \( w_i \leftarrow w_i + \alpha \, \delta \, f_i \)
  - \( \delta \) is \( r+\gamma V(s’) - w_i \, f_i \)
  - loss function \( L = (1/2)(Q_{real}- fw)^2 \) ← convex
    - take \( \nabla_w L \) we get the update rule (see this)
- guarantee to converge to a global optima (convex loss function)
  - won’t produce optimal policy if the underlying problem is non-linear
  - domain knowledge required for manually feature engineering
deep Q-fn \( Q(s,a;\theta) \)
- states can be non-structured (or less structured), rather than using a factored state representation. → images, videos, or unstructured text
- \( \theta \leftarrow \theta + \alpha \nabla_\theta Q \)
- data expensive : need to learn features as well

reward shaping

goal
- improve speed to convergence (with good design)
- good for sparse reward
methods
- reward shaping (for sparse reward)
  - shaped reward \( F(s,s’) \) from \( s \) to \( s’ \)
    - \( r+F(s,s’)+\gamma \max Q(s’,a’) -Q(s,a) \)
  - potential-based reward shaping
    - \(F(s,s’)=\gamma \phi(s’) - \phi(s) \)
    - potential function needs to be carefully designed to really speed up
  - guarantee optimal policy (cool proof)
- q value initialization
  - similar to potential-based reward shaping → \( \phi(s) = \) pre-defined \( V(s) \)
  - can simply pick some state-action pairs

policy-based methods

learn policy directly, better when action space is big (cannot do policy extraction)
policy iteration (model-based) (value-based)
- key: instead of getting optimal \( V \) (then policy), just use \( V \) to improve policy at each iteration
- cons: doesn’t scale well with big state space
- two steps:
  - policy evaluation start with a policy to get \( V(s) \)
    - \( V^\pi(s)=\sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
      - where \( a \) is simply from \( \pi \)
    - (compare to bellman eqn)
      - \( V^\pi(s)=max_a \sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
        
        where \( a \) is the optimal action at that state
    - until converge → \( O(\vert S\vert ^3) \)
  - policy improvement
    - argmax a for every state, assign action that gets more \( r+\gamma V^\pi(s’) \) to the policy \( pi(s) \)
    - \( O(\vert S\vert^2\vert A\vert) \)
- optimal policy is achieved within \( \vert A\vert^{\vert S\vert} \) iterations
  - value iteration can take infinite iterations
- but each iteration is more expensive
policy gradient (model-free)
- policy representation needs to be differentiable
  - \( J(\theta)=V^{\pi_\theta}(s_0) \)
- update
  - \( \nabla J (\theta)=E[\nabla ln \pi_\theta (s,a)\, Q(s,a)] \)
  - expected return of taking action \( \pi_\theta (s,a) \) multiplied by the gradient.
    - increase log prob of \( (s,a) \) if taking \( a \) at \( s \) is good
- pros
  - work well with continuous action and state space or large action space
    - compare to policy iteration, no need to do \( \arg\max a \) in policy improvement
- cons
  - sample inefficiency (credit assignment)
  - even harder to explain → no values to support the chosen policy
- REINFORCE (on-policy)
  - generate the whole eps (monte carlo)
  - update visited \( (s,a) \) in the eps based on return
    - \( \theta = \theta + \alpha (\gamma^t R \nabla ln \pi_\theta(s,a)) \)
    - (monte carlo update without the counter N → first-visit and only 1 eps)

Actor critic method: learn both policy and value

actor-critic (model-free)
- reduce sample inefficiency and high-variance in Monte-carlo simulation
- actor: learn policy like policy gradient
- critic: provide value for actor to update (NOT for getting policy. i.e. no \( \arg\max a \))
  - → still work with continuous action and state space or large action space

Modeling and abstraction for MDPs

state abstraction
action abstraction
rewarding sub-goals
partially-observable states (e.g. play cards)
- ignore unseen states (ignore cards in deck)
- keep track of unseen states (probabilities of cards not in hand in others’ hands or deck)
heuristic
- state/action pruning
  - higher value action has priority, and drop low ones
  - not explore states that have no value based on prior knowledge
- exploiting
  - balance heuristic and learned values
- shorter eps
  - early stopping with heuristic
  - Q learning and SARSA use (and learn) \( Q \) at \( s’ \) as heuristic