RL fundamentals

Markov Decision Processes (MDP)

  • probabilistic state model
    • instead of deterministic events assumed in most classical planning algorithms, MDPs assume that each action could have multiple outcomes (with probabilitities).
    • robot navigation, treatment for patients
  • \( S, A, P(s’\vert s, a), r(s,a,s’), \gamma \)
    • state space, action space, transition probability, reward, discount factor
  • transition probabiliy \( P(s’\vert s, a) \)
    • But given \( s \) and \( a \), the transition is conditionally independent of all previous states and actions → Markov property .
  • discount factor
    • how much a future reward should be discounted compared to a current reward
    • reward \( r_t \)
    • \( R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} \ldots = \sum_{t’=0}{\gamma}^{t’} r_{t+t’} \)
    • \( R_t = r_t + \gamma R_{t+1} \)
    • less than 1
  • (partially observable)
    • observation model \( O(o\vert s,a) \)
    • POMDP can be seen as MDP with a new state space where each state is a probability distribution over the original state space
      • → exponentially-larger state space, so POMDPs are typically harder problems to solve
  • an MDP produces a policy instead of a sequence of actions in most classical planning algorithms
    • the best action to choose in each state
    • a policy can be deterministic or stochastic.
  • Value function: expected discounted accumulated reward
    • \( V^{\pi}(s)=E_{\pi}[R_t\vert s_t=s] \)
    • the expected value given that the agent follows policy \( \pi \) from \( s \)
  • Action value function: the value of choosing action a in state s and then following this same policy until termination
    • \( Q(s, a)=\sum_{s’}P(s’\vert s,a)[r+\gamma V(s’)] \)
  • Bellman equation: the condition that must hold for a policy to be optimal
    • recursive
      • count the reward backward so that the agent can choose action that leads to \( s' \) with high value function
    • \( V(s) = \max_{a} \sum_{s’} P(s’\vert s,a)[r+\gamma V(s’)] \)
      • max of a is because the Bellman equation assumes that once we know the best states, we will always take the action that leads to the best state
    • \( V(s) = \max_a Q(s,a) \)
  • Policy extraction (works when action space is small) \)
    • \( \pi(s) = \arg\max_a \sum_{s’}P(s’\vert s, a)[r+\gamma V(s’)] \)
    • \( \pi(s) = \arg\max_a Q(s,a) \)
  • methods
    • value-based method
    • policy-based method:
      • good when action space is infinite
        • bypass learning value functions
        • take argmax for big action space in policy extraction
    • hybrid method

[1]Markov Decision Processes — Introduction to Reinforcement Learning

[2]Markov decision process - Wikipedia


Value Iteration

  • Find \( V^\star \) by solving bellman equation
  • while error > threashold
    • initialize: error, \( V[\cdot] \)
    • for all states
      • calculate the action values
        • (for each action, action value is expectation of \( r+\gamma V[s’]) \)
        • (loop through all possible next states and their corresponding transition probabilities)
      • V' \( = \max Q \)
      • error \( = \vert V-V' \vert \)
  • complexity
    • \( \vert S\vert^2 \vert A \vert \) (state, action, next state) → bad scaling
    • can be vectorized

Multi-armed bandits

  • problem
    • each arm gives reward \( X_{a,t} \) with different probabilities
    • maximize total rewards over \( T \) plays
    • no states (independent events)
    • drift: when the probabilities suddenly change
      • recommendation system: users preferences change constantly
  • \( Q(a) \)
    • average reward from using arm a → \( \sum(X_{a,t}) \, / \, N(a) \)
    • \( N(a) \) number of times using arm \( a \)
  • minimize total regret → use to analyze multi-arm bandit algo
    • regret \( \mathcal{R}(\pi,T)=T \max_a Q^\star(a) - E[\sum_t X_{\pi(k),k}] \)
    • expected loss from not taking the best action
    • zero-regret strategy
      • regret approaches zero as \( T \) goes to infinity
      • converge to a optimal policy (may not be unique)
  • How to select \( a \)?
    • epsilon-first
      • explore: first epsilon rounds, select arm uniformly
      • exploit: after epsilon rounds, select arm with \( \max Q \)
      • NOT zero-regret
    • (decreasing) epsilon-greedy
      • explore: with probability epsilon: select arm uniformly
      • exploit: with probability (1-epsilon): select arm with \( \max Q \)
      • can be zero-regret with specific parameters
    • softmax: sample action based on probability determined by \( Q \)
      • Boltzmann distribution: \( e^{Q(a)/\tau}/\sum_b e^{Q(b)/\tau} \)
      • \( \tau=1 \) → probability proportional to the \( Q \)
      • \( \tau»1 \) → uniform distribution
      • \( \tau \approx 0 \) → greedy
      • good for problems with drift → when \( Q \) changed, the policy can adjust due to the probabilities based on \( Q \)
      • can be zero-regret with specific parameters
    • upper confidence bounds (UCB) : sample action based on \( Q \) and number of times using arm \( a \)
      • \( \arg\max_a [Q(a)+\sqrt{2ln T/N(a)}] \)
      • exploitation: \( Q(a) \)
      • exploration: second term → fewer usage gives higher bound
      • good for problems with drift → when \( Q \) changed, the policy can adjust based on \( Q \)

Temporal-difference reinforcement learning

  • Model-free
    • no knowledge of transition
    • experience by trying actions and getting rewards
    • learn value function and policy directly (not transitions)
    • On-policy learning estimate \( Q^\pi \) for the current \( \pi \)
    • off-policy learning estimates the policy independent of the current behaviour
  • methods
    • monte-carlo RL
      • nice toy example
      • keep track of \( Q \) for every state-action
        • for each episode:
          • \( R_t=r_t+\gamma R_{t-1} \)
          • generate \( a \) using multi-arm bandit algo (given different state)
          • update \( Q(s,a) \leftarrow [R_t-Q(s,a)]/N(s,a) \) (average return over \( N \) )
          • \( N(s,a) += 1 \)
      • Monte Carlo method: simulations that use some randomness to explore actions
      • update at the end of episode
      • high variance
        • \( R_t=r_t+\gamma R_{t-1} \) the future discounted reward (high var, depends on eps) is used to calculate \( Q(s,a) \)
        • execute action \( a \) in state \( s \) in different episodes can result in very different values → require a lot of episodes to get good estimates
    • temporal different methods
      • bootstrap
        • update \( Q \) based on TD target: \( r_t+\gamma V(s’) \)
          • the reward plus an estimate of future discounted reward
          • \( Q = (1-\alpha)Q+\alpha [r+\gamma V(s’)] \)
            • higher alpha makes recent info more important
          • \( V(s’) \) converges to optimal, making the TD estimate more stable than relying on episode return
          • rewrite into \( Q = Q + \alpha (r+\gamma V(s’) - Q) \)
            • learning rate (\( \alpha\)) * TD difference
        • limitation:
          • for sparse rewards, it can take longer for rewards to propagate through \( Q \)
          • keeping track of \( Q \) table is expensive and inefficient
            • similar states that haven’t been visited have no values
      • update every steps: good for early stage of episodes
      • two ways to estimate the bootstrapped value \( V(s’) \)
        • Q-learning - off-policy
          • optimistic: estimate \( V(s’) \) using \( Q \), assuming all best actions starting from \( s' \)
          • \( V(s’)= \max_a’ Q(s’,a’) \)
          • off-policy: ignore the actual next action that will be executed, and instead updates based on the estimated best action.
          • convergence guaranteed for tabular \( Q \)
        • SARSA - on-policy
          • sample the next action a’ and use that \( Q \) to update
          • \( V(s’)= Q(s’,a’) \)
          • the policy being used matters!
            • Using a random policy, Q-learning will still converge to the optimal policy, but SARSA will not (necessarily).
  • comparison of on-policy and off-policy methods
    • off-policy
      • learn from demonstration: Q-learning → instead of \( \arg\max_a \), use expert \( a \)
    • on-policy
      • safer demonstrated in the cliff env
        • SARSA takes \( a' \) and the reward will immediately be used to update \( Q \). Since \( a' \) can be exploratory and can be risky, thee reflected \( Q \) will avoid states that give these possibilities → safe but sub-optimal
        • \( Q \) is optimistic and doesn’t care the potential risky exploratory action → optimal, but more risky during training (worse returns with respect to off-policy learning)
    • combination
      • pre-train: off-policy learning with some heuristics (don’t start \( Q \) with all 0)
      • refine using on-policy learning

N-step reinforcement learning

  • between temporal difference and monte-carlo simulation
    • better for sparse rewards
      • n-step bootstrapping don’t rely on estimation that much
      • 1-step bootstrapping can be slow → need to visit all states and do all actiions and propogate the reward back
    • less variance compared to monte-carlo
  • the same update rule \( Q = Q + \alpha ( R- Q) \), but
    • update \( Q \) for current state and action after n steps
      • store the rewards and observed states
    • return \( R \) is \( \sum_j \gamma^j r_{\tau+j} \)
      • if \( n-1<T \) then has to add \( R+=\gamma^n Q(s_{\tau+n},a_{\tau+n}) \)
        • \( Q(s_{\tau+n},a_{\tau+n}) \) can be
          • Q-leanring → max a
          • SARSA → next action from policy
  • missing steps
    • no updates until the n-th steps of the episode
    • no TD-estimate of the future reward in the last n steps of the episode.

  • Value iteration
    • off-line planning visits every state in state space → can’t scale with big state space
  • MCTS is online and deal better with big state space
    • on-line planning action is guided during the solving process → doesn’t need to visit every state
    • planning and execution at the same time during learning (usually through simulations)
    • \( Q(s,a) \) os approximated by averaging the expected reward of trajectories over \( s \) obtained by repeated simulations (randomly)
    • choose \( \arg\max_a \) given current \( s \)
    • features
      • build tree incrementlly
      • suitable for problem with big state and action space
        • Otherwise, just use value iteration
        • terminate with predefined iterations (not finishing the entire search) → solution not optima
      • bad for sparse reward: no guidance for exploration
  • MDPs as ExpectiMax Trees (maximize the expected return)
  • MCTS framework
    • Select a node that is not fully expanded (children not explored)
      • if current \( s \) is fully expanded, apply \( a \) with multi-armed bandit to get next state as \( s \)
    • Expand try action \( a \) and get \( s' \) and \( r \)
    • Simulation from \( s' \) till a termination state or till some predefined steps
      • can use some heuristic to guide towards promising states
      • record return \( R \); no need to store states along the way
    • Backpropagate value back propagated back to the root node
      • total return \( R= r+\gamma R \)
      • \( N=N+1 \)
      • update \( Q \)
  • Upper confidence tree
    • MCTS + UCB (with state now)
    • \( \arg\max_a [Q(a)+2C\sqrt{2ln N(s)/N(s,a)}] \)
    • \( C \) is exploration constant
    • if \( Q\in [0,1] \) and \( C=1/sqrt(2) \) → Minimax algo
  • limit MCTS from expanding visited states
    • as simulations repeat, the same state along a path will continue to be expanded
    • return \( V(s) \) as the reward for any expanded state
  • MCTS with pertained \( Q \)
    • pertained \( Q \) guides MCTS search towards important states
    • MCTS improves \( Q \) by getting rewards from states

Q-function approximation

  • pros
    • memory Q-table too many states and actions to store
    • with approximation, after updating even one weight, we can get values for all states
  • cons
    • the states that share the same feature values will have the \( Q(s,a) \) (same policy for action) → not necessarily for the real \( Q(s,a) \)
  • linear Q-fn approx \( Q=f(s,a) w(s,a) \) → features*weights
    • both has size → total num of state features * \( \vert a\vert \) features
    • update with \( w_i \leftarrow w_i + \alpha \, \delta \, f_i \)
      • \( \delta \) is \( r+\gamma V(s’) - w_i \, f_i \)
      • loss function \( L = (1/2)(Q_{real}- fw)^2 \) ← convex
        • take \( \nabla_w L \) we get the update rule (see this)
    • guarantee to converge to a global optima (convex loss function)
      • won’t produce optimal policy if the underlying problem is non-linear
      • domain knowledge required for manually feature engineering
  • deep Q-fn \( Q(s,a;\theta) \)
    • states can be non-structured (or less structured), rather than using a factored state representation. → images, videos, or unstructured text
    • \( \theta \leftarrow \theta + \alpha \nabla_\theta Q \)
    • data expensive : need to learn features as well

reward shaping

  • goal
    • improve speed to convergence (with good design)
    • good for sparse reward
  • methods
    • reward shaping (for sparse reward)
      • shaped reward \( F(s,s’) \) from \( s \) to \( s’ \)
        • \( r+F(s,s’)+\gamma \max Q(s’,a’) -Q(s,a) \)
      • potential-based reward shaping
        • \(F(s,s’)=\gamma \phi(s’) - \phi(s) \)
        • potential function needs to be carefully designed to really speed up
      • guarantee optimal policy (cool proof)
    • q value initialization
      • similar to potential-based reward shaping → \( \phi(s) = \) pre-defined \( V(s) \)
      • can simply pick some state-action pairs

policy-based methods

  • learn policy directly, better when action space is big (cannot do policy extraction)
  • policy iteration (model-based) (value-based)
    • key: instead of getting optimal \( V \) (then policy), just use \( V \) to improve policy at each iteration
    • cons: doesn’t scale well with big state space
    • two steps:
      • policy evaluation start with a policy to get \( V(s) \)
        • \( V^\pi(s)=\sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
          • where \( a \) is simply from \( \pi \)
        • (compare to bellman eqn)
          • \( V^\pi(s)=max_a \sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
            • where \( a \) is the optimal action at that state
        • until converge → \( O(\vert S\vert ^3) \)
      • policy improvement
        • argmax a for every state, assign action that gets more \( r+\gamma V^\pi(s’) \) to the policy \( pi(s) \)
        • \( O(\vert S\vert^2\vert A\vert) \)
    • optimal policy is achieved within \( \vert A\vert^{\vert S\vert} \) iterations
      • value iteration can take infinite iterations
    • but each iteration is more expensive
  • policy gradient (model-free)
    • policy representation needs to be differentiable
      • \( J(\theta)=V^{\pi_\theta}(s_0) \)
    • update
      • \( \nabla J (\theta)=E[\nabla ln \pi_\theta (s,a)\, Q(s,a)] \)
      • expected return of taking action \( \pi_\theta (s,a) \) multiplied by the gradient.
        • increase log prob of \( (s,a) \) if taking \( a \) at \( s \) is good
    • pros
      • work well with continuous action and state space or large action space
        • compare to policy iteration, no need to do \( \arg\max a \) in policy improvement
    • cons
      • sample inefficiency (credit assignment)
      • even harder to explain → no values to support the chosen policy
    • REINFORCE (on-policy)
      • generate the whole eps (monte carlo)
      • update visited \( (s,a) \) in the eps based on return
        • \( \theta = \theta + \alpha (\gamma^t R \nabla ln \pi_\theta(s,a)) \)
        • (monte carlo update without the counter N → first-visit and only 1 eps)

Actor critic method: learn both policy and value

  • actor-critic (model-free)
    • reduce sample inefficiency and high-variance in Monte-carlo simulation
    • actor: learn policy like policy gradient
    • critic: provide value for actor to update (NOT for getting policy. i.e. no \( \arg\max a \))
      • → still work with continuous action and state space or large action space

Modeling and abstraction for MDPs

  • state abstraction
  • action abstraction
  • rewarding sub-goals
  • partially-observable states (e.g. play cards)
    • ignore unseen states (ignore cards in deck)
    • keep track of unseen states (probabilities of cards not in hand in others’ hands or deck)
  • heuristic
    • state/action pruning
      • higher value action has priority, and drop low ones
      • not explore states that have no value based on prior knowledge
    • exploiting
      • balance heuristic and learned values
    • shorter eps
      • early stopping with heuristic
      • Q learning and SARSA use (and learn) \( Q \) at \( s’ \) as heuristic