RL fundamentals
Markov Decision Processes (MDP)
- probabilistic state model
- instead of deterministic events assumed in most classical planning algorithms, MDPs assume that each action could have multiple outcomes (with probabilitities).
- robot navigation, treatment for patients
- \( S, A, P(s’\vert s, a), r(s,a,s’), \gamma \)
- state space, action space, transition probability, reward, discount factor
- transition probabiliy \( P(s’\vert s, a) \)
- But given \( s \) and \( a \), the transition is conditionally independent of all previous states and actions → Markov property .
- discount factor
- how much a future reward should be discounted compared to a current reward
- reward \( r_t \)
- \( R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} \ldots = \sum_{t’=0}{\gamma}^{t’} r_{t+t’} \)
- \( R_t = r_t + \gamma R_{t+1} \)
- less than 1
- (partially observable)
- observation model \( O(o\vert s,a) \)
- POMDP can be seen as MDP with a new state space where each state is a probability distribution over the original state space
- → exponentially-larger state space, so POMDPs are typically harder problems to solve
- an MDP produces a policy instead of a sequence of actions in most classical planning algorithms
- the best action to choose in each state
- a policy can be deterministic or stochastic.
- Value function: expected discounted accumulated reward
- \( V^{\pi}(s)=E_{\pi}[R_t\vert s_t=s] \)
- the expected value given that the agent follows policy \( \pi \) from \( s \)
- Action value function: the value of choosing action a in state s and then following this same policy until termination
- \( Q(s, a)=\sum_{s’}P(s’\vert s,a)[r+\gamma V(s’)] \)
- Bellman equation: the condition that must hold for a policy to be optimal
- recursive
- count the reward backward so that the agent can choose action that leads to \( s' \) with high value function
- \( V(s) = \max_{a} \sum_{s’} P(s’\vert s,a)[r+\gamma V(s’)] \)
- max of a is because the Bellman equation assumes that once we know the best states, we will always take the action that leads to the best state
- \( V(s) = \max_a Q(s,a) \)
- recursive
- Policy extraction (works when action space is small) \)
- \( \pi(s) = \arg\max_a \sum_{s’}P(s’\vert s, a)[r+\gamma V(s’)] \)
- \( \pi(s) = \arg\max_a Q(s,a) \)
- methods
- value-based method
- policy-based method:
- good when action space is infinite
- bypass learning value functions
- take argmax for big action space in policy extraction
- good when action space is infinite
- hybrid method
[1]Markov Decision Processes — Introduction to Reinforcement Learning
[2]Markov decision process - Wikipedia
Value Iteration
- Find \( V^\star \) by solving bellman equation
- while error > threashold
- initialize: error, \( V[\cdot] \)
- for all states
- calculate the action values
- (for each action, action value is expectation of \( r+\gamma V[s’]) \)
- (loop through all possible next states and their corresponding transition probabilities)
- V' \( = \max Q \)
- error \( = \vert V-V' \vert \)
- calculate the action values
- complexity
- \( \vert S\vert^2 \vert A \vert \) (state, action, next state) → bad scaling
- can be vectorized
Multi-armed bandits
- problem
- each arm gives reward \( X_{a,t} \) with different probabilities
- maximize total rewards over \( T \) plays
- no states (independent events)
- drift: when the probabilities suddenly change
- recommendation system: users preferences change constantly
- \( Q(a) \)
- average reward from using arm a → \( \sum(X_{a,t}) \, / \, N(a) \)
- \( N(a) \) number of times using arm \( a \)
- minimize total regret → use to analyze multi-arm bandit algo
- regret \( \mathcal{R}(\pi,T)=T \max_a Q^\star(a) - E[\sum_t X_{\pi(k),k}] \)
- expected loss from not taking the best action
- zero-regret strategy
- regret approaches zero as \( T \) goes to infinity
- converge to a optimal policy (may not be unique)
- How to select \( a \)?
- epsilon-first
- explore: first epsilon rounds, select arm uniformly
- exploit: after epsilon rounds, select arm with \( \max Q \)
- NOT zero-regret
- (decreasing) epsilon-greedy
- explore: with probability epsilon: select arm uniformly
- exploit: with probability (1-epsilon): select arm with \( \max Q \)
- can be zero-regret with specific parameters
- softmax: sample action based on probability determined by \( Q \)
- Boltzmann distribution: \( e^{Q(a)/\tau}/\sum_b e^{Q(b)/\tau} \)
- \( \tau=1 \) → probability proportional to the \( Q \)
- \( \tau»1 \) → uniform distribution
- \( \tau \approx 0 \) → greedy
- good for problems with drift → when \( Q \) changed, the policy can adjust due to the probabilities based on \( Q \)
- can be zero-regret with specific parameters
- upper confidence bounds (UCB) : sample action based on \( Q \) and number of times using arm \( a \)
- \( \arg\max_a [Q(a)+\sqrt{2ln T/N(a)}] \)
- exploitation: \( Q(a) \)
- exploration: second term → fewer usage gives higher bound
- good for problems with drift → when \( Q \) changed, the policy can adjust based on \( Q \)
- epsilon-first
Temporal-difference reinforcement learning
- Model-free
- no knowledge of transition
- experience by trying actions and getting rewards
- learn value function and policy directly (not transitions)
- On-policy learning estimate \( Q^\pi \) for the current \( \pi \)
- off-policy learning estimates the policy independent of the current behaviour
- methods
- monte-carlo RL
- nice toy example
- keep track of \( Q \) for every state-action
- for each episode:
- \( R_t=r_t+\gamma R_{t-1} \)
- generate \( a \) using multi-arm bandit algo (given different state)
- update \( Q(s,a) \leftarrow [R_t-Q(s,a)]/N(s,a) \) (average return over \( N \) )
- \( N(s,a) += 1 \)
- for each episode:
- Monte Carlo method: simulations that use some randomness to explore actions
- update at the end of episode
- high variance
- \( R_t=r_t+\gamma R_{t-1} \) the future discounted reward (high var, depends on eps) is used to calculate \( Q(s,a) \)
- execute action \( a \) in state \( s \) in different episodes can result in very different values → require a lot of episodes to get good estimates
- temporal different methods
- bootstrap
- update \( Q \) based on TD target: \( r_t+\gamma V(s’) \)
- the reward plus an estimate of future discounted reward
- \( Q = (1-\alpha)Q+\alpha [r+\gamma V(s’)] \)
- higher alpha makes recent info more important
- \( V(s’) \) converges to optimal, making the TD estimate more stable than relying on episode return
- rewrite into \( Q = Q + \alpha (r+\gamma V(s’) - Q) \)
- learning rate (\( \alpha\)) * TD difference
- limitation:
- for sparse rewards, it can take longer for rewards to propagate through \( Q \)
- keeping track of \( Q \) table is expensive and inefficient
- similar states that haven’t been visited have no values
- update \( Q \) based on TD target: \( r_t+\gamma V(s’) \)
- update every steps: good for early stage of episodes
- two ways to estimate the bootstrapped value \( V(s’) \)
- Q-learning - off-policy
- optimistic: estimate \( V(s’) \) using \( Q \), assuming all best actions starting from \( s' \)
- \( V(s’)= \max_a’ Q(s’,a’) \)
- off-policy: ignore the actual next action that will be executed, and instead updates based on the estimated best action.
- convergence guaranteed for tabular \( Q \)
- SARSA - on-policy
- sample the next action a’ and use that \( Q \) to update
- \( V(s’)= Q(s’,a’) \)
- the policy being used matters!
- Using a random policy, Q-learning will still converge to the optimal policy, but SARSA will not (necessarily).
- Q-learning - off-policy
- bootstrap
- monte-carlo RL
- comparison of on-policy and off-policy methods
- off-policy
- learn from demonstration: Q-learning → instead of \( \arg\max_a \), use expert \( a \)
- on-policy
- safer demonstrated in the cliff env
- SARSA takes \( a' \) and the reward will immediately be used to update \( Q \). Since \( a' \) can be exploratory and can be risky, thee reflected \( Q \) will avoid states that give these possibilities → safe but sub-optimal
- \( Q \) is optimistic and doesn’t care the potential risky exploratory action → optimal, but more risky during training (worse returns with respect to off-policy learning)
- safer demonstrated in the cliff env
- combination
- pre-train: off-policy learning with some heuristics (don’t start \( Q \) with all 0)
- refine using on-policy learning
- off-policy
N-step reinforcement learning
- between temporal difference and monte-carlo simulation
- better for sparse rewards
- n-step bootstrapping don’t rely on estimation that much
- 1-step bootstrapping can be slow → need to visit all states and do all actiions and propogate the reward back
- less variance compared to monte-carlo
- better for sparse rewards
- the same update rule \( Q = Q + \alpha ( R- Q) \), but
- update \( Q \) for current state and action after n steps
- store the rewards and observed states
- return \( R \) is \( \sum_j \gamma^j r_{\tau+j} \)
- if \( n-1<T \) then has to add \( R+=\gamma^n Q(s_{\tau+n},a_{\tau+n}) \)
- \( Q(s_{\tau+n},a_{\tau+n}) \) can be
- Q-leanring → max a
- SARSA → next action from policy
- \( Q(s_{\tau+n},a_{\tau+n}) \) can be
- if \( n-1<T \) then has to add \( R+=\gamma^n Q(s_{\tau+n},a_{\tau+n}) \)
- update \( Q \) for current state and action after n steps
- missing steps
- no updates until the n-th steps of the episode
- no TD-estimate of the future reward in the last n steps of the episode.
Monte-carlo tree search
- Value iteration
- off-line planning visits every state in state space → can’t scale with big state space
- MCTS is online and deal better with big state space
- on-line planning action is guided during the solving process → doesn’t need to visit every state
- planning and execution at the same time during learning (usually through simulations)
- \( Q(s,a) \) os approximated by averaging the expected reward of trajectories over \( s \) obtained by repeated simulations (randomly)
- choose \( \arg\max_a \) given current \( s \)
- features
- build tree incrementlly
- suitable for problem with big state and action space
- Otherwise, just use value iteration
- terminate with predefined iterations (not finishing the entire search) → solution not optima
- bad for sparse reward: no guidance for exploration
- MDPs as ExpectiMax Trees (maximize the expected return)
- s (- action → branch node — transition probability →) s’
- each node stores
- a set of children branch nodes (use this to determine fully expanded → all actions have been tried)
- pointers to its parent node and parent action; and
- the number of times it has been visited.
- MCTS framework
- Select a node that is not fully expanded (children not explored)
- if current \( s \) is fully expanded, apply \( a \) with multi-armed bandit to get next state as \( s \)
- Expand try action \( a \) and get \( s' \) and \( r \)
- Simulation from \( s' \) till a termination state or till some predefined steps
- can use some heuristic to guide towards promising states
- record return \( R \); no need to store states along the way
- Backpropagate value back propagated back to the root node
- total return \( R= r+\gamma R \)
- \( N=N+1 \)
- update \( Q \)
- Select a node that is not fully expanded (children not explored)
- Upper confidence tree
- MCTS + UCB (with state now)
- \( \arg\max_a [Q(a)+2C\sqrt{2ln N(s)/N(s,a)}] \)
- \( C \) is exploration constant
- if \( Q\in [0,1] \) and \( C=1/sqrt(2) \) → Minimax algo
- limit MCTS from expanding visited states
- as simulations repeat, the same state along a path will continue to be expanded
- return \( V(s) \) as the reward for any expanded state
- MCTS with pertained \( Q \)
- pertained \( Q \) guides MCTS search towards important states
- MCTS improves \( Q \) by getting rewards from states
Q-function approximation
- pros
- memory Q-table too many states and actions to store
- with approximation, after updating even one weight, we can get values for all states
- cons
- the states that share the same feature values will have the \( Q(s,a) \) (same policy for action) → not necessarily for the real \( Q(s,a) \)
- linear Q-fn approx \( Q=f(s,a) w(s,a) \) → features*weights
- both has size → total num of state features * \( \vert a\vert \) features
- update with \( w_i \leftarrow w_i + \alpha \, \delta \, f_i \)
- \( \delta \) is \( r+\gamma V(s’) - w_i \, f_i \)
- loss function \( L = (1/2)(Q_{real}- fw)^2 \) ← convex
- take \( \nabla_w L \) we get the update rule (see this)
- guarantee to converge to a global optima (convex loss function)
- won’t produce optimal policy if the underlying problem is non-linear
- domain knowledge required for manually feature engineering
- deep Q-fn \( Q(s,a;\theta) \)
- states can be non-structured (or less structured), rather than using a factored state representation. → images, videos, or unstructured text
- \( \theta \leftarrow \theta + \alpha \nabla_\theta Q \)
- data expensive : need to learn features as well
reward shaping
- goal
- improve speed to convergence (with good design)
- good for sparse reward
- methods
- reward shaping (for sparse reward)
- shaped reward \( F(s,s’) \) from \( s \) to \( s’ \)
- \( r+F(s,s’)+\gamma \max Q(s’,a’) -Q(s,a) \)
- potential-based reward shaping
- \(F(s,s’)=\gamma \phi(s’) - \phi(s) \)
- potential function needs to be carefully designed to really speed up
- guarantee optimal policy (cool proof)
- shaped reward \( F(s,s’) \) from \( s \) to \( s’ \)
- q value initialization
- similar to potential-based reward shaping → \( \phi(s) = \) pre-defined \( V(s) \)
- can simply pick some state-action pairs
- reward shaping (for sparse reward)
policy-based methods
- learn policy directly, better when action space is big (cannot do policy extraction)
- policy iteration (model-based) (value-based)
- key: instead of getting optimal \( V \) (then policy), just use \( V \) to improve policy at each iteration
- cons: doesn’t scale well with big state space
- two steps:
- policy evaluation start with a policy to get \( V(s) \)
- \( V^\pi(s)=\sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
- where \( a \) is simply from \( \pi \)
- (compare to bellman eqn)
- \( V^\pi(s)=max_a \sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
- where \( a \) is the optimal action at that state
- \( V^\pi(s)=max_a \sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
- until converge → \( O(\vert S\vert ^3) \)
- \( V^\pi(s)=\sum_{s’} P_a(s’\vert s) [r+\gamma V^\pi(s’)] \)
- policy improvement
- argmax a for every state, assign action that gets more \( r+\gamma V^\pi(s’) \) to the policy \( pi(s) \)
- \( O(\vert S\vert^2\vert A\vert) \)
- policy evaluation start with a policy to get \( V(s) \)
- optimal policy is achieved within \( \vert A\vert^{\vert S\vert} \) iterations
- value iteration can take infinite iterations
- but each iteration is more expensive
- policy gradient (model-free)
- policy representation needs to be differentiable
- \( J(\theta)=V^{\pi_\theta}(s_0) \)
- update
- \( \nabla J (\theta)=E[\nabla ln \pi_\theta (s,a)\, Q(s,a)] \)
- expected return of taking action \( \pi_\theta (s,a) \) multiplied by the gradient.
- increase log prob of \( (s,a) \) if taking \( a \) at \( s \) is good
- pros
- work well with continuous action and state space or large action space
- compare to policy iteration, no need to do \( \arg\max a \) in policy improvement
- work well with continuous action and state space or large action space
- cons
- sample inefficiency (credit assignment)
- even harder to explain → no values to support the chosen policy
- REINFORCE (on-policy)
- generate the whole eps (monte carlo)
- update visited \( (s,a) \) in the eps based on return
- \( \theta = \theta + \alpha (\gamma^t R \nabla ln \pi_\theta(s,a)) \)
- (monte carlo update without the counter N → first-visit and only 1 eps)
- policy representation needs to be differentiable
Actor critic method: learn both policy and value
- actor-critic (model-free)
- reduce sample inefficiency and high-variance in Monte-carlo simulation
- actor: learn policy like policy gradient
- critic: provide value for actor to update (NOT for getting policy. i.e. no \( \arg\max a \))
- → still work with continuous action and state space or large action space
Modeling and abstraction for MDPs
- state abstraction
- action abstraction
- rewarding sub-goals
- partially-observable states (e.g. play cards)
- ignore unseen states (ignore cards in deck)
- keep track of unseen states (probabilities of cards not in hand in others’ hands or deck)
- heuristic
- state/action pruning
- higher value action has priority, and drop low ones
- not explore states that have no value based on prior knowledge
- exploiting
- balance heuristic and learned values
- shorter eps
- early stopping with heuristic
- Q learning and SARSA use (and learn) \( Q \) at \( s’ \) as heuristic
- state/action pruning