MARL algos: MADDPG / Qmix / MAPPO ...

Published: February 02, 2023 by Chia-Hsien (Cathy) Shih

Categories:
Fundamentals 7

[Background] Multi-agent RL schemes

training (T)
- centralized training (CT): Policies are updated based on the mutual exchange of information
- decentralized training (DT): individual policies are updated on its own
execution (E)
- centralized execution (CE): one unit outputs the joint actions for all agents
- decentralized execution (DE): agents determine actions according to individual policies

DTDE

setup
- assuming other agents’ actions as part of the env
  - -> env is non-stationary (disobey the RL assumption)
- limit info for each agent
  - learning in env with stochastic transitions and rewards
- policy update influences others’ policies
  - action shadowing
  - balance btw exploration and exploitation
independent Q-learning (IQL)
cons (potentially)
- scale poorly
- worse performance compared to training with CT
- learn slower compared to training with CT (reported with actor-critic method)
variations
- add experience replay (concurrent experience replay trajectories CERT)
- remove samples that leads to mis-coordination
- remove samples that are non-stationary transitions based on likelihood of returns

CTCE

cons
- scale poorly: big action space is challenging for RL
- can have lazy agents: some agents have no motivation to improve policy because it might damage others’
factorization:
- joint action distribution -> individual policies
- value function -> sum of values

CTDE (majority)

during training
- share functions or communication
- all agents’ previous actions -> this bypass the non-stationary because the consequences of the actions can be learned for all agents
variations:
- parameter sharing -> accelerate the learning progress
- (de)centralized value function -> guide agents with joint info
- hierarchical methods / master-slave structure
  - -> each agent reports local observations and output action based on them along with commands from a central unit
value decomposition networks (VDN)
- centralized critic is represented as a sum of individual value functions that condition only on individual obs and actions.
- cons (addressed by QMix):
  - limited representation of the interaction
  - didn't take extra state information that might be available during training.

MADDPG

goal (general-purpose multi-agent learning)
- CTDE
- no assumptions on the environment dynamics or the communication method between agents
key:
- actor-critic methods that are suitable for MA scenarios that require cooperative, competitive, or mixed interaction
  - (Q function cannot contain different information at training and test time)
- centralized individual critic Q_i for agent I
  - input: state x , actions taken by all agents [a]
    - -> the environment is stationary even as the policies change
  - output: Q value for agent I
    - -> reward fn can be different for agents
- deterministic policy
  - because the PG has high variance espeically in MA settings
  - update:
    - sample x, [a], x’ from replay buffer
    - Q_i-(r+gamma Q_i(x’,[a]’))
      - a’ is from the deterministic policy mu
      - require other agents’ policies
(approximate other agents’ policies)
- if other agents’ policies are not available
- learn by maximizing log probability of other agents’ actions with an entropy regularizer to encourage exploration
robust policies to others’ changes
- challenges: non-stationary in MA setting
  - e.g. in competitive settings, agents can derive a strong policy by overfitting to the behavior of their competitors. Such policies are undesirable as they are brittle and may fail when the competitors alter strategies.
- train a collection of K different sub-policies
  - At each episode, we randomly select one particular sub-policy for each agent to execute
  - maintain separate replay buffers for each subpolicy

MADDPG v.s. COMA

COMA also uses actor-critic method, but

COMA learns a single centralized critic for all agents, whereas MADDPG learn a centralized critic for each agent, allowing for agents with differing reward functions including competitive scenarios
COMA combine recurrent policies with feed-forward critics, whereas MADDPG uses feed-forward policies
COMA learns discrete policies, whereas MADDPG learns continuous policies

QMIX

[CTDE]
features
- factored representation, but can represent more complex centralized action-value functions than VDN(summing Q)
  - scales well with the number of agents
- good with heterogeneous agents: I guess less action shadowing because of it's not linear sum of Q
key
- non-linear combination of per-agent values using special structure for estimating joint critic
- constraint: a global argmax performed on Qtot yields the same result as a set of individual argmax operations performed on each Qa.
  - enforce the joint-action value is monotonic in the per-agent values
    - allow decentralised policies to be easily extracted via argmax of the joint action-value in off-policy learning
    - guarantee consistency between the centralized and decentralized policies.
method
- The mixing network is a feed-forward neural network
  - input: the agent network outputs
  - output: values of Qtot by mixing them monotonically
    - the weights (but not the biases) of the mixing network are restricted to be non-negative. -> from hypernetworks
    - This allows the mixing network to approximate any monotonic function arbitrarily closely (Dugas et al., 2009).
- *hypernetworks
  - input: state (with extra info)
  - output :
    - The non-negative weights of the mixing network
      - Each hypernetwork consists of a single linear layer, followed by an absolute activation function, to ensure that the mixing network weights are non-negative.
    - biases
- goal of this separation:
  - Qtot is allowed to depend on the extra state information in nonmonotonic ways.
  - if we pass the state directly to the mixing network, it would be** overly constraining**
  - Instead, the use of hyper networks makes it possible to condition the weights of the monotonic network on s in an arbitrary way, and construct Qtot flexibly

MAPPO

[CTDE] actor-critic method with on-policy algo
- argument: challenge the common belief that off-policy algorithms are much more sample-efficient than on-policy methods
- -> similar sample efficiency to popular off-policy algorithms cooperative scenarios
key modifications
- limit data reuse among epochs and avoid mini-batch within epoch
  - reuse if harmful due to the non-stationarity in multi-agent settings
  - reuse only for 15 training epochs for easy tasks, and 10 or 5 epochs for more difficult tasks.
- ppo update clipping: important to control the non-stationarity
- value normalization: normalize the regression targets into a range between 0 and 1 during value learning implementation
- value function input: [CTDE] with global info
- death masking: mask dead/ inactive agents with zero vectors

[1] MADDPG paper Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems 30 (2017).

[2] COMA paper Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. "Counterfactual multi-agent policy gradients." In Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1. 2018.

[3] Qmix paper Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. " Monotonic value function factorisation for deep multi-agent reinforcement learning." The Journal of Machine Learning Research 21, no. 1 (2020): 7234-7284.

[4] MAPPO paper Yu, Chao, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. "The surprising effectiveness of ppo in cooperative, multi-agent games." arXiv preprint arXiv:2103.01955 (2021).

[5] hypernetwork paper* Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).

*hypernetwork

hypernetwork is used to generate the weights for another network.
trained end-to-end with backpropagation and thus are usually faster
useful for CNN or RNN as a method of weight-sharing across layers.