MARL algos: MADDPG / Qmix / MAPPO ...

[Background] Multi-agent RL schemes

  • training (T)
    • centralized training (CT): Policies are updated based on the mutual exchange of information
    • decentralized training (DT): individual policies are updated on its own
  • execution (E)
    • centralized execution (CE): one unit outputs the joint actions for all agents
    • decentralized execution (DE): agents determine actions according to individual policies

DTDE

  • setup
    • assuming other agents’ actions as part of the env
      • -> env is non-stationary (disobey the RL assumption)
    • limit info for each agent
      • learning in env with stochastic transitions and rewards
    • policy update influences others’ policies
      • action shadowing
      • balance btw exploration and exploitation
  • independent Q-learning (IQL)
  • cons (potentially)
    • scale poorly
    • worse performance compared to training with CT
    • learn slower compared to training with CT (reported with actor-critic method)
  • variations
    • add experience replay (concurrent experience replay trajectories CERT)
    • remove samples that leads to mis-coordination
    • remove samples that are non-stationary transitions based on likelihood of returns

CTCE

  • cons
    • scale poorly: big action space is challenging for RL
    • can have lazy agents: some agents have no motivation to improve policy because it might damage others’
  • factorization:
    • joint action distribution -> individual policies
    • value function -> sum of values

CTDE (majority)

  • during training
    • share functions or communication
    • all agents’ previous actions -> this bypass the non-stationary because the consequences of the actions can be learned for all agents
  • variations:
    • parameter sharing -> accelerate the learning progress
    • (de)centralized value function -> guide agents with joint info
    • hierarchical methods / master-slave structure
      • -> each agent reports local observations and output action based on them along with commands from a central unit
  • value decomposition networks (VDN)
    • centralized critic is represented as a sum of individual value functions that condition only on individual obs and actions.
    • cons (addressed by QMix):
      • limited representation of the interaction
      • didn't take extra state information that might be available during training.

MADDPG

  • goal (general-purpose multi-agent learning)
    • CTDE
    • no assumptions on the environment dynamics or the communication method between agents
  • key:
    • actor-critic methods that are suitable for MA scenarios that require cooperative, competitive, or mixed interaction
      • (Q function cannot contain different information at training and test time)
    • centralized individual critic Q_i for agent I
      • input: state x , actions taken by all agents [a]
        • -> the environment is stationary even as the policies change
      • output: Q value for agent I
        • -> reward fn can be different for agents
    • deterministic policy
      • because the PG has high variance espeically in MA settings
      • update:
        • sample x, [a], x’ from replay buffer
        • Q_i-(r+gamma Q_i(x’,[a]’))
          • a’ is from the deterministic policy mu
          • require other agents’ policies
  • (approximate other agents’ policies)
    • if other agents’ policies are not available
    • learn by maximizing log probability of other agents’ actions with an entropy regularizer to encourage exploration
  • robust policies to others’ changes
    • challenges: non-stationary in MA setting
      • e.g. in competitive settings, agents can derive a strong policy by overfitting to the behavior of their competitors. Such policies are undesirable as they are brittle and may fail when the competitors alter strategies.
    • train a collection of K different sub-policies
      • At each episode, we randomly select one particular sub-policy for each agent to execute
      • maintain separate replay buffers for each subpolicy

MADDPG v.s. COMA

COMA also uses actor-critic method, but

  • COMA learns a single centralized critic for all agents, whereas MADDPG learn a centralized critic for each agent, allowing for agents with differing reward functions including competitive scenarios
  • COMA combine recurrent policies with feed-forward critics, whereas MADDPG uses feed-forward policies
  • COMA learns discrete policies, whereas MADDPG learns continuous policies

QMIX

  • [CTDE]
  • features
    • factored representation, but can represent more complex centralized action-value functions than VDN(summing Q)
      • scales well with the number of agents
    • good with heterogeneous agents: I guess less action shadowing because of it's not linear sum of Q
  • key
    • non-linear combination of per-agent values using special structure for estimating joint critic
    • constraint: a global argmax performed on Qtot yields the same result as a set of individual argmax operations performed on each Qa.
      • enforce the joint-action value is monotonic in the per-agent values
        • allow decentralised policies to be easily extracted via argmax of the joint action-value in off-policy learning
        • guarantee consistency between the centralized and decentralized policies.
  • method
    • The mixing network is a feed-forward neural network
      • input: the agent network outputs
      • output: values of Qtot by mixing them monotonically
        • the weights (but not the biases) of the mixing network are restricted to be non-negative. -> from hypernetworks
        • This allows the mixing network to approximate any monotonic function arbitrarily closely (Dugas et al., 2009).
    • *hypernetworks
      • input: state (with extra info)
      • output :
        • The non-negative weights of the mixing network
          • Each hypernetwork consists of a single linear layer, followed by an absolute activation function, to ensure that the mixing network weights are non-negative.
        • biases
    • goal of this separation:
      • Qtot is allowed to depend on the extra state information in nonmonotonic ways.
      • if we pass the state directly to the mixing network, it would be** overly constraining**
      • Instead, the use of hyper networks makes it possible to condition the weights of the monotonic network on s in an arbitrary way, and construct Qtot flexibly

MAPPO

  • [CTDE] actor-critic method with on-policy algo
    • argument: challenge the common belief that off-policy algorithms are much more sample-efficient than on-policy methods
    • -> similar sample efficiency to popular off-policy algorithms cooperative scenarios
  • key modifications
    • limit data reuse among epochs and avoid mini-batch within epoch
      • reuse if harmful due to the non-stationarity in multi-agent settings
      • reuse only for 15 training epochs for easy tasks, and 10 or 5 epochs for more difficult tasks.
    • ppo update clipping: important to control the non-stationarity
    • value normalization: normalize the regression targets into a range between 0 and 1 during value learning implementation
    • value function input: [CTDE] with global info
    • death masking: mask dead/ inactive agents with zero vectors

[1] MADDPG paper Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems 30 (2017).

[2] COMA paper Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. "Counterfactual multi-agent policy gradients." In Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1. 2018.

[3] Qmix paper Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. " Monotonic value function factorisation for deep multi-agent reinforcement learning." The Journal of Machine Learning Research 21, no. 1 (2020): 7234-7284.

[4] MAPPO paper Yu, Chao, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. "The surprising effectiveness of ppo in cooperative, multi-agent games." arXiv preprint arXiv:2103.01955 (2021).

[5] hypernetwork paper* Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).

*hypernetwork

  • hypernetwork is used to generate the weights for another network.
  • trained end-to-end with backpropagation and thus are usually faster
  • useful for CNN or RNN as a method of weight-sharing across layers.