Multi-agent game

Multi-agent game

  • the effects of actions and the rewards we receive are dependent on the actions of the other agents
  • type
    • normal form games
      • non-sequential: one move from all agents and try to max payoff that is effected by others
    • extensive form games
      • sequential / deterministic actions / turn-based
    • stochastic games (multi-agent MDP)
      • each agent’s objective is to maximise its own expected reward considering the possible actions of all other agents.

Normal form games

  • defn \( G= (N,A,u) \)
    • # of agents
    • \( A=A_1 \times A_2 \times … \times A_N \) action profile
    • \( u: A \rightarrow R^N \) reward fn (utility of an action)
  • strategies \( s_i \) (strategy profile \( S=S_1 \times S_2 \times … \times S_N \) )
    • pure strategy: same action all the time
    • mixed strategy: choose action follow a probability dist.
  • strictly dominant strategy
    • for all \(s_{-i} \in S_{-I} \), \( u_i(s_i,s_{-I})>u_i(s_i’,s_{-I}) \)
    • (weakly) dominant → with equal sign
    • strategy proof: telling our opponents doesn’t give them advantages
  • prisoner’s dilemma
    • assumption: rational/ self-interest/ perfect info
    • minimize agents’ own penalty will result in the worst outcomes
      • strictly dominant strategy: consider all actions (deny or admit) from the opponent, the best action for me is always admit
  • best response
    • for agent \( i \), the best response to \( s_{-I} \) is a strategy \( s_i^\star \) such that the utility is greater than taking all other \( s_i’ \)
    • algorithm: loop through agent’s strategy \( O(\vert S_i\vert) \)
  • nash equilibrium
    • the strategy profile S that all agents strategies are the best response to others’
    • algorithm:
      • loop through all agent’s strategies (N loops)
      • if all \( s_i \) are the best response for each agent i, append to Nash equilibria
    • depends on strategy
      • prisoner’s dilemma has one pure-strategy Nash equilibrium
      • chicken gamehas two pure-strategy Nash equilibria
      • matching pennies game has no equilibria
        • one player (A) wins with two heads or two tails
        • the other player (B) wins with one head and one tail
        • best strategy → random giving out head or tail
  • mixed strategy
    • my understanding of the key here is to calculate opponent’s expected utilities \( U_{-I}(a_{-I}) \) if we take \( a_1 \) with prob \( p_1 \), \( a_2 \) with prob \( p_2 \), \( a_3 \) with prob \( p_3 \)…
      • we don’t want them to have any action has higher utility (indifferent)
    • expected utility:
      • for other agent i with action \( a_{i} \), take expectation over all possible opponent action profiles (say there are m)
      • \( U_{i}(a_{i})=\sum_{k} p_k*u_k(a_i,a_{-1}^k) \)
        • \( p_k \) is the probability of other agent -1 taking action \( a^k \)
    • indifference:
      • an agent is indifferent between a set of pure strategies if the expected return of all strategies is the same
    • mixed-strategy Nash equilibria
      • all opponents are indifferent to their pure strategies
        • the best mixed strategy is to make the opponent not have a single better strategy
    • e.g. the matching pennies game described above
      • for player A, we want
        • \( U_B(head)=U_B(tail) \)
      • let \( Y \) and \( 1-Y \) be the prob of play A plays head and tail respectively
        • \( Y\times 1+(-1)(1-Y)=Y\times (-1)+1(1-Y) \)
        • \( Y=1/2 \) → best strategy is to uniformly give out head and tail

extensive form games (with full info and turn based)

  • backward induction: model-based
    • key: from bottom to top, figure out the best response each player can take when it’s their turn
    • formulate as tree
      • each layer is a player’s turn
      • for each non-leaf node and its children (called subtree), the player should do action that leads to the best possible payoff (left node) → equilibrium of the subgame
    • algo: from leaf to node, find equilibria of all subgames
      • → the best response for every agent in the game when it is their turn

stochastic game with multi-agent RL

  • we can model it as single-agent RL
    • e.g. MA Q-learning
      • Q function for each agent, and actions are selected based on these independently updated Q
    • independent learning - treat other agents’ actions as part of the env
    • assuming that all the other agents’ policies are stationary (since RL is based on the assumption that the env is stationary)
      • but the other agents’ policies DO change over time, the learning is likely not to work
  • ways to model other agents (opponents) for improving this
    • random selection
    • follow a fixed policy
    • self-play: since policy for the agent and the opponents
  • mean field Q-learning
    • reduce the number of possible interactions between actions by approximating joint actions between all pairs of actions, rather than all global interactions among the agents.