Multi-agent game

Published: January 22, 2023 by Chia-Hsien (Cathy) Shih

Categories:
Fundamentals 7

Multi-agent game

the effects of actions and the rewards we receive are dependent on the actions of the other agents
type
- normal form games
  - non-sequential: one move from all agents and try to max payoff that is effected by others
- extensive form games
  - sequential / deterministic actions / turn-based
- stochastic games (multi-agent MDP)
  - each agent’s objective is to maximise its own expected reward considering the possible actions of all other agents.

Normal form games

defn \( G= (N,A,u) \)
- # of agents
- \( A=A_1 \times A_2 \times … \times A_N \) action profile
- \( u: A \rightarrow R^N \) reward fn (utility of an action)
strategies \( s_i \) (strategy profile \( S=S_1 \times S_2 \times … \times S_N \) )
- pure strategy: same action all the time
- mixed strategy: choose action follow a probability dist.
strictly dominant strategy
- for all \(s_{-i} \in S_{-I} \), \( u_i(s_i,s_{-I})>u_i(s_i’,s_{-I}) \)
- (weakly) dominant → with equal sign
- strategy proof: telling our opponents doesn’t give them advantages
prisoner’s dilemma
- assumption: rational/ self-interest/ perfect info
- minimize agents’ own penalty will result in the worst outcomes
  - strictly dominant strategy: consider all actions (deny or admit) from the opponent, the best action for me is always admit
best response
- for agent \( i \), the best response to \( s_{-I} \) is a strategy \( s_i^\star \) such that the utility is greater than taking all other \( s_i’ \)
- algorithm: loop through agent’s strategy \( O(\vert S_i\vert) \)
nash equilibrium
- the strategy profile S that all agents strategies are the best response to others’
- algorithm:
  - loop through all agent’s strategies (N loops)
  - if all \( s_i \) are the best response for each agent i, append to Nash equilibria
- depends on strategy
  - prisoner’s dilemma has one pure-strategy Nash equilibrium
  - chicken gamehas two pure-strategy Nash equilibria
  - matching pennies game has no equilibria
    - one player (A) wins with two heads or two tails
    - the other player (B) wins with one head and one tail
    - best strategy → random giving out head or tail
mixed strategy
- my understanding of the key here is to calculate opponent’s expected utilities \( U_{-I}(a_{-I}) \) if we take \( a_1 \) with prob \( p_1 \), \( a_2 \) with prob \( p_2 \), \( a_3 \) with prob \( p_3 \)…
  - we don’t want them to have any action has higher utility (indifferent)
- expected utility:
  - for other agent i with action \( a_{i} \), take expectation over all possible opponent action profiles (say there are m)
  - \( U_{i}(a_{i})=\sum_{k} p_k*u_k(a_i,a_{-1}^k) \)
    - \( p_k \) is the probability of other agent -1 taking action \( a^k \)
- indifference:
  - an agent is indifferent between a set of pure strategies if the expected return of all strategies is the same
- mixed-strategy Nash equilibria
  - all opponents are indifferent to their pure strategies
    - the best mixed strategy is to make the opponent not have a single better strategy
- e.g. the matching pennies game described above
  - for player A, we want
    - \( U_B(head)=U_B(tail) \)
  - let \( Y \) and \( 1-Y \) be the prob of play A plays head and tail respectively
    - \( Y\times 1+(-1)(1-Y)=Y\times (-1)+1(1-Y) \)
    - \( Y=1/2 \) → best strategy is to uniformly give out head and tail

extensive form games (with full info and turn based)

backward induction: model-based
- key: from bottom to top, figure out the best response each player can take when it’s their turn
- formulate as tree
  - each layer is a player’s turn
  - for each non-leaf node and its children (called subtree), the player should do action that leads to the best possible payoff (left node) → equilibrium of the subgame
- algo: from leaf to node, find equilibria of all subgames
  - → the best response for every agent in the game when it is their turn

stochastic game with multi-agent RL

we can model it as single-agent RL
- e.g. MA Q-learning
  - Q function for each agent, and actions are selected based on these independently updated Q
- independent learning - treat other agents’ actions as part of the env
- assuming that all the other agents’ policies are stationary (since RL is based on the assumption that the env is stationary)
  - but the other agents’ policies DO change over time, the learning is likely not to work
ways to model other agents (opponents) for improving this
- random selection
- follow a fixed policy
- self-play: since policy for the agent and the opponents
mean field Q-learning
- reduce the number of possible interactions between actions by approximating joint actions between all pairs of actions, rather than all global interactions among the agents.