Multi-agent game
- the effects of actions and the rewards we receive are dependent on the actions of the other agents
- type
- normal form games
- non-sequential: one move from all agents and try to max payoff that is effected by others
- extensive form games
- sequential / deterministic actions / turn-based
- stochastic games (multi-agent MDP)
- each agent’s objective is to maximise its own expected reward considering the possible actions of all other agents.
- defn \( G= (N,A,u) \)
- # of agents
- \( A=A_1 \times A_2 \times … \times A_N \) action profile
- \( u: A \rightarrow R^N \) reward fn (utility of an action)
- strategies \( s_i \) (strategy profile \( S=S_1 \times S_2 \times … \times S_N \) )
- pure strategy: same action all the time
- mixed strategy: choose action follow a probability dist.
- strictly dominant strategy
- for all \(s_{-i} \in S_{-I} \), \( u_i(s_i,s_{-I})>u_i(s_i’,s_{-I}) \)
- (weakly) dominant → with equal sign
- strategy proof: telling our opponents doesn’t give them advantages
- prisoner’s dilemma
- assumption: rational/ self-interest/ perfect info
- minimize agents’ own penalty will result in the worst outcomes
- strictly dominant strategy: consider all actions (deny or admit) from the opponent, the best action for me is always admit
- best response
- for agent \( i \), the best response to \( s_{-I} \) is a strategy \( s_i^\star \) such that the utility is greater than taking all other \( s_i’ \)
- algorithm: loop through agent’s strategy \( O(\vert S_i\vert) \)
- nash equilibrium
- the strategy profile S that all agents strategies are the best response to others’
- algorithm:
- loop through all agent’s strategies (N loops)
- if all \( s_i \) are the best response for each agent i, append to Nash equilibria
- depends on strategy
- prisoner’s dilemma has one pure-strategy Nash equilibrium
- chicken gamehas two pure-strategy Nash equilibria
- matching pennies game has no equilibria
- one player (A) wins with two heads or two tails
- the other player (B) wins with one head and one tail
- best strategy → random giving out head or tail
- mixed strategy
- my understanding of the key here is to calculate opponent’s expected utilities \( U_{-I}(a_{-I}) \) if we take \( a_1 \) with prob \( p_1 \), \( a_2 \) with prob \( p_2 \), \( a_3 \) with prob \( p_3 \)…
- we don’t want them to have any action has higher utility (indifferent)
- expected utility:
- for other agent i with action \( a_{i} \), take expectation over all possible opponent action profiles (say there are m)
- \( U_{i}(a_{i})=\sum_{k} p_k*u_k(a_i,a_{-1}^k) \)
- \( p_k \) is the probability of other agent -1 taking action \( a^k \)
- indifference:
- an agent is indifferent between a set of pure strategies if the expected return of all strategies is the same
- mixed-strategy Nash equilibria
- all opponents are indifferent to their pure strategies
- the best mixed strategy is to make the opponent not have a single better strategy
- e.g. the matching pennies game described above
- for player A, we want
- \( U_B(head)=U_B(tail) \)
- let \( Y \) and \( 1-Y \) be the prob of play A plays head and tail respectively
- \( Y\times 1+(-1)(1-Y)=Y\times (-1)+1(1-Y) \)
- \( Y=1/2 \) → best strategy is to uniformly give out head and tail
- backward induction: model-based
- key: from bottom to top, figure out the best response each player can take when it’s their turn
- formulate as tree
- each layer is a player’s turn
- for each non-leaf node and its children (called subtree), the player should do action that leads to the best possible payoff (left node) → equilibrium of the subgame
- algo: from leaf to node, find equilibria of all subgames
- → the best response for every agent in the game when it is their turn
stochastic game with multi-agent RL
- we can model it as single-agent RL
- e.g. MA Q-learning
- Q function for each agent, and actions are selected based on these independently updated Q
- independent learning - treat other agents’ actions as part of the env
- assuming that all the other agents’ policies are stationary (since RL is based on the assumption that the env is stationary)
- but the other agents’ policies DO change over time, the learning is likely not to work
- ways to model other agents (opponents) for improving this
- random selection
- follow a fixed policy
- self-play: since policy for the agent and the opponents
- mean field Q-learning
- reduce the number of possible interactions between actions by approximating joint actions between all pairs of actions, rather than all global interactions among the agents.