Role Play

Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Weifan Long, Wen Wen, Zhai Peng, Lihua Zhang

Abstract

Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)—where individuals maintain stable value orientations during interactions with others—we propose a novel framework called Role Play (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embeddings observation and employ a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

Algorithm Framework

Role Play framework, agents use a role predictor to estimate the joint role embeddings of other agents and make decisions independently based on their own role embeddings.

RL^2 based RP, agents sampling their role embeddings at the beginning of each trail and learn to adapt to the roles.

Experimental Results

Overcooked

Overcooked is a two-player cooperative game intended to test the collaboration ability of agents. In this game, agents work collaboratively to fulfill soup orders using ingredients like onions and tomatoes. Agents can move and interact with items, such as grabbing or serving soups, based on the game state. To complete an order, agents must combine the correct ingredients in a pot, cook them for a specified time, and then serve the soup with a dish to earn rewards. Each order has distinct cooking times and rewards.

Asymmetric Advantages

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion and Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion in Pot

Cramped Room

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion and Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion in Pot

Counter Circuit

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion and Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Onion in Pot

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Tomato in Pot

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Place Tomato and Deliver Soup

Anyplay

BRDiv

RP

TrajeDi

HSP

Evaluate with Mixed Order

MeltingPot

Harvest

Harvest: agents face a common-pool resource dilemma where apple resources regenerate more effectively when left unharvested in groups. Agents can choose to harvest aggressively, risking future availability, or sustainably, by leaving some apples to promote long-term regeneration. Additionally, agents can use beams to penalize others, causing the penalized agent to temporarily disappear from the game for a few steps.

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Prosocial pre-trained agent

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Inequity-Averse pre-trained agent

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Selfish pre-trained agent

CleanUp

CleanUp: a public goods game in which apple growth in an orchard is hindered by rising pollution levels in a nearby river. When pollution is high, apple growth stops entirely. Agents can reduce pollution by leaving the orchard to work in polluted areas, highlighting the importance of individual efforts in maintaining shared resources. Agents are also able to use beams to penalize others.

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Prosocial pre-trained agent

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Inequity-Averse pre-trained agent

Anyplay

BRDiv

RP (\( z=\frac{\pi}{4} \))

TrajeDi

HSP

Evaluate with Selfish pre-trained agent

Role behavior Analysis

We show the role-specific strategies learned by the agents in the MeltingPot. The learned agent is played with a selfish pre-trained agent. Different roles shows different behaviors. Details are as following.

Harvest

Masochistic

Sadomasochistic

Sadistic

Competitive

Individualistic

Prosocial

Altruistic

Martyr

CleanUp

Masochistic

Sadomasochistic

Sadistic

Competitive

Individualistic

Prosocial

Altruistic

Martyr

BibTeX

@misc{long2024roleplaylearningadaptive, title={Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions}, author={Weifan Long and Wen Wen and Peng Zhai and Lihua Zhang}, year={2024}, eprint={2411.01166}, archivePrefix={arXiv}, primaryClass={cs.MA}, url={https://arxiv.org/abs/2411.01166}, }