Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions

Weifan Long, Wen Wen, Zhai Peng, Lihua Zhang

Abstract

Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)—where individuals maintain stable value orientations during interactions with others—we propose a novel framework called Role Play (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embeddings observation and employ a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.

Algorithm Framework

Experimental Results

Overcooked

Overcooked is a two-player cooperative game intended to test the collaboration ability of agents. In this game, agents work collaboratively to fulfill soup orders using ingredients like onions and tomatoes. Agents can move and interact with items, such as grabbing or serving soups, based on the game state. To complete an order, agents must combine the correct ingredients in a pot, cook them for a specified time, and then serve the soup with a dish to earn rewards. Each order has distinct cooking times and rewards.

Asymmetric Advantages

ok_players

Cramped Room

ok_players

Counter Circuit

ok_players

MeltingPot

Harvest

Harvest: agents face a common-pool resource dilemma where apple resources regenerate more effectively when left unharvested in groups. Agents can choose to harvest aggressively, risking future availability, or sustainably, by leaving some apples to promote long-term regeneration. Additionally, agents can use beams to penalize others, causing the penalized agent to temporarily disappear from the game for a few steps.

hp_players

CleanUp

CleanUp: a public goods game in which apple growth in an orchard is hindered by rising pollution levels in a nearby river. When pollution is high, apple growth stops entirely. Agents can reduce pollution by leaving the orchard to work in polluted areas, highlighting the importance of individual efforts in maintaining shared resources. Agents are also able to use beams to penalize others.

hp_players

Role behavior Analysis

We show the role-specific strategies learned by the agents in the MeltingPot. The learned agent is played with a selfish pre-trained agent. Different roles shows different behaviors. Details are as following.

Harvest

hp_players
Masochistic
Sadomasochistic
Sadistic
Competitive
Individualistic
Prosocial
Altruistic
Martyr

CleanUp

hp_players
Masochistic
Sadomasochistic
Sadistic
Competitive
Individualistic
Prosocial
Altruistic
Martyr

BibTeX

@misc{long2024roleplaylearningadaptive,
        title={Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions}, 
        author={Weifan Long and Wen Wen and Peng Zhai and Lihua Zhang},
        year={2024},
        eprint={2411.01166},
        archivePrefix={arXiv},
        primaryClass={cs.MA},
        url={https://arxiv.org/abs/2411.01166}, 
  }