Multi-Agent RL for Bomberland Game

New York University
CSCIGA 3033 DDRL (2025) - Course Project

*Indicates Equal Contribution
Code
2021 DNDS Training

Sample Game Scene 1

Bomberland Preview

Sample Game Scene 2

Bomberland Training

Sample Game Scene 3

Learned behaviors emerging from attention-based multi-agent self-play training.

Abstract

Inspired by the Bomberland Coder competition, we developed a multi-agent reinforcement learning system for coordinated 3v3 battles in a variant of the classic Bomberman environment. Our agents encode observations — including teammates, opponents, inventory, power-ups, and more — into dynamic entity embeddings. To enable inter-agent coordination, policy networks apply self-attention and positional encoding over units. We explored two model variants: one processes agent trajectories as time-series via attention-based temporal modeling, and the other uses standard per-frame inference without temporal concatenation. Policies are trained with Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE). We further adopt self-play training against periodically frozen policy snapshots to encourage strategic progression. The system is deployed within a Docker + Gym environment provided by Coder One to support scalable multi-agent experiments. Training is still ongoing...

Motivation

Robots cooperating

Real-world systems often need multiple agents to work together.

Bomberland is a multi-agent game where agents must survive, place bombs, and ideally coordinate actions in real-time.

Challenges in this project include:

  • Partial observability: Agents cannot access info like teammate inventory or bomb cooldowns.
  • Shared components: All three units share a max of 3 bombs.
  • Delayed consequences: Bombs explode after a delay, requiring strategic planning.
  • Dynamic environment: Map evolves over time (e.g., shrinking zones in late game).

Idea Inspiration

Hide and Seek Simulation
OpenAI's Hide and Seek environment
openai-architecture
Entity-wise attention and LSTM design

Our model structure design was inspired by OpenAI's Hide and Seek Project. We apply:

  • Entity embeddings for agents, bombs, and obstacles
  • Attention over agents
  • temporal modeling

Model Architecture and Design

We design two model variants to explore the effect of temporal information in multi-agent coordination:

Model A: Attention-based Temporal Modeling

This variant treats a window of recent frames as a temporal sequence. For each unit, we concatenate its feature representations across multiple time steps and apply self-attention across the full sequence. Temporal position encoding is added to retain ordering.

Temporal Attention Model
Figure 1: Attention-based temporal architecture with sequence modeling.

Key modules:

Model B: Per-frame Inference (No Time Window)

This simpler variant processes each frame independently. It uses a similar feature encoding pipeline but does not aggregate over time. It relies solely on current frame information, allowing for faster training and fewer parameters.

Per-frame Model
Figure 2: Per-frame architecture without temporal modeling.

While lacking explicit memory, this model still supports coordination through self-attention over visible entities and positional encoding of units.

Comparing the two variants allows us to evaluate the impact of temporal modeling versus simplicity in multi-agent coordination tasks.

Results and Training Insights

Model #1: With Time Window

  • Trained for ~4000 episodes
  • Performance: Poor
  • Agents learn to bomb destructible blocks but often self-destruct
  • Time-series modeling increases learning difficulty

Model #2: Without Time Window

  • Trained for ~2000 episodes (Ongoing)
  • Performance: Improving – agents exhibit more stable and strategic behavior
  • Fewer suicidal actions, better coordination emerging
  • Training is faster and more stable due to simpler input

Note: Other architectures were also tried. These two show the most promise so far.

PPO Loss function
Loss components used in PPO

Demo Video

This is an early-stage demo of Model #2 agents. The model is still under training and will continue to improve.