Multi-Agent RL for Bomberland Game

Chenmeinian Guo^*, Huhua Xiao^*,

New York University
CSCIGA 3033 DDRL (2025) - Course Project
^*Indicates Equal Contribution

Code

Sample Game Scene 1

Sample Game Scene 2

Sample Game Scene 3

Learned behaviors emerging from attention-based multi-agent self-play training.

Abstract

Inspired by the Bomberland Coder competition, we developed a multi-agent reinforcement learning system for coordinated 3v3 battles in a variant of the classic Bomberman environment. Our agents encode observations — including teammates, opponents, inventory, power-ups, and more — into dynamic entity embeddings. To enable inter-agent coordination, policy networks apply self-attention and positional encoding over units. We explored two model variants: one processes agent trajectories as time-series via attention-based temporal modeling, and the other uses standard per-frame inference without temporal concatenation. Policies are trained with Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE). We further adopt self-play training against periodically frozen policy snapshots to encourage strategic progression. The system is deployed within a Docker + Gym environment provided by Coder One to support scalable multi-agent experiments. Training is still ongoing...

Motivation

Real-world systems often need multiple agents to work together.

Bomberland is a multi-agent game where agents must survive, place bombs, and ideally coordinate actions in real-time.

Challenges in this project include:

Partial observability: Agents cannot access info like teammate inventory or bomb cooldowns.
Shared components: All three units share a max of 3 bombs.
Delayed consequences: Bombs explode after a delay, requiring strategic planning.
Dynamic environment: Map evolves over time (e.g., shrinking zones in late game).

Idea Inspiration

Hide and Seek Simulation — OpenAI's Hide and Seek environment

openai-architecture — Entity-wise attention and LSTM design

Our model structure design was inspired by OpenAI's Hide and Seek Project. We apply:

Entity embeddings for agents, bombs, and obstacles
Attention over agents
temporal modeling

Model Architecture and Design

We design two model variants to explore the effect of temporal information in multi-agent coordination:

Model A: Attention-based Temporal Modeling

This variant treats a window of recent frames as a temporal sequence. For each unit, we concatenate its feature representations across multiple time steps and apply self-attention across the full sequence. Temporal position encoding is added to retain ordering.

Temporal Attention Model — Figure 1: Attention-based temporal architecture with sequence modeling.

Key modules:

CNN: Extracts spatial features from map observations.
FC Encoder: Encodes unit states (e.g., position, inventory).
Temporal Concatenation: Combines features from previous T steps.
Masked Self-Attention: Enables context-aware interactions across entities and time.
Policy Head: Outputs action logits for each unit.

Model B: Per-frame Inference (No Time Window)

This simpler variant processes each frame independently. It uses a similar feature encoding pipeline but does not aggregate over time. It relies solely on current frame information, allowing for faster training and fewer parameters.

Per-frame Model — Figure 2: Per-frame architecture without temporal modeling.

While lacking explicit memory, this model still supports coordination through self-attention over visible entities and positional encoding of units.

Comparing the two variants allows us to evaluate the impact of temporal modeling versus simplicity in multi-agent coordination tasks.

Results and Training Insights

Model #1: With Time Window

Trained for ~4000 episodes
Performance: Poor
Agents learn to bomb destructible blocks but often self-destruct
Time-series modeling increases learning difficulty

Model #2: Without Time Window

Trained for ~2000 episodes (Ongoing)
Performance: Improving – agents exhibit more stable and strategic behavior
Fewer suicidal actions, better coordination emerging
Training is faster and more stable due to simpler input

Note: Other architectures were also tried. These two show the most promise so far.

PPO Loss function — Loss components used in PPO

Demo Video

This is an early-stage demo of Model #2 agents. The model is still under training and will continue to improve.