Our model structure design was inspired by OpenAI's Hide and Seek Project. We apply:
- Entity embeddings for agents, bombs, and obstacles
- Attention over agents
- temporal modeling
Inspired by the Bomberland Coder competition, we developed a multi-agent reinforcement learning system for coordinated 3v3 battles in a variant of the classic Bomberman environment. Our agents encode observations — including teammates, opponents, inventory, power-ups, and more — into dynamic entity embeddings. To enable inter-agent coordination, policy networks apply self-attention and positional encoding over units. We explored two model variants: one processes agent trajectories as time-series via attention-based temporal modeling, and the other uses standard per-frame inference without temporal concatenation. Policies are trained with Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE). We further adopt self-play training against periodically frozen policy snapshots to encourage strategic progression. The system is deployed within a Docker + Gym environment provided by Coder One to support scalable multi-agent experiments. Training is still ongoing...
Real-world systems often need multiple agents to work together.
Bomberland is a multi-agent game where agents must survive, place bombs, and ideally coordinate actions in real-time.
Challenges in this project include:
Our model structure design was inspired by OpenAI's Hide and Seek Project. We apply:
We design two model variants to explore the effect of temporal information in multi-agent coordination:
This variant treats a window of recent frames as a temporal sequence. For each unit, we concatenate its feature representations across multiple time steps and apply self-attention across the full sequence. Temporal position encoding is added to retain ordering.
Key modules:
This simpler variant processes each frame independently. It uses a similar feature encoding pipeline but does not aggregate over time. It relies solely on current frame information, allowing for faster training and fewer parameters.
While lacking explicit memory, this model still supports coordination through self-attention over visible entities and positional encoding of units.
Comparing the two variants allows us to evaluate the impact of temporal modeling versus simplicity in multi-agent coordination tasks.
Note: Other architectures were also tried. These two show the most promise so far.
This is an early-stage demo of Model #2 agents. The model is still under training and will continue to improve.