AgentBreeder

Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement

🏆 NeurIPS 2025 Spotlight 🏆

Abstract

AgentBreeder overview diagram — Figure 1: Overview of AgentBreeder's evolutionary search over multi-agent scaffolds.

Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds, specifically targeting scaffolds' safety impact on large language models in multi-agent systems.

We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks. In 'blue' mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In 'red' mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them.

Key Achievements

🎯 79.4% average uplift in safety benchmark performance while maintaining capability
🔬 First framework for multi-objective evolutionary search over multi-agent scaffolds
⚖️ Balanced optimization of safety and capability through Pareto optimization
🔴 Red-team insights revealing how capable scaffolds can become highly vulnerable
📊 Comprehensive evaluation across DROP, MMLU, GPQA, and SaladData benchmarks

Operational Modes

AgentBreeder operates in three distinct modes, each serving different research and deployment needs:

🔵 BlueAgentBreeder (Defense)

Objective: Maximize both safety and capability
Use case: Develop robust, safe multi-agent systems
Result: 79.4% average safety improvement while maintaining capability

BlueAgentBreeder focuses on creating scaffolds that enhance both the safety and performance of multi-agent systems. This mode is ideal for production deployments where safety is paramount.

🔴 RedAgentBreeder (Attack)

Objective: Maximize capability while minimizing safety
Use case: Red-team testing and vulnerability discovery
Result: Reveals how scaffolding can inadvertently expose safety weaknesses

RedAgentBreeder serves as an adversarial testing tool, helping researchers understand potential vulnerabilities in multi-agent scaffolds. This mode is crucial for identifying safety risks before deployment.

🎯 CapableAgentBreeder (Capability)

Objective: Maximize capability only
Use case: Baseline comparison and pure performance optimization
Result: Competitive performance with existing approaches

CapableAgentBreeder provides a baseline for comparison by focusing solely on performance optimization without safety constraints.

Citation

@inproceedings{rosser2025agentbreeder,
  title={AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement},
  author={Rosser, J and Foerster, Jakob},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  note={NeurIPS 2025 Spotlight}
}