The Social Emergence of Multi-Agent Collaboration – Optimising Gemma4

This entire blog is powered by Gemma4 so this research was quite interesting to see, originally posted by Thomas Wolf on X. What happens when 100+ autonomous agents are given a week-long, competitive optimization goal? We recently discovered that the result isn’t just faster code—it is the emergence of an entirely new social contract.

Our experiment focused on improving the inference speed of Gemma 4 in vLLM. While we achieved a massive 5x final improvement, the most profound results were found within the messages exchanged on the community board.

Integrity and Self-Policing

The agents didn’t just optimize; they policed themselves. We observed instances where agents actively resisted social engineering attempts to move discussions to private channels like Telegram, labeling such moves as “indistinguishable from collusion.”

This level of autonomy is a significant step toward building truly autonomous AI agents. Agents even flagged verification loopholes where “clean PPL” could be used to artificially inflate TPS, prompting community rulings to maintain fairness.

Emergent Collaboration and Division of Labor

Efficiency emerged through specialized roles, much like a highly optimized production line.

The Four-Agent Relay: One agent built the checkpoint, another attempted to run it, a third diagnosed a configuration bug regarding tie_word_embeddings, and a fourth successfully shipped it at 118 TPS.
GPU-Rich vs. GPU-Poor: Compute-starved agents pivoted to writing specs and byte-math for those with the hardware to execute them.
Shared Knowledge: A communal knowledge base emerged, containing playbooks, triage tools, and significance tests so newcomers wouldn’t repeat dead ends.

These interactions demonstrate that mastering smarter agents with smolagents or similar frameworks requires more than just tool access; it requires shared protocols.

Key Discoveries and Technical Reversals

The race was filled with “scientific” breakthroughs that were quickly debunked by peer review from other agents.

Discovery Name	The Reality
“int4-Marlin floor”	Originally thought to be a mathematical limit, it was later broken by MTP speculative decoding.
“Smarter drafter loses”	A 2B drafter’s read overhead dominates; a tiny 256-hidden drafter is actually more efficient at batch-1.
“DFlash near-random”	An agent identified that low acceptance rates were due to a train/serve hidden-state mismatch (bf16 vs int4).

Beyond the technical wins, the community established a significance norm, noting that deltas of less than ~4 TPS were statistically indistinguishable from noise.

Explore the Results

You can witness the fascinating social dynamics for yourself on the Gemma interactions view board.

To trace how specific optimizations evolved, explore the lineage of inventions from these agents. For a real-time look at the performance race, visit the Gemma challenge dashboard.

All of this work was organized by the Gemma Challenge organization on Hugging Face.

Ready to see how multi-agent coordination can redefine performance? Dive into the interaction logs today!

Sources

thomwolf-gemma-fast-challenges.static.hf.space/index.html