Achieving Awesome SOTA Multi-Hop Question Answering with MOTHRAG

Traditional Retrieval-Augmented Generation (RAG) systems often struggle with multi-hop reasoning, where an answer requires connecting multiple disparate pieces of information. Most current solutions rely on expensive, GPU-heavy models or unpredictable agentic loops that are difficult to scale and audit.

See it in action: https://huggingface.co/spaces/JUBOX99/mothrag-demo

The Problem with Agentic Flakiness

While the industry is moving toward agentic retrieval—using LLMs to plan and iterate—this often introduces a “flaky” element. Free-form agent loops can produce inconsistent results, making it nearly impossible to maintain reproducibility in production environments.

Introducing MOTHRAG: Deterministic Multi-Hop QA

The MOTHRAG framework offers a powerful alternative. It delivers research-SOTA parity on multi-hop question answering using only commodity, pay-per-call LLM APIs. There is no need for local GPUs or complex fine-tuning.

By replacing unpredictable loops with deterministic orchestration, MOTHRAG ensures that the same inputs lead to the same answers, accompanied by an inspectable proof tree for every result.

How It Works: A Three-Stage Pipeline

Bridge Retrieval Substrate: Uses multi-query ANN fusion and a tripartite LLM judge to reshape retrieval.
Four-Arm Ensemble Pool: Employs direct reads, decomposition, and iterative refinement.
Deterministic Arbitration: A fixed-weight system that evaluates grounding status and faithfulness without learned components.

Implementation and Deployment

Deployment is remarkably straightforward, requiring only a package installation and your existing API keys for services like Gemini or Groq. This ease of use mirrors how AI tools are revolutionizing .NET development, making complex logic accessible through simple interfaces.

Recommended Production Install:

pip install ‘mothrag[prod]’

Proven Performance

The results are impressive. In benchmarks across datasets like HotpotQA and 2WikiMultiHopQA, MOTHRAG achieves performance levels nearly identical to the most advanced, GPU-bound research models. You can examine the full performance breakdown in the Zenodo preprint.

MOTHRAG is a training-free, multi-hop question-answering framework in which every component — reader, embedder, and retrieval judges — sits behind a commodity pay-per-call API: no local GPU serving, no constrained decoding, and no non-commercially-licensed model.

It is deliberately deterministic. Rather than a free-form agentic loop, MOTHRAG composes a small ensemble of reasoning arms — direct reading, query decomposition, and an iterative, grounding-driven arm — under a deterministic arbitrator that weighs grounding verification, inter-arm agreement, and faithfulness, over a bridge retrieval substrate with multi-hop chain filtering. Every answer is proof-tree-structured for downstream audit, and readers, embedders, and judges can be swapped without retraining.

On HotpotQA, 2WikiMultiHopQA, and MuSiQue, with a Llama-3.3-70B reader and a single uniform configuration, MOTHRAG attains F1 of 78.1 / 76.3 / 50.5 (average 68.3) — the highest average F1 among commercially-deployable multi-hop QA frameworks, within 0.7 points of the GPU-bound research state of the art (parity on HotpotQA, an edge on 2WikiMultiHopQA, an honest gap on MuSiQue). Measured inference cost is 0.018 per query (−44%) at statistical parity on HotpotQA and 2WikiMultiHopQA, with a measured trade-off only on MuSiQue.

The result is research-grade multi-hop accuracy that is deterministic, auditable, and deployable today on commodity APIs alone.

@misc{geymonat2026mothrag,
  title  = {MOTHRAG: Training-Free Multi-Hop Question Answering at Research-SOTA Parity on Commodity LLM APIs},
  author = {Geymonat, Julian},
  year   = {2026},
  doi    = {10.5281/zenodo.20668567},
  url    = {https://doi.org/10.5281/zenodo.20668567}
}

Ready to upgrade your RAG pipeline? Install MOTHRAG today and bring deterministic, high-accuracy reasoning to your LLM applications.