Find Your RAG Pipeline’s Weakest Links with RAGProbe

You have built a sophisticated Retrieval-Augmented Generation (RAG) pipeline, but how do you know it will actually hold up under real-world pressure? Relying on manually written test cases is a losing battle; you can only test for the edge cases you have already anticipated.

The Manual Testing Bottleneck

Traditional evaluation methods often require significant manual effort to create diverse question sets. As discussed in our guide on why traditional software testing fails for AI, the complexity of LLM outputs makes standard unit tests insufficient. You need a way to uncover hidden failure modes—like multi-hop reasoning errors or distractor interference—without writing every single question yourself.

Automated Testing with RAGProbe

Enter RAGProbe, a specialized tool designed to find the questions your RAG pipeline will fail on—before your users do. Unlike other tools that require manual test authorship, RAGProbe analyzes your chunk corpus topology. By looking at how chunks relate to each other in embedding space, it generates adversarial questions targeting specific structural weaknesses.

multi_hop: Targets distant chunk pairs that must be combined to find an answer.
buried_fact: Finds facts hidden within a sea of near-duplicate distractors.
distractor: Identifies moderately similar chunks that act as tempting wrong answers.
near_miss: Tests retrieval boundaries at high-betweenness chokepoint chunks.

Implementation and Setup

Getting started is straightforward if you are familiar with Python environments. First, clone the repository from the official RAGProbe repository and install the package locally.

git clone https://github.com/rishavsunny12/ragProbe.git
cd ragProbe
pip install -e .

# 1. Embed chunks and build the topology graph (no API key)
ragprobe index ./chunks.jsonl

# 2. Generate adversarial questions from the graph
ragprobe generate --llm openai/gpt-4o-mini

# 3. Run questions against your RAG pipeline and grade answers
ragprobe run .ragprobe/questions.jsonl \
  --pipeline http://localhost:8000/query \
  --grader-llm openai/gpt-4o-mini \
  --output baseline.json

# 4. Later, compare a new run against the baseline to catch regressions
ragprobe run .ragprobe/questions.jsonl \
  --pipeline http://localhost:8000/query \
  --grader-llm openai/gpt-4o-mini \
  --output current.json
ragprobe diff baseline.json current.json --fail-on-regression 5

# 5. (optional) Prove the topology-aware questions are actually harder
ragprobe calibrate --pipeline http://localhost:8000/query \
  --llm openai/gpt-4o-mini \
  --grader-llm openai/gpt-4o-mini

Configuring Your LLM Providers

RAGProbe leverages LiteLLM, allowing you to use a variety of models for generation and grading. You can use powerful cloud models like GPT-4o or run entirely local setups using an Ollama server for privacy and cost savings.

Command	Flag	Purpose
generate	–llm	Draft adversarial questions
run	–grader-llm	Grade pipeline answers
calibrate	–llm	Compare difficulty sets

The Testing Workflow

To use RAGProbe effectively, your RAG pipeline must be running and accessible via HTTP. The tool simply POSTs questions to the endpoint you provide, such as your local query service, and parses the JSON response.

Achieving Regression-Free RAG

The true power of RAGProbe lies in its ability to act as a regression tester. By saving a baseline JSON file, you can compare new runs against previous versions. This ensures that any changes to your chunking strategy or embedding model do not introduce new failures into your system.

Stop guessing if your RAG pipeline is robust. Start probing it with RAGProbe today!