From LLM-as-a-Judge To Human-in-the-Loop: Rethinking Evaluation in RAG and Search With OpenSearch - Eric Pugh, OpenSource Connections & Fernando Rejon Barrera, Zeta Alpha Everyone’s using LLMs as judges. In this talk, we’ll explore techniques for LLM-as-a-judge evaluation in Retrieval-Augmented Generation (RAG) systems, where prompts, filters, and retrieval strategies create endless variations. We’ll dive into the recent LLM-as-a-judge support added to OpenSearch 3.1 Search Relevance Workbench, and how to use that for your RAG solution. This begs the question, but how do you evaluate the judges? ELO rankings in chess are a system that calculates the relative skill levels of players based on their game results, with higher ratings indicating stronger players. We introduce RAGElo, an ELO-style ranking framework that uses LLMs to compare outputs without needing gold answers—bringing structure to subjective judgments at scale. Then we showcase the integration of RAGElo into the Search Relevance Workbench in OpenSearch 3: a human-in-the-loop toolkit that lets you dig deep into search results, compare configurations, and spot issues metrics miss. Together, these tools balance automation and intuition—helping you build better retrieval and generation systems with confidence.

From LLM-as-a-Judge To Human-in-the-Loop: Rethinking Evaluat...