Building Conversational AI Agents with Thread-Level Eval Metrics
🎥 Recorded live at the MLOps World | GenAI Summit 2025 — Austin, TX (October 8, 2025) Session Title: Building Conversational AI Agents with Thread-Level Eval Metrics Speakers: • Tony Kipkemboi, Head of Developer Relations, CrewAI • Claire Longo, Lead AI Researcher, Comet Talk Track: Agents in Production Abstract: Building conversational AI agents that behave reliably and intelligently requires more than tracing single prompts — it demands thread-level evaluation of entire conversations. In this joint session, Tony Kipkemboi from CrewAI and Claire Longo from Comet present a practical framework that combines agentic orchestration and rigorous evaluation to improve the real-world performance of conversational agents. On the CrewAI side, Tony demonstrates how to design multi-agent workflows, assign specialized roles, and orchestrate dynamic reasoning chains that reflect real business processes. On the Comet Opik side, Claire introduces custom evaluation metrics, LLM-as-a-Judge techniques, and human-in-the-loop feedback systems that make it possible to measure conversational quality holistically — from intent alignment to outcome success. Together, they show how to close the loop between agent design, experimentation, and evaluation, empowering developers to move from promising prototypes to robust production systems. What you’ll learn: • How to design and evaluate conversational AI agents with real-world reliability • The difference between trace-level and thread-level evaluation • How to combine CrewAI for orchestration and Comet Opik for custom eval metrics • How to capture and integrate human-in-the-loop feedback • Why orchestration and evaluation must be treated as two halves of the same workflow.
