Ep 67 - RAG Done Right: Measure The Evidence Or Drift Into Error Artwork

Inspire AI: Transforming RVA Through Technology and Automation

Our mission is to cultivate AI literacy in the Greater Richmond Region through awareness, community engagement, education, and advocacy. In this podcast, we spotlight companies and individuals in the region who are pioneering the development and use of AI.

All Episodes

Inspire AI: Transforming RVA Through Technology and Automation

Ep 67 - RAG Done Right: Measure The Evidence Or Drift Into Error

February 23, 2026 • AI Ready RVA • Season 2 • Episode 7

0:00 | 9:45

Send us Fan Mail

What happens when a brilliant-sounding AI gives the wrong answer with total confidence? We dig into the quiet culprit behind so many “LLM failures”: retrieval. Rather than judging how smart a model sounds, we walk through how to judge whether it looked at the right evidence, why that matters in high-stakes domains like finance, healthcare, HR, and government, and how leaders can stop organizational drift driven by outdated or partial sources.

We break down four pillars every RAG team should track: retrieval precision and recall to balance noise versus coverage; context relevance and coverage to ensure the retrieved passages actually answer the question; groundedness and fluency so every claim traces back to evidence; and accuracy and completeness to catch stale or missing knowledge. Along the way, we share real-world patterns—chatbots citing old HR policies, assistants using superseded regulations, and tools surfacing obsolete medical guidance—and show how these errors spread when confidence outruns curation.

Then we get practical. We outline precision@K and recall@K, golden question sets tied to authoritative documents, LLM-based judging for relevance and groundedness, and continuous regression testing as knowledge bases evolve. More importantly, we frame the cultural shift: assign ownership for knowledge freshness, make sources visible next to answers, and normalize verification at every level. Treat AI answers as drafts, retrieval as evidence, and evaluation as the safeguard. If you’re running or planning a RAG system, start by asking to see retrieved sources, build a small high-stakes golden set, and set a cadence for archiving and updates.

If this conversation helped sharpen your approach to reliable AI, subscribe, share with a teammate who manages content or compliance, and leave a quick review with one insight you’re taking back to your team.

Want to join a community of AI learners and enthusiasts? AI Ready RVA is leading the conversation and is rapidly rising as a hub for AI in the Richmond Region. Become a member and support our AI literacy initiatives.

SPEAKER_00 1:31

Welcome back to Inspire AI, the podcast where we help leaders stay calm, capable, and intentional in an AI accelerated world. Today, I want to start with a story that might feel uncomfortably familiar. Imagine your organization rolls out a new AI assistant. It's fast, it's articulate, it searches your internal documents and gives confident answers to almost any question. The demo goes great. Leadership is impressed. Then, a few months later, an employee asks a simple question about company policy, and the AI confidently answers, citing a document that looks official. There's just one problem. That document is from 2010. And the policy has changed three times since then. Suddenly, what looked like a powerful productivity tool has become a liability. This isn't a language model problem, it's a retrieval problem. And that's what today's episode is all about. Here's the central idea to hold on to as we go forward. A confident answer backed by the wrong document is worse than no answer at all. Think about that. In retrieval augmented generation, or RAG systems, the quality of the answer is completely constrained by the quality of what gets retrieved. The model doesn't know what's right, it only knows what it's given. So today we're shifting the evaluation

The Demo That Became A Liability

SPEAKER_00 3:02

lens from does the AI sound smart to did the AI actually find the right information to base its answer on? And this is what evaluation is all about. When AI systems fail, our instinct is to blame the model. But in practice, most so-called LLM failures in production, RAG systems, trace back to the retrieval layer. Two patterns show up again and again. The system never retrieved the relevant document at all. Or the system retrieved something that looked relevant but wasn't current, correct, or applicable. The model then does exactly what it's designed to do. It fills in the gaps fluently. That's how you get answers that sound plausible, authoritative, but completely wrong. And this matters because RAG is no longer experimental. Over half of the enterprise AI systems now rely on it.

Shifting Focus To Retrieval Quality

SPEAKER_00 4:02

Often, in high-stakes domains like finance, healthcare, HR, or even government, when retrieval fails at scale, the AI doesn't just make mistakes. It amplifies institutional errors. These are real-world consequences. Think about it. What's really happening in those moments? An outdated policy document gets pulled into the context. A superseded regulation gets treated as a current law. Or an old medical guideline gets framed as best practice. The AI doesn't hesitate, it doesn't hedge, it delivers the answer with confidence. And that confidence is what makes the failure so dangerous. We've seen this in real systems, enterprise chatbots misquoting HR

Real-World Stakes Of Bad Retrieval

SPEAKER_00 4:50

policies, finance assistance referencing outdated reimbursement rules, even healthcare decision support tools surfacing obsolete treatment guidance. In each case, the issue wasn't intelligence, it was grounding. So what do we actually need to evaluate? How do we evaluate RAG systems responsibly? We just need to stop treating the final answer as the only artifact that matters. Instead, there are several dimensions leaders and practitioners of these systems should care about. The first one is retrieval precision and recall. Precision asks, of the documents retrieved, how many were actually relevant? Recall asks, of all the relevant documents that exist, how many did the system find? Low recall means missing information and hallucinations fill the gap. Low precision means noise, and the model

Four Core Dimensions Of Evaluation

SPEAKER_00 5:46

gets distracted or misled. I'll say that a healthy radic system should be a balance of both, precision and recall. The second dimension is context relevance and coverage. Even if documents are technically relevant, do they actually answer the question being asked? And do they cover all the important parts of that question? Partial context leads to partial truth, which is often more dangerous than obvious error. The third dimension is groundedness and fluency. This is where many teams get fooled. Fluency is easy, but groundedness is hard. A grounded answer is one where every claim can be traced back to the retrieved evidence. If the model adds even a single unsupported assumption to make the answer flow better, you've lost reliability. So a stiff but faithful answer is infinitely more valuable than a smooth hallucination. The fourth dimension is accuracy and completeness. Even perfectly grounded answers can still be wrong. As I've said before, if the underlying documents are outdated or incomplete, this is where knowledge management and AI evaluation intersect. Rag systems don't just retrieve information, they preserve and sometimes resurrect organizational memory. So what is an evaluation in practice look like? The good news is this isn't theoretical anymore. Teams are already using retrieval metrics like precision at K and recall at K. What that means is think of K as how many documents did we hand to the model? If K equals 5, the system says, here are the top five documents I think matter. Everything we're measuring happens within those K results. Teams are also using golden question sets tied to known source documents, and LLM-based judges to sanity check relevance and groundedness, and continuous regression testing as knowledge bases evolve. The key shift is cultural, not technical. You don't need to ask, does the AI answer questions? You should ask, how do we know it's looking at the right evidence? Here's where this becomes a real issue for leaders and practitioners. Poor retrieval is not just a technical flaw, it's a knowledge risk. If your AI consistently surfaces outdated or partial information, it doesn't just reflect an institutional memory decay, it accelerates it. Over time, people begin trusting yesterday's truth more than today's reality. That's how organizations drift, quietly, confidently, and incorrectly. Evaluating retrieval is how leaders prevent that drift. What to do next Monday? If you're running or planning a rag system,

Metrics And Practical Evaluation Methods

SPEAKER_00 8:47

ask to see retrieved sources, not just answers. Build a small golden data set of high-stakes questions. Assign ownership for knowledge freshness in archival. Normalize verification, even at the executive level. When leaders model source checking, teams follow. So as we wrap up, here's the principle to remember. AI answers are drafts. Retrieval is the evidence. Evaluation is the safeguard. When you measure what AI knows, not just what it says, you turn a risky black box into a trustworthy partner. Alright. Until next time, stay curious, keep innovating, and keep strengthening the judgment that makes intelligent systems truly useful.

Jason McGinthy

Host