Inspire AI: Transforming RVA Through Technology and Automation

Ep 74 - Agentic Workflows: Did The Job Actually Get Done?

AI Ready RVA Season 2 Episode 13

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 10:58

Send us Fan Mail

An AI agent that confidently says “done” can still be the most expensive kind of wrong. We start with a simple test of reality: when an agent updates a policy document, who was notified, what changed, what got logged, and what state did it actually leave behind? That gap between a polished response and a verified result is where agent hype turns into operational risk.

We walk through task-based evaluation, the practical way to measure agentic workflows that act through tools and trigger real system changes. The key framework is defining every task with a goal state (what must be true at the end) and a constraint set (what must never happen on the way). From there, we build a metrics stack that goes beyond “did it sound helpful” into what engineering teams can defend: task success rate, P95 completion time, tool-use correctness, step-level accuracy, partial progress, and especially catastrophic failure rate. If 10% of runs cause irreversible damage, the system isn’t “90% successful,” it’s not deployable.

Evaluation also can’t be a one-time checkpoint. We map a full lifecycle from offline testing to simulation and staging, then canary releases, and finally production monitoring with continuous evaluation. Along the way we call out the hidden killer: collateral damage, when the agent completes the main task but breaks something adjacent. We close by zooming out to AI governance and leadership decisions, including autonomy tiers and the principle that autonomy must be earned through evidence, not assumed through capability.

Subscribe to Inspire AI, share this with a builder who ships agents, and leave a review with the metric you think most teams ignore. What’s your non-negotiable constraint for autonomous systems?

Want to join a community of AI learners and enthusiasts? AI Ready RVA is leading the conversation and is rapidly rising as a hub for AI in the Richmond Region. Become a member and support our AI literacy initiatives.

Define Goals And Constraints

Metrics That Measure Real Work

Evaluation From Lab To Production

Why Agents Fail In Practice

Governance And Earning Autonomy

Closing Challenge For Builders

SPEAKER_00

Welcome back to Inspire AI, the podcast where we turn complex shifts in artificial intelligence into clear thinking and practical leadership decisions. Today's episode asks a deceptively simple question. Did the job actually get done? Because in the world of AI agents, that question is the difference between real capability and a very convincing illusion. We're going to explore the transformation of outputs to outcomes. Imagine it. You ask an AI agent to update a policy document. Notify your team and log the change. It responds beautifully. The document has been updated. Your team has been notified and the change has been logged. Sounds okay. Moving on. But let's wait a second. First you check the document. What changed exactly? Did it get it right? Then who was notified? How are they notified? Where were they notified? Let's look back at the logs because I'm not sure I even follow what the thing did. Just because the agent said it's done, doesn't mean it didn't fail silently. And that's exactly why we need to rethink evaluation. Most AI evaluation today is still stuck in the world of responses. Did the answer sound good? Did it have high quality accuracy? Was it clear? Did it feel helpful? But in the shift from outputs to outcome, agents don't just answer questions. They act. They call tools. They modify systems, they trigger workflows, and they change real world state. So before you were able to ask, did it say the right thing? Now you have to ask, but did it change the world correctly? And that's the foundation of task-based evaluations. So if you think about the most important idea of the episode, it's that every task must be defined by two things. A goal state, which can be summed up by saying what must be true at the end, and a constraint set, which can be summed up by saying what must not happen along the way. Let's take this example. Goal, policy document updated. Constraints, no data loss, no unauthorized changes, correct version control. And if you don't define both, you're not evaluating, you're basically just guessing. Now this isn't just a technical nuance. It's more of a leadership issue. Because if you're deploying AI agents in your organization, you're really delegating action. You're introducing risk, and you're scaling decision making. And as outlined in my Inspire AI vision, this isn't just about tools. It's about judgment under change. Task-based evaluation is really about how you protect that judgment. So why is this really hard? It's hard because agents don't fail because they're unintelligent. They fail because they operate across multi-step workflows, and every step introduces risk. You have planning errors, wrong tool selection, incorrect inputs, misinterpreted outputs, forgotten context. And the real danger is that these errors can compound. Think about it. You need a stack of metrics, not a single score. So here are some to consider. Task success rate. Did the agent actually complete the job? Completion time. How long did it take? And what happens at the worst case? Tech groups call this P95, which just means in the 95th percentile, the response time as a performance metric indicates that 95% of the users' requests are faster than specified time, while the slowest 5% exceed it. It's a realistic measure of, let's say, user experience by filtering out outliers and revealing the latency experienced by the vast majority of the users. Then there's tool use correctness. Did it pick the right tool? Did it use it correctly? And did it sequence actions properly? Next we have step level accuracy. Where did it go wrong along the way? Then we think about partial progress. How far did it get before failing? That's a big one for your context window considerations. And then finally, catastrophic failure rate. Obviously, this one matters most. What are the failures that are unacceptable no matter what? Because a system that succeeds 90% of the time, but occasionally deletes the wrong database is not a 90% system. It's a non-deployable system. Now we also have to think that about evaluation as not a one-time event. It doesn't end before deployment. It evolves. The stages of evaluation are like this offline testing. You got your controlled environments, reproducible tasks. Then stage two, you have simulation and staging, where you test edge cases and failure modes. Stage three is your canary releases, small scale real world exposure. And then stage four is your production monitoring, continuous evaluation in the wild. This is where many teams fail. They treat evaluation as a checkpoint instead of a system. Your hidden risk is collateral damage. So something that teams can typically miss. When an agent completes the task, but still causes damage. Here's an example or two. When the agent updates the correct file, but it deletes another. Or it sends the right email to the wrong audience. It fixes a bug, but it breaks a dependency. So evaluation must include did anything else break? This is where true maturity begins. And in some circles you might say this is where vibe coding ends and real engineering begins. So I want to make that concrete. Here's where agents break. They typically fail in these five ways Planning failures, where wrong steps were identified and missed dependencies triggered failures, execution failures, where the agent took invalid or incorrect actions, and then interpretation failures, misreading the tool outputs, memory failures, losing track of context, and safety failures, breaking constraints to complete the task, none of these really happen in isolation. They cascade. So how might we fix this? You can't fix it with better prompts alone. You really need better systems. Every action must be validated. So as a mental pattern, you plan, you execute, and you verify. You must use strict schemas and constrained inputs, organize with typed tools. Limit what the agent can do. Even though it might slow you down, use a system of least privileges and log everything. All the actions, all the tool calls, all the state changes. Full observability is a non-negotiable. And if you can't replay failures, you can't improve them. It's going to take a mindset shift. Because your evaluation isn't an experiment, it's a demo. Think about your hypothesis. What should you improve? Consider your controls. What are the same tasks and same conditions that you're working under every time? Have clear success criteria in the form of metrics. And ask yourself, is this real improvement? Because without the rigor, you'll optimize for noise. And I want to zoom out for a minute and think about the real spirit of Inspire AI. Evaluation is not just technical, it's organizational governance. You must define what counts as a catastrophic failure, what triggers a rollback, what level of autonomy is acceptable. Here we might think in tiers. Low risk is advisory agents. Medium risk, assisted execution, high risk, autonomous systems. The key principle is that autonomy must be earned through evidence, not assumed through capability. Think about that. Because what you're actually building is it's not agents that you're building. You're building decision systems, trust systems, risk systems. And over time, our Inspire AI vision makes this clear. We're helping people remain effective as intelligence becomes ambient. And I'm thinking back to something I said earlier. We don't ship agents. Keep innovating, and make sure the systems you build don't just sound right, but actually get the job done.