Player Live
AO VIVO
10 de maio de 2026
Intent-based chaos testing is designed for when AI behaves confidently — and wrongly

Intent-based chaos testing is designed for when AI behaves confidently — and wrongly

Here is a scenario that should concern every enterprise architect shipping autonomous AI systems right now: An observability agent is running in production. Its job is to detect infrastructure anomalies and trigger the appropriate response. Late one night, it flags an elevated anomaly score across a production cluster, 0.87, above its defined threshold of 0.75. The agent is within its permission boundaries. It has access to the rollback service. So it uses it. The rollback causes a four-hour outage. The anomaly it was responding to was a scheduled batch job the agent had never encountered before. There was no actual fault. The agent did not escalate. It did not ask. It acted,  confidently, autonomously, and catastrophically. What makes this scenario particularly uncomfortable is that the failure was not in the model. The model behaved exactly as trained. The failure was in how the system was tested before it reached production. The engineers had validated happy-path behavior, run load tests, and done a security review. What they had not done is ask: what does this agent do when it encounters conditions it was never designed for? That question is the gap I want to talk about. Why the industry has its testing priorities backwards The enterprise AI conversation in 2026 has largely collapsed into two areas: identity governance (who is the agent acting as?) and observability (can we see what it's doing?). Both are legitimate concerns. Neither addresses the more fundamental question of whether your agent will behave as intended when production stops cooperating. The Gravitee State of AI Agent Security 2026 report found that only 14.4% of agents go live with full security and IT approval. A February 2026 paper from 30-plus researchers at Harvard, MIT, Stanford, and CMU documented something even more unsettling: Well-aligned AI agents drift toward manipulation and false task completion in multi-agent environments purely from incentive structures, no adversarial prompting required. The agents weren't broken. The system-level behavior was the problem. This is the distinction that matters most for builders of agentic infrastructure: A model can be aligned and a system can still fail. Local optimization at the model level does not guarantee safe behavior at the system level. Chaos engineers have known this about distributed systems for fifteen years. We are relearning it the hard way with agentic AI. The reason our current testing approaches fall short is not that engineers are cutting corners. It is that three foundational assumptions embedded in traditional testing methodology break down completely with agentic systems: Determinism: Traditional testing assumes that given the same input, a system produces the same output. A large language model (LLM)-backed agent produces probabilistically similar outputs. This is close enough for most tasks, but dangerous for edge cases in production where an unexpected input triggers a reasoning chain no one anticipated. Isolated failure: Traditional testing assumes that when component A fails, it fails in a bounded, traceable way. In a multi-agent pipeline, one agent's degraded output becomes the next agent's poisoned input. The failure compounds and mutates. By the time it surfaces, you are debugging five layers removed from the actual source. Observable completion: Traditional testing assumes that when a task is done, the system accurately signals it. Agentic systems can, and regularly do, signal task completion while operating in a degraded or out-of-scope state. The MIT NANDA project has a term for this: "confident incorrectness." I have a less polite term for it: the thing that causes the 4am incident that took three hours to trace. Intent-based chaos testing exists to address exactly these failure modes, before your agents reach production. The core concept: Measuring deviation from intent, not just from success Chaos engineering as a discipline is not new. Netflix built Chaos Monkey in 2011. The principle is straightforward: Deliberately inject failure into your system to discover its weaknesses before users find them. What is new, and what the industry has not yet applied rigorously to agentic AI, is calibrating chaos experiments not just to infrastructure failure scenarios, but to behavioral intent. The distinction is critical. When a traditional microservice fails under a chaos experiment, you measure recovery time, error rates, and availability. When an agentic AI system fails, those metrics can look perfectly normal while the agent is operating completely outside its intended behavioral boundaries: Zero errors, normal latency, catastrophically wrong decisions. This is the concept behind a chaos scale system calibrated not just to failure severity, but to how far a system's behavior deviates from its intended purpose. I call the output of that measurement an intent deviation score. Here is what that looks like in practice. Before running any chaos experiment against an enterprise observability agent, you define five behavioral dimensions that together describe what "acting correctly" means for that specific agent in its specific deployment context: Behavioral dimension What it measures Weight Tool call deviation Are tool calls diverging from expected sequences under stress? 30% Data access scope Is the agent accessing data outside its authorized boundaries? 25% Completion signal accuracy When the agent reports success, is it actually in a valid state? 20% Escalation fidelity Is the agent escalating to humans when it encounters ambiguity? 15% Decision latency Is time-to-decision within expected bounds given current conditions? 10% The weights are not arbitrary. They reflect the risk profile of the specific agent. For a read-only analytics agent, you might weight data access scope lower. For an agent with write access to production systems, completion signal accuracy and escalation fidelity are where failures become outages. The point is that you define these dimensions before you inject any failure, based on what the agent is actually supposed to do. The deviation score is computed as a weighted average of how far each observed dimension has drifted from its baseline: def compute_intent_deviation_score(     baseline: dict(str, float),     observed: dict(str, float),     weights: dict(str, float) ) -> float:     """ The system computes how far an agent's behavior has drifted from its intended baseline, and returns a score from 0.0 (no deviation) to 1.0 (complete intent violation).    This is NOT a performance metric.

Leia Mais »