
Kayo Magalhães / Câmara dos Deputados Ana Pimentel quer identificar demandas emergenciais A comissão externa

Kayo Magalhães / Câmara dos Deputados Rogéria Santos, relatora do projeto na comissão A Comissão

Kayo Magalhães / Câmara dos Deputados Deputado Paulo Abi-Ackel, relator do projeto na CCJ A

Bruno Spada/Câmara dos Deputados Gilson Marques é o relator da proposta A Comissão de Defesa

As inscrições devem ser feitas pelo link. O secretário de Desenvolvimento Econômico e Relações do

Ler Resumo Em um cenário corporativo desafiador para profissionais negros, o artigo aponta dominar Inteligência

Here is a scenario that should concern every enterprise architect shipping autonomous AI systems right now: An observability agent is running in production. Its job is to detect infrastructure anomalies and trigger the appropriate response. Late one night, it flags an elevated anomaly score across a production cluster, 0.87, above its defined threshold of 0.75. The agent is within its permission boundaries. It has access to the rollback service. So it uses it. The rollback causes a four-hour outage. The anomaly it was responding to was a scheduled batch job the agent had never encountered before. There was no actual fault. The agent did not escalate. It did not ask. It acted, confidently, autonomously, and catastrophically. What makes this scenario particularly uncomfortable is that the failure was not in the model. The model behaved exactly as trained. The failure was in how the system was tested before it reached production. The engineers had validated happy-path behavior, run load tests, and done a security review. What they had not done is ask: what does this agent do when it encounters conditions it was never designed for? That question is the gap I want to talk about. Why the industry has its testing priorities backwards The enterprise AI conversation in 2026 has largely collapsed into two areas: identity governance (who is the agent acting as?) and observability (can we see what it's doing?). Both are legitimate concerns. Neither addresses the more fundamental question of whether your agent will behave as intended when production stops cooperating. The Gravitee State of AI Agent Security 2026 report found that only 14.4% of agents go live with full security and IT approval. A February 2026 paper from 30-plus researchers at Harvard, MIT, Stanford, and CMU documented something even more unsettling: Well-aligned AI agents drift toward manipulation and false task completion in multi-agent environments purely from incentive structures, no adversarial prompting required. The agents weren't broken. The system-level behavior was the problem. This is the distinction that matters most for builders of agentic infrastructure: A model can be aligned and a system can still fail. Local optimization at the model level does not guarantee safe behavior at the system level. Chaos engineers have known this about distributed systems for fifteen years. We are relearning it the hard way with agentic AI. The reason our current testing approaches fall short is not that engineers are cutting corners. It is that three foundational assumptions embedded in traditional testing methodology break down completely with agentic systems: Determinism: Traditional testing assumes that given the same input, a system produces the same output. A large language model (LLM)-backed agent produces probabilistically similar outputs. This is close enough for most tasks, but dangerous for edge cases in production where an unexpected input triggers a reasoning chain no one anticipated. Isolated failure: Traditional testing assumes that when component A fails, it fails in a bounded, traceable way. In a multi-agent pipeline, one agent's degraded output becomes the next agent's poisoned input. The failure compounds and mutates. By the time it surfaces, you are debugging five layers removed from the actual source. Observable completion: Traditional testing assumes that when a task is done, the system accurately signals it. Agentic systems can, and regularly do, signal task completion while operating in a degraded or out-of-scope state. The MIT NANDA project has a term for this: "confident incorrectness." I have a less polite term for it: the thing that causes the 4am incident that took three hours to trace. Intent-based chaos testing exists to address exactly these failure modes, before your agents reach production. The core concept: Measuring deviation from intent, not just from success Chaos engineering as a discipline is not new. Netflix built Chaos Monkey in 2011. The principle is straightforward: Deliberately inject failure into your system to discover its weaknesses before users find them. What is new, and what the industry has not yet applied rigorously to agentic AI, is calibrating chaos experiments not just to infrastructure failure scenarios, but to behavioral intent. The distinction is critical. When a traditional microservice fails under a chaos experiment, you measure recovery time, error rates, and availability. When an agentic AI system fails, those metrics can look perfectly normal while the agent is operating completely outside its intended behavioral boundaries: Zero errors, normal latency, catastrophically wrong decisions. This is the concept behind a chaos scale system calibrated not just to failure severity, but to how far a system's behavior deviates from its intended purpose. I call the output of that measurement an intent deviation score. Here is what that looks like in practice. Before running any chaos experiment against an enterprise observability agent, you define five behavioral dimensions that together describe what "acting correctly" means for that specific agent in its specific deployment context: Behavioral dimension What it measures Weight Tool call deviation Are tool calls diverging from expected sequences under stress? 30% Data access scope Is the agent accessing data outside its authorized boundaries? 25% Completion signal accuracy When the agent reports success, is it actually in a valid state? 20% Escalation fidelity Is the agent escalating to humans when it encounters ambiguity? 15% Decision latency Is time-to-decision within expected bounds given current conditions? 10% The weights are not arbitrary. They reflect the risk profile of the specific agent. For a read-only analytics agent, you might weight data access scope lower. For an agent with write access to production systems, completion signal accuracy and escalation fidelity are where failures become outages. The point is that you define these dimensions before you inject any failure, based on what the agent is actually supposed to do. The deviation score is computed as a weighted average of how far each observed dimension has drifted from its baseline: def compute_intent_deviation_score( baseline: dict(str, float), observed: dict(str, float), weights: dict(str, float) ) -> float: """ The system computes how far an agent's behavior has drifted from its intended baseline, and returns a score from 0.0 (no deviation) to 1.0 (complete intent violation). This is NOT a performance metric.

Maurilio Biagi participa da Agrishow 2025, feira que refletiu os desafios econômicos, climáticos e reputacionais enfrentados atualmente pelo agronegócio brasileiro.

Dario Amodei não é o tipo de CEO que fala livremente sobre números. O cofundador e executivo-chefe da Anthropic, ex-vice-presidente de pesquisa da

Ouça o artigo 3 minutos Este áudio é gerado automaticamente. Por favor, deixe-nos saber se você tiver comentários. Resumo de mergulho: Amazon Ads

Anthropic on Tuesday unveiled a suite of updates to its Claude Managed Agents platform at its second annual Code with Claude developer conference

A Lumu, empresa de cibersegurança criadora do Continuous Compromise Assessment®, anuncia um marco na evolução da defesa autônoma. O Lumu Autopilot, principal solução

O esperado início do corte da taxa básica de juros parece ainda não ter surtido efeito no bolso dos brasileiros. De acordo com

Num mundo onde um vídeo viral do TikTok pode fazer com que uma marca se torne uma tendência global em poucas horas, o

A little-known Miami-based startup called Subquadratic emerged from stealth on Tuesday with a sweeping claim: that it has built the first large language

Quando alguém fala em Disney+, a primeira imagem que vem à cabeça costuma ser princesas, super-heróis da Marvel e galáxias de Star Wars.

Kayo Magalhães / Câmara dos Deputados Ana Pimentel quer identificar demandas emergenciais A comissão externa da Câmara dos Deputados criada para acompanhar os

Ricardo Nunes durante apresentação na Arena R1, evento que reúne empresários em São Paulo para debates sobre liderança, estratégia e alta performance nos

Em um cenário onde a música ao vivo busca novas formas de conexão com o público, o projeto RockStory vem se destacando por

Para Roseli, ao mesmo tempo em que as mulheres conquistam maior flexibilidade de tempo, é preciso estar atenta à sobrecarga emocional. “As mulheres
A nova geração de consoles é uma realidade cada vez mais certa na indústria, enquanto gigantes como PlayStation e Xbox já trabalham em

O último fim de semana marcou um encontro inédito em Alphaville, onde cerca de 40 profissionais se reuniram para o Experience, evento idealizado pelo Dr.

O crescimento da indústria aeronáutica brasileira tem trazido à tona um desafio cada vez mais evidente: a falta de profissionais especializados para atender

Kayo Magalhães / Câmara dos Deputados Rogéria Santos, relatora do projeto na comissão A Comissão de Previdência, Assistência Social, Infância, Adolescência e Família

A noite desta sexta-feira promete elevar o padrão da cena noturna em São Paulo com a apresentação do DJ Canecchia no exclusivo 300

Na terça feira dia 05/05, o podcast Café com Sal recebeu o empresário Alexandre Ostrowiecki para uma conversa profunda sobre empreendedorismo, liderança, política

O Brasil está vivendo um novo momento no empreendedorismo em rede, e a Ozonteck tem sido protagonista dessa transformação com o projeto CONECTA
© 2025 Todos os direitos reservados a Handelsblatt