
Bruno Spada/Câmara dos Deputados Gilson Marques é o relator da proposta A Comissão de Defesa

Mario Agra / Câmara dos Deputados Fernando Monteiro recomendou a aprovação do projeto, com mudanças

Pablo Valadares / Câmara dos Deputados Laura Carneiro: medida fecha lacunas na punição de práticas

Vinicius Loures/Câmara dos Deputados Ricardo Ayres, relator do projeto A Comissão de Viação e Transportes

(Ekaterina Goncharova/Getty Images) Continua após publicidade A Proposta de Emenda à Constituição (PEC) 221 de

Sebrae-SP vai selecionar até 22 pequenos negócios para expor com subsídios que podem chegar a

The stochastic challenge Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love. To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack. This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk. Defining the AI evaluation paradigm Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function. The taxonomy of evaluation checks To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers: Layer 1: Deterministic assertions A surprisingly large share of production AI failures aren't semantic "hallucinations" — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline's first gate, using traditional code and regex to validate structural integrity. Instead of asking if a response is "helpful," these assertions ask strict, binary questions: Did the model generate the correct JSON key/value schema? Did it invoke the correct tool call with the required arguments? Did it successfully slot-fill a valid GUID or email address? // Example: Layer 1 Deterministic Tool Call Assertion { "test_scenario": "User asks to look up an account", "assertion_type": "schema_validation", "expected_action": "Call API: get_customer_record", "actual_ai_output": "I found the customer.", "eval_result": "FAIL – AI hallucinated conversational text instead of generating the required API payload." } In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload. Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive "fail-fast" principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3). Layer 2: Model-based assertions When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is "helpful" or "empathetic." This introduces model-based evaluation, commonly referred to as "LLM-as-a-Judge” or “LLM-Judge." While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is "actionable" or "polite." While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment. 3 critical inputs for model-based assertions However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs: A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment. A strict assessment rubric: Vague evaluation prompts ("Rate how good this answer is") yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a "Helpfulness" rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.) Ground truth (golden outputs): While the rubric provides the rules, a human-vetted "expected answer" acts as the answer key. When the LLM-Judge can compare the production model's output against a verified Golden Output, its scoring reliability increases dramatically. Architecture: The offline vs online pipeline A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely. The offline evaluation pipeline The offline pipeline's primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch. Process 1. Curating the golden dataset The offline lifecycle begins by curating a "golden dataset" — a static, version-controlled repository of 200 to 500 test cases representing the AI's full operational envelope. Each case pairs an exact input payload with an expected "golden output" (ground truth). Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard "happy-path" interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating "refusal capabilities" under stress remains a strict compliance requirement. Example test case payload (standard tool use): Input: "Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m." Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {"duration_minutes": 30, "day": "Tuesday", "time": "10 AM", "attendee": "client_email"}. While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository. 2. Defining the evaluation criteria Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A

Durante a Operação Lunar Peek em novembro de 2024, os invasores obtiveram acesso de administrador remoto não autenticado — e eventual root —

Ouça o artigo 4 minutos Este áudio é gerado automaticamente. Por favor, deixe-nos saber se você tiver comentários. Resumo de mergulho: A Home

Por várias semanas, um coro crescente de desenvolvedores e usuários avançados de IA alegaram que os principais modelos da Anthropic estavam perdendo sua

OpenAI introduced a new paradigm and product today that is likely to have huge implications for enterprises seeking to adopt and control fleets

Ouça o artigo 4 minutos Este áudio é gerado automaticamente. Por favor, deixe-nos saber se você tiver comentários. Resumo de mergulho: A Adobe

Google on Monday unveiled the most significant upgrade to its autonomous research agent capabilities since the product's debut, launching two new agents —

Ouça o artigo 5 minutos Este áudio é gerado automaticamente. Por favor, deixe-nos saber se você tiver comentários. A Green Mountain Coffee Roasters

No ano passado, os primeiros a adotar agentes autônomos de IA foram forçados a jogar um obscuro jogo de azar: manter o agente

Multidões reuniram-se em Gaza enquanto 300 casais se reuniam para uma celebração de casamento conjunta, o maior casamento em massa na região até

Quem pensa que conteúdo adulto e ousado é exclusividade de outras plataformas nunca mergulhou de verdade no catálogo da HBO Max. O streaming,

Tucco conduz a watch party oficial da Ligue 1 McDonald’s no Rio de Janeiro, conectando público, transmissão e ativações durante PSG x Lyon,

Com produção estruturada, acabamento de alto nível e foco em escala, empresa redefine o conceito de mobiliário premium no país Em um mercado

Cerca de 23,3 milhões dos 35,2 milhões de aposentados, pensionistas e beneficiários de auxílios do Instituto Nacional do Seguro Social (INSS) começam a

Iniciativa oferece prêmios em dinheiro e viagem internacional para os melhores projetos de inovação nas universidades da região O programa busca transformar o
O universo dos games pode ganhar mais uma adaptação de peso em Hollywood. A franquia de tiro militar Battlefield está a caminho do

Especialista destaca que comportamento e disciplina têm mais impacto no patrimônio do que o nível de renda Em um momento em que o

Por Daniel Santini O aumento da criminalidade no Brasil deixou de ser apenas um dado estatístico para se consolidar como um retrato sensível

Rio de Janeiro, 23 de abril de 2026 – A sobremesa mais brasileira ganhou versão individual e virou fenômeno digital. O perfil @pudimnocopoamericano,

A convite da Associação do Povo Chinês para a Amizade com Países Estrangeiros, a Orquestra Forte de Copacabana, sob a direção de Márcia

Miami, Flórida, abril de 2026 — Na manhã do dia 18 de abril de 2026, o Novotel Miami Brickell — um dos endereços
© 2025 Todos os direitos reservados a Handelsblatt