{"id":80640,"date":"2026-05-19T08:57:43","date_gmt":"2026-05-19T01:57:43","guid":{"rendered":"https:\/\/hbbgroup.net\/ai-still-cant-beat-the-on-call-engineer-heres-why\/"},"modified":"2026-05-19T08:57:43","modified_gmt":"2026-05-19T01:57:43","slug":"ai-still-cant-beat-the-on-call-engineer-heres-why","status":"publish","type":"post","link":"https:\/\/hbbgroup.net\/vi\/ai-still-cant-beat-the-on-call-engineer-heres-why\/","title":{"rendered":"AI Still Can&#8217;t Beat the On-Call Engineer: Here&#8217;s Why"},"content":{"rendered":"<div>\n<div>\n<h4 color=\"#333\">In brief<\/h4>\n<ul>\n<li>ARFBench is the first AI benchmark built entirely from real production incidents.<\/li>\n<li>GPT-5 leads all existing AI models at 62.7% accuracy but falls short of domain experts at 72.7%.<\/li>\n<li>A theoretical model-expert oracle\u2014combining AI and human judgment\u2014hits 87.2% accuracy, setting the ceiling for what collaborative AI-human teams could achieve.<\/li>\n<\/ul>\n<\/div>\n<p>AI companies keep pitching autonomous <a href=\"https:\/\/resolve.ai\/glossary\/what-is-ai-sre\" target=\"_blank\" rel=\"nofollow external noopener\">site reliability engineer agents<\/a>\u2014AI that investigates production incidents in place of humans. Datadog ran the actual benchmark on real outages, and the best AI models can&#8217;t yet beat the engineers they\u2019re supposed to replace.<\/p>\n<p>The benchmark is <a href=\"https:\/\/arxiv.org\/abs\/2604.21199\" target=\"_blank\">ARFBench<\/a> (Anomaly Reasoning Framework Benchmark), a joint project from Datadog and Carnegie Mellon. Built from 63 real production incidents, extracted from engineers&#8217; own Slack threads during live emergencies\u2014750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points, every question verified by hand. No synthetic data. No textbook scenarios.<\/p>\n<p>&#8220;Trillions of dollars are lost each year due to system outages,&#8221; the researchers write. The benchmark tests whether AI can actually help change that.<\/p>\n<p>\u201cDespite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,\u201d the paper reads.<\/p>\n<p>Questions come in three tiers. Tier I: Does an anomaly exist in this chart? Tier II: When did it start, how severe is it, what type?<\/p>\n<p>The Tier III\u2014the hardest\u2014requires cross-metric reasoning: Is this chart causing the problem in that other chart? That&#8217;s where AI falls apart. GPT-5 scores just 47.5% F1 on Tier III questions, a metric that penalizes models for gaming answers by picking the most common class.<\/p>\n<p>&#8220;Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,&#8221; the researchers write.<\/p>\n<h2 color=\"#333\">How every model stacked up<\/h2>\n<p>GPT-5 led all existing models at 62.7% accuracy\u2014on a test where random guessing gets 24.5%. Gemini 3 Pro scored 58.1%. Claude Opus 4.6: 54.8%. Claude Sonnet 4.5: 47.2%.<\/p>\n<p>Domain experts scored 72.7% accuracy. Non-domain experts\u2014time series researchers at Datadog without extensive observability experience\u2014still hit 69.7%.<\/p>\n<p>No AI model beat either human baseline.<\/p>\n<figure><img loading=\"lazy\" alt=\"ARFBench leaderboard table\" width=\"3722\" height=\"3102\" decoding=\"async\" data-nimg=\"1\" src=\"https:\/\/img.decrypt.co\/insecure\/rs:fit:3840:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/05\/arfbench_overall_accuracy_f1_corrected.png@webp\"><figcaption>Image built by Decrypt based on the ARFBench leaderboard CSV<\/figcaption><\/figure>\n<p>The model that actually topped the full leaderboard was Datadog&#8217;s own hybrid: Toto\u2014their internal time series forecasting model\u2014combined with Qwen3-VL 32B. Toto-1.0-QA-Experimental scored 63.9% accuracy, edging past GPT-5 while using a fraction of its parameters. On anomaly identification specifically, it outperformed every other model by at least 8.8 percentage points in F1.<\/p>\n<p>A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That&#8217;s the point.<\/p>\n<p>The most valuable finding isn&#8217;t which model scored highest.<\/p>\n<p>&#8220;We observe substantially different error profiles between leading models and human experts, suggesting that their strengths are complementary,&#8221; the researchers write. Models hallucinate, miss metadata, and lose domain context. Humans misread precise timestamps and occasionally fail on complex instructions. The mistakes barely overlap.<\/p>\n<p>Model a theoretical &#8220;Model-Expert Oracle&#8221;\u2014a perfect judge that always picks the right answer between the AI and the human\u2014and you get 87.2% accuracy and 82.8% F1. Way above either alone.<\/p>\n<p>That&#8217;s not a product. It&#8217;s a <a href=\"https:\/\/decrypt.co\/362496\/is-agi-here-not-even-close-ai-benchmark\" target=\"_blank\">documented target<\/a>\u2014built from real emergencies, not curated datasets\u2014that quantifies exactly how much better human-AI collaboration could perform. The leaderboard is live on Hugging Face. GPT-5 sits at 62.7%. The ceiling is 87.2%.<\/p>\n<div>\n<h3>Daily Debrief Newsletter<\/h3>\n<p>Start every day with the top news stories right now, plus original features, a podcast, videos and more.<\/p>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>In brief ARFBench is the first AI benchmark built entirely from real production incidents. GPT-5 leads all existing AI models [&hellip;]<\/p>","protected":false},"author":5,"featured_media":80641,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[220],"tags":[],"class_list":["post-80640","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tien-dien-tu"],"acf":[],"_links":{"self":[{"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/posts\/80640","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/comments?post=80640"}],"version-history":[{"count":0,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/posts\/80640\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/media\/80641"}],"wp:attachment":[{"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/media?parent=80640"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/categories?post=80640"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hbbgroup.net\/vi\/wp-json\/wp\/v2\/tags?post=80640"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}