{
    "id": 80640,
    "date": "2026-05-19T08:57:43",
    "date_gmt": "2026-05-19T01:57:43",
    "guid": {
        "rendered": "https:\/\/hbbgroup.net\/ai-still-cant-beat-the-on-call-engineer-heres-why\/"
    },
    "modified": "2026-05-19T08:57:43",
    "modified_gmt": "2026-05-19T01:57:43",
    "slug": "ai-still-cant-beat-the-on-call-engineer-heres-why",
    "status": "publish",
    "type": "post",
    "link": "https:\/\/hbbgroup.net\/en_us\/ai-still-cant-beat-the-on-call-engineer-heres-why\/",
    "title": {
        "rendered": "AI Still Can&#8217;t Beat the On-Call Engineer: Here&#8217;s Why"
    },
    "content": {
        "rendered": "<div>\n<div>\n<h4 color=\"#333\">In brief<\/h4>\n<ul>\n<li>ARFBench is the first AI benchmark built entirely from real production incidents.<\/li>\n<li>GPT-5 leads all existing AI models at 62.7% accuracy but falls short of domain experts at 72.7%.<\/li>\n<li>A theoretical model-expert oracle\u2014combining AI and human judgment\u2014hits 87.2% accuracy, setting the ceiling for what collaborative AI-human teams could achieve.<\/li>\n<\/ul>\n<\/div>\n<p>AI companies keep pitching autonomous <a href=\"https:\/\/resolve.ai\/glossary\/what-is-ai-sre\" target=\"_blank\" rel=\"nofollow external noopener\">site reliability engineer agents<\/a>\u2014AI that investigates production incidents in place of humans. Datadog ran the actual benchmark on real outages, and the best AI models can&#8217;t yet beat the engineers they\u2019re supposed to replace.<\/p>\n<p>The benchmark is <a href=\"https:\/\/arxiv.org\/abs\/2604.21199\" target=\"_blank\">ARFBench<\/a> (Anomaly Reasoning Framework Benchmark), a joint project from Datadog and Carnegie Mellon. Built from 63 real production incidents, extracted from engineers&#8217; own Slack threads during live emergencies\u2014750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points, every question verified by hand. No synthetic data. No textbook scenarios.<\/p>\n<p>&#8220;Trillions of dollars are lost each year due to system outages,&#8221; the researchers write. The benchmark tests whether AI can actually help change that.<\/p>\n<p>\u201cDespite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,\u201d the paper reads.<\/p>\n<p>Questions come in three tiers. Tier I: Does an anomaly exist in this chart? Tier II: When did it start, how severe is it, what type?<\/p>\n<p>The Tier III\u2014the hardest\u2014requires cross-metric reasoning: Is this chart causing the problem in that other chart? That&#8217;s where AI falls apart. GPT-5 scores just 47.5% F1 on Tier III questions, a metric that penalizes models for gaming answers by picking the most common class.<\/p>\n<p>&#8220;Despite the central role of such question-driven analysis in incident response, it remains unclear whether modern foundation models can reliably answer the kinds of time series questions engineers ask in practice,&#8221; the researchers write.<\/p>\n<h2 color=\"#333\">How every model stacked up<\/h2>\n<p>GPT-5 led all existing models at 62.7% accuracy\u2014on a test where random guessing gets 24.5%. Gemini 3 Pro scored 58.1%. Claude Opus 4.6: 54.8%. Claude Sonnet 4.5: 47.2%.<\/p>\n<p>Domain experts scored 72.7% accuracy. Non-domain experts\u2014time series researchers at Datadog without extensive observability experience\u2014still hit 69.7%.<\/p>\n<p>No AI model beat either human baseline.<\/p>\n<figure><img loading=\"lazy\" alt=\"ARFBench leaderboard table\" width=\"3722\" height=\"3102\" decoding=\"async\" data-nimg=\"1\" src=\"https:\/\/img.decrypt.co\/insecure\/rs:fit:3840:0:0:0\/plain\/https:\/\/cdn.decrypt.co\/wp-content\/uploads\/2026\/05\/arfbench_overall_accuracy_f1_corrected.png@webp\"><figcaption>Image built by Decrypt based on the ARFBench leaderboard CSV<\/figcaption><\/figure>\n<p>The model that actually topped the full leaderboard was Datadog&#8217;s own hybrid: Toto\u2014their internal time series forecasting model\u2014combined with Qwen3-VL 32B. Toto-1.0-QA-Experimental scored 63.9% accuracy, edging past GPT-5 while using a fraction of its parameters. On anomaly identification specifically, it outperformed every other model by at least 8.8 percentage points in F1.<\/p>\n<p>A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That&#8217;s the point.<\/p>\n<p>The most valuable finding isn&#8217;t which model scored highest.<\/p>\n<p>&#8220;We observe substantially different error profiles between leading models and human experts, suggesting that their strengths are complementary,&#8221; the researchers write. Models hallucinate, miss metadata, and lose domain context. Humans misread precise timestamps and occasionally fail on complex instructions. The mistakes barely overlap.<\/p>\n<p>Model a theoretical &#8220;Model-Expert Oracle&#8221;\u2014a perfect judge that always picks the right answer between the AI and the human\u2014and you get 87.2% accuracy and 82.8% F1. Way above either alone.<\/p>\n<p>That&#8217;s not a product. It&#8217;s a <a href=\"https:\/\/decrypt.co\/362496\/is-agi-here-not-even-close-ai-benchmark\" target=\"_blank\">documented target<\/a>\u2014built from real emergencies, not curated datasets\u2014that quantifies exactly how much better human-AI collaboration could perform. The leaderboard is live on Hugging Face. GPT-5 sits at 62.7%. The ceiling is 87.2%.<\/p>\n<div>\n<h3>Daily Debrief Newsletter<\/h3>\n<p>Start every day with the top news stories right now, plus original features, a podcast, videos and more.<\/p>\n<\/div>\n<\/div>",
        "protected": false
    },
    "excerpt": {
        "rendered": "<p>In brief ARFBench is the first AI benchmark built entirely from real production incidents. GPT-5 leads all existing AI models [&hellip;]<\/p>",
        "protected": false
    },
    "author": 5,
    "featured_media": 80641,
    "comment_status": "open",
    "ping_status": "open",
    "sticky": false,
    "template": "",
    "format": "standard",
    "meta": {
        "_acf_changed": false,
        "footnotes": ""
    },
    "categories": [
        220
    ],
    "tags": [],
    "class_list": [
        "post-80640",
        "post",
        "type-post",
        "status-publish",
        "format-standard",
        "has-post-thumbnail",
        "hentry",
        "category-tien-dien-tu"
    ],
    "acf": [],
    "_links": {
        "self": [
            {
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/posts\/80640",
                "targetHints": {
                    "allow": [
                        "GET"
                    ]
                }
            }
        ],
        "collection": [
            {
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/posts"
            }
        ],
        "about": [
            {
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/types\/post"
            }
        ],
        "author": [
            {
                "embeddable": true,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/users\/5"
            }
        ],
        "replies": [
            {
                "embeddable": true,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/comments?post=80640"
            }
        ],
        "version-history": [
            {
                "count": 0,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/posts\/80640\/revisions"
            }
        ],
        "wp:featuredmedia": [
            {
                "embeddable": true,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/media\/80641"
            }
        ],
        "wp:attachment": [
            {
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/media?parent=80640"
            }
        ],
        "wp:term": [
            {
                "taxonomy": "category",
                "embeddable": true,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/categories?post=80640"
            },
            {
                "taxonomy": "post_tag",
                "embeddable": true,
                "href": "https:\/\/hbbgroup.net\/en_us\/wp-json\/wp\/v2\/tags?post=80640"
            }
        ],
        "curies": [
            {
                "name": "wp",
                "href": "https:\/\/api.w.org\/{rel}",
                "templated": true
            }
        ]
    }
}