AI scaling hits wall for 7 in 10 firms on operational limits

Operational complexity is rapidly becoming the main obstacle to scaling artificial intelligence, with nearly seven in ten companies now using three or more models, according to a new report from Datadog Inc.

"The companies that win won’t just build better models - they’ll build operational control around them," Yanbing Li, Chief Product Officer at Datadog, said. "In this new era, AI observability becomes as essential as cloud observability was a decade ago.”

The State of AI Engineering 2026 report found that around 5 percent of all AI model requests in production fail, with nearly 60 percent of those failures caused by system capacity limits. This highlights a growing challenge where the infrastructure, not the AI model itself, is the point of failure, leading to slowdowns and broken experiences in AI-powered applications.

The findings suggest a critical shift in the AI industry, where investment and strategy may pivot from pure model development to MLOps and observability platforms. For companies racing to deploy AI, the reliability of their underlying infrastructure could become a more significant competitive differentiator than the sophistication of their algorithms.

The Observability Gap

The challenge mirrors the early days of cloud computing, where the focus shifted from simply having servers to managing their complexity and reliability at scale. Competitive pressure is pushing both startups and large enterprises to deploy AI faster, but this speed creates risk when not paired with operational control.

"The next wave of agent failures won't be about what agents can't do but what teams can't observe,” Guillermo Rauch, CEO at Vercel, said. “Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”

This sentiment is echoed across the industry. A separate study from Riverbed found that while 91 percent of healthcare leaders report that AIOps ROI has met or exceeded expectations, only 31 percent of their organizations are fully prepared to operationalize their AI strategy, with data quality being a primary concern.

From Model Intelligence to Operational Excellence

The focus on operational readiness marks a maturing of the AI market. While reports from institutions like Stanford HAI point to a “great divergence” in AI opinions and performance, the on-the-ground reality for engineers is one of managing increasingly fragmented and complex systems. Datadog's report, which analyzed anonymized data from thousands of customers, shows that the path to production AI is paved with operational hurdles.

"To scale AI with confidence, organizations need real-time visibility across the entire stack – from GPU utilization to model behavior to agent workflows," Datadog's Li added. "At scale, how you operate AI may matter more than the models you choose.”

This operational-first mindset is becoming a recurring theme. The General Services Administration's "million hours challenge" aims to automate repetitive workflows, and new platforms from companies like SymphonyAI and Catapult are being built with embedded AI and operational dashboards to manage complexity from the ground up. The consensus is clear: as AI becomes more powerful, the systems that manage it must become more intelligent.

This article is for informational purposes only and does not constitute investment advice.