Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems. New research from Databricks puts a number on that failure gap. The company’s AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks, reporting gains of 20% or more on Stanford’s STaRK benchmark suite and consistent improvement across Databricks’ own KARLBench evaluation framework. The results make the case that the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem.The work builds on Databricks’ earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures.”RAG works, but it doesn’t scale,” Michael Bendersky, research director at Databricks, told VentureBeat. “If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task.”Single-turn retrieval cannot encode structural constraintsThe core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search. Consider a question like “Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?” The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results.To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain. STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base. How the Supervisor Agent handles what RAG cannotDatabricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps:Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first.Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer.Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required.”The agent can do things like decomposing the question into a SQL query and a search query out of the box,” Bendersky said. “It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found.”It’s not just about hybrid retrievalBeing able to source information from both structured and unstructured data isn’t an entirely new concept.LlamaIndex, LangChain and Microsoft Fabric agents all offer some form of hybrid retrieval. Bendersky draws a distinction in how the Databricks approach frames the problem architecturally.”We almost don’t see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables,” he said. “We see this more as an agent that has access to multiple tools.”The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code. Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks’ research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format.”Just bring the agent to the data,” Bendersky said. “You basically give the agent more sources, and it will learn to use them pretty well.”What this means for enterprisesFor data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest.The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront.Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start.The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one.”This is kind of like a ladder,” Bendersky said. “The agent will slowly get more and more information and then slowly improve overall.”
Venture Beat
43% of AI-generated code changes need debugging in production, survey finds
The software industry is racing to write code with artificial intelligence. It is struggling, badly, to make sure that code holds up once it ships.A survey of 200 senior site-reliability and DevOps leaders at large enterprises across the United States, United Kingdom, and European Union paints a stark picture of the hidden costs embedded in the AI coding boom. According to Lightrun’s 2026 State of AI-Powered Engineering Report, shared exclusively with VentureBeat ahead of its public release, 43% of AI-generated code changes require manual debugging in production environments even after passing quality assurance and staging tests. Not a single respondent said their organization could verify an AI-suggested fix with just one redeploy cycle; 88% reported needing two to three cycles, while 11% required four to six.The findings land at a moment when AI-generated code is proliferating across global enterprises at a breathtaking pace. Both Microsoft CEO Satya Nadella and Google CEO Sundar Pichai have claimed that around a quarter of their companies’ code is now AI-generated. The AIOps market — the ecosystem of platforms and services designed to manage and monitor these AI-driven operations — stands at $18.95 billion in 2026 and is projected to reach $37.79 billion by 2031.Yet the report suggests the infrastructure meant to catch AI-generated mistakes is badly lagging behind AI’s capacity to produce them.”The 0% figure signals that engineering is hitting a trust wall with AI adoption,” said Or Maimon, Lightrun’s chief business officer, referring to the survey’s finding that zero percent of engineering leaders described themselves as “very confident” that AI-generated code will behave correctly once deployed. “While the industry’s emphasis on increased productivity has made AI a necessity, we are seeing a direct negative impact. As AI-generated code enters the system, it doesn’t just increase volume; it slows down the entire deployment pipeline.”Amazon’s March outages showed what happens when AI-generated code ships without safeguardsThe dangers are no longer theoretical. In early March 2026, Amazon suffered a series of high-profile outages that underscored exactly the kind of failure pattern the Lightrun survey describes. On March 2, Amazon.com experienced a disruption lasting nearly six hours, resulting in 120,000 lost orders and 1.6 million website errors. Three days later, on March 5, a more severe outage hit the storefront — lasting six hours and causing a 99% drop in U.S. order volume, with approximately 6.3 million lost orders. Both incidents were traced to AI-assisted code changes deployed to production without proper approval.The fallout was swift. Amazon launched a 90-day code safety reset across 335 critical systems, and AI-assisted code changes must now be approved by senior engineers before they are deployed.Maimon pointed directly to the Amazon episodes. “This uncertainty isn’t based on a hypothesis,” he said. “We just need to look back to the start of March, when Amazon.com in North America went down due to an AI-assisted change being implemented without established safeguards.”The Amazon incidents illustrate the central tension the Lightrun report quantifies in survey data: AI tools can produce code at unprecedented speed, but the systems designed to validate, monitor, and trust that code in live environments have not kept pace. Google’s own 2025 DORA report corroborates this dynamic, finding that AI adoption correlates with an increase in code instability, and that 30% of developers report little or no trust in AI-generated code.Maimon cited that research directly: “Google’s 2025 DORA report found that AI adoption correlates with an almost 10% increase in code instability. Our validation processes were built for the scale of human engineering, but today, engineers have become auditors for massive volumes of unfamiliar code.”Developers are losing two days a week to debugging AI-generated code they didn’t writeOne of the report’s most striking findings is the scale of human capital being consumed by AI-related verification work. Developers now spend an average of 38% of their work week — roughly two full days — on debugging, verification, and environment-specific troubleshooting, according to the survey. For 88% of the companies polled, this “reliability tax” consumes between 26% and 50% of their developers’ weekly capacity.This is not the productivity dividend that enterprise leaders expected when they invested in AI coding assistants. Instead, the engineering bottleneck has simply migrated. Code gets written faster, but it takes far longer to confirm that it works.”In some senses, AI has made the debugging problem worse,” Maimon said. “The volume of change is overwhelming human validation, while the generated code itself frequently does not behave as expected when deployed in Production. AI coding agents cannot see how their code behaves in running environments.”The redeploy problem compounds the time drain. Every surveyed organization requires multiple deployment cycles to verify a single AI-suggested fix — and according to Google’s 2025 DORA report, a single redeploy cycle takes a day to one week on average. In regulated industries such as healthcare and finance, deployment windows are often narrow, governed by mandated code freezes and strict change-management protocols. Requiring three or more cycles to validate a single AI fix can push resolution timelines from days to weeks.Maimon rejected the idea that these multiple cycles represent prudent engineering discipline. “This is not discipline, but an expensive bottleneck and a symptom of the fact that AI-generated fixes are often unreliable,” he said. “If we can move from three cycles to one, we reclaim a massive portion of that 38% lost engineering capacity.”AI monitoring tools can’t see what’s happening inside running applications — and that’s the real problemIf the productivity drain is the most visible cost, the Lightrun report argues the deeper structural problem is what it calls “the runtime visibility gap” — the inability of AI tools and existing monitoring systems to observe what is actually happening inside running applications.Sixty percent of the survey’s respondents identified a lack of visibility into live system behavior as the primary bottleneck in resolving production incidents. In 44% of cases where AI SRE or application performance monitoring tools attempted to investigate production issues, they failed because the necessary execution-level data — variable states, memory usage, request flow — had never been captured in the first place.The report paints a picture of AI tools operating essentially blind in the environments that matter most. Ninety-seven percent of engineering leaders said their AI SRE agents operate without significant visibility into what is actually happening in production. Approximately half of all companies (49%) reported their AI agents have only limited visibility into live execution states. Only 1% reported extensive visibility, and not a single respondent claimed full visibility.This is the gap that turns a minor software bug into a costly outage. When an AI-suggested fix fails in production — as 43% of them do — engineers cannot rely on their AI tools to diagnose the problem, because those tools cannot observe the code’s real-time behavior. Instead, teams fall back on what the report calls “tribal knowledge”: the institutional memory of senior engineers who have seen similar problems before and can intuit the root cause from experience rather than data. The survey found that 54% of resolutions to high-severity incidents rely on tribal knowledge rather than diagnostic evidence from AI SREs or APMs.In finance, 74% of engineering teams trust human intuition over AI diagnostics during serious incidentsThe trust deficit plays out with particular intensity in the finance sector. In an industry where a single application error can cascade into millions of dollars in losses per minute, the survey found that 74% of financial-services engineering teams rely on tribal knowledge over automated diagnostic data during serious incidents — far higher than the 44% figure in the technology sector.”Finance is a heavily regulated, high-stakes environment where a single application error can cost millions of dollars per minute,” Maimon said. “The data shows that these teams simply do not trust AI not to make a dangerous mistake in their Production environments. This is a rational response to tool failure.”The distrust extends beyond finance. Perhaps the most telling data point in the entire report is that not a single organization surveyed — across any industry — has moved its AI SRE tools into actual production workflows. Ninety percent remain in experimental or pilot mode. The remaining 10% evaluated AI SRE tools and chose not to adopt them at all. This represents an extraordinary gap between market enthusiasm and operational reality: enterprises are spending aggressively on AI for IT operations, but the tools they are buying remain quarantined from the environments where they would deliver the most value.Maimon described this as one of the report’s most significant revelations. “Leaders are eager to adopt these new AI tools, but they don’t trust AI to touch live environments,” he said. “The lack of trust is shown in the data; 98% have lower trust in AI operating in production than in coding assistants.”The observability industry built for human-speed engineering is falling short in the age of AIThe findings raise pointed questions about the current generation of observability tools from major vendors like Datadog, Dynatrace, and Splunk. Seventy-seven percent of the engineering leaders surveyed reported low or no confidence that their current observability stack provides enough information to support autonomous root cause analysis or automated incident remediation.Maimon did not shy away from naming the structural problem. “Major vendors often build ‘closed-garden’ ecosystems where their AI SREs can only reason over data collected by their own proprietary agents,” he said. “In a modern enterprise, teams typically have a multi-tool stack to provide full coverage. By forcing a team into a single-vendor silo, these tools create an uncomfortable dependency and a strategic liability: if the vendor’s data coverage is missing a specific layer, the AI is effectively blind to the root cause.”The second issue, Maimon argued, is that current observability-backed AI SRE solutions offer only partial visibility — defined by what engineers thought to log at the time of deployment. Because failures rarely follow predefined paths, autonomous root cause analysis using only these tools will frequently miss the key diagnostic evidence. “To move toward true autonomous remediation,” he said, “the industry must shift toward AI SRE without vendor lock-in; AI SREs must be an active participant that can connect across the entire stack and interrogate live code to capture the ground truth of a failure as it happens.”When asked what it would take to trust AI SREs, the survey’s respondents coalesced unanimously around live runtime visibility. Fifty-eight percent said they need the ability to provide “evidence traces” of variables at the point of failure, and 42% cited the ability to verify a suggested fix before it actually deploys. No respondents selected the ability to ingest multiple log sources or provide better natural language explanations — suggesting that engineering leaders do not want AI that talks better, but AI that can see better.The question is no longer whether to use AI for coding — it’s whether anyone can trust what it producesThe survey was administered by Global Surveyz Research, an independent firm, and drew responses from Directors, VPs, and C-level executives in SRE and DevOps roles at enterprises with 1,500 or more employees across the finance, technology, and information technology sectors. Responses were collected during January and February 2026, with questions randomized to prevent order bias.Lightrun, which is backed by $110 million in funding from Accel and Insight Partners and counts AT&T, Citi, Microsoft, Salesforce, and UnitedHealth Group among its enterprise clients, has a clear commercial interest in the problem the report describes: the company sells a runtime observability platform designed to give AI agents and human engineers real-time visibility into live code execution. Its AI SRE product uses a Model Context Protocol connection to generate live diagnostic evidence at the point of failure without requiring redeployment. That commercial interest does not diminish the survey’s findings, which align closely with independent research from Google DORA and the real-world evidence of the Amazon outages.Taken together, they describe an industry confronting an uncomfortable paradox. AI has solved the slowest part of building software — writing the code — only to reveal that writing was never the hard part. The hard part was always knowing whether it works. And on that question, the engineers closest to the problem are not optimistic.”If the live visibility gap is not closed, then teams are really just compounding instability through their adoption of AI,” Maimon said. “Organizations that don’t bridge this gap will find themselves stuck with long redeploy loops, to solve ever more complex challenges. They will lose their competitive speed to the very AI tools that were meant to provide it.”The machines learned to write the code. Nobody taught them to watch it run.
Agentic coding at enterprise scale demands spec-driven development
Presented by AWSAutonomous agents are compressing software delivery timelines from weeks to days. The enterprises that scale agents safely will be the ones that build using spec-driven development.There’s a moment in every technology shift where the early adopters stop being outliers and start being the baseline. We’re at that moment in software development, and most teams don’t realize it yet.A year ago, vibe coding went viral. Non-developers and junior developers discovered they could build beyond their abilities with AI. It lowered the floor. It made prototyping much quicker, but it also introduced a surplus of slop. What the industry then needed was something that raised the ceiling — something that improved code quality and worked the way the most expert developers work. Spec-driven development did that. It laid the foundation for trustworthy autonomous coding agents.Specs are the trust model for autonomous developmentMost discussions of AI-generated code focus on whether AI can write code. The harder question is whether you can trust it. The answer runs directly through the spec.Spec-driven development starts with a deceptively simple idea: before an AI agent writes a single line of code, it works from a structured, context-rich specification that defines what the system is supposed to do, what its properties are, and what “correct” actually means. That specification is an artifact the agent reasons against throughout the entire development process — fundamentally different from pre-agentic AI approaches of writing documentation after the fact.Enterprise teams are building on this foundation. The Kiro IDE team used Kiro to build Kiro IDE — an agentic coding environment with native spec-driven development — cutting feature builds from two weeks to two days. An AWS engineering team completed an 18-month rearchitecture project, originally scoped for 30 developers, with six people in 76 days using Kiro. An Amazon.com engineering team rolled out “Add to Delivery” — a feature that lets shoppers add items after checkout — two months ahead of schedule by using Kiro and spec-driven development. Alexa+, Amazon Finance, Amazon Stores, AWS, Fire TV, Last Mile Delivery, Prime Video, and more all integrate spec-driven development as part of their build approaches.That shift changes everything downstream. Verifiable testing is what makes autonomous agents safe to runThe spec becomes an automated correctness engine. When a developer is generating 150 check-ins per week with AI assistance, no human can manually review that volume of code. Instead, code built against a concrete specification can be verified through property-based testing and neurosymbolic AI techniques that automatically generate hundreds of test cases derived directly from the spec, probing edge cases no human would think to write by hand. These tests prove that the code satisfies the spec’s defined properties, going well beyond hand-written test suites to provably correct behavior.Verifiable testing enables the shift from one-shot programming to continuous autonomous development. Traditional AI-assisted development operates as a single shot: you give the agent a spec, the agent produces output, and the process ends. Today’s agents continuously correct themselves, feeding build and test failures back into their own reasoning, generating additional tests to probe their own output, and iterating until they produce something both functional and verifiable. The spec is the anchor that keeps that loop from drifting. Instead of developers constantly checking in to see if the agent is making the right decisions, the agent can check itself against the spec to make sure it is on the right path.The autonomous agent of the future will write its own specs, using specifications as the mechanism for self-correction, for verification, for ensuring that what it produces matches the intended behavior of the system.Multi-agent, autonomous, and running right nowThe developers setting the pace today operate in a fundamentally different way. Developers spend significant time building their spec, as well as writing steering files used by the spec to make sure the agent knows what and how to build — more time than their agent may spend building the actual software. They run multiple agents in parallel to critique a problem from different perspectives, as well as run multiple specs, each written for a different component of the system they are building. They let agents run for hours, sometimes days. They use thousands of Kiro credits because the output justifies it.A year ago, agents would lose context and fall apart after 20 minutes. Now, every week you can run them longer than the week before. Agentic capabilities have improved significantly in the last six months that genuinely complex problems are tractable. Newer LLMs are more token-efficient than the previous generation, so for the same spend, you get dramatically more done. The challenge is that doing this well requires deep expertise. The tools, methodologies, and infrastructure exist, but orchestrating them is hard. The goal with Kiro is to bring these capabilities with deep expertise to every developer, not just the top one percent who’ve figured it out.Infrastructure is catching up to ambitionAgents will be ten times more capable within a year. That’s the rate of improvement we’re seeing week over week.The infrastructure to support that level of capability is converging at the same time. Agents are now running in the cloud rather than locally, executing in parallel at scale with secure, reliable communication between agent systems. Organizations can now run agentic workloads the way they’d run any enterprise-grade distributed system — with governance, cost controls, and reliability guarantees that serious software demands. Spec-driven development is the architecture of tomorrow’s autonomous systems.Developers are no longer restricted by how they want to solve the problem. The developers who thrive in this world are the ones building that foundation now: using spec-driven development, prioritizing testability and verification from the start, working with agents as collaborators, and thinking in systems instead of syntax. Deepak Singh is VP of Kiro at AWS.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Five signs data drift is already undermining your security models
Data drift happens when the statistical properties of a machine learning (ML) model’s input data change over time, eventually rendering its predictions less accurate. Cybersecurity professionals who rely on ML for tasks like malware detection and network threat analysis find that undetected data drift can create vulnerabilities. A model trained on old attack patterns may fail to see today’s sophisticated threats. Recognizing the early signs of data drift is the first step in maintaining reliable and efficient security systems.Why data drift compromises security modelsML models are trained on a snapshot of historical data. When live data no longer resembles this snapshot, the model’s performance dwindles, creating a critical cybersecurity risk. A threat detection model may generate more false negatives by missing real breaches or create more false positives, leading to alert fatigue for security teams.Adversaries actively exploit this weakness. In 2024, attackers used echo-spoofing techniques to bypass email protection services. By exploiting misconfigurations in the system, they sent millions of spoofed emails that evaded the vendor’s ML classifiers. This incident demonstrates how threat actors can manipulate input data to exploit blind spots. When a security model fails to adapt to shifting tactics, it becomes a liability.5 indicators of data driftSecurity professionals can recognize the presence of drift (or its potential) in several ways.1. A sudden drop in model performanceAccuracy, precision, and recall are often the first casualties. A consistent decline in these key metrics is a red flag that the model is no longer in sync with the current threat landscape.Consider Klarna’s success: Its AI assistant handled 2.3 million customer service conversations in its first month and performed work equivalent to 700 agents. This efficiency drove a 25% decline in repeat inquiries and reduced resolution times to under two minutes. Now imagine if those parameters suddenly reversed because of drift. In a security context, a similar drop in performance does not just mean unhappy clients — it also means successful intrusions and potential data exfiltration.2. Shifts in statistical distributionsSecurity teams should monitor the core statistical properties of input features, such as the mean, median, and standard deviation. A significant change in these metrics from training data could indicate the underlying data has changed.Monitoring for such shifts enables teams to catch drift before it causes a breach. For example, a phishing detection model might be trained on emails with an average attachment size of 2MB. If the average attachment size suddenly jumps to 10MB due to a new malware-delivery method, the model may fail to classify these emails correctly.3. Changes in prediction behaviorEven if overall accuracy seems stable, distributions of predictions might change, a phenomenon often referred to as prediction drift.For instance, if a fraud detection model historically flagged 1% of transactions as suspicious but suddenly starts flagging 5% or 0.1%, either something has shifted or the nature of the input data has changed. It might indicate a new type of attack that confuses the model or a change in legitimate user behavior that the model was not trained to identify.4. An increase in model uncertaintyFor models that provide a confidence score or probability with their predictions, a general decrease in confidence can be a subtle sign of drift.Recent studies highlight the value of uncertainty quantification in detecting adversarial attacks. If the model becomes less sure about its forecasts across the board, it is likely facing data it was not trained on. In a cybersecurity setting, this uncertainty is an early sign of potential model failure, suggesting the model is operating in unfamiliar ground and that its decisions might no longer be reliable.5. Changes in feature relationshipsThe correlation between different input features can also change over time. In a network intrusion model, traffic volume and packet size might be highly linked during normal operations. If that correlation disappears, it can signal a change in network behavior that the model may not understand. A sudden feature decoupling could indicate a new tunneling tactic or a stealthy exfiltration attempt.Approaches to detecting and mitigating data driftCommon detection methods include the Kolmogorov-Smirnov (KS) and the population stability index (PSI). These compare the distributions of live and training data to identify deviations. The KS test determines if two datasets differ significantly, while the PSI measures how much a variable’s distribution has shifted over time. The mitigation method of choice often depends on how the drift manifests, as distribution changes may occur suddenly. For example, customers’ buying behavior may change overnight with the launch of a new product or a promotion. In other cases, drift may occur gradually over a more extended period. That said, security teams must learn to adjust their monitoring cadence to capture both rapid spikes and slow burns. Mitigation will involve retraining the model on more recent data to reclaim its effectiveness.Proactively manage drift for stronger securityData drift is an inevitable reality, and cybersecurity teams can maintain a strong security posture by treating detection as a continuous and automated process. Proactive monitoring and model retraining are fundamental practices to ensure ML systems remain reliable allies against developing threats.Zac Amos is the Features Editor at ReHack.
Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot
For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser.Security teams tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break.A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device.”When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it.Why local inference is suddenly practicalTwo years ago, running a useful LLM on a work laptop was a niche stunt. Today, it’s routine for technical teams.Three things converged:Consumer-grade accelerators got serious: A MacBook Pro with 64GB unified memory can often run quantized 70B-class models at usable speeds (with practical limits on context length). What once required multi-GPU servers is now feasible on a high-end laptop for many real workflows.Quantization went mainstream: It’s now easy to compress models into smaller, faster formats that fit within laptop memory often with acceptable quality tradeoffs for many tasks.Distribution is frictionless: Open-weight models are a single command away, and the tooling ecosystem makes “download → run → chat” trivial.The result: An engineer can pull down a multi‑GB model artifact, turn off Wi‑Fi, and run sensitive workflows locally, source code review, document summarization, drafting customer communications, even exploratory analysis over regulated datasets. No outbound packets, no proxy logs, no cloud audit trail.From a network-security perspective, that activity can look indistinguishable from “nothing happened”.The risk isn’t only data leaving the company anymoreIf the data isn’t leaving the laptop, why should a CISO care?Because the dominant risks shift from exfiltration to integrity, provenance, and compliance. In practice, local inference creates three classes of blind spots that most enterprises have not operationalized.1. Code and decision contamination (integrity risk)Local models are often adopted because they’re fast, private, and “no approval required.” The downside is that they’re frequently unvetted for the enterprise environment.A common scenario: A senior developer downloads a community-tuned coding model because it benchmarks well. They paste in internal auth logic, payment flows, or infrastructure scripts to “clean it up.” The model returns output that looks competent, compiles, and passes unit tests, but subtly degrades security posture (weak input validation, unsafe defaults, brittle concurrency changes, dependency choices that aren’t allowed internally). The engineer commits the change.If that interaction happened offline, you may have no record that AI influenced the code path at all. And when you later do incident response, you’ll be investigating the symptom (a vulnerability) without visibility into a key cause (uncontrolled model usage).2. Licensing and IP exposure (compliance risk)Many high-performing models ship with licenses that include restrictions on commercial use, attribution requirements, field-of-use limits, or obligations that can be incompatible with proprietary product development. When employees run models locally, that usage can bypass the organization’s normal procurement and legal review process.If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company can inherit risk that shows up later during M&A diligence, customer security reviews, or litigation. The hard part is not just the license terms, it’s the lack of inventory and traceability. Without a governed model hub or usage record, you may not be able to prove what was used where.3. Model supply chain exposure (provenance risk)Local inference also changes the software supply chain problem. Endpoints begin accumulating large model artifacts and the toolchains around them: ownloaders, converters, runtimes, plugins, UI shells, and Python packages.There is a critical technical nuance here: The file format matters. While newer formats like Safetensors are designed to prevent arbitrary code execution, older Pickle-based PyTorch files can execute malicious payloads simply when loaded. If your developers are grabbing unvetted checkpoints from Hugging Face or other repositories, they aren’t just downloading data — they could be downloading an exploit.Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies have no equivalent of a software bill of materials for models: Provenance, hashes, allowed sources, scanning, and lifecycle management.Mitigating BYOM: treat model weights like software artifactsYou can’t solve local inference by blocking URLs. You need endpoint-aware controls and a developer experience that makes the safe path the easy path.Here are three practical ways:1. Move governance down to the endpoint Network DLP and CASB still matter for cloud usage, but they’re not sufficient for BYOM. Start treating local model usage as an endpoint governance problem by looking for specific signals:Inventory and detection: Scan for high-fidelity indicators like .gguf files larger than 2GB, processes like llama.cpp or Ollama, and local listeners on common default port 11434.Process and runtime awareness: Monitor for repeated high GPU/NPU (neural processing unit) utilization from unapproved runtimes or unknown local inference servers.Device policy: Use mobile device management (MDM) and endpoint detection and response (EDR) policies to control installation of unapproved runtimes and enforce baseline hardening on engineering devices. The point isn’t to punish experimentation. It’s to regain visibility.2. Provide a paved road: An internal, curated model hub Shadow AI is often an outcome of friction. Approved tools are too restrictive, too generic, or too slow to approve. A better approach is to offer a curated internal catalog that includes: Approved models for common tasks (coding, summarization, classification)Verified licenses and usage guidancePinned versions with hashes (prioritizing safer formats like Safetensors)Clear documentation for safe local usage, including where sensitive data is and isn’t allowed. If you want developers to stop scavenging, give them something better.3. Update policy language: “Cloud services” isn’t enough anymore Most acceptable use policies talk about SaaS and cloud tools. BYOM requires policy that explicitly covers:Downloading and running model artifacts on corporate endpointsAcceptable sourcesLicense compliance requirementsRules for using models with sensitive dataRetention and logging expectations for local inference tools This doesn’t need to be heavy-handed. It needs to be unambiguous.The perimeter is shifting back to the deviceFor a decade we moved security controls “up” into the cloud. Local inference is pulling a meaningful slice of AI activity back “down” to the endpoint.5 signals shadow AI has moved to endpoints:Large model artifacts: Unexplained storage consumption by .gguf or .pt files.Local inference servers: Processes listening on ports like 11434 (Ollama).GPU utilization patterns: Spikes in GPU usage while offline or disconnected from VPN.Lack of model inventory: Inability to map code outputs to specific model versions.License ambiguity: Presence of “non-commercial” model weights in production builds.Shadow AI 2.0 isn’t a hypothetical future, it’s a predictable consequence of fast hardware, easy distribution, and developer demand. CISOs who focus only on network controls will miss what’s happening on the silicon sitting right on employees’ desks.The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policy at the endpoint, without killing productivity.Jayachander Reddy Kandakatla is a senior MLOps engineer.
AI agent credentials live in the same box as untrusted code. Two new architectures show where the blast radius actually stops.
Four separate RSAC 2026 keynotes arrived at the same conclusion without coordinating. Microsoft’s Vasu Jakkal told attendees that zero trust must extend to AI. Cisco’s Jeetu Patel called for a shift from access control to action control, saying in an exclusive interview with VentureBeat that agents behave “more like teenagers, supremely intelligent, but with no fear of consequence.” CrowdStrike’s George Kurtz identified AI governance as the biggest gap in enterprise technology. Splunk’s John Morgan called for an agentic trust and governance model. Four companies. Four stages. One problem.Matt Caulfield, VP of Product for Identity and Duo at Cisco, put it bluntly in an exclusive VentureBeat interview at RSAC. “While the concept of zero trust is good, we need to take it a step further,” Caulfield said. “It’s not just about authenticating once and then letting the agent run wild. It’s about continuously verifying and scrutinizing every single action the agent’s trying to take, because at any moment, that agent can go rogue.”Seventy-nine percent of organizations already use AI agents, according to PwC’s 2025 AI Agent Survey. Only 14.4% reported full security approval for their entire agent fleet, per the Gravitee State of AI Agent Security 2026 report of 919 organizations in February 2026. A CSA survey presented at RSAC found that only 26% have AI governance policies. CSA’s Agentic Trust Framework describes the resulting gap between deployment velocity and security readiness as a governance emergency.Cybersecurity leaders and industry executives at RSAC agreed on the problem. Then two companies shipped architectures that answer the question differently. The gap between their designs reveals where the real risk sits.The monolithic agent problem that security teams are inheritingThe default enterprise agent pattern is a monolithic container. The model reasons, calls tools, executes generated code, and holds credentials in one process. Every component trusts every other component. OAuth tokens, API keys, and git credentials sit in the same environment where the agent runs code it wrote seconds ago.A prompt injection gives the attacker everything. Tokens are exfiltrable. Sessions are spawnable. The blast radius is not the agent. It is the entire container and every connected service.The CSA and Aembit survey of 228 IT and security professionals quantifies how common this remains: 43% use shared service accounts for agents, 52% rely on workload identities rather than agent-specific credentials, and 68% cannot distinguish agent activity from human activity in their logs. No single function claimed ownership of AI agent access. Security said it was a developer’s responsibility. Developers said it was a security responsibility. Nobody owned it.CrowdStrike CTO Elia Zaitsev, in an exclusive VentureBeat interview, said the pattern should look familiar. “A lot of what securing agents look like would be very similar to what it looks like to secure highly privileged users. They have identities, they have access to underlying systems, they reason, they take action,” Zaitsev said. “There’s rarely going to be one single solution that is the silver bullet. It’s a defense in depth strategy.”CrowdStrike CEO George Kurtz highlighted ClawHavoc (a supply chain campaign targeting the OpenClaw agentic framework) at RSAC during his keynote. Koi Security named the campaign on February 1, 2026. Antiy CERT confirmed 1,184 malicious skills tied to 12 publisher accounts, according to multiple independent analyses of the campaign. Snyk’s ToxicSkills research found that 36.8% of the 3,984 ClawHub skills scanned contain security flaws at any severity level, with 13.4% rated critical. Average breakout time has dropped to 29 minutes. Fastest observed: 27 seconds. (CrowdStrike 2026 Global Threat Report)Anthropic separates the brain from the handsAnthropic’s Managed Agents, launched April 8 in public beta, split every agent into three components that do not trust each other: a brain (Claude and the harness routing its decisions), hands (disposable Linux containers where code executes), and a session (an append-only event log outside both).Separating instructions from execution is one of the oldest patterns in software. Microservices, serverless functions, and message queues. Credentials never enter the sandbox. Anthropic stores OAuth tokens in an external vault. When the agent needs to call an MCP tool, it sends a session-bound token to a dedicated proxy. The proxy fetches real credentials from the vault, makes the external call, and returns the result. The agent never sees the actual token. Git tokens get wired into the local remote at sandbox initialization. Push and pull work without the agent touching the credential. For security directors, this means a compromised sandbox yields nothing an attacker can reuse.The security gain arrived as a side effect of a performance fix. Anthropic decoupled the brain from the hands so inference could start before the container booted. Median time to first token dropped roughly 60%. The zero-trust design is also the fastest design. That kills the enterprise objection that security adds latency.Session durability is the third structural gain. A container crash in the monolithic pattern means total state loss. In Managed Agents, the session log persists outside both brain and hands. If the harness crashes, a new one boots, reads the event log, and resumes. No state lost turns into a productivity gain over time. Managed Agents include built-in session tracing through the Claude Console.Pricing: $0.08 per session-hour of active runtime, idle time excluded, plus standard API token costs. Security directors can now model agent compromise cost per session-hour against the cost of the architectural controls.Nvidia locks the sandbox down and monitors everything inside itNvidia’s NemoClaw, released March 16 in early preview, takes the opposite approach. It does not separate the agent from its execution environment. It wraps the entire agent inside four stacked security layers and watches every move. Anthropic and Nvidia are the only two vendors to have shipped zero-trust agent architectures publicly as of this writing; others are in development.NemoClaw stacks five enforcement layers between the agent and the host. Sandboxed execution uses Landlock, seccomp, and network namespace isolation at the kernel level. Default-deny outbound networking forces every external connection through explicit operator approval via YAML-based policy. Access runs with minimal privileges. A privacy router directs sensitive queries to locally-running Nemotron models, cutting token cost and data leakage to zero. The layer that matters most to security teams is intent verification: OpenShell’s policy engine intercepts every agent action before it touches the host. The trade-off for organizations evaluating NemoClaw is straightforward. Stronger runtime visibility costs more operator staffing.The agent does not know it is inside NemoClaw. In-policy actions return normally. Out-of-policy actions get a configurable denial.Observability is the strongest layer. A real-time Terminal User Interface logs every action, every network request, every blocked connection. The audit trail is complete. The problem is cost: operator load scales linearly with agent activity. Every new endpoint requires manual approval. Observation quality is high. Autonomy is low. That ratio gets expensive fast in production environments running dozens of agents.Durability is the gap nobody’s talking about. Agent state persists as files inside the sandbox. If the sandbox fails, the state goes with it. No external session recovery mechanism exists. Long-running agent tasks carry a durability risk that security teams need to price into deployment planning before they hit production.The credential proximity gapBoth architectures are a real step up from the monolithic default. Where they diverge is the question that matters most to security teams: how close do credentials sit to the execution environment?Anthropic removes credentials from the blast radius entirely. If an attacker compromises the sandbox through prompt injection, they get a disposable container with no tokens and no persistent state. Exfiltrating credentials requires a two-hop attack: influence the brain’s reasoning, then convince it to act through a container that holds nothing worth stealing. Single-hop exfiltration is structurally eliminated.NemoClaw constrains the blast radius and monitors every action inside it. Four security layers limit lateral movement. Default-deny networking blocks unauthorized connections. But the agent and generated code share the same sandbox. Nvidia’s privacy router keeps inference credentials on the host, outside the sandbox. But messaging and integration tokens (Telegram, Slack, Discord) are injected into the sandbox as runtime environment variables. Inference API keys are proxied through the privacy router and not passed into the sandbox directly. The exposure varies by credential type. Credentials are policy-gated, not structurally removed.That distinction matters most for indirect prompt injection, where an adversary embeds instructions in content the agent queries as part of legitimate work. A poisoned web page. A manipulated API response. The intent verification layer evaluates what the agent proposes to do, not the content of data returned by external tools. Injected instructions enter the reasoning chain as trusted context. With proximity to execution.In the Anthropic architecture, indirect injection can influence reasoning but cannot reach the credential vault. In the NemoClaw architecture, injected context sits next to both reasoning and execution inside the shared sandbox. That is the widest gap between the two designs.NCC Group’s David Brauchler, Technical Director and Head of AI/ML Security, advocates for gated agent architectures built on trust segmentation principles where AI systems inherit the trust level of the data they process. Untrusted input, restricted capabilities. Both Anthropic and Nvidia move in this direction. Neither fully arrives.The zero-trust architecture audit for AI agentsThe audit grid covers three vendor patterns across six security dimensions, five actions per row. It distills to five priorities:Audit every deployed agent for the monolithic pattern. Flag any agent holding OAuth tokens in its execution environment. The CSA data shows 43% use shared service accounts. Those are the first targets.Require credential isolation in agent deployment RFPs. Specify whether the vendor removes credentials structurally or gates them through policy. Both reduce risk. They reduce it by different amounts with different failure modes.Test session recovery before production. Kill a sandbox mid-task. Verify state survives. If it does not, long-horizon work carries a data-loss risk that compounds with task duration.Staff for the observability model. Anthropic’s console tracing integrates with existing observability workflows. NemoClaw’s TUI requires an operator-in-the-loop. The staffing math is different.Track indirect prompt injection roadmaps. Neither architecture fully resolves this vector. Anthropic limits the blast radius of a successful injection. NemoClaw catches malicious proposed actions but not malicious returned data. Require vendor roadmap commitments on this specific gap.Zero trust for AI agents stopped being a research topic the moment two architectures shipped. The monolithic default is a liability. The 65-point gap between deployment velocity and security approval is where the next class of breaches will start.
Intuit compressed months of tax code implementation into hours — and built a workflow any regulated-industry team can adapt
When the One Big Beautiful Bill arrived as a 900-page unstructured document — with no standardized schema, no published IRS forms, and a hard shipping deadline — Intuit’s TurboTax team had a question: could AI compress a months-long implementation into days without sacrificing accuracy?What they built to do it is less a tax story than a template, a workflow combining commercial AI tools, a proprietary domain-specific language and a custom unit test framework that any domain-constrained development team can learn from.Joy Shaw, director of tax at Intuit, has spent more than 30 years at the company and lived through both the Tax Cuts and Jobs Act and the OBBB. “There was a lot of noise in the law itself and we were able to pull out the tax implications, narrow it down to the individual tax provisions, narrow it down to our customers,” Shaw told VentureBeat. “That kind of distillation was really fast using the tools, and then enabled us to start coding even before we got forms and instructions in.”How the OBBB raised the barWhen the Tax Cuts and Jobs Act passed in 2017, the TurboTax team worked through the legislation without AI assistance. It took months, and the accuracy requirements left no room for shortcuts. “We used to have to go through the law and we’d code sections that reference other law code sections and try and figure it out on our own,” Shaw said.The OBBB arrived with the same accuracy requirements but a different profile. At 900-plus pages, it was structurally more complex than the TCJA. It came as an unstructured document with no standardized schema. The House and Senate versions used different language to describe the same provisions. And the team had to begin implementation before the IRS had published official forms or instructions.The question was whether AI tools could compress the timeline without compromising the output. The answer required a specific sequence and tooling that did not exist yet.From unstructured document to domain-specific codeThe OBBB was still moving through Congress when the TurboTax team began working on it. Using large language models, the team summarized the House version, then the Senate version and then reconciled the differences. Both chambers referenced the same underlying tax code sections, a consistent anchor point that let the models draw comparisons across structurally inconsistent documents.By signing day, the team had already filtered provisions to those affecting TurboTax customers, narrowed to specific tax situations and customer profiles. Parsing, reconciliation and provision filtering moved from weeks to hours.Those tasks were handled by ChatGPT and general-purpose LLMs. But those tools hit a hard limit when the work shifted from analysis to implementation. TurboTax does not run on a standard programming language. Its tax calculation engine is built on a proprietary domain-specific language maintained internally at Intuit. Any model generating code for that codebase has to translate legal text into syntax it was never trained on, and identify how new provisions interact with decades of existing code without breaking what already works.Claude became the primary tool for that translation and dependency-mapping work. Shaw said it could identify what changed and what did not, letting developers focus only on the new provisions.
“It’s able to integrate with the things that don’t change and identify the dependencies on what did change,” she said. “That sped up the process of development and enabled us to focus only on those things that did change.”Building tooling matched to a near-zero error thresholdGeneral-purpose LLMs got the team to working code. Getting that code to shippable required two proprietary tools built during the OBBB cycle.The first auto-generated TurboTax product screens directly from the law changes. Previously, developers curated those screens individually for each provision. The new tool handled the majority automatically, with manual customization only where needed.The second was a purpose-built unit test framework. Intuit had always run automated tests, but the previous system produced only pass/fail results. When a test failed, developers had to manually open the underlying tax return data file to trace the cause.
“The automation would tell you pass, fail, you would have to dig into the actual tax data file to see what might have been wrong,” Shaw said. The new framework identifies the specific code segment responsible, generates an explanation and allows the correction to be made inside the framework itself.Shaw said accuracy for a consumer tax product has to be close to 100 percent. Sarah Aerni, Intuit’s VP of technology for the Consumer Group, said the architecture has to produce deterministic results.
“Having the types of capabilities around determinism and verifiably correct through tests — that’s what leads to that sort of confidence,” Aerni said.The tooling handles the speed. But Intuit also uses LLM-based evaluation tools to validate AI-generated output, and even those require a human tax expert to assess whether the result is correct. “It comes down to having human expertise to be able to validate and verify just about anything,” Aerni said.Four components any regulated-industry team can useThe OBBB was a tax problem, but the underlying conditions are not unique to tax. Healthcare, financial services, legal tech and government contracting teams regularly face the same combination: complex regulatory documents, hard deadlines, proprietary codebases, and near-zero error tolerance.Based on Intuit’s implementation, four elements of the workflow are transferable to other domain-constrained development environments:Use commercial LLMs for document analysis. General-purpose models handle parsing, reconciliation and provision filtering well. That is where they add speed without creating accuracy risk.Shift to domain-aware tooling when analysis becomes implementation. General-purpose models generating code into a proprietary environment without understanding it will produce output that cannot be trusted at scale.Build evaluation infrastructure before the deadline, not during the sprint. Generic automated testing produces pass/fail outputs. Domain-specific test tooling that identifies failures and enables in-context fixes is what makes AI-generated code shippable.Deploy AI tools across the whole organization, not just engineering. Shaw said Intuit trained and monitored usage across all functions. AI fluency was distributed across the organization rather than concentrated in early adopters.”We continue to lean into the AI and human intelligence opportunity here, so that our customers get what they need out of the experiences that we build,” Aerni said.
OpenAI introduces ChatGPT Pro $100 tier with 5X usage limits for Codex compared to Plus
OpenAI is making moves to try and court more developers and vibe coders (those who build software using AI models and natural language) away from rivals like Anthropic.Today, the firm arguably most synonymous with the generative AI boom announced it will begin offering a new, more mid-range subscription tier — a $100 ChatGPT Pro plan — which joins its free, Go ($8 monthly), Plus ($20 monthly) and existing Pro ($200 monthly) plans for individuals using ChatGPT and related OpenAI products.OpenAI also currently offers Edu, Business ($25 per user monthly, formerly known as Team) and Enterprise (variably priced) plans for organizations in said sectors. Why offer a $100 monthly ChatGPT Pro plan?So why introduce a new $100 ChatGPT Pro plan, then? The big selling point from OpenAI is that the new plan offers five times greater usage limits on Codex, the company’s agentic vibe coding application/harness (the name is shared by both, as well as a lineup of coding-specific language gmodels), than the existing, $20 monthly Plus plan, which seems fair given the math ($20×5=$100). As OpenAI co-founder and CEO Sam Altman wrote in a post on X: “It is very nice to see Codex getting so much love. We are launching a $100 ChatGPT Pro tier by very popular demand.”However, alongside this, OpenAI’s official company account on X noted that “we’re rebalancing Codex usage in [ChatGPT] Plus to support more sessions throughout the week, rather than longer sessions in a single day.”That sounds a lot like OpenAI is also simultaneously reducing how much ChatGPT Plus users can use its Codex harness and application per day. What are the new usage limits for the new $100 ChatGPT Pro plan vs. the $20 Plus?So, what are the current limits on the $20 Plus plan? The new Pro plan gives you 5X greater than…what? Turns out, this is trickier than you’d think to calculate, because it actually varies depending on which underlying AI model you are using to power the Codex application or harness, and whether you are working on code stored in the cloud or locally on your machine or servers. OpenAI’s Developer website notes that for individual users, usage is categorized by “Local Messages” (tasks run on the user’s machine) and “Cloud Tasks” (tasks run on OpenAI’s infrastructure), both of which share a five-hour rolling window. Thus, the breakdown looks like this:ChatGPT Plus ($20/month)GPT-5.4: 33–168 local messages every 5 hours.GPT-5.4-mini: 110–560 local messages every 5 hours.GPT-5.3-Codex: 45–225 local messages and 10–60 cloud tasks every 5 hours.Code Reviews: 10–25 pull requests per weekChatGPT Pro 5X ($100/month)GPT-5.4: 330-1680 local messages every 5 hours.GPT-5.4-mini: 1100-5600 local messages every 5 hours.GPT-5.3-Codex: 450-2,250 local messages and 100-600 cloud tasks every 5 hours.Code Reviews: 100–250 pull requests per weekChatGPT Pro 20x ($200/month)GPT-5.4: 660-3,360 local messages every 5 hours.GPT-5.4-mini: 2,200-11,200 local meessages every 5 hours.GPT-5.3-Codex: 900-4,500 local messages and 200-1,200 cloud tasks every 5 hours.Code Reviews: 200–500 pull requests per weekExclusive Access: Includes GPT-5.3-Codex-Spark (research preview), which has its own dynamic usage limit.And as OpenAI’s Help documentation states:”The number of Codex messages you can send within these limits varies based on the size and complexity of your coding tasks, and where you execute tasks. Small scripts or simple functions may only consume a fraction of your allowance, while larger codebases, long running tasks, or extended sessions that require Codex to hold more context will use significantly more per message.” The larger strategic implications and contextOpenAI’s sudden move toward the $100 price point and expanded agentic capacity comes amid the unprecedented financial ascent of its chief rival, Anthropic. Just days ago, Anthropic revealed its annualized run-rate revenue (ARR) has topped $30 billion, surpassing OpenAI’s last reported ARR of approximately $24–$25 billion. This growth has been fueled by the massive adoption of Claude Code and Claude Cowork, products that have set the benchmark for enterprise-grade autonomous coding.The competitive friction intensified on April 4, 2026, when Anthropic officially blocked Claude subscriptions from being used to provide the intelligence for third-party agentic AI harnesses like OpenClaw. To be clear, Anthropic Claude models themselves can still be used with OpenClaw, users just must now pay for access to Claude models through Anthropic’s application programming interface (API) or extra usage credits, rather than as part of the monthly Claude subscription tiers (which some have likened to an “all-you-can eat” buffet, making the economics challenging for Anthropic when power users and third-party harnesses like OpenClaw consume more than the $20 or $200 monthly user spend on the plans in tokens). OpenClaw’s creator, Peter Steinberger, was notably hired by OpenAI in February 2026 to lead their personal agent strategy, and has, since joining, actively spoken out against Anthropic’s limitations — advising that OpenAI’s Codex and models generally don’t have the same restrictions as Anthropic is now imposing.By hiring Steinberger and subsequently launching a Pro tier that provides the high-volume capacity Anthropic recently restricted, OpenAI is effectively courting the displaced OpenClaw community to reclaim the professional developer market.
Mythos autonomously exploited vulnerabilities that survived 27 years of human review. Security teams need a new detection playbook
A 27-year-old bug sat inside OpenBSD’s TCP stack while auditors reviewed the code, fuzzers ran against it, and the operating system earned its reputation as one of the most security-hardened platforms on earth. Two packets could crash any server running it. Finding that bug cost a single Anthropic discovery campaign approximately $20,000. The specific model run that surfaced the flaw cost under $50.Anthropic’s Claude Mythos Preview found it. Autonomously. No human guided the discovery after the initial prompt.The capability jump is not incremental On Firefox 147 exploit writing, Mythos succeeded 181 times versus 2 for Claude Opus 4.6. A 90x improvement in a single generation. SWE-bench Pro: 77.8% versus 53.4%. CyberGym vulnerability reproduction: 83.1% versus 66.6%. Mythos saturated Anthropic’s Cybench CTF at 100%, forcing the red team to shift to real-world zero-day discovery as the only meaningful evaluation left. Then it surfaced thousands of zero-day vulnerabilities across every major operating system and every major browser, many one to two decades old. Anthropic engineers with no formal security training asked Mythos to find remote code execution vulnerabilities overnight and woke up to a complete, working exploit by morning, according to Anthropic’s red team assessment.Anthropic assembled Project Glasswing, a 12-partner defensive coalition including CrowdStrike, Cisco, Palo Alto Networks, Microsoft, AWS, Apple, and the Linux Foundation, backed by $100 million in usage credits and $4 million in open-source grants. Over 40 additional organizations that build or maintain critical software infrastructure also received access. The partners have been running Mythos against their own infrastructure for weeks. Anthropic committed to a public findings report “within 90 days,” landing in early July 2026.Security directors got the announcement. They didn’t get the playbook. “I’ve been in this industry for 27 years,” Cisco SVP and Chief Security and Trust Officer Anthony Grieco told VentureBeat in an exclusive interview at RSAC 2026. “I have never been more optimistic for what we can do to change security because of the velocity. It’s also a little bit terrifying because we’re moving so quickly. It’s also terrifying because our adversaries have this capability as well, and so frankly, we must move this quickly.”Security directors saw this story told fifteen different ways this week, including VentureBeat’s exclusive interview with Anthropic’s Newton Cheng. As one widely shared X post summarizing the Mythos findings noted, the model cracked cryptography libraries, broke into a production virtual machine monitor, and gave engineers with zero security training working exploits by morning. What that coverage left unanswered: Where does the detection ceiling sit in the methods they already run, and what should they change before July?Seven vulnerability classes that show where every detection method hits its ceiling OpenBSD TCP SACK, 27 years old. Two crafted packets crash any server. SAST, fuzzers, and auditors missed a logic flaw requiring semantic reasoning about how TCP options interact under adversarial conditions. Campaign cost ~$20,000. Anthropic notes the $50 per-run figure reflects hindsight.FFmpeg H.264 codec, 16 years old. Fuzzers exercised the vulnerable code path 5 million times without triggering the flaw, according to Anthropic. Mythos caught it by reasoning about code semantics. Campaign cost ~$10,000.FreeBSD NFS remote code execution, CVE-2026-4747, 17 years old. Unauthenticated root from the internet, per Anthropic’s assessment and independent reproduction. Mythos built a 20-gadget ROP chain split across multiple packets. Fully autonomous.Linux kernel local privilege escalation. Mythos chained two to four low-severity vulnerabilities into full local privilege escalation via race conditions and KASLR bypasses. CSA’s Rich Mogull noted Mythos failed at remote kernel exploitation but succeeded locally. No automated tool chains vulnerabilities today.Browser zero-days across every major browser. Thousands identified. Some required human-model collaboration. In one case, Mythos chained four vulnerabilities into a JIT heap spray, escaping both the renderer and the OS sandboxes. Firefox 147: 181 working exploits versus two for Opus 4.6.Cryptography library vulnerabilities (TLS, AES-GCM, SSH). Implementation flaws enabling certificate forgery or decryption of encrypted communications, per Anthropic’s red team blog and Help Net Security. A critical Botan library certificate bypass was disclosed the same day as the Glasswing announcement. Bugs in the code that implements the math. Not attacks on the math itself.Virtual machine monitor guest-to-host escape. Guest-to-host memory corruption in a production VMM, the technology keeping cloud workloads from seeing each other’s data. Cloud security architectures assume workload isolation holds. This finding breaks that assumption.Nicholas Carlini, in Anthropic’s launch briefing: “I’ve found more bugs in the last couple of weeks than I found in the rest of my life combined.”VentureBeat’s prescriptive matrix Vulnerability ClassWhy Current Methods Miss ItWhat Mythos DoesSecurity Director ActionOS kernel logic (OpenBSD 27yr, Linux 2-4 chain)SAST lacks semantic reasoning. Fuzzers miss logic flaws. Pen testers time-boxed. Bounties scope-exclude kernel.Chains 2-4 low-severity findings into local priv-esc. ~$20K campaign.Add AI-assisted kernel review to pen test RFPs. Expand bounty scope. Request Glasswing findings from OS vendors before July. Re-score clustered findings by chainability.Media codec (FFmpeg 16yr H.264)SAST unflagged. Fuzzers hit path 5M times, never triggered.Reasons about semantics beyond brute-force. ~$10K campaign.Inventory FFmpeg, libwebp, ImageMagick, libpng. Stop treating fuzz coverage as security proxy. Track Glasswing codec CVEs from July.Network stack RCE (FreeBSD 17yr, CVE-2026-4747)DAST limited at protocol depth. Pen tests skip NFS.Full autonomous chain to unauthenticated root. 20-gadget ROP chain.Patch CVE-2026-4747 now. Inventory NFS/SMB/RPC services. Add protocol fuzzing to 2026 cycle.Multi-vuln chaining (2-4 sequenced, local)No tool chains. Pen testers hours-limited. CVSS scores in isolation.Autonomous local chaining via race conditions + KASLR bypass.Require AI-assisted chaining in pen test methodology. Build chainability scoring. Budget AI red teams for 2026.Browser zero-days (thousands, 181 Firefox exploits)Bounties + continuous fuzzing missed thousands. Some required human-model collaboration.90x over Opus 4.6. Chained 4 vulns into JIT heap spray escaping renderer + OS sandbox.Shorten patch SLA to 72hr critical. Pre-stage pipeline for July cycle. Pressure vendors for Glasswing timelines.Crypto libraries (TLS, AES-GCM, SSH, Botan bypass)SAST limited on crypto logic. Pen testers rarely audit crypto depth. Formal verification not standard.Found cert forgery + decryption flaws in battle-tested libraries.Audit all crypto library versions now. Track Glasswing crypto CVEs from July. Accelerate PQC migration.VMM / hypervisor (guest-to-host memory corruption)Cloud security assumes isolation. Few pen tests target hypervisor. Bounties rarely scope VMM.Guest-to-host escape in production VMM.Inventory hypervisor/VMM versions. Request Glasswing findings from cloud providers. Reassess multi-tenant isolation assumptions.Attackers are faster. Defenders are patching once a year. The CrowdStrike 2026 Global Threat Report documents a 29-minute average eCrime breakout time, 65% faster than 2024, with an 89% year-over-year surge in AI-augmented attacks. CrowdStrike CTO Elia Zaitsev put the operational reality plainly in an exclusive interview with VentureBeat. “Adversaries leveraging agentic AI can perform those attacks at such a great speed that a traditional human process of look at alert, triage, investigate for 15 to 20 minutes, take an action an hour, a day, a week later, it’s insufficient,” Zaitsev said. A $20,000 Mythos discovery campaign that runs in hours replaces months of nation-state research effort.CrowdStrike CEO George Kurtz reinforced that timeline pressure on LinkedIn the same day as the Glasswing announcement. “AI is creating the largest security demand driver since enterprises moved to the cloud,” Kurtz wrote. The regulatory clock compounds the operational one. The EU AI Act’s next enforcement phase takes effect August 2, 2026, imposing automated audit trails, cybersecurity requirements for every high-risk AI system, incident reporting obligations, and penalties up to 3% of global revenue. Security directors face a two-wave sequence: July’s Glasswing disclosure cycle, then August’s compliance deadline. Mike Riemer, Field CISO at Ivanti and a 25-year US Air Force veteran who works closely with federal cybersecurity agencies, told VentureBeat what he is hearing from the government. “Threat actors are reverse engineering patches, and the speed at which they’re doing it has been enhanced greatly by AI,” Riemer said. “They’re able to reverse engineer a patch within 72 hours. So if I release a patch and a customer doesn’t patch within 72 hours of that release, they’re open to exploit.” Riemer was blunt about where that leaves the industry. “They are so far in front of us as defenders,” he said.Grieco confirmed the other side of that collision at RSAC 2026. “If you talk to an operational team and many of our customers, they’re only patching once a year,” Grieco told VentureBeat. “And frankly, even in the best of circumstances, that is not fast enough.”CSA’s Mogull makes the structural case that defenders hold the long-term advantage: fix a vulnerability once and every deployment benefits. But the transition period, when attackers reverse-engineer patches in 72 hours and defenders patch once a year, favors offense.Mythos is not the only model finding these bugs. Researchers at AISLE, an AI cybersecurity startup, tested Anthropic’s showcase vulnerabilities on small, open-weights models and found that eight out of eight detected the FreeBSD exploit. AISLE says one model had only 3.6 billion parameters and costs 11 cents per million tokens, and that a 5.1-billion-parameter open model recovered the core analysis chain of the 27-year-old OpenBSD bug. AISLE’s conclusion: “The moat in AI cybersecurity is the system, not the model.” That makes the detection ceiling a structural problem, not a Mythos-specific one. Cheap models find the same bugs. The July timeline gets shorter, not longer. Over 99% of the vulnerabilities Mythos has identified have not yet been patched, per Anthropic’s red team blog. The public Glasswing report lands in early July 2026. It will trigger a high-volume patch cycle across operating systems, browsers, cryptography libraries, and major infrastructure software. Security directors who have not expanded their patch pipeline, re-scoped their bug bounty programs, and built chainability scoring by then will absorb that wave cold. July is not a disclosure event. It is a patch tsunami.What to tell the board Every security director tells the board “we have scanned everything.” Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, told VentureBeat that the statement does not survive Mythos without a qualifier.“What security leaders actually mean is: we have exhaustively scanned for what our tools know how to see,” Baer said in an exclusive interview with VentureBeat. “That’s a very different claim.”Baer proposed reframing residual risk for boards around three tiers: known-knowns (vulnerability classes your stack reliably detects), known-unknowns (classes you know exist but your tools only partially cover, like stateful logic flaws and auth boundary confusion), and unknown-unknowns (vulnerabilities that emerge from composition, how safe components interact in unsafe ways). “This is where Mythos is landing,” Baer said.The board-level statement Baer recommends: “We have high confidence in detecting discrete, known vulnerability classes. Our residual risk is concentrated in cross-function, multi-step, and compositional flaws that evade single-point scanners. We are actively investing in capabilities that raise that detection ceiling.”On chainability, Baer was equally direct. “Chainability has to become a first-class scoring dimension,” she said. “CVSS was built to score atomic vulnerabilities. Mythos is exposing that risk is increasingly graph-shaped, not point-in-time.” Baer outlined three shifts security programs need to make: from severity scoring to exploitability pathways, from vulnerability lists to vulnerability graphs that model relationships across identity, data flow, and permissions, and from remediation SLAs to path disruption, where fixing any node that breaks the chain gets priority over fixing the highest individual CVSS.“Mythos isn’t just finding missed bugs,” Baer said. “It’s invalidating the assumption that vulnerabilities are independent. Security programs that don’t adapt, from coverage thinking to interaction thinking, will keep reporting green dashboards while sitting on red attack paths.”VentureBeat will update this story with additional operational details from Glasswing’s founding partners as interviews are completed.
Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos
The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines. More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw. Having played with these tools for some time, here is a comparison.First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more.Next we have Google’s Antigravity, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box). Finally, we have the mighty Claude. The release of Anthropic’s Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the SaaSpocalypse). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details.Making these tools work for youThe key to making these tools more impactful is giving them more power, but that increases the risk of misuse. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority. While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user’s taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority.But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical. Also, when agents deal with so many diverse systems, it’s important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct.” These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work. When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane.Dattaraj Rao is innovation and R&D architect at Persistent Systems.