Anthropic on Monday launched the most ambitious consumer AI agent to date, giving its Claude chatbot the ability to directly control a user’s Mac — clicking buttons, opening applications, typing into fields, and navigating software on the user’s behalf while they step away from their desk.The update, available immediately as a research preview for paying subscribers, transforms Claude from a conversational assistant into something closer to a remote digital operator. It arrives inside both Claude Cowork, the company’s agentic productivity tool, and Claude Code, its developer-focused command-line agent. Anthropic is also extending Dispatch — a feature introduced last week that lets users assign Claude tasks from a mobile phone — into Claude Code for the first time, creating an end-to-end pipeline where a user can issue instructions from anywhere and return to a finished deliverable.The move thrusts Anthropic into the center of the most heated competition in artificial intelligence: the scramble to build agents that can act, not just talk. OpenAI, Google, Nvidia, and a growing swarm of startups are all chasing the same prize — an AI that operates inside your existing tools rather than beside them. And the stakes are no longer theoretical. Reuters reported Sunday that OpenAI is actively courting private equity firms in what it described as an “enterprise turf war with Anthropic,” a battle in which the ability to ship working agents is fast becoming the decisive weapon.The new features are available to Claude Pro subscribers (starting at $17 per month) and Max subscribers ($100 or $200 per month), but only on macOS for now.Inside Claude’s computer use: How Anthropic’s AI agent decides when to click, type, and navigate your MacThe computer use feature works through a layered priority system that reveals how Anthropic is thinking about reliability versus reach.When a user assigns Claude a task, it first checks whether a direct connector exists — integrations with services like Gmail, Google Drive, Slack, or Google Calendar. These connectors are the fastest and most reliable path to completing a task, according to Anthropic’s documentation. If no connector is available, Claude falls back to navigating the Chrome browser via Anthropic’s Claude for Chrome extension. Only as a last resort does Claude interact directly with the user’s screen — clicking, typing, scrolling, and opening applications the way a human operator would.This hierarchy matters. As Anthropic’s help center documentation explains, “pulling messages through your Slack connection takes seconds, but navigating Slack through your screen takes much longer and is more error-prone.” Screen-level interaction is the most flexible mode — it can theoretically work with any application — but it is also the slowest and most fragile.When Claude does interact with the screen, it takes screenshots of the user’s desktop to understand what it’s looking at and determine how to navigate. That means Claude can see anything visible on the screen, including personal data, sensitive documents, or private information. Anthropic trains Claude to avoid engaging in stock trading, inputting sensitive data, or gathering facial images, but the company is candid that “these guardrails are part of how Claude is trained and instructed, but they aren’t absolute.”There is nothing to configure. No API keys, no terminal setup, no special permissions beyond what the user grants on a per-app basis. As Ryan Donegan, who handles communications for Anthropic, put it in a press briefing: “Download the app and it uses what’s already on your machine.”Claude Dispatch turns your iPhone into a remote control for AI-powered desktop automationThe real strategic play may not be computer use itself but how Anthropic is pairing it with Dispatch.Dispatch, which launched last week for Cowork and now extends to Claude Code, creates a persistent, continuous conversation between Claude on your phone and Claude on your desktop. A user pairs their mobile device with their Mac by scanning a QR code, and from that point forward, they can text Claude instructions from anywhere. Claude executes those instructions on the desktop — which must remain awake and running the Claude app — and sends back the results.The use cases Anthropic envisions range from mundane to ambitious: having Claude check your email every morning, pull weekly metrics into a report template, organize a cluttered Downloads folder, or even compile a competitive analysis from local files and connected tools into a formatted document. Scheduled tasks allow users to set a cadence once — “every Friday,” “every morning” — and let Claude handle the rest without further prompting.Anthropic’s blog post frames the combination of Dispatch and computer use as something of a paradigm shift. “Claude can use your computer on your behalf while you’re away,” the company wrote, offering examples like creating a morning briefing while a user commutes, making changes in an IDE, running tests, and submitting a pull request.One early user on social media captured the broader ambition succinctly. Gagan Saluja, who describes himself as working with Claude and AWS, wrote: “combine this with /schedule that just dropped and you’ve basically got a background worker that can interact with any app on a cron job. that’s not an AI assistant anymore, that’s infrastructure.”First hands-on tests reveal Claude’s computer use works about half the time — and that may be the pointAnthropic is calling this a research preview for a reason. Early hands-on testing suggests the feature works well for information retrieval and summarization but struggles with more complex, multi-step workflows — particularly those that require interacting with multiple applications.John Voorhees of MacStories, the Apple-focused publication, published a detailed hands-on evaluation of Dispatch the same day as the announcement. His results were mixed. Claude successfully located a specific screenshot on his Mac, summarized the most recent note in his Notion database, listed notes saved that day, added a URL to Notion, summarized his most recently received email, and recalled a screenshot from earlier in the session. But it failed to open the Shortcuts app on his Mac, send a screenshot via iMessage, list unfinished Todoist tasks (due to an authorization error), list Terminal sessions, display a food order from an active Safari tab, or fetch a URL from Safari using AppleScript.Voorhees’ verdict was measured: Dispatch “can find information on your Mac and works with Connectors, but it’s slow and about a 50/50 shot whether what you try will work.” He added that it is “not good enough to rely on when you’re away from your desk” but called it “a step in the right direction.”Meanwhile, on GitHub, users are already surfacing technical issues. One bug report filed against Claude Code describes a scenario where the Read tool attempts to process multiple large PDF files in a single turn without checking whether the combined payload exceeds the 20MB API limit, causing the request to fail outright. The issue, which has been tagged as a bug specific to macOS, highlights the kinds of rough edges that come with shipping an early preview of a complex agentic system.OpenClaw, NemoClaw, and the startup swarm: Why Anthropic is racing to ship AI computer use nowAnthropic’s timing is not accidental. The company is shipping computer use capabilities into a market that has been rapidly reshaped by the viral rise of OpenClaw, the open-source framework that enables AI models to autonomously control computers and interact with tools.OpenClaw exploded earlier this year and proved that users wanted AI agents capable of taking real actions on their computers — and that they were willing to tolerate rough edges to get them. The framework spawned an entire ecosystem of derivative tools — what the community calls “claws” — that turned autonomous computer control from a research curiosity into a product category almost overnight. Nvidia entered the fray last week with NemoClaw, its own framework designed to simplify the setup and deployment of OpenClaw with added security controls. Anthropic is now entering a market that the open-source community essentially created, betting that its advantages — tighter integration, a consumer-friendly interface, and an existing subscriber base — can compete with free.Smaller startups are also pushing into the space. Coasty, which offers both a desktop app and browser-based AI agent for Mac and Windows, markets itself as providing “full browser, desktop, and terminal automation with a native experience.” One user on social media directly pitched Coasty in the replies to Anthropic’s announcement, claiming it offers “much better user experience and more accurate” results — a sign of how crowded and competitive the computer-use agent space has become in a matter of months.The competitive dynamics extend beyond just computer use. Reuters has reported that OpenAI is sweetening its pitch to private equity firms amid what the wire service described as an “enterprise turf war with Anthropic.” The two companies are locked in an escalating battle for enterprise customers, and the ability to offer agents that can actually operate within a company’s existing software stack — not just chat about it — is increasingly the differentiator.Prompt injection, screenshot surveillance, and the unsolved security risks of letting AI control your desktopIf the competitive pressure explains why Anthropic shipped this feature now, the safety caveats explain why the company is hedging its bets.Computer use runs outside the virtual machine that Cowork normally uses for file operations and commands. That means Claude is interacting with the user’s actual desktop and applications — not an isolated sandbox. The implications are significant: a misclick, a misunderstood instruction, or a prompt injection attack could have real consequences on a user’s live system.Anthropic has built several layers of defense. Claude requests permission before accessing each application. Some sensitive apps — investment platforms, cryptocurrency tools — are blocked by default. Users can maintain a blocklist of applications Claude is never allowed to touch. The system scans for signs of prompt injection during computer use sessions. And users can stop Claude at any point.But the company is remarkably forthright about the limits of these protections. “Computer use is still early compared to Claude’s ability to code or interact with text,” Anthropic’s blog post states. “Claude can make mistakes, and while we continue to improve our safeguards, threats are constantly evolving.”The help center documentation goes further, explicitly warning users not to use computer use to manage financial accounts, handle legal documents, process medical information, or interact with apps containing other people’s personal information. Anthropic also advises against using Cowork for HIPAA, FedRAMP, or FSI-regulated workloads.For enterprise and team customers, there is an additional wrinkle. Cowork conversation history is stored locally on the user’s device, not on Anthropic’s servers. But critically, enterprise features like audit logs, compliance APIs, and data exports do not currently capture Cowork activity. This means that organizations subject to regulatory oversight have no centralized record of what Claude did on a user’s machine — a gap that could be a dealbreaker for compliance-sensitive industries.One user flagged this concern on social media with particular precision. NomanInnov8 wrote: “when the agent IS the user (same mouse, keyboard, screen), traditional forensic markers won’t distinguish human vs AI actions. How are we thinking about audit trails here?”The question is not academic. As AI agents gain the ability to take real-world actions — sending emails, modifying files, interacting with financial systems — the ability to distinguish between human and machine actions becomes a foundational requirement for governance, liability, and compliance. Anthropic has not yet answered it.From excitement to anxiety: How users are reacting to Claude’s new power over their machinesThe social media reaction to the announcement split roughly into three camps: those excited about the productivity implications, those concerned about the security risks, and those frustrated that they cannot yet use it.The enthusiasm was genuine and widespread. “Legit just got the update and used it with dispatch — exactly the feature I wanted,” wrote one X user. Mike Joseph called the speed of Anthropic’s feature releases “fantastic.” Another X user noted the significance for non-technical users: “Very exciting for non-tech folks who don’t want or know how to set up OpenClaw.”But the security concerns were equally pointed. One user, posting as Profannyti, wrote: “Granting that kind of control over your personal device doesn’t sit right. It’s almost like letting someone you barely know take the wheel and trusting everything will be fine.” As Engadget reported, experts have warned that one major concern with agentic AI is that “it can take major, sometimes dramatic actions quickly and with little warning,” and that such tools “can also be hijacked by malicious actors.”Several users flagged practical frustrations as well. Windows users — excluded from the macOS-only research preview — expressed predictable dismay. Others reported that the new features were consuming their usage quotas at alarming rates. One Max 20x subscriber paying $200 per month complained that Dispatch was “eating my quota like crazy,” consuming 10% of their allowance in a single prompt. Another user linked to the GitHub bug report about the 20MB payload issue, calling the situation “quite urgent.”Anthropic’s enterprise playbook: Plugins, pricing tiers, and the bet that AI agents can replace entire workflowsThe pricing structure reveals where Anthropic sees the real market. While individual Pro users get access to Cowork, the company notes that agentic tasks “consume more capacity than regular chat” because “Claude coordinates multiple sub-agents and tool calls to complete complex work.” Heavy users are nudged toward Max plans at $100 or $200 per month.For teams, the pricing starts at $20 per seat per month for groups of five to 75 users. Enterprise pricing is custom and includes admin controls to toggle Cowork on or off for the organization.The plugin architecture is where Anthropic’s enterprise ambitions become clearest. Plugins bundle skills, connectors, and sub-agents into a single install that turns Claude into a domain specialist — for legal work, finance, brand voice management, or other functions. Anthropic already lists plugins for legal workflows (contract review, NDA triage), finance (journal entries, reconciliation, variance analysis), and brand voice (analyzing existing documents to enforce guidelines). The company is betting that the combination of computer use, Dispatch, scheduled tasks, and domain-specific plugins will create an agent capable enough to justify enterprise procurement.The testimonials Anthropic has gathered suggest the pitch is landing with at least some organizations. Larisa Cavallaro, identified as an AI Automation Engineer, described connecting Cowork to her company’s tech stack and asking it to identify engineering bottlenecks. Claude, she said, returned “an interactive dashboard, team-by-team efficiency analyses, and a prioritized roadmap.” Joel Hron, a CTO, offered a more philosophical framing: “The human role becomes validation, refinement, and decision-making. Not repetitive rework.”The AI industry’s defining tension: Shipping fast enough to win, slow enough to be safeAnthropic is shipping these capabilities at a moment of extraordinary velocity in the AI industry — and extraordinary uncertainty about what that velocity means.The company’s own research quantifies the transformation underway. Its economic index, published in March 2026, tracks how AI is reshaping labor markets and productivity across sectors. The data suggests that AI adoption is accelerating unevenly, with knowledge workers in technology, finance, and professional services seeing the most dramatic shifts.Anthropic is also navigating significant external pressures beyond the product arena. Recent reporting has highlighted scrutiny from Senator Elizabeth Warren regarding Anthropic’s defense and supply chain relationships — a reminder that the company’s ambitions to build powerful autonomous agents exist within an increasingly complex political and regulatory environment.For now, the computer use feature remains early and imperfect. Complex tasks sometimes require a second attempt. Screen interaction is meaningfully slower than direct integrations. The audit trail gap for enterprise users is a genuine liability. And the fundamental tension between giving an AI agent enough access to be useful and limiting that access enough to be safe remains unresolved.But Anthropic is not waiting for perfection. The company is building in public, shipping capabilities it openly describes as incomplete, and betting that users will tolerate a 50 percent success rate today in exchange for the promise of something transformative tomorrow. It is a calculation that only works if the failures remain minor — a missed click, a stalled task, an unread email. The moment a failure isn’t minor, the calculus changes entirely.The AI industry has spent the last three years proving that machines can think. Anthropic is now asking a harder question: whether humans are ready to let them act. The answer, for the moment, is a provisional yes — hedged with permissions dialogs, blocklists, and the quiet hope that nothing important gets deleted before the technology catches up to the ambition.
Venture Beat
Cloudflare’s new Dynamic Workers ditch containers to run AI agent code 100x faster
Web infrastructure giant Cloudlflare is seeking to transform the way enterprises deploy AI agents with the open beta release of Dynamic Workers, a new lightweight, isolate-based sandboxing system that it says starts in milliseconds, uses only a few megabytes of memory, and can run on the same machine — even the same thread — as the request that created it. Compared with traditional Linux containers, the company says that makes Dynamic Workers roughly 100x faster to start and between 10x and 100x more memory efficient.Cloudflare has spent months pushing what it calls “Code Mode,” the idea that large language models often perform better when they are given an API and asked to write code against it, rather than being forced into one tool call after another. The company says converting an MCP server into a TypeScript API can cut token usage by 81%, and it is now positioning Dynamic Workers as the secure execution layer that makes that approach practical at scale.For enterprise technical decision makers, that is the bigger story. Cloudflare is trying to turn sandboxing itself into a strategic layer in the AI stack. If agents increasingly generate small pieces of code on the fly to retrieve data, transform files, call services or automate workflows, then the economics and safety of the runtime matter almost as much as the capabilities of the model. Cloudflare’s pitch is that containers and microVMs remain useful, but they are too heavy for a future where millions of users may each have one or more agents writing and executing code constantly.The history of modern isolated runtime environmentsTo understand why Cloudflare is doing this, it helps to look at the longer arc of secure code execution. Modern sandboxing has evolved through three main models, each trying to build a better digital box: smaller, faster and more specialized than the one before it.The first model is the isolate. Google introduced the v8::Isolate API in 2011 so the V8 JavaScript engine could run many separate execution contexts efficiently inside the same process. In effect, a single running program could spin up many small, tightly separated compartments, each with its own code and variables. In 2017, Cloudflare adapted that browser-born idea for the cloud with Workers, betting that the traditional cloud stack was too slow for instant, globally distributed web tasks. The result was a runtime that could start code in milliseconds and pack many environments onto a single machine. The trade-off is that isolates are not full computers. They are strongest with JavaScript, TypeScript and WebAssembly, and less natural for workloads that expect a traditional machine environment.The second model is the container. Containers had been technically possible for years through Linux kernel features, but the company Docker turned them into the default software packaging model when it popularized them in 2013.Containers solved a huge portability problem by letting developers package code, libraries and settings into a predictable unit that could run consistently across systems. That made them foundational to modern cloud infrastructure. But they are relatively heavy for the sort of short-lived tasks Cloudflare is talking about here. The company says containers generally take hundreds of milliseconds to boot and hundreds of megabytes of memory to run, which becomes costly and slow when an AI-generated task only needs to execute for a moment.The third model is the microVM. Popularized by AWS Firecracker in 2018, microVMs were designed to offer stronger machine-like isolation than containers without the full bulk of a traditional virtual machine. They are attractive for running untrusted code, which is why they have started to show up in newer AI-agent systems such as Docker Sandboxes. But they still sit between the other two models: stronger isolation and more flexibility than an isolate, but slower and heavier as well.That is the backdrop for Cloudflare’s pitch. The company is not claiming containers disappear, or that microVMs stop mattering. It is claiming that for a growing class of web-scale, short-lived AI-agent workloads, the default box has been too heavy, and the isolate may now be the better fit.Cloudflare’s case against the container bottleneckCloudflare’s argument is blunt: for “consumer-scale” agents, containers are too slow and too expensive. In the company’s framing, a container is fine when a workload persists, but it is a bad fit when an agent needs to run one small computation, return a result and disappear. Developers either keep containers warm, which costs money, or tolerate cold-start delay, which hurts responsiveness. They may also be tempted to reuse a live sandbox across multiple tasks, which weakens isolation.Dynamic Worker Loader is Cloudflare’s answer. The API allows one Worker to instantiate another Worker at runtime with code provided on the fly, usually by a language model. Because these dynamic Workers are built on isolates, Cloudflare says they can be created on demand, run one snippet of code, and then be thrown away immediately afterward. In many cases, they run on the same machine and even the same thread as the Worker that created them, which removes the need to hunt for a warm sandbox somewhere else on the network.The company is also pushing hard on scale. It says many container-based sandbox providers limit concurrent sandboxes or the rate at which they can be created, while Dynamic Workers inherit the same platform characteristics that already let Workers scale to millions of requests per second. In Cloudflare’s telling, that makes it possible to imagine a world where every user-facing AI request gets its own fresh, isolated execution environment without collapsing under startup overhead.Security remains the hardest partCloudflare does not pretend this is easy to secure. In fact, the company explicitly says hardening an isolate-based sandbox is trickier than relying on hardware virtual machines, and notes that security bugs in V8 are more common than those in typical hypervisors. That is an important admission, because the entire thesis depends on convincing developers that an ultra-fast software sandbox can also be safe enough for AI-generated code.Cloudflare’s response is that it has nearly a decade of experience doing exactly that. The company points to automatic rollout of V8 security patches within hours, a custom second-layer sandbox, dynamic cordoning of tenants based on risk, extensions to the V8 sandbox using hardware features like MPK, and research into defenses against Spectre-style side-channel attacks. It also says it scans code for malicious patterns and can block or further sandbox suspicious workloads automatically. Dynamic Workers inherit that broader Workers security model.That matters because without the security story, the speed story sounds risky. With it, Cloudflare is effectively arguing that it has already spent years making isolate-based multi-tenancy safe enough for the public web, and can now reuse that work for the age of AI agents.Code Mode: from tool orchestration to generated logicThe release makes the most sense in the context of Cloudflare’s larger Code Mode strategy. The idea is simple: instead of giving an agent a long list of tools and asking it to call them one by one, give it a programming surface and let it write a short TypeScript function that performs the logic itself. That means the model can chain calls together, filter data, manipulate files and return only the final result, rather than filling the context window with every intermediate step. Cloudflare says that cuts both latency and token usage, and improves outcomes especially when the tool surface is large.The company points to its own Cloudflare MCP server as proof of concept. Rather than exposing the full Cloudflare API as hundreds of individual tools, it says the server exposes the entire API through two tools — search and execute — in under 1,000 tokens because the model writes code against a typed API instead of navigating a long tool catalog.That is a meaningful architectural shift. It moves the center of gravity from tool orchestration toward code execution. And it makes the execution layer itself far more important.Why Cloudflare thinks TypeScript beats HTTP for agentsOne of the more interesting parts of the launch is that Cloudflare is also arguing for a different interface layer. MCP, the company says, defines schemas for flat tool calls but not for programming APIs. OpenAPI can describe REST APIs, but it is verbose both in schema and in usage. TypeScript, by contrast, is concise, widely represented in model training data, and can communicate an API’s shape in far fewer tokens.Cloudflare says the Workers runtime can automatically establish a Cap’n Web RPC bridge between the sandbox and the harness code, so a dynamic Worker can call those typed interfaces across the security boundary as if it were using a local library. That lets developers expose only the exact capabilities they want an agent to have, without forcing the model to reason through a sprawling HTTP interface.The company is not banning HTTP. In fact, it says Dynamic Workers fully support HTTP APIs. But it clearly sees TypeScript RPC as the cleaner long-term interface for machine-generated code, both because it is cheaper in tokens and because it gives developers a narrower, more intentional security surface.Credential injection and tighter control over outbound accessOne of the more practical enterprise features in the release is globalOutbound, which lets developers intercept every outbound HTTP request from a Dynamic Worker. They can inspect it, rewrite it, inject credentials, respond to it directly, or block it entirely. That makes it possible to let an agent reach outside services while never exposing raw secrets to the generated code itself.Cloudflare positions that as a safer way to connect agents to third-party services requiring authentication. Instead of trusting the model not to mishandle credentials, the developer can add them on the way out and keep them outside the agent’s visible environment. In enterprise settings, that kind of blast-radius control may matter as much as the performance gains.More than a runtime: the helper libraries matter tooAnother reason the announcement lands as more than a low-level runtime primitive is that Cloudflare is shipping a toolkit around it. The @cloudflare/codemode package is designed to simplify running model-generated code against AI tools using Dynamic Workers. At its core is DynamicWorkerExecutor(), which sets up a purpose-built sandbox with code normalization and direct control over outbound fetch behavior. The package also includes utility functions to wrap an MCP server into a single code() tool or generate MCP tooling from an OpenAPI spec.The @cloudflare/worker-bundler package handles the fact that Dynamic Workers expect pre-bundled modules. It can resolve npm dependencies, bundle them with esbuild, and return the module map the Worker Loader expects. The @cloudflare/shell package adds a virtual filesystem backed by a durable Workspace using SQLite and R2, with higher-level operations like read, write, search, replace, diff and JSON update, plus transactional batch writes.Taken together, those packages make the launch feel much more complete. Cloudflare is not just exposing a fast sandbox API. It is building the surrounding path from model-generated logic to packaged execution to persistent file manipulation.Isolates versus microVMs: two different homes for agentsCloudflare’s launch also highlights a growing split in the AI-agent market. One side emphasizes fast, disposable, web-scale execution. The other emphasizes deeper, more persistent environments with stronger machine-like boundaries.Docker Sandboxes is a useful contrast. Rather than using standard containers alone, it uses lightweight microVMs to give each agent its own private Docker daemon, allowing the agent to install packages, run commands and modify files without directly exposing the host system. That is a better fit for persistent, local or developer-style environments. Cloudflare is optimizing for something different: short-lived, high-volume execution on the global web.So the trade-off is not simply security versus speed. It is depth versus velocity. MicroVMs offer a sturdier private fortress and broader flexibility. Isolates offer startup speed, density and lower cost at internet scale. That distinction may become one of the main dividing lines in agent infrastructure over the next year.Community reaction: hype, rivalry and the JavaScript catchThe release also drew immediate attention from developers on X, with reactions that captured both excitement and skepticism. Brandon Strittmatter, a Cloudflare product lead and founder of Outerbase, called the move “classic Cloudflare,” praising the company for “changing the current paradigm on containers/sandboxes by reinventing them to be lightweight, less expensive, and ridiculously fast.” Zephyr Cloud CEO Zack Chapple called the release “worth shouting from the mountain tops.”But the strongest caveat surfaced quickly too: this system works best when the agent writes JavaScript. Cloudflare says Workers can technically run Python and WebAssembly, but that for small, on-demand snippets, “JavaScript will load and run much faster.” That prompted criticism from YouTuber and ThursdAI podcast host Alex Volkov, who wrote that he “got excited… until I got here,” reacting to the language constraint.Cloudflare’s defense is pragmatic and a little provocative. Humans have language loyalties, the company argues, but agents do not. In Cloudflare’s words, “AI will write any language you want it to,” and JavaScript is simply well suited to sandboxed execution on the web. That may be true in the narrow sense the company intends, but it also means the platform is most naturally aligned with teams already comfortable in the JavaScript and TypeScript ecosystem.The announcement also triggered immediate competitive positioning. Nathan Flurry of Rivet used the moment to contrast his Secure Exec product as an open-source alternative that supports a broader range of platforms including Vercel, Railway and Kubernetes rather than being tied closely to Cloudflare’s own stack. That reaction is worth noting because it shows how quickly the sandboxing market around agents is already splitting between vertically integrated platforms and more portable approaches.Early use cases: AI apps, automations and generated platformsCloudflare is pitching Dynamic Workers for much more than quick code snippets. The company highlights Code Mode, AI-generated applications, fast development previews, custom automations and user platforms where customers upload or generate code that must run in a secure sandbox.One example it spotlights is Zite, which Cloudflare says is building an app platform where users interact through chat while the model writes TypeScript behind the scenes to build CRUD apps, connect to services like Stripe, Airtable and Google Calendar, and run backend logic. Cloudflare quotes Zite CTO and co-founder Antony Toron saying Dynamic Workers “hit the mark” on speed, isolation and security, and that the company now handles “millions of execution requests daily” using the system.Even allowing for vendor framing, that example gets at the company’s ambition. Cloudflare is not just trying to make agents a bit more efficient. It is trying to make AI-generated execution environments cheap and fast enough to sit underneath full products.Pricing and availabilityDynamic Worker Loader is now in open beta and available to all users on the Workers Paid plan. Cloudflare says dynamically loaded Workers are priced at $0.002 per unique Worker loaded per day, in addition to standard CPU and invocation charges, though that per-Worker fee is waived during the beta period. For one-off code generation use cases, the company says that cost is typically negligible compared with the inference cost of generating the code itself.That pricing model reinforces the larger thesis behind the product: that execution should become a small, routine part of the agent loop rather than a costly special case.The bigger pictureCloudflare’s launch lands at a moment when AI infrastructure is becoming more opinionated. Some vendors are leaning toward long-lived agent environments, persistent memory and machine-like execution. Cloudflare is taking the opposite angle. For many workloads, it argues, the right agent runtime is not a persistent container or a tiny VM, but a fast, disposable isolate that appears instantly, executes one generated program, and vanishes.That does not mean containers or microVMs go away. It means the market is starting to split by workload. Some enterprises will want deeper, more persistent environments. Others — especially those building high-volume, web-facing AI systems — may want an execution layer that is as ephemeral as the requests it serves.Cloudflare is betting that this second category gets very large, very quickly. And if that happens, Dynamic Workers may prove to be more than just another Workers feature. They may be Cloudflare’s attempt to define what the default execution layer for internet-scale AI agents looks like.
The three disciplines separating AI agent demos from real-world deployment
Getting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries.“The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.” Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy.In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments.“People have been experimenting a lot with proof of concepts, they’ve been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.”Why agents keep failing in productionEnterprises are eager to adopt agentic AI in some form or another — often because they’re afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design. The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not. But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out. This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said. “Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic.The tuning loopCreatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy. That loop typically follows this pattern: Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents. Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access. Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time. Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources. Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs.For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding — that sits “above the LLM,” Kawasaki said.Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said. The biggest issues that come up post-deployment: Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned. Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate.Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails.“We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn’t happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.” “Data readiness” doesn’t always require an overhaulWhen looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project. But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database. This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said.Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows). Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.” Matching agents to the workThe best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals.“Especially when you can link them to very specific processes inside an industry — that’s where you can really measure and deliver hard ROI,” he said. For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services.“You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions. However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales.“The agent isn’t giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.” This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories.This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said. The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates. “What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said. Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs. “The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed?“Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said.
What is DeerFlow 2.0 and what should enterprises know about this new, powerful local AI agent orchestrator?
ByteDance, the Chinese tech giant behind TikTok, last month released what may be one of the most ambitious open-source AI agent frameworks to date: DeerFlow 2.0. It’s now going viral across the machine learning community on social media. But is it safe and ready for enterprise use?This is a so-called “SuperAgent harness” that orchestrates multiple AI sub-agents to autonomously complete complex, multi-hour tasks. Best of all: it is available under the permissive, enterprise-friendly standard MIT License, meaning anyone can use, modify, and build on it commercially at no cost. DeerFlow 2.0 is designed for high-complexity, long-horizon tasks that require autonomous orchestration over minutes or hours, including conducting deep research into industry trends, generating comprehensive reports and slide decks, building functional web pages, producing AI-generated videos and reference images, performing exploratory data analysis with insightful visualizations, analyzing and summarizing podcasts or video content, automating complex data and content workflows, and explaining technical architectures through creative formats like comic strips.ByteDance offers a bifurcated deployment strategy that separates the orchestration harness from the AI inference engine. Users can run the core harness directly on a local machine, deploy it across a private Kubernetes cluster for enterprise scale, or connect it to external messaging platforms like Slack or Telegram without requiring a public IP.While many opt for cloud-based inference via OpenAI or Anthropic APIs, the framework is natively model-agnostic, supporting fully localized setups through tools like Ollama. This flexibility allows organizations to tailor the system to their specific data sovereignty needs, choosing between the convenience of cloud-hosted “brains” and the total privacy of a restricted on-premise stack.Importantly, choosing the local route does not mean sacrificing security or functional isolation. Even when running entirely on a single workstation, DeerFlow still utilizes a Docker-based “AIO Sandbox” to provide the agent with its own execution environment. This sandbox—which contains its own browser, shell, and persistent filesystem—ensures that the agent’s “vibe coding” and file manipulations remain strictly contained. Whether the underlying models are served via the cloud or a local server, the agent’s actions always occur within this isolated container, allowing for safe, long-running tasks that can execute bash commands and manage data without risk to the host system’s core integrity.Since its release last month, it has accumulated more than 39,000 stars (user saves) and 4,600 forks — a growth trajectory that has developers and researchers alike paying close attention.Not a chatbot wrapper: what DeerFlow 2.0 actually isDeerFlow is not another thin wrapper around a large language model. The distinction matters.While many AI tools give a model access to a search API and call it an agent, DeerFlow 2.0 gives its agents an actual isolated computer environment: a Docker sandbox with a persistent, mountable filesystem. The system maintains both short- and long-term memory that builds user profiles across sessions. It loads modular “skills” — discrete workflows — on demand to keep context windows manageable. And when a task is too large for one agent, a lead agent decomposes it, spawns parallel sub-agents with isolated contexts, executes code and Bash commands safely, and synthesizes the results into a finished deliverable.It is similar to the approach being pursued by NanoClaw, an OpenClaw variant, which recently partnered with Docker itself to offer enterprise-grade sandboxes for agents and subagents. But while NanoClaw is extremely open ended, DeerFlow has more clearly defined its architecture and scoped tasks: Demos on the project’s official site, deerflow.tech, showcase real outputs: agent trend forecast reports, videos generated from literary prompts, comics explaining machine learning concepts, data analysis notebooks, and podcast summaries. The framework is designed for tasks that take minutes to hours to complete — the kind of work that currently requires a human analyst or a paid subscription to a specialized AI service.From Deep Research to Super AgentDeerFlow’s original v1 launched in May 2025 as a focused deep-research framework. Version 2.0 is something categorically different: a ground-up rewrite on LangGraph 1.0 and LangChain that shares no code with its predecessor. ByteDance explicitly framed the release as a transition “from a Deep Research agent into a full-stack Super Agent.”New in v2: a batteries-included runtime with filesystem access, sandboxed execution, persistent memory, and sub-agent spawning; progressive skill loading; Kubernetes support for distributed execution; and long-horizon task management that can run autonomously across extended timeframes.The framework is fully model-agnostic, working with any OpenAI-compatible API. It has strong out-of-the-box support for ByteDance’s own Doubao-Seed models, as well as DeepSeek v3.2, Kimi 2.5, Anthropic’s Claude, OpenAI’s GPT variants, and local models run via Ollama. It also integrates with Claude Code for terminal-based tasks, and with messaging platforms including Slack, Telegram, and Feishu.Why it’s going viral nowThe project’s current viral moment is the result of a slow build that accelerated sharply this week.The February 28 launch generated significant initial buzz, but it was coverage in machine learning media — including deeplearning.ai’s The Batch — over the following two weeks that built credibility in the research community. Then, on March 21, AI influencer Min Choi posted to his large X following: “China’s ByteDance just dropped DeerFlow 2.0. This AI is a super agent harness with sub-agents, memory, sandboxes, IM channels, and Claude Code integration. 100% open source.” The post earned more than 1,300 likes and triggered a cascade of reposts and commentary across AI Twitter.A search of X using Grok uncovered the full scope of that response. Influencer Brian Roemmele, after conducting what he described as intensive personal testing, declared that “DeerFlow 2.0 absolutely smokes anything we’ve ever put through its paces” and called it a “paradigm shift,” adding that his company had dropped competing frameworks entirely in favor of running DeerFlow locally. “We use 2.0 LOCAL ONLY. NO CLOUD VERSION,” he wrote.More pointed commentary came from accounts focused on the business implications. One post from @Thewarlordai, published March 23, framed it bluntly: “MIT licensed AI employees are the death knell for every agent startup trying to sell seat-based subscriptions. The West is arguing over pricing while China just commoditized the entire workforce.” Another widely shared post described DeerFlow as “an open-source AI staff that researches, codes and ships products while you sleep… now it’s a Python repo and ‘make up’ away.”Cross-linguistic amplification — with substantive posts in English, Japanese, and Turkish — points to genuine global reach rather than a coordinated promotion campaign, though the latter is not out of the question and may be contributing to the current virality. The ByteDance question ByteDance’s involvement is the variable that makes DeerFlow’s reception more complicated than a typical open-source release.On the technical merits, the open-source, MIT-licensed nature of the project means the code is fully auditable. Developers can inspect what it does, where data flows, and what it sends to external services. That is materially different from using a closed ByteDance consumer product.But ByteDance operates under Chinese law, and for organizations in regulated industries — finance, healthcare, defense, government — the provenance of software tooling increasingly triggers formal review requirements, regardless of the code’s quality or openness. The jurisdictional question is not hypothetical: U.S. federal agencies are already operating under guidance that treats Chinese-origin software as a category requiring scrutiny.For individual developers and small teams running fully local deployments with their own LLM API keys, those concerns are less operationally pressing. For enterprise buyers evaluating DeerFlow as infrastructure, they are not.A real tool, with limitationsThe community enthusiasm is credible, but several caveats apply.DeerFlow 2.0 is not a consumer product. Setup requires working knowledge of Docker, YAML configuration files, environment variables, and command-line tools. There is no graphical installer. For developers comfortable with that environment, the setup is described as relatively straightforward; for others, it is a meaningful barrier.Performance when running fully local models — rather than cloud API endpoints — depends heavily on available VRAM and hardware, with context handoff between multiple specialized models a known challenge. For multi-agent tasks running several models in parallel, the resource requirements escalate quickly.The project’s documentation, while improving, still has gaps for enterprise integration scenarios. There has been no independent public security audit of the sandboxed execution environment, which represents a non-trivial attack surface if exposed to untrusted inputs.And the ecosystem, while growing fast, is weeks old. The plugin and skill library that would make DeerFlow comparably mature to established orchestration frameworks simply does not exist yet.What does it mean for enterprises in the AI transformation age?The deeper significance of DeerFlow 2.0 may be less about the tool itself and more about what it represents in the broader race to define autonomous AI infrastructure.DeerFlow’s emergence as a fully capable, self-hostable, MIT-licensed agentic orchestrator adds yet another twist to the ongoing race among enterprises — and AI builders and model providers themselves — to turn generative AI models into more than chatbots, but something more like full or at least part-time employees, capable of both communications and reliable actions.In a sense, it marks the natural next wave after OpenClaw: whereas that open source tool sought to great a dependable, always on autonomous AI agent the user could message, DeerFlow is designed to allow a user to deploy a fleet of them and keep track of them, all within the same system. The decision to implement it in your enterprise hinges on whether your organization’s workload demands “long-horizon” execution—complex, multi-step tasks spanning minutes to hours that involve deep research, coding, and synthesis. Unlike a standard LLM interface, this “SuperAgent” harness decomposes broad prompts into parallel sub-tasks performed by specialized experts. This architecture is specifically designed for high-context workflows where a single-pass response is insufficient and where “vibe coding” or real-time file manipulation in a secure environment is necessary.The primary condition for use is the technical readiness of an organization’s hardware and sandbox environment. Because each task runs within an isolated Docker container with its own filesystem, shell, and browser, DeerFlow acts as a “computer-in-a-box” for the agent. This makes it ideal for data-intensive workloads or software engineering tasks where an agent must execute and debug code safely without contaminating the host system. However, this “batteries-included” runtime places a significant burden on the infrastructure layer; decision-makers must ensure they have the GPU clusters and VRAM capacity to support multi-agent fleets running in parallel, as the framework’s resource requirements escalate quickly during complex tasks.Strategic adoption is often a calculation between the overhead of seat-based SaaS subscriptions and the control of self-hosted open-source deployments. The MIT License positions DeerFlow 2.0 as a highly capable, royalty-free alternative to proprietary agent platforms, potentially functioning as a cost ceiling for the entire category. Enterprises should favor adoption if they prioritize data sovereignty and auditability, as the framework is model-agnostic and supports fully local execution with models like DeepSeek or Kimi. If the goal is to commoditize a digital workforce while maintaining total ownership of the tech stack, the framework provides a compelling, if technically demanding, benchmark.Ultimately, the decision to deploy must be weighed against the inherent risks of an autonomous execution environment and its jurisdictional provenance. While sandboxing provides isolation, the ability of agents to execute bash commands creates a non-trivial attack surface that requires rigorous security governance and auditability. Furthermore, because the project is a ByteDance-led initiative via Volcengine and BytePlus, organizations in regulated sectors must reconcile its technical performance with emerging software-origin standards. Deployment is most appropriate for teams comfortable with a CLI-first, Docker-heavy setup who are ready to trade the convenience of a consumer product for a sophisticated and extensible SuperAgent harness.
Calling all gen AI disruptors of the enterprise! Apply now to present at Transform 2026
The Innovation Showcase is back at Transform 2026: The Orchestration of Enterprise Agentic AI at Scale, taking place July 14 and 15 in Menlo Park.This year, we are moving beyond generative AI to autonomous agents, focusing on enterprise agentic orchestration, LLM observability and evaluation (LLMOps), RAG infrastructure, inference platforms and optimization, and agentic AI security and identity.We’re on the hunt for the 10 most innovative autonomous agent technologies poised to redefine the enterprise. If you have built agents that can reason, plan and execute complex workflows independently to drive real business value, we want to see you on our main stage.Innovators chosen to present at VB Transform 2026 will have the opportunity to share their tech to an audience of hundreds of AI industry decision-makers. You’ll receive direct, live feedback from a curated panel of enterprise tech thought leaders. Beyond the stage, every presenter receives exclusive editorial coverage from VentureBeat, positioning your agentic AI technology in front of our millions of monthly readers.Who should apply?We are looking for dynamic companies with compelling agentic AI technologies that are ready for prime time. Whether you are building specialized autonomous agents to support workers or the orchestration layers that manage AI agents, we want to hear your story.We will select up to 10 candidates across two tracks: up to five seed to early-stage Series A (raised $50M or less) and up to five Series B or later startups, or units within mature, large companies (raised/allocated more than $50M).If you have a product that delivers tangible enterprise results and a vision for the future of autonomous work, don’t miss this opportunity. Application deadline: June 1, 2026, at 5 p.m. PT.Submit hereRead about last year’s winner: Solo.io
From lab to market: Rose Rock Bridge fast-tracks energy innovation in Tulsa
Presented by Tulsa Innovation LabsAs the global energy system evolves, companies are racing to adopt technologies that can deliver real-world solutions, especially in hard-to-abate industries. Oklahoma, long known as the oil capital of the world, is a center for energy innovation, with Rose Rock Bridge at the forefront.A non-profit based in Tulsa, Rose Rock Bridge is a pilot deployment studio that connects early-stage energy startups with corporate energy partners, non-dilutive funding, and pilot opportunities that accelerate commercialization. Now accepting applications for its Spring 2026 cohort through April 6, it is seeking early- and growth-stage startups developing practical, scalable solutions to today’s most pressing energy challenges. Rose Rock Bridge gives startups access to real-world commercial workflows and pilot opportunities through energy partners with more than $150 billion in market capitalization, including Devon Energy, H&P, ONEOK, and Williams. Backed by one of the strongest coalitions of strategic partners and investors of any energy-focused accelerator, incubator, or venture studio, the program enables startups to move quickly from development to real-world testing and deployment.Here’s how it works:Discover opportunities for energy innovationRose Rock Bridge starts by working directly with corporate innovation teams to identify high priority technology solutions for their businesses, pinpointing which solutions will carry the most impact. Focus areas are formed around these findings.“We don’t just chase the latest tech and hope to find a use for it. Our process starts at the asset level — identifying the specific operational bottlenecks and unmet requirements our partners are actually facing,” says Nishant Agarwal, Innovation Manager. “By leveraging our background in CVC and engineering, we run technical deep dives alongside partner subject matter experts to define the requirement first. We then source technologies as a direct response to those needs. This ensures we aren’t just presenting ‘interesting research,’ but delivering solutions with a validated deployment pathway and a clear line of sight to a business case.”Tapping into its network of 40+ universities, 10+ energy incubators, and Fortune 500 companies, Rose Rock Bridge then determines emerging opportunities in the energy ecosystem. Rather than just selecting companies or ideas that might bring in capital, the studio chooses startups that have real potential to commercialize quickly in order to solve the industry’s most pressing challenges.This year’s focus areas include: Operational Agility & IntegrationReservoir & Production EnhancementFluid SystemsRobotics”We’re evaluating deployment probability from day one,” says Andrada Pantelimon, Innovation Associate at Rose Rock Bridge, who manages sourcing strategy and startup operations. “Can this technology deliver a measurable bottom-line impact? Can it realistically pilot within 12 months? Is your team equipped to commercialize? Show us you’ve quantified your value proposition in operator terms and understand which business unit within a corporation might own this solution. If you can articulate those pieces clearly, you’re the kind of startup we want to support.”Derisk technologies for early-stage startups & energy companiesThe benefit is tangible for leading energy corporations seeking proven solutions to complex operational challenges. Rose Rock Bridge provides its corporate partners with validated, field-tested technologies while significantly reducing deployment risk. At the program’s conclusion, partners gain direct access to emerging innovations that have already undergone technical validation and operational feasibility assessment, with identified procurement pathways and pilot plans designed for commercial deployment. Each cohort cycle, up to 15 startups are selected to enter a six-week virtual accelerator focused on pilot deployment. Founders participate in reverse pitch sessions with oil and gas partners, one-on-one clinics with industry and capital mentors, and hands-on commercialization workshops. Founders have the unique opportunity to refine their solutions, assess pilot feasibility, and build industry relationships. This approach derisks adoption and investments through iterative customer feedback, in-field testing, and pilots, enabling breakthrough technologies to reach commercial viability quickly and effectively.”Our curriculum is singularly focused on preparing startups for the realities of corporate partnerships.,” says Devon Fanfair, Rose Rock Bridge Manager and former Techstars Managing Director who is scaling the RRB program. “Founders aren’t just learning, they’re actively testing their assumptions with the exact customers who might deploy their technology. That rapid feedback loop is what transforms promising technologies into deployment-ready solutions with clear commercial pathways.”At the culmination of the accelerator, teams participate in the Rose Rock Bridge showcase with the unique opportunity to pitch their startup to the energy corporate partners they’ve worked alongside for the past six weeks. Four startups are selected to receive up to $100,000 in non-dilutive funding and opportunities for business support services, joining a one-year cohort designed to prepare technologies for market adoption. “Rose Rock Bridge is a cornerstone of Tulsa Innovation Labs’ strategy to showcase our region as a national hub for energy innovation,” added Jennifer Hankins, Managing Director of Tulsa Innovation Labs. “By linking emerging technologies with some of the nation’s largest energy leaders, we help move innovation from concept to market faster, drawing new businesses to the region, enhancing our existing businesses, and reinforcing Tulsa’s role in the global energy economy.”Deploy viable energy solutionsOnce selected to become members of Rose Rock Bridge, startups then pilot their technology with relevant energy partners and grow their venture in Tulsa. Support includes pilot design, execution, and go-to-market strategy, connections to follow-on investment opportunities, subsidized access to services including legal, marketing, PR, and support establishing a Tulsa presence for partner access.Rose Rock Bridge’s success is measured not just in pilot deployments, but in lasting commercial relationships. Multiple portfolio companies have progressed from initial field tests to multi-year contracts with Fortune 500 operators. By derisking the path from proof-of-concept to procurement, RRB has helped establish procurement pathways that might otherwise take years to develop, if they materialize at all.Launched in 2022 with support from Tulsa Innovation Labs, the studio has helped companies advance new technologies, secure patents, launch products, and attract capital. It has derisked 33 startups, supported 16 active or in-development pilots, and invested more than $2 million in early-stage companies, generating a combined portfolio valuation of over $55 million. Examples of the studio’s success include Safety Radar, an AI-powered risk management platform, which secured its first contract with a Rose Rock Bridge partner, expanded to additional energy and aerospace clients, raised over $2 million, and established a Tulsa office. Kinitics Automation, a Canadian company, successfully piloted with one partner, resulting in deployments across multiple sites, effectively using RRB as their gateway to the U.S. market.Backed by corporate partners with more than $150 billion in combined market capitalization, Rose Rock Bridge reflects both the scale of the opportunity and Tulsa’s rising influence in energy innovation.Devon Fanfair is Manager of Rose Rock Bridge.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
You thought the generalist was dead — in the ‘vibe work’ era, they’re more important than ever
Not long ago, the idea of being a “generalist” in the workplace had a mixed reputation. The stereotype was the “jack of all trades” who could dabble in many disciplines but was a “master of none.” And for years, that was more or less true. Most people simply didn’t have access to the expertise required to do highly cross-functional work. If you needed a new graphic, you waited for a designer. If you needed to change a contract, you waited for legal. In smaller organizations and startups, this waiting game was typically replaced with inaction or improvization — often with questionable results.AI is changing this faster than any technology shift I’ve seen. It’s allowing people to succeed at tasks beyond their normal area of expertise.Anthropic found that AI is “enabling engineers to become more full-stack in their work,” meaning they’re able to make competent decisions across a much wider range of interconnected technologies. A direct consequence of this is tasks that would have been left aside due to lack of time or expertise are now being accomplished (27% of AI-assisted work per Anthropic’s study).
This shift is closely mirroring the effects of past revolutionary technologies. The invention of the automobile or the computer did not bring us a wealth of leisure time — it mainly led us to start doing work that could not be done before.With AI as a guide, anyone can now expand their skillsets and augment their expertise to accomplish more. This fundamentally changes what people can do, who can do it, how teams operate, and what leaders should expect. Well, not so fast. The AI advances have been incredible, and if 2025 may not have fully delivered its promise of bringing AI agents to the workforce, there’s no reason to doubt it’s well on its way. But for now, it’s not perfect. If to err is human, to trust AI not to err is foolish.One of the biggest challenges of working with AI is identifying hallucinations. The term was coined, I assume, not as a cute way to refer to factual errors, but as quite an apt way of describing the conviction that AI exhibits in its erroneous answers. We humans have a clear bias toward confident people, which probably explains the number of smart people getting burned after taking ChatGPT at face value. And if experts can get fooled by an overconfident AI, how can generalists hope to harness the power of AI without making the same mistake? Citizen guardrails give way to vibe freedomIt’s tempting to compare today’s AI vibe coding wave to the rise of low- and no-code tools. No-code tools gave users freedom to build custom software tailored to their needs. However, the comparison doesn’t quite hold. The so-called “citizen developers” could only operate inside the boundaries the tool allowed. These tight constraints were limiting, but they had the benefit of saving the users from themselves — preventing anything catastrophic.AI removes those boundaries almost entirely, and with great freedom comes responsibilities that most people aren’t quite prepared for. The first stage of ‘vibe freedom’ is one of unbridled optimism encouraged by a sycophantic AI. “You’re absolutely correct!” The dreaded report that would have taken all night looks better than anything you could have done yourself and only took a few minutes.
The next stage comes almost by surprise — there’s something that’s not quite right. You start doubting the accuracy of the work — you review and then wonder if it wouldn’t have been quicker to just do it yourself in the first place.Then comes bargaining and acceptance. You argue with the AI, you’re led down confusing paths, but slowly you start developing an understanding — a mental model of the AI mind. You learn to recognize the confidently incorrect, you learn to push back and cross-check, you learn to trust and verify. The generalist becomes the trust layerThis is a skill that can be learned, and it can only be learned on the job, through regular practice. This doesn’t require deep specialization, but it does require awareness. Curiosity becomes essential. So does the willingness to learn quickly, think critically, spot inconsistencies, and to rely on judgment rather than treating AI as infallible.That’s the new job of the generalist: Not to be an expert in everything, but to understand the AI mind enough to catch when something is off, and to defer to a true specialist when the stakes are high. The generalist becomes the human trust layer sitting between the AI’s output and the organization’s standards. They decide what passes and what gets a second opinion.That said, this only works if the generalist clears a minimum bar of fluency. There’s a big difference between “broadly informed” and “confidently unaware.” AI makes that gap easier to miss.Impact on teams and hiringClearly, specialists will not be replaced by AI anytime soon. Their work remains critical. It will evolve to become more strategic.What AI changes is everything around the edges. Roles that felt important but were hard to fill, tasks that sat in limbo because no expert was available, backlogs created by waiting for highly skilled people to review simple work. Now, a generalist can get much farther on their own, and specialists can focus on the hardest problems. We’re already starting to see an impact in the hiring landscape. Companies are looking to bring on individuals who are comfortable navigating AI. People who embrace it and use it to take on projects outside of their comfort zone.Performance expectations will shift too. Many leaders are already looking less at productivity alone, and more at how effectively someone uses AI. We see token usage not as a measure of cost, but as an indicator of AI adoption, and perhaps optimistically, as a proxy for productivity. Making vibe work viableUse AI to enhance work, not to wing it: You will get burned letting AI loose. It requires guidance and oversight.Learn when to trust and when to verify: Build an understanding of the AI mind so you can exercise good judgement on the work produced. When in doubt or when the stakes are high, defer to specialists.Set clear organizational standards: AI thrives on context and humans, too. Invest in documentation of processes, procedures, and best practices.Keep humans in the loop: AI shouldn’t remove oversight. It should make oversight easier.Without these factors, AI work stays in the “vibe” stage. With them, it becomes something the business can actually rely on.Return of the generalistThe emerging, AI-empowered generalist is defined by curiosity, adaptability, and the ability to evaluate the work AI produces. They can span multiple functions, not because they’re experts in each one, but because AI gives them access to specialist-level expertise. Most importantly, this new generation of generalists knows when and how to apply their human judgment and critical thinking. That’s the real determining factor for turning vibes into something reliable, sustainable, and viable in the long run.Cedric Savarese is founder and CEO of FormAssembly.
Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)
Look, we’ve spent the last 18 months building production AI systems, and we’ll tell you what keeps us up at night — and it’s not whether the model can answer questions. That’s table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo’d a config file.We’ve moved past the era of “ChatGPT wrappers” (thank God), but the industry still treats autonomous agents like they’re just chatbots with API access. They’re not. When you give an AI system the ability to take actions without human confirmation, you’re crossing a fundamental threshold. You’re not building a helpful assistant anymore — you’re building something closer to an employee. And that changes everything about how we need to engineer these systems.The autonomy problem nobody talks aboutHere’s what’s wild: We’ve gotten really good at making models that *sound* confident. But confidence and reliability aren’t the same thing, and the gap between them is where production systems go to die.We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted “let’s push this if we need to” in a Slack message as an actual directive. The model wasn’t wrong in its interpretation — it was plausible. But plausible isn’t good enough when you’re dealing with autonomy.That incident taught us something crucial: The challenge isn’t building agents that work most of the time. It’s building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.What reliability actually means for autonomous systemsLayered reliability architectureWhen we talk about reliability in traditional software engineering, we’ve got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you’re dealing with probabilistic systems making judgment calls. A bug isn’t just a logic error—it’s the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.So what does reliability look like here? In our experience, it’s a layered approach.Layer 1: Model selection and prompt engineeringThis is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don’t fool yourself into thinking that a great prompt is enough. I’ve seen too many teams ship “GPT-4 with a really good system prompt” and call it enterprise-ready.Layer 2: Deterministic guardrailsBefore the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn’t? Is the action within acceptable parameters? We’re talking old-school validation logic — regex, schema validation, allowlists. It’s not sexy, but it’s effective.One pattern that’s worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don’t just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.Layer 3: Confidence and uncertainty quantificationHere’s where it gets interesting. We need agents that know what they don’t know. We’ve been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: “I’m interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean…”This doesn’t prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.Layer 4: Observability and auditabilityAction Validation Pipeline If you can’t debug it, you can’t trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just “what action did it take” but “what was it thinking, what data did it consider, what was the reasoning chain?”We’ve built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It’s verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.Guardrails: The art of saying noLet’s talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — “we’ll add some safety checks if we need them.” That’s backwards. Guardrails should be your starting point.We think of guardrails in three categories.Permission boundariesWhat is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what’s the maximum damage it can cause?We use a principle called “graduated autonomy.” New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.One technique that’s worked well: Action cost budgets. Each agent has a daily “budget” denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.Graduated Autonomy and Action Cost Budget Semantic HoundariesWhat should the agent understand as in-scope vs out-of-scope? This is trickier because it’s conceptual, not just technical.I’ve found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent’s boundaries. You need multiple layers of defense here.Operational boundariesHow much can the agent do, and how fast? This is your rate limiting and resource control.We’ve implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they’re essential for preventing runaway behavior.We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would’ve hit a threshold and escalated to a human after attempt number 5.Agents need their own style of testingTraditional software testing doesn’t cut it for autonomous agents. You can’t just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.What’s worked for us:Simulation environmentsBuild a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.The key is making scenarios realistic. Don’t just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can’t handle a test environment where things go wrong, it definitely can’t handle production.Red teamingGet creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to “trick” the agent into doing things it shouldn’t.Shadow modeBefore you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent’s choices and the human’s choices, and you analyze the delta.This is painful and slow, but it’s worth it. You’ll find all kinds of subtle misalignments you’d never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.The human-in-the-loop patternThree Human-in-the-Loop Patterns Despite all the automation, humans remain essential. The question is: Where in the loop?We’re increasingly convinced that “human-in-the-loop” is actually several distinct patterns:Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they’re better at. The agent does the grunt work, the human does the judgment calls.The trick is making these transitions smooth. An agent shouldn’t feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.Failure modes and recoveryLet’s be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.We classify failures into three categories:Recoverable errors: The agent tries to do something, it doesn’t work, the agent realizes it didn’t work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn’t making things worse, let it retry with exponential backoff.Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it’s been misinterpreting customer requests for weeks. Maybe it’s been making subtly incorrect data entries. These accumulate into systemic issues.The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?The cost-performance tradeoffHere’s something nobody talks about enough: reliability is expensive.Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.Organizational challengesWe’d be remiss if we didn’t mention that the hardest parts aren’t technical — they’re organizational.Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?How do you handle edge cases where the agent’s logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who’s at fault?What’s your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?These questions don’t have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.Where we go from hereThe industry is still figuring this out. There’s no established playbook for building reliable autonomous agents. We’re all learning in production, and that’s both exciting and terrifying.What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.We’ll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it’s six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.Because in the end, building enterprise-grade autonomous AI agents isn’t about making systems that work perfectly. It’s about making systems that fail safely, recover gracefully, and learn continuously.And that’s the kind of engineering that actually matters.Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer. Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.
Three ways AI is learning to understand the physical world
Large language models are running into limits in domains that require an understanding of the physical world — from robotics to autonomous driving to manufacturing. That constraint is pushing investors toward world models, with AMI Labs raising a $1.03 billion seed round shortly after World Labs secured $1 billion.Large language models (LLMs) excel at processing abstract knowledge through next-token prediction, but they fundamentally lack grounding in physical causality. They cannot reliably predict the physical consequences of real-world actions. AI researchers and thought leaders are increasingly vocal about these limitations as the industry tries to push AI out of web browsers and into physical spaces. In an interview with podcaster Dwarkesh Patel, Turing Award recipient Richard Sutton warned that LLMs just mimic what people say instead of modeling the world, which limits their capacity to learn from experience and adjust themselves to changes in the world.This is why models based on LLMs, including vision-language models (VLMs), can show brittle behavior and break with very small changes to their inputs. Google DeepMind CEO Demis Hassabis echoed this sentiment in another interview, pointing out that today’s AI models suffer from “jagged intelligence.” They can solve complex math olympiads but fail at basic physics because they are missing critical capabilities regarding real-world dynamics. To solve this problem, researchers are shifting focus to building world models that act as internal simulators, allowing AI systems to safely test hypotheses before taking physical action. However, “world models” is an umbrella term that encompasses several distinct architectural approaches. That has produced three distinct architectural approaches, each with different tradeoffs.JEPA: built for real-timeThe first main approach focuses on learning latent representations instead of trying to predict the dynamics of the world at the pixel level. Endorsed by AMI Labs, this method is heavily based on the Joint Embedding Predictive Architecture (JEPA). JEPA models try to mimic how humans understand the world. When we observe the world, we do not memorize every single pixel or irrelevant detail in a scene. For example, if you watch a car driving down a street, you track its trajectory and speed; you do not calculate the exact reflection of light on every single leaf of the trees in the background. JEPA models reproduce this human cognitive shortcut. Instead of forcing the neural network to predict exactly what the next frame of a video will look like, the model learns a smaller set of abstract, or “latent,” features. It discards the irrelevant details and focuses entirely on the core rules of how elements in the scene interact. This makes the model robust against background noise and small changes that break other models.This architecture is highly compute and memory efficient. By ignoring irrelevant details, it requires much fewer training examples and runs with significantly lower latency. These characteristics make it suitable for applications where efficiency and real-time inference are non-negotiable, such as robotics, self-driving cars, and high-stakes enterprise workflows. For example, AMI is partnering with healthcare company Nabla to use this architecture to simulate operational complexity and reduce cognitive load in fast-paced healthcare settings. Yann LeCun, a pioneer of the JEPA architecture and co-founder of AMI, explained that world models based on JEPA are designed to be “controllable in the sense that you can give them goals, and by construction, the only thing they can do is accomplish those goals” in an interview with Newsweek.Gaussian splats: built for spaceA second approach leans on generative models to build complete spatial environments from scratch. Adopted by companies like World Labs, this method takes an initial prompt (it could be an image or a textual description) and uses a generative model to create a 3D Gaussian splat. A Gaussian splat is a technique for representing 3D scenes using millions of tiny, mathematical particles that define geometry and lighting. Unlike flat video generation, these 3D representations can be imported directly into standard physics and 3D engines, such as Unreal Engine, where users and other AI agents can freely navigate and interact with them from any angle.The primary benefit here is a drastic reduction in the time and one-time generation cost required to create complex interactive 3D environments. It addresses the exact problem outlined by World Labs founder Fei-Fei Li, who noted that LLMs are ultimately like “wordsmiths in the dark,” possessing flowery language but lacking spatial intelligence and physical experience. World Labs’ Marble model gives AI that missing spatial awareness. While this approach is not designed for split-second, real-time execution, it has massive potential for spatial computing, interactive entertainment, industrial design, and building static training environments for robotics. The enterprise value is evident in Autodesk’s heavy backing of World Labs to integrate these models into their industrial design applications.End-to-end generation: built for scaleThe third approach uses an end-to-end generative model to process prompts and user actions, continuously generating the scene, physical dynamics, and reactions on the fly. Rather than exporting a static 3D file to an external physics engine, the model itself acts as the engine. It ingests an initial prompt alongside a continuous stream of user actions, and it generates the subsequent frames of the environment in real-time, calculating physics, lighting, and object reactions natively. DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models provide a highly simple interface for generating infinite interactive experiences and massive volumes of synthetic data. DeepMind demonstrated this natively with Genie 3, showcasing how the model maintains strict object permanence and consistent physics at 24 frames per second without relying on a separate memory module.This approach translates directly into heavy-duty synthetic data factories. Nvidia Cosmos uses this architecture to scale synthetic data and physical AI reasoning, allowing autonomous vehicle and robotics developers to synthesize rare, dangerous edge-case conditions without the cost or risk of physical testing. Waymo (a fellow Alphabet subsidiary) built its world model on top of Genie 3, adapting it for training its self-driving cars. The downside to this end-to-end generative method is the great compute cost required to continuously render physics and pixels simultaneously. Still, the investment is necessary to achieve the vision laid out by Hassabis, who argues that a deep, internal understanding of physical causality is required because current AI is missing critical capabilities to operate safely in the real world.What comes next: hybrid architecturesLLMs will continue to serve as the reasoning and communication interface, but world models are positioning themselves as foundational infrastructure for physical and spatial data pipelines. As the underlying models mature, we are seeing the emergence of hybrid architectures that draw on the strengths of each approach. For example, cybersecurity startup DeepTempo recently developed LogLM, a model that integrates elements from LLMs and JEPA to detect anomalies and cyber threats from security and network logs.
Scale AI launches Voice Showdown, the first real-world benchmark for voice AI — and the results are humbling for some top models
Voice AI is moving faster than the tools we use to measure it. Every major AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice models capable of natural, real-time conversation. But the benchmarks used to evaluate those models are largely still running on synthetic speech, English-only prompts, and scripted test sets that bear little resemblance to how people actually talk.Scale AI, the large data annotation startup whose founder was poached by Meta last year to lead its Superintelligence Lab, is still going strong and tackling the problem head on: today it launches Voice Showdown, what it calls the first global preference-based arena designed to benchmark voice AI through the lens of real human interaction. This product offers a unique strategic value to users: free access to the world’s leading frontier models. Through Scale’s ChatLab platform, users can interact with high-tier models—which typically require multiple $20-per-month subscriptions—at no cost. In exchange, users participate in occasional blind, head-to-head “battles” to choose which of two anonymized leading voice models offers a better experience, providing data for the industry’s most authentic, human-preference leaderboard of voice AI models.”Voice AI is really the fastest moving frontier in AI right now,” said Janie Gu, product manager for Showdown at Scale AI. “But the way that we evaluate voice models hasn’t kept up.”The results, drawn from thousands of spontaneous voice conversations across more than 60 languages, reveal capability gaps that other benchmarks have consistently missed.How Scale’s Voice Showdown worksVoice Showdown is built on ChatLab, Scale’s model-agnostic chat platform where users can freely interact with whichever frontier AI model they choose — for free — within a single app. The platform has been available to Scale’s global community of over 500,000 annotators, with roughly 300,000 having submitted at least one prompt. Scale is opening the platform to a public waitlist today.The evaluation mechanism is elegant in its simplicity: while a user is having a natural voice conversation with a model, the system occasionally — on fewer than 5% of all voice prompts — surfaces a blind side-by-side comparison. The same prompt is sent to a second, anonymous model, and the user picks which response they prefer.This design solves three problems that plague existing voice benchmarks.First, every prompt comes from real human speech — with accents, background noise, half-finished sentences, and conversational filler — rather than synthesized audio generated from text. Second, the platform spans more than 60 languages across 6 continents, with over a third of battles occurring in non-English languages including Spanish, Arabic, Japanese, Portuguese, Hindi, and French. Third, because battles occur within users’ actual daily conversations, 81% of prompts are conversational or open-ended — questions without a single correct answer. That rules out automated scoring and makes human preference the only credible signal.Voice Showdown currently runs two evaluation modes: Dictate (users speak, models respond with text) and Speech-to-Speech, or S2S (Speech-to-Speech, users speak, models talk back). A third mode — Full Duplex, which captures real-time, interruptible conversation — is in development.Incentive-aligned votingOne design detail sets Voice Showdown apart from Chatbot Arena (LM Arena), the text benchmark it most closely resembles. In LM Arena, critics have noted that users sometimes cast throwaway votes with little stake in the outcome. Voice Showdown addresses this directly: after a user votes for the model they preferred, the app switches them to that model for the rest of their conversation. If you voted for GPT-4o Audio over Gemini, you’re now talking to GPT-4o Audio. That alignment of consequence with preference discourages casual or dishonest voting.The system also controls for confounds that could corrupt comparisons: both model responses begin streaming simultaneously (eliminating speed bias), voice gender is matched across both options (eliminating gender preference bias), and neither model is identified by name during voting.The new Voice AI leaderboard every enterprise decision-maker should pay attention toVoice Showdown launches with 11 frontier models evaluated across 52 model-voice pairs as of March 18, 2026. Not all models support both evaluation modes — the Dictate leaderboard includes 8 models, while S2S includes 6.Dictate Leaderboard (Speech-In, Text-Out)In this mode, users provide a spoken prompt and evaluate two side-by-side text responses. Here are the baseline scores:Gemini 3 Pro (1073) Gemini 3 Flash (1068) GPT-4o Audio (1019) Qwen 3 Omni (1000) Voxtral Small (925) Gemma 3n (918) GPT Realtime (875) Phi-4 Multimodal (729) Note: Gemini 3 Pro and Gemini 3 Flash are statistically tied for the top rank.Speech-to-Speech (S2S) LeaderboardIn this mode, users speak to the model and evaluate two competing audio responses. Also baselines:Gemini 2.5 Flash Audio (1060) GPT-4o Audio (1059) Grok Voice (1024) Qwen 3 Omni (1000) GPT Realtime (962) GPT Realtime 1.5 (920) Note: Gemini 2.5 Flash Audio and GPT-4o Audio are statistically tied for the top rank in baseline evaluations.Dictate rankings are led by Google’s Gemini 3 Pro and Gemini 3 Flash, which are statistically tied at #1 with Elo scores around 1,043-1,044 after style controls. GPT-4o Audio holds a clear third place. Open-weight models including Gemma3n, Voxtral Small, and Phi-4 Multimodal trail significantly.Speech-to-Speech (S2S) rankings show a tighter race at the top, with Gemini 2.5 Flash Audio and GPT-4o Audio statistically tied at #1 in the baseline rankings. After adjusting for response length and formatting — factors that can inflate perceived quality — GPT-4o Audio pulls ahead (1,102 Elo vs. 1,075 for Gemini 2.5 Flash Audio). Grok Voice jumps to a close second at 1,093 under style controls, suggesting its raw #3 ranking undersells its actual performance quality.Qwen 3 Omni, the open-weight model from Alibaba’s Qwen team, performs better on pure preference than its popularity would suggest — ranking fourth in both modes, ahead of several higher-profile names. “When people come in, they go for the big names,” Gu noted. “But for preference, lesser-known models like Qwen actually pull ahead.”Surprised revealed by real-world preference dataBeyond rankings, Voice Showdown’s real value is in the failure diagnostics — and those paint a more complicated picture of voice AI than most leaderboards reveal.The multilingual gap is worse than you thinkLanguage robustness is the starkest differentiator across models. In Dictate, Gemini 3 models lead across essentially every language tested. In S2S, the winner depends heavily on which language is being spoken: GPT-4o Audio leads in Arabic and Turkish; Gemini 2.5 Flash Audio is strongest in French; Grok Voice is competitive in Japanese and Portuguese.But the more alarming finding is how frequently some models simply stop responding in the user’s language at all.GPT Realtime 1.5 — OpenAI’s newer real-time voice model — responds in English to non-English prompts roughly 20% of the time, even on high-resource, officially supported languages like Hindi, Spanish, and Turkish. Its predecessor, GPT Realtime, mismatches at about half that rate (~10%). Gemini 2.5 Flash Audio and GPT-4o Audio sit at ~7%.The phenomenon runs both directions: some models carry non-English context from earlier in a conversation into an English turn, or simply mishear a prompt and generate an unrelated response in the wrong language entirely.User verbatims from the platform capture the frustration bluntly: “I said I have an interview today with Quest Management and instead of answering, it gave me information about ‘Risk Management.'””GPT Realtime 1.5 thought I was speaking incoherently and recommended mental health assistance, while Qwen 3 Omni correctly identified I was speaking a Nigerian local language.”The reason existing benchmarks miss this: they’re built on synthetic speech optimized for clean acoustic conditions, and they’re rarely multilingual. Real speakers in real environments — with background noise, short utterances, and regional accents — break speech understanding in ways lab conditions don’t anticipate.Voice selection is more than aestheticsVoice Showdown evaluates models not just at the model level but at the individual voice level — and the variance within a single model’s voice catalog is striking.For one unnamed model in the study, the best-performing voice won 30 percentage points more often than the worst-performing voice from the same underlying model. Both voices share the same reasoning and generation backend. The difference is purely in audio presentation.The top-performing voices tend to win or lose on audio understanding and content completeness — whether the model heard you correctly and answered fully. But speech quality remains a deciding factor at the voice selection level, particularly when models are otherwise comparable. “Voice directly shapes how users evaluate the interaction,” Gu said.Models degrade in conversationMost benchmarks test a single turn. Voice Showdown tests how models hold up across extended conversations — and the results aren’t flattering.On Turn 1, content quality accounts for 23% of model failures. By Turn 11 and beyond, it becomes the primary failure mode at 43%. Most models see their win rates decline as conversations extend, struggling to maintain coherence across multiple exchanges.GPT Realtime variants are an exception, marginally improving on later turns — consistent with their known strengths on longer contexts, and their documented weakness on the brief, noisy utterances that dominate early interactions.Prompt length shows a complementary pattern: short prompts (under 10 seconds) are dominated by audio understanding failures (38%), while long prompts (over 40 seconds) shift the primary failure toward content quality (31%). Shorter audio gives models less acoustic context to parse; longer requests are understood but harder to answer well.Why some voice AI models loseAfter every S2S comparison, users tag why they preferred one response over the other across three axes: audio understanding, content quality, and speech output. The failure signatures differ meaningfully by model.Qwen 3 Omni’s losses cluster around speech generation — its reasoning is competitive, but users are put off by how it sounds. GPT Realtime 1.5’s losses are dominated by audio understanding failures (51%), consistent with its language-switching behavior on challenging prompts. Grok Voice’s failures are more balanced across all three axes, indicating no single dominant weakness but no particular strength either.What’s nextThe current leaderboard covers turn-based interaction — you speak, the model responds, repeat. But real voice conversations don’t work that way. People interrupt, change direction mid-sentence, and talk over each other.Scale says Full Duplex evaluation — designed to capture these real-time dynamics through human preference rather than scripted scenarios or automated metrics — is coming to Showdown next. No existing benchmark captures full-duplex interaction through organic human preference data.The leaderboard is live at scale.com/showdown. A public waitlist to join ChatLab and vote on comparisons is open today, with users receiving free access to frontier voice models including GPT-4o, Gemini, and Grok in exchange for occasional preference votes.