The security industry has spent the last year talking about models, copilots, and agents, but a quieter shift is happening one layer below all of that: Vendors are lining up around a shared way to describe security data. The Open Cybersecurity Schema Framework (OCSF), is emerging as one of the strongest candidates for that job.It gives vendors, enterprises, and practitioners a common way to represent security events, findings, objects, and context. That means less time rewriting field names and custom parsers and more time correlating detections, running analytics, and building workflows that can work across products. In a market where every security team is stitching together endpoint, identity, cloud, SaaS, and AI telemetry, a common infrastructure long felt like a pipe dream, and OCSF now puts it within reach.OCSF in plain languageOCSF is an open-source framework for cybersecurity schemas. It’s vendor neutral by design and deliberately agnostic to storage format, data collection, and ETL choices. In practical terms, it gives application teams and data engineers a shared structure for events so analysts can work with a more consistent language for threat detection and investigation.That sounds dry until you look at the daily work inside a security operations center (SOC). Security teams have to spend a lot of effort normalizing data from different tools so that they can correlate events. For example, detecting an employee logging in from San Francisco at 10 a.m. on their laptop, then accessing a cloud resource from New York at 10:02 a.m. could reveal a leaked credential. Setting up a system that can correlate those events, however, is no easy task: Different tools describe the same idea with different fields, nesting structures, and assumptions. OCSF was built to lower this tax. It helps vendors map their own schemas into a common model and helps customers move data through lakes, pipelines, security incident and event management (SIEM) tools without requiring time consuming translation at every hop.The last two years have been unusually fastMost of OCSF’s visible acceleration has happened in the last two years. The project was announced in August 2022 by Amazon AWS and Splunk, building on worked contributed by Symantec, Broadcom, and other well known infrastructure giants Cloudflare, CrowdStrike, IBM, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro, and Zscaler.The OCSF community has kept up a steady cadence of releases over the last two yearsThe community has grown quickly. AWS said in August 2024 that OCSF had expanded from a 17-company initiative into a community with more than 200 participating organizations and 800 contributors, which expanded to 900 wen OCSF joined the Linux Foundation in November 2024. OCSF is showing up across the industryIn the observability and security space, OCSF is everywhere. AWS Security Lake converts natively supported AWS logs and events into OCSF and stores them in Parquet. AWS AppFabric can output OCSF — normalized audit data. AWS Security Hub findings use OCSF, and AWS publishes an extension for cloud-specific resource details. Splunk can translate incoming data into OCSF with edge processor and ingest processor. Cribl supports seamless converting streaming data into OCSF and compatible formats.Palo Alto Networks can forward Strata sogging Service data into Amazon Security Lake in OCSF. CrowdStrike positions itself on both sides of the OCSF pipe, with Falcon data translated into OCSF for Security Lake and Falcon Next-Gen SIEM positioned to ingest and parse OCSF-formatted data. OCSF is one of those rare standards that has crossed the chasm from an abstract standard into standard operational plumbing across the industry.AI is giving the OCSF story fresh urgencyWhen enterprises deploy AI infrastructure, large language models (LLMs) sit at the core, surrounded by complex distributed systems such as model gateways, agent runtimes, vector stores, tool calls, retrieval systems, and policy engines. These components generate new forms of telemetry, much of which spans product boundaries. Security teams across the SOC are increasingly focused on capturing and analyzing this data. The central question often becomes what an agentic AI system actually did, rather than only the text it produced, and whether its actions led to any security breaches.That puts more pressure on the underlying data model. An AI assistant that calls the wrong tool, retrieves the wrong data, or chains together a risky sequence of actions creates a security event that needs to be understood across systems. A shared security schema becomes more valuable in that world, especially when AI is also being used on the analytics side to correlate more data, faster.For OCSF, 2025 was all about AIImagine a company uses an AI assistant to help employees look up internal documents and trigger tools like ticketing systems or code repositories. One day, the assistant starts pulling the wrong files, calling tools it should not use, and exposing sensitive information in its responses. Updates in OCSF versions 1.5.0, 1.6.0, and 1.7.0 help security teams piece together what happened by flagging unusual behavior, showing who had access to the connected systems, and tracing the assistant’s tool calls step by step. Instead of only seeing the final answer the AI gave, the team can investigate the full chain of actions that led to the problem.What’s on the horizonImagine a company uses an AI customer support bot, and one day the bot begins giving long, detailed answers that include internal troubleshooting guidance meant only for staff. With the kinds of changes being developed for OCSF 1.8.0, the security team could see which model handled the exchange, which provider supplied it, what role each message played, and how the token counts changed across the conversation. A sudden spike in prompt or completion tokens could signal that the bot was fed an unusually large hidden prompt, pulled in too much background data from a vector database, or generated an overly long response that increased the chance of sensitive information leaking. That gives investigators a practical clue about where the interaction went off course, instead of leaving them with only the final answer.Why this matters to the broader marketThe bigger story is that OCSF has moved quickly from being a community effort to becoming a real standard that security products use every day. Over the past two years, it has gained stronger governance, frequent releases, and practical support across data lakes, ingest pipelines, SIEM workflows, and partner ecosystems. In a world where AI expands the security landscape through scams, abuse, and new attack paths, security teams rely on OCSF to connect data from many systems without losing context along the way to keep your data safe. Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.
Venture Beat
Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party AI agents
Are you a subscriber to Anthropic’s Claude Pro ($20 monthly) or Max ($100-$200 monthly) plans and use its Claude AI models and products to power third-party AI agents like OpenClaw? If so, you’re in for an unpleasant surprise. Anthropic announced a few hours ago that starting tomorrow, Saturday, April 4, 2026, at 12 pm PT/3 pm ET, it will no longer be possible for those Claude subscribers to use their subscriptions to hook Anthropic’s Claude models up to third-party agentic tools, citing the strain such usage was placing on Anthropic’s compute and engineering resources, and desire to serve a wide number of users reliably. “We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren’t built for the usage patterns of these third-party tools,” wrote Boris Cherny, Head of Claude Code at Anthropic, in a post on X. “Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API.”The company also reportedly sent out an email to this effect to some subscribers. However, it’s not certain if subscribers to Claude Team and Enterprise will be impacted similarly. We’ve reached out to Anthropic for further clarification and will update when we hear back.To be clear, it will still be possible to use Claude models like Opus, Sonnet, and Haiku to power OpenClaw and similar external agents, but users will now need to opt into a pay-as-you-go “extra usage” billing system or utilize Anthropic’s application programming interface (API), which charges for every token of usage rather than allowing for open-ended usage up to certain limits, as the Pro and Max plans have allowed so far. The reason for the change: ‘third party services are not optimized’ The technical reality, according to Anthropic, is that its first-party tools like Claude Code, its AI vibe coding harness, and Claude Cowork, its business app interfacing and control tool, are built to maximize “prompt cache hit rates”—reusing previously processed text to save on compute. Third-party harnesses like OpenClaw often bypass these efficiencies. “Third party services are not optimized in this way, so it’s really hard for us to do sustainably,” Cherny explained further on X. He even revealed his own hands-on attempts to bridge the gap: “I did put up a few PRs to improve prompt cache hit rate for OpenClaw in particular, which should help for folks using it with Claude via API/overages.”Prior to the news, Anthropic had also begun imposing stricter Claude session limits every 5 hours of usage during business hours (5am-11am PT/8am-2pm ET), meaning that the number of tokens you could send during those sessions dropped.This frustrated some power users who suddenly began reaching their limits far faster than they had previously — a change Anthropic said was to help “manage growing demand for Claude” and would only affect up to 7% of users at any given time. Discounts and credits to soften the blowAnthropic is not banning third-party tools entirely, but it is moving them to a different ledger. The new “Extra Usage” bundles represent a middle ground between a flat-rate subscription and a full enterprise API account.The Credit: To “soften the blow,” Anthropic is offering existing subscribers a one-time credit equal to their monthly plan price, redeemable until April 17.The Discount: Users who pre-purchase “extra usage” bundles can receive up to a 30% discount, an attempt to retain power users who might otherwise churn.Capacity Management: Anthropic’s official statement noted that these tools put an “outsized strain” on systems, forcing a prioritization of “customers using our core products and API.”‘The all you-can-eat buffet just closed’The response from the developer community has been a mixture of analytical acceptance and sharp frustration.Growth marketer Aakash Gupta observed on X that the “all-you-can-eat buffet just closed,” noting that a single OpenClaw agent running for one day could burn $1,000 to $5,000 in API costs. “Anthropic was eating that difference on every user who routed through a third-party harness,” Gupta wrote. “That’s the pace of a company watching its margin evaporate in real time.”However, Peter Steinberger, the creator of OpenClaw who was recently hired by OpenAI, took a more skeptical view of the “capacity” argument.“Funny how timings match up,” Steinberger posted on X. “First they copy some popular features into their closed harness, then they lock out open source.” Indeed, Anthropic recently added some of the same capabilities that helped OpenClaw catch-on — such as the ability to message agents through external services like Discord and Telegram — to Claude Code. Steinberger claimed that he and fellow investor Dave Morin attempted to “talk sense” into Anthropic, but were only able to delay the enforcement by a single week.User @ashen_one, founder of Telaga Charity, voiced a concern likely shared by other small-scale builders: “If I switch both [OpenClaw instances] to an API key or the extra usage you’re recommending here, it’s going to be far too expensive to make it worth using. I’ll probably have to switch over to a different model at this point.”.“I know it sucks,” Cherny replied. “Fundamentally engineering is about tradeoffs, and one of the things we do to serve a lot of customers is optimize the way subscriptions work to serve as many people as possible with the best modeLicensing and the OpenAI shadowThe timing of the crackdown is particularly notable given the talent migration. When Steinberger joined OpenAI in February 2026, he brought the “OpenClaw” ethos with him. OpenAI appears to be positioning itself as a more “harness-friendly” alternative, potentially using this moment as a customer acquisition channel for disgruntled Claude power users.By restricting subscription limits to their own “closed harness,” Anthropic is asserting control over the UI/UX layer. This allows them to collect telemetry and manage rate limits more granularly, but it risks alienating the power-user community that built the “agentic” ecosystem in the first place.The Bottom LineAnthropic’s decision is a cold calculation of margins versus growth. As Cherny noted, “Capacity is a resource we manage thoughtfully.” In the 2026 AI landscape, the era of subsidized, unlimited compute for third-party automation is over. For the average user on Claude.ai, the experience remains unchanged; for the power users running autonomous offices, the bell has tolled.
Karpathy shares ‘LLM Knowledge Base’ architecture that bypasses RAG with an evolving markdown library maintained by AI
AI vibe coders have yet another reason to thank Andrej Karpathy, the coiner of the term. The former Director of AI at Tesla and co-founder of OpenAI, now running his own independent AI project, recently posted on X describing a “LLM Knowledge Bases” approach he’s using to manage various topics of research interest. By building a persistent, LLM-maintained record of his projects, Karpathy is solving the core frustration of “stateless” AI development: the dreaded context-limit reset.As anyone who has vibe coded can attest, hitting a usage limit or ending a session often feels like a lobotomy for your project. You’re forced to spend valuable tokens (and time) reconstructing context for the AI, hoping it “remembers” the architectural nuances you just established. Karpathy proposes something simpler and more loosely, messily elegant than the typical enterprise solution of a vector database and RAG pipeline. Instead, he outlines a system where the LLM itself acts as a full-time “research librarian”—actively compiling, linting, and interlinking Markdown (.md) files, the most LLM-friendly and compact data format.By diverting a significant portion of his “token throughput” into the manipulation of structured knowledge rather than boilerplate code, Karpathy has surfaced a blueprint for the next phase of the “Second Brain”—one that is self-healing, auditable, and entirely human-readable.Beyond RAGFor the past three years, the dominant paradigm for giving LLMs access to proprietary data has been Retrieval-Augmented Generation (RAG). In a standard RAG setup, documents are chopped into arbitrary “chunks,” converted into mathematical vectors (embeddings), and stored in a specialized database. When a user asks a question, the system performs a “similarity search” to find the most relevant chunks and feeds them into the LLM.Karpathy’s approach, which he calls LLM Knowledge Bases, rejects the complexity of vector databases for mid-sized datasets. Instead, it relies on the LLM’s increasing ability to reason over structured text.The system architecture, as visualized by X user @himanshu in part of the wider reactions to Karpathy’s post, functions in three distinct stages:Data Ingest: Raw materials—research papers, GitHub repositories, datasets, and web articles—are dumped into a raw/ directory. Karpathy utilizes the Obsidian Web Clipper to convert web content into Markdown (.md) files, ensuring even images are stored locally so the LLM can reference them via vision capabilities.The Compilation Step: This is the core innovation. Instead of just indexing the files, the LLM “compiles” them. It reads the raw data and writes a structured wiki. This includes generating summaries, identifying key concepts, authoring encyclopedia-style articles, and—crucially—creating backlinks between related ideas.Active Maintenance (Linting): The system isn’t static. Karpathy describes running “health checks” or “linting” passes where the LLM scans the wiki for inconsistencies, missing data, or new connections. As community member Charly Wargnier observed, “It acts as a living AI knowledge base that actually heals itself.”By treating Markdown files as the “source of truth,” Karpathy avoids the “black box” problem of vector embeddings. Every claim made by the AI can be traced back to a specific .md file that a human can read, edit, or delete.Implications for the enterpriseWhile Karpathy’s setup is currently described as a “hacky collection of scripts,” the implications for the enterprise are immediate. As entrepreneur Vamshi Reddy (@tammireddy) noted in response to the announcement: “Every business has a raw/ directory. Nobody’s ever compiled it. That’s the product.”Karpathy agreed, suggesting that this methodology represents an “incredible new product” category. Most companies currently “drown” in unstructured data—Slack logs, internal wikis, and PDF reports that no one has the time to synthesize. A “Karpathy-style” enterprise layer wouldn’t just search these documents; it would actively author a “Company Bible” that updates in real-time.As AI educator and newsletter author Ole Lehmann put it on X: “i think whoever packages this for normal people is sitting on something massive. one app that syncs with the tools you already use, your bookmarks, your read-later app, your podcast app, your saved threads.”Eugen Alpeza, co-founder and CEO of AI enterprise agent builder and orchestration startup Edra, noted in an X post that: “The jump from personal research wiki to enterprise operations is where it gets brutal. Thousands of employees, millions of records, tribal knowledge that contradicts itself across teams. Indeed, there is room for a new product and we’re building it in the enterprise.”As the community explores the “Karpathy Pattern,” the focus is already shifting from personal research to multi-agent orchestration. A recent architectural breakdown by @jumperz, founder of AI agent creation platform Secondmate, illustrates this evolution through a “Swarm Knowledge Base” that scales the wiki workflow to a 10-agent system managed via OpenClaw. The core challenge of a multi-agent swarm—where one hallucination can compound and “infect” the collective memory—is addressed here by a dedicated “Quality Gate.” Using the Hermes model (trained by Nous Research for structured evaluation) as an independent supervisor, every draft article is scored and validated before being promoted to the “live” wiki. This system creates a “Compound Loop”: agents dump raw outputs, the compiler organizes them, Hermes validates the truth, and verified briefings are fed back to agents at the start of each session. This ensures that the swarm never “wakes up blank,” but instead begins every task with a filtered, high-integrity briefing of everything the collective has learnedScaling and performanceA common critique of non-vector approaches is scalability. However, Karpathy notes that at a scale of ~100 articles and ~400,000 words, the LLM’s ability to navigate via summaries and index files is more than sufficient.For a departmental wiki or a personal research project, the “fancy RAG” infrastructure often introduces more latency and “retrieval noise” than it solves.Tech podcaster Lex Fridman (@lexfridman) confirmed he uses a similar setup, adding a layer of dynamic visualization:”I often have it generate dynamic html (with js) that allows me to sort/filter data and to tinker with visualizations interactively. Another useful thing is I have the system generate a temporary focused mini-knowledge-base… that I then load into an LLM for voice-mode interaction on a long 7-10 mile run.”This “ephemeral wiki” concept suggests a future where users don’t just “chat” with an AI; they spawn a team of agents to build a custom research environment for a specific task, which then dissolves once the report is written.Licensing and the ‘file-over-app’ philosophyTechnically, Karpathy’s methodology is built on an open standard (Markdown) but viewed through a proprietary-but-extensible lens (note taking and file organization app Obsidian).Markdown (.md): By choosing Markdown, Karpathy ensures his knowledge base is not locked into a specific vendor. It is future-proof; if Obsidian disappears, the files remain readable by any text editor.Obsidian: While Obsidian is a proprietary application, its “local-first” philosophy and EULA (which allows for free personal use and requires a license for commercial use) align with the developer’s desire for data sovereignty.The “Vibe-Coded” Tools: The search engines and CLI tools Karpathy mentions are custom scripts—likely Python-based—that bridge the gap between the LLM and the local file system.This “file-over-app” philosophy is a direct challenge to SaaS-heavy models like Notion or Google Docs. In the Karpathy model, the user owns the data, and the AI is merely a highly sophisticated editor that “visits” the files to perform work.Librarian vs. search engineThe AI community has reacted with a mix of technical validation and “vibe-coding” enthusiasm. The debate centers on whether the industry has over-indexed on Vector DBs for problems that are fundamentally about structure, not just similarity.Jason Paul Michaels (@SpaceWelder314), a welder using Claude, echoed the sentiment that simpler tools are often more robust:”No vector database. No embeddings… Just markdown, FTS5, and grep… Every bug fix… gets indexed. The knowledge compounds.”However, the most significant praise came from Steph Ango (@Kepano), co-creator of Obsidian, who highlighted a concept called “Contamination Mitigation.” He suggested that users should keep their personal “vault” clean and let the agents play in a “messy vault,” only bringing over the useful artifacts once the agent-facing workflow has distilled them.Which solution is right for your enteprise vibe coding projects?FeatureVector DB / RAGKarpathy’s Markdown WikiData FormatOpaque Vectors (Math)Human-Readable MarkdownLogicSemantic Similarity (Nearest Neighbor)Explicit Connections (Backlinks/Indices)AuditabilityLow (Black Box)High (Direct Traceability)CompoundingStatic (Requires re-indexing)Active (Self-healing through linting)Ideal ScaleMillions of Documents100 – 10,000 High-Signal DocumentsThe “Vector DB” approach is like a massive, unorganized warehouse with a very fast forklift driver. You can find anything, but you don’t know why it’s there or how it relates to the pallet next to it. Karpathy’s “Markdown Wiki” is like a curated library with a head librarian who is constantly writing new books to explain the old ones.The next phaseKarpathy’s final exploration points toward the ultimate destination of this data: Synthetic Data Generation and Fine-Tuning. As the wiki grows and the data becomes more “pure” through continuous LLM linting, it becomes the perfect training set. Instead of the LLM just reading the wiki in its “context window,” the user can eventually fine-tune a smaller, more efficient model on the wiki itself. This would allow the LLM to “know” the researcher’s personal knowledge base in its own weights, essentially turning a personal research project into a custom, private intelligence.Bottom-line: Karpathy hasn’t just shared a script; he’s shared a philosophy. By treating the LLM as an active agent that maintains its own memory, he has bypassed the limitations of “one-shot” AI interactions. For the individual researcher, it means the end of the “forgotten bookmark.” For the enterprise, it means the transition from a “raw/ data lake” to a “compiled knowledge asset.” As Karpathy himself summarized: “You rarely ever write or edit the wiki manually; it’s the domain of the LLM.” We are entering the era of the autonomous archive.
Nvidia launches enterprise AI agent platform with Adobe, Salesforce, SAP among 17 adopters at GTC 2026
Jensen Huang walked onto the GTC stage Monday wearing his trademark leather jacket and carrying, as it turned out, the blueprints for a new kind of industry dominance.The Nvidia CEO unveiled the Agent Toolkit, an open-source platform for building autonomous AI agents, and then rattled off the names of the companies that will use it: Adobe, Salesforce, SAP, ServiceNow, Siemens, CrowdStrike, Atlassian, Cadence, Synopsys, IQVIA, Palantir, Box, Cohesity, Dassault Systèmes, Red Hat, Cisco and Amdocs. Seventeen enterprise software companies, touching virtually every industry and every Fortune 500 corporation, all agreeing to build their next generation of AI products on a shared foundation that Nvidia designed, Nvidia optimizes and Nvidia maintains.The toolkit provides the models, the runtime, the security framework and the optimization libraries that AI agents need to operate autonomously inside organizations — resolving customer service tickets, designing semiconductors, managing clinical trials, orchestrating marketing campaigns. Each component is open source. Each is optimized for Nvidia hardware. The combination means that as AI agents proliferate across the corporate world, they will generate demand for Nvidia GPUs not because companies choose to buy them but because the software they depend on was engineered to require them.”The enterprise software industry will evolve into specialized agentic platforms,” Huang told the crowd, “and the IT industry is on the brink of its next great expansion.” What he left unsaid is that Nvidia has just positioned itself as the tollbooth at the entrance to that expansion — open to all, owned by one.Inside Nvidia’s Agent Toolkit: the software stack designed to power every corporate AI workerTo grasp the significance of Monday’s announcements, it helps to understand the problem Nvidia is solving.Building an enterprise AI agent today is an exercise in frustration. A company that wants to deploy an autonomous system — one that can, say, monitor a telecommunications network and proactively resolve customer issues before anyone calls to complain — must assemble a language model, a retrieval system, a security layer, an orchestration framework and a runtime environment, typically from different vendors whose products were never designed to work together.Nvidia’s Agent Toolkit collapses that complexity into a unified platform. It includes Nemotron, a family of open models optimized for agentic reasoning; AI-Q, an open blueprint that lets agents perceive, reason and act on enterprise knowledge; OpenShell, an open-source runtime enforcing policy-based security, network and privacy guardrails; and cuOpt, an optimization skill library. Developers can use the toolkit to create specialized AI agents that act autonomously while using and building other software to complete tasks.The AI-Q component addresses a pain point that has dogged enterprise AI adoption: cost. Its hybrid architecture routes complex orchestration tasks to frontier models while delegating research tasks to Nemotron’s open models, which Nvidia says can cut query costs by more than 50 percent while maintaining top-tier accuracy. Nvidia used the AI-Q Blueprint to build what it claims is the top-ranking AI agent on both the DeepResearch Bench and DeepResearch Bench II leaderboards — benchmarks that, if they hold under independent validation, position the toolkit as not merely convenient but competitively necessary.OpenShell tackles what has been the single biggest obstacle in every boardroom conversation about letting AI agents loose inside corporate systems: trust. The runtime creates isolated sandboxes that enforce strict policies around data access, network reach and privacy boundaries. Nvidia is collaborating with Cisco, CrowdStrike, Google, Microsoft Security and TrendAI to integrate OpenShell with their existing security tools — a calculated move that enlists the cybersecurity industry as a validation layer for Nvidia’s approach rather than a competing one.The partner list that reads like the Fortune 500: who signed on and what they’re buildingThe breadth of Monday’s enterprise adoption announcements reveals Nvidia’s ambitions more clearly than any specification sheet could.Adobe, in a simultaneously announced strategic partnership, will adopt Agent Toolkit software as the foundation for running hybrid, long-running creativity, productivity and marketing agents. Shantanu Narayen, Adobe’s chair and CEO, said the companies will bring together “our Firefly models, CUDA libraries into our applications, 3D digital twins for marketing, and Agent Toolkit and Nemotron to our agentic frameworks to deliver high-quality, controllable and enterprise-grade AI workflows of the future.” The partnership extends deep: Adobe will explore OpenShell and Nemotron as foundations for personalized, secure agentic loops, and will evaluate the toolkit for large-scale workflows powered by Adobe Experience Platform. Nvidia will provide engineering expertise, early access to software and targeted go-to-market support.Salesforce’s integration may be the one enterprise IT leaders parse most carefully. The company is working with Nvidia Agent Toolkit software including Nemotron models, enabling customers to build, customize and deploy AI agents using Agentforce for service, sales and marketing. The collaboration introduces a reference architecture where employees can use Slack as the primary conversational interface and orchestration layer for Agentforce agents — powered by Nvidia infrastructure — that participate directly in business workflows and pull from data stores in both on-premises and cloud environments. For the millions of knowledge workers who already conduct their professional lives inside Slack, this turns a messaging app into the command center for corporate AI.SAP, whose software underpins the financial and operational plumbing of most Global 2000 companies, is using open Agent Toolkit software including NeMo for enabling AI agents through Joule Studio on SAP Business Technology Platform, enabling customers and partners to design agents tailored to their own business needs. ServiceNow’s Autonomous Workforce of AI Specialists leverage Agent Toolkit software, the AI-Q Blueprint and a combination of closed and open models, including Nemotron and ServiceNow’s own Apriel models — a hybrid approach that suggests the toolkit is designed not to replace existing AI investments but to become the connective tissue between them.From chip design to clinical trials: how agentic AI is reshaping specialized industriesThe partner list extends well beyond horizontal software platforms into deeply specialized verticals where autonomous agents could compress timelines measured in years.In semiconductor design — where a single advanced chip can cost billions of dollars and take half a decade to develop — three of the four major electronic design automation companies are building agents on Nvidia’s stack. Cadence will leverage Agent Toolkit and Nemotron with its ChipStack AI SuperAgent for semiconductor design and verification. Siemens is launching its Fuse EDA AI Agent, which uses Nemotron to autonomously orchestrate workflows across its entire electronic design automation portfolio, from design conception through manufacturing sign-off. Synopsys is building a multi-agent framework powered by its AgentEngineer technology using Nemotron and Nemo Agent Toolkit.Healthcare and life sciences present perhaps the most consequential use case. IQVIA is integrating Nemotron and other Agent Toolkit software with IQVIA.ai, a unified agentic AI platform designed to help life sciences organizations work more efficiently across clinical, commercial and real-world operations. The scale is already significant: IQVIA has deployed more than 150 agents across internal teams and client environments, including 19 of the top 20 pharmaceutical companies.The security sector is embedding itself into the architecture from the ground floor. CrowdStrike unveiled a Secure-by-Design AI Blueprint that embeds its Falcon platform protection directly into Nvidia AI agent architectures — including agents built on AI-Q and OpenShell — and is advancing agentic managed detection and response using Nemotron reasoning models. Cisco AI Defense will provide AI security protection for OpenShell, adding controls and guardrails to govern agent actions. These are not aftermarket bolt-ons; they are foundational integrations that signal the security industry views Nvidia’s agent platform as the substrate it needs to protect.Dassault Systèmes is exploring Agent Toolkit software and Nemotron for its role-based AI agents, called Virtual Companions, on its 3DEXPERIENCE agentic platform. Atlassian is working with the toolkit as it evolves its Rovo AI agentic strategy for Jira and Confluence. Box is using it to enable enterprise agents to securely execute long-running business processes. Palantir is developing AI agents on Nemotron that run on its sovereign AI Operating System Reference Architecture.The open-source gambit: why giving software away is Nvidia’s most aggressive business moveThere is something almost paradoxical about a company with a multi-trillion-dollar market capitalization giving away its most strategically important software. But Nvidia’s open-source approach to Agent Toolkit is less an act of generosity than a carefully constructed competitive moat.OpenShell is open source. Nemotron models are open. AI-Q blueprints are publicly available. LangChain, the agent engineering company whose open-source frameworks have been downloaded over 1 billion times, is working with Nvidia to integrate Agent Toolkit components into the LangChain deep agent library for developing advanced, accurate enterprise AI agents at scale. When the most popular independent framework for building AI agents absorbs your toolkit, you have transcended the category of vendor and entered the category of infrastructure.But openness in AI has a way of being strategically selective. The models are open, but they are optimized for Nvidia’s CUDA libraries — the proprietary software layer that has locked developers into Nvidia GPUs for two decades. The runtime is open, but it integrates most deeply with Nvidia’s security partners. The blueprints are open, but they perform best on Nvidia hardware. Developers can explore Agent Toolkit and OpenShell on build.nvidia.com today, running on inference providers and Nvidia Cloud Partners including Baseten, CoreWeave, DeepInfra, DigitalOcean and others — all of which run Nvidia GPUs.The strategy has a historical analog in Google’s approach to Android: give away the operating system to ensure that the entire mobile ecosystem generates demand for your core services. Nvidia is giving away the agent operating system to ensure that the entire enterprise AI ecosystem generates demand for its core product — the GPU. Every Salesforce agent running Nemotron, every SAP workflow orchestrated through OpenShell, every Adobe creative pipeline accelerated by CUDA creates another strand of dependency on Nvidia silicon.This also explains the Nemotron Coalition announced Monday — a global collaboration of model builders including Mistral AI, Cursor, LangChain, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab, all working to advance open frontier models. The coalition’s first project will be a base model codeveloped by Mistral AI and Nvidia, trained on Nvidia DGX Cloud, that will underpin the upcoming Nemotron 4 family. By seeding the open model ecosystem with Nvidia-optimized foundations, the company ensures that even models it does not build will run best on its hardware.What could go wrong: the risks enterprise buyers should weigh before going all-inFor all the ambition on display Monday, several realities temper the narrative.Adoption announcements are not deployment announcements. Many of the partner disclosures use carefully hedged language — “exploring,” “evaluating,” “working with” — that is standard in embargoed press releases but should not be confused with production systems serving millions of users. Adobe’s own forward-looking statements note that “due to the non-binding nature of the agreement, there are no assurances that Adobe will successfully negotiate and execute definitive documentation with Nvidia on favorable terms or at all.” The gap between a GTC keynote demonstration and an enterprise-grade rollout remains substantial.Nvidia is not the only company chasing this market. Microsoft, with its Copilot ecosystem and Azure AI infrastructure, pursues a parallel strategy with the advantage of owning the operating systems and productivity software that most enterprises already use. Google, through Gemini and its cloud platform, has its own agent vision. Amazon, via Bedrock and AWS, is building comparable primitives. The question is not whether enterprise AI agents will be built on some platform but whether the market will consolidate around one stack or fragment across several.The security claims, while architecturally sound, remain unproven at scale. OpenShell’s policy-based guardrails are a promising design pattern, but autonomous agents operating in complex enterprise environments will inevitably encounter edge cases that no policy framework has anticipated. CrowdStrike’s Secure-by-Design AI Blueprint and Cisco AI Defense’s OpenShell integration are exactly the kind of layered defense enterprise buyers will demand — but both are newly unveiled, not battle-hardened through years of adversarial testing. Deploying agents that can autonomously access data, execute code and interact with production systems introduces a threat surface that the industry has barely begun to map.And there is the question of whether enterprises are ready for agents at all. The technology may be available, but organizational readiness — the governance structures, the change management, the regulatory frameworks, the human trust — often lags years behind what the platforms can deliver.Beyond agents: the full scope of what Nvidia announced at GTC 2026Monday’s Agent Toolkit announcement did not arrive in isolation. It landed amid an avalanche of product launches that, taken together, describe a company remaking itself at every layer of the computing stack.Nvidia unveiled the Vera Rubin platform — seven new chips in full production, including the Vera CPU purpose-built for agentic AI, the Rubin GPU, and the newly integrated Groq 3 LPU inference accelerator — designed to power every phase of AI from pretraining to real-time agentic inference. The Vera Rubin NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs, delivering what Nvidia claims is up to 10x higher inference throughput per watt at one-tenth the cost per token compared with the Blackwell platform. Dynamo 1.0, an open-source inference operating system that Nvidia describes as the “operating system for AI factories,” entered production with adoption from AWS, Microsoft Azure, Google Cloud and Oracle Cloud Infrastructure alongside companies like Cursor, Perplexity, PayPal and Pinterest.The BlueField-4 STX storage architecture promises up to 5x token throughput for the long-context reasoning that agents demand, with early adopters including CoreWeave, Crusoe, Lambda, Mistral AI and Nebius. BYD, Geely, Isuzu and Nissan announced Level 4 autonomous vehicle programs on Nvidia’s DRIVE Hyperion platform, and Uber disclosed plans to launch Nvidia-powered robotaxis across 28 cities and four continents by 2028, beginning with Los Angeles and San Francisco in the first half of 2027.Roche, the pharmaceutical giant, announced it is deploying more than 3,500 Nvidia Blackwell GPUs across hybrid cloud and on-premises environments in the U.S. and Europe — what it calls the largest announced GPU footprint available to a pharmaceutical company. Nvidia also launched physical AI tools for healthcare robotics, with CMR Surgical, Johnson & Johnson MedTech and others adopting the platform, and released Open-H, the world’s largest healthcare robotics dataset with over 700 hours of surgical video. And Nvidia even announced a Space Module based on the Vera Rubin architecture, promising to bring data-center-class AI to orbital environments.The real meaning of GTC 2026: Nvidia is no longer selling picks and shovelsStrip away the product specifications and benchmark claims and what emerges from GTC 2026 is a single, clarifying thesis: Nvidia believes the era of AI agents will be larger than the era of AI models, and it intends to own the platform layer of that transition the way it already owns the hardware layer of the current one.The 17 enterprise software companies that signed on Monday are making a bet of their own. They are wagering that building on Nvidia’s agent infrastructure will let them move faster than building alone — and that the benefits of a shared platform outweigh the risks of shared dependency. For Salesforce, it means Agentforce agents that can draw from both cloud and on-premises data through a single Slack interface. For Adobe, it means creative AI pipelines that span image, video, 3D and document intelligence. For SAP, it means agents woven into the transactional fabric of global commerce. Each partnership is rational on its own terms. Together, they form something larger: an industry-wide endorsement of Nvidia as the default substrate for enterprise intelligence.Huang, who opened his career designing graphics chips for video games, closed his keynote by gesturing toward a future in which AI agents do not just assist human workers but operate as autonomous colleagues — reasoning through problems, building their own tools, learning from their mistakes. He compared the moment to the birth of the personal computer, the dawn of the internet, the rise of mobile computing.Technology executives have a professional obligation to describe every product cycle as a revolution. But here is what made Monday different: this time, 17 of the world’s most important software companies showed up to agree with him. Whether they did so out of conviction or out of a calculated fear of being left behind may be the most important question in enterprise technology — and it is one that only the next few years can answer.
Arcee’s new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize
The baton of open source AI models has been passed on between several companies over the years since ChatGPT debuted in late 2022, from Meta with its Llama family to Chinese labs like Qwen and z.ai. But lately, Chinese companies have started pivoting back towards proprietary models even as some U.S. labs like Cursor and Nvidia release their own variants of the Chinese models, leaving a question mark about who will originate this branch of technology going forward. One answer: Arcee, a San Francisco based lab, which this week released AI Trinity-Large-Thinking—a 399-billion parameter text-only reasoning model released under the uncompromisingly open Apache 2.0 license, allowing for full customizability and commercial usage by anyone from indie developers to large enterprises. The release represents more than just a new set of weights on AI code sharing community Hugging Face; it is a strategic bet that “American Open Weights” can provide a sovereign alternative to the increasingly closed or restricted frontier models of 2025. This move arrives precisely as enterprises express growing discomfort with relying on Chinese-based architectures for critical infrastructure, creating a demand for a domestic champion that Arcee intends to fill.As Clément Delangue, co-founder and CEO of Hugging Face, told VentureBeat in a direct message on X: “The strength of the US has always been its startups so maybe they’re the ones we should count on to lead in open-source AI. Arcee shows that it’s possible!” Genesis of a 30-person frontier labTo understand the weight of the Trinity release, one must understand the lab that built it. Based in San Francisco, Arcee AI is a lean team of only 30 people. While competitors like OpenAI and Google operate with thousands of engineers and multibillion-dollar compute budgets, Arcee has defined itself through what CTO Lucas Atkins calls “engineering through constraint”.The company first made waves in 2024 after securing a $24 million Series A led by Emergence Capital, bringing its total capital to just under $50 million. In early 2026, the team took a massive risk: they committed $20 million—nearly half their total funding—to a single 33-day training run for Trinity Large. Utilizing a cluster of 2048 NVIDIA B300 Blackwell GPUs, which provided twice the speed of the previous Hopper generation, Arcee bet the company’s future on the belief that developers needed a frontier model they could truly own. This “back the company” bet was a masterclass in capital efficiency, proving that a small, focused team could stand up a full pipeline and stabilize training without endless reserves.Engineering through extreme architectural constraintTrinity-Large-Thinking is noteworthy for the extreme sparsity of its attention mechanism. While the model houses 400 billion total parameters, its Mixture-of-Experts architecture means that only 1.56%, or 13 billion parameters, are active for any given token. This allows the model to possess the deep knowledge of a massive system while maintaining the inference speed and operational efficiency of a much smaller one—performing roughly 2 to 3 times faster than its peers on the same hardware. Training such a sparse model presented significant stability challenges. To prevent a few experts from becoming “winners” while others remained untrained “dead weight,” Arcee developed SMEBU, or Soft-clamped Momentum Expert Bias Updates. This mechanism ensures that experts are specialized and routed evenly across a general web corpus. The architecture also incorporates a hybrid approach, alternating local and global sliding window attention layers in a 3:1 ratio to maintain performance in long-context scenarios.The data curriculum and synthetic reasoningArcee’s partnership with fellow startup DatologyAI provided a curriculum of over 10 trillion curated tokens. However, the training corpus for the full-scale model was expanded to 20 trillion tokens, split evenly between curated web data and high-quality synthetic data. Unlike typical imitation-based synthetic data where a smaller model simply learns to mimic a larger one, DatologyAI utilized techniques to synthetically rewrite raw web text—such as Wikipedia articles or blogs—to condense the information. This process helped the model learn to reason over concepts and information rather than merely memorizing exact token strings. To ensure regulatory compliance, tremendous effort was invested in excluding copyrighted books and materials with unclear licensing, attracting enterprise customers who are wary of intellectual property risks associated with mainstream LLMs. This data-first approach allowed the model to scale cleanly while significantly improving performance on complex tasks like mathematics and multi-step agent tool use.The pivot from yappy chatbots to reasoning agentsThe defining feature of this official release is the transition from a standard “instruct” model to a “reasoning” model.By implementing a “thinking” phase prior to generating a response—similar to the internal loops found in the earlier Trinity-Mini—Arcee has addressed the primary criticism of its January “Preview” release. Early users of the Preview model had noted that it sometimes struggled with multi-step instructions in complex environments and could be “underwhelming” for agentic tasks.The “Thinking” update effectively bridges this gap, enabling what Arcee calls “long-horizon agents” that can maintain coherence across multi-turn tool calls without getting “sloppy”. This reasoning process enables better context coherence and cleaner instruction following under constraint. This has direct implications for Maestro Reasoning, a 32B-parameter derivative of Trinity already being used in audit-focused industries to provide transparent “thought-to-answer” traces. The goal was to move beyond “yappy” or inefficient chatbots toward reliable, cheap, high-quality agents that stay stable across long-running loops.Geopolitics and the case for American open weightsThe significance of Arcee’s Apache 2.0 commitment is amplified by the retreat of its primary competitors from the open-weight frontier. Throughout 2025, Chinese research labs like Alibaba’s Qwen and z.ai (aka Zhupai) set the pace for high-efficiency MoE architectures. However, as we enter 2026, those labs have begun to shift toward proprietary enterprise platforms and specialized subscriptions, signaling a move away from pure community growth. The fragmentation of these once-prolific teams, such as the departure of key technical leads from Alibaba’s Qwen lab, has left a void at the high end of the open-weight market. In the United States, the movement has faced its own crisis. Meta’s Llama division notably retreated from the frontier landscape following the mixed reception of Llama 4 in April 2025, which faced reports of quality issues and benchmark manipulation.For developers who relied on the Llama 3 era of dominance, the lack of a current 400B+ open model created an urgent need for an alternative that Arcee has risen to fill.Benchmarks and how Arcee’s Trinity-Large-Thinking stacks up to other U.S. frontier open source AI model offeringsTrinity-Large-Thinking’s performance on agent-specific evaluations establishes it as a legitimate frontier contender. On PinchBench, a critical metric for evaluating model capability on autonomous agentic tasks, Trinity achieved a score of 91.9, placing it just behind the proprietary market leader, Claude Opus 4.6 (93.3). This competitiveness is mirrored in IFBench, where Trinity’s score of 52.3 sits in a near-dead heat with Opus 4.6’s 53.1, indicating that the reasoning-first “Thinking” update has successfully addressed the instruction-following hurdles that challenged the model’s earlier preview phase.The model’s broader technical reasoning capabilities also place it at the high end of the current open-source market. It recorded a 96.3 on AIME25, matching the high-tier Kimi-K2.5 and outstripping other major competitors like GLM-5 (93.3) and MiniMax-M2.7 (80.0). While high-end coding benchmarks like SWE-bench Verified still show a lead for top-tier closed-source models—with Trinity scoring 63.2 against Opus 4.6’s 75.6—the massive delta in cost-per-token positions Trinity as the more viable sovereign infrastructure layer for enterprises looking to deploy these capabilities at production scale.When it comes to other U.S. open source frontier model offerings, OpenAI’s gpt-oss tops out at 120 billion parameters, but there’s also Google with Gemma (Gemma 4 was just released this week) and IBM’s Granite family is also worth a mention, despite having lower benchmarks. Nvidia’s Nemotron family is also notable, but is fine-tuned and post-trained Qwen variants.BenchmarkArcee Trinity-Largegpt-oss-120B (High)IBM Granite 4.0Google Gemma 4GPQA-D76.3%80.1%74.8%84.3%Tau2-Airline88.0%65.8%*68.3%76.9%PinchBench91.9%69.0% (IFBench)89.1%93.3%AIME2596.3%97.9%88.5%89.2%MMLU-Pro83.4%90.0% (MMLU)81.2%85.2%So how is an enterprise supposed to choose between all these?Arcee Trinity-Large-Thinking is the premier choice for organizations building autonomous agents; its sparse 400B architecture excels at “thinking” through multi-step logic, complex math, and long-horizon tool use. By activating only a fraction of its parameters, it provides a high-speed reasoning engine for developers who need GPT-4o-level planning capabilities within a cost-effective, open-source framework. Conversely, gpt-oss-120B serves as the optimal middle ground for enterprises that require high-reasoning performance but prioritize lower operational costs and deployment flexibility. Because it activates only 5.1B parameters per forward pass, it is uniquely suited for technical workloads like competitive code generation and advanced mathematical modeling that must run on limited hardware, such as a single H100 GPU. Its configurable reasoning effort—offering “Low,” “Medium,” and “High” modes—makes it the best fit for production environments where latency and accuracy must be balanced dynamically across different tasks.For broader, high-throughput applications, Google Gemma 4 and IBM Granite 4.0 serve as the primary backbones. Gemma 4 offers the highest “intelligence density” for general knowledge and scientific accuracy, making it the most versatile option for R&D and high-speed chat interfaces. Meanwhile, IBM Granite 4.0 is engineered for the “all-day” enterprise workload, utilizing a hybrid architecture that eliminates context bottlenecks for massive document processing. For businesses concerned with legal compliance and hardware efficiency, Granite remains the most reliable foundation for large-scale RAG and document analysis.Ownership as a feature for regulated industriesIn this climate, Arcee’s choice of the Apache 2.0 license is a deliberate act of differentiation. Unlike the restrictive community licenses used by some competitors, Apache 2.0 allows enterprises to truly own their intelligence stack without the “black box” biases of a general-purpose chat model. “Developers and Enterprises need models they can inspect, post-train, host, distill, and own,” Lucas Atkins noted in the launch announcement. This ownership is critical for the “bitter lesson” of training small models: you usually need to train a massive frontier model first to generate the high-quality synthetic data and logits required to build efficient student models.Furthermore, Arcee has released Trinity-Large-TrueBase, a raw 10-trillion-token checkpoint. TrueBase offers a rare, “unspoiled” look at foundational intelligence before instruction tuning and reinforcement learning are applied. For researchers in highly regulated industries like finance and defense, TrueBase allows for authentic audits and custom alignments starting from a clean slate.Community verdict and the future of distillationThe response from the developer community has been largely positive, reflecting the desire for more open weights, U.S.-made mdoels. On X, researchers highlighted the disruption, noting that the “insanely cheap” prices for a model of this size would be a boon for the agentic community. On open AI model inference website OpenRouter, Trinity-Large-Preview established itself as the #1 most used open model in the U.S., serving over 80.6 billion tokens on peak days like March 1, 2026. The proximity of Trinity-Large-Thinking to Claude Opus 4.6 on PinchBench—at 91.9 versus 93.3—is particularly striking when compared to the cost. At $0.90 per million output tokens, Trinity is approximately 96% cheaper than Opus 4.6, which costs $25 per million output tokens. Arcee’s strategy is now focused on bringing these pretraining and post-training lessons back down the stack. Much of the work that went into Trinity Large will now flow into the Mini and Nano models, refreshing the company’s compact line with the distillation of frontier-level reasoning. As global labs pivot toward proprietary lock-in, Arcee has positioned Trinity as a sovereign infrastructure layer that developers can finally control and adapt for long-horizon agentic workflows.
Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks
For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google’s Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba’s Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, “open” with asterisks isn’t the same as open.Gemma 4 eliminates that friction entirely. Google DeepMind’s newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no “Harmful Use” carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over.The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research.Four models, two tiers: Edge to workstation in a single familyGemma 4 arrives as four distinct models organized into two deployment tiers. The “workstation” tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model — both supporting text and image input with 256K-token context windows. The “edge” tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows.The naming convention takes some unpacking. The “E” prefix denotes “effective parameters” — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more. The “A” in 26B A4B stands for “active parameters” — only 3.8 billion of the MoE model’s 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model.For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference — but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle.The MoE bet: 128 small experts to save on inference costsThe architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B–31B range while running at roughly the speed of a 4B model during inference.This is not just a benchmark curiosity — it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family.Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi-turn agent conversations.Native multimodality: Vision, audio, and function calling baked in from scratchPrevious generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level.All four models handle variable aspect-ratio image input with configurable visual token budgets — a meaningful improvement over Gemma 3n’s older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task. Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots.The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local — think healthcare, field service, or multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification.Function calling is also native across all four models, drawing on research from Google’s FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4’s function calling was trained into the model from the ground up — optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents.Benchmarks in context: Where Gemma 4 lands in a crowded fieldThe benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 — numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%. For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode.The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond — a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture.The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability.These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in a single model family with deployment options from edge devices to cloud serverless.What enterprise teams should watch nextGoogle is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially.The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference — rather than maintaining always-on GPU instances — could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications.Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today — workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research — represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google’s open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.
In the wake of Claude Code’s source code leak, 5 actions enterprise security leaders should take now
Every enterprise running AI coding agents has just lost a layer of defense. On March 31, Anthropic accidentally shipped a 59.8 MB source map file inside version 2.1.88 of its @anthropic-ai/claude-code npm package, exposing 512,000 lines of unobfuscated TypeScript across 1,906 files. The readable source includes the complete permission model, every bash security validator, 44 unreleased feature flags, and references to upcoming models Anthropic has not announced. Security researcher Chaofan Shou broadcast the discovery on X by approximately 4:23 UTC. Within hours, mirror repositories had spread across GitHub.Anthropic confirmed the exposure was a packaging error caused by human error. No customer data or model weights were involved. But containment has already failed. The Wall Street Journal reported Wednesday morning that Anthropic had filed copyright takedown requests that briefly resulted in the removal of more than 8,000 copies and adaptations from GitHub. However, an Anthropic spokesperson told VentureBeat that the takedown was intended to be more limited: “We issued a DMCA takedown against one repository hosting leaked Claude Code source code and its forks. The repo named in the notice was part of a fork network connected to our own public Claude Code repo, so the takedown reached more repositories than intended. We retracted the notice for everything except the one repo we named, and GitHub has restored access to the affected forks.”Programmers have already used other AI tools to rewrite Claude Code’s functionality in other programming languages. Those rewrites are themselves going viral. The timing was worse than the leak alone. Hours before the source map shipped, malicious versions of the axios npm package containing a remote access trojan went live on the same registry. Any team that installed or updated Claude Code via npm between 00:21 and 03:29 UTC on March 31 may have pulled both the exposed source and the unrelated axios malware in the same install window.A same-day Gartner First Take (subscription required) said the gap between Anthropic’s product capability and operational discipline should force leaders to rethink how they evaluate AI development tool vendors. Claude Code is the most discussed AI coding agent among Gartner’s software engineering clients. This was the second leak in five days. A separate CMS misconfiguration had already exposed nearly 3,000 unpublished internal assets, including draft announcements for an unreleased model called Claude Mythos. Gartner called the cluster of March incidents a systemic signal.What 512,000 lines reveal about production AI agent architectureThe leaked codebase is not a chat wrapper. It is the agentic harness that wraps Claude’s language model and gives it the ability to use tools, manage files, execute bash commands, and orchestrate multi-agent workflows. The WSJ described the harness as what allows users to control and direct AI models, much like a harness allows a rider to guide a horse. Fortune reported that competitors and legions of startups now have a detailed road map to clone Claude Code’s features without reverse engineering them.The components break down fast. A 46,000-line query engine handles context management through three-layer compression and orchestrates 40-plus tools, each with self-contained schemas and per-tool granular permission checks. And 2,500 lines of bash security validation run 23 sequential checks on every shell command, covering blocked Zsh builtins, Unicode zero-width space injection, IFS null-byte injection, and a malformed token bypass discovered during a HackerOne review.Gartner caught a detail most coverage missed. Claude Code is 90% AI-generated, per Anthropic’s own public disclosures. Under the current U.S. copyright law requiring human authorship, the leaked code carries diminished intellectual property protection. The Supreme Court declined to revisit the human authorship standard in March 2026. Every organization shipping AI-generated production code faces this same unresolved IP exposure.Three attack paths, the readable source makes it cheaper to exploitThe minified bundle already shipped with every string literal extractable. What the readable source eliminates is the research cost. A technical analysis from Straiker’s Jun Zhou, an agentic AI security company, mapped three compositions that are now practical, not theoretical, because the implementation is legible.Context poisoning via the compaction pipeline. Claude Code manages context pressure through a four-stage cascade. MCP tool results are never microcompacted. Read tool results skip budgeting entirely. The autocompact prompt instructs the model to preserve all user messages that are not tool results. A poisoned instruction in a cloned repository’s CLAUDE.md file can survive compaction, get laundered through summarization, and emerge as what the model treats as a genuine user directive. The model is not jailbroken. It is cooperative and follows what it believes are legitimate instructions.Sandbox bypass through shell parsing differentials. Three separate parsers handle bash commands, each with different edge-case behavior. The source documents a known gap where one parser treats carriage returns as word separators, while bash does not. Alex Kim’s review found that certain validators return early-allow decisions that short-circuit all subsequent checks. The source contains explicit warnings about the past exploitability of this pattern.The composition. Context poisoning instructs a cooperative model to construct bash commands sitting in the gaps of the security validators. The defender’s mental model assumes an adversarial model and a cooperative user. This attack inverts both. The model is cooperative. The context is weaponized. The outputs look like commands a reasonable developer would approve.Elia Zaitsev, CrowdStrike’s CTO, told VentureBeat in an exclusive interview at RSAC 2026 that the permission problem exposed in the leak reflects a pattern he sees across every enterprise deploying agents. “Don’t give an agent access to everything just because you’re lazy,” Zaitsev said. “Give it access to only what it needs to get the job done.” He warned that open-ended coding agents are particularly dangerous because their power comes from broad access. “People want to give them access to everything. If you’re building an agentic application in an enterprise, you don’t want to do that. You want a very narrow scope.”Zaitsev framed the core risk in terms that the leaked source validates. “You may trick an agent into doing something bad, but nothing bad has happened until the agent acts on that,” he said. That is precisely what the Straiker analysis describes: context poisoning turns the agent cooperative, and the damage happens when it executes bash commands through the gaps in the validator chain.What the leak exposed and what to auditThe table below maps each exposed layer to the attack path it enables and the audit action it requires. Print it. Take it to Monday’s meeting.Exposed LayerWhat the Leak RevealedAttack Path EnabledDefender Audit Action4-stage compaction pipelineExact criteria for what survives each stage. MCP tool results are never microcompacted. Read results, skip budgeting.Context poisoning: malicious instructions in CLAUDE.md survive compaction and get laundered into ‘user directives’.Audit every CLAUDE.md and .claude/config.json in cloned repos. Treat as executable, not metadata.Bash security validators (2,500 lines, 23 checks)Full validator chain, early-allow short circuits, three-parser differentials, blocked pattern listsSandbox bypass: CR-as-separator gap between parsers. Early-allow in git validators bypasses all downstream checks.Restrict broad permission rules (Bash(git:*), Bash(echo:*)). Redirect operators chain with allowed commands to overwrite files.MCP server interface contractExact tool schemas, permission checks, and integration patterns for all 40+ built-in toolsMalicious MCP servers that match the exact interface. Supply chain attacks are indistinguishable from legitimate servers.Treat MCP servers as untrusted dependencies. Pin versions. Monitor for changes. Vet before enabling.44 feature flags (KAIROS, ULTRAPLAN, coordinator mode)Unreleased autonomous agent mode, 30-min remote planning, multi-agent orchestration, background memory consolidationCompetitors accelerate the development of comparable features. Future attack surface previewed before defenses ship.Monitor for feature flag activation in production. Inventory where agent permissions expand with each release.Anti-distillation and client attestationFake tool injection logic, Zig-level hash attestation (cch=00000), GrowthBook feature flag gatingWorkarounds documented. MITM proxy strips anti-distillation fields. Env var disables experimental betas.Do not rely on vendor DRM for API security. Implement your own API key rotation and usage monitoring.Undercover mode (undercover.ts)90-line module strips AI attribution from commits. Force ON possible, force OFF impossible. Dead-code-eliminated in external builds.AI-authored code enters repos with no attribution. Provenance and audit trail gaps for regulated industries.Implement commit provenance verification. Require AI disclosure policies for development teams using any coding agent.AI-assisted code is already leaking secrets at double the rateGitGuardian’s State of Secrets Sprawl 2026 report, published March 17, found that Claude Code-assisted commits leaked secrets at a 3.2% rate versus the 1.5% baseline across all public GitHub commits. AI service credential leaks surged 81% year-over-year to 1,275,105 detected exposures. And 24,008 unique secrets were found in MCP configuration files on public GitHub, with 2,117 confirmed as live, valid credentials. GitGuardian noted the elevated rate reflects human workflow failures amplified by AI speed, not a simple tool defect.The operational pattern Gartner is trackingFeature velocity compounded the exposure. Anthropic shipped over a dozen Claude Code releases in March, introducing autonomous permission delegation, remote code execution from mobile devices, and AI-scheduled background tasks. Each capability widened the operational surface. The same month that introduced them produced the leak that exposed their implementation.Gartner’s recommendation was specific. Require AI coding agent vendors to demonstrate the same operational maturity expected of other critical development infrastructure: published SLAs, public uptime history, and documented incident response policies. Architect provider-independent integration boundaries that would let you change vendors within 30 days. Anthropic has published one postmortem across more than a dozen March incidents. Third-party monitors detected outages 15 to 30 minutes before Anthropic’s own status page acknowledged them.The company riding this product to a $380 billion valuation and a possible public offering this year, as the WSJ reported, now faces a containment battle that 8,000 DMCA takedowns have not won.Merritt Baer, Chief Security Officer at Enkrypt AI, an enterprise AI guardrails company, and a former AWS security leader, told VentureBeat that the IP exposure Gartner flagged extends into territory most teams have not mapped. “The questions many teams aren’t asking yet are about derived IP,” Baer said. “Can model providers retain embeddings or reasoning traces, and are those artifacts considered your intellectual property?” With 90% of Claude Code’s source AI-generated and now public, that question is no longer theoretical for any enterprise shipping AI-written production code.Zaitsev argued that the identity model itself needs rethinking. “It doesn’t make sense that an agent acting on your behalf would have more privileges than you do,” he told VentureBeat. “You may have 20 agents working on your behalf, but they’re all tied to your privileges and capabilities. We’re not creating 20 new accounts and 20 new services that we need to keep track of.” The leaked source shows Claude Code’s permission system is per-tool and granular. The question is whether enterprises are enforcing the same discipline on their side.Five actions for security leaders this week1. Audit CLAUDE.md and .claude/config.json in every cloned repository. Context poisoning through these files is a documented attack path with a readable implementation guide. Check Point Research found that developers inherently trust project configuration files and rarely apply the same scrutiny as application code during reviews.2. Treat MCP servers as untrusted dependencies. Pin versions, vet before enabling, monitor for changes. The leaked source reveals the exact interface contract.3. Restrict broad bash permission rules and deploy pre-commit secret scanning. A team generating 100 commits per week at the 3.2% leak rate is statistically exposing three credentials. MCP configuration files are the newest surface that most teams are not scanning.4. Require SLAs, uptime history, and incident response documentation from your AI coding agent vendor. Architect provider-independent integration boundaries. Gartner’s guidance: 30-day vendor switch capability.5. Implement commit provenance verification for AI-assisted code. The leaked Undercover Mode module strips AI attribution from commits with no force-off option. Regulated industries need disclosure policies that account for this.Source map exposure is a well-documented failure class caught by standard commercial security tooling, Gartner noted. Apple and identity verification provider Persona suffered the same failure in the past year. The mechanism was not novel. The target was. Claude Code alone generates an estimated $2.5 billion in annualized revenue for a company now valued at $380 billion. Its full architectural blueprint is circulating on mirrors that have promised never to come down.
Microsoft launches 3 new AI models in direct shot at OpenAI and Google
Microsoft on Wednesday launched three new foundational AI models it built entirely in-house — a state-of-the-art speech transcription system, a voice generation engine, and an upgraded image creator — marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI, Google, and other frontier labs on model development, not just distribution.The trio of models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are available immediately through Microsoft Foundry and a new MAI Playground. They span three of the most commercially valuable modalities in enterprise AI: converting speech to text, generating realistic human voice, and creating images. Together, they represent the opening salvo from Microsoft’s superintelligence team, which Suleyman formed just six months ago to pursue what he calls “AI self-sufficiency.””I’m very excited that we’ve now got the first models out, which are the very best in the world for transcription,” Suleyman told VentureBeat in an exclusive interview ahead of the launch. “Not only that, we’re able to deliver the model with half the GPUs of the state-of-the-art competition.”The announcement lands at a precarious moment for Microsoft. The company’s stock just closed its worst quarter since the 2008 financial crisis, as investors increasingly demand proof that hundreds of billions of dollars in AI infrastructure spending will translate into revenue. These models — priced aggressively and positioned to reduce Microsoft’s own cost of goods sold — are Suleyman’s first answer to that pressure.Microsoft’s new transcription model claims best-in-class accuracy across 25 languagesMAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark — the industry-standard multilingual test — across the top 25 languages by Microsoft product usage, averaging 3.8% WER. According to Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 each.The model uses a transformer-based text decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC files up to 200MB, and Microsoft says its batch transcription speed is 2.5 times faster than the existing Microsoft Azure Fast offering. Diarization, contextual biasing, and streaming are listed as “coming soon.” Microsoft is already testing MAI-Transcribe-1 inside Copilot’s Voice mode and Microsoft Teams for conversation transcription — a detail that underscores how quickly the company intends to replace third-party or older internal models with its own.Alongside it, MAI-Voice-1 is Microsoft’s text-to-speech model, capable of generating 60 seconds of natural-sounding audio in a single second. The model preserves speaker identity across long-form content and now supports custom voice creation from just a few seconds of audio through Microsoft Foundry. Microsoft is pricing it at $22 per 1 million characters. MAI-Image-2, meanwhile, debuted as a top-three model family on the Arena.ai leaderboard and now delivers at least 2x faster generation times on Foundry and Copilot compared to its predecessor. Microsoft is rolling it out across Bing and PowerPoint, pricing it at $5 per 1 million tokens for text input and $33 per 1 million tokens for image output. WPP, one of the world’s largest advertising holding companies, is among the first enterprise partners building with MAI-Image-2 at scale.The contract renegotiation with OpenAI that made Microsoft’s model ambitions possibleTo understand why these models matter, you have to understand the contractual tectonic shift that made them possible. Until October 2025, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. The original deal with OpenAI, signed in 2019, gave Microsoft a license to OpenAI’s models in exchange for building the cloud infrastructure OpenAI needed. But when OpenAI sought to expand its compute footprint beyond Microsoft — striking deals with SoftBank and others — Microsoft renegotiated. As Suleyman explained in a December 2025 interview with Bloomberg, the revised agreement meant that “up until a few weeks ago, Microsoft was not allowed — by contract — to pursue artificial general intelligence or superintelligence independently.” The new terms freed Microsoft to build its own frontier models while retaining license rights to everything OpenAI builds through 2032.Suleyman described the dynamic to VentureBeat in characteristically blunt terms. “Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence,” he said. “Since then, we’ve been convening the compute and the team and buying up the data that we need.”He was quick to emphasize that the OpenAI partnership remains intact. “Nothing’s changing with the OpenAI partnership. We will be in partnership with them at least until 2032 and hopefully a lot longer,” Suleyman said. “They have been a phenomenal partner to us.” He also highlighted that Microsoft provides access to Anthropic’s Claude through its Foundry API, framing the company as “a platform of platforms.” But the subtext is unmistakable: Microsoft is building the capability to stand on its own. In March, as Business Insider first reported, Suleyman wrote in an internal memo that his goal is to “focus all my energy on our Superintelligence efforts and be able to deliver world class models for Microsoft over the next 5 years.” CNBC reported that the structural shift freed Suleyman from day-to-day Copilot product responsibilities, with former Snap executive Jacob Andreou taking over as EVP of the combined consumer and commercial Copilot experience.How teams of fewer than 10 engineers built models that rival Big Tech’s bestPerhaps the most striking detail Suleyman shared with VentureBeat is how small the teams behind these models actually are. “The audio model was built by 10 people, and the vast majority of the speed, efficiency and accuracy gains come from the model architecture and the data that we have used,” Suleyman said. “My philosophy has always been that we need fewer people who are more empowered. So we operate an extremely flat structure.” He added: “Our image team, equally, is less than 10 people. So this is all about model and data innovation, which has delivered state of the art performance.”This matters for two reasons. First, it challenges the prevailing industry narrative that frontier AI development requires thousands of researchers and billions in headcount costs. Meta, by contrast, has pursued what Suleyman described in his Bloomberg interview as a strategy of “hiring a lot of individuals, rather than maybe creating a team” — including reported compensation packages of $100 million to $200 million for top researchers. Second, small teams producing state-of-the-art results dramatically improve the economics. If Microsoft can build best-in-class transcription with 10 engineers and half the GPUs of competitors, the margin structure of its AI business looks fundamentally different from companies burning through cash to achieve similar benchmarks.The lean-team philosophy also echoes Suleyman’s broader views on how AI is already reshaping the work of building AI itself. When asked by VentureBeat how his own team works, Suleyman described an environment that resembles a startup trading floor more than a traditional Microsoft engineering org. “There are groups of people around round tables, circular tables, not traditional desks, on laptops instead of big screens,” he said. “They’re basically vibe coding, side by side all day, morning till night, in rooms of 50 or 60 people.”Why Suleyman’s “humanist AI” pitch is aimed squarely at enterprise buyersSuleyman has been steadily building a philosophical brand around Microsoft’s AI efforts that he calls “humanist AI” — a term that appeared prominently in the blog post he authored for the launch and that he elaborated on in our interview. “I think that the motivation of a humanist super intelligence is to create something that is truly in service of humanity,” he told VentureBeat. “Humans will remain in control at the top of the food chain, and they will be always aligned to human interests.”The framing serves multiple purposes. It differentiates Microsoft from the more acceleration-oriented rhetoric coming from OpenAI and Meta. It resonates with enterprise buyers who need governance, compliance, and safety assurances before deploying AI in regulated industries. And it provides a narrative hedge: if something goes wrong in the broader AI ecosystem, Microsoft can point to its stated commitment to human control. In his December Bloomberg interview, Suleyman went further, describing containment and alignment as “red lines” and arguing that no one should release a superintelligence tool until they are “confident it can be controlled.”Suleyman also stressed data provenance as a competitive advantage, describing a conversation with CEO Satya Nadella about developing “a clean lineage of models where the data is extremely clean.” He drew an implicit contrast with open-source alternatives, noting that “many of the open-source models have been trained on data in, let’s say, inappropriate ways. And there are potentially security issues with that.” For enterprise customers evaluating AI vendors amid a thicket of copyright lawsuits across the industry, that is a meaningful commercial argument — if Microsoft can credibly claim that its training data was acquired through properly licensed channels, it reduces the legal and reputational risk of deploying these models in production.Microsoft’s aggressive pricing puts pressure on Amazon, Google, and the AI startup ecosystemToday’s launch positions Microsoft on three competitive fronts simultaneously. MAI-Transcribe-1 directly targets the transcription workloads that OpenAI’s Whisper models have dominated in the open-source community, with Microsoft claiming superior accuracy on all 25 benchmarked languages. The FLEURS results also show it winning against Google’s Gemini 3.1 Flash Lite on 22 of 25 languages — a direct challenge as Google aggressively pushes Gemini across its own product suite. And MAI-Voice-1’s ability to clone voices from seconds of audio and generate speech at 60x real-time puts it in competition with ElevenLabs, Resemble AI, and the growing ecosystem of voice AI startups, with Microsoft’s distribution advantage — any Foundry developer can now access these capabilities through the same API they use for GPT-4 and Claude — acting as a powerful moat.Suleyman framed the competitive position confidently: “We’re now a top three lab just under OpenAI and Gemini,” he told VentureBeat. The pricing strategy — MAI-Voice-1 at $22 per million characters, MAI-Image-2 at $5 per million input tokens — reflects a deliberate decision to compete on cost. “We’re pricing them to be the very best of any hyperscaler. So there will be the cheapest of any of the hyperscalers out there, Amazon. And obviously Google,” Suleyman said. “And that’s a very conscious decision.”This makes strategic sense for Microsoft, which can amortize model development costs across its enormous installed base of enterprise customers. But it also speaks to the question investors have been asking with increasing urgency: when does AI spending start generating returns? Microsoft’s stock has fallen roughly 17% year-to-date, according to CNBC, part of a broader selloff in software stocks. By building models that run on half the GPUs of competitors, Microsoft reduces its own infrastructure costs for internal products — Teams, Copilot, Bing, PowerPoint — while offering developers pricing designed to undercut the rest of the market. In his March memo, Suleyman wrote that his models would “enable us to deliver the COGS efficiencies necessary to be able to serve AI workloads at the immense scale required in the coming years.” These three models are the first tangible delivery on that promise.Suleyman says a frontier large language model is coming — and Microsoft plans to be “completely independent”Suleyman made clear that transcription, voice, and image generation are just the beginning. When asked whether Microsoft would build a large language model to compete directly with GPT at the frontier level, he was unequivocal. “We absolutely are going to be delivering state of the art models across all modalities,” he said. “Our mission is to make sure that if Microsoft ever needs it, we will be able to provide state of the art at the best efficiency, the cheapest price, and be completely independent.”He described a multi-year roadmap to “set up the GPU clusters at the appropriate scale,” noting that the superintelligence team was formally stood up only in October 2025. Suleyman spoke to VentureBeat from Miami, where the full team was convening for one of its regular week-long in-person sessions. He described Nadella flying in for the gathering to lay out “the roadmap of everything that we need to achieve for our AI self-sufficiency mission over the next 2, 3, 4 years, and all the compute roadmap that that would involve.”Building a competitive frontier LLM, of course, is a different order of magnitude in complexity, data requirements, and compute cost from what Microsoft demonstrated Wednesday. The models launched today are specialized — they handle audio and images, not the general reasoning and text generation that underpin products like ChatGPT or Copilot’s core intelligence. Suleyman has the organizational mandate, Nadella’s public backing, and the contractual freedom. What he doesn’t yet have is a track record at Microsoft of delivering on the hardest problem in AI.But consider what he does have: three models that are best-in-class or near it in their respective domains, built by teams smaller than most seed-stage startups, running on half the industry-standard GPU footprint, and priced below every major cloud competitor. Two years ago, Suleyman proposed in MIT Technology Review what he called the “Modern Turing Test” — not whether AI could fool a human in conversation, but whether it could go out into the world and accomplish real economic tasks with minimal oversight. On Wednesday, his own models took a step toward that vision. The question now is whether Microsoft’s superintelligence team can repeat the trick at the scale that actually matters — and whether they can do it before the market’s patience runs out.
Intuit’s AI agents hit 85% repeat usage. The secret was keeping humans involved
When Intuit shipped AI agents to 3 million customers, 85% came back. The reason, according to the company’s EVP and GM: combining AI with human expertise turned out to matter more than anyone expected — not less.Marianna Tessel, the financial software company’s EVP and GM, calls this AI-HI combination a “massive ask” from its customers, noting that it provides another level of confidence and trust. “One of the things we learned that has been fascinating is really the combination of human intelligence and artificial intelligence,” Tessel said in a new VB Beyond the Pilot podcast. “Sometimes it’s the combination of AI and HI that gives you better results.”Chatbots alone aren’t the answer Intuit — the parent company of QuickBooks, TurboTax, MailChimp and other widely-used financial products — was one of the first major enterprises to go all in on generative AI with its GenOS platform last June (long before fears of the “SaaSpocalypse” had SaaS companies scrambling to rethink their strategies). Quickly, though, the company recognized that chatbots alone weren’t the answer in enterprise environments, and pivoted to what it now calls Intuit Intelligence. The dashboard-like platform features specialized AI agents for sales, tax, payroll, accounting and project management that users can interact with using natural language to gain insights on their data, automate tasks, and generate reports. Customers report invoices are being paid 90% in full and five days faster, and that manual work has been reduced by 30%. AI agents help close books, categorize transactions, run payroll, automate invoice reminders and surface discrepancies.For instance, one Intuit customer uncovered fraud after interacting with AI agents and asking questions about amounts that didn’t add up. “In the beginning it was like, ‘Is that an error? And as he dug in, he discovered very significant fraud,” Tessel said. Why humans are still in the loopStill, Intuit operates on the principle that humans are “always accessible,” Tessel said. Platforms are built in a way that users can ask questions of a human expert when they’re not getting what they need from the AI agent, or want a human to bounce ideas off of. “I’m not talking about product experts,” Tessel said. “I’m talking about an actual accounting expert or tax expert or payroll expert.”The platform has also been built to suggest human involvement in “high stakes” decision-making scenarios. AI goes to a certain level, then human experts review and categorize the rest. This provides a level of confidence, according to Tessel. “We actually believe it becomes more needed and more powerful at the right moments,” she said. “The expert still provides things that are unique.”The next step is giving customers the tools to perform next-gen tasks like vibe coding — but with simple architectures to reduce the burden for customers. “What we’re testing is this idea of, you can actually do coding without realizing that that’s what you are doing,” Tessel said. For example, a merchant running a flower shop wants to ensure that they have the right amount of inventory in stock for Mother’s Day. They can vibe code an agent that analyzes previous years’ sales and creates purchase orders where stock is low. That agent could then be instructed to automatically perform that task for future Mother’s Days and other big holidays. Some users will be more sophisticated and want the ability to dive deeper into the technology. “But some just want to express what they want to have happen,” Tessel said. “Because all they want to do is run their business.” Listen to the full podcast to hear about: Why first-party data can create a “moat” for SaaS companies.Why showing AI’s logic matters more than a polished interface.Why 600,000 data points per customer changes what AI can tell you about your business.You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
The end of ‘shadow AI’ at enterprises? Kilo launches KiloClaw for Organizations to enable secure AI agents at scale
As generative AI matures from a novelty into a workplace staple, a new friction point has emerged: the “shadow AI” or “Bring Your Own AI (BYOAI)” crisis. Much like the unsanctioned use of personal devices in years past, developers and knowledge workers are increasingly deploying autonomous agents on personal infrastructure to manage their professional workflows.”Our journey with Kilo Claw has been to make it easier and easier and more accessible to folks,” says Kilo co-founder Scott Breitenother. Today, the company dedicated to providing a portable, multi-model, cloud-based AI coding environment is moving to formalize this “shadow AI” layer: it’s launching KiloClaw for Organizations and KiloClaw Chat, a suite of tools designed to provide enterprise-grade governance over personal AI agents.The announcement comes at a period of high velocity for the company. Since making its securely hosted, one-click OpenClaw product for individuals, KiloClaw, generally available last month, more than 25,000 users have integrated the platform into their daily workflows. Simultaneously, Kilo’s proprietary agent benchmark, PinchBench, has logged over 250,000 interactions and recently gained significant industry validation when it was referenced by Nvidia CEO Jensen Huang during his keynote at the 2026 Nvidia GTC conference in San Jose, California.The shadow AI crisis: Addressing the BYOAI problemThe impetus for KiloClaw for Organizations stems from a growing visibility gap within large enterprises. In a recent interview with VentureBeat, Kilo leadership detailed conversations with high-level AI directors at government contractors who found their developers running OpenClaw agents on random VPS instances to manage calendars and monitor repositories. “What we’re announcing on Tuesday is Kilo Claw for organizations, where a company can buy an organization-level package of Kilo Claws and give every team member access,” explained Kilo co-founder and head of product and engineering Emilie Schario during the interview.”We can’t see any of it,” the head of AI at one such firm reportedly told Kilo. “No audit logs. No credential management. No idea what data is touching what API”. This lack of oversight has led some organizations to issue blanket bans on autonomous agents before a clear strategy on deployment could be formed. Anand Kashyap, CEO and founder of data security firm Fortanix, told VentureBeat without seeing Kilo’s announcement that while “Openclaw has taken the technology world by storm… the enterprise usage is minimal due to the security concerns of the open source version.” Kashyap expanded on this trend:”In recent times, NVIDIA (with NemoClaw), Cisco (DefenseClaw), Palo Alto Networks, and Crowdstrike have all announced offerings to create an enterprise-ready version of OpenClaw with guardrails and governance for agent security. However, enterprise adoption continues to be low.Enterprises like centralized IT control, predictable behavior, and data security which keeps them compliant. An autonomous agentic platform like OpenClaw stretches the envelope on all these parameters, and while security majors have announced their traditional perimeter security measures, they don’t address the fundamental problems of having a reduced attack surface. Over time, we will see an agentic platform emerge where agents are pre-built and packaged, and deployed responsibly with centralized controls, and data access controls built into the agentic platform as well as the LLMs they call upon to get instructions on how to perform the next task. Technologies like Confidential Computing provide compartmentalization of data and processing, and are tremendously helpful in reducing the attack surface.”KiloClaw for Organizations is positioned as the way for the security team to say “yes,” providing the visibility and control required to bring these agents in-house. It transitions agents from developer-managed infrastructure into a managed environment characterized by scoped access and organizational-level controls.Technology: Universal persistence and the “Swiss cheese” methodA core technical hurdle in the current agent landscape is the fragmentation of chat sessions. During the VentureBeat interview, Schario noted that even advanced tools often struggle with canonical sessions, frequently dropping messages or failing to sync across devices. Schario emphasized the security layer that supports this new structure: “You get all the same benefits of the Kilo gateway and the Kilo platform: you can limit what models people can use, get usage visibility, cost controls, and all the advantages of leveraging Kilo with managed, hosted, controlled Kilo Claw”.To address the inherent unreliability of autonomous agents—such as missed cron jobs or failed executions—Kilo employs what Schario calls the “Swiss cheese method” of reliability. By layering additional protections and deterministic guardrails on top of the base OpenClaw architecture, Kilo aims to ensure that tasks, such as a daily 6:00 PM summary, are completed even if the underlying agent logic falters. This is critical because, as Schario noted, “The real risk for any company is data leakage, and that can come from a bot commenting on a GitHub issue or accidentally emailing the person who’s going to get fired before they get fired”.Product: KiloClaw Chat and organizational guardrailsWhile managed infrastructure solves the backend problem, KiloClaw Chat addresses the user experience. Schario noted that “Hosted, managed OpenClaw is easier to get started with, but it’s not enough, and it still requires you to be at the edge of technology to understand how to set it up”. Kilo is looking to lower that barrier for the average worker, asking: “How do we give people who have never heard the phrase OpenClaw or Claudebot an always-on AI assistant?”.Traditionally, interacting with an OpenClaw agent required connecting to third-party messaging services like Telegram or Discord—a process that involves navigating “BotFather” tokens and technical configurations that alienate non-engineers. “One of the number one hurdles we see, both anecdotally and in the data, is that you get your bot running and then you have to connect a channel to it. If you don’t know what’s going on, it’s overwhelming,” Schario observed.“We solved that problem. You don’t need to set up a channel. You can chat with Kilo in the web UI and, with the Kilo Claw app on your phone, interact with Kilo without setting an external channel,” she continued. This native approach is essential for corporate compliance because, as she further explained, “When we were talking to early enterprise opportunities, they don’t want you using your personal Telegram account to chat with your work bot”. As Schario put it, there is a reason enterprise communication doesn’t flow through personal DMs; when a company shuts off access, they must be able to shut off access to the bot.Looking ahead, the company plans to integrate these environments further. “What we’re going to do is make Kilo Chat the waypoint between Telegram, Discord, and OpenClaw, so you get all the convenience of Kilo Chat but can use it in the other channels,” Breitenother added.The enterprise package includes several critical governance features:Identity Management: SSO/OIDC integration and SCIM provisioning for automated user lifecycles.Centralized Billing: Full visibility into compute and inference usage across the entire organization.Admin Controls: Org-wide policies regarding which models can be used, specific permissions, and session durations.Secrets Configuration: Integration with 1Password ensures that agents never handle credentials in plain text, preventing accidental leaks.Licensing and governance: The “bot account” modelOther security experts note that handling bot and AI agentic permissions are among the most pressing problems enterprises are facing todayAs Ev Kontsevoy, CEO and co-founder of AI infrastructure and identity management company Teleport told VentureBeat without seeing the Kilo news: “The potential impact of OpenClaw as a non-deterministic actor demonstrates why identity can’t be an afterthought. You have an autonomous agent with shell access, browser control, and API credentials — running on a persistent loop, across dozens of messaging platforms, with the ability to write its own skills. That’s not a chatbot. That’s a non-deterministic actor with broad infrastructure access and no cryptographic identity, no short-lived credentials, and no real-time audit trail tying actions to a verifiable actor.”Kilo is proposing to solve it with a major change in organizational structure: the adoption of employee “bot accounts”. In Kilo’s vision, every employee eventually carries two identities—their standard human account and a corresponding bot account, such as scott.bot@kiloco.ai. These bot identities operate with strictly limited, read-only permissions. For example, a bot might be granted read-only access to company logs or a GitHub account with contributor-only rights. This “scoped” approach allows the agent to maintain full visibility of the data it needs to be helpful while ensuring it cannot accidentally share sensitive information with others.Addressing concerns over data privacy and “black box” algorithms, Kilo emphasizes that its code is source available. “Anyone can go look at our code. It’s not a black box. When you’re buying Kilo Claw, you’re not giving us your data, and we’re not training on any of your data because we’re not building our own model,” Schario clarified.This licensing choice allows organizations to audit the resiliency and security of the platform without fearing their proprietary data will be used to improve third-party models.Pricing and availabilityKiloClaw for Organizations follows a usage-based pricing model where companies pay only for the compute and inference consumed. Organizations can utilize a “Bring Your Own Key” (BYOK) approach or use Kilo Gateway credits for inference.The service is available starting today, Wednesday, April 1. KiloClaw Chat is currently in beta, with support for web, desktop, and iOS sessions. New users can evaluate the platform via a free tier that includes seven days of compute.As Breitenother summarized to VentureBeat, the goal is to shift from “one-off” deployments to a scalable model for the entire workforce: “I think of Kilo for orgs as buying Kilo Claw by the bushel instead of by the one-off. And we’re hoping to sell a lot of bushels of of kilo claw”.