Block today announced Managerbot, a new AI agent embedded in the Square platform that proactively monitors a seller’s business, identifies emerging problems, and proposes actionable solutions — without the seller ever having to ask a question. The product marks the most tangible manifestation of CEO Jack Dorsey’s controversial bet that artificial intelligence can fundamentally reshape how his company operates, builds products, and serves the millions of small businesses that depend on Square to run day-to-day commerce.In an exclusive interview with VentureBeat, Willem Avé, Block’s head of product at Square, described Managerbot as a decisive break from the company’s earlier Square AI assistant, which functioned as a reactive chatbot that answered seller questions about sales, employees, and business performance.”The big shift from Square AI to Managerbot is really from reactive to proactive,” Avé said. “What that means is the primary interface is not a question box. You assign tasks to Managerbot, and that could be based on data, an insight, or a signal from your business.”The product is beginning to roll out now, with full availability to Square sellers expected over the coming months. Block declined to say whether Managerbot would carry an additional fee or be bundled into existing Square subscriptions.How Managerbot predicts inventory shortages, optimizes schedules, and writes marketing campaigns on its ownAvé outlined three core domains where Managerbot operates today: inventory forecasting, employee shift scheduling, and automated marketing campaign creation. In every case, the agent acts before the seller does — watching over the business, detecting patterns, and surfacing recommendations with proposed actions attached.In the inventory domain, Managerbot continuously monitors a seller’s stock levels, sales velocity, and external signals such as weather patterns and local events, then alerts the seller when an item is about to run out — or when it should stock up ahead of anticipated demand. “In warmer weather, we can see that you sell more of a certain good,” Avé explained. “That’s the forecasting capability, combined with local data — weather, events — so we can help sellers manage both their inventory and cash flows.”For shift scheduling — a task that Avé described as “one of those interesting, very hard computer science problems” that consumes hours of a small business owner’s week — Managerbot analyzes forecasted sales data and then generates optimized employee schedules that balance worker preferences with coverage needs. “It turns out that frontier models are actually pretty good at it,” Avé said.The third capability tackles what Avé called “the whole bucket of things that sellers could do if they had more time” — principally marketing. Managerbot identifies sales trends across a seller’s catalog and automatically drafts win-back campaigns and promotional outreach targeted at a store’s best customer segments. Avé said Block is seeing “very meaningful lift” from Managerbot-generated campaigns compared to what some sellers create manually, though he declined to share specific performance figures publicly.Block built Managerbot on frontier AI models from OpenAI and Anthropic — but says the real innovation is underneathManagerbot runs on third-party frontier models — Avé specifically referenced Anthropic’s Sonnet and OpenAI’s GPT family — but Block’s competitive advantage, he argued, lies in the “agent harness” the company has built around those models. That harness draws heavily on Goose, Block’s open-source agent framework, and incorporates learnings from its consumer-facing Money Bot on Cash App.The challenge specific to Square is scale and complexity. A seller running a small business might interact with hundreds of different tools across invoicing, inventory, customer management, marketing, payroll, and scheduling. Managerbot must navigate all of them coherently within a single agentic loop. “This isn’t like, you know, you load a skill and call it a day — think about hundreds of skills,” Avé said. “Actually, managing the context and managing the way that we progressively disclose tools, and some of the other innovation that we have at the harness layer, is I think some of the secret sauce.”A critical design decision shapes every interaction: Managerbot does not autonomously execute changes to a seller’s business. Every write action — whether adjusting a shift schedule, publishing a marketing campaign, or modifying inventory — requires explicit seller approval. To facilitate that approval, Managerbot generates visual UI previews showing exactly what will change before the seller clicks “yes.” “We want to earn trust with sellers, so any write action is prompted to the user to approve,” Avé said. “The seller needs a visual representation of what the change is. You can’t just describe in words all the time what you’re going to go do.”An $80 million fine and chatbot blunders hang over Block’s push to automate financial recommendationsThat human-in-the-loop caution reflects a sensitivity that gains additional weight given Block’s recent history. In January 2025, 48 state financial regulators imposed an $80 million fine on Block for violations of Bank Secrecy Act and anti-money laundering laws related to Cash App. The Connecticut Department of Banking stated in announcing the settlement that regulators “found Block was not in compliance with certain requirements, creating the potential that its services could be used to support money laundering, terrorism financing, or other illegal activities.” The Illinois Department of Financial and Professional Regulation simultaneously joined the coordinated enforcement action.Separately, reporting from The Guardian has documented instances of Block’s customer-facing chatbots making serious errors, including telling customers to cancel or close their accounts. When VentureBeat raised this concern during the interview, Avé acknowledged the stakes but redirected to Managerbot’s specific safeguards.”Financial accuracy and financial data — the value of these products really come from recommendations,” Avé said. “We need to be better than whatever you can feed to ChatGPT. If you take a CSV of your sales and put it in ChatGPT or Claude, we need our product to be better and answer that question either more accurately or better than what’s available in the market.” He pointed to the harness layer’s role in reducing hallucinations through tuning, prompt engineering, and optimized tool-call loops, while acknowledging the inherent limitations of probabilistic systems: “It’s never going to be zero. Obviously, these are probabilistic systems, and we have guidance and call-outs in the tool to provide that.” On regulated domains like lending and payments, Avé was more definitive: “In any sort of regulated domains — banking, lending, payments — there are strict guardrails on what we can and can’t say to sellers. Those are just part of the product and business.”Dorsey cut 4,000 jobs in the name of AI — Managerbot is the first answer to what those tools are actually buildingIt is impossible to evaluate Managerbot outside the context of the radical organizational surgery Block performed just weeks ago. In late February, Dorsey announced that Block would cut more than 4,000 of its roughly 10,000 employees — nearly half the workforce — explicitly citing AI as the driving rationale. As the BBC reported, Dorsey wrote that “AI fundamentally changes what it means to build and run a company.” Block’s stock surged more than 20 percent on the news, according to ABC7.The company’s Q4 2025 earnings report, released alongside the layoff announcement, showed gross profit of $2.87 billion — up 24 percent year over year — and raised 2026 guidance to $12.2 billion in gross profit, according to AlphaSense’s earnings analysis. Block also reported a greater than 40 percent increase in production code shipped per engineer since September 2025 through the use of agentic coding tools. As CNBC commentator Steve Sedgwick wrote in an opinion piece following the announcement, “I keep getting told on CNBC that AI will create new jobs to replace those being lost. I’ve been asking the same question for years now.” The Observer’s Mark Minevich was more pointed, calling Block’s layoffs “probably the first legitimate mass layoff driven by A.I. as the actual operating thesis.”Managerbot, then, is the product answer to the obvious follow-up question: if Block shed 4,000 workers in the name of intelligence tools, what exactly are those intelligence tools building? Avé framed the product as proof of concept for Block’s entire strategic thesis. “Block has been in the press recently about rebuilding as an intelligence company, and it’s like, a lot of people are asking, ‘What does that mean for us?'” Avé said. “What I like to do is show, not tell. We’re building Managerbot, which I think is one of the more advanced, maybe the most advanced, small business agent out there today.”Sellers who use Managerbot are consolidating their businesses onto Square — and that may be the real strategic payoffPerhaps the most consequential signal Avé shared was an early behavioral pattern: sellers who begin using Managerbot are voluntarily migrating more of their business operations onto the Square platform, consolidating payroll, time cards, and shift scheduling into Block’s ecosystem to feed the agent more data. “When they start interacting with Managerbot, they want to move more of their business onto Square because they see the value,” Avé said. “They’re like, ‘I should put my payroll here. I should get time cards here. I should get my shift schedules here,’ because once all that data is in one place, they can make better decisions and manage their business better.”This dynamic could prove to be Managerbot’s most significant long-term effect — not as a standalone feature, but as a gravitational force pulling sellers deeper into Block’s integrated commerce stack. Block’s Q4 earnings already showed Square’s new volume added grew 29 percent year over year, with sales-led NVA surging 62 percent. Avé also argued that Square’s first-party architecture — built organically rather than through acquisitions — gives it a structural advantage over competitors in the AI era. “We’ve kind of harmonized and canonicalized this data at a sensible layer,” he said. “It’s not super hard to create more skills for these data domains.”When VentureBeat pressed Avé on the tension between helping sellers and upselling them on Block’s own financial products — lending, payments processing, and other services that generate revenue for the company — he acknowledged the concern but framed Managerbot’s mission in terms of decision-making quality. “The goal for Managerbot is to help sellers increase their decision-making correctness,” Avé said. “If we can make sellers better at running their business by making better decisions and giving time back, I think that’s a good thing.”Block says Managerbot isn’t a chatbot — it’s a business protector that compounds the company’s entire AI strategyAvé was insistent that Managerbot represents something categorically different from the chatbot-as-advisor model that has proliferated across enterprise software. “A lot of people are building chatbots as advisors — it can answer a question for you,” he said. “What we really want Managerbot to be is a protector of your business. This is identifying trends. This is spotting things that you might have missed. This is helping you run your business and take actions.”He also argued that the agent model compounds Block’s development velocity in ways that traditional software cannot match. “It’s much more straightforward to add a capability to Managerbot than it is to build a big Web 2.0 UI,” Avé said. “If we can deliver more capabilities, more features, more value to our sellers, the whole system compounds.”Whether that compounding materializes — and whether sellers ultimately experience Managerbot as a trusted protector or a sophisticated upsell engine — will determine much about Block’s future. The company has staked its corporate identity, its headcount, and its Wall Street narrative on the conviction that AI agents can deliver more value with fewer humans in the loop. Managerbot is the first product to carry the full weight of that promise. And the small business owners who keep their shops open with Square terminals, who juggle shift schedules on napkins and skip marketing because there aren’t enough hours in the day — they didn’t ask to be the test case for Silicon Valley’s boldest AI thesis. But as of today, they are.
Venture Beat
How MassMutual and Mass General Brigham turned AI pilot sprawl into production results
Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.“We’re always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”Defining metrics, establishing strong feedback loopsMassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas. Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint. “We won’t go any further with an idea until we get crystal clear on how we’re going to measure, and how we’re going to define success.” Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners. That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what “good” means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift. Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn’t accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn’t mean starting over.Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don’t want to set ourselves up to fall behind.”Weeding instead of letting a thousand flowers bloomMass General Brigham (MGB), for its part, took more of a spray and pray approach — at first. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event. But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn’t have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said. Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required. Sriraman’s team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out). As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.” That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”There’s nothing wrong with that, he noted; it’s just that everybody is discovering and experimenting as the landscape keeps shifting.Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use. They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said. Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges. In clinical settings, the guardrails are absolute: AI systems never issue the final decision. “There’s always going to be a doctor or a physician assistant in the loop to close the decision,” Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.Sriraman was clear: “Thou shall not do this: Don’t show PHI [protected health information] in Perplexity. As simple as that, right?”And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the ’90s and 2000s with AI. The same concepts apply.”
AI agents that automatically prevent, detect and fix software issues are here as NeuBird AI launches Falcon, FalconClaw
The mantra of the modern tech industry was arguably coined by Facebook (before it became Meta): “move fast and break things.” But as enterprise infrastructure has shifted into a dizzying maze of hybrid clouds, microservices, and ephemeral compute clusters, the “breaking” part has become a structural tax that many organizations can no longer afford to pay. Today, three-year-old startup NeuBird AI is launching a full-scale offensive against this “chaos tax,” announcing a $19.3 million funding round alongside the release of its Falcon autonomous production operations agent.The launch isn’t just a product update; it is a philosophical pivot. For years, the industry has focused on “Incident Response”—making the fire trucks faster and the hoses bigger. NeuBird AI is arguing that the only sustainable path forward is “Incident Avoidance”. As Venkat Ramakrishnan, President and COO of NeuBird AI, put it in a recent interview: “Incident management is so old school. Incident resolution is so old school. Incident avoidance is what is going to be enabled by AI”. By grounding AI in real-time enterprise context rather than just large language model reasoning, the company aims to move site reliability engineering and devops teams from a reactive posture to a predictive one.The AI divide: a reality check on automationAccompanying the launch is NeuBird AI’s 2026 State of Production Reliability and AI Adoption Report, a survey of over 1,000 professionals that reveals a massive disconnect between the boardroom and the server room. While 74% of C-suite executives believe their organizations are actively using AI to manage incidents, only 39% of the practitioners—the engineers actually on-call at 2:00 AM—agree.This 35-point “AI Divide” suggests that while leadership is writing checks for AI platforms, the technology is often failing to reach the frontline. For engineers, the reality remains manual and grueling: the study found that engineering teams spend an average of 40% of their time on incident management rather than building new products. Gou Rao, CEO of NeuBird AI, told VentureBeat that this is a persistent operational reality: “Over the past 18 months that we have been in production, this is not a marketing slide. We have concretely been able to demonstrate a massive reduction in time to incident response and resolution”.The consequences of this “toil” are more than just lost productivity. Alert fatigue has transitioned from a morale issue to a direct reliability risk. According to the report, 83% of organizations have teams that ignore or dismiss alerts occasionally, and 44% of companies experienced an outage in the past year tied directly to a suppressed or ignored alert. In many cases, the systems are so noisy that customers discover failures before the monitoring tools do.Introducing NeuBird AI FalconNeuBird AI’s answer to this systemic failure is the Falcon engine. While the company’s previous iteration, Hawkeye, focused on autonomous resolution, Falcon extends that capability into predictive intelligence. “When we launched NeuBird AI in 2023, our first version of the agent was called Hawkeye,” Rao explains. “What we’re announcing next week at HumanX is our next-generation version of the agent, codenamed Falcon. Falcon is easily three times faster than Hawkeye and is averaging around 92% in confidence scores”.This level of accuracy allows engineers to trust the agent’s output at face value. Falcon represents a significant leap over previous generative AI applications in the space, particularly in its ability to forecast failure. “Falcon is really good at preventive prediction, so it can tell you what can go wrong,” Rao says. “It’s pretty accurate on a 72-hour window, even better at 48 hours, and by 24 hours it gets really, really accurate”.One of the standout features of the new release is the Advanced Context Map. Unlike static dashboards, this is a real-time view of infrastructure dependencies and service health. It allows teams to visualize the “blast radius” of an issue as it propagates across an environment, helping engineers understand not just what is broken, but why it is failing in the context of its neighbors.’Minority Report’ for incident managementWhile many AI tools favor flashy web interfaces, NeuBird AI is leaning into the developer’s native habitat with NeuBird AI Desktop. This allows engineers to invoke the production ops agent directly from a command-line interface to explore root causes and system dependencies.”Falcon has a desktop mode which allows it to interact with a developer’s local tools,” Rao noted. “We’re getting a lot more traction from a hands-on developer audience, especially as people go to Claude Desktop and Cursor. They’re completing the loop by using production agents talking to their coding agents”. This integration enables a “multi-agent” workflow where an engineer can use NeuBird AI’s agent to diagnose a root cause in production and then hand off that diagnosis to a coding agent like Claude Code to implement the fix.During a live demo, Rao showcased how the agent could be set to “Sentinel Mode,” constantly sweeping a cluster for risks. If it detects an anomaly—such as a projected 5% spike in AWS costs or a misconfigured Kubernetes pod—it can flag the specific engineer on-call who has the domain expertise to fix it. “This is like ‘Minority Report for Incident Management’,” one financial services executive reportedly told the team after a demo.Context engineering: a gateway for securityA primary concern for enterprises deploying AI is security—ensuring large language models don’t go “crazy” or exfiltrate sensitive data. NeuBird AI addresses this through a proprietary approach to “context engineering”.”The way we implemented our agent is that the large language models themselves are never actually touching the data directly,” Rao explains. “We become the gateway for how the context can be accessed”. This means the model is the reasoning engine, but NeuBird AI is the middleman that wraps the data. Furthermore, the company has implemented strict guardrails on what the agent can actually execute. “We’ve created a language that confines and restricts the agent from what it can do,” says Rao. “If it comes up with something anomalous, or something we don’t know, it won’t run. We won’t do it”.This architectural choice allows NeuBird AI to remain model-agnostic. If a newer model from Anthropic or Google outperforms the current reasoning engine, NeuBird AI can simply switch it out without requiring the customer to change their platform. “Customers don’t want to be tied to a specific way of reasoning,” Rao asserts. “They want to be tied to a platform from which they can get the value of an agentic system”.Displacing the “army”: displacing expensive observabilityOne of the most radical claims NeuBird AI makes is that agentic systems can actually reduce the amount of data enterprises need to store in the first place. Currently, teams rely on massive storage platforms with complex query languages.”People use very complex observability tools like Datadog, Dynatrace, and Sysdig,” Rao says. “This is the norm today, which is why it takes an army of people to solve a problem. What we’ve been able to demonstrate with agentic systems is that you don’t need to store all that data in the first place”. Because the agent can reason across raw data sources, it can identify which signals are junk and which are critical. This shift, Rao argues, “reduces human toil and effort while simultaneously reducing your reliance on these insanely expensive observability tools”.The practical impact of this “incident avoidance” was recently demonstrated at Deep Health. Rao recounts how their agent detected a systemic issue that was invisible to traditional tools: “Our agent was able to go in and prevent an issue from happening which would have caused this company, Deep Health, a major production outage. The customer is completely beside themselves and happy about what it could do”.FalconClaw: operationalizing ‘tribal knowledge’One of the most persistent problems in IT operations is the loss of “tribal knowledge”—the hard-won expertise of senior engineers that exists only in their heads. NeuBird AI is attempting to solve this with FalconClaw, a curated, enterprise-grade skills hub compatible with the OpenClaw ecosystem.FalconClaw allows teams to capture best practices and resolution steps as “validated and compliant skills”. The tech preview launched today with 15 initial skills that work natively with NeuBird AI’s toolchain. According to Francois Martel, Field CTO at NeuBird AI, this turns hard-won expertise into a reusable asset that the AI can use automatically. It’s an attempt to standardize how agents interact with infrastructure, moving away from proprietary “black box” systems toward a multi-agent world where different AI tools can share a common set of operational abilities.Scaling the moat: funding and leadershipThe $19.3 million round was led by Xora Innovation, a Temasek-backed firm, with participation from Mayfield, M12, StepStone Group, and Prosperity7 Ventures. This brings NeuBird AI’s total funding to approximately $64 million.The investor interest is fueled largely by the pedigree of the founding team. Gou Rao and Vinod Jayaraman previously co-founded Portworx, which was acquired by Pure Storage, and Ocarina Networks, acquired by Dell. They have recently bolstered their leadership with Venkat Ramakrishnan, another Pure Storage veteran, as President and COO.For investors like Phil Inagaki of Xora, the value lies in NeuBird AI’s “best-in-class results across accuracy, speed and token consumption”. As cloud costs continue to spiral, the ability of an AI agent to not only fix bugs but also optimize infrastructure capacity is becoming a “must-have” rather than a “nice-to-have”. NeuBird AI claims its agent can save enterprise teams more than 200 engineering hours per month.The path to ‘self-healing’ infrastructureAs the State of Production Reliability report notes, current incident management practices are “no longer sustainable”. With 61% of organizations estimating that a single hour of downtime costs $50,000 or more, the financial stakes of staying in a reactive loop are enormous.NeuBird AI’s launch of Falcon and FalconClaw marks a definitive attempt to break that loop. By focusing on prevention and the “context engineering” required to make AI trustworthy for enterprise production, the company is positioning itself as the critical intelligence layer for the modern stack.While the “AI Divide” between executives and practitioners remains a significant hurdle for the industry, NeuBird AI is betting that as engineers see the value of a cli-driven, 92%-accurate agent that can “see around corners,” the skepticism will fade. For the site reliability engineers currently drowning in a flood of non-actionable alerts, the arrival of a reliable ai teammate couldn’t come soon enough.NeuBird AI Falcon is available starting today, with organizations able to sign up for a free trial at neubird.ai.
Closing the data security maturity gap: Embedding protection into enterprise workflows
Presented by Capital One Data security remains one of the least mature domains in enterprise cybersecurity. According to IBM, 35% of breaches in 2025 involved unmanaged data source or “shadow data.” This reveals a systemic lack of basic data awareness. It’s not because of a lack of tooling or investment. It’s because many organizations still struggle with the most fundamental questions: What data do we have? Where does it live? How does it move? And who is responsible for it? In an increasingly complex ecosystem of data sources, cloud platforms, SaaS applications, APIs, and AI models, those questions are only becoming more difficult to answer. Closing the maturity gap in data security demands a cultural shift where security is no longer treated as an afterthought. Instead, protection is embedded throughout the full data lifecycle, grounded in a robust inventory, clear classification, and scalable mechanisms that translate policy into automated guardrails.Visibility as the foundationThe most persistent barrier to data security maturity is basic visibility. Organizations often focus on how much data they hold, but not on what that data is made up of. Does it contain personally identifiable information (PII)? Financial data? Health information? Intellectual property? Without this level of understanding and inventory, it’s a lot tougher to implement meaningful protection. This can be avoided, however, by prioritizing enterprise capabilities that can detect sensitive data at scale across a large and varied footprint. Detection must be paired with action, deleting data where it’s no longer needed, and securing data where it is by aligning enforcement to a well-defined policy. Mature organizations should start by treating data security as an “understanding your environment” problem. Maintain an inventory, classify what’s in the ecosystem, and align protections with the classification rather than solely relying on perimeter controls or point solutions to scale.Securing chaotic data One reason data security has lagged behind other security domains is that data itself is inherently chaotic. Unlike perimeter security, which relies on explicit ports and defined boundaries, data is largely unpredictable. That is to say, the same underlying information may appear across very different formats: structured databases, unstructured documents, chat transcripts, or analytics pipelines. Each may have slightly different encodings or transformations that introduce unforeseen, and often undetected, changes to the data itself.Human behavior compounds the challenge, with different actions introducing risks in ways that perimeter controls simply can’t anticipate. This could be anything from a credit card number copied into a free-form comment field, a spreadsheet emailed outside its intended audience, or a dataset repurposed for a new workflow.When protection is bolted on at the end of a workflow, organizations create blind spots. They rely on downstream checks to catch upstream design flaws. Over time, complexity accumulates and the risk of exposure becomes a question of when, not if.A more resilient model assumes that sensitive data will surface in unexpected places and formats, so protection is embedded from the moment data is captured. Defense-in-depth becomes a design principle: segmentation, encryption at rest and in transit, tokenization, and layered access controls.Critically, these safeguards travel with the data lifecycle, from ingestion to processing, analytics and publishing. Instead of retrofitting controls, organizations design for chaos. They accept variability as a given and build systems that remain secure even when data diverges from expectations.Scaling governance with automation Data security becomes operationally sustainable when governance is enforced through automation from its genesis. When coupled with clear expectations to create bounded contexts: teams understand what is permitted, under what conditions, and with what protections data can be used effectively.This matters more than ever today. AI systems often require access to huge volumes of data, across domains. This makes policy implementation particularly challenging. To do so effectively and safely requires deep understanding, strong governance policies, and automated protection.Security techniques such as synthetic data and token replacement enable organizations to preserve analytical context while making sensitive values harder to read. Policy-as-code patterns, APIs, and automation can handle tokenization, deletion, retention constraints, and dynamic access controls. With guardrails built into the platforms they use, engineers can focus more on innovating with data and elevating business outcomes securely.AI systems must also operate within the same governance and monitoring expectations as human workflows. Permissions, telemetry, and controls around what models can access, along with the information they can publish, are essential. Governance will always introduce a degree of friction. The goal is to make that friction well understood, navigable and increasingly automated. Confirming purpose, registering a use case, and provisioning access dynamically based on role and need should be clear, repeatable processes.At enterprise scale, this requires centralized capabilities that implement cyber security policy in the data domain. This includes detection and classification engines, tokenization and detokenization services, retention enforcement, and ownership and taxonomy mechanisms that cascade risk management expectations into daily execution.When done well, governance becomes an enablement layer rather than a bottleneck. Metadata and classification drive protection decisions automatically while accelerating business discovery and usage. Data is protected across its lifecycle by strong defenses like tokenization and deleted when required by regulation or internal policy. There should be no need for teams to “touch the data” manually for every control decision, with policy enforced by design.Building for the futurePut simply, closing the data security maturity gap is less about adopting a single breakthrough technology and more about operational discipline. Build the map. Classify what you have. Embed protection into workflows so that security is repeatable at scale.For business leaders seeking measurable progress over the next 18–24 months, three priorities stand out.First, establish a robust inventory and metadata-rich map of the data ecosystem. Visibility is non-negotiable. Second, implement classification tied to clear, actionable policy expectations. Make it obvious what protections each category demands. And finally, invest in scalable, automated protection schemes that integrate directly into development and data workflows.When protection shifts from reactive bolt-on controls to proactive built-in guardrails, compliance becomes simpler, governance becomes stronger, and AI readiness becomes achievable, without compromising rigor.Learn more how Capital One Databolt, the enterprise data security solution from Capital One Software, can help your business become AI-ready by securing sensitive data at scale.Andrew Seaton is Vice President, Data Engineering – Enterprise Data Detection & Protection, Capital One.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
OCSF explained: The shared data language security teams have been missing
The security industry has spent the last year talking about models, copilots, and agents, but a quieter shift is happening one layer below all of that: Vendors are lining up around a shared way to describe security data. The Open Cybersecurity Schema Framework (OCSF), is emerging as one of the strongest candidates for that job.It gives vendors, enterprises, and practitioners a common way to represent security events, findings, objects, and context. That means less time rewriting field names and custom parsers and more time correlating detections, running analytics, and building workflows that can work across products. In a market where every security team is stitching together endpoint, identity, cloud, SaaS, and AI telemetry, a common infrastructure long felt like a pipe dream, and OCSF now puts it within reach.OCSF in plain languageOCSF is an open-source framework for cybersecurity schemas. It’s vendor neutral by design and deliberately agnostic to storage format, data collection, and ETL choices. In practical terms, it gives application teams and data engineers a shared structure for events so analysts can work with a more consistent language for threat detection and investigation.That sounds dry until you look at the daily work inside a security operations center (SOC). Security teams have to spend a lot of effort normalizing data from different tools so that they can correlate events. For example, detecting an employee logging in from San Francisco at 10 a.m. on their laptop, then accessing a cloud resource from New York at 10:02 a.m. could reveal a leaked credential. Setting up a system that can correlate those events, however, is no easy task: Different tools describe the same idea with different fields, nesting structures, and assumptions. OCSF was built to lower this tax. It helps vendors map their own schemas into a common model and helps customers move data through lakes, pipelines, security incident and event management (SIEM) tools without requiring time consuming translation at every hop.The last two years have been unusually fastMost of OCSF’s visible acceleration has happened in the last two years. The project was announced in August 2022 by Amazon AWS and Splunk, building on worked contributed by Symantec, Broadcom, and other well known infrastructure giants Cloudflare, CrowdStrike, IBM, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro, and Zscaler.The OCSF community has kept up a steady cadence of releases over the last two yearsThe community has grown quickly. AWS said in August 2024 that OCSF had expanded from a 17-company initiative into a community with more than 200 participating organizations and 800 contributors, which expanded to 900 wen OCSF joined the Linux Foundation in November 2024. OCSF is showing up across the industryIn the observability and security space, OCSF is everywhere. AWS Security Lake converts natively supported AWS logs and events into OCSF and stores them in Parquet. AWS AppFabric can output OCSF — normalized audit data. AWS Security Hub findings use OCSF, and AWS publishes an extension for cloud-specific resource details. Splunk can translate incoming data into OCSF with edge processor and ingest processor. Cribl supports seamless converting streaming data into OCSF and compatible formats.Palo Alto Networks can forward Strata sogging Service data into Amazon Security Lake in OCSF. CrowdStrike positions itself on both sides of the OCSF pipe, with Falcon data translated into OCSF for Security Lake and Falcon Next-Gen SIEM positioned to ingest and parse OCSF-formatted data. OCSF is one of those rare standards that has crossed the chasm from an abstract standard into standard operational plumbing across the industry.AI is giving the OCSF story fresh urgencyWhen enterprises deploy AI infrastructure, large language models (LLMs) sit at the core, surrounded by complex distributed systems such as model gateways, agent runtimes, vector stores, tool calls, retrieval systems, and policy engines. These components generate new forms of telemetry, much of which spans product boundaries. Security teams across the SOC are increasingly focused on capturing and analyzing this data. The central question often becomes what an agentic AI system actually did, rather than only the text it produced, and whether its actions led to any security breaches.That puts more pressure on the underlying data model. An AI assistant that calls the wrong tool, retrieves the wrong data, or chains together a risky sequence of actions creates a security event that needs to be understood across systems. A shared security schema becomes more valuable in that world, especially when AI is also being used on the analytics side to correlate more data, faster.For OCSF, 2025 was all about AIImagine a company uses an AI assistant to help employees look up internal documents and trigger tools like ticketing systems or code repositories. One day, the assistant starts pulling the wrong files, calling tools it should not use, and exposing sensitive information in its responses. Updates in OCSF versions 1.5.0, 1.6.0, and 1.7.0 help security teams piece together what happened by flagging unusual behavior, showing who had access to the connected systems, and tracing the assistant’s tool calls step by step. Instead of only seeing the final answer the AI gave, the team can investigate the full chain of actions that led to the problem.What’s on the horizonImagine a company uses an AI customer support bot, and one day the bot begins giving long, detailed answers that include internal troubleshooting guidance meant only for staff. With the kinds of changes being developed for OCSF 1.8.0, the security team could see which model handled the exchange, which provider supplied it, what role each message played, and how the token counts changed across the conversation. A sudden spike in prompt or completion tokens could signal that the bot was fed an unusually large hidden prompt, pulled in too much background data from a vector database, or generated an overly long response that increased the chance of sensitive information leaking. That gives investigators a practical clue about where the interaction went off course, instead of leaving them with only the final answer.Why this matters to the broader marketThe bigger story is that OCSF has moved quickly from being a community effort to becoming a real standard that security products use every day. Over the past two years, it has gained stronger governance, frequent releases, and practical support across data lakes, ingest pipelines, SIEM workflows, and partner ecosystems. In a world where AI expands the security landscape through scams, abuse, and new attack paths, security teams rely on OCSF to connect data from many systems without losing context along the way to keep your data safe. Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.
Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party AI agents
Are you a subscriber to Anthropic’s Claude Pro ($20 monthly) or Max ($100-$200 monthly) plans and use its Claude AI models and products to power third-party AI agents like OpenClaw? If so, you’re in for an unpleasant surprise. Anthropic announced a few hours ago that starting tomorrow, Saturday, April 4, 2026, at 12 pm PT/3 pm ET, it will no longer be possible for those Claude subscribers to use their subscriptions to hook Anthropic’s Claude models up to third-party agentic tools, citing the strain such usage was placing on Anthropic’s compute and engineering resources, and desire to serve a wide number of users reliably. “We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren’t built for the usage patterns of these third-party tools,” wrote Boris Cherny, Head of Claude Code at Anthropic, in a post on X. “Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API.”The company also reportedly sent out an email to this effect to some subscribers. However, it’s not certain if subscribers to Claude Team and Enterprise will be impacted similarly. We’ve reached out to Anthropic for further clarification and will update when we hear back.To be clear, it will still be possible to use Claude models like Opus, Sonnet, and Haiku to power OpenClaw and similar external agents, but users will now need to opt into a pay-as-you-go “extra usage” billing system or utilize Anthropic’s application programming interface (API), which charges for every token of usage rather than allowing for open-ended usage up to certain limits, as the Pro and Max plans have allowed so far. The reason for the change: ‘third party services are not optimized’ The technical reality, according to Anthropic, is that its first-party tools like Claude Code, its AI vibe coding harness, and Claude Cowork, its business app interfacing and control tool, are built to maximize “prompt cache hit rates”—reusing previously processed text to save on compute. Third-party harnesses like OpenClaw often bypass these efficiencies. “Third party services are not optimized in this way, so it’s really hard for us to do sustainably,” Cherny explained further on X. He even revealed his own hands-on attempts to bridge the gap: “I did put up a few PRs to improve prompt cache hit rate for OpenClaw in particular, which should help for folks using it with Claude via API/overages.”Prior to the news, Anthropic had also begun imposing stricter Claude session limits every 5 hours of usage during business hours (5am-11am PT/8am-2pm ET), meaning that the number of tokens you could send during those sessions dropped.This frustrated some power users who suddenly began reaching their limits far faster than they had previously — a change Anthropic said was to help “manage growing demand for Claude” and would only affect up to 7% of users at any given time. Discounts and credits to soften the blowAnthropic is not banning third-party tools entirely, but it is moving them to a different ledger. The new “Extra Usage” bundles represent a middle ground between a flat-rate subscription and a full enterprise API account.The Credit: To “soften the blow,” Anthropic is offering existing subscribers a one-time credit equal to their monthly plan price, redeemable until April 17.The Discount: Users who pre-purchase “extra usage” bundles can receive up to a 30% discount, an attempt to retain power users who might otherwise churn.Capacity Management: Anthropic’s official statement noted that these tools put an “outsized strain” on systems, forcing a prioritization of “customers using our core products and API.”‘The all you-can-eat buffet just closed’The response from the developer community has been a mixture of analytical acceptance and sharp frustration.Growth marketer Aakash Gupta observed on X that the “all-you-can-eat buffet just closed,” noting that a single OpenClaw agent running for one day could burn $1,000 to $5,000 in API costs. “Anthropic was eating that difference on every user who routed through a third-party harness,” Gupta wrote. “That’s the pace of a company watching its margin evaporate in real time.”However, Peter Steinberger, the creator of OpenClaw who was recently hired by OpenAI, took a more skeptical view of the “capacity” argument.“Funny how timings match up,” Steinberger posted on X. “First they copy some popular features into their closed harness, then they lock out open source.” Indeed, Anthropic recently added some of the same capabilities that helped OpenClaw catch-on — such as the ability to message agents through external services like Discord and Telegram — to Claude Code. Steinberger claimed that he and fellow investor Dave Morin attempted to “talk sense” into Anthropic, but were only able to delay the enforcement by a single week.User @ashen_one, founder of Telaga Charity, voiced a concern likely shared by other small-scale builders: “If I switch both [OpenClaw instances] to an API key or the extra usage you’re recommending here, it’s going to be far too expensive to make it worth using. I’ll probably have to switch over to a different model at this point.”.“I know it sucks,” Cherny replied. “Fundamentally engineering is about tradeoffs, and one of the things we do to serve a lot of customers is optimize the way subscriptions work to serve as many people as possible with the best modeLicensing and the OpenAI shadowThe timing of the crackdown is particularly notable given the talent migration. When Steinberger joined OpenAI in February 2026, he brought the “OpenClaw” ethos with him. OpenAI appears to be positioning itself as a more “harness-friendly” alternative, potentially using this moment as a customer acquisition channel for disgruntled Claude power users.By restricting subscription limits to their own “closed harness,” Anthropic is asserting control over the UI/UX layer. This allows them to collect telemetry and manage rate limits more granularly, but it risks alienating the power-user community that built the “agentic” ecosystem in the first place.The Bottom LineAnthropic’s decision is a cold calculation of margins versus growth. As Cherny noted, “Capacity is a resource we manage thoughtfully.” In the 2026 AI landscape, the era of subsidized, unlimited compute for third-party automation is over. For the average user on Claude.ai, the experience remains unchanged; for the power users running autonomous offices, the bell has tolled.
Karpathy shares ‘LLM Knowledge Base’ architecture that bypasses RAG with an evolving markdown library maintained by AI
AI vibe coders have yet another reason to thank Andrej Karpathy, the coiner of the term. The former Director of AI at Tesla and co-founder of OpenAI, now running his own independent AI project, recently posted on X describing a “LLM Knowledge Bases” approach he’s using to manage various topics of research interest. By building a persistent, LLM-maintained record of his projects, Karpathy is solving the core frustration of “stateless” AI development: the dreaded context-limit reset.As anyone who has vibe coded can attest, hitting a usage limit or ending a session often feels like a lobotomy for your project. You’re forced to spend valuable tokens (and time) reconstructing context for the AI, hoping it “remembers” the architectural nuances you just established. Karpathy proposes something simpler and more loosely, messily elegant than the typical enterprise solution of a vector database and RAG pipeline. Instead, he outlines a system where the LLM itself acts as a full-time “research librarian”—actively compiling, linting, and interlinking Markdown (.md) files, the most LLM-friendly and compact data format.By diverting a significant portion of his “token throughput” into the manipulation of structured knowledge rather than boilerplate code, Karpathy has surfaced a blueprint for the next phase of the “Second Brain”—one that is self-healing, auditable, and entirely human-readable.Beyond RAGFor the past three years, the dominant paradigm for giving LLMs access to proprietary data has been Retrieval-Augmented Generation (RAG). In a standard RAG setup, documents are chopped into arbitrary “chunks,” converted into mathematical vectors (embeddings), and stored in a specialized database. When a user asks a question, the system performs a “similarity search” to find the most relevant chunks and feeds them into the LLM.Karpathy’s approach, which he calls LLM Knowledge Bases, rejects the complexity of vector databases for mid-sized datasets. Instead, it relies on the LLM’s increasing ability to reason over structured text.The system architecture, as visualized by X user @himanshu in part of the wider reactions to Karpathy’s post, functions in three distinct stages:Data Ingest: Raw materials—research papers, GitHub repositories, datasets, and web articles—are dumped into a raw/ directory. Karpathy utilizes the Obsidian Web Clipper to convert web content into Markdown (.md) files, ensuring even images are stored locally so the LLM can reference them via vision capabilities.The Compilation Step: This is the core innovation. Instead of just indexing the files, the LLM “compiles” them. It reads the raw data and writes a structured wiki. This includes generating summaries, identifying key concepts, authoring encyclopedia-style articles, and—crucially—creating backlinks between related ideas.Active Maintenance (Linting): The system isn’t static. Karpathy describes running “health checks” or “linting” passes where the LLM scans the wiki for inconsistencies, missing data, or new connections. As community member Charly Wargnier observed, “It acts as a living AI knowledge base that actually heals itself.”By treating Markdown files as the “source of truth,” Karpathy avoids the “black box” problem of vector embeddings. Every claim made by the AI can be traced back to a specific .md file that a human can read, edit, or delete.Implications for the enterpriseWhile Karpathy’s setup is currently described as a “hacky collection of scripts,” the implications for the enterprise are immediate. As entrepreneur Vamshi Reddy (@tammireddy) noted in response to the announcement: “Every business has a raw/ directory. Nobody’s ever compiled it. That’s the product.”Karpathy agreed, suggesting that this methodology represents an “incredible new product” category. Most companies currently “drown” in unstructured data—Slack logs, internal wikis, and PDF reports that no one has the time to synthesize. A “Karpathy-style” enterprise layer wouldn’t just search these documents; it would actively author a “Company Bible” that updates in real-time.As AI educator and newsletter author Ole Lehmann put it on X: “i think whoever packages this for normal people is sitting on something massive. one app that syncs with the tools you already use, your bookmarks, your read-later app, your podcast app, your saved threads.”Eugen Alpeza, co-founder and CEO of AI enterprise agent builder and orchestration startup Edra, noted in an X post that: “The jump from personal research wiki to enterprise operations is where it gets brutal. Thousands of employees, millions of records, tribal knowledge that contradicts itself across teams. Indeed, there is room for a new product and we’re building it in the enterprise.”As the community explores the “Karpathy Pattern,” the focus is already shifting from personal research to multi-agent orchestration. A recent architectural breakdown by @jumperz, founder of AI agent creation platform Secondmate, illustrates this evolution through a “Swarm Knowledge Base” that scales the wiki workflow to a 10-agent system managed via OpenClaw. The core challenge of a multi-agent swarm—where one hallucination can compound and “infect” the collective memory—is addressed here by a dedicated “Quality Gate.” Using the Hermes model (trained by Nous Research for structured evaluation) as an independent supervisor, every draft article is scored and validated before being promoted to the “live” wiki. This system creates a “Compound Loop”: agents dump raw outputs, the compiler organizes them, Hermes validates the truth, and verified briefings are fed back to agents at the start of each session. This ensures that the swarm never “wakes up blank,” but instead begins every task with a filtered, high-integrity briefing of everything the collective has learnedScaling and performanceA common critique of non-vector approaches is scalability. However, Karpathy notes that at a scale of ~100 articles and ~400,000 words, the LLM’s ability to navigate via summaries and index files is more than sufficient.For a departmental wiki or a personal research project, the “fancy RAG” infrastructure often introduces more latency and “retrieval noise” than it solves.Tech podcaster Lex Fridman (@lexfridman) confirmed he uses a similar setup, adding a layer of dynamic visualization:”I often have it generate dynamic html (with js) that allows me to sort/filter data and to tinker with visualizations interactively. Another useful thing is I have the system generate a temporary focused mini-knowledge-base… that I then load into an LLM for voice-mode interaction on a long 7-10 mile run.”This “ephemeral wiki” concept suggests a future where users don’t just “chat” with an AI; they spawn a team of agents to build a custom research environment for a specific task, which then dissolves once the report is written.Licensing and the ‘file-over-app’ philosophyTechnically, Karpathy’s methodology is built on an open standard (Markdown) but viewed through a proprietary-but-extensible lens (note taking and file organization app Obsidian).Markdown (.md): By choosing Markdown, Karpathy ensures his knowledge base is not locked into a specific vendor. It is future-proof; if Obsidian disappears, the files remain readable by any text editor.Obsidian: While Obsidian is a proprietary application, its “local-first” philosophy and EULA (which allows for free personal use and requires a license for commercial use) align with the developer’s desire for data sovereignty.The “Vibe-Coded” Tools: The search engines and CLI tools Karpathy mentions are custom scripts—likely Python-based—that bridge the gap between the LLM and the local file system.This “file-over-app” philosophy is a direct challenge to SaaS-heavy models like Notion or Google Docs. In the Karpathy model, the user owns the data, and the AI is merely a highly sophisticated editor that “visits” the files to perform work.Librarian vs. search engineThe AI community has reacted with a mix of technical validation and “vibe-coding” enthusiasm. The debate centers on whether the industry has over-indexed on Vector DBs for problems that are fundamentally about structure, not just similarity.Jason Paul Michaels (@SpaceWelder314), a welder using Claude, echoed the sentiment that simpler tools are often more robust:”No vector database. No embeddings… Just markdown, FTS5, and grep… Every bug fix… gets indexed. The knowledge compounds.”However, the most significant praise came from Steph Ango (@Kepano), co-creator of Obsidian, who highlighted a concept called “Contamination Mitigation.” He suggested that users should keep their personal “vault” clean and let the agents play in a “messy vault,” only bringing over the useful artifacts once the agent-facing workflow has distilled them.Which solution is right for your enteprise vibe coding projects?FeatureVector DB / RAGKarpathy’s Markdown WikiData FormatOpaque Vectors (Math)Human-Readable MarkdownLogicSemantic Similarity (Nearest Neighbor)Explicit Connections (Backlinks/Indices)AuditabilityLow (Black Box)High (Direct Traceability)CompoundingStatic (Requires re-indexing)Active (Self-healing through linting)Ideal ScaleMillions of Documents100 – 10,000 High-Signal DocumentsThe “Vector DB” approach is like a massive, unorganized warehouse with a very fast forklift driver. You can find anything, but you don’t know why it’s there or how it relates to the pallet next to it. Karpathy’s “Markdown Wiki” is like a curated library with a head librarian who is constantly writing new books to explain the old ones.The next phaseKarpathy’s final exploration points toward the ultimate destination of this data: Synthetic Data Generation and Fine-Tuning. As the wiki grows and the data becomes more “pure” through continuous LLM linting, it becomes the perfect training set. Instead of the LLM just reading the wiki in its “context window,” the user can eventually fine-tune a smaller, more efficient model on the wiki itself. This would allow the LLM to “know” the researcher’s personal knowledge base in its own weights, essentially turning a personal research project into a custom, private intelligence.Bottom-line: Karpathy hasn’t just shared a script; he’s shared a philosophy. By treating the LLM as an active agent that maintains its own memory, he has bypassed the limitations of “one-shot” AI interactions. For the individual researcher, it means the end of the “forgotten bookmark.” For the enterprise, it means the transition from a “raw/ data lake” to a “compiled knowledge asset.” As Karpathy himself summarized: “You rarely ever write or edit the wiki manually; it’s the domain of the LLM.” We are entering the era of the autonomous archive.
Nvidia launches enterprise AI agent platform with Adobe, Salesforce, SAP among 17 adopters at GTC 2026
Jensen Huang walked onto the GTC stage Monday wearing his trademark leather jacket and carrying, as it turned out, the blueprints for a new kind of industry dominance.The Nvidia CEO unveiled the Agent Toolkit, an open-source platform for building autonomous AI agents, and then rattled off the names of the companies that will use it: Adobe, Salesforce, SAP, ServiceNow, Siemens, CrowdStrike, Atlassian, Cadence, Synopsys, IQVIA, Palantir, Box, Cohesity, Dassault Systèmes, Red Hat, Cisco and Amdocs. Seventeen enterprise software companies, touching virtually every industry and every Fortune 500 corporation, all agreeing to build their next generation of AI products on a shared foundation that Nvidia designed, Nvidia optimizes and Nvidia maintains.The toolkit provides the models, the runtime, the security framework and the optimization libraries that AI agents need to operate autonomously inside organizations — resolving customer service tickets, designing semiconductors, managing clinical trials, orchestrating marketing campaigns. Each component is open source. Each is optimized for Nvidia hardware. The combination means that as AI agents proliferate across the corporate world, they will generate demand for Nvidia GPUs not because companies choose to buy them but because the software they depend on was engineered to require them.”The enterprise software industry will evolve into specialized agentic platforms,” Huang told the crowd, “and the IT industry is on the brink of its next great expansion.” What he left unsaid is that Nvidia has just positioned itself as the tollbooth at the entrance to that expansion — open to all, owned by one.Inside Nvidia’s Agent Toolkit: the software stack designed to power every corporate AI workerTo grasp the significance of Monday’s announcements, it helps to understand the problem Nvidia is solving.Building an enterprise AI agent today is an exercise in frustration. A company that wants to deploy an autonomous system — one that can, say, monitor a telecommunications network and proactively resolve customer issues before anyone calls to complain — must assemble a language model, a retrieval system, a security layer, an orchestration framework and a runtime environment, typically from different vendors whose products were never designed to work together.Nvidia’s Agent Toolkit collapses that complexity into a unified platform. It includes Nemotron, a family of open models optimized for agentic reasoning; AI-Q, an open blueprint that lets agents perceive, reason and act on enterprise knowledge; OpenShell, an open-source runtime enforcing policy-based security, network and privacy guardrails; and cuOpt, an optimization skill library. Developers can use the toolkit to create specialized AI agents that act autonomously while using and building other software to complete tasks.The AI-Q component addresses a pain point that has dogged enterprise AI adoption: cost. Its hybrid architecture routes complex orchestration tasks to frontier models while delegating research tasks to Nemotron’s open models, which Nvidia says can cut query costs by more than 50 percent while maintaining top-tier accuracy. Nvidia used the AI-Q Blueprint to build what it claims is the top-ranking AI agent on both the DeepResearch Bench and DeepResearch Bench II leaderboards — benchmarks that, if they hold under independent validation, position the toolkit as not merely convenient but competitively necessary.OpenShell tackles what has been the single biggest obstacle in every boardroom conversation about letting AI agents loose inside corporate systems: trust. The runtime creates isolated sandboxes that enforce strict policies around data access, network reach and privacy boundaries. Nvidia is collaborating with Cisco, CrowdStrike, Google, Microsoft Security and TrendAI to integrate OpenShell with their existing security tools — a calculated move that enlists the cybersecurity industry as a validation layer for Nvidia’s approach rather than a competing one.The partner list that reads like the Fortune 500: who signed on and what they’re buildingThe breadth of Monday’s enterprise adoption announcements reveals Nvidia’s ambitions more clearly than any specification sheet could.Adobe, in a simultaneously announced strategic partnership, will adopt Agent Toolkit software as the foundation for running hybrid, long-running creativity, productivity and marketing agents. Shantanu Narayen, Adobe’s chair and CEO, said the companies will bring together “our Firefly models, CUDA libraries into our applications, 3D digital twins for marketing, and Agent Toolkit and Nemotron to our agentic frameworks to deliver high-quality, controllable and enterprise-grade AI workflows of the future.” The partnership extends deep: Adobe will explore OpenShell and Nemotron as foundations for personalized, secure agentic loops, and will evaluate the toolkit for large-scale workflows powered by Adobe Experience Platform. Nvidia will provide engineering expertise, early access to software and targeted go-to-market support.Salesforce’s integration may be the one enterprise IT leaders parse most carefully. The company is working with Nvidia Agent Toolkit software including Nemotron models, enabling customers to build, customize and deploy AI agents using Agentforce for service, sales and marketing. The collaboration introduces a reference architecture where employees can use Slack as the primary conversational interface and orchestration layer for Agentforce agents — powered by Nvidia infrastructure — that participate directly in business workflows and pull from data stores in both on-premises and cloud environments. For the millions of knowledge workers who already conduct their professional lives inside Slack, this turns a messaging app into the command center for corporate AI.SAP, whose software underpins the financial and operational plumbing of most Global 2000 companies, is using open Agent Toolkit software including NeMo for enabling AI agents through Joule Studio on SAP Business Technology Platform, enabling customers and partners to design agents tailored to their own business needs. ServiceNow’s Autonomous Workforce of AI Specialists leverage Agent Toolkit software, the AI-Q Blueprint and a combination of closed and open models, including Nemotron and ServiceNow’s own Apriel models — a hybrid approach that suggests the toolkit is designed not to replace existing AI investments but to become the connective tissue between them.From chip design to clinical trials: how agentic AI is reshaping specialized industriesThe partner list extends well beyond horizontal software platforms into deeply specialized verticals where autonomous agents could compress timelines measured in years.In semiconductor design — where a single advanced chip can cost billions of dollars and take half a decade to develop — three of the four major electronic design automation companies are building agents on Nvidia’s stack. Cadence will leverage Agent Toolkit and Nemotron with its ChipStack AI SuperAgent for semiconductor design and verification. Siemens is launching its Fuse EDA AI Agent, which uses Nemotron to autonomously orchestrate workflows across its entire electronic design automation portfolio, from design conception through manufacturing sign-off. Synopsys is building a multi-agent framework powered by its AgentEngineer technology using Nemotron and Nemo Agent Toolkit.Healthcare and life sciences present perhaps the most consequential use case. IQVIA is integrating Nemotron and other Agent Toolkit software with IQVIA.ai, a unified agentic AI platform designed to help life sciences organizations work more efficiently across clinical, commercial and real-world operations. The scale is already significant: IQVIA has deployed more than 150 agents across internal teams and client environments, including 19 of the top 20 pharmaceutical companies.The security sector is embedding itself into the architecture from the ground floor. CrowdStrike unveiled a Secure-by-Design AI Blueprint that embeds its Falcon platform protection directly into Nvidia AI agent architectures — including agents built on AI-Q and OpenShell — and is advancing agentic managed detection and response using Nemotron reasoning models. Cisco AI Defense will provide AI security protection for OpenShell, adding controls and guardrails to govern agent actions. These are not aftermarket bolt-ons; they are foundational integrations that signal the security industry views Nvidia’s agent platform as the substrate it needs to protect.Dassault Systèmes is exploring Agent Toolkit software and Nemotron for its role-based AI agents, called Virtual Companions, on its 3DEXPERIENCE agentic platform. Atlassian is working with the toolkit as it evolves its Rovo AI agentic strategy for Jira and Confluence. Box is using it to enable enterprise agents to securely execute long-running business processes. Palantir is developing AI agents on Nemotron that run on its sovereign AI Operating System Reference Architecture.The open-source gambit: why giving software away is Nvidia’s most aggressive business moveThere is something almost paradoxical about a company with a multi-trillion-dollar market capitalization giving away its most strategically important software. But Nvidia’s open-source approach to Agent Toolkit is less an act of generosity than a carefully constructed competitive moat.OpenShell is open source. Nemotron models are open. AI-Q blueprints are publicly available. LangChain, the agent engineering company whose open-source frameworks have been downloaded over 1 billion times, is working with Nvidia to integrate Agent Toolkit components into the LangChain deep agent library for developing advanced, accurate enterprise AI agents at scale. When the most popular independent framework for building AI agents absorbs your toolkit, you have transcended the category of vendor and entered the category of infrastructure.But openness in AI has a way of being strategically selective. The models are open, but they are optimized for Nvidia’s CUDA libraries — the proprietary software layer that has locked developers into Nvidia GPUs for two decades. The runtime is open, but it integrates most deeply with Nvidia’s security partners. The blueprints are open, but they perform best on Nvidia hardware. Developers can explore Agent Toolkit and OpenShell on build.nvidia.com today, running on inference providers and Nvidia Cloud Partners including Baseten, CoreWeave, DeepInfra, DigitalOcean and others — all of which run Nvidia GPUs.The strategy has a historical analog in Google’s approach to Android: give away the operating system to ensure that the entire mobile ecosystem generates demand for your core services. Nvidia is giving away the agent operating system to ensure that the entire enterprise AI ecosystem generates demand for its core product — the GPU. Every Salesforce agent running Nemotron, every SAP workflow orchestrated through OpenShell, every Adobe creative pipeline accelerated by CUDA creates another strand of dependency on Nvidia silicon.This also explains the Nemotron Coalition announced Monday — a global collaboration of model builders including Mistral AI, Cursor, LangChain, Perplexity, Reflection AI, Sarvam and Thinking Machines Lab, all working to advance open frontier models. The coalition’s first project will be a base model codeveloped by Mistral AI and Nvidia, trained on Nvidia DGX Cloud, that will underpin the upcoming Nemotron 4 family. By seeding the open model ecosystem with Nvidia-optimized foundations, the company ensures that even models it does not build will run best on its hardware.What could go wrong: the risks enterprise buyers should weigh before going all-inFor all the ambition on display Monday, several realities temper the narrative.Adoption announcements are not deployment announcements. Many of the partner disclosures use carefully hedged language — “exploring,” “evaluating,” “working with” — that is standard in embargoed press releases but should not be confused with production systems serving millions of users. Adobe’s own forward-looking statements note that “due to the non-binding nature of the agreement, there are no assurances that Adobe will successfully negotiate and execute definitive documentation with Nvidia on favorable terms or at all.” The gap between a GTC keynote demonstration and an enterprise-grade rollout remains substantial.Nvidia is not the only company chasing this market. Microsoft, with its Copilot ecosystem and Azure AI infrastructure, pursues a parallel strategy with the advantage of owning the operating systems and productivity software that most enterprises already use. Google, through Gemini and its cloud platform, has its own agent vision. Amazon, via Bedrock and AWS, is building comparable primitives. The question is not whether enterprise AI agents will be built on some platform but whether the market will consolidate around one stack or fragment across several.The security claims, while architecturally sound, remain unproven at scale. OpenShell’s policy-based guardrails are a promising design pattern, but autonomous agents operating in complex enterprise environments will inevitably encounter edge cases that no policy framework has anticipated. CrowdStrike’s Secure-by-Design AI Blueprint and Cisco AI Defense’s OpenShell integration are exactly the kind of layered defense enterprise buyers will demand — but both are newly unveiled, not battle-hardened through years of adversarial testing. Deploying agents that can autonomously access data, execute code and interact with production systems introduces a threat surface that the industry has barely begun to map.And there is the question of whether enterprises are ready for agents at all. The technology may be available, but organizational readiness — the governance structures, the change management, the regulatory frameworks, the human trust — often lags years behind what the platforms can deliver.Beyond agents: the full scope of what Nvidia announced at GTC 2026Monday’s Agent Toolkit announcement did not arrive in isolation. It landed amid an avalanche of product launches that, taken together, describe a company remaking itself at every layer of the computing stack.Nvidia unveiled the Vera Rubin platform — seven new chips in full production, including the Vera CPU purpose-built for agentic AI, the Rubin GPU, and the newly integrated Groq 3 LPU inference accelerator — designed to power every phase of AI from pretraining to real-time agentic inference. The Vera Rubin NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs, delivering what Nvidia claims is up to 10x higher inference throughput per watt at one-tenth the cost per token compared with the Blackwell platform. Dynamo 1.0, an open-source inference operating system that Nvidia describes as the “operating system for AI factories,” entered production with adoption from AWS, Microsoft Azure, Google Cloud and Oracle Cloud Infrastructure alongside companies like Cursor, Perplexity, PayPal and Pinterest.The BlueField-4 STX storage architecture promises up to 5x token throughput for the long-context reasoning that agents demand, with early adopters including CoreWeave, Crusoe, Lambda, Mistral AI and Nebius. BYD, Geely, Isuzu and Nissan announced Level 4 autonomous vehicle programs on Nvidia’s DRIVE Hyperion platform, and Uber disclosed plans to launch Nvidia-powered robotaxis across 28 cities and four continents by 2028, beginning with Los Angeles and San Francisco in the first half of 2027.Roche, the pharmaceutical giant, announced it is deploying more than 3,500 Nvidia Blackwell GPUs across hybrid cloud and on-premises environments in the U.S. and Europe — what it calls the largest announced GPU footprint available to a pharmaceutical company. Nvidia also launched physical AI tools for healthcare robotics, with CMR Surgical, Johnson & Johnson MedTech and others adopting the platform, and released Open-H, the world’s largest healthcare robotics dataset with over 700 hours of surgical video. And Nvidia even announced a Space Module based on the Vera Rubin architecture, promising to bring data-center-class AI to orbital environments.The real meaning of GTC 2026: Nvidia is no longer selling picks and shovelsStrip away the product specifications and benchmark claims and what emerges from GTC 2026 is a single, clarifying thesis: Nvidia believes the era of AI agents will be larger than the era of AI models, and it intends to own the platform layer of that transition the way it already owns the hardware layer of the current one.The 17 enterprise software companies that signed on Monday are making a bet of their own. They are wagering that building on Nvidia’s agent infrastructure will let them move faster than building alone — and that the benefits of a shared platform outweigh the risks of shared dependency. For Salesforce, it means Agentforce agents that can draw from both cloud and on-premises data through a single Slack interface. For Adobe, it means creative AI pipelines that span image, video, 3D and document intelligence. For SAP, it means agents woven into the transactional fabric of global commerce. Each partnership is rational on its own terms. Together, they form something larger: an industry-wide endorsement of Nvidia as the default substrate for enterprise intelligence.Huang, who opened his career designing graphics chips for video games, closed his keynote by gesturing toward a future in which AI agents do not just assist human workers but operate as autonomous colleagues — reasoning through problems, building their own tools, learning from their mistakes. He compared the moment to the birth of the personal computer, the dawn of the internet, the rise of mobile computing.Technology executives have a professional obligation to describe every product cycle as a revolution. But here is what made Monday different: this time, 17 of the world’s most important software companies showed up to agree with him. Whether they did so out of conviction or out of a calculated fear of being left behind may be the most important question in enterprise technology — and it is one that only the next few years can answer.
Arcee’s new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize
The baton of open source AI models has been passed on between several companies over the years since ChatGPT debuted in late 2022, from Meta with its Llama family to Chinese labs like Qwen and z.ai. But lately, Chinese companies have started pivoting back towards proprietary models even as some U.S. labs like Cursor and Nvidia release their own variants of the Chinese models, leaving a question mark about who will originate this branch of technology going forward. One answer: Arcee, a San Francisco based lab, which this week released AI Trinity-Large-Thinking—a 399-billion parameter text-only reasoning model released under the uncompromisingly open Apache 2.0 license, allowing for full customizability and commercial usage by anyone from indie developers to large enterprises. The release represents more than just a new set of weights on AI code sharing community Hugging Face; it is a strategic bet that “American Open Weights” can provide a sovereign alternative to the increasingly closed or restricted frontier models of 2025. This move arrives precisely as enterprises express growing discomfort with relying on Chinese-based architectures for critical infrastructure, creating a demand for a domestic champion that Arcee intends to fill.As Clément Delangue, co-founder and CEO of Hugging Face, told VentureBeat in a direct message on X: “The strength of the US has always been its startups so maybe they’re the ones we should count on to lead in open-source AI. Arcee shows that it’s possible!” Genesis of a 30-person frontier labTo understand the weight of the Trinity release, one must understand the lab that built it. Based in San Francisco, Arcee AI is a lean team of only 30 people. While competitors like OpenAI and Google operate with thousands of engineers and multibillion-dollar compute budgets, Arcee has defined itself through what CTO Lucas Atkins calls “engineering through constraint”.The company first made waves in 2024 after securing a $24 million Series A led by Emergence Capital, bringing its total capital to just under $50 million. In early 2026, the team took a massive risk: they committed $20 million—nearly half their total funding—to a single 33-day training run for Trinity Large. Utilizing a cluster of 2048 NVIDIA B300 Blackwell GPUs, which provided twice the speed of the previous Hopper generation, Arcee bet the company’s future on the belief that developers needed a frontier model they could truly own. This “back the company” bet was a masterclass in capital efficiency, proving that a small, focused team could stand up a full pipeline and stabilize training without endless reserves.Engineering through extreme architectural constraintTrinity-Large-Thinking is noteworthy for the extreme sparsity of its attention mechanism. While the model houses 400 billion total parameters, its Mixture-of-Experts architecture means that only 1.56%, or 13 billion parameters, are active for any given token. This allows the model to possess the deep knowledge of a massive system while maintaining the inference speed and operational efficiency of a much smaller one—performing roughly 2 to 3 times faster than its peers on the same hardware. Training such a sparse model presented significant stability challenges. To prevent a few experts from becoming “winners” while others remained untrained “dead weight,” Arcee developed SMEBU, or Soft-clamped Momentum Expert Bias Updates. This mechanism ensures that experts are specialized and routed evenly across a general web corpus. The architecture also incorporates a hybrid approach, alternating local and global sliding window attention layers in a 3:1 ratio to maintain performance in long-context scenarios.The data curriculum and synthetic reasoningArcee’s partnership with fellow startup DatologyAI provided a curriculum of over 10 trillion curated tokens. However, the training corpus for the full-scale model was expanded to 20 trillion tokens, split evenly between curated web data and high-quality synthetic data. Unlike typical imitation-based synthetic data where a smaller model simply learns to mimic a larger one, DatologyAI utilized techniques to synthetically rewrite raw web text—such as Wikipedia articles or blogs—to condense the information. This process helped the model learn to reason over concepts and information rather than merely memorizing exact token strings. To ensure regulatory compliance, tremendous effort was invested in excluding copyrighted books and materials with unclear licensing, attracting enterprise customers who are wary of intellectual property risks associated with mainstream LLMs. This data-first approach allowed the model to scale cleanly while significantly improving performance on complex tasks like mathematics and multi-step agent tool use.The pivot from yappy chatbots to reasoning agentsThe defining feature of this official release is the transition from a standard “instruct” model to a “reasoning” model.By implementing a “thinking” phase prior to generating a response—similar to the internal loops found in the earlier Trinity-Mini—Arcee has addressed the primary criticism of its January “Preview” release. Early users of the Preview model had noted that it sometimes struggled with multi-step instructions in complex environments and could be “underwhelming” for agentic tasks.The “Thinking” update effectively bridges this gap, enabling what Arcee calls “long-horizon agents” that can maintain coherence across multi-turn tool calls without getting “sloppy”. This reasoning process enables better context coherence and cleaner instruction following under constraint. This has direct implications for Maestro Reasoning, a 32B-parameter derivative of Trinity already being used in audit-focused industries to provide transparent “thought-to-answer” traces. The goal was to move beyond “yappy” or inefficient chatbots toward reliable, cheap, high-quality agents that stay stable across long-running loops.Geopolitics and the case for American open weightsThe significance of Arcee’s Apache 2.0 commitment is amplified by the retreat of its primary competitors from the open-weight frontier. Throughout 2025, Chinese research labs like Alibaba’s Qwen and z.ai (aka Zhupai) set the pace for high-efficiency MoE architectures. However, as we enter 2026, those labs have begun to shift toward proprietary enterprise platforms and specialized subscriptions, signaling a move away from pure community growth. The fragmentation of these once-prolific teams, such as the departure of key technical leads from Alibaba’s Qwen lab, has left a void at the high end of the open-weight market. In the United States, the movement has faced its own crisis. Meta’s Llama division notably retreated from the frontier landscape following the mixed reception of Llama 4 in April 2025, which faced reports of quality issues and benchmark manipulation.For developers who relied on the Llama 3 era of dominance, the lack of a current 400B+ open model created an urgent need for an alternative that Arcee has risen to fill.Benchmarks and how Arcee’s Trinity-Large-Thinking stacks up to other U.S. frontier open source AI model offeringsTrinity-Large-Thinking’s performance on agent-specific evaluations establishes it as a legitimate frontier contender. On PinchBench, a critical metric for evaluating model capability on autonomous agentic tasks, Trinity achieved a score of 91.9, placing it just behind the proprietary market leader, Claude Opus 4.6 (93.3). This competitiveness is mirrored in IFBench, where Trinity’s score of 52.3 sits in a near-dead heat with Opus 4.6’s 53.1, indicating that the reasoning-first “Thinking” update has successfully addressed the instruction-following hurdles that challenged the model’s earlier preview phase.The model’s broader technical reasoning capabilities also place it at the high end of the current open-source market. It recorded a 96.3 on AIME25, matching the high-tier Kimi-K2.5 and outstripping other major competitors like GLM-5 (93.3) and MiniMax-M2.7 (80.0). While high-end coding benchmarks like SWE-bench Verified still show a lead for top-tier closed-source models—with Trinity scoring 63.2 against Opus 4.6’s 75.6—the massive delta in cost-per-token positions Trinity as the more viable sovereign infrastructure layer for enterprises looking to deploy these capabilities at production scale.When it comes to other U.S. open source frontier model offerings, OpenAI’s gpt-oss tops out at 120 billion parameters, but there’s also Google with Gemma (Gemma 4 was just released this week) and IBM’s Granite family is also worth a mention, despite having lower benchmarks. Nvidia’s Nemotron family is also notable, but is fine-tuned and post-trained Qwen variants.BenchmarkArcee Trinity-Largegpt-oss-120B (High)IBM Granite 4.0Google Gemma 4GPQA-D76.3%80.1%74.8%84.3%Tau2-Airline88.0%65.8%*68.3%76.9%PinchBench91.9%69.0% (IFBench)89.1%93.3%AIME2596.3%97.9%88.5%89.2%MMLU-Pro83.4%90.0% (MMLU)81.2%85.2%So how is an enterprise supposed to choose between all these?Arcee Trinity-Large-Thinking is the premier choice for organizations building autonomous agents; its sparse 400B architecture excels at “thinking” through multi-step logic, complex math, and long-horizon tool use. By activating only a fraction of its parameters, it provides a high-speed reasoning engine for developers who need GPT-4o-level planning capabilities within a cost-effective, open-source framework. Conversely, gpt-oss-120B serves as the optimal middle ground for enterprises that require high-reasoning performance but prioritize lower operational costs and deployment flexibility. Because it activates only 5.1B parameters per forward pass, it is uniquely suited for technical workloads like competitive code generation and advanced mathematical modeling that must run on limited hardware, such as a single H100 GPU. Its configurable reasoning effort—offering “Low,” “Medium,” and “High” modes—makes it the best fit for production environments where latency and accuracy must be balanced dynamically across different tasks.For broader, high-throughput applications, Google Gemma 4 and IBM Granite 4.0 serve as the primary backbones. Gemma 4 offers the highest “intelligence density” for general knowledge and scientific accuracy, making it the most versatile option for R&D and high-speed chat interfaces. Meanwhile, IBM Granite 4.0 is engineered for the “all-day” enterprise workload, utilizing a hybrid architecture that eliminates context bottlenecks for massive document processing. For businesses concerned with legal compliance and hardware efficiency, Granite remains the most reliable foundation for large-scale RAG and document analysis.Ownership as a feature for regulated industriesIn this climate, Arcee’s choice of the Apache 2.0 license is a deliberate act of differentiation. Unlike the restrictive community licenses used by some competitors, Apache 2.0 allows enterprises to truly own their intelligence stack without the “black box” biases of a general-purpose chat model. “Developers and Enterprises need models they can inspect, post-train, host, distill, and own,” Lucas Atkins noted in the launch announcement. This ownership is critical for the “bitter lesson” of training small models: you usually need to train a massive frontier model first to generate the high-quality synthetic data and logits required to build efficient student models.Furthermore, Arcee has released Trinity-Large-TrueBase, a raw 10-trillion-token checkpoint. TrueBase offers a rare, “unspoiled” look at foundational intelligence before instruction tuning and reinforcement learning are applied. For researchers in highly regulated industries like finance and defense, TrueBase allows for authentic audits and custom alignments starting from a clean slate.Community verdict and the future of distillationThe response from the developer community has been largely positive, reflecting the desire for more open weights, U.S.-made mdoels. On X, researchers highlighted the disruption, noting that the “insanely cheap” prices for a model of this size would be a boon for the agentic community. On open AI model inference website OpenRouter, Trinity-Large-Preview established itself as the #1 most used open model in the U.S., serving over 80.6 billion tokens on peak days like March 1, 2026. The proximity of Trinity-Large-Thinking to Claude Opus 4.6 on PinchBench—at 91.9 versus 93.3—is particularly striking when compared to the cost. At $0.90 per million output tokens, Trinity is approximately 96% cheaper than Opus 4.6, which costs $25 per million output tokens. Arcee’s strategy is now focused on bringing these pretraining and post-training lessons back down the stack. Much of the work that went into Trinity Large will now flow into the Mini and Nano models, refreshing the company’s compact line with the distillation of frontier-level reasoning. As global labs pivot toward proprietary lock-in, Arcee has positioned Trinity as a sovereign infrastructure layer that developers can finally control and adapt for long-horizon agentic workflows.
Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks
For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google’s Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba’s Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, “open” with asterisks isn’t the same as open.Gemma 4 eliminates that friction entirely. Google DeepMind’s newest open model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no “Harmful Use” carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over.The timing is notable. As some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research.Four models, two tiers: Edge to workstation in a single familyGemma 4 arrives as four distinct models organized into two deployment tiers. The “workstation” tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model — both supporting text and image input with 256K-token context windows. The “edge” tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows.The naming convention takes some unpacking. The “E” prefix denotes “effective parameters” — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more. The “A” in 26B A4B stands for “active parameters” — only 3.8 billion of the MoE model’s 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model.For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference — but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle.The MoE bet: 128 small experts to save on inference costsThe architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B–31B range while running at roughly the speed of a 4B model during inference.This is not just a benchmark curiosity — it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family.Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi-turn agent conversations.Native multimodality: Vision, audio, and function calling baked in from scratchPrevious generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level.All four models handle variable aspect-ratio image input with configurable visual token budgets — a meaningful improvement over Gemma 3n’s older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task. Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots.The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local — think healthcare, field service, or multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification.Function calling is also native across all four models, drawing on research from Google’s FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4’s function calling was trained into the model from the ground up — optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents.Benchmarks in context: Where Gemma 4 lands in a crowded fieldThe benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 — numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%. For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode.The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond — a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture.The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability.These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in a single model family with deployment options from edge devices to cloud serverless.What enterprise teams should watch nextGoogle is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially.The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference — rather than maintaining always-on GPU instances — could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications.Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today — workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research — represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google’s open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.