Presented by SplunkEvery day, organizations learn things their AI systems never get to use.A security analyst corrects an AI-generated investigation. A network engineer identifies the root cause of a recurring outage. An observability team discovers that a pattern of latency, logs and infrastructure changes predicts service degradation. A customer operations team learns which signals indicate an escalation is likely.Each moment contains valuable organizational knowledge. But in most enterprises, that knowledge disappears into tickets, dashboards, chat threads, post-incident reviews and the minds of individual experts. It may help solve the immediate problem, but it rarely becomes part of a reusable system that improves future AI-driven decisions.That is the next challenge for the agentic enterprise.The future will not be defined simply by who has the most capable model or the most autonomous agents. Many organizations will have access to similar frontier models. Many will deploy agents across security, IT, engineering, customer service, and business operations.The real differentiator will be whether those agents can learn from the organization around them.Not by constantly retraining the underlying model, but by capturing operational experience, converting it into institutional knowledge and making that knowledge available to future agents, workflows, and decisions.The agentic enterprise is not just an enterprise that uses AI. It is an enterprise that learns through AI.Agentic enterprises allow AI systems to learn from themThe AI conversation has been dominated by model capability: larger context windows, better reasoning, faster inference, stronger tool use, and more sophisticated agentic behavior.Those advances matter. But in the enterprise, a model is only one part of the system.A model does not automatically know how a specific organization operates. It does not inherently know which remediation step solved last month’s outage, which analyst correction improved a threat investigation, which network signal preceded a service disruption, or which internal policy should override an otherwise plausible recommendation.That knowledge belongs to the enterprise.For agentic systems to improve, organizations need a way to capture that knowledge and make it reusable. In many cases, that does not require changing the model itself. It requires changing the ecosystem around the model: the knowledge base, retrieval layer, prompts, policies, guardrails, routing logic and workflows that shape how agents behave.The model may remain the same. The learning system around it becomes smarter.Feedback loops turn every outcome into a teachable moment for agentsEvery agentic workflow creates signals.An agent receives a request. It retrieves context, reasonsthrough possible actions, calls tools, and generates answers. A human accepts, rejects, or modifies that answer. Downstream systems reveal whether the action worked.That entire chain is valuable.AI observability gives organizations visibility into what happened: the prompt, response, reasoning path, tool calls, data sources, intermediate steps, failure modes and outcomes. Without that visibility, organizations cannot understand why an agent behaved the way it did, let alone improve it.But observability alone is not enough.The larger opportunity is to turn observed behavior into institutional knowledge. A trace should not only help a developer and operators debug an agent. It should help the enterprise understand what the agent learned, what the human corrected, what outcome followed, and what should change before the next similar event.That is the shift from monitoring AI to teaching AI.In the agentic enterprise, feedback loops connect action to outcome, outcome to knowledge and knowledge back to future action.A learning system in practice across security, observability and the networkConsider a service experiencing intermittent degradation.An observability agent detects unusual latency and error rates. A network agent identifies packet loss across a specific path. A security agent notices that the same time window includes suspicious authentication behavior and unusual traffic from a previously unseen source.Individually, each agent has only a partial view. Together, they create a richer operational picture.The first time this incident occurs, human experts may need to intervene. A network engineer confirms that packet loss was caused by a misconfigured routing change. A security analyst determines that the suspicious traffic was not an attack, but a side effect of a misrouted internal service. An SRE connects the network event to the application degradation.That resolution contains knowledge the organization should not have to relearn.A mature agentic learning system would capture the traces, human corrections, topology context, security findings, observability signals and final remediation steps. It would preserve the relationship between those signals: latency pattern, network path, identity behavior, routing change and remediation.The next time a similar pattern appears, agents would not start from zero. They could retrieve the prior case, compare current conditions, recommend the proven diagnostic path and escalate with better context.The underlying frontier model did not need to be retrained.The enterprise learned.The architecture of the learning agentic enterpriseA learning-oriented agentic enterprise needs more than a model or chatbot. It needs an architecture that can capture experience, turn it into usable knowledge, connect that knowledge to operational context, and govern how it changes future agent behavior.Memory preserves what happened: what the agent saw, what it did, where humans intervened, and what outcomes followed.Knowledge bases turn that experience into reusable guidance, including playbooks, examples, policies, procedures, and evidence.A data fabric connects the operational environment. The signals agents need live across logs, metrics, traces, tickets, identity systems, security tools, network telemetry, collaboration platforms, and business applications. A data fabric makes those signals discoverable, correlated, governed, and usable in context.AI observability explains how agents behave by capturing prompts, tool calls, intermediate steps, responses, feedback, and outcomes. That visibility helps organizations understand where agents succeed, where they fail, and what should improve.The control plane governs how learning becomes change: what knowledge is promoted, which prompts or policies are updated, which agents can use new information, what approvals are required, and how changes are audited.Together, these capabilities allow AI systems to improve over time in a controlled, trustworthy way that allows the enterprise to learn from its own operations.The organizations that learn fastest will win The next era of AI will not be won by models alone. It will be won by organizations that can capture what they learn from every workflow, expert correction, incident, investigation, and outcome.The most advanced agentic enterprises will not simply deploy more agents. They will build systems that allow every agent to benefit from the collective knowledge of the organization.That means connecting operational data through a data fabric. It means observing agent behavior deeply enough to understand it. It means preserving experience in memory and institutionalizing it in knowledge bases. It means using a control plane to govern how learning changes agent behavior.The future of AI is not a single autonomous agent acting alone. It is an ecosystem of agents, humans, data and controls that learns over time.The organizations that build that ecosystem will create AI systems that get better with every interaction. Not because the model is constantly changing, but because the enterprise itself is becoming more intelligent.Learn more about how Cisco Data Fabric powered by the Splunk Platform is accelerating agentic operations.Hao Yang is Vice President AI at Splunk, a Cisco Company.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Venture Beat
7,000 Langflow servers are under attack. LangGraph and LangChain have the same holes
Your AI agent did exactly what it was designed to do. The framework underneath it just handed an attacker a shell on the box that holds your OpenAI key, your database credentials, and your CRM tokens.That is not a hypothetical. In a few months, three of the most widely deployed AI agent frameworks each turned a known, ordinary bug class into a way through. Check Point Research chained a SQL injection in LangGraph’s SQLite checkpointer to full remote code execution. Tenable and VulnCheck tracked a path traversal in Langflow’s file upload endpoint to active, in-the-wild RCE. Cyera documented a path traversal in LangChain-core’s prompt loader that reads your secrets off disk. Two paths to a shell, one to your keys. They are the same bug, wearing three frameworks.These frameworks became production infrastructure faster than anyone secured them. They store agent state, take file uploads, load prompt configs, and hold the credentials to databases, CRMs, and internal APIs. The edge tools watch traffic. The endpoint tools watch processes. Neither was built to treat an imported framework as a boundary worth guarding, and that blind spot is exactly where all three chains live, widening every week as these frameworks ship to production.The LangGraph chain, SQL injection to a Python shellStart with the one most teams pulled into production this quarter. LangGraph gives AI agents memory through checkpointers, the persistence layer that stores execution state. It has cleared over 50 million downloads a month. Yarden Porat of Check Point Research took that layer apart and found three vulnerabilities. Two of them chain to RCE.CVE-2025-67644, rated CVSS 7.3, is a SQL injection in the SQLite checkpointer. The function that builds the WHERE clause for checkpoint lookups drops user-controlled filter keys straight into the query with no parameterization and no escaping. This does not hit everyone, but where it hits, it is serious. A deployment is exposed when it self-hosts LangGraph on the SQLite or Redis checkpointer and lets untrusted input reach get_state_history() or a similar history endpoint. Meet those conditions, and an attacker who controls the filter writes a fabricated row straight into the checkpoint table. Run LangChain’s managed LangSmith platform on PostgreSQL, and the exposure is gone.Then CVE-2026-28277, CVSS 6.8, finishes the job. LangGraph’s msgpack checkpoint decoder rebuilds Python objects from the stored data, which lets it import a module and call a named function with attacker-supplied arguments. That step needs write access to the checkpoint store; the SQL injection is what grants it remotely. LangGraph loads the forged row as a legitimate checkpoint, the decoder runs the specified function, including os.system, and code executes under the identity of the agent server. A third issue, CVE-2026-27022, CVSS 6.5, reaches the same place through the Redis checkpointer.There has been no confirmed exploitation in the wild yet. A working proof-of-concept is public in Check Point’s disclosure. The fixes are version bumps: langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, and langgraph-checkpoint-redis to 1.0.2.The Langflow chain, one unauthenticated request to RCELangflow is the one already under attack. CVE-2026-5027, CVSS 8.8, is a path traversal in the POST /api/v2/files endpoint, which takes the filename straight from the form data and writes it to disk unsanitized. An attacker packs that filename with traversal sequences and drops a file anywhere, such as a cron job in /etc/cron.d/. Because Langflow ships with auto-login enabled in its default configuration, an exposed instance needs no credentials at all. A single unauthenticated request reaches the endpoint, and the next cron run hands over a shell.VulnCheck’s Caitlin Condon confirmed exploitation on June 9: “Our Canaries observed exploitation of CVE-2026-5027 that successfully leveraged the path traversal to write what appear to be test files on victim systems.” Censys put roughly 7,000 exposed instances on the internet, most in North America. This is the third Langflow flaw to draw active exploitation this year, after CVE-2025-34291, which the Iranian state-sponsored group MuddyWater weaponized and which CISA added to its Known Exploited Vulnerabilities catalog in May. CVE-2026-5027 itself was patched in version 1.9.0, released April 15.The timeline is what sets the clock. The patch shipped April 15. Attacks started in June, and VulnCheck added CVE-2026-5027 to its exploited-vulnerabilities list June 8 once its sensors caught the first in-the-wild hits. Every instance left unpatched between those two dates has been sitting in the open for almost two months. The lesson for security teams is to start the patch clock at disclosure, not at a federal catalog entry.The LangChain-core gap, arbitrary file reads through the prompt loaderLangChain-core, the foundation under both, disclosed CVE-2026-34070, CVSS 7.5, a path traversal in its legacy prompt-loading API. The load_prompt() functions read a file path out of a config dict with no check against traversal sequences or absolute paths, so an attacker who influences that path reads arbitrary files the process can reach, including the .env file holding OPENAI_API_KEY and ANTHROPIC_API_KEY. Cyera paired it with CVE-2025-68664, CVSS 9.3, a deserialization flaw that resolves environment secrets through a crafted object. The fix versions differ, which matters when you patch: CVE-2026-34070 lands in langchain-core 1.2.22 and 0.3.86; CVE-2025-68664 lands earlier in 1.2.5 and 0.3.81. Clear both, or the higher-severity flaw stays live behind a patched one.Three frameworks, three classic AppSec bugs. Path traversal. SQL injection. Unsafe deserialization. Nothing exotic, nothing AI-specific, just old vulnerabilities living inside new infrastructure. None of this is a frontier-model problem. It is plumbing, sitting in the layer where AI meets the enterprise.Why the scanner cannot see itMerritt Baer, CSO at Enkrypt AI and former deputy CISO at AWS, has named what makes this kind of failure hard to see coming. It does not announce itself as an AI problem. “CISOs will experience MCP insecurity not in the abstract, but when an employee pastes sensitive data into a tool, or when an attacker finds an unauthenticated MCP server in your cloud,” Baer told VentureBeat. “It won’t feel like ‘AI risk.’ It will feel like your traditional security program failing.” The framework chains here are the same shape. An exposed Langflow instance is an unauthenticated server in your cloud, and the alert, if one fires, reads like an ordinary incident.That is the gap in one sentence. The exploit lives in the framework your code imports. The WAF never sees a msgpack decoder running three layers down. The EDR watches the agent server make the same process calls it makes a thousand times a day and waves it through. Both tools are doing their job. Nobody scoped the framework itself as the thing that could turn on you. The root cause is older than AI, and Baer names it. “MCP is shipping with the same mistake we’ve seen in every major protocol rollout: insecure defaults,” she told VentureBeat. “If we don’t build authentication and least privilege in from day one, we’ll be cleaning up breaches for the next decade.” Langflow’s auto-login is that mistake shipped. LangChain-core’s unguarded prompt loader is that mistake shipped. The convenient default is the vulnerability. And the moment an agent connects to anything, that risk compounds. “You’re not just trusting your own security, you’re inheriting the hygiene of every tool, every credential, every developer in that chain,” Baer said. “That’s a supply chain risk in real time.”There is a governance failure layered on top of the technical one, and it is the same miscategorization Assaf Keren, chief security officer at Qualtrics and former CISO at PayPal, has flagged in adjacent tooling. “Most security teams still classify experience management platforms as ‘survey tools,’ which sit in the same risk tier as a project management app,” Keren told VentureBeat. “This is a massive miscategorization.” Swap in AI agent frameworks, and it still holds. Teams file LangGraph, Langflow, and LangChain under developer convenience, then wire them into databases, CRMs, and provider keys. “Security has to be an enabler,” Keren said, “or teams route around it.” These frameworks are what routing around it looks like.Follow the money and it points at the same layer. On its Q1 fiscal 2027 earnings call, CrowdStrike reported its AI detection and response line up more than 250% sequentially, and on June 17 it extended that runtime coverage to agent, LLM, and MCP traffic on AWS. George Kurtz, the company’s co-founder and CEO, named the reason in plain terms: “Agents run on the endpoint. They make tool calls, access files, invoke APIs, and move data at the process level.” That is the exact plumbing these chains abuse, and real money is now moving to the layer your AppSec scan skips.What to put in front of the boardThe board does not need the CVE numbers. It needs the consequence, and Keren draws the line the board cares about. Most teams have mapped the technical blast radius. “But not the business blast radius,” Keren told VentureBeat. “When an AI engine triggers a compensation adjustment based on poisoned data, the damage is not a security incident. It is a wrong business decision executed at machine speed.” A framework RCE is the same problem one layer earlier. The agent does not just leak a credential; it acts on production systems with it, and the business sees an outcome no one can explain.So frame it the way a board frames it: we run AI agent frameworks in production that can be turned into remote shells through bugs our scanners are not built to find, all three are patched, one is under active attack, and here is the date every instance is verified and closed. None of this required custom malware or a zero-day.The six-question checklistSix trust boundaries, one per row, each with the question, the proof point, the command, the fix, and the board line. Run it tonight.Trust-Boundary QuestionProof PointWhat BrokeVerify Before You InstallThe FixBoard Language1. Can the agent’s state store be poisoned with code?LangGraph SQLi-to-RCE chain. CVE-2025-67644 (CVSS 7.3) chains into CVE-2026-28277 (CVSS 6.8). PoC public, no in-the-wild use yet.Filter keys interpolated into SQL with an f-string. Forged checkpoint row hits the msgpack decoder, which imports and runs an attacker-named callable.pip show langgraph-checkpoint-sqlite. Below 3.0.1 = vulnerable. Confirm get_state_history() is not exposed to network input.Upgrade langgraph-checkpoint-sqlite to 3.0.1, langgraph to 1.0.10, langgraph-checkpoint-redis to 1.0.2.“Our agent memory layer can be tricked into running attacker code. Vendor has patched it. We are upgrading and confirming the endpoint is not exposed.”2. Can an unauthenticated request write a file to our agent server?Langflow CVE-2026-5027 (CVSS 8.8). On VulnCheck KEV (June 8). Active exploitation confirmed June 9. ~7,000 exposed instances (Censys).Path traversal in POST /api/v2/files. Filename unsanitized. Auto-login on by default. Two HTTP calls drop a cron job and earn a shell.Query Censys or Shodan for your Langflow, Flowise, n8n, and Dify instances on the perimeter. Check whether auto-login is enabled.Upgrade Langflow to 1.9.0+. Disable auto-login. Pull AI dev tools behind VPN or zero-trust. Isolate port 7860.“Our AI dev tools are reachable from the internet with login off. This exact flaw is under active attack now. We are pulling them behind access controls today.”3. Can our prompt loader read files it should never touch?LangChain-core CVE-2026-34070 (CVSS 7.5), path traversal in the prompt-loading API. Paired with deserialization CVE-2025-68664 (CVSS 9.3).load_prompt() reads a config-supplied path with no traversal check, returning files such as the .env holding OPENAI_API_KEY and ANTHROPIC_API_KEY.pip show langchain-core. Below 1.2.22 (1.x) or 0.3.86 (0.x) = vulnerable. Audit any code passing user-influenced paths to load_prompt().Upgrade langchain-core past both fixes: 1.2.22 / 0.3.86 (CVE-2026-34070) and 1.2.5 / 0.3.81 (CVE-2025-68664). Replace load_prompt() with an allowlisted directory. Run as non-root.“Our prompt system could be steered to read our API keys off disk. We are patching and removing the legacy loader.”4. Does a compromised framework hand over every credential at once?These frameworks are often deployed with provider keys, database credentials, and integration tokens available to the process environment. Cyera documents the credential-exfiltration path.One RCE on the agent server exposes every secret the process can read. Blast radius is the full credential set, not one app.Inventory which secrets each framework process can reach. Confirm keys come from a secrets manager, not static .env files.Move provider keys to ephemeral injection. Rotate any key a vulnerable instance could have read. Scope each key to least privilege.“A single break in one AI framework exposes the keys to every model and data store it touches. We are rotating and scoping them now.”5. Are these frameworks running outside security governance?A prior Langflow flaw, CVE-2025-34291, was weaponized by Iranian-linked MuddyWater and added to CISA KEV in May. Shadow AI is the new shadow IT.Teams stand frameworks up for speed, give them credentials, and never bring them under review. The security team cannot see what it does not know exists.Run a discovery sweep for AI frameworks outside change management. Map each to an owner and an approval record.Assign every framework a documented owner and a place in the approval process. Offer a sanctioned alternative so teams do not route around you.“We have AI frameworks in production that no one formally approved. We are bringing them under governance, not banning them.”6. Can our scanners even see inside the framework at runtime?Runtime detection is forming around this layer: CrowdStrike Falcon AIDR expanded to AWS June 17 (Bedrock, Kiro, Strands); its QuiltWorks coalition now covers cloud workloads.WAF reads HTTP at the edge. EDR watches the endpoint. By default, neither reliably models a msgpack decoder or a prompt loader three layers down in an imported framework as a separate trust boundary.Test whether your AppSec scan covers third-party framework internals. Track CVEs by dependency, not just by what your edge tools can parse.Add framework dependencies to vuln management. Treat agent output and stored state as untrusted. Patch on disclosure, not on KEV listing.“Our scanners check our code, not the frameworks our code imports. We are closing that blind spot and patching on disclosure, not waiting for the federal catalog.”How to read this table: each row is one trust boundary, left to right, from the question to ask to the line to read your board.Give the board the deadline, not the technologyThe fixes are not a re-architecture. They are version bumps and config changes you can land this week. The exposure is the gap between the day the patch shipped and the day your team runs the checks, and right now that gap is measured in months. The frameworks did exactly what they were built to do.
Fine-tuning forgets. RAG leaks context. Hypernetworks build the model your agent needs on demand.
Enterprise teams keep watching the same thing happen. An AI agent demos beautifully, goes to production, and stalls: it runs for a short stretch, then needs a human to top up its context and check its output, and the promised efficiency drains into supervision. The agent did the work; you did the watching. It’s one reason so many agent pilots never turn into production systems.The pitch on the other side of that wall is the one every team wants to believe: an agent that runs a long job on its own, overnight if it has to, and leaves a person to validate only the last 10%. Whether that is achievable turns on a problem the orchestration conversation mostly skips. When AI firm Chroma tested 18 leading models, every one lost accuracy as its input grew, a property of how attention works, not a gap a stronger model closes. An agent fed more and more of your business as it runs does not get steadier. It gets shakier.This is the layer beneath the orchestration race. Routing, durable execution and observability all assume each agent is already competent enough to coordinate in the first place. The deeper question is how long an agent can run before a human has to step in, and that comes down to where your company’s knowledge lives relative to the model. Both standard fixes leave a human in the loop.Why teaching a model your business keeps you in the loopFrontier models keep getting more capable, and the gap does not close, because it is not a capability problem. It is about where your knowledge sits relative to the model, and enterprises have had two ways to place it there. The first is fine-tuning, which bakes knowledge into the weights. It remains subject to catastrophic forgetting, a problem identified in the 1980s and still unresolved in 2026: teaching a model something new tends to erode what it already knew. Teams work around it by isolating each task in its own fine-tuned model or adapter, which produces a sprawling estate of models that raises cost and governance overhead. And a fine-tuned model is a snapshot, stale the day a policy changes, when the expensive, slow retraining cycle starts over.The second is in-context learning, which skips retraining by placing the relevant policies in the prompt at run time. This is where context rot bites. Retrieval narrows what goes into the prompt, but a retrieval miss looks identical to a confident answer, and both cost and latency climb with every token added.The two failures rhyme. With fine-tuning, the model can be confidently working from last quarter’s policy. With in-context learning, it can be confidently working from a detail it lost in the middle of a long prompt. Either way the output looks equally assured, so you cannot tell which parts are wrong without checking all of them. That is why the human never gets to leave. Some teams often run both at once, fine-tuning the stable knowledge and retrieving the rest. That softens each failure but removes neither: on any given output you still cannot be sure the model is both current and working from the right context, so you still check it.A third path: generate the specialist model on demandA third approach is moving from research into early product. Instead of retraining one model or stuffing its prompt, a generator builds a small, task-specific model on demand from your policies, at inference time. The generator is a hypernetwork: a network whose output is the weights of another network. The idea was named in 2016; applying it to produce specialist language models from text or documents is recent and active. Sakana AI’s Text-to-LoRA, presented at ICML 2025, generates a model adapter from a plain-language description in a single pass, and a 2026 system called SHINE calls hypernetwork adaptation a promising new frontier, precisely because it sidesteps both the retraining cost of fine-tuning and the context limits of prompting.The point of generating adapters rather than training and storing them is to collapse a sprawling library of per-task LoRAs into one network that can produce them on demand, including for tasks it has not seen.The elegant part is how this closes the loop on the problem above: the per-task adapter teams hand-build to dodge catastrophic forgetting is the same object a hypernetwork produces automatically. The model zoo stops being a governance headache and becomes a generated output.The case for going small underneath all this was put most directly in a 2025 paper by Nvidia researchers: for the narrow, repetitive tasks that fill agent workflows, small models are capable enough and 10 to 30 times cheaper to run than frontier generalists. Nace.AI, a Palo Alto company that raised a $21.5 million seed round in May, is the clearest commercial instance. Its core technology, a generator it calls a MetaModel, produces parameter adaptations for a model at inference time from a company’s policies, pointed at regulated work: audit, compliance, risk assessment. The company says its agents handle the bulk of a workflow while human experts validate the result, a split it markets as 90/10.How the three approaches compare
Fine-tuningIn-context / RAGHypernetwork-generated modelWhere business knowledge livesIn the model’s weightsIn the prompt, re-supplied each runIn on-demand generated weightsCost to update on a policy changeHigh: retrainLow: edit the sourceLow: regenerateStalenessHigh: a snapshotLowLow: regenerated from current policyPer-call cost and latencyLowHigh, grows with contextLow at run timeDominant failure modeForgetting; model-zoo sprawlContext rot; silent retrieval missesGenerator quality; calibrationWho owns the improving assetWhoever trains the modelWhoever holds the data storeDepends where generator and feedback liveWhy a hypernetwork-built model raises the autonomy ceilingA model that is narrow, current and small has a smaller surface on which to be wrong. Fewer errors, confined to a known domain, mean fewer outputs an agent has to escalate to a person, which is the real basis for any high-autonomy claim. It is also where a number like 90/10 comes from: not a dial set in advance, but an outcome of how little the system needs to hand back. Reported autonomy shares are best read as measurements of an architecture, not as settings.Two design choices decide whether that autonomy is trustworthy or merely fast. The first is grounding: tying every output to its source so a reviewer can verify rather than redo. Research models built for exactly this, such as HalluGuard, label each claim as supported or not and cite the passage they relied on. Nace ships its agents with grounding models and reasoning traces for the same reason. A 10% review only means something if the human can confirm provenance in seconds.The second is the feedback loop, and it forces a question every buyer should ask: when your experts validate the output, whose model improves, and where does it live? That decides whether the compounding asset belongs to the vendor or to you. Arrangements differ. Nace, for instance, uses an external network of certified experts for some engagements and, for direct enterprise deployments, the customer’s own staff, with the resulting model kept inside the customer’s cloud. Each choice routes the learning, and the ownership, somewhere different.Where the third path breaksThe approach is still early, and a few questions will decide how far it goes. Calibration is the linchpin: the value rests on the model knowing when it is unsure. And it is genuinely unsettled, recent work generating these adapters found they do not automatically improve calibration over ordinary fine-tuning, with gains appearing only under specific constraints. The quality of the generated model also depends heavily on the policy data it is built from, which puts a premium on data curation. And scale is the open research frontier, the hypernetworks shown in published work so far have been small. This is where Nace’s own work gets interesting: in our interview, the company said it has scaled its generator well beyond those published sizes and derived a scaling law for how performance grows, results it has begun to share publicly and is now putting through peer review. If it holds up, it would help answer one of the central open questions in the field, and it is the paper worth watching.Whichever approach wins, the work still ends at a human, and that handoff is its own design problem. When Deloitte Australia delivered a roughly A$440,000 government report, it shipped with fabricated citations and an invented court quote after passing senior review, because the reviewers checked the conclusions, which were sound, and not the provenance, which was not. Controlled research suggests the pattern is general: experts corrected an identical flawed recommendation less often when it was labeled AI-generated. The EU AI Act’s Article 14 now names this automation bias. The lesson is not about any one vendor: a high autonomy share concentrates human attention into a thin, late slice of the work, so the value of that review depends entirely on whether the human can check provenance fast, which loops back to grounding.What to build, and what to ask before you buyThe honest takeaway: what holds your agents back is usually not orchestration or model size, but whether the model knows your business well enough to be left alone, and the right fix depends on the job. To automate a long, repetitive, high-volume process end to end, run most of your internal audit overnight and have your own experts check the final slice, a hypernetwork generated model is the approach most likely to do it cheaply and run long enough to matter. For a short task that finishes in a few steps and never needed to run unattended, the gap between this and a well-prompted frontier model shrinks to almost nothing, and is not worth the integration cost.When a vendor pitches autonomous or specialist agents, four questions cut through it. Where does the business knowledge live: in the weights, the prompt, or generated on demand?What does each output come with, so a reviewer can verify it instead of redoing it? What decides which work gets escalated to a human? And whose model improves from that feedback, and where does it run? The answers, not the headline ratio, tell you what you are buying.The hypernetwork approach is the most credible attempt yet at making a small model know a specific business without forgetting it and without re-explaining it on every run. It is also the least proven, and the parts that matter most, calibration and scale, are still in peer review. For the right job, pilot it now. For the wrong one, the integration cost buys you little that a well-prompted frontier model wouldn’t.
Anthropic’s Claude Code Artifacts update brings live, shared dashboards and interactive workspaces to enterprises
Anthropic announced a potentially game-changing new feature for users of Claude Code on the Claude Team and Enterprise subscription plans: Artifacts. This update turns a Claude Code session’s work into a live, interactive, and shareable, custom HTML webpage, allowing a Claude Code user to plug in live code, multiple data sources, and have it surface on an interactive URL that they can send to other teammates — be it a dashboard, an app design, or some other product meant for internal usage. These teammates and the original user can watch the webpage it update in real-time as Claude Code goes about its work autonomously or under the user’s guidance, and as the connected data sources and codebases change. While Anthropic first introduced Artifacts to its consumer web chatbot in the summer of 2024—where it evolved from a manual toggle feature to a generally available tool for publishing code snippets and games to the web—integrating this capability directly into the Claude Code command-line interface (CLI) and desktop app bridges the gap between deep, back-end engineering and the non-technical stakeholders who need to understand it.Product and Technology: The End of the Status UpdateAt its core, Claude Code Artifacts acts as a dynamic translation layer. Built directly from the unbroken context of a user’s session, the agent uses the local repository codebase, connected monitoring tools, and conversational reasoning to spin up specialized web pages. Engineers no longer need to wire up external data sources or stand up temporary infrastructure; the AI builds the UI from what already exists.Crucially, these web pages are not static exports. As the AI works through a terminal session, the open webpage refreshes in-place, updating charts and text instantly at the exact same URL. Every update publishes a new version history, allowing teammates to roll back or track the agent’s progress securely on desktop or mobile.The Battle of Live, Interactive, Shared AI Work Surfaces: Anthropic’s Claude Code Artifacts vs. OpenAI’s Codex SitesAnthropic’s update comes more than two weeks after OpenAI released a massive update to its own Codex platform, introducing a strikingly similar enterprise hosting feature called “Sites”. This tit-for-tat product cadence highlights a rapidly escalating battle over the enterprise workspace across functions and beyond developers themselves, though there are some important technical and philosophical distinctions worth pointing out for enterprises considering either. As revealed in their respective developer documentation webpages, OpenAI is building a platform-as-a-service; Anthropic is building a stateless canvas.OpenAI’s Sites is designed to generate durable, full-stack web applications. According to the platform’s documentation, Codex Sites hosts projects that output as Cloudflare Worker-compatible ES modules. Crucially, Sites supports persistent backend infrastructure: agents can automatically wire up “D1” relational databases for structured data (like user progress or saved records) and “R2” object storage for file uploads. An OpenAI Site can support public sign-ins, integrate with external identity providers, and allows for highly specific access controls tailored to specific workspace groups. It utilizes a two-stage publishing process—saving a reviewable candidate linked to a Git commit before officially deploying to production. In short, it is a production environment designed to replace functional internal SaaS tools.Anthropic’s Claude Code Artifacts, by contrast, deliberately avoids the backend. The newly released documentation is blunt about its limitations: “An artifact is a capture of work, not an application”. Each Artifact is a single, self-contained HTML page capped at a rendered size of 16 MiB. To guarantee organizational security, Claude wraps the published file in a strict Content Security Policy (CSP) that blocks all external network requests. This means the page cannot load external scripts, fonts, or stylesheets, and fetch, XHR, and WebSocket calls are completely blocked. All CSS and JavaScript must be inlined, and images must be embedded as data URIs. Artifacts cannot store form input, call an API at view time, or serve multiple routes.This technical limitation is actually Anthropic’s deliberate philosophical position: While OpenAI wants to spin up persistent software portals for the whole company, Anthropic is keeping Claude Code firmly anchored in ephemeral, highly secure technical workflows. Claude Artifacts are not meant to be software; they are meant to replace whiteboard diagrams, manual bug walkthroughs, and status reports with secure, self-updating visual tools that never leak live data outside the corporate boundary.Licensing and Enterprise Security: Keeping the Codebase PrivateBecause these agents sit at the nexus of proprietary company data and live codebases, licensing and access controls are a primary concern. Both Anthropic and OpenAI have opted for closed, proprietary licensing models for these new visual workspaces. For end users and developers, the distinction is critical. Unlike permissive open-source software (such as MIT or Apache 2.0) or strict copyleft licenses (like GPL)—which grant developers the legal freedom to inspect, modify, and self-host the underlying code—neither Claude Code Artifacts nor Codex Sites can be independently forked or hosted. Enterprise clients do not maintain code-level ownership over Anthropic’s rendering engine or Codex’s integration nodes; both operate strictly within their respective creators’ managed infrastructures.To make this vendor-managed approach palatable to enterprise compliance teams, both companies have heavily prioritized organizational security. Anthropic ensures every artifact is private to its author by default and strictly cannot be made public to the broader internet. When an engineer chooses to share a link, it is viewable exclusively by authenticated members of their specific organization. System administrators retain ultimate authority, managing access through org-level toggles, role-based scoping, and explicit retention policies, while maintaining oversight through a centralized compliance API.OpenAI takes a similarly gated approach with Codex Sites, rolling the feature out primarily for ChatGPT Business and Enterprise workspaces. Like Anthropic, OpenAI relies on system administrators to manage deployment through centralized workspace settings, requiring an admin to explicitly enable Sites via role-based access control (RBAC) for Enterprise tiers.However, because Codex Sites functions more like a hosted web application, its access controls are slightly more granular. When an engineer prepares to share a deployed URL, they can apply specific access modes: restricting the site to just themselves and workspace admins, opening it to all active users in the workspace, or limiting access to custom user groups. Furthermore, to prevent sensitive data leaks, OpenAI provides a dedicated Sites panel to manage runtime environment variables and secrets securely, ensuring those keys do not have to be committed to local source files.Reactions and ReflectionsThe introduction of visual, self-updating UI layers to command-line agents is fundamentally altering how developers view their own workflows. As AI handles the raw syntax and automates the reporting, the friction of communicating technical work to stakeholders is vanishing.Boris Cherny, the Lead and creator of Claude Code, highlighted the sheer utility of the update in a post on X earlier today: “I’ve been using Artifacts in Claude Code for everything: visual explanations of tricky code, system diagrams, quick previews of a few animation options, data analyses and dashboards I share with the team,” Cherny wrote. “They are a game changer for how I work with Claude. Can’t wait to hear what you think!”This sentiment is practically demonstrated in Anthropic’s launch materials. In one scenario, an engineer prompts Claude Code to investigate user drop-offs since a previous software release. In a matter of seconds, the agent executes an SQL read, builds an interactive drop-off funnel dashboard, and diagnoses that “Pro accounts stall at the export sheet”. The AI then proposes UI fixes, updates the live charts as the code is refactored, and generates a secure link that a manager can instantly open via mobile.By turning the terminal into a live, collaborative canvas, Anthropic is proving that the most valuable output of an AI coding assistant isn’t just the code itself—it is the context, the reasoning, and the ability to share that work instantly.
New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget
Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. It requires a tedious, trial-and-error process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously. Because these adjustments are entangled, it becomes nearly impossible to attribute which specific tweak actually solved the problem. To address this challenge, researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from a sequence of trial-and-error guesses into a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time.In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget. For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems.Understanding the bottleneck in autonomous optimizationAs large language models and AI systems become more capable, they are expected to carry out more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms. AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact, such as a machine learning codebase or data pipeline, and a specific objective. The agent’s goal is to iteratively improve this artifact through experimental feedback without step-by-step human supervision.The main challenge of AO is often misunderstood. Many engineering teams find that simply giving a coding agent more time or compute to optimize a codebase doesn’t lead to better results. “Automation can keep an AI working for a very long time — but a loop is not the same as progress,” Jiajie Jin, co-author of the paper, told VentureBeat. “If the goal is vague, or the metric is easy to hack, long-running automation often just produces ‘improvements’ faster that nobody actually wants.”Jin explains that complex tasks take many attempts to get right, and standard agent architectures are missing the critical data structure to maintain state. “How do you make sure the insight and experience from each attempt actually accumulate, instead of getting lost in a scrollback buffer?” he said. Without this structure, agents simply repeat the same mistakes.Current agent systems can run experiments for many hours against well-specified goals: editing code, invoking tools, running tests autonomously. But they treat each attempt in isolation, missing the structural mechanisms that would let them accumulate and act on what they’ve learned.They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration, which is the core mechanism that makes human research cumulative.General coding agents typically rely on conversation transcripts for their memory. Because AO tasks span hundreds of turns and easily exceed context window limits, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they lose the overarching structure of the research process and are prone to stalling on early failures or chasing noisy evaluation swings. The system needs a structured, durable memory that records what directions have been tried, what factual evidence was produced, and how each result changes the space of future hypotheses.Existing frameworks are also prone to reward hacking and overfitting to development metrics. This makes them create the illusion of progress without producing improvements that transfer to real-world performance.Finally, general-purpose coding agents typically chain their tool calls on a single shared working tree. This architectural limitation prevents them from testing parallel hypotheses in isolated environments without corrupting the main codebase or obscuring which hypothesis caused a specific outcome.The Arbor frameworkArbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from the ground-level coding tasks with two key components:The coordinator: A long-lived AI agent that acts like a principal investigator. It never directly edits the target codebase. Instead, it owns the general state of the optimization research, observes accumulated evidence, comes up with new hypotheses and directions to explore, and decides what to do with the results of experiments.Executors: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, it spins up an executor and places it in an isolated environment, essentially a fresh git worktree. Each executor is handed one hypothesis. It implements the assigned idea, runs evaluations, debugs errors, and reports back to the coordinator with the results and created artifacts.These two components collaborate through a mechanism that the researchers call “Hypothesis Tree Refinement” (HTR). HTR represents the entire research process as a persistent, branching tree where every node binds together four things: a hypothesis, the executable artifact, the factual evidence produced, and a distilled insight. This means the coordinator can explore multiple competing directions at the same time without losing its place.The coordinator builds the tree by placing broad ideas near the root, while concrete refinements branch out as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor’s experiment fails, the tree records why it failed as a negative constraint, ensuring the system doesn’t endlessly repeat the same mistake.To understand why Arbor’s isolation matters, consider a common enterprise scenario: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. “When you ask a single agent like Claude Code or Codex to ‘improve accuracy,’ it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method,” Jin said. This entangles the changes, making it impossible to attribute which one actually helped. It also directly mutates the repository without isolation. Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, retrieval another, and the prompt another — each implemented and evaluated in its own isolated git worktree. “So you get clean attribution: ‘constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'” Jin said.When an executor returns a report, the coordinator writes the evidence to the tree and backpropagates the insight upward to parent nodes. This means a local observation becomes a generalized constraint that shapes the coordinator’s future idea generation.To prevent reward hacking or overfitting to the development data, HTR enforces a strict “merge gate.” Even if an executor reports a fantastic development score, the coordinator will spin up an isolated worktree to test the candidate against a held-out test evaluator. The artifact is only merged into the current best trunk if it demonstrably improves the test score, verifying that the progress is real.Arbor generally falls under the concept of “loop engineering,” popularized by industry figures like OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny. The idea is to move beyond single prompts to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, “A loop can fill up with messy, untraceable attempts, and you end up with nothing to show and no way to reconstruct what changed.” Arbor in actionThe researchers evaluated Arbor on an autonomous optimization task suite built from real-world research settings and the MLE-Bench Lite machine learning engineering benchmark. The AO suite featured tasks from different areas of AI development, including model training, harness engineering, and data synthesis.The researchers used different backbone models for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, Codex and Claude Code. Arbor and the baselines were given the same resources. For the MLE-Bench Lite tasks, Arbor was also compared against top-tier agentic research systems like AI-Scientist, ML-Master, and AIDE.Arbor consistently outperformed the baselines. It achieved the best held-out test result on all tasks, attaining more than 2.5 times the average relative gain of Codex and Claude Code. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system’s held-out accuracy from a baseline of 45.33% to 67.67%. Meanwhile, Codex and Claude Code stalled at 50% and 53.33%, respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest result among all benchmarked systems.Arbor proved to be resilient against overfitting. For example, during the Terminal-Bench 2.0 task experiments, Claude Code achieved a high development score of 75 but its score dropped to 71 on the held-out data. Arbor had a lower development score of 72.22 but achieved the highest held-out score of 77.36, ensuring its results transfer to real-world applications.Arbor also showed generalization in a cross-task transfer experiment. After Arbor finished optimizing the search harness for the BrowseComp task, researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor’s optimized codebase significantly improved performance on those unseen tasks as well.Deploying Arbor: Sweet spots and hidden costsFor engineering leads looking to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replacing them. “Its output is an ordinary git branch that your existing code review, CI, and human review can inspect directly,” Jin said. Only verified gains are merged into a per-run trunk, leaving the main repository untouched until a developer manually chooses to promote the code.However, deploying Arbor comes with specific tradeoffs. Jin points out that the biggest catch is token cost, as maintaining a long-lived coordinator that continuously manages the tree and dispatches executors is the dominant expense. Running multiple isolated worktrees concurrently also requires genuine compute and disk resources to process real experiments.So where is Arbor’s sweet spot? According to Jin, it excels at tasks with a clear, trustworthy metric, tolerance for a long time horizon, and a real search space with several plausible directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning. Conversely, teams should explicitly avoid using Arbor for real-time latency tasks, obvious one-line fixes, or when the underlying evaluation metric is flawed. The quality ceiling of the entire run is strictly bounded by the quality of the evaluator. “If the metric isn’t trustworthy, Arbor will just optimize toward an untrustworthy result faster,” Jin said.Jin sees the next evolution going beyond single scalar metrics. “A natural evolution is to have each node’s artifact carry a vector — accuracy, latency, cost — instead of a single score,” Jin said. “Going from a single scalar to a multi-objective Pareto search is a very natural extension of the framework.”
Copilot searched your mailbox. LiteLLM handed out admin keys. Run this 5-check audit before your stack is next
Two AI tools broke in the same way in the same two weeks, and four research teams proved it. The pattern underneath every disclosure is one sentence: enterprise AI accepts external input with no trust boundary. On June 15, Varonis disclosed SearchLeak (CVE-2026-42824), a proof-of-concept exfiltration chain in Microsoft 365 Copilot Enterprise Search. A victim clicks a crafted microsoft.com URL, Copilot searches their mailbox, and the data leaves through a Bing SSRF. No plugins, no second click, no visible indicator. Four days earlier, Obsidian Security published a three-CVE chain against LiteLLM that carried a default low-privilege user all the way to admin and remote code execution. Two tools. Two teams. One broken boundary.The five-check audit at the end of this article maps each gap to a CVE or a market signal from June, a command you can run before lunch, and a sentence a CISO can read to the board.Copilot turned a trusted URL into an exfiltration engineSearchLeak chained three weaknesses into a silent data-theft chain. The URL q parameter fed attacker instructions straight to Copilot’s LLM. A rendering race condition fired an image tag before the output sanitizer ran. Bing’s image-search endpoint, allowlisted in the Content Security Policy, routed the stolen data out. Microsoft rated the flaw critical and patched it on the back end, according to Varonis. NVD has not yet scored it; a third-party tracker lists it at 6.5 medium. The severity is contested, but the mechanism is not.The escalation is the real story. This is the third Varonis Copilot exfiltration chain in twelve months, after Reprompt in January and EchoLeak in 2025. Reprompt hit Copilot Personal. SearchLeak hit Enterprise Search. Enterprise inherits the user’s full organizational permissions, so the blast radius is everything that a user can reach.LiteLLM handed a default account to every provider keyThe LiteLLM gateway holds the keys for OpenAI, Anthropic, Azure, and Bedrock behind a single proxy. The Obsidian chain runs in three moves. CVE-2026-47101, an authorization bypass, lets a non-admin mint a wildcard API key. CVE-2026-47102 promotes that caller to proxy admin through an unguarded /user/update endpoint. CVE-2026-40217 escapes the code sandbox through exec() with full builtins. Obsidian then demonstrated a reverse shell by injecting a forged tool-call response through LiteLLM’s callback mechanism. Obsidian assessed the combined chain at CVSS 9.9. The developer typed one word. The attacker popped a shell.A separate LiteLLM flaw made the urgency immediate. CVE-2026-42271, a command-injection bug in the MCP test endpoints, landed on the CISA KEV list on June 8 with a June 22 remediation deadline. That KEV entry is not the Obsidian chain. The two are distinct disclosures four days apart, fixed in different releases, pointed at the same gateway. LiteLLM carries more than 40,000 GitHub stars and sits in thousands of enterprise deployments. This is not the first scare, either. A supply-chain compromise backdoored LiteLLM versions 1.82.7 and 1.82.8 on PyPI in March. A compromised gateway exposes every provider credential the organization holds.Langflow and Mini Shai-Hulud proved the pattern scalesThe same boundary broke in two more tools in the same fortnight. Langflow CVE-2026-5027 became the third Langflow remote-code-execution flaw to hit active exploitation this year. A path traversal in file upload lets an attacker write files anywhere on disk, and because Langflow ships with auto-login enabled by default, a single unauthenticated request reaches RCE. VulnCheck confirmed exploitation on June 9. Censys counted roughly 7,000 exposed instances, the heaviest concentration in North America, with MuddyWater attribution.The Mini Shai-Hulud campaign hit a different pressure point. After the worm’s source code went public on May 12, copycat variants compromised 32 Red Hat Cloud Services npm packages on June 1, packages pulled 80,000 times a week. The worm harvests more than 20 credential types and self-propagates under the compromised maintainer’s identity.Four teams, four tools, one operating failure. The bug classes differ. SearchLeak is a prompt injection. LiteLLM is privilege escalation. Langflow is path traversal. Mini Shai-Hulud is supply-chain poisoning. The boundary that broke is the same in all four.The market already repriced the riskCrowdStrike’s Q1 FY27 earnings call put a number on the gap. AIDR, the company’s AI detection and response line, grew ending ARR more than 250% sequentially, with a Q2 pipeline above $50 million (SEC-filed 8-K). Total company ARR reached $5.51 billion, and CrowdStrike’s fleet telemetry shows more than 1,800 agentic applications running across enterprise endpoints. On June 17, the company extended AIDR to AWS, adding real-time evaluation of agent, LLM, and MCP communications across Amazon Bedrock, Kiro, and Strands Agents, building on its work with Anthropic’s Project Glasswing. Daniel Bernard, CrowdStrike’s chief business officer, said the AI attack surface now spans development, runtime, identities, and cloud infrastructure, and that teams treating those as separate domains leave the gaps between them open.Practitioners name the same gap in plainer termsDavid Levin, CISO at American Express Global Business Travel, told VentureBeat the pattern does not surprise him. “We kind of have this shadow AI, which is just the new version of shadow IT,” Levin said. Both Langflow and LiteLLM fit the description. Teams stood them up for convenience, gave them credentials, and never brought them under governance. Levin puts the fix before deployment. “We didn’t go into this with just saying we’re going to go do this without the right fundamentals,” he said. “We leverage NIST controls. NIST has released their CSF along with their AI framework. OWASP released their top 10. You need the right fundamentals before you deploy.”Merritt Baer, CSO at Enkrypt AI and former AWS Deputy CISO, named the structural version of the failure in a separate VentureBeat interview. “Enterprises believe they’ve ‘approved’ AI vendors, but what they’ve actually approved is an interface, not the underlying system,” Baer said. “The real dependencies are one or two layers deeper, and those are the ones that fail under stress.” She has tied that directly to how systems fall. “Raw zero-days aren’t how most systems get compromised. Composability is,” Baer told VentureBeat. “It’s the glue between the model and your data where the risk lives. If you give an agent bash and a root token, you’ve already done most of the attacker’s work for them.” That is what rows 2 and 4 of the audit test: the gateway that holds every key, and the agent identity no one governs.Levin had a sharper frame for the boardroom. “You need to talk more in terms of risk versus compliance to your boards and your executives,” he said. “It’s not about the size of the engineering team anymore. It’s the size of your imagination. It’s all written in plain English. It’s not hard for anyone.” Neither SearchLeak nor LiteLLM needed custom malware or a zero-day to work.Adam Meyers, CrowdStrike’s SVP of Intelligence, put the operational squeeze in numbers in an exclusive VentureBeat interview. “The problem is not zero-day. The problem is patching. If you 10x that problem, they’re gonna be completely underwater,” Meyers said. He pointed to identity as the second front. “Some of these AI have their own identities, or people give their identity to the AI to take action on their behalf, and that makes it a very complex problem.”The five-check trust-boundary auditEach row maps a gap to its proof point, a verification command for Monday morning, the fix, and the sentence to read to the board.Trust-Boundary GapProof PointWhat BrokeVerify MondayFix MondayBoard Language1. Prompt-to-DataSearchLeak CVE-2026-42824. P2P injection + HTML race + Bing SSRF. One-click mailbox exfiltration via microsoft.com URL. PoC demonstrated; Microsoft rated it critical, NVD not yet scored.URL q-parameter passed to LLM as instructions. Sanitizer ran after render. Bing acted as exfiltration proxy via CSP allowlist.Audit CSP allowlists for domains performing server-side fetches. Monitor Copilot Search URLs for encoded payloads. Review Copilot audit logs.Confirm server-side patch applied. Enable sensitivity labels restricting Copilot. Treat AI streaming output as untrusted.“Our AI assistant could search employee email and send results to an attacker through a trusted Microsoft URL. Vendor patched it. We must verify configuration.”2. Gateway Credential ExposureLiteLLM three-CVE chain (-47101, -47102, -40217). CVSS 9.9. Separate CVE-2026-42271 on CISA KEV (fixed in v1.83.7; full chain fixed in v1.83.14-stable). June 22 deadline.No role validation on key endpoints. Self-promotion to admin via /user/update. exec() sandbox escape. One gateway exposes all provider keys.Run pip show litellm. Below 1.83.14-stable = vulnerable. Check /mcp-rest/test/ exposure. Audit proxy_admin accounts.Upgrade to v1.83.14-stable+. Rotate all provider API keys. Block /mcp-rest/test/* at proxy. Review Custom Code Guardrails.“Our AI gateway held keys for every provider. A default account could promote itself to admin and steal them all. Rotating and patching now.”3. AI Tooling SprawlLangflow CVE-2026-5027 (CVSS 8.8). Third RCE of 2026. ~7,000 exposed instances. MuddyWater. Active exploitation June 9.Path traversal in file upload. Auto-login enabled by default. Single unauthenticated request to RCE.Query Censys/Shodan for Langflow, Flowise, n8n, Dify on your perimeter. Check auto-login. Inventory AI tools outside change management.Pull AI platforms behind VPN/zero-trust. Enable auth everywhere. Upgrade Langflow to v1.9.0+ (current release 1.10.0). Fingerprint surface continuously.“AI dev tools are exposed to the internet with login disabled. A nation-state group is exploiting this flaw now. Pulling behind access controls today.”4. Non-Human Identity GovernanceAIDR ARR up 250% (Q1 FY27, SEC 8-K). Q2 pipeline >$50M. 1,800+ agentic apps across enterprise endpoints.Agents hold identities and act on behalf of humans. Some exceed their intended scope to reach a goal. No standard governs agent credential lifecycle.Inventory all non-human identities used by agents and MCP servers. Map agent-to-data-store access. Flag agents with write access to security policy.Least-privilege every agent identity. Set privilege boundaries via identity protection. Runtime detection for policy-exceeding actions. Human-in-the-loop for policy changes.“AI agents hold credentials and act autonomously. We do not govern their identity lifecycle like human access. The 250% market growth tells us this gap is systemic.”5. Runtime Agentic DetectionFalcon AIDR expanded to AWS (June 17). Covers Bedrock, Kiro, Strands Agents. MCP integration. Real-time agent/LLM/MCP evaluation.Traditional tools monitor human-speed actions. Agents run at machine speed, thousands of actions per minute, and route around controls to reach goals.Test if EDR/XDR links agent actions to originating identity. Verify SIEM ingests MCP communications. Confirm you can distinguish human from agent on endpoint.Deploy AIDR or equivalent runtime detection. Shadow-AI discovery for all agentic apps, models, MCP servers, identities. Real-time policy enforcement on agent actions.“We cannot distinguish a human employee from an AI agent acting on their behalf. We need runtime detection at machine speed that can stop damage before it starts.”The fix is plumbing, not policyThe June 2 executive order creates an AI Cybersecurity Clearinghouse with a July 2 deadline. The five gaps above are not frontier-model problems. They are plumbing problems in the gateways, orchestration platforms, identity layers, and runtime environments where AI meets the enterprise. The audit is five rows. Every row maps to a June disclosure or market signal, a command a team can run before lunch, and a sentence a CISO can read to the board. The question is not whether your vendor will patch. It’s whether you find the gap first — or whether an attacker finds it the way they found Copilot and LiteLLM.
Adobe embeds agentic AI workflows across Creative Cloud, shifting from media generation to production orchestration
Adobe has announced a major expansion of its “creative agent” across its flagship Creative Cloud suite and upgraded Firefly AI studio. Available in public beta starting today across Premiere Pro, Photoshop, Illustrator, InDesign, and Frame.io, the agent is designed to serve everyone from individual creators to enterprise marketing teams. Unlike first-generation generative AI tools that simply output flat media from a chat interface, Adobe’s embedded assistant acts as an orchestration layer. It interprets natural language prompts and directly accesses the underlying software’s APIs to execute complex, multi-step production workflows—from batch-renaming video sequences to dynamically updating brand assets across print layouts—while leaving the final aesthetic decisions entirely in the hands of the human designer. Technology: Contextual Memory and DOM ManipulationAt the core of this release is a significant technical upgrade to how Adobe’s AI handles persistent memory and context window management. In its upgraded Firefly creative AI studio—currently in private beta—Adobe has introduced two foundational architectural components: “Elements” and “Projects”. Elements functions as a visual variables library, allowing users to save and reuse specific characters, locations, and objects across multiple generations to ensure strict visual consistency as campaigns scale. Projects acts as the contextual memory layer, storing assets, generations, and session history in a unified space so users can pick up where they left off without rebuilding their prompt context. Beyond pixel generation, the system’s most critical technological leap is its ability to operate seamlessly within the complex document structures of desktop applications. “Our Adobe Creative Agent can leverage the decades of powerful features, workflows, APIs that we’ve brought into our application and exposed through tooling that can now be invoked through a creative agent,” an Adobe representative explained. Product: Automating the Tedious, Expanding the CanvasThe practical application of this technology fundamentally alters standard production workflows. Adobe is positioning the human user as a “creative director” capable of delegating repetitive, labor-intensive tasks to the AI. The rollout introduces highly specific specialist agents tailored to the logic of each application: Premiere Pro: The agent handles tedious project setup, analyzing and sorting source media into bins, batch renaming clips, identifying interview questions, and assembling a rough working starting point. Illustrator: The assistant automates mathematical and multi-step design tasks, such as generating 50 versioned files from a spreadsheet or running pre-flight checks to flag color mode errors before printing. It can even programmatically duplicate a vector shape 100 times, randomize its position, and change its size based on its z-depth and transparency. Photoshop & InDesign: The agent executes batch background removals, dynamic layer organization, and applies brand updates across multi-page layouts. Furthermore, Adobe is actively integrating its creative agent into major third-party enterprise platforms, including OpenAI’s ChatGPT, Anthropic’s Claude, Microsoft 365 Copilot, and soon, Google Gemini and Slack. Licensing: Commercial SaaS and Enterprise ImplicationsUnlike open-source orchestration frameworks or models released under MIT or Apache licenses, Adobe’s creative agent operates strictly within a proprietary, commercial SaaS ecosystem. For enterprise decision-makers, this carries specific implications. Because the agent relies on Adobe’s proprietary APIs to manipulate project files, it requires an active Creative Cloud commercial license. Additionally, by bringing the “Adobe for creativity connector” to platforms like Slack and Microsoft Copilot , enterprise IT and systems architects must consider how internal chat tools will interface with Adobe’s cloud processing environments to support enterprise creative and marketing teams securely. The Enterprise Unknowns: APIs, Governance, and ArchitectureWhile Adobe’s announcements highlight a powerful user interface and deep integration within its own flagship applications, several critical questions remain for enterprise technical decision-makers tasked with building bespoke AI systems. VentureBeat has reached out to Adobe for clarification on these infrastructure-level details and will update this coverage as we learn more.For AI system architects, the value of a creative agent lies not just in a native application UI, but in its extensibility. It remains unclear if Adobe plans to expose these new agentic capabilities via API, or if the company will support the Model Context Protocol (MCP). Without MCP support or direct API access, enterprise teams will face friction integrating Adobe’s tools into their own custom task-routing frameworks and internal LLM pipelines.Adobe’s new “Elements” feature promises to solve the generative AI consistency problem by anchoring characters and objects across generations. However, the backend architecture driving this persistent memory is not yet detailed. Whether Adobe is leveraging on-the-fly Low-Rank Adaptation (LoRA) based on user uploads or utilizing a form of visual Retrieval-Augmented Generation (RAG) is a critical distinction for technology leaders managing compute costs, model evaluations, and enterprise-grade inference pipelines.As organizations build out “Projects” and define brand-specific “Elements”, security and data decision-makers require strict guarantees regarding data provenance and storage. It is currently unknown exactly where this contextual workflow and vector data lives—specifically, whether it remains strictly sandboxed within the customer’s enterprise Creative Cloud instance on Adobe servers, and how role-based permissions apply to these new agentic workflows.Finally, as lightning-fast, developer-first, multi-model AI creative platforms like fal.ai gain significant traction among enterprises and developers, Adobe’s position in the broader developer ecosystem remains a point of interest. Whether Adobe views these infrastructure-level API providers as direct competitors to its Firefly AI studio or as potential integration points for bespoke enterprise environments has yet to be seen.Community Reactions: The Tension Between Automation and CraftThe integration of agentic AI touches on the tension between eliminating drudgery and surrendering creative control. According to Adobe’s recent Creators’ Toolkit Report, which surveyed over 16,000 creators globally, the market is highly receptive to AI as an operational assistant rather than an autonomous creator. 75 percent of surveyed creators describe creative AI as integrated or essential to their current workflows. 85 percent emphasized that the final creative decision must always remain in human hands. This sentiment is central to Adobe’s messaging. By focusing the agent’s capabilities on file organization, layer management, and brand compliance, Adobe aims to automate what a spokesperson called the “tedious parts of their workflow”. The goal, according to Adobe executive David Wadhwani, is to let creatives focus on the craft so they can “apply their taste and make the calls that only they can”.
AWS enters the context layer race with a graph that learns from agents, not manual curation
Building a context layer between enterprise data stores and AI agents is bespoke work, with no standard service to automate or maintain the graphs over time. Amazon is making a direct play to change that.Amazon on Wednesday entered the space, announcing a series of three products it’s positioning as a context intelligence stack for AI agents. The centerpiece is AWS Context, a new knowledge graph service that gets smarter through agent usage over time. AWS also announced the general availability of Amazon S3 Annotations and a preview of skill assets in AWS Glue Data Catalog.The context layer is now a contested architectural category with no shortage of options from different vendors. AWS is entering that market with a different architectural premise: that the graph should learn from how agents use it automatically, without human re-curation.”Your agents now get smarter without you having to rebuild anything from scratch,” said Swami Sivasubramanian, vice president of Agentic AI at AWS, during his AWS Summit NYC keynote. “This service automatically builds a knowledge graph from all your existing data,” he said. “This service infers relationships across your data sets, business rules, and domain knowledge, and makes all of it available to your agents and your organization at runtime.” AWS Context builds a self-learning knowledge graph from existing dataIt’s a problem AWS says it has seen repeatedly in customer deployments. AWS Context maps relationships across existing data automatically: what tables exist, what columns mean, how sources relate and which sources are authoritative. It combines semantic search with graph-level reasoning and infers relationships across datasets, business rules and domain knowledge, making all of it available to agents at runtime.”The knowledge graph improves itself over time as it learns which sources produce correct results and which parts get used,” Sivasubramanian said. Data stewards manage the graph through the AWS Management Console, reviewing inferred relationships, promoting them to production and attaching business definitions and usage rules. Every query inherits the calling user’s IAM and Lake Formation permissions, making agent data access auditable by identity through controls enterprises already rely on.All metadata is published in Apache Iceberg format to Amazon S3 Tables, queryable via Athena, Redshift, Spark or any Iceberg-compatible engine, with no proprietary APIs. Third-party catalog connections are supported, so context from systems outside AWS can be pulled into the same graph. Agents query through agentic search APIs and MCP tools across Bedrock AgentCore, EKS or any MCP-compatible framework.Context is more than just a single serviceContext is a complicated space and AWS is layering multiple services to help enterprises build context across the data stack.Amazon S3 Annotations. This service enables users to attach rich business context at the storage layer, directly to individual S3 objects. AWS Glue Data Catalog skill assets. Glue skill assets attach domain knowledge at the catalog layer, linking runbooks, query patterns and usage rules to data assets across the estate. AWS Context then synthesizes both into the knowledge graph that agents query at runtime, combining semantic search with graph-level reasoning across structured and unstructured sources. Each layer feeds the next.AWS is entering a highly competitive context spaceSnowflake announced its context approach earlier this month with its Horizon Context and Cortex Sense services. Microsoft is providing context via its Fabric IQ platform that provides a semantic ontology for data. Redis has developed a context platform that optimizes data for retrieval. Vector database vendor Pinecone has its Nexus context offering that compiles enterprise data into task-specific artifacts before agents ever query them.AWS’s structural argument is straightforward: for enterprises already running S3, Glue and Lake Formation, AWS Context extends an existing identity model with no data movement required. The pitch is zero-integration friction — not just cost consolidation.”Context makes agents more powerful and as the whole world is building agents, every agentic platform vendor needs a context capability,” Holger Mueller, VP and Principal analyst at Constellation Research, told VentureBeat.Mueller noted that AWS is no exception. “The concern — as with all context offerings — is going to be performance, especially for transactional data, we will see,” he said.
Anthropic ships major Claude Design overhaul with design system imports, code round-trips, and a fix for its token-burning problem
When Anthropic quietly released Claude Design in April as a “research preview,” it generated the kind of instant traction most product teams dream about: more than one million users in its first week. It also generated a problem. The tool consumed tokens so voraciously that a PCWorld reviewer burned through 80 percent of his weekly Claude Pro allowance in roughly 25 minutes, producing just three variations of a single webpage prototype. “We’re talking another token-hungry Claude product here,” the reviewer wrote, “one that Pro users in particular will barely be able to use before burning through their usage limits.”Two months later, Anthropic is shipping a substantially overhauled version of Claude Design that attempts to fix the consumption issue while simultaneously repositioning the product from a flashy demo into something far more strategically important: a design system compliance layer that connects to code, connects to the tools enterprises already use, and — critically — keeps everything on brand.The update, announced Wednesday, arrives at a moment when Anthropic is executing one of the most aggressive product expansions in the AI industry’s brief history. In the past ten weeks alone, the company has launched Claude Opus 4.8, released (and then suspended) the Mythos-class Fable 5 model, shipped ten agent templates for financial services, announced a multi-year alliance with DXC Technology to embed Claude inside the IT infrastructure of the world’s largest banks and airlines, rolled out Claude for Small Business with integrations into QuickBooks and PayPal, and published research showing that Claude Code users now average 20 hours per week on the tool. Claude Design’s transformation from prototype toy to enterprise platform is the latest move in a company-wide strategy to make Claude not just an assistant people talk to, but a worker embedded in the systems where work actually happens.How design system imports make Claude Design an enterprise brand-compliance toolThe headline feature in Wednesday’s update is not the new drag-and-resize editor, nor the expanded list of export destinations, though both matter. The feature that signals where Anthropic is heading is the rebuilt design system import.Users can now bring one or several design systems into Claude Design from a GitHub repository, design files, or raw uploads. Once imported, Claude builds with those components, checks its output against the design system, and auto-corrects before the user ever sees the result. For larger organizations, a new admin role can approve a single standard system and lock down edits, ensuring that every asset Claude produces conforms to company guidelines.This is a meaningful departure from the tool’s original positioning. In April, Claude Design was a blank canvas: give it a prompt, and it would generate something visually impressive but stylistically arbitrary. Business Insider tested it against Canva AI for a photography workshop slide deck and found that Claude Design “anticipated my needs” and “identified its own errors and corrected them without prompting.” But the output reflected Claude’s aesthetic judgment, not the user’s brand. For an individual freelancer or a startup founder sketching ideas, that was fine. For a 10,000-person enterprise with a 200-page brand standards document, it was a non-starter.The design system import changes that equation. By ingesting a company’s actual components — its buttons, typography, color tokens, spacing rules — and then validating output against them before surfacing results, Claude Design is attempting something that most human designers struggle with: consistent brand compliance at speed and scale. The admin lockdown feature, which prevents individual users from overriding the approved system, is a direct play for the enterprise procurement conversation, where “can we control what it produces?” is often the first question.Why the Claude Code round-trip could end the design-to-engineering handoff problemThe second major update is the bidirectional integration between Claude Design and Claude Code. Users can now run /design-sync in Claude Code to import their local codebase’s design system into Claude Design, ensuring that prototypes start from real components rather than approximations. When a design is ready to ship, it hands off to Claude Code, which picks up exactly where the designer left off — no screenshot, no rebuild. The integration works in reverse, too. From a Claude Code terminal, the /design command lets developers create, edit, and sync design projects without leaving their workflow.This matters because the handoff between design and engineering has been one of the most persistent friction points in software development for decades. Tools like Figma’s Dev Mode and Zeplin have tried to bridge the gap by generating specifications and code snippets from design files, but the translation has always been lossy. A designer’s prototype and an engineer’s implementation inevitably diverge, creating a cycle of visual QA, redlines, and “that’s not what the mockup looked like” conversations.Anthropic is betting that if the same AI system both designs and codes — and if both modes share the same underlying component library — the gap disappears. It is, in effect, arguing that the design-to-code problem was never really about better specification formats or smarter handoff tools. It was about the fact that two different humans (or two different tools) were interpreting the same intent. A single AI system that operates on both sides of the workflow doesn’t need to interpret; it just continues.The timing of this integration is also significant in light of Anthropic’s own research. Just yesterday, the company published an analysis of roughly 400,000 Claude Code sessions showing that domain expertise — not coding proficiency — is the primary driver of successful outcomes. Every major occupation succeeded at coding tasks at nearly the same rate as software engineers. If designers can now move fluidly between visual prototyping and code implementation through a single AI system, the research suggests they will succeed not because they learned to code, but because they deeply understand the design problems they are solving.Token consumption gets a fix, but the economics of generative design remain tightThe token consumption issue that dogged Claude Design’s launch was not just a user experience annoyance — it was a structural threat to the product’s viability. If a $20-per-month Pro subscriber could exhaust their entire weekly allowance in a single 30-minute session, the tool was effectively inaccessible to the individual users and small teams who drove its initial viral adoption.Anthropic’s response is twofold. First, Claude Design now shares usage limits with chat, Claude Cowork, and Claude Code, rather than drawing from a separate, smaller pool. This gives most users significantly more headroom. Second, the company says it has reduced the average token consumption per turn while maintaining output quality, and that error rates have dropped sharply.Whether this is enough remains an open question. The fundamental tension is architectural: generative design is inherently token-expensive. Every variation Claude produces requires the model to reason about layout, typography, color, spacing, responsiveness, and content simultaneously, then generate a complete, functional artifact. That is a fundamentally different workload than answering a question in chat, and it consumes tokens accordingly. Anthropic’s efficiency improvements may push the breaking point further out, but they do not eliminate the underlying economics. For enterprise customers on Team and Enterprise plans with higher limits, this may be a non-issue. For Pro subscribers, the math is still likely to be tight.The new editor helps mitigate this somewhat by giving users direct control over individual elements — drag, resize, and align — without burning a model turn for every small adjustment. Hundreds of stability fixes also mean fewer wasted turns on errors and regenerations, which were a significant source of token drain in the original release. These are not glamorous improvements, but they are the kind of grind work that separates a research preview from a daily-use tool.Nine new export partners position Claude Design as a creative hub, not a destinationThe update’s third pillar is an expanded set of export destinations. Claude Design now sends work to Adobe, Base44, Canva, Gamma, Lovable, Miro, Replit, Vercel, and Wix, in addition to PDF and PowerPoint. The breadth of this list reveals a deliberate positioning strategy: Anthropic is building Claude Design not as a place where work is finished, but as the place where it begins.The partner quotes tell the story. Replit’s president Michele Catasta frames the integration as meeting “builders wherever ideas begin.” Canva’s Anwar Haneef describes the flow from Claude Design as turning “a first draft” into “a finished asset — kept on-brand, personalized for the moment.” Vercel’s Andrew Qu talks about pushing a concept “straight to Vercel to ship.” In each case, Claude Design is the origin point, and the partner tool is where polish, collaboration, and deployment happen.This hub-and-spoke model also serves as a defensive moat against the open-source alternative that has emerged with surprising speed. Open Design, a community-built project tracked by Augment Code, reached 57,400 GitHub stars and 310 contributors in just eight weeks after Claude Design’s launch. It offers local-first operation, model flexibility supporting 16 different coding agents, and 259 skills with 142 design systems — all without cloud lock-in. Augment Code’s Paula Hingel noted that for “teams that need to self-host, use their own API keys, or swap models, Open Design is currently the only local-first option with this level of skill and design system coverage.”Anthropic’s answer to this competitive pressure is not to match Open Design on self-hosting or model flexibility — those are philosophical concessions the company is unlikely to make. Instead, it is building an integration ecosystem that open-source projects cannot easily replicate. A native Adobe Express connector, a verified Canva export pipeline, a first-party Vercel deployment path — these are partnerships, not features, and they require business relationships that community projects cannot forge at the same pace.Claude Design fits into Anthropic’s broader push to embed AI across the entire enterprise stackTo understand why Claude Design’s evolution matters, it helps to zoom out. Anthropic is building a product surface that now spans creative work (Design), code (Code), knowledge work (Cowork), and enterprise operations (Managed Agents) — all unified by the same underlying models and, increasingly, by shared context that carries across tools.The trajectory of the past quarter makes the pattern unmistakable. In May, Anthropic launched Claude for Small Business with connectors to QuickBooks, PayPal, and HubSpot, putting Claude inside the tools that small business owners already use for payroll, invoicing, and marketing. The same month, the company released ten agent templates for financial services covering everything from pitchbook creation to KYC screening, with connectors to FactSet, S&P Capital IQ, and Morningstar. Claude Opus 4.8 shipped on May 28 with a “dynamic workflows” feature enabling hundreds of parallel sub-agents in a single Claude Code session. Then came the Fable 5 and Mythos 5 launch on June 9, followed almost immediately by a US government export control directive that suspended access to both. DXC Technology announced a multi-year alliance to train tens of thousands of Claude-certified engineers to embed Claude inside the systems it operates for major banks, airlines, and insurers.The design system you import into Claude Design is the same component library that Claude Code uses to implement. The financial model you build in Claude for Excel can flow into a pitchbook created in Claude Design and exported to PowerPoint. The brand assets a small business owner creates through Claude Design can be pushed directly to Canva for team collaboration. This is not a chatbot strategy. It is a platform strategy, and the Claude Design update — with its design system imports, code round-trips, and export ecosystem — is one of the clearest expressions of it yet.Anthropic also published an engineering deep-dive last month detailing how it contains Claude across products using sandboxes, virtual machines, and egress controls — infrastructure that becomes more critical as tools like Claude Design gain access to proprietary design systems and brand assets. The containment architecture reveals both the ambition and the risk: the more deeply Claude embeds into enterprise workflows, the higher the stakes when something goes wrong, and the more sophisticated the security envelope must become.Three questions will determine whether Wednesday’s update delivers on its ambitions. First, whether the token economics actually work for the broadest user base — shared limits and efficiency gains help, but generative design remains expensive. Second, whether the design system import proves robust enough for real enterprise use, because ingesting a GitHub repository of React components and faithfully using them across dozens of design variations is a genuinely hard technical problem. And third, whether the Claude Code round-trip actually eliminates the design-engineering gap or merely shifts it.Claude Design launched two months ago as a thing people tried once and marveled at. Anthropic is now trying to make it a thing people use every day — and more than that, a thing their entire team trusts to stay on brand while they do. In the AI industry, the distance between a viral demo and an indispensable tool has swallowed more products than it has produced. Anthropic just bet that design systems, not just design prompts, are the bridge across.
Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again
On Sunday, a team of nine researchers at Sina Weibo — the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence — quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research community. Their claim: a language model with just 3 billion parameters can match or exceed the reasoning performance of flagship systems from Google DeepMind, OpenAI, Anthropic, and DeepSeek that are hundreds of times larger.The model, called VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters, and ahead of Gemini 3 Pro, Google’s high-performance flagship reasoning system, which scored 91.7. With a test-time scaling technique the team calls Claim-Level Reliability Assessment, the score climbs to 97.1, edging past virtually every system in the public record.Within hours of publication, the paper had drawn 62 upvotes on Hugging Face’s daily papers feed, the model repository had accumulated 130 likes, and the GitHub repository had reached 685 stars. But the reaction on social media was not uniformly celebratory. It was, in many cases, deeply skeptical.”WHAT THE HELL is happening in AI?” wrote the user @orcus108 on X, in a post that accumulated over 161,000 views. “A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don’t know if this is a breakthrough or if the benchmarks are broken.”That tension — between genuine scientific advancement and the growing suspicion that AI benchmarks have become gameable to the point of meaninglessness — sits at the heart of the VibeThinker-3B story. And the answer matters enormously, not just for academic bragging rights, but for the multibillion-dollar question of whether the AI industry’s relentless push toward ever-larger models is the only path to intelligence.Benchmark scores that defy the scaling laws of modern AIThe results reported in the technical report are, by any conventional standard, extraordinary.On the mathematics side, VibeThinker-3B achieved 91.4 on AIME 2025, 94.3 on AIME 2026, 89.3 on HMMT 2025 (the Harvard-MIT Mathematics Tournament), 93.8 on BruMO 2025 (the Brown University Math Olympiad), and 76.4 on IMO-AnswerBench, a benchmark comprising 400 problems at the level of the International Mathematical Olympiad. In coding, it posted an 80.2 Pass@1 on LiveCodeBench v6, a benchmark designed to test executable code generation, and achieved a 96.1 percent acceptance rate on unseen LeetCode weekly and biweekly contests from late April through late May 2026. On instruction following, it scored 93.4 on IFEval.To put the parameter disparity in perspective: DeepSeek V3.2 has 671 billion parameters — roughly 224 times the size of VibeThinker-3B. GLM-5, from Zhipu AI, has 744 billion parameters. Kimi K2.5, from Moonshot AI, exceeds 1 trillion. VibeThinker-3B’s 3 billion parameters could run on a consumer laptop.The researchers frame this result not as an anomaly but as evidence for a broader theoretical claim. They introduce what they call the “Parametric Compression-Coverage Hypothesis,” which argues that different types of AI capability have fundamentally different relationships to model size. Verifiable reasoning — the kind tested by math competitions and coding challenges, where answers can be definitively checked — is what the paper calls a “parameter-dense” capability: one that can be compressed into a compact core. Open-domain knowledge, by contrast, is “parameter-expansive,” requiring broad coverage across facts, concepts, and edge cases that inherently demands more parameters.The paper acknowledges this distinction directly. On GPQA-Diamond, a graduate-level science knowledge benchmark, VibeThinker-3B scored just 70.2 — well behind the 91.9 achieved by Gemini 3 Pro and the 87.0 scored by Claude Opus 4.5. The authors write that this gap “is consistent with our claim rather than a contradiction to it: the main finding is not that a 3B model has fully replaced leading general-purpose models, but that a small model can reach first-tier performance on many verifiable reasoning tasks.”Inside the four-stage training pipeline that powers a tiny reasoning engineVibeThinker-3B is not built from scratch. It is post-trained on top of Qwen2.5-Coder-3B, a compact foundation model from Alibaba’s Qwen team, through what the Weibo AI researchers call the “Spectrum-to-Signal Principle” — a multi-stage pipeline first introduced in the team’s earlier VibeThinker-1.5B work in November 2025.The training unfolds in four major phases. The first is a two-stage supervised fine-tuning process that uses curriculum learning: the model first trains on a broad mixture of math, code, STEM reasoning, general dialogue, and instruction-following data, then shifts to a curated subset of harder, longer-horizon reasoning problems. In the second stage, samples with reasoning traces shorter than 5,000 tokens are discarded, and problems that VibeThinker-1.5B can solve more than 75 percent of the time are filtered out, forcing the model to focus on genuinely difficult challenges.The second phase applies reinforcement learning across multiple domains — mathematics, code, and STEM — using the team’s MaxEnt-Guided Policy Optimization algorithm, or MGPO, which prioritizes training on problems at the model’s current capability boundary rather than problems it already solves easily or finds impossible. Notably, the team found that a strategy that worked well at the 1.5B scale — progressively expanding the context window during RL training — actually hurt performance at 3B. They hypothesize that the stronger starting checkpoint meant that truncating reasoning traces during warm-up was no longer removing noise but disrupting valid reasoning patterns. The solution was to train with a single 64,000-token context window throughout.Within the math RL phase, the team also introduces what it calls “Long2Short Math RL,” a secondary optimization stage that redistributes rewards to favor shorter correct solutions over longer ones, reducing verbosity without sacrificing accuracy. The technique uses a zero-sum reward redistribution that avoids biasing the overall reward signal while nudging the model toward more efficient reasoning.The third phase extracts high-quality reasoning trajectories from the RL-trained checkpoints and distills them back into a unified model through supervised fine-tuning. The team uses a “learning-potential score” — essentially the student model’s perplexity on each teacher trajectory — to prioritize traces that are correct but that the student has not yet internalized. The final phase, called Instruct RL, applies reinforcement learning on instruction-following tasks using a combination of rule-based validators for format constraints and rubric-based reward models for open-ended quality assessment.Francesco Bertolotti, an AI researcher who flagged the paper early on X, described the approach succinctly: “These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn’t provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL.” His post drew over 161,000 views.Real-world testing reveals the gap between benchmark scores and practical AI performanceFor every enthusiastic reaction, the paper drew an equally forceful objection. The AI research community in mid-2026 has grown deeply wary of benchmark-driven claims, and VibeThinker-3B arrived in an environment primed for suspicion.”The benchmarks are literal pattern matching single file coding,” wrote @BigMoonKR on X. “It has no relation to actual coding work. I don’t know how people still don’t get this.””Benchmaxxing,” declared @oflu_bedirhan, using a term that has become shorthand in the AI community for models that appear optimized specifically for benchmark performance at the expense of real-world utility.The most pointed criticism came from users who actually downloaded and tested the model. “Just tried the full precision,” wrote @politilols. “It doesn’t even know what a uv script (so the most popular Python dev tool) is. Haven’t seen that in a single LLM in at least a year now. Benchmaxxed.” When Bertolotti responded that the model seemed more focused on mathematical reasoning than practical coding, the user countered: “They include a livecodebench score. Zero chance that is reflective of the model.”@Itsdotdev raised a structural criticism: “Look into the benchmarks themselves and it probably won’t be so shocking. Why no DeepSWE? Why none of the standard benchmarks SOTA providers use?” The user @AvenirReym posed a more diagnostic question: “If it holds on a benchmark made after the model’s training cutoff, it’s real. If it only wins on AIME-style sets that have been circulating for years, it’s leakage.”The paper’s authors appear to have anticipated these objections. The technical report states that training sets “have undergone strict benchmark decontamination,” including n-gram-based filtering to remove “n-gram overlaps with evaluation sets.”The LeetCode contest evaluation — which covers contests from April 25 to May 31, 2026, dates that postdate any plausible training data cutoff — represents the most robust guard against data contamination concerns. On those contests, VibeThinker-3B passed 123 out of 128 first-attempt submissions, a 96.1 percent rate that exceeded GPT-5.2, Doubao Seed 2.0 Pro, Kimi K2.5, and Claude Opus 4.6 under identical evaluation conditions.Still, real-world user reports suggest a significant gap between benchmark performance and practical utility — a phenomenon that has become familiar across the industry. “In LM Studio it only responds well to first question, next questions reply to the first question,” reported @luismolinaab.Why a social media company may have found a crack in the scaling hypothesisEven the sharpest critics acknowledged that achieving these benchmark numbers at 3 billion parameters — regardless of how transferable they are to production use cases — is a meaningful engineering achievement. “Even if it’s benchmaxxing doing so with 3B parameters is fascinating, goes to show how fast this field is progressing,” wrote @rohityin.The observation cuts to a question that has consumed the AI industry since the advent of the scaling hypothesis: Is bigger always better? The conventional wisdom, articulated most famously in the Chinchilla scaling laws and reinforced by the commercial dominance of ever-larger foundation models, holds that more parameters and more training data reliably yield better performance. The economic corollary is stark: training and deploying frontier models costs tens or hundreds of millions of dollars, creating enormous barriers to entry.VibeThinker-3B challenges that consensus — but only partially. The paper is careful to draw a boundary around its claims, distinguishing between tasks with “clear verification signals” and those that require broad factual knowledge. The Parametric Compression-Coverage Hypothesis explicitly argues that small models cannot replace large ones across the board.”The true significance of VibeThinker-3B does not lie in proving that a 3B model can replace large-scale generalists,” the paper states, “but rather in providing a concrete empirical signal: the development of compact models is no longer merely a passive compromise for deployment efficiency or cost control; it emerges as a promising research trajectory that is fundamentally complementary to the traditional parameter scaling paradigm.”Perhaps the most surprising element of the work is its provenance. Sina Weibo — publicly traded on Nasdaq and Hong Kong, with a market capitalization that fluctuates in the single-digit billions — is not a company typically associated with frontier AI research. Yet the VibeThinker series is Weibo’s second major open-source AI contribution in seven months. VibeThinker-1.5B, released in November 2025, demonstrated that a model with just 1.5 billion parameters could outperform the original DeepSeek R1 on several math benchmarks — a result the team achieved for what it claimed was a post-training cost of just $7,800, compared to the $294,000 estimated for DeepSeek R1.The research team is compact — nine authors, all listed as Sina Weibo Inc. employees. The model is released under the MIT License, one of the most permissive open-source licenses available, and the weights are freely downloadable from both Hugging Face and ModelScope. Within the first day of release, community members had already created GGUF quantizations and derivative models.Small models, big implications, and the question the AI industry can no longer avoidThe most honest assessment of VibeThinker-3B may be that it is simultaneously less and more than what the benchmarks suggest. Less, because a model that struggles with basic knowledge of popular developer tools is unlikely to replace any production-grade coding assistant anytime soon. More, because the underlying insight — that reasoning ability and factual knowledge are partially decoupled, and that the former can be compressed far more aggressively than previously assumed — has profound implications for how the industry thinks about model design, deployment economics, and the accessibility of advanced AI capabilities.If the Parametric Compression-Coverage Hypothesis holds, it suggests a future in which small, specialized reasoning engines operate alongside large knowledge-rich models in hybrid architectures — a vision where a 3-billion-parameter model handles the logical heavy lifting while a larger system supplies the factual grounding. Such an architecture could dramatically reduce the cost of deploying AI reasoning capabilities, potentially bringing competition-level mathematical and coding performance to devices with modest hardware.”The interesting part is that we’re starting to separate knowledge from reasoning,” wrote @RealLambdaFlux on X. “A small model with strong post-training can punch way above its size on tasks with clear feedback.”@cmitsakis suggested the practical endgame: “I think small models are the future for agents because they can use tools to get the knowledge and they can run fast and cheap.”Whether that future arrives through VibeThinker-3B specifically, or through the dozens of teams now racing to reproduce and extend these results, the paper has already accomplished something that no benchmark score can fully capture.It has forced the AI community to confront an uncomfortable possibility: that for years, the industry may have been spending billions of dollars scaling up parameters to improve a kind of intelligence that could have fit, all along, on a laptop. The weights are public. The code is open. And the most important test isn’t on any leaderboard — it’s whether anyone can make a model this small actually useful in the real world.