Meta has been one of the most interesting companies of the generative AI era — initially gaining a loyal and huge following of users for the release of its mostly open source Llama family of large language models (LLMs) beginning in early 2023 but coming to screeching halt last year after Llama 4 debuted to mixed reviews and ultimately, admissions of gaming benchmarks.That bumpy rollout of Llama 4 apparently spurred Meta founder and CEO Mark Zuckerberg to totally overhaul Meta’s AI operations in the summer of 2025, forming a new internal division, Meta Superintelligence Labs (MSL) which he recruited 29-year-old former Scale AI co-founder and CEO Alexandr Wang to lead as Chief AI Officer. Now, today, Meta is showing us the fruits of that effort: Muse Spark, a new proprietary model that Wang says (posting on rival social network X, used more often by the machine learning community) is “the most powerful model that meta has released,” and has “support for tool-use, visual chain of thought, & multi-agent orchestration.” He also says it will be the start of a new Muse family of models, raising questions about what will become of Meta’s popular lineup and ongoing development of the Llama family. It arrives not as a generic chatbot, but as the foundation for what Wang calls “personal superintelligence”—an AI that doesn’t just process text but “sees and understands the world around you” to act as a digital extension of the self, echoing Zuckberg’s public manifesto for a vision of personal superintelligence published in summer 2025.However, it is proprietary only — confined for now to the Meta AI app and website, as well as a ” private API preview to select users,” according to Meta’s blog post announcing it — a move likely to rankle the literally billions of users of Llama models and the thousands of developers who relied upon it (some of whom are active participants in rival social network Reddit’s r/LocalLLaMA subreddit). In addition, no pricing information for the model has yet been announced.It’s unclear if Meta has ended development on the Llama family entirely. When asked directly by VentureBeat, a Meta spokesperson said in an email: “Our current Llama models will continue to be available as open source,” which doesn’t address the question of development of future Llama models. Visual chain-of-thoughtAt its core, Muse Spark is a natively multimodal reasoning model. Unlike previous iterations that “stitched” vision and text together, Muse Spark was rebuilt from the ground up to integrate visual information across its internal logic. This architectural shift enables “visual chain of thought,” allowing the model to annotate dynamic environments—identifying the components of a complex espresso machine or correcting a user’s yoga form via side-by-side video analysis.The most significant technical leap, however, is a new “Contemplating” mode. This feature orchestrates multiple sub-agents to reason in parallel, allowing Meta to compete with extreme reasoning models like Google’s Gemini Deep Think and OpenAI’s GPT-5.4 Pro.In benchmarks, this mode achieved 58% in “Humanity’s Last Exam” and 38% in “FrontierScience Research,” figures that Meta claims validate their new scaling trajectory.Perhaps more impressive for the company’s bottom line is the model’s efficiency. Meta reports that Muse Spark achieves its reasoning capabilities using over an order of magnitude less compute than Llama 4 Maverick, its previous mid-size flagship. This efficiency is driven by a process called “thought compression”. During reinforcement learning, the model is penalized for excessive “thinking time,” forcing it to solve complex problems with fewer reasoning tokens without sacrificing accuracy.Benchmarks reveal a return-to-formThe launch of Muse Spark is framed as a statistical “quantum leap,” ending Meta’s year-long absence from the absolute frontier of AI performance. By reconciling Meta’s official internal data with independent auditing from third-party LLM tracking firm Artificial Analysis, a clear picture emerges: Muse Spark is not just a marginal improvement over the Llama series; it is a fundamental re-entry into the “Top 5” global models.According to the Artificial Analysis Intelligence Index v4.0, Muse Spark achieved a score of 52. For context, Meta’s previous flagship, Llama 4 Maverick, debuted in 2025 with an Index score of just 18. By nearly tripling its performance, Muse Spark now sits within striking distance of the industry’s most elite systems, trailing only Gemini 3.1 Pro Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53).Meta’s official benchmarks suggest that Muse Spark is particularly dominant in multimodal reasoning, specifically where visual figures and logic intersect.CharXiv Reasoning: In “figure understanding,” Muse Spark achieved a score of 86.4, significantly outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Pro (80.2), and GPT-5.4 (82.8).MMMU Pro: Official reports place the model at 80.4, while Artificial Analysis’s independent audit measured it at 80.5%. This makes it the second-most capable vision model on the market, surpassed only by Gemini 3.1 Pro Preview (83.9% official; 82.4% independent).Visual Factuality (SimpleVQA): Muse Spark scored 71.3, placing it ahead of GPT-5.4 (61.1) and Grok 4.2 (57.4), though it narrowly trails Gemini 3.1 Pro (72.4).These scores validate Meta’s focus on “visual chain of thought,” enabling the model to not just recognize objects, but to reason through complex spatial problems and dynamic annotations.The “Thinking” gear of Muse Spark was put to the test against specialized benchmarks designed to break non-reasoning models.Humanity’s Last Exam (HLE): In this multidisciplinary evaluation, Meta reports a score of 42.8 (No Tools) and 50.4 (With Tools). Independent audits by Artificial Analysis tracked the model at 39.9%, trailing Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%).GPQA Diamond (PhD Level Reasoning): Muse Spark achieved a formidable 89.5, surpassing Grok 4.2 (88.5) but trailing the specialized “max reasoning” outputs of Opus 4.6 (92.7) and Gemini 3.1 Pro (94.3).ARC AGI 2: This remains a notable weak point. Muse Spark scored 42.5, far behind the abstract reasoning puzzles solved by Gemini 3.1 Pro (76.5) and GPT-5.4 (76.1).CritPT (Physics Research): Independent auditing found Muse Spark achieved the 5th highest score at 11%. This marks a substantial lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%).One of the most striking results from the official data is Muse Spark’s performance in the health sector, likely a result of Meta’s collaboration with over 1,000 physicians.HealthBench Hard: Muse Spark achieved 42.8, a massive lead over Claude Opus 4.6 (14.8), Gemini 3.1 Pro (20.6), and even GPT-5.4 (40.1).MedXpertQA (Multimodal): It scored 78.4, comfortably ahead of Opus 4.6 (64.8) and Grok 4.2 (65.8), though it still trails Gemini 3.1 Pro’s top-tier score of 81.3.Agentic Systems and Efficiency: The “Thought Compression” EffectWhile Muse Spark excels at reasoning, its “agentic” performance—executing real-world work tasks—presents a more nuanced picture.SWE-Bench Verified: Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Pro (80.6).GDPval-AA Elo: Meta’s official score of 1444 differs slightly from Artificial Analysis’s recorded 1427. In both cases, Muse Spark trails GPT-5.4 (1672) and Opus 4.6 (1606), suggesting that while the model “thinks” well, it is still refining its ability to “act” in long-horizon software and office workflows.Token Efficiency: This is where Muse Spark distinguishes itself. To run the Intelligence Index, it used 58 million output tokens. In contrast, Claude Opus 4.6 required 157 million tokens and GPT-5.4 required 120 million. This supports Meta’s claim of “thought compression”—delivering frontier-class intelligence while using less than half the “thinking time” of its closest competitors.BenchmarkLlama 4 Maverick (2025)Muse Spark (Official)Gemini 3.1 Pro (Official)Intelligence Index Score185257MMMU Pro–80.483.9CharXiv Reasoning–86.480.2HealthBench Hard–42.820.6LicenseOpen-Weights ProprietaryProprietaryWith Muse Spark, Meta has successfully transitioned from being the “LAMP stack for AI” to a direct challenger for the title of “Personal Superintelligence”. While agentic workflows remain a hurdle, its dominance in vision, health, and token efficiency places Meta back at the center of the frontier race.Personal wellness and Instagram shopping Meta is immediately deploying Muse Spark to power specialized experiences across its app family.Shopping Mode: A new feature that leverages Meta’s vast creator ecosystem. The AI picks up on brands, styling choices, and content across Instagram and Threads to provide personalized recommendations, effectively turning every post into a shoppable interaction.Health Reasoning: In a move toward medical utility, Meta collaborated with over 1,000 physicians to curate training data. Muse Spark can now analyze nutritional content from photos of food or provide “health scores” for pescatarian diets with high cholesterol.Interactive UI: The model can generate web-based minigames or tutorials on the fly. For example, a user can prompt the AI to turn a photo into a playable Sudoku game or a highlights-based tutorial for home appliances.Evaluation awarenessWhile Muse Spark demonstrates strong refusal behaviors regarding biological and chemical weapons, its safety profile includes a startling new discovery. Third-party testing by Apollo Research found that the model possesses a high degree of “evaluation awareness”.The model frequently recognized when it was being tested in “alignment traps” and reasoned that it should behave honestly specifically because it was under evaluation. While Meta concluded this was not a “blocking concern” for release, the finding suggests that frontier models are becoming increasingly “conscious” of the testing environment—potentially rendering traditional safety benchmarks less reliable as models learn to “game” the exam.What happens to Llama?In February 2023, Meta released Llama 1 to demonstrate that smaller, compute-optimal models could match larger counterparts like GPT-3 in efficiency. Although access was initially restricted to researchers, the model weights were leaked via 4chan on March 3, 2023, an event that inadvertently democratized high-tier research and catalyzed a global movement for running models on consumer-grade hardware. This shift was solidified in July 2023 with the release of Llama 2, which introduced a commercial license that permitted self-hosting for most organizations. This approach saw rapid adoption, with the Llama family exceeding 100 million downloads and supporting over 1,000 commercial applications by the third quarter of 2023.Through 2024 and 2025, Meta scaled the Llama family to establish it as the essential infrastructure for global enterprise AI, frequently referred to as the LAMP stack for AI. Following the launch of Llama 3 in April 2024 and the landmark Llama 3.1 405B in July, Meta achieved performance parity with the world’s leading proprietary systems. The subsequent release of Llama 4 in April 2025 introduced a Mixture-of-Experts architecture, allowing for massive parameter scaling while maintaining fast inference speeds. By early 2026, the Llama ecosystem reached a staggering scale, totaling 1.2 billion downloads and averaging approximately one million downloads per day.This widespread adoption provided businesses with significant economic sovereignty, as self-hosting Llama models offered an 88% cost reduction compared to using proprietary API providers.As of April 2026, Meta’s role as the undisputed leader of the open-weight movement has transitioned into a highly contested multi-polar landscape characterized by the rise of international competitors. While the United States accounts for 35% of global Llama deployments, Chinese models from labs like Alibaba and DeepSeek began accounting for 41% of downloads on platforms like Hugging Face by late 2025. Throughout early 2026, new entrants such as Zhipu AI’s GLM-5 and Alibaba’s Qwen 3.6 Plus have outpaced Llama 4 Maverick on general knowledge and coding benchmarks. In response to this global pressure, Meta’s Muse Spark arrives with hefty expectations and an open source legacy that will be tough to live up to.Proprietary only (for now)The launch marks a controversial departure from Meta AI’s “open science” roots. While the Llama series was famously accessible to developers, Muse Spark is launching as a proprietary model. Wang addressed the shift on X, stating: “Nine months ago we rebuilt our ai stack from scratch. New infrastructure, new architecture, new data pipelines… This is step one. Bigger models are already in development with plans to open-source future versions.”However, the developer community remains skeptical. Some see this as a necessary pivot after the Llama 4 series failed to gain expected developer traction; others view it as Meta “closing the gates” now that it has a competitive reasoning model. Wang himself acknowledged the transition’s difficulty, noting there are “certainly rough edges we will polish over time”.For the 3 billion people using Meta’s apps, the change will be felt almost instantly. The AI they interact with is no longer just a library of information, but an agent with a $27 billion brain and a mandate to understand their world as intimately as they do.
Venture Beat
New framework lets AI agents rewrite their own skills without retraining the underlying model
One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. “It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code,” Jun Wang, co-author of the paper, told VentureBeat.Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both.The challenges of building self-evolving agentsSelf-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window.Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks.Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a “password reset” script to solve a “refund processing” query simply because the documents share enterprise terminology.”Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill,” Wang said. How Memento-Skills stores and updates skillsTo solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory.These skills are stored as structured markdown files and serve as the agent’s evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model’s reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task.Memento-Skills achieves continual learning through its “Read-Write Reflective Learning” mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it.After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill.Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. “The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility.”To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end.Putting the self-evolving agent to the testThe researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity’s Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings.The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline’s performance, jumping from 17.9% to 38.7%.Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval.The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills. Finding the enterprise sweet spotThe researchers have released the code for Memento-Skills on GitHub, and it is readily available for use.For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows.”Skill transfer depends on the degree of similarity between tasks,” Wang said. “First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction.” In such scattershot environments, cross-task transfer is limited. “Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction.”Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off.”Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved,” Wang said.However, he cautioned against over-deployment in areas not yet suited for the framework. “Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions.”As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption.”To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance,” Wang said. “Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs.”
Amazon S3 Files gives AI agents a native file system workspace, ending the object-file split that breaks multi-agent pipelines
AI agents run on file systems using standard tools to navigate directories and read file paths. The challenge, however, is that there is a lot of enterprise data in object storage systems, notably Amazon S3. Object stores serve data through API calls, not file paths. Bridging that gap has required a separate file system layer alongside S3, duplicated data and sync pipelines to keep both aligned.The rise of agentic AI makes that challenge even harder, and it was affecting Amazon’s own ability to get things done. Engineering teams at AWS using tools like Kiro and Claude Code kept running into the same problem: Agents defaulted to local file tools, but the data was in S3. Downloading it locally worked until the agent’s context window compacted and the session state was lost.Amazon’s answer is S3 Files, which mounts any S3 bucket directly into an agent’s local environment with a single command. The data stays in S3, with no migration required. Under the hood, AWS connects its Elastic File System (EFS) technology to S3 to deliver full file system semantics, not a workaround. S3 Files is available now in most AWS Regions.”By making data in S3 immediately available, as if it’s part of the local file system, we found that we had a really big acceleration with the ability of things like Kiro and Claude Code to be able to work with that data,” Andy Warfield, VP and distinguished engineer at AWS, told VentureBeat.The difference between file and object storage and why it mattersS3 was built for durability, scale and API-based access at the object level. Those properties made it the default storage layer for enterprise data. But they also created a fundamental incompatibility with the file-based tools that developers and agents depend on.
“S3 is not a file system, and it doesn’t have file semantics on a whole bunch of fronts,” Warfield said. “You can’t do a move, an atomic move of an object, and there aren’t actually directories in S3.”Previous attempts to bridge that gap relied on FUSE (Filesystems in USErspace), a software layer that lets developers mount a custom file system in user space without changing the underlying storage. Tools like AWS’s own Mount Point, Google’s gcsfuse and Microsoft’s blobfuse2 all used FUSE-based drivers to make their respective object stores look like a file system. Warfield noted that the problem is that those object stores still weren’t file systems. Those drivers either faked file behavior by stuffing extra metadata into buckets, which broke the object API view, or they refused file operations that the object store couldn’t support.S3 Files takes a different architecture entirely. AWS is connecting its EFS (Elastic File System) technology directly to S3, presenting a full native file system layer while keeping S3 as the system of record. Both the file system API and the S3 object API remain accessible simultaneously against the same data.How S3 Files accelerates agentic AI
Before S3 Files, an agent working with object data had to be explicitly instructed to download files before using tools. That created a session state problem. As agents compacted their context windows, the record of what had been downloaded locally was often lost.”I would find myself having to remind the agent that the data was available locally,” Warfield said.Warfield walked through the before-and-after for a common agent task involving log analysis. He explained that a developer was using Kiro or Claude Code to work with log data, in the object only case they would need to tell the agent where the log files are located and to go and download them. Whereas if the logs are immediately mountable on the local file system, the developer can simply identify that the logs are at a specific path, and the agent immediately has access to go through them.For multi-agent pipelines, multiple agents can access the same mounted bucket simultaneously. AWS says thousands of compute resources can connect to a single S3 file system at the same time, with aggregate read throughput reaching multiple terabytes per second — figures VentureBeat was not able to independently verify.Shared state across agents works through standard file system conventions: subdirectories, notes files and shared project directories that any agent in the pipeline can read and write. Warfield described AWS engineering teams using this pattern internally, with agents logging investigation notes and task summaries into shared project directories.For teams building RAG pipelines on top of shared agent content, S3 Vectors — launched at AWS re:Invent in December 2024 — layers on top for similarity search and retrieval-augmented generation against that same data.What analysts say: this is not just a better FUSEAWS is positioning S3 Files against FUSE-based file access from Azure Blob NFS and Google Cloud Storage FUSE. For AI workloads, the meaningful distinction is not primarily performance.”S3 Files eliminates the data shuffle between object and file storage, turning S3 into a shared, low-latency working space without copying data,” Jeff Vogel, analyst at Gartner, told VentureBeat. “The file system becomes a view, not another dataset.”With FUSE-based approaches, each agent maintains its own local view of the data. When multiple agents work simultaneously, those views can potentially fall out of sync.”It eliminates an entire class of failure modes including unexplained training/inference failures caused by stale metadata, which are notoriously difficult to debug,” Vogel said. “FUSE-based solutions externalize complexity and issues to the user.”The agent-level implications go further still. The architectural argument matters less than what it unlocks in practice.”For agentic AI, which thinks in terms of files, paths, and local scripts, this is the missing link,” Dave McCarthy, analyst at IDC, told VentureBeat. “It allows an AI agent to treat an exabyte-scale bucket as its own local hard drive, enabling a level of autonomous operational speed that was previously bottled up by API overhead associated with approaches like FUSE.”Beyond the agent workflow, McCarthy sees S3 Files as a broader inflection point for how enterprises use their data.”The launch of S3 Files isn’t just S3 with a new interface; it’s the removal of the final friction point between massive data lakes and autonomous AI,” he said. “By converging file and object access with S3, they are opening the door to more use cases with less reworking.”What this means for enterprisesFor enterprise teams that have been maintaining a separate file system alongside S3 to support file-based applications or agent workloads, that architecture is now unnecessary.For enterprise teams consolidating AI infrastructure on S3, the practical shift is concrete: S3 stops being the destination for agent output and becomes the environment where agent work happens.”All of these API changes that you’re seeing out of the storage teams come from firsthand work and customer experience using agents to work with data,” Warfield said. “We’re really singularly focused on removing any friction and making those interactions go as well as they can.”
AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro
Is China picking back up the open source AI baton? Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face.This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month. The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering.The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons. GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls. “agents could do about 20 steps by the end of last year,” wrote z.ai leader Lou on X. “glm-5.1 can do 1,700 rn. autonomous work time may be the most important curve after scaling laws. glm-5.1 will be the first point on that curve that the open-source community can verify with their own hands. hope y’all like it^^”In a market increasingly crowded with fast models, Z.ai is betting on the marathon runner. The company, which listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, is using this release to cement its position as the leading independent developer of large language models in the region.Technology: the staircase pattern of optimizationGLM-5.1s core technological breakthrough isn’t just its scale, though its 754 billion parameters and 202,752 token context window are formidable, but its ability to avoid the plateau effect seen in previous models. In traditional agentic workflows, a model typically applies a few familiar techniques for quick initial gains and then stalls. Giving it more time or more tool calls usually results in diminishing returns or strategy drift. Z.ai research demonstrates that GLM-5.1 operates via what they call a staircase pattern, characterized by periods of incremental tuning within a fixed strategy punctuated by structural changes that shift the performance frontier.In Scenario 1 of their technical report, the model was tasked with optimizing a high-performance vector database, a challenge known as VectorDBBench. The model is provided with a Rust skeleton and empty implementation stubs, then uses tool-call-based agents to edit code, compile, test, and profile. While previous state-of-the-art results from models like Claude Opus 4.6 reached a performance ceiling of 3,547 queries per second, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. The optimization trajectory was not linear but punctuated by structural breakthroughs.At iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, which reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 queries per second. By iteration 240, it autonomously introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. Ultimately, the model identified and cleared six structural bottlenecks, including hierarchical routing via super-clusters and quantized routing using centroid scoring via VNNI. These efforts culminated in a final result of 21,500 queries per second, roughly six times the best result achieved in a single 50-turn session. This demonstrates a model that functions as its own research and development department, breaking complex problems down and running experiments with real precision.The model also managed complex execution tightening, lowering scheduling overhead and improving cache locality. During the optimization of the Approximate Nearest Neighbor search, the model proactively removed nested parallelism in favor of a redesign using per-query single-threading and outer concurrency. When the model encountered iterations where recall fell below the 95 percent threshold, it diagnosed the failure, adjusted its parameters, and implemented parameter compensation to recover the necessary accuracy. This level of autonomous correction is what separates GLM-5.1 from models that simply generate code without testing it in a live environment.Kernelbench: pushing the machine learning frontierThe model’s endurance was further tested in KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba. In this setting, the goal is to produce a faster GPU kernel than the reference PyTorch implementation while maintaining identical outputs. Each of the 50 problems runs in an isolated Docker container with one H100 GPU and is limited to 1,200 tool-use turns. Correctness and performance are evaluated against a PyTorch eager baseline in separate CUDA contexts.The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer. It eventually delivered a 3.6x geometric mean speedup across 50 problems, continuing to make useful progress well past 1,000 tool-use turns. Although Claude Opus 4.6 remains the leader in this specific benchmark at 4.2x, GLM-5.1 has meaningfully extended the productive horizon for open-source models.This capability is not simply about having a longer context window; it requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error. One of the key breakthroughs is the ability to form an autonomous experiment, analyze, and optimize loop, where the model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement. All solutions generated during this process were independently audited for benchmark exploitation, ensuring the optimizations did not rely on specific benchmark behaviors but worked with arbitrary new inputs while keeping computation on the default CUDA stream.Product strategy: subscription and subsidiesGLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. To support this, Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools. The product offering is divided into three subscription tiers, all of which include free Model Context Protocol tools for vision analysis, web search, web reader, and document reading. The Lite tier at $27 USD per quarter is positioned for lightweight workloads and offers three times the usage of a comparable Claude Pro plan. The Pro tier at $81 per quarter is designed for complex workloads, offering five times the Lite plan usage and 40 to 60 percent faster execution. The Max tier at $216 per quarter is aimed at advanced developers with high-volume needs, ensuring guaranteed performance during peak hours.For those using the API directly or through platforms like OpenRouter or Requesty, Z.ai has priced GLM-5.1 at $1.40 per one million input tokens and $4.40 per million output tokens. There’s also a cache discount available for $0.26 per million input tokens. ModelInputOutputTotal CostSourceGrok 4.1 Fast$0.20$0.50$0.70xAIMiniMax M2.7$0.30$1.20$1.50MiniMaxGemini 3 Flash$0.50$3.00$3.50GoogleKimi-K2.5$0.60$3.00$3.60MoonshotMiMo-V2-Pro (≤256K)$1.00$3.00$4.00Xiaomi MiMoGLM-5$1.00$3.20$4.20Z.aiGLM-5-Turbo$1.20$4.00$5.20Z.aiGLM-5.1$1.40$4.40$5.80Z.aiClaude Haiku 4.5$1.00$5.00$6.00AnthropicQwen3-Max$1.20$6.00$7.20Alibaba CloudGemini 3 Pro$2.00$12.00$14.00GoogleGPT-5.2$1.75$14.00$15.75OpenAIGPT-5.4$2.50$15.00$17.50OpenAIClaude Sonnet 4.5$3.00$15.00$18.00AnthropicClaude Opus 4.6$5.00$25.00$30.00AnthropicGPT-5.4 Pro$30.00$180.00$210.00OpenAINotably, the model consumes quota at three times the standard rate during peak hours, which are defined as 14:00 to 18:00 Beijing Time daily, though a limited-time promotion through April 2026 allows off-peak usage to be billed at a standard 1x rate. Complementing the flagship is the recently debuted GLM-5 Turbo. While 5.1 is the marathon runner, Turbo is the sprinter, proprietary and optimized for fast inference and tasks like tool use and persistent automation. At a cost of $1.20 per million input / $4 per million output, it is more expensive than the base GLM-5 but comes in at more affordable than the new GLM-5.1, positioning it as a commercially attractive option for high-speed, supervised agent runs.The model is also packaged for local deployment, supporting inference frameworks including vLLM, SGLang, and xLLM. Comprehensive deployment instructions are available at the official GitHub repository, allowing developers to run the 754 billion parameter MoE model on their own infrastructure. For enterprise teams, the model includes advanced reasoning capabilities that can be accessed via a thinking parameter in API requests, allowing the model to show its step-by-step internal reasoning process before providing a final answer.Benchmarks: a new global standardThe performance data for GLM-5.1 suggests it has leapfrogged several established Western models in coding and engineering tasks. On SWE-Bench Pro, which evaluates a model’s ability to resolve real-world GitHub issues using an instruction prompt and a 200,000 token context window, GLM-5.1 achieved a score of 58.4. For context, this outperforms GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. Beyond standardized coding tests, the model showed significant gains in reasoning and agentic benchmarks. It scored 63.5 on Terminal-Bench 2.0 when evaluated with the Terminus-2 framework and reached 66.5 when paired with the Claude Code harness.On CyberGym, it achieved a 68.7 score based on a single-run pass over 1,507 tasks, demonstrating a nearly 20-point lead over the previous GLM-5 model. The model also performed strongly on the MCP-Atlas public set with a score of 71.8 and achieved a 70.6 on the T3-Bench. In the reasoning domain, it scored 31.0 on Humanitys Last Exam, which jumped to 52.3 when the model was allowed to use external tools. On the AIME 2026 math competition benchmark, it reached 95.3, while scoring 86.2 on GPQA-Diamond for expert-level science reasoning.The most impressive anecdotal benchmark was the Scenario 3 test: building a Linux-style desktop environment from scratch in eight hours. Unlike previous models that might produce a basic taskbar and a placeholder window before declaring the task complete, GLM-5.1 autonomously filled out a file browser, terminal, text editor, system monitor, and even functional games. It iteratively polished the styling and interaction logic until it had delivered a visually consistent, functional web application. This serves as a concrete example of what becomes possible when a model is given the time and the capability to keep refining its own work.Licensing and the open segueThe licensing of these two models tells a larger story about the current state of the global AI market. GLM-5.1 has been released under the MIT License, with its model weights made publicly available on Hugging Face and ModelScope. This follows the Z.ai historical strategy of using open-source releases to build developer goodwill and ecosystem reach. However, GLM-5 Turbo remains proprietary and closed-source. This reflects a growing trend among leading AI labs toward a hybrid model: using open-source models for broad distribution while keeping execution-optimized variants behind a paywall.Industry analysts note that this shift arrives amidst a rebalancing in the Chinese market, where heavyweights like Alibaba are also beginning to segment their proprietary work from their open releases. Z.ai CEO Zhang Peng appears to be navigating this by ensuring that while the flagship’s core intelligence is open to the community, the high-speed execution infrastructure remains a revenue-driving asset. The company is not explicitly promising to open-source GLM-5 Turbo itself, but says the findings will be folded into future open releases. This segmented strategy helps drive adoption while allowing the company to build a sustainable business model around its most commercially relevant work.Community and user reactions: crushing a week’s workThe developer community response to the GLM-5.1 release has been overwhelmingly focused on the model’s reliability in production-grade environments. User reviews suggest a high degree of trust in the model’s autonomy. One developer noted that GLM-5.1 shocked them with how good it is, stating it seems to do what they want more reliably than other models with less reworking of prompts needed. Another developer mentioned that the model’s overall workflow from planning to project execution performs excellently, allowing them to confidently entrust it with complex tasks.Specific case studies from users highlight significant efficiency gains. A user from Crypto Economy News reported that a task involving preprocessing code, feature selection logic, and hyperparameter tuning solutions, which originally would have taken a week, was completed in just two days. Since getting the GLM Coding plan, other developers have noted being able to operate more freely and focus on core development without worrying about resource shortages hindering progress.On social media, the launch announcement generated over 46,000 views in its first hour, with users captivated by the eight-hour autonomous claim. The sentiment among early adopters is that Z.ai has successfully moved past the hallucination-heavy era of AI into a period where models can be trusted to optimize themselves through repeated iteration. The ability to build four applications rapidly through correct prompting and structured planning has been cited by multiple users as a game-changing development for individual developers.The implications of long-horizon workThe release of GLM-5.1 suggests that the next frontier of AI competition will not be measured in tokens per second, but in autonomous duration. If a model can work for eight hours without human intervention, it fundamentally changes the software development lifecycle. However, Z.ai acknowledges that this is only the beginning. Significant challenges remain, such as developing reliable self-evaluation for tasks where no numeric metric exists to optimize against.Escaping local optima earlier when incremental tuning stops paying off is another major hurdle, as is maintaining coherence over execution traces that span thousands of tool calls. For now, Z.ai has placed a marker in the sand. With GLM-5.1, they have delivered a model that doesn’t just answer questions, but finishes projects. The model is already compatible with a wide range of developer tools including Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid. For developers and enterprises, the question is no longer, “what can I ask this AI?” but “what can I assign to it for the next eight hours?”The focus of the industry is clearly shifting toward systems that can reliably execute multi-step work with less supervision. This transition to agentic engineering marks a new phase in the deployment of artificial intelligence within the global economy.
LLM-referred traffic converts at 30-40% — and most enterprises aren’t optimizing for it
For more than two decades, digital discovery has operated on a simple model: search, scan, click, decide. That worked when humans were the ones doing the web searching; but with the advent of AI agents, the primary consumer of information is no longer always human.This is giving rise to a new paradigm: Answer engine optimization (AEO), also referred to as generative engine optimization (GEO). Because agents look at data much differently than humans do, success is no longer defined by rankings and clicks, but whether content is understood, selected, and cited by AI systems.The SEO model that the web was built on simply isn’t going to cut it anymore, and enterprises need to prepare now.How LLMs interpret web contentTraditional SEO is built around keywords, rankings, page-level optimization, and click-through rates. Users manually search across multiple sources and click around to get what they need. Simple, but sometimes frustrating and a definite time suck.But AEO operates on a whole different level. Agents are increasingly taking over users’ workflows: Claude Code, OpenClaw, CrewAI, Microsoft Copilot, AutoGen, LangChain, Agent Bricks, Agentforce, Google Vertex, Perplexity’s web interface, and whatever else comes along.These agents do not “browse” the web the way humans do. They analyze user intent based not just on phrasing, but persistent memory and context from past sessions (rather than simple autocomplete). They require materials that are concise, structured, and to the point. What’s more, agents are moving beyond browsing to delegation, handling more downstream work. What started as “search, read, decide,” evolves to “agent retrieves, agent summarizes, human decides” (and, beyond that, “agent acts → human validates”). “In practice, AEO begins where SEO stops,” said Dustin Engel, founder of consultancy company Elegant Disruption. “AEO is the next layer of discovery,” or “zero-click discovery.”In this new world where agents synthesize answers, users may never even see an enterprise’s website, click-through rates decline, and attribution and citability (rather than pure visibility, or showing up at the top of a list of blue links) become critical. “The new default is closer to a citation map: Where the model is pulling from, how often you show up, and how you are described,” Engel said. Some, like Adam Yang of Q&A platform Quora, argue that AEO is already becoming the default over SEO. This is for “a certain class of queries,” Yang notes. Any question where the user wants a synthesized answer — “what’s the best approach to X,” “compare these two options,” “what do I need to know about Y” — is increasingly resolved by an AI without a click. Google’s own AI Overviews are already accelerating this on the consumer side, many analysts note. “SEO isn’t dead,” Yang said. “But the optimization target has shifted from ‘rank on page 1’ to ‘get cited in the answer.’”How devs are already using AI agents Are there scenarios where regular search/Googling is still the best option? “Absolutely,” said analyst Wyatt Mayham of Northwest AI Consulting. Notably, for personal tasks like finding nearby restaurants or local service providers. The interface is “just better” in those cases because it integrates maps, reviews, and photos. “That experience is hard to beat right now,” he said.For work-related research, though, he says he’s “barely” using traditional search anymore, and it’s getting “closer to zero” every month. “When I need to understand a company or a person professionally, agents do it faster and give me a more useful output than a page of blue links ever did,” he said. His firm uses autonomous agents “heavily,” and built a Claude Skills function that powers its sales operation. Before a discovery call with a prospect, team members can trigger a skill that pulls the contact’s LinkedIn profile, scrapes their company website, grabs relevant info from sources like ZoomInfo, and crafts a clear picture of their revenue, team size, tech stack, and pain points. “By the time I get on a call, I have a tailored research brief ready to go without spending 30 to 45 minutes manually Googling around,” Mayham said. The big advantage is that these tools run in the background, he noted. You don’t have to sit clicking through browser tabs: You just tell the agent what you need, it does it, and you get a structured output that’s actually useful. “It’s collapsed what used to be a full hour of sales prep into a few minutes,” Mayham said. Carlos Dutra, data science manager at fintech company Trustly, said Claude Code has “genuinely changed” his daily workflow. He uses it for most of his coding work, and what surprised him wasn’t the speed, but the fact that he didn’t need to open and keep track of browser tabs.
“Not because I’m lazy, but because the answers are better,” he said. He still uses Google for some tasks: Pricing pages, recent news, anything that needs to be current.
“But for technical reasoning? Agents have mostly replaced search for me personally,” he said.
Quora’s Yang has had a similar experience. He’s been using Claude Code daily for the past few months, primarily for content strategy, knowledge management, and competitive research. Workflows that used to take him half a day now take 30 minutes.
But what’s been most advantageous is that he can now run research and synthesis tasks in parallel that he previously had to do sequentially. Also helpful is that agents’ context retention across sessions is “meaningfully better” than web-based tools.
When he needs to understand a concept, map a competitive landscape, or synthesize industry trends, Claude or Perplexity are the go-to before opening a browser tab. “I’ve started treating agent search as my first stop, not Google. Traditional search is now where I verify, not where I discover.”
The kinks are real, though. Mayham pointed out that LinkedIn, in particular, is “aggressive” about blocking automated access, and many other sites have (or are implementing) similar protections. Users will hit walls when agents can’t get through, so a fallback plan is important for those relying on agents.
“The reliability isn’t 100% yet, and that’s probably the biggest thing holding broader adoption back,” he said.
Mayham’s advice for other devs: Stop chasing shiny objects. A new AI tool launches “practically every day,” and many (experienced devs included) are jumping from platform to platform without ever going deep with any of them.
“Pick a model, go deep, build real workflows on it,” he emphasized. “You’ll get more value from mastery of one platform than surface-level experimentation across five.”How enterprises can compete in an AEO-driven world When AI agents do the searching, the rules change. The question is no longer whether your content ranks on the first page, it’s whether the model selects you as the source when generating an answer.Structure matters much more than it used to. Content should:Be organized around conversational intent, provide direct answers, and mirror real user questions and follow-ups;Be authoritative and reflect strong expertise;Be fresh (and, when necessary, regularly refreshed);Have clear headers and established FAQ schema. Another must is maintaining a strong brand presence across the forums and platforms — Wikipedia, Reddit, LinkedIn, industry publications — that models are trained on. Enterprises might also consider investing in original data, like research.In Mayham’s experience, when a business gets recommended by an LLM during a search-style query, the conversion rate is “dramatically higher” than traditional channels. For his company, LLM-referred traffic is converting at 30 to 40%, which “blows away what we see from SEO or paid social.”“The intent signal is just different when someone is having a conversation with an AI and it recommends you by name.” Discoverability inside LLMs will matter as much as Google rankings, “maybe more,” Mayham said. “It’s a whole new surface for customer acquisition that most businesses aren’t even thinking about yet.”Trustly’s Dutra agreed that the “uncomfortable truth” is that most enterprise content is becoming “basically invisible” in agent-driven queries. “AEO is about whether your content survives being chunked, embedded, and semantically retrieved,” he said. The companies getting ahead aren’t doing anything “exotic,” he noted. They have clean, declarative content that doesn’t require context to understand. Those still writing copy stuffed with keywords are going to fall behind because LLMs care about semantic clarity.A quick test he gives clients: Ask an LLM a question your page is supposed to answer, without giving it the URL. “If it can’t construct the answer from your content, you have a problem.” Jeff Oxford of SEO agency Visibility Labs offers valuable step-by-step advice: Engage in conversations on Reddit, which is one of the most-cited domains in AI search. Enterprises should establish a positive reputation on Reddit, and engage on any relevant threads where customers are asking for recommendations.Build a strong YouTube presence. According to Ahrefs, which tracks internet behavior, YouTube mentions have the “strongest correlation” with AI visibility across ChatGPT, AI Mode, and AI Overviews. “This makes sense, since both Google and OpenAI have trained their models on YouTube transcripts,” Oxford said, “and YouTube is the most-cited domain in Google’s AI products.” Invest in digital PR and brand mentions; the latter is the second-highest correlated factor with AI visibility. “Brands need to improve their digital presence by being in as many places as possible,” Oxford said. Create content aligned with AI citation patterns. Enterprises should audit the prompts and topics where AI search engines are surfacing competitors, then create authoritative content on those same topics.“The goal is to become a source that AI models consider worth citing,” he noted. Still, there may be a lot of unnecessary hype around how drastically enterprises need to change, said Shashi Bellamkonda, principal research director at consultancy firm Info-Tech Research Group.
Those following best practices of producing content that their audience actually needs, written by experts and showcasing expert opinion, are in a good position to be cited in AI-powered search.
He pointed out that Google developed an EEAT framework (experience, expertise, authority, and trust) to evaluate content quality and helpfulness and help algorithms identify reliable, high-quality information.
To stand out, enterprises should use structured data and schema to signal the context: Is this an article, a research study, a product overview? “Original long-form content will be valued by AI-powered answer engines,” Bellamkonda said. “Copycat strategies or trying to game the system are taboo in this era.”
Experts should also share their thoughts across several channels, and “About Us” pages must be “robust” and include bios highlighting thought leaders’ expertise.“Ultimately, the reputation of AI-powered search is in making sure the user likes the search rather than what you think they should read,” Bellamkonda said. “So a good focus on the end user is a great way to succeed.”
AI-RAN is redefining enterprise edge intelligence and autonomy
Presented by Booz Allen AI-RAN, or artificial intelligence radio area networks, is a reimagining of what wireless infrastructure can do. Rather than treating the network as a passive conduit for data, AI-RAN turns it into an active computational layer. It’s a sensor, a compute fabric, and a control plane for physical operations, all rolled into one. That shift has huge implications for industries from manufacturing and logistics to healthcare and smart infrastructure.VentureBeat spoke with two leaders at the center of this transformation: Chris Christou, senior vice president at Booz Allen, and Shervin Gerami, managing director at Cerberus Operations Supply Chain Fund.“AI-RAN can bring the promise of extending 5G and eventually 6G networks into the enterprise,” Christou said. “Proving that a platform can host inference at the edge to enable new types of AI — in particular, physical AI and autonomy-type use cases for things like smart manufacturing and smart warehousing — can make operations more efficient and effective.”“AI-RAN lets enterprises move from digitizing processes to autonomously operating them,” Gerami added. “The enterprise investment should not look at AI-RAN as a networking upgrade. It’s an operating system for physical industries.”The difference between AI for RAN, AI on RAN, and AI and RANThe difference between AI for RAN, AI on RAN, and AI and RAN is critical. AI on RAN runs enterprise AI workloads on edge compute infrastructure integrated with the RAN, enabling real-time applications like computer vision, robotics, and localized LLM inference.AI and RAN represents the deeper convergence — where networks are designed to be AI-native, with AI workloads and radio infrastructure architected together as a coordinated, distributed system. At this stage, RAN evolves from a transport layer into a foundational layer of the AI economy.”This is the transformational part,” Gerami said. “It’s jointly designing applications with networks. Now the application knows the network state, and the network understands the application’s intent. AI for RAN saves money. AI on RAN adds capability. Then AI and RAN together create entirely new business models.”It’s this layered framework that makes AI-RAN more than an incremental evolution of existing wireless technology, and instead a platform shift that opens the network to the kind of developer ecosystem and application innovation that has historically been the domain of cloud computing.How ISAC turns the network into a sensorIntegrated sensing and communications (ISAC) is the center of the infrastructure. The network becomes the sensor, a converged infrastructure simultaneously communicating and sensing its environment at the same time it hosts algorithms and applications at the edge. It will enable drone detection, pedestrian safety, and automotive sensing, and eventually even more innovative use cases. The enterprise value proposition of ISAC and a network as the sensor is clear, Gerami says. Today, organizations rely on multiple discrete systems to achieve situational awareness: cameras, radar, asset trackers, motion sensors and more. Each comes with its own maintenance burden, integration overhead and vendor relationship. ISAC has the potential to handle many of those capabilities natively within the network.“With ISAC you can do asset tracking at sub-meter precision inside factories and hospitals,” he explained. “You can detect movement patterns, perimeter breaches, and anomalies. Smart buildings can have occupancy-aware HVAC and energy optimization.”How AI-RAN shaves milliseconds off edge AI and inferenceWith AI-RAN, edge AI and low-latency inference become supercharged in use cases like real-time robotics management, instant quality inspection, and predictive maintenance. There are the applications where the latency gap between cloud and edge is the difference between a system that works and one that doesn’t.“Where edge AI kicks in is driving operations in milliseconds, not seconds, which is what cloud does,” Gerami explained. Split inference can also change the game, Christou says. “You have a lot of different use cases where the processing is done on the device, making that device more expensive and more power-hungry,” he said. “Now there’s the possibility of offloading that to a local AI-RAN stack, even getting into concepts like split inference, so you do some of the inference on the device, some on the edge AI-RAN stack, and some in the cloud, all appropriate to the use cases and the time scale of the processing required.”Why the timing of AI-RAN investment is critical nowAI-RAN investment has a narrow and strategically critical window, both Germani and Christou said.“5G infrastructure is already being deployed, almost getting to a point of completion. 6G standards are not yet locked in,” Gerami explained. “This is an architectural moment for AI-RAN to come in. It allows the ability to not make RAN become a telco-centric design only. It allows the enterprise to become the co-creator of the application, the revenue and value generator of that network infrastructure.”Historically, enterprise IT has consumed wireless standards rather than shaped them. AI-RAN’s open architecture, built on software-defined, cloud-native, containerized components, changes that standardization dynamic. “Previously in the wireless industry it was a very long cycle. Now we’re seeing a push to get it implemented, get it out there, get early pilots, and then we’ll see how the technology works,” Christou said. Simultaneously, in parallel, you can start defining the standards. You have real-life implementation experience to help influence how those standards take shape.”And the entry point is accessible, Gerami added. “The barrier to entry is very low,” he said. “Right now, it’s all code-based, all software. It’s no different than downloading software. You get yourself an Nvidia box and you can deploy it with a radio.”Why AI-RAN is the future of innovative AI use cases“We see AI-RAN as being an open architecture that’s truly driving innovation,” Gerami said. “It’s a flywheel for innovation. We want to create everything to be microservices, open native, cloud native, to allow partners to build vertical AI apps. There’s so much focus right now in the industry around how we can adopt AI effectively, how it will enable autonomy and robotics. This is one of those foundational pieces that can help realize some of those use cases. The future is about owning that physical inference.”“There’s so much focus right now in the industry around how we can adopt AI effectively — how it will enable autonomy and robotics,” Christou said. “This is one of those foundational pieces that can help realize some of those use cases.”Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
As models converge, the enterprise edge in AI shifts to governed data and the platforms that control it
Presented by BoxAs frontier models converge, the advantage in enterprise AI is moving away from the model and toward the data it can safely access. For most enterprises, that advantage lives in unstructured data: the contracts, case files, product specifications, and internal knowledge. For enterprise leaders, the question is no longer which model to use, but which platform governs the content those models are allowed to reason over.”It’s not what the model does anymore, it’s the enterprise’s own unstructured data – their content, how it’s organized, how it’s governed, and how it’s made accessible to the AI.” says Yash Bhavnani, head of AI at Box.”The organizations that will lead in AI are the ones that built the governance infrastructure to make any model trustworthy, with the right permissions in place, the right content accessible, and a clear audit trail for every action taken,” says Ben Kus, CTO of Box. Enterprise AI must be grounded in secure systems of recordAs the advantage in AI shifts from models to governed content, systems of record are becoming the foundation that makes enterprise AI trustworthy.Employees use frontier models to summarize documents, draft reports, answer questions, but when those tools are disconnected from authoritative internal repositories, the results are difficult to trust, impossible to audit, and potentially dangerous. AI that cannot trace its outputs back to a governed source of record becomes a liability.”It’s not a theoretical concern,” Bhavnani says. “For an insurance enterprise using AI to analyze client claims, low accuracy is simply not acceptable, and untraceable output can’t be acted upon.”Systems of record provide authoritative, version-controlled content with embedded permissions and compliance controls already built in, and RAG pipelines retrieve data from live repositories at inference time, connecting responses directly to current, traceable sources. Without integration into systems of record, employees build their own workarounds, content gets duplicated across tools that don’t talk to each other, and shadow knowledge stores accumulate outside the visibility of IT and compliance teams. “Customers tell us employees are uploading sensitive documents to personal accounts and running their own AI workflows, with no visibility from the enterprise into what is being shared or what is being generated,” he says. “It’s not just a security risk, it’s an organizational one.”Permission-aware access is a requirement for agentic AIAs AI moves into agentic territory, executing multi-step tasks autonomously across documents, workflows, and enterprise systems, the risk profile changes entirely. Agents act faster than humans, often without the contextual judgment needed to decide what data they should access, making permissions-aware access essential.”An AI platform without permissions-aware access is too dangerous to use,” Kus says. “It’s a precondition for safe enterprise AI deployment, and the more it appears to have been added after the fact rather than built into the foundation, the more it should concern the enterprise considering it.”In regulated industries, frameworks like HIPAA, FedRAMP High, and SOC 2 demand audit trails, policy enforcement, and demonstrable controls over who and what has accessed sensitive data. “The audit trail should cover not only the source files but the AI session that used them, and accessed only with the same controls and the same encryption mechanism,” Kus says. “We don’t want customers to end up with a compliance breach because the agent was looking at sensitive data and the agent records got stored somewhere unexpected.” Content platforms are evolving into AI control planesEnterprise content platforms are evolving from repositories into orchestration layers — an AI control plane that sits between models, agents, and enterprise data. Rather than just storing documents, the platform governs how content is accessed, routes it to the right reasoning engine, enforces permissions, and maintains a complete audit trail of every action.”An AI-ready content platform needs to support human navigation and use in the way platforms always have, and it needs its own AI agents that understand the platform’s data structures deeply enough to get the best out of them,” Kus says. “It also needs to be open enough that any external agent can reach into it. An open agent ecosystem is the future of how these platforms will work.”When content, permissions, audit trails, and application access are all handled by the same platform, governance stays attached to the content itself. More than any capability of the models on top of it, a unified governance layer is what allows enterprise AI to scale safely.Turning unstructured content into structured intelligenceUnstructured data has long been a sticking point for organizations, which had to build specialized models to handle every subtype of unstructured data.”What’s changed is that general-purpose large language models now bring enough intelligence to extract structured data from unstructured content without that level of bespoke investment,” Kus says. “Box Extract applies this capability at scale, automatically pulling key information from contracts, forms, claims, and reports and applying it as structured metadata within Box. The content that previously had to be read by a person to yield its value can now be processed, structured, and made queryable across an entire repository.”And once that data is extracted and operational logic lives in the system, users can visualize, search, and act on that extracted information through custom dashboards and no-code tools.Box Agents take this further by enabling multi-step reasoning and task execution grounded directly in enterprise content, with persistent sessions that support iterative knowledge work with simple, natural language direction. And because agent sessions in Box are persistent, the work is not lost between interactions. The practical result is that end-to-end workflows that previously required human coordination across multiple systems can be orchestrated directly on systems of record. “When those workflows are built on Box agents and automation operating directly on governed content, the handoffs become automated, the audit trail is built in, and the system of record remains the authoritative source throughout,” Bhavani says. “Nothing falls through the cracks between systems, because there is only one system.”The enterprises seeing real returns are not the ones that simply plugged in a frontier model and waited for results. They are the ones that connected AI to their systems of record, governed what it can access, and built the operational layer that makes its outputs trustworthy enough to use at scale.Platforms that bring together content management, security, automation, and AI integration in a single layer are emerging as the foundation for enterprise AI, because model capability alone is not enough. Without governance built into the platform, the gaps between systems become the point of failure.Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Block introduces Managerbot, a proactive Square AI agent and the clearest proof point yet for Jack Dorsey’s AI bet
Block today announced Managerbot, a new AI agent embedded in the Square platform that proactively monitors a seller’s business, identifies emerging problems, and proposes actionable solutions — without the seller ever having to ask a question. The product marks the most tangible manifestation of CEO Jack Dorsey’s controversial bet that artificial intelligence can fundamentally reshape how his company operates, builds products, and serves the millions of small businesses that depend on Square to run day-to-day commerce.In an exclusive interview with VentureBeat, Willem Avé, Block’s head of product at Square, described Managerbot as a decisive break from the company’s earlier Square AI assistant, which functioned as a reactive chatbot that answered seller questions about sales, employees, and business performance.”The big shift from Square AI to Managerbot is really from reactive to proactive,” Avé said. “What that means is the primary interface is not a question box. You assign tasks to Managerbot, and that could be based on data, an insight, or a signal from your business.”The product is beginning to roll out now, with full availability to Square sellers expected over the coming months. Block declined to say whether Managerbot would carry an additional fee or be bundled into existing Square subscriptions.How Managerbot predicts inventory shortages, optimizes schedules, and writes marketing campaigns on its ownAvé outlined three core domains where Managerbot operates today: inventory forecasting, employee shift scheduling, and automated marketing campaign creation. In every case, the agent acts before the seller does — watching over the business, detecting patterns, and surfacing recommendations with proposed actions attached.In the inventory domain, Managerbot continuously monitors a seller’s stock levels, sales velocity, and external signals such as weather patterns and local events, then alerts the seller when an item is about to run out — or when it should stock up ahead of anticipated demand. “In warmer weather, we can see that you sell more of a certain good,” Avé explained. “That’s the forecasting capability, combined with local data — weather, events — so we can help sellers manage both their inventory and cash flows.”For shift scheduling — a task that Avé described as “one of those interesting, very hard computer science problems” that consumes hours of a small business owner’s week — Managerbot analyzes forecasted sales data and then generates optimized employee schedules that balance worker preferences with coverage needs. “It turns out that frontier models are actually pretty good at it,” Avé said.The third capability tackles what Avé called “the whole bucket of things that sellers could do if they had more time” — principally marketing. Managerbot identifies sales trends across a seller’s catalog and automatically drafts win-back campaigns and promotional outreach targeted at a store’s best customer segments. Avé said Block is seeing “very meaningful lift” from Managerbot-generated campaigns compared to what some sellers create manually, though he declined to share specific performance figures publicly.Block built Managerbot on frontier AI models from OpenAI and Anthropic — but says the real innovation is underneathManagerbot runs on third-party frontier models — Avé specifically referenced Anthropic’s Sonnet and OpenAI’s GPT family — but Block’s competitive advantage, he argued, lies in the “agent harness” the company has built around those models. That harness draws heavily on Goose, Block’s open-source agent framework, and incorporates learnings from its consumer-facing Money Bot on Cash App.The challenge specific to Square is scale and complexity. A seller running a small business might interact with hundreds of different tools across invoicing, inventory, customer management, marketing, payroll, and scheduling. Managerbot must navigate all of them coherently within a single agentic loop. “This isn’t like, you know, you load a skill and call it a day — think about hundreds of skills,” Avé said. “Actually, managing the context and managing the way that we progressively disclose tools, and some of the other innovation that we have at the harness layer, is I think some of the secret sauce.”A critical design decision shapes every interaction: Managerbot does not autonomously execute changes to a seller’s business. Every write action — whether adjusting a shift schedule, publishing a marketing campaign, or modifying inventory — requires explicit seller approval. To facilitate that approval, Managerbot generates visual UI previews showing exactly what will change before the seller clicks “yes.” “We want to earn trust with sellers, so any write action is prompted to the user to approve,” Avé said. “The seller needs a visual representation of what the change is. You can’t just describe in words all the time what you’re going to go do.”An $80 million fine and chatbot blunders hang over Block’s push to automate financial recommendationsThat human-in-the-loop caution reflects a sensitivity that gains additional weight given Block’s recent history. In January 2025, 48 state financial regulators imposed an $80 million fine on Block for violations of Bank Secrecy Act and anti-money laundering laws related to Cash App. The Connecticut Department of Banking stated in announcing the settlement that regulators “found Block was not in compliance with certain requirements, creating the potential that its services could be used to support money laundering, terrorism financing, or other illegal activities.” The Illinois Department of Financial and Professional Regulation simultaneously joined the coordinated enforcement action.Separately, reporting from The Guardian has documented instances of Block’s customer-facing chatbots making serious errors, including telling customers to cancel or close their accounts. When VentureBeat raised this concern during the interview, Avé acknowledged the stakes but redirected to Managerbot’s specific safeguards.”Financial accuracy and financial data — the value of these products really come from recommendations,” Avé said. “We need to be better than whatever you can feed to ChatGPT. If you take a CSV of your sales and put it in ChatGPT or Claude, we need our product to be better and answer that question either more accurately or better than what’s available in the market.” He pointed to the harness layer’s role in reducing hallucinations through tuning, prompt engineering, and optimized tool-call loops, while acknowledging the inherent limitations of probabilistic systems: “It’s never going to be zero. Obviously, these are probabilistic systems, and we have guidance and call-outs in the tool to provide that.” On regulated domains like lending and payments, Avé was more definitive: “In any sort of regulated domains — banking, lending, payments — there are strict guardrails on what we can and can’t say to sellers. Those are just part of the product and business.”Dorsey cut 4,000 jobs in the name of AI — Managerbot is the first answer to what those tools are actually buildingIt is impossible to evaluate Managerbot outside the context of the radical organizational surgery Block performed just weeks ago. In late February, Dorsey announced that Block would cut more than 4,000 of its roughly 10,000 employees — nearly half the workforce — explicitly citing AI as the driving rationale. As the BBC reported, Dorsey wrote that “AI fundamentally changes what it means to build and run a company.” Block’s stock surged more than 20 percent on the news, according to ABC7.The company’s Q4 2025 earnings report, released alongside the layoff announcement, showed gross profit of $2.87 billion — up 24 percent year over year — and raised 2026 guidance to $12.2 billion in gross profit, according to AlphaSense’s earnings analysis. Block also reported a greater than 40 percent increase in production code shipped per engineer since September 2025 through the use of agentic coding tools. As CNBC commentator Steve Sedgwick wrote in an opinion piece following the announcement, “I keep getting told on CNBC that AI will create new jobs to replace those being lost. I’ve been asking the same question for years now.” The Observer’s Mark Minevich was more pointed, calling Block’s layoffs “probably the first legitimate mass layoff driven by A.I. as the actual operating thesis.”Managerbot, then, is the product answer to the obvious follow-up question: if Block shed 4,000 workers in the name of intelligence tools, what exactly are those intelligence tools building? Avé framed the product as proof of concept for Block’s entire strategic thesis. “Block has been in the press recently about rebuilding as an intelligence company, and it’s like, a lot of people are asking, ‘What does that mean for us?'” Avé said. “What I like to do is show, not tell. We’re building Managerbot, which I think is one of the more advanced, maybe the most advanced, small business agent out there today.”Sellers who use Managerbot are consolidating their businesses onto Square — and that may be the real strategic payoffPerhaps the most consequential signal Avé shared was an early behavioral pattern: sellers who begin using Managerbot are voluntarily migrating more of their business operations onto the Square platform, consolidating payroll, time cards, and shift scheduling into Block’s ecosystem to feed the agent more data. “When they start interacting with Managerbot, they want to move more of their business onto Square because they see the value,” Avé said. “They’re like, ‘I should put my payroll here. I should get time cards here. I should get my shift schedules here,’ because once all that data is in one place, they can make better decisions and manage their business better.”This dynamic could prove to be Managerbot’s most significant long-term effect — not as a standalone feature, but as a gravitational force pulling sellers deeper into Block’s integrated commerce stack. Block’s Q4 earnings already showed Square’s new volume added grew 29 percent year over year, with sales-led NVA surging 62 percent. Avé also argued that Square’s first-party architecture — built organically rather than through acquisitions — gives it a structural advantage over competitors in the AI era. “We’ve kind of harmonized and canonicalized this data at a sensible layer,” he said. “It’s not super hard to create more skills for these data domains.”When VentureBeat pressed Avé on the tension between helping sellers and upselling them on Block’s own financial products — lending, payments processing, and other services that generate revenue for the company — he acknowledged the concern but framed Managerbot’s mission in terms of decision-making quality. “The goal for Managerbot is to help sellers increase their decision-making correctness,” Avé said. “If we can make sellers better at running their business by making better decisions and giving time back, I think that’s a good thing.”Block says Managerbot isn’t a chatbot — it’s a business protector that compounds the company’s entire AI strategyAvé was insistent that Managerbot represents something categorically different from the chatbot-as-advisor model that has proliferated across enterprise software. “A lot of people are building chatbots as advisors — it can answer a question for you,” he said. “What we really want Managerbot to be is a protector of your business. This is identifying trends. This is spotting things that you might have missed. This is helping you run your business and take actions.”He also argued that the agent model compounds Block’s development velocity in ways that traditional software cannot match. “It’s much more straightforward to add a capability to Managerbot than it is to build a big Web 2.0 UI,” Avé said. “If we can deliver more capabilities, more features, more value to our sellers, the whole system compounds.”Whether that compounding materializes — and whether sellers ultimately experience Managerbot as a trusted protector or a sophisticated upsell engine — will determine much about Block’s future. The company has staked its corporate identity, its headcount, and its Wall Street narrative on the conviction that AI agents can deliver more value with fewer humans in the loop. Managerbot is the first product to carry the full weight of that promise. And the small business owners who keep their shops open with Square terminals, who juggle shift schedules on napkins and skip marketing because there aren’t enough hours in the day — they didn’t ask to be the test case for Silicon Valley’s boldest AI thesis. But as of today, they are.
How MassMutual and Mass General Brigham turned AI pilot sprawl into production results
Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.“We’re always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”Defining metrics, establishing strong feedback loopsMassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas. Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint. “We won’t go any further with an idea until we get crystal clear on how we’re going to measure, and how we’re going to define success.” Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners. That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what “good” means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift. Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn’t accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn’t mean starting over.Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don’t want to set ourselves up to fall behind.”Weeding instead of letting a thousand flowers bloomMass General Brigham (MGB), for its part, took more of a spray and pray approach — at first. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event. But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn’t have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said. Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required. Sriraman’s team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out). As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.” That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”There’s nothing wrong with that, he noted; it’s just that everybody is discovering and experimenting as the landscape keeps shifting.Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use. They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said. Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges. In clinical settings, the guardrails are absolute: AI systems never issue the final decision. “There’s always going to be a doctor or a physician assistant in the loop to close the decision,” Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.Sriraman was clear: “Thou shall not do this: Don’t show PHI [protected health information] in Perplexity. As simple as that, right?”And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the ’90s and 2000s with AI. The same concepts apply.”
AI agents that automatically prevent, detect and fix software issues are here as NeuBird AI launches Falcon, FalconClaw
The mantra of the modern tech industry was arguably coined by Facebook (before it became Meta): “move fast and break things.” But as enterprise infrastructure has shifted into a dizzying maze of hybrid clouds, microservices, and ephemeral compute clusters, the “breaking” part has become a structural tax that many organizations can no longer afford to pay. Today, three-year-old startup NeuBird AI is launching a full-scale offensive against this “chaos tax,” announcing a $19.3 million funding round alongside the release of its Falcon autonomous production operations agent.The launch isn’t just a product update; it is a philosophical pivot. For years, the industry has focused on “Incident Response”—making the fire trucks faster and the hoses bigger. NeuBird AI is arguing that the only sustainable path forward is “Incident Avoidance”. As Venkat Ramakrishnan, President and COO of NeuBird AI, put it in a recent interview: “Incident management is so old school. Incident resolution is so old school. Incident avoidance is what is going to be enabled by AI”. By grounding AI in real-time enterprise context rather than just large language model reasoning, the company aims to move site reliability engineering and devops teams from a reactive posture to a predictive one.The AI divide: a reality check on automationAccompanying the launch is NeuBird AI’s 2026 State of Production Reliability and AI Adoption Report, a survey of over 1,000 professionals that reveals a massive disconnect between the boardroom and the server room. While 74% of C-suite executives believe their organizations are actively using AI to manage incidents, only 39% of the practitioners—the engineers actually on-call at 2:00 AM—agree.This 35-point “AI Divide” suggests that while leadership is writing checks for AI platforms, the technology is often failing to reach the frontline. For engineers, the reality remains manual and grueling: the study found that engineering teams spend an average of 40% of their time on incident management rather than building new products. Gou Rao, CEO of NeuBird AI, told VentureBeat that this is a persistent operational reality: “Over the past 18 months that we have been in production, this is not a marketing slide. We have concretely been able to demonstrate a massive reduction in time to incident response and resolution”.The consequences of this “toil” are more than just lost productivity. Alert fatigue has transitioned from a morale issue to a direct reliability risk. According to the report, 83% of organizations have teams that ignore or dismiss alerts occasionally, and 44% of companies experienced an outage in the past year tied directly to a suppressed or ignored alert. In many cases, the systems are so noisy that customers discover failures before the monitoring tools do.Introducing NeuBird AI FalconNeuBird AI’s answer to this systemic failure is the Falcon engine. While the company’s previous iteration, Hawkeye, focused on autonomous resolution, Falcon extends that capability into predictive intelligence. “When we launched NeuBird AI in 2023, our first version of the agent was called Hawkeye,” Rao explains. “What we’re announcing next week at HumanX is our next-generation version of the agent, codenamed Falcon. Falcon is easily three times faster than Hawkeye and is averaging around 92% in confidence scores”.This level of accuracy allows engineers to trust the agent’s output at face value. Falcon represents a significant leap over previous generative AI applications in the space, particularly in its ability to forecast failure. “Falcon is really good at preventive prediction, so it can tell you what can go wrong,” Rao says. “It’s pretty accurate on a 72-hour window, even better at 48 hours, and by 24 hours it gets really, really accurate”.One of the standout features of the new release is the Advanced Context Map. Unlike static dashboards, this is a real-time view of infrastructure dependencies and service health. It allows teams to visualize the “blast radius” of an issue as it propagates across an environment, helping engineers understand not just what is broken, but why it is failing in the context of its neighbors.’Minority Report’ for incident managementWhile many AI tools favor flashy web interfaces, NeuBird AI is leaning into the developer’s native habitat with NeuBird AI Desktop. This allows engineers to invoke the production ops agent directly from a command-line interface to explore root causes and system dependencies.”Falcon has a desktop mode which allows it to interact with a developer’s local tools,” Rao noted. “We’re getting a lot more traction from a hands-on developer audience, especially as people go to Claude Desktop and Cursor. They’re completing the loop by using production agents talking to their coding agents”. This integration enables a “multi-agent” workflow where an engineer can use NeuBird AI’s agent to diagnose a root cause in production and then hand off that diagnosis to a coding agent like Claude Code to implement the fix.During a live demo, Rao showcased how the agent could be set to “Sentinel Mode,” constantly sweeping a cluster for risks. If it detects an anomaly—such as a projected 5% spike in AWS costs or a misconfigured Kubernetes pod—it can flag the specific engineer on-call who has the domain expertise to fix it. “This is like ‘Minority Report for Incident Management’,” one financial services executive reportedly told the team after a demo.Context engineering: a gateway for securityA primary concern for enterprises deploying AI is security—ensuring large language models don’t go “crazy” or exfiltrate sensitive data. NeuBird AI addresses this through a proprietary approach to “context engineering”.”The way we implemented our agent is that the large language models themselves are never actually touching the data directly,” Rao explains. “We become the gateway for how the context can be accessed”. This means the model is the reasoning engine, but NeuBird AI is the middleman that wraps the data. Furthermore, the company has implemented strict guardrails on what the agent can actually execute. “We’ve created a language that confines and restricts the agent from what it can do,” says Rao. “If it comes up with something anomalous, or something we don’t know, it won’t run. We won’t do it”.This architectural choice allows NeuBird AI to remain model-agnostic. If a newer model from Anthropic or Google outperforms the current reasoning engine, NeuBird AI can simply switch it out without requiring the customer to change their platform. “Customers don’t want to be tied to a specific way of reasoning,” Rao asserts. “They want to be tied to a platform from which they can get the value of an agentic system”.Displacing the “army”: displacing expensive observabilityOne of the most radical claims NeuBird AI makes is that agentic systems can actually reduce the amount of data enterprises need to store in the first place. Currently, teams rely on massive storage platforms with complex query languages.”People use very complex observability tools like Datadog, Dynatrace, and Sysdig,” Rao says. “This is the norm today, which is why it takes an army of people to solve a problem. What we’ve been able to demonstrate with agentic systems is that you don’t need to store all that data in the first place”. Because the agent can reason across raw data sources, it can identify which signals are junk and which are critical. This shift, Rao argues, “reduces human toil and effort while simultaneously reducing your reliance on these insanely expensive observability tools”.The practical impact of this “incident avoidance” was recently demonstrated at Deep Health. Rao recounts how their agent detected a systemic issue that was invisible to traditional tools: “Our agent was able to go in and prevent an issue from happening which would have caused this company, Deep Health, a major production outage. The customer is completely beside themselves and happy about what it could do”.FalconClaw: operationalizing ‘tribal knowledge’One of the most persistent problems in IT operations is the loss of “tribal knowledge”—the hard-won expertise of senior engineers that exists only in their heads. NeuBird AI is attempting to solve this with FalconClaw, a curated, enterprise-grade skills hub compatible with the OpenClaw ecosystem.FalconClaw allows teams to capture best practices and resolution steps as “validated and compliant skills”. The tech preview launched today with 15 initial skills that work natively with NeuBird AI’s toolchain. According to Francois Martel, Field CTO at NeuBird AI, this turns hard-won expertise into a reusable asset that the AI can use automatically. It’s an attempt to standardize how agents interact with infrastructure, moving away from proprietary “black box” systems toward a multi-agent world where different AI tools can share a common set of operational abilities.Scaling the moat: funding and leadershipThe $19.3 million round was led by Xora Innovation, a Temasek-backed firm, with participation from Mayfield, M12, StepStone Group, and Prosperity7 Ventures. This brings NeuBird AI’s total funding to approximately $64 million.The investor interest is fueled largely by the pedigree of the founding team. Gou Rao and Vinod Jayaraman previously co-founded Portworx, which was acquired by Pure Storage, and Ocarina Networks, acquired by Dell. They have recently bolstered their leadership with Venkat Ramakrishnan, another Pure Storage veteran, as President and COO.For investors like Phil Inagaki of Xora, the value lies in NeuBird AI’s “best-in-class results across accuracy, speed and token consumption”. As cloud costs continue to spiral, the ability of an AI agent to not only fix bugs but also optimize infrastructure capacity is becoming a “must-have” rather than a “nice-to-have”. NeuBird AI claims its agent can save enterprise teams more than 200 engineering hours per month.The path to ‘self-healing’ infrastructureAs the State of Production Reliability report notes, current incident management practices are “no longer sustainable”. With 61% of organizations estimating that a single hour of downtime costs $50,000 or more, the financial stakes of staying in a reactive loop are enormous.NeuBird AI’s launch of Falcon and FalconClaw marks a definitive attempt to break that loop. By focusing on prevention and the “context engineering” required to make AI trustworthy for enterprise production, the company is positioning itself as the critical intelligence layer for the modern stack.While the “AI Divide” between executives and practitioners remains a significant hurdle for the industry, NeuBird AI is betting that as engineers see the value of a cli-driven, 92%-accurate agent that can “see around corners,” the skepticism will fade. For the site reliability engineers currently drowning in a flood of non-actionable alerts, the arrival of a reliable ai teammate couldn’t come soon enough.NeuBird AI Falcon is available starting today, with organizations able to sign up for a free trial at neubird.ai.