Venture Beat

Rethinking AEO when software agents navigate the web on behalf of users

March 16, 2026 MMN Editor Filed Under: Uncategorized

For more than two decades, digital businesses have relied on a simple assumption: When someone interacts with a website, that activity reflects a human making a conscious choice. Clicks are treated as signals of interest. Time on page is assumed to indicate engagement. Movement through a funnel is interpreted as intent. Entire growth strategies, marketing budgets, and product decisions have been built on this premise.Today, that assumption is quietly beginning to erode.As AI-powered tools increasingly interact with the web on behalf of users, many of the signals organizations depend on are becoming harder to interpret. The data itself is still accurate — pages are viewed, buttons are clicked, actions are recorded — but the meaning behind those actions is changing. This shift isn’t theoretical or limited to edge cases. It’s already influencing how leaders read dashboards, forecast demand, and evaluate performance.The challenge ahead isn’t stopping AI-driven interactions. It’s learning how to interpret digital behavior in a world where human and automated activity increasingly overlap.A changing assumption about web trafficFor decades, the foundation of the internet rested on a quiet, human-centric model. Behind every scroll, form submission, or purchase flow was a person acting out of curiosity, need, or intent. Analytics platforms evolved to capture these behaviors. Security systems focused on separating “legitimate users” from clearly scripted automation. Even digital advertising economics assumed that engagement equaled human attention.Over the last few years, that model has begun to shift. Advances in large language models (LLMs), browser automation, and AI-driven agents have made it possible for software systems to navigate the web in ways that feel fluid and context-aware. Pages are explored, options are compared, workflows are completed — often without obvious signs of automation.This doesn’t mean the web is becoming less human. Instead, it’s becoming more hybrid. AI systems are increasingly embedded in everyday workflows, acting as research assistants, comparison tools, or task completers on behalf of people. As a result, the line between a human interacting directly with a site and software acting for them is becoming less distinct.The challenge isn’t automation itself. It’s the ambiguity this overlap introduces into the signals businesses rely on.What do we mean by AI-generated traffic?When people hear “automated traffic,” they often think of the bots of the past — rigid scripts that followed predefined paths and broke the moment an interface changed. Those systems were repetitive, predictable, and relatively easy to identify.AI-generated traffic is different.Modern AI agents combine machine learning (ML) with automated browsing capabilities. They can interpret page layouts, adapt to interface changes, and complete multi-step tasks. In many cases, language models guide decision-making, allowing these systems to adjust behavior based on context rather than fixed rules. The result is interaction that appears far more natural than earlier automation.Importantly, this kind of traffic is not inherently problematic. Automation has long played a productive role on the web, from search indexing and accessibility tools to testing frameworks and integrations. Newer AI agents simply extend this evolution — helping users summarize content, compare products, or gather information across multiple sites.The issue is not intent, but interpretation. When AI agents interact with a site successfully on behalf of users, traditional engagement metrics may no longer reflect the same meaning they once did.Why AI-generated traffic is becoming harder to distinguishHistorically, detecting automated activity relied on spotting technical irregularities. Systems flagged behavior that moved too fast, followed perfectly consistent paths, or lacked standard browser features. Automation exposed “tells” that made classification straightforward.AI-driven systems change this dynamic. They operate through standard browsers. They pause, scroll, and navigate non-linearly. They vary timing and interaction sequences. Because these agents are designed to interact with the web as it was built — for humans — their behavior increasingly blends into normal usage patterns.As a result, the challenge shifts from identifying errors to interpreting behavior. The question becomes less about whether an interaction is automated and more about how it unfolds over time. Many of the signals that once separated humans from software are converging, making binary classification less effective.When engagement stops meaning what we thinkConsider a common e-commerce scenario.A retail team notices a sustained increase in product views and “add to cart” actions. Historically, this would be a clear signal of growing demand, prompting increased ad spend or inventory expansion.Now imagine that a portion of this activity is generated by AI agents performing price monitoring or product comparison on behalf of users. The interactions occurred. The metrics are accurate. But the underlying intent is different. The funnel no longer represents a straightforward path toward purchase.Nothing is “wrong” with the data — but the meaning has shifted.Similar patterns are appearing across industries:Digital publishers see spikes in article engagement without corresponding ad revenue.SaaS companies observe heavy feature exploration with limited conversion.Travel platforms record increased search activity that doesn’t translate into bookings.In each case, organizations risk optimizing for activity rather than value.Why this is a data and analytics problemAt its core, AI-generated traffic introduces ambiguity into the assumptions underlying analytics and modeling. Many systems assume that observed behavior maps cleanly to human intent. When automated interactions are mixed into datasets, that assumption weakens.Behavioral data may now include:Exploration without purchase intentResearch-driven navigationTask completion without conversionRepeated patterns driven by automation goalsFor analytics teams, this introduces noise into labels, weakens proxy metrics, and increases the risk of feedback loops. Models trained on mixed signals may learn to optimize for volume rather than outcomes that matter to the business.This doesn’t invalidate analytics. It raises the bar for interpretation.Data integrity in a machine-to-machine worldAs behavioral data increasingly feeds ML systems that shape user experience, the composition of that data matters. If a growing share of interactions comes from automated agents, platforms may begin to optimize for machine navigation rather than human experience.Over time, this can subtly reshape the web. Interfaces may become efficient for extraction and summarization while losing the irregularities that make them intuitive or engaging for people. Preserving a meaningful human signal requires moving beyond raw volume and focusing on interaction context.From exclusion to interpretationFor years, the default response to automation was exclusion. CAPTCHAs, rate limits, and static thresholds worked well when automated behavior was clearly distinct.That approach is becoming less effective. AI-driven agents often provide real value to users, and blanket blocking can degrade user experience without improving outcomes. As a result, many organizations are shifting from exclusion toward interpretation.Rather than asking how to keep automation out, teams are asking how to understand different types of traffic and respond appropriately — serving purpose-aligned experiences without assuming a single definition of legitimacy.Behavioral context as a complementary signalOne promising approach is focusing on behavioral context. Instead of centering analysis on identity, systems examine how interactions unfold over time.Human behavior is inconsistent and inefficient. People hesitate, backtrack, and explore unpredictably. Automated agents, even when adaptive, tend to exhibit a more structured internal logic. By observing navigation flow, timing variability, and interaction sequencing, teams can infer intent probabilistically rather than categorically.This allows organizations to remain open while gaining a more nuanced understanding of activity.Ethics, privacy, and responsible interpretationAs analysis becomes more sophisticated, ethical boundaries become more important. Understanding interaction patterns is not the same as tracking individuals.The most resilient approaches rely on aggregated, anonymized signals and transparent practices. The goal is to protect platform integrity while respecting user expectations. Trust remains a foundational requirement, not an afterthought.The future: A spectrum of agencyLooking ahead, web interactions increasingly fall along a spectrum. On one end humans are browsing directly, in the middle users are assisted by AI tools, on the other end agents are acting independently on a user’s behalf.This evolution reflects a maturing digital ecosystem. It also demands a shift in how success is measured. Simple counts of clicks or visits are no longer sufficient. Value must be assessed in context.What business leaders should focus on nowAI-generated traffic is not a problem to eliminate — it’s a reality to understand.Leaders who adapt successfully will:Reevaluate how engagement metrics are interpretedSeparate activity from intent in analytics reviewsInvest in contextual and probabilistic measurement approachesPreserve data quality as AI participation growsTreat trust and privacy as design principlesThe web has evolved before, and it will evolve again. The question is whether organizations are prepared to evolve how they read the signals it produces.Shashwat Jain is a senior software engineer at Amazon.

Fixing AI failure: Three changes enterprises should make now

March 15, 2026 MMN Editor Filed Under: Uncategorized

Recent reports about AI project failure rates have raised uncomfortable questions for organizations investing heavily in AI. Much of the discussion has focused on technical factors like model accuracy and data quality, but after watching dozens of AI initiatives launch, I’ve noticed that the biggest opportunities for improvement are often cultural, not technical.Internal projects that struggle tend to share common issues. For example, engineering teams build models that product managers don’t know how to use. Data scientists build prototypes that operations teams struggle to maintain. And AI applications sit unused because the people they were built for weren’t involved in deciding what “useful” really meant.In contrast, organizations that achieve meaningful value with AI have figured out how to create the right kind of collaboration across departments, and established shared accountability for outcomes. The technology matters, but the organizational readiness matters just as much.Here are three practices I’ve observed that address the cultural and organizational barriers that can impede AI success.Expand AI literacy beyond engineeringWhen only engineers understand how an AI system works and what it’s capable of, collaboration breaks down. Product managers can’t evaluate trade-offs they don’t understand. Designers can’t create interfaces for capabilities they can’t articulate. Analysts can’t validate outputs they can’t interpret.The solution isn’t making everyone a data scientist. It’s helping each role understand how AI applies to their specific work. Product managers need to grasp what kinds of generated content, predictions or recommendations are realistic given available data. Designers need to understand what the AI can actually do so they can design features users will find useful. Analysts need to know which AI outputs require human validation versus which can be trusted.When teams share this working vocabulary, AI stops being something that happens in the engineering department and becomes a tool the entire organization can use effectively.Establish clear rules for AI autonomyThe second challenge involves knowing where AI can act on its own versus where human approval is required. Many organizations default to extremes, either bottlenecking every AI decision through human review, or letting AI systems operate without guardrails.What’s needed is a clear framework that defines where and how AI can act autonomously. This means establishing rules upfront: Can AI approve routine configuration changes? Can it recommend schema updates but not implement them? Can it deploy code to staging environments but not production?These rules should include three elements: auditability (can you trace how the AI reached its decision?), reproducibility (can you recreate the decision path?), and observability (can teams monitor AI behavior as it happens?). Without this framework, you either slow down to the point where AI provides no advantage, or you create systems making decisions nobody can explain or control.Create cross-functional playbooksThe third step is codifying how different teams actually work with AI systems. When every department develops its own approach, you get inconsistent results and redundant effort.Cross-functional playbooks work best when teams develop them together rather than having them imposed from above. These playbooks answer concrete questions like: How do we test AI recommendations before putting them into production? What’s our fallback procedure when an automated deployment fails – does it hand off to human operators or try a different approach first? Who needs to be involved when we override an AI decision? How do we incorporate feedback to improve the system?The goal isn’t to add bureaucracy. It’s ensuring everyone understands how AI fits into their existing work, and what to do when results don’t match expectations.Moving forwardTechnical excellence in AI remains important, but enterprises that over-index on model performance while ignoring organizational factors are setting themselves up for avoidable challenges. The successful AI deployments I’ve seen treat cultural transformation and workflows just as seriously as technical implementation.The question isn’t whether your AI technology is sophisticated enough. It’s whether your organization is ready to work with it.Adi Polak is director for advocacy and developer experience engineering at Confluent.

NanoClaw and Docker partner to make sandboxes the safest way for enterprises to deploy AI agents

March 13, 2026 MMN Editor Filed Under: Uncategorized

NanoClaw, the open-source AI agent platform created by Gavriel Cohen, is partnering with the containerized development platform Docker to let teams run agents inside Docker Sandboxes, a move aimed at one of the biggest obstacles to enterprise adoption: how to give agents room to act without giving them room to damage the systems around them.The announcement matters because the market for AI agents is shifting from novelty to deployment. It is no longer enough for an agent to write code, answer questions or automate a task. For CIOs, CTOs and platform leaders, the harder question is whether that agent can safely connect to live data, modify files, install packages and operate across business systems without exposing the host machine, adjacent workloads or other agents.That is the problem NanoClaw and Docker say they are solving together.A security argument, not just a packaging updateNanoClaw launched as a security-first alternative in the rapidly growing “claw” ecosystem, where agent frameworks promise broad autonomy across local and cloud environments. The project’s core argument has been that many agent systems rely too heavily on software-level guardrails while running too close to the host machine.This Docker integration pushes that argument down into infrastructure.“The partnership with Docker is integrating NanoClaw with Docker Sandboxes,” Cohen said in an interview. “The initial version of NanoClaw used Docker containers for isolating each agent, but Docker Sandboxes is the proper enterprise-ready solution for rolling out agents securely.”That progression matters because the central issue in enterprise agent deployment is isolation. Agents do not behave like traditional applications. They mutate their environments, install dependencies, create files, launch processes and connect to outside systems. That breaks many of the assumptions underlying ordinary container workflows.Cohen framed the issue in direct terms: “You want to unlock the full potential of these highly capable agents, but you don’t want security to be based on trust. You have to have isolated environments and hard boundaries.”That line gets at the broader challenge facing enterprises now experimenting with agents in production-like settings. The more useful agents become, the more access they need. They need tools, memory, external connections and the freedom to take actions on behalf of users and teams. But each gain in capability raises the stakes around containment. A compromised or badly behaving agent cannot be allowed to spill into the host environment, expose credentials or access another agent’s state.Why agents strain conventional infrastructureDocker president and COO Mark Cavage said that reality forced the company to rethink some of the assumptions built into standard developer infrastructure.“Fundamentally, we had to change the isolation and security model to work in the world of agents,” Cavage said. “It feels like normal Docker, but it’s not.”He explained why the old model no longer holds. “Agents break effectively every model we’ve ever known,” Cavage said. “Containers assume immutability, but agents break that on the very first call. The first thing they want to do is install packages, modify files, spin up processes, spin up databases — they want full mutability and a full machine to run in.”That is a useful framing for enterprise technical decision-makers. The promise of agents is not that they behave like static software with a chatbot front end. The promise is that they can perform open-ended work. But open-ended work is exactly what creates new security and governance problems. An agent that can install a package, rewrite a file tree, start a database process or access credentials is more operationally useful than a static assistant. It is also more dangerous if it is running in the wrong environment.Docker’s answer is Docker Sandboxes, which use MicroVM-based isolation while preserving familiar Docker packaging and workflows. According to the companies, NanoClaw can now run inside that infrastructure with a single command, giving teams a more secure execution layer without forcing them to redesign their agent stack from scratch.Cavage put the value proposition plainly: “What that gets you is a much stronger security boundary. When something breaks out — because agents do bad things — it’s truly bounded in something provably secure.”That emphasis on containment rather than trust lines up closely with NanoClaw’s original thesis. In earlier coverage of the project, NanoClaw was positioned as a leaner, more auditable alternative to broader and more permissive frameworks. The argument was not just that it was open source, but that its simplicity made it easier to reason about, secure and customize for production use.Cavage extended that argument beyond any single product. “Security is defense in depth,” he said. “You need every layer of the stack: a secure foundation, a secure framework to run in, and secure things users build on top.”That is likely to resonate with enterprise infrastructure teams that are less interested in model novelty than in blast radius, auditability and layered control. Agents may still rely on the intelligence of frontier models, but what matters operationally is whether the surrounding system can absorb mistakes, misfires or adversarial behavior without turning one compromised process into a wider incident.The enterprise case for many agents, not oneThe NanoClaw-Docker partnership also reflects a broader shift in how vendors are beginning to think about agent deployment at scale. Instead of one central AI system doing everything, the model emerging here is many bounded agents operating across teams, channels and tasks.“What OpenClaw and the claws have shown is how to get tremendous value from coding agents and general-purpose agents that are available today,” Cohen said. “Every team is going to be managing a team of agents.”He pushed that idea further in the interview, sketching a future closer to organizational systems design than to the consumer assistant model that still dominates much of the AI conversation. “In businesses, every employee is going to have their personal assistant agent, but teams will manage a team of agents, and a high-performing team will manage hundreds or thousands of agents,” Cohen said.That is a more useful enterprise lens than the usual consumer framing. In a real organization, agents are likely to be attached to distinct workflows, data stores and communication surfaces. Finance, support, sales engineering, developer productivity and internal operations may all have different automations, different memory and different access rights. A secure multi-agent future depends less on generalized intelligence than on boundaries: who can see what, which process can touch which file system, and what happens when one agent fails or is compromised.NanoClaw’s product design is built around that kind of orchestration. The platform sits on top of Claude Code and adds persistent memory, scheduled tasks, messaging integrations and routing logic so agents can be assigned work across channels such as WhatsApp, Telegram, Slack and Discord. The release says this can all be configured from a phone, without writing custom agent code, while each agent remains isolated inside its own container runtime.Cohen said one practical goal of the Docker integration is to make that deployment model easier to adopt. “People will be able to go to the NanoClaw GitHub, clone the repository, and run a single command,” he said. “That will get their Docker Sandbox set up running NanoClaw.”That ease of setup matters because many enterprise AI deployments still fail at the point where promising demos have to become stable systems. Security features that are too hard to deploy or maintain often end up bypassed. A packaging model that lowers friction without weakening boundaries is more likely to survive internal adoption.An open-source partnership with strategic weightThe partnership is also notable for what it is not. It is not being positioned as an exclusive commercial alliance or a financially engineered enterprise bundle.“There’s no money involved,” Cavage said. “We found this through the foundation developer community. NanoClaw is open source, and Docker has a long history in open source.”That may strengthen the announcement rather than weaken it. In infrastructure, the most credible integrations often emerge because two systems fit technically before they fit commercially. Cohen said the relationship began when a Docker developer advocate got NanoClaw running in Docker Sandboxes and demonstrated that the combination worked.“We were able to put NanoClaw into Docker Sandboxes without making any architecture changes to NanoClaw,” Cohen said. “It just works, because we had a vision of how agents should be deployed and isolated, and Docker was thinking about the same security concerns and arrived at the same design.”For enterprise buyers, that origin story signals that the integration was not forced into existence by a go-to-market arrangement. It suggests genuine architectural compatibility.Docker is also careful not to cast NanoClaw as the only framework it will support. Cavage said the company plans to work broadly across the ecosystem, even as NanoClaw appears to be the first “claw” included in Docker’s official packaging. The implication is that Docker sees a wider market opportunity around secure agent runtime infrastructure, while NanoClaw gains a more recognizable enterprise foundation for its security posture.The bigger story: infrastructure catching up to agentsThe deeper significance of this announcement is that it shifts attention from model capability to runtime design. That may be where the real enterprise competition is heading.The AI industry has spent the last two years proving that models can reason, code and orchestrate tasks with growing sophistication. The next phase is proving that these systems can be deployed in ways security teams, infrastructure leaders and compliance owners can live with.NanoClaw has argued from the start that agent security cannot be bolted on at the application layer. Docker is now making a parallel argument from the runtime side. “The world is going to need a different set of infrastructure to catch up to what agents and AI demand,” Cavage said. “They’re clearly going to get more and more autonomous.”That could turn out to be the central story here. Enterprises do not just need more capable agents. They need better boxes to put them in.For organizations experimenting with AI agents today, the NanoClaw-Docker integration offers a concrete picture of what that box might look like: open-source orchestration on top, MicroVM-backed isolation underneath, and a deployment model designed around containment rather than trust.In that sense, this is more than a product integration. It is an early blueprint for how enterprise agent infrastructure may evolve: less emphasis on unconstrained autonomy, more emphasis on bounded autonomy that can survive contact with real production systems.

Y Combinator-backed Random Labs launches Slate V1, claiming the first ‘swarm-native’ coding agent

March 12, 2026 MMN Editor Filed Under: Uncategorized

The software engineering world is currently wrestling with a fundamental paradox of the AI era: as models become more capable, the “systems problem” of managing them has become the primary bottleneck to real-world productivity. While a developer might have access to the raw intelligence of a frontier model, that intelligence often degrades the moment a task requires a long horizon or a deep context window. But help appears to be on the way: San Francisco-based, Y Combinator-backed startup Random Labs has officially launched Slate V1, described as the industry’s first “swarm native” autonomous coding agent designed to execute massively parallel, complex engineering tasks.Emerging from an open beta, the tool utilizes a “dynamic pruning algorithm” to maintain context in large codebases while scaling output to enterprise complexity. Co-founded by Kiran and Mihir Chintawar in 2024, the company aims to bridge the global engineering shortage by positioning Slate as a collaborative tool for the “next 20 million engineers” rather than a replacement for human developers.With the release of Slate V1, the team at Random Labs is attempting to architect a way out of this zone by introducing the first “swarm-native” agentic coding environment. Slate is not merely a wrapper or a chatbot with file access; it is an implementation of a “hive mind” philosophy designed to scale agentic work with the complexity of a human organization. By leveraging a novel architectural primitive called Thread Weaving, Slate moves beyond the rigid task trees and lossy compaction methods that have defined the first generation of AI coding assistants.Strategy: Action spaceAt the heart of Slate’s effectiveness is a deep engagement with Recursive Language Models (RLM). In a traditional setup, an agent might be asked to “fix a bug,” a prompt that forces the model to juggle high-level strategy and low-level execution simultaneously. Random Labs identifies this as a failure to tap into “Knowledge Overhang”—the latent intelligence a model possesses but cannot effectively access when it is tactically overwhelmed.Slate solves this by using a central orchestration thread that essentially “programs in action space”. This orchestrator doesn’t write the code directly; instead, it uses a TypeScript-based DSL to dispatch parallel worker threads to handle specific, bounded tasks. This creates a clear separation between the “kernel”—which manages the execution graph and maintains strategic alignment—and the worker “processes” that execute tactical operations in the terminal. By mapping onto an OS-style framework, inspired by Andrej Karpathy’s “LLM OS” concept, Slate is able to treat the limited context window of a model as precious RAM, actively, intelligently managing what is retained and what is discarded.Episodic memory and the swarmThe true innovation of the “Thread Weaving” approach lies in how it handles memory. Most agents today rely on “compaction,” which is often just a fancy term for lossy compression that risks dropping critical project state. Slate instead generates “episodes”. When a worker thread completes a task, it doesn’t return a sprawling transcript of every failed attempt; it returns a compressed summary of the successful tool calls and conclusions.Because these episodes share context directly with the orchestrator rather than relying on brittle message passing, the system maintains a “swarm” intelligence. This architecture allows for massive parallelism. A developer can have Claude Sonnet orchestrating a complex refactor while GPT-5.4 executes code, and GLM 5—a favorite for its agentic search capabilities—simultaneously researches library documentation in the background. It’s a similar approach taken by Perplexity with its new Computer multi-model agent By selecting the “right model for the job,” Slate ensures that users aren’t overspending on intelligence for simple tactical steps while still benefiting from the strategic depth of the world’s most powerful models.The business of autonomyFrom a commercial perspective, Random Labs is navigating the early beta period with a mix of transparency and strategic ambiguity. While the company has not yet published a fixed-price subscription sheet, the Slate CLI documentation confirms a shift toward a usage-based credit model. Commands like /usage and /billing allow users to monitor their credit burn in real-time, and the inclusion of organization-level billing toggles suggests a clear focus on professional engineering teams rather than solo hobbyists.There is also a significant play toward integration. Random Labs recently announced that direct support for OpenAI’s Codex and Anthropic’s Claude Code is slated for release next week. This suggests that Slate isn’t trying to compete with these models’ native interfaces, but rather to act as the superior orchestration layer that allows engineers to use all of them at once, safely and cost-effectively.I’ve reached out to Architecturally, the system is designed to maximize caching through subthread reuse, a “novel context engineering” trick that the team claims keeps the swarm approach from becoming a financial burden for users.Stability AIPerhaps the most compelling argument for the Slate architecture is its stability. In internal testing, an early version of this threading system managed to pass 2/3 of the tests on the make-mips-interpreter task within the Terminal Bench 2.0 suite.This is a task where even the newest frontier models, like Opus 4.6, often succeed less than 20% of the time when used in standard, non-orchestrated harnesses.This success in a “mutated” or changing environment is what separates a tool from a partner. According to Random Labs’ documentation, one fintech founder in NYC described Slate as their “best debugging tool,” a sentiment that echoes the broader goal of Random Labs: to build agents that don’t just complete a prompt, but scale like an organization. As the industry moves past simple “chat with your code” interfaces, the “Thread Weaving” of Slate V1 offers a glimpse into a future where the primary role of the human engineer is to direct a hive mind of specialized models, each working in concert to solve the long-horizon problems of modern software.

Agents need vector search more than RAG ever did

March 12, 2026 MMN Editor Filed Under: Uncategorized

What’s the role of vector databases in the agentic AI world? That’s a question that organizations have been coming to terms with in recent months.

The narrative had real momentum. As large language models scaled to million-token context windows, a credible argument circulated among enterprise architects: purpose-built vector search was a stopgap, not infrastructure. Agentic memory would absorb the retrieval problem. Vector databases were a RAG-era artifact.The production evidence is running the other way.Qdrant, the Berlin-based open source vector search company, announced a $50 million Series B on Thursday, two years after a $28 million Series A. The timing is not incidental. The company is also shipping version 1.17 of its platform. Together, they reflect a specific argument: The retrieval problem did not shrink when agents arrived. It scaled up and got harder.”Humans make a few queries every few minutes,” Andre Zayarni, Qdrant’s CEO and co-founder, told VentureBeat. “Agents make hundreds or even thousands of queries per second, just gathering information to be able to make decisions.”That shift changes the infrastructure requirements in ways that RAG-era deployments were never designed to handle.Why agents need a retrieval layer that memory can’t replaceAgents operate on information they were never trained on: proprietary enterprise data, current information, millions of documents that change continuously. Context windows manage session state. They don’t provide high-recall search across that data, maintain retrieval quality as it changes, or sustain the query volumes autonomous decision-making generates.”The majority of AI memory frameworks out there are using some kind of vector storage,” Zayarni said. The implication is direct: even the tools positioned as memory alternatives rely on retrieval infrastructure underneath.Three failure modes surface when that retrieval layer isn’t purpose-built for the load. At document scale, a missed result is not a latency problem — it is a quality-of-decision problem that compounds across every retrieval pass in a single agent turn. Under write load, relevance degrades because newly ingested data sits in unoptimized segments before indexing catches up, making searches over the freshest data slower and less accurate precisely when current information matters most. Across distributed infrastructure, a single slow replica pushes latency across every parallel tool call in an agent turn — a delay a human user absorbs as inconvenience but an autonomous agent cannot.Qdrant’s 1.17 release addresses each directly. A relevance feedback query improves recall by adjusting similarity scoring on the next retrieval pass using lightweight model-generated signals, without retraining the embedding model. A delayed fan-out feature queries a second replica when the first exceeds a configurable latency threshold. A new cluster-wide telemetry API replaces node-by-node troubleshooting with a single view across the entire cluster.Why Qdrant doesn’t want to be called a vector database anymoreNearly every major database now supports vectors as a data type — from hyperscalers to traditional relational systems. That shift has changed the competitive question. The data type is now table stakes. What remains specialized is retrieval quality at production scale.That distinction is why Zayarni no longer wants Qdrant called a vector database.”We’re building an information retrieval layer for the AI age,” he said. “Databases are for storing user data. If the quality of search results matters, you need a search engine.”His advice for teams starting out: use whatever vector support is already in your stack. The teams that migrate to purpose-built retrieval do so when scale forces the issue.

“We see companies come to us every day saying they started with Postgres and thought it was good enough — and it’s not.”Qdrant’s architecture, written in Rust, gives it memory efficiency and low-level performance control that higher-level languages don’t match at the same cost. The open source foundation compounds that advantage — community feedback and developer adoption are what allow a company at Qdrant’s scale to compete with vendors that have far larger engineering resources.

“Without it, we wouldn’t be where we are right now at all,” Zayarni said.How two production teams found the limits of general-purpose databasesThe companies building production AI systems on Qdrant are making the same argument from different directions: agents need a retrieval layer, and conversational or contextual memory is not a substitute for it.GlassDollar helps enterprises including Siemens and Mahle evaluate startups. Search is the core product: a user describes a need in natural language and gets back a ranked shortlist from a corpus of millions of companies. The architecture runs query expansion on every request – a single prompt fans out into multiple parallel queries, each retrieving candidates from a different angle, before results are combined and re-ranked. That is an agentic retrieval pattern, not a RAG pattern, and it requires purpose-built search infrastructure to sustain it at volume.The company migrated from Elasticsearch as it scaled toward 10 million indexed documents. After moving to Qdrant it cut infrastructure costs by roughly 40%, dropped a keyword-based compensation layer it had maintained to offset Elasticsearch’s relevance gaps, and saw a 3x increase in user engagement.”We measure success by recall,” Kamen Kanev, GlassDollar’s head of product, told VentureBeat. “If the best companies aren’t in the results, nothing else matters. The user loses trust.” Agentic memory and extended context windows aren’t enough to absorb the workload that GlassDollar needs, either. “That’s an infrastructure problem, not a conversation state management task,” Kanev said. “It’s not something you solve by extending a context window.”Another Qdrant user is &AI, which is building infrastructure for patent litigation. Its AI agent, Andy, runs semantic search across hundreds of millions of documents spanning decades and multiple jurisdictions. Patent attorneys will not act on AI-generated legal text, which means every result the agent surfaces has to be grounded in a real document.”Our whole architecture is designed to minimize hallucination risk by making retrieval the core primitive, not generation,” Herbie Turner, &AI’s founder and CTO, told VentureBeat. For &AI, the agent layer and the retrieval layer are distinct by design. “Andy, our patent agent, is built on top of Qdrant,” Turner said. “The agent is the interface. The vector database is the ground truth.”Three signals it’s time to move off your current setupThe practical starting point: use whatever vector capability is already in your stack. The evaluation question isn’t whether to add vector search — it’s when your current setup stops being adequate. Three signals mark that point: retrieval quality is directly tied to business outcomes; query patterns involve expansion, multi-stage re-ranking, or parallel tool calls; or data volume crosses into the tens of millions of documents.At that point the evaluation shifts to operational questions: how much visibility does your current setup give you into what’s happening across a distributed cluster, and how much performance headroom does it have when agent query volumes increase.”There’s a lot of noise right now about what replaces the retrieval layer,” Kanev said. “But for anyone building a product where retrieval quality is the product, where missing a result has real business consequences, you need dedicated search infrastructure.”

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

March 12, 2026 MMN Editor Filed Under: Uncategorized

Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin.The obvious workaround is spot GPU markets — renting spare capacity to whoever needs it. But spot instances mean the cloud vendor is still the one doing the renting, and engineers buying that capacity are still paying for raw compute with no inference stack attached. FriendliAI’s answer is different: run inference directly on the unused hardware, optimize for token throughput, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batching became foundational to vLLM, the open source inference engine used across most production deployments today.Chun spent over a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced a paper called Orca, which introduced continuous batching. The technique processes inference requests dynamically rather than waiting to fill a fixed batch before executing. It is now industry standard and is the core mechanism inside vLLM.This week, FriendliAI is launching a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a share of the token revenue. The operator’s own jobs always take priority — the moment a scheduler reclaims a GPU, InferenceSense yields.”What we are providing is that instead of letting GPUs be idle, by running inferences they can monetize those idle GPUs,” Chun told VentureBeat.How a Seoul National University lab built the engine inside vLLMChun founded FriendliAI in 2021, before most of the industry had shifted attention from training to inference. The company’s primary product is a dedicated inference endpoint service for AI startups and enterprises running open-weight models. FriendliAI also appears as a deployment option on Hugging Face alongside Azure, AWS and GCP, and currently supports more than 500,000 open-weight models from the platform.InferenceSense now extends that inference engine to the capacity problem GPU operators face between workloads.How it worksInferenceSense runs on top of Kubernetes, which most neocloud operators are already using for resource orchestration. An operator allocates a pool of GPUs to a Kubernetes cluster managed by FriendliAI — declaring which nodes are available and under what conditions they can be reclaimed. Idle detection runs through Kubernetes itself.”We have our own orchestrator that runs on the GPUs of these neocloud — or just cloud — vendors,” Chun said. “We definitely take advantage of Kubernetes, but the software running on top is a really highly optimized inference stack.”When GPUs are unused, InferenceSense spins up isolated containers serving paid inference workloads on open-weight models including DeepSeek, Qwen, Kimi, GLM and MiniMax. When the operator’s scheduler needs hardware back, the inference workloads are preempted and GPUs are returned. FriendliAI says the handoff happens within seconds.Demand is aggregated through FriendliAI’s direct clients and through inference aggregators like OpenRouter. The operator supplies the capacity; FriendliAI handles the demand pipeline, model optimization and serving stack. There are no upfront fees and no minimum commitments. A real-time dashboard shows operators which models are running, tokens being processed and revenue accrued.Why token throughput beats raw capacity rentalSpot GPU markets from providers like CoreWeave, Lambda Labs and RunPod involve the cloud vendor renting out its own hardware to a third party. InferenceSense runs on hardware the neocloud operator already owns, with the operator defining which nodes participate and setting scheduling agreements with FriendliAI in advance. The distinction matters: spot markets monetize capacity, InferenceSense monetizes tokens.Token throughput per GPU-hour determines how much InferenceSense can actually earn during unused windows. FriendliAI claims its engine delivers two to three times the throughput of a standard vLLM deployment, though Chun notes the figure varies by workload type.

Most competing inference stacks are built on Python-based open source frameworks. FriendliAI’s engine is written in C++ and uses custom GPU kernels rather than Nvidia’s cuDNN library. The company has built its own model representation layer for partitioning and executing models across hardware, with its own implementations of speculative decoding, quantization and KV-cache management.Since FriendliAI’s engine processes more tokens per GPU-hour than a standard vLLM stack, operators should generate more revenue per unused cycle than they could by standing up their own inference service. What AI engineers evaluating inference costs should watchFor AI engineers evaluating where to run inference workloads, the neocloud versus hyperscaler decision has typically come down to price and availability.InferenceSense adds a new consideration: if neoclouds can monetize idle capacity through inference, they have more economic incentive to keep token prices competitive.That is not a reason to change infrastructure decisions today — it is still early. But engineers tracking total inference cost should watch whether neocloud adoption of platforms like InferenceSense puts downward pressure on API pricing for models like DeepSeek and Qwen over the next 12 months.

“When we have more efficient suppliers, the overall cost will go down,” Chun said. “With InferenceSense we can contribute to making those models cheaper.”