Long-horizon reasoning exposes a core weakness in AI agents: context windows fill up fast, and retrieval pipelines return noise instead of signal.To solve this, researchers at the National University of Singapore developed MRAgent, a framework that abandons the static “retrieve-then-reason” approach. Instead, it uses a mechanism that allows an agent to dynamically develop its memory based on accumulating evidence. This multi-step memory reconstruction is integrated into the reasoning process of the large language model (LLM). While not the only framework in this space, MRAgent significantly reduces token consumption and runtime costs compared to other agentic memory management approaches.The limits of passive retrieval in long-horizon tasksIn classic retrieval pipelines, documents are retrieved through vector search or graph traversal and passed on to an LLM for reasoning. This passive approach fails because it cannot combine reasoning with memory access, creating three major bottlenecks:These systems cannot revise their retrieval strategy mid-reasoning. If an agent fetches a document and discovers a crucial missing cue — a specific date or person — it has no way to issue a new query based on that finding.Fixed similarity scores and predefined graph expansions return surface-level matches that flood the LLM’s context window with irrelevant noise, degrading reasoning.Current systems rely heavily on pre-constructed structures such as top-k results and static relevance functions, limiting the flexibility required to scale across unpredictable, long-horizon user interactions.The researchers argue that to overcome these limitations, developers must shift toward an “active and associative reconstruction process,” a concept inspired by cognitive neuroscience. Under this paradigm, memory recall unfolds sequentially rather than operating as a passive read-out of a static database. The system starts with small, specific triggers from the user’s prompt, such as a person’s name, an action, or a place. These initial hints point to connecting concepts or categories instead of massive blocks of text. By following these metadata stepping stones, the agent gathers small pieces of evidence one by one. It uses each new piece of information to guide its next step until it successfully pieces together the full, accurate story.How MRAgent implements active memory reconstructionInstead of viewing memory as a static database, MRAgent (Memory Reasoning Architecture for LLM Agents) treats it as an interactive environment. When processing a complex query, the agent uses the backbone LLM’s reasoning abilities to explore multiple candidate retrieval paths across a structured memory graph. At each step, the LLM evaluates the intermediate evidence it has gathered and uses it to iteratively optimize its search. It infers new search constraints, pursues the paths with the best information, and prunes irrelevant branches. This allows MRAgent to piece together deeply buried information without filling the LLM’s context with noise.To make this active exploration computationally efficient and scalable, the framework organizes its database using a “Cue-Tag-Content” mechanism. This operates as a multi-layered associative graph with three node types:Cues: Fine-grained keywords, such as entities or contextual attributes extracted from user interactions.Content: The actual stored memory units. These are divided into multi-granular layers, such as episodic memory for concrete events and semantic memory for stable facts and user preferences.Tags: Semantic bridges that summarize the relational associations between specific Cues and Content.This structure enables a highly efficient two-stage retrieval process. The LLM first navigates from Cues to candidate Tags. Because Tags explicitly expose the semantic relationships and structural associations of the data, the agent evaluates these short summaries to judge their relevance. The LLM identifies promising traversal paths and discards irrelevant branches before spending compute and prompt tokens to access the detailed, heavy memory contents.For example, a user might ask an AI agent, “How did Nate use the prize money when he won his third video game tournament?”MRAgent first extracts fine-grained starting cues from the prompt, such as “Nate,” “video game tournament,” and “win.”The agent maps these initial cues to the memory graph and looks at the available associative Tags connected to them. The agent sees tags like “Tournament Victory” and “Tournament Participation.” Since it is only concerned with what the person did after they won the championship, MRAgent drops the tournament participation tag and pursues the victory tag.The agent retrieves the episodic content linked to the chosen Cue-Tag pair, retrieving three distinct memory episodes where Nate won a tournament.MRAgent looks at the three memories, decides one of them in particular is relevant to the query, and discards the other two.With this information, it updates its cues and starts another round of discovery and pruning. From the new episodic memory it has retrieved, the agent adds “tournament earnings” to its cues and uses that to traverse new tags and home in on new memories. It repeats this process until it gathers enough information to answer the query, which could be something like “Nate saved the money.”MRAgent performance on industry benchmarksMRAgent operates alongside several other frameworks addressing agentic memory building. Alternatives include A-MEM, a graph-based agentic memory framework, and MemoryOS, a hierarchical memory framework. Other persistent memory frameworks include LangMem and Mem0.The researchers tested MRAgent on the LoCoMo and LongMemEval industry benchmarks. These test the abilities of agents to resolve queries on long-horizon tasks and conversations across dozens of sessions and hundreds of turns of dialogue. The backbone models used were Gemini 2.5 Flash and Claude Sonnet 4.5. The system was tested against standard RAG, A-MEM, MemoryOS, LangMem, and Mem0. MRAgent consistently outperformed every baseline across both models and all question types by a significant margin. However, for enterprise developers, the most critical metric is often computational cost. In the LongMemEval tests, MRAgent slashed prompt token consumption to just 118k per sample. By comparison, A-Mem consumed 632k tokens, and LangMem burned through 3.26 million tokens per query. MRAgent also effectively halved the runtime compared to A-Mem, dropping from 1,122 seconds to 586 seconds.What makes MRAgent efficient in practice is its on-demand behavior. Evaluating tags and pruning irrelevant paths before retrieval saves money and context space. Furthermore, the system autonomously evaluates its accumulated context and inherently knows when to stop searching, completely avoiding redundant data exploration.Implementation and development catchWhile MRAgent is highly effective, the Cue-Tag-Content structure needs to be prepared before the agent can query it. Developers must figure out how to architect the underlying memory database to enable the LLM to efficiently navigate associative items and prune irrelevant paths without exploding compute costs.Fortunately, developers do not have to manually label or structure this data. The authors designed MRAgent with an automated distillation pipeline that uses LLMs to process raw interaction histories and automatically populate the memory graph. For a developer, the job is to implement and orchestrate this automated ingestion pipeline, rather than manually tag data.You need to set up a background job or streaming pipeline that passes raw user interactions through prompt templates to extract this metadata before storing it in your graph database.However, the authors emphasize that this is a lightweight construction phase and MRAgent intentionally keeps ingestion simple. The authors have released the code on GitHub.
Venture Beat
Autonomous security agents need complete data. Here’s how to check if yours is ready.
An endpoint agent cannot report its own absence. The 2026 Axonius Actionability Report, conducted with the Ponemon Institute and surveying 662 IT and security professionals, put a number on a gap SOC teams have worked around for years. Across the Axonius customer base, 12.7% of devices in a 298,000-device median inventory are missing their expected security agent.If a device has no agent, no management console shows it. If a CMDB record is stale, no reconciliation flags it. An employee who installed Claude Enterprise outside procurement created a SaaS workspace, identity surface, and API-token footprint that endpoint telemetry alone will not reliably inventory. The coverage percentage on the EDR dashboard is structurally incomplete because the reporting mechanism cannot see what it does not cover.That gap matters more now than it did six months ago. SOC and XDR vendors are pushing more autonomous investigation and remediation into production. Those agents will query the same dashboards, trust the same coverage percentages, and act on the same blind spots human analysts learned to work around. A human analyst second-guesses a 98% coverage number. An autonomous agent treats it as ground truth and moves at machine speed.Three independent signals converged on the same gapGravitee’s 2026 survey of 900-plus executives found 88% reported confirmed or suspected AI-related incidents, and only 14.4% sent agents live with full security approval. The Axonius/Ponemon report found 52% of respondents would let autonomous agents act on recommendations — while 63% said the underlying data lacks important information. The CSA’s Agentic Trust Framework requires verified data governance before agents act on any finding.Mike Riemer, Field CISO at Ivanti, said that known vulnerabilities on Azure’s honeypot networks are now attacked in under 90 seconds. “Traditional security measures continue to work,” Riemer told VentureBeat. The caveat is that those measures only protect what they can see. An EDR agent deployed across 87.3% of the device inventory leaves the remaining 12.7% outside that agent’s telemetry, policy enforcement, and detection logic.Exclusive deployment data quantifies the scaleJoe Diamond, CEO of Axonius, told VentureBeat that the average CISO sees roughly 50% of what is actually on the network. “Say 50% of their environment is sitting in dark matter,” Diamond said. “They don’t know what it is, or where it is, or who has access to it, if it’s secure, if it’s not secure.”Deployment data from more than 900 Axonius customers confirms those numbers. TransUnion went from 70% to 99% endpoint coverage after out-of-band verification. Western Union went from 85% to 99% by consolidating data from 38 tools and cutting manual workload by half. Lumen discovered 1.1 million assets, where the CMDB showed 17,000. That translates to roughly 37,000 unmanaged endpoints per organization sitting outside every policy, every patch cycle, and every detection rule.Diamond pointed to Mythos, Anthropic’s frontier reasoning model, as a sign that machine-speed offensive capability will make any unknown asset far riskier than it is today. “People tend to have shiny object syndrome,” he said. “If you didn’t understand what 50% of your environment looked like from a traditional endpoint perspective, and you think you’re going to wind sprint to granular control and governance of AI, your program will fail.” Diamond called the broader AI shift “as big, if not bigger than the internet.”Three approaches compete to close the gapNo single architecture solves the visibility problem today. Three approaches compete, each with named tradeoffs security teams should evaluate before procurement.A dedicated integration layer uses bidirectional API adapters to build an always-current inventory. Axonius runs 1,400-plus adapters and now discovers shadow Claude Enterprise installations via its Anthropic adapter (GA June 15). “We created a bidirectional API integration with all the IT systems and all the security controls to build an always up-to-date inventory of what the environment looks like,” Diamond told VentureBeat.Platform-native EDR and XDR intelligence builds richer asset context inside the agent footprint. Depth within the agent footprint is the advantage. The limitation is structural. Platform-native intelligence is bounded by what the agent can see, and the gap the Ponemon report identified lives precisely where that visibility ends.CMDB modernization requires continuous reconciliation against three or more independent telemetry sources. Only 13% of organizations reconcile daily, according to Axonius/Ponemon data. The remaining 87% operate on stale records that feed incorrect prioritization into any automated remediation pipeline.EDR data readiness: Five gates before autonomous remediationBefore you let autonomous SOC agents close tickets or quarantine assets, this checklist tells you whether your EDR and asset data is solid enough to trust. It is vendor-agnostic, works with any EDR and CMDB, and gives you five pass/fail gates you can run in a single working session.Risk AreaWhat the data showsReadiness thresholdAction to take nowAsset inventory deltaPonemon: only 45% consolidate into a single view. Forrester TEI: 150% more assets than previously identified. Lumen: 17K in CMDB vs. 1.1M discovered.Delta ≤10% between discovery, CMDB, and EDR agent count. Delta above 10% blocks automated remediation until reconciled.Run API-based discovery against all segments. Diff against CMDB and EDR console count. Reconcile quarterly minimum.Unmanaged AI servicesGravitee: 88% confirmed or suspected AI incidents. Only 14.4% with full security approval. Anthropic adapter (GA June 15) discovers unmanaged Claude Enterprise installations.No high-risk AI services outside approved procurement. Weekly SaaS discovery scans. Unmanaged high-risk instances trigger IR triage before exception review.Deploy SaaS discovery or protocol-level adapters for AI service detection. Automate weekly scans. Route unmanaged instances to IR queue.CMDB record accuracyPonemon: only 13% reconcile daily (RSAC 2026). Brooks Running: 20% server discrepancy between console and independent discovery. Top remediation barriers: unclear prioritization, unclear ownership, inconsistent data.≥85% of records validated against 3+ independent telemetry sources. No stale or orphaned records in active remediation queue.Cross-reference CMDB against cloud inventory, EDR telemetry, and IdP directory. Continuous reconciliation replaces annual audit cycles.Endpoint agent coverage gapPonemon: an agent cannot report its own absence (p. 8). TransUnion: 70% to 99% after out-of-band verification. RSAC 2026: 12.7% of 298K median devices missing expected agent.≥95% agent coverage verified via out-of-band discovery. Many CISOs set this as the minimum before allowing autonomous remediation. No self-reported-only metrics in board reports.Run network-based or API-driven discovery against managed device list. Coverage below 95% blocks automated remediation scoping.Asset ownership mappingPonemon: 32% apply tags consistently. Only 51% assign ownership on new exposures (pp. 9, 16). TransUnion: 12K to 190K assets with ownership mapped.Owner assigned within 24 hours. Tags consistent across cloud, EDR, CMDB. Three systems showing three owners = failure.Automate ownership via cloud tags, IdP group membership, or CMDB metadata. Map asset, remediation, and business owner as separate fields.Five questions to ask before allowing autonomous SOC actionWhat independently verifies endpoint-agent coverage outside the EDR console?How does the SOC reconcile conflicts between EDR, CMDB, cloud inventory, IdP, and discovery tools?Can AI agents act on assets with unknown or disputed ownership?Can the system distinguish “not vulnerable” from “not visible”?What data-quality gate blocks autonomous remediation when coverage or ownership falls below threshold?Board-ready risk framingKayne McGladrey, IEEE Senior Member, has confirmed the pattern across multiple published VentureBeat interviews. The structural gap in self-reported coverage is not new. What is new is that autonomous agents will act on it at machine speed without the institutional workarounds human analysts developed over years of experience. Diamond put the board-level stakes plainly in an April 2026 press statement: “Findings pile up because the data isn’t trusted, ownership isn’t clear, and entire asset classes aren’t even in the picture.”The CSA’s Agentic Trust Framework requires that any agent promoted to a higher autonomy level must pass five gates, including demonstrated accuracy and a security audit. The EU AI Act’s Article 50 transparency obligations take effect August 2, 2026. The May 2026 Digital Omnibus pushed high-risk system obligations to December 2027, but organizations deploying agentic SOC agents on incomplete asset data face immediate operational risk that outpaces any regulatory timeline.The board-ready sentence: Our EDR coverage reports are structurally incomplete because an endpoint agent cannot report its own absence, and we are verifying coverage through out-of-band discovery before deploying autonomous agents that would act on those reports at machine speed.Security director playbookRun out-of-band asset discovery this week. Compare results against your CMDB export and EDR console count. If the delta exceeds 10%, halt automated remediation scoping until the gap is reconciled.Deploy SaaS discovery for AI services. Employees install AI ahead of procurement, ahead of security. Weekly scans are the minimum. Route any unmanaged high-risk instance to your incident response queue for triage before exception review.Map asset ownership to remediation responsibility. Ponemon found only 32% of organizations apply tags consistently. If three systems show three different owners for the same asset, automated remediation has no routing target. Fix the ownership layer before deploying agents that depend on it.Kill self-reported-only coverage metrics. Any risk calculation or board report that relies on EDR console-reported coverage alone is built on data the reporting system cannot verify. Require out-of-band verification for every coverage number that informs a risk decision.
OpenAI unveils GPT-5.6 Sol, Terra and Luna models — but only accessible to limited preview partners for now, per US Gov
OpenAI is announcing a limited preview of its next-generation GPT-5.6 model series today, introducing three distinct, capability-tiered models—Sol, Terra, and Luna—designed to re-engineer developer and enterprise workflows. The initial rollout is available through the API and Codex to a narrow set of trusted partners and organizations after OpenAI previewed the models and release plans to the U.S. government following an executive order issued by President Donald J. Trump earlier this month on June 2, 2026, which calls upon various federal agencies to collaborate on a process for benchmarking and assessing capabilities of new AI models to ensure they are safe and appropriate for wide release. While this process remains underway (it was said in the order to take 30 days, so July 2), OpenAI says in its release blog post that it “previewed our plans and the models’ capabilities ahead of today’s launch. At [the U.S. government’s] request, we are starting with a limited preview for a small group of trusted partners”The flagship model, GPT-5.6 Sol, is priced at $5.00 per million input tokens and $30.00 per million output tokens — the same as GPT-5.5 — and OpenAI says it delivers a major performance gain for long-running coding, cybersecurity and agentic tasks. However, this rollout marks a highly unusual chapter in AI deployment. Because OpenAI is coordinating its release framework with the White House ahead of a broader public launch, enterprise buyers must navigate a novel landscape of real-time safety interventions, mandatory compliance parameters, and structured token caching systems. Technology: Deep Reasoning and the Multi-Agent ParadigmThe core architectural evolution of the GPT-5.6 series centers on how compute is allocated during inference. Rather than relying on instantaneous token generation, OpenAI introduces a new max reasoning effort mode, which explicitly grants the flagship Sol model extended time to reason through highly complex problems deeply. Compounding this is the debut of an ultra mode. This configuration expands past the structural boundaries of a single standalone model, instead deploying specialized “subagents” to divide, conquer, and accelerate multi-step, long-horizon projects. Data from initial evaluations indicates that this subagent coordination shifts the frontier for programmatic execution:Command-Line Automation: On Terminal-Bench 2.1—which evaluates planning, tool usage, and iterative error correction in command-line environments—GPT-5.6 Sol (Ultra) achieves a state-of-the-art score of 91.91%. This edges out GPT-5.6 Sol (Max) at 88.76% and eclipses Claude Mythos 5 at 88%, as documented in Screenshot 2026-06-26 at 12.46.37 PM.png. Professional Workflows: On Agent’s Last Exam, a benchmark spanning 55 professional domains to test long-running workflows, GPT-5.6 Sol is the only model to clear the 50% success threshold, scoring 50.9% in code mode while displaying superior token efficiency relative to preceding architectures, as shown in Screenshot 2026-06-26 at 12.46.55 PM.png. Quantitative Biology: On GeneBench v1, which measures long-horizon genomics analysis, the flagship model systematically outperforms GPT-5.5 while consuming fewer total tokens across simulated latency periods, as detailed in Screenshot 2026-06-26 at 12.47.11 PM.png. Product: Durable Tiers and Prompt Caching EconomicsOpenAI is codifying its product nomenclature into permanent capability tiers that will advance independently on their own cadences. This model family provides businesses with explicit options to balance intelligence against operational latency and financial overhead: GPT-5.6 Sol (Flagship): Optimized for deep reasoning, heavy vulnerability research, and advanced multi-agent coordination ($5.00 input / $30.00 output per 1M tokens). GPT-5.6 Terra (Balanced): Built for efficient, high-volume production workloads, Terra delivers competitive parity with the older GPT-5.5 flagship but is explicitly “2x cheaper” at $2.50 input and $15.00 output per million tokens. GPT-5.6 Luna (Fast): Optimized for rapid, low-cost everyday utility pipelines, priced at $1.00 input and $6.00 output per million tokens. Predictable Prompt Caching MechanicsTo help enterprises control the unpredictable cost curves of running agentic loops, the GPT-5.6 API introduces a revamped prompt caching protocol. Developers can now implement explicit cache breakpoints, backed by a guaranteed 30-minute minimum cache lifetime. Under this framework, initial cache writes carry a 1.25x premium over the model’s standard uncached input rate, but subsequent cache reads receive a steep 90% discount. For systems that routinely pass massive context windows or codebase definitions back into the model, this predictability is a critical financial guardrail. Furthermore, for enterprise applications where latency is the primary barrier to adoption, OpenAI is launching GPT-5.6 Sol on Cerebras hardware this July. This infrastructure partnership claims processing speeds of up to 750 tokens per second, targeting specialized enterprise applications requiring real-time, frontier-grade reasoning. Enterprise Implications: High Security and Algorithmic FrictionFor corporate engineering, information security, and compliance teams, the deployment of GPT-5.6 requires a meticulous look at its security architecture. The models are accessible under a commercial enterprise API license, with open-source options completely off the table due to the dual-use risks inherent to its cyber capabilities. To achieve clearance for release, OpenAI dedicated roughly 700,000 A100e GPU hours solely to automated red-teaming. This compute was allocated to discovering “universal jailbreaks”—systemic attack vectors designed to bypass safeguards across varied contexts, rather than single-prompt workarounds.This massive testing phase feeds directly into a highly strict, multi-layered safeguard stack that operates in real time: Model-Level Refusals: Hardcoded boundaries trained directly into the base weights to resist masked intent or adversarial obfuscation. Real-Time Classifiers: Auxiliary systems that evaluate cyber and biological output token-by-token as it is generated. Reasoning Review Pauses: If a potential high-risk violation is flagged mid-generation, the pipeline automatically pauses. A secondary, larger reasoning model reviews the context of the conversation; if verified as malicious, the output is withheld before it reaches the user endpoint. Operational Friction for Dual-Use Security WorkThis real-time safety stack introduces distinct operational hurdles for enterprise security teams. Because legitimate defensive work—such as code reviews, vulnerability discovery, patch engineering, and defensive testing—frequently utilizes the exact same code primitives as offensive exploits, OpenAI admits that its classifiers may regularly trigger false positives. During this preview period, enterprise developers should expect localized latency spikes, paused API generations, and intermittent request refusals. Persistent flagging can trigger automated account-level reviews across historical conversations to evaluate if an enterprise client is engaging in malicious behavior or standard security research. OpenAI is currently negotiating longer-term enterprise safety compliance controls, including customer-operated safety overrides and privacy-preserving detection mechanisms, to insulate corporate data from manual review pipelines. Importantly, OpenAI notes that under testing, Sol remains optimized for defensive containment rather than offensive deployment. In evaluations running against the Chromium and Firefox codebases, the model successfully isolated bugs and exploitation primitives but was unable to autonomously engineer a functional, full-chain exploit, keeping it safely below the organization’s “Cyber Critical” alert threshold. The Geopolitics of the Phased ReleaseThe broader rollout of the GPT-5.6 series reflects an escalating entanglement between frontier AI labs and national security protocols. The decision to limit initial access to a small circle of vetted partners whose details are shared with the U.S. government stems from direct coordination regarding the developing cyber Executive Order framework. OpenAI has taken the unusual step of publicly critiquing this sovereign gatekeeping within its official product announcement documentation. The company states plainly: “We don’t believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them.” This tension highlights the precarious position of modern tech enterprises. While organizations can leverage unprecedented agentic efficiency and robust defensive patching capabilities via benchmarks like ExploitGym and ExploitBench, they must also accept that access to premier tools remains subject to diplomatic and regulatory authorization. General availability across ChatGPT and the wider public API is expected to roll out incrementally over the coming weeks.
Most companies think they’re building a software factory. They’re actually just shipping bugs faster.
Industrialized factories changed how the world produced physical goods: more output, lower costs, faster than anything that came before. Now a similar shift is happening with software. LLMs have lowered the barrier to writing code, increased individual output, and pushed organizations to think about software development as a production system. The standard software development lifecycle and CI/CD practices that have held for decades won’t hold up under that pressure. That’s where the software factory comes in — and like physical factories, it needs more than speed to actually work.The idea of a “software factory” started to solidify over the past year. Luca Rossi’s “The Era of the Software Factory” made the case plainly: AI is not just changing how fast people write code — it’s changing the whole production system around software. The concept can mean different things: a collection of coding agents and skills files; faster CI/CD; better review systems; or more automation around software delivery. A better frame is to think of it less as a tool category and more as a set of principles. A software factory can’t just be a loose collection of prompts, agents, and plugins. It needs a platform that defines how work moves through the system and how code is generated, reviewed, tested, traced, deployed, and improved when something goes wrong.Otherwise all you’re doing is putting yet another one-off machine into an empty room and calling it a factory. Why is this happening now?There are a few forces all hitting at the same time.Companies have always wanted more software than engineers can produce. That’s why tools like Excel exist: They often fill in the gap for a lot of the software that many companies wish they could make.AI has also lowered the barrier of entry to creating code, and this is the part everyone focuses on. Code creation is now easier, though not always cheaper or better, as evidenced by many high-profile companies fretting over their high AI bills. The barrier to writing functional code has effectively collapsed.More importantly, a single engineer can generate more code than they could just a few years ago. That changes the bottleneck: it’s no longer “How fast can someone write this?” or even, in some cases, “Can someone understand how to code?” Instead it becomes, “Should this be written?” More importantly, can we actually create end products that are durable and reliable and don’t just build tech debt? Or are we just putting out more AI slop faster than ever? That’s where the danger lies. The dangers of the modern software factoryAll of this sounds great. Factories, after all, made production faster and more consistent. They made it possible to build more cars and products, less expensively, which led to more people being able to afford cars and products. Putting environmental impacts aside, you could argue this was positive.But like many things in engineering, there are always tradeoffs, and in this case, there are new risks.When you increase the output of one person with machinery, digital or otherwise, you also increase the mistakes that can be made either by the individual or the machinery. The speed at which code can now be put out is on an industrial scale. Even smaller organizations can suddenly have code bases ballooning up to the size of tech company code bases a decade ago. The data is already showing problems. Faros AI found that while task throughput per developer is up 33.7% and PR merge rate is up 16.2%, the incidents-to-PR ratio has risen 242.7% and bugs per developer are up 54%. Google’s DORA research found that more AI adoption was actually associated with worse delivery stability. As a fractional head of data, I’ve been brought in to fix these exact issues. In the past year alone, I’ve worked on two projects where AI-generated data infrastructure slowly started to morph over time.Between multiple engineers trying to move quickly and a lack of standards, these projects became unruly. Code bases tend to go through some level of evolution, but as different styles blend, the LLMs in turn start to create their own mutations. Codebases developed five to six different styles within months — a process that previously took years. Layer by layer, the engineers would slowly stop understanding exactly what was going on.The pattern echoes what happened a decade ago with self-service tooling: early productivity gains that masked downstream complexity.And that’s why the software factory can’t just be about speed. What makes a software factory workThere are several key principles to consider when building a software factory.Platform over tools: Many teams are slowly implementing AI into their coding workflows at the edges — adding a PR review agent or a skills file into their repos. But building an actual software factory requires a platform, not a collection of tools at the edges. A platform provides a unified foundation where tools aren’t scattered in separate corners. Instead, they actively share data, talk to each other, and work as a single cohesive system — standards, processes, and the work itself all connected. Rerunability and traceability: A real platform requires the ability to go back into any run, identify what went wrong, and rerun it — which is why one-off agents don’t make a factory. The system needs to support taking a serial ID, looking it up, and tracing exactly how it got to the output it produced. This is why state machines make more sense than loops for AI workflows: they make it far easier to rerun a process and understand what happened at each step.Safety and guardrails: Factories are not safe places. Neither is a software factory. As more people develop on these platforms, better guardrails and safety measures need to be built in. Testing and quality control need to be pushed to the front of the process — catching bugs at the lowest possible stage reduces the cost to fix them and limits the blast radius.Standardization: At the enterprise level, every codebase has its own flavor. Layering a code assistant on top without standards produces an amalgamation of styles. Standardization has to be built into the process from the start.Quality control: In older manufacturing models, quality control happened at the end of the line. The product was built, inspected, defects found, and fixed later. Toyota’s approach was different. Quality was pushed into the process itself — workers were expected to stop the line when something was wrong. The goal wasn’t to catch defects at the end; it was to prevent them from flowing downstream in the first place. The same is true for the software factory. QC needs to be baked into the entire process, starting with how the spec is written. That means integrating static code analysis that catches obvious errors and providing templates to LLMs so they know the structure the code should follow. Without that, the bottleneck becomes the final review — or teams just push out more AI slop.Speed without quality isn’t productivityImproving the speed of your code output is not actual productivity if the downstream issues aren’t managed. A company is not more productive because it produces millions of cars, only to see them all fall apart within 100 miles. It’s also not more productive if all it does is produce an endless stream of proofs-of-concept that never enter production. Actual productivity is when the software factory takes ephemeral tokens and turns them into durable outputs. It’s easy to talk about lines of code and how much faster your team is moving.The software factory that wins isn’t the one that generates the most code. It’s the one that generates the fewest defects downstream.
Liquid AI’s smallest model yet LFM2.5-230M beats models 4X its size at data extraction, can run ‘anywhere’
Liquid AI, founded by former MIT computer scientists, today released its smallest AI language model yet, LFM2.5-230M, and enterprises would do well to consider it for their uses in data extraction and local deployment on smartphones, laptops and robotics.This is a 230-million-parameter foundation model explicitly designed for on-device agentic workflows, and as Liquid states in its release blog post, that small size makes it possible to run nearly “anywhere.” According to Liquid, it also outperforms models more than 4X its size on selected benchmarks, specifically doing better at data extraction than the 800 million parameter count Alibaba Qwen3.5-0.8B (Instruct) and 1-billion parameter Google Gemma 3 1B.The model targets developers and engineers building lightweight data extraction pipelines and autonomous edge systems. Operating under a dual-use commercial license, the model remains free for individuals and companies generating less than $10 million in annual revenue, while requiring a paid enterprise agreement for larger corporations. This release distinguishes itself from other small AI models by utilizing the LFM2 architecture to achieve high inference speeds without the massive memory overhead typical of parameter-heavy transformers. While major AI companies Anthropic, OpenAI, Google, Microsoft, Meta and others push parameter counts into the hundreds of billions or trillions to achieve frontier performance, a parallel race focuses entirely on the edge and local deployments. Liquid AI’s launch of LFM2.5-230M signals a pivotal shift toward architectural efficiency over brute-force scaling. By squeezing 19 trillion tokens of pre-training into a 230-million-parameter footprint, the company demonstrates that edge devices do not need massive computational power or persistent cloud connections to execute complex, multi-step agentic workflows. How LFM2.5-230M worksThe LFM2.5-230M model diverges from standard transformer architectures, relying instead on the LFM2 framework. This architecture functions as a hybrid system, interleaving gated short-range convolutions with grouped-query attention to process information efficiently. For those tracking the evolution of efficient architectures, Liquid’s approach shares a similar conceptual goal: managing long contexts and sequential data effectively on edge hardware without the quadratic memory costs of pure attention mechanisms. The model supports an expansive 32K context window, allowing it to ingest substantial documents or continuous streams of robotic telemetry.When analyzing the performance charts provided in the release, the architectural efficiency becomes visually apparent. The model maintains a memory footprint of under 400MB while achieving prefill and decode speeds that outpace comparable models like Gemma 3 1B IT and Granite 4.0-H-350M. On a Samsung Galaxy S25 Ultra equipped with a Qualcomm Snapdragon Gen4 CPU, the model reaches a decode speed of 213 tokens per second. Even on a highly constrained Raspberry Pi 5, the model maintains a decode rate of 42 tokens per second. Furthermore, internal benchmarking shows the GPU inference stack delivers lower end-to-end latency than competing small models across all concurrency levels.Why it matters for enterprises To understand why a 230-million-parameter model is necessary, one must look at how enterprises currently manage data. Organizations have traditionally relied on rigid, rule-based Extract, Transform, Load (ETL) scripts to move and process data. However, these legacy systems are notoriously brittle; a simple change in a document’s layout or a schema update can break the entire pipeline. To solve this, the industry is shifting toward “AI ETL,” where machine learning infers mappings, detects schema drift, and adapts to changes automatically. In a modern lightweight data extraction pipeline, an AI model connects to unstructured sources—like PDFs, emails, or web forms—and structures the data into formats like JSON without requiring hardcoded rules.For enterprises, using a massive flagship model like Claude Opus 4.6 (which costs $5.00 per million input tokens) to parse routine invoices, format addresses, or route telemetry data is economically unviable. This is where models like LFM2.5-230M become critical. Designed explicitly as a lightweight extraction engine, it allows companies to automate repetitive formatting and data parsing at a fraction of the compute cost and latency, running directly on local hardware rather than relying on expensive, continuous cloud API calls.Small Model Benchmarks: LFM vs. The 3B ClassThe AI industry in mid-2026 is seeing a renaissance in “small” models, but the definition of “small” varies wildly.Recently, the open-weight community was stunned by Weibo’s VibeThinker-3B, a 3-billion-parameter model built on a Qwen2-style backbone that achieved a massive 94.3 on the AIME 2026 math benchmark, rivaling 600-billion-parameter behemoths through aggressive data curation and reinforcement learning.Similarly, Google’s Gemma 4 family — which recently crossed 200 million downloads — pushes frontier AI to the edge, including the E2B (2 billion parameters) designed specifically for mobile and IoT deployments.By contrast, Liquid AI’s LFM2.5-230M operates in a completely different weight class. At just 230 million parameters, it is roughly one-tenth the size of Google’s smallest Gemma 4 model and VibeThinker-3B. Because of its microscopic footprint, LFM2.5-230M is not designed to compete on reasoning-heavy workloads like advanced math, coding, or creative writing—a constraint Liquid AI explicitly acknowledges.However, in its intended domains of data extraction and tool calling, the model punches well above its weight class. Benchmarks released by Liquid AI show LFM2.5-230M scoring 43.26 on the BFCLv3 tool-use benchmark, dominating IBM’s Granite 4.0-350M (39.58) and completely outpacing larger 1-billion-parameter models like Google’s Gemma 3 1B IT (16.61). On CaseReportBench for data extraction, it scores 22.51, decimating the Qwen3.5-0.8B (Instruct). LFM2.5-230M proves that while 3-billion-parameter models like VibeThinker are solving advanced calculus, a 230-million-parameter model is the superior, highly optimized choice for executing structured tool calls and keeping agentic pipelines running efficiently on constrained hardware.Advanced research usesBecause it excels at tool calling, LFM2.5-230M functions primarily as a skill-selection layer. Liquid AI demonstrated this capability by deploying the model on a Unitree G1 humanoid robot. Running entirely on-device via the robot’s onboard NVIDIA Jetson Orin compute module, the model successfully processes complex environmental commands.As noted in the company’s technical blog, the model takes a free-form instruction like, *”Hold still for 2 seconds, then walk forward at 1 meter per second for 3 meters, hold a forward one-leg kneel for 5 seconds, and walk backward at 0.5 meters per second for 3 meters,”* and automatically translates it into a structured multi-step plan calling on pre-trained low-level skills provided by NVIDIA’s SONIC framework. The base and post-trained models are available immediately on Hugging Face, with native day-one support across the inference ecosystem for llama.cpp (GGUF), MLX, vLLM, SGLang, and ONNX.Dual-use, custom LFM Open LicenseLiquid AI ships LFM2.5-230M under the LFM Open License v1.0. Despite the word “open” in the title, this is not an Open Source Initiative (OSI) compliant license; it operates as a restricted, dual-use commercial framework.For independent developers, researchers, and early-stage startups, the license functions identically to open-source software. Users receive a perpetual, worldwide, royalty-free license to reproduce, modify, and distribute the model, provided they retain original copyright notices and prominently state any modifications.However, the license includes a strict “Commercial Use Limitation”. Any legal entity generating $10 million or more in annual revenue loses the right to use the model commercially under this agreement.Large enterprises crossing this financial threshold must negotiate a separate, paid commercial agreement with Liquid AI to deploy the model in production. This strategy protects the company from having its intellectual property absorbed by major technology conglomerates for free, while still seeding the model at the grassroots developer level.
OpenAI’s updated GPT-5.5 Instant is better at shopping, complex constraints, and understanding user intent — and it’s already in the API
OpenAI has made a significant update to its most widely used language model, GPT-5.5 Instant, which is the default in the free version of ChatGPT. The company announced the upgraded version of GPT-5.5 Instant yesterday on X, calling it “much more fun to talk to” and saying it is “better at understanding the intent behind a question and adapting its response accordingly,” as well as offering improvements in shopping results, local recommendations, and handling “complex constraints.”However, it has not yet provided any benchmarks or numerical results to quantify these claims. The company said the updated GPT-5.5 Instant was rolling out first to paid ChatGPT subscribers and then to free users as of today, June 25. OpenAI also updated its chat-latest API alias, which points to the latest GPT-5.5 Instant model currently used in ChatGPT, while continuing to recommend the separate gpt-5.5 model for production API usage.That distinction matters, but it should not obscure the main news: this is primarily a ChatGPT-side update to GPT-5.5 Instant, not a new release of the broader GPT-5.5 API model family.Let’s dig into what’s changed…Origins of GPT-5.5 Instant, and why OpenAI updated it less than two months laterGPT-5.5 Instant was first unveiled in early May 2026, just under two months ago, to replace the aging GPT-5.3 Instant engine as the baseline default model for ChatGPT users.Developed as a fast, high-throughput variant of OpenAI’s core flagship model family, the initial spring release focused heavily on correcting systemic factuality deficits.Internal benchmarks from that spring deployment reported a 52.5% reduction in hallucinated claims compared to GPT-5.3 Instant on high-stakes medical, legal, and financial prompts, alongside a 37.3% drop in factual error rates on user-flagged historical conversations.Independent evaluators noted that its predecessor, GPT-5.3 Instant, had struggled in public rankings, placing 44th overall in Arena benchmarks. That gave the May rollout a clear purpose: OpenAI needed a stronger default model for everyday ChatGPT interactions, not just a more capable frontier model for advanced users.Stylistically, the initial spring model introduced a sharper conversational baseline, demonstrating a 30.2% reduction in word count and a 29.2% drop in line usage over typical advice prompts.However, the spring deployment also introduced an operational fault line for enterprise software systems: a feature known as “memory sources.” Designed to grant users visibility into the specific past chats, files, and connected Gmail accounts shaping a personalized answer, memory sources introduced a loose, model-reported observability layer.As reported by VentureBeat, these internal summaries frequently clashed with the deterministic logs of localized vector databases and enterprise Retrieval-Augmented Generation (RAG) pipelines.The resulting friction created dual, competing context records, making it difficult for administrators to reconcile what the model claimed it referenced against what it actually accessed in production.The June 24 update does not appear to expand memory sources directly. Instead, it focuses on making GPT-5.5 Instant better at understanding user intent, carrying context across turns, following multi-part instructions, and producing more useful shopping and local recommendations.A smarter, more ‘fun’ ChatGPT for consumersFor everyday users of ChatGPT, the most noticeable change in GPT-5.5 Instant will be the model’s improved intent recognition.According to OpenAI’s latest release notes, GPT-5.5 Instant has improved at identifying the underlying goal behind a user’s question, particularly in decision-support scenarios like planning, shopping, asking for advice, researching options and comparing local choices.Historically, large language models have struggled when given prompts with multiple overlapping constraints — often dropping one or two requirements in favor of a generalized response.The updated GPT-5.5 Instant handles these complex instructions more reliably. When users push back on an answer, clarify their meaning, or introduce new constraints mid-conversation, the model should adapt dynamically rather than stubbornly repeating its original approach.This contextual awareness extends heavily into commerce and local recommendations. GPT-5.5 Instant now makes better use of location context to surface nearby options, weaving together product recommendations, business information, and relevant images into a more cohesive output when those elements are useful.Furthermore, OpenAI notes that the stylistic formatting of these responses is less rigidly templated, trading robotic lists for a more intentionally designed, warmer and restrained conversational tone.Developers can test the latest Instant behavior through chat-latestFor the developer ecosystem, the June 24 GPT-5.5 Instant update is accessible through OpenAI’s updated chat-latest API alias.chat-latest is not the same thing as the production gpt-5.5 model slug. OpenAI says chat-latest points to the latest Instant model currently used in ChatGPT, and it recommends the separate gpt-5.5 model for production API usage. Developers can use chat-latest to test the newest ChatGPT-style improvements, while using gpt-5.5 when they need a stable production target.The current chat-latest model page lists a 400,000-token context window and support for up to 128,000 maximum output tokens. Its knowledge cutoff is Aug. 31, 2025.On pricing, chat-latest uses the same $5.00 per 1 million input tokens and $30.00 per 1 million output tokens listed on its model page. Cached inputs cost $0.50 per 1 million tokens, a 90% discount that strongly incentivizes developers to optimize prompts by placing static instructions first and dynamic data later.The model supports text and image input, text output, streaming, function calling and structured outputs. Through the Responses API, the chat-latest page also lists support for web search, file search, image generation, code interpreter and MCP.The practical takeaway is simple: chat-latest gives developers access to the updated Instant-style behavior, but OpenAI is still steering production API builders toward the separate gpt-5.5 model. The broader GPT-5.5 API model includes a larger feature set and different production profile, but that is not the main focus of this update.Why this matters for enterprise AI teamsFor enterprises, the June 24 GPT-5.5 Instant update lands at the intersection of two related but distinct trends: better default user experience in ChatGPT, and more reliable orchestration behavior in the API.The consumer-facing changes make ChatGPT more useful for everyday decision-making. Users should see better handling of messy, real-world requests: planning a trip with several constraints, comparing products, finding nearby businesses, or adjusting a recommendation after adding a new requirement.The enterprise relevance is less about a new technical architecture and more about default behavior. A model that better infers intent, preserves context across turns and follows multi-part constraints can make ChatGPT more reliable for employees using it for research, planning, purchasing decisions, customer-facing drafts and internal analysis.But enterprises should remain careful about observability. Memory sources can help users understand why ChatGPT personalized an answer, but they do not provide a complete audit trail. Organizations that already rely on RAG pipelines, vector databases, orchestration logs and internal agent traces should define which record acts as the source of truth when a model’s visible memory sources do not fully match the system’s own logs.What’s next?The release of GPT-5.5 Instant and the updated chat-latest alias signals a maturation in how generative models are deployed.OpenAI is moving away from models that require heavy hand-holding and toward systems that can better infer the user’s goal, preserve constraints and adapt across multiple turns.Whether it is a consumer planning a complex multi-city vacation in ChatGPT, or a developer orchestrating a codebase-navigating agent through the API, GPT-5.5 represents a faster, smarter and more capable baseline for the future of AI workflows.The most important takeaway for developers is also the simplest: GPT-5.5 Instant, chat-latest and gpt-5.5 are related, but they are not the same product surface. GPT-5.5 Instant is the ChatGPT model users experience directly. chat-latest is a moving alias for testing the latest Instant behavior through the API. gpt-5.5 is the production model OpenAI recommends for developers building stable applications.
Your enterprise AI agents should automatically remember which model is right for which task. Mindstone built the capability with Rebel
AI agent orchestration platforms are popping up like weeds these days, but London-based AI transformation startup Mindstone’s Rebel might be among the most promising I’ve come across. That’s because the system, which officially launched this week, is a local-first, agentic AI operating system distributed under a “Fair Source” license, allowing teams of under 100 users to freely adopt and customize it to suit their needs, while those organizations with more users will require paying for an enterprise license. The marquee features are its simplicity and extensive customizability to fit any given team, no matter how unique or specific the workflows, all based around the common, open source standard file format markdown, and, as a result, an organizational memory layer that ensures agents reliably use the enterprise’s preferred AI models for each given task or even subtasks — dynamically switching between local and cloud ones in a predictable, visible way to save costs and maintain data privacy and security as needed. “Shared memory is the most empowering thing you could possibly do with a knowledge-worker AI,” said Greg Detre, chief technology officer (CTO) of Mindstone, in a recent video call interview with VentureBeat. “You get this feeling of being a super-organism as a company that just gets smarter and smarter.”Rebel is available now for macOS on Intel and Apple Silicon machines, as well as Windows, with Linux support in development.Mindstone has raised $5 million from private investors including Pearson Ventures, Moonfire Ventures and Zanichelli Venture. A distinctive, local-first architecture based on markdown filesWhat makes Rebel distinctive is its local-first architecture. Instead of the approach found in developer-heavy agent frameworks such as as LangGraph, CrewAI and AutoGPT, which require teams to wire together databases, cloud infrastructure and state-management logic, Rebel’s core agent memory and instructions live across local markdown (.md) text files — arguably the simplest, easiest, and most popular way to steer AI agents, one that has been widely adopted by AI developers and power users around the globe. Mindstone says Rebel stores its state, prompts, task instructions and memory hierarchy in these files, allowing users and companies to easily inspect, move or modify them as needed. A primary configuration file, agents.md, acts as the agent’s core instruction layer and runtime boundary.That architectural choice is partly about cost. Mindstone argues that common office formats such as Word documents and PDFs often carry formatting and metadata overhead that consumes model token context and raises API costs. Markdown keeps the information closer to raw text, allowing more of the model’s context window to be spent on the actual task rather than document structure.The company also positions the approach as a hedge against vendor lock-in. If a company’s agent instructions, automations and memory are stored locally as text files, they are not trapped inside one SaaS provider’s interface or database. That matters more as enterprises begin giving AI systems broader access to email, calendars, documents and internal workflows.Rebel also lets users create repeatable AI workflows. “Skills” are saved multi-step procedures an agent can reuse. “Operators” adjust how the agent behaves for a given task, such as reviewing a pitch deck from an investor’s perspective or evaluating work through a security lens. “Automations” can run scheduled background tasks, such as scanning messages or files, finding relevant updates, drafting responses, or preparing work before an employee opens the app. Automatically selecting the best, enterprise-preferred AI model for every task (and subtask)Another important feature is multi-model orchestration. Rebel can break a task into parts and route different steps to different models, including splitting between local and cloud-based ones depending on the sensitivity of the information or as guided by enterprise policies. A more powerful model can handle planning or complex reasoning; a cheaper model can handle routine work; a local model can handle sensitive steps or approval checks. This matters for enterprises that want flexibility or are seeking cost controls: not every task need be sent to the same expensive cloud model, and some enterprise workflows prohibit sensitive corporate data leaving local infrastructure.“I want to be able to say, ‘Help me with this,’ and it knows what’s personal, what’s sensitive, and what can be shared with the whole company,” Detre explained. That model-agnostic setup gives companies more control over cost and security. Data-heavy work can run on lower-cost models such as Llama or DeepSeek. Higher-level reasoning can be reserved for more expensive models. Sensitive work can be routed through a local model running on the user’s machine, keeping that information from leaving the device.This approach also gives enterprise teams a way to mix cloud and local inference without treating the choice as all-or-nothing. By shifting away from centralized, monolithic cloud interfaces toward a local file-driven architecture, Mindstone is introducing a model for how enterprise technical decision-makers orchestrate autonomous workflows without forfeiting data sovereignty or predictabilityHow it works in practiceMindstone CTO Greg Detre designed Rebel’s memory system to avoid a common problem in enterprise AI: dumping large amounts of company information into a database and hoping search will retrieve the right context later.Instead, Rebel uses a tiered memory structure. When an interaction happens, the system estimates how likely that information is to be useful again. Information with a high expected value is written into a local readme.md file tied to a specific project space. Information with a moderate expected value becomes a reference link back to deeper historical records. Lower-priority material is stored in an indexed memory directory, where it remains available but dormant until a relevant task calls it back.An ROI dashboard for enterprise buyersFor larger organizations, Mindstone Pro adds an Impact Dashboard designed to show where Rebel is saving time and money across business units.Mindstone says the dashboard uses a separate, closed LLM to evaluate telemetry and calculate business impact. The company says the system is calibrated conservatively, using the lower end of estimated performance gains to avoid inflated productivity claims.That feature speaks to a practical problem for enterprise AI buyers: proving value without over-surveilling employees. Mindstone says the dashboard is isolated from individual workspaces, allowing IT and business leaders to evaluate adoption and return on investment without reading employees’ private agent activity.Fair Source licensing aims to reduce platform riskMindstone is releasing Rebel under a Fair Source license, a model meant to sit between fully closed SaaS and permissive open source.Under the license, Rebel’s code is viewable, auditable, modifiable and deployable. Individuals and organizations with up to 100 concurrent users can run it for free. Once an organization exceeds that threshold, it needs a commercial Mindstone Pro license.The license also includes a two-year sunset clause. Twenty-four months after a given version is released, that version automatically converts to the MIT open-source license.For enterprise buyers, the practical pitch is that Rebel reduces the risk of being trapped. If every automation, memory file and agent instruction is stored locally in markdown, a company can move its data and workflows elsewhere if needed. The product may be commercial, but the underlying work is designed to remain inspectable and portable.Security questions focus on local approvals and shared memoryRebel’s debut on the open access tech product sharing platform Product Hunt this week prompted technical questions about how a local-first agent should handle permissions, safety checks and shared memory.One developer, Nikita Pokryschko, asked whether approval checks for sensitive actions could run entirely on a local model, or whether the gating logic still required a cloud call.Detre responded by explaining Rebel’s separation between planning, execution and background safety logic. Wöhle added that companies can configure Rebel to rely entirely on a local model for gating decisions.That distinction matters for corporate security teams. Autonomous agents often need broad permissions to read files, draft emails or interact with internal systems. If the final approval layer depends on an external cloud model, some companies may see that as a compliance risk. Mindstone is arguing that Rebel can keep those approval boundaries local.A second discussion focused on how Rebel decides what memory can be shared. Product developer Clement Morel asked whether shareability is determined by content, user settings or learned behavior, and what happens if the system gets it wrong.Detre said Rebel uses the user’s local “Chief-of-staff README” and defined spaces to separate private, team and company-wide information. When the agent encounters ambiguous context, the system pauses and asks the user for approval before proceeding.That emphasis on visibility is part of Mindstone’s broader argument against opaque agent systems. As CEO Joshua Wöhle put it in a post on his LinkedIn account: “If an agent is going to sit inside your workspace, remember your context, and ask permission before changing the world, you should be able to see how it works. Not because everyone will read the code, but because someone can.”Mindstone points to customer rollout as early proofMindstone says Rebel has already been deployed across the 250-person workforce of customer Epignosis, covering sales, engineering, product, finance and customer success teams.”The entire organization is operating on Rebel today,” Wöhle told VentureBeat.Over a 12-week deployment, Mindstone says Epignosis recaptured the equivalent capacity of eight full-time roles. The company says adoption spread organically after employees saw colleagues automate time-consuming work, a pattern employees reportedly called the “potatoes effect.”The Epignosis case is central to Mindstone’s argument that enterprise AI should not be treated as a set of isolated personal tools. Rebel’s shared-memory design is meant to let workflows move across teams and improve as more employees use them.“The border between learning and doing is fading out – and that changes everything about how you scale,” Epignosis CEO Dimitris Tsingos said in a statement provided to VentureBeat by Mindstone.Background on MindstoneMindstone Learning Limited, headquartered in London, launched in 2020 under the direction of CEO Joshua Wöhle, previously a co-founder of the digital child safety firm SuperAwesome. Originally positioned in the consumer education technology market, the company built a digital curation tool likened to a “Spotify for learning” that utilized compound learning methodologies. However, following the widespread commercialization of generative artificial intelligence platforms between 2022 and 2024, Mindstone moved into business-to-business enterprise enablement. Leadership identified a critical “last-mile” barrier: while AI tools promised substantial productivity gains, traditional corporate training failed to equip the workforce to practically integrate them into daily operations.Today, Mindstone functions as a comprehensive enterprise software and training ecosystem designed to maximize corporate return on investment for existing AI licenses. The product architecture systematically addresses different organizational tiers through highly contextualized, “live-fire” software applications rather than abstract slide presentations. Financially, Mindstone utilizes a hybrid capitalization strategy that interweaves institutional venture capital from entities like Moonfire Ventures and Pearson Ventures with community-based equity crowdfunding on platforms such as Seedrs and Crowdcube. Mindstone has successfully penetrated the enterprise market, securing commercial contracts with blue-chip corporations including The Home Depot, Hyatt Hotels Corporation, Pearson, and Ernst & Young. Ultimately, Mindstone positions itself as the crucial antidote to corporate inertia, ensuring organizations establish the internal competency required to execute successful AI transformations.Mindstone’s bet: enterprise AI needs shared memory, not more seatsRebel arrives as companies are trying to move from AI experimentation to AI operations. The first wave of enterprise adoption centered on access: giving employees chatbots, copilots and model subscriptions. Mindstone is betting the next wave will center on coordination.That means shared memory, reusable workflows, local control, flexible model routing and measurable business impact. It also means giving enterprises a way to inspect the systems they are being asked to trust.The company’s challenge now is execution. Local-first software can be harder to manage than cloud SaaS. Shared memory raises governance questions. Multi-model routing adds complexity. And enterprises will still need proof that agentic workflows can deliver reliable productivity gains without creating security or compliance headaches.But Mindstone is making a clear argument: buying AI seats is not the same as building AI infrastructure. Rebel is its attempt to turn scattered employee experiments into an operating layer for work.
Mistral launches OCR 4, turning document extraction into a full enterprise AI play
Mistral AI on Tuesday released OCR 4, a document intelligence model that moves beyond raw text extraction to return structured representations of entire documents — complete with bounding boxes, block-type classification, and per-word confidence scores. The release marks Mistral’s fourth generation of optical character recognition technology in roughly 15 months and lands at a moment when the company’s pitch for European AI sovereignty has never been more commercially relevant.The model supports 170 languages across 10 language groups, accepts PDF, DOC, PPT, and OpenDocument formats, and can be deployed as a single container on an organization’s own infrastructure — a capability Mistral is positioning directly at enterprises in regulated industries that cannot route sensitive documents through U.S.-jurisdiction cloud APIs.”Mistral OCR 4 extracts and structures content from a wide range of documents,” the company said in its announcement. “Where previous generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document.”The model is available immediately through the Mistral API, Document AI in Mistral Studio, Amazon SageMaker, and Microsoft Foundry, with Snowflake Parse Document support coming soon. Pricing starts at $4 per 1,000 pages, dropping to $2 per 1,000 pages through a batch API discount.OCR 4 treats every document as a semantic map, not a wall of textThe central engineering shift in OCR 4 is structural. Rather than outputting a flat stream of extracted text — the paradigm that has defined OCR for decades — the model returns a layered representation in which every block is localized with a bounding box, classified by type (title, table, equation, signature, and others), and scored for confidence at both the page and word level.Mistral says bounding boxes were its most-requested capability. The reason is straightforward: without location data, downstream systems cannot trace an extracted fact back to its source on a specific page. That traceability gap has been a persistent friction point for enterprises building retrieval-augmented generation (RAG) pipelines, compliance workflows, or any application where “where did this number come from?” is a question that needs an auditable answer.Block classification addresses a related problem. A paragraph tagged as a “title” can segment a document into hierarchical chunks for semantic search. A block tagged as a “table” can be routed to a structured-data pipeline rather than a text summarizer. A block tagged as a “signature” can trigger a redaction workflow in a compliance system.These are not novel ideas in isolation, but packaging them as first-class outputs of the OCR model itself — rather than requiring a separate layout-analysis stage — removes an integration layer that enterprise teams have historically had to build and maintain themselves.The confidence scores serve a dual purpose. At scale, they allow organizations to programmatically route low-confidence regions to human reviewers and auto-approve high-confidence extractions, building what the industry calls human-in-the-loop verification without requiring a person to review every page of every document. In production systems, OCR is rarely the end goal — it is the first step in a larger pipeline.Developers building RAG systems, agent workflows, or document automation often spend more time reconstructing layout and structure than on the downstream AI logic itself. OCR 4 aims to eliminate that reconstruction step, and if it delivers on that promise, the value accrues not just in OCR cost savings but in reduced engineering hours across the entire document pipeline.Independent reviewers preferred Mistral’s output 72 percent of the time, but benchmarks tell a complicated storyMistral reports that OCR 4 achieved a 72% average win rate in a head-to-head human evaluation against leading competitors, conducted by independent annotators across more than 600 real-world documents in over 12 languages. The model also achieved the top overall score on OlmOCRBench at 85.20 and scored 93.07 on OmniDocBench.But the company itself urges caution in interpreting those numbers. In its release, Mistral took the unusual step of auditing and publicly disclosing the specific types of scoring artifacts it encountered, including ground-truth errors in the reference annotations, equivalent LaTeX notation scored as mismatches, column-reading-order assumptions, and header/footer attribution issues. “We therefore treat the aggregate score as directional rather than definitive,” the company said — a notably transparent stance from a vendor announcing a product.That transparency is well-timed. On the public OlmOCRBench leaderboard, some researchers have noted that OCR 4 currently ranks third, behind open models like Chandra OCR 2. And some open-weight models self-report higher OmniDocBench composite scores — PaddleOCR-VL-1.6 claims 96.33 — though those results have not been independently reproduced on the public leaderboard.Early enterprise feedback has been favorable nonetheless. Aidan Donohue, an AI engineer at financial AI firm Rogo, said the company benchmarked OCR 4 against leading agentic document parsers on a chart-dense financial QA dataset and “reached equivalent accuracy at roughly 8x lower cost and 17x lower latency.” Ivan Mihailov, an AI engineer at intellectual property management firm Anaqua, said OCR 4 is “roughly 4x faster per page than our incumbent provider.” Enterprise buyers, however, should run their own evaluations rather than relying on any vendor’s benchmark numbers. The practical question is not which model scores highest on a leaderboard, but which model produces the fewest errors on your specific documents, in your specific languages, at a price and latency that fit your workflow.The Anthropic export ban gave Mistral’s sovereignty pitch the proof point it neededMistral’s release lands in a geopolitical context that could hardly be more favorable for its strategic positioning.On June 12, Anthropic was forced to disable all access to its newest AI models, Fable 5 and Mythos 5, after the U.S. Commerce Department used national security export controls to bar the company from distributing the models to any foreign national. Enterprise clients in finance, healthcare, SaaS, and critical infrastructure found their core intelligence services abruptly disabled, without prior warning or effective recourse. As of June 24, both models remain offline, with prediction markets giving only 57% odds of restoration before July 1.That episode validated a warning Mistral CEO Arthur Mensch has been sounding for over a year. As Business Insider reported, Mensch warned at London Tech Week in June 2025 about American AI companies “having the keys” for their models, calling it a scenario where European companies are “giving leverage to their providers.” He added: “At some point, you need to be able to turn it off or turn it on, and you don’t want to leave it to another country.”The argument gained further urgency as Mensch’s broader sovereignty pitch escalated in recent months. As reported by CNBC in late May, Mensch told the outlet: “Europe is lagging behind when it comes to [the] buildout of infrastructure, and so we are investing to close that gap.” At the same time, Mensch pushed back against Pope Leo XIV’s call for AI to be “disarmed,” arguing that Europe cannot afford to fall behind U.S. tech giants. “We’re all for peace, but if you look at our rivals and adversaries in the world, they’re using artificial intelligence … we do need to have our own capabilities,” Mensch told reporters.OCR 4’s single-container, self-hosted deployment model is the product-level expression of that argument. A U.S.-headquartered provider offering EU data residency means documents are stored in Frankfurt but governed by U.S. law. Mistral, incorporated in France and operating under EU jurisdiction, offering on-premise containerized deployment, means documents never leave the customer’s infrastructure at all. The EU AI Act’s fine enforcement provisions take effect August 2, adding regulatory pressure to the compliance calculus for European enterprises evaluating document AI vendors.Baidu’s free, open-weight OCR model arrived one day earlier — and the contrast is revealingMistral’s release did not arrive in isolation. Just one day before OCR 4 launched, Baidu shipped Unlimited-OCR on June 22 — a 3-billion-parameter MIT-licensed model that tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward.Baidu’s model uses a technique called Reference Sliding Window Attention (R-SWA) that, as a top Hacker News commenter explained, splits the AI’s focus into two paths: maintaining full attention on the original document image while restricting memory of generated text to a tight, moving window. The result is constant KV cache size and the ability to transcribe 40-plus pages in a single forward pass. The model gathered 1,800 GitHub stars in its first 24 hours and racked up more than 479 upvotes on Hacker News, where the discussion thread ran to 109 comments.The two releases frame what some analysts are calling the June 2026 document-AI split: self-hosted long-horizon parsing with open weights versus structured managed extraction with enterprise features.Baidu’s model is free under an MIT license, runs on standard GPU hardware, and has no managed API or enterprise SLA. Mistral’s model is a commercial product with per-page pricing, bounding boxes, confidence scores, block classification, multi-platform distribution, and self-hosted deployment options for enterprise customers. Unlimited-OCR may be the better tool for a research team digitizing scanned dissertations on a single GPU. OCR 4 is built for the IT procurement process — the world of SLAs, data processing agreements, and compliance audits.Beyond Baidu, the broader OCR competitive field includes Google Document AI, Amazon Textract, Azure Document Intelligence, ABBYY Vantage, and a growing number of open-weight models. On the Hacker News thread for Unlimited-OCR, practitioners offered a candid assessment of the state of the art. Joss82, who has worked on document parsing for 10 years, wrote bluntly: “OCR still sucks in 2026.” Meanwhile, one user named SyneRyder reported success with Claude for OCR of hundreds of pages of handwritten documents, noting the model delivered results with “no corrections required” and even pointed out a continuity error in the source text. These practitioner reports underscore a key tension in the market: performance varies wildly depending on the specific document type, language, and quality of the source material.The real play is not OCR — it is an enterprise AI stack with document intelligence as the on-rampStep back far enough, and Mistral’s OCR 4 release is not really an OCR story. It is an enterprise go-to-market story built on top of a $4.4 billion global intelligent document processing market that is forecast to grow at a 33.1% compound annual growth rate through 2030, according to Grand View Research.For Mistral, OCR is a wedge into enterprise AI budgets. The model feeds directly into Mistral’s Search Toolkit, the company’s open-source composable search framework announced at the AI Now Summit. In that architecture, OCR 4 serves as the ingestion layer for retrieval-augmented generation and enterprise search pipelines, converting raw documents into citation-ready, structurally classified input. The logic is clear: once an enterprise adopts OCR 4 for document extraction, Mistral’s broader model suite — including Medium 3.5 for reasoning and the Vibe agentic platform for task execution — becomes the natural next step in the stack. That pipeline ambition is critical context for understanding Mistral’s current fundraising trajectory. Bloomberg recently reported that the company is in early discussions to raise about €3 billion ($3.5 billion) at a valuation of roughly €20 billion — nearly double the €11.7 billion valuation from its September Series C round. To date, Mistral has raised only about $4 billion, a fraction of what its largest U.S. rivals have taken in. OCR 4 and its associated enterprise revenue pipeline are part of how the company plans to justify that higher valuation, with Mistral targeting €1 billion in revenue for 2026, up from €200 million in 2025, according to Le Monde.Mistral is a company with roughly 1,000 employees and ambitions to compete with labs that have raised 40 times as much capital. It cannot win a general-purpose model arms race against OpenAI and Anthropic. What it can do is build a differentiated enterprise stack around sovereignty, structured document intelligence, and agentic workflows — and use that stack to capture European enterprise budgets that are increasingly wary of U.S. provider dependency. The pricing structure reinforces that strategy: at $2 per 1,000 pages in batch mode, the cost of processing a 100,000-page corporate archive falls to $200, making large-scale digitization projects economically viable in ways they may not have been with token-based vision-language model pricing.Whether Mistral can execute that vision at scale — against Google, Amazon, Microsoft, and a surging open-source ecosystem — remains an open question. But the Anthropic export control crisis is still unresolved, European data sovereignty regulations are tightening, and a potential €20 billion funding round is on the horizon. The company is holding an OCR 4 production webinar on July 7 at 6:00 PM CET.Two weeks ago, the argument for building AI infrastructure outside the reach of U.S. export controls was theoretical. Then the U.S. government flipped a switch, and Anthropic’s most advanced models went dark for every non-American on the planet. Mistral did not cause that crisis — but it spent the last year building the product that makes it matter.
Alibaba’s model never trained as an agent — and improved agent performance across seven benchmarks
Alibaba’s Qwen team released Qwen-AgentWorld on Tuesday — two models trained not to act inside agent environments, but to predict what those environments return. The release covers seven domains under a single architecture: MCP, Search, Terminal, Software Engineering, Android, Web, and OS. The release extends Alibaba’s recent push into autonomous agents. Qwen3.7-Max, released in May, was built around a 35-hour autonomous execution capability. That shift targets a ceiling teams training agents at scale run into directly. Real search engines surface whatever results exist, with no mechanism to inject controlled conditions. Live terminals do not allow injecting a low-disk-space condition on demand. Agent training is bounded by what production environments will surface, with no systematic way to expose the edge cases agents will need to handle but rarely encounter in training.The research team trained agents inside the resulting simulator and found performance gains that exceeded what training against real environments alone produced. In a separate test, using world model training as a warm-up before agentic fine-tuning improved performance across seven benchmarks, including three the model had never seen during training.The paper accompanying the release identified a gap in prior agent research. “We argue that world modeling is a crucial missing piece in the path to general agents.”Qwen-AgentWorld trains on what environments return, not what agents should doMost agent models are trained to answer one question: given what the environment just showed me, what should I do next? Qwen-AgentWorld is trained to answer the inverse: given what the agent just did, what will the environment show next?That reversal is the core of what the paper calls a language world model: instead of optimizing for action selection, the model learns to predict the next environment state across all seven domains under a single training objective. Prior work was narrower: WebWorld, an earlier Qwen project from February, covered web environments only; Snowflake’s Agent World Model, published the same month, generates code-driven SQL-backed environments rather than training a model to predict states. Qwen-AgentWorld is the first to span seven domains in a single model, with environment modeling baked in from the earliest pretraining stage.Alibaba trained both models in three stages on more than 10 million environment interaction trajectories from real agent runs. Stage one teaches the model how environments behave — file systems, terminal states, browser DOM changes, API responses. Stage two trains the model to reason through what comes next before predicting it. Stage three, reinforcement learning, tightens predictions using rule-based checks and open-ended quality scoring.Both models are Mixture-of-Experts designs — only a fraction of parameters are active per token. The 35B model activates 3B; the 397B activates 17B. Both support 256K context windows. For GUI domains (Android, Web, and OS), the models work from textual accessibility trees and UI view hierarchies rather than screenshots.The 35B model weights and AgentWorldBench are available under Apache 2.0; the 397B weights are not publicly released.The training results matter more than the benchmarksThe benchmark scores show how accurately the models predict what environments return. The training results show what that prediction capability is actually worth for teams building agents — and those are the numbers that matter more.According to the researchers, agents trained inside controlled simulation outperformed agents trained in real environments. Injecting targeted perturbations — partial responses that force extra agent steps, and edge cases real environments rarely surface — pushed MCPMark from 24.6 to 33.8. On Search, agents trained in entirely fictional worlds transferred to real search tasks, pushing WideSearch F1 Item from 34.02 to 50.31 on the open 35B model. A separate warm-up test showed that world model pretraining improved BFCL v4 from 62.29 to 71.25 and Claw-Eval from 53.60 to 64.88 with no agent-specific fine-tuning.Researchers flag the benchmark and the overfitting riskThe paper drew immediate reaction from AI researchers on X. The concerns they raised map to what practitioners need to verify before acting on the findings.On the training objective and transfer result, the assessment from one AI/ML researcher was direct. “Every other ‘agent’ model has been trained to act in environments,” wrote @drawais_ai, who has a PhD background and regularly breaks down AI papers. “Qwen flipped the question. They trained the model to predict the environment itself… That predictive knowledge then transfers to agent tasks even without any agent-specific fine-tuning.” He identified the Controllable Sim RL result as “the receipt” for the claim that synthetic training can substitute for real-environment RL at scale, and flagged that three of the seven transfer benchmarks were entirely out of domain.The benchmark margin drew immediate scrutiny. “AgentWorldBench is a benchmark Alibaba built and published in the same paper,” wrote @TheSignal_Desk, who focuses on honest takes and key numbers in AI research. “They wrote the test, then topped it by 0.46.”The sim-RL methodology is the result @limalemonnn, who builds production AI agents, identified as most in need of scrutiny before the headline claim gets quoted. “Sim-trained agents traditionally overfit to the simulator’s quirks,” they wrote. “If the world model is too clean, the agent learns the model, not the task.” They pointed to the paper’s holdout split as the section practitioners should read before acting on the numbers.The overfitting concern has a partial answer in the data. The gap between uncontrolled Sim RL (MCPMark 24.6) and controlled Sim RL (MCPMark 33.8) suggests the gains depend substantially on the controllability mechanism, not simulation accuracy alone. The fictional-world Search result, where agents trained on invented environments transfer to real search tasks, is the paper’s strongest evidence against the overfitting concern.What this means for teams building agentic pipelinesFor AI engineering teams building and scaling agentic pipelines, this work signals a meaningful shift in how agent capability gets built. Teams training agents at scale now have a third option between real-environment RL and static benchmarks: controlled simulation that injects the edge cases production won’t surface.Synthetic environments are a legitimate training layer. Controlled simulation that injects conditions real environments won’t produce is a complement to real-environment RL, not a shortcut around it.What a model learns before agent training starts matters more than most pipelines account for. The warm-up finding — performance gains across unseen benchmarks with no agent-specific training — suggests environment grounding belongs earlier in development than current practice.
Xiaomi’s HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most
As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment. Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment.To address this engineering bottleneck, researchers at Xiaomi introduced HarnessX, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code. In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed HarnessX delivering substantial performance gains across domains like software engineering and web interaction. The results demonstrate that scaling the foundation model is not the only path to more capable AI — and for smaller models, it may not even be the best one. HarnessX’s harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen3.5-9B, gains reached +44% on embodied planning tasks.The challenges of harness engineeringIn AI applications, a foundation model’s capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action. As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges.First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences.Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition.Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent’s operational data.HarnessX: an autonomous foundry for AI agentsHarnessX solves the engineering bottlenecks of manual harness development with what the researchers call a “unified harness foundry.” The core innovation of HarnessX is treating the harness as a “first-class object”. In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model.HarnessX breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a “processor” that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline.To automate the optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness. Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against:Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task.Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another.Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations.To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline:Digester: Compresses execution traces into structured summaries to identify where the agent failed.Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks.Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment.Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting.HarnessX enters a growing field of self-improving harness research — but what separates it is harness-model co-evolution.The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities.HarnessX interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development.HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as DeepSeek-R1. When fine-tuning the model, cross-harness GRPO pools an agent’s execution trajectories for the same task across entirely different versions of the application’s harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations.HarnessX in action on industry benchmarksTo validate the practical utility of HarnessX, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning.They separated the AI into two roles. The “meta-agent,” powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The “task agents” ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B.HarnessX was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code.Dynamically evolving the harness yields significant gains on the same base model. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%.The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering. Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models.Anecdotal evidence from the experiments shows how HarnessX solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site’s JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the MediaWiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks.During the WebShop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking “next page” and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance.Limits of automated harness engineeringOne important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested.Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, HarnessX will not be able to improve the agent’s overall abilities (the researchers observed this with the Qwen3.5-9B model on the SWE-bench coding tests).Despite these limitations, HarnessX makes a concrete case that harness engineering — not just model scaling — is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.