Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working memory is stored.A new technique developed by researchers at MIT addresses this challenge with a fast compression method for the KV cache. The technique, called Attention Matching, manages to compact the context by up to 50x with very little loss in quality.While it is not the only memory compaction technique available, Attention Matching stands out for its execution speed and impressive information-preserving capabilities.The memory bottleneck of the KV cacheLarge language models generate their responses sequentially, one token at a time. To avoid recalculating the entire conversation history from scratch for every predicted word, the model stores a mathematical representation of every previous token it has processed, also known as the key and value pairs. This critical working memory is known as the KV cache.The KV cache scales with conversation length because the model is forced to retain these keys and values for all previous tokens in a given interaction. This consumes expensive hardware resources. “In practice, KV cache memory is the biggest bottleneck to serving models at ultra-long context,” Adam Zweiger, co-author of the paper, told VentureBeat. “It caps concurrency, forces smaller batches, and/or requires more aggressive offloading.” In modern enterprise use cases, such as analyzing massive legal contracts, maintaining multi-session customer dialogues, or running autonomous coding agents, the KV cache can balloon to many gigabytes of memory for a single user request.To solve this massive bottleneck, the AI industry has tried several strategies, but these methods fall short when deployed in enterprise environments where extreme compression is necessary. A class of technical fixes includes optimizing the KV cache by either evicting tokens the model deems less important or merging similar tokens into a single representation. These techniques work for mild compression but “degrade rapidly at high reduction ratios,” according to the authors.Real-world applications often rely on simpler techniques, with the most common approach being to simply drop the older context once the memory limit is reached. But this approach causes the model to lose older information as the context grows long. Another alternative is context summarization, where the system pauses, writes a short text summary of the older context, and replaces the original memory with that summary. While this is an industry standard, summarization is highly lossy and heavily damages downstream performance because it might remove pertinent information from the context.Recent research has proven that it is technically possible to highly compress this memory using a method called Cartridges. However, this approach requires training latent KV cache models through slow, end-to-end mathematical optimization. This gradient-based training can take several hours on expensive GPUs just to compress a single context, making it completely unviable for real-time enterprise applications.How attention matching compresses without the costAttention Matching achieves high-level compaction ratios and quality while being orders of magnitude faster than gradient-based optimization. It bypasses the slow training process through clever mathematical tricks.The researchers realized that to perfectly mimic how an AI interacts with its memory, they need to preserve two mathematical properties when compressing the original key and value vectors into a smaller footprint. The first is the “attention output,” which is the actual information the AI extracts when it queries its memory. The second is the “attention mass,” which acts as the mathematical weight that a token has relative to everything else in the model’s working memory. If the compressed memory can match these two properties, it will behave exactly like the massive, original memory, even when new, unpredictable user prompts are added later. “Attention Matching is, in some ways, the ‘correct’ objective for doing latent context compaction in that it directly targets preserving the behavior of each attention head after compaction,” Zweiger said. While token-dropping and related heuristics can work, explicitly matching attention behavior simply leads to better results.Before compressing the memory, the system generates a small set of “reference queries” that act as a proxy for the types of internal searches the model is likely to perform when reasoning about the specific context. If the compressed memory can accurately answer these reference queries, it will very likely succeed at answering the user’s actual questions later. The authors suggest various methods for generating these reference queries, including appending a hidden prompt to the document telling the model to repeat the previous context, known as the “repeat-prefill” technique. They also suggest a “self-study” approach where the model is prompted to perform a few quick synthetic tasks on the document, such as aggregating all key facts or structuring dates and numbers into a JSON format.With these queries in hand, the system picks a set of keys to preserve in the compacted KV cache based on signals like the highest attention value. It then uses the keys and reference queries to calculate the matching values along with a scalar bias term. This bias ensures that pertinent information is preserved, allowing each retained key to represent the mass of many removed keys.This formulation makes it possible to fit the values with simple algebraic techniques, such as ordinary least squares and nonnegative least squares, entirely avoiding compute-heavy gradient-based optimization. This is what makes Attention Matching super fast in comparison to optimization-heavy compaction methods. The researchers also apply chunked compaction, processing contiguous chunks of the input independently and concatenating them, to further improve performance on long contexts.Attention matching in actionTo understand how this method performs in the real world, the researchers ran a series of stress tests using popular open-source models like Llama 3.1 and Qwen-3 on two distinct types of enterprise datasets. The first was QuALITY, a standard reading comprehension benchmark using 5,000 to 8,000-word documents. The second, representing a true enterprise challenge, was LongHealth, a highly dense, 60,000-token dataset containing the complex medical records of multiple patients.The key finding was the ability of Attention Matching to compact the model’s KV cache by 50x without reducing the accuracy, while taking only seconds to process the documents. To achieve that same level of quality previously, Cartridges required hours of intensive GPU computation per context.When dealing with the dense medical records, standard industry workarounds completely collapsed. The researchers noted that when they tried to use standard text summarization on these patient records, the model’s accuracy dropped so low that it matched the “no-context” baseline, meaning the AI performed as if it had not read the document at all. Attention Matching drastically outperforms summarization, but enterprise architects will need to dial down the compression ratio for dense tasks compared to simpler reading comprehension tests. As Zweiger explains, “The main practical tradeoff is that if you are trying to preserve nearly everything in-context on highly information-dense tasks, you generally need a milder compaction ratio to retain strong accuracy.”The researchers also explored what happens in cases where absolute precision isn’t necessary but extreme memory savings are. They ran Attention Matching on top of a standard text summary. This combined approach achieved 200x compression. It successfully matched the accuracy of standard summarization alone, but with a very small memory footprint.One of the interesting experiments for enterprise workflows was testing online compaction, though they note that this is a proof of concept and has not been tested rigorously in production environments. The researchers tested the model on the advanced AIME math reasoning test. They forced the AI to solve a problem with a strictly capped physical memory limit. Whenever the model’s memory filled up, the system paused, instantly compressed its working memory by 50 percent using Attention Matching, and let it continue thinking. Even after hitting the memory wall and having its KV cache shrunk up to six consecutive times mid-thought, the model successfully solved the math problems. Its performance matched a model that had been given massive, unlimited memory. There are caveats to consider. At a 50x compression ratio, Attention Matching is the clear winner in balancing speed and quality. However, if an enterprise attempts to push compression to extreme 100x limits on highly complex data, the slower, gradient-based Cartridges method actually outperforms it.The researchers have released the code for Attention Matching. However, they note that this is not currently a simple plug-and-play software update. “I think latent compaction is best considered a model-layer technique,” Zweiger notes. “While it can be applied on top of any existing model, it requires access to model weights.” This means enterprises relying entirely on closed APIs cannot implement this themselves; they need open-weight models. The authors note that integrating this latent-space KV compaction into existing, highly optimized commercial inference engines still requires significant effort. Modern AI infrastructure uses complex tricks like prefix caching and variable-length memory packing to keep servers running efficiently, and seamlessly weaving this new compaction technique into those existing systems will take dedicated engineering work. However, there are immediate enterprise applications. “We believe compaction after ingestion is a promising use case, where large tool call outputs or long documents are compacted right after being processed,” Zweiger said.Ultimately, the shift toward mechanical, latent-space compaction aligns with the future product roadmaps of major AI players, Zweiger argues. “We are seeing compaction to shift from something enterprises implement themselves into something model providers ship,” Zweiger said. “This is even more true for latent compaction, where access to model weights is needed. For example, OpenAI now exposes a black-box compaction endpoint that returns an opaque object rather than a plain-text summary.”
Venture Beat
Google PM open-sources Always On Memory Agent, ditching vector databases for LLM-driven persistent memory
Google senior AI product manager Shubham Saboo has turned one of the thorniest problems in agent design into an open-source engineering exercise: persistent memory.This week, he published an open-source “Always On Memory Agent” on the official Google Cloud Platform Github page under a permissive MIT License, allowing for commercial usage.It was built with Google’s Agent Development Kit, or ADK introduced last Spring in 2025, and Gemini 3.1 Flash-Lite, a low-cost model Google introduced on March 3, 2026 as its fastest and most cost-efficient Gemini 3 series model. The project serves as a practical reference implementation for something many AI teams want but few have productionized cleanly: an agent system that can ingest information continuously, consolidate it in the background, and retrieve it later without relying on a conventional vector database.For enterprise developers, the release matters less as a product launch than as a signal about where agent infrastructure is headed. The repo packages a view of long-running autonomy that is increasingly attractive for support systems, research assistants, internal copilots and workflow automation. It also brings governance questions into sharper focus as soon as memory stops being session-bound.What the repo appears to do — and what it does not clearly claimThe repo also appears to use a multi-agent internal architecture, with specialist components handling ingestion, consolidation and querying. But the supplied materials do not clearly establish a broader claim that this is a shared memory framework for multiple independent agents. That distinction matters. ADK as a framework supports multi-agent systems, but this specific repo is best described as an always-on memory agent, or memory layer, built with specialist subagents and persistent storage. Even at this narrower level, it addresses a core infrastructure problem many teams are actively working through.The architecture favors simplicity over a traditional retrieval stackAccording to the repository, the agent runs continuously, ingests files or API input, stores structured memories in SQLite, and performs scheduled memory consolidation every 30 minutes by default. A local HTTP API and Streamlit dashboard are included, and the system supports text, image, audio, video and PDF ingestion. The repo frames the design with an intentionally provocative claim: “No vector database. No embeddings. Just an LLM that reads, thinks, and writes structured memory.”That design choice is likely to draw attention from developers managing cost and operational complexity. Traditional retrieval stacks often require separate embedding pipelines, vector storage, indexing logic and synchronization work. Saboo’s example instead leans on the model to organize and update memory directly. In practice, that can simplify prototypes and reduce infrastructure sprawl, especially for smaller or medium-memory agents. It also shifts the performance question from vector search overhead to model latency, memory compaction logic and long-run behavioral stability.Flash-Lite gives the always-on model some economic logicThat is where Gemini 3.1 Flash-Lite enters the story.Google says the model is built for high-volume developer workloads at scale and priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens. The company also says Flash-Lite is 2.5 times faster than Gemini 2.5 Flash in time to first token and delivers a 45% increase in output speed while maintaining similar or better quality. On Google’s published benchmarks, the model posts an Elo score of 1432 on Arena.ai, 86.9% on GPQA Diamond and 76.8% on MMMU Pro. Google positions those characteristics as a fit for high-frequency tasks such as translation, moderation, UI generation and simulation.Those numbers help explain why Flash-Lite is paired with a background-memory agent. A 24/7 service that periodically re-reads, consolidates and serves memory needs predictable latency and low enough inference cost to avoid making “always on” prohibitively expensive.Google’s ADK documentation reinforces the broader story. The framework is presented as model-agnostic and deployment-agnostic, with support for workflow agents, multi-agent systems, tools, evaluation and deployment targets including Cloud Run and Vertex AI Agent Engine. That combination makes the memory agent feel less like a one-off demo and more like a reference point for a broader agent runtime strategy.The enterprise debate is about governance, not just capabilityPublic reaction shows why enterprise adoption of persistent memory will not hinge on speed or token pricing alone.Several responses on X highlighted exactly the concerns enterprise architects are likely to raise. Franck Abe called Google ADK and 24/7 memory consolidation “brilliant leaps for continuous agent autonomy,” but warned that an agent “dreaming” and cross-pollinating memories in the background without deterministic boundaries becomes “a compliance nightmare.” ELED made a related point, arguing that the main cost of always-on agents is not tokens but “drift and loops.”Those critiques go directly to the operational burden of persistent systems: who can write memory, what gets merged, how retention works, when memories are deleted, and how teams audit what the agent learned over time?Another reaction, from Iffy, challenged the repo’s “no embeddings” framing, arguing that the system still has to chunk, index and retrieve structured memory, and that it may work well for small-context agents but break down once memory stores become much larger. That criticism is technically important. Removing a vector database does not remove retrieval design; it changes where the complexity lives. For developers, the tradeoff is less about ideology than fit. A lighter stack may be attractive for low-cost, bounded-memory agents, while larger-scale deployments may still demand stricter retrieval controls, more explicit indexing strategies and stronger lifecycle tooling.ADK broadens the story beyond a single demoOther commenters focused on developer workflow. One asked for the ADK repo and documentation and wanted to know whether the runtime is serverless or long-running, and whether tool-calling and evaluation hooks are available out of the box. Based on the supplied materials, the answer is effectively both: the memory-agent example itself is structured like a long-running service, while ADK more broadly supports multiple deployment patterns and includes tools and evaluation capabilities.The always-on memory agent is interesting on its own, but the larger message is that Saboo is trying to make agents feel like deployable software systems rather than isolated prompts. In that framing, memory becomes part of the runtime layer, not just an add-on feature.What Saboo has shown — and what he has notWhat Saboo has not shown yet is just as important as what he’s published.The provided materials do not include a direct Flash-Lite versus Anthropic Claude Haiku benchmark for agent loops in production use. They also do not lay out enterprise-grade compliance controls specific to this memory agent, such as: deterministic policy boundaries, retention guarantees, segregation rules or formal audit workflows. And while the repo appears to use multiple specialist agents internally, the materials do not clearly prove a larger claim about persistent memory shared across multiple independent agents.For now, the repo reads as a compelling engineering template rather than a complete enterprise memory platform.Why this matters nowStill, the release lands at the right time. Enterprise AI teams are moving beyond single-turn assistants and into systems expected to remember preferences, preserve project context and operate across longer horizons. Saboo’s open-source memory agent offers a concrete starting point for that next layer of infrastructure, and Flash-Lite gives the economics some credibility.But the strongest takeaway from the reaction around the launch is that continuous memory will be judged on governance as much as capability. That is the real enterprise question behind Saboo’s demo: not whether an agent can remember, but whether it can remember in ways that stay bounded, inspectable and safe enough to trust in production.
Google Workspace CLI brings Gmail, Docs, Sheets and more into a common interface for AI agents
What’s old is new: the command line — the original, clunky non-graphical interface for interacting with and controlling PCs, where the user just typed in raw commands in code — has become one of the most important interfaces in agentic AI.That shift has been driven in part by the rise of coding-native tools such as Claude Code and Kilo CLI, which have helped establish a model where AI agents do not just answer questions in chat windows but execute real tasks through a shared, scriptable interface already familiar to developers — and which can still be found on virtually all PCs. For developers, the appeal is practical: the CLI is inspectable, composable and easier to control than a patchwork of custom app integrations.Now, Google Workspace — the umbrella term for Google’s suite of enterprise cloud apps including Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin — is moving into that pattern with a new CLI that lets them access these applications and the data within them directly, without relying on third-party connectors.The project, googleworkspace/cli, describes itself as “one CLI for all of Google Workspace — built for humans and AI agents,” with structured JSON output and agent-oriented workflows included.In an X post yesterday, Google Cloud director Addy Osmani introduced the Google Workspace CLI as “built for humans and agents,” adding that it covers “Google Drive, Gmail, Calendar, and every Workspace API.” While not officially supported by Google, other posts cast the release as a broader turning point for automation and agent access to enterprise productivity software. Now, instead of having to set up third-party connectors like Zapier to access data and use AI agents to automate work across the Google Workspace suite of apps, enterprise developers (or indie devs and users, for that matter) can easily install the open source (Apache 2.0) Google Workspace CLI from Github and begin setting up automated agentic workflows directly in terminal, asking their AI model to sort email, respond, edit docs and files, and more.Why the CLI model is gaining tractionFor enterprise developers, the importance of the release is not that Google suddenly made Workspace programmable. Workspace APIs have long been available. What changes here is the interface.Instead of forcing teams to build and maintain separate wrappers around individual APIs, the CLI offers a unified command surface with structured output. Installation is straightforward — npm install -g @googleworkspace/cli — and the repo says the package includes prebuilt binaries, with releases also available through GitHub.The repo also says gws reads Google’s Discovery Service at runtime and dynamically builds its command surface, allowing new Workspace API methods to appear without waiting for a manually maintained static tool definition to catch up. For teams building agents or internal automation, that is a meaningful operational advantage. It reduces glue code, lowers maintenance overhead and makes Workspace easier to treat as a programmable runtime rather than a collection of separate SaaS applications.What developers and enterprises actually getThe CLI is designed for both direct human use and agent-driven workflows. For developers working in the terminal, the README highlights features such as per-resource help, dry-run previews, schema inspection and auto-pagination. For agents, the value is clearer still: structured JSON output, reusable commands and built-in skills that let models interact with Workspace data and actions without a custom integration layer.That creates immediate utility for internal enterprise workflows. Teams can use the tool to list Drive files, create spreadsheets, inspect request and response schemas, send Chat messages and paginate through large result sets from the terminal. The README also says the repo ships more than 100 agent skills, including helpers and curated recipes for Gmail, Drive, Docs, Calendar and Sheets.That matters because Workspace remains one of the most common systems of record for day-to-day business work. Email, calendars, internal docs, spreadsheets and shared files are often where operational context lives. A CLI that exposes those surfaces through a common, agent-friendly interface makes it easier to build assistants that retrieve information, trigger actions and automate repetitive processes with less bespoke plumbing.The important caveat: visible, but not officially supportedThe social-media response has been enthusiastic, but enterprises should read the repo carefully before treating the project as a formal Google platform commitment.The README explicitly says: “This is not an officially supported Google product”. It also says the project is under active development and warns users to expect breaking changes as it moves toward v1.0.That does not diminish the technical relevance of the release. It does, however, shape how enterprise teams should think about adoption. Today, this looks more like a promising developer tool with strong momentum than a production platform that large organizations should standardize on immediately.This is a cleaner interface, not a governance bypassThe other key point is that the CLI does not bypass the underlying controls that govern Workspace access.The documentation says users still need a Google Cloud project for OAuth credentials and a Google account with Workspace access. It also outlines multiple authentication patterns for local development, CI and service accounts, along with instructions for enabling APIs and handling setup issues.For enterprises, that is the right way to interpret the tool. It is not magic access to Gmail, Docs or Sheets. It is a more usable abstraction over the same permissions, scopes and admin controls companies already manage.Not a rejection of MCP, but a broader agent interface strategySome of the early commentary around the tool frames it as a cleaner alternative to Model Context Protocol (MCP)-heavy setups, arguing that CLI-driven execution can avoid wasting context window on large tool definitions. There is some logic to that argument, especially for agent systems that can call shell commands directly and parse JSON responses.But the repo itself presents a more nuanced picture. It includes a Gemini CLI extension that gives Gemini agents access to gws commands and Workspace agent skills after terminal authentication. It also includes an MCP server mode through gws mcp, exposing Workspace APIs as structured tools for MCP-compatible clients including Claude Desktop, Gemini CLI and VS Code.The strategic takeaway is not that Google Workspace is choosing CLI instead of MCP. It is that the CLI is emerging as the base interface, with MCP available where it makes sense.What enterprises should do nowThe right near-term move for enterprises is not broad rollout. It is targeted evaluation.Developer productivity, platform engineering and IT automation teams should test the tool in a sandboxed Workspace environment and identify a narrow set of high-friction use cases where a CLI-first approach could reduce integration work. File discovery, spreadsheet updates, document generation, calendar operations and internal reporting are natural starting points.Security and identity teams should review authentication patterns early and determine how tightly permissions, scopes and service-account usage can be constrained and monitored. AI platform teams, meanwhile, should compare direct CLI execution against MCP-based approaches in real workflows, focusing on reliability, prompt overhead and operational simplicity.The broader trend is clear. As agentic software matures, the command line is becoming a common control plane for both developers and AI systems. Google Workspace’s new CLI does not change enterprise automation overnight. But it does make one of the most widely used productivity stacks easier to access through the interface that agent builders increasingly prefer.
OpenAI launches GPT-5.4 with native computer use mode, financial plugins for Microsoft Excel, Google Sheets
The AI updates aren’t slowing down. Literally two days after OpenAI launched a new underlying AI model for ChatGPT called GPT-5.3 Instant, the company has unveiled another, even more massive upgrade: GPT-5.4.Actually, GPT-5.4 comes in two varieties: GPT-5.4 Thinking and GPT-5.4 Pro, the latter designed for the most complex tasks.Both will be available in OpenAI’s paid application programming interface (API) and Codex software development application, while GPT-5.4 Thinking will be available to all paid subscribers of ChatGPT (Plus, the $20-per-month plan, and up) and Pro will be reserved for ChatGPT Pro ($200 monthly) and Enterprise plan users. ChatGPT Free users will also get a taste of GPT-5.4, but only when their queries are auto-routed to the model, according to an OpenAI spokesperson.The big headlines on this release are efficiency, with OpenAI reporting that GPT-5.4 uses far fewer tokens (47% fewer on some tasks) than its predecessors, and, arguably even more impressively, a new “native” Computer Use mode available through the API and its Codex that lets GPT-5.4 navigate a users’ computer like a human and work across applications. The company is also releasing a new Financial Services suite allowing GPT-5.4 to be plugged directly into users’ Microsoft Excel and Google Sheets spreadsheets and cells, enabling granular analysis and automated task completion that should speed up work across the enterprise, but may make fears of white collar layoffs even more pronounced on the heels of similar offerings from Anthropic’s Claude and its new Cowork application.OpenAI says GPT-5.4 supports up to 1 million tokens of context in the API and Codex, enabling agents to plan, execute, and verify tasks across long horizons— however, it charges double the cost per 1 million tokens once the input exceeds 272,000 tokens. Native computer use: a step toward autonomous workflowsThe most consequential capability OpenAI highlights is that GPT-5.4 is its first general-purpose model released with native, state-of-the-art computer-use capabilities in Codex and the API, enabling agents to operate computers and carry out multi-step workflows across applications. OpenAI says the model can both write code to operate computers via libraries like Playwright and issue mouse and keyboard commands in response to screenshots. OpenAI also claims a jump in agentic web browsing. Benchmark results are presented as evidence that this is not merely a UI wrapper. On BrowseComp, which measures how well AI agents can persistently browse the web to find hard-to-locate information, OpenAI reports GPT-5.4 improving by 17% absolute over GPT-5.2, and GPT-5.4 Pro reaching 89.3%, described as a new state of the art. On OSWorld-Verified, which measures desktop navigation using screenshots plus keyboard and mouse actions, OpenAI reports GPT-5.4 at 75.0% success, compared to 47.3% for GPT-5.2, and notes reported human performance at 72.4%. On WebArena-Verified, GPT-5.4 reaches 67.3% success using both DOM- and screenshot-driven interaction, compared to 65.4% for GPT-5.2. On Online-Mind2Web, OpenAI reports 92.8% success using screenshot-based observations alone.OpenAI also links computer use to improvements in vision and document handling. On MMMU-Pro, GPT-5.4 reaches 81.2% success without tool use, compared with 79.5% for GPT-5.2, and OpenAI says it achieves that result using a fraction of the “thinking tokens.” On OmniDocBench, GPT-5.4’s average error is reported at 0.109, improved from 0.140 for GPT-5.2. The post also describes expanded support for high-fidelity image inputs, including an “original” detail level up to 10.24M pixels.OpenAI positions GPT-5.4 as built for longer, multi-step workflows—work that increasingly looks like an agent keeping state across many actions rather than a chatbot responding once. Tool search and improved tool orchestrationAs tool ecosystems get larger, OpenAI argues that the naive approach—dumping every tool definition into the prompt—creates a tax paid on every request: cost, latency, and context pollution. GPT-5.4 introduces tool search in the API as a structural fix. Instead of receiving all tool definitions upfront, the model receives a lightweight list of tools plus a search capability, and it retrieves full tool definitions only when they’re actually needed.OpenAI describes the efficiency win with a concrete comparison: on 250 tasks from Scale’s MCP Atlas benchmark, running with 36 MCP servers enabled, the tool-search configuration reduced total token usage by 47% while achieving the same accuracy as a configuration that exposed all MCP functions directly in context. That 47% figure is specifically about the tool-search setup in that evaluation—not a blanket claim that GPT-5.4 uses 47% fewer tokens for every kind of task.Improvements for developers and coding workflowsOpenAI’s coding pitch is that GPT-5.4 combines the coding strengths of GPT-5.3-Codex with stronger tool and computer-use capabilities that matter when tasks aren’t single-shot. GPT-5.4 matches or outperforms GPT-5.3-Codex on SWE-Bench Pro while being lower latency across reasoning efforts.Codex also gets workflow-level knobs. OpenAI says /fast mode delivers up to 1.5× faster performance across supported models, including GPT-5.4, describing it as the same model and intelligence “just faster.” And it describes releasing an experimental Codex skill, “Playwright (Interactive)”, meant to demonstrate how coding and computer use can work in tandem—visually debugging web and Electron apps and testing an app as it’s being built.OpenAI for Financial ServicesAlongside GPT-5.4, OpenAI is announcing OpenAI for Financial Services, a suite of secure AI products in ChatGPT built for enterprises and financial institutions, powered by GPT-5.4 for advanced financial reasoning and Excel-based modeling.The centerpiece is ChatGPT for Excel and Google Sheets (beta), which OpenAI describes as ChatGPT embedded directly in spreadsheets to build, analyze, and update complex financial models using the formulas and structures teams already rely on.The suite also includes new ChatGPT app integrations intended to unify market, company, and internal data into a single workflow, naming FactSet, MSCI, Third Bridge, and Moody’s. And it introduces reusable “Skills” for recurring finance work such as earnings previews, comparables analysis, DCF analysis, and investment memo drafting.OpenAI anchors the finance push with an internal benchmark claim: model performance increased from 43.7% with GPT-5 to 88.0% with GPT-5.4 Thinking on an OpenAI internal investment banking benchmark.Measuring AI performance against professional workOpenAI leans on benchmarks intended to resemble real office deliverables, not just puzzle-solving. On GDPval, an evaluation spanning “well-specified knowledge work” across 44 occupations, OpenAI reports that GPT-5.4 matches or exceeds industry professionals in 83.0% of comparisons, compared to 71.0% for GPT-5.2.The company also highlights specific improvements in the kinds of artifacts that tend to expose model weaknesses: structured tables, formulas, narrative coherence, and design quality. In an internal benchmark of spreadsheet modeling tasks modeled after what a junior investment banking analyst might do, GPT-5.4 reaches a mean score of 87.5%, compared to 68.4% for GPT-5.2. And on a set of presentation evaluation prompts, OpenAI says human raters preferred GPT-5.4’s presentations 68.0% of the time over GPT-5.2’s, citing stronger aesthetics, greater visual variety, and more effective use of image generation.Improving reliability and reducing hallucinationsOpenAI describes GPT-5.4 as its most factual model yet and connects that claim to a practical dataset: de-identified prompts where users previously flagged factual errors. On that set, OpenAI reports GPT-5.4’s individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors compared to GPT-5.2.In statements provided to VentureBeat from OpenAI and attributed early GPT-5.4 testers, Daniel Swiecki of Walleye Capital says that on internal finance and Excel evaluations, GPT-5.4 improved accuracy by 30 percentage points, which he links to expanded automation for model updates and scenario analysis. Brendan Foody, CEO of Mercor, calls GPT-5.4 the best model the company has tried and says it’s now top of Mercor’s APEX-Agents benchmark for professional services work, emphasizing long-horizon deliverables like slide decks, financial models, and legal analysis.Pricing and availabilityIn the API, OpenAI says GPT-5.4 Thinking is available as gpt-5.4 and GPT-5.4 Pro as gpt-5.4-pro. Pricing is as follows:GPT-5.4: $2.50 / 1M input tokens; $15 / 1M output tokensGPT-5.4 Pro: $30 / 1M input tokens; $180 / 1M output tokensBatch + Flex: half-rate; Priority processing: 2× rateThis makes GPT-5.4 among the more expensive models to run over API compared to the entire field, as seen in the table below.ModelInputOutputTotal CostSourceQwen 3 Turbo$0.05$0.20$0.25Alibaba CloudQwen3.5-Flash$0.10$0.40$0.50Alibaba Clouddeepseek-chat (V3.2-Exp)$0.28$0.42$0.70DeepSeekdeepseek-reasoner (V3.2-Exp)$0.28$0.42$0.70DeepSeekGrok 4.1 Fast (reasoning)$0.20$0.50$0.70xAIGrok 4.1 Fast (non-reasoning)$0.20$0.50$0.70xAIMiniMax M2.5$0.15$1.20$1.35MiniMaxGemini 3.1 Flash-Lite$0.25$1.50$1.75GoogleMiniMax M2.5-Lightning$0.30$2.40$2.70MiniMaxGemini 3 Flash Preview$0.50$3.00$3.50GoogleKimi-k2.5$0.60$3.00$3.60MoonshotGLM-5$1.00$3.20$4.20Z.aiERNIE 5.0$0.85$3.40$4.25BaiduClaude Haiku 4.5$1.00$5.00$6.00AnthropicQwen3-Max (2026-01-23)$1.20$6.00$7.20Alibaba CloudGemini 3 Pro (≤200K)$2.00$12.00$14.00GoogleGPT-5.2$1.75$14.00$15.75OpenAIClaude Sonnet 4.6$3.00$15.00$18.00AnthropicGPT-5.4$2.50$15.00$17.50OpenAIGemini 3 Pro (>200K)$4.00$18.00$22.00GoogleClaude Opus 4.6$5.00$25.00$30.00AnthropicGPT-5.2 Pro$21.00$168.00$189.00OpenAIGPT-5.4 Pro$30.00$180.00$210.00OpenAIAnother important note: with GPT-5.4, requests that exceed 272,000 input tokens are billed at 2X the normal rate, reflecting the ability to send prompts larger than earlier models supported.In Codex, compaction defaults to 272k tokens, and the higher long-context pricing applies only when the input exceeds 272k—meaning developers can keep sending prompts at or under that size without triggering the higher rate, but can opt into larger prompts by raising the compaction limit, with only those larger requests billed differently.An OpenAI spokesperson said that in the API the maximum output is 128,000 tokens, the same as previous models.Finally, on why GPT-5.4 is priced higher at baseline, the spokesperson attributed it to three factors: higher capability on complex tasks (including coding, computer use, deep research, advanced document generation, and tool use), major research improvements from OpenAI’s roadmap, and more efficient reasoning that uses fewer reasoning tokens for comparable tasks—adding that OpenAI believes GPT-5.4 remains below comparable frontier models on pricing even with the increase.The broader shiftAcross the release and the follow-up clarifications, GPT-5.4 is positioned as a model meant to move beyond “answer generation” and into sustained professional workflows—ones that require tool orchestration, computer interaction, long context, and outputs that look like the artifacts people actually use at work. OpenAI’s emphasis on token efficiency, tool search, native computer use, and reduced user-flagged factual errors all point in the same direction: making agentic systems more viable in production by lowering the cost of retries—whether that retry is a human re-prompting, an agent calling another tool, or a workflow re-running because the first pass didn’t stick.
Databricks built a RAG agent it says can handle every kind of enterprise search
Most enterprise RAG pipelines are optimized for one search behavior. They fail silently on the others. A model trained to synthesize cross-document reports handles constraint-driven entity search poorly. A model tuned for simple lookup tasks falls apart on multi-step reasoning over internal notes. Most teams find out when something breaks.Databricks set out to fix that with KARL, short for Knowledge Agents via Reinforcement Learning. The company trained an agent across six distinct enterprise search behaviors simultaneously using a new reinforcement learning algorithm. The result, the company claims, is a model that matches Claude Opus 4.6 on a purpose-built benchmark at 33% lower cost per query and 47% lower latency, trained entirely on synthetic data the agent generated itself with no human labeling required. That comparison is based on KARLBench, which Databricks built to evaluate enterprise search behaviors.”A lot of the big reinforcement learning wins that we’ve seen in the community in the past year have been on verifiable tasks where there is a right and a wrong answer,” Jonathan Frankle, Chief AI Scientist at Databricks, told VentureBeat in an exclusive interview. “The tasks that we’re working on for KARL, and that are just normal for most enterprises, are not strictly verifiable in that same way.”Those tasks include synthesizing intelligence across product manager meeting notes, reconstructing competitive deal outcomes from fragmented customer records, answering questions about account history where no single document has the full answer and generating battle cards from unstructured internal data. None of those has a single correct answer that a system can check automatically.”Doing reinforcement learning in a world where you don’t have a strict right and wrong answer, and figuring out how to guide the process and make sure reward hacking doesn’t happen — that’s really non-trivial,” Frankle said. “Very little of what companies do day to day on knowledge tasks are verifiable.”The generalization trap in enterprise RAGStandard RAG breaks down on ambiguous, multi-step queries drawing on fragmented internal data that was never designed to be queried.To evaluate KARL, Databricks built the KARLBench benchmark to measure performance across six enterprise search behaviors: constraint-driven entity search, cross-document report synthesis, long-document traversal with tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation and fact aggregation over internal company notes. That last task is PMBench, built from Databricks’ own product manager meeting notes — fragmented, ambiguous and unstructured in ways that frontier models handle poorly.Training on any single task and testing on the others produces poor results. The KARL paper shows that multi-task RL generalizes in ways single-task training does not. The team trained KARL on synthetic data for two of the six tasks and found it performed well on all four it had never seen.To build a competitive battle card for a financial services customer, for example, the agent has to identify relevant accounts, filter for recency, reconstruct past competitive deals and infer outcomes — none of which is labeled anywhere in the data.Frankle calls what KARL does “grounded reasoning”: running a difficult reasoning chain while anchoring every step in retrieved facts. “You can think of this as RAG,” he said, “but like RAG plus plus plus plus plus plus, all the way up to 200 vector database calls.”The RL engine: why OAPL mattersKARL’s training is powered by OAPL, short for Optimal Advantage-based Policy Optimization with Lagged Inference policy. It’s a new approach, developed jointly by researchers from Cornell, Databricks and Harvard and published in a separate paper the week before KARL.Standard LLM reinforcement learning uses on-policy algorithms like GRPO (Group Relative Policy Optimization), which assume the model generating training data and the model being updated are in sync. In distributed training, they never are. Prior approaches corrected for this with importance sampling, introducing variance and instability. OAPL embraces the off-policy nature of distributed training instead, using a regression objective that stays stable with policy lags of more than 400 gradient steps, 100 times more off-policy than prior approaches handled. In code generation experiments, it matched a GRPO-trained model using roughly three times fewer training samples.OAPL’s sample efficiency is what keeps the training budget accessible. Reusing previously collected rollouts rather than requiring fresh on-policy data for every update meant the full KARL training run stayed within a few thousand GPU hours. That is the difference between a research project and something an enterprise team can realistically attempt.Agents, memory and the context stackThere has been a lot of discussion in the industry in recent months about how RAG can be replaced with contextual memory, also sometimes referred to as agentic memory.For Frankle, it’s not an either/or discussion, rather he sees it as a layered stack. A vector database with millions of entries sits at the base, which is too large for context. The LLM context window sits at the top. Between them, compression and caching layers are emerging that determine how much of what an agent has already learned it can carry forward.For KARL, this is not abstract. Some KARLBench tasks required 200 sequential vector database queries, with the agent refining searches, verifying details and cross-referencing documents before committing to an answer, exhausting the context window many times over. Rather than training a separate summarization model, the team let KARL learn compression end-to-end through RL: when context grows too large, the agent compresses it and continues, with the only training signal being the reward at the end of the task. Removing that learned compression dropped accuracy on one benchmark from 57% to 39%.”We just let the model figure out how to compress its own context,” Frankle said. “And this worked phenomenally well.”Where KARL falls shortFrankle was candid about the failure modes. KARL struggles most on questions with significant ambiguity, where multiple valid answers exist and the model can’t determine whether the question is genuinely open-ended or just hard to answer. That judgment call is still an unsolved problem.The model also exhibits what Frankle described as giving up early on some queries — stopping before producing a final answer. He pushed back on framing this as a failure, noting that the most expensive queries are typically the ones the model gets wrong anyway. Stopping is often the right call.KARL was also trained and evaluated exclusively on vector search. Tasks requiring SQL queries, file search, or Python-based calculation are not yet in scope. Frankle said those capabilities are next on the roadmap, but they are not in the current system.What this means for enterprise data teamsKARL surfaces three decisions worth revisiting for teams evaluating their retrieval infrastructure.The first is pipeline architecture. If your RAG agent is optimized for one search behavior, the KARL results suggest it is failing on others. Multi-task training across diverse retrieval behaviors produces models that generalize. Narrow pipelines do not.The second is why RL matters here — and it’s not just a training detail. Databricks tested the alternative: distilling from expert models via supervised fine-tuning. That approach improved in-distribution performance but produced negligible gains on tasks the model had never seen. RL developed general search behaviors that transferred. For enterprise teams facing heterogeneous data and unpredictable query types, that distinction is the whole game.
The third is what RL efficiency actually means in practice. A model trained to search better completes tasks in fewer steps, stops earlier on queries it cannot answer, diversifies its search rather than repeating failed queries, and compresses its own context rather than running out of room. The argument for training purpose-built search agents rather than routing everything through general-purpose frontier APIs is not primarily about cost. It is about building a model that knows how to do the job.
Black Forest Labs’ new Self-Flow technique makes training multimodal AI models 2.8x more efficient
To create coherent images or videos, generative AI diffusion models like Stable Diffusion or FLUX have typically relied on external “teachers”—frozen encoders like CLIP or DINOv2—to provide the semantic understanding they couldn’t learn on their own. But this reliance has come at a cost: a “bottleneck” where scaling up the model no longer yields better results because the external teacher has hit its limit.Today, German AI startup Black Forest Labs (maker of the FLUX series of AI image models) has announced a potential end to this era of academic borrowing with the release of Self-Flow, a self-supervised flow matching framework that allows models to learn representation and generation simultaneously. By integrating a novel Dual-Timestep Scheduling mechanism, Black Forest Labs has demonstrated that a single model can achieve state-of-the-art results across images, video, and audio without any external supervision.The technology: breaking the “semantic gap”The fundamental problem with traditional generative training is that it’s a “denoising” task. The model is shown noise and asked to find an image; it has very little incentive to understand what the image is, only what it looks like. To fix this, researchers have previously “aligned” generative features with external discriminative models. However, Black Forest Labs argues this is fundamentally flawed: these external models often operate on misaligned objectives and fail to generalize across different modalities like audio or robotics.The Labs’ new technique, Self-Flow, introduces an “information asymmetry” to solve this. Using a technique called Dual-Timestep Scheduling, the system applies different levels of noise to different parts of the input. The student receives a heavily corrupted version of the data, while the teacher—an Exponential Moving Average (EMA) version of the model itself—sees a “cleaner” version of the same data.The student is then tasked not just with generating the final output, but with predicting what its “cleaner” self is seeing—a process of self-distillation where the teacher is at layer 20 and the student is at layer 8. This “Dual-Pass” approach forces the model to develop a deep, internal semantic understanding, effectively teaching itself how to see while it learns how to create.Product implications: faster, sharper, and multi-modalThe practical results of this shift are stark. According to the research paper, Self-Flow converges approximately 2.8x faster than the REpresentation Alignment (REPA) method, the current industry standard for feature alignment. Perhaps more importantly, it doesn’t plateau; as compute and parameters increase, Self-Flow continues to improve while older methods show diminishing returns.The leap in training efficiency is best understood through the lens of raw computational steps: while standard “vanilla” training traditionally requires 7 million steps to reach a baseline performance level, REPA shortened that journey to just 400,000 steps, representing a 17.5x speedup. Black Forest Labs’ Self-Flow framework pushes this frontier even further, operating 2.8x faster than REPA to hit the same performance milestone in roughly 143,000 steps. Taken together, this evolution represents a nearly 50x reduction in the total number of training steps required to achieve high-quality results, effectively collapsing what was once a massive resource requirement into a significantly more accessible and streamlined process.Black Forest Labs showcased these gains through a 4B parameter multi-modal model. Trained on a massive dataset of 200M images, 6M videos, and 2M audio-video pairs, the model demonstrated significant leaps in three key areas:Typography and text rendering: One of the most persistent “tells” of AI images has been garbled text. Self-Flow significantly outperforms vanilla flow matching in rendering complex, legible signs and labels, such as a neon sign correctly spelling “FLUX is multimodal”.Temporal consistency: In video generation, Self-Flow eliminates many of the “hallucinated” artifacts common in current models, such as limbs that spontaneously disappear during motion.Joint video-audio synthesis: Because the model learns representations natively, it can generate synchronized video and audio from a single prompt, a task where external “borrowed” representations often fail because an image-encoder doesn’t understand sound.In terms of quantitative metrics, Self-Flow achieved superior results over competitive baselines. On Image FID, the model scored 3.61 compared to REPA’s 3.92. For video (FVD), it reached 47.81 compared to REPA’s 49.59, and in audio (FAD), it scored 145.65 against the vanilla baseline’s 148.87.From pixels to planning: the path to world modelsThe announcement concludes with a look toward world models—AI that doesn’t just generate pretty pictures but understands the underlying physics and logic of a scene for planning and robotics.By fine-tuning a 675M parameter version of Self-Flow on the RT-1 robotics dataset, researchers achieved significantly higher success rates in complex, multi-step tasks in the SIMPLER simulator. While standard flow matching struggled with complex “Open and Place” tasks, often failing entirely, the Self-Flow model maintained a steady success rate, suggesting that its internal representations are robust enough for real-world visual reasoning.Implementation and engineering detailsFor researchers looking to verify these claims, Black Forest Labs has released an inference suite on GitHub specifically for ImageNet 256×256 generation. The project, primarily written in Python, provides the SelfFlowPerTokenDiT model architecture based on SiT-XL/2.Engineers can utilize the provided sample.py script to generate 50,000 images for standard FID evaluation. The repository highlights that a key architectural modification in this implementation is per-token timestep conditioning, which allows each token in a sequence to be conditioned on its specific noising timestep. During training, the model utilized BFloat16 mixed precision and the AdamW optimizer with gradient clipping to maintain stability.Licensing and availabilityBlack Forest Labs has made the research paper and official inference code available via GitHub and their research portal. While this is currently a research preview, the company’s track record with the FLUX model family suggests these innovations will likely find their way into their commercial API and open-weights offerings in the near future.For developers, the move away from external encoders is a massive win for efficiency. It eliminates the need to manage separate, heavy models like DINOv2 during training, simplifying the stack and allowing for more specialized, domain-specific training that isn’t beholden to someone else’s “frozen” understanding of the world.Takeaways for enterprise technical decision-makers and adoptersFor enterprises, the arrival of Self-Flow represents a significant shift in the cost-benefit analysis of developing proprietary AI. While the most immediate beneficiaries are organizations training large-scale models from scratch, the research demonstrates that the technology is equally potent for high-resolution fine-tuning. Because the method converges nearly three times faster than current standards, companies can achieve state-of-the-art results with a fraction of the traditional compute budget. This efficiency makes it viable for enterprises to move beyond generic off-the-shelf solutions and develop specialized models that are deeply aligned with their specific data domains, whether that involves niche medical imaging or proprietary industrial sensor data.The practical applications for this technology extend into high-stakes industrial sectors, most notably robotics and autonomous systems. By leveraging the framework’s ability to learn “world models,” enterprises in manufacturing and logistics can develop vision-language-action (VLA) models that possess a superior understanding of physical space and sequential reasoning. In simulation tests, Self-Flow allowed robotic controllers to successfully execute complex, multi-object tasks—such as opening a drawer to place an item inside—where traditional generative models failed. This suggests that the technology is a foundational tool for any enterprise seeking to bridge the gap between digital content generation and real-world physical automation.Beyond performance gains, Self-Flow offers enterprises a strategic advantage by simplifying the underlying AI infrastructure. Most current generative systems are “Frankenstein” models that require complex, external semantic encoders often owned and licensed by third parties. By unifying representation and generation into a single architecture, Self-Flow allows enterprises to eliminate these external dependencies, reducing technical debt and removing the “bottlenecks” associated with scaling third-party teachers. This self-contained nature ensures that as an enterprise scales its compute and data, the model’s performance scales predictably in lockstep, providing a clearer ROI for long-term AI investments.
Microsoft built Phi-4-reasoning-vision-15B to know when to think — and when thinking is a waste of time
Microsoft on Tuesday released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that the company says matches or exceeds the performance of systems many times its size — while consuming a fraction of the compute and training data. The release marks the latest and most technically ambitious chapter in the software giant’s year-long campaign to prove that carefully engineered small models can compete with, and in key areas outperform, the industry’s largest AI systems.The 15-billion-parameter model, available immediately through Microsoft Foundry, HuggingFace, and GitHub under a permissive license, processes both images and text and can reason through complex math and science problems, interpret charts and documents, navigate graphical user interfaces, and handle everyday visual tasks like captioning photos and reading receipts. It arrives at a moment when the AI industry is grappling with a fundamental tension: the biggest models deliver the best raw performance, but their enormous cost, latency, and energy consumption make them impractical for many real-world deployments.”Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models,” the Microsoft Research team wrote in the model’s official announcement, “and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.”How Microsoft trained a competitive vision model on one-fifth the dataPerhaps the most striking claim in the release is how little training data the model required relative to its competitors. Phi-4-reasoning-vision-15B was trained on approximately 200 billion tokens of multimodal data, built atop the Phi-4-Reasoning language backbone (itself trained on 16 billion tokens) and the foundational Phi-4 model (400 billion unique tokens). By contrast, rival multimodal models from Alibaba’s Qwen family (2.5 VL and 3 VL), Moonshot AI’s Kimi-VL, SenseTime’s InternVL series, and Google’s Gemma3 each consumed more than one trillion tokens during training — roughly five times the total data pipeline Microsoft used.That disparity matters enormously for economics. Training large AI models costs millions of dollars in cloud compute, and the environmental footprint of trillion-token training runs has drawn increasing scrutiny from regulators and investors alike. If Microsoft’s claims hold up under independent evaluation, the model represents a significant advance in training efficiency — one that could reshape how organizations think about the build-versus-buy calculus for AI deployment.The secret, according to the research team, lies not in scale but in meticulous data curation. The team’s final dataset drew primarily from three sources: open-source datasets that were “meticulously filtered and improved”; high-quality domain-specific internal data; and targeted data acquisitions. The researchers described a hands-on quality assurance process in which team members manually reviewed samples from each dataset, typically spending five to ten minutes classifying data quality before deciding how to treat each source. For data with incorrect answers, they re-generated responses using GPT-4o and o4-mini. When questions were unsalvageable but images were high quality, they repurposed the images as seeds for new caption or visual question-answering data. They also reported fixing “a surprisingly large number of formatting and logical errors across widely used open-source datasets” — a finding that raises uncomfortable questions about the quality of training data underpinning many of the industry’s most prominent models.Why the model reasons through calculus but stays quiet on captionsThe model’s most technically novel contribution may be its approach to reasoning. In the world of language-only AI, “reasoning models” — systems that spend extra compute time working through problems step by step — have become the hottest category in the field, with OpenAI’s o-series and DeepSeek’s R1 leading the charge. But extending reasoning to multimodal tasks involving images introduces a wrinkle: for many visual tasks like image captioning or optical character recognition, chain-of-thought reasoning is not only unnecessary but can actually degrade performance by introducing unnecessary verbosity and latency.Microsoft’s solution was to build what it calls a “mixed reasoning and non-reasoning model.” The team started with Phi-4-Reasoning, already a capable reasoning language model, and then trained it on a hybrid data mixture where approximately 20 percent of samples included explicit chain-of-thought reasoning traces (wrapped in <think>…</think> tags) and 80 percent were tagged for direct response (with a <nothink> token). The model learned to invoke structured reasoning for domains like math and science where it helps, while defaulting to fast, direct responses for perception-focused tasks where it does not.This design choice reflects a pragmatic view of reasoning that contrasts with the industry’s current enthusiasm for always-on thinking. As the research team explained: “For tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful, while mathematical and scientific problem-solving benefit from multi-step reasoning.” Users who want to override the model’s default behavior can do so by explicitly prompting with <think> or <nothink> tokens.The team explored four possible training pipelines for multimodal reasoning and chose the one they judged to best balance capability, efficiency, and data requirements. The alternative approaches — training reasoning and multimodal capabilities simultaneously from a non-reasoning base, learning multimodal skills first and then adding reasoning, or requiring reasoning traces for all training data — each carried significant drawbacks. Training reasoning from scratch demands enormous multimodal reasoning data. Adding reasoning after multimodal training risks catastrophic forgetting. And forcing reasoning on every query wastes compute on tasks that don’t benefit from it.Inside the vision architecture that makes high-resolution screenshots readableUnder the hood, Phi-4-reasoning-vision-15B uses a mid-fusion architecture that pairs a SigLIP-2 vision encoder with the Phi-4-Reasoning language backbone. The choice of mid-fusion — where a pretrained vision encoder converts images into tokens that are then projected into the language model’s embedding space — over early-fusion, where images and text are processed together in a single transformer, reflects the team’s resource constraints. Early-fusion yields richer joint representations but demands significantly more compute, memory, and data.The team conducted careful ablation studies on how to handle image resolution, an issue that matters critically for tasks like reading dense screenshots or small UI elements. They tested four approaches — Dynamic S, Multi-crop, Multi-crop with S, and dynamic resolution using SigLIP-2’s Naflex variant — and found that dynamic resolution encoders performed best, especially on high-resolution data. They selected the SigLIP-2 Naflex variant with up to 3,600 maximum tokens, which corresponds roughly to native 720p resolution and delivered particularly strong results on benchmarks requiring fine-grained visual understanding like ScreenSpot-Pro.This matters for one of the model’s headline use cases: powering computer-using agents that navigate desktop, web, and mobile interfaces. With strong high-resolution perception and fine-grained grounding capabilities, the model can identify and localize interactive elements like buttons, menus, and text fields — a prerequisite for the autonomous software agents that many in the industry view as the next major frontier for AI deployment. The team noted that the model’s low inference-time requirements make it particularly well suited “for interactive environments where low latency and compact model size are essential.”The benchmarks show a model that trades brute-force accuracy for speed and efficiencyThe model’s benchmark results paint a picture of a system that punches well above its weight class on efficiency while remaining competitive — though not dominant — on raw accuracy. On the team’s own evaluations across ten benchmarks, Phi-4-reasoning-vision-15B scored 84.8 on AI2D (science diagrams), 83.3 on ChartQA, 75.2 on MathVista, 88.2 on ScreenSpot v2 (UI element grounding), and 54.3 on MMMU (a broad multimodal understanding test).Those numbers generally trail the much larger Qwen3-VL-32B models (which scored 85.0, 84.0, 81.8, 93.9, and 70.6 on the same benchmarks, respectively) but remain competitive with or ahead of similarly-sized systems like Qwen3-VL-8B and Kimi-VL-A3B. The real value proposition, as Figure 1 in the announcement illustrates, emerges when accuracy is plotted against compute time and output token count: Phi-4-reasoning-vision-15B sits at the Pareto frontier of models that are both fast and accurate, delivering competitive results in a fraction of the time required by larger systems.The Microsoft team acknowledged that their benchmark numbers “may be lower than other previously shared numbers” because they ran all evaluations themselves rather than quoting leaderboard claims. They used temperature=0.0, greedy decoding, and a 4,096 maximum output token limit, with no custom prompting or parameter tuning. The team committed to releasing all evaluation logs publicly — a transparency practice that remains uncommon in the field and should allow independent researchers to verify the results. Still, independent reproduction will be critical: the AI research community has grown increasingly skeptical of self-reported numbers, particularly when evaluation methodologies differ across organizations.From edge devices to humanoid robots, the Phi family keeps expandingPhi-4-reasoning-vision-15B does not exist in isolation. It is the latest entry in a Phi model family that has expanded rapidly over the past year, evolving from a niche research project into a central pillar of Microsoft’s AI strategy — one that now spans language, vision, on-device inference, education, and robotics.The lineage traces back through several milestones. In late 2024, Microsoft released the original Phi-4, a 14-billion-parameter language model that demonstrated the power of synthetic data and careful curation. In April 2025, the company launched Phi-4 mini reasoning (3.8 billion parameters), Phi-4 reasoning (14 billion parameters), and Phi-4 reasoning plus — with the latter reportedly approaching the performance of DeepSeek’s R1, a model with 671 billion parameters, according to TechCrunch’s reporting at the time.The family has also extended into specialized domains. Phi Silica, an on-device small language model for Copilot+ PCs, has been used with LoRA fine-tuning to customize generation for specific tasks. In one case study detailed on the Windows Developer Blog, Microsoft’s education team used LoRA adapters with Phi Silica to generate Kahoot! quizzes, achieving a 75 percent reduction in rejection rates and a 4.6-times uplift in subjective quality scores. On the hardware side, the Phi-4-mini model has been optimized for MediaTek’s NPU platforms, running at over 800 tokens per second for prefill on the Dimensity 9400 — fast enough for real-time AI on smartphones and tablets.And in what may be the most ambitious extension yet, Microsoft announced Rho-alpha (ρα), described as the company’s “first robotics model derived from Microsoft’s Phi series.” According to Microsoft Research, Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks, adding tactile sensing to the perception stack and targeting dual-arm setups and humanoid robots.What Phi-4-reasoning-vision signals about the future of enterprise AIThe release crystallizes a broader shift in the AI industry’s center of gravity. For the past two years, the dominant narrative has held that bigger is better — that raw scale in parameters, data, and compute is the primary driver of capability. Microsoft’s Phi family represents the most visible corporate champion of the counterargument: that careful engineering of data quality, training methodology, and architecture design can substitute for brute-force scale. This thesis has significant implications for enterprise adoption. Organizations deploying AI in latency-sensitive or resource-constrained settings — edge devices, interactive applications, on-premise servers — cannot practically run trillion-parameter models. A 15-billion-parameter model that delivers 80 to 90 percent of a frontier model’s accuracy at a tenth of the inference cost could unlock deployment scenarios that were previously uneconomical.The model’s open-weight release, accompanied by fine-tuning code and benchmark logs, also represents a competitive strategy. By making the model freely available and deeply documented, Microsoft positions Phi as a foundation layer for an ecosystem of downstream applications — many of which will run on Azure, use Microsoft’s development tools, or integrate with its enterprise software stack.Yet the model still trails the largest open-weight competitors on the hardest benchmarks, particularly in mathematical reasoning (where Qwen3-VL-32B-Thinking-40K scores 78.2 on MathVerse compared to 53.1 for Phi-4-reasoning-vision with forced thinking) and general multimodal understanding (MMMU scores of 72.2 versus 55.0). The 20/80 reasoning-to-non-reasoning data split is, by the team’s own admission, a heuristic that “may not be optimal for all domains or deployment contexts.” And the model’s ability to correctly decide when to reason and when to respond directly remains what the researchers called “an open problem.”Microsoft is wagering that in the real world, where latency budgets are tight, hardware is finite, and deployment costs compound with every API call, the smartest model is not the biggest one — it’s the one that knows when to think and when to just answer. Whether that bet pays off will depend less on benchmark tables and more on what happens when millions of developers start putting Phi-4-reasoning-vision to work. The model is available now on Microsoft Foundry, HuggingFace, and GitHub. The leaderboard, as always, is open.
Pentagon vendor cutoff exposes the AI dependency map most enterprises never built
The federal directive ordering all U.S. government agencies to cease using Anthropic technology comes with a six-month phaseout window. That timeline assumes agencies already know where Anthropic’s models sit inside their workflows. Most don’t today.Most enterprises wouldn’t, either. The gap between what enterprises think they’ve approved and what’s actually running in production is wider than most security leaders realize.AI vendor dependencies don’t stop at the contract you signed; they cascade through your vendors, your vendors’ vendors, and the SaaS platforms your teams adopted without a procurement review. Most enterprises have never mapped that chain.The inventory nobody has runA January 2026 Panorays survey of 200 U.S. CISOs put a number on the problem: Only 15% said they have full visibility into their software supply chains, up from just 3% a year ago. And 49% had adopted AI tools without employer approval, according to a BlackFog survey of 2,000 workers at companies with more than 500 employees; 69% of C-suite members said they were fine with it.That’s where undocumented AI vendor dependencies accumulate, invisible to the security team until a forced migration makes them everyone’s problem.“If you asked a typical enterprise to produce a dependency graph that includes second- and third-order AI calls, they’d be building it from scratch under pressure,” said Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, in an exclusive interview with VentureBeat. “Most security programs were built for static assets. AI is dynamic, compositional, and increasingly indirect.”When a vendor relationship ends overnightThe directive creates a forced migration unlike anything the federal government has attempted with an AI provider. Any enterprise running critical workflows on a single AI vendor faces the same math if that vendor disappears.Shadow AI incidents now account for 20% of all breaches, adding as much as $670,000 to average breach costs, IBM’s 2025 Cost of Data Breach Report found. You can’t execute a transition plan for infrastructure you haven’t inventoried.Your contract with Anthropic may not exist, but your vendors’ contracts might. A CRM platform could have Claude embedded in its analytics engine. A customer service tool might call it on every ticket you process. You didn’t sign for that exposure, but you inherited it, and when a vendor cutoff hits upstream, it cascades downstream fast. The enterprise at the end of that chain doesn’t know the dependency exists until something breaks or the compliance letter shows up.Anthropic has said eight of the 10 largest U.S. companies use Claude. Any organization in those companies’ supply chains has indirect Anthropic exposure, whether they contracted for it or not. AWS and Palantir, which hold billions in military contracts, may need to reassess their commercial relationships with Anthropic to maintain Pentagon business.The supply chain risk designation means any company doing business with the Pentagon now has to prove its workflows don’t touch Anthropic.“Models are not interchangeable,” Baer told VentureBeat. “Switching vendors changes output formats, latency characteristics, safety filters, and hallucination profiles. That means revalidating controls, not just functionality.” She outlined a sequence that starts with triage and blast radius assessment, moves to behavioral drift analysis, and ends with credential and integration churn. “Rotating keys is the easy part,” Baer said. “Untangling hardcoded dependencies, vendor SDK assumptions, and agent workflows is where things break.”The dependencies your logs don’t showA senior defense official described disentangling from Claude as an “enormous pain in the ass,” according to Axios. If that’s the assessment inside the most well-resourced security apparatus on the planet, the question for enterprise CISOs is straightforward. How long would yours take?The shadow IT wave that followed SaaS adoption taught security teams about unsanctioned technology risk. Most caught up. They deployed CASBs, tightened SSO, and ran spend analysis. The tools worked because the threat was visible. A new application meant a new login, a new data store, a new entry in the logs.AI vendor dependencies don’t leave those traces.“Shadow IT with SaaS was visible at the edges,” Baer said. “AI dependencies are embedded inside other vendors’ features, invoked dynamically rather than persistently installed, non-deterministic in behavior, and opaque. You often don’t know which model or provider is actually being used.”Four moves for Monday morningThe federal directive didn’t create the AI supply chain visibility problem. It exposed it.“Not ‘inventory your AI,’ because that’s too abstract and too slow,” Baer told VentureBeat. She recommended four concrete moves that a security leader can execute in 30 days.Map execution paths, not vendors. Instrument at the gateway, proxy, or application layer to log which services are making model calls, to which endpoints, with what data classifications. You’re building a live map of usage, not a static vendor list.Identify control points you actually own. If your only control is at the vendor boundary, you’ve already lost. You want enforcement at ingress (what data goes into models), egress (what outputs are allowed downstream), and orchestration layers where agents and pipelines operate.Run a kill test on your top AI dependency. Pick your most critical AI vendor and simulate its removal in a staging environment. Kill the API key, monitor for 48 hours, and document what breaks, what silently degrades, and what throws errors your incident response playbook doesn’t cover. This exercise will surface dependencies you didn’t know existed.Force vendor disclosure on sub-processors and models. Your AI vendors should be able to answer which models they rely on, where those models are hosted, and what fallback paths exist. If they can’t, that’s your fourth-party blind spot. Ask the questions now, while the relationship is stable. Once a cutoff hits, the leverage shifts, and the answers come too late.The control illusion“Enterprises believe they’ve ‘approved’ AI vendors, but what they’ve actually approved is an interface, not the underlying system,” Baer told VentureBeat. “The real dependencies are one or two layers deeper, and those are the ones that fail under stress.”The federal directive against Anthropic is one organization’s weather event. Every enterprise will eventually face its own version, whether the trigger is regulatory, contractual, operational, or geopolitical. The organizations that mapped their AI supply chain before the storm will recover. The ones that didn’t will scramble.Map your AI vendor dependencies to the sub-tier level. Run the kill test. Force the disclosure. Give yourself 30 days. The next forced migration won’t come with a six-month warning.
Did Alibaba just kneecap its powerful Qwen AI team? Key figures depart in wake of latest open source release
Alibaba’s Qwen team of AI researchers have been among the most prolific and well-regarded by international machine learning community — shipping dozens of powerful generalized and specialized generative models starting last summer, most of them entirely open source and free.But now, just 24 hours after shipping the open source Qwen3.5 small model series—a release that drew public praise from Elon Musk for its “impressive intelligence density”—the project’s technical architect and several other Qwen team members have exited the company under unclear circumstances, raising questions and concerns from around the world about the future direction of the Qwen team and its focus on open source. The departure of Junyang “Justin” Lin, the technical lead who steered Qwen from a nascent lab project to a global powerhouse with over 600 million downloads, alongside two fellow colleagues — staff research scientist Binyuan Hui and intern Kaixin Li — marks a volatile inflection point for Alibaba Cloud and its role as an international open source AI leader. These three Qwen Team members announced their departures on X today, though they did not share the reasons or whether or not it they were voluntary. VentureBeat reached out to sources at Alibaba for more information and will update when we obtain it. Lin himself signed off with a simple post: “me stepping down. bye my beloved qwen.”While the company celebrates a technical triumph, the sudden exit of its core leadership suggests a deepening rift between the researchers who built the models and a corporate hierarchy now pivoting toward aggressive monetization.The departing researchers’ final gift: pocket-sized intelligence The Qwen3.5 small model series (ranging from 0.8B to 9B parameters) represents a final masterstroke in “intelligence density” from the founding team. The models employ a Gated DeltaNet hybrid architecture that allows a 9B-parameter model to rival the reasoning capabilities of much larger systems.By utilizing a 3:1 ratio of linear attention to full attention, the models maintain a massive 262,000-token context window while remaining efficient enough to run natively on standard laptops and smartphones — even in web browsers.Lin, a PKU humanities graduate and polyglot, has long advocated for this “algorithm-hardware co-design” to bypass compute constraints—a philosophy he detailed at the January 2026 Tsinghua AI Summit. For the developer community, Qwen3.5 wasn’t just another update; it was a blueprint for the “Agentic Inflection,” where models shift from being chatbots to autonomous “all-in-one AI workers” capable of navigating UIs and executing complex code.The enterprise dilemmaFor the 90,000+ enterprises currently deploying Qwen via DingTalk or Alibaba Cloud, the leadership vacuum creates a crisis of confidence. Many companies migrated to Qwen because it offered a “third way”: the performance of a proprietary US model with the transparency of open weights.Alibaba has recently consolidated its AI efforts into the “Qwen C-end Business Group,” merging its model labs with consumer hardware teams. The goal is clear: transition Qwen from a research project into the operating system for a new era of AI-integrated glasses and rings.However, the reported appointment of Hao Zhou, a veteran of Google DeepMind’s Gemini team, to lead the Qwen team indicates a shift from “research-first” to “metric-driven” leadership. Industry analysts, including those cited by InfoWorld, warn that as Alibaba pushes to meet investor demands for revenue growth, the “open” in Qwen’s open-weight models may become a secondary priority — similar to what we saw with Meta after the disappointing release of its Llama 4 AI model last spring, and subsequent reorganization of its AI division, seeing the hiring of Scale AI co-founder and CEO Alexandr Wang and following departure of preeminent researcher Yann LeCun. Enterprises relying on the Apache 2.0-licensed Qwen models now face the possibility that future flagships —such as the rumored Qwen3.5-Max—will be locked behind paid, proprietary APIs to drive Cloud DAU (Daily Active User) metrics.The takeaway? If you value Qwen’s open source efforts, download and preserve the models now, while you still can. The “Gemini-fication” of Qwen?The internal friction at Alibaba mirrors the tensions seen at OpenAI and Google: the “soul” of the machine is often at odds with the “scale” of the business. Xinyu Yang, a researcher at rival Chinese AI lab DeepSeek, captured this sentiment in a stark post on X: “Replace the excellent leader with a non-core people from Google Gemini, driven by DAU metrics. If you judge foundation model teams like consumer apps, don’t be surprised when the innovation curve flattens.”This “Gemini-fication”—the shift toward a highly regulated, product-centric culture—threatens the very agility that allowed Qwen to surpass Meta’s Llama in derivative model creation. For the global AI community, the loss of Junyang Lin is symbolic. He was the primary bridge between China’s deep engineering talent and the Western open-source ecosystem. Without his advocacy, there are fears that the project will retreat into a “walled garden” strategy similar to its Western rivals.’Leaving wasn’t your choice’The technical brilliance of the Qwen3.5 release has been overshadowed by the heartbreak of its creators. On social media, the sentiment among the team members who built the model is one of mourning rather than celebration:Chen Cheng, a Qwen contributor, explicitly alluded to a forced departure, writing in a post on X: “I’m truly heartbroken. I know leaving wasn’t your choice… I honestly can’t imagine Qwen without you.”Li suggested the exit signaled the end of broader ambitions, such as a planned Singapore-based research hub: “Qwen could have had a Singapore base, all thanks to Junyang. But now that he’s gone, there’s no reason left to stay here.”What happens to Qwen’s open source AI efforts from here on out?The known facts are simple: Qwen has never been technically stronger, yet its founding core has been dismantled. As Alibaba prepares to face investors for its fiscal Q3 earnings report on March 5, the narrative will likely focus on “efficiency” and “commercial scale.”For the enterprises currently excited about the 60% cost reductions promised by Qwen3.5, the immediate future is bright. But for the larger AI community, the cost of that efficiency may be the loss of the most vibrant open-source lab in the East. As Hao Zhou takes the reins, the world is watching to see if Qwen remains a “model for the world” or becomes merely a component in Alibaba’s corporate bottom line.
Google releases Gemini 3.1 Flash Lite at 1/8th the cost of Pro
Google’s newest AI model is here: Gemini 3.1 Flash-Lite, and the biggest improvements this time around come in cost and speed, especially for enterprises and developers seeking to leverage powerful reasoning and multimodal capabilities from the U.S. search and cloud giant.Positioning it as the most cost-efficient and responsive model in the Gemini 3 series, Google is offering a solution built specifically for intelligence at scale. This launch arrives just weeks after the February debut of its heavy-lifting sibling, Gemini 3.1 Pro, completing a tiered strategy that allows enterprises to scale intelligence across every layer of their infrastructure.Technology: optimized for the “time to first token”In the world of high-throughput AI, the metric that often dictates user experience isn’t just accuracy—it’s latency. For real-time customer support, live content moderation, or instant user interface generation, the “time to first answer token” is the primary indicator of whether an application feels like a tool or a teammate. If a model takes even two seconds to begin its response, the illusion of fluid interaction is broken.Gemini 3.1 Flash-Lite is engineered specifically for this instant feel. According to internal benchmarks and third-party evaluations, Flash-Lite outperforms its predecessor, Gemini 2.5 Flash, with a 2.5X faster time to first token. Furthermore, it boasts a 45 percent increase in overall output speed — 363 tokens per second compared to 249. This speed is achieved through what Koray Kavukcuoglu, VP of Research at Google DeepMind, describes in an X post as an unbelievable amount of complex engineering to make AI feel instantaneous.Perhaps the most innovative technical addition is the introduction of thinking levels. Standardized across both the Flash-Lite and Pro variants, this feature allows developers to modulate the model’s reasoning intensity dynamically. For a simple classification task or a high-volume sentiment analysis, the model can be dialed down for maximum speed and minimum cost. Conversely, for complex code exploration, generating dashboards, or creating simulations, the thinking can be dialed up, allowing the model to perform deeper reasoning and logic before emitting its first response.Product: benchmarking the lite-weight heavy hitterWhile the “Lite” suffix often implies a significant sacrifice in capability, the performance data suggests a model that punches well into the territory of much larger systems. Gemini 3.1 Flash-Lite achieved an Elo score of 1432 on the Arena.ai Leaderboard, placing it in a competitive tier with models much larger in parameter count.Key benchmark results highlight its specialized strengths across diverse cognitive domains:Scientific knowledge: 86.9 percent on GPQA Diamond.Multimodal understanding: 76.8 percent on MMMU-Pro.Multilingual Q&A: 88.9 percent on MMMLU.Parametric knowledge: 43.3 percent on SimpleQA Verified.Abstract reasoning: 16.0 percent on Humanity’s Last Exam (full set)The model is particularly adept at structured output compliance—a critical requirement for enterprise developers who need AI to generate valid JSON, SQL, or UI code that won’t break downstream systems. In benchmarks like LiveCodeBench, Flash-Lite scored a 72.0 percent, outperforming several rivals in its weight class, including GPT-5 mini, which scored 80.4 percent on a different subset but lagged significantly in speed and cost efficiency. Furthermore, its performance on CharXiv Reasoning (73.2 percent) and Video-MMMU (84.8 percent) demonstrates that its multimodal capabilities are robust enough for complex chart synthesis and knowledge acquisition from video.The intelligence hierarchy: Flash-Lite vs. 3.1 ProTo understand Flash-Lite’s place in the market, one must look at it alongside Gemini 3.1 Pro, which Google released in mid-February 2026 to retake the AI crown. While Flash-Lite is the reflexes of the Gemini system, 3.1 Pro is undoubtedly the brain. The primary differentiator is the depth of cognitive processing. Gemini 3.1 Pro was engineered to double the reasoning performance of the previous generation, achieving a verified score of 77.1 percent on ARC-AGI-2—a benchmark designed to test a model’s ability to solve entirely new logic patterns it has not encountered during training.While Flash-Lite holds its own in scientific knowledge at 86.9 percent, the Pro model pushes that boundary to a staggering 94.3 percent, making it the superior choice for deep research and high-stakes synthesis. The application focus also differs significantly based on these reasoning gaps. Gemini 3.1 Pro is capable of vibe-coding—generating animated SVGs and complex 3D simulations directly from text prompts. For example, in one demonstration, Pro coded a complex 3D starling murmuration that users could manipulate via hand-tracking. It can even reason through abstract literary themes, such as translating the atmospheric tone of Emily Brontë’s Wuthering Heights into a functional web design.Gemini 3.1 Flash-Lite, conversely, is the workhorse for high-volume execution. It handles the millions of daily tasks—translation, tagging, and moderation—that require consistent, repeatable results without the massive compute overhead of a reasoning-heavy model. It fills a wireframe with hundreds of products instantly or orchestrates intent routing with 94 percent accuracy, as reported by early testers.1/8th the cost of the flagship Gemini 3.1 Pro model (and cheaper than its predecessor, Flash-Lite 2.5)For enterprise technical decision-makers, the most compelling part of the Gemini 3.1 series is the reasoning-to-dollar ratio. Google has priced Gemini 3.1 Flash-Lite at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens.This pricing makes it significantly more affordable than competitors like Claude 4.5 Haiku, which is priced at $1.00 per 1 million input and $5.00 per 1 million output tokens. Even compared to Gemini 2.5 Flash, which cost $0.30 per 1 million input, Flash-Lite offers a cost reduction alongside its performance gains.When contrasted with Gemini 3.1 Pro—which maintains a price of $2.00 per million input tokens for prompts up to 200k—the strategic advantage of the dual-model approach becomes clear. In high-context usage (above 200,000 tokens per interaction), Flash-Lite is actually between 12x and 16x cheaper.ModelInputOutputTotal CostSourceQwen 3 Turbo$0.05$0.20$0.25Alibaba CloudQwen3.5-Flash$0.10$0.40$0.50Alibaba Clouddeepseek-chat (V3.2-Exp)$0.28$0.42$0.70DeepSeekdeepseek-reasoner (V3.2-Exp)$0.28$0.42$0.70DeepSeekGrok 4.1 Fast (reasoning)$0.20$0.50$0.70xAIGrok 4.1 Fast (non-reasoning)$0.20$0.50$0.70xAIMiniMax M2.5$0.15$1.20$1.35MiniMaxGemini 3.1 Flash-Lite$0.25$1.50$1.75GoogleMiniMax M2.5-Lightning$0.30$2.40$2.70MiniMaxGemini 3 Flash Preview$0.50$3.00$3.50GoogleKimi-k2.5$0.60$3.00$3.60MoonshotGLM-5$1.00$3.20$4.20Z.aiERNIE 5.0$0.85$3.40$4.25BaiduClaude Haiku 4.5$1.00$5.00$6.00AnthropicQwen3-Max (2026-01-23)$1.20$6.00$7.20Alibaba CloudGemini 3 Pro (≤200K)$2.00$12.00$14.00GoogleGPT-5.2$1.75$14.00$15.75OpenAIClaude Sonnet 4.5$3.00$15.00$18.00AnthropicGemini 3 Pro (>200K)$4.00$18.00$22.00GoogleClaude Opus 4.6$5.00$25.00$30.00AnthropicGPT-5.2 Pro$21.00$168.00$189.00OpenAIBy using a cascading architecture, an enterprise can use 3.1 Pro for the initial complex planning, architectural design, and deep logic, then hand off high-frequency, repetitive execution to Flash-Lite at one-eighth of the cost.This shift effectively moves AI from an expensive experimental cost center to a utility-grade resource that can be run over every log file, email, and customer chat without exhausting the cloud budget.Community and developer reactionsEarly feedback from Google’s partner network suggests that the 3.1 series is successfully filling a critical gap in the market for reliable autonomy. Andrew Carr, Chief Scientist at Cartwheel, has tested both models and noted their unique strengths. Regarding 3.1 Pro, he highlighted its substantially improved understanding of 3D transformations, which resolved long-standing rotation order bugs in animation pipelines. However, he found Flash-Lite to be a different kind of unlock for the business: “3.1 Flash-Lite is a remarkably competent model. It is lightning fast, but still somehow finds a way to follow all instructions… The intelligence to speed ratio is unparalleled in any other model”.For consumer-facing applications, the low latency of Flash-Lite has been the key to market expansion. Kolby Nottingham, Head of AI at Latitude, shared that the model achieved a 20 percent higher success rate and 60 percent faster inference times compared to their previous model, enabling sophisticated storytelling to a much wider audience than would have otherwise been possible. Reliability in data tagging has also emerged as a standout feature. Bianca Rangecroft, CEO of Whering, reported that by integrating 3.1 Flash-Lite into their classification pipeline, they achieved 100 percent consistency in item tagging, providing a highly reliable foundation for their label assignment and increasing confidence in structured outputs.Kaan Ortabas, Co-Founder of HubX, noted that as a root orchestration engine, Flash-Lite delivered sub-10 second completions with near-instant streaming and 97 percent structured output compliance. On the flagship side, Vladislav Tankov, Director of AI at JetBrains, noted a 15 percent quality improvement in the Pro model, emphasizing that it is stronger, faster, and more efficient, requiring fewer output tokens to achieve its goals.Licensing and enterprise availabilityBoth Gemini 3.1 Flash-Lite and Pro are offered through Google AI Studio and Vertex AI. As proprietary models, they follow a standard commercial software-as-a-service model rather than an open-source license. Operating through Vertex AI provides grounded reasoning within a secure perimeter, ensuring that high-volume workloads—like those being run by Databricks to achieve best-in-class results on the OfficeQA benchmark—remain protected by enterprise-grade security and data residency guarantees.However, they also are limited in terms of customizability and require persistent internet connectivity, as opposed to purely open source rivals like the powerful new Qwen3.5 series released by Alibaba over the last few weeks.The current preview status for Flash-Lite allows Google to refine safety and performance based on real-world developer feedback before general availability. For developers already building via the Gemini API, the transition to 3.1 Pro and Flash-Lite represents a direct performance upgrade at the same or lower price points, effectively lowering the barrier to entry for complex agentic workflows.The verdict: the new standard for utility AIThe release of Gemini 3.1 Flash-Lite represents the final piece of a strategic pivot for Google. While the industry has been obsessed with state-of-the-art reasoning for the most complex problems, the vast majority of enterprise work consists of high-volume, repetitive, but high-precision tasks. By providing both the brain in Gemini 3.1 Pro and the reflexes in Gemini 3.1 Flash-Lite, Google is signaling that the next phase of the AI race will be won by models that can think through a problem, but also execute that solution at scale.For the CTO or technical lead deciding which model to bake into their 2026 product roadmap, the Gemini 3.1 series offers a compelling argument: you no longer have to pay a reasoning tax to get reliable, instantaneous results. As Flash-Lite rolls out in preview today, the message to the developer community is clear: the barrier to intelligence at scale hasn’t just been lowered—it’s been dismantled.