The Token Do-or-Die Line: Financial AI Companies Scramble to Cut Costs

NextFin News -- "To be frank, the window of opportunity isn ’ t long anymore. For companies, an AI transformation may have to be completed within the next two years," said Liang Zhongzhi, Senior Technical Director at Yinmi Fund.

He noted that companies that complete the transformation first will gain enormous cost and efficiency advantages, enabling them to lock in incremental growth in their respective markets. In his view, AI transformation is no longer about development — it ’ s a matter of corporate survival.

However, most companies ’ AI implementation efforts are currently stuck. Many vendors and apps are simply "stuffing" an AI assistant into their existing interaction model, yet still fail to truly solve users ’ business problems, resulting in generally low usage intensity. The core issue often isn ’ t the technology itself, but the fact that existing production relationships can ’ t align with the new productive forces — and reshaping those relationships is an extremely painful process.

According to McKinsey ’ s The State of Organizations 2026, as many as 88% of AI pilot projects failed to scale. The main reasons were not technical flaws, but rather the absence of evaluation mechanisms and governance barriers. Insufficient organizational readiness, as a "slow variable," is more concealed than technological risk.

Yinmi Fund ’ s exploration is highly instructive. Starting in 2026, the company proactively pursued change and launched an AI-driven overhaul. On the R&D side alone, all roles were consolidated into a single role: "product engineer." Its AI assistant, "AI Xiaogu," has cumulatively handled more than 1 million user questions. When token consumption reaches real-world usage at the level of a million tokens per day, cost is no longer an abstract figure — it becomes a very real bill.

According to a recent public disclosure by Yinmi Fund Chairwoman Xiao Wen, Yinmi has deployed more than 200 models internally, with monthly token consumption reaching the hundreds-of-billions level. AI is no longer an experimental project; it has truly become as fundamental as water, electricity, and gas — an everyday necessity for ordinary employees in their daily work.

Three Token-related Issues in Financial Scenarios

Before exploring cost-reduction paths, Yinmi Fund tried a range of approaches, including tiered model scheduling, prompt streamlining, caching and precomputation, and RAG optimization. These delivered results, but the team hoped to find a solution closer to the underlying logic.

Liang Zhongzhi ’ s analysis suggests that token usage in financial scenarios has three major characteristics that directly drive up costs:

First, the context is exceptionally long. Financial decision-making requires synthesizing a large amount of information — one client ’ s holdings data, trading history, risk preferences, and communication records. Put together, that can easily run into several thousand or even tens of thousands of tokens. That ’ s simply not in the same ballpark as writing a piece of code completion.

Second, the accuracy bar is extremely high. Individual users might tolerate an AI-written blog post being a bit wordy, but businesses can ’ t tolerate AI getting the return calculation wrong in an investment recommendation. This means financial scenarios often require stronger ( and therefore more expensive ) models, as well as more inference steps.

Third, the "value density" varies enormously from one scenario to another. A user asking "What is fund dollar-cost averaging?" and a high-net-worth client asking "How should I allocate my 5 million in assets?" may consume roughly the same number of tokens, but the business value differs by orders of magnitude.

"The term ‘ token anxiety ’ is spot-on," but in Liang Zhongzhi ’ s view, it is more a product of a cognitive stage: the anxiety often comes from "not knowing whether it ’ s worth it." If you can clearly calculate the business value corresponding to every unit of token consumption, the anxiety will disappear.

Beyond common forms of waste such as "showboating calls," "brute-force context stuffing," and "duplicate reasoning," Liang Zhongzhi highlighted an even more hidden kind of waste: "using probabilistic reasoning to solve deterministic problems." These are scenarios that should have been built as traditional software — build once, reuse indefinitely — but instead are repeatedly handed off to AI, creating linear costs out of thin air. Taken together, this waste may account for more than 50% of an enterprise ’ s token consumption in AI applications.

To address this, Yingmi Fund developed a "token arbitrage" framework:

Step one: determine whether the scenario has an optimal solution. If it does, the best approach is to develop it as traditional software — build once, reuse indefinitely, with zero marginal cost— such as a fund screener, NAV lookup, or account overview.

Step two: if you determine there ’ s no optimal solution, then look at whether Token arbitrage holds. In a linear-cost setting, Token consumption is essentially paying for "leverage that grows nonlinearly."

Based on this, YMI Fund chose to invest heavily in Tokens for financial advisory scenarios — so that each Token replaces not a few cents of compute cost, but tens or even hundreds of yuan in marginal labor cost.

"Machines in the Industrial Revolution were a one-time investment with marginal costs trending toward zero; machines in the AI era are pay-per-use, so marginal costs don ’ t go to zero. In the era of traditional software, you aimed for build once, reuse indefinitely; in the AI era, what you ’ re aiming for is that every single call creates positive value. That ’ s a fundamental shift in mindset." Liang Zhongzhi pointed out.

Make Tokens Something Other Than a Cost Center

In fact, fine-grained control of token costs is shifting from an elective to a required course for enterprises.

A Goldman Sachs report, released in May 2026, noted that the AI industry is moving from a cost narrative to a profit narrative. The report showed that token pricing for mainstream large models, which had previously been falling by about 40% a year, has begun to stabilize, while the compute cost per token — driven by NVIDIA, AMD, Google TPU, and others — has continued to drop at an annual rate of 60% – 70%. The "scissor gap" between the two curves is opening up profit headroom. Goldman Sachs projected that by 2030, consumer- and enterprise-side Agents combined will drive global Token consumption to 24 times the 2026 level, reaching roughly 120 quadrillion Tokens per month.

"If modern Chinese uses fewer tokens than English, then what about classical Chinese — one of the highest – information-density written forms among human languages? Could that work too?"

In late 2024, a wave of "learn Chinese to save tokens" went viral on overseas social media: U.S. developers found that expressing the same meaning in Chinese used far fewer Tokens than in English.

Liang Zhongzhi verified this with hands-on tests: he wrote the same passage in English, modern Chinese, and Classical Chinese, then calculated Token usage. The result was striking — Classical Chinese used only about 30% – 40% as may Tokens as English.

This is also the core idea behind Token-Zip: use a low-cost, high-speed model to translate the user ’ s original input into Classical Chinese; then use a high-cost, high-quality model to "think" and answer in Classical Chinese; and finally convert it back to produce the final output. It ’ s essentially adding a "compress – decompress" layer on both ends of the expensive model.

Real-world tests showed that across 54 English prompt use cases spanning 14 domains, costs were reduced by an average of 51%, and response quality also improved. "We suspect this is because the conciseness of classical Chinese forces the model to focus more on the core information and cut down on fluff," Liang Zhongzhi added.

In addition, finance is a category of scenarios that require extensive natural-language interaction — such as robo-advisory services, customer inquiries, research report generation, and compliance reviews — where both inputs and outputs are primarily in natural language. Token-Zip ’ s benchmark data showed that natural-language – dense content delivers the best compression results, for example: law 60%, education 60%, healthcare 57%, and finance/economics 45%. This means financial scenarios are inherently well suited to the compression approach represented by Token-Zip.

Over the past two years, Yingmi Fund has built a layered strategy for controlling Token costs:

First is model routing: not every scenario uses the most expensive model; only scenarios that truly require strong reasoning capabilities use top-tier models. And model selection is not a one-time decision, but a process of continuous optimization.

Second is prompt engineering and context management, including streamlining the system prompt, dynamically loading context, and optimizing few-shot examples.

Third is scenario solidification: once an AI scenario is used repeatedly and its logic stabilizes, it can be gradually solidified from "reasoning from scratch" each time into template-based execution, potentially reducing Token consumption by 80%. AI helps developers quickly validate whether a scenario is valuable and how its logic works; once validation succeeds and the pattern is stable, the scenario can be solidified.

Of course, after these three steps are completed, for scenarios that truly require expensive models and cannot be further solidified, Token-Zip can provide an additional compression layer. In addition, Yingmi Fund has also put into practice a path with the greatest strategic value ——re-assetizing AI-native capabilities, i.e., packaging all internal financial capabilities ( such as data queries, investment research and analysis, trade execution, etc. ) into AI-native standardized tools ( MCP Servers ) . Each tool comes with clear semantic descriptions and standardized input/output formats, which will dramatically reduce Token consumption when the AI calls them.

Overall, from model routing to scenario hardening, and on to Token-Zip and the packaging of AI-native tools, Yingmi Fund has been building a systematic Token cost-control framework. The core of this framework isn ’ t simply "saving money," but turning every Token spent into a value investment that can be calculated, measured, and optimized.

Once you understand that every Token is buying you leverage for nonlinear growth, Token anxiety truly fades away. "Spending Tokens isn ’ t a bad thing, but throughout the process you must think about how to convert that Token spend into incremental business growth in a steady, sustained way," Liang Zhongzhi advised.

宙世代元宇宙

元宇宙党建解决方案

元宇宙文旅解决方案

元宇宙展厅解决方案

元宇宙行业峰会解决方案

元宇宙营销解决方案

元宇宙会展解决方案

元宇宙演艺节目解决方案

元宇宙博物馆解决方案

元宇宙图书馆解决方案

元宇宙校园解决方案

元宇宙企业展厅解决方案

元宇宙艺术展解决方案

元宇宙电商解决方案

融媒体解决方案

ZAKER智慧云

媒体解决方案

党建解决方案

公检法解决方案

智慧交通解决方案

高校解决方案

AI视频

AI视频剪辑

视频定制服务

AI智能客服

我的订阅

The Token Do-or-Die Line: Financial AI Companies Scramble to Cut Costs

宙世代

一起剪

相关阅读

华为首次公开宣布：MatePad Pro Max搭载Kirin T93 Pro芯片

最强Mate！华为Mate 90九月亮相：首发麒麟9050 硬刚iPhone 18 Pro

人在谷歌I/O现场，发现安卓已经没戏份了。

鸿蒙二合一平板年度重磅更新来了！华为MatePad Edge喜提鸿蒙6.1：更流畅 更安全

李开复、苏姿丰对谈：2026年，AI开始替代的将是整个部门

Google 重塑搜索框，进化 50 亿人的上网习惯

国风创作神器！华为自研音悦家App发布：一台华为平板搞定编曲/录音/混音

每天泡在算法里，“90后”工程师帮中国快递省下近100亿元

华为Pura X Max销量已超25万台 搭载麒麟9030 Pro

正式对标美国最强编程巨头 DeepSeek亲自下场做Claude竞品

谷歌不想再追赶ChatGPT

机箱塞进9块屏播15000张GIF！玩家耗时200小时：自制动态表情主机

舱驾一体，为什么在2026年成为了热门技术趋势？

起底机器人“数据采集中心”：左手倒右手的畸形繁荣

英特尔陈立武：随着AI从训练转向推理 CPU与GPU配比或可达4:1

最新评论

钛媒体

热门推荐

企业资讯

鸿蒙二合一平板年度重磅更新来了！华为MatePad Edge喜提鸿蒙6.1：更流畅更安全

华为Pura X Max销量已超25万台搭载麒麟9030 Pro