关于ZAKER Skills 合作
钛媒体 24分钟前

李飞飞万字长文刷屏:网红文生视频只是“画皮”,真正的万亿级风口在这

刚刚,AI视觉领域的领军人物李飞飞及其 World Labs 团队发布了一篇关于"世界模型(World Models)"的深度文章(后附中英全文),瞬间引发整个AI业界的强烈关注。

在各大厂商纷纷抛出炫酷的AI生成视频、大谈特谈"世界模型"的今天,李飞飞却一针见血地指出:

目前的"世界模型"是AI领域被最过度使用、语义最过载的词汇。

这篇文章究竟写了什么?对于科技从业者和投资者而言,它又隐藏着哪些足以颠覆现有产业格局的万亿级商机?

到底什么是真正的"世界模型"

语言模型让机器掌握了词汇和推理,但李飞飞指出,物理世界的运行基底截然不同——世界并不是由语言构成的。真正的世界模型,必须能学习空间和时间的统计结构,比如光线如何落下、物体如何遵循物理定律等。

为了厘清概念,文章追溯了强化学习中经典的"智能体-行动-状态-观测值"循环(POMDP),并将目前市面上鱼龙混杂的世界模型,精准划分为三大功能流派:

1. 渲染器(Renderer):只会"画皮"的视觉魔术师

目前市面上爆火的文生视频模型(如能生成电影级无人机镜头的AI)多属于此类。它们输出的是供人类观看的"像素"观测值,核心追求视觉保真度。但这层表象极具欺骗性,它们并不理解三维结构。无人机镜头从上方看完美无瑕,但一旦尝试在下方的城市中穿行,结构就会完全崩塌。

2. 模拟器(Simulator):被严重低估的"物理引擎"

模拟器输出的是底层"状态"。它不追求好看,而是要求几何结构经得起推敲、动力学符合牛顿定律。它可以作为专业人士的设计工具,或是自动驾驶和机器人的绝佳训练场。

3. 规划器(Planner):机器人的"行动大脑"

规划器输出的是"行动"。当接收到观测画面和目标后,它能直接告诉智能体(如机器人)接下来该做什么。视觉-语言-行动(VLA)模型就属于此类。

对科技圈和创投圈而言,李飞飞这篇文章释放了极其重要的赛道信号。

信号一:警惕"渲染器"的致命局限,核心枢纽其实是"模拟器"

文章传达了一个明确的隐形信息:不要被炫酷的AI视频迷惑了双眼。渲染器优化视觉合理性而非物理准确性的局限性极其致命,你绝对不能用它来设计建筑或训练机器人。 真正被公众忽视、却至关重要的核心枢纽是"模拟器"。模拟是连接视觉呈现与行动规划的桥梁,掌握模拟,既可向上生成视觉画面,也能向下支撑机器人的动作规划。

信号二:从万亿级工业应用,到具身智能的终局商机

这篇文章勾勒了现在的商业落地场景与未来的商业风口:

C端消费级红利(渲染器应用):商业化最为成熟。谷歌的 Nano Banana 模型已将该级别图像生成技术推向数亿用户,图像与文本到视频的工具在企业和消费市场正处于爆发期。

万亿级B端工业市场(模拟器应用):这是极其庞大的商业空间。英伟达的 Omniverse 平台正是瞄准了这一超过万亿美元的潜在市场,其核心场景包括工厂、仓库和供应链的数字孪生,以及自动驾驶测试、建筑可视化、工程与药物研发。

具身智能机器人的未来(规划器应用):尽管目前的机器人演示多局限于受控实验室,面临着数据短缺和"虚实鸿沟"的挑战,但一旦规划器取得突破,整个行业将迎来能在厨房、仓库或手术室中可靠工作的通用机器人。

信号三:融合与大一统(终极风口)

文章透露,未来的技术终局将是统一世界基础模型(World Foundation Model),属于面向物理空间的新型基础模型。它将打破渲染、模拟和规划的边界。 目前,World Labs 已经迈出了第一步,其首款产品 Marble 可接收文本、图片、视频、空间草图四类多模态提示,生成可交互探索的 3D 环境,同时输出视觉用的高斯泼溅数据和物理引擎用的碰撞网格。

以下为李飞飞此次发布在substack平台上的全文(中文翻译版与英文原版),略经钛媒体编辑:

《世界模型功能分类学》

"世界是所有发生的事物。" —— 路德维希·维特根斯坦,《逻辑哲学论》,1921年

世界并不是由语言构成的。

在早先的一篇文章中,我们提出空间智能(spatial intelligence)是人工智能的下一个前沿,而世界模型(world models)则是通往这一目标的路径。在此,World Labs 团队和我希望做进一步的深入探讨:在如今众多被构建并被称为"世界模型"的事物中,究竟是哪些功能组件真正构成了这种能力?它们各自的作用又是什么?语言模型赋予了机器对概念、词汇和推理的非凡掌控力,但物理世界(无论是虚拟的还是现实的)运行在一种截然不同的基底上。语言模型学习的是文本的统计结构,而世界模型学习的则是空间和时间的统计结构:光线如何落在物体表面,从一个从未有相机捕捉过的角度看花园是什么样子,以及物体如何对力做出反应并遵循物理定律。

这使得"世界模型"成为当今 AI 领域最重要、同时也是被过度使用的术语之一。计算机视觉、机器人学、强化学习和生成式 AI 都声称在构建世界模型,但它们各自所指的含义却大相径庭。一个能生成华丽却不符合物理规律的火焰的视频模型,一个能即兴生成可玩游戏的语言模型,以及一个能忠实模拟燃烧过程的物理引擎,都在使用着同一个名称。

古希腊人对于世界是由什么构成的(究竟是火、水还是不可分割的原子)从未达成共识,因为"世界"从来就不是一个单一的事物。它始终是特定思想家在进行推理时所需要的某种"整体性"的代名词。人工智能继承了同样的问题,而这恰恰发生在该领域最需要精确性的时刻。

分类背后的逻辑循环

要拨开这些迷雾,我们需要从一张比上述任何技术都要古老的图表开始。几十年来,包括萨顿(Sutton)和巴托(Barto)的经典教材在内的强化学习教科书,一直使用类似版本的图表来描述智能体(agent)如何与世界交互。这张图的正式名称是"部分可观测马尔可夫决策过程"(Partially Observable Markov Decision Process,简称 POMDP),而"世界模型"一词的最初定义正是源于这一传统。

智能体(可以是一个人、一个机器人或一个软件系统)会采取行动(actions)。这些行动会影响世界的状态(state)。智能体永远无法直接看到这种状态。智能体所接收到的是观测值(observations):落在视网膜上的光子、传感器的读数,以及视频帧中的像素。新的观测值指导新的行动,这一循环由此不断继续。

我们需要对"状态(state)"一词进行剖析,因为它的含义在不同领域中有所变化。这不是化学家眼中的状态(如固态、液态和气态的区别)。这是物理学家和机器人学家眼中的状态:对特定时刻世界中正在发生的事情的完整描述,包括每一个物体、每一个位置、每一个速度和每一种属性。状态是世界底层的现实;它在原则上是完整的,但对其内部的任何智能体来说,永远无法直接全景可见。观测值是智能体对该现实的局部视图,而行动则是智能体做出的回应。

这个从智能体到行动、再到状态、再到观测值并循环往复的结构,赋予了现代术语"世界模型"其技术含义。这个词组本身的历史更为悠久,可追溯至肯尼斯克雷克(Kenneth Craik)1943 年提出的理论:人类心智通过运行现实的"微缩模型"进行推理,并在 20 世纪 80 年代末和 90 年代初被引入神经网络。这一循环也解释了人们如今使用该术语时的真正含义。目前那些被称为世界模型的各种不同事物,实际上都是这个同一循环的不同投影。它们各自输出该循环中的不同部分。

世界模型的三个功能

第一种世界模型是渲染器(Renderer)

渲染器以人类肉眼可见的像素形式输出观测值,其最重要的品质是视觉保真度。一个将文本提示转化为电影级无人机镜头的视频模型就是一个渲染器。像谷歌的 Genie 3 或 World Labs 自己的 RTFM 这样的交互式系统也是如此,模型会在用户输入的条件下实时生成画面帧。这类模型并不具备对三维结构的显式理解。它生成的是观察者"将会看到的"景象,而非"实际存在的"实体。无人机镜头中的建筑物从上方看可能完美无瑕,但如果试着在下方的城市中穿行,它们就会分崩离析。

第二种是模拟器(Simulator)

模拟器输出的是状态:一种在几何、物理或动力学上高度忠实的对世界的表征,人类和计算机程序都可以对其进行计算和交互。如果说渲染器的契约纯粹是视觉层面的,那么模拟器的契约则是结构层面的,它要求几何结构经得起推敲,物理机制遵循牛顿定律,动力学表现符合物理定律下世界应有的运作方式。模拟器同时服务于两类用户群体。建筑师、设计师、电影制作人和游戏开发者等人类专业人士,需要超越纯粹视觉合理性的精确度;而强化学习智能体、机器人控制器和自动驾驶汽车等计算机程序,则将模拟器作为训练场,使其能够大规模地与世界进行交互,测试那些在现实中危险、昂贵或根本无法执行的场景。

第三种是规划器(Planner)

规划器输出的是行动。给定一个观测值和一个目标,规划器能够回答"智能体接下来应该做什么"的问题。在许多方面,这正是渲染器的逆过程。渲染器以行动为输入并产生观测值,而规划器则以观测值为输入并产生行动,从而闭合了感知-行动循环。视觉-语言-行动(VLA)模型、基于模型的系统,以及新一波的世界行动模型(World Action Models),都是对规划器的尝试:这些系统旨在决定机器人在非结构化世界中该做什么。

这三个类别涵盖了目前实际落地的绝大多数产品,它们之间的区分在实践中非常有用。然而,这些类别在本质上并不是完全割裂的。它们底层都基于关于世界如何运作的相同知识——几何学、物理学、动力学。一个能够从任何角度渲染一个杯子的模型,原则上应当能够模拟推开杯子时会发生什么,并规划出用手去拿起杯子的动作。越来越多最有趣的研究,都在刻意模糊这三者之间的界限。

为什么模拟是核心枢纽

在这三个类别中,模拟器获得的公众关注最少,但却是三者中最重要的一环。本文正是为了探讨这种不对称性。

渲染器是目前商业化最成熟的。许多图像或文本到视频(text-to-video)的产品正在消费者或企业市场中快速扩张。谷歌的 Nano Banana 模型已经将渲染器级别的图像生成技术交到了数以亿计的用户手中。技术是真实的,市场也是真实的。然而,渲染器优化的是视觉上的合理性,而非物理上的准确性,而这个上限是非常致命的。它们的输出虽然精美,但你不能指望用它们来设计建筑或训练机器人。

规划器是最具吸引力也是最处于起步阶段的,它与快速发展的机器人学习领域密切相关。过去两年里,该领域展示了一些在视频中看起来令人印象深刻的机器人演示,但我们必须坦诚地看待这些演示的实际内容。几乎所有的演示都被限制在高度受控的实验室环境中,处理的物体种类有限,且任务周期很短。没有一个在实际部署所需的复杂性、可变性或持续时长上得到过验证。从一段引人注目的演示视频,到一台能在厨房、仓库或手术室里可靠工作的机器人,两者之间的鸿沟依然巨大。尽管如此,这一领域的商业押注十分庞大。一波资金充裕的新入局者正竞相推出通用规划系统,而大型基础设施巨头们则将规划能力构建在更广泛的模拟栈之上。一个能够进行规划的机器人就是一个能够实际工作的机器人,整个行业都在竞相成为第一个跨过这道终点线的赢家。

模拟是连接这两者的桥梁。如果说语言是对世界的抽象,像素是对世界的投影,那么几何学、物理学和动力学就是世界本身。模拟器必须在这个层面上运作:它作为结构性的骨干,既能衍生出视觉外观(供渲染器使用),也能推导出行动后果(供规划器使用)。一个掌握了模拟的模型,可以将其对世界的理解投射成像素供人类观察,也可以投射成行动预测供具身智能体执行。而一个仅仅掌握了渲染或仅仅掌握了规划的模型,是做不到这两点的。这里的商业空间是巨大的。单单是英伟达(NVIDIA)的 Omniverse 平台,就瞄准了该公司估计超过万亿美元规模的潜在市场,涵盖工厂、仓库、供应链和数字孪生等领域。机器人训练、自动驾驶测试、建筑可视化、工程设计以及药物研发,都依赖于具备模拟形态的技术。

该领域最困难的未解难题也都集中于此。带有明确几何形状、材料属性和物理标注的三维数据,比渲染器用于训练的互联网视频要稀缺几个数量级。"从模拟到现实(sim-to-real)"的鸿沟——即事物在模拟中的行为与在现实中的行为之间的差异——依然存在。生成式模拟器在此之上还引入了新的风险:AI 生成的几何体可能看起来是正确的,但却包含了自相交或错误的比例,从而产生荒谬的物理现象。而包含刚体、可变形物体、流体和布料相互作用的大规模多物理场模拟,其计算成本依然比单一领域的模拟高出几个数量级。

在 World Labs,Marble 是我们向这一领域迈出的第一步。它接受多模态提示(文本、图像、视频或空间草图)并生成可探索的 3D 环境,同时输出用于视觉探索的高斯泼溅(Gaussian splats)以及物理引擎可运算的碰撞网格(collision meshes)。但这仅仅是横跨整个领域正在书写的漫长篇章的序章,因为渲染、模拟和规划之间的界限已经开始消融。

边界消融与未来展望

但这还只是开始。目前该领域最重要的趋势是这三大类别正开始相互融合。大家形成的一个共识是:渲染一个世界、模拟一个世界以及在其中采取行动所需要的知识在很大程度上是相同的。继续前面的例子:一个真正理解杯子如何放置在桌子上的模型(它的几何形状、材料属性、受力反应等),应当能够从任何角度渲染那个杯子,模拟当它被推开时会发生什么,并规划出一只手去拿起它的动作。这三个类别不过是对同一种底层理解的三个投影。

例如:近期越来越多来自各个机器人实验室的研究表明——至少在概念层面上——一个预训练的视频渲染器可以用作联合预测世界与动作的基础骨干,它通过让一个模型去想象"将会发生什么"以及"该怎么做",从而在渲染器和规划器之间架起了一座桥梁。World Labs 的 Marble 已经能够从单一模型中同时输出高斯泼溅和碰撞网格,消解了渲染器与模拟器之间的边界。每一个层级都在从被动输出向交互式系统转变:渲染器变得受行动条件控制(action-conditioned),模拟器生成的环境变得更具可控性和可编辑性,而规划器开始进行深思熟虑的推演而不再是仅仅做出被动反应。

它的逻辑终点是一个统一的世界模型:一个能够渲染逼真视图、生成准确物理结构,并规划动作序列的基础模型(foundation model),它能够根据下游消费者的需求在不同的输出模态之间灵活切换。我们依然面临诸多严峻的挑战。数据版图是不均衡的:渲染器拥有海量的互联网视频资源,而模拟器和规划器则面临 3D 资产和机器人演示数据极度短缺的问题。一味优化视觉美感可能会牺牲机器人或高保真模拟所需的精确度。在单一架构内调和这些矛盾,是当今世界模型研究中极具决定性的开放难题,而这也正是 World Labs 在不断迭代 Marble 的过程中力求解决的目标。

然而,前进的方向是清晰的。自 20 世纪 80 年代末以来,整个领域一直在押注这一点——一个足够丰富的世界模型,就是任何智能体去观察世界、构建世界并在其中采取行动所需要的全部知识底座——如今,这一信念正在推动整整一代人的研究。而赋予这场"豪赌"分量的,是目前正在发生的融合:三条最初各自独立的研发主线(其任何一条都足以驱动和塑造数十亿美元规模的产业),现在正开始汇聚为一。综合来看,随着它们之间边界的消融,它们将重塑一个更加宏大的图景:机器智能与其所栖息的物理世界之间的关系——这正是空间智能的漫长征程。

语言赋予了机器谈论这个世界的能力。而世界模型,则是机器最终去理解、想象、推理并与这个世界进行交互的必由之路。

英文版原文:

A Functional Taxonomy of World Models

"The world is everything that is the case."

— Ludwig Wittgenstein,Tractatus Logico-Philosophicus, 1921

The world is not made of words.

In anearlier essay, we argued that spatial intelligence is AI’s next frontier and that world models are the path to it. Here, the World Labs team and I want to go one level deeper: of the many things now being built and called ‘world models,’ which functional pieces actually compose that capacity — and what is each one for?

Language models have given machines an extraordinary command of concepts, vocabulary, and reasoning, but the physical world, virtual or real, runs on a different substrate. Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics.

That makes "world model" one of the most important and most overloaded terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI each claim to be building world models, and each means something quite different. Avideo modelthat produces gorgeous but physically impossible flames, alanguage modelimprovising a playable game, and aphysics enginethat faithfully simulates combustion all go by the same name.

The ancient Greeks could never agree on what the world was made of, whether fire, water, or indivisible atoms, because "world" was never a single thing. It was always a stand-in for whatever totality a given thinker needed to reason about. AI has inherited the same problem, at exactly the moment when the field needs precision.

The loop beneath the taxonomy

Cutting through that confusion starts with a diagram older than any of the technology in question. Reinforcement learning textbooks, including the canonical Sutton and Barto, have used a version of the same picture for decades to describe how an agent interacts with a world. The formal name for this picture is the partially observable Markov decision process, or POMDP, and the original definition of the term "world model" belongs to that tradition.

An agent, which can be a person, a robot, or a software system, takes actions. Those actions affect the state of the world. The agent never sees the state directly. What reaches the agent are observations: the photons that fall on a retina, the readings from a sensor, and the pixels in a video frame. New observations inform new actions, and the loop continues.

The word "state" needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response.

This loop — agent to action to state to observation and back — is the structure that gave the modern term "world model" its technical meaning. The phrase itself is older, traced to Kenneth Craik’s 1943 proposal that minds reason by running "small-scale models" of reality, and carried into neural networks by the late 1980s and early 1990s. And the loop also explains what people mean by the term today. The different things now being called world models are in fact different projections of this same loop. Each one outputs a different piece of it.

Three functions of a world model

The first kind of world model is a renderer.A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity.A video model that turns a text prompt into a cinematic drone shot is a renderer. So is an interactive system likeGoogle’s Genie 3, or World Labs’ ownRTFM, where the model generates frames in real time conditioned on user input. The model carries no explicit understanding of three-dimensional structure. It produces what a viewer would see, not what is. The buildings in the drone shot may look flawless from above, but try to drive through the city below and they fall apart.

The second kind is a simulator. A simulator outputs state: a geometrically, physically or dynamically faithful representation of the world that humans and computer programs can both compute on and interact with.Where the renderer’s contract is purely visual, the simulator’s contract is structural, demanding geometry that holds up under inspection, physics that respects Newton’s laws, and dynamics that behave the way the world needs to behave given the laws of physics. A simulator serves two consumers at once. Human professionals such as architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles use simulators as training grounds where they can interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to run in reality.

The third kind is a planner. A planner outputs actions.Given an observation and a goal, a planner answers the question of what the agent should do next. This is, in many ways, the inverse of the renderer. Where a renderer takes actions as input and produces observations, a planner takes observations as input and produces actions, closing the perception-action loop. Vision-Language-Action models, model-based systems, and the new wave of World Action Models are all attempts at planners: systems that can decide what a robot should do in an unstructured world.

These three categories describe most of what is actually shipping today, and the distinction between them is useful in practice. The categories are not, however, fundamentally separate. The same underlying knowledge of how the world works—geometry, physics, dynamics—sits beneath all of them. A model that can render a cup from any angle ought, in principle, to be able to simulate what happens when the cup is pushed and plan a hand to pick the cup up. Increasingly, the most interesting research deliberately blurs the boundaries between the three.

Why simulation is the linchpin

Of the three categories, the simulator gets the least public attention, and is the most consequential of the three. This essay addresses this asymmetry.

The renderer is by far the most commercially mature. A number of image- or text-to-video products are expanding in the consumer or enterprise markets rapidly. Google’s Nano Banana model has put renderer-quality image generation in the hands of potentially hundreds of millions of users. The technology is real, and the markets are real. Yet renderers optimize for visual plausibility rather than physical accuracy, and that ceiling matters. Their outputs are beautiful, but they cannot be trusted to design a building or train a robot.

The planner is the most intriguing and the most nascent,closely conneced to the rapidly evolving field of robotic learning.The field has produced robotic demos in the last two years that look impressive in videos, but candor is required about what those demos actually show. Almost all have been confined to heavily constrained laboratory setups, with narrow object sets and short task horizons. None have been validated at the complexity, variability, or duration that real-world deployment demands. The gap between a compelling demo reel and a robot that reliably works in a kitchen, a warehouse, or an operating room remains vast. The commercial bets are nonetheless substantial. A wave of well-funded entrants is racing to ship general-purpose planning systems, while the largest infrastructure players are positioning planning atop broader simulation stacks. A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first.

Simulation is the bridge between the two.If language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself. A simulator must work at that level: the structural backbone from which both visual appearance (for renderers) and action consequences (for planners) can be derived.

A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that masters only rendering, or only planning, cannot do either. The commercial surface area is enormous. NVIDIA’s Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market in factories, warehouses, supply chains, and digital twins. Robotics training, autonomous vehicle testing, architectural visualization, engineering, and drug discovery all depend on something simulation-shaped.

The hardest open problems in the field live there too. Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on. The sim-to-real gap, which is the difference between how things behave in simulation and how they behave in reality, persists. Generative simulators introduce a new risk on top of that: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Multi-physics simulation at scale, where rigid bodies, deformable objects, fluids, and cloth all interact, remains orders of magnitude more expensive than single-domain simulation.

At World Labs,Marbleis our first move into this territory.It takes multimodal prompts (text, image, video, or spatial sketch) and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. But Marble is only the first chapter of a much longer arc being written across the field as the lines between rendering, simulation, and planning begin to collapse.

Where the boundaries are collapsing and what comes next

But more is to come. The most important pattern in the field right now is that the three categories are starting to blend into one another. The shared insight is that the knowledge required to render a world, simulate it, and act in it is largely the same. Continuing the earlier example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan for a hand to pick the cup up. The three categories are three projections of a single underlying understanding.

For example: a small but growing number of recent work from various robotics labs have demonstrated that—at least conceptually—a pretrained video renderer can be used as the backbone for joint world-and-action prediction, suggesting a bridge between the renderer and the planner by letting one model imagine what will happen and what to do. World Labs’ Marble already outputs Gaussian splats and collision meshes from a single model, dissolving the boundary between the renderer and the simulator.Every level is moving from passive output to interactive system,with renderers becoming action-conditioned, simulators generating worlds that are more controllable and editable, and planners deliberating rather than just reacting.

The logical endpoint is a unified world model: one foundation model that can render photorealistic views, produce physically accurate structure, and plan action sequences, switching between output modalities depending on what the downstream consumer needs.We will still face a number of daunting challenges. The data picture is uneven, with renderers awash in internet video while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or a high-fidelity simulation needs. Reconciling these tensions inside a single architecture is the defining open problem in world model research today, and this is what World Labs sets out to do as we continue to evolve Marble.

The direction, however, is clear. The same bet the field has been making since the late 1980s — that a sufficiently rich model of the world is all that any agent needs to see worlds, build them, and act in them — is the bet now driving an entire generation of research. What gives that "big bet" weight is the convergence already underway: three threads, each already driving and shaping multi-billion-dollar industries on its own, that began as separate research programs are starting to behave like one. Taken together, as the boundaries between them collapse, they will reshape something larger: the relationship between machine intelligence and the physical world it inhabits - the long arc of spatial intelligence.

Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it.(本文首发钛媒体APP,作者 | 硅谷Tech_news,编辑 | 林深)

相关标签

觉得文章不错,微信扫描分享好友

扫码分享

热门推荐

查看更多内容

企业资讯

查看更多内容