Three of the world's greatest AI minds disagree fundamentally on how to build Artificial General Intelligence. One bets on learning from action. One bets on understanding the world. One bets on mastering language. Only one can be the primary path — or perhaps all three are needed.
世界上三位最伟大的AI头脑在如何构建通用人工智能上存在根本分歧。一位押注于行动中学习,一位押注于理解世界,一位押注于掌握语言。只有一条能成为主路——或者三条都不可或缺。
Intelligence emerges from interaction with environments. An agent learns by trying actions, receiving rewards, and refining its strategy. Language is just one interface — the real intelligence is in the ability to act, plan, and achieve goals in complex environments.
Build agents that learn from scratch through trial and error in increasingly complex domains. Start with games (AlphaGo), move to science (AlphaFold), then generalize. Combine RL with neural networks, search, and planning. The agent discovers strategies no human ever conceived.
Proven superhuman performance: AlphaGo defeated world champion (2016), AlphaFold solved protein folding (2020), AlphaGeometry proved math theorems (2024). RL discovers genuinely novel solutions, not just recombinations of training data.
Extremely sample-inefficient — needs millions of trials. Reward function design is an unsolved problem (reward hacking). Doesn't transfer well across domains. AlphaGo can't make breakfast. Each new domain requires a new agent trained from scratch.
Intelligence = building an internal model of how the world works, then using it to predict, plan, and imagine. A baby learns more about physics in 6 months than any LLM learns from the entire internet. Language is a thin layer on top of deep world understanding.
JEPA (Joint Embedding Predictive Architecture) — learn representations by predicting the future state of the world from sensory input, not by generating pixels. Self-supervised learning from video and multi-modal data. Build an "inner simulator" that can imagine consequences before acting.
Biologically plausible — this is how animal brains actually work. Sample-efficient learning (babies learn from very few examples). Grounded in physical reality. Could enable common sense, intuitive physics, and the kind of understanding LLMs clearly lack.
Still largely theoretical — no world model has achieved anything close to LLM-level capabilities. JEPA papers show promise but no breakthrough product. The gap between vision and execution is enormous. LeCun has been predicting this for years without a landmark demonstration.
Language is not just an interface — it IS intelligence. Only humans have language. Only humans have advanced intelligence. Language compresses all human knowledge, reasoning, and understanding into tokens. Mastering language is mastering thought itself.
Scale transformer models on internet-scale text. Add RLHF for alignment. Add chain-of-thought for reasoning. Add tool use for acting in the world. Add vision and audio for multimodality. The "scaling hypothesis" — intelligence emerges from sufficient scale and data.
Actually works TODAY. GPT-4, Claude, Gemini are the most capable AI systems ever built. Passed the bar exam, medical licensing, PhD-level science. Generates code, writes poetry, reasons about ethics. Hundreds of millions of users. Billions in revenue. The only approach with real-world traction.
Hallucinations. No physical grounding. No survival instinct. No drives. You can tell an LLM to make money — it will write text about making money but doesn't want money. A cockroach wants to survive more than any LLM wants anything. Is prediction really understanding?
Even cats and dogs know how to strive for survival — it's written into their genes. Large language models cannot learn this. Even if you prompt them to make money, they appear indifferent. To them, it's all just wordplay. Even if you shut them down, they feel no emotion.
即使猫和狗也知道如何为生存而努力——这写在它们的基因里。大语言模型学不会这一点。即使你提示它们赚钱,它们仍然显得漠不关心。对它们来说,一切都只是文字游戏。即使你关闭它们,它们也毫无情感。
500 million years of evolution gave animals hunger, fear, desire. No AI system has intrinsic motivation. Without drives, there's no genuine agency — only the appearance of it.
Intelligence evolved in bodies that move through physical space. Touch, proprioception, pain, pleasure — these aren't extras, they may be prerequisites. A brain in a vat may never truly understand.
Humans experience time. We remember the past, fear the future, feel urgency. LLMs have no sense of time passing. Each prompt is stateless. Without temporal continuity, can there be consciousness?
The Hard Problem: why does physical processing give rise to "something it is like" to be conscious? None of the three roads even attempts to answer this. Perhaps AGI doesn't require consciousness — but perhaps it does.
| Dimension 维度 | RL (Hassabis) | World Models (LeCun) | Language (Altman/Amodei) |
|---|---|---|---|
| Core idea核心思想 | Learn by doing | Learn by observing | Learn by reading |
| Analogy类比 | An athlete training | A baby watching the world | A scholar reading every book |
| Data source数据来源 | Environment interaction | Video, sensory streams | Text (internet-scale) |
| Grounding基础 | Action in environments | Physical world perception | Linguistic abstraction |
| Best result最佳成果 | AlphaFold (solved proteins) | V-JEPA (early research) | GPT-4 / Claude (general use) |
| Users today当前用户 | Researchers only | Researchers only | Hundreds of millions |
| Survival instinct生存本能 | Reward-shaped (artificial) | Not addressed | None |
| Language role语言角色 | One of many interfaces | Thin surface layer | The core of intelligence |
| Biggest bet最大赌注 | Generalization across domains | JEPA architecture works | Scale is all you need |
| Risk风险 | May never generalize | May never leave the lab | May hit a ceiling without grounding |
Language captures symbolic reasoning. World models capture physical intuition. Reinforcement learning captures goal-directed behavior. The human brain does all three — plus something we haven't identified yet. Perhaps embodiment. Perhaps drives. Perhaps consciousness itself.
语言捕获符号推理。世界模型捕获物理直觉。强化学习捕获目标导向行为。人脑三者兼具——再加上我们尚未识别的某些东西。也许是具身性,也许是驱动力,也许是意识本身。