ChatGPT's Core Technologies, Seen From the Application Layer 应用视角下 ChatGPT 背后的关键技术讨论

2023-02-26 · Chinese original on Zhihu 知乎原文

A 2023 application-builder's reading of ChatGPT: emergence, alignment, prompting, in-context learning, hallucination, multimodality, and the coming wave of AI-native applications.

从应用构建者视角讨论 ChatGPT 背后的关键技术：涌现、alignment、prompt、in-context learning、事实性、多模态和应用创新。

English edition adapted for native English readers; Chinese text follows the original Zhihu source. 英文版按英文读者习惯重写整理，中文版保留知乎原文。

This piece came from a period of intense discussion around large language models. Rather than write another model-centric explainer, I wanted to look at ChatGPT from the position I knew better: someone who had worked across deep learning, engineering, and internet products.

Large language models did not appear out of nowhere. To understand ChatGPT, you have to trace roughly a decade of deep-learning progress, especially in natural language processing. But for application builders, the more important question is not only how the model works. It is why the product suddenly works.

Start with the strange phenomenon of emergence

Deep learning has always been criticized for weak theory. But in the history of science, useful applications often arrived before complete theory. Major unexplained phenomena are sometimes the opening for new theory.

Emergent ability chart for large language models — The original post used emergence as the entry point for understanding LLMs.

The mysterious part of LLMs is emergent ability. Researchers still do not have a fully satisfying explanation for why performance can jump sharply after scale passes certain thresholds. If we understood this well, it might even reshape the old debate between statistical and symbolic views of intelligence.

LeCun's criticism of ChatGPT-as-AGI came from a familiar position: statistical methods should not be enough for general intelligence. Yet emergence hangs over that argument like a cloud. If similar effects appeared in vision or multimodal systems, the implications would be much larger than a commercial race between model providers.

There are several possible explanations. Maybe our evaluation metrics are too discontinuous, and the capability is present earlier than the score suggests. Maybe some knowledge and reasoning patterns are learned incorrectly at small scale, then corrected only when the model becomes large enough. Or maybe scale really does produce a qualitative change in a sufficiently complex learned distribution.

U-shaped scaling behavior illustration — A figure from the original post discussing non-smooth capability curves.

Alignment is the key to productization

Compute, data, and algorithms explain much of AI progress, but productization depends heavily on alignment. This may be one of the places where OpenAI was ahead of the industry in product judgment.

I am using “alignment” here in a practical product sense: the methods that make a model's latent capability line up with what users actually want. Prompting, in-context learning, chain-of-thought style interaction, and RLHF all belong to this broader product story.

Prompting as the UI/UX of the AI era

Many people compared prompts to a new kind of UI/UX. That comparison is useful. Prompting was not merely a research trick for matching downstream tasks to pretraining. It became a way for users to expose and steer model capabilities.

In-context learning looked at first like a way to distinguish zero-shot use from meta-learning. But later work showed something more surprising: even wrong examples sometimes did not hurt performance much, while examples from the wrong distribution did. The prompt was not simply a label. It was a context-setting interface.

In-context learning illustration — Prompting and in-context learning became part of the product interface.

The same applies to chain-of-thought style prompting. A model that seems weak at reasoning can improve noticeably when the interaction asks it to proceed step by step. Before we understand the mechanism deeply, AI researchers often look like alchemists: trying different spells to summon capability from a system we do not fully understand.

ChatGPT found a better alignment surface: GPT-3.5 plus RLHF, wrapped in a dialogue product. That does not mean the full capability of LLMs has been unlocked. It means interaction design became part of the model's effective intelligence.

Will LLMs look like search engines or cloud computing?

One important business question is whether large models will resemble Google-style search dominance or AWS-style infrastructure competition. My intuition is closer to the AWS analogy: one company may lead, but multiple strong providers can still exist.

Search and recommendation systems did not truly understand content. They mined user behavior and distribution feedback. In the LLM era, models begin to understand and generate content itself. That weakens some old supply-side moats and changes how distribution may work.

ChatGPT was the best product at the time and had strong user feedback, but the underlying LLM technology was not locked inside OpenAI. Google and Meta also had users, talent, and infrastructure. It was reasonable to expect serious competitors.

Hallucination and external memory

ChatGPT can produce factual errors, and model training has a time boundary. In production, it can expand the capability boundary of professionals, but it cannot simply replace expertise. The model was especially strong in technical domains partly because the web contains a large amount of high-quality programming and IT material.

After GPT-3, much work studied how models store, modify, and correct knowledge. Some research viewed the transformer's feed-forward layers as a kind of key-value memory. Other work tried to update specific facts through constrained optimization without damaging unrelated knowledge.

From an energy and system-design perspective, I think LLMs should rely less on memorizing every factual detail and more on reasoning over external knowledge. Retrieval-augmented approaches, DeepMind's RETRO, vector databases, LangChain, GPTIndex, and the early new Bing all pointed in that direction.

When will multimodality really arrive?

I believe multimodal large models are a prerequisite for AGI. Humans learn in a physical world; text is already an abstraction. Vision provides stronger anchors in physical regularities, which may help models learn more fundamental concepts.

But the path is not straightforward. CLIP was useful, but more like BERT than GPT-3. ViT was promising, but the tokenization problem in vision is different from language: text tokens carry semantic structure in a way image patches do not. This is part of why diffusion models became so effective in image generation while transformer-style sequence modeling faced different constraints.

Another guess: truly large multimodal models will need sparsity. If we loosely compare parameter scale with human synapses, GPT-3 was still far smaller. Scaling further while keeping inference cost manageable likely requires sparse architectures and new infrastructure.

A new era of application innovation

Media often ask which jobs will be affected by AI. The better question may be: which ones will not? I was not fully optimistic about AGI, but the capabilities shown by ChatGPT and diffusion models were broad enough that most industries should take them seriously.

This wave changes human-computer interaction. We will see a generation of applications whose primary interface is natural language. For the first time, machines can interpret human intent with this level of detail, across multiple rounds, with each interaction shaped by context.

Think about Office, Photoshop, or video editing tools. Learning them often means learning a graphical programming language for instructing a computer. If users can express intent directly in natural language, many categories of software can be rebuilt.

That does not mean every AI application will succeed. Every technology has boundaries, and we do not yet know where they are. Many early AI apps will simply wrap an API without a durable moat. The deeper opportunities are in workflow design, alignment with user intent, and surviving long enough for the next maturity cycle.

Other questions worth tracking

Data quality is often misunderstood. Many people still think model training mainly requires labeled data, while the NLP scaling story depended heavily on self-supervised objectives such as masked language modeling.
Compute still matters. GPUs face physical limits too, but parallel workloads have allowed rapid growth. In the large-model era, demand for compute is obvious; the open question is whether supply costs fall as quickly as people expect.
New optimization algorithms may matter. Some researchers, including Hinton, have long questioned whether SGD-based backpropagation is the right long-term path for intelligence.

Closing AI technology chart from the original post — A closing figure from the original Zhihu essay.

上周，我应邀和同事去参加一个交流会。在活动场地，我发现给提供场地的公司，正好是当初在学校里经常听到的一批工程师，他们曾在青芝坞创业做游戏引擎。如今，这家公司已经有了数百人的规模。当时，我还碰巧遇到了知乎上的一位网友，我们聊了很多有关大语言模型的想法，感到大家的反馈都很热情。于是，我趁着周末整理了一下这段时间以来的观点，并将它们更系统地写出来。

大语言模型并不是凭空出现的。要理解它背后的技术原理，需要一步步梳理近10年来深度学习特别是自然语言处理领域的关键技术发展。虽然有不少文章已经涉及到这一部分，但我强烈推荐一篇优质的中文综述，它可以帮助您更系统地理解这一过程。

通向AGI之路：大型语言模型（LLM）技术精要

作为一个涉猎深度学习、工程开发和互联网产品的从业者，本文旨在从更偏应用的角度探讨ChatGPT背后的大型语言模型LLM中的关键技术。

让我们从一个神奇的现象开始：涌现能力

深度学习在理论上的孱弱一直被诟病，但科学史上存在大量先有应用再有理论的事例，而其中不能解释的重大实验现象往往预示着理论创新的契机。我们首先从现象出发，来看下LLM中最神秘的emergent ability，如图

上图中x、y轴分别表示模型规模和模型效果。研究员们至今无法给出令人信服的解释，为什么主流的大型模型在规模超过10^22级别后，效果会突然大幅提升。这个问题非常重要，有可能搞清楚这个问题，就能终结追求AGI路上的统计和符号的路线之争。

最近LeCun批评了很多人认为ChatGPT可以带来AGI的观点，在Twitter上被骂惨了。其背后的观点是，基于统计的方法论不应该能够实现AGI，但这种 emergent ability（涌现能力）就像一朵乌云一样漂浮在人类智能的上空。

近年来，一些工作也在尝试使用大型模型在计算机视觉领域进行实验，观察是否有涌现现象。好在虽然谷歌最近将ViT推向了22B，但仍然没有观察到明显的涌现现象。如果在图像或多模态领域也出现了类似的成果，从小的方面看只是各个竞争赛道的消亡和商业逻辑的改变，但从大的方面来看，我们人类引以为豪的智能可能真的只是统计规律。

以下是一些比较有意思的解释：

模型效果评估方法并不够平滑，实际上在中间阶段就已经开始有效果了，只是指标上还没有表现出来。
一些难以理解的知识、概念和推理能力等，会在一开始就学习错误，导致效果更差，需要进一步学习才能更好。因此模型需要足够大。例如下图显示，LLM模型在scale增大的过程中会出现先下降再上升的U形曲线效果图。

量变引起质变。在学习模型知识分布时，模型类似于贝叶斯网络，在足够复杂的连接之后，量变引起质变，表现出了类似智能的能力。

AI技术产品化的关键技术：Alignment

AI的巨大进步，在算力、数据、算法等方面已经有很多讨论，然而，在产品化过程中，Alignment发挥了关键作用，这可能是OpenAI在认知上领先整个行业的关键部分。Alignment没有标准的定义，本文只是我个人观点中的aligment，如果有错误，请指出。

AI时代的UIUX：Prompt engineer

很多人都认同把prompt比喻成新时代的UIUX，它在NLP领域取得了巨大成功，当初知乎上就有不少同学讨论该项技术对研究领域的重要性。

如何看待NLP领域最近比较火的prompt，能否借鉴到CV领域？ - 知乎

在"文本转图像"普及的时代，Prompt技术也同样大放异彩。可以说，Prompt是我们尝试与深度学习模型的能力对齐的一个重要尝试。当我们再次阅读GPT3的论文时，我们会惊讶于In-Context Learning理念的重要性，对Prompt的理解也不再局限于只是为了将下游任务的任务形式与预训练对齐、替代finetune等方面。

In Context Learning

初看In Context Learning似乎是为了区分zero-shot、meta learning而新造的词汇，但随着后续基于它展开的工作，比如这篇：

How does in-context learning work? A framework for understanding the differences from traditional supervised learning

大家发现就算给model输入是有错误的示例，也不会影响模型的实际效果，而如果输入的事例和需求侧任务的分布差异较大，则效果会明显下降。当然也不能不提CoT的神奇效果，LLM模型在逻辑和推理上一直比较差，然而简单的在输入中加入Let's thingk step by step，效果就会明显的提升。

种种现象表明，LLM其实已经学到了大量的知识，只是我们还没有找特别好的方法来开启它的能力，就好像今天的AI研究员就像古时候的炼金术师，在没有在原理层面突破之前，只能在一遍遍的实践中尝试不同的咒语去召回魔法。今天，ChatGPT好像找到了更好开启LLM能力的对齐方式，RLHF加GPT3.5，让AI产品化的能力提升了一截，但并不能说我们已经把LLM的能力完全发挥出来了。

因此，alignment的创新不会停止。Prompt、In Context Learning、Chain of Thoughts、Reinforcement Learning Human feedback，是整个行业多年来孜孜不倦努力下的阶段性产出。然而由于交互创新如此重要，我们不会停下创新的脚步。

LLM的创新是搜索引擎式的范畴还是AWS式的范畴

这个问题背后是关心大模型未来是类似Google一样的巨头垄断模式，还是会像AWS一样，出现多家提供大模型服务的公司。个人是倾向于LLM很难出现一家独大的情况，更像AWS的云计算的模式，给行业提供优质的计算服务，虽然会有一家公司比较领先，但是依旧会有别的公司保持强有力的竞争。

LLM和搜索推荐系统的不同，在搜索推荐系统时代，model并没有理解内容，而是基于用户投票做出的数据挖掘，比如字节掌控了强大的内容供给侧，再依靠推荐的高效分发，才形成了今天如此高壁垒的抖音。但LLM时代，model开始理解了内容本身，进而能创造内容，那么内容供给侧的垄断是否就不再存在。同时基于model在内容和需求侧的理解，天然可以做好分发。

虽然ChatGPT是目前最好的，也收集到了很多优质的用户反馈，在未来会保持一定的领先，但是LLM技术本身并没有被垄断在OpenAI，且用户的feedback在LLM领域，也不确定会对用户体验带来多大的护城河，毕竟Google、Meta都不缺用户。相信在不久的未来，其他巨头也能推出效果不错的竞品。

胡编乱造的问题

以ChatGPT为例，该模型可能会出现一些事实性错误。同时，由于模型训练的时效性，实际生产环境中也可能会存在一些问题。因此，ChatGPT目前只能扩展您的个人专业能力并扩大边界，无法替代专业人才。虽然我们团队在使用ChatGPT的过程中发现它的输出可靠性和解决问题的能力远超预期，但是仍需谨慎使用，以免提供误导性的结果。或许由于网络上存在大量高质量的IT技术相关数据集，该模型在这个方面表现出了更强大的能力。

在GPT-3之后，出现了大量的工作来研究模型如何记忆这些知识，以及如何修改和更正它们。其中一些工作研究了Transformer中占三分之二参数的FFN，认为它实际上起到了类似Key-Value Memory的作用。此外，不同层的FFN存储的知识抽象程度也不同。还有一些工作通过对优化目标加约束来实现特定知识的更新，并证明确实可以更正某些知识而不会明显影响原先记忆的其他知识。

从能耗的角度出发，我认为LLM未来应该更加注重知识的理解和推理，而不是仅仅记忆事实性的知识。为了实现这一目标，我们可以参考Augment Retrieval相关的工作，例如DeepMind提出的Retro框架，该框架将外部知识库的embedding表达与LLM融合。还有最近开源的项目，如LangChain、GPTIndex等，这些工作都利用了外部数据库。例如，新必应的实践是使用搜索召回的结果作为输入，然后让LLM处理这些输入，最终给出答案。这些方法统称为"external DB"，我认为它们是比较实用和可行的应用方案。此外，这些方法还为一些进行向量数据库开发的公司带来了新的机会。

多模态什么时候会来

我认为，多模态的大型模型是实现AGI的前提条件。我们人类自身是在四维世界中学习和理解世界的，而文本则是一个更为抽象的领域。视觉领域拥有更好的物理规律，这些规律可以作为锚点，帮助模型理解和学习到真正更基本的概念。然而，目前还没有出现特别出色的工作，例如Clip虽然使用起来很好，但它更像是NLP中的Bert，而不是像GPT3这样的模型。虽然ViT为我们带来了一些希望，但是像Transformer这样能够同时考虑局部和全局信息的序列结构，在某种程度上需要token是离散的。在NLP中，每个token本身都具有一定的概念意义，其分割也有实际意义。但是在CV领域中，不同组patch的方式所代表的物理意义是天然不同的。当Transformer应用于CV时，它可以与NLP保持一致，使用MAE的方式进行自监督训练，从而解决了训练数据的问题。但是就像Diffusion Model在图像生成方面非常有效，但在离散的文本场景中能力受到很大限制一样，因此Transformer在CV领域中并不一定合适。

另一个猜测是多模态的大型模型应该是稀疏的。从人类大脑神经突触的角度看（大概在100万亿的规模），GPT-3的参数估计还需要增加大约一千倍。因此，只有当模型是稀疏的时，才有可能进一步扩大模型规模，同时降低推理成本。这样的巨大挑战为整个行业注入了巨大的创新机会。举个不太恰当的比喻，谁会成为新时代的Parameter Server，是谷歌的Pathway吗？

因此，即使不考虑计算力、难以训练和模态融合等关键问题，多模态的大型模型的实现可能比预期更为困难。因此，我们应该更加务实地关注图像领域的大型模型，如果能够出现具有新兴能力的预训练模型，则会更为理想。

应用创新的新时代

许多媒体都会问哪些行业和岗位会受到AI的影响。相反，我们应该反过来问，哪些行业不容易受到影响。尽管我对AGI的前景并不那么乐观，但在ChatGPT和Diffusion展示的能力方面，很少有行业不会受到影响。我们应该尽可能地拥抱AI。在我们的文明史上，人类发明了工具，而工具又反过来塑造了新的人类。

这轮AI技术突破具有改变人机交互的能力，未来将会出现一批基于人类自然语言作为交互手段的应用，这将是一个新时代的开端。

在科技史上，技术变革引起人机交互创新，从而演化出新的商业生态的例子比比皆是。ChatGPT只是其中的一个典型应用，它展示了基于人类自然语言和机器的交互体验有多么惊人。这是历史上第一次，机器对人类的需求理解如此细致，人类可以反复多轮地表达需求，每一次都是独特的体验。

交互方式的改变将会重新定义许多应用。回想一下Office、Photoshop、视频编辑等生产力工具的学习经历，一定不是那么愉悦的。这些生产力工具本质上要求您学习一种图形化的编程语言，以便将您的需求告诉计算机执行。如果我们可以直接用自然语言描述我们的需求，那么大范围的预测是，所有的软件都可以重新制作。

当然，这并不是说基于AI的应用都会成功，每种技术都有其边界，只是我们现在还不知道它们在哪里。此外，现在的AI仍不是AGI，更像是iPhone或AWS的第一次出现。大部分围绕AI做的应用，大家都是在OpenAI的API包一层做应用，并无核心技术壁垒。更多的创新在自然语言交互下的产品动线、需求对齐上的创新，未来都是大概率都是先烈。

类比于iPhone，今天的AI与iPhone 1相似，都处于不稳定的创新状态。在AI的发展中，我们无法确定新的技术范式何时出现，或者会在何处出现，例如多模态技术，新的对齐方式是否能够让大型模型直接服务客户等问题。此外，应用的时间窗口，以及是否会出现护城河等问题，也是未知的。因此，在进行应用创新时，我们需要做好心理准备，追求自身能够存活到下一个成熟周期。

其他

许多人谈论数据质量，但在简单交流后，发现大多数人仍然认为AI模型需要大量标注数据才能训练。实际上，NLP领域是在Masked Language Model的训练方式后，才得以扩展到如此规模，并衍生出一系列后续发展。MLM最大的特点是自监督，不需要真正的人工标注。其思想非常简洁，就是在现有的文本训练语料中，随机地遮盖掉一些词（准确来说是token），让模型来预测这些被遮盖掉的词。由于这些词事先是已知的，因此可以算作自监督学习。这种方法的好处是，可以大幅提升可用于训练的数据规模。对于理解语言模型来说，这一部分非常重要。如果感兴趣，可以进一步搜索相关资料进行学习。因此，所谓数据质量是指数据本身的优质程度，例如，Wikipedia的数据天然比reddit的要好一些。
算力的摩尔定律。很多年前，大家已经在说摩尔定律遇到了物理瓶颈，除非基础科学的突破，否则我们很快就无法在提升计算机的性能。但这几年我们看到GPU的发展很快，算力增长迅速。GPU和CPU一样，同样遇到了物理瓶颈，但GPU的场景天然是并行的，可以通过堆更多晶体管来缓解问题。这个领域涉略不深，在大模型时代，算力的需求侧不在存疑，算力的供给是否真的像大家预期的一样，成本快速下降，希望有更资深的人来解答算力这部分的问题。PS：最近看到一篇有意思的工作，Looped Transformer as Programmable Computers，在探讨是否有可能用Transformer做一个通用的计算机。前几年已经有人去证明transformer是图灵完备。
新的优化算法，随机梯度下降的优化算法和模型结构，更偏圈内人员关注，比如Hinton老爷子就一直不相信SGD-based的优化算法是人工智能的未来，类比于人脑不存在有反向传播这样的东西。在前几年Router的基础上，他最新的工作FF，还是focus在这块。

最后，附一张网上流传很广的图作为结尾，祝大家周末快乐！