Algorithm Engineers Are Engineers First

During campus recruiting, we often interviewed candidates for algorithm roles by starting with one or two simple coding problems before moving into machine-learning questions. Some candidates complained that we were not really asking about machine learning. In many cases, the reason was simple: the code was too weak for the interviewer to keep going.

Machine-learning roles are easy to romanticize. People imagine deriving formulas and tuning models all day. In reality, our search and recommendation work required a very different kind of ability: understanding systems, data pipelines, serving constraints, performance profiling, and the ugly details that determine whether a model can actually create value online.

The first lesson: stop chasing fancy architectures too early

In one deep-learning ranking project, we spent months stepping on pits. We were overly attracted to fancy paper ideas and hoped that changing model structures would produce gains. Most of that failed. The meaningful gains came from a more boring place: using more training data, improving sample construction and cleaning, choosing classic model structures carefully, and respecting the optimizer.

Deep learning needs much more data than traditional models; increasing sample size can visibly improve results.
At the beginning, forget many of the clever tricks in papers. A workable solution is usually a question of “how much better”, not “whether it works at all”.
Tuning speed matters. If model iteration is alchemy, the faster alchemist has an advantage.
Embeddings are extremely powerful; spend serious effort on how the model represents IDs.
Care about computation. A ranking model eventually has to serve online traffic.

Training is an engineering problem

At first, we fed the existing baseline features into deep models: DNN, DFM, LSTM, and so on. The results were worse than logistic regression. Part of the reason was embarrassing but real: to move quickly, we loaded everything into memory, limited the data scale, and handled part of the preprocessing in Python. GPU utilization was unstable because computation kept bouncing between CPU and GPU.

After profiling, we moved the sample construction to Spark and generated TFRecord data. The whole construction pipeline became nearly ten times faster than the old Hive SQL plus HDFS-to-local process, and we could use much more data. The model improved. This was not a modeling miracle; it was engineering work.

Industrial embeddings are not NLP toy sizes

In NLP papers, hundreds of thousands of words can be called large-scale. In industrial recommender systems, user IDs and item IDs easily reach millions or tens of millions. Embedding lookup can run out of memory quickly. Hashing, SimHash, metadata-based recoding, and selective treatment of sparse IDs all become practical design choices.

Wide & Deep also brings sparse-model training problems. Many implementations use dense tensors. Once feature scale reaches hundreds of millions, parameter-server communication becomes a disaster. Reading the TensorFlow source and using sparse ops can make a large difference. Sometimes the elegant solution is simply understanding the system deeply enough.

Online ranking is not batch prediction

In training, prediction code is often organized by batch size. Online ranking is different. When a user arrives, the system may need to rerank thousands of candidates. If user-side features are copied thousands of times and TensorFlow performs thousands of embedding lookups, latency will be terrible. Doing user-side lookups once at the beginning and then copying memory can dramatically reduce response time.

Other costs hide in attention modules, cross-network implementations, and small algebraic choices. A simple use of commutativity in DCN’s cross layer can bring a large performance improvement. These are not separate from algorithm work; they are part of making the algorithm real.

Theory still matters

None of this means machine-learning theory is unimportant. Theory gives you belief and direction when a project takes a year, when there is no clear intermediate output, and when the business keeps asking for KPIs. But beautiful assumptions meet cruel real-world data very quickly. Moving toward truth requires both theoretical understanding and the ability to dirty your hands in the system.

The relationship is simple: theory points the way; engineering is the blade that cuts through the road. Without theory, you have no direction. Without coding ability, you can only watch from the side. Algorithm engineers are engineers first.

引子

最近校招面试到吐，算法岗位有点太热了，简直心力憔悴。我们的面试分两个部分，先是做一两道编码题，然后才是考察机器学习的知识。很多同学不理解，在网上diss我们，说什么机器学习基本没有问。这种情况，一般是代码做的太烂了，面试官都没有兴趣去了解机器学习部分。

机器学习算法岗位，很容易让大家有个误解，认为平时工作就是推推公式，调调参数。鉴于此，本文借用下我们团队最近的一个重要项目：深度学习在搜索、推荐中的应用，来描述下平时我们是怎么干活的，看完之后，大家应该很容易理解为何我们要求有编码能力。

其实，我们的编码题真的很简单，不用刷题也能做出来，看看其他公司出的题，已经有点类似面试造原子弹，进来卖茶叶蛋的蜜汁感觉。当然，他们有资本，可以通过这种方式选到很聪明的候选人。

回到正题，我们从去年年底开始探索深度学习在搜索、推荐中的应用，包括排序和召回。以前我们常常用和工程同学合作，对系统的理解，比如推荐引擎、搜索引擎来表达编码能力的重要性，可能对于应届生来讲，有点模糊。这次的项目经历可能更好一些。

先总结下指导思想

这大半年，我们踩了很多坑，特别是痴迷论文中的各种fancy结构，寄希望于换换模型拿到收益。最终都纷纷被打脸，反而是回归到开始，从使用更多的样本数据，改善样本清洗、构造的逻辑，谨慎选择经典的模型结构，对优化算法保持敬畏等等，拿到了不错的收益。先来几点务虚的鸡汤，大概有以下几点：

对比传统模型，深度学习更需要大量的数据去学习，样本数据的增加能明显的改善模型的结果。
在初期，请忘记paper里面各式各样的奇技淫巧。一套有效的方案，其效果是多和少的问题，不是有和无的问题。好好调参，比乱试各种论文idea有效。
深度学习真的可以自称调参炼丹师，所以比别人试的更快，是炼丹师的核心竞争力。
Embedding太神奇，请把主要精力花在这里，深度模型对id的理解可以震惊到你。
关心你的模型的计算效率，最终还是要上线的，绕不过去的性能问题。

训练中的工程能力篇，就是各种踩坑各种填坑
样本规模的问题

一开始，我们把现有基线的特征数据喂到了深度模型中，试过dnn、dfm、lstm等等，发现效果比lr还差。当时为了快速尝试，将所有的数据load到了内存，限制了数据规模，而且有部分数据预处理的工作也是在python中处理，导致计算在cpu和gpu之间频繁切换，gpu利用率抖动很厉害。基于tf提供的性能工具，做了点分析后，判断是特征预处理这部分移太耗时了。另外，模型的参数很大，但是样本数不够，急需增加样本量。我们用spark将样本数据构造成tfrecord的格式，整个构建过程对比原来基于hive sql，再从hdfs拉到本地，快了近10倍，而且能用的样本数据量大了很多，发现模型效果好了很多。

embedding id量级过大的问题

深度学习是在图像、语音等场景起家，经常在nlp的论文中，将几十万的word做embedding称为大规模。工业界做user和item embedding的同学应该笑了。userid和itemid非常容易过百万、千万的量级，导致生成embedding lookup oom。可以参考我上篇文章：https://zhuanlan.zhihu.com/p/39774203。

有些公司会选择对id进行hash，再做embedding，比如tf的官网就建议这样：https://www.tensorflow.org/guide/feature_columns#hashed_column。也有些会选择simhash来替换直接hash。我们目前能做百万级别的原始id，后续如果需要加大量级，更倾向于只对样本特别稀疏的id做hash或根据id的metadata做重编码来做。

Wide模型带来的稀疏模型训练问题

大部分的wide & deep代码实现，其实用的tensor都是dense的。tf基于PS做的模型训练，当你的特征规模在亿级别时，网络通信是个灾难，加上grpc的垃圾性能，网卡利用率上不去，训练的时间大部分都耗在通信上了。

但如果花点心思看看tf的源码，解决方法其实很简单，采用一些sparse的op就行。比如用sparse_gather，就能解决网络传输的问题。但这个不是彻底的解决方案，tf在计算的时候又会把sparse的tensor转成dense做。继续看看源码，会发现tf自身实现的embedding_lookup_sparse。换个角度来理解，天然就能支持sparse的wide模型训练。把sparse的wide模型理解成embedding size为1的情况，上层接个pooling做sum，就是我们要的wide的output结果，方案很优雅。

分布式下训练速度不能随着batch size增加变快

这个问题，单纯看性能分析还不好发现。还是去看下TF的代码实现，其实是TF默认有个dimension压缩的优化带来的。TF为了节省存储，会对一个batch内的相同的feature做hash压缩，这里会有个distinct的操作，在batch size大的时候，性能损耗很明显。改下参数，就可以取消该操作，不好的地方是浪费点内存。

还有两个核心问题：TF不支持sparse模型和分布式下work的checkpoint问题，这里不展开了。

线上性能篇：
真实线上场景与batch size的训练的差异

真实排序的时候，一个用户过来，需要精排的候选集可能有几千。而我们在训练的时候，基于batchsize方式组织的predict代码。会将用户侧的feature复制几千次，变成一个矩阵输入到模型中。如果给tf自己做，这里就会有几千次的embedding lookup，非常的耗时。如果我们选择在请求的一开始，就把用户侧的lookup做掉，然后去做点内存复制，就能大大减少rt。

另外一个耗时大头是attention，这个解决方案也很多，比如用查表近似就可以。

还有一些是模型实现的细节不好导致性能很差，比如DCN的cross实现，一个简单的交换律能带来巨大的性能提升，参考：https://zhuanlan.zhihu.com/p/43364598

扯淡开始

上面很多工作，都是算法工程师和工程同学一起深入到代码细节中去扣出来的，特别是算法工程师要给出可能的问题点。做性能profile，工程的同学比我们在行，但是模型中可能的性能问题，我们比他们了解的多。当然也有很多同学diss，上面这些都是工程没有做好啊，工程好了不需要关心。但是，真正的突破必然是打破现有的体系，需要你冲锋陷阵的时候自己不能上，别人凭什么听你的，跟你干。大概率就是在后面维护点边缘业务了。

难道机器学习理论不重要吗

当然不是，这篇已经写得太长了，只讲两个点。

信念的来源：这个其实是很重要的，一个项目，搞个一年半载的，中间没有什么明确的产出，老板要kpi，旁边的同事刷刷的出效果，靠什么支持你去坚持继续填坑，只有对理论认知的信念。
假设总是很美好，现实数据很残酷，左脸打完打右脸，啪啪啪的响。怎么一步步的接近真实，解决问题，靠的还是对理论的理解，特别是结合业务的理论理解。

工程和理论的关系就有点像，理论起到是指路者的作用，而工程是你前进道路上披荆斩棘的利刃。没有理论就没有方向，没有编码能力，就只能当个吃瓜群众，二者缺一不可。

最后，总结下：算法工程师首先是个工程师。

PS：Don’t panic！Make your hands dirty！编码没有那么难。