During campus recruiting, we often interviewed candidates for algorithm roles by starting with one or two simple coding problems before moving into machine-learning questions. Some candidates complained that we were not really asking about machine learning. In many cases, the reason was simple: the code was too weak for the interviewer to keep going.
Machine-learning roles are easy to romanticize. People imagine deriving formulas and tuning models all day. In reality, our search and recommendation work required a very different kind of ability: understanding systems, data pipelines, serving constraints, performance profiling, and the ugly details that determine whether a model can actually create value online.
The first lesson: stop chasing fancy architectures too early
In one deep-learning ranking project, we spent months stepping on pits. We were overly attracted to fancy paper ideas and hoped that changing model structures would produce gains. Most of that failed. The meaningful gains came from a more boring place: using more training data, improving sample construction and cleaning, choosing classic model structures carefully, and respecting the optimizer.
- Deep learning needs much more data than traditional models; increasing sample size can visibly improve results.
- At the beginning, forget many of the clever tricks in papers. A workable solution is usually a question of “how much better”, not “whether it works at all”.
- Tuning speed matters. If model iteration is alchemy, the faster alchemist has an advantage.
- Embeddings are extremely powerful; spend serious effort on how the model represents IDs.
- Care about computation. A ranking model eventually has to serve online traffic.
Training is an engineering problem
At first, we fed the existing baseline features into deep models: DNN, DFM, LSTM, and so on. The results were worse than logistic regression. Part of the reason was embarrassing but real: to move quickly, we loaded everything into memory, limited the data scale, and handled part of the preprocessing in Python. GPU utilization was unstable because computation kept bouncing between CPU and GPU.
After profiling, we moved the sample construction to Spark and generated TFRecord data. The whole construction pipeline became nearly ten times faster than the old Hive SQL plus HDFS-to-local process, and we could use much more data. The model improved. This was not a modeling miracle; it was engineering work.
Industrial embeddings are not NLP toy sizes
In NLP papers, hundreds of thousands of words can be called large-scale. In industrial recommender systems, user IDs and item IDs easily reach millions or tens of millions. Embedding lookup can run out of memory quickly. Hashing, SimHash, metadata-based recoding, and selective treatment of sparse IDs all become practical design choices.
Wide & Deep also brings sparse-model training problems. Many implementations use dense tensors. Once feature scale reaches hundreds of millions, parameter-server communication becomes a disaster. Reading the TensorFlow source and using sparse ops can make a large difference. Sometimes the elegant solution is simply understanding the system deeply enough.
Online ranking is not batch prediction
In training, prediction code is often organized by batch size. Online ranking is different. When a user arrives, the system may need to rerank thousands of candidates. If user-side features are copied thousands of times and TensorFlow performs thousands of embedding lookups, latency will be terrible. Doing user-side lookups once at the beginning and then copying memory can dramatically reduce response time.
Other costs hide in attention modules, cross-network implementations, and small algebraic choices. A simple use of commutativity in DCN’s cross layer can bring a large performance improvement. These are not separate from algorithm work; they are part of making the algorithm real.
Theory still matters
None of this means machine-learning theory is unimportant. Theory gives you belief and direction when a project takes a year, when there is no clear intermediate output, and when the business keeps asking for KPIs. But beautiful assumptions meet cruel real-world data very quickly. Moving toward truth requires both theoretical understanding and the ability to dirty your hands in the system.
The relationship is simple: theory points the way; engineering is the blade that cuts through the road. Without theory, you have no direction. Without coding ability, you can only watch from the side. Algorithm engineers are engineers first.