We’ve recently completed a series of engineering improvements to the LifeStreet prediction engine, allowing us to more effectively incorporate even more data into our models. This project has led to a 53% decrease in CPIs and a 137% increase in impression-to-install rates, yielding ROAS and scale wins across the board, and we’re excited to share the details with you.
The data processing challenge
Like all machine learning systems, LifeStreet’s predictive models become increasingly accurate with more data. More accurate predictions drive more effective bidding, thereby decreasing CPIs, improving user quality, and ultimately increasing ROAS for our advertisers.
We set out with the goal of significantly improving our ability to process the data we use for model training. The engineering difficulty stems from the large amounts of data involved: billions of rows, with thousands of data points describing each row.
An example of model training data. We process billions of rows, with thousands of data points describing each row.
We typically train dozens of models each day, and each of these models requires tens of thousands of rows of training data per second. At this scale, our data processing tasks have historically consumed hundreds of CPU cores on a daily basis, representing a large fraction of the compute resources we dedicate to model training.
We challenged ourselves to find a full tenfold improvement in our data processing speeds. If we could do this, we’d be able to train models with substantially more data, leading to more accurate predictions and increased performance for our advertisers.
Processing data to encode features
When training a model, it’s necessary to turn each row of input data into a vector of numbers, each of which is referred to as a feature. This process, called feature encoding, turns all input data into a uniform numeric format that models can learn from. A good feature encoding system supports many different ways to encode a given set of inputs, and data scientists spend significant time determining the best feature encodings to use.
Feature encoding is a process that turns all input data into a uniform numeric format that models can learn from.
Like many machine learning projects, our original feature encoding system was implemented in the Python programming language. Python has strong support for machine learning, and when integrated with libraries such as Numpy and Tensorflow, it offers very good performance for machine learning applications. However, because Python is an interpreted language, it processes certain tasks more slowly than other languages can.
For such workloads, performance improvements can be realized using a compiled language instead. One such language, called Go, is a relatively new language from Google which is frequently used in cloud computing applications. To capitalize on the inherent advantage of a compiled language, we elected to entirely rewrite our feature encoding system using Go. Our new system uses a technique called code generation, in which we generate a custom feature encoding program specific to each model. This allows the Go compiler to heavily optimize the generated code, significantly speeding up performance relative to our Python baseline.
Exceeding performance targets
We’re pleased to report that the new feature encoder has exceeded our performance targets. We’ve benchmarked a single instance of this process encoding over 35,000 training rows per second, which is fast enough that we can run feature encoding in real-time as we train our models, reducing CPU time by over 90% as compared to our previous version.
As a result, we have been able to substantially increase the size of the datasets used to train our predictive models. We’ve increased the number of input features in our production install-to-impression models by over 50%, and have been able to increase the number of rows in our training datasets by over 400%, enabling modeling techniques which were previously time-prohibitive.
By increasing the amount of data available to our models, we have substantially improved the accuracy of our impression-to-install predictions, leading to up to a 53% decrease in CPIs and a 137% increase in our impression-to-install rates, driving ROAS and scale wins as a result.
Continued rollout and increased testing
We’ve already switched our impression-to-install models to the new system, and are currently rolling this out to our product-specific user quality models. Early results here are also encouraging, leading to increased payer rates and greater downstream ROAS.
This project has also improved our model operations workflow. By significantly decreasing the time it takes to train models, we’ve been able to substantially increase our model testing cadence, allowing us to prepare and evaluate an average of seven additional model tests per day since rollout. This allows us to iterate faster to find model testing wins, and we look forward to sharing the results with you in the coming weeks and months ahead.