With the glomex Media Exchange Service, the global marketplace for premium video content, up and running, I visited the 10th ACM RecSys conference in Boston to get a hold of what’s new in the world of recommender systems and to be inspired for what we at glomex can do for our users.
Here are the top industry lessons learned:
1. Explain recommendations to users
There are a lot of different strategies to generate recommendations, be it the most simple one based on overall top lists, or, very sophisticated, based on user-item interaction, user and/or item properties. However, not only will there be a lot of people inside your company to ask for explanations of the recommendations, it should also be made transparent to the end user why they get presented with this recommendation. Otherwise the user might be disappointed at best (e.g. by an unfitting movie recommendation on Netflix) or offended at worst (e.g. by an unfitting job recommendation on LinkedIn) by the recommendation and quit the service as a result.
If you have a look at Netflix or Spotify, they are very careful to present the user’s recommendations in a transparent way:
The explanations let the user clearly know where those recommendations come from and how the user’s actions and interests influence those recommendations.
2. Avoid bias in recommendations
Evan Estola, Lead Machine Learning Engineer at Meetup, gave multiple examples of how recommender systems can go horribly wrong in his presentation: From Orbitz, an online travel agency, who charged Windows and Mac users differently for the same product, to the Microsoft AI which became racist based on Twitter feeds in less than a day.
While you may laugh at those stories, the danger of building a recommender system that suffers similar biases is very real. Let’s just take the example of Meetup’s main use case of recommending a Meetup group to a person. If you base recommendations on the gender of a user and then explain that a female user gets the recommendation of a cooking group because she is female might be horribly offending to the user. However, if you base the recommendation on the user’s interest and then explain that the user gets the cooking group recommendation because she is interested in exotic food preparation, it becomes much less offensive. Thus, when building a recommender system, you should question whether it is ethical to use this or that feature, even if it boosts performance.
3. Understand your recommender system well
If you can’t find out why a recommendation is produced the way it is, you can’t resolve potential biases in the recommender system. Thus, you need to understand how your recommender system ticks, how different input leads to different (or the same) recommendation.
Even if the keynote of Claudia Perlich, Chief Scientist at Dstillery, had a different focus, I’d like to take it as an example for the importance of understanding and debugging recommender systems. According to the use case that Claudia presented, Dstillery predicts that a site visitor clicks/buys a product given that he or she is shown an ad, and thus only shows ads to the users with the highest click/conversion probability. The model that is used for this is simply logistic regression. Why? Because 1) Dstillery trains more than 3000 models per day (“good luck doing that with neural nets”) and 2) logistic regression is well understood and easy to debug. Claudia explained several interesting points (“models tend to go where the signal is”, “accidental clicks are more predictable than intentional”, “[prediction] performance is surprisingly stable even under random noise”) and backed each one of these up with a graph visualizing the distribution of predicted click/conversion probability (see slides) to provide intuition of what the model in each case was doing.
[Visualization of predicted probability that a user is male based on the user’s Facebook likes. At the upper bound of the probability distribution, the accuracy of the prediction is perfect. The uncertain prediction in the middle indicates users without Facebook likes: that is where the model falls back to the prediction based nothing, which leads to a 50⁄50 chance to get the gender right. Slide from Claudia Perlich’s keynote at ACM RecSys 2016 ]
Xavier Amatriain, VP Engineering at Quora, mentioned in his tutorial of “Lessons Learned from Building Real-Life Recommender Systems” (held together with Deepak Agarwal, VP Engineering at LinkedIn), that one should start out with an easy and well-understood recommender system, his favorite starting point being matrix factorization.
4. Don’t build a distributed recommender system
Following the point of being able to understand and debug the recommender system, this is more of a technical piece of advice: Stephanie Kaye Rogers, Software Engineer at Pinterest, advises to not use a distributed architecture for the recommender system since that just adds to the complexity, while most systems that work really well can run on a single machine. You might ask how that is possible with Terabytes of data? Well, we know about of the sparsity of training data: a user only visited a very limited number of websites in the internet that is composed of millions of sites. Using intelligent sampling methods, we can create a good training dataset that is manageable on one machine, and recalibrate the model result using importance weighting – exactly like it’s done by e.g. the team at Youtube (see paper).
5. Feature engineering is still necessary with neural networks
Paul Covington, Senior Software Engineer at Google working on recommendations at Youtube, pointed out that despite the promises of deep learning, you should still invest a rather big amount of effort into feature engineering. According to his experience, the time you don’t spend on feature engineering, you will spend on calibrating the network – and more.
Also it might pay off to engineer features out of business rules, e.g. if we think that the consumption mode would differ for users who have been away from the service for quite some time, we should use a feature of time of inactivity instead of a business rule to filter recommendations.
6. Do offline and online evaluation of the recommender system
This may sound obvious to some, but for me the fact that companies with online testing possibilities are doing offline evaluation was a surprise. I always thought online result trumps offline results at any time, because in the end, it’s the business metrics (video minutes, retention, conversion, etc.) which count and not precision or recall. While that might be true, fact is that online metrics take a lot of time to converge, i.e. to come to a statistically significant result in an online A/B testing. User retention for example would take at least four weeks to produce a valid testing result. Thus, if you would want to calibrate your recommender system on online test results, the turn-around time would be awfully long.
Additionally, think of the thousands of parameters that need tweaking in a model: to deploy a model to production for every one of the thousands of parameter combinations is just not practical.
What is done instead is to tweak models offline with respect to typical offline evaluation metrics (recall, accuracy, diversity, novelty, etc.) and only deploy the best performing model and run online tests on it.
Regarding the online metrics Xavier Amatriain, formerly Engineering Director at Netflix, mentioned that you should be sure what you optimize for and that those metrics are in alignment with the product vision. Netflix uses e.g. member retention as a long-term metric, and streaming time and time passed to find a video as short-term metrics for A/B test evaluation.
Youtube used to optimize for video views (which favors ad revenue), but switched to view time in order to optimize for user engagement in the long run (even if that switch decreased their revenue in the short run).
That’s it for now with my compilation of industry lessons learned about recommender systems, stay tuned for a compilation of research insights presented at RecSys 2016. And if you want to learn lessons about recommender systems in a real-world application yourself, come join us at glomex!
by Cindy Lamm, Data Scientist