This post was written with contributions from Mike Tamir, Chief Scientist and Learning Officer at Galvanize.
Our lives have become inundated by recommendation engines. On a daily basis, we receive recommendations on things to do, purchase, and watch from the likes of Amazon, Netflix, and Yelp. At face value, we take recommendations for granted—most recommendations simply make sense: people who like this one action movie will probably like this other action movie too. But there’s a lot going on under the hood that makes them work, sometimes with remarkable insight.
To speak very broadly, recommendation engines work when you have an inventory of items—this could be movies, music, things you want to sell, or some combination thereof—and a list of users. The goal then is to match users with items they might most likely be interested in purchasing, doing, watching, etc.
When you think about this like a data scientist, you have your user, and then you have your items, and each of those has a list of associated properties (features). One thing you can do is look at the properties the user likes—based on items they have purchased in the past or items they have reviewed or rated—as well as the properties of each item, and then start matching them that way. This is what’s referred to as content-based filtering.
Say we’re a bookseller. Books have all sorts of properties associated with them: the color of the book cover, the length, the author, the genre, and so on. You can then start building a classifier based on who each user is—what kind of books that user has found in the past—and the properties of each individual book.
However, this becomes complicated when you’re an online retailer such as, say, Amazon, and you go from being just a book seller to being an everything seller. Suddenly all these book properties for each user are no longer useful when that user wants to purchase a spatula, or a pair of headphones. So now you might know something about a user’s toy preferences, but you don’t know anything about their movie preferences. The goal now is to find information about the user across the entire inventory.
This is where collaborative filtering comes into play. Collaborative filtering means that you’re using other users—the information you have about your entire host of users—to give you answers about a particular user for whom you want to make recommendations.
The major insight to collaborative filtering is that the properties of the user aren’t the things that he or she likes—the raw properties of the user (the features of the user) are that user’s purchasing patterns themselves: what that user has bought, ranked, or clicked on. Similarly, the features of an item are not things like length, genre, or author (in the case of a book)—the features are simply the list of users that have bought or selected or highly ranked that particular item.
Now your engine is collaborative, in the sense that you’re letting all of your users do the work for you about what’s important about each item. They’re doing that work very specifically by watching the movie, purchasing items, reading stories that all have something in common: the user likes them. If a user chooses to watch a movie, then that’s a vote of confidence. And if they choose to rank it highly, then that’s a signal that this is the sort of movie they like, or this is the sort of item that they want to purchase.
When you get enough of these users all together, then there is an opportunity to start learning about how someone’s watching patterns might help with making recommendations on their purchase patterns of one type of product even if they have not purchased products of that type before. This way, you as the data scientist don’t have to worry about coming up with the important specific raw features of your entire product catalog. This might not seem like much for items with simple features, such as books. But think about electronics, or one of any number of complex products you might be selling. All you have to worry about is the bulk collaborative work, which the wisdom of the crowd does for you.
Let’s walk through a potential user. Say we’re Amazon, and we want to recommend something to a user that we know absolutely nothing about. All we know is that the user is using a computer we’ve never seen before from an IP address we’ve never seen on a browser that’s using private browsing. So we know nothing about this user, and then the user goes to Amazon and searches for Star Wars Legos.
Now we have a single data point, and with that data point we can run a test. With the single data point of “search for Star Wars Legos,” what other things might we want to recommend to you? We could recommend the most popular item we sell, bananas, or we could just recommend arbitrary things of similar SKU. We could recommend other Star Wars items or other Lego items. Or, we could use all the data that we already have: all the collaborative information from my user base about Star Wars Legos.
In that case, what we do is build something called an item-to-item covariance matrix, which essentially means that we look at all item purchases as well as what other items a user tends to purchase with those items, then come up with a measure of that called covariance. On Amazon, you’ll recognize this as the “Customers Who Bought This Item Also Bought” section.
When we don’t have much information about the user, like this case here, you probably want to stop there. But what if we have much more information? In our next post, we’ll talk about how recommendation engines like Netflix use latent features and matrix factorization to produce personalized recommendations.
Want more data science tutorials and content? Subscribe to our data science newsletter.