Recommender Systems: Content-Based Recommendations & Collaborative Filtering [FULL]

At Kyoko
10 min readJul 11, 2021

--

Whilst browsing around the Internet, you may often find that a lot of what we see is unique to our own preferences, personalized in such a way that we’re able to receive information that’s relevant to our needs.

Today, we’ll discuss just how this works.

Meet Recommender Systems.

Admire the complexity.

So far in this blog, we have covered two types of ML: Supervised learning and Unsupervised learning. The system we’re discussing today is a combination of both types. We’ll cover:

  1. The Idea [Introduction]
  2. Problem Formulation
  3. Content-Based Recommendations
  4. Collaborative Filtering
  5. Low-Rank Matrix Factorization
  6. Mean Normalization

Let’s get started.

1. The Idea: an Introduction

Just as the name implies, recommender systems seek to look for certain patterns used to then filter information so as to help make useful recommendations for us.

In other words, these systems aim to predict a user’s likes and dislikes. This information will then be used to filter out and recommend items relevant to that specific user.

There are two ways to achieve this: using content-based recommendations and collaborative filtering.

Before we get into that, let’s begin with some basic notation.

2. Problem Formulation

In this particular section, we will be formulating some values to better simplify the discussed concepts. Refer back to this for notation.

Say we’d like to create a recommender system that predicts movie ratings from users by looking at other movies they’ve rated in the past and understand other users’ behaviors similar to them.

Notation:

  • n_u = number of users (4)
  • n_m = number of movies (5)
  • r(i, j) = 1 if movie i has already been rated by user j
  • y(i, j) = the rating of movie i given by user j
    — Note: this is only possible when r(i, j)=1

For example, y(4,3) = 5 since Carol rated the movie Sun Tzu, Art of War with 5 stars, r(3,1) = 0 since Alice has not rated the movie Cute Kitties but has rated other romance-related films like Love at the Beach with 5 stars. Knowing this, the recommender algorithm understands that Carol prefers action movies as opposed to Alice who’d favor the romance ones.

In short, the goal is to have our algorithm predict the rating users would’ve given for those unwatched movies (?).

3. Content-Based Recommendations

One way to do this is by using an approach called content-based recommendations. The idea is to do the recommendation based on the contents of the items (movies) in a set of features.

Say we create a new feature n that measures how much romance or action a movie is. See the following:

For each user j, we want the algorithm to predict the rating of movie i using the function (θ(j))(x(i)). Where:

  • θ(j) is the ℝ⁺¹ parameter vector for user j.
  • x(i) is the ℝ⁺¹ feature vector for movie i.
    — Recall that the bias term x0 is always 1.

— Example: given the parameter vector for Alice θ(1) as [0; 5; 0] and the feature vector for the unrated movie x(3) as [1; 0.99; 0], the rating calculated with (θ(1)ᵀ)(x(3)) is then 4.95 (5 x 0.99).

This then can be applied to a separate linear regression problem, used for learning θ(j) after having each user j rated an m(j) number of movies:

Copy-paste: minθ(j) 1/2 ∑i:r(i,j)=1 ((θ(j))ᵀ(x(i)) − y(i,j))² + λ/2 n∑k=1 (θ(j)k)²

— Note: the ‘m(j)’ has been ommitted for simplification.

To learn ALL parameter vectors θ(1), θ(2), …, θ(n_u):

Copy-paste: minθ(1),…,θ(n_u) 1/2 n_u∑j=1 ∑i:r(i,j)=1 ((θ(j))ᵀ(x(i)) − y(i,j))² + λ/2 n_u∑j=1 n∑k=1 (θ(j)k)²

— The 2 summation symbols n_u∑j=1 were added.

Applying this to the Cost Function,

Copy-paste: J(θ(1),…,θ(nu)) = 1/2 n_u∑j=1 ∑i:r(i,j)=1 ((θ(j))ᵀ(x(i)) − y(i,j))² + λ/2 n_u∑j=1 n∑k=1 (θ(j)k)²

And to Gradient Descent,

Copy-paste: θ(j)k = θ(j)k − α (∑i:r(i,j)=1 ((θ(j))ᵀx(i) − y(i,j)) x(i)k), for k=0 — separate this for the bias term x0 θ(j)k = θ(j)k − α (∑i:r(i,j)=1 ((θ(j))ᵀx(i) − y(i,j)) x(i)k + λθ(j)k), otherwise

Note: you don’t need to regularize the bias term, hence the separation.

Using these, recommender systems can rely on features that take in the content of movies to help them predict someone’s liking towards one.

However, the process of coming up with such reliable features isn’t easy. Thus, we have another alternative for recommender systems:

4. Collaborative Filtering

In contrast to finding the values of θ(j), what if we try to learn the feature values of x(i) instead? Previously, these values were already provided. But again, having someone watch and rate how romantic or action-packed is each movie just is not easy: it’s time-consuming and expensive.

We can initially make a guess for the values of θ(j), and then use it to learn the values of x(i). Using these learned values, we can learn again the values of θ(j) in such a way as to better improve the originally estimated θ(j) values used for learning x(i). These values are then used to learn again better values of x(i), and the cycle repeats.

Guess θ(j) → learn x(i) → learn θ(j) → learn x(i) → learn θ(j) → …

This way, we’re able to get better movie ratings and better recommendations for users. What’s good about this is that every user is collaborating in a way that helps the algorithm better filter movie recommendations.

Hence, we’re relying more on user collaboration than content.

4.1 The Idea

*Note: it’s supposed to use Θ, not θ.

By not knowing anything about a movie, we can infer that the first movie x(i) is more of a romantic movie just by looking at the users’ ratings.

Alice and Bob have similar likings, rating 5 stars for x1⁽¹⁾. On the other hand, Carol and Dave rated 0 stars for that movie, indicating that they both prefer another type of movie, seen from their ratings on x2⁽¹⁾.

The goal of a Collaborative Filtering recommender system is to find a value of x(i) so that the function (θ(j))(x(i)) may result in the value of the actual rating θ(j) given by the user.

For example, what feature vector x(1) value would give (θ(j))(x(i)) ≈ 5? Knowing the values of θ(1) = 5, we can sort of estimate that a reasonable value for x(1) would be 1.0. Meaning, the feature vector x(1) is 100% (1.0) a romance movie x1(1) and 0% (0.0) a action one x2(1).

All of this being the result of user collaboration.

4.2 Formulation

Given the [θ(1), …, θ(n_u)] values, the algorithm learns x(i) by minimizing the squared difference between the predicted ratings (θ(j))(x(i)) and the actual given rating y(i, j) using the following regression,

Copy-paste: minx(i) 1/2 ∑j:r(i,j)=1 ((θ(j))ᵀ(x(i)) − y(i,j))² + λ/2 n∑k=1 (x(j)k)²

To learn all the feature vectors [x(1), …, x(n_m)] for all movies m, given the ratings by a large number of users [θ(1), …, θ(n_u)],

Copy-paste: minx(1),…,x(n_u) 1/2 n_u∑i=1 ∑j:r(i,j)=1 ((θ(j))ᵀ(x(i)) − y(i,j))² + λ/2 n_u∑i=1 n∑k=1 (x(j)k)²

Hence the following gradient descent update:

Copy-paste: xk(i) ​:= xk(i) ​− α (∑j:r(i,j)=1​ [(θ(j))ᵀ(x(i)) − y(i,j)] θk(j) ​+ λxk(i)​) — if i ≠ 0.

If you notice, these formulae look very similar to that of the equations used for content-based recommendations which try to find the user parameter vector values of θ(j). What we did, in this case, is simply changing those θ(j) variables into x(i) which is what it’s trying to learn, as well as change the function of the summation symbols ∑.

4.3 The Algorithm

Collaborative filtering works by learning the values of either θ(j) and x(i) using the given features or user ratings, respectively, and so on — refer to the cycle discussed at the beginning of this section.

To summarize, the current optimization objective we have is as follows.

1/2. Given x(1), x(2), …, x(n_u), learn θ(1), θ(2), …, θ(n_u):

2/1. Given θ(1), θ(2), …, θ(n_u), learn x(1), x(2), …, x(n_u):

3. Repeat.

Instead of having to go back and forth finding θ and x values, there’s a way we can do solve for both values simultaneously. Combining the two formulae from above into one minimization objective, we may have the following Cost Function that takes in both parameters:

Copy-paste: J(x(1),…,x(nm),θ(1),…,θ(nu)) = 1/2 ∑(i,j):r(i,j)=1 ((θ(j))ᵀx(i)−y(i,j))² + λ/2 nm∑i=1 n∑k=1 (x(i)k)² + λ/2 nu∑j=1 n∑k=1 (θ(j)k)²

Note: there isn’t x0 = 1 or θ0, hence x ∈ ℝⁿ and θ ∈ ℝⁿ.

Hence, the optimization objective:

Copy-paste: min(x(1),…,x(nm),θ(1),…,θ(nu)) J(x(1),…,x(nm),θ(1),…,θ(nu))

This way the algorithm is able to minimize both parameters θ and x simultaneously, without having to continuously go between those two. This optimization objective is actually just the same as both the Gradient Descent Updates (2) from earlier if either x or θ is kept constant, respectively.

4.4 Conclusion

The following is, thus, our collaborative filtering algorithm:

  1. Initialize small random values to both the parameters x(2), …, x(n_u) and θ(1), θ(2), …, θ(n_u).
  2. Minimize J(x(1), …, x(n_m), θ(1), …, θ(n_u)) using either gradient descent or any other advanced optimization algorithm.
  3. Given the user ratings with parameter θ and a movie with learned features x, predict the star ratings of the unrated movies using θᵀx.

Using this, the algorithm is able to simultaneously learn good features for movies and make good predictions on user ratings, thus filtering movies based on the collaborative data collected on users’ movie preferences.

5. Low-Rank Matrix Factorization

5.1 Definition

Going back to our original data, we can group all the user ratings into a matrix Y based on the y(i,j) ratings of movie i made by user j,

A vectorized implementation of this would be:

Copy-paste: Y = XΘᵀ

Where,

This algorithm is the Low-Rank Matrix Factorization, known for its low-rank matrix attribute from linear algebra.

5.2 Application

This can be very helpful in finding other movies related to what a specific user has watched, to then be recommended designated to that user.

Technically, for each movie i, we’d like to learn a feature vector x(i) ∈ ℝthat sort of grabs the essence of a movie’s contents — which part of it makes the users like or dislike the movie?

Once learned, we then need to find other unwatched movies j to recommend to a specific user, based on how similar it is to the movies i that user has watched. We can do this by finding the euclidean ‘distance’ between the feature values of movies i and j, and say they’re similar the smaller their differences:

Small ∥ x(i) − x(j)∥ → Movies are similar.

Hence, recommend movies with the smallest distance.

6. Mean Normalization

6.1 The Idea

Say we have a new user who has not rated any movie,

If we were to apply the current algorithm we have, using the Y matrix,

And the given cost function,

Eve has not rated any movies, so there are no cases where r(i,j)=1. Hence, it’s only trying to find values of Θ(5) that minimizes that last regularization term, which will result in just a vector of zeros (for Θ1(5) & Θ2(5)).

Then all the predicted ratings for Θ(5) are zero (0).

This is bad.

Which movie are we going to recommend to Eve if all ratings are the same, none of which has a higher rating than the other? We can’t also assume that Eve will rate those movies as 0 once she has watched them.

This is where Mean Normalization comes in.

6.2 Mean Normalization

The idea here is to find some other rating that’s at least ideally safe, in a way. Safe enough for the algorithm to make decent predictions. Such as taking the average ratings made by other users.

To do this, we’ll normalize each row of the rating Y matrix with their mean and then subtract the mean off the ratings:

Treating the updated Y matrix as the original Y matrix used to learn Θ(j) and X(i), we’ll then denormalize them after the predictions were made. We do this by adding the averages back to the predicted ratings,

Copy-paste: (Θ(j))ᵀ(X(i)) + μ_i

This means that the ratings predicted for Eve is simply just the average ratings of all other users’ ratings,

Hence, the predicted ratings:

This means that we’re just going to recommend Eve movies based on the average star ratings of movies given by other users, without knowing her movie preferences in the first place.

Additionally, this also works for instead movies with no ratings. In this case, you’d want to normalize the columns of the rating Y matrix as opposed to the rows which is what we did do for users who haven’t rated any movies — this isn’t that important, though, as movies without ratings shouldn’t really be recommended to anyone anyway.

This is all about Recommender Systems as explained so far in the course. Next time, we’ll discuss other ML-related topics.

See you then.

— This article was originally posted on https://azumiekyu.blogspot.com/
Visit the site for better formatting. Thanks for reading!

--

--

At Kyoko
At Kyoko

Written by At Kyoko

Researcher on Artificial Intelligence (Machine Learning), Web & App Development, Human Psychology, and Sociology. I love learning new things about the world.

Responses (1)