Building a Simple Recommendation System in Python Using Cosine Similarity
So, what is a recommendation system and why might an application need one? If you've got any type of application that uses a content management system, you may have noticed that it could be beneficial to have a way of recommending certain content to certain users on it. This is bascally what most people refer to as "the algorithm". But what exactly is this algorithm and how could you implement it on your platform?
The truth is, there isn't one single algorithm that's used to recommend content to users. There are a lot of different methods that can be chosen based on what the exact goals are. For example, you could arrange users into different clusters that are shown certain types of content. But I'll explain what a few of the ones most used for recommending content are and how you could implement them into your platform.
Popularity Based Recommendations
This is the simplest type. All you basically do is implement some type of popularity metrics like which items have the most clicks, views, or likes and recommend those to users. This method isn't particulary accurate because items with fewer interactions might be missed by users in favor of ones that already have a fair amount of popularity, but it can be useful when you don't have enough other data points to rely on. For example, you won't know anything about new users to your platform so you can't give them any recommendations based on their past interactions. This scenario is called the 'cold start' problem. Some of the next few methods I'll demonstrate have this problem because they only work well when you have an adequate amount of past data to use for making future predictions. To address cold starts, popularity based reccommendations are pretty useful.
Content Based Filtering
In this type, the system recommends items to a user based off other items that have similar features to the ones they've already shown positive interaction with in the past. Another way of thinking about features is in the context of categories that content can belong to. So for example, when recommending movies, you can say that a movie has a certain amount of comedy, drama, action, sci-fi, etc. associated with it. But let's look at how to implement it!
Let's break down what our approach is before we start writing any code.
- 1. Import Numpy cause we're gonna have to do work with arrays and do a little maths
- 2. Have each movie represented as a vector where each number in the vector is a float from 0 to 1 representing 'how much' of a specific feature it exhibits. This would have to be determnined beforehand, so you'd need to compute it from metadata, manually enter it, or use an ML model that can create embeddings in some other way
- 3. Have ratings for some of those movies already. These should also be floats that range from 0 to 1
- 4. Build a user profile based on what they've already shown interest in
- 5. Use Cosine Similarity to find movies similar to those
To recap on what Cosine Similarity is, it's a way to find vectors with the smallest difference in angle between them. If they're pointing in generally the same direction, we can say that they're similiar. The smaller the cosine of the angle between them, the more similar they are.
import numpy as np
# Each movie is represented as a feature vector
# Features: [action, comedy, romance, sci-fi, thriller]
movies = {
"The Matrix": np.array([0.8, 0.1, 0.1, 0.9, 0.6]),
"Inception": np.array([0.7, 0.0, 0.2, 0.8, 0.8]),
"The Notebook": np.array([0.0, 0.2, 0.9, 0.0, 0.1]),
"Superbad": np.array([0.1, 0.9, 0.3, 0.0, 0.0]),
"Alien": np.array([0.6, 0.0, 0.0, 0.9, 0.8]),
"Interstellar": np.array([0.4, 0.0, 0.3, 1.0, 0.5]),
"The Hangover": np.array([0.2, 1.0, 0.2, 0.0, 0.1]),
"Blade Runner 2049": np.array([0.5, 0.0, 0.2, 1.0, 0.6]),
}
# User has rated these movies (0.0 - 1.0)
user_ratings = {
"The Matrix": 1.0,
"Inception": 0.9,
"Alien": 0.7,
}
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
# first, we need to normalize the vectors. (yes, think of them as vectors now, not arrays)
# np.linalg.norm() calculates the Euclidean length of the vectors (basically pythagorean theorem)
# the reason we need this is because we don't want the magnitude of the vectors...
# In the next step we'll use the dot product to calculate the directions BUT!
# The dot product gives us a magnitude AND direction. We only want direction because
# movies with higher scores (magnitude) would dominate the influence and we don't want that.
# So we'll divide the dot product by the magnitude to cancel it out.
magnitude = np.linalg.norm(a) * np.linalg.norm(b)
print("magnitude: ")
print(magnitude)
return float(np.dot(a, b) / magnitude) if magnitude > 0 else 0.0
def build_user_profile(ratings: dict, catalog: dict) -> np.ndarray:
"""
Weighted average of feature vectors for rated items.
Higher-rated movies contribute more to the profile.
"""
total_weight = sum(ratings.values()) # get sum of all movies this user rated
print("total_weight: ")
print(total_weight)
# for each movie the user has rated,
# multiply their rating by the feature vector of that movie in the catalog
# and sum all those up
profile = sum(
catalog[title] * rating
for title, rating in ratings.items()
)
print("profile: ")
print(profile)
return profile / total_weight # then divide by the sum of their ratings to get their weighted averages
def recommend(
user_profile: np.ndarray,
catalog: dict,
seen: set,
top_n: int = 3
) -> list[tuple[str, float]]:
scores = {
title: cosine_similarity(user_profile, vec)
for title, vec in catalog.items()
if title not in seen
}
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
user_profile = build_user_profile(user_ratings, movies)
recommendations = recommend(user_profile, movies, seen=set(user_ratings))
print("User profile vector:", np.round(user_profile, 2))
print("\nRecommendations:")
for title, score in recommendations:
print(" %s: similarity: %s" % (title, format(score, ".4f")))
Then running that, we get:
User profile vector: [0.71 0.04 0.11 0.87 0.72]
Recommendations:
Blade Runner 2049 similarity: 0.9752
Interstellar similarity: 0.9419
The Hangover similarity: 0.1965
This is a pretty powerful way to find items similar to ones the user has shown interest in before. But one of the drawbacks is that it can be a bit limiting because users can kind of get 'trapped' inside a bubble of only being recommended content that is really similar. Also, like I mentioned above, this method suffers from the cold start problem because you need a fair amount of data on which movies the user has already interacted with before.
Collaborative Filtering
In this type, the system recommends items to a user based off how similar that user's preferences are to other users. An example of collaborative filtering is when you're on an ecommerce site and they have a section that says something like "Users who bought this also bought these...". This method generally gives strong recommendations for an individual user but depends on them having demonstrated a significant amount of their preferences on other items, so it also has the cold start problem. Let's look at how to do it.
Implementing Collaborative Filtering
So, our goal with collaborative filtering is to find users with similar preferences so we can predict which items unseen to them might be appealing. To do this, we can again use Cosine Similarity to find similar users, based on past interactions. So before we can use cosine similairty, we need to create a vector of each one of our user records where each number in the vector is the user's 'score' of a particular item. To come up with these scores, you'd have to implement some type of rating system, like a way for the user to give 'stars' to an item or something, but you can honestly use anything, for example, the amount of time the user paused scrolling or swiping with a certain item in view. These types of metrics can give you a better idea since users don't always give items ratings even if they technically 'liked' the item.
Let's skip the actual rating metric implementation and assume we already have that info. So for example, if we had 5 users on our platform, and 4 items in the database, and each user rated some of the items on a scale of 1-5, we could create a vector for each user's ratings, and combine them into a 5 x 4 matrix like this like this:
import numpy as np
# User-item matrix: rows = users, cols = items, values = ratings (0 = not rated)
ratings = np.array([
[5, 3, 0, 1],
[4, 0, 4, 1],
[1, 1, 0, 5],
[0, 0, 5, 4],
[2, 0, 3, 0],
])
Now that we have each user's ratings listed out as a vector, we can use cosine similairty to see which of these vectors point in the most similar directions. Let's add a simple function that will use the cosine_similarity package from sklearn.metrics.pairwise to find which vectors have the smallest difference in cosine value. In the Content Based Filtering example above, we wrote our own using numpy, but in this example we'll just use sklearn's method. We'll pass in the ratings matrix to the cosine_similarity function and it will return a matrix showing the similairty of each user.
...
def recommend(user_idx: int, ratings: np.ndarray, top_n: int = 2) -> list[int]:
sim = cosine_similarity(ratings) # (n_users, n_users)
user_sim = sim[user_idx] # similarity to all other users
# Weighted sum of other users' ratings
weighted = user_sim @ ratings # dot product across users
# Zero out items the user already rated
already_rated = ratings[user_idx] > 0
weighted[already_rated] = 0
return np.argsort(weighted)[::-1][:top_n].tolist()
print(recommend(0, ratings)) # β items user 0 should like
If we print sim, we'll see this. It's a matrix of users x users that shows the score of similairty of one user to another. For example, user 0 and user 1 have a score of 0.62. The diagonal will always be 1s because if you compare a user to themselves, obviously they're perfectly correlated.
[[1. 0.61791438 0.42289003 0.10559274 0.46880723]
[0.61791438 1. 0.30151134 0.6524727 0.9656091 ]
[0.42289003 0.30151134 1. 0.60111309 0.1067521 ]
[0.10559274 0.6524727 0.60111309 1. 0.64972212]
[0.46880723 0.9656091 0.1067521 0.64972212 1. ]]
On the next line, all we need to do is pass in the index of a particular user and we'll get the row of all users and how similar that one is to each of them, like this:
# for user 0
[1. 0.61791438 0.42289003 0.10559274 0.46880723]
Now that we have a vector explaining the similairty of each user, we can think of each number in that vector as a 'weight' for how significant that user's influence would be. So, this means we can calculate a rating for each item a user hasn't rated yet by multiplying these weights by the average value of the ratings that we do have. To do this we can just take the dot product of the user similairty vector and the ratings matrix. Why are we using the dot product for this? Well, remember that the dot product is taken by multiplying each number of a vector by each corresponding column (since we have a row vector) and summing up each product, to get a new vector with an element for each one in our starting vector. Below is an example of the math behind taking the dot product. This is called a weighted sum. We can see that, users 1 and 4 are most similar so their elements in the vector carry the most 'weight' in determining how closely user 0 would rate each item to the sum of the ratings.
user_sim (User0's row from similarity matrix):
[1.00, 0.73, 0.31, 0.08, 0.55]
[1.00, 0.61791438, 0.42289003, 0.10559274, 0.46880723]
ratings matrix:
Item0 Item1 Item2 Item3
User0 [ 5, 3, 0, 1 ]
User1 [ 4, 0, 4, 1 ]
User2 [ 1, 1, 0, 5 ]
User3 [ 0, 0, 5, 4 ]
User4 [ 2, 0, 3, 0 ]
Item0 = (1.00 * 5) + (0.61791438 * 4) + (0.42289003 * 1) + (0.10559274 * 0) + (0.46880723 * 2)
Item1 = (1.00 * 3) + (0.61791438 * 0) + (0.42289003 * 1) + (0.10559274 * 0) + (0.46880723 * 0)
Item2 = (1.00 * 0) + (0.61791438 * 4) + (0.42289003 * 0) + (0.10559274 * 5) + (0.46880723 * 3)
Item3 = (1.00 * 1) + (0.61791438 * 1) + (0.42289003 * 5) + (0.10559274 * 4) + (0.46880723 * 0)
That gives us this:
# weighted
[8.83216202 3.42289003 4.40604289 4.15473548]
Now, you may be asking yourself, "What do we do with the weighted sums for the items that the user already has scores for?" Well... we just throw them out. Since those items have already been rated by the user in question, we can just set them to zero, like this:
already_rated = ratings[user_idx] > 0 # get all the indicies of all the already rated items
weighted[already_rated] = 0 # set them to zero
so the final list of predicted ratings looks like this:
[0, 0, 4.40604289, 0]
So it's actually kind of intuitive that Item2 is the recommendation because User1 (sim=~0.62) and User4 (sim=~0.47) both rated it positively, and they are most similar to User0. User3 also rated it 5 but their similarity is only 0.08, so their influence doesn't carry much weight. So putting all the code together, we get this:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# User-item matrix: rows = users, cols = items, values = ratings (0 = not rated)
ratings = np.array([
[5, 3, 0, 1],
[4, 0, 4, 1],
[1, 1, 0, 5],
[0, 0, 5, 4],
[2, 0, 3, 0],
])
def recommend(user_idx: int, ratings: np.ndarray, top_n: int = 2) -> list[int]:
sim = cosine_similarity(ratings) # (n_users, n_users)
user_sim = sim[user_idx] # similarity to all other users
# Weighted sum of other users' ratings
weighted = user_sim @ ratings # dot product across users
# Zero out items the user already rated
already_rated = ratings[user_idx] > 0
weighted[already_rated] = 0
return np.argsort(weighted)[::-1][:top_n].tolist()
print(recommend(0, ratings)) # β items user 0 should like
So that's how you can find content to recommend to your users. A couple of other things to mention though. Calculating these similarities every time you want to show them to a user is kind of a waste of compute resources, so you might want to just do it every now and then (like on a cron job or something) and have the script update fields in your database, so you can easily query similar items or users. Also, personally, I still think there's merit to letting natural discovery happen. So I think you should still give the user the option to view items in chronological order, by what they chose to follow, searching items by keyword, etc.