Users like options, but too many options can lead to analysis paralysis.
That’s where recommender systems come in. These tools have come a long way, making it easier for businesses to offer plenty of options without overwhelming their users. It’s the best of both worlds: variety without the decision fatigue.
Let’s unpack content-based filtering (a core part of many types of recommender systems), explain the data science techniques that make it work, dig into its advantages and disadvantages, and walk through a tutorial that will show you how to build personalized recommendations with Redis. You can clone the repo here.
What is content-based filtering?
Content-based filtering is a recommendation technique that uses machine learning to suggest items to users based on the features (i.e., the content) of those items. A recommender system using content-based filtering analyzes item features and user preferences to build a user profile that the system can match to new items that suit the user profile.
Content-based filtering methods break users and items down into metadata. IMDB’s recommender system might, for example, break out movies by genre tags, such as comedy, horror, or romance. It also captures information on user behavior, like the movies you click on or the search terms you’re using right now, building a user profile to keep those recommendations relevant and to support ongoing recommendations.
Metadata is the foundation of content-based filtering, but recommender algorithms are where the magic happens.
Many recommender systems rely on a k-Nearest Neighbors (k-NN) model. This machine learning model finds the nearest data points (i.e., neighbors) to a given input and makes predictions based on the properties of those neighbors. In the IMDB example, a k-NN model would know that a given user clicked on a movie listing with “fast-paced,” “ensemble cast,” and “PG rating” and then recommend a new movie listing with similar attributes.
Of course, this basic data science approach isn’t just for movies:
- An ecommerce clothing retailer might recommend joggers because a user previously bought sweatpants.
- A social media network might recommend posts on a given topic because a user clicked through and engaged with similar posts.
- A data analytics dashboard might recommend a new template because a business user has downloaded similar ones in the past.
Content-based filtering isn’t without its limitations, so it’s often combined with collaborative filtering.
Content-based filtering vs. collaborative filtering
Recommender systems that exclusively use content-based filtering techniques tend to be limited by the scope and quality of available metadata.
When the metadata is limited, even the best algorithm can fall flat. Recommendations might miss the mark, feeling unconvincing or irrelevant. And if the metadata lacks depth, the system will likely serve up overly similar suggestions that leave users bored and uninspired.
For these reasons (among others, which we’ll get into in the next section), companies turn to collaborative filtering — either to replace content-based filtering or to complement it.
Collaborative filtering methods rely on user interactions, such as user ratings, user likes, or purchases, to make recommendations. The recommender system allows users to “collaborate” with each other via implicit feedback. It then leverages feedback from other users to make informed recommendations.
For example, if someone rates a movie highly or consistently buys and reviews certain clothing brands, a collaborative filtering system will recommend those movies and brands to similar users.
Why content-based filtering works—and where it falls short
Content-based filtering comes with its own set of tradeoffs, but many of these can be balanced out by combining it with a hybrid recommender system that adds collaborative filtering. Either way, if you’re building a recommender system, you need to know how to leverage the advantages and get ahead of the disadvantages.
Advantages
Content-based filtering has numerous advantages that make it a core part of many recommender systems:
- Personalized recommendations tailored to individual user preferences–including likes and dislikes–can help companies create a better user experience that leads to higher conversion rates.
- Recommending items based on the content of the items addresses what’s known as the cold-start problem, allowing even new users to receive personalized recommendations.
- These recommendations can still happen independent of any particular user data or user behaviors.
Content-based filtering often succeeds or struggles depending on the quality and breadth of your metadata. The richer the metadata, the easier it is to implement content-based filtering and the better the results.
Disadvantages
Content-based filtering has a few limitations, some of which are harder to get around than others.
- Content-based filtering can sometimes create “filter bubbles,” (as they’re called on social media), where recommendations feel repetitive or overly narrow because the system focuses too much on the small set of items a user has interacted with.
- Sparse or poor-quality metadata can limit the effectiveness of content-based filtering because the recommendations can only be as good or as diverse as the metadata.
- Unlike collaborative filtering, content-based filtering doesn’t benefit from network effects or social proof.
Content-based filtering can also struggle in domains with complex or unstructured content, such as images, music, or videos. But if you have feature extraction methods, such as computer vision or vision-based LLM models, you can turn this disadvantage into an advantage using tools such as RedisVL.
Use cases for content-based filtering
Content-based filtering is widely used across industries and use cases because the benefit of near-infinite options risks becoming a burden without personalized filtering and recommendations. That’s why, when you look closely, you can find content-based filtering almost everywhere you look online. Let’s explore a few use cases:
- Ecommerce platforms: Online retailers (whether on their own platform or a platform like Amazon or Shopify) often use content-based filtering to recommend users new products based on the products they’ve previously purchased.
- Media streaming services: Streaming services, such as Netflix, benefit from engaging users on their platforms for as long as possible, so these services often use content-based filtering to suggest movies, TV shows, or music based on the genres and artists users have already enjoyed.
- Educational platforms: With the rise of massive open online courses (MOOCs) and the popularity of industry-specific boot camps, such as coding boot camps, content-based filtering has become a useful way for educational platforms to recommend courses and resources based on a user’s learning history and preferences.
- Healthcare applications: There’s a huge variety of treatments, diets, and exercises available online, but not all of them work for all bodies, healthcare apps use the information they have on user preferences to recommend new exercise regimens or recipes.
In a physical store, business owners have limited space to show customers new products. Online, companies can show many more options as long as they use the right recommender system to help users find the choices that suit them best.
How to build a content-based filtering recommendation engine with Redis
With Redis, building a content-based filtering system is easy. Here, we’ll walk through how to build a movie recommendation system supported by content-based filtering using RedisVL and the IMDB movie dataset. You can clone the repo here.
At a high level, we’ll use RedisVL to generate a semantic embedding vector from each movie’s title, description, and keywords and then store and query vectors with vector similarity search to find semantically similar movies. We’ll then use additional fields, such as genre and release year, to enhance the results.
Set up your environment
Start by importing the needed libraries and defining your Redis URL.
Python
Load and preprocess the dataset
We’re using a dataset of approximately 25,000 movies from IMDB. As with any data task, the first step is to clean our data. This includes filling in missing values, converting certain fields into lists, and removing unnecessary columns.
Python
Python
Python
Generate vector embeddings to recommend similar content
The heart of our recommendation system is determining the similarity between movies based on their descriptions. To do this, we use a pre-trained language model from HuggingFace to generate vector embeddings for each movie’s overview and keywords. This step will take a while, but it only needs to be done once for your entire dataset.
If you don’t want to wait, you can skip this cell and load the vectors we’ve gone ahead and pre-generated to a file for you.
Python
Python
Define our Redis Search schema
Next, we define a schema for RedisVL to specify the fields that each movie will have, including the vector dimensions, distance metric, and any additional fields like year, genre, or rating. We’ll load this from a yaml file, content_filtering_schema.yaml.
Unset
Python
Look for movie recommendations with vector similarity search
Now that our data is stored in Redis, we can use vector similarity search to find movies that are similar to one another. For example, to find movies similar to the classic “20,000 Leagues Under the Sea”, we retrieve its vector embedding and use it to search for similar movies.
Python
Here’s what the query results will look like.
Python
Add filters to improve recommendations
In real-world recommendation systems, users often like to apply their own filters—like narrowing down by genre or searching with specific keywords. We can easily expand our system to include filters for these (or any other fields) set up in our schema.
There’s no one-size-fits-all approach to adding these filters because every content recommendation app will have different fields, which is why Redis supports a full host of filter types, including tags, text fuzzy matching, numeric ranges, and geo radius. Try playing around with adding filters to other fields defined in our schema to see how the results change.
Python
Get started with RedisVL
Now, you know the basics of content-based filtering, the advantages and disadvantages of this technique, a range of use cases for adopting it, and how to build a content-based recommendation system yourself using RedisVL.
With the power of Redis as a vector database, you can generate relevant recommendations that improve the user experience and boost conversion rates (among many other benefits). Whether you’re recommending products, music, movies, or books, the flexibility and performance of RedisVL make it an excellent choice for building scalable recommender systems.