Building a Content-Based Recommendation System

This article was published as a part of the Data Science Blogathon .

Introduction

Back in 2000, people used to purchase groceries from their local hypermarts. However, in the last 20 years, several online e-commerce stores have been launched. So, instead of going to a physical store, you can visit an online e-commerce store from the comfort of your home. Customers who used to shop from the physical store then started to purchase several products from such e-commerce websites. Also, in 2020, Covid-19 severely impacted the sales of physical hypermarts. This gave the e-commerce demand a boom, and since 2020, more customers have started to shop from such online stores. See the detailed report summary from McKinsey here.

Many products are available in such e-commerce stores ranging from a toothbrush to a Car. If you want to purchase something, just search for it on the internet, and there will be at least one e-commerce website serving that product. So, many new e-commerce players have been launched in recent years (it has become a very competitive space). In such a competitive space, it is quintessential for the e-commerce store to identify the customer preference and keep a customer interested to shop from their website. This is why the recommendation system helps. A straightforward example would be that if you are purchasing bread, you will possibly purchase butter or Milk. This article will show how to build a recommendation system for Bigbasket.

Importance of Recommendation Systems

Netflix: According to a recent study by McKinsey, Netflix uses personalized recommendations, and it is responsible for 80 percent of the content streamed. This has allowed Netflix to earn 1 billion dollars in a single year. Netflix doesn’t spend much on marketing but on recommendations to improve customer retention.

Amazon: Similarly, 35% of the Amazon website’s sales come from personalized recommendations. So, even Amazon does not spend much on marketing. However, it is still able to retain customers using personalized recommendations.

Source of the above information.

Type of Recommendation System

Popularity Based: Recommends the most popular or most selling products. E.g. Trending videos on Youtube.

Disadvantage: The recommendations are similar for all customers and not personalized.

Content-based: It works on similarities between products. First, we must create a vector representing all the product features. Then, we calculate the similarity between those vectors using methods like:

  1. Euclidean Distance
  2. Manhattan Distance
  3. Jaccard Distance
  4. Cosine Distance (or cosine similarity – we will be using this metric in our article)

For example, if a set of users likes action movies by Jackie Chan, this algorithm may recommend the movies having the below characteristics.

Advantage: Quick to implement

Collaborative-based: It works on similarities between users. If there are 2 users.

Recommendation System

In the illustration above, there are 2 users with similar taste preferences. Both of them liked pie and protein salad and looked like fitness enthusiasts, so they are similar. Now, the user on the right liked a can of energy drink, so we recommend the same energy drink to the user on the left.

Advantages: It gives personalized recommendations for each user basis their historic preferences.

Disadvantage: The algorithm needs a lot of historical data to train, and the model’s accuracy increases with increased data.

Hybrid: Here, we combine all the above 3 recommender systems.

netflix Recommendation System

For example, the Netflix homepage has several strips for recommending content to its subscriber.

What is Bigbasket?

big basket

Bigbasket is one of the biggest online grocery stores in India. It was launched in 2011, and several competitors have challenged it for market share. However, it can still retain its fair share in the market.

Understand the Data for Recommendation System

We will be using a publicly available dataset from Kaggle. It can be downloaded from here

A bout this file. T his dataset contains the below 10 columns:

  1. index – the serial number
  2. product – Title (or name) of the product
  3. category – Category of the product
  4. sub_category – Subcategory of the product
  5. brand – Brand of the product
  6. sale_price – Price at which product is being sold on the site
  7. market_price – The market price of the product
  8. type – Type into which product falls
  9. rating – aggregate product rating (out of 5) by customers
  10. description – Description of the product

Import Relevant Libraries

#Basic Libraries import numpy as np import pandas as pd #Visualization Libraries import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px #Text Handling Libraries import re from sklearn.feature_extraction.text CountVectorizer from sklearn.metrics.pairwise import cosine_similarity

Recommendation System

EDA (Exploratory Data Analysis) and Data Cleaning

“I can’t make bricks without clay.” ― Arthur Conan Doyle.

We need data to make our recommendation system model. So, EDA is crucial because it allows us to understand the data. Around 70-80% of the time is spent in this step. There may be several issues with data. We need to spend time here and get introduced to the data.

Check for Missing data: If the data is null or missing, we cannot use it for data analysis. So, let’s look at the data distribution.
Missing count:

print('Null Data Count In Each Column') print('-'*30) print(df.isnull().sum()) print('-'*30) print('Null Data % In Each Column') print('-'*30) for col in df.columns: null_count = df[col].isnull().sum() total_count = df.shape[0] print("<> : ".format(col,null_count/total_count * 100))

Recommendation System OUTPUT

Findings from the EDA:

The above features are important for building a recommendation system. So, we will drop the rows from data that contain missing values.
Python code:

df = df.dropna() print(df.shape)
(18840, 9)
Even after dropping nulls, we have a good data size of 18840 records.

Understanding Data Types of Columns

df.dtypes
product object category object sub_category object brand object sale_price float64 market_price float64 type object rating float64 description object

Findings from this step:
The features sale_price, market_price, and rating are numeric (as they are represented using float64). The rest are all string features (represented as objects).

Univariate Analysis in Recommendation System

Here, we will understand the data distribution of several columns. A look at category column distribution

counts = df['category'].value_counts() count_percentage = df['category'].value_counts(1)*100 counts_df = pd.DataFrame() display(counts_df) px.bar(data_frame=counts_df, x='Category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Count of Items in Each Category')

univariate analysis Count of item in each category

A look at sub_category column distribution

counts = df['sub_category'].value_counts() count_percentage = df['sub_category'].value_counts(1)*100 counts_df = pd.DataFrame() print('unique sub_category values',df['sub_category'].nunique()) print('Top 10 sub_category') display(counts_df.head(10)) print('Bottom 10 sub_category') display(counts_df.tail(10)) px.bar(data_frame=counts_df[:10], x='sub_category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Bought Sub_Categories') px.bar(data_frame=counts_df[-10:], x='sub_category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Bottom 10 Bought Sub_Categories')

univariate analysis top 10 sub categories univariate analysis

A look at brand column distribution

column = 'brand' counts = df[column].value_counts() count_percentage = df[column].value_counts(1)*100 counts_df = pd.DataFrame() print('unique '+str(column)+' values',df['sub_category'].nunique()) print('Top 10 '+str(column)) display(counts_df.head(10)) print('Bottom 10 '+str(column)) display(counts_df.tail(10)) px.bar(data_frame=counts_df.head(10), x=column, y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Brand Items based on Item Counts')

Euclidean distance formula univariate analysis | Recommendation System Recommendation System

A look at type column distribution

column = 'type' counts = df[column].value_counts() count_percentage = df[column].value_counts(1)*100 counts_df = pd.DataFrame() print('unique '+str(column)+' values',df[column].nunique()) print('Top 10 '+str(column)) display(counts_df.head(10)) counts_df[counts_df['Counts']==1].shape px.bar(data_frame=counts_df.head(10), x='type', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Types of Products based on Item Counts')

column distribution column distribution | Recommendation System

Ratings Analysis

Since this is a numeric column, Let’s look at the histogram of this column.
print(df['rating'].describe()) df['rating'].hist(bins=10)

rating analysis

We can see that histogram is skewed towards the right, which means that most of the products have a higher rating.

How many Ratings are between the interval of 0 to 1, 1 to 2, and so on?
pd.cut(df.rating,bins = [0,1,2,3,4,5]).reset_index().groupby(['rating']).size()
rating (0, 1] 387 (1, 2] 335 (2, 3] 1347 (3, 4] 6559 (4, 5] 10212

Feature Engineering for Recommendation System

Here, we can create new features which can improve the recommendations. We have the sale_price and market_price. Let’s create a feature discount%
Formula = [ (market_price – sale_price)/sale_price ] * 100

df['discount'] = (df['market_price']-df['sale_price'])*100/df['market_price'] print(df['discount'].describe()) pd.cut(df.discount,bins = [-1,0,10,20,30,40,50,60,80,90,100]).reset_index().groupby(['discount']).size()

Feature Engineering | Recommendation System

Since this is a numeric column, Let’s look at the histogram of this column.

df['discount'].hist()

numeric column

Bivariate Analysis

Is there any relation between rating and discount?

ax = df.plot.scatter(x='rating',y='discount')

Bivariate analysis

Findings: Looking at the scatter plot, we do not see any association between rating and discount.

Let’s Clean a Few String Columns

The category column contains ‘Kitchen, Garden & Pets.’ We will clean it, so it contains [kitchen, garden, pets]. A similar transformation will be applied to sub_category, type, and brand features.
Now, let’s create one feature (product_classification_features) which appends all the above 4 cleaned columns.

df2 = df.copy() rmv_spc = lambda a:a.strip() get_list = lambda a:list(map(rmv_spc,re.split('& |, |*|n', a))) for col in ['category', 'sub_category', 'type']: df2[col] = df2[col].apply(get_list) def cleaner(x): if isinstance(x, list): return [str.lower(i.replace(" ", "")) for i in x] else: if isinstance(x, str): return str.lower(x.replace(" ", "")) else: return '' for col in ['category', 'sub_category', 'type','brand']: df2[col] = df2[col].apply(cleaner) def couple(x): return ' '.join(x['category']) + ' ' + ' '.join(x['sub_category']) + ' '+x['brand']+' ' +' '.join( x['type']) df2['product_classification_features'] = df2.apply(couple, axis=1)

Now, we have the data ready. Let us build a simple logic for recommendations based on popularity. We will use the recommender feed – type, category, or sub_category. It will return the most popular product or the product having the highest rating.

def recommend_most_popular(col,col_value,top_n=5):

return df[df[col]==col_value].sort_values(by=’rating’,ascending = False).head(top_n)[[‘product’,col,’rating’]]

Let’s look at the most popular products for the category = Beauty & Hygiene

recommend_most_popular(col='category',col_value='Beauty & Hygiene')

Recommendation System

Let’s look at the most popular products for sub_category = Hair Care

recommend_most_popular(col='sub_category',col_value='Hair Care')

sub category | bivariate analysis

Let’s look at the most popular products for the brand = Amul

recommend_most_popular(col='brand',col_value='Amul')

amul bivariate analysis

Let’s look at the most popular products for type = Face Care

recommend_most_popular(col='type',col_value='Face Care')

Recommendation System

Let’s Build a Content-based Recommendation System

As the name suggests, these algorithms use the data of the product we want to recommend. E.g., Kids like Toy Story 1 movies. Toy Story is an animated movie created by Pixar studios – so the system can recommend other animated movies by Pixar studios like Toy Story 2. For our, e.g., we will use the product metadata like category, sub_category, type, price, etc. We will extract similar products from the product portfolio based on these features. The feature we created earlier (product_classification_features) will be helpful here.
We will use CountVectorizer to create a feature space.

e.g., You have 3 sentences:
s1 = ‘Ram is a boy
s2 = ‘Ram is good
s3 = ‘Good is that boy.’
So, we first convert them into lowercase
s1 = ‘ram is boy’
s2 = ‘ram is good’
s3 = ‘good is that boy’
So, Vocabulary consists of as unique words.
So, when you use CountVectorizer, it creates 5 features, each representing the count of occurrence of that word.

from sklearn.feature_extraction.text import CountVectorizer from scipy.spatial import distance vectorizer = CountVectorizer(lowercase=True) X = vectorizer.fit_transform([s1,s2,s3]) X.toarray() count_vect_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out()) count_vect_df
boy good is ram that
0 1 0 1 1 0
1 0 1 1 1 0
2 1 1 1 0 1

Once we have these vectors, we can use the below similarity metrics for identifying and recommending similar products

1. Euclidean Distance: It measures the straight line distance between the 2 vectors.

EUCLIDEAN DISTANCE

For our example, Euclidean Distance looks like the below:

distance.cdist(count_vect_df, count_vect_df, metric='euclidean').round(2)
array([[0. , 1. , 1.73], [1. , 0. , 1.41], [1.73, 1.41, 0. ]])

In the above example, all diagonal entries are 0, which is intuitive – each sentence’s Euclidean Distance is 0. Also, Euclidean Distance between

So, s1 is closer to s2 s2 is closer to s1 s3 is closer to s2

Manhattan Distance: It measures the absolute differences between the two vectors.

Manhattan distance

For our example, Manhattan Distance looks like the below:

distance.cdist(count_vect_df, count_vect_df, metric='cityblock').astype('int')
array([[0, 2, 3], [2, 0, 3], [3, 3, 0]])

In the above example, all diagonal entries are 0, which is intuitive – the Manhattan Distance of each sentence with itself is 0. Also, Manhattan Distance between

So, s1 is closer to s2
s2 is closer to s1
s3 is equidistant from s2 and s1

Jaccard Distance: It measures the ratio of common words between 2 sentences to the overall unique words in those 2 sentences.

For s1 and s2, it is calculated as
s1 and s2 = len()/len() = 0.5
For our example, Jaccard Distance looks like the below:

distance.cdist(count_vect_df, count_vect_df, metric='jaccard').astype('int')
array([[0. , 0.5, 0.6], [0.5, 0. , 0.6], [0.6, 0.6, 0. ]])

So, Jaccard’s distance between

So, s1 is closer to s2

s2 is closer to s1

s3 is equidistant from s2 and s1

Cosine Distance (or cosine similarity – we will be using this metric in our article) we will use cosine similarity for identifying and recommending similar products.
Cosine similarity measures the angle between the 2 vectors.
Cosine similarity can give a value between -1 to +1. A value of -1 means that the products are opposite or non-similar and a value of +1 means that the 2 products are the same.
Cosine similarity looks like below:

Cosine similarity is 1- cosine distance

1-distance.cdist(count_vect_df, count_vect_df, metric='cosine').round(2)
array([[1. , 0.67, 0.58], [0.67, 1. , 0.58], [0.58, 0.58, 1. ]])

In the above example, all diagonal entries are 1, which is intuitive – each sentence’s cosine similarity is 1.

Also, cosine similarity between

So, s1 is closer to s2

s2 is closer to s1

s3 is equidistant from s2 and s1

Let’s calculate the cosine similarity of the product_classification_features for all the products.

count = CountVectorizer(stop_words='english') count_matrix = count.fit_transform(df2['product_classification_features']) cosine_sim = cosine_similarity(count_matrix, count_matrix) cosine_sim_df = pd.DataFrame(cosine_sim)

The above matrix contains the cosine similarity of each product vs the rest of the products in catalog. Let’s build a recommender using cosine similarity

def content_recommendation_v1(title): a = df2.copy().reset_index().drop('index',axis=1) index = a[a['product']==title].index[0] top_n_index = list(cosine_sim_df[index].nlargest(10).index) try: top_n_index.remove(index) except: pass similar_df = a.iloc[top_n_index][['product']] similar_df['cosine_similarity'] = cosine_sim_df[index].iloc[top_n_index] return similar_df

Let’s see the recommendations for a few products.

title = 'Water Bottle - Orange' content_recommendation_v1(title)
product cosine_similarity
109 Glass Water Bottle – Aquaria Organic Purple 0.875
705 Glass Water Bottle With Round Base – Transparent… 0.875
1155 H2O Unbreakable Water Bottle – Pink 0.875
1500 Water Bottle H2O Purple 0.875
1828 H2O Unbreakable Water Bottle – Green 0.875
1976 Regel Tritan Plastic Sports Water Bottle – Black 0.875
2182 Apsara 1 Water Bottle – Assorted Colour 0.875
2361 Glass Water Bottle With Round Base – Yellow, B… 0.875
2485 Trendy Stainless Steel Bottle With Steel Cap -… 0.875
title = 'Dark Chocolate- 55% Rich In Cocoa' content_recommendation_v1(title)
product cosine_similarity
105 I Love You Fruit N Nut Chocolate 1.0
1144 Choco Cracker – Magical Crystal With Milk Choc… 1.0
1718 Dark Chocolate – Single Origin, India 1.0
2517 Fruit N Nut, Dark Chocolate- 55% Rich In Cocoa 1.0
3117 Colombia Classique Black, Single Origin Dark C… 1.0
3167 Dark Chocolate 1.0
4013 Almondo – Roasted Almonds Coated With Milk Cho… 1.0
4358 Sugar-Free Dark Chocolate 1.0
5867 Milk Compound Slab – MCO-11 1.0

Improving the Model

Let us tweak the algorithm. Let us also use the product column to create another cosine similarity. We will take the average of both the cosine similarity and see if the results are better.

count2 = CountVectorizer(stop_words='english',lowercase=True) count_matrix2 = count2.fit_transform(df2['product']) cosine_sim2 = cosine_similarity(count_matrix2, count_matrix2) cosine_sim_df2 = pd.DataFrame(cosine_sim2) def content_recommendation_v2(title): a = df2.copy().reset_index().drop('index',axis=1) index = a[a['product']==title].index[0] similar_basis_metric_1 = cosine_sim_df[cosine_sim_df[index]>0][index].reset_index().rename(columns=) similar_basis_metric_2 = cosine_sim_df2[cosine_sim_df2[index]>0][index].reset_index().rename(columns=) similar_df = similar_basis_metric_1.merge(similar_basis_metric_2,how='left').merge(a[['product']].reset_index(),how='left') similar_df['sim'] = similar_df[['sim_1','sim_2']].fillna(0).mean(axis=1) similar_df = similar_df[similar_df['index']!=index].sort_values(by='sim',ascending=False) return similar_df[['product','sim']].head(10) title = 'Water Bottle - Orange' content_recommendation_v2(title)
product sim
669 Sante Infuser Water Bottle – Orange 0.824798
2024 H2o Unbreakable Water Bottle – Orange 0.824798
2565 Swat Pet Water Bottle – Orange 0.824798
1912 Glass Water Bottle – Circo Orange & Lemon 0.791053
2084 Spray Glass water Bottle With Cork – Orange 0.791053
1924 Sip-It-Plastic Water Bottle 0.726175
1997 Water Bottle – Twisty, Pink 0.726175
1290 Plastic Water Bottle – Pink 0.726175
195 Water Bottle H2O Purple 0.726175
1863 Water Bottle – Apsara 1 Assorted Colour 0.695699

Findings: The results are much better for this. The orange color bottles have a higher similarity.

title = 'Dark Chocolate- 55% Rich In Cocoa' content_recommendation_v2(title)
product sim
437 Fruit N Nut, Dark Chocolate- 55% Rich In Cocoa 0.922577
2340 Sugar-Free Dark Chocolate- 55% Rich In Cocoa 0.922577
2717 Rich Cocoa Dark Chocolate Bar 0.837500
561 Dark Chocolate 0.816228
504 Dlite Rich Cocoa Dark Chocolate Bar 0.802648
3137 Bitter Chocolate- 75% Rich In Cocoa 0.800000
2544 Peru Dark Amazon, Single Origin Dark Chocolate… 0.782843
3148 Bournville Rich Cocoa 70% Dark Chocolate Bar 0.775562
548 Colombia Classique Black, Single Origin Dark C… 0.769680
2941 Tanzania Chocolat Noir, Single Origin Dark Cho… 0.769680

Findings: Even here, we see Dark chocolate products having higher similarities.

Let us run the recommendation for a few more products.
title = 'Nacho Round Chips' content_recommendation_v2(title)
product sim
3766 Nacho Chips – Crunchy Pizza 0.788675
718 Nacho Chips – Jalapeno 0.770833
3630 Nacho Chips – Cheese 0.770833
1680 Nacho Chips – Salsa 0.770833
464 Nacho Chips – Peri Peri 0.735702
3140 Nacho Chips – Sweet Chilli 0.726175
3912 Nacho Chips – Roasted Masala 0.726175
4478 Nacho Chips – Jalapeno, No Onion, No Garlic 0.695699
1233 Nacho Chips – Peri Peri 0.673202
86 Nacho Chips – Cheese With Herbs, No Onion, No … 0.673202
title = 'Chewy Mints - Lemon' content_recommendation_v2(title)
product sim
452 Chewy Candy Stick – Strawberry Flavour 0.557671
324 Orbit Sugar-Free Chewing Gum – Lemon & Lime 0.537680
1558 Chewy Mints – Spearmint Flavour Candy 0.525460
711 Chewing Gum – Peppermint 0.500000
2676 Chewing Gum – Peppermint 0.500000
1734 Chewing Gum – Peppermint 0.500000
2842 Orange Chewy Dragees 0.433928
877 Sugarfree Strawberry 0.428571
1063 White Xylitol Sugarfree Spearmint Flavour Chew… 0.428571
1543 Mint – Sugarfree, Peppermint Flavour 0.428571
title = 'Veggie - Fingers' content_recommendation_v2(title)
product sim
1186 Veggie Fingers – Veggie Delight with Corn, Car… 0.853553
980 Veggie – Nuggets 0.750000
814 Veggie Burger Patty 0.704124
2639 Veggie – Nuggets 0.687500
331 Veggie Stix 0.687500
2656 Veggie Pizza Pocket 0.641624
840 Veggie Burger Patty 0.641624
1231 Crispy Veggie Burger Patty 0.614277
1082 Chicken Fingers – Garlic 0.557678
1121 Quick Snack – Fish Fingers 0.530330

The recommendation results look relevant.

Kudos on building your content recommendation system using NLP.

Conclusion

Companies like Amazon and Netflix can drive a lot of revenue using a recommendation system. They can engage and retain existing customers by recommending relevant content to them. In this article, we saw different types of recommendation systems. We then used a publicly available dataset, did a thorough EDA, and developed a content-based recommendation system. There are several metrics we can use to identify similarities between products. We used cosine similarity for our recommendation system.

Key takeaways:

I hope you liked this guide on building content recommendations. Share your feedback with me in the comments section below.

Feel free to connect with me on LinkedIn if you want to discuss this with me.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Aakash93 06 Sep, 2022

Data Scientist with extensive experience in solving many real world business problems across different domains. Possess fine blend of business knowledge, maths/stats and technology/programming. Experienced in handling client facing roles, stakeholder management, effective communication with presentation & negotiation skills.