This article was published as a part of the Data Science Blogathon .
Back in 2000, people used to purchase groceries from their local hypermarts. However, in the last 20 years, several online e-commerce stores have been launched. So, instead of going to a physical store, you can visit an online e-commerce store from the comfort of your home. Customers who used to shop from the physical store then started to purchase several products from such e-commerce websites. Also, in 2020, Covid-19 severely impacted the sales of physical hypermarts. This gave the e-commerce demand a boom, and since 2020, more customers have started to shop from such online stores. See the detailed report summary from McKinsey here.
Many products are available in such e-commerce stores ranging from a toothbrush to a Car. If you want to purchase something, just search for it on the internet, and there will be at least one e-commerce website serving that product. So, many new e-commerce players have been launched in recent years (it has become a very competitive space). In such a competitive space, it is quintessential for the e-commerce store to identify the customer preference and keep a customer interested to shop from their website. This is why the recommendation system helps. A straightforward example would be that if you are purchasing bread, you will possibly purchase butter or Milk. This article will show how to build a recommendation system for Bigbasket.
Netflix: According to a recent study by McKinsey, Netflix uses personalized recommendations, and it is responsible for 80 percent of the content streamed. This has allowed Netflix to earn 1 billion dollars in a single year. Netflix doesn’t spend much on marketing but on recommendations to improve customer retention.
Amazon: Similarly, 35% of the Amazon website’s sales come from personalized recommendations. So, even Amazon does not spend much on marketing. However, it is still able to retain customers using personalized recommendations.
Source of the above information.
Popularity Based: Recommends the most popular or most selling products. E.g. Trending videos on Youtube.
Disadvantage: The recommendations are similar for all customers and not personalized.
Content-based: It works on similarities between products. First, we must create a vector representing all the product features. Then, we calculate the similarity between those vectors using methods like:
For example, if a set of users likes action movies by Jackie Chan, this algorithm may recommend the movies having the below characteristics.
Advantage: Quick to implement
Collaborative-based: It works on similarities between users. If there are 2 users.
In the illustration above, there are 2 users with similar taste preferences. Both of them liked pie and protein salad and looked like fitness enthusiasts, so they are similar. Now, the user on the right liked a can of energy drink, so we recommend the same energy drink to the user on the left.
Advantages: It gives personalized recommendations for each user basis their historic preferences.
Disadvantage: The algorithm needs a lot of historical data to train, and the model’s accuracy increases with increased data.
Hybrid: Here, we combine all the above 3 recommender systems.
For example, the Netflix homepage has several strips for recommending content to its subscriber.
Bigbasket is one of the biggest online grocery stores in India. It was launched in 2011, and several competitors have challenged it for market share. However, it can still retain its fair share in the market.
We will be using a publicly available dataset from Kaggle. It can be downloaded from here
A bout this file. T his dataset contains the below 10 columns:
#Basic Libraries import numpy as np import pandas as pd #Visualization Libraries import matplotlib.pyplot as plt import seaborn as sns import plotly.express as px #Text Handling Libraries import re from sklearn.feature_extraction.text CountVectorizer from sklearn.metrics.pairwise import cosine_similarity
“I can’t make bricks without clay.” ― Arthur Conan Doyle.
We need data to make our recommendation system model. So, EDA is crucial because it allows us to understand the data. Around 70-80% of the time is spent in this step. There may be several issues with data. We need to spend time here and get introduced to the data.
Check for Missing data: If the data is null or missing, we cannot use it for data analysis. So, let’s look at the data distribution.
Missing count:
print('Null Data Count In Each Column') print('-'*30) print(df.isnull().sum()) print('-'*30) print('Null Data % In Each Column') print('-'*30) for col in df.columns: null_count = df[col].isnull().sum() total_count = df.shape[0] print("<> : ".format(col,null_count/total_count * 100))
Findings from the EDA:
The above features are important for building a recommendation system. So, we will drop the rows from data that contain missing values.
Python code:
df = df.dropna() print(df.shape)
(18840, 9)Even after dropping nulls, we have a good data size of 18840 records.
df.dtypes
product object category object sub_category object brand object sale_price float64 market_price float64 type object rating float64 description object
Findings from this step:
The features sale_price, market_price, and rating are numeric (as they are represented using float64). The rest are all string features (represented as objects).
Here, we will understand the data distribution of several columns. A look at category column distribution
counts = df['category'].value_counts() count_percentage = df['category'].value_counts(1)*100 counts_df = pd.DataFrame() display(counts_df) px.bar(data_frame=counts_df, x='Category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Count of Items in Each Category')
A look at sub_category column distribution
counts = df['sub_category'].value_counts() count_percentage = df['sub_category'].value_counts(1)*100 counts_df = pd.DataFrame() print('unique sub_category values',df['sub_category'].nunique()) print('Top 10 sub_category') display(counts_df.head(10)) print('Bottom 10 sub_category') display(counts_df.tail(10)) px.bar(data_frame=counts_df[:10], x='sub_category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Bought Sub_Categories') px.bar(data_frame=counts_df[-10:], x='sub_category', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Bottom 10 Bought Sub_Categories')
A look at brand column distribution
column = 'brand' counts = df[column].value_counts() count_percentage = df[column].value_counts(1)*100 counts_df = pd.DataFrame() print('unique '+str(column)+' values',df['sub_category'].nunique()) print('Top 10 '+str(column)) display(counts_df.head(10)) print('Bottom 10 '+str(column)) display(counts_df.tail(10)) px.bar(data_frame=counts_df.head(10), x=column, y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Brand Items based on Item Counts')
A look at type column distribution
column = 'type' counts = df[column].value_counts() count_percentage = df[column].value_counts(1)*100 counts_df = pd.DataFrame() print('unique '+str(column)+' values',df[column].nunique()) print('Top 10 '+str(column)) display(counts_df.head(10)) counts_df[counts_df['Counts']==1].shape px.bar(data_frame=counts_df.head(10), x='type', y='Counts', color='Counts', color_continuous_scale='blues', text_auto=True, title=f'Top 10 Types of Products based on Item Counts')
Ratings Analysis
Since this is a numeric column, Let’s look at the histogram of this column.print(df['rating'].describe()) df['rating'].hist(bins=10)
We can see that histogram is skewed towards the right, which means that most of the products have a higher rating.
How many Ratings are between the interval of 0 to 1, 1 to 2, and so on?pd.cut(df.rating,bins = [0,1,2,3,4,5]).reset_index().groupby(['rating']).size()
rating (0, 1] 387 (1, 2] 335 (2, 3] 1347 (3, 4] 6559 (4, 5] 10212
Here, we can create new features which can improve the recommendations. We have the sale_price and market_price. Let’s create a feature discount%
Formula = [ (market_price – sale_price)/sale_price ] * 100
df['discount'] = (df['market_price']-df['sale_price'])*100/df['market_price'] print(df['discount'].describe()) pd.cut(df.discount,bins = [-1,0,10,20,30,40,50,60,80,90,100]).reset_index().groupby(['discount']).size()
Since this is a numeric column, Let’s look at the histogram of this column.
df['discount'].hist()
Is there any relation between rating and discount?
ax = df.plot.scatter(x='rating',y='discount')
Findings: Looking at the scatter plot, we do not see any association between rating and discount.
The category column contains ‘Kitchen, Garden & Pets.’ We will clean it, so it contains [kitchen, garden, pets]. A similar transformation will be applied to sub_category, type, and brand features.
Now, let’s create one feature (product_classification_features) which appends all the above 4 cleaned columns.
df2 = df.copy() rmv_spc = lambda a:a.strip() get_list = lambda a:list(map(rmv_spc,re.split('& |, |*|n', a))) for col in ['category', 'sub_category', 'type']: df2[col] = df2[col].apply(get_list) def cleaner(x): if isinstance(x, list): return [str.lower(i.replace(" ", "")) for i in x] else: if isinstance(x, str): return str.lower(x.replace(" ", "")) else: return '' for col in ['category', 'sub_category', 'type','brand']: df2[col] = df2[col].apply(cleaner) def couple(x): return ' '.join(x['category']) + ' ' + ' '.join(x['sub_category']) + ' '+x['brand']+' ' +' '.join( x['type']) df2['product_classification_features'] = df2.apply(couple, axis=1)
Now, we have the data ready. Let us build a simple logic for recommendations based on popularity. We will use the recommender feed – type, category, or sub_category. It will return the most popular product or the product having the highest rating.
def recommend_most_popular(col,col_value,top_n=5):return df[df[col]==col_value].sort_values(by=’rating’,ascending = False).head(top_n)[[‘product’,col,’rating’]]
Let’s look at the most popular products for the category = Beauty & Hygiene
recommend_most_popular(col='category',col_value='Beauty & Hygiene')
Let’s look at the most popular products for sub_category = Hair Care
recommend_most_popular(col='sub_category',col_value='Hair Care')
Let’s look at the most popular products for the brand = Amul
recommend_most_popular(col='brand',col_value='Amul')
Let’s look at the most popular products for type = Face Care
recommend_most_popular(col='type',col_value='Face Care')
As the name suggests, these algorithms use the data of the product we want to recommend. E.g., Kids like Toy Story 1 movies. Toy Story is an animated movie created by Pixar studios – so the system can recommend other animated movies by Pixar studios like Toy Story 2. For our, e.g., we will use the product metadata like category, sub_category, type, price, etc. We will extract similar products from the product portfolio based on these features. The feature we created earlier (product_classification_features) will be helpful here.
We will use CountVectorizer to create a feature space.
e.g., You have 3 sentences:
s1 = ‘Ram is a boy
s2 = ‘Ram is good
s3 = ‘Good is that boy.’
So, we first convert them into lowercase
s1 = ‘ram is boy’
s2 = ‘ram is good’
s3 = ‘good is that boy’
So, Vocabulary consists of as unique words.
So, when you use CountVectorizer, it creates 5 features, each representing the count of occurrence of that word.
from sklearn.feature_extraction.text import CountVectorizer from scipy.spatial import distance vectorizer = CountVectorizer(lowercase=True) X = vectorizer.fit_transform([s1,s2,s3]) X.toarray() count_vect_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out()) count_vect_df
boy | good | is | ram | that | |
---|---|---|---|---|---|
0 | 1 | 0 | 1 | 1 | 0 |
1 | 0 | 1 | 1 | 1 | 0 |
2 | 1 | 1 | 1 | 0 | 1 |
Once we have these vectors, we can use the below similarity metrics for identifying and recommending similar products
1. Euclidean Distance: It measures the straight line distance between the 2 vectors.
For our example, Euclidean Distance looks like the below:
distance.cdist(count_vect_df, count_vect_df, metric='euclidean').round(2)
array([[0. , 1. , 1.73], [1. , 0. , 1.41], [1.73, 1.41, 0. ]])
In the above example, all diagonal entries are 0, which is intuitive – each sentence’s Euclidean Distance is 0. Also, Euclidean Distance between
Manhattan Distance: It measures the absolute differences between the two vectors.
For our example, Manhattan Distance looks like the below:
distance.cdist(count_vect_df, count_vect_df, metric='cityblock').astype('int')
array([[0, 2, 3], [2, 0, 3], [3, 3, 0]])
In the above example, all diagonal entries are 0, which is intuitive – the Manhattan Distance of each sentence with itself is 0. Also, Manhattan Distance between
So, s1 is closer to s2
s2 is closer to s1
s3 is equidistant from s2 and s1
Jaccard Distance: It measures the ratio of common words between 2 sentences to the overall unique words in those 2 sentences.
For s1 and s2, it is calculated as
s1 and s2 = len()/len() = 0.5
For our example, Jaccard Distance looks like the below:
distance.cdist(count_vect_df, count_vect_df, metric='jaccard').astype('int')
array([[0. , 0.5, 0.6], [0.5, 0. , 0.6], [0.6, 0.6, 0. ]])
So, Jaccard’s distance between
So, s1 is closer to s2
s2 is closer to s1
s3 is equidistant from s2 and s1
Cosine Distance (or cosine similarity – we will be using this metric in our article) we will use cosine similarity for identifying and recommending similar products.
Cosine similarity measures the angle between the 2 vectors.
Cosine similarity can give a value between -1 to +1. A value of -1 means that the products are opposite or non-similar and a value of +1 means that the 2 products are the same.
Cosine similarity looks like below:
Cosine similarity is 1- cosine distance
1-distance.cdist(count_vect_df, count_vect_df, metric='cosine').round(2)
array([[1. , 0.67, 0.58], [0.67, 1. , 0.58], [0.58, 0.58, 1. ]])
In the above example, all diagonal entries are 1, which is intuitive – each sentence’s cosine similarity is 1.
Also, cosine similarity between
So, s1 is closer to s2
s2 is closer to s1
s3 is equidistant from s2 and s1
Let’s calculate the cosine similarity of the product_classification_features for all the products.
count = CountVectorizer(stop_words='english') count_matrix = count.fit_transform(df2['product_classification_features']) cosine_sim = cosine_similarity(count_matrix, count_matrix) cosine_sim_df = pd.DataFrame(cosine_sim)
The above matrix contains the cosine similarity of each product vs the rest of the products in catalog. Let’s build a recommender using cosine similarity
def content_recommendation_v1(title): a = df2.copy().reset_index().drop('index',axis=1) index = a[a['product']==title].index[0] top_n_index = list(cosine_sim_df[index].nlargest(10).index) try: top_n_index.remove(index) except: pass similar_df = a.iloc[top_n_index][['product']] similar_df['cosine_similarity'] = cosine_sim_df[index].iloc[top_n_index] return similar_df
Let’s see the recommendations for a few products.
title = 'Water Bottle - Orange' content_recommendation_v1(title)
product | cosine_similarity | |
---|---|---|
109 | Glass Water Bottle – Aquaria Organic Purple | 0.875 |
705 | Glass Water Bottle With Round Base – Transparent… | 0.875 |
1155 | H2O Unbreakable Water Bottle – Pink | 0.875 |
1500 | Water Bottle H2O Purple | 0.875 |
1828 | H2O Unbreakable Water Bottle – Green | 0.875 |
1976 | Regel Tritan Plastic Sports Water Bottle – Black | 0.875 |
2182 | Apsara 1 Water Bottle – Assorted Colour | 0.875 |
2361 | Glass Water Bottle With Round Base – Yellow, B… | 0.875 |
2485 | Trendy Stainless Steel Bottle With Steel Cap -… | 0.875 |
title = 'Dark Chocolate- 55% Rich In Cocoa' content_recommendation_v1(title)
product | cosine_similarity | |
---|---|---|
105 | I Love You Fruit N Nut Chocolate | 1.0 |
1144 | Choco Cracker – Magical Crystal With Milk Choc… | 1.0 |
1718 | Dark Chocolate – Single Origin, India | 1.0 |
2517 | Fruit N Nut, Dark Chocolate- 55% Rich In Cocoa | 1.0 |
3117 | Colombia Classique Black, Single Origin Dark C… | 1.0 |
3167 | Dark Chocolate | 1.0 |
4013 | Almondo – Roasted Almonds Coated With Milk Cho… | 1.0 |
4358 | Sugar-Free Dark Chocolate | 1.0 |
5867 | Milk Compound Slab – MCO-11 | 1.0 |
Let us tweak the algorithm. Let us also use the product column to create another cosine similarity. We will take the average of both the cosine similarity and see if the results are better.
count2 = CountVectorizer(stop_words='english',lowercase=True) count_matrix2 = count2.fit_transform(df2['product']) cosine_sim2 = cosine_similarity(count_matrix2, count_matrix2) cosine_sim_df2 = pd.DataFrame(cosine_sim2) def content_recommendation_v2(title): a = df2.copy().reset_index().drop('index',axis=1) index = a[a['product']==title].index[0] similar_basis_metric_1 = cosine_sim_df[cosine_sim_df[index]>0][index].reset_index().rename(columns=) similar_basis_metric_2 = cosine_sim_df2[cosine_sim_df2[index]>0][index].reset_index().rename(columns=) similar_df = similar_basis_metric_1.merge(similar_basis_metric_2,how='left').merge(a[['product']].reset_index(),how='left') similar_df['sim'] = similar_df[['sim_1','sim_2']].fillna(0).mean(axis=1) similar_df = similar_df[similar_df['index']!=index].sort_values(by='sim',ascending=False) return similar_df[['product','sim']].head(10) title = 'Water Bottle - Orange' content_recommendation_v2(title)
product | sim | |
---|---|---|
669 | Sante Infuser Water Bottle – Orange | 0.824798 |
2024 | H2o Unbreakable Water Bottle – Orange | 0.824798 |
2565 | Swat Pet Water Bottle – Orange | 0.824798 |
1912 | Glass Water Bottle – Circo Orange & Lemon | 0.791053 |
2084 | Spray Glass water Bottle With Cork – Orange | 0.791053 |
1924 | Sip-It-Plastic Water Bottle | 0.726175 |
1997 | Water Bottle – Twisty, Pink | 0.726175 |
1290 | Plastic Water Bottle – Pink | 0.726175 |
195 | Water Bottle H2O Purple | 0.726175 |
1863 | Water Bottle – Apsara 1 Assorted Colour | 0.695699 |
Findings: The results are much better for this. The orange color bottles have a higher similarity.
title = 'Dark Chocolate- 55% Rich In Cocoa' content_recommendation_v2(title)
product | sim | |
---|---|---|
437 | Fruit N Nut, Dark Chocolate- 55% Rich In Cocoa | 0.922577 |
2340 | Sugar-Free Dark Chocolate- 55% Rich In Cocoa | 0.922577 |
2717 | Rich Cocoa Dark Chocolate Bar | 0.837500 |
561 | Dark Chocolate | 0.816228 |
504 | Dlite Rich Cocoa Dark Chocolate Bar | 0.802648 |
3137 | Bitter Chocolate- 75% Rich In Cocoa | 0.800000 |
2544 | Peru Dark Amazon, Single Origin Dark Chocolate… | 0.782843 |
3148 | Bournville Rich Cocoa 70% Dark Chocolate Bar | 0.775562 |
548 | Colombia Classique Black, Single Origin Dark C… | 0.769680 |
2941 | Tanzania Chocolat Noir, Single Origin Dark Cho… | 0.769680 |
Findings: Even here, we see Dark chocolate products having higher similarities.
Let us run the recommendation for a few more products.title = 'Nacho Round Chips' content_recommendation_v2(title)
product | sim | |
---|---|---|
3766 | Nacho Chips – Crunchy Pizza | 0.788675 |
718 | Nacho Chips – Jalapeno | 0.770833 |
3630 | Nacho Chips – Cheese | 0.770833 |
1680 | Nacho Chips – Salsa | 0.770833 |
464 | Nacho Chips – Peri Peri | 0.735702 |
3140 | Nacho Chips – Sweet Chilli | 0.726175 |
3912 | Nacho Chips – Roasted Masala | 0.726175 |
4478 | Nacho Chips – Jalapeno, No Onion, No Garlic | 0.695699 |
1233 | Nacho Chips – Peri Peri | 0.673202 |
86 | Nacho Chips – Cheese With Herbs, No Onion, No … | 0.673202 |
title = 'Chewy Mints - Lemon' content_recommendation_v2(title)
product | sim | |
---|---|---|
452 | Chewy Candy Stick – Strawberry Flavour | 0.557671 |
324 | Orbit Sugar-Free Chewing Gum – Lemon & Lime | 0.537680 |
1558 | Chewy Mints – Spearmint Flavour Candy | 0.525460 |
711 | Chewing Gum – Peppermint | 0.500000 |
2676 | Chewing Gum – Peppermint | 0.500000 |
1734 | Chewing Gum – Peppermint | 0.500000 |
2842 | Orange Chewy Dragees | 0.433928 |
877 | Sugarfree Strawberry | 0.428571 |
1063 | White Xylitol Sugarfree Spearmint Flavour Chew… | 0.428571 |
1543 | Mint – Sugarfree, Peppermint Flavour | 0.428571 |
title = 'Veggie - Fingers' content_recommendation_v2(title)
product | sim | |
---|---|---|
1186 | Veggie Fingers – Veggie Delight with Corn, Car… | 0.853553 |
980 | Veggie – Nuggets | 0.750000 |
814 | Veggie Burger Patty | 0.704124 |
2639 | Veggie – Nuggets | 0.687500 |
331 | Veggie Stix | 0.687500 |
2656 | Veggie Pizza Pocket | 0.641624 |
840 | Veggie Burger Patty | 0.641624 |
1231 | Crispy Veggie Burger Patty | 0.614277 |
1082 | Chicken Fingers – Garlic | 0.557678 |
1121 | Quick Snack – Fish Fingers | 0.530330 |
The recommendation results look relevant.
Kudos on building your content recommendation system using NLP.
Companies like Amazon and Netflix can drive a lot of revenue using a recommendation system. They can engage and retain existing customers by recommending relevant content to them. In this article, we saw different types of recommendation systems. We then used a publicly available dataset, did a thorough EDA, and developed a content-based recommendation system. There are several metrics we can use to identify similarities between products. We used cosine similarity for our recommendation system.
Key takeaways:
I hope you liked this guide on building content recommendations. Share your feedback with me in the comments section below.
Feel free to connect with me on LinkedIn if you want to discuss this with me.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Aakash93 06 Sep, 2022Data Scientist with extensive experience in solving many real world business problems across different domains. Possess fine blend of business knowledge, maths/stats and technology/programming. Experienced in handling client facing roles, stakeholder management, effective communication with presentation & negotiation skills.