Children Books Recommendation System

source: shorturl.at/xyW02

Problem Statement:

Build a Recommendation System for children's books from amazon, a dataset that scrapped by Modhi almannaa, see the link.

This dataset was scraped from Amazon, the largest selling platform, it provides all the information that helps the customers to find the best products, by reviewing all the information the customer needs to know before buying the product. This dataset focused only on children's books.

Recommendation system algorithms:

Content-based recommendation systems recommend items to a user by using the similarity of items. This recommender system recommends products or items based on their description or features. It identifies the similarity between the products based on their descriptions. It also considers the user’s previous history in order to recommend a similar product.

Pros and Cons of content-based recommendation techniques

  1. No cold-start problem, unlike Collaborative Filtering, if the programs have sufficient descriptions, we avoid the “new item problem”.
  2. Able to recommend to users with unique tastes
  3. Able to recommend new & unpopular items — No first-rater problem.
  1. Content-Based tend to over-specialization: they will recommend items similar to those already consumed, with a tendency of creating a “filter bubble”.
  2. Never recommends items outside the user’s content profile, people might have multiple interests.

This filtration strategy is based on the combination of the user’s behavior and comparing and contrasting that with other users’ behavior in the database. The history of all users plays an important role in this algorithm. The main difference between content-based filtering and collaborative filtering that in the latter, the interaction of all users with the items influences the recommendation algorithm while for content-based filtering only the concerned user’s data is taken into account.

Pros and Cons of collaborative filtering techniques

Pros:

  1. Collaborative filtering systems work by people in the system, and it is expected that people to be better at evaluating information than a computed function.

Cons:

  1. Cold Start: a major challenge of the Collaborative Filtering technique can be how to make recommendations for a new user who has recently entered the system; that is called the cold-start user problem.
  2. Sparsity:
  • The user/rating matrix is sparse.
  • Hard to find users that have rated the same items.

3. First rater:

  • Cannot recommend an item that has not been previously rated.
  • New items, Esoteric items.

4. Popularity bias:

  • Cannot recommend items to someone with a unique taste.
  • Tends to recommend popular items.

As I mentioned above, we are using Amazon.com data and don’t have a user reading history. Hence, we have used a simple content-based recommendation system. We are going to build two recommendation systems by using a book title and book description.

We need to find similar books to a given book and then recommend those similar books to the user. How do we find whether the given book is similar or dissimilar? A similarity measure was used to find this.

source: shorturl.at/ivGRY

The Data

This data contains information about different children books:

The Description feature for the first book looks like this:

'\n#1\xa0NEW YORK TIMES\xa0BESTSELLER • A modern, sophisticated suspense novel from National Book Award finalist, and Printz Award honoree E. Lockhart.A beautiful and distinguished family.A private island.A brilliant, damaged girl; a passionate, political boy.A group of four friends—the Liars—whose friendship turns destructive.A revolution. An accident. A secret.Lies upon lies.True love.The truth.Read it.And if anyone asks you how it ends, just LIE."Thrilling, beautiful, and blisteringly smart, We Were Liars is utterly unforgettable." —John Green, #1 New York Times bestselling author of The Fault in Our Stars\n\n'

The Product_Details feature for the first book looks like this:

'\n\n\n\nProduct details\n\n\n\n\n\nPublisher\n:\n\nEmber; Illustrated edition (May 29, 2018)\n\n\nLanguage\n:\n\nEnglish\n\n\nPaperback\n:\n\n320 pages\n\n\nISBN-10\n:\n\n0385741278\n\n\nISBN-13\n:\n\n978-0385741279\n\n\nReading age\n:\n\n12 - 17 years\n\n\nLexile measure\n:\n\n600L\n\n\nGrade level\n:\n\n7 - 9\n\n\nItem Weight\n:\n\n12 ounces\n\n\nDimensions\n:\n\n5.49 x 0.84 x 8.24 inches\n\n\n\n\n\n\nBest Sellers Rank:\n\n#13 in Books (See Top 100 in Books)\n\n #1 in Teen & Young Adult Family Fiction\n #1 in Teen & Young Adult Fiction about Emotions & Feelings\n #1 in Teen & Young Adult Fiction about Death & Dying\n\n\n\n\n\n\nCustomer Reviews:\n\n\n\n\n\n\n\n4.5 out of 5 stars\n\n\n\n\n\n\n\n\n12,959 ratings\n\n\n\n\n\n\n\n\n\n\n\n\n'

And after some regex:

'Product detailsPublisher:Ember; Illustrated edition (May 29, 2018)Language:EnglishPaperback:320 pagesISBN-10:0385741278ISBN-13:978-0385741279Reading age:12 - 17 yearsLexile measure:600LGrade level:7 - 9Item Weight:12 ouncesDimensions:5.49 x 0.84 x 8.24 inchesBest Sellers Rank:#13 in Books (See Top 100 in Books) #1 in Teen & Young Adult Family Fiction #1 in Teen & Young Adult Fiction about Emotions & Feelings #1 in Teen & Young Adult Fiction about Death & DyingCustomer Reviews:4.5 out of 5 stars12,959 ratings'

So we can extract a lot of information from this column like Publisher, Illustrated edition, Language, ISBN, and so on.

df['Publisher'] = df['Product_Details'].str.extract('((?<=Publisher:).+?(?=;))' ,expand=False)#'ISBN-13:.+?(?= ;)df['Publisher'] = df.Publisher.str.replace(r'-', '', regex=True).str.strip()#Program

The new feature Publisher :

Publisher feature

The new feature ISBN :

df[‘ISBN’] = df[‘Product_Details’].str.extract(‘ISBN-13:(\d+-?\d*)’, expand=False)##ISBN consists of 13 digits
df[‘ISBN’] = df.ISBN.str.replace(r’-’, ‘’, regex=True).str.strip()
ISBN Feature

So, the above two examples of how easy that extract any data by using the regex technique.

new dataframe
hasNAN = df.isnull().sum()
hasNAN = hasNAN[hasNAN > 0]
hasNAN = hasNAN.sort_values(ascending=False)
print(hasNAN,df.shape)
Null values of the new dataframe

The new dataframe shape is (1200, 6), so since we have many books and the null values less than 70%, so we will delete the null values.

# Calculating the word count for book description
df[‘word_count’] = df1[‘Description’].apply(lambda x: len(str(x).split()))# Plotting the word count
df[‘word_count’].plot(
kind=’hist’,
bins = 50,
figsize = (12,8),title=’Word Count Distribution for book descriptions’)
Distribution of word count for book description.
  • The majority of books description are between 30~200 words.
  • We don’t have many lengthy book descriptions. It is clear that Amazon.com provides short descriptions.
  • The distribution of books description is right-skewed.
#Converting text descriptions into vectors using TF-IDF using Bigram
tf = TfidfVectorizer(ngram_range=(2, 2), stop_words=’english’, lowercase = False)
tfidf_matrix = tf.fit_transform(df[‘Name’])
total_words = tfidf_matrix.sum(axis=0)
#Finding the word frequency
freq = [(word, total_words[0, idx]) for word, idx in tf.vocabulary_.items()]
freq =sorted(freq, key = lambda x: x[1], reverse=True)
#converting into dataframe
bigram = pd.DataFrame(freq)
bigram.rename(columns = {0:’bigram’, 1: ‘count’}, inplace = True)
#Taking first 20 records
bigram = bigram.head(20)
##Plotting the bigram distribution
bigram.plot(x =’bigram’, y=’count’, kind = ‘bar’, title = “Bigram disribution for the top 20 words in the book titles”, figsize = (15,7), )
Plotting the bigram distribution
  • The Boxed Set, Box Set, and Harry Potter are the top 3 words in the book titles.
#Converting text descriptions into vectors using TF-IDF using Bigram
tf = TfidfVectorizer(ngram_range=(2, 2), stop_words=’english’, lowercase = False)
tfidf_matrix = tf.fit_transform(df1[‘Description’])
total_words = tfidf_matrix.sum(axis=0)
#Finding the word frequency
freq = [(word, total_words[0, idx]) for word, idx in tf.vocabulary_.items()]
freq =sorted(freq, key = lambda x: x[1], reverse=True)
#converting into dataframe
bigram = pd.DataFrame(freq)
bigram.rename(columns = {0:’bigram’, 1: ‘count’}, inplace = True)
#Taking first 20 records
bigram = bigram.head(20)
##Plotting the bigram distribution
bigram.plot(x =’bigram’, y=’count’, kind = ‘bar’, title = “Bigram disribution for the top 20 words in the book description”, figsize = (15,7), )
  • New York, York Times, board book are the top 3 words in the book description.

Text Preprocessing

## Function for removing NonAscii characters
def _removeNonAscii(s):
return “”.join(i for i in s if ord(i)<128)
## Function for converting into lower case
def make_lower_case(text):
return text.lower()
# Function for removing stop words
def remove_stop_words(text):
text = text.split()
stops = set(stopwords.words(“english”))
text = [w for w in text if not w in stops]
text = “ “.join(text)
return text
# Function for removing punctuation
def remove_punctuation(text):
tokenizer = RegexpTokenizer(r’\w+’)
text = tokenizer.tokenize(text)
text = “ “.join(text)
return text
# Function for removing the html tags
def remove_html(text):
html_pattern = re.compile(‘<.*?>’)
return html_pattern.sub(r’’, text)

The above functions converting into lower case and removing HTML tags, punctuation, stop words, and NonAscii characters for the description feature.

# Applying all the functions in description and storing as a cleaned_desc
df1[‘cleaned_desc’] = df1[‘Description’].apply(_removeNonAscii)
df1[‘cleaned_desc’] = df1.cleaned_desc.apply(func = make_lower_case)
df1[‘cleaned_desc’] = df1.cleaned_desc.apply(func = remove_stop_words)
df1[‘cleaned_desc’] = df1.cleaned_desc.apply(func=remove_punctuation)
df1[‘cleaned_desc’] = df1.cleaned_desc.apply(func=remove_html)

The above function store the modification into another feature that called cleaned_desc.

Comparison between description before and after cleaning:

  • We build a recommendation engine using book titles and descriptions.
  • Convert each book title and description into vectors using TF-IDF and bigram.
  • The model recommends a similar book based on title and description.
  • The model recommends a similar book based on title and description.
  • Define a function that takes the book title as input and returns the top five similar recommended books based on the title and description.
tfidf = TfidfVectorizer(stop_words=’english’)
df1[‘cleaned_desc’] = df1[‘cleaned_desc’].fillna(‘’)
#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
overview_matrix = tfidf.fit_transform(df1[‘cleaned_desc’])
#Output the shape of tfidf_matrix
overview_matrix.shape
#similarity_matrix
similarity_matrix = linear_kernel(overview_matrix,overview_matrix)
similarity_matrix
#book index mapping
mapping = pd.Series(df1.index,index =df1['Name'])
mapping[:3]
def recommend_books_based_on_plot(book_input):
book_index = mapping[book_input]
#get similarity values with other books
#similarity_score is the list of index and similarity matrix
similarity_score = list(enumerate(similarity_matrix[book_index]))
#sort in descending order the similarity score of book inputted with all the other books
similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
# Get the scores of the 5 most similar books. Ignore the first book.
similarity_score = similarity_score[1:6]
#return book names using the mapping series
book_indices = [i[0] for i in similarity_score]
return (df1['Name'].iloc[book_indices])

Let's now try to get a recommendation for the book ‘If Animals Kissed Good Night’ from the above recommendation function and see what it outputs.

recommend_books_based_on_plot(‘If Animals Kissed Good Night’).to_frame()
Top 5 books recommended.

Top 5 books recommended that are similar to “If Animals Kissed Good Night”.

  • From the description, we can see that the recommender books have similar points in the stories from the 'If Animals Kissed Good Night' book.