Goodreads Books Recommendation System

Monirah abdulaziz
5 min readFeb 28, 2021

Done by: Lamalharbi, Aisha Y Hakami and Monirah abdulaziz

source: https://www.goodreads.com/

In this blog, we will see how we can build a simple content-based recommender system using Goodreads.com data, computer books, particularly, this data obtained from Kaggle.

Usually, we plan to buy a book specially the scientific books, we normally ask about the goods ones, research the book domains, compare the books with similar or read the reviews so here the recommender system is the master of this problem.

Recommendation system algorithms:

Content-based recommendation system (CB)

Content-based recommendation systems recommend items to a user by using the similarity of items. This recommender system recommends products or items based on their description or features. It identifies the similarity between the products based on their descriptions. It also considers the user’s previous history in order to recommend a similar product.

Pros and Cons of content-based recommendation techniques

Pros :

  1. No cold-start problem, unlike Collaborative Filtering, if the programs have sufficient descriptions, we avoid the “new item problem”.
  2. Able to recommend to users with unique tastes
  3. Able to recommend new & unpopular items — No first-rater problem.

Cons :

  1. Content-Based tend to over-specialization: they will recommend items similar to those already consumed, with a tendency of creating a “filter bubble”.
  2. Never recommends items outside the user’s content profile, people might have multiple interests.

Collaborative Filtering (CL)

This filtration strategy is based on the combination of the user’s behavior and comparing and contrasting that with other users’ behavior in the database. The history of all users plays an important role in this algorithm. The main difference between content-based filtering and collaborative filtering that in the latter, the interaction of all users with the items influences the recommendation algorithm while for content-based filtering only the concerned user’s data is taken into account.

Pros and Cons of collaborative filtering techniques

Pros:

  1. Collaborative filtering systems work by people in the system, and it is expected that people to be better at evaluating information than a computed function.

Cons:

  1. Cold Start: a major challenge of the Collaborative Filtering technique can be how to make recommendations for a new user who has recently entered the system; that is called the cold-start user problem.
  2. Sparsity:
  • The user/rating matrix is sparse.
  • Hard to find users that have rated the same items.

3. First rater:

  • Cannot recommend an item that has not been previously rated.
  • New items, Esoteric items.

4. Popularity bias:

  • Cannot recommend items to someone with a unique taste.
  • Tends to recommend popular items.

As I mentioned above, we are using Amazon.com data and don’t have a user reading history. Hence, we have used a simple content-based recommendation system. We are going to build two recommendation systems by using a book title and book description.

We need to find similar books to a given book and then recommend those similar books to the user. How do we find whether the given book is similar or dissimilar? A similarity measure was used to find this.

source: shorturl.at/ivGRY

The Data:

This data contains information about different Computer books:

Data dictionary

The head of the table:

The head of the table.

Check Nulls

Null values

EDA

Since the columnBook_Id is as an index, we will drop it and generate a new ID.

df.drop(['Book_Id'],axis=1,inplace=True)
  • Check duplicates and drop if exists.
df.shape

(1234, 9)

df[df['book_Title']=='Anime and the Visual Novel: Narrative Structure Design and Play at the Crossroads of Animation and Computer Games']
Sample of duplicated rows

drop_duplicates and check the shape

df.drop_duplicates(inplace=True)
df.shape

(1142, 9)

df[df['book_Title']=='Anime and the Visual Novel: Narrative Structure Design and Play at the Crossroads of Animation and Computer Games']
Confirm that the duplicated rows are removed
# Generate a new id 
df=df.assign(id=(df['book_Title']).astype('category').cat.codes)
df.head()
after assigning new ID

Visualization

Most frequent authors

plt.subplots(figsize=(10,7))
df.Author_Name.value_counts()[:10].plot(kind="bar")
plt.show()
Most frequent authors

Donald Ervin Knuth, Gray B.Shelly, and Sumita Arora are the most 3 frequent authors.

Calculating and plotting the word count for book_Title

# Calculating the word count for book description
df[‘book_Title’] = df[‘book_Title’].apply(lambda x: len(str(x).split()))# Plotting the word count
df[‘book_Title’].plot(
kind=’hist’,
bins = 50,
figsize = (12,8),title=’Word Count Distribution for book_Title’)
Word Count Distribution for book_Title
  • The majority of the book's title is between 5~10 words.
  • The distribution of the book’s title is right-skewed.

Content-Based Recommender Systems

# test poinst for ratings_count=40, Avg_Rating= 4.5, Publish_year= 2002, Pages_no= 523
test_point = [40, 4.5, 2002, 523]
X = df.iloc[:, [2, 3, 4, 6]].values# build a nearest neighbor object, we are searching for just 3 neighbors so n_neighbors=3
nn = NearestNeighbors(n_neighbors=3).fit(X)
# kneighbors returns the neighbor for the test_point
print(nn.kneighbors([test_point]))

(array([[ 3.9028323 , 7.91060048, 12.2137791 ]]), array([[626, 445, 486]]))

# This is the most similar book from all the available books
df.iloc[626].to_frame()
df.iloc[445].to_frame()

From the above results, we can observe that the recommender books are similar to the test points that we fed manually to the model.

--

--