Movie Recommendation System

6 min readMar 1, 2021

This project was build by Neon team : Aisha Y Hakami, Monirah abdulaziz, Mohammed Al-Ali, and Lamalharbi.

In our daily life when we are shopping online, or looking for a movie to watch, we usually ask the recommendation from our friends for their personal opinion, and when they recommend something that we don’t like but they enjoyed it, what a waste of time right !!. So if there is a system that can understand you, and recommend for you based on your interests, that would be soo cool isn’t it. Well, that precisely what recommender systems are made for.

trailer

Problem Statement

Coming to the streaming shows websites, understanding the user’s behavior always consider a challenge. Irrespective of gender, age, or geographical location everyone enjoying watching movies at home. Bunch of people like genre-specific movies to be romance, action, or comedy while others enjoying the lead actors and directors’ visions. We all are connected via this wonderful medium. However, what most exciting is the fact that how distinctive our choices and combinations are in terms of the show’s preferences. But, when we take all that has been said, it’s remarkably difficult to generalize a movie and say that everyone would like it. So here the recommender systems shine to act as special assistants for the user’s needs.

Dataset

We found out that STC launched an Open Data initiative to support those who are interested in Data Science, the data is associated with Jawwy, which is their IPTV service. And we decided that this will be the beginning of the neon recommendation system journey. The data consist of 29,487 user behavior within 8k shows all in 3m row. we extract the juice out of all the behavioral patterns of not only the users but also from the movies themselves.

Modeling

We decided that we want to use 6 models to meet user preferences as much as we can those models are :

Collaborative filtering

This type of filter is based on users’ watching history, and it will recommend us movies that we haven’t watched yet, but users similar to us have and like. To determine whether two users are similar or not, this filter considers the movies both of them watched. By looking at the movies in common, this type of algorithm will basically recommend the movie for a user who hasn’t watched it yet, based on the similar users’ watching history.

Pros and Cons of collaborative filtering technique

Pros:

Collaborative filtering systems work by people in the system, and it is expected that people to be better at evaluating information than a computed function.

Cons:

Cold Start: a major challenge of the Collaborative Filtering technique can be how to make recommendations for a new user who has recently entered the system; that is called the cold-start user problem.
Sparsity:

The user/rating matrix is sparse.
Hard to find users that have rated the same items.

3. First rater:

Cannot recommend an item that has not been previously rated.
New items, Esoteric items.

4. Popularity bias:

Cannot recommend items to someone with a unique taste.
Tends to recommend popular items.

Content-based filters

Based on what we like, the algorithm will simply pick a movie with similar content(story description )to recommend us. This type of filter does not involve other users.

Pros and Cons of content-based technique

Pros :

No cold-start problem, unlike Collaborative Filtering, if the programs have sufficient descriptions, we avoid the “new item problem”.
Able to recommend to users with unique tastes

Cons :

Content-Based tend to over-specialization: they will recommend items similar to those already consumed, with a tendency of creating a “filter bubble”.
Never recommends items outside the user’s content profile, people might have multiple interests.

Hybrid filtering :

Overcomes previous cons.
Create a weighted recommender (weights are chosen equally, combining the results of predict_cf and predict_cn).
Create differently weighted recommender (weights are chosen equally, combining the results of predict_cf, predict_cn, and predict_popularity)
Create recommender based on popularity with weighted CL and CB (which CL and CB predictions affected by popularity prediction)
Keep the strengths of CL, CB, and popularity models.
Overcomes CL, CB, and popularity models cons.

indices = pd.Series(df_program_desc.index)
# defining the function that takes in movie title 
# as input and returns the top 10 recommended moviesdef recommendations(title,type_of_recommendation, cosine_sim = cosine_sim, cosine_sim_w = cosine_sim_w, prec_watch_mat = prec_watch_mat):
 ‘’’
 type_of_recommendation values:
 
 0: the similarity scores of program description (Content Based),
 1: the similarity scores of watch history (Collaborative filtering),
 2: the similarity scores between watch history and program description (Hybrid),
 3: the similarity scores between watch history and program description 
 with popularity of program as indepent variables (Hybrid),
 4: the similarity scores between watch history and program description 
 as depent variables on popularity of program (Hybrid).
 ‘’’
 
 
 # initializing the empty list of recommended movies
 recommended_movies = []
 
 # gettin the index of the movie that matches the title
 idx = indices[indices == title].index[0]
 
 if type_of_recommendation == 0:
 
 # creating a Series with the similarity scores of program description in descending order
 score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
 
 elif type_of_recommendation == 1:
 
 # creating a Series with the similarity scores of watch history in descending order
 score_series = pd.Series(cosine_sim_w[idx]).sort_values(ascending = False)
 
 elif type_of_recommendation == 2:
 
 # creating a Series with the similarity scores between watch history and program description in descending order
 score_series = pd.Series(cosine_sim_w[idx]*0.5 + cosine_sim[idx]*0.5).sort_values(ascending = False)
 elif type_of_recommendation == 3:
 
 # creating a Series with the similarity scores between watch history and program description 
 # with popularity of program as indepent variables in descending order
 score_series = pd.Series(cosine_sim_w[idx]*0.33 + cosine_sim[idx]*0.33 + prec_watch_mat*0.34).sort_values(ascending = False)
 elif type_of_recommendation == 4:
 
 # creating a Series with the similarity scores between watch history and program description 
 # as depent variables on popularity of program in descending order
 score_series = pd.Series((cosine_sim_w[idx]*0.5 + cosine_sim[idx]*0.5) * prec_watch_mat).sort_values(ascending = False)
 
 else:
 print(‘You have entered wrong value’)
 return
# getting the indexes of the 10 most similar movies
 top_10_indexes = list(score_series.iloc[1:11].index)
 
 # populating the list with the titles of the best 10 matching movies
 for i in top_10_indexes:
 recommended_movies.append(list(df_program_desc.index)[i])
 
 return recommended_movies

How does Cosine Similarity work?

All the filters above will be measured by Cosine similarity which is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, the higher the cosine similarity.

# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix_desc = count.fit_transform(df_program_desc[‘cleaned_desc’])# generating the cosine similarity matrix on program description
cosine_sim = cosine_similarity(count_matrix_desc, count_matrix_desc)# generating the cosine similarity matrix on watch history
cosine_sim_w = cosine_similarity(watch_crosstab_transpose.values, watch_crosstab_transpose.values)

Also, we use Surprise library that I am going to write an article in a near future.

Demo

Result

we used the Hit-Rate as an evaluation metric for the cosine matrix algorithms

The Hit-Rate Generate the top n recommendation for a user and compare them to those the user watched. If they match then increase the hit rate by 1, do this for the complete dataset to get the hit rate. Since we want to recommend movies that new to the user so the closest to 0 the better. We sum the number of hits for each movie in our top-N list and divide by the total number of movies.

Since we want to recommend movies that new to the user so the closest to 0 the better. The content-based model represents a good score, while the collaborative not that good. The hybrid model keeps the strength of the previous models and got a good score. the hybrid and popularity models are biased to user preferences and represent a not bad score as well.

Thank you for reading.