Building Movie Recommendation System with Surprise and python.

7 min readMar 4, 2021

Done by: Monirah, Aisha Y Hakami, Lamalharbi and Mohammed Alali

Problem Identification

Developing a recommender system for IPTV in Jawwy service provided by STC. Python and JupyterLab are tools with Sklearn libraries for implementing the proposed recommender system. The aim is to create a business value for a local company to improve their service, and STC was selected for this project with their service Jawwy since they shared its data in the open data initiative.

In this blog, we will build a recommendation model by using the Surprise method

Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.

Surprise was designed with the following purposes in mind:

Give users perfect control over their experiments
Provide various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods
Make it easy to implement new algorithm ideas.
Provide tools to evaluate, analyze and compare the algorithms’ performance. Cross-validation procedures can be run very easily using powerful CV iterators (inspired by sci-kit-learn excellent tools.

We’ll be working with the jawwy IPTV dataset to develop recommendation system algorithms, with the Surprise library, which was built by Nicolas Hug. Let’s dive in!

Dataset

Saudi operator STC launched an interactive IPTV service called InVision for its DSL customers to coincide with the start of Ramadan. The service is initially available in the three major cities Riyadh, Jeddah, and Dammam free of charge throughout the month of Ramadan. InVision provides free TV channels and pay-TV packages, time-shifting of programs, program recording, a video-on-demand library of religious and documentary programs, and a catch-up TV service to watch programs broadcast over the last seven days. In addition, users have access to a control panel to select programs and movies.

We admire STC's role in order to enable the data community, when STC launched Open Data initiative which can help in preparing a technical-aware Saudi generation in line with Vision 2030, by supporting researchers interested in Data Science, through enabling them to explore and process data through application and practice.

The data consist of 29,487 user behavior within 8k shows all in 3m row. we extract the juice out of all the behavioral patterns of not only the users but also from the movies themselves.

Modeling

In previous posts, we discuss two of the most popular ways to approach recommender systems are collaborative filtering and content-based recommendations. In this blog, we will focus on the collaborative filtering approach, that is: the user is recommended items that people with similar tastes and preferences liked in the past. In another word, this method predicts unknown ratings by using the similarities between users.

Surprise Model

Surprise provides a collection of estimators for rating prediction, classical algorithms are implemented such as the main similarity-based algorithms, as well as algorithms based on matrix factorization like SVD or NMF. It also supports tools for model evaluation like cross-validation iterators and built-in metrics.

Since surprise does not support implicit ratings, and our dataset hasn’t this important data as shown below:

So we did some feature engineering to obtain the rating of watching for each user/movie, the idea is to consider the watching ratio as a rating.

The problem is the difference between movie and series in this dataset, the total movie duration is as the following:

sample of movie

While the series is like the following:

sample of series

for movie programs, it is easy to obtain the ratio of watching and deal with it as a user rating, by using regex remove (min) word that suffix the number of minutes in duration feature which is the total duration for each movie from IMDB, then transform the time from minutes to seconds, after that divide the duration by the duration_second which is the duration of the user watching for this particular movie.

Unfortunately, we cannot apply these steps for series/ tv shows because the duration is for the season rather than the total duration of the episode of the series, and when we try to obtain this data from an external dataset, we face other problems which are some series/ tv shows will be lost, also there are many different movies have the same name like little women and little women.

So we split the dataset into 2 dataframe, the first one merged_df_series for TV Show and the second one merged_df_movie for Movie programs

for for TV Show programs to assign the total duration, we use .max() that returns the maximum of the values over the duration_seconds by users for each program/Tv Show as a total duration, since we don't have the total duration the dataset

df_series_duration=merged_df_series.groupby(‘program_name’)[[‘duration_seconds’]].max()

Add average_watching and total_duration, we will use it as a target for the surprise model.

#average_watching
merged_df_series[“average_watching”] = merged_df_series.apply(lambda x: 1 if x[‘duration_seconds’] > df_series_duration.loc[x.program_name,’duration_seconds’] else x[‘duration_seconds’]/df_series_duration.loc[x.program_name,’duration_seconds’], axis=1)# total_duration 
merged_df_series[“total_duration”]= merged_df_series.apply(lambda x: df_series_duration.loc[x.program_name,’duration_seconds’],axis=1)

The head of the series dataset:

Remember that the average watching represents the ratio of watching for each user for a particular program.

For Movie programs:

# Extract minute by using (regex) and convert to appropriate type 
merged_df_movie[‘total_duration’] = merged_df_movie[‘duration’].str.replace(r’min’, ‘’)
merged_df_movie[‘duration_seconds’] = pd.to_numeric((merged_df_movie[‘duration_seconds’]) , errors=’coerce’).astype(‘Int64’)
merged_df_movie[‘total_duration’] = pd.to_numeric((merged_df_movie[‘total_duration’]) , errors=’coerce’).astype(‘Int64’)# convert from min to sec
merged_df_movie[‘total_duration’]=(merged_df_movie[‘total_duration’]*60)

IF the value of duration_seconds > total_duration, assign the value of total_duration to duration_seconds:

merged_df_movie[“duration_seconds”] = merged_df_movie.apply(lambda x: x[‘total_duration’] if x[‘duration_seconds’] > x[‘total_duration’] else x[‘duration_seconds’], axis=1)

Add average_watching Feature, we will use it as a target for the surprise model.

merged_df_movie[‘average_watching’]=merged_df_movie[‘duration_seconds’]/merged_df_movie[‘total_duration’]

The head of the movies dataset:

Concatenate Dataframes

concatenate merged_df_movie and merged_df_series and check outliers.

The dataset looks like this:

Some EDA:

Most of the programs in the data watched less than 3 times, and very few programs have many watching, although the most-watched program has been watched 2312 times

Distribution Of Number of Watching Per User

Most of the users in the data had less than 10 watched programs, and not many users had many watchings, although the most-active user has been watched 136 programs.

reader = Reader(rating_scale=(0.03, 1.0))
data = Dataset.load_from_df(df_data[[‘user_id’, ‘show_id’, ‘average_watching’]], reader)benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(),
 KNNWithMeans(), KNNWithZScore(), BaselineOnly()]:
 # Perform cross validation
 results = cross_validate(algorithm, data, measures=[‘RMSE’], cv=3, verbose=False)
 
 # Get results & append algorithm name
 tmp = pd.DataFrame.from_dict(results).mean(axis=0)
 tmp = tmp.append(pd.Series([str(algorithm).split(‘ ‘)[0].split(‘.’)[-1]], index=[‘Algorithm’]))
 benchmark.append(tmp)

Train and Predict

BaselineOnly algorithm gave us the best rmse, therefore, we will train and predict with BaselineOnly and use Alternating Least Squares (ALS).


print(‘Using ALS’)
bsl_options = {‘method’: ‘als’,
 “random_state”:250,
 ‘n_epochs’: 5,
 ‘reg_u’: 12,
 ‘reg_i’: 5
 }
algo = BaselineOnly(bsl_options)
cross_validate(algo, data, measures=[‘RMSE’], cv=3, verbose=False)Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

Out:

{'test_rmse': array([0.32846883, 0.32829109, 0.32849049]),
 'fit_time': (0.05880260467529297, 0.04584360122680664, 0.06278657913208008),
 'test_time': (0.1285717487335205, 0.14949917793273926, 0.12757420539855957)}

We use the train_test_split() to sample a trainset and a test set with given sizes, and use the accuracy metric of rmse. We’ll then use the fit() the method which will train the algorithm on the trainset, and the test() the method which will return the predictions made from the testset.

trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Out:

Estimating biases using als...
RMSE: 0.3295

To inspect our predictions in detail, we are going to build a pandas data frame with all the predictions.

User’s 13618 behavior

Here we will study the watching behavior for a particular user his ID is (13618)

# user_id is the 13618
ratings = newdf.loc[newdf[‘user_id’] == 13618]
# obtain the required data of this user
ratings=ratings[[‘user_id’, ‘show_id’, ‘average_watching’]]
ratings

# get the list of the movie ids
unique_ids = newdf[‘show_id’].unique()# get the list of the ids that the userid 13618 has watched
iids1001 = newdf.loc[newdf[‘user_id’]==13618, ‘show_id’]# remove the rated movies for the recommendations
movies_to_predict = np.setdiff1d(unique_ids,iids1001)

Recommendation Programs For User’s 13618

algo = BaselineOnly(bsl_options)
algo.fit(data.build_full_trainset())my_recs = []
for iid in movies_to_predict:
 my_recs.append((iid, algo.predict(uid=’13618',iid=iid).est))
pd.DataFrame(my_recs, columns=[‘iid’, ‘predictions’]).sort_values(‘predictions’, ascending=False).head(10)