You are on page 1of 7

USECASE

Movie Recommendation System Using KNN

Aim:
To construct a movie recommendation system using K-Nearest Neighbors (KNN) to
provide personalized movie suggestions based on user preferences and interactions.
Algorithm:
Step 1: Data Collection and Preprocessing:
- Gather a dataset of movies with relevant information such as title, genre, director,
actors, and user ratings.
- Preprocess the data by handling missing values, encoding categorical variables, and
scaling numerical features if necessary.
Step 2: Feature Selection:
- Select relevant features for similarity calculation, such as genre, director, actors, and
user ratings.
- Transform the selected features into a suitable format for input to the KNN algorithm.
Step 3: Model Training:
- Instantiate a KNN model with a chosen number of neighbors (K) and a distance
metric (e.g., Euclidean distance).
- Fit the KNN model to the feature vectors of the movies in the dataset.
Step 4: Recommendation Generation:
- For a given target movie, calculate its distance to all other movies in the dataset using
the trained KNN model.
- Select the K nearest neighbors of the target movie based on their distances.
- Recommend the movies that are closest to the target movie as similar
recommendations.
Step 5: User Interaction and Feedback:
- Provide a user interface where users can input their preferences or interact with the
recommendation system.
- Allow users to provide feedback on recommended movies (e.g., rating, like/dislike).
- Incorporate user feedback into the recommendation process to improve future
recommendations.
Step 6: Evaluation and Optimization:
- Evaluate the performance of the recommendation system using metrics such as
precision, recall, and user satisfaction.
- Fine-tune parameters such as the number of neighbors (K) and the choice of distance
metric based on evaluation results.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ratings = pd.read_csv('ratings.csv')
ratings.head()
movies = pd.read_csv("movies.csv")
movies.head()
unique_user = ratings.userId.nunique(dropna = True)
unique_movie = ratings.movieId.nunique(dropna = True)
print("number of unique user:")
print(unique_user)
print("number of unique movies:")
print(unique_movie)
total_ratings = unique_user*unique_movie
rating_present = ratings.shape[0]
ratings_not_provided = total_ratings - rating_present
print("ratings not provided means some user have not watched some movies and its
given by")
print(ratings_not_provided)
print("sparsity of user-item matrix is :")
print(ratings_not_provided / total_ratings)
rating_cnt = pd.DataFrame(ratings.groupby('rating').size(),columns=['count'])
rating_cnt
rating_cnt = rating_cnt.append(pd.DataFrame({'count':ratings_not_provided},index =
[0])).sort_index()
rating_cnt
rating_cnt['log_count'] = np.log(rating_cnt['count'])
rating_cnt
rating_cnt_for_vis = rating_cnt
ax = rating_cnt_for_vis.reset_index().rename(columns = {'index':'rating_value'})
x='rating_value',
y='count',
logy = True,
kind='bar',
title='count for each rating in log scale',
figsize=(12,6))
ax.set_xlabel('rating_value')
ax.set_ylabel('count of each rating')
print("frequency of rating like 3 and 4 are more in compare to other ratings")
movie_freq = pd.DataFrame(ratings.groupby('movieId').size(),columns=['count'])
movie_freq.head()
movie_freq_copy = movie_freq.sort_values(by='count',ascending=False)
movie_freq_copy=movie_freq_copy.reset_index(drop=True)
ax1 = movie_freq_copy.plot(
title='rating frquency of movie',logy=True, figsize=(12,8))
ax1.set_xlabel('number of movies')
ax1.set_ylabel('rating freq of movies’)
threshold_rating_freq = 10
popular_movies_id=list(set(movie_freq.query('count>=@threshold_rating_freq').index)
ratings_with_popular_movies = ratings[ratings.movieId.isin(popular_movies_id)]
print('shape of ratings:')
print(ratings.shape)
print('shape of ratings_with_popular_movies:')
print(ratings_with_popular_movies.shape)
print("no of movies which are rated more than 50 times:")
print(len(popular_movies_id))
print("no of unique movies present in dataset:")
print(unique_movie)
user_cnt = pd.DataFrame(ratings.groupby('userId').size(),columns=['count'])
user_cnt_copy = user_cnt
user_cnt.head()
ax = user_cnt_copy.sort_values('count',ascending=False).reset_index(drop=True).plot(
title='rating freq by user',figsize=(12,6),)
ax.set_xlabel("users")
ax.set_ylabel("rating frequency")
threshold_val = 30
active_user = list(set(user_cnt.query('count>=@threshold_val').index))
ratings_with_popular_movies_with_active_user =
ratings_with_popular_movies[ratings_with_popular_movies.userId.isin(acti
ve_user)]
print('shape of ratings_with_popular_movies:')
print(ratings_with_popular_movies.shape)
print('shape of ratings_with_popular_movies_with_active_user:')
print(ratings_with_popular_movies_with_active_user.shape)
print("unique_user:")
print(unique_user)
print("active_user")
print(len(active_user))
print("unique_movies")
print(unique_movie)
print("popular_movies")
print(len(popular_movies_id))
print("sparsity of final ratings df:")
print( (428*2269 - 76395)/(428*2269))
final_ratings = ratings_with_popular_movies_with_active_user
item_user_mat = final_ratings.pivot(index='movieId',columns =
'userId',values='rating').fillna(0)
Output:
system is working....
Viewer who watches this movie Father of the Bride Part II also watches following movies.
Sabrina (1995)
Miracle on 34th Street (1994)
Striptease (1996)
Juror, The (1996)
Mr. Holland's Opus (1995)
Sgt. Bilko (1996)
Twister (1996)
Grumpier Old Men (1995)
Tin Cup (1996)
Willy Wonka & the Chocolate Factory (1971)

Result:
Thus, the movie recommendation system using K-Nearest Neighbors (KNN) has been
successfully implemented.

You might also like