You are on page 1of 9

Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

Assignment_1 - DSML
Priyanshu Jain | MBA19218 |

Date - 20/08/2020

We will start by importing libraries that we will require for our project. Pandas to handle dataframes, seaborn to create
aesthetic plots, matplotlib to add certain elements to our plots and scipy to calculate statistics.

In [1]: import pandas as pd


import seaborn as sn
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

Ques - 1
How many records are present in the dataset? Print the metadata information of the dataset.

We will start by loading our data from bollywood.csv file to bollywood variable. Then we will check our data for any null values
and the number and type of datapoints that we have.

In [2]: bollywood = pd.read_csv(r'C:\Users\hp\Desktop\bollywood.csv')


bollywood.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 10 columns):
SlNo 149 non-null int64
Release Date 149 non-null object
MovieName 149 non-null object
ReleaseTime 149 non-null object
Genre 149 non-null object
Budget 149 non-null int64
BoxOfficeCollection 149 non-null float64
YoutubeViews 149 non-null int64
YoutubeLikes 149 non-null int64
YoutubeDislikes 149 non-null int64
dtypes: float64(1), int64(5), object(4)
memory usage: 11.8+ KB

Ans - 1
Above we can see that our data is not missing. It has 149 entries and Serial Number, Budget, Youtube Views, Youtube Likes,
Youtube Dislikes are integers. Release Date, MovieName, ReleaseTime and Genre are object types and BoxOfficeCollection
is float type.

Just to get an overview of our columns and data in them we will print first five rows.

1 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

In [3]: bollywood.head(5)

Out[3]:
Release
SlNo MovieName ReleaseTime Genre Budget BoxOfficeCollection YoutubeViews YoutubeLikes
Date

18-
0 1 2 States LW Romance 36 104.00 8576361 26622
Apr-14

Table No.
1 2 4-Jan-13 N Thriller 10 12.00 1087320 1129
21

18- Amit Sahni


2 3 N Comedy 10 4.00 572336 586
Jul-14 Ki List

Rajdhani
3 4 4-Jan-13 N Drama 7 0.35 42626
Express

Bobby
4 5 4-Jul-14 N Comedy 18 10.80 3113427 4512
Jasoos

Ques - 2
Which month of the year, maximum number movie releases are seen? (Note: Extract a new column called month
from ReleaseDate column.) Do a barplot.

I used groupby to group our data according to Month which I calculated using lambda function and split function. After
grouping I am calculating total number of movies released in that particular month.

In [4]: bollywood['Month'] = bollywood['Release Date'].apply(lambda x: x.split('-')[1])


movies_by_month = bollywood.groupby('Month')['MovieName'].count().reset_index()
sn.barplot(movies_by_month.Month,movies_by_month.MovieName,palette='Set2')

Out[4]: <matplotlib.axes._subplots.AxesSubplot at 0x2902d5fb8c8>

2 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

Ans - 2
Jan has the most number of movie releases it is 20, followed by March and May. Least number of movies have been released
in the month of December.

Ques - 3
Which are the top 10 movies with maximum return on investment (ROI)? Calculate return on investment (ROI) as
(BoxOfficeCollection – Budget) / Budget. Draw any Plot.

We are calculating bollywood ROI using this below equation in percentage. After this I have extracted the highest 10 values of
ROI and then plotted in barplot.

In [5]: bollywood['ROI'] = (bollywood['BoxOfficeCollection']-bollywood['Budget'])/bollywood


['Budget']*100

In [6]: high_roi = bollywood.nlargest(10,'ROI').reset_index()


sn.barplot(high_roi.MovieName,high_roi.ROI,palette = 'Set2')
plt.xticks(rotation = 30)

Out[6]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

Ans - 3
The movies with highest ROI are Aashiqui 2, PK, Grand Masti, The Lunchbox, Fukrey, Mary Kom, Shahid, Humpty Sharma Ki
Dulhania, Bhaag Milkha Bhaag and Chennai Express. 6 movies out of these have been released in 2nd quarter July to
September. Movies belong to Drama category the most.

Ques - 4
Which Genre Has The Highest Release Of Movies? Do a barplot.

3 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

In [7]: genre_movie_count = bollywood.groupby('Genre').MovieName.count().reset_index()


sn.barplot(genre_movie_count.Genre,genre_movie_count.MovieName,palette = 'Set2')

Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x2902ce36d48>

Ans - 4
Comedy has the highest number of movie releases 36 total in number. We can see that Drama has 35 which also close.

Ques - 5
How many movies in each release times like long weekend, festive season, etc. got released? Do a barplot.

In [8]: weekend_movie_count = bollywood.groupby('ReleaseTime').MovieName.count().reset_inde


x()
sn.barplot(weekend_movie_count.ReleaseTime,weekend_movie_count.MovieName,palette =
'Set2')

Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x2902eaea0c8>

4 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

Ans - 5
Most movies got released in Normal days. This could also be the reason why December has the least number of movie
releases as it is Christmas and New year time.

Ques - 6
How many movies got released in each genre? Which genre had highest number of releases? Sort number of
releases in each genre in descending order. Do a barplot.

In [9]: genre_movie_count_sorted = genre_movie_count.nlargest(5,'MovieName')


sn.barplot(genre_movie_count_sorted.Genre,genre_movie_count_sorted.MovieName,palett
e = 'Set2')

Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x2902eb5b9c8>

Ans - 6
The highest number of movies got released in the Comedy Genre i.e. 36. Other genres Drama, Thriller Genre, Romance,
Action have 35, 29, 25, 24 movies respectively being released for those genres.

Ques - 7
Calculate the average ROI for different Release Date. Which month made the highest average ROI? Do a barplot.

5 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

In [10]: month_avg_roi = bollywood.groupby('Month').ROI.mean().reset_index()


sn.barplot(month_avg_roi.Month,month_avg_roi.ROI,palette = 'Set2')

Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x2902ebdb1c8>

Ans - 7
Movies that got released in December have the highest mean ROI of 364.3 followed by September with 261.04. January and
November have the least ROI nearing 0.

Ques - 8
Draw a histogram plot to find out the distribution of movie budgets.

Using seaborn.distplot we can plot the density plot and setting the Kernel Density Estimation to True we get the following plot
with its shape marked.

In [11]: sn.distplot(bollywood.Budget,kde=True,bins=15,color = 'Red')

Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x2902eca3ac8>

6 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

Ans - 8
The plot above is right skewed and most of the movies have budgets in the range of 0-50.

Ques - 9
Which genre of movies typically sees more YouTube views? Draw boxplots for each genre of movies to compare.

In [12]: genre_youtube_views = bollywood.groupby('Genre').YoutubeViews.sum().reset_index()


sn.barplot(genre_youtube_views.Genre,genre_youtube_views.YoutubeViews,palette = 'Se
t2')

Out[12]: <matplotlib.axes._subplots.AxesSubplot at 0x2902ed40288>

Ans - 9
Action movies have the highest youtube views nearly 136.6 million and Romance has the least nearing to 86.8 million in total.

Ques - 10
Is there a correlation between box office collection and YouTube likes? Is the correlation positive or negative?

In [13]: plot = sn.pairplot(data=bollywood,x_vars = 'BoxOfficeCollection',y_vars = 'YoutubeL


ikes',palette = 'Set2')

7 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

Using scipy.stats.pearsonr we can calculate the pearson correlation and p-value between our two independent variables.

In [14]: corr,p = pearsonr(bollywood.BoxOfficeCollection,bollywood.YoutubeLikes)


print('Correlation between BoxOfficeCollection & YoutubeLikes: ' + str(corr) + ' '
+ str(p))

Correlation between BoxOfficeCollection & YoutubeLikes: 0.6825165877731298 9.218


436382353977e-22

Ans - 10
Correlation between BoxOfficeCollection & YoutubeLikes: 0.682 and p-value is 9.2e-22.

Ques - 11
Which of the variables among Budget, BoxOfficeCollection, YoutubeView, YoutubeLikes, YoutubeDislikes are highly
correlated? Note: Draw pair plot or heatmap.

8 of 9 16-08-2020, 14:22
Assignment - DSML http://localhost:8889/nbconvert/html/Desktop/Assignment - DSML.ipyn...

In [15]: sn.pairplot(data=bollywood[['Budget','BoxOfficeCollection','YoutubeViews','YoutubeL
ikes','YoutubeDislikes']])

Out[15]: <seaborn.axisgrid.PairGrid at 0x2902ee3de88>

Ans - 11
The highest correlation from the graph above can be seen between YoutubeViews and YoutubeLikes and YoutubeDislikes.
Budget has good correlation with Box Office Collection.

Thank You!!!

9 of 9 16-08-2020, 14:22

You might also like