You are on page 1of 24

Insights of the Cinema Industry

in the period from 1990 to 2019

Instructor: Dr. Nguyen Viet Anh
Nguyễn Tuấn Ngọc – HE141341
Phan Sỹ Kiên – HE140040
Nguyễn Nam – HE140040
Vũ Hồng Nam – HE140040
Nguyễn Duy Khánh – HE130954

Description 2

Data Collecting And Processing 3

Data Collecting 3
Data Processing 3

Get A Look At The Dataset 4

Descriptive Measures 4
Distribution 5

Let's “Watch” The Movie 6

Diversity In The Cinema Industry 6
Diversity comes from content 6
Diversity comes from genre 7
Construct 95% confidence interval for action movie population proportion that
won awards. 9
Construct 95% confidence interval for biography movie population proportion that
won awards. 9
Test on the Difference in Population Proportions of action and biography movies
that won awards. 10
Diversity comes from production country 10
The most of 12
Most Popular Movies by Popularity Score 12
Most Voted on Movies 13
Most Critically Acclaimed Movies 13
Most Successful Directors 14
The More Comedies Decline, The More Actions Thrive 14
The trend 14
Construct the 95% confidence interval of the mean worldwide gross of the action
movies population. 16
Construct the 95% confidence interval of the mean worldwide gross of the
comedy movies population. 16
Test on the difference in the mean of the popularity of the action and comedy
movies population. 16
All about money! 17
Theater audiences are more interested in action movies 19
Fierce competition from other genres 20
Box Office profit Is Influenced By Many Factors 20
The numbers 21
Genres 22
MPA 23

Conclusion 24
Unanswered questions 24
Knowledge earned 25

1. Description
After more than 100 years of establishment and rapid development, cinema
has transformed from a simple new form of entertainment into an art and the
most important tool of mass communication in modern society.
With great interest, our team wanted to analyze the development process and
learn about trends of the film industry, especially in the period from 1990 to
2019. This is the golden era of cinema with thousands of new technologies
applied such as 3D, 4D, Motion Capture, IMAX, VFX, StageCraft,... and box
office records are continuously broken and replaced.
We have collected a dataset containing basic information (title, director, year,
runtime, genre, imdb_rating, metascore_rating, vote, mpap, keyword, budget,
country, language, released_date, award_win, award_nomination, profit
wolrdwide_gross, us_gross, international_gross) about 12,146 movies.
After going through the dataset we gathered ourselves, plus our longstanding
curiosity, we have the following questions about the cinema industry:
● How diverse are movies?
○ What are the most popular contents in movies?
○ How many genres of movies have been made? Which genre is
the most popular?
○ Which countries produce the most movies?
○ Which movies are more popular?
○ Which movies have the most votes?
○ Which directors are the most successful?
● Which genre of film is taking the throne at the box office?
○ Which genre of movies are growing both in terms of numbers
and sales? And which genre does decline?
○ What is the cause of that increase and decrease?
● What Affects Box Office profit?
○ How do domestic sales, vote, imdb, award win affect global
○ Does the genre of the movie affect its profit?
○ Which MPAP rating is the most profitable?
○ Are these features enough to predict a movie's profit?
Through specific analysis and visual charts, we will have a more specific look
at cinema in many different aspects and get answers for all questions above.

2. Data Collecting And Processing

2.1. Data Collecting
We collected data on most movies that were released internationally between
1990 and 2019. The data includes basic information about the movie and its
box office profit. After a while of researching, we found that two websites that
provide this information are and
● IMDb: title, director, year, runtime, genre, imdb_rating,
metascore_rating, vote, mpap, keyword, budget, country, language,
released_date, award_win, award_nomination.
● Box Office Mojo: wolrdwide_gross, us_gross, international_gross.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. We use this library to make it more
convenient to collect data from the above two sites.

2.2. Data Processing

The data we have just collected cannot be used for analysis because there is
a lot of information left blank in some movies such as runtime, director,
domestic_gross, budget,... Therefore we have to clean the data:
● replace nan value of “domestic_gross” with 0
● replace nan value of “budget” with 0
● replace nan value of “year” with 1 January of the released year
● replace nan value of “keyword” with movie’s title
● replace nan value of “language” with English
● replace nan value of “metascore” with the mean metascore of all
movies which have the same genre in the same year
● drop nan value of “runtime”
● drop nan value of “director”
● define a new feature “score” calculated by metascore and imdb
● define a new feature “popularity” by be using IMDB's weighted rating
● define a new feature “profit” by worldwide_gross and budget

3. Get A Look At The Dataset

3.1. Descriptive Measures
Min, max, mode, count, standard deviation, Q1, Q2, Q3 of all numeric columns.
Pandas describe() is used to view some basic statistical details like
percentile, mean, min, max, standard deviation, the number of values of a
dataframe or a series of numeric values.

Pandas describe() also show the number of values, the number of unique
values, the mode and its frequency for non-numeric columns.
3.2. Distribution
Distribution of imdb, metascore, score, popularity, vote, year, runtime, award
win, award nomination of dataset
We use the 'distplot' function of seaborn. This function provides access to
several approaches for visualizing the univariate or bivariate distribution of
data, including subsets of data defined by semantic mapping and faceting
across multiple subplots.

4. Let's “Watch” The Movie

4.1. Diversity In The Cinema Industry
4.1.1. Diversity comes from content
Film is like any other art form, it was born to reflect reality as well as human
thoughts, desires and wishes. A good movie is one where the viewer can see
himself in the character, can understand the motivation, and empathize with
what the characters are going through.

Here is the world cloud of words in the titles of the movies. To draw these
word clouds, we use the WordCloud() function of the wordcloud library.
The word Love is the most commonly used word in movie titles. Man and girl
are also among the most commonly occuring words. This encapsulates the
idea of the ubiquitous presence of romance in movies pretty well.

Next is the world cloud of words in the keywords of the movies.

Love, friendship, relationship, woman, male, girl, husband wife, father, son and
marriage are all words that appear a lot, expressing a very popular emotional
theme in movies. In addition, sensational topics such as murder, death and
drug also attracted a lot of people's attention. This comes from both
subjective and objective factors.
Humans have always longed for healthy relationships with the people around
them and love movies give us hope of finding love of our own. There's another
scientific reason that people fall for a good love story—oxytocin, a.k.a. the love
hormone. Oxytocin releases into our bloodstream upon hearing a well-told
story, our brains react as if we are experiencing it ourselves.
And movies about murder attract us because murder and crime genre give
people a glimpse into the minds of people who have committed crimes. We're
drawn to the tension between good and evil, and crime, murder movies
embodies our fascination with that dynamic. Besides, maybe learning about
crime and murder simply appeals to our survival instinct, which is as a
subconscious way of preparing for real-life threats.

4.1.2. Diversity comes from genre

Diverse content must lead to diverse movie genres. Over the past 30 years,
the number of movies produced and released globally is in 20 major genres. It
includes: Biography, Comedy, Action, Drama, Crime, Adventure, Mystery,
Horror, Animation, Sci-Fi, Thriller, Fantasy, Romance, Western, Family, Music,
Musical, War, History, Sport.

Comedy, Drama and Action are the three most produced film genres,
accounting for about 72% of the films produced between 1990 and 2019.
Drama and Comedy account for such a large number because the two genres
are easily communicated, their content related to human-to-human
relationships. While the action genre gives viewers moments of relaxation.
Also the action movie is able to create a scenario of stress, but this stress
happens in complete control and is short lived – this is something which
people enjoy. Whenever a person watches a movie he feels that he is traveling
with the hero, the same pain, same happiness, same excitement and he
becomes one with him in his imagination.
The variety of film genres exists because one has different interests, different
degrees of love for each genre of film. The following is the box plot for the
popularity of movies by genre from 1990 to 2019.

Biography, Drama, and Animation are the three genres that tend to have a
higher concentration of data (median) than other genres, showing the
popularity of films in these three genres slightly better. The common point of
these three genres is that their content reflects highly dramatic issues. Not
only that, animation also has the ability to increase creativity in storytelling.

The variation of the Music, Musical and Fantasy genres is relatively large,
showing the uneven quality of movies.

The audience's love partly reflects

the quality of the film. We see that
the concentration and median
trend of the Biography genre is
higher than that of the Action
genre. Surely, the probability of a
Biography film receiving an award
will be higher than that of the
Action genre. We constructed the
95% confidence interval for the
biography and action movie population proportion that won awards. Construct 95% confidence interval for action movie
population proportion that won awards. Construct 95% confidence interval for biography movie

population proportion that won awards.
We calculate it using the library ‘statsmodels’. Its 'stats.proportion_confint'
function is used to calculate the confidence interval for a binomial proportion
with the default alpha is 0.05, count is the number of successes and nobs is
the total number of trials. Test on the Difference in Population Proportions of

action and biography movies that won awards.

The parameters of
interest are p1 and p2,
the proportion of
movies that won
awards in biography
genre (p1) or in action
genre (p2):
H0: p1 = p2
H1: p1 > p2
α = 0.05
From this, we can conclude the population proportion of biography movies
with won awards is higher than the population proportion of action movies
with won awards.

4.1.3. Diversity comes from production country

Our data only considers movies that are globally released, so there are some
countries that produce less than ten movies because most of their movies
just stop at screening and serving the people of that country.
If you use our calculating file, you can interact with the below chart in order to
see how many movies each country produces.

As we can see on the map, in addition to Hollywood, the world also has typical
movie powerhouses such as Britain, France, India, and Canada. And the
emerging dragons of Asian cinema such as Japan, Korea, China, Hong Kong.
Europe can be considered the cradle of cinema. They were pioneers in the
motion picture industry, with a number of innovative engineers and artists.
Throughout time, there have been countless new waves and movements of
classical images born and developed here, such as German Expressionism,
Soviet Montage, French Impressionist Cinema and Italian Realism.
India is considered one of the world's major film markets. In particular,
Bollywood is famous for its lavish dance sequences on romantic music. Such
typical movies are mass-produced in this country to cater to the tastes of
mass audiences. In addition, they also focus on serious topics, emphasizing
depictions of realism and naturalism, symbolic elements and concern for the
environment, politics and society.
And it is impossible not to mention the Korean film industry with recent
admirable achievements such as the Academy Award for Best Picture, Best
Director, Best Foreign Language Film of Parasite. Korean movies with unique
content are not only inspired by Western cinema and Japanese New Wave, but
also based on Pansori: a traditional Korean art form of storytelling.

4.1.4. The most of Most Popular Movies by Popularity Score

In this list there are a few interesting names such as:

● The Lord of the Rings: The Return of the King and The Lord of the
Rings: The Fellowship of the Ring are both Franchise's successful
movies The Lord of the Rings with the Oscar for Best Picture for The
Lord of the Rings : The Return of the King and a similar nomination for
The Lord of the Rings: The Fellowship of the Ring.
● Parasite caused a storm in 2019 with countless awards and was the
first Asian film to win Best Picture.
● Boyhood: A film shot and produced within 12 years from the time the
actor was a child to adulthood. Most Voted on Movies

In this list there are a few interesting names such as:

● The two films in the The Lord of the Rings franchise are not only highly
appreciated by experts but also loved by the mass audience when they
entered the top 10 movies with the most votes.
● 1994 was a successful year for cinema with many excellent films such
as The Shawshank Redemption, Pulp Fiction, Forrest Gump. Most Critically Acclaimed Movies

In this list there are a few interesting names such as:

● Parasite and Roma, the only two films in a foreign language and
produced in a country other than the United States, took first and
second places in the top 10 award winning films.
● In this top 10, there are 5 Oscar winners for Best Picture (Parasite, 12
Years a Slave, Moonlight, The Lord of the Rings: The Return of the King,
Birdman) and the remaining 5 films are also nominated for Best.
Picture. Most Successful Directors

In this list there are a few interesting names such as:

● Most of the directors on the above list are famous for their excellent
action movies such as Steven Spielberg, Anthony Russo, Michael Bay,
Christopher Nolan.

4.2. The More Comedies Decline, The More Actions

4.2.1. The trend
According to the "Percentage of movies by genre from 1990 to 2019" pie chart
above, one would think Comedy movies dominated the box office because the
genre accounted for 27% of worldwide film releases between 1990 and 2019.
equals the total % of all genres other than Action and Drama combined. But is
Comedy really the king of the box office in terms of both the number of
movies and the profit?
The chart above gives us a lot of interesting information: the decline of
Comedy, Adventure; the steady rise of Action, Drama and the explosion of
Animation and Biography. These changes are associated with human
technological development, especially technology in film production. With new
technologies like augmented reality, virtual reality, after effects/visual effects,
live character animations, 3D, 4D, Motion Capture, IMAX, VFX, StageCraft,...
make animation and action movies production much easier, more realistic.
This makes blockbusters dominate the box office throughout the year.
As for revenue, Comedy has also lost its throne for a long time. From the chart
above, we can see that Comedy's revenue is unstable and has almost
negligible growth. It is worth mentioning that the total revenue in 2017 is
almost the same as in 1990. While the revenue of animation and action
movies has grown tremendously.

In 1990, the total revenue of action movies was roughly equal to the total
revenue of comedy movies, and comedy’s revenue was nearly 27 times that of
animation movies. But in 2019, action’s revenue was 5.3 times more than
comedy and animation’s revenue is nearly 3 times more than comedy. Construct the 95% confidence interval of the mean

worldwide gross of the action movies population. Construct the 95% confidence interval of the mean

worldwide gross of the comedy movies population. Test on the difference in the mean of the popularity of

the action and comedy movies population.
The parameters of interest are the mean world wide gross for the two genres
action (μ1) and comedy (μ2). We are interested in determining if μ1 > μ2.
H0: μ1 = μ2
H1: μ1 > μ2
α = 0.05
As we can see, the standard
deviation of the two target
populations is different. So the
variance must be different as
well. The unpooled approach will
be more appropriate.

Since t0 > t0.025, 1106 we

reject the null
hypothesis. Therefore,
there is evidence to
conclude that the mean
worldwide gross of
action movies is higher
than the mean
worldwide gross of
comedy movies.

There are three main reasons for the decline, instability of comedy and the
dramatic growth of action over the past 30 years:
● Action movies are more profitable
● Theater audiences are more interested in action movies
● Fierce competition from other genres

4.2.2. All about money!

It appears that the problem here lies with the studios wanting to invest in
more guaranteed money-makers – blockbusters, like action movies or
superhero movies.

Cinema is also a business, above actors, directors, technicians are

businessmen, bosses of movie studios. What they are most interested in is
whether the film makes revenue and how many times the revenue is higher
than the cost.

The two graphs below show us why the more comedies decline, the more
actions thrive:
The average revenue of the action movies has increased following by
decreased and increased again year by year, but the general trend is still up. In
1990, an action film earned about 90 million USD but in 2019, the average
revenue of an action film is 450 million. Meanwhile in the comedy genre, the
average in 1990 was about 70 million but in 2019 it was only about 110

The annual revenue of these two genres are also significantly different. From
1990 to 2000, comedy gross revenue was pretty close together. But by 2008,
action's annual gross jumped while comedy's annual gross fell, turning them
into two unequal opponents.

Here are line plots of average and total profit of these two genres.
Not surprisingly, these two charts are quite similar to the two above. It seems
that the film industry is like the stock market: You have to spend money to
make money. And if you want to make the real big money, you're gambling
with an awful lot of risks. The cost of an action movie is high, but its profit is
much higher, so the profit is large.

Like revenue, the average profit of comedy has grown slightly and for a period
of time that has barely increased. From 1991 to 2005, the average annual
return was even lower than it was in 1990.

Therefore, investors and film producers will not hesitate to choose action
movies that take more money to produce but also earn more.

4.2.3. Theater audiences are more interested in action

The huge difference between the revenue and profit of these two genres is
due to the tastes of the audience. The vote feature clearly shows the
audience's interest in a movie. The more votes a movie has on IMDb, the more
it is known and liked.

In both charts "Average votes per film of two genres Action and Comedy by
year" and "Total votes of two genres Action and Comedy by year", it is
interesting that both the average and total number of votes of action movies
are still higher than that of comedy movies. The difference between the
interest of the mass audience and the two lines is really clear in 2008.
To explain the increased popularity of action movies, we rely on our
knowledge as well as the results of some surveys:
● Action movies have higher budgets, so marketing costs are also higher.
● New technologies in action movies are always appealing to us.
● Action movie content is often more novel and attractive than comedy.
● According to a survey by Gökçe Bayramıçlılar, 75% of respondents said
that they would rather watch sitcom series online than go to the
cinema to watch comedy.

4.2.4. Fierce competition from other genres

The slow growth of comedy series is also accompanied by the increase in

both the number and revenue of other genres such as Animation, Biography,
Drama. The number of genres in 1990 was 11 and in 2019, it was 15 with 4
new genres such as Musical, Music, Sport, Fantasy.

In the chart "Percentage of movies by genre from 1990 to 2019 by Year”, we

can see that the number of animation films and biographies in 2019 almost
doubled compared to 1990. More specifically, the number of animation
movies is less than comedy’s but its revenue in 2019 is almost three times the
revenue of comedies (according to the chart "Total of grosses by genre from
1990 to 2019 by year").

4.3. Box Office profit Is Influenced By Many Factors

A lot of things affect the profit of a movie, from subjective factors such as the
director, actors, production studio, content, marketing strategy, ... to objective
factors such as market, time released, score,... With the data we've collected,
we'll look at how a movie's domestic gross, popularity, votes, review score,
number of awards, genre, MPAP rating influence its profits.
4.3.1. The numbers
Correlation is a term used to represent the statistical measure of linear
relationship between two variables. It can also be defined as the measure of
dependence between two different variables. If there are multiple variables
with the goal is to find correlation between all of these variables and store
them using appropriate data structure, which is called a correlation matrix.
Correlation heatmap is a graphical representation of correlation matrix
representing correlation between different variables.

The regression plots in seaborn are preeminent and intended to add a visual
guide that helps to emphasize patterns in a dataset during exploratory data
analyses. In other words, it shows us the relationship, the degree of
connection between two features.
The regression plot of budget and profit will help us understand if movies with
higher budget, will it generate higher profit? We can see that there's a strong
correlation between the budget and the profit.
Similarly, we see that worldwide gross, international gross, domestic gross are
also closely related to profit. There are strong positive linear relationships
between worldwide gross, international gross, domestic gross and profit. This
is quite obvious because profit is partly calculated by them.
In addition, vote, imdb, won awards, nominated awards are also associated
with profit. These features will play an important role in building the movie
profit prediction algorithm.
Although we were able to establish moderate positive correlation between
rating and vote, we couldn't establish acceptable correlation between them
and profit.

4.3.2. Genres
Movie sales can also be related to the genre of the film. Each category will
have a different number of followers. Some genres will be suitable for the
majority of audiences such as action, comedy, family, animation,... Meanwhile,
some genres will be more picky with audiences such as crime, horror, thriller,...

Below are box plots of profit by genres.

Through the chart above, we see that mystery and animation films are the two
genres with the highest median, followed by action, adventure, and sci-fi. In
particular, action movies seem to have the most outliers because the number
of action movies is relatively large and their quality is unstable.

4.3.3. MPA
Another factor that also greatly affects the profit is the film's MPA rating, the
age of the audience that the film aims for. We will consider 4 main types,
which are R, PG-19, PG, G labels:
● R – Restricted: Under 17 requires accompanying parent or adult
guardian. Contains some adult material. Parents are urged to learn
more about the film before taking their young children with them.

● PG-13 – Parents Strongly Cautioned: Some material may be

inappropriate for children under 13. Parents are urged to be cautious.
Some material may be inappropriate for pre-teenagers.

● PG – Parental Guidance Suggested: Some material may not be suitable

for children. Parents urged to give "parental guidance". May contain
some material parents might not like for their young children.

● G – General Audiences: All ages admitted. Nothing that would offend

parents for viewing by children.
The G label has the highest profit average, followed by PG, PG-13 and finally R.
PG-13 has many outliers since it is the label for most films.
This is quite true because the G label will reach the largest audience.
Therefore, producers always try to choose scripts and filmmaking methods
that limit violence and gore so that the film can be labeled G or PG-13 in order
to gain the most profit.

5. Conclusion
5.1. Unanswered questions
● How do the director and main actors affect the profit of the film?
(Because we don't know how to handle the director column, nor
have we collected the main actors of each movie.)
● Are the average profit for each genre the same across
countries? (The data we have is from two US websites, so the
movies from the US will be more and more accurate.)
● What content makes the movie profitable? Which audience will
watch movies in theaters more?
● How do online movie streaming platforms like Netflix, Amazon
Prime, HBO Max,... affect the box office profit of movies?
● How has the COVID-19 pandemic changed the film industry?
5.2. Knowledge earned
After working together on this project, we have both learned a lot of
new knowledge as well as reinforced old knowledge:
● Revise the knowledge of probability and statistics such as
probability, sampling distribution, descriptive statistics,
confidence interval, test of hypotheses, simple linear regression,
● Learn more how to collect, clean, process data and calculate
and graph from that data in Python.
● Improve the ability to work in groups, make documents and
● Learn more about the film industry and answer the group's
questions about movies.

You might also like