You are on page 1of 8

APPLIED DATA SCIENCE

MAPÚA UNIVERSITY

#658 MURALLA ST., INTRAMUROS, MANILA 1002, METRO MANILA

SCHOOL OF MECHANICAL AND MANUFACTURING ENGINEERING

CASE STUDY:

Part of Data Science in the Success of Netflix Inc.

In Partial Fulfillment for the Course:

DS100-2 / B9

APPLIED DATA SCIENCE

Submitted by:

AGCAOILI, Vir Franciz A.

AQUINO, Peter Wyn Gian S.

HOSSAIN, Shailani B.

PARADO, Jacob A.

Submitted to:

Ms. Eliza Eleazar

4th of August 2020

Page 1 of 8
APPLIED DATA SCIENCE

Background / History of the enterprise

Netflix, Inc was founded by two tech entrepreneurs Reed Hastings and Marc Randolph. It
began its operations in the year of 1997. The Company’s head office is in Los Gatos, California.
Netflix’s Main business is subscription-based online streaming services of TV Shows, Originals,
Movies, etc. Being the largest media service provider, it has over 148 Million members operated
across 190 countries except for China, Iran, North Korea, Crimea, and Syria. During the initial
days Netflix suffered huge loss but with the rise of internet users and Netflix changed its
business model from traditional DVD rental and sales to the introduction of online video
streaming in 2007. Netflix was able to reduce the loss. To make this possible Netflix needed to
change their business strategy. Along with the streaming on movies, TV Shows from other
studios Netflix is also producing its own movies and TV-Shows. From 2010 Netflix started its
expansion worldwide starting from Canada in 2010 than in Latin American countries in the year
2011 followed by the United Kingdom and other European Countries like Denmark, Netherlands,
Norway etc. from 2012 till 2015. In the year 2012 Netflix split its business of DVD rental service
as an Electronic, separate division from online streaming division. Till 2017 DVD rental division
has around 3.3 million customers and Netflix has plans to keep this service for a few more
years. The biggest challenge currently faced by Netflix are Maintaining the existing subscribers
and increasing the new subscriber count, increase in competition by other streaming providers
like Hulu, Disney, Warner Media, Amazon, the rise of the cost to produce the original content.
To overcome these challenges Netflix uses Big Data Analytics. Netflix has heavily invested in
research on big data analytics and it spends over $1 billion for it. As of today, they have a
separate division called Netflix Research that mainly concentrates on data analytics areas such
as customer experience, recommendations, machine learning, etc. They are heavily invested in
Data Sciences and Data Analytics for their recommendation systems. These recommendation
systems understand the users and provide recommendations accordingly.

The role of Data Science / Data Analytics in transforming / innovating the enterprise

Page 2 of 8
APPLIED DATA SCIENCE

Data science is in the DNA of Netflix and Netflix leverages data science in improving
each part of the user experience. Netflix has throughout the years been utilizing information
science for its content recommendation engine, to choose which movies and television
programs to deliver and to improve users experience.

Netflix was one of the early adopters of Big Data Analytics in the year 2006 Netflix came
up on a test that would grant $1 Million to any individual who might improve their current
recommendation system called Cinematch by 10%. The test was to build up an algorithm to
anticipate the user film inclination dependent on the more seasoned information. Netflix gave
the dataset which contains around 100 million evaluations given by 480 thousand users to 17
thousand movies, ratings were in the structure user, movie name, date of rating and rating given
by the user.

Steps done by the Analytics team based on the Analytics lifecycle

The main goal why Netflix has a need for analyzing collected data is to keep people on
subscribing. With increased accuracy of recommendations, people would be more likely to
subscribe again for the next month and they could also attract new subscribers to the platform.

So what data is collected from the subscribers of Netflix? Different data are collected such as
events when the user pauses, rewinds, leaves or fast forwards a content. The place (through zip
code) and the date when a person watches is also recorded. What device the user used is also
recorded. The ratings given, search history and browsing and scrolling behavior are also taken
into consideration. Lastly, the nature of the show and the credit calculation are also included. 

The algorithmic results can be computed either online in real-time, offline in batch, or
near line in between. Each approach has its advantages and disadvantages, which need to be
taken into account for each use case. Online computation can respond better to recent events
and user interaction, but has to respond to requests in real-time. This can limit the
computational complexity of the algorithms employed as well as the amount of data that can be
processed. Offline computation has less limitations on the amount of data and the computational
complexity of the algorithms since it runs in a batch manner with relaxed timing requirements.
Page 3 of 8
APPLIED DATA SCIENCE

Near line computation is an intermediate compromise between these two modes in which we
can perform online-like computations, but do not require them to be served in real-time.

In any case, the choice of online/near line/offline processing is not an either/or question.
All approaches can and should be combined. the modeling part can be done in a hybrid
offline/online manner. This is not a natural fit for traditional supervised classification applications
where the classifier has to be trained in batch from labeled data and will only be applied online
to classify new inputs. However, approaches such as Matrix Factorization are a more natural fit
for hybrid online/offline modeling: some factors can be precomputed offline while others can be
updated in real-time to create a fresher result. Other unsupervised approaches such as
clustering also allow for offline computation of the cluster centers and online assignment of
clusters. 

Much of the computation they need to do when running personalization machine learning
algorithms can be done offline. There are two main kinds of tasks that fall in this category:
model training and batch computation of intermediate or final results. In the model training jobs,
they collect relevant existing data and apply a machine learning algorithm that produces a set of
model parameters (which they will refer to as the model). This model will usually be encoded
and stored in a file for later consumption. Although most of the models are trained offline in
batch mode. They also have some online learning techniques where incremental training is
indeed performed online. Batch computation of results is the offline computation process
defined above in which existing models and corresponding input data to compute results that
will be used at a later time either for subsequent online processing or direct presentation to the
user.

Both of these tasks need refined data to process, which is usually generated by running a
database query. Since these queries run over large amounts of data, it can be beneficial to run
them in a distributed fashion, which makes them very good candidates for running on Hadoop
via either Hive or Pig jobs. Once the queries have completed, they need a mechanism for
publishing the resulting data. They have several requirements for that mechanism: First, it
should notify subscribers when the result of a query is ready. Second, it should support different
repositories (not only HDFS, but also S3 or Cassandra, for instance). Finally, it should

Page 4 of 8
APPLIED DATA SCIENCE

transparently handle errors, allow for monitoring, and alerting. At Netflix they use an internal tool
named Hermes that provides all of these capabilities and integrates them into a coherent
publish-subscribe framework. It allows data to be delivered to subscribers in near real-time. In
some sense, it covers some of the same use cases as Apache Kafka, but it is not a
message/event queue system.

At Netflix, their near-real-time event flow is managed through an internal framework


called Manhattan. Manhattan is a distributed computation system that is central to our
algorithmic architecture for recommendation. It is somewhat similar to Twitter’s Storm, but it
addresses different concerns and responds to a different set of internal requirements. The data
flow is managed mostly through logging through Chukwa to Hadoop for the initial steps of the
process. Later they use Hermes as the publish-subscribe mechanism.

 Methods / resources used by the team


 
Netflix’s ability to collect and use the data is the reason behind their success. According
to Netflix, they earn over a billion in customer retention because the recommendation system
accounts for over 80% of the content streamed on the platform. Netflix also uses its big data and
analytics tools to decide if they want to greenlight original content. To an outsider, it might look
like Netflix is throwing their cash at whatever they can get, but in reality, they greenlight original
content based on several touch points derived from their user base.

Since Netflix deals with a lot of data, it would be beneficial to run them in Hadoop through
Pig or Hive. The results must be published and be supported by not just HDFS but other
databases such as S3 and Cassandra. For this, Netflix developed an in-house tool called
Hermes. It is also a publish-subscribe framework like Kafka, but it provides additional features
such as multi-DC support, a tracking mechanism, JSON to Avro conversion, and a GUI called
Hermes console’ (Morgan, 2019). They wanted a tool to effectively monitor, alert and handle
errors transparently.

Page 5 of 8
APPLIED DATA SCIENCE

Hadoop makes distributed computing possible by providing a set of software and tools. It
works on the principle of Map Reduce for the storage and processing of Big Data. Many
companies today use Hadoop for large scale data processing and analytics today. HDFS stands
for Hadoop Distributed File System. It is one of the core components of the Hadoop ecosystem
which functions as a storage system. It works on the principles of MapReduce. It can provide
high bandwidth along with the cluster. JavaScript Object Notation (JSON) is a lightweight data-
interchange format. It is easy for humans to read and write. It is easy for machines to parse and
generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-
262 3rd Edition - December 1999.Avro is a row-oriented remote procedure call and data
serialization framework developed within Apache's Hadoop project. It uses JSON for defining
data types and protocols, and serializes data in a compact binary format.  GUI stands for
graphical user interface is a system of interactive visual components for computer software. A
GUI displays objects that convey information, and represent actions that can be taken by the
user. The objects change color, size, or visibility when the user interacts with them.

  Results

Through Netflix’s data analytics, personalization and recommendation save $1 billion a


year for the company. Netflix is able to collect several data points to create a detailed profile on
its subscribers. The profile is far more detailed than the personas created through conventional
marketing. It is one of the important factors in attracting new subscribers to the platform and
encourages existing users to keep on subscribing.

They have also surprisingly discovered binary information which can be understood as
the fact that people do not select and rate movies at random. 

Netflix has been able to ensure a high engagement rate with its original content, such
that 90 percent of Netflix users have engaged with its original content.

Netflix’s big data approach to content is so successful that, compared to the TV industry,
where just 35 percent of shows are renewed past their first season, Netflix renews 93 percent of
its original series.

Page 6 of 8
APPLIED DATA SCIENCE

Netflix even uses big data and analytics to conduct custom marketing, for example, to
promote ‘House of Cards’ Netflix cut over ten different versions of a trailer to promote the show.
If you watched lots of TV shows centered on women, you get a trailer focused on the female
characters. However, if you watched a lot of content directed by David Finch, you would have
gotten a trailer that focused the trailer on him. Netflix did not have to spend too much time and
resources on marketing the show because they already knew how many people would be
interested in it and what would incentivize them to tune in.

Page 7 of 8
APPLIED DATA SCIENCE

References

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3473148

https://seleritysas.com/blog/2019/04/05/how-netflix-used-big-data-and-analytics-to-
generate-billions/

https://neilpatel.com/blog/how-netflix-uses-analytics/

https://towardsdatascience.com/the-netflix-data-scientist-interview-
35093d4c20aa#:~:text=Data%20science%20is%20in%20the,and%20to%20improve
%20user%20experience.

https://towardsdatascience.com/netflix-recommender-system-a-big-data-case-study-
19cfa6d56ff5

https://netflixtechblog.com/system-architectures-for-personalization-and-
recommendation-e081aa94b5d8

Page 8 of 8

You might also like