You are on page 1of 33

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Amritsar College of Engineering & Technology, Amritsar


(Autonomous college under UGC Act – 1956[2(f) and 12(B)] )

Project Report
On
“ANALYSIS OF YOUTUBE VIDEOS ”

Submitted in the Partial fulfillment of the requirement for the Award of Degree of

Bachelor of Technology
in
COMPUTER SCIENCE & ENGINEERING

Batch
(2018-2022)

Subject - Big Data Analytics


(ACCS-16503)

Submitted to Submitted by

Er. AJAY SHARMA MRIDUL MAHAJAN (1800258)


NANCY (1800259)
NARINDERPAL SINGH (1800260)
NAVJOT KAUR (1800262)
DECLARATION
We hereby declare that the project work entitled “ANALYSIS OF YOUTUBE VIDEOS
DATASET” is an authentic record of our own work carried out as per requirement for the
awards of degree of B.Tech(CSE) Amritsar Group of Colleges, Amritsar under the guidance
of Er. AJAY SHARMA .

(Signature of students)
MRIDUL MAHAJAN (1800258)
NANCY (1800259)
NARINDERPAL SINGH (1800260)
NAVJOT KAUR (1800262)

Date:

Certified that the above statement made by the students is correct to the best of my
knowledge and belief.

Signature:
Examined By:
Er. Ajay Sharma
ACKNOWLEDGEMENT

This is a humble effort to express our sincere gratitude towards those who have guided and
helped us to complete this project.
A project is major milestone during the study period of a student . As such this project was a
challenge to us and was an opportunity to prove our caliber. We are highly grateful and
obliged to each and every one making us help out of problems being faced by us.
We MRIDUL MAHAJAN, NANCY, NARINDERPAL SINGH ,NAVJOT KAUR the
students of Bachelor of Technology (CSE) degree would like to take this opportunity to
express our sincere regards to our incomparable supervisor Er. Ajay Sharma, Associate

Professor and Mr. Vinod Sharma , Head of Department of Computer Science


Engineering, for guidance , constructive criticism , valuable suggestions and long standing
efforts which brought this report in its present form
INDEX PAGE
S.No. Content Page No.
1. Introduction to subject 4-5

2. Apache Pig 6-26

3. Analysis of Youtube VideosDataset 34-49


An Introduction to Big Data Analytics
Big data analytics can be defined as a process of examining large and varied data sets. We
use advanced analytics techniques against the large data to uncover the hidden patterns,
unknown correlations, market trends, customer preferences, and other useful information.
This helps the organizations to make informed decisions.

Big data analytics examines large amounts of data to uncover hidden patterns, correlations
and other insights. With today’s technology, it’s possible to analyze your data and get
answers from it immediately. Big Data Analytics helps you to understand your organization
better. With the use of Big data analytics, one can make informed decisions without blindly
relying on guesses.

History and Evolution of Big Data Analytics


The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term “big
data,” businesses were using basic analytics essentially numbers in a spreadsheet that were
manually examined to uncover insights and trends.

Values of Big Data Analytics


Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits, and happier customers. Here are the most important values of Big Data.

Uses of Big Data analytics across different industries

Banking
Large amounts of information will be streaming into banks, managing all this data and
getting proper insights would be possible only with big data analytics. This is important to
understand customers and boost their satisfaction, and also to minimize risk and fraud.

Government
When government agencies are able to harness and apply analytics to their big data, they gain
significant ground when it comes to managing utilities, running agencies, dealing with traffic
congestion or preventing crime.

Health Care
Patient records, Treatment plans, Prescription information. When it comes to health care,
everything needs to be done quickly, accurately. And, in some cases, with enough
transparency to satisfy stringent industry regulations. When big data is managed effectively,
health care providers can uncover hidden insights that improve patient care.

Education
Educators armed with data-driven insight can make a significant impact on school systems,
students, and curriculums. By analyzing big data, they can identify at-risk students, make
sure students are making adequate progress, and can implement a better system for evaluation
and support of teachers and principals.

Manufacturing
Armed with insight that big data can provide, manufacturers can boost quality and output
while minimizing waste – processes that are key in today’s highly competitive market. More
and more manufacturers are working in an analytics-based culture, which means they can
solve problems faster and make more agile business decisions.

Retail
Customer relationship building is critical to the retail industry. And the best way to manage
that is to manage big data. Retailers need to know the best way to market to customers. The
most effective way to handle transactions and the most strategic way to bring back lapsed
business. Big data remains at the heart of all those things.

Final Thoughts
Apart from the wide range of benefits Big Data Analytics offers, there are some pitfalls like
lack of internal analytic skills and hiring a skilled data scientist and data engineers to fill this
gap will cost you more money.

Sometimes Data management issues may arise depending upon the amount and variety of
data involved. In addition, integrating Hadoop, Spark and other big data tools into a cohesive
architecture that meets an organization’s big data analytics needs is a challenging proposition
for many IT and analytics teams, which have to identify the right mix of technologies .
Apache Pig:

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger
sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform
all the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own function
sfor reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language.
All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component
known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.

Apache Pig – Architecture:

The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level
data processing language which provides a rich set of data types and operators to perform various
operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script using
the Pig Latin language, and execute them using any of the execution mechanisms (Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations
applied by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes
the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components:

As shown in the figure, there are various components in the Apache Pig framework. Let us
take a
look at the major components.

Parser:

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.

Optimizer:

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler:

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model:


The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Atom:

Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a
field.
Example − ‘raja’ or ‘30’

Tuple:

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)

Bag:

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is


known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}


A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map:

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation:

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).

Apache Pig Execution Modes:

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode:

In this mode, all the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode:

MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms:
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using them
in our script.
DATASET OF YOUTUBE VIDEOS

DETAILS:- There are 11 columns and 40883 rows.

Columns :-

• video_id
• title
• channel
• category_id
• publish_date
• views
• likes
• dislikes
• comments
• thumbnail
• comments_disabled
Queries & Outputs :-

1) Load youtubevideos.csv into a relation and assign a suitable schema to it and dump this
relation on the console.
Ans:-

➢ youtube = load ‘/youtube1.csv’ using PigStorage(‘,’) as (video_id : chararray , title :


chararray , channel : chararray, category_id : int, publish_time : chararray, views :
double, likes : double, dislikes : double , comment : double, thumbnail : chararray,
comments_d : chararray);

Output:-

➢ Describe youtube;
➢ dump youtube;
I. Convert date fields into date time format .

➢ y = foreach youtube generate video_id , title , channel , category_id


,ToDate(publish_time,’dd-MM-yyyy HH:mm’) as
d1,views,likes,dislikes,comment,thumbnail,comments_d;

➢ Describe y;

➢ Dump y;
2) Store relation into a file.
Ans : -
➢ store youtube into ‘/mri’;
3) Generate a 2% Sample of this dataset and display and store it;

➢ se = sample youtube 0.2;


➢ dump se;

Output:-

➢ Store se into ‘/k’;


4) Display unique category_id from youtube_file.csv.
Ans:-

➢ d = distinct(foreach youtube generate category_id);


➢ dump d;
5)List the video_id and thumbnail where publish_time year is ‘2017’ and title started with ‘E’.
Ans:-

➢ v = filter y by STARTSWITH(title,’E’) and GetYear(d1) == 2017;

➢ data = foreach v generate video_id,thumbnail;

➢ dump data;
6) Rank the videos by channels,where category_id =1.
Ans:-
➢ ch = filter youtube by category_id == 1;
➢ ch1 = rank ch by channel;
➢ dump ch1;

Output:-
7) Display category_id and title where comments_disabled is ‘TRUE’ and count the no. of
comments disabled.
Ans:-
➢ fun = filter c by comments_d matches ‘.*TRUE.*;
➢ fun1 = group fun by comments_d;
➢ fun2 = foreach fun1 generate group,COUNT(fun.comments_d);
➢ dump fun2;

Output:-
8) Display channel,title,likes and dislikes which were published in NOV and DEC in 2017
order by likes desc.

Ans:-
➢ to = filter y by GetYear(d1) == 2017 and GetMonth(d1) >= 11;

➢ to1 = foreach to generate title,channel,likes dislikes ;

➢ t2 = order to1 by dislikes desc;

➢ dump t2;

Output:-
9)Split this dataset into 4 relations , ,one containing dislikes < 8000, another with
dislikes >= 8000 and <150000,another with dislikes >=150000 and <700000 and
remaining in 4th relation .
Store all relations into the root of HDFS.
Ans:-
➢ split youtube into c1 if dislikes < 8000,c2 if dislikes >= 8000 and dislikes < 150000 ,c3
if dislikes >= 150000 and dislikes <700000, c4 otherwise;

➢ store c1 into ‘/m1’;

Output:-

➢ store c2 into ‘/m2’;


➢ store c3 into ‘/m3’;

➢ store c4 into ‘/m4’;


10) Display title, channel and dislikes which is published in NOV and dislikes > 5 lakhs.
Ans:-
➢ m = filter y by GetMonth(d1)==11 and dislikes > 500000;
➢ m1 = foreach m generate title,channel,dislikes;
➢ dump m1;
Output:-
11) List the video_id,title,likes,thumbnails of all videos released before 2017 and order
by likes desc.
Ans:-
➢ ht = filter y by GetYear(d1)<2017;
➢ ht1 = foreach ht generate video_id,title,likes,thumbnail;
➢ ht2 = order ht1 by likes desc;
➢ dump ht2;
Output:-
12)Count the number of videos released in the year ‘2018’ grouped by disabled comments.
Ans:-
➢ y1 = filter y by GetYear(d1) == 2018;

➢ y2 = group y1 by comments_d;

➢ y3 = foreach y2 generate group,COUNT(y1.comments_d);

➢ dump y3;

Output:-
13)
(i) Count the Max dislikes of all videos with channel_title=’nigahiga’;
(ii)Count the Min likes of all videos with channel_title = ‘Ed Sheeran’;
Ans:-
➢ i)f = filter youtube by channel matches ‘.*nigahiga.*’;
➢ fc = group f by channel;
➢ fcc = foreach fc generate group,MAX(f.likes);
➢ dump fcc;
Output:

➢ ii)f2 = filter youtube by channel matches ‘.*Ed Sheeran.*’;


➢ fc1 = group f2 by channel;
➢ fcc1 = foreach fc1 generate group,MIN(f2.likes);
➢ dump fcc;
Output:-
14)List the details of all videos of channel ‘NELK’ and whom publish_time is 2017 order by
views.
Ans:-

➢ un = filter y by channel matches ‘.*NELK.*’ and GetYear(d1)==2017;

➢ un1 = order un by views;

➢ dump un1;

Output:-
15)Display top 40 videos(video_id,title,likes,comments_d) acc to their ranks on the basis of likes
and whose comments are disabled.
Ans:-
➢ b= filter youtube by comments_d matches ‘.*TRUE.*’;

➢ b1 = limit b 40;

➢ b2 = foreach b1 generate video_id,title,likes,comments_d;

➢ d2 = rank b2 by likes desc DENSE;

➢ dump d2;

Output:-
16)List all the videos where dislikes >6000 and likes <150000 order by video_id desc.
Ans:-
➢ fg = filter youtube by likes <150000 and dislikes >6000;
➢ fg1 = order fg by video_id desc;
➢ dump fg1;

Output:-

17)List thumbnail,publish_date which were published in ‘2016’ and after 9P.M.


Ans:-
➢ j = filter y by GetYear(d1)==2016 and GetHour(d1)>21;
➢ j1 = foreach j generate thumbnail,d1;
➢ dump j1;
Output:-
18)List the average comments by category_id.
Ans:-
➢ r = group youtube by category_id;
➢ r1 = foreach r generate group,AVG(r.comment);
➢ dump r1;\

Output:-

19) List the channel_title and thumbnail where category_id is 23 and title contains the string
‘oss’;
Ans:-
➢ t = filter youtube by category_id == 23 and title matches ‘.*oss.*’;
➢ tf = foreach t generate channel,thumbnail;
➢ dump tf;

Output:-
20)List the title,channel_title,publish_time whose video_id startswith ‘2’ and endswith ‘E’;
Ans:-
➢ h = filter y by STARTSWITH(video_id,’2’) and ENDSWITH(video_id,’E’);

➢ h1 = foreach h generate title,channel,d1;

➢ dump h1;

Output:-
21)Display title along with the channel.
Ans:-
➢ youtube = load ‘/youtube1.csv’ using PigStorage(‘,’) as (video_id : chararray , title :
chararray , channel : chararray, category_id : int, publish_time : chararray, views :
double, likes : double, dislikes : double , comment : double, thumbnail : chararray,
comments_d : chararray);

➢ youtube1 = load ‘/youtube1.csv’ using PigStorage(‘,’) as (video_id : chararray , title :


chararray , channel : chararray, category_id : int, publish_time : chararray, views :
double, likes : double, dislikes : double , comment : double, thumbnail : chararray,
comments_d : chararray);

➢ a2 = join youtube by title,c1 by channel;

➢ dump a2;

You might also like