You are on page 1of 27

Big Data Analytics

Group Project

Submitted to
Prof. R K Jena

Submitted by

Anisha Vijay 201811004


Faizan Ali Sayyed 201811017
Yatindra Bapna 201831050

1
CONTENTS

1. Introduction to Big Data Analytics for Business…………………………....


2. HDFS and Map-Reduce……………………………………………………..
3. Hive………………………………………………………………………….
3.2 Hive Case study…….…………………………………………………...
3.2.1 About Data set…………………………………………………
3.2.2 List of Queries…………………………………………………
3.2.3 Solutions………………………………………………………
4. PIG……………………………………………………………………….......
4.2 PIG Case study…………………………………………………………...
4.2.1 About Data set…………………………………………………
4.2.2 List of Queries………………………………………………....
4.2.3 Solutions……………………………………………………….
5. Conclusion and Learning……………………………………………………...
Introduction to Big Data Analytics for Business

The act of gathering and storing large amounts of information for eventual analysis is ages old.
However, a new term but with an almost similar usage have come about, Big Data. In simple
terms, big data is the data which cannot be handled by traditional RDMBS. Big data is in large
volume mostly in petabytes and zetabytes and more. Also it may be in structured or
unstructured format. This makes complicated to manage such data. But data has to be managed
and analyzed to make prediction, analyze consumer behavior, predict nature, to make better
choice and so many. Big data analytics is method to analyze data where different tools are
been used to fetch out desired results. Such tools include hadoop and other vendor specific
products. Big data analytics is making life is easier. it is larger, more complex data sets,
especially from new data sources. These data sets are so voluminous that traditional data
processing software just can’t manage them. But these massive volumes of data can be used to
address business problems you wouldn’t have been able to tackle before.

On a broad scale, data analytics technologies and techniques provide a means to analyze data
sets and draw conclusions about them which help organizations make informed business
decisions. Business intelligence (BI) queries answer basic questions about business operations
and performance. Big data analytics is a form of advanced analytics, which involves complex
applications with elements such as predictive models, statistical algorithms and what-if
analysis powered by high-performance analytics systems

Example of Big Data:


A network of cafes can collect the data about their customers’ activities. Say, a customer visited
a café to buy breakfast – this leaves one entry in the database. Next morning, the customer
redeemed a promo coupon – another entry is added. The customer commented on the social
network how impressed they are with the café – this adds up one more entry. To get the picture,
the café should be able to store and process the data of all their customers (no matter whether
it is transactional, web behavior or text data), while each minute brings new data entries. This
leads us to the convenience of storing data at numerous computers in a distributed manner and
running processing jobs in parallel.
A crucial segment which has caught air is Social Media Analytics,

in the age of Facebook, Instagram and Twitter we can't just ignore these platforms. People
praise and post negative criticism on Social Media without any second thought. It becomes
crucial to give if not more, then equal importance to it. There are many software available in
the market for Data Analytics. They provide a lot of services embedded in them. They might
have produced a scare for the independent service providers who charge these big firms a
fortune for every service. For eg : if a firm wants to extract Data from a particular website and
also use social media analytics, they charge them separately for each service. There are times
when one service provider may not even have the other analytics software. In that case the
personnel have to approach a whole different Software company to get their job done. This
creates multiple software clients, it costs a fortune, troublesome to manage so many providers.
There were companies that spent over billion dollars on employing these services annually. In
2010 data analytics industry earned billions of dollars for providing these services as a separate
entity. Big data will continue to stay growing, and introducing more and more servers is not
the best solution as it will just add to the expenses of the company. If only there was a single
compact solution to every need of every Industry, the world would be a better place to live.

Big Data Applications:


Big data is used to improve many aspects of our cities and countries. For example, it
allows cities to optimize traffic flows based on real time traffic information as well as social
media and weather data. A number of cities are currently piloting big data analytics with the
aim of turning themselves into Smart Cities, where the transport infrastructure and utility
processes are all joined up. Where a bus would wait for a delayed train and where traffic signals
predict traffic volumes and operate to minimize jams. Big data is applied heavily in improving
security and enabling law enforcement. I am sure you are aware of the revelations that the
National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist plots (and
maybe spy on us). Others use big data techniques to detect and prevent cyber-attacks. Police
forces use big data tools to catch criminals and even predict criminal activity and credit card
companies use big data use it to detect fraudulent transactions.

Sentiment Analysis:
A large airline company started monitoring tweets about their flights to see how customers are
feeling about upgrades, new planes, entertainment, etc. Nothing special there, except when they
began feeding this information to their customer support platform and solving them in real-
time.

One memorable instance occurred when a customer tweeted negatively about lost luggage
before boarding his connecting flight. They collect the tweets (having issues) and offer him a
free first class upgrade on the way back. They also tracked the luggage and gave information
on where the luggage was, and where they would deliver it.

Needless to say, he was pretty shocked about it and tweeted like a happy camper throughout
the rest of his trip.

Sentiment analysis is the analysis of behind the data substance. A basic task in sentiment
analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect
level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is
positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for
instance, at emotional states such as “angry,” “sad,” and “happy.”

1
HDFS and Map-Reduce
Introduction
HDFS and MapReduce are Hadoop's two main parts, where HDFS is from the 'infrastructural'
perspective and MapReduce is from the 'programming' aspect. Although HDFS is currently an
Apache Hadoop sub-project, it was officially created as a web search engine infrastructure for
the Apache Nutch project.

The primary data storage system used by Hadoop applications is the Hadoop Distributed File
System (HDFS). It uses an architecture of NameNode and DataNode to implement a distributed
file system that provides high-performance data access across highly scalable clusters of
Hadoop.
HDFS is a main component of many Hadoop ecosystem techniques as it offers a reliable way
to manage big data pools and support associated large data analytics apps.

How HDFS works


HDFS supports fast information transfer between computer nodes. At the beginning, it was
strongly linked to MapReduce, a programmatic data processing framework. When HDFS
receives data, it breaks down the information into distinct blocks and distributes it to distinct
nodes within a cluster, allowing highly efficient parallel processing.

In addition, the Hadoop Distributed File System is intended specifically to be extremely


tolerant of faults. The file system replicates or copies multiple times each piece of data and
distributes the copies to individual nodes, placing at least one copy on a server rack that is
different from the others. As a consequence, the node information that crash can be discovered
in a cluster elsewhere.

This ensures that processing can continue while data is recovered. HDFS uses master/slave
architecture. Each Hadoop cluster consisted of a single NameNode in its original incarnation
which managed file system activities and supported DataNodes which managed information
storage on individual compute nodes. The HDFS components combine with big information
sets to help apps.

2
Features of HDFS

 It is appropriate for storage and processing distributed.


 Hadoop provides a command interface to interact with HDFS.
 The integrated NameNode and DataNode servers assist users readily verify the cluster
status.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

Goals of HDFS
Fault detection and recovery - Because HDFS contains a big amount of commodity
hardware, component failure is common. HDFS should therefore have mechanisms for rapid
and automatic identification and recovery of faults.

Huge datasets - To handle apps with enormous datasets, HDFS should have hundreds of nodes
per cluster.

Hardware at data - When the computation takes place close the information, a desired task
can be performed effectively. Especially where large data sets are engaged, network traffic is
reduced and the throughput is increased

3
MapReduce
MapReduce is a method for processing and a program model for java-based distributed
computing. There are two significant tasks in the MapReduce algorithm, namely Map and
Reduce. Map requires a set of information and transforms it to another set of information where
tuples (key/value pairs) are broken down into individual components. Second, reduce task,
which takes output from a map as an input and combines these tuples of data into a smaller set
of tuples. As implied by the name sequence MapReduce, the reduction task is always carried
out after the map job. MapReduce's main benefit is that data processing is simple to scale across
various computing nodes. The primitives for information processing are called mappers and
reducers under the MapReduce model. It is sometimes not trivial to decompose a information
handling request into mappers and reducers. However, once we write an application in the form
of MapReduce, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is simply a change in configuration. This easy scalability
has drawn many programmers to use the model of MapReduce.

The whole process of computing is broken down into the phases of mapping, shuffling and
reduction.

Mapping Stage: This is the MapReduce's first phase and involves the Hadoop Distributed File
System (HDFS) data reading process. The information may be in a folder or file format. The
input data file is supplied one line at a moment in the mapper function. Then the mapper
processes the information and decreases it to narrower information blocks.

Reducing Stage: Multiple procedures can consist of the reducer stage. The information is
transmitted from the mapper to the reducer during the shuffling phase. There would be
no input into the reducer phase without the successful shuffling of the data. But even
before the mapping method is complete, the shuffling method can begin. Next, the
information will be sorted to decrease the time taken to decrease the information. By
offering a cue when the next key in the sorted input information is separate from the
earlier key, the sorting effectively enables the reduction process. The reduction task
requires a specific pair of key-value to call the reduction function which takes the key-
value as its input. The reducer's output can be implemented straight to be stored in the
HDFS.

4
Hive
Introduction
Hive is an infrastructure instrument for data warehouse processing structured information in
Hadoop. To summarize Big Data, it lies on top of Hadoop, making it simple to query and
analyze.

Hive is an open source software that allows programmers to analyze Hadoop's big information
sets. In the company intelligence sector, the volume of information sets being gathered and
analyzed is increasing, making traditional information warehousing solutions more costly.
Hadoop with MapReduce framework is used as an option to analyze enormous size information
sets.

Although Hadoop has proven to be helpful in operating on enormous information sets, its
MapReduce framework is very small, requiring programmers to write custom programs that
are difficult to keep and reuse. Hive arrives here for programmers to be rescued.

Hive offers a declarative language similar to SQL, called HiveQL, which is used to express
queries. Using SQL-related Hive-QL users can readily conduct data analysis. These queries are
compiled by Hive engine into Map-Reduce employment to be performed on Hadoop.
Additionally, it is also possible to plug custom Map-Reduce scripts into queries.

How Apache Hive works


Hadoop processing initially depended exclusively on the MapReduce framework, which
needed users to comprehend sophisticated Java programming styles in order to effectively
query information. Apache Hive's motive was to simplify the creation of queries, and in turn
to open unstructured Hadoop information to a wider community of customers in organisations.

Hive has three primary tasks: summing up, querying and analyzing information. It supports
queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language that
translated SQL-style queries automatically into MapReduce tasks performed on the Hadoop
platform in its first incarnation. Additionally, to plug into queries, HiveQL endorsed custom
MapReduce scripts.

When SQL queries are presented via Hive, a driver element that generates session handles
originally receives them, forwarding applications to a compiler via Java Database Connectivity
/ Open Database Connectivity interfaces that eventually forward employment for execution.
Hive allows information serialization / deserialization and improves schema design flexibility
by including a system catalog called Hive-Metastore.

How Hive has evolved


Hive has developed to include more than just MapReduce, like Hadoop. Including the YARN
resource manager in Hadoop 2.0 helped designers expand their capacity to use Hive, as did
other parts of the Hadoop ecosystem. Over time, both the Apache Spark SQL engine and the

5
Hive engine have been supported by HiveQL and the Hive Engine, adding assistance for
distributed process implementation via Apache Tez and Spark.

Early Hive file support consisted of text files (also known as flat files), SequenceFiles (flat files
composed of binary key / value pairs) and Record Columnar Files (RCFiles) that store table
rows in a columnar database fashion. Hive storage support for columnar has come to include
Optimized Row Columnar (ORC) files and files for parquet.

Since its beginnings, hive execution and interactivity have been a subject of attention. That's
because the results of queries lagged behind those of more familiar SQL motors. Apache Hive
committers started work on the Stinger project in 2013 to increase efficiency, bringing Apache
Tez to the warehouse system and directing acyclic graph processing.

New methods that enhanced efficiency by incorporating a cost-based optimizer, in-memory


hash joins, a vector query engine, and other enhancements were also accompanying Stinger.
Recent variants of Hive have recorded query output reaching 100,000 queries per hour and
analytics processing of 100 million rows per second.

Uses of Hive:
1. The storage distributed by Apache Hive.

2. Hive offers instruments for simple extraction / transformation / loading (ETL) of


information.

3. It offers a range of information formats with the framework.

4. We can access documents stored in Hadoop Distributed File System (HDFS is used to
query and manage big residing datasets) or other information storage applications such
as Apache HBase by using Hive.

Limitations of Hive
1. Hive is not intended for processing online transactions (OLTP), it is used only for online
analytical processing.
2. Hive supports information overwriting or apprehension, but does not update and delete
information.
3. Sub-queries are not supported in Hive.

Why Hive is used in spite of Pig?


Despite the availability of Pig, the following are the reasons why Hive is used:

Hive-QL is a declarative language line SQL, Pig Latin is a language of data flow. Pig: a
language and environment for the data flow to explore very large datasets. Hive: a warehouse
of distributed data.

6
Hive Commands:
Data Definition Language (DDL)
DDL statements are used to build and modify the tables and other objects in the database.
Example:
CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.
Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.
To list out the databases in Hive warehouse, enter the command ‘show databases’.
The command to use the database is USE <data base name>
Describe provides information about the schema of the table.
Data Manipulation Language (DML)
DML statements are used to retrieve, store, modify, delete, insert and update data in the
database.
Example :
LOAD, INSERT Statements.
Syntax :
LOAD data <LOCAL> inpath <file path> into table [tablename]
Insert Command:
The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO
syntax is work from the version 0.8)

7
Hive Case study

About Data set: FIFA 19


FIFA 19 has created database for all the football players playing professional in the world. The
dataset consists of plater details for 18207 players. It includes player names, FIFA Id, age,
nationality, club, value, wage, contract validity and release clause. In another table,
performance details for these players are given. This table includes positon, age, overall rating,
potential rating, International reputation on a scale of 1 to 5, skill moves rating on a scale 1 to
5, height and weight. Both the tables have multiple attribute which are categorical and numeric
in nature.

Questions:
1. Find the performance matrix of all the players based on ID
2. Find the sum of total wages of all players
3. Find the scope for improving to potential score for each player.
4. Find the player with 5-star skill moves
5. Find the body mass index for each player
6. Find the count of nationality brazil.
7. Compare the total value for players belonging to Nationality FRANCE
8. How many distinct countries have players playing football?
9. What is the average wage of a football player?
10. Find 10 distinct clubs for the top value players.

List of Queries:
 Create table Fifa1 (id int, name string, foot string, position string, age int, overall
int, potential int, rep int, skills int, height double, weight double) row format
delimited fields terminated by ‘ ‘ lines delimited by ‘\nj’ stored as textfile;
 Create table Fifa2 (id int, name string, age int, nationality string, club string, value
double, wage int, contract string, clause double) row format delimited fields
terminated by ‘ ‘ lines delimited by ‘\nj’ stored as textfile;
 Set hive.cli.print.header= true;
 Hadoop.fs –put Fifa1 /userFaizan hadoop fs –ls /user/Faizan
 Hadoop.fs –put Fifa2 /userFaizan hadoop fs –ls /user/Faizan
 Load data local inpath ‘/home/cloudera/Faizaqn/Hive/Fifa19 Bigdata Dataset.csv’
overwrite into tables Fifa1 ;
 Load data local inpath ‘/home/cloudera/Faizaqn/Hive/Fifa19 Bigdata
Dataset2.csv’ overwrite into tables Fifa2;
 Select * from Fifa1;

8
 Select * from Fifa2;

1. Select id, overall, skill, rep from Fifa1


International Skill
ID Overall Reputation Moves
158023 94 5 4
20801 94 5 5
190871 92 5 5
193080 91 4 1
192985 91 4 4
183277 91 4 4
177003 91 4 4
176580 91 5 3
155862 91 4 3
200389 90 3 1
188545 90 4 4
182521 90 4 3
182493 90 3 2
168542 90 4 4
215914 89 3 2
211110 89 3 4
202126 89 3 3
194765 89 4 4
192448 89 3 1
192119 89 4 1
189511 89 4 3
179813 89 4 3
167495 89 5 1
153079 89 4 4
138956 89 4 2
231747 88 3 5
209331 88 3 4
200145 88 3 2

Similar record for 10,716 ids


2. Select sum(wage) from Fifa1;
163455 Euros

3. Alter table Fifa1 add diff int where diff= potential-overall;


Select id, diff from Fifa1;

9
ID Overall Potential Difference
158023 94 94 0
20801 94 94 0
190871 92 93 1
193080 91 93 2
192985 91 92 1
183277 91 91 0
177003 91 91 0
176580 91 91 0
155862 91 91 0
200389 90 93 3
188545 90 90 0
182521 90 90 0
182493 90 90 0
168542 90 90 0
215914 89 90 1
211110 89 94 5
202126 89 91 2
194765 89 90 1
192448 89 92 3
192119 89 90 1
189511 89 89 0
179813 89 89 0
167495 89 89 0
153079 89 89 0
138956 89 89 0
231747 88 95 7
209331 88 89 1
200145 88 90 2

Similar record for 10,716 ids


4. select id, names, skills from Fifa1 where skills=5;
Preferred Skill
ID Name Foot Moves
Cristiano
20801 Ronaldo Right 5
190871 Neymar Jr Right 5
231747 K. Mbappé Right 5
189242 Coutinho Right 5
176676 Marcelo Left 5
195864 P. Pogba Right 5
Douglas
190483 Costa Left 5
189509 Thiago Right 5

10
204485 R. Mahrez Left 5
Z.
41236 Ibrahimovi? Right 5
202556 M. Depay Right 5
193082 J. Cuadrado Right 5
183898 A. Di María Left 5
20775 Quaresma Right 5
213345 K. Coman Right 5
208808 Q. Promes Right 5
156616 F. Ribéry Right 5
Gelson
227055 Martins Right 5
F.
212404 Bernardeschi Left 5
198717 W. Zaha Right 5

Total 51 records with 5 star skills

5. Alter Table Fifa1 add (h1 double, w1 double, BMI double) where BMI =
weight/(height*height), h1= heighjt*0.3048, w1=weight* 0.453592;
Select id, name, BMI,h1,w1 from Fifa1;

ID Name BMI h1 w1
158023 L. Messi 24.66438 1.71 72.12
Cristiano
20801 Ronaldo 23.99333 1.86 83.01
190871 Neymar Jr 21.71751 1.77 68.04
193080 De Gea 20.67151 1.92 76.20
192985 K. De Bruyne 29.72363 1.53 69.85
183277 E. Hazard 24.4205 1.74 73.94
177003 L. Modri? 21.87357 1.74 66.22
176580 L. Suárez 26.59953 1.80 86.18
155862 Sergio Ramos 25.33955 1.80 82.10
200389 J. Oblak 25.17333 1.86 87.09
R.
188545 Lewandowski 24.63957 1.80 79.83
182521 T. Kroos 23.51959 1.80 76.20
182493 D. Godín 22.55111 1.86 78.02
168542 David Silva 22.17321 1.74 67.13
215914 N. Kanté 25.55312 1.68 72.12
211110 P. Dybala 31.97175 1.53 74.84
202126 H. Kane 25.69778 1.86 88.90
194765 A. Griezmann 23.31013 1.77 73.03
192448 M. ter Stegen 24.51778 1.86 84.82

11
192119 T. Courtois 24.52849 1.98 96.16
Sergio
189511 Busquets 22.02667 1.86 76.20

6. select nationality,,COUNT(id) from Fifa2 GROUP BY nationality having


nationality=’brazil;

Brazil 738

7. select id, Value, nationality from Fifa1, Fifa2 where nationality where
Nationality=’France’ and Fifa1.id=Fifa2.id;

ID Nationality Value
235456 France 600
231103 France 600
184763 France 600
240057 France 600
232117 France 600
244402 France 600
240050 France 600
200876 France 600
243627 France 1.1
177568 France 600
172952 France 600
228759 France 600
244117 France 600
228240 France 600
215914 France 63
225168 France 600
237198 France 600
194765 France 78
237708 France 1.1
244350 France 600
220030 France 600
225149 France 600
213368 France 600
209784 France 600

Select sum(Value) from Fifa1, Fifa2 where nationality where Nationality=’France’ and
Fifa1.id=Fifa2.id;

France 100940.4

12
8. select nationality, count(id) group by nationality from Fifa2;
Count of
Country Nationality
Albania 25
Algeria 54
Angola 11
Antigua &
Barbuda 1
Argentina 681
Armenia 8
Australia 89
Austria 146
Azerbaijan 3
Barbados 1
Belarus 4
Belgium 184
Benin 11
Bermuda 1
Bolivia 17
Bosnia
Herzegovina 44
Brazil 738
Bulgaria 17
Burkina Faso 13
Burundi 1
Cameroon 62
Canada 27
Cape Verde 19
Central African
Rep. 3
Chad 2
Chile 222
China PR 84
Colombia 351
Comoros 4
Congo 10
Costa Rica 24
Croatia 93

9. select avg.(wage) from Fifa1;


15,000 pounds a week

13
10. select id,Name, value, club from Fifa1, Fifa2 group by (distinct nationality) order by
value limit 10 where Fifa1.id=Fifa2.id;

ID Name Club Value


158023 L. Messi FC Barcelona 110.5
Cristiano
20801 Ronaldo Juventus 77
Paris Saint-
190871 Neymar Jr Germain 118.5
Manchester
193080 De Gea United 72
192985 K. De Bruyne Manchester City 102
183277 E. Hazard Chelsea 93
177003 L. Modri? Real Madrid 67
200389 J. Oblak Atlético Madrid 68
R. FC Bayern
188545 Lewandowski München 77
182521 T. Kroos Real Madrid 76.5
Tottenham
202126 H. Kane Hotspur 83.5
194765 A. Griezmann Atlético Madrid 78

14
Introduction to PIG

Apache Pig is a platform for analyzing big information sets as information flows. It is intended
to give MapReduce an abstraction, decreasing the complexity of writing a MapReduce
program. With Apache Pig, we can very readily conduct information manipulation activities in
Hadoop.
The features of Apache pig are:
 Pig allows programmers without knowing Java to write complex data transformations.
 Apache Pig has two primary parts–the language of Pig Latin and the environment of
Pig Run-time in which Pig Latin programs are performed.
 Pig provides a simple data flow language known as Pig Latin for Big Data Analytics
that has SQL-like functionalities such as join, filter, limit, etc.
 Developers who are working with scripting languages and SQL, leverages Pig Latin.
This gives developers ease of programming with Apache Pig. Pig Latin provides a
variety of built-in operators to read, write, and process large data sets, such as join, sort,
filter, etc. Thus it is evident, Pig has a rich set of operators.
 Programmers write scripts using Pig Latin to analyze data and these scripts are interna
lly converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writin
g MapReduce tasks was the only way to process the data stored in HDFS.
 If a programmer wants to write custom functions which is unavailable in Pig, Pig allows
them to write User Defined Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This provides
extensibility to Apache Pig.
 Pig can process any kind of data, i.e. structured, semi-structured or unstructured data,
coming from various sources.
 Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code. It can
handle inconsistent schema (in case of unstructured data). Apache Pig extracts the data,
performs operations on that data and dumps the data in the required format in HDFS
i.e. ETL (Extract Transform Load).
 Apache Pig automatically optimizes the tasks before execution, i.e. automatic
optimization. It allows programmers and developers to concentrate upon the whole
operation irrespective of creating mapper and reducer functions separately.
Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache
Pig is used:
Where we need to process, huge data sets like Web logs, streaming online data, etc.
Where we need Data processing for search platforms (different types of data needs to be
processed) like Yahoo uses Pig for 40 percent of their jobs including news feeds and search
engine.

15
Where we need to system time touchy statistics loads. Here, data wishes to be extracted and
analyzed speedy. E.G. Device gaining knowledge of algorithms calls for time sensitive
statistics masses, like twitter desires to fast extract statistics of consumer sports (i.E. Tweets,
re-tweets and likes) and examine the information to find styles in consumer behaviors, and
make guidelines straight away like trending tweets.
Apache Pig Tutorial: Architecture
For writing a Pig script, we need Pig Latin language and to execute them, we need execution
surroundings. The architecture of Apache Pig is proven within the under picture.

Initially as illustrated within the above picture, we put up Pig scripts to the Apache Pig
execution surroundings which may be written in Pig Latin using integrated operators.
There are 3 ways to execute the Pig script:
Grunt Shell: This is Pig’s interactive shell supplied to execute all Pig Scripts.
Script File: Write all the Pig commands in a script report and execute the Pig script record.
This is performed by means of the Pig Server.
Embedded Script: If some functions are unavailable in built-in operators, we will
programmatically create User Defined Functions to carry that functionalities the usage of other
languages like Java, Python, Ruby, and so forth. And embed it in Pig Latin Script file. Then,
execute that script document.
Parser

16
From the above photograph you may see, after passing via Grunt or Pig Server, Pig Scripts are
surpassed to the Parser. The Parser does type checking and exams the syntax of the script. The
parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and
logical operators. The logical operators are represented because the nodes and the facts flows
are represented as edges.
Optimizer
Then the DAG is submitted to the optimizer. The Optimizer performs the optimization sports
like break up, merge, remodel, and reorder operators and so on. This optimizer provides the
automated optimization function to Apache Pig. The optimizer basically goals to lessen the
quantity of records inside the pipeline at any example of time even as processing the extracted
facts, and for that it performs functions like:
 PushUpFilter: If there are a couple of situations in the filter out and the filter out can be
cut up, Pig splits the conditions and pushes up every circumstance one at a time.
Selecting those situations earlier, enables in decreasing the range of facts ultimate in
the pipeline.
 PushDownForEachFlatten: Applying flatten, which produces a go product among a
complex kind consisting of a tuple or a bag and the alternative fields within the
document, as past due as possible in the plan. This continues the variety of records low
inside the pipeline.
 ColumnPruner: Omitting columns which are by no means used or not wanted, lowering
the dimensions of the document. This can be applied after every operator, so that fields
can be pruned as aggressively as viable.
 MapKeyPruner: Omitting map keys which are by no means used, reducing the size of
the file.
 LimitOptimizer: If the limit operator is right away implemented after a load or type
operator, Pig converts the load or kind operator into a restrict-touchy implementation,
which does not require processing the complete information set. Applying the
restriction in advance, reduces the quantity of facts.
Compiler
After the optimization procedure, the compiler compiles the optimized code into a chain of
MapReduce jobs. The compiler is the only who is answerable for converting Pig jobs
robotically into MapReduce jobs.
Execution engine
Finally, as shown inside the parent, these MapReduce jobs are submitted for execution to the
execution engine. Then the MapReduce jobs are executed and offers the required end result.
The result can be displayed at the display the use of “DUMP” declaration and can be saved
inside the HDFS the use of “STORE” assertion.

17
PIG Case study

About Data set: Students


There are two dataset Student details and Student performance. Both the dataset comprises of
record of 5000 students. Dataset Student details consists of roll no, gender, ethnicity, parental
level of education, type of lunch, test preparation for courses. Dataset Student Score consists
of test scores of mathematics, reading and writing along with the total score. This data has been
classified based on several categories. All the attributes in Student details are nominal data
type, while the test scores and total scores are continuous numeric data. Scores for the student
vary from 0 to 100 based on an objective test.

Questions:
1. Display Total Score and Roll No. of all students?
2. List Math_Score in Descending order?
3. List the Roll No. of student who are Male?
4. Display the parental education level of all the student whose test is completed?
5. List the roll_no & writing_score & reading_score of all the student who has
Total_Score more than 180?
6. List the Total_Score of the Students who takes Standard Lunch?
7. Display the Race, Number of employees, and maximum Total_Score of each Race?

List of Queries:
 A = load ‘/user/bapna/StudentScore_Pig.csv’ using PigStorage() as (Roll_no:long,
math_score:int, reading_score:int, writing_score:int, Total_Score:int);
 Dump A;
 B = load ‘/user/bapna/Students_Pig.csv’ using PigStorage() as (Roll no:long,
gender:Chararray, race:Chararray, parental level of education:Chararray,
lunch:Chararray, test preparation course:Chararray);
 Dump B;

Q1) Display Total Score and Roll No. of all students?


 C = foreach A generate Total_Score, Roll_no;
 DUMP C;

Roll No Total Score


11001 218

18
11002 247
11003 278
11004 148
11005 229
11006 232
11007 275
11008 122
11009 195
11010 148
11011 164

Q2) list Math_Score in Descending order ?


 D = foreach A generate Math_Score;
 D = ORDER D by Math_Score DESC;
 DUMP D;

Math
Score
100
100
100
100
100
100
100
99
99
99
99

Q3) list the Roll No. of student who are Male?


 E = JOIN A by Roll_no, B by Roll_no;
 E = filter E by B::Gender = ‘Male’;
 E= foreach E generate A::Roll_no;
 DUMP E;

Roll No gender

19
11004 male
11005 male
11008 male
11009 male
11011 male
11012 male
11014 male
11017 male
11019 male
11021 male
11023 male

Q4) Display the parental education level of all the student whose test is completed ?
 F = foreach A generate test preparation course, parental level of education;
 F = filter F by test preparation course == 'Completed';
 F = group F All;
 F = foreach F generate parental education level;
 DUMP F;

Parental level of test preparation


Roll No education course
11002 some college completed
11007 some college completed
11009 high school completed
11014 some college completed
11019 master's degree completed
11022 some college completed
11025 bachelor's degree completed
11036 associate's degree completed
11039 associate's degree completed
11044 some college completed

20
11047 associate's degree completed
11049 associate's degree completed
11050 high school completed
11052 associate's degree completed

Q5) List the roll_no & writing_score & reading_score of all the student who has
Total_Score more than 180 ?
 G = filter A by Total_Score > 180;
 G = foreach G generate Roll_no, writing_score, reading_score;
 DUMP G;

Roll No reading score writing score


11459 100 100
11917 100 100
11963 100 100
11115 100 100
11180 100 100
11713 100 99
11166 100 100
11626 97 99
11150 100 93
11686 99 100
11904 100 100

Q6) List the Total_Score of the Students who takes Standard Lunch ?
 H = JOIN A by Roll_no, B by Roll_no;
 H = filter H by B::lunch == 'Standard';
 H = foreach H generate A::Total_Score;
 DUMP H;

Total
Roll No Score lunch

21
11001 218 standard
11002 247 standard
11003 278 standard
11005 229 standard
11006 232 standard
11007 275 standard
11011 164 standard
11012 135 standard
11013 219 standard
11014 220 standard
11015 161 standard

Q7) Display the Race, Number of employees, and maximum Total_Score of each Race ?
 I = JOIN A by Roll_no, B by Roll_no;
 I = group B by Race;
 I = foreach I generate group, MAX(A.Total_Score) as Score, COUNT(B.Race) as
count;
 DUMP I:

Roll No Total Score Race/Ethnicity


11547 289 group A
11001 218 group B
11002 247 group C
11009 195 group D
11033 193 group E

22
Conclusion & Learning:

Big Data accessibility, low-cost commodity hardware, and the latest information management
and analytics software have developed a distinctive time in data analysis history. The
convergence of these trends implies that for the first moment in history we have the capacities
to analyse amazing information sets rapidly and cost-effectively. These are neither theoretical
nor trivial skills. They constitute a real step forward and a clear chance to make huge gains in
effectiveness, productivity, income, and profitability.

The Big Data Analytics Era is here, and these are genuinely revolutionary times if company
and technology experts keep working together and delivering on the promise.

The Key Learning from this project are as follows:

 The needs and importance of Big data analytics in various business contexts.
 Understanding the challenges of managing Big data.
 Use of Hive and Pig for finding key elements of dataset.
 Difference in Hive and Pig Coding, to infer useful elements.
 Finding relationship from different datasets at one time.

23

You might also like