You are on page 1of 39

Saad Khan

MSCS
2nd semester
Content

 Introduction
 What is Big Data.
 Characteristic of Big Data.
 Storing, Selecting and processing of Big Data
 Why Big Data
 How it is Different
 Hive
 Pig
 Flume
Introduction

 Big Data may well be the Next Big Thing in the IT world

 The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.

 Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
 ‘Big Data’ is similar to ‘small data’, but bigger in size.

 An aim to solve new problems or old problems in a


better way
What is BIG DATA (Cont..)

 Walmart handles more than 1 million customer transactions every hour.

 Facebook handles 40 billion photos from its user base.


Characteristic of Big DATA
Volume

 A typical PC might have had 10 gigabytes of storage in 2000

 Today, Facebook ingests 500 terabytes of new data every day

 Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity

 Clickstreams and ad impressions capture user behavior at millions of events


per second.

 High-frequency stock trading algorithms reflect market changes within


microseconds

 Machine to machine processes exchange data between billions of devices


Variety

 Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.

 Traditional database systems were designed to address smaller volumes of


structured data, fewer updates or a predictable, consistent data structure
Storing Big Data

 Selecting data source for analysis

 Eliminating redundant data

 Establishing the role of NoSQL


Selecting Big Data Stores

 Choosing the correct data stores based on your data characteristics.

 Moving code to data.

 Implementing polyglot data store solutions


Processing Big Data

 Mapping data to the programming framework

 Connecting and extracting data from storage.

 Transforming data for processing.


Why Big Data

 Increase of Storage capacities.

 Increase of processing.

 Availability of data(different data types).


How is big data different?

 Automatically generated by a machine


(e.g. Sensor embedded in an engine)

 Typically an entirely new source of data


(e.g. Use of the internet)

 Not designed to be friendly


(e.g. Text streams)
Hive
What is Hive?

 Hive is a data warehouse infrastructure tool to process structure data in


Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy

 Initially Hive was developed by Facebook, later the Apache Software


Foundation took it up and developed it further as an open source under
the name Apache Hive.
Feature of Hive

 It stores Schema in a database and processed data into HDFS(Hadoop


Distributed File System).

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.


Architecture Of Hive

 User Interface - Hive is a data warehouse infrastructure software that can


create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD.

 Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)

 HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program

 HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive

 Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.

 Get Metadata- The compiler sends metadata request to Megastorez

 Send Metadata- Metastore sends metadata as a response to the


compiler.
Working of Hive(Cont..)

 Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.

 Execute Plan- the driver sends the execute plan to the


execution engine.
Pig
What is Pig?

 A platform for analyzing large data sets that consists of a high-level


language for expressing data analysis programs

 Compiles down to MapReduce jobs

 Developed by Yahoo!
Pig Component

 Two Main Components.


 High Level Language (Pig Latin)
 Set of Commands
 Two Execution Modes
 Local: Read/Write to local file system
 MapReduce: connects to Hadoop cluster and reads/write to HDFS
Why Pig?

 Common design patterns as key word (joins, distinct, counts)

 Data flow analysis

 Avoid java level errors


Language Feature Pig

 Keywords
 Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…

 Aggregations
 Count, Avg, Sum, Max, Min

 Schema
 Defines at query-times not when files are loaded
Flume
What is flume?

 Apache Flume is a tool/service/data ingestion mechanism for collecting


aggregating and transporting large amounts of streaming data such as log
files, events (etc...) from various sources to a centralized data store

 Flume is a highly reliable, distributed, and configurable tool. It is principally


designed to copy streaming data (log data) from various web servers to
HDFS.
Flume Architecture

 Flume Event
 An event is the basic unit of the data transported inside Flume.

 Flume Agent.
 Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume

 Assume an e-commerce web application wants to analyze the customer


behavior from a particular region.

 To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.

 Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume

 Flume ingests long data from multiple web serves into a centralized store

 Using flume, we can get the data from multiple servers immediately into
Hadoop.

 Flume supports a large set of sources and destinations types

 Flume can be scaled horizontally.


Advantages of flume

 Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).

 Flume provides the feature of contextual routing.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable


Any Question

You might also like