Big Data Overview

Saad Khan
MSCS
2nd semester
Content
 Introduction
 What is Big Data.
 Characteristic of Big Data.
 Storing, Selecting and processing of Big Data
 Why Big Data
 How it is Different
 Hive
 Pig
 Flume
Introduction
 Big Data may well be the Next Big Thing in the IT world
 The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook were built around big data from the
beginning.
 Big data burst upon the scene in the first decade of the 21st century
What is BIG DATA?
 ‘Big Data’ is similar to ‘small data’, but bigger in size.
 An aim to solve new problems or old problems in a

better way
What is BIG DATA (Cont..)
 Walmart handles more than 1 million customer transactions every hour.
 Facebook handles 40 billion photos from its user base.

Characteristic of Big DATA
Volume
 A typical PC might have had 10 gigabytes of storage in 2000
 Today, Facebook ingests 500 terabytes of new data every day
 Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
Velocity
 Clickstreams and ad impressions capture user behavior at millions of events

per second.
 High-frequency stock trading algorithms reflect market changes within

microseconds
 Machine to machine processes exchange data between billions of devices

Variety
 Big Data isn't just numbers, dates, and strings. Big Data is also geospatial
data, 3D data, audio and video, and unstructured text, including log files
and social media.
 Traditional database systems were designed to address smaller volumes of

structured data, fewer updates or a predictable, consistent data structure
Storing Big Data
 Selecting data source for analysis
 Eliminating redundant data
 Establishing the role of NoSQL

Selecting Big Data Stores
 Choosing the correct data stores based on your data characteristics.
 Moving code to data.
 Implementing polyglot data store solutions

Processing Big Data
 Mapping data to the programming framework
 Connecting and extracting data from storage.
 Transforming data for processing.

Why Big Data
 Increase of Storage capacities.
 Increase of processing.
 Availability of data(different data types).

How is big data different?
 Automatically generated by a machine

(e.g. Sensor embedded in an engine)
 Typically an entirely new source of data

(e.g. Use of the internet)
 Not designed to be friendly

(e.g. Text streams)
Hive
What is Hive?
 Hive is a data warehouse infrastructure tool to process structure data in

Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy
 Initially Hive was developed by Facebook, later the Apache Software

Foundation took it up and developed it further as an open source under
the name Apache Hive.
Feature of Hive
 It stores Schema in a database and processed data into HDFS(Hadoop

Distributed File System).
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.

Architecture Of Hive
 User Interface - Hive is a data warehouse infrastructure software that can

create interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD.
 Meta Store -Hive chooses respective database servers to store the schema
or Metadata of tables, databases, columns in a table, their data types and
HDFS mapping.
Architecture Of Hive(Cont..)
Architecture Of Hive(Cont..)
 HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info
on the Megastore. It is one of the replacements of traditional approach for
MapReduce program
 HDFS or HBASE - Hadoop distributed file system or HBASE are the data
storage techniques to store data into the file system.
Working of Hive
 Get Plan- The driver takes the help of query complier that parses
the query to check the syntax and query plan or the requirement of
query.
 Get Metadata- The compiler sends metadata request to Megastorez
 Send Metadata- Metastore sends metadata as a response to the

compiler.
Working of Hive(Cont..)
 Send Plan- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is complete.
 Execute Plan- the driver sends the execute plan to the

execution engine.
Pig
What is Pig?
 A platform for analyzing large data sets that consists of a high-level

language for expressing data analysis programs
 Compiles down to MapReduce jobs
 Developed by Yahoo!
Pig Component
 Two Main Components.

 High Level Language (Pig Latin)
 Set of Commands
 Two Execution Modes
 Local: Read/Write to local file system
 MapReduce: connects to Hadoop cluster and reads/write to HDFS
Why Pig?
 Common design patterns as key word (joins, distinct, counts)
 Data flow analysis
 Avoid java level errors

Language Feature Pig
 Keywords
 Load, Filter, For each Generate, Group By, Store, Join, Distinct, Order by,…
 Aggregations
 Count, Avg, Sum, Max, Min
 Schema
 Defines at query-times not when files are loaded
Flume
What is flume?
 Apache Flume is a tool/service/data ingestion mechanism for collecting

aggregating and transporting large amounts of streaming data such as log
files, events (etc...) from various sources to a centralized data store
 Flume is a highly reliable, distributed, and configurable tool. It is principally

designed to copy streaming data (log data) from various web servers to
HDFS.
Flume Architecture
 Flume Event
 An event is the basic unit of the data transported inside Flume.
 Flume Agent.
 Take a look at the following illustration. It shows the internal components of an
agent and how they collaborate with each other.
Application of Flume
 Assume an e-commerce web application wants to analyze the customer

behavior from a particular region.
 To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
 Flume is used to move the log data generated by application servers into
HDFS at a higher speed
Feature of flume
 Flume ingests long data from multiple web serves into a centralized store
 Using flume, we can get the data from multiple servers immediately into
Hadoop.
 Flume supports a large set of sources and destinations types
 Flume can be scaled horizontally.

Advantages of flume
 Using apache flume we can store the data in to any of the centralized
stores (Hbase, HDFS).
 Flume provides the feature of contextual routing.
 Flume is reliable, fault tolerant, scalable, manageable, and customizable

Any Question

Big Data Overview

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Overview

Uploaded by

Copyright:

Available Formats

Saad Khan

 An aim to solve new problems or old problems in a

 Walmart handles more than 1 million customer transactions every hour.

 Facebook handles 40 billion photos from its user base.

 A typical PC might have had 10 gigabytes of storage in 2000

 Today, Facebook ingests 500 terabytes of new data every day

 Clickstreams and ad impressions capture user behavior at millions of events

 High-frequency stock trading algorithms reflect market changes within

 Machine to machine processes exchange data between billions of devices

 Traditional database systems were designed to address smaller volumes of

 Selecting data source for analysis

 Eliminating redundant data

 Establishing the role of NoSQL

 Choosing the correct data stores based on your data characteristics.

 Moving code to data.

 Implementing polyglot data store solutions

 Mapping data to the programming framework

 Connecting and extracting data from storage.

 Transforming data for processing.

 Increase of Storage capacities.

 Availability of data(different data types).

 Automatically generated by a machine

 Typically an entirely new source of data

 Not designed to be friendly

 Hive is a data warehouse infrastructure tool to process structure data in

 Initially Hive was developed by Facebook, later the Apache Software

 It stores Schema in a database and processed data into HDFS(Hadoop

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.

 User Interface - Hive is a data warehouse infrastructure software that can

 Get Metadata- The compiler sends metadata request to Megastorez

 Send Metadata- Metastore sends metadata as a response to the

 Execute Plan- the driver sends the execute plan to the

 A platform for analyzing large data sets that consists of a high-level

 Compiles down to MapReduce jobs

 Two Main Components.

 Common design patterns as key word (joins, distinct, counts)

 Data flow analysis

 Avoid java level errors

 Apache Flume is a tool/service/data ingestion mechanism for collecting

 Flume is a highly reliable, distributed, and configurable tool. It is principally

 Assume an e-commerce web application wants to analyze the customer

 Flume supports a large set of sources and destinations types

 Flume can be scaled horizontally.

 Flume provides the feature of contextual routing.

 Flume is reliable, fault tolerant, scalable, manageable, and customizable

You might also like