You are on page 1of 82

BIG DATA

K.LOGESWARAN AP(Sr.G) | AI | KEC


Can you think of ?
2

• Can you think of running a query on 20,980,000 GB file.


• What if we get a new data set like this, every day?
• What if we need to execute complex queries on this
data set everyday ?
• Does anybody really deal with this type of data set?
• Is it possible to store and analyze this data?
• Yes Google deals with more than 20 PB data
everyday
In fact, in a minute
3
•Email users send more than 204 million messages;
•Mobile Web receives 217 new users;
•Google receives over 2 million search queries;
•YouTube users upload 48 hours of new video;
•Facebook users share 684,000 bits of content;
•Twitter users send more than 100,000 tweets;
•Consumers spend $272,000 on Web shopping;
•Apple receives around 47,000 application downloads;
•Brands receive more than 34,000 Facebook 'likes';
•Tumblr blog owners publish 27,000 new posts;
•Instagram users share 3,600 new photos;
•Flickr users, on the other hand, add 3,125 new photos;
• Foursquare users perform 2,000 check-ins;
•WordPress users publish close to 350 new blog posts.
And this is one year back͙.. Damn!!
BIG DATA
BIG DATA
5
•Data which are very large in size is called Big Data.

•Big data is data that contains greater variety, arriving in increasing


volumes and with more velocity.

•Big Data refers to extremely large ,very fast, highly diverse and
complex data that cannot be managed by traditional data
management tools
or
•Big data primarily refers to data sets that are too large or complex
to be dealt with by traditional data-processing application software.
Big data Applications
Big data Applications
Data sources
8
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data
on a day to day basis as they have billions of users worldwide

• New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade
data per day.

• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.

•E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.

•Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.

•Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million user
9

Volume Velocity Variety


• Data • Data • Data
quantity Speed Types
Data sources
10
Volume
• A typical PC might have had 10 gigabytes of storage in
2000.
• Today, Face book ingests 500 terabytes of new data
every day.
• Boeing 737 will generate 240 terabytes of flight data
during a single flight across the US.
• The smart phones, the data they create and consume;
• sensors embedded into everyday objects will soon result
in billions of new, constantly-updated data feeds
containing environmental, location, and other information,
including video.

13
Velocity
Click streams and ad impressions capture user
behavior at millions of events per second

high-frequency stock trading algorithms reflect market


changes within microseconds

machine to machine processes exchange data


between billions ofdevices

infrastructure and sensors generate massive log data in


real-time

on-line gaming systems support millions of concurrent users,


14
each producing multiple inputs per second.
Variety
Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files
and social media.

Traditional database systems were designed to


address smaller volumes of structured data, fewer
updates or a predictable, consistent data structure.

Big Data analysis includes different types of


data
15
Wholeness of BIG DATA
1.1UNDERSTANDING BIG DATA
 Big Data can be examined at two levels
 Collection of data ,analyzed and utilized for benefit of
business
 Insights helps to make better decision
 Specialkind data poses unique challenges in storing,processing
and offer unique benefits
 Space, time and function

 Huge opportunities for technology providers to innovate and manage


the entire life cycle of data –to generate, store, organize ,analyze and
visualize this data
1.2 Capturing BIG DATA
 Four V’s (Volume,velocity,variety, veracity) arrive
together with data at a time
1.2 Capturing BIG DATA-Volume
Volume is amount of data generated by organization or individuals
Data generated is doubling every I year
Data is huge to extract meaningful specific information in a reasonable period of
time
1.2 Capturing BIG DATA-Volume

•Reason for data growth is reduction in storage cost of data.


It decreases 30 -40 percent per year
•Different form and functions of data increases
•Cost of computation and communication of data is also
coming down
1.2 Capturing BIG DATA-Velocity
 Big data is generated by Billion of devices and
communicated at higher speed via internet
 Increased velocity of data is due to increase in speed
of internet
 Internet speed at homes and offices increased to 100
times faster
 Increased variety of sources like mobile devices
,sensors can generate data from anywhere ,at any
time
1.2 Capturing BIG DATA-Variety
 Three kinds of data
 Form of data
 Data type range from numbers to text, graph, map, audio,
video, etc.
 Composite of data in a single file
 Text documents have graphs and pictures inside it
 Video songs have audio embedded in it
 Audio and video have different complex storage formats
 Function of data
 Data from human conversaion,songs and movies ,new product
design, old archived data etc
 Processing of each data is different
 Used to recognize people face in pictures, comparing to identify
the speaker ,comparing handwriting to identify the writer
1.2 Capturing BIG DATA-Variety
 Three kinds of data
 Source of data
 Mobile phone, tablet allows to access and generate data any
time and anywhere
 Web access and search logs he sources of data
 Business Systems generates structured business transactional
information
 Sensors like temperature pressure on machines and RFID tags on
assets generate data
 Three broad type of source of data
 Human to human communication
 Human to machine communication
 Machine to machine to communication
1.2 Capturing BIG DATA-Veracity
 Relates to truthfulness, believability and quality of
data
 Source of information may not be authoritative
 Data may not communicated and received correctly
due to human or technical failures
 Data provided and received may be intentionally
wrong for competitive or security reasons
1.2 Capturing BIG DATA-Veracity
Benefitting from Big data
Benefitting from Big data

 Monitoring and tracking Applications


 Analysis and Insight
 New product development
1.4 Management of big data
 some emerging insights into making better use of Big Data.
 Focus to protect and enhance customer relationships and
customer experience.
 Solve a real pain-point. Big Data should be deployed for
specific business Objectives
 Organizations are beginning their pilot implementations by
using existing and newly accessible internal sources of data.
 Combining data-based analysis with human intuition and
perspectives is better than going just one way.
1.4 Management of big data

 Faster you analyze the data, the more its predictive


value. The value of data depreciates with time
 Don’t throw away data if no immediate use can be seen
for it. Data has value beyond what you initially anticipate.
Maintain one copy of your data, not multiple.
 Data is expected to continue to grow at exponential
rates. Storage costs continue to fall, data generation
continues to grow, data-based applications continue to
grow in capability and functionality.
 Big Data is transforming business, just like IT did. Big
Data is a new phase representing a digital world.
1.5 Organizing big data
 Given huge quantities-The cost of storing and processing
the data, too, would be a major driver for the choice of an
organizing pattern.
 Given the fast speed of data,it will also be desirable to
create a control over the data by maintaining count and
averages over time, unique values received, etc.
 Given the variety in form factors, data needs to be stored
and analyzed differently.
 Given different quality levels of data, various data sources
may need to be ranked and prioritized before serving them
to the audience.
1.6 Analyzing big data
 Big Data can be utilized to visualize a flowing or a
static situation.
 Analyzed in Two ways
 Big Data in motion
 is to process the incoming stream of data in real time for quick
and effective statistics about the data.
 Big Data at rest
 To store and structure the data and apply standard analytical
techniques on batches of data for generating insights.
1.6 Analyzing big data
1.6 Analyzing big data

The bar shows the


number of page
views, and the inner
darker bar shows
the number of
unique visitors.
 The dashboard
could show the view
by days, weeks or
years also.
1.6 Analyzing big data
 Text Data could be combined,
filtered, cleaned, thematically
analyzed, and visualized in a
wordcloud.

 wordcloud from a recent stream of


tweets (ie Twitter messages) from US
Presidential candidates Hillary
Clinton and Donald Trump.

 The larger words implies greater


frequency of occurrence in the
tweets.
1.7 Technology challenges of Bigdata
 Storing Huge Volumes
 It distributes data across the large cluster of inexpensive commodity machines, and ensures that every
piece of data is stored on multiple machines to guarantee that at least one copy is always available
 Hadoop is the most well-known clustering technology for Big Data. Its data storage pattern is called
Hadoop Distributed File System (HDFS).
 Ingesting streams at an extremely fast pace
 creating special ingesting systems that can open an unlimited number of channels for receiving data.
 These queuing systems can hold data, from which consumer applications can request and process
data at their own pace.
 Apache Spark is the most popular system for streaming applications

 Handling a variety of forms and functions of data


 structuring and access of all varieties of data
 HBase, for example, stores each data element separately along with its key identifying information.
This is called a key-value pair format.
 Cassandra stores data in a document format.
 NoSQL languages, such as Pig and Hive, are used to access this data.
 Processing data at huge speeds
 to moving large amounts of data from storage to the processor
1.7 Technology challenges of Bigdata

 Processing data at huge speeds


 to moving large amounts of data from storage to the processor
 this would consume enormous network capacity and choke the network
 Alternative to this is to “move the processing to where the data is stored.”
 Distributes the task logic throughout the cluster of machines where the data is
stored.
 Those machines work, in parallel, on the data assigned to them, respectively.
 A follow-up process consolidates the outputs of all the small tasks and delivers
the final results
 MapReduce, also invented by Google, is the best-known technology for parallel
processing of distributed Big Data.
1.7 Technology challenges of Bigdata
Assignment 1
Liberty Stores Case Exercise:
Liberty Stores Inc. is a specialized global retail chain that sells organic food,
organic clothing, wellness products, and education products to enlightened
LOHAS (Lifestyles of the Healthy and Sustainable) citizens worldwide. The
company is 20 years old, and is growing rapidly. It now operates in 5
continents, 50 countries, 150 cities, and has 500 stores. It sells 20000
products and has 10000 employees. The company has revenues of over $5
billion and has a profit of about 5% of its revenue. The company pays
special attention to the conditions under which the products are grown and
produced. It donates about one-fifth (20%) from its pre-tax profits from
global local charitable causes.
 Q1: Create a comprehensive Big Data strategy for the CEO of the
company.
 Q2: How can Big Data systems such as IBM Watson help this company?
Big Data Architecture
CASELET: Google Query Architecture
Big Data Architecture
 There are many sources of data. All data is funneled in
through an ingest system.
 The data is forked into two sides:
 a stream processing system
 Streaming data processing happens as the data flows through a system.
This results in analysis and reporting of events as it happens. An
example would be fraud detection or intrusion detection.
 a batch processing system.
 Batch processing is when the processing and analysis happens on a set
of data that have already been stored over a period of time
 The outcome of these processing can be sent into NoSQL
databases for later retrieval, or sent directly for consumption
by many applications and devices.
Big Data Architecture
Big Data Architecture
Big data sources
 Sources of data for an application depends upon

data taken to perform analysis.


 The data will vary in origin, size, speed, form, and

function, as described by the 4 Vs


 Data sources can be internal or external to the

organization
Big Data Architecture
 A big data solution typically comprises these as logical
layers.
 Data ingest layer
 Batch Processing layer

 Streaming Processing layer

 Data Organizing Layer

 Data Consumption layer

 Infrastructure Layer

 Distributed File System Layer

 Each layer can be represented by one or more


available technologies.
Big Data Architecture
Data ingest layer
 Used for acquiring data from the data sources.

 can acquire at various speeds and in various quantities.


 data is sent to a batch processing system, a stream processing
Directly to HDFS
Batch Processing layer and Streaming Processing layer
 analysis layer reads data from the file system or from the NoSQL databases.

 Data is processed using parallel programming to produce the desired results.

 needs to understand the data sources and data types, the algorithms that would
work on that data, and the format of the desired outcomes.
 output of this layer could be sent for instant reporting, or stored in a NoSQL
databases for an on-demand report, for the client.
Big Data Architecture
 Data Organizing Layer
 layer receives data from both the batch and stream
processing layers.
 NoSQL databases. Is used to organize the data for easy
access.
 SQL-like languages like Hive and Pig can be used to easily
access data and generate reports.
Big Data Architecture
 Data Consumption layer
 This layer consumes the output provided by the analysis layers,
directly or through the organizing layer.
 The outcome could be standard reports, data analytics, dashboards
and other visualization applications, recommendation engine, on
mobile and other devices.
 Infrastructure Layer
 Used to manages the raw resources of storage, compute, and
communication through a cloud computing paradigm.
 Distributed File System Layer
 include the Hadoop Distributed File System (HDFS).
 supporting applications, such as YARN (Yet Another Resource
Manager), that enable the efficient access to data storage and its
transfer.
Big Data Architecture examples-IBM
WATSON
Netflix
 This is one of the largest providers of online video
entertainment. They handle 400 Billion online events per day.
 As a cutting-edge user of big data technologies, they are
constantly innovating their mix of technologies to deliver the
best performance.
 Kafka is the common messaging system for all incoming
requests.
 They host the entire infrastructure on Amazon Web Services
(AWS).
 The database is AWS’ S3 as well as Cassandra and Hbase to
store data.
 Spark is used for stream processing.
Netflix
Netflix
EBAY
 Ebay is the second-largest Ecommerce company in
the world.
 It delivers 800 million listings from 25 million sellers
to 160 million buyers.
 To manage this huge stream of activity, EBay uses a
stack of Hadoop, Spark, Kafka, and other elements.
Paypal

 This payments-facilitation company needs to understand


and acquire customers, and process a large number of
payment transactions.
Apache Hadoop - Distributed Computing

A distributed file storage system is a clever way of storing


huge quantities of data in a networked collection of commodity
machines
 secureand cost-effective
 speed and ease, for retrieval and processing
Apache Hadoop - HADOOP FRAMEWORK

 Apache Hadoop distributed computing framework


 composed of the following modules:
 Hadoop Common – contains libraries and utilities needed by other
Hadoop modules
 Hadoop Distributed File System (HDFS) – a distributed file-system
that stores data on commodity machines
 YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of
users’ applications
 MapReduce – an implementation of the MapReduce programming
model for large scale data processing.
 facilitates concurrent processing by splitting petabytes of data into smaller
chunks, and processing them in parallel on Hadoop commodity servers.
 In the end, it aggregates all the data from multiple servers to return a
consolidated output back to the application
Apache Hadoop - HDFS DESIGN GOALS

 Hadoop distributed file system (HDFS) is a distributed and


scalable file-system
 It is designed for applications that deal with very large
data sizes
 also designed to deal with mostly immutable files, i.e.
write data once, but read it many times
 major design goals of HDFS
Apache Hadoop - HDFS DESIGN GOALS

 major design goals of HDFS


 Hardware failure management - one must plan for it
 Huge volume – capacity to store large file with fast throughput
 High speed - mechanism to provide low latency (latency - time it takes
for a data packet to travel from one designated point to another) access to streaming
applications
 High variety - Maintain simple data coherence(data coherence-
uniformity across shared resource data), by writing data once but reading
many times
 Plug-and-play - Maintain easy accessibility of data using any
hardware, software, and database platform
 Network efficiency- Minimize network bandwidth requirement,
by minimizing data movement.
Apache Hadoop - MASTER-SLAVE ARCHITECTURE

 Hadoop is an architecture for organizing computers in a


master-slave relationship
 A Hadoop cluster has two types of nodes
 Master- Single master node called NameNode
 Slave - large number of slave worker nodes (called
DataNodes)
 A small Hadoop cluster includes a single master and
multiple worker nodes
 A large Hadoop cluster would consist of a master and
thousands of small ordinary machines as worker nodes
Apache Hadoop - MASTER-SLAVE ARCHITECTURE
Apache Hadoop - MASTER-SLAVE ARCHITECTURE

 MASTER NODE (NameNode)


 the master node manages the overall file system, its namespace,
and controls the access to files by clients
 The master node is aware of the data-nodes, i.e. which blocks of
which file are stored on which data node
 It also controls the processing plan for all applications running on
the data on the cluster
 Only one Master node is available - that makes it a single point of
failure
 To overcome from failure: the master node has its hot backup always
ready to take over, just in case the master node dies unexpectedly
 The master node uses a transaction log to persistently record every
change that occurs to file system
Apache Hadoop - MASTER-SLAVE ARCHITECTURE

 WORKER NODES(DataNodes)
 store the data blocks in their storage space, as directed by
the master node
 It contains many disks to maximize storage capacity and
access speed.
 It do not have awareness about the distributed file structure
Apache Hadoop - Architecture
Apache Hadoop - Architecture

 The NameNode stores all relevant information about all the


DataNodes, and the files stored in those DataNodes
 Information includes:
 For every DataNode, its name, rack, capacity, and health
 For every File, its name, replicas, type, size, timeStamp, location,
health, etc
 DataNode failure:
 data on the failed DataNode will be accessed from its replicas on
other DataNodes
 The failed DataNode can be automatically recreated on another
machine, by writing all those file blocks of from the other healthy
replicas
 Each DataNode sends a heartbeat message to the NameNode
periodically. Without this message, the DataNode is assumed to be
dead.
Apache Hadoop - Architecture

 Role of NameNode
 tries to ensure that files are evenly spread across the data-
nodes in the cluster
 tries to optimize the networking load

 tries to store fragments of files on the same node for speed of


read and writing
Apache Hadoop - BLOCK SYSTEM

 A block of data is the fundamental storage unit in HDFS


 HDFS stores large files (typically gigabytes to terabytes) by
storing segments (called blocks) of the file across multiple
machines
 All storage capacity and file sizes are measured in blocks
 A block ranges from 16–128MB in size, with a default block size of
64MB.
 Thus, an HDFS file is chopped up into 64 MB chunks, and if
possible, each chunk will reside on a different DataNode
 Every data file takes up a number of blocks depending upon
its size
 Eg., 1 Terabyte storage will have 16000 blocks (1TB divided by
64MB).
Apache Hadoop - ENSURING DATA INTEGRITY

 Hadoop ensures that no data is lost or corrupted during


storage or processing
 Only one client can write or append to a file, at a time.
 If some data on a DataNode is indeed lost or corrupted
 new healthy replica for that lost block will be used to recreate
the data
 To ensure integrity
A checksum algorithm is applied on all data written to HDFS

You might also like