Lecture 1 & 2

What is Big data XP
Big Data Analysis
Mr. Ganesh Bhagwat

Introduction to Big Data XP
Big Data Analysis
• Big data refer to the massive data sets that are collected from a variety of data
sources for business needs to reveal new insights for optimized decision making.
• "Big data" is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
by traditional data-processing application software.
• Big data generates value by the storage and processing of digital data that cannot
be analyzed by traditional computing techniques.
• Result of various trends like cloud, increased computing resources, generation by
mobile computing, social networks, sensors, web data etc.

Mr. Ganesh Bhagwat
Big Data Characteristics XP
Big Data Analysis
Mr. Ganesh Bhagwat

Big Data Characteristics XP
Big Data Analysis
• Volume :The quantity of generated and stored data. The size of the data
determines the value and potential insight, and whether it can be
considered big data or not.
• Variety: The type and nature of the data. This helps people who
analyze it to effectively use the resulting insight. Big data draws from text,
images, audio, video; plus it completes missing pieces through data
fusion.
Mr. Ganesh Bhagwat

Big Data Characteristics conti… XP
Big Data Analysis
• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Big data is often available in real-time.
Compared to small data, big data are produced more continually. Two
kinds of velocity related to big data are the frequency of generation and
the frequency of handling, recording, and publishing.
• Veracity: It is the extended definition for big data, which refers to the data
quality and the data value.
• The data quality of captured data can vary greatly, affecting the accurate
analysis
Mr. Ganesh Bhagwat
XP
Big Data Characteristics Big Data Analysis
Mr. Ganesh Bhagwat

Types of Big Data XP
Big Data Analysis
• There are various formats of digital data in the environment today.
• Apart from data that is needed in transactions, there is also information

present in
– E-mails, audio, video images, logs, blogs and forums , social networking sites,
clickstreams, sensors, statistical data centers and mobile phone applications.
– Data generated can thus be classified as real time, event-based, structured,

semi-structured, unstructured, complex etc.
Mr. Ganesh Bhagwat

Types Of Big Data XP
Big Data Analysis
• BigData' could be found in three forms:
• Structured : Any data that can be stored, accessed and processed in the
form of fixed format is termed as a 'structured' data.
• Ex: Data in the form of tables, Excel etc.
Mr. Ganesh Bhagwat

Big Data Analysis
• Unstructured: Any data with unknown form or the structure is

classified as unstructured data. In addition to the size being huge, un-
structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text
files, images, videos etc.
• Example:The output returned by 'Google Search’
Mr. Ganesh Bhagwat

Big Data Analysis
Semi-structured: Semi-structured data can contain both the forms of data.

We can see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
Mr. Ganesh Bhagwat

XP
Traditional vs Big Data Approach Big Data Analysis
• Understanding the customer experience is important for business.
• Organizations where data loads are constant and predictable are

better served by traditional databases and approaches like centralized
storage in data warehouse.
• Big data approach is that of distributed scalable infrastructure for

processing and deducing inferences from large amount of data with
growing workloads through Hadoop clusters in a much shorter time frame
than by traditional methods.
Mr. Ganesh Bhagwat

Need of Big Data Analytics XP
Big Data Analysis
Mr. Ganesh Bhagwat

XP
Big Data Analysis
Mr. Ganesh Bhagwat

Big Data Applications XP
https://www.youtube.com/watch?v=skJPPYbG3BQ Big Data Analysis
• Government : The use and adoption of big data within governmental processes allows
efficiencies in terms of cost, productivity, and innovation.
• International development: offer cost-effective opportunities to improve decision-making in

critical development areas such as health care, employment, economic productivity, crime,
security, and natural disaster and resource management.
• Financial application loans processing, fraud detection
• Recommendation systems
• Sentiment analysis through social media
• Customer relationships management for improved service levels
• Energy sensor monitoring
• Human genome mapping (genome -> the complete set of genetic information in an organism)
Mr. Ganesh Bhagwat

Example of Big data XP
Big Data Analysis
• The New York Stock Exchange generates about one terabyte of new trade data per
day.
• Social Media : The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Mr. Ganesh Bhagwat

Summary XP
Big Data Analysis
• Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
• Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
• Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
• Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
• Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata
Mr. Ganesh Bhagwat

Big Data Analysis
What Is Hadoop? What does it do? XP
Big Data Analysis
• Apache Hadoop is an open-source software framework for storing data and

running applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing

power and the ability to handle virtually limitless concurrent tasks or jobs.
• It is designed to scale up from single servers to thousands of machines,

each providing computation and storage.
• It accomplishes the following

– Massive data storage
– Faster processing
Mr. Ganesh Bhagwat
Why is Hadoop important? XP
Big Data Analysis
• Ability to store and process huge amounts of any kind of data, quickly. With data
volumes and varieties constantly increasing, especially from social media and the Internet
of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
• Fault tolerance. Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple copies of all data are stored
automatically.
Mr. Ganesh Bhagwat
Why is Hadoop important? XP
Big Data Analysis
• Flexibility. Unlike traditional relational databases, you don’t have to

preprocess data before storing it. You can store as much data as you
want and decide how to use it later. That includes unstructured data like
text, images and videos.
• Low cost. The open-source framework is free and uses commodity

hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle more data
simply by adding nodes. Little administration is required. Commodity
Hardware:
Computer hardware
that is affordable
and easy to obtain.
Mr. Ganesh Bhagwat
XP
Big Data Analysis
Mr. Ganesh Bhagwat

Why Hadoop? XP
Big Data Analysis
• Example 1
– Transfer speed is around 100 MB/sec and a standard disk is 1 TB
– Time to read an entire disk is 3 Hours ( ~10000 secs)
– Increase in processing time might not be helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips are reached.
• Example 2
– If 100 TB of data sets are to be scanned then in case of
• Remote storage with 10 Mbps bandwidth, it would take 165 min
• Local storage with 50 Mbps, it will take 33 min
So it is better to move computation rather than moving data.
Mr. Ganesh Bhagwat

XP
Big Data Analysis
Data Visualization
Mr. Ganesh Bhagwat

XP
What is Data Visualization Big Data Analysis
• Data visualization is actually a set of data points and information that are
represented graphically to make it easy and quick for user to understand. Data
visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by
using visual effects or elements such as a chart, graphs, and maps.
• Data visualization is the process of translating large data sets and metrics into
charts, graphs and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.
Mr. Ganesh Bhagwat

XP
Big Data Analysis
Mr. Ganesh Bhagwat

XP
Big Data Analysis
Mr. Ganesh Bhagwat

XP
Challenges of big data visualization Big Data Analysis
• Visual noise: Most of the objects in dataset are too relative to each other.
Users cannot divide them as separate objects on the screen.
• Information loss: Reduction of visible data sets can be used, but leads to
information loss
.
• Large image perception: Data visualization methods are not only limited by
aspect ratio and resolution of device, but also by physical perception limits.
• High rate of image change: Users observe data and cannot react to the
number of data change or its intensity on display.
• High performance requirements: It can be hardly noticed in static

visualization because of lower visualization speed requirements--high
performance requirement.
Mr. Ganesh Bhagwat

XP
Big Data Analysis
The 3Vs
Let's take a moment to further examine the Vs.
Volume
Volume involves determining or calculating how much of something there is,
or in the case of big data, how much of something there will be.
Velocity
Velocity is the rate or pace at which something is occurring. The measured
velocity experience can and usually does change over time. Velocities directly affect
outcomes.
Variety
Thinking back to our previous mention of relational databases, it is generally
accepted that relational databases are considered to be highly structured, although they
may
Mr. Ganesh contain text in VCHAR, CLOB, or BLOB fields.
Bhagwat
XP
Solution to this challenges Big Data Analysis
1. Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.
2. Understanding the data: One solution is to have the proper domain expertise in place.
3. Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.
4. Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.
5. Dealing with outliers: Possible solutions are to remove the outliers from the data or create
a separate chart for the outliers.
Mr. Ganesh Bhagwat

XP
Approaches to Big Data Visualization Big Data Analysis
• When it comes to the topic of big data, simple data visualization tools with their
basic features become somewhat inadequate. The concepts and models necessary to
efficiently and effectively visualize big data can be daunting, but are not
unobtainable.
• Using workable approaches (studied in the following chapters of this book) the
reader will review some of the most popular (or currently trending) tools, such as:
• Hadoop
• R
• Data Manager
• D3
• Tableau
• Python
• Splunk
• This is done in an effort to meet the challenges of big data visualization and support
better decision making.
Mr. Ganesh Bhagwat

XP
D3 and big Data Big Data Analysis
Mr. Ganesh Bhagwat

XP
Big Data Analysis
• D3 stands for Data-Driven Documents. It is an open-source JavaScript
library developed by Mike Bostock to create custom interactive data
visualizations in the web browser using SVG, HTML and CSS.
• With the massive amount of data being generated today, communicating
this information is getting difficult. Visual representations of data are the
most effective means of conveying meaningful information and D3
provides a great deal of ease and flexibility to create these data
visualizations. It is dynamic, intuitive and needs minimum amount of
effort.
• It is similar to Protovis in concept but while Protovis is used for static
visualizations, D3 focuses more on interactions, transitions and
transformations.
Mr. Ganesh Bhagwat

XP
Example of Using D3 Big Data Analysis
Mr. Ganesh Bhagwat

Big Data Analysis
Thank you.

Lecture 1 &amp; 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 &amp; 2

Uploaded by

Copyright:

Available Formats

What is Big data XP

Big Data Analysis

Mr. Ganesh Bhagwat

by traditional data-processing application software.

be analyzed by traditional computing techniques.

• Result of various trends like cloud, increased computing resources, generation by

mobile computing, social networks, sensors, web data etc.

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

• There are various formats of digital data in the environment today.

• Apart from data that is needed in transactions, there is also information

– Data generated can thus be classified as real time, event-based, structured,

Mr. Ganesh Bhagwat

• BigData' could be found in three forms:

• Ex: Data in the form of tables, Excel etc.

Mr. Ganesh Bhagwat

• Unstructured: Any data with unknown form or the structure is

• Example:The output returned by 'Google Search’

Mr. Ganesh Bhagwat

Semi-structured: Semi-structured data can contain both the forms of data.

Example of semi-structured data is a data represented in an XML file.

Mr. Ganesh Bhagwat

• Understanding the customer experience is important for business.

• Organizations where data loads are constant and predictable are

• Big data approach is that of distributed scalable infrastructure for

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

• International development: offer cost-effective opportunities to improve decision-making in

• Financial application loans processing, fraud detection

• Sentiment analysis through social media

• Customer relationships management for improved service levels

• Energy sensor monitoring

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

• Apache Hadoop is an open-source software framework for storing data and

• It provides massive storage for any kind of data, enormous processing

• It is designed to scale up from single servers to thousands of machines,

• It accomplishes the following

• Flexibility. Unlike traditional relational databases, you don’t have to

• Low cost. The open-source framework is free and uses commodity

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

• High performance requirements: It can be hardly noticed in static

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

Mr. Ganesh Bhagwat

You might also like

Lecture 1 & 2

Lecture 1 & 2