Professional Documents
Culture Documents
• Big data refer to the massive data sets that are collected from a variety of data
sources for business needs to reveal new insights for optimized decision making.
• "Big data" is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt with
• Big data generates value by the storage and processing of digital data that cannot
• Volume :The quantity of generated and stored data. The size of the data
determines the value and potential insight, and whether it can be
considered big data or not.
• Variety: The type and nature of the data. This helps people who
analyze it to effectively use the resulting insight. Big data draws from text,
images, audio, video; plus it completes missing pieces through data
fusion.
• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Big data is often available in real-time.
Compared to small data, big data are produced more continually. Two
kinds of velocity related to big data are the frequency of generation and
the frequency of handling, recording, and publishing.
• Veracity: It is the extended definition for big data, which refers to the data
quality and the data value.
• The data quality of captured data can vary greatly, affecting the accurate
analysis
Mr. Ganesh Bhagwat
XP
Big Data Characteristics Big Data Analysis
• Structured : Any data that can be stored, accessed and processed in the
form of fixed format is termed as a 'structured' data.
• Recommendation systems
• Human genome mapping (genome -> the complete set of genetic information in an organism)
• The New York Stock Exchange generates about one terabyte of new trade data per
day.
• Social Media : The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
• Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
• Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
• Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
• Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
• Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata
• Ability to store and process huge amounts of any kind of data, quickly. With data
volumes and varieties constantly increasing, especially from social media and the Internet
of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
• Fault tolerance. Data and application processing are protected against hardware
failure. If a node goes down, jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple copies of all data are stored
automatically.
Mr. Ganesh Bhagwat
Why is Hadoop important? XP
Big Data Analysis
• Scalability. You can easily grow your system to handle more data
simply by adding nodes. Little administration is required. Commodity
Hardware:
Computer hardware
that is affordable
and easy to obtain.
Mr. Ganesh Bhagwat
XP
Big Data Analysis
• Example 1
– Transfer speed is around 100 MB/sec and a standard disk is 1 TB
– Time to read an entire disk is 3 Hours ( ~10000 secs)
– Increase in processing time might not be helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips are reached.
• Example 2
– If 100 TB of data sets are to be scanned then in case of
• Remote storage with 10 Mbps bandwidth, it would take 165 min
• Local storage with 50 Mbps, it will take 33 min
So it is better to move computation rather than moving data.
Data Visualization
• Data visualization is actually a set of data points and information that are
represented graphically to make it easy and quick for user to understand. Data
visualization is good if it has a clear meaning, purpose, and is very easy to
interpret, without requiring context. Tools of data visualization provide an
accessible way to see and understand trends, outliers, and patterns in data by
using visual effects or elements such as a chart, graphs, and maps.
• Data visualization is the process of translating large data sets and metrics into
charts, graphs and other visuals. The resulting visual representation of data makes it
easier to identify and share real-time trends, outliers, and new insights about the
information represented in the data.
• Visual noise: Most of the objects in dataset are too relative to each other.
Users cannot divide them as separate objects on the screen.
• Information loss: Reduction of visible data sets can be used, but leads to
information loss
.
• Large image perception: Data visualization methods are not only limited by
aspect ratio and resolution of device, but also by physical perception limits.
• High rate of image change: Users observe data and cannot react to the
number of data change or its intensity on display.
The 3Vs
Let's take a moment to further examine the Vs.
Volume
Volume involves determining or calculating how much of something there is,
or in the case of big data, how much of something there will be.
Velocity
Velocity is the rate or pace at which something is occurring. The measured
velocity experience can and usually does change over time. Velocities directly affect
outcomes.
Variety
Thinking back to our previous mention of relational databases, it is generally
accepted that relational databases are considered to be highly structured, although they
may
Mr. Ganesh contain text in VCHAR, CLOB, or BLOB fields.
Bhagwat
XP
Solution to this challenges Big Data Analysis
1. Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.
2. Understanding the data: One solution is to have the proper domain expertise in place.
3. Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.
4. Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.
5. Dealing with outliers: Possible solutions are to remove the outliers from the data or create
a separate chart for the outliers.
• When it comes to the topic of big data, simple data visualization tools with their
basic features become somewhat inadequate. The concepts and models necessary to
efficiently and effectively visualize big data can be daunting, but are not
unobtainable.
• Using workable approaches (studied in the following chapters of this book) the
reader will review some of the most popular (or currently trending) tools, such as:
• Hadoop
• R
• Data Manager
• D3
• Tableau
• Python
• Splunk
• This is done in an effort to meet the challenges of big data visualization and support
better decision making.