You are on page 1of 37

Understanding BIG DATA

A presentation on Big Data and underlying


concepts
by
Makinde Hakeem O.
h.makinde@nlrc-gov.ng
2018 (PB)
2008 (TB)
2000 (GB)
•Big Data?
BIG DATA?
(Scope)
•Challenges

•Characteristics

•Architecture

•Technology
BIG DATA?
•Dataset with size beyond the ability of
commonly used tools to capture, curate,
manage and process data within a tolerable
elapsed time.
BIG DATA?
(Definitions)
•Term used to refer to data sets that are too
large or complex for traditional data-
processing application software to
adequately deal with.
BIG DATA?
(Definitions)
•Data with many cases (rows) offer greater
statistical power, while data with higher
complexity (more attributes or columns) may
lead to a higher false discovery rate.
BIG DATA?
(Definitions)
•Big Data requires a set of techniques and
technologies with new forms of integration to
reveal insights from dataset that are diverse,
complex and of massive scale.
BIG DATA? (Definitions)
•Big Data represents the information assets
characterized by such a high volume, velocity
and variety to require specific technology and
analytical methods for its transformation into
value.
BIG DATA?
(Data Sets)
Datasets grows rapidly because they are generated by
•Mobile devices
•Aerial (remote sensing), Software logs, Camera,
•Microphones,
•Radio-frequency identification (RFID) readers
•Wireless sensor networks.
BIG DATA?
(Definitions)
•Big Data is characterized by huge amounts
(Volume) of frequently updated data
(velocity), in various format (variety) such as
numeric, textual, images/videos etc.
BIG DATA?
(Philosophy)
•Unstructured Data
•Semi-structured Data
•Structured Data
BIG DATA?
(Challenges)
•Storage
•Capturing
•Analysis
•Visualization
BIG DATA?
(Storage: Challenges)
•DAS: Direct Attached Storage {SSD, SATA}
•SAN: Storage Area Network
•NAS: Network Attached Storage
BIG DATA?
(Storage:DAS {SSD;SATA})
SSD (Solid State
Drive)
DAS (Directly
Attached Drive)

SATA
BIG DATA?
(Storage: SAN)

SAN (Storage
Area Network)
BIG DATA?
(Storage: NAS)

NAS (Network
attached Storage)
BIG DATA?
(Challenges: Capturing: )
BIG DATA?
(Challenges: Analysis )
•Data must be processed with advanced tools
(analytics and algorithms) to reveal meaningful
information. For example, to manage a factory
one must consider both visible and invisible
issues with various components.
BIG DATA?
(Challenges: Analysis )
•Information generation algorithms must
detect and address invisible issues such as
machine degradation, component wear, etc.
on the factory floor
BIG DATA?
(Challenges: Visualization )
•Data visualization is viewed by many
disciplines as a modern equivalent of visual
communication. It involves the creation and
study of the visual representation of data.
BIG DATA?
(Challenges: Visualization )
•Effective
visualization helps users analyze and
reason about data and evidence. It makes complex
data more accessible, understandable and usable.
Users may have particular analytical tasks, such as
making comparisons or understanding causality.
BIG DATA?
(Challenges: Visualization )
•data visualization uses statistical graphics,
plots, information graphics and other tools.
Numerical data may be encoded using dots,
lines, or bars, to visually communicate a
quantitative message
BIG DATA?
(Characteristics)
•Volume
•Variety
•Velocity
•Veracity
BIG DATA?
(Characteristics: Volume)
Volume is the quantity of generated and
stored data. The size of the data determines
the value and potential insight, and whether it
can be considered big data or not.
BIG DATA?
(Characteristics: Variety)
Variety is the type and nature of the data. This helps
people who analyze it to effectively use the resulting
insight. Big data draws from text, images, audio,
video; plus it completes missing pieces through data
fusion.
BIG DATA?
(Characteristics: Velocity)
In this context, the speed at which the data is
generated and processed to meet the
demands and challenges that lie in the path of
growth and development. Big data is often
available in real-time
BIG DATA?
(Characteristics: Velocity)
Compared to small data, big data are produced
more continually. Two kinds of velocity related to
Big Data are the frequency of generation and the
frequency of handling, recording, and publishing.
BIG DATA?
(Characteristics: Veracity)
Veracity is the extended definition for big data,
which refers to the data quality and the data
value. The data quality of captured data can
vary greatly, affecting the accurate analysis.
BIG DATA?
(Architecture)
•Symmetric Multiprocessing
•Parallel Computing

•Multicore Computing

•Distributed Computing

•Clustered Computing

•Grid Computing
BIG DATA?
(Architecture: Parallel Computing)
•Thisis the implementation of clustered
computers (nodes) driven by a special OS to
simultaneously obtain exponential processing
capability.
BIG DATA?
(Architecture: Parallel Computing)
•Large problems can often be divided into smaller ones,
which can then be solved at the same time. There are
several different forms of parallel computing: bit-level,
instruction-level, data, and task parallelism.
BIG DATA?
(Architecture: Parallel Computing)
•There are various implementations of parallel
computing technologies by numerous organizations
across the world each with features that best
addresses individual requirements.
BIG DATA?
(Architecture: Parallel Computing)
Two must popular in the context of Big Data are:

•MapReduce (Google,2004)
•Hadoop (Apache, 2012)
BIG DATA?
(Parallel Computing: MapReduce)
In 2004, Google published a paper on a process called
MapReduce that uses a similar architecture. The MapReduce
concept provides a parallel processing model, and an
associated implementation was released to process huge
amounts of data.
BIG DATA?
(Parallel Computing: MapReduce)
With MapReduce, queries are split and distributed
across parallel nodes and processed in parallel (the
Map step). The results are then gathered and delivered
(the Reduce step). The framework was very successful
BIG DATA?
(Parallel Computing: Hadoop)
An implementation of the MapReduce framework was
also adopted by an Apache open-source project named
Hadoop. Apache Spark was developed in 2012 in
response to limitations in the MapReduce paradigm, as it
adds the ability to set up many operations (not just map
followed by reduce).

You might also like