Professional Documents
Culture Documents
Chapter 9
1
Big Data Management Techniques
• volume
• velocity
• variety
• veracity
• value
Big Data Management Techniques
Volume: Typical data sources that are responsible for generating high data
volumes can include:
• structured data
• unstructured data
• semi-structured data
Big Data Management Techniques
Structured data
Structured data conforms to a data model or schema and is often stored in
tabular form. It is used to capture relationships between different entities and is
therefore most often stored in a relational database.
Unstructured data
Data that does not conform to a data model or data schema is known as
unstructured data.
Unstructured data has a faster growth rate than structured data.
Big Data Management Techniques
Semi-structured data
Semi-structured data has a defined level of structure and consistency, but is not
relational in nature. Instead, semi-structured data is hierarchical or graph-
based.
Examples: electronic data interchange (EDI) files, spreadsheets, RSS feeds and
sensor data.
Overview
➢ Big Data Storage Concept
12
BD Storage
• Data acquired from external sources is often not in a format or structure that
can be directly processed.
• Data wrangling includes steps to filter, cleanse and otherwise prepare the
data for downstream analysis.
BD Storage
• From a storage perspective, a copy of the data is first stored in its acquired
format, and, after wrangling, the prepared data needs to be stored again.
• Each node in the cluster has its own dedicated resources, such as memory, a
processor, and a hard drive.
• A cluster can execute a task by splitting it into small pieces and distributing
their execution onto different computers that belong to the cluster.
BD Storage
Clusters
BD Storage
File System
• A file system is the method of storing and organizing data on a storage device, such as flash
drives, DVDs and hard drives.
• A file is an atomic unit of storage used by the file system to store data.
• A file system provides a logical view of the data stored on the storage device and presents it
as a tree structure of directories and files.
• Operating systems employ file systems to store and retrieve data on behalf of applications.
• Each operating system provides support for one or more file systems (e.g. NTFS on
Microsoft Windows).
BD Storage
File System
BD Storage
Distributed File System
• A distributed file system is a file system that can store large files spread
across the nodes of a cluster.
• To the client, files appear to be local; however, the files are distributed
throughout the cluster.
• This local view is presented via the distributed file system and it enables the
files to be accessed from multiple locations.
BD Storage
Distributed File System
• A NoSQL database often provides a query interface that can be called from
within an application.
BD Storage
Sharding
• Sharding is the process of horizontally partitioning a large dataset into a
collection of smaller, more manageable datasets called shards.
• The shards are distributed across multiple nodes, where a node is a server
or a machine.
• Each shard is stored on a separate node and each node is responsible for
only the data stored on it.
• Each shard shares the same schema, and all shards collectively represent
the complete dataset.
BD Storage
Sharding
BD Storage
Replication
• Replication provides scalability and availability due to the fact that the same
data is replicated on various nodes.
• Fault tolerance is also achieved since data redundancy ensures that data is
not lost when an individual node fails.
BD Storage
Replication
Overview
➢ Big Data Storage Concept
26
BD Processing
Parallel Data Processing
Hadoop is a versatile framework that provides both processing and storage capabilities.
BD Processing
Processing Workloads
A processing workload in Big Data is defined as the amount and nature of data
that is processed within a certain amount of time.
Workloads are usually divided into two types:
• batch
• transactional
BD Processing
Processing Workloads
38
BD Analysis
Basic types of data analysis:
• quantitative analysis
• qualitative analysis
• data mining
• statistical analysis
• machine learning
• semantic analysis
• visual analysis
BD Analysis
Quantitative analysis: a data analysis technique that focuses on quantifying
the patterns and correlations found in the data. Based on statistical practices, it
involves analyzing a large number of observations from a dataset.
BD Analysis
Qualitative analysis: a data analysis technique that focuses on describing
various data qualities using words.
BD Analysis
Data mining or data discovery, refers to automated, software-based techniques
that sift through massive datasets to identify patterns and trends.
Data mining forms the basis for predictive analytics and business intelligence (BI).
BD Analysis
Statistical analysis uses statistical methods based on mathematical formulas as a
means for analyzing data. Could be quantitative (most often) or qualitative.
It can also be used to infer patterns and relationships within the dataset, such as
regression and correlation.
BD Analysis
This is the basic concept of machine learning is to combine human knowledge
with the processing speed of machines, to be able to process large amounts of
data without requiring much human intervention.
• Classification
• Clustering
• Outlier Detection
• Filtering
BD Analysis
Semantic analysis represents practices for extracting meaningful information
from textual and speech data.
• Text Analytics
• Sentiment Analysis
BD Analysis
Visual Analysis involves the graphic representation of data to enable or enhance
its visual perception. It acts as a discovery tool in the field of Big Data.
Visual analysis helps to identify and highlight hidden patterns, correlations and
anomalies. It also encourages the formulation of questions from different angles.
They also facilitate the identification of areas of interest and the discovery of
extreme (high/low) values within a dataset.
A heat map itself is a visual, color-coded representation of data values. Each value
is given a color according to its type or the range that it falls under.
Heat map depicting the sales of three divisions within a company over a period of six months.
BD Analysis
Example
• How can I visually identify any patterns related to carbon emissions across a
large number of cities around the world?
• How can I see if there are any patterns of different types of cancers in
relation to different ethnicities?
Time series analysis helps to uncover patterns within data that are time-
dependent. Once identified, the pattern can be extrapolated for future
predictions.
Time series analyses are usually used for forecasting by identifying long-term
trends, seasonal periodic patterns and irregular short-term variations in the
dataset.
BD Analysis
Example
BD Analysis
• How much yield should the farmer expect based on historical yield data?
• Are two individuals related to each other via a long chain of ancestry?
The GIS provides tooling that enables interactive exploration of the spatial data,
for example measuring the distance between two points, or defining a region
around a point as a circle with a defined distance-based radius.
BD Analysis
• Applications of spatial data analysis include operations and logistic
optimization, environmental sciences and infrastructure planning.
• Data used as input for spatial data analysis can either contain exact locations,
such as longitude and latitude, or the information required to calculate
locations, such as zip codes or IP addresses.
• Spatial data analysis can be used to determine the number of entities that
fall within a certain radius of another entity.
BD Analysis
BD Analysis
• Where are the high and low concentrations of a particular mineral based on
readings taken from a number of sample locations within an area?
61