Big Data Module 1

CS6CRT19 Big Data Analytics Module 1
Big Data Definitions

Big Data is high-volume, high-velocity and/or high-variety information
asset that requires new forms of processing for enhanced decision making, insight
discovery and process optimization.
Other definitions found in the existing literature includes the following:
A collection of data sets so large or complex that traditional data processing
applications are inadequate. - Wikipedia
Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges. - Oxford English Dictionary
The Characteristics (5Vs) of Big Data

For a dataset to be considered Big Data, it must possess one or more of the
following characteristics that require accommodation in the solution design and
architecture of the analytic environment:
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value
Volume
The anticipated volume of data that is processed by Big Data solutions is
substantial and ever-growing. High data volumes impose distinct data storage and
processing demands, as well as additional data preparation, curation, and
management processes.
Velocity
In Big Data environments, data can arrive at fast speeds, and enormous
datasets can accumulate within very short periods of time. From an enterprise;s point
Swamy Saswathikananda College, Poothotta 1

of view, the velocity of data translates into the amount of time it takes for the data
to be processed once it enters the enterprise’s perimeter. Coping with the fast in flow
of data requires the enterprise to design highly elastic and available data processing
solutions and corresponding data storage capabilities.
Variety
Data variety refers to the multiple formats and types of data that need to be
supported by Big Data solutions. Data variety brings challenges for enterprises in
terms of data integration, transformation, processing, and storage.
Veracity
Veracity refers to the quality or fidelity of data. Data that enters Big Data
environments needs to be assessed for quality, which can lead to data processing
activities to resolve invalid data and remove noise. In relation to veracity, data can
be part of the signal or noise of a dataset.
Noise is data that cannot be converted into information and thus has no value,
whereas signals have value and lead to meaningful information. Data with a high
signal-to-noise ratio has more veracity than data with a lower ratio. Data that is
acquired in a controlled manner (for example, via online customer registration)
usually contains less noise. Data acquired via uncontrolled sources (such as blog
postings) contains more noise.
Value
Value is defined as the usefulness of data for an enterprise. The value
characteristic is intuitively related to the veracity characteristic in that the higher the
data fidelity, the more value it holds for the business. Value is also dependent on
how long it takes to process the data because analytics results have a shelf-life. For
instance, a 20 minute delayed stock quote has no value for making a stock trade.

Types/Sources of Big Data

The following are the types (sources) of big data, as suggested by IBM and
the Big Data task team:
● Social networks and web data, such as Facebook, Twitter, e-mails, blogs,
and YouTube.
● Transactions data and Business Processes data, such as credit card
transactions, flight bookings, etc. and public agencies data such as medical
records, insurance business data, etc.
● Customer master data, such as data for facial recognition and for the name,
date of birth, marriage anniversary, gender, location and income category.
● Machine-generated data, such as machine-to-machine or Internet of Things
(IOT) data, and the data from sensors, trackers, web logs and computer
systems log. Computer generated data is also considered as machine generated
data from data stores. Usage of programs for processing of data using data
repositories, such as database or file, generates data and also machine
generated data.
● Human-generated data such as biometrics data, human-machine interaction
data, e-mail records with a mail server and MySql database of student grades.
Classification/Nature of Data
Data can be classified based on its nature, as structured, semi-structured, and
unstructured data.
Structured Data
Structured data conform and associate with data schemas and data models.
Structured data are found in tables.Structured data enables the following:
● Data insert, delete, update, and append
● Indexing to enable faster data retrieval

● Scalability which enables increasing or decreasing capacities and data

processing operations such as storing, processing, and analytics.
● Transactions processing which follows ACID (Atomicity, Consistency,
Isolation, and Durability) rules.
● Encryption and decryption for data security
Semi-structured Data
Examples of semi-structured data are XML(Extended Markup Language) and
JSON (JavaScript Object Notation) documents. Semi-structured data contains tags
or other markers, which separate semantic elements and enforce hierarchies of
records and fields within the data. Semi-structured data does not conform and
associate with formal data model structures. Semi-structured data do not associate
data models, such as the relational database and table models.
Unstructured Data
Unstructured data does not possess data features such as tables or a database.
Unstructured data are found in file types such as .TXT, .CSV. Data may be as key-
value pairs. Data may have internal structures, such as in emails. The data do not
reveal relationships, hierarchy relationships, or object oriented features, such as
extensibility. The relationships, schema, and features need to be separately
established. Growth in data today is mostly in the form of unstructured data.

Challenges of Conventional Systems
1. The Uncertainty of Data Management:
One disruptive facet of big data management is the use of a wide range of
innovative data management tools and frameworks whose designs are dedicated to
supporting operational and analytical processing. The NoSQL (not only SQL)
frameworks are used that differentiate it from traditional relational database
management systems and are also largely designed to fulfill performance demands
of big data applications such as managing a large amount of data and quick response
times. There are a variety of NoSQL approaches such as hierarchical object
representation (such as JSON, XML and BSON) and the concept of a key-value
storage. The wide range of NoSQL tools, developers and the status of the market are
creating uncertainty with the data management.
2. Talent Gap in Big Data:
It is difficult to win the respect of media and analysts in tech without being
bombarded with content touting the value of the analysis of big data and
corresponding reliance on a wide range of disruptive technologies. The new tools
evolved in this sector can range from traditional relational database tools with some
alternative data layouts designed to maximize access speed while reducing the
storage footprints, NoSQL data management frameworks, in-memory analytics, and
as well as the broad Hadoop ecosystem. The reality is that there is a lack of skills
available in the market for big data technologies. The typical expert has also gained
experience through tool implementation and its use as a programming model, apart
from the big data management aspects.

3. Getting Data into Big Data Structure:
It might be obvious that the intent of big data management involves analyzing
and processing a large amount of data. There are many people who have raised
expectations considering analyzing huge data sets for a big data platform. They also
may not be aware of the complexity behind the transmission, access, and delivery of
data and information from a wide range of resources and then loading these data into
a big data platform. The intricate aspects of data transmission, access and loading
are only part of the challenge. The requirement to navigate transformation and
extraction is not limited to conventional relational data sets.
4. Syncing Across Data Sources:
Once you import data into big data platforms you may also realize that data
copies migrated from a wide range of sources on different rates and schedules can
rapidly get out of the synchronization with the originating system. This implies that
the data coming from one source is not out of date as compared to the data coming
from another source. It also means the commonality of data definitions, concepts,
metadata and the like. The traditional data management and data warehouses, the
sequence of data transformation, extraction and migrations all arise the situation in
which there are risks for data to become unsynchronized.
5. Extracting Information from the Data in Big Data Integration:
The most practical use cases for big data involve the availability of data,
augmenting existing storage of data as well as allowing access to end-user
employing business intelligence tools for the purpose of the discovery of data. This
business intelligence must be able to connect different big data platforms and also
provide transparency of the data consumers to eliminate the requirement of custom
coding. At the same time, if the number of data consumers grows, then one can

provide a need to support an increasing collection of many simultaneous user

accesses. This increment of demand may also spike at any time in reaction to
different aspects of business process cycles. It also becomes a challenge in big data
integration to ensure the right-time data availability to the data consumers.
6. Miscellaneous Challenges:
Other challenges may occur while integrating big data. Some of the challenges
include integration of data, skill availability, solution cost, the volume of data, the
rate of transformation of data, veracity and validity of data. The ability to merge data
that is not similar in source or structure and to do so at a reasonable cost and in time.
It is also a challenge to process a large amount of data at a reasonable speed so that
information is available for data consumers when they need it. The validation of data
sets is also fulfilled while transferring data from one source to another or to
consumers as well.
Intelligent Data Analysis (IDA)
Intelligent Data Analysis (IDA) discloses hidden facts that are not known previously
and provides potentially important information or facts from large quantities of data. It also
helps in making a decision. IDA helps to obtain useful information, necessary data and
interesting models from a lot of data available online in order to make the right choices.
Steps Involved in IDA
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining;
(3) data validation and explanation. The preparation of data involves opting for the
required data from the related data source and incorporating it into a data set that can be
used for data mining.The main goal of intelligent data analysis is to obtain knowledge.

Processes in Big Data Analytics

Big data analytics refers to collecting, processing, cleaning, and analyzing
large datasets to help organizations operationalize their big data.
1. Collect Data
Data collection looks different for every organization. With today’s

technology, organizations can gather both structured and unstructured data from a
variety of sources — from cloud storage to mobile applications to in-store IoT
sensors and beyond. Some data will be stored in data warehouses where business
intelligence tools and solutions can access it easily.
Raw or unstructured data that is too diverse or complex for a warehouse may
be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate
results on analytical queries, especially when it’s large and unstructured. Available
data is growing exponentially, making data processing a challenge for organizations.
One processing option is batch processing, which looks at large data blocks
over time. Batch processing is useful when there is a longer turnaround time between
collecting and analyzing data.
Stream processing looks at small batches of data at once, shortening the

delay time between collection and analysis for quicker decision-making. Stream
processing is more complex and often more expensive.
3. Clean Data

Data requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed
insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data
analysis methods include:
● Data mining sorts through large datasets to identify patterns and relationships
by identifying anomalies and creating data clusters.

● Predictive analytics uses an organization’s historical data to make
predictions about the future, identifying upcoming risks and opportunities.

● Deep learning imitates human learning patterns by using artificial
intelligence and machine learning to layer algorithms and find patterns in the
most complex and abstract data.
Analysis Vs Reporting
Following are the five major differences between Analysis and Reporting:
1. Purpose
Reporting helps companies monitor their data even before digital technology
booms. Various organizations have been dependent on the information it brings to
their business, as reporting extracts that and makes it easier to understand.

Analysis interprets data at a deeper level. While reporting can link between
cross-channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not analysis
reports), analysis interprets this information and provides recommendations on
actions.
2. Tasks
Reporting includes building, configuring, consolidating, organizing,

formatting, and summarizing. It’s very similar to the above mentioned like turning
data into charts, graphs, and linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and

confirming. With big data, predicting is possible as well.
3. Outputs
Reporting has a push approach, as it pushes information to users and outputs

come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to

further probe and to answer business questions. Analysis presentations are
comprised of insights, recommended actions, and a forecast of its impact on the
company—all in a language that’s easy to understand at the level of the user who’ll
be reading and deciding on it.
4. Delivery
Analysis requires a more custom approach, with human minds doing superior
reasoning and analytical thinking to extract insights, and technical skills to provide

efficient steps towards accomplishing a specific goal. This is why data analysts and
scientists are demanded these days, as organizations depend on them to come up
with recommendations for leaders or business executives to make decisions about
their businesses.
5. Value
Reporting itself is just numbers. Without drawing insights and getting reports
aligned with your organization’s big picture, you can’t make decisions based on
reports alone.
Data analysis is the most powerful tool to bring into your business. Employing
the powers of analysis can be comparable to finding gold in your reports, which
allows your business to increase profits and further develop.
Modern Big Data Analytics Tools and Technology
Big data analytics cannot be narrowed down to a single tool or technology.

Instead, several types of tools work together to help you collect, process, cleanse,
and analyze big data. Some of the major players in big data ecosystems are listed
below.
● Hadoop is an open-source framework that efficiently stores and processes big
datasets on clusters of commodity hardware. This framework is free and can

handle large amounts of structured and unstructured data, making it a valuable
mainstay for any big data operation.
● NoSQL databases are non-relational data management systems that do not
require a fixed scheme, making them a great option for big, raw, unstructured

data. NoSQL stands for “not only SQL,” and these databases can handle a
variety of data models.
● Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire

clusters. Spark can handle both batch and stream processing for fast
computation.
● R-Programming R is a free open source software programming language
and a software environment for statistical computing and graphics. It is
used by data miners for developing statistical software and data analysis. It
has become a highly popular tool for big data in recent years.

Statistical Concepts: Sampling Distributions
In statistics, a population is the entire pool from which a statistical sample is

drawn. A population may refer to an entire group of people, objects, events, hospital
visits, or measurements. A population can thus be said to be an aggregate
observation of subjects grouped together by a common feature.
A lot of data drawn and used by academicians, statisticians, researchers,

marketers, analysts, etc. are actually samples, not populations. A sample is a subset
of a population.
A sampling distribution is a probability distribution of a statistic obtained

from a larger number of samples drawn from a specific population. The sampling
distribution of a given population is the distribution of frequencies of a range of
different outcomes that could possibly occur for a statistic of a population.
Re-Sampling
The problem with the sampling process is that we only have a single estimate
of the population parameter, with little idea of the variability or uncertainty in the
estimate. One way to address this is by estimating the population parameter multiple
times from our data sample. This is called resampling.
Re-sampling is the method that consists of creating or drawing repeated

samples from the original samples. Resampling involves the selection of randomized
cases with replacement from the original data sample in such a manner that each
number of a sample drawn has a number of cases that are similar to the original data
sample.

Statistical Inference
The primary objective of a sample study is to create inferences (conclusions)

about the population by examining only a part of the population. Inferences created
in such a way are called statistical inferences. Statistical inference is a process by
which we create conclusions about the population based on samples drawn from the
population.
Prediction Error
Predictive analytical processes use new and historical data to forecast activity,
behaviour, and trends. A prediction error is the failure of some expected event to
occur. When prediction fails, humans can use different methods, examining
predictions and failures and deciding some methods to overcome such errors in the
future. Applying that type of knowledge can inform decisions and improve the
quality of future prediction.

Big Data Module 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Module 1

Uploaded by

Copyright:

Available Formats

CS6CRT19 Big Data Analytics Module 1

Big Data Definitions

The Characteristics (5Vs) of Big Data

Swamy Saswathikananda College, Poothotta 1

Swamy Saswathikananda College, Poothotta 2

Types/Sources of Big Data

Swamy Saswathikananda College, Poothotta 3

● Scalability which enables increasing or decreasing capacities and data

Swamy Saswathikananda College, Poothotta 4

Challenges of Conventional Systems

1. The Uncertainty of Data Management:

2. Talent Gap in Big Data:

Swamy Saswathikananda College, Poothotta 1

3. Getting Data into Big Data Structure:

4. Syncing Across Data Sources:

5. Extracting Information from the Data in Big Data Integration:

Swamy Saswathikananda College, Poothotta 2

provide a need to support an increasing collection of many simultaneous user

Intelligent Data Analysis (IDA)

Steps Involved in IDA

Swamy Saswathikananda College, Poothotta 3

Processes in Big Data Analytics

Data collection looks different for every organization. With today’s

Stream processing looks at small batches of data at once, shortening the

Swamy Saswathikananda College, Poothotta 1

by identifying anomalies and creating data clusters.

predictions about the future, identifying upcoming risks and opportunities.

Swamy Saswathikananda College, Poothotta 2

Reporting includes building, configuring, consolidating, organizing,

Analysis consists of questioning, examining, interpreting, comparing, and

Reporting has a push approach, as it pushes information to users and outputs

Analysis has a pull approach, where a data analyst draws information to

Swamy Saswathikananda College, Poothotta 3

Modern Big Data Analytics Tools and Technology

Big data analytics cannot be narrowed down to a single tool or technology.

● Hadoop is an open-source framework that efficiently stores and processes big

datasets on clusters of commodity hardware. This framework is free and can

Swamy Saswathikananda College, Poothotta 4

parallelism and fault tolerance to provide an interface for programming entire

Swamy Saswathikananda College, Poothotta 5

Statistical Concepts: Sampling Distributions

In statistics, a population is the entire pool from which a statistical sample is

A lot of data drawn and used by academicians, statisticians, researchers,

A sampling distribution is a probability distribution of a statistic obtained

Re-sampling is the method that consists of creating or drawing repeated

Swamy Saswathikananda College, Poothotta 1

The primary objective of a sample study is to create inferences (conclusions)

Swamy Saswathikananda College, Poothotta 2

You might also like