You are on page 1of 9

UNIT 3

Introduction to BIG DATA: What is, Types, Characteristics & Example

What is Data?

The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data
that is huge in size and yet growing exponentially with time. In short such data is so large and
complex that none of the traditional data management tools are able to store it or process it
efficiently.

In this tutorial, you will learn,

 Examples Of Big Data


 Types Of Big Data
 Characteristics Of Big Data
 Advantages Of Big Data Processing

Examples of Big Data


Following are some the examples of Big Data-

The New York Stock Exchange generates about one terabyte of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Big Data


BigData' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of
a 'structured' data.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data is
in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over the years

Please note that web application data, which is unstructured, consists of log files, transaction history
files etc. OLTP systems are built to work with structured data wherein data is stored in relations
(tables).

Characteristics of Big Data


(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

o Businesses can utilize outside intelligence while taking decisions

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
50

Access to social data from search engines and sites like facebook, twitter are enabling organizations
to fine tune their business strategies.

o Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.

o Early identification of risk to the product/services, if any


o Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Summary

 Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
 Examples of Big Data generation includes stock exchanges, social media sites, jet engines,
etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
 Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata

What do you mean by data processing?

Data processing is the conversion of data into usable and desired form. This conversion or
“processing” is carried out using a predefined sequence of operations either manually or
automatically. Most of the data processing is done by using computers and thus done automatically.
The output or “processed” data can be obtained in different forms like image, graph, table, vector
file, audio, charts or any other desired format depending on the software or method of data
processing used. When done itself it is referred to as automatic data processing. Continue reading
below to understand more about what is data processing.

Fundamentals of data processing & how data is processed

Data processing is undertaken by any activity which requires a collection of data. This data collected
needs to be stored, sorted, processed, analyzed and presented. This complete process can be divided
into 6 simple primary stages which are:

 Data collection
 Storage of data

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
51

 Sorting of data
 Processing of data
 Data analysis
 Data presentation and conclusions

Once the data is collected the need for data entry emerges for storage of data. Storage can be done in
physical form by use of papers, in notebooks or in any other physical form. With the emergence and
growing emphasis on Computer System, Big Data & Data Mining the data collection is large and a
number of operations need to be performed for meaningful analysis and presentation, the data is
stored in digital form. Having the raw data and processed data into digital form enables the user to
perform a large number of operations in small time and allows conversion into different types. The
user can thus select the output which best suits the requirement.
This continuous use and processing of data follow cycle called as data processing
cycle and information processing cycle which might provide instant results or take time depending
upon the need of processing data. The complexity in the field of data processing is increasing which
is creating a need for advanced techniques.
Storage of data is followed by sorting and filtering. This stage is profoundly affected by the format in
which data is stored and further depends on the software used. General daily day and noncomplex
data can be stored as text files, tables or a combination of both in Microsoft Excel or similar
software. As the task becomes complex which requires performing specific and specialized
operations they require different data processing tools and software which is meant to cater to the
peculiar needs.
Storing, sorting, filtering and processing of data can be done by single software or a combination of
software whichever feasible and required. Data processing thus carried out by software is done as per
the predefined set of operations. Most of the modern-day software allows users to perform different
actions based on the analysis or study to be carried out. Data processing provides the output file in
various formats.

Different types of output files obtained as “processed” data

 Plain text file – These constitute the simplest form or processed data. Most of these files are
user readable and easy to comprehend. Very negligible or no further processing is these type of
files. These are exported as notepad or WordPad files.
 Table/ spreadsheet – This file format is most suitable for numeric data. Having digits in rows
and columns allows the user to perform various operations like filtering & sorting in
ascending/descending order to make it easy to understand and use. Various mathematical
operations can be applied when using this file output.
 Charts & Graphs – Option to get the output in the form of charts and graphs is handy and now
forms standard features in most of the software. This option is beneficial when dealing with
numerical values reflecting trends and growth/decline. Though there are ample charts and
graphs are available to match diverse requirements there exists situation when there is a need to
have a user-defined option. In case no inbuilt chart or graph is available then the option to
create own charts, i.e., custom charts/graphs come handy.
 Maps/Vector or image file – When dealing with spatial data the option to export the processed
data into maps, vector and image files is of great use. Having the information on maps is of

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
52

particular use for urban planners who work on different types of maps. Image files are obtained
when dealing with graphics and do not constitute any human readable input.
 Other formats/ raw files – These are the software specific file formats which can be used and
processed by specialized software. These output files may not be a complete product and
require further processing. Thus there will need to perform multiple data processing.

Methods of data processing

 Manual data processing: In this method data is processed manually without the use of a
machine, tool or electronic device. Data is processed manually, and all the calculations and
logical operations are performed manually on the data.
 Mechanical data processing – Data processing is done by use of a mechanical device or very
simple electronic devices like calculator and typewriters. When the need for processing is
simple, this method can be adopted.
 Electronic data processing – This is the modern technique to process data. Electronic Data
processing is the fastest and best available method with the highest reliability and accuracy.
The technology used is latest as this method used computers and employed in most of the
agencies. The use of software forms the part of this type of data processing. The data is
processed through a computer; Data and set of instructions are given to the computer as input,
and the computer automatically processes the data according to the given set of instructions.
The computer is also known as electronic data processing machine.

Types of data processing on the basis of process/steps performed

There are various types of data processing, some of the most popular types are as follows:
 Batch Processing
 Real-time processing
 Online Processing
 Multiprocessing
 Time-sharing

What makes processing of data important

Nowadays more and more data is collected for academic, scientific research, private & personal use,
institutional use, commercial use. This collected data needs to be stored, sorted, filtered, analyzed
and presented and even require data transfer for it to be of any use. This process can be simple or
complex depending on the scale at which data collection is done and the complexity of the results
which are required to be obtained. The time consumed in obtaining the desired result depends on the
operations which need to be performed on the collected data and on the nature of the output file
required to be obtained. This problem becomes starker when dealing with the very large volume of
data such as those collected by multinational companies about their users, sales, manufacturing, etc.
Data processing services and companies dealing with personal information and other sensitive
information must be careful about data protection.
The need for data processing becomes more and more critical in such cases. In such cases, data
mining and data management come into play without which optimal results cannot be obtained. Each
stage starting from data collection to presentation has a direct effect on the output and usefulness of
the processed data. Sharing the dataset with third party must be done carefully and as per written data
processing agreement & service agreement. This prevents data theft, misuse and loss of data.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
53
What type of data needs to be processed?

Data in any form and of any type requires processing most of the time. These data can be categorised
as personal information, financial transactions, tax credits, banking details, computational data,
images and simply almost anything you can think of. The quantum of processing required will
depend on the specilisatized processing which the data requires. Subsequently it will depend on the
output that you require. With the increase in demand and the requirement for automatic data
processing & electronic data processing, a competitive market for data services has emerged.

Six stages of data processing

1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources,
including data lakes and data warehouses. It is important that the data sources available are
trustworthy and well-built so the data collected (and later used as information) is of the highest
possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to
as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage
of data processing. During preparation, raw data is diligently checked for any errors. The purpose of
this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-
quality data for the best business intelligence.

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Sales force or a data
warehouse like Redshift), and translated into a language that it can understand. Data input is the
first stage in which raw data begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself
may vary slightly depending on the source of data being processed (data lakes, social networks,
connected devices etc.) and its intended use (examining advertising patterns, medical diagnosis
from connected devices, determining customer needs, etc.).

5. Data output/interpretation

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP
54

The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.). Members of the
company or institution can now begin to self-serve the data for their own data analytics projects.

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. Plus, properly stored data is a necessity for compliance with data protection legislation
like GDPR. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.

The future of data processing


The future of data processing lies in the cloud. Cloud technology builds on the convenience of
current electronic data processing methods and accelerates its speed and effectiveness. Faster,
higher-quality data means more data for each organization to utilize and more valuable insights
to extract.

CS- 503 (A) Data Analytics Notes By –ARUN KUMAR JHAPATE, CSE, SIRT, BHOPAL, MP

You might also like