You are on page 1of 9

Unit 1 Introduction to BIG DATA ANALYSIS

EVOLUTION OF TECHNOLOGY
 Telephone--------Mobile/Android
 Bulky desktop-----FD/HDD/Cloud
 Car -------Smart car

WHY????
 Phone Data
 Self driving Car data
 Smart AC
 Social Media
 Amazon Flipcart

Big Data is the term for collection of data sets so large and complex that it becomes difficult to
process using on-hand database system tools or traditional data processing applications

How do we classify a data as BD?


Using 5 V’s

Volume

Variety

Velocity

Value

Veracity

Big Data As an Opportunity

What is Data?

The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.

What is Big Data?


Big Data is also data but with a huge size. Big Data is a term used to describe a collection of
data that is huge in volume and yet growing exponentially with time. In short such data is so
large and complex that none of the traditional data management tools are able to store it or
process it efficiently.

Following are some of the examples of Big Data-

 The New York Stock Exchange generates about one terabyte of new trade data per day.

 Social Media The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.

 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With


many thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Big Data

Big Data' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Employee_ID  Employee_Name  Gender  Department  Salary_In_lacs

2365  Rajesh Kulkarni  Male  Finance 650000

3398  Pratibha Joshi  Female  Admin  650000

7465  Shushil Roy  Male  Admin  500000


7500  Shubhojit Das  Male  Finance  500000

7699  Priya Sane  Female  Finance  550000

Looking at these figures one can easily understand why the name Big Data is given and imagine
the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of


a 'structured' data.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

 Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-


<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over the years

 Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).

Characteristics Of Big Data

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by
most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

1.) Ability to process Big Data brings in multiple benefits, such as-

o Businesses can utilize outside intelligence while taking decisions


2.) Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.

o Improved customer service

3.) Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language
processing technologies are being used to read and evaluate consumer responses.

o Early identification of risk to the product/services, if any


o Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of
Big Data technologies and data warehouse helps an organization to offload infrequently accessed
data.

What are we doing with this Big Data?


We can use this big data to process and draw some meaningful insights out of it. There are
various frameworks available to process the big data. Below list provides the popular framework
that is widely being used by big data developers and analysts.

Apache Hadoop: we can write map-reduce the program to process the data.

Spark: we can write spark program to process the data, using spark we can process live stream
of data as well.

Apache Flink: this framework is also used to process a stream of data.

And many more like Storm, Samza.

Big Data Analytics


Big Data analytics is the process of collecting, organizing and analyzing a large amount of data
to uncover hidden pattern, correlation and other meaningful insights. It helps an organization to
understand the information contained in their data and use it to provide new opportunities to
improve their business which in turn leads to more efficient operations, higher profits and
happier customers.

To analyze such a large volume of data, Big Data analytics applications enables big data analyst,
data scientists, predictive modelers, statisticians, and other analytical performers to analyze the
growing volume of structured and unstructured data. It is performed using specialized software
tools and applications. Using these tools various data operations can be performed like data
mining, text mining, predictive analysis, forecasting etc., all these processes are performed
separately and are a part of high-performance analytics. Using Big Data analytic tools and
software enables an organization to process a large amount of data and provide meaningful
insights that provide better business decisions in the future.

Key Technologies behind Big Data Analytics


Analytics comprises various technologies that help you get the most valued information from the
data.

Hadoop
The open-source framework that is widely used to store a large amount of data and run various
applications on a cluster of commodity hardware. It has become a key technology to be used in
big data because of the constant increase in the variety and volume of data and its distributed
computing model provides faster access to data.

Data Mining
Once the data is stored in the data management system. You can use data mining techniques to
discover the patterns which are used for further analysis and answer complex business questions.
With data mining, all the repetitive and noisy data can be removed and point out only the
relevant information that is used to accelerate the pace of making informed decisions.

Text Mining
With text mining, we can analyze the text data from the web like the comments, likes from social
media and other text-based sources like email we can identify if the mail is spam. Text Mining
uses technologies like machine learning or natural language processing to analyze a large amount
of data and discover the various patterns.

Predictive Analytics
Predictive analytics uses data, statistical algorithms and machine learning techniques to identify
future outcomes based on historical data. It’s all about providing the best future outcomes so that
organizations can feel confident in their current business decisions.

Benefits of Big Data Analytics


Big Data Analytics has been popular among various organizations. Organizations like the e-
commerce industry, social media, healthcare, Banking, Entertainment industries, etc are widely
using analytics to understand various patterns, collecting and utilizing the customer insights,
fraud detection, monitor financial market activities etc.

Let’s take an example of e-commerce industry:


e-commerce industry like Amazon, Flipkart, Myntra and many other online shopping sites make
use of big data.

They collect customer data in several ways like

 Collect information about the items searched by the customer.


 Information regarding their preferences.
 Information about the popularity of the products and many other data.

Using these kinds of data, organizations derive some patterns and provide the best customer
service like

 displaying the popular products that are being sold.


 show the products that are related to the products that a customer bought.
 Provide secure money transitions and identify if there are any fraudulent transactions
being made.
 Forecast the demand for the products and many more.

Conclusion
Big Data is a game-changer. Many organizations are using more analytics to drive strategic
actions and offer a better customer experience. A slight change in the efficiency or smallest
savings can lead to a huge profit, which is why most organizations are moving towards big data.

Traditional vs. Big Data business approach

 Data architecture
Traditional data use centralized database architecture in which large and complex problems
are solved by a single computer system. Centralized architecture is costly and ineffective
to process large amount of data. Big data is based on the distributed database architecture
where a large block of data is solved by dividing it into several smaller sizes. Then the
solution to a problem is computed by several different computers present in a given computer
network. The computers communicate to each other in order to find the solution to a
problem. The distributed database provides better computing, lower price and also improve
the performance as compared to the centralized database system. This is because centralized
architecture is based on the mainframes which are not as economic as microprocessors in
distributed database system. Also the distributed database has more computational power as
compared to the centralized database system which is used to manage traditional data.

 Types of data
Traditional database systems are based on the structured data i.e. traditional data is stored in
fixed format or fields in a file. Examples of the unstructured data include Relational
Database System (RDBMS) and the spreadsheets, which only answers to the questions about
what happened. Traditional database only provides an insight to a problem at the small level.
However in order to enhance the ability of an organization, to gain more insight into the data
and also to know about metadata unstructured data is used . Big data uses the semi-structured
and unstructured data and improves the variety of the data gathered from different sources
like customers, audience or subscribers. After the collection, Big data transforms it into
knowledge based information.

 Volume of data
The traditional system database can store only small amount of data ranging from gigabytes
to terabytes. However, big data helps to store and process large amount of data which
consists of hundreds of terabytes of data or petabytes of data and beyond. The storage of
massive amount of data would reduce the overall cost for storing data and help in providing
business intelligence.

 Data schema
Big data uses the dynamic schema for data storage. Both the un-structured and  structured
information can be stored and any schema can be used since the schema is applied only after
a query is generated. Big data is stored in raw format and then the schema is applied only
when the data is to be read. This process is beneficial in preserving the information present in
the data. The traditional database is based on the fixed schema which is static in nature. In
traditional database data cannot be changed once it is saved and this is only done during
write operations.

 Data relationship
In the traditional database system relationship between the data items can be explored easily
as the number of information’s stored is small. However, big data contains massive or
voluminous data which increase the level of difficulty in figuring out the relationship
between the data items.

 Scaling
Scaling refers to demand of the resources and servers required to carry out the computation.
Big data is based on the scale out architecture under which the distributed approaches for
computing are employed with more than one server. So, the load of the computation is shared
with single application based system. However, achieving the scalability in the traditional
database is very difficult because the traditional database runs on the single server and
requires expensive servers to scale up.

 Higher cost of traditional data


Traditional database system requires complex and expensive hardware and software in order
to manage large amount of data.  Also moving the data from one system to another requires
more number of hardware and software resources which increases the cost significantly.
While in case of big data as the massive amount of data is segregated between various
systems, the amount of data decreases. So use of big data is quite simple, makes use of
commodity hardware and open source software to process the data .

Use 5E26B3F82A6CE to save 20000 on 62001 - 70000 words standard order of  service.

 Accuracy and confidentiality


Under the traditional database system it is very expensive to store massive amount of data, so
all the data cannot be stored. This would decrease the amount of data to be analyzed which
will decrease the result’s accuracy and confidence. While in big data as the amount required
to store voluminous data is lower. Therefore the data is stored in big data systems and the
points of correlation are identified which would provide high accurate results .

Sr. No Traditional data Big Data


1 Here the data is “Structured” data “Unstructured” or semi
structured
2 The size of the data is very small The size is more than the
traditional
3 Here the data is centralized Here data is distributed
4 It is easy to work or manipulate It is difficult to handle the data
5 Normal system configuration is sufficient to High system configuration is
process required to process the data
6 A traditional database system is enough Special kind of functions used
to manipulate the data
7 Expensive not

You might also like