You are on page 1of 31

Lecture Notes

Big Data Analytics (18CS72)

VTU Syllabus:

Module-1:

Introduction to Big Data Analytics:

Big Data, Scalability and Parallel Processing, Designing Data Architecture,


Data Sources, Quality, Pre-Processing and Storing, Data Storage and
Analysis, Big Data Analytics Applications and Case Studies.

What do you mean by Data ?


Data is the collection of raw facts and figures. Actually data is unprocessed,
that is why data is called collection of raw facts and figures. We collect data from
different resources. After collection, data is entered into a machine for processing.
Data may be collection of words, numbers, pictures, or sounds etc.

Examples of Data:

• Student data on admission form- bundle of admission forms contains name,


father’s name, address, photograph etc.

• Student’s examination data - In examination system of a college/school, data


about obtained marks of different subjects for all students is collected, exam
schedule etc.

KIT/CSE/BJ Page 1
• Census Report- Data of citizens- During census, data of all citizens like number
of persons living in a home, literate or illiterate, number of children, cast, religion
etc.

What is information?
Processed data is called information. When raw facts and figures are
processed and arranged in some proper order then they become information.
Information has proper meanings. Information is useful in decision-making. In
other words, Information is data that has been processed in such a way as to be
meaningful values to the person who receives it.

Ex: The data collected is in a survey report is: ‘HYD20M’

If we process the above data we understand that code is information about a


person as follows: HYD is city name ‘Hyderabad’, 20 is age and M is to represent
‘MALE.

Flow of Data:

Data —> Information —> Actionable intelligence —> Better decisions


—> Enhanced business value

KIT/CSE/BJ Page 2
What do you mean by Big Data?
“Big Data” is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it.

Big data analytics examines large and different types of data to, uncover
hidden patterns , correlations and other insights.

Define Big Data:

“Big data is high-volume, high-velocity and/or high-variety information


assets that demand cost-effective, innovative forms of information processing that
enable enhanced insight and decision making “. - by Gartner

System of Units to represent data :

When dealing with big data, we consider numbers to represent like megabytes,
gigabytes, terabytes etc. Here is the system of units to represent data.

KIT/CSE/BJ Page 3
Data to Big Data:
 Big Data' is a term used to describe collection of data that is huge in size and
yet growing exponentially with time.
 Normally we work on data of size MB(Word Doc, Excel) or maximum
GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called
Big Data.
 It is stated that almost 90% of today's data has been generated in the past
2 to 3 years.
 In short, such a data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
 As per IDC (International Data Corporation) report, new data created per
each person in the world per second by 2020 will be 1.7 MB.

 The amount of total data in the world by 2020 will reach around 44
ZettaBytes (44 trillion GigaByte) and 175 ZettaBytes by 2025. It is
being seen that the total volume of data is double every two years.

KIT/CSE/BJ Page 4
An Insight on Data Size:
 Byte: One grain of rice
 KB(3): One cup of rice
 MB(6): 8 bags of rice
 GB(9): 3 semi truck of rice
 TB(12): 2 container ships of rice
 PB(15): Half of Bangalore
 Exabyte(18): 1/4th of India
 Zettabyte(21): Fills Specific Ocean
 Yottabyte(24): An earth sized bowl
 Brontobyte(27): An Astronomical size. Roughly the distance from Earth
to the Sun i,e 150 million kilometres (93 million miles) or ~8 light
minutes.

KIT/CSE/BJ Page 5
What are the Sources of Big Data?
 The New York Stock Exchange generates about one terabyte of new trade
data per day.

 Social Media Impact Statistic shows that 500+terabytes of new data gets
ingested into the databases of social media site Facebook,Google,
LinkedIn, every day. This data is mainly generated in terms of
photo/image and video uploads, message exchanges, putting comments
etc.

 Single Jet engine can generate 10+terabytes of data in 30 minutes of a


flight time. With many thousand flights per day, generation of data reaches
up to many Petabytes.

 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge


amount of logs from which users buying trends can be traced.

 Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.

 Telecom company: Telecom giants like Reliance Jio, Airtel, Vodafone


study the user trends and accordingly publish their plans and for this they
store the data of its million users.

KIT/CSE/BJ Page 6
History of Big Data Innovation/ Evolution of Big Data:

KIT/CSE/BJ Page 7
Harnessing Big Data:

 OLTP: Online Transaction Processing (DBMSs)


 OLAP: Online Analytical Processing (Data Warehousing)
 RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)

KIT/CSE/BJ Page 8
What are the Characteristics of Big Data ?

Big Data was defined by the “3Vs” but now there is “6Vs” of Big Data
which are also termed as the characteristics of Big Data as follows:

1. Volume: The name ‘Big Data’ itself is related to a size which is enormous.
If the Volume of data is very large, then it is actually considered as a ‘Big
Data’.

2. Velocity: It refers to the high speed of accumulation of data. This


determines that how fast the data is generated and processed to meet the
demands.

3. Variety: It refers to nature of data that is

 Structured
 Semi-structured
 Unstructured data.
 Multi -Structured

4. Veracity: It refers to inconsistencies and uncertainty in data. Data in bulk


could create confusion whereas less amount of data could convey half or
Incomplete Information .

5. Value: The bulk of Data having no Value, unless we turn it into something
useful. Data in itself is of no use or importance but it needs to be converted
into something valuable to extract information.

6. Variability: It means how often does the meaning or shape of data


change. Example: if we are eating same ice-cream daily and the taste just
keep changing.

KIT/CSE/BJ Page 9
Classify Big Data :
Big data are classified as per sources of data availability with various format as
follows:

1..Structured Data :

Any data that can be stored, accessed and processed in the form of fixed
format is termed as a 'structured' data. Data stored in a relational database
management system.

Examples –

An 'Employee' table in a database is an example of Structured Data

• Relational data, Geo-location, credit card numbers, addresses, etc.

Nearly 15–20% data are in structured or semi-structured form.

2. Semi-structured data:

Semi-structured data is information that does not reside in a relational


database but that have some organizational properties that make it easier to
analyze.

Example –

KIT/CSE/BJ Page 10
• XML data.

<note>

<to>You</to>

<from>Me</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

 JSON Data:

3. Multi-Structured data: Consists of multiple formats of data-


data viz. structured,
semi-structured
structured and/or unstructured data.

KIT/CSE/BJ Page 11
Multi-structured data sets can have many formats. They are found in non-
transactional systems.

Example: streaming data on customer interactions, data of multiple sensors,


data at web or enterprise server or the data-warehouse data in multiple formats.

4. Unstructured Data :

Data which is not follow a pre-defined standard or any organized format.


This kind of data is also not fit for the relational database. To manage and store
Unstructured data there are many platforms to handle it like No-SQL Database.

Examples –

• Word, PDF, text, media logs, email data, Output returned by 'Google Search'
. etc.

Data Quality:
• High quality means, data which enables all the required operations,
analysis, decisions, planning and knowledge discovery correctly.
• A definition for high quality data, especially for artificial intelligence
applications, can be data with five R's as follows: Relevancy, Recent, Range,
Robustness and Reliability. Relevancy is of utmost importance.

• A uniform definition of data quality is difficult.

• A reference can be made to a set of values of quantitative or qualitative


conditions, which must be specified to say that data quality is high or low.

KIT/CSE/BJ Page 12
Data Integrity:

Data integrity refers to the maintenance of consistency and accuracy in data


over its usable life.

Software, which store, process, or retrieve the data, should maintain the integrity of
data. Data should be incorruptible.

Noise:

Noise in data refers to data giving additional meaningless information besides true
(actual/required) information.

Outlier:
An outlier in data refers to data, which appears to not belong to the dataset.

• For example, data that is outside an expected range.

• Actual outliers need to be removed from the dataset, else the result will be
effected by a small or large amount.

Data Wrangling:

Data wrangling refers to the process of transforming and mapping the data from
one format to another format, which makes it valuable for analytics and data
visualizations.

KIT/CSE/BJ Page 13
Big Data architecture:
Big data architecture refers to the logical and physical structure that dictates
how high volumes of data are ingested, processed, stored, managed, and
accessed.

 Data sources : Data sources from open and third-party, play a significant
role in architecture.

Ex: Relational databases, data warehouses, cloud-based data warehouses,


SaaS applications, real-time data from company servers and sensors such as
IoT devices etc.

 Data Storage: There is data stored in file stores that are distributed in nature
and that can hold a variety of format-based big files.

Ex.:- HDFS, Microsoft Azure, AWS, and GCP(Google Cloud Platform)


storage, among other blob(Blob stands for Binary Large Object, which includes
objects such as images and multimedia files.) containers.

 Batch Processing: Each chunk of data is split into different categories


using long-running jobs, which filter and aggregate and also prepare data for
analysis.
 Real Time-Based Message Ingestion: All real-time streaming systems
that cater to the data being generated at the time it is received. Message-
based ingestion stores such as Apache Kafka, Apache Flume, Event hubs
from Azure, and others.

KIT/CSE/BJ Page 14
 Stream processing : Stream processing, on the other hand, handles all of
that streaming data in the form of windows or streams and writes it to
the sink. This includes Apache Spark, Flink, Storm, etc.
 Analytics-Based Datastore: In order to analyze and process already
processed data, analytical tools use the data store that is based on HBase or
any other NoSQL data warehouse technology.
 Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis
tools that utilize embedded technology and a solution to produce useful
graphs, analysis, and insights that are beneficial to the businesses. For
example, Cognos, Hyperion, and others.

There are four types of analytics on big data :

 Diagnostic: Explains why a problem is happening.


 Descriptive: Describes the current state of a business through
historical data.
 Predictive: Projects future results based on historical data.
 Prescriptive: Takes predictive analytics a step further by projecting
best future efforts.

 Orchestration: Data-based solutions that are repetitive in nature, and


which are also contained in workflow chains that can transform the source
data and also move data across sources as well as sinks and loads in stores.
Ex. Sqoop, oozie, data factory, and others .

Data pre-processing:

• Data pre-processing is an important step at the ingestion layer.

• Pre-processing is a must before data mining and analytics.

• Pre-processing is also a must before running a Machine Learning (ML)


algorithm. Analytics needs prior screening of data quality.

KIT/CSE/BJ Page 15
• Necessary when data is being exported to a cloud service or data store
needs pre-processing.

Data Pre-Processing includes:

(i) Dropping out of range, inconsistent and outlier values

(ii) Filtering unreliable, irrelevant and redundant information

(ii) Data cleaning, editing, reduction and/or wrangling

(iv) Data validation and transformation

(v) ELT processing.

Data Format used during Pre-Processing:

(i)Comma-separated values CSV

(ii) Java Script Object Notation (JSON)

(iii) Tag Length Value (TLV)

(iv) Key-value pairs.

(v) Hash-key-value pairs.

Big Data Architecture is classified into another two advanced deployment


model as follows:

 Lambda Architecture:

A single Lambda architecture handles both batch (static) data and real-
time processing data. It is employed to solve the problem of computing arbitrary

KIT/CSE/BJ Page 16
functions. In this deployment model, latency is reduced and negligible errors are
preserved while retaining accuracy.

 Kappa Architecture:

When compared to Lambda architecture, Kappa architecture is also intended to


handle both real-time streaming and batch processing data. The Kappa
architecture, in addition to reducing the additional cost that comes from the
Lambda architecture, replaces the data sourcing medium with message
queues.

Five layers in Big Data Architecture Design:


Big Data architecture is the logical or physical layout/structure of how Big
Data will be stored, accessed and managed within a IT environment.

Data processing architecture consists of five layers:

(i) Identification of data sources,

(ii) Acquisition, Ingestion, Extraction, Pre-processing, Transformation of data,

(iii) Data storage at files, servers, cluster or cloud

(iv) Data-processing

(v) Data consumption by the number of programs and tools such as business
intelligence, data mining, discovering patterns/clusters, artificial
intelligence (AI), machine learning (ML), text analytics, descriptive and
predictive analytics, and data visualization.

KIT/CSE/BJ Page 17
Logical layer 1 (L1): It is for identifying data sources, which are external, internal or
both.

The layer 2 (L2): It is for data-ingestion .Data ingestion means a process of


absorbing information, just like the process of absorbing nutrients and
medications into the body by eating or drinking them.

Ingestion is the process of obtaining and importing data for immediate use or
transfer.

The L3 layer (L3): It is for storage of data from the L2 layer

The Layer-4 (L4): It is for data processing using software, such as MapReduce,
Hive, Pig or Spark.

The top layer L5: It is for data consumption. Data is used in analytics,
visualizations, reporting, export to cloud or web servers.

KIT/CSE/BJ Page 18
Scalability in Big Data Architecture Design:
Scalability enables increase or decrease in the capacity of data storage,
processing and analytics.

Scalability is the capability of a system to handle the workload as per the


magnitude of the work. System capability needs increment with the increased
workloads.

Two types of scalability:

 Vertical Scalability
 Horizontal Scalability

Vertical Scalability: It means scaling up the given system's resources and


increasing the system's analytics, reporting and visualization capabilities.

Scaling up also means designing the algorithm according to the architecture that
uses resources efficiently.

For example, x terabyte of data take time t for processing, code size with
increasing complexity increase by factor n, then scaling up means that processing
takes equal, less or much less than (n x t).

Horizontal scalability: It means scaling out to use more multiple processors as


a single entity so a business can scale beyond the computer capacity of a single
server.

It is increasing the number of systems working in coherence and scaling out or


distributing the workload.

Processing different datasets of a large dataset deploys horizontal scalability.

Scaling out is basically using more resources and distributing the processing
and storage tasks in parallel.

For example, If r resources in a system process x terabyte of data in time t, then the
(p X x) terabytes process on p parallel distributed nodes such that the time taken up
remains t or is slightly more than t (due to the additional time required for Inter
Processing nodes Communication (WC).

KIT/CSE/BJ Page 19
Parallel Processing in Big Data Architecture:
Big Data needs processing of large data volume, and therefore needs
intensive computations. So complex applications in processing with large
datasets (terabyte to petabyte datasets) need hundreds of computing nodes.

Processing of this much distributed data within a short time and at


minimum cost is problematic.

Alternative ways for scaling up and out:

• Massively Parallel Processing Platforms

• Cloud Computing

• Grid and Cluster Computing

• Volunteer Computing

Massively Parallel Processing Platforms:


It is impractical or impossible to execute many complex programs on a
single computer system, especially in limited computer memory. Here, it is
required to enhance (scale) up the computer system or use massive parallel
processing (MPPs) platforms.

Parallelization of tasks can be done at several levels:

 Distributing separate tasks onto separate threads on the same CPU,


 Distributing separate tasks onto separate CPUs on the same computer
 Distributing separate tasks onto separate computers.

The system executes multiple program instructions or sub-tasks at any


moment in time. Total time taken will be much less than with a single
compute resource.

KIT/CSE/BJ Page 20
Distributed Computing Model is the best example for MPP
processing .

Cloud Computing::
Cloud computing is a type of Internet-based computing that provides
shared processing resources and data to the computers and other devices on
demand.“

• One of the best approach for data processing is to perform parallel and
distributed computing in a cloud-computing environment.

It offers high data security compared to other distributed technologies.

Ex. : Amazon Web Service (AWS), Digital Ocean, Elastic Compute Cloud (EC2),
Microsoft Azure or Apache Cloud Stack. Amazon Simple Storage Service (S3)

Cloud computing features are:

(i) On-demand service

(ii) Resource pooling,

(iii) Scalability,

(iv) Accountability

(v) Broad network access.

Cloud computing allows availability of computer infrastructure and services


"on-demand" basis.

The computing infrastructure includes data storage device, development


platform, database, computing power or software applications.

Cloud services can be classified into three fundamental types:

1. Infrastructure as a Service (IaaS):

KIT/CSE/BJ Page 21
2. Platform as a Service (PaaS):

3. Software as a Service (SaaS):

Grid Computing :
Grid Computing can be defined as a network of computers working
together to perform a task that would rather be difficult for a single machine.
All machines on that network work under the same protocol to act as a virtual
supercomputer.

Computers on the network contribute resources like processing power and storage
capacity to the network.
Grid Computing is a subset of distributed computing, where a virtual
supercomputer comprises machines on a network connected by some bus,
mostly Ethernet or sometimes the Internet.
It can also be seen as a form of Parallel Computing where instead of many CPU
cores on a single machine, it contains multiple cores spread across various
locations.

Working:
A Grid computing network mainly consists of these three types of machines:

1. Control Node: A computer, usually a server or a group of servers which


administrates the whole network and keeps the account of the resources in
the network pool.

2. Provider: The computer contributes its resources to the network resource


pool.
3. User: The computer that uses the resources on the network.

 When a computer makes a request for resources to the control node, the
control node gives the user access to the resources available on the
network.

 When it is not in use it should ideally contribute its resources to the


network.

KIT/CSE/BJ Page 22
 Hence a normal computer
computer on the node can swing in between being a user or
a provider based on its needs.

 The nodes may consist of machines with similar platforms using the same
networks, else machines with different platforms
OS called homogeneous networks,
running on various different OSs called heterogeneous networks
networks.

 This is the distinguishing part of grid computing from other distributed


computing architectures.

 For controlling the network and its resources a software/networking


protocol is used generally known as Middleware.. This is responsible for
administrating the network and the control nodes are merely its executors.

 As a grid computing system should use only unused resources of a


computer, it is the job of the control node that any provider is not
overloaded with tasks.

KIT/CSE/BJ Page 23
Currently, grid computing is being used in various institutions to solve a lot
of mathematical, analytical, and physics problems.

Advantages of Grid Computing:


1. It is not centralized, as there are no servers required, except the control node
which is just used for controlling and not for processing.
2. Multiple heterogeneous machines i.e. machines with different Operating
Systems can use a single grid computing network.
3. Tasks can be performed parallelly across various physical locations and
the users don’t have to pay for them (with money).

Disadvantages of Grid Computing:


1. The software of the grid is still in the involution stage.
2. A super-fast interconnect between computer resources is the need of the hour.
3. Licensing across many servers may make it prohibitive for some applications.
4. Many groups are reluctant with sharing resources.
5. Trouble in the control node can come to halt in the whole network.

Cluster Computing:

A cluster is a group of computers connected by a network. The group works


together to accomplish the same task.

Clusters are used mainly for load balancing. They shift processes between
nodes to keep an even load on the group of connected computers.

Hadoop architecture uses the similar methods.

Volunteer computing::
Volunteer computing is a distributed computing paradigm which uses
computing resources of the volunteers. Volunteers are organizations or members
who own personal computers.

Some issues with volunteer computing systems are:

1. Volunteered computers heterogeneity

KIT/CSE/BJ Page 24
2. Drop outs from the network over time

3. Their irregular availability

4. Incorrect results at volunteers are unaccountable as they are essentially


from anonymous volunteers.

Big Data Analytics:

Data Analytics can be formally defined as the statistical and mathematical


data analysis that enables clustering the data into relevant groups, segmenting
the data into distributed partitions, ranking the data based on relevancy and
predicting the future possibilities using data.

• Analytics uses historical data and predicts new values or results.

• Analytics will suggests/recommend techniques for improving the enterprise


business.

Data analysis helps in finding business intelligence and helps in decision


making.

Data Analytics has to go through the following phases before deriving the new
facts, providing business intelligence and generating new knowledge.

1. Descriptive analytics :

It is the examination of data or content, performed manually.

2. Predictive analytics : The branch of advanced analytics that makes


predictions about future outcomes using historical data using data mining
techniques and machine learning .

2. Prescriptive analytics : The deals with use of technology to help


businesses make better decisions through the analysis of raw data to
maximize the profits.

KIT/CSE/BJ Page 25
4. Cognitive analytics [Opinion Mining/Sentiment Analysis]
Analysis It is the use of
computerized models to simulate the human thought process in
ns where the answers may be ambiguous and uncertain.
complex situations

Traditional and Big Data analytics architecture reference model

KIT/CSE/BJ Page 26
Applications of Big Data:
• Travel and tourism: Big data helps in predicting requirements such as
those for travel facilities. Through this, the businesses have noticed
significant improvement.

• Finance and banking: This sector extensively uses big data to understand
customer behaviour through patterns and other trends.

• Healthcare: There has been a revolution in the healthcare sector, thanks to


big data. Through predictive analytics, healthcare personnel are able to
provide personalized services to patients, thereby improving outcomes.

• Telecommunication and multimedia : Given how much data is generated in


this sector daily, big data technologies are required to handle such huge data.

• Tracking Customer Spending Habit, Shopping Behavior: In big retails


store (like Amazon, Walmart, Big Bazar etc.) management team has to keep
data of customer’s spending habit (in which product customer spent, in
which brand they wish to spent, how frequently they spent), shopping
behavior, customer’s most liked product (so that they can keep those
products in the store). Which product is being searched/sold most, based on
that data, production/collection rate of that product get fixed.

• Recommendation: By tracking customer spending habit, shopping


behavior, Big retails store provide a recommendation to the customer. E-
commerce site like Amazon, Walmart, Flipkart does product
recommendation.

• Smart Traffic System: Data about the condition of the traffic of different
road, collected through camera kept beside the road, at entry and exit point
of the city, GPS device placed in the vehicle (Ola, Uber cab, etc.). All
such data are analyzed and jam-free or less jam way, less time taking
ways are recommended

• Secure Air Traffic System: At various places of flight (like propeller etc)
sensors present. These sensors capture data like the speed of flight, moisture,

KIT/CSE/BJ Page 27
temperature, other environmental condition. Based on such data analysis, an
environmental parameter within flight are set up and varied.

• Auto Driving Car: Big data analysis helps drive a car without human
interpretation. In the various spot of car camera, a sensor placed, that
gather data like the size of the surrounding car, obstacle, distance from
those, etc

• IoT: Manufacturing company install IOT sensor into machines to collect


operational data. Analyzing such data, it can be predicted how long
machine will work without any problem when it requires repairing so
that company can take action before the situation when machine facing a lot
of issues or gets totally down. Thus, the cost to replace the whole machine
can be saved.

• Education Sector: Online educational course conducting organization


utilize big data to search candidate, interested in that course. If someone
searches for YouTube tutorial video on a subject, then online or offline
course provider organization on that subject send ad online to that
person about their course.

• Media and Entertainment Sector: Media and entertainment service


providing company like Netflix, Amazon Prime, Spotify do analysis on
data collected from their users. Data like what type of video, music users
are watching

KIT/CSE/BJ Page 28
Previous Year VTU Questions with Answer
(Module-1)

1. Define Big Data. Explain the Evolution of Big Data and their
characteristics (10 Marks)
2. What is grid computing? List and explain the features, drawbacks of
grid computing (10 Marks)
3. Discuss the functions of each of the five layers in Big Data
architecture design (10 Marks)
4. Illustrate the various phases involved in Big Data Analytics with neat
diagram.(10 Marks)

KIT/CSE/BJ Page 29
Sample Questions
Big Data Analytics (18CS72)

Module-1

1. What do you mean by Big Data?


2. Define Big Data according to Gartner.
3. Compare data and information’s
4. “Any size of data can be called as big data”-explain
5. Compare PB with YB.
6. What are sources of Big Data ?
7. Explain various characteristic of Big Data .
8. Compare structured, unstructured and semi structured data with example.
9. Compare : OLTP, OLAP and RTAP.
10. Explain the Evolution of Big Data.
11. Describe the five layers Big Data architecture.
12. Compare Lamda and Kappa architecture..
13. What do you mean by Scalability?. Classify Scalability.
14. What do you mean by Data Pre-processing?
15. What is Outlier ?
16. What do you mean by Data Wrangling?
17. Explain “Data Quality”.
18. What is the purpose of Grid Computing ? Write the working Principle .
19. Define big data analytics .
20. Mention importance of big data analytics .
21. Explain on different type of big data analytics .
22. Mention some big data analytics tools .
23. Describe on various application of big data

KIT/CSE/BJ Page 30
KIT/CSE/BJ Page 31

You might also like