Professional Documents
Culture Documents
VTU Syllabus:
Module-1:
Examples of Data:
KIT/CSE/BJ Page 1
• Census Report- Data of citizens- During census, data of all citizens like number
of persons living in a home, literate or illiterate, number of children, cast, religion
etc.
What is information?
Processed data is called information. When raw facts and figures are
processed and arranged in some proper order then they become information.
Information has proper meanings. Information is useful in decision-making. In
other words, Information is data that has been processed in such a way as to be
meaningful values to the person who receives it.
Flow of Data:
KIT/CSE/BJ Page 2
What do you mean by Big Data?
“Big Data” is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it.
Big data analytics examines large and different types of data to, uncover
hidden patterns , correlations and other insights.
When dealing with big data, we consider numbers to represent like megabytes,
gigabytes, terabytes etc. Here is the system of units to represent data.
KIT/CSE/BJ Page 3
Data to Big Data:
Big Data' is a term used to describe collection of data that is huge in size and
yet growing exponentially with time.
Normally we work on data of size MB(Word Doc, Excel) or maximum
GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called
Big Data.
It is stated that almost 90% of today's data has been generated in the past
2 to 3 years.
In short, such a data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
As per IDC (International Data Corporation) report, new data created per
each person in the world per second by 2020 will be 1.7 MB.
The amount of total data in the world by 2020 will reach around 44
ZettaBytes (44 trillion GigaByte) and 175 ZettaBytes by 2025. It is
being seen that the total volume of data is double every two years.
KIT/CSE/BJ Page 4
An Insight on Data Size:
Byte: One grain of rice
KB(3): One cup of rice
MB(6): 8 bags of rice
GB(9): 3 semi truck of rice
TB(12): 2 container ships of rice
PB(15): Half of Bangalore
Exabyte(18): 1/4th of India
Zettabyte(21): Fills Specific Ocean
Yottabyte(24): An earth sized bowl
Brontobyte(27): An Astronomical size. Roughly the distance from Earth
to the Sun i,e 150 million kilometres (93 million miles) or ~8 light
minutes.
KIT/CSE/BJ Page 5
What are the Sources of Big Data?
The New York Stock Exchange generates about one terabyte of new trade
data per day.
Social Media Impact Statistic shows that 500+terabytes of new data gets
ingested into the databases of social media site Facebook,Google,
LinkedIn, every day. This data is mainly generated in terms of
photo/image and video uploads, message exchanges, putting comments
etc.
Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
KIT/CSE/BJ Page 6
History of Big Data Innovation/ Evolution of Big Data:
KIT/CSE/BJ Page 7
Harnessing Big Data:
KIT/CSE/BJ Page 8
What are the Characteristics of Big Data ?
Big Data was defined by the “3Vs” but now there is “6Vs” of Big Data
which are also termed as the characteristics of Big Data as follows:
1. Volume: The name ‘Big Data’ itself is related to a size which is enormous.
If the Volume of data is very large, then it is actually considered as a ‘Big
Data’.
Structured
Semi-structured
Unstructured data.
Multi -Structured
5. Value: The bulk of Data having no Value, unless we turn it into something
useful. Data in itself is of no use or importance but it needs to be converted
into something valuable to extract information.
KIT/CSE/BJ Page 9
Classify Big Data :
Big data are classified as per sources of data availability with various format as
follows:
1..Structured Data :
Any data that can be stored, accessed and processed in the form of fixed
format is termed as a 'structured' data. Data stored in a relational database
management system.
Examples –
2. Semi-structured data:
Example –
KIT/CSE/BJ Page 10
• XML data.
<note>
<to>You</to>
<from>Me</from>
<heading>Reminder</heading>
</note>
JSON Data:
KIT/CSE/BJ Page 11
Multi-structured data sets can have many formats. They are found in non-
transactional systems.
4. Unstructured Data :
Examples –
• Word, PDF, text, media logs, email data, Output returned by 'Google Search'
. etc.
Data Quality:
• High quality means, data which enables all the required operations,
analysis, decisions, planning and knowledge discovery correctly.
• A definition for high quality data, especially for artificial intelligence
applications, can be data with five R's as follows: Relevancy, Recent, Range,
Robustness and Reliability. Relevancy is of utmost importance.
KIT/CSE/BJ Page 12
Data Integrity:
Software, which store, process, or retrieve the data, should maintain the integrity of
data. Data should be incorruptible.
Noise:
Noise in data refers to data giving additional meaningless information besides true
(actual/required) information.
Outlier:
An outlier in data refers to data, which appears to not belong to the dataset.
• Actual outliers need to be removed from the dataset, else the result will be
effected by a small or large amount.
Data Wrangling:
Data wrangling refers to the process of transforming and mapping the data from
one format to another format, which makes it valuable for analytics and data
visualizations.
KIT/CSE/BJ Page 13
Big Data architecture:
Big data architecture refers to the logical and physical structure that dictates
how high volumes of data are ingested, processed, stored, managed, and
accessed.
Data sources : Data sources from open and third-party, play a significant
role in architecture.
Data Storage: There is data stored in file stores that are distributed in nature
and that can hold a variety of format-based big files.
KIT/CSE/BJ Page 14
Stream processing : Stream processing, on the other hand, handles all of
that streaming data in the form of windows or streams and writes it to
the sink. This includes Apache Spark, Flink, Storm, etc.
Analytics-Based Datastore: In order to analyze and process already
processed data, analytical tools use the data store that is based on HBase or
any other NoSQL data warehouse technology.
Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis
tools that utilize embedded technology and a solution to produce useful
graphs, analysis, and insights that are beneficial to the businesses. For
example, Cognos, Hyperion, and others.
Data pre-processing:
KIT/CSE/BJ Page 15
• Necessary when data is being exported to a cloud service or data store
needs pre-processing.
Lambda Architecture:
A single Lambda architecture handles both batch (static) data and real-
time processing data. It is employed to solve the problem of computing arbitrary
KIT/CSE/BJ Page 16
functions. In this deployment model, latency is reduced and negligible errors are
preserved while retaining accuracy.
Kappa Architecture:
(iv) Data-processing
(v) Data consumption by the number of programs and tools such as business
intelligence, data mining, discovering patterns/clusters, artificial
intelligence (AI), machine learning (ML), text analytics, descriptive and
predictive analytics, and data visualization.
KIT/CSE/BJ Page 17
Logical layer 1 (L1): It is for identifying data sources, which are external, internal or
both.
Ingestion is the process of obtaining and importing data for immediate use or
transfer.
The Layer-4 (L4): It is for data processing using software, such as MapReduce,
Hive, Pig or Spark.
The top layer L5: It is for data consumption. Data is used in analytics,
visualizations, reporting, export to cloud or web servers.
KIT/CSE/BJ Page 18
Scalability in Big Data Architecture Design:
Scalability enables increase or decrease in the capacity of data storage,
processing and analytics.
Vertical Scalability
Horizontal Scalability
Scaling up also means designing the algorithm according to the architecture that
uses resources efficiently.
For example, x terabyte of data take time t for processing, code size with
increasing complexity increase by factor n, then scaling up means that processing
takes equal, less or much less than (n x t).
Scaling out is basically using more resources and distributing the processing
and storage tasks in parallel.
For example, If r resources in a system process x terabyte of data in time t, then the
(p X x) terabytes process on p parallel distributed nodes such that the time taken up
remains t or is slightly more than t (due to the additional time required for Inter
Processing nodes Communication (WC).
KIT/CSE/BJ Page 19
Parallel Processing in Big Data Architecture:
Big Data needs processing of large data volume, and therefore needs
intensive computations. So complex applications in processing with large
datasets (terabyte to petabyte datasets) need hundreds of computing nodes.
• Cloud Computing
• Volunteer Computing
KIT/CSE/BJ Page 20
Distributed Computing Model is the best example for MPP
processing .
Cloud Computing::
Cloud computing is a type of Internet-based computing that provides
shared processing resources and data to the computers and other devices on
demand.“
• One of the best approach for data processing is to perform parallel and
distributed computing in a cloud-computing environment.
Ex. : Amazon Web Service (AWS), Digital Ocean, Elastic Compute Cloud (EC2),
Microsoft Azure or Apache Cloud Stack. Amazon Simple Storage Service (S3)
(iii) Scalability,
(iv) Accountability
KIT/CSE/BJ Page 21
2. Platform as a Service (PaaS):
Grid Computing :
Grid Computing can be defined as a network of computers working
together to perform a task that would rather be difficult for a single machine.
All machines on that network work under the same protocol to act as a virtual
supercomputer.
Computers on the network contribute resources like processing power and storage
capacity to the network.
Grid Computing is a subset of distributed computing, where a virtual
supercomputer comprises machines on a network connected by some bus,
mostly Ethernet or sometimes the Internet.
It can also be seen as a form of Parallel Computing where instead of many CPU
cores on a single machine, it contains multiple cores spread across various
locations.
Working:
A Grid computing network mainly consists of these three types of machines:
When a computer makes a request for resources to the control node, the
control node gives the user access to the resources available on the
network.
KIT/CSE/BJ Page 22
Hence a normal computer
computer on the node can swing in between being a user or
a provider based on its needs.
The nodes may consist of machines with similar platforms using the same
networks, else machines with different platforms
OS called homogeneous networks,
running on various different OSs called heterogeneous networks
networks.
KIT/CSE/BJ Page 23
Currently, grid computing is being used in various institutions to solve a lot
of mathematical, analytical, and physics problems.
Cluster Computing:
Clusters are used mainly for load balancing. They shift processes between
nodes to keep an even load on the group of connected computers.
Volunteer computing::
Volunteer computing is a distributed computing paradigm which uses
computing resources of the volunteers. Volunteers are organizations or members
who own personal computers.
KIT/CSE/BJ Page 24
2. Drop outs from the network over time
Data Analytics has to go through the following phases before deriving the new
facts, providing business intelligence and generating new knowledge.
1. Descriptive analytics :
KIT/CSE/BJ Page 25
4. Cognitive analytics [Opinion Mining/Sentiment Analysis]
Analysis It is the use of
computerized models to simulate the human thought process in
ns where the answers may be ambiguous and uncertain.
complex situations
KIT/CSE/BJ Page 26
Applications of Big Data:
• Travel and tourism: Big data helps in predicting requirements such as
those for travel facilities. Through this, the businesses have noticed
significant improvement.
• Finance and banking: This sector extensively uses big data to understand
customer behaviour through patterns and other trends.
• Smart Traffic System: Data about the condition of the traffic of different
road, collected through camera kept beside the road, at entry and exit point
of the city, GPS device placed in the vehicle (Ola, Uber cab, etc.). All
such data are analyzed and jam-free or less jam way, less time taking
ways are recommended
• Secure Air Traffic System: At various places of flight (like propeller etc)
sensors present. These sensors capture data like the speed of flight, moisture,
KIT/CSE/BJ Page 27
temperature, other environmental condition. Based on such data analysis, an
environmental parameter within flight are set up and varied.
• Auto Driving Car: Big data analysis helps drive a car without human
interpretation. In the various spot of car camera, a sensor placed, that
gather data like the size of the surrounding car, obstacle, distance from
those, etc
KIT/CSE/BJ Page 28
Previous Year VTU Questions with Answer
(Module-1)
1. Define Big Data. Explain the Evolution of Big Data and their
characteristics (10 Marks)
2. What is grid computing? List and explain the features, drawbacks of
grid computing (10 Marks)
3. Discuss the functions of each of the five layers in Big Data
architecture design (10 Marks)
4. Illustrate the various phases involved in Big Data Analytics with neat
diagram.(10 Marks)
KIT/CSE/BJ Page 29
Sample Questions
Big Data Analytics (18CS72)
Module-1
KIT/CSE/BJ Page 30
KIT/CSE/BJ Page 31