You are on page 1of 75

Data Analytics (CS-603)

Javed Moazzam
Assistant Professor
Computer Science & Engineering

Lecture No-01

Topic – Introduction to Big Data


Course Outcome:
At the end of the course, student will be able to

• Students will develop relevant programming abilities.

• Students will demonstrate proficiency with statistical analysis of data.

• Students will develop the ability to build and assess data-based models.

• Students will execute statistical data analysis with professional statistical softwares.

• Students will demonstrate skill in data management.


• Students will apply data science concepts and methods to solve problems in real world
contexts and will communicate these solutions efficiently.
Big data is defined as a complex and voluminous set of information comprising structured,
unstructured, and semi-structured datasets, which is challenging to manage using traditional data
processing tools. It requires additional infrastructure to govern, analyze, and convert into insights.

Big data is a quantity of data that is enormous in volume and is constantly expanding rapidly. No
typical data management systems can effectively store or analyze this data because of its
magnitude and complexity.

Big data is a collection of organized, semi-structured, and unstructured information gathered by


businesses that can be mined for information and utilized in advanced applications of analytics like
predictive modeling and machine learning
• Importance of Big Data
• Saving Cost
• Driving efficiency
• Analyzing the market
• Improving customer experiences
• Supporting innovation
• Detecting fraud
• Improving productivity
• Enabling agility
Saving Cost : When a company has to store a lot of data, big data platforms like Apache Hadoop, Spark, etc., can
help save costs. These technologies aid businesses in finding more efficient methods to conduct operations. This also
has an impact on the business’ bottom line. For example, the price of returns is typically 1.5 times more expensive
than the price of standard shipping.

Driving Efficiency : Using real-time in-memory analytics, businesses may gather data from various sources. They
can quickly evaluate data thanks to big data tools, which makes it easier to act soon, depending on what they
discover. Big data tools have the potential to increase operational effectiveness. The tools can automate repetitive
processes and tasks to provide employees more time to work on activities demanding cognitive skills.

Analyzing the Market : Big data analysis aids firms in better comprehending the state of the market. For instance,
studying purchase patterns enables businesses to determine the most popular items and develop them appropriately.
Improving Customer Experience : Big data enables companies to tailor products to their target market without
spending a fortune on ineffective advertising campaigns. By tracking point of sale (POS) transactions and online
purchases, businesses can use big data to study consumer patterns. Using these insights, focused and targeted
marketing strategies are created to assist companies in meeting consumer expectations and fostering brand loyalty.

Supporting Innovation : Business innovation relies on the insights you may uncover through big data analytics. It
enables you to innovate around new products and services while updating existing ones. Product development can be
aided by knowing what consumers think about your goods and services. Businesses must put in place procedures that
assist them in keeping track of feedback, product success, and rival companies in today’s competitive marketplace.
Big data analytics also makes real-time market monitoring possible, which aids in timely innovation.

Detecting Fraud : Big data is primarily used by financial companies and the public sector to identify fraud. Data
analysts utilize artificial intelligence and machine learning algorithms to find abnormalities and transaction trends.
Improving Productivity : Modern big data tools make it possible for data scientists and analysts to efficiently
examine enormous amounts of data, giving them a fast overview of additional data. Additionally, it raises their output
levels. Furthermore, big data analytics enables data scientists and analysts to learn more about the efficiency of their
data pipelines, allowing them to choose how to fulfill their duties and tasks more effectively.

Enabling Agility : Big data analytics can assist businesses in becoming more innovative and adaptable in the
marketplace. One can analyze large consumer data sets to help enterprises get insights ahead of the competition and
handle customer pain points more effectively.
Big Data 3V’s
Volume
The volume of your data is how much of it there is – measured in gigabytes, zettabytes
(ZB), and yottabytes (YB). Industry trends predict a significant increase in data volume
over the next few years. Earlier, there were issues with storing and processing this
enormous volume of data. But nowadays, data gathered from all these sources is
organized using distributed systems like Hadoop. Understanding the usefulness of the data
requires knowledge of its magnitude. Additionally, one may use the volume to identify if
a data set is big data or not.
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Velocity
Velocity describes how quickly data is processed. Any significant data operation
has to operate at a high rate. The linkage of incoming data sets, activity bursts,
and the pace of change make up this phenomenon. Sensors, social media
platforms, and application logs all continuously generate enormous volumes of
data. There is no use in spending time or effort on it if the data flow is not
constant.
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions  missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like
 send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body


Variety
The many types of big data are referred to as variety. As it impacts performance, it is
one of the main problems the big data sector is now dealing with. It’s crucial to organize
your data so that you can manage its diversity effectively. Variety is the wide range of
information you collect from numerous sources.
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be generating/collecting many types of data
Big Public Data (online, weather, finance, etc)
Types of Big Data
Driver’s for Big Data
A number of business drivers are at the core of this success and explain why Big
Data has quickly risen to become one of the most coveted topics in the industry.
Six main business drivers can be identified:
The digitization of society.
Connectivity through cloud computing.
Increased knowledge about data science.
Social media applications.
The upcoming Internet-of-Things (IoT).
The digitization of society

Big Data is largely consumer driven and consumer oriented. Most of the data in
the world is generated by consumers, who are nowadays ‘always-on’. Most people
now spend 4-6 hours per day consuming and generating data through a variety of
devices and (social) applications. With every click, swipe or message, new data is
created in a database somewhere around the world. Because everyone now has a
smartphone in their pocket, the data creation sums to incomprehensible amounts.
Some studies estimate that 60% of data was generated within the last two years,
which is a good indication of the rate with which society has digitized.
Connectivity through cloud computing

Cloud computing environments (where data is remotely stored in distributed


storage systems) have made it possible to quickly scale up or scale down IT
infrastructure and facilitate a pay-as-you-go model. This means that organizations
that want to process massive quantities of data (and thus have large storage and
processing requirements) do not have to invest in large quantities of IT
infrastructure. Instead, they can license the storage and processing capacity they
need and only pay for the amounts they actually used. As a result, most of Big
Data solutions leverage the possibilities of cloud computing to deliver their
solutions to enterprises.
Increased knowledge about data science

In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years. The demand for data scientist (and
similar job titles) has increased tremendously and many people have actively
become engaged in the domain of data science.
Social media applications

Everyone understands the impact that social media has on daily life. However, in
the study of Big Data, social media plays a role of paramount importance. Not
only because of the sheer volume of data that is produced everyday through
platforms such as Twitter, Facebook, LinkedIn and Instagram, but also because
social media provides nearly real-time data about human behavior.
Social media data provides insights into the behaviors, preferences and opinions of
‘the public’ on a scale that has never been known before. Due to this, it is
immensely valuable to anyone who is able to derive meaning from these large
quantities of data. Social media data can be used to identify customer preferences
for product development, target new customers for future purchases, or even
target potential voters in elections. Social media data might even be considered
one of the most important business drivers of Big Data
The upcoming internet of things (IoT)

The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enables these objects to connect and
exchange data. It is increasingly gaining popularity as consumer goods providers
start including ‘smart’ sensors in household appliances. Whereas the average
household in 2010 had around 10 devices that connected to the internet, this
number is expected to rise to 50 per household by 2020. Examples of these
devices include thermostats, smoke detectors, televisions, audio systems and even
smart refrigerators.
Apache Spark

Apache Spark is a lightning-fast, open source data-processing engine for machine learning and AI
applications, backed by the largest open source community in big data.

Apache Spark (Spark) is an open source data-processing engine for large data sets. It is designed to
deliver the computational speed, scalability, and programmability required for Big Data—specifically
for streaming data, graph data, machine learning, and artificial intelligence (AI) applications.

Spark's analytics engine processes data 10 to 100 times faster than alternatives. It scales by
distributing processing work across large clusters of computers, with built-in parallelism and fault
tolerance. It even includes APIs for programming languages that are popular among data analysts
and data scientists, including Scala, Java, Python, and R.

Spark is often compared to Apache Hadoop, and specifically to MapReduce, Hadoop’s native data-
processing component. The chief difference between Spark and MapReduce is that Spark processes
and keeps the data in memory for subsequent steps—without writing to or reading from disk—
which results in dramatically faster processing speeds. (You’ll find more on how Spark compares to
and complements Hadoop elsewhere in this article.)
How Apache Spark Works

Apache Spark has a hierarchical master/slave architecture. The Spark Driver is the master node that
controls the cluster manager, which manages the worker (slave) nodes and delivers data results to
the application client.
Based on the application code, Spark Driver generates the SparkContext, which works with the
cluster manager—Spark’s Standalone Cluster Manager or other cluster managers like Hadoop YARN,
Kubernetes, or Mesos— to distribute and monitor execution across the nodes. It also creates
Resilient Distributed Datasets (RDDs), which are the key to Spark’s remarkable processing speed.

Resilient Distributed Dataset(RDD)


Resilient Distributed Datasets (RDDs) are fault-tolerant collections of elements that can be distributed
among multiple nodes in a cluster and worked on in parallel. RDDs are a fundamental structure in
Apache Spark.
Spark loads data by referencing a data source or by parallelizing an existing collection with the
SparkContext parallelize method into an RDD for processing. Once data is loaded into an RDD,
Spark performs transformations and actions on RDDs in memory—the key to Spark’s speed. Spark
also stores the data in memory unless the system runs out of memory or the user decides to write
the data to disk for persistence.
How Apache Spark Works

Each dataset in an RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. And, users can perform two types of RDD operations: transformations and
actions. Transformations are operations applied to create a new RDD. Actions are used to instruct
Apache Spark to apply computation and pass the result back to the driver.

Spark supports a variety of actions and transformations on RDDs. This distribution is done by Spark,
so users don’t have to worry about computing the right distribution.
Apache Spark Machine Learning

Spark has various libraries that extend the capabilities to machine learning, artificial intelligence
(AI), and stream processing.

Apache Spark Mllib


One of the critical capabilities of Apache Spark is the machine learning abilities available in the
Spark MLlib. The Apache Spark MLlib provides an out-of-the-box solution for doing classification
and regression, collaborative filtering, clustering, distributed linear algebra, decision trees, random
forests, gradient-boosted trees, frequent pattern mining, evaluation metrics, and statistics. The
capabilities of the MLlib, combined with the various data types Spark can handle, make Apache
Spark an indispensable Big Data tool.

Spark GraphX
In addition to having API capabilities, Spark has Spark GraphX, a new addition to Spark designed
to solve graph problems. GraphX is a graph abstraction that extends RDDs for graphs and graph-
parallel computation. Spark GraphX integrates with graph databases that store interconnectivity
information or webs of connection information, like that of a social network.
Apache Spark Machine Learning

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant
processing of live data streams. As Spark Streaming processes data, it can deliver data to file
systems, databases, and live dashboards for real-time streaming analytics with Spark's machine
learning and graph-processing algorithms. Built on the Spark SQL engine, Spark Streaming also
allows for incremental batch processing that results in faster processing of streamed data.
Predictive Analytics

“Predictive analytics is an advanced form of data analytics that attempts to answer the question,
“What might happen next?””

Predictive analytics is the process of using data to forecast future outcomes. The process uses data
analysis, machine learning, artificial intelligence, and statistical models to find patterns that might
predict future behavior. Organizations can use historic and current data to forecast trends and
behaviors seconds, days, or years into the future with a great deal of precision.

How does predictive analytics work?

Data scientists use predictive models to identify correlations between different elements in selected
datasets. Once data collection is complete, a statistical model is formulated, trained, and modified
to generate predictions.
Deep Packet Inspection (DPI)

Deep packet inspection (DPI), also known as packet sniffing, is a method of examining the content
of data packets as they pass by a checkpoint on the network. With normal types of stateful packet
inspection, the device only checks the information in the packet’s header, like the destination
Internet Protocol (IP) address, source IP address, and port number. DPI examines a larger range of
metadata and data connected with each packet the device interfaces with. In this DPI meaning, the
inspection process includes examining both the header and the data the packet is carrying.

You might also like