Professional Documents
Culture Documents
Javed Moazzam
Assistant Professor
Computer Science & Engineering
Lecture No-01
• Students will develop the ability to build and assess data-based models.
• Students will execute statistical data analysis with professional statistical softwares.
Big data is a quantity of data that is enormous in volume and is constantly expanding rapidly. No
typical data management systems can effectively store or analyze this data because of its
magnitude and complexity.
Driving Efficiency : Using real-time in-memory analytics, businesses may gather data from various sources. They
can quickly evaluate data thanks to big data tools, which makes it easier to act soon, depending on what they
discover. Big data tools have the potential to increase operational effectiveness. The tools can automate repetitive
processes and tasks to provide employees more time to work on activities demanding cognitive skills.
Analyzing the Market : Big data analysis aids firms in better comprehending the state of the market. For instance,
studying purchase patterns enables businesses to determine the most popular items and develop them appropriately.
Improving Customer Experience : Big data enables companies to tailor products to their target market without
spending a fortune on ineffective advertising campaigns. By tracking point of sale (POS) transactions and online
purchases, businesses can use big data to study consumer patterns. Using these insights, focused and targeted
marketing strategies are created to assist companies in meeting consumer expectations and fostering brand loyalty.
Supporting Innovation : Business innovation relies on the insights you may uncover through big data analytics. It
enables you to innovate around new products and services while updating existing ones. Product development can be
aided by knowing what consumers think about your goods and services. Businesses must put in place procedures that
assist them in keeping track of feedback, product success, and rival companies in today’s competitive marketplace.
Big data analytics also makes real-time market monitoring possible, which aids in timely innovation.
Detecting Fraud : Big data is primarily used by financial companies and the public sector to identify fraud. Data
analysts utilize artificial intelligence and machine learning algorithms to find abnormalities and transaction trends.
Improving Productivity : Modern big data tools make it possible for data scientists and analysts to efficiently
examine enormous amounts of data, giving them a fast overview of additional data. Additionally, it raises their output
levels. Furthermore, big data analytics enables data scientists and analysts to learn more about the efficiency of their
data pipelines, allowing them to choose how to fulfill their duties and tasks more effectively.
Enabling Agility : Big data analytics can assist businesses in becoming more innovative and adaptable in the
marketplace. One can analyze large consumer data sets to help enterprises get insights ahead of the competition and
handle customer pain points more effectively.
Big Data 3V’s
Volume
The volume of your data is how much of it there is – measured in gigabytes, zettabytes
(ZB), and yottabytes (YB). Industry trends predict a significant increase in data volume
over the next few years. Earlier, there were issues with storing and processing this
enormous volume of data. But nowadays, data gathered from all these sources is
organized using distributed systems like Hadoop. Understanding the usefulness of the data
requires knowledge of its magnitude. Additionally, one may use the volume to identify if
a data set is big data or not.
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Velocity
Velocity describes how quickly data is processed. Any significant data operation
has to operate at a high rate. The linkage of incoming data sets, activity bursts,
and the pace of change make up this phenomenon. Sensors, social media
platforms, and application logs all continuously generate enormous volumes of
data. There is no use in spending time or effort on it if the data flow is not
constant.
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like
send promotions right now for store next to you
Big Data is largely consumer driven and consumer oriented. Most of the data in
the world is generated by consumers, who are nowadays ‘always-on’. Most people
now spend 4-6 hours per day consuming and generating data through a variety of
devices and (social) applications. With every click, swipe or message, new data is
created in a database somewhere around the world. Because everyone now has a
smartphone in their pocket, the data creation sums to incomprehensible amounts.
Some studies estimate that 60% of data was generated within the last two years,
which is a good indication of the rate with which society has digitized.
Connectivity through cloud computing
In the last decade, the term data science and data scientist have become
tremendously popular. In October 2012, Harvard Business Review called the data
scientist “sexiest job of the 21st century” and many other publications have
featured this new job role in recent years. The demand for data scientist (and
similar job titles) has increased tremendously and many people have actively
become engaged in the domain of data science.
Social media applications
Everyone understands the impact that social media has on daily life. However, in
the study of Big Data, social media plays a role of paramount importance. Not
only because of the sheer volume of data that is produced everyday through
platforms such as Twitter, Facebook, LinkedIn and Instagram, but also because
social media provides nearly real-time data about human behavior.
Social media data provides insights into the behaviors, preferences and opinions of
‘the public’ on a scale that has never been known before. Due to this, it is
immensely valuable to anyone who is able to derive meaning from these large
quantities of data. Social media data can be used to identify customer preferences
for product development, target new customers for future purchases, or even
target potential voters in elections. Social media data might even be considered
one of the most important business drivers of Big Data
The upcoming internet of things (IoT)
The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enables these objects to connect and
exchange data. It is increasingly gaining popularity as consumer goods providers
start including ‘smart’ sensors in household appliances. Whereas the average
household in 2010 had around 10 devices that connected to the internet, this
number is expected to rise to 50 per household by 2020. Examples of these
devices include thermostats, smoke detectors, televisions, audio systems and even
smart refrigerators.
Apache Spark
Apache Spark is a lightning-fast, open source data-processing engine for machine learning and AI
applications, backed by the largest open source community in big data.
Apache Spark (Spark) is an open source data-processing engine for large data sets. It is designed to
deliver the computational speed, scalability, and programmability required for Big Data—specifically
for streaming data, graph data, machine learning, and artificial intelligence (AI) applications.
Spark's analytics engine processes data 10 to 100 times faster than alternatives. It scales by
distributing processing work across large clusters of computers, with built-in parallelism and fault
tolerance. It even includes APIs for programming languages that are popular among data analysts
and data scientists, including Scala, Java, Python, and R.
Spark is often compared to Apache Hadoop, and specifically to MapReduce, Hadoop’s native data-
processing component. The chief difference between Spark and MapReduce is that Spark processes
and keeps the data in memory for subsequent steps—without writing to or reading from disk—
which results in dramatically faster processing speeds. (You’ll find more on how Spark compares to
and complements Hadoop elsewhere in this article.)
How Apache Spark Works
Apache Spark has a hierarchical master/slave architecture. The Spark Driver is the master node that
controls the cluster manager, which manages the worker (slave) nodes and delivers data results to
the application client.
Based on the application code, Spark Driver generates the SparkContext, which works with the
cluster manager—Spark’s Standalone Cluster Manager or other cluster managers like Hadoop YARN,
Kubernetes, or Mesos— to distribute and monitor execution across the nodes. It also creates
Resilient Distributed Datasets (RDDs), which are the key to Spark’s remarkable processing speed.
Each dataset in an RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. And, users can perform two types of RDD operations: transformations and
actions. Transformations are operations applied to create a new RDD. Actions are used to instruct
Apache Spark to apply computation and pass the result back to the driver.
Spark supports a variety of actions and transformations on RDDs. This distribution is done by Spark,
so users don’t have to worry about computing the right distribution.
Apache Spark Machine Learning
Spark has various libraries that extend the capabilities to machine learning, artificial intelligence
(AI), and stream processing.
Spark GraphX
In addition to having API capabilities, Spark has Spark GraphX, a new addition to Spark designed
to solve graph problems. GraphX is a graph abstraction that extends RDDs for graphs and graph-
parallel computation. Spark GraphX integrates with graph databases that store interconnectivity
information or webs of connection information, like that of a social network.
Apache Spark Machine Learning
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, fault-tolerant
processing of live data streams. As Spark Streaming processes data, it can deliver data to file
systems, databases, and live dashboards for real-time streaming analytics with Spark's machine
learning and graph-processing algorithms. Built on the Spark SQL engine, Spark Streaming also
allows for incremental batch processing that results in faster processing of streamed data.
Predictive Analytics
“Predictive analytics is an advanced form of data analytics that attempts to answer the question,
“What might happen next?””
Predictive analytics is the process of using data to forecast future outcomes. The process uses data
analysis, machine learning, artificial intelligence, and statistical models to find patterns that might
predict future behavior. Organizations can use historic and current data to forecast trends and
behaviors seconds, days, or years into the future with a great deal of precision.
Data scientists use predictive models to identify correlations between different elements in selected
datasets. Once data collection is complete, a statistical model is formulated, trained, and modified
to generate predictions.
Deep Packet Inspection (DPI)
Deep packet inspection (DPI), also known as packet sniffing, is a method of examining the content
of data packets as they pass by a checkpoint on the network. With normal types of stateful packet
inspection, the device only checks the information in the packet’s header, like the destination
Internet Protocol (IP) address, source IP address, and port number. DPI examines a larger range of
metadata and data connected with each packet the device interfaces with. In this DPI meaning, the
inspection process includes examining both the header and the data the packet is carrying.