You are on page 1of 10

SCHOOL OF COMPUTING

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY


UNA
HIMACHAL PRADESH

Advanced Operating System


(CSPE31)

Submitted To: Submitted By:


Mr. Sahil Chiraag Mittal (18111)
Apache Spark

Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark
has seen a very high adoption rate from top-notch technology companies like Google,
Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According
to a survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019
and 2022. The Spark market revenue is zooming fast and may grow up to $4.2 billion by
2022, with a cumulative market valued at $9.2 billion (2019 - 2022).

Introduction -
Spark is an Apache project advertised as “lightning-fast cluster computing”. It has a thriving
open-source community and is the most active Apache project at the moment.

Spark provides a faster and more general data processing platform. Spark lets you run
programs up to 100x faster in memory, or 10x faster on disk than Hadoop. Last year, Spark
took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on
one-tenth the number of machines and it also became the fastest open-source engine for
sorting a petabyte. Spark also makes it possible to write code more quickly as you have over
80 high-level operators at your disposal.

Another important aspect when learning how to use Apache Spark is the interactive shell
(REPL) which provides out-of-the-box. Using REPL, one can test the outcome of each line of
code without first needing to code and execute the entire job. The path to working code is
thus much shorter and ad-hoc data analysis is made possible.

Features of Spark -

1. Speed: According to Apache, Spark can run applications on the Hadoop cluster up to
100 times faster in memory and up to 10 times faster on disk. Spark is able to
achieve such a speed by overcoming the drawback of MapReduce which always
writes to disk for all intermediate results. Spark does not need to write intermediate
results to disk and can work in memory using DAG, lazy evaluation, RDDs and
caching. Spark has a highly optimized execution engine which makes it so fast.
2. Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is
also fault-tolerant. It achieves this using an abstraction layer called RDD (Resilient
Distributed Datasets) in combination with DAG, which is built to handle failures of
tasks or even node failures.
3. Lazy Evaluation: Spark works on lazy evaluation techniques. This means that the
processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner,
i.e. the output RDDs/datasets are not available after transformation will be available
only when needed i.e. when any action is performed. The transformations are just
part of the DAG which gets executed when action is called.
4. Multiple Language Support: Spark provides support for multiple programming
languages like Scala, Java, Python, R and also Spark SQL which is very similar to
SQL.
5. Reusability: Spark code once written for batch processing jobs can also be utilized
for writing processed on Stream processing and it can be used to join historical batch
data and stream data on the fly.
6. Machine Learning: MLlib is a Machine Learning library of Spark. which is available
out of the box for creating ML pipelines for data analysis and predictive analytics also
7. Graph Processing: Apache Spark also has Graph processing logic. Using GraphX
APIs which is again provided out of the box one can write graph processing and do
graph-parallel computation.
8. Stream Processing and Structured Streaming: Spark can be used for batch
processing and also has the capability to cater to stream processing use cases with
micro-batches. Spark Streaming comes with Spark and one does not need to use
any other streaming tools or APIs. Spark streaming also supports Structured
Streaming. Spark streaming also has in-built connectors for Apache Kafka which
comes very handy while developing Streaming applications.
9. Spark SQL: Spark has amazing SQL support and has an in-built SQL optimizer.
Spark SQL features are used heavily in warehouses to build ETL pipelines.

Use Cases of Apache Spark -


As the adoption of Spark across industries continues to rise steadily, it is giving birth to
unique and varied Spark applications. These Spark applications are being successfully
implemented and executed in real-world scenarios. Following are some of the Spark
applications.

● Data Streaming :

Apache Spark is easy to use and brings up a language-integrated API to stream


processing. It is also fault-tolerant, i.e., it helps semantics without extra work and
recovers data easily.

This technology is used to process the streaming data. Spark streaming has the
potential to handle additional workloads. Among all, the common ways used in
businesses are:
● Streaming ETL
● Data enrichment
● Trigger event detection
● Complex session analysis

Let us try to understand Spark Streaming from an example.

Suppose a big retail chain company wants to get a real-time dashboard to keep a
close eye on its inventory and operations. Using this dashboard the management
should be able to track how many products are being purchased, shipped and
delivered to customers.

Spark Streaming can be an ideal fit here.


The order management system pushes the order status to the queue(could be Kafka)
from where the Streaming process reads every minute and picks all the orders with
their status. Then Spark engine processes these and emits the output status count.
The spark streaming process runs like a daemon until it is killed or an error is
encountered.

● Machine Learning :

There are three techniques in Machine Learning:


● Classification: Gmail organizes or filters mails from labels that you provide and
filters spam to another folder. This is how classification works.
● Clustering: Taking Google News as a sample, it categorizes news items based on
the title and the content of the news.
● Collaborative filtering: Facebook uses this to show users ads or products as per
their history, purchases, and location.

Spark with Machine Learning algorithms helps in performing advanced analytics


which assists customers with their queries on sets of data. It is the Machine Learning
Library (MLlib) that holds all these components.
Machine Learning capabilities further help you in securing your real-time data from
any malicious activities.

● Interactive Analysis :

● Spark provides an easy way to study APIs, and also it is a strong tool for interactive
data analysis. It is available in Python or Scala.
● MapReduce is made to handle batch processing and SQL on Hadoop engines which
are usually considered to be slow. Hence, with Spark, it is fast to perform any
identification queries against live data without sampling.
● Structured streaming is also a new feature that helps in web analytics by allowing
customers to run a user-friendly query with web visitors.

● Fog Computing :

● Fog computing runs a program 100 times faster in memory and 10 times faster in the
disk than Hadoop. It helps write apps quickly in Java, Scala, Python, and R.
● It includes SQL, streaming, and hard analytics and can run anywhere
(standalone/cloud, etc.).
● With the rise of Big Data Analytics, the concept that arises is IoT (Internet of Things).
IoT implants objects and devices with small sensors that interact with each other, and
users are making use of it in a revolutionary way.
● It is a decentralized computing infrastructure where data, compute, storage, and
applications are located, somewhere between the data source and the cloud. It
brings the advantages of the cloud closer to where data is created and acted upon,
more or less the way edge computing does it.
Additional Features of Spark -
● Spark is the favourite of Developers as it allows them to write applications in Java,
Scala, Python, and even R.
● Spark is backed by an active developer community, and it is also supported by a
dedicated company — Databricks.
● Although a majority of Spark applications use HDFS as the underlying data file
storage layer, it is also compatible with other data sources like Cassandra, MySQL,
and AWS S3.
● Spark was developed on top of the Hadoop ecosystem that allows for easy and fast
deployment of Spark.
● From being a niche technology, Spark has now become a mainstream tech, thanks to
the ever-increasing pile of data generated by the fast-growing numbers of IoT and
other connected devices.

Industries Using Apache Spark -


● Finance:

Spark is used in Finance industry across different functional and technology


domains.

A typical use case is building a Data Warehouse for batch processing and daily
reporting. The Spark data frames abstraction has been used as a generic ingestion
platform capable of ingesting data from multiple sources of different formats.

Financial services companies also use Apache Spark MLlib to create and train
models for fraud detection. Some of the banks have started using Spark as a tool for
classifying text in money transfers.

● HealthCare:

The Healthcare industry is the newest in adopting advanced technologies like big
data and machine learning to provide hi-tech facilities to their patients. Apache Spark
is penetrating fast and is becoming the heartbeat in the latest Healthcare
applications. Hospitals use these Spark enabled healthcare applications to analyze
patients medical history to identify possible health issues based on history and
learning.
Another very interesting problem in hospitals is when working with Operating
Room(OR) scheduling within a hospital setting is that it is difficult to schedule and
predict available OR block times. This leads to empty and unused operating rooms
leading to longer waiting times for patients for their procedures.
Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per
minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully
and optimally. OR efficiency differs depending on the OR staffing and allocation, not
the workload. So the loss of efficiency means a loss for the patient. So time and
management are of the utmost importance here.

Spark and MLlib solve the problem by developing a predictive model that would
identify available OR time 2 weeks in advance, allowing hospitals to confirm waitlist
cases two weeks in advance instead of when blocks normally release 4 days out.
This OR Scheduling can be done by getting the historical data and running then
linear regression model with multiple variables.

Retail :

Big retail chains have this usual problem of optimising their supply chain to minimize
cost and wastage, improve customer service and gain insights into customers’
shopping behaviour to serve better and in the process optimize their profit.
To achieve these goals these retail companies have a lot of challenges like keeping
the inventory up to date based on sales and also predicting sales and inventory
during some promotional events and sale seasons. Also, they need to keep track of
customers’ orders transit and delivery.
All these pose huge technical challenges. Apache Spark and MLlib are being used by
a lot of these companies to capture real-time sales and invoice data, ingest it and
then figure out the inventory. Spark MLlib analytics and predictive models are being
used to predict sales during promotions and sale seasons to match the inventory and
be ready for the event. The historical data on customers’ buying behaviour is also
used to provide the customer with personalized suggestions and improve customer
satisfaction.

Benefits of using Apache Spark -

Many of the companies across industries have been benefiting from Apache Spark.
The reasons could be different:
1. Speed of execution
2. Multi-language support
3. Machine learning library
4. Graph processing library
5. Batch processing as well as Stream & Structured stream processing

Apache Spark is beneficial for small as well as large enterprises. Spark offers a
complete solution to many of the common problems like ETL and warehousing,
Conclusion -

Apache Spark has the capabilities to process a huge amount of data in a very efficient
manner with high throughput. It can solve problems related to batch processing, near
real-time processing, can be used to apply lambda architecture, can be used for Structured
Streaming. Also, it can solve many of the complex data analytics and predictive analytics
problems with the help of the MLlib component which comes out of the box.

You might also like