Professional Documents
Culture Documents
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Scalable
Apache Spark is an open source framework intended to provide
parallelized data processing at scale. At the same time, Spark high-level
functions can be used to carry out different data processing tasks on
datasets of diverse sizes and schemas. This is accomplished by
distributing workloads from several servers to thousands of machines,
running on a cluster of computers and orchestrated by a cluster
manager like Mesos or Hadoop YARN. Therefore, hardware resources
can increase linearly with every new computer added. It is worth
clarifying that hardware addition to the cluster does not necessarily
represent a linear increase in computing performance and hence linear
reduction in processing time because internal cluster management,
data transfer, network traffic, and so on also consume resources,
subtracting them from the effective Spark computing capabilities.
Despite the fact that running in cluster mode leverages Spark’s full
distributed capacity, it can also be run locally on a single computer,
called local mode.
If you have searched for information about Spark before, you
probably have read something like “Spark runs on commodity
hardware.” It is important to understand the term “commodity
hardware.” In the context of big data, commodity hardware does not
denote low quality, but rather equipment based on market standards,
which is general-purpose, widely available, and hence affordable as
opposed to purpose-built computers.
Ease of Use
Spark makes the life of data engineers and data scientists operating on
large datasets easier. Spark provides a single unified engine and API for
diverse use cases such as streaming, batch, or interactive data
processing. These tools allow it to easily cope with diverse scenarios
like ETL processes, machine learning, or graphs and graph-parallel
computation. Spark also provides about a hundred operators for data
transformation and the notion of dataframes for manipulating semi-
structured data.
Figure 1-2 Spark communication architecture with worker nodes and executors
Spark Core
Spark Core is the bedrock on top of which in-memory computing, fault
tolerance, and parallel computing are developed. The Core also
provides data abstraction via RDDs and together with the cluster
manager data arrangement over the different nodes of the cluster. The
high-level libraries (Spark SQL, Streaming, MLlib for machine learning,
and GraphX for graph data processing) are also running over the Core.
Spark APIs
Spark incorporates a series of application programming interfaces
(APIs) for different programming languages (SQL, Scala, Java, Python,
and R), paving the way for the adoption of Spark by a great variety of
professionals with different development, data science, and data
engineering backgrounds. For example, Spark SQL permits the
interaction with RDDs as if we were submitting SQL queries to a
traditional relational database. This feature has facilitated many
transactional database administrators and developers to embrace
Apache Spark.
Let’s now review each of the four libraries in detail.
Spark Streaming
Spark Structured Streaming is a high-level library on top of the core
Spark SQL engine. Structured Streaming enables Spark’s fault-tolerant
and real-time processing of unbounded data streams without users
having to think about how the streaming takes place. Spark Structured
Streaming provides fault-tolerant, fast, end-to-end, exactly-once, at-
scale stream processing. Spark Streaming permits express streaming
computation in the same fashion as static data is computed via batch
processing. This is achieved by executing the streaming process
incrementally and continuously and updating the outputs as the
incoming data is ingested.
With Spark 2.3, a new low-latency processing mode called
continuous processing was introduced, achieving end-to-end latencies
of as low as 1 ms, ensuring at-least-once2 message delivery. The at-
least-once concept is depicted in Figure 1-5. By default, Structured
Streaming internally processes the information as micro-batches,
meaning data is processed as a series of tiny batch jobs.
Spark GraphX
GraphX is a new high-level Spark library for graphs and graph-parallel
computation designed to solve graph problems. GraphX extends the
Spark RDD capabilities by introducing this new graph abstraction to
support graph computation and includes a collection of graph
algorithms and builders to optimize graph analytics.
The Apache Spark ecosystem described in this section is portrayed
in Figure 1-6.
Figure 1-6 The Apache Spark ecosystem
In Figure 1-7 we can see the Apache Spark ecosystem of connectors.
Figure 1-8 Hierarchical relationship between data, data analytics, and data insight
extraction
1.6 Summary
In this chapter we briefly looked at the Apache Spark architecture,
implementation, and ecosystem of applications. We also covered the
two different types of data processing Spark can deal with, batch and
streaming, and the main differences between them. In the next chapter,
we are going to go through the Spark setup process, the Spark
application concept, and the two different types of Apache Spark RDD
operations: transformations and actions.
Footnotes
1 www.databricks.com/blog/2014/11/05/spark-officially-sets-a-
new-record-in-large-scale-sorting.html
2 With the at-least-once message delivery semantic, a message can be delivered
more than once; however, no message can be lost.
3 www.statista.com/statistics/871513/worldwide-data-created/
Now that you have an understanding of what Spark is and how it works, we can get you set up to start using
it. In this chapter, I’ll provide download and installation instructions and cover Spark command-line utilities
in detail. I’ll also review Spark application concepts, as well as transformations, actions, immutability, and
lazy evaluation.
This will download the file spark-3.3.0-bin-hadoop3.tgz or another similar name in your case,
which is a compressed file that contains all the binaries you will need to execute Spark in local mode on your
local computer or laptop.
What is great about setting Apache Spark up in local mode is that you don’t need much work to do. We
basically need to install Java and set some environment variables. Let’s see how to do it in several
environments.
$ java -version
If Java is already installed on your system, you get to see a message similar to the following:
$ java -version
java version "18.0.2" 2022-07-19
Java(TM) SE Runtime Environment (build 18.0.2+9-61)
Java HotSpot(TM) 64-Bit Server VM (build 18.0.2+9-61, mixed mode, sharing)
Your Java version may be different. Java 18 is the Java version in this case.
If you don’t have Java installed
1. Open a browser window, and navigate to the Java download page as seen in Figure 2-2.
2. Click the Java file of your choice and save the file to a location (e.g., /home/<user>/Downloads).
$ cd PATH/TO/spark-3.3.0-bin-hadoop3.tgz_location
and execute
$ su -
Password:
$ cd /home/<user>/Downloads/
$ mv spark-3.3.0-bin-hadoop3 /usr/local/spark
$ exit
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
Use the following command for sourcing the ~/.bashrc file, updating the environment variables:
$ source ~/.bashrc
$ $SPARK_HOME/bin/spark-shell
If Spark is installed successfully, then you will find the following output:
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 18.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
You can try the installation a bit further by taking advantage of the README.md file that is present in the
$SPARK_HOME directory:
The Spark context Web UI would be available typing the following URL in your browser:
http://localhost:4040
There, you can see the jobs, stages, storage space, and executors that are used for your small application.
The result can be seen in Figure 2-3.
Figure 2-3 Apache Spark Web UI showing jobs, stages, storage, environment, and executors used for the application running on
the Spark shell
java -version
You must see the same signature you copied before; if not, something is wrong. Try to solve it by
downloading the file again.
b. SPARK_HOME: \PATH\TO\YOUR\SPARK-DIRECTORY
c. HADOOP_HOME: \PATH\TO\YOUR\HADOOP-DIRECTORY
You will have to repeat the previous step twice, to introduce the three variables.
5. Then, click your Edit button, Figure 2-6 left, to edit your PATH.
b. %SPARK_HOME%\bin
c. %HADOOP_HOME%bin
Configure system and environment variables for all the computer’s users.
7. If you want to add those variables for all the users of your computer, apart from the user performing this
installation, you should repeat all the previous steps for the System variables, Figure 2-6 bottom-left,
clicking the New button to add the environment variables and then clicking the Edit button to add them
to the system PATH, as you did before.
rootLogger.level = info
rootLogger.level = ERROR
This will create a RDD. You can view the file’s content by using the next instruction:
file.take(10).foreach(println)