You are on page 1of 1

1) How can Apache Spark fit into a data application?

Include specific Spark


functionalities that can be applied in a data application.

Coded in Scala, Spark makes it possible to process data from data sources such as
Hadoop Distributed File System, NoSQL databases, or relational data stores like Apache
Hive. This framework also supports In-memory processing, which increases the
performance of analytical applications of big data.

2) Why is parallelism important and how does Spark parallelize tasks? Provide
at least two specific examples for each.

Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a
cluster or computer processors. It has easy-to-use APIs for operating on large datasets, in
various programming languages. It also has APIs for transforming data, and familiar data
frame APIs for manipulating semi-structured data.

3) What is a DataFrame in Spark and how is it different from a SQL table?


Provide at least two specific examples for each.

A Python/R DataFrame is a table of data with rows and columns. Like spreadsheets with
named columns, Python/R DataFrames are stored on a single computer. What you can do
with DataFrames in Python/R are limited to the resources of the single computer they exist
on.

In Spark SQL, a DataFrame can span thousands of computers. It may take too long to
perform computations on one computer, and Spark provides a framework that manages
and coordinates the execution of tasks across a cluster of computers. By pooling these
resources together, we can parallelize the computational resources as if they were one.

You might also like