You are on page 1of 9

PySpark

About PySpark
 PySpark is the Python library for Apache Spark, an open-source big data processing and analytics engine.
 Apache Spark is designed for distributed data processing, enabling the processing of large-scale data
across a cluster of computers, making it suitable for big data applications.
 The main goal of Spark is to provide a unified and easy-to-use platform for processing and analyzing
data, regardless of whether the data is batch processing (processing data in fixed-size chunks) or streaming
data (processing data in real-time).
 PySpark allows Python developers to interact with Spark and perform data processing tasks using
Python code. It provides an API in Python that integrates well with Spark's distributed processing
capabilities.
 PySpark provides support for various data sources, including Hadoop Distributed File System (HDFS),
Apache Cassandra, Apache HBase, Amazon S3, and more.

Components of PySpark
The main components of PySpark include:
 Spark Core: The core of Spark that provides basic functionality for distributed data processing, including
data parallelism, fault tolerance, and distributed task scheduling.
 Spark SQL: A module that provides a SQL-like interface for querying structured and semi-structured data,
making it easy to work with structured data using SQL commands.
 Spark Streaming: A module for processing real-time data streams, allowing you to process and analyze data
in real-time as it arrives.
 MLlib (Machine Learning Library): A library in Spark that provides a set of machine learning algorithms
and tools for tasks like classification, regression, clustering, etc.
 GraphX: A library for graph processing and analysis, useful for tasks like social network analysis,
recommendations, and more.

Using PySpark
 To use PySpark, you need to have Apache Spark installed on your cluster, and you interact with it using
the Python API provided by PySpark.
 PySpark code is typically written in Python, and it's then executed on the Spark cluster to perform
distributed data processing tasks.

PySpark Applications
 PySpark is very well used in Data Science and Machine Learning community
 Supports data science libraries written in Python including NumPy, TensorFlow.
 used due to its efficient processing of large datasets.
 used by many organizations like Walmart, Trivago, Sanofi, Runtastic
Pyspark features

 In-Memory Computation
 Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed
of analytics.
 Process of storing and processing data entirely in memory, rather than relying heavily on disk-based
storage and retrieval
 By reducing disk I/O and enabling faster access to data, it significantly improves the performance and
efficiency of data processing task

 Lazy Evaluation
 Lazy Evaluation is an evaluation strategy that delays the evaluation of an expression until its value is
needed
 Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will
not start the execution of the process until an ACTION is called
 Fault Tolerance
 capability to operate and to recover loss after a failure occurs
 Immutability
 Immutable Resilient Distributed Datasets (RDDs)
 RDDs in PySpark are designed to be immutable, meaning they cannot be modified once created.
 Instead, transformations on RDDs create new RDDs.
 immutability ensures data integrity and consistency throughout the processing pipeline
 Partitioning
 Partitioning refers to the division of a large dataset into smaller, more manageable chunks called
partitions
 Partitioning enables parallel and efficient data processing across multiple nodes.
 Cache and persistence
Persistence
 ability to cache RDD or DataFrames in memory or disk storage
Caching
 Caching stores the data partitions in the cluster's memory or disk, making it readily available for
subsequent actions or transformations without recomputation.
 In memory computation
 Process of storing and processing data entirely in memory, rather than relying heavily on disk-based
storage and retrieval
 By reducing disk I/O and enabling faster access to data, it significantly improves the performance and
efficiency of data processing task
 Inbuild-optimization when using DataFrames
 to automatically optimize the execution of data processing operations.
 to improve performance, minimize resource utilization, and enhance overall efficiency
 Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
 PySpark's compatibility with multiple cluster managers allows for flexible deployment options and
integration with existing infrastructure

Advantages
 PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process data
efficiently in a distributed fashion
 Applications running on PySpark are 100x faster than traditional systems.
 PySpark for data ingestion pipelines.
o the end-to-end workflows that handle the process of collecting, preparing, and loading data from
various sources into a target storage or processing system
 Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
 PySpark also is used to process real-time data using Streaming and Kafka Using PySpark streaming you
can also stream files from the file system and also stream from the socket.
 PySpark natively has machine learning and graph libraries.
or
Advantages
 Fast and Distributed Inmemory Data Processing
 Applications 100x faster than traditional systems.
 Powerful data ingestion capabilities for collecting, preparing, and loading data from various sources into a
target storage or processing system
 Processing of various data sources from Hadoop HDFS, AWS S3, and many file systems.
 Real-time data streaming with PySpark.
 Built-in machine learning and graph libraries.
Difference between Spark and PySpark
 PySpark is a Python API for Apache Spark, while Spark is an open-source big data processing framework
written in Scala. The main differences between PySpark and Spark are:
 PySpark is written in Python, while Spark is written in Scala.
 PySpark is easier to use as it has a more user-friendly interface, while Spark requires more expertise in
programming.
 PySpark can be slower than Spark because of the overhead introduced by the Python interpreter, while
Spark can provide better performance due to its native Scala implementation.
 PySpark has access to some, but not all, of Spark's libraries, while Spark has a rich set of libraries for data
processing.
 Spark has a larger community of users and contributors than PySpark.

PySpark Architecture

 Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are
called “Workers”.
 When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
 Cluster Manager Types
 Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
 Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark
applications.
 Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
 Kubernetes – an open-source system for automating deployment, scaling, and management of
containerized applications.
PySpark Modules & Packages
 PySpark RDD (pyspark.RDD)
o SIGNATURE
class pyspark.RDD(jrdd: JavaObject, ctx: SparkContext, jrdd_deserializer: pyspark.serializers.Serializer =
AutoBatchedSerializer(CloudPickleSerializer()))
o jrdd: This parameter is the Java representation of the RDD (JavaObject) that the PySpark RDD
wraps. RDDs are implemented in Scala (the primary language of Spark), and PySpark provides a
Python API to interact with these Scala-based RDDs.
o ctx: This parameter represents the SparkContext object, which serves as the entry point to any Spark
functionality in PySpark. The SparkContext manages the configuration of the Spark application and
coordinates the execution of tasks across the cluster.
o jrdd_deserializer: This parameter represents the deserializer used for the RDD. It specifies how the
data within the RDD should be serialized and deserialized. The default deserializer is
AutoBatchedSerializer(CloudPickleSerializer()), which serializes data using CloudPickleSerializer
and supports automatic batching for more efficient serialization.

 PySpark DataFrame and SQL (pyspark.sql)


o SIGNATURE
class pyspark.sql.DataFrame(jdf, sql_ctx)
o This constructor is used to create an instance of the DataFrame class, which represents a distributed
collection of data organized into named columns.
o jdf: This parameter represents the underlying Java DataFrame (jdf) that the PySpark DataFrame
wraps. PySpark provides a Python API to interact with DataFrames, but the actual implementation is
in Scala. The jdf parameter is the Java representation of the DataFrame that the PySpark DataFrame
object wraps.
o sql_ctx: This parameter represents the Spark SQL Context object. The Spark SQL Context is the
entry point for working with structured data in Spark, and it provides the ability to execute SQL
queries and access DataFrames. In newer versions of Spark (2.0 and later), SparkSession is
recommended instead of SQLContext. A SparkSession combines the functionalities of SparkContext,
SQLContext, HiveContext, and StreamingContext into a single entry point.

 PySpark Streaming (pyspark.streaming)


o constructor is used to create an instance of the StreamingContext class, which is the entry point for
all Spark Streaming functionality in PySpark.
o SIGNATURE
class pyspark.streaming.StreamingContext(sparkContext, batchDuration=None, jssc=None)
o sparkContext: This parameter represents the SparkContext object, which serves as the entry point
to any Spark functionality in PySpark.
o batchDuration: This parameter specifies the time interval at which Spark Streaming will receive
and process data. It defines the size of each micro-batch in the streaming context. If set to None, you
can later set the batch duration using the ssc.remember() method.
o jssc: This parameter represents the underlying Java StreamingContext (jssc) that the PySpark
StreamingContext wraps.

 PySpark MLib (pyspark.ml, pyspark.mllib)


o In Apache Spark, there are two libraries for machine learning: pyspark.ml and pyspark.mllib. These
libraries provide machine learning functionality to perform various tasks such as classification,
regression, clustering, and collaborative filtering in distributed environments.
 PySpark GraphFrames (GraphFrames)
o PySpark GraphFrames is a library for Apache Spark that provides support for processing and
analyzing graphs and networks in a distributed manner.
o GraphFrames extends the capabilities of PySpark by introducing specialized graph data structures
and algorithms, making it easier to work with graph data in Spark-based applications.
o
 PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

Resilient Distributed Dataset (RDD)


 Resilient Distributed Dataset [RDD]
 Fundamental data structure of PySpark
 schema-less data structures, that can handle both structured and unstructured data.
 helps a programmer to perform in-memory computations on large cluster
 fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you
cannot change it.
 Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the
cluster.
 Reslient- RDDs are fault-tolerant, meaning they can recover from failures. If a partition is lost, it can be
recomputed using the lineage information stored with the RDD.
 Distributed: RDDs are partitioned across multiple nodes in a cluster

What are RDDs?

 RDD are the elements that run and work on multiple nodes to perform parallel processing in a cluster
 RDDs are immutable, meaning you cannot change them once you create an RDD.
 These are fault-tolerant, so they automatically recover in case of failure.
 You can apply multiple operations on these RDDs to achieve a particular task
 There are two ways to apply operations −
o Transformation
o Actions
 Transformations are applied on an RDD to give another RDD.

 While Actions are performed on an RDD to give a non-RDD value.

Transformations
 operation that takes an RDD as input and produces another RDD as output.
 Once a transformation is applied to an RDD, it returns a new RDD, the original RDD remains the same and
thus are immutable.
 After applying the transformation, it creates a Directed Acyclic Graph or DAG for computations and ends
after applying any actions on it.
 This is the reason they are called lazy evaluation processes.
 Transformations are the process which are used to create a new RDD. It follows the principle of Lazy
Evaluations (the execution will not start until an action is triggered). Few of transformations are given
below:
o Map, flatMap, filter, distinct, reduceByKey, mapPartitions, sortBy

Actions
 a operation which are applied on an RDD to produce a single value.
 Actions are applied on a resultant RDD and produces a non-RDD value, thus removing the laziness of
the transformation of RDD
 actions are operations that trigger the execution of transformations on RDDs or DataFrames and return
results to the driver program or save data to external storage.
 Actions are the processes which are applied on an RDD to initiate Apache Spark to apply calculation and
pass the result back to driver. Few actions are following:
o Collect, collectAsMap, reduce, countByKey/countByValue, take,first

o PySpark RDD Operations

o Actions in PySpark RDDs

o Transformations in PySpark RDDs

o PySpark Pair RDD Operations

o Transformations in Pair RDDs

o Actions in Pair RDD


 Transformation Operations:

o map(func): Applies a function to each element of the RDD and returns a new RDD with the results.
o filter(func): Returns a new RDD containing only the elements that satisfy the given predicate function.
o flatMap(func): Similar to map, but each input item can be mapped to zero or more output items.
o distinct() :Returns a new RDD with distinct elements.
o union(otherRDD): Returns a new RDD containing the elements from both RDDs.
o intersection(otherRDD): Returns a new RDD containing the common elements between two RDDs.
o subtract(otherRDD): Returns a new RDD with elements from the source RDD that are not present in
the
other RDD.
o groupByKey() :Groups the elements of the RDD by key, creating a new RDD of (key, values) pairs.
o reduceByKey(func) :Performs a reduce operation on the values of each key in the RDD.
o sortByKey(ascending=True) or sortByKey(ascending=False): : Sorts the RDD by key in either
ascending or descending order.

 Action Operations
RDD Operation
o collect(): Returns all elements of the RDD as a list to the driver program. Use with caution on large
datasets as it brings all data to the driver.
o count(): Returns the number of elements in the RDD.
o first(): Returns the first element in the RDD.
o take(n): Returns the first n elements of the RDD as a list.
o top(n): Returns the top n elements of the RDD based on their natural ordering.
o reduce(func): Aggregates the elements of the RDD using a function func.
o foreach(func): Applies a function func to each element of the RDD (useful for side effects).

DataFrames
o show(n=20, truncate=True): Prints the first n rows of the DataFrame in tabular form.
o count(): Returns the number of rows in the DataFrame.
o first(): Returns the first row of the DataFrame as a Row object.
o take(n): Returns the first n rows of the DataFrame as a list of Row objects.
o collect(): Returns all rows of the DataFrame as a list of Row objects to the driver program. Use with
caution on large datasets as it brings all data to the driver.
o head(n=1): Returns the first n rows of the DataFrame as a list of Row objects.
o describe(*cols): Computes statistics (count, mean, standard deviation, min, and max) for specified
columns.
o printSchema(): Prints the schema of the DataFrame.
o write: Writes the DataFrame to external storage (e.g., Parquet, CSV, JSON).

You might also like