Professional Documents
Culture Documents
About PySpark
PySpark is the Python library for Apache Spark, an open-source big data processing and analytics engine.
Apache Spark is designed for distributed data processing, enabling the processing of large-scale data
across a cluster of computers, making it suitable for big data applications.
The main goal of Spark is to provide a unified and easy-to-use platform for processing and analyzing
data, regardless of whether the data is batch processing (processing data in fixed-size chunks) or streaming
data (processing data in real-time).
PySpark allows Python developers to interact with Spark and perform data processing tasks using
Python code. It provides an API in Python that integrates well with Spark's distributed processing
capabilities.
PySpark provides support for various data sources, including Hadoop Distributed File System (HDFS),
Apache Cassandra, Apache HBase, Amazon S3, and more.
Components of PySpark
The main components of PySpark include:
Spark Core: The core of Spark that provides basic functionality for distributed data processing, including
data parallelism, fault tolerance, and distributed task scheduling.
Spark SQL: A module that provides a SQL-like interface for querying structured and semi-structured data,
making it easy to work with structured data using SQL commands.
Spark Streaming: A module for processing real-time data streams, allowing you to process and analyze data
in real-time as it arrives.
MLlib (Machine Learning Library): A library in Spark that provides a set of machine learning algorithms
and tools for tasks like classification, regression, clustering, etc.
GraphX: A library for graph processing and analysis, useful for tasks like social network analysis,
recommendations, and more.
Using PySpark
To use PySpark, you need to have Apache Spark installed on your cluster, and you interact with it using
the Python API provided by PySpark.
PySpark code is typically written in Python, and it's then executed on the Spark cluster to perform
distributed data processing tasks.
PySpark Applications
PySpark is very well used in Data Science and Machine Learning community
Supports data science libraries written in Python including NumPy, TensorFlow.
used due to its efficient processing of large datasets.
used by many organizations like Walmart, Trivago, Sanofi, Runtastic
Pyspark features
In-Memory Computation
Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed
of analytics.
Process of storing and processing data entirely in memory, rather than relying heavily on disk-based
storage and retrieval
By reducing disk I/O and enabling faster access to data, it significantly improves the performance and
efficiency of data processing task
Lazy Evaluation
Lazy Evaluation is an evaluation strategy that delays the evaluation of an expression until its value is
needed
Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will
not start the execution of the process until an ACTION is called
Fault Tolerance
capability to operate and to recover loss after a failure occurs
Immutability
Immutable Resilient Distributed Datasets (RDDs)
RDDs in PySpark are designed to be immutable, meaning they cannot be modified once created.
Instead, transformations on RDDs create new RDDs.
immutability ensures data integrity and consistency throughout the processing pipeline
Partitioning
Partitioning refers to the division of a large dataset into smaller, more manageable chunks called
partitions
Partitioning enables parallel and efficient data processing across multiple nodes.
Cache and persistence
Persistence
ability to cache RDD or DataFrames in memory or disk storage
Caching
Caching stores the data partitions in the cluster's memory or disk, making it readily available for
subsequent actions or transformations without recomputation.
In memory computation
Process of storing and processing data entirely in memory, rather than relying heavily on disk-based
storage and retrieval
By reducing disk I/O and enabling faster access to data, it significantly improves the performance and
efficiency of data processing task
Inbuild-optimization when using DataFrames
to automatically optimize the execution of data processing operations.
to improve performance, minimize resource utilization, and enhance overall efficiency
Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
PySpark's compatibility with multiple cluster managers allows for flexible deployment options and
integration with existing infrastructure
Advantages
PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process data
efficiently in a distributed fashion
Applications running on PySpark are 100x faster than traditional systems.
PySpark for data ingestion pipelines.
o the end-to-end workflows that handle the process of collecting, preparing, and loading data from
various sources into a target storage or processing system
Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
PySpark also is used to process real-time data using Streaming and Kafka Using PySpark streaming you
can also stream files from the file system and also stream from the socket.
PySpark natively has machine learning and graph libraries.
or
Advantages
Fast and Distributed Inmemory Data Processing
Applications 100x faster than traditional systems.
Powerful data ingestion capabilities for collecting, preparing, and loading data from various sources into a
target storage or processing system
Processing of various data sources from Hadoop HDFS, AWS S3, and many file systems.
Real-time data streaming with PySpark.
Built-in machine learning and graph libraries.
Difference between Spark and PySpark
PySpark is a Python API for Apache Spark, while Spark is an open-source big data processing framework
written in Scala. The main differences between PySpark and Spark are:
PySpark is written in Python, while Spark is written in Scala.
PySpark is easier to use as it has a more user-friendly interface, while Spark requires more expertise in
programming.
PySpark can be slower than Spark because of the overhead introduced by the Python interpreter, while
Spark can provide better performance due to its native Scala implementation.
PySpark has access to some, but not all, of Spark's libraries, while Spark has a rich set of libraries for data
processing.
Spark has a larger community of users and contributors than PySpark.
PySpark Architecture
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are
called “Workers”.
When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
Cluster Manager Types
Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark
applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
Kubernetes – an open-source system for automating deployment, scaling, and management of
containerized applications.
PySpark Modules & Packages
PySpark RDD (pyspark.RDD)
o SIGNATURE
class pyspark.RDD(jrdd: JavaObject, ctx: SparkContext, jrdd_deserializer: pyspark.serializers.Serializer =
AutoBatchedSerializer(CloudPickleSerializer()))
o jrdd: This parameter is the Java representation of the RDD (JavaObject) that the PySpark RDD
wraps. RDDs are implemented in Scala (the primary language of Spark), and PySpark provides a
Python API to interact with these Scala-based RDDs.
o ctx: This parameter represents the SparkContext object, which serves as the entry point to any Spark
functionality in PySpark. The SparkContext manages the configuration of the Spark application and
coordinates the execution of tasks across the cluster.
o jrdd_deserializer: This parameter represents the deserializer used for the RDD. It specifies how the
data within the RDD should be serialized and deserialized. The default deserializer is
AutoBatchedSerializer(CloudPickleSerializer()), which serializes data using CloudPickleSerializer
and supports automatic batching for more efficient serialization.
RDD are the elements that run and work on multiple nodes to perform parallel processing in a cluster
RDDs are immutable, meaning you cannot change them once you create an RDD.
These are fault-tolerant, so they automatically recover in case of failure.
You can apply multiple operations on these RDDs to achieve a particular task
There are two ways to apply operations −
o Transformation
o Actions
Transformations are applied on an RDD to give another RDD.
Transformations
operation that takes an RDD as input and produces another RDD as output.
Once a transformation is applied to an RDD, it returns a new RDD, the original RDD remains the same and
thus are immutable.
After applying the transformation, it creates a Directed Acyclic Graph or DAG for computations and ends
after applying any actions on it.
This is the reason they are called lazy evaluation processes.
Transformations are the process which are used to create a new RDD. It follows the principle of Lazy
Evaluations (the execution will not start until an action is triggered). Few of transformations are given
below:
o Map, flatMap, filter, distinct, reduceByKey, mapPartitions, sortBy
Actions
a operation which are applied on an RDD to produce a single value.
Actions are applied on a resultant RDD and produces a non-RDD value, thus removing the laziness of
the transformation of RDD
actions are operations that trigger the execution of transformations on RDDs or DataFrames and return
results to the driver program or save data to external storage.
Actions are the processes which are applied on an RDD to initiate Apache Spark to apply calculation and
pass the result back to driver. Few actions are following:
o Collect, collectAsMap, reduce, countByKey/countByValue, take,first
o map(func): Applies a function to each element of the RDD and returns a new RDD with the results.
o filter(func): Returns a new RDD containing only the elements that satisfy the given predicate function.
o flatMap(func): Similar to map, but each input item can be mapped to zero or more output items.
o distinct() :Returns a new RDD with distinct elements.
o union(otherRDD): Returns a new RDD containing the elements from both RDDs.
o intersection(otherRDD): Returns a new RDD containing the common elements between two RDDs.
o subtract(otherRDD): Returns a new RDD with elements from the source RDD that are not present in
the
other RDD.
o groupByKey() :Groups the elements of the RDD by key, creating a new RDD of (key, values) pairs.
o reduceByKey(func) :Performs a reduce operation on the values of each key in the RDD.
o sortByKey(ascending=True) or sortByKey(ascending=False): : Sorts the RDD by key in either
ascending or descending order.
Action Operations
RDD Operation
o collect(): Returns all elements of the RDD as a list to the driver program. Use with caution on large
datasets as it brings all data to the driver.
o count(): Returns the number of elements in the RDD.
o first(): Returns the first element in the RDD.
o take(n): Returns the first n elements of the RDD as a list.
o top(n): Returns the top n elements of the RDD based on their natural ordering.
o reduce(func): Aggregates the elements of the RDD using a function func.
o foreach(func): Applies a function func to each element of the RDD (useful for side effects).
DataFrames
o show(n=20, truncate=True): Prints the first n rows of the DataFrame in tabular form.
o count(): Returns the number of rows in the DataFrame.
o first(): Returns the first row of the DataFrame as a Row object.
o take(n): Returns the first n rows of the DataFrame as a list of Row objects.
o collect(): Returns all rows of the DataFrame as a list of Row objects to the driver program. Use with
caution on large datasets as it brings all data to the driver.
o head(n=1): Returns the first n rows of the DataFrame as a list of Row objects.
o describe(*cols): Computes statistics (count, mean, standard deviation, min, and max) for specified
columns.
o printSchema(): Prints the schema of the DataFrame.
o write: Writes the DataFrame to external storage (e.g., Parquet, CSV, JSON).