You are on page 1of 7

# [ Big Data Analytics with PySpark ] ( CheatSheet )

Setting Up PySpark Environment

● Install PySpark: !pip install pyspark


● Initialize SparkContext: from pyspark import SparkContext; sc =
SparkContext()
● Create SparkSession: from pyspark.sql import SparkSession; spark =
SparkSession.builder.appName('AppName').getOrCreate()
● Check Spark version: spark.version
● Configure Spark properties: spark.conf.set("spark.sql.shuffle.partitions",
"50")
● List configured properties: spark.sparkContext.getConf().getAll()
● Stop SparkContext: sc.stop()
● Initialize Spark in Jupyter Notebook: %env PYSPARK_PYTHON=python3

Data Loading and Saving

● Read CSV file: df = spark.read.csv('path/to/file.csv', header=True,


inferSchema=True)
● Read Parquet file: df = spark.read.parquet('path/to/file.parquet')
● Read from database (JDBC): df = spark.read.format("jdbc").option("url",
"jdbc_url").option("dbtable", "table_name").option("user",
"username").option("password", "password").load()
● Write DataFrame to CSV: df.write.csv('path/to/output.csv',
mode='overwrite')
● Write DataFrame to Parquet: df.write.parquet('path/to/output.parquet',
mode='overwrite')
● Load a text file: rdd = sc.textFile('path/to/textfile.txt')
● Save RDD to a text file: rdd.saveAsTextFile('path/to/output')
● Read JSON file: df = spark.read.json('path/to/file.json')
● Write DataFrame to JSON: df.write.json('path/to/output.json',
mode='overwrite')
● DataFrame to RDD conversion: rdd = df.rdd
● RDD to DataFrame conversion: df = rdd.toDF(['column1', 'column2'])
● Read multiple files: df = spark.read.csv(['path/to/file1.csv',
'path/to/file2.csv'])
● Read from HDFS: df =
spark.read.text("hdfs://namenode:port/path/to/file.txt")
● Saving DataFrame in Hive: df.write.saveAsTable("database.tableName")
By: Waleed Mousa
● Specifying schema explicitly: from pyspark.sql.types import StructType,
StructField, IntegerType, StringType; schema =
StructType([StructField("id", IntegerType(), True), StructField("name",
StringType(), True)]); df =
spark.read.schema(schema).csv('path/to/file.csv')

Data Processing and Transformation

● Select columns: df.select("column1", "column2").show()


● Filter rows: df.filter(df["age"] > 30).show()
● GroupBy and aggregate: df.groupBy("department").agg({"salary": "avg",
"age": "max"}).show()
● Join DataFrames: df1.join(df2, df1.id == df2.id).show()
● Sort DataFrame: df.sort(df.age.desc()).show()
● Distinct values: df.select("column").distinct().show()
● Column operations (add, subtract, etc.): df.withColumn("new_column",
df["salary"] * 0.1 + df["bonus"]).show()
● Rename column: df.withColumnRenamed("oldName", "newName").show()
● Drop column: df.drop("column_to_drop").show()
● Handle missing data: df.na.fill({"column1": "value1", "column2":
"value2"}).show()
● User-defined functions (UDF): from pyspark.sql.functions import udf; from
pyspark.sql.types import LongType; def square(x): return x * x;
square_udf = udf(square, LongType()); df.withColumn("squared",
square_udf(df["number"])).show()
● Pivot tables: df.groupBy("department").pivot("gender").agg({"salary":
"avg"}).show()
● Window functions: from pyspark.sql.window import Window; from
pyspark.sql.functions import rank; windowSpec =
Window.partitionBy("department").orderBy("salary");
df.withColumn("rank", rank().over(windowSpec)).show()
● Running SQL queries directly on DataFrames:
df.createOrReplaceTempView("table"); spark.sql("SELECT * FROM table
WHERE age > 30").show()
● Sampling DataFrames: df.sample(withReplacement=False,
fraction=0.1).show()
● Concatenating columns: from pyspark.sql.functions import concat_ws;
df.withColumn('full_name', concat_ws(' ', df['first_name'],
df['last_name'])).show()

By: Waleed Mousa


● Splitting a column into multiple columns: from pyspark.sql.functions
import split; df.withColumn('splitted', split(df['full_name'], '
')).show()
● Collecting a column as a list:
df.groupBy("department").agg(collect_list("name").alias("names")).show()
● Converting DataFrame column to Python list: names_list =
df.select("name").rdd.flatMap(lambda x: x).collect()
● Using when-otherwise for conditional logic: from pyspark.sql.functions
import when; df.withColumn("category", when(df["age"] < 30,
"Young").otherwise("Old")).show()
● Exploding a list to rows: from pyspark.sql.functions import explode;
df.withColumn('name', explode(df['names'])).show()
● Aggregating with custom expressions: from pyspark.sql.functions import
expr; df.groupBy("department").agg(expr("avg(salary) as
average_salary")).show()
● Calculating correlations: df.stat.corr("column1", "column2")
● Handling date and timestamp: from pyspark.sql.functions import
current_date, current_timestamp; df.withColumn("today",
current_date()).withColumn("now", current_timestamp()).show()
● Repartitioning DataFrames: df.repartition(10).rdd.getNumPartitions()
● Caching DataFrames for optimization: df.cache()
● Applying map and reduce operations on RDDs: rdd.map(lambda x: x *
x).reduce(lambda x, y: x + y)
● Using broadcast variables for efficiency: broadcastVar = sc.broadcast([1,
2, 3])
● Accumulators for aggregating information across executors: acc =
sc.accumulator(0); rdd.foreach(lambda x: acc.add(x))
● DataFrame descriptive statistics: df.describe().show()

Performance Optimization

● Broadcast join for large and small DataFrames: from


pyspark.sql.functions import broadcast;
large_df.join(broadcast(small_df), "key").show()
● Avoiding shuffles with coalesce:
df.coalesce(1).write.csv('path/to/output', mode='overwrite')
● Partition tuning for better parallelism:
df.repartition("column").write.parquet('path/to/output')
● Caching intermediate DataFrames: intermediate_df = df.filter(df["age"] >
30).cache()

By: Waleed Mousa


● Using columnar storage formats like Parquet:
df.write.parquet('path/to/output.parquet')
● Optimizing Spark SQL with explain plans: df.explain(True)
● Minimizing data serialization cost with Kryo:
spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Leveraging off-heap memory storage:
spark.conf.set("spark.memory.offHeap.enabled", true);
spark.conf.set("spark.memory.offHeap.size","2g")
● Adjusting the size of shuffle partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200")
● Using vectorized operations in PySpark: from pyspark.sql.functions import
pandas_udf; @pandas_udf("integer") def square_udf(s: pd.Series) ->
pd.Series: return s * s

Advanced Analytics and Machine Learning with PySpark

● Linear Regression Model: from pyspark.ml.regression import


LinearRegression; lr = LinearRegression(featuresCol='features',
labelCol='label'); lrModel = lr.fit(train_df)
● Classification Model (Logistic Regression): from pyspark.ml.classification
import LogisticRegression; logr =
LogisticRegression(featuresCol='features', labelCol='label'); logrModel =
logr.fit(train_df)
● Decision Tree Classifier: from pyspark.ml.classification import
DecisionTreeClassifier; dt =
DecisionTreeClassifier(featuresCol='features', labelCol='label'); dtModel
= dt.fit(train_df)
● Random Forest Classifier: from pyspark.ml.classification import
RandomForestClassifier; rf =
RandomForestClassifier(featuresCol='features', labelCol='label'); rfModel
= rf.fit(train_df)
● Gradient-Boosted Tree Classifier: from pyspark.ml.classification import
GBTClassifier; gbt = GBTClassifier(featuresCol='features',
labelCol='label'); gbtModel = gbt.fit(train_df)
● Clustering with K-Means: from pyspark.ml.clustering import KMeans; kmeans
= KMeans().setK(3).setSeed(1); model = kmeans.fit(dataset)
● Building a Pipeline: from pyspark.ml import Pipeline; from
pyspark.ml.feature import HashingTF, Tokenizer; tokenizer =
Tokenizer(inputCol="text", outputCol="words"); hashingTF =
HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features"); lr =

By: Waleed Mousa


LogisticRegression(maxIter=10, regParam=0.001); pipeline =
Pipeline(stages=[tokenizer, hashingTF, lr])
● Model Evaluation (Binary Classification): from pyspark.ml.evaluation
import BinaryClassificationEvaluator; evaluator =
BinaryClassificationEvaluator(); print('Area Under ROC',
evaluator.evaluate(predictions))
● Model Evaluation (Multiclass Classification): from pyspark.ml.evaluation
import MulticlassClassificationEvaluator; evaluator =
MulticlassClassificationEvaluator(metricName="accuracy"); accuracy =
evaluator.evaluate(predictions); print("Test Accuracy = %g" % accuracy)
● Hyperparameter Tuning using CrossValidator: from pyspark.ml.tuning
import ParamGridBuilder, CrossValidator; paramGrid =
ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).build(); cv =
CrossValidator(estimator=lr, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=3); cvModel = cv.fit(train_df)
● Feature Transformation - VectorAssembler: from pyspark.ml.feature import
VectorAssembler; assembler =
VectorAssembler(inputCols=['feature1','feature2'], outputCol="features");
output = assembler.transform(df)
● Feature Scaling - StandardScaler: from pyspark.ml.feature import
StandardScaler; scaler = StandardScaler(inputCol="features",
outputCol="scaledFeatures", withStd=True, withMean=False); scalerModel =
scaler.fit(df); scaledData = scalerModel.transform(df)
● Text Processing - Tokenization: from pyspark.ml.feature import Tokenizer;
tokenizer = Tokenizer(inputCol="document", outputCol="words"); wordsData
= tokenizer.transform(documentDF)
● Text Processing - Stop Words Removal: from pyspark.ml.feature import
StopWordsRemover; remover = StopWordsRemover(inputCol="words",
outputCol="filtered"); filteredData = remover.transform(wordsData)
● Principal Component Analysis (PCA): from pyspark.ml.feature import PCA;
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures"); model =
pca.fit(df); result = model.transform(df).select("pcaFeatures")
● Handling Missing Values: df.na.fill({'column1': 'value1', 'column2':
'value2'}).show()
● Using SQL Functions for Data Manipulation: from pyspark.sql.functions
import col, upper; df.select(col("name"),
upper(col("name")).alias("name_upper")).show()
● Applying Custom Functions with UDF: from pyspark.sql.functions import
udf; from pyspark.sql.types import IntegerType; my_udf = udf(lambda x:
len(x), IntegerType()); df.withColumn("string_length",
my_udf(col("string_column"))).show()

By: Waleed Mousa


Streaming Data Analysis with PySpark

● Creating a Streaming DataFrame from a Socket Source: lines =


spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()
● Writing Streaming Data to Console: query =
lines.writeStream.outputMode("append").format("console").start()
● Using Watermarking to Handle Late Data: `windowedCounts =
lines.withWatermark("timestamp", "10
minutes").groupBy(window(col("timestamp"), "5 minutes")).count()
● Aggregating Stream Data:aggregatedStream =
streamData.groupBy("column").agg({"value": "sum"})
● Querying Streaming Data in Memory: query =
streamData.writeStream.queryName("aggregated_data").outputMode("complete"
).format("memory").start()
● Reading from a Kafka Source: kafkaStream =
spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"host1:port1,host2:port2").option("subscribe", "topicName").load()
● Writing Stream Data to Kafka: query =
streamData.selectExpr("to_json(struct(*)) AS
value").writeStream.format("kafka").option("kafka.bootstrap.servers",
"host:port").option("topic", "outputTopic").start()
● Triggering Streaming Queries: query =
streamData.writeStream.outputMode("append").trigger(processingTime='5
seconds').format("console").start()
● Managing Streaming Queries: query.status, query.stop()
● Using Foreach and ForeachBatch for Custom Sinks: query =
streamData.writeStream.foreachBatch(customFunction).start()
● Stateful Stream Processing (mapGroupsWithState): mappedStream =
streamData.groupByKey(lambda x: x.key).mapGroupsWithState(updateFunction)
● Handling Late Data and Watermarking: lateDataHandledStream =
streamData.withWatermark("timestampColumn", "1
hour").groupBy(window(col("timestampColumn"), "10 minutes"),
"keyColumn").count()
● Streaming Deduplication: streamData.withWatermark("eventTime", "10
minutes").dropDuplicates(["userID", "eventTime"])
● Continuous Processing Mode: query =
streamData.writeStream.format("console").trigger(continuous="1
second").start()

By: Waleed Mousa


● Monitoring Streaming Queries: spark.streams.addListener(new
StreamingQueryListener())

Spark Performance Tuning and Best Practices

● Broadcast Variables for Large Lookups: broadcastVar =


sc.broadcast(largeLookupTable)
● Partitioning Strategies for Large Datasets: df.repartition(200,
"keyColumn")
● Persisting DataFrames in Memory: df.persist(StorageLevel.MEMORY_AND_DISK)
● Optimizing Spark SQL Joins: df.join(broadcast(smallDf), "key")
● Minimizing Data Shuffles: df.coalesce(1)
● Using Kryo Serialization for Faster Processing:
spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Adjusting Spark Executor Memory: spark.conf.set("spark.executor.memory",
"4g")
● Tuning Spark SQL Shuffle Partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200")
● Leveraging DataFrame Caching Wisely: df.cache()
● Avoiding Unnecessary Operations in Transformations: Avoid complex
operations inside loops or iterative transformations
● Monitoring and Debugging with Spark UI: Access Spark UI on
http://<spark-master-host>:4040
● Efficient Use of Accumulators for Global Aggregates: acc =
sc.accumulator(0)
● Optimizing Data Locality for Faster Processing: Ensure data is close to
the computation for minimizing data transfer
● Utilizing DataFrames and Datasets over RDDs for Optimized Performance:
Prefer DataFrames and Datasets APIs for leveraging Catalyst optimizer and
Tungsten execution engine
● Applying Best Practices for Data Skew: Use salting techniques or
repartitioning to mitigate data skew in joins or aggregations

By: Waleed Mousa

You might also like