● Splitting a column into multiple columns: from pyspark.sql.functions import split; df.withColumn('splitted', split(df['full_name'], ' ')).show() ● Collecting a column as a list: df.groupBy("department").agg(collect_list("name").alias("names")).show() ● Converting DataFrame column to Python list: names_list = df.select("name").rdd.flatMap(lambda x: x).collect() ● Using when-otherwise for conditional logic: from pyspark.sql.functions import when; df.withColumn("category", when(df["age"] < 30, "Young").otherwise("Old")).show() ● Exploding a list to rows: from pyspark.sql.functions import explode; df.withColumn('name', explode(df['names'])).show() ● Aggregating with custom expressions: from pyspark.sql.functions import expr; df.groupBy("department").agg(expr("avg(salary) as average_salary")).show() ● Calculating correlations: df.stat.corr("column1", "column2") ● Handling date and timestamp: from pyspark.sql.functions import current_date, current_timestamp; df.withColumn("today", current_date()).withColumn("now", current_timestamp()).show() ● Repartitioning DataFrames: df.repartition(10).rdd.getNumPartitions() ● Caching DataFrames for optimization: df.cache() ● Applying map and reduce operations on RDDs: rdd.map(lambda x: x * x).reduce(lambda x, y: x + y) ● Using broadcast variables for efficiency: broadcastVar = sc.broadcast([1, 2, 3]) ● Accumulators for aggregating information across executors: acc = sc.accumulator(0); rdd.foreach(lambda x: acc.add(x)) ● DataFrame descriptive statistics: df.describe().show()
Performance Optimization
● Broadcast join for large and small DataFrames: from
● Broadcast Variables for Large Lookups: broadcastVar =
sc.broadcast(largeLookupTable) ● Partitioning Strategies for Large Datasets: df.repartition(200, "keyColumn") ● Persisting DataFrames in Memory: df.persist(StorageLevel.MEMORY_AND_DISK) ● Optimizing Spark SQL Joins: df.join(broadcast(smallDf), "key") ● Minimizing Data Shuffles: df.coalesce(1) ● Using Kryo Serialization for Faster Processing: spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ● Adjusting Spark Executor Memory: spark.conf.set("spark.executor.memory", "4g") ● Tuning Spark SQL Shuffle Partitions: spark.conf.set("spark.sql.shuffle.partitions", "200") ● Leveraging DataFrame Caching Wisely: df.cache() ● Avoiding Unnecessary Operations in Transformations: Avoid complex operations inside loops or iterative transformations ● Monitoring and Debugging with Spark UI: Access Spark UI on http://<spark-master-host>:4040 ● Efficient Use of Accumulators for Global Aggregates: acc = sc.accumulator(0) ● Optimizing Data Locality for Faster Processing: Ensure data is close to the computation for minimizing data transfer ● Utilizing DataFrames and Datasets over RDDs for Optimized Performance: Prefer DataFrames and Datasets APIs for leveraging Catalyst optimizer and Tungsten execution engine ● Applying Best Practices for Data Skew: Use salting techniques or repartitioning to mitigate data skew in joins or aggregations