Professional Documents
Culture Documents
======================
Flip the | 4| D| f| 500|
+---+----+---+------+
Solution:
======
df1.withColumn("Gender",when(df1.Gender == 'F' then 'M').when(df1.Gender == 'M' then
'F').otherwise("NULL")
Another solution would be directly converting it into RDD and apply map transformation like below:
def convertGender(emp):
if emp[2] == 'm':
return (emp[0],emp[1],'F',emp[3])
if emp[2] == 'f':
return (emp[0],emp[1],'M',emp[3])
else:
return (emp[0],emp[1],'NULL',emp[3])
df = df.rdd.map(lambda emp : convertGender(emp)).toDF()
================-==========================================
Another interview question.. how jobs/stages/tasks spilited when spark action is happened.
===================-=========================================
Expected Output:
===========
Date Payments variance_flag
15-02-2022 100 null
16-02-2022 500 1
17-02-2022 900 1
18-02-2022 300 -1
19-02-2022 400 1
20-02-2022 100 -1
21-02-2022 1000 1
=======-
payments = [('15-02-2022',100),('16-02-2022',500),('17-02-2022',900),('18-02-2022',300),('19-02-
2022',400),('20-02-2022',120),('21-02-2022',1000)]
df_payments = spark.createDataFrame(payments,['Date','Payments'])
df_payments = df_payments.withColumn('Date',to_date(col('Date'),'dd-MM-yyyy'))
windowSpec = Window.orderBy(col('Date'))
df_payments = df_payments.withColumn('Lag_Payments',lag("Payments",1).over(windowSpec)) \
.withColumn('variance_flag',when(col('Lag_Payments') == "null","null").when(col('Lag_Payments') > col('
Payments'), -1)\
.when(col('Lag_Payments') < col('Payments'), 1)).drop('Lag_Payments')
df_payments.show()
+----------+--------+-------------+
| Date|Payments|variance_flag|
+----------+--------+-------------+
|2022-02-15| 100| null|
|2022-02-16| 500| 1|
|2022-02-17| 900| 1|
|2022-02-18| 300| -1|
|2022-02-19| 400| 1|
|2022-02-20| 120| -1|
|2022-02-21| 1000| 1|
+----------+--------+-------------+
=============-
with mytable as (select *, lag(Payments,1) over (order by date) as temp_pay from t1)
select date, Payments,case
when temp_pay is Null then Null
when Payments>temp_pay then 1 else -1 end as variance_flag from mytable;
=========-=
Using "lag" window function on "payments" column we can achieve this output.
==========-=
select * , case when payment>lag(payment) over(order by date desc) then 1 else -1 end variance_flag
from payment
========================-====================-======================================
Output :
======
Team,Total_match,Total_win,Total_loss
--------------------------------------
India,2,2,0
Srilanka,2,0,2
Aus,2,1,1
==========-==========
lst_data = [("India","Aus","India"),("Srilanka","Aus","Aus"),("Srilanka","India","India")]
schema = ["Team1","Team2","Winner"]
df = spark.createDataFrame(lst_data,schema)
df1 = df.groupBy("team1").count().withColumnRenamed("team1","team")
df2 = df.groupBy("team2").count().withColumnRenamed("team2","team")
df3 =
df1.unionAll(df2).groupBy("team").sum("count").withColumnRenamed("sum(count)","total_match")
df4 = df.groupBy("Winner").count().withColumnRenamed("count","total_win")
df5 = df3.join(df4,df3.team ==
df4.Winner,"left").withColumn("total_win",coalesce(df4.total_win,lit(0))).select("team","total_match","
total_win").withColumn("total_loss",(col("total_match") - col("total_win")))
===========-=========
=========-======
val a=List(("India","Aus","India"),("Japan","Aus","Aus"),("Japan","India","India"))
val df=a.toDF("team1","team2","win")
+-----+-----+-----+
|team1|team2| win|
+-----+-----+-----+
|India| Aus|India|
|Japan| Aus| Aus|
|Japan|India|India|
+-----+-----+-----+
val df2=df.select("team1").union(df.select("team2"))
val df3=df2.groupBy("team1").count().withColumnRenamed("count","Total_Matches")
+-----+-------------+
|team1|Total_Matches|
+-----+-------------+
|India| 2|
| Aus| 2|
|Japan| 2|
+-----+-------------+
val df4=df.groupBy("win").count().withColumnRenamed("count","winner")
+-----+------+
| win|winner|
+-----+------+
|India| 2|
| Aus| 1|
+-----+------+
df3.join(df4,col("team1")===col("win"),"left").drop("win").na.fill(0).withColumn("loss",col("Total_Match
es")-col("winner")).show
+-----+-------------+------+----+
|team1|Total_Matches|winner|loss|
+-----+-------------+------+----+
|India| 2| 2| 0|
| Aus| 2| 1| 1|
|Japan| 2| 0| 2|
+-----+-------------+------+----+
==============================-======
==============-=
Hope the below query will give the solution if you use spark SQL:
With teams_cte as
(Select team1 as team, (case when team1=winner then 1 else 0 end) as won from table
Union all
Select team2 as team, (case when team2=winner then 1 else 0 end) as won from table)
Select team, count(*) as tot_natches, sum(won) as total_won , count(*)-sum(won) as total_loss
From teams_cte
Group by team;
=============-====
==============-==========
This is a very good question. I was asked this question in one of my interviews but in SQL. Just attempted
to solve this using pyspark.
===============-==============================================================
if 199/200 partitions are getting executed in spark but after 1 hour you are getting error.What things you
will do?
Mostly data skewness problem will cause this kind of problem where last 1-2 taks keeps executing while
other executor finish off their tasks . The idea will be to bring the data to normalise form .
========================================================-==========================
==========-