SQL9097

Data Question:
======================
Flip the | 4| D| f| 500|
+---+----+---+------+
Output: flipped Dataframe

+---+----+---+------+
| id|name|Gender|salary|
+---+----+---+------+
| 1| A| f| 2500|
| 2| B| m| 1500|
| 3| C| f| 5500|
| 4| D| m| 500|
+---+----+---+------+
Solution:
======
df1.withColumn("Gender",when(df1.Gender == 'F' then 'M').when(df1.Gender == 'M' then
'F').otherwise("NULL")
please feel free to add alternative solution if any.

Only one 'when' statement would suffice.
Another solution would be directly converting it into RDD and apply map transformation like below:
def convertGender(emp):
if emp[2] == 'm':
return (emp[0],emp[1],'F',emp[3])
if emp[2] == 'f':
return (emp[0],emp[1],'M',emp[3])
else:
return (emp[0],emp[1],'NULL',emp[3])
df = df.rdd.map(lambda emp : convertGender(emp)).toDF()
================-==========================================
Another interview question.. how jobs/stages/tasks spilited when spark action is happened.
===================-=========================================
Data Engineer Interview Question:

======================
Input:
====
Date Payments
15-02-2022 100
16-02-2022 500
17-02-2022 900
18-02-2022 300
19-02-2022 400
20-02-2022 120
21-02-2022 1000
Expected Output:
===========
Date Payments variance_flag
15-02-2022 100 null
16-02-2022 500 1
17-02-2022 900 1
18-02-2022 300 -1
19-02-2022 400 1
20-02-2022 100 -1
21-02-2022 1000 1
Tip: We can use window function.

please feel free to add answers in comments. Stay tuned for more questions.
=======-
payments = [('15-02-2022',100),('16-02-2022',500),('17-02-2022',900),('18-02-2022',300),('19-02-
2022',400),('20-02-2022',120),('21-02-2022',1000)]
df_payments = spark.createDataFrame(payments,['Date','Payments'])
df_payments = df_payments.withColumn('Date',to_date(col('Date'),'dd-MM-yyyy'))
windowSpec = Window.orderBy(col('Date'))
df_payments = df_payments.withColumn('Lag_Payments',lag("Payments",1).over(windowSpec)) \
.withColumn('variance_flag',when(col('Lag_Payments') == "null","null").when(col('Lag_Payments') > col('
Payments'), -1)\
.when(col('Lag_Payments') < col('Payments'), 1)).drop('Lag_Payments')
df_payments.show()
+----------+--------+-------------+
| Date|Payments|variance_flag|
+----------+--------+-------------+
|2022-02-15| 100| null|
|2022-02-16| 500| 1|
|2022-02-17| 900| 1|
|2022-02-18| 300| -1|
|2022-02-19| 400| 1|
|2022-02-20| 120| -1|
|2022-02-21| 1000| 1|
+----------+--------+-------------+
=============-
with mytable as (select *, lag(Payments,1) over (order by date) as temp_pay from t1)
select date, Payments,case
when temp_pay is Null then Null
when Payments>temp_pay then 1 else -1 end as variance_flag from mytable;
=========-=
Using "lag" window function on "payments" column we can achieve this output.
==========-=
select * , case when payment>lag(payment) over(order by date desc) then 1 else -1 end variance_flag
from payment
========================-====================-======================================

======================
Write a Spark program to get the below Output based on the below given Input
Input :
=====
Team1,Team2,Winner
-----------------
India,Aus,India
Srilanka,Aus,Aus
Srilanka,India,India
Output :
======
Team,Total_match,Total_win,Total_loss
--------------------------------------
India,2,2,0
Srilanka,2,0,2
Aus,2,1,1
==========-==========
from pyspark.sql.functions import col,sum,coalesce,lit
lst_data = [("India","Aus","India"),("Srilanka","Aus","Aus"),("Srilanka","India","India")]
schema = ["Team1","Team2","Winner"]
df = spark.createDataFrame(lst_data,schema)
df1 = df.groupBy("team1").count().withColumnRenamed("team1","team")
df2 = df.groupBy("team2").count().withColumnRenamed("team2","team")
df3 =
df1.unionAll(df2).groupBy("team").sum("count").withColumnRenamed("sum(count)","total_match")
df4 = df.groupBy("Winner").count().withColumnRenamed("count","total_win")
df5 = df3.join(df4,df3.team ==
df4.Winner,"left").withColumn("total_win",coalesce(df4.total_win,lit(0))).select("team","total_match","
total_win").withColumn("total_loss",(col("total_match") - col("total_win")))
===========-=========
Select t.team,count(*) as total_match,sum(t.win) as total_win,sum(t.loss) as total_loss from(select

Team2 as team, case when Team2=Winner Then 1 Else 0 END as win,case when Team2!=Winner Then 1
Else 0 END as loss from temp union all select Team1 as team, case when Team1=Winner Then 1 Else 0
END as win, case when Team1!=Winner Then 1 Else 0 END as loss from temp)t group by t.team;
=========-======
val a=List(("India","Aus","India"),("Japan","Aus","Aus"),("Japan","India","India"))
val df=a.toDF("team1","team2","win")
+-----+-----+-----+
|team1|team2| win|
+-----+-----+-----+
|India| Aus|India|
|Japan| Aus| Aus|
|Japan|India|India|
+-----+-----+-----+
val df2=df.select("team1").union(df.select("team2"))
val df3=df2.groupBy("team1").count().withColumnRenamed("count","Total_Matches")
+-----+-------------+
|team1|Total_Matches|
+-----+-------------+
|India| 2|
| Aus| 2|
|Japan| 2|
+-----+-------------+
val df4=df.groupBy("win").count().withColumnRenamed("count","winner")
+-----+------+
| win|winner|
+-----+------+
|India| 2|
| Aus| 1|
+-----+------+
df3.join(df4,col("team1")===col("win"),"left").drop("win").na.fill(0).withColumn("loss",col("Total_Match
es")-col("winner")).show
+-----+-------------+------+----+
|team1|Total_Matches|winner|loss|
+-----+-------------+------+----+
|India| 2| 2| 0|
| Aus| 2| 1| 1|
|Japan| 2| 0| 2|
+-----+-------------+------+----+
==============================-======
WITH gro AS (SELECT team1 team FROM tri

UNION
SELECT team2 FROM tri)
SELECT team, tm.total_match, NVL(Total_win, 0)tot_win, NVL((total_match-total_win),0) Total_loss
FROM gro,
(SELECT winner, count(*) Total_win FROM tri GROUP BY winner) win,
(SELECT team1, COUNT(*) total_match
FROM (SELECT team1 FROM tri UNION ALL SELECT team2 FROM tri) GROUP BY team1) tm
WHERE gro.team = win.winner(+) AND (tm.team1 = gro.team);
==============-=
Hope the below query will give the solution if you use spark SQL:
With teams_cte as
(Select team1 as team, (case when team1=winner then 1 else 0 end) as won from table
Union all
Select team2 as team, (case when team2=winner then 1 else 0 end) as won from table)
Select team, count(*) as tot_natches, sum(won) as total_won , count(*)-sum(won) as total_loss
From teams_cte
Group by team;
=============-====
==============-==========
This is a very good question. I was asked this question in one of my interviews but in SQL. Just attempted
to solve this using pyspark.
from pyspark.sql.functions import col

from pyspark.sql.functions import sum,avg,max,min,mean,count
df_t1 = df.select(col("Team1").alias("Team"))
df_t2 = df.select(col("Team2").alias("Team"))
df_all_teams = df_t1.union(df_t2)
df_all_teams_Agg = df_all_teams.groupby("Team").agg(count("Team").alias("Total_matchs"))
df_all_teams_Agg.show(truncate=False)
df_winner = df.groupby("Winner").agg(count("Winner").alias("Total_Won_matchs"))
df_winner.show(truncate=False)
df_all = df_all_teams_Agg.join(df_winner, df_all_teams_Agg.Team ==
df_winner.Winner,'left').select(df_all_teams_Agg.Team,df_all_teams_Agg.Total_matchs,df_winner.Total
_Won_matchs)
df_all = df_all.fillna(0)
df_all = df_all.withColumn("Total_Loss_match",df_all["Total_matchs"]-df_all["Total_Won_matchs"])
df_all.show(truncate=False)
===============-==============================================================
if 199/200 partitions are getting executed in spark but after 1 hour you are getting error.What things you
will do?
Mostly data skewness problem will cause this kind of problem where last 1-2 taks keeps executing while
other executor finish off their tasks . The idea will be to bring the data to normalise form .
========================================================-==========================

======================
Input:
111,"d1,d2,d3"
112,"d1,d4"
113,"d5,d6,d1"
1. create Data Frame

2.transform into below format
this the Output required
id,dept
111,d1
111,d2
111,d3
112,d1
112,d4
113,d5
113,d6
113,d1
val df = sparkSession.read.option("delimiter", ",")

.csv(InputFile)
val df2 = df.withColumn("id",col("_c0"))
.withColumn("dept",explode(functions.split(col("_c1"),",")))
.drop("_c1").drop("_c0")
Feel free to add answer in comments. stay tuned for answer.
==========-

SQL9097

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SQL9097

Uploaded by

Copyright:

Available Formats

Data Question:

Output: flipped Dataframe

please feel free to add alternative solution if any.

Data Engineer Interview Question:

Tip: We can use window function.

Data Engineer Interview Question:

from pyspark.sql.functions import col,sum,coalesce,lit

Select t.team,count(*) as total_match,sum(t.win) as total_win,sum(t.loss) as total_loss from(select

WITH gro AS (SELECT team1 team FROM tri

from pyspark.sql.functions import col

Data Engineer Interview Question:

1. create Data Frame

val df = sparkSession.read.option("delimiter", ",")

Feel free to add answer in comments. stay tuned for answer.

You might also like