You are on page 1of 1

1. How to handle real time pipelines?

Suppose we have csv files or structured data in RDBMS, if any


records get updated and inserted. It should reflect into your output table into the target. How will you
handle real time data using DataBricks?
2. In which case we use broadcast join?
3. What are the other ways to handle performance of application?
4. How to archive file which is more than 10 days?
5. How will you roll back to the older version if required?
6. Remove duplicate records from table other than distinct
7. Explain cache and persist
8. Which types of data u are getting from client?
9. Difference between parquet file and delta table
10. Are storing your tables as managed tables or external tables?
11. Difference between managed table and external table
12. Difference between dropping managed table and dropping external tables
13. How you union two dataframe?
14. While doing union we have 1st dataframe of 4 columns and 2nd dataframe of 5 columns then union
will happen or not?
15. I want to create new column and I want to insert values based on some conditions how we can do?
16. How u join 2 dataframe?
17. How we can create new column and how we can add rank to that column using partitionBy?
18. How will u extract current date in PySpark and Spark SQL?  current_date()
19. How we can add 5 days to current date? 
Df1= df.withColumn("date_plus_5_days", date_add(df["current_date"], 5))
20. I want to extract 1st day of current month then how we can extract?  trunc function along with the
month argument
21. I want to extract 1st day of current month then how we can extract?  last_day

Link: https://chat.openai.com/share/dfc92d08-7683-48d6-8b65-23bec4a43da7

You might also like