You are on page 1of 2

Adhoc Guidelines:

1.  Developers needs to follow the directory structure creation in the edge node/HDFS location

2. unused files or folders needs to be cleaned in every week/twice in a week.

3. use the temp folder for the intermediate process in the spark/hive job. Once the process is
completed, the created temp folder should be removed by the program.

4. Do not keep the files in the Sudo home directory. Instead keep the files in the specific folder.

5. Creating backup tables in hive, Naming conventions should follow. Example:


<hive_tablename_bkp_yyyymmdd>

6. backup tables can be deleted by admin team if it is older than 2 months.

Spark Guidelines

1. We have to use the right yarn queue, check the spark-submit parameter --queue what queue it
has mentioned make sure that it has not run in default queue
2. Check all Spark Context get stopped once the job got finished, check in code that
sparkContext.stop() is there.
3. If any Spark-Shell is in opened state, Kill the spark session don’t keep it in idle state.
4. If we are reusing the same dataframes go for Persist(MEMORY_AND_DISK_SER2).
5. Executor and Memory usage should be allocated to the spark submit while running the spark
jobs.
6. Ignore Df.count() whenever it is not necessary.
7. Use Parquet/ORC format for saving the data into the hive table through spark.
8. Make sure that we are using broadcast join with small and large table.
9. Use Shared variable whenever it is necessary.
10. When you are writing your queries, instead of using select * to get all the columns, only retrieve
the columns relevant for your query.
11. Repartition will cause a shuffle, and shuffle is an expensive operation, so this should be
evaluated on an application basis

Hive Guidelines

1. Make sure that you are giving the location when you are creating a hive table.
2. Using exact column names in SELECT statement, instead of “SELECT *”
3. Use Partitioned columns in WHERE clause
4. Instead of using MR engine, use Tez engine for hive performance for hive optimization.
5. Vectorization In Hive – to improve the performance of operations we use Vectorized query
execution. Here operations refer to scans, aggregations, filters, and joins. It happens by
performing them in batches of 1024 rows at once instead of single row each time.
6. Incase if we are going for temporary calculation/temporary data checks use internal tables.
7. Don’t store the data in text file or sequential file format it will occupy more memory.
8. Use MapJoin whenever it is necessary. (hive.auto.convert.join)
9. Avoid locking of tables - It is extremely important to make sure that the tables are being used in
any Hive query as sources are not being used by another process.
10. Avoid Calculated Fields in JOIN and WHERE clause.
11. Use SORT BY instead of ORDER BY

You might also like