Welcome to Scribd!

Spark and Hive Guidelines

Uploaded by

0% found this document useful (0 votes)

15 views2 pages

The document provides guidelines for developing Spark and Hive jobs. For Spark jobs, it recommends using the correct YARN queue, stopping Spark contexts after jobs finish, reusing dataframes with persistence, and optimizing joins, caching, and data formats. For Hive jobs, it suggests providing table locations, using specific column names, partitioning on filters, using Tez for performance, vectorization, internal tables for temporary data, optimized file formats, and avoiding locks and unnecessary calculations.

Original Description:

Original Title

Spark_and_Hive_Guidelines

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

15 views2 pages

Spark and Hive Guidelines

Uploaded by

shrikannann

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

Adhoc Guidelines:

1. Developers needs to follow the directory structure creation in the edge node/HDFS location

2. unused files or folders needs to be cleaned in every week/twice in a week.

3. use the temp folder for the intermediate process in the spark/hive job. Once the process is
completed, the created temp folder should be removed by the program.

4. Do not keep the files in the Sudo home directory. Instead keep the files in the specific folder.

5. Creating backup tables in hive, Naming conventions should follow. Example:

<hive_tablename_bkp_yyyymmdd>

6. backup tables can be deleted by admin team if it is older than 2 months.

Spark Guidelines

1. We have to use the right yarn queue, check the spark-submit parameter --queue what queue it
has mentioned make sure that it has not run in default queue
2. Check all Spark Context get stopped once the job got finished, check in code that
sparkContext.stop() is there.
3. If any Spark-Shell is in opened state, Kill the spark session don’t keep it in idle state.
4. If we are reusing the same dataframes go for Persist(MEMORY_AND_DISK_SER2).
5. Executor and Memory usage should be allocated to the spark submit while running the spark
jobs.
6. Ignore Df.count() whenever it is not necessary.
7. Use Parquet/ORC format for saving the data into the hive table through spark.
8. Make sure that we are using broadcast join with small and large table.
9. Use Shared variable whenever it is necessary.
10. When you are writing your queries, instead of using select * to get all the columns, only retrieve
the columns relevant for your query.
11. Repartition will cause a shuffle, and shuffle is an expensive operation, so this should be
evaluated on an application basis

Hive Guidelines

1. Make sure that you are giving the location when you are creating a hive table.
2. Using exact column names in SELECT statement, instead of “SELECT *”
3. Use Partitioned columns in WHERE clause
4. Instead of using MR engine, use Tez engine for hive performance for hive optimization.
5. Vectorization In Hive – to improve the performance of operations we use Vectorized query
execution. Here operations refer to scans, aggregations, filters, and joins. It happens by
performing them in batches of 1024 rows at once instead of single row each time.
6. Incase if we are going for temporary calculation/temporary data checks use internal tables.
7. Don’t store the data in text file or sequential file format it will occupy more memory.
8. Use MapJoin whenever it is necessary. (hive.auto.convert.join)
9. Avoid locking of tables - It is extremely important to make sure that the tables are being used in
any Hive query as sources are not being used by another process.
10. Avoid Calculated Fields in JOIN and WHERE clause.
11. Use SORT BY instead of ORDER BY

New Microsoft Word Document
Document1 page
New Microsoft Word Document
tamiloppo321
No ratings yet
Paper Infa Optimization Pravin
Document14 pages
Paper Infa Optimization Pravin
Bhaskar Reddy
No ratings yet
Common Pitfalls
Document1 page
Common Pitfalls
JagadishBabu Parri
No ratings yet
Redis Cluster Tutorial-11
Document2 pages
Redis Cluster Tutorial-11
Kajaruban Surendran
No ratings yet
Kick-Starter Kit For BigData Developers
Document7 pages
Kick-Starter Kit For BigData Developers
kim jong un
No ratings yet
Performance Optimization Session
Document3 pages
Performance Optimization Session
Vaibhav Sawant
No ratings yet
SQL Tip14
Document1 page
SQL Tip14
Vamsi Chowdary
No ratings yet
A Mapping Design Tips
Document5 pages
A Mapping Design Tips
kukunurib
No ratings yet
Apache Hive Optimization Techniques - 1 - Towards Data Science
Document8 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
mydumm
No ratings yet
SAP BPC Script Logic
Document8 pages
SAP BPC Script Logic
sekhardatta
No ratings yet
Tips and Tricks For Optimizing Performance With SAP Sybase ASE
Document24 pages
Tips and Tricks For Optimizing Performance With SAP Sybase ASE
kubow
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
Document15 pages
Cache and Caching: Electrical and Electronic Engineering
Enock Omari
No ratings yet
Tips For Optimize
Document4 pages
Tips For Optimize
mayank sharma
No ratings yet
Informatica Admin
Document13 pages
Informatica Admin
Vikas Sinha
No ratings yet
UNIX Interview 1.: Solaris
Document6 pages
UNIX Interview 1.: Solaris
Mohammad Nizamuddin
No ratings yet
IIIT Hyderabad OS Assignment
Document2 pages
IIIT Hyderabad OS Assignment
Sangam Patil
100% (1)
Ab-Initio Enterview Questions
Document33 pages
Ab-Initio Enterview Questions
udaykishorep
100% (2)
Producer-Consumer/Bounded Buffer Problem Asiya S
Document8 pages
Producer-Consumer/Bounded Buffer Problem Asiya S
Asiya Arif
No ratings yet
Performance: Commands/memory-Usage
Document8 pages
Performance: Commands/memory-Usage
Diogenes
No ratings yet
Backup Restore V6x
Document4 pages
Backup Restore V6x
Zaid H. Abureesh
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
Document6 pages
Real Time Hadoop Interview Questions From Various Interviews
Saurabh Gupta
No ratings yet
Copy DB
Document5 pages
Copy DB
psaikrish
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
Document5 pages
Lab 5 Correlate Structured W Unstructured Data
Vin
No ratings yet
AWS ExamQuickTips
Document2 pages
AWS ExamQuickTips
Cswbut Webs
No ratings yet
How To Install and Set Up A 3-Node Hadoop Cluster
Document36 pages
How To Install and Set Up A 3-Node Hadoop Cluster
Fernovy Gesner
No ratings yet
UCM Archive Replicate Using Filesystem
Document4 pages
UCM Archive Replicate Using Filesystem
John Does
No ratings yet
Ab Initio Interview Questions - 1
Document19 pages
Ab Initio Interview Questions - 1
sudgay5677
80% (5)
Dataset Datregeatable
Document4 pages
Dataset Datregeatable
Prashant Purwar
No ratings yet
DB 8
Document11 pages
DB 8
Anju Rose
No ratings yet
Ansible Sample Exam For EX407
Document7 pages
Ansible Sample Exam For EX407
Paulo Gusttavo Tognato
100% (1)
Informatica Performance Tuning - Guidelines
Document4 pages
Informatica Performance Tuning - Guidelines
raams46
No ratings yet
Ab Initio Interview Questions
Document6 pages
Ab Initio Interview Questions
kamnagarg87
No ratings yet
SQL Server Integration Services (SSIS) 15 Best Practices
Document2 pages
SQL Server Integration Services (SSIS) 15 Best Practices
Gayatri Tandle
No ratings yet
ADO Best Practices
Document2 pages
ADO Best Practices
Sir_Shahzad
No ratings yet
Advanced C: - Uday
Document117 pages
Advanced C: - Uday
Raghavendra Rajendra Basvan
100% (1)
The Plan For Every Part
Document7 pages
The Plan For Every Part
Alpha Samad
No ratings yet
Master Thesis Project HOWTO
Document6 pages
Master Thesis Project HOWTO
Solomon Pizzocaro
No ratings yet
Oracle Interview
Document42 pages
Oracle Interview
Katinti Yellaiah
No ratings yet
User Guide For Sesame
Document80 pages
User Guide For Sesame
largocaballero
No ratings yet
User Guide: Ftrees in Moe V.0.4
Document14 pages
User Guide: Ftrees in Moe V.0.4
fernandojardim12
No ratings yet
Java Performance Optimization Tips & Interview Questions
Document12 pages
Java Performance Optimization Tips & Interview Questions
Vijay Kumar
No ratings yet
IPExpert DC Chapters13 15 Workbook
Document16 pages
IPExpert DC Chapters13 15 Workbook
paulo_an7381
No ratings yet
Chapter 09
Document24 pages
Chapter 09
ssMShahzadAhmedss
No ratings yet
Big Data Analysis Lab File: Objective-Design A Word Count Application Using Mapreduce Programming Model. Theory
Document12 pages
Big Data Analysis Lab File: Objective-Design A Word Count Application Using Mapreduce Programming Model. Theory
DEEP INDER SINGH SIDHU
No ratings yet
Decagon Java Curriculum
Document22 pages
Decagon Java Curriculum
jaiyeade akeem
No ratings yet
APT Config
Document9 pages
APT Config
nithinmamidala999
No ratings yet
Pyspark Study Material
Document5 pages
Pyspark Study Material
Vignesan
No ratings yet
Informatica Best Practices
Document4 pages
Informatica Best Practices
rishabh_200
No ratings yet
Mslearn dp100 17
Document2 pages
Mslearn dp100 17
Emily Rapani
No ratings yet
DataStage Configuration File
Document6 pages
DataStage Configuration File
Jose Antonio Lopez Mesa
No ratings yet
LEARN SAS Within 7 Weeks: Part3 (Introduction To SAS - SET, MERGE, and Multiple Operations)
Document56 pages
LEARN SAS Within 7 Weeks: Part3 (Introduction To SAS - SET, MERGE, and Multiple Operations)
sarath.annapareddy
100% (4)
Checklist DBA
Document3 pages
Checklist DBA
Samuel Poerwoatmodjo
No ratings yet
Mslearn dp100 11
Document2 pages
Mslearn dp100 11
Emily Rapani
No ratings yet
Technical Skills Enhancement - PL/SQL Best Practices Oracle Architecture
Document35 pages
Technical Skills Enhancement - PL/SQL Best Practices Oracle Architecture
Huynh Sy Nguyen
No ratings yet
Training AWS - Module 4 - Storage in AWS
Document48 pages
Training AWS - Module 4 - Storage in AWS
Nhật Minh Trần
No ratings yet
Mslearn dp100 16
Document2 pages
Mslearn dp100 16
Emily Rapani
No ratings yet
Accessing Hadoop Data Using Hive: Hive Configuration
Document3 pages
Accessing Hadoop Data Using Hive: Hive Configuration
rajiv2karna
No ratings yet
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
SAS Interview Questions You’ll Most Likely Be Asked: Job Interview Questions Series
From Everand
SAS Interview Questions You’ll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
No ratings yet
Apache Spark Components
Document4 pages
Apache Spark Components
nitinlucky
No ratings yet
02.advantages of Database Approach
Document14 pages
02.advantages of Database Approach
Madhu
No ratings yet
CAD, Mechatronics
Document168 pages
CAD, Mechatronics
gociy20905
No ratings yet
Lab Installing Oracle and Loading Course Databases
Document3 pages
Lab Installing Oracle and Loading Course Databases
gomez.mafe
No ratings yet
BSCS Rating Document
Document19 pages
BSCS Rating Document
nagesh
100% (1)
Reference Companion Guide 80730
Document172 pages
Reference Companion Guide 80730
tarek abib
No ratings yet
Identify and Resolve Database Performance Problems
Document18 pages
Identify and Resolve Database Performance Problems
aa wants
No ratings yet
Ascp Data Collection Technical Workshop Good Oracle Database
Document14 pages
Ascp Data Collection Technical Workshop Good Oracle Database
anonymous anonymous
No ratings yet
DNS
Document9 pages
DNS
Sidharth Bishnoi
No ratings yet
Introduction To Teamcenter Customization
Document44 pages
Introduction To Teamcenter Customization
Nagaraj Muniyandi
80% (5)
MAA - Creating Single Instance Physical Standby For A RAC Primary - 12c
Document18 pages
MAA - Creating Single Instance Physical Standby For A RAC Primary - 12c
Sam
No ratings yet
Number: 1z0-133 Passing Score: 800 Time Limit: 120 Min
Document35 pages
Number: 1z0-133 Passing Score: 800 Time Limit: 120 Min
Addi
No ratings yet
Introduction Into Files and Folders (Directory)
Document17 pages
Introduction Into Files and Folders (Directory)
edris
No ratings yet
W3Schools Quiz Results
Document16 pages
W3Schools Quiz Results
Angel Fabian Vera Vasquez
No ratings yet
Datapump
Document2 pages
Datapump
OlegOleg
No ratings yet
SFDC Sample Resume
Document2 pages
SFDC Sample Resume
namrata kokate
No ratings yet
B and B+ Tree
Document33 pages
B and B+ Tree
SEEMA BANSAL
No ratings yet
University of Dar Es Salaam Department of Informatics Is 263: Database Concepts Test 1 (15 Marks)
Document1 page
University of Dar Es Salaam Department of Informatics Is 263: Database Concepts Test 1 (15 Marks)
george
No ratings yet
Financial Management Application Copy Utility v1 5
Document24 pages
Financial Management Application Copy Utility v1 5
praswer
No ratings yet
Spatial Data Infrastructures
Document9 pages
Spatial Data Infrastructures
tagne
No ratings yet
Tools and Utilities For Performance
Document37 pages
Tools and Utilities For Performance
Unknown
No ratings yet
InfyTQ - SQL Questions
Document38 pages
InfyTQ - SQL Questions
Akhilesh
No ratings yet
CMP-3110 E-Commerce Applications: Development
Document18 pages
CMP-3110 E-Commerce Applications: Development
MAIRA CS17
No ratings yet
List of SQL Commands - Codecademy
Document9 pages
List of SQL Commands - Codecademy
aqua2376
No ratings yet
Computerized Library System Example
Document8 pages
Computerized Library System Example
tinotenda
0% (1)
Nav2013 Enus Csintro 01 PDF
Document28 pages
Nav2013 Enus Csintro 01 PDF
Jose De Jesus Gutierrez Macias
No ratings yet
Psyc Info
Document7 pages
Psyc Info
Drishna Chandar
No ratings yet
Prin - and App - of Database Homework 02
Document3 pages
Prin - and App - of Database Homework 02
Fahim Muntasir
No ratings yet
SQL Case Study-Advance: Business Scenario
Document3 pages
SQL Case Study-Advance: Business Scenario
John
0% (2)
Project 20221215
Document5 pages
Project 20221215
itzaz yousaf
No ratings yet