Professional Documents
Culture Documents
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
▪ Started at Facebook
▪ Data was collected by nightly cron jobs into
Oracle DB
▪ “ETL” via hand-coded python
▪ Grew from 10s of GBs (2006) to 1 TB/day new
data (2007), now 10x that
Allows programmers
to plug-in custom Provides tools to
mappers and enable easy data
reducers ETL
Data
Mining
Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing
JDBC ODBC
Driver
Metastore
(compiles, optimizes, executes)
Hadoop
Master
*Resource DFS
Name Node
Manager
Shell
Hive
Driver Components Metastore
Execution
Compiler
Engine
Driver Metastore
Local
MySQL
Metastore
Driver Metastore
Metastore
Driver
Remote Server JVM
Metastore MySQL
Metastore
Driver
Server JVM
▪ Hive does not verify the data when it is loaded, but rather when a query is issued.
▪ Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed and serialized to disk in the database’s internal format. The load operation is just a file
copy or move.
Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer
Primitive
Types
Databases Tables
timestamp
Userid
referer_url
page_url
IP
Buckets or Clusters
Tables
▪ Partitions divided further into buckets based
▪ Schemas in namespaces
on some other column
▪ Used for data sampling
Partitions
Buckets (Cluster)
Bucket
Create the table in another HDFS location and not in warehouse directory
▪ For external table, Hive is not responsible for managing the data
▪ CREATE EXTERNAL TABLE external_Table (dummy STRING) LOCATION ‘path/to/hdfs/directory’;
▪ Hive does not delete the table (or HDFS files) even when the tables are dropped
▪ It leaves the table untouched and only metadata about the tables are deleted
http://www.edureka.co/blog/hive-commands/
▪ The script runs and executed all the queries one by one in a single go.
▪ The final output is saved in /user/hive/warehouse/healthdb.db/healthcaresampledsdeidentified directory.
http://www.edureka.co/blog/apache-hadoop-hive-script/
2 edureka@2.com EN GB
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
2 edureka@2.com EN GB Prod 1
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
2 edureka@2.com EN GB
3 edureka@3.com FR FR Prod 2
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper
HDFS
Read data from
Hive table
De-identify columns
and store the data back
in a Hive table
Hive Script