Professional Documents
Culture Documents
Serde is short for Serializer/Deserializer. Hive uses the Serde interface for IO.
The interface handles both serialization and deserialization and also
interpreting the results of serialization as individual fields for processing.
Reading from HDFS
HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
Writing to HDFS
Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
TextInputFormat/HiveIgnoreKeyTextOutputFormat
It read/write data in plain text file format.
SequenceFileInputFormat/SequenceFileOutputFormat
It read/write data in Hadoop SequenceFile format.
MetadataTypedColumnsetSerDe
So, to read/write delimited records we use this Hive Serde. Such as CSV,
tab-separated control-A separated records (sorry, quote is not
supported yet).
LazySimpleSerDe
Also, to read the same data format as MetadataTypedColumnsetSerDe.
Moreover, it creates Objects in a lazy way. Hence, that offers better
performance.
Built-in SerDes
Avro
ORC
Parquet
CSV
JsonSerde
id int,
name string,
salary float,
address string,
city string
with serdeproperties(
"quotechar"="\""
tblproperties('skip.header.line.count'='1');
hadoop@hadoop:~$ ls /home/hadoop/Desktop/wc_session.csv
/home/hadoop/Desktop/wc_session.csv
hive (bwt_session)> desc wc_csvserde_tb;
OK
OK