Professional Documents
Culture Documents
Loader
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
� Loader is a data loading tool. Function enhancement have been made for Loader based
on the open source Sqoop.
� Loader is used to exchange data and files between FusionInsight Hadoop, relational
database and file systems.
� Here shows the position of Loader in FusionInsight, which is used to exchange files be-
tween Fusioninsight Hadoop, relational database and file systems.
4. Features of Loader
� Loader manages jobs on the WebUI and also provides CLIs (command line interface) to
meet customer requirements for program scheduling and script automation.
� Loader uses MapReduce for parallel data processing. Parameters affect MapReduce
splits. Therefore, proper parameter configuration is required to ensure optimal data im-
port performance.
� Besides, Loader servers are deployed in active/standby mode which ensures high relia-
bility
� Secure Loader versions are configured in a unified manner by Fusioninsight Manager
� This is a web page of FusionInsight Manager where we see the health status of Loader
components. We could also make operation management such as starting or stopping
services and downloading clients.
� Click the LoaderServer (active) here to go to the Loader job management page.
� Loader provides various job conversion rules for data cleaning and conversion into the
target data structure in different service scenarios. If conversion is not required in actual
application, no conversion rules need to be specified, except for this conversion opera-
tor mentioned here.
� Loader also provides the following operators:
o EL Operation: Specifies an algorithm to calculate field values
o String Operation: Converts the upper and lower cases of existing fields to gener-
ate new fields.
� String Reverse: reverses existing string fields to generate new fields.
� String trim: clears space contained in existing string fields to generate new
fields.
o Filter Rows: filters rows that contain triggering conditions by configuring logic
conditions
o Update Fields: updates fields values when certain conditions are met.
� Then, how to create a Loader job? Here describes an example of loading a SFTP file into
HDFS.
� To create new job, the first step is to configure the basic information including name,
type, connection, group, queue, and priority.
� Name identifies the new job and needs to be unique.
� Type specifies the type of a job to import or export data.
� Connection provides data source connection information for the new job. If no required
connection is available click ADD here to create a connection.
� Group indicates a job group and Queue indicates the Yarn queue to which the new job
belong.
� Priority indicates the priority level of the job in the Yarn queue.
� Next is to configure the input information including input path, File split type, Filter
type, Path filter, File filter, Encode type, suffix name and compression.
� Input path can be a directory or the name of the source file.
� File split type can be set to FILE or SIZE. When the parameter is set to FILE, a file is not
split and can only be processed by a Map job. The file name and content remain un-
changed during data read. When the parameter is set to SIZE, the file structure is
changed. A file is split into multiple segments which are read by different Map Jobs.
� Usually FILE is recommended when the original file name needs to be reserved. Size is
recommended when the original file name does not need to be reserved or when a
super large file is to be processed.
� Filter type indicates the file filtering criterion.
� Path filter is used with Filter type to specify the expression for filtering the directories in
the input path of the source files. If there are multiple filter conditions use commas to
separate them. If the value is empty, the directory is not filtered.
� File filter is also used with Filter type to specify the expression for filtering the file name
of the source files. If there are multiple file conditions, use commas to separate them.
The value cannot be empty.
� Encode type indicates the encoding format of a source file.
� Suffix name indicates the suffix added to a source file after the source file is imported.
� Compression indicates whether to compress data to reduce I/O resource consumption
when an SFTP server is used for data transmission.
� Loader operators have three types: Input operator, output operator and Transform op-
erator.
• Input Operator is used in the first step of data conversion. This type of operator
converts data into fields. Only one input operator can be used in each conversion.
• Conversion operator is used in the intermediate conversion step of data conver-
sion. This type of operator is optional. The conversion operators can be used to-
gether in any combination. Conversion operators can process only fields. There-
fore, an input operator must be used first to convert data into fields
� The last step is to configure output information including Storage type, File type, com-
pression format, output directory, file operate type and number.
� Storage type indicates the target data storage type which can be HDFS, HBASE and
HIVE.
� File type indicates the file type in which data is imported. Available values are TEXT_FILE,
BINARY_FILE (binary read byte), and SEQUENCE_FILE(sequence file).
� Compression format is to specify the data compression format after the data is im-
ported to HDFS.
� Output directory indicates the target directory. If a file with the same name is already
existed in the target directory, Loader provides 5 operate types:
o OVERRIDE means to override the oldfile.
o RENAME is to rename the new file and imports to the target directory.
o APPEND means to add new file content to the old file.
o IGNORE means to ignore the new file and keeps the old file.
o ERROR will generates an error report when a file has the same name as that in the
target directory.
� Number indicates the number of Map jobs.
� Then go to the Loader job management page. This page displays all current jobs and
last execution status.
� Select a job and then click History button in the operation column so that we can fill
execution records of specified jobs.
� When using Loader to transform data, if data does not meet the Loader conversion
rules, we call it Dirty Data. Users can check on the job history page. Dirty data is stored
in HDFS.