You are on page 1of 6

DATAWAREHOUSE AND

AUTOMATION
All about knowledge of datawarehouse and automation of routine tasks
Skip to content
 HOME
 INFORMATICA : DOCUMENTATION OF CODE
 DATASTAGE : DOCUMENTATION OF CODE
 ETL : JOB CONTROL TABLE AND ITS IMPLEMENTAION FOR INCREMENTAL LOAD.
 UNIX: SHELL SCRIPT TO PULL REQUIRED FIELDS FROM THE SOURCE FILE.
 UNIX: S3 CODE SNIPPETS
 ABOUT
 SQOOP : MYSQL TO S3

ETL : JOB CONTROL TABLE AND ITS


IMPLEMENTAION FOR
INCREMENTAL LOAD.
Job Control Table is used in ETL tools like Informatica, datastage and SSIS to get the newly inserted/updated

data since the last run date of ETL jobs. The below diagram is specifically done by considering Informatica as

the ETL tool. The same can be implemented in other ETL tools with some modifications.

Tables used are as below:

1. ETL Batch table.

2. ETL Control table.


To view the image clearly, save the image in local disk and zoom in.

Initial Values in ETL Control table : The initial values for High and Low watermark dates will be set to

1/1/1900 12:00 and process name = <name of the job> will be inserted into Job Control table for all the

dataflow jobs. This could be inserted in the deployment script as a one time activity.

ETL_Control_Tabl
e

Proces Proces Proces Process Failur


LWM HMW
Batch_ID Job_Name s Start s End s Status e
Date Date
date Date Status Descriptio Reaso
Code n n

wf_Appointme 1/1/190 1/1/190


-1 NULL NULL
nt 0 0:00 0 0:00

1/1/190 1/1/190
-1 wf_Patient NULL NULL
0 0:00 0 0:00
Explanation of the flow:

1. Batch Identifier is a sequentially generated number which is unique for each run of the jobs. A batch id is

generated initially when we start our jobs. The batch start date is inserted into table. Batch End Date will be

updated at the end of each workflow.

The batch table is used to monitor the performance of the jobs over a period of time.

2. The dataflow jobs which would be run after the Batch Identifier job, will get the previous successful run of

the respective dataflow from the Job Control table. The High Water mark date of the previous run will be used

as Low Water Mark date of current run.

High Watermark date of current run is determined by the max date of source system.

Low Watermark date = High Watermark date of recent previous success run.

High Watermark date = Max date of source records.

The incremented data is retrieved using the above 2 watermark dates.

3. Once the dataflow completes its execution, the status of the execution is updated in Job Control tables with

Low water mark and High water mark. This record will be used to get the Low Watermark of the next run.

In case of failure, the error message will also be updated in the control table.

4. The batch end date is updated with dataflow’s completed date.


Note : if we restart a particular workflow without starting the entire workflow, then the same batch id will be

used and on completion Batch end date will be updated in the batch table.

Table structure of ETL Batch and Control table:

ETL Batch table:

Column Name DataType Description

Batch Identifier, this will be


Batch_ID number
generated sequentially.

Batch_Start_DateTime dateTime Start DateTime of the batch.

Batch_End_DateTime dateTime End DateTime of the batch.


ETL Control Table:

Column Name DataType Description

Batch Identifier, this will be


Batch_ID number generated sequentially before the
jobs are executed.

Process_Start_DateTime dateTime Start DateTime of the process.

Process_Name varchar(100) Name of the process.

Process_End_DateTime dateTime End DateTime of the process.

Low Watermark Date i.e. date


LWMDate date from which records should be
fetched from source.

High Watermark Date i.e. date


HWMDate date till which records should be
fetched from source.

Status code associated with the


Process_Status_Code char(1) process. Refer the below table for
values related to this column.

Status description associated with


the process status code. Refer the
Process_Status_Description varchar(20)
below table for values related to
this column.

The description of the error if the


Failure_Reason varchar(255)
process is failed.
Sample Data of the Control tables:
ETL Batch

Batch_ID Batch_St_Dt Batch_End_Dt

1 8/5/2014 16:21 8/5/2014 17:30

2 8/6/2014 16:21 8/6/2014 16:37

ETL_Control_Tab
le

Proces Process Failur


Process Process
LWM HMW s Status e
Batch_ID Job_Name Start End
Date Date Status Descriptio Reaso
date Date
Code n n

wf_Appointme 1/1/190 1/1/190


-1 NULL NULL
nt 0 0:00 0 0:00

1/1/190 1/1/190
-1 wf_Patient NULL NULL
0 0:00 0 0:00

wf_Appointme 8/5/201 8/5/201 1/1/190 8/4/201


1 Y Success
nt 4 16:21 4 16:30 0 0:00 4 0:00

8/5/201 8/5/201 1/1/190 8/4/201


1 wf_Patient Y Success
4 16:30 4 17:30 0 0:00 4 0:00

wf_Appointme 8/6/201 8/6/201 8/4/201 8/5/201


2 Y Success
nt 4 16:21 4 16:35 4 0:00 4 0:00

Data
too
8/6/201 8/6/201 8/4/201 8/5/201 large
2 wf_Patient E Error
4 16:35 4 16:37 4 0:00 4 0:00 for
colum
n

Name(required)

Email(required)

Website
Comment(required)

Submit

REPORT THIS AD

REPORT THIS AD

Share this:

 Twitter

 Facebook

Leave a Reply

REPORT THIS AD

Blog at WordPress.com.
Close and accept
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
 Follow

You might also like