knowerce|consulting

Datacamp ETL
Documentation November 2009

info@knowerce.sk www.knowerce.sk

knowerce|consulting

Document information
Creator Knowerce, s.r.o. Vavilovova 16 851 01 Bratislava info@knowerce.sk www.knowerce.sk Author Date of creation Document revision Štefan Urbánek, stefan@knowerce.sk 12.11.2009 1

Document Restrictions Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Offer

info@knowerce.sk

2

knowerce|consulting

Contents
Introduction ....................................................................................................................................................................................4 Overview .........................................................................................................................................................................................6
System Context Objects and classes Software Requirements Preparation Database initialisation Configuration Launching Manual Launching Scheduled using cron Running Programatically What jobs will be run Job Status Scheduling Forced run Example: Public Procurement Extraction ETL job Job Utility Methods Errors and Failing a Job ETL System Defaults Using defaults in jobs etl_jobs etl_job_status etl_defaults etl_batch 6 6 8 8 8 9 10 10 10 10 10 11 12 12 14 14 14 15 15 17 17 18 18

Installation ........................................................................................................................................................................................8

Running ETL Jobs......................................................................................................................................................................10

Job Management........................................................................................................................................................................12

Creating a Job Bundle .............................................................................................................................................................14

Defaults...........................................................................................................................................................................................15

Appendix: ETL Tables .............................................................................................................................................................17

Cron Example.............................................................................................................................................................................19

Offer

info@knowerce.sk

3

knowerce|consulting

Introduction
This document describes architecture, structures and process of Datacamp Extraction Transformation and Loading framework. Purpose of the framework is to perform automated scheduled data processing, usually in the background. Main features: ■ scheduled or manual launching of ETL jobs ■ job management and configuration through database ■ logging ■ ETL job plug-in API ETL tools provided: ■ ■ ■ ■ parallel URL downloader record transformation functions table comparisons table mappings

Offer

info@knowerce.sk

4

knowerce|consulting

Project Page and Sources
Project page with sources can be found: http://github.com/Stiivi/Datacamp-ETL Wiki Documentation: http://wiki.github.com/Stiivi/Datacamp-ETL/ Related project Datacamp: http://github.com/Stiivi/datacamp Support General Discussion Mailing List http://groups.google.com/group/datacamp

Development Mailing List (recommended for Datacamp-ETL project): http://groups.google.com/group/datacamp-dev

Offer

info@knowerce.sk

5

knowerce|consulting

Overview
System Context Datacamp ETL framework has plug-in based architecture and runs on top of a database server.

ETL ETL modules directory (directories) DB Server

job module bundle

job module bundle

job module bundle

ETL Staging Database directory for extracted and temporary files

Objects and classes Core of the ETL framework are Job Manager and Job objects. There are two categories of classes: job management and utility classes that are not necessary for data processing.
Job Management
Job Manager ETL Defaults

Utilities

Batch Job Status Job Job Info Download Manager

Download Batch Extraction Transformation Loading

Class Batch Download Batch

Description and provided functionality Information about data processed by ETL List of files and additional information for automated parallel downloading and processing

Offer

info@knowerce.sk

6

knowerce|consulting

Class Download Manager ETL Defaults Job Job Info Job Manager Job Status

Description and provided functionality Performs parallel download of huge amount of URLs Stores configuration variables in key-value dictionary Abstract class for ETL jobs, provides utilities for running, logging and error handling Information about job: name, type, scheduling, … Configures and launches jobs, handles errors. Information about job run: when was run, what was the result and reason for failure.

Offer

info@knowerce.sk

7

knowerce|consulting

Installation
Software Requirements 1 ■ database server ■ ruby ■ rails ■ gems: sequel Preparation I. II. create a directory where working files, such as dumps and ETL files, will be stored, for example: /var/lib/datacamp create a database. For use with Datacamp web application create two schemas: ■ data schema, example: datacamp_data

■ staging schema (for ETL), example: datacamp_staging III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, …) to the datacamp ETL schemas Check: at this point you should have: ■ ■ ■ ■ sources working directory one or two database schemas database user with appropriate permissions

Database initialisation To initialize ETL database schema run appropriate SQL script from install directory, for example:
mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql

currently works only with MySQL server as there are couple of MySQL specific code residues. This will change in the future.
1

Offer

info@knowerce.sk

8

knowerce|consulting

Configuration Create config.yml in the ETL directory. You can use config.yml.example as a template. Configuration variables are:
Variable etl_files_path dataset_dump_path Description Path for working files – downloaded, extracted and temporary files Datacamp application specific. Where Datacamp datasets are being dumped (dumps are shared by ETL and the application) File where logs are being written. If not set, standard error output (stderr) is used Path where ETL job bundles are stored database connection

log_file job_search_path connection

# ETL Configuration # ########################################################### # Paths # Where temporary ETL files are stored (such as files downloaded from web) etl_files_path: /var/lib/datacamp-etl # Path to dump datasets. This path should be accesible by application to provide # dump API dataset_dump_path: /var/lib/datacamp-etl/dumps # Path for log file log_file: /var/lib/datacamp-etl/etl.log # Path to ETL jobs # All jobs are searched here. The direcotry should contain subdirectories # similar to OS X bundles in form: job_name.job_type. Example: foo.loading job_search_path: /usr/lib/datacamp-etl/jobs ########################################################### # Database Connection connection: host: localhost username: root password: charset: utf8 staging_schema: datacamp_staging dataset_schema: datacamp_data app_schema: datacamp_app

Offer

info@knowerce.sk

9

knowerce|consulting

Running ETL Jobs
Launching Manual Launching Jobs are being run with simply launching the etl.rb script:
ruby etl.rb

The script looks for config.yml in current directory. You can pass another configuration file:
ruby etl.rb --config another_config.yml

Scheduled using cron You would mostly like to run ETL automatically and periodically. To do so, configure a cron job for the Datacamp ETL by creating a cron script. There is an example in install/etl_cron_job, where you have to change ETL_PATH, CONFIG and probabbly RUBY variables. See appendix where example file is listed. Running Programatically Or configure JobManager manually and run all jobs by:
job_manager = JobManager.new … # configure job_manager here job_manager.run_scheduled_jobs

Log is being written to preconfigured file or to standard error output. See Installation instructions how to configure the log file. What jobs will be run By default only jobs that are enabled and scheduled for this day and were not run successfully already. If all jobs succeed, then any subsequent launch of ETL should not run any jobs. All unsuccessful are being re-tried. Not enabled jobs are not run on any occasion. For more information see Job Management.

Offer

info@knowerce.sk

10

knowerce|consulting

Job Status Each job leaves a footprint of its run in etl_job_status table. The table contains information:
Column job_name job_id status phase message start_date end_date task which was run identifier of the job current status of the job: ok, running, failed if job has more phases, this column identifies which phase the job is in error message on job fail when the job started when the job finished, or NULL if job is still running Description

Possible job statuses are: ■ ■ ■ running – job is still running (or ETL crashed and did not reset the job status) ok – job finished correctly failed – job dod not finished correctly, see phase and message for more information

Example of successful runs – you want to achieve this:

Example of mixed statuses, including failed ones:

Offer

info@knowerce.sk

11

knowerce|consulting

Job Management
Jobs are managet through etl_jobs table where you specify:
Column job_name job_type is_enabled run_order schedule force_run name of a job (see below) type of a job: extraction, transformation, loading, … set to 1 when the task is enabled number which specifies order in which jobs are being run. Jobs are run from lowest number to highest. If number is the same for more jobs, behaviour is undefined when the job is being run run despite scheduling rule Description

Example:

To add a new job, insert a line into the table and set job information. To remove a job just delete a line. Scheduling Jobs can be currently scheduled on daily basis:

run each day monday, tuesday, wednesday, thursday, friday, saturday, sunday – run on particular week ■ day Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly specified by force_run flag. Forced run There is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data managers to re-run an ETL job remotely without requiring access to the system where ETL processes are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run next day. The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data from possible inconsistencies introduced by running jobs at unexpected times.

daily:

Offer

info@knowerce.sk

12

knowerce|consulting

This behaviour can be modified using ETL system defaults: ■ ■ force_run_all – run all enabled jobs, regardless of their scheduling time reset_force_run_flag – allow jobs to be re-run each time ETL script is launched. Set this to 0 for development and testing.

Offer

info@knowerce.sk

13

knowerce|consulting

Creating a Job Bundle
Jobs are implemented by “bundles” or in other words directories containing all necessary code and information for the job. Only requirement for the bundle is that it follows certain naming convention and contains ruby script with the job class. ■ ■ ■ bundle directory should be named: job_name.job_type bundle should contain ruby file: job_name_job_type.rb the ruby file should contain camelized job name and job type class: JobNameJobType which should be a subclass of appropriate job subclass (Extraction, Transformation, Loading)

The class should implement run method with the main job code. Example: Public Procurement Extraction ETL job I. create a job bundle directory: mkdir public_procurement.extraction II. create a Ruby file: public_procurement.extraction/public_procurement_extraction.rb III. implement a class named: PublicProcurementExtraction:
class PublicProcurementExtraction < Extraction def run … job code goes here … end

Job Utility Methods There are several utility methods for job writers: ■ ■ ■ files_directory – directory where working, extracted, downloaded and temporary files are stored. This directory is job specific – each job has its own directory by default logger – object for writing into ETL manager log message, phase – set job status information

Also each job has access to defaults dictionary. See chapter about Defaults for more information. Errors and Failing a Job It is recommended to raise exception on error. The exception will be handled by job manager and the job will be closed properly with appropriate status and message set.
raise “unable to connect to data source”

will result in failed job with same message as the exception.

Offer

info@knowerce.sk

14

knowerce|consulting

Defaults
Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The keyvalue pairs are stored by domains. Domain usually corresponds to job name, for example: invoices loading job and invoices transformation job share common domain invoices. The domain etl is reserved for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in more convenient way. Defaults are stored in etl_defaults table which contains: domain, default_key and value:

ETL System Defaults
Key force_run_all Description On next ETL run all enabled jobs are launched, regardless of their scheduling. See Running ETL? After running forced job (see Running ETL?) clear it’s flag so it will be not run again. Default Value (if key-value does not exist) FALSE

reset_force_run_flag

TRUE

Using defaults in jobs Job has access to defaults domain based on the job name. To retrieve a value from defaults:
url = defaults[:download_url] count = defaults[:count].to_i

Retrieve value or set to default value if not found:
batch_size = defaults.value(:batch_size, 200).to_i This will look for batch_size key, if it does not exist, then the

key will be created and assigned value

200. To store default value:
defaults[:count] = count

Values are committed when job finishes. Example:

Offer

info@knowerce.sk

15

knowerce|consulting

@batch_size = defaults.value(:batch_size, 200).to_i @download_threads = defaults.value(:download_threads, 10).to_i @download_fail_threshold = defaults.value(:download_fail_threshold, 10).to_i

Offer

info@knowerce.sk

16

knowerce|consulting

Appendix: ETL Tables
etl_jobs
Column id name job_type is_enabled run_order int varchar varchar int int Type object identifier job name job type flag whether the job is run or not order in which the jobs are being run. If more jobs have same order numer, the behaviour is undefined. date and time when job was alst run status of last run how the job is scheduled force job to be run next time ETL runs Description

last_run_date last_run_status schedule force_run

datetime varchar varchar int

etl_job_status
Column id job_name job_id status phase message start_date end_date int varchar int varchar varchar varchar datetime datetime Type object identifier job name job identifier current or last run status phase in which the job currently is wile running or was when finished status message provided by job object or exception message when the job was run when the job finished Description

Offer

info@knowerce.sk

17

knowerce|consulting

etl_defaults

Column id domain default_key value int

Type association id

Description

varchar varchar varchar

domain name (usually corresponds to job name) key value for key

etl_batch

Column id batch_type batch_source data_source_name data_source_url valid_due_date batch_date username created_at updated_at int

Type

Description

varchar varchar varchar varchar date date varchar datetime datetime

Offer

info@knowerce.sk

18

knowerce|consulting

Cron Example
#!/bin/bash # # ETL cron job script # # Ubuntu/Debian: Put this script in /etc/cron.daily # Other unces: schedule appropriately in /etc/crontab ##################################################################### # ETL Configuration # Path to your ETL installation ETL_PATH=/usr/lib/datacamp-etl # Configuration file (database connection and other paths) CONFIG=$ETL_PATH/config.yml # Ruby interpreter path RUBY=/usr/bin/ruby ##################################################################### ETL_TOOL=etl.rb $RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG

Offer

info@knowerce.sk

19

Sign up to vote on this title
UsefulNot useful