You are on page 1of 90

DataWarehousing & DataStage - Session 2

Data Stage : Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

What is Datastage?
Datastage is an ETL tool used to design jobs for Extraction, Transformation and Load Ideal Tool for data integration projects-such as data warehouses and data marts.

9/26/2013 4:25 AM

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

What is a DATA WAREHOUSE?


A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

9/26/2013 4:25 AM 3

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Data Warehouse
A data warehouse is a
subject-oriented integrated time-varying non-volatile

collection of data that is used primarily in organizational decision making.


-- Bill Inmon, Building the Data Warehouse 1996

9/26/2013 4:25 AM

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Data Warehouse Architecture


Relational Databases

Optimized Loader

ERP Systems

Extraction Cleansing Data Warehouse Engine Analyze Query

Purchased Data

Legacy Data Metadata Repository 9/26/2013 4:25 AM 5

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Why Data Warehouse?


Who are our lowest/highest margin customers ?

What is the most effective distribution channel?

Who are my customers and what products are they buying?

What product prom-otions have the biggest impact on revenue?

to the

competition

Which customers are most likely to go

9/26/2013 4:25 AM

What impact will new products/services have on revenue and margins?

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

The Motivation

DATA

INFORMATION

UNDERSTANDING

DECISION

9/26/2013 4:25 AM

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

INTRODUCTION TO DATASTAGE

9/26/2013 4:25 AM

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

What is DataStage ?
DataStage is a client server application. Server can be installed in either Windows or Unix Operating Systems. Client can be installed in Windows Communication between the client tools and DataStage server Design jobs for Extraction, Transformation, and Loading (ETL) Ideal Tool for data integration projects-such as data warehouses, data marts and system migration.

9/26/2013 4:25 AM

Data Stage: Basics

9/26/2013

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DATASTAGE -- ETL TOOL

Extract

Load

Transform

The Foundation for Business Intelligence

9/26/2013 4:25 AM

10

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Architecture
SERVER

Multiple processes dynamically stated to support Parallelization and development


Extraction/ Transformation rules and Metadata

NT/ UNIX
Intel Alpha Unix Solaris ENGINE

WIN 95/NT

MANAGER
CLIENT

DESIGNER

DIRECTOR

ADMIN

Graphical workflow style tools for point-and-click specifications of sources, targets and transformation requirements
9/26/2013 4:25 AM 11

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Server and Clients

9/26/2013 4:25 AM

12

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Terminology
Project: A Project is a collection of related Jobs Job : A job is an executable Program which is built using different stages in GUI Stages: They represent the processing steps required. Links: They represent the flow of data between different stages. Shared Containers: Defines reusable logic Sequences: Allows to run a sequence of related Jobs.

9/26/2013 4:25 AM

13

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DATASTAGE SERVER COMPONENTS:


REPOSITORY Central Store for all info required to build a DW

DATASTAGE SERVER Runs jobs

DATASTAGE CLIENT COMPONENTS:


ADMINISTRATOR To create and move projects, set up DS users DESIGNER Design Interface to create DS jobs DIRECTOR GUI to schedule, run and monitor jobs. MANAGER GUI to view and edit Repository contents
9/26/2013 4:25 AM 14

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Administrator

9/26/2013 4:25 AM

15

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Client Logon

9/26/2013 4:25 AM

16

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Manager

9/26/2013 4:25 AM

17

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Designer

9/26/2013 4:25 AM

18

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Director

9/26/2013 4:25 AM

19

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Developing in DataStage
Define global and project properties in Administrator Import meta data into Manager Build job in Designer Compile Designer Validate, run, and monitor in Director

9/26/2013 4:25 AM

20

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Projects

9/26/2013 4:25 AM

21

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Project Properties
Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator

9/26/2013 4:25 AM

22

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Setting Project Properties To set project properties, log onto Administrator, select your project, and then click Properties

9/26/2013 4:25 AM

23

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Projects General Tab

9/26/2013 4:25 AM

24

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Environment Variables

9/26/2013 4:25 AM

25

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Permissions Tab

9/26/2013 4:25 AM

26

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Tunables Tab

9/26/2013 4:25 AM

27

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Parallel Tab

9/26/2013 4:25 AM

28

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Managing Meta Data

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

What Is Metadata?

Data

Source
Meta Data

Transform

Target
Meta Data

Meta Data Repository


9/26/2013 4:25 AM 30

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Manager

9/26/2013 4:25 AM

31

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Manager Contents
Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc.

9/26/2013 4:25 AM

32

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Import and Export


Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers

9/26/2013 4:25 AM

33

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Export Procedure
In Manager, click Export>DataStage Components Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine

9/26/2013 4:25 AM

34

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Exporting DataStage Objects

9/26/2013 4:25 AM

35

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Exporting DataStage Objects

9/26/2013 4:25 AM

36

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Import Procedure
In Manager, click Import>DataStage Components Select DataStage objects for import

9/26/2013 4:25 AM

37

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Importing DataStage Objects

9/26/2013 4:25 AM

38

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Metadata Import
Import format and column destinations from sequential files Import relational table column destinations Imported as Table Definitions Table definitions can be loaded into job stages

9/26/2013 4:25 AM

39

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Sequential File Import Procedure


In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary

9/26/2013 4:25 AM

40

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Designing and Documenting Jobs

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

What Is a Job?
Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface Compiles into Orchestrate shell language (OSH)

9/26/2013 4:25 AM

42

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Development Overview


In Manager, import metadata defining sources and targets In Designer, add stages defining data extractions and loads And Transformers and other stages to defined data transformations Add links defining the flow of data from sources to targets Compiled the job In Director, validate, run, and monitor your job
9/26/2013 4:25 AM 43

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Designer Work Area

9/26/2013 4:25 AM

44

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Designer Toolbar
Show/hide metadata markers

Job properties

Compile

9/26/2013 4:25 AM

45

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Tools Palette

9/26/2013 4:25 AM

46

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Adding Stages and Links


Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another

9/26/2013 4:25 AM

47

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Sequential File Stage


Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited Specified column definitions Specify write action

9/26/2013 4:25 AM

48

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Designer - Create New Job

9/26/2013 4:25 AM

49

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Drag Stages and Links Using Palette

9/26/2013 4:25 AM

50

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Assign Meta Data

9/26/2013 4:25 AM

51

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Editing a Sequential Source Stage

9/26/2013 4:25 AM

52

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Transformer Stage
Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations)

9/26/2013 4:25 AM

53

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Transformer Stage Elements

9/26/2013 4:25 AM

54

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Create Column Mappings

9/26/2013 4:25 AM

55

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Creating Stage Variables

9/26/2013 4:25 AM

56

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Result

9/26/2013 4:25 AM

57

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Adding Job Parameters


Makes the job more flexible Parameters can be:
Used in constraints and derivations Used in directory and file names

Parameter values are determined at run time

9/26/2013 4:25 AM

58

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Adding Job Documentation


Job Properties
Short and long descriptions

Shows in Manager

Annotation stage
Is a stage on the tool palette Shows on the job GUI (work area)

9/26/2013 4:25 AM

59

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Properties Documentation

9/26/2013 4:25 AM

60

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Annotation Stage on the Palette

9/26/2013 4:25 AM

61

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Annotation Stage Properties

9/26/2013 4:25 AM

62

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Final Job Work Area with Documentation

9/26/2013 4:25 AM

63

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Compiling a Job

9/26/2013 4:25 AM

64

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Errors or Successful Message

9/26/2013 4:25 AM

65

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Running Jobs

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Prerequisite to Job Execution


Result from Designer compile

9/26/2013 4:25 AM

67

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

DataStage Director
Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer
Tools > Run Director

9/26/2013 4:25 AM

68

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Running Your Job

9/26/2013 4:25 AM

69

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Run Options Parameters and Limits

9/26/2013 4:25 AM

70

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Director Log View

9/26/2013 4:25 AM

71

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Message Details are Available

9/26/2013 4:25 AM

72

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Other Director Functions


Schedule job to run on a particular date/time Clear job log Set Director options
Row limits Abort after x warnings

9/26/2013 4:25 AM

73

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Standards and Techniques

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Presentation

Document using the annotation stage

9/26/2013 4:25 AM

75

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Naming conventions
Stages named after the
Data they access

Function they perform


DO NOT leave defaulted stage names like Sequential_File_0

Links named for the data they carry


DO NOT leave defaulted link names like DSLink3

9/26/2013 4:25 AM

76

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Stage and Link Names

Stages and links renamed to data they handle

9/26/2013 4:25 AM

77

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Create Reusable Job Components

Container

9/26/2013 4:25 AM

78

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Partitioning and Collecting Icons

Partitioner Collector

9/26/2013 4:25 AM

79

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

More Stages

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Transformer Stage Functions


Control data flow Create derivations

9/26/2013 4:25 AM

81

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Rejecting Data Example

Property Reject Mode = Output

If Not Found property

Constraint Other/log option

9/26/2013 4:25 AM

82

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Transformer Stage Properties

9/26/2013 4:25 AM

83

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Stage Variables

Show/Hide button

9/26/2013 4:25 AM

84

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

1. The Join Stage


Four types:

2 sorted input links, 1 output link "left outer" on primary input, "right outer" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM

Inner Left Outer Right Outer Full Outer

9/26/2013 4:25 AM

85

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Control Using the Job Sequencer

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Control Options


Manually write job control Code generated in Basic Use the job control tab on the job properties page Generates basic code which you can modify Job Sequencer Build a controlling job much the same way you build other jobs Comprised of stages and links

No basic coding
9/26/2013 4:25 AM 87

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Job Sequencer
Build like a regular job Type Job Sequence Has stages and links Job Activity stage represents a DataStage job Links represent passing control
9/26/2013 4:25 AM 88

Stages

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

Example
Job Activity stage contains conditional triggers

9/26/2013 4:25 AM

89

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com

DataWarehousing & DataStage - Session 2

QUESTIONS ??????????
9/26/2013 4:25 AM 90

Data Stage: Basics

Atul Singh
atulsingh@in.ibm.com