P. 1
DataStage How to Kick Start

DataStage How to Kick Start

|Views: 9|Likes:
Published by Anand Priya

More info:

Published by: Anand Priya on May 09, 2013
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Information Server 8.

1 Introduction

Course Outline
This course will explain the concepts of DataStage, its architecture, and how to apply it to a 'real life' scenario in a business case-study in which you'll solve business problems. We will begin by looking at the big picture and discuss why businesses need ETL tools and where DataStage fits in the product set. Once we've talked about the very basic architecture of DataStage, we'll investigate a business case-study, learn about a company called Amalgamated Conglomerate Corporation (ACC) - a fictitious holding company - and its business and technical needs. We'll then go about solving this company's problems with DataStage in a Guided Tour Product Simulation. In a practice environment, you'll become ACC’s Senior DataStage Developer. In this capacity, you will assess an existing DataStage Job, build your own Job, modify a Job, and build a Sequencer Job. Using the DataStage clients, you'll log onto the DataStage server and look at a Job that was built previously by the former DataStage Developer at ACC. You’ll then build your own Job by importing meta data, building a Job design, compiling, running, troubleshooting, and then fixing this Job. You’ll then modify a Job and finally build a special type of Job called a Sequence Job. Let’s get started! This section begins by talking about the DataStage tool, what it does, and some of the concepts that underpin its use. We’ll discuss how it fits in the Information Server suite of products and how it can be purchased as a stand-alone product or in conjunction with other products in the suite. We’ll briefly cover DataStage’s architecture and its clients.

Business Intelli Solutions Inc


Information Server 8.1 Introduction

What Is DataStage? InfoSphere DataStage is an ETL (Extract Transform and Load) tool that is a part of the InfoSphere suite of products. It functions as a stand-alone product or in conjunction with other products in the suite. DataStage provides a visual UI, which you can use, in a point-and-click fashion (A non-linear programming tool), to quickly build DataStage Jobs that will perform extractions, transformations, and loads of data for use in data warehousing, system migrations, data integration projects, and data marts.

Sometimes DataStage is purchased with QualityStage, which performs data cleansing (We’ll touch on it in this training but be sure to take the “Introduction to QualityStage FlexLearning module as well). When implementing DataStage as a stand-alone product, you will still need to install InfoSphere Information Server components with it. Other components or products in the suite can be added at a later time. The other products in the suite are QualityStage, Information Analyzer, Business Glossary, and MetaData Workbench. InfoSphere Information Server is a suite of tools that is used to manage your information needs. Several components come with Information Server. One of the main components is DataStage, which provides the ETL capability. IBM InfoSphere Metadata Workbench provides end-to-end metadata management, depicting the relationships between sources and consumers.

IBM InfoSphere Change Data Capture Real-time change Data Capture (CDC) and replication solution across heterogeneous environments

IBM InfoSphere Change Data Capture for Oracle Replication Real-time data distribution and high availability/disaster recovery solution for Oracle environments

IBM InfoSphere Information Analyzer Profiles and establishes an understanding of source systems and monitors data rules.

IBM InfoSphere Business Glossary Creates, manages, and searches metadata definitions.

IBM InfoSphere QualityStage Standardizes and matches information across heterogeneous sources.

IBM InfoSphere DataStage Extracts, transforms, and loads data between multiple sources and targets

IBM InfoSphere Datastage MVS Edition provides native data integration capabilities for the mainframe.

Business Intelli Solutions Inc


Information Server 8.1 Introduction
IBM InfoSphere Federation Server Defines integrated views across diverse and distributed information sources, including cost-based query optimization and integrated caching.

IBM InfoSphere Information Services Director allows information access and integration processes to be published as reusable services in a service oriented architecture.

IBM InfoSphere Information Server FastTrack simplifies and streamlines communication between the business analyst and developer by capturing business requirements and automatically translating into DataStage ETL jobs.

Connectivity Software provides efficient and cost-effective cross-platform, high-speed, realtime, batch and change-only integration for your data sources. This training will focus on DataStage which gives you the ability to import, export, create, and manage metadata from a wide variety of sources to use within these DataStage ETL Jobs. After Jobs are created, they can be scheduled, run, and monitored, all within the DataStage environment.

DataStage provides a point-and-click user interface in which there is a Canvas. You dragand-drop icons that represent object’s (Stages and Links) from a Palette onto the Canvas to build Jobs. For instance, you might drag and drop icons for a source, a transformation, and a target onto the Canvas. The data might then flow from a Source Stage via a Link to a Transformation Stage through another Link to a Database Stage, for example. Stages and Links present a graphical environment that guides you, the developer through needed steps. You are presented with fill-in-the-blank-style boxes that enable you to quickly and easily configure data flows. When connecting to a database for example, you drag a proprietary database Stage icon onto the Canvas, you don’t have to worry about how that database’s native code works to still be able to leverage its high performance. After configuring some particulars and doing a compile, the result is a working Job that is executable within the DataStage environment as well as a Job design which provides a graphical look as to how the data or information is flowing and being used

Business Intelli Solutions Inc


let’s first very briefly look at the architecture and framework behind the InfoSphere Information Server and its various components (Collectively known as InfoSphere).Information Server 8. They all work with one another and can be used as an integrated product suite or it also comes as a stand-alone DataStage application. DataStage. InfoSphere has various product modules within it including the Information Services Director.com . Information Analyzer.businessintelli. They all sit on top of the repository. this 3-tiered architecture fulfills many information integration needs allowing metadata and terminology collected with one product during one phase of a project to flow to other products throughout the enterprise enabling common understandings of data and leveraging common discoveries. the Business Glossary. you will need to install the basics of the Information Server itself including the repository and the application layer. QualityStage.1 Introduction Information Server Backbone In order to understand the environment in which we will work as Developers. These product modules are all supported by common metadata access services and metadata analysis services. From there. and Federation Server. Whichever component or components you have. Business Intelli Solutions Inc www.

businessintelli. thereby allowing you to. DataStage can help with ordering and receiving activities as well by putting data to and/or pulling data from an EDI wire. The Federation Server is a tool that let's you put together disparate and heterogeneous data stores through a single point of view for the organizations entire data world: Databases still stored on mainframes as well as a variety of other computers are made to look as one repository. Using DataStage. As mergers and acquisitions occur. effectively parse it. DataStage has both real-time and batch capabilities. This tutorial will focus primarily on the ETL activities with a little bit of knowledge on DataStage's QualityStage Stages that go along with it. and move it into a data warehouse or to online operating systems so that you can have quicker access and more robust delivery of your business data. • • • • • These product modules feed to DataStage allowing you to create Jobs that will function in a high efficiency manner to move data and/or cleanse it as necessary.1 Introduction Let’s just review the products in the suite again. cleanse your data inline. using DataStage. you will be able to build data warehouses and other new repositories for integrated forms and views of data for your entire organization. • DataStage. files. DataStage is also very useful in data migration activities. table definitions. let's take a more detailed look at DataStage's components. DataStage also integrates with various other tools in addition to databases such as MQ Series that provide reliable delivery of messaging traffic. in the process of doing your other ETL activities. can pull and push metadata. by itself. be brought together into a single trusted view or common unified view with which to make better business decisions and implementations. for example.com . and information from and to various targets and sources of data. the underlying IT systems can. DataStage and QualityStage allow you to then manipulate the data and cleanse it in-line while you're working with it. Business Intelli Solutions Inc www. The Information Analyzer allows you to analyze data in databases. and data stores and gives you the knowledge to determine what types of Jobs need to be developed. DataStage will provide the components needed to get data from place to another and manage it on an ongoing basis. Now that we've talked about the overall Information Server's architecture and components. DataStage and QualityStage have been integrated into a single canvas so that all your activities within a Job can flow from a traditional DataStage Stage to one of the QualityStage Stages. The Information Services Director let's you turn your Jobs into real-time Web Services. The Business Glossary maps your business terms to your technical terms so that all the members of your organization can have common nomenclature during all business activities and stages of technical development thus reducing ambiguity among varying audiences.Information Server 8. EDI and other means of data communication.

and keep track of events once your Job has been developed. The Designer is where you will be doing most of your development. and the Director client. and a shared repository.Information Server 8. dedicated engines. These are all components within the InfoSphere Information Server using its shared and common repository. The Administrator is used for setting up and managing individual DataStage Projects and each Project's related common project information. schedule Jobs. watch Jobs. a Designer client. The Designer is used to develop and manage DataStage applications and all their related components such as metadata. and is running Business Intelli Solutions Inc www. deployed.businessintelli. The clients include an Administrator client. The Director allows you to actively monitor runtime activities.com .1 Introduction DataStage Architecture DataStage has a 3-tiered architecture consisting of clients.

The Parallel engine can dynamically change the way in which it performs its parallelism. and metadata that you will need while developing Jobs. the traditional Server engine (used for DataStage 'Server' Jobs) and the Parallel engine (used for DataStage Parallel Jobs).1 Introduction Inside DataStage. The Shared or 'Common' repository holds all of the information such as Job designs.com .Information Server 8. Business Intelli Solutions Inc www. let's take a look at the Administrator client's UI. Now. The 'Server’ jobs are typically single-threaded and have evolved over the years during various versions of DataStage since its inception.businessintelli. dynamically increase or decrease their efficiency and to optimize the use of resources at hand within your hardware. logs of the Job runs. there are two engines. The Parallel engine allows you to create Jobs that can. at runtime.

Information Server 8.1 Introduction Administrator Client We're looking at the Administrator's UI under the tab entitled 'General'. you can enable job administration from within the Director client. Here. enable editing of the internal Job references.businessintelli. What is RCP and how does it work? Business Intelli Solutions Inc www.com . enable runtime column propagation (RCP) for all of your Parallel Jobs. sharing metadata with importing from Connectors (the DataStage objects that automatically connect to various proprietary sources and databases with minimal configuration) and also give you some auto-purge settings for your log.

This means that the UI may have different options available to different people. when it is installed the Administrator assigns roles such as Developer or Operator. Under the Permissions tab. The Mainframe tab allows us to set needed options when working with the mainframe version of DataStage. There may be differing functions made available to different people. And under the Remote tab of the Business Intelli Solutions Inc www. The Schedule tab allows us to dictate which scheduler will be used. and operators may not be the same person.com . The Parallel tab gives us options as to how the parallel operations will work with the parallel engine.1 Introduction DataStage is a role-based tool. The Administrator. Under the Sequence tab. The person with the role of ‘administrator’ may have more options than does a ‘developer’. we would need to designate who will be the authorized user to run Jobs from the scheduler. Usually.Information Server 8. Thus.businessintelli. when you look at the actual product. Other tabs let us set the memory requirements on a per-Project basis. your UI may be different than others'. You can thus determine which user gets what role for that particular Project. Traces are available should you wish to put tracing on the engine. In the Windows environment. the Sequencer options are set. This can all be designated for each separate DataStage Project. This would allow us to create Jobs that could then be ported up to the mainframe and then executed in their native form with the appropriate JCL. DataStage can assign functions to individuals who will be using the tool. developers.

Toward the lower-left is the Palette. It shows us the metadata objects and Jobs found within DataStage.businessintelli. The Palette contains all of the Stages (drag and droppable icons). On the right is the Canvas. Now.com . drag. data will flow from a Row Generator Stage to the Peek Stage.1 Introduction Administrator client. In the upper-left (shown highlighted) is the Repository Browser area. Let's examine the main areas of the UI. As an ETL Developer. for instance and drag it to the right. we make settings for remote parallel Job deployment on the USS system or remote Job deployment on a Grid. Links can be added from the Palette directly or you can just Right-click on a Stage. Business Intelli Solutions Inc www. In this particular example. You would select a Database source icon.Information Server 8. this is where you'll spend a good deal of your time. This is where you would drop the Stages that you have chosen to be part of your Job design. You can see that the Job in the graphic above is a Parallel Job and that is contains two Stages that have been joined together with a Link. DataStage Designer The second of the three DataStage clients is the Designer. and then drop on the next Stage to create the Link dynamically.

this is a fatal error. main activities of the program.com . a yellow message is a warning and may not directly affect the running of the Job but should probably be looked at. Color-coding shows that green messages are just informational. We can see the types of things that occur during a Job run such as the Job starting. There are messages and other events shown in this Job run that are logged to this area and kept inside the Shared Repository. we can see an individual Job log. they can be found and analyzed here to determine what went wrong in order to take corrective action. Director gives us a runtime view of the Job.1 Introduction The Designer is where you will create the kinds of working code that you need in order to properly move your data from one source to another. and 50-100 more based on service packs and add-ons. Above. There are over 100 Stages that come with the product out of the box (At installation time there are approximately 60 base Stages installed. If you should ever see a red message in here. Should there be any warnings or runtime errors. setting environment variables. an additional 40 that are optional. DataStage Director The third DataStage client is the Director.businessintelli.Information Server 8. and then eventually the successful completion of the Job. which must be resolved before the Job can run successfully. Business Intelli Solutions Inc www. and then you can build your own custom Stages in addition).

Information Server 8.1 Introduction

DataStage Repository In the Repository Browser, we have various folder icons that represent different categories within the DataStage Project. Some come automatically with the Project and then you can add your own and name them. Organize them in a way suitable to your DataStage Project providing for easy export and import. Standard folders include those for Jobs, Table Definitions, Rule Sets, and other DataStage objects. Tip: Create a Folder named after your initiative and then order or move all the appropriate objects for just that initiative under that one common location.

Table Definitions are also known as metadata or schemas. These terms are often used interchangeably. Each typically contains a collection of information that defines the metadata about the individual record or row within a table or file. A piece of metadata can describe column names, column lengths, or columns' data types or it can describe fields within a file.

Other folders include one for Routines (sub-routines able to be called from within a Job). Shared Containers are pieces of DataStage Jobs that can be pulled together, used, and reused as modular components. Stage Types (highlighted) is a master list of all the Stages that are available to you in this Project. Stages can also be added later through various

Business Intelli Solutions Inc


Information Server 8.1 Introduction
options depending on what product(s) have already been installed and depending on what your administrator has made available to you. The Standardization Rules folder contains out-of-the-box or default rulesfor QualityStage.

Transforms provide other abilities from the standard Server engine to create macro types of use and re-use of various functions to its calling structure. WAVES Rules are also used by QualityStage for address verification modules. Match Specifications which are used by QualityStage's Match Stages. Machine Profiles which are used in IBM Mainframes. IMS “view sets” are for working with legacyIMS types of Database Management Systems

Steps to Create a DataStage Job . This section talks about the steps to building your first Job. First, you'll need to understand some basic concepts, then we'll talk about setting up an environment, next you'll connect to the sources, then we'll talk about how to import table definitions, and then we'll provide you with an understanding of the various types of Stages and how they are used. Then, we'll talk about working with RCP, creating Parameter Sets, understanding how to use the CLI, and, in the next lession, you will put this all of this knowledge to use and begin building Jobs in a case-study scenario. This section covers the following: 1. Understand Some Underpinnings 1. Types of Jobs 2. Design Elements of Parallel Jobs 3. Pipeline Parallelism 4. Partition Parallelism 5. Three-Node Partitioning 6. Job Design Versus Execution 7. Architectural Setup Option 2. Setting up a new DataStage environment 1. Setting up the DataStage Engine for the First Time 2. DataStage Administrator Projects Tab 3. Environment Reporting Variables 4. DataStage Administrator Permissions Tab 5. Export Window 6. Choose Stages; Passive Stages 7. Connect to Databases 3. Import Table Definitions 4. Active Stages; The Basics 5. An Important Active Stage; The Transformer Stage 6. Advanced Concepts When Building Jobs; RCP and More 7. Pulling Jobs Together; Parameter Sets, CLI, etc

Business Intelli Solutions Inc


Information Server 8.1 Introduction

Let’s now shift our discussion to the process of developing Jobs in DataStage. What are the steps involved in developing Jobs? First, you define global and Project properties using the Administrator client that we mentioned earlier in the tutorial. This includes defining how you want the Job to be run. For instance, do you want Runtime Column Propagation or any other common variables in your environment that you might be using throughout the Project? We'll discuss this in greater depth later in this tutorial.

Next, you go into the Designer client and import metadata into the repository for use, later, in building your Jobs. Then, using the Designer tool, you actually build the Job and compile it. The next step is to use the Director tool to run and monitor the Job (Jobs can also be run from within the Designer but the Job log messages must be viewed from within the Director client). So, as you’re testing your Jobs it can be a good idea to have both tools open. This way, you can get very detailed information about your Job and all of the things that are happening in it.

Business Intelli Solutions Inc


Still in the Administrator. These can be set there and encrypted in a common location so that everyone can use these without necessarily exposing security in your system. for instance. we could then put in some common environmental variables. We do this by double-clicking the Stage in question and filling out the particulars that it requests. We could go to the Administrator tool set up our Jobs to be able to use RCP and set the number of purges in the logs for however often we want to run purges.1 Introduction Administrator Client Let’s say that we want to create a Job that extracts data from our primary operational system. Business Intelli Solutions Inc www. Inside the Stage(s). Designer Client To then create this kind of extract Job. The database Stages come pre-built to meet the needs of many proprietary databases and satisfy their native connectivity requirements thus providing high-speed transfer capability while eliminating the need to do a lot of the configuration. Also within the Designer we want to use Stages that will connect us to the database. ready-to-use. We do this by dragging and dropping the appropriate Stage onto the Canvas.businessintelli. the user ID. the name of the database that is being accessed.com .Information Server 8. we will use the variables (that we defined in our common environment using the Administrator tool) to define the Job as we see it needs to be done. we would next want to go into our Designer tool to first import metadata from the database itself for the table or tables which we’ll be pulling. and password. for instance. We’ll pull that data out and store it into either a local repository such as a flat file (By dragging a Sequential File Stage onto the Canvas and creating a link between the two Stages) or store it into one of DataStage’s custom Data Sets (A proprietary data store which keeps the data in a partitioned. high-speed format so that it doesn’t need to be parsed for re-use as it would otherwise need to be from a flat file) And then we could. put it directly into another database (By dragging a Stage onto the Canvas at the end of that data flow).

Job Sequences. and Server Jobs (the legacy Jobs from earlier versions of DataStage). The executable that the OSH executes is accomplished with C++ class instances. Also you can create custom Operators using a toolkit and the C++ language. We’ll discuss these things in greater detail later in the tutorial. for example. They have built-in functionality for what is called pipeline and partition parallelism. other control logic. Master Sequencer Jobs can. When Jobs are compiled. These are means of efficiency that can be set.Information Server 8. are Jobs that control other Jobs. Parallel Jobs are executed by DataStage using a parallel engine.1 Introduction Types of Jobs Let’s talk about the three types of Jobs that we would most likely be developing. kick off other Jobs and other activities including Command Line activities. or looping that needs to go on in order to re-run a job over and over Business Intelli Solutions Inc www. If you have special data need to write to a special device. you can configure your parallelism dynamically. as we mentioned. These fit into the categories of Parallel Jobs. such as one for scanning and tracking control on delivery trucks. For instance. These are then monitored in the DataStage Director with the runtime Job monitoring process.com .businessintelli. called OSH. then you could write a custom module to put the data directly into that format without having to go through any other intermediate tools. The OSH executes various Operators Operators are pre-built functions relating to the Stages in our Job Design. they are compiled into a DataStage-specific language called Orchestrate Scripting Language. Job Sequences (which control other Jobs).

A common Command Line API is provided. you can also group them together through the Job Sequencer and call them that way. they cannot easily have their parallelism changed dynamically at runtime as the Parallel Jobs can. Thus. Runtime monitoring is built into the DataStage engine and can also be viewed from the Director client. you can perform activities based on their failures and do that either specifically. or using other options to gain efficiencies with their performance.Information Server 8.businessintelli. Server Jobs are legacy DataStage Jobs that continue to exist and are supported as DataStage applications but you may also find them useful for their string-parsing capabilities. you can achieve parallel processing capabilities by either sequencing these Jobs with the Job Sequencer to run at the same time. However. you can embed this into third-party schedulers and can then execute Jobs in whatever fashion you choose. In addition. when anything happens and an exception is thrown.com . Server Jobs cannot be parallelized in the same way that the Parallel Jobs can be. Legacy Server Jobs do not follow the same parallelization scheme that the Parallel Jobs do. The third type of Job is the Server Job. Although you still have the ability to call any/each Job individually.1 Introduction again. These are executed by the DataStage Server engine and are compiled into a form of BASIC and then run against the Server engine. Therefore. when a local Job fails or globally. Business Intelli Solutions Inc www. The Job Sequencer can control all your Jobs at once without having to schedule each individual Job with the scheduler.

Transform. we have Passive Stages. Peek Stages that allow you to debug and look at some of the Job’s interimactivity without having to land your data somewhere. the Oracle Stage. flow through transformations in the middle of the Canvas. Those who are new to DataStage are able to look at. see. For instance. and understand the activities and flows that are going on within the DataStage Job. Stages. These allow you to read data and write data. DataStage let’s you put elements such as Stages and Links in any fashion that you wish. and various other third-party component Stages such as SAP and Siebel Stages Business Intelli Solutions Inc www. Passive Stages include things such as the Sequential File Stage. get implemented ‘under the covers’ as OSH Operators (The pre-built components that you don’t have to worry about and that will execute based on your Job design on the graphical Canvas).Information Server 8. This flow makes future interpretation of your Jobs simple. we should think of the basic elements. and Load and represent them visually in a manner that is familiar to people. which are Stages and Links. flow chart style left-to-right top-to-bottom visual design. as we mentioned.com . going outputs or Loading Stages on the right side. Next. For example.businessintelli.1 Introduction Design Elements of Parallel Jobs One of the nice things about having a visual medium like DataStage is that you can easily take the notions of Extract. which represent the ‘E’ and the ‘L’ of ETL (Extract and Load). you can drag an Extract Stage out onto the visual Canvas and drop it on the left of the screen. using an orderly. the Stage to read DB2. This also means that other Developer's Jobs will be intuitively understandable to you When we start designing a Parallel Job. Other Passive Stages might include the MQ Series Stage.

These represent “Links” coming into and out of the Stage (when viewed from the Canvas). Although the Stages are visually predominant on the Canvas. you get various dialogue boxes that refer to input and output (input to that Stage and output from that Stage) in DataStage. splitting or merging the data (using the ability to direct data down multiple paths simultaneously. as well as individually modifying each column of data that goes out. or pulling the data – either by joining it into one common larger column or by funneling it into multiple rows within the same data. Now that we’ve talked about some Stages. are all kept with the Link. These include the Transformer Stage. when you modify Jobs. They do the transformation and other ‘heavy lifting’ work. generating the data – such as generating rows or columns. merge two streams of data together or pieces from two streams into common one common stream. Your settings. It provides a variety of functionality including the ability to filter rows.com . And therefore. When you open up a Stage by doubleclicking it. including the metadata that you chose.businessintelli. The setting that you define in these two areas within the Stage affect the Link coming in and the Link going out of the Stage. which allows you to transform data.1 Introduction Processing or active Stages are the ‘T’ of ELT (Transformation). in actuality. in some cases. DataStage is a link-based system. perform aggregation of the data (summations or counting of data or finding first and last). let’s talk about the things that help the data flow from one Stage to the Next Links are the ‘pipes’ through which the data moves from Stage to Stage. These are also assisted by specific Stages to do filtering. it’s easy. all the action is happening on the links.Information Server 8. to move those pieces of data because they stay with the Links (all the metadata and accompanying information) from one Stage type to another Stage very easily without losing data or having to re-enter it. manipulate column order and do a variety of other functions to your data to enrich it and potentially to validate and pre-cleanse it before applying some of the QualityStage Stages to it. You will learn to rely heavily on the Transformer Stage. Business Intelli Solutions Inc www. It allows you to propagate the data that goes out in a number of ways and.

It uses your processor more efficiently and thus maximizes the investment in your hardware. this allows us to execute transformer. There are some limits on the scalability depending on how much CPU processing horsepower is available and how much memory is available to store things dynamically. This is done is two ways the first of which is Pipeline Parallelism. cleansing.Information Server 8.businessintelli. For example. Here.1 Introduction Pipeline Parallelism The purpose of parallelism is to create more efficiency and get more work out of your machine(s). This reduces disk usage for the staging areas. each of the Operators (That correspond to the various Stages that we see on our Canvas) set themselves up within their own processing space and pass data from one to the other rapidly without having to be re-instantiated continually or without having to do all the work in a given Stage before starting the next portion of the pipelining process. You can think of it as a conveyer-belt moving the rows from process to process.com . and loading processes simultaneously. Business Intelli Solutions Inc www. by not requiring you to land data as often.

businessintelli. such as network latency or other external issues. let’s say that you want to do a filtering type of operation. This facilitates a near-linear scalability of your hardware. That means that we could run 8 times faster on 8 processors. Business Intelli Solutions Inc www.com . This assumes that data is evenly distributed and no other factors that would limit your data from getting out there. This is more of a ‘divide and conquer’ approach in which we divide an incoming stream of data into subsets to be separately processed by an Operator.Information Server 8. and so on. Each partition of data is processed by the same Operator. For instance. 24 times faster on 24 processors. Each one of these subsets is called a partition.1 Introduction Partition Parallelism The second factor of parallelism is called Partition Parallelism. Each partition will be filtered exactly the same way but multiple filters will be set up in order to handle each of the partitions themselves.

we still have multiple copies when there shouldn’t be. Should the data become separated (The Key value fields start going into different partitions). Business Intelli Solutions Inc www. In other words. in Vegas. thus dealing all the cards evenly to each player. like a card dealer. we need to segregate the data within the partition based on some key values. If we had put them all correctly on one partition then only one duplicate would survive which was our desired outcome. may be in another partition and thereby mistakenly ‘kept alive’ and moving down another stream in a parallel Operation. 15:34 The way in which we partition will depend upon how we need to process this data. you keep the relevant data together by these common values. dealing out cards in the deck to every player around the table. This is very important when we’re doing things such as sorting the data or possibly filtering the data on these key values.com .businessintelli. Then at the other end of the partitioning process when all the data is collected. This is known as Hash partitioning (Or with numeric you can use a Modular partitioning) and in this way. whether they are numeric or alpha values. Other times. The simplest of the methods is called the Round Robin in which we assign each incoming row to the next subsequent partition in order to create a balanced workload.Information Server 8. when the Operation occurs it will not fully work correctly as some of the data that should have been removed. then.1 Introduction Three-Node Partitioning Data can be partitioned into multiple streams.

If you have to partition by range or some type of hash partitioning scheme. you can see that the data is flowing from two Oracle Stages and the data is brought together inside of a Merge operation. we would create four instances of each to handle the four separate streams of data. Here’s what we would see: For each of the Oracle Stages (Toward the left of the data flow) and other Stages in this design. In the background. we take the merged data and we do an aggregation on it and finally we send the data out to a DB2 database. In this kind of activity when doing summation. it is extremely important that we partition or segregate our data and that we partition it in a way that will work with our design.businessintelli. From there. At the top of the graphic.Information Server 8. the quantities of data in each partition may become imbalanced and one or two of the partitions may finish long before the others and thus reduce performance. Job Design Versus Execution Let’s look at an example that will give you an idea of what happens when we design the Job and then what happens when we run it. it will be processed much more quickly.1 Introduction The kind of partition that you perform matters a lot! One of the reasons why Round Robin is the default is that if you can evenly distribute the data. let’s say that we want to partition this data four-ways (In DataStage terminology would be known as 4 Nodes). Business Intelli Solutions Inc www.com .

we want to partition it by the values by which we are going to do the summation.businessintelli. Business Intelli Solutions Inc www. let's continue our discussion in a different direction.com . Let’s say that we have 5 rows that need to come together for a sum.1 Introduction Here. we have deployed all of the Information Server DataStage components on a single machine. let's talk about some architectural setup options. In one setup. so that all the summation can be done together. though.Information Server 8. All of the components are hosted in one environment. we don’t accidentally sum on two different things and loose parts of our rows. we might get four values out of this phase (Highlight on the 4 instances of Aggregation Stage) with only one of the set of partitions summing anything at all! Architectural Setup Options Now that we've talked about the considerations that we must be aware of as we do our partitioning both in the design of our Job and then in the choice of partitioning strategy. Otherwise. Note that we will discuss these partitioning strategies in greater detail later in the tutorial. we want them all to be together in the same partition. For now. This way. This section of the tutorial will talk about some of the underpinnings of DataStage in general and specifically some of the architectural setups that we can use.

All of the components are hosted in one environment.com . This includes the WebSphere Domain. let’s say that we are carrying a laptop around: It can contain all the co-located components (and it could still be connected to by a remote client). we have deployed all of the Information Server DataStage components on a single machine. What are some reasons for putting all of these components on a single machine? Perhaps that is all the resources that you have available or perhaps you have very poor network performance and don’t want the resulting latency going across the network wire to the repository. the Xmeta repository using DB2 (other databases can be used for the repository such as Oracle or SQL Server).Information Server 8. The clients can be stored on the same machine as well if it is a Windows machine. Business Intelli Solutions Inc www.1 Introduction In one setup.businessintelli. In other words. and the DataStage Server itself.

One of the benefits to this separation (Having two machines) is that the metadata repository on Machine A can be utilized (By a product such as Information Analyzer) concurrently without impacting the performance of DataStage as it runs on Machine B Business Intelli Solutions Inc www.1 Introduction Now let’s talk about putting the components on two machines.Information Server 8. A typical two-machine setup would consist of the Metadata repository and the Domain layer on Machine A and the DataStage Server onto Machine B and potentially be able to connect to Machine B remotely with the clients.com .businessintelli.

The Metadata Server backbone is installed on Machine B so that Information Analysis and Business Glossary activities don’t risk negatively impacting any development on the DataStage Server: Several DataStage development teams can keep working and utilizing their resources most effectively. Next we’ll look at Jobs within Projects and talk about a number of important development objects such as Passive DataStage Stages. In this last section. Finally. we talked about some different options. Within a Project. In this next section of the tutorial. you have the DataStage Server on Machine C. Setting up a new DataStage environment.1 Introduction In a 3rd scenario.Information Server 8. it should be said that you must determine the optimal setup in your environment depending on your own conditions and requirements. major components are separated onto different machines. The database repository is on Machine A (Maybe because you are using a machine that already has your database farms installed on it). We’ll also talk about setting up the DataStage engine itself for the first time using the Administrator client and see how to then setup DataStage Projects.businessintelli. connected to by remote clients. And.com . we are going to talk about DataStage’s architecture ‘under the cover’ and some of the ways that we can set up the product. For Business Intelli Solutions Inc www. we’ll see how to import and export table definitions into that Project.

Then we’ll talk about some key concepts that you’ll want to know about as you are developing Jobs such as the use of Runtime Column Propagation (RCP). These include the Lookup Stage. Of course.1 Introduction example. and various relational data Stages.com . the Join Stage. we’ll discuss parallelism more in depth. Finally. This may be within Sequencers or at the Job level we will talk about strategies for different ways to apply parallelism. Then we’ll talk about ways of implementing and deploying our DataStage application that can make things easier on you and the team. Specifically.Information Server 8. We’ll also talk about some Active Stages. We’ll see how these things work and then what happens when the parallel Job gets compiled. how to create Shared Containers for re-usability. the Merge. putting together Sequencers to organize all of your Jobs. we’ll talk about the Data Set Stage. and using the Command Line Interface (CLI) in which Jobs can be called from a third-party scheduler. Sequential Files Stage. These might include the creating Parameter Sets. Business Intelli Solutions Inc www. which are important when we are trying to combine and/or manipulate data. Within InfoSphere. and the Transformer Stage (at the heart of many Jobs). using various Run options. we’ll discuss how to apply these concepts and scale our previously constructed DataStage application up or down depending on the resources that we have available to us. we’ll talk about where we perform parallelism.businessintelli. we’ll talk in more detail about the two forms of parallelism (Pipeline parallelism and Partitioning parallelism). Stage.

We can see it shows the Domain to which we want to connect with its corresponding port number of the application server. the user ID and password for the DataStage Administrator with which we will connect.com .readu command which would show you any locks that are still active in DataStage (as well as any other commands associated with the Business Intelli Solutions Inc www. The Command button allows you to issue native DataStage engine commands when necessary such as the list.1 Introduction Setting up the DataStage Engine for the First Time When we're setting up a new DataStage engine. we have a tab entitled Projects. we are presented with the Attach to DataStage dialog box shown above.businessintelli. and the DataStage Server name or its IP address. the first thing is to go into the Administrator client and make sure that some of the basics have been taken care of. you will find that the Server’s name is the same as the beginning portion of the Domain name (But this does not have to be the case – it depends on the architecture and name of the computer and how it is set up in your company’s Domain Name Server (DNS)).Information Server 8. Often. We use this for each of the individual Projects that we want to set up here within DataStage. DataStage Administrator Projects Tab Within the Administrator client. When we log on.

and so on.Information Server 8. clearing job information. This will enable job administration in Director. you can see the Project pathname box. It will show us the settings for the selected Project (datastage1 shown in the graphic). Business Intelli Solutions Inc www. One of the times that you may NOT want to do this is in a production environment. Below that. This shows us where the Project directory is located. One thing that you’ll probably want to do is to check the first option.com . Otherwise. typically. re-setting their jobs. This is extremely important as you must know where the Project is located on the machine (It is located the machine on which the DataStage Server has been installed). Various tabs across the top each have thier own options. Let’s take a look at the Properties button. want to give people the ability to administer Jobs in the Director client as they see fit. Enabling job administration in Director will allow developers to do things such as stopping their jobs. The Properties button brings up the dialog box shown above.1 Introduction DataStage engine or with the Universe which was the predecessor to the DataStage engine).businessintelli. you do.

The ‘autopurge of the job log’ option is very important. These will be separate from your overall DataStage Environment variables. Metadata is usually understood as a descriptive (How big is a field. What type of data is it) but operational metadata tells us things like “How frequently has it been run” and “How much volume has gone through it”. This can be important later as you determine your scalability and future considerations such as capacity planning. which may go for long periods of time between each run of a particular Job.Information Server 8. allows us to perform cross-application impact analysis. The “Generate operational metadata” checkbox will allow the data. Some strategies include setting it to the 5 previous runs within your development and test environment. later. Since most people want to make changes regularly during development work. These will be Project-specific. for further analysis and understanding data in the ‘operational’ form. This again. You always want to set up default purge options so that the logs don’t fill up and consume large disk space on your machine (The machine that the DataStage Server is installed on – not necessarily the machine that houses your clients like the Administrator client). The ‘Share metadata when importing from Connectors’ option. We’ll cover this more in depth a little later in the tutorial but just be aware that this is where you enable it for the datastage1 (each) Project. you may want to set it to ‘every 30 days’ where you plan on running your Jobs on a daily basis. in particular the row counts to be able to be captured. It's also useful in production where you don't want people to make modifications to Jobs. Project-specific Environment variables are only for this one particular Project. Business Intelli Solutions Inc www.1 Introduction Next. it's not very useful for development environments. In your production environment. is the option to enable RCP for parallel Jobs. as it is being generated. The Environment button allows you to set global environment variables for your entire Project. which are global to the entire Server.businessintelli.com . You have your choice of purging them based on the number of runs or by the number of days old. Next is a button that allows you to protect a Project. The ‘Enable editing of internal references in jobs’ option will help you when dealing with lineage and impact analysis (If you change Jobs or objects over time).

com .Information Server 8. we have our categories of environmental variables such as General. Business Intelli Solutions Inc www. its prompt. that will be used for the entire DataStage Project. The other category is User Defined. and then the value.businessintelli.1 Introduction When you click on the Environment button. the corresponding pane on the right displays the name of the environmental variable to be used. which includes Parallel environmental variables. Here. Under each category area. On the left. it opens the Environment variables dialog box. there are two panes in the window.

However. Business Intelli Solutions Inc www.Information Server 8. you can have just one Job contain a different value from the other uses of that same environmental variable throughout the rest of the Project. each of these variables can also be set individually or in each Job.com . That is to say. By setting these here. at runtime. APT_MSG_FILELINE.businessintelli. you can have all the developers on the Project use them in a standardized way.1 Introduction Environment Reporting Variables A standard category is for environmental variables for reporting such as the APT_DUMP_SCORE. APT_NO_JOBMON.

1 Introduction Business Intelli Solutions Inc www.businessintelli.Information Server 8.com .

businessintelli. The “DataStage Operator to view the full log” checkbox is typically used unless you have a particular reason such as sensitive data or Operators being overwhelmed by too much information. we can see in the User Role drop-down menu that there are several roles available for us to assign to that user or group.1 Introduction DataStage Administrator Permissions Tab is where we set up users and groups for a variety of authorizations. Normally. you will not use unless working with IBM support while experiencing a difficult problem that cannot be solved any other way. This allows us to perform engine traces. Typically. they can easily communicate that information to their second line of support when escalating a problem. Super Operator. Let’s continue our discussion by moving over to the Tracing tab. These consist of Operator. and Production Manager. Once we create a new user. We would add a user and then assign a product role. Business Intelli Solutions Inc www. we allow Operators to view the full log so that if they encounter problems during a DataStage Job.Information Server 8.com . Developer.

The checkbox at the top allows you to see the Orchestrate Shell (OSH) Scripting language that has been generated from within your Job properties box.com . Further down.1 Introduction The Parallel tab contains some options.businessintelli. there are some advanced options for doing things such as creating format defaults for various data types used in the parallel framework for this Project. It creates a tab entitled OSH.Information Server 8. Business Intelli Solutions Inc www.

Information Server 8.1 Introduction
Under the tab entitled Sequence, we can add checkpoints so that the Job sequences are restartable upon any failure. We can automatically handle activities that fail, log warnings after activities that finish with a status other than ‘OK’, and log report messages after each Job run.

Work on Projects and Jobs Once we get our engine configured and set our basic Project properties, the next thing that we will want to be able to do is to work on a new Project. If we generate all these DataStage Jobs from scratch, we won’t be re-using anything from previous Projects. However, often times, it makes sense to leverage work that we have already done. We may want to import things from a previous Project. These include things such as program objects such as Jobs, Sequences, Shared Containers, Metadata objects (for instance Table Definitions), Data Connections, things used globally at runtime such as Parameter Sets, and other objects such as routines, transforms, that we want to bring along and re-use from a previous Project.

Exporting and Importing Objects

Business Intelli Solutions Inc


Information Server 8.1 Introduction

Export Window In order to make this happen, the first thing that we need to do is to export it from one environment. Then we will be able to import it. Here in the Export window, we can highlight certain objects and then add them to our export set. One important thing to consider is the ‘Job components to export’ drop-down menu shown highlighted. This will allow you to export the job designs with executables, where applicable. Let’s say that we are promoting all of these objects to a production environment that does not have its own compiler. You will then need to have these pre-compiled Jobs ready to run when they reach their target. You can opt to exclude ‘read only’ items. This is a typical default.

Next, you would choose which file you would like to export it into. This is a text file that will ‘live’ on your client machine. All of the export functionality is done through client components. Once we have identified all the objects that we want to export, we simply click the Export button to begin the export.

Business Intelli Solutions Inc


Information Server 8.1 Introduction

We export either into an XML file or the native DSX file. Once this is accomplished, then we can begin our import. For the import, we would use the DataStage Designer client, click its import option and select the objects to import from the flat file into our new environment. Let's take a look at that dialog box.

This file that we are importing may serve as a backup in that it allows us to import objectby-object. Our export is exactly like a database export in so far as it allows us to pull out one object from the export at a time. Should someone accidentally delete something (in the new environment), instead of losing the entire box, we only lose one object that can then be

Business Intelli Solutions Inc


So the Passive Stages are beginning points in the data flow or ending points. You then have a choice to ‘Import all’ or ‘Import selected’. We'll get to this Stage a little later. Let’s talk about some very important and key components that we will use during our development. let’s talk about DataStage Stages that we see on our Canvas. For example a Transformer Stage. Passive Stages Passive Stages are typically used for ‘data in’ and ‘data out’ such as working with sequential files. First.businessintelli. First. At the time. There are two classes of Stages. Most notable of the Active type is the Transformer Stage. Passive Stages and Active Stages. and the Modify Stage pass data as they receive the data unlike the Passive Stage which only either inputs data or only outputs data. databases. you will need to identify the export file. they immediately go out. Whereas. and Data Sets (Data Sets are the DataStage proprietary high-speed data format). You can also choose to perform an ‘Impact analysis’ to see what objects are related to each other. In particular.Information Server 8. Choose Stages Now that we've talked about setting up our Project. let's now focus on building Jobs. you can choose to ‘Overwrite without a query’ if you know that everything that you want is the most current. Business Intelli Solutions Inc www. the Filter Stage.com . On the Import screen. let's discuss the Passive Stages. let's cover the Passive Stages. The Active Stages are continuous flow Stages that go in the middle of the data flow. The active Stages manipulate the data. Active Stages are the things that actually perform on the data: As rows are coming in. Once you have set things up and are ready to begin developing Jobs. which has many functions.1 Introduction restored without losing any of the other work that you’ve done in the meantime by importing it from the export file.

on the way out (when writing it). and then complex flat file data which is most notably used with older COBOL-type structures – these allow us to look at complicated data structures embedded inside of files and are used in proprietary formats –much the way that COBOL does. it is important to know that DataStage has an import/export operator working on the data (Not to be confused with importing and exporting objects. components.1 Introduction Types of File Data Let’s describe the types of file data that we’ll be dealing with as we begin to develop Jobs. taking that internal structure and writing it back out into a flat file format. Primarily this falls into three categories. and putting it into an internal structure (within DataStage) so that it can be used in the Job and then. parsing it. likewise. Data Sets which are the DataStage proprietary format for high-speed access. Business Intelli Solutions Inc www. sequential file data (Should be of either fixed or variable length).com . How Sequential Data is Handled When we are using our Sequential File Stage.businessintelli. This Sequential File Stage’s import and export is about taking the data from its native format.Information Server 8. and metadata).

Business Intelli Solutions Inc www. we’ll see how we can push those rejected files out.businessintelli. we could look in the log and see messages relating to things such as ‘How many records were imported successfully’ and ‘How many were rejected’.1 Introduction After the Job has been run. Later.com .Information Server 8. it is because they cannot be converted correctly during the import or the export. When the records get rejected.

Information Server 8. a field of data within a record is either delimited or in a fixed position. Business Intelli Solutions Inc www. Typically. The Stage needs to ‘know’ how the file is divided into rows and it needs to ‘know’ what the record format is (e. If we set up multiple readers. We can use these record delimiters and column delimiters (such as the comma or the new-line delimiter) as a record delimiter. These will execute in parallel when executing multiple files.g.1 Introduction Normally.businessintelli. how the row is divided into columns).com . this is going to execute in sequential mode but you can use one of the features of the Sequential File Stage to be able to read multiple files at the same time. we can read chunks of a single file in parallel.

Business Intelli Solutions Inc www. it will be sent down the dashed reject link entitled Source_Rejects and. by the target.Information Server 8. in this case. Over on the right. if anything isn’t written correctly.com . send it into our Copy Stage and load it into the target flat file. read by a Peek Stage.1 Introduction Job Design Using Sequential Stages The graphic above shows how we can read data from the Selling_Group_Mapping file. it will be sent out to the Target_File_Target_Rejects link to the TargetRejects Peek Stage. if anything is not read correctly by the Selling_Group_Mapping link. To configure the Sequential File Stage. However. we would double-click it.businessintelli.

com . on the Output tab.Information Server 8.businessintelli. SQL Types. under the Columns sub-tab look like all of the other metadata that we’ve seen for this file. The column metadata looks like the metadata that we have elsewhere. we would configure our Input file column definitions. Keys. let's say that we clicked on the sub-tab entitled 'Properties'. Scales. within the Sequential File Stage. Business Intelli Solutions Inc www. Lengths. In other words all of the Column Names. Next. under the Columns sub-tab. etc.1 Introduction Sequential Source Columns Tab Then.

there is the file’s name. For instance. its read method.com . we are presented with some options. These options help us define how we will read the file.businessintelli. etc.1 Introduction Under the Properties tab.Information Server 8. whether or not we will reject unreadable rows. Business Intelli Solutions Inc www. and other options such as whether or not the first line is a column header.

the data contained in the Data Set is already partitioned and ready to go into the next parallel Job in the proper format. Data Sets are very important to our parallel operations in that they work within the parallel framework and use all the same terminology and are therefore accessed more easily. This way.com . It preserves the partitioning that we established during our Jobs.Information Server 8.businessintelli. Business Intelli Solutions Inc www. This kind of file stores data in a binary format that is non-readable without a DataStage viewer.1 Introduction Datasets The other primary file that we use is the Data Set.

or for EDI.com .1 Introduction We use a Data Set as an intermediate point to land data that does not need to be interchanged with any application outside of DataStage: It will only be used within DataStage Jobs going from one job to the next (or just within a Job).Information Server 8. Business Intelli Solutions Inc www. or for any of the other normal forms of data interchange. It is a proprietary DataStage file format that can be used within the InfoSphere world for intermediate staging of data where we do not want to use the database or a sequential file. It is not a useful form of data storage for sending to other people as an FTP file.businessintelli.

Information Server 8.com . It then opens a screen.1 Introduction Data Set management Utility You may need to look at a Data Set to verify that it looks ‘right’ or to find specific values as a part of your testing. which will show you how the file is separated. seek out the file that you want. Let’s look at this display Business Intelli Solutions Inc www. it will also include a Displaying option for your data. You can look at a Data Set using the Designer client: There is a tool option that allows you to do data set management. It shows the number of partitions and nodes and it shows how the data is balanced between them. And. if you would like. It let’s you first.businessintelli.

businessintelli.1 Introduction Data and Schema Displayed A Data Set cannot be read like a typical sequential file by native tools such as the VI Editor or your Notepad editor in Windows.com .Information Server 8. When looking at the Data Viewer we will see data in the normal tabular format just as you view all other DataStage data using the View command. Business Intelli Solutions Inc www.

Information Server 8. Business Intelli Solutions Inc www. let's begin by talking a little bit about working with relational data. To import relational data. before we can access our databases. you need to verify that your type conversions have come through correctly and will suit the downstream needs (They must suit the needs of subsequent Stages and other activities within DataStage).businessintelli. we will use the Table Definition Import utility within the Designer client – or you can use the orchdbutil. The orchdbutil is the preferred method to get correct type conversions. we need to know how to import relational data. in any situation. So. we will need to talk a little bit about working with relational data. You will need to work with Data Connection objects.com . we won’t need to include every piece of detailed information such as User ID. password of the various databases that we’re connecting to. However. Data Connection objects store all of our database connection information into one single named object so that. However.1 Introduction Connect to Databases Now that we’ve talked about some of the basic Passive Stages for accessing file data we will take a look at some of the basic Database Stages that allow us to connect to a variety of relational databases. when we go into our other Stages in Jobs. To do this.

We have the ability to use Select statements as we would in any database. These were ported to the current version of Information Server in order to support DataStage Server Jobs and their functionality. you get a feature called the SQL Builder utility that quickly builds up such statements allowing you to get the data that you desire.businessintelli.Information Server 8. DataStage has a similar functionality that allows you to build INSERT. We just talked briefly about Connector Stages. we have the Enterprise Stages. They provide parallel support as well with the parallel extender set. They provide parallel support and are the most functional and provide consistent GUI and functionality across all the relational data types. These are the legacy Stages from previous versions of DataStage.com . UPDATE. When we’re writing the data. With DataStage. Business Intelli Solutions Inc www. it gives you the ability to select the data you want. Then. When you have one of these Stages. and DELETE statements also using the SQL Builder.1 Introduction Next we need to see what Stages are available to access the relational data. There are also things called Plug-in Stages: These are the oldest family of connectivity Stages for databases and other relational sources.

Business Intelli Solutions Inc www.Information Server 8. From within the Designer client.com . we can use either ODBC or the Orchestrate schema definitions. which can then later be translated into pulling that same data out from corresponding databases. Orchestrate schema imports are better because the data types tend to be more accurate.businessintelli. you simply click on Import>Table Definitions>Orchestrate Schema Definitions. To import these table definitions. you could choose to import table definitions using the ODBC option. Likewise you can use the Plug-In table definitions or other sources from which you can import metadata including legacy information such as COBOL files.1 Introduction Import Table Definitions The first thing that we want to do is to import our table definitions. Alternatively.

ODBC Import When we’re using our ODBC Import. all the entries will be blank). Business Intelli Solutions Inc www.businessintelli. username. the server on which it is hosted. the name of the database from which you are pulling it.com . first we select the ODBC data source name (DSN). This will need to have been set up for you by your System Administrator before you use the screen shown above (Otherwise.Information Server 8. Many will need a username and password although some may come pre-configured. and password.1 Introduction Orchestrate Schema Import The Orchestrate Schema Import utility does require certain bits of information including the database type.

Information Server 8.0 for client/server. which allow you to connect to these various data sources and give you unlimited connections. The next connector type is the DB2 UDB Stage. This is useful for DB2 versions 8. The last Connector Stage type is for Teradata. There is also a connector for the WebSphere MQ to allow us to connect to MQ series queues. and various other databases.businessintelli.1 and 8. which conforms to the ODBC 3.com . This gives us fast access to Teradata databases.2. DB2 UDB.1 Introduction Connector Stage Types There are several Connector Stage types including ODBC.3 and 6. SQL Server. And there is also WSMB 5.5 standard and is level 3 compliant and certified for use with Oracle. This will also include the suite of DataDirect drivers. It can be used with MQ 5. Business Intelli Solutions Inc www.0.

we want to be able to use it on our Canvas. that an ODBC connector has been dragged and droppedon on the left.businessintelli. You can see. in the graphic above.1 Introduction Connector Stages Now that we’ve done some of the basic set up for our relational data. Let’s take a look inside of the Stage. Next we would want to configure it so that we can then pull our data from an ODBC sources.Information Server 8. Business Intelli Solutions Inc www.com .

We can use the SQL Builder tool highlighted on the right. Or we can manually enter our own SQL. Business Intelli Solutions Inc www. we will then want to configure an SQL statement that will allow us to pull data. an area in which we can see the link properties. you can see a Navigator panel. we can see the various other properties of the connection itself. we can use SQL that is file-based.Information Server 8.1 Introduction Stage Editor Inside of the Connector Stage.com .businessintelli. Once we have chosen the connector for our ODBC Connector Stage. and then below. Alternatively.

removing constraints. we should have our connection information and know what SQL we will be using.Information Server 8. Business Intelli Solutions Inc www.businessintelli. or various other directives to the database. we can enter any Before SQL commands or After SQL commands that may need to occur such as dropping indexes. re-creating indexes.1 Introduction Connector Stage Properties When configuring the Connector Stage. This may include any transaction information and session management information that we entered (This would be our record count and commitment control shown highlighted) Additionally.com .

Information Server 8.1 Introduction Building a Query Using SQL Builder When you build a query using the SQL Builder utility.businessintelli. there are certain things that you’ll need to know: Be sure that you are using the proper table definition and be sure that the Locator tab information is specified fully and correctly. You can also drag the table definition to the SQL Builder Canvas. Business Intelli Solutions Inc www.com . And. you can drag on the columns that you want to select.

Information Server 8.1 Introduction

Data Connection Data Connections are objects in DataStage that allow us to store all of the characteristics and important information that we need to connect to our various data sources. The Data Connection stores the database parameters and values as a named object in the DataStage repository. It is associated with a Stage type. The property values can be specified in a Job Stage of the given type by loading the Data Connection into that Stage.

Business Intelli Solutions Inc


Information Server 8.1 Introduction

Creating a New Data Connection Object You can see the icon that represents the Data Connection highlighted above.

Business Intelli Solutions Inc


Information Server 8.1 Introduction

Select the Stage Type Inside the Data Connection, we can select the Stage type that we’re interested in using and that we want to associate this to. Then we specify any parameters that are needed by that particular database. Obviously, different databases have different required parameters, so the selection of the Stage type is important.

Business Intelli Solutions Inc


Join Stages Aside from Passive Stages. filtering of data. our other main classification of Stages is the Active Stage.businessintelli. Business Intelli Solutions Inc www. Active Stages Lookup. and a number of other things for which there are corresponding Active Stages.Information Server 8. modification of metadata. Merge. Some of the Active Stages that we’ll be looking at include a number of functionalities such as data combination. then we will want to load the Data Connection into the Stage that we have selected.1 Introduction Loading the Data Connection Once we have built our Data Connection.com . the transformation of data.

These Stages combine two or more input links. the Merge Stage. and the Join Stage. They also differ in how they treat rows of data when there are unmatched key values. Business Intelli Solutions Inc www. For this. Some of these Stages also have input requirements such as needing to sort the data or deduplicate it prior to its combination. You can have multiple reference links. mainly. It can only have one primary output link (And an optional ‘reject’ link if that option is selected).Information Server 8.com . These Stages differ.1 Introduction Data Combination First.businessintelli. in the way that they use memory. the three Stages that we use are the Lookup Stage. let’s talk about data combination. The Lookup Stage Some of the features of the Lookup Stage include its requirement for one input for its primary input link.

Information Server 8.1 Introduction Lookup Failure Actions It is limited to one output link however.businessintelli. or we could just fail the Job (Which will abort the Job). Business Intelli Solutions Inc www. with its lookup failure options. Other lookup failure options include continuing with the row of data through the process. dropping the row. you can include a reject link. passing the data through with a null value for what should have been looked up.com .

You should make sure that the Lookup data is small enough to fit into physical memory.Information Server 8. Business Intelli Solutions Inc www. You should make sure that the Lookup data is small enough to fit into physical memory. This data is indexed by using a hash key. This data is indexed by using a hash key. which gives it a high-speed lookup capability. which gives it a high-speed lookup capability. so you must be careful how you use it.businessintelli. The Lookup Stage builds a Hash table in memory from the Lookup file(s). so you must be careful how you use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s).1 Introduction The Lookup Stage can also return multiple matching rows.com . The Lookup Stage can also return multiple matching rows.

or a range lookup on the reference link. Business Intelli Solutions Inc www.1 Introduction Lookup Types There are different types of Lookups depending on whether we want to do an equality match.Information Server 8.com . a caseless match.businessintelli.

What would happen if we clicked on the Lookup Stage to see how it's configured? Let's take a look. As you are dragging objects onto the Canvas and building the Job. always draw the Primary link before the Reference link.businessintelli. Then coming from our Lookup Stage is the output link going to the target. Business Intelli Solutions Inc www. Notice that the reference link is dashed (a string of broken lines).Information Server 8. We have a primary input coming in from the left and the data flows into the Lookup Stage. Coming in from the top is our Reference data.1 Introduction Lookup Example Let’s see how a Lookup Stage might typically fit into a Job.com .

On the top of the screen is more of a graphical representation of each row. This same metaphor applies to most other Stages as well.businessintelli. Toward the bottom of the screen is a more metadata-oriented view in which you can see each of the links and their respective characteristics. Everything on the left represents the data coming into the Lookup Stage – both Primary link and Reference link(s). Notice that the metadata references (shown highlighted lower left) look like the table definitions that we used earlier to pull metadata into the DataStage Project. there is a multi-pane window. Business Intelli Solutions Inc www.com .Information Server 8.1 Introduction Lookup Stage With an Equality Match Within the Lookup Stage. Everything on the right is coming out of the Lookup Stage.

In this case.Information Server 8. Business Intelli Solutions Inc www.1 Introduction Highlighted toward the top.com .businessintelli. we are using a single column to connect an equal: Where a “Warehouse Item” is equal to an “Item”. is a place where you connect column(s) from the input area (highlight Item input link) and dragging them down and dropping them to the corresponding Reference link and connect them to the various equivalent fields here (highlight on Warehouse item in the lower blue area).

Information Server 8. we are taking all of our input columns and we are adding a new column to it based on what we looked up on the Warehouse Item ID.com . So.businessintelli. we can move all the columns from the input area on the left to the output link area on the right but replace this Item (highlight on the right box) description with the Warehouse Item description (the one that we found in the Reference link). Business Intelli Solutions Inc www.1 Introduction Next.

Business Intelli Solutions Inc www.Information Server 8.1 Introduction When you click on the icon with golden-chains (shown highlighted) it gives you a new dialog screen.com .businessintelli.

drop the row.Information Server 8. fail. drop. Here we can determine what actions to take should our lookup fail.com . Business Intelli Solutions Inc www. or reject for if/when your condition is not met. if you use a condition during your lookup (highlight on fail under conditions not met). you can use these same options of continue.businessintelli. You can continue the row. fail the row. or reject it.1 Introduction Specifying Lookup Failure Actions This dialog box shows us the constraints that we can use. Also.

right outer.businessintelli. and the full outer join. The input links must be sorted before they come into a Join Stage. The data doesn't need to be indexed and instead has come into the Stage in a sorted fashion.1 Introduction Join Stage The Join Stage involves the combination of data. Like a Lookup Stage in an SQL query. The Join Stage will use much less memory: It is known as a ‘light-weight’ Stage. We must identify a left and a right link as they come in. inner. There are 4 types of joins. left outer. Business Intelli Solutions Inc www. in the Join Stage we will use columns to pull our data together. This Stage supports additional ‘intermediate’ links.com .Information Server 8.

here.businessintelli. in this case. Business Intelli Solutions Inc www.1 Introduction Job with Join Stage Above is an example of a Job that uses the Join Stage.com . Remember that the reference link that we saw earlier did have a dashed line. Similar to the Lookup Stage. a secondary input. But. we have a primary input.Information Server 8. Notice the green Join Stage icon in the middle. and an output. notice that the right outer link input is not a dashed line (Highlight on the upper link).

we will use the Properties editor. Here. Business Intelli Solutions Inc www. we will use the attributes necessary to create the join. Similar editors are found in other Stages.1 Introduction Join Stage Editor When we get inside the Join Stage.businessintelli. We will specify the join type that we want and what kind of key we will be joining on.Information Server 8.com .

Instead of a left and a right link. Business Intelli Solutions Inc www.businessintelli. its input links must be sorted. Unmatched Secondary links can be captured in a Reject link and dispositioned accordingly.1 Introduction The Merge Stage It is quite similar to the Join Stage. This gives the effect of a left-outer join.Information Server 8. It is a ‘light-weight’ Stage in that it uses little memory (This is because we expect the data to already be sorted before coming into the Stage and therefore there are no keys in memory [indexes to be used]). Unmatched Master rows can be kept or dropped. Much like the Join Stage.com . we have a 'Master' link and one or more Secondary links.

The icon toward the right of each link above is the partitioning icon (Partitioning it out.businessintelli. the sorting can be done by using an explicit Sort Stage or you can use an On-link sort.1 Introduction When the data is not ‘pre-sorted’.com . and then collecting the partitioned data back up). then you will see its icon on the Link (The icon is shown in the lower-left of the graphic). it looks like a fan. Business Intelli Solutions Inc www.Information Server 8. If you have an On-link sort.

com .Information Server 8.1 Introduction Coming out of the Merge Stage.businessintelli. The latter will capture the incoming rows that were not matched in the merge. Business Intelli Solutions Inc www. we have a solid line for the main output and a dashed line for the rejects.

This is another way to combine data. Also you must specify whether or not to warn on unmatched masters. in some cases produce multiple rows. which will create more columns within one row and. Business Intelli Solutions Inc www.Information Server 8. you must specify the attributes by which you will pull together the two sets of data.1 Introduction Merge Stage Properties Inside the Merge Stage.com .businessintelli. What to do with the unmatched master records and whether to warn or to reject updates.

1 Introduction The Funnel Stage The Funnel Stage provides a way to bring many rows from different input links into one common output link.com . The important thing here is that all sources must have identical metadata! If your links do not have this. Business Intelli Solutions Inc www.businessintelli.Information Server 8. then they must first go through some kind of Transformer Stage or Modify Stage in order to make all of their metadata match before coming into the Funnel Stage.

businessintelli. and so on (Based on the number of links that you have coming into your Funnel Stage). The Sequence mode outputs all of the records from the first input link and then outputs all from the second input link.com . The Sort mode is where we combine the input records in order defined based on keys. The first one coming into the Funnel is the first one put out. Business Intelli Solutions Inc www. The Continuous mode is where records are combined in no particular order.1 Introduction The Funnel Stage works in three modes. This produces a sorted output if all input links are sorted by the same key.Information Server 8.

Just set the partitioning to anything other than Auto. Business Intelli Solutions Inc www.1 Introduction Funnel Stage Example Above.com . you can see an example of the Funnel Stage’s use. As mentioned. Obviously. sorts can be done on the input link to a given Stage. This on-link sort is configured within a given Stage on its Input link’s Partitioning tab. as was the case with the Join Stage and the Merge Stage.businessintelli. the Sort Stage is used to sort data that requires sorting. Let’s continue our discussion on Active Stages. we’ll briefly talk about the Sort Stage and the Aggregate Stage.Information Server 8. Here.

Business Intelli Solutions Inc www.businessintelli.Information Server 8.com .1 Introduction Alternatively. The advantage to on-line sorting (within the Stage) is that it is quick and easy and can often by done in conjunction with any partitioning that you’ll be doing. One of the advantages to the Sort Stage is that it gives you more options for controlling memory usage during the sort. you can have a separate Sort Stage.

we have sequential data coming into a Sort Stage before it is moved to a Remove_Duplicates Stage and then sent out to a Data Set.Information Server 8.businessintelli. we have a different example in which file that is being sorted directly into a Data Set but in this case the sort is happening directly on the link (on-link sort). In this example.com .1 Introduction Sorting Alternatives The Sort Stage is highlighted above. Below. Business Intelli Solutions Inc www.

you specify the aggregation functions. or ranges of values that you are looking for. then.com . Some examples of functions are counting values (such as nulls/non-nulls).1 Introduction Our next Active Stage is the Aggregator Stage. summing values. The purpose of this Stage is to perform data aggregations. The grouping method (hash table or pre-sort) is often a performance issue.Information Server 8. Columns to be aggregated must be specified. you specify one or more key columns.businessintelli. the minimum. Next. or determining the maximum. then it’s not necessary. If your data is already sorted. If you don’t want to take the time to do a sort of your data prior to aggregation. which define the aggregation units (or groups). Business Intelli Solutions Inc www. using the default hash option will allow you to pull the data in without any sorting ahead of time. When using it.

Data flows from a Data Set (at the top of the data flow) and is copied two ways. Business Intelli Solutions Inc www.Information Server 8. the total number of rows that are being counted.1 Introduction Job with Aggregator Stage In this Job we can see the Aggregator Stage’s icon. The Aggregator Stage uses a Greek Sigma character as a symbol for summation. One way (to the right) is sent to a Join Stage.businessintelli.com . Another way (straight down) is sent to an Aggregator Stage to aggregate the row counts and then (flowing from the Aggregate to the Join) is passed back into the same Join so that each row that came into this Job can also have. attached to it.

businessintelli. this leaves us with no choice as to which duplicate to keep. This tactic gives you more sophisticated ways to remove duplicates.1 Introduction Remove Duplicates Stage Our next Active Stage is the Remove Duplicate Stage. Use of the Unique option provides for a Stable sort which always retains the first row in the group or a Non-Stable sort in which it is indeterminate as to which row will be kept. However. it lets you choose whether you want to retain the first or the last of the duplicates in the group. Business Intelli Solutions Inc www. The process of removing duplicates can be accomplished using the Sort Stage with the Unique option. The alternative to removing duplicates is to use the Remove Duplicates Stage. In particular.Information Server 8.com .

and Expressions to be referenced. the data comes into a Copy Stage toward the bottom. Business Intelli Solutions Inc www. Constraints. An Important Active Stage: The Transformer Stage One of the most important Active Stages is the Transformer Stage. Derivations.businessintelli.com . and then runs the sorted data up into the Remove Duplicate Stage.Information Server 8. From the Sort Stage (toward the left of the graphic).1 Introduction This sample Job shows us an example of the Remove Duplicate Stage and we can see what the icon looks like. It provides for four things: Column Mappings. Then the data is sent out to a Data Set.

You can direct data down different Output links and process it differently or process different forms of the data. With column mappings. and its content. The Constraint allows you to filter data much like the Filter Stage does. we can change the metadata.com . These are written in a BASIC code and the final compiled code is C++ generated object code. within the Transformer Stage are derivations.Information Server 8. We can use system variables with constants. concatenation.1 Introduction It is one of the most powerful Stages. Business Intelli Solutions Inc www. We can use the input columns that are coming into the Transformer to determine how we want to either derive or constrain our data. Also. Stage variables (local in their scope) to just the Stage – as opposed to being global for the whole Job).businessintelli. We can use Job parameters and/or functions (whether they be the native ones provided with DataStage or ones that you have custom-created). character detection. and we can also use external routines within our Transformer. Other features of the Transformer include Constraints. its layout. You can also write your own custom derivations and save these in your Project for use in many Jobs. There are a number of out-of-the-box derivations that you can use allowing you to do things such as: string manipulation. and other activities. The Transformer's use of expressions is for constraints and/or derivations to use as reference.

com .1 Introduction Job with a Transformer Stage The Transformer Stage has an icon that looks like a T shape with an arrow.Information Server 8. Business Intelli Solutions Inc www. we have the data coming into the Transformer. In the example Job above.businessintelli. There is a reject link coming out of the Transformer (performed with the Otherwise option) and two other links coming out of the Transformer as well to populate two different files.

businessintelli. as with the Lookup Stage. the bottom half of the screen resembles the table definitions that you will see inside of your repository (highlight on center lower half). in that we have an Input link (highlighted on the left) and all of our Output links on the right. Business Intelli Solutions Inc www.com . This gives you the very detailed information about the metadata on each column. The top area (highlight on upper-most blue box on the right) allows us to create Stage variables. which we can use to make local calculations in the process of doing all of our other derivations.Information Server 8. These two output links (the two boxes on the right) give us a graphical view and specify the column names and also gives us places to enter any derivations. Again.1 Introduction Inside the Transformer Stage Inside a Transformer Stage we have a look-and-feel similar to that of the Lookup Stage that we talked about earlier.

businessintelli.com . Business Intelli Solutions Inc www.Information Server 8.1 Introduction (Yellow highlight on the Golden-Chain button in the toolbar at the top) When you click the Golden Chain button then this will bring you to the section that allows you to do Constraints.

you can type in your constraint into the highlighted area that also uses BASIC terminology.com . Business Intelli Solutions Inc www. Here you can create an expression that will let you put certain data into one link or a different link or not put it into a link. When you right-click in any of the white space.Information Server 8. DataStage brings up options for your Expression Builder (highlighted toward the left).businessintelli.1 Introduction Defining a Constraint.

1 Introduction Alternatively. we are using an input column.com . At the beginning of the expression. In the example above it shows an UpCase function and a Job parameter named (Channel_Description). Business Intelli Solutions Inc www. We compare our input column to an upper-cased version of one of our Job parameters.Information Server 8. you can also use the Ellipse button to bring up the same menu.businessintelli.

These derivations can occur within the Stage Variable area or down within any one of the Links. we also use the BASIC-like code to create expressions within a particular cell. As with the constraints. Business Intelli Solutions Inc www. a right-click or clicking the Ellipse button brings up the context-sensitive code menu to help you build your expression.Information Server 8.1 Introduction Defining a Derivation When we are defining or building a derivation.com .businessintelli.

and/or it can find the length of strings. it can Upcase/Downcase.com . we can also build up “If.Information Server 8. The Transformer can use a Substring operator.1 Introduction If Then Else Derivation Using our Transformer Stage. then. It provides a Business Intelli Solutions Inc www.businessintelli. else” derivations where we can put conditional logic that will affect the data on our output link String Functions and Operators Some things included in the Transformer Stage’s functionality are string functions and operators.

or replacing NULLs as we see fit. Checking for NULLs Other out-of-the-box functionality of the Transformer includes checking for NULLs. DataStage can identify whether or not there are NULLs. and can do various testing. assist us in how we want to handle them.1 Introduction variety of other string functions as well such as string substitution and finding positional information.businessintelli. Business Intelli Solutions Inc www.Information Server 8. setting.com .

The derivations in higher columns are executed before lower columns: Everything has a topdown flow to it. Constraints (from top to bottom) 3. Business Intelli Solutions Inc www. NULL handling. and those for type-conversion. logic. column derivations (from top to bottom).businessintelli. Strings.Information Server 8. when a Transformer is executed. 1. in the earlier links before the later links. Then within each link. the next thing that is executed are the column derivations and they are executed first. Within each Link. The first things that are executed in the Transformer for each row are the derivations in the Stage Variables.1 Introduction Transformer Functions Other functions of the Transformer include those for Date and Time. Derivations in Stage variables (from top to bottom) 2. there is a very specific order of execution. Numbers.com . Transformer Execution Order It is important to know that. The second things that are executed are the constraints for each link that is going out of the Transformer.

So.businessintelli. very careful if you ever re-arrange your metadata (because of how the order of execution could affect your results within a Transformer. Business Intelli Solutions Inc www. and within that Link.Information Server 8. Which means that.com . if we do something in a column above. at the beginning we will execute all my variables (from top to bottom) and then execute each constraint (from top to bottom). TIP: You must be very.1 Introduction Let’s look at a quick example of how this might work. we can use its results in the column below. For each constraint. we do all the columns from top to bottom. the Transformer will fire off the corresponding output Link.

Business Intelli Solutions Inc www.com .businessintelli.Information Server 8.1 Introduction Transformer Reject Links You can also have Reject links coming out of your Transformer. Reject links differentiate themselves from other links by how they are designated inside the constraint. Let’s quickly see how this is done.

1 Introduction Off to the right.businessintelli.com . Business Intelli Solutions Inc www.Information Server 8. Checking the respective checkbox create a Reject link as necessary. we have the “Otherwise Log” option or heading.

the columns of data can flow through a Stage without being explicitly defined in the Stage. • • Business Intelli Solutions Inc www. we need to cover a few other important concepts that will help us in our development of DataStage Jobs. the target columns in a Stage need not have any columns explicitly mapped to them: No column mapping is enforced at design-time. Now. We’ll see some examples of this later. With RCP.1 Introduction Our “Otherwise” link creates a straight line (such as the Missing Data Link shown above) whereas. the input columns are mapped to unmapped columns by name. With RCP enabled. This is all done ‘behind the scenes’ for you.businessintelli.Information Server 8. • • • • Run Time Column Propagation Shared Containers Runtime Column Propagation (RCP) One of the most important concepts is that of Runtime Column Propagation (RCP). Advanced Concepts When Building Jobs We’ve discussed the key objects in DataStage and we have set up DataStage. a “Reject” link would have a dashed line (Such as the Link all the way to the left).com . We’ve been able to use Passive Stages to pull data in from various files or relational database sources and we’ve talked about various Active Stages that let us manipulate our data. When RCP is turned on.

it must be enabled at the Project-level. You must enable RCP for the entire Project if you intend to use it. When we want to enable RCP.1 Introduction • Let’s talk about how the implicit columns get into a Job.Information Server 8. then can also choose to have each newly created Link use RCP by default or not. • • • • Business Intelli Solutions Inc www. we look at the Output Column tab in order to decide whether or not we want to use RCP.com . Within each Stage. The implicit columns are read from sequential files associated with schemas or from the tables when read from relational databases using a “select” or they can be explicitly defined as an output column in a Stage that is earlier in the data flow of our Job. they give us greater Job flexibility. a Job can process input with different layouts. At the Job level. Second. RCP gives us the ability to create more re-usability within components. And. This way you can create a component of logic and apply it to a single named column while all the other columns will flow through untouched. We don’t have to have a separate Job for multiple record types that are just slightly different. then you won’t find the option in your Job. The main benefits of RCP are twofold.businessintelli. we can enable it for the entire Job or just for individual Stages. First. if you do. This way. such as the Shared Containers (There are many ways to create re-usable bits of DataStage ETL code). This default setting can be overridden at the Job-level or at the Link-level within each Stage. If it’s not enabled there.

you know that columns coming out of this Stage will be RCP-enabled. However.Information Server 8. if we only drew derivations to two of the columns (the upper two on the right). the two lower columns would remain in a RED state (ineligible derivations) and this Job will not compile. let’s see what we can do with RCP Business Intelli Solutions Inc www.com . When you see this checkmark.1 Introduction Here’s an example of where we set up and enable RCP just within a Stage (Highlight on checkbox of Enabling RCP at Stage Level slide).businessintelli. When RCP is Disabled Let’s say that we had four columns coming in to this Stage shown above as the input link shown above on the left and four columns going out (on the right) then.

The only thing that they all have to share is that the very first two columns (highlight on top two columns) must be the same and must be explicit. we could leave these two columns completely off (the ones in the lower right of the previous graphic) and they would be carried in an invisible fashion through subsequent Stages and thereby.com . allowing you to run many Jobs that have row-types that have the upper two columns explicitly but also have the lower two (or more) columns implicitly In other words.Information Server 8.1 Introduction When RCP is enabled we can see that the output link’s columns (lower right) do not have anything in the red state. By name matching. we can run this Job one time with a record that has four columns and another time we can run a Job that has eight columns another time we can run that Job with six columns. Business Intelli Solutions Inc www. Or. DataStage will assign the two columns on the input link (on the left) to two columns going out on the output link (lower right) automatically.businessintelli.

businessintelli. so let’s review and continue our discussion about it. when it is turned on. Target columns in the Stage do not need to have any columns explicitly mapped to them. no column mapping is enforced at designtime when RCP is turned on. Let’s look at an example of RCP so that we can better understand it. Therefore. and other things that allow you greater flexibility and help create greater reusability within your DataStage Jobs. RCP is how implicit columns get into a Job.Information Server 8. Business Intelli Solutions Inc www.com . The main benefits of RCP are Job flexibility so that we can process input with different or varying layouts. Even people who have been in the field for years can have difficulty grasping both the power and the difficulties of understanding RCP better. we can work with a Sequential File Stage or the Modify Stage to take unknown data and give it an explicit metadata tag. Using this schema. a schema file is another way to define our metadata much like a table definition.1 Introduction Runtime Column Propagation (RCP) In Detail Because of the importance of concepts including Runtime Column Propagation (RCP). Understanding RCP often takes several years in the field to understand. As you’ll recall. the use of Shared Containers (the chunks of DataStage code that can be re-used). it is worth talking about them in greater detail. will allow columns of data to flow through a Stage without being explicitly defined in the Stage. It also works very well within DataStage’s Shared Containers (the user-defined chunks of DataStage ETL applications). the use of schemas. We can also define it by explicitly defining it as an output column in an earlier Stage in our data flow (Using a previous Stage in the Job design). RCP also allows us to create re-usability. We can define their metadata by using a schema file in conjunction with a Sequential File Stage. RCP. We can also pull in our metadata implicitly when we read it from a database using our “select” statement. Input columns are mapped to unmapped columns by name.

1 Introduction The graphic above summarizes the main points of RCP but by revisiting the example that we looked at a few moments ago.businessintelli. Business Intelli Solutions Inc www.Information Server 8. So let’s review our example in which RCP is NOT used. we can gain a deeper understanding of RCP.com .

these two columns are left in a red color indicating that this is an incomplete derivation. Therefore.1 Introduction We’re looking within a Transformer Stage at a typical example of columns when RCP has not been enabled. Business Intelli Solutions Inc www. this Job won’t compile. we can see four outgoing columns (on the output link). We can see four columns on the input link to the left.com . Therefore. To the lower right.Information Server 8. Let’s review how it functions without RCP.businessintelli. But the two lower columns SPEC_HANDLING_CODE and DISTR_CHANNEL_DESC aren’t connected to the input columns and don’t have any derivations.

And. what kind of metadata needs to get processed there. the bottom two columns are no longer in a red state and therefore. this Job will compile. they would be carried invisibly along to the next Stage. If we had additional explicitly named input link columns beyond the four highlighted to the left. in the exact same situation but one in which RCP is enabled. If they Business Intelli Solutions Inc www. by name reference these two columns on the bottom left will be assigned to the two columns on the bottom right. Now with the same Transformer. This can negatively impact downstream activities. This applies in terms of the number of columns and the metadata type. they could be either used or not used as necessary. Important things to know when using RCP are to understand what data is coming in. But. Thus you must be very careful when using the RCP. and to understand what data is going out of the Stage. It gives us a lot of flexibility but can create headaches at runtime if you are not aware of data that is passing through your Job.businessintelli.1 Introduction When RCP is Enabled. But this would allow us to not require any kind of individual column-checking at design-time. When this Job runs.com .Information Server 8. This has mixed benefits.

you’ll need to make sure that it is truncated properly to fit into the outgoing column space so that it doesn’t overflow and create a runtime failure. then what you are bringing in at some point in the processing. It would then bring this data together so that it could be written out with the new number attached to it. if we were to process five Customer records from the five companies separately. and process the data. We are only going to specifically identify 2 of those columns coming in and only specify 3 of those columns coming out of our Stage. Business Use Case Scenario Let’s talk about a good use of RCP in business terms. the corporation conglomerate knows about it and can pull the customer data from all of the sub-companies. and then process the data through a DataStage Job which would then relate the subsystem’s ID to the corporate ID via a cross-reference that is stored in the corporate data warehouse.com . each of these Customer ID fields may overlap: The value of ‘123’ at Company A may represent a totally different Customer than the value of ‘123’ at Company B even though they are the same value. in our Job. But each of these sub-companies will not know what their corporate conglomerate Customer ID is until it comes time to pull all of this data into the warehouse. let’s say that you bring in one column of a numeric type of data but it needs to go out as character data. So. If we do this in a file-to-file example. Two columns would be needed. They don’t know this corporate conglomerate Customer ID because their systems weren’t built knowing about any corporate conglomerate Customer ID. We will read the first file coming in. The conglomerate wants to take data from all five of these companies’ operational systems and populating a data warehouse with Customer Information. then you’ll need to perform some type of conversion function on it. But. one company might have 10 columns in their Customer table while another company has 20 columns in their Customer’s table.businessintelli.Information Server 8. the source system ID and the source system Customer ID. then write it out… even though we only see. However. Each of the sub-systems has different layouts for their Customer but they all contain a commonality of having its own Customer ID field. The Job could then use this information to do a Lookup to a reference from the conglomerate’s warehouse that would in turn find the corresponding corporate conglomerate Customer ID. will need to be modified accordingly to work with the final target metadata of the Job. Let’s say that you have a large conglomerate company and five companies underneath it in a corporate hierarchy. Likewise. For instance.1 Introduction do not match. if you have a longer column coming in. The conglomerate has a single Customer ID that can track each of them distinctly. And then that data could then be processed subsequently knowing to which corporate Customer ID this particular sub-system’s row belongs. the 2 or 3 columns that we Business Intelli Solutions Inc www. For example if it comes in with one column length but needs to go out with a longer column length then you’ll need to affect it from within either a Transformer Stage or a Modify Stage. add another column or piece of information to say ‘from which company it came’.

They can then be reused in various other Job designs. Shared Containers Shared Containers are encapsulated pieces of DataStage Job designs. when we write our output file. They are components of Jobs that are stored in a different container. and then re-applied. one for each of specific data layouts coming from each of the sub-companies.Information Server 8.com . Thereby. This is the case with RCP on. This logic could be encapsulated.businessintelli. all of the columns that came from the input file will be written in addition to the new conglomerate Customer ID that we did the lookup on. stored. Business Intelli Solutions Inc www. In the previous example we performed some logic on the grouping Customer Information from sub-companies with a corporate Customer ID. we will only need one Job instead of having to write 5 separate Jobs. We can apply “stored Transformer business logic” to a variety of different situations.1 Introduction have explicitly identified.

Business Intelli Solutions Inc www.Information Server 8.com . and (S)hared completes the task. Then this same Job would have a slightly different look. the (C)onstruct Container option.businessintelli. Then clicking (E)dit from the menubar.1 Introduction Creating a Shared Container In the example above. the Developer has selected the Copy Stage and the Transformer Stage as a Shared Container. Let’s see what it would look like.

Let’s examine the concept of combining the two.com . If that Shared Container is where we did our Lookup to find our corporate conglomerate number. The combination of Shared Container and RCP is a very powerful tool for re-use. Business Intelli Solutions Inc www. you can see the Shared Container icon where the selected Stages used to be. If we create a Shared Container in conjunction with RCP then the metadata inside the Shared Container only need to be what is necessary to conduct the functions that are going to occur in that Shared Container. all the columns would be passed out with the newly added corporate identifier (that we had done a Lookup in our Shared Container). But on the way out of the Transformer (that adds the conglomerate Customer ID).1 Introduction Using a Shared Container in a Job Above. any other columns (regardless of how many there are – different sub-companies have different amounts of columns) passed in there too.businessintelli. Shared Containers can use RCP although they don’t have to do so. and all we really needed to do was to pass into the Shared Container the 2 columns we explicitly knew. Recall for a moment the example that we were talking about with the conglomerate and the five sub-companies.Information Server 8. This same Shared Container can now be used in many other Jobs.

For instance.businessintelli. Then.1 Introduction Mapping Input/Output Links to the Container When mapping the input and output links to the Shared Container. By using RCP we only have to specify the few columns that really matter to us while we are inside the Shared Container. we will know not only the Source CustomerID that appeared on the billing invoice and other information from that particular sub-company but also be able to tie it to our conglomerate database so that we can distinguish it from other customers from other sub-companies. Maybe even in Jobs that weren’t intended to process just the Customer data. When we come back out of the Shared Container we can re-identify any of the columns as necessary – depending on what we want to do with them Shared Containers Continued The main idea around Shared Containers is that they let us create re-usable code.com . we might want to process Order data where we still need to Lookup the conglomerate’s Customer ID just for this Order. you need to select the Shared Container link with which the input link will be mapped to.Information Server 8. as we process this Order into our data warehouse. Business Intelli Solutions Inc www. It can be used in a variety of Jobs. produce the appropriate output including the concatenated conglomerate Customer ID. and tag that into the Order. By simply passing in our whole Order record but only exposing the Customer number (the sub-system Customer identifier) and the source sub-system from which it came so that the Shared Container can do the Lookup. You will need to select the Container link to map the input link as well as the appropriate columns that you’ll need.

They might even be of the Sequencer type in which case they are sub-Sequencers or they can be Jobs themselves whether Parallel or Server Jobs.1 Introduction Again. Shared Containers do not necessarily have to be RCP-enabled but the combination of the two is a very powerful re-usability tool.businessintelli. have talked about how to pull in metadata. A Job Sequence is a special type of Job. the next topic answers the question “How do we control all of these Jobs and process them in an orderly flow without having to do each one individually?” The main element that we use for this is called a Job Sequence.com . Business Intelli Solutions Inc www. Job Control Now that we’ve created the DataStage environment. It is a master controlling Job that controls the execution of a set of subordinate Jobs. and have seen how to build Jobs.Information Server 8.

Information Server 8. and we can also do things such as ‘Wait for File’ before our Job starts. Business Intelli Solutions Inc www.1 Introduction One of the things that a Job Sequencer does is to pass values to the subordinate Job’s parameters. It also controls the order of execution by using Links. the All or Some options. For example. let’s say that we were waiting for a file to be FTP’ed to our system. These Links are different from our Job Links though in that there is no metadata being passed on them. the rest of the activities within that Sequencer will be kicked off. we can then sit there in a waiting-mode until that file comes. The Sequence Link tells us what should be executed next and when it should be executed. When it does arrive. The Sequence specifies the conditions under which the subordinate Jobs get executed using a term know as “Triggers”.businessintelli.com . We can specify a complex flow of control using Loops.

on the Job Sequencer’s Canvas you add Stages that will execute the Jobs. Stages that will execute system commands in other executables. Then. skipping over all the ones that worked. The order of sub-sequences. looping. For example you can use its e-mail option that ties into your native system and sends an e-mail. and pick up and re-run the one that failed. The Sequencer can include restart checkpoints. and/or special purpose Stages that will perform functions such as looping options or setting up local variables. They also allow us to execute system commands and executables (that we might normally execute from the Command Line). Jobs. you could send a message to your Tivoli Work Manager. When you create a Job Sequencer. Next.Information Server 8. you open a new Job Sequence & specify whether or not it's re-startable.businessintelli. For example. This will specify the order in which the Jobs are executed. e-mails is determined by the flow that you create on the Canvas Business Intelli Solutions Inc www.com . by writing a Command Line option that writes a message to the Tivoli Command Center so that operators can see what has happened. system commands. you add Links between Stages. Should there be some failure within your Sequence. you don’t have to go back and re-run all the Jobs. You can just find the one that failed. Any Command Line options can be executed from within the Sequencer.1 Introduction The Job Sequencers will allow us to do system activities.

you might create a condition so that when the flow is coming out of Activity A. Business Intelli Solutions Inc www. The ‘error-out’ Link might then go to an e-mail that sends an error message to someone’s console and stops the Sequencer at that point. Now let's see some of the Stages that can be used within the Sequencer.Information Server 8.businessintelli. For instance.com . Job Sequences allow us to specify particular error handling such as the global ability to trap errors. one Link goes to Activity B if the Activity A Job finished successfully but goes to Activity C on a different Link if the Job ‘errors-out’.1 Introduction You can specify triggers on the Links. Then you could enable and disable re-start checkpoints within the Sequencer.

Business Intelli Solutions Inc www.com . the Job Activity Stage (used to execute a DataStage Job). These include the EndLoop Activity Stage.1 Introduction Here is a graphic of some of the Stages that can be used within the Sequencer.businessintelli. Executing (the native OS) Command Stage. and the Nested Condition Stage. the Exception Handler Stage. There is a (forced) terminator Activity Stage. the Sequencer Stage (which allows us to bring together and to coordinate a rendezvous point for multiple Links once we have started many Jobs in parallel) and the StartLoop Activity Stage. the Routine Activity Stage (used to call DataStage Server routines).Information Server 8. and the Wait For File Activity. In addition there are the Notification Activity Stage (that is typically tied into DataStage’s email function). the UserVariables Activity Stage (used to create local variables).

In the top flow of the Job Sequence above though.com . the bottom Stage is an exception handler Stage and will only kick off if there is an exception during our runtime. This Stage will thus start executing immediately. or 3 fail. if Job 1 is finished successfully the flow follows the green Link to Job 2 and so on.businessintelli. If Job 3 completes successfully.Information Server 8. The red links indicate error activity. this Job Sequence starts with nothing coming into it. then it kicks off Job 1. In a different example. At the top left. the first Stage waits for a file. Should any of Jobs 1.1 Introduction Sequencer Job Example Here we can see an example of a Sequencer Job. The green color of the Link indicates that there is a trigger on the Link (this trigger basically says ‘if the Job was successful. the flow follows the red link to a Sequencer (which can have all or any of the Jobs complete) but in this case. then it sends a notification warning by email that there has been a Job failure. 2. however. then the flow goes on to execute a command. they normally will all start at the same time within a typical Job Sequence. Normally the one at the bottom would as well. Business Intelli Solutions Inc www. if any of the Links are followed to it. go to Job 2’). if there were many Stages independent of one another.

It just sequences the activities. in turn.Information Server 8. Thereby. In order to do that. A value file stores the values for each parameter within the Parameter Set. One of the areas that will help us to pull our Jobs together more effectively will be the use of Parameter Sets. Business Intelli Solutions Inc www.1 Introduction Exception Handler Stage There is one final note about our Sequencer Job: It is not a scheduler unto itself.businessintelli. will it call all the other Jobs for you. one or more value files can be named and specified. The Parameter Set allows us to store a number of parameters in a particular named object. We will talk a little about how the Command Line can be used in this manner as well as how it will help us deploy our applications effectively and efficiently.com . call this Job. we would need to learn to execute the Job from the Command Line. Think of it as an entry point into your application from whatever external source you intend to use. Pulling Jobs Together • • • Parameter Sets Running Jobs from the Command Line Other Director Functions Parameter Sets Let's start by talking about Parameter Sets. These values can then be picked up at runtime. Parameters are needed for various functions within our Jobs and are extremely useful in providing Jobs with flexibility and versatility. There must be some external scheduler whether that is the DataStage scheduler or a third-party enterprise scheduler to start up this whole Sequencer Job and only from there. Whether that source is just kicking it off from a Command Line or using something like a Tivoli Work Manager to execute the Job based on a daily or monthly schedule or a conditional schedule should some other activity occur first and then.

This makes it very convenient for us. the “Database Type”. Business Intelli Solutions Inc www. As we develop our Jobs.businessintelli. which is on the Parameter tab within the Job Properties. We could then move our Job from one environment to the next and simply select the appropriate values for that environment from the Parameter Set.Information Server 8. We could then have one set of values in a value file used specifically for development. we don’t have to enter a long list of parameters for every single Job thus risking the possibilities of mis-keying or having omissions that prevent our parameters from passing successfully from the Sequencer Job into our lower level Jobs. An example of a Parameter Set might include all the things that happen within a certain environment. another set for testing. We might set up a Parameter Set entitled “environment” and have it contain parameters such as the “Host Name”. and the “Passwords” (Not to be confused with the DataStage Connection Objects).com . the “Host Type”. the “Database Name”.1 Introduction Parameter Sets can be added to the Job’s parameter list. and another set for production. the “Usernames”.

The next question is “How do we execute it?” Although it can be executed by hand using the DataStage Director client.Information Server 8. The graphic above shows some of the particulars.1 Introduction Running Jobs from the Command Line Now that we’ve learned the basics of how to sequence our Jobs and group our parameters. there are a series of Command Line options that will allow us to start the Job from the native Command Line when we type in the command or to use our scheduler to call Jobs. this is not the typical way that applications are utilized. we have an application.businessintelli. Let’s say that we’ve built. developed.com . Business Intelli Solutions Inc www. and tested a Job and are now ready to put it into production. Instead.

One of the most common functions is to run a Job. the name of the Project that we want to run from. we are passing parameters that include the number of rows that we want to run. and the name of the Job. dsjob’s function is documented in the “Parallel Job Advanced Developer’s Guide”. but more often than not.Information Server 8.com . Business Intelli Solutions Inc www. when we use the dsjob command it will be encapsulated inside some kind of shell script (whether on a Windows or UNIX type of system or whatever system DataStage is running on) and we would want to make sure that all the options are correct before issuing the command. You can use the -logsum or the -logdetail options to the dsjob command to write your output to a file. which can then either be archived or even used by another DataStage Job and loaded into a database with all your log information (any database – doesn’t have to be part of the DataStage repository). You may choose the latter for performance reasons so that you don’t tie up your Xmeta database with a lot of logging activity while the Job is running –which could affect your network should your Xmeta be remotely located from your DataStage server. With DataStage version 8. summary of its messages. and even Link information. we must run Jobs from the Command Line as most schedulers use a Command Line execution. The dsjob is a multifunctional Application Programming Interface (API) on the Command Line that allows us to perform a number of functions. By executing the dsjob command on your Command Line with various options after it. typically.1. Xmeta is the repository for all of DataStage. In it. pull that data out and process it into your master repository (whether that be Xmeta or some other relational data source) so that you can keep it for long term usage. This way you can do it without negatively impacting the performance of the application as it is running since you will be doing it after the application is finished. To do this. The InfoSphere suite provides a utility for DataStage called the dsjob. only at the end. The top one in the graphic above (HIGHLIGHT ON TOP ONE) will run a Job. dsjob also has other functions such as giving us information about a Job’s run status. Now.1 Introduction Other Director Functions We mentioned that we can schedule Jobs using the DataStage Director. As mentioned earlier. you’ll be able to accomplish a number or automated functions using a whole host of other tools (Since the Command Line is the most common method for launching other activities).businessintelli. Having the option to pull the logging information out subsequently and allow you to run all the log information into the local DataStage engine’s repository and then. Jobs are scheduled by an external scheduler. you are able to store the logged information directly into the Xmeta repository or keep it locally in the DataStage engine. The example above (in the second bullet point) displays a summary of all Job messages in the log.

Amalgamated Conglomeration Corporation (ACC) has hired you as their DataStage Developer! ACC is a large holding company that owns many subsidiary companies underneath it. Understand the Business Problem and Case Scenario Assess the Existing Architecture Look at a Feed Job Create a Feed Job Modify the Consolidation Job Create a Sequence Job Business Intelli Solutions Inc www. Earlier in the tutorial.Information Server 8.businessintelli. we discussed the many of the important concepts involved in DataStage. we’re going to talk about what happens to ACC over a period of a year as they acquire new companies. In our scenario. we will first introduce you to a case-study of a business use-case scenario in which you will learn about a business entitled Amalgamated Conglomeration Corporation. navigate around the UI. 6. and ‘learn by doing’. 5. you will ‘use the product’. without having to perform any installations. In this section you will do the following: 1.1 Introduction Product Simulation Welcome to the Guided Tour Product Simulation. 4.com . In this interactive section. to create DataStage Jobs. 3. 2. Specifically.

the state of ACC has changed: Your employer has acquired a new company called Disks of the World (Company D). Let’s go back to January 1 and see what the feeds looked like before Disks of the World (Company D) came on board. and.Information Server 8. All of these belong to ACC and information from all of them will need to be pulled together from time to time. These subsidiary companies include Acme Universal Manufacturing (Company A).com . Disks of the World will now need to be able to feed its information into ACC’s Data Warehouse. on March 3rd. ACC has four subsidiary companies.1 Introduction Let’s look at ACC’s corporate structure as it looks at the beginning of the year on January 1 st. some time will have gone by. Business Intelli Solutions Inc www. As you can see in the graphic above. Cracklin’ Communications (Company C) and Eco Research and Design (Company E). Big Box Stores (Company B). Now.businessintelli.

we can see four particular feeds. But Big Box or Company B has such a large feed that they put it out once an hour.com . But this is done in two steps. so they do it four times per day.businessintelli. Above. Let’s quickly talk about how this was done from more of a technical standpoint. ACC’s business requirement though. The feeds are all turned into files.Information Server 8. and E) feeds its data into a common customer Data Mart. one for each company and that particular company's customers. each puts out its own feed at different frequencies. It will therefore pull in many files from Companies B Business Intelli Solutions Inc www. Cracklin’ Communications or Company C has a large feed but not quite as large as Company B.1 Introduction Each of the individual subsidiaries (Companies A. B. The reason why we have several different feeds running to files is that each of these subsidiary companies put out their data at different intervals or periods. we only want to consolidate it in “one fell swoop” (ACC’s business requirements are that we put all the feed files into the Data Mart at one time). Acme or Company A puts out their data once per day. Eco Research or Company E has such a small customer base that they only produce their feed one time per week. However. These files are then later pulled together by a consolidation job (A DataStage Job) and run into the COMMON CUSTOMER DATA MART. In other words. is for us to have a daily consolidation of these feeds into the Common Customer Data Mart. Each FEED FILE has been generated by an individual DataStage DS FEED JOB. C.

and then. Business Intelli Solutions Inc www. probably only one from Company A. This type of Job is used in order to control the running of other DataStage Jobs. You will build a DataStage feed that accommodates their need to produce their data multiple times per day during their salescycle.Information Server 8. we don’t have to schedule them individually in our corporate-wide scheduling software. This Sequencer will call all the Jobs in the correct order. can be brought together by using a Sequencer Job. The DataStage Developers for ACC (that’s you) will now need to add a new DataStage feed for the customer data coming from Company D (Disks of the World) into the Common Customer Data Mart. there will also be one in the heap from Company E.1 Introduction and C. This way.com . once per week. This feed will be landed to the same set of files (Feed files at the center of the graphic) that are then picked up by another DataStage Job entitled Consolidation DS Job (which combines them into the mart). Now let’s go ahead in time to March 3 rd. We will just schedule the one Sequencer. we’ll see how all of the Jobs (All DS FEED JOBs and the CONSOLIDATION JOB). you will need to build a special type of Job called a Sequencer Job. You’ll recall that ACC has now acquired another company called Disks of the World. Finally. In other words.businessintelli.

they are not the same customer. then the Data Mart must find the unique ID. Let’s talk about how the Consolidation DataStage Job would do this. If we put them into the Common Customer Data Mart ‘as is’. 10. as the DataStage Developer for ACC. Since each of the subsidiary companies developed their own systems independently (of ACC). Even though these two share the same number. 11. You will need to understand the issue so that. Customer_123SourceSysID_1 is added to an ADDRESS-update for Joe Smith). the source system number will be sent into the Job (as a parameter). This way. There is a technical problem that has been identified and reported to ACC in the past. For this we will use both the source system number or ID and the Customer ID from the source system to do a lookup into the table. it the Job must perform a lookup.1 Introduction In sum. and so forth. you will help solve this problem: ACC will need to recognize any particular customer as having come from a particular subsidiary: 6. “Source system 2” or “coming from Company B” for Big Box Stores. “Source system 1” or “coming from Company A” for Acme. Business Intelli Solutions Inc www. These two fields when combined will create a new field that is an alternate key and is now a unique identifier in the Common Customer Data Mart for Joe. Then. When the data is processed. then we will hand it off to ACC's Data Mart team to add it in a separate process (a ‘Black Box’ process Job that we will not see in this tutorial) which will then place the new customer or insert it for the first time into the Data Mart. when it comes time to load the Common Customer Data Mart (With the Consolidation DS Job) then we won’t have ambiguity in our data. In other words. One of the things that all of these DS Feed Jobs have in common is that the DataStage Jobs will need to find a common customer ID for the corporation. that indicates which company the data came from (Company A is known to the Data Mart as Source System ID = 1. along with an additional identifier. you will need to do the following: 1. each Feed Job has designated its subsidiary company uniquely. and add the unique ID to the data. Anytime that ACC runs one of these Jobs the Feed Jobs designate from which source system (Company) they are coming. Joe’s address is updated. as you develop a new DS FEED JOB. 8. This number will now be able to distinguish any customer from any subsidiary. For example. The Common Customer Data Mart will have its own numbering system for these customers and it will differentiate by using the Customer ID that comes from the company (Joe Smith = Customer 123). Finally. then we would have ambiguity and not know which customer we were dealing with.com . later in this Guided Tour Product Simulation. If the customer is already known to the Mart.businessintelli. 12. there is a Customer 123 that represents Emmanuel Jones Enterprises. 9. 7. View the existing Acme DS FEED JOB to see how the previous DataStage Developer at ACC did it 2. from the Common Customer Data Mart. the numbering systems that they use for their customers are different and may well inappropriately overlap. For example. However. For example. Our DataStage “Feed” Jobs will need to be able to find out if their customer is already known to the Common Customer Data Mart. Company A has a Customer 123 that represents Joe Smith but over at Company B.Information Server 8. create a new DataStage Job shown highlighted above (Company D's DS FEED JOB) 3. Then you will need to modify ACC’s DS CONSOLIDATION JOB to bring in this new/extra feed (from Company D) into the data mart 4. for instance). If the customer is not known. ACC's Business Requirements 13. you will also create a Job that controls all of these Jobs! 5. We’ll be able to load it correctly and distinctly.

then it will be sent to another process for insertion into the Customer Data Mart. we need to code the Job so that it will convert all of all of the source system-specific metadata from the source system’s metadata layout into the target metadata layout. 14. This key has no meaning to each individual source system. the Job uses those two columns to find the true key for the Common Customer Data Mart which should then be returned into the Job so that that column (the true key) can be carried along with all the other native source columns that are coming in the row as it passes through the Job. The Common Customer ID is a unique key within the Common Customer Data Mart. 16.Information Server 8. might contain a value like “9788342”. This Insertion Job is not discussed in this tutorial and is considered a “Black Box” process used by Amalgamated Conglomerate Corporation to add new customers. And this field is unique so that no customer is overlapped with another Business Intelli Solutions Inc www. Additionally. If the customer is new.com . So the only way that the source system can find this unique key is through this feed process.businessintelli. 15. for example. on the way out. This is called a “reverse lookup”. the unique Customer ID in the Common Customer Data Mart has compounded the Source System Identifier with a source-specific Customer ID into a field that. 17.1 Introduction to see if this compound key already exists. In summary. we should see the same type of data but captured in columns and fields which conform to the Common Customer Data Mart’s metadata structure. The feed process enriches the data by looking at the Common Customer Data Mart and finding this Common Customer ID to help differentiate all of our different customers across all of our subsidiary companies. the metadata appears as it would in the source system (Acme’s layout of the customer) but. Coming into the Job. In other words.

and the development strategy. Business Intelli Solutions Inc www. we’ll look at the Consolidation Job. the architecture.com . We will need the highlighted client to look at. let's look at the DS Feed Job for the Acme subsidiary (Company A) that was built previously by another DataStage Developer at ACC (Highlighted above). In this Job. we will see how it pulls the data from Acme’s operational database and puts that data into a file after it has looked up to find the appropriate Common Customer ID from the Data Mart. after ACC makes another acquisition in March. then. There.Information Server 8. in which you will have to build a similar DS Feed Job for Disks of the World (Company D). and modify DataStage Jobs. we will go to a section of the tutorial. After we've seen it and how it works. In order to look at the DataStage Feed Job for Acme Universal Manufacturing.1 Introduction Look At Acme DS Feed Job Now that we know the business requirements. we will see how all of the individual Feed files are brought together prior to being loaded into the Common Customer Data Mart. create. we'll need to bring up the DataStage Designer Client.businessintelli. Then.

com . since we will look at an existing Job.Information Server 8. At the top of the screen. just click the Cancel button. A screen will then pop up within which we can work on our Job. At the bottom of the screen. you can see the domain in which we are working. Then. with a password filled in for you (for purposes of this tutorial) enter a username of student and click the OK button (The NEXT button below has been disabled for the remainder of the Guided Tour Product Simulation) Our Designer client is now connecting to the DataStage Server via authentication and authorization from the Server’s Security layer. For right now. what type. From here please develop as per the spec…… Business Intelli Solutions Inc www.businessintelli. we need to make sure that we are working with the correct DataStage Project in this case it is the AmalCorp Project (Amalgamated Conglomerate Corporation Project). Here. First. we are asked if we want to create a Job and if so.1 Introduction Our Login screen comes up. we’ll need to enter username/password information.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->