The Data Warehousing and Business Intelligence

DataStage is a tool set for designing, developing, and running applications that populate one or more tables in a data warehouse or data mart. It consists of client and server components. Client Components DataStage has four client components which are installed on any PC running Windows 95, Windows 2000, or Windows NT 4.0 with Service Pack 4 or later:

• • •

DataStage Designer. A design interface used to create DataStage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server. DataStage Director. A user interface used to validate, schedule, run, and monitor DataStage jobs. DataStage Manager. A user interface used to view and edit the contents of the Repository. DataStage Administrator. A user interface used to configure DataStage projects and users.

Server Components There are three server components which are installed on a server:

• • •

Repository. A central store that contains all the information required to build a data mart or data warehouse. DataStage Server. Runs executable jobs that extract, transform, and load data into a data warehouse. DataStage Package Installer. A user interface used to install packaged DataStage jobs and plug-ins.

DataStage Features DataStage has the following features to aid the design and processing required to build a data warehouse:

• • • • • •

Uses graphical design tools. With simple point-and-click techniques you can draw a scheme to represent your processing requirements. Extracts data from any number or types of database. Handles all the meta data definitions required to define your data warehouse. You can view and modify the table definitions at any point during the design of your application. Aggregates data. You can modify SQL SELECT statements used to extract data. Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use. Loads the data warehouse.

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

• •

DataStage jobs. Built-in components. These are predefined components used in a job.

The Data Warehousing and Business Intelligence

User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

A complete project may contain several jobs and user-defined components. There is a special class of project called a protected project. Normally nothing can be added, deleted, or changed in a protected project. Users can view objects in the project, and perform tasks that affect the way a job runs rather than the job's design. Users with Production Manager Status can import existing DataStage components into a protected project and manipulate projects in other ways. A DataStage job populates one or more tables in the target database. There is no limit to the number of jobs you can create in a DataStage project. DataStage jobs are defined using DataStage Designer, but you can view and edit some job properties using the DataStage Manager. The job design contains:

• •

Stages to represent the processing steps required Links between the stages to represent the flow of data

There are three basic types of DataStage job:

• • •

Server jobs. These are compiled and run on the DataStage server. A server job will connect to databases on other machines as necessary, extract data, process it, then write the data to the target data warehouse. Parallel jobs. These are available only if you have Enterprise Edition installed. Parallel jobs are compiled and run on a DataStage UNIX server, and can be run in parallel on SMP, MPP, and cluster systems. Mainframe jobs. These are available only if you have Enterprise MVS Edition installed. A mainframe job is compiled and run on the mainframe. Data extracted by such jobs is then loaded into the data warehouse.

There are two other entities that are similar to jobs in the way they appear in the DataStage Designer, and are handled by it. These are:

• •

Shared containers. These are reusable job elements. They typically comprise a number of stages and links. Copies of shared containers can be used in any number of server jobs and edited as required. Job Sequences. A job sequence allows you to specify a sequence of DataStage jobs to be executed, and actions to take depending on results.

Server and Mainframe jobs consist of individual stages. Each stage describes a data source, a particular process, or a data mart. For example, one stage may extract data from a data source, while another transforms it. Stages are added to a job and linked together using the Designer. There are two types of stage:

• • •

Built-in stages. Supplied with DataStage. Used for extracting, aggregating, transforming, or writing data. Plug-in stages. Additional stages separately installed to perform specialized tasks that the built-in stages do not support. Job Sequence Stages. Special built-in stages which allow you to define sequences of activities to run.

The DataStage Manager is a means of viewing and managing the contents of the Repository. The links between the stages represent the flow of data into or out of a stage. Some of these tasks can also be performed from the DataStage Designer.The Data Warehousing and Business Intelligence The following diagram represents the simplest job you could have: a data source. a Transformer stage. server job routines. data elements. You might. custom transforms. Report on Repository contents. . These include: • • • Perform usage analysis queries. For example. Data Source Transformation Target DW You must specify the data you want at each stage. and the final database. for example. and how it is handled. DataStage Manager The DataStage Manager is used to view and edit the contents of the Repository. You can use the DataStage Manager to: • • Import table or stored procedure definitions Create table or stored procedure definitions. import table definitions from a data modelling tool. machine profiles. and plug-ins There are also more specialized tasks that can only be performed from the DataStage Manager. mainframe routines. do you want all the columns in the source data. or only a select few? Should the data be aggregated or converted before being passed on to the next stage? You can use DataStage with MetaBrokers in order to exchange metadata with other data warehousing tools. Importing. exporting and packaging DataStage jobs.

These are reusable job elements. . The Options dialog box appears.. By default. You can use the DataStage options to specify that the Designer always opens a new server. This contains four branches. Job sequences. These are reusable job elements. A job sequence allows you to specify a sequence of DataStage server and parallel jobs to be executed. Copies of shared containers can be used in any number of parallel jobs and edited as required. They can also be used in parallel jobs to make server job functionality available. or mainframe job. Parallel shared containers. but support parallel processing on SMP. Parallel job. Server shared containers. Mainframe job. where they are compiled and run. choose Tools > Options. or job sequence when its starts. These are compiled and run on the DataStage server in a similar way to server jobs. parallel.The Data Warehousing and Business Intelligence You can use the DataStage Manager to: • • • • • • • • • Create items Rename items Select multiple items View or edit item properties Delete items Delete a category Copy items Move items between categories Create empty categories Datstage Designer The DataStage Designer is a graphical design tool used by developers to design and develop a DataStage job. server or parallel shared container. These run on the DataStage Server. and cluster systems. To specify the Designer options. DataStage initially starts with no jobs open. Specifying Designer Options You can specify default display settings and the level of prompting used when the Designer is started. You can choose to create a new job as follows: • • • • • • Server job. connecting to other data sources as necessary. The right mouse button accesses various shortcut menus in the DataStage Designer window. MPP.. Mainframe jobs are uploaded to a mainframe. each giving access to pages containing settings for individual areas of the DataStage Designer as follows: • • • • Appearance Branch General options Repository Tree options Palette options . Copies of shared containers can be used in any number of server jobs and edited as required. Or you can choose to open an existing job of any of these types. These are available only if you have installed Enterprise MVS Edition. The dialog box has a tree in the left pane. and actions to take depending on results.

.You must also determine the data you want to load into your data mart or data warehouse. you will need Enterprise MVS Edition installed and you will use mainframe jobs that actually run on the mainframe. you will need to arrange for a custom stage to extract the data.The Data Warehousing and Business Intelligence • • • • • • • • • • • • • • • • • • Graphical Performance Monitor Options Default Branch General options Mainframe options Expression Editor Branch Server and parallel options Job Sequencer Branch SMTP Defaults Default Trigger Colors Meta data Branch General options Printing Branch General options Prompting Branch General options Confirmation options Transformer Branch General options Architecture Approach: Assessing Your Data . Table definitions are stored in the Repository and are shared by all the jobs in a project. as a minimum. The location of the data. Is your data on a networked disk or a tape? You may find that if your data is on a tape.Before you design your application. DataStage jobs can be quite complex and so it is advisable to consider the following before starting a job: • • The number and type of data sources. If this is the case. What do you want to store in the data warehouse and how do you want to store it? • • • NOTE . What columns are in your data? Can you import the table definitions. you must assess your data. Whether you will need to extract data from a mainframe source. Salient Activities in a Datastage Build Initiative Import Table Definitions . You need. For each different type of data source you will need a different type of stage.Table definitions are the key to your DataStage project and specify the data to be used at each stage of a DataStage job. table definitions for each data source and one for each data target in the data warehouse. The content of the data. Not all the data may be eligible. You will need a stage for each data source you want to access. or will you need to define them manually? Are definitions of the data items consistent between data sources? The data warehouse.

To provide even greater flexibility. you can use a stored procedure to define the data to use. which may be input or output Return a value (like a function call) Create a result set in the same way as an SQL SELECT statement The definition for a stored procedure (including the associated parameters and meta data) is stored in the Repository. You can import. A stored procedure can: • • • Have associated parameters. A stored procedure may have a return value or output parameters defined. you can also define your own custom routines and functions from which to build custom transforms. DataStage supports the use of stored procedures (with or without input arguments) and the creation of a result set. You can use the built-in transforms supplied with DataStage. create. or create table definitions using the DataStage Manager or the DataStage Designer.The Data Warehousing and Business Intelligence You can view. You can enter or view the definition of a transform in the Transform dialog box. but these are ignored at run time. you can create custom transforms. The following properties are stored for each table definition: • • General information about the table or file that holds the data records Column definitions describing the individual columns in the record CREATING TRANSFORMS .If you are accessing data from or writing data to a database via an ODBC connection.Transforms are used in the Transformer stage to convert your data to a format you want to use in the final data mart. There are three ways of doing this: • • • Entering the code within DataStage (using BASIC functions) Creating a reference to an externally cataloged routine Importing external ActiveX (OLE) functions USING STORED PROCEDURES . When writing stored procedures for use with DataStage. Each transform is built from functions or routines used to convert the data from one type to another. import. . or edit a stored procedure definition using the DataStage Manager or DataStage Designer. you should follow these guidelines: • • Group parameter definitions for output parameters after the definitions for input parameters. These stored procedure definitions can be used when you edit an ODBC stage in your job design. When using raiserror to return a user error set the severity so that the error is treated as informational. or you want a specific transform to act on a specific data element. but does not support output arguments or return values. If the built-in transforms are not suitable.

you need to use a plug-in stage. 4.You may find that the built-in stage types do not meet all your requirements for data extraction or transformation. and processing steps required Links between the stages to represent the flow of data There are three different types of job within DataStage: • • • Server jobs. These are only available if you have installed Enterprise Edition. Jobs are designed and developed using the DataStage Designer. optionally. These run on DataStage servers that are SMP. The stages that are available in the DataStage Designer depend on the type of job that is currently open in the Designer. Then define table or stored procedures definitions. There is no limit to the number of jobs you can create in a DataStage project. and is designed to provide you with quick information when you are actually working on a job. A stage usually has at least one data input and one data output. Plug-In Stage . Mainframe jobs are uploaded to a mainframe. They run on the DataStage Server. In this case. These are available if you have installed DataStage Server. where they are compiled and run. First assess your data. Mainframe jobs. The DataStage Designer is a graphical design tool used by developers to design and develop a DataStage job. The manuals are: • • • Server Job Developer's Guide Parallel Job Developer's Guide Mainframe Job Developer's Guide Note that before you start to develop your job. or mainframe job open. and mainframe jobs have different types of stages. some stages can accept more than one data input. you must: 1. and output to more than one stage. and for design tips. in printed form). or cluster systems. see the Manuals provided in PDF format (or. The function and properties of a plug-in stage are determined by the plug-in used when the stage . The stages that appear in the tool palette depend on whether you have a server. Parallel jobs. and on whether you have customized the tool palette. Stages are linked together using the tool palette. This on-line help tells you about the individual stage types that each type of job supports. Optionally create and assign data elements. The STAGE in a DataStage Job . MPP. parallel jobs. connecting to other data sources as necessary.A DataStage job populates one or more tables in the target database.A job consists of stages linked together which describe the flow of data from a data source to a final data warehouse. For general background information about each type of job. Server jobs. data mart. 3. A job design contains: • • Stages to represent the data sources. Then create your data warehouse. However. 2. parallel.The Data Warehousing and Business Intelligence Creating Datstage Jobs . These are only available if you have installed Enterprise MVS Edition.

JOB Properties: Each job in a project has properties including optional descriptions and job parameters. If an input link is no. The link is made when you release the mouse button. to bulk load data into a data mart. A server job has up to six pages: General. Use the mouse to point at the first stage and right click then drag the link to the second stage and release it. Where a link is an output from one Transformer stage and an input to another Transformer stage. open the job in the DataStage Designer window and choose Edit > Job Properties. then it is the primary link. Click the first stage and drag the link to the second stage. You can view and edit the job properties from the DataStage Designer or the DataStage Manager: • • From the Designer. do one of the following: • • • Click the Link shortcut in the General group of the tool palette. Position the mouse cursor on the edge of a stage until the cursor changes to a circle. even if you have NLS installed. Parameters. a mainframe job. The dialog box differs depending on whether it is a server job. When looking at a job design in the DataStage. Click and drag the mouse to the other stage.The Data Warehousing and Business Intelligence is inserted. then the output link information is shown when you rest the pointer over it. . and: Output execution order = n for output links. or a job sequence. 1. Note that the NLS page is not available if you open the dialog box from the Manager. Two plug-ins are always installed with DataStage: BCPLoad and Orabulk. Use the mouse to select the first stage. In both cases n gives the link’s place in the execution order. Plug-ins are written to perform specific tasks. The Job Properties dialog box appears. double-click a job in the DataStage Manager window display area or select the job and choose File > Properties. Job control. The link is made when you release the mouse button. there are two ways to look at the link execution order: • Place the mouse pointer over a link that is an input to or an output from a Transformer stage. for example. The Transformer stage allows you to specify the execution order of links coming into and going at from the stage. You can also choose to install a number of other plug-ins when you install DataStage. LINK – A Datastage jobs flows as per the links created in the Job. To link stages. From the Manager. A ToolTip appears displaying the message: Input execution order = n for input links. and Dependencies. NLS. Performance.

DataStage provides two types of container: Local containers. Note that the NLS page is not available if you open the dialog box from the Manager. NLS. You can use shared containers to make common job components available throughout the project. • • • Shared containers. you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Job Control. A local container is edited in a tabbed page of the job’s Diagram window. Defaults. unlike local containers. Shared containers also help you to simplify your design but. Job Sequences . it can be scheduled and run using the DataStage Director. Code Reusability . This can be any routine in the DataStage Repository (but not transforms). Job Control. Generated OSH. Containers enable you to simplify and modularize your server job designs by replacing complex areas of the diagram with a single container stage. and Dependencies.DataStage provides a graphical Job Sequencer which allows you to specify a sequence of server or parallel jobs to run. Execution. Once you have defined a job sequence. A mainframe job has five pages: General. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example. Parallel shared container. If the DataStage job has lots of stages and links. they are reusable by other jobs. for example. Used in parallel jobs. These are created separately and are stored in the Repository in the same way that jobs are. The sequence can also contain control information. Containers are linked to other stages or containers in the job by input and output stages. . A job sequence has up to four pages: General. Extensions. and. Parameters. You can also use shared containers as a way of incorporating server job functionality into parallel jobs. Parameters. it may be easier to create additional containers to describe a particular sequence of steps. It appears in the DataStage Repository and in the DataStage Director client as a job. These are created within a job and are only accessible by that job. Parameters. There are two types of shared container: Server shared container. and Operational meta data. you could use one to make a server plug-in stage available to a parallel job).A container is a group of stages and links. Specifies a DataStage server job. Routine. Environment.The Data Warehousing and Business Intelligence Parallel jobs have up to eight pages: General. The main purpose of using a DataStage local container is to simplify a complex design visually to make it easier to understand in the Diagram window. Dependencies. Specifies a routine. The job sequence supports the following types of activity: Job. Used in server jobs (can also be used in parallel jobs). even if you have NLS installed. if the Generated OSH visible option has been selected in the Administrator client.

OK Conditional .Custom Conditional . Warnings. It is executed if a job in the sequence fails to run (other exceptions handled by triggers). Unconditional. User status.ReturnValue Unconditional Otherwise Conditional . A routine or command has returned a value. Run-activity-on-exception.OK Conditional . are TRIGGERS The control flow in the sequence is dictated by how you interconnect activity icons with triggers. A conditional trigger fires the target activity if the source activity fulfills the specified condition. Activity fails. Specifies that an email notification should be sent at this point of the sequence (uses SMTP). regardless of what other triggers are fired from the same activity. Activity succeeds.UserStatus Unconditional Otherwise Routine Job Nested Condition .Custom Conditional . ReturnValue. Failed. An otherwise trigger is used as a default where a source activity has multiple output triggers.Failed Conditional . Otherwise. Wait-for-file.Failed Conditional . Allows you to define a custom expression. ExecuteCommand Trigger Type Unconditional Otherwise Conditional . There can only be one of these in a job sequence.Custom Conditional . Custom.Failed Conditional . Waits for a specified file to appear or disappear. Specifies an operating system command to execute. An unconditional trigger fires the target activity once the source activity completes. and can be one of the following types: OK. Email Notification. but none of the conditional ones have fired. Different activities can output different types of trigger: Activity Type Wait-for-file.ReturnValue Unconditional Otherwise Conditional .Warnings Conditional . There are three types of trigger: • • • • • • • • • Conditional. Activity produced warnings.OK Conditional . The condition is defined by an expression.The Data Warehousing and Business Intelligence ExecCommand. Allows you to define a custom status message to write to the log.

End of Build . the job will stop when it reaches a breakpoint. you must release the job again.Custom Unconditional CONTROL ENTITIES . Select the job that you want to release. the new link does not inherit the breakpoint. Email notification Conditional . A job can be released when it has been compiled and validated successfully at least once in its life. or Release All to release all the jobs in the tree. When you run the job in debug mode. this is known as a "fixed job release". Choose Tools > Release job.n. When you refer to a job by its released name. Breakpoints are not inherited when a job is saved under a different name. 2. If a link is deleted. Any breakpoints you have set remain if the job is closed and reopened. DataStage Job Debugging The DataStage debugger provides you with basic facilities for testing and debugging your job designs. A physical copy of the chosen job is made (along with all the routines and code required to run the job) and it is recompiled. To label a job for deployment. or step to the processing of the next row of data (which may be on the same link or another link). 4. . The Job Release dialog box appears. browse to the required category in the Jobs branch in the project tree. Nested Conditions and Sequences are represented in the job design by icons and joined to activities by triggers. You can then step to the next action (reading or writing) on that link. The released job is automatically assigned a name and version number using the format jobname%reln. or either end moved. From the DataStage Manager. Select the job you want to release in the display area.The Data Warehousing and Business Intelligence Run-activity-on-exception. 5. or upgraded. exported.The Job Sequencer provides additional control entities to help control execution in a job sequence. 3. you must edit the original job.Releasing a Job If you are developing a job for users on another DataStage system. where jobname is the name of the job you chose to release and n. If a link is deleted and another of the same name created. Breakpoints are validated when the job is compiled. Sequencer. and always equates to that particular version of the job.n is the job version number. you must release it.n. the breakpoint is deleted. The debugger is run from the DataStage Designer. Click Release Job to release the selected job. To use the changes you have made.n. Jobs are released using the DataStage Manager. It can be used from many places in the Designer: The debugger enables you to set breakpoints on the links in your job. you must label the job as ready for deployment before you can package it. To release a job: 1. If you want to develop and enhance a job design.

A variable’s value can be numeric or character string data. BASIC can perform certain arithmetic or algebraic calculations. Upper. the null value. Source Code: Source code is the original form of the program written by the programmer. The power of DataStage BASIC comes from statements and built-in functions that take advantage of the extensive database management capabilities of DataStage. you can use MetaBrokers to import table definitions into DataStage that you have set up using a data modeling tool. function arguments can be expressions that include functions. Tool Customization Using BASIC programming: DataStage BASIC is a business-oriented programming language designed to work efficiently with the DataStage environment. such as calculating the sine (SIN). These benefits combined with other BASIC extensions result in a development tool well-suited for a wide range of applications.The Data Warehousing and Business Intelligence Note: Released jobs cannot be copied or renamed using the Manager. cosine (COS). Function: A BASIC intrinsic function performs mathematical or string manipulations on its arguments. Variable names begin with an alphabetic character and can include alphanumeric characters. which can be executed by the DataStage RUN command or called as a subroutine.MetaBrokers allow you to exchange enterprise meta data between DataStage and other data warehousing tools.: A BASIC program is a set of statements directing the computer to perform a series of tasks in a specified order. DataStage BASIC contains both numeric and string functions. . Variable names can be as long as the physical line. periods ( . ). • Numeric functions. or tangent (TAN) of an angle passed as an argument. in addition. that is. Object Code: Object code is compiler output. A BASIC statement is made up of keywords and variables. Similarly you can export meta data from a DataStage job to a business intelligence tool to help it in its analysis of your data warehouse.and lowercase letters are interpreted as different. It is referenced by its keyword name and is followed by the required arguments enclosed in parentheses. BASIC Program. REC and Rec are different variables. but only the first 64 characters are significant. It is easy for a beginning programmer to use yet powerful enough to meet the needs of an experienced programmer. dollar signs ( $ ). or it can be defined by the programmer. Integrating with an existing EDW . Variable: A variable is a symbolic name assigned to one or more data values stored in memory. DataStage BASIC programmers should understand the meanings of the following terms: • BASIC program • Source code • Object code • Variable • Function • Keyword. and percent signs ( % ). or it can be the result of operations performed by the program. Functions can be used in expressions. For example.

and the STR function generates a particular character string a specified number of times. statement numbers) Statement labels of any length Multiple statements allowed on one line Computed GOTO statements Complex IF statements Multiline IF statements Priority CASE statement selection String handling with variable length strings up to 232–1 characters External subroutine calls Direct and indirect subroutine calls Magnetic tape input and output Retrieve data conversion capabilities DataStage file access and update capabilities File-level and record-level locking capabilities Pattern matching Best Practices: Design Approach It is always preferred to follow a lifecycle framework when deploying iterations to the data warehouse. The case of a keyword is ignored. As a warehouse matures. EDW_1_0. You're asking the right questions right off the bat. EDW_1_1 would be a full code set of EDW_1_0. You're going to need to do things like setup a consistent parameter framework where all of your jobs are location/host independent.The Data Warehousing and Business Intelligence • String functions. and you update whatever schedule pointers are necessary to reflect that EDW_1_1 is now the current released project. A string function operates on ASCII character strings. You will create the project under that release name. . Since you're going to have to coordinate database changes with the implementation of the code changes. etc. You're going to want to setup job design standards. On implementation day. Example. for example. Release Approach I recommend that one considers the merits of a release control system. For example. whereby Iteration #1 of the warehouse is assigned a release name. routines. you find out where you made short sighted decisions in the initial framework and methodology. EDW_1_0 is available for bug fixes to the current production release. you have the database changes enacted for that release. READU and readu are the same keyword. this approach will allow you to create the EDW_1_1 project in the production environment ahead of time. As you build and work on the next release for EDW_1_1 in a same named project. The summarized list is given bellow: • • • • • • • • • • • • • • • Optional statement labels (that is. You're going to iterative releases where more subject areas and tables are deployed. the TRIM function deletes extra blank spaces and tabs from a character string. plus the enhancements and revisions. BASIC programming allows a developer to perform a wide gamut of activities. as well as bug fixes and corrections to existing tables and processes. shared/common libraries of functions. and deploy a completed release into the project and have it compiled and in place awaiting your implementation day. Keyword: A BASIC keyword is a word that has special significance in a BASIC program statement. I look at the long term affects of building 500+ jobs to support the load of an enterprise data warehouse.

user acceptance/QA testing. but if you are dealing with a matured warehouse you now have to toss in the OLAP reporting changes that go along with the database changes. generating log files etc. and other sundry pieces. Use of Enterprise Standard Event Handler shared containers for across-the-board activities like error handling (including error prioritizing). This type of approach follows the iterative approach espoused by Inmon and the lifecycle of Kimball.The Data Warehousing and Business Intelligence Synchronizing Data Model changes with the Warehouse One has a challenge as the ETL Architect/Warehouse Architect in coordinating code changes with database changes. you gain the ability to work on maintenance issues on the current release (EDW_1_0). they too should be developed in versioned projects. MicroStrategy. It's a fully integrated approach and you have to consider the full lifecycle crossing from back room to front office. etc have to follow the release. fully qualified table names. as well as the script. From that standpoint. you can now tag and manage your common/shared objects across the board and package a complete release. You're going to have shell scripts. control files. I recommend something robust like PVCS to manage your objects. hash files. /var/opt/EDW/edw_1_0 could the base directory for assorted subdirectories to contain runtime files. If you have a high volume issue with a backfill. control files. while the second generation (EDW_1_1) is undergoing user acceptance testing. By versioning your full EDW application at a project level. On that note. and log folders for your release. Incorporating and Maintaining Plug-n-Play Features in the DWH It is better to have a common/shared library of routines and functions that you're going to keep synchronized across all of the projects. which may be undesirable. Reduce number of active database lookups in a job. Though nothing has been mentioned along the lines of the presentation layer (Business objects. You're going to want a good version control product that can manage everything. SQL scripts. Other Guidelines: • • • • • • • Standard Naming Conventions. You may have a data marts/EDW enhancement that is running on a delayed reaction because the EDW release changes aren't in place. and production. this could take the data marts/EDW offline during the backfill process. For example. Your scripts. Convert repetitive transformations to routines. while developing the third generation release (EDW_1_2). Use synonyms vs. lots of pieces and parts outside of the ETL tool. as well as DataStage's working directories for sequential files. If you're running independent teams working on the EDW and the marts. . you're going to want to setup release versioned directory structures within your system environment so that each version can have a discrete working environment. Event handling (like sending notifications). as each data marts/EDW could be running on its own independent lifecycle. system test. This approach allows you to seamlessly migrate your application across the multiple hosts required for development. Achieve parameterization as far as practicable. Incorporate Checkpoint Restart logic (if Business requires). Use pre-loaded hash files for the same. As for downstream data marts/EDW. etc). Trying to bring a data marts/EDW enhancement online at the same time as an EDW enhancement can be VERY difficult. from your data model to your shell scripts to your DS objects. control file. this coordination effort is like the Russians and the US working on the international space station: sometimes things go wrong in the translation and the parts don't mate perfectly.

Compiling a routine can take minutes. Either they work or they do not work. These tools can extract the ETL business rules and allow you to report against them.x only then use OCI9 stage and not ODBC stage. If you can separate your jobs into projects that never overlap then do it. Points of Caution: Organizing a Datastage project depends on the number of jobs you expect to have. your Director client is completely locked up. The usage analysis links on import add a lot of overhead to the import process.If you do not separate then you may have issues in isolating sensitive data. Financial data may be sensitive and need specific developers working on it. Here are some of my observations based on my experience with Datastage: • • • • • 500+ jobs in a project causes a long refresh time in the DataStage Director.e. depending on how many jobs there are and how many jobs use the routine.The Data Warehousing and Business Intelligence • Use of appropriate plug-in if there is only one type of RDBMS that is in place in the Data warehouse. NOTE . A Director refresh will hang a Monitor dialog box until the refresh completes. Routines are seldom changed. Any edit windows open are hung until the refresh completes. if the DWH is working on Oracle 8/9. During this refresh. Reusability is not an issue. Replicating metadata is not a problem either. Some platforms have a lot less of an issue with this. even a 1 line routine. The number of jobs somewhat corresponds to the number of tables in each target. There is a trade-off between in benefits gained by having less projects and the complexity of MetaStage and Reporting Assistant. Routines are easily copied from one project to another. SOME PRACTICAL TOPICS Executing DS jobs thru Autosys/Unix environments – Enterprise Schedulers like Autosys on UNIX platform can be used to schedule Datastage jobs. To interface Datastage jobs with an external scheduler we need to use the Command Line Interface (CLI) of Datastage. i. DataStage itself seems to take longer to do things like pull a job up. It does not take long to re-import table definitions or export them and import them into another project. but does not lessen the impact of the refresh. Command Syntax: dsjob [-file <file> <server> | [-server <server>][-user <user>][-password <password>]] <primary command> [<arguments>] Valid primary command options are: -run -stop -lprojects -ljobs -linvocations -lstages -llinks -projectinfo -jobinfo . Jobs usually cannot be reused. If there is some overlap in functionality then you cannot easily run jobs in 2 separate projects. I would try to keep the number of jobs under 250 in each project. If it gets over 1000 then you see some performance loss to browse through the jobs. Increasing the refresh interval to 30 seconds mitigates the occurrence of refresh.

Information is available from the prod. Information is available from the prod.Log file Name. So this should be implemented only if there is a Business requirement. logs and archives Restartability always carries with it a burden of needing to stage data on disk. Passed as Command line parameter #$2 .Datastage Server Password. Otherwise one . reaching commit points.The name of the Datastage Project.Datastage Server name. since DataStage is intended to keep data in memory as much as possible (for speed).the BIN directory of the Datastage Engine . Information is available from the prod. The entire logic can be written in a shell file and invoked through an Autosys Command Job. #$1 .profile file.profile file #$PROJECTNAME .The Data Warehousing and Business Intelligence -stageinfo -linkinfo -lparams -paraminfo -log -logsum -logdetail -lognewest -report -jobid So by using the various command options we can get the relevant information for a job or about the project. A DataStage job failure is passed by to AutoSys as an exit code.profile file #$DSPASSWORD .Job Name.profile file #${BinFileDirectory} .Datastage Server User ID. The sample code for the same is given bellow: #Get Job Status #$DSSERVER .profile file #$DSUSERID . Passed as Command line parameter #Execute the datastage job $BinFileDirectory/dsjob -server $DSSERVER -user $DSUSERID -password $DSPASSWORD -run $PROJECTNAME $1 #loop to check the status of the job from the defined log file while [ 1 -eq 1 ] do dsjob -jobinfo $PROJECTNAME $1 > $2 jobstatus=`grep 'Job Status' $2|cut -d':' -f 2 | cut -d'(' -f 1` echo $jobstatus if [[ $jobstatus != ' RUNNING ' ]] then if [[ $jobstatus != ' RUN OK ' ]] then auditstatus='FAILURE' echo $auditstatus exit 1 else auditstatus='SUCCESS' echo $auditstatus exit 0 fi fi done DS Jobs options on "Restartability". You have to design for this. Information is available from the prod. The simplest way to achieve this is to use the -jobstatus parameter when invoking the job.Information is available from the prod.

the main control jobs should never abort. Mainly used where the DWH design is in form of a Batch Architecture. The dependency matrix can be an Excel Spreadsheet that can be designed on the dependency tree by listing jobs and a space separated list of immediate predecessor jobs. You also gain the ability to start and stop at milestones. The sequence can also contain control information. milestone tracking. A good approach is to develop a job control library that reads a simple dependency matrix.profile level or be job specific. track completed jobs and waiting jobs. The error threshold can be a global parameter maintained at the prod. You can customize parameter value assignments to each jobs needs. Have an external job monitor the processing job and ABORT the processing job as soon as the error threshold is reached. In the 2nd case there error threshold for each job needs to be passed as a parameter to the particular job. etc. One should leverage the power of the underlying BASIC language to create a customized job control with automatic 'Resurrection' ability where it picks up right from where it left off in the job stream. for example. then you can have a job control process to manage the execution of the jobs. this does require customizing the code generated when a job sequence is compiled. Realize that DataStage has a wonderful API library of job control functions. so that recovery can be 100% automatic. One could then create a job control routine to process a particular selection of jobs in the project (maybe all of them.The Data Warehousing and Business Intelligence should go for the option of a START-OVER. and never call DSLogFatal. However.FAILED or DSJS. There are 2 approaches that can be considered while using Error threshold: Approach 1 – Figure out the count of erroneous records (as per the given business logic) and compare with the error threshold. maybe just the ones with a status of DSJS. Here is an over view of the approach to implement a Check-point restart: A hierarchy of control jobs (job sequences) is the easiest way to accomplish Restartability. In fact. which greatly extends the ability to manage hundreds of jobs in a single process. for each of these executing your "dump log" job. Use DSJ. Executing Jobs in a Controlled Sequence within the DS environment DataStage provides a graphical Job Sequencer which allows you to specify a sequence of server jobs or parallel jobs to run. Use standard UNIX commands like grep/awk to read from the log file. for . This approach is applied in a Real Time DWH. It is a good approach because it allows full metadata exposure as to the process execution flow. Never use ABORT or STOP statements. It is not advisable to abort. Never return non-zero codes from before/after subroutines (instead pass results and status as return values). DSJS. Reaching Commit Points The commit points can be controlled through error thresholds. delimited text files.) The only challenge becomes maintaining the dependency tree. If the count of erroneous records is greater than the error threshold then have the control job abort the processing job and write to a log file.RUNWARN. (Parameters can be read from a file and set them in a job at runtime. Approach 2 – Let the processing job continue with its activity and do the comparison with the error threshold intermittently during the job activity. Process Logging In all approaches it is advisable to have a DataStage job to dump the requisite contents from DataStage job logs into.CRASHED). You control the absolute level of restart capabilities.ERRNONE as the second argument for DSAttachJob. etc. I prefer to log warnings and other restart status information. Once you have a dependency tree. One can maintain the tree in an Oracle (any RDBMS) table.

Nested Conditions and Sequences are represented in the job design by icons and joined to activities by triggers. You create the job sequence in the DataStage Designer. • Email Notification . • Wait-for-file . When it is compiled. and can have parameters.Waits for a specified file to appear or disappear. A nested condition allows you to further branch the execution of a sequence depending on a condition. The job sequence itself has properties. Translating Stages and Links to Processes When you design a job you see it in terms of stages and links. and join these together with triggers (as opposed to links) to define control flow.Specifies that an email notification should be sent at this point of the sequence (uses SMTP). • Routine . Each activity has properties that can be tested in trigger expressions and passed to other activities further on in the sequence. Actives stages. the DataStage engine sees it in terms of processes that are subsequently run on the server.The Data Warehousing and Business Intelligence example. which can be passed to the activities it is sequencing. you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails.In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire. It appears in the DataStage Repository and in the DataStage Director client as a job.There can only be one of these in a job sequence. Once you have defined a job sequence. add activities (as opposed to stages) from the tool palette. It is executed if a job in the sequence fails to run (other exceptions are handled by triggers). This can be any routine in the DataStage Repository (but not transforms). • ANY mode.Specifies a routine. • ExecCommand . The sequencer operates in two modes: • ALL mode. Control Entities The Job Sequencer provides additional control entities to help control execution in a job sequence.Specifies a DataStage server or parallel job. which are used to supply job parameters and routine arguments. Tips on performance and tuning of DS jobs Here is an overview of the some design techniques for getting the best possible performance from DataStage jobs that one is designing. It can have multiple input triggers as well as multiple output triggers. output triggers can be fired if any of the sequencer inputs are TRUE. The job sequence supports the following types of activity: • Job . Designing a job sequence is similar to designing a job. Activities can also have parameters.Specifies an operating system command to execute.In this mode. it can be scheduled and run using the DataStage Director. such as the Transformer and . How does the DataStage engine define a process? It is here that the distinction between active and passive stages becomes important. • Run-activity-on-exception . The Pseudo code: Load/init jobA Run jobA If ExitStatus of jobA = OK then /*tested by trigger*/ If Today = “Wednesday” then /*tested by nested condition*/ run jobW If Today = “Saturday” then run jobS Else run JobB A sequencer allows you to synchronize the control flow of multiple activities in a job sequence.

There may be two factors affecting the performance of your DataStage job: • It may be CPU limited • It may be I/O limited You can now obtain detailed performance statistics on a job to enable you to identify those parts of a job that might be limiting performance. The collection of performance statistics can be turned on and off for each active stage in a DataStage job. This makes good sense when you are running the job on a single processor system. Use shift-click to select multiple active stages to monitor from the list. Partitioning and Collecting With the introduction of the enhanced multi-processor support at Release 6 onwards. such as Sequential file stage and ODBC stage. This means that an operation reading from one data source and writing to another could be divided into a reading process and a writing process able to take advantage of multiprocessor systems. and so make changes to increase performance. Interpreting Performance Statistics . select the stage you want to monitor and select the Performance statistics check box. Behavior of Datastage jobs on Single and Multiple Processor systems The default behavior when compiling DataStage jobs is to run all adjacent active stages in a single process. active stages become processes. But the situation becomes more complicated where you connect active stages together and passive stages together. By default this will all be run in a single process. The IPC facility can also be used to produce multiple processes where passive stages are directly connected. This is done via the Tracing tab of the Job Run Options dialog box. Under the covers DataStage inserts a cut-down transformer stage between the passive stages which just passes data straight from one stage to the other. What happens where you have a job that links two or more active stages together?. Diagnosing the Jobs Once the jobs have been designed it is better to run some diagnostics to see if performance could be improved. all adjacent active stages between them being run in a single process. There are two ways of doing this: • Explicitly – by inserting IPC stages between connected active stages. When you are running on a multi-processor system it is better to run each active stage in a separate process so the processes can be distributed among available processors and run in parallel. At its simplest. there are opportunities to further enhance the performance of server jobs by partitioning data.The Data Warehousing and Business Intelligence Aggregator perform processing tasks. and becomes a process when the job is run. The Link Partitioner stage allows you to partition data you are reading so it can be processed by individual processors running on multiple processors. What happens when you have a job that links two passive stages together? Obviously there is some processing going on. while passive stages. Passive stages mark the process boundaries. The Link Collector stage allows you to collect partitioned data together again for writing to a single data target. • Implicitly – by turning on inter-process row buffering either project wide (using the DataStage Administrator) or for individual jobs (in the Job Properties dialog box). are reading or writing data sources and provide services to the active stages.

you can use an Inter Process (IPC) stage in place of the Sequential stage.The minimum elapsed time in microseconds that this part of the process took for any of the rows processed. The information shown is: • Percent. for example a bulk loader. If the Minimum figure and Average figure are very close. it may be possible to invoke the tool as a filter command in the Sequential stage and pass the data direct to the tool .The average elapsed time in microseconds that this part of the process took for the rows processed. Alternatively. • Average . • Minimum . If the Job monitor window shows that one active stage is using nearly 100% of CPU time this also indicates that the job is CPU limited.Although it can be more difficult to diagnose I/O limited jobs and improve them. I/O Limited Jobs . you can turn it on for individual jobs via the Performance tab of the Job Properties dialog box. This allows connected active stages to pass data via buffers rather than row by row.Multi-processor Systems – The performance of most DataStage jobs on multiprocessor systems can be improved by turning on inter-process row buffering and recompiling the job. Otherwise poorly performing jobs may be I/O limited. in these circumstances. This enables the job to run using a separate process for each active stage. or an external sort. This is achieved by duplicating the CPUintensive stages or stages (using a shared container is the quickest way to do this) and inserting a Link Partitioner and Link Collector stage before and after the duplicated stages. and of each of its input and output links. you can turn it on for individual jobs via the Performance tab of the Job Properties dialog box.The performance of most DataStage jobs can be improved by turning in-process row buffering on and recompiling the job. Additional Information to improve Job performance CPU Limited Jobs – Single Processor Systems . and it is advisable to redesign your job to use row buffering rather than COMMON blocks. If you have one active stage using nearly 100% of CPU you can improve performance by running multiple parallel copies of a stage process. CPU Limited Jobs . CAUTION: You cannot use inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. Alternatively. • Count. This is not recommended practice. • If an intermediate sequential stage is being used to land a file so that it can be fed to an external tool. You can turn inter-process row buffering on for the whole project using the DataStage Administrator. if you collect statistics for the first active stage the entire cost of the downstream active stage is included in the active-to-active link This distortion remains even where you are running the active stages in different processes (by having inter-process row buffering enabled) unless you are actually running on a multi-processor system. there are certain basic steps you can take: • If you have split processes in your job design by writing data to a Sequential file and then reading it back again. Care should be taken to interpret these figures. Also be aware that. this suggests that the process is CPU limited.The Data Warehousing and Business Intelligence The performance statistics relate to the per-row processing cycle of an active stage. For example. This will split the process and reduce I/O and elapsed time as the reading process can start reading data as soon as it is available rather than waiting for writing process to finish. when in-process active stage to active stage links are used the percent column will not add up to 100%. You can turn inprocess row buffering on for the whole project using the DataStage Administrator. which will run simultaneously on a separate processor.The number of times this part of the process was executed.The percentage of overall execution time that this part of the process used.

but slowly on a poorly designed one.or resize an existing file specifying it (using the RESIZE command). • • DWH Architecture based on a Universal File Format (UFF): . the file then always uses the specified type of write cache.The Data Warehousing and Business Intelligence • If you are processing a large data set you can use the Link Partitioner stage to split it into multiple parts without landing intermediate fields If a job still appears to be I/O limited after taking one or more of the above steps you can use the performance statistics to determine which individual stages are I/O limited. Otherwise you can turn write caching on at the stage level via the Outputs page of the hash file stage. Once you have identified the stage the actions you take might depend on the types of passive stage involved in the process. Pre-allocating . Another use is to host slowly-growing dimension tables in a star-schema warehouse design. Write Caching . Again. When you have calculated your modulus you can create a file specifying it (using the Create File feature of the Hash file dialog box) . You can calculate the minimum modulus as follows: minimum modulus = estimated data size/ (group size *2048). Analyze the results and compare them for each stage.If you are using dynamic files you can speed up loading the file by doing some rough calculations and specifying minimum modulus accordingly. Performing lookups can be fast on a well designed file. There are various steps you can take within your job design to speed up operations that read and write hash files. This greatly enhances operation by cutting down or eliminating split operations. Run the job with a substantial data set and with performance tracing enabled for each of the active stages. When you have calculated your minimum modulus you can create a file specifying it or resize an existing file specifying it (using the RESIZE command) Calculating static file modulus . 2. • • Pre-loading . Specify this on the Hash File stage Outputs page. This ensures that hashed files are written to disk in group order rather than the order in which individual rows are written (which would by its nature necessitate time consuming random disk accesses). If server caching is enabled. In particular look for active stages that use less CPU than others. and which have one or more links where the average elapsed time.You can calculate the modulus required for a static file using a similar method as described above for calculating a pre-allocation modulus for dynamic files: modulus = estimated data size/(separation * 512). you can specify the type of write caching when you create a hash file.You can speed up read operations of reference links by pre-loading a hash file into memory. Following can be done: 1. a well designed file will make extracting data from dimension files much faster.Poorly designed hashed files can be a cause of disappointing performance.You can specify a cache for write operations such that data is written there and then flushed to disk. Hashed files are commonly used to provide a reference table based on a single key. Poorly designed hashed files can have particular performance implications for all stage types you might consider: • redistributing files across disk drives • changing memory or disk hardware • reconfiguring databases • reconfiguring operating system Hash File Design .

This approach of split-processing would act as a common post load process that can run in multiple threads and will not have any dependency on the processing of the individual input files. *Deffun GetMetaDataString("") Calling "DSU."") Calling "DSU. The architecture would load the data for this attribute in the staging area but will not propagate it to the Datamarts/EDW.InsertUFFData" – Ceate and Load UFF to Staging area Deffun GetMetaDataCount("") Calling "DSU."") Calling "DSU.GetPosFromArray" – Get File Layout details Deffun InsertUFFData(RecStr. convert it to an UFF using the specific converter module and then use the common module to load the UFF to the staging area. this approach can still be applied. ArgFileStatus.MailInfo. "AHT". Considering Oracle as our target database.H ****************************************************************** ******************************************* * Define the Event And Error Logging Routines EventNo = 0 Result = "" Action = "" ErrorNo = 0 OprnCode = "" AMPM = "" RR = 0 *Declare the Library Functions Deffun InsertEventLog(EventNo.GetMetaDataCount" – Sanity to check if any new attribute has been processed **Next 3 Functions are for Email Notification Deffun GetFromMailAddress("Dummy") Calling "DSU. ArgSubject. FileName.GetToMailAddress" Deffun SendMail(ArgToAddressList.GetFromMailAddress" Deffun GetToMailAddress(ArgOprnCd) Calling "DSU. A sample pseudo code in Datastage implementing the above approach is given bellow: $INCLUDE DSINCLUDE JOBCONTROL.Function for Getting information from a process Queue – this is a work round for implementing a REAL TIME DWH. one can implement a Oracle transparent or procedural gateway to pull data from the other RDBMS.The Data Warehousing and Business Intelligence Creating an UFF is another approach of architecting a DWH where all the input file formats are converted to a UFF. In the staging area this attribute would be marked as UNK (Unknown) and notification would be sent to the process owner(s) and/or support team regarding the occurrence of this new attribute.UpdateProcessQue" . This approach aims at having specific converter modules per file that read the input files and convert them to a UFF based on the metadata. the UNK tag would be taken off and the attribute can flow to the Datamarts/EDW for reporting. Result.MetData(x). ArgFromAddress. ArgFileDateTime. After the process owner validates this attribute. FileName. There would be a common module/process that would process the UFF and load it to the DWH.GetMetaDataString" Deffun . Action. ArgOprnCode.SendMail" . The metadata would capture all the possible attributes that all the feed files would have along with the layout of each file."") Calling "DSU."") Calling "DSU.InsertEventLog" – Function for Event Logging Deffun InsertErrorLog(ErrorNo. ArgMessageBody) Calling "DSU.InsertErrorLog" ."") Calling "DSU.AttribPos.Function for Error Logging Deffun UpdateProcessQue(FileName. Incase the data source changes from a feed file to RDBMS table(s). OprnCode. There would also be the capability to process a file having a new attribute (outside the list).Get Metadata Information GetPosFromArray(AttribStr.

which was being generated earlier. ". "USID". this goes on for each section. if it is not .FileName : " picked from Queue. * then contains the detail values.Index(FileName.hdbc) ****************************************************************** ****************************************** * Function GetMetaDataCount is called to get the count of Meta data from table META_DATA iTotalCount = GetMetaDataCount(hdbc) *********************************************Variables declarations******************************************** * Stores the META DATA and then converted to comma delimited. "DSN NAME". 1))) <> "AHT" then goto InvalidFile ****************************************************************** ******************************************** * Open the raw data file for processing OPENSEQ FilePath : "\WORK\": FileName to INFILE THEN PRINT "INPUT FILE OPENED FOR PROCESSING" END ELSE PRINT "DID NOT OPEN INPUT FILE" Ans3 = InsertEventLog(6. rows in comma delimited format with the headers.The Data Warehousing and Business Intelligence ****************************************************************** ******************************************* status = SQLAllocEnv(henv) status = SQLAllocConnect(henv. * Virtual representation of the intermediate file. "PWD") iTotalCount = 0 ****************************************************************** ******************************************** ****************************************************************** *************************** *Setting the Mailing Information ToAddressList = GetToMailAddress("PROCESS Group") FromAddress = GetFromMailAddress("dummy") Subject = "EDSS Autogenerated message (Process Group Name)" ****************************************************************** **************************** * Check the file extension. hdbc) status = SQLConnect(hdbc.".AHT then move to Error\INVALIDFILE Folder If UpCase(Right(FileName.". Len(FileName) . dim DataPos(iTotalCount) * Stores [Start Section]."P". dim ArrTmp(300) ArrTmpCnt = 0 . dim MetData(iTotalCount) * Stores the position of META DATA stored in the above array at their respective positions.

The Data Warehousing and Business Intelligence * Counter of the array ArrTmp. length of array. * represents a UFF record with comma delimited values at their respective positions. RecStr = "" * Stores the string that is to be inserted in the table STAGING_FILE_DATA." : A ELSE ErrorCode = -104 * If encounter EOF then goto EndOfFile section if Status()=1 then goto EndOfFile . ignore this line READSEQ A FROM INFILE THEN RR = RR + 1 ELSE ErrorCode = -103 * Parse the fourth line of raw data file to get the headers. append the metadata READSEQ A FROM INFILE THEN MetaData = MetaData : ". denotes the number of records. MetaData = "" DataVal = "" ErrorCode = 0 CNT = 0 RR = 0 WR = 0 x = 0 iCnt = 0 iCtrTemp = 1 RecCount = 0 Ans = "" Ans1 = "" Ans2 = "" Ans3 = "" Ans4 = "" Ans5 = "" cn = 0 CheckStr ="" RecNull = 0 * This label parse the details of a particular section ReadSection: * Reinitializing the variables MetaData = "" DataVal = "" * If encounter EOF then goto EndOfFile section *if Status()=1 then goto EndOfFile * Parse the first line of raw data file to get the headers READSEQ A FROM INFILE THEN MetaData = A ELSE ErrorCode = -101 * If the file is blank then donot process if status() = 1 then goto EmptyFile * Parse the second line of raw data file to get the details READSEQ A FROM INFILE THEN DataVal = A ELSE ErrorCode = -102 If trimB(TrimF(DataVal)) = "" then goto IncompleteFile * If encounter EOF then goto EndOfFile section if Status()=1 then goto EndOfFile * Parse the third line of raw data file.

" * Call this loop till the count of delimited values in a record and stores the value of meta data at the respective position for x = 1 to dcount(tmpA.MailInfo.MetData(x).x)). char(34)."." : A if Status()=1 then goto EndOfFile iCtrTemp = iCtrTemp + 1 Next CNT ****************************************************************** ****************************************** * This lable takes care of EOF EndOfFile: * Nothing is done ****************************************************************** ****************************************** ArrTmpCnt = ArrTmpCnt + 1 ****************************************************************** ****************************************** * This lable creates the UFF and stores in ArrTmp * Making Log Ans3 = InsertEventLog(14.hdbc) DataPos(x) = Ans1 .""))) <> "" THEN MetData(x) = TrimB(TrimF(EReplace(UpCase(field(tmpA. char(34).".".""))) * Call this function to get the position of the Meta data Ans1 = GetPosFromArray(AttribStr.x)). 300 considered as the upper limit of the details (assumption) For CNT = 1 to 300 READSEQ A FROM INFILE THEN ArrTmp(iCtrTemp) = DataVal : ".1) MailInfo = FileName : " in Operation AHT.2. "") RecNull = ISNULL(CheckStr) IF ((CheckStr = "") OR (RecNull = 1)) then goto IncompleteFile *First element of the array denotes the start of section ArrTmp(iCtrTemp) = "[StartSection]" iCtrTemp = iCtrTemp + 1 ArrTmp(iCtrTemp) = MetaData iCtrTemp = iCtrTemp + 1 * Append the details of the current section in the array."~".") if TrimB(TrimF(EReplace(UpCase(field(tmpA.".".hdbc) CreateUFF: AttribStr = "" AttribPos = "" ArrTmpCnt = ArrTmpCnt + 1 tmpA = ArrTmp(ArrTmpCnt) Ans1 = ArgMetaDataString AttribStr = field(Ans1.1) AttribPos = field(Ans1."."P"."~".The Data Warehousing and Business Intelligence CheckStr = DataVal CheckStr = EReplace (CheckStr.FileName : " UFF conversion initiated.AttribPos.1.".".

".x)). FileName .". strMove.The Data Warehousing and Business Intelligence End next x * Calling the function again to get the latest count of Attributes.iTotalCount) ArrTmpCnt = ArrTmpCnt+1 A = ArrTmp(ArrTmpCnt) if TrimF(TrimB(A[1. hdbc) * Send notification for the empty file. hdbc) Ans8 = InsertEventLog(12." SendMailResult = SendMail(ToAddressList. MessageBody) goto ExitProcess .".hdbc) *Update Process Queue Ans5 = UpdateProcessQue(FileName .DataPos(x)) next x * Call this function to insert the UFF record in STAGING_FILE_DATA Ans2 = InsertUFFData(RecStr. "E". "E". iTotalCount = GetMetaDataCount(hdbc) dim ArrUFF(iTotalCount) for cn = 1 to iCtrTemp . MessageBody = FileName : " is invalid.1 RecStr = "" RecStr = STR(". "AHT".".1])) = "" then goto EndProcessing for x = 1 to dcount(A.". Output. strMove. "AHT".TrimB(TrimF(EReplace(UpCase(field(A. FileName : " Queue status updated. "AHT".".1. Subject. if any new attributes are added.hdbc) *Log Actions in Event Log Ans7 = InsertEventLog(9. Output. SystemReturnCode) END *Log Errors in Error Log Ans4=InsertErrorLog(3.""))): ".") RecStr = EReplace(RecStr. SystemReturnCode) END ELSE strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : " " : FilePath : "/ERROR/INVALIDFILE/" Call DSExecute("SH". "P". "".".". char(34). FileName.".hdbc) next cn goto EndProcessing ****************************************************************** ************************************************* ****************************************************************** ************************************************* InvalidFile: strMove = "" If StrOSType="NT" then strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) : " " : FilePath : "\ERROR\INVALIDFILE\" Call DSExecute("NT". FileName : " Mail notification sent. FromAddress. Please verify.

"AHT". hdbc) Ans10 = InsertEventLog(9.hdbc) Ans5 = UpdateProcessQue(FileName. "I". SystemReturnCode) END Ans4 = InsertErrorLog(3. "AHT".hdbc) Ans5 = UpdateProcessQue(FileName . SystemReturnCode) END Ans4=InsertErrorLog(2. "E". strMove. "E". Output. "E". "I".hdbc) *Log Actions in Event Log . strMove. SystemReturnCode) END ELSE strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : " " : FilePath : "/ERROR/INVALIDFILE/" Call DSExecute("SH".The Data Warehousing and Business Intelligence ****************************************************************** ************************************************* ****************************************************************** ************************************************* IncompleteFile: strMove = "" CloseSeq INFILE If StrOSType = "NT" then strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) : " " : FilePath : "\ERROR\INVALIDFILE\" Call DSExecute("NT"." SendMailResult = SendMail(ToAddressList. MessageBody) goto ExitProcess ****************************************************************** ************************************************* ****************************************************************** ************************************************* EmptyFile: CloseSeq INFILE strMove = "" SendMailResult = "" If StrOSType="NT" then strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) : " " : FilePath : "\ERROR\EMPTYFILE\" Call DSExecute("NT". hdbc) Ans11 = InsertEventLog(12. FileName. FileName : " Error encountered while parsing the file " . "AHT". FileName .hdbc) *Log Actions in Event Log Ans9 = InsertEventLog(3. "Mail notification sent for file " : FileName . MessageBody = FileName : " is Incomplete. strMove. "". Please verify. Output. FromAddress. Output. "AHT". SystemReturnCode) END ELSE strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : " " : FilePath : "/ERROR/EMPTYFILE/" Call DSExecute("SH". Subject. strMove. "Queue status updated for " :FileName . hdbc) * Send notification for the Incomplete file. Output. "".

hdbc) ****************************************************************** ************************************************* ****************************************************************** ************************************************* ExitProcess: Ans = iTotalCount ****************************************************************** ************************************************* Sanity Checks in a DWH In the entire ETL process there are 2 key check points which should pass a sanity check. hdbc) Ans3 = InsertEventLog(15. "P". "Queue status updated for " :FileName . Subject. hdbc) * Send notification for the empty file. SystemReturnCode) END ELSE strMove = "mv " : DQuote(FilePath : "/WORK/" : FileName) : " " : FilePath : "/PROCESSED/" Call DSExecute("SH". strMove."P". The pre-load sanity includes doing a check on following areas: . MessageBody) goto ExitProcess ****************************************************************** ************************************************* ****************************************************************** ************************************************* EndProcessing: Ans = iTotalCount CloseSeq INFILE Ans = iTotalCount CloseSeq INFILE strMove = "" If StrOSType="NT" then strMove = "move " : DQuote(FilePath : "\WORK\" : FileName) : " " : FilePath : "\PROCESSED\" Call DSExecute("NT". SystemReturnCode) END *Log Actions in Event Log Ans11 = InsertEventLog(12. FileName : "File load successful" . "Process Group Name". "U". FromAddress. strMove.FileName : " UFF conversion done. MessageBody = FileName: " contains no records.hdbc) Ans10 = InsertEventLog(4. "U". Output. "I"." SendMailResult = SendMail(ToAddressList. "Mail notification sent for file " : FileName . Please verify. hdbc) *Update Process Queue Ans5 = UpdateProcessQue(FileName. They are the pre-load stage (implement a pre-load sanity) and the post load stage (implement post load sanity). "".". Output.The Data Warehousing and Business Intelligence Ans10 = InsertEventLog(9.

Higher number indicates higher severity. • Potential Cause(s) of Failure • Occurrence . • Process to allow re-keying process if the dimensional lookup has failed during the Staging to Fact load. o Create an intermediate keyed file having the [key/value] pair if the records to be loaded to the fact table. • Risk re-assessment – After the recommended actions are implemented. • Potential Failure Mode – What can fail or pose a risk or failure • Potential Effect(s) of Failure – Who or what will be affected if the failure occurs. Key Deliverable Documents: Architecture/Approach Document (High Level) – Provides a bird’s eye view of the DWH architecture. • Detection .ranked in scale of 1-10. • Responsibility and Target Date Completion – Of who is supposed to implement the recommended actions. – Provides low level details of what has been discussed in the Approach document. Decision can then be taken to load these records with the key values as UNK or run them through the keying process once-over again. This allows to rank the risk so that high priority once can be addressed first. This process can be architected in following 2 ways: o Abort the load process if there is a lookup failure. • Severity . Higher number indicates that there is more chance of the Business being affected before we can prevent the failure. • Action Taken – Activity that was actually done to mitigate the risk. • Data validation checks on 2 or more fact tables after they are loaded through post load process. predict or notify when the failure occurs. FMEA (Failure mode Effective Analysis) Document – three questions need to be addressed on the FMEA: 1) What risks are there to business if we do not make this change? 2) What other systems/components can this change impact? 3) What can go wrong in making this change? The FMEA Document should have the following: • Item/ Process Step – whether internal (like the process is part of the ETL load) or external (like arrival of feed file from the data provider). • Current Controls – Any process in place to prevent. • Risk Priority Number (RPN) – This product of Severity and Occurrence and Detection (Severity*Occurrence*Detection).ranked in scale of 1-10. Confirming data accuracy between the actual feed file and the header/trailer information. Higher number indicates higher possibility of occurrence. Post Load Sanity includes doing a check on following areas • Process to delete duplicate records after the staging load. .The Data Warehousing and Business Intelligence • • Confirming successful load of the data for the previous business day.ranked in scale of 1-10. Sanity checks can be done by implementing the logic in shared container or at database level – whichever is appropriate. The records that have failed lookup go into an unkeyed file with the key = UNK (Unknown). Detailed Design Document. • Recommended Actions – To prevent the failures.

Sign up to vote on this title
UsefulNot useful