You are on page 1of 5

Whats New in DataStage version 8.

x Summary
DataStage has seen major transformations in the past years from an extract-transform-load (ETL) tool running in what was called the UniVerse engine, to what is now a DataStage engine. With the need to adapt to demands of volume processing, Ascential acquired and integrated the parallel processing engine into DataStage. To provide data quality and data integration solutions, Ascential added more products that included MetaStage, QualityStage and ProfileStage. Enter IBM Information Server, the latest release from IBM after acquiring Ascential. IBM Information Server delivers improved functionalities that make job administration, development and maintenance easier.

What is Information Server?


IBM Information Server is the integrated set of components that include WebSphere DataStage and Quality Stage, WebSphere Information Analyzer (hybrid of ProfileStage and AuditStage), Federation Server, and Business Glossary that share a common metadata repository, common administration, common logging and common reporting. Each component is separately orderable to fit the clients needs. IBM Information Server is built on a service-oriented architecture. Jobs for data analysis, data cleansing and data transformation can run as a service; making them available to any application at any time. This approach enhances job reuse and supports standardized processes to be implemented across projects and applications that call these services. Organizations looking to implement or upgrade from previous versions will find a host of new features that simplify job design and help cut implementation effort. Websphere DataStaqe alone is packed with new tools to better manage the job performance and improve operational efficiency. This article will examine the new features in DataStage version 8. Other Information Server components will not be discussed.

Whats new in WebSphere DataStage


There are many new features and added functionalities in WebSphere DataStage that help to cut development time, simplify job design and improve job performance. Among these features are Data Connection Object, ODBC Connector, SQL Builder, Slowly Changing Dimension stage, Range Look-up, Advanced and Quick Find, Parameter Set, Common Logging, Resource Estimation tool, Performance Analysis tool and Job Compare. The following are some additional details on each of these new features:

Data Connection Object


Data connection objects are reusable components that store database connection information. One data connection object can be created for each particular database and has to be associated with a particular stage type. Once configured, they can be simply

Alpine Consulting, Inc.

Page 1 of 5

Confidential Material

Whats New in DataStage version 8.x


dragged and dropped during job design, a very handy feature when creating jobs that read from or write to the same database.

ODBC Connector
ODBC Connectors are for accessing external data sources and can be used to read, write, look up and filter data or simply to test the database connectivity during job design. The Test Button on connectors allows developers to test database connections without having to view the data or to run the job. ODBC Connector is the best stage to use for your database because it gives the maximum parallel performance and offers more features compared to database enterprise stages and plug-in stages. ODBC Connector has DataStage schemas to data types, attributes specify, runtime errors a schema reconciliation function that automatically compares external-resource schemas such as a database. Schemas include and field lengths. Based on the reconciliation rules that you or extra transformation on mismatched schemas can be avoided.

ODBC Connecter has its own reject-handling function which eliminates the need to add a Modify or a Transformer stage for capturing SQL errors or for aborting jobs. A choice between number of rows or percentage or rows rejected can be specified for terminating the job run. ODBC Connector can transfer large objects (LOB) using inline or reference methods. However, a connector is the only stage that does reference methods so another connector is needed to transfer the LOB inline later in the job.

SQL Builder
The enterprise database stages in WebSphere DataStage as well as the ODBC Connector stage provide a graphical interface for building SQL that makes it easier for developers to view, create and edit SQL. While previous releases support auto-generated queries that synchronize the SQL select list with the DataStage column list, the SQL that can be created is very limited the table alias has to be manually appended to the columns and the where clause has to be manually keyed in. The SQL Builder provides a drag-anddrop approach for creating the column list and the where clause and supports most database functions. It is important to note that the SQL Builder is limited to one or two source tables and a simple SQL statement.

Slowly Changing Dimension Stage


Slowly Changing Dimension Stage (SCD) is a new stage that supports both Type 1 and Type 2 processing. The SCD stage allows for in-memory lookup updates, surrogate key generation and updates to dimension tables: processes that were previously implemented in a series of stages or in multiple jobs. SCD can handle a series of updates to the same dimension row because the changes are applied to the lookup table in memory and are mirrored to the dimension update link.

Range Look-up
Range Look-up is equivalent to the operator between. Lookup against a range of values was difficult to implement in previous DataStage versions. By having this functionality in

Alpine Consulting, Inc.

Page 2 of 5

Confidential Material

Whats New in DataStage version 8.x


the lookup stage, comparing a source column to a range of two lookup columns or a lookup column to a range of two source columns can be easily implemented.

Advanced Find
The Quick Find and the Advanced Find are very useful tools for developers as it does a quick search within the project or within the repository for object usage or dependencies. The results are presented in a detailed view or graphical view. This facility can be used for impact analysis or when doing project-wide job changes.

Parameter Set
Parameter Set is a new repository object containing parameter names and values that can be shared across jobs. Parameter Sets provide an easier and faster method when adding parameters to a job, eliminating the need to add parameters invidiually to each job. Parameter values can be stored in a file, adding flexibility when changing parameter values at runtime or when changing environment.

Resource Estimation
Resource Estimation tool is designed for planning how much resources will have to be allocated to the DataStage jobs as volume of data grows. The Resource Estimation utility can project how much CPU, scratch and disk requirements per operator per partition will be utilized by the job given a sample size of data. The output is presented in a graphical form that can also be generated in a HTML report.

Performance Analysis
The Performance Analysis tool is useful for identifying bottlenecks in a job. The graphical report visualizes the start and end time for each stage in the job, the physical amount of time needed for each stage to complete in each partition, the total CPU and system time for each stage, the memory utilization for each stage over the course of the job, and much more.

Job Compare
While it is a good to keep copies of different job implementations, we frequently lose track of the job changes. In our attempt to improve job design and performance, our experiments often lead us back to the older copy of the job. This is where the Job Compare feature of WebSphere DataStage comes in handy. Job Compare also works on other objects like routines, shared containers, and table definitions; whether they are in the same or in different repositories.

Job Locking
In WebSphere DataStage, jobs can be opened in read-only mode. Developers need not guess who has the job locked as this information is provided when opening jobs that are in use. Sessions can also be disconnected from the web console if there is a need such as restarting the DataStage server while there are existing connections. Unlocking a job

Alpine Consulting, Inc.

Page 3 of 5

Confidential Material

Whats New in DataStage version 8.x


from a disconnected session is very easily managed from the web console without having to follow multiple steps in the DataStage Administrator client, which is a major improvement over previous versions.

Job Monitoring
Time-based and record-based monitoring of jobs is available, both of which can adjust to system loads, thus eliminating an unnecessary refresh when the DataStage server is at its peak performance.

Other Key Improvements


One drawback of previous releases of DataStage EE was job startup. In version 8, this drawback is markedly reduced. Small parallel jobs will now run faster. Buffering and process combination is also improved to optimize job performance. Information Server web console has common logging and common reporting features that work across components. Both features provide easy searches using log or report date constraints, wildcard filters, saved queries and reports. The ODBC enterprise stage for WebSphere DataStage can now connect to more external data sources including IBM DB2, Informix, Oracle, Sybase, SQL Server, Teradata, MySQL, MS Access, and others. In Information Server, WebSphere QualityStage is tightly integrated with WebSphere DataStage and shares the same design canvas, which enables users to design jobs with data transformation stages and data quality stages in the same session. The new designer client is now called DataStage and QualityStage Designer, as are the Administrator and Director clients. In Information Server, the Manager client is gone and the functionalities are incorporated into DataStage and QualityStage Designer thereby saving developer time when importing and exporting jobs. Metadata analysis and report is also available inside DataStage and QualityStage Designer.

Conclusion
The look and feel of DataStage and QualityStage canvas remains the same but the new functionalities are major enhancements over the previous version. Data Connection Object, Parameter Set, Range Look-up and Slowly Changing Dimension are all designed to simplify design, help cut implementation effort and reduce cost. Advance Find provides a good way to do impact analysis, an importanat step in project management. Resource Estimation is as important for project planning. Meanwhile, Perfornamce Analysis tool is another useful feature that can be used throughout the lifecycle of a job. By knowing what causes a performance bottleneck, production support groups can better cope with the ever-shrinking batch windows. While Advance Find will not perform a Replace function and SQL Builder will not let us build complex SQL, all the changes in version 8 have positive impact on job development, production support and project management. Combined with the features offered in Information Server, existing customers who are looking to upgrade or new DataStage clients will benefit from the new enhancements.

Alpine Consulting, Inc.

Page 4 of 5

Confidential Material

Whats New in DataStage version 8.x

References IBM Information Server, Version 8, http://publib.boulder.ibm.com/infocenter

Author Luisa Garcia Senior Information Management Consultant, Alpine Consulting, Inc. For more information on DataStage, Information Server, or related products, please contact Alpine Consulting at 847.605.0788 or info@alpineinc.com.

Alpine Consulting, Inc.

Page 5 of 5

Confidential Material

You might also like