This action might not be possible to undo. Are you sure you want to continue?
Extraction Transformation & Load
batch process of large volumes of data
Load a warehouse, mart, analytical and reporting applications Application/Data Integration Load packaged applications, or external systems through their APIs or interface databases Data Migration
Heterogeneous Data Sources: Relational & non-relational databases; Sequential flat files, complex flat files, COBOL files, VSAM data, XML data, etc.; Packaged Applications (e.g. SAP, Siebel, etc.) Incremental/changed data or complete/snapshot data Internal data or third-part data Push/Pull Cleansing & validation Simple - Range checks, duplicate checks, NULL value transforms etc. Specialized/Complex – Name & address validations, de-duplication, etc.
Computations (arithmetic, string, date, etc.) Pivot Split or Concatenate Aggregate Filter Join, look-up
Historical vs. Refresh load Incremental vs. Snapshot Bulk Loading vs Record-level Loading
Pro*C. etc. Ab Initio . Informatica PowerCenter. etc. etc.• ETL Platform Options – Database features including SQL. – Engine-based products: IBM/Ascential DataStage. stored procedures. Teradata. – Code-based custom scripts: PL/SQL.: Oracle. Cobol.
Metadata maintenance components Metadata import & export from standard databases. packages.g. etc. Options for sharing or reusing developed components Facility to call external routines or write custom code for complex requirement Batch definition to handle dependencies between data flows to create the application ETL Engines that handle the data manipulation without depending on the database engines. packages. Inbuilt standard functions & transformations – e. etc. aggregate. sort. etc. Run-time support for monitoring the data flow and reading message logs Scheduling options . date.• Usual features provided by ETL tools: – – – – – – – – – – – Graphical data flow definition interfaces for easy development Native & ODBC connectivity to standard databases.
Architecture of a Typical ETL Tool Source & Target Database Source & Target Database ETL Metadata Repository Metadata Data ETL Engine Data GUI-Based Development Environment • Metadata Definition/Import/Export • Data Flow & Transformation Definition • Batch Definition • Test & Debug • Schedule Run-time Environment • Trigger ETL • Monitor flow • View logs .
de-duplication Data Profiling Metadata Management Run Audit Pre-built templates Additional adaptors for interfacing with third-party products.• Optional additional functions – – – – – – Cleansing capability. name & address cleansing. models & protocols .
Server Component DataStage Server Repository DataStage Package Installer Client Component DataStage Designer DataStage Director DataStage Administrator DataStage Manager .
DataStage Components Engine 4 Sources Manager Server 4 Targets •ETL Metadata •Maintained in internal format Repository •Execute Jobs • Monitor Jobs. view job logs Director •Manage Repository •Create custom routines & transforms •Import & Export component definitions Designer •Assemble Jobs •Debug •Compile Jobs •Execute Jobs .
etc. HP HP-UX. Server 2003. mapping rules. Red Hat Enterprise Linux AS. managing data Repository: Contains all the metadata.DataStage Server Components DataStage Server: Available for : Win NT. each server can handle multiple projects DataStage repository maintained in an internal format & not in the database Package Installer Note: DataStage uses the OS-level security . HP Compaq Tru64. 2000. DataStage applications are organized into Projects. Sun Solaris Server runs the executable.Only root/admin user can administer the server . IBM AIX.
compiled to create the executables Director: validate.DataStage Client Components Windows-based components Need to access the server at development time as well Designer. to create DataStage ‘jobs’ . creating and moving projects. run. and setting up purging criteria Designer. Administrator: setting up users. schedule. Director & Manager can connect to one Project at a time . and monitor jobs Manager: view and edit the contents of the Repository.
a client program provided with DataStage. choose DataStage Administrator. To access the DataStage Administrator: 1. From the Ascential DataStage program folder. Most DataStage configuration tasks are carried out using the DataStage Administrator. .
3. or as dsadm (for UNIX servers). Log on to the server. The controls and buttons on this page are enabled only if you logged on as an administrator . you have unlimited administrative rights. It is enabled only when at least one project exists.2. If you do so as an Administrator (for Windows NT servers). The DataStage Administration window appears: The General page lets you set server-wide properties. otherwise your rights are restricted as described in the previous section.
Primary interface to the DataStage Repository..1. 3.Used to store and manage re-usable metadata for the jobs. 2. Custom routines and transforms can also be created in the Manager . 4. Used to import and export components from file-system to DataStage projects.
The DataStage Director is the client component that validates.schedules and monitors jobs run by the DataStage Server. It is the starting point for most off the tasks a DataStage operator needs to do in respect of DataStage jobs.. .runs.
Job Category Pan Menu Bar Toolbar Status Bar Display Area .
The display area is the main part of the DataStage Director window. It displays the status of all jobs in the category currently selected in the job category tree.The default view. regardless of their category. which appears in the right pane of the DataStage Director window. There are three views: Job Status . and displays the status of all server jobs in the current project. the Job Statues view includes a Category column. . If you hide the job category pane.
. Job Schedule . the display area shows all scheduled jobs and batches. regardless of their category. Job Log.Displays a summary of scheduled jobs and batches in the currently selected job category..If the job category pane is hidden.Displays the log file for a job chosen from the Job Status view or the Job Schedule view.
. and transform the data.DataStage Designer is used to: Create DataStage Jobs that are compiled into executable programs. integrate. load. Integrating and loading data. transforming. Design the jobs that extract. Create and reuse metadata and job components Allows you to use familiar graphical point –and-click techniques to develop processes for extracting. aggregate. cleansing.
Use Designer to: Specify how data is extracted. Split data into multiple outputs on the basis of defined constraints .... Specify data transformations. Decode data going into the target tables using reference lookups Aggregate Data.
you define the required actions and processes for each stage and link.The Designer graphical interface lets you select Stage icons. still working in the Designer. get it working. and add links. . and so on. Then. then insert further processing. This means that you can easily create a simple job. A job created with the Designer is easily scalable. drop them onto the Designer work area. additional data sources.
you may only have one project installed on your system and this is displayed by default. At this point. Enter your user name in the User name field. 3.1. This list box displays all the projects installed on your DataStage server. 4. 5. Choose the project to connect to from the Project list. This is the name of the system where the DataStage server components are installed.Enter the name of your host in the Host system field. Select the Save settings check box to save your logon settings . 2. This is your user name on the server system. Enter your password in the Password field.
• A Toolbar from where you select Designer functions.The DataStage Designer window consists of the following parts: • One or more Job windows where you design your jobs. • A Tool Palette from which you select job components. • The Property Browser window where you view the properties of the selected job. • The Repository window where you view components in a projects. • A Status Bar which displays one-line help for the window . • A Debug Toolbar from where you select debug functions.
For full information about the Designer window. . and information on the current state of job operations. compilation. refer to the DataStage Designer Guide.components. including the functions of the pull-down and shortcut menus. for example.
STAGES IN DATASTAGE FILE:SEQUENTIAL FILE DATA SET PROCESSING TRANSFORMER COPY .
FILTER SORTER AGGREGATOR FUNNEL REMOVE DUPLICATE JOIN .
LOOK UP MERGE MODIFY NETEZZA TERADATA ORACLE .
Create a Job Select and position stages Connect stages with links Import a schema Set stage options Save. and take a look at some of that data. Learn how to: Create a Enterprise Edition Job that generates data. Compile & Run a Job View and Delete Job Log Row Generator Peek 40 Stages used: .
To create a new job: Select FileNew. and select Parallel OR Click the New Program icon on the toolbar Creating a New Job Create the following flow: • • • • Select the Row Generator stage Drag it onto the Parallel Canvas and drop it Select the Peek stage Drag it onto the Parallel Canvas and drop it • Right-Click on the Row Generator stage and drag a Link onto Peek Does your flow look like the one above? 41 .
Select to import schema 2.Importing a Schema 1. You can also use the File Browser 42 . Enter appropriate path and file name (Instructor will provide details) 3.
43 .Importing a Schema Make sure you put it into the right categoryShould reflect your userid Click on Next/Import/Finish to import.
We want all of the columns! Click OK. 44 . select the imported schema This lets you select which columns you want to bring in.Importing a Schema.End Goal Did everything go smoothly? After clicking on Finish.
Double-Click here to access additional options Click on Next> to step through column properties… Click on Close when done 45 .Column Properties Row Generator specific options: Here you can select specific properties for the data you are going to generate.
the ‘eye’ should not wink. Notice the new icon on the link.Final Touches Your job should look like this… However. indicating presence of metadata Next: • Click on the Compile icon • Save the job (Lab2) under your own Category • Did it compile successfully? 46 .
Ready to Run Action: Click to Run Click for Log 47 .
Running Job After you click Select Run Click for Log 48 .
paste as text into any editor of your choice 49 .Clearing the Job Log Tips: • Clear away unnecessary Job Logs • Use <Copy> button.
Objectives Learn how to: Modify the simple datagenerating program to sort the data and save it. Create a copy of a Job Edit an existing Job Create a Dataset Handle Errors View a Dataset New stage used: Sort 50 .
Create a Copy of a Job If necessary. Access stage properties for Peek stage 2. 1. open the Job created in Lab 2. Override default Partition type: from (Auto) to Hash Click Here to specify Sort Insertion Next: Click OK What Happens? 51 . Select Input tab 3.
Insert a Sort Let’s sort by birth_date • Select the birth_date column from the Available list • Once selected. you should see birth_date listed under Selected Food for Thought: Why Hash partitioning type? 52 .
Select Save As… from the File menu And Save Job (Lab3). 53 . denoting the presence of a sort.Sort Insertion A Are your results sorted on birth_date? Z Note the new icon appears on the link. Choose one of these to compile and run your job.
Now attach a Dataset stage to the program by… Placing a Dataset stage on the Canvas Right-clicking on Peek and drawing a Link over to the Dataset stage Your Job should now look like this: 54 .Let’s Stage the Data We’ll now save the output of the sort for later use.
Viewing a Dataset Right-click on the Dataset stage and select View DSLinkX data (Note: Link names may vary) Click OK to bring up Data Browser: 55 .
Objectives Use the Lookup stage to replace state codes with state names Learn how to: use the Lookup operator start thinking about partitioning New operators used: Lookup Entire partitioner .
Let’s expand it out into a full state name.Remember the Records in Lab 2? They look like this: John Parker M 1979-04-24 MA 0 1 0 0 Susan Calvin F 1967-12-24 IL 0 1 1 1 William Mandella M 1962-04-07 CA 0 1 2 2 Ann Claybourne F 1960-10-29 FL 0 1 3 3 Frank Chalmers M 1969-12-10 NY 0 1 4 4 Jane Studdock F 1962-02-24 TX 0 1 5 5 One of the fields is a two-character state code. 57 .
The State Table We have a table that maps state codes to state names: Alabama Alaska American Samoa Arizona Arkansas California Colorado Connecticut Delaware District of Columbia […] AL AK AS AZ AR CA CO CT DE DC Unix text file with tab after state full name. We’ll use that table to tack on the expanded state name to the rows generated in Lab 2. We imported this file in Lab5a. 58 .
• Uses states...txt file as the lookup table • Has TAB delimiter between state_name & state columns • Use state column as lookup key • Note that source data has column called state while lookup table has state_code • Use same schema as Lab 2 • Generate 100 rows Reminder: Don’t forget to perform column mapping (see next slide) 59 .What We’re Going To Build.
Lookup Mapping 60 .
0: John Calvin M 1961-11-30 0025966.55 OH 0 1 12 12 Ohio Peek.0: John Mandella M 1981-06-16 Peek.39 FL 0 1 16 16 Florida 0022976.92 NJ 0 1 32 32 New Jersey Peek.03 CO 0 1 8 8 Colorado 0082552.0: Frank Chalmers M 1969-12-10 Peek.0: Frank Austin M 1971-01-21 Peek.0: Frank Studdock M 1962-10-29 Peek.48 MI 0 1 24 24 Michigan 0098979.0: Frank Sinatra M 1984-06-12 Peek.0: John Parker M 1979-04-24 0087228.46 MA 0 1 0 0 Massachusetts 0004881.94 NY 0 1 4 4 New York Peek.0: Frank Glass M 1983-04-15 0068974.. Sample Output: Make sure state names match state code Peek.0: John Sarandon M 1964-06-03 Peek.80 CA 0 1 28 28 California 0023340.45 KY 0 1 20 20 Kentucky 0005305.0: John Boone M 1964-04-16 0042729.57 SD 0 1 36 36 South Dakota 61 .What You Should See..
Objectives Learn how to: Use the Join stage to Use an InnerJoin find out which products New stages used: the customer purchased Join Remdup Hash partitioner 62 .
Background - What We Have Customers of ACME Hardware place orders for products.
We have two simple tables to model this
1 1 1001 1116 1147 1032 1161 1106 1132 1195 1007 1021 1072 1139
screws nuts bolts nails nuts screws washers bolts nails nuts screws washers 137 200 145 159 197 253 330 527 370 162 135 351
customer_order table: tells us which orders were placed by each customer
1 2 2 3 3 3 4 4 4 4
order_product table: tells us how many of each product are in an order
1001 1001 1001 1001 1001 1002 1002 1002 1002 1002
Note the data types involved.
Use Integer and Varchar types where appropriate when defining table definitions.
Background - What We Want
Q: Which products have been ordered by each customer? A: Customer 1 has ordered washers, bolts, screws and …
Go ahead and assemble this flow, but do so in a more optimized manner – see next slide (save as Lab8a and a copy for Lab 9). • Use cust_order.txt & order_prod_plus.txt as input files. • See previous slide for file layouts – whitespace delimited fields • Note: column ordering matters! • Make sure you get the column data types correct also.
These two jobs are equivalent!
Notice the different partitioner/collector icons. Can you visually determine how the data is being handled?
We want to join the tables using the order number as the join key.txt‘ as your input If we then sort these records on customer and product and remove duplicated customer/product combinations. This means the tables need to be hashed and sorted on the order number. we have our answer… 234 records should be written out . Make sure you use 'order_prod_plus. The resulting table will have one record for each row of the order_product table. with the customer field added.
Using Lookup and Merge This is what your flows should look like. • Using Lookup: Note the "PhantomOrders" links leading to "Customerless" files 67 .
This is what your flows should look like. • Using Merge: .
Order Matters! Remember: • Lookup captures unmatched Source rows. on the Primary link • Merge captures unmatched Update rows. on the Secondary link(s) Tip: • Always check the Link Ordering Tab in the Stage page .
“28" "1000".“widgets".“14" You caught ACME Hardware red-handed: they tried to boost their stock by reporting a phantom order of 28 gaskets and 14 widgets! .“gaskets". just as InnerJoin • Rejects: Lookup and Merge populate the "Customerless" file with following two rows: "1000".70 New Results from Lookup and Merge • Outputs: Lookup and Merge should yield outputs with 234 rows.
Objectives Use the Aggregator stage to see how many of each product does each customer have on order. New stages used: Aggregator 71 .
.The Aggregator stage is a processing stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group.
Now that we know about the Aggregator stage. Remove the implicit Remdup (sort unique) Insert an Aggregator 73 . We did almost enough work in the InnerJoin lab (Lab 8a) to find out how many of each product each customer has on order. we can finish the job. Go back to the version of Lab 8a...Our InnerJoin Job Was A Bit Incomplete.
Aggregator Options Method: sort we could have a lot of customer/product groups… Grouping keys: customer and product Set Column for Calculation=quantity Function to apply: Sum Output Column Name of result column: quantity 74 .
.. Your Job should look like this. 75 .What You Should Have. Compile and Run your Job.
Modify Stage The Modify stage is a processing stage. You can drop or keep columns from the schema. The modified data set is then output. or change the type of a column. It can have a single input link and a single output link. The modify stage alters the record schema of its input data set. .
Dropping and Keeping Columns The following example takes a data set comprising the following columns: Modify Stage 28-3 .
The modify stage is used to drop the REPID. the stage properties are set as follows: . and COMMENTS columns. To do this. CREDITLIMIT.
STATE. NAME. ZIP. AREA. ADDRESS. PHONE Changing Data Type You could also change the data types of one or more of the columns from the above example. andspecify the conversion in the stage properties: conv_CUSTID:string = string_from_decimal(CUSTID) . Say you wanted to convert the CUSTID from decimal to string. you would specify a new column to take the converted data.You could achieve the same effect by specifying which columns to keep. CITY. In the case of this example the required specification to use in the stage properties would be: KEEP CUSTID. rather than which ones to drop.
Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns . It can have a single input link and any number of output links. The Copy stage copies a single input data set to a number of output data sets.Copy Stage The Copy stage is a processing stage.
and we do not need to set it in this instance as we are copying to multiple data sets (and DataStage will not attempt to optimize it out of the job). The easiest way to do this is using the Outputs page Mapping tab. When you open this for a link the left pane shows the input columns. simply drag the columns you want to preserve across to the right pane.The Copy stage properties are fairly simple. We repeat this for each link as follows: . The only property is Force. We need to concentrate on telling DataStage which columns to drop on each output link.
The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.
The continuous funnel method is selected on the Stage page Properties tab of the Funnel stage:
The continuous funnel method does not attempt to impose any order on the data it is processing. It simply writes rows as they become available on the input links. In our example the stage has written a row from each input link in turn. A sample of the final, funneled, data is as follows:
Filter Stage The Filter stage is a processing stage. a single reject link. unmodified. It can have a single input link and a any number of output links and. if required. . the records of the input data set which satisfy the specified requirements and filters out all other records. optionally. You can specify different requirements to route rows down different output links. The Filter stage transfers. The filtered out records can be routed to a reject link.
except when comparing strings.Specifying the Filter The operation of the filter stage is governed by the expressions you set in the Where property on the Properties Tab. . it is written unchanged to the specified output link. • The Boolean operators AND and OR to combine requirements. • Optional constants to be used in comparisons. The Where property supports standard SQL expressions. You can use the following elements to specify the expressions: • Input columns. When a record meets the requirements. • Requirements involving the contents of the input columns.
This action might not be possible to undo. Are you sure you want to continue?