This action might not be possible to undo. Are you sure you want to continue?
Extraction Transformation & Load
batch process of large volumes of data
Load a warehouse, mart, analytical and reporting applications Application/Data Integration Load packaged applications, or external systems through their APIs or interface databases Data Migration
Heterogeneous Data Sources: Relational & non-relational databases; Sequential flat files, complex flat files, COBOL files, VSAM data, XML data, etc.; Packaged Applications (e.g. SAP, Siebel, etc.) Incremental/changed data or complete/snapshot data Internal data or third-part data Push/Pull Cleansing & validation Simple - Range checks, duplicate checks, NULL value transforms etc. Specialized/Complex – Name & address validations, de-duplication, etc.
Computations (arithmetic, string, date, etc.) Pivot Split or Concatenate Aggregate Filter Join, look-up
Historical vs. Refresh load Incremental vs. Snapshot Bulk Loading vs Record-level Loading
• ETL Platform Options – Database features including SQL. etc. Teradata. – Engine-based products: IBM/Ascential DataStage. stored procedures. etc.: Oracle. – Code-based custom scripts: PL/SQL. etc. Pro*C. Ab Initio . Informatica PowerCenter. Cobol.
etc. etc. aggregate.• Usual features provided by ETL tools: – – – – – – – – – – – Graphical data flow definition interfaces for easy development Native & ODBC connectivity to standard databases. sort. etc. packages. Metadata maintenance components Metadata import & export from standard databases. packages.g. date. Run-time support for monitoring the data flow and reading message logs Scheduling options . Inbuilt standard functions & transformations – e. Options for sharing or reusing developed components Facility to call external routines or write custom code for complex requirement Batch definition to handle dependencies between data flows to create the application ETL Engines that handle the data manipulation without depending on the database engines.
Architecture of a Typical ETL Tool Source & Target Database Source & Target Database ETL Metadata Repository Metadata Data ETL Engine Data GUI-Based Development Environment • Metadata Definition/Import/Export • Data Flow & Transformation Definition • Batch Definition • Test & Debug • Schedule Run-time Environment • Trigger ETL • Monitor flow • View logs .
name & address cleansing. de-duplication Data Profiling Metadata Management Run Audit Pre-built templates Additional adaptors for interfacing with third-party products. models & protocols .• Optional additional functions – – – – – – Cleansing capability.
Server Component DataStage Server Repository DataStage Package Installer Client Component DataStage Designer DataStage Director DataStage Administrator DataStage Manager .
view job logs Director •Manage Repository •Create custom routines & transforms •Import & Export component definitions Designer •Assemble Jobs •Debug •Compile Jobs •Execute Jobs .DataStage Components Engine 4 Sources Manager Server 4 Targets •ETL Metadata •Maintained in internal format Repository •Execute Jobs • Monitor Jobs.
Sun Solaris Server runs the executable. Server 2003.DataStage Server Components DataStage Server: Available for : Win NT. IBM AIX. managing data Repository: Contains all the metadata. Red Hat Enterprise Linux AS. HP HP-UX.Only root/admin user can administer the server . DataStage applications are organized into Projects. mapping rules. HP Compaq Tru64. etc. 2000. each server can handle multiple projects DataStage repository maintained in an internal format & not in the database Package Installer Note: DataStage uses the OS-level security .
schedule. to create DataStage ‘jobs’ .DataStage Client Components Windows-based components Need to access the server at development time as well Designer. and monitor jobs Manager: view and edit the contents of the Repository. and setting up purging criteria Designer. run. creating and moving projects. Director & Manager can connect to one Project at a time . Administrator: setting up users. compiled to create the executables Director: validate.
a client program provided with DataStage. Most DataStage configuration tasks are carried out using the DataStage Administrator. To access the DataStage Administrator: 1. From the Ascential DataStage program folder. choose DataStage Administrator. .
you have unlimited administrative rights. Log on to the server. or as dsadm (for UNIX servers). If you do so as an Administrator (for Windows NT servers). The DataStage Administration window appears: The General page lets you set server-wide properties. otherwise your rights are restricted as described in the previous section. The controls and buttons on this page are enabled only if you logged on as an administrator .2. 3. It is enabled only when at least one project exists.
Primary interface to the DataStage Repository.Used to store and manage re-usable metadata for the jobs. Custom routines and transforms can also be created in the Manager . Used to import and export components from file-system to DataStage projects..1. 2. 4. 3.
runs. .. The DataStage Director is the client component that validates. It is the starting point for most off the tasks a DataStage operator needs to do in respect of DataStage jobs.schedules and monitors jobs run by the DataStage Server.
Job Category Pan Menu Bar Toolbar Status Bar Display Area .
. If you hide the job category pane. and displays the status of all server jobs in the current project. regardless of their category. which appears in the right pane of the DataStage Director window. There are three views: Job Status .The default view.The display area is the main part of the DataStage Director window. It displays the status of all jobs in the category currently selected in the job category tree. the Job Statues view includes a Category column.
the display area shows all scheduled jobs and batches.Displays a summary of scheduled jobs and batches in the currently selected job category.Displays the log file for a job chosen from the Job Status view or the Job Schedule view. Job Log. . Job Schedule . regardless of their category.If the job category pane is hidden..
DataStage Designer is used to: Create DataStage Jobs that are compiled into executable programs. integrate. aggregate. Integrating and loading data. load. Create and reuse metadata and job components Allows you to use familiar graphical point –and-click techniques to develop processes for extracting. Design the jobs that extract. transforming. and transform the data. . cleansing.
Use Designer to: Specify how data is extracted. Specify data transformations.... Split data into multiple outputs on the basis of defined constraints . Decode data going into the target tables using reference lookups Aggregate Data.
drop them onto the Designer work area.The Designer graphical interface lets you select Stage icons. you define the required actions and processes for each stage and link. then insert further processing. A job created with the Designer is easily scalable. still working in the Designer. additional data sources. and so on. This means that you can easily create a simple job. Then. and add links. . get it working.
Enter the name of your host in the Host system field. Select the Save settings check box to save your logon settings . 4. 2. This is the name of the system where the DataStage server components are installed. At this point. you may only have one project installed on your system and this is displayed by default. This list box displays all the projects installed on your DataStage server. Choose the project to connect to from the Project list. Enter your user name in the User name field. This is your user name on the server system.1. 3. Enter your password in the Password field. 5.
• The Repository window where you view components in a projects. • The Property Browser window where you view the properties of the selected job.The DataStage Designer window consists of the following parts: • One or more Job windows where you design your jobs. • A Tool Palette from which you select job components. • A Status Bar which displays one-line help for the window . • A Toolbar from where you select Designer functions. • A Debug Toolbar from where you select debug functions.
for example. . refer to the DataStage Designer Guide. including the functions of the pull-down and shortcut menus.components. and information on the current state of job operations. For full information about the Designer window. compilation.
STAGES IN DATASTAGE FILE:SEQUENTIAL FILE DATA SET PROCESSING TRANSFORMER COPY .
FILTER SORTER AGGREGATOR FUNNEL REMOVE DUPLICATE JOIN .
LOOK UP MERGE MODIFY NETEZZA TERADATA ORACLE .
Compile & Run a Job View and Delete Job Log Row Generator Peek 40 Stages used: . Create a Job Select and position stages Connect stages with links Import a schema Set stage options Save. and take a look at some of that data. Learn how to: Create a Enterprise Edition Job that generates data.
To create a new job: Select FileNew. and select Parallel OR Click the New Program icon on the toolbar Creating a New Job Create the following flow: • • • • Select the Row Generator stage Drag it onto the Parallel Canvas and drop it Select the Peek stage Drag it onto the Parallel Canvas and drop it • Right-Click on the Row Generator stage and drag a Link onto Peek Does your flow look like the one above? 41 .
Select to import schema 2.Importing a Schema 1. Enter appropriate path and file name (Instructor will provide details) 3. You can also use the File Browser 42 .
Importing a Schema Make sure you put it into the right categoryShould reflect your userid Click on Next/Import/Finish to import. 43 .
44 . We want all of the columns! Click OK.Importing a Schema. select the imported schema This lets you select which columns you want to bring in.End Goal Did everything go smoothly? After clicking on Finish.
Column Properties Row Generator specific options: Here you can select specific properties for the data you are going to generate. Double-Click here to access additional options Click on Next> to step through column properties… Click on Close when done 45 .
Notice the new icon on the link.Final Touches Your job should look like this… However. indicating presence of metadata Next: • Click on the Compile icon • Save the job (Lab2) under your own Category • Did it compile successfully? 46 . the ‘eye’ should not wink.
Ready to Run Action: Click to Run Click for Log 47 .
Running Job After you click Select Run Click for Log 48 .
paste as text into any editor of your choice 49 .Clearing the Job Log Tips: • Clear away unnecessary Job Logs • Use <Copy> button.
Objectives Learn how to: Modify the simple datagenerating program to sort the data and save it. Create a copy of a Job Edit an existing Job Create a Dataset Handle Errors View a Dataset New stage used: Sort 50 .
1. Override default Partition type: from (Auto) to Hash Click Here to specify Sort Insertion Next: Click OK What Happens? 51 .Create a Copy of a Job If necessary. open the Job created in Lab 2. Select Input tab 3. Access stage properties for Peek stage 2.
Insert a Sort Let’s sort by birth_date • Select the birth_date column from the Available list • Once selected. you should see birth_date listed under Selected Food for Thought: Why Hash partitioning type? 52 .
denoting the presence of a sort.Sort Insertion A Are your results sorted on birth_date? Z Note the new icon appears on the link. Select Save As… from the File menu And Save Job (Lab3). Choose one of these to compile and run your job. 53 .
Let’s Stage the Data We’ll now save the output of the sort for later use. Now attach a Dataset stage to the program by… Placing a Dataset stage on the Canvas Right-clicking on Peek and drawing a Link over to the Dataset stage Your Job should now look like this: 54 .
Viewing a Dataset Right-click on the Dataset stage and select View DSLinkX data (Note: Link names may vary) Click OK to bring up Data Browser: 55 .
Objectives Use the Lookup stage to replace state codes with state names Learn how to: use the Lookup operator start thinking about partitioning New operators used: Lookup Entire partitioner .
57 .Remember the Records in Lab 2? They look like this: John Parker M 1979-04-24 MA 0 1 0 0 Susan Calvin F 1967-12-24 IL 0 1 1 1 William Mandella M 1962-04-07 CA 0 1 2 2 Ann Claybourne F 1960-10-29 FL 0 1 3 3 Frank Chalmers M 1969-12-10 NY 0 1 4 4 Jane Studdock F 1962-02-24 TX 0 1 5 5 One of the fields is a two-character state code. Let’s expand it out into a full state name.
58 . We’ll use that table to tack on the expanded state name to the rows generated in Lab 2. We imported this file in Lab5a.The State Table We have a table that maps state codes to state names: Alabama Alaska American Samoa Arizona Arkansas California Colorado Connecticut Delaware District of Columbia […] AL AK AS AZ AR CA CO CT DE DC Unix text file with tab after state full name.
• Uses states.txt file as the lookup table • Has TAB delimiter between state_name & state columns • Use state column as lookup key • Note that source data has column called state while lookup table has state_code • Use same schema as Lab 2 • Generate 100 rows Reminder: Don’t forget to perform column mapping (see next slide) 59 ...What We’re Going To Build.
Lookup Mapping 60 .
0: Frank Glass M 1983-04-15 0068974.0: Frank Sinatra M 1984-06-12 Peek.48 MI 0 1 24 24 Michigan 0098979.55 OH 0 1 12 12 Ohio Peek.0: John Boone M 1964-04-16 0042729.03 CO 0 1 8 8 Colorado 0082552.94 NY 0 1 4 4 New York Peek.0: John Mandella M 1981-06-16 Peek.46 MA 0 1 0 0 Massachusetts 0004881.0: John Parker M 1979-04-24 0087228.39 FL 0 1 16 16 Florida 0022976.57 SD 0 1 36 36 South Dakota 61 .0: Frank Chalmers M 1969-12-10 Peek.92 NJ 0 1 32 32 New Jersey Peek.80 CA 0 1 28 28 California 0023340..45 KY 0 1 20 20 Kentucky 0005305. Sample Output: Make sure state names match state code Peek.What You Should See.0: Frank Austin M 1971-01-21 Peek.0: John Calvin M 1961-11-30 0025966.0: Frank Studdock M 1962-10-29 Peek.0: John Sarandon M 1964-06-03 Peek..
Objectives Learn how to: Use the Join stage to Use an InnerJoin find out which products New stages used: the customer purchased Join Remdup Hash partitioner 62 .
Background - What We Have Customers of ACME Hardware place orders for products.
We have two simple tables to model this
1 1 1001 1116 1147 1032 1161 1106 1132 1195 1007 1021 1072 1139
screws nuts bolts nails nuts screws washers bolts nails nuts screws washers 137 200 145 159 197 253 330 527 370 162 135 351
customer_order table: tells us which orders were placed by each customer
1 2 2 3 3 3 4 4 4 4
order_product table: tells us how many of each product are in an order
1001 1001 1001 1001 1001 1002 1002 1002 1002 1002
Note the data types involved.
Use Integer and Varchar types where appropriate when defining table definitions.
Background - What We Want
Q: Which products have been ordered by each customer? A: Customer 1 has ordered washers, bolts, screws and …
Go ahead and assemble this flow, but do so in a more optimized manner – see next slide (save as Lab8a and a copy for Lab 9). • Use cust_order.txt & order_prod_plus.txt as input files. • See previous slide for file layouts – whitespace delimited fields • Note: column ordering matters! • Make sure you get the column data types correct also.
These two jobs are equivalent!
Notice the different partitioner/collector icons. Can you visually determine how the data is being handled?
we have our answer… 234 records should be written out . Make sure you use 'order_prod_plus.We want to join the tables using the order number as the join key. This means the tables need to be hashed and sorted on the order number.txt‘ as your input If we then sort these records on customer and product and remove duplicated customer/product combinations. with the customer field added. The resulting table will have one record for each row of the order_product table.
Using Lookup and Merge This is what your flows should look like. • Using Lookup: Note the "PhantomOrders" links leading to "Customerless" files 67 .
• Using Merge: .This is what your flows should look like.
on the Secondary link(s) Tip: • Always check the Link Ordering Tab in the Stage page .Order Matters! Remember: • Lookup captures unmatched Source rows. on the Primary link • Merge captures unmatched Update rows.
“widgets".“28" "1000".70 New Results from Lookup and Merge • Outputs: Lookup and Merge should yield outputs with 234 rows.“14" You caught ACME Hardware red-handed: they tried to boost their stock by reporting a phantom order of 28 gaskets and 14 widgets! . just as InnerJoin • Rejects: Lookup and Merge populate the "Customerless" file with following two rows: "1000".“gaskets".
Objectives Use the Aggregator stage to see how many of each product does each customer have on order. New stages used: Aggregator 71 .
It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group.The Aggregator stage is a processing stage. .
. we can finish the job.Our InnerJoin Job Was A Bit Incomplete. We did almost enough work in the InnerJoin lab (Lab 8a) to find out how many of each product each customer has on order.. Go back to the version of Lab 8a. Now that we know about the Aggregator stage. Remove the implicit Remdup (sort unique) Insert an Aggregator 73 .
Aggregator Options Method: sort we could have a lot of customer/product groups… Grouping keys: customer and product Set Column for Calculation=quantity Function to apply: Sum Output Column Name of result column: quantity 74 .
Your Job should look like this.. Compile and Run your Job.What You Should Have.. 75 .
You can drop or keep columns from the schema. It can have a single input link and a single output link.Modify Stage The Modify stage is a processing stage. The modify stage alters the record schema of its input data set. The modified data set is then output. . or change the type of a column.
Dropping and Keeping Columns The following example takes a data set comprising the following columns: Modify Stage 28-3 .
To do this.The modify stage is used to drop the REPID. CREDITLIMIT. the stage properties are set as follows: . and COMMENTS columns.
CITY. Say you wanted to convert the CUSTID from decimal to string. PHONE Changing Data Type You could also change the data types of one or more of the columns from the above example. rather than which ones to drop. andspecify the conversion in the stage properties: conv_CUSTID:string = string_from_decimal(CUSTID) . In the case of this example the required specification to use in the stage properties would be: KEEP CUSTID. ADDRESS. AREA. STATE.You could achieve the same effect by specifying which columns to keep. ZIP. you would specify a new column to take the converted data. NAME.
The Copy stage copies a single input data set to a number of output data sets. Records can be copied without modification or you can drop or change the order of columns . Each record of the input data set is copied to every output data set.Copy Stage The Copy stage is a processing stage. It can have a single input link and any number of output links.
When you open this for a link the left pane shows the input columns. simply drag the columns you want to preserve across to the right pane. The easiest way to do this is using the Outputs page Mapping tab. We need to concentrate on telling DataStage which columns to drop on each output link. We repeat this for each link as follows: . and we do not need to set it in this instance as we are copying to multiple data sets (and DataStage will not attempt to optimize it out of the job).The Copy stage properties are fairly simple. The only property is Force.
The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.
The continuous funnel method is selected on the Stage page Properties tab of the Funnel stage:
The continuous funnel method does not attempt to impose any order on the data it is processing. It simply writes rows as they become available on the input links. In our example the stage has written a row from each input link in turn. A sample of the final, funneled, data is as follows:
The Filter stage transfers. . optionally. It can have a single input link and a any number of output links and. The filtered out records can be routed to a reject link. unmodified. a single reject link.Filter Stage The Filter stage is a processing stage. You can specify different requirements to route rows down different output links. if required. the records of the input data set which satisfy the specified requirements and filters out all other records.
The Where property supports standard SQL expressions. You can use the following elements to specify the expressions: • Input columns. When a record meets the requirements. • Requirements involving the contents of the input columns. • The Boolean operators AND and OR to combine requirements. • Optional constants to be used in comparisons. .Specifying the Filter The operation of the filter stage is governed by the expressions you set in the Where property on the Properties Tab. it is written unchanged to the specified output link.except when comparing strings.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.