This action might not be possible to undo. Are you sure you want to continue?
Tuesday, September 28, 2010
1. Use the same configuration file for all your jobs.
You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!
2. Use a sparse database lookup on high volumes.
This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!
3. Keep resorting your data.
Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!
4. Design single threaded bottlenecks
This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire
Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times. 9. simple and practical but dumb as all hell. Sequential files are the Forest Gump of file storage. it can give you a transformed dataset that can be picked up and used by multiple jobs. Validate every field A lot of data comes from databases.such as numbers in VARCHAR fields and dates in string fields and packed fields. Let the disks look after themselves Never look at what is happening on your disk I/O . It will be years before anyone spots the bottleneck! 6. forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is.especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Keep Writing that Data to Disk Staging of data can be a very good idea. 7. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times. 8. but most don't. You can then do explicit metadata conversion and validation on top of that. A bulk database load via a single update stream. Turn on debugging and forget that it's on In a parallel job you can turn on a debugging setting that forces it to run in sequential mode. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage.DataStage Parallel engine waiting. Write extra steps in database code . It can give you a rollback point for failed jobs. Reading a large sequential file from a parallel job without using multiple readers per node. 5. Some fields need explicit metadata conversion . it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow . Adding a layer of validation you don't need should slow those jobs down. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it.that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). For example.
This is on the second tab in the export form. done this myself. For sequential file stages the drag and drop is faster as it loads both the column names and the format values in one go. maps fields with the same names. a modify or transformer stage and a peek stage. There is an Export option to export by individual job name or export by category name. This is a sure fire way to end up with a step in the end to end integration that is not scalable. Most property windows in DataStage are modal and you can only have one property window open per Designer session. by opening two Designers you can have two property windows open at the same time and copy or compare them more easily. find and highlight the second job. * You can load metadata into a stage by using the "Load" button on the column tab or by dragging and dropping a table definition from the Designer repository window onto a link in your job.but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. Don't create shared containers from a blank canvas. if you sign a binding contract to support it for every day that you have left on this earth". default partitioning and your first draft design and throw that into production and get the hell out of there DataStage tip for beginners Monday. What you might miss is the extra option you get on the Columns tab "Load" button. That way the job name or category you have highlighted will be automatically picked. * If you want to export several jobs that are not in the same category use the append option. * When you add a shared container into your job you need to map the columns of the container to your job link. * On the Options tab is a check box to include in the export "Referenced shared containers" to also export those." just say "okay. In addition to the normal column load you get "Load from Container" which is a quick way to load the container metadata into your job. Often when you go to export something it is on the wrong option. Continue until all jobs have been selected and exported. for example database login values or transformer functions. If you do this in a large and complex job it can be time consuming to debug due to job startup times. Use this throughout your project as a quick way to test a function or conversion. just keep the default settings. . always build and test a full job and then turn part of it into a container. * If you want to copy and paste settings between jobs.. especially the transformers. or "I can write that in Java". Things you could easily missYou could use DataStage for months and not see some of these time savers. * When you do an export there is a "View" button. * Can't get a Modify function or Transformer function working correctly? Trial and error is often the only way to work out the syntax of a function. close the export form and open it again. Don't do Performance Testing Do not take your highest volume jobs into performance testing. Close the export window. "I can write that in SQL". 10. Have a column of each type in this test job. cannot be easily modified and slows everything down. click this to open the export file and run any type of search and replace on job parameter values when moving between dev and test. Always copy and use an existing job. September 27. Highlight and export the first job. Consider have a couple test jobs in your dev project with a row generator. 2010 Here are the tips from Vincent McBurney Import Export* When you do an export cut and paste the export file name. is poorly documented. or "I can do that in an Awk script".. * There is an automap button in a lot of stages. open each job in a seperate Designer session. eg. While export and import independently remember the last file name used they do not share that name between each other. When you go to your project and run an import paste the file name instead of having to browse for it. You'll be long gone when someone comes to clean it up. * When you switch export type between category and individual job it is quick to switch the type. Yes. you want a job but it is showing the category. * Don't create a job from an empty canvas. If someone starts saying "I can write that in. If you used the load button you would need to load the column names and then the format details seperately. You switch from category export to individual job export but back on tab 1 your job is still not highlighted. we know that just about any programming language can do just about anything .The same phrase gets uttered on many an ETL project. we know. in the export form click the "Append" option to add to the file.
and process data transformations for loading to one or more target database tables or files. even though they are likely to have the same or similar parameter lists and settings. * Copy and paste as many copies of this job activity as you need for your sequence. #filedir#/#filename# but you may not know that you can put macros into property text boxes. This should give you the same set of job activity stages but with each one pointing at a different job and with the full set of job parameters set Data Generation Using DataStage Sunday. you need it to retrieve the stage names.1 and earlier you could copy and paste a job activity stage. In this example the first two are job parameters and the third value is not.* You can put job parameters into stage properties text boxes. it's a DataStage macro. then the third last etc until you have replaced all job names back to the second job activity stage. Make sure you only replace to the bottom of the file.x when you change the job name all the parameter settings get wiped. not necessarily used) Within the Transformer. * Close the sequence job and export the job and click the View button. * In an empty sequence job add your first job activity stage and set all parameter values or copy one in from an existing job.5. This will rename the job of the last activity stage. If you have 10-20 parameters per job (as I normally do) it becomes very repetitive and is open to manual coding errors. No input links should be specified. change the job name and retain most of the parameter settings. * Open the sequence job. When the cursor is on that part of the export file search and replace the old job name with the new job name. DataStage is normally used to process multiple input files and/or table selections. You need to set the parameters for every flippin job activity stage. Support for data generation is provided through the following process: The DataStage Transformer Stage needs either an input link OR Stage Variables! The stage automatically stops when no rows are output. optionally the stage variable is you increment/decrement it. * Import the job into your project. #filedir#/#filename#_#DSJobName#. * Copy the name of the last job activity stage name and search for it in the export file. Under 7. A faster way is to do the job renaming in an export file. most text editors should have this option. place a Constraint on the output link(s) using @OUTROWNUM for a specified number of rows or. perform lookups for related data. Sequence JobsMy most annoying Sequence job "feature" is the constant need to enter job parameters over and over and over again. September 26. . 2010 This note assumes some familiarity with the DataStage transformation engine. eg. * Repeat this for the second last job activity. Under version 7. Use the Transformer Stage as a source stage with 1 or more output links. Setup a stage variable (Just needs to be declared. Define your derivations with your test data.
SIMPLE DEMONSTRATION Below is a simple Job to produce test data rows. Normally. while the Sequential File Stage on the right (―Target‖) represents a very simple target. Double-clicking the Transformer stage opens the screen below. stop when the outgoing row number counter reaches 1000. The number of rows created is controlled by the Transformer Constraint. As shown in the WriteGeneratedData Derivation below. . the Constraint for the WriteGeneratedData data link is @OUTROWNUM < 1000—or. but it does not need to be used or referenced. input data sources would be shown on the left—but there are no inputs required! There needs to be a stage variable in place for data generation to work. The DataStage Transformer stage on the left (GenerateSomeData) contains the logic for data field generation.
INCLUDE INTERNAL DATA IN TEST SUITE Here is a job with a Data Generation stage driving another Transformer stage to lookup specific row data. . The TestFile hash index contains 10 rows 0 – 9 where field1 = FirstName. field2 = LastName The goal is for every input row up to 999 take the last character 0-9 and use it as the key to ―lookup‖ the FirstName/LastName.
.Here is the actual data in the hash lookup file.
This is the second Transformer Stage where the lookup is being done (notice the substring: [Len(WriteGeneratedData.1] in the key derivation to pick off the last character of the input row for the lookup key. NumericRandomGenerator(Offset.NumberRange. RANDOM NUMBER GENERATION A critical feature for a data generation tool is the ability to generate random numbers for data entered into a DataStage job.RepeatSeed) .Id). The NumericRandomGenerator routine generates random numbers for a range of numbers.
13.29. An example would be 1-25 4. The output of the job is shown below: RANDOM DATE GENERATION A related feature to random number generation is random date generation.7.84. A transformation using this process is shown below: DataStage allows the developer to control the complexity of random and pseudorandom The DataStage Aggregator and Pivot stages were used in a job to test the randomness of the NumericRandomGenerator. A pseudorandom range of numbers is multiple sets of numbers in different ranges.65. Random date generation uses a range of dates between which random dates are output.85 Suites of pseudorandom number ranges can be created using the Offset variable and Stage variables.54.Pseudorandom ranges of numbers can be very important in building test suites of data.27.31 51-75 67.79. but random within that range. The DataStage Routine DateRandomGenerator uses and creates dates in any of the following formats: Database Date Type DataStage Internal Numeric Date Format D or Internal Example Data 55555 .69 76-100 98. A test was developed to run high numbers of random numbers and test the parameters of the process.22 26-50 44.
ODBC Oracle OCI Sybase OC Informix CLI DB2 DB-Connect ODBC OCI OC CLI DB2 2001-01-01 00:00:00.RepeatSeed) Argument StartDate Description Example Starting Date according to above format 2001-01-01 or 12846 Ending Date according to above format 2001-01-01 or 12846 Number of days offset forward from dates Date Format Designator according to above chart 365 EndDate Offset DateFormat ODBC RepeatSeed Random number repeating start seed. 0 if no repeating randomness desired.EndDate. 0 Should only be used once at beginning of process.DateFormat.000 55555 55555 55555 55555 The syntax is DateRandomGenerator(StartDate.Offset. EXAMPLE DATA GENERATION: BUILDING A DATA WAREHOUSE FACT TABLE .
Cross-reference tables can be used to match customers with their product preferences or other products as cross-selling opportunities.RepeatSeed) NumericRandomGenerator(AvgQuantity. .One example requiring random number generation is building a data warehouse fact table within a star schema. If Product A is ordered twice as often as Product B. a distribution across products. If the average quantity of a product sold is stored in the hash table.NumberRange. the product table can be split into two hash indexes and the second only be accessed in conjunction with the product selected in the first. and customers approximating a realistic slice of reality is important. time. This allows each model to be simple and verifiable while building a larger model.0) Is used to provide a range of Quantity numbers from 50% to 150% of the AvgQuantity. it can be used as the basis for a random selection. it can be stored twice in the product table if both are selected at the same time. Hash tables can be used in a simple example of product selection. Several copies of a product can be stored in the hash table for different models of usage. Ability to rerun with slightly different parameters—a separate model for large purchasers from small purchasers. Combining random numbers with hash indexes allow assigning real customers and products to the generated test data. If a product is only ordered in conjunction with another product. If 90% of the orders only receive one quantity while 10% get hundreds. A function such as NumericRandomGenerator(Offset.AvgQuantity/2. The test data generator may need to create more sales for weekdays than weekends but may randomly distribute sales across weekdays. DataStage has several features that will help model complex data generation requirements: Powerful hash index features that allow the data modeler to build a representative mix without changing data sources. A star schema fact table represents the joining of several dimension tables with measures such as quantity and extended price. 10 records can be stored with different average quantities. In building test data.
selecting typical product purchases for a customer. More stages can be added to select products up to an average amount per customer order. is selected from keys brought from the First_Product search. such as a cable set for a computer. Products related to the first. Only the First_Product selection is selected in the ―random‖ hash table search. there may be multiple copies of popular products and customers to increase their frequency in the resulting file. . This process can be made as complex as the ETL processes DataStage typically models. modeling growth.A sample job is shown below: The Generate_Random_Keys stage creates random entries with customer keys and product keys in the range of 0-[the total number]. where [the total number] is the maximum number for each customer or product. or adding more functional processes. The hash indexes have a unique number added in addition to the natural key. As discussed above.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.