10 Ways to Make DataStage Run Slower
Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower! The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehousethat is a great chat aboutbad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! Howdo you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.
1. Use the same configuration file for all your jobs.
You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs torun quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partitionacross is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but onceyou are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the jobto partition those rows across a lot of processes and then repartition them at the end. So a job that would take asecond or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slowdown your entire DataStage batch run!
2. Use a sparse database lookup on high volumes.
This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is thatserver jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hashfile. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either inmemory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to asparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10million rows you can send 10 million SQL statements to the database! That will put the brakes on!
3. Keep resorting your data.
Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit bygetting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed severalhundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is likehaving to waiting until every person who is going to eat at that restaurant for that night has arrived and has been putin order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that isalready sorted by the SQL select statement but sorting it again anyway!
4. Design single threaded bottlenecks
This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical pathof your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Somegood bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocessfunctionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via asingle update stream. Reading a large sequential file from a parallel job without using multiple readers per node.
5. Turn on debugging and forget that it's on
In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on todebug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots thebottleneck!
6. Let the disks look after themselves