These are the mistakes that ETL designers can make when processing scary high da ta volumes.

Dan Lindstedt is a very large data guru and he has a series of outstanding blog posts on very large databases, the latest is ETL Engines: VLDW & Loading / Trans forming. I first came across Dan on ETL/ELT forums where he has over 20,000 for um posts. He popped up on B-Eye-Network blogs. Dan has had no comments on his latest post series yet as B-Eye don't have a reader friendly design and it disco urages reader participation. For example I just got three comments over a weeke nd on three old archived ETL posts. ITToolbox is a friendly place for reader pa rticipation. My favourite part of Dan's latest post is the 17 mistakes that ETL Designers mak e with very large data. In fact that's the title I would have gone for! For so me reason blog titles with a number in them attract more hits. I've shown Dan's list of 17 below with my own comments on how that impacts DataStage developers. I would love to hear your own contributions for common ETL design mistakes. 1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process. Agree 100%. I believe in have at least three bands in your processing: Extract from source to file, process file to load ready dataset and load load ready data set. For a standard target table load I would have an insert job, an update job , a delete job (if needed), a bulk load job (for large volumes). 2) Sourcing multiple systems at the same time, depending on heterogeneous system s for data. I can see this working well for smaller volumes - and the combination of DataSta ge accessing data from multiple systems via a Federation Server plugin is intrig uing, but this type of cross database joining would be nasty on very large volum es. Pulling the smaller data volumes into target tables or lookup datasets woul d be faster. 3) Targeting more than 1 or 2 target tables Agree 100%. You could take it to two tables if there is a very close parent/chi ld relationship like a master table and an attribute table, or a header table an d a detail table. But that's the exception, most large volume ETL jobs should b e preparing data for one target table. 4) moving rows that are too wide through a single process I don't know how you get around this one. If you have a lot of columns then you gotta get it in there! 5) loading very large data sets to targets WITH INDEXES ON DataStage makes this easy to manage, on a standard database tab (insert or bulk load activities) you can use the before-SQL tab to turn indexes off and the afte r-SQL tab to turn them back on. The statements on those tabs are run just once for each job (and not per row). You don't need indexes and keys if you have oth er ways to check your referential integrity. 6) not running a cost-based optimizer in the database 7) not keeping statistics up to date in the database 8) not producing the correct indexes on the sources / lookups that need to be ac

13) Trying to do "too much" inside of a single data flow." "Could you maybe get off Facebook and do a friggen trace!" 9) not purchasing enough RAM for the ETL server to house the RAM caches in memor y. however this does make the job harder to debug and sometimes you get those random out of resource errors. 14) believing that "I need to process all the data in one pass because it's the fastest way to do it. DataStage Parallel Jobs need a lot of RAM for the lookup stage. If you try to trap database rejects in a Server Job you use a reject link from a Transformer prior to the database sta ..kind of like rows that are too wide ..cessed Examples of why you need to be in the good books with your DBAs! "My DataStage job is running slow. I think there is something wrong with the dat abase table. I am going to be IBM architects when I meet them at IoD can run on some 64 bit environments bu plans IBM have for expanding this. I'm not sure what Dan means by multi-passing and I'll ask in his comments thread . The Information Server t it will be interesting to find out what an OS swapping joke. increasing complexity and dropping performance This is a tricky one .sometimes you need to do it and you hope the massively parallel architecture is up to it. maybe run a trace?" "What is DataStage?" "Could you just check the table." "You sure it's not DataStage?" "Can you check the table. I've got the job running right now!" "I ran a query on the table and it's running fine. The IBM Informa tion Server Blade starts with 4G of RAM per blade for two dual core CPUs on each blade. More detail about this is in Dan's post. 10) running on a 32 bit environment which causes significant OS swapping to occu r 11) running on a 32 bit environment which causes significant OS swapping to occu r 12) running on a 32 bit environment which causes significant OS swapping to occu r I think the treblification of this one is talking about 64 bit processing with the 2007 next month. 15) Letting the database "bounce" errors back to the ETL tool." This is completely false. IF parallelism can be increased . multi-passing the data can actu ally improve performance by orders of magnitude. dropping flow rat es and throughput rates by factors of 4x to 10x. DataStag e can make almost every type of stage work in parallel so it can get away with e xtra steps.

A lot of ETL sites have templates for building new jobs or design standards or e xample jobs. Parallel Jobs are more tage and on a parallel ew process. Dan points out that each reject row slows down the job by a factor of 4 as the ETL job stops processing to handle that reject row. . and database rejects faulty. I e database reject uditing of rows. efficient as they use a reject link out of the database s architecture can push the handling of those rows into a n seen any overhead in this design if you don't get rejects should be under On smaller volume jobs I use th link and insert instead of bulk load a lot for a more robust a 16) "THINKING" in a transactional mode. I've never try to ive database load pt them anyway. I haven't . because they think it's necessary (transactio nal processing again). A bit hard to avoid in a row-by-row ETL tool! The parallel architecture and cac hing and memory sharing and a bunch of other things make it fast. Think outside the box."performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines". These are excellent for 90% of your ETL work however very large da ta jobs may need custom designs. You should look at what has been done in other jobs but also be ready to take a job into performance testing to try out dozens of other configurations. rather than a batch mode. Fortunately this is very hard to do in a DataStage job or people would misuse it ! In DataStage looping on a per row basis can be done via lookups to database s tored procedures or custom code modules but most standard DataStage stages do an action per row. one row at a time (like they would code a cursor in a database langua ge). 17) LOOPING inside an ETL process.01% of your rows or else your design is trap bulk load or multi load or tpump or any other type of nat errors back in DataStage and I don't think the stages can acce let the database handle them. As Dan says in his post . and processing each row.