You are on page 1of 3

These are the mistakes that ETL designers can make when processing scary high da

ta volumes.
Dan Lindstedt is a very large data guru and he has a series of outstanding blog
posts on very large databases, the latest is ETL Engines: VLDW & Loading / Trans
forming. I first came across Dan on ETL/ELT forums where he has over 20,000 for
um posts. He popped up on B-Eye-Network blogs. Dan has had no comments on his
latest post series yet as B-Eye don't have a reader friendly design and it disco
urages reader participation. For example I just got three comments over a weeke
nd on three old archived ETL posts. ITToolbox is a friendly place for reader pa
rticipation.
My favourite part of Dan's latest post is the 17 mistakes that ETL Designers mak
e with very large data. In fact that's the title I would have gone for! For so
me reason blog titles with a number in them attract more hits. I've shown Dan's
list of 17 below with my own comments on how that impacts DataStage developers.
I would love to hear your own contributions for common ETL design mistakes.
1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same
process.
Agree 100%. I believe in have at least three bands in your processing: Extract
from source to file, process file to load ready dataset and load load ready data
set. For a standard target table load I would have an insert job, an update job
, a delete job (if needed), a bulk load job (for large volumes).
2) Sourcing multiple systems at the same time, depending on heterogeneous system
s for data.
I can see this working well for smaller volumes - and the combination of DataSta
ge accessing data from multiple systems via a Federation Server plugin is intrig
uing, but this type of cross database joining would be nasty on very large volum
es. Pulling the smaller data volumes into target tables or lookup datasets woul
d be faster.
3) Targeting more than 1 or 2 target tables
Agree 100%. You could take it to two tables if there is a very close parent/chi
ld relationship like a master table and an attribute table, or a header table an
d a detail table. But that's the exception, most large volume ETL jobs should b
e preparing data for one target table.
4) moving rows that are too wide through a single process
I don't know how you get around this one. If you have a lot of columns then you
gotta get it in there!
5) loading very large data sets to targets WITH INDEXES ON
DataStage makes this easy to manage, on a standard database tab (insert or bulk
load activities) you can use the before-SQL tab to turn indexes off and the afte
r-SQL tab to turn them back on. The statements on those tabs are run just once
for each job (and not per row). You don't need indexes and keys if you have oth
er ways to check your referential integrity.
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be ac
cessed
Examples of why you need to be in the good books with your DBAs!
"My DataStage job is running slow, I think there is something wrong with the dat
abase table."
"You sure it's not DataStage?"
"Can you check the table, maybe run a trace?"
"What is DataStage?"
"Could you just check the table, I've got the job running right now!"
"I ran a query on the table and it's running fine."
"Could you maybe get off Facebook and do a friggen trace!"
9) not purchasing enough RAM for the ETL server to house the RAM caches in memor
y.
DataStage Parallel Jobs need a lot of RAM for the lookup stage. The IBM Informa
tion Server Blade starts with 4G of RAM per blade for two dual core CPUs on each
blade.
10) running on a 32 bit environment which causes significant OS swapping to occu
r
11) running on a 32 bit environment which causes significant OS swapping to occu
r
12) running on a 32 bit environment which causes significant OS swapping to occu
r
I think the treblification of this one is an OS swapping joke. I am going to be
talking about 64 bit processing with the IBM architects when I meet them at IoD
2007 next month. The Information Server can run on some 64 bit environments bu
t it will be interesting to find out what plans IBM have for expanding this.
13) Trying to do "too much" inside of a single data flow, increasing complexity
and dropping performance
This is a tricky one - kind of like rows that are too wide - sometimes you need
to do it and you hope the massively parallel architecture is up to it. DataStag
e can make almost every type of stage work in parallel so it can get away with e
xtra steps, however this does make the job harder to debug and sometimes you get
those random out of resource errors...
14) believing that "I need to process all the data in one pass because it's the
fastest way to do it." This is completely false, multi-passing the data can actu
ally improve performance by orders of magnitude. IF parallelism can be increased
.
I'm not sure what Dan means by multi-passing and I'll ask in his comments thread
.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rat
es and throughput rates by factors of 4x to 10x.
More detail about this is in Dan's post. If you try to trap database rejects in
a Server Job you use a reject link from a Transformer prior to the database sta
ge. Dan points out that each reject row slows down the job by a factor of 4 as
the ETL job stops processing to handle that reject row.
Parallel Jobs are more efficient as they use a reject link out of the database s
tage and on a parallel architecture can push the handling of those rows into a n
ew process. I haven't seen any overhead in this design if you don't get rejects
, and database rejects should be under 0.01% of your rows or else your design is
faulty.
I've never try to trap bulk load or multi load or tpump or any other type of nat
ive database load errors back in DataStage and I don't think the stages can acce
pt them anyway, I let the database handle them. On smaller volume jobs I use th
e database reject link and insert instead of bulk load a lot for a more robust a
uditing of rows.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing
each row, one row at a time (like they would code a cursor in a database langua
ge).
A bit hard to avoid in a row-by-row ETL tool! The parallel architecture and cac
hing and memory sharing and a bunch of other things make it fast.
17) LOOPING inside an ETL process, because they think it's necessary (transactio
nal processing again).
Fortunately this is very hard to do in a DataStage job or people would misuse it
! In DataStage looping on a per row basis can be done via lookups to database s
tored procedures or custom code modules but most standard DataStage stages do an
action per row.

Think outside the box.


A lot of ETL sites have templates for building new jobs or design standards or e
xample jobs. These are excellent for 90% of your ETL work however very large da
ta jobs may need custom designs. You should look at what has been done in other
jobs but also be ready to take a job into performance testing to try out dozens
of other configurations. As Dan says in his post - "performance and tuning at
these volumes usually means going contrary to the grain of what you've typically
learned in building ETL load routines".

You might also like