Integration Platforms Wayne Eckerson and Colin White REPORT SERI ES Research Sponsors Business Objects DataMirror Corporation Hummingbird Ltd Informatica Corporation Evaluating ETL and Data Integration Platforms Acknowledgements TD W I w ould like to thank m any people w ho contributed to this report. First, w e appreciate the m any users w ho responded to our survey, especially those w ho responded to our requests for phone interview s. Second, w ed like to thank Steve Tracy and Pieter M im no w ho review ed draft m anuscripts and provided feedback, as w ell as our report sponsors w ho review ed outlines, survey questions, and report drafts. Finally, w e w ould like to recognize TD W Is production team : D enelle H anlon, Theresa Johnston, M arie M cFarland, and D onna Padian. About TDWI The D ata W arehousing Institute (TD W I), a division of 101com m unications LLC, is the prem ier provider of in-depth, high-quality education and training in the business intelligence and data w arehousing industry. TD W I supports a w orldw ide m em bership program , quarterly educational conferences, regional sem inars, onsite courses, leadership aw ards program s, num erous print and online publications, and a public and private (M em bers-only) W eb site. This special report is the property of The D ata W arehousing Institute (TD W I) and is m ade available to a restricted num ber of clients only upon these term s and conditions. TD W I reserves all rights herein. Reproduction or disclosure in w hole or in part to parties other than the TD W I client, w ho is the original subscriber to this report, is perm itted only w ith the w ritten perm ission and express consent of TD W I. This report shall be treated at all tim es as a confidential and proprietary docum ent for internal use only. The inform ation contained in the report is believed to be reliable but cannot be guaranteed to be correct or com plete. For m ore inform ation about this report or its sponsors, and to view the archived report W ebinar, please visit: w w w .dw -institute.com /etlreport/. 2003 by 101com m unications LLC. All rights reserved. Printed in the U nited States. The D ata W arehousing Institute is a tradem ark of 101com m unications LLC. O ther product and com pany nam es m entioned herein m ay be tradem arks and/or registered tradem arks of their respective com panies. The D ata W arehousing Institute is a division of 101com m unications LLC based in Chatsw orth, CA. Scope, Methodology, and Demographics . . . . . . . . . . .3 Executive Summary: The Role of ETL in BI . . . . . . . . . .4 ETL in Flux . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 The Evolution of ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Framework and Components . . . . . . . . . . . . . . . . . . .5 A Quick History . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Code Generation Tools . . . . . . . . . . . . . . . . . .6 Engine-Based Tools . . . . . . . . . . . . . . . . . . . . . . .6 Code Generators versus Engines . . . . . . . . . . .7 Database-Centric ETL . . . . . . . . . . . . . . . . . . . . .8 Data Integration Platforms . . . . . . . . . . . . . . . . . . . . .9 ETL versus EAI . . . . . . . . . . . . . . . . . . . . . . . . . .9 ETL Trends and Requirements . . . . . . . . . . . . . . . . . . .10 Large Volumes of Data . . . . . . . . . . . . . . . . . . .10 Diverse Data Sources . . . . . . . . . . . . . . . . . .10 Shrinking Batch Windows . . . . . . . . . . . . . . . .11 Operational Decision Making . . . . . . . . . . . . .11 Data Quality Add-Ons . . . . . . . . . . . . . . . . . . . .12 Meta Data Management . . . . . . . . . . . . . . . . . . . . . . . .13 Packaged Solutions . . . . . . . . . . . . . . . . . . . . . . . . .14 Enterprise Infrastructure . . . . . . . . . . . . . . . . .14 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Build or Buy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Why Buy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 Maintaining Custom Code . . . . . . . . . . . . . . . .16 Why Build? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Build and Buy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 User Satisfaction with ETL Tools . . . . . . . . . . . . . . .18 Challenges in Deploying ETL . . . . . . . . . . . . . . . . . .19 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 1 Table of Contents Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . .21 Buy If You . . . . . . . . . . . . . . . . . . . . . . . . . . .21 Build If You . . . . . . . . . . . . . . . . . . . . . . . . .22 Data Integration Platforms . . . . . . . . . . . . . . . . . . . . . .22 Platform Characteristics . . . . . . . . . . . . . . . . . .23 Data Integration Characteristics . . . . . . . . . . .23 High Performance and Scalability . . . . . . . . . . . . . .23 Built-In Data Cleansing and Profiling . . . . . . . . . . . .23 Complex, Reusable Transformations . . . . . . . . . . . . .24 Reliable Operations and Robust Administration . . . .25 Diverse Source and Target Systems . . . . . . . . . . . . .25 Update and Capture Utilities . . . . . . . . . . . . . . . . . . .26 Global Meta Data Management . . . . . . . . . . . . . . . .27 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28 ETL Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . .28 Available Resources . . . . . . . . . . . . . . . . . . . . .29 Vendor Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Overall Product Considerations . . . . . . . . . . . . . . . .30 Design Features . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 Meta Data Management Features . . . . . . . . . . . . . .31 Transformation Features . . . . . . . . . . . . . . . . . . . . .31 Data Quality Features . . . . . . . . . . . . . . . . . . . . . . . .32 Performance Features . . . . . . . . . . . . . . . . . . . . . . .32 Extract and Capture Features . . . . . . . . . . . . . . . . . .33 Load and Update Features . . . . . . . . . . . . . . . . . . . .33 Operate and Administer Component . . . . . . . . . . . . .34 Integrated Product Suites . . . . . . . . . . . . . . . . . . . .34 Company Services and Pricing . . . . . . . . . . . . . . . . .35 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 by Wayne W. Eckerson and Colin White Evaluating ETL and Data Integration Platforms REPORT SERI ES 2 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms WAYNE ECKERSON is director of research for The D ata W arehousing Institute (TD W I), the leading provider of high-quality, in-depth education and research services to data w arehousing and business intelligence professionals w orldw ide. Eckerson oversees TD W Is M em ber publi- cations and research services. Eckerson has w ritten and spoken on data w arehousing and business intelligence since 1994. H e has published in-depth reports and articles about data quality, data w arehousing, custom er relationship m anagem ent, online analytical processing (O LAP), W eb-based analytical tools, analytic applications, and portals, am ong other topics. In addition, Eckerson has delivered pre- sentations at industry conferences, user group m eetings, and vendor sem inars. H e has also consulted w ith m any vendor and user firm s. Prior to joining TD W I, Eckerson w as a senior consultant at the Patricia Seybold G roup, and direc- tor of the G roups Business Intelligence & D ata W arehouse Service, w hich he launched in 1996. COLIN WHITE is president of Intelligent Business Strategies. W hite is w ell know n for his in- depth know ledge of leading-edge business intelligence, enterprise portal, database and W eb technologies, and how they can be integrated into an IT infrastructure for building and sup- porting the intelligent business. W ith m ore than 32 years of IT experience, W hite has consulted for dozens of com panies throughout the w orld and is a frequent speaker at leading IT events. H e is a faculty m em ber at TD W I and currently serves on the advisory board of TD W Is Business Intelligence Strategies program . H e also chairs a portals and W eb Services conference. W hite has co-authored several books, and has w ritten num erous articles on business intelli- gence, portals, database, and W eb technology for leading IT trade journals. H e w rites a regular colum n for DM Review m agazine entitled Intelligent Business Strategies.Prior to becom ing an analyst and consultant, W hite w orked at IBM and Am dahl. About the Authors ETL- Ext ract , t ransf orm, and l oad (ETL) t ool s pl ay a cri t i cal part i n creat i ng dat a warehouses, whi ch f orm t he bedrock of busi ness i nt el l i - gence (see bel ow). ETL t ool s si t at t he i nt ersect i on of myri ad source and t arget syst ems and act as a f unnel t o pul l t oget her and bl end het erogeneous dat a i nt o a consi st ent f ormat and meani ng and popul at e dat a warehouses. Business Intelligence - TDWI uses t he t erm "busi ness i nt el l i gence" or BI as an umbrel l a t erm t hat encompasses ETL, dat a warehouses, report i ng and anal ysi s t ool s, and anal yt i c appl i cat i ons. BI proj ect s t urn dat a i nt o i nf ormat i on, knowl edge, and pl ans t hat dri ve prof i t abl e busi ness deci si ons. TDWI Definitions The TDWI Report Seri es i s desi gned t o educat e t echni cal and busi ness prof essi onal s about cri t i cal i ssues i n BI. TDWIs i n-dept h report s of f er obj ect i ve, vendor-neut ral research consi st i ng of i nt ervi ews wi t h i ndust ry expert s and a survey of BI prof essi onal s worl dwi de. TDWI i n- dept h report s are sponsored by vendors who col l ect i vel y wi sh t o evangel i ze a di sci pl i ne wi t hi n BI or an emergi ng approach or t echnol ogy. About the TDWI Report Series 26% 5% Corporate IT professional (63%) Systems integrator or external consultant (26%) Business sponsor or business user (5%) 63% 6% Other (6%) Less than $10 million (15%) $10 billion +(15%) $1 billion to $10 billion (28%) $100 million to $1 billion (25%) $10 million to $100 million (18%) 15% 15% 28% 25% 18% USA (62%) Scandinavian countries (5%) India (4%) Canada (6%) 18% 62% 6% 4% 5% 3% 2% United Kingdom (3%) Other (18%) Mexico (2%) 9% 8% 10% 7% 5% 7% 13% 13% 6% 5% 3% 14% Federal government Education Telecommunications Healthcare Insurance State/local government Other industries Consulting/professional services Retail/wholesale/distribution Manufacturing (non-computer) Financial services 0 3 6 9 12 15 Software/Internet Demographics Posi t i on Company Revenues Indust ry 6% Corporate (51%) Business Unit/division (21%) Department/functional group (21%) 51% 21% 21% Does not apply (6%) Organi zat i on St ruct ure Report Scope. Thi s report exami nes t he current and f ut ure st at e of ETL t ool s. It descri bes how busi ness requi rement s are f uel i ng t he creat i on of a new generat i on of product s cal l ed dat a i nt egrat i on pl at f orms. It exami nes t he pros and cons of bui l di ng versus buyi ng ETL f unct i onal i t y and assesses t he chal l enges and success f act ors i nvol ved wi t h i mpl ement i ng ETL t ool s. It f i ni shes by provi di ng a l i st of eval uat i on cri t eri a t o assi st organi zat i ons i n sel ect i ng ETL product s or val i dat i ng current ones. Methodology.The research f or t hi s report was conduct ed by i nt ervi ewi ng i ndust ry expert s, i ncl udi ng consul t ant s, i ndust ry anal yst s, and IT prof essi onal s, w ho have i mpl ement ed ETL t ool s. The research i s al so based on a survey of 1000+ busi ness i nt el l i gence prof essi onal s t hat TDWI conduct ed i n November 2002. Survey Methodology.TDWI recei ved 1,051 responses t o t he survey. Of t hese, TDWI qual i f i ed 741 respondent s w ho had bot h depl oyed ETL f unct i onal i t y and w ere ei t her IT prof essi onal s or consul t ant s at end- user organi zat i ons. Branchi ng l ogi c account s f or t he vari at i on i n t he number of respondent s t o each quest i on. For exampl e, respondent s w ho bui l t ETL programs w ere asked a f ew di f f erent quest i ons f rom t hose w ho bought vendor ETL t ool s. M ul t i -choi ce quest i ons and roundi ng t echni ques account f or t ot al s t hat don t equal 100 percent . Survey Demographics.M ost respondent s were corporat e IT prof essi onal s who work at l arge U.S. compani es i n a range of i ndust ri es. (See t he i l l ust rat i ons on t hi s page f or breakout s.) Scope and Methodology THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 3 Scope, Met hodol ogy, and Demographi cs Count ry 4 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Executive Summary: The Role of ETL in Business Intelligence ETL is the heart and soul of business intelligence (BI). ETL processes bring together and com - bine data from m ultiple source system s into a data w arehouse, enabling all users to w ork off a single, integrated set of dataa single version of the truth. The result is an organization that no longer spins its w heels collecting data or arguing about w hose data is correct, but one that uses inform ation as a key process enabler and com petitive w eapon. In these organizations, BI system s are no longer nice to have, but essential to success. These system s are no longer stand-alone and separate from operational processingthey are inte- grated w ith overall business processes. As a result, an effective BI environm ent based on inte- grated data enables users to m ake strategic, tactical, and operational decisions that drive the business on a daily basis. Why ETL is Hard. According to m ost practitioners, ETL design and developm ent w ork con- sum es 60 to 80 percent of an entire BI project. W ith such an inordinate am ount of resources tied up in ETL w ork, it behooves BI team s to optim ize this layer of their BI environm ent. ETL is so tim e consum ing because it involves the unenviable task of re-integrating the enter- prises data from scratch. O ver the span of m any years, organizations have allow ed their busi- ness processes to dis-integrateinto dozens or hundreds of local processes, each m anaged by a single fiefdom (e.g., departm ents, business units, divisions) w ith its ow n system s, data, and view of the w orld. W ith the goal of achieving a single version of truth, business executives are appointing BI team s to re-integratew hat has taken years or decades to undo. Equipped w ith ETL and m od- eling tools, BI team s are now expected to sw oop in like conquering heroes and rescue the organization from inform ation chaos. O bviously, the challenges and risks are daunting. Mitigating Risk. ETL tools are perhaps the m ost critical instrum ents in a BI team s toolbox. W hether built or bought, a good ETL tool in the hands of an experienced ETL developer can speed deploym ent, m inim ize the im pact of system s changes and new user requirem ents, and m itigate project risk. A w eak ETL tool in the hands of an untrained developer can w reak havoc on BI project schedules and budgets. ETL in Flux G iven the dem ands placed on ETL and the m ore prom inent role that BI is playing in corpora- tions, it is no w onder that this technology is now in a state of flux. More Complete Solutions. O rganizations are now pushing ETL vendors to deliver m ore com - plete BI solutions.Prim arily, this m eans handling additional back-end data m anagem ent and processing responsibilities, such as providing data profiling, data cleansing, and enterprise m eta data m anagem ent utilities. A grow ing num ber of users also w ant BI vendors to deliver a com plete solution that spans both back-end data m anagem ent functions and front-end report- ing and analysis applications. Better Throughput and Scalability. They also w ant ETL tools to increase throughput and per- form ance to handle exploding volum es of data and shrinking batch w indow s. Rather than refresh the entire data w arehouse from scratch, they w ant ETL tools to capture and update changes that have occurred in source system s since the last load. ETL: The Heart and Soul of BI ETL Work Consumes 60 to 80 Percent of all BI Projects ETL Tools Are the Most Important in a BI Toolbox Users Seek Integrated Toolsets ETL Tools Must Process More Data in Less Time THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 5 The Evol ut i on of ETL More Sources, Greater Complexity, Better Administration. ETL tools also need to handle a w ider variety of source system data, including W eb, XM L, and packaged applications. To inte- grate these diverse data sets, ETL tools m ust also handle m ore com plex m appings and trans- form ations, and offer enhanced adm inistration to im prove reliability and speed deploym ents. Near-Real-Time Data. Finally, ETL tools need to feed data w arehouses m ore quickly w ith m ore up-to-date inform ation. This is because batch processing w indow s are shrinking and business users w ant integrated data delivered on a tim elier basis (i.e., the previous day, hour, or m inute) so they can m ake critical operational decisions w ithout delay. Clearly, the m arket for ETL tools is changing and expanding. In response to user requirem ents, ETL vendors are transform ing their products from single-purpose ETL products into m ulti-pur- pose data integration platforms. BI professionals need help understanding these m arket and technology changes as w ell as how to leverage the convergence of ETL w ith new technologies to optim ize their BI architectures and ensure a healthy return on their ETL investm ents. The Evolution of ETL Framework and Components Before w e exam ine the future of ETL, w e w ill define the m ajor com ponents in an ETL fram ew ork. ETL stands for extract, transform, and load. That is, ETL program s periodically extract data from source system s, transformthe data into a com m on form at, and then load the data into the target data store, usually a data w arehouse. As an acronym , how ever, ETL only tells part of the story. ETL tools also com m only moveor transport data betw een sources and targets, document how data elem ents change as they m ove betw een source and target (i.e., m eta data), exchangethis m eta data w ith other applica- tions as needed, and administer all run-tim e processes and operations (e.g., scheduling, error m anagem ent, audit logs, and statistics). A m ore accurate acronym m ight be EM TD LEA! The diagram on page 5 depicts the m ajor com ponents involved in ETL processing. The follow - ing bullets describe each com ponent in m ore detail: Users Want Timelier Data Do We Need a New Acronym? Illustration S1. Thediagramshowsthecorecomponentsof an ETL product. Courtesy of Intelligent BusinessStrategies. Extract Administration & operations services Transport services Load Meta data import/export Databases & files Source adapters Transform Runtime meta data services Target adapters Design manager Legacy applications Databases & files Meta data repository ETL Processing Framework 6 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Design manager: Provides a graphical m apping environm ent that lets developers define source-to-target m appings, transform ations, process flow s, and jobs. The designs are stored in a m eta data repository. Meta data management: Provides a repository to define, docum ent, and m anage infor- m ation (i.e., m eta data) about the ETL design and runtim e processes. The repository m akes m eta data available to the ETL engine at run tim e and other applications Extract: Extracts source data using adapters, such as O D BC, native SQ L form ats, or flat file extractors. These adapters consult m eta data to determ ine w hich data to extract and how . Transform: ETL tools provide a library of transform ation objects that let developers transform source data into target data structures and create sum m ary tables to im prove perform ance. Load: ETL tools use target data adapters, such as SQ L or native bulk loaders, to insert or m odify data in target databases or files. Transport services: ETL tools use netw ork and file protocols (e.g., FTP) to m ove data betw een source and target system s and in-m em ory protocols (e.g., data caches) to m ove data betw een ETL run-tim e com ponents. Administration and operation: ETL utilities let adm inistrators schedule, run, and m onitor ETL jobs as w ell as log all events, m anage errors, recover from failures, and reconcile outputs w ith source system s. The com ponents above com e w ith m ost vendor-supplied ETL tools, but they can also be built by data w arehouse developers. A m ajority of com panies have a m ix of packaged and hom egrow n ETL applications. In som e cases, they use different tools in different projects, or they use custom code to augm ent the functionality of an ETL product. The pros and cons of building and buying ETL com ponents are discussed later in the report. (See Build or Buy? on page 15.) A Quick History Code Generation Tools In the early 1990s, m ost organizations developed custom code to extract and transform data from operational system s and load it into data w arehouses. In the m id-1990s, vendors recog- nized an opportunity and began shipping ETL tools designed to reduce or elim inate the labor- intensive process of w riting custom ETL program s. Early vendor ETL tools provided a graphical design environm ent that generated third-genera- tion language (3G L) program s, such as a CO BO L. Although these early code-generation tools sim plified ETL developm ent w ork, they did little to autom ate the runtim e environm ent or lessen code m aintenance and change control w ork. O ften, adm inistrators had to m anually dis- tribute and m anage com piled code, schedule and run jobs, or copy and transport files. Engine-Based Tools To autom ate m ore of the ETL process, vendors began delivering engine-basedproducts in the m id to late 1990s that em ployed proprietary scripting languages running w ithin an ETL or D BM S server. These ETL engines use language interpreters to process ETL w orkflow s at runtim e. The ETL w orkflow s defined by developers in the graphical environm ent are stored in a m eta data repository, w hich the engine reads at runtim e to determ ine how to process incom ing data. Although this interpretive approach m ore tightly unifies design and execution environm ents, it doesnt necessarily elim inate all custom coding and m aintenance. To handle com plex or unique requirem ents, developers often resort to coding custom routines and exits that the tool ETL Engines Interpret Design Rules at Runtime THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 7 The Evol ut i on of ETL accesses at runtim e. These user-developed routines and exits increase com plexity and m ainte- nance, and therefore should be used sparingly. All Processing Occurs in the Engine. Another significant characteristic of an engine-based approach is that all processing takes place in the engine, not on source system s (although hybrid architectures exist). The engine typically runs on a W indow s or U N IX m achine and establishes direct connections to source system s. If the source system is non-relational, adm inistrators use a third-party gatew ay to establish a direct connection or create a flat file to feed the ETL engine. Som e engines support parallel processing, w hich enables them to process ETL w orkflow s in parallel across m ultiple hardw are processors. O thers require adm inistrators to m anually define a parallel processing fram ew ork in advance (using partitions, for exam ple)a m uch less dynam ic environm ent. Code Generators versus Engines There is still a debate about w hether ETL engines or code generators, w hich have im proved since their debut in the m id-1990s, offer the best functionality and perform ance. Benefits of Code Generators. The benefit of code generators is that they can handle m ore com plex processing than their engine-based counterparts,says Steve Tracy, assistant director of inform ation delivery and application strategy at H artford Life Insurance in Sim sbury, CT. M any engine-based tools typically w ant to read a record, then w rite a record, w hich lim its their ability to efficiently handle certain types of transform ations. Consequently, code generators also elim inate the need for developers to m aintain user-w ritten routines and exits to handle com plex transform ations in ETL w orkflow s. This avoids creating blind spotsin the tools and their m eta data. In addition, code generators produce com piled code to run on various platform s. Com piled code is not only fast, it also enables organizations to distribute processing across m ultiple plat- form s to optim ize perform ance. Because the [code generation] product ran on m ultiple platform s, w e did not have to buy a huge server,says K evin Light, m anaging consultant of business intelligence solutions at ED S Canada. Consequently, our clients overall capital outlay [for a code generation product] w as less than half w hat it w ould cost them to deploy a com parable engine-based product. Benefits of Engines. Engines, on the other hand, concentrate all processing in a single server. Although this can becom e a bottleneck, adm inistrators can optim ally configure the engine platform to deliver high perform ance. Also, since all processing occurs on one m achine, adm inistrators can m ore easily m onitor perform ance and quickly upgrade capacity if needed to m eet system level agreem ents. This approach also rem oves political obstacles that arise w hen ETL adm inistrators try to dis- tribute transform ation code to run on various source system s. By offloading transform ation processing from source system s, ETL engines interfere less w ith critical runtim e operations. Enhanced Graphical Development. In addition, m ost engine-based ETL products offer visual developm ent environm ents that m ake them easier to use. These graphical w orkspaces enable developers to create ETL w orkflow s com prised of m ultiple ETL objects for defining data m ap- pings and transform ations. (See Illustration S2.) Although code generators also offer visual developm ent environm ents, they often arent as easy to use as engine-based tools, lengthening overall developm ent tim es, according to m any practitioners. ETL Engines Centralize Processing in a High- Performance Server Code Generators Eliminate the Need to Maintain Custom Routines and Exits Visual Development Environments Help Manage Complex J obs 8 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Database-Centric ETL Today, several D BM S vendors em bed ETL capabilities in their D BM S products (as w ell as O LAP and data m ining capabilities). Since these vendors offer ETL capabilities at little or no extra charge, organizations are seriously exploring this option because it prom ises to reduce costs and sim plify their BI environm ents. In short, users are now asking, W hy should w e purchase a third-party ETL tool w hen w e can get ETL capabilities for free from our database vendor of choice? W hat are the additional ben- efits of buying a third-party ETL tool? To address these questions, organizations first need to understand the level and type of ETL processing that database vendors support. Today, there are three basic groups: Cooperative ETL. H ere, third-party ETL tools can leverage com m on database functionality to perform certain types of ETL processing. For exam ple, third-party ETL tools can leverage stored procedures and enhanced SQ L to perform transform ations and aggregations in the database w here appropriate. This enables third-party ETL tools to optim ize perform ance by exploiting the optim ization, parallel processing, and scalability features of the D BM S. It also im proves recoverability since stored procedures are m aintained in a com m on recoverable data store. Complementary ETL. Som e database vendors now offer ETL functions that m irror features offered by independent ETL vendors. For exam ple, m aterialized view s create sum m ary tables that can be autom atically updated w hen detailed data in one or m ore associated tables change. Sum m ary tables can speed up query perform ance by several orders of m agnitude. In addition, som e D BM S vendors can issue SQ L statem ents that interact w ith W eb Services-based applications or m essaging queues, w hich are useful w hen building and m aintaining near-real- tim e data w arehouses. Competitive ETL. M ost database vendors offer graphical developm ent tools that exploit the ETL capabilities of their database products. These tools provide m any of the features offered by third-party ETL tools, but at zero cost, or for a license fee that is a fraction of the price of independent tools. At present, how ever, database-centric ETL solutions vary considerably in quality and functionality. Third-party ETL tools still offer im portant advantages, such as the Illustration S2. Many ETL toolsprovidea graphical interfacefor mappingsourcestotargetsand managingcomplex workflows. ETL Visual Mapping Interface Why Purchase an ETL Tool When Our Database Bundles One? THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 9 The Evol ut i on of ETL range of data sources supported, transform ation pow er, and adm inistration. Today, database- centric ETL products are useful for building departm ental data m arts, but this w ill change over tim e as D BM S vendors enhance their ETL capabilities. In sum m ary, database vendors currently offer ETL capabilities that both enhance and com pete w ith independent ETL tools. W e expect database vendors to continue to enhance their ETL capabilities and com pete aggressively to increase their share of the ETL m arket. Data Integration Platforms D uring the past several years, business requirem ents for BI projects have expanded dram ati- cally, placing new dem ands on ETL tools. These requirem ents are outlined in the follow ing section (See ETL Trends and Requirements). Consequently, ETL vendors have begun to dra- m atically change their products to m eet these requirem ents. The result is a new generation ETL tool that TD W I calls a data integration platform. As a platform, this em erging generation of ETL tools provides greater perform ance, through- put, and scalability to process larger volum es of data at higher speeds. To deal w ith shrinking batch w indow s, these platform s also load data w arehouses m ore quickly and reliably, using change data capture techniques, continuous processing, and im proved runtim e operations. The platform s also provide expanded functionality, especially in the areas of data quality, transform ation pow er, and adm inistration. As a data integration hub, these products connect to a broader array of databases, system s, and applications as w ell as other integration hubs. They capture and process data in batch or real tim e using either a hub-and-spoke or peer-to-peer inform ation delivery architecture. They coordinate and exchange m eta data am ong heterogeneous system s to deliver a highly integrat- ed environm ent that is easy to use and adapts w ell to change. Solution Sets. U ltim ately, data integration platform s are designed to m eet a larger share of an organizations BI requirem ents. To do this, som e ETL vendors are extending their product lines horizontally, adding data quality tools and near-real-tim e data capture facilities to pro- vide a com plete data m anagem ent solution. O thers are extending vertically, adding analytical tools and applications to provide a com plete BI solution. ETL versus EAI To be clear, ETL tools are just one of several technologies vying to becom e the preem inent enterprise data integration platform . O ne of the m ost im portant technologies is enterprise application integration (EAI) softw are, such as BEA System s, Inc.s W ebLogic Platform and Tibco Softw are Inc.s ActiveEnterprise. EAI softw are enables developers to create real-tim e, event-driven interfaces am ong disparate transaction system s. For exam ple, during the Internet boom , com panies flocked to EAI tools to connect e-com m erce system s w ith back-end inventory and shipping system s to ensure that W eb sites accurately reflected product availability and delivery tim es. Although EAI softw are is event-driven and supports transaction-style processing, it does not generally have the sam e transform ation pow er as an ETL tool. O n the other hand, m ost ETL tools do not support real-tim e data processing. To overcom e these lim itations, som e ETL and EAI vendors are now partnering to provide the best of both approaches. EAI softw are cap- tures data and application events in real tim e and passes them to the ETL tools, w hich trans- form the data and loads it into the BI environm ent. Database Vendors Both Enhance and Compete with ETL Vendors Higher Performance, Complete Solution ETL Vendors Are Expanding Vertically and Horizontally Some ETL and EAI Vendors Are Partnering... 10 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Already, som e ETL vendors have begun incorporating EAI functionality into their core engines. These tools contain adapters to operational applications and operate in an alw ays aw akem ode so they can receive and process data and events generated by the applications in real tim e. Future Reality. It w ont be long before all ETL and EAI vendors follow suit, integrating both ETL and EAI capabilities w ithin a single product. The w inners in this race w ill have a substan- tial advantage in vying for delivery of a com plete, robust data integration platform . ETL Trends and Requirements As m entioned earlier, business requirem ents are fueling the evolution of ETL into data integra- tion platform s. Business users are dem anding access to m ore tim ely, detailed, and relevant inform ation to inform critical business decisions and drive fast-m oving business processes. Large Volumes of Data M ore business users today w ant access to m ore granular data (e.g., transactions) and m ore years of history across m ore subject areas than ever before. N ot surprisingly, data w arehouses are exploding in size and scope. Terabyte data w arehousesa rarity several years agoare now fairly com m onplace. The largest data w arehouses exceed 50 terabytes of raw data, and the size of these data w arehouses is putting enorm ous pressure on ETL adm inistrators to speed up ETL processing. O ur survey indicates that the percentage of organizations loading m ore than 500 gigabytes w ill triple in 18 m onths, w hile those loading less than 1 gigabyte w ill decrease by a third. (See Illustration 1.) Diverse Data Sources O ne reason for increased data volum es is that business users w ant the data w arehouse to cull data from a w ider variety of system s. According to our survey, on average, organizations now extract data from 12 distinct data sources. This average w ill inexorably increase over tim e as organizations expand their data w arehouses to support m ore subject areas and groups in the organization. Although alm ost all com panies use ETL to extract data from relational databases, flat files, and legacy system s, a significant percentage now w ant to extract data from application packages, such as SAP R/3 (39 percent), XM L files (15 percent), W eb-based data sources (15 percent), and EAI softw are (12 percent). (See Illustration 2.) Spreadmarts.In addition, m any respondents also w rote in the othercategory that they w ant ETL tools to extract data from M icrosoft Excel and Access files. In m any com panies, critical data ETL and EAI Will Converge More History, More Detail, More Subject Areas Illustration 1. Theaveragedata load will increasesignificantlyin thenext 18 months. Based on 756 respondents. TODAY 18 MONTHS Less t han 1GB 59% 40% 1GB t o 500GB 38% 50% 500GB+ 3% 10% Eighteen-Month Plans for Data Load Volumes On Average, Data Warehouses Cull Data from 12 Distinct Sources Excel Spreadsheets Are a Critical Data Source THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 11 ETL Trends & Requi rement s is locked up in these personal data stores, w hich are controlled by executives or m anagers run- ning different divisions or departm ents. Each group populates these data stores from different sources at different tim es using different rules and definitions. As a result, these spreadsheets becom e surrogate independent data m artsor spreadm artsas TD W I calls them . The prolifer- ation of spreadm arts precipitates organizational feuds (dueling spreadsheets) that can paralyze an organization and drive a CEO to the brink of insanity. Shrinking Batch Windows W ith the expansion in data volum es, m any organizations are finding it virtually im possible to load a data w arehouse in a single batch w indow . There are no longer enough hours at night or on the w eekend to finish loading data w arehouses containing hundreds of gigabytes or terabytes of data. 24x7Requirements. Com pounding the problem , batch w indow s are shrinking or becom ing non-existent. This is especially true in large organizations w hose data w arehouses and opera- tional system s span regions or continents and m ust be available around the clock. There is no longer any system s dow ntim eto execute tim e consum ing extracts and loads. Variable Update Cycles. Finally, the grow ing num ber and diversity of sources that feed the data w arehouse m ake it difficult to coordinate a single batch load. Each source system oper- ates on a different schedule and each contains data w ith different levels of freshnessor rele- vancy for different groups of users. For exam ple, retail purchasing m anagers find little value in sales data if they cant view it w ithin one day, if not im m ediately. Accountants, how ever, m ay w ant to exam ine general ledger data at the end of the m onth. Operational Decision Making Another reason that ETL tools m ust support continuous extract/load processing is that users w ant tim elier data. D ata w arehouses traditionally support strategic or tactical decision m aking based on historical data com piled over several m onths or years. Typical strategic or tactical questions are: What are our year-over-year trends? How close are we to meeting this months revenue goals? If we reduce prices, what impact will it have on our sales and inventory for the next three months? Illustration 2. Typesof data sourcesthat ETL programsprocess. Multi-choicequestion, based on 755 respondents. 39% 15% 65% 15% 4% 12% 81% 89% 15% Replication or change data capture utilities Web Other EAI/messaging software Relational databases Packaged applications Mainframe/legacy systems XML Flat files 0 20 40 60 80 100 Data Sources More Data, Less Time Data Has Different Expiration Dates Users Want Timelier Data 12 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Near-real-time Loads Will Triple in 18 Months Users Want ETL Vendors to Offer Data Quality Tools Operational BI. H ow ever, business users increasingly w ant to m ake operational decisions based on yesterdays or todays data. They are now asking: Based on yesterdays sales, which products should we ship to which stores at what prices to maximize todays revenues? What is the most profitable way to reroute packages off a truck that broke down this morning? What products and discounts should I offer to the customer that Im currently speaking with on the phone? According to our survey, m ore than tw o-thirds of organizations already load their data w are- houses on a nightly basis. (See Illustration 3.) H ow ever, in the next 18 m onths, the num ber of organizations that w ill load their data w arehouses m ultiple tim es a day w ill double, and the num ber that w ill load data w arehouses in near real tim e w ill triple! To m eet operational decision-m aking requirem ents, ETL processing w ill need to support tw o types of processing: (1) large batch ETL jobs that involve extracting, transform ing, and loading a signifi- cant am ount of data, and (2) continuous processing that captures data and application events from source system s and loads them into data w arehouses in rapid intervals or near real tim e. M ost organizations w ill need to m erge these tw o types of processing, w hich is com plicated because of the num erous interdependencies that adm inistrators m ust m anage to ensure consis- tency of inform ation in the data w arehouse and reports. Data Quality Add-Ons M any organizations w ant to purchase com plete solutions, not technology or tools for building infrastructure and applications. Add-on Products. W hen asked w hich add-on products from ETL vendors are very im portant, users expressed significant interest in data cleansing and profiling tools. (See Illustration 4.) This is not surprising since m any BI projects fail because the IT departm ent or (m ore than likely) an outside consultancy underestim ates the quality of source data. Code, Load, and Explode. A com m on scenario is the code, load, and explodephenom enon. This is w here ETL developers code the extracts and transform ations, then start processing only to discover an unacceptably large num ber of errors due to unanticipated values in the source data files. They fix the errors, rerun the ETL process, only to find m ore errors, and so on. This ugly scenario repeats itself until the project deadlines and budgets becom e im periled and angry business sponsors halt the project. Illustration 3. Based on 754 respondents. TODAY IN 18 M ONTHS M ont hl y 32% 27% Weekl y 34% 29% Dai l y/ ni ght l y 69% 65% M ul t i pl e t i mes per day 15% 30% Near real t i me 6% 19% Data Warehouse Load Frequency THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 13 ETL Trends & Requi rement s In m y experience, 10 percent [of the tim e and cost associated w ith ETL] is learning the ETL tool the first tim e out and m aking it conform to your w ill,said John M urphy, an IT professional in a new s- group posting. N inety percent is the interm inable iterative process of learning just how little you know about the source data, how dirty that data really is, how difficult it is to establish a cleansing policy that results in certifiably correct data on the target that m atches the source, and how long it takes to re-w ork all the assum ptions you m ade at the outset w hile you w ere still an innocent. The problem w ith current ETL tools is that they assumethe data they receive is clean and consistent. They cant assess the consistency or accuracy of source data and they cant handle specialized cleansing routines, such as nam e and address scrubbing. To m eet these needs, ETL vendors are partnering w ith or acquiring data quality vendors to integrate specialized data cleansing routines w ithin ETL w orkflow s. Meta Data Management As the num ber and variety of source and target system s proliferate, BI m anagers are seeking autom ated m ethods to docum ent and m anage the interdependencies am ong data elem ents in the disparate system s. In short, they w ant a global m eta data m anagem ent solution. As the hub of a BI project, an ETL tool is w ell positioned to docum ent, m anage, and coordi- nate inform ation about data w ithin data m odeling tools, source system s, data w arehouses, data m arts, analytical tools, portals, and even other ETL tools and repositories. M any users w ould like to see ETL tools play a m ore pivotal role in m anaging global m eta data, although others think ETL tools w ill alw ays be inherently unfit to play the role of global traffic cop. Certainly, ETL vendors should store m eta data for its ow n dom ain, but if they start trying to becom e a global m eta data repository for all BI tools, they w ill have trouble,says Tracy from H artford Life. Id prefer they focus on im proving their perform ance, m anageability, im pact analysis reports, and ability to support com plex logic. Enforce Data Consistency. N evertheless, large com panies w ant a global m eta data repository to m anage and distribute standarddata definitions, rules, and other elem ents w ithin and throughout a netw ork of interconnected BI environm ents. This global repository could help ensure data consistency and a single version of the truththroughout an organization. Although ETL vendors often pitch their global m eta data m anagem ent capabilities, m ost tools today fall short of user expectations. (And even single vendor all-in-one BI suites fall short.) W hen asked in an open-ended question, W hat w ould m ake your ETL tool better?m any sur- vey respondents w rote better m eta data m anagem entor enhanced m eta data integration. Illustration 4. Asignificant percentageof userswant ETL toolstooffer data cleansingand profilingcapabilities. Based on 740 respondents. 10% 9% 21% 24% 43% Packaged analytic reports Data cleansing tools Packaged data models Analytic tools Data profiling tools 0 10 20 30 40 50 Rating the Importance of Add-On Products Lack of Knowledge about Source Data Cripples Projects ETL Tools Assume the Data Is Clean Users Want ETL Tools to Play a Pivotal Role in Meta Data Most ETL Tools Fall Short of Expectations 14 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Packaged Solutions Can Speed Deployment I dont know if ETL tools should be used for a [BI] m eta data repository, but no one else is stepping up to the plate,says ED S Canadas Light. Light says the gap betw een expectation and reality som etim es puts him as a consultant in a tough position. M any clients think the ETL tools can do it all, and w e [as consultants] end up as the m eat in a sandw ich trying to force a tool to do som ething it cant. Packaged Solutions Besides add-on products, an increasing num ber of organizations are looking to buy com pre- hensive, off-the-shelf packages that provide near instant insight to their data. For instance, m ore than a third of respondents (39 percent) said they are m ore attracted to vendors that offer ETL as part of a com plete application or analytic suite. (See Illustration 5.) K err-M cG ee Corporation, an energy and chem icals firm in O klahom a City, purchased a pack- aged solution to support its O racle Financials application. W e had valid num bers w ithin a w eek, w hich is incredible,says Brian M orris, a data w arehousing specialist at the firm . W e dont have a lot of tim e so anything that is pre-built is helpful. A packaged approach excels, as in K err-M cG ees case, w hen there is a single source, w hich is a packaged application itself w ith a docum ented data m odel and API. The package K err- M cG ee used also supplied analytic reports, w hich the firm did not use since it already had reports defined in another reporting tool. Some Skepticism. H ow ever, there is skepticism am ong som e users about ETL vendors w ho w ant to deliver end-to-end BI packages. W e w ant ETL vendors to focus on their core com petency, not their brand,says Cynthia Connolly, vice president of application developm ent at Alliance Capital M anagem ent. W e w ant them to enhance their ETL products, not branch off into new areas. Enterprise Infrastructure Multi-Faceted Usage. Finally, organizations also w ant to broaden their use of ETL to support additional applications besides BI. (See Illustration 6.) In essence, they w ant to standardize on a single infrastructure product to support all their enterprise data integration requirem ents. A m ajority of organizations already use ETL tools for BI, data m igration, and conversion proj- ects. But, a grow ing num ber w ill use ETL tools to support application integration and m aster reference data projects in the next 18 m onths. Illustration 5. Analyticsuites. Based on 747 respondents. Yes (39%) No (43%) Not sure (18%) 39% 43% 18% Analytic Suites Does a vendors ability to offer ETL as part of a complete analytic suite make the vendor and its products more attractive to your team? Organizations Want a Standard Data Integration Infrastructure THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 15 Bui l d or Buy? SUMMARY As data w arehouses am ass m ore data from m ore sources, BI team s need ETL tools to process data m ore quickly, efficiently, and reliably. They also need ETL tools to assum e a larger role in coordinating the flow of data am ong source and target system s and m anaging data consistency and integrity. Finally, m any w ant ETL vendors to provide m ore com plete solutions, either to m eet an organizations enterprise infrastructure needs or deliver an end-to-end BI solution. Build or Buy? First Decision. As BI team s exam ine their data integration needs, the first decision they need to m ake is w hether to build or buy an ETL tool. D espite the ever-im proving functionality offered by vendor ETL tools, there are still m any organizations that believe it is better to hand w rite ETL program s than use off-the-shelf ETL softw are. These com panies point to the high cost of m any ETL tools and the abundance of program m ers on their staff to justify their decision. Strong Market to Buy.N onetheless, the vast m ajority of BI team s (82 percent) have pur- chased a vendor ETL tool, according to our survey. O verall, 45 percent use vendor tools exclusively, w hile 37 percent use a com bination of tools and custom ETL code. O nly 18 per- cent exclusively w rite ETL program s from scratch. (See Illustration 7.) Illustration 6. BesidesBI, organizationswant touseETL productsfor a varietyof data integration needs. Based on 740 responses. 40% 45% 81% 87% Data warehousing/business intelligence Reference data integration (e.g., customer hubs) Application integration Data migration/conversion 0 20 40 60 80 100 Today In 18 months 27% 29% 10% 11% Current and Planned Uses of ETL Illustration 7. Thevast majority of organizationsbuildingdata warehouseshavepurchased an ETL tool froma vendor. Based on 761 responses. Build only (18%) Buy only (45%) Build & buy (37%) 18% 45% 37% Build or Buy? Most Companies Buy ETL Tools 16 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Why Buy? Maintaining CustomCode The prim ary reason that organizations purchase vendor ETL tools is to m inim ize the tim e and cost of developing and m aintaining proprietary code. W e need to expand the scale and scope of our data w arehouse and w e cant do it by hand- cranking code,says Richard Bow les, data w arehouse developer at Safew ay Stores plc in London. Safew ay is currently evaluating vendor ETL tools in preparation for the expansion of its data w arehouse. The problem w ith Safew ays custom CO BO L program s, according to Bow les, is that it takes too m uch tim e to test each program w henever a change is m ade to a source system or target data w arehouse. First, you need to analyze w hich program s are affected by a change, then check business rules, rew rite the code, update the CO BO L copybooks, and finally test and debug the revised program s, he says. Custom code equals high m aintenance. W e w ant our team focused on building new routines to expand the data w arehouse, not preoccupied w ith m aintaining existing ones,Bow les says. Tools that Speed Deployment Save Money. M any BI team s are confident that vendor ETL tools can better help them deliver projects on tim e and w ithin budget. These m anagers ranked speed of deploym entas the num ber one reason they decided to purchase an ETL tool, according to our survey. (See Table 1.) K en K irchner, a data w arehousing m anager at W erner Enterprises, Inc., in O m aha, N E, is using an ETL tool to replace a team of outside consultants that botched the firm s first attem pt at building a data w arehouse. O ur consultants radically underestim ated w hat needed to be done. It took them 15 m onths to code 130 stored procedures,says K irchner. A [vendor ETL] tool w ill reduce the num ber of people, tim e, and m oney w e need to deliver our data w arehouse. ED S Canadas Light says in m any situations, clients have older ETL environm ents w here the business rules are encapsulated in CO BO L or other 3G Ls, m aking it im possible to do im pact analysis. M oving to an engine-based tool allow s the business rules to be m aintained in a sin- gle location w hich facilitates m aintenance. Table1. Rankingof reasonsfor buyingan ETL tool. To deploy ETL more quickly It had all the functionality we need Help us manage meta data Build a data integration infrastructure 1 2 3 4 0 1000 2000 3000 4000 5000 Ranking Score Reasons for Buying an ETL Tool - Ranking Custom Code Is Laborious to Modify THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 17 Bui l d or Buy? Why Build? Cheaper and Faster. H ow ever, som e organizations believe it is cheaper and quicker to code ETL program s than use a vendor ETL tool. (See Table 2.) Custom softw are is w hat saves us tim e and m oney, even in m aintenance, because its geared to our business,says M ike G albraith, director of G lobal Business System s D evelopm ent at Tyco Electronics in H arrisburg, PA. The key for ETL program sor any softw are developm ent projectis to w rite good code, G albraith says. W e use softw are engineering techniques. W e w rite our code from specifica- tions and a m eta data m odel. Everything is self docum enting and that has saved us a lot. H ubert G oodm an, director of business intelligence at Cum m ins, Inc., a leading m aker of diesel engines, says the cost to m aintain his groups custom ETL code is less than the annual m ainte- nance fee and training costs that another departm ent is paying to use a vendor ETL tool. G oodm ans team uses object-oriented techniques to w rite efficient, m aintainable PL/SQ L code. To keep costs dow n and speed turnaround tim es, he outsources ETL code developm ent over- seas to program m ers in India. (Som e practitioners say it is im practical to outsource G U I-based ETL developm ent because of the interactive nature of such tools.) The Hard J ob of ETL. Several BI m anagers also said vendor ETL tools dont address the really challenging aspects of m igrating source data into a data w arehouse, specifically how to identi- fy and clean dirty data, build interfaces to legacy system s, and deliver near-real-tim e data. For these m anagers, the m ath doesnt add upw hy spend $200,000 or m ore to purchase an ETL tool that only handles the easyw ork of m apping and transform ing data? Build and Buy Slow Migration to ETL Packages. D espite these argum ents, there is a slow and steady m igration aw ay from w riting custom ETL program s. M ore than a quarter (26 percent) of organizations that use custom code for ETL w ill soon replace it w ith a vendor ETL tool. Another 23 percent are planning to augm ent their custom code w ith a vendor ETL tool. (See Illustration 8.) W hen asked w hy they are sw itching, m any BI m anagers said that vendor ETL tools didnt existw hen they first im plem ented their data w arehouse. O thers said they initially couldnt afford a vendor ETL tool or didnt have tim e to investigate new tools or train team m em bers. Typically, these team s leave existing hand code in place and use the ETL package to support a new project or source system . O ver tim e, they displace custom code w ith the vendor ETL Table2. Rankingof reasonsfor buyingan ETL tool. Based on respondentswhocodeETL exclusively. Less expensive Our culture encourages in-house development Build a data integration infrastructure Deploy more quickly 0 500 1000 1500 2000 1 2 3 4 Ranking Score Reasons for Coding ETL Programs - Ranking Cummins Outsources ETL Coding Overseas to Speed Development Firms Are Gradually Replacing Home Grown Code 18 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms tool as their team gains experience w ith the tool and the tool proves it can deliver adequate perform ance and functionality. Mix and Match. Som e team s, how ever, dont plan to abandon custom code. Their strategy is to use w hatever approach m akes sense in each situation. These team s often use custom code to handle processes that vendor ETL tools dont readily support out of the box. For exam ple, Preetam Basil, a Priceline.com data w arehousing architect, points out that Priceline.com uses a vendor ETL tool for the bulk of its ETL processing, but it has developed U N IX and SQ L code to extract and scrub data from its volum inous W eb logs. User Satisfaction with ETL Tools O verall, m ost BI team s are generally pleased w ith their ETL tools. According to our survey, 28 percent of data w arehousing team s are very satisfiedw ith their vendor ETL tool, and 51 per- cent are m ostly satisfied.Vendor ETL tools get a slightly higher satisfaction rating than hand- coded ETL program s. (See Illustration 9.) A m ajority of organizations (52 percent) plan to upgrade to the next version of their ETL tool w ithin 18 m onths, a sign that the team is com m itted to the product. Another 22 percent w ill m aintain the product as isduring the period. A sm all fraction plan to replace their vendor ETL tool w ith another vendor ETL tool (7 percent) or custom code (2 percent). (See Illustration 10.) Illustration 8. Almost half (49 percent) of respondentsplan toreplaceor augment their customcodewitha vendor ETL tool. Based on 415 respondents. Continue to enhance the customcode (43%) Replace it with a vendor ETL tool (26%) Augment it with a vendor ETL tool (23%) 43% 26% 23% 7% 2% Outsource it to a third party (2%) Other (7%) Eighteen-Month Plans for Your ETL Custom Code? Users Are Mostly Satisfied with ETL Tools Strategy: Custom Code Where Appropriate Illustration 9. Vendor ETL toolshavea higher satisfaction ratingthan customETL programsfor organizationsthat useeither vendor ETL toolsor customcodeexclusively. Based on 341 and 134 responses, respectively. Very Satisfied Mostly Satisfied Somewhat Satisfied 46% 24% 12% 18% Not Very Satisfied 51% 16% 5% 28% 0 10 20 30 40 50 60 Vendor ETL tool Custom code Satisfaction Rating: Vendor Tools versus Custom Code THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 19 Bui l d or Buy? Very few organizations have abandoned vendor ETL tools. M ost have purchased one tool and continue to use it in a production environm ent. (See Illustrations 11a and 11b.) In other w ords, there is very little ETL shelfw are. Challenges in Deploying ETL D espite this rosy picture, BI team s encounter num erous challenges w hen deploying vendor ETL tools. The top tw o challenges are ensuring adequate data qualityand understanding source data,according to our survey. (See Table 3.) Although these problem s are not neces- sarily a function of the ETL tool or code, they do represent an opportunity for ETL vendors to expand the capabilities of their toolset to m eet these needs. Complex Transformations. The next m ost significant challenge is designing com plex trans- form ations and m appings. The goal of [an ETL] tool is to m inim ize the am ount of code you have to w rite outside of the tool to handle transform ations,says A lliance Capital M anagem ents Connolly. U nfortunately, m ost ETL tools fall short in this area. W e w ish there w as m ore flexibility in w riting transform ations,says Brian K oscinski, ETL project lead at Am erican Eagle O utfitters in W arrendale, PA. W e often need to drop a file into a tem porary area and use our ow n code to sort the data the w ay w e w ant it before w e can load it into our data w arehouse. Illustration 11a. Most BI teamsonly useoneETL tool. Based on 617 responses. Illustration 11b. Very fewETL toolsgounused. Based on 617 responses. ETL Products in Production Number of ETL Tools 0 (6%) 1 (60%) 2 (25%) 60% 25% 6% 6% 3% 3 (6%) 4+(3%) 0 (81%) 1 (15%) 2 (4%) 81% 15% 4%1% 3 (1%) ETL Shelfware How many vendor ETL tools has your team purchased that it does not use? Illustration 10. Most teamsarecommitted totheir current vendor ETL tool. Based on 622 responses. Upgrade to the next or latest version (52%) Maintain as is (22%) Augment it with customcode or functions (11%) 52% 22% 11% 3% 7% Replace it with a tool fromanother vendor (7%) Augment it with a tool fromanother vendor (3%) 2% 4% Replace it with hand-written code (2%) Outsource it to a third party (0%) Other (4%) Eighteen-Month Plans for Your Vendor ETL Tool Data Quality Issues Top all Challenges Goal: Minimize Code Written Outside the ETL Tool 20 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms According to our survey, only a sm all percentage of team s (11 percent) are currently extend- ing vendor ETL tools w ith custom code. The rest are using a reasonable am ountof custom code or none at allto augm ent their vendor ETL tools. Skilled ETL Developers. Another significant challenge, articulated by m any survey respondents and users interview ed for this report, is finding and training skilled ETL program m ers. M any say the skills of ETL developers and fam iliarity w ith the tools can seriously affect the outcom e of a project. Priceline.com s Basil agrees. Initially, w e didnt have ETL developers w ho w ere know ledge- able about data w arehousing, database design, or ETL processes. As a result, they built ineffi- cient processes that took forever to load. ED S Canadas Light says its im perative to hire experienced ETL developers at the outset of a project, no m atter w hat the cost, to establish developm ent standards. O therw ise, the prob- lem s encountered dow n the line w ill be m uch larger and harder to grapple w ith.O thers add that its w ise to hire at least tw o ETL developers in case one leaves in m idstream . M ost BI m anagers are frustrated by the tim e it takes developers to learn a new tool (about three m onths) and then becom e proficient w ith it (12 or m ore m onths.) Although the skill and experi- ence of a developer counts heavily here, a good ETL developm ent environm entalong w ith high-quality training and support offered by the vendorcan accelerate the learning curve. The Temptation to Code. D evelopers often com plain that it takes m uch longer to learn the ETL tool and m anipulate transform ation objects than sim ply coding the transform ations from scratch. And m any give in to tem ptation. But perseverance pays off. D evelopers are five to six tim es m ore productive once they learn an ETL tool com pared to using a com m and-line interface, according to Pieter M im no. Table3. Data quality issuesarethetoptwochallengesfacingETL developers. Ensuring adequate data quality Creating complex mappings Creating complex transformations Understanding source data Ensuring adequate performance Providing access to meta data Finding skilled ETL programmers Collecting and maintaining meta data Ensuring adequate scalability Loading data in near real time Ensuring adequate reliability Integrating with third-party tools or applications Creating and integrating user defined functions Integrating with load utilities Debugging programs 0 1000 2000 3000 4000 5000 6000 7000 8000 1 11 10 9 8 7 6 5 4 3 2 15 14 13 12 Ranking Score Ranking of ETL Challenges Developers Take 12 or More Months to Be Proficient with an ETL Tool THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 21 Bui l d or Buy? Pricing Sticker Shock. Perhaps the biggest hurdle w ith vendor ETL tools is their price tag. M any data w arehousing team s experience sticker shock w hen vendor ETL salespeople quote license fees. M any team s w eigh the perceived value of a tool against its list pricew hich often exceeds $200,000 for m any toolsand w alk aw ay from the deal. For $200,000 you can buy at least 4,000 m an hours of developm ent and the tools ongoing m ain- tenance fee pays for one-half of a head count,says data w arehousing consultant D arrell Piatt. The price of ETL tools is a difficult pill to sw allow for com panies that dont realize that a data w arehouse is different from a transaction system , says Bryan LaPlante, senior consultant w ith Pragm atek, a M inneapolis consultancy that w orks w ith m id-m arket com panies. These com panies dont realize that an ETL tool m akes it easier to keep up w ith the continuous flow of changes that are part and parcel of every data w arehousing program , he says. But even com panies that understand the im portance of ETL tools find it difficult to open their checkbooks. Although ETL tools are m uch m ore functional than they w ere five years ago, m ost are still not w orth buying at list price,says Piatt. Spread the Wealth. To justify the steep costs, m any team s sell the tool internally as an infrastruc- ture com ponent that can support m ultiple projects. The initial purchase is expensive, but it is easier to justify w hen you use it for m ultiple projects,says H eath H atchett, group m anager of business intelligence engineering at Intuit in M ountain View , CA. O thers negotiate w ith vendors until they get the price they w ant. Its a buyersm arket right now ,says Safew ays Bow les. G iven the current econom y and the increasing m aturity of the ETL m arket, the list prices for ETL products are starting to fall. The dow nw ard pressure on prices w ill continue for the next few years as softw are giants bundle robust ETL program s into core data m anagem ent program s, som etim es at no extra cost, and BI vendors package ETL tools inside of com prehensive BI suites. In addition, a few start-ups now offer sm all-footprint products at extrem ely affordable prices. Recommendations Every BI project team has different levels and types of resources to m eet different business requirem ents. Thus, there is no right or w rong decision about w hether to build or buy ETL capa- bilities. But here are som e general recom m endations to guide your decision based on our research. Buy If You Want to Minimize Hand Coding. ETL tools enable developers to create ETL w ork- flow s and objects using a graphical w orkbench that m inim izes hand coding and the problem s that it can create. Want to Speed Project Deployment. In the hands of experienced ETL program m ers, an ETL tool can boost productivity and deploym ent tim e significantly. (H ow ever, be careful, these productivity gains dont happen until ETL program m ers have w orked w ith a tool for at least a year.) Want to Maintain ETL Rules Transparently. Rather than bury ETL rules w ithin code, ETL tools store target-to-source m appings and transform ations in a m eta data repository accessible via the tools G U I and/or an API. This reduces dow nstream m aintenance costs and insulates the team w hen key ETL program m ers leave. Have Unique Requirements the Tool Supports. If the tool supports your unique requirem entssuch as providing adapters to a packaged application your com pany just purchasedit m ay be w orthw hile to invest in a vendor ETL tool. Hard to J ustify the Price Compared to Using Programmers Downward Pricing Pressures Now Exist 22 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Need to Standardize Your BI Architecture. Purchasing a scalable ETL tool w ith suffi- cient functionality is a good w ay to enable your organization to establish a standard infrastructure for BI and other data m anagem ent projects. Reusing the ETL tool in m ulti- ple projects helps to justify purchasing the tool. Build If You Have Skilled Programmers Available. Building ETL program s m akes sense w hen your IT departm ent has skilled program m ers available w ho understand BI and are not bogged dow n delivering other projects. Practice Sound Software Engineering Techniques. The key to coding ETL is w riting efficient code that is designed from a com m on m eta m odel and rigorously docum ented. Have a Solid Relationship with a Consulting Firm. Consultancies that youve w orked w ith successfully in the past and understand data w arehousing can efficiently w rite and docum ent ETL program s. (Just m ake sure they teach you the code before they leave!) Have Unique Functionality. If you have a significant num ber of com plex data sources or transform ations that ETL tools cant support out of the box, then coding ETL pro- gram s is a good option. Have Existing Code. If you can reuse existing, high-quality code in a new project, then it m ay be w iser to continue to w rite code. Data Integration Platforms As m entioned earlier, m any vendors are extending their ETL products to m eet new user requirem ents. The result is a new generation of ETL products that TD W I calls data integration platforms. These products extend ETL tools w ith a variety of new capabilities, including data cleansing, data profiling, advanced data capture, increm ental updates, and a host of new source and target system s. (See Illustration S3.) W e can divide the m ain features of a data integration platform into platform characteristics and data integration characteristics. N o ETL product today delivers all the features outlined below . Illustration S3. Data integration platformsextend thecapabilitiesof ETL tools. Extract Administration & operations services Transport services Load Meta data import/export Databases & files Source adapters Transform Runtime meta data services Target adapters Design manager Legacy applications Application integration software Global Meta data repository Application packages Application integration software Web & XML data Event data Databases & files XML Adapter development kit Capture Update Clean Profile Data Integration Platforms THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 23 Dat a I nt egrat i on Pl at f orms H ow ever, m ost ETL vendors are m oving quickly to deliver full-fledged data integration platform s. PlatformCharacteristics: H igh perform ance and scalability Built-in data cleansing and profiling Com plex, reusable transform ations Reliable operations and robust adm inistration Data Integration Characteristics: D iverse source and target system s U pdate and capture facilities N ear-real-tim e processing G lobal m eta data m anagem ent High Performance and Scalability Parallelization. D ata integration platform s offer exceedingly high throughput and scalability due to parallel processing facilities that exploit high-perform ance com puting platform s. The parallel engine processes w orkflow stream s in parallel using m ultiple operating system threads, m ulti-pass SQ L, and an in-m em ory data cache to hold tem porary data w ithout storing it to disk. W here dependencies exist am ong or w ithin stream s, the engine uses pipelining to optim ize throughput. The high-perform ance com puting platform supports clusters of servers running m ultiple CPU s and load balancing softw are to optim ize perform ance under large, sustained loads. Benchmarks. In the m id-1990s, ETL engines achieved a m axim um throughput of 10 to 15 giga- bytes of data per hour, w hich w as sufficient to m eet the needs of m ost BI projects at the tim e. Today, how ever, ETL engines now boast throughput of betw een 100 and 150 gigabytes an hour or m ore, thanks to steady im provem ents in m em ory m anagem ent, parallelization, distrib- uted processing, and high-perform ance servers. 1 Built-In Data Cleansing and Profiling To avoid the code, load, and explodesyndrom e m entioned earlier, good data integration platform s provide built-in support for data cleansing and profiling functionality. Data Profiling Tools. D ata profiling tools provide an exhaustive inventory of source data. (Although m any com panies use SQ L to sam ple source data value, this is akin to sighting the tip of an iceberg.) D ata profiling tools identify the range of values and form ats of every field as w ell as all colum n, row , and table dependencies. The tool spits out reports that serve as the Rosetta Stone of your source files, helping you translate betw een hopelessly out-of-date copy books and w hats really in your source system s. Data Cleansing Tools. D ata cleansing tools validate and correct business rules in source data. M ost data cleansing tools apply specialized functions for scrubbing nam e and address data: (1) parsing and standardizing the form at of nam es and addresses, (2) verifying those nam es/addresses against a third-party database, (3) m atching nam es to rem ove duplicates, and (4) householding nam es to consolidate m ailings to a single address. 1 There are m any variables that affect throughputthe com plexity of transform ations, the num ber of sources that m ust be integrated, the com plexity of target data m odels, the capabilities of the target database, and so on. Your throughput rate m ay vary considerably from these rates. Parallel Engines and High-Performance Computing Platforms Todays ETL Engines Can Process 150GB/Hour Profiling Tools: the Rosetta Stone to Source Systems 24 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Som e data cleansing tools now can also apply m any of the above functions to specific types of non-nam e-and-address data, such as products. D ata integration platform s integrate data cleansing routines w ithin a visual ETL w orkflow . (See Illustration S4.) At runtim e, the ETL tool issues calls to the data cleansing tool to perform its validation, scrubbing, or m atching routines. Ideally, data integration platform s exchange m eta data w ith profiling and cleansing tools. For instance, a data profiling tool can share errors and form ats w ith an ETL tool so developers can design m appings and transform ations to correct the dirty data. Also, a data cleansing tool can share inform ation about m atched or householded custom er records so the ETL tool can com - bine them into a single record before loading the data w arehouse. Complex, Reusable Transformations Extensible, Reusable Objects. D ata integration platform s provide a robust set of transform ation objects that developers can extend and com bine to create new objects and reuse them in vari- ous ETL w orkflow s. Ideally, developers should be able to deploy a custom object that dynam - ically adapts to its environm ent, m inim izing the tim e developers spend updating m ultiple instances of the object. M ost ETL tools today lim it reuse because they are context sensitive,says Steve Tracy at H artford Life Insurance in Sim sbury, CT. They let you copy and paste an object to support 37 instance of a source system , but you have to configure each object separately for each. This becom es painful if there is a change in the source system and you have to reconfigure all 37 instances of the object. External Code. Although ETL tools have gained a raft of built-in and extensible transform ation objects, developers alw ays run into situations w here they need to w rite custom code to sup- port com plex or unique transform ations. To address this pain point, data integration platform s provide internal scripting languages that enable developers to w rite custom code w ithout leaving the tool. But if a developer is m ore com fortable coding in C, C++, or another language, the tool should also be able to call exter- nal routines from w ithin an ETL w orkflow and m anage those routines as if they w ere internal objects. In the sam e w ay, the tool should be able to call third-party schedulers, security sys- tem s, m eta data repositories, and analytical tools. Illustration S4. Data integration platformsintegratedata cleansingroutinesintoan ETL graphical workflow. Integrating Cleansing and ETL Calls to External Code Should Be Transparent Most Reusable ETL Objects Are Context Sensitive THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 25 Dat a I nt egrat i on Pl at f orms Reliable Operations and Robust Administration N o m atter how m uch horsepow er a system possesses, if its runtim e environm ent is unstable, overall perform ance suffers. In our survey, 86 percent of respondents rated reliabilityas a very im portantETL feature. (See Illustration 13) This w as the highest rating of any ETL fea- ture respondents w ere asked to evaluate. Robust Schedulers. D ata integration platform s provide m uch m ore robust operational and adm inistrative features than the current generation of ETL tools. For exam ple, they provide robust schedulers that orchestrate the flow of inter-dependent jobs (both ETL and non-ETL) and run in a lights-out environm ent. O r they interface w ith a third-party scheduling tool so the organization can m anage both ETL and non-ETL jobs w ithin the sam e environm ent in a highly granular fashion. Conditional Execution. Job processing is m ore reliable because data integration platform s are sm artenough to perform conditional execution based on thresholds, content, or inter- process dependencies. This reduces the am ount of coding needed to prevent such outages, especially as BI environm ents becom e m ore com plex. Debugging and Error Recovery. In addition, data integration platform s provide robust debug- ging and error recovery capabilities that m inim ize how m uch code developers have to w rite to recover from errors in the design or runtim e environm ents. These tools also deliver reports and diagnostics that are easy to understand and actionable. Instead of logging w hat happened and w hen, users w ant the tools to say why it happened and recommend fixes. Enterprise Consoles. Surprisingly, only 18 percent of com panies said a single ETL operations console and graphical m onitoring w as very im portant.(See Illustration 15) This indicates that m ost organizations do not yet need to m anage m ultiple ETL servers. This w ill change as ETL usage grow s and organizations decide to standardize the scheduling and m anagem ent of ETL processes across the enterprise. W e suspect m ost organizations w ill use in-house enterprise system s m anagem ent tools, such as Com puter AssociatesU nicenter, to m anage distributed ETL system s. Diverse Source and Target Systems Intelligent Adapters. The m ark of a good data integration platform is the num ber and diversity of source and target system s it supports. Besides supporting relational databases and flat filesw hich m ost ETL tools support todaydata integration platform s provide adapters to intelligently connect to a variety of com plex and unique data stores. Application Interfaces and Data Types. For exam ple, operational application packages, such as SAPs R/3, contain unique data types, such as pooled and clustered tables, as w ell as several different application and data interfaces. D ata integration platform s offer native support for these interfaces and can intelligently handle unique data types. Web Services and XML. D ata integration platform s also provide am ple support to the W eb w orld. W e expect XM L and W eb Services to constitute an ever greater portion of the process- ing that ETL tools perform . XM L is becom ing a standard m eans of interchanging both data and m eta data. W eb Services (w hich use XM L) is becom ing the standard m ethod for enabling het- erogeneous applications to interact over the Internet. N ot surprising, alm ost half (47 percent) of survey respondents said W eb Services w ere an im portantor fairly im portantfeature for ETL tools to possess. Its likely that W eb Services and XM L w ill becom e a de facto m ethod for Users Want Greater Reliability above All Else Web Services and XML Are Fundamental Services for ETL 26 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms stream ing XM L data from operational system s to ETL tools and from data w arehouses to reporting and analytical tools and applications. Desktop Data Stores. Also, data integration platform s support pervasive desktop applications, such as M icrosoft Excel and Access. By connecting to these sources via XM L and W eb Services, data integration platform s enable organizations to halt the grow th of spreadm arts and the organizational infighting they create. Adapter Development Kits. Finally, data integration platform s provide adapter developm ent kits that let developers create custom adapters to link to unique or uncom m on data sources or applications. Update and Capture Utilities To m eet business requirem ents for m ore tim ely data and negotiate shrinking batch w indow s, data integration platform s support a variety of data capture and loading techniques. Incremental Updates. First data integration platform s update data w arehouses increm entally instead of rebuilding or refreshingthem (i.e., dim ensional tables) from scratch every tim e. M ore than one-third of our survey respondents (34 percent) said increm ental update is a very im portantfeature in ETL tools. (See Illustration 14.) Change Data Capture. To increm entally update a data w arehouse, ETL tools need to detect and dow nload only the changes that occurred in the source system since the last load, a process called change data capture.(Som etim es, the source system has a facility that identifies changes since the last extract and feeds them to the ETL tool.) The ETL tool then uses SQ L update, insert, and delete com m ands to update the target file w ith the changed data. This process is m ore efficient than bulk refreshesw here the ETL tool com pletely rebuilds all dim ension tables from scratch and appends new transactions to the fact table in a data w are- house. M ore than a third (38 percent) of our respondents said that change data capture w as a very im portant feature in ETL tools. (See Illustration 14.) More Frequent Loads. Another w ay ETL tools can m inim ize load tim es is to extract and load data at m ore frequent intervalsevery hour or lessexecuting m any sm all batch runs throughout the day instead of a single large one at night or on the w eekends. For exam ple, Intuit Inc. uses a vendor ETL tool to continuously extract data every couple of m inutes from an O racle 11i source system and load it into the data w arehouse, according to Intuits H atchett. H ow ever, for its large nightly loads, it uses hardw are clusters on its ETL serv- er and load balancing to m eet its ever-shrinking batch w indow s. O ne w ay to accom m odate continuous loads w ithout interfering w ith user queries, is for adm inistrators to create m irror im ages of tables or data partitions. The ETL tool then loads the m irror im age in the background w hile end users query the activetable or partition in the foreground. O nce the load is com pleted, adm inistrators sw itch pointers from the original parti- tion to the new ly updated one. 2 Bulk Loading. D ata integration platform s leverage native bulk load utilities to block slam data into target databases rather than sim ply using SQ L inserts and updates. D ata integration plat- form s can update target tables by selectively applying SQ L w ithin a bulk load operation, som ething that tools today cant support. 26 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 2 Adm inistrators m ust m ake sure that the ETL tool or database does not update sum m ary tables in the target database until a quiet periodusually at nightand users m ust be educated about datas volatility during the period w hen its being continuously updated. A Remedy to Spreadmarts? Full Refresh versus Incremental Updating Organizations Are Loading Data More Frequently THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 27 Dat a I nt egrat i on Pl at f orms Near-real-time Data Capture. Finally, organizations seek ETL tools that capture and load data into the data w arehouse in near real tim e. This trickle feedprocess is needed because busi- ness users w ant integrated data delivered on a tim elier basis (i.e., the previous day, hour, or m inuteso they can m ake critical tactical and operational decisions w ithout delay). Priceline.com , for exam ple, uses m iddlew are to capture source data changes into a staging area and then use a vendor ETL tool to load it into the data w arehouse. It has built a custom program to feed m ultiple sessions in parallel to its vendor ETL tool, but is planning to upgrade a new version that supports parallel processing soon. EAI Software. As m entioned earlier in this report, m ost ETL tools today partner w ith EAI tools to deliver near-real-tim e capabilities. The EAI tools typically use application-level interfaces to peel off new transactions or events as soon as they are generated by the source application. They then deliver these events to any application that needs them either im m ediately (near real tim e), at predefined intervals (scheduled), or w hen the target application asks for them (publish and subscribe). Bi-directional, Peer-to-Peer Interfaces. Although EAI tools are associated w ith delivering real- tim e data, they support a variety of delivery m ethods. Also, in contrast to ETL tools, their application interfaces are typically bidirectional and peer-to-peer in nature, instead of unidirec- tional and hub-and-spoke. EAI softw are flow s data and events back and forth am ong applica- tions, each alternatively serving as sources or targets. U nlike m ost of todays ETL tools, data integration platform s w ill blend the best of both ETL and EAI capabilities in a single toolset. Few organizations w ant to support dual data integra- tion platform s and w ould prefer to source ETL and EAI capabilities from a single vendor. The upshot is that data integration platform s w ill process data in batch or near real tim e depending on business requirem ents. They w ill also support unidirectional and bidirectional flow s of data and events am ong disparate applications and data stores using a hub-and-spoke or peer-to-peer architecture. Global Meta Data Management To optim ize and coordinate the flow of data am ong applications and data stores, data inte- gration platform s offer global m eta data m anagem ent services. The tools autom atically docu- m ent, exchange, and synchronize m eta data am ong various applications and data stores in a BI environm ent. Meta Data Repository and Interchange. To do this, the tools m aintain a global repository of m eta data. The repository stores m eta data about ETL m appings and transform ations. It m ay also contain relevant m eta data from upstream or dow nstream system s, such as data m odeling and analytical tools. Common Warehouse Metamodel. M ore im portantly, the tools exchange and synchronize m eta data w ith other tools via a robust m eta data interchange interface, such as the O bject M anagem ent G roups Com m on W arehouse M etam odel (CW M ), w hich is based on XM L. According to our survey, 48 percent of respondents rated opennessand 41 percent rated m eta data interfaceas very im portantdesign features. (See Illustration 12) Although som e ETL vendors recently announced support for CW M and although the specifications are consid- ered robust, its still too early to tell w hether these standards w ill gain substantial m om entum to facilitate w idespread m eta data integration. Trickle Feeding Data in Near Real Time Data Integration Platforms Merge the Best of ETL and EAI Data Integration Platforms Synchronize Meta Data Openness and Meta Data Interfaces Are Important Features 28 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Impact Analysis Reports. D ata integration platform s also generate im pact analysis reports that analyze the dependencies am ong elem ents used by all system s in a BI environm ent. The reports identify objects that m ay be affected by a change, such as adding a colum n, revising a calculation, or changing a data type or attribute. Such reports reduce the am ount of tim e and effort required to [m anage changes] and reduce the need for custom w ork,w rote one survey respondent. M oreover, w hen appropriate, the tools autom atically update each other w ith such changes w ithout IT intervention. (See Illustration S5 for a sam ple im pact analysis report.) Thirty-three percent of respondents said im pact analysis reports are a very im portantfeature. Data Lineage. In addition, data integration platform s offer data lineagereports that describe or depict the origins of a data elem ent or report and the business rules and transform ations that w ere applied to it as it m oved from a source system to the data w arehouse or a data m art. D ata lineage reports help business users understand the nature of the data they are analyzing. SUMMARY D ata integration platform s represent the next generation of ETL tools. These tools extend cur- rent ETL capabilities w ith enhanced perform ance, transform ation pow er, reliability, adm inistra- tion, and m eta data integration. The platform s also support a w ider array of source and target system s and provide utilities to capture and load data m ore quickly and efficiently, including supporting near-real-tim e data capture. ETL Evaluation Criteria Guidelines for Selecting ETL Products. It m ay take several years for m ajor ETL tools to evolve into data integration platform s and support all the features described above. In the m eantim e, you m ay need to select an ETL tool or reevaluate the one you are using. If so, there are m any things to consider. The follow ing list is a detailedbut by no m eans com prehensive list of evaluation criteria. Illustration S5. ETL toolsshould quickly generatereportsthat clearly outlinedependenciesamongdata elementsin a BI envi- ronment and highlight theimpact of changesin sourcesystemsor downstreamapplications. Impact Analysis Report Impact Analysis Reports Save Time and Money THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 29 ETL Eval uat i on Cri t eri a Its im portant to rem em ber that not all the criteria below m ay be applicable or im portant to your environm ent. The key is to first understand your needs and requirem ents and then build criteria from there. The list that follow s can help trigger ideas for im portant features that you m ay need. W eve tried to provide descriptive attributes to qualify the requirem ents so you can understand the extent to w hich a tool m eets the criteria. If asked, all vendors w ill say their tool can per- form certain functions. Its better to phrase questions in a w ay that lets you ascertain their true level of support. Available Resources ETL Matrix. The evaluation list below is a sum m ary version of a m atrix docum ent that TD W I M em bers can access at the follow ing U RL: w w w .dw -institute.com /etlm atrix/. Another set of cri- teria is available at: w w w .clickstream dataw arehousing.com /Clickstream ETLCriteria.htm . ETL Product Guide. If you w ould like to view a com plete list of ETL products, go to TD W Is Marketplace Online (w w w .dw -institute.com /m arketplace) and click on the D ata Integrationcategory, and then select the D W M apping and Transform ationsubcategory, and related subcategories. Courses on ETL. Finally, TD W I also offers several courses on ETL, including TD W I Data Acquisition, TDWI Data Cleansing, and Evaluating and Selecting ETL and Data Cleansing Tools. See our latest conference and sem inar catalogs for m ore details (w w w .dw - institute.com /education/). Vendor Attributes Before exam ining tools, you need to evaluate the com pany selling the ETL tool. You w ant to find out (1) w hether the vendor is financially stable and w ont disappear, and (2) how com - m itted the vendor is to its ETL toolset and the ETL m arket. Asking these questions can quickly shorten your list of vendors and the tim e spent analyzing products! For m ore details, ask questions in the follow ing areas: M ission and Focus o Prim ary focus of com pany o Percent of revenues and profits from ETL o ETL m arket and product strategy M aturity and Scale o Years in business. Public or private? o N um ber of em ployees? Salespeople? D istributors? VARS? Financial H ealth o Y-Y trends in revenues, profits, stock price o Cash reserves; D ebt Financial Indicators o License-to-service revenues o Percent of revenues spent on R& D Relationships o N um ber of distributors, O EM s, VARs Focus on Your Requirements How Stable Is the Vendor? 30 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Overall Product Considerations Before diving in and evaluating an ETL tools features and functions, m ake sure you have the right m indset. The first principle of tool selection is to question and verify everything. Its never as easy as the vendor says,w rites a BI professional from a m ajor electronics retailer. Verify everything they say dow n to the last turn of the screw before the sale is com plete and the softw are license agreem ent is signed. Platforms. After adopting a skeptical attitude, step back and see if the product m eets your overall needs in key areas. Topping the list is w hether the product runs on platform s that you have in house and w ill w ork w ith your existing tools and utilities (e.g., security, schedulers, analytical tools, data m odeling tools, source system s, etc.). Full Lifecycle Support. Second, find out w hether the product supports all ETL processes and provides full lifecycle support for the developm ent, testing, and production of ETL program s. You dont w ant a partial solution. Also, m ake sure it supports the preferred developm ent style of your ETL developers, w hether thats procedural, object-oriented, or G U I-driven. Requisite Performance and Functionality. Finally, m ake sure the tool provides adequate per- form ance and supports the transform ations you need. The only w ay to validate this is to have vendors perform a proof of concept (PO C) w ith your data. Although a PO C m ay take several days to set up and run, it is the only w ay to really determ ine w hether the tool w ill m eet your business and technical requirem ents and is w orth buying. You should also never pay the ven- dor to perform a PO C or agree to purchase the product if they com plete the PO C successfully. Its really im portant to take these tools for a test drive w ith your data,says ED S Canadas Light. M im no adds, A proof of concept exposes how different the tools really arethis is som ething youll never ascertain from evaluations on paper only. Design Features Ease of Use. M ost BI team s w ould like vendors to m ake ETL tools easier to learn and use. A graphical developm ent environm ent can enhance the usability of an ETL product and acceler- ate the steep learning curve. Alm ost all survey respondents (84 percent) said a visual m apping interface, for exam ple, w as either a very im portantor fairly im portantdesign feature. Illustration 12. Easeof useisrated asa very important design feature. Based on 746 respondents. 48% 45% 51% 40% 41% 59% 70% 34% 0 10 20 30 40 50 60 70 80 Visual mapping interface Single logical meta data store Meta data interfaces Ease of use Openness Debugging Meta data reports (e.g., impact analysis) Transformation language and power Ratings of Importance of Design, Transformation, and Meta Data Features You Dont Want a Partial Solution THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 31 ETL Eval uat i on Cri t eri a G ood ETL tools save developers tim e by letting them com bine and reuse objects, tasks, and processes in various w orkflow s. W orkflow s are visual and interactive, letting developers focus on one step at a tim e. W izards and cut-and-paste features also speed developm ent, as w ell as interactive debuggers that dont require developers to w rite code. M ore specifically, ask w hether the ETL tool provides: An integrated, graphical developm ent environm ent Interactive w orkflow diagram s to depict com plex flow s of jobs, processes, tasks, and data A graphical drag-and-drop interface for m apping source and target data The ability to com bine m ultiple jobs or tasks into a containerobject that can be visu- ally depicted and opened up in a w orkflow diagram Reuse of containers and custom objects in m ultiple w orkflow s and projects A procedural or object-oriented developm ent environm ent A scripting language to w rite custom transform ation objects and routines Exits to call external code or routines on the sam e or different platform s Version control w ith check-in and check-out An audit trail that tracks changes to every process, job, task, and object Meta Data Management Features As the hub of a data w arehousing or data integration effort, an ETL tool is w ell positioned to capture and m anage m eta data from a m ultiplicity of tools. Today, m ost ETL tools autom atical- ly docum ent inform ation about their developm ent and runtim e processes in a m eta data repository (e.g., relational tables or a proprietary engine) w hich can be accessed via query tools, a docum ented API, or a W eb brow ser. Today, the best ETL tools im port and export technical and business m eta data w ith targeted upstream and dow nstream system s in the BI environm ent. They also generate im pact analysis and data lineage reports across these tools. M ost should be w orking on w ays to autom ate the exchange of m eta data using CW M and W eb Services standards and keep heterogeneous BI system s in synch. For m ore details on m eta data, ask w hether an ETL product provides: Self-docum enting m eta data for all design and runtim e processes Support for technical, business, and operational m eta data A rich repository for storing m eta data A hierarchical repository that can m anage and synchronize m ultiple local or globally distributed m eta data repositories The ability to reverse-engineer source m eta data, including flat files and spreadsheets A robust interface to interchange m eta data w ith third-party tools Robust im pact analysis reports D ata lineage reports that show the origin and derivation of an object Transformation Features Extensible, Reusable, Context-Independent Objects. The best ETL tools provide a rich library of transform ation objects that can be extended and com bined to create new , context-independ- ent, reusable objects. The tools also provide an internal scripting language and transparent support for calls to external routines. Evaluate the Reality versus Promise of a Vendors Meta Data Strategy 32 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms To learn about transform ation features, ask w hether the tool provides: A rich library of base transform ation objects The ability to extend base objects Context-independent objects or containers Record- and set-level processing logic and transform s G eneration of surrogate keys in m em ory in a single pass Support for m ultiple types of transform ation and m apping functions The ability to create tem porary files, if needed for com plex transform ations The ability to perform recursive processing or loops Data Quality Features The best ETL tools provide data profiling facilities, and can spaw n data cleansing routines from w ithin ETL w orkflow s. To get details on data profiling and cleansing facilities, ask w hether the ETL product: Provides internal or third-party data profiling and cleansing tools Integrates cleansing tasks into visual w orkflow s and diagram s Enables profiling, cleansing and ETL tools to exchange data and m eta data Autom atically generates rules for ETL tools to build m appings that detect and fix data defects Profiles form ats, dependencies, and values of source data files Parses and standardizes records and fields to a com m on form at Verifies record and field contents against external reference data M atches and consolidates file records Performance Features Perform ance features received som e of the highest ratings in our survey. (See Illustration 13.) As the volum e of data increases, batch w indow s shrink, and users require m ore tim ely data, team s are looking to ETL tools to m ake up the difference. The best ETL tools offer a parallel processing engine that runs on a high-perform ance com put- ing server. It leverages threads, clusters, load balancing, and other facilities to speed through- put and ensure adequate scalability. Illustration 13. Reliability and performance/throughput wererated asvery important featuresby morethan 80 percent of 750 survey respondents. 70% 56% 70% 22% 34% 81% 86% Parallel processing Real-time data capture Incremental update Reliability Availability Scalability Performance and throughput 0 20 40 60 80 100 Important Performance Features: Rating To learn m ore about perform ance features, ask w hether the product provides: A flexible, high-perform ance w ay to process and transform source data Intelligent m anagem ent of aggregate/sum m ary tables o Bundled w ith product or extra priced option H igh-speed processing using m ultiple concurrent w orkflow s, m ultiple hardw are proces- sors, and system s threads and clustered servers In-m em ory cache to avoid creating interm ediate tem porary files Linear perform ance im provem ent w hen adding processor and m em ory Extract and Capture Features Leading ETL tools provide native support for a w ide range of source databases, a top require- m ent am ong our survey respondents. (See Illustration 14.) The low est com m on denom inator is support for O D BC/JD BC, but m ake sure the tool can directly access the sources you support, preferably at no additional cost and w ithout the use of a third-party gatew ay. U sers also em phasized the im portance of extracting and integrating data from m ultiple source system s. The ability to use an ETL tool to extract data from three sources and com bine them together is w orth the effort com pared to w riting the routine in som ething like PL/SQ L,says Am erican Eagles K oscinski. To understand extract capabilities, ask w hether the tool provides: The ability to schedule extracts by tim e, interval, or event. Robust adapters that extract data from various sources A developm ent fram ew ork for creating custom data adapters A robust set of rules for selecting source data Selection, m apping, and m erging of records from m ultiple source files A facility to capture only changes in source files. Load and Update Features To understand load and update features, ask w hether the product can: Load data into m ultiple types of target system s Load data into heterogeneous target system s in parallel THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 33 ETL Eval uat i on Cri t eri a Illustration 14. Userswant an ETL tool tosupport a widerangeof sources. Based on 745 respondentswhorated theabove itemsasvery important. 34% 33% 36% 12% 22% 38% 54% Support for native load utilities Web services support Real-time data capture Breadth of sources Incremental update Breadth of targets supported Change data capture 0 10 20 30 40 50 60 Extract and Load Features: User Ratings Support data partitions on the target Both refresh and append data in the target database Increm entally update target data Support high-speed load utilities via an API Turn off referential integrity and drop indexes, if desired Take a snapshot of data after a load for recovery purposes, if desired Autom atically generate D D L to create tables Operate and Administer Component The ultim ate test of an ETL product is how w ell it runs the program s that ETL developers design, and how w ell it recovers from errors. Thus, m onitoring and m anaging the ETL runtim e environ- m ent is a critical requirem ent. O ur survey respondents rated error reporting and recoveryespe- cially high (79 percent) w ith schedulingnot far behind (66 percent). (See Illustration 15.) For detailed inform ation on adm inistrative features, ask w hether the product provides: A visual console for m anaging and m onitoring ETL processes A robust, graphical scheduler to define job schedules and dependencies The ability to validate jobs before running them . A com m and line or application interface to control and run jobs Robust graphical m onitoring that displays in near real tim e: The ability to restart from checkpoints w ithout losing data M API-com pliant notification of errors and job statistics Built-in logic for defining and m anaging rejected records Robust set of easy-to-read, useful adm inistrative reports A detailed job or audit log that records all activity and events in the job: Rigorous security controls, including links to LD AP and other directories Integrated Product Suites M any vendors today bundle ETL tools w ithin a business intelligence or data m anagem ent suite. Their goal is to ensure a suiteor com plete, integrated solutionw ill provide m ore value and be m ore attractive to m ore organizations. 34 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms Illustration 15. Error reportingleadsall administration features, accordingto745 respondentswhorated theaboveitemsas very important. 40% 18% 51% 18% 66% 79% Single operations console Graphical monitoring Error reporting and recovery Single logical meta data store Debugging Scheduling 0 10 20 30 40 50 60 70 80 Rating of Administration Features To drill into add-on products, ask: Is the package geared to a specific source application or is it m ore generic? H ow integrated is the ETL tool w ith other tools in the suite? o D o they have a com m on look and feel? o D o they share m eta data in an autom ated, synchronized fashion? o D o they share a com m on install and m anagem ent system ? Company Services and Pricing As m entioned earlier in this report, m any user organizations find the price of ETL tools too high. Alm ost three-quarters of survey respondents (71 percent) consider pricing am ong the top three factors affecting their decision to buy a vendor ETL tool. (See Illustration 16.) Bundles versus Components. U sers pay close attention to vendor packaging options. Vendors that offer all-in-one packages at high list prices can often price them selves out of contention. Vendors that break their packages into com ponents that enable users to buy only w hat they need are m ore attractive to users w ho w ant to m inim ize their initial cost outlays. H ow ever, runtim e charges can ruin the equation if a vendor charges the custom er each tim e a com po- nent runs on a different platform . M ore specifically, ask about: Pricing for a four CPU system on W indow s and U N IX Annual m aintenance fees Level of support, training, consulting, and docum entation provided Pricing for add-on softw are and interfaces required to run the product Annual user conferences Cost of m oving from a single server to m ultiple servers U ser group available O nline discussion groups THEDATA WAREHOUSING INSTITUTE www.dw-institute.com 35 ETL Eval uat i on Cri t eri a Illustration 16. Pricingisa significant factor affectingpurchasingdecisions. Based on 621 respondents. A lot. It's a top factor. (10%) Significantly. Among the top 3 factors. (61%) Somewhat. An important but not critical factor. (26%) 61% 26% 4% 10% Not much. It is a requirement for doing business. (4%) How Does Pricing Affect Your ETL Purchasing Decision? Conclusion ETL tools are the traffic cop for business intelligence applications. They control the flow of data betw een m yriad source system s and BI applications. As BI environm ents expand and grow m ore com plex, ETL tools need to change to keep pace. Next Generation Functionality. Today, organizations need to m ove m ore data from m ore sources m ore quickly into a variety of distributed BI applications. To m anage this data flow , ETL tools need to evolve from batch-oriented, single-threaded processing that extracts and loads data in bulk to continuous, parallel processes that capture data in near real tim e. They also need to pro- vide enhanced adm inistration and ensure reliable operations and high availability. As the traffic cop of the BI environm ent, they need to connect to a w ider variety of system s and data stores (notably XM L, W eb Services-based data sources, and spreadsheets) and coordinate the exchange of inform ation am ong these system s via a global m eta data m anagem ent system . To sim plify and speed the developm ent of com plex designs, the tools need a visual w ork envi- ronm ent that allow s developers to create and reuse custom transform ation objects. Finally, ETL tools need to deliver a larger part of the BI solution by integrating data cleansing and profiling capabilities, EAI functionality, toolkits for building custom adapters, and possibly reporting and analysis tools and analytic applications. This is a huge list of new functionality. The sum total represents a next-generation ETL prod- uctor a data integration platform . It w ill take ETL vendors several years to extend their products to support these areas, but m any are w ell on their w ay. Focus on Your Business Requirements. The key is to understand your current and future ETL processing needs and then identify the product features and functions that support those needs. W ith this know ledge, you can then identify one or m ore products that can adequately m eet your ETL and BI needs. 36 THEDATA WAREHOUSING INSTITUTE www.dw-institute.com Evaluating ETL and Data Integration Platforms ETL Tools Need to Keep Pace It Will take ETL Vendors Several Years to Deliver Data Integration Platforms Concl usi on Business Objects 3030Orchard Parkway San J ose, CA 95134 408.953.6000 Web: www.businessobjects.com Business Objects is the world's leading provider of business intelligence (BI) solutions. Business intelligence lets organi- zations access, analyze, and share information internally with employees and externally with customers, suppliers, and partners. It helps organizations improve operational efficiency, build profitable customer relationships, and develop differentiated product offerings. The company's products include data integration tools, the industry's leading integrated business intelligence platform, and a suite of enterprise analytic applications. Business Objects is the first to offer a complete BI solution that is composed of best-of-breed components, giving organiza- tions the means to deploy end-to-end BI to the enterprise, from data extraction to analytic applications. Business Objects has more than 17,000 customers in over 80 countries.The company's stock is publicly traded under the ticker symbols NASDAQ:BOBJ and Euronext Paris (Euroclear code 12074). It is included in the SBF 120 and IT CAC 50 French stock market indexes. DataMirror 3100Steeles Avenue East, Suite 1100 Markham, Ontario L3R 8T3 Canada 905.415.0310 Fax: 905.415.0340 Email: info@datamirror.com Web: www.datamirror.com DataMirror (Nasdaq:DMCX;TSX:DMC), a leading provider of enterprise application integration and resiliency solutions, gives companies the power to manage, monitor, and protect their corporate data in real time. DataMirrors comprehensive family of LiveBusiness solutions enables customers to easily and cost effectively capture, transform, and flow data throughout the enterprise. DataMirror unlocks the experience of now by providing the instant data access, integration, and availability companies require today across all computers in their business. 1,700 companies have gone live with DataMirror software, including Debenhams, Energis, GMAC Commercial Mortgage, the London Stock Exchange, OshKosh BGosh, Priority Health,Tiffany & Co., and Union Pacific Railroad. Hummingbird Ltd. 1Sparks Avenue Toronto, Ontario M2H 2W1 Canada Email: getinfo@hummingbird.com Web: www.hummingbird.com Headquartered in Toronto, Canada, Hummingbird Ltd. (NASDAQ: HUMC,TSE: HUM), is a global enterprise software company employing 1,300 people in nearly 40 offices around the world. Hummingbirds revolutionary Hummingbird Enterprise, an integrated information and knowledge management solution suite, manages the entire lifecycle of information and knowledge assets. Hummingbird Enterprise creates a 360-degree view of all enterprise content, both structured and unstructured data, with a portfolio of products that are both modular and interoperable.Today, five million users rely on Hummingbird to connect, manage, access, publish, and search their enterprise content. For more information, please visit: http://www.hummingbird.com. Informatica Corporation Headquarters 2100Seaport Boulevard Redwood City, CA 94063 650.385.5000 Toll-free U.S.: 800.653.3871 Fax: 650.385.5500 Informatica offers the industry's only integrated business analytics suite, including the industry-leading enterprise data integration platform, a suite of packaged and domain-specific data warehouses, the only Internet-based business intelli- gence solution, and market-leading analytic applications.The Market Leading Data Integration Solution comprise of Informatica PowerCenter, PowerMart, and Informatica PowerConnect(tm) products, the industry's leading data integration platform helps companies fully leverage and move data from virtually any corporate system into data warehouses, operational data stores, staging areas, or other analytical environment. Offering real-time performance, scalability, and extensibility, the Informatica data integration platform can handle the challenging and unique analytic requirements of even the largest enterprises. SPONSORS The Data Warehousing Institute 5200Southcenter Blvd., Suite 250 Seattle, WA 98188 Local: 206.246.5059 Fax: 206.246.5952 Email:info@dw-institute.com Web: www.dw-institute.com Membership As the data warehousing and business intelligence field continues to evolve and develop, it is necessary for information technology professionals to connect and interact with one another. TDWI provides these professionals with the opportunity to learn from each other, network, share ideas, and respond as a collective whole to the challenges and opportunities in the data warehousing and BI industry. Through Membership with TDWI, these professionals make positive contributions to the industry and advance their professional development. TDWI Members benefit through increased knowledge of all the hottest trends in data warehousing and BI, which makes TDWI Members some of the most valuable professionals in the industry. TDWI Members are able to avoid common pitfalls, quickly learn data warehousing and BI fundamentals, and network with peers and industry experts to give their projects and companies a competitive edge in deploying data warehousing and BI solutions. TDWI Membership includes more than 4,000 Members who are data warehousing and information technology (IT) professionals from Fortune 1000 corporations, consulting organizations, and governments in 45 countries. Benefits to Members from TDWI include:
Quarterly Business Intelligence Journal
Biweekly TDWI FlashPoint electronic bulletin
Quarterly TDWI Member Newsletter
Annual Data Warehousing Salary, Roles, and Responsibilities Report
Quarterly Ten Mistakes to Avoid series
TDWI Best Practices Awards summaries
Semiannual What Works: Best Practices in Business Intelligence and
Data Warehousing corporate case study compendium
TDWIs Marketplace Online comprehensive product and service guide
Annual technology poster
Periodic research report summaries
Special discounts on all conferences and seminars
Fifteen-percent discount on all publications and merchandise
Membership with TDWI is available to all data warehousing, BI, and IT professionals for an annual fee of $245 ($295 outside the U.S.). TDWI also offers a Corporate Membership for organizations that register 5 or more individuals as TDWI Members.