This action might not be possible to undo. Are you sure you want to continue?
Getting Started with Learning About Data Warehousing A Definition of Data Warehousing A Definition of Decision Support The Case for Data Warehousing The Case Against Data Warehousing Actions for Data Warehouse Success Data Warehousing Gotchas Performing Data Warehousing Software Evaluations An (Informal) Taxonomy of Data Warehouse Data Errors Data Warehousing Political Issues Different Aspects of Data Warehouse Architecture What to Learn About in Order to Speed Up Data Warehouse Querying What to Learn About in Order to Speed Up Data Warehouse Loading How to Save Money on Your Data Warehousing Efforts Using Data Warehousing in Strategic Decision Making Maintenance Issues for Data Warehousing Systems What Decision Support Tools are Used For Is Web Data Analysis (i.e., Web Data Mining) Different? Getting Started with Learning About Data Warehousing If you are new to this field and the way you like to get into a new field is by getting an overview, I suggest that you: Read the books "Building the Data Warehouse" by W. H. Inmon, "The Data Warehouse Toolkit" by Ralph Kimball, "Data Warehouse from Architecture to Implementation" by Barry Devlin, and "Data Warehousing in the Real World" by Sam Anahory and Dennis Murray With due respect to all the other fine books on data warehousing and decision support, when read in combination I believe these four books provide a great introduction to and overview of the strategic and tactical issues system developers face (even though the books are several years old - despite what you read in the trade media, data warehousing does not change that much.) Especially valuable are Inmon's overall overview and description of the iterative nature of data warehouse development, Kimball's description of data modeling principles and query/report tools, Devlin's descriptions of data extraction, cleaning, and loading issues and metadata, and Anahory/Murray's description of what can be done so a system can run efficiently and their description of the main tasks in a data warehouse project. If you are a really ambitious reader, consider a couple of other titles. "The Data Warehouse Lifecycle Toolkit" by Ralph Kimball, et. al., is a 700+ page, clearly written description of a methodology for constructing data warehouses. If you use Oracle, "Oracle8i Data Warehousing" by Gary Dodge and Tim Gorman provides practical technical advice that even a non-DBA can understand and appreciate. Finally, "Data Warehouse Design Solutions" by Christopher Adamson and Michael Venerable provides insight on model design for specific business problems. (By the way, the above material contains the only recommendations of commercial products in this site. There is no commercial connection between this site and the authors or publishers of the books just cited.)
Visit a couple of organizations that have had warehousing systems in production for over a year You will get an excellent education if you can ask an organization who 'has done it' what are the biggest issues it faced in developing systems and what are the biggest issues it faces in maintaining systems. Also, ask what the organization felt it did right and what it felt it could have done differently. I believe that if you do this you will learn a great deal aspects of data warehousing that do not get discussed much in the literature - specifically the politics of data warehousing projects, the maintenance burdens data warehousing imposes, and how to deal with data warehousing software/hardware vendors and consultants. Read up on some fundamental technical topics You may find you will be greatly helped by reading up on SQL queries (especially multi-table and summary queries and subqueries), database indexing, join processing, and how query optimization works. Also helpful would be some knowledge about how logical structures can be created and how database partitioning can be used in conjunction with logical structures. - There are many fine books on SQL. The latter knowledge will most likely be found in books aimed at DBAs for specific commercial databases. Build something! Computer texts love to cite a (supposedly) Confucian quote "What I hear I forget. What I see I remember. What I do I understand." Well, this quote is apt in the case of learning about data warehousing. After you build something, no matter how modest, you will gain a more profound appreciation of the topic. A Definition of Data Warehousing My favored definition of a data warehouse is a slightly modified version of Ralph Kimball's definition on page 310 of The Data Warehouse Toolkit: A data warehouse is a copy of transaction data specifically structured for querying and reporting. Ralph states that a data warehouse is "a copy of transaction data specifically structured for query and analysis". Two quibbles I have with Ralph's definition are: 1) Sometimes non-transaction data are stored in a data warehouse - though probably 95-99% of the data usually are transaction data. 2) I say "querying and reporting" rather than "query and analysis" because the main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted "formal" reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis. - For some more information about why the transaction data are copied, you may want to see my essay The Case for Data Warehousing. What I especially like about Ralph's definition is what he does not say. The form of the stored data has nothing to do with whether something is a data warehouse.
A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional
database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity. Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making.
Of course if you want to define every user as a decision maker and all activities as decision making processes, then my assertion is false. But in my experience, the overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist for making decisions with wide ranging effects (so-called "strategic" decisions.). In fact, I would assert that most of data warehouses are used for post-decision monitoring of the effects of decisions (or as some people might say, for "operational" issues. By the way, this is not saying that using data warehousing in the decision making process is not a wonderful, potentially high return effort. But my caution is that though the trade press, vendors, and many industry experts trumpet the role of data warehousing vis-à-vis decision making, this is an area in reality we really do not have a clear understanding of. (See the writing of Peter Keen for more on this perspective.) A Definition of Decision Support The term decision support, if my knowledge of history of this area is correct, goes back to the 1970s when it was coined by some academics associated with the Massachusetts Institute of Technology. Since then, many academic definitions have been offered. - My purpose in this essay is to provide a definition that may lend clarity to practitioners. A decision support system or tool is one specifically designed to allow business end users to perform computer generated analyses of data on their own. I believe the essence of decision support is, in the language of the 1960s, to allow end users to do their own thing. I note that this definition is still fuzzy because what constitutes analyses and "on their own" are debatable points. We cannot say that decision support systems or tools necessarily support the making of decisions. What's in a name? - As far as I know, cognitive researchers do not agree on how decisions are made. Therefore, saying that these tools support making decisions is not a provable statement. Nor, is it, in may opinion, an insightful way of defining these tools. These tools do not analyze by themselves - rather they help a person analyze In other words, the tools facilitate analyses rather than perform analyses. If you want to to learn more about how the tools facilitate analyses, see my essay on What Decision Support Tools are Used For. Data warehousing and decision support systems and tools do not necessarily go hand in hand. Many data warehouses are not used as decision support systems. And decision support systems or tools do not necessarily require the use of a data warehouse as a source for data. I assert that, by
far, the most used decision support tools are spreadsheets not connected in any automated way with a data warehouse. Business intelligence seems to have become the vendors' preferred synonym for decision support My guess is because decision support has an academic connotation and, as just mentioned, decision support systems do not necessarily support decisions. On the other hand, business intelligence systems do not necessarily make a business more intelligent. By the way, the consultant-coined term business intelligence goes back to the late 1980s, fell out of use, and then was revived by the DW/DSS world in the late 1990s. Confusingly, business intelligence is also used as a synonym for competitive intelligence (and is probably a more apt term for that area). By the way, "analytics" seems to be an up and coming name for this area - despite the mid-1990 consultant-coined term "analytical applications" never taking hold.
The Case for Data Warehousing The following is a list of the basic reasons why organizations implement data warehousing. This list was put together because too much of the data warehousing literature confuses "next order" benefits with these basic reasons. For example, spend a little time reading data warehouse trade material and you will read about using a data warehouse to "convert data into business intelligence", "make management decision making based on facts not intuition", "get closer to the customers", and the seemingly ubiquitously used phrase "gain competitive advantage". In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these highfalutin objectives. The basic reasons organizations implement data warehouses are: To perform server/disk bound tasks associated with querying and reporting on servers/disks not used by transaction processing systems
Most firms want to set up transaction processing systems so there is a high probability that transactions will be completed in what is judged to be an acceptable amount of time. Reports and queries, which can require a much greater range of limited server/disk resources than transaction processing, run on the servers/disks used by transaction processing systems can lower the probability that transactions complete in an acceptable amount of time. Or, running queries and reports, with their variable resource requirements, on the servers/disks used by transaction processing systems can make it quite complex to manage servers/disks so there is a high enough probability that acceptable response time can be achieved. Firms therefore may find that the least expensive and/or most organizationally expeditious way to obtain high probability of acceptable transaction processing response time is to implement a data warehousing architecture that uses separate servers/disks for some querying and reporting. To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing
There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) and may not be appropriate for transaction processing because the modeling technique will slow down and complicate transaction processing. Also, there are server technologies that that may speed up query
and reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) and server technologies that may speed up transaction processing but slow down query and report processing (e.g., technology for transaction recovery.) - Do note that whether and by how much a modeling technique or server technology is a help or hindrance to querying/reporting and transaction processing varies across vendors' products and according to the situation in which the technique or technology is used. To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel
Often a data warehouse can be set up so that simpler queries and reports can be written by less technically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hit a complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintain queries and reports written against data warehouse data. It should be noted, however, that much of the improved IS productivity probably comes from the lack of bureaucracy usually associated with establishing reports and queries in the data warehouse. To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems
Please read my essay on An informal taxonomy of data warehouse data errors for an explanation of the type of "errors" that need cleaning up. The data warehouse provides an opportunity to clean up the data without changing the transaction processing systems. Note, however, that some data warehousing implementations provide a means to capture corrections made to the data warehouse data and feed the corrections back into transaction processing systems. Sometimes it makes more sense to handle corrections this way than to apply changes directly to the transaction processing system. To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only
For a long time firms that need reports with data from multiple systems have been writing data extracts and then running sort/merge logic to combine the extracted data and then running reports against the sort/merged data. In many cases this is a perfectly adequate strategy. However, if a company has large amounts of data that need to be sort/merged frequently, if data purged from transaction processing systems needs to be reported upon, and most importantly, if the data need to be "cleaned", data warehousing may be appropriate. To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time
Older data are often purged from transaction processing systems so the expected response time can
if not impossible. In this paper I attempt to slightly fill that void by shedding light on business and cultural factors that greatly lessen the value of data warehousing for certain organizations. sometimes the business end user community does not have a strong interest in old transaction . That is.be better controlled. to generate a report based on some characteristic at a previous point in time. closeness to its customers. By the way. This is a small part of the universe of data available to manage a business. We IT people are businesspeople too. if you want a report of the salaries of employees at grade Level 3 as of the beginning of each month in 1997. better decision making. For example. Obtaining these next order benefits requires firms to figure out. this purged data and the current data may be stored in the data warehouse where there presumably is less of a need to control expected response time or the expected response time is at a much higher level. when I refer to data warehousing. firms may implement data warehouses that handle what is called the "slowly changing dimension" issue. usually by trial and error. Also. how to change business practices to best use the data warehouse and then to change their business practices. The Case Against Data Warehousing The literature is full of testimonials for data warehousing. (I grit my teeth when I say that because I am not one to assume that an IT objective is not a business objective. a firm that expects to get business intelligence. to repeat the point I made initially. and competitive advantage simply by plopping down a data warehouse is in for a surprise. For querying and reporting. Some firms implement data warehousing for all the reasons cited. That is. however. To be able to handle this type of reporting problem. the limitations will not be in every implementation of a transaction processing system. I am not saying that a data warehouse has no "business" objectives. Some of the reasons data warehousing efforts may not be appropriate for certain organizations are: Data warehousing systems. for the most part. If you examine the list you may be struck that need for data warehousing is mainly caused by the limitations of transaction processing systems.) I do believe that the achievement of a "business" objective for a data warehouse necessarily comes about because of the achievement of one or many of the above objectives. Finally. store historical data that have been generated in internal transaction processing systems. These limitations of transaction processing systems are not.As for "as was" reporting. Some firm implement data warehousing for only one of the reasons cited. the limitations of transaction processing systems will vary in how crippling they are. For example. I refer to both centralized data warehousing systems and data marts. To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases The concern here is security. There is almost nothing about the arguments against data warehousing. inherent. By the way. . data warehousing may be interesting to firms that want to allow report and querying only over the Internet. And that can be harder than implementing a data warehouse. some times it is difficult. you may not be able to do this because you only have a record of current employee grade level. Sometimes this part has limited value.
Data warehouses. This lack of interest often stems from the fact that the markets in which a business competes are in great flux or that the internal structure of the organization is in perpetual transition. Data warehousing. take a life of their own. If most of your business needs are to report on data in one transaction processing system and/or all the historical data you need are in that system and/or the data in the system are clean and/or your hardware can support reporting against the live system data and/or the structure of the system data is relatively simple and/or your firm does not have much interest in end user ad hoc query/report tools. Data warehousing can have a learning curve that may be too long for impatient firms. Though the interest in business process reengineering seems to have waned. Also. you may find that as more of these conditions are met. data warehousing may not be for your business. . Despite the speed of the data warehousing development effort. these conditions describe the reporting needs of many firms. if unchecked. . Data warehousing can become an exercise in data for the sake of the data. the less value data warehousing may add to your firm. it takes time for an organization to figure how it can change its business practices to get a substantial return on its data warehousing investment.processing system data beyond what are available in basic reports generated in transaction processing systems. Whew! You can say that again. distributing. Data warehousing systems can complicate business processes significantly. Unfortunately. data warehousing can quickly add clutter to the business environment.Anyway. and reading a report a "process"). like most other complex systems. centralized IS" type shops most of the data warehousing vendors slant their marketing to. can foster the "institutionalization" of easily created reports whose reason for being quickly is forgotten while people still toil to process these reports. Organizations find that there are unlimited opportunities to add data to their data warehouse. there may not be a solid historical base to compare current performance with. And once you get away from the big "Fortune 500. If these conditions exist. some of the appreciation of how complicated processes can slowly strangle a business has remained. sometimes there is a lack of interest in looking at this data in any in-depth way because a business is so simple that a data warehouse is overkill. If your organization does not know how to throw out processes (pardon my calling producing. In certain organizations ad hoc end user query/reporting tools do not "take". I speculate that rigorous analysis of the return on most of the major data warehousing implementers' investments would find a much longer average payback period that you would surmise from reading the trade press. adding data without questioning the business value of the data can lessen the business value of the data warehouse and quickly increase the cost of maintaining the data warehouse.
It should be pointed out that there are some potential "show-stoppers" in . I refer to availability of both employees and consultants. Many "strategic applications" of data warehousing have a short life span and require the developers to put together a technically inelegant system quickly. it's very easy for the users to quickly go sour on a system they were enthusiastic about at roll-out time if the system personnel do not support the maturing of the system. You should be wary of a consultant who says he has experience implementing scores of data warehouses in a couple of years. it takes a long time to gain experience with the usual problems that develop at different phases of a data warehousing effort. This time. By the way. the issue is in the IS organization. clean it up. Organizations who cannot or will not staff to meet these maintenance demands should think twice before they jump into the data warehousing business. Or. cleaning. If your sell of the data warehousing project is the ability to do this strategic work (which is probably now being done by your users with large and complex spreadsheets) as opposed to the usual development of canned and semi-canned reports and queries. the importance of the culture cannot be underestimated. It is important to note that the more successful a warehouse is with the users. In other words. Trying to promote the use of such a tool in these organizations is setting yourself up for failure.This is of concern to organizations that believe they can get their return on investment by having users write many of their own queries and reports. The percentage of time that must be devoted to extracting. For many organizations this approach to systems work is much harder to accept than most people realize. Usually this is experience will be with a well-defined part of a data warehousing project that was amenable to outsourcing or with minor projects. and deliver it in a format and time frame that is useful for the end users is too much of a cost to bear. ask yourself if the IS culture can accept this mode of working. though. the more maintenance it may require. Sometimes the cost to capture data. sometimes these tools do not take because a business is so complicated that only relatively simple reports with little business value can be written by end users. many systems by their very nature require a great deal of care and feeding once they are in "production". and loading data has been well discussed in the literature. Systems of some depth require a considerable amount of time to develop fully. In some firms there are profound cultural barriers in the business organization to the acceptance of a tool that allows a person to ask questions on his own. Data warehousing systems can require a great deal of "maintenance" which many organizations cannot or will not support. There is a limited number of people available who have worked with the full data warehousing system project "life cycle". Despite the best efforts to architect a system so "maintenance" (in quotation marks because it seems often there is never the closure to the initial data warehousing effort that the term "maintenance" implies) demands are minimized. Some developers are reluctant to work this way. Again.
these efforts. However. By the way. Often. Having basic training and some hands on experience. Loading data from previous years can require the knowledge of transaction processing system developers who have long since moved on. there is no denying that data warehousing is risky. Slow down! Consider providing training initially in the minimum needed for the user to get something useful from the tool. The technology for more modestly sized data warehouses. but it will reduce the size of them. the user will have a much better context with which to grasp the next level. on the other hand. go off for 6 months.) Builders and users working with each other will not reduce the number of iterations. Warehouses are iterative! (I think the word iterative means there are lots of mistakes in the projects. Now the fact that these efforts are risky does not bolster the case against data warehousing. In fact. then data warehousing may not be for you. compromises end up substantially compromising the value of the information in the data warehouse. Best to establish right up front that this project is going to entail some additional ongoing responsibility. Well. Then let the user use the tool for a while (meaning several days. In several days they learn both the basics and intermediate and sometimes advanced aspects of using a tool. smaller organizations are probably much more "into" data warehousing than larger organizations. see Peter Block's Flawless Consulting for a great discussion of how to bring about 'joint' projects. Data warehousing has not repealed the positive relationship between risk and expected return in capital projects. weeks. Finally. most of the trade press is dominated by vendors/consultants/publications trying to market to large organizations with huge staffs and huge budgets. You may have gotten the impression from reading the trade press that data warehousing is only for large organizations because it requires huge staffs and huge budgets. It is only recently that practical technology for huge organizations who lust for multi-terabyte databases has become available. if your organization does not know how to manage risky projects. in terms of numbers. has been available for many years. I think most data warehousing efforts are done by small staffs with modest budgets. These are points I rarely see discussed or I do not see discussed enough in the barrage of articles about data warehousing. Establish that maintaining data quality will be an ONGOING joint user/builder responsibility Organizations undertaking warehousing efforts almost continually discover data problems. Train the users one step at a time Typically users are trained once. though. . Finally. Also. data may have to be loaded into a data warehousing system in a processing window that just isn't big enough.Though I have no way to prove this. and then come back with the 'finished' project. once the basics and . Cleaning data so they are in a form that is acceptable to users from different functional areas may require arbitration skills the typical data warehousing developer may not possess. Actions for Data Warehouse Success The following are some suggestions for the warehouse builder. or months). you may have seen articles that state that data warehousing failure rates are between 10% and 90%. From day one establish that warehousing is a joint user/builder project Warehouse projects will fail if the builders get specs from the users. Though how these failure rates are determined is suspect. Sometimes compromises are acceptable getarounds.
Do not assume the data are self-explanatory or that any metadata you may provide will answer any questions. at a high level. attribute names. . and loading (ETL) usually takes the majority of the time in initial data warehouse development. In project management lingo. Train the users about the data stored in the data warehouse Users often need more training about the stored data than about the tools used to access the data. Note that users are often used to seeing data in canned reports and seeing data in its "raw" form can be confusing. overwork. . Consider doing a high level corporate data model / data warehouse architecture "exercise" in three weeks Actually.If you know what raw data you need. transforming. the feeder system programmer often can take a while to get you that data. For reasons of politics. . A good investment of time in the initial stages of a warehouse project is for the builder and user to jointly determine what checks will be made on the warehouse data during development and what checks need to be made on an ongoing basis. checking the correctness of aggregation logic.the next level are learned. definitions of derived data. subjects and relationships and most importantly. schedule advanced training. The architecture part of the exercise to determine the dimensions. Huge warehouse efforts quickly go sour if after system roll-out users find multiple mistakes. the key point regarding time is to "time-box" the exercise into a relatively short time. testing whether classifications codes were assigned correctly. Determine a plan to test the integrity of the data in the warehouse Do not underestimate the importance of user faith in the integrity of the warehouse data. request it as soon as you know it. what are the chunks of information that it makes sense to deliver in different projects. The exercise also consists of coming to an agreement as to how to keep the corporate model up-to-date and how to make sure future data warehousing efforts pay attention to the architectural principles. Implement a user accessible automated directory to information stored in the warehouse The majority of successful warehousing efforts I have seen included providing some means for the warehouse user to locate stored information. keep training the users! After a year using the tool. and just plain lack of knowledge of how data are physically stored in a system. and information sources that you will attempt to use consistently in your data warehousing efforts. a pretty simple database sufficed for initial use. You are probably going to have to ask one of the programmers of the legacy feeder systems to initially get this data for you. The checks including tying warehouse data controls back to controls in feeder systems. the marginal benefits from additional time devoted to these types of exercises rapidly decrease. Most of the times this involved building a separate database with directory information. And most of the time.The corporate model is going to identify. Once you know what raw data you want to feed into the data. request that data If you have done some reading on data warehouse development you probably have read that figuring out the process of extracting. figuring out ETL is usually on the critical path. After about three weeks.
At the very least. chances are they will monitor network activity for you and be ready to make adjustments to the network as necessary. you better be available to support the user when he starts out whatever the day or the hour. Coordinate system roll-out with network administration personnel Use of data warehousing systems can bring about some strange spikes in network activity. If you keep network administration people informed of the roll-out schedule.) Suffice it to say. Have a good grasp of desktop databases and spreadsheets Even if you are dealing with a 100 TB database. You owe this to the users in order to maintain their trust. not included. cleaning. If you want to make that beginning user as a committed customer of your data warehouse. The distractions are less at those times. Maintain the audit trail to the feeder systems That is. (No one has ever explained how this percentage was obtained though. in fact. get the user in the habit of eyeballing the query or report to check if several records that should be included are. Be prepared to support beginning users immediately and at any time We developers often greatly underestimate users' hesitation to begin using the data warehouse. Data Warehousing Gotchas Here are some points for the warehouse builder I rarely see discussed or I do not see discussed enough in the barrage of articles about data warehousing. the first point is to be available to help when the user wants to try to use the data warehouse the first time. Market and sell your data warehousing systems For the most part. though. in fact. . and loading data The usual figure quoted is that 80% of the time building a data warehouse will be spent on this type of work. use of data warehousing systems is optional. Forewarned is forearmed! You are going to spend much time extracting. included and that several records that should not be included are.From the start get warehouse users in the habit of 'testing' complex queries Many people will assume that the query result is correct. This means you have to identify the potential users of the systems. there are so many little tasks to be done in a data warehousing project where knowledge of these tools will be helpful. and then make them want to keep coming back to use the system. help them understand what are the benefits of the system. Skillful use of these tools during development can be a huge productivity enhancer. This hesitation could be because of user fear of technology or user fear that they will not get IS support. make it as easy as possible to tie the data in the data warehouse to the feeder systems. Users also may want to use the data warehouse for the first time during the weekend or at 6:00 in the morning or 8:00 at night. Your users have to trust the numbers in the data warehouse. So.
You are going to find problems with systems feeding the data warehouse Problems that have gone undetected for years will pop up. when building sales reporting data warehouses. Data warehousing projects start with data and end with requirements. You will need to validate data not being validated by transaction processing systems Typically once data are in warehouse many inconsistencies are found with fields containing 'descriptive' information. if not more. Also. The warehouse developer. Despite best efforts at project management. 'Digital' and. again. Inmon. complex. You will underbudget for the resources skilled in the feeder system platforms In addition to understanding the feeder system data. In this case the data warehouse developer faces the possibility of modifying the transaction processing system or building a system dedicated to capturing the missing information. many times no controls are put on customer names. . extracting data and loading data are equally. ask what information he wants next. Once warehouse users see what they can do with 2000's technology. Therefore.the amount of time on these tasks is often grossly underestimated.tasks at which mainframe utilities often excel. By the way. you may find that you want to build aggregates on the mainframe because aggregation also involves substantial sorting. You are going to have to make a decision on whether to fix the problem in what you thought was the 'read-only' data warehouse or fix the transaction processing system. H. For example. data warehousing project scope will increase To paraphrase data warehousing author W. Often cleaning involves a great deal of sort/merging . Note that this point is about extracting and cleaning and loading. may have to modify the transaction processing systems or develop (or buy) some data scrubbing technology. Often it is found that a system which contains information that the designer would like to feed into the warehousing system does not contain information down to the product or customer level. Though by now many people are aware the cleaning the data is complex. you may find it advantageous to build some of the "cleaning" logic on the feeder system platform if that platform is a mainframe. You will find the need to store data not being captured by any existing system A very common problem is to find the need to store data that are not kept in any transaction processing system. you could have 'DEC'. Rather. Some transaction processing systems feeding the warehousing system will not contain detail This problem is often encountered in customer or product oriented warehousing systems. there is often a need to include information on off-invoice adjustments not recorded in an order entry system. 'Digital Equipment' in your database. (Which is fine!) One piece of advice for the warehouse builder is never to ask the warehouse user what information he wants. For example. this is what some people label a 'granularity' problem. This is going to cause problems for a warehouse user who expects to perform an ad hoc query selecting on customer name. they will want much more. traditional projects start with requirements and end with data.
If you have a cherry cola brand there is a chance that two users will classify the brand in different categories. The tools will allow users to perform the same calculation differently. if your company sells dog food and auto tires. the users may not know what data to use their newfangled decision support tools to retrieve. Large scale data warehousing can become an exercise in data homogenizing Data have quirks! Sometimes when we developers combine detailed data for different subjects. the number of possible business rules is so large that you will not be able to incorporate all rules. You will find that there are means to incorporate some of the business rules in your warehouse. in our efforts to make everything 'fit' we can take the life out of the data. storage of this calculated data can eat up far more storage than the raw data. It comes about because the query and report tools allow the user the users to gain a much better appreciation of what technology could do. After end users receive query and report tools. Granted there are many reports that are so complex that IS expertise is going to be required no matter what tool the end user has. 'Overhead' can eat up great amounts of disk space A popular way to design a decision support relational databases is with star or snowflake schemas. be aware that certain products pre-calculate and store summarized data. and then some . For instance. you want to be careful if you are building a sales data warehouse for both lines of business. the users have been "culturally conditioned" to use what they are given and to never ask for more. for many reasons the users are unable to use the new tools themselves to realize the potential. As with star/snowflake schemas. However. Your warehouse users will develop conflicting business rules Many warehouse tools allow users to perform calculations.. Also suppose that the flavor category includes cherry and cola. To use a phrase from pop sociology. many times this phenomenon points to training needs. For instance.Many warehouse end users will be trained and never or seldom apply their training I once read a study that claimed that only one quarter of the people who get training in a query tool actually become heavy users of the tool. suppose you are summarizing beverage sales by flavor category. if this happens do some honest research on why. However. requests for IS written reports may increase This phenomenon was seen with many of the information centers of the 1980s. If you are using multidimensional databases. By the way. However. Your warehouse users may not know how to use data After many years of using whatever reports have been thrown in their faces. You have to make a judgment call as to whether these businesses fit the same logical and/or physical model.. The time it takes to load the warehouse will expand to the amount of the time in the available window. Persons taking this approach usually also build aggregate fact tables. If there are many dimensions to the data. be aware that the combination of the aggregate tables and indexes to the fact tables and aggregate fact tables can eat up many times more space than the raw data.
limitations. The data warehouse data you do not reconcile with the feeder systems will cause the problems For certain data warehouse data you are going to think that there is no logical way that data in the feeder systems can be reconciled with what are in the warehouse. Performing Data Warehouse Software Evaluations Here are some ideas that may make the process of evaluating data warehousing software more effective. data. etc. This is not a comprehensive list of tasks to follow in a technology evaluation. is applicable to buying any sort of data warehousing/decision support technology. you will then discover there is a way. There is no "metaphysically" best technology out there. An excellent paper to read along with this essay is Nigel Pendse's How not buy an OLAP product which has advice that. Frankly. restricting people to "need to know" does not cut it in the organization on the 2000s. and resources . Then. changes in production systems. when a user looks at a report and tells you "I think there is a problem". Do the evaluation yourself That is. You are building a HIGH maintenance system Reorganizations. are going to affect the warehouse.especially if you make your data warehouse Web-accessible You are going to face a paradox . changes to the warehouse have to be made fast. If the warehouse is going to stay 'current' (and being current will be a big selling point of the warehouse). I don't just mean making it Web accessible . to reconcile the data. But. You are going to have a tough problem with security . on the other hand. new pricing schemes.You'll do yourself well by understanding the different ways to approach updating the warehouse.I mean architecting it in a way that people want to use it). For the most part. You will fail if you concentrate on resource optimization to the neglect of project. Unfortunately. you will probably lose your customer from day one and will have a tough time getting him back. use of data warehousing systems is optional. and customer management issues and an understanding of what adds value to the customer If you provide a system that is fast and technically elegant but adds little value or has suspect data. expectations. product introductions. the greater security risk you are exposing yourself too. be aware that "There's all day Sunday to load the database!" have been famous last words of more than a handful of warehouse developers. new customers. albeit roundabout. The customer has to want to use the system.the more accessible you make your data warehouse (and by accessible.which you know better . for the most part. it will be with the unreconciled data. Rather. Before you decide that you can do complete refreshes. All technologies have to be evaluated in the context of your organization's needs. these are points that seem to be rarely discussed or followed in this wave of interest in data warehousing. do not rely solely (or even in large part) on the ideas of someone outside your organization. exposing information to theft from anyplace in the globe is not too great for job security either.
If you are going to see multiple vendor demos.You need a minimum number of sites to help you detect patterns. Plan on 20 minutes with the reference . One more point. Ask open-ended questions . Always first ask whether technology already in-house can do the job Successful data warehousing/decision support systems can often be built without the specialized tools you see listed in this site. many times these reports can be an excellent source of background information on a vendor. Many libraries will have a large collection of these reports stored on CD. Get references Talking to reference sites is one of the most effective means of getting practical information. call 5-6 sites .Some of the references will be more comfortable if they know what you'll be asking.The reference will appreciate this. You would be surprised how important operational issues surface while doing evaluations.Try to have options as to which organizations you will call. Be skeptical of data warehousing pundits' endorsements or reviews of technology Often these pundits get compensated handsomely for these objective appearing endorsements or reviews.though you have to decide the weight of each criterion.Again the reference will appreciate this. build a test case that each vendor will follow This will allow you to compare apples to apples and peaches to peaches.This will lay the groundwork if you have to call about another issue. Because departing from the standard vendor dog and pony show takes time on part of the vendor. you can never be sure of the outsider's biases. Some hints on reference gathering practices that have worked for me are: Ask the software vendor for a complete list of referenceable sites .You will find some interesting information with skillful questions. Send your questions to the reference in advance . Leave some open time at the end of the demo so the vendors can show features that were not covered well in the test case. Also. . many will be unwilling to do this unless you are talking about a major purchase. Read stock analyst reports on publicly held vendors and the industry outlook Though these reports are intended mainly to get people to buy stocks. Send a thank you note to your references asking if it would be okay to make a quick follow-up call if necessary . Outsiders's main worth really comes from their knowledge of criteria you can use in the evaluation .than any outsider. Make a telephone appointment to talk with the reference . If this is a major decision for your company. Taking on additional technology in you organization always imposes some burdens that should always be recognized before you hand over your organization's money.
and making the final decision. Make sure these parties are asked how they want to be represented in the evaluation. if you do not have the skills and/or patience to be a mediator. let an end user lead the evaluation effort It seems odd but some organizations buy end user tools with little input from the end users of these tools. Go to the vendor road shows to talk with other attendees Sometimes I think that the audience at the vendor road shows is the best source of information. An (Informal) Taxonomy of Data Warehouse Data Errors . For example. Facilitation skills can be especially helpful if you have sessions dedicated to setting criteria. See how well the tool handles changes. that may be affected by the name change. capacity. You will find that you and that person can exchange information that is mutually beneficial. If parties that are in conflict with each other will actively participate. most tools work with something like a data dictionary. A simple check could save you some major potential grief. computer resource consumption. See how the dictionary helps you locate and change queries. ease of use. forms. the people in that department can help you with this. Check the financial stability of the vendor If you for work for an organization with an accounts receivable department. chances are you will come across a person who is in at the same stage in evaluating warehousing tools. One of the first steps in a technology evaluation is to identify all 'interested parties' in the acquisition. If you're evaluating an end user tool. See what are the consequences of changing the name of a field in the data dictionary.Check how well the software handles maintenance Most of the time spent with a software tool will be with maintenance. Have a representative team perform the evaluation Often technology acquisitions fail or go awry because a group within an organization felt it did not get its views heard during the evaluation. making your short list. seek the services of an outside facilitator. several report and query tools can be made quite accessible to end users if you are willing to maintain extensive data dictionaries. and ease of maintenance. make sure the persons making the buying decision understand these tradeoffs. For instance. macros. If you'll make a point of talking with several other attendees. reports. ease of development. etc. Understand the tradeoffs the software makes Usually there is not a free lunch! Designers of tools trade off speed. To prevent some nasty surprises once the tool has been purchased. Several OLAP tools attain quick retrieval times by requiring the storage of huge amounts of pre-calculated numbers.
Finally. First. Incomplete errors These consist of: Missing records This means a record that should be in a source system is not there. you will be more prone to spot them and to plan your project to attack the errors in a manageable way.) Note you may not spot this type of error unless you have another system or old reports to tie to. and loading of data. there may be dimension table attributes you will want to record but which are not in any system feeding the data warehouse. The categories of "errors" I place "errors" into four categories. by intelligent or careless design. in the metaphysical sense. What follows is a list of common errors. if you are a relational database expert.not the data marts (or whatever you want to call them) that are fed with cleansed data. is much in-depth discussion of what exactly are those errors in the dirty data that you will spend your time cleaning up. Forewarned is forearmed. Quotations are around the word errors because some errors are not. Actually. note that when I refer to a data warehouse. (I read a white paper about how users have to "fess up" about bad data. are not being recorded That is. with some awkwardness. Perhaps the material in this paper can help you formulate a checklist of errors you will be checking for. bear with my imprecise use of some terminology. I refer to the database that is directly fed with data from the source systems . erroneous. So.You may have seen publications that tell you that you may have to spend the majority of your data warehouse development time building the means for both the initial and recurring extraction. usually system personnel cause MUCH more headaches than users. if you are feeding the same type of data in from multiple systems you may find that one of the source systems does . There is often a mistaken belief that a source system requires entry of a field. let me suggest that errors involve data that are either: Incomplete Incorrect Incomprehensible Inconsistent. For example. though. transforming. data you want to store in the data warehouse are not being recorded anywhere. Also. the marketing user may have a personal classification scheme for products indicating the degree to which items are being promoted. What I have not seen. Records or fields that. by design. Missing fields These are fields that should be there but are not. Usually this is caused by a programmer who diddled with a file and did not clean up completely. If you know the possibility that certain errors exist. I further divide this situation into three categories. Second.
Now. You may find it necessary to bring data into the warehouse environment solely to allow you to check the calculation. Wrong calculations. Wrong information entered into source system Sometimes a source system contains data that were simply incorrectly entered into the system. for various reasons. someone may have keypunched 6/9/96 as 9/6/96. you may have to extract data from an ancient repair parts ordering system that was programmed in 1968 to assign a product code of 100 to all transactions. updating the source system may not necessarily cause the recording of a transaction. sometimes. You will have to make a judgment call on whether to check the data. aggregations This situation refers to when you decide to or have to load data that have already been calculated or aggregated outside the data warehouse environment. there are duplicate records within one system whose data are feeding the warehouse. there is information that is duplicated in multiple systems that feed in the same type of information. For instance. Unbeknownst to you.not record a field your user wants to store in the data warehouse. your branch in West Wauwatosa is booking services in both the product and service order entry systems. note that you may miss the duplicates if you feed already aggregated data into the warehouse. For example. Third. Now the obvious action is to correct the source system. sometimes adjustments to source system data are made downstream from the source system. Second. Note that if you have many errors . Incorrect errors You can say that again! That is. the source system cannot be corrected. However. The "gotcha" comes when the code is wrong but it is still a valid code. Duplicate records There usually are two situations to be dealt with. the data really are incorrect. For example. you are going to catch it. Now if the code is not valid. product code 100 stands for something other than repair parts.) In both cases. (The possibility of situation like this may sound crazy until you encounter the quirks in real world systems. however. Wrong (but sometimes right) codes This usually occurs when an old transaction processing system is assigning a code that the transaction processing system users do not care about. In this case you may find that the grain of the information to be stored in the warehouse may be lost in the downstream system. First. For example. there may be "transactions" you need to store in the data warehouse that are not recorded in a explicit manner. Or. maybe you are feeding in data from an order entry system for products and an order entry system for services. Off-invoice adjustments made in general ledger systems are a big offender.
They may contain a semblance of a structure with data that are half validated. Multiple fields within one field This is the situation where a source system has one field which contains information that the data warehouse will carry in multiple fields. then the category code should be either A.. Unknown codes Many times you can figure out what 99% of what codes mean. Inconsistency errors . the programmer may also have instituted a record layout that varies. In addition to singular fields being formatted strangely.g. Brown". Incorrect pairing of codes This is best described by an example. Weird formatting to conserve disk space This occurs when the programmer of the source system resorted to some out of the ordinary scheme to save disk space. However. Spreadsheets and word processing files Often in order to perform the initial load of a data warehouse it is necessary to extract critical data being held in spreadsheet files and/or "merge list" files. Sometimes there are supposed to be rules that state that if a part number suffix is XXX. there is a non-arithmetic relationship between attributes whose rules have been broken. B. Many-to-many relationships and hierarchical files that allow multiple parents Watch out for this architecture in source systems. "Joe E. Incomprehensibility errors These are the types of conditions that make source data difficult to read. e. is kept in one field in the source system and it is necessary to parse this into three fields in the warehouse.in a source system that cannot be corrected. you usually find that there will be a handful of records with unknown codes and usually these records contain huge or minuscule dollar amounts and are several years old. In more technical terms. or C. However. often anything goes in these files. It is easy to incorrectly transfer data organized in such manner. you have a much larger issue in that you do not really have a reliable "system of record". By far the most common occurrence of this problem is when a whole name.
The category of inconsistency errors encompasses the widest range of problems. As . you will probably avoid loading calculated numbers into the warehouse but there sometimes is the situation where this must be done. say in 1995 you have customers A. for the most part. based on the type of business analysis you perform. and D in previous years. Inconsistent business rules This. all its sales to Customer A with three customer numbers and another source system records its sales to customer A with two different customer numbers. Going for 90% consistency requires a huge jump in the level of effort. The data warehouse users may want to see these as one color. some records may indicate a color of violet and some may indicate a color of purple. is a fancy way of saying that calculated numbers are calculated differently. Obviously similar data from different systems can easily be inconsistent. B. the obvious solution is to use one customer number here. C. and D. customer A buys customer B. In 1996. Inconsistent use of different codes Much of the data warehousing literature gives the example of one system that uses "M" and "F" and another system that uses "1" or "2" to distinguish gender. My unscientific impression of this type of problem is that decent knowledge of string searching will allow you to relatively easily make name and address information 80% consistent. However. sometimes spaces and other extraneous information have been inconsistently embedded in codes. you may face the dilemma of how to identify the sales to customers A. you may want to decide if sending a person to Mars is easier. Normally. and time. Inconsistent names and addresses Strictly speaking this is a case of different codes with the same meaning. More annoyingly. In 1998. Inconsistent meaning of a code This is usually an issue when the definition of an organizational entity changes over time. say. As for 100% consistency in a database of substantial size. The problem is that there is usually some good business reason why there are five customer numbers. Now. data within one system can be inconsistent across locations. C. Different codes with the same meaning For example. Customer A sells of part of what was A and C to customer D. In 1997. B. Overlapping codes This is a situation where one source system records. May I suggest that you wish that this is the toughest data cleaning problem you will face. Going for 95% consistency requires another incremental huge jump in effort. customer A buys customer C. reporting units. When you build your warehouse in 1999. For example.
spaces. For example. For example. You may find that entity A uses account '1000' for administrative expenses while entity B uses '1500' for administrative expenses. Perhaps you purchase weekly numbers. then the category code should be either A.g. The problem occurs when there is not necessarily a relation between the customer or product grain of the sales data and the account . Inconsistent use of nulls. (This problem gets more interesting if entity A uses '1500' and entity B uses '1000' for something other than administrative expenses. This problem especially comes into play when you buy data. the e-mail address of the customer. perhaps you have a fact table with ledger account numbers. For example. A more difficult situation is when different business policies are used to populate a field. Inconsistent use of an attribute For example. if a part number suffix is XXX. in a pickle. I believe the most common instance of this type of problem is where data are aggregated by customer. etc. you'll find that you are. . Inconsistent timing Strictly speaking this is a case of inconsistent grain of the most atomic information. In a nutshell. if you work for a pickle company you might want to analyze purchased scanner data for grocery store sales of gherkins.) Inconsistent date cut-offs Strictly speaking this is a case of inconsistent use of an attribute. or C) is non consistently followed. the issue comes up most with dating sales and sales returns. Inconsistent grain of the most atomic information Certain times you need to compare multiple sets of information that are not available at the same grain. As you can imagine. empty values. you may have to feed data into the warehouse solely to check calculations..noted before. Often sales are recorded by product and customer but expenses are recorded by account and profit center. this refers to when you need to compare multiple sets of aggregated data and the data are aggregated differently in the source systems.This can also mean that a non-arithmetic relationship between two fields (e. customer and product profitability systems compare sales and expenses by product and customer. well. . You may find that this field contains the name of the customer purchasing agent.profit center grain of the expense data. etc. When someone comes up with the idea to produce a monthly report that incorporates monthly expense data from internal systems. an order entry system may have a field labeled shipping instructions. B. This is when you are merging data from two systems that follow different policies as to dating transactions. Inconsistent aggregating Strictly speaking this is a case of inconsistent business rules.
My previous comment applies. Out of synch fact data Certain summary information may be derived independently from data in different fact tables. though. For example. however. Most of these errors do not jump out at you. if it takes 100 hours to reconcile data from two source systems. You may spend much more time checking for errors than cleaning up errors. . the differences are symptoms of deeper problems. not 200. however.Now this is not the hardest problem to correct in a warehouse. that reconciling inconsistencies over time may be even harder because the people who know what happened in previous years may not be around to answer your questions. The complexity of a data warehouse increases geometrically with the number of sources of data fed into it. Lack of referential integrity It is surprising about how many source systems have been built without this basic check. Often. to forget about this until it is discovered at the worst possible time. hours to reconcile data from four source systems. The complexity of a data warehouse increases geometrically with the span of time of data to be fed into it. you can expect that it will take on the order of 400. a total sales number may be derived from adding up either transactions in a ledger debit/credit fact table or transactions in a sales invoice fact table. Let me offer the following ending thoughts: Be prepared for a lot of tedious work. Some ending thoughts I hope this paper adds to the understanding of what takes up the majority of time in a data warehouse. It is easy. For example. Obviously there may be differences because one table is updated later than another table. The errors of inconsistency are the most difficult to handle. Probably the most important "tools" for solving these problems are a sharp eye and endurance for checking an abundance of detail information. Note. At least that is my experience. Having to reconcile inconsistent systems is the reason.
What you do. often what constitutes "correct" data is debatable. boils down to a question of money and politics. to the best of my knowledge. that situation still has not changed. IS to IS issues Internecine conflicts in IS projects can be the most difficult to deal with. Where does the data warehousing development group report to The issue is whether the data warehousing development group should be a free standing development organization or whether it should be part of a group that traditionally has concentrated its efforts on transaction processing development. Data warehousing experiences all the usual political problems (i. Marc comments on how little extended discussion of politics there is in the data warehousing literature. . Often transaction processing development organizations have been driven by their work order backlogs and the need to react to whatever is the crisis on hand. best flourishes when done with an entrepreneurial orientation rather than with a reactive orientation. Placing the data warehousing effort in a separate development group can lessen knowledge transfer and appreciation of how to make data warehouses industrial safe. Just check into literature about IS project management and you will find a wealth of material on these issues. deadlines. those that are between IS and the users (IS to Users).You will be faced with an economic and political question as to how erroneous the data in your system will be. These data warehousing systems need to be as "industrial safe" as some of the transaction processing systems. Though what is done about these issues varies by organization. Though these issues can appear minor and even petty. in this paper I try to list the political issues that are peculiar to data warehousing. On the other hand.e. I have classified the political issues into those that are within the IS organization (IS to IS). In his June 1997 paper.) that occur in complex technology projects.. In this paper. My working definition of a data warehousing "political issue" is a situation where the equally valid and reasonable goals and interests of two or more parties collide with each other. however. more often then not. Data Warehousing Political Issues This paper is a list of political issues that frequently come up in data warehousing projects. More vexingly. Some persons believe that data warehousing. etc. these are situations where there is great potential for conflict. Data warehousing projects probably are typical in this respect. That is. Finally. People often get blind sided by politics. Completely fixing some of these problems can be quite expensive. and those that are between users (User to User). they can account for a good portion of the mental wear and tear experienced by data warehouse developers. This is unfortunate because ambitious data warehousing projects are rife with political issues. My hope is that this paper might give readers some advance warning of these issues. resources. many organizations quickly come to depend on data warehousing systems for day-to-day work. As of the writing of this paper. I recommend that you read Marc Demarest's The Politics of Data Warehousing in conjunction with this paper. I believe the best advice to data warehouse implementers is to do your best to spot these issues early and then pick your battles wisely.
may be reluctant to help if they feel that the data warehousing effort is going to be audit of their work. a typical course of action is to compare a previous copy of the transaction system database with the current database. Against what data should reports be written Often an organization quickly discovers that quite a few reports can be written against data in the data warehouse or against data in the transaction processing systems. Cutting the DBA organization out of the data warehousing support loop can deprive the data warehousing effort of some valuable wisdom. Should feeder system problems be corrected in the data warehouse or in the feeder system Actually. This can be quite perplexing to organizations where there is not agreement as to what the data warehouse is for. The developers of these systems. usually there are multiple problems with different groups suggesting different combinations of actions. it is not uncommon for previously undiscovered data quality problems occur after the big push to clean data for the initial load of the data warehouse is done. How much time should be allotted to the window in which transaction processing system databases are frozen can be a source of contention. How big is the data warehousing batch processing window Often there is need for a time period where transaction processing systems are kept stable so changes made to the systems can be captured and fed into the data warehouse. Who has ongoing responsibility for data quality monitoring Data quality is not a one time concern to many firms that implement data warehouses. the question often becomes whether: 1) The feeder system should be fixed or 2) The feeder system should be left alone and the data in the warehouse should be fixed or 3) Data should be fixed in the data warehouse with the fixes fed back to the feeder system. and the "I'll know what I want when I see it" nature of data warehouse development can necessitate table and index changes. a copy of the current database is made for comparison in the next processing cycle. Proliferating data marts. On the other hand. In some firms. How to gain the cooperation of feeder system developers who appear to have much more to lose than to gain in the data warehouse development effort Data warehousing efforts often bring to light problems in feeder transaction processing systems that may have been "hidden" for years. want changes made quickly and get quite frustrated being put on the DBA backlog. DBAs often have knowledge about how to make database processing industrial safe. uncertainty about usage patterns.the DBA group or the data warehousing development group The need to make data warehouse database structure changes can be relatively frequent. whose knowledge is often crucial to the data warehousing effort.Who should administer the data warehousing databases . concerned about losing the favor and interest of data warehouse users. Data warehouse developers. the need to "freeze" transaction processing system databases can cause inconveniences to other processing. And to further complicate matters. Firms find it . After the changes are identified. In a firm with complex feeder systems. When changes cannot be easily identified.
On the other hand. unlike with transaction processing systems. Unless data warehouses are tailored to their preferences. Why should users give up control of user managed databases Many user departments have. often the IS organization has the burden of figuring out how to gain cooperation. How to gain the cooperation of a user whose spreadsheet is being automated Often part of the goal of a data warehouse is to automate the production of a spreadsheet or series of spreadsheets that have been manually created by a user. Should design be for the needs of the masses or for the needs of the most demanding user In many data warehousing projects it is not uncommon for the IS organization to find one to a handful of users whose "needs" go way beyond those of most of the data warehouse users. users may quickly decide not to use the data warehouse. IS to User issues User issues can be especially thorny with data warehouses because. How are requests to make feeder transaction processing system changes approved and how is knowledge about the changes communicated Small changes in feeder transaction processing systems can have major impacts on the feed to a data warehouse. these users may be skeptical about whether the IS organization can do as good a job supporting the user reporting needs as the users did on their own. Though dealing with this sensitive personnel issue probably should be to be the responsibility of user management. Often these systems were built by user organizations on their own because the IS organization was unwilling or unable to help the users or the users were skeptical about the level of support they would receive if they were to work with IS. When should requirements be frozen (and unfrozen) . on their own. Sometimes the user's corporate identity is tied to the spreadsheets and he or she feels (rightfully) threatened by the prospect of automation. It is highly likely when a data warehouse that will subsume the functions of these user managed databases is proposed. use of data warehousing systems is often optional. under pressure from their users to make changes. developed databases that meet some of their key reporting needs. Even more vexing situations come when a change is made in the feeder transaction processing system and is not communicated to the data warehouse developers. What requirements should be frozen. the need is for a far greater level of detail and/or for far more history and/or for a series of reports of both a high deal of technical and business complexity. these users can have a peculiar need that is especially beneficial to the business and/or can be people whose support is vital to the success of the project.necessary to install procedures to regularly audit data quality. And in most firms it is unclear who should have responsibility for executing these procedures. Usually. now have to work with data warehouse developers to assess the impact on downstream systems. This user's cooperation will be needed in the data warehouse development. Conflicts arise when transaction processing system developers. It can be quite expensive and time consuming to satisfy the needs of these far more demanding users.
Data warehousing development is iterative. . there can be many start-stop cycles in data warehousing requirements definition. This need to obtain IS involvement can create great consternation in an IS organization who thought that building a data warehouse was going to get it out of the report writing business. In these cases. sometimes budget. This maintenance can be quite involved. Here is what often happens: 1) The reports become too technically difficult for the users to change and/or 2) The report "code" becomes lost or corrupted and/or 3) The user leaves the organization (usually without documenting the report). it often is not clear who this person should be. some requirements may be frozen while some are always loose. Who should have responsibility for maintaining data warehouse data not fed by transaction processing systems Often as part of a data warehouse it is necessary to manually maintain dimension tables and conversion tables that contain data not in any transaction processing system. Managing requirements definition in a data warehouse effort can require a deft political touch. In how timely a manner are data corrected Sometimes users are used to being able to make a correction to data and then immediately run reports against corrected data. Unfortunately.Some reasons sometimes do make good business sense. Determining whether users and/or IS should bear the maintenance burden can be a major issue. Rather. problems occur because sometimes data is not fed from the transaction processing systems or fed multiple times. Some of the reasons are: 1) The desire to put their data on different hardware platforms so their reporting needs are less impacted by other people's processing 2) The desire to modify data at their own discretion (though this may strike terror in a data warehousing purist) 3) The desire not have to work with other groups on resolving data definition issues. This does not mean that requirements never get frozen. IS usually gets called in. How to pass responsibility for running and maintaining a report from the users to IS Users write reports that the business comes to depend on for day-to-day functioning. or quota data must be manually maintained. it can get quite expensive to support a proliferating number of data marts. Also. However. data errors pop up after the data warehouse is implemented. For example. Perhaps the users have been running reports against a transaction system database which could immediately be adjusted. Perhaps the users had their own database or spreadsheets which they could adjust at their will and then generate reports. User to User issues . Who is in charge of ongoing audit of data quality As mentioned before. Also. How many data marts should there be Users want their own data marts for a variety of reasons. Many times it is necessary to make someone explicitly responsible for regularly auditing data. forecast. Problems come if data warehouse developers design systems so corrections now are now incorporated into the data warehouse during a batch feed at the end of the day or at the end of the week or at the end of the month.
. and that calculations are always calculated the same way. To determine profitability. Sales say Finance won't understand "its" numbers and Finance says Sales won't understand "its" numbers. e. Who has access to what data As can be imagined. another common major issue regarding profitability is when a sale should be recognized. that attributes are used consistently.Note that a common use of a data warehouse is to report profitability for internal purposes in a way more meaningful than profitability as calculated per generally accepted accounting principles. calculations should be defined similarly You may have seen some data warehousing literature that talks about how the data warehouse should create a "common view" (or some similar term) of all the data. attributes. e. it may be necessary to include expense allocations. the firm may have issues as to what a customer is. financial and market analysts question whether line accountants and sales people can understand certain data. It is my opinion that these definition tasks probably cause more political issues than any other definition tasks . How is profitability calculated Most firms end up wanting to determine similar definitions of customers and profitability.g. Finally. . A customer may be a legal entity. I believe that most firms do not have the patience to do this. through a great deal of give and take. or it may be the people performing a function for a legal entity or a location. I believe that this is referring to making sure that dimensions conform. these will be debates regarding interpretation of data. Note that an unexpected consequence of data warehousing is that while before users might be able to reconcile their differences by making adjustments to summarized numbers. data warehousing may force them to agree on how the detail should be interpreted. Rather. Who has final say over the correctness of data If multiple user organizations are going to be accessing the same data. IS can be right in the middle between users. .g. the determination of which can be politically contentious. attributes. These debates about correctness will not be which items are in error. Rather. Rather. etc. Perhaps more complicated to deal with are concerns of one user group that another user group may misinterpret data.These are issues that involve potential conflicts among the users of a data warehouse. it may be a location. Though this is a nice ideal. Conclusion If you go through these issues I believe you will see three common threads regarding why data warehousing projects engender political issues: 1) Data warehousing imposes new . How to define a customer. there will be ongoing disagreements about the "correctness" of data added to the data warehouse. Also common is for division personnel not to want corporate personnel to see detail division data. firms implementing data warehouse decide a subset of dimensions. one business group may not want another business group to see its data and one location may not want another location to see its data. To put this is in what I believe are in more concrete terms. This does not mean that IS is not involved. If so. What dimensions. It is very common to want to report profitability by customer and/or by product. and calculations whose definition is worthwhile making the effort to calculate similarly. Often people's whose formal job it is to analyze information question whether people whose formal job is not to analyze information will misinterpret data. Often one functional area thinks another won't understand certain data.
semantics. This is the choice of what data sources.. the discussion is often one of architectural purity and beauty or of the writer's conception of rightness and wrongness. determining this architecture has more to do with determining the place of the data warehouse in your business than any other architectural decision. Determining where we hold data to report against is what I call the reporting data store architecture. and (sometimes) 3) transported to another data store where they can be reported against and/or cleaned up. much of the literature discusses only the "enterprise" data warehouse. This a list of aspects of architecture that the data warehouse decision maker will have to deal with themselves. Rather. money. Architecture is a pretty nebulous term. the decisions involved in determining this architecture should drive all other architectural decisions. and politics involved in doing so. There are many other architecture issues that affect the data warehouse. the literature makes these decisions seem much more black and white than they are. business rules. Different Aspects of Data Warehouse Architecture This page is a list of the different aspects of data warehouse architecture. I think of architecture as a system design decision that is usually not easily changed. and metrics an organization chooses to put into common usage.) This list will not attempt to provide detailed explanations of the different types of architecture. dimensions. e. Also. As mentioned before. Unfortunately.) This is by far the hardest aspect of architecture to implement and maintain because it involves organizational politics. and metrics an organization chooses not to put into common usage. definitions of data. business rules. . dimensions.obligations whose responsibilities are unclear 2) Data warehousing requires changes in processes that an organization is comfortable with 3) Data warehousing requires agreement on some. Many writings on this aspect or architecture take on a religious overtone. there are many more variations being used that cannot easily be given a snappy label. network topology. (Though the article does not say it explicitly. I am presenting this list because the data warehousing literature usually muddles the subject of architecture by lumping different types of decisions together or by forgetting certain types of decisions. In reality. Data consistency architecture Doug Hackney's excellent but confusingly titled article on what he calls incremental data mart enterprise architecture is the most succinct statement of what this means. That its. In my opinion. All other decisions are what I call staging data store architecture. it is also the equally important choice of what data sources. the dependent data mart. The decision is not easily changed because the amount of work. rather than discussing what will make most sense for the organization implementing the data warehouse. semantics. Reporting data store and staging data store architecture The main reasons we store data in a data warehousing systems are so they can be: 1) reported against. but not all. there are infinite variations of this architecture. but these have to be made with all of an organization's systems in mind (and with people other than the data warehouse team being the main decision makers.g. However. in the area of what I call reporting and staging data store architecture. 2) cleaned up. and the independent data mart options. this determination of this architecture seems to often be backed into than consciously made. For example.
To get the most return from a data warehouse (or any other system). to the best of my knowledge. As a final comment. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up querying. data warehouse implementers may use this paper as a starting point in their search for ways to speed up queries. This can range from an architecture as simple as host-based reporting to one as complicated as the diagram on page 32 of Ralph Kimball's "The Data Webhouse Toolkit". you will probably have to use some other means to accomplish this other than the usual security mechanisms at your organization. data models.Data modeling architecture This is the choice of whether you wish to use denormalized. Some topics that apply. What to Learn About in Order to Speed Up Data Warehouse Querying This paper is a laundry list of items data warehouse implementers may wish to learn more about in order to speed up their data warehouse queries or to make the data warehouse "environment" more responsive to the bulk of the data warehouse query users. Security architecture If you need to restrict access down to the row or field level. normalized. it makes perfect sense for an organization to use a variety of models. This paper will not attempt to provide detailed explanations of these topics. Note that while security may not be technically difficult to implement. proprietary multidimensional. decisions on data consistency architecture will probably have much more influence on the return of investment in the data warehouse than any other architectural decisions. object-oriented. it can cause political consternation. to one or two vendors' technologies are not listed. Tool architecture This is your choice of the tools you are going to use for reporting and for what I call infrastructure. As you may guess. SQL SELECT statements This is bedrock knowledge. Rather. business practices have to change in conjunction with or as a result of the system implementation. etc. let me assert that in the long run. Processing tiers architecture This is your choice of what physical platforms will do what pieces of the concurrent processing that takes place when using a data warehouse. Though you may think that your query tool's SQL generation . This list includes topics that are relevant to many of the relational database and data access tool technologies. It is quite worthwhile to get an book on SQL (there are quite a few good ones) and review (or learn) this topic. Conscious determination of data consistency architecture is almost always a prerequisite to using a data warehouse to effect business practice change.
Nevertheless. Sometimes the resolution of consumption issues is a simple rewrite of the query. uses indexes. At the very least. "The Data Warehouse Toolkit". How does your database join tables. You need this information to identify which queries are especially resource consumptive. Partitioning This is probably the second most common method of speeding up queries. this modeling can reduce the amount of sort/merging that goes on when joining tables. this information may not be that accessible. . choose access paths This is some more bedrock knowledge. And sometimes the solution is to do nothing . If the information exists. Sometimes resolution is more technically involved and requires doing many things listed in this paper. written for an academic audience. it is dividing one table into several tables usually based on the time the table data represent. There are many discussions of this in the literature. union tables. What statistics your database provides on query execution Sometimes those of us building stores of information for users to analyze forget about our own information needs. some query tools may generate more efficient SQL if data are modeled . Note that both tables and indexes may be partitioned. Unfortunately. and "Data Warehousing in the Real World" have especially good non-technology specific discussions of this topic. shapes. Aggregate tables This is probably the most used method of speeding up queries. and forms. Note that persons with a transaction processing mindset may have a hard time accepting as much use of these indexes as is usually helpful in a data warehouse. Dimensional modeling With certain database technologies.capabilities lessen the need for this knowledge. The books "The Data Warehouse Lifecycle Toolkit". you will eventually find the SQL knowledge quite helpful.The vendor/consultant community would do itself well if it tried much harder to communicate this information in coherent and comprehensible terms. it may be poorly written. Aggregate navigators/query redirectors This is the technology that automatically directs a query to aggregated data if such data are available and appropriate for the query. and/or scattered among many manuals. And. it is worth making a determined effort to understand these topics. B-tree indexing Adding numerous indexes is another common method for speeding up queries. Note that partitioning comes in many ways.you just have to accept that your data warehouse has to support these demanding queries. You probably will be concerned with a clump of queries that are far more consumptive than average.
e. the requirements of join processing can be eliminated. low cardinality) and tends to be in WHERE clauses often. the number of users running queries and the amount of data to be returned in a query can sometimes limit this technique's effectiveness. however. Also. Before getting fancy with this it is worth taking the time to understand what actually takes up space in your database tables. Defragmentation of table and index files This is more basic stuff. . Look into the topic of RAID for more details. if you use surrogate keys in conjunction with dimension modeling. Reducing the width of large tables that get scanned There are also many ways to do this.. Bit mapped indexing This technique can work well when a field takes on a low number of distinct values (i. Striping files This means spreading a file over several physical disks. Solid State Disk Supposedly prices have come down in the last few years. Completely denormalizing aggregate tables If these tables can be heavily indexed and can be maintained by complete refreshing. Note. Archiving/purging data Sometimes the cost of having to scan through older data exceeds the benefit of having it available in the unlikely possibility someone wants to examine it.dimensionally. Locating different files used concurrently on different disks This is basic stuff but it can be helpful. Parallelizing query execution Developments in database technology have made doing this much easier. joins may be more efficient. this may be an interesting strategy. Loading tables completely in memory Presuming the memory is available to do this and you have researched other topics in this paper.
scheduling resource consumptive queries for off-hours times may free up resources for other queries during prime time. "Productionizing" regularly used. highly resource consumptive queries Certain queries probably should be written by someone with a great deal of knowledge how to make queries efficient. this facility gives you a means so priority queries (such as a query needed to gain information for the monthly close of the financial books) can execute faster.Disk controllers Too few can be a query bottleneck. Query governors These stop queries usually after a specified number of rows have been returned and/or a specified time has elapsed. Note that they are probably more helpful to those who report off of highly normalized databases. Query accelerators These help you generate more efficient SQL. The query then need be run only once and perhaps at a less busy time. The reason you need to learn about this is to prevent using the query tool where it is inefficient or to know when you might build some "get arounds". this does not speed a given query up. What your query tool attempts to do via SQL and what it does internally The book "The Data Warehouse Toolkit" has a good discussion of where query tools may fall short. Some of these provides hints about how to make the query more efficient and some (I have heard) actually try to fix up the queries. Query nannies This is my term for technologies that warn (scold?) the user if he submits an inefficient query. Query tool caching of results . Storing the image of the report If a report based on a query is used by many people and on-line retrieval of the report is needed. There are tools that allow intelligent retrieval of stored report data. Query scheduling capabilities This does not necessarily speed up a given query. However. However. Query queuing As with scheduling. the image of the report may be stored.
the tool may check to see if the results are stored. if some of your users will run queries that generate large result sets (and do not assume that only lengthy reports bring back large result sets to the query tool). in general. The cost of installing more/faster CPU. the cost of an electrician and wire may be worth it.. will take a long time. your user may have a card that was advertised to perform at 100Mbps which in actuality performs at 30Mbps. Or. it pays to trace the flow of data from the server to the user's workstation in order to see if there are any mismatched network components. disk Sometimes buying metal is (by far) the least expensive way to speed up your queries. Database technology designed specifically for data warehousing and third party indexing technology designed to speed up queries Look at my Database page and Query and Load Accelerators page for more information. They are more used to dealing with predictable transaction processing than extremely variable data warehousing demands. .Though it is not necessarily pretty.e. parameterized queries.. Network bottlenecks Though you do not have to become an expert at network topologies. Making two copies of the data warehouse . find out how your network people load balance. find out the costs of dropping more cable so you can put your users that run large result set producing queries on dedicated network segments. sometimes the best way to handle this mixed use of the data warehouse is to create a separate copy of the data warehouse for each user group. if a subset of a previously retrieved result set is desired. . Or. If you have invested millions in the data warehouse. And if necessary. some tools make it easy to retrieve a small subset of records that meet the query criteria. For example. This makes it quicker to test the query and cuts down the number of potentially expensive test queries. A relatively small number of users (usually with more "analytical" needs) are running potentially highly resource consumptive ad hoc queries. Also. You will prevent some problems if you spend some time teaching your users about what. the tool will read the previously retrieved query result set rather than the data warehouse. Query tool preview of a subset of records When a query is being developed. in a typical data warehouse most of the users (usually with more "operational" needs) are running IS written. However. Fast Ethernet may be in your new facility but your user may have a 10Mbps network interface card. If the same query is run again.Some tools store the results of some queries. "tiers" or "partitions") of the tool on different hardware servers. Some final thoughts about speeding up queries: You best expect that many of your queries are going to run a "long" time.one for "operational" users and one for "analytical" users It actually is hard to draw a line between what is operational use and what is analytical use of a data warehouse. Multi-tiered architectures/Application partitioning Some query tools allow you to run different components (i. memory.
Your DBA should know some ways of speeding up the load that apply only to the technology of your DBMS vendor. though. If your data warehouse is not there to support day-to-day monitoring and analysis.In line with what I just said. transforming and loading data (henceforth simply referred to as loading) or to make these processes less prone to errors. question why it should be updated daily. If you do not drop indices. What input file formatting will speed up bulk loading Oftentimes operations done on the input data on the feeder system platform (e. Though you read sometimes ridiculous articles in the trade press and from industry analysts (who have coined the awful term "information latency") about how the business world wants to know everything immediately. this tuning time can take IS away from other data warehouse problems whose solution is more meaningful to the business. How often the users really need updated data Oftentimes data warehouse developers unquestioningly give in to the most extreme demands for freshness of data or they automatically assume data need to be updated far more often than makes business sense. What to Learn About in Order to Speed Up Data Warehouse Loading This paper is another laundry list of items data warehouse implementers may wish to learn more about in order to speed up the process of extracting. try to design your loading process so you are not tied to loading at a specific interval. trial and error. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up loading. and making uncomfortable trade-offs. if you do decide to update weekly or monthly. Rather. How to parallelize table load and index maintenance or re-creation . This paper will not attempt to provide detailed explanations of these topics. you may want to learn about dropping indices before a database load and then re-establishing them after the load. you can spend plenty of time tuning queries. There may be certain "crunch" times when you have to load more frequently. eliminating packed and signed fields) can speed up loading. you want to make sure you set the index fill factors so your server's disk drives do not waste time looking for space in which to write index updates. the reality is quite different. question why it should be updated weekly. How to drop and re-establish indices and how to set index fill factors If you update a large portion of the database (I've heard estimates from 10 . By the way. If your data warehouse is not there for week-to-week monitoring and analysis.they will reformat data and sometimes aggregate data.g. What facilities does the database have for bulk loading data and which of those facilities does it make sense to use Many databases have ways of speeding up loading at the expense of data integrity checking.25% up). doings thing by intuition. This list does not include points relevant to a specific vendor's technology. Though many IS people like to spend their time tuning queries. In reality the area of speeding up queries involves plenty of guesswork.. data warehouse implementers may use this paper as a starting point in their search for ways to speed up loading. sorting. Note that certain bulk loaders do more than load .
You will probably have to learn how to use memory very carefully if you do this (and have a lot of memory on the server on which you are doing the aggregating). platform. Given the circumstances. What integrity checks should be done in the loading process After you perform the initial load of data warehouse tables. How indices are used by your database optimizer You need to learn this so you can figure out whether your indices are actually going to get used. these different types of parallelism can have widely varying amounts of effect. How to load databases via a stream Certain ETL tools will allow you to extract. transform. and data parallelism. What domain integrity checks should be in the data warehouse database Depending on how you resolve the above two issues. scalability restrictions and limitations on how sophisticated your transformations can efficiently be. You do. What statistics are available on aggregate table usage . though. Where processes can be done in memory If you have got the available memory. You may want to work with flat files and a dedicated sort/merge utility either on the data warehouse platform or. Sorts especially can be speeded up by doing them in memory. if the source data are on another platform. you may want to start a "discussion" of how all the errors you found should be trapped in the feeder systems (preferably at data entry time). By the way. it is not necessary to create intermediate files. That is. you have to investigate the sensibility of incorporating referential integrity or any other type of domain integrity checking in your database. Where does it make sense to transform the data There may be faster places to do it than in your data warehouse database system. have to be careful about data source. size. though. Where does it make sense to aggregate the data Sometimes if you do the aggregating outside the data warehouse database environment you can create multiple aggregate output files in one "pass" of the input data. In more recent versions of DBMS software. you may be able to get away with less indices than in older versions. and load in one process.Dropping indices and bulk loading in parallel can drastically improve loading time. learn how to use it. learn the differences between pipeline. component. you may want to do it on that platform. is that you then will need people skilled in that platform and you may be invading someone else's fiefdom. The problem with doing this on the source system platform.
don't forget about tape. tape is sometimes the fastest way to transfer data. Whether you should incrementally update or rebuild a table Sometimes you have the option to either incrementally update a table or rebuild a table. How to use report scraping software If a report that has the data you need to extract is available. You will probably create aggregates that seldom get used. though. Also. There are a number of high speed transfer technologies to investigate. Be aware that you may have options in how you do this and the options will differ in speed. You do run a risk if the report format changes. You need these statistics for making the case for deleting the aggregates (though be forewarned this can get you into a quirky political aspect of data warehouse management. Nevertheless. But this technique often makes sense for extracting data the systems whose code hasn't been touched in the last ten years. don't forget about using compression technology in conjunction with transferring. Even if you have to send a tape overnight for early delivery. What are alternate methods for changed data capture Presuming you must incrementally update your data warehouse database and you are not extracting from date stamped transaction records in the feeder system. if this is done it can eliminate the time needed to go through sometimes time consuming. it is faster to rebuild.As you might have read ad nauseum. convoluted processing to determine what feeder system data has changed. customer. You may find that you get the most benefit by creating a region. A rule of thumb sometimes stated is that if 20% of the records will be updated. sometimes it make sense to put the report image in a file and use software specially designed to extract data from report image files. you may find you have a technically daunting task in capturing changed information. territory. A complicating factor. and salesperson aggregate and say. territory. customer.) What level of data it makes sense to aggregate it and what non-additive measures are sensible to include in your aggregate tables Say you have region. How to modify feeder systems so changes to records are written to flat files Though this usually is not worth it. Suffice it to say that you should think twice before adding these measures to your aggregates. You may find that after a certain level of update activity it is faster to rebuild than to update. What are non-FTP ways of transferring data FTP-ing can be slow. product aggregate adds little to the performance of your queries. product. customer. This is a rough rule and the actual threshold will vary. an additional region. it may be worth experimenting with them. building a data warehouse is an iterative undertaking. How to perform disk mirroring and hot backups . product. and salesperson dimensions. is use of non-additive measures in your aggregates because they will force you to re-aggregate. if you have options. Also. that. territory.
In fact. availability of a checkpoint can be a lifesaver when the load crashes (which it does at the worst times). However. (Though be careful that you understand how mirroring can be handled by both hardware and software). Where there are dependencies. How to set a restartable checkpoint Again. Similarly. data warehouse tables. How to schedule loading processes Loading a data warehouse usually requires quite a few processes. you may want to make sure input data. and logs (if you do not disable logging) are on different physical disks. How to make a copy of your transaction system database If you really want to use your data warehouse only for production reporting. if a disk is mirrored while being bulk loaded. hot backups allow you to have your data warehouse database available when backing it up. And you want to make sure you have the human and automated support for scheduling the way you want to. Obviously. . How to defragment table and index files This is basic knowledge it will probably do you well to know. you want to understand where there are and are not dependencies so you can "multi-task" these processes as much as possible. and restore the mirror with the updated copy. With mirrored disks. Partial updating of multidimensional (MOLAP) databases Many of these tools allow you to only recalculate some of the calculated numbers stored in the "cube". a cycle of partial backups followed by a full backup is also worth looking into. you want to do risk analyses so you can find out whether it is worth the effort to build in restart capabilities in the intermediate processes. Architectural purists hate this solution but sometimes it just makes sense to handle your reporting needs this way. you can "break" the mirror. update the copy. This means that you can still have your data warehouse available while loading it. if you have a tight window for loading the data warehouse and that loading takes considerable time. Most of these tools that have the capability will warn you that you do so at the risk of possibly getting data out of synch. checkpoints will not by themselves speed up the loading process. you may want to learn about striping to spread a file over multiple disks and partitioning to divide a logical file into many physical files spread over different disks. How certain forms of RAID technology can both speed and slow loading RAID technology can both help and harm loading speed. you may be better off just copying the transaction database periodically as is.Disk mirroring and hot backups will not speed up loading the data warehouse database (in fact. How to distribute data on multiple physical disks If you can afford multiple disks. loading time can greatly increase) but they can give you some greatly desired flexibility and breathing room. indexes. By the way.
You can throw more and more technology at this but ultimately your best tactics are the ability to understand what really is most important to the business and good user expectation management. two production systems that provide duplicate functionality. the service level provided to the users. you can end up with a data warehousing system that is in effect your "production" report and query generation system and which requires the same service level as the feeder transaction processing system. In a transaction processing system. political. though. memory. and maybe even 20 hours. the users of the system. or unquestioning acceptance of mainstream industry thinking. Now.. and technical design decisions whose cumulative effect is to force far more resources to be committed to a data warehousing effort than what was expected. Some final comments . in many cases. How to Save Money on Your Data Warehousing Efforts This essay is not a list of tactics to be used in deploying the technology of your choice. using a data warehouse for the unbundling the querying and reporting functionality from a transaction processing system may be a good investment if you do it by design. the technology to be used.. ought to be in the feeder transaction processing systems.Welcome to the slippery slope! You're going to find more reports and queries that could go "both ways". note how much more discretion there usually is in the design and implementation of data warehousing systems as opposed to transaction processing systems. there is generally far greater discretion over these factors.In the long run long loading times usually will cause bigger problems than long query times. and. for lack of time. unless it is done by design. And since you're the data warehouse developer you'll probably decide that the report or query is easier to do in the data warehouse. at great cost. political pressure. in the big picture. you can quickly back yourself into supporting. Set expectations about response time before the users use the data warehouse . the data to be stored in the system. You may even end up doing transaction processing in your data warehousing (some data warehousing analysts politely call this "a feedback mechanism") to send corrected data back to the transaction processing system. 16. If this unbundling is done insidiously. data warehousing developers often fail to understand the range of choices they have. I hope these pointers will give you a little pause. Before you know it. First. Have a reason besides expediency for building a report or query in the data warehouse as opposed to the feeder transaction processing system You probably won't be far into your data warehousing efforts when you see a report or query that could be done in the data warehousing system or in the feeder transaction processing system. What is the cost of installing more/faster CPU. In a data warehousing effort. However. It is not completely uncommon that data warehouse development teams find themselves with systems they have promised to update daily but then they find the update time stretches to 12. Rather this is a list a pointers that may prompt a data warehouse developer to think twice before making those project management. That being said. 14.. disk Sometimes buying metal is (by far) the least expensive way to speed up loading.How to use multiple disk controllers You will want high-speed interconnects to these controllers.. the functionality of the system are usually subject to relatively little discretion. And. do not let your data warehouse be the main source for operational-oriented query and report functionality that.
you are setting yourself up for costly (and possibly perpetual) rework of your design when the data warehouse performance does not meet the initial expectations of the users. for some reason every user will want to do a five year trend analysis at the same time) 2) Not everyone starts using the data warehouse at the same rate. proprietary midrange. these platforms were being used quite successfully for data warehousing systems. The platforms are not always appropriate but if you have a substantial investment in these platforms and the "keepers" of those platforms are not overly resistant.You best discuss performance issues with your users at the very start of your data warehouse investigations. If you do your homework. you will find written material (not specifically about data warehousing though) and consultants available to advise you how to deal with specific vendors. In fact. Be skeptical about comparing this type of analysis between different sets of technologies. you may find doing the analysis worthwhile. though you will not read about it in the trade media. Do the analysis of whether platforms your organization has been using for a long time are appropriate for your data warehousing efforts Mainframe. Before data warehousing was called data warehousing. If you will have large numbers of users who only run canned reports. historically profitable vendors. average performance tends to drop 3) If your data warehouse is being used for ad hoc end user work. the mainstream wisdom did done another "180". Do the analysis of whether your users should directly report/query against data stored in the transaction processing systems In the 1970s. consider the alternatives to providing these users with "full blown" client based report and query. By the way.Reporting against transaction processing system data is not always appropriate. the mainstream industry wisdom was that data should be extracted and reported against. Else they may expect response time to be the same as moving a cell in an Excel worksheet. In the 1980s the mainstream wisdom did a "180" and said that "data shall not be duplicated" and that you should go against the real stuff. In the 1990s. these platforms still are being used successfully for data warehousing.) Bargain with the database and hardware vendors Chances are you are going to buy your database and your hardware from some well known. This type of analysis is an "art" but an art that your database/hardware vendor/consultant (with your questioning every assumption they make) should be able to help you with. but unless you automatically want to accept mainstream wisdom which never seems to consider the varieties of situations people face. and file server network operating systems are legitimate platforms for data warehousing. it is worthwhile to do the analysis. Do the work to determine the economics of different service levels Get an appreciation of how much increments to the data warehouse service level cost. As more users start using the system. If you do not discuss expected performance issues with your users..These "obvious" points never get mentioned enough: 1) Data warehousing performance can fluctuate far more than transaction processing system performance (e. OLAP tools . .g. the important knowledge is how making adjustments with a given set of technologies will change cost and expected performance. . you most likely won't be able to "tune" your data warehouse system for everything your users are going to throw at it. (And then in the 2000s you will be considered in the avant garde and you will be a source for mainstream wisdom.
Implement query efficiency enhancing design techniques that do not require special hardware or software Specifically learn about using aggregate tables and partitioning. If most of your business needs are to report on data in one transaction processing system and/or all the historical data you need are in that system and/or the data in the system are clean and/or your hardware can support reporting against the live system data and/or the structure of the system data is relatively simple and/or your firm does not have much interest in end user ad hoc query/report tools. with the data warehouse users. do the work of costing how much it will fix the transaction processing system before you make the data warehouse decision It may not be surprising that the primary motivation for the construction of many data warehouses is to get around the difficulties caused by a problematic transaction processing system. that most business users have a hard time understanding. Think twice before building the means to perform complex calculations that few business users understand It is not that uncommon for one business user to decide that he or she needs the data warehouse to store or report a set of numbers that are extremely difficult to determine and more importantly. the data warehouse developer has to diplomatically discuss whether it is worth calculating a set of numbers that perhaps only business user will understand. Note that "worth" is a judgment that the data warehouse developers and the users have to agree upon. Sometimes it is.98% of data warehouse users are strictly report users have appeared in the trade press. most effective. examine if each of the majors tasks is worth the effort You will probably come up with a long list of data problems many of which are not worth the effort to clean up. Itemize possible data cleaning tasks and. These techniques can be used with any type of database or file access methods. Immediately deciding upon a data warehouse as a "fix" can be an expensive mistake. Though these techniques can be overused. But the alternatives are usually there if you look. If you don't do the work of costing how much it will cost to fix the transaction processing systems. you may not NEED a data warehouse .In the typical data warehouse. If the main reason you are considering a data warehousing is to get around the difficulties caused by a dysfunctional transaction processing system. you may never understand what is really causing the problems. and least expensive ways to speed up retrieval of information. they generally are the simplest. the majority of users will strictly be running canned reports. Alternatives to providing canned report users with full blown tools vary based on the technology you are using and the politics of the situation. In this case. most times it is not. And then you're setting yourself up for a situation where the same problems recur in the data warehouse and you end up supporting both a dysfunctional transaction processing system and a dysfunctional data warehouse. (Estimates that 75% .) A great deal of money can be spent licensing and supporting functionality that the users will rarely use.
amount of data. For working purposes let me say that a strategic decision is one that involves spending a lot of money and/or firing/re-assigning/hiring a lot of people and/or that is going to cause a lot of pain/joy until the next strategic decision is made. There probably are thousands of published definitions. the messier (and more costly and less beneficial) they end up being. modeling (not in the IS sense of the word). Nevertheless. First. What follows are some personal observations on how you may actually use a data warehouse in a strategic decision making exercise. if you have no need for the slice-and-dice or modeling capabilities of OLAP tools. can bring more payoff than some canned reporting system used for years. Using Data Warehousing in Strategic Decision Making Though you can read many definitions of data warehouses that say that these systems are designed for "strategic decision makers" (or some other similar term) there is little written about actually using data warehouses in strategic decision making processes. Accept that data warehousing is going to be technically messy If someone were ever to write "The Zen Of Data Warehousing" (perish the thought . let me define strategic decision making.please). you may be better off coding by hand than using a so called "data mart" tool. Systems for strategic decision making tend to be relatively short-lived. Later I will go into more detail regarding these topics. some data warehouse do get used in strategic decision making and are used very profitably. In this essay. and formal reporting are the most time consuming tasks when using data warehouses in strategic decision making. . There are no rules for determining where this point is. If you have to perform fairly complex data transformations and/or you have relatively few data sources and targets. though.) I assert that most of the uses of data warehouses are not for strategic decision making. Those couple of days using the system. and time you have to load the database. The amount of time spent using these systems sometimes can be measured in days counted on one hand. Creating "special" databases. I believe that most data warehouses are used primarily for post decision monitoring of the effects of decisions. Before buying data mining tools do your best to assess whether they will yield "actionable" insights worth the effort in making the data mining tool work. certain types of tools just do not make good business sense. the more technically elegant you try to make these systems. Probably the most important reason for this is that strategic decision making usually is not done that often. For example. The database you use for transaction processing may do just fine based on the number of users. a report and query tool may meet your reporting needs more than adequately. (Of course "a lot of" is a relative term. I would like offer some insight into using data warehouses in such decision making exercises. one of the concepts would probably be that at some point.Sometimes a good report generator will do just fine. Question whether you really will benefit from certain categories of tools For some data warehouse implementations. Rather. Use your judgment and intuition to make the determination.
in turn.) You may have to "feed" data into user maintained spreadsheet models. I do suggest. The "requirements" are usually gleaned from "business" meetings which IS may have a little struggle to get into or are related secondhand from attendees of these meetings. Sometimes the analysis being done with highly summarized data and/or the need for speed lessens the need for extremely clean data. as I just mentioned. .Usually the work must be done quickly and is requested with little advanced notice. You may need to create special databases. cross dimensional calculations). This work usually has to be done in anything from a long afternoon to several weeks. many of these calculations are inter-record. (For more on somewhat similar ideas about these special databases. These requirements are usually ambiguous. Many OLAP tools allow a great deal of flexibility in making calculations but these capabilities tend to be too difficult for the user who is in a hurry in the strategic decision making exercise.and the user is most knowledgeable about doing these changes in the spreadsheet environment. Often you need to run repeated queries against a subset of the data warehouse. see Thomas Davenport's description of a "data deli" and Ralph Kimball's discussion of "behavioral studies". You will probably have to aggregate data differently. and combine data that never have before been combined. (In other words. For the sake of simplicity and efficiency. This is "figure it out as you go along work" where IS often must take the part of the business analyst. The work you are doing allows the business to see a point of view that is not the common view of the business. often there just is no way of avoiding these extracts. IS usually has to put on its business hat and figure out what is really needed by the business. The spreadsheets are used because the user needs to change complex calculations . (To put this in a little more technical terms. These "feeds" are either links to data stored in a data warehouse or the actual loading of data into spreadsheets.maybe as part of a scenario analysis but usually because there is continual doubt about how certain calculations should be made . Much of the use of data warehousing for strategic decision making ultimately involves "feeding" user maintained spreadsheets. You may have to create some highly formatted reports. use different calculations for derived numbers. Note also that oftentimes it is necessary to. your best course is to create a special database. you keep an audit trail that lets you trace how data were derived from feeder systems. You may be thinking you created a data warehouse so you would not have to build special "extracts" but. Sometimes data cleanliness is much less of a concern in strategic decision making. a part of many effective strategic decision making exercises is to see the business in a different perspective. you built it according to what then was the common view of the business. feed spreadsheet data into the special databases you have created. There is usually no time for formal interviewing and extended data modeling exercises. however. The subset may be one created by an extract query with quite complex constraints. perhaps to no surprise. that whatever the data expectations are.) You are doing this work because when you built the data warehouse. Or. you may need to repeatedly access new aggregates and calculations or you may have to repeatedly concurrently access data that are not in the production data warehouse or that are in the production database but are not easily combined.
Do your best to conform the main dimensions of data used in your business. IS has to take the part of the business analyst to spur the imagination of the users. do not try to design for every contingency that could occur in a strategic decision making exercise. Many of your users will want a polished look to the reports in order to convey credibility. try to keep atomic data in some electronically retrievable format. Do not put everything you can possibly think of in the data warehouse..The information from the data warehouse has to be communicated to people who do not have and/or want direct access to the data warehouse. and internal "entity". Also. Do. though.e.e. This means that either IS can miss an opportunity or be faced with an impossible task that must be done quickly. . Do not assume that your users have full appreciation of the power of the technology. there is usually some give and take as to whether these reports and graphs should be created manually (i. you may be up the creek when these exercises come up. Note that there are usually politics in getting in the loop early. Do not let the knowledge of the systems stay in the minds of the outside technical consultants This trite and obvious piece of advice needs to be repeated. When you are initially designing the warehouse. You are not going to be able to foresee everything that will be needed in these exercises. However. The technical consultants are gone and not available when these opportunities come up. Now some advice: Probably the most important determinant of the benefit you will get from technology is your ability to figure out the most insightful questions that the technology enables you to ask. financial account. having previously built up a relationship of trust with a "decision maker" helps greatly. Persons supporting data warehouses that really will be used for decision support should be encouraged to learn the scripting language of the spreadsheet (which for most people is Visual Basic for Applications) so they have the flexibility in coming up with solutions in these strategic decision making exercises. Unless you have some users with good gut instincts about technology. graphs are usually created for these exercises. identification.. your users may want to communicate the information in printed reports that look just "so". We in the data warehouse world often forget that the spreadsheet is by far the most used decision support tool. product. despite the rush. (That means customer. spreadsheet) or generated directly from the database. And do not make yourself completely dependent on outside resources whose availability you cannot control. These exercises come up unexpectedly. presentation tool. people and department. Users will tend to either grossly underestimate or overestimate the power of the data warehouses in these strategic decision making exercises. i. Learn spreadsheets and how your data warehouse can interact with them. Try to get in "the loop" early. These reports are usually being created to persuade someone. In a strategic decision making exercise. with a word processor. If the key knowledge of your systems are in the heads of consultants. By the way.) Do address the slowly changing dimension issue.
And though intelligent use of the data extraction. it is smart. as if these systems ever achieve the stability implied by that term. it is a major mistake for a DW/DSS developer not to attend those meetings regularly. to justify the data warehouse after the fact. test changes. despite the messiness of the work. you may have much more exposure than you have with the typical transaction processing system. that you need to modify your production data warehouse database. By the way. If your IS organization has change control meetings.especially in the realm of strategic decision making where. If you are responsible for a system being fed from. perhaps imperative. technology alone will not make that person a better decision maker . many changes will require a fair amount of effort. make changes. The technical work done in these exercises is usually not "industrial strength" and it is probably not worth the effort to make it so. This list is presented because. and how to purge data . Of course this is no new concern to anyone doing systems maintenance. say. You may learn. It is hard to calculate the expected ROI of a data warehouse project. 10 sources. If you do not justify a data warehouse before building it. Do not claim that data warehousing alone will necessarily improve strategic decision making It needs to be oft-repeated that if a person is a mediocre decision maker.Don't "production-ize" your work. though. success (or. keeping informed and assessing the impact of technically driven changes to the feeder systems may be more difficult than keeping track of the business driven changes. despite our 100 TB databases. forewarned is forearmed! You will be challenged to learn about business and feeder system changes that will affect the DW/DSS systems You as the system developer would like to know of developments that will affect the DW/DSS systems in time to allow adequate time to assess what is impacted. And the best way you are going to do this is "anecdotally" with successful war stories like a strategic decision making exercise. You will have to figure out if. Most businesses have to go on faith that the effort somehow will be worth it. Well. Don't miss these opportunities. Also. much more remains unknown than known. Maintenance Issues for Data Warehousing Systems Another important aspect of data warehousing and decision support systems (hereafter referred to as DW/DSS systems and I know that is redundant) where I see little public discussion is maintenance of these systems. can strongly bolster the belief that the data warehouse was worth the effort. How you will deal with the issues will depend on your environment. just as mentioned in my gotchas page. just participation) in a strategic decision making exercise. cleaning. do keep your work around so you can cannibalize code for the next strategic decision making exercise. sometimes. when. and loading tools and the information catalogs can greatly ease the burden here. etc. Here I present some of the issues that you may face when your systems are "in production".
It is unlikely than an "expert" can foresee all the problems. Before you get into a discussion about purging data. These "structures" can be many things . alternative means of storage. . If you are allowing a fair degree of end user developed access to systems and your systems are large and complex. you may have sold the DW concept as a way that "killer queries" will not drag down your "production" systems. And many of the problems are so crazy that they only way you are going to solve them is on a trial-and-error basis. after a while you will see countless ways to add or refine these aggregate structures usually in the name of reducing end user retrieval time.You may have a challenge on two fronts. when you have yielded to this temptation several times. You may have to push the end users into "deep water". You may also have to convince your IS staff that the report and query building tools are not "toys". Either you are at some type of capacity limit or more likely. Mainly for the sake of completeness. And an all too common reality is that IS ends up taking over almost all the query and report writing or IS writes some semi-canned queries and the potential of the system for answering ad hoc questions never gets fully realized. You will have to balance the need for building aggregate structures for processing efficiency with the desire not to build a maintenance nightmare Many DW/DSS systems involve building structures to contain aggregated information. you will find out that the users are just as dependent on the data warehousing systems for recurring needs as they are on the so-called production systems and killer queries hurt wherever they occur. You will find endless opportunities to tune DW/DSS system databases I once saw a quote from the director of IS of a well-known retailing business who said that the biggest data warehousing lesson he learned is "there aren't many data warehousing experts out there". When you are at this point you may realize that the DW/DSS system has becoming a breeding ground for corporate information pack rats ("Why just last week ______ asked for an analysis going back to 1956!"). This usually comes sooner than you expect. dimensions in the OLAP world. A very common IS expectation is that the end users will take over the overwhelming majority of query and report writing duties. you will find you have exploded the size and complexity of your data warehouse without proper consideration of whether the incremental size and complexity had business worth. etc. one piece of advice is to learn about less expensive. Now that you've put in a data warehousing systems. you will be tempted to add this data. Anyway.separate tables in relational systems. . you are restructuring data and it is not worth the effort to restructure certain data. And if you are like most DW/DSS developers. you will discover that there are myriad ways to drag the systems down to a crawl. You will be motivated to store data in the data warehouse "for data's sake" You and/or the users of the system will see "holes" in the data you store in the data warehouse. The issue you face is balancing your desire to speed things up with the need to be careful with how much a maintenance burden you want to take on. By the way.There comes a point when it does not make business sense to hold certain data in the warehousing system. Unfortunately. You will have to determine which queries and reports should be IS written and which should be user written Probably when you got started into this area you had an idea about who would be doing what. after you have been in production a while you have seen how reality has differed from your expectations.
attribute names. This means. If not. you are going to have to develop an appreciation of what to do when there is a processing problem with one or several of those feeder systems. if you have invested in relational and multidimensional database technology. for example. this means determining if and how to process partial updates to the warehousing system. In order that the need to re-test their work does not come as too bad a surprise to your end users. You will come across many situations where it is not clear whether to go heavy duty or lightweight. if you have 10 systems feeding your data warehouse. I refer to consistent use of dimensions. you have to be careful you are not setting yourself up to building a clone of a dysfunctional transaction processing system. not keeping their work in 10 different directories and storing descriptions of their work. you have to consider the amount of time it takes to update your systems on a recurring basis. You will find that it is not clear what is the best tool for many applications. You will be pressured to implement a means to interactively correct data in the data warehouse (and perhaps send back corrections to the transaction processing system) And you though your data warehouse was read-only! I am not saying this is necessarily bad. may I suggest that you get them into good housekeeping habits early on. You will find that maintaining a data warehouse architecture may be much harder than establishing the architecture By architecture. The dependencies in DW/DSS update processing can get quite complex. definitions of derived data. this means determining if and when you will process updates to the data warehousing system. you have to consider developer time. Though. and data sources for specific information. Unless there is someone with responsibility to keep his eye . you may eventually find that you have almost a clone of your transaction processing system in your data warehousing system. you will find that for many applications. First. You will be uncertain which tools are most appropriate for a certain task DW/DSS systems present IS with yet another set of tools with overlapping uses. At the simplest level. At a more difficult level. You will have to figure out how to test the effect of structure changes on end user written queries and reports After a while you are going to make some database structure changes that may affect the reports and queries that your end users have written. Secondly. Do take the time to understand these dependencies especially if you do not have the most well-behaved feeder systems.There two aspects of this burden. as in the last point. For instance. at a technical level. You will be uncertain whether to create certain reports/queries in the data warehousing system or in the "feeder" transaction processing system You are best advised to have some guidelines as to what goes where. Many organizations also have a heavy duty tool and a more lightweight tool that have similar ends. You will have to determine how problems with feeder system update processing affect DW/DSS system update processing Again. it is a toss-up as to which database technology will do the job better.
on subsequent data warehouse development.your expectations of what should remain consistent will change over time 2) Be able to work in a persuasive. not coercive manner . well. if their data warehousing systems are used for ad hoc reporting. will find their security schemes are either too loose or too tight. though your dimension table change capture mechanism may handle the change (I hope you know about slowly changing dimensions). You will have to rework how you have implemented security Most firms. You will find that assigning security is a balancing act. You will have to perform euthanasia on some DW/DSS systems DW/DSS systems tend to be changed frequently. You will have to keep reconciling feeder systems with the DW/DSS systems After things are going smoothly for a while. say that you work for a fruit distribution company. Perhaps it has a policy of using category code "100" for sales of apples and oranges. By the way. you may be in for a surprise. Often there is no "right" way to handle these issues that come up in comparing historical. have to do your best so you know there is an issue. Also. You do. the person keeping his eye on this development must: 1) Have some judgment . you may find that it is an ongoing discussion as to how to handle responsibility for regular reconciliation. They experience entropy much more quickly than. You will find it is far more expensive (and complex) to maintain a data warehouse than to build one Hope you got that point by now! What Decision Support Tools are Used For . apples to apples and oranges to oranges comparison should be made for historical purposes. If the company suddenly starts using code "150" for oranges.data warehouse developers especially resent "architecture police". If your firm is used to keeping and patching a system for as long as you keep a refrigerator (and these days there are firms like that dipping their feet in DW/DSS for the first time). if you have end users reconcile information. some times there is a tendency to be slack in whatever process you have implemented to reconcile systems. it is easy to quickly lose the benefits of the hard work it usually takes to establish the architecture. You will find that the business changes the meanings of attributes over time and that these changes can be overlooked For example. You want to minimize security breaches but on the other hand you do not want to minimize the chance of a user discovering some useful business insight as a result of his examining something that someone else might have thought was beyond the scope of his everyday concerns. though. general ledger systems. there now is a question of how. say.
Decision support tools do not tell these people anything amazing that the people don't already suspect. business intelligence) tools used to access a data warehouse are being used for. monthly. But the information produced with the tools gives them confidence their gut feel is okay. Rather. perhaps most. it is possible to get a general idea just what the decision support (a. To compare the same type of information in different time periods This is simply the usual daily. . To compare information about customers.. if IT does not know all the specific uses. Jill Dyché notes many IT departments don't really know how the business is using its data warehouse. The main uses of decision support tools are: To check that "everything" is okay Surprise! Nothing will be done with many. Sometimes the sign of a great warehouse is that the users "run with it" on their own. I will attempt to make a general statement about use of these tools. To convey information in a more digestible manner These tools are often used to convey what a person or persons already know. etc. yearly comparisons. financial accounts Sometimes this is side by side comparisons of a series of measures. the least. In this essay. Customer B usually pays late and still takes the early payment discount. Sometimes this is identification of the most. they want to understand some small aspect of an operation like Customer A always pays on time. It is not necessarily bad. Nevertheless. products. quarterly. These knowing people use the tools simply to present information to other people in a way that it is more easily read. though. the earliest. etc. the latest. I would say a primary function of decision support tools is to support non-action. Perhaps data warehouse support people can do a better job if they have a better feel for what the tools are really being used for. To confirm the "obvious" Most end users the reports and queries are ultimately being produced for have a pretty good gut feel for what is going on in their area of concern. cost/profit centers. If I were able to write the essay on "The Zen of Data Warehousing" (which I will not).a.k. They are run to confirm a person's usually not crisply defined notion but intuitively felt notion of "okayness". To figure out how something "works" Most people are not looking for some grand Unified Theory of how firm XYZ works. weekly. of the queries and reports created with decision support tools.In the section on the "dirty little secrets of data warehousing" in her fascinating book "e-Data".
To check performance versus formal and informal goals or constraints That is. measures of what actually occurred are compared with budgets. Note they do not have to agree on all the data . The end users in these cases often write reports that could hardly be called analyses. Or. To help advocate a position These tools are not just for "objective" presentation of the facts. To grab a little piece of information out of a large volume of information These tools make picking that virtual needle out of that virtual haystack a lot simpler. The decision support tools perform the function of confirming their intuition.just some data whose credibility must be accepted for actions to be taken. or some other types of goals. The decision support tools kind of do double duty in that they help refine the criteria of what is out of the ordinary and identify what fit the refined criteria of out of ordinariness. Often they are cleverly used to help bolster the case for doing (or not doing) something. To provide a report "of record" For all kinds of reasons it is often necessary for people to agree that "these are the numbers". I think that most good businesspeople have an intuitive feeling of the most important trends and relationships between factors that are affecting their business. quotas. To identify the out of the ordinary Usually the ultimate consumer of the tool's output has somewhat vague criteria of what is out of the ordinary. Yes. To confirm and sometimes to discover trends and relationships With all respect to the people working hard on data mining. the tools also can help discover trends and relationships but it is difficult (though potentially profitable) to sift out the meaningless and spurious trends. forecasts. To get around an Information Technology department that does not have the time or the resources to write reports Often end users use these tools out of impatience with the IT department. To provide data for a what if analysis or a forecast . the IT department gives the user these tools to relieve the pressure off of itself. Decision support tools often are used to produce this "official" information.
Mostly it will be used for making many little decisions about how to modify the design of a web site . Yes. It is not that mysterious.. Web data are the record of what actions a user takes with his mouse and keyboard while visiting a site That is all it is. Is Web Data Analysis (i. To repeat points I have made in other essays. The businesspeople will want and benefit most from highly aggregated web data that are usually combined with non-web data . transaction processing system. human judgment and intuition.That is. web data have quirks but what data (especially data as detailed as raw web data) do not have quirks. and the ability to put the information spit out by tools into a context of information that is much wider than any data warehouse. Web Mining) Different? The topic of analyzing web data (also referred to as clickstream data ) is one of the more discussed topics in the niche of data warehousing/decision support. In this essay I would like to challenge some of the usual industry hyperbole. This essay is not meant to be a how-to primer but rather to raise some questions in the mind of the reader. not many companies fall into that category). Web data are just another source of data . On the other hand.with its own quirks and with limitations that come with all other sources of data If you have worked with a variety of other data sources. In fact. the cumulative effect of these little decisions may be company and career endangering. Decisions are made and business intelligence is garnered only with the combination of the output of the decision support tools. knowledge repository can handle. Nor do they directly supply what I would consider to be business intelligence. most of what is written seems to be the same unquestioning praise of supposedly revolutionary changes that analyzing this data is going to bring about. despite their name most of these tools are not used as the sole input into making a non-trivial decision. Though there has been some intelligent writing on the topic. if data could be characterized as mundane. The primary beneficiaries of web data analysis are web designers Not many bet-your-company (and bet-your-career) decisions are going to be made with the results of web data analysis. if your company is betting its continuance on smart use of its web site (and. the tools are used to feed data into a spreadsheet where the actual what-if analysis or forecast will be done. web data would have to rank among the most mundane. you probably know much of what you need to know about working with web data.e. except for the dot-coms. The tools can do some of the what-if-ing and forecasting themselves but most business users are more comfortable doing this work in spreadsheets.
And these people think in terms of relative performance of "channels".Most web data has far more detail than the usual marketing or financial person wants to see. Often web data analysis yields conclusions that would be immediately obvious to a good web designer Web data analysis can serve as a very expensive substitute for a good web designer. on an recurring hourly basis. sometimes web data analysis can be an inexpensive substitute for a very expensive web designer. On the other hand. Most students of effective good web design do not strike me as people who want to sit down with a query/report tool or OLAP tool and refine some analysis for three hours. Imagine doing a traditional cost center spending analysis. . are not web based. The person who is going to get the most insight from web data is the person who understands designing web sites so they are used profitably and who understands the power of data analysis These people are hard to find! Sorry about the stereotypes but. though. for non dot-com companies. most data loses value over time. the value of old detailed web data is dubious I have read the publications predicting petabyte sized warehouses of months and even years of web data. Probably any web site that generates that much detailed data changes so often that. the value of the web data declines quickly. except at a very aggregated level. (If you want to be a little more academic. it is hard and perhaps meaningless to compare older data with newer data.) Because web sites change so much. at least in my limited exposure to good web designers and people who may not be hands-on designers but do have a good feel for the power of a web site. they are very different people from the financial and marketing analysts that data warehousing/decision support developers are used to working with. disgorge meaningful analyses. What I have not read. most of which. though. is what people will do with older web data.I don't know who the pundits work with but most people I have encountered who analyze data are not polymaths who can. . You can deliver "real-time" access to web data but your users will not be able to analyze it in real time I read the pundits who say now you have got to go out and build usually expensive means to let users analyze web data generated up to the last millisecond. Now imagine what would happen if the cost centers and their reporting hierarchy would change everyday. This is kind of what it is like to analyze some web data. the expected value of the data declines over time. The value of detailed web data declines pretty fast over time Though many data warehousing implementers won't admit it. In the same vein.
Web data issues make it harder to do the manual judgment tasks needed to use data mining tools to separate useful information from gibberish By now there is awareness that a great deal of judgment that can only be provided by a human being is needed to for most data mining work. If a page consists of multiple dynamically generated areas. Web data by itself do not give you much information about the web site user Unless the web site user has bought something from the site. you may have to categorize the pages on the web site. the marginal value of additional analysis may drop pretty rapidly. Often cursory analysis of web data produces most of the value that can be gained from analyzing the data Or. . The data may be so dirty and so fuzzy that analyzing it further may not be worth it. To make this data sensible. you also have to define a "session" . though it is not exactly categorization. Also. and identifying when the user started and stopped looking at a web site. I mean there may be many. identifying what was viewed. By that. identifying the sequence of user activity on a web site. The definition of a session can be arbitrary. all the problems with web data make it harder to do these judgment tasks that no software can do. you know very little about the site user.when a user started and stopped accessing a web site. Web data relies on some pretty fuzzy categorization All you may know about the web site user is (what you think are) the sequence of his clicks. Many of these problems are not solvable given the design goals of a web site. These categorizations can get pretty fuzzy. you are going to have a hard time tracing user activity If your site generates pages dynamically. many ways to categorize with no compelling reason to use one categorization method over another. you may have to write your own system to track the dynamic content This information also has to be correlated with the log file analysis. you probably have a unique problem If the servers' clocks are not exactly (!!) in sync. then you have a more complicated problem. Data may have gaps or data may be suspect. you may have to categorize users by their clicking sequences. As you can imagine.Web data is far "dirtier" than the usual data warehouse data Web data often present problems with identifying web site users. If session data are culled from multiple servers. in more academic terms. Also.
honestly acknowledge the data problems we cannot solve or can partially solve. I interpret their writings to mean that marketers have to be humble about their understanding of consumer behavior. you find you do this by analyzing the clicks of a customer who left the site without buying. . etc) non-Web data to learn something about the web site user.) And even if a site user has bought something. Though it seems counterintuitive. That is. usually the only thing you know about the non-customer is his clicking pattern.In actuality. Web data analysis can be extremely profitable. we have to know who are our real users. you get a little information that is usually not great. if given. much more can be effectively acted upon by observation of group behavior rather than by observation of individual behavior. Analysis of clicking patterns. as mentioned before. .(I read that most registration information. Some marketing writers have questioned the effectiveness of the extremely targeted marketing some firms attempt via web data analysis Though I make no claim to be a marketing expert. They say that at some point in segmenting a market it is actually possible to get negative marginal returns. can be quite moot. some of the supposed experts whose publications I have read have question the effectiveness of finely segmenting markets (which at its most extreme is segmenting markets to one person). is false. web data analysis has to be done intelligently. Also. This essay is not meant to dissuade anyone from analyzing web data. Remember. and make our decisions on how much we want to analyze with an eye to expected marginal benefits versus marginal expected costs. But like all other applications of data warehousing/decision support. Web data do not give you that much information about why a person does not become a customer When you read that web data is supposed to help you find why a person did not customer. the last page a person clicked on is supposed to be important to analyze. you need to combine the web data with data from internal and external (like and Equifax.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.