CERTIFICATE

This is to certify that the seminar report entitled “DATA WAREHOUSING” is a genuine work submitted by Shruti Vaish and is dully accepted as a partial fulfillment of the award of degree of Batchelor of Technology as per specification of board of Technical Education for the session 2010-2011. This is an original work of student carried out under my supervision and guidance.

Seminar Guide Department Amit Kumar Kar Tripathi CS/IT Department Department

Head Of

Rajiv Ranjan

CS/IT

ACKNOWLEDGEMENT

No individual effort, big or small can ever be realized without collective effort of proverbial friends or guide. I consider my self exceptionally fortunate that I had indulgent guide, learned professors and caring friends to successfully steer me that one of the most challenging assignment of my academic career. Today when my endeavourer had reached its fruition, I look back on mute gratitude to one and all without whose help, I am sure reality would have remained a dream. On the way leading to the completion of my seminar, a lot of difficulties were faced, but I was always motivated by the proverb: ”Where there is will, there’s a way”. However, above all this, it was the guidance and support of my seminar guide Mr.Amit Kumar Kar & coordinator Mr.Shrawan Kumar Pandey and Mrs.Aradhana Shukla that I could prepare myself for my presentation. So, it is my pleasure to present my cordial thanks to him. Above all, I thank God for making this mortal venture possible.

SHRUTI VAISH B.Tech Computer Science & Engg

2

Third Year 3 .

Data at different levels of aggregation may have different life spans depending on how they are to be used for downstream analysis and data mining. rather different data warehousing and on-line analysis architectures are required. and is staged into one or more OLAP tools that are used as computation engines to continuously and incrementally build summary data cubes. nightly) by extracting. data warehouses are refreshed periodically (for example. or to rebuild multidimensional (data cube) views of the data for on-line querying and analysis.ABSTRACT Data warehouses and on-line analytical processing (OLAP) tools have become essential elements of decision support systems. which might then be stored back in the data warehouse. and scheduling operations on data depending on the type of processing to be performed and the age of the data. In this architecture. data flows continuously into a data warehouse. staging large volumes and flow rates of data with different life spans at different levels of aggregation. that are characterized by very high data volumes and data flow rates. cleaning and consolidating data from several operational data sources. Increasingly. In this paper. The data in the warehouse is then used to periodically generate reports. The key features of the architecture are the following: incremental data reduction using OLAP engines to generate summaries and enable data mining. we are seeing business intelligence applications in telecommunications. including virtual data warehouses or enterprise portals that support access through views or links directly to the operational data sources. we first motivate the need for a new architecture by summarizing the requirements of these applications. 4 .e. Then. move data from the warehouse into off-line archival storage).. electronic commerce. and other industries. Traditionally. transforming.Retirement policies define when to discard data from the warehouse (i. however. we describe a few approaches that are being developed. We discuss the relative merits of these approaches. and that require continuous analysis and mining of the data. For such applications.

5 .

and be able to make meaningful analysis of it. and in more directions. (Rolleigh and Thomas. the more we know. The need for information is growing at an increasing rate thus. (Widom. (Rolleigh and Thomas. If any corporation does take such an action it will undoubtedly join the ranks of many big-name corporations that have made humongous investments that have failed to provide any return on investment (ROI). 2002) The importance of data warehousing in the commercial segment arises from the need for enterprises to gather all of their information into a single place for in-depth analysis. Where once it was the job of the information technologists to study customer data. If a company does indeed want to succeed with data warehousing. and the desire to decouple such analysis from online transaction processing systems. than at any time in history. 2002) 6 . the more we need to know. tedious and time-consuming steps are required. Corporations are competing in a world that is moving faster. 1995) But this isn't an easy endeavor. this means a lot of careful. Before people can get ready access to data.INTRODUCTION It has become a strategic imperative for corporations to know more about its customers and prospects than ever before. a lot of behind-the-scenes groundwork must be done. It is insane to just go out and have your information technology organization buy a data warehouse. And then the warehouse has to be tailored for the specific requirements of the company. it has to build crossorganizational consensus and support for a way of business that is empowered by real customer data. nowadays even the president of the company may need to sieve through the databases of the corporation to retrieve clues for better marketplace performance.

(Davis et al. WHAT IS A DATA WAREHOUSE? Data warehousing is a concept. the data will be organized by entity (such as customer or product) rather than application (sales or purchase). and what we need to know about it. integrated. time stamped data will be present and that once stored in the warehouse it cannot be changed (Figure 1). Bill Inmon widely considered as the 'father' of data warehousing describes it as: "A subject-oriented. This paper will first introduce the concept of data warehouse in a simple straightforward manner followed by the major components of a data warehouse and the various structures of a data warehouse. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions.Let's look at what goes into creating a rich data warehouse. The paper will then follow on by presenting the data warehousing methodologies. that both current and historical. And finally it will discuss the advantages and disadvantages of data warehousing ending with the conclusion. time-variant. 1999) Figure 1 7 . non-volatile collection of data in support of management's decision -making process" This implies that within a warehouse.

and then they must evaluate the current operational data to determine how to transform that data into what adds value to the output provided by the corporations. (Greenfield) Nevertheless.though probably 95-99% of the data usually are transaction data. Some of those parts are summarized into information "components" and stored in the warehouse. Data 8 . we can explain data warehouse as it being analogous to a physical warehouse. Data warehousing is the process of making your operational data available to your business managers and decision support applications.com/rnd/warehousing) Finally. Operational systems create data "parts" that are loaded into the warehouse. Proper warehousing focuses on efficient information access. a simple definition of data warehouse. Sometimes nontransaction data are stored in a data warehouse . Queries and reports generated from data stored in a data warehouse may or may not be used for analysis.However. Additionally. 1996) But this definition does not encompass the entirety of data warehousing. The tools that you choose for your warehousing solution will take data from your operational systems (extract it). and create a data warehouse (load it). "querying and reporting" rather than "query and analysis" is key when talking about the functionality of data warehousing because the main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted "formal" reports.sas. (www. is “(a) data warehouse is a copy of transaction data specifically structured for querying and analysis. Of course. as Ralph Kimball puts it. convert your operational data into business information using your defined business rules (transform it). this efficiency doesn't happen magically. Corporations must first identify what it is that they require from the data and the decision support applications. data warehousing doesn't just make data available.” (Kimball.

extraction/transformation programs. client/server architecture. and more. A Data Warehouse is typically a blending of technologies. All departments in a corporation do not have the same information requirements. graphical user interfaces.for good reason. lightly summarized data and highly summarized data.Warehouse users make requests and are delivered information "products" that are created from the components and parts stored in the warehouse. (Perkins) THE COMPONENTS OF A DATA WAREHOUSE The following describes the components of a data warehouse (Figure. A well-defined and properly implemented data warehouse can be a valuable competitive tool. including relational and multidimensional databases. Data warehousing is one of the hottest industry trends . Highly summarized data are primarily for the executives. so effective Data Warehouse design provides for customized. Highly summarized data can come from either the lightly summarized data 9 . Lightly summarized data are the hallmark of a Data Warehouse. lightly summarized data for every department. 2) Figure 2 Summarized Data :.There are two kinds of summarized data.

Data in a data warehouse differ from operational systems data in that they can only be read. Current Detail :.. where the bulk of data resides.e. i. Current detail refreshment occurs as frequently as necessary to support enterprise requirements.A system of record is the source of the data that feed the data warehouse. tallying. and merging data from multiple sources. into a data warehouse. These integration and transformation programs perform functions such as: Reformatting. it is necessary that a data warehouse be populated with the highest quality data available. Identifying default values. As operational data items pass from their systems of record to a data warehouse. Summarizing. not modified. System of Record :. If executives require more detailed information they have the capability of accessing increasing levels of detail through a "drill down" process. as is. Integration and Transformation Programs :.The heart of a Data Warehouse is its current detail. Current detail is typically two to five years old. accurate. data that are most timely. Every data entity in current detail is a snapshot. integration and transformation programs convert them from application-specific data into enterprise data. at a moment in time. and have the best structural conformance to the data warehouse. Adding time elements. complete. or modifying key structures. When either operational or Data Warehouse 10 .Even the highest quality operational data cannot usually be copied. Supplying logic to choose between multiple data sources. representing the instance when the data are accurate.used by enterprise elements or from current detail. Thus. Current detail comes directly from operational systems and may be stored as raw data or as aggregations of raw data. recalculating.

but exists and functions in a different dimension from other warehouse data. Archives include not only old data (in raw or summarized form). (Perkins) Along with the various components of a data warehouse there are various structures of a data warehouse too. metadata is integral to all levels of the Data Warehouse.environments change.or data about data. with a low incidence of access. STRUCTURES OF A DATA WAREHOUSE There are various structures of a data warehouse that a corporation can adopt based on its needs.Data Warehouse archives contain old data (normally over two years old) of significant. Archive data are most often used for forecasting and trend analysis. they also include the metadata that describes the old data's characteristics.One of the most important parts of a Data Warehouse is its metadata . integration and transformation programs are modified to reflect that change. 11 . There is usually a massive amount of data stored in the Data Warehouse archives. Metadata that is used by Data Warehouse developers to manage and control Data Warehouse creation and maintenance resides outside the Data Warehouse. Archives :. the logical data warehouse and the data mart. Also called Data Warehouse architecture. Metadata :. The physical data warehouse. continuing interest and value to the enterprise.

an enterprise builds a series of physical (or logical) data marts over time and links them via an enterprise-wide logical data warehouse or feeds them from a single physical warehouse. (Perkins) 12 . Logical Data Warehouse :. function.is a physical database in which all the data for the data warehouse are stored. which typically supports an enterprise element (department. This structure is effective only when there is a single source for the data and they are known to be accurate and timely. Instead. but does not contain actual data. As part of an iterative data warehouse development process.like physical data warehouse also contains metadata. Data Mart :. packaging and processing the detail data.Figure 3 STRUCTURES OF A DATA WAREHOUSE Physical Data Warehouse :. along with metadata and processing logic for scrubbing. it contains the information necessary to access the data wherever they reside. etc.is a subset of an enterprise-wide data warehouse. organizing. organizing. including enterprise rules and processing logic for scrubbing. region.). packaging and processing the data.

• • The corporate data could change (they may start collecting Web log data) New releases of the firm’s chosen software may become available (warehousing is still an evolving market and even the best tools continue to improve and change). Big Bang Approach A big bang methodology tries to solve all known problems by creating a huge data warehouse before you release it for evaluation and testing. there are some considerations to take heed of: To create a data warehouse. 13 . you may be able to accomplish your warehousing project with a big bang methodology. evaluate and install the necessary software and hardware. the corporation must plan its warehouse. collect business requirements. But.DATA WAREHOUSING METOHODS Several warehousing methodologies are used throughout the warehousing community. and become familiar with its corporate data. • Management supporters can lose interest in this project if you don't keep them involved and show rapid results. amount of data to be incorporated. While these tasks are taking place: • The business goals of the corporation can change due to changes in the market or technology. Based on your objectives. and your intimate knowledge of your business and data. Many people believe that this process is necessary to deliver on your objectives. All of these fall into one of two categories: the big bang approach or the iterative approach.

referred to as projects. • It can adjust to changes in the business requirements faster because the team is small. but evaluation of all of your deliverables up front is not required. After each project. any one of these changes could cause your project to fail because you cannot quickly respond the necessary changes. If you are not plugged into the proper channels. Iterative Approach With an iterative methodology. review of the architecture. Additionally. which provides it with user needs and defect reports. 14 . its development process. manageable chunks. but when entering the planning phase.The items listed above are just a few of the business and technical changes that could impact your plans. it needs to concentrate only on its first project or iteration. • Early involvement by the corporation’s user community provides real-situation testing. This feedback can improve the corporation’s goals and processes for the next iteration. This keeps the management supporters involved and interested in the project. The value of smaller projects within the larger warehousing process is: • It shows a faster return on the company investment because it delivers one solution quickly. Manageable projects that have short delivery schedules. users are provided with better feedback when they can see the system than when they have to envision it from a slide presentation. the same planning tasks are performed that are required in the big bang approach. and the corporation’s business requirements is done. In the iterative approach. the corporation breaks its warehousing project into small. The corporation must design its overall architecture.

create the proper roles. design. data cleansing and parameter clarification (may send back to the design phase for another iteration). breaking the project into parts or releases. Due to its iterative nature. Third is the Design phase. build. Followed by the Development phase. and getting agreement on the goals and purpose of the warehouse. the iteration methodology thus starts with the Initial Organization phase. Followed by the Analysis phase. evaluation of the feasibility of the data warehouse. Perhaps most important. and establishing a clear set of objectives and definition of success for each release of the project will be appropriate. By delivering functionality and business value with each short release. taking a "big bang" approach may not be a prudent step. train its users. gain insight and assimilate lessons learned. Each release will need to go through the complete system development lifecycle of requirements. a corporation can integrate its data. monitoring. and support. Followed by. (Karakizis) As it is exhibited in Figure 4. Instead. the Implementation phase. creation of the actual data warehouse structure and population of the data warehouse. a data warehouse is a journey not a destination. which were not previously considered. analyzing and designing the warehouse system architecture. There is always more data which can be integrated. At the same time. the rolling out the production environment and providing user training. updating and cleansing data. identifying the corporation’s readiness for undertaking a data warehousing project. (Wierschem et al) 15 . gathering business requirements. implement. an iterative approach enables the project team to demonstrate a few quick wins in terms of business value delivered. by using an iterative approach a corporation can adjust its data warehouse's content to add new sources. Fifth is the Testing phase. as it gains insight into its business and analytics. And finally the Maintenance phase.Although the business may ask for everything to be delivered by the warehouse at once. test.

" Given the complexity of the data warehousing system and the cross-departmental implications of the project.Initial Organization Analysis Design Development Testing Implementation Maintenance Figure 4 At this point. make sure the proper tools are selected. the next iteration or project should be started from the assessment phase. The assessment and requirements phases should require less process time after successive iterations. it is easy to see why the proper selection of tools and 16 . As the old Chinese adage says. provisions for an easy path for user feedback should be established and review of the five steps must be done and necessary adjustments should be made. After the review of the process. GETTING THE RIGHT TOOLS TO BUILD A DATA WAREHOUSE As in any endeavor selecting the correct tools is paramount for success.” To accomplish a goal.

Support :.This encompasses consulting and education. In this there are two possibilities. 3. for the simple reason that questions whether the vendor is going to be around for a while or not. It may even be more important than the current functionalities that the tool itself provides. this is probably the most important measure. Professional Services :. Stability :. This section of the paper will present information on such selections. or will it be able to make enhancements to its tool? The other step is selecting the right team to build the data warehouse. 1. one is to use external consultants and the other. what type of training is available? And how much is the consulting team willing to do knowledge transfer? Does the consulting team purposely hold off information so that either 1) you will need to send more people to vendor's education classes. But the question is. However. What type of consulting proposal does the vendor give? Is the personnel requirements and consulting rates reasonable? It might be wise to speak with members of the consulting team before signing on the dotted line. There are two steps that top management is concerned with when building a data warehouse. One step is choosing a vendor.More than anything else. will any software issues be handled promptly by the vendor or not? 2. in doing so there are certain basic but nonetheless critical issues that have to be evaluated. 17 . On the education front.personnel is very important.What type of support is offered? It is industry standard for vendors to charge an annual support fee that is 15-20% of the software product license. or 2) you will need to hire additional consulting to make any changes to the system. to hire permanent employees.

With consultants. The fact of the matter is.The pros of hiring external consultants are: 1. They are less expensive. DBA: This role is responsible to keep the database running smoothly. The pros of hiring permanent employees are: 1. management must understand that there are various entities that play important roles in a data warehouse project. With that. or one of the tool vendor firms. This makes knowledge transfer very important. Of course. via a Big-5 firm. and even more for Big-5 or vendor consultants. 2. hiring permanent employees is a much more economical option. people with extensive data warehousing backgrounds are difficult to find. even today. when there is a need to ramp up a team quickly. They are usually more experienced in data warehousing implementations. the easiest route to go is to hire external consultants. they are likely to leave at a moment's notice. With hourly rates for experienced data warehousing professionals running from $100/hr and up. Project Manager: This person will oversee the progress and be responsible for the success of the data warehousing project. 1. Additional tasks for this role may be to plan and execute a backup/recovery plan. whether they are on contract. 18 . They are. However. as well as performance tuning. 2. too. They are less likely to leave. the flip side is that these consultants are much easier to get rid of.

and deploying the extraction. including sales and marketing analysis. 5. from the backend hardware/software to the client desktop configurations. 8. budgeting. Technical Architect: This role is responsible for developing and implementing the overall technical architecture of the data warehouse. OLAP (On-Line Analytical Processing) Developer: OLAP is the foundation for a range of essential business applications. Front End Developer: This person is responsible for developing the front-end. statutory consolidation.3. 4. 6. (Pendse and Creeth) The role of the OLAP Developer is thus very crucial. performance measurement and data warehouse reporting. ETL Developer: This role is responsible for planning. a person on the data warehouse team needs to work with the end users to get them familiar with how the front end is set up so that the end users can get the most benefit out of the data warehouse system. Trainer: A significant role is the trainer. 7. developing. and loading routine for the data warehouse. whether it be client-server or over the web. planning. transformation. 19 . profitability analysis. balanced scorecard. He/She is responsible for the development of OLAP cubes. Data Modeler: This role is responsible for taking the data structure that exists in the enterprise and model it into a schema that is suitable for OLAP analysis. After the data warehouse is implemented.

A successful data warehouse project will provide numerous lasting benefits to a company. It enables improved knowledge of relationships among products and services and their performances. • The improvement in vendor relations and price reductions by targeting selected vendors with increased level of purchasing over the enterprise. (WHIPS) Along with its numerous ease of use benefits data warehousing provides other qualitative advantages too. (Smith) 20 . It is not with blind belief that corporations are investing millions of dollars in data warehousing projects. Data warehousing makes retrieving information so easy that when a user query is submitted to the warehouse. This makes it much easier and more efficient to run queries over data that originally came from different sources. the needed information is already there. • The significant savings from improved data quality across the enterprise. ability to make quick and proper analysis that pave the way for better decision making can be gained from a successful data warehousing project and thus. (Smith) The benefits of the development of a data warehouse would include: • More accurate predictions of customer demand based on the use of trends analysis.ADVANTAGES and DISADVANTAGES OF DATA WAREHOUSING Data warehousing has been increasingly popular in many organizations around the world. • The response improvement in direct marketing campaigns through the use of household demographics and current customer analysis. with inconsistencies and differences already resolved. give the company a strong competitive advantage over the competition.

and quantifiable ROI should be expected over time. after all. A company must not forget. that the goal for any data warehousing project is to lower operating costs and generate revenue—this is an investment. less storage on the mainframe and the ability to identify and keep the most profitable customers while getting a better picture of who they are. however. • Information at the warehouse is under the control of the warehouse users. 2001) 21 . there is no need to perform query optimization over heterogeneous sources.• The ability to run complex queries easily and efficiently since query execution does not involve data translation and communication with remote sources. and it's easy to see why data warehousing is spreading faster. Furthermore. there will be increase in analysis of marketing databases to cross-sell products. For example. thus it can be stored safely and reliably for as long as necessary. Some of the soft benefits of data warehousing come in the technology's effect on users. granting them faster access to more accurate data and allowing them to give better customer service. a warehouse changes users' jobs. the telecom industry uses data warehouses to target customers who may want certain phone services rather than doing "blanket" phone and mail campaigns and aggravating customers with unsolicited calls during dinner. • Convenience for end users since they can use a single data model and query language. For example. The corporation will make dramatic cost savings and its revenues will soar. (Wailgum. When built and used correctly. (WHIPS) From the management’s point of view the benefits and rewards are abounding for a company that builds and maintains a data warehouse correctly. a very difficult problem faced by other approaches. • Simplicity of the system design.

• Data held in one place highlights data integrity problems and vulnerability from the public domain thus advanced security to prevent unwanted users. The most important disadvantages are: • Expensive initial data warehouse set up. although in comparison to the management of the current environment it will mean that overall less time is actually required in the Data Warehousing approach. Also. (OCS Consulting) • Cost and time is also borne to develop the required new skill-set for warehouse developers and end users. Choice of hardware. However. • The re-education of the programmers often proves to be a disadvantage as change is often resisted until familiarity is gained with the new approach. • The data warehouse will require management. • A data warehouse takes time to build and time should be given to the project and the difficulties in getting a data warehouse up and running and developed should not be underestimated. software and structure requires careful consideration and how they will progressively work together in the future.Even though the benefits of data warehousing by far outweigh the disadvantages there are certain disadvantages of data warehousing that companies must pay heed to. after the system is in place the cost should be low and cover only the maintenance and future modifications of the system. there is high cost in getting data translated and copied to existing databases in time for being useful for the end user. • A data warehouse is complex to develop it cannot just be bought as an off-theshelf product and is designed specific for an organization needs. including 22 .

Therefore. even if management has to invest a massive amount of capital to build a data warehouse it must do in hindsight of the myriad benefits that will crop up. 23 . Any company that doesn’t see the importance and benefits of data warehousing and is blinded by the cost and daunted by the size of the task will feel the devastating impact of the competing business that have undertaken successful data.competitors from accessing the data base will be of critical importance for the company. we see that data warehousing does not have to be an enigma to the managers of companies. 1999) CONCLUSION Thus. (Davis et al. Management must believe and understand that Data warehousing is of strategic importance to a company. Even though it is an exhausting task a successful data warehousing project is crucial for the companies to run a successful business.

Sign up to vote on this title
UsefulNot useful