This action might not be possible to undo. Are you sure you want to continue?
Ashutosh Chandra Prachi Sharma Richa Palyaal Shikha Jain
A data warehouse is a repository of information gathered from multiple sources stored under a unified schema, at a single site. The data warehouse is a relational data base organized to hold information in a structure that best supports reporting and analysis. A copy of transaction data specifically structured to Query and Analysis (Ralph Kimball, 1996)
A single, complete and consistent store of data obtained from a variety of different sources made available to end users, in what they can understand and use in a business context (Barry Devlin 1992) A process of transforming data into information and making it available to users in a timely enough manner to make a difference (Forrester Research 1996) A collection of integrated, subject oriented databases designed to support the DSS function where each unit of data is relevant at some moment of time (Bill Inmon, 1991)
Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Constructed by integrating multiple, heterogeneous data sources. relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”.
A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data.
A Data warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and presentation that exists for end-user computing within the enterprise.
Three parts of the data warehouse: The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front-end) software that allows users to access and analyze data from the warehouse
Data flows into the data warehouse through the “load manager". The data is extracted from the operational databases & supplemented by data imported from external sources. The load manager primarily performs an extract Transform load(ETL) operation : Data extraction. Data transformation. Data loading.
It provides an interface between the warehouse& its users. It performs task like directing the queries to appropriate tables, monitoring the effectiveness of the indexes & summary data & query scheduling.
The primary components of data warehouses are :
Data Sources Data Transformation Reporting Metadata Operations Optional Components
Data Sources: Data sources refers to any electronic repository of information where data is passed from these systems to the data warehouse either on a transaction-by transaction basis for real-time data warehouses or on a regular cycle. Data Transformation: The Data Transformation layer receives data from the data sources, cleans and standardizes it, and loads it into the data repository. Data Warehouse: The data warehouse is a relational database organized to hold information in a structure that best supports reporting and analysis.
Reporting: The data in the data warehouse must be available to all the users if the data warehouse is to be useful. Metadata: Metadata or "data about data", is used to inform users of the data warehouse about its status and the information held within the data warehouse. Operations: Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also covers user management, security, capacity management and related functions.
In addition, the following components also exist in some data warehouses:
Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data
Helps in Reporting & Analyzing the data. Increases data consistency. Increases productivity and decreases computing costs. Is able to combine data from different sources, in one place. It provides an infrastructure that could support changes to data and replication of the changed data back into the operational systems.
Extracting, cleaning and loading data could be time consuming. Problems with compatibility with systems already in place e.g. transaction processing system. Providing training to end-users, who end up not using the data warehouse. Security could develop into a serious issue, especially if the data warehouse is web accessible
Data Warehousing is such a new field that it is difficult to estimate what new developments are likely to most affect it. Clearly, the development of parallel DB servers with improved query engines is likely to be one of the most important. Parallel servers will make it possible to access huge data bases in much less time.
Data Warehousing is not a new phenomenon. All large organizations already have data warehouses, but they are just not managing them. Over the next few years, the growth of data warehousing is going to be enormous with new products and technologies coming out frequently. In order to get the most out of this period, it is going to be important that data warehouse planners and developers have a clear idea of what they are looking for and then choose strategies and methods that will provide them with performance today and flexibility for tomorrow.