Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword or section
Like this
25Activity

Table Of Contents

0 of .
Results for:
No results containing your search query
P. 1
Data Warehousing & Mining

Data Warehousing & Mining

Ratings: (0)|Views: 235 |Likes:
Published by 0202shruti

More info:

Published by: 0202shruti on Dec 20, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/08/2013

pdf

text

original

 
1Datawarehousing & Mining - www.neteffect.in
Data Warehousing & Mining
Data Warehouse Architecture:
Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of howthe data warehouse is built. There is no right or wrong architecture. The worthiness of the architecturecan be judged in how the conceptualization aids in the building, maintenance, and usage of the datawarehouse.One possible simple conceptualization of a data warehouse architecture consists of the followinginterconnected layers:
 
Operational database layer
The source data for the data warehouse - An organization's ERP systems fall into this layer.Informational access layerThe data accessed for reporting and analyzing and the tools for reporting and analyzing data - BI toolsfall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in thisarticle, have to do with this layer.
 
Data access layer
 The interface between the operational and informational access layer - Tools to extract, transform,Load data into the warehouse fall into this layer.Metadata layerThe data directory - This is often usually more detailed than an operational system data directory.There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can beaccessed by a particular reporting and analysis tool.Normalized versus dimensional approach for storage of data
 
There are two leading approaches to storing data in a data warehouse - the dimensional approach andthe normalized approach.In the dimensional approach, transaction data are partitioned into either "facts", which are generallynumeric transaction data, and "dimensions", which are the reference information that gives context tothe facts. For example, a sales transaction can be broken up into facts such as the number of productsordered and the price paid for the products, and into dimensions such as order date, customer name,product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.A key advantage of a dimensional approach is that the data warehouse is easier for the user tounderstand and to use. Also, the retrieval of data from the data warehouse tends to operate veryquickly. The main disadvantages of the dimensional approach are:1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with datafrom different operational systems is complicated, and2) It is difficult to modify the data warehouse structure if the organization adopting the dimensionalapproach changes the way in which it does business.In the normalized approach, the data in the data warehouse are stored following, to a degree, the Coddnormalization rule. Tables are grouped together by subject areas that reflect general data categories(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it isstraightforward to add information into the database. A disadvantage of this approach is that, becauseof the number of tables involved, it can be difficult for users both to1) join data from different sources into meaningful information and then2) access the information without a precise understanding of the sources of data and of the datastructure of the data warehouse.These approaches are not exact opposites of each other. Dimensional approaches can involve
 
2Datawarehousing & Mining - www.neteffect.in
normalizing data to a degree.
 
Evolution in organization use of data warehouses
 
Organizations generally start off with relatively simple use of data warehousing. Over time, moresophisticated use of data warehousing evolves. The following general stages of use of the datawarehouse can be distinguished:Off line Operational DatabasesData warehouses in this initial stage are developed by simply copying the data of an operational systemto another server where the processing load of reporting against the copied data does not impact theoperational system's performance.Off line Data WarehouseData warehouses at this stage are updated from data in the operational systems on a regular basis andthe data warehouse data is stored in a data structure designed to facilitate reporting.Real Time Data WarehouseData warehouses at this stage are updated every time an operational system performs a transaction(e.g., an order or a delivery or a booking.)Integrated Data WarehouseData warehouses at this stage are updated every time an operational system performs a transaction.The data warehouses then generate transactions that are passed back into the operational systems.hichare the reference information that gives context to the facts. For example, a sales transaction can bebroken up into facts such as the number of products ordered and the price paid for the products, andinto dimensions such as order date, customer name, product number, order ship-to and bill-tolocations, and salesperson responsible for receiving the order. A key advantage of a dimensionalapproach is that the data warehouse is easier for the user to understand and to use. Also, the retrievalof data from the data warehouse tends to operate very quickly. The main disadvantages of thedimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading thedata warehouse with data from different operational systems is complicated, and 2) It is difficult tomodify the data warehouse structure if the organization adopting the dimensional approach changes theway in which it does business.In the normalized approach, the data in the data warehouse are stored following, to a degree, the Coddnormalization rule. Tables are grouped together by subject areas that reflect general data categories(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it isstraightforward to add information into the database. A disadvantage of this approach is that, becauseof the number of tables involved, it can be difficult for users both to 1) join data from different sourcesinto meaningful information and then 2) access the information without a precise understanding of thesources of data and of the data structure of the data warehouse.These approaches are not exact opposites of each other. Dimensional approaches can involvenormalizing data to a degree.
 
Fact table:
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.It is often located at the centre of a star schema, surrounded by dimension tables.Fact tables providethe (usually) additive values which act as independent variables by which dimensional attributes areanalyzed. Fact tables are often defined by their grain. The grain of a fact table represents the mostatomic level by which the facts may be defined. The grain of a SALES fact table might be stated as"Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely definedby a day, product and store. Other dimensions might be members of this fact table (such aslocation/region) but these add nothing to the uniqueness of the fact records. These "affiliatedimensions" allow for additional slices of the independent facts but generally provide insights at ahigher level of aggregation (region is made up of many stores)
 
 
3Datawarehousing & Mining - www.neteffect.in
A data warehouse dimension provides the means to "slice and dice" data in a data warehouse.Dimensions provide structured labeling information to otherwise unordered numeric measures. Forexample, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to asales receipt. A dimensional data element is similar to a categorical variable in statistics.The primary function of dimensions is threefold: to provide filtering, grouping and labeling. Forexample, in a data warehouse where each person is categorized as having a gender of male, female orunknown, a user of the data warehouse would then be able to filter or categorize each presentation orreport by either filtering based on the gender dimension or displaying results broken out by the gender.
 
Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouseschema. The star schema consists of a few "fact tables" (possibly only one, justifying the name)referencing any number of "dimension tables". The star schema is considered an important special caseof the snowflake schema.
 
ExampleStar schema used by example query.Consider a database of sales, perhaps from a store chain, classified by date, store and product. Theimage of the schema to the right is a star schema version of the sample schema provided in thesnowflake schema article.Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store andDim_Product.Each dimension table has a primary key on its Id column, relating to one of the columns of theFact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary keyUnits_Sold column of the fact table in this example represents a measure or metric that can be used incalculations and analysis. The non-primary key columns of the dimension tables represent additionalattributes of the dimensions (such as the Year of the Dim_Date dimension).
 
Star schema used by example query.
 
The following query extracts how many TV sets have been sold, for each brand and country, in 1997.
 
Normalization:
 Database normalization, sometimes referred to as canonical synthesis, is a technique for designingrelational database tables to minimize duplication of information and, in so doing, to safeguard thedatabase against certain types of logical or structural problems, namely data anomalies. For example,when multiple instances of a given piece of information occur in a table, the possibility exists that theseinstances will not be kept consistent when the data within the table is updated, leading to a loss of dataintegrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because itsstructure reflects the basic assumptions for when multiple instances of the same information should berepresented by a single instance only.Higher degrees of normalization typically involve more tables and create the need for a larger numberof joins, which can reduce performance. Accordingly, more highly normalized tables are typically usedin database applications involving many isolated transactions (e.g. an automated teller machine), whileless normalized tables tend to be used in database applications that need to map complex relationshipsbetween data entities and data attributes (e.g. a reporting application, or a full-text search application).Database theory describes a table's degree of normalization in terms of normal forms of successively

Activity (25)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
doncobre liked this
subbu1432 liked this
subbu1432 liked this
akhileshmalu liked this
Ashish Singhal liked this
abusidik liked this
maheshumbarkar liked this
connectingvivek liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->