Kamesh.

chalasani

DATA WAREHOUSE CONCEPTS Data warehousing is the coordinated, architected, and periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytical and informational processing. A data warehouse system has the following characteristics: ✓ It provides centralization of corporate data assets. ✓ It’s contained in a well-managed environment. ✓ It has consistent and repeatable processes defined for loading data from corporate applications. ✓ It’s built on an open and scalable architecture that can handle future expansion of data. ✓ It provides tools that allow its users to effectively process the data into information without a high degree of technical support. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. Ralph Kimball provided a more concise definition of a data warehouse: A data warehouse is a copy of transaction data specifically structured for query and analysis. This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse. Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may

Kamesh.chalasani

have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. In general, all data warehouse systems have the following layers: • • • • • • • • • Data Source Layer Data Extraction Layer Staging Area ETL Layer Data Storage Layer Data Logic Layer Data Presentation Layer Metadata Layer System Operations Layer

The picture below shows the relationships among the different components of the data warehouse architecture:

Each component is discussed individually below: Data Source Layer

Kamesh.chalasani

This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source. Many different types of data can be a data source: • Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. • Web server logs with user browsing data. • Internal market research data. • Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer. Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types. Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. Metadata Layer

one or more documents are produced that fully describe the steps and results of that particular task. • • • • • • • • • • • • • Requirement Gathering Physical Environment Setup Data Modeling ETL OLAP Cube Design Front End Development Report Development Performance Tuning Query Optimization Quality Assurance Rolling out to Production Production Maintenance Incremental Enhancements Each page listed above represents a typical data warehouse design phase. A logical data model would be an example of something that's in the metadata layer. Some of them obvious. and has several sections: • Task Description: This section describes what typically needs to be accomplished during this particular data warehouse design phase. • Possible Pitfalls: Things to watch out for. • Deliverables: Typically at the end of each data warehouse task. some of them not so obvious. This is especially important for consultants to communicate their results to the clients.chalasani This is where information about the data stored in the data warehouse system is stored. • Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes. system performance. the data warehouse design can begin. such as ETL job status.Kamesh. . and user access history. Data Warehouse Design After the tools and team personnel selections are made. The following are the typical steps involved in the datawarehousing project cycle. The Additional Observations section contains my own observations on data warehouse processes not included in any of the design steps. All of them are real. System Operations Layer This layer includes information on how the data warehouse system operates.

Even though a successful data warehouse benefits the enterprise. Based on the information gathered above. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition. Physical Environment Setup Task Description . In particular. the data warehouse effort either never gets off the ground. and most importantly. a concrete project plan indicating the finishing date of the data warehousing project. Because end users are typically not familiar with the data warehousing process or concept.8 weeks. as many data warehousing DBA's will attest. Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information. data source identification. a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. where multiple people are talking about the project scope in the same meeting. there are occasions where departments may not feel the same way.chalasani Requirement Gathering Task Description The first thing that the project team should engage in is gathering requirements from end users. If the sponsor is at the CXO level. she can often exert enough influence to make sure everyone cooperates. The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. and the project team will spend the remaining period of time trying to satisfy these requirements. training requirements. Possible Pitfalls This phase often turns out to be the most tricky phase of the data warehousing implementation. the help of the business sponsor is essential. it would be ideal to have a strong business sponsor. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise. Without an effective backup and restore strategy. Deliverables • A list of reports / cubes to be delivered to the end users by the end of this current phase. When this happens. this can happen very quickly after the project goes live. or could not start in the direction originally defined. end user reporting / analysis requirements are identified. the system will only last until the first major disaster. there are often political battles that center on the willingness of information sharing. and.Kamesh. Time Requirement 2 . Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions. • A updated project plan that clearly identifies resource loads and milestone delivery dates.

Data Modeling Task Description . Testing. it is necessary to set up the physical servers and databases. and scripts / settings for the software. Having a separate development environment will prevent the production environment from being impacted by this.Kamesh. For example. 2. having multiple long queries running on the development database could affect the performance on the production database. Deliverables • Hardware / Software setup document for all of the environments. At a minimum. • Development and QA can occur during the time users are accessing the data warehouse. This is problematic for the following reasons: 1. having separate environment(s) will allow the data warehousing team to examine the data without impacting the production environment. and the production environment will have its own set of application and database servers. It is not enough to simply have different physical environments set up. OLAP Cube. often data warehousing teams will decide to use only a single database and a single server for the different environments. it is necessary to set up a development environment and a production environment. Sometimes it is possible that the server needs to be rebooted for the development environment. Having different environments is very important for the following reasons: • All changes can be tested and QA'd first without affecting the production environment.chalasani Once the requirements are somewhat clear. Environment separation is achieved by either a directory structure or setting up distinct instances of the database. In other words. and reporting) also need to be set up properly for each environment. and Production. Time Requirement Getting the servers and databases ready should take less than 1 week. There are also many data warehousing projects where there are three environments: Development. It is best for the different environments to use distinct application and database servers. Possible Pitfalls To save on capital. the development environment will have its own application server and database servers. including hardware specifications. The different processes (such as ETL. There may be interference when having different database environments on a single box. • When there is any question about the data.

rectifying it will becoming a much tougher and more complex process. This can prove suicidal to the project because end users will usually tolerate less formatting. and Physical Data Modeling Part of the data modeling exercise is often the identification of data sources. Loading) process typically takes the longest to develop.Kamesh. Without this person. the logical data model is built based on user requirements. whether they even exist anywhere in the enterprise at all. Time Requirement 1 . Should the data not be available. or. Indeed. If this was delayed until the ETL phase.6 weeks. A good data model will allow the data warehousing system to grow easily. and then it is translated into the physical data model.chalasani This is a very important step in the data warehousing project. Possible Pitfalls It is essential to have a subject-matter expert as part of the data modeling team. and this can easily take up to 50% of the data warehouse implementation cycle or longer. In data warehousing project. Time Requirement 2 . ETL Task Description The ETL (Extraction. longer time to . it is fair to say that the foundation of the data warehousing system is the data model. understand the necessary columns. understand the business rules. it becomes difficult to get a definitive answer on many of the questions.6 weeks. better yet. Transformation. this is a good time to raise the alarm. and understand the logical and physical data models. However. • Physical data model. Deliverables • Data Mapping Document • ETL Script / ETL Package in the ETL tool Possible Pitfalls There is a tendency to give this particular phase too little development time. Logical. • Logical data model. as well as allowing for good performance.Conceptual. Sometimes this step is deferred until the ETL step. Deliverables • Identification of data sources. my feeling is that it is better to find out where the data exists. The detailed steps can be found in the section. The reason for this is that it takes time to get the source data. and the entire project gets dragged out. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry.

Deliverables • Documentation specifying the OLAP cube dimensions and measures. Hence front end development is an important part of a data warehousing initiative. So. The front-end options ranges from an internal front-end development using scripting languages such as ASP. however. sometimes not followed. When this is the case. • Actual OLAP cube / report. There are cases where the design goal is to cover all possible future uses. or fewer delivered reports. whatever strategy one pursues. Front End Development Regardless of the strength of the OLAP engine and the integrity of the data. however. ETL performance suffers. Time Requirement 1 . the data warehouse brings zero value to them. When this happens. This is. if the users cannot visualize the reports. make sure the ability to deliver over the web is a must. Remember that data warehousing is an iterative process .Kamesh. the primary goal should be to optimize load speed without sacrificing on quality. but not so much that it stretches the data warehouse scope by a mile. users have some idea on what they want. there usually isn't much time remaining for the olap cube to be refreshed. OLAP Cube Design Task Description Usually the design of the olap cube can be derived from theRequirement Gathering phase. Possible Pitfalls Make sure your olap cube-bilding process is optimized. or Perl. and often so does the performance of the entire data warehousing system.2 weeks. and after the loading of the data warehouse. A second common problem is that some people make the ETL process more complicated than necessary.chalasani run reports.no one can ever meet all the requirements all at once. It is common for the data warehouse to be on the bottom of the nightly batch load. it is usually a good idea to include enough information so that they feel like they have gained something through the data warehouse. As a result. PHP. whether they are practical or just a figment of someone's imagination. so the only thing that the user needs is the standard browser. to off-the-shelf products such as Seagate Crystal . but it is difficult for them to specify the exact report / analysis they want to see. These days it is no longer desirable nor feasible to have the IT department doing program installations on end users desktops just so that they can view reports. one thing that they will not tolerate is wrong information. So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should need to be delivered over the web. it is worthwhile to experiment with the olap cube generation paths to ensure optimal performance. less functionality (slicing and dicing). More often than not. In ETL design.

do the reports need to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a GUI interface so that each user can customize her reports? Time Requirement 1 . although not as time consuming as some of the other steps such as ETL and data modeling. report development. Report delivery: What report delivery methods are needed? In addition to delivering the report to the web front end. the only direct touchpoint he or she has with the data warehousing system is the reports they see. many OLAP vendors offer a front-end on their own. and would allow end users to slice and dice the data on the report without having to pull data from an external source. but also include possible changes in the back-end structure. make sure it can be easily customized to suit the enterprise. There are several points the data warehousing team need to pay attention to before releasing the report. will the front-end tool be flexible enough to adjust to the changes without much modification? Another area to be concerned with is the complexity of the reporting tool. or in some form of spreadsheet.Kamesh. especially the possible changes to the reporting requirements of the enterprise. For example. this is not true. One would think that report development is an easy task. So. To the end user. nevertheless plays a very important role in determining the success of the data warehousing project. There are reporting solutions in the marketplace that support report delivery as a flash file.4 weeks. When choosing vendor tools. Possible changes include not just the difference in report layout and report content. . to the more higher-level products such as Actuate. All they care is that they receives their information in a timely manner and in the way they specified. Deliverables Front End Deployment Documentation Possible Pitfalls Just remember that the end users do not care how complex or how technologically advanced your front end infrastructure is. other possibilities include delivery via email. In addition. Report Development Task Description Report specification typically comes directly from the requirements phase.chalasani Reports. How hard can it be to just follow instructions to build the report? Unfortunately. if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL Server. via text messaging. User customization: Do users need to be able to select their own metrics? And how do users need to be able to filter the information? The report development process needs to take those factors into consideration so that users can get the information they need in the shortest amount of time possible. For example. Such flash file essentially acts as a mini-cube.

A sales report can show 8 metrics covering the entire company to the company CEO. For example.It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance.chalasani Access privileges: Special attention needs to be paid to who has what access to what information.Kamesh. and it is hence ideal for the data warehousing team to invest some time to tune the query. Time Requirement 1 . there will certainly be requests for additional reports.Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority. We present a number of query optimizationideas. • Report Delivery . • Reports set up in the front end / reports delivered to user's preferred channel.2 weeks. that means that the window for data loading is not very long.Sometimes. This means that the request needs to be prioritized and put into a future data warehousing development cycle. Report development does not happen only during the implementation phase. • Query Processing . especially the most popularly ones. Possible Pitfalls Make sure the exact definitions of the report are communicated to the users. As a result. while the same report may only show 5 of the metrics covering only a single district to a District Sales Director. In this case. . My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit. it should be fairly straightforward to develop the new report into the front end. especially in a ROLAP environment or in a system where the reports are run directly against the relationship database. it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. After the system goes into production. 2. query performance can be an issue. Performance Tuning Task Description There are three major areas where a data warehousing system can use a little performance tuning: • ETL . Data is not yet available in the data warehouse. There is no need to wait for a major production push before making new reports available. Otherwise. A data warehousing system that has its ETL process finishing right ontime is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. Data is already available in the data warehouse. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. user interpretation of the report can be errenous. These types of requests generally fall into two broad categories: 1. Deliverables • Report Specification Documentation.

Time Requirement 3 . this adds a little complexity in query management (i. especially when the intermediate results have a large number of rows. but also may lead to table locking and data corruption issues. Understand how your database is executing your query Nowadays all databases have their own query optimizer.e. Store intermediate results Sometimes logic for a query can be quite complex. and UNION-type statements. one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan. Retrieve as little data as possible The more data returned from the query. The way to increase query performance in those cases is to store the intermediate results in a temporary table. the intermediate results are not stored in the database. inline views. the more resources the database needs to expand to process and store these data.5 days. SQL query performance becomes an issue sooner or later. . For example. which index from which table is being used to execute the query? The first step to query optimization is understanding what the database is doing.Performance enhancements seen on less powerful machines sometimes do not materialize on the larger. one can use "EXPLAIN [SQL Query]" keyword to see the query plan.chalasani network traffic. Different databases have different commands for this. In Oracle. and even the way that the front-end was built sometimes play significant roles.. but are immediately used within the query. This can lead to performance issues. do not use 'SELECT *'. 3. Query Optimization For any production database. Often.Kamesh. So. query optimization becomes an important task. the need to manage temporary tables). in MySQL.Goal and Result Possible Pitfalls Make sure the development environment mimics the production environment as much as possible . For example. 2. Deliverables • Performance tuning document . you can even build an index on the temporary table to speed up the query performance even more. we offer some guiding principles for query optimization: 1. and offers a way for users to understand how a query is executed. if you only need to retrieve one column from a table. it is possible to achieve the desired result through the use of subqueries. It is important for the data warehouse team to look into these areas for performance tuning. So for example. Granted. First. For those cases. but the speedup in query performance is often worth the trouble. and break up the initial SQL statement into several SQL statements. In many cases. server setup. production-level machines. Having long-running queries not only consumes system resources that makes the server and application run slowly.

This strategy decreases the amount of data a SQL query needs to process. Time Requirement 1 . Deliverables • QA Test Plan • QA verification that the data warehousing system is ready to go to production Possible Pitfalls .4 weeks. Sometimes the QA process is overlooked. • Aggregate Table Pre-populating tables at higher levels so less amount of data need to be parsed. On my very first data warehousing project. This speeds up query performance because fewer table joins are needed. There was one mistake. As a result. and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance. This strategy decreases the amount of data a SQL query needs to process. and some of them may even resent the need to have to learn another tool or tools.chalasani Below are several specific query optimization strategies. the project managers failed to recognize that it is necessary to go through the client QA process before the project can go into production. and everyone thought that we had met the deadline. • Server Tuning Each server has its own parameters. This makes the QA process a tricky one. it took five extra months to bring the project to production (the original development time had been only 2 1/2 months). the QA team takes over. • Denormalization The process of denormalization combines multiple tables into a single table. • Horizontal Partitioning Partition the table by data value. In fact. this strategy is so important that index optimization is also discussed. Usually the QA team members will know little about data warehousing. though. Quality Assurance Task Description Once the development team declares that everything is ready for further testing. the project team worked very hard to get everything ready for Phase 1.Kamesh. The QA team is always from the client. • Use Index Using an index is the first strategy one should use to speed up a query. • Vertical Partitioning Partition the table by columns. most often time.

Some may think this is as easy as flipping on a switch. it needs to be maintained.chalasani As mentioned above. Deliverables Consistent availability of the data warehousing system to the end users. Possible Pitfalls Usually by this time most. Time Requirement Ongoing. and some of them may even resent the need to have to learn another tool or tools. but usually it is not true. only to have little usage because the users are not properly trained. Deliverables • Delivery of the data warehousing system to the end users. Possible Pitfalls Take care to address the user education needs.Kamesh. it is very important to consistently monitor end user usage. This serves two purposes: 1. Regardless of how intuitive or easy the interface may be. usually the QA team members know little about data warehousing. To understand how much users are utilizing the data warehouse for return-oninvestment calculations and future enhancement considerations. There is nothing more frustrating to spend several months to develop and QA the data warehousing system. and 2. it sometimes take up to a full week to bring everyone online! Fortunately. In addition. To capture any runaway requests so that they can be fixed before slowing the entire system down. making going production sometimes as easy as sending out an URL via email. it is time for the data warehouse system to go live. Depending on the number of end users. so it is essential that proper documentation is left for those who are handling production . Production Maintenance Task Description Once the data warehouse goes production. it is always a good idea to send the users to at least a one-day course to let them understand what they can achieve by properly using the data warehouse. Time Requirement 1 . Tasks as such regular backup and crisis management becomes important and should be planned out.3 days. nowadays most end users access the data warehouse over the web. of the developers will have left the project. if not all. Rollout To Production Task Description Once the QA team gives thumbs up. Make sure the QA team members get enough education so that they can complete the testing themselves.

For example. There is nothing more frustrating than staring at something another person did. the original geographical designations may be different. but now because sales are going so well. This is a definite no-no. many of the tool vendors position their products as business intelligence software rather than data warehousing software. Observations 1)Quick Implementation Time 2)Lack Of Collaboration With Data Mining Efforts 3)Industry Consolidation 4)How To Measure Success Business Intelligence Business intelligence is a term commonly associated with data warehousing. In fact. So. I would very strongly recommend that the typical cycle of development --> QA --> Production be followed.chalasani maintenance. there are often needs for incremental enhancements. the company may originally have 4 sales regions. regardless of how simple the change may seem. start on that as soon as possible.Kamesh. now they have 10 sales regions. There are other occasions where the two terms are used interchangeably. but simply small changes that follow the business itself. if there is another phase of the data warehouse planned. I am not talking about a new data warehousing phases. A data warehousing (or data mart) system is the backend. Incremental Enhancements Task Description Once the data warehousing system goes live. Many unexpected problems will pop up if this is done. it is very tempting to just go ahead and make the change in production. or the . yet unable to figure it out due to the lack of proper documentation. So. exactly what is business inteligence? Business intelligence usually refers to the information that is available for the enterprise to make decisions on. Another pitfall is that the maintenance phase is usually boring. Deliverables • Change management documentation • Actual change to the data warehousing system Possible Pitfalls Because a lot of times the changes are simple to make.

Even for home-built solutions. Data mining tools are used for finding correlation among different factors. It's commonly used. even large ones. OLAP tools are used for multidimensional analysis. the ability to export numbers to Excel usually needs to be built. They are listed in the following order: Increasing cost.Kamesh. They make it easy for users to look at the data from multiple dimensions. 3. we will discuss business intelligence in the context of using a data warehouse infrastructure. For our purposes here. 2. I am including both custom-built reporting tools and the commercial reporting tools together. component for achieving business intellignce. There are several reasons for this: 1. The Reporting Tool Selectionselection discusses how one should select an OLAP tool. It has most of the functionalities users need to display data. 1)Tools :The most common tools used for business intelligence are as follows. as well as unstrctured data (thus the need fo content management systems). and run their own reports. and decreasing number of total users. Excel Take a guess what's the most common business intelligence tool? You might be surprised to find out it's Microsoft Excel.chalasani infrastructural. You can easily send an Excel sheet to another person without worrying whether the recipient knows how to read the numbers. In fact. increasing functionality. increasing business intelligence complexity. OLAP tool OLAP tools are usually used by advanced users. Reporting tool In this discussion. and in an organization. Data mining tool Data mining tools are usually only by very specialized users. 2)Uses:Business intelligence usage can be categorized into the following categories: . Business operations reporting and dashboard are the most common applications for a reporting tool. there are usually only a handful of users using data mining tools. it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality. schedule. Business intelligence also includes the insight gained from doing data mining analysis. It's relatively cheap. The OLAP Tool Selection selection discusses how one should select an OLAP tool. Excel is best used for business operations reporting and goals tracking. They provide some flexibility in terms of the ability for each user to create.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example. It is an art because one can never be sure what the future holds. For this audience. Dashboard The primary purpose of a dashboard is to convey the information at a glance. presentation and ease of use are very important for a dashboard to be useful. This is different from the 3rd normal form. This includes the actuals and how the actuals stack up against the goals. It offers good insight into the numbers at a more granular level. it is also a science because one can extrapolate from historical data. 2. At the same time. In this case. Attribute: A unique level within a dimension. sales amount would be such a measure. if any. the . one possible hierarchy in the Time dimension is Year → Quarter → Month → Day. Multidimensional analysis Multidimensional analysis is the "slicing and dicing" of the data. As you can imagine. 3. Fact Table: A fact table is a table that contains the measures of interest. there is little. For example. it can be sales amount by store by day. Forecasting Many of you have no doubt run into the needs for forecasting. "How do different factors correlate to one another?" and "Are there significant time trends that can be leveraged/anticipated?" Dimensional Data Model Dimensional data model is most often used in data warehousing systems. To understand dimensional data modeling. need for drilling down on the data. For example. so it's not a total guess. This requires a solid data warehousing / data mart backend. For example. the time dimension. Questions asked are like. let's define some of the terms commonly used in this type of modeling: Dimension: A category of information.chalasani 1. and all of you would agree that forecasting is both a science and an art. This type of business intelligence often manifests itself in the standard weekly or monthly reports that need to be produced. 5. Month is an attribute in the Time Dimension.Kamesh. Finding correlation among different factors This is diving very deep into business intelligence. What if competitors decide to spend a large amount of money in advertising? What if the price of oil shoots up to $80 a barrel? At the same time. as well as business-savvy analysts to get to the necessary data. commonly used for transactional (OLTP) type systems. Business operations reporting The most common form of business intelligence is business operations reporting. For example. the same data would then be stored differently in a dimensional model than in a 3rd normal form model. This measure is stored in the fact table with the appropriate granularity. 4.

and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout. Dimensions and hierarchies are represented by lookup tables. and a sales amount column. . Personally. when there is a business case to analyze the information at that particular level. one for the unique ID that identifies the quarter. in an off-line retail world. and product. however. The determining factors usually goes back to the requirements. 2. geography. the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. the dimensions for a sales fact table are usually time. first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). Fact Table Granularity Granularity The first step in designing a fact table is to determine thegranularity of the fact table. is by no means a complete list for all off-line retailers. because business processes will often dictate clearly what are the relevant dimensions. Lookup Table: The lookup table provides the detailed information about the attributes. Each row (each quarter) may have several fields. will also have the ability to track the customer dimension. A supermarket with a Rewards Card program. By granularity. This is where user requirement (both stated and possibly future) plays a major role. and one or more additional fields that specifies how that particular quarter is represented on a report (for example. Fact tables connect to one or more lookup tables. For example. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made.Kamesh. A dimensional model includes fact tables and lookup tables. This constitutes two steps: 1. Determine which dimensions will be included. Whether one uses a star or a snowflake largely depends on personal preference and business needs. This list. but fact tables do not have direct relationships to one another. For example. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts. we mean the lowest level of information that will be stored in the fact table. Which Dimensions To Include Determining which dimensions to include is usually a straightforward process. What Level Within Each Dimensions To Include Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. where customers provide some personal information in exchange for a rewards card.chalasani fact table would contain three columns: A date column. Determine where along the hierarchy of each dimension the information will be kept. I am partial to snowflakes. the most commonly used schema types are Star Schema andSnowflake Schema. a store column.

date.e. Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin . and we have a fact table with the following columns: Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. and the data warehousing team needs to fight the urge of the "dumping the lowest level of detail into the data warehouse" symptom. The first example assumes that we are a retailer. it makes sense to use 'hour' as the lowest level of granularity in the time dimension. the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. but based on the industry knowledge.) If so. • Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table. the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details.chalasani In the above example. and prior experience will become invaluable here. and product.. but not the others. it is prudent for the data warehousing team to design the fact table such that lower-level information is included. then 'day' can be used as the lowest level of granularity. because you can sum up this fact along any of the three dimensions present in the fact table -. and only includes what is practically needed. Sales_Amount is the fact. Fact And Fact Table Types Types of Facts There are three types of facts: • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Since the lower the level of detail. Sometimes this can be more of an art than science. Let us use examples to illustrate each of the three types of facts. • Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.Kamesh. This will avoid possibly needing to re-design the fact table in the future. trying to anticipate all future requirements is an impossible and hence futile exercise. On the other hand. looking at how certain products may sell by different hours of the day. If daily analysis is sufficient. Note that sometimes the users will not specify certain requirements. store. the larger the data amount in the fact table. the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage. Sales_Amount is an additive fact. will the supermarket wanting to do analysis along at the hourly level? (i. For example. In such cases. In this case.

Current_Balance and Profit_Margin are the facts. store. this fact table may describe the total sales by product by store by day. In this case. product. a complex star can have more than one fact table. a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. and usually includes more semi-additive and non-additive facts. In the star schema design. Profit_Margin is a non-additive fact. Sample star schema All measures in the fact table are related to all the dimensions that fact table is related to. but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information).Let's look at an example: Assume our data warehouse keeps store sales data. The primary key in each dimension table is related to a forieng key in the fact table.Kamesh. the figure on the left repesents our star schema. as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?). In other words. • Snapshot: This type of fact table describes the state of things in a particular instance of time. and the different dimensions are time. Note that different dimensions are not related to one another.chalasani The purpose of this table is to record the current balance for each account at the end of each day. The facts for this type of fact tables are mostly additive facts. they all have the same level of granularity. The second example presented here is a snapshot fact table. A star schema can be simple or complex. A simple star consists of one fact table. there are two types of fact tables: • Cumulative: This type of fact table describes what has happened over a period of time. The first example presented here is a cumulative fact table. and customer. Each dimension is represented as a single table. The lines between two tables indicate that there is a primary key / foreign key relationship between the two tables. as well as the profit margin for each account for each day. For example. Types of Fact Tables Based on the above classifications. . Current_Balance is a semi-additive fact. for it does not make sense to add them up for the account level or the day level.

Sample snowflake schema For example. In a star schema. this applies to cases where the attribute for a record varies over time. which is then connected to Day. How should ABC Inc. Year is connected to Month. and they are categorized as follows: Type 1: The new record replaces the original record. she moved to Los Angeles. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. whereas in a snowflake schema. No trace of the old record exists. . that dimensional table is normalized into multiple lookup tables. Therefore. each representing a level in the dimensional hierarchy. 2003. In a nutshell. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. each dimension is represented by a single dimensional table. the original entry in the customer lookup table has the following record: Customer Key 1001 Name Christina State Illinois At a later date. She first lived in Chicago. a lookup table for week. where each point of the star explodes into more points.Kamesh. Illinois. Slowly Changing Dimension The "Slowly Changing Dimension" problem is a common one particular to data warehousing. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right. the Time Dimension that consists of 2 different hierarchies: 1. Week is only connected to Day. Type 2: A new record is added into the customer dimension table. California on January. We give an example below: Christina is a customer with ABC Inc. the customer is treated essentially as two people. and a lookup table for day. So.chalasani The snowflake schema is an extension of the star schema. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. Week → Day We will have 4 lookup tables in a snowflake schema: A lookup table for year. a lookup table for month. Type 3: The original record is modified to reflect the change. Year → Month → Day 2. There are in general three ways to solve this type of problem.

it is not possible to trace back in history. For example. both the original and the new record will be present. in this case. and we have the following table: Customer Key 1001 Name Christina State California Advantages: . Disadvantages: . since there is no need to keep track of the old information. By applying this methodology. no history is kept. Finally. the company would not be able to know that Christina lived in Illinois before. we add the new information as a new row into the table: Customer Key 1001 1005 Name Christina Christina State Illinois California Advantages: .This allows us to accurately keep all historical information. recall we originally have the following table: Customer Key 1001 Name Christina State Illinois After Christina moved from Illinois to California. we compare and contrast among the three alternatives. .All history is lost. Usage: About 50% of the time. recall we originally have the following table: Customer Key 1001 Name Christina State Illinois After Christina moved from Illinois to California. In other words.chalasani We next take a look at each of the scenarios and how the data model and the data looks like for each of them. the new information simply overwrites the original information.This is the easiest way to handle the Slowly Changing Dimension problem. In our example. Therefore. The newe record gets its own primary key. the new information replaces the new record. In our example. a new record is added to the table to represent the new information. In Type 2 Slowly Changing Dimension. In Type 1 Slowly Changing Dimension. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.Kamesh.

and we have the following table (assuming the effective date of change is January 15. recall we originally have the following table: Customer Key 1001 Name Christina State Illinois To accommodate Type 3 Slowly Changing Dimension. There will also be a column that indicates when the current value becomes active. .This allows us to keep some part of history. the original information gets updated. since new information is updated.chalasani Disadvantages: . When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. 2003. For example. and one indicating the current value. When to use Type 3: . In Type 3 Slowly Changing Dimension.Type 3 will not be able to keep all history where an attribute is changed more than once.This will cause the size of the table to grow fast. Usage: Type 3 is rarely used in actual practice. . Disadvantages: . we will now have the following columns: • • • • • Customer Key Name Original State Current State Effective Date After Christina moved from Illinois to California. the California information will be lost. 2003): Customer Key 1001 Name Christina Original State Illinois Current State California Effective Date 15-JAN-2003 Advantages: . Usage: About 50% of the time. one indicating the original value.Kamesh.This necessarily complicates the ETL process. In cases where the number of rows for the table is very high to start with. there will be two columns to indicate the particular attribute of interest.This does not increase the size of the table. if Christina later moves to Texas on December 15. In our example. storage and performance can become a concern.

logical. Logical Model Design Physical Model Design Conceptual Model Design We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another). sometimes the conceptual data model and the logical data model are considered as a single deliverable. were discussed in prior sections. and physical versions of a single data model.Kamesh. The table below compares the different features: Feature Conceptual Logical Physical Entity Names ✓ ✓ Entity Relationships ✓ ✓ Attributes ✓ Primary Keys ✓ ✓ Foreign Keys ✓ ✓ Table Names ✓ Column Names ✓ Column Data Types ✓ Below we show the conceptual.logical data model. The three level of data modeling. and when such changes will only occur for a finite number of time.physical data model. In a data warehousing project. Here we compare these three types of data models.chalasani Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes. then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented).conceptual data model. . and finally the physical data model (so we know exactly how to implement our data model in the database of choice).

Access level We need to ensure that data is not altered by any unauthorized means either during the ETL process or in the data warehouse.Kamesh. Valid Values Only allowed values are permitted in the database. Not NULL vs NULL-able For columns identified as NOT NULL. there are three areas of where data integrity needs to be enforced: Database level We can enforce data integrity at the database level. meaning data is consistent and correct. they may not have a NULL value. In a data warehouse or a data mart. ETL process For each step of the ETL process. For example. In the data warehousing field. any resulting report and analysis will not be useful. Garbage Out. Primary key / Unique constraint Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely identified. To do this. as well as logging of all data access history. Most common checks include record counts or record sums. data integrity checks should be put in place to ensure that source data is the same as the data in the destination. a primary key cannot be deleted if there is still a foreign key that refers to this primary key.chalasani Data integrity refers to the validity of data. a value of '-1' cannot be allowed. ." If there is no data integrity in the data warehouse. For example. Common ways of enforcing data integrity include: Referential integrity The relationship between the primary key of one table and the foreign key of another table must always be maintained. we frequently hear the term. Data integrity can only ensured if there is no unauthorized access to the data. "Garbage In. if a column can only have positive integers. there needs to be safeguards against unauthorized access to data (including physical access to the servers).

typically as part of the ETL process. Metadata Data that describes data and other structures. such as objects. the schema design of a data warehouse is typically stored in a repository as metadata. file. Mapping The definition of the relationship and data flow between source and target objects. . and processes. file. business rules. A repository contains metadata. Transportation The process of moving copied or transformed data from a source to a data warehouse.chalasani Source System A database. Staging Area A place where data is processed before entering the warehouse. Transformation The process of manipulating data. aggregating. Examples include cleansing.Kamesh. which is used to generate scripts used to build and populate the data warehouse. or other storage facility to which the "transformed source data" is loaded in a data warehouse. Any manipulation beyond copying is a transformation. application. and integrating data from multiple sources. Target System A database. or other storage facility from which the data in a data warehouse is derived. For example. Cleansing The process of resolving inconsistencies and fixing the anomalies in source data. application.

These errors can be corrected by using data cleansing process and standardized data can be loaded in target systems (data warehouse). For example. • Data Cleansing: The PowerCenter's data cleansing technology improves data quality by validating. creating mapping and mapplets. A person's address may not be same in all source systems because of typos and postal code.Kamesh. • Designer: Source Analyzer. • Power Center Server: Power Center server does the extraction from source and then loading data into targets. Joining are some of the examples of transformation. Filtering. Mapping Designer and Warehouse Designer are tools reside within the Designer wizard. • Repository Server: This repository server takes care of all the connections between the repository and the Power Center Client. city name may not match with address. Source Analyzer is used for extracting metadata from source systems. Aggregation. • Workflow Manager: Workflow helps to load the data from source to target in a sequential manner.chalasani Guidelines to work with Informatica Power Center • Repository: This is where all the metadata information is stored in the Informatica suite. correctly naming and standardization of address data. Sorting. store and manage metadata. if the fact tables are loaded before the lookup tables. • Power Center Client: Informatica client is used for managing users. creating sessions and run workflows etc. Mapping is a pictorial representation about the flow of data from source to target. identifiying source and target systems definitions. The Power Center Client and the Repository Server would access this repository to retrieve. Transformations ensure the quality of the data being loaded into target and this is done during the mapping process from source to target. • Transformation: Transformations help to transform the source data according to the requirements of target system. To . then the target system will pop up an error message since the fact table is violating the foreign key validation. Warehouse Designer is used for extracting metadata from target systems or metadata can be created in the Designer itself. Mapping Designer is used to create mapping between sources and targets.

chalasani • • • avoid this. tranfer files over FTP. . triggers etc). and other third party applications. columns. db2 etc) and flat files in unix. such as Erwin. Oracle designer. For example. and for relational databases (oracle. VSAM. Siebel etc.. workflows can be created to ensure the correct flow of data from source to target. etc. SAP. There is no need for informatica developer to create these data structures once again. Power Channel: This helps to transfer large amount of encrypted and compressed data over LAN. Peoplesoft. Siebel etc. through Firewalls. data types. Embarcadero. Power Exchange: Informatica Power Exchange as a stand alone service or along with Power Center.Kamesh. Sybase Power Designer etc for developing data models. Peoplesoft. SAP. helps organizations leverage data by avoiding manual coding of data extraction programs. IMS etc. Functional and technical team should have spent much time and effort in creating the data model's data structures(tables.). Power Exchange supports batch.). sql server. functions. Meta Data Exchange: Metadata Exchange enables organizations to take advantage of the time and effort already invested in defining data structures within their IT environment when used with Power Center. these data structures can be imported into power center to identifiy source and target mappings which leverages time and effort. Power Center Exchange: This component helps to extract data and metadata from ERP systems like IBM's MQSeries. linux and windows systems. an organization may be using data modeling tools. mid range (AS400 DB2 etc. Power Center Connect: This component helps to extract data and metadata from ERP systems like IBM's MQSeries. By using meta deta exchange. WAN. real time and changed data capture options in main frame(DB2.. procedures. and other third party applications. Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.

and analyze corporate information from data stored in a data warehouse. PowerAnalyzer enables to gain insight into business processes and develop business intelligence. or otherdata storage models. Connected Transformation Connected transformation is connected to other transformations or directly to . Power Mart: Power Mart is a departmental version of Informatica for building. Passive Transformation A passive transformation does not change the number of rows that pass through it i. and sharing enterprise data simple and easily available to decision makers. PowerAnalyzer makes accessing. Super Glue: Superglue is used for loading metadata in a centralized place from several sources. Transformations can be Connected or UnConnected. filter. Power Center supports global repositories and networked repositories and it can be connected to several sources. It can also run reports on data in any table in a relational database that do not conform to the dimensional model. Power Mart can extensibily grow to an enterprise implementation and it is easy for developer productivity through a codeless environment. PowerAnalyzer is best with a dimensional data warehouse in a relational database. Power center is used for corporate enterprise data warehouse and power mart is used for departmental data warehouses like data marts. operational data store. Power Mart supports single repository and it can be connected to fewer sources when compared to Power Center. an organization can extract. Active Transformation An active transformation can change the number of rows that pass through it from source to target i. analyzing. format. and managing data warehouses and data marts. data mart. With PowerAnalyzer.e it passes all rows through the transformation. deploying.chalasani Power Analyzer: Power Analyzer provides organizations with reporting facilities.e it eliminates rows that do not meet the condition in transformation. Reports can be run against this superglue to analyze meta data.Kamesh.

MAX. COUNT. to calculate total of daily sales or to calculate average of monthly or yearly sales. Aggregate functions such as AVG. to calculate discount of each product or to concatenate first and . FIRST. UnConnected Transformation An unconnected transformation is not connected to other transformations in the mapping.chalasani target table in the mapping.Kamesh. It is called within another transformation. This can be used to calculate values in a single row before writing to the target. and returns a value to that transformation. For example. Following are the list of Transformations available in Informatica: • Aggregator Transformation • Expression Transformation • Filter Transformation • Joiner Transformation • Lookup Transformation • Normalizer Transformation • Rank Transformation • Router Transformation • Sequence Generator Transformation • Stored Procedure Transformation • Sorter Transformation • Update Strategy Transformation • XML Source Qualifier Transformation • Advanced External Procedure Transformation • External Transformation Aggregator Transformation Aggregator transformation is an Active and Connected transformation. For example. PERCENTILE. SUM etc. Expression Transformation Expression transformation is a Passive and Connected transformation. This transformation is useful to perform calculations such as averages and sums (mainly to perform calculations on multiple rows or groups). can be used in aggregate transformation.

This can be used to filter rows in a mapping that do not meet the condition. The Joiner transformation supports the following types of joins: • Normal • Master Outer • Detail Outer • Full Outer Normal join discards all the rows of data from the master and detail source that do not match. at least one matching port. It discards the unmatched rows from the detail source. based on the condition. there must be atleast one matching port. Filter Transformation Filter transformation is an Active and Connected transformation. Joiner Transformation Joiner Transformation is an Active and Connected transformation. In order to join two sources. Lookup Transformation Lookup transformation is Passive and it can be both Connected and UnConnected as well. This can be used to join two sources coming from two different locations or from same location. to know all the employees who are working in Department 10 or to find out the products that falls between the rate category $500 and $1000. For example. Full outer join keeps all rows of data from both the master and detail sources. Detail outer join keeps all rows of data from the master source and the matching rows from the detail source. While joining two sources it is a must to specify one source as master and the other as detail. to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source. For example. It .Kamesh.chalasani last names or to convert date to a string field. Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source.

Rank Transformation Rank transformation is an Active and Connected transformation. if we want to filter data like where State=Michigan. Normalizer Transformation Normalizer Transformation is an Active and Connected transformation. or synonym. to select top 10 Regions where the sales volume was very high or to select 10 lowest priced products. It is similar to filter transformation. output and default groups. Also. For example. It is useful to test multiple conditions. State=New York and all other States. It has input. For example.chalasani is used to look up data in a relational table. It is used mainly with COBOL sources where most of the time data is stored in de-normalized format. The only difference is. Sequence Generator Transformation Sequence Generator transformation is a Passive and Connected transformation. Router Transformation Router is an Active and Connected transformation. Lookup definition can be imported either from source or from target tables Difference between Connected and UnConnected Lookup Transformation: Connected lookup receives input values directly from mapping pipeline whereas UnConnected lookup receives values from: LKP expression from another transformation. Connected lookup supports user-defined default values whereas UnConnected lookup does not support user defined values. State=California.Kamesh. filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition. It is used to select the top or bottom rank of data. It is used to create unique primary key values or cycle . Normalizer transformation can be used to create multiple rows from a single row of data. It’s easy to route data to different tables. Connected lookup returns multiple columns from the same row whereas UnConnected lookup has one return port and returns one column from each row. view.

XML Source Qualifier Transformation . It is useful to automate time-consuming tasks and it is also used in error handling. or any database with a valid connection to the Informatica Server. it is must to connect it to a Source Qualifier transformation. target. Stored Procedure Transformation Stored Procedure transformation is a Passive and Connected & UnConnected transformation. NEXTVAL port generates a sequence of numbers by connecting it to a transformation or target. user-defined variables and conditional statements. and the stored procedure can exist in a source. Sorter Transformation Sorter transformation is a Connected and an Active transformation. insert. It has two output ports to connect transformations.Kamesh. and specify whether the output rows should be distinct. The Source Qualifier performs the various tasks such as overriding default SQL query. either to maintain history of data or recent changes. When adding a relational or a flat file source definition to a mapping. filtering records. You can specify how to treat source rows in table. Source Qualifier Transformation Source Qualifier transformation is an Active and Connected transformation. It is used to update data in target table. a specialized calculation etc. The stored procedure must exist in the database before creating a Stored Procedure transformation. delete or data driven. It allows to sort data either in ascending or descending order according to a specified field. By default it has two fields CURRVAL and NEXTVAL(You cannot add ports to this transformation). Stored Procedure is an executable script with SQL statements and control statements. to drop and recreate indexes and to determine the space in database. update.chalasani through a sequential range of numbers or to replace missing keys. join data from two or more tables etc. Update Strategy Transformation Update strategy transformation is an Active and Connected transformation. Also used to configure for case-sensitive sorting. CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By value.

instead of creating the necessary Expression transformations in a mapping. Differences between Advanced External Procedure and External Procedure Transformations: External Procedure returns single value. It operates in conjunction with procedures. In such cases External procedure is useful to develop complex functions within a dynamic link library (DLL) or UNIX shared library. such as sorting and aggregation. Advanced External Procedure Transformation Advanced External Procedure transformation is an Active and Connected transformation. Sometimes.chalasani XML Source Qualifier is a Passive and Connected transformation. External Procedure Transformation External Procedure transformation is an Active and Connected/UnConnected transformations. which are created outside of the Designer interface to extend PowerCenter/PowerMart functionality. It represents the data elements that the Informatica Server reads when it executes a session with XML sources. the standard transformations such as Expression transformation may not provide the functionality that you want. XML Source Qualifier is used only with an XML source definition. . It is useful in creating external transformation applications.Kamesh. which require all input rows to be processed before emitting any output rows. External Procedure supports COM and Informatica procedures where as AEP supports only Informatica Procedures. where as Advanced External Procedure returns multiple values.

Sign up to vote on this title
UsefulNot useful