You are on page 1of 30

On TV, a low-resolution photo is never a problem for crime scene investigators.

Load it into a computer, zoom in on a reflection, click "enhance", and presto! the criminal is revealed. In the world of data warehousing, we cannot rely on the CSI effect. If we start with lowresolution data, the detail is lost forever. Eventually, someone will ask a question that requires more detail. And it simply wont be there to answer the question. A few simple guidelines will help you avoid this unfortunate situation. A Common Problem Omission of detail is one of the biggest frustrations caused by legacy designs. It is the number three problem I encounter in design reviews (after failure to use surrogate keys and failure to plan slowly changing dimensions.) The reason for this is simple: people often design their schema to support current requirements. Unfortunately, this does not work in data warehousing. Analytic requirements constantly change. As a designer, your job is to produce solutions that will answer questions that are not yet known. This may sound like an impossible task, but it's not. Its one of the strong suits of dimensional design. Three guidelines for high-resolution design will help future-proof your solution. 1. Match Grain to Source Data, Not Requirements When you're designing a fact table, you need to establish its grain. A statement of grain defines what is represented by a row in a fact table. This is the "resolution" of your measurements. (For a refresher on grain, see the post Rule 1: State Your Grain.) Don't set grain at the level of detail that meets requirements. Set it at the level of detail at which data is available. For example, you are designing a fact table that measures sales. Someone asks for daily totals of sales in dollars. Don't assume daily totals of sales is sufficient. Eventually

someone will want to go deeper, perhaps looking for products that are bought together. This requires setting your grain at the order line level of detail. Once you have identified the process a fact table represents, look at the resolution of the source data. That should guide your decisions on grain. Of course, this needs to be kept within reason. If the available level of detail would result in fact tables that are too large or expensive to manage, you might need to do some summarization. But be careful; you'll be working with low-resolution data in the future. 2. Capture All the Facts that Fit Remember that each fact table describes a process. When you fill in the facts, don't just include ones that people asked for. Include all available facts that describe that process. Returning to our sales data, this means you should not stop at sales dollars. What else is known about sales? Quantities, tax, and other information is likely to be readily available, and many operational systmes may also offer a rudimentary concept of cost or margin. Include them! 3. Fill Out Dimensions The last guideline for a high-resolution design involves the dimensions. You can probably guess what it is: When you design a dimension table, don't just include attributes someone asked for. Find what is available in the source system, and include it all. A simple "Product Name" and "Product ID" might be all that's needed to support sales by product. But what else do source tables hold that describes your products? Manufactures? Suppliers? Colors? Sizes? Weights? Include all this in your designs. Remember, dimensions are the source of all context in our reports. The more we fill them out, the more kinds of questions people will be able to ask. Initial Design is the Right Time

Match your grain, facts and dimensions to the available data, even if that goes beyond the requirements. Of course, you can recover from errors in any of these areas. This will require modifying your design, ETL process, and so forth. When its decided that additional detail is needed going forward, you can make these changes. But it is often difficult to add detail to historic data already in the data warehouse. For example, it may require accessing backup data. Worse, it may put ETL developers through the difficult process of performing slow change processing in reverse. Think high-resolution today, and you can avoid these problems tomorrow. Image Credit: Alan Cleaver via Creative Commons 2.0 | More Posted by Chris Adamson at 8:10 AM Labels: Basics 0 comments
TUESDAY, APRIL 27, 2010

Basics: What is a Dimensional Model


In a recent post, I asserted that a star schema is the implementation of a dimensional model in a relational database. Many readers may have missed this point, because it was embedded in a conversation about the term normalization. Here, then, is a clutter-free discussion the dimensional model.

Activities and Conditions A dimensional model is a structured framework for measurement. Usually, a dimensional model describes a process or activity. For example, a retailer might design a dimensional model of sales transactions. A dimensional model may also describe conditions, as measured at predefined intervals. A water utility might design a dimensional model of reservoir levels, measured daily. Facts and Dimensions A dimensional model describes measurement of a process (or conditions) through facts and dimensions. Facts are the measurements. For the retailer's sales activities, facts that measure each sales transaction include the quantity sold and the price paid. For the utility's reservoir conditions, a single fact is measured: gallons on hand. Facts are not useful without context. For example, "one hundred units sold," is a measurement of quantity sold, but it has no context. One hundred of what? When were they sold? Where? Dimensions are used to describe the context of facts. For the retailer, dimension values provide context for each measurement. These include the date and time of each purchase, the product sold, and the store in which it was purchased. For the utility's model of reservoir status, dimensions include the date the measurement

was taken, the reservoir or facility where the measurement was taken, and the inspector who measured it. If you can define a process, facts and dimensions, you have the core of a dimensional model. Of course, there is a bit more to it than that (the concept of grain is also essential), but that's the basic idea. Uses The dimensional model is most famous for its role as the basis for a star schema. Before we get to that, though, its important to recognize that it has other uses. First, a dimensional model is an excellent way to describe requirements for an analytic system. A model of process measurement is far more efficient and flexible than a list of of specific business questions. A single dimensional model may be able to answer thousands of questions, including some that have not yet been thought of. A dimensional model is also an excellent tool for planning your data warehouse strategy, and for managing the scope of implementation projects. This is particularly true when the model is translated to a database design and linked to sources of data. (I've written about this before.) Stars, Snowflakes, Cubes A dimensional model can serve as the basis for a database design. This, of course, is what it is famous for. (If you can even say it is famous, that is.)
y When a dimensional model is implemented in a relational database, it is called a

star schema (or sometimes a snowflake schema.)


y When implemented in a multidimensional database, it is called a cube. (I touched

on this previously, as well.) If you work with any of these things, you are working with a dimensional model. - Chris

Image: Slide together Polyhedra by fdecomite Licensed under Creative Commons 2.0 | More Posted by Chris Adamson at 1:00 PM Labels: Basics 0 comments
TUESDAY, MARCH 2, 2010

Just the Facts?


A reader asks: Im having a problem in picking out facts. For example are all dates facts -- start date, end date and DoB? What about personal data like post code, job title, status (active /left) etc. Would you say that if you cant do math on it, its not a fact? - [Name Withheld], UK Good question. It allows me to cover the basics, as well as touch on some advanced topics. First, the Basics..

Facts vs. Dimensions Facts are measurements. I really prefer that word, because it is a bit more descriptive. But the rest of the world already knows "Facts". Examples that are clearly measurements:

y order dollars y gross pay y quantity sold y account balance

In the original question, all the attributes mentioned are dimensions. (Dates, personal data, and so forth.) Dimensions are used to give facts context. You use dimension values to filter or group facts. Here are some in action:
y order dollars (fact) by order date (dimension) y gross pay (fact) by department code (dimension) y quantity sold (fact) for January (dimension value)

You can usually "do math" on facts. The most common example is adding them together, or aggregating them, as might be done in the examples above. Don't use that as a criterion to to identify facts, though. Many numbers are not facts. And, to make matters more complicated, and some non-numbers are facts. Many Numbers are Not Facts Some numbers are really dimensions. You can tell because we use them to give context to facts. A numeric department code is obviously a dimension. We wouldn't normally add the values together, but we might use them to break out a measurement -- for example: budget (fact) by department (dimension). Less obvious are things like "unit price". It sounds like it might be a fact. After all, we can certainly add up money, right? Actually, when we add up sales, we don't add up unit prices, we add up "extended prices". Extended price is a fact. It is equivalent to unit price times quantity. Look again, and you will also see that unit price behaves like the other dimensions we looked at. We can use it to group or filter facts. "How many did we sell at a unit price of 10 cents?" That question filters a fact (quantity sold) by a dimension value (10 cent unit

price). Some Facts are Not Numbers There are also facts that are not numbers. These are rare, but they do occur. Test results or grades are the most common example. You can't do math on "pass" or "fail," but they are measurements. Facts like this are sometimes called "text facts." A test_result fact that can take on the values "pass" or "fail" might be stored in a fact table that contains a row for each student taking a particular test on a particular day. If they take on a relatively low number of possible values, consider converting text facts into discrete facts. In this case, we might replace test_result with two facts: number_passed and number_failed. For each row in the fact table, one of these will take on the value 1, the other the value 0. This may seem wasteful, but now it is easy to get a sum(number_passed) or sum(number_failed.) Some Stars Have No Facts If you don't have any facts, you can still have a fact table. Sometimes, the mere occurrence of an event is all you need to know about. For example, you might use a fact table to track the fact that a employee was present on a given day. Factless fact tables can also track conditions. For example, each year, circumstances might make employees eligible to participate in particular programs. Aside from the dimensions (Employee, Year and Program) there might not be any facts to record. That's OK, you can still count up rows that meet various conditions. - Chris Do you have a question? Send it in. I answer all my email, though sometimes it takes a while. Image: Jack Webb as Joe "Just the Facts" Friday, from the Public Domain | More

Posted by Chris Adamson at 1:43 PM Labels: Basics, Design Techniques, People, Q and A 0 comments
WEDNESDAY, FEBRUARY 3, 2010

Dimensional and Relational: Not Opposites


A common misconception holds that the terms dimensional and relational are opposites. They are not. The word "dimensional" describes a design method. The word "relational" describes a data storage technology.
y A dimensional model is a design approach that describes a process in terms of

measurements (known as facts) and their context (dimensions)


y A star schema is a dimensional model implemented using relational storage

technologythat is, in a relational database (RDBMS)


y A cube is a dimensional model implemented using multi-dimensional storage

technologythat is, in a multidimensional database (MDB) This simple diagram illustrates these concepts:

As you can see from the diagram, a star schema is both relational and dimensional. So is a snowflake schema.

By the way, don't let this confuse you: most modern day DBMS products accommodate both kinds of storage. | More Posted by Chris Adamson at 9:32 AM Labels: Basics 0 comments
WEDNESDAY, DECEMBER 9, 2009

Rule 1: State Your Grain


Make sure you have a statement of grain for each fact table or cube in your dimensional design. I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: What is the grain? Without this basic information, it is usually impossible to comment on whatever design issue the person is facing. Being able to state grain is important, and not just for its value as a conversation starter. Heres a brief look at what grain is, how to define it, and what happens if you dont. What is Grain A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room. When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process. It may even lead to reports that are just plain wrong. Defining Grain There are two ways to state the grain of a fact table. The first way is to use business

language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: Order measurements at the order-line level of detail. That sums things up pretty well. Each row of the fact table corresponds to a single order line. Sometimes it is not easy to state grain in this manner. When thats the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: Processing measurements by application and status. This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.) When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth. Business definitions of grain are usually used for transaction fact tables. Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates. Fuzzy Grain Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data. In the latter case, doublecounting, triple-counting or worse is possible. This is why sate your grain is rule #1 in any dimensional design. -Chris

Image by GravityX9 licensed under Creative Commons 2.0

| More Posted by Chris Adamson at 12:09 PM Labels: Basics 1 comments


WEDNESDAY, MAY 20, 2009

Do I really need Surrogate Keys?


Here is the #1 most frequently question that people ask of me: Q: Do I really need surrogate keys? A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities. The case for the surrogate key is entirely pragmatic. Read on for a full explanation. Dimensions Need their Own Unique Identifier It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source. For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.) A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process. Compound Keys Work, But Why Take that Route? An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.

This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:
y Fact table rows become larger, as they must include the multi-part key for each

dimension
y The ETL process must manage a sequence for each natural key value, rather than a

single sequence for a surrogate key


y SQL becomes more difficult to read, write or debug y Multi-part dimension keys can disrupt star join optimizers

So: a compound key takes more space, is not any easier, and may disrupt performance. Why bother? (By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!) Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble! (This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.) That's the basic answer. I'll respond to some common follow up questions over the coming weeks. - Chris | More Posted by Chris Adamson at 12:34 PM Labels: Basics, Q and A, Slow Changes 0 comments
TUESDAY, OCTOBER 9, 2007

For Slowly Changing Dimensions, "Change" is Relative

There's a difference between the way we think about Slowly Changing Dimensions and the way we document them. In this post, I'll highlight this difference by examining the two most common Slow Change techniques. The term "slowly changing dimension" originated with Ralph Kimball, who identified three techniques for dealing with changed data. Commonly abbreviated as SCD's, these techniques are applied in any form of dimensional design, regardless of the data warehouse architecture. In practice, there is a subtle but importance between the way we think about these changes and the way we describe them in a dimensional design. This sometimes leads to confusion. Before I explain this important distinction, let me review the difference between surrogate and natural keys, and describe the two most common SCD techniques. (Future posts will look at other slow change techniques.) Natural Keys and Surrogate Keys We usually think of dimension tables in a star schema as corresponding to something in a source system. For example, each row in a customer dimension table relates to a single customer in a source system. Each column is loaded from one or more sources, based on a set of rules. The link back to a source system is preserved in the form of a natural key usually a unique identifier in a source system, such as a customer_id. But the star schema design does not rely on this natural key, or business key, to uniquely identify rows in dimension tables. Instead, a surrogate key is introduced. This surrogate key gives the dimensional design flexibility to handle changes differently than they are handled in source systems, while preserving the ability to perform joins using a single column. Type 1 and Type 2 Slow Changes Slowly changing dimension techniques determine how the dimensional model will respond to changes in the source system. If the customer with id 8472 changes, what do we do

with that change? Alert readers may already be concerned about what I mean by "change" here, but let's first recap the two most common techniques.
y Type 1: Update When the dimensional model responds to a change in source data

by updating a column, Kimball calls this a type 1 change. For example, if a customer's date of birth changes, it is probably appropriate to update the corresponding row for that customer in the dimension table. Under this scenario, any facts that were already associated with the dimension table row have effectively been revised as well. A report of sales dollars by date of birth, for example, will provide different results immediately before and after the type 1 change is applied. The type 1 change does not preserve history of the attribute value.
y Type 2: New Row A more common response to a changed data element is to insert

a new row into the dimension table. For example, when the address of customer 8472 changes, we create a new row for the customer in the dimension table. This row has a different surrogate key, and the new address. Customer 8472 now has two rows in the dimension, each with its own surrogate key. This preserves the history of the attribute, and does not revise any previously stored facts. New facts will be associated with the new version of customer 8472; old facts remain associated with the old version. For the most part, these two techniques form the basis of a dimensional model's response to change. (Future posts will consider the less common type 3 change, and additional techniques.) While these concepts are fairly easy to understand, it is important to look a bit deeper. We think about slow changes with respect to the source Notice the way that the original problem was framed. I asked how the dimensional schema would "respond to changes in the source data." This is how we usually think about the problem, and for good reason. After all, the source data exists before it is loaded into the dimensional schema. If birth_date changes, we overwrite; if address changes, we insert a new record.

Now observe that a change to the source does not always result in a change in the dimensional schema. In the example, a change in address resulted in a new rownot a changed row. No data is changed. Still, we refer to this process as the occurrence of a type 2 change. Why? Because we think about slow changes with respect to the source data. And there, a change did occur. We document slow changes with respect to the star The most common way to document the dimensional schema's response to change is on the dimensional side, on an attribute by attribute basis. For each column in a dimension table, we note how changes in the source data will be handled. Our customer example might be documented as follows:

In the diagram, each non-key attribute is tagged with a 1 or a 2. This indicates whether changes in the source of the attribute should be handled as type 1 or type 2 changes. Documenting SCD behavior in this way is handy. ETL developers use this information to design a scheme for performing incremental loads. Report developers use this information to understand how facts will be grouped when combined with different dimension attributes. The only drawback to documenting SCD rules in this way is that it can lead to confusion. By tagging an attribute as a "type 2 SCD" we risk implying that attribute values may change. After all, the "C" in "SCD" stands for "change." But of course, this attribute does not change. Rather, its classification as a type 2 attribute means "for a given natural key, if the source for this attribute undergoes a change, it will be necessary to insert a new row. " In future posts, I will look at some common misconceptions about slowly changing dimensions, and discuss additional techniques for handling changes.

Copyright (c) 2007 Chris Adamson

| More Posted by Chris Adamson at 9:43 AM Labels: Basics, Design Techniques, Slow Changes 2 comments
TUESDAY, MAY 1, 2007

10 Things You Should Know About that Sample Star Schema


Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with. The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales. (Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star. Most of this is probably readily apparent if you've worked with a sample schemawhat may be new is the terminology. The first two you probably know: 1. Facts are measurements that describe a business process. They are almost always numericbut not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?"

(quantity ordered). There are some schemas that do not include factswe'll look at those in another post. 2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep). 3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributesfar more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide. 4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers. In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system. By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into playa surrogate key avoids the need for multi-column joins. 5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type

2" change: an entirely new record is written to the dimension table. For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record. 6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value. In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table. 7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate. 8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every productonly the ones they bought. You can use an outer join when you want the latter.) 9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail

belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.) In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail. 10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.? Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% marginthis fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post. Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you. Related Posts: Ten Things You Won't Learn from that Demo Schema

Ten Things You Won't Learn from that Demo Schema


Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning toolsto a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well. But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database. 1. Multiple Fact Tables Most sample databases contain a single starone fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions. When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero. 2. Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally. 3. Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

4. Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for. A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semiadditivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need. 5. Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility. In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity. 6. Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles. 7. Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all. There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element. And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data. 8. Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy: An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed. A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople. A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis. All bridge implementations have implications for usage, or report building.

Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult. 9. Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars. 10. Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates. I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new. If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehouse Institute conferences. And keep reading this blog. There will always be more to learn.

10 Things You Should Know About that Sample Star Schema


Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales. (Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star. Most of this is probably readily apparent if you've worked with a sample schemawhat may be new is the terminology. The first two you probably know: 1. Facts are measurements that describe a business process. They are almost always numericbut not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include factswe'll look at those in another post. 2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep). 3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributesfar more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers. In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system. By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into playa surrogate key avoids the need for multi-column joins. 5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table. For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record. 6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value. In addition to Type 1 and Type 2 changes, there are other SCD techniques.

Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table. 7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate. 8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every productonly the ones they bought. You can use an outer join when you want the latter.) 9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.) In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail. 10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the

three.? Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% marginthis fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post. Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Q&A: Degenerate Dimensions, ETL and BI


A question from a reader about including dimensions in the fact table: Q: Question concerning an argument I am having with a colleague. We have a transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.

A:

This is one argument I will not be able to settle. Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension. It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons. Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an

attribute left over that doesn't really fit into any of the other dimension tables. Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimension. You can still use it in the exact same way as any other dimension attribute -as a way to filter queries, group results, break subtotals, and so forth. Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries. That said, a dimension table for the attribute may make sense in some situations. Junk Dimension If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension. The attributes are not directly related to one another and there is no natural key. It is populated with the table that contains the Cartesian product of all possible values. ETL Consistency Concerns If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise. Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently. Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing. While I would not go to a separate table for this reason, I do understand why many designers opt to do so. The next situation is a different story. BI Tool Capabilities If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this: you are configuring your BI tool to auto-generate SQL queries for users. You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema. Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table. Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue. And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute. If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

You might also like