Data Modelling Field Guide - Edition 1

THOUGHTSPOT DATA
MODELLING
Spot-On Data Modeling
First Edition
A ThoughtSpot Field Guide

About the Author
Misha Beek is a seasoned Lead Solution
Architect at ThoughtSpot, spearheading the
company's initiatives in the Europe, the
Middle East and Africa (EMEA) region. With
a global perspective, Misha has been at the
forefront of implementing ThoughtSpot use
cases worldwide, showcasing an
exceptional knack for tackling intricate data
and security models using an array of
innovative techniques. His unique approach
combines traditional wisdom with cutting-
edge methods resulting in data models that
are both efficient and high performing.
Misha's "ThoughtSpot Data Security Field
Guide" has reached its 3rd edition,
garnering nearly a thousand downloads.
This trusted go-to resource offers practical
insights and real-world examples for
implementing robust security models in
ThoughtSpot, making it an invaluable Misha Beek
reference for ThoughtSpot professionals.
Lead Solutions Architect
Misha’s breadth of knowledge and
collaborative spirit, exemplified by this ThoughtSpot
newly released guide, promises to become misha-beek
another invaluable resource in the data
modelling field. Much like its predecessor,
this new guide is poised to become the go-
to reference for data modelers seeking
practical wisdom and actionable insights to
enhance their skills and competence.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 2
Table of Contents
1 BEST PRACTICES LISTED IN THIS FIELD GUIDE 10
2 A WORD FROM THE AUTHOR 12
3 EXPLORING DATA MODELING FOR PRACTICAL INSIGHTS 14
3.1 Understanding Dimensional Modeling 14
3.2 Effective practices for successful dimensional modeling 14
3.3 Guiding Principles for Success in Dimensional Modeling 15
3.4 Avoiding common pitfalls in dimensional modeling 16
4 FUNDAMENTALS: DATA MODEL TYPES 18
4.1 Dimensional Data Models 18

4.1.1 Introduction 18
4.1.2 Advantages 18
4.1.3 Considerations 19
4.1.4 Understanding Dimensional Model Types 20
4.2 Flat Denormalised Table 25

4.2.1 Definition 25
4.2.2 Example 26
4.2.3 Advantages 26
4.2.4 Disadvantages 26
4.3 Precalculated Views/OLAP Cube 28

4.3.1 Definition 28
4.3.2 Example 28
4.3.3 Advantages 28
4.4 OLTP/Transactional Data Model 29

4.4.1 Definition 29
4.4.2 Example 30
4.4.3 Advantages 31
4.5 DataVault 32
4.5.1 Definition 32
4.5.2 Example 33
4.5.3 Advantages 33
4.6 Model Types Compared 34
4.7 Test Your Knowledge 36
5 FUNDAMENTALS: FACTS AND DIMENSIONS 38
5.1 Introduction 38
5.2 Fact Table Types 39

5.2.1 Time-Related Fact Tables 39
5.2.2 Aggregation and analysis fact tables 45
5.2.3 Context and relationship fact tables 49
5.2.4 Specialized Fact Tables 51
5.2.5 Fact table types: wrapping it up 53
5.3 Dimension Types 56

5.3.1 Hierarchical and Organizational Dimensions 56
5.3.2 Data Consistency and Reusability Dimensions 58
5.3.3 Optimizing Performance and Complexity 61
5.3.4 Customized and Specialized Dimensions 69
5.3.5 Dimension Types: Wrapping it up 73
5.4 Testing your Knowledge 75
6 EFFECTIVE DESIGN PATTERNS AND BEST PRACTICES FOR

OVERCOMING COMMON CHALLENGES IN DIMENSIONAL
MODELS 82
7 BRIDGING CHASMS, FANNING INSIGHTS: CHASM AND FAN

TRAPS UNVEILED 85
7.1 Introduction 85
7.2 Understanding Chasm and Fan Traps 85

7.2.1 What is a Chasm Trap? 85
7.2.2 Challenges Posed by Chasm Traps 86
7.2.3 Variants of chasm traps 86
7.2.4 What is a fan trap? 87
7.3 Modeling Chasm and Fan Traps 87
7.3.1 Chasm traps 87
7.3.2 Fan Traps 92
7.4 Considerations and conclusion 95
8 MASTERING DIMENSIONAL RELATIONSHIPS: JOIN

CARDINALITY, ROLE PLAYING DIMENSIONS AND JOIN PATHS97
8.1 Introduction 97
8.2 Understanding Dimensional Relationships 97

8.2.1 What is join cardinality? 97
8.2.2 Importance of join direction 98
8.2.3 Role playing dimensions 99
8.2.4 Multiple join paths 100
8.3 Modelling Dimensional Relationships 100

8.3.1 1-to-1 Relationships 100
8.3.2 Many-to-many relationships 101
8.3.3 Role playing dimensions 104
8.3.4 Range joins in dimensional models 104
8.4 Considerations and Conclusion 113
9 THE PITFALLS OF OUTER JOINS 114
9.1 Introduction 114
9.2 Understanding Outer Joins 114

9.2.1 What is an outer join? 114
9.2.2 Challenges with outer joins 115
9.2.3 Increased complexity 115
9.3 Modelling Outer Joins 117

9.3.1 Avoid outer joins where possible 117
9.3.2 Populating fact tables appropriately and creating default members for dimensions
117
9.3.3 Leveraging factless fact and bridge tables 118
10 SCALING DATA PEAKS: A DEEP DIVE INTO HIERARCHIES 120
10.2 Understanding Hierarchies 120
10.2.1 What are hierarchies in dimensional modeling? 120
10.2.2 Types of hierarchies 121
10.3 Modeling Hierarchies 125

10.3.1 Balanced hierarchies 125
10.3.2 Advanced attribute hierarchies 127
10.3.3 Ragged hierarchies 130
10.3.4 Unbalanced hierarchies - Solution 1: Bridge tables 131
10.3.5 Unbalanced hierarchies - Solution 2: Node table 140
10.3.6 Unbalanced hierarchies - Solution 3: Path string attributes 145
10.4 Comparison of the various implementation techniques 151
11 EVOLVING RELATIONS IN TIME: A DEEP DIVE INTO SLOWLY

CHANGING DIMENSIONS 158
11.2 Understanding slowly changing dimensions 158

11.2.1 What are slowly changing dimensions? 158
11.2.2 Challenges with slowly changing dimensions 159
11.3 Modelling slowly changing dimensions 160

11.3.1 Type 1: Overwrite 160
11.3.2 Type 2: Add new record 160
11.3.3 Type 3: Add columns 161
11.3.4 Type 4: Maintain history in separate table (aka "bifurcation") 161
12 FROM FINE TO COARSE: CRAFTING DATA MODELS WITH

PRECISION AND GRAIN 165
12.2 Understanding Granularity 165

12.2.1 What is granularity? 165
12.2.2 Denormalization: To denormalize or Not? 167
12.2.3 Handling mixed granularity models 168
12.2.4 Search experience 169
12.3 Modeling for mixed grain 170
12.3.1 Option 1: Augmenting the fact table 170
12.3.2 Another option: splitting dimension 171
13 UNBOXED INSIGHTS: UNLEASHING THE POTENTIAL OF

FLEXIBLE DATA STRUCTURES 176
13.2 Understanding flexible data structures 176

13.2.1 What are key value pairs? 176
13.2.2 What is unstructured/semi structured data? 177
13.3 Modeling flexible data structures 178

13.3.1 Key value pairs 178
13.3.2 Unstructured/semi structured Data 179
14 DIMENSIONAL MASTERY: EXPLORING THE DATE DIMENSION182
14.2 Understanding Date Dimensions 182

14.2.1 What is a date dimension? 182
14.2.2 Using ThoughtSpot keywords: Simplifying time queries 182
14.3 Modeling date & time 183
14.4 Using the date dimension for Look Back Measures 184
14.4.1 Introduction 184
14.4.2 Requirement definition 184
14.4.3 Solution design 185
14.4.4 Example: How have sales changed this year versus last year during the same
period? 186
14.4.5 Conclusion 189
15 COUNTING COINS ACROSS CONTINENTS: A GUIDE TO

CURRENCY CONVERSION IN ANALYTICS 191
15.2 Understanding Currency Conversion 191

15.2.1 What is currency conversion? 191
15.2.2 Challenges of currency conversion 192
15.3 Modeling Currency Conversion 192
16 EVOLVING DATA STRUCTURES: A GUIDE TO LATE-BINDING

ATTRIBUTES 198
16.2 Understanding Late-Binding Attributes 198

16.2.1 What are late-binding attributes? 198
16.2.2 Late-binding and varying attributes per row 199
16.2.3 Use cases for late-binding attributes 199
16.2.4 Strengths and challenges 201
16.3 Modeling Late-Binding AttributeS 203

16.3.1 Customer profiles 203
16.3.2 Survey responses 204
16.3.3 Implementation strategies for late-binding attributes 205
16.4 Modeling late-binding attributes for search 209

16.4.1 Design considerations 209
16.4.2 Data structure 209
16.4.3 Strong typing 210
16.4.4 Combining field and answer in a tag 210
16.4.5 AND condition for profile filters 211
16.5 Conclusion 211
17 BEYOND SUMS: MASTERING COMPLEX DATA METRICS 213
17.2 Understanding Non-Additive, Semi-Additive and Derive Measures 213

17.2.1 What are derived measures? 213
17.2.2 What are non-additive measures? 215
17.2.3 What are semi-additive Measures? 215
17.3 Modeling Non-additive, Semi-additive and Derived Measures 216
17.3.1 Derived measures 216
17.3.2 Non-additive measures 217
17.3.3 Semi-additive measures 224
1 Best Practices listed in this Field Guide
Apart from providing extensive information on data modeling scenarios and
design patterns, this field guide also provides best practices and top tips for
setting up your data model.
Using this icon, we will This icon indicates

highlight best practices tips/tricks from the
based on typical field field, or useful
experiences. knowledge
The complete list of best practices in this field guide can be found below.
Best Practice 1 - Choose the right dimensional modeling technique 36
Best Practice 2 - When to use a transactional fact table? 40
Best Practice 3 - When to use a snapshot fact table? 41

Best Practice 4 - When to use an accumulating snapshot fact table? 42
Best Practice 5 - When to use a periodic snapshot fact table? 43
Best Practice 6 - When to use a partitioned fact table? 44
Best Practice 7 - When to use a cumulative fact table? 46
Best Practice 8 - When to use an aggregated fact table? 47
Best Practice 9 - When to use a derived fact table? 48
Best Practice 10 - When to use a factless fact table? 50
Best Practice 11 - When to use a bridge fact table? 51

Best Practice 12 - When to use a multi-value fact table? 53
Best Practice 13 - When to use role playing dimensions 57
Best Practice 14 - When to use fixed-depth hierarchy dimensions 58
Best Practice 15 - When to use conformed dimensions 59
Best Practice 16 - When to use universal dimensions 60

Best Practice 17 - Implement Conformed/Universal Dimensions 61
Best Practice 18 - When to use degenerate dimensions 62
Best Practice 19 - When to use Transaction Dimensions 64
Best Practice 20 - When to use mini-dimensions 65
Best Practice 21 - When to use junk dimensions 66
Best Practice 22 - When to use shrunken dimensions 67
Best Practice 23 - When to use late-binding dimensions 67
Best Practice 24 - When to use composite dimensions 68
Best Practice 25 - When to use snapshot dimensions 70
Best Practice 26 - When to use custom dimensions 71
Best Practice 27 - When to use derived dimensions 72
Best Practice 28 - ThoughtSpot handles chasm traps automatically! 96
Best Practice 29 - ThoughtSpot handles fan traps automatically! 96
Best Practice 30 - Understand join cardinality 98
Best Practice 31 - Importance of join direction 99
Best Practice 32 - Consider transforming 1-to-1 relationships 101
Best Practice 33 - Consider transforming many-to-many relationships 103

Best Practice 34 - Avoid outer joins where possible 117
Best Practice 35 - Populate fact tables appropriately and create default members for
dimensions 117
Best Practice 36 - Use factless fact or bridge tables 118
Best Practice 37 - Avoid abstract names in hierarchies 126
Best Practice 38 - Limit the depth of ragged hierarchies 131
Best Practice 39 - Load data at the lowest granularity for enhanced analysis 166
Best Practice 40 - Storing exchange rates 196

Best Practice 41 - Use bridge tables 196
Best Practice 42 - Standardize currency codes 196
Best Practice 43 - Use fact-based currency conversion 196
Best Practice 44 - Apply currency conversion on load 197
2 A Word from the Author
Hello there, fellow data wranglers and aspiring modelers of the digital universe!
Before you dive headfirst into the world of data
modeling, I wanted to take a moment to share a few
words of wisdom, encouragement, and maybe even a
sprinkle of humor. You see, data modeling is a bit like
juggling chainsaws while riding a unicycle on a tightrope
suspended over a pit of hungry alligators – exciting,
challenging, and definitely not for the faint of heart.
Now, I know what you might be thinking. "How hard can
it be to organize data?" Well, let me tell you, it's a bit
like trying to herd cats through a maze designed by M.C.
Escher, blindfolded, on a moonless night. In other
words, it's not a walk in the park. And here's the kicker:
a lot of people think they understand data modeling,
believing it's a piece of cake. But trust me, that's where
it often goes wrong.
Data modeling isn't just about slapping together a few
tables and calling it a day. It's a nuanced dance of
relationships, attributes, and constraints that requires the finesse of a concert
pianist. But fear not, for within the pages of this field guide, you'll find the
roadmap to navigate this labyrinthine world of data modeling with wit and
wisdom.
I'll delve into both straightforward and more complex design patterns, all
relevant to data modeling, which is like adding layers of flavor to a lasagna –
some patterns are classic and comforting, while others are bold and
adventurous.
But do not forget data modeling isn't just hard; it's also incredibly rewarding.
It's like solving a puzzle that unlocks the secrets of your organization's data,
revealing hidden insights and transforming chaos into clarity. Plus, you get to
wield the power of knowledge and help your business make better decisions.
In this field guide, I've poured our hearts, souls, and possibly a few cups of
coffee into distilling me and my colleagues years of experience in the field into
practical, actionable advice. We've made the mistakes so you don't have to
(trust us, some of those mistakes were doozies). You'll find real-world examples
and step-by-step guides on how to implement the very best design patterns and
solutions in your data model.
Now, let me take a moment to tip my hat to some of the brilliant minds in the
field who contributed to this guide. One of the wonders of being part of the
ThoughtSpot community is that you're in the midst of some seriously sharp
individuals. A handful of the ingenious use cases and solutions you'll find in these
pages owe their existence to the creativity and insights of my colleagues. David
Stefkovich deserves a special mention, but I must also give credit to Damian
Waldron, Bill Lay, and Nick Cooper for their invaluable input. It's this kind of
collaborative effort that makes projects like this truly shine.
So, whether you're a seasoned data guru or a wide-eyed beginner, fasten your
seatbelt, put on your thinking cap, and get ready for a rollercoaster ride through
the fascinating world of data modeling. It may be tough, but with the right
guidance, and the knowledge you'll gain from this field guide, you'll conquer this
data-driven jungle like a pro.
Happy modeling!
Misha Beek
3 Exploring data modeling for practical
insights
In this guide, we'll delve into Data Modeling, uncovering its key principles,
relevance to modern data environments, and practical application of best
practices. Whether you're an experienced data pro or a newcomer, this guide is
your companion for a thorough exploration of essential practices, valuable tips,
and proven design patterns for successful data modeling.
Before diving into the technical aspects in later chapters, this section focuses on
non-technical facets and essential best practices.
3.1 UNDERSTANDING DIMENSIONAL MODELING
Dimensional Modeling is a strategic framework that organizes complex data into

a clear structure for robust analysis and intuitive exploration. Think of it as a
blueprint for your data universe, where dimensions provide distinct views and
intersections weave a tapestry of knowledge. By creating a dimensional model,
you bridge the gap between raw data and insights, transforming information
into understanding.
In today's digital age, organizations grapple with a dual challenge: a surge in
data and the pursuit of value. Dimensional Modeling helps untangle data
complexities by arranging them into hierarchies, ensuring data integrity and
user-friendly interfaces. This equips decision-makers with tools for well-
informed choices.
3.2 EFFECTIVE PRACTICES FOR SUCCESSFUL

DIMENSIONAL MODELING
In Dimensional Modeling, technical knowledge is vital, but it's the practical, non-
technical approaches that lead to success. These involve understanding
requirements, proper documentation, effective communication, and adhering to
guiding principles.
At the core, a meaningful dimensional model starts by understanding
stakeholders' needs. This means engaging in discussions and carefully outlining
requirements. Through this process, your model begins to take shape, mirroring
real operational aspects. Active listening, asking the right questions, and
aligning technical solutions with tangible business problems all contribute to
creating a fitting model.
Documentation becomes a vital tool in navigating the complexities of
Dimensional Modeling. This includes capturing design choices, mapping out data
connections, and creating user guides. These documents simplify intricate
details, paving the way for innovation and practical solutions.
However, the heart of our journey lies in collaboration.
Effective communication acts as the glue that binds
the technical and non-technical sides. It's about
translating complex technical terms into
understandable concepts, fostering a shared
understanding. This synergy between different
perspectives drives us toward our common
goal.
In this pragmatic interplay of technical and non-
technical aspects, our path through Dimensional
Modeling is guided by a practical approach. By
focusing on understanding, documentation, and
collaboration, we navigate the complexities of data, shaping it into a powerful
tool for informed decision-making.
3.3 GUIDING PRINCIPLES FOR SUCCESS IN DIMENSIONAL

MODELING
Embrace these guiding principles:

n Business-Centric Approach: Craft dimensions and facts to match user
interpretation of data, aligning with real-world scenarios.
n User-Friendly Naming: Use intuitive names for dimensions, attributes, and
measures, simplifying query design and understanding.
n Iterative Refinement: Use user feedback to create a dynamic model that
meets evolving needs and enhances user experience.
n User-Centric Design: Prioritize usability, navigation, and efficient querying
for a user-friendly model.
n Ensuring Data Quality: Implement data quality checks during Extract
Transform & Load (ETL) processes to maintain integrity and support
meaningful analysis.
n Collaboration Dynamics: Engage with business users to refine the model
in line with actual business needs.
n Embracing Change: Stay updated on developments to ensure your model
remains effective.
n Scalability: Design your model to accommodate growth and scalability.
n Documentation: Document design, relationships, and rules for
understanding and collaboration.
n Business-IT Collaboration: Collaborate to ensure your model aligns with
business needs.
n Self-Service Design: Create interfaces empowering users to explore data
independently.
n Education: Invest in training to maximize benefits.
n Continual Evolution: Continuously refine your model to accommodate new
requirements, dimensions, and data sources.
3.4 AVOIDING COMMON PITFALLS IN DIMENSIONAL

MODELING
While navigating the seas of Dimensional Modeling, be mindful of common

pitfalls that can undermine your efforts:
n Insufficient/Incorrect
Dimensional Modeling: Ensure
alignment with reporting
requirements and include desired
events. Mismatched requirements
can lead to omitted dimensions
and facts crucial for accurate
analysis.
n Overcomplicating the Model: Focus on essentials and prioritize reporting
over excessive data integrity. Strive for simplicity to enhance comprehension
and performance.
n Ignoring Data Quality and Integrity: Neglecting data quality leads to
inaccurate insights. Rigorously validate, cleanse, and integrate data for
reliable results.
n Failing to Capture Granularity: Balance granularity and usability to avoid
superficial insights. Understand business analytical needs for effective data
modeling.
n Neglecting Future Scalability: Design for growth to accommodate
evolving requirements. Ensure flexibility for new dimensions, facts, and
reporting needs.
n Ignoring Performance Optimization: Optimize performance to avoid
sluggish query response times. Incorporate indexing, aggregations, and
proper join strategies.
n Understand Existing Model and Patterns Before Making Changes:
Modify existing models with care, understanding its intricacies and patterns
to avoid unintended consequences.
n Avoid Overloading Single Worksheets for Complex Use Cases:
Distribute complex analytical needs across multiple worksheets for clarity and
efficiency.
Having explored the less technical dimensions of data modeling, we're now
ready to dive deeper into the heart of data modeling itself. In the upcoming
sections, we'll immerse ourselves in best practices, insightful tips, and intricate
examples that illuminate the path of effective data modeling.
4 Fundamentals: Data model types
4.1 DIMENSIONAL DATA MODELS
4.1.1 Introduction
A dimensional data model is a data modeling technique used in data
warehousing and business intelligence to structure data for easy analysis and
reporting. It has several advantages that make it popular in decision support
systems and analytical environments.
4.1.2 Advantages
Here are some of the key advantages of a dimensional data model:
n Simplicity and ease of understanding: Dimensional data models are
designed to be intuitive and easy to understand, even for non-technical
users. They use familiar concepts like dimensions (descriptive attributes) and
measures (quantitative data. to represent business entities and their metrics.
n Faster query performance:
Dimensional models are optimized
for querying and reporting. Their
denormalized structure allows for
quick retrieval of data, reducing the
complexity of joins and aggregation
operations. As a result, analytical
queries can be executed more
efficiently, providing faster response
times.
n Flexibility and agility:
Dimensional data models are highly
flexible and can adapt easily to
changing business requirements. When new dimensions or measures are
needed, they can be added without affecting the existing structure, making
it convenient to incorporate new data and insights as the business evolves.
n Enhanced user experience: Due to its user-centric design, a dimensional
data model empowers business users to perform self-service data exploration
and analysis. With a well-structured model, users can navigate through data
hierarchies and relationships to gain meaningful insights without relying
heavily on IT or data experts.
n Better support for business analytics: Dimensional models are purpose-
built for analytical applications and business intelligence. They provide a clear
representation of business metrics and enable advanced analytics like drill-
down, roll-up, slicing, dicing, and pivot operations.
n Business-focused design: Dimensional data models are designed with a
focus on meeting business requirements and answering business questions.
This alignment with business needs enables users to quickly find the
information they need to make informed decisions.
n Improved data quality: Dimensional data models promote better data
quality and consistency by adhering to standardized dimensions and
hierarchies. This ensures that the data used for analysis is consistent across
different reports and dashboards.
4.1.3 Considerations
Although dimensional data models bring significant benefits to data warehousing
and analytical setups, their implementation requires careful consideration of
certain factors. These key considerations encompass:
n Data Redundancy: Dimensional data models often denormalize data to
improve query performance and user experience. However, denormalization
can lead to data redundancy, as
the same information might be
stored in multiple places within the
model. This can increase storage
requirements and may lead to
data integrity issues if updates are
not properly managed.
n Complexity in Hierarchies:
Hierarchical relationships in
dimensional models can
sometimes be complex, especially
when dealing with multi-level
hierarchies. Maintaining and
representing these hierarchies
correctly may require careful attention and additional effort. See chapter 10
for various implementation techniques for hierarchies.
n Difficulties in Real-time Data: Real-time data integration poses challenges
when using dimensional models because these models are predominantly
designed for batch processing. The need for data transformation during
loading can complicate real-time data integration processes.
n Inflexibility for Certain Queries: While dimensional models excel in many
analytical queries, they may not be optimal for certain complex analytical
operations, especially those requiring ad-hoc or exploratory data analysis. In
such cases, other data models like data vault or snowflake schemas might
be more suitable.
n Maintenance Overhead: Dimensional models require ongoing
maintenance, especially when introducing changes to the model, like adding
new dimensions or measures. Proper documentation is essential to prevent
data inconsistencies and ensure smooth maintenance.
n Data Governance Challenges: With denormalized data and ease of data
integration, it can be challenging to enforce strict data governance rules,
leading to potential data quality issues and inconsistency.
4.1.4 Understanding Dimensional Model Types

There are several types of dimensional models, each with its unique
characteristics, benefits, and drawbacks. In this section, we will delve into the
different dimensional model types, explore their definitions, and highlight
examples, advantages, and disadvantages.
4.1.4.1 Star Schema

4.1.4.1.1 Definition
The star schema is one of the most widely used and straightforward dimensional
models. It consists of a central fact table connected directly to multiple
dimension tables. The fact table holds quantitative measures or metrics, while
each dimension table represents a specific attribute or category related to the
fact data.
4.1.4.1.2 Example
Imagine a sales database where the central fact table contains sales revenue
and quantity data. The dimension tables may include attributes such as product
name, product category, date, and location. Each dimension table is linked to
the fact table via foreign key relationships.
Figure 1 A sample Star Schema Model
4.1.4.1.3 Advantages
n Simple and easy to understand, making it a preferred choice for small to
medium-sized data warehouses.
n Efficient query performance due to denormalized nature, reducing the
number of joins required for analysis.
n Optimal for analytical processes where data retrieval often involves
aggregations.
4.1.4.1.4 Disadvantages
n Start-up Costs: existing data structures do not always exist in relational
format to support dimensional modeling.
n Redundancy in dimension tables can lead to increased storage requirements.
n Maintenance challenges when updates or changes to dimension attributes
are required.
n Not suitable for highly normalized data.
4.1.4.2 Snowflake Schema
The snowflake schema is an extension of the star schema. In this model,
dimension tables are further normalized by breaking them into multiple related
tables. The normalization helps reduce data redundancy but may increase query
complexity.
4.1.4.2.2 Example
In the previous sales database example, a snowflake schema could involve
breaking down the product dimension table into multiple tables, such as product
details and product categories. Each of these tables would hold specific attribute
data related to products. Similarly, the location table could be divided into
multiple tables, introducing dimensions with for cities and countries.
Figure 2 Sample snowflake schema
n Improved data storage efficiency due to reduced redundancy.
n Easier maintenance when updating shared dimension attributes.
n Suitable for data warehouses with limited storage capabilities.
n Increased query complexity due to multiple joins required to access data from
normalized tables.
n Potentially slower query performance compared to star schema due to
additional joins.
n May not be as intuitive to understand and implement for some users.
4.1.4.3 Galaxy Schema (Constellation Schema)
The galaxy schema, also known as the constellation schema, combines multiple
star schemas or snowflake schemas into a more complex and interconnected
structure. This model is useful when dealing with diverse and heterogeneous
data sources.
4.1.4.3.2 Example
Consider a data warehouse that combines sales data, financial data, and
customer data. The galaxy schema links all these fact tables and dimension
tables together.
Figure 3 Sample galaxy schema
n Provides a comprehensive view of diverse and complex data sources.
n Enables analysis across multiple business areas or departments.
n Offers flexibility for accommodating varied data structures.
n Increased complexity in schema design and maintenance.
n Query performance may suffer due to the complexity of the model.
n Requires a deeper understanding of data relationships and business
requirements.
4.1.4.4 Fact Constellation (Fact Galaxy)

The fact constellation, or fact galaxy or inverted star schema, is a variant of the
galaxy schema without a central fact table. Instead, each fact table is connected
directly to relevant dimension tables. This model is suitable when facts are not
directly related to each other.
4.1.4.4.2 Example
Figure 4 Sample fact constellation
In a retail business, separate fact tables may exist for sales, inventory, and
returns. Each fact table links to common dimension tables like product, date,
and location.
n Facilitates easy integration of unrelated facts and dimensions.
n Simplifies data model design for scenarios with diverse data sources.
n Better suited for scenarios where facts don't have direct correlations.
n Query performance might suffer when querying data across multiple fact
tables.
n Potential data redundancy in dimension tables due to multiple connections.
n Requires careful consideration of data relationships to avoid data
inconsistencies.
4.2 FLAT DENORMALISED TABLE
4.2.1 Definition
A flat denormalized table approach in data modeling refers to a design where
data is stored in a single table with all relevant information, including redundant
data, combined in a denormalized fashion. This approach stands in contrast to
traditional normalized data modeling, where data is distributed across multiple
related tables to reduce redundancy and improve data integrity. The flat
denormalized table approach is sometimes also called a "wide table" or
"flattened table" approach.
In this approach, instead of having separate tables for each entity (e.g.,
customers, orders, products), all relevant data is merged into a single table,
typically through the process of joining and aggregating data from various
sources. The resulting table includes all the required attributes, making it easier
to query and analyze data without the need for complex joins and relationships.
4.2.2 Example
For example, consider an e-commerce scenario where you have customer data,
order data, and product data:
Figure 5 Flat table example - Normalized approach Figure 6 Flat table

example -
Denormalised
4.2.3 Advantages
n Simplified Queries: The flat table eliminates the need for complex joins,
making queries easier to write and understand.
n Improved Query Performance: By reducing join operations, query
performance can be improved, especially for analytical queries.
n Better Reporting: The denormalized table facilitates faster and simpler
reporting since all required data is available in a single location.
n Reduced Data Complexity: Data is presented in a user-friendly way,
reducing the need for complex joins and data manipulation.
4.2.4 Disadvantages
n Increased Query Processing Time: Aggregating data from a large flat
denormalized table can lead to slower query processing times. Since the table
contains redundant data, the size of the table can be substantial, and
aggregating this data requires more time and resources.
n Cost of Indexing: Indexing denormalized tables can be expensive due to
the increased size and complexity of the data. Indexes that worked well on
normalized tables may not provide the same performance benefits on
denormalized tables.
n Storage Overhead: The denormalized table contains redundant information,
leading to increased storage requirements. This can be especially problematic
when dealing with large datasets, leading to additional storage costs and
potential storage constraints.
n Increased Memory Usage: Aggregating data from a flat denormalized
table often requires more memory resources during query execution. This
can cause memory contention and potentially lead to performance
bottlenecks, particularly on systems with limited memory.
n Search Experience: Users must be careful when selecting fields for
calculations in denormalized tables, as some fields might already be
aggregated or at a different granularity level. Incorrectly choosing
aggregated fields can lead to skewed results and inaccuracies in the analysis.
n Update Anomalies: Data updates can be more cumbersome due to multiple
occurrences of the same information in the table and any changes to the data
model or relationships may require extensive modifications to the flat table.
n Mixed grain issues: The issue of mixed granularity in flat denormalized
tables arises when the table combines data at different levels of detail within
the same structure. This can lead to data inconsistencies and difficulties in
querying and aggregating data efficiently, as the table may contain attributes
from different entities with varying levels of specificity. Handling mixed
granularity requires careful consideration of data organization and querying
strategies to ensure data accuracy and proper analysis.
n Implementing slowly changing dimensions: Adding slowly changing
dimensions (SCDs) to a flat denormalized table involves introducing
additional columns to track historical changes of attributes. This increases
data redundancy and complexity in managing updates, query logic, and
referential integrity, while requiring careful planning for proper indexing and
data validation.
4.3 PRECALCULATED VIEWS/OLAP CUBE
4.3.1 Definition
A precalculated view/OLAP (Online Analytical Processing) Cube data model is a
type of multidimensional data model used in business intelligence and data
warehousing. It organizes data in a way that allows for efficient and fast analysis
of large datasets. OLAP cubes are particularly designed to facilitate analytical
queries and reporting tasks, offering a more user-friendly approach for decision-
makers to explore and gain insights from their data.
4.3.2 Example
Let's consider a retail business as an
example. The business may have a large
database containing sales data,
including information about products,
customers, regions, and time periods.
With an OLAP Cube data model, the data
can be organized into a multidimensional
structure like in Figure 7.
The OLAP cube would pre-calculate
aggregations for various combinations of
dimensions and measures, creating a
data structure that enables rapid Figure 7 Sample OLAP Cube
querying and analysis of data across different dimensions.
4.3.3 Advantages
n Faster Query Performance: By pre-calculating aggregations and storing
them in the cube, OLAP queries can return results much faster than
traditional relational databases when dealing with large datasets. This
enhances the responsiveness of analytical tools and reduces query
processing time.
n Simplified Complex Queries: OLAP cubes allow users to perform complex
analytical queries with a simple drag-and-drop interface or using a few clicks.
Users can easily navigate through hierarchies, drill down into details, and
perform slice-and-dice operations to explore data from different
perspectives.
n Improved Decision Making: With OLAP cubes, decision-makers can
quickly analyze historical trends, spot patterns, and identify opportunities
and challenges in their business. The intuitive interface enables them to
interactively explore data, gain insights, and make data-driven decisions.
n Reduced Database Load: OLAP cubes offload the query processing burden
from the operational database since they store pre-aggregated data. This
helps in improving the overall performance of the transactional system as it
doesn't have to handle complex analytical queries.
4.3.4 Disadvantages
n Data Size and Maintenance: OLAP cubes can be resource intensive
because of their storage requirements. Cube sizes can grow significantly,
especially for large datasets, and they need to be regularly updated and
maintained to ensure data accuracy and relevancy.
n Limited Real-Time Data: Since OLAP cubes are typically refreshed
periodically, they may not reflect the most current data in real-time. There
could be a delay between data updates in the operational system and the
data being available in the OLAP cube.
n Cube Design Complexity: Designing an effective OLAP cube can be
complex, especially when dealing with multiple dimensions and measures.
Properly defining hierarchies, relationships, and aggregations requires careful
planning and understanding of the business requirements.
n Inflexible Schema: OLAP cubes are based on predefined dimensions and
measures, making it challenging to handle ad hoc analysis or accommodate
changes in data requirements. Any changes to the cube structure may
involve significant effort and reprocessing of data.
4.4 OLTP/TRANSACTIONAL DATA MODEL
4.4.1 Definition
An Online Transaction Processing (OLTP) or transactional data model is designed
for the efficient management of day-to-day operational transactions in a
database. It is used to process high volumes of small, individual transactions in
real-time, such as recording sales, processing orders, updating inventory, and
managing customer information. Unlike Online Analytical Processing (OLAP)
data models, OLTP focuses on quick data processing and ensures data integrity
and consistency.
4.4.2 Example
Let's consider a bank management system as an example.
Figure 8 Sample OLTP/Transactional Model
The OLTP data model for the platform may include tables such as:
n Customers: Stores information about individual customers, such as name,
contact details, and address.
n Customer Types: Information on the type of customer.
n Customer Purchases: Products and services the customer has bought.
n Orders: Records each customer's purchase orders, along with order details
like order ID, date, and status.
n Products & Services: The available products and services.
n Accounts: The bank accounts of all the customers.
n Account Types: The type of accounts available.
n Merchants: The merchants selling the various products and services.
n Transactions: The financial transactions the customers have been involved
in.
n Transaction Types: The available transaction types.
4.4.3 Advantages
n Real-time Transaction Processing: OLTP systems are optimized for quick
and real-time processing of individual transactions. They ensure that each
transaction is promptly recorded in the database, enabling immediate access
to the latest data.
n Data Integrity: OLTP systems enforce data integrity constraints to maintain
the accuracy and consistency of data. This helps prevent errors and ensures
that the data remains reliable for day-to-day operations.
n Concurrent Access: OLTP data models are designed to support multiple
concurrent users accessing the database simultaneously. This is crucial for
applications with a large user base, and high transaction volumes.
n High Availability: OLTP databases are typically set up with redundant
hardware and failover mechanisms to ensure high availability. This minimizes
downtime and always keeps the system accessible.
4.4.4 Disadvantages
n Performance for Analytical Queries: OLTP systems are not optimized for
complex analytical queries, leading to slower query performance compared
to OLAP databases.
n Data Redundancy: Normalized data structures in OLTP systems may result
in data redundancy and increased storage requirements, impacting
performance and storage costs.
n Complexity in Reporting: Generating complex reports in OLTP systems is
challenging due to the normalized data structure, requiring multiple table
joins and leading to slower query performance.
n Challenges with Long Join Paths: Joining multiple tables in normalized
models can require additional computation and impact query performance.
n Modeling Complexity: Multiple join paths between entities make modelling
for search more challenging in normalized models.
n Lack of Codified Business Rules: Business rules are not embedded in the
model, requiring end-users to have a deeper level of data literacy training.
n Confusing Search Experience: Search experience may be confusing for
users due to the normalized structure and requires advanced modeling
techniques.
n Increased Learning Curve: Normalized models require advanced skills in
ThoughtSpot, increasing the learning curve for the data admin team.
n Additional Documentation and Technical Debt: Implementing and
maintaining normalized models requires more documentation and results in
technical debt over time.
4.5 DATAVAULT
4.5.1 Definition
A Data Vault data model is a specific approach to designing a data warehouse
that aims to provide a scalable, flexible, and reliable foundation for capturing
and storing data from various sources. It was developed by Dan Linstedt in the
early 2000s and has gained popularity as an effective approach for handling
complex data integration challenges.
The Data Vault methodology addresses some common issues that traditional
data warehousing approaches might face, such as data silos, high maintenance
costs, and difficulty in adapting to changing business requirements. It achieves
these goals through the following key principles:
n Hub-and-Spoke Architecture: The data model consists of three main types
of tables: Hubs, Links, and Satellites. This structure helps to maintain data
integrity and supports historical data tracking.
n Hubs: Hubs represent business entities and act as the central repository for
core business keys. They provide a way to uniquely identify business entities
and are essentially the primary keys for each business entity.
n Links: Links connect multiple Hubs and represent the relationships between
these entities. They store the surrogate keys from the related Hubs to
establish relationships.
n Satellites: Satellites store descriptive attributes about the Hubs and Links,
including changes over time. They hold historical data and capture changes
to attributes, allowing for easy data auditing and historical tracking.
The Data Vault model promotes a strong separation between business keys
(natural keys from source systems) and surrogate keys (system-generated keys
used internally for relationships). This separation enhances data traceability, as
the model doesn't rely on the primary keys from source systems.
4.5.2 Example
Figure 9 Sample Data Vault Model
4.5.3 Advantages
n Scalability: Data Vault is designed to scale efficiently as data volume and
complexity grow, making it suitable for large enterprises with diverse data
sources.
n Flexibility: The model can easily accommodate changes in source systems
and business requirements, reducing the impact of changes on the overall
data warehouse structure.
n Auditing and Compliance: The historical tracking of data changes in
Satellites facilitates auditing and compliance with data governance
regulations.
n Data Integration: It helps to integrate data from various sources,
effectively, breaking down data silos.
4.6 MODEL TYPES COMPARED
Data Model Type Description Best Used For Preferred by

Dimensional Models Hierarchical model Business Intelligence (BI) Data warehousing
with a fact table(s) and analytics, ad-hoc platforms, Business
connected to querying, reporting, data Intelligence (BI)
dimension tables. warehousing. tools, and analytical
databases
Denormalized Flat Single table with all Fast query performance, Real-time processing
Table the relevant data, simple analytics, real-time platforms, NoSQL
avoiding joins. reporting, simple data databases, streaming
marts. analytics engines
Precalculated Multidimensional data Complex analytical queries, Online Analytical
Views/OLAP Cube structure for fast pre-aggregation, Processing (OLAP)
analysis. multidimensional slicing databases, data
and dicing, interactive warehousing
dashboards. platforms with OLAP
capabilities
OLTP/Transactional Designed for Online transaction Transactional
Model transactional processing, managing day- databases,
applications. to-day business operational systems,
transactions, maintaining and Online
data integrity. Transaction
Processing (OLTP)
platforms
Data Vault Focus on flexibility, Large-scale data Large-scale data
scalability, historical warehousing projects, warehousing
tracking. handling data source platforms, data
changes, data integration integration tools,
from multiple sources, data data lakes with
auditing. structured metadata.
Here's a brief explanation of when each data model type is best used:
n Dimensional Models: Dimensional models, such as star schema and
snowflake schema, are best used for BI and analytics scenarios. They
facilitate simplified and efficient querying, reporting, and data exploration,
making them ideal for data warehousing and decision-making tasks. These
type of models are typically preferred by Data warehousing platforms,
Business Intelligence (BI) tools, and analytical databases because
Dimensional models are optimized for querying and analyzing data in data
warehousing and analytical environments. They work well with BI tools and
platforms that prioritize ad-hoc querying, reporting, and data exploration.
n Denormalized Flat Table: Denormalized flat tables work best for
applications that require fast query performance and simple data structures.
They are suitable for real-time reporting, analytics on smaller datasets, and
cases where joins can be costly. These type of models are typically preferred
by Real-time processing platforms, NoSQL databases, and streaming
analytics engines because denormalized flat tables are suited for real-time
processing and simple data structures. They can be efficiently handled by
platforms that focus on fast query performance, real-time reporting, and
support for semi-structured and unstructured data.
n Precalculated Views/OLAP Cube: Precalculated views and OLAP cubes are
optimal for complex analytical queries that involve pre-aggregated data.
They offer high performance and support interactive dashboards and
multidimensional analysis. These model types are typically preferred by
Online Analytical Processing (OLAP) databases, data warehousing platforms
with OLAP capabilities, because precalculated views and OLAP cubes are
designed for complex analytical queries that involve pre-aggregated data.
OLAP databases and platforms are specifically built to handle
multidimensional analysis, and interactive dashboards.
n OLTP/Transactional Model: OLTP/transactional models are designed for
online transaction processing systems. They are used in applications that
manage day-to-day business operations and ensure data consistency and
integrity. These type of models are typically preferred by Transactional
databases, operational systems, and Online Transaction Processing (OLTP)
platforms, because OLTP/transactional models are optimized for handling
business transactions and maintaining data consistency. These models are
ideal for operational systems that manage day-to-day business operations
and require real-time data processing.
n Data Vault: Data Vault is most useful for large-scale data warehousing
projects where data integration from diverse sources and historical tracking
are crucial. It offers flexibility in handling changes and scaling the data
architecture. These type of models are typically preferred by Large-scale data
warehousing platforms, data integration tools, data lakes with structured
metadata, because Data Vault is suitable for data warehousing projects
involving diverse data sources, historical tracking, and data integration.
Platforms that support data lakes and structured metadata can accommodate
Data Vault implementations effectively.
Based on our experience, dimensional models typically offer superior analytical
capabilities, making them the preferred choice usually. However, specific use
cases might benefit from a denormalized table approach. When considering our
supported Cloud Data Platforms, all of them can handle both models. However,
apart from RedShift, most platforms tend to favor dimensional models over
denormalized structures.
Best Practice 1 - Choose the right dimensional modeling
technique
Understand the primary dimensional modelling techniques
discussed in this chapter. Choose the technique that best fits the
business requirements and data complexity. Balance simplicity and
performance considerations when making this choice.
4.7 TEST YOUR KNOWLEDGE
Welcome to the Knowledge Challenge! This section is

designed to put your understanding of the topics
covered in this chapter to the test. Don't worry, it's all
in good fun! Choose the best option that you believe
answers each question correctly.
1. Which of the following is an advantage of a dimensional data model?
a. Complex hierarchical relationships c. Improved data quality and consistency
b. Data redundancy and denormalization d. Faster query performance and ease of

understanding
2. What is one of the key disadvantages associated with dimensional data

models?
a. Difficulty in real-time data integration c. Improved data quality and consistency
b. Enhanced user experience and self- d. Flexibility and agility in adapting to

service data exploration changing business requirements
3. What is a characteristic of a Star Schema?
a. It consists of a central fact table c. It combines multiple star schemas into a

connected directly to multiple dimension more complex and interconnected structure.
tables.
b. It further normalizes dimension tables d. It does not have a central fact table and
by breaking them into multiple related each fact table is connected directly to
tables. relevant dimension tables.
4. What is an advantage of using a Snowflake Schema?
a. Improved data storage efficiency due c. Increased query complexity due to

to reduced redundancy. multiple joins required.
b. Easier maintenance when updating d. Optimal for analytical processes involving

shared dimension attributes. aggregations.
5. Which type of schema is suitable for scenarios where facts don't have direct
correlations?
a. Star Schema c. Galaxy Schema
b. Snowflake Schema d. Fact Constellation
6. What is a characteristic of a Flat Denormalized Table approach in data

modeling?
a. It reduces query processing time by c. It stores data in a single table with all
eliminating the need for aggregations. relevant information combined.
b. It involves breaking down dimension d. It organizes data in a multidimensional

tables into multiple related tables. structure for efficient analysis.
7. What is an advantage of using a Pre-calculated View/OLAP Cube data model?
a. Simplified Complex Queries c. High Availability
b. Real-time Transaction Processing d. Improved Query Performance
8. Which data model type is optimized for handling day-to-day operational

transactions in a database?
a. Star Schema c. OLAP/Transactional Model
b. Data Vault d. Snowflake Schema
9. Which data model type is designed for large-scale data warehousing projects
involving diverse data sources and historical tracking?
a. Star Schema c. Flat Denormalized Table
b. Snowflake Schema d. Data Vault
Answers:
1: d, 2: a, 3: a, 4: a, 5: d, 6: c, 7: d, 8: c, 9: d
5 Fundamentals: Facts and dimensions
5.1 INTRODUCTION
Understanding the foundational concepts of dimensional modeling is essential.

This chapter sets the stage by embarking on a comprehensive journey through
the intricate world of Fact Types and Dimension Types. We delve into the core
principles that govern the structuring of data, facilitating meaningful analysis
and reporting.
In the first section, "Fact Types," we dissect various types of fact tables, each
tailored to capture specific facets of business operations. From temporal insights
offered by Time-Related Fact Tables to the nuanced analysis of Aggregation and
Analysis Fact Tables, we explore how these structures empower organizations
to extract valuable insights. The journey continues into the realm of Context
and Relationship Fact Tables, unveiling the significance of capturing connections
between dimensions. Specialized Fact Tables reveal the potential of
unconventional data representation.
Please note that the guidelines in this chapter

are general in nature. Your choice of dimension
and fact types will depend on the specific
requirements and complexities of your data
project. Consider the unique attributes of your
data and your analytical needs when selecting a
dimension type.
Also, understand that the examples provided
here aspire to showcase different types of facts
and dimensions. There is no one-size-fits-all
solution, and no rigid rule dictates using a
specific type in every scenario. The decision
largely hinges on your individual use case. While
data modeling has mathematical aspects, it also
involves an artful understanding of the data and
its intended use.
Moving forward, we transition to "Dimension Types," where the intricacies of
hierarchical organization, data consistency, and optimization strategies come to
light.
5.2 FACT TABLE TYPES
We have tried to group the type of fact tables in based on their similarities in
terms of characteristics, purposes, and use cases.
5.2.1 Time-Related Fact Tables
5.2.1.1 Transactional Fact Table
The transactional fact table captures detailed data at the most granular level,
representing individual business transactions or events.
5.2.1.1.2 Use Case

Suitable for operational and transactional reporting, tracking day-to-day
business activities.
n Detailed Analysis: Provides granular data for in-depth analysis of individual
transactions.
n Accurate Reporting: Enables precise transactional reporting and
performance tracking.
5.2.1.1.4 Considerations
n Data Volume: May lead to a large data volume, requiring efficient storage
and query optimization.
n Query Performance: Complex queries on detailed data might affect query
response time.
5.2.1.1.5 Example
OrderID ProductID CustomerID OrderDate Quantity SalesAmount
1001 P101 C501 2023-01-15 3 $150.00
1002 P102 C502 2023-01-18 2 $120.00
... ... ... ... ... ...
Table 1 Example Transactional Fact Table
Best Practice 2 - When to use a transactional fact table?

n When you need to capture individual business transactions or
events at a granular level.
n Your focus is on operational and transactional reporting.
n You are looking for detailed analysis and precise transactional
reporting.
5.2.1.2 Snapshot Fact Table

The snapshot fact table captures data at specific points in time, facilitating
historical analysis and trend identification.
5.2.1.2.2 Use Case

Useful for tracking changes over time, historical reporting, and identifying
trends.
n Historical Analysis: Supports historical reporting and trend identification.
n Simplified Queries: Provides fixed snapshots for simplified querying.
n Data Redundancy: Snapshot tables may duplicate data for each snapshot
period.
n Limited Granularity: Granularity is fixed at snapshot intervals, limiting
real-time analysis.
5.2.1.2.5 Example
SnapshotDate ProductID QuantityInStock Price
2023-01-31 P101 150 $50.00
2023-01-31 P102 80 $70.00
... ... ... ...
Table 2 Example snapshot fact table
Best Practice 3 - When to use a snapshot fact table?

n You want to capture data at specific points in time for historical
analysis and trend identification.
n Your goal is historical reporting and trend analysis.
n You are willing to accepted fixed granularity at snapshot
intervals and potential data redundancy.
5.2.1.3 Accumulating snapshot fact table

The accumulating snapshot fact table captures cumulative or aggregated data
over time to track progress or changes.
5.2.1.3.2 Use Case

Ideal for tracking progress or state changes over intervals.
n Progress Tracking: Monitors progress or state changes over intervals.
n Milestone Analysis: Useful for tracking milestones and stages.
n Data Updates: Requires updates as milestones are reached, potentially
impacting data integrity.
n Complexity: Handling incremental updates can add complexity to ETL
processes.
5.2.1.3.5 Example
ProjectID MilestoneDate TasksCompleted TotalTasks
P001 2023-01-15 5 10
P001 2023-02-15 8 10
... ... ... ...
Table 3 Example accumulating snapshot fact table
Best Practice 4 - When to use an accumulating snapshot

fact table?
n You are tracking progress or state changes over intervals.
n Milestone analysis is a key requirement.
n You can manage the complexity of handling incremental
updates.
5.2.1.4 Periodic snapshot fact table

The periodic snapshot fact table captures data at regular intervals, enabling
analysis over those intervals.
5.2.1.4.2 Use Case

Useful for trend analysis and performance monitoring over specific time periods.
n Trend Analysis: Facilitates trend analysis over consistent time intervals.
n Performance Optimization: Provides aggregated data for faster querying.
n Limited Flexibility: Aggregated data might not capture all details of
individual transactions.
n Data Staleness: Data may not reflect real-time changes between snapshot
intervals.
5.2.1.4.5 Example
WeekEnding ProductID UnitsSold Revenue
2023-01-07 P101 120 $6,000
2023-01-07 P102 90 $5,400
... ... ... ...
Table 4 Example periodic snapshot fact table
Best Practice 5 - When to use a periodic snapshot fact

table?
n You need to perform trend analysis and performance monitoring
over specific time periods.
n You can handle data staleness between snapshot intervals.
n The trade-off of limited flexibility is acceptable for your use
case.
5.2.1.5 Partitioned Fact Table

The partitioned fact table is split into smaller partitions based on a specific
attribute.
5.2.1.5.2 Use Case

Optimizes query performance and enhances manageability by restricting data
access.
n Query Performance: Enhances query performance by restricting data
access and reducing join complexity.
n Manageability: Improves data management and maintenance by
segmenting data into smaller partitions.
n Historical Data: Facilitates efficient handling of historical data and time-
based queries.
n Partitioning Strategy: Requires careful selection of partitioning key and
strategy to ensure optimal performance.
n Data Loading: Loading data into partitioned tables may require specific ETL
processes.
n Complexity: Managing partitions and optimizing queries may require
advanced database management skills.
5.2.1.5.5 Example
Region Date ProductID UnitsSold
East 2023-01-15 P101 100
East 2023-01-15 P102 80
... ... ... ...
Table 5 Example partitioned fact table
Best Practice 6 - When to use a partitioned fact table?

n You need to optimize query performance and data management
by segmenting data.
n You are willing to invest in the careful selection of partitioning
key and strategies.
n You can manage the complexity of optimizing queries and data
loading for partitions.
5.2.1.6 Key differences between these fact table types

Aspect Transactional Snapshot Accumulating Periodic Partitioned
Fact Table Fact Table Snapshot Snapshot Fact Table
Fact Table Fact Table
Definition Captures Captures data Captures Captures data Splits a fact

individual at specific cumulative or at regular table into
business points in time aggregated intervals for smaller
transactions for historical data over time analysis over partitions
with detailed, analysis and to track those based on a
atomic-level trend progress or intervals. specific
data. identification. changes. attribute,
enhancing
performance
and
manageability.
Use Case / Tracks Tracks Monitors Trend analysis Optimizes

Purpose operational changes over progress or and query
data and time, state changes performance performance
transactional historical over intervals. monitoring and data
reporting. reporting, and over specific management
trend analysis. time periods. by segmenting
data.
Granularity of Granular, Captures data Cumulative or Captures data Optimizes data

Data detailed data at fixed incremental at regular, access and
at the atomic intervals, data capturing structured performance
level. providing progress over intervals. through data
snapshots of stages. partitioning.
data.
Time-Based Focuses on Captures Captures Captures data Enhances

Considerations individual specific points changes over at consistent query
transactional in time for time with intervals, performance
events. analysis. incremental facilitating for specific
updates. time-based attributes,
analysis. often time-
related.
Example Order Monthly Project Weekly Sales data

transactions snapshots of progress website visit partitioned by
with attributes sales data by tracking with counts for regions for
like order product cumulative different improved
date, product, category. task pages. query
quantity, and completion performance.
amount. percentages.
Table 6 Key differences between these fact table types
5.2.2 Aggregation and analysis fact tables
5.2.2.1 Cumulative fact table
The cumulative fact table stores cumulative or running total values over time.
5.2.2.1.2 Use Case

Suitable for tracking accumulative measures, such as revenue, for analysis.
n Running Totals: Easily tracks cumulative values over time.
n Insightful Analysis: Enables analysis of progressive measures.
n Data Volume: Cumulative values may grow significantly over time.
n Complexity: Managing running totals requires careful maintenance.
5.2.2.1.5 Example
Date ProductID CumulativeRevenue
2023-01-01 P101 $10,000
2023-01-02 P101 $11,500
... ... ...
Table 7 Example cumulative fact table
Best Practice 7 - When to use a cumulative fact table?

n You need to track cumulative or running total values over time
n You are analyzing progressive measures
n You are prepared to manage data volume growth and
maintenance complexity
5.2.2.2 Aggregated Fact Table

The aggregated fact table stores pre-aggregated data to improve query
performance.
5.2.2.2.2 Use Case

Enhances query response time for frequently used summary reports.
n Improved Performance: Enhances query response time for summary-level
reporting.
n Query Simplicity: Simplifies complex calculations during querying.
n Data Loss: Aggregation may lead to loss of detail, affecting detailed
analysis.
n Data Refresh: Requires periodic refresh to incorporate new data.
5.2.2.2.5 Example
Year Month ProductCategory TotalSales
2023 Jan Electronics $50,000
2023 Jan Clothing $30,000
... ... ... ...
Table 8 Example aggregated fact table
Best Practice 8 - When to use an aggregated fact table?

n You need to enhance query response time for summary-level
reporting.
n Simplifying complex calculations during querying is a
requirement.
n You can accept potential data loss due to aggregation.
5.2.2.3 Derived Fact Table
The derived fact table contains calculated measures derived from other fact
tables or external sources.
5.2.2.3.2 Use Case

Offers a centralized source for calculated measures, reducing redundancy and
simplifying analysis.
n Centralized Calculations: Offers a consolidated source for calculated
measures.
n Reduced Redundancy: Eliminates redundancy in storing calculated values.
n Maintenance: Requires updates if source data or calculations change.
n Processing Overhead: Deriving measures during ETL adds processing
overhead.
5.2.2.3.5 Example
Date ProductID SalesAmount CostAmount Profit
2023-01-15 P101 $1,000 $700 $300
2023-01-18 P102 $800 $500 $300
... ... ... ... ...
Table 9 Example derived fact table
Best Practice 9 - When to use a derived fact table?

n You need to store calculated measures derived from other fact
n Consolidated calculated measures are a requirement.
n You are prepared for updates if source data or calculations
change

Aspect Cumulative Fact Aggregated Fact Derived Fact Table
Table Table
Definition Contains cumulative Stores pre-aggregated Contains calculated

values or running totals data to improve query measures derived from
over time. performance. other fact tables or
external sources.
Use Case / Purpose Tracks accumulative Enhances query Offers consolidated view
measures over time, response time for of calculated measures,
enabling trend analysis. summary-level reducing redundancy.
reporting.
Data Aggregation Captures incremental Stores summarized Stores calculated

changes to a measure data at different levels measures derived from
over time, building a of granularity for faster existing measures or
running total. querying. external data.
Examples Running total of year- Quarterly total revenue Profit margin calculated
to-date sales. by region. as (Revenue - Costs).
Table 10 Key differences between these fact table types
5.2.3 Context and relationship fact tables
5.2.3.1 Factless fact table
A Factless Fact Table captures events or occurrences without measures. It
contains only foreign keys referring to dimension tables and serves as a "fact"
to represent relationships between dimensions.
5.2.3.1.2 Use Case

Factless Fact Tables are useful for tracking patterns, associations, or scenarios
where no direct measures apply. They are particularly valuable for capturing
relationships between dimensions.
n Relationship Tracking: Captures relationships between dimensions.
n Pattern Identification: Useful for identifying patterns and associations.
n Lack of measures: No measures for direct analysis; relies on associations.
n Complexity: Requires joins with dimension tables for meaningful analysis.
5.2.3.1.5 Example
Date CustomerID ProductID Action
2023-01-15 C501 P101 Purchase
2023-01-18 C502 P102 Purchase
... ... ... ...
Table 11 Example factless fact table
Best Practice 10 - When to use a factless fact table?
n You are capturing events or occurrences without measures but
with a focus on relationships between dimensions.
n You need to track patterns and associations.
n You are comfortable with additional joins with dimension tables
for meaningful analysis
5.2.3.2 Bridge Fact Table
A Bridge Table resolves a many-to-many relationship between dimension tables
by creating a link between them. It contains the primary keys of both related
dimensions and may include additional attributes related to the relationship.
5.2.3.2.2 Use Case

Bridge Tables handle complex relationships between dimensions, allowing for
detailed analysis and reporting of many-to-many associations.
n Many-to-Many Resolution: Resolves complex many-to-many
relationships.
n Detailed Analysis: Allows detailed analysis of associations.
n Data Redundancy: May duplicate dimension data, affecting storage
efficiency.
n Query Complexity: Requires additional joins, potentially impacting query
performance.
5.2.3.2.5 Example
StudentID CourseID EnrollmentDate
S501 C101 2023-01-15
S502 C101 2023-01-18
... ... ...
Table 12 Example bridge fact table
Best Practice 11 - When to use a bridge fact table?

n You need to resolve many-to-many relationships between
dimensions.
n You are dealing with complex relationships that require detailed
analysis of many-to-many associations.
n You can manage data redundancy and accept additional joins.

Aspect Factless Fact Table Bridge Table
Definition Captures events or occurrences Resolves many-to-many

without measures. relationships between
dimensions.
Use Case / Purpose Tracks patterns, associations, or Handles complex relationships,

scenarios where no direct enabling detailed analysis of
measures apply. many-to-many associations.
Contents Contains foreign keys referring Contains primary keys of related

to dimensions. dimensions and may include
attributes related to the
relationship.
Focus Represents relationships Acts as an intermediary to

between dimensions. navigate and resolve complex
associations between
dimensions.
Table 13 Key differences between factless fact tables and bridge fact tables
5.2.4 Specialized Fact Tables
5.2.4.1 Multi-Valued Fact Table
A multi-valued fact table is designed to capture and represent multiple related
attributes or values for a single business event, transaction, or occurrence.
These attributes are often non-numeric and provide additional context or
dimensions to the event.
5.2.4.1.2 Use Case
Multi-valued fact tables are used in situations where a single event can be
associated with multiple attributes or values that don't fit into a traditional
measure-based fact table. Some common use cases include:
n Categorization: Capturing multiple categories or labels associated with an
event. For example, categorizing customer complaints by type, severity, and
resolution.
n Tags or Keywords: Storing keywords or tags associated with a document,
article, or product.
n Attributes with Variable Counts: Tracking varying counts of different
attributes. For instance, recording the number of times each type of service
was performed during a maintenance visit.
n Multi-Valued Relationships: Representing relationships between entities.
For example, capturing multiple recipients for a single email.
n Rich Context: Multi-valued fact tables provide richer context and additional
dimensions to business events.
n Flexible Analysis: They enable flexible analysis and reporting on attributes
that don't fit into traditional measures.
n Data Volume: Multi-valued fact tables may increase data volume due to the
inclusion of multiple attributes for each event.
n Complexity: Managing and querying multi-valued attributes may add
complexity to data processing.
5.2.4.1.5 Example
Consider a scenario in an e-commerce system where a customer places an order
for multiple products, each having different attributes like color, size, and
quantity. A traditional fact table might capture the order total and quantities
sold, but a multi-valued fact table would store the following:
OrderID ProductID Color Size Quantity
1001 P101 Red M 2
1001 P102 Blue L 1
1001 P103 Green S 3
Table 14 Example multi-value fact table
Best Practice 12 - When to use a multi-value fact table?

n You are capturing multiple related attributes or values for a
single business event, transaction, or occurrence.
n You require rich context and additional dimensions for your
data.
n You are willing to manage potential data volume increase and
increased complexity.
5.2.5 Fact table types: wrapping it up

This discusses various types of fact tables, each designed to meet specific data
storage and analysis needs. We categorize these fact table types into Time-
Related, Aggregation and Analysis, Context and Relationship, and Specialized
Fact Tables. Selecting the right type of fact table depends on the specific use
case and objectives, considering factors such as data granularity, query
performance, historical analysis, and relationship tracking.
The various fact table types are summarized below.
Figure 10 Fact table types
Group Type Summary When to use?
Time-Related Transactional Granular, n When you need to capture

Fact Tables Fact Table Operational, individual business transactions or
Detailed events at a granular level.
n Your focus is on operational and
transactional reporting.
n You are looking for detailed
analysis and precise transactional
reporting.
Snapshot Fact Historical, Fixed n You want to capture data at

Table Intervals specific points in time for historical
analysis and trend identification.
n Your goal is historical reporting
and trend analysis.
n You are willing to accept fixed
granularity at snapshot intervals
and potential data redundancy.
Accumulating Progress Tracking, n You are tracking progress or state

Snapshot Fact Milestones changes over intervals.
Table
n Milestone analysis is a key
requirement.
n You can manage the complexity of
handling incremental updates.
Periodic Snapshot Trend Analysis, n You need to perform trend

Fact Table Limited Flexibility analysis and performance
monitoring over specific time
periods.
n You can handle data staleness
between snapshot intervals.
n The trade-off of limited flexibility
is acceptable for your use case.
Partitioned Fact Query n You need to optimize query

Table Optimization, Data performance and data
Segmentation management by segmenting data.
n You are willing to invest in the
careful selection of partitioning
key and strategies.
n You can manage the complexity of
optimizing queries and data
loading for partitions.
Aggregation & Cumulative Fact Running Totals, n You need to track cumulative or
Analysis Fact Table Progressive running total values over time.
Table Analysis
n You are analyzing progressive
measures.
Group Type Summary When to use?
n You are prepared to manage data
volume growth and maintenance
complexity
Aggregated Fact Query n You need to enhance query

Table Performance, Data response time for summary-level
Loss reporting.
n Simplifying complex calculations
during querying is a requirement.
n You can accept potential data loss
due to aggregation.
Derived Fact Calculated n You need to store calculated

Table Measures, Updates measures derived from other fact
n Consolidated calculated measures
are a requirement.
n You are prepared for updates if
source data or calculations change
Context and Factless Fact Relationship n You are capturing events or

Relationship Fact Table Tracking, Patterns occurrences without measures but
Tables with a focus on relationships
between dimensions.
n You need to track patterns and
associations.
n You are comfortable with
additional joins with dimension
tables for meaningful analysis
Bridge Fact Table Many-to-Many, n You need to resolve many-to-

Complex many relationships between
Relationships dimensions.
n You are dealing with complex
relationships that require detailed
analysis of many-to-many
associations.
n You can manage data redundancy
and accept additional joins.
Specialized Fact Multi-Valued Fact Multiple Attributes, n You are capturing multiple related
Tables Table Rich Context attributes or values for a single
business event, transaction, or
occurrence.
n You require rich context and
additional dimensions for your
data.
Table 15 Summary of fact table types
5.3 DIMENSION TYPES
There is a multitude of different dimension types out there and each of them
are used for different uses and use cases. For ease of reading we have tried to
group them in terms of similarities in purpose and characteristics.
5.3.1 Hierarchical and Organizational Dimensions
5.3.1.1 Role-Playing Dimensions
5.3.1.1.1 Description
A role-playing dimension is a single dimension table that is used multiple times
in a fact table, each time representing a different perspective or "role." Each
role of the dimension represents a different attribute or set of attributes within
the same dimension table. For example, a date dimension could be used for
both "Order Date" and "Shipping Date" in a sales scenario.
5.3.1.1.2 Use Case

Role-playing dimensions are used when you need to analyze a fact table from
different temporal perspectives, each associated with a different attribute of the
same dimension. This allows for consistent and meaningful analysis without the
need to duplicate dimension tables.
5.3.1.1.3 Example
Date Key Year Month Day
20230101 2023 01 01
20230102 2023 01 02
... ... ... ...
Table 16 Example role playing dimension
In a retail business, the "Date" dimension is reused in both the "Sales" and
"Returns" fact tables. In the "Sales" fact table, it's used as "Order Date," while
in the "Returns" fact table, it's the "Return Date." This allows separate analysis
of sales and returns using the same dimension.
Best Practice 13 - When to use role playing dimensions
n When you need to analyze a fact table from different temporal
perspectives, each associated with a different attribute of the
same dimension. This allows for consistent and meaningful
analysis without the need to duplicate dimension tables.
5.3.1.2 Fixed-depth hierarchies
A fixed depth hierarchy refers to a hierarchical structure within a dimension
where the number of levels in the hierarchy is predetermined and remains
consistent. Each level represents a specific attribute of the dimension, and the
hierarchy is defined with a fixed number of levels. Fixed-depth hierarchies are
often used for organized and standardized analysis, allowing users to drill down
or roll up within the defined levels. Note there is a section on hierarchies,
including non-fixed depth hierarchies later in this document in chapter 10).
5.3.1.2.2 Use case

Fixed depth hierarchies are used when you have a well-defined and consistent
structure for hierarchical attributes. They provide an organized way to navigate
and analyze dimension data at different levels of granularity.
5.3.1.2.3 Example
Country State City
USA New York New York
USA California Los Angeles
... ... ...
Table 17 Example fixed-depth hierarchy
A "Geography" dimension with a fixed three-level hierarchy: "Country," "State,"

and "City." Users can consistently drill down from country-level data to city-level
data, facilitating analysis.
Best Practice 14 - When to use fixed-depth hierarchy
dimensions
n When you have a well-defined and consistent structure for
hierarchical attributes. They provide an organized way to
navigate and analyze dimension data at different levels of
granularity.
5.3.1.3 Key differences between these entity types

Aspect Role-Playing Dimensions Fixed Depth Hierarchies
Multiplicity Single dimension table is used Single hierarchy within a

multiple times with different dimension with a fixed number of
roles. levels.
Perspective vs. Structure Represent different perspectives Define a structured hierarchical

or attributes of the same arrangement of attributes within
dimension. a dimension.
Variety of Data Each role may have distinct data Focus on organizing data within a
related to the perspective it predefined hierarchy.
represents.
Analytical Flexibility Allow for different analyses of the Provide consistent analysis
same dimension from multiple through organized hierarchical
temporal viewpoints. navigation.
Table 18 Key differences between role-playing dimensions and fixed depth hierarchies
In summary, role-playing dimensions involve using a single dimension table for

different perspectives, while fixed depth hierarchies focus on organizing
dimension data in a structured hierarchical arrangement with a fixed number of
levels. Each serves a unique purpose in dimensional modeling and contributes
to effective data analysis.
5.3.2 Data Consistency and Reusability Dimensions
5.3.2.1 Conformed Dimensions

A conformed dimension refers to a dimension that is consistent and shared
across multiple fact tables within the same data warehouse. It ensures that the
same dimension, with the same attributes and values, is used consistently
across different areas of analysis. Conformed dimensions help maintain data
integrity, simplify data integration, and enable accurate cross-functional
reporting.
5.3.2.1.2 Use Case

Conformed dimensions are used when you have multiple fact tables that need
to share the same dimension for consistent analysis and reporting. For example,
if you have separate fact tables for sales, inventory, and returns, a conformed
product dimension would ensure that product-related attributes like product ID,
product name, and category are consistent across these fact tables.
5.3.2.1.3 Example
Product ID Product Name Category Manufacturer
P001 Smartphone Electronics ABC Corp
P002 Laptop Electronics XYZ Inc
... ... ... ...
Table 19 Example conformed dimension
A multinational corporation has multiple business units, each with its own sales
data. However, they all share a common "Product" dimension with consistent
attributes like product ID, name, and category. This ensures unified reporting
across the organization.
Best Practice 15 - When to use conformed dimensions

n When you have multiple fact tables that need to share the same
dimension for consistent analysis and reporting. Ensures data
integrity, simplifies integration, and enables accurate cross-
functional reporting.
5.3.2.2 Universal Dimensions

A universal dimension refers to a dimension that is used to track data elements
that are common and consistent across multiple fact tables, often relating to
standard measurements or classifications. Universal dimensions are designed to
reduce redundancy by centralizing the storage of attributes that have the same
meaning across different analytical contexts.
5.3.2.2.2 Use Case

Universal dimensions are used when certain attributes, such as currency codes,
units of measure, or geographic regions, are shared across multiple fact tables.
Instead of creating duplicate instances of the same dimension for each fact
table, a universal dimension allows you to maintain a single set of attributes
that can be referenced by different fact tables.
5.3.2.2.3 Example
Currency Code Currency Name Exchange Rate
USD US Dollar 1.00
EUR Euro 0.85
... ... ...
Table 20 Example universal dimension

A "Currency" dimension with attributes like currency code and exchange rate is
shared across fact tables for sales, purchases, and financial reporting.
Best Practice 16 - When to use universal dimensions

n When certain attributes, such as currency codes, units of
measure, or geographic regions, are shared across multiple fact
tables. Reduces redundancy and centralizes the storage of
attributes.

Conformed dimension might seem very similar to universal dimensions (Section
5.3.2.2) as they are both concepts in dimensional modeling to ensure
consistency and accuracy, but there are some key differences:
Aspect Conformed Dimension Universal Dimension
Scope Shared within a specific data Shared across multiple data

warehouse. marts or warehouses.
Usage Ensures consistency within a Provides standardized attributes

specific context. across contexts.
Aspect Conformed Dimension Universal Dimension
Attributes Contains attributes relevant to a Contains attributes with

business area. consistent meanings.
Redundancy Reduces redundancy within a Reduces redundancy across data

data warehouse. marts or warehouses.
Application Used for consistent analysis and Used for standard measurements
reporting. or classifications.
Table 21 Key differences between conformed dimensions and universal dimensions
In summary, while both conformed dimensions and universal dimensions focus

on maintaining data consistency, conformed dimensions apply to a specific data
warehouse, ensuring consistency within it, while universal dimensions are
broader in scope and facilitate consistency across multiple data marts or
warehouses.
Best Practice 17 - Implement Conformed/Universal

Dimensions
Create conformed dimensions that are shared across multiple facts
to ensure consistent and accurate reporting. Maintain a single
version of truth for each dimension to avoid data discrepancies.
5.3.3 Optimizing Performance and Complexity
5.3.3.1 Degenerate Dimensions
A degenerate dimension refers to attributes from a dimension that are stored
directly within the fact table, rather than being represented in a separate
dimension table. These attributes are often identifiers or codes associated with
a specific transaction and have no relevance outside the context of that
transaction.
5.3.3.1.2 Use Case

Degenerate dimensions are used to simplify the data model by avoiding the
creation of additional dimension tables for attributes that are only relevant to a
single fact table. They are typically employed to handle transactional data, such
as order numbers, invoice numbers, or receipt numbers.
5.3.3.1.3 Example
Sales ID Customer ID Invoice Quantity Unit Price Total Price
Number
1001 C001 I001 2 $300 $600
1002 C002 I002 1 $800 $800
... ... ... ... ... ...
Table 22 Example degenerate dimension
In a sales fact table, the "Invoice Number" is a degenerate dimension. Instead

of creating a separate dimension, the invoice number is stored directly in the
fact table, simplifying searches and maintaining data integrity.
Best Practice 18 - When to use degenerate dimensions

n Use to simplify the data model by avoiding the creation of
additional dimension tables for attributes that are only relevant
to a single fact table. Typically employed to handle transactional
data like order numbers or invoice numbers.
5.3.3.2 Transaction dimensions

A transaction dimension is a specialized dimension in a data warehouse that
focuses on capturing attributes associated with individual transactions within a
fact table. Unlike standard dimensions, which provide attributes for general
analysis, transaction dimensions are tailored to offer context-specific
information related to each transaction event. They enrich the fact table by
providing additional data points that describe the circumstances or
characteristics of the transaction.
5.3.3.2.2 Use Case

Transaction dimensions find their most prominent utility in scenarios where the
need arises to scrutinize transactional data with a finer lens. They are commonly
employed for the following use cases:
1. Enhanced Analysis: Transaction dimensions allow analysts to perform more
detailed and context-aware analysis of individual transactions. This can be
particularly valuable when examining the factors that influence specific
events, such as purchases, orders, or service requests.
2. Temporal Insights: When dealing with time-series data, transaction
dimensions enable the inclusion of attributes like transaction date, time of
day, or even user-specific information, providing a comprehensive temporal
context to each transaction.
3. Fine-Grained Reporting: Organizations often require in-depth,
transaction-level reporting for auditing, regulatory compliance, or quality
control purposes. Transaction dimensions facilitate the creation of such
reports by capturing relevant details.
4. Behavioral Analysis: Transaction dimensions enable the study of customer
behavior, such as identifying patterns in purchasing habits, tracking the
evolution of user preferences, or understanding the sequence of events
leading to specific outcomes.
5.3.3.2.3 Example
Let's illustrate the concept of transaction dimensions with an example from a
retail business:
Consider a large online retailer that wants to analyze its e-commerce
transactions. The fact table records individual sales transactions, including
information about products sold, customers, order dates, and quantities. To gain
a deeper understanding of these transactions, the retailer decides to implement
a transaction dimension.
Here's what the transaction dimension might look like:
Transaction ID Transaction Date Customer ID Payment Method Shipping Method
1001 2023-08-15 10:30:00 AM C001 Credit Card Standard
1002 2023-08-15 02:45:00 PM C002 PayPal Express
... ... ... ... ...
In this example, the Transaction Dimension provides transaction-specific details,

such as the date and time of the transaction, the customer involved, the
payment method used, and the chosen shipping method. By linking this
dimension to the fact table, the retailer can analyze transactional data in greater
depth. For instance, they can determine the most popular payment methods
during certain times of the day, assess customer preferences for shipping
options, or conduct temporal analysis of transaction volumes.
Best Practice 19 - When to use Transaction Dimensions

n Use when you need to capture additional details related to a
specific transaction within the fact table. These details can
include attributes such as transaction date, time of day, or user
information. Transaction dimensions are used to enrich
transactional data with context-specific attributes that provide
insights into the circumstances surrounding each transaction.
They enhance the analytical value of the fact table by allowing
for more detailed analysis.
5.3.3.3 Mini Dimension

A mini-dimension is a subset of attributes from a larger dimension that are
extracted to create a smaller, specialized dimension table. Mini-dimensions are
often used to address performance issues caused by high cardinality attributes
in the main dimension. By creating a mini-dimension, you can improve query
performance while still keeping the necessary attributes for analysis.
5.3.3.3.2 Use Case

Mini-dimensions are used when you have high cardinality attributes in a
dimension that lead to increased storage requirements and query complexity.
By creating a mini-dimension, you can optimize performance for specific queries
without sacrificing important attributes.
5.3.3.3.3 Example
Category ID Category Name Description
CAT001 Electronics Electronic Devices
CAT002 Clothing Apparel
... ... ...
Table 23 Example mini-dimension
A "Product Category" mini dimension is extracted from the "Product" dimension,
including attributes for analyzing sales at the broader category level.
Best Practice 20 - When to use mini-dimensions

n Use when you have high cardinality attributes in a dimension
that lead to increased storage requirements and query
complexity. Mini dimensions optimize performance for specific
queries without sacrificing important attributes.
5.3.3.4 Junk Dimension

A junk dimension, also known as a "garbage dimension," is a technique that
involves combining multiple low-cardinality flags or indicators into a single
dimension table. These flags might represent different attributes, such as
Yes/No indicators or categorical values, that are used primarily for filtering or
grouping in queries. Combining them into a junk dimension simplifies schema
design and reduces the number of dimension tables.
5.3.3.4.2 Use Case

Junk dimensions are used to reduce the complexity of the schema by
consolidating multiple low-cardinality attributes that are not individually
meaningful but are frequently used together in queries. This streamlines the
design and enhances query performance.
5.3.3.4.3 Example
Promotion Flag ID Summer Sale Black Friday Discount Applied
FLAG001 Y N N
FLAG002 N Y Y
... ... ... ...
Table 24 Example junk dimension
A "Promotion Flags" junk dimension groups various promotion-related

attributes, like "Summer Sale" and "Black Friday" for streamlined schema
design.
Best Practice 21 - When to use junk dimensions
n Use to reduce the complexity of the schema by consolidating
multiple low-cardinality flags or indicators into a single
dimension table. Simplifies schema design and reduces the
number of dimension tables.
5.3.3.5 Shrunken Dimension
A shrunken dimension refers to a subset of attributes from a larger dimension
that are selected for a higher level of summary. Shrunken dimensions are used
to optimize query performance and reduce complexity when analyzing data at a
coarser granularity. Shrunken dimensions are often derived from a larger
dimension to provide a more focused view of data for specific analytical needs.
5.3.3.5.2 Use Case

Shrunken dimensions are useful when you need to perform summary-level
analysis and want to improve query performance by eliminating attributes that
are not relevant at that level. For example, a shrunken "Quarter" dimension
derived from a "Date" dimension might contain only attributes related to
quarters (e.g., Quarter ID, Quarter Number, Start Date, End Date) to facilitate
analysis at a quarterly level.
5.3.3.5.3 Example
Quarter ID Quarter Number Start Date End Date
QTR001 Q1 2023-01-01 2023-03-31
QTR002 Q2 2023-04-01 2023-06-30
... ... ... ...
Table 25 Example shrunken dimension
A "Quarter" shrunken dimension derived from the "Date" dimension includes

attributes for analyzing data at the quarterly level.
Best Practice 22 - When to use shrunken dimensions
n Use when you need to perform summary-level analysis and
want to improve query performance by eliminating attributes
that are not relevant at that level.
5.3.3.6 Late-binding dimensions
A late-binding dimension allows flexibility in adding new attributes to a
dimension table without altering the schema. Late-binding dimensions defer the
binding of attributes until they are needed, which can simplify management and
accommodate changes to attributes over time. Note that late-binding
dimensions are discussed in detail in chapter 16.
5.3.3.6.2 Use Case

Late-binding dimensions are used when there's a need for agility in handling
evolving or rapidly changing data, allowing new attributes to be added without
immediate schema changes.
5.3.3.6.3 Example
Product ID Product Name Attribute 1 Attribute 2 ...
P001 Smartphone 4G Support 64GB Memory ...
P002 Laptop SSD Drive 16GB RAM ...
... ... ... ... ...
Table 26 Example late-binding dimension

In a technology industry, a "Product" dimension is designed as late-binding. As
new attributes like processor type emerge, they can be added without altering
the dimension's structure.
Best Practice 23 - When to use late-binding dimensions

n Use when there's a need for agility in handling evolving or
rapidly changing data, allowing new attributes to be added
without immediate schema changes.
5.3.3.7 Composite dimensions
Created by merging attributes from different dimensions into a single dimension
table, reducing the number of dimension tables in the schema.
5.3.3.7.2 Use Case

To simplify schema design by combining attributes from different dimensions
with shared characteristics. Composite dimensions reduce the number of
dimensions while preserving analytical capabilities.
5.3.3.7.3 Example
Channel ID Channel Type Channel Details
CH001 Online Website
CH002 In-Store Retail Store A
... ... ...
Table 27 Sample composite dimension
A "Sales Channel" composite dimension combines attributes from both "Online

Channel" and "In-Store Channel" dimensions for simplified reporting.
Best Practice 24 - When to use composite dimensions

n Use to simplify schema design by combining attributes from
different dimensions with shared characteristics, reducing the
number of dimension tables.

Aspect Mini- Junk Shrunke Late- Degener Transacti Composit
Dimensio Dimensio n Binding ate on e
ns ns Dimensio Dimensio Dimensio Dimensio Dimensio
ns ns ns ns ns
Attribute Optimize Combine Optimize Allow Simplify Enrich the Simplify

Focus high low for flexibility in schema analysis schema
cardinality cardinality summary- attribute design for with design by
attributes. attributes. level addition. transactio transactio combining
queries. nal data. shared
attributes
n-specific from
details. different
dimension
s.
Optimizati Performan Schema Performan Flexibility Simplify Improve Simplify

on ce for high simplicity ce for for schema query schema
cardinality and query summary- evolving for performan design by
attributes. performan level data transactio ce for reducing
ce. queries. needs. nal data. transactio dimension
n analysis. tables.
Purpose Performan Simplify Performan Accommod Simplify Enhance Simplify

ce schema ce ate schema analysis of schema
optimizati design and optimizati evolving design for transactio design by
on for query on for data transactio nal data. consolidati
specific performan higher- requireme nal data. ng
attributes. ce. level nts. attributes.
aggregati
on.
Attributes Subset of Combined Subset of Attributes Attributes Contextua Combined

attributes low attributes can be specific to l details attributes
from main cardinality for added transactio related to from
dimension attributes. summary without n data. transactio different
. levels. schema ns. dimension
changes. s.
Table 28 Key differences between these dimension types
In summary, mini-dimensions, junk dimensions, shrunken dimensions, and late-

binding dimensions each have a unique purpose in optimizing and simplifying
dimensional modeling based on the nature of attributes and the analytical
requirements of the data.
5.3.4 Customized and Specialized Dimensions
5.3.4.1 Snapshot Dimensions
Snapshot dimensions are used to capture point-in-time attributes related to a
fact record. These attributes represent the state of a dimension at a specific
moment in time and are associated with the fact record when the event
occurred. Snapshot dimensions are particularly useful for scenarios where
historical changes need to be tracked, such as in data warehousing for
regulatory compliance or auditing purposes.
5.3.4.1.2 Use Case

Snapshot dimensions are used when you need to maintain historical context for
a fact record by capturing the attributes as they were at the time of the event.
For example, in a customer churn analysis, you might use a snapshot dimension
to capture customer details at the time they initiated a service cancellation.
5.3.4.1.3 Example
Customer ID Account Balance Credit Limit Snapshot Date
C001 $1,000 $2,000 2023-06-30
C002 $500 $1,000 2023-06-30
... ... ... ...
Table 29 Example snapshot dimension
A "Customer Account Snapshot" dimension captures monthly attributes like

account balance, credit limit, and payment status for historical financial analysis.
Best Practice 25 - When to use snapshot dimensions

n Use when you need to maintain historical context for fact
records by capturing attributes as they were at specific points
in time.
5.3.4.2 Custom dimensions
Custom dimensions are dimension tables that are tailored to specific business
needs and are not predefined. They are designed to accommodate specialized
data elements or unique analytical requirements that are not adequately
covered by standard dimensions. Custom dimensions offer flexibility regarding
attribute selection and structure, allowing organizations to capture domain-
specific information.
5.3.4.2.2 Use Case
Custom dimensions are used when standard dimensions do not fully capture the
nuances of a business scenario or when you need to analyze attributes that are
specific to your organization's operations. For example, a custom dimension
could be created to track project-specific attributes in a consulting firm's data
warehouse.
5.3.4.2.3 Example
Patient ID Diagnosis Code Diagnosis Treatment
Description
P001 D001 Influenza Rest
P002 D002 Fractured Arm Cast
... ... ... ...
Table 30 Example custom dimension
A "Patient Diagnosis" dimension in a healthcare system includes medical

condition and treatment attributes for specialized medical analysis.
Best Practice 26 - When to use custom dimensions

n Use when standard dimensions do not fully capture the nuances
of a business scenario or when you need to analyze attributes
specific to your organization's operations.
5.3.4.3 Derived dimensions

Derived dimensions are created by deriving new attributes from existing ones.
These dimensions allow you to analyze data from a new perspective by
performing calculations or transformations on existing attributes. Derived
dimensions provide insights into calculated or derived measures, enabling more
nuanced analysis.
5.3.4.3.2 Use Case

Derived dimensions are used when you want to explore data from a different
angle by creating attributes that are not directly stored in the source data. For
instance, you could create a derived dimension to analyze profit margins by
subtracting cost-related attributes from revenue-related attributes.
5.3.4.3.3 Example
Product ID Profit Margin Net Profit
P001 20% $200
P002 15% $150
... ... ...
Table 31 Example derived dimension
A "Profitability" dimension includes calculated attributes like "Profit Margin" and

"Net Profit," derived from "Revenue" and "Cost" dimensions.
Best Practice 27 - When to use derived dimensions

n Use when you want to explore data from a different angle by
creating attributes that are not directly stored in the source
data. Typically created by deriving new attributes from existing
ones.

Snapshot dimensions, customer dimensions and derived dimensions all look a
bit similar, but they serve different purposes and have distinct characteristics:
Aspect Snapshot Dimensions Custom Dimensions Derived Dimensions
Purpose Capture historical Tailor attributes to Create new attributes

attributes at specific specific business needs. based on calculations.
times.
Use Case Maintain historical Address unique Analyze data from new
context for fact records. analytical or domain perspectives.
scenarios.
Attribute Source Captured as they Designed to address Derived from existing

existed at a specific specific business needs. data attributes.
time.
Attribute Flexibility Focuses on maintaining Tailored to address Creates new

historical records. unique business perspectives by deriving
scenarios. attributes.
Table 32 Differences between snapshot, custom and derived dimensions
In summary, snapshot dimensions capture historical attributes at specific times;

custom dimensions are designed to address specialized business requirements,
and derived dimensions create new attributes by performing calculations or
transformations on existing data. Each type of dimension serves a distinct
purpose in enhancing data analysis within a dimensional model.
5.3.5 Dimension Types: Wrapping it up

This section discusses various types of dimension tables, each designed to meet
specific data storage and analysis needs. These dimension types are categorized
into Hierarchical and Organizational, Data Consistency and Reusability,
Optimizing Performance and Complexity and Customized and Specialized
Dimensions. Selecting the right type of dimension type depends on the specific
use case and objectives, considering various factors.
The various dimension types are summarized below.
Figure 11 Dimension types
Group Type Summary When to use
Hierarchical and Role-Playing Perspective n When you need to analyze a fact table
Organizational Dimensions Switching, from different temporal perspectives,
Consistency each associated with a different attribute
of the same dimension. This allows for
consistent and meaningful analysis
without the need to duplicate dimension
tables.
Fixed Depth Organized n When you have a well-defined and

Hierarchies Structure, consistent structure for hierarchical
Consistent attributes. They provide an organized
Levels way to navigate and analyze dimension
data at different levels of granularity.
Data Consistency Conformed Data Integrity, n When you have multiple fact tables that
and Reusability Dimensions Cross-Functional need to share the same dimension for
consistent analysis and reporting.
Ensures data integrity, simplifies
integration, and enables accurate cross-
functional reporting.
Universal Shared n When certain attributes, such as currency

Dimensions Attributes, codes, units of measure, or geographic
Standardized regions, are shared across multiple fact
Data tables. Reduces redundancy and
centralizes the storage of attributes.
Optimizing Degenerate Simplified n Used to simplify the data model by

Performance and Dimensions Schema, avoiding the creation of additional
Complexity Transaction Data dimension tables for attributes that are
only relevant to a single fact table.
Typically employed to handle
transactional data like order numbers or
invoice numbers.
Transaction Contextual n Use when you need to capture additional

Dimensions Details, Enriched details related to a specific transaction
Analysis within the fact table. These details can
include attributes such as transaction
date, time of day, or user information.
Mini Query n Use when you have high cardinality

Dimension Optimization, attributes in a dimension that lead to
High Cardinality increased storage requirements and
query complexity. Mini dimensions
optimize performance for specific queries
without sacrificing important attributes.
Junk Schema n Use to reduce the complexity of the

Dimension Simplification, schema by consolidating multiple low-
Low Cardinality cardinality flags or indicators into a single
dimension table. Simplifies schema
design and reduces the number of
dimension tables.
Shrunken Query n Use when you need to perform summary-

Dimension Performance, level analysis and want to improve query
Summary-Level performance by eliminating attributes
that are not relevant at that level.
Late-Binding Evolving Data, n Use when there's a need for agility in

Dimensions Schema handling evolving or rapidly changing
Flexibility data, allowing new attributes to be added
without immediate schema changes.
Composite Simplified n Used to simplify schema design by

Dimensions Schema, combining attributes from different
Combined dimensions with shared characteristics,
Attributes reducing the number of dimension tables.
Customized and Snapshot Historical n Use when you need to maintain historical
Specialized Dimensions Context, Point- context for fact records by capturing
in-Time attributes as they were at specific points
in time.
Custom Tailored n Use when standard dimensions do not

Dimensions Attributes, fully capture the nuances of a business
Business-Specific scenario or when you need to analyze
attributes specific to your organization's
operations.
Derived Calculated n Use when you want to explore data from

Dimensions Attributes, New a different angle by creating attributes
Perspectives that are not directly stored in the source
data. Typically created by deriving new
attributes from existing ones.
Table 33 The various dimension types
5.4 TESTING YOUR KNOWLEDGE
Get ready for a knowledge showdown! In this chapter's

Knowledge Quest, we'll be putting your grasp of the topics
to the test with a series of multiple-choice questions. Think
of it as a fun way to challenge your knowledge and reinforce
your learning. Select the most suitable option for each question
and see how well you fare. Ready to embark on this educational
adventure? Let's dive in!
1. What is the primary purpose of a Transactional Fact Table?
a. Capturing data at specific points in c. Tracking individual business

time for historical analysis transactions or events
b. Providing fixed snapshots for simplified d. Monitoring progress or state changes

querying over intervals
2. What is a key advantage of a Snapshot Fact Table?
a. Supports historical reporting and trend c. Monitors progress or state changes over
identification intervals
b. Provides granular data for in-depth d. Captures data at regular, structured

analysis intervals
3. When is it recommended to use a Periodic Snapshot Fact Table?
a. When you need to capture individual c. When you need to perform trend
business transactions analysis over specific time periods
b. When you want to optimize query d. When you want to monitor progress or
performance through partitioning state changes over intervals
4. What is the primary purpose of a Cumulative Fact Table?
a. Storing pre-aggregated data to c. Contains calculated measures derived

improve query performance from other fact tables
b. Tracking accumulative measures over d. Enhancing query response time for

time for trend analysis frequently used summary reports
5. What is a key advantage of an Aggregated Fact Table?
a. Offers a consolidated source for c. Easily tracks cumulative values over

calculated measures time
b. Simplifies complex calculations during d. Captures incremental changes to a

querying measure over time
6. When is it recommended to use a Derived Fact Table?
a. When you need to enhance query c. When you need to store calculated
response time for summary-level measures derived from other fact tables
reporting or external sources
b. When you want to store cumulative or d. When you are willing to accept potential
running total values over time data loss due to aggregation
7. What is the primary purpose of a Factless Fact Table?
a. Capturing detailed transactional data c. Resolving many-to-many relationships
b. Tracking patterns and associations d. Storing pre-aggregated data
8. When is a Bridge Table typically used?
a. When you need to track cumulative c. To resolve many-to-many relationships

values over time between dimensions
b. To capture individual business d. For optimizing query performance

transactions
9. What is a key consideration when working with a Bridge Table?
a. Data redundancy c. Complexity of managing running totals
b. Lack of measures for direct analysis d. Limited granularity
10. In which scenario would a Multi-Valued Fact Table be most beneficial?
a. Tracking total revenue over time c. Capturing data at specific points in time
b. Categorizing customer complaints by d. Storing summarized data for faster

type, severity, and resolution querying
11. Which type of fact table would be most suitable for capturing individual
business transactions or events at a granular level?
a. Cumulative Fact Table c. Aggregated Fact Table
b. Transactional Fact Table d. Factless Fact Table
12. If your primary focus is on historical reporting and trend analysis, which type
of fact table should you choose?
a. Accumulating Snapshot Fact Table c. Snapshot Fact Table
b. Aggregated Fact Table d. Derived Fact Table
13. When dealing with complex relationships that require detailed analysis of
many-to-many associations, which type of fact table is most appropriate?
a. Bridge Fact Table c. Multi-Valued Fact Table
b. Partitioned Fact Table d. Cumulative Fact Table
14. When would you use role-playing dimensions in dimensional modeling?
a. When you need to organize dimension c. When you have a well-defined and
data in a hierarchical arrangement. consistent structure for hierarchical
attributes.
b. When you want to analyze a fact table d. When you need to track events or
from different temporal perspectives, occurrences without measures.
using the same dimension.
15. What is a characteristic feature of fixed-depth hierarchies in dimensions?
a. They allow for different analyses of the c. They involve using a single dimension
same dimension from multiple temporal table for different perspectives.
viewpoints.
b. They provide consistent analysis d. They are used to capture events or

through organized hierarchical navigation. occurrences without measures.
16. In a retail business, how might a role-playing dimension be used?
a. To track customer complaints by type, c. To navigate and analyze dimension data

severity, and resolution. at different levels of granularity.
b. To represent different perspectives or d. To store keywords or tags associated

attributes of the same dimension in sales with a document, article, or product.
and returns scenarios.
17. What is the primary purpose of a conformed dimension in dimensional

modeling?
a. To centralize the storage of attributes c. To track data elements that are

with the same meaning across different common and consistent across multiple
analytical contexts. fact tables.
b. To ensure consistency within a specific d. To reduce redundancy by sharing

data warehouse and enable accurate attributes relevant to a business area.
cross-functional reporting.
18. When would you use a universal dimension in dimensional modeling?
a. When you need to centralize the c. When certain attributes, such as

storage of attributes with the same currency codes or geographic regions, are
meaning across different analytical shared across multiple fact tables.
contexts.
b. When you have multiple fact tables d. When you want to track data elements
that need to share the same dimension that are common and consistent within a
for consistent analysis and reporting. specific data warehouse.
19. What is the primary purpose of a degenerate dimension in dimensional

modeling?
a. To centralize the storage of attributes c. To capture additional details related to
with the same meaning across different a specific transaction within the fact table.
analytical contexts.
b. To simplify the data model by avoiding d. To optimize query performance by

the creation of additional dimension eliminating attributes that are not relevant
tables for transaction-specific attributes. at a higher level of summary.
20. When would you use a transaction dimension in dimensional modeling?
a. When you need to consolidate multiple c. When you need to capture additional
low-cardinality flags or indicators into a details related to a specific transaction
single dimension table. within the fact table.
b. When you have high cardinality d. When you need to scrutinize

attributes in a dimension, and you want transactional data with a finer lens and
to optimize performance for specific gain context-specific insights related to
queries without sacrificing important each transaction event.
attributes.
21. What is the purpose of a mini-dimension in dimensional modeling?
a. To consolidate multiple low-cardinality c. To store attributes from a dimension

flags or indicators into a single dimension directly within the fact table.
table.
b. To extract a subset of attributes from a d. To provide a focused view of data for

larger dimension, addressing performance specific analytical needs by selecting a
issues caused by high cardinality subset of attributes from a larger
attributes. dimension.
22. What is the primary use case for a junk dimension in dimensional modeling?
a. To reduce the complexity of the c. To capture additional details related to

schema by consolidating multiple low- a specific transaction within the fact table.
cardinality flags or indicators into a single
dimension table.
b. To perform summary-level analysis and d. To provide context-specific information

improve query performance by related to each transaction event.
eliminating attributes that are not
relevant at that level.
23. When would you use a shrunken dimension in dimensional modeling?
a. When you have high cardinality c. When you need to capture additional
attributes in a dimension, and you want details related to a specific transaction
to optimize performance for specific within the fact table.
queries without sacrificing important
attributes.
b. When you need to perform summary- d. When you need to scrutinize

level analysis and want to improve query transactional data with a finer lens and
performance by eliminating attributes gain context-specific insights related to
that are not relevant at that level. each transaction event.
24. When should you use a snapshot dimension in dimensional modeling?
a. When you need to capture additional c. When you have high cardinality
details related to a specific transaction attributes in a dimension that lead to
within the fact table. increased storage requirements and query
complexity.
b. When you want to maintain historical d. When you want to explore data from a
context for fact records by capturing different angle by creating attributes that
attributes as they were at specific points are not directly stored in the source data.
in time.
25. What is the primary purpose of a custom dimension in dimensional modeling?
a. To capture historical attributes at c. To create new attributes by performing

specific times. calculations or transformations on existing
data.
b. To tailor attributes to specific business d. To analyze data from a different angle

needs that are not adequately covered by by creating attributes that are not directly
standard dimensions. stored in the source data.
26. When should you use a Shrunken Dimension in dimensional modeling?
a. When you need to capture additional c. When you need to perform summary-
details related to a specific transaction level analysis and want to improve query
within the fact table. performance by eliminating attributes that
are not relevant at that level.
b. When there's a need for agility in d. When you want to explore data from a
handling evolving or rapidly changing different angle by creating attributes that
data. are not directly stored in the source data.
27. What is the primary purpose of a Composite Dimension in dimensional

modeling?
a. To simplify schema design by c. To create new attributes by performing
combining attributes from different calculations or transformations on existing
dimensions with shared characteristics, data.
reducing the number of dimension tables.
b. To tailor attributes to specific business d. To analyze data from a different angle

needs that are not adequately covered by by creating attributes that are not directly
standard dimensions. stored in the source data.
28. When should you use a Universal Dimension in dimensional modeling?
a. When you have high cardinality c. When you want to maintain historical
attributes in a dimension that lead to context for fact records by capturing
increased storage requirements and query attributes as they were at specific points
complexity. in time.
b. When certain attributes, such as d. When you have a well-defined and

currency codes, units of measure, or consistent structure for hierarchical
geographic regions, are shared across attributes.
multiple fact tables.
Answers:
1:C, 2:A, 3:C, 4:B, 5:B, 6:C, 7:B, 8:C, 9:A, 10:B, 11:B, 12:C, 13:A, 14:B, 15:B,
16:B, 17:B, 18:C, 19:B, 20:D, 21:B, 22:A, 23:B, 24:B, 25:B, 26:C, 27:A, 28:B
6 Effective design patterns and best
practices for overcoming common
challenges in dimensional models
In a dimensional model, which is a data modeling technique used in data
warehousing and business intelligence, some elements can be more challenging
to model than others. The goal of a dimensional model is to make data accessible
and easily understandable for reporting and analysis. In no particular order,
here are some of the most difficult things to model in a dimensional model:
n Chasm traps and fan traps (Chapter 7). Chasm traps and fan traps can
create ambiguous data interpretations and/or can cause overcounting and
inaccuracies, but they can also be a helpful design pattern.
n Join cardinality (Chapter 8): Addressing join cardinality in dimension
models is crucial to optimize query performance, maintain data accuracy, and
simplify queries. Incorrect handling of join cardinality can lead to slow
performance, data anomalies, and inconsistent results. Ignoring join
cardinality may cause data redundancy, hinder scalability, and make model
maintenance challenging.
n Outer joins (Chapter 9): In dimensional models, outer joins can cause data
accuracy and query performance issues when joining fact and dimension
tables with missing or null values, leading to incomplete or incorrect results,
data duplication, and hindered query optimization.
n Hierarchies (Chapter 10): Representing hierarchical relationships between
dimensions can be complex. For example, dealing with a product hierarchy
where products are categorized into various levels like category,
subcategory, and individual product can be challenging to model efficiently.
n Slowly changing dimensions (SCDs) (Chapter 11): Managing slowly
changing dimensions, which are dimensions that change over time, can be
difficult. There are different types of SCDs, and choosing the appropriate
strategy (Type 1, Type 2, Type 3, etc.) requires careful consideration.
n Granularity, denormalization and mixed grain facts (Chapter 12):
Granularity choice is critical, as overly detailed data can lead to inefficiencies,
while aggregated data may lack the necessary detail. Denormalization
streamlines queries but can lead to redundancy and increased storage. Mixed
grain facts pose a challenge when measures of varying detail levels coexist,
potentially causing inconsistencies.
n Working with less structured data (Chapter 13): Mastering the
management of flexible data structures is crucial in today's data landscape.
Organizations seeking deeper insights must effectively utilize key-value
pairs, unstructured, and semi-structured data. Techniques such as pivoting
and structured representation are invaluable in converting these data formats
into ones compatible with ThoughtSpot.
n Date dimension and when you need one (Chapter 14): This chapter
underscores the significance of the Date Dimension as a foundational
framework for handling time-related data. It streamlines time interpretation,
enabling navigation through different periods, historical comparisons, and
trend identification. ThoughtSpot's keywords enhance querying efficiency for
time intervals, reducing the need for a dedicated Date Dimension table in
many cases. However, in advanced scenarios, such as the 'look back
measures' design pattern, the Date Dimension table remains indispensable,
particularly in the retail industry.
n Currency conversion (Chapter 15): If dealing with data from multiple
countries with different currencies, handling currency conversion in the
dimensional model can be challenging. It requires maintaining historical
exchange rates and implementing proper calculations for accurate reporting.
n Late-binding attributes (Chapter 16): Derived Measures: Calculated or
derived measures, such as profit margin or growth rate, involve complex
expressions and dependencies on other measures. Incorporating these into
the dimensional model requires careful consideration of the underlying logic.
n Non-additive, semi-additive and derived measures (Chapter 17):
Calculated or derived measures, such as profit margin or growth rate, involve
complex expressions and dependencies on other measures. Some measures,
like ratios or percentages, cannot be aggregated at all and certain measures,
such as account balances or inventory levels, cannot be simply aggregated
using standard summation techniques. Modeling these types of measures
while maintaining their integrity can be challenging.
Overcoming these challenges requires a deep understanding of the data, the
business requirements, and the trade-offs between complexity and performance
in the dimensional model design. In the following chapters we will describe the
challenges and our best practices in these scenarios.
7 Bridging chasms, fanning insights: Chasm
and fan traps unveiled
7.1 INTRODUCTION
In the domain of dimensional models, the emergence of chasm traps and fan
traps is not uncommon. In this section, we will
delve into the nature of these traps, the challenges
they present, the scenarios where they become
problematic, and the potential benefits they offer.
Moreover, we will explore various manifestations of
chasm traps, including regular, nested, and chained
variations, as well as bridge tables and fan traps.
Lastly, we will examine how ThoughtSpot
addresses these challenges and even employs them
for specific use cases.
7.2 UNDERSTANDING CHASM AND

FAN TRAPS
7.2.1 What is a Chasm Trap?

A Chasm Trap arises within a dimensional model when multiple fact tables are
interconnected through shared conformed dimensions, yet a direct relationship
between these fact tables is absent. This situation can lead to confusion when
amalgamating information from disparate fact tables in a single query, resulting
in inaccuracies and misleading insights.
For example, consider a scenario where a manufacturing company's data
warehouse houses two fact tables: "Production" and "Sales." These tables share
dimensions like "Product", "Date", and "Location."
n The "Production" fact table contains data about the quantity of products
manufactured on specific dates at various locations.
n The "Sales" fact table includes information about quantities of products sold
on particular dates at different locations.
These fact tables share dimensions such as "Product," which contains details
about products; "Date," capturing calendar dates; and "Location," featuring
geographical attributes. Due to the common dimensions shared by both fact
tables, a chasm trap arises.
Figure 12 Sample Chasm Trap Model
7.2.2 Challenges Posed by Chasm Traps

Chasm traps introduce a couple of notable challenges:
n Interpretation of Results: Queries involving traps can yield misinterpreted
results, casting doubt on the credibility of insights generated.
n Performance Impact: Unresolved traps can adversely impact query
performance, as the database engine grapples with navigating complex
interrelationships.
7.2.3 Variants of chasm traps

Beyond the fundamental chasm trap, it's important to be aware of its variations:
n Nested Chasm Trap
n Chained Chasm Trap
Figure 13 Standard Figure 14 Nested chasm trap Figure 15 Chained chasm trap
chasm trap/bridge
table
7.2.4 What is a fan trap?

A fan trap arises when there exists a many-to-many-to-one relationship among
three interconnected tables, with metrics present in both of the "many side"
tables. This often transpires when performing joins between two fact tables and
a dimension table. This scenario can lead to instances of over-counting and
inaccuracies during the analysis of data.
Consider a data model involving tables - "Customer," "Order," and "Order
Details."
Figure 16 Sample fan trap
7.3 MODELING CHASM AND FAN TRAPS
7.3.1 Chasm traps

Let’s take the sample model from Figure 12 and populate it with the following
test data:
Production Fact Table
Date Key Product Key Location Key Quantity Produced
101 201 301 2
102 202 301 3
102 201 302 5
102 203 302 2
Figure 17 Contents of the production table
Sales Fact Table
Date Key Product Key Location Key Quantity Sold
101 201 301 10
101 202 302 10
102 203 302 5
102 204 302 15
Figure 18 Contents of the sales table
Date Dimension
Date Key Date Day Month Year
101 2023-07-01 1 7 2023
102 2023-07-02 2 7 2023
Figure 19 Contents of the date dimension
Product Dimension
Product Key Product ID Product Name Category
201 PROD-001 Widget A Widgets
202 PROD-002 Widget B Widgets
203 PROD-003 Widget C Widgets
204 PROD-004 Widget D Widgets
Figure 20 Contents of the product dimension
Location Dimension
Location Key Location ID Location Name Address
301 LOC-001 Warehouse A 123 Main St
302 LOC-002 Warehouse B 456 Oak Ave
Figure 21 Contents of the location dimension
Now, consider the scenario where you aim to generate a report displaying the
overall quantity produced and sold for every product at each location within a
defined timeframe. Because there is no direct linkage between the "Production"
and "Sales" fact tables, attempting a straightforward join with the common
dimensions (Date, Product, Location) would inevitably yield inaccurate
outcomes.
Without addressing the chasm trap, in SQL, the query might look like this:
SELECT
p.Product_Name,
l.Location_Name,
SUM(pr.Quantity_Produced) AS Total_Produced,
SUM(s.Quantity_Sold) AS Total_Sold
FROM Production_Fact pr
JOIN Date_Dimension d ON pr.Date_Key = d.Date_Key
JOIN Product_Dimension p ON pr.Product_Key = p.Product_Key
JOIN Location_Dimension l ON pr.Location_Key = l.Location_Key
JOIN Sales_Fact s ON s.Date_Key = d.Date_Key
AND s.Product_Key = p.Product_Key
AND s.Location_Key = l.Location_Key
GROUP BY
p.Product_Name,
l.Location_Name;
The above query would likely produce incorrect results due to over-counting and
data inconsistencies, as the "Production" and "Sales" fact tables are not directly
related to each other. Let’s execute this query1:
1
To simulate this we have created a SQL View in ThoughtSpot to make sure the SQL is exactly
as in the SQL code described above.
Figure 22 Executing the wrong query causing overcounting
In the incorrect result, we can see that the quantities produced for Widget B due
to the chasm trap. Widget B was only sold 10 times and this shows 20! We also
see in this example that Widget D is not reported as it was not sold.
The correct way of dealing with a chasm trap in SQL is by splitting up the
process, i.e. in one sub-query collect the sales data, in a 2nd the production data
and merge those results together.
WITH
/*
** Query 0: Collects Production Data
*/
"qt_0" AS (
SELECT
"ta_1"."PRODUCT_NAME" "ca_1",
CASE
WHEN sum("ta_2"."QUANTITY_PRODUCED") IS NOT NULL THEN
sum("ta_2"."QUANTITY_PRODUCED")
ELSE 0
END "ca_2"
FROM
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCTION_FACT" "ta_2"
JOIN
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCT_DIMENSION"
"ta_1"
ON "ta_2"."PRODUCT_KEY" = "ta_1"."PRODUCT_KEY"
GROUP BY "ca_1"
),
/*
** Query 1: Collects Sales Data
*/
"qt_1" AS (
SELECT
"ta_3"."PRODUCT_NAME" "ca_3",
CASE
WHEN sum("ta_4"."QUANTITY_SOLD") IS NOT NULL THEN
sum("ta_4"."QUANTITY_SOLD")
ELSE 0
END "ca_4"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."SALES_FACT"
"ta_4"
JOIN
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCT_DIMENSION"
"ta_3"
ON "ta_4"."PRODUCT_KEY" = "ta_3"."PRODUCT_KEY"
GROUP BY "ca_3"
)
/*
** Final Query: Merge Results
*/
SELECT
CASE
WHEN "ta_5"."ca_1" IS NOT NULL THEN "ta_5"."ca_1"
ELSE "ta_6"."ca_3"
END "ca_5",
CASE
ELSE 0
END "ca_6",
CASE
ELSE 0
END "ca_7"
FROM "qt_0" "ta_5"
FULL OUTER JOIN "qt_1" "ta_6"
ON (EQUAL_NULL("ta_5"."ca_1","ta_6"."ca_3"))
And when we execute this query we see we get all the correct results:
Figure 23 Correct results from the chasm trap
7.3.2 Fan Traps
Let's populate the illustrative model from Figure 16 with some test data:
CUSTOMER
CUST_ID CUST_NAME
100 Ethan
101 Olivia
102 Liam
103 Ava
Figure 24 Sample data for the customer table
ORDER
ORDER_ID CUST_ID ORDER_TOTAL
1 100 1100
2 101 1300
3 100 1400
4 102 1200
Figure 25 Sample data for the order table
ORDER_DETAIL
ORDER_ID QTY PROD_ID
1 3 10
1 2 11
2 4 13
2 5 12
3 4 12
4 5 10
Figure 26 Sample data for the order detail table
Upon executing a single SQL query that spans the fan trap, overcounting will
happen.
SELECT c.CUST_NAME, od.PROD_ID, o.ORDER_ID, SUM(o.ORDER_TOTAL),
SUM(od.QTY)
FROM CUSTOMER AS c
INNER JOIN "ORDER" AS o ON c.CUST_ID = o.CUST_ID
INNER JOIN ORDER_DETAIL AS od on o.ORDER_ID = od.ORDER_ID
GROUP BY c.CUST_NAME, od.PROD_ID, o.ORDER_ID;
Figure 27 Running a single pass query against a fan trap
This is clear in Figure 27, where overcounting happens. The cumulative value
tallies to 7,400, despite the knowledge that the overall order total (achieved by
summing all values in the order table) amounts to merely 5,000. The underlying
cause for this anomaly lies in the presence of two entries each for Ethan's initial
order (Order 1) and Olivia's order (Order 2) in the order detail table. The fan
trap exacerbates this duplication, leading to the inflated value of 7,400 (5,000
+ 1,100 + 1,300).
For effectively querying against a fan trap, the accurate SQL approach involves
a division of the query into two parts, like addressing a chasm trap, followed by
the consolidation of results.
WITH
"qt_1" AS (
SELECT
"ta_3"."CUST_ID" "ca_4",
"ta_3"."CUST_NAME" "ca_5",
CASE
WHEN sum("ta_4"."QTY") IS NOT NULL THEN sum("ta_4"."QTY")
ELSE 0
END "ca_6"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER_DETAIL"
"ta_4"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER"
"MTA_0"
ON "ta_4"."ORDER_ID" = "MTA_0"."ORDER_ID"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."CUSTOMER"
"ta_3"
ON "MTA_0"."CUST_ID" = "ta_3"."CUST_ID"
GROUP BY
"ca_4",
"ca_5"
),
"qt_0" AS (
SELECT
"ta_1"."CUST_ID" "ca_1",
"ta_1"."CUST_NAME" "ca_2",
CASE
WHEN sum("ta_2"."ORDER_TOTAL") IS NOT NULL THEN
sum("ta_2"."ORDER_TOTAL")
ELSE 0
END "ca_3"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER" "ta_2"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."CUSTOMER"
"ta_1"
ON "ta_2"."CUST_ID" = "ta_1"."CUST_ID"
GROUP BY
"ca_1",
"ca_2"
)
SELECT
"ta_5"."ca_4" "ca_7",
"ta_5"."ca_5" "ca_8",
CASE
ELSE 0
END "ca_9",
CASE
ELSE 0
END "ca_10"
FROM "qt_1" "ta_5"
LEFT OUTER JOIN "qt_0" "ta_6"
ON (
(EQUAL_NULL("ta_5"."ca_4","ta_6"."ca_1"))
AND (EQUAL_NULL("ta_5"."ca_5","ta_6"."ca_2"))
)
The results can be seen below:
Figure 28 Correct results of the fan trap
7.4 CONSIDERATIONS AND CONCLUSION
Effectively managing chasm traps and fan traps requires a thoughtful approach.
These challenges can introduce complexities into data analysis and reporting,
but there are strategies to overcome them.
When dealing with chasm traps, it's crucial to recognize the potential for
misinterpretation and performance degradation. Subqueries provide a useful
solution to address these issues. By separating the data retrieval process into
distinct subqueries for each dimension with the fact, the risk of overcounting is
mitigated. Merging the outcomes of these subqueries ensures that the correct
total quantities are reported.
Similarly, in the case of fan traps, the SQL approach of dividing the query into
two parts is effective in addressing the complexities introduced by the many-to-
many-to-one relationship. By executing this query in steps, the risks of
overcounting and data distortion are mitigated. Consolidating the results of
these subqueries allows for accurate analysis, ensuring reliable insights for
decision-making.
In ThoughtSpot, the handling of chasm traps and fan traps is a seamless and
automated process. ThoughtSpot's inherent intelligence recognizes the
presence of these traps and employs the appropriate SQL logic to address them.
This automation is a significant advantage for users, as it eliminates the need
for manual intervention and reduces the likelihood of errors in query design.
Moreover, ThoughtSpot's ability to handle these traps seamlessly offers users
the opportunity to focus on the insights and analysis rather than getting bogged
down by intricate data modeling challenges. This capability not only enhances
user experience but also accelerates the decision-making process by providing
accurate and reliable results.
In conclusion, the strategic handling of chasm traps and fan traps is essential
for maintaining the accuracy and reliability of data analysis within dimensional
models. Whether through subqueries and result consolidation or ThoughtSpot's
automated mechanisms, the ultimate goal is to provide users with the tools they
need to derive meaningful insights and make informed decisions based on
accurate data.
Best Practice 28 - ThoughtSpot handles chasm traps

automatically!
The good news is that within ThoughtSpot you do not have to do
anything in case of a chasm trap. ThoughtSpot will automatically
recognize it and execute the correct SQL. In fact the query above
and the results in Figure 23 are generated by ThoughtSpot.
In fact you can utilize this mechanism in certain user cases. For
example, the fact constellation schema type described in section
4.1.4.4 is based on this.
Also a bridge table, for example to resolve a many-to-may
relationship is a chasm trap.
Best Practice 29 - ThoughtSpot handles fan traps

automatically!
The silver lining is that within the ThoughtSpot environment,
addressing a fan trap is a seamless process. ThoughtSpot's
inherent functionality discerns the presence of a fan trap
automatically and seamlessly executes the appropriate SQL logic.
In fact, the SQL query outlined above, along with the outcomes
displayed in Figure 28, are both products of ThoughtSpot's
automated handling.
8 Mastering dimensional relationships: Join
cardinality, role playing dimensions and
join paths
8.1 INTRODUCTION
One of the key aspects of dimensional

modeling is understanding join
cardinality, which refers to the
relationship between tables and the
number of rows that can be
associated between them. In this
chapter, we will explore the concepts
of join cardinality, discuss the pitfalls to
watch out for, and provide best practices for
transforming 1-to-1 and many-to-many
relationships. Additionally, we will highlight the
importance of join direction in optimizing query performance.
8.2 UNDERSTANDING DIMENSIONAL

RELATIONSHIPS
8.2.1 What is join cardinality?

Join cardinality refers to the relationship between two tables in a database or
data warehouse. It describes how rows in one table correspond to rows in
another table when they are joined. There are three primary types of join
cardinality:
n One-to-one (1:1) relationship: In a one-to-one relationship, each row in
the first table is associated with only one row in the second table, and vice
versa. This type of relationship is relatively straightforward, and it is rarely a
challenge to handle in dimensional models.
n One-to-many (1:N) relationship: In a one-to-many relationship, each row
in the first table can be associated with multiple rows in the second table,
but each row in the second table can only be associated with one row in the
first table. This is a common relationship type and can be effectively managed
in dimensional models.
n Many-to-many (N:N) relationship: In a dimensional model, entities are
represented as dimensions and facts. Dimensions contain descriptive
attributes, while facts store numeric measures. Many-to-many joins are
generally avoided in dimensional models, as they can lead to data
redundancy and complicate query performance. However, in some cases,
they are unavoidable due to the nature of the data.
Let's use the example of a data warehouse for a university that stores
information about students and courses. In this scenario, students can enroll
in multiple courses, and each course can have multiple students enrolled.
This creates a many-to-many relationship between the “Student” and
“Course” tables. Resolving this relationship appropriately is essential to
maintain the integrity and efficiency of the data warehouse.
Best Practice 30 - Understand join cardinality

Understanding join cardinality is essential for accurate and efficient
data analysis and database design. It ensures precise query
results, data integrity, and optimized performance by guiding
proper join types and relationships
8.2.2 Importance of join direction

Join direction plays a critical role in optimizing query performance in dimensional
models. The join direction refers to the order in which tables are joined during
a query. In a dimensional model the join direction is typically from the fact table
(many side) to the dimension (one side).
Incorrect join directions can sometimes cause unexpected query execution plans
and potentially unexpected results.
There are a few use cases where you might want to reverse the join direction.
For example, if you want to use a ‘date selector’ dimension, which will allow you
to control multiple dates at the same time.
Best Practice 31 - Importance of join direction
Optimize Join Direction: Understand that join direction influences
query performance and should typically be from the fact table
(many side) to the dimension (one side).
8.2.3 Role playing dimensions

Role-playing dimensions in dimensional modeling refer to the practice of using
a single dimension table multiple times within a fact table, each time
representing a different perspective or role. This allows for more efficient
storage and querying of data without creating redundant dimension tables.
For example, consider a "Date" dimension table that stores various attributes
about dates such as year, quarter, month, and day. In a traditional approach,
you might create separate dimension tables for each attribute. However, with
role-playing dimensions, you can use the same "Date" dimension table multiple
times in your fact table to capture different date-related information.
Let's say you have a sales fact table that records daily sales transactions. You
can use the "Date" dimension table twice in this context:
n Order date role: This role-
playing instance of the
"Date" dimension table
represents the order date
for each transaction. It
helps you analyze sales
trends and patterns based Figure 29 Role Playing Dimensions
on when orders were
placed.
n Shipping date role: This role-playing instance of the "Date" dimension table
represents the date when orders were shipped. It enables you to analyze
delivery efficiency and shipping-related insights.
8.2.4 Multiple join paths
In the preceding section, as we explored
the concept of role-playing dimensions, we
clarified that multiple join paths need not
necessarily be problematic and can indeed
offer benefits. Nonetheless, there are
instances where they might introduce
potential complications, as ThoughtSpot
must opt for a single path during searches.
Consider the following scenario involving
an employee table, an account table, and
a branch table. To address this, you
Figure 30 Multiple Join Paths
essentially have two choices:
n Evaluate the necessity of the join and potentially eliminate it (See Figure 31).
n In situations where multiple join paths lead to confusion or impose
constraints related to chasm traps, consider duplicating the table to alleviate
the issue (See Figure 32).
Figure 31 Eliminating a join path Figure 32 Split up the branch table
8.3 MODELLING DIMENSIONAL RELATIONSHIPS
8.3.1 1-to-1 Relationships

1-to-1 relationships are simple to handle in dimensional models. When dealing
with a 1-to-1 relationship, consider whether the two tables should be merged
into a single table. If both tables contain distinct and independent attributes, it
might be better to keep them separate. However, if the attributes are related
and often queried together, combining them can reduce redundancy and
simplify queries.
Best Practice 32 - Consider transforming 1-to-1

relationships
Consider table merging: Evaluate whether merging two tables with
a 1-to-1 relationship is appropriate, based on the attributes'
independence and querying patterns.
8.3.2 Many-to-many relationships

We'll demonstrate each best practice to handle the many-to-many relationship
between the "Students" and "Courses" dimensions.
8.3.2.1 Bridge or junction tables

To tackle many-to-many joins, it is recommended to introduce bridge or junction
tables. A bridge table acts as an intermediary between the two dimensions and
is used to break down the many-to-many relationship into two one-to-many
relationships.
For our example, let’s create a bridge table called "Enrollments" that connects
the "Students" and "Courses" dimensions. The "Enrollments" table will store
associations between students and the courses they are enrolled in. It typically
contains foreign keys referencing the primary keys of the "Students" and
"Courses" tables.
Enrollments
EnrollmentID StudentID CourseID
1 101 CSCI101
2 101 MATH201
3 102 CSCI101
4 103 MATH201
Table 34 Enrollments table
8.3.2.2 Natural keys vs surrogate keys
To improve performance and maintain data integrity, it's beneficial to use
surrogate keys in bridge tables. Surrogate keys are system-generated unique
identifiers that have no inherent meaning and help avoid complex composite
keys. By implementing surrogate keys in the bridge table, it becomes easier to
manage relationships and track changes over time.
Use natural keys and surrogate keys wisely

Choose natural keys for dimensions whenever possible, but
leverage surrogate keys for improved performance and
flexibility, especially in bridge tables.
8.3.2.3 Denormalization
Although denormalization is not the ideal solution for most cases in a
dimensional model, it can be considered when dealing with complex many-to-
many relationships. By denormalizing certain attributes from the bridge table
into the fact table, you can reduce the need for joining multiple tables during
queries, thus improving query performance.
Incorporate denormalization by including relevant attributes from the
"Students" and "Courses" dimensions directly into the "Enrollments" fact table.
This step reduces the need for joins during query execution.
Enrollments
EnrollmentID StudentID StudentName CourseID CourseName
1 101 John CSCI101 Computer Science

101
2 101 John MATH201 Mathematics 201
3 102 Emily CSCI101 Computer Science

101
4 103 Michael MATH201 Mathematics 201
Table 35 Enrollments fact table (with denormalized attributes)
8.3.2.4 Aggregations and summarizations
When dealing with large datasets, many-to-many joins can slow down query
performance. To address this, consider pre-aggregating and summarizing the
data. Materialized views or aggregated tables can be employed to store pre-
calculated results, allowing for faster query responses, and reducing the need
for complex joins.
In the case of our example:
To improve query performance, create aggregated tables that pre-calculate
summary information from the "Enrollments" fact table. For example, you could
have an aggregated table showing the number of students enrolled in each
course.
Course Enrollment Count
CourseID CourseName EnrollmentCount
CSCI101 Computer Science 101 2
MATH201 Mathematics 201 2
Table 36 Aggregated table - Course enrollment count
Best Practice 33 - Consider transforming many-to-many

relationships
Transforming a many-to-many relationship is essential to ensure
efficient data management and accurate analysis. The relationship
can be transformed using any of the four techniques described
above:
n Bridge or junction tables
n Denormalization
n Aggregations and Summarizations
This transformation enhances data integrity, streamlines querying,
and supports optimized performance, making the data model more
comprehensible and responsive to analytical needs while avoiding
complications associated with direct many-to-many joins.
8.3.3 Role playing dimensions
Please refer to Figure 29 and then:
1. Establish multiple joins between the fact and dimension tables in your
database and import them into ThoughtSpot.
2. Include the fact table in a worksheet.
3. Incorporate the same
dimension table fields
multiple times, each for a
distinct role. When
dealing with multiple join
paths (e.g., 2), upon
adding dimension fields,
you'll be prompted to Figure 33 Choosing a join path
specify the relevant join, as depicted in Figure 33.
4. Choose the suitable relation/join path for each added field.
5. Assign unique names to each field, such as 'order date' and 'shipping date'.
By employing role-playing dimensions, you avoid the need to duplicate data in
separate dimension tables while still gaining the flexibility to analyze data from
different angles. This approach enhances data integrity, optimizes storage, and
simplifies query design in dimensional modeling.
8.3.4 Range joins in dimensional models

While traditional joins serve as the foundation for connecting data tables through
precise matches, there are situations that demand a more adaptable approach.
This is precisely where range joins find their significance, frequently employed
when dealing with Slowly Changing Dimensions (SCDs). Range joins introduce
a heightened level of flexibility, enabling data professionals to create
associations between datasets not solely reliant on exact matches but also
accommodating a range of values. In the following section, we will delve into
the intricacies of range joins, furnish practical examples, and delve into their
advantages and drawbacks.
8.3.4.1 Examples of range joins
8.3.4.1.1 Example 1: Date ranges

Consider a sales database where you want to associate sales transactions with
specific promotional periods. Instead of performing an exact match between
transaction dates and promotion start/end dates, you can use range joins to link
transactions to promotions based on date ranges. For instance, a transaction on
July 5th, 2023, would fall within the promotion that spans from July 1st to July
15th, 2023.
8.3.4.1.2 Example 2: Price bands

In a pricing analysis scenario, you might have products categorized into price
bands (e.g., low, medium, high). Instead of matching each product to a specific
price, range joins allow you to link products to price bands. This approach
simplifies analysis by considering a broader range of prices for product
categories.
8.3.4.2 Pros of range joins

n Increased Flexibility: Range joins offer the flexibility to connect data sets
based on broader criteria, accommodating variations and fluctuations in data.
n Handling Overlapping Data: In cases where data periods or ranges
overlap, range joins provide an elegant solution for resolving conflicts,
ensuring that data is appropriately associated.
n Simplified Queries: Range joins can simplify complex queries by reducing
the need for intricate filtering conditions, making data retrieval more
efficient.
8.3.4.3 Cons of range joins

n Complexity: Implementing range joins can be more complex than
traditional exact-match joins, requiring careful consideration of data ranges
and potential overlaps.
n Performance Impact: Depending on the volume of data and the complexity
of the range joins, there may be a performance impact, as these operations
can be computationally intensive.
n Data Maintenance: Managing and maintaining range-based relationships
can be challenging, particularly as data evolves and new ranges are
introduced.
8.3.4.4 Use case: Modeling techniques for assisting with supply chain
analytics
Retail analysts often grapple with numerous responsibilities, from monitoring
product availability to comparing sales across stores and replenishing low stock
items. In this section, we will explore various modeling techniques and tools
created in ThoughtSpot that aid retail analysts in managing their daily tasks
effectively, by looking at two typical examples.
8.3.4.4.1 Example 1: Which products are running low in supply

A common challenge faced by retail
analysts is identifying products that are
running low in supply across various
levels of the supply chain, including
distribution regions, distribution centers,
stores, and planograms. Traditional
systems often rely on pre-calculated
metrics like Weeks of Supply (WOS)
stored in fact tables. However, pre
aggregation can limit flexibility, Figure 34 Inventory aggregation levels
especially when starting at an aggregate
level and drilling down to detailed levels. Here we will offer a different solution
to this challenge.
8.3.4.4.1.1 Calculate weeks of supply using range join

Range join is a powerful feature that allows analysts to calculate weeks of supply
in a more flexible manner. Unlike pre-aggregated methods, range join enables
in-line averages at the time of search, making it easier to handle complex supply
chain scenarios.
Figure 35 High level model using the range joins Figure 36 The week
dimension
Figure 37 Defining the range join in TML
Figure 38 The joins defined in ThoughtSpot Figure 39 How the range join looks in
TS UI
Figure 40 Modifying the join condition in TML to create the BETWEEN
Figure 41 Which products are running low on supply?
Figure 42 Query visualizer Figure 43 Generated SQL
8.3.4.4.2 Example 2: Compare my sales to nearby store sales
Retail analysts are always comparing their stores to other stores to identify
trends that need attention. Retailers often compare foot traffic and sales with
other retailers to assess their own performance and competitiveness in the
market.
Similarly, analyzing sales data in
relation to other retailers can provide
insights into market trends, customer
preferences, and the effectiveness of
promotional campaigns. Retailers may
compare their sales figures with those of
similar retailers to evaluate their market
share, identify growth opportunities, or
make pricing and inventory decisions.
An analyst, not only compares, the store to other stores in the state, county, or
zip, but also would like to drill to a radius locating stores in the same
neighborhood.
1 2
1 Replicated the Sales and Returns using a view joined into a Nearby Store
2 Insert Bridge Tables that allow filtering between my store and nearby
stores
Store Distance = VIEW AS SELECT A.STORE_ID AS STORE_ID ,B.STORE_ID AS
3 STORE_NEARBY_ID
,HAVERSINE(A.STORE_LATITUDE,A.STORE_LOGITUDE,B.STORE_LATITUDE,B.STORE_LOGIT
UDE) AS STORE_DISTANCE FROM DS_DIM_STORE A ,DS_DIM_NEARBY_STORE B;
Figure 44 how do my sales compare to store within a 5 mile radius?
Figure 45 Query plan for how do my sales compare to store within a 5 mile radius
8.3.4.4.3 Example 3: Locate nearby store to replenish low stock alerts
One of the most critical alerts for retail analysts is the out-of-stock alert. In this
requirement, we explore how analysts can quickly locate inventory at different
levels to replenish out-of-stock conditions efficiently, starting with nearby stores
before resorting to distant distribution centers.
The Haversine formula

The Haversine function, named after its
developer R.W. Sinnott, is a crucial
mathematical tool for measuring
distances between two points on a
sphere's surface, like Earth's. It finds
widespread use in geospatial
calculations and accurate distance
measurements, notably in
geographical and mapping
applications. By considering latitude
and longitude coordinates, it computes
the shortest "as-the-crow-flies"
distance between two locations,
accommodating Earth's curved
surface. This function finds application
across domains like navigation,
logistics, and location-based services,
aiding tasks such as route planning,
location tracking, and proximity
analysis. Essentially, Haversine
simplifies complex distance
calculations on curved surfaces,
proving invaluable in geospatial
analysis and decision-making.
1 Nearby Stores
2 My Store
Move inventory from CVS to ThoughtSpot by foot
Figure 46 Locate nearby Inventory to avoid stock-out condition
Range joins represent a valuable tool in the toolkit of data professionals working
with dimensional models, especially when implementing Slowly Changing
Dimensions (SCDs). They provide the means to establish connections between
data sets that would be difficult or impossible to achieve with traditional joins.
By allowing for flexibility and accommodating overlapping data, range joins
enhance the analytical capabilities of dimensional models.
Mastering dimensional relationships is a cornerstone of effective data modeling.

Join cardinality, role-playing dimensions, and handling multiple join paths are
crucial skills that enable you to create data models that accurately reflect the
complexities of your organization's data while optimizing query performance. By
understanding the principles behind these concepts and applying best practices,
you can transform raw data into meaningful insights that drive informed
decisions, and strategic actions.
9 The pitfalls of outer joins
9.1 INTRODUCTION
Dimensional models are essential for

organizing data in data warehousing
environments, providing easy-to-
understand structures that facilitate
efficient querying and reporting.
However, incorporating outer joins in
dimensional models can lead to various
challenges, and when combined with
row-level security (RLS) it can get even
more complicated. In this section, we
will explore the reasons outer joins in
dimensional models can be problematic
and propose alternative strategies to simplify
data management while preserving data
integrity and security.
Handle nulls and missing data appropriately

Address null values and missing data in your dimensional
model. Determine whether nulls should be treated as a
separate attribute value or ignored during analysis. Ensure
consistent handling of nulls across dimensions and facts.
9.2 UNDERSTANDING OUTER JOINS
9.2.1 What is an outer join?

An outer join emerges as a join operation that merges datasets from multiple
tables, relying on shared attributes, while also incorporating records that don't
find matches in one or both of the tables. This approach becomes particularly
valuable when dealing with dimension tables marked by data gaps or
incompleteness. To illustrate, consider a data warehouse scenario involving a
"Product" dimension table and a "Sales" fact table. When certain products
remain unsold, an outer join applied to these tables would yield an outcome
encompassing all products from the "Product" dimension. If sales data is
available, it is incorporated accordingly. Notably, products lacking sales data, or
unsold products, are also presented in the results. This method allows for a
holistic perspective that embraces both sold and unsold products, enriching the
analytical panorama.
9.2.2 Challenges with outer joins
9.2.2.1 Data integrity and completeness

Outer joins in dimensional models can introduce data integrity issues by creating
null values in result sets. These null values may affect aggregations and
calculations, leading to inaccurate reporting and analysis. Incomplete data can
be misleading and compromise the decision-making process.
Suppose we have a sales fact table that
captures sales transactions and a
product dimension table that holds
information about products. If we use
an outer join between the fact table and
the product dimension, it may result in
Figure 47 Outer join to products
null values for products that haven't
been sold yet. This can lead to data integrity issues and affect aggregations,
causing inaccurate reporting and analysis.
9.2.3 Increased
complexity
Outer joins can make queries more
complex and harder to maintain.
As the number of dimensions and
relationships grow, the query
complexity escalates, potentially
leading to performance
degradation. This complexity also
poses challenges for analysts and
developers when understanding Figure 48 Increased complexity with multiple outer
and optimizing the data model. joins
A search on this data model would generate a SQL statement similar as shown
below.
-- Incorrect: Complex query with multiple outer joins
SELECT *
FROM sales
LEFT OUTER JOIN products ON sales.product_id = products.product_id
LEFT OUTER JOIN customers ON sales.customer_id = customers.customer_id
LEFT OUTER JOIN regions ON sales.region_id = regions.region_id
LEFT OUTER JOIN time ON sales.transaction_date = time.date;
9.2.3.1 Query performance

While modern database systems are optimized for performance, the use of outer
joins can still negatively impact query execution times. These joins require more
computational resources and may lead to slower responses when handling large
datasets.
9.2.3.2 Row-level security challenges

Row-Level Security (RLS) is a crucial mechanism for managing data access
based on user permissions. However, its integration with outer joins in
dimensional models can introduce complications. Security policies need to
account for null values resulting from outer joins, leading to intricate rule
definitions and potential security vulnerabilities.
Let's examine the example illustrated in Figure 47, featuring a Sales fact table
and a Product dimension. Suppose we aim to secure this model to restrict access
to specific products, and an outer join is employed. In such a scenario, any Sales
records lacking corresponding products will be hidden from view.
The presence of outer joins introduces null values in the Sales records that do
not match any products in the dimension. When RLS is applied to control data
access, these null-associated Sales rows are filtered out due to insufficient
product information. Consequently, authorized users will be unable to observe
these Sales records, even if they hold significance for analysis or reporting
purposes. This situation may lead to data gaps, potentially impeding
comprehensive insights and decision-making.
For more comprehensive information on data models and data security, we
recommend exploring our Data Security Field Guide, which provides detailed
insights into safeguarding data assets in various contexts.
9.3 MODELLING OUTER JOINS
9.3.1 Avoid outer joins where possible

In the pursuit of clean and meaningful results, the choice between inner and
outer joins holds significant importance. Our first best practice advises
leveraging inner joins over outer joins whenever the circumstances permit.
Inner joins eliminate the uncertainty of null values, ensuring that only complete
and valid data contributes to the outcome. This approach not only bolsters data
integrity but also simplifies queries, making the analytical process smoother and
more accurate.
Best Practice 34 - Avoid outer joins where possible

Whenever feasible, strive to use inner joins instead of outer joins.
Inner joins eliminate the possibility of null values and ensure that
only complete and valid data is included in the result set. This
approach improves data integrity and simplifies queries.
9.3.2 Populating fact tables appropriately and creating

default members for dimensions
The second practice delves into the importance of meticulously populating fact
tables to avert null value-related predicaments stemming from outer joins. It
also suggests crafting default members for dimensions, ensuring a seamless
data flow even when certain information is missing. By using default values or
adept data imputation techniques, data completeness prevails, circumventing
the need for outer joins. These default members, acting as placeholders, warrant
that all requisite dimension attributes boast valid values. Imagine a product
dimension with absent entries – by introducing a default member like
"Unknown" or "N/A," the gaps are addressed and the analysis remains
unobstructed.
Best Practice 35 - Populate fact tables appropriately and

create default members for dimensions
Ensure that fact tables are populated appropriately to avoid null
values caused by outer joins. Wherever data is missing, use default
values or data imputation techniques to maintain data
completeness and prevent the need for outer joins. These default
members act as placeholders, ensuring that all necessary
dimension attributes have valid values.
For example, in a product dimension, if specific products are not
available, create a default member like "Unknown" or "N/A" to
represent missing product entries.
FACT_SALES DIM_PRODUCT
SALE_ID PRODUCT_ID … PRODUCT_ID PRODUCT_NAME …
1 1 … 1 Banana …
2 -1 … -1 Unknown …
3 2 … 2 Pear …
Table 37 Creating default members for dimensions
9.3.3 Leveraging factless fact and bridge tables

Our final best practice introduces a creative approach that sidesteps the
challenges posed by outer joins – using factless fact tables. These specialized
tables capture events or occurrences without numerical measurements,
elegantly serving as bridges between dimension tables. By avoiding outer joins
altogether, data integrity remains intact. In parallel, we discuss the utility of
bridge tables to address intricate many-to-many relationships between
dimensions. Acting as intermediaries, these tables streamline the resolution of
complex relationships, steering clear of the complications that can arise from
direct many-to-many joins.
Best Practice 36 - Use factless fact or bridge tables

In some cases, it's appropriate to create factless fact tables that
capture events or occurrences without numerical measurements.
Factless fact tables act as bridges between dimension tables and
help maintain data integrity without the need for outer joins.
Consider a scenario where we want to track customers' visits to
different stores. Instead of using an outer join with a store
dimension, we can create a factless fact table like "store_visits"
with columns "customer_id" and "store_id" to record visits:
Figure 49 Use a factless fact to avoid outer joins
Another approach to address many-to-many relationships between

dimensions is by utilizing bridge tables, which follows a similar
concept. Bridge tables act as intermediate connectors, facilitating
the resolution of complex relationships between dimensions.
This chapter has explored the complex realm of outer joins within dimensional
models, offering insights that go beyond traditional modeling techniques.
By uncovering the challenges tied to outer joins, we've shed light on the
potential issues that arise when these joins intersect with row-level security
(RLS). The interplay of data gaps, null values, and intricate queries can cast
doubt on the accuracy and reliability of analytical outcomes. However, armed
with alternative approaches, you now have the means to navigate these
challenges while strengthening the core structure of your data model.
10 Scaling data peaks: A deep dive into
hierarchies
10.1 INTRODUCTION
The incorporation of hierarchical structures is a recurring

necessity to effectively represent various aspects of
a business. These hierarchies serve as essential
tools for visualizing organizational structures,
geographical locations, product categorization,
calendars, and charts of accounts, among
other vital components. While ThoughtSpot,
with its advanced search functionality,
liberates users from traditional hierarchical
drilling down, there are situations where a
guided experience from parent to child levels
becomes valuable and relevant.
10.2 UNDERSTANDING
HIERARCHIES
10.2.1 What are hierarchies in dimensional modeling?

Hierarchies present a structured and organized way to represent data
relationships, enabling users to navigate through complex data sets with ease.
They offer a top-down approach to data exploration, allowing users to move
from higher-level aggregated data to granular details effortlessly. This
hierarchical representation is particularly valuable for analytical purposes, as it
aids decision-makers in identifying patterns, trends, and insights across
different levels of the data model.
However, even with the flexibility provided by ThoughtSpot's search
functionality, there are scenarios where hierarchical structures are
indispensable. For instance, when analyzing organizational reporting
relationships or conducting in-depth geographical analysis, hierarchical support
becomes crucial to gain a comprehensive understanding of the data.
In this section, we will delve into
different implementation techniques
for hierarchies, considering the
specific use cases and characteristics
of each approach. Whether dealing
with balanced, unbalanced, or ragged
hierarchies, organizations can make
informed decisions to optimize their
data models' hierarchical
representation and utilization. By
understanding and leveraging these
techniques, businesses can unlock the
full potential of their hierarchical data,
enabling a deeper understanding of
their operations and making data-
Figure 50 Sample hierarchies
driven decisions with confidence.
10.2.2 Types of hierarchies

When it comes to modeling hierarchies, there are several techniques available,
and the most suitable one depends on the specific use case and characteristics
of the hierarchy being modeled. Consider the following elements when choosing
the most appropriate implementation technique:
n Balanced: Determine whether your hierarchy is balanced or unbalanced. In
a balanced hierarchy, the branches have the same depth. An example of a
balanced hierarchy is a calendar. On the other hand, organizational charts in
a company are typically unbalanced.
n Raggedness: Assess whether your hierarchy is ragged or not. A ragged
hierarchy is one where at least one member has a parent that is more than
one level above the child. An example of a ragged hierarchy is geographical
locations.
n User searches: Consider the type of questions that end users are expected
to ask concerning this hierarchy. Understanding the users' query patterns
can help refine the modeling approach.
By carefully analyzing these factors, you can make an informed decision about
which technique is best suited to effectively represent and utilize the hierarchy
in your data model.
10.2.2.1 Balanced hierarchies
The concept of a balanced hierarchy finds a perfect embodiment in the product
hierarchy, where each product is systematically organized into five distinct
levels. This hierarchical structure is known for its uniform depth across
branches, simplifying comprehension and making it an excellent choice for
organizing product-related data. Its predictable and efficient performance
further enhances its appeal, ensuring smooth data navigation and facilitating
insightful analyses for business decision-makers.
Similarly, in traditional data modeling
scenarios, dates often present a well-
structured, balanced hierarchy, flowing
seamlessly from days to months, quarters, and
years. However, with the advent of
ThoughtSpot's advanced keyword-based
search functionality, the conventional
approach of implementing complex date
hierarchies is frequently rendered
unnecessary. ThoughtSpot empowers users to
interact with date-related data using natural
language queries, unlocking a more dynamic
and intuitive data exploration experience. This
flexibility not only accelerates the analysis
process but also liberates users from rigid
hierarchies, fostering creative insights and
enabling them to unearth valuable information
without constraints.
10.2.2.1.1 Advanced attribute hierarchies

Incorporating advanced attribute hierarchies
Figure 51 Sample product hierarchy
within dimensions adds layers of context to
your data, enabling more sophisticated analysis and insights.
By incorporating advanced attribute hierarchies, your dimensional model
provides a multidimensional view of the data, allowing users to explore diverse
dimensions simultaneously. This approach uncovers hidden relationships,
patterns, and trends that may impact business decisions. Users can conduct
more nuanced analyses, discovering connections between geographic
attributes, and other factors, thereby guiding strategic initiatives and
optimizations.
10.2.2.2 Ragged hierarchies

The provided example showcases a
ragged hierarchy, illustrating the
modeling of geographical locations.
In North America, the hierarchy
follows the sequence of Continent
=> Country => State => City,
while in Europe, it is structured as
Continent => Country => City. It is
important to note that this example
has been simplified for clarity,
omitting additional complexities.
For instance, in Europe, there
might be an intermediary level
between continent and countries
for sovereign countries like the Figure 52 Sample geographical locations hierarchy
United Kingdom. Moreover, (Ragged)
different countries could have
alternative administrative divisions such as counties, provinces, and more. The
purpose of this demonstration is to highlight the presence of missing levels in a
hierarchy, not to capture all the intricate details.
The ragged hierarchy concept emerges when one or more levels are absent or
differ between different branches of the hierarchy. This can occur due to
variations in geographical or organizational structures across different regions
or entities. Acknowledging the existence of ragged hierarchies is crucial in data
modeling, as it influences the way data is organized and queried, demanding a
flexible approach to accommodate diverse hierarchy configurations.
10.2.2.3 Unbalanced hierarchies
Figure 53 Sample organization chart (Unbalanced hierarchy)

Organization charts often exhibit an unbalanced hierarchical structure. As
exemplified in the given scenario, Tessa Miller, the CEO, holds the topmost
position in the organization, with both Evan, the COO, and Sue, the Executive
Secretary, reporting directly to her. Interestingly, neither Sue, the Executive
Secretary, nor Alison, the System Admin, manage any subordinates, whereas
Evan, in his role as COO, oversees a team of individuals.
This unbalanced hierarchy can be attributed to two key reasons:
1. Varying depth: The depth of the branches within the hierarchy is not
uniform, meaning some branches extend to more levels than others. This
discrepancy can result from the diverse responsibilities and reporting
structures across different roles within the organization.
2. Logical equivalents: The levels of branches in the hierarchy do not
represent logical equivalents. For instance, the role of an executive secretary
is not analogous to that of a COO. Such dissimilarities in responsibilities and
managerial functions contribute to the unbalanced nature of the organization
chart.
Recognizing these disparities in hierarchy is essential for effective data modeling
and analysis, as it informs decision-makers about the unique dynamics within
the organization and influences the way data is organized and interpreted.
Embracing the intricacies of unbalanced hierarchies empowers organizations to
create more accurate representations of their structure and ensures a more
comprehensive understanding of their workforce and managerial relationships.
10.3 MODELING HIERARCHIES
In this section, we will explore different implementation techniques for

hierarchies, each offering unique advantages and considerations. Subsequently,
the following section will delve into the pros and cons of these techniques,
providing a comprehensive analysis of their strengths and limitations.
Selecting the most suitable technique can be influenced

Column name
by multiple factors. In some instances, the nature of the
hierarchy itself might dictate the choice, while in other PRODUCT_ITEM
cases, the decision may hinge on the specific use case and
PRODUCT_TYPE
your project requirements. Balancing these factors may
involve making certain compromises to achieve the PRODUCT_LINE
desired outcomes effectively.
PRODUCT_FAMILY
By examining these techniques and their respective trade-
Table 38 A balanced
offs, you can make informed decisions tailored to the product hierarchy
unique demands of your data model, thus optimizing the
representation and utilization of hierarchies within your business context.
10.3.1 Balanced hierarchies

Balanced hierarchies, also known as fixed depth hierarchies, are prevalent in
various fields and offer straightforward modeling solutions. The typical approach
involves denormalizing the hierarchy into columns within the dimension. The
number of columns added determines the roll-up levels to be tracked. For
instance, a balanced product hierarchy, as shown in Table 38, may comprise
five levels. Moreover, you can incorporate multiple hierarchies into the same
table by simply adding additional columns, enhancing flexibility in your data
model.
By joining the PRODUCT_ITEM column to your fact table, the balanced hierarchy
becomes an integral part of your model, enabling you to roll up data to the
specified levels. Nevertheless, it is crucial to ensure a seamless search
experience for end users.
Best Practice 37 - Avoid abstract names in hierarchies

Avoid naming the hierarchy columns in a numeric manner like
"xxxx-level-1," "xxxx-level-2," and so on, as it may confuse users
and obscure the business context. Utilizing abstract names for the
individual levels can render them practically meaningless and even
undermine the proper representation of a ragged hierarchy.
Striving for clear and informative attribute names enhances user
comprehension and contributes to a more intuitive and effective
data exploration experience.
Let's consider a scenario involving an e-commerce company that sells various

products across different categories. We'll model balanced hierarchies in a
dimensional model for product categories. Imagine you have the following
product category hierarchy:
n Electronics
u Computers
u Mobile phones
u TVs
n Clothing
u Men's clothing
u Women's clothing
u Accessories
n Home and living
u Furniture
u Kitchen appliances
u Home decor
To model a balanced hierarchy in a dimensional model, you would create a
"Product" dimension table with multiple attributes representing different levels
of the hierarchy. Here's a simplified version of how the dimension might look:
DIM_PRODUCT
Product Key Product Name Product Type Product Line
1 Laptop Electronics Computers
2 Smartphone Electronics Mobile Phones
3 TV Electronics TVs
4 Shirt Clothing Men's Clothing
5 Dress Clothing Women's Clothing
6 Watch Clothing Accessories
7 Sofa Home and Living Furniture
8 Blender Home and Living Kitchen Appliances
9 Vase Home and Living Home Decor
Figure 54 Sample product dimension with hierarchy
In this example, each product is associated with its corresponding category

levels. The hierarchy maintains balance by ensuring that each category level
has a consistent number of members, even if certain products are missing in
specific categories. For instance, even if there are no products listed in the
"Electronics > Computers" category, the hierarchy still maintains its structure
by including the "Electronics" and "Computers" levels.
This balanced hierarchy simplifies querying and reporting across different
category levels, allowing users to analyze sales, inventory, or other metrics with
consistency and accuracy. It also provides flexibility to add new products to the
hierarchy without disrupting the overall structure.
10.3.2 Advanced attribute hierarchies

Imagine a multinational retail company that wants to analyze sales performance
based on geographic attributes. We'll demonstrate how advanced attribute
hierarchies can enrich the analysis by adding diverse perspectives to the data.
n Step 1: Define initial dimensions and fact table
Dimensions:
u Date Dimension: Contains attributes like Date, Month, Quarter, and Year.
u Product Dimension: Includes attributes such as Product ID, Product Name,
and Product Category.
u Geography Dimension: Encompasses attributes like Region, Country, and
City.
Fact Table:
u Sales Fact Table: Stores sales transactions with measures like Sales
Amount and Quantity Sold, along with foreign keys referencing Date,
Product, and Geography dimensions.
n Step 2: Implementing advanced attribute hierarchies
Integrate advanced attribute hierarchies within the Geography dimension to
provide deeper insights into the data.
Geography Dimension: Create multiple attribute hierarchies within the
Geography dimension:
u Geographic Levels: Include Region, Country, and City attributes in a

traditional hierarchy for standard geographic analysis.
u Population Perspective: Add Population Density as an attribute, enabling
comparisons of sales performance against population density.
u Economic Perspective: Integrate Economic Index as an attribute, allowing
analysis of sales in relation to the economic situation of each location.
n Step 3: Performing analysis
Let's explore how users can leverage these advanced attribute hierarchies
for more comprehensive analysis:
Scenario 1: Regional sales comparison by economic index:
u Users start by selecting a specific Region from the standard geographic
hierarchy.
u They then switch to the Economic Index hierarchy and choose different
economic levels (e.g., High, Medium, Low).
u The analysis reveals sales performance for the selected region based on
its economic status, uncovering potential correlations between economic
factors and sales.
Scenario 2: Population density impact on product categories:
u Users select a country from the geographic hierarchy.
u They further analyze by choosing a Population Density range (e.g., High,
Medium, Low).
u By selecting a Product Category from the Product hierarchy, users can
visualize the impact of population density on the popularity of different
product categories.
Scenario 3: City-level analysis with combined attributes:
u Users select a specific city from the geographic hierarchy.
u They then refine the analysis by combining attributes, such as choosing a
Population Density range and an Economic Index level.
u By analyzing sales trends in this manner, users can gain insights into how
various attributes interact and influence sales patterns.
Implementing this approach empowers analysts to perform advanced
geographic analyses, leading to more informed decisions and actionable insights
that go beyond traditional geographic analysis.
Utilize advanced attribute hierarchies for enhanced
analysis
Incorporate advanced attribute hierarchies within dimensions
to provide richer context and facilitate more nuanced analysis.
For example, within a geography dimension, create hierarchies
based on administrative levels, population density, or
economic indicators.
10.3.3 Ragged hierarchies

As demonstrated in the earlier example, ragged hierarchies present a challenge
due to the potential absence of certain levels within the branches. When the
number of levels in the hierarchy is relatively low, it is possible to model the
ragged hierarchy as a balanced one and address the empty levels using two
main approaches:
n Propagating the parent level: In this method, the parent level's data is
extended downward (or, in some cases, the child level data is propagated
upward). However, a notable challenge with this approach lies in the naming
convention of the columns. While the column name 'State' correctly
represents the data for North America, it becomes inaccurate for Europe.
n Using specific values: Alternatively, you can populate the empty levels
with specific values like 'Not Applicable' or more descriptive labels such as
'Other state/Non US-state', allowing users to identify and interpret the data
correctly.
For the given ragged example, the resulting table could resemble the one
depicted in Table 39.
North America Europe North America Europe
USA England USA England

Or:
California N/A or NULL California London
San Francisco London San Francisco London
Table 39 A geographical locations hierarchy
Best Practice 38 - Limit the depth of ragged hierarchies
It's important to note that this compromise works well for
hierarchies with a narrow range, typically comprising 3-4 levels.
Extending this approach to hierarchy with 4-8 or 10 levels may
become impractical, as the attribute names assigned to the various
levels must retain their meaningful context. Striking the right
balance between hierarchy depth and accurate representation is
key to a successful modeling strategy.
10.3.4 Unbalanced hierarchies - Solution 1: Bridge tables

Unbalanced hierarchies frequently occur in organizational structures, where
employees at a particular level may have an extensive number of direct reports,
while others at the same level might have very few or none at all. Departmental
hierarchies also serve as a common example of unbalanced structures.
To address unbalanced hierarchies effectively, one flexible solution involves
employing a bridge table. Typically used to resolve many-to-many relationships,
bridge tables also prove to be a powerful solution for handling hierarchies. Let's
delve into a straightforward example using the following hypothetical structure
depicted in
Organizational structures very often fall in this category; employees at a certain
level may have hundreds of direct reports, while others at the same level might
have none of just a few. Another common example are departmental
hierarchies.
One flexible solution to implement unbalanced hierarchies is to use a bridge
table. We often use bridge tables to resolve many-to-many relationships, but
they are also an effective solution to hierarchies.
Let's create a simple example using the following hypothetical structure:
Figure 55 Implementation of a fictional organization chart
To accommodate the hierarchical relationship between employee data in

DIM_EMPLOYEE and sales data in FACT_SALES, we have introduced a bridge
table. Positioned between DIM_EMPLOYEE and FACT_SALES, this bridge table
serves as a link enabling hierarchical reporting without involving the FACT table
directly, which we will explore further in subsequent discussions.
Let's outline the key joins within this setup:
n EMPLOYEE_HIERARCHY_to_FACT_SALES: This join connects
FACT_SALES to the bridge table using the EMPLOYEE_ID from FACT_SALES
and the CHILD_EMPLOYEE_ID from the hierarchy table.
n parent_EMP_hierarchy: This join links the PARENT_EMPLOYEE_ID from
the hierarchy table to the EMPLOYEE_ID in DIM_EMPLOYEE, establishing the
connection between parent employees and their corresponding records.
n child_EMP_hierarchy: This join connects the CHILD_EMPLOYEE_ID from
the hierarchy table to the EMPLOYEE_ID in DIM_EMPLOYEE, establishing the
link between child employees and their respective records.
10.3.4.1 Contents of the hierarchy table
The hierarchy table serves as a repository for all possible paths between
different levels, encompassing even the pathway to the node itself (called a
zero-length pathway).
This table comprises several essential columns:
n PARENT_EMPLOYEE_ID: Represents the 'parent' level, indicating the
superior employee or entity in the hierarchy.
n CHILD_EMPLOYEE_ID: Lists the child levels accessible to the parent,
signifying the subordinate employees or entities.
n DEPTH: Specifies the distance between the parent and the corresponding
child, providing insights into their hierarchical relationship.
n BOTTOM_FLAG: An indicator flag denoting that a specific child level is the
lowest one and does not have any further children beneath it.
n TOP_FLAG: An indicator flag signifying that a particular level is the highest
in the hierarchy, with no additional levels above it.
The bridge table will be populated with
the data illustrated in Figure 56.
Upon inspection, it becomes clear that
the hierarchy table comprises 18 rows,
effectively representing our
hierarchical structure encompassing 7
distinct nodes. For instance, node 1
exhibits access to all nodes, including
itself at depth 0.
10.3.4.2 Worksheet example

With this table now available, let's
proceed to create a straightforward
example using its data. Incorporating
all the relevant tables and inputting
employee data for both parent and
child employees, our sample fact table
contains test date summarizing sales
Figure 56 Contents of the bridge table
figures per employee.
Figure 57 A worksheet example for the organizational hierarchy
Let's initiate our first search to retrieve the
same information, aiming to display
individual sales figures for each employee.
Since we seek straightforward answers, we
can employ the child employee, as we've
modeled the employee_id from the fact table
to join with the hierarchy table.
Alternatively, if we wish to determine the
sales figures for each level, encompassing
both individual employees and their
respective team members, we can select the Figure 58 Contents of the fact table
parent employee details instead of the child.
This will yield an aggregated result representing the combined sales generated
by the parent employee and all their underlying team members.
Figure 59 Searching for individual sales figures per employee
Figure 60 Utilizing the hierarchy
Upon conducting this search, we discovered that everyone under Employee 1,

including the person themselves, contributed a total of $4,920.07 in sales. By
further traversing the hierarchy, we ascertain that Employee 2, for instance,
achieved a total sales figure of $2,783.42, which accounts for the combined
sales of Employee 2, Employee 4, Employee 5, and Employee 7. This value
corresponds to the sum of individual sales for each of these employees, forming
a comprehensive understanding of their sales performance within the hierarchy.
10.3.4.3 Create a hierarchical pivot in ThoughtSpot

In the field, we frequently encounter requests to present hierarchies in a
clickable drill-down format, akin to the tree view familiar to customers from
other tools. For this visualization to function effectively in ThoughtSpot, a few
steps need to be taken:
1. Ensure Cumulative Values: The hierarchy elements must have
cumulative values, where the parent's rolled-up value exactly matches the
sum of its underlying children.
2. Denormalize the tree: Transform the flexible structure into a balanced
hierarchy, as discussed earlier in the document. To achieve this, we require
an additional piece of information in our bridge table, representing the level
of the parent node in the hierarchy (e.g., root/top as level 0, the next level
as level 1, and so on).
3. Define the number of levels: Decide on the number of levels needed for
your use case; in the example, we have four levels (Level0 to Level3).
4. Formulate formulas: Using formulas, calculate the levels and split them
out accordingly. For each level, update the formula accordingly (e.g.,
Level0 formula: max(if (parent level = 0) then parent employee name else
null)). Replace the amount with a formula representing the sum of the
amounts.
5. Create a view: Select all the levels, the amount (the formula, and the
child employee name (for grouping purposes) and save this as a view.
6. Utilize the view: Use the view as input for the pivot table (without the
child employee name, used only for grouping).
Figure 61 Utilizing the view and hierarchy in the pivot
Please note that this example may contain null values for various reasons,
including nodes having their values and the possibility of a ragged hierarchy due
to the denormalization. Understanding these nuances will aid in interpreting the
visualizations effectively. While this example can be further refined regarding
null handling, inner joins/outer joins, and display, its primary purpose is to
illustrate the underlying concept.
10.3.4.4 What additional capabilities does this solution offer?

n Easy extension for shared ownership: The solution can be effortlessly
extended to accommodate shared ownership scenarios. For instance, if a
node (let's say node D) in the organization hierarchy is 50% owned by node
B and 50% by node C, any attributed value (e.g., revenue, payments) for
node D should be apportioned upward with a 50% weighting to node B and
another 50% weighting to node C. To achieve this, simply add an extra
attribute, "percent ownership," to the bridge table and update it for all nodes
where the hierarchy ends at node D.
n Implementation of slowly changing hierarchies: Implementing slowly
changing hierarchies is straightforward by adding effective start and end
dates to the bridge table. It is essential to note that when using this feature,
all searches require a date filter
to freeze the hierarchy at a
specific point in time to obtain
accurate results.
n Hierarchical reporting: The
additional join between the
bridge table and the employee
table allows for seamless
hierarchical reporting, including
questions such as "who reports
to whom." To ensure its
effectiveness, include depth or
any attribute from the bridge
table in your queries. If no
specific attribute is needed from
the bridge table, you can force
its inclusion in any search by
creating a view or by adding a
"fake" RLS rule (e.g., 'True') to
the bridge table and enabling
strict RLS.
n Identification of managers: Figure 62 Hierarchical reporting: who reports to
Easily determine who is a whom?
manager by creating a formula
that evaluates whether a particular node is a manager or not. This can be
done using a formula such as:
if (bottom_flag = 1 and child_employee_id = parent_employee_id)
then 'not a manager' else 'manager'
Once these results are indexed, you can conveniently filter for managers or
non-managers.
n Team size calculation: Calculate the size of a team using the following
formula:
group_count(child_employee_id, parent_employee_id)
This formula allows you to determine the total count of employees within a
team, facilitating team size analysis.
10.3.5 Unbalanced hierarchies - Solution 2: Node table

An alternative solution for implementing hierarchies is the Node table approach,
which shares similarities with the previous solution but exhibits several
differences. This example revolves around a general ledger system with a chart
of accounts, where the chart of accounts forms a ragged hierarchy. Transactions
are recorded at each leaf node or account within this hierarchy. This model
accommodates scenarios where transactions can be posted at any node in the
hierarchy, necessitating additional entries in the hierarchies table for each level
that experiences transactional activity.
The following diagram illustrates an instance of a ragged hierarchy for a chart
of accounts. In this context, transactions are booked at the leaf nodes,
represented by Account [999], such as ACCOUNT 100 and ACCOUNT 230.
Additionally, 'Top' is included as a logical grouping within the diagram.
Figure 63 A ragged hierarchy for a chart of accounts
10.3.5.1 Data model
The relationship between GL Transactions and GL Accounts is established

through a primary-foreign key connection. Additionally, GL Transactions and GL
Account Nodes are linked through a relationship join, where the AccountID
serves as the matching criterion.
10.3.5.1.1 GL transactions
This table captures transactional data, with each transaction being recorded at
the leaf level of the hierarchy. It contains essential information pertaining to
accounts.
AccountID Amount
100 40
110 60
120 80
210 40
220 160
230 40
10.3.5.1.2 GL accounts
This table facilitates easy selection of the account involved in the transactions.
It enables users to quickly identify the specific accounts that have been
transacted against.
AccountID Account Code Account Name Account Description
100 100 Name 1 100 : Name 1
110 110 Name 2 110 : Name 2
120 120 Name 3 120 : Name 3
210 210 Name 4 210 : Name 4
220 220 Name 5 220 : Name 5
230 230 Name 6 230 : Name 6
10.3.5.1.3 GL account nodes

For each entry in the GL Accounts table, data is normalized or unpivoted in the
GL Account Nodes table, adhering to the following guidelines:
n The Selection Node stands as the primary column that users choose, and it
is indexed in ThoughtSpot, providing search suggestions to users.
n If a node is at the top of the hierarchy, the Parent Node will share the same
value as the Selection Node.
n If a node represents the bottom of the hierarchy, the Child Node will share
the same value as the Selection Node.
n The Level attribute serves as an indicator of the hierarchical depth, conveying
information about the node's position within the tree structure.
ID Selection Node Child Node Parent Node Level
100 Account 100 Account 100 Fixed Assets 3
100 Fixed Assets Account 100 Assets 2
100 Assets Fixed Assets Assets 1
110 Account 110 Account 110 Cash 4
110 Cash Account 110 Current Assets 3
ID Selection Node Child Node Parent Node Level
110 Current Assets Cash Assets 2
110 Assets Current Assets Assets 1
120 Account 120 Account 120 Cash 4
120 Accounts Receivable Account 120 Current Assets 3
120 Current Assets Accounts Receivable Assets 2
120 Assets Current Assets Assets 1
210 Account 210 Account 210 Short Term 4
210 Short Term Account 210 Loans 3
210 Loans Short Term Liabilities 2
210 Liabilities Loans Liabilities 1
220 Account 220 Account 220 Long Term 4
220 Long Term Account 220 Loans 3
220 Loans Short Term Liabilities 2
220 Liabilities Loans Liabilities 1
230 Account 230 Account 230 Accounts Payable 3
230 Accounts Payable Account 230 Liabilities 2
230 Liabilities Accounts Payable Liabilities 1
10.3.5.2 Search experience

As a user, navigating and searching for specific nodes within the hierarchy is
straightforward, as can be seen in Figure 64. Additionally, if we select "Current
Assets" from the selection node column in the search, we can easily break down
the results to the next level of the hierarchy by adding the "Child Node" field to
the search, as shown in Figure 65.
Figure 64 Navigating particular nodes in Figure 65 Adding the child node to the search
the hierarchy
Furthermore, should the user wish to analyze "Cash" specifically, they can
simply modify the Selection Node filter value to "Cash," automatically updating
the Child Node value to reflect the relevant context (See Figure 66).
Similarly, for users who require information about the parent node to navigate
up the hierarchy, the Parent Node column can be added to the search, providing
the necessary hierarchical context (Figure 67). This flexible approach empowers
users to explore and drill down into the hierarchy, enhancing their analytical
capabilities and understanding of the data.
Figure 66 Analyzing cash specifically Figure 67 Including parent information
10.3.5.3 Limitation: Data model knowledge required
One limitation of the node table
hierarchy is that business users must
possess a good understanding of the
data model. When multiple nodes are
selected in a search, it can lead to
confusing results. For example,
consider a search involving two node
selections: "Current Assets" and
"Cash." Since "Cash" is a child of
"Current Assets," the line item
amounts for both nodes are correct.
However, the Total Amount is not
accurate, as "Cash" is included twice Figure 68 Limitation when selecting multiple
nodes
in the calculation.
To avoid such discrepancies, users need to be cautious while selecting multiple
nodes to ensure the results align with their intended analysis. Having a clear
understanding of the hierarchical relationships within the data model can
significantly enhance the accuracy and reliability of search outcomes. Proper
training and guidance for business users regarding the nuances of the node table
hierarchy can mitigate potential confusion and facilitate more effective data
exploration and analysis.
10.3.6 Unbalanced hierarchies - Solution 3: Path string

attributes
In these alternative solutions, we employ clever attributes to encode the path
leading to specific nodes within the hierarchy.
The path string attribute approach involves defining a path string for each level,
starting with the parent's path string, and then adding letters like A, B, or C
from left to right in that pattern. The final character in the path string is either
a plus (+) or a dot (.), serving as an indicator of the node's children's status:
n A+ indicates the presence of children.
n A. denotes that the node represents a bottom node in its respective branch.
Figure 69 Our org chart revisited with path Figure 70 Adding the path strings to the tables
attributes
Note: The example presented here utilizes a simple letter scheme, which limits
the number of children per node to a maximum of 26. However, in more
extensive implementations, you can opt for a more sophisticated pattern. For
instance, you can use two characters per level or incorporate numbers to
accommodate a larger and more complex hierarchy. The flexibility to design
advanced patterns allows for scalability and adaptability, making the approach
suitable for a wide range of hierarchical structures in diverse applications.
The strength of this solution lies in its swift and efficient navigation through the
hierarchical tree. In conventional systems, regular expressions (as
demonstrated in the second column below) are often utilized. However, in
ThoughtSpot, regular expressions are not supported. Instead, we can employ
formulas (as seen in column 3), providing a better experience. Alternatively,
more advanced users familiar with the path string attribute can leverage the
search bar, as demonstrated in column 4.
Search For Traditional SQL with ThoughtSpot Formula ThoughtSpot Search

regular expressions Bar
All Employees A* left(pathstring,1) = 'A' pathstring begins with 'a'
Leaf nodes only *. right(pathstring ,1) = '.' pathstring ends

(4,6,7)
Search For Traditional SQL with ThoughtSpot Formula ThoughtSpot Search
regular expressions Bar
with '.'
Managers *+ right(pathstring,1) ='+' pathstring ends
(1,2,3,5) with '+'
Top Node ?+ right(pathstring,1) ='+' 'a+'

and strlen(pathstring)=2
Employee 2 AA? left(pathstring,1) = pathstring begins
and its 'AA' (see note below) with 'aa'
children
Note: Please be aware that these generic filters, particularly when

implemented with formulas, provide a satisfactory search experience.
However, the search experience can significantly diminish when attempting
more specific queries. For instance, if we aim to select 'Employee 2' and all its
children in the hierarchy, the search process becomes more challenging and
less straightforward. This highlights the need for further optimization and
tailored approaches to handle such specific searches effectively within the
hierarchy.
Creating a flexible search that covers specific scenarios like selecting 'Node 2'
and all its children is not straightforward unless the end user comprehends the
Figure 71 Using group_aggregate and other functions
structure of the path strings. To approach this, we can utilize formulas to store
the key field of the desired node (Node 2) in a variable and then employ
group_aggregate and other functions to filter the data, as depicted in Figure 71.
Here we have implemented the following formulas:
Formula Definition
selected_employee_id 2
selected_principal_path_string left(group_aggregate (max(pathstring),{},{employee_id

= selected_employee_id }) , strlen(group_aggregate
(max(pathstring),{},{employee_id =
selected_employee_id })) -1)
filter_me left(pathstring,strlen(selected_principal_path_string)) =
selected_principal_path_string
To adapt this search for 'Node 3' and its children, all that is required is to modify
the value of the selected_employee_id variable to 3.
This implementation strategy represents the highest level of flexibility
achievable with the current version of ThoughtSpot. Alternatively, users who
grasp the path string structure could attempt searching for path strings
beginning with 'AA', but this approach might not be as user-friendly.
Please note that different or more flexible implementations might be possible
through subqueries, views, and joining them with the main table, but these
approaches are notably more complex to implement (if feasible at all).
10.3.6.1 Unbalanced hierarchies - Solution 4: Modified pre-ordered tree

traversal
An alternative to the previous approach is the modified pre-ordered tree
traversal method. In this approach, each node is associated with a pair of
numbers that uniquely identify the nodes below it.
In his setup, every node possesses attributes called "left" and "right," which
enables comprehensive querying of the entire hierarchy. For instance, querying
where the "left" attribute falls between 1 and 14 would retrieve the complete
hierarchy. Additionally, leaf nodes can be identified when the difference between
the "left" and "right" attributes is precisely 1.
Figure 72 Our fictional org chart with index numbers
To perform similar searches as in the previous section, the following queries

would be required:
Search For Traditional SQL with regular expressions ThoughtSpot Search

Bar
All Employees emp_left >= min_left (*) and Search bar queries not
emp_right <= max_right () really user search
experience friendly as
Leaf nodes only emp_right - emp_left = 1 it would require
(4,6,7) understanding of the
left/right values of
each level
Managers emp_right - emp_left > 1
(1,2,3,5)
Top Node emp_right = max_right ()
Employee 2 emp_left >= 2 and emp_right <= 9
and its (see note below)
children
(*) min_left is a formula calculating the minimum value for emp_left (i.e. 1 in this case)
() max_right is a formula calculating the maximum value for emp_right (i.e. 14 in this case)
Note: Like the path string approach, generic
filters can be implemented using formulas,
but conducting search bar searches without
knowledge of the left and right indices can
prove to be quite challenging. Additionally,
performing more specific searches, like
finding 'Employee 2' and its children in the
hierarchy, becomes more intricate. The most
adaptable solution would likely involve
creating a formula containing the starting
node (e.g., node 2) and utilizing two group
aggregates to determine the left and right
indices for that node.
For example:
In this example we have the following formulas:
Formula Definition
selected_employee_id 2
selected_left group_aggregate (min (emp_left ),{},{employee_id =

selected_employee_id })
selected_right group_aggregate (max (emp_right),{},{employee_id =
selected_employee_id })
filter_me emp_left >= selected_left and emp_left <= selected_right
If we want to do the same search but then for Employee 3, we just need to
change the value of the selected_employee_id to 3.
10.4 COMPARISON OF THE VARIOUS IMPLEMENTATION

TECHNIQUES
In the previous section, we discussed six distinct techniques for implementing

hierarchies. However, these techniques can be grouped into three categories
based on the type of hierarchy to implement or the nature of the solution.
Hierarchy Type Solution Type Implementation Technique
Balanced Hierarchies Balanced Hierarchies
Ragged Hierarchies
Unbalanced Using additional hierarchy Bridge Tables
Hierarchies table
Node Table
Using additional hierarchy Path String Attributes
attributes
Modified pre-ordered tree
traversal
As a result, it would be sensible to compare and analyze them within these three
distinct groups.
10.4.1.1 Balanced hierarchy: Balanced vs ragged
When dealing with balanced hierarchies, there are two implementation
techniques available, both based on the flattened table approach. The choice
between these techniques largely depends on whether your hierarchy exhibits
ragged characteristics or not. It's essential to consider the level of raggedness
in your hierarchy; in cases of extreme raggedness, it may be more appropriate
to opt for one of the unbalanced hierarchy approaches instead. By carefully
evaluating the nature of your hierarchy, you can make an informed decision to
select the most suitable implementation technique that best aligns with your
specific data structure and requirements.
10.4.1.2 Unbalanced hierarchy: Implementation using bridge or node

tables
The first group of implementation approaches for unbalanced hierarchies use an
additional table to define/control the hierarchy. Although the concept is similar,
there are key differences between the two approaches.
Bridge Table Node Table
Functionality Values can exist on every Values can exists on leaf
node/level of the hierarchy nodes (bottom) only
More flexible/More features Less flexible/less features
Number of rows in More rows than the node Less rows than the bridge
table
solution, as for each node: solution, as for each
(Number of children + 1) node:
rows (Number of leaf nodes
under this node)
Scalability Less than Node table More than Bridge Table
Search Experience Good Good
The first group of implementation approaches for unbalanced hierarchies

involves using an additional table to define and control the hierarchy. While the
underlying concept is similar, there are significant distinctions between these
two approaches.
When considering scalability and performance, it is crucial to assess the number
of rows in each table. The bridge table, in this comparison, generally contains
more rows than the node table, which may pose potential challenges for very
large hierarchies. However, determining the threshold for what qualifies as a
"large hierarchy" is a pertinent question.
For instance, in the case of the largest retailers worldwide, their product
hierarchies could encompass up to a million products. However, it's important
to note that product hierarchies tend to be balanced, making the balanced
hierarchy approach more suitable.
On the other hand, for HR solutions attempting to model all employees of one
of the top 30 largest global companies, including their hierarchical reporting
relationships, the bridge table solution might not be the optimal choice. In such
cases, assessing the appropriateness of either approach becomes crucial.
In summary, when choosing between these two options, you must weigh the
tradeoff between flexibility and scalability/performance. Carefully evaluating
your specific use case and hierarchy size will enable you to make a well-informed
decision that best aligns with your data model's needs and overall performance
requirements.
10.4.1.3 Unbalanced hierarchies: Path string attributes vs modified pre-

ordered tree traversal
Path String Modified pre-ordered
tree traversal
Functionality Same Same
Number of 1 2
columns
Required
Scalability Not very good Not very good
Performance Fast Fast
Search Experience Poor Poor
Hierarchy/ETL Complex (any change Complex (any change
Maintenance requires hierarchy rebuild) requires hierarchy rebuild)
10.4.1.4 Balanced vs bridge/node table vs attribute solution
Finally, the following table will list some differences between the three groups
of implementation strategies:
Hierarchy Balanced Bridge/Node Path String/Tree

Table
type/solution traversal
Easy of Understanding High Medium Low
Implementation Effort Low High High
Flexibility Low High Low
ETL Effort Low High High
Change Impact Medium Low (Very) High
Search Experience High (*) Medium – High Low - Medium
(*) The effectiveness of using ragged balanced hierarchies depends on whether proper column naming has been
implemented for each level. Ensuring that each level possesses an understandable name, rather than generic labels like
"level-1," "level-2," etc., is crucial for enhancing the search experience and overall usability of the hierarchy within the
system. Clear and descriptive column names enable users to intuitively navigate and interpret the data, making the
implementation of ragged balanced hierarchies more advantageous when well-structured and appropriately labeled.
It is important to consider that attribute solutions may not always be the most
suitable choice for implementation in ThoughtSpot due to their poor search
experience. As search experience is a critical aspect of the platform, alternative
solutions are often preferred. However, in specific scenarios where detailed
analysis of the hierarchy is not a primary requirement, attribute solutions can
offer a faster and simpler alternative. It ultimately depends on the specific use
case and the level of analysis needed to determine whether attribute solutions
can effectively serve as a viable option within ThoughtSpot.
Hierarchies serve as vital tools for organizing and analyzing complex data
relationships, providing a structured approach that aids in uncovering insights
and patterns. Throughout this chapter, we have explored the diverse facets of
hierarchies, delving into their types, implementations, and considerations for
various use cases.
Balanced hierarchies, characterized by their uniform depth and streamlined
structure, are adept at simplifying comprehension and enabling efficient data
navigation. While traditional data modeling scenarios, such as calendar dates,
often align well with balanced hierarchies, the emergence of advanced search
functionalities, like those offered by ThoughtSpot, is reshaping the landscape.
The ability to interact with data using natural language queries is revolutionizing
the exploration process, transcending rigid hierarchies, and fostering creativity
in insights generation.
Advanced attribute hierarchies add layers of context to data models, facilitating
multidimensional analysis and unveiling intricate relationships. This approach
empowers analysts to draw nuanced insights from geographical attributes,
population perspectives, and economic indicators, thereby guiding strategic
initiatives.
Ragged hierarchies, exemplified by varying branch depths or absent levels in
different branches, demand flexible modeling approaches. Solutions like
propagating parent-level data or using specific values to represent missing
levels enable meaningful representation and interpretation of data.
Unbalanced hierarchies, where branch depths and logical equivalents vary,
introduce complexities in data modeling. Solutions such as bridge tables, node
tables, path string attributes, and modified pre-ordered tree traversals offer
distinct methods for handling such hierarchies, each with its own benefits and
limitations. Businesses must consider the
nature of their data, user requirements,
and organizational structures to
determine the most suitable modeling
technique.
In conclusion, hierarchies serve as
indispensable tools for navigating
complex data landscapes. By
carefully choosing and
implementing appropriate
techniques, organizations can
unleash the potential of their
hierarchical data, driving deeper
understanding, informed decision-
making, and strategic growth. As the data
landscape continues to evolve, mastering
hierarchical modeling becomes a critical skill for organizations seeking to
harness the power of their data.
Validate and document hierarchies

Regularly validate hierarchies within dimensions to ensure
accuracy. Document hierarchies clearly to assist users in
understanding the structure and relationships, promoting
consistent and effective analysis.
Avoid excessive hierarchies

Limit the number of hierarchies within dimensions to avoid
overwhelming users. Focus on the most essential attributes for
drilling down and aggregating data.
Consider surrogate hierarchies

In traditional dimensional modeling, hierarchies are often
based on the natural attributes of a dimension, such as a
calendar hierarchy with levels like year, quarter, month, and
day. While these hierarchies provide essential ways to
aggregate and analyze data, they might not always align with
user preferences or reporting needs. This is where surrogate
hierarchies come into play.
A surrogate hierarchy is an alternative hierarchy within a
dimension that is designed to facilitate specific analytical
scenarios or user preferences. Unlike the natural hierarchy, a
surrogate hierarchy is defined based on business logic, user
behavior, or specific use cases, rather than following the direct
attributes of the dimension.
Consider a sales dimension that includes the attributes
"Product," "Category," and "Subcategory." The natural
hierarchy might follow the product category structure, but
users may also want to analyze sales by a different
perspective, such as "Brand." Creating a surrogate hierarchy
based on the "Brand" attribute allows users to easily navigate
and analyze sales data by brand, regardless of its position in
the original hierarchy.
Natural Hierarchy:
Product à Category à Sub Category
Surrogate Hierarchy:
Product à Category à Sub Category à Brand
11 Evolving relations in time: A deep dive
into slowly changing dimensions
11.1 INTRODUCTION
One crucial aspect of dimensional modeling is handling Slowly Changing

Dimensions (SCDs). Slowly Changing Dimensions refer to the entities or
attributes in a database that undergo changes over time but at a slower rate
than the data itself. These changes can include updates, inserts, or deletes,
making them complex to manage. In this chapter, we will delve into the issues
associated with Slowly Changing Dimensions and explore best practices to
effectively implement them with detailed examples.
11.2 UNDERSTANDING SLOWLY CHANGING DIMENSIONS
11.2.1 What are slowly changing dimensions?

There are in total seven types of slowly changing dimensions. In this section we
will only discuss 1-4, as the others are mainly combinations of the other four,
as these are the ones you would probably encounter in the field, as the other
are rarer.
11.2.1.1 Type 1: Overwrite

In the Type 1 approach, historical changes are not preserved. When an update
occurs, the existing data in the dimension is simply overwritten with the new
values. This method is straightforward and maintains the latest information but
lacks historical context, which may be critical for some analyses.
11.2.1.2 Type 2: Add new record
Type 2 approach introduces a new row in the dimension table for each change,
and we need two additional data columns to capture validity of the row. This
way, a complete history of changes is preserved, enabling time-based analysis.
However, this method can lead to increased storage requirements, and queries
might become more complex as they need to consider multiple records for a
single entity.
11.2.1.3 Type 3: Add columns

The Type 3 approach involves adding columns to the existing dimension table
to store some limited historical data. This method strikes a balance between
storage efficiency and historical tracking. However, it may not be suitable for
scenarios where more extensive historical data is needed.
11.2.1.4 Type 4: Maintain history in separate table (aka "bifurcation")

In Type 4 SCD, historical attributes are stored in a separate table, while the
current attribute values are kept in the main dimension table. This approach
allows for efficient storage and maintains the historical data. However, it may
result in more complex queries due to the need to join the main table with the
history table.
11.2.2 Challenges with slowly changing dimensions

n Data consistency: Handling SCDs requires careful consideration to maintain
data consistency. Incorrectly updating or inserting data can lead to
inconsistencies in historical analyses.
n Query complexity: As the number of historical records grows in some of
these SCDs, queries can become more complex and slower, impacting
performance.
n Storage requirements: Certain SCDs increase storage requirements
significantly, as they store multiple versions of the same entity.
n Historical reporting: Providing accurate historical reports may become
challenging, especially when dealing with Type 1 or Type 3 SCDs.
11.3 MODELLING SLOWLY CHANGING DIMENSIONS
Assume we have the following initial Employee dimension table:
DIM_EMPLOYEE
EmployeeID EmployeeName Department
101 John Smith HR
Table 40 Employee dimension for SCD descriptions
Now, we will apply changes to the Employee dimension using each SCD type to
illustrate the concepts.
11.3.1 Type 1: Overwrite

In this scenario, the Type 1 SCD strategy is demonstrated. The initial state of
the "Employee" dimension table (DIM_EMPLOYEE) includes an employee named
John Smith in the HR department. Using the Type 1 SCD strategy, John Smith's
department is updated to "Finance." The result is that the existing data is simply
overwritten, and no historical information is preserved. The updated record
directly replaces the original record, reflecting the most current information.
This approach is suitable when historical data isn't essential, and only the latest
values matter.
DIM_EMPLOYEE
EmployeeID EmployeeName Department
101 John Smith Finance
Table 41 Updating John's department (SCD Type 1)
11.3.2 Type 2: Add new record

Like the previous scenario, John Smith's department is updated to "Finance."
However, with Type 2 SCD, instead of overwriting the existing record, a new
record is inserted into the "Employee" dimension table. This new record includes
an "EffectiveStartDate" and "EffectiveEndDate" to indicate the validity period of
each version of the record. The old record for John Smith in the HR department
remains in the table, but its "EffectiveEndDate" is adjusted to signify when it's
no longer valid. This approach maintains a historical record of changes while
accommodating new information.
DIM_EMPLOYEE
EmployeeID EmployeeName Department EffectiveStartDate EffectiveEndDate
101 John Smith HR 2023-01-01 2023-08-02
101 John Smith Finance 2023-08-03 9999-12-31
11.3.3 Type 3: Add columns

In this case, the Type 3 SCD strategy is used. Once again, John Smith's
department is updated, but this time, two new columns are introduced:
"PrevDepartment" and "PrevDepStartDate." These columns store the previous
department value and the date when the change occurred. The updated
department value is stored in the "Department" column. This approach allows
for the tracking of the most recent change and the previous value without
creating new records or overwriting existing ones. It strikes a balance between
maintaining a limited history and accommodating changes.
DIM_EMPLOYEE
EmployeeID EmployeeName Department PrevDepartment PrevDepStartDate
101 John Smith Finance HR 2023-01-01
11.3.4 Type 4: Maintain history in separate table (aka

"bifurcation")
The Type 4 SCD strategy is exemplified here, with a twist. John Smith's
department is updated to "Finance," and a new table named "Employee_History"
is introduced. The original "Employee" dimension table retains the updated
record, while the "Employee_History" table holds the historical data. This
separation keeps the primary dimension table smaller and more current, while
historical data is stored separately. This strategy is beneficial when dimension
attributes experience high volatility, and it's particularly useful for maintaining
historical information without affecting the primary dimension table's
performance. It allows for efficient storage and retrieval of historical records.
DIM_EMPLOYEE
101 John Smith Finance 2023-08-03 9999-12-31
Table 44 Updating John's department - employee table (SCD Type 4)
DIM_EMPLOYEE_HISTORY
101 John Smith HR 2023-01-01 2023-08-02
Table 45 Updating John's department - employee history table (SCD Type 4)

Lastly, in the final table, a scenario is presented where the Type 4 technique is
used to create a mini-dimension. In this approach, a subset of dimension
attributes with relatively high volatility is extracted and managed in a separate
mini-dimension. Each unique profile in the mini-dimension is assigned a
surrogate key. Both the surrogate keys of the primary dimension and the mini-
dimension profile are stored as foreign keys in the fact table, enabling efficient
querying and data retrieval. This strategy optimizes performance and storage
while accommodating changes in volatile dimension attributes.
Table 46 Using SCD-4 as a mini-dimension
In the journey of exploring Slowly Changing Dimensions (SCDs), we've

navigated through various strategies to manage evolving data within a
dimensional modeling context. SCDs refer to attributes or entities that undergo
changes at a slower pace than the data itself, presenting challenges in
maintaining data accuracy and historical context. Through this chapter, we've
unearthed the complexities and best practices of handling SCDs, shedding light
on how different strategies can be employed to cater to distinct requirements.
A successful approach to managing Slowly Changing Dimensions is founded on
a balanced understanding of the nature of your data and the desired analytical
outcomes. Each SCD type presents a trade-off between storage, query
performance, and historical context. The choice of strategy should align with
your specific business needs and technological environment. By comprehending
the nuances and benefits of each strategy, you can empower your data
architecture to gracefully accommodate changes while delivering valuable
insights.
Some SCDs might require range joins

Some of the SCD types require a range join, which in
ThoughtSpot can be implemented using a relationship or you
can implement it in your CDW.
In Falcon, you could do it using the following command:
ALTER TABLE "wholesale_buys"
ADD RELATIONSHIP "REL_fruit" WITH "retail_sales" AS
"wholesale_buys"."fruit" = "retail_sales"."fruit" AND
("wholesale_buys"."date_order" <
"retail_sales"."date_sold" AND
"retail_sales"."date_sold" <
"wholesale_buys"."expire_date");
Note: A range join (when defined in TQL) always needs an
equality condition besides the range join
If you need to define joins in the ThoughtSpot user interface,
you will need to do this in two steps:
n Define the join between the two tables and add the two joins
for both the < and > part (as you cannot do that yet in the
interface)
n Open/Edit the TML in the source table and find the join and
change the ‘=’ to the proper operator (<,>,<=,>=)
Please consider:
n A data range join at volumes of data will be slower than a
primary-foreign key relationship join.
n If modeled correctly, i.e. no overlapping dates, then there
will be no duplicate records.
Want to avoid the range join?

n Can it be modeled using a different SCD type?
n Otherwise, we can potentially denormalize it and bring the
date fields down to the fact:
Step 1 : In the fact table include additional columns
corresponding to the range
SPID TranID Date Amount FromDate ToDate
SP1 T1 02/01/22 10 01/01/22 31/12/22
SP2 T2 01/01/22 20 01/01/22 31/12/23
SP3 T3 01/01/22 30 01/01/22 31/12/23
SP1 T4 01/01/23 40 01/01/23 31/12/23
SP2 T5 01/01/23 50 01/01/22 31/12/23
SP3 T6 01/01/23 60 01/01/22 31/12/23
Step 2 : Create a FK-PK relationship with SPID, From

Date, To Date from Fact to Dimension is a SCD
SPID SalesRep Department FromDate ToDate
SP1 Name1 D1 01/01/22 31/12/22
SP2 Name2 D2 01/01/22 31/12/23
SP3 Name3 D3 01/01/22 31/12/23
SP1 Name1 D2 01/01/23 31/12/23
12 From fine to coarse: Crafting data models
with precision and grain
12.1 INTRODUCTION
A nuanced understanding of granularity and its

implications is essential for constructing effective
and performance-driven data models. Granularity
defines the level of detail at which data is
captured, playing a pivotal role in shaping the
precision of analyses and the efficiency of
queries. This chapter delves into the intricacies of
granularity and explores strategies to manage
mixed grain models, where varying levels of
granularity intersect, presenting unique challenges and
opportunities. From loading data at the lowest granularity to
making informed decisions about denormalization and tackling mixed grain
complexities, this chapter offers insights to help you navigate the intricate world
of data modeling.
12.2 UNDERSTANDING GRANULARITY
12.2.1 What is granularity?

Granularity is the level of detail at which data is captured and stored within a
database. It signifies the depth and specificity of information contained in each
data record. In dimensional modeling, defining granularity is fundamental as it
directly impacts the precision of analyses and the performance of queries.
For instance, consider a sales data scenario. If we define granularity at the daily
level, each data record would represent a single sale made on a specific day. On
the other hand, if we set granularity at the monthly level, data records would
capture summarized sales for each month.
12.2.1.1 Importance of loading data at the lowest grain
As a best practice, we recommend loading data at the lowest level of granularity
(not summarized). This approach offers several advantages:
n Enhanced query flexibility: Having data at the lowest level of detail
empowers users to identify aggregate outliers and drill down to understand
underlying trends. This translates to heightened query flexibility and the
ability to explore data comprehensively.
n Reduced cognitive demand: Users are relieved from the burden of
performing complex calculations. For instance, they don't need to manually
calculate unique counts for accurate answers. Furthermore, calculations like
averages are less prone to producing erroneous results.
n Dynamic aggregation: Instead of relying on pre-calculated ratios, data
should encompass columns necessary for calculating numerators and
denominators. This ensures we can dynamically aggregate solutions,
adapting to users' changing requirements.
Best Practice 39 - Load data at the lowest granularity for

enhanced analysis
Storing data at its most detailed granularity is a key best practice,
offering numerous advantages for efficient analysis. By keeping
data unsummarized, users gain enhanced query flexibility,
enabling them to spot outliers, delve into underlying trends, and
comprehend data comprehensively.
Maintaining data at the finest level of detail eliminates the need for
intricate calculations, such as unique counts and averages. This
streamlines analysis and minimizes the risk of calculation errors.
Moreover, this approach supports dynamic aggregation by
retaining essential columns for on-the-fly metric calculations.
Unlike rigid pre-calculated ratios, this adaptability ensures that
users can adjust their aggregations according to evolving needs.
In summary, prioritizing data granularity forms the basis for
insightful analysis. It empowers users to uncover valuable insights
while simplifying their analytical workflows.
12.2.2 Denormalization: To denormalize or Not?
Denormalization is a pivotal concept in dimensional modeling, involving the
consolidation of related data to optimize query performance. However,
denormalization should be approached with careful consideration of specific
scenarios to strike the right balance between improved querying and data
integrity. Let's explore when and how denormalization should be applied:
12.2.2.1 Minimizing joins and simplifying data

A primary motivation behind denormalization is to reduce the complexity of joins
and enhance query performance. By consolidating data into a single table,
denormalization minimizes the need for joining multiple tables, leading to faster
response times.
12.2.2.2 Denormalization in 1:M relationships

In a one-to-many (1:M) relationship, denormalizing the "1" side can prove
beneficial. This consolidation helps to fetch related data more efficiently, as
queries don't require additional joins to access relevant information.
12.2.2.3 Appropriate denormalization for small attributes

For tables containing just one or two small attributes, denormalization can
streamline queries without dramatically affecting data size. However, when
dealing with dimension tables boasting a plethora of attributes, opting for
denormalization might inadvertently inflate the data volume. In such cases,
retaining the table's integrity by preserving it as a dimension table can be a
judicious decision.
12.2.2.4 Caution with measures and denormalization

Exercise caution when dealing with tables that include measures (quantifiable
data, e.g., sales, revenue) in a denormalization context. Denormalizing these
tables, except in one-to-one (1:1) relationships, can lead to unintended
consequences. Measures are sensitive to data duplication, which can result in
skewed or inaccurate analyses.
In essence, the judicious application of denormalization can considerably
enhance query performance and user experience within a dimensional model.
By striking a balance between minimizing joins and preserving data integrity,
businesses can navigate the complexities of denormalization to optimize their
analytical capabilities.
12.2.2.5 Degenerate dimensions

For simple, single-valued attributes that don't need a separate dimension table,
you can incorporate them directly into the fact table as degenerate dimensions.
Examples include invoice numbers, order IDs, or transaction IDs.
12.2.2.6 Mini-dimensions
For attributes with a relatively small number of unique values, consider creating
mini-dimensions. These compact dimension tables can be used to manage low-
cardinality attributes efficiently, reducing the overall complexity of your model.
12.2.3 Handling mixed granularity models

In this section, we address the intricacies of mixed granularity models,
particularly in the context of chasm traps within ThoughtSpot. A specific focus
is placed on situations where chasm traps emerge at different levels of
granularity, leading to complex challenges in modeling. We explore this scenario
by considering Sales and Budget Facts, each captured at distinct levels, to
dissect the complexities of managing mixed grain facts.
For instance, let's consider a dimensional model with a Sales Fact table capturing
data daily, including attributes like Date, Department, Sales Rep, Customer, and
Item. In contrast, the Budgets Fact table operates at a monthly level, containing
Date, Department, and Sales Rep information, but not extending down to the
customer level. This unique scenario creates a puzzle when attempting to
combine total Sales and Budgets by Customer, generating a perplexing situation
for analysis.
Figure 73 A data model with mixed grain facts
12.2.4 Search experience

To illustrate the challenges of this mixed granularity model, let's envision a
Business User's search journey. The user aims to understand Total Sales and
Total Budgets per Customer, embarking on a search to gain insights that span
beyond the granularity defined by Budgets. However, this exploration leads to
a mixture of results that can be puzzling.
Figure 74 Searching the mixed grain model

The challenge becomes evident due to the discrepancy between the requested
Customer-level data and the inherent granularity of Budgets. The results of the
search are complex and contestable, primarily stemming from the inherent
differences in granularity between Sales and Budgets.
Figure 75 Results from the search on the mixed grain model
To dissect this situation, we break down the query

process into three steps:
n Query 1 retrieves the total sales for each
Figure 76 The query plan for this
customer.
search
n Query 2 fetches the total budgets.
n Query 3 combines the results, attempting to allocate total budgets to each
customer row.
12.3 MODELING FOR MIXED GRAIN
12.3.1 Option 1: Augmenting the fact table

The key to effective modeling for search results involves a practical approach of
aligning facts at their lowest attribute level. This transformation not only ensures
precise calculations but also presents information in a clear and user-friendly
manner, promoting accuracy and ease of understanding.
In simple terms, the objective is as follows:
Figure 77 Correctly modeling mixed grain
So, how can we achieve this outcome?

n Step 1: Establish conformed fact tables: The journey begins by
enhancing each fact table with additional columns that establish connections
across dimensions. For instance, introduce a CUSTOMERID column to the
BUDGETS table, assigning a default value (e.g., -99) to all rows.
n Step 2: Elevate attribute tables: Enhance attribute tables with an extra
row that holds an ID value of -99, accompanied by attributes such as 'N/A'.
In our example, this augmentation applies to the customer table.
n Step 3: Bridging the gap: Seamlessly establish links between fact tables
and attribute tables. In this context, establish a connection between Budgets
and Customer, effectively creating a bridge that spans the gap.
By following these practical steps, a solution emerges. The enigma of the chasm
trap is effectively resolved, providing users with a coherent, accurate, and
intuitive search experience.
12.3.2 Another option: splitting dimension

In certain scenarios, there exists an alternative avenue to address the
complexities of multi-grain challenges. This alternative comes into play when
varying levels of granularity are contained within the same dimension. In such
cases, the two fact tables are still linked via the same dimension but at different
levels. A common example of this arises with a date/time dimension, where one
fact table operates at a daily level, while the other operates at a higher level
such as monthly or another designated period.
For better clarity,
consider the model
depicted in Figure 78.
This model
incorporates two
distinct fact tables: a
Sales Fact table
characterized by
daily granularity per
product, and a Target
Fact table that
Figure 78 Multi-grain dimensional model
outlines target
quantities for a specified period, which could span a month, several weeks, or
any other interval higher than daily.
To illustrate this concept, let's populate this model with practical sample data:
PRODUCT_DIMENSION
PRODUCT COLOR PRICE
ProductA Red 100.00
ProductB Blue 150.00
ProductC Green 200.00
Figure 79 Sample data for the PRODUCT_DIMENSION table
DATE_DIMENSION
DATE PERIOD YEAR
2023-01-01 202301 2023
2023-01-02 202301 2023
2023-01-03 202301 2023
2023-01-04 202301 2023
… Populate all values so we have data for the complete month of January 2023 …
DATE_DIMENSION
2023-01-30 202301 2023
2023-01-31 202301 2023
Figure 80 Sample data for the DATE_DIMENSION table
SALES_FACT
DATE PRODUCT QTY
2023-01-15 ProductA 50
2023-01-10 ProductB 40
2023-01-20 ProductC 60
Figure 81 Sample data for the SALES_FACT table
TARGET_FACT
PERIOD TARGET_QTY
202301 500
202302 600
202303 700
Figure 82 Sample data for the TARGET_FACT table
Upon importing these tables

into ThoughtSpot and
configuring them, a
straightforward worksheet
encompassing key attributes
can be devised for testing
purposes. This test search
aims to uncover the
disparities between actual
sales and target sales within Figure 83 Running a search query to get sales vs target
a given period, as shown in
Figure 83.
While the actual quantity (Total Qty) is accurately reported, the Target Quantity
exhibits a significant overcount. This overcount arises due to the distinct grain
levels of the two tables. During the join process, the system inadvertently
duplicates the 500 value 31 times, corresponding to each date within that
period.
To tackle this issue, several options can be explored. One approach is as detailed
in the preceding section. Alternatively, one can avoid using the period key and
substitute it with the date of the period's first day. However, this would
concentrate all target quantities on a single day.
In this context, a more fitting resolution involves splitting the dimension. This is
possible because both grains coexist within the same primary dimension but
operate at differing levels, i.e., daily versus period.
To achieve this we split the
date dimension in two
dimensions:
n DATE_DIMENSION
which contains the
DATE and the PERIOD
(fk)
n PERIOD_DIMENSION
containing the PERIOD
and YEAR
These dimensions are then
joined together, and the Figure 84 Resolving the multi-grain issue by splitting
TARGET_FACT is linked to dimensions
the PERIOD_DIMENSION
instead of the DATE_DIMENSION. When this reconfigured setup is imported into
ThoughtSpot, configured, and relevant attributes are included in a worksheet,
the updated search can be performed, as depicted in Figure 85.
The outcome of this approach successfully rectifies the overcounting issue,
accurately reporting the target quantity as 500. This technique serves as an
effective means of addressing multi-grain challenges in your data model.
Notably, this technique is most suitable when the facts are linked via the same
primary dimension, such as Date in this context, albeit at differing levels.
Figure 85 Rerunning our actuals vs target search
In conclusion, granularity stands as a foundational concept that underpins the

accuracy and effectiveness of dimensional models. By recognizing the
significance of granularity, organizations can make informed choices that
empower users to perform comprehensive analyses with ease and accuracy.
The practice of loading data at the lowest level of granularity fosters flexible
querying, reduces cognitive load, and facilitates dynamic aggregation for
adaptable metrics. Similarly, the art of denormalization, when judiciously
applied, streamlines query performance while maintaining data integrity.
The challenge of mixed granularity models, exemplified through the enigma of
chasm traps, highlights the importance of thoughtful modeling approaches.
Whether through bridging conformed fact tables, splitting dimensions, or other
techniques, managing mixed grain models is essential for yielding accurate and
meaningful insights.
By navigating the nuances of granularity, organizations can construct data
models that amplify analytical capabilities and support informed decision-
making.
13 Unboxed insights: Unleashing the
potential of flexible data structures
13.1 INTRODUCTION
This chapter explores the dynamic landscape of

flexible data structures within the context of
analytical frameworks. In an increasingly data-
driven world, the nuances of working with diverse
data formats, including key-value pairs,
unstructured, and semi-structured data, pose
unique challenges and opportunities. This chapter
delves into the intricacies of managing such data,
highlighting strategies to unlock its potential within
structured environments.
13.2 UNDERSTANDING FLEXIBLE DATA

STRUCTURES
13.2.1 What are key value pairs?

Modeling key/value pairs within a dimensional framework presents specific
challenges due to their dynamic nature, which often doesn't align well with the
structured schema of traditional models. Key/value pairs are used to encompass
attributes with diverse and undefined characteristics.
The main issue with key/value pairs becomes apparent in the search experience,
where users might need to be educated about how to query this type of data.
To illustrate, consider the table shown below.
Trans_ID Measure Code Measure Value
1 Sales 100
1 Tax 10
1 Cost 50
Figure 86 Key/value pair sample data
When queried, the resulting output lacks optimal usability, and users are
required to understand attributes such as 'measure_value' and 'measure_code,'
which can diminish the overall search experience.
Figure 87 Searching key/value pairs
13.2.2 What is unstructured/semi structured data?

Regarding unstructured and semi-structured data, certain considerations need
to be made, especially in the context of ThoughtSpot's capabilities.
In the context of ThoughtSpot, the emphasis primarily lies on structured data.
As a result, support for directly handling semi-structured and unstructured data
within the platform is not (yet) available.
Consider a simplified example of some Twitter feed data:
{
"user": {
"id": "123456789",
"username": "twitteruser",
"full_name": "John Doe",
"followers_count": 10000,
"location": "New York, NY",
"verified": true
},
"tweet": {
"id": "987654321",
"text": "Excited to announce the launch of our new product! 🚀
#ProductLaunch",
"created_at": "2023-08-07T15:30:00Z",
"retweet_count": 50,
"favorite_count": 150
},
"hashtags": ["ProductLaunch"],
"mentions": ["@colleague1", "@partner"],
"media": [
{
"type": "image",
"url": "https://example.com/image.jpg"
}
]
}
In this example, we have a JSON object representing a Twitter feed with several
key components:
n user: Information about the user who posted the tweet.
n tweet: Details of the tweet itself, including its content and engagement
metrics.
n hashtags: An array of hashtags used in the tweet.
n mentions: An array of user mentions within the tweet.
n media: An array containing media elements associated with the tweet, such
as images or videos.
Given the current data structure, our options are limited. To make meaningful
progress, we must leverage the capabilities of the underlying data platform to
flatten or unpack this information and import the results of that process into
ThoughtSpot.
13.3 MODELING FLEXIBLE DATA STRUCTURES
13.3.1 Key value pairs

To tackle this challenge, the solution involves transforming the data through
pivoting (converting rows into columns). Various methods can be employed for
this purpose:
n Database-level pivot: This involves creating a database view to pivot the
data and then importing this view into ThoughtSpot. While it does provide
the desired search experience, its scalability might be limited as the data
volume increases.
SELECT
MAX(CASE WHEN measure_code = 'cost'
THEN measure_value END) AS Cost,
MAX(CASE WHEN measure_code = 'sales'
THEN measure_value END) AS Sales
FROM KeyValueData;
n ThoughtSpot pivot: Within ThoughtSpot, formulas can be utilized to pivot
the data. Although this method aligns well with search expectations, concerns
about scalability could arise over time.
n ETL incorporation: The ETL process can also incorporate the pivot
functionality, enabling pre-calculation and persistence of the pivoted data.
This option guarantees the best search experience and performance.
In conclusion, effectively managing key/value pairs involves navigating the
challenges of dynamically represented data. Pivoting and various
implementation methods can significantly enhance the usability and
effectiveness of searches in such scenarios.
13.3.2 Unstructured/semi structured Data

To address the challenge of incorporating unstructured and semi-structured data
into a structured data model compatible with ThoughtSpot, data warehouses
offer a viable solution. For instance, platforms like Snowflake allow for the
storage of data in formats such as JSON or Parquet, leveraging a "variant" data
type.
Here's how the resolution process would work:
n Structured representation: Within a data warehouse, unstructured or
semi-structured data can be stored using the variant data type offered by
platforms like Snowflake.
n SQL transformation: Utilizing SQL keywords and functions, data engineers
can perform transformations that flatten the nested structure of the
unstructured or semi-structured data. This process converts it into a
structured tabular format, enabling efficient querying and analysis.
n Selective flattening: As a best practice, it is recommended to flatten or
transform only the attributes from the source data that are actually required
for searching and analysis. Performing this transformation incurs a
computational cost. By focusing on the necessary attributes, organizations
can optimize performance and minimize resource consumption.
n Table or view creation: Following the transformation, structured tables or
views are generated from the unstructured or semi-structured data. These
structured entities align with ThoughtSpot's requirements and are conducive
to seamless integration.
n Import and analysis: Finally, the newly created structured tables or views
can be easily imported into ThoughtSpot for comprehensive analysis and
exploration.
Here is a Snowflake example illustrating how you could extract the data from
the JSON tweet sample provided earlier. It's important to highlight that this
example covers the complete extraction. However, it's advisable to adhere to
best practices by only unpacking the specific data elements that are essential
for your analysis.
CREATE OR REPLACE VIEW flattened_tweet_data AS
SELECT
json_data:user:id::string AS user_id,
json_data:user:username::string AS username,
json_data:user:full_name::string AS full_name,
json_data:user:followers_count::integer AS followers_count,
json_data:user:location::string AS location,
json_data:user:verified::boolean AS verified,
json_data:tweet:id::string AS tweet_id,
json_data:tweet:text::string AS tweet_text,
json_data:tweet:created_at::timestamp AS created_at,
json_data:tweet:retweet_count::integer AS retweet_count,
json_data:tweet:favorite_count::integer AS favorite_count,
json_data:hashtags::array AS hashtags,
json_data:mentions::array AS mentions,
json_data:media[0]:type::string AS media_type,
json_data:media[0]:url::string AS media_url
FROM json_data;
By harnessing this strategy and adhering to the principle of selective
transformation, organizations can bridge the gap between unstructured or semi-
structured data and the structured data model demanded by ThoughtSpot.
While the platform itself thrives on structured data, the capabilities of modern
data warehouses and SQL-driven transformations empower users to transform
and organize diverse data formats into a format suitable for impactful analysis
within ThoughtSpot.
In conclusion, navigating the realm of flexible data structures demands a

thoughtful and pragmatic approach. As organizations strive for deeper insights,
the ability to harness the power of key-value pairs, unstructured, and semi-
structured data becomes a strategic advantage. By employing techniques like
pivoting and structured representation, businesses can transform these forms
of data into formats compatible with ThoughtSpot. As the landscape of data
continues to evolve, mastering the art of managing flexible data structures
ensures that organizations remain at the forefront of data-driven innovation.
14 Dimensional mastery: Exploring the date
dimension
14.1 INTRODUCTION
One vital element of a dimensional model is the

Date Dimension, which acts as a key piece in
understanding time-related information. In
this chapter, we'll dive into what the Date
Dimension is, why it matters, and how it fits
into dimensional modeling. We'll also explore
how ThoughtSpot's keywords play a role and
can replace a date dimension.
14.2 UNDERSTANDING DATE

DIMENSIONS
14.2.1 What is a date dimension?

Think of the date dimension as a special tool that helps us understand time
better. It's like a compass for navigating through time-related data. This
dimension lets us analyze events across different periods, compare historical
data, and spot trends. With various attributes and categories, it transforms
abstract time data into something we can easily interpret.
14.2.2 Using ThoughtSpot keywords: Simplifying time

queries
ThoughtSpot offers keywords that make it easier to ask questions about time
intervals, like Yearly, Quarterly, Monthly, Weekly, Hourly, Detailed. These
keywords are super helpful because they replace the need for a dedicated
Date Dimension table in many cases. Furthermore, ThoughtSpot offers support
for custom calendars which in combination with these keywords is powerful.
14.3 MODELING DATE & TIME
As previously stated, ThoughtSpot can manage most standard date/time

dimension functions through its comprehensive set of keywords. However,
specific situations arise where a dedicated Date Dimension table becomes
necessary:
n To join non-conformed data. If we're dealing with different sets of facts that
need to be joined, the date dimension table helps. We could use date
dimension to join those facts together.
n You want to define ‘Smart Date Hierarchies to provide users with flexible and
intelligent ways to analyze data over different time perspectives, which is not
provided out-of-the-box by ThoughtSpot and its keywords, e.g.:
u Seasonality: Add attributes related to seasonality, such as "Holiday
Season," "Back-to-School Season," or "Summer Season." These attributes
can help users identify trends and patterns related to specific periods of
the year.
u Comparative Analysis: Introduce attributes for comparative analysis, such
as "Same Period Last Year" and "Previous Period." (Like-for-like
comparisons). These attributes enable users to compare current
performance with historical periods, facilitating trend analysis. This is
especially useful when you want to utilize the versus keyword in
ThoughtSpot which does not work in combination with ‘group’ functions
(see section 14.4).
u Day Part Analysis: Create attributes for different parts of the day, such as
"Morning," "Afternoon," and "Evening." This can help users understand
how performance varies during different time periods within a day.
u Time-to-Event Analysis: If applicable, add attributes for "Time to Event"
analysis, such as "Days to Next Promotion" or "Days to Product Launch."
These attributes provide insights into how certain events impact business
performance.
u In section 17.3.3.2 we also describe a design pattern where the date
dimension can help with semi-additive measures
In the grand world of dimensional modeling, the date dimension is like a guide
to help us make sense of time-related data. While ThoughtSpot's keywords
simplify things, the date dimension table is a powerful tool when we need to dig
deeper.
14.4 USING THE DATE DIMENSION FOR LOOK BACK

MEASURES
14.4.1 Introduction
In the world of retail analytics, gaining insights into historical performance is
paramount. Retailers often need to compare sales, market share, and other key
metrics year-over-year to understand trends and make informed decisions.
However, performing these comparisons efficiently can be challenging,
especially when dealing with large datasets. In this section, we will delve into
utilizing a date dimension for ‘look back measures’, which allows retailers to
answer crucial questions about their performance over time.
14.4.2 Requirement definition

Retailers continually seek ways to understand how their products are performing
in the market. Whether it's comparing sales figures between this year and last
year or assessing market share changes, these insights can drive strategic
decision-making. This ‘look back measures’ design pattern was designed to
address this need, especially as the ‘group_’ functions in ThoughtSpot, which
are often used to achieve this do not work in combination with the powerful
‘versus’ keyword.
Here are some key questions that this design pattern helps answer:
1. How have sales changed this year versus last year during the same period?
2. What is the share of MULO (Multi-Outlet) this year versus last year during
the same period?
3. How has the share changed over time?
4. How does the share compare this year versus last year to another retailer
during the same period?
5. Bonus: How much foot traffic went through the store?
These questions are crucial for retailers to understand their performance,
market position, and the impact of various factors on their business.
14.4.3 Solution design
Our solution for look back measures begins with the creation of a Week
Dimension that defines the look-back periods. The process involves the following
steps:
1. Week dimension: Create a Week Dimension that defines the look-back
periods.
2. Views and join: Create SELECT * views over the Fact and join them to the
Look Back Period.
3. Worksheet labeling: Insert the Look Back view in the worksheet and label
the measure as "1 year ago" or any relevant timeframe.
Figure 88 Utilizing the date dimension for look back periods
This solution allows users to easily compare data from different time periods
and gain insights into their retail performance.
14.4.3.1 Defining look back measures

Once the foundational elements are in place, you can define the look back
measures. Some key formulas include:
n Sales This Year minus Sales Last Year = Sales Diff from Last Year
n (Sales This Year over Sales Last Year) - 1 = Sales % from Last
Year
These measures enable retailers to analyze sales variations and trends between
different time periods efficiently.
14.4.3.2 Application
With the solution and measures defined, ThoughtSpot users can quickly perform
analyses and answer the critical questions mentioned earlier. For instance, they
can easily assess how sales have changed over time, calculate the share of
MULO this year versus last year, and track share changes over specified time
windows.
14.4.4 Example: How have sales changed this year versus

last year during the same period?
The ability to visualize and interact with data in real time empowers users to
make data-driven decisions swiftly.
14.4.4.1 Enhancing share measures

This design pattern extends beyond sales comparisons. Users can define share
measures to gain deeper insights into market share dynamics. Here's how it
works:
Figure 89 Worksheet model (Share measures)
n Sales This Year over MULO Sales This Year = Share of MULO
n Sales Last Year over MULO Sales Last Year = Share of MULO Last Year
n Share of MULO Last Year - Share of MULO = YoY Share Change
n YoY Share Change over time (Last week vs. Last 12 weeks vs. Last 26 weeks
vs. Last 52 weeks)
These share measures provide retailers with a comprehensive view of their
market position and how it evolves over different timeframes.
14.4.4.2 Example: What is the share of MULO this year versus last year
during the same period?
Figure 90 What is the share of MULO this year versus last year during the same period?
Figure 91 Query Plan for What is the share of MULO this year versus last year during the same
period?
14.4.4.3 Example: How has the share changed over time?
Figure 92 How has the share changed over time?
14.4.4.4 Example: How does the share compare this year versus last year
to another retailer during the same period?
Figure 93 How does the share compare this year versus last year to another retailer during the
same period?
14.4.5 Conclusion
Utilizing the Date Dimension enables retailers to efficiently compare
performance data across different time periods, answer critical questions, and
gain valuable insights into their business. By harnessing the power of
ThoughtSpot's analytics platform, retailers can make data-driven decisions that
drive growth and success in a competitive market.
This chapter has highlighted the importance of the date dimension, which serves
as a fundamental framework for understanding time-related data. This
dimension simplifies the interpretation of time, allowing us to navigate through
different time periods, make historical comparisons, and uncover underlying
trends.
ThoughtSpot's keywords make it easier to query time intervals, reducing the
need for a dedicated date dimension table in many scenarios. These keywords
provide added flexibility and precision. Nevertheless, there are situations where
a dedicated date dimension table is essential, particularly when more advanced
and non-standard functionalities are required.
One of the many examples include the combination of the date dimension and
the 'look back measures' design pattern in ThoughtSpot has a profound impact
on the retail industry. It equips retailers with the necessary resources to
efficiently analyze performance data across various timeframes, offering
solutions to critical questions and valuable insights into their operations.
In conclusion, the date dimension serves as a guiding light for exploring
temporal data complexities. ThoughtSpot's keywords simplify query processes,
while the Date Dimension table remains a powerful tool for in-depth analysis
when needed.
15 Counting coins across continents: A guide
to currency conversion in analytics
15.1 INTRODUCTION
In the globalized business world, organizations often deal with

data from multiple countries with different currencies. To
provide accurate and consistent reporting and analysis,
dimensional models in data warehousing and business
intelligence systems must handle currency conversion
effectively. This article explores the challenges associated
with currency conversion in a dimensional model and
presents best practices with detailed examples to address
these challenges.
15.2 UNDERSTANDING CURRENCY CONVERSION
15.2.1 What is currency conversion?

Currency conversion in a dimensional data model involves the process of
translating financial values from one currency to another, allowing for consistent
analysis and reporting across different currency denominations. This is crucial
for organizations that operate in multiple countries or regions, as it ensures that
financial data can be accurately compared and analyzed irrespective of currency
variations.
The conversion process typically requires a centralized currency conversion
table that holds exchange rates for various currencies over time. When querying
the data, the appropriate exchange rates are applied to convert the values into
a common currency, often referred to as a reporting currency. This ensures that
financial metrics such as revenue, costs, and profits can be accurately
aggregated and analyzed in a uniform manner, providing a holistic view of the
organization's performance. Additionally, currency conversion helps in creating
standardized financial reports and dashboards that reflect consistent monetary
values, enabling effective decision-making and performance evaluation across
diverse markets and currencies.
15.2.2 Challenges of currency conversion
n Exchange rate management: Currency conversion requires up-to-date
and historical exchange rates. Keeping track of exchange rates over time and
ensuring the accuracy of conversions can be complex, especially when
dealing with various currencies and frequent rate fluctuations.
n Consistency and integrity: Maintaining the consistency and integrity of
currency conversions is essential to avoid errors in reporting and analysis.
Inconsistent conversions can lead to discrepancies in financial statements
and misleading business decisions.
n Multi-currency transactions: Dealing with multi-currency transactions in
a single report or analysis requires proper handling of different currency
conversions for accurate aggregations.
n Dimensional hierarchies: Currency conversion needs to be applied at
various levels of dimensional hierarchies, such as at the country, region, or
global level. Ensuring correct conversions across these hierarchies is crucial.
15.3 MODELING CURRENCY CONVERSION
Let’s consider a simple sales dimensional model for a global company.

Transactions can be made in multiple currencies and the amounts will also be
provided in a base currency which is standard across the whole company.
Including a base rate currency and amount in the fact table can be a good idea
for handling consistency and rounding errors in multi-currency transactions. This
approach is often referred to as "multi-currency accounting" or "multi-currency
triangulation". The base rate currency and amount can act as a reference point
for all transactions, ensuring that all values are consistent and can be accurately
converted to other currencies without accumulating rounding errors.
n Transaction table: Contains sales transactions data, including the amount
sold and the currency code used in the transaction, it also contains the
amount in the base currency. The base currency is fixed across the whole
company so for performance reasons, it is probably better to calculate this
as part of the ETL/ingestion process.
n Exchange rate table: This holds the
exchange rate between a source and
a target currency. For the simplicity of
the example we have not included
dates here, but in the real world you
probably want to store an exchange
rate for a certain date/period.
n User table: This table stores the user
and their preferred currency.
n User currency bridge: This table
contains the currencies used by the
users. This is a degenerate bridge
table. It's an optimization technique
used in dimensional modeling to avoid
creating unnecessary dimension
tables while still capturing the
Figure 94 A data model for a multi-currency
necessary relationships between use case
facts and dimensions. In our case we
use it to resolve the many-to-many relationship and avoid null records being
returned when no user is selected. This should never happen as a user should
always be specified, but it will limit the amount of data processed by the
query.
Our data model will look like the one shown in Figure 94.
We will populate the model with the following data:
EXCHANGE_RATE
EXCHANGE_ FROM_ FROM_ FROM_ TO_ TO_ TO_ RATE

RATE_ID CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_
ID CODE NAME ID CODE NAME
1 1 USD US Dollar 2 EUR Euro 0.85
2 1 USD US Dollar 3 GBP British Pound 0.72
3 1 USD US Dollar 1 USD US Dollar 1
4 2 EUR Euro 1 USD US Dollar 1.18
5 2 EUR Euro 3 GBP British Pound 0.84
6 2 EUR Euro 2 EUR Euro 1
EXCHANGE_RATE
EXCHANGE_ FROM_ FROM_ FROM_ TO_ TO_ TO_ RATE

RATE_ID CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_ CURRENCY_
ID CODE NAME ID CODE NAME
7 3 GBP British Pound 1 USD US Dollar 1.39
8 3 GBP British Pound 2 EUR Euro 1.19
9 3 GBP British Pound 3 GBP British Pound 1
Table 47 Contents of the EXCHANGE_RATE table
TRANSACTION
TRANSACTIO AMOU CURRENCY BASERATE_AM BASERATE_CURREN DATE DESCRIPTI

N_ID NT _ID OUNT CY_ID ON
1 100 1 100 1 08/01/20 Book

23 Purchase
2 50 2 59 1 08/02/20 Clothing
23 Purchase
3 200 2 236 1 08/03/20 Electronics

23 Purchase
4 150 3 212 1 08/03/20 Grocery

23 Shopping
Table 48 Contents of the TRANSACTION table
USER
USER_ID USER_NAME PREF_CURRENCY_ID
1 Alice 1
2 Bob 2
Table 49 Contents of the USER table
USER_CURRENCIES
PREF_CURRENCY_ID
Table 50 Contents of the USER_CURRENCIES table
When we now import these tables in ThoughtSpot, create a worksheet using
these tables and place a RLS rule on the USER table (ts_username =
user_name), we can see this in action.
For example, if we log in as Alice we would get the following results:
Figure 95 Multi-Currency data results when logged in as Alice
Implement bridge table for multi-currency transactions

If you want to support multi-currency transactions, create a
bridge table that connects the transaction table to the
exchange rate table. This obviously requires a few changes to
the transaction table as well to deal with the different
currencies.
Currency conversion within a dimensional data model is a vital aspect for

organizations operating in diverse global markets. The challenges of managing
multiple currencies necessitate effective strategies for accurate reporting and
analysis.
This chapter explored the intricacies of currency conversion, uncovering the
need for standardized exchange rates, the implementation of bridge tables for
multi-currency transactions, and the significance of standardized currency
codes. By utilizing fact-based currency conversion and applying these practices
during the data loading process, organizations can ensure consistent and
accurate currency conversions for their financial analysis.
ThoughtSpot's capabilities further enhance the potential to create
comprehensive multi-currency reports and dashboards that reflect accurate and
uniform monetary values. With the complexity of global business environments,
adopting best practices for currency conversion empowers organizations to
make informed decisions and drive successful operations across various
currencies and markets.
Best Practice 40 - Storing exchange rates

Create a dedicated exchange rate table in the data warehouse to
store historical and current exchange rates. Include attributes such
as currency codes, date, and exchange rate values.
Best Practice 41 - Use bridge tables

Implement bridge tables to manage currency conversions
efficiently. These bridge tables connect the fact table to the
exchange rate table and provide a way to handle multi-currency
transactions.
Best Practice 42 - Standardize currency codes

Standardize currency codes across the dimensional model to
ensure consistency in reporting. Use ISO currency codes to avoid
confusion and make conversions more straightforward.
Best Practice 43 - Use fact-based currency conversion

Perform currency conversion at the fact table level rather than the
dimension table level. This approach ensures consistent
conversions for all relevant measures in the fact table.
Best Practice 44 - Apply currency conversion on load
When loading data into the data warehouse, apply currency
conversion directly during the ETL (Extract, Transform, Load)
process. This approach simplifies queries and ensures consistent
conversions throughout the data.
16 Evolving data structures: A guide to late-
binding attributes
16.1 INTRODUCTION
The concept of late-binding attributes plays a pivotal role in ensuring

flexibility and adaptability in data warehouses. Late-binding attributes
are those attributes that can change, even on a per-row
basis within a dataset. This chapter explores the
concept of late-binding attributes, their benefits,
challenges, and a detailed example showcasing
their resolution through traditional dimensional
modeling techniques.
16.2 UNDERSTANDING LATE-BINDING

ATTRIBUTES
16.2.1 What are late-binding attributes?

In traditional data modeling, early-binding attributes are defined during the
initial design phase of a data warehouse. These attributes are fixed and
predetermined, providing stability to the data structure. However, as businesses
grow and evolve, their data requirements change, leading to challenges when
accommodating new attributes or modifications to existing ones. Early-binding
attributes may hinder the agility of data warehousing solutions, requiring
significant effort and resources for adjustments.
Late-binding attributes, on the other hand, offer a solution to this issue. By
allowing attributes to be determined at a later stage, typically during data
loading or transformation, organizations can swiftly adapt to changing business
needs. This flexibility is especially important in industries where attributes are
subject to frequent modifications or where historical data must be preserved
accurately.
16.2.2 Late-binding and varying attributes per row
A powerful application of late-binding attributes is seen when dealing with
datasets where attributes can differ per row. This scenario is common in
situations where data is collected from multiple sources, each producing data
with distinct attributes. Consider a retail business that aggregates sales data
from various franchise locations. Each location might have unique attributes,
such as specific promotional codes, local tax rates, or regional product offerings.
Late-binding attributes allow for seamless integration of such diverse data. As
new franchise locations are added, data management system can effortlessly
accommodate the unique attributes associated with each location without
requiring schema alterations. Queries can then be crafted to aggregate and
analyze this data, regardless of the varying attributes present in each row.
16.2.3 Use cases for late-binding attributes

Late-binding attributes are often used to handle scenarios where the attributes
associated with entities can change over time or vary on a per-instance basis.
Late-Binding Attribute use cases can typically be divided in a two categories:
n Custom properties/characteristics. These are use cases where there are
different characteristics to be recorded, but not every characteristic applies
to every entity. To be honest these are not really late-binding attributes and
are typically resolved using a bridge table. The attributes might different per
row of data, but there is a finite set of attributes.
n Truly late-binding attributes. In these uses cases both attribute name and
value are late-binding. For example, row 1 can refer to a set of attributes and
values, and row 2 can refer to a completely different set of attributes and
values. Theoretically the set of attributes is unlimited.
16.2.3.1 Custom properties/characteristics
16.2.3.1.1 Custom properties and tags

Late-binding attributes can be used
to attach custom properties or tags
to entities. For example, in an e-
commerce platform, products could
have user-defined tags that indicate
specific attributes like "On Sale,"
"New Arrival," or "Limited Stock."
This is a very straightforward
example as the attribute name is
actually static (Tag), it is just the
values which are dynamic. Figure 96 Sample Ecommerce models with tags
16.2.3.1.2 Healthcare patient attributes

In healthcare, patient attributes like
allergies, genetic markers, and
medical conditions can vary for each
patient and change over time. Late-
binding attributes allow healthcare
systems to capture and manage this
dynamic patient-specific data.
The fact table "Patient Encounters"
records details of patient visits, Figure 97 Sample healthcare model with dynamic
attributes
including encounter ID, patient ID,
date, procedure code, and revenue generated. The "Patients" dimension table
contains early-binding attributes such as patient first name, last name, date of
birth, gender, and address. The "Patient Attributes" dimension table stores late-
binding attributes for patients, such as genetic markers and allergies.
Additionally, there is a "Procedures" dimension table that holds details about
medical procedures, including procedure codes, procedure names, and
departments.
Late-binding attributes, such as genetic markers and allergies, are stored in the
"Patient Attributes" table. This allows the healthcare organization to
accommodate new attributes without altering the core structure of the "Patients"
dimension table. Queries involving late-binding attributes would dynamically
join the "Patient Attributes" table as needed to retrieve additional patient
information.
16.2.3.1.3 Other examples

n Inventory Management: In a warehouse or inventory management
system, items might have attributes such as location, shelf life, or supplier.
These attributes can change over time or be different for each item.
n Asset Management: Assets in an organization, such as equipment,
vehicles, or IT assets, can have attributes that differ based on the asset type,
usage, or location.
n Event Management and Logging: In event tracking or logging, you might
have events with variable attributes. For instance, an event could have
different custom properties depending on the context, making late-binding
attributes a flexible solution.
n Social Media Posts: In a social media platform, posts can have various
attributes like tags, sentiment scores, or location data, which can differ for
each post.
n Product Configurations: For configurable products, attributes like color,
size, or features can change depending on the chosen configuration.
n Real Estate Listings: Real estate listings can have different attributes based
on the property type, location, or features, making late-binding attributes
useful for capturing this information.
16.2.4 Strengths and challenges
16.2.4.1 Benefits of late-binding attributes

The utilization of late-binding attributes in dimensional modeling offers several
notable benefits:
n Agility: Late-binding attributes enable rapid adaptation to changing business
requirements without compromising existing data structures. This agility is
crucial in dynamic industries like healthcare, where new medical knowledge
and technological advancements continually influence data needs.
n Scalability: Late-binding attributes facilitate scalability by reducing the
complexity of modifying existing data structures. Organizations can focus on
extending their data model to include new attributes rather than
restructuring their entire data warehouse.
n Reduced maintenance: The separation of early-binding and late-binding
attributes minimizes the need for extensive maintenance efforts when
adapting to changes. This efficiency can translate to cost savings and
resource optimization.
16.2.4.2 Considerations and challenges of late-binding attributes

While late-binding attributes offer numerous benefits, they also introduce
considerations and challenges that organizations must address:
n Data integrity and consistency: Late-binding attributes can potentially
lead to data integrity and consistency challenges. When attributes can
change per row, ensuring that the data remains accurate and coherent
becomes crucial. In the healthcare example, maintaining a reliable patient
history while accommodating dynamic attributes requires meticulous data
management processes.
n Query performance: Queries involving late-binding attributes might
experience performance bottlenecks due to the need for dynamic joins or
aggregations. Optimizing query performance becomes a complex task, and
proper indexing and query design are essential to maintain acceptable
response times.
n Schema complexity: Introducing late-binding attributes can lead to
increased schema complexity. Organizations must carefully design their
dimensional model to strike a balance between accommodating evolving
attributes and maintaining a manageable and understandable schema.
n Data transformation overhead: Transforming data to incorporate late-
binding attributes can add overhead to the ETL (Extract, Transform, Load)
process. Extracting, cleansing, and loading data with varying attributes
requires robust ETL pipelines that handle data transformations efficiently.
n Versioning and auditing: Tracking changes to late-binding attributes and
maintaining historical versions can be challenging. Proper versioning and
auditing mechanisms must be implemented to ensure accountability,
traceability, and compliance.
n Data governance and documentation: Managing late-binding attributes
demands robust data governance practices. Organizations need to document
attribute definitions, update policies, and communicate changes effectively
to stakeholders to prevent misunderstandings and inconsistencies.
16.3 MODELING LATE-BINDING ATTRIBUTES
16.3.1 Customer profiles

Late-binding attributes can be used to store additional customer information
that might not be standardized, such as preferences, interests, or hobbies. For
example, a client has a multi-tenant setup where each of their tenants can set
up several different profile characteristics for their own clients.
For example:
Tenant 1 wants to track some sports related questions on their clients, such as:
n What is your favorite football/soccer
club?
n What is your favorite baseball team?
n What is your favorite American
Football team?
n How many sports do you play?
Tenant 2 wants to track questions related
to food and going out, such as:
n What is your favorite food?
n What is your favorite drink?
n How many nights a week do you go Figure 98 EAV Type model for capturing
out? customer profiles
n What is your favorite night club?

One strategy is to model this in an EAV type of model, like shown in Figure 98.
And the data for the two main tables could look like this:
TENANT_ID PROFILE_ID CUSTOM_FIELD_ID CUSTOM_FIELD_ANSWER
1 10001 1001 Barcelona
1 10001 1002 Cincinnati
1 10001 1003 Vikings
1 10001 1004 2
2 20001 1001 Burger
2 20001 1002 Gin Tonic
2 20001 1003 3
2 20001 1004 Bootshaus
Table 51 Sample data for TBL_CUSTOM_FIELD_VALUE
The Custom Field itself would come from another table (via the
CUSTOM_FIELD_ID), for example:
TENANT_ID CUSTOM_FIELD_ID CUSTOM_FIELD_NAME CUSTOM_FIELD_TYPE
1 1001 What is your favourite football/soccer VARCHAR

club?
1 1002 What is your favourite baseball team? VARCHAR
1 1003 What is your favourite American VARCHAR

Football team?
1 1004 How many sports do you play? NUMBER
2 1001 What is your favourite food? VARCHAR
2 1002 What is your favourite drink? VARCHAR
2 1003 How many nights a week do you go NUMBER

out?
2 1004 What is your favourite night club? VARCHAR
Table 52 Sample data for TBL_CUSTOM_FIELD
16.3.2 Survey responses

Imagine a scenario where you're storing survey responses, and each survey
may have different questions with varying attributes. These attributes might
include question types (multiple choice, text, rating), response values, and
timestamps. The solution to this would be similar as the one described in the
previous section.
16.3.3 Implementation strategies for late-binding
attributes
Before we shown some likely implementation strategies for the data model for
late-binding attributes. The complete list of strategies include:
16.3.3.1 Denormalize attributes and values into the table

With this strategy you will just add a pair of attribute and value to the table for
each late-binding field. For example:
TBL_PROFILE
TENANT_ID INTEGER
PROFILE_ID INTEGER
CUSTOM_FIELD_1 VARCHAR
CUSTOM_FIELD_ANSWER_1 VARCHAR
Table 53 Implementing late-binding attributes using denormalised attributes
16.3.3.2 Custom property mapping

For custom property mapping as discussed in section 16.2.3.1, you most likely
will utilize bridge tables to map these properties. Note that in certain cases you
might prefer these attributes (as they are not really late binding anymore) into
a mini dimension (if they are high cardinality) or a junk dimension (low
cardinality) for performance reasons, but this obviously also depends on the use
case.
16.3.3.3 EAV type implementations
An example of this we discussed in section 16.3.1.
16.3.3.4 Semi structured solutions (when supported)

Some Cloud Data Platforms, such as SnowFlake, support storage of semi-
structured data. SnowFlake supports this via its VARIANT data type. Let’s work
through an example using the same use case as described in section 16.3.1.
For this we just focus on the part of the model where the late-binding attributes
are stored. We have defined a table like this:
PROFILE_QUESTIONS
PROFILE_QUESTION_ID INTEGER
TENANT_ID INTEGER
USER_ID INTEGER
QUESTION_DATA VARIANT
Table 54 Profile questions table using variant
And we populate this table with the sample data from the example:
PROFILE_ TENANT_ID USER_ID QUESTION_DATA

QUESTION_ID
1 1 1 { "What is your favourite football/soccer club?":

"Barcelona", "data_type": "VARCHAR", "type":
"Sports" }
2 1 1 { "What is your favourite baseball team?": "Cincinnati",

"data_type": "VARCHAR", "type": "Sports" }
3 1 1 { "What is your favourite American Football team?":

"Vikings", "data_type": "VARCHAR", "type": "Sports"
}
4 1 1 { "How many sports do you play?": 2, "data_type":

"NUMBER", "type": "Sports" }
5 2 2 { "What is your favourite food?": "Burger",

"data_type": "VARCHAR", "type": "Food" }
6 2 2 { "What is your favourite drink?": "Gin Tonic",

7 2 2 { "How many nights a week do you go out?": 3,
"data_type": "NUMBER", "type": "Food" }
8 2 2 { "What is your favourite night club?": "Bootshaus",

The question_data column stores JSON-like structures with varying attributes

for each question/answer entry.
Using built-in SnowFlake JSON features we can now extract and pivot this data
using the following query:
SELECT
pq.TENANT_ID,
MAX(pq.QUESTION_DATA:"What is your favourite football/soccer
club?"::STRING) AS "What is your favourite football/soccer club?",
MAX(pq.QUESTION_DATA:"What is your favourite baseball
team?"::STRING) AS "What is your favourite baseball team?",
MAX(pq.QUESTION_DATA:"What is your favourite American Football
team?"::STRING) AS "What is your favourite American Football
team?",
MAX(pq.QUESTION_DATA:"How many sports do you play?"::NUMBER) AS
"How many sports do you play?",
MAX(pq.QUESTION_DATA:"What is your favourite food?"::STRING) AS
"What is your favourite food?",
MAX(pq.QUESTION_DATA:"What is your favourite drink?"::STRING)
AS "What is your favourite drink?",
MAX(pq.QUESTION_DATA:"How many nights a week do you go
out?"::NUMBER) AS "How many nights a week do you go out?",
MAX(pq.QUESTION_DATA:"What is your favourite night
club?"::STRING) AS "What is your favourite night club?"
FROM PROFILE_QUESTIONS pq
GROUP BY pq.TENANT_ID;
Which will return the following data:
TENANT_I What is your What is What is How What is What is How What is
D favourite your your many your your many your
football/socc favourit favourit sport favourit favourit night favourit
er club? e e s do e food? e drink? s a e night
basebal America you week club?
l team? n play? do
Football you
team? go
out?
1 Barcelona Cincinna Vikings 2

ti
2 Burger Gin 3 Bootsha
Tonic us
Table 55 JSON results unpacked
Here you can see that for tenant 1, we get their selected set of sports related
questions and for tenant 2, their food & going out related questions.
You can convert this SQL statement into a view which then can easily be
imported into ThoughtSpot.
Option Pro Con
Denormalise attributes Fast to implement Fields are generic, not strongly typed
and values
Allows actions on custom fields Table is inefficient sizewise, fields
(search, sort) might never be used for certain
tenants
Number of fields needs to be

anticipated
Semi Structured Fast to implement No DB actions possible

Solutions
Flexible, allowing any number and
types of fields
EAV type Both flexible and efficient. Slight increase in development time
implementations and complexity of your queries, but
DB actions can be performed, but
there really aren't too many cons,
the data is normalized somewhat
here
to reduce wasted space.
Not very scalable when applied
across multiple tables
EAV type Strongly typed version of option 3 No comparisons or sort operations

implementations possible
More operations possible
(strongly typed)
Not very scalable when applied
across multiple tables
Custom property Easy to implement Not really late binding

mapping
Not really late binding, so all
analysis capabilities available
Table 56 Pros and cons of the various late-binding attribute modeling techniques
16.4 MODELING LATE-BINDING ATTRIBUTES FOR SEARCH
16.4.1 Design considerations

Creating a successful user-friendly search encounter through late-binding
attributes is heavily contingent upon the specific needs. These requirements
dictate how these attributes will be harnessed by end users. If their purpose is
solely for presentation without any search, filtering, or drill-down functionality,
various choices become available. These options differ based on whether they
should be consolidated within a single row or distributed across multiple rows.
Nevertheless, our focus lies in the realm of comprehensive searching and
analysis. Thus, incorporating filter choices, drill-down capabilities, and similar
functionalities remains pivotal for us.
16.4.2 Data structure

In the preceding section, we explored different methodologies for structuring
late-binding attribute data within a data model, each presenting its own
advantages and drawbacks. However, in our standard ThoughtSpot
implementations, these approaches merely mark the initial phase, as the
resultant search experience from most of these strategies tends to be
unfavorable.
This predicament mainly arises from the fact that most
of the aforementioned options involve storing data
within an Entity-Attribute-Value (EAV) or key-
value design pattern. These patterns, despite
their utility, often yield a suboptimal search
experience due to the challenges in
formulating queries against such data
structures. EAV models also commonly
suffer from performance constraints during
analysis operations, leading many to
regard them as anti-patterns. Although the
utilization of semi-structured data types
offers a step toward improvement, it still
demands additional effort, effectively
reverting the attributes back to early binding
attributes.
In a broader context, the most optimal solution entails fully modeling all fields.
This approach not only delivers a superior search experience but also generally
enhances performance. It is evident from the various scenarios that while extra
work might be necessary to optimize the search encounter, the scalability of the
strategy should be a central consideration.
16.4.3 Strong typing

While storing custom fields within a relational structure allows for strong typing,
adapting this arrangement for optimal search functionality presents distinct
challenges. Often, the process of modeling custom fields for search entails
pivoting the data to consolidate all custom fields as columns within a single row.
However, a potential issue arises when disparate types emerge within a single
column across different rows, a situation incompatible with many data
platforms. Referencing Table 52 highlights this concern, where the 3rd question
illustrates varying data types between tenants 1 and 2.
Preserving data types in custom field scenarios during search-oriented modeling
becomes notably intricate due to these factors. Theoretically, a solution could
involve utilizing OAuth and implementing Row-Level Security (RLS) in your
Cloud Data Warehouse (CDW), followed by casting the data to appropriate types
prior to importing it into ThoughtSpot. This approach relies on the notion that,
post RLS application, each tenant would possess only one data row, allowing for
subsequent type casting. However, the scalability of such a method raises
doubts and feasibility concerns.
16.4.4 Combining field and answer in a tag

Yet another hurdle to surmount is the prevalent practice of storing late-binding
attributes and their corresponding values in separate columns (except for the
semi-structured solution). This initial separation poses a notable challenge for
filtering since filters are typically confined to a single column. Consider, for
instance, placing a filter on the answer column – while you can pinpoint an
answer, discerning the associated question remains elusive. In the case of
"Cincinnati," it could signify a baseball team, an American football team, or even
a city-based inquiry.
A potential remedy for this conundrum involves crafting a tag: a field that
amalgamates both the question and the answer, subsequently employed for
filtering purposes. To accommodate lengthy queries, it is advisable to generate
succinct tags. For instance, instead of 'What is your favorite baseball team?' you
might opt for 'MLB.' Upon fusion with the corresponding answer, the outcome
could resemble 'MLB:Cincinnati Reds,' thus creating a unified field suitable for
filtering.
This approach doesn't negate the utility of individual question and answer fields,
which maintain readability within a table. Consequently, by harmoniously
integrating these aspects, a harmonious balance is achieved. However, it's
crucial to acknowledge that this solution exclusively addresses filtering
concerns.
16.4.5 AND condition for profile filters

Furthermore, it is more than likely that you want multiple filters to be combined
via an AND pattern and not IN (OR). For example, you want to drill down to
profiles of users who support a particular baseball team and those who support
a particular American football team, as there might be a correlation, such as
location.
To be able to achieve this AND functionality, you will need to pivot all your
custom fields to one row into separate columns allowing you to apply separate
fields on each of them. Because alternatively if you have your custom fields in
rows, the filter will be applied in OR style. For example:
n User 1 supports baseball team 'Cincinnati Reds' and American Football team
'Cincinnati Bengals'
n User 2 also supports the baseball team 'Cincinnati Reds', but supports the
'Vikings' as their favorite American Football team.
What will happen if a search is executed on baseball team 'Cincinnati Reds' and
American Football team 'Cincinnati Bengals', it will also return user2, as the
conditions are generated as part of an IN clause (which has the same
functionality as OR).
16.5 CONCLUSION
Late-binding attributes offer flexibility in data warehousing, allowing for

adaptable adjustments to changing business needs. Explored in this chapter are
their benefits, challenges, and practical applications through a healthcare
example within traditional dimensional modeling.
Early-binding attributes, defined during initial data modeling, can hinder
adaptability. In contrast, late-binding attributes provide agility, scalability, and
reduced maintenance, crucial for dynamic sectors like healthcare.
However, these attributes introduce challenges such as data integrity, query
performance, and schema complexity. Nonetheless, they empower custom
property mapping and dynamic attribute handling.
Implementation strategies like denormalization, EAV types, and semi-structured
solutions have been discussed, each with its pros and cons to consider.
In the context of search-oriented modeling, combining field and answer in a tag
improves filtering without sacrificing readability.
Notably, achieving an AND pattern for filter combination is emphasized. This
requires structuring custom fields into separate columns for precise filtration.
Ultimately, though various strategies are explored, explicitly modeling all
attributes remains the optimal solution. This effectively transforms late-binding
attributes back into early-binding attributes. Despite challenges, this approach
empowers organizations to adapt, analyze, and stay competitive in a rapidly
evolving landscape.
17 Beyond sums: Mastering complex data
metrics
17.1 INTRODUCTION
In the ever-evolving landscape of data analytics, the

journey toward more insightful decision-making demands
a deeper understanding of the intricate world beyond
simple sums. Traditional aggregate measures often fall
short when facing the complexities of modern data analysis.
The objective of this chapter is to navigate through the
realms of non-additive, semi-additive, and derived
measures. These advanced metrics transcend the
boundaries of basic aggregation, enabling us to
uncover more nuanced insights from our data.
17.2 UNDERSTANDING NON-

ADDITIVE, SEMI-ADDITIVE AND DERIVE MEASURES
17.2.1 What are derived measures?

At the heart of Derived Measures lies a fundamental principle: to improve the
speed and efficacy of analytical queries. By pre-calculating ratios, percentages,
and other calculated metrics, Derived Measures eliminate the computational
complexity that often accompanies queries involving multifaceted calculations.
This not only expedites query execution but also streamlines the analytical
process, allowing business users to focus on deriving insights rather than
grappling with intricate mathematical operations. In essence, Derived Measures
offer a gateway to a more efficient, user-friendly, and responsive analytical
environment.
17.2.1.1 Examples of derived measures

The real-world applicability of Derived Measures becomes evident when
considering practical scenarios:
n Profit margin enhancement: In financial analysis, the computation of
profit margin (net profit divided by revenue) is a recurring requirement.
Through Derived Measures, this calculation is performed during the ETL
(Extract, Transform, Load) process and stored directly in the fact table.
Analysts can now swiftly compare profit margins across products, regions, or
time periods without the overhead of repetitive calculations.
n Conversion rate streamlining: In the realm of e-commerce, calculating
conversion rates (purchases divided by website visits) is pivotal. Derived
Measures transform this calculation into a one-time operation, enabling
marketers to assess the success of campaigns and analyze user behavior
without being entangled in constant computations.
n Insights into inventory turnover: Inventory management necessitates
the calculation of inventory turnover (cost of goods sold divided by average
inventory). By integrating this metric as a Derived Measures, analysts can
seamlessly explore trends in product turnover, thereby optimizing inventory
control strategies.
n Customer loyalty exploration: Tracking customer retention rates
(percentage of customers retained over a period) is indispensable for
customer relationship management. Derived Measures expedite the
assessment of customer loyalty, facilitating insights into the effectiveness of
engagement initiatives.
17.2.1.2 Benefits of using derived measures

n Elevated query performance: Derived Measures shift the computational
burden from query execution to the ETL process, resulting in accelerated
query response times and enhanced user satisfaction.
n Empowerment of business users: By eliminating the need for complex
calculations, Derived Measures democratize data access. Business users can
delve into data without being constrained by technical intricacies, leading to
informed decision-making.
n Unwavering accuracy: The utilization of Derived Measures ensures
consistency and accuracy across analyses. Calculations are performed
uniformly during data loading, minimizing the risk of discrepancies.
n Seamless scalability: Derived Measures maintain their performance
benefits as data volumes expand, contributing to the scalability of the data
warehouse infrastructure.
17.2.1.3 Challenges and considerations

While the advantages of Derived Measures are compelling, several factors merit
consideration:
n Storage trade-offs: Storing pre-calculated metrics contributes to increased
fact table size. Striking a balance between performance gains and storage
costs is imperative.
n Data maintenance: The accuracy of Derived Measures hinges on their
timely update following changes in underlying data. Effective data
maintenance protocols are essential to sustain data integrity.
n Calculation complexity: Not all calculations are suitable for pre-
computation. Complex, infrequently used calculations may still be better
performed during query execution.
17.2.2 What are non-additive measures?

Non-additive measures defy straightforward aggregation due to their inherent
nature. Think of ratios like profit margins or percentages of completion.
Summing up these measures across dimensions doesn't provide meaningful
results. Instead, we need alternative approaches that capture their essence
accurately.
17.2.3 What are semi-additive Measures?

Semi-additive measures refer to specific types of numerical data that can be
aggregated across some dimensions but not others. Unlike fully additive
measures that can be summed across all dimensions, semi-additive measures
exhibit different aggregation behaviors based on the context of analysis.
Examples of semi-additive measures include account balances, inventory levels,
and stock prices. These measures cannot be aggregated using standard
summation techniques because their aggregation may lead to inaccurate results
or misinterpretations.
For example for an account balance you most likely want to be able to perform
the following searches:
n Show the Latest Balance ( Latest Balance by Account)
n Show Year End Balance (Year-end Balance by Account)
n Show Month End Balance (Month-end Balance by Account)
n Show Week End Balance (Week-end Balance by Account)
n Show month-end balance trend (Month-end Balance by Account Monthly)
n Compare month-end balance over the past 3 months (Month-end Balance by
Account last month vs 2 months ago vs 3 months ago
17.2.3.1 Challenges of modeling semi-additive measures

Modeling semi-additive measures presents unique challenges in designing data
warehouses and analytical databases. Some key challenges include:
n Aggregation complexity: Determining which dimensions can and cannot
be aggregated for a given semi-additive measure requires careful analysis of
the business context. Differentiating between additive, semi-additive, and
non-additive measures is critical to ensure accurate analysis.
n Data integrity: Ensuring data integrity is essential when dealing with semi-
additive measures. Aggregating these measures inappropriately can result in
inconsistencies, leading to unreliable analytical insights.
n Query performance: Aggregating semi-additive measures efficiently
requires specialized approaches. Incorrect query design can lead to poor
performance due to the need for complex calculations or extensive
processing.
n Historical data: Semi-additive measures often have a temporal aspect.
Balances or inventory levels may need to be tracked over time, introducing
additional complexities when dealing with historical data.
17.3 MODELING NON-ADDITIVE, SEMI-ADDITIVE AND

DERIVED MEASURES
17.3.1 Derived measures

Consider a retail company that operates across multiple regions. The company's
data warehouse captures sales data, including revenue and costs, and aims to
provide insights into various financial performance metrics. A typical data model
for this scenario might look something like this the model below.
Figure 99 Sample retail data model
One of the key derived metrics in this scenario is the profit margin, which is
calculated as (Revenue - Cost) / Revenue. Instead of computing this ratio during
each query, a Derived Measure can be introduced in the fact table.
Figure 100 Adding a derived fact to the fact table
With the Profit Margin derived fact stored directly in the fact table, analytical
queries involving profit margin calculations can now be executed significantly
faster. This approach optimizes query performance and enhances the overall
usability of the data warehouse.
17.3.2 Non-additive measures

To handle non-additive measures, we can employ various modeling techniques:
17.3.2.1 Weighted averages
For ratios and percentages, calculate weighted averages instead of simple sums.
Multiply the measure value by a relevant weight, sum these values, and then
divide by the sum of weights.
17.3.2.1.1 When to Use

Weighted averages are suitable when you have measures that are ratios or
percentages, and you want to calculate an aggregated value that accounts for
varying weights or significance.
17.3.2.1.2 Pros
n Suitable for ratios and percentages: Weighted averages are well-suited for
handling non-additive measures like ratios and percentages, providing a
meaningful way to aggregate them.
n Accurate aggregation: Weighted averages provide accurate aggregation by
considering the significance of each value.
17.3.2.1.3 Cons
n Limited applicability: This technique is specifically designed for calculating
weighted averages of non-additive measures and may not be applicable to
all types of measures.
17.3.2.2 Fact table augmentation

Expand the fact table to include additional fields that help differentiate non-
additive values. For instance, for percentage completion, include a "Total Tasks"
field. While this increases data redundancy, it allows meaningful aggregation.
17.3.2.2.1 When to Use

Fact table augmentation is useful when you want to add additional fields to the
fact table to differentiate non-additive values, such as including a "Total Tasks"
field for percentage completion.
17.3.2.2.2 Pros
n Customized aggregation: Augmentation allows you to store and aggregate
non-additive values with additional context, providing more meaningful
results.
n Improved query performance: Precomputed values can enhance query
performance for non-additive measures that require complex calculations.
17.3.2.2.3 Cons
n Data redundancy: Augmenting the fact table can lead to increased data
redundancy, as additional fields need to be stored alongside the raw data.
n Increased storage: Storing additional fields in the fact table can result in
increased storage requirements.
17.3.2.3 Separate fact tables

Create separate fact tables for additive and non-additive measures. This keeps
aggregation straightforward for additive measures while enabling specialized
treatment for non-additive ones.
17.3.2.3.1 When to use

Separate fact tables are suitable when you have a mix of additive and non-
additive measures and want to provide specialized treatment for non-additive
ones.
17.3.2.3.2 Pros
n Specialized treatment: Separate fact tables allow you to handle non-additive
measures differently, optimizing performance and maintaining clarity.
n Improved query performance: Precomputed values in separate fact tables
enhance query performance for non-additive measures.
17.3.2.3.3 Cons
Additional Complexity: Managing multiple fact tables can introduce some level
of complexity to the data model.
Maintenance Overhead: Handling separate fact tables requires careful design
and maintenance to ensure data consistency and accuracy.
17.3.2.4 Detailed example: Calculating weighted average defect rate

Let's dive into a practical example involving a manufacturing company with
multiple product types and their respective defect rates. Defect rate, a non-
additive measure, requires careful handling.
17.3.2.4.1 The data model
We'll be using a Product Dimension and a Date Dimension. The Product
Dimension represents different product types, and the Date Dimension includes
quarters and years. We've also stored the weight index in the product dimension
for simplicity. In real-world cases, you might calculate this from your fact table.
Additionally, we need a fact table to capture product defect rates and relevant
information.
Figure 101 Data model for recording defect rates
We populate this model with the following test data:
DIM_PRODUCT
PRODUCT_ID PRODUCT_NAME PRODUCT_CATEGORY WEIGHT_INDEX
1 Product A Category X 0.4000
2 Product B Category Y 0.3000
3 Product C Category Z 0.3000
Table 57 Test data for DIM_PRODUCT
DIM_DATE
DATE_ID YEAR QUARTER
101 2023 Q1
102 2023 Q2
103 2023 Q3
Table 58 Test data for DIM_DATE
FACT_DEFECT_RATE
FACT_ID PRODUCT_ID DATE_ID DEFECT_RATE
FACT_DEFECT_RATE
1 1 101 0.0200
2 2 101 0.0500
3 3 101 0.0300
4 1 102 0.0300
5 2 102 0.0600
6 3 102 0.0400
7 1 103 0.0200
8 2 103 0.0400
9 3 103 0.0300
Table 59 Test data for FACT_DEFECT_RATE
17.3.2.4.2 Weighted Average Calculation

To calculate the weighted average defect rate:
1. Assign weights: Assign weights to each percentage value. These weights
represent the importance or significance of each percentage in the overall
average. The sum of the weights should be equal to 1 (or 100% if using
percentages as weights).
2. Calculate weighted values: Multiply each percentage value by its
corresponding weight to get the weighted value for each percentage. The
formula for that is:
3. Sum weighted values: Sum up all the weighted values calculated in the
previous step.
4. Calculate weighted average: Divide the sum of the weighted values by
the sum of the weights.
Weighted Average Calculation
The formula for weighed average
calculation is:
Weighted Average = (Σ(weight *
percentage)) / Σ(weight)
Where:
n Σ denotes the sum of all items.
n "weight" represents the weight assigned
to each percentage.
n "percentage" is the actual percentage
value.
17.3.2.5 Implementation in ThoughtSpot

Once we've brought in those three tables into ThoughtSpot and set them up, we
can proceed to create a worksheet based on them. It's important to note that
since the defect_rate itself qualifies as a Non-Additive measure, we need to set
it up as an attribute and designate it as non-additive.
Table 60 DEFECT_RATE is non-additive
Within the worksheet, we'll need to formulate a calculation for determining the
weighted average. The formula mirrors the one
mentioned earlier, with a slight adjustment.
Here, we multiply the denominator by the
number of quarters, our lowest date unit. This
adjustment ensures the formula remains
effective when rolling up to a year.
(sum(defect_rate *weight_index ) /
(sum (weight_index) * count(quarter ) ) )
Upon executing a search for the average defect rate per quarter, the outcome
is displayed, as depicted below. Notably, even upon aggregation, the calculation
maintains its accuracy.
Table 61 Searching for the defect rate per Table 62 Searching for the defect rate per
quarter year
Always remember to customize the weights according to your specific

circumstances and verify that they sum up to 1 or 100%, depending on your
preference.
17.3.2.6 Implementing this example with the other techniques

The remaining techniques are essentially variations of the method described
above.
17.3.2.6.1 Augmented fact tables

The fundamental concept here involves precomputing certain intermediary
outcomes and incorporating them into the fact table. This is achieved by
introducing two additional fields: one for the sum of the average weight rate
and another for the cumulative sum of the weight index.
FACT_ID PRODUCT_ID DATE_ID DEFECT_RATE WEIGHTED_ SUM_

DEFECT_RATE WEIGHT_INDEX
1 1 101 0.0200 0.0080 0.4000
2 2 101 0.0500 0.0150 0.3000
3 3 101 0.0300 0.0090 0.3000
4 1 102 0.0300 0.0120 0.4000
5 2 102 0.0600 0.0180 0.3000
6 3 102 0.0400 0.0120 0.3000
7 1 103 0.0200 0.0080 0.4000
8 2 103 0.0400 0.0120 0.3000
9 3 103 0.0300 0.0090 0.3000
Table 63 Augmented Fact Table
Now, you can easily compute the weighted average by summing up the
WEIGHTED_DEFECT_RATE and then dividing it by the total
SUM_WEIGHT_INDEX.
17.3.2.6.2 Separate fact tables

In this approach, you would maintain two distinct fact tables: one dedicated to
'typical' facts and another exclusively for non-additive measures. Within this
second table, you can then proceed to apply the same methodologies as outlined
earlier.
17.3.3 Semi-additive measures

We present two distinct methodologies for implementing semi-additive
measures tailored to your specific use case. The first method leverages the
capabilities of ThoughtSpot through the utilization of aggregation functions,
while the second offers a more sophisticated and refined approach by
embedding the functionality directly into the data model.
17.3.3.1 Approach 1: Leveraging ThoughtSpot aggregation functions
In this approach, we create a dynamic moving measure, formulated as a
formula, to determine the maximum date within the context of the search.
Whether it's analyzed on a yearly, quarterly, monthly, weekly, or daily basis,
this formula plays a pivotal role in the process. The formula is defined as follows:
Business effective date = group_max(snapshot_date, snapshot_date)
The execution of this method generates a

query plan, as depicted in Figure 102. While
this solution exhibits functional feasibility, it
might encounter scalability issues with larger
datasets. To address this limitation and
provide a more optimal solution, the
subsequent section details an alternative
approach.
17.3.3.2 Approach 2: Enriching the date

dimension
In this alternative approach, we will integrate
the required functionality directly into the
data model. Our focus is on enhancing the Figure 102 Query plan for group
date dimension by incorporating Boolean aggregate
indicators for each day that depict the semi-
additive calendar attributes.
Date
is_quarter_start
is_month_start
is_quarter_end
is_month_end
is_week_start
is_year_start
is_week_end
is_year_end
is_latest
1-jan-2023 0 1 0 1 0 1 0 1 0
2-jan-2023 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
31-mar-2023 0 0 0 0 1 0 1 0 0
1-apr-2023 0 0 0 1 0 1 0 1 0
... ... ... ... ... ... ... ... ... ...
31-dec-2023 1 0 1 0 1 0 1 0 1
Figure 103 Extended date dimension
Within this table, each entry corresponds to a specific date, and the associated
Boolean indicators are assigned values of 1 or 0 based on the predefined criteria.
It's important to note that the values within the "is_latest" column would
typically be dynamically determined based on the actual data within your
system. As the latest date may evolve over time, this column requires periodic
updates to remain accurate.
Subsequently, we can import this extended date dimension into ThoughtSpot
and configure its usage. In the worksheet employing this table, we will introduce
several formulas to further enhance the search experience:
Name Definition
Latest if (is_latest = 1 ) then ‘latest’ else null
Year Start if (is_year_start = 1) then ‘year start’ else null
Year End if (is_year_end = 1) or (is_latest = 1) then ‘year end’ else null
Quarter Start if (is_quarter_start = 1) then ‘quarter start’ else null
Quarter End if (is_quarter_end = 1) or (is_latest = 1) then ‘quarter end’ else null
Month Start if (is_month_start = 1) then ‘month start’ else null
Month End if (is_month_end = 1) or (is_latest = 1) then ‘month end’ else null
Week Start if (is_week_start = 1) then ‘week start’ else null
Week End if (is_week_end = 1) or (is_latest = 1) then ‘week end’ else null
Figure 104 Formulas to add to worksheet
Please be aware that the inclusion of the condition (is_latest = 1) within the
year end, quarter end, month end, and week end formulas is an optional step.
By incorporating this condition, you will extend the consideration to include the
last recorded date within your dataset as the respective period's end.
With these preparations in place, we can seamlessly utilize user-friendly terms
during our search queries, such as 'month end':
Figure 105 Searching for our balance at month end
In wrapping up our exploration of complex data metrics, it's clear that non-
additive, semi-additive, and derived measures offer valuable tools for gaining
deeper insights from data. While these advanced metrics provide the potential
for more nuanced analysis, they also come with practical considerations.
When working with derived measures, it's important to balance the benefits of
pre-calculated metrics with the increased storage they require. Non-additive and
semi-additive measures demand attention to data integrity and query
performance. Techniques like weighted averages, fact table augmentation, and
separate fact tables provide diverse solutions, each with its own set of
advantages and challenges.
In the realm of semi-additive measures, choosing between ThoughtSpot
aggregation functions and extending the data model involves careful planning
to ensure scalability and accuracy.
List of Figures
FIGURE 1 A SAMPLE STAR SCHEMA MODEL ................................................................................................................ 21
FIGURE 2 SAMPLE SNOWFLAKE SCHEMA ...................................................................................................................... 22
FIGURE 3 SAMPLE GALAXY SCHEMA .............................................................................................................................. 23
FIGURE 4 SAMPLE FACT CONSTELLATION ..................................................................................................................... 24
FIGURE 5 FLAT TABLE EXAMPLE - NORMALIZED APPROACH ....................................................................................... 26
FIGURE 6 FLAT TABLE EXAMPLE - DENORMALISED...................................................................................................... 26
FIGURE 7 SAMPLE OLAP CUBE .................................................................................................................................... 28
FIGURE 8 SAMPLE OLTP/TRANSACTIONAL MODEL .................................................................................................... 30
FIGURE 9 SAMPLE DATA VAULT MODEL ...................................................................................................................... 33
FIGURE 10 FACT TABLE TYPES ...................................................................................................................................... 53
FIGURE 11 DIMENSION TYPES ...................................................................................................................................... 73
FIGURE 12 SAMPLE CHASM TRAP MODEL ................................................................................................................... 86
FIGURE 13 STANDARD CHASM TRAP/BRIDGE TABLE ................................................................................................... 87
FIGURE 14 NESTED CHASM TRAP ................................................................................................................................. 87
FIGURE 15 CHAINED CHASM TRAP ............................................................................................................................... 87
FIGURE 16 SAMPLE FAN TRAP ....................................................................................................................................... 87
FIGURE 17 CONTENTS OF THE PRODUCTION TABLE .................................................................................................... 88
FIGURE 18 CONTENTS OF THE SALES TABLE ............................................................................................................... 88
FIGURE 19 CONTENTS OF THE DATE DIMENSION ........................................................................................................ 88
FIGURE 20 CONTENTS OF THE PRODUCT DIMENSION ................................................................................................. 88
FIGURE 21 CONTENTS OF THE LOCATION DIMENSION ................................................................................................ 89
FIGURE 22 EXECUTING THE WRONG QUERY CAUSING OVERCOUNTING..................................................................... 90
FIGURE 23 CORRECT RESULTS FROM THE CHASM TRAP ............................................................................................. 91
FIGURE 24 SAMPLE DATA FOR THE CUSTOMER TABLE ................................................................................................ 92
FIGURE 25 SAMPLE DATA FOR THE ORDER TABLE ....................................................................................................... 92
FIGURE 26 SAMPLE DATA FOR THE ORDER DETAIL TABLE .......................................................................................... 92
FIGURE 27 RUNNING A SINGLE PASS QUERY AGAINST A FAN TRAP ........................................................................... 93
FIGURE 28 CORRECT RESULTS OF THE FAN TRAP ........................................................................................................ 95
FIGURE 28 ROLE PLAYING DIMENSIONS ..................................................................................................................... 99
FIGURE 29 MULTIPLE JOIN PATHS ............................................................................................................................. 100
FIGURE 31 ELIMINATING A JOIN PATH ....................................................................................................................... 100
FIGURE 32 SPLIT UP THE BRANCH TABLE................................................................................................................... 100
FIGURE 32 CHOOSING A JOIN PATH ........................................................................................................................... 104
FIGURE 33 INVENTORY AGGREGATION LEVELS ......................................................................................................... 106
FIGURE 35 HIGH LEVEL MODEL USING THE RANGE JOINS ........................................................................................ 107
FIGURE 36 THE WEEK DIMENSION ............................................................................................................................. 107
FIGURE 37 DEFINING THE RANGE JOIN IN TML ........................................................................................................ 107
FIGURE 38 THE JOINS DEFINED IN THOUGHTSPOT .................................................................................................. 107
FIGURE 39 HOW THE RANGE JOIN LOOKS IN TS UI ................................................................................................. 107
FIGURE 40 MODIFYING THE JOIN CONDITION IN TML TO CREATE THE BETWEEN .............................................. 107
FIGURE 41 WHICH PRODUCTS ARE RUNNING LOW ON SUPPLY? .............................................................................. 108
FIGURE 42 QUERY VISUALIZER .................................................................................................................................. 108
FIGURE 43 GENERATED SQL ..................................................................................................................................... 108
FIGURE 44 HOW DO MY SALES COMPARE TO STORE WITHIN A 5 MILE RADIUS? .................................................... 110
FIGURE 45 QUERY PLAN FOR HOW DO MY SALES COMPARE TO STORE WITHIN A 5 MILE RADIUS......................... 110
FIGURE 46 LOCATE NEARBY INVENTORY TO AVOID STOCK-OUT CONDITION ......................................................... 112
FIGURE 46 OUTER JOIN TO PRODUCTS ...................................................................................................................... 115
FIGURE 47 INCREASED COMPLEXITY WITH MULTIPLE OUTER JOINS ........................................................................ 115
FIGURE 49 USE A FACTLESS FACT TO AVOID OUTER JOINS...................................................................................... 119
FIGURE 49 SAMPLE HIERARCHIES .............................................................................................................................. 121
FIGURE 50 SAMPLE PRODUCT HIERARCHY ................................................................................................................. 122
FIGURE 51 SAMPLE GEOGRAPHICAL LOCATIONS HIERARCHY (RAGGED) ................................................................ 123
FIGURE 53 SAMPLE ORGANIZATION CHART (UNBALANCED HIERARCHY) ................................................................ 124
FIGURE 54 SAMPLE PRODUCT DIMENSION WITH HIERARCHY ................................................................................... 127
FIGURE 55 IMPLEMENTATION OF A FICTIONAL ORGANIZATION CHART.................................................................... 132
FIGURE 55 CONTENTS OF THE BRIDGE TABLE ........................................................................................................... 134
FIGURE 57 A WORKSHEET EXAMPLE FOR THE ORGANIZATIONAL HIERARCHY ......................................................... 134
FIGURE 57 CONTENTS OF THE FACT TABLE ............................................................................................................... 135
FIGURE 59 SEARCHING FOR INDIVIDUAL SALES FIGURES PER EMPLOYEE ............................................................... 135
FIGURE 60 UTILIZING THE HIERARCHY ...................................................................................................................... 136
FIGURE 61 UTILIZING THE VIEW AND HIERARCHY IN THE PIVOT ............................................................................. 138
FIGURE 61 HIERARCHICAL REPORTING: WHO REPORTS TO WHOM? ....................................................................... 139
FIGURE 63 A RAGGED HIERARCHY FOR A CHART OF ACCOUNTS .............................................................................. 140
FIGURE 64 NAVIGATING PARTICULAR NODES IN THE HIERARCHY ........................................................................... 144
FIGURE 65 ADDING THE CHILD NODE TO THE SEARCH ............................................................................................. 144
FIGURE 66 ANALYZING CASH SPECIFICALLY .............................................................................................................. 144
FIGURE 67 INCLUDING PARENT INFORMATION .......................................................................................................... 144
FIGURE 67 LIMITATION WHEN SELECTING MULTIPLE NODES ................................................................................... 145
FIGURE 69 OUR ORG CHART REVISITED WITH PATH ATTRIBUTES ........................................................................... 146
FIGURE 70 ADDING THE PATH STRINGS TO THE TABLES .......................................................................................... 146
FIGURE 70 USING GROUP_AGGREGATE AND OTHER FUNCTIONS ............................................................................ 147
FIGURE 71 OUR FICTIONAL ORG CHART WITH INDEX NUMBERS .............................................................................. 149
FIGURE 73 A DATA MODEL WITH MIXED GRAIN FACTS ............................................................................................. 169
FIGURE 74 SEARCHING THE MIXED GRAIN MODEL .................................................................................................... 169
FIGURE 75 RESULTS FROM THE SEARCH ON THE MIXED GRAIN MODEL .................................................................. 170
FIGURE 75 THE QUERY PLAN FOR THIS SEARCH ........................................................................................................ 170
FIGURE 77 CORRECTLY MODELING MIXED GRAIN ..................................................................................................... 171
FIGURE 77 MULTI-GRAIN DIMENSIONAL MODEL ....................................................................................................... 172
FIGURE 79 SAMPLE DATA FOR THE PRODUCT_DIMENSION TABLE .................................................................. 172
FIGURE 80 SAMPLE DATA FOR THE DATE_DIMENSION TABLE ........................................................................... 173
FIGURE 81 SAMPLE DATA FOR THE SALES_FACT TABLE ....................................................................................... 173
FIGURE 82 SAMPLE DATA FOR THE TARGET_FACT TABLE .................................................................................... 173
FIGURE 82 RUNNING A SEARCH QUERY TO GET SALES VS TARGET ......................................................................... 173
FIGURE 83 RESOLVING THE MULTI-GRAIN ISSUE BY SPLITTING DIMENSIONS ....................................................... 174
FIGURE 85 RERUNNING OUR ACTUALS VS TARGET SEARCH ..................................................................................... 175
FIGURE 86 KEY/VALUE PAIR SAMPLE DATA................................................................................................................ 177
FIGURE 87 SEARCHING KEY/VALUE PAIRS ................................................................................................................. 177
FIGURE 88 UTILIZING THE DATE DIMENSION FOR LOOK BACK PERIODS................................................................. 185
FIGURE 89 WORKSHEET MODEL (SHARE MEASURES) .............................................................................................. 187
FIGURE 90 WHAT IS THE SHARE OF MULO THIS YEAR VERSUS LAST YEAR DURING THE SAME PERIOD?............ 188
FIGURE 91 QUERY PLAN FOR WHAT IS THE SHARE OF MULO THIS YEAR VERSUS LAST YEAR DURING THE SAME
PERIOD? ............................................................................................................................................................... 188
FIGURE 92 HOW HAS THE SHARE CHANGED OVER TIME?......................................................................................... 189
FIGURE 93 HOW DOES THE SHARE COMPARE THIS YEAR VERSUS LAST YEAR TO ANOTHER RETAILER DURING THE
SAME PERIOD?..................................................................................................................................................... 189
FIGURE 94 A DATA MODEL FOR A MULTI-CURRENCY USE CASE ............................................................................... 193
FIGURE 95 MULTI-CURRENCY DATA RESULTS WHEN LOGGED IN AS ALICE ............................................................ 195
FIGURE 96 SAMPLE ECOMMERCE MODELS WITH TAGS ............................................................................................. 200
FIGURE 97 SAMPLE HEALTHCARE MODEL WITH DYNAMIC ATTRIBUTES ................................................................... 200
FIGURE 98 EAV TYPE MODEL FOR CAPTURING CUSTOMER PROFILES ..................................................................... 203
FIGURE 99 SAMPLE RETAIL DATA MODEL ................................................................................................................... 217
FIGURE 100 ADDING A DERIVED FACT TO THE FACT TABLE ..................................................................................... 217
FIGURE 101 DATA MODEL FOR RECORDING DEFECT RATES ..................................................................................... 220
FIGURE 102 QUERY PLAN FOR GROUP AGGREGATE .................................................................................................. 225
FIGURE 103 EXTENDED DATE DIMENSION ................................................................................................................ 226
FIGURE 104 FORMULAS TO ADD TO WORKSHEET ..................................................................................................... 226
FIGURE 105 SEARCHING FOR OUR BALANCE AT MONTH END................................................................................... 227
List of Tables
TABLE 1 EXAMPLE TRANSACTIONAL FACT TABLE ........................................................................................................ 40
TABLE 2 EXAMPLE SNAPSHOT FACT TABLE ................................................................................................................... 41
TABLE 3 EXAMPLE ACCUMULATING SNAPSHOT FACT TABLE ........................................................................................ 42
TABLE 4 EXAMPLE PERIODIC SNAPSHOT FACT TABLE .................................................................................................. 43
TABLE 5 EXAMPLE PARTITIONED FACT TABLE .............................................................................................................. 44
TABLE 6 KEY DIFFERENCES BETWEEN THESE FACT TABLE TYPES ............................................................................... 45
TABLE 7 EXAMPLE CUMULATIVE FACT TABLE ................................................................................................................ 46
TABLE 8 EXAMPLE AGGREGATED FACT TABLE .............................................................................................................. 47
TABLE 9 EXAMPLE DERIVED FACT TABLE ...................................................................................................................... 48
TABLE 10 KEY DIFFERENCES BETWEEN THESE FACT TABLE TYPES ............................................................................. 49
TABLE 11 EXAMPLE FACTLESS FACT TABLE .................................................................................................................. 49
TABLE 12 EXAMPLE BRIDGE FACT TABLE...................................................................................................................... 51
TABLE 13 KEY DIFFERENCES BETWEEN FACTLESS FACT TABLES AND BRIDGE FACT TABLES.................................... 51
TABLE 14 EXAMPLE MULTI-VALUE FACT TABLE ............................................................................................................ 53
TABLE 15 SUMMARY OF FACT TABLE TYPES ................................................................................................................. 55
TABLE 16 EXAMPLE ROLE PLAYING DIMENSION ........................................................................................................... 56
TABLE 17 EXAMPLE FIXED-DEPTH HIERARCHY ............................................................................................................ 57
TABLE 18 KEY DIFFERENCES BETWEEN ROLE-PLAYING DIMENSIONS AND FIXED DEPTH HIERARCHIES .................. 58
TABLE 19 EXAMPLE CONFORMED DIMENSION.............................................................................................................. 59
TABLE 20 EXAMPLE UNIVERSAL DIMENSION ................................................................................................................ 60
TABLE 21 KEY DIFFERENCES BETWEEN CONFORMED DIMENSIONS AND UNIVERSAL DIMENSIONS ......................... 61
TABLE 22 EXAMPLE DEGENERATE DIMENSION ............................................................................................................. 62
TABLE 23 EXAMPLE MINI-DIMENSION .......................................................................................................................... 64
TABLE 24 EXAMPLE JUNK DIMENSION .......................................................................................................................... 65
TABLE 25 EXAMPLE SHRUNKEN DIMENSION ................................................................................................................ 66
TABLE 26 EXAMPLE LATE-BINDING DIMENSION .......................................................................................................... 67
TABLE 27 SAMPLE COMPOSITE DIMENSION ................................................................................................................. 68
TABLE 28 KEY DIFFERENCES BETWEEN THESE DIMENSION TYPES ............................................................................. 69
TABLE 29 EXAMPLE SNAPSHOT DIMENSION ................................................................................................................. 70
TABLE 30 EXAMPLE CUSTOM DIMENSION .................................................................................................................... 71
TABLE 31 EXAMPLE DERIVED DIMENSION .................................................................................................................... 72
TABLE 32 DIFFERENCES BETWEEN SNAPSHOT, CUSTOM AND DERIVED DIMENSIONS .............................................. 73
TABLE 33 THE VARIOUS DIMENSION TYPES ................................................................................................................. 75
TABLE 34 ENROLLMENTS TABLE ................................................................................................................................. 101
TABLE 35 ENROLLMENTS FACT TABLE (WITH DENORMALIZED ATTRIBUTES) .......................................................... 102
TABLE 36 AGGREGATED TABLE - COURSE ENROLLMENT COUNT .............................................................................. 103
TABLE 37 CREATING DEFAULT MEMBERS FOR DIMENSIONS ..................................................................................... 118
TABLE 38 A BALANCED PRODUCT HIERARCHY ........................................................................................................... 125
TABLE 39 A GEOGRAPHICAL LOCATIONS HIERARCHY ............................................................................................... 130
TABLE 40 EMPLOYEE DIMENSION FOR SCD DESCRIPTIONS .................................................................................... 160
TABLE 41 UPDATING JOHN'S DEPARTMENT (SCD TYPE 1) ..................................................................................... 160
TABLE 44 UPDATING JOHN'S DEPARTMENT - EMPLOYEE TABLE (SCD TYPE 4) ..................................................... 162
TABLE 45 UPDATING JOHN'S DEPARTMENT - EMPLOYEE HISTORY TABLE (SCD TYPE 4) ..................................... 162
TABLE 46 USING SCD-4 AS A MINI-DIMENSION ..................................................................................................... 162
TABLE 47 CONTENTS OF THE EXCHANGE_RATE TABLE ...................................................................................... 194
TABLE 48 CONTENTS OF THE TRANSACTION TABLE ............................................................................................ 194
TABLE 49 CONTENTS OF THE USER TABLE ............................................................................................................... 194
TABLE 50 CONTENTS OF THE USER_CURRENCIES TABLE .................................................................................. 194
TABLE 51 SAMPLE DATA FOR TBL_CUSTOM_FIELD_VALUE ............................................................................ 204
TABLE 52 SAMPLE DATA FOR TBL_CUSTOM_FIELD ............................................................................................ 204
TABLE 53 IMPLEMENTING LATE-BINDING ATTRIBUTES USING DENORMALISED ATTRIBUTES ................................. 205
TABLE 54 PROFILE QUESTIONS TABLE USING VARIANT ............................................................................................ 206
TABLE 55 JSON RESULTS UNPACKED ........................................................................................................................ 208
TABLE 56 PROS AND CONS OF THE VARIOUS LATE-BINDING ATTRIBUTE MODELING TECHNIQUES ....................... 208
TABLE 57 TEST DATA FOR DIM_PRODUCT ............................................................................................................ 220
TABLE 58 TEST DATA FOR DIM_DATE .................................................................................................................... 220
TABLE 59 TEST DATA FOR FACT_DEFECT_RATE................................................................................................. 221
TABLE 60 DEFECT_RATE IS NON-ADDITIVE .......................................................................................................... 222
TABLE 61 SEARCHING FOR THE DEFECT RATE PER QUARTER ................................................................................... 223
TABLE 62 SEARCHING FOR THE DEFECT RATE PER YEAR .......................................................................................... 223
TABLE 63 AUGMENTED FACT TABLE........................................................................................................................... 224

Data Modelling Field Guide - Edition 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Modelling Field Guide - Edition 1

Uploaded by

Copyright:

Available Formats

THOUGHTSPOT DATA

A ThoughtSpot Field Guide

2 A WORD FROM THE AUTHOR 12

3 EXPLORING DATA MODELING FOR PRACTICAL INSIGHTS 14

3.1 Understanding Dimensional Modeling 14

3.2 Effective practices for successful dimensional modeling 14

3.3 Guiding Principles for Success in Dimensional Modeling 15

3.4 Avoiding common pitfalls in dimensional modeling 16

4 FUNDAMENTALS: DATA MODEL TYPES 18

4.1 Dimensional Data Models 18

4.2 Flat Denormalised Table 25

4.3 Precalculated Views/OLAP Cube 28

4.4 OLTP/Transactional Data Model 29

4.6 Model Types Compared 34

4.7 Test Your Knowledge 36

5 FUNDAMENTALS: FACTS AND DIMENSIONS 38

5.2 Fact Table Types 39

5.3 Dimension Types 56

5.4 Testing your Knowledge 75

6 EFFECTIVE DESIGN PATTERNS AND BEST PRACTICES FOR

7 BRIDGING CHASMS, FANNING INSIGHTS: CHASM AND FAN

7.2 Understanding Chasm and Fan Traps 85

7.3 Modeling Chasm and Fan Traps 87

7.4 Considerations and conclusion 95

8 MASTERING DIMENSIONAL RELATIONSHIPS: JOIN

8.2 Understanding Dimensional Relationships 97

8.3 Modelling Dimensional Relationships 100

8.4 Considerations and Conclusion 113

9 THE PITFALLS OF OUTER JOINS 114

9.1 Introduction 114

9.2 Understanding Outer Joins 114

9.3 Modelling Outer Joins 117

9.4 Considerations and Conclusion 119

10 SCALING DATA PEAKS: A DEEP DIVE INTO HIERARCHIES 120

10.1 Introduction 120

10.3 Modeling Hierarchies 125

10.4 Comparison of the various implementation techniques 151

10.5 Considerations and Conclusion 154

11 EVOLVING RELATIONS IN TIME: A DEEP DIVE INTO SLOWLY

11.1 Introduction 158

11.2 Understanding slowly changing dimensions 158

11.3 Modelling slowly changing dimensions 160

11.4 Considerations and Conclusion 162

12 FROM FINE TO COARSE: CRAFTING DATA MODELS WITH

12.1 Introduction 165

12.2 Understanding Granularity 165

12.4 Considerations and Conclusion 175

13 UNBOXED INSIGHTS: UNLEASHING THE POTENTIAL OF

13.1 Introduction 176

13.2 Understanding flexible data structures 176

13.3 Modeling flexible data structures 178

13.4 Considerations and Conclusion 181

14 DIMENSIONAL MASTERY: EXPLORING THE DATE DIMENSION182

14.1 Introduction 182

14.2 Understanding Date Dimensions 182

14.3 Modeling date & time 183

14.5 Considerations and Conclusion 189

15 COUNTING COINS ACROSS CONTINENTS: A GUIDE TO

15.2 Understanding Currency Conversion 191

15.3 Modeling Currency Conversion 192

15.4 Considerations and Conclusion 195

16 EVOLVING DATA STRUCTURES: A GUIDE TO LATE-BINDING

16.1 Introduction 198

16.2 Understanding Late-Binding Attributes 198