Professional Documents
Culture Documents
MODELLING
Spot-On Data Modeling
First Edition
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 2
Table of Contents
1 BEST PRACTICES LISTED IN THIS FIELD GUIDE 10
4.5 DataVault 32
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 3
4.5.1 Definition 32
4.5.2 Example 33
4.5.3 Advantages 33
5.1 Introduction 38
7.1 Introduction 85
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 4
7.3.1 Chasm traps 87
7.3.2 Fan Traps 92
8.1 Introduction 97
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 5
10.2 Understanding Hierarchies 120
10.2.1 What are hierarchies in dimensional modeling? 120
10.2.2 Types of hierarchies 121
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 6
12.3 Modeling for mixed grain 170
12.3.1 Option 1: Augmenting the fact table 170
12.3.2 Another option: splitting dimension 171
14.4 Using the date dimension for Look Back Measures 184
14.4.1 Introduction 184
14.4.2 Requirement definition 184
14.4.3 Solution design 185
14.4.4 Example: How have sales changed this year versus last year during the same
period? 186
14.4.5 Conclusion 189
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 7
15.1 Introduction 191
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 8
17.3.1 Derived measures 216
17.3.2 Non-additive measures 217
17.3.3 Semi-additive measures 224
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 9
1 Best Practices listed in this Field Guide
Apart from providing extensive information on data modeling scenarios and
design patterns, this field guide also provides best practices and top tips for
setting up your data model.
The complete list of best practices in this field guide can be found below.
Best Practice 1 - Choose the right dimensional modeling technique 36
Best Practice 2 - When to use a transactional fact table? 40
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 10
Best Practice 23 - When to use late-binding dimensions 67
Best Practice 24 - When to use composite dimensions 68
Best Practice 25 - When to use snapshot dimensions 70
Best Practice 26 - When to use custom dimensions 71
Best Practice 27 - When to use derived dimensions 72
Best Practice 28 - ThoughtSpot handles chasm traps automatically! 96
Best Practice 29 - ThoughtSpot handles fan traps automatically! 96
Best Practice 30 - Understand join cardinality 98
Best Practice 31 - Importance of join direction 99
Best Practice 32 - Consider transforming 1-to-1 relationships 101
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 11
2 A Word from the Author
Hello there, fellow data wranglers and aspiring modelers of the digital universe!
Before you dive headfirst into the world of data
modeling, I wanted to take a moment to share a few
words of wisdom, encouragement, and maybe even a
sprinkle of humor. You see, data modeling is a bit like
juggling chainsaws while riding a unicycle on a tightrope
suspended over a pit of hungry alligators – exciting,
challenging, and definitely not for the faint of heart.
Now, I know what you might be thinking. "How hard can
it be to organize data?" Well, let me tell you, it's a bit
like trying to herd cats through a maze designed by M.C.
Escher, blindfolded, on a moonless night. In other
words, it's not a walk in the park. And here's the kicker:
a lot of people think they understand data modeling,
believing it's a piece of cake. But trust me, that's where
it often goes wrong.
Data modeling isn't just about slapping together a few
tables and calling it a day. It's a nuanced dance of
relationships, attributes, and constraints that requires the finesse of a concert
pianist. But fear not, for within the pages of this field guide, you'll find the
roadmap to navigate this labyrinthine world of data modeling with wit and
wisdom.
I'll delve into both straightforward and more complex design patterns, all
relevant to data modeling, which is like adding layers of flavor to a lasagna –
some patterns are classic and comforting, while others are bold and
adventurous.
But do not forget data modeling isn't just hard; it's also incredibly rewarding.
It's like solving a puzzle that unlocks the secrets of your organization's data,
revealing hidden insights and transforming chaos into clarity. Plus, you get to
wield the power of knowledge and help your business make better decisions.
In this field guide, I've poured our hearts, souls, and possibly a few cups of
coffee into distilling me and my colleagues years of experience in the field into
practical, actionable advice. We've made the mistakes so you don't have to
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 12
(trust us, some of those mistakes were doozies). You'll find real-world examples
and step-by-step guides on how to implement the very best design patterns and
solutions in your data model.
Now, let me take a moment to tip my hat to some of the brilliant minds in the
field who contributed to this guide. One of the wonders of being part of the
ThoughtSpot community is that you're in the midst of some seriously sharp
individuals. A handful of the ingenious use cases and solutions you'll find in these
pages owe their existence to the creativity and insights of my colleagues. David
Stefkovich deserves a special mention, but I must also give credit to Damian
Waldron, Bill Lay, and Nick Cooper for their invaluable input. It's this kind of
collaborative effort that makes projects like this truly shine.
So, whether you're a seasoned data guru or a wide-eyed beginner, fasten your
seatbelt, put on your thinking cap, and get ready for a rollercoaster ride through
the fascinating world of data modeling. It may be tough, but with the right
guidance, and the knowledge you'll gain from this field guide, you'll conquer this
data-driven jungle like a pro.
Happy modeling!
Misha Beek
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 13
3 Exploring data modeling for practical
insights
In this guide, we'll delve into Data Modeling, uncovering its key principles,
relevance to modern data environments, and practical application of best
practices. Whether you're an experienced data pro or a newcomer, this guide is
your companion for a thorough exploration of essential practices, valuable tips,
and proven design patterns for successful data modeling.
Before diving into the technical aspects in later chapters, this section focuses on
non-technical facets and essential best practices.
In Dimensional Modeling, technical knowledge is vital, but it's the practical, non-
technical approaches that lead to success. These involve understanding
requirements, proper documentation, effective communication, and adhering to
guiding principles.
At the core, a meaningful dimensional model starts by understanding
stakeholders' needs. This means engaging in discussions and carefully outlining
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 14
requirements. Through this process, your model begins to take shape, mirroring
real operational aspects. Active listening, asking the right questions, and
aligning technical solutions with tangible business problems all contribute to
creating a fitting model.
Documentation becomes a vital tool in navigating the complexities of
Dimensional Modeling. This includes capturing design choices, mapping out data
connections, and creating user guides. These documents simplify intricate
details, paving the way for innovation and practical solutions.
However, the heart of our journey lies in collaboration.
Effective communication acts as the glue that binds
the technical and non-technical sides. It's about
translating complex technical terms into
understandable concepts, fostering a shared
understanding. This synergy between different
perspectives drives us toward our common
goal.
In this pragmatic interplay of technical and non-
technical aspects, our path through Dimensional
Modeling is guided by a practical approach. By
focusing on understanding, documentation, and
collaboration, we navigate the complexities of data, shaping it into a powerful
tool for informed decision-making.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 15
n Ensuring Data Quality: Implement data quality checks during Extract
Transform & Load (ETL) processes to maintain integrity and support
meaningful analysis.
n Collaboration Dynamics: Engage with business users to refine the model
in line with actual business needs.
n Embracing Change: Stay updated on developments to ensure your model
remains effective.
n Scalability: Design your model to accommodate growth and scalability.
n Documentation: Document design, relationships, and rules for
understanding and collaboration.
n Business-IT Collaboration: Collaborate to ensure your model aligns with
business needs.
n Self-Service Design: Create interfaces empowering users to explore data
independently.
n Education: Invest in training to maximize benefits.
n Continual Evolution: Continuously refine your model to accommodate new
requirements, dimensions, and data sources.
n Insufficient/Incorrect
Dimensional Modeling: Ensure
alignment with reporting
requirements and include desired
events. Mismatched requirements
can lead to omitted dimensions
and facts crucial for accurate
analysis.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 16
n Overcomplicating the Model: Focus on essentials and prioritize reporting
over excessive data integrity. Strive for simplicity to enhance comprehension
and performance.
n Ignoring Data Quality and Integrity: Neglecting data quality leads to
inaccurate insights. Rigorously validate, cleanse, and integrate data for
reliable results.
n Failing to Capture Granularity: Balance granularity and usability to avoid
superficial insights. Understand business analytical needs for effective data
modeling.
n Neglecting Future Scalability: Design for growth to accommodate
evolving requirements. Ensure flexibility for new dimensions, facts, and
reporting needs.
n Ignoring Performance Optimization: Optimize performance to avoid
sluggish query response times. Incorporate indexing, aggregations, and
proper join strategies.
n Understand Existing Model and Patterns Before Making Changes:
Modify existing models with care, understanding its intricacies and patterns
to avoid unintended consequences.
n Avoid Overloading Single Worksheets for Complex Use Cases:
Distribute complex analytical needs across multiple worksheets for clarity and
efficiency.
Having explored the less technical dimensions of data modeling, we're now
ready to dive deeper into the heart of data modeling itself. In the upcoming
sections, we'll immerse ourselves in best practices, insightful tips, and intricate
examples that illuminate the path of effective data modeling.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 17
4 Fundamentals: Data model types
4.1.1 Introduction
A dimensional data model is a data modeling technique used in data
warehousing and business intelligence to structure data for easy analysis and
reporting. It has several advantages that make it popular in decision support
systems and analytical environments.
4.1.2 Advantages
Here are some of the key advantages of a dimensional data model:
n Simplicity and ease of understanding: Dimensional data models are
designed to be intuitive and easy to understand, even for non-technical
users. They use familiar concepts like dimensions (descriptive attributes) and
measures (quantitative data. to represent business entities and their metrics.
n Faster query performance:
Dimensional models are optimized
for querying and reporting. Their
denormalized structure allows for
quick retrieval of data, reducing the
complexity of joins and aggregation
operations. As a result, analytical
queries can be executed more
efficiently, providing faster response
times.
n Flexibility and agility:
Dimensional data models are highly
flexible and can adapt easily to
changing business requirements. When new dimensions or measures are
needed, they can be added without affecting the existing structure, making
it convenient to incorporate new data and insights as the business evolves.
n Enhanced user experience: Due to its user-centric design, a dimensional
data model empowers business users to perform self-service data exploration
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 18
and analysis. With a well-structured model, users can navigate through data
hierarchies and relationships to gain meaningful insights without relying
heavily on IT or data experts.
n Better support for business analytics: Dimensional models are purpose-
built for analytical applications and business intelligence. They provide a clear
representation of business metrics and enable advanced analytics like drill-
down, roll-up, slicing, dicing, and pivot operations.
n Business-focused design: Dimensional data models are designed with a
focus on meeting business requirements and answering business questions.
This alignment with business needs enables users to quickly find the
information they need to make informed decisions.
n Improved data quality: Dimensional data models promote better data
quality and consistency by adhering to standardized dimensions and
hierarchies. This ensures that the data used for analysis is consistent across
different reports and dashboards.
4.1.3 Considerations
Although dimensional data models bring significant benefits to data warehousing
and analytical setups, their implementation requires careful consideration of
certain factors. These key considerations encompass:
n Data Redundancy: Dimensional data models often denormalize data to
improve query performance and user experience. However, denormalization
can lead to data redundancy, as
the same information might be
stored in multiple places within the
model. This can increase storage
requirements and may lead to
data integrity issues if updates are
not properly managed.
n Complexity in Hierarchies:
Hierarchical relationships in
dimensional models can
sometimes be complex, especially
when dealing with multi-level
hierarchies. Maintaining and
representing these hierarchies
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 19
correctly may require careful attention and additional effort. See chapter 10
for various implementation techniques for hierarchies.
n Difficulties in Real-time Data: Real-time data integration poses challenges
when using dimensional models because these models are predominantly
designed for batch processing. The need for data transformation during
loading can complicate real-time data integration processes.
n Inflexibility for Certain Queries: While dimensional models excel in many
analytical queries, they may not be optimal for certain complex analytical
operations, especially those requiring ad-hoc or exploratory data analysis. In
such cases, other data models like data vault or snowflake schemas might
be more suitable.
n Maintenance Overhead: Dimensional models require ongoing
maintenance, especially when introducing changes to the model, like adding
new dimensions or measures. Proper documentation is essential to prevent
data inconsistencies and ensure smooth maintenance.
n Data Governance Challenges: With denormalized data and ease of data
integration, it can be challenging to enforce strict data governance rules,
leading to potential data quality issues and inconsistency.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 20
4.1.4.1.2 Example
Imagine a sales database where the central fact table contains sales revenue
and quantity data. The dimension tables may include attributes such as product
name, product category, date, and location. Each dimension table is linked to
the fact table via foreign key relationships.
4.1.4.1.3 Advantages
n Simple and easy to understand, making it a preferred choice for small to
medium-sized data warehouses.
n Efficient query performance due to denormalized nature, reducing the
number of joins required for analysis.
n Optimal for analytical processes where data retrieval often involves
aggregations.
4.1.4.1.4 Disadvantages
n Start-up Costs: existing data structures do not always exist in relational
format to support dimensional modeling.
n Redundancy in dimension tables can lead to increased storage requirements.
n Maintenance challenges when updates or changes to dimension attributes
are required.
n Not suitable for highly normalized data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 21
4.1.4.2 Snowflake Schema
4.1.4.2.1 Definition
The snowflake schema is an extension of the star schema. In this model,
dimension tables are further normalized by breaking them into multiple related
tables. The normalization helps reduce data redundancy but may increase query
complexity.
4.1.4.2.2 Example
In the previous sales database example, a snowflake schema could involve
breaking down the product dimension table into multiple tables, such as product
details and product categories. Each of these tables would hold specific attribute
data related to products. Similarly, the location table could be divided into
multiple tables, introducing dimensions with for cities and countries.
4.1.4.2.3 Advantages
n Improved data storage efficiency due to reduced redundancy.
n Easier maintenance when updating shared dimension attributes.
n Suitable for data warehouses with limited storage capabilities.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 22
4.1.4.2.4 Disadvantages
n Increased query complexity due to multiple joins required to access data from
normalized tables.
n Potentially slower query performance compared to star schema due to
additional joins.
n May not be as intuitive to understand and implement for some users.
4.1.4.3.1 Definition
The galaxy schema, also known as the constellation schema, combines multiple
star schemas or snowflake schemas into a more complex and interconnected
structure. This model is useful when dealing with diverse and heterogeneous
data sources.
4.1.4.3.2 Example
Consider a data warehouse that combines sales data, financial data, and
customer data. The galaxy schema links all these fact tables and dimension
tables together.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 23
4.1.4.3.3 Advantages
n Provides a comprehensive view of diverse and complex data sources.
n Enables analysis across multiple business areas or departments.
n Offers flexibility for accommodating varied data structures.
4.1.4.3.4 Disadvantages
n Increased complexity in schema design and maintenance.
n Query performance may suffer due to the complexity of the model.
n Requires a deeper understanding of data relationships and business
requirements.
4.1.4.4.2 Example
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 24
In a retail business, separate fact tables may exist for sales, inventory, and
returns. Each fact table links to common dimension tables like product, date,
and location.
4.1.4.4.3 Advantages
n Facilitates easy integration of unrelated facts and dimensions.
n Simplifies data model design for scenarios with diverse data sources.
n Better suited for scenarios where facts don't have direct correlations.
4.1.4.4.4 Disadvantages
n Query performance might suffer when querying data across multiple fact
tables.
n Potential data redundancy in dimension tables due to multiple connections.
n Requires careful consideration of data relationships to avoid data
inconsistencies.
4.2.1 Definition
A flat denormalized table approach in data modeling refers to a design where
data is stored in a single table with all relevant information, including redundant
data, combined in a denormalized fashion. This approach stands in contrast to
traditional normalized data modeling, where data is distributed across multiple
related tables to reduce redundancy and improve data integrity. The flat
denormalized table approach is sometimes also called a "wide table" or
"flattened table" approach.
In this approach, instead of having separate tables for each entity (e.g.,
customers, orders, products), all relevant data is merged into a single table,
typically through the process of joining and aggregating data from various
sources. The resulting table includes all the required attributes, making it easier
to query and analyze data without the need for complex joins and relationships.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 25
4.2.2 Example
For example, consider an e-commerce scenario where you have customer data,
order data, and product data:
4.2.3 Advantages
n Simplified Queries: The flat table eliminates the need for complex joins,
making queries easier to write and understand.
n Improved Query Performance: By reducing join operations, query
performance can be improved, especially for analytical queries.
n Better Reporting: The denormalized table facilitates faster and simpler
reporting since all required data is available in a single location.
n Reduced Data Complexity: Data is presented in a user-friendly way,
reducing the need for complex joins and data manipulation.
4.2.4 Disadvantages
n Increased Query Processing Time: Aggregating data from a large flat
denormalized table can lead to slower query processing times. Since the table
contains redundant data, the size of the table can be substantial, and
aggregating this data requires more time and resources.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 26
n Cost of Indexing: Indexing denormalized tables can be expensive due to
the increased size and complexity of the data. Indexes that worked well on
normalized tables may not provide the same performance benefits on
denormalized tables.
n Storage Overhead: The denormalized table contains redundant information,
leading to increased storage requirements. This can be especially problematic
when dealing with large datasets, leading to additional storage costs and
potential storage constraints.
n Increased Memory Usage: Aggregating data from a flat denormalized
table often requires more memory resources during query execution. This
can cause memory contention and potentially lead to performance
bottlenecks, particularly on systems with limited memory.
n Search Experience: Users must be careful when selecting fields for
calculations in denormalized tables, as some fields might already be
aggregated or at a different granularity level. Incorrectly choosing
aggregated fields can lead to skewed results and inaccuracies in the analysis.
n Update Anomalies: Data updates can be more cumbersome due to multiple
occurrences of the same information in the table and any changes to the data
model or relationships may require extensive modifications to the flat table.
n Mixed grain issues: The issue of mixed granularity in flat denormalized
tables arises when the table combines data at different levels of detail within
the same structure. This can lead to data inconsistencies and difficulties in
querying and aggregating data efficiently, as the table may contain attributes
from different entities with varying levels of specificity. Handling mixed
granularity requires careful consideration of data organization and querying
strategies to ensure data accuracy and proper analysis.
n Implementing slowly changing dimensions: Adding slowly changing
dimensions (SCDs) to a flat denormalized table involves introducing
additional columns to track historical changes of attributes. This increases
data redundancy and complexity in managing updates, query logic, and
referential integrity, while requiring careful planning for proper indexing and
data validation.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 27
4.3 PRECALCULATED VIEWS/OLAP CUBE
4.3.1 Definition
A precalculated view/OLAP (Online Analytical Processing) Cube data model is a
type of multidimensional data model used in business intelligence and data
warehousing. It organizes data in a way that allows for efficient and fast analysis
of large datasets. OLAP cubes are particularly designed to facilitate analytical
queries and reporting tasks, offering a more user-friendly approach for decision-
makers to explore and gain insights from their data.
4.3.2 Example
Let's consider a retail business as an
example. The business may have a large
database containing sales data,
including information about products,
customers, regions, and time periods.
With an OLAP Cube data model, the data
can be organized into a multidimensional
structure like in Figure 7.
The OLAP cube would pre-calculate
aggregations for various combinations of
dimensions and measures, creating a
data structure that enables rapid Figure 7 Sample OLAP Cube
querying and analysis of data across different dimensions.
4.3.3 Advantages
n Faster Query Performance: By pre-calculating aggregations and storing
them in the cube, OLAP queries can return results much faster than
traditional relational databases when dealing with large datasets. This
enhances the responsiveness of analytical tools and reduces query
processing time.
n Simplified Complex Queries: OLAP cubes allow users to perform complex
analytical queries with a simple drag-and-drop interface or using a few clicks.
Users can easily navigate through hierarchies, drill down into details, and
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 28
perform slice-and-dice operations to explore data from different
perspectives.
n Improved Decision Making: With OLAP cubes, decision-makers can
quickly analyze historical trends, spot patterns, and identify opportunities
and challenges in their business. The intuitive interface enables them to
interactively explore data, gain insights, and make data-driven decisions.
n Reduced Database Load: OLAP cubes offload the query processing burden
from the operational database since they store pre-aggregated data. This
helps in improving the overall performance of the transactional system as it
doesn't have to handle complex analytical queries.
4.3.4 Disadvantages
n Data Size and Maintenance: OLAP cubes can be resource intensive
because of their storage requirements. Cube sizes can grow significantly,
especially for large datasets, and they need to be regularly updated and
maintained to ensure data accuracy and relevancy.
n Limited Real-Time Data: Since OLAP cubes are typically refreshed
periodically, they may not reflect the most current data in real-time. There
could be a delay between data updates in the operational system and the
data being available in the OLAP cube.
n Cube Design Complexity: Designing an effective OLAP cube can be
complex, especially when dealing with multiple dimensions and measures.
Properly defining hierarchies, relationships, and aggregations requires careful
planning and understanding of the business requirements.
n Inflexible Schema: OLAP cubes are based on predefined dimensions and
measures, making it challenging to handle ad hoc analysis or accommodate
changes in data requirements. Any changes to the cube structure may
involve significant effort and reprocessing of data.
4.4.1 Definition
An Online Transaction Processing (OLTP) or transactional data model is designed
for the efficient management of day-to-day operational transactions in a
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 29
database. It is used to process high volumes of small, individual transactions in
real-time, such as recording sales, processing orders, updating inventory, and
managing customer information. Unlike Online Analytical Processing (OLAP)
data models, OLTP focuses on quick data processing and ensures data integrity
and consistency.
4.4.2 Example
Let's consider a bank management system as an example.
The OLTP data model for the platform may include tables such as:
n Customers: Stores information about individual customers, such as name,
contact details, and address.
n Customer Types: Information on the type of customer.
n Customer Purchases: Products and services the customer has bought.
n Orders: Records each customer's purchase orders, along with order details
like order ID, date, and status.
n Products & Services: The available products and services.
n Accounts: The bank accounts of all the customers.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 30
n Account Types: The type of accounts available.
n Merchants: The merchants selling the various products and services.
n Transactions: The financial transactions the customers have been involved
in.
n Transaction Types: The available transaction types.
4.4.3 Advantages
n Real-time Transaction Processing: OLTP systems are optimized for quick
and real-time processing of individual transactions. They ensure that each
transaction is promptly recorded in the database, enabling immediate access
to the latest data.
n Data Integrity: OLTP systems enforce data integrity constraints to maintain
the accuracy and consistency of data. This helps prevent errors and ensures
that the data remains reliable for day-to-day operations.
n Concurrent Access: OLTP data models are designed to support multiple
concurrent users accessing the database simultaneously. This is crucial for
applications with a large user base, and high transaction volumes.
n High Availability: OLTP databases are typically set up with redundant
hardware and failover mechanisms to ensure high availability. This minimizes
downtime and always keeps the system accessible.
4.4.4 Disadvantages
n Performance for Analytical Queries: OLTP systems are not optimized for
complex analytical queries, leading to slower query performance compared
to OLAP databases.
n Data Redundancy: Normalized data structures in OLTP systems may result
in data redundancy and increased storage requirements, impacting
performance and storage costs.
n Complexity in Reporting: Generating complex reports in OLTP systems is
challenging due to the normalized data structure, requiring multiple table
joins and leading to slower query performance.
n Challenges with Long Join Paths: Joining multiple tables in normalized
models can require additional computation and impact query performance.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 31
n Modeling Complexity: Multiple join paths between entities make modelling
for search more challenging in normalized models.
n Lack of Codified Business Rules: Business rules are not embedded in the
model, requiring end-users to have a deeper level of data literacy training.
n Confusing Search Experience: Search experience may be confusing for
users due to the normalized structure and requires advanced modeling
techniques.
n Increased Learning Curve: Normalized models require advanced skills in
ThoughtSpot, increasing the learning curve for the data admin team.
n Additional Documentation and Technical Debt: Implementing and
maintaining normalized models requires more documentation and results in
technical debt over time.
4.5 DATAVAULT
4.5.1 Definition
A Data Vault data model is a specific approach to designing a data warehouse
that aims to provide a scalable, flexible, and reliable foundation for capturing
and storing data from various sources. It was developed by Dan Linstedt in the
early 2000s and has gained popularity as an effective approach for handling
complex data integration challenges.
The Data Vault methodology addresses some common issues that traditional
data warehousing approaches might face, such as data silos, high maintenance
costs, and difficulty in adapting to changing business requirements. It achieves
these goals through the following key principles:
n Hub-and-Spoke Architecture: The data model consists of three main types
of tables: Hubs, Links, and Satellites. This structure helps to maintain data
integrity and supports historical data tracking.
n Hubs: Hubs represent business entities and act as the central repository for
core business keys. They provide a way to uniquely identify business entities
and are essentially the primary keys for each business entity.
n Links: Links connect multiple Hubs and represent the relationships between
these entities. They store the surrogate keys from the related Hubs to
establish relationships.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 32
n Satellites: Satellites store descriptive attributes about the Hubs and Links,
including changes over time. They hold historical data and capture changes
to attributes, allowing for easy data auditing and historical tracking.
The Data Vault model promotes a strong separation between business keys
(natural keys from source systems) and surrogate keys (system-generated keys
used internally for relationships). This separation enhances data traceability, as
the model doesn't rely on the primary keys from source systems.
4.5.2 Example
4.5.3 Advantages
n Scalability: Data Vault is designed to scale efficiently as data volume and
complexity grow, making it suitable for large enterprises with diverse data
sources.
n Flexibility: The model can easily accommodate changes in source systems
and business requirements, reducing the impact of changes on the overall
data warehouse structure.
n Auditing and Compliance: The historical tracking of data changes in
Satellites facilitates auditing and compliance with data governance
regulations.
n Data Integration: It helps to integrate data from various sources,
effectively, breaking down data silos.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 33
4.6 MODEL TYPES COMPARED
Here's a brief explanation of when each data model type is best used:
n Dimensional Models: Dimensional models, such as star schema and
snowflake schema, are best used for BI and analytics scenarios. They
facilitate simplified and efficient querying, reporting, and data exploration,
making them ideal for data warehousing and decision-making tasks. These
type of models are typically preferred by Data warehousing platforms,
Business Intelligence (BI) tools, and analytical databases because
Dimensional models are optimized for querying and analyzing data in data
warehousing and analytical environments. They work well with BI tools and
platforms that prioritize ad-hoc querying, reporting, and data exploration.
n Denormalized Flat Table: Denormalized flat tables work best for
applications that require fast query performance and simple data structures.
They are suitable for real-time reporting, analytics on smaller datasets, and
cases where joins can be costly. These type of models are typically preferred
by Real-time processing platforms, NoSQL databases, and streaming
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 34
analytics engines because denormalized flat tables are suited for real-time
processing and simple data structures. They can be efficiently handled by
platforms that focus on fast query performance, real-time reporting, and
support for semi-structured and unstructured data.
n Precalculated Views/OLAP Cube: Precalculated views and OLAP cubes are
optimal for complex analytical queries that involve pre-aggregated data.
They offer high performance and support interactive dashboards and
multidimensional analysis. These model types are typically preferred by
Online Analytical Processing (OLAP) databases, data warehousing platforms
with OLAP capabilities, because precalculated views and OLAP cubes are
designed for complex analytical queries that involve pre-aggregated data.
OLAP databases and platforms are specifically built to handle
multidimensional analysis, and interactive dashboards.
n OLTP/Transactional Model: OLTP/transactional models are designed for
online transaction processing systems. They are used in applications that
manage day-to-day business operations and ensure data consistency and
integrity. These type of models are typically preferred by Transactional
databases, operational systems, and Online Transaction Processing (OLTP)
platforms, because OLTP/transactional models are optimized for handling
business transactions and maintaining data consistency. These models are
ideal for operational systems that manage day-to-day business operations
and require real-time data processing.
n Data Vault: Data Vault is most useful for large-scale data warehousing
projects where data integration from diverse sources and historical tracking
are crucial. It offers flexibility in handling changes and scaling the data
architecture. These type of models are typically preferred by Large-scale data
warehousing platforms, data integration tools, data lakes with structured
metadata, because Data Vault is suitable for data warehousing projects
involving diverse data sources, historical tracking, and data integration.
Platforms that support data lakes and structured metadata can accommodate
Data Vault implementations effectively.
Based on our experience, dimensional models typically offer superior analytical
capabilities, making them the preferred choice usually. However, specific use
cases might benefit from a denormalized table approach. When considering our
supported Cloud Data Platforms, all of them can handle both models. However,
apart from RedShift, most platforms tend to favor dimensional models over
denormalized structures.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 35
Best Practice 1 - Choose the right dimensional modeling
technique
Understand the primary dimensional modelling techniques
discussed in this chapter. Choose the technique that best fits the
business requirements and data complexity. Balance simplicity and
performance considerations when making this choice.
b. It further normalizes dimension tables d. It does not have a central fact table and
by breaking them into multiple related each fact table is connected directly to
tables. relevant dimension tables.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 36
4. What is an advantage of using a Snowflake Schema?
5. Which type of schema is suitable for scenarios where facts don't have direct
correlations?
a. It reduces query processing time by c. It stores data in a single table with all
eliminating the need for aggregations. relevant information combined.
9. Which data model type is designed for large-scale data warehousing projects
involving diverse data sources and historical tracking?
Answers:
1: d, 2: a, 3: a, 4: a, 5: d, 6: c, 7: d, 8: c, 9: d
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 37
5 Fundamentals: Facts and dimensions
5.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 38
Moving forward, we transition to "Dimension Types," where the intricacies of
hierarchical organization, data consistency, and optimization strategies come to
light.
We have tried to group the type of fact tables in based on their similarities in
terms of characteristics, purposes, and use cases.
5.2.1.1.1 Definition
The transactional fact table captures detailed data at the most granular level,
representing individual business transactions or events.
5.2.1.1.3 Advantages
n Detailed Analysis: Provides granular data for in-depth analysis of individual
transactions.
n Accurate Reporting: Enables precise transactional reporting and
performance tracking.
5.2.1.1.4 Considerations
n Data Volume: May lead to a large data volume, requiring efficient storage
and query optimization.
n Query Performance: Complex queries on detailed data might affect query
response time.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 39
5.2.1.1.5 Example
OrderID ProductID CustomerID OrderDate Quantity SalesAmount
5.2.1.2.3 Advantages
n Historical Analysis: Supports historical reporting and trend identification.
n Simplified Queries: Provides fixed snapshots for simplified querying.
5.2.1.2.4 Considerations
n Data Redundancy: Snapshot tables may duplicate data for each snapshot
period.
n Limited Granularity: Granularity is fixed at snapshot intervals, limiting
real-time analysis.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 40
5.2.1.2.5 Example
SnapshotDate ProductID QuantityInStock Price
5.2.1.3.3 Advantages
n Progress Tracking: Monitors progress or state changes over intervals.
n Milestone Analysis: Useful for tracking milestones and stages.
5.2.1.3.4 Considerations
n Data Updates: Requires updates as milestones are reached, potentially
impacting data integrity.
n Complexity: Handling incremental updates can add complexity to ETL
processes.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 41
5.2.1.3.5 Example
ProjectID MilestoneDate TasksCompleted TotalTasks
P001 2023-01-15 5 10
P001 2023-02-15 8 10
5.2.1.4.3 Advantages
n Trend Analysis: Facilitates trend analysis over consistent time intervals.
n Performance Optimization: Provides aggregated data for faster querying.
5.2.1.4.4 Considerations
n Limited Flexibility: Aggregated data might not capture all details of
individual transactions.
n Data Staleness: Data may not reflect real-time changes between snapshot
intervals.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 42
5.2.1.4.5 Example
WeekEnding ProductID UnitsSold Revenue
5.2.1.5.3 Advantages
n Query Performance: Enhances query performance by restricting data
access and reducing join complexity.
n Manageability: Improves data management and maintenance by
segmenting data into smaller partitions.
n Historical Data: Facilitates efficient handling of historical data and time-
based queries.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 43
5.2.1.5.4 Considerations
n Partitioning Strategy: Requires careful selection of partitioning key and
strategy to ensure optimal performance.
n Data Loading: Loading data into partitioned tables may require specific ETL
processes.
n Complexity: Managing partitions and optimizing queries may require
advanced database management skills.
5.2.1.5.5 Example
Region Date ProductID UnitsSold
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 44
and
manageability.
5.2.2.1.1 Definition
The cumulative fact table stores cumulative or running total values over time.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 45
5.2.2.1.3 Advantages
n Running Totals: Easily tracks cumulative values over time.
n Insightful Analysis: Enables analysis of progressive measures.
5.2.2.1.4 Considerations
n Data Volume: Cumulative values may grow significantly over time.
n Complexity: Managing running totals requires careful maintenance.
5.2.2.1.5 Example
Date ProductID CumulativeRevenue
5.2.2.2.3 Advantages
n Improved Performance: Enhances query response time for summary-level
reporting.
n Query Simplicity: Simplifies complex calculations during querying.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 46
5.2.2.2.4 Considerations
n Data Loss: Aggregation may lead to loss of detail, affecting detailed
analysis.
n Data Refresh: Requires periodic refresh to incorporate new data.
5.2.2.2.5 Example
Year Month ProductCategory TotalSales
5.2.2.3.1 Definition
The derived fact table contains calculated measures derived from other fact
tables or external sources.
5.2.2.3.3 Advantages
n Centralized Calculations: Offers a consolidated source for calculated
measures.
n Reduced Redundancy: Eliminates redundancy in storing calculated values.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 47
5.2.2.3.4 Considerations
n Maintenance: Requires updates if source data or calculations change.
n Processing Overhead: Deriving measures during ETL adds processing
overhead.
5.2.2.3.5 Example
Date ProductID SalesAmount CostAmount Profit
Use Case / Purpose Tracks accumulative Enhances query Offers consolidated view
measures over time, response time for of calculated measures,
enabling trend analysis. summary-level reducing redundancy.
reporting.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 48
Examples Running total of year- Quarterly total revenue Profit margin calculated
to-date sales. by region. as (Revenue - Costs).
5.2.3.1.1 Definition
A Factless Fact Table captures events or occurrences without measures. It
contains only foreign keys referring to dimension tables and serves as a "fact"
to represent relationships between dimensions.
5.2.3.1.3 Advantages
n Relationship Tracking: Captures relationships between dimensions.
n Pattern Identification: Useful for identifying patterns and associations.
5.2.3.1.4 Considerations
n Lack of measures: No measures for direct analysis; relies on associations.
n Complexity: Requires joins with dimension tables for meaningful analysis.
5.2.3.1.5 Example
Date CustomerID ProductID Action
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 49
Best Practice 10 - When to use a factless fact table?
n You are capturing events or occurrences without measures but
with a focus on relationships between dimensions.
n You need to track patterns and associations.
n You are comfortable with additional joins with dimension tables
for meaningful analysis
5.2.3.2.1 Definition
A Bridge Table resolves a many-to-many relationship between dimension tables
by creating a link between them. It contains the primary keys of both related
dimensions and may include additional attributes related to the relationship.
5.2.3.2.3 Advantages
n Many-to-Many Resolution: Resolves complex many-to-many
relationships.
n Detailed Analysis: Allows detailed analysis of associations.
5.2.3.2.4 Considerations
n Data Redundancy: May duplicate dimension data, affecting storage
efficiency.
n Query Complexity: Requires additional joins, potentially impacting query
performance.
5.2.3.2.5 Example
StudentID CourseID EnrollmentDate
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 50
Table 12 Example bridge fact table
Table 13 Key differences between factless fact tables and bridge fact tables
5.2.4.1.1 Definition
A multi-valued fact table is designed to capture and represent multiple related
attributes or values for a single business event, transaction, or occurrence.
These attributes are often non-numeric and provide additional context or
dimensions to the event.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 51
5.2.4.1.2 Use Case
Multi-valued fact tables are used in situations where a single event can be
associated with multiple attributes or values that don't fit into a traditional
measure-based fact table. Some common use cases include:
n Categorization: Capturing multiple categories or labels associated with an
event. For example, categorizing customer complaints by type, severity, and
resolution.
n Tags or Keywords: Storing keywords or tags associated with a document,
article, or product.
n Attributes with Variable Counts: Tracking varying counts of different
attributes. For instance, recording the number of times each type of service
was performed during a maintenance visit.
n Multi-Valued Relationships: Representing relationships between entities.
For example, capturing multiple recipients for a single email.
5.2.4.1.3 Advantages
n Rich Context: Multi-valued fact tables provide richer context and additional
dimensions to business events.
n Flexible Analysis: They enable flexible analysis and reporting on attributes
that don't fit into traditional measures.
5.2.4.1.4 Considerations
n Data Volume: Multi-valued fact tables may increase data volume due to the
inclusion of multiple attributes for each event.
n Complexity: Managing and querying multi-valued attributes may add
complexity to data processing.
5.2.4.1.5 Example
Consider a scenario in an e-commerce system where a customer places an order
for multiple products, each having different attributes like color, size, and
quantity. A traditional fact table might capture the order total and quantities
sold, but a multi-valued fact table would store the following:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 52
1001 P102 Blue L 1
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 53
Group Type Summary When to use?
Aggregation & Cumulative Fact Running Totals, n You need to track cumulative or
Analysis Fact Table Progressive running total values over time.
Table Analysis
n You are analyzing progressive
measures.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 54
Group Type Summary When to use?
n You are prepared to manage data
volume growth and maintenance
complexity
Specialized Fact Multi-Valued Fact Multiple Attributes, n You are capturing multiple related
Tables Table Rich Context attributes or values for a single
business event, transaction, or
occurrence.
n You require rich context and
additional dimensions for your
data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 55
5.3 DIMENSION TYPES
There is a multitude of different dimension types out there and each of them
are used for different uses and use cases. For ease of reading we have tried to
group them in terms of similarities in purpose and characteristics.
5.3.1.1.1 Description
A role-playing dimension is a single dimension table that is used multiple times
in a fact table, each time representing a different perspective or "role." Each
role of the dimension represents a different attribute or set of attributes within
the same dimension table. For example, a date dimension could be used for
both "Order Date" and "Shipping Date" in a sales scenario.
5.3.1.1.3 Example
Date Key Year Month Day
20230101 2023 01 01
20230102 2023 01 02
In a retail business, the "Date" dimension is reused in both the "Sales" and
"Returns" fact tables. In the "Sales" fact table, it's used as "Order Date," while
in the "Returns" fact table, it's the "Return Date." This allows separate analysis
of sales and returns using the same dimension.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 56
Best Practice 13 - When to use role playing dimensions
n When you need to analyze a fact table from different temporal
perspectives, each associated with a different attribute of the
same dimension. This allows for consistent and meaningful
analysis without the need to duplicate dimension tables.
5.3.1.2.1 Definition
A fixed depth hierarchy refers to a hierarchical structure within a dimension
where the number of levels in the hierarchy is predetermined and remains
consistent. Each level represents a specific attribute of the dimension, and the
hierarchy is defined with a fixed number of levels. Fixed-depth hierarchies are
often used for organized and standardized analysis, allowing users to drill down
or roll up within the defined levels. Note there is a section on hierarchies,
including non-fixed depth hierarchies later in this document in chapter 10).
5.3.1.2.3 Example
Country State City
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 57
Best Practice 14 - When to use fixed-depth hierarchy
dimensions
n When you have a well-defined and consistent structure for
hierarchical attributes. They provide an organized way to
navigate and analyze dimension data at different levels of
granularity.
Variety of Data Each role may have distinct data Focus on organizing data within a
related to the perspective it predefined hierarchy.
represents.
Analytical Flexibility Allow for different analyses of the Provide consistent analysis
same dimension from multiple through organized hierarchical
temporal viewpoints. navigation.
Table 18 Key differences between role-playing dimensions and fixed depth hierarchies
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 58
same dimension, with the same attributes and values, is used consistently
across different areas of analysis. Conformed dimensions help maintain data
integrity, simplify data integration, and enable accurate cross-functional
reporting.
5.3.2.1.3 Example
Product ID Product Name Category Manufacturer
A multinational corporation has multiple business units, each with its own sales
data. However, they all share a common "Product" dimension with consistent
attributes like product ID, name, and category. This ensures unified reporting
across the organization.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 59
reduce redundancy by centralizing the storage of attributes that have the same
meaning across different analytical contexts.
5.3.2.2.3 Example
Currency Code Currency Name Exchange Rate
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 60
Aspect Conformed Dimension Universal Dimension
Application Used for consistent analysis and Used for standard measurements
reporting. or classifications.
5.3.3.1.1 Definition
A degenerate dimension refers to attributes from a dimension that are stored
directly within the fact table, rather than being represented in a separate
dimension table. These attributes are often identifiers or codes associated with
a specific transaction and have no relevance outside the context of that
transaction.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 61
single fact table. They are typically employed to handle transactional data, such
as order numbers, invoice numbers, or receipt numbers.
5.3.3.1.3 Example
Sales ID Customer ID Invoice Quantity Unit Price Total Price
Number
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 62
1. Enhanced Analysis: Transaction dimensions allow analysts to perform more
detailed and context-aware analysis of individual transactions. This can be
particularly valuable when examining the factors that influence specific
events, such as purchases, orders, or service requests.
2. Temporal Insights: When dealing with time-series data, transaction
dimensions enable the inclusion of attributes like transaction date, time of
day, or even user-specific information, providing a comprehensive temporal
context to each transaction.
3. Fine-Grained Reporting: Organizations often require in-depth,
transaction-level reporting for auditing, regulatory compliance, or quality
control purposes. Transaction dimensions facilitate the creation of such
reports by capturing relevant details.
4. Behavioral Analysis: Transaction dimensions enable the study of customer
behavior, such as identifying patterns in purchasing habits, tracking the
evolution of user preferences, or understanding the sequence of events
leading to specific outcomes.
5.3.3.2.3 Example
Let's illustrate the concept of transaction dimensions with an example from a
retail business:
Consider a large online retailer that wants to analyze its e-commerce
transactions. The fact table records individual sales transactions, including
information about products sold, customers, order dates, and quantities. To gain
a deeper understanding of these transactions, the retailer decides to implement
a transaction dimension.
Here's what the transaction dimension might look like:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 63
depth. For instance, they can determine the most popular payment methods
during certain times of the day, assess customer preferences for shipping
options, or conduct temporal analysis of transaction volumes.
5.3.3.3.3 Example
Category ID Category Name Description
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 64
A "Product Category" mini dimension is extracted from the "Product" dimension,
including attributes for analyzing sales at the broader category level.
5.3.3.4.3 Example
Promotion Flag ID Summer Sale Black Friday Discount Applied
FLAG001 Y N N
FLAG002 N Y Y
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 65
Best Practice 21 - When to use junk dimensions
n Use to reduce the complexity of the schema by consolidating
multiple low-cardinality flags or indicators into a single
dimension table. Simplifies schema design and reduces the
number of dimension tables.
5.3.3.5.1 Definition
A shrunken dimension refers to a subset of attributes from a larger dimension
that are selected for a higher level of summary. Shrunken dimensions are used
to optimize query performance and reduce complexity when analyzing data at a
coarser granularity. Shrunken dimensions are often derived from a larger
dimension to provide a more focused view of data for specific analytical needs.
5.3.3.5.3 Example
Quarter ID Quarter Number Start Date End Date
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 66
Best Practice 22 - When to use shrunken dimensions
n Use when you need to perform summary-level analysis and
want to improve query performance by eliminating attributes
that are not relevant at that level.
5.3.3.6.1 Definition
A late-binding dimension allows flexibility in adding new attributes to a
dimension table without altering the schema. Late-binding dimensions defer the
binding of attributes until they are needed, which can simplify management and
accommodate changes to attributes over time. Note that late-binding
dimensions are discussed in detail in chapter 16.
5.3.3.6.3 Example
Product ID Product Name Attribute 1 Attribute 2 ...
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 67
5.3.3.7 Composite dimensions
5.3.3.7.1 Definition
Created by merging attributes from different dimensions into a single dimension
table, reducing the number of dimension tables in the schema.
5.3.3.7.3 Example
Channel ID Channel Type Channel Details
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 68
n-specific from
details. different
dimension
s.
5.3.4.1.1 Definition
Snapshot dimensions are used to capture point-in-time attributes related to a
fact record. These attributes represent the state of a dimension at a specific
moment in time and are associated with the fact record when the event
occurred. Snapshot dimensions are particularly useful for scenarios where
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 69
historical changes need to be tracked, such as in data warehousing for
regulatory compliance or auditing purposes.
5.3.4.1.3 Example
Customer ID Account Balance Credit Limit Snapshot Date
5.3.4.2.1 Definition
Custom dimensions are dimension tables that are tailored to specific business
needs and are not predefined. They are designed to accommodate specialized
data elements or unique analytical requirements that are not adequately
covered by standard dimensions. Custom dimensions offer flexibility regarding
attribute selection and structure, allowing organizations to capture domain-
specific information.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 70
5.3.4.2.2 Use Case
Custom dimensions are used when standard dimensions do not fully capture the
nuances of a business scenario or when you need to analyze attributes that are
specific to your organization's operations. For example, a custom dimension
could be created to track project-specific attributes in a consulting firm's data
warehouse.
5.3.4.2.3 Example
Patient ID Diagnosis Code Diagnosis Treatment
Description
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 71
instance, you could create a derived dimension to analyze profit margins by
subtracting cost-related attributes from revenue-related attributes.
5.3.4.3.3 Example
Product ID Profit Margin Net Profit
Use Case Maintain historical Address unique Analyze data from new
context for fact records. analytical or domain perspectives.
scenarios.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 72
Table 32 Differences between snapshot, custom and derived dimensions
Hierarchical and Role-Playing Perspective n When you need to analyze a fact table
Organizational Dimensions Switching, from different temporal perspectives,
Consistency each associated with a different attribute
of the same dimension. This allows for
consistent and meaningful analysis
without the need to duplicate dimension
tables.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 73
Group Type Summary When to use
Data Consistency Conformed Data Integrity, n When you have multiple fact tables that
and Reusability Dimensions Cross-Functional need to share the same dimension for
consistent analysis and reporting.
Ensures data integrity, simplifies
integration, and enables accurate cross-
functional reporting.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 74
Group Type Summary When to use
Customized and Snapshot Historical n Use when you need to maintain historical
Specialized Dimensions Context, Point- context for fact records by capturing
in-Time attributes as they were at specific points
in time.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 75
2. What is a key advantage of a Snapshot Fact Table?
a. Supports historical reporting and trend c. Monitors progress or state changes over
identification intervals
a. When you need to capture individual c. When you need to perform trend
business transactions analysis over specific time periods
b. When you want to optimize query d. When you want to monitor progress or
performance through partitioning state changes over intervals
a. When you need to enhance query c. When you need to store calculated
response time for summary-level measures derived from other fact tables
reporting or external sources
b. When you want to store cumulative or d. When you are willing to accept potential
running total values over time data loss due to aggregation
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 76
8. When is a Bridge Table typically used?
a. Tracking total revenue over time c. Capturing data at specific points in time
11. Which type of fact table would be most suitable for capturing individual
business transactions or events at a granular level?
12. If your primary focus is on historical reporting and trend analysis, which type
of fact table should you choose?
13. When dealing with complex relationships that require detailed analysis of
many-to-many associations, which type of fact table is most appropriate?
a. When you need to organize dimension c. When you have a well-defined and
data in a hierarchical arrangement. consistent structure for hierarchical
attributes.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 77
b. When you want to analyze a fact table d. When you need to track events or
from different temporal perspectives, occurrences without measures.
using the same dimension.
a. They allow for different analyses of the c. They involve using a single dimension
same dimension from multiple temporal table for different perspectives.
viewpoints.
b. When you have multiple fact tables d. When you want to track data elements
that need to share the same dimension that are common and consistent within a
for consistent analysis and reporting. specific data warehouse.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 78
a. To centralize the storage of attributes c. To capture additional details related to
with the same meaning across different a specific transaction within the fact table.
analytical contexts.
a. When you need to consolidate multiple c. When you need to capture additional
low-cardinality flags or indicators into a details related to a specific transaction
single dimension table. within the fact table.
22. What is the primary use case for a junk dimension in dimensional modeling?
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 79
a. When you have high cardinality c. When you need to capture additional
attributes in a dimension, and you want details related to a specific transaction
to optimize performance for specific within the fact table.
queries without sacrificing important
attributes.
a. When you need to capture additional c. When you have high cardinality
details related to a specific transaction attributes in a dimension that lead to
within the fact table. increased storage requirements and query
complexity.
b. When you want to maintain historical d. When you want to explore data from a
context for fact records by capturing different angle by creating attributes that
attributes as they were at specific points are not directly stored in the source data.
in time.
a. When you need to capture additional c. When you need to perform summary-
details related to a specific transaction level analysis and want to improve query
within the fact table. performance by eliminating attributes that
are not relevant at that level.
b. When there's a need for agility in d. When you want to explore data from a
handling evolving or rapidly changing different angle by creating attributes that
data. are not directly stored in the source data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 80
a. To simplify schema design by c. To create new attributes by performing
combining attributes from different calculations or transformations on existing
dimensions with shared characteristics, data.
reducing the number of dimension tables.
a. When you have high cardinality c. When you want to maintain historical
attributes in a dimension that lead to context for fact records by capturing
increased storage requirements and query attributes as they were at specific points
complexity. in time.
Answers:
1:C, 2:A, 3:C, 4:B, 5:B, 6:C, 7:B, 8:C, 9:A, 10:B, 11:B, 12:C, 13:A, 14:B, 15:B,
16:B, 17:B, 18:C, 19:B, 20:D, 21:B, 22:A, 23:B, 24:B, 25:B, 26:C, 27:A, 28:B
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 81
6 Effective design patterns and best
practices for overcoming common
challenges in dimensional models
In a dimensional model, which is a data modeling technique used in data
warehousing and business intelligence, some elements can be more challenging
to model than others. The goal of a dimensional model is to make data accessible
and easily understandable for reporting and analysis. In no particular order,
here are some of the most difficult things to model in a dimensional model:
n Chasm traps and fan traps (Chapter 7). Chasm traps and fan traps can
create ambiguous data interpretations and/or can cause overcounting and
inaccuracies, but they can also be a helpful design pattern.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 82
n Join cardinality (Chapter 8): Addressing join cardinality in dimension
models is crucial to optimize query performance, maintain data accuracy, and
simplify queries. Incorrect handling of join cardinality can lead to slow
performance, data anomalies, and inconsistent results. Ignoring join
cardinality may cause data redundancy, hinder scalability, and make model
maintenance challenging.
n Outer joins (Chapter 9): In dimensional models, outer joins can cause data
accuracy and query performance issues when joining fact and dimension
tables with missing or null values, leading to incomplete or incorrect results,
data duplication, and hindered query optimization.
n Hierarchies (Chapter 10): Representing hierarchical relationships between
dimensions can be complex. For example, dealing with a product hierarchy
where products are categorized into various levels like category,
subcategory, and individual product can be challenging to model efficiently.
n Slowly changing dimensions (SCDs) (Chapter 11): Managing slowly
changing dimensions, which are dimensions that change over time, can be
difficult. There are different types of SCDs, and choosing the appropriate
strategy (Type 1, Type 2, Type 3, etc.) requires careful consideration.
n Granularity, denormalization and mixed grain facts (Chapter 12):
Granularity choice is critical, as overly detailed data can lead to inefficiencies,
while aggregated data may lack the necessary detail. Denormalization
streamlines queries but can lead to redundancy and increased storage. Mixed
grain facts pose a challenge when measures of varying detail levels coexist,
potentially causing inconsistencies.
n Working with less structured data (Chapter 13): Mastering the
management of flexible data structures is crucial in today's data landscape.
Organizations seeking deeper insights must effectively utilize key-value
pairs, unstructured, and semi-structured data. Techniques such as pivoting
and structured representation are invaluable in converting these data formats
into ones compatible with ThoughtSpot.
n Date dimension and when you need one (Chapter 14): This chapter
underscores the significance of the Date Dimension as a foundational
framework for handling time-related data. It streamlines time interpretation,
enabling navigation through different periods, historical comparisons, and
trend identification. ThoughtSpot's keywords enhance querying efficiency for
time intervals, reducing the need for a dedicated Date Dimension table in
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 83
many cases. However, in advanced scenarios, such as the 'look back
measures' design pattern, the Date Dimension table remains indispensable,
particularly in the retail industry.
n Currency conversion (Chapter 15): If dealing with data from multiple
countries with different currencies, handling currency conversion in the
dimensional model can be challenging. It requires maintaining historical
exchange rates and implementing proper calculations for accurate reporting.
n Late-binding attributes (Chapter 16): Derived Measures: Calculated or
derived measures, such as profit margin or growth rate, involve complex
expressions and dependencies on other measures. Incorporating these into
the dimensional model requires careful consideration of the underlying logic.
n Non-additive, semi-additive and derived measures (Chapter 17):
Calculated or derived measures, such as profit margin or growth rate, involve
complex expressions and dependencies on other measures. Some measures,
like ratios or percentages, cannot be aggregated at all and certain measures,
such as account balances or inventory levels, cannot be simply aggregated
using standard summation techniques. Modeling these types of measures
while maintaining their integrity can be challenging.
Overcoming these challenges requires a deep understanding of the data, the
business requirements, and the trade-offs between complexity and performance
in the dimensional model design. In the following chapters we will describe the
challenges and our best practices in these scenarios.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 84
7 Bridging chasms, fanning insights: Chasm
and fan traps unveiled
7.1 INTRODUCTION
In the domain of dimensional models, the emergence of chasm traps and fan
traps is not uncommon. In this section, we will
delve into the nature of these traps, the challenges
they present, the scenarios where they become
problematic, and the potential benefits they offer.
Moreover, we will explore various manifestations of
chasm traps, including regular, nested, and chained
variations, as well as bridge tables and fan traps.
Lastly, we will examine how ThoughtSpot
addresses these challenges and even employs them
for specific use cases.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 85
These fact tables share dimensions such as "Product," which contains details
about products; "Date," capturing calendar dates; and "Location," featuring
geographical attributes. Due to the common dimensions shared by both fact
tables, a chasm trap arises.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 86
Figure 13 Standard Figure 14 Nested chasm trap Figure 15 Chained chasm trap
chasm trap/bridge
table
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 87
Production Fact Table
Date Dimension
Product Dimension
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 88
Location Dimension
Now, consider the scenario where you aim to generate a report displaying the
overall quantity produced and sold for every product at each location within a
defined timeframe. Because there is no direct linkage between the "Production"
and "Sales" fact tables, attempting a straightforward join with the common
dimensions (Date, Product, Location) would inevitably yield inaccurate
outcomes.
Without addressing the chasm trap, in SQL, the query might look like this:
SELECT
p.Product_Name,
l.Location_Name,
SUM(pr.Quantity_Produced) AS Total_Produced,
SUM(s.Quantity_Sold) AS Total_Sold
FROM Production_Fact pr
JOIN Date_Dimension d ON pr.Date_Key = d.Date_Key
JOIN Product_Dimension p ON pr.Product_Key = p.Product_Key
JOIN Location_Dimension l ON pr.Location_Key = l.Location_Key
JOIN Sales_Fact s ON s.Date_Key = d.Date_Key
AND s.Product_Key = p.Product_Key
AND s.Location_Key = l.Location_Key
GROUP BY
p.Product_Name,
l.Location_Name;
The above query would likely produce incorrect results due to over-counting and
data inconsistencies, as the "Production" and "Sales" fact tables are not directly
related to each other. Let’s execute this query1:
1
To simulate this we have created a SQL View in ThoughtSpot to make sure the SQL is exactly
as in the SQL code described above.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 89
Figure 22 Executing the wrong query causing overcounting
In the incorrect result, we can see that the quantities produced for Widget B due
to the chasm trap. Widget B was only sold 10 times and this shows 20! We also
see in this example that Widget D is not reported as it was not sold.
The correct way of dealing with a chasm trap in SQL is by splitting up the
process, i.e. in one sub-query collect the sales data, in a 2nd the production data
and merge those results together.
WITH
/*
** Query 0: Collects Production Data
*/
"qt_0" AS (
SELECT
"ta_1"."PRODUCT_NAME" "ca_1",
CASE
WHEN sum("ta_2"."QUANTITY_PRODUCED") IS NOT NULL THEN
sum("ta_2"."QUANTITY_PRODUCED")
ELSE 0
END "ca_2"
FROM
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCTION_FACT" "ta_2"
JOIN
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCT_DIMENSION"
"ta_1"
ON "ta_2"."PRODUCT_KEY" = "ta_1"."PRODUCT_KEY"
GROUP BY "ca_1"
),
/*
** Query 1: Collects Sales Data
*/
"qt_1" AS (
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 90
SELECT
"ta_3"."PRODUCT_NAME" "ca_3",
CASE
WHEN sum("ta_4"."QUANTITY_SOLD") IS NOT NULL THEN
sum("ta_4"."QUANTITY_SOLD")
ELSE 0
END "ca_4"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."SALES_FACT"
"ta_4"
JOIN
"DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."PRODUCT_DIMENSION"
"ta_3"
ON "ta_4"."PRODUCT_KEY" = "ta_3"."PRODUCT_KEY"
GROUP BY "ca_3"
)
/*
** Final Query: Merge Results
*/
SELECT
CASE
WHEN "ta_5"."ca_1" IS NOT NULL THEN "ta_5"."ca_1"
ELSE "ta_6"."ca_3"
END "ca_5",
CASE
WHEN "ta_5"."ca_2" IS NOT NULL THEN "ta_5"."ca_2"
ELSE 0
END "ca_6",
CASE
WHEN "ta_6"."ca_4" IS NOT NULL THEN "ta_6"."ca_4"
ELSE 0
END "ca_7"
FROM "qt_0" "ta_5"
FULL OUTER JOIN "qt_1" "ta_6"
ON (EQUAL_NULL("ta_5"."ca_1","ta_6"."ca_3"))
And when we execute this query we see we get all the correct results:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 91
7.3.2 Fan Traps
Let's populate the illustrative model from Figure 16 with some test data:
CUSTOMER
CUST_ID CUST_NAME
100 Ethan
101 Olivia
102 Liam
103 Ava
ORDER
1 100 1100
2 101 1300
3 100 1400
4 102 1200
ORDER_DETAIL
1 3 10
1 2 11
2 4 13
2 5 12
3 4 12
4 5 10
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 92
Upon executing a single SQL query that spans the fan trap, overcounting will
happen.
SELECT c.CUST_NAME, od.PROD_ID, o.ORDER_ID, SUM(o.ORDER_TOTAL),
SUM(od.QTY)
FROM CUSTOMER AS c
INNER JOIN "ORDER" AS o ON c.CUST_ID = o.CUST_ID
INNER JOIN ORDER_DETAIL AS od on o.ORDER_ID = od.ORDER_ID
GROUP BY c.CUST_NAME, od.PROD_ID, o.ORDER_ID;
This is clear in Figure 27, where overcounting happens. The cumulative value
tallies to 7,400, despite the knowledge that the overall order total (achieved by
summing all values in the order table) amounts to merely 5,000. The underlying
cause for this anomaly lies in the presence of two entries each for Ethan's initial
order (Order 1) and Olivia's order (Order 2) in the order detail table. The fan
trap exacerbates this duplication, leading to the inflated value of 7,400 (5,000
+ 1,100 + 1,300).
For effectively querying against a fan trap, the accurate SQL approach involves
a division of the query into two parts, like addressing a chasm trap, followed by
the consolidation of results.
WITH
"qt_1" AS (
SELECT
"ta_3"."CUST_ID" "ca_4",
"ta_3"."CUST_NAME" "ca_5",
CASE
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 93
WHEN sum("ta_4"."QTY") IS NOT NULL THEN sum("ta_4"."QTY")
ELSE 0
END "ca_6"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER_DETAIL"
"ta_4"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER"
"MTA_0"
ON "ta_4"."ORDER_ID" = "MTA_0"."ORDER_ID"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."CUSTOMER"
"ta_3"
ON "MTA_0"."CUST_ID" = "ta_3"."CUST_ID"
GROUP BY
"ca_4",
"ca_5"
),
"qt_0" AS (
SELECT
"ta_1"."CUST_ID" "ca_1",
"ta_1"."CUST_NAME" "ca_2",
CASE
WHEN sum("ta_2"."ORDER_TOTAL") IS NOT NULL THEN
sum("ta_2"."ORDER_TOTAL")
ELSE 0
END "ca_3"
FROM "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."ORDER" "ta_2"
JOIN "DATAMODELLING_FIELD_GUIDE"."CHASM_FAN_TRAPS"."CUSTOMER"
"ta_1"
ON "ta_2"."CUST_ID" = "ta_1"."CUST_ID"
GROUP BY
"ca_1",
"ca_2"
)
SELECT
"ta_5"."ca_4" "ca_7",
"ta_5"."ca_5" "ca_8",
CASE
WHEN "ta_6"."ca_3" IS NOT NULL THEN "ta_6"."ca_3"
ELSE 0
END "ca_9",
CASE
WHEN "ta_5"."ca_6" IS NOT NULL THEN "ta_5"."ca_6"
ELSE 0
END "ca_10"
FROM "qt_1" "ta_5"
LEFT OUTER JOIN "qt_0" "ta_6"
ON (
(EQUAL_NULL("ta_5"."ca_4","ta_6"."ca_1"))
AND (EQUAL_NULL("ta_5"."ca_5","ta_6"."ca_2"))
)
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 94
The results can be seen below:
Effectively managing chasm traps and fan traps requires a thoughtful approach.
These challenges can introduce complexities into data analysis and reporting,
but there are strategies to overcome them.
When dealing with chasm traps, it's crucial to recognize the potential for
misinterpretation and performance degradation. Subqueries provide a useful
solution to address these issues. By separating the data retrieval process into
distinct subqueries for each dimension with the fact, the risk of overcounting is
mitigated. Merging the outcomes of these subqueries ensures that the correct
total quantities are reported.
Similarly, in the case of fan traps, the SQL approach of dividing the query into
two parts is effective in addressing the complexities introduced by the many-to-
many-to-one relationship. By executing this query in steps, the risks of
overcounting and data distortion are mitigated. Consolidating the results of
these subqueries allows for accurate analysis, ensuring reliable insights for
decision-making.
In ThoughtSpot, the handling of chasm traps and fan traps is a seamless and
automated process. ThoughtSpot's inherent intelligence recognizes the
presence of these traps and employs the appropriate SQL logic to address them.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 95
This automation is a significant advantage for users, as it eliminates the need
for manual intervention and reduces the likelihood of errors in query design.
Moreover, ThoughtSpot's ability to handle these traps seamlessly offers users
the opportunity to focus on the insights and analysis rather than getting bogged
down by intricate data modeling challenges. This capability not only enhances
user experience but also accelerates the decision-making process by providing
accurate and reliable results.
In conclusion, the strategic handling of chasm traps and fan traps is essential
for maintaining the accuracy and reliability of data analysis within dimensional
models. Whether through subqueries and result consolidation or ThoughtSpot's
automated mechanisms, the ultimate goal is to provide users with the tools they
need to derive meaningful insights and make informed decisions based on
accurate data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 96
8 Mastering dimensional relationships: Join
cardinality, role playing dimensions and
join paths
8.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 97
but each row in the second table can only be associated with one row in the
first table. This is a common relationship type and can be effectively managed
in dimensional models.
n Many-to-many (N:N) relationship: In a dimensional model, entities are
represented as dimensions and facts. Dimensions contain descriptive
attributes, while facts store numeric measures. Many-to-many joins are
generally avoided in dimensional models, as they can lead to data
redundancy and complicate query performance. However, in some cases,
they are unavoidable due to the nature of the data.
Let's use the example of a data warehouse for a university that stores
information about students and courses. In this scenario, students can enroll
in multiple courses, and each course can have multiple students enrolled.
This creates a many-to-many relationship between the “Student” and
“Course” tables. Resolving this relationship appropriately is essential to
maintain the integrity and efficiency of the data warehouse.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 98
Best Practice 31 - Importance of join direction
Optimize Join Direction: Understand that join direction influences
query performance and should typically be from the fact table
(many side) to the dimension (one side).
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 99
8.2.4 Multiple join paths
In the preceding section, as we explored
the concept of role-playing dimensions, we
clarified that multiple join paths need not
necessarily be problematic and can indeed
offer benefits. Nonetheless, there are
instances where they might introduce
potential complications, as ThoughtSpot
must opt for a single path during searches.
Consider the following scenario involving
an employee table, an account table, and
a branch table. To address this, you
Figure 30 Multiple Join Paths
essentially have two choices:
n Evaluate the necessity of the join and potentially eliminate it (See Figure 31).
n In situations where multiple join paths lead to confusion or impose
constraints related to chasm traps, consider duplicating the table to alleviate
the issue (See Figure 32).
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 100
might be better to keep them separate. However, if the attributes are related
and often queried together, combining them can reduce redundancy and
simplify queries.
Enrollments
1 101 CSCI101
2 101 MATH201
3 102 CSCI101
4 103 MATH201
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 101
8.3.2.2 Natural keys vs surrogate keys
To improve performance and maintain data integrity, it's beneficial to use
surrogate keys in bridge tables. Surrogate keys are system-generated unique
identifiers that have no inherent meaning and help avoid complex composite
keys. By implementing surrogate keys in the bridge table, it becomes easier to
manage relationships and track changes over time.
8.3.2.3 Denormalization
Although denormalization is not the ideal solution for most cases in a
dimensional model, it can be considered when dealing with complex many-to-
many relationships. By denormalizing certain attributes from the bridge table
into the fact table, you can reduce the need for joining multiple tables during
queries, thus improving query performance.
Incorporate denormalization by including relevant attributes from the
"Students" and "Courses" dimensions directly into the "Enrollments" fact table.
This step reduces the need for joins during query execution.
Enrollments
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 102
8.3.2.4 Aggregations and summarizations
When dealing with large datasets, many-to-many joins can slow down query
performance. To address this, consider pre-aggregating and summarizing the
data. Materialized views or aggregated tables can be employed to store pre-
calculated results, allowing for faster query responses, and reducing the need
for complex joins.
In the case of our example:
To improve query performance, create aggregated tables that pre-calculate
summary information from the "Enrollments" fact table. For example, you could
have an aggregated table showing the number of students enrolled in each
course.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 103
8.3.3 Role playing dimensions
Please refer to Figure 29 and then:
1. Establish multiple joins between the fact and dimension tables in your
database and import them into ThoughtSpot.
2. Include the fact table in a worksheet.
3. Incorporate the same
dimension table fields
multiple times, each for a
distinct role. When
dealing with multiple join
paths (e.g., 2), upon
adding dimension fields,
you'll be prompted to Figure 33 Choosing a join path
specify the relevant join, as depicted in Figure 33.
4. Choose the suitable relation/join path for each added field.
5. Assign unique names to each field, such as 'order date' and 'shipping date'.
By employing role-playing dimensions, you avoid the need to duplicate data in
separate dimension tables while still gaining the flexibility to analyze data from
different angles. This approach enhances data integrity, optimizes storage, and
simplifies query design in dimensional modeling.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 104
8.3.4.1 Examples of range joins
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 105
n Data Maintenance: Managing and maintaining range-based relationships
can be challenging, particularly as data evolves and new ranges are
introduced.
8.3.4.4 Use case: Modeling techniques for assisting with supply chain
analytics
Retail analysts often grapple with numerous responsibilities, from monitoring
product availability to comparing sales across stores and replenishing low stock
items. In this section, we will explore various modeling techniques and tools
created in ThoughtSpot that aid retail analysts in managing their daily tasks
effectively, by looking at two typical examples.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 106
Figure 35 High level model using the range joins Figure 36 The week
dimension
Figure 38 The joins defined in ThoughtSpot Figure 39 How the range join looks in
TS UI
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 107
Figure 41 Which products are running low on supply?
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 108
8.3.4.4.2 Example 2: Compare my sales to nearby store sales
Retail analysts are always comparing their stores to other stores to identify
trends that need attention. Retailers often compare foot traffic and sales with
other retailers to assess their own performance and competitiveness in the
market.
Similarly, analyzing sales data in
relation to other retailers can provide
insights into market trends, customer
preferences, and the effectiveness of
promotional campaigns. Retailers may
compare their sales figures with those of
similar retailers to evaluate their market
share, identify growth opportunities, or
make pricing and inventory decisions.
An analyst, not only compares, the store to other stores in the state, county, or
zip, but also would like to drill to a radius locating stores in the same
neighborhood.
1 2
1 Replicated the Sales and Returns using a view joined into a Nearby Store
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 109
2 Insert Bridge Tables that allow filtering between my store and nearby
stores
Store Distance = VIEW AS SELECT A.STORE_ID AS STORE_ID ,B.STORE_ID AS
3 STORE_NEARBY_ID
,HAVERSINE(A.STORE_LATITUDE,A.STORE_LOGITUDE,B.STORE_LATITUDE,B.STORE_LOGIT
UDE) AS STORE_DISTANCE FROM DS_DIM_STORE A ,DS_DIM_NEARBY_STORE B;
Figure 45 Query plan for how do my sales compare to store within a 5 mile radius
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 110
8.3.4.4.3 Example 3: Locate nearby store to replenish low stock alerts
One of the most critical alerts for retail analysts is the out-of-stock alert. In this
requirement, we explore how analysts can quickly locate inventory at different
levels to replenish out-of-stock conditions efficiently, starting with nearby stores
before resorting to distant distribution centers.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 111
1 Nearby Stores
2 My Store
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 112
Range joins represent a valuable tool in the toolkit of data professionals working
with dimensional models, especially when implementing Slowly Changing
Dimensions (SCDs). They provide the means to establish connections between
data sets that would be difficult or impossible to achieve with traditional joins.
By allowing for flexibility and accommodating overlapping data, range joins
enhance the analytical capabilities of dimensional models.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 113
9 The pitfalls of outer joins
9.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 114
remain unsold, an outer join applied to these tables would yield an outcome
encompassing all products from the "Product" dimension. If sales data is
available, it is incorporated accordingly. Notably, products lacking sales data, or
unsold products, are also presented in the results. This method allows for a
holistic perspective that embraces both sold and unsold products, enriching the
analytical panorama.
9.2.3 Increased
complexity
Outer joins can make queries more
complex and harder to maintain.
As the number of dimensions and
relationships grow, the query
complexity escalates, potentially
leading to performance
degradation. This complexity also
poses challenges for analysts and
developers when understanding Figure 48 Increased complexity with multiple outer
and optimizing the data model. joins
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 115
A search on this data model would generate a SQL statement similar as shown
below.
-- Incorrect: Complex query with multiple outer joins
SELECT *
FROM sales
LEFT OUTER JOIN products ON sales.product_id = products.product_id
LEFT OUTER JOIN customers ON sales.customer_id = customers.customer_id
LEFT OUTER JOIN regions ON sales.region_id = regions.region_id
LEFT OUTER JOIN time ON sales.transaction_date = time.date;
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 116
9.3 MODELLING OUTER JOINS
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 117
completeness and prevent the need for outer joins. These default
members act as placeholders, ensuring that all necessary
dimension attributes have valid values.
For example, in a product dimension, if specific products are not
available, create a default member like "Unknown" or "N/A" to
represent missing product entries.
FACT_SALES DIM_PRODUCT
1 1 … 1 Banana …
2 -1 … -1 Unknown …
3 2 … 2 Pear …
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 118
Consider a scenario where we want to track customers' visits to
different stores. Instead of using an outer join with a store
dimension, we can create a factless fact table like "store_visits"
with columns "customer_id" and "store_id" to record visits:
This chapter has explored the complex realm of outer joins within dimensional
models, offering insights that go beyond traditional modeling techniques.
By uncovering the challenges tied to outer joins, we've shed light on the
potential issues that arise when these joins intersect with row-level security
(RLS). The interplay of data gaps, null values, and intricate queries can cast
doubt on the accuracy and reliability of analytical outcomes. However, armed
with alternative approaches, you now have the means to navigate these
challenges while strengthening the core structure of your data model.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 119
10 Scaling data peaks: A deep dive into
hierarchies
10.1 INTRODUCTION
10.2 UNDERSTANDING
HIERARCHIES
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 120
In this section, we will delve into
different implementation techniques
for hierarchies, considering the
specific use cases and characteristics
of each approach. Whether dealing
with balanced, unbalanced, or ragged
hierarchies, organizations can make
informed decisions to optimize their
data models' hierarchical
representation and utilization. By
understanding and leveraging these
techniques, businesses can unlock the
full potential of their hierarchical data,
enabling a deeper understanding of
their operations and making data-
Figure 50 Sample hierarchies
driven decisions with confidence.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 121
10.2.2.1 Balanced hierarchies
The concept of a balanced hierarchy finds a perfect embodiment in the product
hierarchy, where each product is systematically organized into five distinct
levels. This hierarchical structure is known for its uniform depth across
branches, simplifying comprehension and making it an excellent choice for
organizing product-related data. Its predictable and efficient performance
further enhances its appeal, ensuring smooth data navigation and facilitating
insightful analyses for business decision-makers.
Similarly, in traditional data modeling
scenarios, dates often present a well-
structured, balanced hierarchy, flowing
seamlessly from days to months, quarters, and
years. However, with the advent of
ThoughtSpot's advanced keyword-based
search functionality, the conventional
approach of implementing complex date
hierarchies is frequently rendered
unnecessary. ThoughtSpot empowers users to
interact with date-related data using natural
language queries, unlocking a more dynamic
and intuitive data exploration experience. This
flexibility not only accelerates the analysis
process but also liberates users from rigid
hierarchies, fostering creative insights and
enabling them to unearth valuable information
without constraints.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 122
attributes, and other factors, thereby guiding strategic initiatives and
optimizations.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 123
10.2.2.3 Unbalanced hierarchies
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 124
managerial functions contribute to the unbalanced nature of the organization
chart.
Recognizing these disparities in hierarchy is essential for effective data modeling
and analysis, as it informs decision-makers about the unique dynamics within
the organization and influences the way data is organized and interpreted.
Embracing the intricacies of unbalanced hierarchies empowers organizations to
create more accurate representations of their structure and ensures a more
comprehensive understanding of their workforce and managerial relationships.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 125
table by simply adding additional columns, enhancing flexibility in your data
model.
By joining the PRODUCT_ITEM column to your fact table, the balanced hierarchy
becomes an integral part of your model, enabling you to roll up data to the
specified levels. Nevertheless, it is crucial to ensure a seamless search
experience for end users.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 126
To model a balanced hierarchy in a dimensional model, you would create a
"Product" dimension table with multiple attributes representing different levels
of the hierarchy. Here's a simplified version of how the dimension might look:
DIM_PRODUCT
3 TV Electronics TVs
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 127
n Step 1: Define initial dimensions and fact table
Dimensions:
u Date Dimension: Contains attributes like Date, Month, Quarter, and Year.
u Product Dimension: Includes attributes such as Product ID, Product Name,
and Product Category.
u Geography Dimension: Encompasses attributes like Region, Country, and
City.
Fact Table:
u Sales Fact Table: Stores sales transactions with measures like Sales
Amount and Quantity Sold, along with foreign keys referencing Date,
Product, and Geography dimensions.
n Step 2: Implementing advanced attribute hierarchies
Integrate advanced attribute hierarchies within the Geography dimension to
provide deeper insights into the data.
Geography Dimension: Create multiple attribute hierarchies within the
Geography dimension:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 128
u Population Perspective: Add Population Density as an attribute, enabling
comparisons of sales performance against population density.
u Economic Perspective: Integrate Economic Index as an attribute, allowing
analysis of sales in relation to the economic situation of each location.
n Step 3: Performing analysis
Let's explore how users can leverage these advanced attribute hierarchies
for more comprehensive analysis:
Scenario 1: Regional sales comparison by economic index:
u Users start by selecting a specific Region from the standard geographic
hierarchy.
u They then switch to the Economic Index hierarchy and choose different
economic levels (e.g., High, Medium, Low).
u The analysis reveals sales performance for the selected region based on
its economic status, uncovering potential correlations between economic
factors and sales.
Scenario 2: Population density impact on product categories:
u Users select a country from the geographic hierarchy.
u They further analyze by choosing a Population Density range (e.g., High,
Medium, Low).
u By selecting a Product Category from the Product hierarchy, users can
visualize the impact of population density on the popularity of different
product categories.
Scenario 3: City-level analysis with combined attributes:
u Users select a specific city from the geographic hierarchy.
u They then refine the analysis by combining attributes, such as choosing a
Population Density range and an Economic Index level.
u By analyzing sales trends in this manner, users can gain insights into how
various attributes interact and influence sales patterns.
Implementing this approach empowers analysts to perform advanced
geographic analyses, leading to more informed decisions and actionable insights
that go beyond traditional geographic analysis.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 129
Utilize advanced attribute hierarchies for enhanced
analysis
Incorporate advanced attribute hierarchies within dimensions
to provide richer context and facilitate more nuanced analysis.
For example, within a geography dimension, create hierarchies
based on administrative levels, population density, or
economic indicators.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 130
Best Practice 38 - Limit the depth of ragged hierarchies
It's important to note that this compromise works well for
hierarchies with a narrow range, typically comprising 3-4 levels.
Extending this approach to hierarchy with 4-8 or 10 levels may
become impractical, as the attribute names assigned to the various
levels must retain their meaningful context. Striking the right
balance between hierarchy depth and accurate representation is
key to a successful modeling strategy.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 131
Figure 55 Implementation of a fictional organization chart
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 132
10.3.4.1 Contents of the hierarchy table
The hierarchy table serves as a repository for all possible paths between
different levels, encompassing even the pathway to the node itself (called a
zero-length pathway).
This table comprises several essential columns:
n PARENT_EMPLOYEE_ID: Represents the 'parent' level, indicating the
superior employee or entity in the hierarchy.
n CHILD_EMPLOYEE_ID: Lists the child levels accessible to the parent,
signifying the subordinate employees or entities.
n DEPTH: Specifies the distance between the parent and the corresponding
child, providing insights into their hierarchical relationship.
n BOTTOM_FLAG: An indicator flag denoting that a specific child level is the
lowest one and does not have any further children beneath it.
n TOP_FLAG: An indicator flag signifying that a particular level is the highest
in the hierarchy, with no additional levels above it.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 133
The bridge table will be populated with
the data illustrated in Figure 56.
Upon inspection, it becomes clear that
the hierarchy table comprises 18 rows,
effectively representing our
hierarchical structure encompassing 7
distinct nodes. For instance, node 1
exhibits access to all nodes, including
itself at depth 0.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 134
Let's initiate our first search to retrieve the
same information, aiming to display
individual sales figures for each employee.
Since we seek straightforward answers, we
can employ the child employee, as we've
modeled the employee_id from the fact table
to join with the hierarchy table.
Alternatively, if we wish to determine the
sales figures for each level, encompassing
both individual employees and their
respective team members, we can select the Figure 58 Contents of the fact table
parent employee details instead of the child.
This will yield an aggregated result representing the combined sales generated
by the parent employee and all their underlying team members.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 135
Figure 60 Utilizing the hierarchy
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 136
2. Denormalize the tree: Transform the flexible structure into a balanced
hierarchy, as discussed earlier in the document. To achieve this, we require
an additional piece of information in our bridge table, representing the level
of the parent node in the hierarchy (e.g., root/top as level 0, the next level
as level 1, and so on).
3. Define the number of levels: Decide on the number of levels needed for
your use case; in the example, we have four levels (Level0 to Level3).
4. Formulate formulas: Using formulas, calculate the levels and split them
out accordingly. For each level, update the formula accordingly (e.g.,
Level0 formula: max(if (parent level = 0) then parent employee name else
null)). Replace the amount with a formula representing the sum of the
amounts.
5. Create a view: Select all the levels, the amount (the formula, and the
child employee name (for grouping purposes) and save this as a view.
6. Utilize the view: Use the view as input for the pivot table (without the
child employee name, used only for grouping).
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 137
Figure 61 Utilizing the view and hierarchy in the pivot
Please note that this example may contain null values for various reasons,
including nodes having their values and the possibility of a ragged hierarchy due
to the denormalization. Understanding these nuances will aid in interpreting the
visualizations effectively. While this example can be further refined regarding
null handling, inner joins/outer joins, and display, its primary purpose is to
illustrate the underlying concept.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 138
node D should be apportioned upward with a 50% weighting to node B and
another 50% weighting to node C. To achieve this, simply add an extra
attribute, "percent ownership," to the bridge table and update it for all nodes
where the hierarchy ends at node D.
n Implementation of slowly changing hierarchies: Implementing slowly
changing hierarchies is straightforward by adding effective start and end
dates to the bridge table. It is essential to note that when using this feature,
all searches require a date filter
to freeze the hierarchy at a
specific point in time to obtain
accurate results.
n Hierarchical reporting: The
additional join between the
bridge table and the employee
table allows for seamless
hierarchical reporting, including
questions such as "who reports
to whom." To ensure its
effectiveness, include depth or
any attribute from the bridge
table in your queries. If no
specific attribute is needed from
the bridge table, you can force
its inclusion in any search by
creating a view or by adding a
"fake" RLS rule (e.g., 'True') to
the bridge table and enabling
strict RLS.
n Identification of managers: Figure 62 Hierarchical reporting: who reports to
Easily determine who is a whom?
manager by creating a formula
that evaluates whether a particular node is a manager or not. This can be
done using a formula such as:
if (bottom_flag = 1 and child_employee_id = parent_employee_id)
then 'not a manager' else 'manager'
Once these results are indexed, you can conveniently filter for managers or
non-managers.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 139
n Team size calculation: Calculate the size of a team using the following
formula:
group_count(child_employee_id, parent_employee_id)
This formula allows you to determine the total count of employees within a
team, facilitating team size analysis.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 140
10.3.5.1 Data model
10.3.5.1.1 GL transactions
This table captures transactional data, with each transaction being recorded at
the leaf level of the hierarchy. It contains essential information pertaining to
accounts.
AccountID Amount
100 40
110 60
120 80
210 40
220 160
230 40
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 141
10.3.5.1.2 GL accounts
This table facilitates easy selection of the account involved in the transactions.
It enables users to quickly identify the specific accounts that have been
transacted against.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 142
ID Selection Node Child Node Parent Node Level
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 143
Figure 64 Navigating particular nodes in Figure 65 Adding the child node to the search
the hierarchy
Furthermore, should the user wish to analyze "Cash" specifically, they can
simply modify the Selection Node filter value to "Cash," automatically updating
the Child Node value to reflect the relevant context (See Figure 66).
Similarly, for users who require information about the parent node to navigate
up the hierarchy, the Parent Node column can be added to the search, providing
the necessary hierarchical context (Figure 67). This flexible approach empowers
users to explore and drill down into the hierarchy, enhancing their analytical
capabilities and understanding of the data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 144
10.3.5.3 Limitation: Data model knowledge required
One limitation of the node table
hierarchy is that business users must
possess a good understanding of the
data model. When multiple nodes are
selected in a search, it can lead to
confusing results. For example,
consider a search involving two node
selections: "Current Assets" and
"Cash." Since "Cash" is a child of
"Current Assets," the line item
amounts for both nodes are correct.
However, the Total Amount is not
accurate, as "Cash" is included twice Figure 68 Limitation when selecting multiple
nodes
in the calculation.
To avoid such discrepancies, users need to be cautious while selecting multiple
nodes to ensure the results align with their intended analysis. Having a clear
understanding of the hierarchical relationships within the data model can
significantly enhance the accuracy and reliability of search outcomes. Proper
training and guidance for business users regarding the nuances of the node table
hierarchy can mitigate potential confusion and facilitate more effective data
exploration and analysis.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 145
Figure 69 Our org chart revisited with path Figure 70 Adding the path strings to the tables
attributes
Note: The example presented here utilizes a simple letter scheme, which limits
the number of children per node to a maximum of 26. However, in more
extensive implementations, you can opt for a more sophisticated pattern. For
instance, you can use two characters per level or incorporate numbers to
accommodate a larger and more complex hierarchy. The flexibility to design
advanced patterns allows for scalability and adaptability, making the approach
suitable for a wide range of hierarchical structures in diverse applications.
The strength of this solution lies in its swift and efficient navigation through the
hierarchical tree. In conventional systems, regular expressions (as
demonstrated in the second column below) are often utilized. However, in
ThoughtSpot, regular expressions are not supported. Instead, we can employ
formulas (as seen in column 3), providing a better experience. Alternatively,
more advanced users familiar with the path string attribute can leverage the
search bar, as demonstrated in column 4.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 146
Search For Traditional SQL with ThoughtSpot Formula ThoughtSpot Search
regular expressions Bar
with '.'
children
Creating a flexible search that covers specific scenarios like selecting 'Node 2'
and all its children is not straightforward unless the end user comprehends the
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 147
structure of the path strings. To approach this, we can utilize formulas to store
the key field of the desired node (Node 2) in a variable and then employ
group_aggregate and other functions to filter the data, as depicted in Figure 71.
Here we have implemented the following formulas:
Formula Definition
selected_employee_id 2
filter_me left(pathstring,strlen(selected_principal_path_string)) =
selected_principal_path_string
To adapt this search for 'Node 3' and its children, all that is required is to modify
the value of the selected_employee_id variable to 3.
This implementation strategy represents the highest level of flexibility
achievable with the current version of ThoughtSpot. Alternatively, users who
grasp the path string structure could attempt searching for path strings
beginning with 'AA', but this approach might not be as user-friendly.
Please note that different or more flexible implementations might be possible
through subqueries, views, and joining them with the main table, but these
approaches are notably more complex to implement (if feasible at all).
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 148
Figure 72 Our fictional org chart with index numbers
All Employees emp_left >= min_left (*) and Search bar queries not
emp_right <= max_right () really user search
experience friendly as
Leaf nodes only emp_right - emp_left = 1 it would require
(4,6,7) understanding of the
left/right values of
each level
Managers emp_right - emp_left > 1
(1,2,3,5)
children
(*) min_left is a formula calculating the minimum value for emp_left (i.e. 1 in this case)
() max_right is a formula calculating the maximum value for emp_right (i.e. 14 in this case)
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 149
Note: Like the path string approach, generic
filters can be implemented using formulas,
but conducting search bar searches without
knowledge of the left and right indices can
prove to be quite challenging. Additionally,
performing more specific searches, like
finding 'Employee 2' and its children in the
hierarchy, becomes more intricate. The most
adaptable solution would likely involve
creating a formula containing the starting
node (e.g., node 2) and utilizing two group
aggregates to determine the left and right
indices for that node.
For example:
Formula Definition
selected_employee_id 2
selected_employee_id })
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 150
If we want to do the same search but then for Employee 3, we just need to
change the value of the selected_employee_id to 3.
Ragged Hierarchies
Hierarchies table
Node Table
attributes
Modified pre-ordered tree
traversal
As a result, it would be sensible to compare and analyze them within these three
distinct groups.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 151
10.4.1.1 Balanced hierarchy: Balanced vs ragged
When dealing with balanced hierarchies, there are two implementation
techniques available, both based on the flattened table approach. The choice
between these techniques largely depends on whether your hierarchy exhibits
ragged characteristics or not. It's essential to consider the level of raggedness
in your hierarchy; in cases of extreme raggedness, it may be more appropriate
to opt for one of the unbalanced hierarchy approaches instead. By carefully
evaluating the nature of your hierarchy, you can make an informed decision to
select the most suitable implementation technique that best aligns with your
specific data structure and requirements.
Number of rows in More rows than the node Less rows than the bridge
table
solution, as for each node: solution, as for each
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 152
When considering scalability and performance, it is crucial to assess the number
of rows in each table. The bridge table, in this comparison, generally contains
more rows than the node table, which may pose potential challenges for very
large hierarchies. However, determining the threshold for what qualifies as a
"large hierarchy" is a pertinent question.
For instance, in the case of the largest retailers worldwide, their product
hierarchies could encompass up to a million products. However, it's important
to note that product hierarchies tend to be balanced, making the balanced
hierarchy approach more suitable.
On the other hand, for HR solutions attempting to model all employees of one
of the top 30 largest global companies, including their hierarchical reporting
relationships, the bridge table solution might not be the optimal choice. In such
cases, assessing the appropriateness of either approach becomes crucial.
In summary, when choosing between these two options, you must weigh the
tradeoff between flexibility and scalability/performance. Carefully evaluating
your specific use case and hierarchy size will enable you to make a well-informed
decision that best aligns with your data model's needs and overall performance
requirements.
tree traversal
Number of 1 2
columns
Required
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 153
10.4.1.4 Balanced vs bridge/node table vs attribute solution
Finally, the following table will list some differences between the three groups
of implementation strategies:
(*) The effectiveness of using ragged balanced hierarchies depends on whether proper column naming has been
implemented for each level. Ensuring that each level possesses an understandable name, rather than generic labels like
"level-1," "level-2," etc., is crucial for enhancing the search experience and overall usability of the hierarchy within the
system. Clear and descriptive column names enable users to intuitively navigate and interpret the data, making the
implementation of ragged balanced hierarchies more advantageous when well-structured and appropriately labeled.
It is important to consider that attribute solutions may not always be the most
suitable choice for implementation in ThoughtSpot due to their poor search
experience. As search experience is a critical aspect of the platform, alternative
solutions are often preferred. However, in specific scenarios where detailed
analysis of the hierarchy is not a primary requirement, attribute solutions can
offer a faster and simpler alternative. It ultimately depends on the specific use
case and the level of analysis needed to determine whether attribute solutions
can effectively serve as a viable option within ThoughtSpot.
Hierarchies serve as vital tools for organizing and analyzing complex data
relationships, providing a structured approach that aids in uncovering insights
and patterns. Throughout this chapter, we have explored the diverse facets of
hierarchies, delving into their types, implementations, and considerations for
various use cases.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 154
Balanced hierarchies, characterized by their uniform depth and streamlined
structure, are adept at simplifying comprehension and enabling efficient data
navigation. While traditional data modeling scenarios, such as calendar dates,
often align well with balanced hierarchies, the emergence of advanced search
functionalities, like those offered by ThoughtSpot, is reshaping the landscape.
The ability to interact with data using natural language queries is revolutionizing
the exploration process, transcending rigid hierarchies, and fostering creativity
in insights generation.
Advanced attribute hierarchies add layers of context to data models, facilitating
multidimensional analysis and unveiling intricate relationships. This approach
empowers analysts to draw nuanced insights from geographical attributes,
population perspectives, and economic indicators, thereby guiding strategic
initiatives.
Ragged hierarchies, exemplified by varying branch depths or absent levels in
different branches, demand flexible modeling approaches. Solutions like
propagating parent-level data or using specific values to represent missing
levels enable meaningful representation and interpretation of data.
Unbalanced hierarchies, where branch depths and logical equivalents vary,
introduce complexities in data modeling. Solutions such as bridge tables, node
tables, path string attributes, and modified pre-ordered tree traversals offer
distinct methods for handling such hierarchies, each with its own benefits and
limitations. Businesses must consider the
nature of their data, user requirements,
and organizational structures to
determine the most suitable modeling
technique.
In conclusion, hierarchies serve as
indispensable tools for navigating
complex data landscapes. By
carefully choosing and
implementing appropriate
techniques, organizations can
unleash the potential of their
hierarchical data, driving deeper
understanding, informed decision-
making, and strategic growth. As the data
landscape continues to evolve, mastering
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 155
hierarchical modeling becomes a critical skill for organizations seeking to
harness the power of their data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 156
Natural Hierarchy:
Product à Category à Sub Category
Surrogate Hierarchy:
Product à Category à Sub Category à Brand
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 157
11 Evolving relations in time: A deep dive
into slowly changing dimensions
11.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 158
11.2.1.2 Type 2: Add new record
Type 2 approach introduces a new row in the dimension table for each change,
and we need two additional data columns to capture validity of the row. This
way, a complete history of changes is preserved, enabling time-based analysis.
However, this method can lead to increased storage requirements, and queries
might become more complex as they need to consider multiple records for a
single entity.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 159
11.3 MODELLING SLOWLY CHANGING DIMENSIONS
DIM_EMPLOYEE
Now, we will apply changes to the Employee dimension using each SCD type to
illustrate the concepts.
DIM_EMPLOYEE
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 160
no longer valid. This approach maintains a historical record of changes while
accommodating new information.
DIM_EMPLOYEE
DIM_EMPLOYEE
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 161
historical information without affecting the primary dimension table's
performance. It allows for efficient storage and retrieval of historical records.
DIM_EMPLOYEE
DIM_EMPLOYEE_HISTORY
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 162
changes at a slower pace than the data itself, presenting challenges in
maintaining data accuracy and historical context. Through this chapter, we've
unearthed the complexities and best practices of handling SCDs, shedding light
on how different strategies can be employed to cater to distinct requirements.
A successful approach to managing Slowly Changing Dimensions is founded on
a balanced understanding of the nature of your data and the desired analytical
outcomes. Each SCD type presents a trade-off between storage, query
performance, and historical context. The choice of strategy should align with
your specific business needs and technological environment. By comprehending
the nuances and benefits of each strategy, you can empower your data
architecture to gracefully accommodate changes while delivering valuable
insights.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 163
n If modeled correctly, i.e. no overlapping dates, then there
will be no duplicate records.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 164
12 From fine to coarse: Crafting data models
with precision and grain
12.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 165
12.2.1.1 Importance of loading data at the lowest grain
As a best practice, we recommend loading data at the lowest level of granularity
(not summarized). This approach offers several advantages:
n Enhanced query flexibility: Having data at the lowest level of detail
empowers users to identify aggregate outliers and drill down to understand
underlying trends. This translates to heightened query flexibility and the
ability to explore data comprehensively.
n Reduced cognitive demand: Users are relieved from the burden of
performing complex calculations. For instance, they don't need to manually
calculate unique counts for accurate answers. Furthermore, calculations like
averages are less prone to producing erroneous results.
n Dynamic aggregation: Instead of relying on pre-calculated ratios, data
should encompass columns necessary for calculating numerators and
denominators. This ensures we can dynamically aggregate solutions,
adapting to users' changing requirements.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 166
12.2.2 Denormalization: To denormalize or Not?
Denormalization is a pivotal concept in dimensional modeling, involving the
consolidation of related data to optimize query performance. However,
denormalization should be approached with careful consideration of specific
scenarios to strike the right balance between improved querying and data
integrity. Let's explore when and how denormalization should be applied:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 167
businesses can navigate the complexities of denormalization to optimize their
analytical capabilities.
12.2.2.6 Mini-dimensions
For attributes with a relatively small number of unique values, consider creating
mini-dimensions. These compact dimension tables can be used to manage low-
cardinality attributes efficiently, reducing the overall complexity of your model.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 168
Figure 73 A data model with mixed grain facts
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 169
search are complex and contestable, primarily stemming from the inherent
differences in granularity between Sales and Budgets.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 170
Figure 77 Correctly modeling mixed grain
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 171
fact table operates at a daily level, while the other operates at a higher level
such as monthly or another designated period.
For better clarity,
consider the model
depicted in Figure 78.
This model
incorporates two
distinct fact tables: a
Sales Fact table
characterized by
daily granularity per
product, and a Target
Fact table that
Figure 78 Multi-grain dimensional model
outlines target
quantities for a specified period, which could span a month, several weeks, or
any other interval higher than daily.
To illustrate this concept, let's populate this model with practical sample data:
PRODUCT_DIMENSION
DATE_DIMENSION
… Populate all values so we have data for the complete month of January 2023 …
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 172
DATE_DIMENSION
SALES_FACT
2023-01-15 ProductA 50
2023-01-10 ProductB 40
2023-01-20 ProductC 60
TARGET_FACT
PERIOD TARGET_QTY
202301 500
202302 600
202303 700
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 173
levels of the two tables. During the join process, the system inadvertently
duplicates the 500 value 31 times, corresponding to each date within that
period.
To tackle this issue, several options can be explored. One approach is as detailed
in the preceding section. Alternatively, one can avoid using the period key and
substitute it with the date of the period's first day. However, this would
concentrate all target quantities on a single day.
In this context, a more fitting resolution involves splitting the dimension. This is
possible because both grains coexist within the same primary dimension but
operate at differing levels, i.e., daily versus period.
To achieve this we split the
date dimension in two
dimensions:
n DATE_DIMENSION
which contains the
DATE and the PERIOD
(fk)
n PERIOD_DIMENSION
containing the PERIOD
and YEAR
These dimensions are then
joined together, and the Figure 84 Resolving the multi-grain issue by splitting
TARGET_FACT is linked to dimensions
the PERIOD_DIMENSION
instead of the DATE_DIMENSION. When this reconfigured setup is imported into
ThoughtSpot, configured, and relevant attributes are included in a worksheet,
the updated search can be performed, as depicted in Figure 85.
The outcome of this approach successfully rectifies the overcounting issue,
accurately reporting the target quantity as 500. This technique serves as an
effective means of addressing multi-grain challenges in your data model.
Notably, this technique is most suitable when the facts are linked via the same
primary dimension, such as Date in this context, albeit at differing levels.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 174
Figure 85 Rerunning our actuals vs target search
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 175
13 Unboxed insights: Unleashing the
potential of flexible data structures
13.1 INTRODUCTION
1 Sales 100
1 Tax 10
1 Cost 50
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 176
Figure 86 Key/value pair sample data
When queried, the resulting output lacks optimal usability, and users are
required to understand attributes such as 'measure_value' and 'measure_code,'
which can diminish the overall search experience.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 177
},
"hashtags": ["ProductLaunch"],
"mentions": ["@colleague1", "@partner"],
"media": [
{
"type": "image",
"url": "https://example.com/image.jpg"
}
]
}
In this example, we have a JSON object representing a Twitter feed with several
key components:
n user: Information about the user who posted the tweet.
n tweet: Details of the tweet itself, including its content and engagement
metrics.
n hashtags: An array of hashtags used in the tweet.
n mentions: An array of user mentions within the tweet.
n media: An array containing media elements associated with the tweet, such
as images or videos.
Given the current data structure, our options are limited. To make meaningful
progress, we must leverage the capabilities of the underlying data platform to
flatten or unpack this information and import the results of that process into
ThoughtSpot.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 178
SELECT
MAX(CASE WHEN measure_code = 'cost'
THEN measure_value END) AS Cost,
MAX(CASE WHEN measure_code = 'sales'
THEN measure_value END) AS Sales
FROM KeyValueData;
n ThoughtSpot pivot: Within ThoughtSpot, formulas can be utilized to pivot
the data. Although this method aligns well with search expectations, concerns
about scalability could arise over time.
n ETL incorporation: The ETL process can also incorporate the pivot
functionality, enabling pre-calculation and persistence of the pivoted data.
This option guarantees the best search experience and performance.
In conclusion, effectively managing key/value pairs involves navigating the
challenges of dynamically represented data. Pivoting and various
implementation methods can significantly enhance the usability and
effectiveness of searches in such scenarios.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 179
n Table or view creation: Following the transformation, structured tables or
views are generated from the unstructured or semi-structured data. These
structured entities align with ThoughtSpot's requirements and are conducive
to seamless integration.
n Import and analysis: Finally, the newly created structured tables or views
can be easily imported into ThoughtSpot for comprehensive analysis and
exploration.
Here is a Snowflake example illustrating how you could extract the data from
the JSON tweet sample provided earlier. It's important to highlight that this
example covers the complete extraction. However, it's advisable to adhere to
best practices by only unpacking the specific data elements that are essential
for your analysis.
CREATE OR REPLACE VIEW flattened_tweet_data AS
SELECT
json_data:user:id::string AS user_id,
json_data:user:username::string AS username,
json_data:user:full_name::string AS full_name,
json_data:user:followers_count::integer AS followers_count,
json_data:user:location::string AS location,
json_data:user:verified::boolean AS verified,
json_data:tweet:id::string AS tweet_id,
json_data:tweet:text::string AS tweet_text,
json_data:tweet:created_at::timestamp AS created_at,
json_data:tweet:retweet_count::integer AS retweet_count,
json_data:tweet:favorite_count::integer AS favorite_count,
json_data:hashtags::array AS hashtags,
json_data:mentions::array AS mentions,
json_data:media[0]:type::string AS media_type,
json_data:media[0]:url::string AS media_url
FROM json_data;
By harnessing this strategy and adhering to the principle of selective
transformation, organizations can bridge the gap between unstructured or semi-
structured data and the structured data model demanded by ThoughtSpot.
While the platform itself thrives on structured data, the capabilities of modern
data warehouses and SQL-driven transformations empower users to transform
and organize diverse data formats into a format suitable for impactful analysis
within ThoughtSpot.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 180
13.4 CONSIDERATIONS AND CONCLUSION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 181
14 Dimensional mastery: Exploring the date
dimension
14.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 182
14.3 MODELING DATE & TIME
n To join non-conformed data. If we're dealing with different sets of facts that
need to be joined, the date dimension table helps. We could use date
dimension to join those facts together.
n You want to define ‘Smart Date Hierarchies to provide users with flexible and
intelligent ways to analyze data over different time perspectives, which is not
provided out-of-the-box by ThoughtSpot and its keywords, e.g.:
u Seasonality: Add attributes related to seasonality, such as "Holiday
Season," "Back-to-School Season," or "Summer Season." These attributes
can help users identify trends and patterns related to specific periods of
the year.
u Comparative Analysis: Introduce attributes for comparative analysis, such
as "Same Period Last Year" and "Previous Period." (Like-for-like
comparisons). These attributes enable users to compare current
performance with historical periods, facilitating trend analysis. This is
especially useful when you want to utilize the versus keyword in
ThoughtSpot which does not work in combination with ‘group’ functions
(see section 14.4).
u Day Part Analysis: Create attributes for different parts of the day, such as
"Morning," "Afternoon," and "Evening." This can help users understand
how performance varies during different time periods within a day.
u Time-to-Event Analysis: If applicable, add attributes for "Time to Event"
analysis, such as "Days to Next Promotion" or "Days to Product Launch."
These attributes provide insights into how certain events impact business
performance.
u In section 17.3.3.2 we also describe a design pattern where the date
dimension can help with semi-additive measures
In the grand world of dimensional modeling, the date dimension is like a guide
to help us make sense of time-related data. While ThoughtSpot's keywords
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 183
simplify things, the date dimension table is a powerful tool when we need to dig
deeper.
14.4.1 Introduction
In the world of retail analytics, gaining insights into historical performance is
paramount. Retailers often need to compare sales, market share, and other key
metrics year-over-year to understand trends and make informed decisions.
However, performing these comparisons efficiently can be challenging,
especially when dealing with large datasets. In this section, we will delve into
utilizing a date dimension for ‘look back measures’, which allows retailers to
answer crucial questions about their performance over time.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 184
14.4.3 Solution design
Our solution for look back measures begins with the creation of a Week
Dimension that defines the look-back periods. The process involves the following
steps:
1. Week dimension: Create a Week Dimension that defines the look-back
periods.
2. Views and join: Create SELECT * views over the Fact and join them to the
Look Back Period.
3. Worksheet labeling: Insert the Look Back view in the worksheet and label
the measure as "1 year ago" or any relevant timeframe.
This solution allows users to easily compare data from different time periods
and gain insights into their retail performance.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 185
These measures enable retailers to analyze sales variations and trends between
different time periods efficiently.
14.4.3.2 Application
With the solution and measures defined, ThoughtSpot users can quickly perform
analyses and answer the critical questions mentioned earlier. For instance, they
can easily assess how sales have changed over time, calculate the share of
MULO this year versus last year, and track share changes over specified time
windows.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 186
The ability to visualize and interact with data in real time empowers users to
make data-driven decisions swiftly.
n Sales This Year over MULO Sales This Year = Share of MULO
n Sales Last Year over MULO Sales Last Year = Share of MULO Last Year
n Share of MULO Last Year - Share of MULO = YoY Share Change
n YoY Share Change over time (Last week vs. Last 12 weeks vs. Last 26 weeks
vs. Last 52 weeks)
These share measures provide retailers with a comprehensive view of their
market position and how it evolves over different timeframes.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 187
14.4.4.2 Example: What is the share of MULO this year versus last year
during the same period?
Figure 90 What is the share of MULO this year versus last year during the same period?
Figure 91 Query Plan for What is the share of MULO this year versus last year during the same
period?
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 188
14.4.4.3 Example: How has the share changed over time?
14.4.4.4 Example: How does the share compare this year versus last year
to another retailer during the same period?
Figure 93 How does the share compare this year versus last year to another retailer during the
same period?
14.4.5 Conclusion
Utilizing the Date Dimension enables retailers to efficiently compare
performance data across different time periods, answer critical questions, and
gain valuable insights into their business. By harnessing the power of
ThoughtSpot's analytics platform, retailers can make data-driven decisions that
drive growth and success in a competitive market.
This chapter has highlighted the importance of the date dimension, which serves
as a fundamental framework for understanding time-related data. This
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 189
dimension simplifies the interpretation of time, allowing us to navigate through
different time periods, make historical comparisons, and uncover underlying
trends.
ThoughtSpot's keywords make it easier to query time intervals, reducing the
need for a dedicated date dimension table in many scenarios. These keywords
provide added flexibility and precision. Nevertheless, there are situations where
a dedicated date dimension table is essential, particularly when more advanced
and non-standard functionalities are required.
One of the many examples include the combination of the date dimension and
the 'look back measures' design pattern in ThoughtSpot has a profound impact
on the retail industry. It equips retailers with the necessary resources to
efficiently analyze performance data across various timeframes, offering
solutions to critical questions and valuable insights into their operations.
In conclusion, the date dimension serves as a guiding light for exploring
temporal data complexities. ThoughtSpot's keywords simplify query processes,
while the Date Dimension table remains a powerful tool for in-depth analysis
when needed.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 190
15 Counting coins across continents: A guide
to currency conversion in analytics
15.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 191
15.2.2 Challenges of currency conversion
n Exchange rate management: Currency conversion requires up-to-date
and historical exchange rates. Keeping track of exchange rates over time and
ensuring the accuracy of conversions can be complex, especially when
dealing with various currencies and frequent rate fluctuations.
n Consistency and integrity: Maintaining the consistency and integrity of
currency conversions is essential to avoid errors in reporting and analysis.
Inconsistent conversions can lead to discrepancies in financial statements
and misleading business decisions.
n Multi-currency transactions: Dealing with multi-currency transactions in
a single report or analysis requires proper handling of different currency
conversions for accurate aggregations.
n Dimensional hierarchies: Currency conversion needs to be applied at
various levels of dimensional hierarchies, such as at the country, region, or
global level. Ensuring correct conversions across these hierarchies is crucial.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 192
n Exchange rate table: This holds the
exchange rate between a source and
a target currency. For the simplicity of
the example we have not included
dates here, but in the real world you
probably want to store an exchange
rate for a certain date/period.
n User table: This table stores the user
and their preferred currency.
n User currency bridge: This table
contains the currencies used by the
users. This is a degenerate bridge
table. It's an optimization technique
used in dimensional modeling to avoid
creating unnecessary dimension
tables while still capturing the
Figure 94 A data model for a multi-currency
necessary relationships between use case
facts and dimensions. In our case we
use it to resolve the many-to-many relationship and avoid null records being
returned when no user is selected. This should never happen as a user should
always be specified, but it will limit the amount of data processed by the
query.
Our data model will look like the one shown in Figure 94.
We will populate the model with the following data:
EXCHANGE_RATE
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 193
EXCHANGE_RATE
TRANSACTION
2 50 2 59 1 08/02/20 Clothing
23 Purchase
USER
1 Alice 1
2 Bob 2
USER_CURRENCIES
PREF_CURRENCY_ID
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 194
When we now import these tables in ThoughtSpot, create a worksheet using
these tables and place a RLS rule on the USER table (ts_username =
user_name), we can see this in action.
For example, if we log in as Alice we would get the following results:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 195
ThoughtSpot's capabilities further enhance the potential to create
comprehensive multi-currency reports and dashboards that reflect accurate and
uniform monetary values. With the complexity of global business environments,
adopting best practices for currency conversion empowers organizations to
make informed decisions and drive successful operations across various
currencies and markets.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 196
Best Practice 44 - Apply currency conversion on load
When loading data into the data warehouse, apply currency
conversion directly during the ETL (Extract, Transform, Load)
process. This approach simplifies queries and ensures consistent
conversions throughout the data.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 197
16 Evolving data structures: A guide to late-
binding attributes
16.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 198
16.2.2 Late-binding and varying attributes per row
A powerful application of late-binding attributes is seen when dealing with
datasets where attributes can differ per row. This scenario is common in
situations where data is collected from multiple sources, each producing data
with distinct attributes. Consider a retail business that aggregates sales data
from various franchise locations. Each location might have unique attributes,
such as specific promotional codes, local tax rates, or regional product offerings.
Late-binding attributes allow for seamless integration of such diverse data. As
new franchise locations are added, data management system can effortlessly
accommodate the unique attributes associated with each location without
requiring schema alterations. Queries can then be crafted to aggregate and
analyze this data, regardless of the varying attributes present in each row.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 199
16.2.3.1 Custom properties/characteristics
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 200
join the "Patient Attributes" table as needed to retrieve additional patient
information.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 201
extending their data model to include new attributes rather than
restructuring their entire data warehouse.
n Reduced maintenance: The separation of early-binding and late-binding
attributes minimizes the need for extensive maintenance efforts when
adapting to changes. This efficiency can translate to cost savings and
resource optimization.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 202
attribute definitions, update policies, and communicate changes effectively
to stakeholders to prevent misunderstandings and inconsistencies.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 203
1 10001 1003 Vikings
1 10001 1004 2
2 20001 1003 3
The Custom Field itself would come from another table (via the
CUSTOM_FIELD_ID), for example:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 204
16.3.3 Implementation strategies for late-binding
attributes
Before we shown some likely implementation strategies for the data model for
late-binding attributes. The complete list of strategies include:
TBL_PROFILE
TENANT_ID INTEGER
PROFILE_ID INTEGER
CUSTOM_FIELD_1 VARCHAR
CUSTOM_FIELD_ANSWER_1 VARCHAR
CUSTOM_FIELD_2 VARCHAR
CUSTOM_FIELD_ANSWER_2 VARCHAR
CUSTOM_FIELD_3 VARCHAR
CUSTOM_FIELD_ANSWER_3 VARCHAR
CUSTOM_FIELD_4 VARCHAR
CUSTOM_FIELD_ANSWER_5 VARCHAR
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 205
16.3.3.3 EAV type implementations
An example of this we discussed in section 16.3.1.
PROFILE_QUESTIONS
PROFILE_QUESTION_ID INTEGER
TENANT_ID INTEGER
USER_ID INTEGER
QUESTION_DATA VARIANT
And we populate this table with the sample data from the example:
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 206
7 2 2 { "How many nights a week do you go out?": 3,
"data_type": "NUMBER", "type": "Food" }
TENANT_I What is your What is What is How What is What is How What is
D favourite your your many your your many your
football/socc favourit favourit sport favourit favourit night favourit
er club? e e s do e food? e drink? s a e night
basebal America you week club?
l team? n play? do
Football you
team? go
out?
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 207
2 Burger Gin 3 Bootsha
Tonic us
Here you can see that for tenant 1, we get their selected set of sports related
questions and for tenant 2, their food & going out related questions.
You can convert this SQL statement into a view which then can easily be
imported into ThoughtSpot.
Denormalise attributes Fast to implement Fields are generic, not strongly typed
and values
Allows actions on custom fields Table is inefficient sizewise, fields
(search, sort) might never be used for certain
tenants
EAV type Both flexible and efficient. Slight increase in development time
implementations and complexity of your queries, but
DB actions can be performed, but
there really aren't too many cons,
the data is normalized somewhat
here
to reduce wasted space.
Not very scalable when applied
across multiple tables
Table 56 Pros and cons of the various late-binding attribute modeling techniques
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 208
16.4 MODELING LATE-BINDING ATTRIBUTES FOR SEARCH
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 209
In a broader context, the most optimal solution entails fully modeling all fields.
This approach not only delivers a superior search experience but also generally
enhances performance. It is evident from the various scenarios that while extra
work might be necessary to optimize the search encounter, the scalability of the
strategy should be a central consideration.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 210
succinct tags. For instance, instead of 'What is your favorite baseball team?' you
might opt for 'MLB.' Upon fusion with the corresponding answer, the outcome
could resemble 'MLB:Cincinnati Reds,' thus creating a unified field suitable for
filtering.
This approach doesn't negate the utility of individual question and answer fields,
which maintain readability within a table. Consequently, by harmoniously
integrating these aspects, a harmonious balance is achieved. However, it's
crucial to acknowledge that this solution exclusively addresses filtering
concerns.
16.5 CONCLUSION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 211
Early-binding attributes, defined during initial data modeling, can hinder
adaptability. In contrast, late-binding attributes provide agility, scalability, and
reduced maintenance, crucial for dynamic sectors like healthcare.
However, these attributes introduce challenges such as data integrity, query
performance, and schema complexity. Nonetheless, they empower custom
property mapping and dynamic attribute handling.
Implementation strategies like denormalization, EAV types, and semi-structured
solutions have been discussed, each with its pros and cons to consider.
In the context of search-oriented modeling, combining field and answer in a tag
improves filtering without sacrificing readability.
Notably, achieving an AND pattern for filter combination is emphasized. This
requires structuring custom fields into separate columns for precise filtration.
Ultimately, though various strategies are explored, explicitly modeling all
attributes remains the optimal solution. This effectively transforms late-binding
attributes back into early-binding attributes. Despite challenges, this approach
empowers organizations to adapt, analyze, and stay competitive in a rapidly
evolving landscape.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 212
17 Beyond sums: Mastering complex data
metrics
17.1 INTRODUCTION
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 213
n Profit margin enhancement: In financial analysis, the computation of
profit margin (net profit divided by revenue) is a recurring requirement.
Through Derived Measures, this calculation is performed during the ETL
(Extract, Transform, Load) process and stored directly in the fact table.
Analysts can now swiftly compare profit margins across products, regions, or
time periods without the overhead of repetitive calculations.
n Conversion rate streamlining: In the realm of e-commerce, calculating
conversion rates (purchases divided by website visits) is pivotal. Derived
Measures transform this calculation into a one-time operation, enabling
marketers to assess the success of campaigns and analyze user behavior
without being entangled in constant computations.
n Insights into inventory turnover: Inventory management necessitates
the calculation of inventory turnover (cost of goods sold divided by average
inventory). By integrating this metric as a Derived Measures, analysts can
seamlessly explore trends in product turnover, thereby optimizing inventory
control strategies.
n Customer loyalty exploration: Tracking customer retention rates
(percentage of customers retained over a period) is indispensable for
customer relationship management. Derived Measures expedite the
assessment of customer loyalty, facilitating insights into the effectiveness of
engagement initiatives.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 214
n Seamless scalability: Derived Measures maintain their performance
benefits as data volumes expand, contributing to the scalability of the data
warehouse infrastructure.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 215
n Show the Latest Balance ( Latest Balance by Account)
n Show Year End Balance (Year-end Balance by Account)
n Show Month End Balance (Month-end Balance by Account)
n Show Week End Balance (Week-end Balance by Account)
n Show month-end balance trend (Month-end Balance by Account Monthly)
n Compare month-end balance over the past 3 months (Month-end Balance by
Account last month vs 2 months ago vs 3 months ago
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 216
provide insights into various financial performance metrics. A typical data model
for this scenario might look something like this the model below.
One of the key derived metrics in this scenario is the profit margin, which is
calculated as (Revenue - Cost) / Revenue. Instead of computing this ratio during
each query, a Derived Measure can be introduced in the fact table.
With the Profit Margin derived fact stored directly in the fact table, analytical
queries involving profit margin calculations can now be executed significantly
faster. This approach optimizes query performance and enhances the overall
usability of the data warehouse.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 217
17.3.2.1 Weighted averages
For ratios and percentages, calculate weighted averages instead of simple sums.
Multiply the measure value by a relevant weight, sum these values, and then
divide by the sum of weights.
17.3.2.1.2 Pros
n Suitable for ratios and percentages: Weighted averages are well-suited for
handling non-additive measures like ratios and percentages, providing a
meaningful way to aggregate them.
n Accurate aggregation: Weighted averages provide accurate aggregation by
considering the significance of each value.
17.3.2.1.3 Cons
n Limited applicability: This technique is specifically designed for calculating
weighted averages of non-additive measures and may not be applicable to
all types of measures.
17.3.2.2.2 Pros
n Customized aggregation: Augmentation allows you to store and aggregate
non-additive values with additional context, providing more meaningful
results.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 218
n Improved query performance: Precomputed values can enhance query
performance for non-additive measures that require complex calculations.
17.3.2.2.3 Cons
n Data redundancy: Augmenting the fact table can lead to increased data
redundancy, as additional fields need to be stored alongside the raw data.
n Increased storage: Storing additional fields in the fact table can result in
increased storage requirements.
17.3.2.3.2 Pros
n Specialized treatment: Separate fact tables allow you to handle non-additive
measures differently, optimizing performance and maintaining clarity.
n Improved query performance: Precomputed values in separate fact tables
enhance query performance for non-additive measures.
17.3.2.3.3 Cons
Additional Complexity: Managing multiple fact tables can introduce some level
of complexity to the data model.
Maintenance Overhead: Handling separate fact tables requires careful design
and maintenance to ensure data consistency and accuracy.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 219
17.3.2.4.1 The data model
We'll be using a Product Dimension and a Date Dimension. The Product
Dimension represents different product types, and the Date Dimension includes
quarters and years. We've also stored the weight index in the product dimension
for simplicity. In real-world cases, you might calculate this from your fact table.
Additionally, we need a fact table to capture product defect rates and relevant
information.
DIM_PRODUCT
DIM_DATE
101 2023 Q1
102 2023 Q2
103 2023 Q3
FACT_DEFECT_RATE
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 220
FACT_DEFECT_RATE
1 1 101 0.0200
2 2 101 0.0500
3 3 101 0.0300
4 1 102 0.0300
5 2 102 0.0600
6 3 102 0.0400
7 1 103 0.0200
8 2 103 0.0400
9 3 103 0.0300
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 221
Weighted Average Calculation
The formula for weighed average
calculation is:
Weighted Average = (Σ(weight *
percentage)) / Σ(weight)
Where:
n Σ denotes the sum of all items.
n "weight" represents the weight assigned
to each percentage.
n "percentage" is the actual percentage
value.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 222
Within the worksheet, we'll need to formulate a calculation for determining the
weighted average. The formula mirrors the one
mentioned earlier, with a slight adjustment.
Here, we multiply the denominator by the
number of quarters, our lowest date unit. This
adjustment ensures the formula remains
effective when rolling up to a year.
(sum(defect_rate *weight_index ) /
(sum (weight_index) * count(quarter ) ) )
Upon executing a search for the average defect rate per quarter, the outcome
is displayed, as depicted below. Notably, even upon aggregation, the calculation
maintains its accuracy.
Table 61 Searching for the defect rate per Table 62 Searching for the defect rate per
quarter year
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 223
introducing two additional fields: one for the sum of the average weight rate
and another for the cumulative sum of the weight index.
Now, you can easily compute the weighted average by summing up the
WEIGHTED_DEFECT_RATE and then dividing it by the total
SUM_WEIGHT_INDEX.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 224
17.3.3.1 Approach 1: Leveraging ThoughtSpot aggregation functions
In this approach, we create a dynamic moving measure, formulated as a
formula, to determine the maximum date within the context of the search.
Whether it's analyzed on a yearly, quarterly, monthly, weekly, or daily basis,
this formula plays a pivotal role in the process. The formula is defined as follows:
Business effective date = group_max(snapshot_date, snapshot_date)
Date
is_quarter_start
is_month_start
is_quarter_end
is_month_end
is_week_start
is_year_start
is_week_end
is_year_end
is_latest
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 225
1-jan-2023 0 1 0 1 0 1 0 1 0
2-jan-2023 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
31-mar-2023 0 0 0 0 1 0 1 0 0
1-apr-2023 0 0 0 1 0 1 0 1 0
... ... ... ... ... ... ... ... ... ...
31-dec-2023 1 0 1 0 1 0 1 0 1
Within this table, each entry corresponds to a specific date, and the associated
Boolean indicators are assigned values of 1 or 0 based on the predefined criteria.
It's important to note that the values within the "is_latest" column would
typically be dynamically determined based on the actual data within your
system. As the latest date may evolve over time, this column requires periodic
updates to remain accurate.
Subsequently, we can import this extended date dimension into ThoughtSpot
and configure its usage. In the worksheet employing this table, we will introduce
several formulas to further enhance the search experience:
Name Definition
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 226
Please be aware that the inclusion of the condition (is_latest = 1) within the
year end, quarter end, month end, and week end formulas is an optional step.
By incorporating this condition, you will extend the consideration to include the
last recorded date within your dataset as the respective period's end.
With these preparations in place, we can seamlessly utilize user-friendly terms
during our search queries, such as 'month end':
In wrapping up our exploration of complex data metrics, it's clear that non-
additive, semi-additive, and derived measures offer valuable tools for gaining
deeper insights from data. While these advanced metrics provide the potential
for more nuanced analysis, they also come with practical considerations.
When working with derived measures, it's important to balance the benefits of
pre-calculated metrics with the increased storage they require. Non-additive and
semi-additive measures demand attention to data integrity and query
performance. Techniques like weighted averages, fact table augmentation, and
separate fact tables provide diverse solutions, each with its own set of
advantages and challenges.
In the realm of semi-additive measures, choosing between ThoughtSpot
aggregation functions and extending the data model involves careful planning
to ensure scalability and accuracy.
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 227
List of Figures
FIGURE 1 A SAMPLE STAR SCHEMA MODEL ................................................................................................................ 21
FIGURE 2 SAMPLE SNOWFLAKE SCHEMA ...................................................................................................................... 22
FIGURE 3 SAMPLE GALAXY SCHEMA .............................................................................................................................. 23
FIGURE 4 SAMPLE FACT CONSTELLATION ..................................................................................................................... 24
FIGURE 5 FLAT TABLE EXAMPLE - NORMALIZED APPROACH ....................................................................................... 26
FIGURE 6 FLAT TABLE EXAMPLE - DENORMALISED...................................................................................................... 26
FIGURE 7 SAMPLE OLAP CUBE .................................................................................................................................... 28
FIGURE 8 SAMPLE OLTP/TRANSACTIONAL MODEL .................................................................................................... 30
FIGURE 9 SAMPLE DATA VAULT MODEL ...................................................................................................................... 33
FIGURE 10 FACT TABLE TYPES ...................................................................................................................................... 53
FIGURE 11 DIMENSION TYPES ...................................................................................................................................... 73
FIGURE 12 SAMPLE CHASM TRAP MODEL ................................................................................................................... 86
FIGURE 13 STANDARD CHASM TRAP/BRIDGE TABLE ................................................................................................... 87
FIGURE 14 NESTED CHASM TRAP ................................................................................................................................. 87
FIGURE 15 CHAINED CHASM TRAP ............................................................................................................................... 87
FIGURE 16 SAMPLE FAN TRAP ....................................................................................................................................... 87
FIGURE 17 CONTENTS OF THE PRODUCTION TABLE .................................................................................................... 88
FIGURE 18 CONTENTS OF THE SALES TABLE ............................................................................................................... 88
FIGURE 19 CONTENTS OF THE DATE DIMENSION ........................................................................................................ 88
FIGURE 20 CONTENTS OF THE PRODUCT DIMENSION ................................................................................................. 88
FIGURE 21 CONTENTS OF THE LOCATION DIMENSION ................................................................................................ 89
FIGURE 22 EXECUTING THE WRONG QUERY CAUSING OVERCOUNTING..................................................................... 90
FIGURE 23 CORRECT RESULTS FROM THE CHASM TRAP ............................................................................................. 91
FIGURE 24 SAMPLE DATA FOR THE CUSTOMER TABLE ................................................................................................ 92
FIGURE 25 SAMPLE DATA FOR THE ORDER TABLE ....................................................................................................... 92
FIGURE 26 SAMPLE DATA FOR THE ORDER DETAIL TABLE .......................................................................................... 92
FIGURE 27 RUNNING A SINGLE PASS QUERY AGAINST A FAN TRAP ........................................................................... 93
FIGURE 28 CORRECT RESULTS OF THE FAN TRAP ........................................................................................................ 95
FIGURE 28 ROLE PLAYING DIMENSIONS ..................................................................................................................... 99
FIGURE 29 MULTIPLE JOIN PATHS ............................................................................................................................. 100
FIGURE 31 ELIMINATING A JOIN PATH ....................................................................................................................... 100
FIGURE 32 SPLIT UP THE BRANCH TABLE................................................................................................................... 100
FIGURE 32 CHOOSING A JOIN PATH ........................................................................................................................... 104
FIGURE 33 INVENTORY AGGREGATION LEVELS ......................................................................................................... 106
FIGURE 35 HIGH LEVEL MODEL USING THE RANGE JOINS ........................................................................................ 107
FIGURE 36 THE WEEK DIMENSION ............................................................................................................................. 107
FIGURE 37 DEFINING THE RANGE JOIN IN TML ........................................................................................................ 107
FIGURE 38 THE JOINS DEFINED IN THOUGHTSPOT .................................................................................................. 107
FIGURE 39 HOW THE RANGE JOIN LOOKS IN TS UI ................................................................................................. 107
FIGURE 40 MODIFYING THE JOIN CONDITION IN TML TO CREATE THE BETWEEN .............................................. 107
FIGURE 41 WHICH PRODUCTS ARE RUNNING LOW ON SUPPLY? .............................................................................. 108
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 228
FIGURE 42 QUERY VISUALIZER .................................................................................................................................. 108
FIGURE 43 GENERATED SQL ..................................................................................................................................... 108
FIGURE 44 HOW DO MY SALES COMPARE TO STORE WITHIN A 5 MILE RADIUS? .................................................... 110
FIGURE 45 QUERY PLAN FOR HOW DO MY SALES COMPARE TO STORE WITHIN A 5 MILE RADIUS......................... 110
FIGURE 46 LOCATE NEARBY INVENTORY TO AVOID STOCK-OUT CONDITION ......................................................... 112
FIGURE 46 OUTER JOIN TO PRODUCTS ...................................................................................................................... 115
FIGURE 47 INCREASED COMPLEXITY WITH MULTIPLE OUTER JOINS ........................................................................ 115
FIGURE 49 USE A FACTLESS FACT TO AVOID OUTER JOINS...................................................................................... 119
FIGURE 49 SAMPLE HIERARCHIES .............................................................................................................................. 121
FIGURE 50 SAMPLE PRODUCT HIERARCHY ................................................................................................................. 122
FIGURE 51 SAMPLE GEOGRAPHICAL LOCATIONS HIERARCHY (RAGGED) ................................................................ 123
FIGURE 53 SAMPLE ORGANIZATION CHART (UNBALANCED HIERARCHY) ................................................................ 124
FIGURE 54 SAMPLE PRODUCT DIMENSION WITH HIERARCHY ................................................................................... 127
FIGURE 55 IMPLEMENTATION OF A FICTIONAL ORGANIZATION CHART.................................................................... 132
FIGURE 55 CONTENTS OF THE BRIDGE TABLE ........................................................................................................... 134
FIGURE 57 A WORKSHEET EXAMPLE FOR THE ORGANIZATIONAL HIERARCHY ......................................................... 134
FIGURE 57 CONTENTS OF THE FACT TABLE ............................................................................................................... 135
FIGURE 59 SEARCHING FOR INDIVIDUAL SALES FIGURES PER EMPLOYEE ............................................................... 135
FIGURE 60 UTILIZING THE HIERARCHY ...................................................................................................................... 136
FIGURE 61 UTILIZING THE VIEW AND HIERARCHY IN THE PIVOT ............................................................................. 138
FIGURE 61 HIERARCHICAL REPORTING: WHO REPORTS TO WHOM? ....................................................................... 139
FIGURE 63 A RAGGED HIERARCHY FOR A CHART OF ACCOUNTS .............................................................................. 140
FIGURE 64 NAVIGATING PARTICULAR NODES IN THE HIERARCHY ........................................................................... 144
FIGURE 65 ADDING THE CHILD NODE TO THE SEARCH ............................................................................................. 144
FIGURE 66 ANALYZING CASH SPECIFICALLY .............................................................................................................. 144
FIGURE 67 INCLUDING PARENT INFORMATION .......................................................................................................... 144
FIGURE 67 LIMITATION WHEN SELECTING MULTIPLE NODES ................................................................................... 145
FIGURE 69 OUR ORG CHART REVISITED WITH PATH ATTRIBUTES ........................................................................... 146
FIGURE 70 ADDING THE PATH STRINGS TO THE TABLES .......................................................................................... 146
FIGURE 70 USING GROUP_AGGREGATE AND OTHER FUNCTIONS ............................................................................ 147
FIGURE 71 OUR FICTIONAL ORG CHART WITH INDEX NUMBERS .............................................................................. 149
FIGURE 73 A DATA MODEL WITH MIXED GRAIN FACTS ............................................................................................. 169
FIGURE 74 SEARCHING THE MIXED GRAIN MODEL .................................................................................................... 169
FIGURE 75 RESULTS FROM THE SEARCH ON THE MIXED GRAIN MODEL .................................................................. 170
FIGURE 75 THE QUERY PLAN FOR THIS SEARCH ........................................................................................................ 170
FIGURE 77 CORRECTLY MODELING MIXED GRAIN ..................................................................................................... 171
FIGURE 77 MULTI-GRAIN DIMENSIONAL MODEL ....................................................................................................... 172
FIGURE 79 SAMPLE DATA FOR THE PRODUCT_DIMENSION TABLE .................................................................. 172
FIGURE 80 SAMPLE DATA FOR THE DATE_DIMENSION TABLE ........................................................................... 173
FIGURE 81 SAMPLE DATA FOR THE SALES_FACT TABLE ....................................................................................... 173
FIGURE 82 SAMPLE DATA FOR THE TARGET_FACT TABLE .................................................................................... 173
FIGURE 82 RUNNING A SEARCH QUERY TO GET SALES VS TARGET ......................................................................... 173
FIGURE 83 RESOLVING THE MULTI-GRAIN ISSUE BY SPLITTING DIMENSIONS ....................................................... 174
FIGURE 85 RERUNNING OUR ACTUALS VS TARGET SEARCH ..................................................................................... 175
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 229
FIGURE 86 KEY/VALUE PAIR SAMPLE DATA................................................................................................................ 177
FIGURE 87 SEARCHING KEY/VALUE PAIRS ................................................................................................................. 177
FIGURE 88 UTILIZING THE DATE DIMENSION FOR LOOK BACK PERIODS................................................................. 185
FIGURE 89 WORKSHEET MODEL (SHARE MEASURES) .............................................................................................. 187
FIGURE 90 WHAT IS THE SHARE OF MULO THIS YEAR VERSUS LAST YEAR DURING THE SAME PERIOD?............ 188
FIGURE 91 QUERY PLAN FOR WHAT IS THE SHARE OF MULO THIS YEAR VERSUS LAST YEAR DURING THE SAME
PERIOD? ............................................................................................................................................................... 188
FIGURE 92 HOW HAS THE SHARE CHANGED OVER TIME?......................................................................................... 189
FIGURE 93 HOW DOES THE SHARE COMPARE THIS YEAR VERSUS LAST YEAR TO ANOTHER RETAILER DURING THE
SAME PERIOD?..................................................................................................................................................... 189
FIGURE 94 A DATA MODEL FOR A MULTI-CURRENCY USE CASE ............................................................................... 193
FIGURE 95 MULTI-CURRENCY DATA RESULTS WHEN LOGGED IN AS ALICE ............................................................ 195
FIGURE 96 SAMPLE ECOMMERCE MODELS WITH TAGS ............................................................................................. 200
FIGURE 97 SAMPLE HEALTHCARE MODEL WITH DYNAMIC ATTRIBUTES ................................................................... 200
FIGURE 98 EAV TYPE MODEL FOR CAPTURING CUSTOMER PROFILES ..................................................................... 203
FIGURE 99 SAMPLE RETAIL DATA MODEL ................................................................................................................... 217
FIGURE 100 ADDING A DERIVED FACT TO THE FACT TABLE ..................................................................................... 217
FIGURE 101 DATA MODEL FOR RECORDING DEFECT RATES ..................................................................................... 220
FIGURE 102 QUERY PLAN FOR GROUP AGGREGATE .................................................................................................. 225
FIGURE 103 EXTENDED DATE DIMENSION ................................................................................................................ 226
FIGURE 104 FORMULAS TO ADD TO WORKSHEET ..................................................................................................... 226
FIGURE 105 SEARCHING FOR OUR BALANCE AT MONTH END................................................................................... 227
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 230
List of Tables
TABLE 1 EXAMPLE TRANSACTIONAL FACT TABLE ........................................................................................................ 40
TABLE 2 EXAMPLE SNAPSHOT FACT TABLE ................................................................................................................... 41
TABLE 3 EXAMPLE ACCUMULATING SNAPSHOT FACT TABLE ........................................................................................ 42
TABLE 4 EXAMPLE PERIODIC SNAPSHOT FACT TABLE .................................................................................................. 43
TABLE 5 EXAMPLE PARTITIONED FACT TABLE .............................................................................................................. 44
TABLE 6 KEY DIFFERENCES BETWEEN THESE FACT TABLE TYPES ............................................................................... 45
TABLE 7 EXAMPLE CUMULATIVE FACT TABLE ................................................................................................................ 46
TABLE 8 EXAMPLE AGGREGATED FACT TABLE .............................................................................................................. 47
TABLE 9 EXAMPLE DERIVED FACT TABLE ...................................................................................................................... 48
TABLE 10 KEY DIFFERENCES BETWEEN THESE FACT TABLE TYPES ............................................................................. 49
TABLE 11 EXAMPLE FACTLESS FACT TABLE .................................................................................................................. 49
TABLE 12 EXAMPLE BRIDGE FACT TABLE...................................................................................................................... 51
TABLE 13 KEY DIFFERENCES BETWEEN FACTLESS FACT TABLES AND BRIDGE FACT TABLES.................................... 51
TABLE 14 EXAMPLE MULTI-VALUE FACT TABLE ............................................................................................................ 53
TABLE 15 SUMMARY OF FACT TABLE TYPES ................................................................................................................. 55
TABLE 16 EXAMPLE ROLE PLAYING DIMENSION ........................................................................................................... 56
TABLE 17 EXAMPLE FIXED-DEPTH HIERARCHY ............................................................................................................ 57
TABLE 18 KEY DIFFERENCES BETWEEN ROLE-PLAYING DIMENSIONS AND FIXED DEPTH HIERARCHIES .................. 58
TABLE 19 EXAMPLE CONFORMED DIMENSION.............................................................................................................. 59
TABLE 20 EXAMPLE UNIVERSAL DIMENSION ................................................................................................................ 60
TABLE 21 KEY DIFFERENCES BETWEEN CONFORMED DIMENSIONS AND UNIVERSAL DIMENSIONS ......................... 61
TABLE 22 EXAMPLE DEGENERATE DIMENSION ............................................................................................................. 62
TABLE 23 EXAMPLE MINI-DIMENSION .......................................................................................................................... 64
TABLE 24 EXAMPLE JUNK DIMENSION .......................................................................................................................... 65
TABLE 25 EXAMPLE SHRUNKEN DIMENSION ................................................................................................................ 66
TABLE 26 EXAMPLE LATE-BINDING DIMENSION .......................................................................................................... 67
TABLE 27 SAMPLE COMPOSITE DIMENSION ................................................................................................................. 68
TABLE 28 KEY DIFFERENCES BETWEEN THESE DIMENSION TYPES ............................................................................. 69
TABLE 29 EXAMPLE SNAPSHOT DIMENSION ................................................................................................................. 70
TABLE 30 EXAMPLE CUSTOM DIMENSION .................................................................................................................... 71
TABLE 31 EXAMPLE DERIVED DIMENSION .................................................................................................................... 72
TABLE 32 DIFFERENCES BETWEEN SNAPSHOT, CUSTOM AND DERIVED DIMENSIONS .............................................. 73
TABLE 33 THE VARIOUS DIMENSION TYPES ................................................................................................................. 75
TABLE 34 ENROLLMENTS TABLE ................................................................................................................................. 101
TABLE 35 ENROLLMENTS FACT TABLE (WITH DENORMALIZED ATTRIBUTES) .......................................................... 102
TABLE 36 AGGREGATED TABLE - COURSE ENROLLMENT COUNT .............................................................................. 103
TABLE 37 CREATING DEFAULT MEMBERS FOR DIMENSIONS ..................................................................................... 118
TABLE 38 A BALANCED PRODUCT HIERARCHY ........................................................................................................... 125
TABLE 39 A GEOGRAPHICAL LOCATIONS HIERARCHY ............................................................................................... 130
TABLE 40 EMPLOYEE DIMENSION FOR SCD DESCRIPTIONS .................................................................................... 160
TABLE 41 UPDATING JOHN'S DEPARTMENT (SCD TYPE 1) ..................................................................................... 160
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 231
TABLE 42 UPDATING JOHN'S DEPARTMENT (SCD TYPE 2) ..................................................................................... 161
TABLE 43 UPDATING JOHN'S DEPARTMENT (SCD TYPE 3) ..................................................................................... 161
TABLE 44 UPDATING JOHN'S DEPARTMENT - EMPLOYEE TABLE (SCD TYPE 4) ..................................................... 162
TABLE 45 UPDATING JOHN'S DEPARTMENT - EMPLOYEE HISTORY TABLE (SCD TYPE 4) ..................................... 162
TABLE 46 USING SCD-4 AS A MINI-DIMENSION ..................................................................................................... 162
TABLE 47 CONTENTS OF THE EXCHANGE_RATE TABLE ...................................................................................... 194
TABLE 48 CONTENTS OF THE TRANSACTION TABLE ............................................................................................ 194
TABLE 49 CONTENTS OF THE USER TABLE ............................................................................................................... 194
TABLE 50 CONTENTS OF THE USER_CURRENCIES TABLE .................................................................................. 194
TABLE 51 SAMPLE DATA FOR TBL_CUSTOM_FIELD_VALUE ............................................................................ 204
TABLE 52 SAMPLE DATA FOR TBL_CUSTOM_FIELD ............................................................................................ 204
TABLE 53 IMPLEMENTING LATE-BINDING ATTRIBUTES USING DENORMALISED ATTRIBUTES ................................. 205
TABLE 54 PROFILE QUESTIONS TABLE USING VARIANT ............................................................................................ 206
TABLE 55 JSON RESULTS UNPACKED ........................................................................................................................ 208
TABLE 56 PROS AND CONS OF THE VARIOUS LATE-BINDING ATTRIBUTE MODELING TECHNIQUES ....................... 208
TABLE 57 TEST DATA FOR DIM_PRODUCT ............................................................................................................ 220
TABLE 58 TEST DATA FOR DIM_DATE .................................................................................................................... 220
TABLE 59 TEST DATA FOR FACT_DEFECT_RATE................................................................................................. 221
TABLE 60 DEFECT_RATE IS NON-ADDITIVE .......................................................................................................... 222
TABLE 61 SEARCHING FOR THE DEFECT RATE PER QUARTER ................................................................................... 223
TABLE 62 SEARCHING FOR THE DEFECT RATE PER YEAR .......................................................................................... 223
TABLE 63 AUGMENTED FACT TABLE........................................................................................................................... 224
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 232
ThoughtSpot Data Security, 1st Edition | Spot-On Data Modelling | A ThoughtSpot Field Guide | Page 233