Professional Documents
Culture Documents
engineering
playbook
Modern data engineering playbook
Shifting mindsets:
why you should treat data as a product 4
Quality is king:
finding the value in your data test strategy 28
Introduction
Data has been dubbed ‘the new oil’ – highly valuable, and able to power entire industries. But unless
it’s handled carefully and strategically, it can be worthless or even detrimental to your business.
No matter what business you’re in, data can give you a clear point of difference and competitive
advantage. And now that it is also the by-product of every digital step we take, your teams have the
ability to collect and store more data than ever. Using it to make better strategic decisions, and create
tailored and frictionless customer experiences.
However, as our need for insights at speed increases, centralized data platform architectures are
failing to keep up. They lack the flexibility to enable dispersed teams to make informed and timely
decisions. With the added challenges around governance, privacy and security and data quality,
organizations are struggling to manage the complexity of operationalizing their data assets.
The modern data stack (MDS) – a collection of tools and patterns used for data integration – has
emerged to address these challenges. It helps you analyze data, improve efficiencies, and unearth
new opportunities. And there is a growing array of off-the-shelf tools to choose from, avoiding the
need for custom-built solutions. In turn, that enables lower costs and increased time to production.
But there is more to reaping the benefits of MDS than having the right tools. Its success is
underpinned by modern engineering practices and software delivery principles that allow you to
accelerate development, de-risk large projects and continuously improve products. You also need
the right skillsets across your teams, so they can make data more accessible, target the right
problems and build the right products. And of course, valuable data intelligence relies entirely on the
quality of data.
These are the things we address in this playbook, helping you save time, reduce risks, and improve
the return on investment of your data projects.
Learn how shifting to a data product mindset will help you build the right thing and build the thing right
– and how to assemble the right team to make it happen. Explore practices and principles that will
speed up production, and find out how to save time by catching data quality issues early. And discover
how you can embed security and privacy from the start to improve the quality of your product, build
trust with your customers and allow you to move faster.
Shifting mindsets:
why you should treat
data as a product
By Keith Schulze and Kunal Tiwary
Today, organizations are increasingly recognizing the potential value of data – yet many fail to realize
a return on investment from their data assets.
When it comes to developing data assets, many organizations take a ‘build it and they will come’
approach. While this philosophy may have paid off for Kevin Costner in Field of Dreams, organizations
might not be so lucky. Because there is one inherent shortcoming of the approach: it doesn’t consider
the needs of data users.
What if we flipped the mindset, and consider some valuable user-centric lessons from our product
teams? What if we managed data as a product – not just an asset? We’re seeing this shift in perception
gain traction, allowing organizations to unlock more value from data projects.
Rethinking data
The data as a product mindset is one of the four principles of data mesh, a style of data management
which decentralizes project architecture models. Data as a product treats the data users as
customers, developing data products to bring them value and help them achieve their end goals. For
example, if your customer’s end goal is to reduce churn rate by 10%, you will need to start with that
goal and work backwards – and develop a churn forecasting data product that will meet this need.
Thinking of data as a product means putting those user needs at the heart of their design. It’s
designed to be shared – not controlled.
Zhamak Dehghani, author of Data Mesh, Delivering Data Value at Scale, and founder of the data mesh
concept, describes it this way; “Data as a product is very different from data as an asset. What do you
do with an asset? You collect and hoard it. With a product it’s the other way around. You share it and
make the experience of that data more delightful.”
Product thinking requires a deep knowledge and understanding of your customer. Your teams can
then build for real world problems – and continuously develop products that offer more value.
That’s why it’s so critical to start by knowing who your customer is and what is most valuable to them.
What problems are they trying to solve? What’s at stake for them if they can’t use or access the data
easily? Those customers might be internal or external – the key is to think beyond simply offering data
sources, and expecting users to adapt or compromise the way they work to use it.
Unfortunately, there’s no silver bullet here. It takes time to understand your customers and their goals,
and involves real-world testing and constant refinement. And once you’ve solved that for one group of
customers, how do you scale and expand this? Can you make those products reusable, satisfying the
needs of a broader range of customers?
At Thoughtworks, we have adapted the Double-Diamond design process model to make sure that we
build the right thing and build it right. This starts with identifying what a customer needs. We use a
structured discovery and inception process to uncover these requirements for any new data product.
We then apply a set of well-understood practices and tools that are known to deliver high-quality
software and data.
Strategy Solution
Understand whys and hows Execute the outcome
Much like software products, data products also benefit from a responsible and accountable team
who continuously improve performance and release new features in a safe environment. It also
reduces the feedback loops needed to evolve or enhance these products. It encourages direct
communication between the producer and the consumer of data products – cutting out lengthy and
convoluted central planning processes.
The owners of a data product are also accountable for maintaining agreed levels of service. This is
important because without clear accountability, there might be complex processes and competing
priorities to contend with when services go down.
For example, retail organizations use a number of metrics to facilitate demand planning (e.g. forecast
accuracy, order fill rate). Different teams depend on these metrics to forecast and provision stock
to meet the demand. Any delays or errors in reporting can have severe impacts to downstream
business processes, leading to unhappy customers and a loss of revenue or a surplus of stock with a
cost to business.
As a business evolves, there may be other demand planning metrics that would allow for more
accurate forecasts; any delay in implementing these also means a sacrifice in potential profit.
Businesses need to continuously evolve their demand planning process to use the most accurate
metrics – and ensure that the metrics are reliable and high quality. Any error should be fixed promptly
to minimize the impact on downstream consumers.
Having a team with clear accountability to develop, evolve and maintain these metrics as a product
with high levels of service ensures that:
1. You’re always providing downstream consumers with the most accurate metrics to
facilitate demand planning, and
2. You minimize the impacts of outages on other teams in the demand planning process.
Moving away from teams aligned to archetypes or skill sets, to small product-oriented teams with
tightly focused goals is one way to get there. These teams may require a blend of different capabilities
– such as data engineers, data scientists, QAs and designers – to develop a product that meets the
needs of customers.
Creating a culture where learning from failure is embraced and celebrated is also critical to the
success of developing effective data products. Finding what doesn’t work, or where friction points lie,
allows teams to adjust their thinking and approach for future projects – and continually improve
products and customer experience along the way.
Discoverable: If you want a customer to use a data product, they need to be able to find it.
Discoverability can take many forms, from a primitive list of datasets on an internal wiki system
to a full-fledged data catalog. Irrespective of the implementation, catalogs should house
important meta information about the data products such as their owners, source of origin,
lineage, and sample datasets.
Addressable: Data products should also have a single unique address where they can be
found. This makes it easy for your customer to find them – and reduces the time your data
teams spend on helping people locate them.
Self-describing and interoperable: Data products should provide users and automated
systems with metadata in a way that allows users to self-service. For example, a data product
should expose metadata describing the data sources used to build the data product, and the
schema and information about outputs of the data product.
Trustworthy and secure: For a customer to use a data product confidently, the product
should commit to an agreed level of trustworthiness (or quality). This means defining a set of
Service Level Objectives (SLO) and measurable Service Level Indicators (SLI) upfront – and
implementing automated mechanisms to test and report SLI metrics regularly.
Secure and governed by a global access control: With the rapid rise in data breaches, it has
never been more important to secure your data products and build in security. While registered
data sets should be automatically discoverable to all customers, they should not be accessible
by default. Users will need to request access to each dataset they need, with data owners
granting or denying access individually, using federated access controls.
Engineering practices
to accelerate your data
product delivery
By David Tan and Mitchell Lisle
Approaching the development of data products as you would approach building software is a good
starting point. But data products are typically more complicated than software applications because
they are software and data intensive. Not only do teams have to navigate the many different
components and tools that are part and parcel of software development, they also need to grapple
with the complexity of data. Given this additional dimension, it can be all too easy for teams to get
mired in cumbersome development processes and production deployments, leading to anxiety and
release delays.
At Thoughtworks, we find that intentionally applying “sensible default” engineering practices allows us
to deliver data products sustainably and at speed. In this article, we’ll dive into how this can be done.
From development and deployment to operation, sensible default practices help us build the thing
right. These practices include:
As we will elaborate later in this chapter, these practices are essential in managing the complexity of
modern data stacks and accelerating value delivery because they provide teams with the following
characteristics which help teams deliver quality at speed:
Fast feedback: Find out whether a change has been successful in moments, not days.
Whether it’s knowing unit tests have passed, you haven’t broken production, or a customer is
happy with what you’ve built.
Simplicity: Build for what you need now, not what you think might be coming. This lets you limit
complexity, while enabling you to make choices that allow your software to rapidly change and
meet upcoming requirements.
Repeatability: Have the confidence and predictability that comes from removing manual tasks
that might introduce inconsistencies and spend time on what matters – not troubleshooting.
Similar to the practical test pyramid for software delivery, the practical test data grid (Figure 1) helps
guide how and where you invest your effort to get a clear, timely picture of either data quality or code
quality, or both. The grid considers the following data-testing layers:
• Point data tests capture a single scenario which can be reasoned about logically, for
instance a function that counts the number of words in a blog post. These tests should be
cheap to implement and there should be many of them, to set expectations in a range of
specific circumstances.
• Sample data tests provide valuable feedback about the data as a whole without processing large
volumes. They allow you to understand fuzzier expectations and variation in data, especially over
time. While they bring additional complexity and require some threshold tuning, they will uncover
issues point tests don’t capture. Consider using synthetic samples for these tests.
• Global data tests uncover unanticipated scenarios by testing against all available data. They’re
also the least targeted, most subject to outside changes, and most computationally expensive.
CODE
Service
Unit
DATA
Fewest tests
Global
Sample
Point
You can apply these tests to data alone or combine them with code tests to verify various stages
of data transformation – in which case you would consider the two dimensions as a practical data
test grid. Again, this is not a prescription, you needn’t fill every cell and the boundaries aren’t always
precise, but this grid helps direct our testing and monitoring effort for fast and economical feedback
on quality in data-intensive systems.
• In the code plane, code flows along the Y-axis, from bottom to top, between environments (e.g.
development, test, production). In traditional software engineering terms it’s a CI/CD pipeline – the
flow that software engineers are typically familiar with.
• In the data plane, data flows along the X-axis from left to right in each environment where data
is transformed from one form to another. This is a data pipeline and is something data experts
understand very well.
• In the reverse data plane, data flows along the Y-axis in the opposite direction of the code
plane. You can create samples of production data into test environments using different privacy
preserving or obfuscation techniques such as masking, differential privacy – or you can create
purely synthetic data, using samples of production data.
Data Pipeline
Production environment
Data plane
Source Target
data data
Production data Transformation
Code plane
Data plane
Source Target
data data
Production like data Transformation
Data plane
Source Target
data data
Mock data Transformation
However, the historical data needed to train machine learning models to create forecasts was locked
up in operational databases, and the existing data store could not scale to serve this new use case.
We applied sensible default engineering practices to build a scalable streaming data architecture for
Company Y. The pipelines ingested data from the operational data store to an analytical data store to
service the new product feature.
• Our team built the artifacts once and deployed to each environment through automated
deployment. We applied infrastructure as code (i.e. application infrastructure, deployment
configuration, observability are specified in code) and provisioned everything automatically. If the
build breaks in the pre-production environment, we either fixed it quickly (in 10 minutes or less) or
rolled the change back.
• Post-deployment tests made sure everything was working as expected, giving us the
confidence to let the CI/CD pipeline automatically deploy the changes to production (i.e.
continuous deployment).
• Observability and proactive notifications allowed everyone on the team to know the health of
our system at any time. And logging with correlation identifiers helped trace an event through
distributed systems and observe the data pipelines.
• Where necessary, we refactored (rewriting small pieces of code in existing features) and managed
tech debt as part of stories and iterations, not as an afterthought buried in the backlog.
For more complex changes, our team uses feature toggles, data toggles or blue-green deployment.
If we are averse to committing code to the main branch, it’s usually a sign that we’re missing some
essential practices mentioned above. For instance, if you’re afraid your changes might cause
production issues, make your tests more comprehensive. And if you need another pair of eyes to
review pull requests, consider pair programming to speed up feedback. It’s important to note that
TBD is a practice that’s only possible if we have the safety net and quality gates provided by the
preceding practices
In our next chapter, we’ll look at how to build a team to help you build the right thing.
Consider your favorite sporting team: each person brings a unique set of skills and experience to the
team and plays a specific role. Powerful and accurate, quarterbacks call the shots on the field, while
linebackers defend with strength and speed. Versatile midfielders connect the two and keep the ball
moving smoothly. To be a successful team, these roles and skills must work together seamlessly.
Data teams work similarly. As an integral part of larger processes, a successful team will bring
together complementary skills, experience and knowledge to deal with critical questions such as:
• What new markets, services, or cost optimizations is the business embarking on?
• How could data help our people make these decisions?
• Is this data available to key decision makers in a form that is understandable?
As the modern data value chain evolves, organizations are making a fundamental shift in how they
think about data – and how they build teams around it. As you focus on democratizing data across
your organization, think about how you can align your internal structures around the people who make
up your data teams. Here’s what we have found useful in building data teams for specific products.
• Subscriptions: forecasting churn and helping predict which customers are likely to
cancel their subscription
• Content: content recommendations and ranking
• Player: client related statistics
• Payment: fraud detection
Much like software teams, data product teams own their products collectively. Each product should
have a nominated product owner who acts as the team’s ambassador and key communicator
to stakeholders and other data product teams. They drive the product roadmap and lifecycle,
communicating expectations and facilitating collaboration.
One of the most effective ways to build your data team is to develop roles that focus on specific
aspects of the product’s infrastructure, development and lifecycle. Each role brings a different set of
skills, strengths, and approaches needed to create value from data. The nature of the data product
you want to build will dictate the roles you’ll need.
Developing effective
data products requires a
specialist team with a set
of multidisciplinary skills,
experience and knowledge.
It’s important to note that a role is distinct from an individual who plays a part in your team. In fact,
some people may fall under multiple roles. You should consider the roles needed to build the product,
and those that will operate and maintain it. Different teams will:
• Create data products and make sure they’re delivered through reliable data pipelines
• Use data products and combine them with advanced analytics to create new business value
• Ensure data products are dependable and run smoothly
In some organizations it’s also common to see platform teams with specialized knowledge in certain
technical competencies such as infrastructure or data science. This helps reduce the cognitive load
on a data product team, because team members don’t need to be specialists outside of their skill sets.
Roles in practice
Consider a scenario where you need to build a new financial services offering for a retailer. Your goal
is to serve aggregated customer credit history information to an external entity, safely and securely, to
help them make credit lending decisions.
Let’s consider some of the roles you will need to build and evolve the customer credit history data
product into the future.
• Product owner: The product owner is accountable for maximizing your data product’s value. They
also play an important role in supporting the team at key decision and prioritization points. They
will help prioritize which insight should be built first based on the return of investment of the
feature and the effort involved.
• Business analyst: they play a vital role in understanding and aligning the value of your data
product with the needs of customers and the wider business.
• Data engineer: they build the pipelines that source data from several internal systems, and
transform and aggregate the data into a form that supports credit decision making.
• Infrastructure engineer: they build reusable and scalable infrastructure around the data product
to facilitate reproducibility, continuous integration/deployment and automate as much as possible.
• Backend engineer: they build business logic in the form of data APIs to make data integrations
with UI, other products and visualisation tools easy.
• Quality assurance: accountable for maintaining data and product quality. This role is essential to
build trust with the customer.
• Data science: depending on how well-defined the credit decision making process is and whether
we know what data is required, this role may be required to help with this understanding.
Remember, you don’t need dedicated people for every role. You might have experienced team
members who fit many roles, or some of your core platforms might play a supporting role. For
example, you might ask someone to be a security and privacy champion and make them accountable
for making sure that the team follows good security practices. They can work closely with a core
security platform team to run important activities like setting security objectives or running threat
modeling sessions.
The skills and knowledge required at each stage of the data product’s lifecycle is different - so the
roles you’ll need will change along the way.
Courage to speak up: Speaking up can take many forms. For example, questioning why we
are doing things that aren’t aligned towards our values and goals. Or raising challenges and
providing constructive suggestions in a team retrospective (assuming a psychologically safe
environment) as the first step towards continuous improvement.
Defining what success looks like for your data team is critical – and goals should focus on enabling
decisions (outcomes) rather than the number of dashboards built (activities). For the Subscription
domains in the Netflix example above, a success metric might be to reduce churn by 10% over the
next quarter – rather than measuring the number of data points acquired to produce that metric.
Holding data team sessions is a great way to communicate these goals and how they fit in with the
rest of the organization and build internal understanding of the teams’ functions and roles.
Measure to
improve
Focus on Focus on
customer developing
and high a learning
value work culture
Leverage Seek to
what is shorten the
working and line from
bake in ideation to
repeatability value
Embrace a
test and learn
culture and
seek to reduce
uncertainity
As the managers and coaches of elite sporting teams know, an unbalanced team can fall short
when it’s time to perform. We’ve seen many organizations fail to realize the importance of balancing
various skill sets when forming their data teams. This often leads to poorly engineered products or
user experience – or turning products into a feature factory without considering the use case and
customer value.
When you can draw on a broad range of expertise, and have the right structures to get the most value
out of each skill set, your teams can target the right problems and thrive. As the famous saying goes,
teamwork divides the task and multiplies the success.
In a previous chapter, we took a deep dive into the engineering practices that will help you deliver your
data products quickly, safely and sustainably. Now we’ll explore delivery planning principles to guide
how we shape and sequence our work. These enable teams to add incremental value, de-risk large
projects, and find opportunities for continuous improvement when delivering data-intensive products.
Note: We won’t touch on typical iteration planning and release planning activities, such as story
estimates and tracking cycle time, as they are becoming increasingly common in the industry. Those
practices are still important, and we want to complement that by sharing additional principles and
practices that provide an intentional focus on creating value.
The downside of this approach is that users can only provide valuable feedback after significant
investments of time and effort. It also leads to late integration issues when horizontal slices eventually
come together, increasing the risk of release delays.
So, how can we plan and sequence our work to release value early – and often? How can we
avoid the quicksand of data engineering for the sake of data engineering, and create a cadence of
demonstrable business value in every iteration? The answer is by flipping our thinking, and slicing
work vertically.
1. At the level of stories, you articulate and demonstrate business value in all stories, and ensure
that the majority of stories are independently shippable units of value.
2. At the level of iterations, you regularly demonstrate value to users by delivering a collection of
vertically sliced stories within a reasonable timeframe.
3. At the level of releases, you plan, sequence and prioritize a collection of stories that’s oriented
towards creating demonstrable business value.
Products Products
Customer Customer
Channel Dashboard Channel Dashboard
experience experience
Products
PaaS PaaS
IaaS IaaS
Tracing a thin slice of value through the components of a data ecosystem and delivering value in
vertically sliced user stories enables you to test and learn more cost-effectively, solve customer
problems more efficiently, and deliver value sooner. We will describe what this looks like in action in
the case study below.
Data
High
Known Unknown
Knowns Knowns
Risk Risk
Low High
Known Unknown
Unknowns Unknowns
Data
Low
In essence, DDHD is about formulating hypotheses, running small experiments with clear outcomes
and criteria and using the data we collect to tell stories and share your lessons with your team,
business and stakeholders. The case study below will illustrate this in action, but for now this is what a
hypothesis looks like:
DDHD creates a space and a framework for teams to run short experiments, to learn as we deliver
value incrementally, and to continuously apply our lessons learned to have greater impact and reduce
risks of costly and unvalidated investments.
The four key metrics provide insight into the flow and friction of value delivery, and they’re a great
starting point for identifying what is working well, and what needs to be improved. For organizations
that do not have the platform tooling that allows teams to measure the four key metrics, you can get
started simply by surveying teams regularly with DORA’s quick check tool. Even though the precision
is not as high, this method gives you a first indication of where your organization stands, and what
the trends are.
In addition, you should also measure outcomes-oriented metrics that are relevant to your
organization, such as improvement in efficiency and customer satisfaction. Outcomes-oriented
metrics help to align teams towards activities that contribute to organizational goals rather than
busywork. The diagram below gives an example of what outcomes-oriented metrics might look like in
the insurance domain.
Mean time On average, how long (hrs, Build In the specified interval, what Future sensing How many alternates
days, wks or mnths) to percentage of the builds insurance orgs. were
to restore restore services that made
failure rate failed due to any reason. identified. (More is good)
(Less is good) a product usable and useful. (Less is good)
Change fail What percentage of changes Security How many security warnings Improvement For a Delivery Analyst,
to production had to be (of high, medium or low mean time in identifying
percentage warnings in efficiency & a processing claim issue.
rolled back or hotfix/patched. severity) are currently being
(Less is good) (Less is good) reported by the underlying tool. effectiveness (Less is good)
Lead time On an average, how long Tech debt How long (days, wks or mnths) it Improvement Number of issues reported
will take to rewrite the code that by the processing agency.
(Less is good) (hrs, days, wks or mnths) for (Less is good) in customer (Less is good)
code to reach production was the result of prioritizing
when committed. speedy delivery over perfect code. experience
*Based on the book Accelerate by Nicole Forsgren, Jez Humble, Gene Kim
**Example from insurance processing project
However, these metrics should be undergirded by a healthy organizational culture and value-oriented
mindset. Otherwise, we can fall into the trap of dysfunctional metrics. As Godhart’s law states: “When
a measure becomes a target, it ceases to be a good measure.” And keep in mind Hawthorne effect
– when your team knows they’re being measured, there might be a reflex to bend the rules and find
loopholes to meet the targets.
Rather than wait for a data platform to be completed (i.e. horizontal slicing), we applied lean product
development and data mesh principles to build a data product that provides real customer value in
the short term, while supporting extensibility towards a shared platform in the medium to long term.
This allowed us to deliver a low-risk solution fast (in just over 4 months), while also providing valuable
insight into an ongoing data platform development effort.
Customer
need Evolve
Improvement Build
opportunity
Phase 1: Discovery
With the help of domain experts, we refined Company X’s business requirements to outline data
sources needed for the credit history product. We found we only needed seven (out of 150) tables
from the source database to deliver the minimum requirements. This reduced data ingestion efforts,
as we didn’t need to process or clean unnecessary data. Over 6 weeks, we also refined the features
and cross-functional requirements of the customer credit history data product, and aligned on the
intended business value.
We articulated hypotheses to help us find the shortest path to the ‘correct’ solutions. These
hypotheses helped us stay on the right track towards our goal of building the right product. For
example, we could validate our approach by running an experiment and collecting data on one of
our hypothesis:
Phase 2: Delivery
Once everyone was aligned on the product form and value it should deliver, we started developing a
minimum viable product (MVP). Scoping an MVP can be difficult. We aimed for the thinnest ‘vertical’
slice that provided feedback about the viability of the data product, and ensured that it was close
enough to the final product from a customer perspective to continually test our hypotheses. The MVP
also uncovered potential edge cases, hidden or missed product opportunities, and possible obstacles.
This early feedback helped to identify risks and where we can focus our risk mitigation efforts when
further developing the product.
It also helped to define the data sources and transformations that we could leverage when iterating
on future releases of the product. Our focus for the production delivery phase was to implement
well-governed transformations on supportable and extensible data infrastructure – and serving the
results to data product consumers. We applied our sensible default engineering practices, such as
test-driven development (TDD), infrastructure as code, CI/CD, observability in both code and data
planes, among others.
• Building awareness: Are there any gaps or opportunities in your current delivery
planning practices?
• Being open to what needs to change: How would you apply the principles and practices outlined
in this chapter to help you improve your delivery planning?
• Executing the change: Connect industry-tested recommended practices with practical experience
in successfully delivering data products.
In the next chapter, we’ll share how you can save hours by better managing the quality of your data.
Data is an integral part of every product delivery and customer engagement. When designing data
systems you need to understand both data and technology, while appreciating the ultimate value a
product will bring to customers. It doesn’t matter how big or small your data product is, establishing
sensible defaults helps balance the trade-offs of particular technology decisions.
Assessing best practices around data management is a good starting point that can guide your data
design choices. As data engineers, you should select technology tools that work well together for
a project and are suitable for the broader needs of data across the organization. At its core, every
technology decision needs to be driven by the value it will provide to the business.
In this chapter, we cover the baseline principles to get you started and some considerations for
balancing the trade-offs of technology.
Default principles
Given today’s complex data and technological landscape, the notion that there can be a single best
way of doing anything seems ambitious. But sensible default practices are a great starting point
because they’re an effective way of building architecture on shared values. They also allow you to be
technology agnostic, while focusing on the elements of good design. And for data projects, they can
provide an initial set of baseline principles: effective practices and techniques to get you started.
Capacity and
performance Incremental Built-in
Data quality
planningn and value delivery observability
measurement
[DS] Tracked,
Built-in security,
Built-in [DS] Ethics and measured and
privacy and
discoverability identifing biases reproducible
compliance
experiments
Figure 1: Data default practices (In addition to core engineering sensible defaults)
As with software projects, we believe it’s essential to establish best practices around data
management for modern data platforms, including:
However, there may be circumstances which make the default choice invalid — or at least suboptimal.
For example, if your product differentiation requires ultra low latency reads at the expense of
consistency, you will need to use a specialized niche data store for your use case.
Business processes are akin to a flow of events – and virtually all data you deal with is streaming in
this flow. Data is almost always produced and updated continually at its source, and it is constantly
arriving. Waiting until the end of the day to process data in a batch is a bit like buying a newspaper to
find out what happened in the world yesterday. While for some use cases, such as billing and payroll
systems, this is acceptable, others require more immediate streaming data processing.
When designing appropriate architecture, focus on developing solutions that process data to meet
the business outcome you’re working towards. You need to step back and appreciate what the
architecture is there to support. Are you building a system to facilitate “transactions” (think sales on
an ecommerce website, or a payment processing system)? Or are you trying to “analyze” history to
identify trends and use aggregated data to draw insights? Start with the problem you are trying to
solve. Then look at the characteristics of the relevant business data, and consider how technology
and project architecture might be able to better support those workloads.
While it’s crucial to use the right data store for the right use case, you want to solve the problem, not
build to the constraints of the technology. The wrong technology choices can misdirect engineering
effort and undermine the likelihood of future business success.
But finding the right technology can be time consuming. To speed up the process, Amazon founder,
Jeff Bezos suggests you don’t deliberate over easily reversible, “two-way door” decisions. Simply
walk through the door and see if you like it — if you don’t, go back. You can make these decisions fast
and even automate them.
Few, if any, data decisions are hard to reverse. But those that are need to be made carefully. A
data technology radar is a great way to balance risk in your technology portfolio and pollinate
innovation across teams, experimenting accordingly. It allows you to work out what kind of
technology organization you want to be – and objectively assess the data tools that are working and
those that are not.
It’s also important to only invest in and use custom-built solutions where it is a differentiator to the
business or provides a competitive advantage. Consider what’s important to your business:
• Security, schemas, design and lineage – the non-negotiables for system design
• An upfront tiring of latency, correctness and durability – deciding the ranking is still important
• Batch, harmonize and consolidate delivery – because costs ebb and flow
Moving fast may lead to duplication, while consolidating effort on a centralized data platform will
eliminate duplication but take longer. You need to understand the trade-offs and make data architecture
decisions at the enterprise level quickly. Slow decisions can cause your data infrastructure to
unnecessarily proliferate which can be difficult to reverse if you want to consolidate your infrastructure.
Optimizing for sensible data defaults can make it easier to select the right technology and make good
architecture choices. It can also help you validate whether your investment is providing the business
with the competitive advantage you expect it to – helping you make more informed decisions.
In the next chapter, we’ll share how you can implement an effective data strategy that provides the
foundations for managing data quality.
Quality is king:
finding the value in
your data test strategy
By Simon Aubury and Kunal Tiwary
Imagine what your data teams could achieve with an extra two days per week. The thought is exciting,
isn’t it? But where would this additional time come from? The answer may lie in better managing the
quality of your data. The best way to do that is by catching issues early with the help of a rigorous
data testing strategy.
The impact of moving data testing upstream shouldn’t be underestimated. Research suggests that
data teams spend 30-40% of their time focusing on data quality issues. That’s a significant amount
of time they could be spending on revenue-generating activities like creating better products and
features or improving access to faster and more accurate insights across their organization. And
beyond productivity and organizational effectiveness, data downtime – caused by missing, inaccurate,
or compromised data – can cost companies millions of dollars each year and erode organizational
trust in your data team as a revenue driver for the organization.
Missing values in datasets can lead to failures in production systems, incorrect data can lead to the
wrong business decisions being made and changes in data distribution can degrade the performance
of machine learning models. Irrelevant product recommendations could impact customer experience
and lead to a loss of revenue. In sectors such as healthcare, the consequences can be far more
significant. Incorrect data can lead to the wrong medication being prescribed, triggering adverse
reactions, or even death.
In this chapter we’ll take a look at how to implement an effective data strategy. But first, we’ll look at
the core considerations that lay the foundations for data quality.
Freshness: The importance of data recency is context specific. A security monitoring or fraud
detection application requires very fresh data to make sure aberrations are detected and can
be dealt with quickly, whereas training a machine learning model can tolerate more latent data.
Accuracy: Life impacting decisions such as drug effectiveness and financial decisions have
little room for inaccurate data. However, we may sacrifice accuracy for speed when offering
suggestions on a retail online store or streaming service.
Consistency: Definitions of common terms need to be consistent. For example, what does
“current” mean when we talk about customers – purchased last week or two months ago? Or
what constitutes a “customer” – already signed on, authenticated, a real human?
Understanding the data source: The source of data or how data is captured can affect its
accuracy. If a customer service representative at a bank selects a drop-down field in a hurry or
without validation, a manual error could lead to incorrect account closure reports.
Data producer teams should also aim to capture finer-grained metrics such as:
• Error tolerance
• Ownership – if a quality check is broken who is going to fix it and when?
• How much is data quality worth? Will adding more data improve your analytics? What are the
appropriate and agreed thresholds and tolerances for data quality for the business? Is 90%, 99%,
99.9% accuracy expected or acceptable for the end user of the data?
• Service level agreements (SLAs) – how much downtime does the business allow for each year?
As the owners of data quality, data producers are ultimately responsible for knowing these thresholds
and meeting agreed expectations.
Enabling new data ingestion within continuous integration and continuous integration/continuous
delivery (CI/CD) pipelines also ensures imported data goes through quality tests – and doesn’t break
existing checks. There may be instances where checks should break the pipeline and send a high
urgency alert. If a check flags a negative house number within a pipeline running on real estate data
for instance, you need to immediately address the issue and stop the run. Whereas if a house number
is simply out of range, you can continue the pipeline runs with simple alerts.
Making the results from these quality checks available to the wider business is incredibly important.
For example, an address list that is 15% incomplete may delay the marketing team’s campaign launch.
While a variance of 1% in an engineering measurement could jeopardise an expensive manufacturing
process. Making quality levels visible as part of a metadata catalog can also be immensely valuable,
as it allows data consumers to make informed decisions when considering the data’s use cases.
Many data quality frameworks today offer profiling reports that include the error/failure distribution.
You can find some good open-source frameworks – we’ve had positive experiences with Great
Expectations, Deequ and Soda – that can help you implement data quality tests through a range of
features. Depending on the level of integration you require, some key features to consider for your
framework include:
Treat your development environment and pipeline tests just as you would treat them in a production
setting. Communicate with source application teams about data quality issues and fix them at the
source application. For example, Know Your Customer (KYC) checks in your pipeline need to have
non-nullable customer attributes. If the source application doesn’t enforce a validation on the system,
null/empty values will be ingested – making the transformations pointless or invalid for many rows of
data. Monitoring metrics such as row counts, totals and averages and setting timely failure alerts will
also reduce time-to-resolution.
Organizations need to build infrastructure that enables data producer teams to fix and resolve issues
and deploy changes quickly to continually maintain and evolve the data quality paradigm. Once they
do that, they can begin focusing more on value-adding activities, rather than simply fixing problems.
Measuring and communicating data quality will help you achieve alignment on the state and
business relevance of data across the organization. It will also allow data teams to make continuous
improvements over time.
In the first half of 2022, there were 817 data compromises impacting over 53 million people in
the United States. Each of these security breaches cost an average $4.35 million (US)– a 12.7%
increase since 2020.
Governments around the world are implementing stricter laws around data privacy, as more
organizations and individuals are affected by breaches every day. As engineering teams play an
increasingly important role in this space, it wouldn’t be right to wrap up our deep dive into data
engineering without discussing security and privacy.
Security and privacy are often used interchangeably, but they are not the same. Security enables
privacy, but doesn’t guarantee it. Privacy typically refers to a user’s ability to control, access, and
regulate their personal information, while security refers to the system that protects that data
from getting into the wrong hands. You can have data security without data privacy, but not the
other way around.
They are equally important and any good information management system will ensure personal
data is treated appropriately.
To make things even more complicated, the challenges of doing this to a product or solution
that is already in production can lead to further risks as it increases the surface area for security
or data breaches. For any product that processes data that could be considered Personally
Identifiable Information (PII), security and privacy is especially critical to consider from the very
beginning of a project.
And that is what we mean by “shifting left:” In software engineering, shifting left is a conscious effort
to embed certain practices earlier in the development lifecycle – left being the start of a product
lifecycle, right being the end.
Valueininshifting
Value shifting left
left
While
While keeping valuesonon
keeping values right
right
Security
Network Production
Shift Security Features & Continuous Shift
Pentesting Security Security
Left Architecture Proactive Testing
Measures Controls Right
controls
Proactive Reactive
Many organizations will have security engineers or even a Chief Information Security Officer (CISO),
yet many lack technical expertise when it comes to privacy. This has led to the emergence of the
privacy engineer – a specialist software engineering role that ensures privacy considerations are
embedded into product development, rather than left as an afterthought. The need for this role
has intensified partly because organizations there are today increasing legislative requirements
to which practices, processes and products must comply. Greater awareness of the ethical
dimension of technology also makes the role particularly valuable. In the past, data collection was
something often done with little consideration for users’ personal privacy. However, today privacy
can be a differentiator: there are plenty of examples of privacy being placed at the center of a
product development.
In the same way that we — as developers — think about technical debt, we need to also start paying
attention to our “privacy debt”. With data breaches increasing, companies have a decision to make
about when they tackle this debt. Those that successfully shift left on security and privacy will
significantly reduce the probability of ending up on a growing list of companies who have failed to
protect their data.
Safeguarding data
Personally Identifiable Information (PII) is data that directly identifies, in isolation or in combination
with other data, an individual. Without the right security in place, hackers can access this data and
create profiles – using that information to impersonate them or sell it to other criminals.
Consideration is required when storing any form of personal data. Stop and reflect: is collecting
PII necessary to deliver the customer experience? Can you retain trust? For example, a retail
recommendation system can offer a tailored experience with a broad age bracket without capturing
a customer’s date of birth. For organizations that do use PII, being able to find and identify sensitive
data is critical to protecting their customers and their reputation. Technologies like data catalogs and
appropriate governance frameworks are particularly useful here for ensuring data can be effectively
organized and secured.
It is important to note that simply obfuscating or masking PII fields in a dataset does not necessarily
de-identify an individual’s data. It may be possible to re-identify the data using other contextual
information. For example, hackers use uniqueness as a path to exploit vulnerabilities, so knowing all
the ways your data can be unique to an individual is important. A driver of a distinctly coloured car
in a large city might be fairly unique, but that same driver in a small country town would be identified
easily. This also goes for machine learning approaches that are trained on data with outliers. Outliers
can reveal sensitive information about your data and can inadvertently leak it through a prediction API.
One of the complexities organizations face is the need for access to PII to allow development teams to
experiment and test as they’re working. Test environments are often not subject to the same security
and privacy, because they don’t contain production data. But a data science workflow needs to have
data that represents production, so you can train models and do analysis to understand what that
model will do in production.
There are many ways you can do this without giving access to production data directly:
1. Generate fake data that matches the schema of your production data. For some use cases, such
as data validation checks, this can be enough to help ensure pipelines work adequately and
without error. But if your goal is to train and release a model into production, you should avoid fake
data. Not training your model on data that is as close to the real thing as possible raises ethical
and accuracy concerns.
2. Evaluate whether synthetic data, or data that is representative of the real scenario but generated
by a model, could work for your use case. You may still need to apply additional privacy preserving
techniques on top of this data.
3. Generate a subset of secure, anonymous data. Anonymization is a challenging and at times
impossible task when it comes to PII. Simply removing obvious fields like names, addresses and
other identifiers does not mean you cannot re-identify an individual in that dataset. Having a
good understanding of privacy engineering practices such as masking, differential privacy and
encrypted computation are important to do this effectively.
4. Build an isolated secure environment specifically for model building and training with access to
a copy of production data. This approach is more costly and introduces risk since you’re copying
data to another location. You’ll also need a separate environment with all of the same security and
privacy controls as production.
The biggest shift you have to make to improve security and privacy is to think small. Data minimization
is your friend – helping you build what you want, using the smallest subset of data you really need.
1. Appointing security champions with experience in security activities and processes will help
guide your development teams on the right decisions to make.
2. Put a data classification process in place to allow you to tag sensitive data and apply governance
policies across the organisation based on the sensitivity of your data.
Here is a simple mental model for your data as it enters your systems.
Identify
Keep
Classify evolving
Lean
mindset
3. Run frequent security workshops, such as threat modelling, to so you can make impact
incrementally, prioritizing work as you go. Focus on small, actionable changes you can make to
ensure these sessions continue to deliver value to your security infrastructure.
4. Use the CIA (Confidentiality, Integrity, Availability) triad to think about security:
• Confidentiality: The asset cannot be accessed by people or systems that shouldn’t access it.
• Integrity: The asset cannot be changed by people or systems that shouldn’t change it.
• Availability: Every person and system that should be able to access an asset can do so.
lity
Int
tia
eg
n
de
rity
nfi
Co
Availability
Security Security objectives Security requirements Product design with Application Joint user story Security sign-off
Activities analysis security in mind architecture security security kick-off during each iteration
High-level security design and review Automatic source
requirements Agile threat modelling Security solution or code scan Penetration test against
and sizing counter measure Application
Dependencies the to-be-released
Risk prioritisation designing deployment solution application
security design and security check
review Review “high risk”
source code from
security perspective
Shifting security activities left and introducing them early and throughout the delivery life cycle will ensure Shift
security & compliance, continuous feedback & early detection Left
Read our Responsible tech playbook for more useful tools, techniques and principles. Use it to inform
your next planning session, and to better address critical security and privacy considerations; many of
the tools and frameworks you find in it will help you to assess your own risks.
The later you leave security and privacy in the development process, the more at risk you are of a
significant data breach or security incident. Investing in everything needed to embed security and
privacy in your system is well worth it in the long run: for your organization and your customers.
Modern data engineering helps you get the most value from your data, make better decisions and
create more tailor customer experiences – at speed.
But to truly harness modern practices, your business needs to embrace a new way of thinking about
data. Shifting to a data product mindset requires an organization-wide cultural change. You may also
need to realign internal structures and platforms to empower data teams.
Because an empowered team with the right set of skills, experience and knowledge will be able to
develop the right solutions for the right problems faster – and thrive.
And by applying sensible defaults to your practices and principles, your data teams can add
more value, reduce project risks, and find opportunities for improvement while delivering data
products at speed.
Shifting to modern data engineering practices can be complex and take time. At Thoughtworks,
we have helped many organizations make the move – and unlock the true potential of their data at
scale. If you need help with any of the areas discussed in this playbook or would like to share your
experience with modern data engineering, get in touch now.
solutions@thoughtworks.com
thoughtworks.com/en-au/what-we-do/data-and-aiBrisbane, Queensland 4000
Phone +61 7 3129 4506
Naarm / Melbourne
Land of the Wurundjeri People of the Kulin Nation
Level 23, 303 Collins Street
Melbourne, Victoria 3000
Phone +61 3 9691 6500
About Thoughtworks
Thoughtworks is a global technology consultancy
that integrates strategy, design and engineering
to drive digital innovation. We are over 12,500
people strong across 50 offices in 18 countries.
Over the last 25+ years, we’ve delivered
extraordinary impact together with our clients by
helping them solve complex business problems
with technology as the differentiator.
thoughtworks.com
Meeanjin / Brisbane
Land of the Turrbal and Jagera/Yuggera Peoples
Level 19, 127 Creek Street
Brisbane, Queensland 4000
Phone +61 7 3129 4506
Naarm / Melbourne
Land of the Wurundjeri People of the Kulin Nation
Level 23, 303 Collins Street
Melbourne, Victoria 3000
Phone +61 3 9691 6500