Professional Documents
Culture Documents
Research Sponsored by
Eckerson Group provides research, consulting, and education services to help organizations
get more value from their data. Our experts each have 25 years of experience in the field,
specializing in business intelligence, data architecture, data governance, analytics, and data
management. We provide organizations with expert guidance during every step of their data
and analytics journey. Get more value from your data. Put an expert on your side. Learn what
Eckerson Group can do for you!
Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Why Customize?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Why Standardize? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Adoption Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
About Equalum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Executive Summary
Enterprises use both standard and custom data pipelines to process raw data into
analytics-ready data sets. They must apply the right mix, creating efficiency and scale via
standardization and automation where possible, but still accommodating customization in
order to innovate. But many get the balance wrong and standardize too slowly, limiting the
value they derive from analytics.
This report examines the impact of standardization and customization on data pipelines,
with a focus on design, building, testing, rollout, operations, and adaptation. It seeks to help
architects, data engineers, application developers, and data scientists strike the right balance
in their environments. Enterprises can start by standardizing overly customized data pipelines
and demonstrating clear ROI with bite-size projects. As they standardize more and regain
the right balance, they will drive data democratization, increased productivity, and reduced
risk. As they standardize, enterprises can also free up and redirect resources to custom work,
fostering innovation and increasing analytics value.
Introduction
To visualize the strain on modern data pipelines, stand in the middle of a seesaw and try
to balance it while friends pile on either side. You’ll need to make some fast and tricky
calculations to avoid hitting the dirt.
Enterprise data teams need to strike a similar balance with the pipelines that process data for
analytics. On one hand, they must standardize where possible, using automation to handle
high data volumes and varieties. On the other hand, they must use hand-coded programs and
scripts to customize and support the specialized demands of advanced analytics. Swing too
far one way, and you cannot scale. Swing the other way, and you cannot innovate.
This report seeks to help data analytics leaders and their teams strike the right balance,
standardizing where efficiency matters and customizing where new ideas matter. One of the
key lessons: just like balancing a seesaw, these opposing forces need each other.
Data pipeline. The data pipeline describes the tasks and processes, including ingestion,
transformation, and preparation, that process raw data into analytics-ready data sets.
Enterprises create and manage many pipelines, each comprising a distinct combination of
data sources, transformation tasks, and targets in order to support various analytics use cases
required by the business.
Custom approach. “Custom” tasks and processes apply to one or a few specialized pipelines.
Data engineers, data scientists, and application developers can spend significant time creating
and refining custom pipeline components—related to transformation in particular—to address
specialized requirements. They develop and test custom program coding, as well as scripts to
manage that program’s tasks. They insert those programs into production data pipelines, then
manage, monitor, and update the resulting custom data pipeline to fix errors, address new
requirements, and so on.
Standard approach. At the other end of the spectrum, “standard” tasks and processes
apply to many data pipelines with little or no adaptation. For example, many pipeline tools
provide an automated graphical interface for users to configure, execute, and monitor basic
extraction, transformation, and loading (ETL) tasks. They apply roughly the same sequence of
drag-and-drop steps to multiple pipelines without requiring the underlying software code to
be manually rewritten or changed.
Hybrid approach. Open source communities and commercial vendors alike now offer
extensive libraries of modular code, including include machine learning and other advanced
algorithms that enterprises can plug into their data pipelines. Depending on the amount
of adaptation and customization involved, these modular libraries can help data analytics
teams address specialized requirements with less work than would otherwise be necessary.
The amount of work required determines whether such hybrid approaches lean toward the
“standard” or “custom” end of the spectrum. Templates constitute another hybrid approach,
enabling data teams to reuse custom code and effectively standardize it across multiple
pipelines.
Why Customize?
Enterprises must frequently take a custom approach to data pipelines in order to support
innovative use cases, which have only multiplied in the wake of the COVID-19 pandemic. They
must answer creative questions, often on a real-time basis. For example:
• How can we continually monitor the resulting revenue and use those findings to
re-calibrate prices?
Architects, data engineers, and application developers collaborate with data scientists to help
the business address such innovative questions with custom ingestion, transformation, and
analytics programs.
Why Standardize?
Enterprises standardize to improve the operational efficiency of data pipelines so they can
deliver timely and well-structured data to BI analysts, business managers, and data scientists.
Data teams identify modular tasks and processes that can be applied to many pipelines using
GUI tools and templates. Standard data delivery increases the productivity of IT and business
stakeholders throughout the enterprise.
Well-run enterprises standardize some existing tasks over time to manage rising data
supply and rising data demand at scale. Hybrid approaches help drive the evolution toward
standardization for many data pipelines. Data teams develop and deploy custom code, then
standardize the repeated, scalable, and tactical aspects later with GUI-based tools and
templates.
Figure 1 compares these approaches and illustrates the evolution of processes over time.
repeated tasks frees up time to build custom solutions. Those custom solutions, meanwhile,
often result in pipeline tasks and processes that can be standardized over time to improve
efficiency and scale. Table 1 compares the goals, methods, and mutual benefits of these two
approaches to data pipelines.
• These architects, as their name implies, then architect data pipelines by creating
the guidelines to achieve quality and consistency in design and development.
• Architects and data engineers design and build pipelines, often with the help of
data scientists. They identify users, define use cases, and inventory data sources.
They apply design patterns such as ETL or streaming to manage the ingestion,
processing, and delivery of data to support those use cases. Application
developers might also help design and build pipelines that integrate with
operational applications.
• Data engineers operate and adapt pipelines. They maintain, monitor, and tune
various pipeline components, including sources, targets, and processors, as well
as their interconnections. In addition, they add and remove these components to
• Analysts and data scientists consume the data emanating from the data
pipelines and provide feedback on data quality, service-level agreements, and
so on to assist architects and engineers with ongoing architecture, design, and
adaptations.
• Architects and data engineers also help data stewards and other governance
managers create policies and controls, which data engineers then implement.
Data stewards monitor data usage and enforce those policies to maintain data
quality.
Figure 2 illustrates the stages and contributions of various roles to this iterative and
collaborative process.
Enterprise data teams can most easily standardize the design/build and operate/adapt stages
of the lifecycle.
Siloed data. Although this is hardly new or surprising, the problem of data silos persists in
enterprise environments thanks to years of accumulated one-off architectural decisions. Data
teams in IT departments or lines of business steadily adopt new data sources and analytics
targets to address custom use cases. This creates a sprawling patchwork of poorly integrated
data types, formats, and platforms that delay or prevent standardization efforts.
Enterprise data teams also accumulate multiple data pipeline and analytics tools, many
of which address only narrow parts of their environments. Technical limitations on tool
compatibility, and the preferences of organizational fiefdoms, limit the ability of enterprise
data teams to centralize tools. Standardization suffers.
Resistance to change. Data teams and enterprises resist change for many familiar reasons.
Specialized data engineers and application developers might view their backlog of manual
programming and scripting requests, however tedious, as a form of job security. Stakeholders
of all types—data engineers, analysts, and data scientists in particular—do not have much
spare time to learn new processes or tools. The status quo remains a comfort zone.
Siloed data and tools, budget constraints for new tools, and staff resistance to change cause enterprise data
teams to favor existing custom tasks and processes over standardization.
Adoption Patterns
Enterprises pursue numerous strategic initiatives to democratize data usage and drive digital
transformation. These inter-related initiatives, including real-time streaming, cloud migration,
data modernization, self-service, and advanced analytics, drive demand for both standard
and custom data pipelines.
Real-time streaming. Enterprises replace legacy batch data ingestion and processing with
real-time streaming mechanisms such as changed data capture (CDC) and message streaming.
Real-time streaming enables new insights and improves efficiency by eliminating repeated
batch copies of unchanged data.
Cloud migration. Data teams complement and replace on-premises platforms such as
databases, data warehouses, and data lakes with infrastructure-as-a-service (IaaS) offerings
from cloud service providers. IaaS platforms create financial and operational flexibility.
use cases such as machine learning. Application developers also help integrate operationally
focused data pipelines with business applications where appropriate.
Let’s examine how custom, standard, and hybrid approaches shape each of these steps,
as well as the resulting impact on operating costs in the form of time, effort, and software
licensing.
1. Design. Once architects and data engineers identify target users and use cases,
they define the product—i.e., the output of their pipelines—in terms of data
types, formats, latency, and so on. Then they inventory the necessary data
inputs spanning both internal and external sources. Outputs and inputs, in
turn, drive the design of the “middle”—including data acquisition, ingestion,
transformation, and preparation—which becomes the beating heart of the
data pipeline. Architects and data engineers must select their design pattern
carefully—batch ETL, streaming, virtualization, and so on—to ensure they deliver
correctly structured data in a timely fashion. They must also carefully design
the transformation tasks required to improve, enrich, and format data. Many
enterprises adopt data streaming in order to improve efficiency and enable real-
time analytics.
Data teams should seek to standardize where possible during the design phase.
Do they really need to hand-craft data ingestion and transformation with custom
programs and scripts? In many cases, the answer is no. Commercial, automated
pipeline tools address the most common data types, sources, targets, and
transformation tasks, creating efficiency benefits that justify their licensing costs
over time. Data teams should seek budget approval and start evaluating pipeline
tools. Key evaluation criteria include ease of use, breadth of source and target
support, and support for real-time data streaming.
You can’t standardize everything, of course. Unusual data sources might require
custom integration, for example, and advanced algorithms such as machine
learning often require specialized transformation processes. Data engineers
should scope the custom programs and scripts they will need to develop, and
find “hybrid” tools such as templates and code libraries that can help them along
the way.
2. Build. Next, data engineers tackle the hard work of building data pipelines. These
pipelines must connect to source databases, IoT sensors, social media feeds,
and so on, then ingest, transform, and deliver the data on a periodic, scheduled,
or real-time basis to an analytics target such as a data warehouse, data lake,
or NoSQL platform. They might need to incorporate often-changing source
schemas into their data targets, gather metadata, and monitor data lineage.
They might also need to schedule jobs, execute workloads, coordinate task
interdependencies, and monitor execution status on a real-time basis.
When custom code is required, data engineers, data scientists, and application
developers need to ensure their custom components integrate with the rest
of the pipeline. For example, they might use open APIs that maintain future
integration options.
4. Roll out. With upfront testing complete, data engineers, data scientists, and
application developers can put their data pipelines into production. They
package, stage, and release custom code, all while keeping a close eye on their
full environments to spot unexpected cascading changes or surprises. They
should employ continuous delivery best practices, meaning that code should
be “releasable” throughout its development in order to remove risk from the
rollout. They also might employ DevOps best practices to streamline rollouts with
improved communication and collaboration between parties.
Standard pipelines that use automated, GUI-driven tools accelerate rollouts and
reduce risk by reducing configuration and execution errors. Hybrid tools such as
templates and code libraries similarly reduce errors and therefore help streamline
rollouts, although to a lesser degree.
5. Operate and adapt. Like living organisms, data pipelines need ongoing support
in the form of continuous monitoring and enhancement. Cross-functional teams,
Once again, automated pipeline tools greatly improve speed and flexibility.
Data teams can change many data sources or targets, or reconfigure basic
transformation tasks, with a few clicks. Given the interdependencies involved
in modern environments, they should carefully model, predict, and assess
the cascading impacts of such changes on standard and especially custom
components.
Figure 4 compares the cumulative operating costs of standard, custom, and hybrid
approaches to data pipelines.
Although standard data pipeline approaches carry higher design and building costs due to software licensing,
their efficiency benefits drive costs lower than custom and hybrid approaches during subsequent stages in the
data pipeline lifecycle.
Data democratization. Standardized, automated tools for accessing and analyzing data
enable more business-oriented managers and analysts to make data-driven decisions. These
standard tools, meanwhile, free up data engineers to build custom data pipelines and support
more requests for advanced analytics.
Productivity. Standard and hybrid data pipeline approaches improve output per unit of input
on several dimensions. Data engineers build and operate more data pipelines in less time.
Developers integrate pipelines with business applications more easily. Business managers and
analysts answer their questions faster.
Reduced risk. Balanced data pipeline approaches also reduce operational risk. Business
managers make better decisions because they have more data at their disposal. Data teams
commit to analytics projects with new confidence they will meet deadlines, budgets, and
SLAs.
Higher analytics value. Business managers and leaders gain a higher return on investment
in their data assets. They create more and better insights, faster, improving both the top and
bottom line.
Effectively balancing standard, custom, and hybrid approaches to data pipelines yields higher analytics value,
innovation, data democratization, productivity, and lower risk.
• Focus on ROI. Start by standardizing pipelines that offer low risk and high return
in a short time frame, even if they are smaller in scale. Your initiative will need
to demonstrate a clear ROI out of the gate in order to gain approval for larger
projects. You can achieve this by selecting a cost-effective, highly automated,
easily learned tool that can replace one or more existing tools and their
associated maintenance fees.
And finally … plan for growth! Expect demand for your team’s services to continue to rise,
driving the need for continued rebalancing as we enter the post-COVID-19 world.
• Eckerson Research publishes insights so you and your team can stay abreast
of the latest tools, techniques, and technologies in the field.
• Eckerson Education keeps your data analytics team current on the latest
developments in the field through three- and six-hour workshops and public
seminars.
Unlike other firms, Eckerson Group focuses solely on data analytics. Our veteran
practitioners each have more than 25 years of experience in the field. They specialize in
every facet of data analytics—from data architecture and data governance to business
intelligence and artificial intelligence. Their primary mission is to share their hard-won
lessons with you.
Our clients say we are hard-working, insightful, and humble. We take the compliment! It all
stems from our love of data and desire to serve—we see ourselves as a bunch of continuous
learners, interpreting the world of data for you and others.
About Equalum
Develop and operationalize your batch and streaming pipelines with infinite
scalability and speed with Equalum
Traditional Change Data Capture and ETL processes and tools cannot adequately perform
under the pressure of modern data volumes and velocities. The strain on legacy data
systems leads to data latency, broken pipelines, and stale data used for business analytics
and daily operations. Equalum built the industry’s most scalable and comprehensive data
ingestion platform, combining streaming Change Data Capture with modern data
transformation capabilities. While real-time ingestion and integration are a core strength of
the platform, Equalum also supports high scale batch processing.
Equalum supports both structured and semi-structured data formats, and can run on-
premises, in public clouds or in hybrid environments. Equalum’s library of optimized and
developed CDC connectors is one of the largest in the world, and more are developed and
rolled out on a continuous basis, largely based on customer demand. Equalum’s multi-
modal approach to data ingestion can power a multitude of use cases including CDC Data
Replication, CDC ETL ingestion, batch ingestion and more.
Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka
and others under the hood. The platform’s easy to use, drag and drop UI eliminates IT
productivity bottlenecks with rapid deployment and simple data pipeline setup. The
platform’s comprehensive data monitoring eliminates the need for endless DIY patch fixes to
broken pipelines and challenging open source frameworks management, empowering the
user with immediate system diagnostics, solution options and visibility into data integrity.
Headquartered in Silicon Valley and Tel Aviv, Equalum is proud to work with some of the
world’s top industrial, financial, and media enterprises globally.