Modern Data Engineering: The Infoq Emag / Issue #92 / February 2021

The InfoQ eMag / Issue #92 / February 2021
MODERN
DATA ENGINEERING
The Future of Data Beyond the Database, Combining DataOps and

Engineering and beyond the Stream DevOps: Scale at Speed
Processor: What’s the Next
Step for Data Management?
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

InfoQ @InfoQ InfoQ InfoQ
Modern
Data Engineering
IN THIS ISSUE
The Future of Data

Engineering 06 Data Gateways in the Cloud
Native Era 32
24 39
Beyond the Database, and
beyond the Stream Processor: Combining DataOps and
What’s the Next Step for Data DevOps: Scale at Speed
Management?
PRODUCTION EDITOR Ana Ciobotaru / COPY EDITOR Maureen Spencer / DESIGN Dragos Balasoiu, Ana Ciobotaru
GENERAL FEEDBACK feedback@infoq.com / ADVERTISING sales@infoq.com / EDITORIAL editors@infoq.com
CONTRIBUTORS
Chris Riccomini Ben Stopford

is a software engineer with more than a is Lead Technologist, Office of the CTO at
decade of experience at Silicon Valley tech Confluent (a company that backs Apache
companies including PayPal, LinkedIn, Kafka). He has worked on a wide range
and WePay, a JPMorgan Chase company. of projects from implementing the latest
Riccomini is active in open source. He is version of Kafka’s replication protocol to
co-author of Apache Samza, a stream- assessing and shaping Confluent’s strategy.
processing framework, and is also a He is the author of the book Designing
member of the Apache project management Event-Driven Systems, O’Reilly, 2018.
committee (PMC), Apache Airflow PMC,
and Apache Samza PMC. He is a strategic
investor and advisor for startups in the
data space, where he advises founders
and technical leaders on product and
engineering strategy.
Bilgin Ibryam Sam Bocetta

is a product manager at Red Hat, is a former security analyst, having spent
committer and member of Apache Software the bulk of his as a network engineer for the
Foundation. He is an open source evangelist, Navy. He is now semi-retired, and educates
blogger, occasional speaker, and the author the public about security and privacy
of Kubernetes Patterns and Camel Design technology. Much of Sam’s work involved
Patterns books. In his day-to-day job, penetration testing ballistic systems. He
Bilgin enjoys mentoring, coding and leading analyzed our networks looking for entry
developers to be successful with building points, then created security-vulnerability
open source solutions. His current work assessments based on my findings.
focuses on blockchain, distributed systems,
microservices, devops, and cloud-native
application development.
A LETTER FROM
THE EDITOR
Data architecture is being Future of Data Engineering
Thomas Betts disrupted, echoing the evolution In his QCon presentation on
of software architecture over the The Future of Data Engineering,
is the Lead Editor for Architecture and
Design at InfoQ, and a Sr. Principal
past decade. Chris Riccomini focuses on
Software Engineer at Blackbaud. data pipelines and how they
For over two decades, his focus has The changes coming to data mature from batch to real-
always been on providing software
engineering will look and sound time processing, which leads
solutions that delight his customers.
He has worked in a variety of industries,
familiar to those who have to more integration and a need
including retail, finance, health care, watched monoliths be broken up for automation. He advocates
defense and travel. Thomas lives in into microservices: DevOps to for a move from a monolith to
Denver with his wife and son, and they
DataOps; API Gateway to Data micro-warehouses. Because
love hiking and otherwise exploring
beautiful Colorado.
Gateway; Service Mesh to Data the job of a data engineer is to
Mesh. While this will have benefits help an organization move and
in agility and productivity, it will process data, that means creating
come with a cost of understanding self-service tools that enable
and supporting a next-generation producers and consumers of data
data architecture. to define the pipelines they need.
Data engineers and software Beyond the Database, and

architects will benefit from the beyond the Stream Processor:
guidance of the experts in this What’s the Next Step for Data
eMag as they discuss various Management?
aspects of breaking down Looking beyond the database and
traditional silos that defined stream processor, Ben Stopford
where data lived, how data asks, “What’s the next step for
systems were built and managed, data management?” The shift from
and how data flows in and out of traditional databases to stream
the system. processing has fundamentally
changed the interaction model,
from passive data to active data. features and various products to
Modern applications need to evaluate based on your needs.
effectively work with both these
models, and that means rethinking Combining DataOps and DevOps:
what a database is, what it means Scale at Speed
to us, and how we interact with With all the new tools available
both the data it contains and the and ideas of how to improve data
event streams that connect it all engineering, Sam Bocetta looks
together. at how companies can adapt their
business models so that they are
Data Gateways in the Cloud better able to process, analyze,
Native Era and derive value from big data.
Just as API gateways are
necessary to tame microservices, In Combining DataOps and
data gateways focus on the data DevOps: Scale at Speed, he
aspect to offer abstractions, says development teams need
security, scaling, federation, and consistent access to high-quality
contract-driven development data. Companies need to realize
features. In his article, Data that data is a vital commodity,
Gateways in the Cloud Native Era, and embrace a data-centric view
Bilgin Ibryam says the freedom of the enterprise, rather than
for microservices to use the most organizing teams around the
suitable database leads to a tools and technologies used to
polyglot persistence layer, which manage data.
necessitates advanced gateway
capabilities. There is no one-size- When done correctly, DataOps
fits-all option for a data gateway, provides a cultural transformation
and Ibryam covers several that promotes communication
between all data stakeholders.
The InfoQ eMag / Issue #92/ February 2021
The Future of Data Engineering

by Chris Riccomini, Software Engineer
“The future of data engineering” is engineering. Our stack is Airflow, from data science to service
a fancy title for presenting stages Kafka, and BigQuery, for the infrastructure, and so on. I also
of data pipeline maturity and most part. Airflow is, of course, wrote Apache Samza, which is a
building out a sample architecture a job scheduler that kicks off stream-processing system, and
as I progress, until I land on a jobs and does workflow things. helped build out their Hadoop
modern data architecture and BigQuery is a data warehouse ecosystem. Before that, I spent
data pipeline. I will also hint at hosted by Google Cloud. I make time as a data scientist at PayPal.
where things are headed for the some references to Google
next couple of years. Cloud services here, and you There are many definitions for
can definitely swap them with “data engineering”. I’ve seen
It’s important to know me and my the corresponding AWS or Azure people use it when talking
perspective when I’m predicting services. about business analytics and
the future, so that you can couch in the context of data science.
mine with your own perspectives We, at WePay, use Kafka a lot. I’m going to throw down my
and act accordingly. I spent about seven years at definition: a data engineer’s job
LinkedIn, the birthplace of Kafka, is to help an organization move
I work at WePay, which is a which is a pub/sub, write- and process data. To move data
payment-processing company. ahead log. Kafka has become means streaming pipelines or
JPMorgan Chase acquired the the backbone of a log-based data pipelines; to process data
company a few years ago. I work architecture. At LinkedIn, I spent means data warehouses and
on data infrastructure and data a bunch of time doing everything stream processing. Usually, we’re
6
focused on asynchronous, batch about helping the organization to We eventually landed on Airflow

or streaming stuff as opposed to move and process the data. and BigQuery, which is Google
synchronous real-time things. Cloud’s version of Redshift. The
The reason that I put this Ada post and mine are almost
I want to call out the key word presentation together was a 2019 identical, from the diagrams to
here: “help”. Data engineers are blog post from a company called even the structure and sections of
not supposed to be moving and Ada in which they talk about the post.
processing the data themselves their journey to set up a data
but are supposed to be helping warehouse. They had a MongoDB This was something we had done
the organization do that. database and were starting to a few years earlier and so I threw
run up against its limits when it down the gauntlet on Twitter and
Maxime Beauchemin is a prolific came to reporting and some ad predicted Ada’s future. I claimed
engineer who started out, I think, hoc query things. Eventually, they to know how they would progress
at Yahoo and passed through landed on Apache Airflow and as they continued to build out
Facebook, Airbnb, and Lyft. Over Redshift, which is AWS’s data- their data warehouse: one step
the course of his adventures, he warehousing solution. would be to go from batch to a
wrote Airflow, which is the job real-time pipeline, and the next
scheduler that we and a bunch What struck me about the Ada step would be to a fully self-serve
of other companies use. He also post was how much it looked or automated pipeline.
wrote Superset. In his “The Rise like a post that I’d written about
of the Data Engineer” blog post three years earlier. When I landed I’m not trying to pick on Ada. I
a few years ago, Beauchemin at WePay, they didn’t have much think it’s a perfectly reasonable
said that “... data engineers build of a data warehouse and so we solution. I just think that there’s
tools, infrastructure, frameworks, went through almost the exact a natural evolution of a data
and services.” This is how we go same exercise that Ada did. pipeline and a data warehouse
and the modern data ecosystem
Figure 1: Getting cute with land/expand/on demand for pipeline evolution
7
Figure 2: The six stages of data-pipeline maturity
and that’s really what I want to the timeliness route, we start because WePay may be farther
cover here. doing more integration with other along on some dimensions - and
systems. The last two tie together: then there are companies that are
I refined this idea with the tweet in automation and decentralization. even farther along than WePay.
Figure 1. The idea was that we On the automation front, I think we
initially land with nothing, so we need to start thinking about how These stages let you find your
need to set up a data warehouse we operate not just our operations current starting point and build
quickly. Then we expand as we do but our data management. And your own roadmap from there.
more integrations, and maybe we then decentralizing the data
go to real-time because we›ve got warehouse. Stage 0: None
Kafka in our ecosystem. You’re probably at this stage if
I designed a hierarchy of you have no data warehouse.
Finally, we move to automation for data-pipeline progression. You probably have a monolithic
on-demand stuff. That eventually Organizations go through this architecture.
led to my “The Future of Data evolution in sequence.
Engineering” post in which I You’re maybe a smaller company
discussed four future trends. The reason I created this path is and you need a warehouse up and
that everyone’s future is different running now. You probably don’t
The first trend is timeliness, going because everyone is at a different have too many data engineers
from this batch-based periodic point in their life cycle. and so you’re doing this on the
architecture to a more real- side.
time architecture. The second is The future at Ada looks very
connectivity; once we go down different than the future at WePay
8
Stage 0 looks like Figure 3, with a you’re reading directly from the Stage 1: Batch

lovely monolith and a database. database. It’s easy and cheap. We started down the batch path,
You take a user and you attach it and this is where the Ada post
to the database. This is where WePay was when and my earlier post come in.
I landed there in 2014. We had a
PHP monolith and a monolithic Going into Stage 1, you probably
MySQL database. The users have a monolithic architecture.
we had, though, weren’t happy You might be starting to lean away
and things were starting to tip from that but usually it works
over. We had queries timing out. best when you have relatively
We had users impacting each few sources. Data engineering is
other - most OLTP systems that now probably your part-time job.
you’re going to be using do not Queries are timing out because
have strong isolation and multi- you’re exceeding the database
tenancy so users can really get in capacity, whether in space,
Figure 3: Stage 0 of data-pipeline each other’s way. memory, or CPU.
maturity
Because we were using MySQL, The lack of complex analytical
This sounds crazy to people that we were missing some of the SQL functions is becoming an
have been in the data-warehouse fancier analytic SQL stuff that issue for your organization as
world for a while but it’s a viable our data-science and business- people need those for customer-
solution when you need to get analytics people wanted and facing or internal reports. People
things quickly up and running. report generation was starting are asking for charts, business
The data appears to the user as to break. It was a pretty normal intelligence, and all that kind of
basically real-time data because story. fun stuff.
Figure 4: Stage 1 is the classic batch-based approach to data
9
This is where the classic batch- The number of Airflow workflows Hard deletes weren’t propagating
based approach comes in. that we had went from a few - and this is a big problem if you
Between the database and the hundred to a few thousand. We have people who delete data
user, you stuff a data warehouse started running tens or hundreds from your database. Removing
that can accomplish a lot more of thousands of tasks per day, a row or a table or whatever can
OLAP and fulfill analytic needs. To and that became an operational cause problems with batch loads
get data from the database into issue because of the probability because you just don’t know when
that data warehouse, you have a that some of those are not going the data disappears. Also, MySQL
scheduler that periodically wakes to work. We also discovered - and replication latency was affecting
up to suck in the data. this is not intuitive for people our data quality and periodic
who haven’t run complex data loads would cause occasional
That’s where WePay was at pipelines - that the incremental or MySQL timeouts on our workflow.
about a year after I joined. This batch-based approach requires
architecture is fantastic in terms imposing dependencies or Stage 2: Realtime
of tradeoffs. You can get the requirements on the schemas of This is where real-time data
pipeline up pretty quickly these the data that you’re loading. We processing kicks off. This
days - when I did it in 2016, it had issues with create_time approaches the cusp of the
took a couple of weeks. Our data and modify_time and ORMs modern era of real-time data
latency was about 15 minutes, so doing things in different ways and architecture and it deserves a
we did incremental partition loads, it got a little complicated for us. closer look than the first two
taking little chunks of data and stages.
loading them in. We were running DBAs were impacting our
a few hundred tables. This is a workload; they could do You might be ready for Stage 2
nice place to start if you’re trying something that hurt the replica if your load times are taking too
to get something up and running that we’re reading off of and long. You’ve got pipelines that
but, of course, you outgrow it. cause latency issues, which in are no longer stable, whether
turn could cause us to miss data. because workflows are failing
Figure 5: Stage 2 of data-pipeline maturity
10
or your RDBMS is having trouble batch processor for ETL purposes We use it with the upstream

serving the data. You’ve got and replace it with a streaming connectors. Kafka has a
complicated workflows and data platform. That’s what WePay component called Kafka Connect.
latency is becoming a bigger did. We changed our ETL pipeline We heavily use Debezium, a
problem: maybe the 15-minute from Airflow to Debezium and a change-data-capture (CDC)
jobs you started with in 2014 few other systems, so it started to connector that reads data from
are now taking an hour or a day, look like Figure 6. MySQL in real time and funnels it
and the people using them aren’t in real time into Kafka.
happy about it. Data engineering The hatched Airflow box now
is probably your full-time job now. contains five boxes, and we’re CDC is essentially a way to
talking about many machines so replicate data from one data
Your ecosystem might have the operational complexity has source to others. Wikipedia’s
something like Apache Kafka gone up. In exchange, we get a fancy definition of CDC is “…
floating around. Maybe the real-time pipeline. the identification, capture, and
operations folks have spun it up delivery of the changes made
to do log aggregation and run Kafka is a write-ahead log that to the enterprise data sources.”
some operational metrics over we can send messages to (they A concrete example is what
it; maybe some web services are get appended to the end of the something like Debezium will do
communicating via Kafka to do log) and we can have consumers with a MySQL database. When
some queuing or asynchronous reading from various locations in I insert a row, update that row,
processing. that log. It’s a sequential read and and later delete that row, the CDC
sequential write kind of thing. feed will give me three different
From a data-pipeline perspective, events: an insert, the update, and
this is the time to get rid of that the delete. In some cases, it will
Figure 6: WePay’s data architecture in 2017
11
also provide the before and the BigQuery’s real-time streaming should look like so that you can be
after states of that row. As you insert API. satisfied that the data warehouse
can imagine, this can be useful and the underlying web service
if you’re building out a data One of the cool things about itself are healthy.
warehouse. BigQuery is that you can use its
RESTful API to post data into the Figure 8 shows some of
Debezium can use a bunch of data warehouse in real time and the inevitable problems we
sources. We use MySQL, as I it’s visible almost immediately. encountered in this migration. Not
mentioned. One of the things in That gives us a latency from all of our connectors were on this
that Ada post that caught my eye our production database to our pipeline, so we found ourselves
data warehouse of a couple of between the new cool stuff and
seconds. the older painful stuff.
This pattern opens up a lot of use Datastore is a Google Cloud

cases. It lets you do real-time system that we were using; that
metrics and business intelligence was still Airflow-based.
off of your data warehouse. It
also allows you to debug, which is Cassandra didn’t have a connector
not immediately obvious - if your and neither did Bigtable, which
engineers need to see the state is a Google Cloud equivalent of
of their database in production HBase.
right now, being able to go to the
data warehouse to expose that to We had BigQuery but BigQuery
Figure 7: Debezium sources them so that they can figure out needed more than just our primary
what’s going on with their system OLTP data; it needed logging and
was the fact that they were with essentially a real-time view metrics. We had Elasticsearch
using MongoDB - sure enough, is pretty handy. and this fancy graph database
Debezium has a MongoDB (which we’re going to be open-
connector. We contributed You can also do some fancy sourcing soon) that also needed
a Cassandra connector to monitoring with it. You can data.
Debezium a couple of months impose assertions about what the
ago. It’s incubating and we’re still shape of the data in the database The ecosystem was looking more
getting up off the ground with it
ourselves but that’s something
that we’re going to be using
heavily in the near future.
Last but not least in our

architecture, we have KCBQ,
which stands for Kafka Connect
BigQuery (I do not name things
creatively). This connector takes
data from Kafka and loads it into
BigQuery. The cool thing about
this, though, is that it leverages Figure 8: Problems at WePay in the migration to Stage 2
12
complicated. We’re no longer you may have special caches or got our streaming platform, and

talking about this little monolithic a real-time OLAP system. You’ve we’ve got our data warehouse,
database but about something got a team of data engineers now, but now we also have web
like Figure 9, which comes from people who are responsible for services, maybe a NoSQL thing,
Confluent and is pretty accurate. managing this complex workload. or a NewSQL thing. We’ve got
Hopefully, you have a happy, a graph database and search
mature SRE organization that’s system plugged in.
more than willing to take on all
these connectors for you. Figure 11 depicts where
WePay was at the beginning of
2019. Things were becoming
more complicated. Debezium
connects not only to MySQL but
to Cassandra as well, with the
connector that we’d been working
on. At the bottom is Kafka Connect
Waltz (KCW). Waltz is a ledger that
we built in house that›s Kafka-
Figure 9: The data ecosystem is ish in some ways and more like
no longer a monolith a database in other ways, but it
services our ledger use cases and
You have to figure out how to needs.
manage some of this operational Figure 10: Stage 3 of data-
pain. One of the first things you pipeline maturity We are a payment-processing
can do is to start integration so system so we care a lot about
that you have fewer systems to Figure 10 shows what data data transactionality and multi-
deal with. We used Kafka for that. integration looks like. We still region availability and so we use a
have the base data pipeline that quorum-based write-ahead log to
Stage 3: Integration we’ve had so far. We’ve got a handle serializable transactions.
If you think back 20 years service with a database, we’ve On the downstream side, we’ve
to enterprise-service-bus
architectures, that’s really all data
integration is. The only difference
is that streaming platforms like
Kafka along with the evolution
in stream processing have made
this viable.
You might be ready for data

integration if you’ve got a lot
of microservices. You have a
diverse set of databases as
Figure 8 depicts. You’ve got some
specialized, derived data systems;
I mentioned a graph database but Figure 11: WePay’s data ecosystem at the start of 2019
13
got a bunch of stuff going on. to test it out - and the cost to do fairly manual tasks like those in
so is certainly lower. Figure 12.

We were incurring a lot of pain and
have many boxes on our diagram. Data portability also helps with
This is getting more and more multi-cloud strategy. If you need
complicated. to run multiple clouds because
you need high availability or you
The reason we took on this want to pick cloud vendors to
complexity has to do with save money, you can use Kafka
Metcalfe’s law. I’m going to and the Kafka bus to move the
paraphrase the definition and data around.
probably corrupt it: it essentially
states that the value of a network Lastly, I think it leads to
increases as you add nodes and infrastructure agility. I alluded
connections to it. Metcalfe’s law to this with my Elasticsearch
was initially intended to apply example but if you come across Figure 12: WePay’s problems in
to communication devices, like some new hot real-time OLAP Stage 3
adding more peripherals to an system that you want to check
Ethernet network. out or some new cache that you In short, we were spending a lot of
want to plug in, having your data time administering the systems
So, we’re getting to a network already in your streaming platform around the streaming platform
effect in our data ecosystem. in Kafka means that all you need - the connectors, the upstream
In a post in early 2019, I thought to do is turn on the new thing and databases, the downstream data
through the implications of Kafka plug in a sink to load the data. warehouses - and our ticket load
as an escape hatch. You add more It drastically lowers the cost of looked like Figure 13.
systems to the Kafka bus, all of testing new things and supporting
which are able to load their data specialized infrastructure.
in and expose it to other systems
and slurp up the data of in Kafka, You can easily plug in things
and you leverage this network that do one or two things really
effect in your data ecosystem. well, when before you might have
had to decide between tradeoffs
We found this to be a powerful like supporting a specialized
architecture because the data graph database or using an
becomes portable. I’m not saying RDBMS which happens to have
it’ll let you avoid vendor lock- joins. By reducing the cost of
in but it will at least ameliorate specialization, you can build a
some of those concerns. Porting more granular infrastructure to
data is usually the harder part handle your queries. Figure 13: Ticket load at WePay in
to deal with when you’re moving Stage 3
between systems. The idea is that The problems in Stage 3 look
it becomes theoretically possible, a little different. When WePay Fans of JIRA might recognize
if you’re on Splunk for example, to bought into this integration Figure 13. It is a screenshot of
plug in Elasticsearch alongside it architecture, we found ourselves our support load in JIRA in 2019.
still spending a lot of time on It starts relatively low then it
14
Figure 14: Stage 4 adds two new layers to the data ecosystem
skyrockets and it never fully recovered, although defines toil as manual, repeatable, automatable
there’s a nice trend late in the year that relates to the stuff. It›s usually interrupt-driven: you›re getting
next step of our evolution. Slack messages or tickets or people are showing
up at your desk asking you to do things. That is not
Stage 4: Automation what you want to be doing.
We started investing in automation. This is
something you’ve got to do when your system gets The Google book says, “If a human operator needs
this big. I think most people would say we should to touch your system during normal operations, you
have been automating all along. have a bug.”
You might be ready for Stage 4 if your SREs can’t But the “normal operations” of data engineering were
keep up, you’re spending a lot of time on manual toil, what we were spending our time on. Anytime you’re
and you don’t have time for the fun stuff. managing a pipeline, you’re going to be adding new
topics, adding new data sets, setting up views, and
Figure 14 shows the two new layers that appear in granting access.
Stage 4. The first is the automation of operations,
and this won’t surprise most people. It’s the DevOps This stuff needs to get automated. Great news!
stuff that has been going on for a long time. The There’s a bunch of solutions for this: Terraform,
second layer, data-management automation, is not Ansible, and so on.
quite as obvious.
We at WePay use Terraform and Ansible but you can
Let’s first cover automation for operations. substitute any similar product.
Google’s Site Reliability Engineering handbook
15
we had Ansible for a long time We have a fairly robust compliance
- we had a bunch of operational arm that’s part of JPMorgan

tooling. We were fancy and on the Chase. Because we deal with
cloud. credit cards, we have PCI audits
and we deal with credit-card data.
We had a bunch of scripts to Regulation is here and we really
manage BigQuery and automate need to think about this. Europe
a lot of our toil like creating views has GDPR. California has CCPA.
in BigQuery, creating data sets, PCI applies to credit-card data.
and so on. So why did we have HIPAA for health. SOX applies if
such a high ticket load? you’re a public company. New
York has SHIELD. This is going
The answer is that we were to become more and more of a
spending a lot of time on data theme, so get used to it. We have
Figure 15: Some systemd_ management. We were answering to get better at automating this
log thing in Terraform that questions like “Who’s going to get stuff or else our lives as data
logs some stuff when you’re access to this data once I load it?”, engineers are going to be spent
using compaction (which is “Security, is it okay to persist this chasing people to make sure this
an exciting policy to use with data indefinitely or do we need to stuff is compliant.
your systemd_logs) have a three-year
I want to discuss what that
might look like. As I get into the
futuristic stuff, I get more vague
or hand-wavy, but I’m trying
to keep it as concrete as I can.
First thing you want to do for
automated data management is
probably to set up a data catalog.
You probably want it centralized,
i.e., you want to have one with all
the metadata. The data catalog
will have the locations of your
data, what schemas that data has,
who owns the data, and lineage,
which is essentially the source
Figure 16: Managing your Kafka Connect connectors in Terraform and path of the data.
You can use it to manage your truncation policy?”, and “Is this The lineage for my initial example
topics. Figures 15 and 16 show data even allowed in the system?” is that it came from MySQL, it
some Terraform automations. As a payment processor, WePay went to Kafka, and then it got
Not terribly surprising. deals with sensitive information loaded into BigQuery - that whole
and our people need to follow pipeline. Lineage can even track
Yes, we should have been doing geography and security policies encryption or versioning, so you
this, but we kind of were doing and other stuff like that. know what things are encrypted
this already. We had Terraform,
16
and what things are versioned as

the schemas evolved.
There’s a bunch of activity in

this lineage area. Amundsen is a
data catalog from Lyft. You have
Apache Atlas. LinkedIn open-
sourced DataHub as a patch
in 2020. WeWork has a system
called Marquez. Google has a
product called Data Catalog. I
know I’m missing more.
These things generally do a lot,

more than one thing, but I want to
show a concrete example. I yanked
Figure 17 from the Amundsen
blog.
It tells us what source code Underneath it, of course, is a
It has fake data, the schema, generated the data — in this repository that actually houses
the field types, the data types, case, it’s Airflow, as indicated by all this information. That’s really
everything. At the right, it has who that little pinwheel — and some useful because you need to get all
owns the data - and notice that lineage. It even has a little preview. your systems to be talking to this
Add button there. It’s a pretty nice UI. data catalog.
Figure 18: Your data ecosystem needs to talk to your data catalog
17
That Add button in the Owned By
section is important. You don’t

as a data engineer want to be
entering that data yourself. You
do not want to return to the land
of manual data stewards and data
management. Instead, you want
to be hooking up all these systems
to your data catalog so that
they’re automatically reporting
stuff about the schema, about the
evolution of the schema, about
the ownership when the data is
loaded from one to the next.
First off, you need your systems

like Airflow and BigQuery, your Figure 19: Managing Kafka ACLs with Terraform
data warehouses and stuff, to talk
to the data catalog. I think there’s this data whenever they make There has been a fair amount of
quite a bit of movement there. access requests - and that’s work done to support this aspect.
not where you want to be. You Airflow has RBAC, which was a
You then need your data-pipeline want to be able to automate the patch WePay submitted. Airflow
streaming platforms to talk to access-request management so has taken this seriously and has
the data catalog. I haven’t seen that you can be as hands off with added a lot more, like DAG-level
as much yet for that. There may it as possible. access control. Kafka has had
be stuff coming out that will ACLs for quite a while.
integrate better, but right now I This is kind of an alphabet
think that’s something you’ve got soup with role-based access You can use tools to automate this
to do on your own. control (RBAC), identity access stuff. We want to automate adding
management (IAM), and an a new user to the system and
I don’t think we’ve done a really access-control list (ACL). Access configuring their access. We want
good job of bridging the gap on control is just a bunch of fancy to automate the configuration of
the service side. You want your words for a bunch of different access controls when a new piece
service stuff in the data catalog as features for managing groups, of data is added to the system.
well: things like gRPC protobufs, user access, and so on. We want to automate service-
JSON schemas, and even the DBs account access as new web
of those databases. You need three things to do services come online.
this: you need your systems to
Once you know where all your data support it, you need to provide There’s occasionally a need to
is, the next step is to configure tooling to policymakers so grant someone temporary access
access to it. If you haven’t they can configure the policies to something. You don’t want to
automated this, you’re probably appropriately, and you need to have to set a calendar reminder
going to Security, Compliance, or automate the management of the to revoke the access for this user
whoever the policymaker is and policies once the policymakers in three weeks. You want that to
asking if this individual can see have defined them. be automated. The same goes
18
Figure 20: Detecting sensitive data
for unused access. You want to open-source project called Security number, credit card, or
know when users aren’t using Apache Ranger, with a bit of an other sensitive information, it can
all the permissions that they’re enforcement and monitoring immediately alert you that there’s
granted so that you can strip mechanism built into it; that’s a violation in place.
those unused permissions to limit more focused on the Hadoop
the vulnerability of the space. ecosystem. What all these things There’s a little bit of progress
have in common is that you can here. Users can use the data
Now that your data catalog tells use them to detect the presence of catalog and find the data that they
you where all the data is and you sensitive data where it shouldn’t need, we have some automation
have policies set up, you need to be. in place, and maybe we’re using
detect violations. I mostly want Terraform to manage ACLs for
to discuss data loss prevention Figure 20 is an example. A piece Kafka or to manage RBAC in
(DLP) but there’s also auditing, of submitted text contains a Airflow.
which is keeping track of logs and phone number, and the system
making sure that the activities sends a result that says it is “very But there’s still a problem and
and systems are conforming to likely” that it has detected an that is that data engineering is
the required policies. infoType of phone number. You probably still responsible for
can use this stuff to monitor your managing that configuration and
I’m going to talk about Google policies. For example, you can run those deployments. The reason
Cloud Platform because I use it DLP checks on a data set that is for that is mostly the interface.
and I have some experience with supposed to be clean - i.e., not We’re still getting pull requests,
its data-loss solution. There’s have any sensitive information in Terraform, DSL, YAML, JSON,
a corresponding AWS product it - and if a check finds anything Kubernetes ... it’s nitty-gritty.
called Macie. There’s also an like a phone number, Social
19
It might be a tall order to ask I frame this line of thought view. She even discusses policy
security teams to make changes based on our migration from automation and a lot of the same
to that. Asking your compliance monolith to microservices over stuff that I›m thinking about.
wing to make changes is an even the past decade or two. Part of
taller order. Going beyond your the motivation for that was to I think this shift towards
compliance people is basically break up large, complex things, decentralization will take place in
impossible. to increase agility, to increase two phases. Say you have a set of
efficiency, and to let people move raw tools - Git, YAML, JSON, etc.
Stage 5: Decentralization at their own pace. A lot of those - and a beaten-down engineering
You’re probably ready to characteristics sound like your team that is getting requests left
decentralize your data pipeline data warehouse: it’s monolithic, and right and running scripts
and your data warehouses if you it’s not that agile, you have to ask all the time. To escape that, the
have a fully automated real-time your data engineering team to do first step is simply to expose that
data pipeline but people are still things, and maybe you’re not able raw set of tools to your other
coming to ask you to load data. to do things at your own engineers.
They’re comfortable with this

stuff, they know Git, they know
pull requests, they know YAML
and JSON and all that. You
can at least start to expose the
automated tooling and pipelines
to those teams so that they can
begin to manage their own data
warehouses.
An example of this would be a

team that does a lot of reporting.
They need a data warehouse that
they can manage so you might
Figure 21: Decentralization is Stage 5 of the data ecosystem just give them keys to the castle,
and they can go about it. Maybe
If you have an automated data pace. I think we’re going to want there’s a business-analytics team
pipeline and data warehouse, to do the same thing - go from a that’s attached to your sales
I don’t think you need a single monolith to microwarehouses - organization and they need a data
team to manage all this stuff. I and we’re going to want a more warehouse. They can manage
think the place where this will decentralized approach. their own as well.
first happen, and we’re already
seeing this in some ways, is in I’m not alone in this thought. This is not the end goal; the end
a decentralization of the data Zhamak Dehghani wrote a goal is full decentralization. But
warehouse. I think we’re moving great blog post that is such a for that we need much more
towards a world where people are great description of what I’m development of the tooling that
going to be empowered to spin thinking. She discusses the shift we’re providing, beyond just Git,
up multiple data warehouses and from this monolithic view to a YAML, and the RTFM attitude that
administer and manage their own. more fragmented or decentralized we sometimes throw around.
20
We need polished UIs, something

TL;DR
that you can give not only to an
engineer who’s been writing code
for 10 years but to almost anyone
in your organization. • The job of data engineers
is to help an organization
If we can get to that point, I think move and process data but
we will be able to create a fully not to do that themselves.
decentralized warehouse and
• Data architecture
data pipeline where Security
follows a predictable
and Compliance can manage
evolutionary path from
access controls while data
monolith to automated,
engineers manage the tooling and
decentralized, self-serve
infrastructure.
data microwarehouses.
This is what Maxime Beauchemin • Integration with

meant by “... data engineers build connectors is a key step
tools, infrastructure, frameworks, in this evolution but
and services.” Everyone else can increases workload and
manage their own data pipelines requires automation to
and their own data warehouses correct for that.
and data engineers can help them
• Increasing regulatory
do that. There’s that key word
scrutiny and privacy
“help” that I drew attention to at
requirements will drive the
the beginning.
need for automated data
management.
Figure 21 is my view of a modern
decentralized data architecture. • A lack of polished, non-
We have real-time data integration, technical tooling is
a streaming platform, automated currently preventing
data management, automated data ecosystems
operations, decentralized data from achieving full
warehouses and pipelines, and decentralization.
happy engineers, SREs, and users.
21
SPONSORED ARTICLE
5 Elements for a DevOps-Friendly

Analytics Solution
When analytics providers say custom elements divert 3. Environment
their platform and tools will work DevOps resources to figuring A DevOps-friendly analytics
with your tech stack, what do out SSO peculiarities and solution will work in any
they really mean? Here are five decrease the odds of being environment. That means
important factors to consider able to quickly debug any applications will work on clouds,
when evaluating solutions for problems. in containers, and across mixed
data visualization, dashboards and changing deployment
• Defense against known
and reporting: environments. These may include
vulnerabilities. Look for an
Windows or Linux; on-premises,
embedded analytics vendor
1. Security cloud, and hybrid architectures;
who is already protecting
If the analytics solution is DevOps- and mobile devices and browsers.
against common security
friendly, it will continue handling It also means applications can be
vulnerabilities that any web
security with the same methods deployed in a container, or parts
application will face.
your DevOps team is currently of them dispersed across multiple
using. You won’t have to recreate • Incident visibility. If there is containers.
or replicate BI security information a security incident, does the
in two different places. The rising platform provide detailed 4. Architecture
trend of DevSecOps teams is an event logging? DevOps should If the analytics solution is
indicator of how essential security be able to quickly find the Who, DevOps-friendly, you should
is to DevOps teams everywhere. What, and When of everything be able to deploy it into your
Consider the following: going on at the time. current architecture using
standard methods (with minimal
• Authorization controls. 2. Data architecture-specific steps). For
Whatever DevOps is currently A DevOps-friendly analytics instance, it may be as simple as
using—credentials, enterprise solution will not force you to installing an ASP.NET application
services account, or use a proprietary data store, on an IIS server or a Java
something else—the analytics replicate data, or change your application on an Apache Tomcat
platform should adapt to it. current data schemas. You server.
should be able to store your data
• Standard-based SSO. It’s
in place via relational databases, 5. Release Cycles
important for user experience
web services, or your own A DevOps-friendly embedded
that embedded analytics
proprietary solution. Applications analytics solution won’t drag
applications play within
incorporating analytics should be release cycles. If you’re moving
a single sign-on (SSO)
able to use data in your existing toward continuous integration
environment—and important
systems and deliver strong and continuous delivery (CI/CD),
for DevOps that they do it
performance across disparate you want to control exactly what
based on standards. Any
sources. gets deployed and when through
22
your build pipelines. Does the • Offers flexible scaling so it SPONSORED ARTICLE

embedded analytics platform grows with your business
allow you to deploy smaller size
incremental update packages Logi Analytics: The Embedded
when broader changes aren’t Analytics Experts
needed? Logi Analytics empowers the
world’s software teams with the
Sustainable Innovation and most intuitive, developer-grade
Differentiation embedded analytics solutions
Your analytics solution will and a team of dedicated people,
affect how frequently you invested in your success.
release standout software, how
competitive you are, and how Logi leverages your existing
sustainable your advantage is tech stack, so you can quickly
over time. In summary, look for build, manage and deploy your
one that: application. And because Logi
• Utilizes your technology— supports unlimited customization
including security frameworks and white-labeling, you have total
and tech stack architecture— control to make the application
as it is uniquely your own.
• Leverages your existing

Over 2,200 application
processes so you can build
teams have trusted Logi to
and release application
help power their businesses
updates faster
with sophisticated analytics
capabilities.
23
Beyond the Database, and beyond the

Stream Processor: What’s the Next Step
for Data Management?
by Ben Stopford, Lead Technologist, Office of the CTO at Confluent
At QCon London in March 2020, at the differences between stream historical queries over datasets
I gave a talk on why both stream processors and databases. they’ve accumulated.
processors and databases remain This led to a new product called
necessary from a technical ksqlDB. With the rise in popularity You might think this kind of
standpoint and explored industry of event-streaming systems consolidation at a technical level
trends that make consolidation and their obvious relationship to is an intellectual curiosity, but if
likely in the future. These trends databases, it’s useful to compare you step back a little it really points
map onto common approaches how the different models handle to a more fundamental shift in the
from active databases like data that’s moving versus way that we build software. Marc
MongoDB to streaming solutions data that is stationary. Maybe Andreessen, now a venture
like Flink, Kafka Streams, or more importantly, there is clear capitalist in Silicon Valley, has
ksqlDB. consolidation happening between an excellent way of putting this:
these fields. Databases are “software is eating the world”.
I work at Confluent, the company becoming increasingly active, Investing in software companies
founded by the creators of Apache emitting events as data is written, makes sense simply because
Kafka. These days, I work in the and stream processors are both individuals and companies
Office of the CTO. One of the things increasingly passive, providing consume more software over
we did last year was to look closely time. We buy more applications
24
for our phones, we buy more officer do his job better. That’s automated, we end up with

internet services, companies buy the weak form of this analogy: we many applications talking to
more business software, etc. This buy more software to help these one another: software talking to
software makes our world more people do a better job. software. This is a humongous
efficient. That’s the trend. shift in system design as it’s
These days, modern digital no longer about helping human
But this idea that we as an industry companies don’t build software to users, it’s about doing the work in
merely consume more software help people do a better job: they a fully automated fashion.
is a shallow way to think about build software that take humans
it. A more profound perspective completely out of the critical path. From Monoliths to Event-Driven
is that our businesses effectively Continuing the example, today, Microservices: the evolution of
become more automated. It’s we can get a loan application software architecture
less about buying more software approved in only a few seconds We see the same thing in the
and more about using software to (a traditional mortgage will take evolution of software architecture:
automate our business processes, maybe a couple of weeks, because monolith to distributed monolith
so the whole thing works on it follows that older manual to microservices, see Figure
autopilot — a company is thus process). This uses software, but 1. below. There is something
“becoming software”, in a sense. it’s using software in a different special about the event-driven
Think Netflix: their business way. Many different pieces of microservices though. Event-
started with sending DVDs via software are chained together driven microservices don’t talk
postal mail to customers, who into a single, fully automated directly to the user interface. They
returned the physical media process. Software talks to other automate business processes
via post. By 2020, Netflix has software, and the resulting loan is rather than responding to users
become a pure software platform approved in a few seconds. clicking buttons and expecting
that allows customers to watch things to happen. So in these
any movie immediately with the So, software makes our architectures there is a user-
click of a button. Let’s look at businesses more efficient, but centric side and a software-
another example at what it means think about what this means centric side. As architectures
for businesses to “become for the architecture of those evolve from left to right, in
software”. systems. In the old world, we Figure 1, the “user” of one piece
might build an application that of software is more likely to be
Think about something like a helps a credit officer do her job another piece of software, rather
loan-processing application for better, say, using a three-tier than being a human, so software
a mortgage someone might get architecture. The credit officer evolution also correlates with
for their home. This is a business talks to a user interface, then Marc Andreessen’s observation.
process that hasn’t changed for there’s some kind of back-end
a hundred years. There’s a credit server application running behind Modern architectures and the
officer, there’s a risk officer, and that, and a database, and that all consequences of traditional
there’s a loan officer. Each has helps the person do risk analysis, databases
a particular role in this process, or whatever it might be, more We’ve all used databases. We
and companies write or purchase efficiently. write a query, and send it to a
software that makes the process server somewhere with lots of
more efficient. The purchased As companies become more data on it. The database answers
software helps the credit officer automated, and their business our questions. There’s no way
do her job better, or helps the risk processes become more we’d ever be able to answer data-
25
One alternative is to use event
streams. Stream processors

allow us to manipulate event
streams similar to the way that
databases manipulate data that
is held in files.
That’s to say: stream processors

are built for active data, data that
is in motion, and they’re built for
asynchronicity. But anyone who
has used a stream processor
Figure 1: The evolution of software systems. probably recognizes that it
doesn’t feel much like a traditional
centric questions like these by This hints at an important database.
ourselves. There’s simply too assumption all databases make:
much data for our brains to parse. data is passive. It sits there in In a traditional database
But with a database, we simply the database, waiting for us to do interaction, the query is active
send a question and we can get something. This makes a lot of and the data is passive. Clicking
an answer. It’s wonderful. It’s sense: the database, as a piece a button and running the query is
powerful. of software, is a tool designed what makes things happen. The
to help us humans — whether data passively waits for us to run
The breadth of database systems it’s you or me, a credit officer, or that query. If we’re off making a
available today is staggering. whoever — interact with data. cup of tea, the database doesn’t
Something like Cassandra lets us do anything. We have to initiate
store a huge amount of data for the But if there’s no user interface that action ourselves.
amount of memory the database waiting, if there’s no one clicking
is allocated; Elasticsearch is buttons and expecting things In a stream processor, it’s the
different, providing a to happen, does it have to be other way around. The query is
rich, interactive query synchronous? In a world where passive. It sits there running, just
model; Neo4j lets us query the software is increasingly talking waiting for events to arrive. The
relationship between entities, not to other software, the answer is: trigger isn’t someone clicking a
just the entities themselves; things probably not. button and running the query, it’s
like Oracle or PostgreSQL are
workhorse databases that can
morph to different types of use
case.
Each of these platforms has

slightly different capabilities that
make it more appropriate to a
certain use case but at a high
level, they’re all similar. In all
cases, we ask a question and wait
for an answer. Figure 2: The database versus stream processing.
26
the data itself — an event emitted current state at a single point subsequent moves to the opening

by some other system or whatever in time. You have no idea what position.
it might be. The interaction happened previously!
model is fundamentally different. Note that the event-based
So what if we combined them In fact, a stream closely relates approach contains more
together? to the idea of event sourcing, information about what actually
this programming paradigm happened. Not only do we know
Event Streams are the key that stores data in a database the positions at one point in
to solving the data issues of as events with the objective of the game, we also know how
modern distributed architectures retaining all this information. the game unfolded. We can, for
With that in mind, I want to dig example, determine whether we
a little deeper into some of the If we want to derive our current arrived at a position on the board
fundamental data structures state, i.e. a table, we can do this through a brilliant move of one
used by stream processors, and by replaying the event streams. player versus a terrible blunder of
compare how those relate to
databases. Probably the most
fundamental relationship here is
the one between events, streams
and tables.
Events are the building block

and, conceptually, they are a
simple idea: a simple recording of
something that happened in the
world at a particular point in time.
Figure 3: We can think of chess as a sequence of events or as records of
So, an event could be somebody
positions
placing an order, paying for a
pair of trousers, or moving a rook Chess is a good analogy for this. the opponent.
in a chess game. It could be the Think of a database table as
position of your phone or another describing the position of each Now we can think of an event
kind of continuous event stream. piece. The position of each of stream as a special type of table.
those pieces tells me the current It’s a particular type of table that
Individual events form into state of a game. I can store that doesn’t exist in most databases.
streams, and the picture somewhere and reload it if I want It’s immutable. We can’t change
changes further. An event stream to. But that’s not the only option, it. And it’s append-only — we can
represents the variance in some representing a chess game as only insert new records to it. By
variable: the position of your events gives quite a different contrast, traditional database
phone as you drive, the progress result. tables let us update and delete as
of your order, or the moves in a well as insert. They’re mutable.
game of chess. We store the sequence of events
from the well-known opening If I write a value to a stream, it is
All, exact recordings of what position all chess games start automatically going to live forever.
changed in the world. By from. The current position of the I can’t go back and change some
comparison, a database table board at any point in the game arbitrary event, so it’s more like an
represents a snapshot of the can then be derived by applying all accounting ledger with double-
27
entry bookkeeping. Everything
that ever happened is recorded

in the ledger. But when we think
about data in motion there is
another good reason for taking
this approach.
Events are typically published to

other parts of the system or the
company, so it might already have
been consumed by somebody
else by the time you want to
change it.
Figure 4: Stream processors use tables internally to hold intermediary
So unlike in a database where you results, but data in and out is all represented by streams
can just update the row in question
when data is moving you actually Figure 4 depicts how a stream the stream represents history,
need to create a compensating processor might conduct credit every single event, or state
event that can propagate that scoring. There’s a payment change, that has happened to the
change to all listeners. stream for accounts on the left. A system. This can be collapsed
credit-scoring function computes into a table via some function,
All in all, this makes events a each user’s credit score and keeps be it a simple “group by key”
far better way to represent in a the result in a table. This table, or a complex machine learning
distributed architecture. which is internal to the stream routine.
processor, is much smaller than
Streams and Tables: two sides of the input stream. That table can be turned back into
the same coin? an event stream by “listening” to
So events provide a different The stream processor continues it. Note that, most of the time, the
type of data model — but to listen to the payments coming output stream won’t be the same
counterintuitively, internally, a in, updating the credit scores in as the input as the processing
stream processor actually uses the internal table as it does so, is usually lossy, but if we keep
tables, much like a database. and outputting the results via the original input stream in
another event stream. Kafka we can always rewind and
These internal tables are regenerate.
analogous to the temporary tables So, the thing to note is, when
traditional databases use to hold we use the stream processor, So, we have this kind of duality:
intermediate results, but unlike we create tables, we just don’t streams can go to tables,
temporary tables in a database, really show them to anybody. It’s tables can go back to streams.
they are not discarded when the purely an implementation detail. In technologies like ksqlDB
query is done. All the interaction is via the event these two ideas are blended
streams. together: you can do streaming
This is because streaming queries transformations from stream
run indefinitely, there is no reason This leads to what is known as to stream, you can also create
to discard them. the stream/table duality, where tables, or ‘materialized views’
as they are sometimes referred
28
to and query those like a regular need to join events as they arrive, operation. Say we want

database. Some database as depicted in Figure 6 below. As to aggregate a stream of
technologies like MongoDB and the events move into the stream temperature measurements
RethinkDB are blending these processor, they get buffered inside using windows that each span
concepts together too, but they an index and the stream processor five minutes of time to compute a
approach the problem from the looks for the corresponding event five-minute temperature average.
opposite direction. in the other stream. It also uses It’s very hard to do that inside a
the event timestamp on the event traditional database in a realtime
How Stream Processors and to determine the order in which to way. A stream processor can also
Databases differ process them so it doesn’t matter do more advanced correlations.
A stream processor performs which event came first. This is For example, a session window is
some operations just like a quite different from a traditional a difficult thing to implement in a
database does. For example, a database. database.
stream-table join (see Figure 5) is
algorithmically very similar to an Stream processors have other A session has no defined length; it
equi-join in a database. As events features which databases don’t lasts for some period of time and
arrive the corresponding value have. They can correlate events in dynamically ends after a period
is looked up in the table via its time, typically using the concept of inactivity. Bob is looking for
primary key. of windows. trousers on our website, maybe he
buys some, and then goes away
— that’s a session. A session
window allows us to detect and
isolate that amorphous period.
Another unique property of

stream processors is their ability
to handle late and out-of-order
data. For example, they retain old
Figure 5: Joining a stream with a table. windows which can be updated
retrospectively if late or out of
However, joining two streams For example, we can use a order data arrives. Traditional
together using a stream-stream time window to restrict which databases can be programmed
join is very different because we messages contribute to a join to do similar types of queries,
but not in a way that yields either
accurate or performant realtime
results.
The benefits of a hybrid, stream-

oriented database
So by now, it should be clear that
we have two very different types
of query. In stream processing, we
have the notion of a push query:
queries that push data out of the
Figure 6: Joining two streams system.
29
Traditional databases have for describing all of this, which data arrives and creating new
this notion of a pull-query: ask we can create using something output records. “Earliest to
a question and get an answer familiar like SQL, albeit with some forever” is a less-well-explored
returned back. What’s interesting extensions — a single model that combination but is a very
to me is the hybrid-world that sits rules them all. useful one. Say we’re building a
between them, combining both dashboard application.
approaches. We can send a select A query can run from the start of
statement and get a response. time to now, from now until the With “earliest to forever”, we can
At the same time, we can also end of time, or from the start of run a query that loads the right
listen to changes on that table as time to the end of time. “Earliest data into the dashboard, and
they happen. We can have both to now” is just what a regular then continues to keep it up to
interaction models. database does. date. There is one last subtlety:
are historical queries event-based
(all moves in the chess game) or
snapshots (the positions of the
pieces now)? This is the other
option a unified model must
include.
Put this all together and we get

a universal model that combines
stream processing and databases
Figure 7: A materialized view, defined by a stream processor, which the as well as a middle ground
second application interacts with either via a query or via a stream. between them: queries that we
want to be kept up to date. There
So, I believe we are in a period The query executes over all rows is work in progress to add such
of convergence, where we will in the database and terminates streaming SQL extensions to
end up with a unified model that when it has covered them all. the official ANSI SQL standard,
straddles streams and tables, “Now to forever” is what stream but we can try it out today in
handles both the asynchronous processors do today. technologies like ksqlDB. For
and synchronous and provides example, here is the SQL for
users with an interaction model The query starts on the next event push (“now to forever”) and pull
that is both active and passive. it receives and continues running (“earliest to now”) queries.
There is, in fact, a unified model forever, recomputing itself as new
While stream processors are
becoming more database-like, the
reverse is also true. Databases
are becoming more like stream
processors. We see this in
active databases like MongoDB,
Couchbase, and RethinkDB. They
don’t have the same primitives
for processing event streams,
handling asynchronicity, or
Figure 8: The unified interaction model. handling time and temporal
30
TL;DR
• As companies become
more automated, and
their business processes
become more automated,
we end up with many
applications talking
to one another. This is
a humongous shift in
Figure 9: Push and pull queries in the unified interaction model. system design as it’s
about doing the work in a
correlations, but they do let us This means two things for data. fully automated fashion by
create streams from tables and Firstly, we need data tooling that machines.
compared to tools like ksqlDB can handle both the asynchronous
• In traditional databases,
they are better at performing pull and the synchronous. Secondly,
data is passive and
queries, as they’re approaching we also need different interaction
queries are active: the
the problem from that direction. models.
data passively waits for
something to run a query.
So, whether you come at this from Models which push results to us
In a stream processor, data
the stream processing side or the and chain stages of autonomous
is active and the query is
database side there is a clear data processing together.
passive: the trigger it’s the
drive towards a centreground. I
data itself. The interaction
think we’re going to see a lot more For the database this means the
model is fundamentally
of it. ability to query passive data sets
different.
and get answers to questions
The database: is it time for a users have. But it also means • In modern applications,
rethink? active interactions that push data we need the ability to
If you think back to where we to different subscribing services. query passive data sets
started: Andreasen’s observation and get answers for users
of a world being eaten by software, On one side of this evolution are actions but we also need
this suggested a world where active databases: MongoDB, active interaction through
software talks to software. Where Couchbase, RethinkDB, etc. On data that is pushed as an
user interfaces, the software that the other are stream processors: event stream to different
helps you and me, is a smaller ksqlDB, Flink, Hazelcast Jet. subscribing services.
part of the whole package.
• We need to rethink what
Whichever path prevails, one thing
a database is, what it
We see this today in all manner of is certain: we need to rethink what
means to us, and how we
businesses across ride sharing, a database is, what it means to
interact with both the data
finance, automotive — it’s coming us, and how we interact with both
it contains and the event
up everywhere. the data it contains and the event
streams that connect it all
streams that connect modern
together
businesses together.
31
Data Gateways in the Cloud Native Era

by Bilgin Ibryam, Product Manager at Red Hat
These days, there is a lot of and Kubernetes. We will see with the introduction of REST and
excitement around 12-factor what are the technical challenges popularization of Javascript.
apps, microservices, and service introduced by the Microservices
mesh, but not so much around architecture and how data REST helped us decouple
cloud-native data. The number gateways can complement frontends from backends through
of conference talks, blog posts, API gateways to address these a uniform interface and resource-
best practices, and purpose- challenges in the Kubernetes era. oriented requests. It popularized
built tools around cloud-native stateless services and enabled
data access is relatively low. Application architecture response caching, by moving all
One of the main reasons for this evolutions client session state to clients, and
is because most data access Let’s start with what has been so forth. This new architecture
technologies are architectured changing in the way we manage was the answer to the huge
and created in a stack that favors code and the data in the past scalability demands of modern
static environments rather than decade or so. I still remember the businesses.
the dynamic nature of cloud time when I started my IT career by
environments and Kubernetes. creating frontends with Servlets, A similar change happened with
JSP, and JSFs. In the backend, the backend services through
In this article, we will explore EJBs, SOAP, server-side session the Microservices movement.
the different categories of data management, was the state of Decoupling from the frontend
gateways, from more monolithic art technologies and techniques. was not enough, and the
to ones designed for the cloud But things changed rather quickly monolithic backend had to be
32
Application architecture evolution brings new challenges
decoupled into bounded context resilience, and observability that gave rise to fault injection
enabling independent fast-paced turned into major areas of and automatic recovery testing.
releases. These are examples technology innovation addressed
of how architectures, tools, and in the years that followed. And finally, that gave rise to
techniques evolved pressured advanced network telemetry
by the business needs for fast Similarly, creating a database per and tracing. All of these created
software delivery of planet-scale microservice, having the freedom a whole new layer that sits
applications. and technology choice of different between the frontend and the
datastores is a challenge. That backend. This layer is occupied
That takes us to the data layer. shows itself more and more primarily with API management
One of the existential motivations recently with the explosion of data gateways, service discovery, and
for microservices is having and the demand for accessing service mesh technologies, but
independent data sources per data not only by the services but also with tracing components,
service. If you have microservices other real-time reporting and AI/ application load balancers, and
touching the same data, that ML needs. all kinds of traffic management
sooner or later introduces and monitoring proxies. This even
coupling and limits independent The rise of API gateways includes projects such as Knative
scalability or releasing. It is not With the increasing adoption with activation and scaling-
only an independent database of Microservices, it became to-zero features driven by the
but also a heterogeneous one, apparent that operating such an networking activity.
so every microservice is free to architecture is hard. While having
use the database type that fits its every microservice independent With time, it became apparent that
needs. sounds great, it requires tools and creating microservices at a fast
practices that we didn’t need and pace, operating microservices at
While decoupling frontend didn’t have before. scale requires tooling we didn’t
from backend and splitting need before. Something that was
monoliths into microservices This gave rise to more advanced fully handled by a single load
gave the desired flexibility, it release strategies such as blue/ balancer had to be replaced with a
created challenges not-present green deployments, canary new advanced management layer.
before. Service discovery and releases, dark launches. Then A new technology layer, a new set
load balancing, network-level of practices and techniques, and
33
a new group of users responsible document database, and the third Here are a few areas where
were born. microservice one uses an in- data gateways differ from API
memory key-value store. gateways.
The case for data gateways
Microservices influence the While microservices allow you Abstraction
data layer in two dimensions. all that freedom, again it comes An API gateway can hide
First, it demands an independent at a cost. It turns out operating a implementation endpoints
database per microservice. From large number of datastore comes and help upgrade and rollback
a practical implementation point at a cost that existing tooling services without affecting
of view, this can be from an and practices were not prepared service consumers. Similarly, a
independent database instance to for. In the modern digital world, data gateway can help abstract
independent schemas and logical storing data in a reliable form is a physical data source, its
groupings of tables. The main not enough. specifics, and help alter, migrate,
rule here is, only one microservice decommission, without affecting
owns and touches a dataset. Data is useful when it turns into data consumers.
And all data is accessed through insights and for that, it has to
the APIs or Events of the owning be accessible in a controlled Security
microservice. form by many. AI/ML experts, An API manager secures resource
data scientists, business endpoints based on HTTP
The second way a microservices analysts, all want to dig into methods. A service mesh secures
architecture influenced the the data, but the application- based on network connections.
data layer is through datastore focused microservices and But none of them can understand
proliferation. Similarly, their data access patterns are and secure the data and its shape
enabling microservices not designed for these data- that is passing through them. A
to be written in different hungry demands. data gateway, on the other hand,
languages, this architecture understands the different data
allows the freedom for every This is where data gateways can sources and the data model
microservices-based system help you. A data gateway is like an and acts on them. It can apply
to have a polyglot persistence API gateway, but it understands RBAC per data row and column,
layer. With this freedom, one and acts on the physical data filter, obfuscate, and sanitize
microservice can use a relational layer rather than the networking the individual data elements
database, another one can use a layer. whenever necessary. This is a
API and Data gateways offering similar capabilities at different layers
34
more fine-grained security model Heterogeneity is the degree that can be deployed on hybrid/

than networking or API level of differentiation in various multi-cloud infrastructures, work
security of API gateways. data sources such as network with different data sources, data
protocols, query languages, formats, and applicable for many
Scaling query capabilities, data models, use cases.
API gateways can do service error handling, transaction
discovery, load-balancing, and semantics, etc. A data gateway Classic data virtualization
assist the scaling of services can accommodate all of these platforms
through an orchestrator such as differences as a seamless, In the very first category of data
Kubernetes. But they cannot scale transparent data-federation layer. gateways, are the traditional data
data. Data can scale only through virtualization platforms such
replication and caching. Some Schema-first as Denodo and TIBCO/Composite.
data stores can do replication in API gateways allow contract-first While these are the most feature-
cloud-native environments but service and client development laden data platforms, they tend
not all. Purpose-built tools, such with specifications such as to do too much and want to be
as Debezium, can perform change OpenAPI. Data gateways allow everything from API management,
data capture from the transaction schema-first data consumption to metadata management,
logs of data stores and enable based on the SQL standard. A data cataloging, environment
data replication for scaling and SQL schema for data modeling is management, deployment,
other use cases. the OpenAPI equivalent of APIs. configuration management, and
whatnot. From an architectural
A data gateway, on the other hand, Many shades of data gateways point of view, they are very much
can speed-up access to all kinds In this article, I use the terms like the old ESBs, but for the data
of data sources by caching data API and data gateways loosely layer. You may manage to put
and providing materialized views. to refer to a set of capabilities. them into a container, but it is
It can understand the queries, There are many types of API hard to put them into the cloud-
optimize them based on the gateways such as API managers, native citizen category.
capabilities of the data source, load balancers, service mesh,
and produce the most performant service registry, etc. It is similar Databases with data federation
execution plan. The combination to data gateways, where they capabilities
of materialized views and the range from huge monolithic data Another emerging trend is the
stream nature of change data virtualization platforms that want fact that databases, in addition to
capture would be the ultimate to do everything, to data federation storing data, are also starting to
data scaling technique, but libraries, from purpose-built cloud act as data federation gateways
there are no known cloud-native services to end-user query tools. and allowing access to external
implementations of this yet. data. For example, PostgreSQL
Let’s explore the different types implements the ANSI SQL / MED
Federation of data gateways and see which specification for a standardized
In API management, response fit the definition of “a cloud- way of handling access to remote
composition is a common native data gateway.” When I say objects from SQL databases.
technique for aggregating data a cloud-native data gateway, I
from multiple different systems. mean a containerized first-class That means remote data stores,
In the data space, the same Kubernetes citizen. I mean a such as SQL, NoSQL, File, LDAP,
technique is referred to as gateway that is open source, using Web, Big Data, can all be accessed
heterogeneous data federation. open standards; a component as if they were tables in the same
35
PostgreSQL database. SQL/ on top of a few data sources. engine purposefully re-written for
MED stands for Management This is a fast-growing category the Kubernetes ecosystem. It uses
of External Data, and it is also specialized for enabling rapid the SQL/MED specification for
implemented by MariaDB web-based development of data- defining the virtual data models
CONNECT engine, DB2, Teiid driven applications rather than and relies on the Kubernetes
project discussed below, and a BI/AI/ML use cases. Operator model for the building,
few others. deployment, and management of
Open-source data gateways its runtime on Openshift.
Starting in SQL Server 2019, Apache Drill is a schema-free
you can now query external SQL query engine for NoSQL Once deployed, the runtime can
data sources without moving or databases and file systems. It scale as any other stateless
copying the data. The PolyBase offers JDBC and ODBC access cloud-native workload on
engine of SQL Server instance to to business users, analysts, and Kubernetes and integrate with
process Transact-SQL queries data scientists on top of data other cloud-native projects. For
to access external data in SQL sources that don’t support such example, it can use Keycloak for
Server, Oracle, Teradata, and APIs. single sign-on and data roles,
MongoDB. Infinispan for distributed caching
Again, having uniform SQL based needs, export metrics and register
GraphQL data bridges access to disparate data sources with Prometheus for monitoring,
Compared to the traditional is the driver. While Drill is highly Jaeger for tracing, and even with
data virtualization, this is a scalable, it relies on Hadoop 3scale for API management. But
new category of data gateways or Apache Zookeeper’s kind of ultimately, Teiid runs as a single
focused around the fast web- infrastructure which shows its Spring Boot application acting as
based data access. The common age. a data proxy and integrating with
thing around Hasura, Prisma, other best-of-breed services on
SpaceUpTech, is that they focus Teiid is a project sponsored by Openshift rather than trying to
on GraphQL data access by Red Hat and I’m most familiar with reinvent everything from scratch.
offering a lightweight abstraction it. It is a mature data federation
36
Architectural overview of Teiid facing artifacts generates some Secure tunneling data-proxies

data gateway confusion. Regardless of this, With cloud-hosted data gateways
On the client-side, Teiid offers both distributions of Presto are comes the need for accessing
standard SQL over JDBC/ODBC among the most popular open- on-premise data. Data has
and Odata APIs. Business users, source projects in this space. gravity and also might be affected
analysts, and data scientists can by regulatory requirements
use standard BI/analytics tools Cloud-hosted data gateways preventing it from moving to
such as Tableau, MicroStrategy, services the cloud. It may also be a
Spotfire, etc. to interact with Teiid. With a move to the cloud conscious decision to keep the
Developers can leverage the REST infrastructure, the need for data most valuable asset (your data)
API or JDBC for custom built gateways doesn’t go away but from cloud-coupling. All of these
microservices and serverless increases instead. Here are a cases require cloud access to on-
workloads. In either case, for data few cloud-based data gateway premise data. And cloud providers
consumers, Teiid appears as a services: make it easy to reach your data.
standard PostgreSQL database Azure’s On-premises Data
accessed over its JDBC or ODBC AWS Athena is ANSI SQL based Gateway is such a proxy allowing
protocols but offering additional interactive query service for access to on-premise data stores
abstractions and decoupling from analyzing data tightly integrated from Azure Service Bus.
the physical data sources. with Amazon S3. It is based on
PrestoDB and supports additional In the opposite scenario,
PrestoDB is another popular data sources and federation accessing cloud-hosted data
open-source project started capabilities too. Another similar stores from on-premise clients
by Facebook. It is a distributed service by Amazon is AWS can be challenging too. Google’s
SQL query engine targeting Redshift Spectrum. It is focused Cloud SQL Proxy provides secure
big data use cases through its around the same functionality, access to Cloud SQL instances
coordinator-worker architecture. i.e. querying S3 objects using without having to whitelist IP
SQL. The main difference is that addresses or configure SSL.
The Coordinator is responsible Redshift Spectrum requires a
for parsing statements, planning Redshift cluster, whereas Athena Red Hat-sponsored open-source
queries, managing workers, is a serverless offering that project Skupper takes the more
fetching results from the workers, doesn’t require any servers. Big generic approach to address
and returning the final results Query is a similar service but from these challenges. Skupper
to the client. The worker is Google. solves Kubernetes multi-cluster
responsible for executing tasks communication challenges
and processing data. Recently the These tools require minimal to through a layer 7 virtual network
PrestoDB community split and no setup, they can access on- that offers advanced routing and
created a fork called PrestoSQL premise or cloud-hosted data secure connectivity capabilities.
that is now part of The Linux and process huge datasets. But Rather than embedding Skupper
Foundation. they couple you with a single into the business service runtime,
cloud provider as they cannot be it runs as a standalone instance
While forking is a common and deployed on multiple clouds or per Kubernetes namespace and
natural path for many open-source on-premise. They are ideal for acts as a shared sidecar capable
projects, unfortunately, in this interactive querying rather than of secure tunneling for data
case, the similarity in the names acting as hybrid data frontend for access or other general service-
and all of the other community- other services and tools to use. to-service communication. It is a
37
generic secure-connectivity proxy gives the ultimate flexibility for
TL;DR
applicable for many use cases in delivering software at a fast pace.

the hybrid cloud world. But all of that comes at a cost
that becomes increasingly visible • Application architectures
Connection pools for serverless with the need for uniform real- have evolved to split the
workloads time data access from emerging frontend from the backend,
Serverless takes software user groups with different needs. and further split the
decomposition a step further Keeping microservices data only backend into independent
from microservices. Rather than for the microservice itself creates microservices.
services splitting by bounded challenges that have no good
context, serverless is based on technological and architectural • Modern distributed
the function model where every answers yet. application architectures
operation is short-lived and created the need for
performs a single operation. Data gateways, combined with API Gateways and
cloud-native technologies offer helped popularize API
These granular software features similar to API gateways Management and Service
constructs are extremely scalable but for the data layer that can help Mesh technologies.
and flexible but come at a cost that address these new challenges. • Microservices give the
previously wasn’t present. It turns The data gateways vary in freedom for using the most
out rapid scaling of functions is a specialization, but they tend to suitable database type
challenge for connection-oriented consolidate on providing uniform depending on the needs
data sources such as relational SQL-based access, enhanced of the service. Such a
databases and message brokers. security with data roles, caching, polyglot persistence layer
As a result cloud providers offer and abstraction over physical raises the need for API
transparent data proxies as a data stores. Gateway-like capabilities
service to manage connection but for the data layer.
pools effectively. Amazon RDS Data has gravity, requires
Proxy is such a service that granular access control, is hard • Data Gateways act like API
sits between your application to scale, and difficult to move Gateways but focusing
and your relational database to on/off/between cloud-native on the data aspect. A
efficiently manage connections infrastructures. Having a data Data Gateway offers
to the database and improve gateway component as part of abstractions, security,
scalability. the cloud-native tooling arsenal, scaling, federation,
which is hybrid and works on and contract-driven
Conclusion multiple cloud providers, supports development features.
different use cases is becoming a • There are many types
Modern cloud native architectures necessity. of Data Gateways,
combined with the microservices from the traditional
principles enable the creation of data virtualization
highly scalable and independent technologies, to light
applications. GraphQL translators,
cloud-hosted services,
The large choice of data storage connection pools, and fully
engines, cloud-hosted services, open source alternatives.
protocols, and data formats,
38
Combining DataOps and DevOps: Scale at Speed
by Sam Bocetta, Security Analyst, semi-retired, educates the public about security and privacy technology
Over the past decade, hundreds constancy, and output of new In order to move quickly,
of organizations have made the software. development teams need
shift to adopt the cloud as a way consistent access to high-quality
to obtain access to its automated, Organizations are rushing to data.
scalable, and on-demand advance to the latest and best
infrastructure. The shift has technological advancements. If it takes days to refresh the data
changed software development New strategies are being in a test environment, teams are
requirement timeframes from implemented through data- caught in a difficult position:
weeks to mere minutes. driven decision making and move slightly slower, or make
the infrastructure needed to concessions on quality at the
Around the same time, the cloud’s integrate new breakthroughs - detriment of your customers,
scalability has also encouraged from artificial intelligence (AI) to subscribers, or users.
organizations to look at new machine learning and automation
development models. DevOps - is easily accessible. DataOps and DevOps - A Better
and the cloud have, together, Understanding of How It Works
broken down the walls between But even in a world where software DataOps is really an extension of
people and technology. has become lightweight, scalable, DevOps standards and processes
and automated, there’s one thing into the data analytics world. The
DevOps and continuous delivery that prevents organizations from DevOps philosophy underscores
processes have become truly shining - and that is how consistent and flawless
widespread in most of our readily their development teams collaboration between developers,
industries, enabling enterprises can actually access their data. quality assurance teams and IT
to radically increase the integrity, Ops administrators. DataOps
39
does the same for administrators move to a microservice-based data science. Instead, we need
and engineers who store, analyze, workflow that benefits from to realize that data is a vital
archive and deliver data. containers and other progressive commodity, and to put together
technologies - hence the massive all those that use or handle data
To put it another way, DataOps rise in Software-as-a-Service to take a data-centric view of the
is about streamlining the (SaaS) offerings specifically enterprise.
processes involved in processing, due to the prior rise of DataOps.
analyzing and deriving value SaaS appeals to a massive When building applications or
from big data. This aims to break entrepreneurial demographic, data-rich systems, development
down the silos in the data storage since almost anyone with teams learn to look past the data
and analytics fields which have knowledge or skills can help build delivery mechanics and instead
historically isolated different a SaaS company. concentrate on the policies and
teams from each other. Improved limitations that control data in
communication and cooperation DataOps also includes their organization, they can align
between different teams will lead administrators and engineers their infrastructure more closely
to faster outcomes and better to make use of next-generation to enable data flow across their
time-to-value. DataOps is a way data technology to develop organization to those who need it.
to automate the data processing their data storage and analytics
and storage workflows in the infrastructure. They need To make the shift, DataOps
same way that DevOps does solutions for data processing that needs teams to recognize
when creating applications. are scalable and readily available the challenges of today’s
- think cluster-based, robust technology environment and to
DevOps storage. think creatively about specific
DevOps combines IT / Ops approaches to data challenges in
and developers to work closely The DataOps architecture also their organization. For example,
together in order to deliver needs to be able to handle a you might have information
software of a higher quality. number of workloads in order to about individual users and their
achieve the same versatility as the functions, data attributes and
DevOps works in a simulated DevOps implementation pipeline. what needs to be protected for
environment, and due to Creating a data management individual audiences, as well as
the radical advances of cloud- tool set of diverse solutions, from knowledge of the assets needed
based developments, we can log aggregators such as Splunk to deliver the data where it is
witness how organizations are and Sumo Logic and Big Data required.
now moving DevOps to their cloud Analytics applications such as
environments. With additional Hadoop and Spark, is crucial to Getting teams together that have
continuous integration and achieving this agility. different ideas helps the company
automation of testing and delivery, to evolve faster. Instead of waiting
DevOps breaks complicated Embracing the Changes minutes, hours or even weeks for
tasks into much simpler ones. We need to step away from data, environments need to be
organizing our teams and created in minutes and at the pace
DataOps technologies around the tools required to allow the rapid creation
Adopting DevOps will require we use to manage data, such as and delivery of applications and
multiple alterations to your application creation, information solutions. At the same time,
infrastructure. To make the management, identity access companies do not have to choose
most of DevOps, you’ll want to management and analytics and between access and security;
40
they can function assured that This will put extraordinary This system also makes use of

their data is adequately protected pressure on organizations, and a number of bespoke tools - one
for all environments and users whoever comes up with solutions called Horovod, which coordinates
without lengthy manual checks first will reap the benefits. With parallel data processing across
and authorisations. DataOps, IT can overcome the hundreds of GPUs, and Manifold,
expense, sophistication and risk a visualization tool that is used to
When done correctly, DataOps that comes with the management assess the efficacy of ML models.
provides a cultural transformation of data to become a business
that promotes communication enabler while users can get the Netflix is also a company that
between all data stakeholders. data they need to unlock their processes huge amounts of
Data management will now be innovative capacity. If DevOps and data every day, and one in
the collective responsibility of cloud had been the key enablers of which these data need to be
personnel, database managers, today’s digital economy, DataOps accessible to thousands of
and development developers, as is set to be the generator of our individual clients. The core of the
well as security and compliance future data economy. Netflix user experience is their
officers. And although Chief Data recommendation engine, which
Officers track data governance Uber and Netflix Show Us the currently runs in Spark. However,
and efficiency, they seldom take Way Forward the company is continually testing
any interest in non-production While the way in which DataOps new models in order to improve
needs. is implemented will be different data availability and the accuracy
in every organization, it can be of the recommendations that their
Innovation fails when no-one instructive to look at the way ML algorithms provide.
takes charge of cross-functional in which the concept has been
data management. Nevertheless, applied in real-world contexts. Unlike many companies, however,
companies can ensure that Two of the most pioneering Netflix runs these tests offline,
confidential data is secure through companies in this regard have and only deploy new models in
powerful collaborative data been Uber and Netflix, both of consumer-facing systems once
systems and that the right data whom have been very open they have been proven to be
is made accessible to the right about the way in which they use effective. In other words, they are
people, whenever and wherever DataOps within their businesses conscious to ensure the balance
they need it. Right through from models. between stability and flexibility
the engineers who supply the data that characterises effective
to the data scientists who analyze Uber, for instance, uses a machine DataOps approaches.
it, to the developers who check it. learning model (ML) known
as Michelangelo to process the Why Dataops and Devops Are a
The next ten years is set to reshape huge amounts of the data that Match Made in Heaven
the face of computing as IoT the firm collects, and to share Today’s millennial consumers
devices, machine learning, this across the organization. are more aware of their brand
augmented or virtual reality, voice Michelangelo helps manage engagement and not only want
computing, and more become DataOps in a way similar to great products but also want
common across all industries. DevOps by encouraging iterative great customized experiences
With this change, more data, more development of models and when using these products. Many
privacy and security concerns democratizing the access to data forward-thinking companies
and much more regulation will and tools across teams. are therefore in the midst of
come into play.
41
the transition from a product to shift their focus from stability development and ongoing
economy to a service economy. and efficiency to innovation and efficiency to its users, the cloud
flexibility. Faster technology offers simplicity in the use
For example: innovations result in shorter of, and quality in the product
production stages, creative by optimizing operational
• Android and iPhone integrate designs and higher delivery rates. performance. As a result, DevOps
customer support in their in conjunction with Cloud fulfills
product bundle The emergence of social user expectations with the help of
media marketing and future sophisticated execution.
• When buying a new vehicle,
technologies are shifting control
BMW includes daily car
away from production and
maintenance in the buying
TL;DR
keeping customers or users at
price
the core. Branding and marketing
• Our smartphone technology mechanisms now react to
now includes food delivery, consumer preferences rather • DataOps is all about
maps, GPS and even online than unlocking it. From SMEs to streamlining the
banking as a service with their start-ups, companies need to processes that are
product delivery encourage and support creative involved in processing,
responsiveness and focus on analyzing and deriving
This shift from product to service waste reduction. value from big data.
as a priority is also reflected in
the delivery of software, enabling It’s time for IT organizations to • Development teams
companies to provide innovation, enable software as a service with need to learn how to look
speed, reliability, frequency and the aid of DevOps methodologies past the data delivery
operation on the customer’s and Cloud automation. DevOps mechanics and instead
behalf. combined with cloud helps concentrate on the
to assess the quality of policies and limitations
With cloud automation, the customer’s experience. that control data in their
companies are now able to shift This cross-department and organization.
their focus and assimilate user cross-functional cooperation • Uber and Netflix have both
experience seamlessly from strengthens an organization’s been very open about
machine-based functions to operations and helps them the way in which they
IaaS (infrastructure-as-a- achieve the advantage in their use DataOps within their
service), PaaS (platform-as-a- market. businesses models.
service), and SaaS. DevOps helps
this by removing the discrepancy One thing digital transformation • Many forward-thinking
between development and has taught us is that software companies have found
support. and hardware have to work in themselves in the midst
unison. Each corporation must of the transition from a
We Can Shift from Stability to adapt to the combination of product-based economy
Agility digital applications with material to a service-based
With an increase in production systems or components. economy.
speed, the industry has been
challenged to adjust their go- While DevOps offers
to-market strategy but mostly advancements in software
42
Read recent issues
The InfoQ eMag / Issue #83 / March 2020 The InfoQ eMag / Issue #81 / January 2020
The InfoQ eMag / Issue #89 / December 2020
Service Mesh 2020 Microservices:

Ultimate Guide Testing, Observing,
Year in Review
and Understanding
Service Service Mesh Exploring the Dealing with Remote From Monolith to Event- Monitoring Tyler Treat on
12 Microservices Obscuring
Mesh Implementations (Possible) Future of Team Challenges Driven: Finding Seams in Your Microservices the Microservice
Testing Techniques Complexity
Features and Products Service Meshes Future Architecture Right Way Observability
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
Service Mesh: 2020: Microservices: Testing,

Ulimate Guide Year in Review Observing and Understanding
This eMag aims to answer 2020 is probably the most This eMag takes a deep
pertinent questions for extended year we will see in dive into the techniques and
software architects and our whole life. A vast number culture changes required to
technical leaders, such as: of people have spent the most successfully test, observe, and
what is a service mesh?, do significant part of the year at understand microservices.
I need a service mesh?, and home. Remote work went from
how do I evaluate the different “something to be explored”
service mesh offerings? to teams’ reality around the
world. In this eMag, we’ve
packed in some of the most
relevant InfoQ content of 2020.
And there was no way to avoid
content on remote activities.
InfoQ @InfoQ InfoQ InfoQ

Modern Data Engineering: The Infoq Emag / Issue #92 / February 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modern Data Engineering: The Infoq Emag / Issue #92 / February 2021

Uploaded by

Copyright:

Available Formats

The InfoQ eMag / Issue #92 / February 2021

The Future of Data Beyond the Database, Combining DataOps and

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

The Future of Data

Chris Riccomini Ben Stopford

Bilgin Ibryam Sam Bocetta

Data engineers and software Beyond the Database, and

The Future of Data Engineering

The InfoQ eMag / Issue #92/ February 2021

Figure 1: Getting cute with land/expand/on demand for pipeline evolution

Figure 2: The six stages of data-pipeline maturity

The InfoQ eMag / Issue #92/ February 2021

Figure 4: Stage 1 is the classic batch-based approach to data

Figure 5: Stage 2 of data-pipeline maturity

The InfoQ eMag / Issue #92/ February 2021

Figure 6: WePay’s data architecture in 2017

This pattern opens up a lot of use Datastore is a Google Cloud

Last but not least in our

The InfoQ eMag / Issue #92/ February 2021

You might be ready for data

so is certainly lower. Figure 12.

- we had a bunch of operational arm that’s part of JPMorgan

The InfoQ eMag / Issue #92/ February 2021

There’s a bunch of activity in

These things generally do a lot,

section is important. You don’t

First off, you need your systems

They’re comfortable with this

An example of this would be a

The InfoQ eMag / Issue #92/ February 2021

This is what Maxime Beauchemin • Integration with

5 Elements for a DevOps-Friendly

The InfoQ eMag / Issue #92/ February 2021

• Leverages your existing

Beyond the Database, and beyond the

The InfoQ eMag / Issue #92/ February 2021

streams. Stream processors

That’s to say: stream processors

Each of these platforms has

The InfoQ eMag / Issue #92/ February 2021

Events are the building block

that ever happened is recorded

Events are typically published to

The InfoQ eMag / Issue #92/ February 2021

Another unique property of

The benefits of a hybrid, stream-

Put this all together and we get

Data Gateways in the Cloud Native Era

API and Data gateways offering similar capabilities at different layers

The InfoQ eMag / Issue #92/ February 2021

The InfoQ eMag / Issue #92/ February 2021

applicable for many use cases in delivering software at a fast pace.

The InfoQ eMag / Issue #92/ February 2021

Service Mesh 2020 Microservices:

Service Mesh: 2020: Microservices: Testing,

InfoQ @InfoQ InfoQ InfoQ

You might also like