Professional Documents
Culture Documents
MODERN
DATA ENGINEERING
Modern
Data Engineering
IN THIS ISSUE
24 39
Beyond the Database, and
beyond the Stream Processor: Combining DataOps and
What’s the Next Step for Data DevOps: Scale at Speed
Management?
PRODUCTION EDITOR Ana Ciobotaru / COPY EDITOR Maureen Spencer / DESIGN Dragos Balasoiu, Ana Ciobotaru
GENERAL FEEDBACK feedback@infoq.com / ADVERTISING sales@infoq.com / EDITORIAL editors@infoq.com
CONTRIBUTORS
“The future of data engineering” is engineering. Our stack is Airflow, from data science to service
a fancy title for presenting stages Kafka, and BigQuery, for the infrastructure, and so on. I also
of data pipeline maturity and most part. Airflow is, of course, wrote Apache Samza, which is a
building out a sample architecture a job scheduler that kicks off stream-processing system, and
as I progress, until I land on a jobs and does workflow things. helped build out their Hadoop
modern data architecture and BigQuery is a data warehouse ecosystem. Before that, I spent
data pipeline. I will also hint at hosted by Google Cloud. I make time as a data scientist at PayPal.
where things are headed for the some references to Google
next couple of years. Cloud services here, and you There are many definitions for
can definitely swap them with “data engineering”. I’ve seen
It’s important to know me and my the corresponding AWS or Azure people use it when talking
perspective when I’m predicting services. about business analytics and
the future, so that you can couch in the context of data science.
mine with your own perspectives We, at WePay, use Kafka a lot. I’m going to throw down my
and act accordingly. I spent about seven years at definition: a data engineer’s job
LinkedIn, the birthplace of Kafka, is to help an organization move
I work at WePay, which is a which is a pub/sub, write- and process data. To move data
payment-processing company. ahead log. Kafka has become means streaming pipelines or
JPMorgan Chase acquired the the backbone of a log-based data pipelines; to process data
company a few years ago. I work architecture. At LinkedIn, I spent means data warehouses and
on data infrastructure and data a bunch of time doing everything stream processing. Usually, we’re
6
focused on asynchronous, batch about helping the organization to We eventually landed on Airflow
7
The InfoQ eMag / Issue #92/ February 2021
and that’s really what I want to the timeliness route, we start because WePay may be farther
cover here. doing more integration with other along on some dimensions - and
systems. The last two tie together: then there are companies that are
I refined this idea with the tweet in automation and decentralization. even farther along than WePay.
Figure 1. The idea was that we On the automation front, I think we
initially land with nothing, so we need to start thinking about how These stages let you find your
need to set up a data warehouse we operate not just our operations current starting point and build
quickly. Then we expand as we do but our data management. And your own roadmap from there.
more integrations, and maybe we then decentralizing the data
go to real-time because we›ve got warehouse. Stage 0: None
Kafka in our ecosystem. You’re probably at this stage if
I designed a hierarchy of you have no data warehouse.
Finally, we move to automation for data-pipeline progression. You probably have a monolithic
on-demand stuff. That eventually Organizations go through this architecture.
led to my “The Future of Data evolution in sequence.
Engineering” post in which I You’re maybe a smaller company
discussed four future trends. The reason I created this path is and you need a warehouse up and
that everyone’s future is different running now. You probably don’t
The first trend is timeliness, going because everyone is at a different have too many data engineers
from this batch-based periodic point in their life cycle. and so you’re doing this on the
architecture to a more real- side.
time architecture. The second is The future at Ada looks very
connectivity; once we go down different than the future at WePay
8
Stage 0 looks like Figure 3, with a you’re reading directly from the Stage 1: Batch
9
This is where the classic batch- The number of Airflow workflows Hard deletes weren’t propagating
The InfoQ eMag / Issue #92/ February 2021
based approach comes in. that we had went from a few - and this is a big problem if you
Between the database and the hundred to a few thousand. We have people who delete data
user, you stuff a data warehouse started running tens or hundreds from your database. Removing
that can accomplish a lot more of thousands of tasks per day, a row or a table or whatever can
OLAP and fulfill analytic needs. To and that became an operational cause problems with batch loads
get data from the database into issue because of the probability because you just don’t know when
that data warehouse, you have a that some of those are not going the data disappears. Also, MySQL
scheduler that periodically wakes to work. We also discovered - and replication latency was affecting
up to suck in the data. this is not intuitive for people our data quality and periodic
who haven’t run complex data loads would cause occasional
That’s where WePay was at pipelines - that the incremental or MySQL timeouts on our workflow.
about a year after I joined. This batch-based approach requires
architecture is fantastic in terms imposing dependencies or Stage 2: Realtime
of tradeoffs. You can get the requirements on the schemas of This is where real-time data
pipeline up pretty quickly these the data that you’re loading. We processing kicks off. This
days - when I did it in 2016, it had issues with create_time approaches the cusp of the
took a couple of weeks. Our data and modify_time and ORMs modern era of real-time data
latency was about 15 minutes, so doing things in different ways and architecture and it deserves a
we did incremental partition loads, it got a little complicated for us. closer look than the first two
taking little chunks of data and stages.
loading them in. We were running DBAs were impacting our
a few hundred tables. This is a workload; they could do You might be ready for Stage 2
nice place to start if you’re trying something that hurt the replica if your load times are taking too
to get something up and running that we’re reading off of and long. You’ve got pipelines that
but, of course, you outgrow it. cause latency issues, which in are no longer stable, whether
turn could cause us to miss data. because workflows are failing
10
or your RDBMS is having trouble batch processor for ETL purposes We use it with the upstream
11
also provide the before and the BigQuery’s real-time streaming should look like so that you can be
The InfoQ eMag / Issue #92/ February 2021
after states of that row. As you insert API. satisfied that the data warehouse
can imagine, this can be useful and the underlying web service
if you’re building out a data One of the cool things about itself are healthy.
warehouse. BigQuery is that you can use its
RESTful API to post data into the Figure 8 shows some of
Debezium can use a bunch of data warehouse in real time and the inevitable problems we
sources. We use MySQL, as I it’s visible almost immediately. encountered in this migration. Not
mentioned. One of the things in That gives us a latency from all of our connectors were on this
that Ada post that caught my eye our production database to our pipeline, so we found ourselves
data warehouse of a couple of between the new cool stuff and
seconds. the older painful stuff.
12
complicated. We’re no longer you may have special caches or got our streaming platform, and
13
got a bunch of stuff going on. to test it out - and the cost to do fairly manual tasks like those in
The InfoQ eMag / Issue #92/ February 2021
14
The InfoQ eMag / Issue #92/ February 2021
Figure 14: Stage 4 adds two new layers to the data ecosystem
skyrockets and it never fully recovered, although defines toil as manual, repeatable, automatable
there’s a nice trend late in the year that relates to the stuff. It›s usually interrupt-driven: you›re getting
next step of our evolution. Slack messages or tickets or people are showing
up at your desk asking you to do things. That is not
Stage 4: Automation what you want to be doing.
We started investing in automation. This is
something you’ve got to do when your system gets The Google book says, “If a human operator needs
this big. I think most people would say we should to touch your system during normal operations, you
have been automating all along. have a bug.”
You might be ready for Stage 4 if your SREs can’t But the “normal operations” of data engineering were
keep up, you’re spending a lot of time on manual toil, what we were spending our time on. Anytime you’re
and you don’t have time for the fun stuff. managing a pipeline, you’re going to be adding new
topics, adding new data sets, setting up views, and
Figure 14 shows the two new layers that appear in granting access.
Stage 4. The first is the automation of operations,
and this won’t surprise most people. It’s the DevOps This stuff needs to get automated. Great news!
stuff that has been going on for a long time. The There’s a bunch of solutions for this: Terraform,
second layer, data-management automation, is not Ansible, and so on.
quite as obvious.
We at WePay use Terraform and Ansible but you can
Let’s first cover automation for operations. substitute any similar product.
Google’s Site Reliability Engineering handbook
15
we had Ansible for a long time We have a fairly robust compliance
The InfoQ eMag / Issue #92/ February 2021
You can use it to manage your truncation policy?”, and “Is this The lineage for my initial example
topics. Figures 15 and 16 show data even allowed in the system?” is that it came from MySQL, it
some Terraform automations. As a payment processor, WePay went to Kafka, and then it got
Not terribly surprising. deals with sensitive information loaded into BigQuery - that whole
and our people need to follow pipeline. Lineage can even track
Yes, we should have been doing geography and security policies encryption or versioning, so you
this, but we kind of were doing and other stuff like that. know what things are encrypted
this already. We had Terraform,
16
and what things are versioned as
Figure 18: Your data ecosystem needs to talk to your data catalog
17
That Add button in the Owned By
The InfoQ eMag / Issue #92/ February 2021
18
The InfoQ eMag / Issue #92/ February 2021
Figure 20: Detecting sensitive data
for unused access. You want to open-source project called Security number, credit card, or
know when users aren’t using Apache Ranger, with a bit of an other sensitive information, it can
all the permissions that they’re enforcement and monitoring immediately alert you that there’s
granted so that you can strip mechanism built into it; that’s a violation in place.
those unused permissions to limit more focused on the Hadoop
the vulnerability of the space. ecosystem. What all these things There’s a little bit of progress
have in common is that you can here. Users can use the data
Now that your data catalog tells use them to detect the presence of catalog and find the data that they
you where all the data is and you sensitive data where it shouldn’t need, we have some automation
have policies set up, you need to be. in place, and maybe we’re using
detect violations. I mostly want Terraform to manage ACLs for
to discuss data loss prevention Figure 20 is an example. A piece Kafka or to manage RBAC in
(DLP) but there’s also auditing, of submitted text contains a Airflow.
which is keeping track of logs and phone number, and the system
making sure that the activities sends a result that says it is “very But there’s still a problem and
and systems are conforming to likely” that it has detected an that is that data engineering is
the required policies. infoType of phone number. You probably still responsible for
can use this stuff to monitor your managing that configuration and
I’m going to talk about Google policies. For example, you can run those deployments. The reason
Cloud Platform because I use it DLP checks on a data set that is for that is mostly the interface.
and I have some experience with supposed to be clean - i.e., not We’re still getting pull requests,
its data-loss solution. There’s have any sensitive information in Terraform, DSL, YAML, JSON,
a corresponding AWS product it - and if a check finds anything Kubernetes ... it’s nitty-gritty.
called Macie. There’s also an like a phone number, Social
19
It might be a tall order to ask I frame this line of thought view. She even discusses policy
The InfoQ eMag / Issue #92/ February 2021
security teams to make changes based on our migration from automation and a lot of the same
to that. Asking your compliance monolith to microservices over stuff that I›m thinking about.
wing to make changes is an even the past decade or two. Part of
taller order. Going beyond your the motivation for that was to I think this shift towards
compliance people is basically break up large, complex things, decentralization will take place in
impossible. to increase agility, to increase two phases. Say you have a set of
efficiency, and to let people move raw tools - Git, YAML, JSON, etc.
Stage 5: Decentralization at their own pace. A lot of those - and a beaten-down engineering
You’re probably ready to characteristics sound like your team that is getting requests left
decentralize your data pipeline data warehouse: it’s monolithic, and right and running scripts
and your data warehouses if you it’s not that agile, you have to ask all the time. To escape that, the
have a fully automated real-time your data engineering team to do first step is simply to expose that
data pipeline but people are still things, and maybe you’re not able raw set of tools to your other
coming to ask you to load data. to do things at your own engineers.
20
We need polished UIs, something
21
SPONSORED ARTICLE
The InfoQ eMag / Issue #92/ February 2021
22
your build pipelines. Does the • Offers flexible scaling so it SPONSORED ARTICLE
23
The InfoQ eMag / Issue #92/ February 2021
At QCon London in March 2020, at the differences between stream historical queries over datasets
I gave a talk on why both stream processors and databases. they’ve accumulated.
processors and databases remain This led to a new product called
necessary from a technical ksqlDB. With the rise in popularity You might think this kind of
standpoint and explored industry of event-streaming systems consolidation at a technical level
trends that make consolidation and their obvious relationship to is an intellectual curiosity, but if
likely in the future. These trends databases, it’s useful to compare you step back a little it really points
map onto common approaches how the different models handle to a more fundamental shift in the
from active databases like data that’s moving versus way that we build software. Marc
MongoDB to streaming solutions data that is stationary. Maybe Andreessen, now a venture
like Flink, Kafka Streams, or more importantly, there is clear capitalist in Silicon Valley, has
ksqlDB. consolidation happening between an excellent way of putting this:
these fields. Databases are “software is eating the world”.
I work at Confluent, the company becoming increasingly active, Investing in software companies
founded by the creators of Apache emitting events as data is written, makes sense simply because
Kafka. These days, I work in the and stream processors are both individuals and companies
Office of the CTO. One of the things increasingly passive, providing consume more software over
we did last year was to look closely time. We buy more applications
24
for our phones, we buy more officer do his job better. That’s automated, we end up with
25
One alternative is to use event
The InfoQ eMag / Issue #92/ February 2021
26
the data itself — an event emitted current state at a single point subsequent moves to the opening
27
entry bookkeeping. Everything
The InfoQ eMag / Issue #92/ February 2021
28
to and query those like a regular need to join events as they arrive, operation. Say we want
29
Traditional databases have for describing all of this, which data arrives and creating new
The InfoQ eMag / Issue #92/ February 2021
this notion of a pull-query: ask we can create using something output records. “Earliest to
a question and get an answer familiar like SQL, albeit with some forever” is a less-well-explored
returned back. What’s interesting extensions — a single model that combination but is a very
to me is the hybrid-world that sits rules them all. useful one. Say we’re building a
between them, combining both dashboard application.
approaches. We can send a select A query can run from the start of
statement and get a response. time to now, from now until the With “earliest to forever”, we can
At the same time, we can also end of time, or from the start of run a query that loads the right
listen to changes on that table as time to the end of time. “Earliest data into the dashboard, and
they happen. We can have both to now” is just what a regular then continues to keep it up to
interaction models. database does. date. There is one last subtlety:
are historical queries event-based
(all moves in the chess game) or
snapshots (the positions of the
pieces now)? This is the other
option a unified model must
include.
30
The InfoQ eMag / Issue #92/ February 2021
TL;DR
• As companies become
more automated, and
their business processes
become more automated,
we end up with many
applications talking
to one another. This is
a humongous shift in
Figure 9: Push and pull queries in the unified interaction model. system design as it’s
about doing the work in a
correlations, but they do let us This means two things for data. fully automated fashion by
create streams from tables and Firstly, we need data tooling that machines.
compared to tools like ksqlDB can handle both the asynchronous
• In traditional databases,
they are better at performing pull and the synchronous. Secondly,
data is passive and
queries, as they’re approaching we also need different interaction
queries are active: the
the problem from that direction. models.
data passively waits for
something to run a query.
So, whether you come at this from Models which push results to us
In a stream processor, data
the stream processing side or the and chain stages of autonomous
is active and the query is
database side there is a clear data processing together.
passive: the trigger it’s the
drive towards a centreground. I
data itself. The interaction
think we’re going to see a lot more For the database this means the
model is fundamentally
of it. ability to query passive data sets
different.
and get answers to questions
The database: is it time for a users have. But it also means • In modern applications,
rethink? active interactions that push data we need the ability to
If you think back to where we to different subscribing services. query passive data sets
started: Andreasen’s observation and get answers for users
of a world being eaten by software, On one side of this evolution are actions but we also need
this suggested a world where active databases: MongoDB, active interaction through
software talks to software. Where Couchbase, RethinkDB, etc. On data that is pushed as an
user interfaces, the software that the other are stream processors: event stream to different
helps you and me, is a smaller ksqlDB, Flink, Hazelcast Jet. subscribing services.
part of the whole package.
• We need to rethink what
Whichever path prevails, one thing
a database is, what it
We see this today in all manner of is certain: we need to rethink what
means to us, and how we
businesses across ride sharing, a database is, what it means to
interact with both the data
finance, automotive — it’s coming us, and how we interact with both
it contains and the event
up everywhere. the data it contains and the event
streams that connect it all
streams that connect modern
together
businesses together.
31
The InfoQ eMag / Issue #92/ February 2021
These days, there is a lot of and Kubernetes. We will see with the introduction of REST and
excitement around 12-factor what are the technical challenges popularization of Javascript.
apps, microservices, and service introduced by the Microservices
mesh, but not so much around architecture and how data REST helped us decouple
cloud-native data. The number gateways can complement frontends from backends through
of conference talks, blog posts, API gateways to address these a uniform interface and resource-
best practices, and purpose- challenges in the Kubernetes era. oriented requests. It popularized
built tools around cloud-native stateless services and enabled
data access is relatively low. Application architecture response caching, by moving all
One of the main reasons for this evolutions client session state to clients, and
is because most data access Let’s start with what has been so forth. This new architecture
technologies are architectured changing in the way we manage was the answer to the huge
and created in a stack that favors code and the data in the past scalability demands of modern
static environments rather than decade or so. I still remember the businesses.
the dynamic nature of cloud time when I started my IT career by
environments and Kubernetes. creating frontends with Servlets, A similar change happened with
JSP, and JSFs. In the backend, the backend services through
In this article, we will explore EJBs, SOAP, server-side session the Microservices movement.
the different categories of data management, was the state of Decoupling from the frontend
gateways, from more monolithic art technologies and techniques. was not enough, and the
to ones designed for the cloud But things changed rather quickly monolithic backend had to be
32
The InfoQ eMag / Issue #92/ February 2021
Application architecture evolution brings new challenges
decoupled into bounded context resilience, and observability that gave rise to fault injection
enabling independent fast-paced turned into major areas of and automatic recovery testing.
releases. These are examples technology innovation addressed
of how architectures, tools, and in the years that followed. And finally, that gave rise to
techniques evolved pressured advanced network telemetry
by the business needs for fast Similarly, creating a database per and tracing. All of these created
software delivery of planet-scale microservice, having the freedom a whole new layer that sits
applications. and technology choice of different between the frontend and the
datastores is a challenge. That backend. This layer is occupied
That takes us to the data layer. shows itself more and more primarily with API management
One of the existential motivations recently with the explosion of data gateways, service discovery, and
for microservices is having and the demand for accessing service mesh technologies, but
independent data sources per data not only by the services but also with tracing components,
service. If you have microservices other real-time reporting and AI/ application load balancers, and
touching the same data, that ML needs. all kinds of traffic management
sooner or later introduces and monitoring proxies. This even
coupling and limits independent The rise of API gateways includes projects such as Knative
scalability or releasing. It is not With the increasing adoption with activation and scaling-
only an independent database of Microservices, it became to-zero features driven by the
but also a heterogeneous one, apparent that operating such an networking activity.
so every microservice is free to architecture is hard. While having
use the database type that fits its every microservice independent With time, it became apparent that
needs. sounds great, it requires tools and creating microservices at a fast
practices that we didn’t need and pace, operating microservices at
While decoupling frontend didn’t have before. scale requires tooling we didn’t
from backend and splitting need before. Something that was
monoliths into microservices This gave rise to more advanced fully handled by a single load
gave the desired flexibility, it release strategies such as blue/ balancer had to be replaced with a
created challenges not-present green deployments, canary new advanced management layer.
before. Service discovery and releases, dark launches. Then A new technology layer, a new set
load balancing, network-level of practices and techniques, and
33
a new group of users responsible document database, and the third Here are a few areas where
The InfoQ eMag / Issue #92/ February 2021
were born. microservice one uses an in- data gateways differ from API
memory key-value store. gateways.
The case for data gateways
Microservices influence the While microservices allow you Abstraction
data layer in two dimensions. all that freedom, again it comes An API gateway can hide
First, it demands an independent at a cost. It turns out operating a implementation endpoints
database per microservice. From large number of datastore comes and help upgrade and rollback
a practical implementation point at a cost that existing tooling services without affecting
of view, this can be from an and practices were not prepared service consumers. Similarly, a
independent database instance to for. In the modern digital world, data gateway can help abstract
independent schemas and logical storing data in a reliable form is a physical data source, its
groupings of tables. The main not enough. specifics, and help alter, migrate,
rule here is, only one microservice decommission, without affecting
owns and touches a dataset. Data is useful when it turns into data consumers.
And all data is accessed through insights and for that, it has to
the APIs or Events of the owning be accessible in a controlled Security
microservice. form by many. AI/ML experts, An API manager secures resource
data scientists, business endpoints based on HTTP
The second way a microservices analysts, all want to dig into methods. A service mesh secures
architecture influenced the the data, but the application- based on network connections.
data layer is through datastore focused microservices and But none of them can understand
proliferation. Similarly, their data access patterns are and secure the data and its shape
enabling microservices not designed for these data- that is passing through them. A
to be written in different hungry demands. data gateway, on the other hand,
languages, this architecture understands the different data
allows the freedom for every This is where data gateways can sources and the data model
microservices-based system help you. A data gateway is like an and acts on them. It can apply
to have a polyglot persistence API gateway, but it understands RBAC per data row and column,
layer. With this freedom, one and acts on the physical data filter, obfuscate, and sanitize
microservice can use a relational layer rather than the networking the individual data elements
database, another one can use a layer. whenever necessary. This is a
34
more fine-grained security model Heterogeneity is the degree that can be deployed on hybrid/
35
PostgreSQL database. SQL/ on top of a few data sources. engine purposefully re-written for
The InfoQ eMag / Issue #92/ February 2021
MED stands for Management This is a fast-growing category the Kubernetes ecosystem. It uses
of External Data, and it is also specialized for enabling rapid the SQL/MED specification for
implemented by MariaDB web-based development of data- defining the virtual data models
CONNECT engine, DB2, Teiid driven applications rather than and relies on the Kubernetes
project discussed below, and a BI/AI/ML use cases. Operator model for the building,
few others. deployment, and management of
Open-source data gateways its runtime on Openshift.
Starting in SQL Server 2019, Apache Drill is a schema-free
you can now query external SQL query engine for NoSQL Once deployed, the runtime can
data sources without moving or databases and file systems. It scale as any other stateless
copying the data. The PolyBase offers JDBC and ODBC access cloud-native workload on
engine of SQL Server instance to to business users, analysts, and Kubernetes and integrate with
process Transact-SQL queries data scientists on top of data other cloud-native projects. For
to access external data in SQL sources that don’t support such example, it can use Keycloak for
Server, Oracle, Teradata, and APIs. single sign-on and data roles,
MongoDB. Infinispan for distributed caching
Again, having uniform SQL based needs, export metrics and register
GraphQL data bridges access to disparate data sources with Prometheus for monitoring,
Compared to the traditional is the driver. While Drill is highly Jaeger for tracing, and even with
data virtualization, this is a scalable, it relies on Hadoop 3scale for API management. But
new category of data gateways or Apache Zookeeper’s kind of ultimately, Teiid runs as a single
focused around the fast web- infrastructure which shows its Spring Boot application acting as
based data access. The common age. a data proxy and integrating with
thing around Hasura, Prisma, other best-of-breed services on
SpaceUpTech, is that they focus Teiid is a project sponsored by Openshift rather than trying to
on GraphQL data access by Red Hat and I’m most familiar with reinvent everything from scratch.
offering a lightweight abstraction it. It is a mature data federation
36
Architectural overview of Teiid facing artifacts generates some Secure tunneling data-proxies
37
generic secure-connectivity proxy gives the ultimate flexibility for
TL;DR
The InfoQ eMag / Issue #92/ February 2021
38
The InfoQ eMag / Issue #92/ February 2021
Combining DataOps and DevOps: Scale at Speed
by Sam Bocetta, Security Analyst, semi-retired, educates the public about security and privacy technology
Over the past decade, hundreds constancy, and output of new In order to move quickly,
of organizations have made the software. development teams need
shift to adopt the cloud as a way consistent access to high-quality
to obtain access to its automated, Organizations are rushing to data.
scalable, and on-demand advance to the latest and best
infrastructure. The shift has technological advancements. If it takes days to refresh the data
changed software development New strategies are being in a test environment, teams are
requirement timeframes from implemented through data- caught in a difficult position:
weeks to mere minutes. driven decision making and move slightly slower, or make
the infrastructure needed to concessions on quality at the
Around the same time, the cloud’s integrate new breakthroughs - detriment of your customers,
scalability has also encouraged from artificial intelligence (AI) to subscribers, or users.
organizations to look at new machine learning and automation
development models. DevOps - is easily accessible. DataOps and DevOps - A Better
and the cloud have, together, Understanding of How It Works
broken down the walls between But even in a world where software DataOps is really an extension of
people and technology. has become lightweight, scalable, DevOps standards and processes
and automated, there’s one thing into the data analytics world. The
DevOps and continuous delivery that prevents organizations from DevOps philosophy underscores
processes have become truly shining - and that is how consistent and flawless
widespread in most of our readily their development teams collaboration between developers,
industries, enabling enterprises can actually access their data. quality assurance teams and IT
to radically increase the integrity, Ops administrators. DataOps
39
does the same for administrators move to a microservice-based data science. Instead, we need
The InfoQ eMag / Issue #92/ February 2021
and engineers who store, analyze, workflow that benefits from to realize that data is a vital
archive and deliver data. containers and other progressive commodity, and to put together
technologies - hence the massive all those that use or handle data
To put it another way, DataOps rise in Software-as-a-Service to take a data-centric view of the
is about streamlining the (SaaS) offerings specifically enterprise.
processes involved in processing, due to the prior rise of DataOps.
analyzing and deriving value SaaS appeals to a massive When building applications or
from big data. This aims to break entrepreneurial demographic, data-rich systems, development
down the silos in the data storage since almost anyone with teams learn to look past the data
and analytics fields which have knowledge or skills can help build delivery mechanics and instead
historically isolated different a SaaS company. concentrate on the policies and
teams from each other. Improved limitations that control data in
communication and cooperation DataOps also includes their organization, they can align
between different teams will lead administrators and engineers their infrastructure more closely
to faster outcomes and better to make use of next-generation to enable data flow across their
time-to-value. DataOps is a way data technology to develop organization to those who need it.
to automate the data processing their data storage and analytics
and storage workflows in the infrastructure. They need To make the shift, DataOps
same way that DevOps does solutions for data processing that needs teams to recognize
when creating applications. are scalable and readily available the challenges of today’s
- think cluster-based, robust technology environment and to
DevOps storage. think creatively about specific
DevOps combines IT / Ops approaches to data challenges in
and developers to work closely The DataOps architecture also their organization. For example,
together in order to deliver needs to be able to handle a you might have information
software of a higher quality. number of workloads in order to about individual users and their
achieve the same versatility as the functions, data attributes and
DevOps works in a simulated DevOps implementation pipeline. what needs to be protected for
environment, and due to Creating a data management individual audiences, as well as
the radical advances of cloud- tool set of diverse solutions, from knowledge of the assets needed
based developments, we can log aggregators such as Splunk to deliver the data where it is
witness how organizations are and Sumo Logic and Big Data required.
now moving DevOps to their cloud Analytics applications such as
environments. With additional Hadoop and Spark, is crucial to Getting teams together that have
continuous integration and achieving this agility. different ideas helps the company
automation of testing and delivery, to evolve faster. Instead of waiting
DevOps breaks complicated Embracing the Changes minutes, hours or even weeks for
tasks into much simpler ones. We need to step away from data, environments need to be
organizing our teams and created in minutes and at the pace
DataOps technologies around the tools required to allow the rapid creation
Adopting DevOps will require we use to manage data, such as and delivery of applications and
multiple alterations to your application creation, information solutions. At the same time,
infrastructure. To make the management, identity access companies do not have to choose
most of DevOps, you’ll want to management and analytics and between access and security;
40
they can function assured that This will put extraordinary This system also makes use of
41
the transition from a product to shift their focus from stability development and ongoing
The InfoQ eMag / Issue #92/ February 2021
economy to a service economy. and efficiency to innovation and efficiency to its users, the cloud
flexibility. Faster technology offers simplicity in the use
For example: innovations result in shorter of, and quality in the product
production stages, creative by optimizing operational
• Android and iPhone integrate designs and higher delivery rates. performance. As a result, DevOps
customer support in their in conjunction with Cloud fulfills
product bundle The emergence of social user expectations with the help of
media marketing and future sophisticated execution.
• When buying a new vehicle,
technologies are shifting control
BMW includes daily car
away from production and
maintenance in the buying
TL;DR
keeping customers or users at
price
the core. Branding and marketing
• Our smartphone technology mechanisms now react to
now includes food delivery, consumer preferences rather • DataOps is all about
maps, GPS and even online than unlocking it. From SMEs to streamlining the
banking as a service with their start-ups, companies need to processes that are
product delivery encourage and support creative involved in processing,
responsiveness and focus on analyzing and deriving
This shift from product to service waste reduction. value from big data.
as a priority is also reflected in
the delivery of software, enabling It’s time for IT organizations to • Development teams
companies to provide innovation, enable software as a service with need to learn how to look
speed, reliability, frequency and the aid of DevOps methodologies past the data delivery
operation on the customer’s and Cloud automation. DevOps mechanics and instead
behalf. combined with cloud helps concentrate on the
to assess the quality of policies and limitations
With cloud automation, the customer’s experience. that control data in their
companies are now able to shift This cross-department and organization.
their focus and assimilate user cross-functional cooperation • Uber and Netflix have both
experience seamlessly from strengthens an organization’s been very open about
machine-based functions to operations and helps them the way in which they
IaaS (infrastructure-as-a- achieve the advantage in their use DataOps within their
service), PaaS (platform-as-a- market. businesses models.
service), and SaaS. DevOps helps
this by removing the discrepancy One thing digital transformation • Many forward-thinking
between development and has taught us is that software companies have found
support. and hardware have to work in themselves in the midst
unison. Each corporation must of the transition from a
We Can Shift from Stability to adapt to the combination of product-based economy
Agility digital applications with material to a service-based
With an increase in production systems or components. economy.
speed, the industry has been
challenged to adjust their go- While DevOps offers
to-market strategy but mostly advancements in software
42
Read recent issues
The InfoQ eMag / Issue #83 / March 2020 The InfoQ eMag / Issue #81 / January 2020
The InfoQ eMag / Issue #89 / December 2020
Service Service Mesh Exploring the Dealing with Remote From Monolith to Event- Monitoring Tyler Treat on
12 Microservices Obscuring
Mesh Implementations (Possible) Future of Team Challenges Driven: Finding Seams in Your Microservices the Microservice
Testing Techniques Complexity
Features and Products Service Meshes Future Architecture Right Way Observability
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
This eMag aims to answer 2020 is probably the most This eMag takes a deep
pertinent questions for extended year we will see in dive into the techniques and
software architects and our whole life. A vast number culture changes required to
technical leaders, such as: of people have spent the most successfully test, observe, and
what is a service mesh?, do significant part of the year at understand microservices.
I need a service mesh?, and home. Remote work went from
how do I evaluate the different “something to be explored”
service mesh offerings? to teams’ reality around the
world. In this eMag, we’ve
packed in some of the most
relevant InfoQ content of 2020.
And there was no way to avoid
content on remote activities.