Professional Documents
Culture Documents
Gregor Hohpe
This book is for sale at http://leanpub.com/cloudstrategy
Each environment presented its unique set of challenges but also shared
noteworthy commonalities, which this book distills into concrete advice.
Each migration involved specific vendors and products, both on premises
and in the commercial cloud. In this book, though, I stay away from spe-
cific products as much as possible, using them only as occasional examples
where I consider it helpful. There’s ample material describing products
already and while products come and go, architectural considerations tend
to stay. Instead, as with 37 Things, I prefer to take a fresh look at some well-
trodden topics and buzzwords to give readers a novel way of approaching
some of their problems.
Cloud Stories
Corporate IT can be a somewhat uninspiring and outright arduous topic.
But IT doesn’t have to be boring. That’s why I share many of the anecdotes
that I collected from the daily rut of cloud migration alongside the
architectural reflections.
Readers appreciated several attributes of 37 Things’ writing style and
content, which I aimed to repeat for this book:
About this Book iii
So, just as with 37 Things, I hope that this book equips you can with a few
catchy slogans that you’re able to back up with solid architecture insight.
There are many ways to the cloud. The worst you can do
is transport your existing processes to the cloud, which will
earn you a new data center but not a cloud - surely not
what you were set out to achieve! Therefore, it’s time to
question existing assumptions about your infrastructure and
operational model.
Getting Involved
My brain doesn’t stop generating new ideas just because the book is
published, so have a look at my blog to see what’s new:
https://architectelevator.com/blog
http://twitter.com/ghohpe
http://www.linkedin.com/in/ghohpe
About this Book vii
To provide feedback and help make this a better book, please join our
private discussion group:
https://groups.google.com/d/forum/cloud-strategy-book
Acknowledgments
Books aren’t written by a sole author locked up in a hotel room for a
season (if you watched The Shining, you know where that leads…). Many
people have knowingly or unknowingly contributed to this book through
hallway conversations, meeting discussions, manuscript reviews, Twitter
dialogs, or casual chats over a beer. Chef kept me going by feeding me
tasty pasta, pizza, and cheesecake.
Part I: Understanding
Cloud
Rethinking Cloud
This part helps you take a fresh look in the context of business transforma-
tion. Along the way, you’ll realize a few things that the shiny brochures
likely didn’t tell you:
Procuring a Cloud?
Most traditional IT processes are designed around the procurement of
individual components, which are then integrated in house or more
frequently by a system integrator. Much of IT even defines itself by the bill
of materials that they procured over time, from being “heavy on SAP” to
being an “Oracle Shop” or a “Microsoft House”. So, when the time comes to
move to the cloud, enterprises tend to follow the same, proven process to
procure a new component for their ever-growing IT arsenal. After all, they
have learned that following the same process leads to the same desirable
results. Right? Not necessarily.
Cloud isn’t just some additional element that you add to your IT portfolio.
It becomes the fundamental backbone of your IT: cloud is where your data
resides, your software runs, your security mechanisms protect your assets,
and your analytics crunch numbers. Embracing cloud resembles a full-on
IT outsourcing more than a traditional IT procurement.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 4
The closest example to embracing cloud that IT has likely seen in the past
decades is the introduction of a major ERP system - many IT staff will
still get goose bumps from those memories. Installing the ERP software
was likely the easiest part whereas integration and customization typically
incurred significant effort and cost, often into the hundreds of millions of
dollars.
We’re moving to the cloud not because it’s so easy, but because of the
undeniable benefits it brings, much like ERP did. In both cases the benefits
don’t derive from simply installing a piece of software. Rather, they
depend on your organization adjusting to a new way of working that’s
embedded in the platform. That change was likely the most difficult part of
the ERP implementation but also the one that brought the most significant
benefits. Luckily, clouds are built as flexible platforms and hence leave
more room for creativity than the “my way or the highway” attitude of
some ERP systems.
Procurement
Procurement is the process of evaluating and purchasing software and
hardware components. Because a good portion of IT’s budget flows
Cloud isn’t IT Procurement; it’s a Lifestyle Change 5
Many IT processes are driven through budget control: if you want to start a
project, you need budget approval; likewise if you want to buy any piece
of software or hardware. In the traditional view of IT as a cost center¹,
such a setup makes good sense: if it’s about the money, let’s minimize the
amount that we spend.
Traditional IT budgets are set up at least a year in advance, making
predictability a key consideration. No CFO or shareholder likes to find
out 9 months into the fiscal year that IT is going to overrun its budget
by 20%. Therefore, IT procurement tends to negotiate multi-year license
terms for the software they buy. They “lock in discounts” by signing up
for a larger package to accommodate increase in usage over time despite
not being able to fully utilize it right from the start. If this reminds you
of the free cell phone minutes that are actually free only after you pay
for them in bulk, you might be onto something. And the one who’s being
locked in here isn’t so much the discount but the customer.
Cloud’s critical innovation, and the reason it’s turned IT upside down,
is its elastic pricing model: you don’t pay for resources upfront but only
for what you actually consume. Such a pricing model enables a major cost
savings potential, for example because you don’t pay for capacity that you
aren’t yet utilizing. However, elasticity also takes away the predictability
that IT so much cherishes. I have seen a CIO trying to prevent anyone
from ordering a new (virtual) server, inhibiting the very benefit of rapid
provisioning in the cloud. Such things can happen when a new way of
working clashes with existing incentives.
ments, scoring each product along those dimensions. They add up the
scores and go into negotiation with the vendor with the highest number.
This approach works reasonably well if your organization has a thorough
understanding of the product scope and the needs you have, and is able
to translate those into desired product features. Although this process has
never been a great one (is the product with a score of 82.3 really better
than the one with 81.7?), it was alright for well-known components like
relational databases.
Unfortunately, this approach doesn’t work for cloud platforms. Cloud
platforms are vast in scope and our definition of what a cloud should do is
largely shaped by the cloud provider’s existing and upcoming offerings. So
we’re caught in a loop of cloud providers telling us what cloud is so we can
score their offering based on that definition. As I jested in 37 Things, if you
have never seen a car and visit a well-known automotive manufacturer
from the Stuttgart area, you’ll walk out with a star emblem on the hood as
the first item in your feature list (see “The IT World is Flat” in 37 Things).
Trust me, speaking to IT staff after a vendor meeting makes even that
cheeky analogy seem mild.
Because traditional scoring doesn’t work well for clouds, a better approach
is to compare your company’s vision with the provider’s product strategy
and philosophy. For that, you need to know why you’re going to the cloud
and that buying a faster car doesn’t make you a better driver.
Massive checklists also assume that you can make a conscious decision
based on a snapshot in time. However, clouds evolve rapidly. A checklist
from today becomes useless the latest by the time the cloud provider holds
their annual re:Invent / Ignite / Next event.
Related to the previous consideration, IT should therefore look to under-
stand the provider’s product strategy and expected evolution. Not many
providers will tell you these straight out, but you can reverse engineer
a good bit from their product roadmaps. After all, the cloud platform’s
constant evolution is one of the main motivators for wanting to deploy on
top of it. To speak mathematically, you’re more interested in the vector
than the current position.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 7
Operations
Cloud doesn’t just challenge traditional procurement processes but oper-
ational processes as well. Those processes exist to “keep the lights on”,
making sure that applications are up-and-running, hardware is available,
and we have some idea of what’s going on in our data center. It’s not
a big surprise that cloud changes the way we operate our infrastructure
significantly.
Side-by-side
This list is just meant to give you a flavor why selecting a cloud isn’t
your typical IT procurement - it’s not nearly exhaustive. For example,
cloud also challenges existing financial processes. The examples do make
clear, though, that applying your traditional procurement and operational
processes to a cloud adoption is likely going to put you on the wrong
starting point.
The following table summarizes the differences to highlight that cloud is
a 180 degree departure from many well-known IT processes. Get ready!
Cloud isn’t IT Procurement; it’s a Lifestyle Change 10
Changing Lifestyle
So, going to the cloud entails a major lifestyle change for IT and perhaps
even the business. Think of it like moving to a different country. I have
moved from the US to Japan and had a great experience, in large part
Cloud isn’t IT Procurement; it’s a Lifestyle Change 12
because I adopted the local lifestyle: I didn’t bring a car, moved into a much
smaller (but equally comfortable) apartment, learned basic Japanese, and
got used to carrying my trash back home (the rarity of public trash bins
is a favorite topic of visitors to Japan). If I had aimed for a 3000 sq. ft.
home with a two-car garage, insisted on driving everywhere, and kept
asking people in English for the nearest trash bin, it would have been
a rather disappointing experience. And in that case I should have likely
asked myself why I am even moving to Japan in the first place. If in Rome,
do like the Romans do (or the Japanese - you get my point).
2. Cloud Thinks in the First
Derivative
In economies of speed, cloud is natural.
Digital IT
The difference between making things digital and using digital technology
to change the model applies directly to corporate IT, which has grown
¹https://wiki.c2.com/
Cloud Thinks in the First Derivative 15
Change is Abnormal
Because traditional IT is so used to making existing things digital, it has a
well defined process for it. Taking something from the current “as-is” state
to a reasonably well-known future “to-be”state is known as a “project”.
Projects are expensive and risky. Therefore, much effort goes into defining
the future “to-be” state from which a (seemingly always positive) business
case is calculated. Next, delivery teams meticulously plan out the project
execution by decomposing tasks into a work breakdown structure. Once
the project is delivered, hopefully vaguely following the projections,
there’s a giant sigh of relief as things return to “business as usual”, a
synonym for avoiding future changes. The achieved future state thus
remains for a good while. That’s needed to harvest the ROI (Return on
Investment) projected in the up-front business case. This way of thinking
is depicted on the left side of this diagram:
Cloud Thinks in the First Derivative 16
Two assumptions lie behind this well-known way of working. First, you
need to have a pretty good idea what you are aiming for to define a good
“target state”. Second, “change” is considered an unusual activity. That’s
why it’s tightly packaged into a project with all the associated control
mechanisms. This way, change is well contained and concludes with a
happy ending, the launch party, which routinely takes place before the
majority of users have even seen the delivered system.
scale when they recover the project investment: the bigger they are, the
better the returns. They operate based on Economies of Scale.
The right-hand side works differently. Here, instead of being a good
guesser, companies need to be fast learners so they can figure out quickly
what works and what doesn’t. They also need low cost of experimentation
and the ability to course correct frequently. So, instead of scale, they’re
living by the rules of Economies of Speed.
The models are almost diametrically opposed. Whereas the left side favors
planning and optimization, the right side favors experimentation and
disruption. On the left, deviation from a plan is undesired while on the
right side it’s the moment of most learning. But even more profoundly,
the two sides also use a different language.
For the people who adopted a world of constant change, absolutes aren’t
as useful. If things are in constant flux, absolutes have less meaning - they
change right after having been measured or established.
The illustration below demonstrates how in a static world we look
at absolute positions. When there’s movement, here drawn as vectors
because I am not able to animate the book, our attention shifts to the
direction and speed of movement.
Cloud Thinks in the First Derivative 18
The same is true in IT. Hence, the folks living in the world depicted on the
right speak in relative values. Instead of budget, they speak of consump-
tion; instead of headcount they are more concerned about burn rate, i.e.
spend per month; instead of fixating on deadlines, they aim to maximize
progress. Those metrics represent rates over time, mathematically known
as the first derivative of a function (assuming the x axis represents time).
It’s clear that the cloud providers didn’t exactly invent the data center. If
all they had done was build a better data center, they’d be traditional IT
outsourcing providers, not cloud providers. What cloud did fundamentally
change is allowing IT infrastructure to be procured based on a consump-
tion model that’s a perfect fit for a world of constant change. Cloud also
made provisioning near-instant, which is particularly valuable for folks
operating in economies of speed.
Cloud Thinks in the First Derivative 19
When organizations speak about their journey to the cloud, you’ll hear
a lot of fantastic things like dramatic cost savings, increased velocity, or
even a full-on business transformation. However exciting these might be,
in many cases they turn out to be mere aspirations, wishes to be a bit more
blunt. And we know that wishes usually don’t come true. So, if you want
your cloud journey to succeed, you better have an actual strategy.
Strategies
To have a chance at making your wishes come true, you’ll need a strategy.
Interestingly, such strategies come in many different shapes. Some people
are lucky enough to inherit money or marry into a wealthy family. Others
play the lottery - a strategy that’s easily implemented but has a rather
low chance of success. Some work hard to earn a good salary and there
are those who contemplate a bank robbery. While many strategies are
possible, not all are recommended. Or, as I remind people: some strategies
might get you into a very large house with many rooms. It’s called jail.
Clayton Christensen, author of The Innovator’s Dilemma¹, gives valuable
advice on strategy in his book How Will You Measure Your Life?². Those
strategies apply both to enterprises and personal life, and not just for
cars and power boats. Christensen breaks a strategy down into three core
elements:
Although large companies have a lot of resources and can afford to have
a broad strategy, all three elements still apply. Even the most successful
companies have limited resources. A company with unlimited funds,
should it exist, would still be limited by how many qualified people it can
hire and how fast it can onboard them. So, they still need to set priorities.
Hence, a strategy that doesn’t require you to make difficult trade-offs
likely isn’t one. Or as Harvard Business School marketing guru Michael
Porter aptly pointed out:
¹Clayton Christensen, The Innovator’s Dilemma, 2011, Harper Business
²Clayton Christensen, How Will You Measure Your Life?, 2012, Harper Business
Wishful Thinking isn’t a Strategy 23
Many businesses that do have a clearly articulated strategy fall into the
next trap: a large gap between what’s shown on the PowerPoint slides and
the reality. That’s why Andy Grove, former CEO of Intel, famously said
that “to understand a company’s strategy, look at what they actually do
rather than what they say they will do.”
I refer to this process as expand and contract - broaden your view to avoid
blind spots and then narrow back towards a concrete plan. Many of the
decision models in the later chapters in this book (see Part IV) are designed
to help you with this exercise. They define a clear language of choices that
help you make informed decisions.
Once you have a strategy, the execution is based on making conscious
decisions in line with the strategy, all the while balancing trade-offs.
Principles help you make this connection between your execution and
your vision. The next chapter shows how that works.
Wishful Thinking isn’t a Strategy 24
For example, one dial might define how to distribute workloads across
multiple clouds. Many organizations will want to turn this dial to “10”
to gain ultimate independence from the cloud provider, ignoring the fact
that the products promising that capability are virtually all proprietary
and lock you in even more. Hence, turning this dial up will also move
another dial. That’s why setting a strategy is interesting but challenging.
A real strategy will need to consider how to set that dial meaningfully
over time. Perhaps you start with one cloud, so you can build the required
skills and also receive vendor support more easily. You wouldn’t want to
paint yourself into a corner completely, so perhaps you structure your
applications such that they isolate dependencies on proprietary cloud
Wishful Thinking isn’t a Strategy 25
services. That’d be like turning the dial to “2”. Over time, you can consider
balancing your workloads across clouds, slowly turning up the dial. You
may not need to shift applications from one cloud to the next all the time,
so you’re happy to leave the dial at “5”.
To make such decisions more concrete, this book gives you decision
models that put names and reference architectures behind the dials and
the numbers. In this example, the chapter on multi-cloud lays out five
options and guides you towards dialing in the right number.
If this approach reminds you of complex systems thinking, you are spot
on. The exercise here is to understand the behavior of the system you’re
managing. It’s too complex to take it apart and see how it all works, so
you need to use heuristics to set the controls and constantly observe the
resulting effect. The corollary to this insight is that you can’t manage what
you can’t understand³.
Goals
As Christensen pointed out, execution is key if you want to give your
strategy a chance of leading to the intended (wished for) result. While
you are executing, it isn’t always easy to know how you’re doing, though
- not having won the lottery yet doesn’t have to mean that there was
no progress. On the other hand, just reading fancy car magazines might
make your wishes more concrete, but it doesn’t accomplish much in terms
of being able to actually buy one.
So, to know whether you’re heading in the right direction and are making
commensurate progress, you’ll need concrete intermediate goals. These
goals will depend on the strategy you chose. If your strategy is to earn
money through an honest job, a suitable initial goal can be earning a
degree from a reputable educational institution. Setting a specific amount
of savings aside per month is also a reasonable goal. If you are eyeing
a bank vault, a meaningful goal might be to identify a bank with weak
security. Diverse strategies lead to diverse goals.
While goals are useful for tracking progress, they aren’t necessarily the
end result. I still don’t have a boat. Nevertheless, they allow you to
make concrete and measurable progress. On the upside, my strategy of
pursuing a career in IT met the opportunity to live and work on three
continents. Following Christensen’s second rule, I happily embraced those
opportunities and thus achieved more than I had set out to do.
IT Wishes
Returning back into IT, given how much is at stake, we’d think that IT
is full of well-worked out strategies, meticulously defined and tracked
via specific and achievable goals that are packaged into amazing slide
decks. After all, what else would upper management and all the strategy
consultants be doing?
Sadly, reality doesn’t always match the aspirations, even if it’s merely
to have a well-laid-out strategy. Large companies often believe that they
have sufficient resources to not require laser-sharp focus. Plus, because
business has been going fairly well, the strategy often comes down to
“do more of the same, just a bit better.” Lastly, large enterprises tend to
have slow feedback cycles and limited transparency, both of which hinder
tracking progress and responding to opportunities.
It’s useful to have wishes - at least you know why you are going to the
cloud in the first place. You just need to make sure that these objectives
are supplemented by a viable plan on how to achieve them. They should
also be somewhat realistic to avoid preprogrammed disappointment.
Calibrating your wishes and pairing them up with a viable path is the
essence of defining a strategy.
Turning wishes into a viable strategy is hard work. The strategy has to be
both forward-looking and executable, presenting an interesting tension
between the two. A well-considered set of principles can be an important
connecting element.
candidate principles (not all of which would pass my test above) but
otherwise leans a bit heavily on TOGAF-flavored enterprise architecture.
Cloud Principles
Just like architecture, principles live in context. It therefore isn’t meaning-
ful to provide a list of principles to be copy-pasted to other organizations.
To give some concrete inspiration, I will share a few principles that I have
developed with organizations in the past, separating “higher-level”, more
strategic principles from more narrowly focused ones.
High-level Principles
A slogan adorning virtually all cloud strategy documents is multi cloud.
Sadly, too many of these documents state that the architects “favor multi
cloud”. If you’re lucky, there’s a reason attached, usually to avoid lock-in.
That’s all good, but it doesn’t really help guide concrete decisions, so it’s
hardly a very meaningful principle. Should I deploy all applications on
all clouds? That might be overkill. Are all clouds equal? Or do I have a
primary cloud and one or two as fallbacks? That’s where things get a lot
more interesting.
With one customer I defined the preference clearly in the following
principle:
Again, the opposite is also viable: anything that’s needed but isn’t yet
available, we build as quickly as possible. Sadly in large organizations,
“as quickly as possible” typically translates into 18 months. By then the
cloud providers have often delivered the same capability, likely orders of
magnitude cheaper. Strategy takes discipline.
Specific Principles
In other situations, when dealing with a smaller and more nimble orga-
nization, we developed more specific principles. Some of these principles
were designed to counterbalance prevalent tendencies that made us less
productive. We essentially used the principles to gently course correct -
that’s totally fine as long as the principles don’t become too tactical.
and funding request. How the story got from A to B is left as an exercise
for the reader.
Capability ≠ Benefit
A critically important but often neglected step when evaluating products
is the translation of a tool’s abstract capability into concrete value for
the organization. Some systems that can scale to 10,000 transactions per
second are as impressive as a car with a top speed of 300 km/h. However,
they bring about as much benefit to a typical enterprise as that car in a
country which enforces a 100 km/h speed limit. If anything, either will
get you into trouble. You either end up with a speeding ticket (or in
jail) or stuck with overpriced, highly specialized consultants helping you
implement a simple use case in an overly complex tool.
Look for products that generate value even if you only use
a subset of features, thus giving you a “ramp” to adoption.
IT’s inclination to want to look too far ahead when procuring tools often
isn’t the results of ignorance but of excessive friction in approval pro-
cesses. Because procuring a new product requires elaborate evaluations,
business cases, security reviews, and so on, it’d be very inefficient to start
with a simple tool and upgrade soon after - you’d just be doing the whole
process over again. Instead, teams are looking for a tool that’ll suit them
three years down the road, even if it’s a struggle for the first two. So,
systemic friction is once again the culprit for much of IT’s ailments.
If You Don’t Know How to Drive… 42
Back in IT, we can find quite a few analogies. While your enterprise is
proudly making its first steps into the cloud, you are made to believe
that everyone else is already blissfully basking in a perimeter-less-multi-
hybrid-cloud-native-serverless nirvana. Even if that was the case, it’d still
be fine to be going at your own pace. You should also be a bit skeptical
as to whether there’s some amplification and wishful thinking at play. Go
out and have a look - you might see some of your peers at the tanning
studio.
If You Don’t Know How to Drive… 43
Organizational Architectures
Organizations therefore need to decide to what extent they can train
existing staff and how they are going to acquire new skill sets. Some
organizations are looking to start with a small center of excellence, which
is expected to carry the change to the remainder of the organizations.
Other organizations have trained and certified everyone, including their
CEO. The happy medium might be somewhere in the middle but won’t
be the same for every kind of organization. So, as you are designing
your technical cloud architecture, you’ll also want to be defining your
organizational architecture.
If You Don’t Know How to Drive… 47
Bring’em Home
So, it’s no surprise that many IT departments are rethinking their out-
sourcing strategy. Most companies I worked with were bringing software
delivery and DevOps skills back in house, usually only held back by
their ability to attract talent. In-house development benefits from faster
feedback cycles. It also assures that any learning is retained inside the or-
ganization as opposed to being lost to a third party provider. Organizations
also maintain better control over the software they develop in house.
One challenge with insourcing is that an IT that builds software looks
quite different from one that mostly operates third-party software. In-
stead of oversight committees and procurement departments, you’ll need
product owners that aren’t program managers in disguise and architects
that can make difficult trade-offs under project pressure instead of com-
menting on vendor presentations. So, just like outsourcing, insourcing is
also a big deal and one that should be well thought through.
²Ries, The Lean Startup, 2011, Currency
³Reinertsen, The Principles of Product Development Flow, 2009, Celeritas Publishing
Cloud is Outsourcing 50
Digital Outsourcing
Strangely, though, while traditional companies are (rightly) reeling things
back in, their competitors from the digital side appear to do exactly the
opposite: they rely heavily on outsourcing. No server is to be found on
their premises as everything is running in the cloud. Commodity software
is licensed and used in a SaaS model. Food is catered and office space
is leased. HR and payroll services are outsourced to third parties. These
companies maintain a minimum of assets and focus solely on product
development. They’re the corporate equivalent of the “own nothing”
trend.
Outsourcing a la Cloud
Cloud isn’t just outsourcing; it might well pass as the mother of all
outsourcings. Activities from facilities and hardware procurement all
the way to patching operating systems and monitoring applications are
provided by a third party. Unlike traditional infrastructure outsourcing
that typically just handles data center facilities or server operations,
cloud offers fully managed services like data warehouses, pre-trained ma-
chine learning models, or end-to-end software delivery and deployment
platforms. In most cases you won’t even be able to inspect the service
provider’s facilities due to security considerations.
⁴https://architectelevator.com
Cloud is Outsourcing 51
Full Control
Transparency
Short-term Commitments
Evolution
Economics
However, what those lines entail makes all the difference. On the left side
it’s lengthy contracts and service requests (or emailing spreadsheets for
the less fortunate) that go into a black hole from which the requested
service (or something vaguely similar) ultimately emerges. Cloud in
contrast feels like interacting with a piece of commercial software. Instead
of filing a service request, you call an API, just like you would with
your ERP package. Or you use the browser interface via a cloud console.
Provisioning isn’t a black hole, pricing is transparent. New services are
announced all the time. You could call cloud platforms a service-oriented
architecture for infrastructure.
This simple comparison highlights that organizations who take a static
view of the world won’t notice the stark differences between the models.
Organizations that think in the first derivative, however, will see that
cloud is a lifestyle changer.
It appears that cloud managed to draw a very precise dividing line between
a scale-based implementation and a democratic, speed-based business
model. Perhaps that’s the real magic trick of cloud.
Cloud is Outsourcing 55
Outsourcing is Insurance
Anyone who’s spent time in traditional IT knows that there’s another
reason for outsourcing: if something blows up, such as a software project
or a data center consolidation gone wrong, with outsourcing you’re
further from the epicenter of the disaster and hence have a better chance
of survival. When internal folks point out to me that a contractor’s day
rate could pay for two internals, I routinely remind them that the system
integrator rate includes life insurance for our IT managers. “Blame shield”
is a blunt term sometimes used for this setup. No worries, the integrator
who took the hit will move on to other customers, largely unscathed.
Sadly, or perhaps luckily, cloud doesn’t include this element in the
outsourcing arrangement. “You build it, you run it” also translates into
“you set it on fire, you put it out.” I feel that’s fair game.
7. Cloud Turns Your
Organization Sideways
And your customers will like it.
and the database layer. So, this structure, which looks like it can localize
change, often runs counter to end-to-end dependencies.
Optimizing End-to-End
Cloud gives organizations an edge in Economies of Speed because it
takes friction out of the system, such as having to wait for delivery and
installation of servers. To get there, though, you need to think and optimize
end-to-end, cutting across the existing layers. Each transition between
layers can be a friction point and introduce delays - service requests,
shared resource pools, or waiting for a steering committee decision.
Hence, a cloud operating model works best if you turn the layer cake on its
side. Instead of slicing by application / middleware / infrastructure, you
give each team end-to-end control and responsibility over a functional
area. Instead of trying to optimize each function individually, they are
optimizing the end-to-end value delivery. Following this logic, the layers,
which represent technical elements, turn into vertical slices that represent
functional areas or services.
That this works has been demonstrated by the Lean movement in manu-
facturing, which for about half a century benefited from optimizing end-
to-end flow instead of individual processing steps.
Infra Teams
Having applications teams automate deployment doesn’t imply that there’s
nothing left for infrastructure teams to do. They are still in charge of
commissioning cloud interconnects, deploying common tools for moni-
toring and security, and creating automation among many other things.
My standard answer to anyone in IT worrying about whether there will
be something left to do for them is “I have never seen anyone in IT not
being busy.” As IT is asked to pick up pace and deliver more value, there’s
plenty to do! Operations teams that used to manually maintain the steady
state are now helping support change. Essentially, they have been moved
into the first derivative.
Cloud Turns Your Organization Sideways 59
in the article Why Central Cloud Teams Fail³ - note that the authors sell
cloud training, giving them an incentive to propose training the whole
organization, but the point perhaps prevails.
In summary, CoEs can be useful to “start the engine” but just like a two-
speed architecture, they aren’t a magic recipe and should be expected to
have a limited lifespan.
Organizational Debt
There is likely a lot more to be said about changes to organizational
structures, processes, and HR schemes needed to maximize the benefit
of moving to the cloud. Although the technical changes aren’t easy, they
can often be implemented more quickly than the organizational changes.
After all, as my friend Mark Birch, who used to be the Regional Director
for Stack Overflow in APAC, pointedly concluded:
Pushing the technical change while neglecting (or simply not advancing
at equal pace) the organizational change is going to accumulate organiza-
tional debt. Just like technical debt, “borrowing” from your organization
allows you to speed up initially, but the accumulating interest will
ultimately come back to bite you just like overdue credit card bills. Soon,
your technical progress will slow down due to organizational resistance.
In any case, the value that the technical progress generates will diminish
- an effect that I call the value gap⁴.
³https://info.acloud.guru/resources/why-central-cloud-teams-fail-and-how-to-save-
yours
⁴https://aws.amazon.com/blogs/enterprise-strategy/is-your-cloud-journey-stuck-in-
the-value-gap/
8. Retain / Re-Skill / Replace
/ Retire
The four “R”s of migrating your workforce.
When IT leaders and architects speak about cloud migrations they usually
refer to migrating applications to the cloud. To plan such a move, several
frameworks with varying numbers of R’s exist. However, an equally large
or perhaps even bigger challenge is how to migrate your workforce to
embrace cloud and work efficiently in the new environment. Interestingly,
there’s another set of “R’s” for that.
Retain
Re-Skill
Replace
Retire
Some roles aren’t needed any more and hence don’t need to
be replaced. For example, many tasks associated with data
center facilities, such as planning, procurement, and instal-
lation of hardware are likely no longer required in a cloud
operating model. Folks concerned with design and plan-
ning might move towards evaluating and procuring cloud
providers’ services, but this function largely disappears once
an organization completes the move to the cloud and shuts
down its physical data centers. In many IT organizations such
tasks were outsourced to begin with, softening the impact on
internal staff.
In real life, just like on the technical side of cloud migration, you will have
a mix of “migration paths” for different groups of people. You’ll be able
to retain and re-skill the majority of staff and might have to replace or
retire a few. Choosing will very much depend on your organization and
the individuals. However, having a defined and labeled list of options will
help you make a meaningful selection.
who were simply bored and would do a stellar job once your organization
adjusts its operating model.
In a famous interview¹ Adrian Cockcroft, now VP of Cloud Architecture
Strategy at AWS, responded to a CIO lamenting that his organization
doesn’t have the same skills as Netflix with “well, we hired them from
you!” He went to explain that “What we found, over and over again, was
that people were producing a fraction of what they could produce at most
companies.” So, have a close look to see whether your people are being
held back by the system.
You might also encounter the inverse: folks who were very good at their
existing task but are unwilling to embrace a new way of working. It’s
similar to the warning label on financial instruments: “past performance
is no guarantee of future results.”
Training can also come in the form of working side-by-side with an expert.
This can be an external or a well-qualified internal person. Just like before,
you’d want to use this feedback channel to know who your best candidates
are.
Some organizations feel that having to (re-)train their people is a major
investment that puts them at a disadvantage compared to their “digital”
competitors who seem to only employ people who know the cloud,
embrace DevOps, and never lose sight of business value. However, this is a
classic “the grass is greener” fallacy: the large tech leaders invest an enor-
mous amount into training and enablement, ranging from onboarding
bootcamps over weekly tech talks and brown bags to self-paced learning
modules. They are happy to make this investment because they know the
ROI is high.
So, you need to find a way to dry up the mud before you can utilize top
talent. However, once again you’re caught in a chicken-and-egg situation:
draining the swamp also requires talent. You might be fortunate enough
to find an athlete who is willing to also help remove friction. Such a
person might see changing the running surface as an even more interesting
challenge than just running.
²https://www.linkedin.com/pulse/drain-swamp-before-looking-top-talent-gregor-
hohpe/
Retain / Re-Skill / Replace / Retire 67
More likely, you’ll need to fix some of the worst friction points before
you go out to hire. You might want to start with those that are likely to
be faced first by new hires to at least soften the blow. Common friction
points early in the onboarding process include inadequate IT equipment
such as laptops that don’t allow software installs and take 15 minutes to
boot due to device management (aka “corporate spyware”). Unresponsive
or inflexible IT support is another common complaint.
HR onboarding processes also run the risk of turning off athletes before
they even enter the race. One famous figure in agile software development
was once dropped off the recruiting process by a major internet company
because he told them to just “Google his name” instead of him submitting
a resume. I guess they weren’t ready for him.
You can build similar isolation layers for example for budget allocation.
³https://domainlanguage.com/ddd/reference/
Retain / Re-Skill / Replace / Retire 68
Up Your Assets
One constant concern among organizations looking to attract desirable
technical skills is their low (perceived) employer attractiveness. How
would a traditional insurance company / bank / manufacturing business
be able to compete for talent against Google, Amazon, and Facebook?
The good news is that, although many don’t like to hear it, likely you
don’t need the same kind of engineer that the FAANGs (Facebook, Apple,
Amazon, Netflix, Google) employ. That doesn’t imply that your engineers
are less smart - they just don’t need to build websites that scale to 1 billion
users or develop globally distributed transactional databases. They need to
be able to code applications that delight business users and transform the
insurance / banking / manufacturing industry. They’ll have tremendous
tools at their fingertips, thanks to cloud, and hence can focus on delivering
value to your business.
Re-label?
Movie Recipes
It’s well known that Hollywood movies follow a few, proven recipes.
A common theme starts with a person leading a seemingly normal if
perhaps ambitious life. To achieve their ambitious goal this person engages
someone from the underworld to do a bit of dirty work for them. To at least
initially save the character’s moral grounds, the person will justify their
actions based on a pressing need or a vague principle of greater good. The
counterpart from the underworld often comes in form of a hitman, who
is tasked with scaring away, kidnapping, or disposing of a person who
stands in the way of a grand plan. Alas, things don’t go as planned and so
the story unfolds.
A movie following this recipe fairly closely is The Accountant with Ben
Affleck. Here, the CEO of a major robotics company hired a hitman
to dispose of his sister and his CFO who might have gotten wind of
his financial manipulations. Independently, the company also hired an
accountant to investigate some irregularities with their bookkeeping. As
it turns out, the accountant brings a lot more skills to the table than just
adding numbers. After a lot of action and countless empty shells, things…
(spoiler alert!) don’t end well for the instigator.
Don’t Hire a Digital Hitman 71
While as an IT leader you may not have to worry about someone else
outbidding you to turn your own hitman against you, you may still hit a
few snags, such as:
Don’t Hire a Digital Hitman 72
• The transformation expert you are hiring may feel that it’s entirely
reasonable to quit after a half a year as that’s how things work
where they’re from.
• They may simply ignore all corporate rules - after all they are here
to challenge and change things, right? But perhaps that shouldn’t
include blowing the budget half way through the year and storming
into the CEO’s office every time they have what they think is a good
idea.
• They may also feel that it’s OK to publicly speak about the work
they’re doing, including all of your organization’s shortcomings.
Asking someone who grew up in the Googles of the world how to become
like them is a bit like asking trust fund kids for investment advice: they
may tell you to invest a small portion of your assets into a building
downtown and convert it into luxury condos while keeping the penthouse
Don’t Hire a Digital Hitman 73
for yourself. This may in fact be sound advice, but tough to implement for
someone who doesn’t have $50 million as starting capital.
The IT equivalents of trust fund investment advice can take many forms,
most of which backfire:
• The new, shiny (and expensive) innovation center may not actually
feed anything back into the core business. This could be because
the business isn’t ready for it or because the innovation center has
become completely detached from the business.
• The new work model may be shunned by employees or it might
violate existing labor contract agreements and hence can’t be
implemented.
• The staff may lack the skills to support a shiny new “digital”
platform that was proposed.
Stamina
First of all, don’t expect a digital hitman who carries silver bullets
to instantly kill your legacy zombies. Look for someone who has
shown the stamina to see a transformation through from beginning
to end. In many cases this would have taken several years.
Enterprise Experience
You can’t just copy-paste one company’s culture onto another - you
need to plot a careful path. So, your candidate may be extremely
frustrated with your existing platforms and processes and leave
quickly. Look for someone who has had some exposure to enterprise
systems and environments.
Don’t Hire a Digital Hitman 74
Teacher
Having worked for a “digital” company doesn’t automatically qual-
ify a person as a transformation agent. Yes, they’ll have some
experience working in the kind of organization that you’re inspired
by (or compete against). However, that doesn’t mean they know
how to take you from where you are to where you want to be. Think
about it this way: not every native speaker of a specific language is
a good teacher of that language.
Engine Room
Knowing how a digital company acts doesn’t mean the person
knows the kind of technology platform that’s needed underneath.
Yes, “move to the cloud” can’t be all wrong, but it doesn’t tell
you how to get there in which order and with which expectations.
Therefore, find someone who can ride the architect elevator at least
a few floors down.
Vendor Neutral
Smart advice is relatively easy to come by compared to actual
implementation. Look for someone who has spent time inside
a transforming organization as opposed to looking on from the
outside. Likewise watch out for candidates who are looking to drive
the transformation through product investment alone. If the person
spews out a lot of product names you might be looking at a sales
person in disguise. Once hired, such a person will chummy up to
vendors selling you transformation snake oil. This is not the person
you want to lead your transformation.
By Michele Danieli
Enterprise Architecture
Filling this mandate translates into three concrete tasks that are material
to the success of the cloud journey:
These roles and their needs existed before shifting to a cloud operating
model. However, the move to the cloud changes the way these players
interact. In traditional IT, long term investment in vendor technologies,
long procurement lead times, and mature but constrained technology
options (both in infrastructure and software) limited the pace of change
and innovation. Hence much time was spent carefully selecting software
and managing the technology lifecycle. In this context, EA was needed to
lay out a multi-year roadmap and ensure convergence from the present to
the defined future state.
Virtuous Cycles
Cloud gives enterprises more options that can be adopted at a faster
pace and with a much lower investment. Cloud also elevates the roles
of developers, operations engineers, and product owners because it turns
your organization sideways. Traditional organization charts don’t reflect
this dynamic well. I visualize this new organizational model as a circle
connecting three vertices: Product Owners, Developers and Engineers.
These individuals are guided by an external circle of stakeholders: Busi-
ness, Infrastructure Managers, Cloud Providers and Service Managers.
Enterprise Architecture in the Cloud 82
The faster the inner loop spins, the more value it delivers to the business
and the more confidence is built into the team: product owners and devel-
opers working in shorter sprints; developers pushing to production more
frequently; engineers setting up and running more stable environments
faster.
Any disturbance will be seen as reducing output and is therefore un-
welcome. To the parties in the loop, enterprise-wide IT standards that
constrain technical choices are considered friction. To make matters
worse, such standards are often based on a legacy operating model that
doesn’t suit the way modern teams like to work. For example, existing
standards for build tool chains might not support the level of automation
engineers desire, affecting the time it takes the team to deliver features.
While the inner loop spins fast on the product cycle, the outer loop
explores strategic opportunities, often enabled by new technologies such
as machine learning, cognitive computing, and robotics. Although this
loop moves more slowly than the inner loop, it also speeds up in a modern
enterprise. As a result, product ideas flow more quickly from external to
internal circles who return continuous feedback. Instead of defining multi-
year plans, teams now probe and explore, consuming IT services provided
by cloud providers.
Modern enterprise architects need to fit into these loops to avoid becoming
purely ceremonial roles. Does it mean they are giving way to Product
Owners, Developers, Engineers, and Cloud Architects? Not necessarily.
Enterprise Architecture in the Cloud 83
While the role isn’t fundamentally different from the past, the wheel
spins much faster and thus requires a more pragmatic and non dogmatic
approach to strategy. It also requires leaving the ivory tower behind to
work closely with developers, engineers and infrastructure managers in
the engine room.
Part III: Moving to the
Cloud
Widespread “advice” and buzzwords can muddy the path to cloud enlight-
enment a good bit. My favorite one is the notion of wanting to become a
“cloud native” organization. As someone who’s moved around the world
a good bit, I feel obliged to point out that “being native” is rather the
Enterprise Architecture in the Cloud 86
Measuring Progress
So, it’s better to put all that vocabulary aside and focus on making a
better IT for our business. Your goal shouldn’t be to achieve a certain
label (I give you any label you like for a modest fee…), but to improve on
concrete core metrics that are relevant to the business and the customer.
Many organizations track their cloud migrations by the number of servers
or applications migrated. Such metrics are IT-internal proxy metrics and
should make way for outcome-oriented metrics such as uptime or release
frequency.
Plotting a Path
Many resources that help you with the mechanics of a cloud migration
are available by cloud service providers and third parties, ranging from
planning aids, frameworks, and conversion tools. For example, AWS⁵,
Azure⁶, and Google Cloud⁷ each provide an elaborate Cloud Adoption
Framework.
⁵https://aws.amazon.com/professional-services/CAF/
⁶https://azure.microsoft.com/en-us/cloud-adoption-framework/
⁷https://cloud.google.com/adoption-framework
Enterprise Architecture in the Cloud 87
Cost
existing workload to the cloud. Savings for example come from using
managed services and the cloud’s elastic “pay as you grow” pricing model
that allows suspending compute resources that aren’t utilized. However,
realizing these savings requires transparency into resource utilization and
high levels of deployment automation - cloud savings have to be earned.
Uptime
No one likes to pay for IT that isn’t running, making uptime another
core metric for the CIO. Cloud can help here as well: a global footprint
allows applications to run across a network of global data centers, each
of which features multiple availability zones. Cloud providers also main-
tain massive network connections to the internet backbone, affording
them improved resilience against denial-of-service (DDoS) attacks. Setup
correctly, cloud applications can achieve 99.9% availability or higher,
which is often difficult or expensive to realize on premises. Additionally,
applications can be deployed across multiple clouds to further increase
availability.
Why Exactly Are You Going to the Cloud Again? 90
Scalability
Assuring uptime under heavy load requires scalability. Cloud touts instant
availability of compute and storage resources, allowing applications to
scale out to remain performant even under increased load. Many cloud
providers have demonstrated that their infrastructures can easily with-
stand tens of thousands of requests per second, more than enough for most
enterprises. Conveniently, the cloud model makes capacity management
the cloud provider’s responsibility, freeing your IT from having to main-
tain unused hardware inventory. You just provision new servers when
you need them.
Organizations looking to benefit from cloud for scale might want to
have a look at serverless compute that manages scaling of instances
transparently inside the platform. However, you can also achieve massive
scale with regular virtual machine instances, e.g. for large batch or number
crunching jobs.
Performance
Many end users access applications from the internet, for example from
a browser or a mobile application. Hosting applications with a cloud
provider can provide shorter paths and thus a better user experience
compared to applications hosted on premises. Global SaaS applications
like Google’s Gmail are proof of the cloud’s performance capabilities.
Performance is also a customer-relevant metric, unlike operational cost
reduction, which is rarely passed on to the customer.
Why Exactly Are You Going to the Cloud Again? 91
Velocity
Speed isn’t just relevant once the application is running but also on
the path there, i.e. during software delivery. Modern cloud tools can
greatly assist in accelerating software delivery. Fully managed build tool
chains and automated deployment tools allow your teams to focus on
feature development while the cloud takes care of compiling, testing, and
deploying software. Once you realize that you shouldn’t run software you
didn’t build, you are going to focus on building software and the velocity
with which you can do it.
Most cloud providers (and hosted source control providers like GitHub or
GitLab) offer fully managed CI/CD pipelines that rebuild and test your
software whenever source code changes. Serverless compute also vastly
simplifies deployment.
Security
Insight
Cloud isn’t all about computing resources. Cloud providers offer services
that are impractical or impossible to implement at the scale of corporate
data centers. For example, large-scale machine learning systems or pre-
trained models inherently live in the cloud. So, gaining better analytic
capabilities or better insights, for example into customer behavior or
manufacturing efficiency, can be a good motivator for going to the cloud.
Customers looking for improved insight will benefit from open-source
machine learning framework such as TensorFlow, pre-trained models
such as Amazon Rekognition⁵ or Google Vision AI⁶, or data warehouse
tools like Google BigQuery or AWS Redshift.
¹https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)
²https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
³https://aws.amazon.com/ec2/nitro/
⁴https://cloud.google.com/blog/products/gcp/titan-in-depth-security-in-plaintext
⁵https://aws.amazon.com/rekognition/
⁶https://cloud.google.com/vision
Why Exactly Are You Going to the Cloud Again? 93
Transparency
Priorities
Despite all the great things that cloud provides, it can’t solve all problems
all at once. Even with cloud, if a strategy sounds too good to be true, it
probably is. So having a catalog of goals to choose from allows a balanced
discussion of which problems to address first. Perhaps you want to focus
on uptime and transparency first and tackle cost later, knowing well that
savings have to be earned. Equally important to setting the initial focus is
defining what comes later, or what’s out of scope altogether.
like counting migrated applications are a dangerous proxy metric for your
real objectives that can lure you into believing that you’re making progress
when you’re not actually providing value to the business.
Compute Cloud (EC2) were the first modern-era cloud services offered
back in 2006, a focus on these elements appears natural.
In enterprise parlance, servers and storage, translate into “infrastructure”.
Therefore, when looking for a “home” for their cloud migration effort,
it seems natural to assign it to the infrastructure and operations team, i.e.
the “run” side of the house. After all, they should be familiar with the type
of services that the cloud offers. They are also often the ones who have
the best overview of current data center size and application inventory,
although “best” doesn’t mean “accurate” in most organizations.
Serving Servers
To understand why running a cloud migration out of the infrastructure
side of the house should be taken with caution, let’s look at what a
server actually does. In my typical unfiltered way, I define a server as an
expensive piece of hardware that consumes a lot of energy and is obsolete
in three years.
Based on that definition, it’s easy to see that most business and IT users
wouldn’t be too keen on buying servers, whether it’s on premises or in the
cloud. Instead of buying servers, people want to run and use applications.
It’s just that traditional IT services have primed us that before we can do
so, we first have to order a server.
No One Wants a Server 97
Time is Money
Being able to programmatically provision compute resources is certainly
useful. However, focusing purely on hardware provisioning times and
cost misses the real driver behind cloud computing. Let’s be honest,
cloud computing didn’t exactly invent the data center. What it did was
revolutionize the way we procure hardware with elastic pricing and
instant provisioning. Not taking advantage of these disruptive changes
is like going to a gourmet restaurant and ordering French fries - you’d be
eating but entirely missing the point.
Being able to provision a server in a few seconds is certainly nice, but it
doesn’t do much for the business unless it translates into business value.
Server provisioning time often makes up a small fraction of the time
in which a normal code change can go into production (or into a test
environment). Reducing the overall time customers have to wait for a new
feature to be released, the time to value, is a critical metric for businesses
operating in Economies of Speed. Velocity is the currency of the digital
world. And it needs much more than just servers.
Rapid deployment is also a key factor for elastic scaling: to scale out across
more servers, running software has to be deployed to those additional
servers. If that takes a long time, you won’t be able to handle sudden load
spikes. That means that you have to keep already configured servers in
stand-by, which drives up your cost. Time is money, also for software
deployment.
Looking Sideways
Naturally, cloud providers realize this shift and have come a long way
since the days when they only offered servers and storage. Building an
application on EC2 and S3 alone, while technically totally viable, feels
a little bit like building a web application in HTML 4.0 - functional,
but outdated. A minority of Amazon’s 100+ cloud services these days is
purely related to infrastructure. You’ll find databases of all forms, con-
tent delivery, analytics, machine learning, container platforms, serverless
frameworks, and much more.
Although there are still minor differences in VM-level cloud services
between the providers, the rapid commoditization of infrastructure has
the cloud service providers swiftly moving up the stack from IaaS to mid-
dleware and higher-level run-times. “Moving up the stack” is something
that IT has been doing for a good while, e.g. when moving from physical
machines to virtual ones in the early 2000s. However, once this process
reaches from IaaS (Infrastructure as a Service) and PaaS (Platform as a
Service) to FaaS (Functions as a Service), aka “Serverless”, there isn’t much
more room to move up any further.
No One Wants a Server 99
Gone are the days where delivery tool chains were cobbled together one-
offs. Modern tool chains run in containers, are easily deployed, scalable,
restartable, and transparent.
¹Hohpe, The Software Architect Elevator, 2020, O’Reilly
No One Wants a Server 100
Carrying your existing processes into the cloud will get you
another data center, not a cloud.
Moving to the cloud with your old way of working is the cardinal sin of
cloud migration: you’ll get yet another data center, probably not what you
were looking for! Last time I checked, no CIO needed more data centers.
In the worst case, you need to face the possibility that you’re not getting
a cloud at all, but an Enterprise Non-Cloud.
The answer to common concerns about cloud often lie in taking a step back
and looking at the problem from a different angle. For example, several
modern cloud platforms utilize containers for deployment automation
and orchestration. When discussing container technology with customers,
a frequent response is: “many of my enterprise applications, especially
commercial off-the-shelf software, don’t run in containers, or at least not
without the vendor’s support.” Fair point! The answer to this conundrum,
interestingly, lies in asking a much more fundamental question.
It’s impressive how much load your IT truck can handle, but nevertheless
it’s time to slim down.
Stuck in the middle, IT would shell out money for licenses and hardware
assets, but lack control over either the software features nor their own
infrastructure. Instead of a sandwich, an image of a vice might also come
to your mind. Aren’t we glad cloud came and showed us that this isn’t a
particularly clever setup?
Installation is cumbersome
Yup, it’s like that porcelain shop: you break, you pay. In
this case, you pay for support. And if something breaks you
have to convince external support that it’s not due to your
environment. All the while business is beating down on you
why stuff isn’t running as promised.
And despite all this effort, the worst part is that you don’t
have control on either end. If you need a new feature in the
software you are running, you’ll have to wait until the next
release - that is, if your feature is on the feature list. If not,
you may be waiting a looooong time. Likewise, operations
contracts are typically fixed in scope for many years. No
Kubernetes for you.
Running software you didn’t build isn’t actually all that great. It’s amazing
how enterprise IT has become so used to this model that it accepts it as
normal and doesn’t even question it.
Luckily, the wake-up call came with cloud and the need to question our
basic assumptions.
Salesforce CRM was really the first major company to lead us on that
virtuous route back around 2005, even before the modern cloud providers
started offering us servers in an IaaS model. Of course it’s by far no
longer the only company following this model: you can handle your
communications and documentation needs in the cloud thanks to Google
G Suite¹ (which started 2006 as “Google Apps for Your Domain”), HR
and performance management with SAP SuccessFactors², ERP with SAP
Cloud ERP³, service management with ServiceNow⁴, and integrate it
all with Mulesoft CloudHub⁵ (note: these are just examples; I have no
affiliation with these companies and the list of SaaS providers is nearly
endless).
Anything as a Service
Service-oriented models aren’t limited to software delivery. These days
you can lease an electric power generator by the kWh produced or a train
engine by the passenger mile run. Just like SaaS, such models place more
responsibility on the provider. That’s because the provider deals with
operational issues without any opportunity of pointing the finger at the
consumer:
If the train engine doesn’t run, in the traditional model it’d be the
customer’s (i.e. the train company’s) problem. In many cases the customer
would even pay extra for service and repairs. In a service model, the
manufacturer doesn’t get to charge the customer in case of operational
¹https://gsuite.google.com
²https://www.successfactors.com
³https://www.sap.com/sea/products/erp/erp-cloud.html
⁴https://www.servicenow.com/
⁵https://www.mulesoft.com/platform/saas/cloudhub-ipaas-cloud-based-integration
Don’t Run Software You Didn’t Build 107
issues. Instead, the manufacturer takes the hit because the customer
won’t pay for the passenger kilometers not achieved. This feedback loop
motivates the provider to correct any issue quickly. The same is (or at least
should be) true for software as a service: if you pay per transaction, you
won’t pay for software that can’t process transactions.
Naturally, you are happier with running software (or a running train), but
the feedback loop established through pricing by generated value leads to
a higher quality product and more reliable operations. The same principle
is applied in a DevOps model does for software development: engineers
will invest more into building quality software if they have top deal with
any production issues themselves.
Security
As describe earlier, the days where running applications on your
premises means that they are more secure, are over. Attack vectors
like compromised hardware, large-scale denial-of-service-attacks
by nation-state attackers, make it nearly impossible to protect IT
assets without the scale and operational automation of large cloud
providers. Plus, on-premises often isn’t actually on your premises -
it’s just another data center run by an outsourcing provider.
Latency
Latency isn’t just a matter of distance, but also a matter of the num-
ber of hops, and the size of the pipes (saturation drives up latency
or brings a service down altogether). Most of your customers and
partners will access your services from the internet, so often the
cloud, and therefore the SaaS offering, is closer to them than your
ISP and your data center.
Don’t Run Software You Didn’t Build 108
Control
Control is when reality matches what you want it to be. In many
enterprises, manual processes severely limit control: someone files
a request, which gets approved, followed by a manual action.
Who guarantees the action matches what was approved? A jump
server and session recording? Automation and APIs give you tighter
control in most cases.
Cost Although cloud doesn’t come with a magic cost reduction button,
it does give you better cost transparency and more levers to control
cost. Especially hidden and indirect costs go down dramatically
with SaaS solutions.
Integration
One could write whole books about integration - oh, wait⁶! Natu-
rally, a significant part of installing software isn’t just about getting
it to run, but also to connect it to other systems. A SaaS model
doesn’t make this work go away completely, but the availability
of APIs and pre-built interfaces (e.g. Salesforce and GSuite⁷) makes
it a good bit easier. From that perspective, it’s also no surprise that
Salesforce acquired Mulesoft, one of the leading integration-as-a-
service vendors.
Many enterprises who moved to the cloud have found that not all their
expectations have been met, or at least not as quickly as they would
have liked. Although an unclear strategy or inflated expectations can be
culprits, in many cases the problems lies closer to home. The migration
journey deprived the enterprise of exactly those great properties that the
cloud was going to bring them.
• Onboarding Process
• Hybrid Cloud
• Virtual Private Cloud
• Legacy Applications
• Cost Recovery
Onboarding
Enterprises have special requirements for cloud accounts that differ from
start-ups and consumers:
Most of these steps are necessary to connect the cloud model to existing
procurement and billing processes, something that enterprises can’t just
abandon overnight. However, they typically lead to a semi-manual sign-
up process for project teams to “get to the cloud”. Likely, someone has to
approve the request, link to a project budget, and define spend limits. Also,
some enterprises have restrictions on which cloud providers can be used,
sometimes depending on the kind of workload.
Cloud developers might need to conduct additional steps, such as config-
uring firewalls so that they are able to access cloud services from within
the corporate network. Many enterprises will require developer machines
to be registered with device management and be subjected to endpoint
security scans (aka “corporate spyware”).
Don’t Build an Enterprise Non-Cloud! 111
Hybrid Network
For enterprises, Hybrid Cloud is a reality because not all applications can
be migrated overnight. This’ll mean that applications running in the cloud
communicate with those on premises, usually over a combination of a
cloud interconnect, which connects the virtual private cloud (VPC) with
the existing on-premises network, making the VPC look like an extension
of the on-premises network.
Enterprises aren’t going to want all their applications to face the internet
and many also want to be able to choose IP address ranges and connect
servers with on-premises services. Many enterprises are also not too keen
to share servers with their cloud tenant neighbors. Others yet are limited
to physical servers by existing licensing agreements. Most cloud providers
can accommodate this request, for example with dedicated instances¹ or
dedicated hosts (e.g. AWS² or Azure³).
Cost Recovery
IT’s internal customers. It’s often difficult to allocate these costs on a per-
service or per-instance basis, so IT often adds an “overhead” charge to the
existing cloud charges, which appears reasonable.
There may be additional fixed costs levied per business unit or per project
team, such as common infrastructure, aforementioned VPNs, jump hosts,
firewalls, and much more. As a result, internal customers pay a base fee
on top of the cloud service fee.
Remembering NIST
The US Department of Commerce’s National Institute of Standards and
Technology (NIST) published a very useful definition of cloud computing
in 2011 (PDF download⁴). It used to be cited quite a bit, but I haven’t seen
it mentioned very much lately - perhaps everyone knows by now what
cloud is and the ones who don’t are too embarrassed to ask. The document
defines five major capabilities for cloud (edited for brevity):
On-demand self-service
A consumer can unilaterally provision computing capabilities, such
as server time and network storage, as needed automatically with-
out requiring human interaction.
Broad network access
Capabilities are available over the network and accessed through
standard mechanisms.
Resource pooling
The provider’s computing resources are pooled to serve multiple
consumers using a multi-tenant model, with different physical and
virtual resources dynamically.
Rapid elasticity
Capabilities can be elastically provisioned and released to scale
rapidly outward and inward with demand.
Measured service
Cloud systems automatically control and optimize resource use by
leveraging a metering capability (typically per-per-use).
So, after going back to the fundamental definition of what a cloud is, you
might start to feel that something doesn’t 100% line up. And you’re spot
on!
⁴https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf
Don’t Build an Enterprise Non-Cloud! 113
That’s bad news as it means that despite all good intentions your enter-
prise didn’t get a cloud! It got yet another good ole corporate data center,
which is surely not what it was looking for.
What Now?
So, how do you make sure your enterprise cloud remains deserving of the
label? While there’s no three-step-recipe, a few considerations can help:
Calibrate Expectations
Realization is the first step to improvement. So, being aware of these traps
is a first major step towards not falling into them, or at least not without
knowing. Also, it behooves us to temper rosy visions of cost savings and
digital transformation. Moving all your old junk to a new house means
you’ll be living with the same junk just in a fancier environment. Likewise,
carrying your enterprise baggage into the cloud won’t transform anything.
Measurable Goals
A cloud migration without clear measurable goals risks getting off track as
IT departments become lost in all the shiny new technology toys. Instead,
be clear why you’re going to the cloud: to save cost, to improve uptime,
to launch new products faster, to secure your data, to scale more easily,
to sound more modern - there are many reasons for going to the cloud.
Prioritizing and measuring progress is more likely to keep you on track.
Segmentation
Enterprise IT loves harmonization but one cloud size won’t fit all applica-
tions. Some applications don’t need to undergo all firewall-compartment-
peering-configuration-review-and-approval steps. Perhaps some applica-
tions, for example simple ones not holding customer data, can just go
straight to the cloud, just as long as billing isn’t on Jamie’s credit card.
Don’t Build an Enterprise Non-Cloud! 115
More traps
Cloud migrations navigate in treacherous waters. Many enterprises are
falling into the trap of wanting to avoid lock-in at all cost and look to
achieve that by not using cloud-provider-managed services because most
of them are proprietary. This means no DynamoDB, Athena, AWS Secrets
Manager, SQS, BigQuery, Spanner, Rekognition, Cloud AutoML, and so
on. Now, you may still have a cloud, but one that predates the NIST
definition from 2011. Also not the place where you want to be. If you
embrace cloud, you should embrace managed services.
Happy Clouds
When enterprises embark on a cloud journey, they often focus on the great
new things they will get. But equally important is leaving some of your
enterprise baggage behind.
15. Cloud Migration per
Pythagoras
Time to dig out your geometry books.
Choosing the right path to the cloud isn’t alway easy. Just moving all
the old junk into the cloud is more likely to yield you another data center
than a cloud transformation whereas re-architecting every last application
will take a long time and not be cost effective. So, you’ll need to think
a bit more carefully about what to do with your application portfolio.
Decision models help you make conscious decisions and sometimes even
third-grade geometry lessons can help you understand the trade-offs.
While the model plots simple axes for “up” / “modernize” and “out” /
“migrate”, each direction is made up from multiple facets and gradations:
Cloud Migration per Pythagoras 118
Up
Moving an application “up the stack” can occur along multiple dimen-
sions:
Out
Migration Triangles
Although the title plays off the tough career management principle used
by some consulting firms (“if you can’t keep moving up, you have to
get out”), in our case up-and-out or out-and-up is definitely allowed and
actually encouraged.
Using the up-or-out plane to plot the migration path makes it apparent that
migrations usually aren’t a one-shot type of endeavor. So, rather than just
putting applications into a simple bucket like “re-host” or “re-architect”,
the visual model invites you to plot a multi-step path for your applications,
highlighting that a cloud journey is indeed a journey and not a single
decision point.
³https://cloud.google.com/blog/products/application-development/5-principles-for-
cloud-native-architecture-what-it-is-and-how-to-master-it
Cloud Migration per Pythagoras 120
Remember Pythagoras?
Once you plot different sequences of migration options on sheets of paper,
you might feel like you’re back in trigonometry class. And that thought is
actually not as far fetched as it might seem. One of the most basic formulas
you might remember from way back when is Pythagoras’ Theorem. He
concluded some millennia ago that for a right triangle (one that has one
right angle), the sum of the squares of the sides equals the square of the
hypotenuse: c² = a² + b².
Pythagoras’ Theorem
Without reaching too deep into high school math, this definition has a few
simple consequences⁶:
Let’s see how these properties help you plan your cloud migration.
⁴https://en.wikipedia.org/wiki/Charles_Joseph_Minard
⁵https://www.edwardtufte.com/tufte/posters
⁶These properties apply to all triangles, but in our up-or-out plane we deal with right
triangles, so it’s nice to be able to fall back on Pythagoras.
Cloud Migration per Pythagoras 122
However, migrating first and then re-architecting makes for a longer path
than taking the hypotenuse. Hence it might be considered less “efficient”
- spelled with air quotes here as efficiency can be a dangerous goal also.
The same holds true for re-architecting first and then migrating out - in
either case the sum of the sides is larger than the hypotenuse:
Cloud Migration per Pythagoras 123
However, if generating value is your key metric and you factor in the cost
of delay, this path might still be the most beneficial for your business.
Replatform [A]
Shifting the application as is, but changing the operating system or
middleware (the platform the application sits on), usually motivated
by licensing restrictions or outdated operating systems not available
in the cloud.
Repurchase/Replace [A/G/M]
Replacing the application with another commercially available (or
open source) one, preferably in a SaaS model.
Refactor / Rearchitect [A/G/M]
Completely redoing the application’s deployment architecture, e.g.
by breaking a monolith into microservices, replacing one-off com-
ponents with cloud services, or extracting states from services into
a NOSQL data store.
Revise [G]
Updating legacy applications to prepare them for a subsequent
rehost.
Rebuild [G/M]
Perhaps the more drastic version of refactor, implying that the
application is rebuilt from scratch. Just like with a house, there is
likely a gray area between remodeling and rebuilding.
Retire [A]
Getting rid of the application. This can happen because sweeping
the application portfolio to prepare for a cloud migration can reveal
unused or orphaned applications.
Retain [A]
Keeping the application where it is, as it is, on premises.
Or, if you prefer to pivot this matrix, here is the list of “R”s I could extract
from each of the above on-line resources:
So, while the “R” are undeniably catchy, the semantics behind them are
fuzzy. It’ll be difficult to lock down the semantics in any case, but the
Cloud Migration per Pythagoras 125
Decision Models
Sketching paths on a simple chart admittedly also has fuzzy semantics
(“up” can mean rearchitecting completely or perhaps just refactoring or
automating) but the fact that the model is abstract conveys this fact
more clearly than a set of seemingly precise words. Sketches also makes
communicating your strategy to a wide range of stakeholders easier than
reciting lengthy variations of verbs starting with “R”.
Cloud Architecture
Unfortunately, many vendor certifications fuel the notion that cloud archi-
tecture is mostly about selecting services and memorizing the respective
features. It feels to me a bit like becoming a certified Lego artist by
Cloud Migration per Pythagoras 127
managing to recite all colors and shapes of Lego bricks (do they make
a blue 1x7?).
So, let’s look under the covers and take a true architect’s point of view.
Yes, this will include popular concepts like multi-hybrid and hybrid-
multi-cloud, but perhaps not in the way it’s described in the marketing
brochures. There is no “best” architecture, just the one that’s most suit-
able for your situation and your objectives. Hence, defining your cloud
architecture requires a fair amount of thinking - something that you
definitely should not outsource¹⁰. That thinking is helped by decision
models presented in this chapter.
Decision Models
One could fill a whole book on cloud architecture, and several people have
done so (I was lucky enough to write the foreword for Cloud Computing
Patterns¹²). Cloud service providers have added more architecture guid-
ance (e.g. Azure maintains a nice site with cloud architecture patterns¹³).
This part is not looking to repeat what’s already been written. Rather, it
invites you to pick a novel point of view that shuns the buzzwords and fo-
cuses on meaningful decisions. To help you become a better informed and
more disciplined decision maker, several decision models and frameworks
guide you through major decision points along your cloud journey:
¹²Fehling, Leymann, Retter, Schupeck, Arbiter, Cloud Computing Patterns, Springer 2014
¹³https://docs.microsoft.com/en-us/azure/architecture/patterns/
16. Multi-cloud: You’ve Got
Options
But options don’t come for free.
Multi-Hybrid Split
The core premise of a multi-hybrid cloud approach sounds appealing
enough: your workloads can move from your premises to the cloud and
back and even between different clouds whenever needed; and all that
ostensibly with not much more than the push of a button. Architects are
born skeptics and thus inclined (and paid) to take a look under the covers
to better understand the constraints, costs, and benefits of such solutions.
The first step in dissecting the buzzwords is to split the multi-hybrid
combo-buzzword into two, separating hybrid from multi. Each has dif-
ferent driving forces behind it, so it’s best to try simple definitions:
the piece in the cloud and the piece remaining on-premises interact
to do something useful.
• Multi: running workloads with more than one cloud provider.
Multi-cloud options
Taking an architecture point of view on multi cloud means identifying
common approaches, the motivations behind each, and the architectural
trade-offs involved. To get there, it’s best to first take a step back from
the technical platform and examine common scenarios and the value
they yield. I have observed (and participated in) quite a few different
setups that would fall under the general label of “multi-cloud”. Based on
my experience they can be broken down into the following five distinct
scenarios:
1. Arbitrary: Workloads are in more than one cloud but for no partic-
ular reason.
2. Segmented: Different clouds are used for different purposes.
3. Choice: Projects (or business units) have a choice of cloud provider.
4. Parallel: Single applications are deployed to multiple clouds, for
example to achieve higher levels of availability.
5. Portable: Workloads can be moved between clouds at will.
A higher number in this list isn’t necessarily better - each option has its
advantages and limitations. Rather, it’s about finding the approach that
Multi-cloud: You’ve Got Options 132
best suits your needs and making a conscious choice. The biggest mistake
could be choosing an option that provides capabilities that aren’t needed.
Because each option has a cost, as we will soon see.
Breaking down multi cloud into multiple flavors that each identify the
drivers and benefits is also a nice example of how elevator architects see
shades of gray where many others only see black or white. Referring
to five distinct options and elaborating them with simple vocabulary
enables an in-depth conversation that largely avoids technical jargon and
is therefore suitable to a wide audience. That’s what the Architect Elevator
is all about.
Multi-Cloud Scenarios
Let’s look at each of the five ways of doing multi-cloud individually, with
a particularly keen eye on the key capabilities it brings and what things
we should watch out for. At the end we’ll summarize what we learned in
a decision table.
Arbitrary
If enterprise has taught us one thing, it’s likely that reality rarely lives up
to the slide decks. Applying this line of reasoning (and the usual dosage
of cynicism) to multi-cloud, we find that a huge percentage of enterprise
multi-cloud isn’t the result of divine architectural foresight but simply
poor governance and excessive vendor influence.
This flavor of multi-cloud means that you have some workloads running
in the orange cloud, some others in the light blue cloud, and a few more
under the rainbow. You don’t have much of an idea why things are in one
cloud or the other. In many cases enterprises started with orange, then
received a huge service credit from light blue thanks to existing license
agreements, and finally some of the cool kids loved the rainbow stuff so
much that they ignored the corporate standard. And if you look carefully,
you may see some red peeking in due to existing relationships.
Strategy isn’t exactly the word to be used for this type of multi-cloud
setup. It’s not all bad, though: at least you are deploying something to the
cloud! That’s a good thing because before you can steer you first have
to move². So, at least you’re moving. You’re gathering experience and
building skills with multiple technology platforms, which you can use to
settle on the provider that best meets your need. So, while arbitrary isn’t
a viable target picture, it’s a common starting point.
Segmented
When pursuing this approach, it’s helpful to understand the seams be-
tween your applications so you don’t incur excessive egress charges
because half your application ends up left and the other half on the
right. Also, keep in mind that vendor capabilities are rapidly shifting,
especially in segments like machine learning or artificial intelligence.
Snapshot comparisons therefore aren’t particularly meaningful and may
unwittingly lead you into this scenario just to find out six months later
that your preferred vendor is offering comparable functionality.
Vendor pushes can make you slip from segmented back into
arbitrary.
Choice
Many might not consider the first two examples as true multi cloud. What
they are looking for (and pitching) is being able to deploy your workloads
freely across cloud providers, thus minimizing lock-in (or the perception
thereof), usually by means of building abstraction layers or governance
frameworks. Architects look behind this ploy by considering the finality
of the cloud decision. For example, should you be able to change your mind
after your initial choice and, if so, how easy do you expect the switch to
be?
The least complex and most common case is giving your developers an
initial choice of cloud provider but not expecting them to keep changing
their mind. This choice scenario is common in large organizations with
shared IT service units. Central IT is generally expected to support a wide
range of business units and their respective IT preferences. Freedom of
choice might also result from the desire to remain neutral, such as in public
sector, or a regulatory guideline to avoid placing “all eggs in one basket”,
for example in financial services or similar critical services.
A choice setup typically has central IT manage the commercial rela-
tionships with the cloud providers. Some IT departments also develop
a common tool set to create cloud provider account instances to assure
central spend tracking and corporate governance.
The advantage of this setup is that projects are free to use proprietary
cloud services, such as managed databases, if it matches their preferred
trade-off between avoiding lock-in and minimizing operational overhead.
As a result, business units get an unencumbered, dare I say native, cloud
experience. Hence, this setup makes a good initial step for multi-cloud.
Multi-cloud: You’ve Got Options 136
Parallel
While the previous option gives you a choice among cloud service
providers, you are still bound by the service level of a single provider.
Many enterprises are looking to deploy critical applications across multi-
ple clouds in their pursuit to achieve higher levels of availability than they
could achieve with a single provider.
Being able to deploy the same application in parallel to multiple clouds
requires a certain set of decoupling from the cloud provider’s proprietary
features. This can be achieved in a number of ways, for example:
While absorbing differences inside your code base might sound kludgy,
it’s what ORM frameworks have been successfully doing for relational
databases for over a decade.
The critical aspect to watch out for is complexity, which can easily undo
the anticipated uptime gain. Additional layers of abstraction and more
tooling also increase the chance of a misconfiguration, which causes
unplanned downtime. I have seen vendors suggesting designs that deploy
across each vendor’s three availability zones, plus a disaster recovery
environment in each, times three cloud providers. With each component
occupying 3 * 2 * 3 = 18 nodes, I’d be skeptical whether this amount of
machinery really gives you higher availability than using 9 nodes (one
per zone and per cloud provider).
Second, seeking harmonization across both deployments may not be
what’s actually desired. The higher the degree of commonality across
clouds, the higher the chance that you’ll deploy a broken application
or encounter issues on both clouds, undoing the resilience benefit. The
extreme example are space probes or similar systems that require extreme
reliability: they use two separate teams to avoid any form of commonality.
So, when designing for availability, keep in mind that the cloud provider’s
platform isn’t the only outage scenario - human error and application soft-
ware issues (bugs or run-time issues such as memory leaks, overflowing
queues etc.) can be a bigger contributor to outages.
Multi-cloud: You’ve Got Options 138
Portable
too well that polishing objects to become ever more shiny comes at a
cost. Dollar cost is the apparent concern, but you also need to factor in
additional complexity, having to manage multiple vendors, finding the
right skill set, and long-term viability (will we ditch all this container stuff
and go serverless?). Those factors can’t be solved with money.
It’s therefore paramount to understand and clearly communicate your pri-
mary objective. Are you looking at multi-cloud so you can better negotiate
with vendors, to increase your availability, or to support deploying in
regions where only one provider or the other may have a data center?
Remember, that “avoiding lock-in” is only a meta-goal, which, while
architecturally desirable, needs to be justified by a tangible benefit.
Choosing Wisely
The following table summarizes the multi-cloud choices, their main
drivers, and the side-effects to watch out for:
And once you split workloads across cloud and on premises, more likely
than not, those two pieces will need to interact, as captured in our
definition from the previous chapter:
Accordingly, the core of a hybrid cloud strategy isn’t when but how,
specifically how to slice, meaning which workloads should move out into
the cloud and which ones remain on premises. That decision calls for some
architectural thinking, aided by taking the Architect Elevator a few floors
down.
First, we want to sharpen our view of what makes a “hybrid”. Having some
workloads in the cloud while others remain on our premises sadly means
that you have added yet another computing environment to your existing
portfolio - not a great result.
The difficult part in making a hybrid car wasn’t sticking a battery and
an electric motor into a petrol-powered car. Getting the two systems
to work seamlessly and harmoniously was the critical innovation.
That’s what made the Toyota Prius a true hybrid and such a success.
Because the same is true for hybrid clouds, we’ll augment our definition:
With this clarification, let’s come back to cloud strategy and see what
options we have for slicing workloads.
• Lower Risk: Lastly, the list helps you consider applicability, pros,
and cons of each option consistently. It can also tell you about what
to watch out for when selecting a specific option.
Therefore, our decision model once again helps us make better decisions
and communicate them such that everyone understands which route we
take and why.
This list’s main purpose isn’t as much to find new options that no one has
ever thought of - you might have seen more. It’s much rather a collection
of well-known options that makes for a useful sanity check. Let’s have a
look at each approach in more detail.
Hybrid Cloud: Slicing the Elephant 145
Separating by tier
Although the excitement of “two speed IT” wore off, this approach can
still be a reasonable intermediary step. Just don’t mistake it for a long-
term strategy.
Separating by generation
Closely related is the split by new vs. old. As explained above, newer
systems are more likely to feel at home in the cloud because they are
generally designed to be scalable and independently deployable. They’re
also usually smaller and hence benefit more from lean cloud platforms
such as serverless.
Again, several reasons make this type of split compelling:
• Modern components are more likely to use modern tool chains and
architectures, such as microservices, continuous integration (CI),
automated tests and deployment, etc. Hence they can take better
advantage of cloud platforms.
Hybrid Cloud: Slicing the Elephant 147
• Splitting by new vs old may not align well with the existing systems
architecture, meaning it isn’t hitting a good seam. For example,
splitting highly cohesive components across data center boundaries
will likely result in poor performance or increased traffic cost.
• Unless you are in the process of replacing all components over time,
this strategy doesn’t lead to a “100% in the cloud” outcome.
• “New” is a rather fuzzy term, so you’d want to precisely define the
split for your workloads.
Separating by criticality
Many enterprises prefer to first dip their toe into the cloud water before
going all out (or “all in” as the providers would see it). They are likely
to try the cloud with a few simple applications that can accommodate
the initial learning curve and the inevitable mistakes. “Failing fast” and
learning along the way makes sense:
Hybrid Cloud: Slicing the Elephant 148
Overall it’s a good “tip your toe into the water” tactic that certainly beats
a “let’s wait until all this settles” approach.
Separating by lifecycle
Hybrid Cloud: Slicing the Elephant 149
There are many ways to slice the proverbial elephant (eggplant for
vegetarians like me). Rather than just considering splits across run-
time components, a whole different approach is to split by application
lifecycle. For example, you can run your tool chain, test, and staging
environments in the cloud while keeping production on premises, for
example if regulatory constraints require you to do so.
Shifting build and test environments into the cloud is a good idea for
several reasons:
This option may be useful for development teams that are restrained from
running production workloads in the cloud but still want to utilize the
cloud.
Not all your data is accessed by applications all the time. In fact, a vast
majority of enterprise data is “cold”, meaning it’s rarely accessed. Cloud
providers offer appealing options for data that’s rarely accessed, such as
historical records or back-ups. Amazon’s Glacier⁷ was one of the earliest
offerings to specifically target that use case. Other providers have special
⁶https://en.wikipedia.org/wiki/Equifax#May%E2%80%93July_2017_data_breach
⁷https://aws.amazon.com/glacier/
Hybrid Cloud: Slicing the Elephant 152
archive tiers for their storage service, e.g. Azure Storage Archive Tier⁸ and
GCP Coldline Cloud Storage ⁹.
Archiving data in the cloud makes good sense:
• Data recovery costs can be high, so you’d only want to move data
that you rarely need.
• The data you’d want to back up likely contains customer or other
proprietary data, so you might need to encrypt or otherwise protect
that data, which can interfere with optimizations such as data de-
duplication.
Still, using cloud backup is a good way for enterprises to quickly start
benefiting from the cloud.
⁸https://azure.microsoft.com/en-us/services/storage/archive/
⁹https://cloud.google.com/storage/docs/storage-classes
Hybrid Cloud: Slicing the Elephant 153
Lastly, one can split workloads by the nature of the system’s operational
state. The most apparent division in operational state is whether things are
going well (“business as usual”) or whether the main compute resources
are unavailable (“disaster”). While you or your regulator may have a
strong preference for running systems on premises, during a data center
outage you might find a system running in the cloud preferable over none
running at all. This reasoning can apply, for example, to critical systems
like payment infrastructure.
This approach to slicing workload is slightly different from the others as
the same workload would be able to run in both environments, but under
different circumstances:
However, life isn’t quite that simple. There are drawbacks as well:
• To run emergency operations from the cloud, you need to have your
data synchronized into a location that’s both accessible from the
cloud and unaffected by the outage scenario that you are planning
for. More likely than not, that location would be the cloud, at which
point you should consider running the system from the cloud in any
case.
• Using the cloud just for emergency situations deprives you of the
benefits of operating in the cloud in the vast majority of the cases.
• You might not be able to spool up the cloud environment on a
moment’s notice. In that case you’d need to have the environment
on standby, which will rack up charges with little benefit.
So, this approach may be suitable mainly for compute tasks that don’t
require a lot of data. For example, for an on-premises e-commerce site
that faces an outage you could allow customers to place orders by picking
from a (public) catalog and storing orders until the main system comes
back up. It’ll likely beat a cute HTTP 500 page.
Hybrid Cloud: Slicing the Elephant 154
Last, but not least, we may not have to conjure an outright emergency
to occasionally shift some workloads into the cloud while keeping it
on premises under normal circumstances. An approach that used to be
frequently cited is bursting into the cloud, meaning you keep a fixed ca-
pacity on premises and temporarily extend into the cloud when additional
capacity is needed.
Again, you gain some desirable benefits:
• The cloud’s elastic billing model is ideal for this case as short bursts
are going to be relatively cheap.
• You retain the comfort of operating on premises for most of the time
and are able to utilize your already paid for hardware.
But there are reasons that people don’t talk about this option quite as much
anymore:
• Again, you need access to data from the cloud instances . That
means you either replicate the data to the cloud, which is likely
what you tried to avoid in the first place. Alternatively, your cloud
instances need to operate on data that’s kept on premises, which
likely implies major latency penalties and can incur significant data
access cost.
• You need to have an application architecture that can run on
premises and in the cloud simultaneously under heavy load. That’s
not a small feat and means significant engineering effort for a rare
case.
Hybrid Cloud: Slicing the Elephant 155
Rather than bursting regular workloads into the cloud, this option works
best for dedicated compute-intensive tasks, such as simulations as they
are regularly run in financial institutions. It’s also a great fit for one-time
compute tasks where you rig up machinery in the cloud just for one time.
For example, this worked well for the New York Times digitizing 6 million
photos¹⁰ or for calculating pi to 31 trillion digits¹¹. To be fair, though, these
cases had no on-premises equivalent, so it likely wouldn’t qualify as a
hybrid setup.
We just saw how a hybrid cloud setup provides many ways to split
workloads across cloud and on premises. Equally interesting is how
cloud providers make hybrid cloud happen in the first place, i.e. how
they enable a cloud operating model on your premises. Because different
vendors follow different philosophies and choose different approaches,
it’s the architects’ job to look behind the slogans and understand the
assumptions, decisions, and the ramifications that are baked into the
respective approaches. “Doors closing. Elevator going down…”
(The astute reader will have noticed that the diagrams are rotated by 90
degrees from the ones in the previous chapter, now showing cloud and on
premises side-by-side. This is the artistic license I took to make showing
the layers easier.)
The first option isn’t as horrible as it might sound. You could still use
modern software approaches like CI/CD, automated testing, and frequent
releases. In many cases, option #2 is a significant step forward, assuming
you are clear on what you expect to gain from moving to the cloud. Lastly,
if you want to achieve a true hybrid cloud setup, you don’t just enable
cloud capabilities on premises, but also want uniform management and
deployment across the environments.
Which one is the correct target picture for your enterprise depends largely
on how you will split applications. For example, if applications run either
in the cloud or on premises, but not half-half, then uniform management
may not be as relevant for you. Once again, you must keep in mind
that not every available feature translates into an actual benefit for your
organization.
The Cloud - Now On Your Premises 158
Scale
Most on premises data centers are two or more orders of magnitude
smaller than so-called hyperscaler data centers. The smaller scale
impacts economics but also capacity management, i.e. how much
hardware can be on stand-by for instant application scaling. For
example, running one-off number crunching jobs on hundreds or
thousands of VMs likely won’t work on premises and if it did, it’d
be highly uneconomical.
Custom Hardware
Virtually all cloud providers use custom hardware for their data
centers, from custom network switches and custom security mod-
ules to custom data center designs. They can afford to do so based
on their scale. In comparison, on premises data centers have to make
do with commercial off-the-shelf hardware.
Diversity
On premises environments have usually grown over time, often
decades, and therefore look like they contain everything including
the proverbial kitchen sink. Cloud data centers are neatly built and
highly homogeneous, making them much easier to manage.
Bandwidth
Large-scale cloud providers have enormous bandwidth and super-
low latency straight from the internet backbone. Corporate data
centers often have thin pipes. Network traffic also typically needs
to make many hops: 10, 20, or 30 milliseconds isn’t uncommon just
to reach a corporate server. By that time, Google’s search engine
has already returned the results.
Geographic reach
Cloud data centers aren’t just bigger, they’re also spread around the
globe and supported by cloud endpoints that give users low-latency
access. Corporate data centers are typically confined to one or two
locations.
The Cloud - Now On Your Premises 159
Physical access
In your own data center you can usually “touch” the hardware and
conduct manual configurations or replace faulty hardware. Cloud
data centers are highly restricted and usually do not permit access
by customers.
Skills
An on-premises operations team typically has different skill sets
than teams working for a cloud service provider. This is often the
case because on-premises teams are more experienced in traditional
operational approaches. Also the operational tasks might be out-
sourced altogether to a third party, leaving very little skill in house.
With cloud and on-premises data centers containing quite different hard-
ware infrastructures, the abstraction layer absorbs such differences so that
applications “see” the same kind of virtual infrastructure. Once you have
such an abstraction in place, a great benefit is that you can layer additional
higher-level services on top, which will “inherit” the abstraction and
also work across cloud and on-premises, giving this approach a definite
architectural elegance.
¹https://en.wikipedia.org/wiki/David_Wheeler_(computer_scientist)
The Cloud - Now On Your Premises 161
By far the most popular example of this approach is Kubernetes, the open
source project that has grown from being a container orchestrator to look
like a universal deployment/management/operations platform for any
computing infrastructure, sort of like the JVM but focused on operational
aspects as opposed to a language run-time.
Advantages
Considerations
Of course, what looks good in theory will also face real-life limitations:
• The abstraction only works for applications that respect it. For
example, Kubernetes assumes your application is designed to op-
erate in a container environment, which may be sub-optimal or not
possible for traditional applications. If you end up having multiple
abstractions for different types of applications, things get a lot more
complicated.
• Using such an abstraction layer likely locks you into a specific prod-
uct. So, while you can perhaps change cloud providers more easily,
changing to a different abstraction or removing it altogether will be
very difficult. For most enterprise users VMWare comes to mind.
An open source abstraction reduces this concern somewhat, but
certainly doesn’t eliminate it. You are still tied to the abstraction’s
APIs and you won’t be able to just fork it and operate it on your
own.
I label this approach the Esperanto⁵ ideal. It’s surely intellectually appeal-
ing - wouldn’t it be nice if we all spoke the same language? However, just
like Esperanto, broad adoption is unlikely because it requires each person
to learn yet another language and they won’t be able to express themselves
as well as in their native language. And sufficiently many people speak
English, anyway.
Once you have a carbon copy of your cloud on premises, you can shift
applications back and forth or perhaps even split them across cloud and
on-premises without much additional complexity. You can then connect
the on-premises cloud to the original cloud’s control plane and manage it
just like any other cloud region. That’s pretty handy.
Different flavors of this option vary in how the cloud will be set up on
your premises. Some solutions are fully integrated as a hardware/software
solution while others are software stacks to be deployed on third party
hardware.
The best-known products following this approach are AWS Outposts⁶,
which delivers custom hardware to your premises, and Azure Stack⁷,
a software solution that’s bundled with certified third party vendors’
hardware. In both cases, you essentially receive a shipment of a rack to
be installed in your data center. Once connected, you have a cloud on
premises! At least that’s what the brochures claim.
Advantages
or the rack might have an optimized power supply that powers all
servers, rendering it a rack-size blade⁸.
• You can get a fully certified and optimized stack that assures that
what runs in the cloud will also run on premises. In contrast, the
abstraction layer approach runs the risk of a disconnect between
the abstraction and the underlying infrastructure.
• You don’t need to do a lot - the cloud comes to you in a fully inte-
grated package that essentially only needs power and networking.
Considerations
Questions to Ask
The good news is that different vendors have packaged the presented
solutions into product offerings. Understanding the relevant decisions
behind each approach and the embedded trade-offs allows you to select
the right solution for your needs. Based on the discussions above, here’s
a list of questions you might want to ask vendors offering products based
on the approaches described above:
⁹https://www.vmware.com/products/vsphere/projectpacific.html
The Cloud - Now On Your Premises 168
Asking such questions might not make you new friends on the vendor
side, but it’ll surely help you make a more informed decision.
Additional Considerations
Deploying and operating workloads aren’t the only elements for a hybrid
cloud. At a minimum, the following dimensions come into play also:
Monitoring
Deployment
Data Synchronization
Plotting a Path
So, once again, there’s a lot more behind a simple buzzword and the
associated products. As each vendor tends to tell the story from their point
of view, which is to be expected, it’s up to us architects to “zoom out” to see
the bigger picture without omitting relevant technical decisions. Making
the trade-offs inherent in the different approaches visible to decision
makers across the whole organization is the architect’s responsibility.
Because otherwise there’ll be surprises later on and one thing we have
learned in IT is that there are no good surprises.
¹⁰https://aws.amazon.com/storagegateway/features
19. Don’t Get Locked-up
Into Avoiding Lock-in.
Options have a price¹.
• First, architects realize that lock-in isn’t binary: you’re always going
to be locked into some things to some degree - and that’s OK.
• Second, you can avoid lock-in by creating options, but those options
have a cost, not only in money, but also in complexity or, worse yet,
new lock-in.
• Lastly, architects need to peek behind common associations, for
example open source software supposedly making you free of lock-
in.
At the same time, there’s hardly an IT department that hasn’t been held
captive by vendors who knew that migrating off their solutions is difficult.
Therefore, wanting to minimize lock-in is an understandable desire. No
one wants to get fooled twice. So, let’s see how to best balance lock-in
when defining a cloud strategy.
Shades of Lock-in
As we have seen, lock-in isn’t a binary property that has you either locked
in or not. Actually, lock-in comes in more flavors than you might have
expected:
Vendor Lock-in
This is the kind that IT folks generally mean when they mention
“lock-in”. It describes the difficulty of switching from one vendor
to a competitor. For example, if migrating from Siebel CRM to
Salesforce CRM or from an IBM DB2 database to an Oracle one will
³https://aws.amazon.com/outposts/
Don’t Get Locked-up Into Avoiding Lock-in. 174
cost you an arm and a leg, you are “locked in”. This type of lock-in
is common as vendors generally (more or less visibly) benefit from
it. It can also come in the form of commercial arrangements, such
as long-term licensing and support agreements that earned you a
discount off the license fees back then.
Product Lock-in
Related, although different nevertheless is being locked into a
product. Open source products may avoid the vendor lock-in, but
they don’t remove product lock-in: if you are using Kubernetes or
Cassandra, you are certainly locked into a specific product’s APIs,
configurations, and features. If you work in a professional (and
especially enterprise) environment, you will also need commercial
support, which will again lock you into a vendor contract - see
above. Heavy customization, integration points, and proprietary
extensions are forms of product lock-in: they make it difficult to
switch to another product, even if it’s open source.
Version lock-in
You may even be locked into a specific product version. Version
upgrades can be costly if they break existing customizations and
extensions you have built. Some version upgrades might even
require you to rewrite your application - AngularJS vs. Angular
2 comes to mind. You feel this lock-in particularly badly when a
vendor decides to deprecate your version or discontinues the whole
product line, forcing you to choose between being out of support or
doing a major overhaul.
Architecture lock-in
You can also be locked into a specific kind of architecture. Sticking
with the Kubernetes example, you are likely structuring your appli-
cation into services along domain context boundaries, having them
communicate via APIs. If you wanted to migrate to a serverless
platform, you’d want your services finer-grained, externalize state
management, and have them connected via events, among other
changes. These aren’t minor adjustments, but a major overhaul of
your application architecture. That’s going to cost you, so once
again you’re locked in.
Platform lock-in
A special flavor of product lock-in is being locked into a platform,
such as our cloud platforms. Such platforms not only run your
applications, but they also hold your user accounts and associated
Don’t Get Locked-up Into Avoiding Lock-in. 175
So, next time someone speaks about lock-in, you can ask them which of
these eight dimensions they are referring to. While most of our computers
work in a binary system, architecture does not.
Accepted Lock-in
While we typically aren’t looking for more lock-in, we might be inclined,
and well advised, to accept some level of lock-in. Adrian Cockcroft of AWS
is famous for starting his talks by asking who in the audience is worried
Don’t Get Locked-up Into Avoiding Lock-in. 176
You will likely accept some lock-in in return for the benefits
you receive.
Effort
This is the additional work to be done in terms of person-hours. If
you opt to deploy in containers on top of Kubernetes in order to
reduce cloud provider lock-in, this item would include the effort to
learn new concepts, write Docker files, and get the spaces lined up
in your YAML files.
Expense
This is the additional cash expense, e.g. for product or support
licenses, to hire external providers, or to attend KubeCon. It can
also include operational costs if wanting to be portable requires
you operate your own service instead of utilizing a cloud-provider
managed one. I have seen cost differences of several orders of
magnitude, for example for secrets management.
Underutilization
This cost is indirect, but can be significant. Reduced lock-in often
comes in the form of an abstraction layer across multiple products
or platforms. However, such a layer may not provide feature parity
with the underlying platforms, either because it’s lagging behind or
because it won’t fit the abstraction. In a bad case, the layer may be
the lowest common denominator across all supported products and
may prevent you from using vendor-specific features. This in turn
means you get less utility out of the software you use and pay for. If
it’s an important platform feature that the abstraction layer doesn’t
support, the cost can be high.
Complexity
Complexity is another often under-estimated cost in IT, but one of
the most severe ones. Adding an abstraction layer increases com-
plexity as there’s yet another component with new concepts and
new constraints. IT routinely suffers from excessive complexity,
which drives up cost and reduces velocity due to new learning
curves. It can also hurt system availability because complex systems
are harder to diagnose and prone to systemic failures. While you
can absorb monetary cost with bigger budgets, reducing excessive
complexity is very hard - it often leads to even more complexity.
It’s so much easier to keep adding stuff than taking it away.
Don’t Get Locked-up Into Avoiding Lock-in. 178
New Lock-ins
Lastly, avoiding one lock-in often comes at the expense of another
one. For example, you may opt to avoid AWS CloudFormation⁴
and instead use Hashicorp’s Terraform⁵ or Pulumi⁶, both of which
are great products and support multiple cloud providers. However,
now you are tied to another product from an additional vendor and
need to figure out whether that’s OK for you. Similarly, most multi-
cloud frameworks promise to let you move workloads across cloud
providers but lock you into the framework. You’ll have to decide
which form of lock-in concerns you more.
Optimal Lock-in
With both being locked-in and avoiding it carrying a cost, it’s the archi-
tect’s job to balance the two and perhaps find the sweet spot. We might
even say that this sums up the essence of the architect’s job: dismantle the
buzzwords, identify the real drivers, translate them into an economic risk
model, and then find the best spot for the given business context.
⁷https://www.thoughtworks.com/radar/techniques/generic-cloud-usage
⁸https://landscape.cncf.io/
⁹https://en.wikipedia.org/wiki/Service_Component_Architecture
¹⁰https://architectelevator.com/architecture/failure-doesnt-respect-abstraction/
Don’t Get Locked-up Into Avoiding Lock-in. 180
Sometimes you can find ways to reduce lock-in at low (overall) cost, for
example by using an Object-relational Mapping (ORM) framework like
Hibernate¹¹ that increases developer productivity and reduces database
vendor lock-in at the same time. Those opportunities don’t require a lot
of deliberation and should just be done. However, other decisions might
deserve a second, or even third look.
To find the optimal degree of lock-in, you need to sum up two cost
elements (see graph):
hold. The second will be the price of the option itself, i.e. how much you
pay for acquiring the option. The fact that gentlemen Black and Scholes
received a Nobel prize in Economics¹² for option pricing tells us that this
is not a simple equation to solve. However, you can imagine that the prize
of the option increases with a reduced strike price - the easier you want
the later switch to be, the more you need to prepare right now.
For us the relevant curve is the red line (labeled “total”), which tracks the
sum of both costs. Without any complex formulas we can easily see that
minimizing switching cost isn’t the most economical choice as it’ll drive
up your up-front investment for diminishing returns.
Let’s take a simple example. Many architects favor not being locked into
a specific cloud provider. How much would it cost to migrate your ap-
plication from the current provider to another one, for example having to
re-code automation scripts, change code that accesses proprietary services,
etc? $100,000? That’s a reasonable chunk of money for a small-to-medium
application. Next, you’ll need to consider the likelihood of needing to
switch clouds over the application’s lifespan. Maybe 5%, or perhaps even
lower? Now, how much would it cost you to bring that switching cost
down to near zero? Perhaps a lot more than the $5,000 ($100,000 x 5%)
you are expected to save, making it a poor economic choice. Could you
amortize the upfront investment across many applications? Perhaps, but
a significant portion of the costs would also multiply.
peace of mind, but it’s not the most economical, and therefore, rational,
choice.
commercial offerings that put you at the mercy of the vendor’s product
roadmap (which makes it important to understand the vendor’s product
philosophy - see The IT World is Flat in 37 Things¹³). Although this is
a significant benefit for community-driven projects, many “cloud-scale”
open source projects see roughly half of the contributions from a single
commercial vendor, who is often also the instigator of the project and
a heavy commercial supporter of the affiliated foundation. Having the
development heavily influenced by that vendor’s strategy might work
well for you if it aligns with yours - or not, if it doesn’t. So, it’s up to
you to decide whether you’re happily married / locked-in to that project.
Although many open source products operate across multiple cloud
platforms, most can’t shield us from the specifics of each cloud platform.
This effect is most apparent in automation tools, such as Terraform, who
give us common language syntax but can’t shield us from the many
vendor-specific constructs, i.e. what options are available to configure a
VM. Implementation details thus “leak” through the common open source
frame. As a result, these tools reduce the switching cost from one cloud
to another, for example because you don’t have to learn a new language
syntax, but certainly don’t bring it down to zero. In return you might pay
a price in feature lag between a specific platform and the abstraction layer.
Solving that equation is the architect’s job.
In summary, open source is a great asset for corporate IT. Without it,
we wouldn’t have Linux operating systems, the MySQL database, the
Apache web server, the Hadoop distributed processing framework, and
many other modern software amenities. There are many good reasons to
be using open source software, but it’s the architect’s job to look behind
blanket statements and simplified associations, for example related to
lock-in.
Maneuvering Lock-in
Consciously managing lock-in is a key part of a cloud architect’s role.
Trying to drive potential switching costs down to zero is just as foolish
as completely ignoring the topic. Some simple techniques help us reduce
lock-in for little cost while others just lock us into the next product or
architectural layer. Open standards, especially interface standards such as
open APIs, do reduce lock-in because they limit the blast radius of change.
¹³https://architectelevator.com/book
Don’t Get Locked-up Into Avoiding Lock-in. 184
Lock-in has a bad reputation in enterprise, and generally rightly so. Most
any enterprise has been held hostage by a product or vendor in the past.
Cloud platforms are different in several aspects:
1. Most cloud platform vendors aren’t your typical enterprise tool ven-
dors (the one that might have been, has transformed substantially).
2. An application’s dependence on a specific cloud platform isn’t
trivial, but generally less when compared with complete enterprise
products like ERP or CRM software.
3. A large portion of the software deployed in the cloud will be
custom-developed applications that are likely to have a modular
structure and automated build and test cycles. This makes a poten-
tial platform switch much more digestible than trying to relocate
your Salesforce APEX code or your SAP ABAP scripts.
4. In the past a switch between vendors was often necessitated by
severe feature disparity or price gouging. In the first decade of cloud
computing, providers have repeatedly reduced prices and released
dozens of new features each year, making such a switch less likely.
So, it’s worth taking an architectural point of view on lock-in. Rather than
getting locked up in it, architects need to develop skills for picking locks.
20. The End of
Multi-tenancy?
Cloud makes us revisit past architecture assumptions.
Multi-tenancy
Multi-tenant systems have been a common pattern in enterprise software
for several decades. According to Wikipedia¹:
The key benefit of such a multi tenant system is the speed with which
a new share or (logical) customer instance can be created. Because all
tenants share a single system instance, no new software instances have
to be deployed when a new customer arrives. Also, operations are more
efficient because common services can be shared across tenants.
That’s where this architectural style derives its name - an apartment
building accommodates multiple tenants so you don’t need to build a new
house for each new tenant. Rather, a single building efficiently houses
many tenants, aided by common infrastructure like hallways and shared
plumbing.
¹https://en.wikipedia.org/wiki/Multitenancy
The End of Multi-tenancy? 186
No Software!
In recent IT history, the company whose success was most directly based
on a multi-tenant architecture was Salesforce. Salesforce employed a
Software-as-a-Service model for its CRM (Customer Relationship Man-
agement) software well before this term gained popularity for cloud
computing platforms. Salesforce operated the software, including updates,
scaling, etc., freeing customers from having to install software on their
premises.
The model, aided by an effective “no software” marketing campaign,
was extremely successful. It capitalized on enterprises’ experience that
software takes a long time to install and causes all sorts of problems
along the way. Instead, Salesforce could sign up new customers almost
instantaneously. Often the software was sold directly to the business
under the (somewhat accurate) claim that it can be done without the IT
department. “Somewhat”, because integrations with existing systems are
a significant aspect of a CRM deployment.
Tenant Challenges
While multi tenant systems have apparent advantages for customers, they
pose several architectural challenges. To make a single system appear like
a set of individual systems, data, configuration, as well as non-functional
characteristics, such as performance, have to be isolated between the
tenants - people want thick walls in their apartments. This isolation
introduces complexity into the application and the database schema, for
example by having to propagate the tenant’s identity through the code
base and database structure for each operation. Any mistake in this
architecture would run the risk of exposing one tenant’s data to another
- a highly undesired scenario. Imagine finding your neighbor’s groceries
in your fridge, or worse.
Also, scaling a single system for a large number of tenants can quickly
hit limits. So, despite its great properties, multi-tenancy, like most other
architectural decisions, is a matter of trade-offs. You have to manage a
more complex system for the benefit of adding customers quickly and
easily. Past systems architects chose the optimum balance between the two
and succeeded in the market. Now, cloud computing platforms challenge
some of the base assumptions and might lead us to a new optimum. Oddly,
the path to insight leads past waterfowl…
Duck Typing
The object-oriented world has a notion of duck typing², which indicates
that instead of classifying an object by a single, global type hierarchy,
it’s classified by its observable behavior. So, instead of defining a duck as
being a waterfowl, which is a bird, which in turn is part of the animal
kingdom, you define a duck by its behavior: if it quacks like a duck and
walks like a duck, it’s considered a duck.
²https://en.wikipedia.org/wiki/Duck_typing
The End of Multi-tenancy? 188
In software design this means that the internal make-up or type hierarchy
of a class takes a back seat to the object’s observable behavior: if an object
has a method to get the current time, perhaps it’s a Clock object, regardless
of its parent class. Duck typing is concerned with whether an object can
be used for a particular purpose and is less interested in an overarching
ontology. Hence it allows more degrees of freedom in system composition.
Duck Architecture
It’s possible to apply the same notion of typing based on behavior to large-
scale system architecture, giving us a fresh take on architecture concepts
like multi-tenancy. Traditionally, multi-tenancy is considered a structural
property: it describes how the system was constructed, e.g. by segmenting
a single database and configuration for each user, i.e. tenant.
Following the “duck” logic, though, multi-tenancy becomes an observable
property. Much like quacking like a duck, a multi-tenant system’s key
property is its ability to create new and isolated user instances virtually
instantaneously. How that capability is achieved is secondary - that’s up
to the implementer.
The “duck architecture” supports the notion that the sole purpose of a sys-
tem’s structure is to exhibit a desirable set of observable behaviors. Your
customer’s won’t appreciate the beauty of your application architecture
when that application is slow, buggy, or mostly unavailable.
The End of Multi-tenancy? 189
Revisiting Constraints
The goal of a system’s structure is to provide desirable properties. How-
ever, the structure is also influenced by a set of constraints - not everything
can be implemented. Therefore, the essence of architecture is to implement
the desirable properties within the given set of constraints:
time infrastructure. And this very aspect of automation is the one that
allows us to reconsider multi-tenancy as a structural system property.
Thanks to cloud automation, deploying and configuring new system
instances and databases has become much easier and quicker than in
the past. A system can therefore exhibit the properties of a multi-tenant
system without being architected like one. The system would simply
create actual new instances for each tenant instead of hosting all tenants
in a single system. It can thus avoid the complexity of a multi-tenant ar-
chitecture but retain the benefits. So, it behaves like a multi-tenant system
(remember the duck?) without incorporating the structural property.
which is pampered like a favorite pet, possibly even has a cute name, and
is expected to live a long and happy life.
Actually, in my opinion the “pet” metaphor doesn’t quite hold. I have seen
many servers that had a bit of space left being (ab-)used as mail or DHCP
server, development machine, experimentation ground, or whatever else
came to developers’ minds. So, some of those servers, while enjoying a
long life, might resemble a pack mule more than a cute bunny. And, in
case there’s any doubt, Urban Dictionary confirms¹ that mule skinners
and bosses alike take good care of their pack mules.
Automation allows us to reconsider this tendency to prolong a server’s
life. If setting up a new server is fast and pain-free, and tearing one down
means we don’t have to pay for it anymore, then we might not have to
hold on to our favorite server until death do us part. Instead, we are happy
to throw (virtual) servers away and provision new ones as we see need.
Consistency
If servers are long-lived elements that are individually and manually
maintained, they’re bound to be unique. Each server will have slightly
different software and versions thereof installed, which is a great way of
¹https://www.urbandictionary.com/define.php?term=pack%20mule
The New -ility: Disposability 195
building a test laboratory. It’s not that great for production environments,
where security and rapid trouble shooting are important. Unique servers
are sometimes referred to as snowflakes, playing off the fact that each one
of them is unique.
Disposing and recreating servers vastly increases the chances that servers
are in a consistent state based on common automation scripts or so-called
golden images. It works best if developers and operations staff refrain
from manually updating servers, except perhaps in emergency situations.
In IT jargon, not updating servers makes them immutable and avoids
configuration drift.
Disposing and re-creating your systems from scripts has another massive
advantage: it assures that you are always in a well-defined, clean state.
Whereas a developer who “just needed to make that small fix” is annoying,
a malware infection is far more serious. So, disposability makes your
environment not only more consistent, but also consistently clean and
therefore safe.
Transparency
One of the main hindrances in traditional IT is the astonishing lack of
transparency. Answering simple questions like how many servers are
provisioned,who owns them, what applications run on them, or what
patch level they are on, typically requires a mini-project that’s guaranteed
to yield a somewhat inaccurate or at least out-of-date answer.
Of course, enterprise software vendors stand by to happily address any
of IT’s ailments big or small, including this one. So, we are sure to find
tools like a CMDB, Configuration Management Database, or other system
management tools in most enterprises. These tools track the inventory
of servers including their vital attributes, often by scanning the estate of
deployed systems. The base assumption behind these tools is that server
provisioning occurs somehow through a convoluted and intransparent
process. Therefore, transparency has to be created in retrospect through
additional, and expensive, tooling. It’s a little bit like selling you a magic
hoverboard that allows you to levitate back out of the corner you just
painted yourself into. Cool, for sure, but not really necessary had you just
planned ahead a little.
The New -ility: Disposability 196
Isn’t creating such scripts labor-intensive? Surely, the first ones will have
an additional learning curve, but the more you minimize diversity, the
more you can reuse the same scripts and thus win twice. Once you factor
in the cost of procuring, installing, and operating a CMDB or similar tools,
you’re bound to have the budget to develop many more scripts than the
tool vendors would want you to believe.
The New -ility: Disposability 197
Less Stress
Lastly, if you want to know whether a team embraced disposability as one
of their guiding principles, there’s a simple test. Ask them what happens
if one of their servers dies (not assuming any data loss, but a loss of all
software and configuration installed on it). The fewer tissue packs are
needed (or expletives yelled) in response to such a scenario, the closer
the team is to accepting disposability as a new -ility.
If their systems are disposable, one of them keeling over isn’t anything
unexpected or dramatic - it becomes routine. Therefore, disposability is
also a stress reducer for the teams. And with all the demands on software
delivery, any reduction in stress is highly welcome.
²In reality, each IT operation also has an environmental cost due to energy consumption
and associated cooling cost. In the big scheme of IT operations, server provisioning is likely
a negligible fraction.
Part V: Building (for)
the Cloud
Because no one wants a server, the cloud should make it easier to build
useful applications. Applications are the part of IT that’s closest to the
customer and hence the essence of a so-called digital transformation.
Application delivery allows organizations to innovate and differentiate.
That’s why rapid feature releases, application security, and reliability
should be key elements of any cloud journey. However, application
ecosystems have also become more complex. Models have helped us make
better decisions when architecting the cloud, so let’s look at a model for
an application-centric view of the cloud.
Applications Differentiate
Once you realize that running software you didn’t build is a bad deal,
you should make it a better deal to build software. Custom software is
what gives organizations a competitive edge, delights their customers,
and is the anchor of innovation. That’s what’s behind the shift from IT
as service provider to IT as competitive differentiator. So, it behooves
everyone in an IT organization to better understand the software delivery
lifecycle and its implications. Discussing software applications typically
starts by separating software delivery, i.e. when software is being made,
from operations, when software is being executed and available to end
users.
The clover leaf appropriately places the application in the center and
represents the essential ecosystem of supporting mechanisms as the four
leaves. Going counter-clockwise, the model contains the following leaves:
Delivery Pipeline
The build, deployment, and configuration mechanisms compile and
package code into deployable units and distribute them to the run-
time platform. A well-oiled delivery pipeline speeds up software
delivery because it constitutes a software system’s first derivative.
The oil comes in form of automation, which enables approaches like
continuous integration (CI) and continuous delivery (CD).
Run-time Platform
Running software needs an underlying environment that manages
aspects like distributing workloads across available compute re-
sources, triggering restarts on failure, and routing network traffic to
the right places. This is the land of virtualization, VMs, containers,
and serverless run-times.
Monitoring / Operations
Once you have applications running, especially those consisting of
many individual components, you’re going to need a good idea
of what your application is doing. For example, is it operating as
expected, are there unexpected load spikes, is the infrastructure out
of memory, or is CPU consumption significantly increased from the
The Application- centric Cloud 203
The simple model shows that there’s more to a software ecosystem than
just a build pipeline and run-time infrastructure. An application-centric
cloud would want to include all four elements of the clover leaf model.
The vertical axis represents the software “stack” where the lower layers
indicate the run-time “underneath” the application. The communications
channels that connect multiple applications sit on top. So, we find that
the positioning of the leaves has meaning. However, the model still makes
sense without paying attention to this finer detail.
Models like this one that can be viewed at different levels of abstractions
are the ones architects should strive for because they are suited to
communicating with a broad audience across levels of abstraction - the
essence of the Architect Elevator.
Growing Leaves
Building an application platform for cloud, for example by means of an
engineering productivity team, takes time. A common question, therefore,
is in which order to grow the leaves on the clover. Although there is
no universal answer, a common guideline I give is that you need to
grow them somewhat evenly. That’s the way nature does it - you won’t
find a clover that first grows one leaf to full size, then two, then three.
Instead each leaf grows little by little. The same holds true for application
delivery platforms. Having a stellar run-time platform but absolutely no
monitoring likely isn’t a great setup.
If I’d be pressed for more specific advice, building the platform in counter-
clockwise circles around the clover leaf is most natural. First, you’d want
to have software delivery (left) worked out. If you can’t deploy software,
everything else becomes unimportant. If you initially deploy on basic
VMs, that’s perhaps OK, but soon you’d want to enhance your run-time to
become automated, resilient, and easily scalable (bottom). For that you’ll
need better monitoring (right). Communications (top) is also important,
but can probably wait until you established solid operational capabilities
for the other parts (Martin Fowler’s famous article reminds us that you
must be this tall to use microservices²). From there, keep iterating to
continue growing the four petals.
Is this model better than the original? You be the judge - it’s certainly
richer and conveys more in a single picture. But showing more isn’t always
better - the model has become more noisy. You could perhaps show the
two versions to different audiences - developers might appreciate the small
petals whereas executives might not.
Defusing buzzwords and positioning them in a logical context is always
helpful. Finding simple models is a great way to accomplish that. Once
they’re done, such models might seem obvious. However, that’s one of
their critical quality attributes: you want your good ideas to look obvious
in hindsight.
23. What Do Containers
Contain?
Metaphors help us reason about complex systems.
Artifact Packaging
Containers package application artifacts based on a Dockerfile, a
specification of the installation steps needed to run the applica-
tion. The Dockerfile would for example include the installation of
libraries that the application depends on. From this specification,
Docker builds an image file that contains all required dependencies.
Image Inheritance
A particularly clever feature that Docker included in the generation
of the container image files is the notion of a base image. Let’s
say you want to package a simple Node application into a Docker
image. You’d likely need the run-time, libraries, npm and a few other
items installed for your application to run. Instead of having to
figure out exactly which files are needed, you can simply reference
a Node base image that already contains all the necessary files. All
you need to do is add a single line to your Dockerfile, e.g. FROM
node:boron.
Lightweight Virtualization
Lastly, when talking about containers, folks will often refer to
cgroups², a Linux mechanism for process isolation. cgroups sup-
port light-weight virtualization by allowing each container to be-
have as if it had its own system environment, such as file system
²https://en.wikipedia.org/wiki/Cgroups
What Do Containers Contain? 212
The term container thus combines three important core concepts. When
folks mention the word “container”, they might well refer to different
aspects, leading to confusion. It also helps us understand why containers
revolutionized the way modern applications are deployed and executed:
the concept of modern containers spans all the way from building to
deploying and running software.
Container Benefits
My Architect Elevator Workshops³ include an exercise that asks partici-
pants to develop a narrative to explain the benefits of using containers
to senior executives. Each group pretends to be talking to a different
stakeholder, including the CIO, the CFO, the CMO (Chief Marketing
Officer), and the CEO. They each get two minutes to deliver their pitch.
The exercise highlights the importance of avoiding jargon and building a
ramp⁴ for the audience. It also highlights the power of metaphors. Relating
to something known, such as a shipping container, can make it much
easier for the audience to understand and remember the benefits of Docker
containers.
Having run this exercise many times. we distilled a few nice analogies that
utilize the shipping container metaphor to explain the great things about
Docker containers:
If you load your goods into a container and seal the doors,
nothing goes missing while en-route (as long as the container
doesn’t go overboard, which does occasionally happen).
The same is true for Docker containers. Instead of the usual
semi-manual deployment that is bound to be missing a
³https://architectelevator.com/workshops
⁴https://architectelevator.com/book
What Do Containers Contain? 213
So, it turns out that the shipping container analogy holds pretty well and
helps us explain why containers are such a big deal for cloud computing.
The following table summarizes the analogies:
Thin Walls
Shipping containers provide pretty good isolation from one another - you
don’t care much what else is shipping along with your container. However,
the isolation isn’t quite as strong as placing goods on separate vessels. For
example, pictures circulating on the internet show a toppling container
pulling the whole stack with it overboard. The same is true for compute
containers: a hypervisor provides stronger isolation than containers.
Some applications might not be happy with this lower level of isolation.
It might help to remember that one of the largest container use cases plus
the most popular container orchestrator originated from an environment
where all applications were built from a single source code repository and
highly standardized libraries, vastly reducing the need for isolation from
“bad” tenants. This might not be the environment that you’re operating in,
⁷https://k3s.io/
⁸https://www.raspberrypi.org/products/
⁹Late-model Raspberry Pi’s have compute power similar to Linux computers from just
a decade ago, so they might not be a perfect example for embedded systems that operate
under power and memory constraints.
What Do Containers Contain? 217
so once again you shouldn’t copy-paste others’ approaches but first seek
to understand the trade-offs behind them.
It’s OK if not everything runs in a container - it doesn’t mean they’re
useless. Likewise, although many things can be made to run in a container,
that doesn’t mean that it’s the best approach for all use cases.
Border Control
If containers are opaque, meaning that you can’t easily look inside, the
question arises whether containers are the perfect vehicle for smuggling
prohibited goods in and out of a country¹⁰. Customs offices deal with this
with this in several ways:
¹⁰Recent world news taught us that a musical instrument case might be a suitable vehicle
to smuggle yourself out of a country.
What Do Containers Contain? 218
• Each container must have a signed off manifest. I’d expect that the
freight’s origin and the shipper’s reputation are a consideration for
import clearance.
• Customs may randomly select your container for inspection to
detect any discrepancies between manifest and actual contents.
• Most ports have powerful X-Ray machines that can have a good
look inside a container.
Also, Docker was open, open source actually, something that can’t be said
of many traditional operations tools. Lastly, it provided a clean interface
¹¹https://www.thoughtworks.com/radar/techniques/container-security-scanning
What Do Containers Contain? 219
Looking for a bit more depth, Mike Robert’s authoritative article on server-
less architectures³ from 2018 is a great starting point. He distinguishes
serverless third party services from Faas - Function as a Service, which
allows you to deploy your own “serverless” applications. Focusing on the
latter, those are applications…
That’s a lot of important stuff in one sentence, so let’s unpack this a bit:
Stateless
I split the first property into two parts. In a curious analogy to
being “server-less”, stateless-ness is also a matter of perspective.
As I mused some 15 years ago⁴, most every modern compute
architecture outside of an analog computer holds state (and the
analog capacitors could even be considered holding state). So, one
needs to consider context. In this case, being stateless refers to
application-level persistent state, such as an operational data store.
Serverless applications externalize such state instead of relying on
the local file system or long-lived in-memory state.
Compute Containers
Another loaded term - when used in this context, the term contain-
ers appears to imply a self-contained deployment and run-time unit,
not necessarily a Docker container. Having self-contained units is
important because ephemeral applications (see below) need to be
automatically re-incarnated. Having all code and dependencies in
some sort of container makes that easier.
²https://aws.amazon.com/lambda/faqs/
³https://martinfowler.com/articles/serverless.html
⁴https://www.enterpriseintegrationpatterns.com/ramblings/20_statelessness.html
Serverless = Worry Less? 222
Event-triggered
If serverless applications are deployed as stateless container units,
how are they invoked? The answer is: per event. Such an event
can be triggered explicitly from a front-end, or implicitly from a
message queue, a data store change, or similar sources. Serverless
application modules also communicate with each other by publish-
ing and consuming events, which makes them loosely coupled and
easily re-composable.
Ephemeral
One of those fancy IT words, ephemeral means that things live only
for a short period of time. Another, more blunt way of describing
this property is that things are easily disposable. So, serverless
applications are expected to do one thing and finish up because
they might be disposed of right after. Placing this constraint is a
classic example of an application programmers accepting higher
complexity (or more responsibility) to enable a more powerful run-
time infrastructure.
Managed by Third Party
This aspect is likely the main contributor to serverless’ much-
debated name. Although “managed” is a rather “soft” term (I
manage to get out of bed in the morning…), in this context it likely
implies that all infrastructure-related aspects are handled by the
platform and not the developer. This would presumably include
provisioning, scaling, patching, dealing with hardware failure, and
so on.
Auto-scales
A serverless platform is expected to scale based on demand, giving
⁵https://blog.symphonia.io/posts/2017-06-22_defining-serverless-part-1
Serverless = Worry Less? 223
Is it a Game Changer?
So now that we have a better idea of what types of systems deserve
to be called “serverless”, it’s time to have a look at why it’s such a
big deal. On one hand, serverless is the natural evolution of handing
over larger portions of the non-differentiating toil to the cloud providers.
While on a virtual machine you have to deal with hardening images,
installing packages, and deploying applications, containers and container
⁶https://blog.symphonia.io/posts/2017-06-26_defining-serverless-part-3
Serverless = Worry Less? 224
Both Google App Engine, a serverless platform that existed well before
the term was coined, and AWS Lambda, AWS’ category defining service
follow the opaque model. You deploy your code to the platform from the
command line or via a ZIP file (for Lambda) without having to know what
the underlying packaging and deployment mechanism is.
Google Cloud Run follows the opposite approach by layering serverless
capabilities on top of a container platform. One critical capability being
added is scale-to-zero, which is emulated using Knative⁸ (talk about
another not-so-stellar-name…) to route traffic to a special Activator⁹ that
can scale up container pods when traffic comes in.
From an architectural point of view, both approaches have merit but
reflect a different point of view. When considering a product line architec-
ture, layering is an elegant approach as it allows you to develop a common
base platform both for serverless and traditional (serverful? servermore?)
container platforms. However, when considering the user’s point of view,
an opaque platform provides a cleaner overall abstraction and gives you
more freedom to evolve the lower layers without exposing this complexity
to the user.
Language Constraints
Platform Integration
By Jean-Francois Landreau
Our architecture team regularly authored compact papers that shared the
essence of our IT strategy with a broad internal audience of architects,
CIOs and IT managers. Each paper, or Tech Brief, tackled a specific
technology trend, elaborated the impact for our specific environment,
and concluded with actionable guidance. I was assigned to write such a
Tech Brief to give development teams and managers directions for moving
applications to our new cloud platform. The paper was intended both as
useful guidance but also as a form of guard rail for us to avoid someone
¹https://12factor.net/
Keep Your FROSST 228
• Frugal
• Relocatable
• Observable
• Seamlessly updatable
• internally Secured
• failure Tolerant
Let’s look at each one, discuss why it’s relevant and how it relates to the
other characteristics.
Frugal
code that ekes out every last bit of memory and every CPU cycle (we
leave that to microcontrollers). It mainly means to not be unnecessarily
wasteful, for example by using an n² algorithm when a simple n or n log(n)
version is available.
Being prudent with memory or CPU consumption will help you stay in
budget when your application meets substantial success and you have
to scale horizontally. Whether your application or service consumes 128
or 512 MB RAM might not matter for a handful of instances, but if
you’re running hundreds or thousands, it makes a difference. Frugality
in application start-up time supports the application’s relocatability and
seamless updatability thanks to quick distribution and launch times.
Frugality can be achieved in many ways, for example by using efficient
library implementations for common problems instead of clunky one-
offs. It can also mean not lugging around huge libraries when only a
tiny fraction is required. Lastly, it could mean storing large in memory
data sets in an efficient format like Protocol Buffers if it’s infrequently
enough accessed to justify the extra CPU cycles for decoding (letting the
bit twiddling happen under the covers).
Relocatable
Observable
Seamlessly Updatable
Internally Secured
Failure Tolerant
Things will break. But the show must go on. Some of the largest E-
commerce providers are said to rarely be in a “100% functioning state”, but
their site is very rarely down. A user may not be able to see the reviews
for a product but is still able to order it. The secret: a composable website
with each component accepting failure from dependent services without
propagating them to the customer.
Failure tolerance is supported by the application’s frugality, relocatability
and observability characteristics, but also has to be a conscious design
decision for its internal workings. Failure tolerance supports seamless
updates because during start-up dependent components might not be
available. If your application doesn’t tolerate such a state and refuses to
start up, an explicit start-up sequence has to be defined, which not only
makes the complete startup brittle but also slower.
On the cloud, it’s a bad practice to believe that the network always works
or that dependent services are always responding. The application should
have a degraded mode to cope with such situations and a mechanism
to recover automatically when the dependencies are available again. For
example, an E-commerce site could simply show popular items when
personalized recommendations are temporarily unavailable.
No one likes to pay for systems that aren’t running. Hence it’s not
surprising that system uptime is a popular item on a CIO’s set of goals.
That doesn’t mean it’s particularly popular with IT staff, though. The
team generally being tasked with ensuring uptime (and blamed in case
of failure) is IT operations. That team’s main objective is to assure that
systems are up-and-running and functioning properly and securely. So it’s
no surprise that this team isn’t especially fond of outages. Oddly enough,
cloud also challenges this way of working in several ways.
Avoiding Failure
The natural and obvious way to assure uptime is to build systems that
don’t break. Such systems are built by buying high quality hardware,
meticulous testing of applications (sometimes using formal methods),
providing ample hardware resources in case applications do leak (or need)
memory. Essentially, these approaches are based on upfront planning,
verification, and prediction to avoid later failure.
IT likes systems that don’t break because they don’t need constant
attention and no one is woken up in the middle of the night to deal
with outages. The reliability of such systems is generally measured by
the number of outages encountered in a given time period. The time that
such a system can operate without incident is described by the system’s
MTBF – Mean Time Between Failure, usually measured in hours.
High-end critical hardware, such as network switches, sport an MTBF
of 50,000 hours or more, which translates into a whopping 6 years of
continuous operation. Of course, there’s always a nuance. For one, it’s
a mean figure assuming some (perhaps normal) distribution. So, if your
switch goes out in a year, you can’t get your money back - you were
Keep Calm And Operate On 235
simply on the near end of the probability distribution - someone else might
get 100,000 hours in your place. Second, many MTBF hours, especially
astronomical ones like 200,000 hours, are theoretical values calculated
based on component MTBF and placement. Those are design MTBF and
don’t account for oddball real-life failure scenarios - real life always has
a surprise or two in stock for us, such as someone stumbling over a cable
or a bug shorting a circuit¹. Lastly, in many cases a high MTBF doesn’t
actually imply that nothing breaks but rather that the device will keep
working even if a component fails. More on that later.
Robustness
Systems that don’t break are called robust. Robust systems resist distur-
bance. They are like a castle with very thick walls. If someone decides to
disturb your dwelling by throwing rocks, those rocks will simply bounce
off. Therefore, robust systems are convenient as you can largely ignore the
disturbance. Unless, of course, someone shows up with a trebuchet and a
very large stone. Also, living in a castle with very thick walls may be all
cool and retro, but often not the most comfortable option.
This example also shows that simply counting the number of outages is
a poor proxy metric (see 37 Things) - an outage that causes 30 seconds
of downtime should not be counted the same as one that lasts 24 hours.
The redeeming property of traditional IT is that it might not even detect
the 30-second outage and hence doesn’t count it. However, poor detection
shouldn’t be the defining element!
An essential characteristic of robust systems is that they don’t break
very often. This means that repairs are rare, and therefore often not well
practiced. You know you have robust systems when you have IT all up in
arms during an outage, running around like carrots with their heads cut
off (I am a vegetarian, you remember). People who drive a German luxury
car will know an analog experience: the car rarely brakes, but when it does,
oh my, will it cost you! The ideal state, of course, is that if a failure occurs
nothing really bad happens. Such systems are resilient.
Resilience
Resilient systems absorb disturbance. They focus on fast recovery because
it’s known that robustness can only go so far. Rather than rely on up-
front planning, such systems bet on redundancy and automation to detect
and remedy failures quickly. Rather than a castle with thick walls, such
systems are more like an expansive estate - if the big rock launched by
the trebuchet annihilates one building, you move to another one. Or, you
build from wood and put a new roof up in a day. In any case you have also
limited the blast radius of the failure - a smashed or burnt down building
won’t destroy the whole estate.
In IT, robustness is usually implemented at the infrastructure level,
whereas resilience generally spans both infrastructure and applications.
For example, applications that are broken down and deployed in small
units, perhaps in containers, can exhibit resilience by using orchestration
Keep Calm And Operate On 237
Antifragile Systems
Resilient systems absorb failure, so we should be able to test their
resilience by simulating or even actually triggering a failure, without caus-
ing user-visible outages. That way we can make sure that the resilience
actually works.
So, when dealing with resilient systems, causing disturbance can actually
make the system stronger. That’s what Nassim Taleb calls Antifragile
Systems² - these aren’t systems that are not fragile (that’d be robust) but
systems that actually benefit from disturbance.
While at first thought it might be difficult to imagine a system that
welcomes disturbance, it turns out that many highly resilient systems
fall into this category. The most amazing system, and one close to
home, deserves the label “antifragile”: our human body (apologies to all
animal readers). For example, immunizations work by injecting a small
disturbance into our system to make it stronger.
²Nassim Taleb: Antifragile: Things That Gain From Disorder, Random House, 2012
Keep Calm And Operate On 238
Antifragile systems add a second, outer system to this setup. The outer
system injects disturbances, observes the system behavior, and improves
the correcting action mechanism.
Chaos Engineering
When the fire brigade trains, they usually do so in what I refer to as the
“pre-production” environment. They have special training grounds with
charred up buses and aircraft fuselage mock-ups that they set on fire to
inject disturbance and make the overall system, which heavily depends
on the preparedness of the brigade, more resilient.
Keep Calm And Operate On 240
In software, we deal with bits and bytes that we can dispose of at will.
And because we don’t actually set things on fire, it’s feasible for us to
test in production. That way we know that the resilient behavior in pre-
production actually translates into production, where it matters most.
Naturally, one may wonder how smart it is to inject disturbance into
a production system. It turns out that in a resilient system, routinely
injecting small disturbances builds confidence and utilizes the antifragile
properties of the overall system (including people and processes). That
insight is the basis for Chaos Engineering⁴, described as discipline of
experimenting on a system in order to build confidence in the system’s
capability to withstand turbulent conditions in production.
Don’t try to be antifragile before you are sure that you are
highly resilient!
Likewise, one must be careful to not get lured into wanting to be antifragile
before all your systems are highly resilient. Injecting disturbance into a
fragile system doesn’t make it stronger, it simply wreaks havoc.
Keep Calm And Operate On 243
DON’T…
DO…
By now you have migrated existing applications and built new ones so
that they take advantage of the cloud computing platform. But you’re not
quite done yet. A cloud journey isn’t a one-time shot, but an ongoing series
of learning and optimization.
Many cloud initiatives are motivated, and justified, by cost savings. Cost is
always a good argument, particularly in traditional organizations, which
still view IT as a cost center. And although lower cost is always welcome,
once again things aren’t that simple. Time to take a closer look at reducing
your infrastructure bill.
Server Sizing
I am somewhat amused that when discussing cloud migrations, IT depart-
ments talk so much about cost and saving money. The amusement stems
Cloud Savings Have To Be Earned 246
from the fact that most on-premises servers are massively over-sized and
thus waste huge amounts of money. So, if you want to talk about saving
money, perhaps look at your server sizing first!
I estimate that existing IT hardware is typically 2-5x larger than actually
needed. The reasons are easy to understand. Because provisioning a server
takes a long time, teams rather play it save and order a bigger one just in
case. Also, you can’t order capacity at will, so you need to size for peak
load, which may take place a few days a year. And, let’s be honest, in
most organizations spending a few extra dollars on a server gets you in
less trouble than a poorly performing application.
I tend to joke that when big IT needs a car, they buy a Rolls
Royce. You know, just in case, and because it scored better
on the feature comparison. When they forget to fill the tank
and the car breaks down, their solution is to buy a second
Rolls-Royce. Redundancy, you know!
If you’re after savings, you must complete at least two more important
steps after the actual cloud migration:
your workloads and your spend. Most clouds even give you automated
cost-savings reports if your hardware is heavily underutilized.
The nice thing about cloud is that, unlike your on-premises server that was
purchased already, you have many ways to correct oversized or unneeded
hardware. Thanks to being designed for the first derivative the formula
for cloud spend is quite simple: you are charged by the size of hardware
or storage multiplied by the unit cost and by the length of time you use it.
You can imagine your total cost as the volume of a cube:
For the more mathematically inclined, the formula reads like this:
Since the unit cost is largely fixed by the provider (more on that later),
you have two main levers for saving: resource size and lifetime.
OPTIMIZING SIZE
OPTIMIZING TIME
The sad part is that if you negotiate a 30% discount from your vendor, that
server is still your most expensive one. Now if I want to really get the
executives’ attention, particularly CFOs, I tell them to imagine that half
the servers in their data center aren’t doing anything. Most of them won’t
Cloud Savings Have To Be Earned 250
ARCHITECTURE CHANGES
Batched usage (left) vs. sporadic usage (right) favor different pricing models
Additionally, if your data is largely static, you can save a lot in the pay-
per-query model by creating interim results tables that hold aggregated
results. Once again it shows that knowing your needs is the most impor-
tant ingredient into reducing cost.
Sounds complicated? Yes, it can be. But such is the double-edged sword of
cost control (and transparency): you have much more control over pricing
but need to work to realize those savings. Once again, there is no magic
bullet. The good news is that in our example, both models likely will result
in a bill that’s much lower than setting up a traditional data warehouse
that’s idle 99% of the time.
Many teams will start to compare prices by comparing unit prices for basic
services such as compute nodes. This is likely a sign that you’re stuck in
a traditional data-center oriented frame of mind. In cloud, your biggest
levers are sizing and resource lifetime. Data often has the longest lifespan,
so there it makes sense to look at cost per unit.
Luckily, history has shown that the competition between the “big three”
generally works in your favor: most cloud providers have reduced prices
many times.
Doing Nothing
Cynics might note that some cloud providers being rumored to be incur-
ring significant losses means that you’re in fact get more than you pay for,
at least for as long as the investors keep bankrolling the operation. There’s
another, slightly less cynical cost savings strategy that most are too shy
to mention: doing nothing (after migrating to the cloud, that is). Not only
does compute power become cheaper all the time, the cloud providers
also deal with depreciation and hardware refresh cycles for you, letting
you enjoy a continuous negative price inflation. So, you’ll save by doing
nothing - perhaps some things in the cloud are a kind of magic.
Premature Optimization
Premature optimization is widely considered the root of much evil in
computer science. So, when is the time mature to optimize cost? In the end,
it’s a simple economic equation: how much savings can you realize per
engineer hour invested? If your overall cloud run rate is $1000 / month it’s
unlikely that you’ll be getting a huge return on investing many engineer
hours into optimization. If it’s $100,000 / month, the situation is likely
different. In any case, it’s good to establish a culture that makes cost a
first-class citizen.
Cloud Savings Have To Be Earned 254
Optimizing Globally
Modern frameworks can unexpectedly consume compute power and drive
up cost, for example by requiring five nodes for leader election, a few more
as control plane nodes, another two as logging aggregators, an inbound
and an outbound proxy, etc. Suddenly, your simple demo app consumes
10 VMs and your monthly bill is in the hundreds of dollars per month.
Such frameworks give you productivity gains in returns and servers are
cheaper than developers. Higher productivity might also give you an edge
in the market, which will outweigh most potential savings. So, once again
it’s about optimizing globally, not locally. Transparency is your best friend
in this exercise.
DON’T…
DO…
• Understand pricing models and how they fit your usage patterns.
• Use automation to reduce cost.
• Use transparency to understand and manage cost.
• Spend time understanding your own usage patterns.
• Read price reduction case studies carefully to see what impacted the
price and whether that applies to you.
28. It’s Time To Increase
Your “Run” Budget
Cloud challenges budgeting misconceptions.
A CIO who burns too much money just “keeping the lights on” will have
a difficult time meeting the demand from the business who needs projects
to be completed. Therefore, almost every CIO’s agenda includes reducing
the run budget in favor of the project budget. This pressure only builds in
times where the business has to compete with “digital disruptors”, which
places additional demands on IT to deliver more value to the business.
Some analysts have therefore started to rephrase “Run vs. Change” as Run-
Grow-Transform¹ to highlight that change also incurs expense.
There are many ways to reduce operational costs, including optimizing
server sizing, decommissioning data centers, and of course moving to the
cloud. Somewhat ironically, though, the move to the cloud shifts capital
investment towards consumption-based billing, which is traditionally
associated with run costs. It can also highlight operational costs that were
buried in project budgets.
A real story about building an internal cloud platform and cost recovering
for it illustrates some of these challenges.
ing that the strict separation between build and run might no longer make
much sense in modern IT organizations. The same is true for a DevOps
mindset that favors “you build it, you run it”, meaning project teams
are also responsible for operations. However, as we saw, such a way of
working can run up against existing accounting practices.
actual spend. Anything that can be bundled into the project cost benefits
from this effect. So, the half day that the software developer spent fixing
the CI/CD pipeline might show up as only $100 in the current fiscal year
because the remaining $400 is pushed to the subsequent years. A shared
service will have a difficult time competing with that accounting magic!
The astute reader might have noticed that pushing expenses into the future
makes you look better in the current year, but burdens future years with
cost from prior investments. That’s absolutely correct - I have seen IT
organizations where past capitalization ate most of the current year’s
budget - the corporate equivalent of having lived off your credit card
for too long. However, project managers often have short planning and
reward horizons, so this isn’t as disconcerting to them as it might be for
you.
Cloud Accounting
The apparent conflict between a novel way of utilizing IT resources
and existing account methods has not escaped the attention from bodies
like the IFRS Foundation³, which publishes the International Financial
Reporting Standards, similar to GAAP in the US. While the IFRS doesn’t
yet publish rules specific to cloud computing, it does include language
on intangible assets that a customer controls, which can apply to cloud
resources. The guidance, for example is, that implementation costs related
to such intangible assets can be capitalized. EY published an in-depth
paper on the topic⁴ in July 2020. If you think reading financial guidance
is a stretch for a cloud architect, consider all the executives who have to
listen to IT pitching Kubernetes to them.
a result, people stuck in the old way of thinking will often fail to see the
new model’s huge advantages.
However, a new model can also completely change the business’ view on
IT spending. Traditionally, hardly any business would be able to determine
how much operating a specific software feature costs them. Hence, they’d
be unable to assess whether the feature is profitable, i.e. whether the IT
spending stands in any relation to the business value generated.
Cloud services, especially fine-grained serverless runtimes, change all this.
Serverless platforms allow you to track the operational cost associated
with each IT element down to individual functions, giving you fine-
grained transparency and control over the Return-on-Investment (ROI)
of your IT assets. For example, you can see how much it costs you to sign
up a customer, to hold a potentially abandoned shopping cart, to send a
reminder e-mail, or to process a payment. Mapping these unit costs against
the business value they generate allows you to determine the profitability
of individual functions. Let’s say you concluded that customer signups are
profitable per unit, then you should be excited to see IT spend on sign-ups
increase!
DON’T…
DO…
Most corporate IT has been born and raised from automating the business:
IT took something that was done by pen and paper and automated it. What
started mostly with mainframes and spreadsheets (VisiCalc¹ turned 40 in
2019) soon became the backbone of the modern enterprise. IT automation
paid for itself largely by saving manual labor, much like automation of
a car assembly line. So when we speak about automation these days, it’s
only natural to assume that we’re looking to further increase efficiency.
However, cloud also challenges this assumption.
A slew of recent software innovations has set out to change this: Con-
tinuous Integration (CI) automates software test and build, Continuous
³https://wiki.c2.com/?WhatIsSoftwareDesign
Automation isn’t About Efficiency 267
Speed
Repeatability
Confidence
Resilience
It’s easily forgotten that a new feature release isn’t the only
time you deploy software. In case of a hardware failure, you
need to deploy software quickly to a new set of machines.
With highly automated software delivery, this becomes a
matter of seconds or minutes - often before the first incident
call can be set up. Deployment automation increases uptime
and resilience.
Transparency
Compliance
Cloud Ops
Because moving to the cloud places a significant part of system operations
in the hands of the cloud provider, especially with fully-managed SaaS
services, there’s often a connotation that cloud requires little to no oper-
ational effort, occasionally reflected in the moniker “NoOps”. Although
cloud computing brings a new operational model that’s based on higher
levels of automation and outsourcing IT operations, today’s drive to
automate IT infrastructure isn’t driven by the desire to eliminate jobs and
increase efficiency. Operations teams therefore should embrace, not fear
automation of software delivery.
run out of valuable things to do. If anything, the value proposition of well-
run IT operations goes up in an environment that demands high rates of
change and instant global scalability.
Speed = Resilience
Once you speed up your software delivery and operations through au-
tomation, it does more than just make you go faster: it allows you to work
in an entirely different fashion. For example, many enterprises are stuck in
the apparent conflict between cost and system availability: the traditional
way to make systems more available was to add more hardware to better
distribute load and to allow failover to a standby system in case of a critical
failure.
Once you have a high degree of automation and can deploy new appli-
cations within minutes on an elastic cloud infrastructure, you often don’t
need that extra hardware sitting around - after all the most expensive
server is the one that’s not doing anything. Instead, you quickly provision
new hardware and deploy software when the need arises. This way you
achieve increased resilience and system availability at a lower cost, thanks
to automation!
Automation isn’t About Efficiency 271
DON’T…
DO…
Have you ever been to the supermarket when you really only needed milk,
but then picked up a few small items along the way just to find out that
your bill at checkout is $50? To make matters worse, because you really
didn’t buy anything for more than $1.99, you’re not sure what to take out
of the basket because it won’t make a big difference anyway. If you did,
then you’re emotionally prepared for what could happen to you on your
journey to the cloud.
$40. Even though it isn’t cheap, the pricing model makes it relatively easy
to guess how much the total bill will be (depending on your locale you
might get tricked with a mandatory gratuity, state and local tax, health
care cost recovery charges, resort charge, credit card fee, and all kinds of
other creations that share the sole goal of extracting more money from
your wallet.)
Cloud computing is more like going to the grocery store: you’ll have many
more line items, each at a relatively low price. So, if you exercise good
discipline it can be significantly cheaper. However, it’s also more difficult
to predict total costs.
So, when enterprises first embark on the cloud journey they need to get
used to a new pricing model that includes more dimensions of charges,
such as compute, storage, reserved capacity, network, API calls, etc. They
also need to get used to seemingly minuscule amounts adding up into non-
trivial cost. Or, as the Germans say “Kleinvieh macht auch Mist” - even
small animals produce manure.
Cost Control
The cloud operating model gives developers before unheard-of levels of
control over IT spend. Developers incur cost both via their applications
(for example, a chatty application makes more API calls) as well as
with machine sizing and provisioning. Cloud providers generally charge
identical amounts for production and development machine instances,
causing test and build farms plus individual developer machines to quickly
add up despite them likely being much smaller instances.
Although most enterprises place billing alerts and limits, with that power
over spending also comes additional responsibility. Being a responsible
developer starts with shutting down unused instances, selecting conser-
vative machine sizes, purging old data, or considering more cost-efficient
alternatives. Done well, developers’ control over cost leads to a tighter
Beware the Supermarket Effect! 274
and therefore better feedback cycle. Done poorly, it can lead to runaway
spending.
So, when performing actions that differ from the regular load patterns,
it’s important to understand the impact it might have on the various
dimensions of the pricing model. It might be helpful to either throttle
down bulk loads or to observe and quickly undo any auto scaling after
the load is finished.
Infinite Loops
Orphans
The next category is less dramatic, but also sneaks up on you. Developers
tend to try a lot of things out. In the process they create new instances
or load data sets into one of the many data stores. Occasionally, these
instances are subsequently forgotten unless they reach a volume where a
billing alert hints that something is awry.
One additional line item is unlikely to hit your bill hard, but will neverthe-
less surprise you. Some cloud pricing schemes include a dependent service
for free when deploying a certain type of service. For example, some cloud
providers include an elastic IP address with a virtual machine instance.
However, if you stop that instance, you will need to pay for the elastic IP.
Some folks who provisioned many very small instances are surprised to
find that after stopping them, the cost remains almost the same due to the
elastic IP charges.
In another example, many cloud providers will charge you egress band-
width cost if you connect from one server to another via its external
IP address, even if both servers are located in the same region. So,
accidentally looking up a server’s public IP address instead of the internal
one will cost you.
It pays to have a closer look at the pricing fine print, but it’d be hard
to believe that you can predict all situations. Running a cost optimizer is
likely a good idea for any sizable deployment.
Being Prepared
While sudden line items in your bill aren’t pleasant, it might bear
some relief knowing that you’re not the only one facing this problem.
And in this case it’s more than just “misery loves company.” A whole
slew of tools and techniques have been developed to help address this
common problem. Just blaming the cloud providers isn’t quite fair as your
unintentional usage actually did incur infrastructure cost for them.
Beware the Supermarket Effect! 277
Checking Out
You might have observed how supermarkets place (usually overpriced)
items appealing to kids near the cashier’s desk, so that the kids can toss
something into the shopping cart without the parents noticing. Some IT
managers might feel they have had a similar experience. Cloud providers
routinely dangle shiny objects in front of developers, who are always eager
to try something new. And, look, they are free of charge for the initial trial!
I have before likened developers to screaming children, who chant “Mommy,
Mommy, I want my Kubernetes operators - all my friends have them!”
So, you might have to manage some expectations. On the upshot, I give
the cloud providers a lot of credit for devising transparent and elastic
pricing models. Naturally, elasticity goes both ways, but if you manage
your spend wisely, you can achieve significant savings even if you find
the occasional chocolate bar on your receipt.
Beware the Supermarket Effect! 279
DON’T…
DO…