You are on page 1of 294

Cloud Strategy

A Decision-based Approach to a Successful


Cloud Journey

Gregor Hohpe
This book is for sale at http://leanpub.com/cloudstrategy

This version was published on 2020-08-13

This is a Leanpub book. Leanpub empowers authors and publishers with


the Lean Publishing process. Lean Publishing is the act of publishing an
in-progress ebook using lightweight tools and many iterations to get
reader feedback, pivot until you have the right book and build traction
once you do.

© 2019 - 2020 Gregor Hohpe


Tweet This Book!
Please help Gregor Hohpe by spreading the word about this book on
Twitter!
The suggested tweet for this book is:
I changed my cloud migration from wishful thinking to a strategy
defined by architectural principles.
The suggested hashtag for this book is #CloudStrategyBook.
Find out what other people are saying about the book by clicking on this
link to search for this hashtag on Twitter:
#CloudStrategyBook
Contents

About this Book i

Part I: Understanding Cloud 1


1. Cloud isn’t IT Procurement; it’s a Lifestyle Change 3

2. Cloud Thinks in the First Derivative 13

3. Wishful Thinking isn’t a Strategy 21

4. Principles Connect Strategy and Decisions 29

5. If You Don’t Know How to Drive… 38

Part II: Organizing for the Cloud 45


6. Cloud is Outsourcing 48

7. Cloud Turns Your Organization Sideways 56

8. Retain / Re-Skill / Replace / Retire 62

9. Don’t Hire a Digital Hitman 70

10. Enterprise Architecture in the Cloud 76


CONTENTS

Part III: Moving to the Cloud 85


11. Why Exactly Are You Going to the Cloud Again? 88

12. No One Wants a Server 95

13. Don’t Run Software You Didn’t Build 102

14. Don’t Build an Enterprise Non-Cloud! 109

15. Cloud Migration per Pythagoras 116

Part IV: Architecting the Cloud 126


16. Multi-cloud: You’ve Got Options 129

17. Hybrid Cloud: Slicing the Elephant 141

18. The Cloud - Now On Your Premises 156

19. Don’t Get Locked-up Into Avoiding Lock-in. 171

20. The End of Multi-tenancy? 185

21. The New -ility: Disposability 192

Part V: Building (for) the Cloud 198


22. The Application- centric Cloud 201

23. What Do Containers Contain? 210

24. Serverless = Worry Less? 220

25. Keep Your FROSST 227

26. Keep Calm And Operate On 234


CONTENTS

Part VI: Embracing the Cloud 244


27. Cloud Savings Have To Be Earned 245

28. It’s Time To Increase Your “Run” Budget 256

29. Automation isn’t About Efficiency 265

30. Beware the Supermarket Effect! 272

Author Biography 280


About this Book
Every enterprise finds itself in some stage of a cloud migration these
days, or cloud transformation, as it’s often called. Cloud computing is
an amazing resource that can provide fully managed platforms, instant
scaling, auto-optimizing and even auto-healing operations, per-second
billing, pre-trained machine learning models, and globally distributed
transactional data stores. So, it’s no wonder that most enterprises want
to take advantage of such capabilities.
Still, migrating a whole enterprise to the cloud isn’t as easy as pushing
a button. Migration itself can be a costly endeavor while lifting and
shifting legacy applications to the cloud is unlikely to bring the anticipated
benefits. At the same time, though, re-architecting applications to run
optimally in the cloud is bound to be cost prohibitive and not always a
worthwhile investment. So, enterprises need to have a better and more
nuanced strategy than “we’re moving to the cloud.”
A sound cloud strategy isn’t something you can copy from another
organization - each organization has a different starting point and has to
work under different constraints, both internal and external. Also, there
are no simple recipes for cloud success. Instead, you need a set of proven
decision models that help you analyze your situation, evaluate options,
understand trade-offs, and articulate your choice to a broad audience.
Now that sounds like architecture, does it?
Lastly, a cloud strategy isn’t just about technology but requires close align-
ment between business and IT goals. Realizing cloud’s potential hinges on
shedding old habits and changing the organization’s very beliefs. That’s
why it’s called transformation. I wrote a whole book on transformation,
or more specifically the role of architects in transformation: 37 Things One
Architect Knows About IT Transformation¹, so I’d encourage you to read
that book first if you have an interest in architecture and organizational
changes. Cloud Strategy applies the “elevator architect” thinking from 37
Things to cloud architecture and cloud migrations.
¹https://leanpub.com/37things
About this Book ii

Life Teaches The Best Lessons


Just like 37 Things, this book contains many anecdotes and the occasional
punch line based on my real-world experience with cloud migrations. I
have been in charge of a major cloud transformation in three different
roles:

• As chief architect of a major financial services provider, devising


and building a private cloud platform to speed up application
delivery.
• As technical director at a major cloud provider, advising strategic
clients in Asia and Europe, including some of the largest retailers
and telecommunications companies, on both cloud strategy and
organizational transformation.
• As a Singapore smart nation fellow, laying out an overarching cloud
strategy at the national level.

Each environment presented its unique set of challenges but also shared
noteworthy commonalities, which this book distills into concrete advice.
Each migration involved specific vendors and products, both on premises
and in the commercial cloud. In this book, though, I stay away from spe-
cific products as much as possible, using them only as occasional examples
where I consider it helpful. There’s ample material describing products
already and while products come and go, architectural considerations tend
to stay. Instead, as with 37 Things, I prefer to take a fresh look at some well-
trodden topics and buzzwords to give readers a novel way of approaching
some of their problems.

Cloud Stories
Corporate IT can be a somewhat uninspiring and outright arduous topic.
But IT doesn’t have to be boring. That’s why I share many of the anecdotes
that I collected from the daily rut of cloud migration alongside the
architectural reflections.
Readers appreciated several attributes of 37 Things’ writing style and
content, which I aimed to repeat for this book:
About this Book iii

• Real Experience - rather than painting rosy pictures of what could


be done, I try to describe what worked (or perhaps didn’t) and why,
based on actual experience.
• Unfiltered Opinion - I prefer to call things the way they are. Also, I
am not shy to highlight downsides or limitations. There are plenty
of marketing brochures already, so I’m not keen to add another one.
• Engaging Stories - stories stick, so I try to package complex topics
into approachable stories and engaging anecdotes.
• Less Jargon, More Thought - IT people are well-known for spewing
out the latest buzzwords. But few can tell you when to use which
products and what constraints are built into them. I am aiming for
the opposite.
• Valuable Take-aways - Stories are nice, but architects also need
concrete advice to make their cloud migration successful. I share
what I know.
• Useful References - A lot has been written on cloud computing,
architecture, and IT strategy. I am not here to regurgitate what
has been already written but want to synthesize new insights. I am
happy to point you to related material.

So, just as with 37 Things, I hope that this book equips you can with a few
catchy slogans that you’re able to back up with solid architecture insight.

Better Decisions with Models


Although cloud computing is founded in advanced technology, this book
isn’t deeply technical. You won’t find instructions on how to have your
CI pipeline auto-generate YAML Helm Charts for fully-automated multi-
cluster container orchestration management in a provider-neutral fashion.
You might, however, find guidelines on how you would go about deciding
whether such a setup is a good match for your organization.
This book focuses on meaningful decisions, those that involve conscious
and sometimes difficult trade-offs. Individual product features step aside
in favor of a balanced comparison of architectural approaches. Consid-
ering both strengths and weaknesses leads to vendor-neutral decision
models, often accompanied by questions that you should ask the vendor
or yourself.
About this Book iv

Employing the “Architect Elevator” notion to better connect the IT engine


room to the business penthouse, means that elevating the level of discus-
sion isn’t dumbing things down. Rather, it’s like a good map that guides
you well because it omits unnecessary detail. This book therefore removes
noise and highlights critical aspects and connections that are too often
overlooked. It’ll make you see the forest and not just the trees, sharpening
your thinking and decision making at the relevant level.

What will I learn?


This book is structured into six major sections, which roughly follow the
cloud journey that a complex organization is likely to undertake:

Part I: Understanding the Cloud

Cloud is very different from procuring a traditional IT prod-


uct. So, rather than follow a traditional selection and procure-
ment process, you’ll have to rethink the way your IT works.

Part II: Organizing for to the Cloud

Cloud impacts more than technology. Getting the most out


of cloud necessitates organizational changes, affecting both
structure and processes.

Part III: Moving to the Cloud

There are many ways to the cloud. The worst you can do
is transport your existing processes to the cloud, which will
earn you a new data center but not a cloud - surely not
what you were set out to achieve! Therefore, it’s time to
question existing assumptions about your infrastructure and
operational model.

Part IV: Architecting the Cloud

There’s a lot more to cloud architecture than picking the


right vendor or product. It’s best to dodge all the buzzwords
and use architectural decision models instead. This includes
multi- and hybrid-cloud but perhaps not in the way the
marketing brochures laid it out.
About this Book v

Part V: Building (for) the Cloud

The cloud is a formidable platform. However, applications


running on top of this platform need to do their part also.
This part looks at what makes an application cloud-ready,
what serverless is all about, and what’s the big deal about
containers.

Part VI: Embracing Cloud

Congratulations! You have applications in the cloud. Are you


done? Not quite yet! First, to reap the full benefits of cloud,
you need to go all in. Second, cloud may still have a few
pleasant and not-so-pleasant surprises for you in stock.

Although you’re welcome to read all chapters in sequence, the book is


designed to be read in any order that best suits your needs. So, you are
just as welcome to dive into a topic that’s most relevant to you and follow
the many cross references to related chapters.

Will it answer my questions?


I often warn my workshop participants that they should expect to leave
with more questions than they came with. Similarly, this book presents
a new way of thinking rather than an instruction sheet, which I consider
a good thing for two reasons. First, you’ll have better questions in your
mind, the ones that lead you to making critical decisions. And second,
you’ll have better tools to answer these questions within their specific
context as opposed to following some generic “framework”.
There is no copy-paste for transformation. So, this book likely won’t tell
you exactly what to do but will allow you to make better decisions for
yourself. Think about it as learning how to fish (see the cover).

Do’s and Don’ts


Much of this book is dedicated to looking beneath the surface of the
cloud technology buzzwords, aiming to give enterprises a deeper and more
About this Book vi

nuanced view on what’s really involved in a cloud migration. However,


as an architect or IT leader you’re also expected to devise an execution
plan and lead your organization on a well-defined path. For that you need
concrete, actionable advice.
Several chapters therefore include a Do’s and Don’ts section at the end that
summarizes recommendations and provides words of caution. You can use
them as a checklist to avoid falling into the same traps as others before
you. Think about yourself as India Jones - you’re the one who dodges all
the traps filled with skeletons. It’s challenging and might be a close call
sometimes, but you come out as the hero.

What’s with the fish, again?


The cover shows a swarm of fish that resembles a large fish. I took it in
the Enoshima Aquarium in Japan, just a short train ride south of Tokyo,
not far from Kamakura. Keeping with the theme of using personal photos
of fish from 37 Things, I selected this swarm because it illustrates how
the sum of parts has its own shape and dynamic - a swarm is more than
a bunch of fish. The same is true for complex architectures and cloud in
particular.

Getting Involved
My brain doesn’t stop generating new ideas just because the book is
published, so have a look at my blog to see what’s new:

https://architectelevator.com/blog

Also, follow me on Twitter or LinkedIn to see what I am up to or to


comment on my posts:

http://twitter.com/ghohpe
http://www.linkedin.com/in/ghohpe
About this Book vii

To provide feedback and help make this a better book, please join our
private discussion group:
https://groups.google.com/d/forum/cloud-strategy-book

Acknowledgments
Books aren’t written by a sole author locked up in a hotel room for a
season (if you watched The Shining, you know where that leads…). Many
people have knowingly or unknowingly contributed to this book through
hallway conversations, meeting discussions, manuscript reviews, Twitter
dialogs, or casual chats over a beer. Chef kept me going by feeding me
tasty pasta, pizza, and cheesecake.
Part I: Understanding
Cloud

Dedicating several chapters of a book to understanding cloud might seem


like carrying owls to Athens. After all, cloud has become as ubiquitous
as IT itself and the (on-line) shelves are full of related books, articles,
blog posts, and product briefs. However, much of the available material is
either product-centric or promises amazing benefits without giving much
detail about how to actually achieve them. In my mind, that’s putting the
technology cart before the enterprise horse.

Putting Cloud in Context


When embarking on a cloud journey, it’s good to first take a step back and
realize that cloud is a much bigger deal than it might seem from the onset.
This way organizations can avoid treating a cloud transformation like yet
another IT project. Instead, they need to prepare for a full-on lifestyle
change.
To really appreciate the impact hat cloud computing makes, it’s good to
first realize that IT’s role in the enterprise is changing. And to put that
in context, it’s good to look at how the business evolves. And lastly, to
understand why the business needs to change, it’s helpful to look at how
the competitive landscape has evolved.
Modern organizations that have grown up with the cloud think and work
differently from traditional enterprises who are embarking on a cloud
migration. It’s therefore good to understand why cloud is such a good
fit for them and how (or whether) their structure and their behaviors
translate to your organization’s situation. Each starting point is different
and so is the journey.
About this Book 2

It’s Your Cloud Journey


Adopting cloud because everyone else does it might be better than doing
nothing, but it’s unlikely to be a suitable starting point for a sound strat-
egy. Instead, you need to have a clear view of why you’re moving to the
cloud in the first place and what success looks like to your organization.
Then, you can weave a path from where you are to that success. Along the
way you’ll know that there isn’t a simple endpoint or even a stable target
picture: cloud evolves and so does your competitive playing field, making
it a perpetual journey. It’s therefore well worth putting some thought into
starting out right and understanding what conscious decisions and trade-
offs you are making.

Rethinking Cloud
This part helps you take a fresh look in the context of business transforma-
tion. Along the way, you’ll realize a few things that the shiny brochures
likely didn’t tell you:

• That Cloud isn’t IT procurement, but a full-on lifestyle change.


• That Cloud-ready organizations think in the first derivative.
• That Wishful thinking isn’t a strategy.
• That Principles link strategy and decisions
• That If you don’t know how to drive, buying a faster car is a bad
idea.
1. Cloud isn’t IT
Procurement; it’s a
Lifestyle Change
You don’t buy a cloud, you embrace it.

Corporate IT is fundamentally structured around a buy over build ap-


proach. This makes good sense because it isn’t particularly useful for the
average enterprise to build its own accounting, human resources, payroll,
or inventory system. The same is true of much of the IT infrastructure:
enterprises procure servers, network switches, storage devices, application
servers, and so on.
Naturally, enterprises tend to follow the same approach when they look
at cloud and cloud vendors. Sadly, this can lead to trouble before the first
application is ever migrated to the cloud.

Procuring a Cloud?
Most traditional IT processes are designed around the procurement of
individual components, which are then integrated in house or more
frequently by a system integrator. Much of IT even defines itself by the bill
of materials that they procured over time, from being “heavy on SAP” to
being an “Oracle Shop” or a “Microsoft House”. So, when the time comes to
move to the cloud, enterprises tend to follow the same, proven process to
procure a new component for their ever-growing IT arsenal. After all, they
have learned that following the same process leads to the same desirable
results. Right? Not necessarily.
Cloud isn’t just some additional element that you add to your IT portfolio.
It becomes the fundamental backbone of your IT: cloud is where your data
resides, your software runs, your security mechanisms protect your assets,
and your analytics crunch numbers. Embracing cloud resembles a full-on
IT outsourcing more than a traditional IT procurement.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 4

Cloud isn’t an additional element that you add to your IT


portfolio. It resembles full-on IT outsourcing more than IT
procurement.

The closest example to embracing cloud that IT has likely seen in the past
decades is the introduction of a major ERP system - many IT staff will
still get goose bumps from those memories. Installing the ERP software
was likely the easiest part whereas integration and customization typically
incurred significant effort and cost, often into the hundreds of millions of
dollars.
We’re moving to the cloud not because it’s so easy, but because of the
undeniable benefits it brings, much like ERP did. In both cases the benefits
don’t derive from simply installing a piece of software. Rather, they
depend on your organization adjusting to a new way of working that’s
embedded in the platform. That change was likely the most difficult part of
the ERP implementation but also the one that brought the most significant
benefits. Luckily, clouds are built as flexible platforms and hence leave
more room for creativity than the “my way or the highway” attitude of
some ERP systems.

How Cloud is Different


Applying traditional procurement processes to cloud is likely to lead to
disappointment. That’s because most of these processes are based on
assumptions that were true in the past but that don’t hold for cloud.
Attempting to implement new technology using an old model is like
printing out emails and filing the paper. You think no one does it? I still get
occasional mails with a “save the environment; don’t print this mail” at
the bottom. Adopting new technology is fairly easy. Adjusting your way
of thinking and working takes much more time.
Two classic examples of existing processes that won’t mesh with cloud
are procurement and operations. Let’s have a look at each area and how
cloud changes the way we think about it:

Procurement
Procurement is the process of evaluating and purchasing software and
hardware components. Because a good portion of IT’s budget flows
Cloud isn’t IT Procurement; it’s a Lifestyle Change 5

through it, procurement tends to follow strict processes to assure that


money is spent wisely, fairly, and sparsely.

Predictability versus Elasticity

Many IT processes are driven through budget control: if you want to start a
project, you need budget approval; likewise if you want to buy any piece
of software or hardware. In the traditional view of IT as a cost center¹,
such a setup makes good sense: if it’s about the money, let’s minimize the
amount that we spend.
Traditional IT budgets are set up at least a year in advance, making
predictability a key consideration. No CFO or shareholder likes to find
out 9 months into the fiscal year that IT is going to overrun its budget
by 20%. Therefore, IT procurement tends to negotiate multi-year license
terms for the software they buy. They “lock in discounts” by signing up
for a larger package to accommodate increase in usage over time despite
not being able to fully utilize it right from the start. If this reminds you
of the free cell phone minutes that are actually free only after you pay
for them in bulk, you might be onto something. And the one who’s being
locked in here isn’t so much the discount but the customer.
Cloud’s critical innovation, and the reason it’s turned IT upside down,
is its elastic pricing model: you don’t pay for resources upfront but only
for what you actually consume. Such a pricing model enables a major cost
savings potential, for example because you don’t pay for capacity that you
aren’t yet utilizing. However, elasticity also takes away the predictability
that IT so much cherishes. I have seen a CIO trying to prevent anyone
from ordering a new (virtual) server, inhibiting the very benefit of rapid
provisioning in the cloud. Such things can happen when a new way of
working clashes with existing incentives.

Feature Checklist versus Vision

To spend money wisely, IT procurement routinely compares multiple


vendor offerings. Some organizations, especially in public sector, even
have a regulatory requirement to request multiple vendor offers to assure
disciplined and transparent spending. To decide between vendors, pro-
curement makes a list of required features and non-functional require-
¹https://www.linkedin.com/pulse/reverse-engineering-organization-gregor-hohpe/
Cloud isn’t IT Procurement; it’s a Lifestyle Change 6

ments, scoring each product along those dimensions. They add up the
scores and go into negotiation with the vendor with the highest number.
This approach works reasonably well if your organization has a thorough
understanding of the product scope and the needs you have, and is able
to translate those into desired product features. Although this process has
never been a great one (is the product with a score of 82.3 really better
than the one with 81.7?), it was alright for well-known components like
relational databases.
Unfortunately, this approach doesn’t work for cloud platforms. Cloud
platforms are vast in scope and our definition of what a cloud should do is
largely shaped by the cloud provider’s existing and upcoming offerings. So
we’re caught in a loop of cloud providers telling us what cloud is so we can
score their offering based on that definition. As I jested in 37 Things, if you
have never seen a car and visit a well-known automotive manufacturer
from the Stuttgart area, you’ll walk out with a star emblem on the hood as
the first item in your feature list (see “The IT World is Flat” in 37 Things).
Trust me, speaking to IT staff after a vendor meeting makes even that
cheeky analogy seem mild.
Because traditional scoring doesn’t work well for clouds, a better approach
is to compare your company’s vision with the provider’s product strategy
and philosophy. For that, you need to know why you’re going to the cloud
and that buying a faster car doesn’t make you a better driver.

Snapshot versus Evolution

Massive checklists also assume that you can make a conscious decision
based on a snapshot in time. However, clouds evolve rapidly. A checklist
from today becomes useless the latest by the time the cloud provider holds
their annual re:Invent / Ignite / Next event.
Related to the previous consideration, IT should therefore look to under-
stand the provider’s product strategy and expected evolution. Not many
providers will tell you these straight out, but you can reverse engineer
a good bit from their product roadmaps. After all, the cloud platform’s
constant evolution is one of the main motivators for wanting to deploy on
top of it. To speak mathematically, you’re more interested in the vector
than the current position.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 7

Product versus Platform

Most items procured by IT are products: they serve a specific purpose,


perhaps in coordination with other components. This makes it easy to
look at them in isolation.
Cloud is a giant platform that forms the basis for software delivery
and operations. While some large software systems can also be heavily
customized and may be positioned as a platform, cloud is different in
that it’s an extremely broad and flexible playing field. Attempting a
metaphor, you could say that traditionally IT has purchased artwork, but
with cloud it’s buying a blank canvas and some magic pens. Therefore,
when embarking on a cloud journey many more considerations come into
play.

Local Optimization versus Global Optimization

When selecting products, IT commonly looks at each product individually,


following a best-of-breed approach that picks the best solution for each
specific task.
Cloud platforms contain hundreds of individual products making a com-
parison between individual products rather meaningless unless you’re
limiting your move to the cloud to one very specific use case like pre-
trained machine learning models. When looking at cloud as a platform,
though, you need to look at the whole and not just the pieces (remember
the book cover?). That means optimizing across the platform as opposed
to locally for each component, an exercise that is more complex and will
require coordination across different parts of the organization.

Matching the Business to the Software

Procuring a product is traditionally done by matching the product’s


capabilities against the organization’s needs. The underlying assumption
is that the product that most closely matches your organization’s way of
working will provide the most value to your business. If you have a family
of five, you’ll want a minivan, not a two-seater sports car.
In the case of cloud, though, you’re not looking to replace an existing
component that serves a specific organizational need. Rather the opposite,
you’re looking for a product that enables your organization to work in a
Cloud isn’t IT Procurement; it’s a Lifestyle Change 8

fundamentally different way (that’s what we call “transformation”). As a


result, you should be adjusting your organization’s operating model to the
platform you’re buying. Hence, you should see which model underlying
the cloud platform you find most appropriate for your organization and
work backwards from there. Although the platforms might look roughly
similar from the outside, upon closer inspection you’ll realize that they
are built under different assumptions that reflect the provider’s culture.

Operations
Cloud doesn’t just challenge traditional procurement processes but oper-
ational processes as well. Those processes exist to “keep the lights on”,
making sure that applications are up-and-running, hardware is available,
and we have some idea of what’s going on in our data center. It’s not
a big surprise that cloud changes the way we operate our infrastructure
significantly.

Segregation of Infrastructure from Applications

Most IT departments distinguish infrastructure operations from applica-


tion delivery, typically reflected in the organizational structure where
“change” departments build applications to be handed (or tossed) over to
“run” departments to be operated. Critical mechanisms such as security,
hardware provisioning, and cost control are the operations teams’ re-
sponsibility whereas features delivery and usability are in the application
teams’ court.
A central theme of cloud is infrastructure automation, which gives devel-
opment teams direct access to infrastructure configuration and thus blurs
the lines. Especially aspects like security, cost control, scaling, or resilience
span both application and infrastructure.

Control versus Transparency

Traditional IT generally suffers from poor transparency into its system


inventory. Therefore, many processes are based on control, such as not
giving developers access to configure infrastructure elements, restricting
deployments, or requiring manual inspections and approvals. These mech-
anisms are often at odds with modern software delivery approaches such
as DevOps or Continuous Delivery.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 9

Cloud dramatically increases transparency, allowing you to monitor usage


and policy violations rather than restricting processes upfront. Using
the transparency to perform actual monitoring can significantly increase
compliance while reducing the process burden - process controls were
usually just proxies for desirable outcomes with a weak link to reality.
However, it requires you to change your IT lifestyle and dive deeper into
the platform’s technical capabilities. For example, in an environment with
high degrees of automation you can scan deployment scripts for policy
violations.

Resilience through Redundancy vs. Automation

IT traditionally increases system up-time through redundancy. If the


server running an important application fails, there’s a fully configured
stand-by server ready to take over, minimizing any disruption. As can
be easily seen, such an approach leads to rather unfavorable economics:
half the production servers are essentially doing nothing. Alas, with
monolithic applications and manual deployments, it was the only choice
as it would take far too long to deploy a new instance.
Cloud thrives on scale-out architectures and automation, meaning new
application instances can be added quickly and easily. This allows the
elimination of “warm standby servers”. Instead, new application or service
instances can be deployed when a failure occurs. This is one of several
examples of how cloud can bring enormous cost savings but only if you
adjust the way you work.

Side-by-side
This list is just meant to give you a flavor why selecting a cloud isn’t
your typical IT procurement - it’s not nearly exhaustive. For example,
cloud also challenges existing financial processes. The examples do make
clear, though, that applying your traditional procurement and operational
processes to a cloud adoption is likely going to put you on the wrong
starting point.
The following table summarizes the differences to highlight that cloud is
a 180 degree departure from many well-known IT processes. Get ready!
Cloud isn’t IT Procurement; it’s a Lifestyle Change 10

Capability Traditional Cloud


Budgeting Predictability Elasticity
Suitability Feature Checklist Vision
Functionality Snapshot Evolution
Scope Component Platform
Optimization Local Global
Alignment Product to Business Business to Product
Operational Model App vs. Infra Apps and Infra
Compliance Control Transparency
Resilience Redundancy Automation

Same, but Very Different


Despite the stark differences from traditional IT, the major cloud provider’s
product portfolio might look quite similar to one another. However,
having worked not only at two cloud vendors but also with many cloud
customers, I can confidently state that the organizations behind the cloud
platforms have very different cultures and operating models. If you have
a chance to visit multiple vendors for an executive briefing, you should
not just pay attention to the technical content, but also try to understand
the organization’s culture and underlying assumptions.

Because cloud is a journey, compare cloud providers not


just by their products but also their by their history and
cultural DNA.

A telling indicator is the core business that each vendor engaged in


before they started to offer cloud services. That history has shaped the
organization’s organizing principles and values, and also their product
strategy. I won’t elaborate on the differences here because I’d like you
to go have a look for yourself (and I also don’t want to get into trouble).
However, I am sure that after spending a bit of time with each vendor
outside of the typical sales pitch will give you a very vivid picture of what
I am hinting at.
Because cloud is a journey more than a destination, it requires a long-
term partnership. Therefore, I highly recommend having a look behind
the scenes, at the organization and not just the product, to understand the
provider’s culture and whether it matches your aspirations.
Cloud isn’t IT Procurement; it’s a Lifestyle Change 11

Cloud in the Enterprise


Cloud providers dealing with enterprises face an interesting dilemma.
On one hand they represent a non-traditional IT model that requires
enterprises to transform. However, they still need to help those enterprises
on the way. So, the providers make their clouds “enterprise ready” without
trying to lose their digital roots. “Enterprise” features such as industry
certifications are highly valuable and necessary, but one sometimes won-
ders whether glitzy customer experience centers overlooking vast control
rooms where no real crisis ever seems to be taking place are really needed
to engage enterprises.

When I visit customer experience centers, I feel like going


to a fancy casino. I am mightily impressed until I remember
where all the money comes from.

Commitment-based pricing models favored by most enterprises stand


in contrast to cloud’s elasticity - discounts are given for multi-year
agreements that specify a committed minimum spend. Traditionally such
plans compensate for the high cost of enterprise sales, e.g. all those folks
flying around the world for 60 minute customer meetings and elaborate
conferences with well-known musical acts. Isn’t cloud supposed to put
away with much of that tradition? Cynics define “enterprise software” as
being bloated, outdated, inflexible, and expensive. Let’s hope that cloud
and traditional enterprise meet somewhere in the middle!
Transforming organizations is challenging in both directions. Whereas
traditional enterprises install free baristas because that’s what they ob-
served in their digital counterparts, internet-scale companies copy the
cheesy customer experience centers that they observe in traditional en-
terprise vendors. Both initiatives are unlikely to have the intended effect.

Changing Lifestyle
So, going to the cloud entails a major lifestyle change for IT and perhaps
even the business. Think of it like moving to a different country. I have
moved from the US to Japan and had a great experience, in large part
Cloud isn’t IT Procurement; it’s a Lifestyle Change 12

because I adopted the local lifestyle: I didn’t bring a car, moved into a much
smaller (but equally comfortable) apartment, learned basic Japanese, and
got used to carrying my trash back home (the rarity of public trash bins
is a favorite topic of visitors to Japan). If I had aimed for a 3000 sq. ft.
home with a two-car garage, insisted on driving everywhere, and kept
asking people in English for the nearest trash bin, it would have been
a rather disappointing experience. And in that case I should have likely
asked myself why I am even moving to Japan in the first place. If in Rome,
do like the Romans do (or the Japanese - you get my point).
2. Cloud Thinks in the First
Derivative
In economies of speed, cloud is natural.

Major cloud migrations routinely go hand-in-hand with a so-called digital


transformation, an initiative to help the business become more digital.
Even though it’s not totally clear what being “digital” means, it appears
to imply being able to better compete with so-called “digital disruptors”
who threaten existing business models. Interestingly, almost all these dis-
ruptors operate on the cloud, so even though the cloud alone doesn’t make
you digital, there appears to be some connection between being “digital”
and being in the cloud. Interestingly the answer lies in mathematics.

Making Things Digital


Before diving into the intricate relationship between cloud computing and
first derivatives, I have to take a brief moment to get on my “digital”
soap box. When a company speaks about transforming, it usually aims
to become more “digital”, referring to the internet giants or FAANGs -
Facebook, Apple, Amazon, Netflix, and Google.
The issue I take with the term “digital” is that most things became digital
three or four decades ago: we replaced the slide ruler with a calculator in
the 70ies, the tape with the CD in the 80ies, and the letter with e-mail in
the 90ies. Oh, and along the way analog watches were replaced by those
funny “digital” ones that only show the time when you push a button
before being again replaced by digital ones that mimic analog displays.
So, making things digital isn’t anything particularly new, really.

We made most things digital decades ago. Now it’s about


using digital technology to fundamentally change the
model.
Cloud Thinks in the First Derivative 14

If we’re not in the business of making things digital anymore, what do


we refer to when a company wants to become “digital”? We want to
use modern, i.e. “digital”, technology to fundamentally change the way
the business works. A simple example helps illustrate the difference. Mi-
crosoft’s Encarta was a digital encyclopedia launched in 1993. It initially
sold in form of a CD-ROM, which made it superior to a traditional paper
encyclopedia in most every regard: it had more content, was more up-to-
date, had multimedia content, was easier to search, and surely easier to
carry. Despite all these advantages people don’t use Encarta anymore - the
product was discontinued in 2009. The reason is now obvious: Encarta was
a digital copy of an encyclopedia but failed to fundamentally change the
model. The well-known product that did, of course, is Wikipedia. Based on
Ward Cunningham’s Wiki¹, Wikipedia questioned the basic assumptions
about how an encyclopedia is created, distributed, and monetized. Instead
of being a relatively static, authoritative set of works created by a small set
of experts, it leveraged a global computer network and universal browser
access to create a collaborative model. The new model quickly surpassed
the “digital clone” and rendered it obsolete.

Not all changes in model need to be as dramatic as Wikipedia, Uber,


or Airbnb. Leanpub, the platform on which I am authoring this book,
changed the book publishing model in a subtle but important way.
Realizing that eBooks have a near-zero distribution cost, Leanpub
encourages authors to publish books before they’re finished, making
the platform a cross between Kickstarter and book publisher. Such a
model is a stark contrast to traditional print publishing, which relies
on lengthy editing and quality assurance processes to publish the
final book once in high quality.
https://www.kickstarter.com/

Digital IT
The difference between making things digital and using digital technology
to change the model applies directly to corporate IT, which has grown
¹https://wiki.c2.com/
Cloud Thinks in the First Derivative 15

primarily by making digital copies of existing processes. In an insurance


business, IT would automate new customer sign-ups, renewals, claims
handling, and risk calculation. In manufacturing it would automate or-
ders, shipping, inventory management, and the assembly line. By speeding
up these processes and reducing cost, often by orders of magnitude, IT
quickly evolved into the backbone of the business. What it didn’t do is
change the existing business model. But that’s exactly what the Uber’s,
Airbnb’s, and Wikipedia’s of the world have done. This shift is particularly
dangerous for corporate IT who’s done extremely well creating digital
copies, often handling budgets in the billions of dollars and generating
commensurate business value. Just like Encarta.
Classic IT sometimes argues that “digital” disruptors are technology com-
panies as opposed to their “normal” financial services or manufacturing
business. I disagree. Having worked on one of Google’s core systems
for several years I can firmly state that Google is first and foremost an
advertising business; advertising is what generates the vast majority of the
revenue (and profits). Tesla is making cars, Amazon excels at fulfillment,
and Airbnb helps travelers find accommodations. That’s “normal” busi-
ness. The difference is that these companies have leveraged technology to
redefine how that business is conducted. And that’s what “going digital”
is all about.

Change is Abnormal
Because traditional IT is so used to making existing things digital, it has a
well defined process for it. Taking something from the current “as-is” state
to a reasonably well-known future “to-be”state is known as a “project”.
Projects are expensive and risky. Therefore, much effort goes into defining
the future “to-be” state from which a (seemingly always positive) business
case is calculated. Next, delivery teams meticulously plan out the project
execution by decomposing tasks into a work breakdown structure. Once
the project is delivered, hopefully vaguely following the projections,
there’s a giant sigh of relief as things return to “business as usual”, a
synonym for avoiding future changes. The achieved future state thus
remains for a good while. That’s needed to harvest the ROI (Return on
Investment) projected in the up-front business case. This way of thinking
is depicted on the left side of this diagram:
Cloud Thinks in the First Derivative 16

Project thinking (left) vs. Constant Change (right)

Two assumptions lie behind this well-known way of working. First, you
need to have a pretty good idea what you are aiming for to define a good
“target state”. Second, “change” is considered an unusual activity. That’s
why it’s tightly packaged into a project with all the associated control
mechanisms. This way, change is well contained and concludes with a
happy ending, the launch party, which routinely takes place before the
majority of users have even seen the delivered system.

The Digital World Has No Target Picture


Both assumptions are false in the digital world. So-called digital com-
panies aren’t looking to optimize existing processes – they chart new
territory by looking for new models. Now, finding new models isn’t so
easy, so these companies work via discovery and experimentation to
figure out what works and what doesn’t. Digital companies don’t have
the luxury of a well-defined target picture. Rather, they live in a world of
constant change (indicated on the right side of the diagram).

Digital companies don’t have the luxury of a well-defined


target picture. Rather, they live in a world of constant
change.

The difference has a fundamental effect on how IT needs to operate. On


the left-hand side the core competence is being a good guesser. Good
prediction (or guessing) is needed to define a future target state and to
execute to plan afterwards. Lastly, organizations also benefit from their
Cloud Thinks in the First Derivative 17

scale when they recover the project investment: the bigger they are, the
better the returns. They operate based on Economies of Scale.
The right-hand side works differently. Here, instead of being a good
guesser, companies need to be fast learners so they can figure out quickly
what works and what doesn’t. They also need low cost of experimentation
and the ability to course correct frequently. So, instead of scale, they’re
living by the rules of Economies of Speed.
The models are almost diametrically opposed. Whereas the left side favors
planning and optimization, the right side favors experimentation and
disruption. On the left, deviation from a plan is undesired while on the
right side it’s the moment of most learning. But even more profoundly,
the two sides also use a different language.

Absolutes in a Changing World


Because traditional IT is based on the assumption that change is tempo-
rary and the normal state is defined by stability, people living in this model
think and speak in absolute values. They manage via headcount, time, and
budgets: “the project takes 10 engineers for 6 months and costs 1.5 million
dollars”; “the server hardware costs 50,000 dollars plus another 20,000 for
licenses”. Absolute values give them predictability and simplify planning.

Organizations operating in constant change think and


speak in relative values because absolutes have little mean-
ing for them.

For the people who adopted a world of constant change, absolutes aren’t
as useful. If things are in constant flux, absolutes have less meaning - they
change right after having been measured or established.
The illustration below demonstrates how in a static world we look
at absolute positions. When there’s movement, here drawn as vectors
because I am not able to animate the book, our attention shifts to the
direction and speed of movement.
Cloud Thinks in the First Derivative 18

Static view (left) vs. Movement (right)

The same is true in IT. Hence, the folks living in the world depicted on the
right speak in relative values. Instead of budget, they speak of consump-
tion; instead of headcount they are more concerned about burn rate, i.e.
spend per month; instead of fixating on deadlines, they aim to maximize
progress. Those metrics represent rates over time, mathematically known
as the first derivative of a function (assuming the x axis represents time).

Cloud Speaks Relatives


Because these organizations think in relative terms, or the first derivative,
cloud is a natural fit for their mental model. That’s because the cloud
operates in a consumption model: the server is no longer $50,000 but
three dollars an hour. Instead of absolute investments, we speak in rates
per hour or per second. Hence, the cloud model makes perfect sense for
organizations living in the first derivative because both speak the same
language. That explains why virtually all of them operate in the cloud.

Cloud speaks in the first derivative, pricing hardware in


rates per time unit, not absolute investment.

It’s clear that the cloud providers didn’t exactly invent the data center. If
all they had done was build a better data center, they’d be traditional IT
outsourcing providers, not cloud providers. What cloud did fundamentally
change is allowing IT infrastructure to be procured based on a consump-
tion model that’s a perfect fit for a world of constant change. Cloud also
made provisioning near-instant, which is particularly valuable for folks
operating in economies of speed.
Cloud Thinks in the First Derivative 19

Relatives Reduce Stress And Slack


Working with relatives has another significant advantage. No, I am not
talking about hiring the in-laws but about embracing the first derivative,
which represents relative values over time. Predictions are necessarily
uncertain, especially about the future, as the old adage goes. So, what
happens if the prediction for a project timeline comes out short? We
deny reality and embark on a death march. At least the opposite should
be alright, then? What if we are about to come in under budget? Well,
we need to find ways to spend money quickly and create busywork lest
anyone finds out and cuts the budget for next time.
The first derivative, measured in velocity or burn rate, doesn’t lead to such
project travesties. All we know is that faster progress and a lower burn rate
are better, so we’re constantly optimizing without burning or slacking. So,
working with relative figures does indeed reduce stress.

New Meets Old


Traditional organizations adopting the cloud’s consumption model can
face some unexpected headwind, though. Their IT processes have been
built around a traditional up-front planning model for decades. IT depart-
ments are asked to estimate their infrastructure sizing and cost for years
into the future. Such a process misses the cloud’s key capability, elasticity,
which enables scaling infrastructure up and down as needed. Boxing that
capability into a fixed budget number makes about as little as sense as
requiring someone to run waving a flag in front of a motor car² - a clear
sign that an outdated process hinders modern technology.
A common concern about cloud frequently voiced in traditional organi-
zations is spend control: if your infrastructure scales up and down at will,
how do you predict and cap costs? First-derivative organizations think
differently. It’s not that they have money to give away. Rather, they realize
that for a certain hardware cost of x dollars they can serve n users and
generate y dollars in business value. So they don’t have a problem paying
double for twice as many users and twice as much business benefit! For
them increasing IT spend is a sign of business success (and cost often won’t
actually double).
²https://en.wikipedia.org/wiki/Locomotive_Acts
Cloud Thinks in the First Derivative 20

This apparent conflict in thinking between budget predictability and value


generation is one of the reasons that cloud implies an IT lifestyle change.
It also hints at why increasing your run budget may be a good thing.
3. Wishful Thinking isn’t a
Strategy
Wishes are free, but rarely come true.

When organizations speak about their journey to the cloud, you’ll hear
a lot of fantastic things like dramatic cost savings, increased velocity, or
even a full-on business transformation. However exciting these might be,
in many cases they turn out to be mere aspirations, wishes to be a bit more
blunt. And we know that wishes usually don’t come true. So, if you want
your cloud journey to succeed, you better have an actual strategy.

Learning from Real Life


We all have aspirations in our life. For many people, the list starts with
a baseline of health and a happy family, perhaps an interesting job, and
topped off with some nice items like a stately home, a fancy car, and, if
you’re especially lucky, a nice boat.
I know - my favorite 43’ powerboat was on my wish list long enough
for the British manufacturer to go belly up in the meantime (better than
the boat, I suppose). Apparently I wasn’t the only one whose wishes didn’t
turn into the reality of actually buying one. On the upside, with that much
money “saved”, I now feel entitled to dream of a fancy sports car!

Wishing for things is free. Making them come true isn’t.

While it’s OK and actually encouraged to have ambitions, we need to


be careful that these ambitions don’t remain wishes. Wishes are popular
because it’s easy to have them.
Wishful Thinking isn’t a Strategy 22

Strategies
To have a chance at making your wishes come true, you’ll need a strategy.
Interestingly, such strategies come in many different shapes. Some people
are lucky enough to inherit money or marry into a wealthy family. Others
play the lottery - a strategy that’s easily implemented but has a rather
low chance of success. Some work hard to earn a good salary and there
are those who contemplate a bank robbery. While many strategies are
possible, not all are recommended. Or, as I remind people: some strategies
might get you into a very large house with many rooms. It’s called jail.
Clayton Christensen, author of The Innovator’s Dilemma¹, gives valuable
advice on strategy in his book How Will You Measure Your Life?². Those
strategies apply both to enterprises and personal life, and not just for
cars and power boats. Christensen breaks a strategy down into three core
elements:

You need to set priorities


They tell you which direction to take at the next decision point
when you’re faced with making difficult trade-offs.
You need to balance your plan with opportunities that might come
along
Sometimes a course correction makes more sense than holding
steady. On the other hand, a strategy that changes every week likely
isn’t one.
You need to allocate resources along with your strategy
The strategy is only as good as the execution. One of the most
valuable resources to invest in your personal life is time, so allocate
it wisely.

Although large companies have a lot of resources and can afford to have
a broad strategy, all three elements still apply. Even the most successful
companies have limited resources. A company with unlimited funds,
should it exist, would still be limited by how many qualified people it can
hire and how fast it can onboard them. So, they still need to set priorities.
Hence, a strategy that doesn’t require you to make difficult trade-offs
likely isn’t one. Or as Harvard Business School marketing guru Michael
Porter aptly pointed out:
¹Clayton Christensen, The Innovator’s Dilemma, 2011, Harper Business
²Clayton Christensen, How Will You Measure Your Life?, 2012, Harper Business
Wishful Thinking isn’t a Strategy 23

Strategy is defined by what you’re not doing.

Many businesses that do have a clearly articulated strategy fall into the
next trap: a large gap between what’s shown on the PowerPoint slides and
the reality. That’s why Andy Grove, former CEO of Intel, famously said
that “to understand a company’s strategy, look at what they actually do
rather than what they say they will do.”

Strategy = Creativity + Discipline


IT strategies can be as diverse as strategies in real life, perhaps minus
robbing banks. Some organizations go “all in” and move all their IT assets
to the cloud in one fell swoop while others test the waters by migrating
just a few applications so they can learn how to get the most out of the
cloud. The former organization might realize value sooner but the latter
can ultimately obtain more value. Of course, you could also think about
combining the two, but that’d imply a longer path.
Building a meaningful strategy relies on both creativity and decision-
making discipline, two traits that are often considered as at odds with
one another. One requires you to think big and brainstorm whereas the
other needs you to make conscious trade-offs and lay out a plan. So, you’d
want to explore many options and then carefully select those that you will
pursue.

To define a strategy you want to explore many options and


then carefully select those to pursue.

I refer to this process as expand and contract - broaden your view to avoid
blind spots and then narrow back towards a concrete plan. Many of the
decision models in the later chapters in this book (see Part IV) are designed
to help you with this exercise. They define a clear language of choices that
help you make informed decisions.
Once you have a strategy, the execution is based on making conscious
decisions in line with the strategy, all the while balancing trade-offs.
Principles help you make this connection between your execution and
your vision. The next chapter shows how that works.
Wishful Thinking isn’t a Strategy 24

Strategy = Setting the Dials


I often explain the exercise of setting an IT strategy as follows. You’re
essentially looking at a large black box with a complex machinery inside.
That machinery is your IT, which can do amazing things but also has a
lot of moving parts. What you need to do is to identify the levers and
dials on this box. They tell you what choices you have. Next, you need to
develop a rough understanding of how these levers influence the behavior
of the box and how they’re interrelated. Not all levers will be able to be
set independently - you can’t simply turn all dials to “10”. Therefore, you
next need to decide the setting for each lever - those reflect the strategic
direction you set. To make things ever more interesting, the levers don’t
remain in the same position over time but rather have to be adjusted as
time progresses.

Strategy is like setting dials on a complex circuit

For example, one dial might define how to distribute workloads across
multiple clouds. Many organizations will want to turn this dial to “10”
to gain ultimate independence from the cloud provider, ignoring the fact
that the products promising that capability are virtually all proprietary
and lock you in even more. Hence, turning this dial up will also move
another dial. That’s why setting a strategy is interesting but challenging.
A real strategy will need to consider how to set that dial meaningfully
over time. Perhaps you start with one cloud, so you can build the required
skills and also receive vendor support more easily. You wouldn’t want to
paint yourself into a corner completely, so perhaps you structure your
applications such that they isolate dependencies on proprietary cloud
Wishful Thinking isn’t a Strategy 25

services. That’d be like turning the dial to “2”. Over time, you can consider
balancing your workloads across clouds, slowly turning up the dial. You
may not need to shift applications from one cloud to the next all the time,
so you’re happy to leave the dial at “5”.
To make such decisions more concrete, this book gives you decision
models that put names and reference architectures behind the dials and
the numbers. In this example, the chapter on multi-cloud lays out five
options and guides you towards dialing in the right number.
If this approach reminds you of complex systems thinking, you are spot
on. The exercise here is to understand the behavior of the system you’re
managing. It’s too complex to take it apart and see how it all works, so
you need to use heuristics to set the controls and constantly observe the
resulting effect. The corollary to this insight is that you can’t manage what
you can’t understand³.

Goals
As Christensen pointed out, execution is key if you want to give your
strategy a chance of leading to the intended (wished for) result. While
you are executing, it isn’t always easy to know how you’re doing, though
- not having won the lottery yet doesn’t have to mean that there was
no progress. On the other hand, just reading fancy car magazines might
make your wishes more concrete, but it doesn’t accomplish much in terms
of being able to actually buy one.
So, to know whether you’re heading in the right direction and are making
commensurate progress, you’ll need concrete intermediate goals. These
goals will depend on the strategy you chose. If your strategy is to earn
money through an honest job, a suitable initial goal can be earning a
degree from a reputable educational institution. Setting a specific amount
of savings aside per month is also a reasonable goal. If you are eyeing
a bank vault, a meaningful goal might be to identify a bank with weak
security. Diverse strategies lead to diverse goals.

One concrete goal derived from my “work hard to be able


to buy a boat” strategy was to get a motor boat license as
a student - I had time and they don’t expire. So, if my boat
wish ever comes true, I’ll be ready to get behind the helm.
³https://www.linkedin.com/pulse/you-cant-manage-what-understand-gregor-hohpe/
Wishful Thinking isn’t a Strategy 26

While goals are useful for tracking progress, they aren’t necessarily the
end result. I still don’t have a boat. Nevertheless, they allow you to
make concrete and measurable progress. On the upside, my strategy of
pursuing a career in IT met the opportunity to live and work on three
continents. Following Christensen’s second rule, I happily embraced those
opportunities and thus achieved more than I had set out to do.

IT Wishes
Returning back into IT, given how much is at stake, we’d think that IT
is full of well-worked out strategies, meticulously defined and tracked
via specific and achievable goals that are packaged into amazing slide
decks. After all, what else would upper management and all the strategy
consultants be doing?
Sadly, reality doesn’t always match the aspirations, even if it’s merely
to have a well-laid-out strategy. Large companies often believe that they
have sufficient resources to not require laser-sharp focus. Plus, because
business has been going fairly well, the strategy often comes down to
“do more of the same, just a bit better.” Lastly, large enterprises tend to
have slow feedback cycles and limited transparency, both of which hinder
tracking progress and responding to opportunities.

Cloud Wishes = Appealing but


Unrealistic
I have spent a good part of the last five years discussing and developing
cloud strategies for my employers and my clients. When working with
large organizations, I have observed how often wishing takes precedence
over developing a clear strategy.
Organizations (or sometimes vendors) promise straight-out 30% in cost
reduction, 99.99% uptime for even the oldest applications, and a full-on
digital transformation across the whole organization. Like all wishes, these
objectives sound really, really good. That’s a common property of wishes
- rarely does someone wish for something crappy or uninteresting.
That’s why wishes are a very slippery slope for organizations. “Outcome
based management” can degenerate into selecting unsubstantiated wishes
Wishful Thinking isn’t a Strategy 27

over well-balanced strategies, which naturally won’t be able to match the


results projected by sheer aspiration. Soon after, the wishing contest turns
into a lying contest to get initiatives funded. By the time reality hits, the
folks having presented their wishes will have moved on to the next project
and likely blame lack of execution for the wishes not having come true.

Be careful to not fall for wishes that promise amazing out-


comes. Demand to see a viable path to results, intermediate
goals, and conscious trade-offs.

It’s useful to have wishes - at least you know why you are going to the
cloud in the first place. You just need to make sure that these objectives
are supplemented by a viable plan on how to achieve them. They should
also be somewhat realistic to avoid preprogrammed disappointment.
Calibrating your wishes and pairing them up with a viable path is the
essence of defining a strategy.

Transformation Doesn’t Have a SKU


Making trade-offs can be challenging, especially for wealthy organiza-
tions. They’re so used to getting it all that they believe it’s all just a
matter of securing sufficient funding. Such organizations resemble spoiled
children who are used to getting any toy they wish for. Usually, their room
is so full of toys that they can’t find anything anymore. I have seen many
IT environments that look just like that - you can surely picture the CIO
looking for his or her blockchain among all the other IoT, AI, RPA, and
AR initiatives.
One critical lesson for such organizations, discussed in more detail in 37
Things, is that an IT transformation isn’t something you can buy with
money - it doesn’t have a SKU⁴. Rather, transformation forces you to
question the very things that helped you become successful in the past.
Ironically, the more successful an organization has been, the more difficult
this exercise becomes. Wishing for a change certainly has a low likelihood
of success.
⁴Stock-Keeping Unit, meaning an item you can buy off the shelf
Wishful Thinking isn’t a Strategy 28

Beware the Proxy


In personal life, when you are looking for ways to get closer to realizing
some of your wishes, it’s easy to fall for products that are offered as
substitute stepping stones. In your pursuit to become an athlete, you
are lured by fancy sneakers, perhaps bearing your idol’s signature; your
path to a fancy lifestyle seems lined with brand-name suits, dresses, and
handbags, which might even sport the logo of your favorite sports car.
Sadly, buying Michael Jordan branded sneakers isn’t any more likely to
make you a professional athlete than buying a Ferrari cap makes you an
F1 driver. While they might be decent shoes, they are merely a proxy
for a real goal like winning the next match or going for a workout every
morning.
IT is littered with such proxies. They also have well-known labels and
price tags many orders of magnitude higher than even the fanciest
sneakers, which can actually run into the thousands of dollars. To make
matters worse, IT being far ahead of itself by buying fancy “sneakers”
doesn’t just make them look silly, it’s also outright dangerous.
4. Principles Connect
Strategy and Decisions
It’s not the destination. It’s the turns you took.

Turning wishes into a viable strategy is hard work. The strategy has to be
both forward-looking and executable, presenting an interesting tension
between the two. A well-considered set of principles can be an important
connecting element.

A Strategy You Need


Developing a cloud strategy is significantly harder than stating aspira-
tions, another corporate euphemism for wishes. For example, are you go-
ing to move a specific class of applications to the cloud first? Will you “lift-
and-shift” or re-architect applications to take full advantage of the cloud?
Or perhaps use different strategies for different types of applications? If
your organization is global, will you migrate by geographic region? Or by
business unit? What about continents where cloud infrastructure is still
sparse? Will you wait for cloud providers to build out their infrastructure
or deploy an on-premises cloud in the meantime? Can you get rid of some
applications altogether, replacing them with a SaaS offering? There are
so many decisions to be made that this book contains a whole chapter
on Moving to the Cloud. And the inherent trade-offs even lead us to
Pythagoras’ Theorem.

Decisions Define the Journey


When embarking on a journey, the defining element isn’t the destination
but the turns you take along the way. Are you going for the scenic route,
stay on the freeway, or take a risky shortcut? All are viable options, and
the option you take will depend on your preferences. However, in most
Principles Connect Strategy and Decisions 30

cases you’d want to have some consistency in decision making. Perhaps


you prefer to take the freeway because you have a long way to go, allowing
just a short scenic detour where there’s a major sight not too far from the
freeway.
37 Things devotes an entire chapter to establishing the connection be-
tween architecture and decisions. A picture with boxes and lines could
depict an architecture but it could also be abstract art. As an architect,
what you’re interested in are the decisions that were made, especially the
ones that aren’t obvious or trivial.
I use a simple drawing to highlight the importance of decisions. The
sketch on the left shows a boilerplate house. The one on the right is quite
similar but includes two critical decisions that define the roof’s angle and
overhang:

Decisions define architecture

Those decisions define the architecture, despite the sketch remaining


crude. For those who think that pointy roofs aren’t a particularly signif-
icant decision, I might point to Japan, a country with surprisingly heavy
snowfall in some regions, where the houses in Shirakawago in the Gifu
prefecture obtained World Heritage status thanks to their unique design
with pointed roofs. These houses also highlight that the architecture
decisions are only meaningful in a given context - you won’t find such
houses by the beach.
Principles Connect Strategy and Decisions 31

Strategy Informs Decisions


In my architecture workshops, which discuss the role of the architect in
complex enterprise settings, I share the following diagram:

Strategy and principles provide decision consistency

The diagram highlights that consistency in decision making can be


achieved by basing decisions on principles, which are in turn derived from
an overarching strategy. The strategy lays out the overall direction you are
heading in and how you want to go about achieving it. In a sense, it sets
the frame and defines important boundaries.

An important first step of any strategy is defining a set of


guiding principles.

Principles make this strategy more concrete and applicable to making


decisions, which in turn define your architecture. Hence one of the
first steps when laying out a strategy will be defining a set of guiding
principles. It won’t be easy, but skipping this step will simply defer and
multiply the pain for later on.
Principles Connect Strategy and Decisions 32

Principles for Defining Principles


Defining meaningful principles isn’t an easy task. Principles easily slip
back into wishful thinking by portraying an ideal state rather than
something that makes a conscious trade-off. I haven’t been able to distill a
simple litmus test for good principles but I have come across enough bad
ones that I have compiled a few tests for meaningful principles:

• If the opposite of a principle is nonsense, it’s likely not a good one.


Everyone wants happy customers and a high quality product, so
putting them in a principle won’t guide a lot of decisions.
• Principles that include product names or specific architectures
(usually in form of buzzwords) run the risk of being decisions that
wanted to be elevated to principles - an unwelcome form of reverse
engineering.
• Principles should pass the test of time. No one is a clairvoyant, but
it helps to imagine whether the principle is still likely to make sense
in a few years’ time. A good proxy can be looking back three years
and finding poor examples.
• It helps if the list of principles employs parallelism, meaning they
apply the same grammatical construct. For example it’s useful if all
principles consist of full sentences or just nouns. Mixing the two
will make them look awkward.
• Principles should be memorable. Few teams will walk around with
the list of principles on a cheat sheet, making principles that no one
remembers unlikely to influence many decisions. If you’re placing
the principles on motivational posters plastered on the office walls
you’re likely kidding yourself.
• Although there is no magic count for the number of principles, less
than a handful might be sparse whereas more than a dozen will be
difficult to remember.

If you’re looking for an academic foundation for using principles, there’s


a short paper titled Impact of Principles on Enterprise Engineering¹.
The part I found most useful is how principles are positioned as the
connecting elements between root causes and outcomes. There’s also
a whole book titled Architecture Principles² that includes a catalog of
¹https://www.researchgate.net/publication/43205891_Impact_of_Principles_on_
Enterprise_Engineering
²Greefhorst, Proper, Architecture Principles, 2011, Springer
Principles Connect Strategy and Decisions 33

candidate principles (not all of which would pass my test above) but
otherwise leans a bit heavily on TOGAF-flavored enterprise architecture.

Looking for Inspiration


One company famous for being guided by principles is Amazon, who
cherishes their 14 Leadership Principles³ so much that they accompany
employees from the initial interview to performance appraisals. Now
you might wonder whether Amazon’s first and most famous leadership
principle, Customer Obsession, violates my first guideline above. I’d say it
doesn’t - the principle goes far beyond just having “satisfied customers”.
Instead, it mandates that “Although leaders pay attention to competitors,
they obsess over customers.” There’s clearly a trade-off here: you glance
left and right at your competition but you have the customer in firm focus.
The opposite would also be meaningful and is widely practiced.
To gain some inspiration on how to develop sound principles, I tend to
refer to Agile methods, one of the simplest but most profoundly misunder-
stood approaches in software development. The Agile Manifesto⁴ defines
four core values that are augmented by 12 principles. Each principle is
stated in a short paragraph of one or two sentences, for example:

“Welcome changing requirements, even late in development. Agile


processes harness change for the customer’s competitive advantage.”
“Deliver working software frequently, from a couple of weeks to a couple
of months, with a preference to the shorter timescale.”
“Working software is the primary measure of progress.”

Although the principles aren’t entirely parallel in structure, they do


place an important emphasis and pass our tests above. For example,
many organizations loathe late changes because they are expensive and
annoying. Such places would adopt the opposite principle.
A related structure can be, not quite coincidentally, found in Kent Beck’s
classic book Extreme Programming Explained⁵. Kent progresses from a
³https://www.amazon.jobs/en/principles
⁴https://agilemanifesto.org/
⁵Kent Beck, Extreme Programming Explained, 2nd edition, 2014, Addison-Wesley
Principles Connect Strategy and Decisions 34

core set of values, which relate to our definition of a strategy, to 14


principles, most of which consist of a single noun, elaborated in a short
chapter. For example, the principle of flow establishes that XP favors
“continuous flow of activities rather than discrete phases”, in contrast to
many earlier software development frameworks. Kent then progresses to
derive concrete practices from this set of principles.

Cloud Principles
Just like architecture, principles live in context. It therefore isn’t meaning-
ful to provide a list of principles to be copy-pasted to other organizations.
To give some concrete inspiration, I will share a few principles that I have
developed with organizations in the past, separating “higher-level”, more
strategic principles from more narrowly focused ones.

High-level Principles
A slogan adorning virtually all cloud strategy documents is multi cloud.
Sadly, too many of these documents state that the architects “favor multi
cloud”. If you’re lucky, there’s a reason attached, usually to avoid lock-in.
That’s all good, but it doesn’t really help guide concrete decisions, so it’s
hardly a very meaningful principle. Should I deploy all applications on
all clouds? That might be overkill. Are all clouds equal? Or do I have a
primary cloud and one or two as fallbacks? That’s where things get a lot
more interesting.
With one customer I defined the preference clearly in the following
principle:

Multi-Cloud Doesn’t Mean Uniform Cloud

As a central IT group we give business units a choice of


cloud. However, we aren’t looking to provide a uniform
experience across clouds. Each application should use the
cloud’s features, especially managed services, if they allow
us to deliver software faster and in higher quality.

Another principle that I used in the past related to pace of adoption:


Principles Connect Strategy and Decisions 35

We anticipate cloud evolution. Waiting can be a viable


strategy.

The rate at which cloud platforms evolve marginalizes efforts


to build heavyweight platforms in house. We therefore care-
fully evaluate what we build ourselves versus simply waiting
a few months for the feature to become part of the cloud
platform.

Again, the opposite is also viable: anything that’s needed but isn’t yet
available, we build as quickly as possible. Sadly in large organizations,
“as quickly as possible” typically translates into 18 months. By then the
cloud providers have often delivered the same capability, likely orders of
magnitude cheaper. Strategy takes discipline.

Specific Principles
In other situations, when dealing with a smaller and more nimble orga-
nization, we developed more specific principles. Some of these principles
were designed to counterbalance prevalent tendencies that made us less
productive. We essentially used the principles to gently course correct -
that’s totally fine as long as the principles don’t become too tactical.

Use Before Reuse

Don’t build things that could be reused some day. Build


something that is useful now and refine it when it’s actually
used by other services.

This principle was intended to counter-balance engineers’ well-intentioned


tendency to design for the future, occasionally forgetting the present need.
The principle vaguely relates to my saying that “reuse is a dangerous word
because it describes components that were designed to be widely used but
aren’t.” This principle could apply well to a lot of platform initiatives.

Design from front-to-back

Don’t build elaborate APIs that mimic the back-end system’s


design. Instead, build consumer-driven APIs that provide the
services in a format that the front-ends prefer.
Principles Connect Strategy and Decisions 36

During a transition we wanted to design interfaces such that they reflect


where we are heading, not where we came from. Easy-to-use APIs
also have a higher chance of being widely used, so they don’t remain
“reusable”.
To make the principles more punchy and memorable, each of them
received a clever slogan:

• Use Before Reuse: “Avoid gold plating.”


• Design from front-to-back: “Customer-centricity also applies to
APIs.”

We tested the set of ten principles in a group setting and managed to


reconstruct it from collective memory.
One could consider applying a MECE-test (Mutually Exclusive, Collec-
tively Exhaustive) to the set of principles. Although I can see that as a
useful intellectual exercise, I don’t feel that principles have to live up to
that much structure. For example, I am generally fine with principles plac-
ing an emphasis (and hence not being collectively exhaustive). However,
if your set of principles leaves a giant blind spot, a MECE check can help
you identify that.

Dangerous Disconnect - The Hourglass


When defining strategies, one clear and present danger is having a nice
set of high-level principles and then jumping into decisions or project
proposals without a clear linkage between the two. This problem typically
manifests itself in presentations that start by pitching colorful buzzwords,
supported by a long list of purported benefits, before suddenly jumping to
propose the purchase of a specific tool or the approval of a large project.
The logical connection between the two remains very thin. I refer to such
presentations or arguments as the “hourglass”.
You can detect the hourglass from the following recurring pattern. The
talk starts with an elaborate list of benefits and buzzwords. However, as
soon as you are enamored with wanting to become digital, cloud native,
cloud first, data driven, agile and lean, a rather foggy and confusing
segment follows (often aided by incomprehensible diagrams). Finally, the
thin trickle in the middle leads to a conclusion of a significant headcount
Principles Connect Strategy and Decisions 37

and funding request. How the story got from A to B is left as an exercise
for the reader.

The IT Presentation Hourglass

A good strategy brings value by making a strong connection between the


desirable outcome for the business and the detailed decision. An hourglass
is rather the opposite, so watch out for this common antipattern.
5. If You Don’t Know How to
Drive…
…buying a faster car is the worst thing you can do.

Window shopping is a popular pastime: it doesn’t cost very much and


you can look at many exciting things, be it luxury fashion or exotic cars.
Sometimes window shopping lures you into actually buying one of these
shiny objects just to find out they weren’t really made for the average
consumer. Or, to put it another way: the average driver is likely better off
with a Golf (or perhaps a BMW) than the latest Lamborghini. The same
holds true for corporate IT when it goes shopping for cloud¹.

Shiny objects can make you blind


Favoring a buy-over-build model, IT spends a fair amount of time shop-
ping around for solutions. In the process, enterprises can compare vendor
solutions, conduct evaluations, and also learn quite a bit. Of course,
looking for new IT solutions is a bit like window shopping for cars,
clothing, or real estate. For a moment you can break away from the
constraints of reality and get a taste of the life in luxury: the racy two
seater (that doesn’t have room for the kids), the fancy dress (that’s not
washable and a bit too sheer), and the exquisite country home (that’s half
an hour’s drive away from the next store). All these have their lure but
ultimately give way to the pressures of reality. And that’s generally a good
thing, unless you’re keen to take out the two seater wearing an evening
dress to get milk (I guarantee it will get old).
So, when looking at products, IT or not, we are well advised to separate the
“being enamored by shiny objects” from the “let’s actually buy something
that serves our needs” mode. While the former surely is more fun, the
latter is more important.
¹The analogies in this chapter refer to several traditional gender role models. I assure
the readers that these are intended purely for sake of metaphor and don’t indicate any
endorsement in either direction by the author.
If You Don’t Know How to Drive… 39

Capability ≠ Benefit
A critically important but often neglected step when evaluating products
is the translation of a tool’s abstract capability into concrete value for
the organization. Some systems that can scale to 10,000 transactions per
second are as impressive as a car with a top speed of 300 km/h. However,
they bring about as much benefit to a typical enterprise as that car in a
country which enforces a 100 km/h speed limit. If anything, either will
get you into trouble. You either end up with a speeding ticket (or in
jail) or stuck with overpriced, highly specialized consultants helping you
implement a simple use case in an overly complex tool.

Not every tool feature translates into concrete value for


your organization. Fancy features that you don’t need just
mean you’re paying for something that doesn’t have any
value to you.

Naturally, vendors benefit from us ogling at shiny objects. It gives them


a shot at selling us that two-seater. A common sales technique for this
goes as follows. The salesperson leads us into describing ourselves as more
sophisticated/modern/stylish than we really are. The next step then is the
suggestion to buy a product that’s suitable to a person or company of
that inflated standing. This technique is nicely illustrated in Bob Cialdini’s
classic Influence²: a young woman comes to survey his social habits, which
he naturally overstates a fair bit. Next, she builds a rational business case
for him to sign up for some form of social club membership that gives
discounts for folks who frequently go out. Or, perhaps closer to (enterprise)
home: “organizations like yours who are successfully moving to the cloud
realize substantial savings by making a larger up-front commitment” -
wanna tell them that your cloud migration is going a bit more slowly
than anticipated?
²Cialdini, Robert B.: *Influence, the new psychology of modern persuasion, 1984, Quill
If You Don’t Know How to Drive… 40

The best tool is the one that suits your


level
Fancy tools need fancy skills, meaning a product that matches your capa-
bilities is best for you. Or, put another way, if you’re a poor driver, about
the dumbest thing you can do is buy a faster car. It’ll only make a crash
more likely and more expensive - probably not what you were looking for
unless you’re angling to be immortalized in YouTubes amazingly popular
“supercar fails” channels. It seems that Schadenfreude drives viewership,
but perhaps not quite to the point of making the economics work out. So,
first become a better driver and consider upgrading once you’ve maxed
out your vehicle’s capabilities.

The one time I tested my driving skills on the Nürburgring


Nordschleife in a (admittedly underpowered) BMW 1-
series rental car, I was passed by a VW Golf and a (surely
overpowered) minivan. A faster car wouldn’t have helped
much except get me into trouble.³

Back in IT the same applies. Transformation takes place by changing


an organization’s assumptions and operating model. There is no SKU
for transformation - it’s not something that you can buy. Better tools
help you work in better ways (you probably won’t take the Yugo to the
Nürburgring), but there’s a cycle of continuous improvement where both
tools and capabilities incrementally ratchet up.

Look for products that generate value even if you only use
a subset of features, thus giving you a “ramp” to adoption.

When procuring IT tools, organizations therefore shouldn’t pick those


products that have the longest list of features, but those that can grow with
them, e.g. by providing value even if you use only a subset of features.
I refer to this capability as affording incremental value delivery. Or to
quote Alan Kay: “Simple things should be simple, complex things should
be possible.”
³I do get my redemption, however, every time I pass someone riding a much more
expensive mountain bike.
If You Don’t Know How to Drive… 41

Giant leaps don’t happen


As our capabilities improve, isn’t it OK to buy a tool one or two sizes
ahead of where you are now? It’s like buying kids’ shoes half a size bigger
- they’ll be the right size in just 2 months. On the other hand, wanting to
have an adult racing cycle when you’re just learning how to balance is
dangerous.
The same holds true for IT tools. A unified version control system and
some form of continuous integration are a great step forward towards
accelerated software delivery even if you still have to work on shortening
your release cycles and improving your test coverage. A fully automated
build-deploy-kubernetes-container-helm-chart-generator setup is likely
less helpful - it’s a great aspiration and may get you some bragging rights
but is likely a bit too much to bite off in one setting for traditional IT.
So, when looking at tools, have a clear progression in mind and a balanced
view on how far you can jump in one setting. If you’re a bit behind it’s ever
so tempting to just make one big leap to catch up. The reality is, though,
that if a small step is difficult to make, a giant jump is guaranteed to end
up in utter failure.

Organizations that have fallen behind are tempted to make


one giant leap ahead to catch up. Sadly, this is the most
unlikely path to success.

IT’s inclination to want to look too far ahead when procuring tools often
isn’t the results of ignorance but of excessive friction in approval pro-
cesses. Because procuring a new product requires elaborate evaluations,
business cases, security reviews, and so on, it’d be very inefficient to start
with a simple tool and upgrade soon after - you’d just be doing the whole
process over again. Instead, teams are looking for a tool that’ll suit them
three years down the road, even if it’s a struggle for the first two. So,
systemic friction is once again the culprit for much of IT’s ailments.
If You Don’t Know How to Drive… 42

You can only sell people what they don’t


already have
There’s a common assumption that products are made and sold to address
an unmet customer need. However, I don’t find that equation to be quite
that simple. In many cases it appears that demand is first created just so
it can be subsequently addressed. Allow me to try another analogy from
daily life.
Having lived and worked on three continents I have observed how the
fashion and beauty industries set quite different, and often opposing,
ideals in different regions. In Asia, it’s desirable to be fair skinned (a heavy
tan makes you look like a farmer, supposedly), whereas in Europe the
number of tanning studios is only barely outdone by fitness centers (being
tanned means you can afford an exotic beach vacation or perhaps at least
a tanning studio). Asian women often wish they were more full-figured
while everyone else seems to be trying to lose weight and become more
petite. The list of contrasting goals continues to include high cheekbones
(European models tend to have them, Asian models usually don’t), wide
or narrow noses, round or almond shaped eyes, and so on.
While body image is always a delicate topic, I’ve personally come to derive
two conclusions from this:

1. Be happy being whoever you are regardless of what’s on the


billboards.
2. People who sell you stuff will promote an ideal that’s rare and
difficult, if not impossible, to achieve. See #1.

Back in IT, we can find quite a few analogies. While your enterprise is
proudly making its first steps into the cloud, you are made to believe
that everyone else is already blissfully basking in a perimeter-less-multi-
hybrid-cloud-native-serverless nirvana. Even if that was the case, it’d still
be fine to be going at your own pace. You should also be a bit skeptical
as to whether there’s some amplification and wishful thinking at play. Go
out and have a look - you might see some of your peers at the tanning
studio.
If You Don’t Know How to Drive… 43

Don’t fret if you’re not on the bleeding edge. There are


successful businesses that don’t run all their applications as
microservices in containers which they deploy 1000 times
a day. It’s OK.

Marketing isn’t reality


Technology moves fast, but technology adoption is often not quite as fast
as the readers of quarterly earnings reports or attendees of semi-annual
launch events would like. Transforming IT infrastructure and changing
the fundamental operating model may take a sizable enterprise two, three,
or even five years. In the meantime vendors have to announce new things
to stay in business.
Hence we get to see shiny demos that auto-deploy, auto-heal, auto-
migrate, and almost auto-procure (lucky for us, not quite yet). Looking
ahead at product evolution is quite useful when selecting vendors or
products, so marketing serves a valuable function (I have many good
friends in marketing and highly appreciate their work). But we mustn’t
confuse vision with reality - it’ll just lead to frustration. Marketing shares
a possible target picture - we are the ones who have to walk the path, and
that’s usually done by taking one step after another.

Does a better knife make you a better


cook?
Indulge me to share one last metaphor. I love cooking - it serves as an
important therapeutic function that lets the hands and the intuition do
their work while my brain cools off. I have spent a good bit of time in
specialty shops and department stores’ kitchen departments around the
globe - Europe being my favorite source for pots and baking supplies
while I look to Japan for knives, ceramics, wooden products, and hyper-
specialized tools like my ginger grater cleaner. Over time, the significant
price increase for basic kitchen tools, such as knives, struck me. General
inflation aside, a very good German knife used to run some 30-40 Euros.
Modern kitchen shops are now full of knives for easily five times that,
sometimes running North of 400 Euros. While there’s certainly a bit of
If You Don’t Know How to Drive… 44

priming at play (see Making Decisions in 37 Things), another reason given


by the stores were changes in cooking habits.
In most Western societies the domestic role model, up until the mid-
to-late 20th century, was based on the woman taking care of food and
child rearing while men would work in the fields or later in factories and
offices. This division of labor led to knives being a tool - a thing that you
use to cut your meat or vegetables. And a 30-Euro knife managed that
just fine. Men increasingly taking an interest in cooking changed kitchen
supplies, especially knives, from being basic tools to being seen as hobby or
even vanity items. This change has enabled a completely different product
selection and pricing model - the cook is now some form of modern-age
samurai who chooses to wield his 200-times folded carbon steel blade at
tomatoes rather than enemies.
Translating my cooking experiences back to IT, I like to ask vendors
pitching their shiny tools: will buying a better knife make me a better
cook? Or will it rather put my fingers at risk? I do enjoy a good knife,
but my experience tells me that good cooking comes from practice and a
good understanding of how to prepare and assemble the ingredients. And
you could say the same of good architecture! So, buy a solid IT tool, not a
vanity item, and invest in your own skill!
Part II: Organizing for
the Cloud

Cloud is a fundamental lifestyle change for IT. Consequently, getting


the most out of cloud computing requires changes to the organization,
including departmental structures, processes, career schemes, and HR
guidelines. Moving to the cloud is hence as much an organizational as it is
a technology topic. The relationship between technical and organizational
change is one of the core themes of the Architect Elevator and applies
directly to cloud transformations. Pushing the technical agenda while
ignoring the organizational transformation is likely to lead you into what
I call the Value Gap⁴.
Even adopting cloud in its pure technological form is an organizational
change. After all, it means outsourcing a big portion of the IT respon-
sibilities. On the other hand, moving to the cloud doesn’t mean that all
operational concerns and functions disappear - quite the contrary. Orga-
nizing for the cloud impacts the organization ranging from the operations
to business functions up to and including financial management.

Behavior over Structure


Organizations are often portrayed by their structure - the classic org chart.
So, many organizations wonder what new kind of structure they should
embrace when they move to the cloud: how should they slice departments
and which reporting lines should they establish? Sadly, structural changes
alone rarely bring the anticipated results. That’s because changing the
⁴https://aws.amazon.com/blogs/enterprise-strategy/is-your-cloud-journey-stuck-in-
the-value-gap/
If You Don’t Know How to Drive… 46

way of working, including the written and often unwritten processes,


is as essential to successfully moving to the cloud. For example, if
requisitioning a server requires a lengthy approval process, reducing the
actual technical provisioning time won’t make a big difference. And re-
organizing teams into squads won’t help unless the processes change to
give them autonomy.

It’s all About People


To not leave this book devoid of references to The Matrix movie trilogy,
the Matrix Reloaded includes a famous scene that has the Merovingian
comment on Neo stopping a hail of bullets mid air with “OK, you have
some skill.” Successfully migrating applications to or building applications
for the cloud also requires some skill, perhaps short of stopping bullets
with your bare hands.
Rather, technical teams need to be familiar with the wealth of products and
services being offered, account management and permissions schemes,
modern application architectures, DevSecOps, and much more. A decade
ago when much of IT staff was hired, hardly any of these concepts existed.
On one hand that’s a great level setter - everyone starts fresh. But it also
means that not learning new technologies and new ways of working puts
you at risk of quickly being left behind.

Organizational Architectures
Organizations therefore need to decide to what extent they can train
existing staff and how they are going to acquire new skill sets. Some
organizations are looking to start with a small center of excellence, which
is expected to carry the change to the remainder of the organizations.
Other organizations have trained and certified everyone, including their
CEO. The happy medium might be somewhere in the middle but won’t
be the same for every kind of organization. So, as you are designing
your technical cloud architecture, you’ll also want to be defining your
organizational architecture.
If You Don’t Know How to Drive… 47

Organizing for the Cloud


Organizations built for the cloud have learned that…

• Cloud is outsourcing, but of a special kind.


• Cloud turns your organization on its side.
• Cloud requires new skills but that doesn’t mean you need to hire
different people.
• Hiring a digital hitman is bound to end in mayhem.
• Cloud gives Enterprise Architecture a new meaning.
6. Cloud is Outsourcing
And outsourcing is always a big deal.

Traditional IT relies heavily on outsourcing to third party providers


for software development, system integration, and operations. Doing
so afforded IT some perceived efficiencies but also introduced friction,
which becomes a hindrance in an environment defined by Economies of
Speed. Many enterprises are therefore bringing much of their IT back in
house. However, moving to cloud is outsourcing “par excellence”: all your
infrastructure is hosted and operated by someone else. And we’re told the
more we hand over to the cloud provider to operate, the better off we are.
How do the two ideals reconcile?

Don’t Outsource Thinking!


Classic IT is a buy-over-build environment because traditionally IT wasn’t
a differentiator for the business and thus could be safely outsourced.
Outsourcing is most prevalent in data center facilities and operations,
system integration of packaged software, and software development. In
these respective areas I have seen ratios of outsourced to in-house staff of
5 to 1 and sometimes even 10 to 1. In the extreme case all that’s left in IT
is a procurement and program management umbrella, which motivated
me to author a blog post with the intentionally controversial title Don’t
Outsource Thinking¹.
Outsourcing is driven by thinking in Economies of Scale. Because the
outsourcing provider has more people qualified to do the job and is
typically much larger (sometimes employing hundreds of thousands of
people), it’s assumed that the provider can perform the respective task
more efficiently. In return for this efficiency, IT departments give up some
amount of control as part of the contractual agreement. In an IT world
that’s perceived as a commodity, lack of control isn’t a major concern
as the provider would agree to operate based on well-understood and
acknowledged best practices.
¹https://architectelevator.com/strategy/dont-outsource-thinking/
Cloud is Outsourcing 49

Efficiency versus Agility


Much has changed since. The competitive environment is no longer
defined by a company’s size, but by its speed and agility. That’s why
digital disruptors can challenge incumbent businesses despite being orders
of magnitude smaller in size. Because these challengers operate at higher
velocity, they can realize faster (and cheaper) Build-Measure-Learn cy-
cles², a critical capability when charting new territory. These companies
also understand the Cost of Delay ³, the price paid for not having a product
on the market sooner.
In these Economies of Speed the benefits realized by moving fast outweigh
the efficiencies to be gained from being large. Traditional IT outsourcing
doesn’t mesh well with these rules of engagement. Driven by the quest for
efficiency gains, outsourcing contracts often span multiple years (some-
times decades!) and are bound by rather rigid contractual agreements
that specify each party’s obligations in excruciating detail. Changes are
cumbersome at best and likely expensive - change request is the most
feared phrase for anyone who has signed an outsourcing contract.

Bring’em Home
So, it’s no surprise that many IT departments are rethinking their out-
sourcing strategy. Most companies I worked with were bringing software
delivery and DevOps skills back in house, usually only held back by
their ability to attract talent. In-house development benefits from faster
feedback cycles. It also assures that any learning is retained inside the or-
ganization as opposed to being lost to a third party provider. Organizations
also maintain better control over the software they develop in house.
One challenge with insourcing is that an IT that builds software looks
quite different from one that mostly operates third-party software. In-
stead of oversight committees and procurement departments, you’ll need
product owners that aren’t program managers in disguise and architects
that can make difficult trade-offs under project pressure instead of com-
menting on vendor presentations. So, just like outsourcing, insourcing is
also a big deal and one that should be well thought through.
²Ries, The Lean Startup, 2011, Currency
³Reinertsen, The Principles of Product Development Flow, 2009, Celeritas Publishing
Cloud is Outsourcing 50

A recurring challenge is the apparent chicken-and-egg problem of attract-


ing talented staff. Qualified people want to work with other highly skilled
folks, not in an environment dominated by budgeting and procurement
processes. So you’ll need a few folks to “break the ice”, who are either
brave or foolish enough (or more likely both) to sign up for a major
transformation. I know because I have been - that’s the story of The
Software Architect Elevator⁴.
Don’t look for talent just on the outside. Chances are you are sitting on
unrealized potential: motivated employees who are held back by rigid
processes and procedures. Take a good walk around the engine room or
invite people to share some of their pet projects. You may find some gems.

Digital Outsourcing
Strangely, though, while traditional companies are (rightly) reeling things
back in, their competitors from the digital side appear to do exactly the
opposite: they rely heavily on outsourcing. No server is to be found on
their premises as everything is running in the cloud. Commodity software
is licensed and used in a SaaS model. Food is catered and office space
is leased. HR and payroll services are outsourced to third parties. These
companies maintain a minimum of assets and focus solely on product
development. They’re the corporate equivalent of the “own nothing”
trend.

Outsourcing a la Cloud
Cloud isn’t just outsourcing; it might well pass as the mother of all
outsourcings. Activities from facilities and hardware procurement all
the way to patching operating systems and monitoring applications are
provided by a third party. Unlike traditional infrastructure outsourcing
that typically just handles data center facilities or server operations,
cloud offers fully managed services like data warehouses, pre-trained ma-
chine learning models, or end-to-end software delivery and deployment
platforms. In most cases you won’t even be able to inspect the service
provider’s facilities due to security considerations.
⁴https://architectelevator.com
Cloud is Outsourcing 51

So, what makes outsourcing to the cloud so different from traditional


outsourcing? Despite being similar in principle, cloud excels by providing
all the important capabilities that traditional outsourcing could not:

Full Control

In a cloud environment, you get exactly the infrastructure


that you specify via an API or a visual console. Hence,
you have full control over the types and sizes of servers,
the network compartments they’re part of, which operating
system they run, etc. Instead of being bound by a restrictive
order sheet that attempts to harvest efficiencies by restricting
choice, cloud gives you utmost flexibility. With that flexibil-
ity also comes some responsibility, though. More on that a
little later.

Transparency

Cloud also gives you complete transparency into service lev-


els, system status, and billing details. Except in rather unusual
circumstances there is no “blame game” when something
goes wrong - the transparency protects both sides.

Short-term Commitments

Elastic billing is a key tenet of cloud computing. Although


cloud providers do encourage (and incentivize) large multi-
year commitments, you’re essentially paying for usage and
can scale down at will. Shifting your workloads to another
cloud provider is often more of a technical consideration than
a legal one.

Evolution

Terms that are set in stone are a significant downside of


traditional outsourcing. Folks who signed an outsourcing
contract in 2015 might have felt that provisioning a server
within several days was acceptable and that a server with
more than 512 GB of RAM would never be needed. Five
Cloud is Outsourcing 52

years later, we complain if a new server takes more than a


few minutes to spool up while memory-hungry applications
like SAP HANA happily consume Terabytes of RAM in the
double-digit range. Updating existing outsourcing contracts
to include these shifts will be time consuming and expensive.
In contrast, cloud providers release new services all the time
- at the time of this writing AWS could provision 8-socket
bare-metal servers with 24TB RAM.

Economics

Cloud providers routinely reduce prices for their services.


AWS, the longest-running of the major cloud providers,
boasts of having slashed prices 70 times between 2006 and
2020. Try that with your outsourcing provider.

So, while cloud is indeed outsourcing, it redefines what outsourcing actu-


ally means. This major shift has become possible through a combination of
technical advancements such as software-defined infrastructure and the
willingness of a provider to break with a traditional business model.

Once Again, It’s the Lines!


Those who have read 37 Things know that I am proud to reject any pur-
ported architecture diagram that contains no lines. Lines are important,
more important than the boxes in my view, because they define how
elements interact and how the system as a whole behaves. And after all,
that’s what architecture is all about. So: “no lines, no architecture.”
The same is true for the relationship between your organization and your
infrastructure. If you were going to draw a simple diagram of you and your
(cloud or traditional) outsourcing provider you’d see two boxes labeled
“you” and “them”, connected by a line. Using a cloud platform would not
look much different:
Cloud is Outsourcing 53

Comparing traditional outsourcing with cloud

However, what those lines entail makes all the difference. On the left side
it’s lengthy contracts and service requests (or emailing spreadsheets for
the less fortunate) that go into a black hole from which the requested
service (or something vaguely similar) ultimately emerges. Cloud in
contrast feels like interacting with a piece of commercial software. Instead
of filing a service request, you call an API, just like you would with
your ERP package. Or you use the browser interface via a cloud console.
Provisioning isn’t a black hole, pricing is transparent. New services are
announced all the time. You could call cloud platforms a service-oriented
architecture for infrastructure.
This simple comparison highlights that organizations who take a static
view of the world won’t notice the stark differences between the models.
Organizations that think in the first derivative, however, will see that
cloud is a lifestyle changer.

Core vs. Non-Core


A fundamental difference between outsourcing to a cloud and outsourcing
software delivery hinges on the notion of what in IT is core vs. non-
core. In much of traditional IT operations is king: the quality of IT is
defined by its operational excellence. As IT shifted from a cost-driven
commodity to becoming a differentiator and innovation driver, though,
software delivery moved into the limelight. That’s why it should be held
in house as you would do with any other core asset. That’s also why you
should not consider cloud an infrastructure topic.
Cloud is Outsourcing 54

Outsourcing a Mess is a Bigger Mess


There’s an old rule in outsourcing: if your IT is a mess, outsourcing will
make it only worse. That’s so because messy, largely manual IT is able
to function only thanks to an extensive Black Market (see 37 Things).
Black markets aren’t just inefficient, they also can’t be outsourced: Joe in
operations who used to pull a few strings for you is no longer in charge.
Knowing that cloud is outsourcing, you should also expect to get the most
out of cloud if you first clean up your house. That means for example
knowing your application inventory and dependencies. Because the usual
state of affairs is that these things aren’t known, most cloud providers
have devised clever scanning and monitoring tools that aim to figure out
what software runs on which server and which other applications it talks
to. They’re certainly helpful but no substitute for actually cleaning up.
It’s a bit like someone coming to your house before a move to tell you
that you have 35 plates and 53 mismatching forks. You might appreciate
the information, but it doesn’t really make things better unless you know
which forks are family heirloom and which ones came with the happy
meal.

Cloud Leverages Economies of Scale


It’s noteworthy that although cloud is a key ingredient into making
organizations successful in Economies of Speed, it itself relies heavily
on economies of scale. Building large data centers and private, global
networks takes enormous investment that only a few large hyperscale
providers can afford.

Cloud managed to draw a very precise line between a


scale-optimized implementation and a speed-based busi-
ness model.

It appears that cloud managed to draw a very precise dividing line between
a scale-based implementation and a democratic, speed-based business
model. Perhaps that’s the real magic trick of cloud.
Cloud is Outsourcing 55

Outsourcing is Insurance
Anyone who’s spent time in traditional IT knows that there’s another
reason for outsourcing: if something blows up, such as a software project
or a data center consolidation gone wrong, with outsourcing you’re
further from the epicenter of the disaster and hence have a better chance
of survival. When internal folks point out to me that a contractor’s day
rate could pay for two internals, I routinely remind them that the system
integrator rate includes life insurance for our IT managers. “Blame shield”
is a blunt term sometimes used for this setup. No worries, the integrator
who took the hit will move on to other customers, largely unscathed.
Sadly, or perhaps luckily, cloud doesn’t include this element in the
outsourcing arrangement. “You build it, you run it” also translates into
“you set it on fire, you put it out.” I feel that’s fair game.
7. Cloud Turns Your
Organization Sideways
And your customers will like it.

Traditional IT maintains a fairly strict separation between infrastructure


and application teams. Application teams delight customers with func-
tionality while infrastructure teams procure hardware, assure reliable
operations, and protect IT assets. The frequent quarrels between the
respective teams might hint that maybe not everything is perfect in this
setup. The conflict is easy to see because the two groups have opposing
goals: application delivery teams are incentivized to release more features
while the infrastructure and operations teams want to minimize change
to assure reliability. Cloud puts the final nail into this already strained
relationship.

The IT Layer Cake


Traditional organizations function by layers. Just like in systems archi-
tecture, working in layers has distinct advantages: concerns are cleanly
separated, dependencies flow in one direction, and interfaces between
units are well defined. Layering also allows you to swap one layer out
for another one - in organizations that’s called outsourcing.
However, as is invariably the case with architecture, there are also
downsides. For example, layering can cause translation errors between
the layers and latency can be high for messages to traverse the layers.
Those concerns apply to both technical and organizational systems. Ad-
ditionally, changes often have to ripple through multiple layers, slowing
things down. Most developers have seen situations where a simple change
like adding a field to the user interface required changes to the UI code,
the WAF (Web Application Firewall), the web server, the API gateway, the
back-end API, the application server code, the object-relational mapping,
Cloud Turns Your Organization Sideways 57

and the database layer. So, this structure, which looks like it can localize
change, often runs counter to end-to-end dependencies.

Layering favors local optimization over end-to-end opti-


mization, often causing friction at the interfaces.

Lastly, and perhaps most importantly, layering favors local optimization.


Clean separation of concerns between the layers leads us towards optimiz-
ing one individual layer. In organizations you notice this effect when hav-
ing to fill out an endless form when requesting something from another
team. The form allows that team to be highly optimized internally but
it burdens other teams in ways that aren’t measured anywhere. The end
result is a collection of highly optimized teams where nothing seems to get
done due to excessive friction. The technical equivalent is a microservices
architecture where all communication occurs via SOAP/XML - most CPU
cycles are going to be spent parsing XML and garbage collecting DOM
fragments.

Optimizing End-to-End
Cloud gives organizations an edge in Economies of Speed because it
takes friction out of the system, such as having to wait for delivery and
installation of servers. To get there, though, you need to think and optimize
end-to-end, cutting across the existing layers. Each transition between
layers can be a friction point and introduce delays - service requests,
shared resource pools, or waiting for a steering committee decision.
Hence, a cloud operating model works best if you turn the layer cake on its
side. Instead of slicing by application / middleware / infrastructure, you
give each team end-to-end control and responsibility over a functional
area. Instead of trying to optimize each function individually, they are
optimizing the end-to-end value delivery. Following this logic, the layers,
which represent technical elements, turn into vertical slices that represent
functional areas or services.

The layers representing technical elements turn into verti-


cal slices representing functional areas or services.
Cloud Turns Your Organization Sideways 58

That this works has been demonstrated by the Lean movement in manu-
facturing, which for about half a century benefited from optimizing end-
to-end flow instead of individual processing steps.

Provisioning is Another Commit


A big driver for loosening up the boundaries between the IT layers is
infrastructure automation. With techniques such as Infrastructure-as-
Code and gitops, provisioning and configuring infrastructure becomes
as simple as checking a new version of a file into the source code
repository (a “commit” in git). This has tremendous advantages as the
source code repository has now become the single source of truth for
application and infrastructure. Moreover, having a full description of your
deployment environment in your source code repository allows you to
automatically scan these configurations, for example for compliance with
security guidelines.
Having a software delivery pipeline create not just a software artifact
but also the infrastructure that it runs on makes a separation between
application and infrastructure within the context of a project team rather
meaningless.

Infra Teams
Having applications teams automate deployment doesn’t imply that there’s
nothing left for infrastructure teams to do. They are still in charge of
commissioning cloud interconnects, deploying common tools for moni-
toring and security, and creating automation among many other things.
My standard answer to anyone in IT worrying about whether there will
be something left to do for them is “I have never seen anyone in IT not
being busy.” As IT is asked to pick up pace and deliver more value, there’s
plenty to do! Operations teams that used to manually maintain the steady
state are now helping support change. Essentially, they have been moved
into the first derivative.
Cloud Turns Your Organization Sideways 59

Engineering Productivity Team


Alas, things are rarely black or white in IT, and that’s also the case with
layers. There’s a new layer, or unit perhaps, that will actually support
your journey to the cloud, which also resides “beneath” the application
development teams. The new layer’s role is to make sure that the software
delivery teams are as productive as they can be while relieving each team
from having to rig up their own deployment infrastructure, such as build
servers or monitoring.
Such teams are commonly called “Engineering Productivity Teams” or
“Developer Productivity Engineering” (DPE). Although on the surface,
such a team might look like another layer, there are a couple of defining
differences:

• The team’s role is to be an enabler, not a gating function. A


good productivity team shortens the time to initial deploy and
enables teams to deploy frequently. Infrastructure teams were often
perceived to be doing the opposite.
• Rather than focus on undifferentiated work (your customers won’t
buy more products because of the brand of server you installed),
productivity teams contribute to customer-visible business metrics,
such as the time it takes to implement a new feature.
• The team doesn’t absolve applications teams from their operational
responsibilities. Rather, this critical feedback cycle remains within
the application teams and for example allows them to manage their
error budgets, a key practice of Site Reliability Engineering (SRE).
• Names matter. Instead of naming teams by what they do, the name
should express the value they deliver. This team makes developers
more productive and the name reflects that.
• Focusing on software delivery reflects the realization that cloud
isn’t just about infrastructure.

Interestingly, the value of common software delivery infrastructure also


hasn’t escaped the cloud providers who are now no longer just “moving up
the stack” but also provide tools that span horizontally across the devel-
opment lifecycle, such as fully hosted continuous integration and delivery
pipelines. Such services simplify the work of engineering productivity
teams who now can focus on configuring and integrating existing services
as opposed to having to build or operate them manually.
Cloud Turns Your Organization Sideways 60

Centers of Excellence Aren’t Always an


Excellent Idea
Another new team that is often introduced during a cloud journey is a
Center of Excellence (CoE) or Center of Competence (CoC). As the name
suggests, this team is designed to hold a set of experts that are well versed
in the new technology and help bring it to the rest of the organization. The
appeal is apparent: instead of re-training a whole organization, you seed
a small team that can then act as a multiplier to transform the rest of the
organization. While the idea is legit, I have seen such teams, or perhaps a
specific implementation thereof, face several challenges.
First, the idea of transforming an organization from a small central team
is analog to the Missionary Model¹. Missionaries are sent out to “convert”
existing cultures to a new way of thinking. As one astute reader of my
blog once pointed out, being a missionary is very hard - some missionaries
ended up being food. While IT missionaries are likely to survive, bringing
change into teams that work off existing incentive systems and retain their
existing leadership is nevertheless difficult at best.
If converting a whole organization in a missionary model is difficult,
the natural conclusion is that building tools and frameworks is a better
way to increase the impact such a team can have across the organiza-
tion. Unfortunately, teams following this route often fall into the trap
of prescriptive governance. Such traditional vehicles are more likely to
disable than enable the organization. People usually won’t embrace a new
technology because you make it harder for them to use.
While many organizations rightly worry about things spinning out of
control if no guard rails are in place, I routinely remind that in order
to steer, you first have to move². So, in my experience it’s best to first
get things rolling, perhaps with fewer constraints and more handholding
than desired, and then define governance based on what has been learned.
Lastly, many CoE teams that are actually successful in moving the organi-
zation towards a new way of working fall victim to their own success by
becoming a bottleneck for inbound requests. This effect is nicely described
¹https://www.linkedin.com/pulse/transformation-missionaries-boot-camps-
shanghaiing-gregor-hohpe/
²https://www.linkedin.com/pulse/before-you-can-steer-first-have-move-gregor-
hohpe/
Cloud Turns Your Organization Sideways 61

in the article Why Central Cloud Teams Fail³ - note that the authors sell
cloud training, giving them an incentive to propose training the whole
organization, but the point perhaps prevails.

While CoEs can be a useful way to get a transformation


started, they aren’t a magic solution to transforming a
whole organization.

In summary, CoEs can be useful to “start the engine” but just like a two-
speed architecture, they aren’t a magic recipe and should be expected to
have a limited lifespan.

Organizational Debt
There is likely a lot more to be said about changes to organizational
structures, processes, and HR schemes needed to maximize the benefit
of moving to the cloud. Although the technical changes aren’t easy, they
can often be implemented more quickly than the organizational changes.
After all, as my friend Mark Birch, who used to be the Regional Director
for Stack Overflow in APAC, pointedly concluded:

There is no Stack Overflow for transformation where you just


cut and paste your culture change and compile.

Pushing the technical change while neglecting (or simply not advancing
at equal pace) the organizational change is going to accumulate organiza-
tional debt. Just like technical debt, “borrowing” from your organization
allows you to speed up initially, but the accumulating interest will
ultimately come back to bite you just like overdue credit card bills. Soon,
your technical progress will slow down due to organizational resistance.
In any case, the value that the technical progress generates will diminish
- an effect that I call the value gap⁴.
³https://info.acloud.guru/resources/why-central-cloud-teams-fail-and-how-to-save-
yours
⁴https://aws.amazon.com/blogs/enterprise-strategy/is-your-cloud-journey-stuck-in-
the-value-gap/
8. Retain / Re-Skill / Replace
/ Retire
The four “R”s of migrating your workforce.

When IT leaders and architects speak about cloud migrations they usually
refer to migrating applications to the cloud. To plan such a move, several
frameworks with varying numbers of R’s exist. However, an equally large
or perhaps even bigger challenge is how to migrate your workforce to
embrace cloud and work efficiently in the new environment. Interestingly,
there’s another set of “R’s” for that.

Migrating the Workforce


New technology requires new skills. Not just new technology skills like
formulating Helm charts but also new mindsets that debunk existing
beliefs as described in “Reverse Engineering the Organization” in 37
Things. For example, if an organization equates “fast” with “poor quality”
- they do it “quick and dirty” - they will consciously or unconsciously
resist higher software delivery velocity. Therefore, change management
is a major part of any cloud journey. Organizations are made up of people,
so people are a good starting point for this aspect of your cloud journey.

The 4 “Rs” of People Transformation


When it comes to preparing your workforce for the cloud journey, just
like with applications, you have several options:

Retain

Although cloud permeates many aspects of the organization,


not all jobs will fundamentally change. For example, you
Retain / Re-Skill / Replace / Retire 63

might be able to keep existing project managers, assuming


they haven’t degenerated into pure budget administrators.
Retaining staff might also include providing additional train-
ing, for example on agile methods or a shift from projects to
products.
Interestingly, even though applications built for the cloud
are different, developers often fall into the Retain category
because they are already used to adopting new technology.
Chances are they have already tinkered with cloud technol-
ogy before your organization went “all in” on cloud. The most
important aspect for developers is that they don’t see cloud
just as a new set of tools but understand the fundamental
shifts, such as making everything disposable.

Re-Skill

Other roles might remain by title but require completely


different skill sets. This is often the case in infrastructure
and operations where once manual processes are now au-
tomated. This doesn’t mean folks will be out of work (after
all, automation isn’t about efficiency), but the nature of the
work is changing substantially. Instead of (semi-)manually
provisioning servers or other IT assets, tasks now focus on
building automation, baking golden virtual machine images,
and creating more transparency through monitoring.
Re-skill should be the first choice in any organizational
transformation because as I like to say “the best hire is the
person you didn’t lose.” Existing staff is familiar with the
organization and the business domain, something a new
person might still have to acquire over several months. Also,
the hot job market for IT talent has hurt the signal-to-noise
ratio for recruiting, meaning you have to sift through more
candidates to find good ones.

Replace

Not every person can be re-skilled. In some cases, the jump is


simply too big while in others the person isn’t willing to start
from scratch. Although it might be easy to peg this on the
Retain / Re-Skill / Replace / Retire 64

individual, for folks who have excelled in a particular task for


several decades having the intellectual rug pulled out from
under their feet can be traumatizing.
Interestingly, being replaced isn’t always bad in the long run.
So-called “out of date” skill sets often end up being in high
demand due to swindling supply. For example, many COBOL
programmers make a very good living!
Naturally, replacing also implies the organization finding and
onboarding a new person with the desired skill set, which
isn’t always easy. More on that below.

Retire

Some roles aren’t needed any more and hence don’t need to
be replaced. For example, many tasks associated with data
center facilities, such as planning, procurement, and instal-
lation of hardware are likely no longer required in a cloud
operating model. Folks concerned with design and plan-
ning might move towards evaluating and procuring cloud
providers’ services, but this function largely disappears once
an organization completes the move to the cloud and shuts
down its physical data centers. In many IT organizations such
tasks were outsourced to begin with, softening the impact on
internal staff.

In real life, just like on the technical side of cloud migration, you will have
a mix of “migration paths” for different groups of people. You’ll be able
to retain and re-skill the majority of staff and might have to replace or
retire a few. Choosing will very much depend on your organization and
the individuals. However, having a defined and labeled list of options will
help you make a meaningful selection.

You Already Have the People


Categorizing people is a whole different game than categorizing applica-
tions - this isn’t about shedding the “legacy”! When looking for new talent
your first major mistake might be to only look outside the organization.
You’re bound to have folks who showed only mediocre performance but
Retain / Re-Skill / Replace / Retire 65

who were simply bored and would do a stellar job once your organization
adjusts its operating model.
In a famous interview¹ Adrian Cockcroft, now VP of Cloud Architecture
Strategy at AWS, responded to a CIO lamenting that his organization
doesn’t have the same skills as Netflix with “well, we hired them from
you!” He went to explain that “What we found, over and over again, was
that people were producing a fraction of what they could produce at most
companies.” So, have a close look to see whether your people are being
held back by the system.

In my talks and workshops I routinely show a picture


of Ferraris stuck in a traffic jam (or some sort of cruise)
and ask the audience how fast these cars can go. Most
people correctly estimate them to have a top speed north
of 300km/h. However, how fast are they going? Maybe 30.
Now, the critical question is would putting more horse-
power in the cars make them go faster? The answer is easy:
no. Sadly, that’s what many organizations are attempting
to do by claiming they “need better people”.

You might also encounter the inverse: folks who were very good at their
existing task but are unwilling to embrace a new way of working. It’s
similar to the warning label on financial instruments: “past performance
is no guarantee of future results.”

Training is More Than Teaching


One technique that can be a better indicator than past performance
is training. Besides teaching new techniques, a well-run training that
challenges students as opposed to just having them copy-paste code
snippets (many of us have sadly seen those) can easily spot who latches on
to new tools and concepts and who isn’t. I therefore recommend debriefing
with the instructor after any training event to get input into staffing
decisions. Training staff also gives everyone a fair shot, ensuring that you
don’t have a blind spot for already existing talent.
¹https://dzone.com/articles/tech-talk-simplifying-the-future-with-adrian-cockc
Retain / Re-Skill / Replace / Retire 66

Training can also come in the form of working side-by-side with an expert.
This can be an external or a well-qualified internal person. Just like before,
you’d want to use this feedback channel to know who your best candidates
are.
Some organizations feel that having to (re-)train their people is a major
investment that puts them at a disadvantage compared to their “digital”
competitors who seem to only employ people who know the cloud,
embrace DevOps, and never lose sight of business value. However, this is a
classic “the grass is greener” fallacy: the large tech leaders invest an enor-
mous amount into training and enablement, ranging from onboarding
bootcamps over weekly tech talks and brown bags to self-paced learning
modules. They are happy to make this investment because they know the
ROI is high.

Top Athletes Don’t Compete in the


Swamp
Many organizations are looking to bring in some super stars to handle the
transition. Naturally, they are looking for top athletes, the best in their
field. The problem is, as I described in a blog post² that the friction that’s
likely to exist in the organization equates to everyone one wading knee-
deep in the mud. In such an environment a superstar sprinter won’t be
moving a whole lot faster than the rest of the folks. Instead, they will
grow frustrated and leave soon.

In my team we often reminded ourselves that for each task


we have two goals: first, accomplish the task, but also to
improve the way it’s done in the future.

So, you need to find a way to dry up the mud before you can utilize top
talent. However, once again you’re caught in a chicken-and-egg situation:
draining the swamp also requires talent. You might be fortunate enough
to find an athlete who is willing to also help remove friction. Such a
person might see changing the running surface as an even more interesting
challenge than just running.
²https://www.linkedin.com/pulse/drain-swamp-before-looking-top-talent-gregor-
hohpe/
Retain / Re-Skill / Replace / Retire 67

More likely, you’ll need to fix some of the worst friction points before
you go out to hire. You might want to start with those that are likely to
be faced first by new hires to at least soften the blow. Common friction
points early in the onboarding process include inadequate IT equipment
such as laptops that don’t allow software installs and take 15 minutes to
boot due to device management (aka “corporate spyware”). Unresponsive
or inflexible IT support is another common complaint.

I was once forced to take a docking station along with my


laptop because all equipment comes in a single, indivis-
ible bundle. When returning equipment I ran into issues
because some of the pieces I didn’t want in the first place
were tucked away in some corner.

HR onboarding processes also run the risk of turning off athletes before
they even enter the race. One famous figure in agile software development
was once dropped off the recruiting process by a major internet company
because he told them to just “Google his name” instead of him submitting
a resume. I guess they weren’t ready for him.

Organizational Anti-corruption Layer


You can’t boil the ocean, so you also don’t have to dry up all the mud at
once. For example, you can obtain an exception for your team to use non-
standard hardware. However, you must dam off the part that’s dry from
the rest of the swamp lest it gets flooded again.
I call this approach the Organizational Anti-corruption Layer, borrowed
from the common design pattern³ for working with legacy systems.
In organizations, this pattern can for example be applied to headcount
allocation:

When working in a large organization I had all team


members officially report to me so that shifting resources
between sub-teams would not affect the official org chart
and spared us from lengthy approval processes.

You can build similar isolation layers for example for budget allocation.
³https://domainlanguage.com/ddd/reference/
Retain / Re-Skill / Replace / Retire 68

Up Your Assets
One constant concern among organizations looking to attract desirable
technical skills is their low (perceived) employer attractiveness. How
would a traditional insurance company / bank / manufacturing business
be able to compete for talent against Google, Amazon, and Facebook?
The good news is that, although many don’t like to hear it, likely you
don’t need the same kind of engineer that the FAANGs (Facebook, Apple,
Amazon, Netflix, Google) employ. That doesn’t imply that your engineers
are less smart - they just don’t need to build websites that scale to 1 billion
users or develop globally distributed transactional databases. They need to
be able to code applications that delight business users and transform the
insurance / banking / manufacturing industry. They’ll have tremendous
tools at their fingertips, thanks to cloud, and hence can focus on delivering
value to your business.

In my past we were quite successful in hiring qualified


candidates because we didn’t require them to speak the
local language (German). Additionally, many of the “digital
brand-name” competitors had very few actual engineering
positions as they had opened sales and consulting offices
which required extensive travel.

Second, large and successful organizations typically have a lot more to


offer to candidates than they believe. At a large and traditional insurance
company, we conducted an exercise to identify what we can offer to
desirable technical candidates and were surprised how long the list
became: from simple items like a steady work week and paid overtime (not
every software engineer prefers to sleep under their desk), we could offer
challenging projects on a global scale, access to and coaching from top
executives, assignments abroad, co-presenting at conferences with senior
technologists, and much more. It’s likely that your organization also has
a lot more to offer than it realizes, so it’s a worthwhile exercise to make
such a list.
Retain / Re-Skill / Replace / Retire 69

Re-label?

In many organizations I have observed a “fifth R” - the re-labeling of roles.


For example, all project managers underwent SCRUM training and were
swiftly renamed into SCRUM masters or product owners. Alternatively,
existing teams are re-labeled as “tribes” without giving them more end-
to-end autonomy. With existing structures and incentives remaining as
they are, such maneuvers accomplish close to nothing besides spending
time and money. Organizations don’t transform by sticking new labels on
existing structures. So, don’t fall into the trap of the fifth R.
9. Don’t Hire a Digital
Hitman
What ends in mayhem in the movies is unlikely to end well in IT.

When looking to transform, traditional organizations often look to bring


in a key person who has worked for so-called digital companies. The idea
is simple: this person knows what the target picture looks like and will
lead the organization on the path to enlightenment. As a newcomer, this
person can also ruffle some feathers without tarnishing the image of the
existing management. He or she is the digital hit(wo)man who will do the
dirty work for us. What could possibly go wrong?

Movie Recipes
It’s well known that Hollywood movies follow a few, proven recipes.
A common theme starts with a person leading a seemingly normal if
perhaps ambitious life. To achieve their ambitious goal this person engages
someone from the underworld to do a bit of dirty work for them. To at least
initially save the character’s moral grounds, the person will justify their
actions based on a pressing need or a vague principle of greater good. The
counterpart from the underworld often comes in form of a hitman, who
is tasked with scaring away, kidnapping, or disposing of a person who
stands in the way of a grand plan. Alas, things don’t go as planned and so
the story unfolds.
A movie following this recipe fairly closely is The Accountant with Ben
Affleck. Here, the CEO of a major robotics company hired a hitman
to dispose of his sister and his CFO who might have gotten wind of
his financial manipulations. Independently, the company also hired an
accountant to investigate some irregularities with their bookkeeping. As
it turns out, the accountant brings a lot more skills to the table than just
adding numbers. After a lot of action and countless empty shells, things…
(spoiler alert!) don’t end well for the instigator.
Don’t Hire a Digital Hitman 71

The Digital Hit(wo)man


Back in the world of IT and digital transformation, we can quickly see the
analogy: a business leader is under pressure from the board to change the
organization to become faster and “more digital”; there is some amount of
handiwork involved, such as re-skilling, re-orgs or, de-layering. But it’s in
the name of greater good, i.e., saving the company. Wouldn’t it be easiest
to recruit some sort of digital hit(wo)man, who can clean the place up
before disappearing as quickly as they came?
The natural candidate wouldn’t be as good looking as Ben Affleck and
wouldn’t bring a trailer full of firearms. To be precise, the movie accoun-
tant is also the one who chases the hitman. For that part you better watch
the movie, though. Digital IT hit(wo)men would come with credentials
such as having held a major role at a brand-name digital company, having
led a corporate innovation center, or having run a start-up incubator. He
or she would be full of energy and promise, and would have one or the
other magic trick up their sleeve.

Crossing Over Into the Unknown


Sadly, though, hiring the IT hit(wo)man is more likely to end up as in the
movie than in a successful transformation. The key challenge in hiring
someone from a different world (under or not) is that you won’t know the
world you’re crossing over into. For example, you wouldn’t know how to
pick a good candidate because the whole premise is that you need someone
who has a very different background and brings completely different
skills. You also won’t understand the unwritten rules of engagement in
the other world.

When hiring someone from a different world, you’re not


qualified to make the hiring decision.

While as an IT leader you may not have to worry about someone else
outbidding you to turn your own hitman against you, you may still hit a
few snags, such as:
Don’t Hire a Digital Hitman 72

• The transformation expert you are hiring may feel that it’s entirely
reasonable to quit after a half a year as that’s how things work
where they’re from.
• They may simply ignore all corporate rules - after all they are here
to challenge and change things, right? But perhaps that shouldn’t
include blowing the budget half way through the year and storming
into the CEO’s office every time they have what they think is a good
idea.
• They may also feel that it’s OK to publicly speak about the work
they’re doing, including all of your organization’s shortcomings.

As a result, some of the stories involving a digital hit(wo)man don’t fare


much better than in the movies. While luckily the body count generally
stays low, there’s carnage in other forms: large sums of money spent, many
expectations missed, results not achieved, egos bruised, and employees
disgruntled.

Asking Trust Fund Kids for Investment


Advice
A digital hit(wo)man usually worked in a digital environment, so they
understand well what it looks like and how it works. What most of them
didn’t have to do, though, is build such an environment from scratch or,
more difficult yet, transform an existing organization into a “digital” one.
Therefore, they might not be able to lay out a transformation path for your
organization because they have never seen the starting point. That likely
ends up being a frustrating experience both for them and for you.

Folks coming from a digital company know the target


picture but they don’t know the path, especially from your
starting point.

Asking someone who grew up in the Googles of the world how to become
like them is a bit like asking trust fund kids for investment advice: they
may tell you to invest a small portion of your assets into a building
downtown and convert it into luxury condos while keeping the penthouse
Don’t Hire a Digital Hitman 73

for yourself. This may in fact be sound advice, but tough to implement for
someone who doesn’t have $50 million as starting capital.
The IT equivalents of trust fund investment advice can take many forms,
most of which backfire:

• The new, shiny (and expensive) innovation center may not actually
feed anything back into the core business. This could be because
the business isn’t ready for it or because the innovation center has
become completely detached from the business.
• The new work model may be shunned by employees or it might
violate existing labor contract agreements and hence can’t be
implemented.
• The staff may lack the skills to support a shiny new “digital”
platform that was proposed.

So, perhaps instead of a hit(wo)man you need an organizational horse


whisperer. That’s a less exciting movie but organizational transformations
can also be slow and draining.

Who to Look For?


Transforming a complex organization without any outside help is going
to be difficult - you won’t even know much about what you are aiming
for. It’s like reading travel brochures to prepare for living abroad. So how
do you go about finding a good person to help you? As usual, there isn’t
an easy recipe, but my experience has shown that you can improve the
odds by looking for a few specific traits:

Stamina
First of all, don’t expect a digital hitman who carries silver bullets
to instantly kill your legacy zombies. Look for someone who has
shown the stamina to see a transformation through from beginning
to end. In many cases this would have taken several years.
Enterprise Experience
You can’t just copy-paste one company’s culture onto another - you
need to plot a careful path. So, your candidate may be extremely
frustrated with your existing platforms and processes and leave
quickly. Look for someone who has had some exposure to enterprise
systems and environments.
Don’t Hire a Digital Hitman 74

Teacher
Having worked for a “digital” company doesn’t automatically qual-
ify a person as a transformation agent. Yes, they’ll have some
experience working in the kind of organization that you’re inspired
by (or compete against). However, that doesn’t mean they know
how to take you from where you are to where you want to be. Think
about it this way: not every native speaker of a specific language is
a good teacher of that language.
Engine Room
Knowing how a digital company acts doesn’t mean the person
knows the kind of technology platform that’s needed underneath.
Yes, “move to the cloud” can’t be all wrong, but it doesn’t tell
you how to get there in which order and with which expectations.
Therefore, find someone who can ride the architect elevator at least
a few floors down.
Vendor Neutral
Smart advice is relatively easy to come by compared to actual
implementation. Look for someone who has spent time inside
a transforming organization as opposed to looking on from the
outside. Likewise watch out for candidates who are looking to drive
the transformation through product investment alone. If the person
spews out a lot of product names you might be looking at a sales
person in disguise. Once hired, such a person will chummy up to
vendors selling you transformation snake oil. This is not the person
you want to lead your transformation.

What Do They Look For?


Naturally, qualified candidates who can lead an organization through a
transformation will have plenty of opportunities. And while they will
likely look for a commensurate compensation package, they usually can’t
be bought with money alone (see “Money Can’t Buy Love” in 37 Things).
These folks will look for a challenge but not for a suicide mission -
your transformation isn’t Mission Impossible and the hero won’t come
out alive. So they’d want to be assured that the organization is willing
to make difficult changes and that top-management support is in place.
Realistically, the candidates will interview your organization as much as
you will interview them.
Don’t Hire a Digital Hitman 75

It Sure Ain’t Easy, But It’s Doable


If you feel that these suggestions will narrow down the set of available
candidates even further, you are probably right. Finding a good person
to drive the transformation is hard - unicorns don’t exist. That’s perhaps
the first important lesson for any transformation: you got to put wishful
thinking aside and brace for harsh reality.
10. Enterprise Architecture
in the Cloud
Keeping the head in the cloud but the feet on the ground.

For several years I ran Enterprise Architecture at a major financial services


firm. When joining the organization I had little idea what enterprise
architecture actually entailed, which in hindsight might have been an
asset because the organization wanted someone with a Silicon Valley
background to bring a fresh take. The results are well known: I concluded
that it’s time to redefine the architect’s role as the connecting element
between the IT engine room and the business penthouse, a concept which
became known as The Software Architect Elevator¹.
Of course I could not have done any of this without a stellar team. Michele
was one architect who did great work in the engine room but regularly
popped his head into the penthouse to help chart the IT strategy. He’s now
running his own architecture team supporting a cloud journey. Here are
some of his insights.

By Michele Danieli

Enterprises turn to cloud to reap substantial benefits in agility, security


and operational stability. But the way the cloud journey starts can vary
greatly. In many organizations, so-called shadow IT bypasses corporate
restrictions by deploying in the cloud; others require extreme scale that’s
only available in the cloud; some see a unique opportunity for reinventing
the IT landscape with no ties to the past; or perhaps a new IT strategy
foresees the orderly migration of resources to the cloud. Each starting
point also impacts the role the Enterprise Architecture team plays during
the cloud journey.
¹Hohpe, The Software Architect Elevator, 2020, O’Reilly
Enterprise Architecture in the Cloud 77

Enterprise Architecture

Most IT departments in large enterprises include a team labeled “Enter-


prise Architecture” (EA), typically as part of a central IT governance team.
Although interpretations vary widely, enterprise architect teams bring
value by providing the glue between business and IT architecture (see
Enterprise Architect or Architect in the Enterprise?²). As such a connecting
element, they translate market needs into technical components and ca-
pabilities. Bridging the two halves across a whole enterprise requires both
breadth of expertise and the ability to cut through enormous complexity.

Cloud Enterprise Architecture


Like many IT functions, EA is also affected by the organization’s move to
the cloud. Interestingly, the starting point of the cloud initiative impacts
the way sponsors and project teams see the new role of EA:

• Business units seeing cloud as a way to overcome excessive friction


inside the central IT department will stay away from EA, except
perhaps for guidance on connectivity or security constraints.
• Software and product engineering teams will believe they can figure
it all out without much help from EA.
• Infrastructure and operations teams will seek support from EA to
map applications onto the newly designed infrastructure.
• If cloud is a new enterprise strategy, EA will be asked to develop a
comprehensive view of the current infrastructure, application, and
process inventory to be used as input to strategic planning.

In any scenario, EA departments in the cloud-enabled enterprise should


focus their guidance on the why and what of what they are trying to
achieve and leave the how of the implementation details to other teams.
This means enterprise architects need to depart from prescriptive roles as
they often exist in form of design authorities and other governance bodies.
Instead, they should act as cloud advocates in the boardroom and sparring
partners in the engine room.
²Hohpe, The Software Architect Elevator, 2020, O’Reilly
Enterprise Architecture in the Cloud 78

Enterprise architects in the cloud enterprise should depart


from prescriptive roles to become cloud advocates in the
boardroom and sparring partners in the engine room.

Filling this mandate translates into three concrete tasks that are material
to the success of the cloud journey:

Inform business leadership


Business leadership gets bombarded with buzzwords, usually closely
followed by unsubstantiated promises. EA teams can help them
separate hype from reality to define a cohesive strategy.
Link business, organization, and IT
There are many paths to the cloud. EA teams can help navigate
cloud decisions with a holistic architecture view, including the
connection to the business and the organization.
Establish guidelines and enable adoption
EA can foster knowledge sharing, translating individual experience
into common technical guidelines. Also, EA can develop running
services that make following the guidelines the easy path.

Let’s look at each task in detail.

Inform business leadership

Although cloud can do amazing things, it also attracts a


lot of hype. Business and IT management therefore needs
a solid understanding of how cloud will affect their enter-
prise. While providers and external consultants deliver a
fair amount of information, enterprise architects provide a
unique perspective on the drivers for moving to the cloud.
They need to consider strategic aspects like lock-in or the
changing economics of operating in a cloud model. This
strategic dialog calibrates the expectations regarding direc-
tion, value, and costs over the long term. Without this dialog,
business is bound to be disappointed by results achieved from
the move to the cloud.
Enterprise Architecture in the Cloud 79

Link business, organization and IT

Developing a cloud enterprise is a lot more than using cloud


technologies. It requires understanding how cloud will im-
pact the organization as much as how the existing landscape
of applications and IT services must be transformed. Cloud
providers support transformation with services like the AWS
Cloud Adoption Framework³ or Pivotal Labs⁴. Although they
originate from different perspectives, these programs have
one common aspect: they recognize the importance of a holis-
tic view across business, people, and IT. Enterprise architects
complement the vendor expertise with insider knowledge.
EA helps the organization identify the different aspects of
moving to cloud and plotting viable adoption paths. They
contribute to a decision framework for hybrid and multi
cloud selection, complementing developers’ application and
technology views with a broader perspective.
Cloud transformation will require change, and change bears
cost and risks. If such change is linked to specific business
benefits, gaining business support will be easier. For example,
automation and containerization can translate into portabil-
ity, enabling a growth strategy in new regions with lower
entry cost and faster time to market; increased elasticity can
translate in ability to better absorb workload fluctuation (e.g.
seasonal patterns), improving business metrics like turnover
and customer satisfaction. It is the role of EA to make such
connections.

Establish guidelines and enable adoption

With cloud as a central element of the strategy, enterprise


architecture must help operationalize this vision. Applica-
tions in a typical enterprise portfolio aren’t equally portable
to the cloud. Enterprise architects should therefore work with
cloud engineers and developers to establish the criteria and
guidelines that can be used to assess an application’s cloud
readiness, using frameworks like FROSST.
³https://aws.amazon.com/professional-services/CAF
⁴https://tanzu.vmware.com/labs
Enterprise Architecture in the Cloud 80

In my current role, EA is defining our application portfo-


lio’s cloud readiness, including application structure, man-
ufacturing and run processes, technologies, and delivery
model.

Instead of restricting technology decisions with rigid stan-


dards, enterprise architecture should transform existing needs
into mechanisms and running services. Engineers consume
services, not abstract principles.

At my current employer, the cloud security community


translated security principles into building blocks, released
in the enterprise code repository.

Lastly, modeling the actual impact of lock-in at the time of


service design identifies risks and benefits better than citing
abstract principles such as loose coupling.

The Enterprise IT Cast of Characters


How can EA perform these important but challenging tasks within the IT
organization? Understanding the organization helps EA find the appro-
priate positioning and engagement model.
Enterprise IT has to tackle the complexity of concurrently managing
business, technical, operational, financial, and compliance aspects across
several layers. Enterprises split this complexity across specific roles and
processes, each carrying slightly different perspectives:

• Product Owner - cares about high team productivity to meet


business needs.
• SW Developer - needs technology options that maximize produc-
tivity.
• Cloud Engineer - cares about fast provisioning of resources and
scalable, secure, operable infrastructure.
Enterprise Architecture in the Cloud 81

• Infrastructure Manager - focuses on infrastructure, operations, costs,


and security.
• Service Manager - ensures stability and operability (SLAs) for
specific IT services.
• Enterprise Architect - looks at long term strategy and compliance
with operating models.

These roles and their needs existed before shifting to a cloud operating
model. However, the move to the cloud changes the way these players
interact. In traditional IT, long term investment in vendor technologies,
long procurement lead times, and mature but constrained technology
options (both in infrastructure and software) limited the pace of change
and innovation. Hence much time was spent carefully selecting software
and managing the technology lifecycle. In this context, EA was needed to
lay out a multi-year roadmap and ensure convergence from the present to
the defined future state.

Virtuous Cycles
Cloud gives enterprises more options that can be adopted at a faster
pace and with a much lower investment. Cloud also elevates the roles
of developers, operations engineers, and product owners because it turns
your organization sideways. Traditional organization charts don’t reflect
this dynamic well. I visualize this new organizational model as a circle
connecting three vertices: Product Owners, Developers and Engineers.
These individuals are guided by an external circle of stakeholders: Busi-
ness, Infrastructure Managers, Cloud Providers and Service Managers.
Enterprise Architecture in the Cloud 82

The Enterprise IT Cast of Characters

The faster the inner loop spins, the more value it delivers to the business
and the more confidence is built into the team: product owners and devel-
opers working in shorter sprints; developers pushing to production more
frequently; engineers setting up and running more stable environments
faster.
Any disturbance will be seen as reducing output and is therefore un-
welcome. To the parties in the loop, enterprise-wide IT standards that
constrain technical choices are considered friction. To make matters
worse, such standards are often based on a legacy operating model that
doesn’t suit the way modern teams like to work. For example, existing
standards for build tool chains might not support the level of automation
engineers desire, affecting the time it takes the team to deliver features.
While the inner loop spins fast on the product cycle, the outer loop
explores strategic opportunities, often enabled by new technologies such
as machine learning, cognitive computing, and robotics. Although this
loop moves more slowly than the inner loop, it also speeds up in a modern
enterprise. As a result, product ideas flow more quickly from external to
internal circles who return continuous feedback. Instead of defining multi-
year plans, teams now probe and explore, consuming IT services provided
by cloud providers.
Modern enterprise architects need to fit into these loops to avoid becoming
purely ceremonial roles. Does it mean they are giving way to Product
Owners, Developers, Engineers, and Cloud Architects? Not necessarily.
Enterprise Architecture in the Cloud 83

Bringing Value to the Cloud Enterprise


Enterprise architects should position themselves as the interface between
the inner loop of product delivery and the outer loop of stakeholders. Here,
they can transform the learnings from the field into a systematic view that
enables the cloud enterprise to thrive.

Enterprise Architects as Connecting Element

In this role, cloud enterprise architects connect different levels of abstrac-


tion and ride the elevator from the executive board down to the engine
room many times.

When a central IT group adopted cloud for specific appli-


cations like machine learning but retained core business
systems on premises, projects started to explore the cloud
on their own, leading to fragmented decisions and lack
of experience sharing. Our EA team set out to translate
these early experiences into shared capabilities by creating
a compelling vision (going as far as creating an internal
brand name and logo) and driving implementation with the
infrastructure manager.

Connecting the circles allows EA to move from conceptualization to


action. In some cases the team can even act as a technical product owner,
stimulating engineering teams to work and think differently. For such a
Enterprise Architecture in the Cloud 84

model to succeed, EA teams need to be pragmatic, descriptive instead of


prescriptive. Concrete activities for such modern EA teams can include:

• Convincing business leadership on the need for a longer architec-


ture planning horizon
• Working across teams to build an environment that enables cloud
adoption, for example via shared services
• Supporting engineers in building services that benefit the widest
audience

To succeed in this role of flywheel for the cloud transformation, enterprise


architects need to not only develop specific skills in cloud architecture and
technologies but also reposition themselves as an enabler rather than a
governing function.
In conclusion, enterprise architecture’s role in the cloud enterprise is not
to design the cloud infrastructure (cloud architects and engineers will
do that) nor to select cloud technologies (developers and engineers will
do that) nor to model applications for available cloud services (solution
architects and developers will do that). EA teams detail where and why the
business will benefit from cloud, taking a broad perspective on business,
applications and technology.

Enterprise architects do not design the cloud infrastructure


nor do they select cloud technologies. They detail where
and why the business will benefit from cloud.

While the role isn’t fundamentally different from the past, the wheel
spins much faster and thus requires a more pragmatic and non dogmatic
approach to strategy. It also requires leaving the ivory tower behind to
work closely with developers, engineers and infrastructure managers in
the engine room.
Part III: Moving to the
Cloud

Understanding that cloud is a different animal from traditional IT procure-


ment is a good precondition for laying out a cloud strategy. Now it’s time
to tackle the shift of on-premises resources to a cloud operating model.

In With the New…


The center of such a strategy will be the migration of existing on-premises
applications to the cloud. However, simply moving your IT assets from
your premises to the cloud is more likely to yield you another data
center than a cloud transformation. You therefore need to shed existing
assumptions - leaving things behind is a critical element of any cloud
migration. Some of the things to be left behind are the very assumptions
that made IT big and powerful, such as operating servers and packaged
applications. Lastly, clear and measurable objectives are a good start for
any successful migration, something that should be common sense but
that all too often goes under in the complexity and politics of enterprise
IT.

Too Much Buzz Makes Deaf

Widespread “advice” and buzzwords can muddy the path to cloud enlight-
enment a good bit. My favorite one is the notion of wanting to become a
“cloud native” organization. As someone who’s moved around the world
a good bit, I feel obliged to point out that “being native” is rather the
Enterprise Architecture in the Cloud 86

opposite of “migrating”. Hence, the term doesn’t just feel unnecessarily


exclusionary, but also constitutes an oxymoron. Lastly, in many countries’
history, the natives didn’t fare terribly well during migrations.

Measuring Progress
So, it’s better to put all that vocabulary aside and focus on making a
better IT for our business. Your goal shouldn’t be to achieve a certain
label (I give you any label you like for a modest fee…), but to improve on
concrete core metrics that are relevant to the business and the customer.
Many organizations track their cloud migrations by the number of servers
or applications migrated. Such metrics are IT-internal proxy metrics and
should make way for outcome-oriented metrics such as uptime or release
frequency.

Knowing Your Path


Cloud migration frameworks often provide a list of choices, e.g. whether
to re-architect applications or just lift-and-shift them as is. Most of these
publications assume, though, that this migration is a one-step process: you
decide the path for each application and off they go. I have never seen
things take place as easily as that. Applications are interdependent and
migrations often take multiple passes of preparing, shifting, optimizing,
and re-architecting workloads. So, a migration strategy is more than
dividing applications into buckets. Each application (or set of applications)
will take a path consisting of multiple steps.

Plotting a Path
Many resources that help you with the mechanics of a cloud migration
are available by cloud service providers and third parties, ranging from
planning aids, frameworks, and conversion tools. For example, AWS⁵,
Azure⁶, and Google Cloud⁷ each provide an elaborate Cloud Adoption
Framework.
⁵https://aws.amazon.com/professional-services/CAF/
⁶https://azure.microsoft.com/en-us/cloud-adoption-framework/
⁷https://cloud.google.com/adoption-framework
Enterprise Architecture in the Cloud 87

This chapter isn’t looking to replicate advice that’s already presented in a


well-structured format. Rather, it’s looking to fill in the gaps by alerting
you to common pitfalls and elaborating on the parts that don’t fit into a
structured framework:

• Be clear why you’re going to the cloud in the first place..


• Remember that No one wants a server.
• Realize that You shouldn’t run software you didn’t build.
• Avoid building an enterprise non-cloud.
• Dig out your high school course notes to Apply Pythagoras to Cloud
Migration.
11. Why Exactly Are You
Going to the Cloud
Again?
When reading a map, it’s best to know where you want to go.

Cloud adoption has become synonymous with modernizing IT to the point


where asking why one is doing it will be seen as ignorance at best or heresy
at worst. Alas, there are many paths to the cloud, and none are universally
better than others. The objective, therefore, is to select the one that best
suits your organization. To know how you want to go to the cloud requires
you to first become clear on why you’re going in the first place.

Many Good Reasons to Come to Cloud


Luckily, the list of common motivations for IT to move to the cloud isn’t
infinitely long, as the following compilation shows. You might find a
few unexpected entries. One all-too-common motivation is intentionally
omitted: FOMO, the fear of missing out.

Cost

If you want to get something funded in an IT organization that’s seen as a


cost center, you present a business case. That business case demonstrates a
return on the investment through cost reduction. So, it’s not a big surprise
that cost is a common but also somewhat overused driver for all things IT,
including cloud. That’s not all wrong - we all like to save money - but as
so often the truth isn’t quite as simple.
Cloud providers have favorable economies of scale but nevertheless invest
heavily in global network infrastructure, security, data center facilities,
and modern technologies such as machine learning. Hence, one should be
cautious to expect drastic cost reduction simply by lifting and shifting
Why Exactly Are You Going to the Cloud Again? 89

existing workload to the cloud. Savings for example come from using
managed services and the cloud’s elastic “pay as you grow” pricing model
that allows suspending compute resources that aren’t utilized. However,
realizing these savings requires transparency into resource utilization and
high levels of deployment automation - cloud savings have to be earned.

The cloud’s major savings potential lies in its elastic pricing


model. Realizing it requires transparency and automation,
though.

Organizations looking to reduce cost should migrate workloads that have


uneven usage profiles and can thus benefit from the cloud’s elasticity.
Analytics solutions have yielded some of the biggest savings I have seen -
running interactive queries on a cloud provider managed data warehouse
can be two or three orders of magnitude cheaper than rigging up a full data
warehouse yourself when it’s only sporadically used. Ironically, migrating
that same warehouse to the cloud won’t help you much as you’re still
paying for underutilized hardware. It’s the fully managed service and
associated pricing model that makes the difference.
Another common migration vector is data backup and archiving. On-
premises backup and archiving systems tend to consume large amounts
of operational costs for very infrequent access. Such data can be stored
at lower cost in the commercial cloud. Most providers (e.g. AWS Glacier)
charge for retrieval of your data, which is a relatively rare event.

Uptime

No one likes to pay for IT that isn’t running, making uptime another
core metric for the CIO. Cloud can help here as well: a global footprint
allows applications to run across a network of global data centers, each
of which features multiple availability zones. Cloud providers also main-
tain massive network connections to the internet backbone, affording
them improved resilience against denial-of-service (DDoS) attacks. Setup
correctly, cloud applications can achieve 99.9% availability or higher,
which is often difficult or expensive to realize on premises. Additionally,
applications can be deployed across multiple clouds to further increase
availability.
Why Exactly Are You Going to the Cloud Again? 90

Organizations moving to the cloud to increase uptime should look at fully


managed solutions that routinely come with 99.9% SLAs. Load balancing
exiting solutions across multiple instances can yield even better uptimes
as long as the workloads are architected to do so. Run-time platforms with
built-in resilience, such as container orchestration or serverless platforms,
are good destinations for modern applications.
Alas, cloud isn’t a magic button that makes brittle applications suddenly
available 24/7 with zero downtime. Even if the server keeps running,
your application might be failing. Hence, adjustments to application
architecture and the deployment model might be needed to achieve high
levels of availability.

Scalability

Assuring uptime under heavy load requires scalability. Cloud touts instant
availability of compute and storage resources, allowing applications to
scale out to remain performant even under increased load. Many cloud
providers have demonstrated that their infrastructures can easily with-
stand tens of thousands of requests per second, more than enough for most
enterprises. Conveniently, the cloud model makes capacity management
the cloud provider’s responsibility, freeing your IT from having to main-
tain unused hardware inventory. You just provision new servers when
you need them.
Organizations looking to benefit from cloud for scale might want to
have a look at serverless compute that manages scaling of instances
transparently inside the platform. However, you can also achieve massive
scale with regular virtual machine instances, e.g. for large batch or number
crunching jobs.

Performance

Many end users access applications from the internet, for example from
a browser or a mobile application. Hosting applications with a cloud
provider can provide shorter paths and thus a better user experience
compared to applications hosted on premises. Global SaaS applications
like Google’s Gmail are proof of the cloud’s performance capabilities.
Performance is also a customer-relevant metric, unlike operational cost
reduction, which is rarely passed on to the customer.
Why Exactly Are You Going to the Cloud Again? 91

One enterprise ran a promotion for customers who down-


load and install a mobile app. Following common IT guide-
lines, they hosted the app image on premises, meaning
all downloads took place via their corporate network. The
campaign itself was a huge success, sadly causing the
corporate network to be instantly overloaded. They would
have been much better off hosting that file in the cloud.

Organizations looking to improve application performance might benefit


from highly scalable and low-latency data stores or in-memory data
stores. For applications that can scale out, provisioning and load balancing
across multiple nodes is also a viable strategy. Many applications will
already use Cloud CDNs (Content Delivery Networks) to reduce website
load times.

Velocity

Speed isn’t just relevant once the application is running but also on
the path there, i.e. during software delivery. Modern cloud tools can
greatly assist in accelerating software delivery. Fully managed build tool
chains and automated deployment tools allow your teams to focus on
feature development while the cloud takes care of compiling, testing, and
deploying software. Once you realize that you shouldn’t run software you
didn’t build, you are going to focus on building software and the velocity
with which you can do it.

Writing this book involved automated builds: a GitHub


workflow triggered a new PDF generation each time I
committed or I merged a change into the main branch.

Most cloud providers (and hosted source control providers like GitHub or
GitLab) offer fully managed CI/CD pipelines that rebuild and test your
software whenever source code changes. Serverless compute also vastly
simplifies deployment.

Security

Security used to be a reason for enterprises not to move to the cloud.


However, attack vectors have shifted to exploit hardware-level vulner-
Why Exactly Are You Going to the Cloud Again? 92

abilities based on microprocessor design (Spectre¹ and Meltdown²) or


compromised supply chains. Such attacks render traditional data center
“perimeter” defenses ineffective and position large-scale cloud providers
as the few civilian organizations with the necessary resources to provide
adequate protection. Several cloud providers further increase security
with custom-developed hardware, for example the AWS Nitro System³
or Google’s Titan Chip⁴. Both are based on the notion of an in-house
hardware root of trust that can detect manipulation of hardware or
firmware.
Cloud providers’ strong security posture also benefits from a sophisticated
operating model with frequent and automated updates. Lastly, they pro-
vide numerous security components such as software-defined networks,
identity-aware gateways, firewalls, WAFs, and many more, which can be
used to protect applications deployed to the cloud.
There likely isn’t a particular set of products that automatically increases
security when migrating to the cloud. However, it’s safe to say that when
used appropriately, operating in the cloud might well be more secure than
operating on premises.

Insight

Cloud isn’t all about computing resources. Cloud providers offer services
that are impractical or impossible to implement at the scale of corporate
data centers. For example, large-scale machine learning systems or pre-
trained models inherently live in the cloud. So, gaining better analytic
capabilities or better insights, for example into customer behavior or
manufacturing efficiency, can be a good motivator for going to the cloud.
Customers looking for improved insight will benefit from open-source
machine learning framework such as TensorFlow, pre-trained models
such as Amazon Rekognition⁵ or Google Vision AI⁶, or data warehouse
tools like Google BigQuery or AWS Redshift.
¹https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)
²https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
³https://aws.amazon.com/ec2/nitro/
⁴https://cloud.google.com/blog/products/gcp/titan-in-depth-security-in-plaintext
⁵https://aws.amazon.com/rekognition/
⁶https://cloud.google.com/vision
Why Exactly Are You Going to the Cloud Again? 93

Transparency

Classic IT often has limited transparency into their operational environ-


ments. Simple questions like how many servers are provisioned, which
applications run on them or their patch status are often difficult and time-
consuming to answer. Migrating to the cloud can dramatically increase
transparency thanks to uniform automation and monitoring.
Organizations looking to increase transparency should use automation for
infrastructure deployment and configuration. They should also consider
a careful setup of billing and account hierarchies.
It’s easy to see that there are many good reasons to want to go to
the cloud. It’s also apparent that different motivations lead to different
adoption paths. Choosing multiple reasons is totally acceptable and in fact
encouraged, especially for large IT organizations.

Priorities
Despite all the great things that cloud provides, it can’t solve all problems
all at once. Even with cloud, if a strategy sounds too good to be true, it
probably is. So having a catalog of goals to choose from allows a balanced
discussion of which problems to address first. Perhaps you want to focus
on uptime and transparency first and tackle cost later, knowing well that
savings have to be earned. Equally important to setting the initial focus is
defining what comes later, or what’s out of scope altogether.

Setting Clear Expectations


Clearly communicating goals is as important as setting them. An IT team
that’s aiming for uptime improvements while management was promised
cost savings is headed for trouble. Again, working off a baseline list and
showing what’s in and what’s out (or comes later) can be a great help here.

Measuring Real Progress


Once you’re clear on what you’re after, you can track progress in more
meaningful ways than how many applications were migrated. IT metrics
Why Exactly Are You Going to the Cloud Again? 94

like counting migrated applications are a dangerous proxy metric for your
real objectives that can lure you into believing that you’re making progress
when you’re not actually providing value to the business.

The number of applications migrated is a dangerous proxy


metric for cloud success because your end customers won’t
care. They will, however, care about application perfor-
mance, uptime, or security.

Your strategic objective should be to improve a meaningful business


metric as opposed to internal IT metrics.

Cloud With Training Wheels


Interestingly, many of the benefits listed above, scalability perhaps aside,
don’t actually hinge on moving your applications to the public cloud.
Adopting a cloud operating model combined with popular open source
tools can achieve a large subset of the presented benefits on premises in
your own data center. The downside is that you’ll have to build, or at least
assemble, much of it yourself. It could be a useful testing ground, though,
for whether your organization is able to adjust to the new operating model.
Think of it as “cloud with training wheels”.
12. No One Wants a Server
Cloud is not an infrastructure topic.

When traditional enterprises embark on a cloud journey, they usually


create a project because that’s how things get done in IT. When they look
for a project owner, the infrastructure team appears to be the obvious
choice. After all, cloud is all about servers and networks, right? Not
necessarily - running cloud from the infrastructure side could be your
first major mistake. I’ll explain why.

You Build It, They Run It


Most IT organizations separate teams by infrastructure operations and
application delivery. Often, these respective sub-organizations are labeled
“change” for application delivery and “run” for operations, which include
infrastructure. So, you build (change) a piece of software and then hand it
over to another team to run it.
It isn’t difficult to see that such a setup provides the respective teams with
conflicting objectives: the operations team is incentivized to minimize
change while the “change” team is rewarded for delivering new features.
Often, the resulting applications are “handed” (rather, thrown) over to the
operations team, letting them deal with any production issues whereas
the “change” team is essentially off the hook. Naturally, operations teams
learn quickly to be very cautious in accepting things coming their way,
which in turns stands in the way of rapid feature delivery. And so the
unhappy cycle continues.

Servers & Storage = Infrastructure


When traditional IT organizations first look at cloud, they mostly look
at things like servers (in the form of virtual machines), storage, and
networking. Recalling that AWS’ Simple Storage Service (S3) and Elastic
No One Wants a Server 96

Compute Cloud (EC2) were the first modern-era cloud services offered
back in 2006, a focus on these elements appears natural.
In enterprise parlance, servers and storage, translate into “infrastructure”.
Therefore, when looking for a “home” for their cloud migration effort,
it seems natural to assign it to the infrastructure and operations team, i.e.
the “run” side of the house. After all, they should be familiar with the type
of services that the cloud offers. They are also often the ones who have
the best overview of current data center size and application inventory,
although “best” doesn’t mean “accurate” in most organizations.

Driving the cloud journey from the infrastructure group


could be your first major mistake.

Unfortunately, as simple as this decision might be, in many cases it could


be the first major mistake along the cloud journey.

Serving Servers
To understand why running a cloud migration out of the infrastructure
side of the house should be taken with caution, let’s look at what a
server actually does. In my typical unfiltered way, I define a server as an
expensive piece of hardware that consumes a lot of energy and is obsolete
in three years.

A server is a piece of expensive hardware that consumes a


lot of energy and is obsolete in three years.

Based on that definition, it’s easy to see that most business and IT users
wouldn’t be too keen on buying servers, whether it’s on premises or in the
cloud. Instead of buying servers, people want to run and use applications.
It’s just that traditional IT services have primed us that before we can do
so, we first have to order a server.
No One Wants a Server 97

Time is Money
Being able to programmatically provision compute resources is certainly
useful. However, focusing purely on hardware provisioning times and
cost misses the real driver behind cloud computing. Let’s be honest,
cloud computing didn’t exactly invent the data center. What it did was
revolutionize the way we procure hardware with elastic pricing and
instant provisioning. Not taking advantage of these disruptive changes
is like going to a gourmet restaurant and ordering French fries - you’d be
eating but entirely missing the point.
Being able to provision a server in a few seconds is certainly nice, but it
doesn’t do much for the business unless it translates into business value.
Server provisioning time often makes up a small fraction of the time
in which a normal code change can go into production (or into a test
environment). Reducing the overall time customers have to wait for a new
feature to be released, the time to value, is a critical metric for businesses
operating in Economies of Speed. Velocity is the currency of the digital
world. And it needs much more than just servers.

Increasing the velocity with which you can deploy software


reduces the time to value for the business and the hardware
cost. It’s a win-win.

Rapid deployment is also a key factor for elastic scaling: to scale out across
more servers, running software has to be deployed to those additional
servers. If that takes a long time, you won’t be able to handle sudden load
spikes. That means that you have to keep already configured servers in
stand-by, which drives up your cost. Time is money, also for software
deployment.

The Application-centric Cloud


Organizations that look at the cloud from an infrastructure point of view,
seeing servers, storage, and network, will thus miss out on the key benefits
of cloud. Instead, organizations should take an application-centric view on
their cloud strategy. You might think that this is a somewhat narrow view
No One Wants a Server 98

because a large portion of what IT runs wasn’t actually built in house.


Park that thought for a minute or, more precisely, until the next chapter.
An application-centric approach to cloud will look quite different from
an infrastructure-oriented approach. It focuses on speeding up software
delivery, primarily by reducing friction. Reduced friction and increased
delivery speed comes from automation, such as a fully automated tool
chain, infrastructure-as-code, and cloud-ready applications that scale
horizontally. Coupled with modern techniques like continuous integration
(CI), continuous delivery (CD), and lean development, such tooling can
transform the way an enterprise thinks about software delivery. For more
insight into why automation is so important, see Automation Isn’t about
Efficiency.

Looking Sideways
Naturally, cloud providers realize this shift and have come a long way
since the days when they only offered servers and storage. Building an
application on EC2 and S3 alone, while technically totally viable, feels
a little bit like building a web application in HTML 4.0 - functional,
but outdated. A minority of Amazon’s 100+ cloud services these days is
purely related to infrastructure. You’ll find databases of all forms, con-
tent delivery, analytics, machine learning, container platforms, serverless
frameworks, and much more.
Although there are still minor differences in VM-level cloud services
between the providers, the rapid commoditization of infrastructure has
the cloud service providers swiftly moving up the stack from IaaS to mid-
dleware and higher-level run-times. “Moving up the stack” is something
that IT has been doing for a good while, e.g. when moving from physical
machines to virtual ones in the early 2000s. However, once this process
reaches from IaaS (Infrastructure as a Service) and PaaS (Platform as a
Service) to FaaS (Functions as a Service), aka “Serverless”, there isn’t much
more room to move up any further.
No One Wants a Server 99

After moving up, look sideways!

Instead, IT and cloud providers start to look “sideways” at the software


development lifecycle and tool chain: getting new features into production
quickly is the core competency for competing in the digital world that’s
based on rapid build-measure-learn cycles. Tool chains are an applica-
tion’s first derivative¹ - they determine an application’s rate of change.
Modern tool chains are quite sophisticated machines. A common strategy
is to “shift left” aspects like quality and security. Rather than being tacked
on at the end of a software project lifecycle, where they tend to get
squeezed and it’s often too late to make significant improvements anyway,
these concerns become part of the continuous delivery tool chain. This
way, any deviations can be detected automatically and early in the process
because tests are executed on every code change or every deployment.
Following this principle, modern tool chains thus include steps like:

• Code-level unit tests


• Static code analysis for vulnerabilities (sometimes referred to as
SAST - Static Application Security Testing)
• Dependency analysis, e.g. detecting use of outdated or compro-
mised libraries
• License scans to detect license “pollution” from third party libraries
• Tracking of code quality metrics
• Penetration testing

Gone are the days where delivery tool chains were cobbled together one-
offs. Modern tool chains run in containers, are easily deployed, scalable,
restartable, and transparent.
¹Hohpe, The Software Architect Elevator, 2020, O’Reilly
No One Wants a Server 100

The shift to looking sideways is also incorporated in modern serverless


products like AWS Lambda or Google Cloud Functions, which blend
runtime and deployment. These platforms not only abstract away in-
frastructure management, they also improve software deployment: you
essentially deploy source code or artifacts directly from the command line,
passing any run-time configuration that is needed.

Don’t Build Yet Another Data Center


Driving a cloud migration from the infrastructure point-of-view has
another grave disadvantage: you’re likely to transport your current way
of managing infrastructure into the cloud. That happens because the
infrastructure and operations teams are built around these processes. They
know that these processes work, so why not keep them to minimize risk?
Naturally, these teams will also have a certain sense of self-preservation,
so they’ll be happy to keep their existing processes that emphasize control
and manual check points over end-to-end automation.

Carrying your existing processes into the cloud will get you
another data center, not a cloud.

Moving to the cloud with your old way of working is the cardinal sin of
cloud migration: you’ll get yet another data center, probably not what you
were looking for! Last time I checked, no CIO needed more data centers.
In the worst case, you need to face the possibility that you’re not getting
a cloud at all, but an Enterprise Non-Cloud.

“It runs in the cloud” Doesn’t Make the


Mark
Knowing that the real value lies in quickly deploying and scaling appli-
cations also implies that a statement like “app XYZ can run in the cloud”
doesn’t carry much meaning. Most applications running on an operating
system from this century can be run on a server in the cloud somehow.
No One Wants a Server 101

A cloud strategy’s objective should therefore not be limited to getting


existing applications to run in the cloud somehow. Rather, applications
should measurably benefit from running in the cloud, for example by
scaling horizontally or becoming resilient thanks to a globally distributed
deployment. Hence, my snarky retort to such a claim is to state that my
car can fly - off a cliff. Don’t hurl your applications off a cliff!
13. Don’t Run Software You
Didn’t Build
Running other people’s software is actually a bad deal.

The answer to common concerns about cloud often lie in taking a step back
and looking at the problem from a different angle. For example, several
modern cloud platforms utilize containers for deployment automation
and orchestration. When discussing container technology with customers,
a frequent response is: “many of my enterprise applications, especially
commercial off-the-shelf software, don’t run in containers, or at least not
without the vendor’s support.” Fair point! The answer to this conundrum,
interestingly, lies in asking a much more fundamental question.

Enterprise IT = Running Someone Else’s


Software
Enterprise IT grew up running other people’s software. That’s because
enterprise IT favors buying over building software, and in most cases
rightly so: there’s little to be gained from writing your own payroll or
accounting software. Also, to address a broad market (and to cater to
organizations’ fear of code - see 37 Things), enterprise packages are often
highly configurable, reducing the need for custom development.
Over time, running other people’s software has thus become the funda-
mental premise of IT. IT’s main responsibility consisted of procuring soft-
ware and hardware, installing the software on the hardware, configuring
and integrating, and then operating it, the latter often being outsourced.
Although you might think that this setup lets others do the heavy lifting
while IT “merely integrates”, this setup isn’t nearly as great as it looks on
paper.
First, it has led to a state where IT is quite overloaded with countless
applications, sometimes resembling an image like this:
Don’t Run Software You Didn’t Build 103

Does your IT feel a bit overloaded?

It’s impressive how much load your IT truck can handle, but nevertheless
it’s time to slim down.

The Unfortunate IT Sandwich


Worse yet, because traditional IT often also outsources operations, it ends
up in a somewhat awkward place, “sandwiched” between commercial
software vendors on one side and the third-party operations provider on
the other. And to make it ever more fun (for those watching, not those in
the middle of it) the business would beat down on IT why they couldn’t
have the desired features and why the system is down again, squeezing
the sandwich from both sides (not shown: blood, sweat, and tears running
out from IT):
Don’t Run Software You Didn’t Build 104

The IT Sandwich getting a squeeze

Stuck in the middle, IT would shell out money for licenses and hardware
assets, but lack control over either the software features nor their own
infrastructure. Instead of a sandwich, an image of a vice might also come
to your mind. Aren’t we glad cloud came and showed us that this isn’t a
particularly clever setup?

Running Others’ Software is a Bad Deal!


Looking at it this way, it’s not difficult to see that running software built
by others is actually a pretty bad deal:

You pay for hardware

The only thing that IT owns are expensive assets, which


depreciate quickly. To make it worse, software vendors’
sizing requirements are often conservative because it’s easier
for them to have you buy more hardware than to explain
performance challenges afterwards.

Installation is cumbersome

Although many folks want to make us believe that the benefit


of packaged software is easy and safe installation and con-
figuration, the reality often looks different. When we faced
installation issues once, the vendor’s response was that it’s
due to our environment. Isn’t getting other people’s software
to run in our environment the very purpose of installation?
Don’t Run Software You Didn’t Build 105

If something breaks, you’re guilty until proven innocent

Yup, it’s like that porcelain shop: you break, you pay. In
this case, you pay for support. And if something breaks you
have to convince external support that it’s not due to your
environment. All the while business is beating down on you
why stuff isn’t running as promised.

You can’t make changes when you need them

And despite all this effort, the worst part is that you don’t
have control on either end. If you need a new feature in the
software you are running, you’ll have to wait until the next
release - that is, if your feature is on the feature list. If not,
you may be waiting a looooong time. Likewise, operations
contracts are typically fixed in scope for many years. No
Kubernetes for you.

Running software you didn’t build isn’t actually all that great. It’s amazing
how enterprise IT has become so used to this model that it accepts it as
normal and doesn’t even question it.

Enterprise IT has become so used to running other people’s


software that it doesn’t realize what a bad deal it is.

Luckily, the wake-up call came with cloud and the need to question our
basic assumptions.

SaaS - Software as a Service


The world of cloud has brought us new options to rethink our operational
model. A Software-as-a-Service (SaaS) model hands over most of the
responsibilities to the software vendor: the vendor runs the applications
for you; and applies patches; and upgrades; and scales; and replaces faulty
hardware (unless they run on a cloud, which they likely should be). Now
that sounds a lot better!
Don’t Run Software You Didn’t Build 106

Salesforce CRM was really the first major company to lead us on that
virtuous route back around 2005, even before the modern cloud providers
started offering us servers in an IaaS model. Of course it’s by far no
longer the only company following this model: you can handle your
communications and documentation needs in the cloud thanks to Google
G Suite¹ (which started 2006 as “Google Apps for Your Domain”), HR
and performance management with SAP SuccessFactors², ERP with SAP
Cloud ERP³, service management with ServiceNow⁴, and integrate it
all with Mulesoft CloudHub⁵ (note: these are just examples; I have no
affiliation with these companies and the list of SaaS providers is nearly
endless).

Anything as a Service
Service-oriented models aren’t limited to software delivery. These days
you can lease an electric power generator by the kWh produced or a train
engine by the passenger mile run. Just like SaaS, such models place more
responsibility on the provider. That’s because the provider deals with
operational issues without any opportunity of pointing the finger at the
consumer:

SaaS: Feedback leads to better quality

If the train engine doesn’t run, in the traditional model it’d be the
customer’s (i.e. the train company’s) problem. In many cases the customer
would even pay extra for service and repairs. In a service model, the
manufacturer doesn’t get to charge the customer in case of operational
¹https://gsuite.google.com
²https://www.successfactors.com
³https://www.sap.com/sea/products/erp/erp-cloud.html
⁴https://www.servicenow.com/
⁵https://www.mulesoft.com/platform/saas/cloudhub-ipaas-cloud-based-integration
Don’t Run Software You Didn’t Build 107

issues. Instead, the manufacturer takes the hit because the customer
won’t pay for the passenger kilometers not achieved. This feedback loop
motivates the provider to correct any issue quickly. The same is (or at least
should be) true for software as a service: if you pay per transaction, you
won’t pay for software that can’t process transactions.

SaaS is like applying DevOps to packaged software: it places


the end-to-end responsibility with those people who can
best solve a problem.

Naturally, you are happier with running software (or a running train), but
the feedback loop established through pricing by generated value leads to
a higher quality product and more reliable operations. The same principle
is applied in a DevOps model does for software development: engineers
will invest more into building quality software if they have top deal with
any production issues themselves.

But what about…


When considering SaaS, enterprise IT tends to pull from its long repertoire
of “buts”, triggered by a major shift in operating model:

Security
As describe earlier, the days where running applications on your
premises means that they are more secure, are over. Attack vectors
like compromised hardware, large-scale denial-of-service-attacks
by nation-state attackers, make it nearly impossible to protect IT
assets without the scale and operational automation of large cloud
providers. Plus, on-premises often isn’t actually on your premises -
it’s just another data center run by an outsourcing provider.
Latency
Latency isn’t just a matter of distance, but also a matter of the num-
ber of hops, and the size of the pipes (saturation drives up latency
or brings a service down altogether). Most of your customers and
partners will access your services from the internet, so often the
cloud, and therefore the SaaS offering, is closer to them than your
ISP and your data center.
Don’t Run Software You Didn’t Build 108

Control
Control is when reality matches what you want it to be. In many
enterprises, manual processes severely limit control: someone files
a request, which gets approved, followed by a manual action.
Who guarantees the action matches what was approved? A jump
server and session recording? Automation and APIs give you tighter
control in most cases.
Cost Although cloud doesn’t come with a magic cost reduction button,
it does give you better cost transparency and more levers to control
cost. Especially hidden and indirect costs go down dramatically
with SaaS solutions.
Integration
One could write whole books about integration - oh, wait⁶! Natu-
rally, a significant part of installing software isn’t just about getting
it to run, but also to connect it to other systems. A SaaS model
doesn’t make this work go away completely, but the availability
of APIs and pre-built interfaces (e.g. Salesforce and GSuite⁷) makes
it a good bit easier. From that perspective, it’s also no surprise that
Salesforce acquired Mulesoft, one of the leading integration-as-a-
service vendors.

What about the software you do build?


Naturally, there’s going to be some software that you do want to build, for
example because it provides a competitive advantage. For that software,
delivery velocity, availability, and scalability are key criteria. That’s
exactly why you should look for an application-centric cloud!

Strategy = Setting the vector


Not all your on premise software will disappear tomorrow. And not all
other software will run in containers or on a serverless platform. But
strategy is a direction, not current reality. The key is that every time
you see yourself running software you didn’t build, you ask yourself why
you’re carrying someone else’s weight.
⁶https://www.enterpriseintegrationpatterns.com
⁷https://www.salesforce.com/campaign/google/
14. Don’t Build an
Enterprise Non-Cloud!
Be careful not to throw the cloud baby out with the enterprise
bathwater.

Many enterprises who moved to the cloud have found that not all their
expectations have been met, or at least not as quickly as they would
have liked. Although an unclear strategy or inflated expectations can be
culprits, in many cases the problems lies closer to home. The migration
journey deprived the enterprise of exactly those great properties that the
cloud was going to bring them.

This fancy looking sports car is unlikely to meet your expectations

Cloud for Enterprise


When enterprises move to a commercial cloud provider, they don’t just
grab a credit card, sign up, and deploy away. They have to abide by
existing policies and regulations, need to ensure spend discipline, and
often have special data encryption and residency requirements. Therefore,
almost every IT department has a cloud transformation program under-
way that attempts to marry existing ways of working with the operating
model of the cloud. Now, because the amazing thing about cloud is that it
rethinks the way IT is done, we can imagine that this translation process
isn’t trivial.

Enterprises don’t just grab a credit card, sign up for a cloud


provider, and deploy away.
Don’t Build an Enterprise Non-Cloud! 110

When I work with large organizations on their cloud strategy, several


recurring themes come up:

• Onboarding Process
• Hybrid Cloud
• Virtual Private Cloud
• Legacy Applications
• Cost Recovery

Each of them makes good sense:

Onboarding

Enterprises have special requirements for cloud accounts that differ from
start-ups and consumers:

• They utilize central billing accounts to gain cost transparency


instead of people randomly using credit cards
• They need to allocate cloud charges to specific cost centers, e.g.
through billing accounts
• They negotiate discounts based on overall purchasing power or
“commits”, stated intents to use a certain volume of cloud resources
• They may limit and manage the number of cloud accounts being
shared in the organization
• They may require approvals from people whose spending authority
is sufficiently high

Most of these steps are necessary to connect the cloud model to existing
procurement and billing processes, something that enterprises can’t just
abandon overnight. However, they typically lead to a semi-manual sign-
up process for project teams to “get to the cloud”. Likely, someone has to
approve the request, link to a project budget, and define spend limits. Also,
some enterprises have restrictions on which cloud providers can be used,
sometimes depending on the kind of workload.
Cloud developers might need to conduct additional steps, such as config-
uring firewalls so that they are able to access cloud services from within
the corporate network. Many enterprises will require developer machines
to be registered with device management and be subjected to endpoint
security scans (aka “corporate spyware”).
Don’t Build an Enterprise Non-Cloud! 111

Hybrid Network

For enterprises, Hybrid Cloud is a reality because not all applications can
be migrated overnight. This’ll mean that applications running in the cloud
communicate with those on premises, usually over a combination of a
cloud interconnect, which connects the virtual private cloud (VPC) with
the existing on-premises network, making the VPC look like an extension
of the on-premises network.

Virtual Private Cloud

Enterprises aren’t going to want all their applications to face the internet
and many also want to be able to choose IP address ranges and connect
servers with on-premises services. Many enterprises are also not too keen
to share servers with their cloud tenant neighbors. Others yet are limited
to physical servers by existing licensing agreements. Most cloud providers
can accommodate this request, for example with dedicated instances¹ or
dedicated hosts (e.g. AWS² or Azure³).

Legacy or Monolithic Applications

The majority of applications in the enterprise portfolio are going to be


third party commercial software. Applications that are built in house
often are architected as single instances (so-called “monoliths”). These
applications cannot easily scale out across multiple server instances. Re-
architecting such applications is either costly or, in case of commercial
applications, not possible.

Cost Recovery

Lastly, preparing the enterprise for commercial cloud, or the commercial


cloud for enterprise, isn’t free. This cost is typically borne by the central IT
group so that it can be amortized across the whole enterprise. Most central
IT departments are cost centers that need to recover their cost, meaning
any expenditure has to be charged back to business divisions, who are
¹https://aws.amazon.com/ec2/pricing/dedicated-instances/
²https://aws.amazon.com/ec2/dedicated-hosts/
³https://azure.microsoft.com/en-us/services/virtual-machines/dedicated-host/
Don’t Build an Enterprise Non-Cloud! 112

IT’s internal customers. It’s often difficult to allocate these costs on a per-
service or per-instance basis, so IT often adds an “overhead” charge to the
existing cloud charges, which appears reasonable.
There may be additional fixed costs levied per business unit or per project
team, such as common infrastructure, aforementioned VPNs, jump hosts,
firewalls, and much more. As a result, internal customers pay a base fee
on top of the cloud service fee.

Remembering NIST
The US Department of Commerce’s National Institute of Standards and
Technology (NIST) published a very useful definition of cloud computing
in 2011 (PDF download⁴). It used to be cited quite a bit, but I haven’t seen
it mentioned very much lately - perhaps everyone knows by now what
cloud is and the ones who don’t are too embarrassed to ask. The document
defines five major capabilities for cloud (edited for brevity):

On-demand self-service
A consumer can unilaterally provision computing capabilities, such
as server time and network storage, as needed automatically with-
out requiring human interaction.
Broad network access
Capabilities are available over the network and accessed through
standard mechanisms.
Resource pooling
The provider’s computing resources are pooled to serve multiple
consumers using a multi-tenant model, with different physical and
virtual resources dynamically.
Rapid elasticity
Capabilities can be elastically provisioned and released to scale
rapidly outward and inward with demand.
Measured service
Cloud systems automatically control and optimize resource use by
leveraging a metering capability (typically per-per-use).

So, after going back to the fundamental definition of what a cloud is, you
might start to feel that something doesn’t 100% line up. And you’re spot
on!
⁴https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf
Don’t Build an Enterprise Non-Cloud! 113

The Enterprise Non-Cloud


Putting the enterprise “features” from above next to the NIST capabilities,
you realize that they largely contradict:

• Lengthy sign-up processes contradict on-demand self-service be-


cause they routinely require manual approvals and software installs
- corporate IT processes send their regards.
• Your corporate network isn’t going to be quite as broad as the
internet and firewalls and loads of other restrictions make network
access far from being universal.
• Dedicated instances aren’t as widely pooled and have poorer economies
of scale. Your network interconnect is also dedicated.
• Traditional applications don’t benefit from rapid elasticity as they
don’t scale out and deployment often isn’t automated.
• A high baseline cost charged from corporate IT makes the cloud a lot
less “measured” and often burdens small projects with prohibitive
fixed costs.

The Enterprise Non-cloud

That’s bad news as it means that despite all good intentions your enter-
prise didn’t get a cloud! It got yet another good ole corporate data center,
which is surely not what it was looking for.

Many “enterprise clouds” no longer fulfill the fundamental


capabilities of a cloud.
Don’t Build an Enterprise Non-Cloud! 114

What Now?
So, how do you make sure your enterprise cloud remains deserving of the
label? While there’s no three-step-recipe, a few considerations can help:

Calibrate Expectations
Realization is the first step to improvement. So, being aware of these traps
is a first major step towards not falling into them, or at least not without
knowing. Also, it behooves us to temper rosy visions of cost savings and
digital transformation. Moving all your old junk to a new house means
you’ll be living with the same junk just in a fancier environment. Likewise,
carrying your enterprise baggage into the cloud won’t transform anything.

Bring cloud to you, not vice versa


Cloud isn’t a classic IT procurement but a fundamental change in IT
operating model. Therefore, you should be cautious not to transport your
existing operating model to the cloud as that’ll lead to the results cited
above. Instead, you need to bring some elements of the cloud operating
model to your environment. For example, you can replace cumbersome
manual processes with automation and self-service so that they benefit
both on-premises systems and those running in the cloud.

Measurable Goals
A cloud migration without clear measurable goals risks getting off track as
IT departments become lost in all the shiny new technology toys. Instead,
be clear why you’re going to the cloud: to save cost, to improve uptime,
to launch new products faster, to secure your data, to scale more easily,
to sound more modern - there are many reasons for going to the cloud.
Prioritizing and measuring progress is more likely to keep you on track.

Segmentation
Enterprise IT loves harmonization but one cloud size won’t fit all applica-
tions. Some applications don’t need to undergo all firewall-compartment-
peering-configuration-review-and-approval steps. Perhaps some applica-
tions, for example simple ones not holding customer data, can just go
straight to the cloud, just as long as billing isn’t on Jamie’s credit card.
Don’t Build an Enterprise Non-Cloud! 115

More traps
Cloud migrations navigate in treacherous waters. Many enterprises are
falling into the trap of wanting to avoid lock-in at all cost and look to
achieve that by not using cloud-provider-managed services because most
of them are proprietary. This means no DynamoDB, Athena, AWS Secrets
Manager, SQS, BigQuery, Spanner, Rekognition, Cloud AutoML, and so
on. Now, you may still have a cloud, but one that predates the NIST
definition from 2011. Also not the place where you want to be. If you
embrace cloud, you should embrace managed services.

Happy Clouds
When enterprises embark on a cloud journey, they often focus on the great
new things they will get. But equally important is leaving some of your
enterprise baggage behind.
15. Cloud Migration per
Pythagoras
Time to dig out your geometry books.

Choosing the right path to the cloud isn’t alway easy. Just moving all
the old junk into the cloud is more likely to yield you another data center
than a cloud transformation whereas re-architecting every last application
will take a long time and not be cost effective. So, you’ll need to think
a bit more carefully about what to do with your application portfolio.
Decision models help you make conscious decisions and sometimes even
third-grade geometry lessons can help you understand the trade-offs.

Moving up or moving out


I have long felt that organizations looking to move their workloads to
the cloud can benefit from better decisions models. When people read
about models they often imagine something quite abstract and academic.
Good models, however, are rather the opposite. They should be simple but
evocative so that they both help us make better decisions and allow them
to communicate those decisions in an intuitive way.
One of my more popular blog posts, co-authored with Bryan Stiekes, titled
Moving Up or Out¹, presented such a model. The model plots application
migration options on a simple two-dimensional plane.
¹https://cloud.google.com/blog/topics/perspectives/enterprise-it-can-move-up-or-
out-or-both
Cloud Migration per Pythagoras 117

Moving applications up or out

The plane’s horizontal axis represents moving “out” from on premises


to the cloud, represented by a vector to the right. Alternatively, you
can modernize applications to operate further “up” the stack, further
away from servers and hardware details, for example by utilizing PaaS,
serverless, or SaaS services.
So, a completely horizontal arrow implies Lift-and-shift, moving appli-
cations as they are without any modernization. A vertical arrow would
imply a complete re-architecting of an application, for example to run on
a serverless framework, while remaining in place in an on premises data
center.
Combinations aren’t just allowed but also encouraged, so the blog post
advises architects to visually plot the direction they suggest for each
application or class of applications on such a matrix: out, up, or a
combination thereof. It’s a good example of a simple decision tool that
allows architects to make conscious decisions and to communicate them
broadly in the organization.

Not All IT is Binary

While the model plots simple axes for “up” / “modernize” and “out” /
“migrate”, each direction is made up from multiple facets and gradations:
Cloud Migration per Pythagoras 118

Up

Moving an application “up the stack” can occur along multiple dimen-
sions:

Platform: IaaS ⇒ PaaS ⇒ FaaS


The metaphor of “moving up” stems from the notion of the IT
Stack that’s rooted in hardware (such as servers, storage, and
networks) and is then increasingly abstracted initially through VM-
level virtualization and then via containers (OS-level virtualization)
and ultimately serverless platforms.
Sometimes, this progression is also labeled as IaaS (Infrastructure-
as-a-Service, i.e. VMs), PaaS (Platform-as-a-Service, i.e. Container
Orchestration²), and FaaS (Functions-as-a-Service, i.e. Serverless).
Higher levels of abstraction usually increase portability and reduce
operational toil. While this nomenclature focuses on the run-time
platform itself, applications that take advantage of the respective
abstraction level will also look different.
Structure: Monolith ⇒ Microservices
Another way to show how “high” an application is on the vertical
axis can be expressed by the application’s deployment and run-time
architecture. Traditional applications are often “monoliths”, single
pieces of software. Even though that model has served us well for
several decades because it’s easy to manage and all pieces are in one
artifact, monoliths make independent deployment and independent
scaling difficult. The push for more delivery velocity gave way
to the idea of breaking applications into individual, independently
deployable and scalable services.
Deployment: Manual ⇒ Automated
A third way of moving up is to replace manual provisioning,
deployment, and configuration with automated methods, which
give you fast and repeatable deployments.

One of the nicest, ideology-free summaries I have seen on moving up the


stack is by former colleague Tom Grey in his blog post 5 principles for
²There is a lot of debate about how similar or different PaaS and container orchestration
platforms are, seemingly fueled by product marketing considerations. Let’s stay clear of this
mud fight and consider them at roughly the same level of abstraction even though they’re
not entirely synonymous.
Cloud Migration per Pythagoras 119

Cloud-native Architecture³ (The title receives a passing score thanks to


“architecture” compensating for the term “native”).

Out

Moving out is a bit simpler as it indicates moving applications from


running on premises to running in the cloud. Gradations on this axis also
exist because single applications may run in a hybrid model where some
components, e.g. the front-end, are running in the cloud whereas others,
such as a legacy back-end, still run on premises. Hence, “moving out” isn’t
black or white. You are welcome to fine tune the model for your use by
adding “hybrid” in the middle of the horizontal axis.

Migration Triangles
Although the title plays off the tough career management principle used
by some consulting firms (“if you can’t keep moving up, you have to
get out”), in our case up-and-out or out-and-up is definitely allowed and
actually encouraged.
Using the up-or-out plane to plot the migration path makes it apparent that
migrations usually aren’t a one-shot type of endeavor. So, rather than just
putting applications into a simple bucket like “re-host” or “re-architect”,
the visual model invites you to plot a multi-step path for your applications,
highlighting that a cloud journey is indeed a journey and not a single
decision point.
³https://cloud.google.com/blog/products/application-development/5-principles-for-
cloud-native-architecture-what-it-is-and-how-to-master-it
Cloud Migration per Pythagoras 120

Plotting Migration Paths

In the example in the diagram above, an application might undergo a


sensible five-step process:

1. A small amount of modernizing is required to allow a portion of the


application, for example the front-end, to move to the cloud. This
could involve developing new or protecting existing APIs, setting
up an API gateway, etc.
2. Subsequently, a portion of the application can be migrated while
others, e.g. a legacy backend stays on premises.
3. Further modernization, e.g. porting the back-end from an existing
proprietary application server that isn’t compatible with cloud
licensing to an open source alternative such as Tomcat, prepares
the remainder of the application for migration.
4. The remainder of the application is migrated.
5. Now fully in the cloud, the application can be further modernized
to use managed database services, cloud monitoring, etc. to increase
stability and reduce operational overhead.

Why not do everything at once? Time to value! By moving a piece first


you can build critical skills and benefit from scalability thanks to Content-
Delivery and protection from Cloud WAFs (Web Application Firewalls)
immediately. If you want to be more ambitious, you would plot the value
generated at each step along the line, perhaps as an uplifting counterpart to
Cloud Migration per Pythagoras 121

Charles Joseph Minard’s map of Napoleon’s March⁴, which is frequently


lauded by Ed Tufte⁵ as one of the best statistical graphics ever drawn.

Remember Pythagoras?
Once you plot different sequences of migration options on sheets of paper,
you might feel like you’re back in trigonometry class. And that thought is
actually not as far fetched as it might seem. One of the most basic formulas
you might remember from way back when is Pythagoras’ Theorem. He
concluded some millennia ago that for a right triangle (one that has one
right angle), the sum of the squares of the sides equals the square of the
hypotenuse: c² = a² + b².

Pythagoras’ Theorem

Without reaching too deep into high school math, this definition has a few
simple consequences⁶:

1. Each side is shorter than the hypotenuse


2. The hypotenuse is shorter than the sum of the sides

Let’s see how these properties help you plan your cloud migration.
⁴https://en.wikipedia.org/wiki/Charles_Joseph_Minard
⁵https://www.edwardtufte.com/tufte/posters
⁶These properties apply to all triangles, but in our up-or-out plane we deal with right
triangles, so it’s nice to be able to fall back on Pythagoras.
Cloud Migration per Pythagoras 122

Cloud Migration Trigonometry


Just like in geometry, the sides of the triangle are shorter, meaning that just
moving out to the cloud horizontally across is going to be faster (shorter
line “b”) than trying to re-architect applications while migrating them (the
hypotenuse, labeled “c”):

Each side is shorter than the hypotenuse

However, migrating first and then re-architecting makes for a longer path
than taking the hypotenuse. Hence it might be considered less “efficient”
- spelled with air quotes here as efficiency can be a dangerous goal also.
The same holds true for re-architecting first and then migrating out - in
either case the sum of the sides is larger than the hypotenuse:
Cloud Migration per Pythagoras 123

The hypotenuse is shorter than the sum of the sides

However, if generating value is your key metric and you factor in the cost
of delay, this path might still be the most beneficial for your business.

What about all the R’s?


Many models for choosing application migration paths are based on a
series of words starting with “R”, such as rehost - replatform - rearchitect
- retire. Interestingly, multiple versions exist, at least two with five (by
Gartner from 2010⁷ and Microsoft from 2019⁸) and one with six (by
AWS from 2016⁹). Not all lists use the same vocabulary, and Gartner
also apparently updated the terms in 2018 (Rearchitect became Refactor),
adding to the confusion.
Here’s a brief attempt at a summary, yielding 10 terms grouped into 8
concepts, indicating where the terms appear (A = AWS, G = Gartner, M =
Microsoft):
Rehost [A/G/M]
Moving the whole VM from on premises to the cloud, essentially
changing the physical “host” but nothing else.
⁷https://www.gartner.com/en/documents/1485116/migrating-applications-to-the-
cloud-rehost-refactor-revi
⁸https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/digital-estate/5-
rs-of-rationalization
⁹https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-
applications-to-the-cloud/
Cloud Migration per Pythagoras 124

Replatform [A]
Shifting the application as is, but changing the operating system or
middleware (the platform the application sits on), usually motivated
by licensing restrictions or outdated operating systems not available
in the cloud.
Repurchase/Replace [A/G/M]
Replacing the application with another commercially available (or
open source) one, preferably in a SaaS model.
Refactor / Rearchitect [A/G/M]
Completely redoing the application’s deployment architecture, e.g.
by breaking a monolith into microservices, replacing one-off com-
ponents with cloud services, or extracting states from services into
a NOSQL data store.
Revise [G]
Updating legacy applications to prepare them for a subsequent
rehost.
Rebuild [G/M]
Perhaps the more drastic version of refactor, implying that the
application is rebuilt from scratch. Just like with a house, there is
likely a gray area between remodeling and rebuilding.
Retire [A]
Getting rid of the application. This can happen because sweeping
the application portfolio to prepare for a cloud migration can reveal
unused or orphaned applications.
Retain [A]
Keeping the application where it is, as it is, on premises.

Or, if you prefer to pivot this matrix, here is the list of “R”s I could extract
from each of the above on-line resources:

• Gartner 2010: Rehost, Revise, Rearchitect, Rebuild, Replace


• Gartner 2018: Rehost, Refactor, Revise, Rebuild, Replace
• AWS 2016: Rehost, Replatform, Repurchase, Refactor / Rearchitect,
Retire, Retain
• Microsoft 2019: Rehost, Refactor, Rearchitect, Rebuild, Replace

So, while the “R” are undeniably catchy, the semantics behind them are
fuzzy. It’ll be difficult to lock down the semantics in any case, but the
Cloud Migration per Pythagoras 125

primary danger is folks walking out of a meeting mumbling “replatform”


to themselves but meaning different things. In that case, you need another
“R”: reconcile.

Decision Models
Sketching paths on a simple chart admittedly also has fuzzy semantics
(“up” can mean rearchitecting completely or perhaps just refactoring or
automating) but the fact that the model is abstract conveys this fact
more clearly than a set of seemingly precise words. Sketches also makes
communicating your strategy to a wide range of stakeholders easier than
reciting lengthy variations of verbs starting with “R”.

The Decision Is Up To You


Models like these sketches exist to help you make better decisions and
communicate them more easily and more broadly. They are particularly
useful if there are no simple “left-or-right” answers. The models don’t tell
you what to do, but help you get to a better answer. For example, if you
need to move out quickly, e.g. in order to vacate an existing data center by
the time the contract is up, picking the short side of the triangle to move
out first will be best, even though you are ultimately taking a longer path.
If you have time and resources, following the hypotenuse is actually less
effort than making two hops. However, keep in mind that you will likely
make several Optimization Rounds to reduce cost.
So, get your rulers, pencils, and grid paper and draw some cloud migra-
tions!
Part IV: Architecting
the Cloud

IT discussions are routinely dominated by product names and buzzwords.


“To implement <buzzword>, we are evaluating <product A> and <product
B>” is a common part of IT discussions, only outdone by the second
version, “To become <buzzword>, we are evaluating <product C> and
<product D>.” I’ll leave you to assign the words agile, lean, digital,
anti-fragile, zero trust, DevOps, IaC, and cloud-native to the respective
sentences.
A successful cloud migration depends on a lot more than just buzzwords.
Selecting the right vendor and the right services might help, but putting
those pieces together in a meaningful way that supports the business
objectives is what cloud strategy and cloud architecture are all about.

I frequently compare being an architect to being a restau-


rant’s star chef: picking good ingredients is useful, but how
it’s put together is what earns the restaurant its reputation.
And, as anyone who has tried to re-create their favorite
restaurant dish can attest, there’s usually a lot more in-
volved than is apparent from the end product.

Cloud Architecture
Unfortunately, many vendor certifications fuel the notion that cloud archi-
tecture is mostly about selecting services and memorizing the respective
features. It feels to me a bit like becoming a certified Lego artist by
Cloud Migration per Pythagoras 127

managing to recite all colors and shapes of Lego bricks (do they make
a blue 1x7?).
So, let’s look under the covers and take a true architect’s point of view.
Yes, this will include popular concepts like multi-hybrid and hybrid-
multi-cloud, but perhaps not in the way it’s described in the marketing
brochures. There is no “best” architecture, just the one that’s most suit-
able for your situation and your objectives. Hence, defining your cloud
architecture requires a fair amount of thinking - something that you
definitely should not outsource¹⁰. That thinking is helped by decision
models presented in this chapter.

The Architect Elevator Approach


The Architect Elevator¹¹ defines a role model of an architect who can con-
nect the business strategy in the corporate penthouse with the technical
reality in the engine room. Instead of simply promising benefits, when
such an architect looks at a collection of vendor products, they reverse
engineer the key assumptions, constraints, and decisions behind those
offerings. They will then map that insight to the enterprise’s context and
balance the trade-offs of putting these products together into a concrete
solution.
Classic IT is built on the assumption that technical implementation
decisions are derived from a clear definition of business needs, making
architecture a one-way street. Cloud turns this assumption, like many
others, on its head, favoring high-level decision makers who understand
the ramifications of technical choices made in the engine room. After
all, those technical decisions are the critical enablers for the enterprise’s
ability to innovate and compete in the market. Hence, it is the elevator
architect’s role not only to make better decisions but also to communicate
them transparently to upper management. Clear decision models that skip
jargon and buzzwords can be an extremely useful tool in this context.
¹⁰https://architectelevator.com/strategy/dont-outsource-thinking/
¹¹http://architectelevator.com
Cloud Migration per Pythagoras 128

Decision Models
One could fill a whole book on cloud architecture, and several people have
done so (I was lucky enough to write the foreword for Cloud Computing
Patterns¹²). Cloud service providers have added more architecture guid-
ance (e.g. Azure maintains a nice site with cloud architecture patterns¹³).
This part is not looking to repeat what’s already been written. Rather, it
invites you to pick a novel point of view that shuns the buzzwords and fo-
cuses on meaningful decisions. To help you become a better informed and
more disciplined decision maker, several decision models and frameworks
guide you through major decision points along your cloud journey:

• There are many flavors of Multi-cloud and you should choose


carefully which one is best for you.
• Hybrid Cloud requires you to separate your workloads into cloud
and on-premises. Knowing your options helps you choose the best
path.
• Architects like to look under the covers, so here’s how different
vendors architect their hybrid cloud solutions.
• Many architects see their main job as battling lock-in. But life’s not
that simple: don’t get locked up into avoiding lock-in!
• Cloud changes many past assumptions that drove popular architec-
ture styles. Hence, we may see the end of multi-tenancy.
• Architects concern themselves with non-functional requirements,
also known as “ilities”. Cloud brings us a new -ility: Disposability,
and in an environmentally conscious fashion.

¹²Fehling, Leymann, Retter, Schupeck, Arbiter, Cloud Computing Patterns, Springer 2014
¹³https://docs.microsoft.com/en-us/azure/architecture/patterns/
16. Multi-cloud: You’ve Got
Options
But options don’t come for free.

While most enterprises are still busily migrating existing applications to


the cloud or perhaps building new cloud-ready applications, the product
marketing departments haven’t been sitting idle, making it difficult to
escape slogans like multi-hybrid-cloud computing. Or perhaps it was
hybrid-multi? I am not sure myself.
Are enterprises already falling behind before even finishing their migra-
tion? Should they skip going to one cloud and strive straight for multi-
cloud nirvana? Do people actually mean the same thing when they talk
about multi-cloud?
It’s time to bust the buzzword and bring things back to earth and
to business value. We’ll find that there are quite a few architectural
consideration behind that help us make better decisions.

Multi-Hybrid Split
The core premise of a multi-hybrid cloud approach sounds appealing
enough: your workloads can move from your premises to the cloud and
back and even between different clouds whenever needed; and all that
ostensibly with not much more than the push of a button. Architects are
born skeptics and thus inclined (and paid) to take a look under the covers
to better understand the constraints, costs, and benefits of such solutions.
The first step in dissecting the buzzwords is to split the multi-hybrid
combo-buzzword into two, separating hybrid from multi. Each has dif-
ferent driving forces behind it, so it’s best to try simple definitions:

• Hybrid: splitting workload(s) across the cloud and on-premises.


Generally, these workloads pertain to a single application, meaning
Multi-cloud: You’ve Got Options 130

the piece in the cloud and the piece remaining on-premises interact
to do something useful.
• Multi: running workloads with more than one cloud provider.

Hybrid and Multi Cloud

As simple as these may seem, there tends to be a reasonable amount


of confusion around the terms. Some folks want us to think that multi
and hybrid are in fact very similar (“on premises is just another cloud”)
whereas others (including myself) cite the very different constraints of
operating on premises versus the public cloud.
One major difference is that hybrid cloud is a given for most enterprises,
at least during the transition, whereas a multi-cloud strategy is an ex-
plicit architecture choice you make. Many enterprises are very successful
running on a single cloud. They reduce cost along the way, for example
by minimizing the skill sets they need and getting volume discounts.
Therefore, it’s good to understand what multi choices you have and the
trade-offs that are involved. So, it’s time for another evocative decision
framework that filters out the noise and amplifies the decision signals.

Multi-Hybrid = Hybrid Multi?


Some vendors, for example Google Cloud, consider on-premises to be
part of multi-cloud¹: “A multi-cloud setup might also include private
computing environments.” While technically the two are related (“on-
prem is just another data center”), as architects we understand that on-
premises operates under a completely different set of assumptions. So, for
our discussion we treat multi and hybrid as two different avenues.
¹https://cloud.google.com/solutions/hybrid-and-multi-cloud-patterns-and-practices
Multi-cloud: You’ve Got Options 131

Multi-cloud options
Taking an architecture point of view on multi cloud means identifying
common approaches, the motivations behind each, and the architectural
trade-offs involved. To get there, it’s best to first take a step back from
the technical platform and examine common scenarios and the value
they yield. I have observed (and participated in) quite a few different
setups that would fall under the general label of “multi-cloud”. Based on
my experience they can be broken down into the following five distinct
scenarios:

Multi-cloud Architecture Styles

1. Arbitrary: Workloads are in more than one cloud but for no partic-
ular reason.
2. Segmented: Different clouds are used for different purposes.
3. Choice: Projects (or business units) have a choice of cloud provider.
4. Parallel: Single applications are deployed to multiple clouds, for
example to achieve higher levels of availability.
5. Portable: Workloads can be moved between clouds at will.

A higher number in this list isn’t necessarily better - each option has its
advantages and limitations. Rather, it’s about finding the approach that
Multi-cloud: You’ve Got Options 132

best suits your needs and making a conscious choice. The biggest mistake
could be choosing an option that provides capabilities that aren’t needed.
Because each option has a cost, as we will soon see.

Multi-cloud architecture isn’t a simple one-size fits all


decision. The most common mistake is choosing an option
that’s more complex than what’s needed for the business
to succeed.

Breaking down multi cloud into multiple flavors that each identify the
drivers and benefits is also a nice example of how elevator architects see
shades of gray where many others only see black or white. Referring
to five distinct options and elaborating them with simple vocabulary
enables an in-depth conversation that largely avoids technical jargon and
is therefore suitable to a wide audience. That’s what the Architect Elevator
is all about.

Multi-Cloud Scenarios
Let’s look at each of the five ways of doing multi-cloud individually, with
a particularly keen eye on the key capabilities it brings and what things
we should watch out for. At the end we’ll summarize what we learned in
a decision table.

Arbitrary

Arbitrary: Some stuff in any cloud


Multi-cloud: You’ve Got Options 133

If enterprise has taught us one thing, it’s likely that reality rarely lives up
to the slide decks. Applying this line of reasoning (and the usual dosage
of cynicism) to multi-cloud, we find that a huge percentage of enterprise
multi-cloud isn’t the result of divine architectural foresight but simply
poor governance and excessive vendor influence.
This flavor of multi-cloud means that you have some workloads running
in the orange cloud, some others in the light blue cloud, and a few more
under the rainbow. You don’t have much of an idea why things are in one
cloud or the other. In many cases enterprises started with orange, then
received a huge service credit from light blue thanks to existing license
agreements, and finally some of the cool kids loved the rainbow stuff so
much that they ignored the corporate standard. And if you look carefully,
you may see some red peeking in due to existing relationships.
Strategy isn’t exactly the word to be used for this type of multi-cloud
setup. It’s not all bad, though: at least you are deploying something to the
cloud! That’s a good thing because before you can steer you first have
to move². So, at least you’re moving. You’re gathering experience and
building skills with multiple technology platforms, which you can use to
settle on the provider that best meets your need. So, while arbitrary isn’t
a viable target picture, it’s a common starting point.

Segmented

Segmented: Different clouds for different needs

Segmenting workloads across different clouds is also common, and a good


step ahead: you deploy specific types of workload to specific clouds.
²https://www.linkedin.com/pulse/before-you-can-steer-first-have-move-gregor-
hohpe/
Multi-cloud: You’ve Got Options 134

Companies often follow this scenario to benefit from different vendors’


strengths for different kinds of workloads. Different licensing models may
also lead you to favoring different vendors for different workloads. A
common incarnation of the segmented scenario is to have most workloads
in orange, Windows-related workloads on light blue, and ML/analytics on
rainbow.
You may decide to segregate cloud providers by several factors:

• Type of workload (legacy vs. modern)


• Type of data (confidential vs. open)
• Type of product (compute vs. data analytics vs. collaboration soft-
ware)

When pursuing this approach, it’s helpful to understand the seams be-
tween your applications so you don’t incur excessive egress charges
because half your application ends up left and the other half on the
right. Also, keep in mind that vendor capabilities are rapidly shifting,
especially in segments like machine learning or artificial intelligence.
Snapshot comparisons therefore aren’t particularly meaningful and may
unwittingly lead you into this scenario just to find out six months later
that your preferred vendor is offering comparable functionality.

Vendor pushes can make you slip from segmented back into
arbitrary.

Also, I have observed enterprises slipping from segmented back into


arbitrary as a result of sales teams using a foothold to grow their slice
of the pie. You might initially use cloud vendor X for a specific type of
service feature, such as pre-trained machine learning. Sure enough their
(pre-)sales folks will try to convince teams to use their other services as
well - that’s their job, after all. It’s therefore important to exercise decision
discipline and stay away from resume-driven architecture. If you don’t,
you end up in situations like running 95% of your workload on ECS in
Singapore and a few percent on AppEngine in Tokyo - a real example,
which duplicates skill sets needed for the same use case.
Multi-cloud: You’ve Got Options 135

Choice

Choice: Freedom to pick

Many might not consider the first two examples as true multi cloud. What
they are looking for (and pitching) is being able to deploy your workloads
freely across cloud providers, thus minimizing lock-in (or the perception
thereof), usually by means of building abstraction layers or governance
frameworks. Architects look behind this ploy by considering the finality
of the cloud decision. For example, should you be able to change your mind
after your initial choice and, if so, how easy do you expect the switch to
be?
The least complex and most common case is giving your developers an
initial choice of cloud provider but not expecting them to keep changing
their mind. This choice scenario is common in large organizations with
shared IT service units. Central IT is generally expected to support a wide
range of business units and their respective IT preferences. Freedom of
choice might also result from the desire to remain neutral, such as in public
sector, or a regulatory guideline to avoid placing “all eggs in one basket”,
for example in financial services or similar critical services.
A choice setup typically has central IT manage the commercial rela-
tionships with the cloud providers. Some IT departments also develop
a common tool set to create cloud provider account instances to assure
central spend tracking and corporate governance.
The advantage of this setup is that projects are free to use proprietary
cloud services, such as managed databases, if it matches their preferred
trade-off between avoiding lock-in and minimizing operational overhead.
As a result, business units get an unencumbered, dare I say native, cloud
experience. Hence, this setup makes a good initial step for multi-cloud.
Multi-cloud: You’ve Got Options 136

Parallel

Parallel: A single app deployed multiple times

While the previous option gives you a choice among cloud service
providers, you are still bound by the service level of a single provider.
Many enterprises are looking to deploy critical applications across multi-
ple clouds in their pursuit to achieve higher levels of availability than they
could achieve with a single provider.
Being able to deploy the same application in parallel to multiple clouds
requires a certain set of decoupling from the cloud provider’s proprietary
features. This can be achieved in a number of ways, for example:

• Managing cloud-specific functions such as identity management,


deployment automation, or monitoring separately for each cloud,
isolating them from the core application code through interfaces or
pluggable modules.
• Maintain two branches for those components of your application
that are cloud provider specific and wrap them behind a common
interface. For example, you could have a common interface for
block data storage.
• Using open source components because they will generally run
on any cloud. While this works relatively well for pure compute
(hosted Kubernetes is available on most clouds), it may reduce your
ability to take advantage of other fully managed services, such as
data stores or monitoring. Because managed services are one of the
key benefits of moving to the clouds in the first place, this is an
option that’ll need careful considerations.
• Utilize a multi-cloud abstraction framework, so you can develop
once and deploy to any cloud without having to deal with any
Multi-cloud: You’ve Got Options 137

cloud specifics. However, such an abstraction layer might prevent


you from benefiting from a particular cloud’s unique offering,
potentially weakening your solution or increasing cost.

While absorbing differences inside your code base might sound kludgy,
it’s what ORM frameworks have been successfully doing for relational
databases for over a decade.
The critical aspect to watch out for is complexity, which can easily undo
the anticipated uptime gain. Additional layers of abstraction and more
tooling also increase the chance of a misconfiguration, which causes
unplanned downtime. I have seen vendors suggesting designs that deploy
across each vendor’s three availability zones, plus a disaster recovery
environment in each, times three cloud providers. With each component
occupying 3 * 2 * 3 = 18 nodes, I’d be skeptical whether this amount of
machinery really gives you higher availability than using 9 nodes (one
per zone and per cloud provider).
Second, seeking harmonization across both deployments may not be
what’s actually desired. The higher the degree of commonality across
clouds, the higher the chance that you’ll deploy a broken application
or encounter issues on both clouds, undoing the resilience benefit. The
extreme example are space probes or similar systems that require extreme
reliability: they use two separate teams to avoid any form of commonality.

Higher degrees of harmonization across providers increases


the chance of a common error, undoing potential increases
in system uptime.

So, when designing for availability, keep in mind that the cloud provider’s
platform isn’t the only outage scenario - human error and application soft-
ware issues (bugs or run-time issues such as memory leaks, overflowing
queues etc.) can be a bigger contributor to outages.
Multi-cloud: You’ve Got Options 138

Portable

Portable: Shifting at will

The perceived pinnacle of multi-cloud is free portability across clouds,


meaning you can deploy your workloads anywhere and also move them as
you please. The advantages are easy to grasp: you can avoid vendor lock-
in, which for example gives you negotiation power. You can also move
applications based on resource needs. For example, you may run normal
operations in one cloud and burst excessive traffic into another.
The core mechanisms that enable this capability are high levels of au-
tomation and abstraction away from cloud services. While for parallel
deployments you could get away with a semi-manual setup or deployment
process, full portability requires you to be able to shift the workload any
time, so everything better be fully automated.
Multi-cloud abstraction frameworks promise this capability. However,
nothing is ever free, so the cost comes in form of complexity, lock-in to
a specific framework, restriction to specific application architectures (e.g.,
containers) and platform underutilization (see Lock-in).
Also, most such abstractions generally don’t take care of your data: if you
shift your compute nodes across providers willy-nilly, how are you going
to keep your data in sync? And if you manage to overcome this hurdle,
egress data costs may come to nib you in the rear. So, although this option
looks great on paper (or PowerPoint), it involves significant trade-offs.

Chasing Shiny Objects Makes You Blind


When chasing shiny objects, we can easily fall into the trap of thinking
that the shinier, the better. Those with enterprise battle scars know all
Multi-cloud: You’ve Got Options 139

too well that polishing objects to become ever more shiny comes at a
cost. Dollar cost is the apparent concern, but you also need to factor in
additional complexity, having to manage multiple vendors, finding the
right skill set, and long-term viability (will we ditch all this container stuff
and go serverless?). Those factors can’t be solved with money.
It’s therefore paramount to understand and clearly communicate your pri-
mary objective. Are you looking at multi-cloud so you can better negotiate
with vendors, to increase your availability, or to support deploying in
regions where only one provider or the other may have a data center?
Remember, that “avoiding lock-in” is only a meta-goal, which, while
architecturally desirable, needs to be justified by a tangible benefit.

Multi-cloud ≠ Uniform Cloud


When advising enterprise IT departments on a multi-cloud strategy, I
routinely advise them not to attempt to build a uniform cloud experience
across each provider. Each cloud provider has specific strengths, in their
product offering but also in their product strategy and corporate culture.
Attempting to make all clouds look the same does not actually benefit your
internal customers but incurs a heavy burden, for example because they
won’t be able to use an inexpensive managed service from cloud provider
X. Or they might be working with internal staff or an external vendor
that’s familiar with the original cloud but not with the abstraction layer
woven over it. I call this the Esperanto effect: yes, it’d be nice if we all
spoke one universal language. However, that means we all have to learn
yet one more language and many of us speak English already.

Choosing Wisely
The following table summarizes the multi-cloud choices, their main
drivers, and the side-effects to watch out for:

Style Key Capability Key Mechanism Consideration


Arbitrary Deploying Cloud skill Lack of
applications to the governance.
cloud. Network traffic
cost
Segmented Clear guidance on Governance Drifting back to
cloud usage. “Arbitrary”
Multi-cloud: You’ve Got Options 140

Style Key Capability Key Mechanism Consideration


Choice Support project Common Yet-another layer
needs and framework for of abstraction.
preferences; provisioning, Lack of guidance
reduce lock-in billing,
governance
Parallel Higher Automation, Complexity;
availability than abstraction, load under-utilization
single cloud. balancing. of cloud services
Portable Shift workloads as Full automation, Complexity;
you please. abstraction. Data Lock-in into
portability. multi-cloud
frameworks;
underutilization
As expected: TANSTAAFL - there ain’t no such a thing as a free lunch.
Architecture is the business of trade-offs. Therefore, it’s important to
break down the options, give them meaningful names, and understand
their implications. Armed with these tools, you can happily chart your
course to hybrid-multi-cloud enlightenment.
17. Hybrid Cloud: Slicing the
Elephant
Enterprises can’t avoid hybrid cloud. But there are multiple ways
to get there.

Hybrid cloud is a reality for enterprises. A sound cloud strategy goes


beyond the buzzword, though, and makes a conscious choice of what to
keep on premises and what to move into the cloud. Again, it turns out
that simple but evocative decision models help Elevator Architects make
better, and more conscious decisions on their journey to the cloud.

Hybrid is a Reality. Multi is an option.


Unlike multi cloud, hybrid cloud architectures are a given in any enter-
prise cloud scenario, at least as a transitional state. The reason is simple:
despite magic things like AWS Snowmobile¹ you’re not going to find all
your workloads magically removed from your data center one day - some
applications will already be in the cloud while others are still on premises.

“No CIO will wake up one morning to find all of their


workloads in the cloud.”

And once you split workloads across cloud and on premises, more likely
than not, those two pieces will need to interact, as captured in our
definition from the previous chapter:

Hybrid cloud architectures split workloads across the cloud


and on-premises environments. Generally, these workloads
interact to do something useful.
¹https://aws.amazon.com/snowmobile/
Hybrid Cloud: Slicing the Elephant 142

Accordingly, the core of a hybrid cloud strategy isn’t when but how,
specifically how to slice, meaning which workloads should move out into
the cloud and which ones remain on premises. That decision calls for some
architectural thinking, aided by taking the Architect Elevator a few floors
down.

Two isolated environments don’t make


a hybrid

First, we want to sharpen our view of what makes a “hybrid”. Having some
workloads in the cloud while others remain on our premises sadly means
that you have added yet another computing environment to your existing
portfolio - not a great result.

I haven’t met a CIO who’s looking for more complexity. So,


don’t make your cloud yet-another-data center.

Therefore, unified management across the environment is an essential


element to calling something hybrid. Consider the analogy of a hybrid
car:

The difficult part in making a hybrid car wasn’t sticking a battery and
an electric motor into a petrol-powered car. Getting the two systems
to work seamlessly and harmoniously was the critical innovation.
That’s what made the Toyota Prius a true hybrid and such a success.

Because the same is true for hybrid clouds, we’ll augment our definition:

Hybrid cloud architectures split workloads across the cloud


and on-premises environments, which are managed in a
unified way. Generally, these workloads interact to do
something useful.
Hybrid Cloud: Slicing the Elephant 143

Hybrid Cloud Unifies Management

With this clarification, let’s come back to cloud strategy and see what
options we have for slicing workloads.

Hybrid splits - 31 flavors?


The key architectural decision of where to split your workloads across on
premises and the cloud is related to the notion of finding seams in an IT
system, first described (to my knowledge) by Mike Feathers in his classic
book Working Effectively with Legacy Code². Seams are those places where
a split doesn’t cross too many dependencies. A good seam minimizes
rework, such as introducing new APIs, and avoids run-time problems,
such as degraded performance.
I don’t think there are quite 31 ways to split workloads. Still, cataloging
them is useful in several ways:

• Improved Transparency: A common vocabulary gives decision


transparency because it’s easy to communicate which option(s) you
chose and why.
• More Choice: The list allows you to check whether you might
have overlooked some options. Especially in environments where
regulatory concerns and data residency requirements still limit
some cloud migrations, a list of options allows you to see whether
you exhausted all possible options.
²Feathers, Working Effectively With Legacy Code, 2004, Pearson
Hybrid Cloud: Slicing the Elephant 144

• Lower Risk: Lastly, the list helps you consider applicability, pros,
and cons of each option consistently. It can also tell you about what
to watch out for when selecting a specific option.

Therefore, our decision model once again helps us make better decisions
and communicate them such that everyone understands which route we
take and why.

Ways to Slice the Cloud Elephant


I have seen enterprises split their application portfolios in at least eight
common ways, divided into three categories of applications, data, and
operational considerations:

Eight Ways to Split a Hybrid Cloud

This list’s main purpose isn’t as much to find new options that no one has
ever thought of - you might have seen more. It’s much rather a collection
of well-known options that makes for a useful sanity check. Let’s have a
look at each approach in more detail.
Hybrid Cloud: Slicing the Elephant 145

Tier: Front vs. Back

Separating by tier

Moving customer or ecosystem-facing components, aka “front-end sys-


tems” / “systems of engagement” to the cloud while keeping back-end
systems is a common strategy. While you won’t see me celebrating
two-speed architectures (see Architects Live in the First Derivative in
The Software Architect Elevator³), multiple factors make this approach a
natural fit:

• Front-end components are exposed to the internet, so routing traffic


straight from the internet to the cloud reduces latency and eases
traffic congestion on your corporate network.
• The wild traffic swings commonly experienced by front-end com-
ponents make elastic scaling particularly valuable.
• Front-end systems are more likely to be developed using modern
tool chains and architectures, which make them well suited for the
cloud.
• Front-end systems are less likely to store personally identifiable
or confidential information because that data is usually passed
through to back-ends. In environments where enterprise policies
restrict such data from being stored in the cloud, front-end systems
are good candidates for moving to the cloud.

Besides these natural advantages, architecture wouldn’t be interesting if


there weren’t some things to watch out for:

• Even if the front-end takes advantage of elastic scaling, most back-


ends won’t be able to absorb the corresponding load spikes. So,
³Hohpe, The Software Architect Elevator, 2020, O’Reilly
Hybrid Cloud: Slicing the Elephant 146

unless you employ caching or queuing or can move computa-


tionally expensive tasks to the cloud, you may not improve your
application’s end-to-end scalability.
• Similarly, if your application suffers from reliability issues, moving
just one half to the cloud is unlikely to make those go away.
• Many front-ends are “chatty” when communicating with back-
end systems or specifically databases because they were built for
environments where the two sit very close to each other. One front-
end request can issue dozens or sometimes hundreds of requests to
the back-end. Splitting such a chatty channel between cloud and
on-premises is likely to increase end-user latency.

Although the excitement of “two speed IT” wore off, this approach can
still be a reasonable intermediary step. Just don’t mistake it for a long-
term strategy.

Generation: New vs. Old

Separating by generation

Closely related is the split by new vs. old. As explained above, newer
systems are more likely to feel at home in the cloud because they are
generally designed to be scalable and independently deployable. They’re
also usually smaller and hence benefit more from lean cloud platforms
such as serverless.
Again, several reasons make this type of split compelling:

• Modern components are more likely to use modern tool chains and
architectures, such as microservices, continuous integration (CI),
automated tests and deployment, etc. Hence they can take better
advantage of cloud platforms.
Hybrid Cloud: Slicing the Elephant 147

• Modern components often run in containers and can thus utilize


higher-level cloud services like managed container orchestration.
• Modern components are usually better understood and tested,
reducing the migration risk.

Again, there are a few things to consider:

• Splitting by new vs old may not align well with the existing systems
architecture, meaning it isn’t hitting a good seam. For example,
splitting highly cohesive components across data center boundaries
will likely result in poor performance or increased traffic cost.
• Unless you are in the process of replacing all components over time,
this strategy doesn’t lead to a “100% in the cloud” outcome.
• “New” is a rather fuzzy term, so you’d want to precisely define the
split for your workloads.

Following this approach, you’d want your cloud to be application-centric.


Overall this is a good approach if you are building new systems -
ultimately new will replace old.

Criticality: Non-critical vs. Critical

Separating by criticality

Many enterprises prefer to first dip their toe into the cloud water before
going all out (or “all in” as the providers would see it). They are likely
to try the cloud with a few simple applications that can accommodate
the initial learning curve and the inevitable mistakes. “Failing fast” and
learning along the way makes sense:
Hybrid Cloud: Slicing the Elephant 148

• Skill set availability is one of the main inhibitors of moving to


the cloud - stuff is never as easy as shown in the vendor demos.
Therefore, starting small and getting rapid feedback builds much
needed skills in a low-risk environment.
• Smaller applications benefit from the cloud self-service approach
because it reduces or eliminates the fixed overhead common in on-
premises IT.
• Small applications also give you timely feedback and allow you to
calibrate your expectations.

While moving non-critical applications out to the cloud is a good start, it


also has some limitations:

• Cloud providers generally offer better uptime and security than


on-premises environments, so you’ll likely gain more by moving
critical workloads.
• Simple applications typically don’t have the same security, up-time,
and scalability requirements as subsequent and more substantive
workloads. You therefore might underestimate the effort of a full
migration.
• While you can learn a lot from moving simple applications, the
initial direct ROI of the migration might be low. It’s good to set
the expectations accordingly.

Overall it’s a good “tip your toe into the water” tactic that certainly beats
a “let’s wait until all this settles” approach.

Lifecycle: Development vs. Production

Separating by lifecycle
Hybrid Cloud: Slicing the Elephant 149

There are many ways to slice the proverbial elephant (eggplant for
vegetarians like me). Rather than just considering splits across run-
time components, a whole different approach is to split by application
lifecycle. For example, you can run your tool chain, test, and staging
environments in the cloud while keeping production on premises, for
example if regulatory constraints require you to do so.
Shifting build and test environments into the cloud is a good idea for
several reasons:

• Build and test environments rarely contain real customer data, so


most concerns around data privacy and locality don’t apply.
• Development, test, and build environments are temporary in nature,
so being able to set them up when needed and tearing them back
down leverages the elasticity of the cloud and can cut infrastructure
costs by a factor of 3 or more: the core working hours make up less
than a third of a week’s 168 hours. Functional test environments
may even have shorter duty cycles, depending on how quickly they
can spin up.
• Many build systems are failure tolerant, so you can shave off even
more dollars by using preemptible⁴ / spot⁵ compute instances.
• Build tool chains are generally closely monitored by well-skilled
development teams who can quickly correct minor issues.

You might have guessed that architecture is the business of trade-offs, so


once again we have a few things to watch out for:

• Splitting your software lifecycle across cloud and on-premises


results in a test environment that’s different from production. This
is risky as you might not detect performance bottlenecks or subtle
differences that can cause bugs to surface only in production.
• If your build chain generates large artifacts you may face delays or
egress charges when deploying those to your premises.
• Your tool chain can also be an attack vector for someone to inject
malware or Trojans. You therefore have to protect your cloud tool
chain well.
⁴https://cloud.google.com/preemptible-vms/
⁵https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
Hybrid Cloud: Slicing the Elephant 150

This option may be useful for development teams that are restrained from
running production workloads in the cloud but still want to utilize the
cloud.

Data Classification: Non-sensitive vs. Sensitive

Separating by data classification

The compute aspect of hybrid cloud is comparatively simple because


code can easily be re-deployed into different environments. Data is a
different story: data largely resides in one place and needs to be migrated
or synchronized if it’s meant to go somewhere else. If you move your
compute and keep the data on premises, you’ll likely incur a heavy
performance penalty. Therefore, it’s the data that often prevents the move
to the cloud.
As data classification can be the hurdle for moving apps to the cloud,
a natural split would be to move non-sensitive data to the cloud while
keeping sensitive data on premises. Doing so has some distinct advantages:

• It complies with common internal regulations related to data res-


idency as sensitive data will remain on your promises and isn’t
replicated into other regions.
• It limits your exposure in case of a possible cloud data breach. While
keeping data on premises won’t guarantee that no breach occurs,
ironically the reputational damage of a data breach on premises can
be lower than a breach that’s happening in the cloud. In the former
case, the organization can be accused of giving their customers’
data into someone else’s hands, which can be perceived as careless.
Oddly enough, incompetence (or sheer bad luck) on premises is
more easily shrugged away because “that’s the way everyone does
it.”
Hybrid Cloud: Slicing the Elephant 151

However, the approach is rooted in assumptions that don’t necessarily


hold true for today’s computing environments:

• Protecting on-premises environments from sophisticated attacks


and low-level exploits has become a difficult proposition. For exam-
ple, CPU-level exploits like Spectre and Meltdown were corrected
by some cloud providers before they were announced - something
that couldn’t be done on premises.
• Malicious data access and exploits don’t always access data directly
but often go through many “hops”. Having some systems in the
cloud while your sensitive data remains on premises doesn’t auto-
matically mean your data is secure. For example, one of the biggest
data breaches, the Equifax Data Breach⁶ that affected up to 145
million customer records had data stored on premises.

Hence, while this approach may be suitable to getting started in face


of regulatory or policy constraints, use it with caution as a long-term
strategy.

Data Freshness: Back-up vs. Operational

Separating by data freshness

Not all your data is accessed by applications all the time. In fact, a vast
majority of enterprise data is “cold”, meaning it’s rarely accessed. Cloud
providers offer appealing options for data that’s rarely accessed, such as
historical records or back-ups. Amazon’s Glacier⁷ was one of the earliest
offerings to specifically target that use case. Other providers have special
⁶https://en.wikipedia.org/wiki/Equifax#May%E2%80%93July_2017_data_breach
⁷https://aws.amazon.com/glacier/
Hybrid Cloud: Slicing the Elephant 152

archive tiers for their storage service, e.g. Azure Storage Archive Tier⁸ and
GCP Coldline Cloud Storage ⁹.
Archiving data in the cloud makes good sense:

• Backing up and restoring data usually occurs separate from regular


operations, so it allows you to take advantage of the cloud without
having to migrate or re-architect applications.
• For older applications, data storage cost can make up a significant
percentage of the overall operational costs, so storing it in the cloud
can give you instant cost savings.
• For archiving purposes, having data in a location separate from your
usual data center increases the resilience against disaster scenarios,
such as a data center fire.

Alas, this approach also has limitations:

• Data recovery costs can be high, so you’d only want to move data
that you rarely need.
• The data you’d want to back up likely contains customer or other
proprietary data, so you might need to encrypt or otherwise protect
that data, which can interfere with optimizations such as data de-
duplication.

Still, using cloud backup is a good way for enterprises to quickly start
benefiting from the cloud.

Operational State: Disaster vs. BAU

Separating by operational state

⁸https://azure.microsoft.com/en-us/services/storage/archive/
⁹https://cloud.google.com/storage/docs/storage-classes
Hybrid Cloud: Slicing the Elephant 153

Lastly, one can split workloads by the nature of the system’s operational
state. The most apparent division in operational state is whether things are
going well (“business as usual”) or whether the main compute resources
are unavailable (“disaster”). While you or your regulator may have a
strong preference for running systems on premises, during a data center
outage you might find a system running in the cloud preferable over none
running at all. This reasoning can apply, for example, to critical systems
like payment infrastructure.
This approach to slicing workload is slightly different from the others as
the same workload would be able to run in both environments, but under
different circumstances:

• No workloads run in the cloud under normal operational condi-


tions, meeting potential regulatory requirements.
• Cloud usage is temporary, making good use of the cloud’s elastic
billing approach. The cost of the cloud environment is low during
normal operations.

However, life isn’t quite that simple. There are drawbacks as well:

• To run emergency operations from the cloud, you need to have your
data synchronized into a location that’s both accessible from the
cloud and unaffected by the outage scenario that you are planning
for. More likely than not, that location would be the cloud, at which
point you should consider running the system from the cloud in any
case.
• Using the cloud just for emergency situations deprives you of the
benefits of operating in the cloud in the vast majority of the cases.
• You might not be able to spool up the cloud environment on a
moment’s notice. In that case you’d need to have the environment
on standby, which will rack up charges with little benefit.

So, this approach may be suitable mainly for compute tasks that don’t
require a lot of data. For example, for an on-premises e-commerce site
that faces an outage you could allow customers to place orders by picking
from a (public) catalog and storing orders until the main system comes
back up. It’ll likely beat a cute HTTP 500 page.
Hybrid Cloud: Slicing the Elephant 154

Workload Demand: Burst vs Normal Operations

Separating by workload demand

Last, but not least, we may not have to conjure an outright emergency
to occasionally shift some workloads into the cloud while keeping it
on premises under normal circumstances. An approach that used to be
frequently cited is bursting into the cloud, meaning you keep a fixed ca-
pacity on premises and temporarily extend into the cloud when additional
capacity is needed.
Again, you gain some desirable benefits:

• The cloud’s elastic billing model is ideal for this case as short bursts
are going to be relatively cheap.
• You retain the comfort of operating on premises for most of the time
and are able to utilize your already paid for hardware.

But there are reasons that people don’t talk about this option quite as much
anymore:

• Again, you need access to data from the cloud instances . That
means you either replicate the data to the cloud, which is likely
what you tried to avoid in the first place. Alternatively, your cloud
instances need to operate on data that’s kept on premises, which
likely implies major latency penalties and can incur significant data
access cost.
• You need to have an application architecture that can run on
premises and in the cloud simultaneously under heavy load. That’s
not a small feat and means significant engineering effort for a rare
case.
Hybrid Cloud: Slicing the Elephant 155

Rather than bursting regular workloads into the cloud, this option works
best for dedicated compute-intensive tasks, such as simulations as they
are regularly run in financial institutions. It’s also a great fit for one-time
compute tasks where you rig up machinery in the cloud just for one time.
For example, this worked well for the New York Times digitizing 6 million
photos¹⁰ or for calculating pi to 31 trillion digits¹¹. To be fair, though, these
cases had no on-premises equivalent, so it likely wouldn’t qualify as a
hybrid setup.

Putting it into Practice


For most enterprises hybrid cloud is inevitable, so you better have a good
plan for how to make the best use of it. Because there are quite a few
options to slice your workloads across the cloud and your on-premises
compute environments, cataloging these options helps you make well-
thought-out and clearly communicated decisions. It also keeps you from
wanting to proclaim that “one size fits all” and will lead you to a more
nuanced strategy.
Happy slicing and dicing!
¹⁰https://www.nytimes.com/2018/11/10/reader-center/past-tense-photos-history-
morgue.html
¹¹https://www.theverge.com/2019/3/14/18265358/pi-calculation-record-31-trillion-
google
18. The Cloud - Now On
Your Premises
From the buzzword down the rabbit hole into a new universe.

We just saw how a hybrid cloud setup provides many ways to split
workloads across cloud and on premises. Equally interesting is how
cloud providers make hybrid cloud happen in the first place, i.e. how
they enable a cloud operating model on your premises. Because different
vendors follow different philosophies and choose different approaches,
it’s the architects’ job to look behind the slogans and understand the
assumptions, decisions, and the ramifications that are baked into the
respective approaches. “Doors closing. Elevator going down…”

Bringing the Cloud to Your Premises


A cloud operating model brings many great benefits. But since they’re
mostly available in commercial cloud environments, what to do about
the applications that remain on premises, for example because of regu-
latory constraints, corporate policies, or specific hardware dependencies?
Declaring everything that remains on premises as legacy would be foolish.
True, there are inherent limitations on premises, such as the rate at which
you can scale (physical hardware won’t spawn) and the overall scale you
can reach (you’re not Google). Still, on-premises applications deserve to
enjoy modern architectures and automated deployments. “Bringing the
cloud to you” was even the slogan of a now re-vamped hybrid cloud
platform.

Hybrid vs. Cloud On Premises


Similar to multi cloud, the first choice an enterprise can make regarding
their on-premises environment is how much uniformity across cloud and
on premises they want to achieve. Initially, there are three fundamental
choices:
The Cloud - Now On Your Premises 157

1. Largely leave the on-premises environment as is, which for most


enterprises means a virtualized IaaS environment based on VMware
or Microsoft Hyper-V.
2. Achieve cloud-like capabilities on premises, but operate separately
from the cloud.
3. Consider on-premises an extension of the cloud and manage both
environments uniformly. This management unit is often called
“control plane”, a term borrowed from network architecture.

Hybrid cloud ambitions

(The astute reader will have noticed that the diagrams are rotated by 90
degrees from the ones in the previous chapter, now showing cloud and on
premises side-by-side. This is the artistic license I took to make showing
the layers easier.)
The first option isn’t as horrible as it might sound. You could still use
modern software approaches like CI/CD, automated testing, and frequent
releases. In many cases, option #2 is a significant step forward, assuming
you are clear on what you expect to gain from moving to the cloud. Lastly,
if you want to achieve a true hybrid cloud setup, you don’t just enable
cloud capabilities on premises, but also want uniform management and
deployment across the environments.
Which one is the correct target picture for your enterprise depends largely
on how you will split applications. For example, if applications run either
in the cloud or on premises, but not half-half, then uniform management
may not be as relevant for you. Once again, you must keep in mind
that not every available feature translates into an actual benefit for your
organization.
The Cloud - Now On Your Premises 158

Why On Premises is Different


Architecture lives in context. Before thinking about the implementation of
hybrid cloud frameworks, you should therefore consider how on premises
is a different ballpark from commercial cloud. While it’s all made from
racks and servers, cables, and network switches, power supplies and air
conditioning, you’ll also find quite a few differences:

Scale
Most on premises data centers are two or more orders of magnitude
smaller than so-called hyperscaler data centers. The smaller scale
impacts economics but also capacity management, i.e. how much
hardware can be on stand-by for instant application scaling. For
example, running one-off number crunching jobs on hundreds or
thousands of VMs likely won’t work on premises and if it did, it’d
be highly uneconomical.
Custom Hardware
Virtually all cloud providers use custom hardware for their data
centers, from custom network switches and custom security mod-
ules to custom data center designs. They can afford to do so based
on their scale. In comparison, on premises data centers have to make
do with commercial off-the-shelf hardware.
Diversity
On premises environments have usually grown over time, often
decades, and therefore look like they contain everything including
the proverbial kitchen sink. Cloud data centers are neatly built and
highly homogeneous, making them much easier to manage.
Bandwidth
Large-scale cloud providers have enormous bandwidth and super-
low latency straight from the internet backbone. Corporate data
centers often have thin pipes. Network traffic also typically needs
to make many hops: 10, 20, or 30 milliseconds isn’t uncommon just
to reach a corporate server. By that time, Google’s search engine
has already returned the results.
Geographic reach
Cloud data centers aren’t just bigger, they’re also spread around the
globe and supported by cloud endpoints that give users low-latency
access. Corporate data centers are typically confined to one or two
locations.
The Cloud - Now On Your Premises 159

Physical access
In your own data center you can usually “touch” the hardware and
conduct manual configurations or replace faulty hardware. Cloud
data centers are highly restricted and usually do not permit access
by customers.
Skills
An on-premises operations team typically has different skill sets
than teams working for a cloud service provider. This is often the
case because on-premises teams are more experienced in traditional
operational approaches. Also the operational tasks might be out-
sourced altogether to a third party, leaving very little skill in house.

In summary, despite many similarities on the surface, on premises envi-


ronments operate in a substantially different context and with different
constraints than commercial cloud environments. Hence, we need to be
cautious not to place an abstraction that’s appealing at first but ultimately
turns out to be an illusion. IT isn’t a magic show, not even cloud
computing.

Hybrid Implementation Strategies


Given the different constraints in the cloud and on premises, it’s no
surprise that providers approach the problem of bringing the cloud to your
premises in different ways and sometimes in combinations thereof. That’s
a good thing because it lets you choose the approach that best matches
your strategy.
The primary approaches I have observed are:

1. Layering a uniform run-time and management layer across both


environments.
2. Replicating the cloud environment on your premises, with either a
hardware or software solution.
3. Replicating your existing on-premises virtualization environment
in the cloud.
4. Building an interface layer on existing on premises virtualization
tools to make it compatible with the cloud management layer, aka
control plane.
The Cloud - Now On Your Premises 160

Each of these approaches deserves a closer look. Naturally, architects are


going to be most interested in the decisions and trade-offs that sit beneath
each particular path.

1. Define a Shared Abstraction Layer


When an architect is going to look for a uniform solution, the first thought
that comes to mind is to add an abstraction layer. Following the old
saying that “all problems in computer science can be solved with another
level of indirection”, attributed to David Wheeler¹, this layer absorbs the
differences between the underlying implementations (we’re glancing over
the difference between indirection and abstraction for a moment). Great
examples are SQL to abstract database internals and the Java Virtual
Machine (JVM) to abstract processor architectures, which ironically are
less diverse nowadays than when that layer was created back in 1994.
So, it’s natural to follow the same logic for clouds, as illustrated in the
following diagram:

Layering a common abstraction across cloud and on premises

With cloud and on-premises data centers containing quite different hard-
ware infrastructures, the abstraction layer absorbs such differences so that
applications “see” the same kind of virtual infrastructure. Once you have
such an abstraction in place, a great benefit is that you can layer additional
higher-level services on top, which will “inherit” the abstraction and
also work across cloud and on-premises, giving this approach a definite
architectural elegance.
¹https://en.wikipedia.org/wiki/David_Wheeler_(computer_scientist)
The Cloud - Now On Your Premises 161

By far the most popular example of this approach is Kubernetes, the open
source project that has grown from being a container orchestrator to look
like a universal deployment/management/operations platform for any
computing infrastructure, sort of like the JVM but focused on operational
aspects as opposed to a language run-time.

Advantages

Naturally, this approach has several advantages:

• The abstraction layer simplifies application deployment across en-


vironments because they look identical to the application.
• You might be able to pull the abstraction not only over on-premises
but also across multiple cloud providers, yielding a multi-hybrid
solution.
• You can build new abstractions on top of this base abstraction. For
example, this is the approach KNative follows in the Kubernetes
ecosystem.
• A common abstraction can enable unique features like fail-over or
resiliency across environments, (mostly) transparent to the applica-
tion.
• You only need to learn one tool for all environments, reducing the
learning curve and improving skills portability.

Considerations

Of course, what looks good in theory will also face real-life limitations:

• A shared abstraction can become a lowest common denominator,


meaning features available in one platform won’t be reflected in the
abstraction, thus leading to costly underutilization. An abstraction
is, for example, unlikely to support cloud services like machine
learning that are impractical to implement on premises due to the
required hardware scale. The effect is exacerbated if the abstraction
spans multiple cloud providers, which may or may not provide
certain features. Also, the layer may face a time lag when new
features are released in the cloud platform, meaning you won’t have
access to the latest features.
The Cloud - Now On Your Premises 162

• The new abstraction layer adds another moving part, increasing


overall system complexity and potentially reducing reliability. It
also requires developers to have to learn a new layer in addition
to the cloud service they usually already know. Admittedly once
they learn this layer, they can work in any cloud to the extent that
the abstraction holds.
• The abstraction runs the risk of conveying a model that doesn’t
match the underlying reality, resulting in a dangerous illusion. A
well-known example is the Remote Procedure Call², which pretends
that a remote call across the network is just like a local method call.
A fantastic abstraction that, unfortunately, doesn’t hold³ because
it ignores the inherent properties of remote communication such
as latency and new failure modes. Likewise, a cloud abstraction on
premises might ignore the limited scalability.
• Different implementations of the abstraction might have subtle dif-
ferences. Applications that depend on those are no longer portable.
For example, despite the SQL abstraction, very few applications can
be ported from one relational database to another without changes.
• Implementation details, especially those where it is difficult to
define a common denominator, often leak through the abstraction.
For example, many cross-cloud automation tools define a common
automation syntax but can’t abstract away platform-specific de-
tails, such as the options available to configure a virtual machine.
Raising the abstraction higher (so you don’t need to deal with VMs)
can overcome some of these differences but in return increases
complexity.
• Most failure scenarios break the abstraction⁴. If something goes
wrong underneath, let’s say your on premises network environment
is flooded or faces other issues, the abstraction can’t shield you from
having to dive into that environment to diagnose and remedy the
problem.
• The abstraction likely doesn’t cover all your applications’ needs.
For example, Kubernetes primarily focuses on compute workloads.
If you use a managed database service such as DynamoDB, which
might be a great convenience for your application, this dependency
resides outside the common abstraction layer.
²https://en.wikipedia.org/wiki/Remote_procedure_call
³https://www.enterpriseintegrationpatterns.com/patterns/messaging/
EncapsulatedSynchronousIntegration.html
⁴https://architectelevator.com/architecture/failure-doesnt-respect-abstraction/
The Cloud - Now On Your Premises 163

• The abstraction only works for applications that respect it. For
example, Kubernetes assumes your application is designed to op-
erate in a container environment, which may be sub-optimal or not
possible for traditional applications. If you end up having multiple
abstractions for different types of applications, things get a lot more
complicated.
• Using such an abstraction layer likely locks you into a specific prod-
uct. So, while you can perhaps change cloud providers more easily,
changing to a different abstraction or removing it altogether will be
very difficult. For most enterprise users VMWare comes to mind.
An open source abstraction reduces this concern somewhat, but
certainly doesn’t eliminate it. You are still tied to the abstraction’s
APIs and you won’t be able to just fork it and operate it on your
own.

A common abstraction resembles Esperanto. Intellectually


appealing but unlikely to gain broad traction because it’s
an additional burden for folks already speaking their native
language.

I label this approach the Esperanto⁵ ideal. It’s surely intellectually appeal-
ing - wouldn’t it be nice if we all spoke the same language? However, just
like Esperanto, broad adoption is unlikely because it requires each person
to learn yet another language and they won’t be able to express themselves
as well as in their native language. And sufficiently many people speak
English, anyway.

2. Copy Cloud to On-Premises


If you feel like your cloud is good as it is and you don’t want to place
a whole new abstraction layer on top of it, a natural idea is to have the
vendor bring the cloud to your corporate data center:
⁵https://en.wikipedia.org/wiki/Esperanto
The Cloud - Now On Your Premises 164

Replicating the cloud on premises

Once you have a carbon copy of your cloud on premises, you can shift
applications back and forth or perhaps even split them across cloud and
on-premises without much additional complexity. You can then connect
the on-premises cloud to the original cloud’s control plane and manage it
just like any other cloud region. That’s pretty handy.
Different flavors of this option vary in how the cloud will be set up on
your premises. Some solutions are fully integrated as a hardware/software
solution while others are software stacks to be deployed on third party
hardware.
The best-known products following this approach are AWS Outposts⁶,
which delivers custom hardware to your premises, and Azure Stack⁷,
a software solution that’s bundled with certified third party vendors’
hardware. In both cases, you essentially receive a shipment of a rack to
be installed in your data center. Once connected, you have a cloud on
premises! At least that’s what the brochures claim.

Advantages

Again, this approach has some nice benefits:

• You can get a true-to-the-original experience on premises. For


example, if you know AWS or Azure well, you don’t need to re-
learn a whole new environment just to operate on premises.
• Because vendor hardware is deployed on your premises, you benefit
from the same custom hardware that’s available only to hyper-
scalers. For example, servers may include custom security modules
⁶https://aws.amazon.com/outposts/
⁷https://azure.microsoft.com/en-in/overview/azure-stack/
The Cloud - Now On Your Premises 165

or the rack might have an optimized power supply that powers all
servers, rendering it a rack-size blade⁸.
• You can get a fully certified and optimized stack that assures that
what runs in the cloud will also run on premises. In contrast, the
abstraction layer approach runs the risk of a disconnect between
the abstraction and the underlying infrastructure.
• You don’t need to do a lot - the cloud comes to you in a fully inte-
grated package that essentially only needs power and networking.

Considerations

And, of course we have to keep a few things in mind, also:

• Not all cloud services can meaningfully scale down to an on-


premises setup that’s orders of magnitudes smaller than a com-
mercial cloud data center. On premises you might have a physical
capacity of a handful of racks, which won’t work for services like
BigTable or RedShift. So, the services offered won’t be exactly
identical in the respective environments.
• Shipping custom hardware might make it difficult to update on-
premises systems, especially given the rapid pace of evolution on
cloud services. Also, most enterprises will have stringent validation
processes before they allow any hardware, especially a new kind,
into their data centers.
• To achieve unified management, the on premises appliances will
connect back (“phone home”) to the cloud provider. Hence, this so-
lution doesn’t provide a strict isolation between the cloud provider
and your on premises systems. Given that isolation is a key driver
for keeping applications and data on premises in the first place, this
aspect can be a show stopper for some enterprises. You’re relying
on the internal segregation between the cloud appliance’s control
plane and the user plane where your applications and data reside.
• Bringing a vendor’s cloud to your premises means that you in-
crease your commitment, and hence the potential lock-in, with that
vendor. This might be OK for you if you’ve already considered
the Return on Lock-in, but it’s still something to factor in. To
counterbalance, you could place a common abstraction layer on top
⁸https://en.wikipedia.org/wiki/Blade_server
The Cloud - Now On Your Premises 166

of the environments, for example by running managed Kubernetes


in your on-premises cloud environment. But that’ll invite some of
the limitations from #1 back in.
• Just like the other approaches, bringing a cloud appliance on premises
can’t overcome some of the inherent limitations of the on premises
environments, such as limited hardware and network capacity.

3. Copy On-Premises to Cloud


If one can bring the cloud to your premises, one can also consider bringing
your on premises systems to the cloud. While such an approach might
seem unusual at first, it does make sense if you consider that many
traditional applications won’t easily run in the “regular” cloud. Therefore,
replicating your on premises environment in the cloud eases migration,
for example in those cases where you need to vacate your existing data
center before an upcoming contract renewal.

Replicating on premises in the cloud

A popular example of this approach is AWS’ VMware Cloud, which allows


you to migrate VMs straight from vCenter and for example supports
operating systems that aren’t offered in EC2.
Some might debate whether this approach really counts as “hybrid cloud”
or whether it’s closer to “hybrid virtualization” because you’re not really
operating in a cloud-oriented operating model. Instead you might be
bringing the old way of working into the cloud data center. Still, as an
intermediate step or if you’re close to shutting down an on-premises data
center, this setup can be a viable option.
The Cloud - Now On Your Premises 167

4. Make On-Premises Look Like Cloud


Lastly, rather than bringing a whole new abstraction or even custom
hardware into your data center, why not take what you have and give it
a cloud-compatible API? Then, you could interact with your on premises
environment the same way you do with the cloud without tossing every-
thing you have on premises.

Making on premises look like cloud

Naturally, this approach makes most sense if you’re heavily invested


into an existing on-premises infrastructure that you’re not prepared to
let go anytime soon. The most common starting point for this approach
is VMware’s ESX virtualization, which is widely used in enterprises.
VMware is planning to support Kubernetes on this platform via Project
Pacific⁹, which embeds Kubernetes into vSphere’s control plane to manage
VMs and containers alike.

Questions to Ask
The good news is that different vendors have packaged the presented
solutions into product offerings. Understanding the relevant decisions
behind each approach and the embedded trade-offs allows you to select
the right solution for your needs. Based on the discussions above, here’s
a list of questions you might want to ask vendors offering products based
on the approaches described above:
⁹https://www.vmware.com/products/vsphere/projectpacific.html
The Cloud - Now On Your Premises 168

1. Define a Shared Abstraction Layer

• To what extent does the abstraction address non-compute services,


such as databases or machine learning?
• How are you expecting to keep up with several major cloud providers’
pace of innovation?
• What underlying infrastructure is needed on premises to support
the abstraction?
• Why do you believe that a common abstraction works well across
two very different environments?
• How would I minimize lock-in into your abstraction layer? Which
parts aren’t open source?

2. Copy Cloud to On-Premises

• What service coverage does the on-premises solution provide com-


pared to the cloud?
• What are your environmental requirements, e.g. power and cooling
density?
• How can we monitor the traffic back to your control plane to make
sure our data doesn’t leak out?
• To what extent will the solution run while being detached from the
control plane?
• What are the legal and liability implications of bringing your
solution to our data center?

3. Copy On-Premises to Cloud

• How well is the solution integrated into your cloud?


• What’s the minimum sizing at which this solution is viable?
• What’s the best path forward for migration, i.e. how much easier is
it to migrate from this solution as opposed to from on premises?

4. Make On-Premises Look Like Cloud

• How does your solution compare to the ones mentioned in #1?


• Won’t adding yet another layer further increase complexity?
• How are you keeping pace with the rapid evolution of the open
source frameworks you are embedding?
The Cloud - Now On Your Premises 169

Asking such questions might not make you new friends on the vendor
side, but it’ll surely help you make a more informed decision.

Additional Considerations
Deploying and operating workloads aren’t the only elements for a hybrid
cloud. At a minimum, the following dimensions come into play also:

Identity and Access Management

A key element of cloud is to manage, enable, and restrict


access to compute resources and their configurations. If you
are looking to unify management across on-premises and
cloud environments, you’ll also want to integrate identity
and access management, meaning developers and adminis-
trators use the same user name and authentication methods
and receive similar privileges.

Monitoring

If you have applications that spread across on premises and


the cloud, you’d want to have consistent monitoring across
environments. Otherwise, when you encounter an opera-
tional issue you might have to chase down the source of the
problem in different monitoring tools, which may make it
difficult to correlate events across the whole environment.

Deployment

Ideally, you’d be able to deploy applications to the cloud and


on premises in a uniform way. If you have the same run-time
environment available in both environments, harmonized de-
ployment is greatly simplified. Otherwise, a uniform contin-
uous build and integration (CI) pipeline coupled with several
deployment options allows you to minimize the diversity.
The Cloud - Now On Your Premises 170

Data Synchronization

Lastly, being able to manage workloads uniformly isn’t suffi-


cient as all your applications also need access to data. For new
applications, you might be able to abstract data access via
common APIs, but most commercial or legacy applications
won’t easily support such an approach. A variety of tools
promise to simplify data access across cloud and on-premises
. For example, the AWS Storage Gateway¹⁰ allows you to
access cloud storage from on-premises while solutions like
NetApp SnapMirror enable data replication.

Plotting a Path
So, once again, there’s a lot more behind a simple buzzword and the
associated products. As each vendor tends to tell the story from their point
of view, which is to be expected, it’s up to us architects to “zoom out” to see
the bigger picture without omitting relevant technical decisions. Making
the trade-offs inherent in the different approaches visible to decision
makers across the whole organization is the architect’s responsibility.
Because otherwise there’ll be surprises later on and one thing we have
learned in IT is that there are no good surprises.
¹⁰https://aws.amazon.com/storagegateway/features
19. Don’t Get Locked-up
Into Avoiding Lock-in.
Options have a price¹.

An architect’s main job is often seen as avoiding being locked into a


specific technology or solution. That’s not a huge surprise: wouldn’t it be
nice if you could easily migrate from one database to another or from one
cloud provider to another in case the product no longer meets your needs
or another vendor offers you favorable commercial terms? However, lock-
in isn’t a true-or-false situation. And while reducing lock-in surely is
desirable, it also comes at a cost. Let’s develop some decision models that
will keep your cloud strategy from getting locked up into trying to avoid
lock-in.

Architecture Creates Options


One of an architect’s major objectives is to create options². Options allow
you to postpone decisions into the future, at known parameters. For
example, the option to scale out an application allows you to add compute
capacity at any time. Applications built without this option require
hardware sizing upfront. Options are valuable because they allow you to
defer a decision until more information (e.g. the application workload)
is available as opposed to having to make a best guess upfront. Options
hence make a system change-tolerant: when needed, you can exercise the
option without having to rebuild the system. Lock-in does the opposite:
once decided, it makes switching from the selected solution to another
difficult or expensive.
Many architects may therefore consider lock-in their archenemy during
their noble quest to protect IT’s freedom of choice in face of uncertainty.
¹This article is an adaption of my original article posted to Martin Fowler’s Web Site:
https://martinfowler.com/articles/oss-lockin.html
²https://architectelevator.com/architecture/architecture-options/
Don’t Get Locked-up Into Avoiding Lock-in. 172

Sadly, architecture isn’t that simple - it’s a business of trade-offs. Hence,


experienced architects have a more nuanced view on lock-in:

• First, architects realize that lock-in isn’t binary: you’re always going
to be locked into some things to some degree - and that’s OK.
• Second, you can avoid lock-in by creating options, but those options
have a cost, not only in money, but also in complexity or, worse yet,
new lock-in.
• Lastly, architects need to peek behind common associations, for
example open source software supposedly making you free of lock-
in.

At the same time, there’s hardly an IT department that hasn’t been held
captive by vendors who knew that migrating off their solutions is difficult.
Therefore, wanting to minimize lock-in is an understandable desire. No
one wants to get fooled twice. So, let’s see how to best balance lock-in
when defining a cloud strategy.

Cloud Please, but with Lock-in on the


Side!
Cloud folks talk a lot about lock-in these days: all the conveniences that
cloud platforms bring us appear to be coming at the expense of being
locked into a specific platform. A cure is said to be readily available in
form of hybrid-multi-cloud abstractions. Such solutions might give us
more options but they also bring additional complexity and constraints.
Deploying applications used to be fairly simple. Nowadays things seem to
be more complicated with a multitude of choices. For modern applications,
deploying in containers is one obvious route, unless you’re perhaps aiming
straight for serverless. Containers sound good, but should you use AWS’
Elastic Container Service (ECS) to run them? It integrates well with the
platform but it’s a proprietary service, specific to Amazon’s cloud. So, you
may prefer Kubernetes. It’s open source and runs on most environments,
including on premises. Problem solved? Not quite - now you are tied to
Kubernetes - think of all those precious YAML files you’ll be writing! So
you traded one lock-in for another, didn’t you? And if you use a managed
Kubernetes services such as Amazon’s Elastic Kubernetes Service (EKS),
Don’t Get Locked-up Into Avoiding Lock-in. 173

Azure’s Kubernetes Service (AKS), or Google’s Kubernetes Engine (GKE).


Although doing so is generally a good idea, it may also tie you to a specific
version of Kubernetes and proprietary extensions.
If you need your software to run on premises, you could also opt for AWS
Outposts³ and run your ECS there, so you do have the option. But that
again is based on proprietary hardware, which could be an advantage or
a liability. Or over on the Kubernetes route, you could deploy GKE on
your premises. That in turn is installed on top of VMware, which is also
proprietary and hence locks you in. However, more likely than not, you’re
already locked into that one, so does it really make a difference? Or you
could try to make all these concerns go away with Google’s freshly minted
Anthos, which is largely built from open-source components, including
GKE. Nevertheless, it’s a proprietary offering: you can move applications
to different clouds - as long as you keep using Anthos. Now that’s the very
definition of lock-in, isn’t it?
Taking a whole different angle, if you neatly separate your deployment
automation from your application run-time, doesn’t that make it much
easier to switch infrastructures, mitigating the effect of all that lock-in?
Hey, there are even cross-platform infrastructure-as-code tools. Aren’t
those supposed to help make some of these concerns go away altogether?
It looks like avoiding cloud lock-in isn’t quite so easy and might even
get you locked up into trying to escape from it. Therefore, we need an
architecture point of view so our cloud strategy doesn’t get derailed by
unqualified fear of lock-in.

Shades of Lock-in
As we have seen, lock-in isn’t a binary property that has you either locked
in or not. Actually, lock-in comes in more flavors than you might have
expected:

Vendor Lock-in
This is the kind that IT folks generally mean when they mention
“lock-in”. It describes the difficulty of switching from one vendor
to a competitor. For example, if migrating from Siebel CRM to
Salesforce CRM or from an IBM DB2 database to an Oracle one will
³https://aws.amazon.com/outposts/
Don’t Get Locked-up Into Avoiding Lock-in. 174

cost you an arm and a leg, you are “locked in”. This type of lock-in
is common as vendors generally (more or less visibly) benefit from
it. It can also come in the form of commercial arrangements, such
as long-term licensing and support agreements that earned you a
discount off the license fees back then.
Product Lock-in
Related, although different nevertheless is being locked into a
product. Open source products may avoid the vendor lock-in, but
they don’t remove product lock-in: if you are using Kubernetes or
Cassandra, you are certainly locked into a specific product’s APIs,
configurations, and features. If you work in a professional (and
especially enterprise) environment, you will also need commercial
support, which will again lock you into a vendor contract - see
above. Heavy customization, integration points, and proprietary
extensions are forms of product lock-in: they make it difficult to
switch to another product, even if it’s open source.
Version lock-in
You may even be locked into a specific product version. Version
upgrades can be costly if they break existing customizations and
extensions you have built. Some version upgrades might even
require you to rewrite your application - AngularJS vs. Angular
2 comes to mind. You feel this lock-in particularly badly when a
vendor decides to deprecate your version or discontinues the whole
product line, forcing you to choose between being out of support or
doing a major overhaul.
Architecture lock-in
You can also be locked into a specific kind of architecture. Sticking
with the Kubernetes example, you are likely structuring your appli-
cation into services along domain context boundaries, having them
communicate via APIs. If you wanted to migrate to a serverless
platform, you’d want your services finer-grained, externalize state
management, and have them connected via events, among other
changes. These aren’t minor adjustments, but a major overhaul of
your application architecture. That’s going to cost you, so once
again you’re locked in.
Platform lock-in
A special flavor of product lock-in is being locked into a platform,
such as our cloud platforms. Such platforms not only run your
applications, but they also hold your user accounts and associated
Don’t Get Locked-up Into Avoiding Lock-in. 175

access rights, security policies, network segmentations, and trained


machine learning models. Also, these platforms usually hold your
data, leading to data gravity. All of these make switching platforms
harder.
Skills lock-in
Not all lock-in is on the technical side. As your developers are
becoming familiar with a certain product or architecture, you’ll
have skills lock-in: it’ll take you time to re-train (or hire) developers
for a different product or technology. With skills availability being
one of the major constraints in today’s IT environment, this type of
lock-in is very real.
Legal lock-in
You might be locked into a specific solution for legal reasons, such
as compliance. For example, you might not be able to migrate your
data to another cloud provider’s data center if it’s located outside
your country. Your software provider’s license might also not allow
you to move your systems to the cloud even though they’d run
perfectly fine. Lock-in comes in many flavors.
Mental Lock-in
The most subtle and most dangerous type of lock-in is the one
that affects your thinking. After working with a certain set of
vendors and architectures you’ll unconsciously absorb assumptions
into your decision making. These assumptions can lead you to reject
viable alternative options. For example, you might reject scale-out
architectures as inefficient because they don’t scale linearly (you
don’t get twice the performance when doubling the hardware).
While technically accurate, this way of thinking ignores the fact
that in the internet age, scalability trumps efficiency.

So, next time someone speaks about lock-in, you can ask them which of
these eight dimensions they are referring to. While most of our computers
work in a binary system, architecture does not.

Accepted Lock-in
While we typically aren’t looking for more lock-in, we might be inclined,
and well advised, to accept some level of lock-in. Adrian Cockcroft of AWS
is famous for starting his talks by asking who in the audience is worried
Don’t Get Locked-up Into Avoiding Lock-in. 176

about lock-in. While looking at the many raised hands, he follows up by


asking who is married. Not many hands go down as the expression on
people’s faces suddenly turns more contemplative.
Apparently, there are some benefits to being locked in, at least if the
lock-in is mutual. And back in technology, many of my friends are fairly
happily locked into an Apple iOS ecosystem because they appreciate the
seamless integration across services and devices.

You will likely accept some lock-in in return for the benefits
you receive.

Back in corporate IT some components might lock you into a specific


product or vendor, but in return give you a unique feature that provides a
tangible benefit for your business. While we generally prefer less lock-in,
this trade-off might well be acceptable. For example, you may happily and
productively use a product like Google Cloud Spanner or AWS Bare Metal
Instances. The same holds for other native cloud services that are better
integrated than many cross-cloud alternatives. If a migration is unlikely,
you might be happy to make that trade-off.
On the commercial side, being a heavy user of one vendor’s products
might allow you to negotiate favorable terms. Or as a major customer you
might be able to exert influence over the vendor’s product roadmap, which
in some sense, and somewhat ironically, reduces the severity of your lock-
in: being able to steer the products you’re using gives you fewer reasons to
switch. We often describe such a mutual lock-in as a “partnership”, which
is generally considered a good thing.
In all these situations, you realize that you are locked in and decided to
be so based on a conscious decision that yielded a positive “ROL” - your
Return On Lock-in. It actually demonstrates a more balanced architecture
strategy than just proclaiming that “lock-in is evil”. Architecture isn’t a
Hollywood movie where good battles evil. Rather, it’s an engaging murder
mystery with many twists and turns.

The Cost of Reducing Lock-in


While lock-in isn’t the root of all evil, reducing it is still generally
desirable. However, doing so comes as a cost - there are no free lunches in
Don’t Get Locked-up Into Avoiding Lock-in. 177

architecture (even those “free” vendor lunches come at a price). Although


we often think of cost in terms of dollars, the cost of reducing lock-in also
comes in a bigger variety of flavors than we might have thought:

Effort
This is the additional work to be done in terms of person-hours. If
you opt to deploy in containers on top of Kubernetes in order to
reduce cloud provider lock-in, this item would include the effort to
learn new concepts, write Docker files, and get the spaces lined up
in your YAML files.
Expense
This is the additional cash expense, e.g. for product or support
licenses, to hire external providers, or to attend KubeCon. It can
also include operational costs if wanting to be portable requires
you operate your own service instead of utilizing a cloud-provider
managed one. I have seen cost differences of several orders of
magnitude, for example for secrets management.
Underutilization
This cost is indirect, but can be significant. Reduced lock-in often
comes in the form of an abstraction layer across multiple products
or platforms. However, such a layer may not provide feature parity
with the underlying platforms, either because it’s lagging behind or
because it won’t fit the abstraction. In a bad case, the layer may be
the lowest common denominator across all supported products and
may prevent you from using vendor-specific features. This in turn
means you get less utility out of the software you use and pay for. If
it’s an important platform feature that the abstraction layer doesn’t
support, the cost can be high.
Complexity
Complexity is another often under-estimated cost in IT, but one of
the most severe ones. Adding an abstraction layer increases com-
plexity as there’s yet another component with new concepts and
new constraints. IT routinely suffers from excessive complexity,
which drives up cost and reduces velocity due to new learning
curves. It can also hurt system availability because complex systems
are harder to diagnose and prone to systemic failures. While you
can absorb monetary cost with bigger budgets, reducing excessive
complexity is very hard - it often leads to even more complexity.
It’s so much easier to keep adding stuff than taking it away.
Don’t Get Locked-up Into Avoiding Lock-in. 178

New Lock-ins
Lastly, avoiding one lock-in often comes at the expense of another
one. For example, you may opt to avoid AWS CloudFormation⁴
and instead use Hashicorp’s Terraform⁵ or Pulumi⁶, both of which
are great products and support multiple cloud providers. However,
now you are tied to another product from an additional vendor and
need to figure out whether that’s OK for you. Similarly, most multi-
cloud frameworks promise to let you move workloads across cloud
providers but lock you into the framework. You’ll have to decide
which form of lock-in concerns you more.

The Real Enemies: Complexity and


Underutilization
So, the cost of reducing lock-in isn’t just very real but comes in many
forms. Out of this list, underutilization and complexity should concern ar-
chitects the most because they are both severe and most often overlooked.
Both stem from the fact that a reduction in lock-in is often achieved by
adding another layer on top of existing tools and platforms.

Complexity and underutilization can be the biggest but


least obvious price you pay for reducing lock-in.

For example, not being able to take advantage of DynamoDB or Cloud


Spanner because you want your solution to be portable across clouds bears
a real, concrete cost of reduced features or value delivered. You’re paying
with underutilization: a feature or product is available, but you’re not
using it.
The cost of not being able to use a feature is real and incurred right
now, whereas the potential return in reduced switching cost is deferred
and hypothetical. It’s realized only when a full migration is needed. If
your product fails in the market because avoiding a specific service led to
poor performance or a delayed launch, the option of a later migration
⁴https://aws.amazon.com/cloudformation/
⁵https://www.terraform.io/”
⁶https://www.pulumi.com/
Don’t Get Locked-up Into Avoiding Lock-in. 179

will be worthless because your product won’t be around. That’s why


ThoughtWorks’ widely respected Technology Radar⁷ designated Generic
Cloud Usage as Hold already in late 2018. Reducing lock-in is a delicate
decision that’s certainly not binary.

You bear the cost of the option, such as increased com-


plexity or underutilization, today. The payoff in form of
flexibility gained comes only in the future and is limited to
specific scenarios.

The cost in terms of complexity is incurred due to the addition of yet


another layer of abstraction, such as JDBC connectors, container orches-
tration, common APIs, etc. While all useful tools, such layers add more
moving parts, increasing the overall system complexity. A quick look at
the Cloud Native Computing Foundation’s product landscape⁸ illustrates
this complexity vividly. The complexity increases the learning effort for
new team members, makes debugging more difficult, and increases the
chance of systemic errors.
Not all abstractions are troublesome, though. Some layers of abstraction
can greatly simplify things: most developers are happy to read from byte
streams instead of individual disk blocks. Such abstractions can work for
narrow interfaces, but rarely across broad platforms, and especially those
from competing vendors - the industry standard graveyard is littered with
such attempts - remember SCA⁹? And, while abstraction may shield you
from some complexity during happy times, it’s well known that Failure
Doesn’t Respect Abstraction¹⁰.

Optimal Lock-in
With both being locked-in and avoiding it carrying a cost, it’s the archi-
tect’s job to balance the two and perhaps find the sweet spot. We might
even say that this sums up the essence of the architect’s job: dismantle the
buzzwords, identify the real drivers, translate them into an economic risk
model, and then find the best spot for the given business context.
⁷https://www.thoughtworks.com/radar/techniques/generic-cloud-usage
⁸https://landscape.cncf.io/
⁹https://en.wikipedia.org/wiki/Service_Component_Architecture
¹⁰https://architectelevator.com/architecture/failure-doesnt-respect-abstraction/
Don’t Get Locked-up Into Avoiding Lock-in. 180

It’s the architect’s job to dismantle the buzzwords, identify


the real drivers, translate them into an economic / risk
model, and then find the best spot for the given business
context.

Sometimes you can find ways to reduce lock-in at low (overall) cost, for
example by using an Object-relational Mapping (ORM) framework like
Hibernate¹¹ that increases developer productivity and reduces database
vendor lock-in at the same time. Those opportunities don’t require a lot
of deliberation and should just be done. However, other decisions might
deserve a second, or even third look.

The total cost of lock-in

To find the optimal degree of lock-in, you need to sum up two cost
elements (see graph):

1. The expected cost of lock-in, measured by the chance of needing to


switch multiplied by the cost in case it happens (blue curve going
down)
2. The upfront investment required to reduce lock-in (dashed green
line)

In the world of financial options, the first element is considered the


option’s strike price, i.e. the price to pay when you exercise the option you
¹¹https://hibernate.org/orm/
Don’t Get Locked-up Into Avoiding Lock-in. 181

hold. The second will be the price of the option itself, i.e. how much you
pay for acquiring the option. The fact that gentlemen Black and Scholes
received a Nobel prize in Economics¹² for option pricing tells us that this
is not a simple equation to solve. However, you can imagine that the prize
of the option increases with a reduced strike price - the easier you want
the later switch to be, the more you need to prepare right now.
For us the relevant curve is the red line (labeled “total”), which tracks the
sum of both costs. Without any complex formulas we can easily see that
minimizing switching cost isn’t the most economical choice as it’ll drive
up your up-front investment for diminishing returns.
Let’s take a simple example. Many architects favor not being locked into
a specific cloud provider. How much would it cost to migrate your ap-
plication from the current provider to another one, for example having to
re-code automation scripts, change code that accesses proprietary services,
etc? $100,000? That’s a reasonable chunk of money for a small-to-medium
application. Next, you’ll need to consider the likelihood of needing to
switch clouds over the application’s lifespan. Maybe 5%, or perhaps even
lower? Now, how much would it cost you to bring that switching cost
down to near zero? Perhaps a lot more than the $5,000 ($100,000 x 5%)
you are expected to save, making it a poor economic choice. Could you
amortize the upfront investment across many applications? Perhaps, but
a significant portion of the costs would also multiply.

Minimizing switching cost won’t be the most economical


choice for most enterprises. It’s like being over-insured.

A well-greased CI/CD pipeline and automated testing will drastically


reduce your switching cost, perhaps by half. Now the expected cost of
switching is only $2,500. Note that building the CI/CD pipeline doesn’t
equate to buying an option - it’s a mechanism that helps you right here
and now to release higher quality software at lower cost every time.
Therefore, minimizing the switching cost isn’t the goal - it’s about
minimizing total cost. Architects that focus on switching cost alone will
over-invest into reduced lock-in. It’s the equivalent of being over-insured:
paying a huge premium to bring the deductible down to zero may give you
¹²https://en.wikipedia.org/wiki/Black%E2%80%93Scholes_model
Don’t Get Locked-up Into Avoiding Lock-in. 182

peace of mind, but it’s not the most economical, and therefore, rational,
choice.

Open Source and Lock-in


Lastly, a few thoughts to the common notion that using open source
software eliminates lock-in. From the many forms of lock-in above we
can see that open source software primarily addresses a single one, that of
vendor lock-in. Most other forms of lock-in remain. You’ll still be facing
architecture, product, skills, and mental lock-in, for example. That doesn’t
mean open source is not worth considering, but it means that it isn’t the
easy way out of lock-in that it’s sometimes portrayed to be.
In the context of cloud, open source is often positioned as a natural path
to multi cloud because most open source frameworks can be operated
on all commercial clouds as well as on premises. But here we need to
be careful whether open source also means that you have to operate
the software yourself as opposed to subscribing to a fully managed, but
perhaps proprietary, service. After all, the transformational aspect of cloud
isn’t to have a database, but being able to provision one at a moment’s
notice, never having to deal with operations, and paying only per usage.
There are efforts to offer fully-managed open source platforms across
cloud providers, but those are again proprietary offerings. There are no
magic answers.
The argument that open source avoids lock-in rests on the notion that
“in the worst case you can pack up your software and leave your
cloud provider”. While this argument might hold true for the software’s
functionality (most vendors include proprietary “enterprise” extensions,
though), it doesn’t cover operational aspects. Architects know best that
the non-functional aspects or “-ilities” are a critical element of enterprise
IT. Therefore, reverting from a managed platform (e.g. GKE/EKS/AKS for
Kubernetes) back to operating the upstream open source version yourself
will eliminate virtually all operational advantages that came with the
proprietary control planes. In many cases, this can amount to losing
a significant portion of the overall benefit, not even considering the
enormous effort involved.
Another common point regarding open source is that in principle ev-
eryone can contribute to or influence the product, much in contrast to
Don’t Get Locked-up Into Avoiding Lock-in. 183

commercial offerings that put you at the mercy of the vendor’s product
roadmap (which makes it important to understand the vendor’s product
philosophy - see The IT World is Flat in 37 Things¹³). Although this is
a significant benefit for community-driven projects, many “cloud-scale”
open source projects see roughly half of the contributions from a single
commercial vendor, who is often also the instigator of the project and
a heavy commercial supporter of the affiliated foundation. Having the
development heavily influenced by that vendor’s strategy might work
well for you if it aligns with yours - or not, if it doesn’t. So, it’s up to
you to decide whether you’re happily married / locked-in to that project.
Although many open source products operate across multiple cloud
platforms, most can’t shield us from the specifics of each cloud platform.
This effect is most apparent in automation tools, such as Terraform, who
give us common language syntax but can’t shield us from the many
vendor-specific constructs, i.e. what options are available to configure a
VM. Implementation details thus “leak” through the common open source
frame. As a result, these tools reduce the switching cost from one cloud
to another, for example because you don’t have to learn a new language
syntax, but certainly don’t bring it down to zero. In return you might pay
a price in feature lag between a specific platform and the abstraction layer.
Solving that equation is the architect’s job.
In summary, open source is a great asset for corporate IT. Without it,
we wouldn’t have Linux operating systems, the MySQL database, the
Apache web server, the Hadoop distributed processing framework, and
many other modern software amenities. There are many good reasons to
be using open source software, but it’s the architect’s job to look behind
blanket statements and simplified associations, for example related to
lock-in.

Maneuvering Lock-in
Consciously managing lock-in is a key part of a cloud architect’s role.
Trying to drive potential switching costs down to zero is just as foolish
as completely ignoring the topic. Some simple techniques help us reduce
lock-in for little cost while others just lock us into the next product or
architectural layer. Open standards, especially interface standards such as
open APIs, do reduce lock-in because they limit the blast radius of change.
¹³https://architectelevator.com/book
Don’t Get Locked-up Into Avoiding Lock-in. 184

Lock-in has a bad reputation in enterprise, and generally rightly so. Most
any enterprise has been held hostage by a product or vendor in the past.
Cloud platforms are different in several aspects:

1. Most cloud platform vendors aren’t your typical enterprise tool ven-
dors (the one that might have been, has transformed substantially).
2. An application’s dependence on a specific cloud platform isn’t
trivial, but generally less when compared with complete enterprise
products like ERP or CRM software.
3. A large portion of the software deployed in the cloud will be
custom-developed applications that are likely to have a modular
structure and automated build and test cycles. This makes a poten-
tial platform switch much more digestible than trying to relocate
your Salesforce APEX code or your SAP ABAP scripts.
4. In the past a switch between vendors was often necessitated by
severe feature disparity or price gouging. In the first decade of cloud
computing, providers have repeatedly reduced prices and released
dozens of new features each year, making such a switch less likely.

So, it’s worth taking an architectural point of view on lock-in. Rather than
getting locked up in it, architects need to develop skills for picking locks.
20. The End of
Multi-tenancy?
Cloud makes us revisit past architecture assumptions.

Most architecture decisions aim to balance conflicting forces and con-


straints. Often we consider such architecture decisions the “long-lasting”
decisions that are difficult to reverse. New technology or a new operating
model, such as cloud, can influence those constraints, though, making it
worthwhile to revisit past architecture decisions. Multi-tenant architec-
tures make for an excellent case study.

Multi-tenancy
Multi-tenant systems have been a common pattern in enterprise software
for several decades. According to Wikipedia¹:

Multi-tenant systems are designed to provide every tenant a dedi-


cated share of a single system instance.

The key benefit of such a multi tenant system is the speed with which
a new share or (logical) customer instance can be created. Because all
tenants share a single system instance, no new software instances have
to be deployed when a new customer arrives. Also, operations are more
efficient because common services can be shared across tenants.
That’s where this architectural style derives its name - an apartment
building accommodates multiple tenants so you don’t need to build a new
house for each new tenant. Rather, a single building efficiently houses
many tenants, aided by common infrastructure like hallways and shared
plumbing.
¹https://en.wikipedia.org/wiki/Multitenancy
The End of Multi-tenancy? 186

No Software!
In recent IT history, the company whose success was most directly based
on a multi-tenant architecture was Salesforce. Salesforce employed a
Software-as-a-Service model for its CRM (Customer Relationship Man-
agement) software well before this term gained popularity for cloud
computing platforms. Salesforce operated the software, including updates,
scaling, etc., freeing customers from having to install software on their
premises.
The model, aided by an effective “no software” marketing campaign,
was extremely successful. It capitalized on enterprises’ experience that
software takes a long time to install and causes all sorts of problems
along the way. Instead, Salesforce could sign up new customers almost
instantaneously. Often the software was sold directly to the business
under the (somewhat accurate) claim that it can be done without the IT
department. “Somewhat”, because integrations with existing systems are
a significant aspect of a CRM deployment.

No Software (to install)

From an architecture point of view, Salesforce ran a large database


segregated by customer. The application logic would make sure all func-
tions are executed on the current customer’s data set. When a new
customer purchased the CRM suite, Salesforce essentially only set up a
new logical “tenant” in their system, identified by a so-called OrgID field
in the database. Every so often, more hardware needed to be added to
accommodate the additional data and computational demands, but that
was independent from the actual customer sign-up and could be done in
advance.
The End of Multi-tenancy? 187

Tenant Challenges
While multi tenant systems have apparent advantages for customers, they
pose several architectural challenges. To make a single system appear like
a set of individual systems, data, configuration, as well as non-functional
characteristics, such as performance, have to be isolated between the
tenants - people want thick walls in their apartments. This isolation
introduces complexity into the application and the database schema, for
example by having to propagate the tenant’s identity through the code
base and database structure for each operation. Any mistake in this
architecture would run the risk of exposing one tenant’s data to another
- a highly undesired scenario. Imagine finding your neighbor’s groceries
in your fridge, or worse.
Also, scaling a single system for a large number of tenants can quickly
hit limits. So, despite its great properties, multi-tenancy, like most other
architectural decisions, is a matter of trade-offs. You have to manage a
more complex system for the benefit of adding customers quickly and
easily. Past systems architects chose the optimum balance between the two
and succeeded in the market. Now, cloud computing platforms challenge
some of the base assumptions and might lead us to a new optimum. Oddly,
the path to insight leads past waterfowl…

Duck Typing
The object-oriented world has a notion of duck typing², which indicates
that instead of classifying an object by a single, global type hierarchy,
it’s classified by its observable behavior. So, instead of defining a duck as
being a waterfowl, which is a bird, which in turn is part of the animal
kingdom, you define a duck by its behavior: if it quacks like a duck and
walks like a duck, it’s considered a duck.
²https://en.wikipedia.org/wiki/Duck_typing
The End of Multi-tenancy? 188

Typing via Hierarchy (left) vs. Duck Typing (Right)

In software design this means that the internal make-up or type hierarchy
of a class takes a back seat to the object’s observable behavior: if an object
has a method to get the current time, perhaps it’s a Clock object, regardless
of its parent class. Duck typing is concerned with whether an object can
be used for a particular purpose and is less interested in an overarching
ontology. Hence it allows more degrees of freedom in system composition.

Duck Architecture
It’s possible to apply the same notion of typing based on behavior to large-
scale system architecture, giving us a fresh take on architecture concepts
like multi-tenancy. Traditionally, multi-tenancy is considered a structural
property: it describes how the system was constructed, e.g. by segmenting
a single database and configuration for each user, i.e. tenant.
Following the “duck” logic, though, multi-tenancy becomes an observable
property. Much like quacking like a duck, a multi-tenant system’s key
property is its ability to create new and isolated user instances virtually
instantaneously. How that capability is achieved is secondary - that’s up
to the implementer.

The sole purpose of a system’s structure is to exhibit a


desirable set of observable behaviors.

The “duck architecture” supports the notion that the sole purpose of a sys-
tem’s structure is to exhibit a desirable set of observable behaviors. Your
customer’s won’t appreciate the beauty of your application architecture
when that application is slow, buggy, or mostly unavailable.
The End of Multi-tenancy? 189

Revisiting Constraints
The goal of a system’s structure is to provide desirable properties. How-
ever, the structure is also influenced by a set of constraints - not everything
can be implemented. Therefore, the essence of architecture is to implement
the desirable properties within the given set of constraints:

Constraints and Desirable Properties Shape Architecture

Now, certain constraints that have shaped architectural decisions in the


past may have changed or disappeared. The result is that past architectural
decisions can and should be revisited.
For multi-tenant systems, the key constraint driving the design is that
deploying a new system instance is a complex and time consuming
process. Hence, if you want to give independent users near-instant access
to their own logical instance, you need to build a system that’s based on a
single existing instance that’s logically segmented into seemingly isolated
instances. Such a system doesn’t have to deploy a new instance for each
new customer and can therefore set up new customers easily.
A secondary constraint is that having many separate system instances,
each with its own hardware infrastructure, is rather inefficient. Each
instance carries the overhead of an operating system, storage, monitoring
etc., which would be duplicated many times, wasting compute and storage
resources.

Cloud Removes Constraints


We already learned that cloud computing isn’t just an infrastructure topic.
It’s as much about software delivery and automation as it is about run-
The End of Multi-tenancy? 190

time infrastructure. And this very aspect of automation is the one that
allows us to reconsider multi-tenancy as a structural system property.
Thanks to cloud automation, deploying and configuring new system
instances and databases has become much easier and quicker than in
the past. A system can therefore exhibit the properties of a multi-tenant
system without being architected like one. The system would simply
create actual new instances for each tenant instead of hosting all tenants
in a single system. It can thus avoid the complexity of a multi-tenant ar-
chitecture but retain the benefits. So, it behaves like a multi-tenant system
(remember the duck?) without incorporating the structural property.

Multi-Tenancy vs. Multi-Single-Tenancy

Additionally, container and lightweight VM technologies significantly


reduce the overhead associated with each individual instance because they
don’t duplicate the complete operating system each time.
But what about the operational complexity of managing all these in-
stances? Container Orchestration like Kubernetes are designed to help
operate and manage large numbers of container instances, including
automated restarts, workload management, traffic routing etc. It’s not
uncommon to run several thousand container instances on a large Ku-
bernetes Cluster.
In combination, this means that cloud automation and containers allow a
traditional system with a multi-tenant architecture to be replaced with
what we can call a multi-single-tenant system. This system exhibits
comparable properties but follows a much simpler application architecture
by relying on a more powerful run-time infrastructure.
The End of Multi-tenancy? 191

A multi-single-tenant system exhibits the behavior of a


multi-tenant system without the complexities of building
one. It simplifies the application architecture thanks to a
more powerful run-time infrastructure.

Under the covers, a cloud provider ultimately implements a typical


multi-tenant design, for example by isolating clients sharing the same
physical server. Luckily that complexity has been absorbed by the cloud
provider so there’s no reason for us to reinvent it for each application,
vastly simplifying our application logic. While in many cases we shift
responsibilities up the stack, in this case we’re lucky enough to be able
to push them down.

Efficient Single Family Homes


To stretch the building metaphor from the beginning a bit, cloud com-
puting allows us to move away from apartment buildings into easily
replicated and efficient single-family homes: they are easier to build and
each tenant gets its own building. Luckily we can do this quite efficiently
in cloud computing as compared to the real world, so we don’t have to
worry about urban sprawl.
21. The New -ility:
Disposability
When it comes to servers, we don’t recycle.

Tossing servers may cost you. Or not.

Architects are said to focus on the so-called “ilities”, a system’s non-


functional requirements. Classic examples include availability, scalability,
maintainability, and portability. Cloud and automation add a new -ility,
disposability.

Speeding up is more than going faster


As discussed before, cloud isn’t about replacing your existing infrastruc-
ture as much as speeding up software delivery to successfully compete
The New -ility: Disposability 193

in Economies of Speed. One of the key mechanisms to achieve speed is


automation - things that are automated can be done faster than by hand.
Getting things done faster is definitely useful - I haven’t heard a business
complain that their IT delivers too quickly.
However, if things can be done a lot faster, you can do more than just be
quicker. You can start to work in a fundamentally different model, as long
as you also change the way you think about things. Handling servers is a
great example.
Software development doesn’t have a lot of tangible assets - it’s all bits
and bytes. Servers are likely the most visible assets in IT - they’re big
metal boxes with many blinking lights that cost a good chunk of money.
So, it’s natural that we want to treat them well and use them for as long
as possible.

Servers aren’t assets; they’re a liability.

Following our insight that cloud is a fundamental lifestyle change, we can


rethink that assumption and come to realize that servers aren’t assets -
they’re a liability. In fact, no one wants one because they’re expensive,
consume a lot of energy and depreciate quickly. If a server is a liability,
we should minimize the number we have. For example, if we don’t need
a server anymore, or need a different configuration, we should just throw
it away.

Longevity considered harmful


Generally, we consider recycling to be a good thing because it preserves re-
sources. I am personally a big fan of repairing old things, being particularly
fond of having been able to repair my venerable Bang & Olufsen Beolab
4000 speakers (a shorted FET muted the speaker) and my 1992 Thurlby
lab power supply (a potentiometer went out and cut the current limiter to
zero).
So, it’s no surprise that we apply the same thought to servers: we use them
over and over again, thanks to a little patching and new software being
installed. Such behavior leads to the well-known notion of a “pet server”,
The New -ility: Disposability 194

which is pampered like a favorite pet, possibly even has a cute name, and
is expected to live a long and happy life.
Actually, in my opinion the “pet” metaphor doesn’t quite hold. I have seen
many servers that had a bit of space left being (ab-)used as mail or DHCP
server, development machine, experimentation ground, or whatever else
came to developers’ minds. So, some of those servers, while enjoying a
long life, might resemble a pack mule more than a cute bunny. And, in
case there’s any doubt, Urban Dictionary confirms¹ that mule skinners
and bosses alike take good care of their pack mules.
Automation allows us to reconsider this tendency to prolong a server’s
life. If setting up a new server is fast and pain-free, and tearing one down
means we don’t have to pay for it anymore, then we might not have to
hold on to our favorite server until death do us part. Instead, we are happy
to throw (virtual) servers away and provision new ones as we see need.

Disposing servers is sometimes also referred to as the


“cattle” model, suggesting that sick animals be shot instead
of being pampered like a pet. As a devout vegetarian, I am
not so fond of metaphors that involve killing, so I prefer to
stick with servers and suggest the term disposability.

Automation and related practices such as software-defined infrastructure


allow us to trade longevity in favor of disposability - a measure of how
easily we can throw something away and recreate it, in our case servers.

Throwing away servers for good


Changing our attitude to favor disposability gives us several benefits, most
importantly consistency, transparency, and less stress.

Consistency
If servers are long-lived elements that are individually and manually
maintained, they’re bound to be unique. Each server will have slightly
different software and versions thereof installed, which is a great way of
¹https://www.urbandictionary.com/define.php?term=pack%20mule
The New -ility: Disposability 195

building a test laboratory. It’s not that great for production environments,
where security and rapid trouble shooting are important. Unique servers
are sometimes referred to as snowflakes, playing off the fact that each one
of them is unique.
Disposing and recreating servers vastly increases the chances that servers
are in a consistent state based on common automation scripts or so-called
golden images. It works best if developers and operations staff refrain
from manually updating servers, except perhaps in emergency situations.
In IT jargon, not updating servers makes them immutable and avoids
configuration drift.
Disposing and re-creating your systems from scripts has another massive
advantage: it assures that you are always in a well-defined, clean state.
Whereas a developer who “just needed to make that small fix” is annoying,
a malware infection is far more serious. So, disposability makes your
environment not only more consistent, but also consistently clean and
therefore safe.

Transparency
One of the main hindrances in traditional IT is the astonishing lack of
transparency. Answering simple questions like how many servers are
provisioned,who owns them, what applications run on them, or what
patch level they are on, typically requires a mini-project that’s guaranteed
to yield a somewhat inaccurate or at least out-of-date answer.
Of course, enterprise software vendors stand by to happily address any
of IT’s ailments big or small, including this one. So, we are sure to find
tools like a CMDB, Configuration Management Database, or other system
management tools in most enterprises. These tools track the inventory
of servers including their vital attributes, often by scanning the estate of
deployed systems. The base assumption behind these tools is that server
provisioning occurs somehow through a convoluted and intransparent
process. Therefore, transparency has to be created in retrospect through
additional, and expensive, tooling. It’s a little bit like selling you a magic
hoverboard that allows you to levitate back out of the corner you just
painted yourself into. Cool, for sure, but not really necessary had you just
planned ahead a little.
The New -ility: Disposability 196

Many IT tools are built under the assumption that servers


are provisioned somehow and transparency therefore has
to be created in retrospect.

The principle of disposability allows us to turn this approach around.


If all servers are easily disposed of and re-created from scripts and
configuration files, which are checked into a source control system, then as
long as you minimize manual configuration drift you’ll inherently know
how each server is configured simply by looking at the scripts that created
them. That’s a different and much simpler way to gain transparency a
priori as opposed to trying to reverse engineer reality in retrospect.

Transparency via Tooling (left) vs. Inherent Transparency (right)

Isn’t creating such scripts labor-intensive? Surely, the first ones will have
an additional learning curve, but the more you minimize diversity, the
more you can reuse the same scripts and thus win twice. Once you factor
in the cost of procuring, installing, and operating a CMDB or similar tools,
you’re bound to have the budget to develop many more scripts than the
tool vendors would want you to believe.
The New -ility: Disposability 197

Less Stress
Lastly, if you want to know whether a team embraced disposability as one
of their guiding principles, there’s a simple test. Ask them what happens
if one of their servers dies (not assuming any data loss, but a loss of all
software and configuration installed on it). The fewer tissue packs are
needed (or expletives yelled) in response to such a scenario, the closer
the team is to accepting disposability as a new -ility.
If their systems are disposable, one of them keeling over isn’t anything
unexpected or dramatic - it becomes routine. Therefore, disposability is
also a stress reducer for the teams. And with all the demands on software
delivery, any reduction in stress is highly welcome.

Better life with less recycling (in IT only!)


As a result, life with disposable resources is not only less stressful, but
also makes for a better operational model. And if we’re worried about
environmental impact, disposing servers in IT is 100% environmentally
acceptable - it’s all just bits and bites²! So, go on and toss some servers!

²In reality, each IT operation also has an environmental cost due to energy consumption
and associated cooling cost. In the big scheme of IT operations, server provisioning is likely
a negligible fraction.
Part V: Building (for)
the Cloud

The cloud is a platform on which you deploy applications. Unlike tradi-


tional setups where application and infrastructure were fairly isolated and
often managed by different teams, cloud applications and the associated
tooling interact closely with their environment. For example, platforms
that offer resilient operations usually require applications to be automat-
ically deployable. Similarly, serverless platforms expect applications to
externalize their state and be short-lived. So, when we talk about cloud,
we should also talk about building applications.

Application Complexity Increases


Although cloud gives applications amazing capabilities like resilience,
auto-scaling, auto-healing, and updates without downtime, it has also
made application delivery more complex. Listening to modern applica-
tion developers speak about green-blue deploys, NoOps, NewOps, post-
DevOps, FinOps, DevSecOps, YAML indentation, Kubernetes operators,
service meshes, HATEOAS, microservices, microkernels, split brains, or
declarative vs. procedural IaC might make you feel like application
delivery was invaded by aliens speaking a new intergalactic language.
Many of these mechanisms have a viable purpose and represent major
progress in the way we build and deliver software. Still, the tools that bring
us such great capabilities have also caused a jargon proliferation like we
haven’t seen since the days where database column names were limited to
six characters. Explaining these many tools and techniques with intuitive
models instead of jargon will help us understand cloud’s implications on
application design and delivery.
The New -ility: Disposability 199

Removing Constraints Impacts


Architecture
The environment’s constraints shapes the structure of applications. For
example, if deploying software is difficult, you’d lean towards deploying
a large piece of software once. Likewise, if communications are very slow
and intransparent, you might prefer to keep all parts together to avoid re-
mote communication. Cloud, in conjunction with modern software stacks,
has reduced or eliminated many such constraints and hence enabled new
kinds of software application architecture to emerge. Microservices, likely
the most well-known example, became popular because lower run-time
overhead and advances in software deployment made dividing applica-
tions into smaller components practical. Understanding such architectural
shifts helps lay out a path for application evolution on the cloud.

Platforms Expand and Contract


Platforms to improve application delivery have existed for quite a while.
For example, PaaS (Platform as a Service) products simplified application
deployment with pre-fabricated build packs that included common de-
pendencies. However, most of these platforms were designed as “black
boxes” that didn’t easily support replacing individual components. After
plateauing for a while, the pace of innovation picked up again, this time
favoring loose collections of tools, such as the Kubernetes ecosystem.
Shifting towards sets of tools allows components to evolve independently
but usually leaves the end user with the complexity of assembling all the
bits and pieces into a working whole.

I have seen projects where the build and deployment sys-


tem became more complex than the application itself.

Over time, as approaches stabilize, we can expect platforms to again


become more prescriptive, or “opinionated” in modern IT vernacular, and
hence better integrated. Anticipating such platform cycles can help us
make better IT investment decisions.
The New -ility: Disposability 200

Applications for the Cloud


Many existing resources describe how applications should be built for the
cloud. This chapter isn’t intended to be an application development guide
but rather looks at those aspects of application development and delivery
that directly relate to cloud platforms:

• An application-centric cloud looks very different from an infrastructure-


centric one. We could say it’s more flowery.
• You can’t be in the cloud without containers, it seems. But what’s
really packed up inside that container metaphor?
• Serverless isn’t really server-less, but perhaps it’s worry-less?
• Many frameworks and buzzwords want to tell us what makes an
application suitable to cloud. But it can also be simple and punchy
like FROSST.
• Things break, even in the cloud. So, let’s stay calm and operate on.
22. The Application- centric
Cloud
Sketching out the modern application ecosystem.

Because no one wants a server, the cloud should make it easier to build
useful applications. Applications are the part of IT that’s closest to the
customer and hence the essence of a so-called digital transformation.
Application delivery allows organizations to innovate and differentiate.
That’s why rapid feature releases, application security, and reliability
should be key elements of any cloud journey. However, application
ecosystems have also become more complex. Models have helped us make
better decisions when architecting the cloud, so let’s look at a model for
an application-centric view of the cloud.

Applications Differentiate
Once you realize that running software you didn’t build is a bad deal,
you should make it a better deal to build software. Custom software is
what gives organizations a competitive edge, delights their customers,
and is the anchor of innovation. That’s what’s behind the shift from IT
as service provider to IT as competitive differentiator. So, it behooves
everyone in an IT organization to better understand the software delivery
lifecycle and its implications. Discussing software applications typically
starts by separating software delivery, i.e. when software is being made,
from operations, when software is being executed and available to end
users.

The Four-Leaf Clover


Developing a strategy for an application-centric cloud needs more than
just building and running software. And once again, the connection
The Application- centric Cloud 202

between the pieces is at least as interesting as the pieces themselves. To


make building an application-centric cloud meaningful but also intuitive,
we can follow a model which I dubbed the “four-leaf clover”:

A four leaf clover represents the application ecosystem

The clover leaf appropriately places the application in the center and
represents the essential ecosystem of supporting mechanisms as the four
leaves. Going counter-clockwise, the model contains the following leaves:

Delivery Pipeline
The build, deployment, and configuration mechanisms compile and
package code into deployable units and distribute them to the run-
time platform. A well-oiled delivery pipeline speeds up software
delivery because it constitutes a software system’s first derivative.
The oil comes in form of automation, which enables approaches like
continuous integration (CI) and continuous delivery (CD).
Run-time Platform
Running software needs an underlying environment that manages
aspects like distributing workloads across available compute re-
sources, triggering restarts on failure, and routing network traffic to
the right places. This is the land of virtualization, VMs, containers,
and serverless run-times.
Monitoring / Operations
Once you have applications running, especially those consisting of
many individual components, you’re going to need a good idea
of what your application is doing. For example, is it operating as
expected, are there unexpected load spikes, is the infrastructure out
of memory, or is CPU consumption significantly increased from the
The Application- centric Cloud 203

last release? Ideally, monitoring can detect symptoms before they


become major problems and cause customer-visible downtime. This
is the realm of log processing, time series databases, and dashboard
visualizations.
Communication
Last but certainly not least, applications and services don’t live in
isolation. They’ll need to communicate with related services, with
other applications, and often with the outside world via public
APIs. Such communication should take place securely (protecting
both messages and participants), dynamically (e.g. to test a new
version of a service), and transparently (so you have an idea which
component depends on which others). Communication is handled
via API gateways, proxies, and service meshes.
The Application
Of course, the center of the clover leaf holds the application (or the
service). There’s a good bit to say about applications’ structure and
behavior that makes them best suited for the cloud. That’s why this
book includes a whole chapter on making applications good citizens
for the cloud: FROSST.

The simple model shows that there’s more to a software ecosystem than
just a build pipeline and run-time infrastructure. An application-centric
cloud would want to include all four elements of the clover leaf model.

Good Models Reveal Themselves


The clover leaf shares one property with most good models: it is intuitive
at first sight but reveals itself upon closer inspection, much like a Rothko
painting (I wish it’d also fetch similar prices). Upon closer inspection you’ll
find that the leaves aren’t positioned arbitrarily. Rather, the elements on
the leaves run along two defined axes. The horizontal axis represents the
application lifecycle from left to right. Hence we find the build pipeline
on the left because that’s how software is “born”. The run-time sits in
the middle, and monitoring on the right because it takes place after the
software is up-and-running.

Good models are intuitive at first sight but reveal them-


selves upon closer inspection like a Rothko painting.
The Application- centric Cloud 204

The vertical axis represents the software “stack” where the lower layers
indicate the run-time “underneath” the application. The communications
channels that connect multiple applications sit on top. So, we find that
the positioning of the leaves has meaning. However, the model still makes
sense without paying attention to this finer detail.
Models like this one that can be viewed at different levels of abstractions
are the ones architects should strive for because they are suited to
communicating with a broad audience across levels of abstraction - the
essence of the Architect Elevator.

Diversity vs. Harmonization


Infrastructure architecture is typically driven by the quest for standard-
ization - IT looks to limit the number of different types of servers, storage,
middleware, etc. Much of it is motivated by the pursuit of economies
of scale - less diversity means larger purchases and bigger discounts
in a traditional IT setting. Cloud’s linear pricing makes this approach
largely obsolete - 10 servers cost you exactly 10 times as much as one
server (modest enterprise discounts aside). Harmonization also means
complexity reduction, which is a worthwhile goal in any IT environment,
pre- or post-cloud.
So, it’s no surprise that when building an environment that places applica-
tions into the center of attention, the topic of standardization also creeps
up. What to standardize and what to leave up to developers’ preference
can be a hotly debated issue. After all, applications are meant to be diverse
and allow for experimentation and innovation. But do you really need to
use a different programming language and database for each service?

Some Silicon Valley companies known for giving devel-


opers substantial leeway in many regards, including their
outfit, are surprisingly strict when it comes to source code.
For example, they strictly enforce a 80-character limit per
line of code (after many months of debate, Java code was
granted an exception to stretch to a dizzying 120 charac-
ters).

Many companies are looking to narrow the number of development


editors, operating systems, or hardware to be used. The motivation comes
The Application- centric Cloud 205

again from aiming to reduce management complexity (if something


breaks, it’s easier to diagnose if all is the same), utilizing economies of
scale (carried over from the old IT days), or security (diversity widens the
attack surface).
However, equally loud are the screams from software developers feeling
crippled by needless standards that deprive them from their favorite tool. I
have seen developers set up completely separate networks so they can use
their favorite on-line drawing tool. The unfortunate side effect was that
no one else was able to see the documentation containing said drawings.
As so often, the right approach probably lies somewhere in the middle of
the two extremes. However, what’s a fair or reasonable split?

Standards Have Value and Cost


Just like providing options has a value and a cost, the same holds true
for standards. Standards can help reduce complexity, which is often an
inhibitor of delivery velocity. Standards can also limit choice, meaning
the best tool for the job might not be available - that’s a cost. Limiting
choice doesn’t necessary mean limiting creativity, though (see A4 Paper
Doesn’t Stifle Creativity¹).
Architects understand that standards are not a black-or-white decision
and look to balance cost and benefit. Standardizing those elements where
harmonization provides a measurable value and has a justifiable cost
should be highest on the list. In the context of application delivery,
a common source code repository boosts reuse and thus reduces de-
velopment effort and cost. Hence, standardizing on one repository has
measurable value and speeds up software delivery. Likewise, common
systems monitoring can help diagnose issues more easily than having
to consult many different systems and manually trying to correlate data
across them.

Standardize those elements that resemble a connecting


element, i.e. where harmonization improves reuse or inter-
operability.

I hence tend to favor standardizing those elements that resemble a


connecting element. For example, developers should have a free choice of
¹Hohpe, The Software Architect Elevator, 2020, O’Reilly
The Application- centric Cloud 206

code editor or IDE as that’s an individual decision. Diversity in these items


may annoy standards committees but carries little real cost. Programming
languages are somewhat of a hybrid. They are usually a team’s choice,
so some coordination is helpful. Services written in different languages
can interact well through standard APIs, reducing the cost of diversity.
However, using many different languages might impair skills reuse across
teams, increasing the cost. Lastly, elements like monitoring or version
control connect many different parts of the organization and thus are more
valuable to standardize.
A section on standards wouldn’t be complete without highlighting that
standardization spans many levels of abstractions. Whereas much of
traditional IT concerns itself with standardizing products, additional
levels of standards can have a bigger impact on the organization. For
example, relational databases can be standardized at the product level,
the data model or schema level, consistent field naming conventions,
and more. The following illustration compares a standardized database
product with diverse schema (left) with diverse products but harmonized
schema (right).

Standardizing at different levels

While the left side applies traditional IT thinking to reduce operational


complexity, the right side is more relevant for the integration of different
data sources. The left side focuses on bottom line whereas the right side
looks at top line - integration across systems is often a driver of business
value an a better user experience. Once again we find that seemingly
simple decisions have more nuances than might be initially apparent.
Teasing out these subtleties is one of an architect’s key value contributions.
The Application- centric Cloud 207

Growing Leaves
Building an application platform for cloud, for example by means of an
engineering productivity team, takes time. A common question, therefore,
is in which order to grow the leaves on the clover. Although there is
no universal answer, a common guideline I give is that you need to
grow them somewhat evenly. That’s the way nature does it - you won’t
find a clover that first grows one leaf to full size, then two, then three.
Instead each leaf grows little by little. The same holds true for application
delivery platforms. Having a stellar run-time platform but absolutely no
monitoring likely isn’t a great setup.

Growing the application ecosystem little by little

If I’d be pressed for more specific advice, building the platform in counter-
clockwise circles around the clover leaf is most natural. First, you’d want
to have software delivery (left) worked out. If you can’t deploy software,
everything else becomes unimportant. If you initially deploy on basic
VMs, that’s perhaps OK, but soon you’d want to enhance your run-time to
become automated, resilient, and easily scalable (bottom). For that you’ll
need better monitoring (right). Communications (top) is also important,
but can probably wait until you established solid operational capabilities
for the other parts (Martin Fowler’s famous article reminds us that you
must be this tall to use microservices²). From there, keep iterating to
continue growing the four petals.

Pushing the Model


Once you have a good model, you can try to stretch it to reveal yet more
detail without negatively impacting the initial simplicity. It won’t always
work but because it’s hard to know without trying, it’s always worth a
try. Looking back at the clover after some time I was wondering whether
²https://martinfowler.com/bliki/MicroservicePrerequisites.html
The Application- centric Cloud 208

it can be augmented to also explain and position some of the common


buzzwords that float around modern software delivery.
I felt that IaC (Infrastructure as Code), which is equally useful and over-
hyped, can be summed up as using the build pipeline to not only deploy
software artifacts but to also instantiate the necessary infrastructure
components. In the clover model it could therefore be depicted as a new
petal on the bottom left, between Delivery Pipeline and Run-time Platform.
Come to think about it, you could also place GitOps here, another frequent
buzzword (you know to be cautious when the top search result describes
something as a “paradigm”). GitOps roughly describes the useful practice
of checking your infrastructure deployment and configuration scripts
(used in a broad sense here, including declarative specifications) into your
source code repository. Infrastructure changes can then be tracked via git
logs and approved via pull requests. So, GitOps also sits at the intersection
of delivery pipeline and run-time and hence could occupy the same petal.
Stretching the model further, one could position SRE (Site Reliability
Engineering) as a new petal at the bottom right between Monitoring
and Run-time Platform. SRE brings a software engineering mindset to
infrastructure and operations, suggesting practices like automation and
balancing delivery velocity and reliability via so-called Error Budgets.
If we feel lucky with our model, we should try to find matching elements
for the top half also. The word that’s been buzzing around communica-
tions quite a bit is “service mesh”. Again, it’s not easy to find a precise
definition - the Wikipedia page is flagged as having issues. Erring on the
site of simplicity, a service mesh aims to provide better control (such as
dynamic traffic routing) and better transparency (such as latencies and
dependencies) of service-to-service communications. Following this logic
we could place it on either the top right.
This leaves us one petal short, the one on the top left. Since I have been
working in integration for decades, the closest technique that comes to
my mind is IDL - an Interface Description Language³. While for some
this term would trigger nightmares of CORBA (Common Object Request
Broker Architecture), Google’s Protocol Buffers and tools like Swagger
have made IDL cool again.
³https://en.wikipedia.org/wiki/Interface_description_language
The Application- centric Cloud 209

Adding buzzword petals to the four leaf clover

Is this model better than the original? You be the judge - it’s certainly
richer and conveys more in a single picture. But showing more isn’t always
better - the model has become more noisy. You could perhaps show the
two versions to different audiences - developers might appreciate the small
petals whereas executives might not.
Defusing buzzwords and positioning them in a logical context is always
helpful. Finding simple models is a great way to accomplish that. Once
they’re done, such models might seem obvious. However, that’s one of
their critical quality attributes: you want your good ideas to look obvious
in hindsight.
23. What Do Containers
Contain?
Metaphors help us reason about complex systems.

Container technology has become an indispensable staple of cloud archi-


tecture. Much like it’s hard to imagine shipping and logistics without the
shipping container, Docker containers have become virtually synonymous
with being cloud native (and perhaps some smuggled in cloud immi-
grants…). Now that containers are being increasingly associated with
multi-cloud approaches, it’s worth having a closer look behind the scenes
of containers’ runaway success. And we’ll unpack the container metaphor
along the way.

Containers Package and Run


When IT folks mention the word container, they are likely referring to
Docker containers, or more recently container images compliant with the
Open Container Initiative¹ (OCI). Despite all the buzz around containers
and associated tools, there’s still occasional confusion about the exact role
containers play in the software lifecycle. The debate isn’t helped by most
“architecture diagrams” of Docker showing run-time components, such as
docker host and the client API, as opposed to capabilities. Let’s try to do
better.
In my view, containers combine three important mechanisms that redefine
how we deploy and operate applications:
¹https://www.opencontainers.org/
What Do Containers Contain? 211

A Functional Architecture of Docker Containers

Artifact Packaging
Containers package application artifacts based on a Dockerfile, a
specification of the installation steps needed to run the applica-
tion. The Dockerfile would for example include the installation of
libraries that the application depends on. From this specification,
Docker builds an image file that contains all required dependencies.
Image Inheritance
A particularly clever feature that Docker included in the generation
of the container image files is the notion of a base image. Let’s
say you want to package a simple Node application into a Docker
image. You’d likely need the run-time, libraries, npm and a few other
items installed for your application to run. Instead of having to
figure out exactly which files are needed, you can simply reference
a Node base image that already contains all the necessary files. All
you need to do is add a single line to your Dockerfile, e.g. FROM
node:boron.
Lightweight Virtualization
Lastly, when talking about containers, folks will often refer to
cgroups², a Linux mechanism for process isolation. cgroups sup-
port light-weight virtualization by allowing each container to be-
have as if it had its own system environment, such as file system
²https://en.wikipedia.org/wiki/Cgroups
What Do Containers Contain? 212

paths or network ports. For example, on a operating system instance


(such as a VM) multiple web servers, each listening to port 80, can
run inside containers and have their logical ports mapped to distinct
host ports.

The term container thus combines three important core concepts. When
folks mention the word “container”, they might well refer to different
aspects, leading to confusion. It also helps us understand why containers
revolutionized the way modern applications are deployed and executed:
the concept of modern containers spans all the way from building to
deploying and running software.

Container Benefits
My Architect Elevator Workshops³ include an exercise that asks partici-
pants to develop a narrative to explain the benefits of using containers
to senior executives. Each group pretends to be talking to a different
stakeholder, including the CIO, the CFO, the CMO (Chief Marketing
Officer), and the CEO. They each get two minutes to deliver their pitch.
The exercise highlights the importance of avoiding jargon and building a
ramp⁴ for the audience. It also highlights the power of metaphors. Relating
to something known, such as a shipping container, can make it much
easier for the audience to understand and remember the benefits of Docker
containers.
Having run this exercise many times. we distilled a few nice analogies that
utilize the shipping container metaphor to explain the great things about
Docker containers:

Containers Are Enclosed

If you load your goods into a container and seal the doors,
nothing goes missing while en-route (as long as the container
doesn’t go overboard, which does occasionally happen).
The same is true for Docker containers. Instead of the usual
semi-manual deployment that is bound to be missing a
³https://architectelevator.com/workshops
⁴https://architectelevator.com/book
What Do Containers Contain? 213

library or configuration file, deploying a container is pre-


dictable and repeatable. Once the Dockerfile is well defined,
everything is inside the container and nothing falls out.

Containers Are Uniform

While you could package goods in large boxes or crates to


avoid stuff getting lost, shipping containers are unique in that
their dimensions and exterior features, such as the hook-up
points, are highly standardized. Containers are likely one of
most successful examples of standardization, perhaps outside
of A4 paper⁵. I tend to highlight this important property by
reminding people that:

Containers look the same from the outside, whether there’s


bags of sand inside or diamonds.

Thanks to this uniformity, tooling and “platforms” can be


highly standardized: you only need one type of crane, one
kind of truck, and one type of ship as the “platform” for
the containers. This brings both flexibility and efficiency.
The fact that an incoming container can be loaded onto
any truck or another vessel makes operations more efficient
than waiting for specialized transport. But the uniformity
also brings critical economies of scale for investment into
specialized tools, such as highly optimized cranes.
Uniformity in Docker containers brings the same advantages:
you only need one set of tools to deal with a wide variety of
workloads. This allows you better economies of scale to de-
velop sophisticated tooling, such as container orchestration,
that’d be hard to justify as a one-off.
⁵Hohpe, The Software Architect Elevator, 2020, O’Reilly
What Do Containers Contain? 214

Containers Stack Tightly

Shipping containers are designed to stack neatly so that more


freight can be loaded onto a single ship. Large container
ships hold upwards of 20,000 Twenty-foot Equivalent Units
(TEU), which means 20,000 of the “smaller” 20 ft containers
or proportionally fewer of the “larger” 40 ft containers. Those
ships stack containers 16 units high, which equates to about a
13 story building. The regular, rectangular shape of shipping
containers means that very little cargo space is wasted on
these ships. The resulting packing density makes container
shipping one of the most efficient modes of transport in the
world.
Docker containers also stack more tightly because, unlike
VMs, multiple containers can share a single operating system
instance. The reduced overhead makes deployment of small
services, like we often use them in microservices architec-
tures, significantly more efficient. So, packing density is a
benefit for both physical and logical containers.

Containers Load Fast

On my business trips to Singapore the Asia Square Westin


hotel often afforded me a clear view of Singapore’s con-
tainer port. Seeing a giant container ship being unloaded and
reloaded within just a few hours is truly impressive. Highly
specialized and efficient tooling, made possible through the
standardized container format, minimizes expensive wait
times for vessels. Container loading is also sped up because
the packaging process (i.e. placing goods into the container)
is handled separately from the loading / deployment process
of the container. It thus minimizes the potential bottleneck of
ship loading - there are only so many cranes that can service
a vessel at the same time.
Docker Containers also have shorter deployment times be-
cause they don’t require a whole new virtual machine and
operating system instance to be deployed. As a result most
IT containers spin up in less than a minute as opposed to
roughly 5-10 minutes for a complete VM. Part of that speed
What Do Containers Contain? 215

gain is attributable to the smaller size, but also to the same


strategy as with physical containers: all little application
details and dependencies are loaded into the container image
first, making deployment more efficient.
Rapid deployment isn’t just a matter of speed, but also
increases resilience at lower cost than traditional stand-by
servers.

So, it turns out that the shipping container analogy holds pretty well and
helps us explain why containers are such a big deal for cloud computing.
The following table summarizes the analogies:

Not Everything Ships in Containers


Now that we have done quite well with relating the benefits of IT
containers to shipping containers, let’s stretch the metaphor a little bit.
Despite the undisputed advantages of container ships, not all goods are
transported in containers. Liquids, gas, and bulk material are notable
examples that are transported via oil tankers, LPG vessels, or bulk vessels,
which actually outnumber container vessels⁶. Other materials, especially
perishable ones, are also better transported by aircraft, which do use
specialized palettes and containers but generally of much smaller size and
weight.
So, if our analogy holds, perhaps not all applications are best off running
in containers. For large applications, so-called monoliths, the efficiency
⁶https://www.statista.com/statistics/264024/number-of-merchant-ships-worldwide-
by-type/
What Do Containers Contain? 216

advantage is smaller because the application size is bigger in comparison


to the OS image overhead. And, deployment can be automated with
a variety of tools, such as Ansible, Chef, or Puppet. Now, of course
we’re told that such applications are bad and should all be replaced with
numerous microservices. Architects, however, know that “architecture
isn’t a Hollywood movie; it’s not a battle between good and evil.” Hence,
one size doesn’t necessarily fit all and your IT won’t come to a standstill
if not every last application runs in containers.
Also, although containers, and even container orchestration with k3s⁷,
can run on a Raspberry Pi⁸, embedded systems may not need to pack
many applications onto a single device⁹. Similarly, performance-critical
applications, especially those heavy on I/O, might not want to pay the
overhead that containers introduce, e.g. to virtualize network connections.
And some of them, especially databases, have their own multi-tenant
mechanisms, again reducing the OS overhead and the need for rapid
deployment. You can pack quite a few logical databases onto a single
database server that runs on a pair of heavy-duty VMs or even physical
servers.

Thin Walls
Shipping containers provide pretty good isolation from one another - you
don’t care much what else is shipping along with your container. However,
the isolation isn’t quite as strong as placing goods on separate vessels. For
example, pictures circulating on the internet show a toppling container
pulling the whole stack with it overboard. The same is true for compute
containers: a hypervisor provides stronger isolation than containers.
Some applications might not be happy with this lower level of isolation.
It might help to remember that one of the largest container use cases plus
the most popular container orchestrator originated from an environment
where all applications were built from a single source code repository and
highly standardized libraries, vastly reducing the need for isolation from
“bad” tenants. This might not be the environment that you’re operating in,
⁷https://k3s.io/
⁸https://www.raspberrypi.org/products/
⁹Late-model Raspberry Pi’s have compute power similar to Linux computers from just
a decade ago, so they might not be a perfect example for embedded systems that operate
under power and memory constraints.
What Do Containers Contain? 217

so once again you shouldn’t copy-paste others’ approaches but first seek
to understand the trade-offs behind them.
It’s OK if not everything runs in a container - it doesn’t mean they’re
useless. Likewise, although many things can be made to run in a container,
that doesn’t mean that it’s the best approach for all use cases.

Shipping Containers Don’t Re-Spawn


Alas, physical analogies don’t necessarily translate to the ephemeral world
of bits and bytes. For example, analogies between physical building and
IT architecture are generally to be taken with a grain of salt. Also, as we
just saw, disposability is a great feature in the IT world, but not in the
physical one. Bits just recycle better than plastic.

Docker containers have a few tricks up their sleeves that


shipping containers can’t do.

In my workshops, we stumbled upon a specific break in the analogy


when one team tried to stretch the metaphor to explain the concept of
disposability. Because Docker containers can be easily deployed, multiple
running instances of one container image are straightforward to create.
After trying to weave a storyline that if your shipping container goes
overboard, a new instance can be spawned, the team had to admit that
it had likely overstretched the metaphor. This wasn’t just fun but also
educational because you don’t know the limits of a metaphor before you
test them.

Border Control
If containers are opaque, meaning that you can’t easily look inside, the
question arises whether containers are the perfect vehicle for smuggling
prohibited goods in and out of a country¹⁰. Customs offices deal with this
with this in several ways:
¹⁰Recent world news taught us that a musical instrument case might be a suitable vehicle
to smuggle yourself out of a country.
What Do Containers Contain? 218

• Each container must have a signed off manifest. I’d expect that the
freight’s origin and the shipper’s reputation are a consideration for
import clearance.
• Customs may randomly select your container for inspection to
detect any discrepancies between manifest and actual contents.
• Most ports have powerful X-Ray machines that can have a good
look inside a container.

We should also not blindly trust IT containers, especially base containers


that you inherit from. If your Dockerfile references a compromised base
image, chances are that your container image will also be compromised.
Modern container tool chains therefore include scans for containers
before deploying them. Container Security Scanning has a firm place
on ThoughtWorks’ Technology Radar¹¹. Companies will also maintain
private container registries that will (or at least should) only contain
containers with a proven provenance.

Containers Are For Developers


Hopefully, examining analogies between shipping containers and Docker
containers has helped shed some light on the benefits of using containers.
One additional aspect played an important role, though, in Docker’s surge
as a container format, despite the underlying run-time technology having
been available for several years and VMware having deployed powerful
virtualization tools to virtually (pun intended) every enterprise.
And that difference is as important as it is subtle: Docker is a tool for
developers, not for operations. Docker was designed, developed, and
marketed to developers. It features text-based definition files, which are
easy to manage with version control, instead of the shiny but cumbersome
GUI tools perennially peddled by enterprise operations vendors.

Docker was designed and built for developers, not opera-


tions staff. It liberated developers by supporting a DevOps
way of working detached from traditional IT operations.

Also, Docker was open, open source actually, something that can’t be said
of many traditional operations tools. Lastly, it provided a clean interface
¹¹https://www.thoughtworks.com/radar/techniques/container-security-scanning
What Do Containers Contain? 219

to, or rather separation from, operations. As long as a developer could


obtain a virtual machine from the operations team, it would install Docker
and be able to spawn new containers at will without ever having to submit
another service request.

Beware of RDA - Resume-Driven


Architecture!
As much as Docker deserved to become a staple of modern delivery, the
popularity of tools isn’t always a function of merit. Even open source
tools can enjoy seven-figure marketing budgets and effectively eliminate
the competition before setting up their own trade marks, certifications,
conferences, and paid vendor memberships. Tying the notion of a modern
enterprise with deploying a particular tool is a great marketing story but
shouldn’t impress architects.
Popular tools drive the demand for developers possessing relevant expe-
rience. In a dynamic market, that demand quickly translates into higher
salaries and job security. Therefore, some developers might favor a specific
tool because it’s good for their resume - and their pay grade. It’s been
asserted that adding “Kubernetes” to your resume can earn you a 20% pay
hike.
This tendency leads you to an unusual and undesired type of architecture:
Resume-Driven Architecture. It’s probably not the one you’re after! So,
“everyone else does it” shouldn’t be a sufficient argument for technology
adoption, even if the technology looks useful and has wide adoption.
Architects look beyond the buzz and aim for concrete value for their
organization. Being able to do that is also the most attractive item you
can have on your resume.
24. Serverless = Worry Less?
If no one wants a server, let’s go serverless!

As an industry, we generally earn mediocre marks for naming things.


Whether microservices are really “micro” as opposed to “mini” or “nano”
has caused some amount of debate. Meanwhile is it really useful to define
something by what it isn’t, like NOSQL has done? Lastly, you have
probably noticed by now that I’m not a fan of calling things native. But
the winning prize for awkward naming likely goes to the term serverless,
which describes a runtime environment that surely relies on servers. Time
to put the buzzword aside and have a look.

Architects Look For Defining Qualities


Dissecting buzzwords is one of enterprise architects’ most important tasks.
Buzzwords are a slippery slope towards a strategy that’s guided by wishful
thinking. As an architect, the question I like to ask any vendor touting a
popular buzzword is: “could you please enumerate the defining qualities
for that label?” Based on these qualities I should be able to determine
whether a thing I am looking at is worthy of the label or not. The vendor’s
initial response is invariably “That’s a good question.” Surely we can do
better than that.

Server-less = Less Servers?


Tracking down the origin of the term leads us to an article by Ken Fromm
from 2012¹. Here we learn that:

Their worldview is increasingly around tasks and process


flows, not applications and servers — and their units of
measure for compute cycles are in seconds and minutes, not
hours. In short, their thinking is becoming serverless.
¹https://read.acloud.guru/why-the-future-of-software-and-apps-is-serverless-
reprinted-from-10-15-2012-b92ea572b2ef
Serverless = Worry Less? 221

AWS, whose Lambda service placed serverless architectures into the


mainstream, makes the explanation² quite simple, giving support to the
moniker server-worry-less:

Serverless computing allows you to build and run applications


and services without thinking about servers.

Looking for a bit more depth, Mike Robert’s authoritative article on server-
less architectures³ from 2018 is a great starting point. He distinguishes
serverless third party services from Faas - Function as a Service, which
allows you to deploy your own “serverless” applications. Focusing on the
latter, those are applications…

…where server-side logic is run in stateless compute contain-


ers that are event-triggered, ephemeral, and fully managed
by a third party.

That’s a lot of important stuff in one sentence, so let’s unpack this a bit:

Stateless
I split the first property into two parts. In a curious analogy to
being “server-less”, stateless-ness is also a matter of perspective.
As I mused some 15 years ago⁴, most every modern compute
architecture outside of an analog computer holds state (and the
analog capacitors could even be considered holding state). So, one
needs to consider context. In this case, being stateless refers to
application-level persistent state, such as an operational data store.
Serverless applications externalize such state instead of relying on
the local file system or long-lived in-memory state.
Compute Containers
Another loaded term - when used in this context, the term contain-
ers appears to imply a self-contained deployment and run-time unit,
not necessarily a Docker container. Having self-contained units is
important because ephemeral applications (see below) need to be
automatically re-incarnated. Having all code and dependencies in
some sort of container makes that easier.
²https://aws.amazon.com/lambda/faqs/
³https://martinfowler.com/articles/serverless.html
⁴https://www.enterpriseintegrationpatterns.com/ramblings/20_statelessness.html
Serverless = Worry Less? 222

Event-triggered
If serverless applications are deployed as stateless container units,
how are they invoked? The answer is: per event. Such an event
can be triggered explicitly from a front-end, or implicitly from a
message queue, a data store change, or similar sources. Serverless
application modules also communicate with each other by publish-
ing and consuming events, which makes them loosely coupled and
easily re-composable.
Ephemeral
One of those fancy IT words, ephemeral means that things live only
for a short period of time. Another, more blunt way of describing
this property is that things are easily disposable. So, serverless
applications are expected to do one thing and finish up because
they might be disposed of right after. Placing this constraint is a
classic example of an application programmers accepting higher
complexity (or more responsibility) to enable a more powerful run-
time infrastructure.
Managed by Third Party
This aspect is likely the main contributor to serverless’ much-
debated name. Although “managed” is a rather “soft” term (I
manage to get out of bed in the morning…), in this context it likely
implies that all infrastructure-related aspects are handled by the
platform and not the developer. This would presumably include
provisioning, scaling, patching, dealing with hardware failure, and
so on.

But There’s More


Although Mike’s carefully worded sentence is astonishingly comprehen-
sive, its focus is on applications running on a serverless (or FaaS) platform.
The platform itself also needs to possess a number of defining qualities to
deserve the coveted label. In fact, Mike wrote another article just on the
topic of Defining Serverless⁵. So let’s see what else is commonly assumed
when a platform provider labels a service as serverless:

Auto-scales
A serverless platform is expected to scale based on demand, giving
⁵https://blog.symphonia.io/posts/2017-06-22_defining-serverless-part-1
Serverless = Worry Less? 223

substance to the term server-worry-less, meaning you don’t have


to worry about servers; the platform does that. This property sets
serverless platforms aside for example from container orchestration
where the application operations team typically has to provision
nodes, i.e. virtual machines managed by the orchestration.
Scales to Zero
Although we are most interested in scaling up, scaling down can
be equally important and sometimes more difficult to achieve. If
we follow the cloud’s promise of elastic pricing, we shouldn’t have
to pay anything if we don’t incur any processing, for example
because our site isn’t taking any traffic. The reality, though, is
that traditional applications keep occupying memory and usually
demand at least one virtual machine under any circumstance, so
zero traffic doesn’t become zero cost. Truly paying zero dollars is
unlikely to ever be true - all but the most trivial services will store
operational data that keeps incurring charges. Nevertheless, idling
serverless applications will cost you significantly less than with
traditional deployment on VMs.
Packaged Deployment
This one is a bit more fuzzy. It appears that serverless platforms
make it easy to deploy code, at least for simple cases. Best examples
are likely Node applications deployed to AWS Lambda or Google
Cloud Functions with just a simple command line.
Precise Billing
Cost based on precise usage⁶ is also one of Mike’s properties. AWS
Lambda bills in 100 ms increments of usage, making sporadically
used applications incredibly cheap to run.

Is it a Game Changer?
So now that we have a better idea of what types of systems deserve
to be called “serverless”, it’s time to have a look at why it’s such a
big deal. On one hand, serverless is the natural evolution of handing
over larger portions of the non-differentiating toil to the cloud providers.
While on a virtual machine you have to deal with hardening images,
installing packages, and deploying applications, containers and container
⁶https://blog.symphonia.io/posts/2017-06-26_defining-serverless-part-3
Serverless = Worry Less? 224

orchestration took things one step up by allowing you to deploy pre-


packaged units to a pool of nodes (servers). However, you still manage
cluster sizes, scaling units and all sorts of other machinery that, unless you
believe YAML is the nirvana for DevOps disciples, ends up being rather
cumbersome.
Serverless makes all this go away by handling deployment (and recycling)
of instances, node pool sizes, and everything under the covers, really
allowing you to focus on application functionality as opposed to infras-
tructure.
Additionally, just like cloud, serverless shifts the business model in a
subtle but important way. As Ben Wilson, my former colleague and
now Group Product Manager at Google Cloud, and myself laid out in a
blog post titled Rethinking commercial software delivery with serverless⁷,
serverless incurs cost per execution of a single fine-grained application
function. You can therefore compare the value (and profit) generated
by that function to the cost of executing that function at a level of
granularity never before thought possible. Ben aptly called this capability
“transparent, transaction-level economics.” Now that’s a game changer.

Building Serverless Platforms


Architects tend to be those folks who routinely took their childhood toys
apart to see how they worked. So, as architects we are naturally curious
how the magic of serverless is implemented under the hood. As I revealed
in 37 Things, for me architecture is rooted in the non-trivial decisions that
were made. So, when our favorite cloud providers built their serverless
platforms, what critical decisions would they make? Let’s play serverless
platform architect for a minute.

Layering vs. Black Box

A serverless platform could be a “black box” (“opaque” would be a nicer


term) or built by layering additional capabilities on top of another base
platform. Interestingly, we find examples of both approaches, even within
a single cloud provider.
⁷https://cloud.google.com/blog/topics/perspectives/rethinking-commercial-software-
delivery-with-cloud-spanner-and-serverless
Serverless = Worry Less? 225

Both Google App Engine, a serverless platform that existed well before
the term was coined, and AWS Lambda, AWS’ category defining service
follow the opaque model. You deploy your code to the platform from the
command line or via a ZIP file (for Lambda) without having to know what
the underlying packaging and deployment mechanism is.
Google Cloud Run follows the opposite approach by layering serverless
capabilities on top of a container platform. One critical capability being
added is scale-to-zero, which is emulated using Knative⁸ (talk about
another not-so-stellar-name…) to route traffic to a special Activator⁹ that
can scale up container pods when traffic comes in.
From an architectural point of view, both approaches have merit but
reflect a different point of view. When considering a product line architec-
ture, layering is an elegant approach as it allows you to develop a common
base platform both for serverless and traditional (serverful? servermore?)
container platforms. However, when considering the user’s point of view,
an opaque platform provides a cleaner overall abstraction and gives you
more freedom to evolve the lower layers without exposing this complexity
to the user.

Language Constraints

A second key decision is what languages and libraries to support. Closely


related is the decision whether to constrain the developer to better assure
the platform’s capabilities. For example, most platforms will limit the
ability to run native code or the spawning of additional threads. One of
the earliest serverless platforms, Google App Engine, placed numerous
restrictions on the application, allowing only Java and limiting library use.
Together with the lack of ancillary services like managed SQL databases,
these limitations likely contributed to the slow uptake when compared
with AWS Lambda.
Many of the modern FaaS platforms began with a limited selection of
languages, usually starting from Node, but now support a wide range.
For example, in 2020 AWS Lambda supported Java, Golang, PowerShell,
Node.js, C#, Python, and Ruby. Container-based serverless platforms
aren’t really limited in the choice of language as long as it’s practical to
include the runtime in a container.
⁸https://knative.dev/docs/serving/configuring-autoscaling/
⁹https://github.com/knative/serving/tree/26b0e59b7c1238abf6284d17f8606b82f3cd25f1/
pkg/activator
Serverless = Worry Less? 226

Salesforce’s Apex language is another important, but sometimes omitted


serverless platform (as in FaaS). Salesforce decided to impose constraints
in favor of better platform management, going as far as defining its
own programming language. I am generally skeptical about business
applications defining custom languages because not only is language
design really hard but most of them also lack in tooling and community
support. In Salesforce’s case creating a new language might have been
a plausible avenue because their platform provides many higher-level
services and wasn’t intended to be running generic applications.

Platform Integration

An additional decision is how much a serverless platform should be


integrated with the cloud platform, for example, monitoring, identity
and access management, or automation. A tighter integration benefits
developers but might make the platform less portable. In the current
market we see AWS Lambda being fairly well integrated, for example
with CloudWatch, IAM, and CloudFormation. Google Cloud Run largely
relies on the Kubernetes ecosystem and associated products bundled into
Anthos, which is becoming something like a “cloud on a cloud”.

Is Serverless the New Normal?


Many cloud strategies lay out a path from virtual machines to containers
and ultimately serverless. I take issue with such an association for multiple
reasons. First, your strategy should be a lot more than following an
industry trend - it needs to define critical decisions and trade-offs you
are making. Second, just like containers, not all applications are best off
running in a serverless model. So, it’s much more about choosing the most
suitable platform for each kind of application than blindly following along
the platform evolution. Else you don’t need an architect.
25. Keep Your FROSST
Cloud doesn’t love all applications equally much.

Cloud is an amazing platform but it doesn’t love all applications equally


much. I’ve seen old-school monoliths, which require special storage
hardware because trivial business actions trigger 200 database queries
and snapshotting the database is the only cure for a batch process that
regularly corrupts customer data. Such applications can be made to run
in the cloud but honestly won’t benefit a whole lot. On the other end of
the spectrum we have 12-factor, cloud-native, stateless, ephemeral, post-
DevOps, microservices applications that were awaiting the cloud before
it was even conceived.
After seeing someone claim their application was 66% cloud ready because
it met 8 of the 12 factors¹ commonly cited for modern applications, my
team set out to craft a better definition of what makes applications well
suited for the cloud. My colleague Jean-Francois Landreau authored this
nice description that he labeled “FROSST”. I consider it a perfect piece of
pragmatic, vendor-neutral architecture advice and am happy to include it
in this book.

By Jean-Francois Landreau

Our architecture team regularly authored compact papers that shared the
essence of our IT strategy with a broad internal audience of architects,
CIOs and IT managers. Each paper, or Tech Brief, tackled a specific
technology trend, elaborated the impact for our specific environment,
and concluded with actionable guidance. I was assigned to write such a
Tech Brief to give development teams and managers directions for moving
applications to our new cloud platform. The paper was intended both as
useful guidance but also as a form of guard rail for us to avoid someone
¹https://12factor.net/
Keep Your FROSST 228

deploying the mother of all brittle, latency sensitive monoliths on our


platform just to highlight that the platform didn’t produce an instant
miracle.
Titled Five Characteristics of Cloud-ready Applications, the Tech Brief was
widely read by colleagues across the globe and referenced in several senior
management slides decks. Encouraged by this positive feedback, I evolved
the five characteristics into the moniker “FROSST” - vaguely inspired by
the chilly Munich winters.

Cloud Applications should be FROSST


I wasn’t looking for an all encompassing but difficult to remember set
of attributes (tell me those 12 factors again…) but a description that’s
meaningful, intuitive, and easy to remember. I therefore chose simple
words and intentionally avoided buzzwords or jargon. Lastly, I wanted the
characteristics to support each other so they can make a cohesive whole
as opposed to a “laundry list”.
I ended up with FROSST - an acronym that helps me remember the key
characteristics of applications that look to be good friends with the cloud.
Those applications should be:

• Frugal
• Relocatable
• Observable
• Seamlessly updatable
• internally Secured
• failure Tolerant

Let’s look at each one, discuss why it’s relevant and how it relates to the
other characteristics.

Frugal

Frugality means that an application should be conservative in the amount


of CPU and memory it uses. A small deployment footprint, such as a
Docker image, is also a sign of frugality. Being frugal doesn’t mean being
cheap - it doesn’t imply premature optimization via incomprehensible
Keep Your FROSST 229

code that ekes out every last bit of memory and every CPU cycle (we
leave that to microcontrollers). It mainly means to not be unnecessarily
wasteful, for example by using an n² algorithm when a simple n or n log(n)
version is available.
Being prudent with memory or CPU consumption will help you stay in
budget when your application meets substantial success and you have
to scale horizontally. Whether your application or service consumes 128
or 512 MB RAM might not matter for a handful of instances, but if
you’re running hundreds or thousands, it makes a difference. Frugality
in application start-up time supports the application’s relocatability and
seamless updatability thanks to quick distribution and launch times.
Frugality can be achieved in many ways, for example by using efficient
library implementations for common problems instead of clunky one-
offs. It can also mean not lugging around huge libraries when only a
tiny fraction is required. Lastly, it could mean storing large in memory
data sets in an efficient format like Protocol Buffers if it’s infrequently
enough accessed to justify the extra CPU cycles for decoding (letting the
bit twiddling happen under the covers).

Relocatable

Relocatable applications expect to be moved from one data center to


another or from one server to the next without requiring a full-on
migration that has a project manager executing countless tasks to make
sure that everything still works. The cloud moves applications frequently
and expects applications to be prepared for such events.
Cloud applications need to relocate for different reasons. First, to scale
an application horizontally, new instances need to be deployed. It’s not
strictly re-located but the relocatability characteristic allows additional
instances to run on different servers. The second occasion is when either
the application or the underlying infrastructure fail, necessitating a quick
redeployment. AWS’ auto-scaling groups and Kubernetes’ replication con-
trollers are typical examples. To make these relocation mechanisms work,
observability is needed. Additionally, relocatability also comes in handy
during the update of the application when a new version is deployed,
perhaps in parallel to the existing version. Lastly, relocatability can pay off
when deploying applications on so-called spot (or preemptible) instances,
discounted cloud instances that can be reclaimed at any point in time.
Keep Your FROSST 230

Externalizing state, implementing an init function to collect environ-


ment details, or not relying on single instances (singletons) help make an
application relocatable. The application may still need singletons, however
a singleton will now be a role, assigned via an election protocol, rather
than a single physical instance.

Observable

Observability means that there is no need to plug a debugger into an


application to analyze a problem. While it’s never been a good solution,
horizontally scaling and relocating cloud applications make debugging
highly impractical.
The observability characteristic supports several scenarios. It enables
relocatability by helping frameworks detect or, better yet, predict failure.
It also increases failure tolerance when used as part of a circuit breaker².
Finally, it’s an essential part of platform or application operations that
need to meet specific service level agreements. For example, observability
allows you to build dashboards measuring these SLAs.
The simplest implementation of observability is by exposing endpoints
that report the application’s readiness and liveness, for example to be used
by a load balancer or orchestrator. Giving startup parameters, connected
dependencies, the error rate of incoming requests, etc. can be very helpful
for distributed tracing and troubleshooting. Logs are also a very useful
form of observability particularly when trying to debug on a widely
distributed system. A classic and immensely useful implementation of
observability is Google’s varz³ coupled with Borgmon, replicated in the
open source world via Prometheus⁴.

Seamlessly Updatable

On the internet there are no nights, no weekends, and no low-usage win-


dows for scheduled downtime. To be honest, performing upgrades at night
or over the weekend wasn’t much fun, anyway. To make matters worse,
often the development teams were not present, leaving the operations
team to trigger a deployment they couldn’t fully qualify. It’s a lot better
²https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
³https://landing.google.com/sre/sre-book/chapters/practical-alerting
⁴https://prometheus.io/
Keep Your FROSST 231

to deploy during business hours with the stakeholders directly available


thanks to seamlessly updatable applications.
As outlined in the previous sections, frugality, relocatability and observ-
ability contribute to this characteristic. Automation and tooling, for ex-
ample to perform green-blue deploys, complete the implementation. This
characteristic is therefore not fully intrinsic to the application, but relies on
additional external tooling. A developer who has access to an environment
that supports seamless updates (like Kubernetes or auto-scaling groups)
has no excuse for needing downtime or requiring simultaneous updates
of dependent components.
Making application service APIs backward compatible makes updates
seamless. New features should be activated via feature toggles to separate
component deployment from feature release. Also, applications should
be able to exist simultaneously in different versions to support canary
releases.

Internally Secured

Traditionally, security was managed at the perimeter of the data center.


The associated belief that everything running inside the perimeter was in
a trusted environment meant that applications were often poorly secured.
Companies realized that it was not enough and started to implement
security in depth that extends security considerations into the application.
This approach is mandatory in the cloud, particularly for regulated
companies.
All other characteristics presented so far contribute to application security.
For example, redeploying the application every day from a verified source
can enhance security because it could overcome a Trojan horse or another
attack that compromised the application’s integrity. Close observation of
the application can enable the early detection of intrusions or attacks.
However, on top of this the application must be internally secured: it
cannot count on the run-time platform being a fully secured environment,
be it on-premises or on the public cloud. For example, do not expect an
API gateway to always protect your web service. Although it can clearly
do a major part of the job, a web service application has to authenticate
both the API Gateway and the end-user at the origin of the call to prevent
unauthorized access. “Trust no one” is a good guideline for applications
that can be deployed into a variety of environments.
Keep Your FROSST 232

Developers should test applications at a minimum against OWASP Top


Ten Web Applications Security Risks⁵. If there is no embedded authoriza-
tion or authentication mechanism, there is probably something missing.
Testing that the values of the parameters are within the expected range
at the beginning of a function is a good practice that shouldn’t only be
reserved to software running mission critical systems on airplanes or
rockets.

Failure Tolerant

Things will break. But the show must go on. Some of the largest E-
commerce providers are said to rarely be in a “100% functioning state”, but
their site is very rarely down. A user may not be able to see the reviews
for a product but is still able to order it. The secret: a composable website
with each component accepting failure from dependent services without
propagating them to the customer.
Failure tolerance is supported by the application’s frugality, relocatability
and observability characteristics, but also has to be a conscious design
decision for its internal workings. Failure tolerance supports seamless
updates because during start-up dependent components might not be
available. If your application doesn’t tolerate such a state and refuses to
start up, an explicit start-up sequence has to be defined, which not only
makes the complete startup brittle but also slower.
On the cloud, it’s a bad practice to believe that the network always works
or that dependent services are always responding. The application should
have a degraded mode to cope with such situations and a mechanism
to recover automatically when the dependencies are available again. For
example, an E-commerce site could simply show popular items when
personalized recommendations are temporarily unavailable.

When to use FROSST


I hinted that FROSST is a necessary set of characteristics and highlighted
how they sustain each other. I’m not yet able to demonstrate that they are
sufficient and complete. However, I used them on numerous occasions in
my work:
⁵https://owasp.org/www-project-top-ten/
Keep Your FROSST 233

• Let’s say you have to establish a set of architecture principles to


guide the development of a platform, perhaps running on a public
cloud. Based on your specific context, your principles can be guided
by the FROSST catalog.
• Next, you’ll need to establish a list of non-functional requirements
(NFR) for this platform. You can now use FROSST as a checklist to
ensure that the NFRs support these characteristics.
• Likely, development teams will ask you for guidelines for their
applications. Instead of building a list of NFRs and architecture
principles for them to follow, you can point them directly to
FROSST.
• Lastly, working for a big company, you will likely evaluate third
party software solutions. You may use FROSST as a prism through
which you evaluate their solutions regarding compatibility with
your cloud strategy. Using a consistent framework assures that your
evaluation is consistent and unencumbered by sales pitches.

FROSST can be used on many kinds of runtime platforms, be it vir-


tual machines or containers, on-premises or on public clouds. These
characteristics are still valid even when you move to a higher level of
abstraction. such as Functions as a Service, where a good portion of the
FROSST implementation is done by the infrastructure framework. Most
importantly, you likely already remember the six characteristics.
26. Keep Calm And Operate
On
What kills us makes us stronger.

No one likes to pay for systems that aren’t running. Hence it’s not
surprising that system uptime is a popular item on a CIO’s set of goals.
That doesn’t mean it’s particularly popular with IT staff, though. The
team generally being tasked with ensuring uptime (and blamed in case
of failure) is IT operations. That team’s main objective is to assure that
systems are up-and-running and functioning properly and securely. So it’s
no surprise that this team isn’t especially fond of outages. Oddly enough,
cloud also challenges this way of working in several ways.

Avoiding Failure
The natural and obvious way to assure uptime is to build systems that
don’t break. Such systems are built by buying high quality hardware,
meticulous testing of applications (sometimes using formal methods),
providing ample hardware resources in case applications do leak (or need)
memory. Essentially, these approaches are based on upfront planning,
verification, and prediction to avoid later failure.
IT likes systems that don’t break because they don’t need constant
attention and no one is woken up in the middle of the night to deal
with outages. The reliability of such systems is generally measured by
the number of outages encountered in a given time period. The time that
such a system can operate without incident is described by the system’s
MTBF – Mean Time Between Failure, usually measured in hours.
High-end critical hardware, such as network switches, sport an MTBF
of 50,000 hours or more, which translates into a whopping 6 years of
continuous operation. Of course, there’s always a nuance. For one, it’s
a mean figure assuming some (perhaps normal) distribution. So, if your
switch goes out in a year, you can’t get your money back - you were
Keep Calm And Operate On 235

simply on the near end of the probability distribution - someone else might
get 100,000 hours in your place. Second, many MTBF hours, especially
astronomical ones like 200,000 hours, are theoretical values calculated
based on component MTBF and placement. Those are design MTBF and
don’t account for oddball real-life failure scenarios - real life always has
a surprise or two in stock for us, such as someone stumbling over a cable
or a bug shorting a circuit¹. Lastly, in many cases a high MTBF doesn’t
actually imply that nothing breaks but rather that the device will keep
working even if a component fails. More on that later.

Robustness
Systems that don’t break are called robust. Robust systems resist distur-
bance. They are like a castle with very thick walls. If someone decides to
disturb your dwelling by throwing rocks, those rocks will simply bounce
off. Therefore, robust systems are convenient as you can largely ignore the
disturbance. Unless, of course, someone shows up with a trebuchet and a
very large stone. Also, living in a castle with very thick walls may be all
cool and retro, but often not the most comfortable option.

Recovering from Failure


While maximizing MTBF is a worthwhile endeavor, ultimately the time
of “F” - as in failure - comes (somehow a picture of Sesame Street pops up
in my head). That’s when another metric comes into play - MTTR, Mean
Time To Repair, i.e. how long does it take for the respective system to be
back up and running?
A system’s availability, the most indicative measure of a system’s oper-
ational quality, naturally derives from both metrics. If you do actually
get the 50,000 hours out of your switch (remember, it’s just a statistical
measure) and it puts you out of business for a day, so 24 hours, you still got
99.95% (1 - 24/50000) availability, not bad. If you want to achieve 99.99%,
you’d need to reduce the relative downtime by a factor of five, either by
increasing the MTBF or by decreasing the MTTR. It’s easy to see that
stretching MTBF has limits. How do you build something with an MTBF
of 250,000 hours (28.5 years) and actually test for that? Somewhere, some
¹https://www.computerhistory.org/tdih/september/9/
Keep Calm And Operate On 236

time, something will break. So there’s a point of diminishing returns for


increasing MTBF. Hence you should also focus on reducing MTTR.

Increasing MTBF is useful, but to reach high uptime guar-


antees minimizing MTTR is equally important.

This example also shows that simply counting the number of outages is
a poor proxy metric (see 37 Things) - an outage that causes 30 seconds
of downtime should not be counted the same as one that lasts 24 hours.
The redeeming property of traditional IT is that it might not even detect
the 30-second outage and hence doesn’t count it. However, poor detection
shouldn’t be the defining element!
An essential characteristic of robust systems is that they don’t break
very often. This means that repairs are rare, and therefore often not well
practiced. You know you have robust systems when you have IT all up in
arms during an outage, running around like carrots with their heads cut
off (I am a vegetarian, you remember). People who drive a German luxury
car will know an analog experience: the car rarely brakes, but when it does,
oh my, will it cost you! The ideal state, of course, is that if a failure occurs
nothing really bad happens. Such systems are resilient.

Resilience
Resilient systems absorb disturbance. They focus on fast recovery because
it’s known that robustness can only go so far. Rather than rely on up-
front planning, such systems bet on redundancy and automation to detect
and remedy failures quickly. Rather than a castle with thick walls, such
systems are more like an expansive estate - if the big rock launched by
the trebuchet annihilates one building, you move to another one. Or, you
build from wood and put a new roof up in a day. In any case you have also
limited the blast radius of the failure - a smashed or burnt down building
won’t destroy the whole estate.
In IT, robustness is usually implemented at the infrastructure level,
whereas resilience generally spans both infrastructure and applications.
For example, applications that are broken down and deployed in small
units, perhaps in containers, can exhibit resilience by using orchestration
Keep Calm And Operate On 237

that automatically spawns new instances in case underlying hardware


fails.
Infrastructure often has some amount of redundancy, and hence some
resilience, built in. For example, redundant power supplies and network
interfaces are standard on most modern hardware. However, these mech-
anisms cannot protect against application or systemic failure. Hence, true
resilience requires collaboration from the application layer.

Antifragile Systems
Resilient systems absorb failure, so we should be able to test their
resilience by simulating or even actually triggering a failure, without caus-
ing user-visible outages. That way we can make sure that the resilience
actually works.

A major financial services provider once lost a major piece


of converged infrastructure - almost 10,000 users were un-
able to access the core business application. The cause was
identified as a power supply failure that wasn’t absorbed
by the redundant power supply, as it was designed to do.
Had they tested this scenario during planned downtime,
they would have found the problem and likely avoided this
expensive and embarrassing outage.

So, when dealing with resilient systems, causing disturbance can actually
make the system stronger. That’s what Nassim Taleb calls Antifragile
Systems² - these aren’t systems that are not fragile (that’d be robust) but
systems that actually benefit from disturbance.
While at first thought it might be difficult to imagine a system that
welcomes disturbance, it turns out that many highly resilient systems
fall into this category. The most amazing system, and one close to
home, deserves the label “antifragile”: our human body (apologies to all
animal readers). For example, immunizations work by injecting a small
disturbance into our system to make it stronger.
²Nassim Taleb: Antifragile: Things That Gain From Disorder, Random House, 2012
Keep Calm And Operate On 238

What do firemen do when there’s no fire and all the trucks


have been polished to a mirror perfect shine? They train!
And they train by setting actual fire so the first time they
feel the heat isn’t when they arrive at your house.

Antifragile systems require an additional level to be part of the scheme:


the systemic level.

Different models require a different scope

The system level includes ancillary systems, such as monitoring, but


also processes and staff behavior. In fact, most critical professions where
accidents are rare but costly rely on proverbial “fire drills” where teams
actively inject failure into a system to test its resilience but also to hone
their response. They are antifragile.

Systems Controlling Systems


Thinking about the application you are managing as a complex system (see
“Every Systems is Perfect” in The Software Architect Elevator³) highlights
the difference between the three approaches:
A robust application is a very simple system. It has a system state that is
impacted by disturbance. If your server fails, the system will be unavail-
able. Resilient applications form a more powerful system that consists of
a self-correcting (sometimes called goal-seeking) feedback loop. If there
is a discrepancy between a desired and actual system state (e.g. a failing
server), a correcting action is invoked, which returns the system to the
desired state. The correction, for example failing over to another server or
scaling out to additional instances, is usually not instant nor perfect but
does a good job at absorbing minor disturbances.
³Hohpe, The Software Architect Elevator, 2020, O’Reilly
Keep Calm And Operate On 239

An outer system improves the inner system

Antifragile systems add a second, outer system to this setup. The outer
system injects disturbances, observes the system behavior, and improves
the correcting action mechanism.

Antifragile organizations are those that embrace a systems


thinking attitude toward their operations.

Such compensating mechanisms are similar to many control circuits in


real life. For example, most ovens have a thermostat that controls the
heating element, thus compensating for heat loss due to limited insulation
or someone opening the door (the disturbance). That’s the inner system.
During development of the oven, the engineers will calibrate the heater
and the thermostat to ensure a smooth operation, for example enabling
the oven to reach the desired temperature quickly without causing heat
spikes. That’s the outer loop.

Chaos Engineering
When the fire brigade trains, they usually do so in what I refer to as the
“pre-production” environment. They have special training grounds with
charred up buses and aircraft fuselage mock-ups that they set on fire to
inject disturbance and make the overall system, which heavily depends
on the preparedness of the brigade, more resilient.
Keep Calm And Operate On 240

In software, we deal with bits and bytes that we can dispose of at will.
And because we don’t actually set things on fire, it’s feasible for us to
test in production. That way we know that the resilient behavior in pre-
production actually translates into production, where it matters most.
Naturally, one may wonder how smart it is to inject disturbance into
a production system. It turns out that in a resilient system, routinely
injecting small disturbances builds confidence and utilizes the antifragile
properties of the overall system (including people and processes). That
insight is the basis for Chaos Engineering⁴, described as discipline of
experimenting on a system in order to build confidence in the system’s
capability to withstand turbulent conditions in production.

Chaos Monkey: Governance Through


Deliberate Failure
One of the better known tools associated with chaos engineering is the
Chaos Monkey⁵, a process that randomly terminates virtual machine
instances and containers. While traditional organizations would feel that
this is the last thing they want to run in their data center, it is actually
a governance mechanism that has an effectiveness that other governance
bodies can only dream of. The chaos monkey assures that all software
that is released is resilient against a single-VM outage. Non-compliant
applications fall victim to the monkey test and are shut down. Now that’s
governance!

The Chaos Monkey is the ultimate governance tool, shut-


ting down non-compliant applications.

A gentler version of governance through deliberate failure are systems


that periodically return errors even when they are operating fine. Doing
so forces their clients to perform error checking, which they might oth-
erwise neglect when interacting with a highly available system. A well-
known example is Google’s Chubby Service⁶, which sported an amazing
⁴http://principlesofchaos.org/
⁵https://github.com/Netflix/chaosmonkey
⁶https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/
45855.pdf
Keep Calm And Operate On 241

99.99958% availability. Due to this “excess” availability, the service’s Site


Reliability Engineers started forcing periodic outages.

Diversity Breeds Resilience


Much of IT concerns itself with reducing diversity. Diversity drives up
operational complexity (multiple ways of essentially doing the same
thing), reduces negotiation power (enterprise sales still largely functions
by economies of scale), and can compromise security (the more stuff you
have, the more likely one of them is vulnerable).
There is one area, though, where diversity is desired: highly resilient
systems. That’s so because diversity limits the blast radius of software
failure. If one piece of software suffers from a defect, if all components rely
on the same software, this defect can propagate to redundant components
and thus compromise the anticipated resilience. The extreme cases are
space probes and such, which are sometimes built with fully redundant
systems that perform the same function but share no implementation
details. In such scenarios, even the teams aren’t allowed to speak to each
other to maintain complete diversity of ideas.

From Fragile to Antifragile


While everyone would dream of running only antifragile systems, systems
that don’t die from disturbance but only get stronger, it’s a long path for
most organizations. The following table summarizes the individual stages
and implications.

It’s easy to see that although antifragility is an amazing thing and a


worthwhile goal, it’s a full-on lifestyle change. Also, it can only be
achieved by working across infrastructure, middleware, application layer,
Keep Calm And Operate On 242

and established processes, something that often runs counter to existing


organizational barriers. It’s therefore best attempted on specific projects
where one team is in charge of end-to-end application management.

Don’t try to be antifragile before you are sure that you are
highly resilient!

Likewise, one must be careful to not get lured into wanting to be antifragile
before all your systems are highly resilient. Injecting disturbance into a
fragile system doesn’t make it stronger, it simply wreaks havoc.
Keep Calm And Operate On 243

Resilience Do’s and Don’ts

DON’T…

• Make availability the sole responsibility of infra/ops teams.


• Just count outages, considering them “bad luck”.
• Rely on failover mechanisms that weren’t explicitly tested.
• Inject disturbance unless you have resilient systems.

DO…

• Progress from robustness (MTBF) to also consider resilience (MTTR).


• Use monitoring and automation to help achieve resilience.
• Link organizational layers to achieve antifragility.
• Conduct regular “fire drills” to make sure you know how to handle
an outage.
Part VI: Embracing the
Cloud

By now you have migrated existing applications and built new ones so
that they take advantage of the cloud computing platform. But you’re not
quite done yet. A cloud journey isn’t a one-time shot, but an ongoing series
of learning and optimization.

The Journey Continues


Your initial migration has likely yielded promising benefits, but there’s
surely some optimization work left to do. Also, as your cloud consumption
increases, your financial management might take notice and have a
different view on the savings you achieved. Lastly, with new powers come
new responsibilities, so the cloud might bring you a few surprises.
Embracing cloud permeates all parts of the organization, whether it’s IT,
business, finance, or HR because it influences core business processes,
financial management, and hiring / re-skilling.
To round off this book, the last part discusses aspects to consider as your
organization fully embraces the cloud as a new lifestyle:

• Cloud savings don’t arrive magically; they have to be earned


• You may find that migrating to the cloud increased your run budget.
That’s likely a good thing!
• Traditionally, we think of automation as being about efficiency.
That’s not so in software and cloud.
• Small items do add up, also in the cloud. Beware the Supermarket
effect!
27. Cloud Savings Have To
Be Earned
There ain’t no such thing as a free lunch. Not even in the cloud.

Many cloud initiatives are motivated, and justified, by cost savings. Cost is
always a good argument, particularly in traditional organizations, which
still view IT as a cost center. And although lower cost is always welcome,
once again things aren’t that simple. Time to take a closer look at reducing
your infrastructure bill.

How Much Cheaper is the Cloud?


When migrating traditional, monolithic, non-elastic applications to the
cloud, many organizations are surprised to find that the hardware cost
isn’t that much lower than on premises. Well, the cloud providers still
have to pay for the hardware and electricity just like you do. They also
invest in significant infrastructure around it.
Admittedly, they have better economies of scale, but hardware mark-ups
aren’t that rich and, depending what OS you run, licenses also have to be
paid. As of early 2020, running an AWS EC2 m5.xlarge instance with 4
vCPU and 16GB of RAM, comparable to a decent laptop, with Windows
will cost you about 40 cents per hour, or $288 per month. To that, you need
to add the cost for storage, networking, egress traffic, etc, so watch for the
Supermarket Effect! Including a second server as stand-by or development
server, data back-up, and a few bells and whistles, $10,000 a year aren’t
a stretch by any means, no matter which provider you choose. Larger
instances, as they are typical in classic IT, can easily hit a multiple of that.
But wasn’t cloud meant to be cheap?

Server Sizing
I am somewhat amused that when discussing cloud migrations, IT depart-
ments talk so much about cost and saving money. The amusement stems
Cloud Savings Have To Be Earned 246

from the fact that most on-premises servers are massively over-sized and
thus waste huge amounts of money. So, if you want to talk about saving
money, perhaps look at your server sizing first!
I estimate that existing IT hardware is typically 2-5x larger than actually
needed. The reasons are easy to understand. Because provisioning a server
takes a long time, teams rather play it save and order a bigger one just in
case. Also, you can’t order capacity at will, so you need to size for peak
load, which may take place a few days a year. And, let’s be honest, in
most organizations spending a few extra dollars on a server gets you in
less trouble than a poorly performing application.

In most IT organizations overspending on a server gets you


in less trouble than a poorly performing application.

Server sizing tends to be particularly generous when it’s done by a


software vendor. After all, they’re spending your money, not theirs. I
have seen relatively simple systems proposing a run-time footprint of
many dozen virtual machines because it was recommended by the vendor.
Many vendors calculate license fees based on hardware size, so the vendor
actually earns by oversizing your servers. And the teams have little
incentive to trim hardware cost once the budget is approved. To the
contrary, how much hardware your application “needs” has become an
odd sort of bragging right in corporate IT.

I tend to joke that when big IT needs a car, they buy a Rolls
Royce. You know, just in case, and because it scored better
on the feature comparison. When they forget to fill the tank
and the car breaks down, their solution is to buy a second
Rolls-Royce. Redundancy, you know!

Even though “hardware is cheap” it isn’t so cheap that having a second


look at your sizing isn’t worth it. In many cases, you can easily shave off
$100k or more off your hardware bill per year. I’d be inclined to add that
to my list of accomplishments for the next performance review.
Cloud Savings Have To Be Earned 247

Earn Your Savings


Migrating your bloated servers to the cloud may reduce your run cost
a little bit thanks to better economies of scale and lower operational
overhead. Still, migration itself isn’t free either and many organizations
are somewhat disappointed that migrating to the cloud didn’t magically
knock 30, 40, or even 50% off their operations budget.
The reality is that savings don’t just appear but must be earned, as
illustrated in the following diagram:

Cloud savings have to be earned.

If you’re after savings, you must complete at least two more important
steps after the actual cloud migration:

Optimization through Transparency


Simply lifting and shifting existing systems to the cloud might provide
some savings, but in order to significantly reduce your cloud bill, you
need to optimize your infrastructure. Now you might think that no one has
money to give away, so your infrastructure must have been optimized at
some point or another. As we so frequently find, cloud changes the way we
can do things in subtle but important ways: it gives us more transparency.
Transparency is the key ingredient into being able to optimize and turns
out to be one of the most under-appreciated benefits of moving to the
cloud. Typical IT organizations have relatively little insight into what’s
happening in their infrastructure, such as which applications run on which
servers, how heavily the servers are utilized, and what run cost to attribute
to each application. Cloud gives you significantly more transparency into
Cloud Savings Have To Be Earned 248

your workloads and your spend. Most clouds even give you automated
cost-savings reports if your hardware is heavily underutilized.

Transparency is the most under-appreciated benefit of


moving to the cloud. It’s the critical enabler for cost reduc-
tion.

The nice thing about cloud is that, unlike your on-premises server that was
purchased already, you have many ways to correct oversized or unneeded
hardware. Thanks to being designed for the first derivative the formula
for cloud spend is quite simple: you are charged by the size of hardware
or storage multiplied by the unit cost and by the length of time you use it.
You can imagine your total cost as the volume of a cube:

The Cost Cube

For the more mathematically inclined, the formula reads like this:

Cost[$] = size [units] * time [hours] * unit cost [$/unit/hour]

Since the unit cost is largely fixed by the provider (more on that later),
you have two main levers for saving: resource size and lifetime.

OPTIMIZING SIZE

Optimizing size is a natural first step when looking to reduce


cost. Servers that are mostly idling around are good candi-
dates. Also, cloud providers offer many combinations of CPU
Cloud Savings Have To Be Earned 249

and RAM so that applications that are particularly compute


or memory intensive can be deployed on virtual machines
that more closely match their needs. Some cloud providers
offer custom machine sizes so that you can pick your favorite
combination of memory and CPU.

OPTIMIZING TIME

Reducing the amount of time a server is running also presents


ample opportunity for cost savings. A typical development
and test environment is utilized about 40-50 hours a week,
which is roughly one quarter of the week’s 168 hours. Stop-
ping or downscaling these environments when they’re not
needed can cut your cost dramatically. Winding down and
re-provisioning servers also assures that your systems are
disposable.

When looking to reduce a server’s size or runtime, one limiting element


are load peaks or intermittent loads. What if the idling server is heavily
utilized once a week or once a month? How much you can save is largely
determined by the time and complexity of resizing the system or bringing
it back on-line. This consideration leads us to the second major cost
savings vector: automation.

Resilience through Automation


When organizations look to optimize spend, I usually ask them which
server is the most expensive one. After a lot of speculation, I let them in
on a little secret:

The most expensive server is the one that’s not doing


anything. Even at a 30% discount.

The sad part is that if you negotiate a 30% discount from your vendor, that
server is still your most expensive one. Now if I want to really get the
executives’ attention, particularly CFOs, I tell them to imagine that half
the servers in their data center aren’t doing anything. Most of them won’t
Cloud Savings Have To Be Earned 250

be amused because they’d consider it a giant waste of money and dearly


hope that my exercise is a hypothetical one, until I let them in on the fact
that this is true for most data centers.
Why are half the servers in a data center not doing anything? They’re
there in case another server, which is doing something, breaks. “Warm
standby” they’re called. Warm standby servers take over in a failure
scenario to minimize system downtime. The poor economics of warm
stand-by servers are easily explained:

A single server can conservatively achieve around 98% of


uptime. Adding a redundant server can increase that to 99
or 99.5%, but at double the hardware cost! So you bought
yourself 1 percent of uptime for 100% increase in hardware
cost.

The way out of this conundrum is automation. If your software provi-


sioning is fully automated, meaning you can spool up a new instance in
mere minutes or even seconds as opposed to days or hours, then you may
not need that redundant hardware - you simply deploy a new instance
when the primary instance fails. Let’s say you have a 99.5% Service Level
Objective (SLO), which allows you 30 days * 24 h * 0.5% = 3.6h of downtime
a month. If it takes you 3 minutes to deploy and start a new server to take
over from the old one, you could do that 72 times a month without missing
your SLO. Try doing that by hand.
The big shift here is from a robust architecture, one that’s designed not to
break, to a resilient architecture, one that can absorb failure and remain
available - see Stay Calm and Operate On. Shifting from robustness and
redundancy to resilience and automation is one of several ways how cloud
can defy existing contradictions, such as providing better uptime at lower
cost (see my Google Cloud blog post on Connecting the Dots¹).

Changing Cloud Providers


You might be surprised that changing providers wasn’t on the list of major
savings vehicles. After all we do see numerous press releases of dramatic
¹https://blog.google/products/google-cloud/how-the-cloud-operating-model-meets-
enterprise-cio-needs/
Cloud Savings Have To Be Earned 251

cost savings from migrations in one direction or another. When looking


at these reports, more often than not there are two critical factors at play:

ARCHITECTURE CHANGES

The case study often involves a drastic shift in architecture


that allowed optimizing usage. Classic examples are data
warehouse analytics that are used sporadically but require
lots of compute power. Running a constant estate of servers
for such usage can indeed cost you 10 or 100 times more
than a managed pay-per-use data analytics service. Many on-
premises Hadoop environments have gone this route as long
as the data can be reasonably transferred to and stored in the
cloud. Interestingly, the choice of cloud provider is a distant
second factor to the change in model.

PRICING MODEL DIFFERENCES

Different cloud providers have different pricing models for


some of their services. Again, these are most noticeable in
fully managed services like data warehouses. Some providers
charge for such a service per query (by compute power used
and amount of data processed), others for the compute infras-
tructure (by size and time). Most charge you for storage on
top of the compute charges. To make things more interesting,
in some cases compute power and storage size are interlinked,
so as your data set grows, you’ll need more compute nodes.
Which model will be cheaper? The answer is a resounding
“it depends!”. If you run a lot of queries in a short time
interval, e.g. for your month-end analysis, being charged for
compute/storage likely comes out better because you pay for
the nodes only for a limited time and can utilize them heavily
before tearing them down (shown on the left side of the
diagram below). However, if you run queries intermittently
over time, paying per query likely comes out cheaper (shown
on the right) because you would otherwise have to keep the
infrastructure in place the whole time - setting it up for a
single query would be highly impractical.
Cloud Savings Have To Be Earned 252

Batched usage (left) vs. sporadic usage (right) favor different pricing models

Additionally, if your data is largely static, you can save a lot in the pay-
per-query model by creating interim results tables that hold aggregated
results. Once again it shows that knowing your needs is the most impor-
tant ingredient into reducing cost.
Sounds complicated? Yes, it can be. But such is the double-edged sword of
cost control (and transparency): you have much more control over pricing
but need to work to realize those savings. Once again, there is no magic
bullet. The good news is that in our example, both models likely will result
in a bill that’s much lower than setting up a traditional data warehouse
that’s idle 99% of the time.

Knowing your needs and usage patterns is the most impor-


tant ingredient into reducing cost.

When comparing prices, don’t expect huge differences in pricing between


providers - open market competition is doing its job just fine. You’ll gain if
a particular model fits your situation and usage pattern particularly well.
Always take a critical look at the case studies that claim “moving from
provider X to Y reduced our cost by 90%.” Most of the time there’s an
architecture change involved or the use case was a particularly poor fit
for the pricing model. It’s a bit like “I traded my Mercedes for a BMW and
also decided to ride my bike to work. Now I use 80% less gas.” Architects
look for causality².

When reviewing case studies related to cost savings make


sure to look for causality. Often it’s not just the product
change that led to the savings.
²https://architectelevator.com/architecture/architects-causality
Cloud Savings Have To Be Earned 253

Many teams will start to compare prices by comparing unit prices for basic
services such as compute nodes. This is likely a sign that you’re stuck in
a traditional data-center oriented frame of mind. In cloud, your biggest
levers are sizing and resource lifetime. Data often has the longest lifespan,
so there it makes sense to look at cost per unit.

Comparing unit cost charts for compute nodes can be a sign


that you’re looking at cloud cost from an outdated data-
center mindset.

Luckily, history has shown that the competition between the “big three”
generally works in your favor: most cloud providers have reduced prices
many times.

Doing Nothing
Cynics might note that some cloud providers being rumored to be incur-
ring significant losses means that you’re in fact get more than you pay for,
at least for as long as the investors keep bankrolling the operation. There’s
another, slightly less cynical cost savings strategy that most are too shy
to mention: doing nothing (after migrating to the cloud, that is). Not only
does compute power become cheaper all the time, the cloud providers
also deal with depreciation and hardware refresh cycles for you, letting
you enjoy a continuous negative price inflation. So, you’ll save by doing
nothing - perhaps some things in the cloud are a kind of magic.

Premature Optimization
Premature optimization is widely considered the root of much evil in
computer science. So, when is the time mature to optimize cost? In the end,
it’s a simple economic equation: how much savings can you realize per
engineer hour invested? If your overall cloud run rate is $1000 / month it’s
unlikely that you’ll be getting a huge return on investing many engineer
hours into optimization. If it’s $100,000 / month, the situation is likely
different. In any case, it’s good to establish a culture that makes cost a
first-class citizen.
Cloud Savings Have To Be Earned 254

Optimizing Globally
Modern frameworks can unexpectedly consume compute power and drive
up cost, for example by requiring five nodes for leader election, a few more
as control plane nodes, another two as logging aggregators, an inbound
and an outbound proxy, etc. Suddenly, your simple demo app consumes
10 VMs and your monthly bill is in the hundreds of dollars per month.
Such frameworks give you productivity gains in returns and servers are
cheaper than developers. Higher productivity might also give you an edge
in the market, which will outweigh most potential savings. So, once again
it’s about optimizing globally, not locally. Transparency is your best friend
in this exercise.

Cost is More than Dollars and Cents


So, once again, there are many factors to be considered, even when it
comes to cost. It also means that cloud is a good time to spend more time
with your CFO. More on that in the next chapter.
Cloud Savings Have To Be Earned 255

Savings Do’s and Don’ts

DON’T…

• Justify your cloud migration based on cost alone.


• Expect miracle cost reductions just by lifting and shifting to the
cloud.
• Just compare unit prices for VMs.
• Attribute cost savings to a change in product or vendor alone.
• Over-obsess with cost. Place your resources where the return is
highest.

DO…

• Understand pricing models and how they fit your usage patterns.
• Use automation to reduce cost.
• Use transparency to understand and manage cost.
• Spend time understanding your own usage patterns.
• Read price reduction case studies carefully to see what impacted the
price and whether that applies to you.
28. It’s Time To Increase
Your “Run” Budget
Cloud challenges budgeting misconceptions.

An IT organization’s health is often measured by what percentage of the


available budget is spent on operations (“run”) vs. projects (“change”).
Most CIOs look to minimize spending on “keeping the lights on” because
that’d mean more money is available to deliver projects (and value) to the
business. Also, the change budget is closely tied to financial models that
allow deferring some of the cost that’s been incurred. As you might have
guessed, a cloud operating model that blurs the line between application
delivery and operations can put a wrinkle into that strategy.

Run vs. Change


Much of internal IT is managed as a cost center. It’s no surprise then that
money spent is a key performance indicator for CIOs. While a reduction
in overall spend is generally welcome, a second major metric is comparing
the money spent on operations, often referred to as “run” (as in running the
business) with the money spent on project implementations., sometimes
referred to as “change” or “build” budget.

Classic IT looks to minimize the run budget


It’s Time To Increase Your “Run” Budget 257

A CIO who burns too much money just “keeping the lights on” will have
a difficult time meeting the demand from the business who needs projects
to be completed. Therefore, almost every CIO’s agenda includes reducing
the run budget in favor of the project budget. This pressure only builds in
times where the business has to compete with “digital disruptors”, which
places additional demands on IT to deliver more value to the business.
Some analysts have therefore started to rephrase “Run vs. Change” as Run-
Grow-Transform¹ to highlight that change also incurs expense.
There are many ways to reduce operational costs, including optimizing
server sizing, decommissioning data centers, and of course moving to the
cloud. Somewhat ironically, though, the move to the cloud shifts capital
investment towards consumption-based billing, which is traditionally
associated with run costs. It can also highlight operational costs that were
buried in project budgets.
A real story about building an internal cloud platform and cost recovering
for it illustrates some of these challenges.

Billing for Shared Platforms


Around 2014 my team and I started to build a private cloud platform for
our employer, a major global financial service company. As chief architect
I had set out to solve what I considered the biggest problem in our IT
organization at the time: it took us far too long to deploy software. Being
convinced that if you can’t deploy software, most other problems become
secondary, we set out to correct this state of affairs. We hence deployed a
private cloud platform that included a fully automated software delivery
tool chain, an application run-time, and application monitoring.
Being enormously proud of what we built and finding strong demand
inside the organization, we were surprised to hit a snag with our charging
model. As is common in enterprise IT, we intended to recover the money
we spent on hardware, licenses, and project cost by charging our internal
customers in the business units. We therefore had to define a charging
model, i.e., how we would determine how much each of our customers
would have to pay. Having built an elastic platform that allows customers
to increase and decrease their usage at will, billing based on consumed
¹https://www.gartner.com/smarterwithgartner/align-it-functions-with-business-
strategy-using-the-run-grow-transform-model/
It’s Time To Increase Your “Run” Budget 258

capacity seemed natural. As main memory was the limiting element in


our infrastructure (we had 3 TB, but had to serve many memory hungry
applications), we decided to charge by memory allocated per month. So,
if you’d allocate 1 GB for your application, you’d pay twice as much as
if you could make do with 512MB. We also charged for the containerized
build tool chain by the capacity it used and the usage of our on-premises
GitHub repository by seat.
While usage-based billing has become the “new normal” these days,
charging for a tool chain and other application development services by
usage was fairly novel back then. But, despite our economies of scale
allowing us to provide such services at a significantly lower cost than each
project developing a one-off implementation, our pricing model turned
out to be a hurdle for many projects.

False Assumptions About Cost


As I discussed in 37 Things, organizations operate based on existing
assumptions that are rarely documented anywhere. When you develop
a new solution that breaks these assumptions, you may find unexpected
resistance. That’s what had happened to us. Specifically, we stumbled
upon two assumptions regarding cost:

Invisible Cost = No Cost


Cost accounting generally breaks spend down by line item.
A project budget is split into many line items like personnel
cost, software licenses, hardware, team dinners, consulting
services and many more. The dangerous assumption built
into this model is that any cost that doesn’t explicitly show
up as a line item is assumed to be zero.
So, where would the money a project spent on building oper-
ational tools such as a one-off CI/CD pipeline show up? It’d
generally be rolled up into the same line items as the software
development cost, i.e. personnel cost, hardware etc. However,
that line item is assumed to be delivering functionality that
provides value to the business. Hence, the time and money
spent on setting up project tooling isn’t really visible - once
It’s Time To Increase Your “Run” Budget 259

a project was approved based on a positive business case, no


one would look too deeply inside to see how many devel-
oper hours were spent on building delivery infrastructure as
opposed to delivering business functionality.
Because a project cost breakdown by features vs. building and
maintaining tooling is uncommon, projects were rarely asked
about the cost of the latter. In contrast, our shared platform,
which provided tooling as a service, billed projects explicitly
for this expense every month. Usage of the CI/CD pipeline,
source repository, etc. now showed up quite visibly as “in-
ternal IT services” on the project budget. Some of the project
managers were surprised to see that their project spent $500
a month on delivery infrastructure notwithstanding the fact
that the 10 or 20 person days they would have spent building
and maintaining it would have cost them a multiple. So,
these project managers had never worried about the cost
of software delivery tooling because it had never shown up
anywhere.

Lowering costs, but making them visible in the process can


be perceived as “costing more” by your internal customers.

Another version of this fallacy were developers claiming that


running their own source code repository would cost them
nothing. They used the open source version, so they had no
license cost. When asked who maintains the installation, they
pointed out that “an engineer does it in her spare time.” At
that point you should become slightly suspicious. Probing
further, we found that the software ran on a “spare” server
that obviously cost nothing. Backups weren’t really done, so
storage cost was also “pretty much zero”. The real answer
is that of course all these items have associated costs; they
are just assumed to not exist because they don’t show up
anywhere. When your software engineer spends half a day
with a Jenkins version upgrade or an operational issue, you’re
simply missing half a person day from delivering value to the
business.
It’s Time To Increase Your “Run” Budget 260

Recurring Cost = Operations


The second assumption that we uncovered was that tradi-
tional organizations associate any recurring cost with oper-
ations. So, if they’d pay a monthly fee for version control
and the build tool chain, they would lump that into their
operational cost as opposed to software development cost,
where it would wind up if they had built it themselves.
When these items are rolled up, ultimately to the CIO, it
would suggest that the project spend on operations increased,
a trend that’s not desired. Because there was no distinction
between recurring costs that are needed to operate the soft-
ware and recurring costs for software development, project
managers preferred to build their own tooling as opposed to
using a shared service despite the clearly better economics
of the latter, especially if you account for the developer’s
opportunity cost.

Cloud Cost Blurs the Line


Enterprises moving to the cloud can encounter similar challenges as the
scope of cloud platforms is rapidly expanding to not just include infras-
tructure services, such as virtual machines, but also software development
tooling, such as AWS CodeBuild and CodeDeploy, Google Cloud Build
and Cloud Repository, and many others. Just like our shared software
delivery platform, they can also appear to increase total “run” cost because
you now see a line item for cost that might have been buried in project
budgets up to now. Rolling up, the consumption-based billing model
might portray an increase in the portion of the IT budget that is spent
on operations.

Spending a larger portion of your IT budget on items that


would traditionally be considered “operational” is to be
expected with the transition to a cloud operating model and
is a good thing.

A “serverless” model, which makes software deployment seamless, totally


blurs the line between deployment tools and application run-time, show-
It’s Time To Increase Your “Run” Budget 261

ing that the strict separation between build and run might no longer make
much sense in modern IT organizations. The same is true for a DevOps
mindset that favors “you build it, you run it”, meaning project teams
are also responsible for operations. However, as we saw, such a way of
working can run up against existing accounting practices.

A Mini Tour of Accounting


Most software developers and architects don’t bother looking into the
details of project budgeting and financials. However, when you are build-
ing a shared platform or are supporting a major cloud transformation,
having a basic understanding of financial management can help. The key
constructs that might help you better understand budgeting decisions are
depreciation and capitalization.
Depreciation allocates capital expenses spread over a period of time. A
classic example is a manufacturing tool, which might cost a significant
amount of money but lasts a decade. Although the tool investment is a
business expense, and hence tax deductible, most tax authorities won’t
allow the business to claim the full price of the machine in the year it
was purchased. Rather, the business can claim a fraction of the cost each
year over the life span of the tool to better associate the expense with the
ongoing profit generated from the upfront investment. This admittedly
simplified example also explains why a business’s profit is distinct from
its cash flow. You might have spent more than you earned to buy the
machine, meaning your cash flow is negative, but you’re posting a profit
because the machine cost is deducted in fractions over many years.
IT also makes investments that provide value to the business for an
extended period of time, much like buying a manufacturing tool: it builds
business applications. Because many of these applications carry a large
upfront investment but can be in operation for decades, it appears logical
that the cost of developing them should be allocated proportionally over
the asset’s useful life. Such expenses are called capitalized expenses².
This capitalization mechanism helps explain project manager’s love for
project costs. Despite spending the cash for developers right now, only
a fraction of that cost is allocated to the current fiscal year. The rest is
spread over future years, making the cost appear much smaller than the
²https://www.investopedia.com/terms/c/capitalization.asp
It’s Time To Increase Your “Run” Budget 262

actual spend. Anything that can be bundled into the project cost benefits
from this effect. So, the half day that the software developer spent fixing
the CI/CD pipeline might show up as only $100 in the current fiscal year
because the remaining $400 is pushed to the subsequent years. A shared
service will have a difficult time competing with that accounting magic!
The astute reader might have noticed that pushing expenses into the future
makes you look better in the current year, but burdens future years with
cost from prior investments. That’s absolutely correct - I have seen IT
organizations where past capitalization ate most of the current year’s
budget - the corporate equivalent of having lived off your credit card
for too long. However, project managers often have short planning and
reward horizons, so this isn’t as disconcerting to them as it might be for
you.

Cloud Accounting
The apparent conflict between a novel way of utilizing IT resources
and existing account methods has not escaped the attention from bodies
like the IFRS Foundation³, which publishes the International Financial
Reporting Standards, similar to GAAP in the US. While the IFRS doesn’t
yet publish rules specific to cloud computing, it does include language
on intangible assets that a customer controls, which can apply to cloud
resources. The guidance, for example is, that implementation costs related
to such intangible assets can be capitalized. EY published an in-depth
paper on the topic⁴ in July 2020. If you think reading financial guidance
is a stretch for a cloud architect, consider all the executives who have to
listen to IT pitching Kubernetes to them.

Spending as Success Metric


The cost accounting for service consumption hints at an interesting
phenomenon. If you apply a new operational model to the traditional
way of thinking, it might actually appear worse than the old model. As
³https://www.ifrs.org/about-us/who-we-are/
⁴https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/topics/ifrs/ey-applying-
ifrs-cloud-computing-costs-july-2020.pdf
It’s Time To Increase Your “Run” Budget 263

a result, people stuck in the old way of thinking will often fail to see the
new model’s huge advantages.
However, a new model can also completely change the business’ view on
IT spending. Traditionally, hardly any business would be able to determine
how much operating a specific software feature costs them. Hence, they’d
be unable to assess whether the feature is profitable, i.e. whether the IT
spending stands in any relation to the business value generated.
Cloud services, especially fine-grained serverless runtimes, change all this.
Serverless platforms allow you to track the operational cost associated
with each IT element down to individual functions, giving you fine-
grained transparency and control over the Return-on-Investment (ROI)
of your IT assets. For example, you can see how much it costs you to sign
up a customer, to hold a potentially abandoned shopping cart, to send a
reminder e-mail, or to process a payment. Mapping these unit costs against
the business value they generate allows you to determine the profitability
of individual functions. Let’s say you concluded that customer signups are
profitable per unit, then you should be excited to see IT spend on sign-ups
increase!

With fine-grained cost transparency you might be excited


to see IT spending on profitable functions go up.

Fine-grained financial transparency is another example of the great ben-


efits that await those organizations who are able to adjust their operating
model to match the cloud’s capabilities. It can even turn your view of IT
spending upside down.

Changing the Model


The shared platform example illustrates how IT organizations that are
stuck in an old way of thinking will be less likely to reap the benefits
from cloud platforms. Not only do their existing structures, beliefs, and
vocabulary stand in the way of adoption, they will also miss the major
benefits of moving to a cloud operating model.
In the case of “change” vs “run”, once IT realizes that they shouldn’t
run software that they didn’t build in the first place, the obsession with
reducing run cost might just go away some day.
It’s Time To Increase Your “Run” Budget 264

Budget Do’s and Don’ts

DON’T…

• Assume cost that doesn’t show up explicitly in a line item is zero.


• Blame a shared services team or the cloud provider for making these
costs visible.
• Hide tooling and project setup costs in the project budget.
• Measure a new operational model in an old cost model.

DO…

• Break out the project cost incurred for building tooling.


• Utilize shared platforms to avoid duplicating tooling infrastructure.
• Anticipate push-back from projects against a usage-based pricing
model.
• Use the consumption-based pricing model of the cloud to compare
marginal value against marginal cost.
29. Automation isn’t About
Efficiency
Speeding up is more than going faster.

Most corporate IT has been born and raised from automating the business:
IT took something that was done by pen and paper and automated it. What
started mostly with mainframes and spreadsheets (VisiCalc¹ turned 40 in
2019) soon became the backbone of the modern enterprise. IT automation
paid for itself largely by saving manual labor, much like automation of
a car assembly line. So when we speak about automation these days, it’s
only natural to assume that we’re looking to further increase efficiency.
However, cloud also challenges this assumption.

Industrializing Software Delivery


Within IT, software development is a relatively costly, and largely manual
process. It’s no surprise then that IT management’s quest for efficiency
through automation also came across the idea of streamlining its own
software development processes. Much of that effort originally focused on
the specification and coding parts of software delivery, i.e. the translation
from ideas into code. Many frameworks and methodologies looking to
industrialize this aspect of software development, such as CASE tools,
4GL², and executable UML, arrived and disappeared again over the course
of the 1990s and 2000s.

I delivered significant business systems in PowerBuilder


during the mid-1990s, a system that was considered a 4-
th generation language (4GL) that combined UI design and
coding in a single environment. Occasionally I still get
spam mail recruiting PowerBuilder developers.
¹https://en.wikipedia.org/wiki/VisiCalc
²https://en.wikipedia.org/wiki/Fourth-generation_programming_language
Automation isn’t About Efficiency 266

Somehow, though, trying to automate the specification and coding steps


of software delivery never quite yielded the results that the tools’ creators
were after. Collectively, we could have saved a lot of effort if we had
taken Jack Reeves’ article What is Software Design³ to heart. Published
in 1992, Jack elaborated that coding is actually the design of software
whereas compiling and deploying software is the manufacturing aspect.
So, if you’re looking to industrialize software manufacturing, you should
automate testing, compiling, and deployment, as opposed to trying to
industrialize coding. About a quarter century later that’s finally being
done. Some things take time, even in software.

DevOps: The Shoemaker’s Children get


New Shoes
Ironically, while IT grew big automating the business, it didn’t pay much
attention to automating itself for about 50 years. As a result, software
builds and deployments, the “manufacturing” of software, ended up being
more art than science: pulling a patch in at the last minute, copying
countless files, and changing magic configurations.
The results were predictably unpredictable: every software deployment
became a high wire act and those who were so inclined, said a little
prayer each time they would embark on it. A forgotten file, a missed
configuration, the wrong version of something rather - that was the
daily life of deployment “engineering” for a long time. What was indeed
missing was an engineering approach to this set of tasks. If any more proof
was needed, e-mail circulations congratulating the team on a successful
software release might be well intended, but only serve as proof of the
awfully low success rate.

Corporate IT celebrating each major product release of its


core system to much applause by upper management is a
clear sign of missing build and deployment automation.

A slew of recent software innovations has set out to change this: Con-
tinuous Integration (CI) automates software test and build, Continuous
³https://wiki.c2.com/?WhatIsSoftwareDesign
Automation isn’t About Efficiency 267

Delivery (CD) automates deployment, and DevOps automates the whole


thing including operational aspects such as infrastructure configuration,
scaling, resilience, etc. In combination, these tools and techniques take the
friction out of software delivery. It also allows teams to embrace a whole
new way of working.

The New Value of Automation


With all this automation in place, some IT members, especially in op-
erations, fear that this means it’s now their turn to be made redundant
through technology. This fear is further stoked by terms like “NoOps” -
seemingly implying that in a fully automated cloud operating model ops
isn’t required anymore.
While there may be a tad bit of Schadenfreude coming up from the busi-
ness side who has lived with being made redundant through technology
for decades, reducing manual labor isn’t actually the main driver behind
this automation. Automation in software delivery has a different set of
goals:

Speed

Speed is the currency of the digital economy because it en-


ables rapid and inexpensive innovation. Automation makes
you faster and thus helps the business compete against dis-
ruptors who can often roll out a feature not 10% faster but
10x or even 100x faster.

Repeatability

Going fast is great, but not if it comes at the expense of


quality. Automation not only speeds things up, it also elim-
inates the main error source in software deployment: hu-
mans. That’s why you should Never Send a Human to Do
a Machine’s Job (see 37 Things⁴). Automation takes the error
margin out of a process and makes it repeatable.
⁴https://leanpub.com/37things
Automation isn’t About Efficiency 268

Confidence

Repeatability breeds confidence. If your development team is


afraid of the next software release, it won’t be fast: fear causes
hesitation and hesitation slows you down. If deployment
followed the same automated process the last 100 times, the
team’s confidence in the next release is high.

Resilience

It’s easily forgotten that a new feature release isn’t the only
time you deploy software. In case of a hardware failure, you
need to deploy software quickly to a new set of machines.
With highly automated software delivery, this becomes a
matter of seconds or minutes - often before the first incident
call can be set up. Deployment automation increases uptime
and resilience.

Transparency

Automated processes (in IT deployment or otherwise) give


you better insight into what happened when for what reasons
and how long it took. This level of transparency for example
allows you to find bottlenecks or common failure points.

Reduced Machine Cost

When using IaaS, i.e. deploying software on virtual machines,


higher levels of automation allow you to better take advan-
tage of preemptible or spot instances, those instances that are
available at discounted prices but that can be reclaimed by
the provider on short notice. In case such a reclaim happens,
automation allows you to re-deploy your workloads on new
instances or to snapshot the state of the current work and
resume it later. Such instances are often available at discounts
from 60 to 80% below the on-demand price.

Continuous improvement / refinement

Having a repeatable and transparent process is the base con-


dition for continuous improvement: you can see what works
Automation isn’t About Efficiency 269

well and what doesn’t. Often you can improve automated


processes without having to re-train people performing man-
ual tasks, which could cause additional quality issues due to
re-learning.

Compliance

Traditional governance is based on defining manual pro-


cesses and periodic checks. It goes without saying that the
gap between what governance prescribes and the actual
compliance tends to differ a good bit. Automation enables
guaranteed compliance - once the automation script is de-
fined in accordance with the governance, you’re guaranteed
to have compliance every time.

Cloud Ops
Because moving to the cloud places a significant part of system operations
in the hands of the cloud provider, especially with fully-managed SaaS
services, there’s often a connotation that cloud requires little to no oper-
ational effort, occasionally reflected in the moniker “NoOps”. Although
cloud computing brings a new operational model that’s based on higher
levels of automation and outsourcing IT operations, today’s drive to
automate IT infrastructure isn’t driven by the desire to eliminate jobs and
increase efficiency. Operations teams therefore should embrace, not fear
automation of software delivery.

Cloud ops isn’t “NoOps” - cloud-scale applications have


enormous operational demands.

I am not particularly fond of the “NoOps” label - I regularly remind


people that cloud-scale applications have higher operational demands
than many traditional applications. For example, modern applications
operate around the globe, scale instantly, receive frequent updates without
visible downtime, and can self-heal if something goes wrong. These are
all operational concerns! Cloud also brings other new types of operational
concerns, such as cost management. So, I have no worry that ops might
Automation isn’t About Efficiency 270

run out of valuable things to do. If anything, the value proposition of well-
run IT operations goes up in an environment that demands high rates of
change and instant global scalability.

Speed = Resilience
Once you speed up your software delivery and operations through au-
tomation, it does more than just make you go faster: it allows you to work
in an entirely different fashion. For example, many enterprises are stuck in
the apparent conflict between cost and system availability: the traditional
way to make systems more available was to add more hardware to better
distribute load and to allow failover to a standby system in case of a critical
failure.
Once you have a high degree of automation and can deploy new appli-
cations within minutes on an elastic cloud infrastructure, you often don’t
need that extra hardware sitting around - after all the most expensive
server is the one that’s not doing anything. Instead, you quickly provision
new hardware and deploy software when the need arises. This way you
achieve increased resilience and system availability at a lower cost, thanks
to automation!
Automation isn’t About Efficiency 271

Automation Do’s and Don’ts

DON’T…

• Celebrate every release. Releases should be normal.


• Equate automation with efficiency.
• See moving faster as the only benefit of speeding things up.
• Fear that cloud means not needing operations.

DO…

• Apply software to solve the problem of delivering software, not the


design of software.
• Automate to improve availability, compliance, and quality.
• Understand that automation boosts confidence and reduces stress.
30. Beware the
Supermarket Effect!
I went to get milk and paid 50 bucks.

Have you ever been to the supermarket when you really only needed milk,
but then picked up a few small items along the way just to find out that
your bill at checkout is $50? To make matters worse, because you really
didn’t buy anything for more than $1.99, you’re not sure what to take out
of the basket because it won’t make a big difference anyway. If you did,
then you’re emotionally prepared for what could happen to you on your
journey to the cloud.

Computing for Mere Pennies


When enterprises look at vendors’ pricing sheets, they are used to count-
ing zeros - enterprise software and hardware isn’t cheap. When the same
enterprises look at cloud vendors’ pricing sheets, they also count zeros.
But this time it’s not the ones before the decimal point, but the ones
behind! An API request costs something like $ 0.00002 while an hour of
compute capacity on a small server can be had for a dime or less. So,
cloud computing should just be a matter of pennies, right? Not quite…
a month has 720 hours and many systems make millions or sometimes
billions of API calls. And although we routinely talk about hyperscalers
and their massive fleet of hardware, the average enterprise still uses a lot
more compute capacity than many folks might imagine. With IT budgets
in the billions of dollars we shouldn’t be surprised to see cloud usage in
the tens of millions for large enterprises.

The Supermarket Effect


Enterprise IT is used to spending money in large chunks, similar to going
to have dinner at a restaurant: meals are $20 or, if you’re at a fancy place,
Beware the Supermarket Effect! 273

$40. Even though it isn’t cheap, the pricing model makes it relatively easy
to guess how much the total bill will be (depending on your locale you
might get tricked with a mandatory gratuity, state and local tax, health
care cost recovery charges, resort charge, credit card fee, and all kinds of
other creations that share the sole goal of extracting more money from
your wallet.)
Cloud computing is more like going to the grocery store: you’ll have many
more line items, each at a relatively low price. So, if you exercise good
discipline it can be significantly cheaper. However, it’s also more difficult
to predict total costs.

I routinely play the game of “guess the bill” at the su-


permarket register and although I am getting better, I am
routinely below the actual amount. Ah, yes, I did grab that
$4.99 cheese and a second bottle of beer…

So, when enterprises first embark on the cloud journey they need to get
used to a new pricing model that includes more dimensions of charges,
such as compute, storage, reserved capacity, network, API calls, etc. They
also need to get used to seemingly minuscule amounts adding up into non-
trivial cost. Or, as the Germans say “Kleinvieh macht auch Mist” - even
small animals produce manure.

Cost Control
The cloud operating model gives developers before unheard-of levels of
control over IT spend. Developers incur cost both via their applications
(for example, a chatty application makes more API calls) as well as
with machine sizing and provisioning. Cloud providers generally charge
identical amounts for production and development machine instances,
causing test and build farms plus individual developer machines to quickly
add up despite them likely being much smaller instances.
Although most enterprises place billing alerts and limits, with that power
over spending also comes additional responsibility. Being a responsible
developer starts with shutting down unused instances, selecting conser-
vative machine sizes, purging old data, or considering more cost-efficient
alternatives. Done well, developers’ control over cost leads to a tighter
Beware the Supermarket Effect! 274

and therefore better feedback cycle. Done poorly, it can lead to runaway
spending.

Cost Out of Control


Despite proper planning and careful execution, most enterprises seriously
moving to the cloud are nevertheless likely to experience unpleasant
surprises on their billing statements. Such surprises can range from minor
annoyances to tens of thousands of dollars of extra charges. The reasons
might be considered obvious in 20-20 hindsight but I have seen them
happen to many organizations, including those well versed in cloud
computing.
In my experience, a handful of different scenarios trigger the majority of
the unwelcome spike in charges:

Self-inflicted Load Spikes

Many managed cloud services feature auto-scaling options. Generally, it’s


a great feature that enables your on-line system to handle sudden load
spikes without manual intervention. However, they’re not quite as great
if you accidentally cause a massive load spike that will hit your cloud
bill fairly hard. Although the natural question is why anyone would do
that, it’s a relatively common occurrence. Whereas load tests are a classic,
albeit planned, scenario, there’s another common culprit of unintentional
spikes: batch data loads.

A fairly sophisticated organization who I worked with em-


ployed a very nice message-oriented architecture (my soft
spot) to propagate, transform, and augment data between
a data source and a cloud-managed data store. One day,
though, a complete re-sync was required and conducted
via a batch load, which pumped the whole dataset into
the serverless pipeline and the data store. Both scaled so
flawlessly that the job was done before long. The serverless
cost per transaction was moderate, but the substantial load
spike in records per second triggered additional capacity
units to be instantiated by the managed database service. It
had racked up a five-figure bill by the time it was detected.
Beware the Supermarket Effect! 275

So, when performing actions that differ from the regular load patterns,
it’s important to understand the impact it might have on the various
dimensions of the pricing model. It might be helpful to either throttle
down bulk loads or to observe and quickly undo any auto scaling after
the load is finished.

Infinite Loops

Another classic cost-anti-pattern are serverless or other auto-scaling so-


lutions that have accidental feedback loops, meaning one request triggers
another request, which in turn triggers the original request again. Such a
loop will cause the system to quickly escalate up - on your dime. Ironically,
fast-scaling systems, such as serverless compute, are more susceptible to
such occurrences. It’s the cloud equivalent of a stack overflow.
The problem of uncontrolled feedback loops itself isn’t new. Loosely
coupled and event-driven systems like serverless applications are highly
configurable but also prone to unforeseen system effects, such as escala-
tions. Additional composition-level monitoring and alerting is therefore
important. In a blog post from 2007¹, I describe building simple validators
that warn you of unintended system structures like circles in the compo-
nent dependency graph.

Orphans

The next category is less dramatic, but also sneaks up on you. Developers
tend to try a lot of things out. In the process they create new instances
or load data sets into one of the many data stores. Occasionally, these
instances are subsequently forgotten unless they reach a volume where a
billing alert hints that something is awry.

I have heard many stories of developers spinning up siz-


able test infrastructure just before they head on vacation.
Returning to work having to explain to your boss why your
infrastructure kept working so hard while you were away
might unnecessarily add to the usual post-vacation Case of
the Mondays².
¹https://www.enterpriseintegrationpatterns.com/ramblings/48_validation.html
²https://www.moviequotedb.com/movies/office-space.html
Beware the Supermarket Effect! 276

With infrastructure being software-defined these days, it looks like some


of the hard-learned lessons from continuous integration can also apply
to cloud infrastructure. For example, you might institute a “no major
provisionings on Friday” rule - a Google image search for “don’t deploy on
Friday” yields many humorous memes for the source of this advice. Also,
when rigging up “temporary” infrastructure, think about the deletion right
when you (hopefully) script the creation. Or better yet, already schedule
the deletion.

Shutting Down May Cost You

One additional line item is unlikely to hit your bill hard, but will neverthe-
less surprise you. Some cloud pricing schemes include a dependent service
for free when deploying a certain type of service. For example, some cloud
providers include an elastic IP address with a virtual machine instance.
However, if you stop that instance, you will need to pay for the elastic IP.
Some folks who provisioned many very small instances are surprised to
find that after stopping them, the cost remains almost the same due to the
elastic IP charges.
In another example, many cloud providers will charge you egress band-
width cost if you connect from one server to another via its external
IP address, even if both servers are located in the same region. So,
accidentally looking up a server’s public IP address instead of the internal
one will cost you.
It pays to have a closer look at the pricing fine print, but it’d be hard
to believe that you can predict all situations. Running a cost optimizer is
likely a good idea for any sizable deployment.

Being Prepared
While sudden line items in your bill aren’t pleasant, it might bear
some relief knowing that you’re not the only one facing this problem.
And in this case it’s more than just “misery loves company.” A whole
slew of tools and techniques have been developed to help address this
common problem. Just blaming the cloud providers isn’t quite fair as your
unintentional usage actually did incur infrastructure cost for them.
Beware the Supermarket Effect! 277

Monitoring cost should be a standard element of any cloud


operation, just like monitoring for traffic and memory
usage.

As so often, the list of treatments isn’t a huge surprise. Having the


discipline to actually conduct them is likely the most important message:

Set Billing Alerts


All cloud providers offer different levels of spending alerts and
limits. You would not want to set them too low - the main purpose of
the cloud is to be able to scale up when needed. But as development
teams gain a major impact over the cost aspect of their deployments,
cost should become a standard element of monitoring. Unfor-
tunately, many enterprise cloud deployments detach developers
from billing and payments, or don’t provide the necessary detail,
highlighting the dangers of building an Enterprise Non-cloud.
Run Regular Scans
Many cloud providers and third parties also offer regular scans of
your infrastructure to detect underutilized or abandoned resources.
Using them won’t do miracles but is good hygiene for your new IT
lifestyle. Cloud gives you much more transparency, so you should
use it.
Monitor Higher-level System Structures
For each automated feedback loop that you employ, for example
auto-scaling, you’re going to want another (automated or manual)
control mechanism that monitors the functioning of the automated
loop. That way you can detect infinite loops and other anomalies
caused by automation.
Manage a New Form of Error Budget
An additional approach is to budget for these kinds of mishaps
up-front - a special kind of “error budget” as practiced by Site
Reliability Engineering (SRE). You want to encourage developers to
build dynamic solutions that auto-scale, so the chances that you’re
going to see some surprises on your bill are high enough that it’s
wise to just allocate that money upfront. That way you’ll have less
stress when it actually occurs. Think of it as a special version of
Etsy’s Three-armed Sweater Award³. If you think setting money
³https://bits.blogs.nytimes.com/2012/07/18/one-on-one-chad-dickerson-ceo-of-etsy
Beware the Supermarket Effect! 278

aside for learning from mistakes is odd, be advised that there’s a


name for it: tuition.

Know Your Biggest Problem


While spend monitoring should be an integral part of cloud operations,
it’s likely not going to be your #1 problem for a while. Hence, overreacting
also isn’t a good idea. After all, the cloud is there to give developers more
freedom and more control, so trying to lock down spending is the quickest
way to arrive at an Enterprise Non-cloud.
Monitoring spending also ties up resources, so how much you might want
to invest in spend control largely depends on how much you’re spending
in the first place. While it’s wise to establish a responsible spending culture
from day 1 (or dollar 1), making a concerted effort likely only pays off once
you have a six-figure run-rate.

Checking Out
You might have observed how supermarkets place (usually overpriced)
items appealing to kids near the cashier’s desk, so that the kids can toss
something into the shopping cart without the parents noticing. Some IT
managers might feel they have had a similar experience. Cloud providers
routinely dangle shiny objects in front of developers, who are always eager
to try something new. And, look, they are free of charge for the initial trial!
I have before likened developers to screaming children, who chant “Mommy,
Mommy, I want my Kubernetes operators - all my friends have them!”
So, you might have to manage some expectations. On the upshot, I give
the cloud providers a lot of credit for devising transparent and elastic
pricing models. Naturally, elasticity goes both ways, but if you manage
your spend wisely, you can achieve significant savings even if you find
the occasional chocolate bar on your receipt.
Beware the Supermarket Effect! 279

Cost Management Do’s and Don’ts

DON’T…

• Get fooled by unit prices of $0.0001.


• Invest too much time into cost optimization until the amount
justifies it.
• Rig up compute resources just before heading for vacation.
• Get sucked in by free trials. Use them to actually try things instead
of overcommitting.

DO…

• Consider cost management a routine operational task.


• Expect to be hit by unexpected cost due to mishaps.
• Consider setting aside a “tuition fund” that pays for the expensive
learning experiences.
• Check the system state, especially scale, after any unusual load
pattern.
• Consider deploying a cost management and optimization system.
Author Biography

Gregor Hohpe is an enterprise strategist with AWS. He advises CTOs and


technology leaders in the transformation of both their organization and
technology platform. Riding the Architect Elevator from the engine room
to the penthouse, he connects the corporate strategy with the technical
implementation and vice versa.
Gregor served as Smart Nation Fellow to the Singapore government, as
technical director at Google Cloud, and as Chief Architect at Allianz SE,
where he deployed the first private cloud software delivery platform. He
has experienced most every angle of the technology business ranging
from start-up to professional services and corporate IT to internet-scale
engineering.

Other Titles By This Author

The Software Architect Elevator, O’Reilly, 2020


Enterprise Integration Patterns, Addison-Wesley, 2003 (with Bobby Woolf)
Author Biography 281

Michele Danieli is the Head of Architecture Practice at Allianz Technol-


ogy Global Lines, leading globally distributed architecture teams building
platforms. He started his career in the engine room and sees architecture
and engineers as best friends. A good diagram and a mind map are his
essential tools and code is not a foe.

Jean-Francois Landreau leads the infrastructure team at Allianz Direct.


When SRE and DevOps shifted the collective excitement from software
development toward operations, he decided to follow along. He is a strong
believer that you can’t take enlightened enterprise decisions if you are too
far away from the engine room.

You might also like