You are on page 1of 3

A complete beginner's guide to DevOps best practices

Matt Heusser, Excelon Development

You can't just 'do' DevOps and hope to get it right. Expert Matthew Heusser takes us through all the steps
required to make DevOps work for your company -- and make your life easier.

FROM THE ESSENTIAL GUIDE:


A DevOps primer: Start, improve and extend your DevOps teams

GUIDE SECTIONS
Smart starting strategies
Working within DevOps teams
DevOps grad school
Defining DevOps

Essential Guide Section you're in:Setting up DevOps teams


More articles from this section:
A strategic comparison of Agile, Lean and DevOps
Using DevOps for improved cloud apps projects
Shift your IT thinking for cloud DevOps
How to speak dev and ops
The evolving world of DevOps software development
Software quality includes using DevOps with Agile

The term DevOps implies programmers working with operators and testers to automate things. What exactly to
automate, and where to start, can be overwhelming.

Today we take a look at how to get started with DevOps as an activity, along with actual technologies and
processes to adopt. The DevOps best practices path we suggest is based on improving feedback at every step:
faster builds, faster deploys, faster recovery, faster notification of errors, faster server builds and faster
notification of errors.

Most of the examples below assume software designed for a web browser. People building Windows and native
mobile applications are probably not going to pursue continuous delivery, but they might still get value by
building, testing and deploying more often than what the team is doing now.

Here are some DevOps best practices to keep in mind.

It all starts with mean time to recovery


The traditional measure for improvement was mean time between failures, or MTBF. Trying to fail less often is
a fine approach in general. Yet something happens when we try to increase uptime from, say, 99.9% to 99.99%.
Each extra "nine" adds only one-tenth to the uptime, while achieving that extra nine can easily double the
project cost. At some point, the price increase for just one extra nine is not worth the investment.
DevOps teams tend to look at the price of uptime differently. Instead of trying to fail less often, they try to
recover more quickly. The algebra runs something like this:

Risk Exposure equals (number of users exposed to problem) multiplied by (how terrible the problem is).

Number of users is correlated with time, so if the team can identify, find and fix the problem in one-tenth the
time, they can have five times as many defects escape to production and still have less risk exposure. Better yet,
the team could apply this idea to every step of the development process -- finding bugs in requirements and
code quickly -- so they can be fixed cheaply.
The pieces of a better mean time to recovery (MTTR) are typically the build, deploy, notice and notify, and fix
processes. Teams can be more or less advanced at each of these; one e-commerce project I worked on recently
could perform a new build in about 15 minutes, but took about four hours to roll a change to production. The
length of the entire loop from build to in production was about five hours.

Build and verify a build


How long does it take to create a build, deploy it to a staging environment, check it for problems and mark it
ready for production? The build server here is the easy part; getting it to move to staging automatically can be a
problem. Many teams have legacy systems where a change in one place could have unforeseen consequences,
so they have a "regression test process" to find problems. An automated regression check might mean it takes an
hour to bless a build; some human processes take weeks or months. In many cases, the team can see a massive
improvement by writing better code -- so there are fewer errors -- while switching to a more effective method of
human testing, such as sampling adjusted for risk.

Deploy to production
Once the deploy decision is made, how long does it take to actually get on production? Some legacy systems
may have hard requirements for this -- systems need to be turned off, files need to be copied by FTP and
coordination needs to happen on multiple machines. Yet even those steps as they exist can be scripted,
automated and done by the technical staff at the push of the button, instead of requiring a ticket and a hand-off.
Please note that might not be the best approach. Instead, the team might start with some percentage of the
deploy, or, perhaps, develop a new architecture to make deploys more seamless.

Notice and notify


Once a bug escapes to production, how long does it take to be found? Again, this is something to measure by
looking at the last handful of serious bugs, when the builds were deployed and when the bugs were reported in a
way that could be fixed. "There is a problem with payments for some customers," for example, is not actionable
feedback.

Using DevOps best practices, take a hard look at those bugs, and you might find some common elements. For
example, the bugs might involve long delays of page loads or 500 or 404 errors on the server. Most teams
pursuing DevOps try to add real-time monitoring -- through dashboards -- of elements that are leading
indicators of problems. Email alerts of problems, monitoring server health and "report problems" links on the
website are all ways of getting notice of problems as soon as possible.

The fix process


Once the bug is found, how long does it take to get a fix ready to build? This is often a human process. Some
group needs to meet to promote the bug to serious, then another person can take that bug on to fix -- assuming
they are allowed to add the bug to this sprint. In classic Scrum, the bug would be added to next sprint's backlog,
adding as much as two weeks or more to the fix process that was accidental. That might be needed to make sure
the team is not overwhelmed. In that case, "Why are so many bugs coming through that we need to triage them
and delay them?" sounds like a reasonable question to ask.

To improve MTTR, take a look at the elements of the loop and find the element that can improve the most with
the least effort. Then go after it, and all the other DevOps best practices.

Before we leave you, a word of caution. Don't give up on mean time between failures

It's tempting to "buy and install" DevOps; just plug in automated builds, use virtualized servers that can deploy
on command, add a dash of monitoring and call it done. This is an approach that can reduce recovery time and
will encourage teams to do a great many deploys quickly.
Except those deploys will require a lot of fixes. The fixes will require fixes. At some point, using the mantra
"move fast and break things," it is possible to actually break all the things. The stability of the system decays,
and, eventually, each new change seems to only make things worse.

In order to deploy often, we have to have high first time quality -- that is, we want our software to be in good
shape before it not only goes out, but at each step along the way. In order to keep up with the pace of
accelerated delivery, applications need to have fewer bugs. That would mean the testers get to spend less time in
bug tracking systems and retesting. Programmers need to get examples that are detailed enough that they will
build the right thing and make fewer mistakes. All of this means reasonably low MTBF as a prerequisite to
DevOps best practices, or, at least, something to develop in parallel. The time to shift is when you've hit a wall,
when adding another level of reliability will cause an explosion of cost while making delivery slower.
Sometimes, the opportunity to increase reliability is there, but the team does not know how, because they lack
the skills. Many modern development techniques like test-driven development or exploratory testing require
skills development. Others, like a component architecture or continuous integration, require an investment of
time and probably money before they will show benefits.

Getting started
To really utilize DevOps best practices, take a hard look at a handful of recent failures in production. Figure out
both how far apart they are and how long they took to get fixed. Ask your team what the next step is in
improvement (higher time between failures or lower time to recovery), what is the bottleneck in the process,
and where the team could have the most improvement for the least effort. The solutions that come up that
involve developers, testers and operations working together to automate the flow of the work and enable self-
service -- getting rid of steps where you have to ask someone to move a file or check something by hand --
those are the DevOps steps.

Next, figure out what it will take to get there. Come up with epics for each release -- automating the build and
deploy process, including creating virtual servers on-demand. Each epic should have clear results, such as
"lower time for a deployed build to staging from three hours to under 15 minutes." Break the epics into stories
and present the epics to management.

If you are management, the DevOps best practices process is a little easier. Discuss what is possible with the
technical team in broad strokes, communicate the vision, come up with the epics collaboratively and then ask
the team how they are going to accomplish the objective of each epic. Along the way, you might search for
tools or training or support, and that's okay. Just start with what and then ask how in order to roll out DevOps
step by step. The tricky part will be deciding how to fund the project when the busy business of production is
calling your name. One way to do that is to dedicate some percentage of effort; one team I worked with
dedicated 20% of the work effort to small projects that were important, yet were being drowned out by major
company initiatives. Pick your percentage, be prepared to defend it and get going.

http://searchsoftwarequality.techtarget.com/answer/A-complete-beginners-guide-to-DevOps-best-
practices

You might also like