You are on page 1of 6

Deployment panic: Blind procedures, blind CIO and an example

of what the IT industry can do about Global Warming.

A way to reduce risk in dealing with highly complex repetitive


processes is to define and document a process in a written procedure.
Procedures are a set of action steps and rules that specify what to do
and not to do in specific situations. A procedure is a kind of “idiot
guide” to what can be a fairly complex task.
It is common place in Information Technology to materialize,
encapsulate, codify, instantiate procedures in software applications. A
lot of business software are dutiful enforcers of procedures.
Procedures also define what a successful and appropriate behavior is
from what is not. The incident I will be describing and discussing has
to do with a procedure that is executed by humans.
For me, the following story illustrates how procedures influence
people’s judgment and, if you don’t pay attention, limit learning and
adaptation to a new reality.

Imagine an IT department 20 people strong that supports a set of web


based applications used daily by hundred of thousands individual
visitors. The department is headed by a Vice President with a few
direct reports: a technical director, a few project managers, a web
infrastructure manager and a database administration group manager.
The managerial style of the Vice President is best described in his own
words: “This is a monarchy and an absolute at that”.

Imagine these twenty some people housed in a big room, part of a


very large cold war era concrete building each with an individual desk
separated from each other by half height partitions. The usual silence
in the place is punctuated by the clicking sound of keyboard keys.
Direct conversations occur rarely and when they do, it is in hushed
tones. Interactions or attempts of interactions occur via emails or
Instant Messaging and in formal meetings in the few available meeting
rooms.

The code generated by the department goes thorough unit and QA


testing before undergoing test deployment in a Staging Environment.
Typically the test deployment to Staging precedes by a few days the
deployment of the new code to the Production Environment.
Since a significant amount of income is generated by these web based
applications, the Project Management Group developed a procedure for
every Project Manager to follow for the whole deployment process.
The deployment procedure specifies the frequency and the expected
attendees to those deployment preparation meetings. Deployment
preparation meetings are organized and facilitated by the Project
Manager assigned to the project and attended by the Technical
Director and key representatives of all the groups actively participating
to a deployment.
New code deployment typically requires the active participation and
cooperation of:
- The software developers who coded the new applications,
- One or two database administrators to perform production
database backups, deploy the new database objects, migrate
associated data,
- Members of the Quality Assurance department who participated
to the new application integration testing,
- A member of the web infrastructure group who controls the web
and application servers running the new application code.
- The Product Manager and/or the Product Management Director
who “owns” the application being deployed.

The deployment procedure specifies that for “larger deployments”, at a


minimum, one representative of each group needs to be on-site for
the duration of the deployment.
More critical deployments are usually deployments requiring significant
database update work which require blocking public access to all
websites accessing database tables that are being updated. Typically
more critical deployments are scheduled to take place over a weekend
to minimize website traffic disruptions.
Because thousands of websites would be offline for the duration of the
database update, this deployment was fitting the definition of a critical
deployment. However, this deployment was a chance to push out
minor enhancements and bug fixes that had no visibility to external
customers. In discussions with the Technical Director and with the
developers working on the project, this deployment was generally
considered to be a minor deployment.

The target date for this deployment was set to the Saturday of Easter
weekend.

During one of the preparation meeting, the Database Administrator


assigned to the project announced that at deployment day, since the
deployment was set for Easter weekend and since he (the Dba
assigned to the project) lives 2 hours away from the office, he would
work from home and perform all database update tasks remotely.
At the following preparation meeting, members from the QA group
announced that they would perform their QA validation tests remotely
as well. To which the lead developer and the web-system administrator
added that they would deploy the code remotely and then drive to the
office.

The Technical Director took part to each deployment preparation


meeting where those announcements were made and although he was
not happy about the turn of event, did not object to anyone’s decision
to work remotely. Similarly, the CIO was informed by the Project
Manager that the deployment date was conflicting with a major
Christian holiday and did not identify a replacement weekend date.

On deployment day, only the Technical Director, the QA Manager, the


Product Manager and Project Manager were actually on site; all Chiefs
– no Indians.

The process started with web-system administrator blocking public


access to websites affected by the database objects update. At the end
of the task, as specified in the deployment procedure, the web-system
administrator informed the whole group via email.
The Database administrator followed with the update of the database
objects. At the end of the task, the Dba emailed the whole group of
the successful completion of his task.
That is the signal expected by the web-system administrator to move
the new code to the production server.
Similarly as soon as the new application code is moved to the
production server, the web-system administrator sends an email
informing the whole group of the successful completion of his task.
That is the signal expected by the lead developer to do a high level
test to check the availability of the new application.

At the very moment when internal users and QA testers expect to be


able to see their test websites come up, nothing shows up but an error
screen. When the problem was first announced by the lead developer
to the whole deployment group, time was: 7:47 am.

Initially, the lead developer diagnoses two separate potential issues:


1. One issue related to a third party software malfunction on the
web-server.
2. The other issue related database access security.

This is the moment when all available brains feverishly work on tracing
obvious symptoms of troubles to specific possible causes.
This is “la raison d’être” of the Technical Director, this is why he woke
up early that morning, why he showed up to work that Saturday.
All this effort expanded for the occasion to serve, to help resolve
another technical puzzle. And no one was on site.
Everyone who could do something about this obstacle to the
completion of this deployment was off-site.

Time went by slowly.


Three minutes.
More time went by…
Five minutes,
Ten minutes,
Fifteen minutes

At 8:04 am, in a private email to his development partner on this


project, the lead developer announces that he discovered a bug in
existing code that lead him to attribute issue #1 to the mal function of
a third party web-server software in error.

Around 8:06 am, the lead developer realizes that the database access
issue is the remaining issue to resolve to complete the deployment. He
decides to call the Database Administrator at his home where nobody
answers the phone.

Twenty minutes go by since the discovery of the incident.

At 8:09 am, the lead developer announces via email to the whole
group that the database access security issue was the only cause of
the problem.

At 8:12 am: the QA Manager, via email, asks the Dba an estimated
time to resolution. Email is copied to the whole group.
No response from the Dba.

At 8:46 am: the lead developer informs the whole group that all the
issues have been resolved and that the QA Group could perform their
validation checks.

The rest of the deployment unfolded without additional surprise or


incident. As promised, the lead developer and the web-system
administrator drove to the office and waited for the completion of the
QA validation test, at the end of which everyone went home at the
expected time.
You may wonder what happened during the 37 minutes of email
silence to the whole deployment group ?

Here is the private emails that did not make it to whole deployment
group distribution list:
At 8:10 am, in a private email to the lead developer, the other
developer panics and suggests to call another Dba.
At 8:11 am, in a private email to his development partner, the lead
developer starts panicking himself:
“Crap, I think you are right, I called him at home no answer!!!”
At 8:16 am: In a private email to the lead developer, the development
partner confirms that the database access security is the remaining
obstacle.
At 8:16 am: In a private email the lead informs that he is on the
phone working the resolution of the database access issue with the
Dba.

For 10 minutes, the lead developer was trying to contact the Dba at
home.
Once the two were working together on the phone, it took them half
an hour to resolve the issue.
Consider that those two individuals were using phone and email to
resolve the issue while linked to the network servers via VPN. They
were separated from each other by 40 to 50 miles.

OK, so would you consider this to be a successful software application


deployment?

What lessons can we learn from this short story?


And what does this story has to do with Global Warming?
Could the Dba and the Lead Developer have resolved the issue faster
while both being offsite ?
If so how ?
Could the Dba and the Lead Developer have resolved the issue faster
while both being onsite ?

So, yes, if as the CIO you consider this deployment to be critical, then
the Technical Manager, the Project Manager, somebody was at fault to
have allowed most members of the deployment team to work offsite.

If you are of the point of view that only the end result matters, then
whether the main actors to the deployment were in the same room or
remote may not matter to you.
We live in a world where medical images are routinely read in India,
where software engineers in Russia develop graphic engines for
American software companies, where websites for large organizations
routinely are hosted in multiple facilities around the planet and where
we need to reduce our hydro-carbons consumption while substantively
reduce our production of green house gazes.
In the context of our global economy and the complexity of intertwined
interests and relationships spanning the planet, how do we solve very
complex technical issues efficiently and while not being in each other’s
immediate proximity?