Deployment panic: Blind procedures, blind CIO and an example of what the IT industry can do about Global Warming

.
A way to reduce risk in dealing with highly complex repetitive processes is to define and document a process in a written procedure. Procedures are a set of action steps and rules that specify what to do and not to do in specific situations. A procedure is a kind of “idiot guide” to what can be a fairly complex task. It is common place in Information Technology to materialize, encapsulate, codify, instantiate procedures in software applications. A lot of business software are dutiful enforcers of procedures. Procedures also define what a successful and appropriate behavior is from what is not. The incident I will be describing and discussing has to do with a procedure that is executed by humans. For me, the following story illustrates how procedures influence people’s judgment and, if you don’t pay attention, limit learning and adaptation to a new reality. Imagine an IT department 20 people strong that supports a set of web based applications used daily by hundred of thousands individual visitors. The department is headed by a Vice President with a few direct reports: a technical director, a few project managers, a web infrastructure manager and a database administration group manager. The managerial style of the Vice President is best described in his own words: “This is a monarchy and an absolute at that”. Imagine these twenty some people housed in a big room, part of a very large cold war era concrete building each with an individual desk separated from each other by half height partitions. The usual silence in the place is punctuated by the clicking sound of keyboard keys. Direct conversations occur rarely and when they do, it is in hushed tones. Interactions or attempts of interactions occur via emails or Instant Messaging and in formal meetings in the few available meeting rooms. The code generated by the department goes thorough unit and QA testing before undergoing test deployment in a Staging Environment. Typically the test deployment to Staging precedes by a few days the deployment of the new code to the Production Environment. Since a significant amount of income is generated by these web based applications, the Project Management Group developed a procedure for every Project Manager to follow for the whole deployment process.

The deployment procedure specifies the frequency and the expected attendees to those deployment preparation meetings. Deployment preparation meetings are organized and facilitated by the Project Manager assigned to the project and attended by the Technical Director and key representatives of all the groups actively participating to a deployment. New code deployment typically requires the active participation and cooperation of: - The software developers who coded the new applications, - One or two database administrators to perform production database backups, deploy the new database objects, migrate associated data, - Members of the Quality Assurance department who participated to the new application integration testing, - A member of the web infrastructure group who controls the web and application servers running the new application code. - The Product Manager and/or the Product Management Director who “owns” the application being deployed. The deployment procedure specifies that for “larger deployments”, at a minimum, one representative of each group needs to be on-site for the duration of the deployment. More critical deployments are usually deployments requiring significant database update work which require blocking public access to all websites accessing database tables that are being updated. Typically more critical deployments are scheduled to take place over a weekend to minimize website traffic disruptions. Because thousands of websites would be offline for the duration of the database update, this deployment was fitting the definition of a critical deployment. However, this deployment was a chance to push out minor enhancements and bug fixes that had no visibility to external customers. In discussions with the Technical Director and with the developers working on the project, this deployment was generally considered to be a minor deployment. The target date for this deployment was set to the Saturday of Easter weekend. During one of the preparation meeting, the Database Administrator assigned to the project announced that at deployment day, since the deployment was set for Easter weekend and since he (the Dba assigned to the project) lives 2 hours away from the office, he would work from home and perform all database update tasks remotely.

At the following preparation meeting, members from the QA group announced that they would perform their QA validation tests remotely as well. To which the lead developer and the web-system administrator added that they would deploy the code remotely and then drive to the office. The Technical Director took part to each deployment preparation meeting where those announcements were made and although he was not happy about the turn of event, did not object to anyone’s decision to work remotely. Similarly, the CIO was informed by the Project Manager that the deployment date was conflicting with a major Christian holiday and did not identify a replacement weekend date. On deployment day, only the Technical Director, the QA Manager, the Product Manager and Project Manager were actually on site; all Chiefs – no Indians. The process started with web-system administrator blocking public access to websites affected by the database objects update. At the end of the task, as specified in the deployment procedure, the web-system administrator informed the whole group via email. The Database administrator followed with the update of the database objects. At the end of the task, the Dba emailed the whole group of the successful completion of his task. That is the signal expected by the web-system administrator to move the new code to the production server. Similarly as soon as the new application code is moved to the production server, the web-system administrator sends an email informing the whole group of the successful completion of his task. That is the signal expected by the lead developer to do a high level test to check the availability of the new application. At the very moment when internal users and QA testers expect to be able to see their test websites come up, nothing shows up but an error screen. When the problem was first announced by the lead developer to the whole deployment group, time was: 7:47 am. Initially, the lead developer diagnoses two separate potential issues: 1. One issue related to a third party software malfunction on the web-server. 2. The other issue related database access security. This is the moment when all available brains feverishly work on tracing obvious symptoms of troubles to specific possible causes.

This is “la raison d’être” of the Technical Director, this is why he woke up early that morning, why he showed up to work that Saturday. All this effort expanded for the occasion to serve, to help resolve another technical puzzle. And no one was on site. Everyone who could do something about this obstacle to the completion of this deployment was off-site. Time went by slowly. Three minutes. More time went by… Five minutes, Ten minutes, Fifteen minutes At 8:04 am, in a private email to his development partner on this project, the lead developer announces that he discovered a bug in existing code that lead him to attribute issue #1 to the mal function of a third party web-server software in error. Around 8:06 am, the lead developer realizes that the database access issue is the remaining issue to resolve to complete the deployment. He decides to call the Database Administrator at his home where nobody answers the phone. Twenty minutes go by since the discovery of the incident. At 8:09 am, the lead developer announces via email to the whole group that the database access security issue was the only cause of the problem. At 8:12 am: the QA Manager, via email, asks the Dba an estimated time to resolution. Email is copied to the whole group. No response from the Dba. At 8:46 am: the lead developer informs the whole group that all the issues have been resolved and that the QA Group could perform their validation checks. The rest of the deployment unfolded without additional surprise or incident. As promised, the lead developer and the web-system administrator drove to the office and waited for the completion of the QA validation test, at the end of which everyone went home at the expected time.

You may wonder what happened during the 37 minutes of email silence to the whole deployment group ? Here is the private emails that did not make it to whole deployment group distribution list: At 8:10 am, in a private email to the lead developer, the other developer panics and suggests to call another Dba. At 8:11 am, in a private email to his development partner, the lead developer starts panicking himself: “Crap, I think you are right, I called him at home no answer!!!” At 8:16 am: In a private email to the lead developer, the development partner confirms that the database access security is the remaining obstacle. At 8:16 am: In a private email the lead informs that he is on the phone working the resolution of the database access issue with the Dba. For 10 minutes, the lead developer was trying to contact the Dba at home. Once the two were working together on the phone, it took them half an hour to resolve the issue. Consider that those two individuals were using phone and email to resolve the issue while linked to the network servers via VPN. They were separated from each other by 40 to 50 miles. OK, so would you consider this to be a successful software application deployment? What lessons can we learn from this short story? And what does this story has to do with Global Warming? Could the Dba and the Lead Developer have resolved the issue faster while both being offsite ? If so how ? Could the Dba and the Lead Developer have resolved the issue faster while both being onsite ? So, yes, if as the CIO you consider this deployment to be critical, then the Technical Manager, the Project Manager, somebody was at fault to have allowed most members of the deployment team to work offsite. If you are of the point of view that only the end result matters, then whether the main actors to the deployment were in the same room or remote may not matter to you.

We live in a world where medical images are routinely read in India, where software engineers in Russia develop graphic engines for American software companies, where websites for large organizations routinely are hosted in multiple facilities around the planet and where we need to reduce our hydro-carbons consumption while substantively reduce our production of green house gazes. In the context of our global economy and the complexity of intertwined interests and relationships spanning the planet, how do we solve very complex technical issues efficiently and while not being in each other’s immediate proximity?