Professional Documents
Culture Documents
The OpenVMS Approach To Software Engineering
The OpenVMS Approach To Software Engineering
Engineering
Writing great code that survives hypes, cutbacks and takeovers
Author’s Note
This article, though never published, was written well before the deal between HP and VMS
Software, Inc. was announced. Please keep that in mind if its conclusion seems a bit gloomy to you.
At a tradeshow a few years ago, I was approached by a gentleman I shall for obvious reasons
henceforth refer to as "the Beard". The Beard had been informed that I dabbled in code writing a
little, and after the barest socially necessary showing an interest in what I did for a living, he
proceeded to tell me about the new software development methods his company was developing.
This was one company that was taking the hype of cloud-computing to a whole new ethereal level!
Their software model consisted of loose functions written in many different languages executing on
many different systems and architectures.
The Beard introduced me to his boss, whom I shall refer to as "the Professor", because the fact that
he was one was the first thing I was told about him. This Professor explained that the beauty of their
scheme was that the interfaces between various functions were more or less loosely defined.
Requests would be made (of the cloud) to provide a function that would do such-and-so, and each
system connected to the cloud could provide an answer. This was presented as a huge advantage in
development agility and ease of management. Finally, no more worries about component versions
and interface definitions!
My mind was sufficiently boggled at the time that the only thing I could think of asking was, "well,
but how do you design such a system, and how do you debug it?" The brilliant answer being, "That's
the beauty of it! You don't need to spend time designing your software anymore. It will design itself!
You only have to test the individual functions, and you don't need any more debugging than that!" At
this point, I got worried and explained a little bit about the kind of projects I had been involved in1,
the mission critical nature of these, the fact that sometimes people's lives or the fate of a nation
depended on large systems, and how I spent most of my time intricately crafting and designing things
before setting out to code any of it. To no avail. "Great, so that means you can really see the
potential of this! Imagine the time you could save if you didn't have to do this." This was my cue to
start looking for a polite way to end the conversation and find the nearest exit.
This is, of course, an extreme example, almost to the point of being ridiculous, but I do believe it to
be symptomatic of the current emphasis on cheap, fast software development. So, as a contrast, and
hopefully as an eye-opener to those developers who have never had the uplifting experience of
being offered a glance at a truly magnificently engineered piece of code, I decided to write an article
on the software engineering practices employed by the engineers who wrote VMS.
Where practical (the scale of the efforts and the size of the development teams I’m part of are
usually much smaller), I try to work and code in accordance with these practices.
Some of these practices come straight from the Digital Software Engineering Manual; many others
were developed by the VMS engineering team. Please read this article as a tribute to those
engineers, many of whom have very kindly offered suggestions for this article.
1
I have worked on the design of systems in such diverse fields as healthcare, meteorology, nuclear safeguards
and government inspection processes. Many of these systems are of a mission critical nature for financial
reasons, some for safety reasons.
3
The Team
Start with a small team of grownups
You should start out with a small team of people who know what they’re doing and don’t need a lot
of hand-holding. The VMS team started with just 3 people working out the initial architecture, and
the entire VMS V1 team consisted of 24 people.
Experienced developers will be able to avoid many subtle yet costly mistakes. A smaller team also
means less time spent on interpersonal communication2 and overhead.
The mentor-apprentice relationship is not only meant to transmit specific design information to new
engineers, but also to instill into them the overall engineering culture of creating correct and reliable
code by understanding it.
Take ownership
Each bit of code in a project should have an owner. The notion of individual ownership provides
control and support by the person who understood the code the best. It also provides accountability
by holding people responsible for solving problems in code they have produced.
2
Don’t get me wrong here; interpersonal communications is of vital importance to a team; precisely for this
reason it becomes increasingly difficult to maintain coherency within a team as it increases in size.
3
Make sure that each team member is aware of who the experts are in each area.
4
Customers depend on your work. It is very satisfying to know that a hospital, bank, stock exchange,
nuclear facility or wafer lab is relying on the quality of your efforts to accomplish their goals. On the
other hand it is humbling and scary to also know that when a problem is discovered, you are
potentially responsible for some serious business consequences, or worse.
The Design
Start with a design
Quality cannot be added later. Your design should be well thought through from day one. Central
features (such as security and auditing) should be built into the design, not added in version 2 as an
afterthought.
A good design process works “top down”. Start with requirements, then a functional specification,
then a high level design, then work down into the details. A good well-structured design meets the
requirements and anticipates future requirements.
Rapid prototyping methods were occasionally used in VMS, but only after there was a high level
design. After the high level module design was in place, “quick and dirty” implementations of
selected components would be written that performed their basic function but lacked required
features, refinements, or performance. This allowed a functioning framework to be set up quickly, in
which multiple engineers could then replace the breadboard components with the real ones as they
were developed.
Keep it simple
Many software projects have failed because the design team allowed the project to become too
complex. A small team and schedule pressure forces the developers to focus on the essentials and
not digress into “nice to have” features. At the same time, the team must maintain the discipline to
not take short cuts in the design process.
Make it modular
Modularity is the inevitable result of a well thought out design. It allows components to be changed
or replaced without affecting the rest of the system.
Modularity allows the overall design to be factored into understandable components. When you're
dealing with something as complex as an operating system, the only way to make it reliable is to
build it so it can be understood. You cannot test your way to reliability with any non-trivial piece of
software.
Hardware support – CPU’s, systems and I/O devices – is one of the big modularity success stories in
VMS. The results speak for themselves. The rest of the system is equally modular, and many major
components were substantially enhanced or replaced outright during the life of the system.
5
impact your ability to maintain your product and put out new releases. Problems that would have
been easy to pinpoint if the design was followed properly now become a search in a maze of
spaghetti-code.
Documenting the design is the key to being able to maintain it. Even more important than
documenting how something works is to document why it is constructed the way it is. It is always
possible to reverse-engineer the how from the code (although it may be tedious). But discovering the
why after the fact is often impossible, and without it, it is often not possible to really understand a
piece of code. All code should be documented reasonably well inside. Again, focus on why the code is
done the way it is.
Interfaces between major modules need to be well defined, and stable. If you feel you need to
change one of these interfaces, think carefully. Your change may break the interface for lots of other
modules that depend on it. Carefully check the design documentation to see if you’ve missed
another way get what you need out of the interface. See if there’s a way you can do without it. If you
still need to change the interface, change the design first and run it by those responsible for the
other modules using this interface.4
A very good idea is to validate inputs passed across interfaces. If you type your internal data
structures, and have your interface type-check all structures passed to it, you can catch many errors
early on.5 One of the components in VMS that does this is the file system. All structures handed to it
are type-checked. As a result, the file system detected many of the pool corruptors in the early life of
the system.
Despite one's best efforts, system components sometimes wear out due to code rot or just
significant changes in scale or requirements. Maintaining a modular design will allow you to rewrite
components completely without major impact on other areas of the system. Several components in
VMS were rewritten in this way, among them are scheduling (rewritten a few times) and memory
management.
Testing
Create meaningful tests
As I said before, you cannot test your way to reliability. That doesn’t mean you shouldn’t test! Test a
lot, and make sure you don’t hit just the mainline code, but especially error paths. Create errors on
purpose to test them. As an example, for VMS clusters, there are test mechanisms that introduce
faults in the cluster protocol layers, and these are exercised while the cluster is operating under
heavy load.
4
I’m making the assumption here that your documentation keeps track of what modules use what interfaces. It
does, doesn’t it?
5
Strongly typed programming languages, like C++, Ruby, or Haskell can help prevent many of these kinds of
problems at compile or run-time, but those benefits fly out the window when mixed with weakly typed (like
Basic) or untyped ( like assembly) languages.
6
VMS (the complete operating system, utilities, associated products, etc.) is built from scratch weekly.
The Quality Test and Verification group installs every week's builds of the operating system and runs
it on a large number of servers. Those servers are used to perform specific regression tests as well as
to accomplish the day-to-day work of the many people (email, notes, setting up test scripts, ad-hoc
testing, etc.).
You may need to get creative with this. If you’re developing an operating system or development
environment it’s relatively easy to use them to do your daily work. If you’re developing business
software, use it to run your business. If you’re developing an airline reservation system, and you’re
not an airline, you might build an adapted version of your software for meeting room reservations.
Try to use your own software on a daily basis.7
Procedures
Maintain a proper workflow
A major pitfall in software development is to take shortcuts, especially when the stakes are high and
customers are clamoring at the gates for a solution.
However, the bigger the change, the more important it is to follow the proper steps. A typical,
rigorous workflow for changes such as used by VMS engineering would look like this:
6
Targeted tests are often written to test how the programmer expects the software to be used. Deviations
from the programmer’s intended use are hard to take into account in deliberate testing, but occur naturally in
every-day use.
7
Don’t stretch this too far, of course. If you’re developing control software for nuclear power plants, it would
be silly to adapt it to run your coffee machine. However, if you’ve developed an underlying framework suitable
for a variety of measurement and control applications, you might consider using the framework for building
automation tasks in your office. You’ll know you’ve made a mistake if it suddenly starts freezing after you’ve
checked in your code.
7
Documentation written by documentation team if required by the change;
Documentation review before release goes out.
Quick and dirty fixes for a specific customer are fine, but should be temporary. For general
distribution and inclusion in the main code stream, a long-term fix should be developed afterwards
that follows established procedures. Sticking to these procedures slows down the development
process in the short run but leads to a higher standard of quality, which will pay itself back in the long
term.
All significant functional changes and additions are discussed at length between two or more
developers. The goal is to identify design weaknesses sooner rather than later, find simpler or faster
ways to meet the requirements, and spread the technical knowledge to multiple people.
Most code changes are reviewed by at least one engineer other than the author of the change. All
sorts of potential problems are caught by code reviews, from significant (though often subtle) design
or implementation errors to trivial mistakes that were not caught in testing. For large changes,
formal code review meetings are called. Performing code reviews slows things down, no doubt about
that, but in the long run it is important.
A final code review is often performed using the documentation that the engineer doing the check-in
has prepared, which allows another engineer to verify that the code being checked in matches
expectations and that the documentation is sensible.
A "release manager" has authority to approve or reject proposed changes, based on schedule
factors, levels of technical risk, etc.
8
A team of "builders" are in ultimate control of the master source code streams. All check-ins follow a
two-stage process: the engineer performing the check-in queues the request to have the check-in be
done. Then a member of the build team reviews the queued check-in requests, and releases the
check-ins from the queue to actually be performed, thus modifying the master source code
database8. A build team member might delay finalizing a check-in until some known issue elsewhere
in the system is resolved (to avoid adding confusion), or might even reject a check-in if the
preparation is incomplete or sufficiently non-standard.
It's always more fun to create something new than to maintain the existing stuff. Be prepared to stop
forward development in favor of fixing bugs.
Design the software such that if it fails, you might be able to capture enough details to diagnose the
failure. The system crash dump and the process dump features are excellent examples. There is also
logging code sprinkled through a number of VMS components, which in some cases can be activated
by the customer to gather additional information9.
Conclusions
A lot of the recommendations in this article can be summed up as “maintain a high standard of work
at all times,” and a lot of that can be traced back to maintaining a proper engineering culture. A
culture like that does not grow overnight, it must be cultivated. It must also be maintained, because
although a well-functioning team strengthens itself and builds up a remarkable resilience, too many
poor management decisions may eventually lead to even the strongest team breaking down.
Acknowledgements
A special thanks to those who provided me with their thoughts on engineering. Lots of your
contributions made it into this article almost verbatim, my work was mostly editorial in nature. My
apologies for keeping you waiting for the article for so long.
8
Depending on the source code repository tool you use, this can be achieved in a number of different ways.
The use of entirely separate developer and master repositories may be called for. I personally find Mercurial,
which has separate repositories by design, well suitable for this type of workflow.
9
e.g., the various SDA extensions such as EXC and FLT