You are on page 1of 59

Source Control HOWTO

by Eric Sink (http://www.ericsink.com/)


Original tutorial source: http://www.ericsink.com/scm/source_control.html

I am writing a series of articles explaining how to do source control and the best practices thereof. See below
for links to the individual chapters in this series. The Introduction explains my motivations and goals for
writing this series.

Please note: This is a work in progress. I plan to be adding new chapters over time, and I may also be revising
the existing chapters as I go along.

Printer-friendly version: Sorry folks, but I currently do not have this material available in a form which is
more suitable for paper. I am planning to eventually publish this material as a book. When that happens, a
link will appear here.

Chapter 0: Introduction (scm_intro.html)

Our universities don't teach people how to do source control. Our employers don't teach people how
to do source control. SCM tool vendors don't teach people how to do source control. We need some
materials that explain how source control is done. My goal for this series of articles is to create a
comprehensive guide to help meet this need.

Chapter 1: Basics (scm_basics.html)

Our discussion of source control must begin by defining the basic terms and describing the basic
operations.

Chapter 2: Checkins (scm_checkins.html)

In this chapter, I will explore the various situations wherein a repository is modified, starting with
the simplest case of a single developer making a change to a single file.

Chapter 3: File Merge (scm_file_merge.html)

Many software teams have discovered that the tradeoff here is worth the trouble. Concurrent
development can bring substantial gains in the productivity of a team. The extra effort to deal with
merge situations is usually a small price to pay.

Chapter 4: Repositories (scm_repositories.html)

A file system is two-dimensional: its space is defined by directories and files. In contrast, a
repository is three-dimensional: it exists in a continuum defined by directories, files and time. An
SCM repository contains every version of your source code that has ever existed. The additional
dimension creates some rather interesting challenges in the architecture of a repository and the
decisions about how it manages data.

Chapter 5: Working Folders (scm_working_folders.html)

The repository is the official archive of our work. We treat our repository with great respect. In
contrast, we treat our working folder with very little regard. It exists for the purpose of being
abused. Our working folder starts out worthless, nothing more than a copy of the repository. If it is
destroyed, we have lost nothing, so we run risky experiments which endanger its life.

Chapter 6: History (scm_history.html)

There is nothing endearing about a development team that can't find something when they need it.
A good SCM tool must do more than just keep every version of everything. It must also provide
ways of searching and viewing and sorting and organizing and finding all that stuff.
Chapter 7: Branches (scm_branches.html)

Nelly has a friend who has a cousin with a neighbor who knows somebody whose life completely fell
apart after they tried using the branch and merge features of their source control tool. So Nelly
refuses to use branching at all.

Chapter 8: Merge Branches (scm_merge_branches.html)

Successfully using the branching and merging features of your source control tool is first a matter of
attitude on the part of the developer. No matter how much help the source control tool provides, it is
not as smart as you are. You are responsible for doing the merge. Think of the tool as a tool, not as a
consultant.

Chapter 9: Source Control Integration with IDEs (scm_ide_integration.html)

Just as a spice rack belongs near the stove, source control should always be available where the
developer is working.

Here's a list of chapters I am thinking about writing:

A chapter on integration with bug-tracking and automated builds


A chapter on common mistakes people make when using source control.
A chapter on remote access (client/server, binary deltas, security, ...)
A chapter on importing from one source control tool to another
A chapter on cross-platform issues
A chapter on writing custom tools which access a source control server
A chapter on miscellaneous stuff that doesn't fit anywhere else (share, pin, cloak, shadow folders, email
notifications, browser-based clients, keyword expansion, ...)
Chapter 0: Introduction
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

What is source control?

Sometimes we call it "version control". Sometimes we call it "SCM", which stands for either "software
configuration management" or "source code management". Sometimes we call it "source control". I use all
these terms interchangeably and make no distinction between them (for now anyway -- configuration
management actually carries more advanced connotations I'll discuss later).

By any of these names, source control is an important practice for any software development team. The most
basic element in software development is our source code. A source control tool offers a system for managing
this source code.

There are many source control tools, and they are all different. However, regardless of which tool you use, it is
likely that your source control tool provides some or all of the following basic features:

It provides a place to store your source code.


It provides a historical record of what you have done over time.
It can provide a way for developers to work on separate tasks in parallel, merging their efforts later.
It can provide a way for developers to work together without getting in each others' way.

HOWTO

My goal for this series of articles is to help people learn how to do source control. I work for SourceGear, a
developer tools ISV. We sell an SCM tool called Vault (http://www.sourcegear.com/vault/) . Through the
experience of selling and supporting this product, I have learned something rather surprising:

Nobody is teaching people how to do source control.


Our universities often don't teach people how to do source control. We graduate with Computer Science
degrees. We know more than we'll ever need to know about discrete math, artificial intelligence and the design
of virtual memory systems. But many of us enter the workforce with no knowledge of how to use any of the
basic tools of software development, including bug-tracking, unit testing, code coverage, source control, or
even IDEs.

Our employers don't teach people how to do source c ontrol. In fact, many employers provide their developers
with no training at all.

SCM tool vendors don't teach people how to do source control. We provide documentation on our products,
but the help and the manuals usually amount to simple explanations of the program's menus and dialogs. We
sort of assume that our customers come to us with a basic background.

Here at SourceGear, our product is positioned specifically as a replacement for SourceSafe. We assume that
everyone who buys Vault already knows how to use SourceSafe. However, experience is teaching us that this
assumption is often untrue. One of the most common questions received by our support team is from users
asking for a solid explanation of the basics of source control.

We need some materials that explain how source


control is done. My goal for this series of articles is Best Practice: Use source control
to create a comprehensive guide to help meet this
need. Some surveys indicate that 70% of software teams do
not use any kind of source control tool. I cannot
imagine how they cope.
Somewhat tool-specific
Throughout this series of articles, I will be sprinkling
Ideally, a series of articles on the techniques of Best Practices that will appear in sidebar boxes like
source control would be tool-neutral, applicable to this one. These boxes will contain pithy and practical
any of the available SCM tools. It simply makes tips for developers and managers using SCM tools.
sense to teach the basic skills without teaching the
specifics of any single tool. We learn the basic skills of writing before we learn to use a word processor.

However, in the case of SCM tools, this tool-agnostic approach is somewhat difficult to achieve. Unlike
writing, source control is simply not done without the assistance of specialized tools. With no tools at all, the
methods of source control are not practical.

Complicating matters further is the fact that not all source control tools are alike. There are at least dozens of
SCM tools available, but there is no standard set of features or even a standard terminology. The word
"checkout" has different meanings for CVS and SourceSafe. The word "branch" has very different semantics
for Subversion and PVCS.

So I will keep the tool-neutral ideal in mind as I write, but my articles will often be somewhat tool-specific.
Vault is the tool I know best, since I have played a big part in its design and coding. Furthermore, I freely
acknowledge that I have a business incentive to talk about my own product. Although I will often mention
other SCM tools, the articles in this series will use the terminology of Vault.

The world's most incomplete list of SCM tools

Several SCM tools that I mention in this series are listed below, with hyperlinks for more information.

Vault (http://www.sourcegear.com/vault/) . Our product. 'Nuff said.


SourceSafe (http://msdn.microsoft.com/vstudio/previous/ssafe/) . Microsoft. Old. Loved. Hated.
Subversion (http://subversion.tigris.org/) . Open source. New. Neato.
CVS (https://www.cvshome.org/) . Open source. Old. Reliable. Dusty.
Perforce (http://www.perforce.com/) . Commercial. A competitor of SourceGear, but one that I
admire.
BitKeeper (http://www.bitkeeper.com/) . Commercial. Uses a distributed repository architecture, so
I won't be talking about this one much.
Arch (http://www.gnu.org/software/gnu-arch/) . Open source. Distributed repository architecture.
Again, I spend most of my words here on tools with a centralized server.

This is a very incomplete list. There are many SCM tools, and I am not interested in trying to produce and
maintain and accurate listing of them all.

Audience

I am writing about source control for programmers and web developers.

When we apply some of the concepts of source control to the world of traditional documents, the result is
called "document management". I'm not writing about any of those usage scenarios.

When we apply some of the concepts of source control to the world of graphic design, the result is called "asset
management". I'm not writing about any of those usage scenarios.

My audience here is the group of people who deal pr imarily with source code files or HTML files.

Warnings about my writing style

First of all, let me say a thing or two about political correctness. Through these articles, I will occasionally find
the need for gender-specific pronouns. In such situations, I generally try to use the male and female variants
of the words with approximately equal frequency.

Second of all, please accept my apologies if my dry sense of humor ever becomes a distraction from the
material. I am writing about source control and trying to make it interesting. That's like writing about sex and
trying to make it boring, so please cut me some slack if I try to make you chuckle along the way.

Looking Ahead
Source control is a large topic, so there is much to be said. I plan for the chapters of this series to be sorted
very roughly from the very basic to the very advanced. In the next chapter, I'll start by defining the most
fundamental terminology of source control.
Chapter 1: Basics
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

A tale of two trees

Our discussion of source control must begin by defining the basic terms and describing the basic operations.
Let's start by defining two important terms: repository and working folder.

An SCM tool provides a place to store your source code. We call this place a repository. The repository exists
on a server machine and is shared by everyone on your team.

Each individual developer does her work in a working folder, which is located on a desktop machine and
accessed using a client.

Each of these things is basically a hierarchy of folders. A specific file in the repository is described by its path,
just like we describe a specific file on the file system of your local machine. In Vault and SourceSafe, a
repository path starts with a dollar sign. For example, the path for a file might look like this:

$/trunk/src/myLibrary/hello.cs

The workflow of a developer is an infinite loop which looks something like this:

Copy the contents of the repository into a working folder.


Make changes to the code in the working folder.
Update the repository to incorporate those changes.
Repeat.

I've omitted certain details like staff meetings and vacations, but this loop essentially describes the life of a
developer who is working with an SCM tool. The repository is the official place where all completed work is
stored. A task is not considered to be completed until the repository contains the result of that task.

Let's imagine for a moment what life would be like without this distinction between working folder and
repository. In a single-person team, the situation could be described as tolerable. However, for any plurality
of developers, things can get very messy.

I've seen people try it. They store their code on a file server. Everyone uses Windows file sharing and edits the
source files in place. When somebody wants to edit main.cpp, they shout across the hall and ask if anybody
else is using that file. Their Ethernet is saturated most of the time because the developers are actually
compiling on their network drives. When we sell our source control tool to someone in this situation, I feel like
an ER doctor. I go home that night with a feeling of true contentment, because I know that I have saved a life.

With an SCM tool, working on a multi-person team


is much simpler. Each developer has a working Best Practice: Don't break the tree
folder which is a private workspace. He can make
changes to his working folder without adversely The benefit of working folders is mostly lost if the
affecting the rest of the team. contents of the repository become "broken". At all
times, the contents of the repository should be in a
Terminology note: Not all SCM tools use the exact state which allows everyone on the team to continue to
terms I am using here. Many systems use the word work. If a developer checks in some code which won't
"directory" instead of "folder". Some SCM tools, build or won't pass the test suite, the entire team grinds
including SourceSafe, use the word "database" to a halt.
instead of "repository". In the context of Vault,
these two words have a different meaning. Vault Many teams have some sort of a social penalty which is
allows multiple repositories to exist within a single applied to developers who break the tree. I'm not
SQL database. For this reason, I use the word talking about anything severe, just a little incentive to
"database" only when I am referring to the SQL remind developers to be careful. For example, require
database. the guilty party put a dollar in a glass jar. (Use the
money to take the team to go see a movie after the
product is shipped.) Another idea is to require the
In and Out
guilty developer to make the coffee every morning. The
point is to make the developer feel embarrassed, but
The repository exists on a server machine which is not punished.
far away from the desktop machine containing the
working folder where the developer does her work. The word "far" in the previous sentence is intended to
mean anything from a few centimeters to thousands of kilometers. The physical distance doesn't really
matter. The SCM tool provides the ability to communicate between the client and the server over TCP/IP,
whether the network is a local Ethernet or an Internet connection to another continent.

Because of this separation between working folder and repository, the most frequently used features of an SCM
tool are the ones which help us move things back and forth between them. Let's define some terms:

Add: A repository starts out completely empty, so we need to "Add" things to it. Using the "Add Files"
command in Vault you can specify files or folders on your desktop machine which will be added to the
repository.

Get: When we copy things from the repository to the working folder, we call that operation "Get". Note
that this operation is usually used when retrieving files that we do not intend to edit. The files in the
working folder will be read-only.

Checkout: When we want to retrieve files for the purpose of modifying them, we call that operation
"Checkout". Those files will be marked writable in our working folder. The SCM server will keep a record
of our intent.

Checkin: When we send changes back to the repository, we call that operation "Checkin". Our working
files will be marked back to read-only and the SCM server will update the repository to contain new
versions of the changed files.

Note that these definitions are merely starting points. The descriptions above correspond to the behavior of
SourceSafe and Vault (with its default settings). However, we will see later that other tools (such as CVS) work
somewhat differently, and Vault can optionally be configured in a mode which matches the behavior of CVS.

Terminology note: Some SCM tools use these words a bit differently. Vault and SourceSafe use the word
"checkout" as a command which specifically communicates the intent to edit a file. For CVS, the "checkout"
command is used to retrieve files from the repository regardless of whether the user intends to edit the files or
not. Some SCM tools use the word "commit" instead of the word "checkin". Actually, Vault uses either of
these terms, for reasons that will be explained in a later chapter.

H.G. Wells would be proud

Your repository is more than just an archive of the current version of your code. Actually, it is an archive of
every version of your code. Your repository contains history. It contains every version of every file that has
ever been checked in to the repository. For this reason, I like to think of a source control tool as a time
machine.

The ability to travel back in time can be extremely useful for a software project. Suppose we need the ability to
retrieve a copy of our source code exactly as it looked on April 28th, 2002. An SCM tool makes this kind of
thing easy to do.

An even more common case is the situation where a piece of code looks goofy and nobody can figure out why.
It's handy to be able to look back at the history and understand when and why a certain change happened.

Over time, the complete history of a repository can become large and overwhelming, so SCM tools provide
ways to cope. For example, Vault provides a History Explorer which allows the history entries to be queried
and searched and sorted.

Perhaps more importantly, most SCM tools provide a feature called a "label" or a "tag". A label is basically a
way to mark a specific instant in the history of the repository with a meaningful name. The label makes it easy
to later retrieve a snapshot of exactly what the repository contained at that instant.

Looking Ahead

This chapter merely scratches the surface of what an SCM tool can provide, making brief mention of two
primary benefits:

Working folders provide developers with a private workspace which is distinct from the main repository.
Repository history provides a complete archive of every change and why it was made.

In the next chapter, I'll be going into much greater detail on the topic of checkins.
Chapter 2: Checkins
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

In this chapter, I will explore the various situations wherein a repository is modified, starting with the simplest
case of a single developer making a change to a single file.

Editing a single file

Consider the simple situation where a developer needs to make a change to one source file. This case is
obviously rather simple:

1. Checkout the file


2. Edit the working file as needed
3. Checkin the file

I won't talk much about step 2 here, as it doesn't really involve the SCM tool directly. Editing the file usually
involves the use of some other tools, like an integrated development environment (IDE).

But I do want to explore steps 1 and 3 in greater detail.

Step 1: Checkout

Checking out a file has two basic effects:

On the server, the SCM tool will remember the fact that you have the file checked out so that others may
be informed.
On your client, the SCM tool will prepare your working file for editing by changing it to be writable.

The server side of checkout

File checkouts are a way of communicating your intentions to others. When you have a file checked out, other
users can be aware and avoid making changes to that file until you are done with it. The checkout status of a
file is usually displayed somewhere in the user interface of the SCM client application. For example, in the
following screendump from Vault, users can see that I have checked out libsgdcore.cpp:
This screendump also hints at the fact there are
actually two kinds of checkouts. The issue here is Best Practice: Use checkouts and locks
the question of whether two people can checkout a carefully
file at the same time. The answer varies across
SCM tools. Some SCM tools can be configured to It is best to use checkouts and locks only when you
behave either way. need them. A checkout discourages others from
modifying a file, and a lock prevents them from doing
Sometimes the SCM tool will allow multiple people so. You should therefore be careful to use these
to checkout a file at the same time. SourceSafe and features only when you actually need them.
Vault both offer this capability as an option. When
this "multiple checkouts" feature is used, things can Don't checkout files just because you think you might
get a bit more complicated. I'll talk more about need to edit them.
this later.
Don't checkout whole folders. Checkout the specific
If the SCM tool prevents anyone else from checking files you need.
out a file which I have checked out, then my
checkout is "exclusive" and may be described as a Don't checkout hundreds or thousands of files at one
"lock". In the screendump above, the user interface time.
is indicating that I have an exclusive lock on
libsgdcore.cpp. Vault will allow no one else to Don't hold exclusive locks any longer than necessary.
checkout this file.
Don't go on vacation while holding exclusive locks on
files.
The client side of checkout

On the client side, the effect of a checkout is quite simple: If necessary, the latest version of the file is retrieved
from the server. The working file is then made writable, if it was not in that state already.

All of the files in a working folder are made read-only when the SCM tool retrieves them from the repository. A
file is not made writable until it is checked out. This prevents the developer from accidentally editing a file.

Undoing a checkout
Normally, a checkout ends when a checkin happens. However, sometimes we checkout a file and subsequently
decide that we did not need to do so. When this happens, we "undo the checkout". Most SCM tools have a
command which offers this functionality. On the server side, the command will remove the checkout and
release any exclusive lock that was being held. On the client side, Vault offers the user three choices for how
the working file should be treated:

Revert: Put the working file back in the state it was in when I checked it out. Any changes I made
while I had the file checked out will be lost.
Leave: Leave the working file alone. This option will effectively leave the file in a state which we call
"Renegade". It is a bad idea to edit a file without checking it out. When I do so, Vault notices my
transgression and chastises me by letting me know that the file is "Renegade".
Delete: Delete the working file.

I usually prefer to work with "Revert" as my option for how the Undo Check Out command behaves.

Step 3: Checkin

One issue does deserve special mention. Most SCM


tools ask the user to enter a comment when making Best Practice: Explain your checkins
a checkin. This comment will be stored in the completely
repository forever along with the changes being
submitted. The comment provides a place for the Every SCM tool provides a way to associate a comment
developer to explain what was changed and why the when checking changes into the repository. This
change was made. comment is important. If we consistently use good
checkin comments, our repository's history contains
After the file is checked out, the developer proceeds not only every change we have ever made, but it also
to make her changes. She edits the file and verifies contains an explanation of why those changes
that her change is correct. Having completed all happened. These kinds of records can be invaluable
this, she is ready to submit her changes to the later as we forget things.
repository. Doing so will make her change
permanent and official. Submitting her changes to I believe developers should be encouraged to enter
the repository is the operation we call "checkin". checkin comments which are as long as necessary to
explain what is going on. Don't just type "minor
The process of a checkin isn't terribly complicated: change". Tell us what the minor change was. Don't just
tell us "fixed bug 1234". Tell us what bug 1234 is and
1. The new version of the file is sent to the SCM tell us a little bit about the changes that were necessary
server where it is stored. to fix it.
2. The version number of the file in the
repository is incremented by one.
3. The file is no longer considered to be checked out or locked.
4. The working file on the client side is made read-only again.

The following screendump shows the checkin dialog box from Vault:
Checkins are additive

It is reassuring to remember one fundamental axiom of source control: Nothing is ever destroyed. Let us
suppose that we are editing a file which is currently at version 4. When we checkin our changes, our new
version of the file becomes version 5. Clients will be notified that the latest version is now 5. Clients that are
still holding version 4 in their working folder will be warned that the file is now "Old".

But version 4 is still there. If we ask the server for the latest version, we will get 5. But if we specifically ask for
version 4, and for any previous version, we can still get it.

Each checkin adds to the history of our repository. We never subtract anything from that history.

Other kinds of checkins

We will informally use the word "checkin" to refer to any change which is made to the repository. It is
common for a developer to say, "I made some checkins this afternoon to fix that bug", using the word
"checkin" to include any of the following types of changes to the repository:

Create a new folder


Add a file
Rename a file or folder
Delete a file or folder
Move a file or folder

It may seem odd to refer to these operations using the word "checkin", because there is no corresponding
"checkout" step. However, this looseness is typical of the way people use the word "checkin", so you'll get used
to it.

I will take this opportunity to say a few things about how these operations behave. If we conceptually think of a
folder as a list of files and subfolders, each of these operations is actually a modification of a folder. When we
create a folder inside folder A, then we are modifying folder A to include a new subfolder in its list. When we
rename a file or folder, the parent folder is being modified.
Just as the version number of a file is incremented when we modify it, these folder-level changes cause the
version number of a folder to be incremented. If we ask for the previous version of a folder, we can still
retrieve it just the way it was before. The renamed file will be back to the old name. The deleted file will
reappear exactly where it was before.

It may bother you to realize that the "delete" command in your SCM tool doesn't actually delete anything.
However, you'll get used to it.

Atomic transactions

I've been talking mostly about the simple case of making a change to a single source code file. However, most
programming tasks require us to make multiple repository changes. Perhaps we need to edit more than one
file to accomplish our task. Perhaps our task requires more than just file modifications, but also folder-level
changes like the addition of new files or the renaming of a file.

When faced with a complex task that requires several different operations, we would like to be able to submit
all the related changes together in a single checkin operation. Although tools like SourceSafe and CVS do not
offer this capability, some source control systems (like Vault and Subversion) do include support for "atomic
transactions".

The concept is similar to the behavior of atomic


transactions in a SQL database. The Vault server Best Practice: Group your checkins logically
guarantees that all operations within a transaction
will stay together. Either they will all succeed, or I recommend that each transaction you check into the
they will all fail. It is impossible for the repositoryrepository should correspond to one task. A "task"
to end up in a state with only half of the operations might be a bug fix or a feature. Include all of the
done. The integrity of the repository is assured. repository changes which were necessary to complete
that task, and nothing else. Avoid fixing multiple bugs
To ensure that a transaction can contain all kinds of in a single checkin transaction.
operations, Vault supports the notion of a pending
change set. Essentially, the Vault client keeps a running list of changes you have made which are waiting to be
sent to the server. When you invoke the Delete command, not only will it not actually delete anything, but it
doesn't even send the command to the server. It merely adds the Delete operation to the pending change set,
so that it can be sent later as part of a group.

In the following screen dump, my pending change set contains three operations. I have modified
libsgdcore.cpp. I have renamed libsgdcore.h to headerfile.h. And I have deleted libsgdcore_diff_file.c.
Note that these operations have not actually happened yet. They won't happen unless I submit them to the
server, at which time they will take place as a single atomic transaction.

Vault persists the pending change set between sessions. If I shutdown my Vault client and turn off my
computer, next time I launch the Vault client the pending change set will contain the same items it does now.

The Church of "Edit-Merge-Commit"

Up until now, I have explained everything about checkouts and checkins in a very "matter of fact" fashion. I
have claimed that working files are always read-only until they are checked out, and I have claimed that files
are always checked out before they are checked in. I have made broad generalizations and I have explained
things in terms that sound very absolute.

I lied.

In reality, there are two very distinct doctrines for how this basic interaction with an SCM tool can work. I
have been describing the doctrine I call "checkout-edit-checkin". Reviewing the simple case when a developer
needs to modify a single file, the practice of this faith involves the following steps::

1. Checkout the file


2. Edit the working file as needed
3. Checkin the file

Followers of the "checkout-edit-checkin" doctrine are effectively submitting to live according to the following
rules:

Files in the working folder are read-only unless they are checked out.
Developers must always checkout a file before editing it. Therefore, the entire team always knows who
is editing which files.
Checkouts are made with exclusive locks, so only one developer can checkout a file at one time.

This approach is the default behavior for SourceSafe and for Vault. However, CVS doesn't work this way at
all. CVS uses the doctrine I call "edit-merge-commit". Practicers of this religion will perform the following
steps to modify a single file:

Edit the working file as needed


Merge any recent changes from the server into the working file
Commit the file to the repository

The edit-merge-commit doctrine is a liberal denomination which preaches a message of freedom from
structure. Its followers live by these rules:

Files in the working folder are always writable.


Nobody uses checkouts at all, so nobody knows who is editing which files.
When a developer commits his changes, he is responsible for ensuring that his changes were made
against the latest version in the repository.

As I said, this is the approach which is supported by CVS. Vault supports edit-merge-commit as an option. In
fact, when this option is turned on, we informally say that Vault is running in "CVS mode".

Each of these approaches corresponds to a different style of managing concurrent development on a team.
People tend to have very strong feelings about which style they prefer. The religious flame war between these
two churches can get very intense.

Holy Wars

The "checkout-edit-checkin" doctrine is obviously more traditional and conservative. When applied strictly, it
is impossible for two people to modify a given file at the same time, thus avoiding the necessity of merging two
versions of a file into one.

The "edit-merge-commit" teaches a lifestyle which is riskier. The risk is that the merge step may be tedious or
cause problems. However, the acceptance of this risk rewards us with a concurrent development style which
causes developers to trip over each other a lot less often.

Still, these risks are real, and we will not flippantly disregard them. A detailed discussion of file merging
appears in the next chapter. For now I will simply mention that most SCM tools include features that can
safely do a three-way merge automatically. Not all developers are willing to trust this feature, but many do.

So, when using the "edit-merge-commit" approach, the merge must happen, and we are left with two choices:

Attempt the automerge. (can be scary)


Merge the files by hand. (can be tedious)

Developers who prefer "checkout-edit-checkin" often find both of these choices to be unacceptable.

I will confess that I am a disciple of the edit-merge-


commit religion. People who use edit-merge- Best Practice: Get the best of both worlds
commit often say that they cannot imagine going
back to what life was like before. I agree. Here at SourceGear we are quite proud of the fact that
Vault allows each developer to choose their own
It is so very convenient to never be required to concurrent development style. Developers who prefer
checkout a file. All the files in my working folder "checkout-edit-checkin" can work that way.
are always writable. If I want to start working on a Developers who prefer "edit-merge-commit" can use
bugfix or a feature, I simply open a text editor and that approach, and they still have exclusive locks
begin making my changes. available to them for those times when they are
needed. As far as I know, Vault is the only product that
This benefit is especially useful when I am offers this flexibility.
disconnected from the server. When people ask me
about the best way to use Vault while "offline", I tell I apologize for this completely shameless plug. I won't
them to consider using edit-merge-commit. Since I do it very often.
don't have to contact the server to checkout a file, I
can simply proceed with my changes. The only time I need the server is when it comes time to merge and
commit.

As I said, automerge is amazingly safe in practice. Thousands of teams use it every day without incident. I
have been actively using edit-merge-commit as my development style for over five years, and I cannot
remember a situation where automerge produced an incorrect file. Experience has made me a believer.

Looking Ahead

In the next chapter, I will be talking in greater detail about the process of merging two modified versions of a
file.
Chapter 3: File Merge
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

How did we get ourselves into this mess?

There are several reasons why we may need to merge two modified versions of a file:

When using "edit-merge-commit" (sometimes called "optimistic locking"), it is possible for two
developers to edit the same file at the same time.
Even if we use "checkout-edit-checkin", we may allow multiple checkouts, resulting once again in the
possibility of two developers editing the same file.
When merging between branches, we may have a situation where the file has been modified in both
branches.

In other words, this mess only happens when people are working in parallel. If we serialize the efforts of our
team by never branching and never allowing two people to work on a module at the same time, we can avoid
ever facing the need to merge two versions of a file.

However, we want our developers to work concurrently. Think of your team as a multithreaded piece of
software, each developer running in its own thread. The key to high performance in a multithreaded system is
to maximize concurrency. Our goal is to never have a thread which is blocked on some other thread.

So we embrace concurrent development, but the threading metaphor continues to apply. Multithreaded
programming can sometimes be a little bit messy, and the same can be said of a multithreaded software
team. There is a certain amount of overhead involved in things like synchronization and context switching.
This overhead is inevitable. If your team is allowing concurrent development to happen, it will periodically
face a situation where two versions of a file need to be merged into one.

In rare cases, the situation can be properly resolved by simply choosing one version of the file over the other.
However, most of the time, we actually need to merge the two versions to create a new version.

What do we do about it?

Let's carefully state the problem as follows: We have two versions of a file, each of which was derived from the
same common ancestor. We sometimes call this common ancestor the "original" file. Each of the other
versions is merely the result of someone applying a set of changes to the original. What we want to create is a
new version of the file which is conceptually equivalent to starting with the original and applying both sets of
changes. We call this process "merging".

The difficulty of doing this merge varies greatly for different types of files. How would we perform a merge of
two Excel spreadsheets? Two PNG images? Two files which have digital signatures? In the general case, the
only way to merge two modified versions of a file is to have a very smart person carefully construct a new copy
of the file which properly incorporates the correct elements from each of the other two.

However, in software and web development there is a special case which is very common. As luck would have
it, most source code files are plain text files with an average of less than 80 characters per line. Merging files
of this kind is vastly simpler than the general case. Many SCM tools contain special features to assist with this
sort of a merge. In fact, in a majority of these cases, the two files can be automatically merged without
requiring the manual effort of a developer.

An example

Let's call our two developers Jane and Joe. Both of them have retrieved version 4 of the same file and both of
them are working on making changes to it.

One of these developers will checkin before the other one. Let's assume it is Jane who gets there first. When
Jane tries to checkin her changes, nothing unusual will happen. The current version of the file is 4, and that
was the version she had when she started making her changes. In other words, version 4 was her baseline for
these changes. Since her baseline matches the current version, there is no merge necessary. Her changes are
checked in, and a version of the file is created in the repository. After her checkin, the current version of the
file is now 5.

The responsibility for merging is going to fall upon Joe. When he tries to checkin his changes, the SCM tool
will protest. His baseline version is 4, but the current version in the repository is now 5. If Joe is allowed to
checkin his version of the file, the changes made by Jane in version 5 will be lost. Therefore, Joe will not be
allowed to checkin this file until he convinces the SCM tool that he has merged Jane's version 5 changes into
his working copy of the file.

Vault reports this situation by setting the status on this file to be "Needs Merge", as shown in the screen dump
below:

In order to resolve this situation, Joe effectively


needs to do a three-way comparison between the Best Practice: Keep the repository in sight
following three versions of the file:
This example happens to involve the need to merge
Version 4 (the baseline from which he and only a single checkin. Since Joe's baseline is 4 and the
Jane both started) current repository version is 5, Joe is only 1 version out
Version 5 (Jane's version) of date. If the repository version were 25 instead of 5,
Joe's working file (containing his own then Joe would be 21 versions out of date instead of just
changes) 1, but the technique is the same. No matter how old his
baseline is, Joe still needs to retrieve the latest version
Version 4 is the common ancestor for both Joe's and do a three-way merge. However, the older his
version and Jane's version of the file. By running a baseline, the more likely he is to encounter conflicts in
diff between version 4 and version 5, Joe can see the merge.
exactly what changes Jane made. He can use this
information to apply those changes to his own Keep in touch with the repository. Update your
version of the file. Once he has done so, he can working folder as often as you can without interrupting
credibly claim that his version is a merge of his your own work. Commit your work to the repository as
changes and Jane's. often as you can without breaking the build. It isn't
wise to let the distance between your working folder
Strictly speaking, Joe is responsible for whatever and the repository grow too large.
changes Jane made, regardless of how difficult the
merge may be. He must perform the changes to his file that Jane would have made if she has started with his
file instead of with version 4. In theory, this could be very difficult:
What happens if Jane changed some of the same lines that Joe changed, but in different ways?
What happens if Jane's changes are functionally incompatible with Joe's?
What happens if Jane made a change to a C# function which Joe has deleted?
What happens if Jane changed 80 percent of the lines in the file?
What happens if Jane and Joe each changed 80 percent of the lines in the file, but each did so for
entirely different reasons?
What happens if Jane's intent was not clear and she cannot be reached to ask questions?

All of these situations are possible, and all of them are Joe's responsibility. He must incorporate Jane's
changes into his file before he can checkin a version 6.

In certain rare situations, Joe may examine Jane's changes and realize that his version needs nothing from
Jane's version 5. Maybe Jane's change simply isn't relevant anymore. In these cases, the merge isn't needed,
and Joe can simply declare the merge to be resolved without actually doing anything. This decision remains
subject to Joe's judgment.

However, most of the time it will be necessary for the merge to actually happen. In these cases, Joe has the
following options:

Attempt to automerge
Use a visual merge tool
Redo one set of changes by hand

Each of these will be explained further in the sections below.

Attempt to automerge

As I mentioned above, a surprising number of cases can be easily handled automatically. Most source control
tools include the ability to attempt an automatic merge. The algorithm uses all three of the involved versions
of the file and attempts to safely produce a merged version.

The reason that automerge is so safe in practice is


that the algorithm is extremely conservative. Best Practice: Only use "automerge on get"
Automerge will refuse to produce a merged version
if Joe's changes and Jane's changes appear to be in It is widely accepted that SCM tools should only
conflict. In the most obvious case, if Joe and Jane attempt automerge on the "get" of a file. In other
both modified the same line, automerge will detect words, when Joe realizes that he must merge in the
this "conflict" and refuse to proceed. In other changes Jane made between version 4 and version 5,
cases, automerge may fail with conflicts if two he will tell his SCM client application to "get" version 5
changes are too close to each other. and attempt to automatically merge it into his working
file. CVS, Subversion and Vault all function in this
manner.
Use a visual merge tool
Unfortunately, SourceSafe attempts to "automerge on
In cases where automerge cannot automatically checkin". This is just a really bad idea. When Joe tries
resolve conflicts, we can use a visual merge tool to to checkin his changes, SourceSafe attempts the
make the job easier. These tools provide a visual automerge. If it believes that it has succeeded, then his
display which shows all three files and highlights changes are checked in and version 6 was created.
exactly what has changed. This makes it much However, it is possible that Joe never examined version
easier for the developer to perform the merge, since 6, or even compiled it. The repository now contains a
she can zero in on the conflicts very quickly. file which has never existed in the working folder of any
developer on earth. Its contents have never been seen
There are several excellent visual merge tools by human eyes, and it has never been run through a
available, including Guiffy compiler. Automerge is safe, but it's not that safe.
(http://www.guiffy.com/) and Araxis Merge
(http://www.araxis.com/) . The following screen It is much better to "automerge on get". This way, the
dump is from "SourceGear DiffMerge", the visual developer can (and should) examine the file after the
merge tool which is included with Vault. (Please automerge has happened. This simple change makes it
note sometimes I have to reduce the size of screen easier to trust automerge. Instead of trying to do the
dumps to make them fit. In those cases, you can developer's job, automerge simply becomes a tool
click on the image to see it at full resolution). which the developer can use to get his job done faster.
(screendumps/scm_diffmerge_1.gif)

This picture is typical of other three-way visual merge applications. The left pane shows Jane's version of the
file. The right pane shows Joe's version. The center pane shows the original file, the common ancestor from
which they both started to make changes. As you can see, Jane and Joe have each inserted a one-line
comment. By right-clicking on each change, the developer can choose whether to apply that change to the
middle pane. In this example, the two changes don't conflict. There is no reason that the resulting file cannot
incorporate both changes.

The following picture shows an example of changes which are conflicting.

(screendumps/scm_diffmerge_2.gif)

Both Jane and Joe have tried to change the wording of this comment. In the original file, the word used in the
comment was "Global". Jane decided to change this word to "Worldwide", but Joe has changed it to the word
"Rampant". These two changes are conflicting, as indicated by the yellow background color being used to
display them. Automerge cannot automatically handle cases like these. Only a human being can decide which
change to keep.

The visual merge tool makes it easy to handle this situation. I can decide which change I want to keep and
apply it to the center pane.
A visual merge tool can make file merging a lot easier by quickly showing the developer exactly what has
changed and allowing him to specify which changes should be applied to get the final merged result.

However, as useful as these kinds of tools can be, they're not magic.

Redo one set of changes by hand

Some situations are so complicated that a visual merge tool just isn't very helpful. In the worst case scenario,
Joe might have to manually redo one set of changes.

This situation recently happened here at SourceGear. We currently have Vault development happening in two
separate branches:

When we shipped version 2.0, we created a branch for maintenance of the 2.0 release. This is the tree
where we develop minor bug fix releases like 2.0.1.
Our "trunk" is the place where active development of the next major release is taking place.

Obviously we want any bug fixes that happen in the 2.0 branch to also happen in the trunk so that they can be
included in our upcoming 2.1 release. We use Vault's "Merge Branches" command to migrate changes from
one place to the other.

I will talk more about branching and merging in a later chapter. For now, suffice it to say that the merging of
branches can create exactly the same kind of three-way merge situation that we've been discussing in this
chapter.

In this case, we ended up with a very difficult merge in the sections of code that deal with logins.

In the 2.0 branch, we implemented a fix to prevent dictionary attacks on passwords. We considered this
a bug fix, since it is related to the security of our product. In concept this change was simple. We
simply block login for any account which is seeing too many failed login attempts. However,
implementing this mini-feature required a surprising number of lines to be changed.
In the trunk, we added the ability for Vault to authenticate logins against Active Directory.

In other words, we made substantial changes to the login code in both these branches. When it came time to
merge, the DiffMerge was extremely colorful.

In this case, it was actually simpler to just start with the trunk version and reimplement the dictionary attack
code. This may seem crazy, but it's actually not that bad. Redoing the changes takes a lot less time than
coding the feature the first time. We could still copy and paste code from the 2.0 version.

Getting back to the primary example, Joe has a choice to make. His current working file already contains his
own set of changes. He could therefore choose to redo Jane's change starting with his current working file.
The problem here is that he might not really know how. He might have no idea what Jane's approach was.
Jane's office might be 10,000 miles away. Jane might have written a lousy comment explaining her checkin.

As an alternative, Joe could set aside his working file, start with the latest repository version and redo his own
changes.

Bottom line: If a merge gets this bad, it takes some time and care to resolve it properly. Luckily, this situation
doesn't happen very often.

Verifying the merge

Regardless of which of the above methods is used to complete the merge, it is highly recommended for Joe to
verify the correctness of his work. Obviously he should check that the entire source tree still compiles. If a test
suite is available, he should build and verify that the tests still pass.

After Joe has completed the merge and verified it, he can declare the merge to be "resolved", after which the
SCM tool will allow him to checkin the file. In the case of Vault, this is done by using the Resolve Merge Status
command, which explicitly tells the Vault client application that the merge is completed. At this time, Vault
would change the baseline version number from 4 to 5, indicating that as far as anyone knows, Joe made his
changes by starting with version 5 of the file, not with version 4.
Since his baseline version now matches the current version of the file, the Vault server will now allow Joe to do
his checkin.

Worth the trouble

I hope I have not scared you away from concurrent


development by explaining the gory details of Best Practice: Give concurrent development a
merging files. In fact, my goal is quite the opposite. try

Remember that easily-resolved merges are the Many teams avoid all forms of concurrent
most common case. Automerge handles a large development. Their entire team uses "checkout-
percentage of situations with no problems at all. A edit-checkin" with exclusive locks, and they never
large percentage of the remaining cases can be branch.
easily handled with a visual merge tool. The
difficult situations are rare, and can still be handled For some small teams, this approach works just fine.
easily by a developer who is patient and careful. However, the larger your team, the more frequently a
developer becomes "blocked" by having to wait for
Many software teams have discovered that the someone else.
tradeoff here is worth the trouble. Concurrent
development can bring substantial gains in the Modern source control systems are designed to make
productivity of a team. The extra effort to deal with concurrent development easy. Give them a try.
merge situations is usually a small price to pay.

Looking Ahead

In the next chapter I will be discussing the concept of a repository in a lot more detail.
Chapter 4: Repositories
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

Cars and clocks

In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In
this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how
an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.

An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just
want to know what time it is. Those who understand the inner workings of a clock cannot tell time any
more skillfully than the rest of us.
An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However,
people who really understand cars tend to get better performance out of them.

Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of
how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little
bit about what's happening inside.

Repository = File System * Time

A repository is the official place where you store all your source code. It keeps track of all your files, as well as
the layout of the directories in which they are stored. It resides on a server where it can be shared by all the
members of your team.

But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM
repository would be no more than a network file system. A repository is much more than that. A repository
contains history.

A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-
dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every
version of your source code that has ever existed. The additional dimension creates some rather interesting
challenges in the architecture of a repository and the decisions about how it manages data.

How do we store all those old versions of everything?

As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just
keep a complete copy of the entire tree for every change that has happened?

We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in
the fall of 2001. In the summer of 2002, we started "dogfooding". On October 25th, 2002, we abandoned our
repository history and started a fresh repository for the core components of Vault. Since that day, this tree has
been modified 4,686 times.

This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every
change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At
today's prices for disk space, this option is worth considering.

However, this particular repository is just not very large. We have several others as well, but the sum total of
all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which
are a lot bigger.

As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on
their claim of 270 developers and the fact that their repository is almost four years old, I'm going to
conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of
storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes
(http://dictionary.reference.com/search?q=terabytes) .

At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is
cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider
things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-
important data is more than just the cost of the actual disk platters.

So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an
obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from
tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple
as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another
copy of them.

So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a
tree represented as a set of changes to another tree. We call this a "delta".

Delta direction

As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a
tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For
example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as
a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version
1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be
faster than others. When using this approach we say that we are using "forward deltas", because each delta
expresses the set of changes from one version to the next.

We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the
Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect
that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In
fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is
probably the most likely one to be needed.

The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.

Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other tree
N is represented as a set of differences from tree N+1. This approach delivers its best performance for the
most common case, but it can still take an awfully long time to retrieve older trees.

Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree
and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example,
suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM
server never has to apply more than 9 deltas to retrieve any tree.

What is a delta?

I've been throwing around this concept of deltas, but I haven't stopped to describe them.

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees
do not need to be related. However, in practice, the only reason we calculate the difference between them is
because one of them is derived from the other. Some developer started with tree N and made one or more
changes, resulting in tree N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this
purpose. A changeset is merely a list of the changes which express the difference between two trees.

For example, let's suppose that Wilbur starts with tree N and makes the following changes:

1. He deletes $/top/subfolder/foo.c because it is no longer needed.


2. He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
3. He edits $/top/bar.c to remove all the calls to the functions in foo.c
4. He renames $/top/hello.c and gives it the new name hola.c
5. He adds a new file called feature_creep.c to $/top/
6. He edits $/top/Makefile to add feature_creep.c to the list of filenames
7. He moves $/top/subfolder/readme.txt into $/top

At this point, he commits all of these changes to the repository as a single transaction. When the SCM server
stores this delta, it must remember all of these changes.

For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in
tree N but does not exist in tree N+1.

For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in
the repository to have an identifier which never changes, even when the name or location of the item changes.

For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item.
If we simply remember every item by its path, we cannot remember the occasions when that path changes.

Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to
remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full
representation of this changeset item needs to contain the entire contents of that file.

Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some
way. We could handle these items the same way as item 5, by storing the entire contents of the new version of
the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree
level.

File deltas

A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta
is because we believe it will be smaller than the file itself, usually because one of the files is derived from the
other.

For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of
lines which have been modified, inserted or changed. This is the same kind of results which are produced by
the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that
software developers and web developers have a lot of text files.

CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff.
Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.

Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file
delta algorithm called VCDiff, as described in RFC 3284 (http://www.faqs.org/rfcs/rfc3284.html) . This
algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This
means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses
the data at the same time.

Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are
large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In
CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow
by a small amount.

Deltas and diffs are different

Please note that I make a distinction between the terms "delta" and "diff".

A "delta" is the difference between two versions. If we have one full file and a delta, then we can
construct the other full file. A delta is used primarily because it is smaller than the full file, not because
it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the
level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text
files.
A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented,
but really cool visual diff tools can also highlight the specific characters on a line which differ. The
purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs
are really useful for text files, because human beings tend to read text files. Most human beings don't
read binary files, and human-readable diffs of binary files are similarly uninteresting.

As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over
slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct
purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their
repository deltas.

The evolution of source control technology

At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools
work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas
before file deltas. That is not the way the history of the world unfolded.

Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version control
systems like RCS only handled file deltas. There was no way for the system to remember folder-level
operations like add, renaming or deleting files.

Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the
world today. It was originally developed as a set of wrappers around RCS which essentially provided support
for some folder-level operations. Although CVS still has some important limitations, it was a big step forward.

Today, several modern source control systems are designed around the notion of tree-wide deltas. By
accurately remembering every possible operation which can happen to a repository, these tools provide a truly
complete history of a project.

What can be stored in a repository?

People sometimes ask us what kind of things can be


stored in a repository. In general, the answer is: Best Practice: Checkin all the canonical stuff,
"Any file". It is true that I am focusing on tools and nothing else
which are designed for software developers and
web developers. However, those tools don't really Although you can store anything you want in a
care what kind of file you store inside them. Vault repository, that doesn't mean you should. The best
doesn't care. Perforce, Subversion and CVS don't practice here is to store everything which is necessary
care. Any of these tools will gratefully accept any to do a build, and nothing else. I call this "the canonical
file you want to store. stuff".

To put this another way, I recommend that you do not


If you will be storing a lot of binary files, it is helpful
to know how your SCM tool handles them. A tool store any file which is automatically generated.
which uses binary deltas in the repository may be a Checkin your hand-edited source code. Don't checkin
better choice. EXEs and DLLs. If you use a code generation tool,
checkin the input file, not the generated code file. If
If all of your files are binary, you may want to you generate your product documentation in several
explore other solutions. Tools like Vault and different formats, checkin the original format, the one
Subversion were designed for programmers. These that you manually edit.
products contain features designed specifically for
use with source code, including diff and If you have two files, one of which is automatically
automerge. You can use these systems to store all generated from the other, then you just don't need to
of your Excel spreadsheets, but they are probably checkin both of them. You would in effect be managing
not the best tool for the job. Consider exploring two expressions of the same thing. If one of them gets
"document management" systems instead. out of sync with the other, then you have a problem.

How is the repository itself stored?

We need to descend through one more layer of abstraction before we turn our attention back to more practical
matters. So far I have been talking about how things are stored and managed within a repository, but I have
not broached the subject of how the repository itself is stored.

A repository must store every version of every file. It must remember the hierarchy of files and folders for
every version of the tree. It must remember metadata, information about every file and folder. It must
remember checkin comments, explanations provided by the developer for each checkin. For large trees and
trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably.
There are several different ways of approaching the problem.

RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was
called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level
down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a
bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one
for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond
memories, that particular phase of my life is over.)

CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate
from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository
contains some additional metadata.

When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are
exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual
database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit
of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has
invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data
corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying
database.

Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the
actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own
archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other
hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being
one of the fastest SCM tools.

Managing repositories

Creating a source control repository is kind of a


special event. It's a little bit like adopting a cat. Best Practice: Use separate repositories for
People often get a cat without realizing the animal things which are truly separate
is going to be around for 10-20 years. Your
repository may have similar longevity, or even Most SCM tools offer the ability to have multiple
longer. distinct repositories. Vault can even host multiple
repositories on the same Vault server. People often ask
Shortly after SourceGear was founded in 1997, we us when this capability should be used.
created a SourceSafe repository. Over seven years
later, that repository is still in use, almost every In general, you should store related items in the same
day. (Along with a whole bunch of legacy projects, repository. Start a separate repository only in
it contains the source code for SourceOffSite. We situations where the contents of the two are completely
never migrated that project to Vault because we unrelated. In a small ISV, it may be quite logical to
wanted the SourceOffSite developers to continue have only one repository which contains every project.
eating their own dogfood.)

That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never
been a very big company). It contains thousands of files, thousands of checkins, and has been backed up
thousands of times.

Treat your repository well and it will serve you well:

Obviously you should do regular backups. That repository contains everything your fussy and expensive
programmers have ever created. Don't risk losing it.
Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how
many people are doing daily backups that cannot actually be restored when they are needed.
Put your repository on a reliable server. If your repository goes down, your entire team is blocked from
doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with
redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply
(UPS).
Be conservative in the way your SCM server machine is managed. Don't put anything on that machine
that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets
released. I've been shocked how many times one of our servers went south simply because we installed
a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with
the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on
some other machine before you put them on critical servers.
Keep your SCM server inside a firewall. If you need to allow your developers to access the repository
from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your
developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and
Subversion can be tunneled through ssh or something similar.

This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of
care and caution which should be used for your SCM repository.

Undo

As I have mentioned, one of the best things about source control is that it contains your entire history. Every
version of everything is stored. Nothing is ever deleted.

However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that
should not be checked in? My history contains something I would rather forget. I want to pretend that it never
happened. Isn't there some way to really delete from a repository?

In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry
about the fact that your repository contains a full history of the error. Your mistakes are a part of your
past. Accept them and move on with your life.

However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a
command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's
say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and
choose the Rollback command.

To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the
rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simply
creates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the very
least, one of them is.

As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a
repository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. The
obliterate command is the only way to delete something and make it truly gone forever.

In my original spec for Vault, I had decided that we


would not implement any form of destructive Best Practice: Never obliterate anything that
delete. We eventually decided to compromise and was real work
implement this command, but I really wanted to
discourage its use. SourceSafe makes it far too easy The purist in me wants to recommend that nothing
to rewrite history and pretend that something never should ever be obliterated. However, my pragmatist
happened. In the Delete dialog box, SourceSafe side prevails. There are situations where obliterate is
includes a checkbox called "Destroy Permanently". not sinful.
This is an atrocious design decision, roughly
equivalent to leaving a sledgehammer next to the However, obliterate should never be used to delete
server machine so that people can bash the hard actual work. Don't obliterate a file simply because you
disks with it every once in a while. This checkbox is discovered it to be a bad idea. Don't obliterate a file
almost irresistible. It simply begs to be checked, simply because you don't need it anymore. Obliterate is
even though it is very rarely the right thing to do. for situations where something in the repository should
never have been there at all. For example, if you
When we first designed the obliterate command for accidentally checkin a gigabyte of MP3s alongside your
Vault, I wanted its user interface to somehow make C++ include files, obliterate is a justifiable choice.
the user feel guilty. I argued that the obliterate
dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.

The rest of the team agreed that we should discourage people from using this command, but in the end, we
settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client,
not the regular client people use every day. In effect, we made the obliterate command available, but
inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to
think twice before they try to rewrite history and pretend something never happened.

Kimchi again?

Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that
"everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler.
Rules don't have exceptions. Generalizations always apply.

This is how we learn. We understand the basic rules first and see the finer points later. First we learn that
memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.

My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely
acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on
checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-
edit-checkin".

In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like
Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single
repository. Each client has a working folder. All clients contact the same server.

I confess that not all SCM tools work this way. Tools like BitKeeper (http://www.bitkeeper.com/) and Arch
(http://www.gnu.org/software/gnu-arch/) are based on the concept of distributed repositories. Instead of one
repository, there can be several, or even many. Things can be retrieved or committed to any repository at any
time. The repositories are synchronized by migrating changesets from one repository to another. This results
in a merge situation which is not altogether different from merging branches.

From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are
advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power
user, this paradigm for source control is very cool.

Having no experience in the implementation of these systems, I will not be explaining their behavior in any
detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of
articles will continue to focus on the more mainstream architecture for source control.

Looking ahead

In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side
and dive into the details of working folders.
Chapter 5: Working Folders
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

The joy of indifference

CVS calls it a sandbox. Subversion calls it a working directory. Vault calls it a working folder. By any of these
names, a working folder is a directory hierarchy on the developer's client machine. It contains a copy of the
contents of a repository folder. The very basic workflow of using source control involves three steps:

1. Update the working folder so that it exactly matches the latest contents of the repository.
2. Make some changes to the working folder.
3. Checkin (or commit) those changes to the repository.

The repository is the official archive of our work. We treat our repository with great respect. We are extremely
careful about what gets checked in. We buy backup disks and RAID arrays and air conditioners and whatever
it takes to make sure our precious repository is always comfortable and happy.

In contrast, we treat our working folder with very


little regard. It exists for the purpose of being Best Practice: Don't let your working folder
abused. Our working folder starts out worthless, become too valuable
nothing more than a copy of the repository. If it is
destroyed, we have lost nothing, so we run risky Checkin your work to the repository as often as you can
experiments which endanger its life. We attempt without breaking the build.
code changes which we are not sure will ever work.
Sometimes the contents of our working folder won't even compile, much less pass the test suite. Sometimes
our code changes turn out to be a Really Bad Idea, so we simply discard the entire working folder and get a
new one.

But if our code changes turn out to be useful, things change in a very big way. Our working folder suddenly
has value. In fact, it is quite precious. The only copy of our most recent efforts is sitting on a crappy,
laptop-grade hard disk which gets physically moved four times a day and never gets backed up. The stress of
this situation is almost intolerable. We want to get those changes checked in to the repository as quickly as
possible.

Once we do, we breathe a sigh of relief. Our working folder has once again become worthless, as it should be.

Hidden state information

Once again I need to spend some time explaining grungy details of how SCM tools work. I don't want to repeat
the analogy I used in the last chapter, so the following line of "code" should suffice:

Response.Write(previousChapter.Section["Cars and Clocks"]);

Let's suppose I have a brand new working folder.


In other words, I started with nothing at all and I Best Practice: Use non-working folders when
retrieved the latest versions from the repository. At you are not working
this moment, my new working folder is completely
in sync with the contents of the repository. But that SCM tools need this "hidden state information" so it
condition is not likely to last for long. I will be can efficiently keep track of things as you make
making changes to some of the files in my working changes to your working folder. However, sometimes
folder, so it will be "newer" than the repository. you want to retrieve files from the repository with no
Other developers may be checking in their changes plan of making changes to them. For example, if you
to the repository, thus making my working folder are retrieving files to make a source tarball, or for the
"out of date". My working folder is going to be new purpose of doing an automated build, you don't really
and old at the same time. Things are going to get need the hidden state information at all.
confusing. The SCM tool is responsible for keeping
track of everything. In fact, it must keep track of Your SCM tool probably has a way to retrieve things
the state of each file individually. "plain", without writing the hidden state information
anywhere. I call this a "non-working folder". In Vault,
For housekeeping purposes, the SCM tool usually this is done automatically whenever you retrieve files to
keeps a bit of extra information on the client side. a destination which is not configured as the working
When a file is retrieved, the SCM client stores its folder, although I sometimes wish we had made this
contents in the corresponding working file, but it functionality a completely separate command.
also records certain information for later.
Examples:

Your SCM tool may record the timestamp on the working file, so that it can later detect if you have
modified it.
It may record the version number of the repository file that was retrieved, so that it may later know the
starting point from which you began to make your changes.
It may even tuck away a complete copy of the file that was retrieved, so that it can show you a diff
without accessing the server.

I call this information "hidden state information". Its exact location depends on which SCM tool you are
using. Subversion hides it in invisible subdirectories in your working directory. Vault can work similarly, but
by default it stores hidden state information in the current user's "Application Data" directory.

Working file states

Because of the changes happening on both the client and the server, a working file can be in one of several
possible states. SCM tools typically have some way of displaying the state of each file to the user. Vault shows
file states in the main window. CVS shows them in response to the 'cvs status' command.

The table below shows the possible states for a working file. The column on the left shows my particular name
for each of these states, which through no coincidence is the name that Vault uses. The column on the far right
shows the name shown by the 'cvs status' command. However, the terminology doesn't really matter. One
way or another, your SCM tool is probably keeping track of all these things and can tell you the state of any file
in your working folder hierarchy.

Refresh

In order to keep all this file status information current, the SCM client must have ways of staying up to date
with everything that is happening. Whenever something changes in the working folders or in the repository,
the SCM client wants to know.

Changes in the working folders on the client side are relatively easy. The SCM client can quickly scan files in
the working folders to determine what has changed. On some operating systems, the client can register to be
notified of changes to any file.

Notification of changes on the server can be a bit trickier. The Vault client periodically queries the server to
ask for the latest version of the repository tree structure. Most of the time, the server will simply respond that
"nothing has changed". However, when something has in fact changed, the client receives a list of things
which have changed since the last time that client asked for the tree structure.

For example, let's assume Laura retrieves the tree structure and is informed that foo.cpp is at version 7.
Later, Wilbur checks in a change to foo.cpp and creates version 8. The next time Laura's Vault client
performs a refresh, it will ask the server if there is anything new. The server will send down a list, informing
her client that foo.cpp is now at version 8. The actual bits for foo.cpp will not be sent until Laura specifically
asks for them. For now, we just want the client to have enough information so that it can inform Laura that
her copy of foo.cpp is now "Old".

Operations that involve a working folder

OK, let's go back to speaking a bit more about practical matters. In terms of actual usage, most interaction
with your SCM tool happens in and around your working folder. The following operations are the basic things
I can do to a working folder:
In the following sections, I will cover each of these operations in a bit more detail.

Make the changes

The primary thing you do to a working folder is make changes to it.

In an idealized world, it would be really nice if the SCM tool didn't have to be involved at all. The developer
would simply work, making all kinds of changes to the working folder while the SCM tool eavesdrops, keeping
an accurate list of every change that has been made.

Unfortunately, this perfect world isn't quite available. Most operations on a working folder cannot be
automatically detected by the SCM client. They must be explicitly indicated by the user. Examples:

It would be unwise for the SCM client to notice that a file is "Missing" and automatically assume it
should be deleted from the repository.
Automatically inferring an "Add" operation is similarly unsafe. We don't want our SCM tool
automatically adding any file which happens to show up in our working folder.
Rename and move operations also cannot be reliably divined by mere observation of the result. If I
rename foo.cpp to bar.cpp, how can my SCM client know what really happened? As far as it can tell, I
might have deleted foo.cpp and added bar.cpp as a new file.

All of these so-called "folder-level" operations require the user to explicitly give a command to the SCM tool.
The resulting operation is added to the pending change set, which is the list of all changes that are waiting to
be committed to the repository.

However, it just so happens that in the most common case, our "eavesdropping" ideal is available. Developers
who use the edit-merge-commit model typically do not issue any explicit command telling the SCM tool of
their intention to edit a file. The files in their working folder are left in a writable state, so they simply open
their text editor or their IDE and begin making changes. At the appropriate time, the SCM tool will notice the
change and add that file to the pending change set.

Users who prefer "checkout-edit-checkin" actually have a somewhat more consistent rule for their work. The
SCM tool must be explicitly informed of all changes to the working folder. All files in their working folder are
usually marked read-only. The SCM tool's Checkout command not only informs the server of the checkout
request, but it also flips the bit on the working file to make it writable.

Review changes

One of the most important features provided by a working folder is the ability to review all of the changes I
have made. For SCM tools that do keep track of a pending change set (Vault, Perforce, Subversion), this is the
place to start. The following screen dump shows the pending change set pane from the Vault client, which is
showing me that I have currently made two changes in my working folder:

(screendumps/scm_pending_5.gif)

The pending change set view shows all kinds of changes, including adds, deletes, renames, moves, and
modified files. It is helpful to keep an eye on the pending change set as I work, verifying that I have not
forgotten anything.

However, for the case of a modified file, this visual display only shows me which files have changed. To really
review my changes, I need to actually look inside the modified files. For this, I invoke a diff tool. The following
screen dump is from a popular Windows diff tool called Beyond Compare
(http://www.scootersoftware.com/) :
This picture is fairly typical of the visual diff tool genre, showing both files side-by-side and highlighting the
parts that are different. There are quite a few tools like this. The following screen dump is from the visual diff
tool which is provided with Vault:

The left panel shows version 21 of


sgdmgui_props.cpp, which is the current version in Best Practice: Run diff just before you checkin,
the repository. The right panel shows my working every time
file. The colored regions show exactly what has
changed: Never checkin your changes without giving them a
quick review in some sort of a diff tool.
On line 33 I changed the type of this function
from long to short.
At line 35 I inserted a one-line comment.
Note that SourceGear's diff tool shows inserted lines by drawing lines in the center gap to indicate exactly
where the insertion occurs. In contrast, Beyond Compare is showing a dead region on the left side across from
the inserted line on the right. This particular issue is a matter of personal preference. The latter approach
does have the benefit that identical lines are always across from each other.

Both of these tools do a nice job on the modification to line 33, showing exactly which part of the line was
changed. Most of the recent visual diff tools support this ability to highlight intraline differences.

Visual diff tools are indispensable. They give me a way to quickly review exactly what has changed. I strongly
recommend you make a habit of reviewing all of your changes just before you checkin. You can catch a lot of
silly mistakes by taking the time to be sure that your changes look the way you think they look.

Undo changes

Sometimes I make changes which I simply don't intend to keep. Perhaps I tried to fix a bug and discovered
that my fix introduced five new bugs that are worse than the one I started with. Or perhaps I just changed my
mind. In any case, a very nice feature of a working folder is the ability to undo.

In the case of a folder-level operation, perhaps the Undo command should actually be called "Nevermind".
After all, the operation is pending. It hasn't happened yet. I'm not really saying that I want to Undo something
which has already happened. Rather, I am just saying that I no longer want to do something that I previously
said I did.

For example, if I tell the Vault client to delete a file, the file isn't really deleted until I commit that change to
the repository. In the meantime, it is merely waiting around in my pending change set. If I then tell the Vault
client to Undo this operation, the only thing that actually has to happen is to remove it from my pending
change set.

In the case of a modified file, the Undo command


simply overwrites the working file with the Best Practice: Be careful with undo
"baseline" version, the one that I last retrieved.
Since Vault has been keeping a copy of this baseline When you tell your SCM client to undo the changes you
version, it merely needs to copy this baseline file have made to a file, those changes will be lost. If your
from its place in the hidden state information over working folder has become valuable, be careful with it.
the working file.

For users who use the checkout-edit-checkin style of development, closely related here is the need to undo a
checkout. This is essentially similar to undoing the changes in a file, but involves the extra step of informing
the server that I no longer want the file to be checked out.

Digression: Your skillet is not a working folder

Source control tools have been a daily part of my life for well over a decade. I can't imagine doing software
development without them. In fact, I have developed habits that occasionally threaten my mental health.
Things would be so much easier if the concept of a working folder were available in other areas of life:

"Hmmm. I can't remember which of these pool chemicals I have already done. Luckily, I can just diff
against the version of the pool water from an hour ago and see exactly what changes I have made."
"Boy am I glad I remembered to set the read-only bit on my front lawn to remind me that I'm not
supposed to cut the grass until a week after the fertilizer was applied."
"No worries -- if I accidentally put too much pepper on this chicken, I can just revert to the latest version
in the repository."

Unfortunately, SCM tools are unique. When I make a mistake in my woodshop, I can't undo it. Only in
software development do I have the luxury of a working folder. It's a place where I can work without
constantly worrying about making a mistake. It's a place where I can work without having to be too careful.
It's a place where I can experiment with ideas that may not work out. I wish I had working folders everywhere.

Update the working folder


Ten milliseconds after I retrieve a fresh working folder, it might be out of date. An SCM repository is a busy
hub of activity. New stuff arrives regularly as team members finish tasks and checkin their work.

I don't like to let my working folder get too far behind the current state of the repository. SCM tools typically
allow the user to invoke a diff tool to compare two repository versions of a file. When I am working on a
feature, I periodically like to review the recent changes in the repository. Unless those changes look likely to
disrupt my own work, I usually proceed to retrieve the latest versions of things so that my working folder stays
up to date.

In CVS, the command to update a working folder is [rather conveniently] called 'update'. In Vault, this
operation is done with the Get Latest Version command. The screen dump below is the corresponding dialog
box:

I want to update my working folder to contain all of


the changes available on the server, so I have Best Practice: Don't get too far behind
invoked the Get Latest Version operation starting at
the very top folder of my repository. The Recursive Update your working folder as often as you can.
checkbox in the dialog above indicates that this
operation will recursively apply to every subfolder.

Note that this dialog box gives me a few choices for how I may want to handle situations where a change has
happened on both the client and the server. Let us suppose for a moment that I am not using exclusive
checkouts and that somebody else has also modified sgdmgui_props.cpp. In this case, I have three choices
available when I want to update my working folder:

Overwrite my working file. This effect here is similar to an Undo. My changes will be lost. Use with
care.
Attempt automatic merge. The Vault client will attempt to construct a file which contains my
changes and the changes which were made on the server. If the automerge succeeds, my working file
will end up in the "Edited" status. If the automerge fails, the status of my working file will be "Needs
Merge", and the Vault client will nag and pester me until I resolve the situation.
Do not overwrite/Merge later. This option leaves my working file untouched. However, the status
of the file will change to "Needs Merge". Vault will not allow me to checkin my changes until I affirm
that I have done the right thing and merged in the changes from the repository.

Note also that the "Prompt for modified files" checkbox allows me to specify that I want the Vault client to
allow me to choose between these options for every file that ends up in this situation.

As you can see, the Get Latest Version dialog box includes a few other options which I won't describe in detail
here. Other SCM tools have similar abilities, although the user interface may be very different. In any case,
it's a good idea to update your working folder as often as you can.
Commit changes

In most situations, I eventually decide that my changes are Good and should be sent back to the repository so
they can become a permanent part of the history of my project. In Vault, Subversion and CVS, the command
is called Commit. The following screen dump shows the Commit dialog box from Vault:

Note that the listbox at the top contains all of the items in my pending change set. In this particular example, I
only have two changes, but this listbox typically has a scrollbar and contains lots of items. I can review all of
the operations and choose exactly which ones I want to commit to the repository. It is possible that I may want
to checkin only some of my currently pending changes. (Perforce has a nifty solution to this problem. The
user can have multiple pending change sets, so that changes can be logically grouped together even as they are
waiting to be checked in.)

The "Change Set Comment" textbox offers a place for me to type an explanation of what I changed and why I
did it. Please note that this textbox has a scrollbar, encouraging you to type as much text as necessary to give a
full explanation of the problem. In my opinion, checkin comments are more important than the comments in
the actual code.

When I click OK, all of the selected items will be sent to the server to be committed to the repository. Since
Vault supports atomic checkin transactions, I know that my changes will succeed or fail as a united group. It
is not possible for the repository to end up in a state where only some of these changes made it.

#region CARS_AND_CLOCKS

Remember the discussion in chapter 4 about binary file deltas? This same technology is also used for checkin
operations. When Vault sends a modified version of a file up to the server, it actually sends only the bytes
which have changed, using the same VCDiff format which is used to make repository storage more efficient.

The reason this is possible is because it has kept a copy of the baseline file in the hidden state information. The
Vault client simply runs the VCDiff algorithm to construct the difference between this baseline file and the
current working file. So in the case of my running example, the Vault client will send three pieces of
information:

The binary delta. Since the pending change set pane shows that my working file is 40 bytes larger than
the baseline where I started, the binary delta is going to be somewhere in the vicinity of 40 bytes long,
perhaps with a few extra bytes for overhead.
The fact that this binary delta was computed against version 21 of the file. Since version 21 is known
and exists on both the client and the server, the SCM server can simply apply the binary delta to its own
copy of version 21 to reconstruct an exact copy of the contents of my working file.
The CRC checksum of the original working file. When the server reconstructs its copy of the working
file, the CRC will be compared to ensure that nothing was corrupted during transit. The file that is
stored in the repository will be exactly the same as the working file. No corruption, no surprises.

Whenever possible, Vault uses binary file deltas "over the wire" in both directions, from client to server as well
as from server to client. In this example, the entire file is only 3,762 bytes, so the savings in network
bandwidth isn't all that significant. However, for larger files, the increase in network performance for offsite
users can be quite dramatic.

This capability of using binary file deltas between client and server is supported by some other SCM tools as
well, including (I believe) Subversion and Perforce.

#endregion

When the checkin has completed successfully, if I am working in "checkout-edit-checkin" mode, the SCM tool
will flip the read-only bit on my working files to prevent me from accidentally making changes without
informing the server of my intentions.

Having completed my checkin, the cycle is completed. My working folder is once again worthless, since my
changes are a permanent part of the repository. I am ready to start again on my next development task.

Looking ahead

In the next chapter, it's time to start talking about some of the more advanced stuff. I'll start with an overview
of labels and history.
Chapter 6: History
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

Confronting your past

You may now be tired of hearing me say it, but I will say it again: Your repository contains every version of
everything which has ever been checked in to the repository. This is a Good Thing. We sleep better at night
because we know that our efforts are always additive, never subtractive. Nothing is ever lost. As the team
regularly checks in more stuff, the complete historical record is preserved, just in case we ever need it.

But this feature is also a Bad Thing. It turns out that keeping absolutely everything isn't all that useful if you
can't find anything later.

My woodshop is a painfully vivid illustration of this problem. I have a habit of never throwing anything away.
When I build a piece of furniture, I save every scrap of wood, telling myself that I might need it someday. I
keep every screw, nail, bolt or nut, just in case I ever need it. But I don't organize these things very well. So
when the time comes that I need something, I usually can't find it. I'm not necessarily proud of this
confession, but my workshop stands as an expression of who I am. Those who love me sometimes find my
habits to be endearing.

But there is nothing endearing about a development team that can't find something when they need it. A good
SCM tool must do more than just keep every version of everything. It must also provide ways of searching and
viewing and sorting and organizing and finding all that stuff.

In the rest of this chapter, I will discuss several mechanisms that SCM tools provide to help make the historical
data more useful.

Labels

Perhaps the most important feature for dealing with old versions is the notion of a "label". In CVS, this feature
is called a "tag". By either name, the concept is the same -- labels offer the ability to associate a name with a
specific version of something in the repository. A label assigns a meaningful symbolic name to a snapshot of
your code so you can later find that snapshot more easily.

This is not altogether different from the descriptive and memorable names we use for variables and constants
in our code. Which of the following two lines of code is easier to understand?

if (errorcode == ERR_FILE_NOT_FOUND)

if (e == -43)

Similarly, which of the following is a more intuitive description of a specific version of your code?

LAST_VERSION_BEFORE_COREY_FOULED_EVERYTHING_UP

378

We create (or "apply") a label by specifying a few things:

1. The string for the name of the label. This should be something descriptive that you can either
remember or recognize later. Don't be afraid to put enough information in the name of the label. Note
that CVS has strict rules for the syntax of a tag name (must start with a letter, no spaces, almost no
punctation allowed). I still follow that tradition even though Vault is more liberal.
2. The folder to which the label will be applied. (You can apply a label or tag to a single file if you want, but
why? Like most source control operations, labels are most useful when applied recursively to a whole
folder.)
3. Which versions of everything should be included in the snapshot. Often this is implicitly understood to
be the latest version, but your SCM tool will almost certainly allow you to label something in the past. If
it won't, take it out back and shoot it.
4. A comment explaining the label. This is optional, and not all SCM tools support it, (CVS doesn't), but a
comment can be handy when you want to explain more than might be appropriate to say in the name of
the label. This is particularly handy if your team has strict rules for the syntax of label
(V1.3.2.1426.prod) which prevent you from putting in other information you need.

For example, in the following screen dump from Vault, I am labeling version 155 of the folder $/src/sgd
/libsgdcore:

It is worth clarifying here that labels play a slightly different role in some SCM tools. In Subversion or Vault,
folders have version numbers. Using the example from my screen dump above, the folder $/src/sgd
/libsgdcore is at version 155. Each of the various files inside that folder has its own version number, but every
time one of those files changes, the version number of the folder is increased by one as well. So the version
number of a folder is a little bit like a label because it maps to a specific snapshot of the contents of the folder.

However, CVS doesn't work this way. There is no folder version number which can be mapped to a specific
snapshot of the contents of that folder. For this reason, tags are all the more important in CVS, since there is
no other way to easily mark specific versions of multiple items as a snapshot.

When to use a label

Labels are cheap. They don't consume a lot of resources. Your SCM tool won't slow down if you use lots of
them. Having more labels does not increase your responsibilities. So you can use them as often as you like.
The following situations are examples of when you might want to use a label:

When you make a release

A release is the most obvious time to apply a label. When you release a version of your application to
customers, it can be very important to later know exactly which version of the code was released.

When something is about to change

Sometimes it is necessary to make a change which is widespread or fundamental. Before destabilizing your
code, you may want to apply a label so you can easily find the version just before things started getting
messed up.

When you do an automated build

Some automated build systems apply a label every time a build is done. The usual approach is to first apply
the label and then do a "get by label" operation to retrieve the code to be used for the build. Using one of
these tools can result in an awful lot of labels, but I still like the idea. It eliminates the guesswork of trying
to figure out exactly which code was in the build.

When you move some changes from one place to another


Labels are handy ways to mark the sync points between two branches or two copies of the same tree. For
example, suppose your company has two groups with separate source control systems. Group A has a
library called SuperDuperNeatoUtilityLib. Group B uses this library as well, but they keep their own copy
in the their own source control repository. Every so often, they login into Group A's repository and see if
there are any bug fixes they want to migrate into their own copy. By applying a label to Group A's
repository, they can more easily remember the latest point at which their two versions were in sync.

Once you have a label, the question is what you can


do with it. The truth is that some labels never get Best Practice: Use labels often
used. That's okay. Like I said, they're cheap.
Labels are very lightweight. Don't hesitate to use them
But many labels do get used. The "get by label" as often as you want.
operation is the most common way that a label
comes in handy. By specifying a label as the version you want to retrieve, you can get a copy of every file
exactly as it was when the label was created.

It's also very handy to diff against a label. For example, in the following screendump from Vault, I am asking
to see all the differences between the contents of my working folder and the contents of the label named "Build
3.0.0.2752". (This label was applied by our automated build system when it made build 2752.)

Admonishments on the evils of "Label Promotion"

Sometimes after you apply a label you realize that you want to make a small change. As an example, consider
the following scenario: One week ago, you finalized the code for the 4.0 release of your product. You applied a
label to the tree, and your team has proceeded with development on a few post-4.0 tasks.

But now Bob (one of your QA guys) comes crawling into your office. His clothes are torn and his face is
covered with soot. While gasping for air he informs you that he has found a potential showstopper bug in the
4.0 release candidate. Apparently if you are running your app on the Elbonian version of Windows NT 3.5
with the time zone set to Pacific Standard Time and you enter a page margin size of 57 inches while printing a
42 page document on a Sunday morning before 9am, the whole machine locks up. In fact, if you don't quickly
kill the app, the computer will soon burst into flame.

As Bob finishes explaining the situation, a developer walks in and announces that he has already found the fix
for this bug, and it affects only one line of code in FOO.CPP. Should he make the fix and generate a new
release candidate?

After scolding Bob for not being more diligent in finding this bug sooner, you begrudgingly decide that the
severity of this bug does indeed make it a showstopper for the 4.0 release. But how to proceed? The label for
the 4.0 build has already been applied. You want a new release candidate which contains exactly the contents
of the 4.0 label plus this one-line change. None of the other stuff which has been checkin in during the past
week should be included.

I'm sure it was this very situation which prompted Microsoft to implement a feature in SourceSafe 6.0 called
"label promotion". The idea is that a minor change to a label can be made after it was originally created.
Returning to our example, let's suppose that the 4.0 label contained version 6 of FOO.CPP. So now we would
make the one-line change and check it in, resulting in version 7 of that file. Then we "promote" version 7 of the
file to be included in the 4.0 label, instead of version 6.

Personally I think "label promotion" is a terrible


name for this feature. In fact, I think label Best Practice: Avoid using label promotion
promotion is a terrible feature. I am doctrinally
opposed to any SCM feature which allows the user Your repository should contain an accurate reflection
to alter the historical record. The history of the of what really happened. Don't use label promotion. If
repository should be a complete record of what you must, do at least try to feel guilty about it.
really happened. If we use label promotion in this
situation, there will be no record of the fact that the original 4.0 release candidate actually contained version 6
of that file. In situations where label promotion seems necessary, a fanatical purist like me would just create a
new branch, which is a topic I will discuss in the next chapter.

However, even though I dislike this feature for philosophical reasons, customers really want it. Here at
SourceGear, I tell people that "the customer is not always right, but the customer is always the customer". So
in order to remain true to our goal of making Vault a painless transition from SourceSafe, we implemented
label promotion. But that doesn't mean I have to be happy about it.

History

Another important feature is the ability to view and browse historical versions of the repository. In its simplest
form, this can be just a list of changes with the following information about each change:

What was changed


When the change was made
Who did it
Why (the comment entered at checkin time)

But without a way of filtering and sorting this information, using history is like trying to take a drink from a
fire hose. Fortunately, most SCM tools provide plenty of flexibility in helping you see the data you need.

In CVS, history is obtained using the 'cvs log' command. In the Vault GUI client, we use the History Explorer.
In either case, the first way to filter history is to decide where to invoke the command. Requesting the full
history from the root folder of a repository is like the aforementioned fire hose. Instead, invoke the command
on a subfolder or even on a file. In this way, you will only see the changes which have been made to the item
you selected.

Most SCM tools provide other ways of filtering history information as well:

Show only changes made during a specific range of dates


Show only changes made by a specific user
Show only changes made to files of a certain extension
Show only changes where the checkin comment contains specific words

The following screendump from Vault shows all the changes I made to one of the Vault libraries during
October 2004:
(screendumps/scm_hist_1.png)

Sometimes the history features of your SCM tool


are used merely to figure out what happened in the Best Practice: Do as I say, not as I do
past, but often we need to dig even deeper. Perhaps
we want to retrieve ("get") an old version? Perhaps It is while using the history features of an SCM tool that
we want to diff against an old version, or diff two we notice what a lousy job our developers do on their
old versions against each other? We may want to checkin comments. Please, make your checkin
apply a label to a version that happened in the past. comments as complete as possible. The screen dump
We may even want to use an old version as the above contains an example of checkin comments
starting point for a new branch. Good SCM tools written by a slacker who was in too much of a hurry.
make all of these things easy to do.

A word about changesets and history

For tools like Subversion and Vault which support atomic transactions and changesets, history can be slightly
different. Because changesets are a grouping of individual changes, history is no longer just a flat list of
individuals changes, but rather, can now be viewed as a hierarchy which is two levels deep.

To ease the transition for SourceSafe users, Vault allows history to be viewed either way. You can ask Vault's
History Explorer to display individual changes. Or, you can ask to see a list of changesets, each of which can
be expanded to see the individual changes contained inside it. Personally, I prefer the changeset-oriented
view. I like the mindset of thinking about the history of my repository in terms of groups of related changes.

Blame

Vault has a feature which can produce an HTML view of a file with each line annotated with information about
the last person who changed that line. We call this feature "Blame". For example, the following screen dump
shows the Blame output for the source code to the Vault command line client:
This poor function has had all kinds of people stomping through it. I was the last person to change line 828,
which I apparently did in revision 106 of the file. However, line 829 was last modified by Jeff, and line 830
belongs to Dan.

By now the reason for the silly-sounding name of


this feature should be obvious. If I find a bug on Best Practice: Don't actually use the blame
line 832, the Blame feature makes it easy for me to feature to be harsh with people about their
see that it must be Dan's fault! mistakes.

Note that we here at SourceGear take absolutely no Even though this Best Practice box is more about team
credit or blame for the name of this command. We management than source control, I don't feel like I'm
took our inspiration for this feature from the blame straying too far off topic to offer the following tidbit:
feature found in the CVS world, popularized by the
Bonsai (http://www.mozilla.org/projects/bonsai/) Tim Krauskopf, an early mentor of mine, said many
tool from the Mozilla project. The following screen wise things to me, including the following piece of
dump shows this CVS Blame feature in action using management advice which I have never forgotten:
the Bonsai installation on www.abisource.com
(http://www.abisource.com) . I was delighted to "Spend more time on credit than on blame, and don't
discover that the AbiWord layout engine actually spend very much time on either one."
still contains some of my code:
Whether you like the name or not, the Blame feature can be awfully handy sometimes.

Looking ahead

In the next chapter, we'll start talking about branches.


Chapter 7: Branches
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

What is a branch?

A branch is what happens when your development team needs to work on two distinct copies of a project at the
same time. This is best explained by citing a common example:

Suppose your development team has just finished and released version 1.0 of UltraHello, your new flagship
product, developed with the hope of capturing a share of the rapidly growing market for "Hello World"
applications.

But now that 1.0 is out the door, you have a new problem you have never faced before. For the last two years,
everybody on your team has been 100% focused on this release. Everybody has been working in the same tree
of source code. You have had only one "line of development", but now you have two:

Development of 2.0. You have all kinds of new features which just didn't make it into 1.0, including
"multilingual Hello", DirectX support for animated Hellos, and of course, the ability to read email
(http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html) .
Maintenance of 1.0. Now that real customers are using UltraHello, they will probably find at least
one bug your testing didn't catch. For bug fixes or other minor improvements requested by customers,
it is quite possible that you will need to release a version 1.0.1.

It is important for these two lines of development to remain distinct. If you release a version 1.0.1, you don't
want it to contain a half-completed implementation of a 2.0 feature. So what you need here is two distinct
source trees so your team can work on both lines of development without interfering with each other.

The most obvious way to solve this problem would simply be to make a copy of your entire source control
repository. Then you can use one repository for 1.0 maintenance and the other repository for 2.0
development. I know people who do it this way, but it's definitely not a perfect solution.

The two-repository approach becomes disappointing in situations where you want to apply a change to both
trees. For example, every time we fix a bug in the 1.0 maintenance tree, we probably also want to apply that
same bug fix to the 2.0 development tree. Do we really want to have to do this manually? If the bug fix is a
simple change, like fixing the incorrect spelling of the word "Hello", then it won't take a programmer very long
to make the change twice. But some bug fixes are more involved, requiring changes to multiple files. It would
be nice if our source control tool would help. A primary goal for any source control tool should be to help
software teams be more concurrent, everybody busy, all at the same time, without getting in each other's way.

To address this very type of problem, source control tools support a feature which is usually called
"branching". This terminology arises from the tendency of computer scientists to use the language of a
physical tree every time hierarchy is involved. In this particular situation, the metaphor breaks down very
quickly, but we keep the name anyhow.

A somewhat better metaphor happens when we envision a nature path which forks into two directions. Before
the fork, there was one path. Now there are two, but they share a common history. When you use the
branching feature of your source control tool, it creates a fork in the path of your development progress. You
now have two trees, but the source control has not forgotten the fact that these two trees used to be one. For
this reason, the SCM tool can help make it easier to take code changes from one fork and apply those changes
to the other. We call this operation "merging branches", a term which highlights why the physical tree
metaphor fails. The two forks of a nature path can merge back into one, but two branches of an oak tree just
don't do that. I'll talk a lot more about merging branches in the next chapter.

At this point I should take a step back and admit that my example of doing 1.0 maintenance and 2.0 features is
very simplistic. Real life examples are sometimes far more complicated, involving multiple branches, active
development in each branch, and the need to easily migrate changes between any two of them. Branching and
merging is perhaps the most complex operation offered by a source control tool, and there is much to say
about it. I'll begin with some "cars and clocks (scm_repositories.html) " stuff and talk about how branching
works "under the hood".

Two branching models

First of all, let's acknowledge that there are [at


least] two popular models for branching. In the Best Practice: Organize your branches
first approach, a branch is like a parallel universe.
The "folder" model of branching usually requires you
The hierarchy of files and folders in the to have one extra level of hierarchy in your repository
repository is sort of like the regular universe. tree. Keep your main development in a folder named
For each branch, there is another universe $/trunk. Then create another folder called $/branches.
which contains the same hierarchy of files Each time you create a branch off of the trunk, put it in
and folders, but with different contents. $/branches.

In order to retrieve a file, you specify not just a path but the name of the universe, er, branch, from which you
want the file retrieved. If you don't specify a branch, then the file will be retrieved from the "default branch".
This is the approach used by CVS and PVCS.

In the other branching model, a branch is just another folder, located in the same repository hierarchy as
everything else. When you create a branch of a folder, it shows up as another folder. With this approach, a
repository path is sufficient to describe a location.

Personally, I prefer the "folder" style of branching over the "parallel universe" style of branching, so my
writing will generally come from this perspective. This is the approach used by most modern source control
tools, including Vault, Subversion (they call it "copy (http://subversion.tigris.org/) "), Perforce (they call it
"Inter-File Branching (http://www.perforce.com/perforce/branch.html) ") and Visual Studio Team
System (looks like they call it branching in "path space (http://blogs.msdn.com/team_foundation/archive
/2005/02/23/379179.aspx) ").

Under the hood

Good source control tools are clever about how they manage the underlying storage issues of branching. For
example, let us suppose that the source code tree for UltraHello is stored in $/projects/Hello/trunk. This
folder contains everything necessary to do a complete build of the shipping product, so there are quite a few
subfolders and several hundred files in there.

Now that you need to go forward with 1.0 maintenance and 2.0 development simultaneously, it is time to
create a branch. So you create a folder called $/projects/Hello/branches. Inside there, you create a branch
called 1.0.

At the moment right after the branch, the following two folders are exactly the same:

$/projects/Hello/trunk

$/projects/Hello/branches/1.0

It appears that the source control tool has made an exact copy of everything in your source tree, but actually it
hasn't. The repository database on disk has barely increased in size. Instead of duplicating the contents of
every file, it has merely pointed the branch at the same contents as the trunk.

As you make changes in one or both of these folders, they diverge, but they continue to share a common
history.

The Pitiful Lives of Nelly and Eddie

In order to use your source control tool most effectively, you need to develop just the right amount of fear of
branching. This delicate balance seems to be very difficult to find. Most people either have too much fear or
not enough.

Nelly is an example of a person who has too much fear of branching. Nelly has a friend who has a cousin with
a neighbor who knows somebody whose life completely fell apart after they tried using the branch and merge
features of their source control tool. So Nelly refuses to use branching at all. In fact, she wrote a 45-page
policy document which requires her development team to never use branching, because after all, "it's not
safe".

So Nelly's development team goes to great lengths to avoid using branching, but eventually they reach a point
where they need to do concurrent development. When this happens, they do anything they can to solve the
problem, as long as it doesn't involve the word "branch". They fork a copy of their tree and begin working with
two completely separate repositories. When they need to make a change to both repositories, they simply
make the change by hand, twice.

Obviously these people are still branching, but they


keep Nelly happy by never using "the b word". Best Practice: Don't be afraid of branches
These folks are happy, and we should probably just
leave them alone, but the whole situation is kind of If you're doing parallel development, let your source
sad. Their source control tool has features which control tool help. That's what it was designed to do.
were specifically designed to make their lives
easier.

At the other end of the spectrum is Eddie, who uses branching far too often. Eddie started out just like Nelly,
afraid of branching because he didn't understand it. But to his credit, Eddie overcame his fear and learned
how powerful branching and merging can be.

And then he went off the deep end.

After he tried branching and had a good first experience with it, Eddie now uses it all the time. He sometimes
branches multiple times per week. Every time he makes a code change, he creates a private branch.

Eddie arrives on Monday morning and discovers that he has been assigned bug 7136 (In the Elbonian version,
the main window is too narrow because the Elbonian language requires 9 words to say "Hello World".) So
Eddie sits down at his desk and begins the process of fixing this bug. The first thing he does is create a branch
called "bug_7136". He makes his code change there in his "private branch" and checks it in. Then, after
verifying that everything is working okay, he uses the Merge Branches feature to migrate all changes from the
trunk into his private branch, just to make sure his code change is compatible with the very latest stuff. Then
he runs his test suite again. Then he notices that the repository has changed yet again, then he does this loop
once more. Finally, he uses Merge Branches to apply his code fixes to the trunk. Then he grabs a copy of the
trunk code, builds it and runs the test suite to verify that he didn't accidentally break anything. When at last he
is satisfied that his code change is proper, he marks bug 7136 as complete. By now it is Friday afternoon at
4:00pm, and there's no point in starting anything new at this point, so he just decides to go home.

Eddie never checks anything into the main trunk. He only checks stuff into his private branch, and then
merges changes into the trunk. His care and attention to detail are admirable, but he's spending far more
time using his source control tool than working on his code.

Let's not even think about what the kids would be like if Eddie and Nelly were to get married.

Dev--Test--Prod

Once you established the proper level of comfort with the branching features of your source control tool, the
next question is how to use those features effectively.

One popular methodology for SCM is often called "code promotion". The basic idea here is that your code
moves through three stages, "dev" (stuff that is in active development), "test" (stuff that is being tested) and
"prod" (stuff that is ready for production release):

As code gets written by programmers, it is placed in the dev tree. This tree is "basically unstable".
Programmers are only allowed to check code into dev.
When the programmers decide they are done with the code, they "promote" it from dev to
test. Programmers are not allowed to check code directly into the test tree. The only way to get code
into test is to promote it. By promoting code to test, the programmers are handing the code over to the
QA team for testing.
When the testers decide the code meets their standards, they promote it from test to prod. Code can
only be part of a release when it has been promoted to prod.
For a variety of reasons, I personally don't like working this way, but there's nothing wrong with it. Lots of
people use this code promotion model effectively, especially in larger companies where the roles of
programmer and tester are very clearly separated.

I understand that PVCS has specific feature support for "promotion groups", although I've never used this
product personally. With other source control tools, the code promotion model can be easily implemented
using three branches, one for dev, one for test, and one for prod. The Merge Branches feature is used to
promote code from one level to the next.

Eric's Preferred Branching Practice

Here at SourceGear our main development tree is


called the "trunk". In our repository it is rooted at Best Practice: Keep a "basically unstable"
$/trunk and it contains all the source code and trunk.
documentation for our entire product.
Do your active development in the trunk, the stability
Most new code is checked into the trunk. In of which increases as you approach release. After you
general, our developers try to never "break the ship, create a maintenance branch and always keep it
tree". Anyone who checks in code which causes the very stable.
trunk builds to fail will be the recipient of heaping
helpings of trash talk and teasing until he gets it fixed. The trunk should always build, and as much as
possible, the resulting build should always work.

Nonetheless, the trunk is the place where active development of new features is happening. The trunk could be
described as "basically unstable", a philosophy of branching which is explained in Essential CVS
(http://www.amazon.com/exec/obidos/ASIN/0596004591/sawdust08-20) , a fine book on CVS by O'Reilly.
In our situation, the stability of the trunk build fluctuates over the months during our development cycle.

During the early and middle parts of a development cycle, the trunk is often not very stable at all. As we
approach alpha, beta and final release, things settle down and the trunk gets more and more stable. Not long
before release, the trunk becomes almost sacred. Every code change gets reviewed carefully to ensure that we
don't regress backwards.

At the moment of release, a branch gets created. This branch becomes our maintenance tree for that release.
Our current maintenance branch is called "3.0", since that's the current major version number of our
product. When we need to do a bug fix or patch release, it is done in the maintenance branch. Each time we
do a release out of the maintenance branch (like 3.0.2), we apply a label.

After the maintenance branch is created, the trunk once again becomes "basically unstable". Developers start
adding the risky code changes we didn't want to include in the release. New feature work begins. The cycle
starts over and repeats itself.

When to branch? Part 1: Principles

Your decisions about when to branch should be


guided by one basic principle: When you create a Best Practice: Don't create a branch unless you
branch, you have to take care of it. There are are willing to take care of it.
responsibilities involved.
A branch is like a puppy.
In most cases, you will eventually have to
perform one or more merge operations. Yes, the SCM tool will make that merge easy, but you still have
to do it.
If a merge is never necessary, then you probably have the responsibility of maintaining the branch
forever.
If you create a branch with the intention of never merging to or from it, and never making changes to it,
then you should not be creating a branch. Use a label instead.

Be afraid of branches, but not so afraid that you never use the feature. Don't branch on a whim, but do branch
when you need to branch.

When to branch? Part 2: Scenarios


There are some situations where branching is NOT the recommended way to go:

Simple changes. As I mentioned above in my "Eddie" scenario, don't branch for every bug fix or
feature.
Customer-specific versions. There are exceptions to this rule, but in general, you should not branch
simply for the sake of doing a custom version for a specific customer. Find a way to build the
customizability into your app.

And there are some situations where branching is the best practice:

Maintenance and development. The classic example, and the one I used above in my story about
UltraHello. Maintaining version N while developing version N+1 is the perfect example of a time to use
branching.
Subteam. Sometimes a subset of your team needs to work on something experimental that will take
several weeks. When they finish, their work will be folded into the main tree, but in the meantime, they
need a separate place to work.
Code promotion. If you want to use the dev-test-prod methodology I mentioned above, use a branch
to model each of the three levels of code promotion.

When to branch? Part 3: Pithy Analogy

A branch is like a working folder for multiple people.


A working folder facilitates parallel development by allowing each person to have their own private
place to work.
When multiple people need a private place to work together, they need a branch.

Looking Ahead

In the next chapter I will delve into the topic of merging branches.
Chapter 8: Merge Branches
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

What is "merge branches"?

Many users find the word "merge" to be confusing, since it seems to imply that we start out with two things and
end up with only one. I'm not going to start trying to invent new vocabulary. Instead, let's just try to be clear
about what we mean we speak about merging branches. I define "merge branches" like this:

To "merge branches" is to take some changes which were done to one branch and apply them to another
branch.

Sounds easy, doesn't it? In practice, merging branches often is easy. But the edge cases can be really tricky.

Consider an example. Let's say that Joe has made a bunch of changes in $/branch and we want to apply those
changes to $/trunk. At some point in the past, $/branch and $/trunk were the same, but they have since
diverged. Joe has been making changes to $/branch while the rest of the team has continued making changes
to $/trunk. Now it is time to bring Joe back into the team. We want to take all the changes Joe made to
$/branch, no matter what those changes were, and we want to apply those changes to $/trunk, no matter what
changes have been to $/trunk during Joe's exile.

The central question about merge branches is the matter of how much help the source control tool can
provide. Let's imagine that our SCM tool provided us with a slider control:

If we drag this slider all the way to the left, the source control tool does all the work, requiring no help at all
from Joe. Speaking as a source control vendor, this is the ideal scenario that we strive for. Most of us don't
make it. However, here at SourceGear we made the decision to build our source control product on the .NET
Framework, which luckily has full support for the kind of technology needed to implement this. The code
snippet below was pasted from our implementation of the Merge Branches feature in Vault:

public void MergeBranches(Folder origin, Folder target)


{
ArrayList changes = GetSelectedChanges(origin);

DeveloperIntention di =
System.Magic.FigureOutWhatDeveloperWasTryingToDo(changes);

di.Apply(target);
}

Boy do I feel sorry for all those other source control vendors trying to implement Merge Branches without the
DeveloperIntention class! And to think that so many people believe the .NET Framework is too large. Sheesh!

OK, I lied. (Stop trying to add a reference to the


System.Magic DLL. It doesn't exist.) The actual Best Practice: Take responsibility for the
truth is that this slider can never be dragged all the merge.
way to the left.
Successfully using the branching and merging features
If we drag the slider all the way to the right, we get of your source control tool is first a matter of attitude
a situation which is actually closer to reality. Joe on the part of the developer. No matter how much help
does all the work and the source control tool is no the source control tool provides, it is not as smart as
help at all. In essence, Joe sits down with $/trunk you are. You are responsible for doing the merge.
and simply re-does the work he did in $/branch. Think of the tool as a tool, not as a consultant.
The context is different, so the changes he makes
this time may be very different from what he did before. But Joe is smart, and he can figure out The Right
Thing to do.

In practice, we find ourselves somewhere between these two extremes. The source control tool cannot do
magic, but it can usually help make the merge easier.

Since the developer must still take responsibility for the merge, things will go more smoothly if she
understands what's really going on. So let's talk about how merge branches works. First I need to define a bit
of terminology.

For the remainder of this chapter I will be using the words "origin" and "target" to refer to the two branches
involved in a merge branches operation. The origin is the folder which contains the changes. The target is the
folder to which we want those changes to be applied.

Note that my definition of merge branches is a one-way operation. We apply changes from the origin to the
target. In my example above, $/branch is the origin and $/trunk is the target. That said, there is nothing
which prevents me switching things around and applying changes in the opposite direction, with $/trunk as
the origin and $/branch as the target, but that would simply be a separate merge branches operation.

Conceptually, a merge branches operation has four steps:

1. Developer selects changes in the origin


2. Source control tool applies some changes automatically to the target
3. Developer reviews the results and resolves any conflicts
4. Commit

Each of these steps is described a bit more in the following sections.

1. Selecting changes in the origin

When you begin a merge branches operation, you know which changes from the origin you want to be applied
over in the target. Most of the time you want to be very specific about which changes from the origin are to be
merged. This is usually evident in the conversation which preceded the merge:

"Dan asked me to merge all the bug fixes from 3.0.5 into the main trunk."
"Jeff said we need to merge the fix for bug 7620 from the trunk into the maintenance tree."
"Ian's experimental rewrite of feature X is ready to be merged into the trunk."

One way or another, you need to tell your source control tool which changes are involved in the merge. The
interface for this operation can vary significantly depending on which tool you are using. The screen shot
below is the point where the Merge Branches Wizard in Vault is asking me to specify which changes should be
merged. I'm selecting everything back to the last build label:
(screendumps/scm_mb_choose.png)

2. Applying changes automatically to the target

After selecting the changes to be applied, it's time to try and make those changes happen in the target. It is
important here to mention that merging branches requires us to consider every kind of change, not just the
common case of edited files. We need to deal with renames, moves, deletes, additions, and whatever else the
source control tool can handle.

I won't spell out every single case. Suffice it to say that each operation should be applied to the target in the
way that Makes Sense. This won't succeed in every situation, but when it does, it is usually safe. Examples:

If a file was edited in the origin and a file with the same relative path exists in the target, try to make the
same edit to the target file. Use the automerge algorithm I mentioned in chapter 3
(scm_file_merge.html) . If automerge fails, signal a conflict and ask the user what to do.
If a file was renamed in the origin, try doing the same rename in the target. Here again, if the rename
isn't possible, signal a conflict and ask the user what to do. For example, the target file may have been
deleted.
If a file was added in the origin, add it to the target. If doing so would cause a name clash, signal a
conflict and ask the user what to do.
What happens if an edited file in the origin has been moved in the target to a different subfolder?
Should we try to apply the edit? I'd say yes. If the automerge succeeds, there's a good chance it is safe.

Bottom line, a source control tool should do all the operations which seem certain to be safe. And even then,
the user needs a chance to review everything before the merge is committed to the repository.

Let's consider a simple example from Subversion. I created a folder called trunk, added a few files, and then
branched it. Then I made three changes to the trunk:

Deleted __init__.py
Modified panel.py
Added a file called anydbm.py

Then I asked Subversion to merge all changes between version 2 and 4 of my trunk into my branch:
Subversion correctly detected all three of my changes and applied them to my working copy of the branch.

3. Developer review

The final step in a merge branches operation is a


review by the developer. The developer is Best Practice: Review the merge before you
ultimately responsible, and is the only one smart commit.
enough to declare that the merge is correct. So we
need to make sure that the developer is given final After your source control tool has done whatever it can
approval before we commit the results of our merge do, it's your turn to finish the job. Any conflicts need to
to the repository. be resolved. Make sure the code still builds. Run the
unit tests to make sure everything still works. Use a
This is the developer's opportunity to take care of diff tool to review the changes.
anything which could not be done automatically by
the source control tool in step 2. For example, Merging branches should always take place in a
suppose the tree contains a file which is in a binary working folder. Your source control tool should give
format that cannot be automatically merged, and you a chance to do these checks before you commit the
that this file has been modified in both the origin final results of a merge branches operation.
and the target. In this case, the developer will need
to construct a version of this file which correctly incorporates both changed versions.

4. Commit

The very last step of a merge branches operation is to commit the results to the repository. Simplistically, this
is a commit like any other. Ideally, it is more. The difference is whether or not the source control tool
supports "merge history".

The benefits of merge history

Merge history contains special historical information about all merge branch operations. Each time you do
use the merge branches feature, it remembers what happened. This allows us to handle two cases with a bit
more finesse:

Repeated merge.

Frequently you want to merge from the same origin to the same target multiple times. Let's suppose you
have a sub-team working in a private branch. Every few weeks you want to merge from the branch into the
trunk. When it comes time to select the changes to be merged over, you only want to select the changes
that haven't already been merged before. Wouldn't it be nice if the source control tool would just
remember this for you?

Merge history allows this and makes things more convenient. The workaround is simply to use a label to
mark the point of your last merge.

Merge in both directions.

A similar case happens when you have two branches and you sometimes want to merge back and forth in
both directions. For example:

1. Create a branch
2. Do some work in both the branch and the trunk
3. Merge some changes from the branch to the trunk
4. Do some more work
5. Merge some changes from the trunk to the branch

At step 5, when it comes time to select changes to be merged, you want the changes from step 3 to be
ignored. There is no need to merge those changes from the trunk to the branch because the branch is
where those changes came from in the first place! A source control tool with a smart implementation of
merge history will know this.

Not all source control tools support merge history. A tool without merge history can still merge branches. It
simply requires the developer to be more involved, to do more thinking.

In fact, I'll have to admit that at the time of this writing, my own favorite tool falls into this category. We're
planning some major improvements to the merge branches feature for Vault 4.0, but as of version 3.x, Vault
does not support merge history. Subversion doesn't either, as of version 1.1. Perforce is reported to have a
good implementation of merge history, so we could say that its "slider" rests a bit further to the left.

Summary

I don't want this chapter to be a step-by-step guide to using any one particular source control tool, so I'm going
to keep this discussion fairly high-level. Each tool implements the merging of branches a little differently.

For some additional information, I suggest you look at Version Control with Subversion
(http://svnbook.red-bean.com/) , a book from O'Reilly. It is obviously Subversion-specific, but it contains a
discussion of branching and merging which I think is pretty good.

The one thing all these tools have in common is the need for the developer to think. Take the time to
understand exactly how the branching and merging features work in your source control tool.
Chapter 9: Source Control Integration with IDEs
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.

Background: What is an IDE?


The various applications used by software developers are traditionally called "tools". When we speak of
"developer tools", we're talking about the essential items that programmers use every day, like compilers and
text editors and syntax checkers and Mountain Dew. Just as a master carpenter uses her tools to build a
house, developers use their tools to build software applications.

In the old days, each developer would assemble their own collection of their favorite tools. Back around 1991,
my preferred toolset looked something like this:

gcc (for compiling source code)


gdb (for debugging)
make (for managing builds)
rcs (for managing versions)
emacs (for editing source code)
vi (for editing the emacs makefile)

Fifteen years later, most developers would consider this approach to be strange. Today, everything is
"integrated". Instead of selecting one of each kind of tool, we select an Integrated Development Environment
(IDE), an application which collects all the necessary tools together in one place. To continue the metaphor,
we would say that the focus today is not on the individual tools, but rather, on the workshop in which those
tools are used.

This trend is hardly new. Ever since Borland released Turbo Pascal (http://en.wikipedia.org
/wiki/Turbo_Pascal) in 1983, IDEs have become more popular every year. In the last ten years, many IDE
products have disappeared as the industry has consolidated. Today, it is only a small exaggeration to say that
there are just two IDEs left: Visual Studio and Eclipse.

But despite the industry consolidation, the trend is clear. Developers want their tools to be very well
integrated together. Most recently, Microsoft's Visual Studio Team System (http://msdn.microsoft.com
/vstudio/teamsystem/default.aspx) takes this trend to a higher level than we have previously seen.
Mainstream IDEs in the past have provided base operations such as editing, compiling, building and
documentation. Now Visual Studio also has unit tests, visual modeling, code generators, and work item
tracking. Furthermore, the IDE isn't just for coders anymore. Every task performed by every person involved
in the software development process is moving into the IDE.

Benefits of source control integration with IDEs


Source control is one of the development tools which has been commonly integrated into IDEs for quite some
time. The fit is very natural. Here at SourceGear, our source control product has two main client
applications:

1. A standalone client application which is specifically designed to talk with the source control server.
2. A client plugin which adds source control features into Visual Studio.

Unsurprisingly, the IDE client is very popular with our users. Many of our users would never think about
using source control without IDE integration.

Why does version control work so nicely inside an IDE? Because it makes the three most common operations
a lot easier:

Checkout

When using the checkout-edit-checkin model, files must be checked out before they are edited. With
source control integrated into an IDE, this task can be quite automatic. Specifically, when you begin to
edit a file, the IDE will notice that you do not have it checked out yet and check the file out for you.
Effectively, this means developers never need to remember to checkout a file.

Add

A common and frustrating mistake is to add a new file to a project but forget to place it under source
control. So when I am done with my coding task, I checkin my changes to the existing files, but the
newly added file never makes it into the repository. The build is broken.

When using source control integration with an IDE, this mistake is basically impossible to make. Most
IDEs today support the notion of a "project", a list of all files which are considered part of the build
process. When used with source control, the IDE decides what files to place under source control
because it knows every file that is part of the project. The act of adding a file to the project also adds it
to source control.

Checkin

IDEs excel at nagging developers. The user interface of an IDE has special places to nag the developer
about compiler errors and unsaved files and even unfixed bugs. Similarly, visual indicators in the IDE
can be used to remind the developer that he has not yet checked in his changes.

When source control is integrated into an IDE, developers don't have to think about it very much. They don't
have to try to remember to Checkout, Add or Checkin because the IDE is either performing those actions
automatically or reminding them to do it.

Bigger benefits
Once you integrate source control into an IDE, you open the possibility for cool features that go beyond the
basics. For example, source control integration can be incredibly helpful when used with refactoring. When I
use the refactoring features of Eclipse to rename a Java class, it is obviously nice that Eclipse figures out all the
changes that need to be made. It's even nicer that Eclipse automatically handles all the necessary source
control operations. It even performs the name change of the source file.

For another example, here is a screen shot of a Blame (scm_history.html) feature integrated into Eclipse:

(screendumps/scm_eclipse_blame.png)

The user story for this feature goes like this: The developer is coding and she encounters something that
deserves to be on The Daily WTF (http://thedailywtf.com/) . She wants to immediately know who is
responsible, so she right-clicks on the offensive line and selects the Blame feature. The source control plugin
queries the repository for history and determines who made the change. The task was simpler because the
Blame feature is conveniently located in the place where it is most likely to be needed.

Tradeoffs and Problems


For source control, IDE integration is great in theory, but it has not always been so great in practice. The
tradeoffs of having your IDE do source control for you are the same as the tradeoffs of having your IDE do
anything else. It's easier, but you have less control over the process.

Before I continue, I need to make a confession:

I personally have never used source control integration with an IDE. Heck, for a long time I didn't use
IDEs at all. I'm a control freak. It's not enough for me to know what's going on under the hood.
Sometimes I prefer to just do everything myself. I don't like project-based build systems where I add a
few files and the IDE magically builds my app. I like diving make systems where I can control exactly
where everything is and where the build targets are placed.

Except for a brief and passionate affair with Think C (http://en.wikipedia.org/wiki/THINK_C) during
the late eighties, I didn't really start using IDE project files until Visual Studio .NET. Today, I am
gradually becoming more and more of an IDE user, but I still prefer to do all source control operations
using a standalone GUI client. Eventually, that will change, and my transformation to IDE user will be
complete.

Anyway, for the sake of completeness, I will explain the tradeoffs I see with using source control integration
with IDEs. This should be taken as information, not as an argument against the feature. IDE integration is
the most natural way to use source control on a daily basis.

The first observation is that IDE clients have fewer features than standalone clients. The IDE is great for basic
source control operations, but it is definitely not the natural place to perform all source control operations.
Some things, such as branching, don't fit very well at all. However, this is a minor point which merely
illustrates that an IDE client cannot be the only user interface for accessing a source control repository. If this
were the only problem, it would not be a problem. This is the sort of tradeoff that I would consciously accept.

The real problem with source control integration for IDEs is that it just doesn't work very well. For this sad
state of affairs, I put most of the blame on MSSCCI.

MSSCCI
It's pronounced "misskee", and it stands for Microsoft Source Code Control Interface. MSSCCI is the API
which defines the interaction between Microsoft Visual Studio and source control tools.

A source control tool which wants to support integration with Visual Studio must implement this API.
Basically, it's a DLL which defines a number of documented entry points. When configured properly, the IDE
makes calls into the DLL to perform source control operations as needed or as requested by the user.

Originally, Microsoft's developer tools were the only host environments for MSSCCI. Today, MSSCCI is used
by lots of other IDEs as well. It has become sort of a de facto standard. Source control vendors implemented
MSSCCI plugins so that their products could be used within Microsoft IDEs. In turn, vendors of other IDEs
implemented MSSCCI hosting so that their products could be used with the already-available source control
plugins.

The ubiquity of MSSCCI is very unfortunate. MSSCCI was designed to be a bridge between SourceSafe and
early versions of Microsoft Visual Studio. It served this purpose just fine, but now the API is being used for lots
of other version control tools besides SourceSafe and lots of other IDEs besides Visual Studio. It is being used
in ways that it was never designed to be used, resulting in lots of frustration.

The top three problems with MSSCCI are:

1. Poor performance. SourceSafe has no support for networking, but the architecture of most modern
version control tools involves a client and a server with TCP in between. To get excellent performance
from a client-server application, careful attention must be paid to the way the networking is done.
Things like threading and blocking and buffering are very important. Unfortunately, MSSCCI makes
this rather difficult.
2. No Edit-Merge-Commit. SourceSafe is basically built around the Checkout-Edit-Checkin approach,
so that's how MSSCCI works. Building a satisfactory MSSCCI plugin for the Edit-Merge-Commit
paradigm is very difficult.
3. No Atomic transactions. SourceSafe has no support for atomic transactions, so MSSCCI and Visual
Studio were not designed to use them. This means that sometimes modern version control tools like
Vault can't group things together properly at commit time.

On top of all this, all the world's MSSCCI hosts tend to implement their side of the API a little differently. If
you implement a MSSCCI plugin and get everything working with Visual Studio 2003, you have approximately
zero chance of it working well with Visual Basic 6, FoxPro or Visual Interdev. After you code in all the special
hacks to get things compatible with these fringe Microsoft environments, your plugin still has no real chance of
working with third party products like MultiEdit (http://www.multiedit.com/) . Every IDE requires some
different tweaks and quirky behavior to make it work. By the time you get your plugin working with some of
these other IDEs, your regression testing shows that it doesn't work with Visual Studio 2003 anymore.
Lather. Rinse. Repeat.

Most developers who work with MSSCCI eventually turn to recreational pharmaceuticals in a futile effort to
cope.

A brighter future
Luckily, MSSCCI is fading away. Earlier in this article I flippantly joked that Visual Studio and Eclipse were
the only IDEs left in the world. This is of course an exaggeration, but the fact remains that these two products
have the lion's share, so we can take some comfort in their dominance when we think about the prevalence of
MSSCCI in the future:

Eclipse does not use MSSCCI. It has its own source control integration APIs.
Visual Studio 2005 introduced a new and greatly improved API for source control integration.

So, the two dominant IDEs today inspire us to dream of a MSSCCI-free world. The planet will certainly be a
nicer place to live when MSSCCI is a distant memory.

Here at SourceGear, the various problems with MSSCCI have caused us to hold a cautious and reserved stance
toward IDE integration. Most of our customers really would prefer an IDE client, so we give them one. But we
consider our standalone GUI client to be the primary UI because it is faster and more full-featured. And
internally, most of us on the Vault team use the standalone GUI client for our everyday work.

But our posture is changing dramatically. We are currently working on an Eclipse plugin as well as a
completely new plugin for the new source control API in Visual Studio 2005. Sometime in early 2007, we will
be ready to consider our IDE clients to be primary, with our other client applications to be available for less
common operations. What do I mean when I say "primary"? Well, among other things, I mean that the IDE
clients will be the way we use our own product. Including me. :-)

It's not yet terribly impressive to look at, but here's a screen shot of our new Visual Studio 2005 client:
(screendumps/scm_vsip_client.png)

Final thoughts
The direction of this industry right now is toward more and more integration. This is a very good thing. We're
going to see many new improvements. Users will be happier. Just as a spice rack belongs near the stove,
source control should always be available where the developer is working.

You might also like