Professional Documents
Culture Documents
I am writing a series of articles explaining how to do source control and the best practices thereof. See below
for links to the individual chapters in this series. The Introduction explains my motivations and goals for
writing this series.
Please note: This is a work in progress. I plan to be adding new chapters over time, and I may also be revising
the existing chapters as I go along.
Printer-friendly version: Sorry folks, but I currently do not have this material available in a form which is
more suitable for paper. I am planning to eventually publish this material as a book. When that happens, a
link will appear here.
Our universities don't teach people how to do source control. Our employers don't teach people how
to do source control. SCM tool vendors don't teach people how to do source control. We need some
materials that explain how source control is done. My goal for this series of articles is to create a
comprehensive guide to help meet this need.
Our discussion of source control must begin by defining the basic terms and describing the basic
operations.
In this chapter, I will explore the various situations wherein a repository is modified, starting with
the simplest case of a single developer making a change to a single file.
Many software teams have discovered that the tradeoff here is worth the trouble. Concurrent
development can bring substantial gains in the productivity of a team. The extra effort to deal with
merge situations is usually a small price to pay.
A file system is two-dimensional: its space is defined by directories and files. In contrast, a
repository is three-dimensional: it exists in a continuum defined by directories, files and time. An
SCM repository contains every version of your source code that has ever existed. The additional
dimension creates some rather interesting challenges in the architecture of a repository and the
decisions about how it manages data.
The repository is the official archive of our work. We treat our repository with great respect. In
contrast, we treat our working folder with very little regard. It exists for the purpose of being
abused. Our working folder starts out worthless, nothing more than a copy of the repository. If it is
destroyed, we have lost nothing, so we run risky experiments which endanger its life.
There is nothing endearing about a development team that can't find something when they need it.
A good SCM tool must do more than just keep every version of everything. It must also provide
ways of searching and viewing and sorting and organizing and finding all that stuff.
Chapter 7: Branches (scm_branches.html)
Nelly has a friend who has a cousin with a neighbor who knows somebody whose life completely fell
apart after they tried using the branch and merge features of their source control tool. So Nelly
refuses to use branching at all.
Successfully using the branching and merging features of your source control tool is first a matter of
attitude on the part of the developer. No matter how much help the source control tool provides, it is
not as smart as you are. You are responsible for doing the merge. Think of the tool as a tool, not as a
consultant.
Just as a spice rack belongs near the stove, source control should always be available where the
developer is working.
Sometimes we call it "version control". Sometimes we call it "SCM", which stands for either "software
configuration management" or "source code management". Sometimes we call it "source control". I use all
these terms interchangeably and make no distinction between them (for now anyway -- configuration
management actually carries more advanced connotations I'll discuss later).
By any of these names, source control is an important practice for any software development team. The most
basic element in software development is our source code. A source control tool offers a system for managing
this source code.
There are many source control tools, and they are all different. However, regardless of which tool you use, it is
likely that your source control tool provides some or all of the following basic features:
HOWTO
My goal for this series of articles is to help people learn how to do source control. I work for SourceGear, a
developer tools ISV. We sell an SCM tool called Vault (http://www.sourcegear.com/vault/) . Through the
experience of selling and supporting this product, I have learned something rather surprising:
Our employers don't teach people how to do source c ontrol. In fact, many employers provide their developers
with no training at all.
SCM tool vendors don't teach people how to do source control. We provide documentation on our products,
but the help and the manuals usually amount to simple explanations of the program's menus and dialogs. We
sort of assume that our customers come to us with a basic background.
Here at SourceGear, our product is positioned specifically as a replacement for SourceSafe. We assume that
everyone who buys Vault already knows how to use SourceSafe. However, experience is teaching us that this
assumption is often untrue. One of the most common questions received by our support team is from users
asking for a solid explanation of the basics of source control.
However, in the case of SCM tools, this tool-agnostic approach is somewhat difficult to achieve. Unlike
writing, source control is simply not done without the assistance of specialized tools. With no tools at all, the
methods of source control are not practical.
Complicating matters further is the fact that not all source control tools are alike. There are at least dozens of
SCM tools available, but there is no standard set of features or even a standard terminology. The word
"checkout" has different meanings for CVS and SourceSafe. The word "branch" has very different semantics
for Subversion and PVCS.
So I will keep the tool-neutral ideal in mind as I write, but my articles will often be somewhat tool-specific.
Vault is the tool I know best, since I have played a big part in its design and coding. Furthermore, I freely
acknowledge that I have a business incentive to talk about my own product. Although I will often mention
other SCM tools, the articles in this series will use the terminology of Vault.
Several SCM tools that I mention in this series are listed below, with hyperlinks for more information.
This is a very incomplete list. There are many SCM tools, and I am not interested in trying to produce and
maintain and accurate listing of them all.
Audience
When we apply some of the concepts of source control to the world of traditional documents, the result is
called "document management". I'm not writing about any of those usage scenarios.
When we apply some of the concepts of source control to the world of graphic design, the result is called "asset
management". I'm not writing about any of those usage scenarios.
My audience here is the group of people who deal pr imarily with source code files or HTML files.
First of all, let me say a thing or two about political correctness. Through these articles, I will occasionally find
the need for gender-specific pronouns. In such situations, I generally try to use the male and female variants
of the words with approximately equal frequency.
Second of all, please accept my apologies if my dry sense of humor ever becomes a distraction from the
material. I am writing about source control and trying to make it interesting. That's like writing about sex and
trying to make it boring, so please cut me some slack if I try to make you chuckle along the way.
Looking Ahead
Source control is a large topic, so there is much to be said. I plan for the chapters of this series to be sorted
very roughly from the very basic to the very advanced. In the next chapter, I'll start by defining the most
fundamental terminology of source control.
Chapter 1: Basics
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
Our discussion of source control must begin by defining the basic terms and describing the basic operations.
Let's start by defining two important terms: repository and working folder.
An SCM tool provides a place to store your source code. We call this place a repository. The repository exists
on a server machine and is shared by everyone on your team.
Each individual developer does her work in a working folder, which is located on a desktop machine and
accessed using a client.
Each of these things is basically a hierarchy of folders. A specific file in the repository is described by its path,
just like we describe a specific file on the file system of your local machine. In Vault and SourceSafe, a
repository path starts with a dollar sign. For example, the path for a file might look like this:
$/trunk/src/myLibrary/hello.cs
The workflow of a developer is an infinite loop which looks something like this:
I've omitted certain details like staff meetings and vacations, but this loop essentially describes the life of a
developer who is working with an SCM tool. The repository is the official place where all completed work is
stored. A task is not considered to be completed until the repository contains the result of that task.
Let's imagine for a moment what life would be like without this distinction between working folder and
repository. In a single-person team, the situation could be described as tolerable. However, for any plurality
of developers, things can get very messy.
I've seen people try it. They store their code on a file server. Everyone uses Windows file sharing and edits the
source files in place. When somebody wants to edit main.cpp, they shout across the hall and ask if anybody
else is using that file. Their Ethernet is saturated most of the time because the developers are actually
compiling on their network drives. When we sell our source control tool to someone in this situation, I feel like
an ER doctor. I go home that night with a feeling of true contentment, because I know that I have saved a life.
Because of this separation between working folder and repository, the most frequently used features of an SCM
tool are the ones which help us move things back and forth between them. Let's define some terms:
Add: A repository starts out completely empty, so we need to "Add" things to it. Using the "Add Files"
command in Vault you can specify files or folders on your desktop machine which will be added to the
repository.
Get: When we copy things from the repository to the working folder, we call that operation "Get". Note
that this operation is usually used when retrieving files that we do not intend to edit. The files in the
working folder will be read-only.
Checkout: When we want to retrieve files for the purpose of modifying them, we call that operation
"Checkout". Those files will be marked writable in our working folder. The SCM server will keep a record
of our intent.
Checkin: When we send changes back to the repository, we call that operation "Checkin". Our working
files will be marked back to read-only and the SCM server will update the repository to contain new
versions of the changed files.
Note that these definitions are merely starting points. The descriptions above correspond to the behavior of
SourceSafe and Vault (with its default settings). However, we will see later that other tools (such as CVS) work
somewhat differently, and Vault can optionally be configured in a mode which matches the behavior of CVS.
Terminology note: Some SCM tools use these words a bit differently. Vault and SourceSafe use the word
"checkout" as a command which specifically communicates the intent to edit a file. For CVS, the "checkout"
command is used to retrieve files from the repository regardless of whether the user intends to edit the files or
not. Some SCM tools use the word "commit" instead of the word "checkin". Actually, Vault uses either of
these terms, for reasons that will be explained in a later chapter.
Your repository is more than just an archive of the current version of your code. Actually, it is an archive of
every version of your code. Your repository contains history. It contains every version of every file that has
ever been checked in to the repository. For this reason, I like to think of a source control tool as a time
machine.
The ability to travel back in time can be extremely useful for a software project. Suppose we need the ability to
retrieve a copy of our source code exactly as it looked on April 28th, 2002. An SCM tool makes this kind of
thing easy to do.
An even more common case is the situation where a piece of code looks goofy and nobody can figure out why.
It's handy to be able to look back at the history and understand when and why a certain change happened.
Over time, the complete history of a repository can become large and overwhelming, so SCM tools provide
ways to cope. For example, Vault provides a History Explorer which allows the history entries to be queried
and searched and sorted.
Perhaps more importantly, most SCM tools provide a feature called a "label" or a "tag". A label is basically a
way to mark a specific instant in the history of the repository with a meaningful name. The label makes it easy
to later retrieve a snapshot of exactly what the repository contained at that instant.
Looking Ahead
This chapter merely scratches the surface of what an SCM tool can provide, making brief mention of two
primary benefits:
Working folders provide developers with a private workspace which is distinct from the main repository.
Repository history provides a complete archive of every change and why it was made.
In the next chapter, I'll be going into much greater detail on the topic of checkins.
Chapter 2: Checkins
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
In this chapter, I will explore the various situations wherein a repository is modified, starting with the simplest
case of a single developer making a change to a single file.
Consider the simple situation where a developer needs to make a change to one source file. This case is
obviously rather simple:
I won't talk much about step 2 here, as it doesn't really involve the SCM tool directly. Editing the file usually
involves the use of some other tools, like an integrated development environment (IDE).
Step 1: Checkout
On the server, the SCM tool will remember the fact that you have the file checked out so that others may
be informed.
On your client, the SCM tool will prepare your working file for editing by changing it to be writable.
File checkouts are a way of communicating your intentions to others. When you have a file checked out, other
users can be aware and avoid making changes to that file until you are done with it. The checkout status of a
file is usually displayed somewhere in the user interface of the SCM client application. For example, in the
following screendump from Vault, users can see that I have checked out libsgdcore.cpp:
This screendump also hints at the fact there are
actually two kinds of checkouts. The issue here is Best Practice: Use checkouts and locks
the question of whether two people can checkout a carefully
file at the same time. The answer varies across
SCM tools. Some SCM tools can be configured to It is best to use checkouts and locks only when you
behave either way. need them. A checkout discourages others from
modifying a file, and a lock prevents them from doing
Sometimes the SCM tool will allow multiple people so. You should therefore be careful to use these
to checkout a file at the same time. SourceSafe and features only when you actually need them.
Vault both offer this capability as an option. When
this "multiple checkouts" feature is used, things can Don't checkout files just because you think you might
get a bit more complicated. I'll talk more about need to edit them.
this later.
Don't checkout whole folders. Checkout the specific
If the SCM tool prevents anyone else from checking files you need.
out a file which I have checked out, then my
checkout is "exclusive" and may be described as a Don't checkout hundreds or thousands of files at one
"lock". In the screendump above, the user interface time.
is indicating that I have an exclusive lock on
libsgdcore.cpp. Vault will allow no one else to Don't hold exclusive locks any longer than necessary.
checkout this file.
Don't go on vacation while holding exclusive locks on
files.
The client side of checkout
On the client side, the effect of a checkout is quite simple: If necessary, the latest version of the file is retrieved
from the server. The working file is then made writable, if it was not in that state already.
All of the files in a working folder are made read-only when the SCM tool retrieves them from the repository. A
file is not made writable until it is checked out. This prevents the developer from accidentally editing a file.
Undoing a checkout
Normally, a checkout ends when a checkin happens. However, sometimes we checkout a file and subsequently
decide that we did not need to do so. When this happens, we "undo the checkout". Most SCM tools have a
command which offers this functionality. On the server side, the command will remove the checkout and
release any exclusive lock that was being held. On the client side, Vault offers the user three choices for how
the working file should be treated:
Revert: Put the working file back in the state it was in when I checked it out. Any changes I made
while I had the file checked out will be lost.
Leave: Leave the working file alone. This option will effectively leave the file in a state which we call
"Renegade". It is a bad idea to edit a file without checking it out. When I do so, Vault notices my
transgression and chastises me by letting me know that the file is "Renegade".
Delete: Delete the working file.
I usually prefer to work with "Revert" as my option for how the Undo Check Out command behaves.
Step 3: Checkin
The following screendump shows the checkin dialog box from Vault:
Checkins are additive
It is reassuring to remember one fundamental axiom of source control: Nothing is ever destroyed. Let us
suppose that we are editing a file which is currently at version 4. When we checkin our changes, our new
version of the file becomes version 5. Clients will be notified that the latest version is now 5. Clients that are
still holding version 4 in their working folder will be warned that the file is now "Old".
But version 4 is still there. If we ask the server for the latest version, we will get 5. But if we specifically ask for
version 4, and for any previous version, we can still get it.
Each checkin adds to the history of our repository. We never subtract anything from that history.
We will informally use the word "checkin" to refer to any change which is made to the repository. It is
common for a developer to say, "I made some checkins this afternoon to fix that bug", using the word
"checkin" to include any of the following types of changes to the repository:
It may seem odd to refer to these operations using the word "checkin", because there is no corresponding
"checkout" step. However, this looseness is typical of the way people use the word "checkin", so you'll get used
to it.
I will take this opportunity to say a few things about how these operations behave. If we conceptually think of a
folder as a list of files and subfolders, each of these operations is actually a modification of a folder. When we
create a folder inside folder A, then we are modifying folder A to include a new subfolder in its list. When we
rename a file or folder, the parent folder is being modified.
Just as the version number of a file is incremented when we modify it, these folder-level changes cause the
version number of a folder to be incremented. If we ask for the previous version of a folder, we can still
retrieve it just the way it was before. The renamed file will be back to the old name. The deleted file will
reappear exactly where it was before.
It may bother you to realize that the "delete" command in your SCM tool doesn't actually delete anything.
However, you'll get used to it.
Atomic transactions
I've been talking mostly about the simple case of making a change to a single source code file. However, most
programming tasks require us to make multiple repository changes. Perhaps we need to edit more than one
file to accomplish our task. Perhaps our task requires more than just file modifications, but also folder-level
changes like the addition of new files or the renaming of a file.
When faced with a complex task that requires several different operations, we would like to be able to submit
all the related changes together in a single checkin operation. Although tools like SourceSafe and CVS do not
offer this capability, some source control systems (like Vault and Subversion) do include support for "atomic
transactions".
In the following screen dump, my pending change set contains three operations. I have modified
libsgdcore.cpp. I have renamed libsgdcore.h to headerfile.h. And I have deleted libsgdcore_diff_file.c.
Note that these operations have not actually happened yet. They won't happen unless I submit them to the
server, at which time they will take place as a single atomic transaction.
Vault persists the pending change set between sessions. If I shutdown my Vault client and turn off my
computer, next time I launch the Vault client the pending change set will contain the same items it does now.
Up until now, I have explained everything about checkouts and checkins in a very "matter of fact" fashion. I
have claimed that working files are always read-only until they are checked out, and I have claimed that files
are always checked out before they are checked in. I have made broad generalizations and I have explained
things in terms that sound very absolute.
I lied.
In reality, there are two very distinct doctrines for how this basic interaction with an SCM tool can work. I
have been describing the doctrine I call "checkout-edit-checkin". Reviewing the simple case when a developer
needs to modify a single file, the practice of this faith involves the following steps::
Followers of the "checkout-edit-checkin" doctrine are effectively submitting to live according to the following
rules:
Files in the working folder are read-only unless they are checked out.
Developers must always checkout a file before editing it. Therefore, the entire team always knows who
is editing which files.
Checkouts are made with exclusive locks, so only one developer can checkout a file at one time.
This approach is the default behavior for SourceSafe and for Vault. However, CVS doesn't work this way at
all. CVS uses the doctrine I call "edit-merge-commit". Practicers of this religion will perform the following
steps to modify a single file:
The edit-merge-commit doctrine is a liberal denomination which preaches a message of freedom from
structure. Its followers live by these rules:
As I said, this is the approach which is supported by CVS. Vault supports edit-merge-commit as an option. In
fact, when this option is turned on, we informally say that Vault is running in "CVS mode".
Each of these approaches corresponds to a different style of managing concurrent development on a team.
People tend to have very strong feelings about which style they prefer. The religious flame war between these
two churches can get very intense.
Holy Wars
The "checkout-edit-checkin" doctrine is obviously more traditional and conservative. When applied strictly, it
is impossible for two people to modify a given file at the same time, thus avoiding the necessity of merging two
versions of a file into one.
The "edit-merge-commit" teaches a lifestyle which is riskier. The risk is that the merge step may be tedious or
cause problems. However, the acceptance of this risk rewards us with a concurrent development style which
causes developers to trip over each other a lot less often.
Still, these risks are real, and we will not flippantly disregard them. A detailed discussion of file merging
appears in the next chapter. For now I will simply mention that most SCM tools include features that can
safely do a three-way merge automatically. Not all developers are willing to trust this feature, but many do.
So, when using the "edit-merge-commit" approach, the merge must happen, and we are left with two choices:
Developers who prefer "checkout-edit-checkin" often find both of these choices to be unacceptable.
As I said, automerge is amazingly safe in practice. Thousands of teams use it every day without incident. I
have been actively using edit-merge-commit as my development style for over five years, and I cannot
remember a situation where automerge produced an incorrect file. Experience has made me a believer.
Looking Ahead
In the next chapter, I will be talking in greater detail about the process of merging two modified versions of a
file.
Chapter 3: File Merge
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
There are several reasons why we may need to merge two modified versions of a file:
When using "edit-merge-commit" (sometimes called "optimistic locking"), it is possible for two
developers to edit the same file at the same time.
Even if we use "checkout-edit-checkin", we may allow multiple checkouts, resulting once again in the
possibility of two developers editing the same file.
When merging between branches, we may have a situation where the file has been modified in both
branches.
In other words, this mess only happens when people are working in parallel. If we serialize the efforts of our
team by never branching and never allowing two people to work on a module at the same time, we can avoid
ever facing the need to merge two versions of a file.
However, we want our developers to work concurrently. Think of your team as a multithreaded piece of
software, each developer running in its own thread. The key to high performance in a multithreaded system is
to maximize concurrency. Our goal is to never have a thread which is blocked on some other thread.
So we embrace concurrent development, but the threading metaphor continues to apply. Multithreaded
programming can sometimes be a little bit messy, and the same can be said of a multithreaded software
team. There is a certain amount of overhead involved in things like synchronization and context switching.
This overhead is inevitable. If your team is allowing concurrent development to happen, it will periodically
face a situation where two versions of a file need to be merged into one.
In rare cases, the situation can be properly resolved by simply choosing one version of the file over the other.
However, most of the time, we actually need to merge the two versions to create a new version.
Let's carefully state the problem as follows: We have two versions of a file, each of which was derived from the
same common ancestor. We sometimes call this common ancestor the "original" file. Each of the other
versions is merely the result of someone applying a set of changes to the original. What we want to create is a
new version of the file which is conceptually equivalent to starting with the original and applying both sets of
changes. We call this process "merging".
The difficulty of doing this merge varies greatly for different types of files. How would we perform a merge of
two Excel spreadsheets? Two PNG images? Two files which have digital signatures? In the general case, the
only way to merge two modified versions of a file is to have a very smart person carefully construct a new copy
of the file which properly incorporates the correct elements from each of the other two.
However, in software and web development there is a special case which is very common. As luck would have
it, most source code files are plain text files with an average of less than 80 characters per line. Merging files
of this kind is vastly simpler than the general case. Many SCM tools contain special features to assist with this
sort of a merge. In fact, in a majority of these cases, the two files can be automatically merged without
requiring the manual effort of a developer.
An example
Let's call our two developers Jane and Joe. Both of them have retrieved version 4 of the same file and both of
them are working on making changes to it.
One of these developers will checkin before the other one. Let's assume it is Jane who gets there first. When
Jane tries to checkin her changes, nothing unusual will happen. The current version of the file is 4, and that
was the version she had when she started making her changes. In other words, version 4 was her baseline for
these changes. Since her baseline matches the current version, there is no merge necessary. Her changes are
checked in, and a version of the file is created in the repository. After her checkin, the current version of the
file is now 5.
The responsibility for merging is going to fall upon Joe. When he tries to checkin his changes, the SCM tool
will protest. His baseline version is 4, but the current version in the repository is now 5. If Joe is allowed to
checkin his version of the file, the changes made by Jane in version 5 will be lost. Therefore, Joe will not be
allowed to checkin this file until he convinces the SCM tool that he has merged Jane's version 5 changes into
his working copy of the file.
Vault reports this situation by setting the status on this file to be "Needs Merge", as shown in the screen dump
below:
All of these situations are possible, and all of them are Joe's responsibility. He must incorporate Jane's
changes into his file before he can checkin a version 6.
In certain rare situations, Joe may examine Jane's changes and realize that his version needs nothing from
Jane's version 5. Maybe Jane's change simply isn't relevant anymore. In these cases, the merge isn't needed,
and Joe can simply declare the merge to be resolved without actually doing anything. This decision remains
subject to Joe's judgment.
However, most of the time it will be necessary for the merge to actually happen. In these cases, Joe has the
following options:
Attempt to automerge
Use a visual merge tool
Redo one set of changes by hand
Attempt to automerge
As I mentioned above, a surprising number of cases can be easily handled automatically. Most source control
tools include the ability to attempt an automatic merge. The algorithm uses all three of the involved versions
of the file and attempts to safely produce a merged version.
This picture is typical of other three-way visual merge applications. The left pane shows Jane's version of the
file. The right pane shows Joe's version. The center pane shows the original file, the common ancestor from
which they both started to make changes. As you can see, Jane and Joe have each inserted a one-line
comment. By right-clicking on each change, the developer can choose whether to apply that change to the
middle pane. In this example, the two changes don't conflict. There is no reason that the resulting file cannot
incorporate both changes.
(screendumps/scm_diffmerge_2.gif)
Both Jane and Joe have tried to change the wording of this comment. In the original file, the word used in the
comment was "Global". Jane decided to change this word to "Worldwide", but Joe has changed it to the word
"Rampant". These two changes are conflicting, as indicated by the yellow background color being used to
display them. Automerge cannot automatically handle cases like these. Only a human being can decide which
change to keep.
The visual merge tool makes it easy to handle this situation. I can decide which change I want to keep and
apply it to the center pane.
A visual merge tool can make file merging a lot easier by quickly showing the developer exactly what has
changed and allowing him to specify which changes should be applied to get the final merged result.
However, as useful as these kinds of tools can be, they're not magic.
Some situations are so complicated that a visual merge tool just isn't very helpful. In the worst case scenario,
Joe might have to manually redo one set of changes.
This situation recently happened here at SourceGear. We currently have Vault development happening in two
separate branches:
When we shipped version 2.0, we created a branch for maintenance of the 2.0 release. This is the tree
where we develop minor bug fix releases like 2.0.1.
Our "trunk" is the place where active development of the next major release is taking place.
Obviously we want any bug fixes that happen in the 2.0 branch to also happen in the trunk so that they can be
included in our upcoming 2.1 release. We use Vault's "Merge Branches" command to migrate changes from
one place to the other.
I will talk more about branching and merging in a later chapter. For now, suffice it to say that the merging of
branches can create exactly the same kind of three-way merge situation that we've been discussing in this
chapter.
In this case, we ended up with a very difficult merge in the sections of code that deal with logins.
In the 2.0 branch, we implemented a fix to prevent dictionary attacks on passwords. We considered this
a bug fix, since it is related to the security of our product. In concept this change was simple. We
simply block login for any account which is seeing too many failed login attempts. However,
implementing this mini-feature required a surprising number of lines to be changed.
In the trunk, we added the ability for Vault to authenticate logins against Active Directory.
In other words, we made substantial changes to the login code in both these branches. When it came time to
merge, the DiffMerge was extremely colorful.
In this case, it was actually simpler to just start with the trunk version and reimplement the dictionary attack
code. This may seem crazy, but it's actually not that bad. Redoing the changes takes a lot less time than
coding the feature the first time. We could still copy and paste code from the 2.0 version.
Getting back to the primary example, Joe has a choice to make. His current working file already contains his
own set of changes. He could therefore choose to redo Jane's change starting with his current working file.
The problem here is that he might not really know how. He might have no idea what Jane's approach was.
Jane's office might be 10,000 miles away. Jane might have written a lousy comment explaining her checkin.
As an alternative, Joe could set aside his working file, start with the latest repository version and redo his own
changes.
Bottom line: If a merge gets this bad, it takes some time and care to resolve it properly. Luckily, this situation
doesn't happen very often.
Regardless of which of the above methods is used to complete the merge, it is highly recommended for Joe to
verify the correctness of his work. Obviously he should check that the entire source tree still compiles. If a test
suite is available, he should build and verify that the tests still pass.
After Joe has completed the merge and verified it, he can declare the merge to be "resolved", after which the
SCM tool will allow him to checkin the file. In the case of Vault, this is done by using the Resolve Merge Status
command, which explicitly tells the Vault client application that the merge is completed. At this time, Vault
would change the baseline version number from 4 to 5, indicating that as far as anyone knows, Joe made his
changes by starting with version 5 of the file, not with version 4.
Since his baseline version now matches the current version of the file, the Vault server will now allow Joe to do
his checkin.
Remember that easily-resolved merges are the Many teams avoid all forms of concurrent
most common case. Automerge handles a large development. Their entire team uses "checkout-
percentage of situations with no problems at all. A edit-checkin" with exclusive locks, and they never
large percentage of the remaining cases can be branch.
easily handled with a visual merge tool. The
difficult situations are rare, and can still be handled For some small teams, this approach works just fine.
easily by a developer who is patient and careful. However, the larger your team, the more frequently a
developer becomes "blocked" by having to wait for
Many software teams have discovered that the someone else.
tradeoff here is worth the trouble. Concurrent
development can bring substantial gains in the Modern source control systems are designed to make
productivity of a team. The extra effort to deal with concurrent development easy. Give them a try.
merge situations is usually a small price to pay.
Looking Ahead
In the next chapter I will be discussing the concept of a repository in a lot more detail.
Chapter 4: Repositories
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In
this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how
an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.
An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just
want to know what time it is. Those who understand the inner workings of a clock cannot tell time any
more skillfully than the rest of us.
An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However,
people who really understand cars tend to get better performance out of them.
Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of
how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little
bit about what's happening inside.
A repository is the official place where you store all your source code. It keeps track of all your files, as well as
the layout of the directories in which they are stored. It resides on a server where it can be shared by all the
members of your team.
But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM
repository would be no more than a network file system. A repository is much more than that. A repository
contains history.
A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-
dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every
version of your source code that has ever existed. The additional dimension creates some rather interesting
challenges in the architecture of a repository and the decisions about how it manages data.
As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just
keep a complete copy of the entire tree for every change that has happened?
We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in
the fall of 2001. In the summer of 2002, we started "dogfooding". On October 25th, 2002, we abandoned our
repository history and started a fresh repository for the core components of Vault. Since that day, this tree has
been modified 4,686 times.
This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every
change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At
today's prices for disk space, this option is worth considering.
However, this particular repository is just not very large. We have several others as well, but the sum total of
all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which
are a lot bigger.
As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on
their claim of 270 developers and the fact that their repository is almost four years old, I'm going to
conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of
storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes
(http://dictionary.reference.com/search?q=terabytes) .
At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is
cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider
things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-
important data is more than just the cost of the actual disk platters.
So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an
obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from
tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple
as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another
copy of them.
So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a
tree represented as a set of changes to another tree. We call this a "delta".
Delta direction
As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a
tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For
example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as
a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version
1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be
faster than others. When using this approach we say that we are using "forward deltas", because each delta
expresses the set of changes from one version to the next.
We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the
Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect
that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In
fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is
probably the most likely one to be needed.
The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.
Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other tree
N is represented as a set of differences from tree N+1. This approach delivers its best performance for the
most common case, but it can still take an awfully long time to retrieve older trees.
Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree
and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example,
suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM
server never has to apply more than 9 deltas to retrieve any tree.
What is a delta?
I've been throwing around this concept of deltas, but I haven't stopped to describe them.
A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees
do not need to be related. However, in practice, the only reason we calculate the difference between them is
because one of them is derived from the other. Some developer started with tree N and made one or more
changes, resulting in tree N+1.
We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this
purpose. A changeset is merely a list of the changes which express the difference between two trees.
For example, let's suppose that Wilbur starts with tree N and makes the following changes:
At this point, he commits all of these changes to the repository as a single transaction. When the SCM server
stores this delta, it must remember all of these changes.
For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in
tree N but does not exist in tree N+1.
For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in
the repository to have an identifier which never changes, even when the name or location of the item changes.
For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item.
If we simply remember every item by its path, we cannot remember the occasions when that path changes.
Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to
remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full
representation of this changeset item needs to contain the entire contents of that file.
Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some
way. We could handle these items the same way as item 5, by storing the entire contents of the new version of
the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree
level.
File deltas
A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta
is because we believe it will be smaller than the file itself, usually because one of the files is derived from the
other.
For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of
lines which have been modified, inserted or changed. This is the same kind of results which are produced by
the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that
software developers and web developers have a lot of text files.
CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff.
Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.
Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file
delta algorithm called VCDiff, as described in RFC 3284 (http://www.faqs.org/rfcs/rfc3284.html) . This
algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This
means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses
the data at the same time.
Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are
large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In
CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow
by a small amount.
Please note that I make a distinction between the terms "delta" and "diff".
A "delta" is the difference between two versions. If we have one full file and a delta, then we can
construct the other full file. A delta is used primarily because it is smaller than the full file, not because
it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the
level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text
files.
A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented,
but really cool visual diff tools can also highlight the specific characters on a line which differ. The
purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs
are really useful for text files, because human beings tend to read text files. Most human beings don't
read binary files, and human-readable diffs of binary files are similarly uninteresting.
As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over
slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct
purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their
repository deltas.
At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools
work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas
before file deltas. That is not the way the history of the world unfolded.
Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version control
systems like RCS only handled file deltas. There was no way for the system to remember folder-level
operations like add, renaming or deleting files.
Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the
world today. It was originally developed as a set of wrappers around RCS which essentially provided support
for some folder-level operations. Although CVS still has some important limitations, it was a big step forward.
Today, several modern source control systems are designed around the notion of tree-wide deltas. By
accurately remembering every possible operation which can happen to a repository, these tools provide a truly
complete history of a project.
We need to descend through one more layer of abstraction before we turn our attention back to more practical
matters. So far I have been talking about how things are stored and managed within a repository, but I have
not broached the subject of how the repository itself is stored.
A repository must store every version of every file. It must remember the hierarchy of files and folders for
every version of the tree. It must remember metadata, information about every file and folder. It must
remember checkin comments, explanations provided by the developer for each checkin. For large trees and
trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably.
There are several different ways of approaching the problem.
RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was
called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level
down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a
bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one
for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond
memories, that particular phase of my life is over.)
CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate
from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository
contains some additional metadata.
When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are
exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual
database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit
of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has
invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data
corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying
database.
Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the
actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own
archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other
hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being
one of the fastest SCM tools.
Managing repositories
That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never
been a very big company). It contains thousands of files, thousands of checkins, and has been backed up
thousands of times.
Obviously you should do regular backups. That repository contains everything your fussy and expensive
programmers have ever created. Don't risk losing it.
Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how
many people are doing daily backups that cannot actually be restored when they are needed.
Put your repository on a reliable server. If your repository goes down, your entire team is blocked from
doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with
redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply
(UPS).
Be conservative in the way your SCM server machine is managed. Don't put anything on that machine
that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets
released. I've been shocked how many times one of our servers went south simply because we installed
a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with
the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on
some other machine before you put them on critical servers.
Keep your SCM server inside a firewall. If you need to allow your developers to access the repository
from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your
developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and
Subversion can be tunneled through ssh or something similar.
This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of
care and caution which should be used for your SCM repository.
Undo
As I have mentioned, one of the best things about source control is that it contains your entire history. Every
version of everything is stored. Nothing is ever deleted.
However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that
should not be checked in? My history contains something I would rather forget. I want to pretend that it never
happened. Isn't there some way to really delete from a repository?
In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry
about the fact that your repository contains a full history of the error. Your mistakes are a part of your
past. Accept them and move on with your life.
However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a
command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's
say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and
choose the Rollback command.
To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the
rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simply
creates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the very
least, one of them is.
As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a
repository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. The
obliterate command is the only way to delete something and make it truly gone forever.
The rest of the team agreed that we should discourage people from using this command, but in the end, we
settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client,
not the regular client people use every day. In effect, we made the obliterate command available, but
inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to
think twice before they try to rewrite history and pretend something never happened.
Kimchi again?
Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that
"everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler.
Rules don't have exceptions. Generalizations always apply.
This is how we learn. We understand the basic rules first and see the finer points later. First we learn that
memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.
My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely
acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on
checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-
edit-checkin".
In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like
Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single
repository. Each client has a working folder. All clients contact the same server.
I confess that not all SCM tools work this way. Tools like BitKeeper (http://www.bitkeeper.com/) and Arch
(http://www.gnu.org/software/gnu-arch/) are based on the concept of distributed repositories. Instead of one
repository, there can be several, or even many. Things can be retrieved or committed to any repository at any
time. The repositories are synchronized by migrating changesets from one repository to another. This results
in a merge situation which is not altogether different from merging branches.
From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are
advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power
user, this paradigm for source control is very cool.
Having no experience in the implementation of these systems, I will not be explaining their behavior in any
detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of
articles will continue to focus on the more mainstream architecture for source control.
Looking ahead
In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side
and dive into the details of working folders.
Chapter 5: Working Folders
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
CVS calls it a sandbox. Subversion calls it a working directory. Vault calls it a working folder. By any of these
names, a working folder is a directory hierarchy on the developer's client machine. It contains a copy of the
contents of a repository folder. The very basic workflow of using source control involves three steps:
1. Update the working folder so that it exactly matches the latest contents of the repository.
2. Make some changes to the working folder.
3. Checkin (or commit) those changes to the repository.
The repository is the official archive of our work. We treat our repository with great respect. We are extremely
careful about what gets checked in. We buy backup disks and RAID arrays and air conditioners and whatever
it takes to make sure our precious repository is always comfortable and happy.
But if our code changes turn out to be useful, things change in a very big way. Our working folder suddenly
has value. In fact, it is quite precious. The only copy of our most recent efforts is sitting on a crappy,
laptop-grade hard disk which gets physically moved four times a day and never gets backed up. The stress of
this situation is almost intolerable. We want to get those changes checked in to the repository as quickly as
possible.
Once we do, we breathe a sigh of relief. Our working folder has once again become worthless, as it should be.
Once again I need to spend some time explaining grungy details of how SCM tools work. I don't want to repeat
the analogy I used in the last chapter, so the following line of "code" should suffice:
Your SCM tool may record the timestamp on the working file, so that it can later detect if you have
modified it.
It may record the version number of the repository file that was retrieved, so that it may later know the
starting point from which you began to make your changes.
It may even tuck away a complete copy of the file that was retrieved, so that it can show you a diff
without accessing the server.
I call this information "hidden state information". Its exact location depends on which SCM tool you are
using. Subversion hides it in invisible subdirectories in your working directory. Vault can work similarly, but
by default it stores hidden state information in the current user's "Application Data" directory.
Because of the changes happening on both the client and the server, a working file can be in one of several
possible states. SCM tools typically have some way of displaying the state of each file to the user. Vault shows
file states in the main window. CVS shows them in response to the 'cvs status' command.
The table below shows the possible states for a working file. The column on the left shows my particular name
for each of these states, which through no coincidence is the name that Vault uses. The column on the far right
shows the name shown by the 'cvs status' command. However, the terminology doesn't really matter. One
way or another, your SCM tool is probably keeping track of all these things and can tell you the state of any file
in your working folder hierarchy.
Refresh
In order to keep all this file status information current, the SCM client must have ways of staying up to date
with everything that is happening. Whenever something changes in the working folders or in the repository,
the SCM client wants to know.
Changes in the working folders on the client side are relatively easy. The SCM client can quickly scan files in
the working folders to determine what has changed. On some operating systems, the client can register to be
notified of changes to any file.
Notification of changes on the server can be a bit trickier. The Vault client periodically queries the server to
ask for the latest version of the repository tree structure. Most of the time, the server will simply respond that
"nothing has changed". However, when something has in fact changed, the client receives a list of things
which have changed since the last time that client asked for the tree structure.
For example, let's assume Laura retrieves the tree structure and is informed that foo.cpp is at version 7.
Later, Wilbur checks in a change to foo.cpp and creates version 8. The next time Laura's Vault client
performs a refresh, it will ask the server if there is anything new. The server will send down a list, informing
her client that foo.cpp is now at version 8. The actual bits for foo.cpp will not be sent until Laura specifically
asks for them. For now, we just want the client to have enough information so that it can inform Laura that
her copy of foo.cpp is now "Old".
OK, let's go back to speaking a bit more about practical matters. In terms of actual usage, most interaction
with your SCM tool happens in and around your working folder. The following operations are the basic things
I can do to a working folder:
In the following sections, I will cover each of these operations in a bit more detail.
In an idealized world, it would be really nice if the SCM tool didn't have to be involved at all. The developer
would simply work, making all kinds of changes to the working folder while the SCM tool eavesdrops, keeping
an accurate list of every change that has been made.
Unfortunately, this perfect world isn't quite available. Most operations on a working folder cannot be
automatically detected by the SCM client. They must be explicitly indicated by the user. Examples:
It would be unwise for the SCM client to notice that a file is "Missing" and automatically assume it
should be deleted from the repository.
Automatically inferring an "Add" operation is similarly unsafe. We don't want our SCM tool
automatically adding any file which happens to show up in our working folder.
Rename and move operations also cannot be reliably divined by mere observation of the result. If I
rename foo.cpp to bar.cpp, how can my SCM client know what really happened? As far as it can tell, I
might have deleted foo.cpp and added bar.cpp as a new file.
All of these so-called "folder-level" operations require the user to explicitly give a command to the SCM tool.
The resulting operation is added to the pending change set, which is the list of all changes that are waiting to
be committed to the repository.
However, it just so happens that in the most common case, our "eavesdropping" ideal is available. Developers
who use the edit-merge-commit model typically do not issue any explicit command telling the SCM tool of
their intention to edit a file. The files in their working folder are left in a writable state, so they simply open
their text editor or their IDE and begin making changes. At the appropriate time, the SCM tool will notice the
change and add that file to the pending change set.
Users who prefer "checkout-edit-checkin" actually have a somewhat more consistent rule for their work. The
SCM tool must be explicitly informed of all changes to the working folder. All files in their working folder are
usually marked read-only. The SCM tool's Checkout command not only informs the server of the checkout
request, but it also flips the bit on the working file to make it writable.
Review changes
One of the most important features provided by a working folder is the ability to review all of the changes I
have made. For SCM tools that do keep track of a pending change set (Vault, Perforce, Subversion), this is the
place to start. The following screen dump shows the pending change set pane from the Vault client, which is
showing me that I have currently made two changes in my working folder:
(screendumps/scm_pending_5.gif)
The pending change set view shows all kinds of changes, including adds, deletes, renames, moves, and
modified files. It is helpful to keep an eye on the pending change set as I work, verifying that I have not
forgotten anything.
However, for the case of a modified file, this visual display only shows me which files have changed. To really
review my changes, I need to actually look inside the modified files. For this, I invoke a diff tool. The following
screen dump is from a popular Windows diff tool called Beyond Compare
(http://www.scootersoftware.com/) :
This picture is fairly typical of the visual diff tool genre, showing both files side-by-side and highlighting the
parts that are different. There are quite a few tools like this. The following screen dump is from the visual diff
tool which is provided with Vault:
Both of these tools do a nice job on the modification to line 33, showing exactly which part of the line was
changed. Most of the recent visual diff tools support this ability to highlight intraline differences.
Visual diff tools are indispensable. They give me a way to quickly review exactly what has changed. I strongly
recommend you make a habit of reviewing all of your changes just before you checkin. You can catch a lot of
silly mistakes by taking the time to be sure that your changes look the way you think they look.
Undo changes
Sometimes I make changes which I simply don't intend to keep. Perhaps I tried to fix a bug and discovered
that my fix introduced five new bugs that are worse than the one I started with. Or perhaps I just changed my
mind. In any case, a very nice feature of a working folder is the ability to undo.
In the case of a folder-level operation, perhaps the Undo command should actually be called "Nevermind".
After all, the operation is pending. It hasn't happened yet. I'm not really saying that I want to Undo something
which has already happened. Rather, I am just saying that I no longer want to do something that I previously
said I did.
For example, if I tell the Vault client to delete a file, the file isn't really deleted until I commit that change to
the repository. In the meantime, it is merely waiting around in my pending change set. If I then tell the Vault
client to Undo this operation, the only thing that actually has to happen is to remove it from my pending
change set.
For users who use the checkout-edit-checkin style of development, closely related here is the need to undo a
checkout. This is essentially similar to undoing the changes in a file, but involves the extra step of informing
the server that I no longer want the file to be checked out.
Source control tools have been a daily part of my life for well over a decade. I can't imagine doing software
development without them. In fact, I have developed habits that occasionally threaten my mental health.
Things would be so much easier if the concept of a working folder were available in other areas of life:
"Hmmm. I can't remember which of these pool chemicals I have already done. Luckily, I can just diff
against the version of the pool water from an hour ago and see exactly what changes I have made."
"Boy am I glad I remembered to set the read-only bit on my front lawn to remind me that I'm not
supposed to cut the grass until a week after the fertilizer was applied."
"No worries -- if I accidentally put too much pepper on this chicken, I can just revert to the latest version
in the repository."
Unfortunately, SCM tools are unique. When I make a mistake in my woodshop, I can't undo it. Only in
software development do I have the luxury of a working folder. It's a place where I can work without
constantly worrying about making a mistake. It's a place where I can work without having to be too careful.
It's a place where I can experiment with ideas that may not work out. I wish I had working folders everywhere.
I don't like to let my working folder get too far behind the current state of the repository. SCM tools typically
allow the user to invoke a diff tool to compare two repository versions of a file. When I am working on a
feature, I periodically like to review the recent changes in the repository. Unless those changes look likely to
disrupt my own work, I usually proceed to retrieve the latest versions of things so that my working folder stays
up to date.
In CVS, the command to update a working folder is [rather conveniently] called 'update'. In Vault, this
operation is done with the Get Latest Version command. The screen dump below is the corresponding dialog
box:
Note that this dialog box gives me a few choices for how I may want to handle situations where a change has
happened on both the client and the server. Let us suppose for a moment that I am not using exclusive
checkouts and that somebody else has also modified sgdmgui_props.cpp. In this case, I have three choices
available when I want to update my working folder:
Overwrite my working file. This effect here is similar to an Undo. My changes will be lost. Use with
care.
Attempt automatic merge. The Vault client will attempt to construct a file which contains my
changes and the changes which were made on the server. If the automerge succeeds, my working file
will end up in the "Edited" status. If the automerge fails, the status of my working file will be "Needs
Merge", and the Vault client will nag and pester me until I resolve the situation.
Do not overwrite/Merge later. This option leaves my working file untouched. However, the status
of the file will change to "Needs Merge". Vault will not allow me to checkin my changes until I affirm
that I have done the right thing and merged in the changes from the repository.
Note also that the "Prompt for modified files" checkbox allows me to specify that I want the Vault client to
allow me to choose between these options for every file that ends up in this situation.
As you can see, the Get Latest Version dialog box includes a few other options which I won't describe in detail
here. Other SCM tools have similar abilities, although the user interface may be very different. In any case,
it's a good idea to update your working folder as often as you can.
Commit changes
In most situations, I eventually decide that my changes are Good and should be sent back to the repository so
they can become a permanent part of the history of my project. In Vault, Subversion and CVS, the command
is called Commit. The following screen dump shows the Commit dialog box from Vault:
Note that the listbox at the top contains all of the items in my pending change set. In this particular example, I
only have two changes, but this listbox typically has a scrollbar and contains lots of items. I can review all of
the operations and choose exactly which ones I want to commit to the repository. It is possible that I may want
to checkin only some of my currently pending changes. (Perforce has a nifty solution to this problem. The
user can have multiple pending change sets, so that changes can be logically grouped together even as they are
waiting to be checked in.)
The "Change Set Comment" textbox offers a place for me to type an explanation of what I changed and why I
did it. Please note that this textbox has a scrollbar, encouraging you to type as much text as necessary to give a
full explanation of the problem. In my opinion, checkin comments are more important than the comments in
the actual code.
When I click OK, all of the selected items will be sent to the server to be committed to the repository. Since
Vault supports atomic checkin transactions, I know that my changes will succeed or fail as a united group. It
is not possible for the repository to end up in a state where only some of these changes made it.
#region CARS_AND_CLOCKS
Remember the discussion in chapter 4 about binary file deltas? This same technology is also used for checkin
operations. When Vault sends a modified version of a file up to the server, it actually sends only the bytes
which have changed, using the same VCDiff format which is used to make repository storage more efficient.
The reason this is possible is because it has kept a copy of the baseline file in the hidden state information. The
Vault client simply runs the VCDiff algorithm to construct the difference between this baseline file and the
current working file. So in the case of my running example, the Vault client will send three pieces of
information:
The binary delta. Since the pending change set pane shows that my working file is 40 bytes larger than
the baseline where I started, the binary delta is going to be somewhere in the vicinity of 40 bytes long,
perhaps with a few extra bytes for overhead.
The fact that this binary delta was computed against version 21 of the file. Since version 21 is known
and exists on both the client and the server, the SCM server can simply apply the binary delta to its own
copy of version 21 to reconstruct an exact copy of the contents of my working file.
The CRC checksum of the original working file. When the server reconstructs its copy of the working
file, the CRC will be compared to ensure that nothing was corrupted during transit. The file that is
stored in the repository will be exactly the same as the working file. No corruption, no surprises.
Whenever possible, Vault uses binary file deltas "over the wire" in both directions, from client to server as well
as from server to client. In this example, the entire file is only 3,762 bytes, so the savings in network
bandwidth isn't all that significant. However, for larger files, the increase in network performance for offsite
users can be quite dramatic.
This capability of using binary file deltas between client and server is supported by some other SCM tools as
well, including (I believe) Subversion and Perforce.
#endregion
When the checkin has completed successfully, if I am working in "checkout-edit-checkin" mode, the SCM tool
will flip the read-only bit on my working files to prevent me from accidentally making changes without
informing the server of my intentions.
Having completed my checkin, the cycle is completed. My working folder is once again worthless, since my
changes are a permanent part of the repository. I am ready to start again on my next development task.
Looking ahead
In the next chapter, it's time to start talking about some of the more advanced stuff. I'll start with an overview
of labels and history.
Chapter 6: History
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
You may now be tired of hearing me say it, but I will say it again: Your repository contains every version of
everything which has ever been checked in to the repository. This is a Good Thing. We sleep better at night
because we know that our efforts are always additive, never subtractive. Nothing is ever lost. As the team
regularly checks in more stuff, the complete historical record is preserved, just in case we ever need it.
But this feature is also a Bad Thing. It turns out that keeping absolutely everything isn't all that useful if you
can't find anything later.
My woodshop is a painfully vivid illustration of this problem. I have a habit of never throwing anything away.
When I build a piece of furniture, I save every scrap of wood, telling myself that I might need it someday. I
keep every screw, nail, bolt or nut, just in case I ever need it. But I don't organize these things very well. So
when the time comes that I need something, I usually can't find it. I'm not necessarily proud of this
confession, but my workshop stands as an expression of who I am. Those who love me sometimes find my
habits to be endearing.
But there is nothing endearing about a development team that can't find something when they need it. A good
SCM tool must do more than just keep every version of everything. It must also provide ways of searching and
viewing and sorting and organizing and finding all that stuff.
In the rest of this chapter, I will discuss several mechanisms that SCM tools provide to help make the historical
data more useful.
Labels
Perhaps the most important feature for dealing with old versions is the notion of a "label". In CVS, this feature
is called a "tag". By either name, the concept is the same -- labels offer the ability to associate a name with a
specific version of something in the repository. A label assigns a meaningful symbolic name to a snapshot of
your code so you can later find that snapshot more easily.
This is not altogether different from the descriptive and memorable names we use for variables and constants
in our code. Which of the following two lines of code is easier to understand?
if (errorcode == ERR_FILE_NOT_FOUND)
if (e == -43)
Similarly, which of the following is a more intuitive description of a specific version of your code?
LAST_VERSION_BEFORE_COREY_FOULED_EVERYTHING_UP
378
1. The string for the name of the label. This should be something descriptive that you can either
remember or recognize later. Don't be afraid to put enough information in the name of the label. Note
that CVS has strict rules for the syntax of a tag name (must start with a letter, no spaces, almost no
punctation allowed). I still follow that tradition even though Vault is more liberal.
2. The folder to which the label will be applied. (You can apply a label or tag to a single file if you want, but
why? Like most source control operations, labels are most useful when applied recursively to a whole
folder.)
3. Which versions of everything should be included in the snapshot. Often this is implicitly understood to
be the latest version, but your SCM tool will almost certainly allow you to label something in the past. If
it won't, take it out back and shoot it.
4. A comment explaining the label. This is optional, and not all SCM tools support it, (CVS doesn't), but a
comment can be handy when you want to explain more than might be appropriate to say in the name of
the label. This is particularly handy if your team has strict rules for the syntax of label
(V1.3.2.1426.prod) which prevent you from putting in other information you need.
For example, in the following screen dump from Vault, I am labeling version 155 of the folder $/src/sgd
/libsgdcore:
It is worth clarifying here that labels play a slightly different role in some SCM tools. In Subversion or Vault,
folders have version numbers. Using the example from my screen dump above, the folder $/src/sgd
/libsgdcore is at version 155. Each of the various files inside that folder has its own version number, but every
time one of those files changes, the version number of the folder is increased by one as well. So the version
number of a folder is a little bit like a label because it maps to a specific snapshot of the contents of the folder.
However, CVS doesn't work this way. There is no folder version number which can be mapped to a specific
snapshot of the contents of that folder. For this reason, tags are all the more important in CVS, since there is
no other way to easily mark specific versions of multiple items as a snapshot.
Labels are cheap. They don't consume a lot of resources. Your SCM tool won't slow down if you use lots of
them. Having more labels does not increase your responsibilities. So you can use them as often as you like.
The following situations are examples of when you might want to use a label:
A release is the most obvious time to apply a label. When you release a version of your application to
customers, it can be very important to later know exactly which version of the code was released.
Sometimes it is necessary to make a change which is widespread or fundamental. Before destabilizing your
code, you may want to apply a label so you can easily find the version just before things started getting
messed up.
Some automated build systems apply a label every time a build is done. The usual approach is to first apply
the label and then do a "get by label" operation to retrieve the code to be used for the build. Using one of
these tools can result in an awful lot of labels, but I still like the idea. It eliminates the guesswork of trying
to figure out exactly which code was in the build.
It's also very handy to diff against a label. For example, in the following screendump from Vault, I am asking
to see all the differences between the contents of my working folder and the contents of the label named "Build
3.0.0.2752". (This label was applied by our automated build system when it made build 2752.)
Sometimes after you apply a label you realize that you want to make a small change. As an example, consider
the following scenario: One week ago, you finalized the code for the 4.0 release of your product. You applied a
label to the tree, and your team has proceeded with development on a few post-4.0 tasks.
But now Bob (one of your QA guys) comes crawling into your office. His clothes are torn and his face is
covered with soot. While gasping for air he informs you that he has found a potential showstopper bug in the
4.0 release candidate. Apparently if you are running your app on the Elbonian version of Windows NT 3.5
with the time zone set to Pacific Standard Time and you enter a page margin size of 57 inches while printing a
42 page document on a Sunday morning before 9am, the whole machine locks up. In fact, if you don't quickly
kill the app, the computer will soon burst into flame.
As Bob finishes explaining the situation, a developer walks in and announces that he has already found the fix
for this bug, and it affects only one line of code in FOO.CPP. Should he make the fix and generate a new
release candidate?
After scolding Bob for not being more diligent in finding this bug sooner, you begrudgingly decide that the
severity of this bug does indeed make it a showstopper for the 4.0 release. But how to proceed? The label for
the 4.0 build has already been applied. You want a new release candidate which contains exactly the contents
of the 4.0 label plus this one-line change. None of the other stuff which has been checkin in during the past
week should be included.
I'm sure it was this very situation which prompted Microsoft to implement a feature in SourceSafe 6.0 called
"label promotion". The idea is that a minor change to a label can be made after it was originally created.
Returning to our example, let's suppose that the 4.0 label contained version 6 of FOO.CPP. So now we would
make the one-line change and check it in, resulting in version 7 of that file. Then we "promote" version 7 of the
file to be included in the 4.0 label, instead of version 6.
However, even though I dislike this feature for philosophical reasons, customers really want it. Here at
SourceGear, I tell people that "the customer is not always right, but the customer is always the customer". So
in order to remain true to our goal of making Vault a painless transition from SourceSafe, we implemented
label promotion. But that doesn't mean I have to be happy about it.
History
Another important feature is the ability to view and browse historical versions of the repository. In its simplest
form, this can be just a list of changes with the following information about each change:
But without a way of filtering and sorting this information, using history is like trying to take a drink from a
fire hose. Fortunately, most SCM tools provide plenty of flexibility in helping you see the data you need.
In CVS, history is obtained using the 'cvs log' command. In the Vault GUI client, we use the History Explorer.
In either case, the first way to filter history is to decide where to invoke the command. Requesting the full
history from the root folder of a repository is like the aforementioned fire hose. Instead, invoke the command
on a subfolder or even on a file. In this way, you will only see the changes which have been made to the item
you selected.
Most SCM tools provide other ways of filtering history information as well:
The following screendump from Vault shows all the changes I made to one of the Vault libraries during
October 2004:
(screendumps/scm_hist_1.png)
For tools like Subversion and Vault which support atomic transactions and changesets, history can be slightly
different. Because changesets are a grouping of individual changes, history is no longer just a flat list of
individuals changes, but rather, can now be viewed as a hierarchy which is two levels deep.
To ease the transition for SourceSafe users, Vault allows history to be viewed either way. You can ask Vault's
History Explorer to display individual changes. Or, you can ask to see a list of changesets, each of which can
be expanded to see the individual changes contained inside it. Personally, I prefer the changeset-oriented
view. I like the mindset of thinking about the history of my repository in terms of groups of related changes.
Blame
Vault has a feature which can produce an HTML view of a file with each line annotated with information about
the last person who changed that line. We call this feature "Blame". For example, the following screen dump
shows the Blame output for the source code to the Vault command line client:
This poor function has had all kinds of people stomping through it. I was the last person to change line 828,
which I apparently did in revision 106 of the file. However, line 829 was last modified by Jeff, and line 830
belongs to Dan.
Note that we here at SourceGear take absolutely no Even though this Best Practice box is more about team
credit or blame for the name of this command. We management than source control, I don't feel like I'm
took our inspiration for this feature from the blame straying too far off topic to offer the following tidbit:
feature found in the CVS world, popularized by the
Bonsai (http://www.mozilla.org/projects/bonsai/) Tim Krauskopf, an early mentor of mine, said many
tool from the Mozilla project. The following screen wise things to me, including the following piece of
dump shows this CVS Blame feature in action using management advice which I have never forgotten:
the Bonsai installation on www.abisource.com
(http://www.abisource.com) . I was delighted to "Spend more time on credit than on blame, and don't
discover that the AbiWord layout engine actually spend very much time on either one."
still contains some of my code:
Whether you like the name or not, the Blame feature can be awfully handy sometimes.
Looking ahead
What is a branch?
A branch is what happens when your development team needs to work on two distinct copies of a project at the
same time. This is best explained by citing a common example:
Suppose your development team has just finished and released version 1.0 of UltraHello, your new flagship
product, developed with the hope of capturing a share of the rapidly growing market for "Hello World"
applications.
But now that 1.0 is out the door, you have a new problem you have never faced before. For the last two years,
everybody on your team has been 100% focused on this release. Everybody has been working in the same tree
of source code. You have had only one "line of development", but now you have two:
Development of 2.0. You have all kinds of new features which just didn't make it into 1.0, including
"multilingual Hello", DirectX support for animated Hellos, and of course, the ability to read email
(http://www.catb.org/~esr/jargon/html/Z/Zawinskis-Law.html) .
Maintenance of 1.0. Now that real customers are using UltraHello, they will probably find at least
one bug your testing didn't catch. For bug fixes or other minor improvements requested by customers,
it is quite possible that you will need to release a version 1.0.1.
It is important for these two lines of development to remain distinct. If you release a version 1.0.1, you don't
want it to contain a half-completed implementation of a 2.0 feature. So what you need here is two distinct
source trees so your team can work on both lines of development without interfering with each other.
The most obvious way to solve this problem would simply be to make a copy of your entire source control
repository. Then you can use one repository for 1.0 maintenance and the other repository for 2.0
development. I know people who do it this way, but it's definitely not a perfect solution.
The two-repository approach becomes disappointing in situations where you want to apply a change to both
trees. For example, every time we fix a bug in the 1.0 maintenance tree, we probably also want to apply that
same bug fix to the 2.0 development tree. Do we really want to have to do this manually? If the bug fix is a
simple change, like fixing the incorrect spelling of the word "Hello", then it won't take a programmer very long
to make the change twice. But some bug fixes are more involved, requiring changes to multiple files. It would
be nice if our source control tool would help. A primary goal for any source control tool should be to help
software teams be more concurrent, everybody busy, all at the same time, without getting in each other's way.
To address this very type of problem, source control tools support a feature which is usually called
"branching". This terminology arises from the tendency of computer scientists to use the language of a
physical tree every time hierarchy is involved. In this particular situation, the metaphor breaks down very
quickly, but we keep the name anyhow.
A somewhat better metaphor happens when we envision a nature path which forks into two directions. Before
the fork, there was one path. Now there are two, but they share a common history. When you use the
branching feature of your source control tool, it creates a fork in the path of your development progress. You
now have two trees, but the source control has not forgotten the fact that these two trees used to be one. For
this reason, the SCM tool can help make it easier to take code changes from one fork and apply those changes
to the other. We call this operation "merging branches", a term which highlights why the physical tree
metaphor fails. The two forks of a nature path can merge back into one, but two branches of an oak tree just
don't do that. I'll talk a lot more about merging branches in the next chapter.
At this point I should take a step back and admit that my example of doing 1.0 maintenance and 2.0 features is
very simplistic. Real life examples are sometimes far more complicated, involving multiple branches, active
development in each branch, and the need to easily migrate changes between any two of them. Branching and
merging is perhaps the most complex operation offered by a source control tool, and there is much to say
about it. I'll begin with some "cars and clocks (scm_repositories.html) " stuff and talk about how branching
works "under the hood".
In order to retrieve a file, you specify not just a path but the name of the universe, er, branch, from which you
want the file retrieved. If you don't specify a branch, then the file will be retrieved from the "default branch".
This is the approach used by CVS and PVCS.
In the other branching model, a branch is just another folder, located in the same repository hierarchy as
everything else. When you create a branch of a folder, it shows up as another folder. With this approach, a
repository path is sufficient to describe a location.
Personally, I prefer the "folder" style of branching over the "parallel universe" style of branching, so my
writing will generally come from this perspective. This is the approach used by most modern source control
tools, including Vault, Subversion (they call it "copy (http://subversion.tigris.org/) "), Perforce (they call it
"Inter-File Branching (http://www.perforce.com/perforce/branch.html) ") and Visual Studio Team
System (looks like they call it branching in "path space (http://blogs.msdn.com/team_foundation/archive
/2005/02/23/379179.aspx) ").
Good source control tools are clever about how they manage the underlying storage issues of branching. For
example, let us suppose that the source code tree for UltraHello is stored in $/projects/Hello/trunk. This
folder contains everything necessary to do a complete build of the shipping product, so there are quite a few
subfolders and several hundred files in there.
Now that you need to go forward with 1.0 maintenance and 2.0 development simultaneously, it is time to
create a branch. So you create a folder called $/projects/Hello/branches. Inside there, you create a branch
called 1.0.
At the moment right after the branch, the following two folders are exactly the same:
$/projects/Hello/trunk
$/projects/Hello/branches/1.0
It appears that the source control tool has made an exact copy of everything in your source tree, but actually it
hasn't. The repository database on disk has barely increased in size. Instead of duplicating the contents of
every file, it has merely pointed the branch at the same contents as the trunk.
As you make changes in one or both of these folders, they diverge, but they continue to share a common
history.
In order to use your source control tool most effectively, you need to develop just the right amount of fear of
branching. This delicate balance seems to be very difficult to find. Most people either have too much fear or
not enough.
Nelly is an example of a person who has too much fear of branching. Nelly has a friend who has a cousin with
a neighbor who knows somebody whose life completely fell apart after they tried using the branch and merge
features of their source control tool. So Nelly refuses to use branching at all. In fact, she wrote a 45-page
policy document which requires her development team to never use branching, because after all, "it's not
safe".
So Nelly's development team goes to great lengths to avoid using branching, but eventually they reach a point
where they need to do concurrent development. When this happens, they do anything they can to solve the
problem, as long as it doesn't involve the word "branch". They fork a copy of their tree and begin working with
two completely separate repositories. When they need to make a change to both repositories, they simply
make the change by hand, twice.
At the other end of the spectrum is Eddie, who uses branching far too often. Eddie started out just like Nelly,
afraid of branching because he didn't understand it. But to his credit, Eddie overcame his fear and learned
how powerful branching and merging can be.
After he tried branching and had a good first experience with it, Eddie now uses it all the time. He sometimes
branches multiple times per week. Every time he makes a code change, he creates a private branch.
Eddie arrives on Monday morning and discovers that he has been assigned bug 7136 (In the Elbonian version,
the main window is too narrow because the Elbonian language requires 9 words to say "Hello World".) So
Eddie sits down at his desk and begins the process of fixing this bug. The first thing he does is create a branch
called "bug_7136". He makes his code change there in his "private branch" and checks it in. Then, after
verifying that everything is working okay, he uses the Merge Branches feature to migrate all changes from the
trunk into his private branch, just to make sure his code change is compatible with the very latest stuff. Then
he runs his test suite again. Then he notices that the repository has changed yet again, then he does this loop
once more. Finally, he uses Merge Branches to apply his code fixes to the trunk. Then he grabs a copy of the
trunk code, builds it and runs the test suite to verify that he didn't accidentally break anything. When at last he
is satisfied that his code change is proper, he marks bug 7136 as complete. By now it is Friday afternoon at
4:00pm, and there's no point in starting anything new at this point, so he just decides to go home.
Eddie never checks anything into the main trunk. He only checks stuff into his private branch, and then
merges changes into the trunk. His care and attention to detail are admirable, but he's spending far more
time using his source control tool than working on his code.
Let's not even think about what the kids would be like if Eddie and Nelly were to get married.
Dev--Test--Prod
Once you established the proper level of comfort with the branching features of your source control tool, the
next question is how to use those features effectively.
One popular methodology for SCM is often called "code promotion". The basic idea here is that your code
moves through three stages, "dev" (stuff that is in active development), "test" (stuff that is being tested) and
"prod" (stuff that is ready for production release):
As code gets written by programmers, it is placed in the dev tree. This tree is "basically unstable".
Programmers are only allowed to check code into dev.
When the programmers decide they are done with the code, they "promote" it from dev to
test. Programmers are not allowed to check code directly into the test tree. The only way to get code
into test is to promote it. By promoting code to test, the programmers are handing the code over to the
QA team for testing.
When the testers decide the code meets their standards, they promote it from test to prod. Code can
only be part of a release when it has been promoted to prod.
For a variety of reasons, I personally don't like working this way, but there's nothing wrong with it. Lots of
people use this code promotion model effectively, especially in larger companies where the roles of
programmer and tester are very clearly separated.
I understand that PVCS has specific feature support for "promotion groups", although I've never used this
product personally. With other source control tools, the code promotion model can be easily implemented
using three branches, one for dev, one for test, and one for prod. The Merge Branches feature is used to
promote code from one level to the next.
Nonetheless, the trunk is the place where active development of new features is happening. The trunk could be
described as "basically unstable", a philosophy of branching which is explained in Essential CVS
(http://www.amazon.com/exec/obidos/ASIN/0596004591/sawdust08-20) , a fine book on CVS by O'Reilly.
In our situation, the stability of the trunk build fluctuates over the months during our development cycle.
During the early and middle parts of a development cycle, the trunk is often not very stable at all. As we
approach alpha, beta and final release, things settle down and the trunk gets more and more stable. Not long
before release, the trunk becomes almost sacred. Every code change gets reviewed carefully to ensure that we
don't regress backwards.
At the moment of release, a branch gets created. This branch becomes our maintenance tree for that release.
Our current maintenance branch is called "3.0", since that's the current major version number of our
product. When we need to do a bug fix or patch release, it is done in the maintenance branch. Each time we
do a release out of the maintenance branch (like 3.0.2), we apply a label.
After the maintenance branch is created, the trunk once again becomes "basically unstable". Developers start
adding the risky code changes we didn't want to include in the release. New feature work begins. The cycle
starts over and repeats itself.
Be afraid of branches, but not so afraid that you never use the feature. Don't branch on a whim, but do branch
when you need to branch.
Simple changes. As I mentioned above in my "Eddie" scenario, don't branch for every bug fix or
feature.
Customer-specific versions. There are exceptions to this rule, but in general, you should not branch
simply for the sake of doing a custom version for a specific customer. Find a way to build the
customizability into your app.
And there are some situations where branching is the best practice:
Maintenance and development. The classic example, and the one I used above in my story about
UltraHello. Maintaining version N while developing version N+1 is the perfect example of a time to use
branching.
Subteam. Sometimes a subset of your team needs to work on something experimental that will take
several weeks. When they finish, their work will be folded into the main tree, but in the meantime, they
need a separate place to work.
Code promotion. If you want to use the dev-test-prod methodology I mentioned above, use a branch
to model each of the three levels of code promotion.
Looking Ahead
In the next chapter I will delve into the topic of merging branches.
Chapter 8: Merge Branches
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
Many users find the word "merge" to be confusing, since it seems to imply that we start out with two things and
end up with only one. I'm not going to start trying to invent new vocabulary. Instead, let's just try to be clear
about what we mean we speak about merging branches. I define "merge branches" like this:
To "merge branches" is to take some changes which were done to one branch and apply them to another
branch.
Sounds easy, doesn't it? In practice, merging branches often is easy. But the edge cases can be really tricky.
Consider an example. Let's say that Joe has made a bunch of changes in $/branch and we want to apply those
changes to $/trunk. At some point in the past, $/branch and $/trunk were the same, but they have since
diverged. Joe has been making changes to $/branch while the rest of the team has continued making changes
to $/trunk. Now it is time to bring Joe back into the team. We want to take all the changes Joe made to
$/branch, no matter what those changes were, and we want to apply those changes to $/trunk, no matter what
changes have been to $/trunk during Joe's exile.
The central question about merge branches is the matter of how much help the source control tool can
provide. Let's imagine that our SCM tool provided us with a slider control:
If we drag this slider all the way to the left, the source control tool does all the work, requiring no help at all
from Joe. Speaking as a source control vendor, this is the ideal scenario that we strive for. Most of us don't
make it. However, here at SourceGear we made the decision to build our source control product on the .NET
Framework, which luckily has full support for the kind of technology needed to implement this. The code
snippet below was pasted from our implementation of the Merge Branches feature in Vault:
DeveloperIntention di =
System.Magic.FigureOutWhatDeveloperWasTryingToDo(changes);
di.Apply(target);
}
Boy do I feel sorry for all those other source control vendors trying to implement Merge Branches without the
DeveloperIntention class! And to think that so many people believe the .NET Framework is too large. Sheesh!
In practice, we find ourselves somewhere between these two extremes. The source control tool cannot do
magic, but it can usually help make the merge easier.
Since the developer must still take responsibility for the merge, things will go more smoothly if she
understands what's really going on. So let's talk about how merge branches works. First I need to define a bit
of terminology.
For the remainder of this chapter I will be using the words "origin" and "target" to refer to the two branches
involved in a merge branches operation. The origin is the folder which contains the changes. The target is the
folder to which we want those changes to be applied.
Note that my definition of merge branches is a one-way operation. We apply changes from the origin to the
target. In my example above, $/branch is the origin and $/trunk is the target. That said, there is nothing
which prevents me switching things around and applying changes in the opposite direction, with $/trunk as
the origin and $/branch as the target, but that would simply be a separate merge branches operation.
When you begin a merge branches operation, you know which changes from the origin you want to be applied
over in the target. Most of the time you want to be very specific about which changes from the origin are to be
merged. This is usually evident in the conversation which preceded the merge:
"Dan asked me to merge all the bug fixes from 3.0.5 into the main trunk."
"Jeff said we need to merge the fix for bug 7620 from the trunk into the maintenance tree."
"Ian's experimental rewrite of feature X is ready to be merged into the trunk."
One way or another, you need to tell your source control tool which changes are involved in the merge. The
interface for this operation can vary significantly depending on which tool you are using. The screen shot
below is the point where the Merge Branches Wizard in Vault is asking me to specify which changes should be
merged. I'm selecting everything back to the last build label:
(screendumps/scm_mb_choose.png)
After selecting the changes to be applied, it's time to try and make those changes happen in the target. It is
important here to mention that merging branches requires us to consider every kind of change, not just the
common case of edited files. We need to deal with renames, moves, deletes, additions, and whatever else the
source control tool can handle.
I won't spell out every single case. Suffice it to say that each operation should be applied to the target in the
way that Makes Sense. This won't succeed in every situation, but when it does, it is usually safe. Examples:
If a file was edited in the origin and a file with the same relative path exists in the target, try to make the
same edit to the target file. Use the automerge algorithm I mentioned in chapter 3
(scm_file_merge.html) . If automerge fails, signal a conflict and ask the user what to do.
If a file was renamed in the origin, try doing the same rename in the target. Here again, if the rename
isn't possible, signal a conflict and ask the user what to do. For example, the target file may have been
deleted.
If a file was added in the origin, add it to the target. If doing so would cause a name clash, signal a
conflict and ask the user what to do.
What happens if an edited file in the origin has been moved in the target to a different subfolder?
Should we try to apply the edit? I'd say yes. If the automerge succeeds, there's a good chance it is safe.
Bottom line, a source control tool should do all the operations which seem certain to be safe. And even then,
the user needs a chance to review everything before the merge is committed to the repository.
Let's consider a simple example from Subversion. I created a folder called trunk, added a few files, and then
branched it. Then I made three changes to the trunk:
Deleted __init__.py
Modified panel.py
Added a file called anydbm.py
Then I asked Subversion to merge all changes between version 2 and 4 of my trunk into my branch:
Subversion correctly detected all three of my changes and applied them to my working copy of the branch.
3. Developer review
4. Commit
The very last step of a merge branches operation is to commit the results to the repository. Simplistically, this
is a commit like any other. Ideally, it is more. The difference is whether or not the source control tool
supports "merge history".
Merge history contains special historical information about all merge branch operations. Each time you do
use the merge branches feature, it remembers what happened. This allows us to handle two cases with a bit
more finesse:
Repeated merge.
Frequently you want to merge from the same origin to the same target multiple times. Let's suppose you
have a sub-team working in a private branch. Every few weeks you want to merge from the branch into the
trunk. When it comes time to select the changes to be merged over, you only want to select the changes
that haven't already been merged before. Wouldn't it be nice if the source control tool would just
remember this for you?
Merge history allows this and makes things more convenient. The workaround is simply to use a label to
mark the point of your last merge.
A similar case happens when you have two branches and you sometimes want to merge back and forth in
both directions. For example:
1. Create a branch
2. Do some work in both the branch and the trunk
3. Merge some changes from the branch to the trunk
4. Do some more work
5. Merge some changes from the trunk to the branch
At step 5, when it comes time to select changes to be merged, you want the changes from step 3 to be
ignored. There is no need to merge those changes from the trunk to the branch because the branch is
where those changes came from in the first place! A source control tool with a smart implementation of
merge history will know this.
Not all source control tools support merge history. A tool without merge history can still merge branches. It
simply requires the developer to be more involved, to do more thinking.
In fact, I'll have to admit that at the time of this writing, my own favorite tool falls into this category. We're
planning some major improvements to the merge branches feature for Vault 4.0, but as of version 3.x, Vault
does not support merge history. Subversion doesn't either, as of version 1.1. Perforce is reported to have a
good implementation of merge history, so we could say that its "slider" rests a bit further to the left.
Summary
I don't want this chapter to be a step-by-step guide to using any one particular source control tool, so I'm going
to keep this discussion fairly high-level. Each tool implements the merging of branches a little differently.
For some additional information, I suggest you look at Version Control with Subversion
(http://svnbook.red-bean.com/) , a book from O'Reilly. It is obviously Subversion-specific, but it contains a
discussion of branching and merging which I think is pretty good.
The one thing all these tools have in common is the need for the developer to think. Take the time to
understand exactly how the branching and merging features work in your source control tool.
Chapter 9: Source Control Integration with IDEs
This is part of an online book called Source Control HOWTO (source_control.html) , a best practices guide
on source control, version control, and configuration management.
In the old days, each developer would assemble their own collection of their favorite tools. Back around 1991,
my preferred toolset looked something like this:
Fifteen years later, most developers would consider this approach to be strange. Today, everything is
"integrated". Instead of selecting one of each kind of tool, we select an Integrated Development Environment
(IDE), an application which collects all the necessary tools together in one place. To continue the metaphor,
we would say that the focus today is not on the individual tools, but rather, on the workshop in which those
tools are used.
This trend is hardly new. Ever since Borland released Turbo Pascal (http://en.wikipedia.org
/wiki/Turbo_Pascal) in 1983, IDEs have become more popular every year. In the last ten years, many IDE
products have disappeared as the industry has consolidated. Today, it is only a small exaggeration to say that
there are just two IDEs left: Visual Studio and Eclipse.
But despite the industry consolidation, the trend is clear. Developers want their tools to be very well
integrated together. Most recently, Microsoft's Visual Studio Team System (http://msdn.microsoft.com
/vstudio/teamsystem/default.aspx) takes this trend to a higher level than we have previously seen.
Mainstream IDEs in the past have provided base operations such as editing, compiling, building and
documentation. Now Visual Studio also has unit tests, visual modeling, code generators, and work item
tracking. Furthermore, the IDE isn't just for coders anymore. Every task performed by every person involved
in the software development process is moving into the IDE.
1. A standalone client application which is specifically designed to talk with the source control server.
2. A client plugin which adds source control features into Visual Studio.
Unsurprisingly, the IDE client is very popular with our users. Many of our users would never think about
using source control without IDE integration.
Why does version control work so nicely inside an IDE? Because it makes the three most common operations
a lot easier:
Checkout
When using the checkout-edit-checkin model, files must be checked out before they are edited. With
source control integrated into an IDE, this task can be quite automatic. Specifically, when you begin to
edit a file, the IDE will notice that you do not have it checked out yet and check the file out for you.
Effectively, this means developers never need to remember to checkout a file.
Add
A common and frustrating mistake is to add a new file to a project but forget to place it under source
control. So when I am done with my coding task, I checkin my changes to the existing files, but the
newly added file never makes it into the repository. The build is broken.
When using source control integration with an IDE, this mistake is basically impossible to make. Most
IDEs today support the notion of a "project", a list of all files which are considered part of the build
process. When used with source control, the IDE decides what files to place under source control
because it knows every file that is part of the project. The act of adding a file to the project also adds it
to source control.
Checkin
IDEs excel at nagging developers. The user interface of an IDE has special places to nag the developer
about compiler errors and unsaved files and even unfixed bugs. Similarly, visual indicators in the IDE
can be used to remind the developer that he has not yet checked in his changes.
When source control is integrated into an IDE, developers don't have to think about it very much. They don't
have to try to remember to Checkout, Add or Checkin because the IDE is either performing those actions
automatically or reminding them to do it.
Bigger benefits
Once you integrate source control into an IDE, you open the possibility for cool features that go beyond the
basics. For example, source control integration can be incredibly helpful when used with refactoring. When I
use the refactoring features of Eclipse to rename a Java class, it is obviously nice that Eclipse figures out all the
changes that need to be made. It's even nicer that Eclipse automatically handles all the necessary source
control operations. It even performs the name change of the source file.
For another example, here is a screen shot of a Blame (scm_history.html) feature integrated into Eclipse:
(screendumps/scm_eclipse_blame.png)
The user story for this feature goes like this: The developer is coding and she encounters something that
deserves to be on The Daily WTF (http://thedailywtf.com/) . She wants to immediately know who is
responsible, so she right-clicks on the offensive line and selects the Blame feature. The source control plugin
queries the repository for history and determines who made the change. The task was simpler because the
Blame feature is conveniently located in the place where it is most likely to be needed.
I personally have never used source control integration with an IDE. Heck, for a long time I didn't use
IDEs at all. I'm a control freak. It's not enough for me to know what's going on under the hood.
Sometimes I prefer to just do everything myself. I don't like project-based build systems where I add a
few files and the IDE magically builds my app. I like diving make systems where I can control exactly
where everything is and where the build targets are placed.
Except for a brief and passionate affair with Think C (http://en.wikipedia.org/wiki/THINK_C) during
the late eighties, I didn't really start using IDE project files until Visual Studio .NET. Today, I am
gradually becoming more and more of an IDE user, but I still prefer to do all source control operations
using a standalone GUI client. Eventually, that will change, and my transformation to IDE user will be
complete.
Anyway, for the sake of completeness, I will explain the tradeoffs I see with using source control integration
with IDEs. This should be taken as information, not as an argument against the feature. IDE integration is
the most natural way to use source control on a daily basis.
The first observation is that IDE clients have fewer features than standalone clients. The IDE is great for basic
source control operations, but it is definitely not the natural place to perform all source control operations.
Some things, such as branching, don't fit very well at all. However, this is a minor point which merely
illustrates that an IDE client cannot be the only user interface for accessing a source control repository. If this
were the only problem, it would not be a problem. This is the sort of tradeoff that I would consciously accept.
The real problem with source control integration for IDEs is that it just doesn't work very well. For this sad
state of affairs, I put most of the blame on MSSCCI.
MSSCCI
It's pronounced "misskee", and it stands for Microsoft Source Code Control Interface. MSSCCI is the API
which defines the interaction between Microsoft Visual Studio and source control tools.
A source control tool which wants to support integration with Visual Studio must implement this API.
Basically, it's a DLL which defines a number of documented entry points. When configured properly, the IDE
makes calls into the DLL to perform source control operations as needed or as requested by the user.
Originally, Microsoft's developer tools were the only host environments for MSSCCI. Today, MSSCCI is used
by lots of other IDEs as well. It has become sort of a de facto standard. Source control vendors implemented
MSSCCI plugins so that their products could be used within Microsoft IDEs. In turn, vendors of other IDEs
implemented MSSCCI hosting so that their products could be used with the already-available source control
plugins.
The ubiquity of MSSCCI is very unfortunate. MSSCCI was designed to be a bridge between SourceSafe and
early versions of Microsoft Visual Studio. It served this purpose just fine, but now the API is being used for lots
of other version control tools besides SourceSafe and lots of other IDEs besides Visual Studio. It is being used
in ways that it was never designed to be used, resulting in lots of frustration.
1. Poor performance. SourceSafe has no support for networking, but the architecture of most modern
version control tools involves a client and a server with TCP in between. To get excellent performance
from a client-server application, careful attention must be paid to the way the networking is done.
Things like threading and blocking and buffering are very important. Unfortunately, MSSCCI makes
this rather difficult.
2. No Edit-Merge-Commit. SourceSafe is basically built around the Checkout-Edit-Checkin approach,
so that's how MSSCCI works. Building a satisfactory MSSCCI plugin for the Edit-Merge-Commit
paradigm is very difficult.
3. No Atomic transactions. SourceSafe has no support for atomic transactions, so MSSCCI and Visual
Studio were not designed to use them. This means that sometimes modern version control tools like
Vault can't group things together properly at commit time.
On top of all this, all the world's MSSCCI hosts tend to implement their side of the API a little differently. If
you implement a MSSCCI plugin and get everything working with Visual Studio 2003, you have approximately
zero chance of it working well with Visual Basic 6, FoxPro or Visual Interdev. After you code in all the special
hacks to get things compatible with these fringe Microsoft environments, your plugin still has no real chance of
working with third party products like MultiEdit (http://www.multiedit.com/) . Every IDE requires some
different tweaks and quirky behavior to make it work. By the time you get your plugin working with some of
these other IDEs, your regression testing shows that it doesn't work with Visual Studio 2003 anymore.
Lather. Rinse. Repeat.
Most developers who work with MSSCCI eventually turn to recreational pharmaceuticals in a futile effort to
cope.
A brighter future
Luckily, MSSCCI is fading away. Earlier in this article I flippantly joked that Visual Studio and Eclipse were
the only IDEs left in the world. This is of course an exaggeration, but the fact remains that these two products
have the lion's share, so we can take some comfort in their dominance when we think about the prevalence of
MSSCCI in the future:
Eclipse does not use MSSCCI. It has its own source control integration APIs.
Visual Studio 2005 introduced a new and greatly improved API for source control integration.
So, the two dominant IDEs today inspire us to dream of a MSSCCI-free world. The planet will certainly be a
nicer place to live when MSSCCI is a distant memory.
Here at SourceGear, the various problems with MSSCCI have caused us to hold a cautious and reserved stance
toward IDE integration. Most of our customers really would prefer an IDE client, so we give them one. But we
consider our standalone GUI client to be the primary UI because it is faster and more full-featured. And
internally, most of us on the Vault team use the standalone GUI client for our everyday work.
But our posture is changing dramatically. We are currently working on an Eclipse plugin as well as a
completely new plugin for the new source control API in Visual Studio 2005. Sometime in early 2007, we will
be ready to consider our IDE clients to be primary, with our other client applications to be available for less
common operations. What do I mean when I say "primary"? Well, among other things, I mean that the IDE
clients will be the way we use our own product. Including me. :-)
It's not yet terribly impressive to look at, but here's a screen shot of our new Visual Studio 2005 client:
(screendumps/scm_vsip_client.png)
Final thoughts
The direction of this industry right now is toward more and more integration. This is a very good thing. We're
going to see many new improvements. Users will be happier. Just as a spice rack belongs near the stove,
source control should always be available where the developer is working.