Professional Documents
Culture Documents
Rescuing A Failed Domain Controller: Disaster Recovery in Action
Rescuing A Failed Domain Controller: Disaster Recovery in Action
By Louis Nel
I recently had to sort out an issue with a failed mirror set (i.e., RAID 1) on a Windows Server 2003 domain controller. No problem, I thought. Well, not quite. The mirror had to be deleted, taking everything from both drives with it. Restoring Active Directory through backup failed. To make a bad situation worse, the DC was the holder of all the Flexible Single Master of Operations (FSMO) roles in this (single) domain. Transferring the roles failed; seizing them was problematic. Disaster recovery? Indeed! This article will show you how to get such a DCand the whole domainback from the brink. As you'll see, a disaster recovery plan is about more than generalities.
Disaster scenario
You have your disaster recovery plan all neatly set out. Then disaster strikes: A Windows Server 2003 domain controller goes down. Okay, not a train smash; you've got up-to-date backups. But restoring Active Directory via backup fails. Now what? Well, you can still reinstall Server 2003 and restore user data from backup. (The latter worksyou've checked.) There's only one problem: This server was the holder of all the FSMO roles. So you're starting to sweat a little, but not too profusely. You know about transferring FSMO roles to another domain controller. But what if that fails? Yes, you can try seizing it. At this stage, you're looking at the stuff disasters are made of, because now your whole domain teeters on the brink. (I'll explain why in a moment.) Admittedly, this is a very particular (and very unfortunate) scenario. But then, the nature of a disaster is its unpredictability. And there are a couple of general lessons to be learned from this specific incident. Here's what I did and what I learned along the way. In this situation, the failed mirror could not be rebuilt in a nondestructive way (I won't go into the whys and wherefores here), making loss of all data on both drives inevitable. I tried restoring AD from backup. It failed, presumably because the backup software that was used (an old version for NT) didn't back up the system state data. Trying to restore with Server 2003's own backup utility (ntbackup.exe) didn't work either. It didn't recognize the backup format of the legacy software.
Figure A: The attempt at seizing the roles resulted in these errors. (Note: the domain name, DC/server name, and CN name have been edited out for security reasons.) Page 1
Copyright 2006 CNET Networks, Inc. All rights reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
if you have more than one domain, you won't--with immediate effect--be able to move security principals from one domain to another. You also won't be able to add new users, groups, and computers to the domain. You won't experience the latter problem immediately, as each DC in the domain has a pool of 512 RIDs. But after that, you're dead in the water. Now you're faced with the prospect rebuilding the whole domain.
Reinstall Windows Server 2003 on the failed machine, make it a DC (run DCPromo), and install and restore whatever other services there were on the machine, like DHCP, WINS, DNS, and IIS. When you're finished, start replicating. Now you're ready to restore your data.
First, clean up
Before you reinstall Windows Server 2003 on the failed machine and make it a DC, there's an important job to do: a metadata cleanup. This entails removing the dead DC from AD (more technically speaking, removing the ntdsDSA object). You have to be an Enterprise Administrator to perform this task. A word of caution: Be absolutely sure this is the route you want to take before you do the metadata cleanup. There's no turning back (at least none that I'm aware of). How you perform the cleanup will differ depending on whether you want to name your new DC the same as the old (failed) one. I suggest retaining the old name, as it simplifies matters a lot (for example, with shares). However, if you always wanted to rename that DC, now is the time. Let's start with the steps to follow if you want to give the new DC the same name. In this case, you'll have to remove the old DC's ntdsDSA object. The commands differ slightly depending on whether the DC in question has Service Pack 1 (SP1) installed. If SP1 is installed, metadata cleanup also removes File Replication Service (FRS) connections and as part of the process, tries to transfer or seize any operations master roles that the retired DC holds.
1 2 3
If SP1 is installed, type remove selected server ServerName. (See Figure B.)
If SP1 is not installed and you're using the version of Ntdsutil.exe that's included with Windows Server 2003 with no service pack, connect to the existing domain controller (in our case, the one in the same site as the failed DC) on which you want to remove the failed DC's ntdsDSA object. To do this, type connections at the metadata cleanup prompt and press [Enter].
Type connect to server <servername>, where <servername> is the DC that will be used to clean the metadata, and press [Enter]. It can be any working DC in the same domain, but we'll use one in the same site. Figure C shows this step on a DC that does not have SP1 installed. Page 2
Copyright 2006 CNET Networks, Inc. All rights reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
5 6 7 8 9 10 11 12 13 14 15
Type list domains and press [Enter]. All domains in the forest will be listed.
Type select site <number> (the number of the site in which the DC was a member) and press [Enter].
Type select server <number>, where <number> is that of the DC to be removed, and press [Enter].
Type quit and press [Enter] until you're back at the command prompt.
Figure B: Starting the metadata cleanup process using ntdsutil on a DC with SP1 installed
Figure C: Starting the metadata cleanup process using ntdsutil on a DC without SP1 installed
Page 3
Copyright 2006 CNET Networks, Inc. All rights reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
If you're going to take the plunge and give the DC a new name, you'll have to remove the failed server from the Sites & Services and Users & Computers snap-ins. NB: Don't do this if the new DC will have the same name as the failed one.
1 2 3 1 2 3
Lessons
Here are some things you should know, check, and do before disaster strikes: This might seem pretty obvious (but how many of us do it): Plan for what-if (worst-case) scenarios. That's what's meant by "disaster", right? Don't bargain on anything (backups working, etc.) Outline procedures to recover from disasters like these. Put a fair amount of detail in your disaster recovery documentation. You need more than generalities. Have the procedures for tasks like seizing FSMO roles set out clearly as part of your disaster recovery plan. It will speed up recovery considerably in case of a crisis. Even better, test your procedures in the calm environment of a test lab. Regularly check that you have what it takes to recover from a disaster. For instance, how up-to-date is the backup of your system state data? When it comes to system state data, age matters. If your system state backup is older than the tombstone age, you're in for trouble. The default tombstone lifetime is 60 days. (A tombstone keeps tabs on objects deleted but not yet completely removed from AD.) To prevent inconsistencies in AD, you're prevented from restoring data older than the tombstone lifetime. Prepare to speed up recovery (and take pressure off yourself) by making separate backups of DNS and DHCP and all server drivers. Ensure that your disaster recovery procedure is set out clearly and systematically, listing the steps to follow and the order in which things should be done.
Potential pitfalls
Install the relevant service pack(s) and critical updates immediately after reinstallation. Remember to check shares and permissions. I also had to restore mapped drives. Also, remember to set up the time service again if you had to follow the recovery route described above. And just to add to the fun: If you apply Server 2003's SP1, you might run into a problem with the time server service not starting. You'll find the solution here.
Page 4
Copyright 2006 CNET Networks, Inc. All rights reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
Additional resources
TechRepublic's Downloads RSS Feed Sign up for TechRepublic's Downloads Weekly Update newsletter Sign up for our Network Administration NetNote Check out all of TechRepublic's free newsletters "Familiarize yourself with Active Directory's five FSMO roles" (TechRepublic article) "Mastering the Active Directory Schema" (TechRepublic download) "Managing OUs, Users, and Groups in Active Directory" (TechRepublic download)
Version history
Version: 1.0 Published: June 12, 2006
Page 5
Copyright 2006 CNET Networks, Inc. All rights reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html