You are on page 1of 6

Best ways to mitigate database

corruption/downtime
Greetings,
I'm not a DBA but I fill in lightly when our DBA is out. Along those lines we have a new DBA that I
feel has a intermediate level of Microsoft SQL expertise.
We ran into a major issue recently where we rebooted one of our major SQL servers and there was
issues or corruption between the machine and SQL where the database wouldn't come up at all and
just stay in recovery without moving forward. This was a single instance server. It wasn't in a cluster
and didn't have an availability group. What we had available to recover was backups from Veeam
and SQL level .bak fill backups (with transaction level). Long story short, the only way around this
(that we knew of) to get this back up and running was to build up a new server completely and copy
over the .bak files to restore from, then restore the transactions as well. That took a very long time
and we were down for a good chunk of the day.
So my question is....what do you experts think the best way to mitigate issues that come up like this?
We are exploring options like cloud options (we have presence up in AWS now), clustering (we don't
have any of our SQL servers in clusters right now), mirroring (we do have one server in a mirror right
now), or availability groups (we don't do this now and I am not familiar with this at all).
Management is pushing for cloud options. They want the niceness of them managing the back end
stuff, and they are also want to avoid buying more storage (we have a lot of TB size databases).
Plus we have a small team with only 1 DBA.
Are there other options I am missing as well? I know there are a lot of variables to consider, but in
general what's the better ways to prevent corruption and minimize downtime if your databases go
south completely? Thanks much.
16 Comments
Share
UnsaveHideReport
100% Upvoted
This thread is archived
New comments cannot be posted and votes cannot be cast
SORT BY
BEST
level 1
LZ_OtHaFA
6 points·6 months ago
You should be taking your daily backups, restoring to an offline server, perform the necessary DBCC
on each DB to ensure the DB's are not corrupted to begin with. This will help prevent a catastrophic
incident like you experienced. If you can automate this process you will be that much ahead of your
next catastrophe. Do this in conjunction with paying attention to your SQL Logs on a regular basis
looking for interesting messages that may hint at corruption.
Cloud works wells to fix the problem in the future, you can spin up a new instance with relative ease,
there is a learning curve before you jump into it, especially for a 1 DBA shop, look into training
options to help get your DBA up to speed.
Give Award
Share
ReportSave

level 2
[deleted]
2 points·6 months ago
We do have a daily full backup of the databases and then transaction level backups every 15
minutes. Yeah those saved us this time, but it took quite a long time to pull down those backup files
from the backup repository. Also, that was pretty much our last option we believe because a
machine level backup from Veeam still presented the corruption.
So that was a bit nerve wrenching if that was our final option on that particular database. I think your
DBCC option is a good one. I don't believe we are doing that today, but that would give us peace of
mind knowing that it checked out fully and such.
Share
ReportSave

level 3
LZ_OtHaFA
3 points·6 months ago
Yep. I have worked for my company for over 3 years and I still cannot get them to dedicate the
resources to make this happen, sad really, it should cost peanuts in the grand scheme of things.
This guy is a good resource: Brent Ozar https://www.brentozar.com/
Good Luck!
Give Award
Share
ReportSave

level 4
[deleted]
1 point·6 months ago
Thank you. Much appreciated.
Share
ReportSave

level 1
IntentionalTexan
2 points·6 months ago
Hybrid Azure/On-Prem with availability group. What kind of license do you have?
Give Award
Share
ReportSave

level 2
[deleted]
1 point·6 months ago
Right now we have just on-prem per core SQL licensing. We have sufficient licenses though to cover
our cores plus more. But...we did talk about Azure some as well. We are invested in AWS right now,
but from what I know or believe, Azure does better with Microsoft SQL in the cloud. So we are open
to that option as well. It would just take us the time to spin up an environment to get going with it.
If we went this route, would both our on-prem server and the Azure one have to be active or "hot"? I
would imagine they would to make sure they have the most up to date records and such on them,
but I'm just not familiar SQL availability groups.
Share
ReportSave

level 3
IntentionalTexan
3 points·6 months ago
Data is written to both instances of the DB. You can choose synchronous or asynchronous commit.
You want a really solid internet connection. I think you need SA with your per core licenses which I
don't beleive you can purchase after the fact. SQL server standard or enterprise? Azure is better for
SQL because you can run an instance natively. In AWS you have to have an OS.
Give Award
Share
ReportSave

level 1
Euroranger
2 points·6 months ago
Mitigate a corrupted backup or have a better solution for recovering from a lost backup? Corruption
happens for a number of reasons but you can offset the effects by mirroring the database, you could
implement temporal tables with that mirroring and perhaps increase the frequency of your backups
so that, in the event of a corrupt .bak file, you aren't having to reproduce so much data/transaction
history when you rebuild.
Give Award
Share
ReportSave

level 2
[deleted]
1 point·6 months ago
I was thinking of mitigating from a corrupted backup. Mirroring has saved us on the one database in
the past. So it is a viable option that we know. It's just the doubling of our database (size wise) that is
a bit of an issue. But if it is "the" option, then we can push for it. Thanks
Share
ReportSave

level 1
BussReplyMail
2 points·6 months ago
OK, so first off, unless it was the master database that wouldn't come up (and if it was, you wouldn't
have gotten into SQL at all,) you shouldn't have had to rebuild to a new server.
Couple possibilities, then we'll get to suggestions to mitigate the potential of this happening in the
future.
 Is / was this a BIG (several hundred GB) database? If so, that will impact how long it
takes a DB to finish crash recovery.
 Is it possible prior to the crash, there was a large, long-running transaction happening?
Again, this would impact how long it would take the DB to recover (SQL would have to roll
back the transaction by re-playing the transaction log.)
Now, to suggestions to mitigate the problem.
 Regularly running DBCC CheckDB against the database. Look at using the Ola
Hallengren maintenance script for this (I use it myself.)
 Sounds like your backups work, planning ahead, regularly test restoring said backups
to another server. How regular depends on your resources.
Now, presuming the database was actually stuck in recovery and wasn't going to come out, that it
was actually corrupt, there are ways to force a DB in that state to offline and then you can drop it /
delete the files on disk. Once you've done that, you could start restoring (on the same server) from
your backups. Likely, this would've reduced your downtime.
OK, finally noticed the part about TB sized databases. That does complicate matters for you, you
might need to look at setting up maintenance windows to run DBCC as it could take a LONG time.
Which, presuming the DB you had a problem with is that big, is also a distinctly likely reason for it
appearing to be "stuck" in recovery (I've got a DB that's a couple 100GB and it takes it a few minutes
to come up from a server restart, so.)
(Source: I'm a DBA overseeing ~8 SQL instances and quite a few databases on those instances.)
Give Award
Share
ReportSave

level 2
[deleted]
1 point·6 months ago
Wow. thanks for the suggestions. Much appreciated. Yeah these were several hundred GB
databases, so they are very large. We need to tame those down a bit and that is one thing we are
going to try and do.
It appears that the running DBCC is a common thing that is coming up from various people, so that
is good to know we need to focus on that. Thanks for the script recommendation.
Yeah we didn't know how to offline or detach the database when this was happening. I know the
DBA tried a few different things and it might have made things worse. I believe things were stuck
rolling back if I heard correctly when this was happening so that might have prevented us from
restoring a known good backup in it's place. It's good to know that there are ways to formally do this.
I will make sure to keep this in our toolbox if it ever happens again. Thanks again.
Share
ReportSave

level 1
Cal1gula
Database Administrator1 point·6 months ago
What did you do to cause the corruption? You should always do a standard reboot on a database
server. Never forced.
But not having corruption to begin with is easier than fixing it.
Give Award
Share
ReportSave

level 2
[deleted]
2 points·6 months ago
I stopped the sql service and shutdown the server in Windows normally. I thought that should be
clean enough.
Share
ReportSave

level 3
Cal1gula
Database Administrator2 points·6 months ago
Should be fine. Were there any odd errors in the Windows or SQL logs? I was at a company where
they rebooted SQL Server weekly (don't ask...) for years and never had any problems with a
standard Windows shutdown. Running DBCC CHECK db is a good practice, but also it would
probably be useful to have an idea how the corruption got there to begin with.
Give Award
Share
ReportSave

level 1
RobinShanab
1 point·6 months ago
Backups are key to dealing with corruption. If they have a good backup that is recent it is almost
always a better option than trying to repair the corruption. To mitigate the database
corruption/downtime, don't do the following:
 Don’t Panic
 Don’t Detach the Database
 Don’t Restart SQL Services
 Don’t Reboot the Machine
 Don’t Start by Trying to Repair the Corruption
Diagnosing Database Corruption
 SQL Server Error Logs
 Crash Dumps
 DBCC CHECKDB
 Windows Event Logs
o Check the Windows System Event Log for any system level, driver, or disk-
related errors.
 msdb.sys.suspect_pages
 Understanding Root Cause
o If the RCA is a bad disk then repairing corruption won’t help as the problem
will simply happen again once that sector is used
o Need to know how widespread corruption is
o Prevent redundant, wasteful, or damaging actions
You can also check the following article written by Microsoft SQL
MVP: https://social.technet.microsoft.com/wiki/contents/articles/51634.sql-server-troubleshooting-
how-to-detect-and-speedily-fix-sql-database-corruption.aspx
Give Award
Share
ReportSave

level 2
[deleted]
1 point·6 months ago
I think that's where things went wrong. Yeah the panic set in and the DBA started trying all of those
things you mentioned.
Thanks for steps needed to help fix those issues. We will definitely keep them in our stock for next
time. Along with those, is the backups as you mentioned. I want to make sure we have at least two
separate good backups for our databases. I think that is key is well like you pointed out.

You might also like