I'm borrowing Calvin's previous title and tag as it fits so well.
Client calls in and says that the application is losing connection to the database and only a reboot of their server fixes it. I login remotely and find multiple entries in the Windows event log regarding SQL anywhere unable to read from file, Bad Block on disk, and LSI Controller error, most of which correspond to unplanned shutdown events.
I promptly make a backup image using the Sybase Central GUI, copy it to my computer and call both my point of contact and their IT guy.
A week or so later, same point of contact at the client emails me about an unrelated application question. I answer it and ask if anything has been done to the server -- "No everything has been fine since we talked to you."
September 08 2011:
Same client calls in this morning. The database service will not start.
This time, while onsite, he unplugged the drive from the RAID controller and plugged it directly into the motherboard and ran chkdsk again, which came back clean again, started the server in this configuration and left. After 4 hours the server crashed again and the drive now reports failures during chkdsk. The drive is mountable and bootable, so most of the data was copied to a new drive, but one of the files that fails to copy with a read error is of course the database file.
So after evaluating all of this, knowing I have a good backup from the 18th, I try to start the daily backup image files created by an event each night. All of these files come back with "This file is not a valid database file."
I ended up starting the 08-20-2011 copy he had made manually as it was the latest available that would start.
So my questions:
This is the text of the backup event running on a database starting as a service using SQLA 126.96.36.19924 -- dest is a string set as the backup location + what day it is
BACKUP DATABASE DIRECTORY dest TRANSACTION LOG TRUNCATE;
You didn't mention if you ran DBVALID on your backup on the 18th of August. If you did not, that would have been your first error - after seeing messages in the Windows event log related to disk I/O issues, it would have been prudent to not only take the backup but also validate its contents.
Point 2 - unless your performance requirements are high, I would enable CRC checking (page checksums) on all database pages so that (even) transient disk failures that occur on a page read from disk will cause an assertion. You should realize that most OS's and file systems can go with unreported file system and page corruption; see articles here and here.
Point 3 - You don't say how often you test your recovery strategy. It's fine to have a backup; it's another thing altogether to make sure that you can use it.
Point 4 - if you have all of the (untruncated) transaction logs, you can take an older (hopefully valid) backup and then apply all of the logs in order to get to the same state.
Point 5 - chkdsk doesn't check everything. Actually it checks very little. It does nothing to ensure that individual pages in a file are not corrupt from media failure.
Make copies of everything before you start any resolution.
Can you make the file valid? Possibly, but likely only with the help of technical support. To get the data out yourself, you'll need to be able to get to the point that a server can start the database. Once you get there, you can try to unload as much of the data that you can; it may be, for example, that other corruption has occurred with indexes, so the data pages themselves are fine.
Regarding (1), I can't say too much. If the backups do not have an off-site (or at least separate medium) location, I would suggest putting one in place.
For (2), there are a set of bytes in the database file that can mention a database is valid. These, internally, are only modified for certain events (e.g. upgrade db). If these are not reading correctly due to physical corruption then the possibility of (3) may not even be a desirable option.
Lastly, for (4), you can run a validation on the backup after is has been completed. Keep in mind with my suggestion for (1) a valid backup will only remain valid if the medium it is stored upon does not have a failure.
As for recovery options, if you have all of the transaction logs available (from the last successful backup), you may try to translate them with dbtran and apply them to the database. You can also apply just the logs to the database, but I would recommend dbtran first (as the logs themselves may be corrupt given the failures you have already noted). And, be sure to make a copy of your backup before applying the dbtran or transaction logs!
Hope this helps!
answered 08 Sep '11, 15:25
One option when you run into the "This file is not a valid database file." is to use dbtran to convert the transaction log into a SQL command file which you could apply to your latest working database. But if your backup of the transaction log is 0 kB I guess your out of luck:-(
Best regards Ove Halseth
answered 09 Sep '11, 05:16