Database Recovery: What Would You Do If...

Title isn't that great, so feel free to change.

Basically, I will describe an issue we had this week. I want to get responses from everyone on what you would have done in the scenario. I won't post what we did at first, but will later on.

Scenario:
A 35GB SQL Anywhere 9 database with 1 remote site that receives all data, and around 190 remote sites that receive limited data. Two assertion failures in one week. One, which was an index issue. The second, a little more severe - an invalid page.

If anyone would like more details, let me know.

Additional Information:

SQL Remote
The backup was from two days prior as the one from the night before was lost due to the failure.
9.0.2.3527
Assertion failed: 201501 (9.0.2.3527) Page for requested record not a table page or record not present on page
Only the consolidated
Replication runs as a service on the server, and by a user triggering it on the laptops, so any number of sites could have replicated at any time.

asked 26 Feb '10, 17:59

Calvin Allen
1.5k●23●26●38
accept rate: 25%

edited 20 Sep '11, 17:37

Volker Barth
40.2k●361●549●822

Question 1: Are you using SQL Remote or MobiLink to synchronize with your remote sites? Question 2: Do you have a recent backup at the central site?

(26 Feb '10, 20:38) Chris Kleisath

Its always useful to know the exact version and build number and in the case of assertion(s) the exact assertion number and message. Without this information, it is impossible to make any suggestions.

In addition, are the assertions seen on the consolidated, a single remote, or multiple locations?

(26 Feb '10, 20:40) Mark Culp

And more questions: If your consolidated is corrupt (so you might need a backup), when have your remotes replicated the last time? IMHO, the question who many remotes would be out-of-sync if a backup at the cons must be restored is one of the most important ones...

(28 Feb '10, 17:42) Volker Barth

I think I have answered all the questions so far.

(01 Mar '10, 00:39) Calvin Allen

Retag remark: I just think that this is very related to SQL Remote (or replication in general) so I added that tag (and had to remove "backup-and-recovery"). Restoring a not-replicated database is so much easier...

(02 Mar '10, 10:40) Volker Barth

@Calvin: I wanted to change the title, and I wanted to change the tag what-would-you-do-if... but at 100+ views you can't argue with success! Now I think the title and the tag are brilliant!

(04 Mar '10, 12:30) Breck Carter

More comments hidden

showing 5 of 6 show all flat view

4 Answers:

active answers oldest answers newest answers popular answers

Parts of what I would have done (Should say, what I have done during one SQL Remote consolidated crash some years ago...) include:

Try to catch all sent messages that have not been read by remotes (i.e. prevent remotes from reading further messages, maybe be removing access to the file link for remotes). That way you may reduce the possible number of remotes that might need to get re-extracted. And that may be the most important goal...
Test if the dabase file is startable without log (all tests done on copies, of course!), i.e. does dbsrvX mydb.db -f work?
Test if the log is valid by running DBTRAN against the log
If both are corrupt, a backup is inevitable (unless the SA support may give inside-help like "correcting" log files which is not possible for us mere users).
Otherwise, you may be able to restore the database to its current state by using a backuped db file and apply all logs (including the current one).
If the current log is unusable, try to find out by using DBTRAN if there are any transactions from remotes that may be not part of the last valid database you can restore, and apply them manually to reduce the amount of possibly lost data. (Or try to read from a consildated SQL Remote -v log output if that is available).

As stated, we had a similar crash some years ago (as was realized later by tech support, based on a bug in the server code), resulting in a database file thas was usable and a log that was valid by itseld (DBTRAN with no problems) but both didn't fit together. With help of Tech Support, we could keep using the database file and "declared the current log" as a online-backup log and started a new log. Adapting the corrent log offsets was the key here. So we had no need to rebuild the database and had only to re-extract a few remotes, and no data loss at all. (And we would always been able to restore from a nightly backup, so it was no very painful experience.) But the procedure and tests took quite a while.

Our resume was an optimized backup-/restore plan (I guess as everybody does after such a crash...) with the following points:

"Differential log backups" all 30 minutes to 2 different foldes (locally and on a net share). This is done by events that test the copied log by use of the DbTools API (DBTranslateLog()).
One nightly database backup with full validation (as before)
If the full or the log backup fails, our system stops its replication immediately in order to prevent damage (i.e. possible out-of-sync states) on remotes

So in case of a server crash (which happened for the very same bug again some weeks later...), we are able to do a very quick restore by going back to the full backup and the latest valid log. Only the remotes woh might have replicated in the last half an hour might need to get re-extraced.

IMHO, bringing up a system fast is usually as important as to find out what went wrong:)

permanent link

answered 02 Mar '10, 10:37

Volker Barth
40.2k●361●549●822
accept rate: 34%

That last point, bringing up a system fast sometimes means breaking replication and re-extracting. Finding out what went wrong is often a very long process, and there is no guarantee you will EVER find out what really went wrong... and if you do, it often turns out to be a unique one-time likely-to-never-be-repeated affair... so the whole search for what went wrong was pointless. Personally, my track record for Finding Out Why Replicaton Broke (a slightly different issue that being discussed here) is effectively zero.

(02 Mar '10, 12:42) Breck Carter

...which is one of the main reasons I like MobiLink, when the consolidated breaks you fix it and carry on synchronizing, no steenking transaction log involvement. No re-extracting 1,000 remotes. No having to explain to management or a client why you have absolutely no idea why it broke, and never will, because you're too busy re-extracting.

(02 Mar '10, 12:45) Breck Carter

@Breck: With a good backup/restore plan one ought to be able to bring up a system fast without having to re-extract much remotes (of course this assumes your remotes do not replicate every few minutes). I think "breaking replication" should only occur to those remotes that have replicated after the point (i.e. the (consolidated's log offset) to which one can recover, and I would always prefer to re-extract some remotes over rebuilding the whole system.

(02 Mar '10, 13:46) Volker Barth

@Breck: And I absolutely agree w.r.t. the questionable value of "finding out what broke the system". It's worth the attempt, but that should be done after the system is up and running again. IMHO, one should not have to decide between "Do I try to recover or do I try to find the error condition?" - I guess when a system is down, typically that's not the situation where most folks are in the best condition for a complex replication analysis:)

(02 Mar '10, 13:50) Volker Barth

And another addition: In our particular case, the fastest solution would have been: Start the broken 8.0.3.x database with dbsrv9 and stop it immediately. That would have "healed" the problem as the server bug was fixed in 9.x, and the database/log would have run with 8.0.3 just fine again. - I just found out that afterwards:( - So another tool in such situation might always be "Try with the latest EBF / the latest version that can load that database without upgrading"...

(02 Mar '10, 13:54) Volker Barth

@Volker: ...as long as you back up the failing database first. Can't repeat that too many times. Recovery efforts have a long track record of making a bad situation worse, and you always want to be able to "try again".

(04 Mar '10, 12:32) Breck Carter

More comments hidden

showing 5 of 6 show all flat view

Assumption: The consolidated dbsrv9.exe is down, having just crashed on the assertion.

First thing: Shut off dbremote.exe at the consolidated end, and stop it from starting again for the time being.

Second thing: Make a full file copy of the crashed consolidated *.db and *.log. That way, no matter how much WORSE things get during attempts to get things going again, you can get the consolidated database back to the point it was right after the crash.

Third thing: Follow the steps in this V9 Help topic (starting at step 3, since 1 and 2 have been covered by the Assumption and Second Thing above):

    SQL Anywhere Server - SQL Usage 
       » Remote Data and Bulk Operations 
          » Importing and exporting data 
             » Rebuilding databases 
                » Rebuild databases involved in synchronization or replication

Assumption: Everything is OK now <g>

( I know, I lose points for all the Assumptions :)

permanent link

answered 01 Mar '10, 08:43

Breck Carter
32.5k●541●726●1050
accept rate: 20%

Well, I wish I had known about those options before I implemented. Oh well, now I do!

(01 Mar '10, 23:39) Calvin Allen

OK, now you have to tell us what you did do :)

(02 Mar '10, 12:47) Breck Carter

flat view

Heres what I did...

As soon as the database went down, make sure the replication service was stopped and wouldnt start until I wanted it to.

Find the cause of the problem. I was able to get a backup by restarting the database and triggering the event. Once this was finished, I worked on the backup to test, which my first step was running dbvalid against it. I found it was a heavy populated table (in our world, at least). Naturally, as soon as it tried to validate that table, the backup db crashed.

Now, I know what table it is, but range of data is it? I followed the Sybase technical document to figure that out (took forever).

Once I knew exactly where the problem was, I unloaded every bit of data I could. What I couldn't get from this database, I was able to get from the one site that replicates all data.

So. Now I know where the problem is, and I have the data to rebuild.

What next?

Remove said table from all publications.
Drop all constraints/triggers/etc. from table
Rename table

Create a new table
Repopulate with data extracted from above.
Add all constraints/triggers/etc.

Create new publication with just this table, subscribe all the sites, and simply 'start' them (Not Synchronize).

permanent link

answered 02 Mar '10, 13:05

Calvin Allen
1.5k●23●26●38
accept rate: 25%

flat view

It's not clear to me that you couldn't have recovered from your last valid backup by applying the backed up transaction logs. If the log files that cover the time from your last valid backup to the point of failure don't exist, then your approach sounds fairly reasonable. Since you don't seem to have had to re-extract your users, I expect you had all of the necessary log files to recover from.

There are a few things that could have been done "before the fact" to both better protect your database and to ensure that you wouldn't risk breaking replication. The most important option that I would always recommend using at the conslidated site is the dbremote -u switch to only replicated backed up transactions. Using this switch prevents the consolidated sending out any messages for operations that occured after the last backup. This ensures 2 things:

You will always have a backup to recover from
Your remotes will always be able to resend any 'new' operations because they won't delete/truncate their own log files until they get a confirmation message back from the consolidated.

If you combine the dbremote -u switch with scheduled incremental transaction log backups on the consolidated as Volker discussed, then you will still keep your replication latency low while astronomically increasing the recoverability of your replicated system.

The other configuration choices to consider include transaction log mirroring and high availabilty configuration on the consolidated. Granted that high availability wasn't available in version 9, you would still have the option of using a transaction log mirror.

permanent link

answered 12 Mar '10, 15:06

Rob Waywell
31●1●1●6
accept rate: 0%

Rob, while I see the beneft in using "-u", there are situations where this doesn't work well. For example, in our setup the consolidated does some kind of complex calculation/reporting based on newly sent-in data and sends the results back to the remotes. So we have to run dbremote on the cons twice in a short cycle (say, 1-2 minutes after an incoming message which triggers the calculation). In that situation, it's simple not useful to do a backup before each dbremote run. Instead we do log backups every 30 minutes. - So I think "always using -u" is a too general recommendation:)

(12 Mar '10, 15:19) Volker Barth

@Volker - MobiLink would be a preferred architecture for your scenario since it provides for that backend processing to happen within the single synchronization session. I expect your system pre-dates MobiLink.

By choosing not to run with the -u you are creating a failure point that can force you to re-extract remotes. If you have a few hundred or a few thousand remotes, that risk likely isn't acceptable.

In other words, the recommendation to always run -u on the consolidated is still the right technical recommendation. Folks may choose not to use it for other business reasons.

(12 Mar '10, 16:10) Rob Waywell

@Rob: I agree with your general recommendation. In our particular case, MobiLink might really fit better - when the system was set up, MobiLink was already available (and we have used it to synchronize the ASA cons with a MS SQL database) but we were (and are) more experienced with SQL Remote. In our case, the omission of -u bears the actual risk that a few remotes (usually 0-10) might have to be rebuild which is tolerable for us, particularly as we can always rebuild their changes as we log all incoming statements. But as you state, this is certainly a non-standard situation.

(12 Mar '10, 17:41) Volker Barth

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

asa-9 ×108
sql-remote ×103
backup ×84
recovery ×29
what-would-you-do-if ×2

question asked: 26 Feb '10, 17:59

question was seen: 4,230 times

last updated: 19 Feb, 00:03

SAP SQL Anywhere

Database Recovery: What Would You Do If...

Follow this question

Related questions