Title isn't that great, so feel free to change.
Basically, I will describe an issue we had this week. I want to get responses from everyone on what you would have done in the scenario. I won't post what we did at first, but will later on.
If anyone would like more details, let me know.
Parts of what I would have done (Should say, what I have done during one SQL Remote consolidated crash some years ago...) include:
As stated, we had a similar crash some years ago (as was realized later by tech support, based on a bug in the server code), resulting in a database file thas was usable and a log that was valid by itseld (DBTRAN with no problems) but both didn't fit together. With help of Tech Support, we could keep using the database file and "declared the current log" as a online-backup log and started a new log. Adapting the corrent log offsets was the key here. So we had no need to rebuild the database and had only to re-extract a few remotes, and no data loss at all. (And we would always been able to restore from a nightly backup, so it was no very painful experience.) But the procedure and tests took quite a while.
Our resume was an optimized backup-/restore plan (I guess as everybody does after such a crash...) with the following points:
So in case of a server crash (which happened for the very same bug again some weeks later...), we are able to do a very quick restore by going back to the full backup and the latest valid log. Only the remotes woh might have replicated in the last half an hour might need to get re-extraced.
IMHO, bringing up a system fast is usually as important as to find out what went wrong:)
answered 02 Mar '10, 10:37
Assumption: The consolidated dbsrv9.exe is down, having just crashed on the assertion.
First thing: Shut off dbremote.exe at the consolidated end, and stop it from starting again for the time being.
Second thing: Make a full file copy of the crashed consolidated *.db and *.log. That way, no matter how much WORSE things get during attempts to get things going again, you can get the consolidated database back to the point it was right after the crash.
Third thing: Follow the steps in this V9 Help topic (starting at step 3, since 1 and 2 have been covered by the Assumption and Second Thing above):
SQL Anywhere Server - SQL Usage » Remote Data and Bulk Operations » Importing and exporting data » Rebuilding databases » Rebuild databases involved in synchronization or replication
Assumption: Everything is OK now <g>
( I know, I lose points for all the Assumptions :)
answered 01 Mar '10, 08:43
Heres what I did...
As soon as the database went down, make sure the replication service was stopped and wouldnt start until I wanted it to.
Find the cause of the problem. I was able to get a backup by restarting the database and triggering the event. Once this was finished, I worked on the backup to test, which my first step was running dbvalid against it. I found it was a heavy populated table (in our world, at least). Naturally, as soon as it tried to validate that table, the backup db crashed.
Now, I know what table it is, but range of data is it? I followed the Sybase technical document to figure that out (took forever).
Once I knew exactly where the problem was, I unloaded every bit of data I could. What I couldn't get from this database, I was able to get from the one site that replicates all data.
So. Now I know where the problem is, and I have the data to rebuild.
Remove said table from all publications.
Create a new table
Create new publication with just this table, subscribe all the sites, and simply 'start' them (Not Synchronize).
answered 02 Mar '10, 13:05
It's not clear to me that you couldn't have recovered from your last valid backup by applying the backed up transaction logs. If the log files that cover the time from your last valid backup to the point of failure don't exist, then your approach sounds fairly reasonable. Since you don't seem to have had to re-extract your users, I expect you had all of the necessary log files to recover from.
There are a few things that could have been done "before the fact" to both better protect your database and to ensure that you wouldn't risk breaking replication. The most important option that I would always recommend using at the conslidated site is the dbremote -u switch to only replicated backed up transactions. Using this switch prevents the consolidated sending out any messages for operations that occured after the last backup. This ensures 2 things:
If you combine the dbremote -u switch with scheduled incremental transaction log backups on the consolidated as Volker discussed, then you will still keep your replication latency low while astronomically increasing the recoverability of your replicated system.
The other configuration choices to consider include transaction log mirroring and high availabilty configuration on the consolidated. Granted that high availability wasn't available in version 9, you would still have the option of using a transaction log mirror.
answered 12 Mar '10, 15:06