Title isn't that great, so feel free to change. Basically, I will describe an issue we had this week. I want to get responses from everyone on what you would have done in the scenario. I won't post what we did at first, but will later on. Scenario: If anyone would like more details, let me know. Additional Information:
asked 26 Feb '10, 17:59 Calvin Allen Volker Barth |
Parts of what I would have done (Should say, what I have done during one SQL Remote consolidated crash some years ago...) include:
As stated, we had a similar crash some years ago (as was realized later by tech support, based on a bug in the server code), resulting in a database file thas was usable and a log that was valid by itseld (DBTRAN with no problems) but both didn't fit together. With help of Tech Support, we could keep using the database file and "declared the current log" as a online-backup log and started a new log. Adapting the corrent log offsets was the key here. So we had no need to rebuild the database and had only to re-extract a few remotes, and no data loss at all. (And we would always been able to restore from a nightly backup, so it was no very painful experience.) But the procedure and tests took quite a while. Our resume was an optimized backup-/restore plan (I guess as everybody does after such a crash...) with the following points:
So in case of a server crash (which happened for the very same bug again some weeks later...), we are able to do a very quick restore by going back to the full backup and the latest valid log. Only the remotes woh might have replicated in the last half an hour might need to get re-extraced. IMHO, bringing up a system fast is usually as important as to find out what went wrong:) answered 02 Mar '10, 10:37 Volker Barth 1
That last point, bringing up a system fast sometimes means breaking replication and re-extracting. Finding out what went wrong is often a very long process, and there is no guarantee you will EVER find out what really went wrong... and if you do, it often turns out to be a unique one-time likely-to-never-be-repeated affair... so the whole search for what went wrong was pointless. Personally, my track record for Finding Out Why Replicaton Broke (a slightly different issue that being discussed here) is effectively zero. 1
...which is one of the main reasons I like MobiLink, when the consolidated breaks you fix it and carry on synchronizing, no steenking transaction log involvement. No re-extracting 1,000 remotes. No having to explain to management or a client why you have absolutely no idea why it broke, and never will, because you're too busy re-extracting. @Breck: With a good backup/restore plan one ought to be able to bring up a system fast without having to re-extract much remotes (of course this assumes your remotes do not replicate every few minutes). I think "breaking replication" should only occur to those remotes that have replicated after the point (i.e. the (consolidated's log offset) to which one can recover, and I would always prefer to re-extract some remotes over rebuilding the whole system. @Breck: And I absolutely agree w.r.t. the questionable value of "finding out what broke the system". It's worth the attempt, but that should be done after the system is up and running again. IMHO, one should not have to decide between "Do I try to recover or do I try to find the error condition?" - I guess when a system is down, typically that's not the situation where most folks are in the best condition for a complex replication analysis:) And another addition: In our particular case, the fastest solution would have been: Start the broken 8.0.3.x database with dbsrv9 and stop it immediately. That would have "healed" the problem as the server bug was fixed in 9.x, and the database/log would have run with 8.0.3 just fine again. - I just found out that afterwards:( - So another tool in such situation might always be "Try with the latest EBF / the latest version that can load that database without upgrading"... @Volker: ...as long as you back up the failing database first. Can't repeat that too many times. Recovery efforts have a long track record of making a bad situation worse, and you always want to be able to "try again". More comments hidden
|
Assumption: The consolidated dbsrv9.exe is down, having just crashed on the assertion. First thing: Shut off dbremote.exe at the consolidated end, and stop it from starting again for the time being. Second thing: Make a full file copy of the crashed consolidated *.db and *.log. That way, no matter how much WORSE things get during attempts to get things going again, you can get the consolidated database back to the point it was right after the crash. Third thing: Follow the steps in this V9 Help topic (starting at step 3, since 1 and 2 have been covered by the Assumption and Second Thing above): SQL Anywhere Server - SQL Usage » Remote Data and Bulk Operations » Importing and exporting data » Rebuilding databases » Rebuild databases involved in synchronization or replication Assumption: Everything is OK now <g> ( I know, I lose points for all the Assumptions :) answered 01 Mar '10, 08:43 Breck Carter Well, I wish I had known about those options before I implemented. Oh well, now I do! OK, now you have to tell us what you did do :) |
Heres what I did... As soon as the database went down, make sure the replication service was stopped and wouldnt start until I wanted it to. Find the cause of the problem. I was able to get a backup by restarting the database and triggering the event. Once this was finished, I worked on the backup to test, which my first step was running dbvalid against it. I found it was a heavy populated table (in our world, at least). Naturally, as soon as it tried to validate that table, the backup db crashed. Now, I know what table it is, but range of data is it? I followed the Sybase technical document to figure that out (took forever). Once I knew exactly where the problem was, I unloaded every bit of data I could. What I couldn't get from this database, I was able to get from the one site that replicates all data. So. Now I know where the problem is, and I have the data to rebuild. What next? Remove said table from all publications. Create a new table Create new publication with just this table, subscribe all the sites, and simply 'start' them (Not Synchronize). answered 02 Mar '10, 13:05 Calvin Allen |
It's not clear to me that you couldn't have recovered from your last valid backup by applying the backed up transaction logs. If the log files that cover the time from your last valid backup to the point of failure don't exist, then your approach sounds fairly reasonable. Since you don't seem to have had to re-extract your users, I expect you had all of the necessary log files to recover from. There are a few things that could have been done "before the fact" to both better protect your database and to ensure that you wouldn't risk breaking replication. The most important option that I would always recommend using at the conslidated site is the dbremote -u switch to only replicated backed up transactions. Using this switch prevents the consolidated sending out any messages for operations that occured after the last backup. This ensures 2 things:
If you combine the dbremote -u switch with scheduled incremental transaction log backups on the consolidated as Volker discussed, then you will still keep your replication latency low while astronomically increasing the recoverability of your replicated system. The other configuration choices to consider include transaction log mirroring and high availabilty configuration on the consolidated. Granted that high availability wasn't available in version 9, you would still have the option of using a transaction log mirror. answered 12 Mar '10, 15:06 Rob Waywell Rob, while I see the beneft in using "-u", there are situations where this doesn't work well. For example, in our setup the consolidated does some kind of complex calculation/reporting based on newly sent-in data and sends the results back to the remotes. So we have to run dbremote on the cons twice in a short cycle (say, 1-2 minutes after an incoming message which triggers the calculation). In that situation, it's simple not useful to do a backup before each dbremote run. Instead we do log backups every 30 minutes. - So I think "always using -u" is a too general recommendation:) 1
@Volker - MobiLink would be a preferred architecture for your scenario since it provides for that backend processing to happen within the single synchronization session. I expect your system pre-dates MobiLink. By choosing not to run with the -u you are creating a failure point that can force you to re-extract remotes. If you have a few hundred or a few thousand remotes, that risk likely isn't acceptable. In other words, the recommendation to always run -u on the consolidated is still the right technical recommendation. Folks may choose not to use it for other business reasons. @Rob: I agree with your general recommendation. In our particular case, MobiLink might really fit better - when the system was set up, MobiLink was already available (and we have used it to synchronize the ASA cons with a MS SQL database) but we were (and are) more experienced with SQL Remote. In our case, the omission of -u bears the actual risk that a few remotes (usually 0-10) might have to be rebuild which is tolerable for us, particularly as we can always rebuild their changes as we log all incoming statements. But as you state, this is certainly a non-standard situation. |
Question 1: Are you using SQL Remote or MobiLink to synchronize with your remote sites? Question 2: Do you have a recent backup at the central site?
Its always useful to know the exact version and build number and in the case of assertion(s) the exact assertion number and message. Without this information, it is impossible to make any suggestions.
In addition, are the assertions seen on the consolidated, a single remote, or multiple locations?
And more questions: If your consolidated is corrupt (so you might need a backup), when have your remotes replicated the last time? IMHO, the question who many remotes would be out-of-sync if a backup at the cons must be restored is one of the most important ones...
I think I have answered all the questions so far.
Retag remark: I just think that this is very related to SQL Remote (or replication in general) so I added that tag (and had to remove "backup-and-recovery"). Restoring a not-replicated database is so much easier...
@Calvin: I wanted to change the title, and I wanted to change the tag what-would-you-do-if... but at 100+ views you can't argue with success! Now I think the title and the tag are brilliant!