Any suggestions that anybody might have on this situation we have ran into with replication would be greatly appreciated.
I had been told months ago that one of our end users that had been having problems with replication didn't need a new extraction, and that his replication was running A.O.K.!!
My mistake I made was that I took his word on that. Today I get a call from that user telling me that a job that he had asked for access to was not showing up in his database. I did a little digging and determined that his replication has not successfully ran since Oct 18, 2010.
I was stunned. Either way, it's my job to fix this. The problem we have now is that he has a database that has not communicated with the consolidated db in so long, I need to make sure that any critical data that he changed or added, gets moved over to the consolidated database. I translated his log file for his user only, and was able to see the work he had done. However, before I go trying to dissect this data, one entry at a time, I'm wondering if there is something else I might be able to do to either get passed this problem, or brute force it.
What's happening, is that every time he runs replication, the following error shows up in his replication log.
E. 06/01 19:47:07. Missing message from "consolidated_pub" (578-02749223586-0) I. 06/01 19:47:14. Scanning logs starting at offset 0060076439 I. 06/01 19:47:14. Processing transactions from active transaction log I. 06/01 19:47:15. Sending message to "consolidated_pub" (0-0060076439-0) I. 06/01 19:47:15. Execution completed
No matter what, it's the same missing message "message" every time. When replication gets ran at the server, it creates about 1200+ files in his replication mailbox. When he runs replication, those messages immediately get deleted, and a log file gets generated with boat load of
I. 06/01 19:46:54. Received message from "consolidated_pub" (578-02749480715-0)
duplicate that x 1200+.
Our replication environment is really simple. Our end users VPN into the network and their replication files are copied to a network share for their user. Normally when we lose files it's because the VPN connection was lost, or they decided to just close the replication window while it was in the middle of replicating their data. Usually the fix is to run replication on the end users machine, follow that up with the server, then go back and forth one or two more times and presto chango, replication is back in sync.
The problem is that I ran that scenario about eight times this evening, and nothing changed. However, I wouldn't be surprised if because of the large number of replication files it is creating, it may take so many replication "round trips", that I could be doing this for a week solid.
For the record. In looking through the log files, it does look like the log offset changed about 5 months ago, then again about 2 months ago. So it did change a couple times. Although the last successful data that was replicated was back in 10/18/10.
If anybody has any suggestions on what we might be able to do "Reg... Anything???" :) I would really appreicate the help. This is still our production database running in ASA6.0.4 Build 3799 for what it's worth.
Sorry for the lenght of the question. TIA for any suggestions!
I wanted to follow up on this thread to let everybody know the outcome.
After corresponding with Reg privately on the log files that we had, what our offsets were set to for the user in question, on the consolidated side, we were successfully able to fix replication for this user.
After making the changes that Reg recommended as a "EXTREMELY UNSUPPORTED FEATURE" (which involved calling a procedure that sets offset ID's on the rebuild of a database), I started running replication at 3:40pm on Thursday 7/14. I had cleared out his replication directory, so when replication ran on the server, it generated 1,391 files for his database. I figured this would take a few hours to complete, but oh no!! Replication ran until 1:52pm on Friday 7/15. So it took just over 22 hours to replication information that dated back to 10/18/2010.
Not surprised since the amount of information being replicated was quite large.
Once replication was completed, I started doing small round trip changes between his database and the consolidated database. Everything replicated without a problem.
Due to the fact that we still don't know what caused this to happen, I think my next move is to create a new database for the end user in question. That way, in case there is anything buggy floating around in this database, at least we can clear that up with the new database.
Replication has been running since last Friday (7/15), and there have been zero errors since then.
Sorry I can't go into more detail, but due to the unsupported nature of the fix, I promised Reg that I wouldn't let the cat out of the bag. Just know that if you run into this same problem, you should be able to contact Reg and have him get the information you would need to implement this type of a fix.
answered 19 Jul '11, 15:46
So, the consolidated database indicates that it has received a confirmation message from the _7577925458804 user indicating that it has successfully applied all operations up to and including log offset 2749272670. However the _7577925458804 remote database indicates that it has only sent a confirmation of offsets 2749223586. The remote database is continuously asking for a resent of messages from offset 2749223586 in the consolidated database, but the consolidate believes the remote has already applied up to offset 2749272670, so it sends from that offset when it gets the resend request. There are a few ways this could happen, such as an incorrect recovery at the remote site, or possibly a copy of the remote database replicated a change up to the consolidated database. It's impossible at this point to deduce what caused the initial problem.
If you still have the offline logs that include offsets 2749223586 through 2749272670, I might be able to get you fixed up without a re-extract, but I'm not 100% sure my secret process will work back in v604.
Please post back and let me know if you have the offline logs with the offsets listed above. If you have set the delete_old_logs database option, it's likely that dbremote has already deleted the log unfortunately. I'll check in later tonight from home.
answered 03 Jun '11, 16:29
Of course I won't get in the way when Reg is gonna solve the problem, and sure he will:)
From the little details I can see here, I would suspect that the cons has received a confirmation for a sent message (possibly unto offset 578-02749480715-0) from the remote. Therefore, the cons won't send any older messages to the remote.
Now, however, the remote seems to be sure it has only received offset 578-02749223586-0 from the cons, and askes for messages starting with that offset. The newer messages the cons sends are therefore rejected, and a missing message is reported.
We have often noticed such behaviour (which cannot be solved by SQL Remote's regular message system) when remote users restore their database from a backup or re-installed the original database state, and by this accidently put the remote into a "previous state". It's then out of sync, apparently.
We usually do a re-extract then. If there has been relevant remote data entry, then getting the translog and apply all outstanding work into the cons is something we have done a few times (and with very much care)...
And now I'm interested what you and Reg will discover:)