Replication WAY Out Of Sync Suggestions

Any suggestions that anybody might have on this situation we have ran into with replication would be greatly appreciated.

I had been told months ago that one of our end users that had been having problems with replication didn't need a new extraction, and that his replication was running A.O.K.!!

My mistake I made was that I took his word on that. Today I get a call from that user telling me that a job that he had asked for access to was not showing up in his database. I did a little digging and determined that his replication has not successfully ran since Oct 18, 2010.

I was stunned. Either way, it's my job to fix this. The problem we have now is that he has a database that has not communicated with the consolidated db in so long, I need to make sure that any critical data that he changed or added, gets moved over to the consolidated database. I translated his log file for his user only, and was able to see the work he had done. However, before I go trying to dissect this data, one entry at a time, I'm wondering if there is something else I might be able to do to either get passed this problem, or brute force it.

What's happening, is that every time he runs replication, the following error shows up in his replication log.

E. 06/01 19:47:07. Missing message from "consolidated_pub" (578-02749223586-0)
I. 06/01 19:47:14. Scanning logs starting at offset 0060076439
I. 06/01 19:47:14. Processing transactions from active transaction log
I. 06/01 19:47:15. Sending message to "consolidated_pub" (0-0060076439-0)
I. 06/01 19:47:15. Execution completed

No matter what, it's the same missing message "message" every time. When replication gets ran at the server, it creates about 1200+ files in his replication mailbox. When he runs replication, those messages immediately get deleted, and a log file gets generated with boat load of

I. 06/01 19:46:54. Received message from "consolidated_pub" (578-02749480715-0)

duplicate that x 1200+.

Our replication environment is really simple. Our end users VPN into the network and their replication files are copied to a network share for their user. Normally when we lose files it's because the VPN connection was lost, or they decided to just close the replication window while it was in the middle of replicating their data. Usually the fix is to run replication on the end users machine, follow that up with the server, then go back and forth one or two more times and presto chango, replication is back in sync.

The problem is that I ran that scenario about eight times this evening, and nothing changed. However, I wouldn't be surprised if because of the large number of replication files it is creating, it may take so many replication "round trips", that I could be doing this for a week solid.

For the record. In looking through the log files, it does look like the log offset changed about 5 months ago, then again about 2 months ago. So it did change a couple times. Although the last successful data that was replicated was back in 10/18/10.

If anybody has any suggestions on what we might be able to do "Reg... Anything???" :) I would really appreicate the help. This is still our production database running in ASA6.0.4 Build 3799 for what it's worth.

Sorry for the lenght of the question. TIA for any suggestions!

Jeff Gibson
Intercept Solutions - Sybase SQL Anywhere OEM Partner
Nashville, TN

dbremote replication best-practices

asked 02 Jun '11, 02:52

Jeff Gibson
2.0k●46●66●76
accept rate: 20%

edited 15 Mar '13, 21:36

Mark Culp
25.0k●10●142●298

Jeff, do you have access to both the consolidate and remote databases to do some debugging, or just one side of the equation?

(02 Jun '11, 09:32) Reg Domaratzki

Replies hidden

I have both machines here until tomorrow. Trying to get this fixed before he leaves to go out to the job site. That's when he uses his replication database. Any ideas would be great Reg!!

(02 Jun '11, 11:57) Jeff Gibson

We first need to figure out why the error is occurring, and why a resend doesn't fix it. Here's what we should do.

1) Connect to the remote database and save the contents of the sys.sysremoteusers view to a file named sru_r1_b.txt

2) Run dbremote on the remote database (no verbosity needed) and save the output file as dbr_r1.txt

3) Connect to the remote database and save the contents of the sys.sysremoteusers view to a file named sru_r1_a.txt

4) Connect to the consolidated database and save the contents of the sys.sysremoteusers view to a file named sru_c2_b.txt

5) Run dbremote on the consolidated database (no verbosity needed) and save the output file as dbr_c2.txt

6) Connect to the remote database and save the contents of the sys.sysremoteusers view to a file named sru_c2_a.txt

7) Connect to the remote database and save the contents of the sys.sysremoteusers view to a file named sru_r3_b.txt

8) Run dbremote on the remote database (no verbosity needed) and save the output file as dbr_r3.txt

9) Connect to the remote database and save the contents of the sys.sysremoteusers view to a file named sru_r3_a.txt

With these three runs of SQL Remote (and the associated SYS.SYSREMOTEUSERS outputs from before and after the runs), we'll be able to explain why replication isn't fixing itself like it should. I can't guarantee a solution to the problem, but will at a minimum be able to tell you why you're stuck in the current state.

Reg

(02 Jun '11, 14:07) Reg Domaratzki

Replies hidden

Thanks Reg. I'll let you know something ASAP!!

(02 Jun '11, 14:39) Jeff Gibson

Reg.

On #6. Did you mean to say "Connect to the consolidated database and save the contents of the sys.sysremoteusers view to a file named sru_c2_a.txt"?

You said remote twice between #6 and #7.

Thanks.

(02 Jun '11, 15:03) Jeff Gibson

Yes, that is what I meant. Sorry about that.

(02 Jun '11, 15:26) Reg Domaratzki

Replies hidden

Files have been emailed to your Reg. Let me know if you don't get them. Thanks!!

(02 Jun '11, 17:14) Jeff Gibson

showing 3 of 7 show all flat view

3 Answers:

active answers oldest answers newest answers popular answers

I wanted to follow up on this thread to let everybody know the outcome.

After corresponding with Reg privately on the log files that we had, what our offsets were set to for the user in question, on the consolidated side, we were successfully able to fix replication for this user.

After making the changes that Reg recommended as a "EXTREMELY UNSUPPORTED FEATURE" (which involved calling a procedure that sets offset ID's on the rebuild of a database), I started running replication at 3:40pm on Thursday 7/14. I had cleared out his replication directory, so when replication ran on the server, it generated 1,391 files for his database. I figured this would take a few hours to complete, but oh no!! Replication ran until 1:52pm on Friday 7/15. So it took just over 22 hours to replication information that dated back to 10/18/2010.

Not surprised since the amount of information being replicated was quite large.

Once replication was completed, I started doing small round trip changes between his database and the consolidated database. Everything replicated without a problem.

Due to the fact that we still don't know what caused this to happen, I think my next move is to create a new database for the end user in question. That way, in case there is anything buggy floating around in this database, at least we can clear that up with the new database.

Replication has been running since last Friday (7/15), and there have been zero errors since then.

Sorry I can't go into more detail, but due to the unsupported nature of the fix, I promised Reg that I wouldn't let the cat out of the bag. Just know that if you run into this same problem, you should be able to contact Reg and have him get the information you would need to implement this type of a fix.

Thanks Everyone!!

Jeff Gibson
Intercept Solutios - Sybase SQL Anywhere OEM Partner
Nashville, TN

permanent link

answered 19 Jul '11, 15:46

Jeff Gibson
2.0k●46●66●76
accept rate: 20%

Hm, shouldn't we add the new "secret-solution" tag to this question? That would surely gain lots of attraction...

"Reg-saved-my-day" would be a useful tag, too, of course:)

(20 Jul '11, 12:46) Volker Barth

Replies hidden

THAT!! Is a great idea!! :)

It would be the first tag everybody would go to.

(20 Jul '11, 16:29) Jeff Gibson

Yes - it would be the "Breck test" whether visitors of this site do take notice of tags at all...

Cf. this comment:)

(20 Jul '11, 16:39) Volker Barth

flat view

So, the consolidated database indicates that it has received a confirmation message from the _7577925458804 user indicating that it has successfully applied all operations up to and including log offset 2749272670. However the _7577925458804 remote database indicates that it has only sent a confirmation of offsets 2749223586. The remote database is continuously asking for a resent of messages from offset 2749223586 in the consolidated database, but the consolidate believes the remote has already applied up to offset 2749272670, so it sends from that offset when it gets the resend request. There are a few ways this could happen, such as an incorrect recovery at the remote site, or possibly a copy of the remote database replicated a change up to the consolidated database. It's impossible at this point to deduce what caused the initial problem.

If you still have the offline logs that include offsets 2749223586 through 2749272670, I might be able to get you fixed up without a re-extract, but I'm not 100% sure my secret process will work back in v604.

Please post back and let me know if you have the offline logs with the offsets listed above. If you have set the delete_old_logs database option, it's likely that dbremote has already deleted the log unfortunately. I'll check in later tonight from home.

permanent link

answered 03 Jun '11, 16:29

Reg Domaratzki
7.9k●3●43●119
accept rate: 36%

I can tell you that our end users never touch the database to recover it. We actually went back and looked and his database was created back in June of 2010, so that's when he started using it and running replication.

Currently my log files for the consolidated database go back to April 29th of 2010.

How can you tell what log offsets are in play with each of the log files?

I have a copy of his replicating database, and have removed it from his machine for this week coming up. So there's no chance he will add additional data to it while he is out.

Any suggestions would be greatly appreciated Reg!!

(03 Jun '11, 18:37) Jeff Gibson

Replies hidden

You can use the DBLOG utility against each transaction log (both the old offline logs and the current one) to show the starting and ending offsets of each.

Cf. this answer how to do so - it was not the best answer to the particular question but its shows the usage of DBLOG.

(03 Jun '11, 18:54) Volker Barth

OK Reg... Here is the info you requested...

From our list of renamed log files that are being held for replication...

10101804.LOG
768KB
10/18/2010 - 4:59PM
Log Start: 02748708092
Log End: 02749436868

The only thing I find about this file is that the timestamp on it is 4:59PM. Normally replication takes about 5 minutes to run. Replication runs at the top of every hour, so that means that the replication task had to have been fired manually before the normal scheduled task was fired. Not sure if that would even mean a thing, but didn't know if it was possible that it tripped on itself between these two replication runs running back to back like that.

Thanks for any info Reg!!

(06 Jun '11, 18:41) Jeff Gibson

I broke out the floppy disks and installed v604 to make sure this worked back in the days. It's a dangerous and unsupported thing I'm about to suggest you do, so it's my policy not to tell you exactly what you need to do, but to provide you with enough information that you should be able to figure it out on your own.

First, run dbunload -n (no data) against the consolidated database to generate a reload.sql file for the consolidated database. Search in this file for

GRANT REMOTE TO "_7577925458804" TYPE ...

It's the line after this one which is of interest, and it's a stored procedure call. Next, open the file %ASANY6%scriptssybprocs.sql and look for definition of this stored procedure in the file. This is a stored procedure that is used during a rebuild of a replicating system, but it can be used on an active system as well if you need to get yourself out of a situation such as the one you are describing. Use at your own risk.

Reg

(07 Jun '11, 10:28) Reg Domaratzki

Replies hidden

Reg, just out of curiosity - and as we sometimes have similar situations with remotes that are unintentionally re-installed over "fresher" ones (w.r.t. replication):

I guess I'm right to assume that using this proc would force the cons to agree on the remote's view on when the last replication has taken place between them. Consequently, the message exchange should succeed after a few cycles.

However, would it not also mean any data the cons has sent the remote between both log offsets (the one the remote knows to have confirmed and the one the cons originally knows to have been confirmed by the remote) will be lost in the remote?

If that's true than I feel very sure that re-extraction is still the way to go for almost all such circumstances, and would treat Jeff's case as a huge exception to the rule...

(I hope my question is neither to vague nor to clear for the desired "I'm just giving you a hint"-policy...)

(07 Jun '11, 11:04) Volker Barth

I would only suggest using this method in this situation where you expect (and can prove) that the messages were applied by a "different" remote database.

The log operations at the consolidated database between offsets 2749223586 through 2749272670 were sent, applied and then confirmed by the _7577925458804 remote database, but we can see from the SYSREMOTE user table for the existing _7577925458804 database that it wasn't this one. It's an acceptable risk to force SQL Remote to resend the operations in this particular case since we can see that they weren't applied by this database, and the database that did apply them is nowhere to be found.

I am 100% in agreement with you that a re-extract is the best way to solve the problem, and that Jeff's case is a huge exception. I go out of my way not to post the full solution because it is NOT a solution to a majority of SQL Remote problems. Over-riding SQL Remote's guaranteed message delivery system is not a good idea 99.99% of the time.

(07 Jun '11, 11:28) Reg Domaratzki

Replies hidden

Reg, thanks for the clarification. - I'm sure we haven't had 10,000 cases to re-extract so far, and consequently the "always re-extract" is still the statistically correct solution :)

(07 Jun '11, 11:34) Volker Barth

Hey Reg (or Volker). I have a question on the above statement. You said that messages were applied by a "different" remote database. By different remote database, I assume you mean a database that had been extracted for the same user?

I only ask because my impression was that you could take a remote database that was in a replicating environment, and move that database (and log file) all over the place, and it will still replicate to the location specified in its database.

My impression is that if I took a database that had been extracted for User B, and put it on User A's machine, you're still going to replicate info for User B.

Would these assumptions be correct?

I've learned over time to never say never, but the chance of this user doing anything that would have replaced his database with another copy of his database is almost zero since they don't keep backups of their databases. We've always followed the re-extract model to put a new database together for a user.

Can you tell me if somebody were to have a problem with their OS, and went back to the previously known good configuration, could this possibly replace their database???

Thanks!!

(08 Jun '11, 01:29) Jeff Gibson

showing 3 of 8 show all flat view

Of course I won't get in the way when Reg is gonna solve the problem, and sure he will:)

From the little details I can see here, I would suspect that the cons has received a confirmation for a sent message (possibly unto offset 578-02749480715-0) from the remote. Therefore, the cons won't send any older messages to the remote.

Now, however, the remote seems to be sure it has only received offset 578-02749223586-0 from the cons, and askes for messages starting with that offset. The newer messages the cons sends are therefore rejected, and a missing message is reported.

We have often noticed such behaviour (which cannot be solved by SQL Remote's regular message system) when remote users restore their database from a backup or re-installed the original database state, and by this accidently put the remote into a "previous state". It's then out of sync, apparently.

We usually do a re-extract then. If there has been relevant remote data entry, then getting the translog and apply all outstanding work into the cons is something we have done a few times (and with very much care)...

And now I'm interested what you and Reg will discover:)

permanent link

answered 02 Jun '11, 18:40

Volker Barth
40.5k●365●556●827
accept rate: 34%

edited 20 Jul '11, 12:42

Good points Volker...

The only thing that I know that has happened with this database for sure, was that the motherboard went bad on the previous machine this user was using. However, that happened back in February of this year. The database started acting up a LONG time before that. When the machine went bad we reinstalled the application and then copied the remote database and log file from the old hard drive over to the new hard drive.

Then our IT guy watched replication run and saw that the files in the replication directory were being delivered and consumed, and that was all he checked.

I will be curious to see what Reg finds.

My impression has always been, that with guaranteed delivery, a replicating database could practically make its way out of anything as long as you ran enough round trips on replication for it to catch up.

Although I do know that if you start messing with log files, backups without saving the log file, etc., you can end up in a heap of trouble.

If Reg emails me the results privately I'll make sure and post them here.

(03 Jun '11, 00:49) Jeff Gibson

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

dbremote ×78
replication ×60
best-practices ×23

question asked: 02 Jun '11, 02:52

question was seen: 4,986 times

last updated: 15 Mar '13, 21:36

SAP SQL Anywhere

Replication WAY Out Of Sync Suggestions

Follow this question

Related questions