At a client site we are using sql anywhere 11.0.1.2811 server with one replica The consolidate DB works fine, no problems. The remote DB is located on a iSCSI volume which had some problems and lost connection a few times. Normally dbsrv11 and dbremote recovered fine, but now we are in a state where we have constant resends and "missing messages" messages

What happended on the remote in chronological order:

I. 07/13 12:19:01. Finished checkpoint of "remotedb" (remotedb.db) at Fri Jul 13 2012 12:19
I. 2012-07-13 13:06:40. Received message from "consolidate" (8-0118836638175-0118836727389-0)
I. 2012-07-13 13:06:40. Applying message from "consolidate" (8-0118836638175-0118836727389-0)
I. 2012-07-13 13:06:40. Sending message to "consolidate" (34-0105371212077-0105371212077-0)
E. 07/13 13:06:40. Fatal error:  A disk write to file "/data/nlpd/nlpdbhulst.log" failed with error code: (30)
E. 2012-07-13 13:06:42. SQL statement failed: (-85) Communication error
E. 2012-07-13 13:06:42. Communication error (-85)
E. 2012-07-13 13:06:42. SQL statement failed: (-101) Not connected to a database
E. 2012-07-13 13:06:42. SQL statement failed: (-85) Communication error
E. 2012-07-13 13:06:42. Communication error (-85)
I. 07/13 15:42:49. Starting database "remotedb" (/data/remotedb.db) at Fri Jul 13 2012 15:42
I. 07/13 15:42:49. Database recovery in progress
I. 07/13 15:42:49.     Last checkpoint at Fri Jul 13 2012 12:19
I. 07/13 15:42:49.     Checkpoint log...
I. 07/13 15:43:09.     Transaction log: remotedb.log...
I. 07/13 15:43:58.     Rollback log...
I. 07/13 15:43:58.     Checkpointing...
E. 2012-07-13 15:46:08. Missing message from "consolidate" (8-0118836638175-0118836727389-0)

The consolidate then resends messages, but always only starting at 9-0118836727389-0118837735535-0. Then the remote again complains about the missing 0118836638175-0118836727389 message and the loop restarts and the resend counters increments....

As far as I see the problem is, that the message from 8-0118836638175-0118836727389-0 had been applied at 13:06:40 and the confirmation also had been sent back to the consolidate at the same time. Then the db did crash (because of the iSCSI connection loss problem) and then did rollback to the last checkpoint, done at 12:19 (Which did not include the last message received at 13:06)

Looking at the remote stats I see this:

On the consolidate for the remote user:

Sendlog-offset: 119494704576
Sendconfirmation-offset: 118836727389

On the remote I see this for the consolidate(remtoe) user:

Receivelog-offset: 118836638175
Receiveconfirmation-offset: 118836638175

The replicated db is arround 53GB, so a reextraction (which would be the simplest cleanest solution) takes a very long time, and then sending the extracted DB over a slow WAN link takes even more time.

I'm now very tempted to change the sendconfirmation-offset from 118836727389 to 118836638175 on the consolidate database for the remote user...

Is there a chance to get it working again by modifying that offset ?

asked 23 Jul '12, 11:16

ASchild's gravatar image

ASchild
777222740
accept rate: 14%


Ok,

after updating the send confirmation offset back to the lower value the sync now works fine again

permanent link

answered 24 Jul '12, 03:25

ASchild's gravatar image

ASchild
777222740
accept rate: 14%

Sorry for the late response, I've been been out of the office for a little bit.

While I agree with your analysis, and that hacking the SYSREMOTEUSER table fixes your immediate issue, I'm surprised that the recovery of the database after the disk full situation resulted in this scenario. A disk full situation should not result in you needing to hack the SYSREMOTEUSER table.

You had stated :

As far as I see the problem is, that the message from 8-0118836638175-0118836727389-0 had been applied at 13:06:40 and the confirmation also had been sent back to the consolidate at the same time. Then the db did crash (because of the iSCSI connection loss problem) and then did rollback to the last checkpoint, done at 12:19 (Which did not include the last message received at 13:06)

While the database engine did recover to the last checkpoint at 12:19, after recovering to this checkpoint it then went on to apply the operations in the remotedb.log since that checkpoint (can be seen below in engine log from 15:43:09 -> 15:43:58), so any committed transactions on the database (including the updates to the SYSREMOTEUSER table) that occurred after the checkpoint should also have been re-applied to the database.

I. 07/13 15:42:49. Database recovery in progress
I. 07/13 15:42:49.     Last checkpoint at Fri Jul 13 2012 12:19
I. 07/13 15:42:49.     Checkpoint log...
I. 07/13 15:43:09.     Transaction log: remotedb.log...
I. 07/13 15:43:58.     Rollback log...
I. 07/13 15:43:58.     Checkpointing...

Unfortunately, without a copy of the database prior to the automatic recovery, it will be impossible to see why the recovery didn't do what we expected.

(25 Jul '12, 09:22) Reg Domaratzki

It was not a disk full condition, but rather a communication problem between the server and the iSCSI volume where the .db and .log files had been stored.

But thanks for your answer.

What did puzzle me was the fact that the remote did ask for missing messages, but the consolidate did only send what HE did think was missing. Apparently in the resend request the log offset is not sent (or then ignored by the consolidate)

(30 Jul '12, 02:48) ASchild
Replies hidden

When there is a discrepancy like this one, the consolidated database assumes it is correct and ignores the offset sent by the remote.

(30 Jul '12, 09:16) Reg Domaratzki
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×143
×78
×23

question asked: 23 Jul '12, 11:16

question was seen: 3,003 times

last updated: 30 Jul '12, 09:16