A large replication got rolled back

Windoze SQL Anywhere 9.something.3951

So I have a large distributed database system...

It's kinda legacy at this point and every few years or so I have a large purge script that will delete multiple years worth of transactions.

We apply the script to the consolidated database and then the delete statements replicate out via dbremote running on a scheduled task.

A month later in production I find that although replication seems to be in synch it's as if the deletes never occurred on the remote database. So then I go back to the log and all I see is a roll back the end of a bajillion deletes.

Any ideas? I realize I'm out of luck being that the Syabse is EOL, and we are trying to move to 16 here shortly... But just kind of curious if anyone has seen this before?

Translating the log with a -g shows me:
CONNECT-1013-017394470350-"10"-2014-04-17 22:48
----BEGIN TRANSACTION-1013-017394470360
--BEGIN TRANSACTION
--UPDATE-1013-017394470406
--UPDATE DBA.Customer
-- SET Changed='2014-04-17 16:09:22.5'
-- WHERE SiteCode='0010'
-- AND CustomerCode='023428'
--UPDATE-1013-017394470593
--UPDATE DBA.Customer
-- SET Active=0,
-- ChangedBy='10',
-- Changed='2014-04-17 16:09:22.6'
-- WHERE SiteCode='0010'
-- AND CustomerCode='027762'
Line 973149
SNIP
Lines upon lines of deletes and marking records inactive...
SNIP
Line 23206783
----CONNECT-1024-017709568341-"10"-2014-04-17 23:33
----BEGIN TRANSACTION-1024-017709568351
--BEGIN TRANSACTION
--COMMIT-1024-017709568358
--COMMIT WORK
--ROLLBACK-1024-017709568359
--ROLLBACK WORK
--ROLLBACK-1013-017709568366
--ROLLBACK WORK

replication rollback

asked 16 May '14, 10:25

harncw
106●4●5●9
accept rate: 25%

edited 16 May '14, 10:27

Just to understand: When you applied the script at the consolidated, it was committed successfully?

FWIW, we usually try to split huge UPDATEs/DELETEs into smaller transactions by adding an explicit COMMIT every n rows, say by a loop with UPDATE TOP ..., cf. this FAQ on preventing a huge rollback log...

(16 May '14, 11:01) Volker Barth

Replies hidden

A ROLLBACK is posted to the transaction log when the connection disconnects. In the translated log, if you go to line 23206783 and search UP for "-1013-" do you ever see the transaction committing?

(16 May '14, 11:13) Reg Domaratzki

Replies hidden

Comment Text Removed

Yes it looks "fine" at the consolidated, it was committed there and at several (all but 2 out of 30) of the remotes. Yes you are right I should commit more frequently. I knew that too.. but figured that it worked as it was so why mess the simplicity of a single delete statement. But thanks for pointing it out :)

(16 May '14, 11:24) harncw

There is a commit earlier in the replication but it is before the "big honking deletes" start.

(16 May '14, 11:28) harncw

You should find that SQL Remote is resending those messages (with that big Delete transaction) over and over again until they succeeed. For these 2 remotes you will probably see the send/confirmed offsets are falling behind the other 'good' sites. You do want to confim that much first. SQL Remote is not going to skip big transacitons like that without some other major issue coming into play.

The reason for the rollback is likely an error that dbremote -v -o should be logging for you. Nothing there?

(20 May '14, 11:04) Nick Elson S...

Replies hidden

The offsets are not falling behind.
http://globalsoftware-inc.com/harncw/offsets.jpg

Unfortunately this customer likes to run their dbremote as a scheduled task with the log files turned off because they will sometimes see "cannot open log file" errors. (No I don't agree with this method but it's hard to stop bad habits.)

(20 May '14, 11:41) harncw

Hm, IMHO, the difference between log_sent and confirm_sent by that marked remote does tell it has not applied all sent messages from the cons. However, I can't tell whether this is expected behaviour here as that will depend on the frequency of cons and remote to apply messages...

(20 May '14, 11:56) Volker Barth

You are correct in that that the log sent and confirm sent are not synchronized 100% as they replicate every hour or so. Importantly though they are more recent than the large delete that occurred about a month ago.

(20 May '14, 15:21) harncw

@Nick: Are you sure about that?

AFAIK, failing statements like DELETEs that fail because they would violate RI constraints or ALTER statements on non-existing database objects do not lead to rejected messages - the according statements are just skipped (i.e. the according transaction is rolled back) but they do not prevent other statements in the message message to be applied.

At least that's my experience:)

I should add that it was always logged (and understandable) why these statements failed so that's a different situation than in the current discussion...

(21 May '14, 03:55) Volker Barth

> I'm out of luck being that the Syabse is EOL

You are NOT out of luck... technical support is still available for SQL Anywhere 9. If you are willing to pay for support, and are willing to be politely-but-firmly persistent enough to get past the [cough] unwillingness to help and "you must upgrade" exhortations, you CAN get help. In the very very very very very unlikely event the problem is caused by a bug in SQL Anywhere 9, it won't be fixed, that's what "EOL" means.

In your case, going the pick-up-the-phone route is probably better than this forum.

(21 May '14, 09:56) Breck Carter

Last time I tried to get help out of support, I left a few messages (and they were polite!) and never heard back from the department(yes I swear on my mother, they never called back)...

I will rip up the remote and re-extract it already did the little one, it's just risky for us to do the larger one because of the amount of activity and complexity of the entire application system.

(21 May '14, 11:38) harncw

Volker... It is true that some kinds of errors (fkey violations for example) can be skipped ... but given that "harncw" indicates they have cascade deletes that would not seem to apply here ...

Harncw ... Since the progresses (confirmed and sent\or\received) are progressing then this would seem to be something that was either skipped (as in ignored) or something that was not sent (as in not subscribed to) ...

Since you are not capturing the logging of this we don't know if there is even any errors involved or some other factor.

There may be no errors detected if (for some reason) those deletes did not match the subscriptions for those 2 remotes. You may have something in the way publications and subscriptions have been set up that could make this possible. And alterations of those could also cause gaps and overlaps in what gets sent.

If there are errors being detected here (at the 2 remotes in question) then there are ways to handle and/or ignor those errors; and doing so could easily lead to this situation. Many of these are spelled out in this article .

For instance you can have your before triggers cause operations to get ignored by raising the SQLE_REMOTE_STATEMENT_FAILED signal. Or you can have all errors be 'handled'/'reported'/logged' by a procedured identified in your replication_error

No formal analysis of the facts here (since facts are MIA without logging) but hopefully something here will be of some assistance.

(26 May '14, 11:39) Nick Elson S...

More comments hidden

showing 4 of 12 show all flat view

One Answer:

active answers oldest answers newest answers popular answers

I have seen this happen before due to foreign keys. For whatever reason, a remote had a record that didn't exist in the consolidated, so when replication tried to commit the deletions, a foreign key error occurred on the remote. Might be something to look for.

permanent link

answered 16 May '14, 11:39

lstavinoha
71●2●2●8
accept rate: 0%

..though IIRC, that should not rollback the whole huge DELETE but only the operations for those rows that would cause RI violations... - just because it's not the "statement text" itself that is replicated but the resulting operations (i.e. on a row-level), usually applied in grouped transactions (DBREMOTE -g n).

(16 May '14, 12:14) Volker Barth

Replies hidden

Yes, the -g n does affect it. I'm not sure about all cases, but with -g 1, basically no grouping occurs, and they can all go.

(16 May '14, 12:48) lstavinoha

I would think SQL Remote would try to re-do failed multi-row operations (i.e. grouped transactions) by applying them one-by-one.

(Hopefully I don't mix this up with MobiLink's feature to re-apply failed multi-row uploads as single-row uploads...)

(16 May '14, 13:01) Volker Barth

Thanks for your input, in the case of this database all the relevant child tables are set to cascade delete.

(16 May '14, 13:16) harncw

> Hopefully I don't mix this up with MobiLink's feature

Another way to say that is "Hopefully I'm not being like Breck"... it's the reason I've become careful (i.e. reluctant) when answering SQL Remote questions :)

(21 May '14, 10:00) Breck Carter

flat view

Your answer

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

replication ×60
rollback ×9

question asked: 16 May '14, 10:25

question was seen: 2,337 times

last updated: 26 May '14, 11:39

SAP SQL Anywhere

A large replication got rolled back

Follow this question

Related questions