We have a utility for automating backups & DR servers for our customers. As well as doing things like making sure that live backup keeps running and regularly applying log files to DR databases, it can invoke client side backups (using dbbackup.exe). This usually works fine even with 200-300GB databases. However, we have been getting quite a few cases where, although the backup completes and reports success, any attempt to use the database (eg apply logs, verify etc) results in a variety of assertion errors. On checking, the original database is fine, as is the disk system of the DR machine. We have submitted a few of these databases to SAP support who have said that basically the database file is corrupt. It would seem likely therefore that some of the database pages are getting corrupted by network errors during the back-up process. Is there any form of in-built error checking, checksum validation etc that goes on during a database backup? Obviously network errors shouldn't be happening, but I'm surprised that the issue isn't being detected during the backup process. Or, is something quite different going on? (As mentioned above, a whole variety of assertion errors can come up - but all seem to relate to a corrupt database file - the log file always seems to be OK when checked) Our log of events (with a few sensitive things replaced with XXXX: 20:46:58 Backing up Primary Database to E:\XXXXX\IQX.db, Replacing Standby |
It seems we were being affected by one or more issues fixed before build 2471 (Thanks for your email Mark Culp!). Certainly we are now not getting repeats. We did then get hit by the apparently new problem: ================(Build #2524 - Engineering Case #809863)================ Under exceptional rare circumstances, the server may loop infinitely during an index backward scan. This has been fixed. This manifested itself for us by a statement like select max(column) from table where anothercolumn = 'something' just never returning, while select min(column) from table where anothercolumn = 'something' was fine. However that was fixed in build 2546 before we got a chance to finish reporting it - so all seems good now. Glad to hear that your issues were fixed!
(28 Sep '17, 22:19)
Mark Culp
|
I have no clue, but do these backups validate successfully?
That would be a moot point, since dbsrv16.exe is failing to start the backup (or copy thereof) and that would have to run successfully before starting dbvalid.exe.
Justin, I apologize in advance that I cannot help you with the precise answer. But here are my thoughts:
By default SA has checksum for all global pages enabled (see dbinit, option -s)
According to the documentation, the checksum is calculated just before the page is written to disk.
If you can reproduce the issue while doing the backup over the network, I'd suggest you to backup the data using the same host, where the DB is running. I know this is what not recommended, but this will help us to exclude the network as a "broken" part.
Again, if the network quality is somehow guilty, it means that every archive (7z/zip/whatever) will be corrupted if you simply copy it from your DB server to the destination machine.
From old Breck's replies (I agree with him), there is a chance that something is wrong with your RAM. It is worth testing RAM using one of the utilities from this page. I am sure you have already done this twice on both machines.
I'd like to add that your SA server is quite old (2.5 years old). The latest version is SQL Anywhere 16.0 SP70 Build 2471.
A short summary - verify the network (do whatever tests you want, e.g. transfer large files over the network, calculate CRC before and after transferring; do the backup on the DB server side), verify RAM (both servers, thoroughly), verify HDD's SMART and surface (just in case), upgrade SA server.
16.0.0.2076 is rather [cough] immature. Even Foxhound 4 requires 2127 to run its own database, and that build was carefully chosen as "the earliest build that can tie its own laces" :)
@Justin: Do not read this rant until after your problem is solved... it will just annoy you :)... read Vlad's comment first.
@EveryoneElse: There is no point in taking backups if you don't validate them afterwards. To put it another way, if you're not willing to spend the time and expense to validate your backups, don't bother creating the backups in the first place... history shows they will not work.
For large databases the "time and expense" of validating backups is very real, since in most cases you have to create a second (throwaway) copy of the backup for the validation process.
The phrase "in most cases" means "you want to preserve the ability to apply incremental log backups after restoring the backup". In order to validate the backup, dbsrv16.exe must first be run normally and that process makes it impossible to apply a future incremental log backup or multiple differential log backups... hence the "second (throwaway) copy of the backup".
You might ask, "Why can't I just start the backup in read-only mode, to run the validation without making a second copy?
Answer: dbsrv16.exe -r prevents the automatic recovery process from running, and that is needed in order run the validation process... assuming you want to prove the backup can actually be used in a restore.
Hm, I wasn't sure whether that error was displayed while already applying transaction logs (as -a is used), so it could be different when the database file would be started alone or with the original log file.
There are some fixes for assertions and backups between 16.0.0.2076 and 2471, e.g.
Not that I would anyhow claim that would apply here.
FWIW £119.99 buys a Seagate Expansion 4 TB USB 3.0 Portable 2.5 inch External Hard Drive which is perfect for those "throwaway copies" :)
FWIW, v17 has introduced "dbvalid -im v" to allow in-memory validation without modifiying the database and log file.
@Justin: Not helpful for you currently, I know:(
No, that's what we would try to do next in the process, but the log has to be applied first to bring the database to a consistent state. We stop at this point if there is an error as any subsequent operation would of course fail.
To be more specific, dbvalid -im v is not for validating the active database, it is specifically for validating a backup copy that is auto-started by the connection from dbvalid.exe.
So yes, this might make the "throwaway copy" unnecessary if (a) the backup is a matching full backup of both database and log file or (b) the backup is an old database backup file and fresh incremental log backup file... not sure about (c) the backup being validated is an old backup file plus a dozen or so differential log backups.
I have come to believe that (a) and (b) are good, and (c) pretty much sucks, it is 2017 and disk space is cheap, so yes, dbvalid -im v is a good thing!
=====
-im mode
Defines the in-memory mode to use if the database server is auto-started (by setting the -im database server option on the StartLine parameter of the connection string).
v or validation (default) In-memory validation. Use this mode to validate a backup copy of the database without modifying it or its transaction log files.
> what we would try to do next in the process
OK, so definitely do not read my rant below :)
Indeed... your point was not moot.
Thanks for the "trigger warning" but not required. That's what this process is all about!
Our normal process (abbreviated) is:
0 Validate existing DR (disaster recovery) database in read only mode - if OK move to safe copies folder
1 Backup database & log to DR machine as new DR database (no truncation)
2 Apply log (from backup) to DR db (what failed in this case)
3 Validate DR DB in read only mode
4 Apply copies of "Live" log backup on regular basis (eg every 20 mins) until we do step 0 again!
This usually works fine and means we have regular off-line validation and a constantly available DR server, with no downtime on the live server for the process. I've missed out the bit where we truncate the log on the live server so it doesn't get too big.
If the live server goes down, we just apply the latest version of the "live backup" log and we are up-to-date.
All the steps can send out email, SMS, SMNP etc notifications.
Your integration of live backup (dbbackup -l) into the regular backup and validation process is intriguing. I assume you run it locally, rather than use live backup to facilitate an offsite recovery server. I'm guessing this is possible with live backup but not HA, the first time I've heard of live backup having significant technical advantage over HA.
...or am I out to lunch? :)
It runs on the DR server. Our process, that runs as a service on the DR machine(s), monitors it and ensures it restarts etc if connection is lost, if the live or DR servers are rebooted etc.
Our clients prefer the approach to HA because of the control and visibility. The DR copy can also be duplicated for reporting, warehousing etc.
Hm, so the same DR server making the sometimes corrupt client-side backup does also do the live backup?
(I'm just wondering whether possible network problems might corrupt the latter, too...)
FWIW live backup and HA can coexist, and with HA you can run reports against the secondary, or a third read-only scale-out... something to keep in mind, since HA in V16+ isn't Your Father's HA any more :)
The HA would deal with the DR side, we would still need to back it up to validate it.
The log files have always been fine. It appears that the backup db file is corrupted before we try to start it.
Thanks Vlad for your pointers. I really appreciate your input.
5 I certainly agree about using the latest version though I haven't seen any changes that specifically address this sort of issue. We are in the process of pushing out a newer build. All the problems we have seen so far have been on 2076.
1 We have checksums on, but one of the things I'm trying to find out is whether the backup process (as opposed to the engine) uses them.
2 We have done that in every case and the live database has always been OK. It is only the backed-up database that has the issue.
3 Occasionally we have been able to get an error when copying a really large file (similar in size to the databases eg 200G - 300G) When this happens we always know about it immediately - Windows reports a failure in copying - but there may also be an element of hidden re-trying going on and we don't know if dbbackup.exe does the same. The odd thing is that in all these cases the dbbackup.exe -l has been going on for months (on dozens of different servers and networks) copying thousands of gigabytes of data without a single error. Does it have an error detection / correction method that the full backup doesn't, I wonder?
4 We have checked in a couple of cases, without anything showing. However it seems unlikely that a lot of different customers would all have servers with faulty RAM that only manifested this particular problem.