Suggestions for how to determine why SA 17 HA is crashing.

Hello,

I have a SQL Anywhere the 17 High Availability server system running under Ubuntu 16.04. The system has been running for various testing purposes for many months now with no apparent problems. However in the last few days a very serious problem has developed.

A (C++) application that uses ODBC is attempting to transfer a .5 GB file as a BLOB as part of an Insert into a database. We've never had a problem before with this particular piece of code, however two things are new. The first is that both the server system as well as the client application are running on Amazon Web Services EC2 instances. The other new issue is the size of this particular file. With respect to the first issue of running on AWS we have been doing so for many months running this code that does inserts of smaller files and have never had a problem with this particular piece of code. So while I don't know where the problem lies there's reason to think that it's not related to AWS directly. However, this is the first time we've tried to insert a file of such magnitude and I'm wondering if we are running into some sort of server limitation or if perhaps there are timing issues that are causing a timeout to occur. Since we are running both the client and the server system within the same AWS VPC (Virtual Private Cloud) the transfer between the client and the server system should be running at very high speed since it is all internal to an AWS data center. So it would not seem that were having any actual data transfer delays but perhaps the client is pausing for too long as it repeatedly reads a new section of the file to transfer to the server. But I don't have any apparent way of determining if this is happening.

I realize that this information I have provided here is inadequate for anyone to suggest what the problem might be so my question actually has to do with suggestions for how I might be able to determine what is going wrong. One surprising aspect of this is that I cannot find anything in the log files to provide even the slightest clue as to why this is happening. In fact, error logs are not even being created. This is such a serious crash that it doesn't just bring down the servers but I have to actually reboot the Ubuntu virtual machine to regain any control so obviously something very, very bad is taking place.

Can someone suggest any log files, testing or debugging strategies or even other monitoring applications that would help me identify what is causing this very significant server crash?

Thank you.

debugging

asked 13 Nov '18, 17:47

AlK
735●31●35●54
accept rate: 37%

Actually I can add this little bit more information. It appears that it is always the mirror server that is crashing. The console log for the mirror server always shows:

Cache size adjusted to 499052K
Cache size adjusted to 529480K
/home/wizard/SQL Anywhere HA Scripts/02) Launch HA.sh: line 22:  2603 Killed

this appears to be a reference to the script that launches the mirror server. So I suppose all this says is that it's the mirror server that crashed but there's no additional information as to why.

Thank you.

(13 Nov '18, 19:42) AlK

Replies hidden

No idea at all, but have you checked whether dbsupport does reveal anything helpful?

What mirror mode are you using? Inserting a .GB BLOB with a synchronous mode seems like heavy load (but note, I'm totally unexperienced with such HA setups)...

Does it also happen with LOAD TABLE?

(14 Nov '18, 03:55) Volker Barth

> /home/wizard/SQL Anywhere HA Scripts/02) Launch HA.sh: line 22:

Have you inspected that script, as it exists on the computer where the mirror server is running? ...what is on line 22?

(14 Nov '18, 06:19) Breck Carter

Can you run a separate test of a big INSERT on an HA database setup that is not running on AWS? (e.g., on your workstation).

You may be running up against some kind of AWS limitation, where AWS thinks there's a memory leak or hack attack or some other nefarious deed and it kills the offending process (this is a wild-a** guess :)

(14 Nov '18, 06:23) Breck Carter

Replies hidden

How much cache does your setup usually need? A limit on the upper bound of the cache size might help, since huge amounts of cache might not help performance much for this rare operation, and it might be the sudden unbounded memory growth that is offending AWS (another wild-a** guess :)

(14 Nov '18, 06:33) Breck Carter

or you have scripts that kill the process when its memory consumption goes up!

(14 Nov '18, 09:06) Vlad

Thanks to Breck and Vlad. Here are responses:

I should have included in my post that yes, I am able to run the exact same INSERT on a local VMware VM running under Windows 7 that communicates with a HA system configured in the same way running under Ubuntu in another VM on a LAN. This works with no problem at all. Since the SQL Anywhere servers are configured the same way running under Ubuntu which is configured the same way it suggests to me that this issue is not a server configuration but of course I don't know that for sure.

– Again of course until I know exactly the problem is I can't be sure, but I'm doubtful that this is AWS somehow reaching into my Ubuntu instance running in their cloud and killing a targeted process. I don't think they pay attention to anything that running inside an instance like that.

– I'm not sure how to answer the question about how much cache the setup usually needs. I have not explicitly set the cache size for either the database servers or for AWS.

– I do not have any scripts (at least that I have written!) That pay attention to memory consumption.

Do you have any suggestions about tools that might help me monitor more closely what's going on with the server? At this point all I know is that it stops running (or least the mirror server stops running).

Thanks for your help.

(14 Nov '18, 13:36) AlK

Replies hidden

> I'm doubtful that this is AWS somehow reaching into my Ubuntu instance running in their cloud and killing a targeted process. I don't think they pay attention to anything that running inside an instance like that.

...and I am equally sure that AWS does pay very close attention to exactly what is going on inside each and every process running on their boxes, for billing and security purposes. It only makes sense in this day and age of evil actors... would you feel safe using AWS if they didn't watch all processes carefully?

The fact your process crashes without diagnostics might actually be a clue, like messages that say "login failed" rather than giving detailed reasons.

Another clue is the word "Killed" in the message, like it was an overt act rather than a failure.

Two clues... :)...

(14 Nov '18, 14:14) Breck Carter

If it fails on AWS, but not on your premises, doesn't that point directly to "AWS" as the reason?

I suggest running tests on AWS with varying size files, to see if there is a AWS-imposed limit: If 3G works then try 4G else try 2G, and so on, working your way up and down in a kind of "binary search" for a limit.

(14 Nov '18, 14:19) Breck Carter

> I have not explicitly set the cache size

My suggestion is to set a -ch limit, which might stop the failures by forcing SQL Anywhere to use the temporary file instead of mimicking a memory leak :)

(14 Nov '18, 14:21) Breck Carter

Hi Volker,

Thanks for the suggestion of dbsupport. I looked in the Diagnostics directory and there are no files so it looks like the crash was not recorded.

I am using Synchronous mode because because that's what we need. Thanks for asking about that; I'll keep that in mind as a possible factor.

It would be very difficult to test with LOAD TABLE because the client machine is locked down and is only running our application. So while not impossible, it would take a great deal of work to execute that command on our client machine but I will keep that in mind as well.

Any other suggestions?

(14 Nov '18, 18:51) AlK

Hi Breck,

That line is the "fi" of the following part of the bash script that launched the mirror server: if [ 'Y' == ${LAUNCH_EXTRA} ] then dbsrv17 -n ... else dbsrv17 -n ... fi The script runs the 1st dbsrv17 so when we "are killed" it looks like the error is pointing to the end of the if that started the server.

Any ideas?

(14 Nov '18, 19:11) AlK

Hi Breck,

Yes, I've only done that numerous times. I can run our client app on one system on our LAN and run the three HA database servers under Ubuntu configured the same way as they are running under Ubuntu under AWS and the transfer of this very large file occurs with no problems at all. This fact makes me suspect that the AWS environment is playing some sort of a roll here. For example, if AWS is interjecting some sort of data-transfer delay I'm wondering if that could be triggering a timeout with the server. But without any information about why the server is crashing it's very hard to take this problem to AWS.

Can you suggest anything?

(14 Nov '18, 19:16) AlK

Hi Breck,

We have tested with smaller files and worked our way up. We find that a 156 MB (and smaller) file transfers with no problem (after repeated tests). But the next jump up has been to the actual .5 GB file and that consistently fails. I definitely want to take this problem to AWS but, as you already know, the circumstances of the server crash are about as vague as they can get. That's why I'm hoping somebody in this forum can suggest some way that I can capture some detail about why the server is crashing. Are you aware of any way of determining if Ubuntu has issues a "kill process" or something such as that? Would a "kill process" explain why SQL Anywhere is not (as far as I can tell) writing anything into a log?

(14 Nov '18, 19:24) AlK

Hello Breck,

The delay in my response was caused by my research and experimentation with -ca and I believe that has led to me to discover the problem.

It looks like we did not have sufficient RAM to support running the servers with no upper limit on the cache size. Since I now know (or at least I believe) what the problem is I have opened a new discussion (https://sqlanywhere-forum.sap.com/questions/32873/how-do-i-determine-the-necessary-ram-or-cache-size-to-support-a-high-availability-system-under-ubuntu) related to sizing the cache. If you can respond to that new discussion I would very much appreciate your contribution. However, before I close this discussion I hope you might be able to answer two questions related to what I discovered:

1) As you noted yourself earlier, the only clue as to the cause of the crash was what was found in the Mirror servers log which was "2603 Killed". You pointed out that this suggested that "something else" actually killed the process rather than the server crashing itself. Given that it looks like we ran out of RAM would you conclude that it was actually Ubuntu that killed the server process since Ubuntu detected that the RAM had been depleted?

2) Once I doubled the RAM I was able to try the insert of the .5 GB BLOB without any Max cache size limitation and we ran to completion successfully. As the insert was taking place I was monitoring available RAM and it was dropping rather quickly. We started with 1.1 GB and it dropped down to 52 MB. I was surprised that after the insert successfully completed that I didn't see the available RAM grow back to the starting point of 1.1 GB. I interpret that to mean that the server gave up perhaps some of the RAM it was using but it held onto a significant piece of RAM that have been used during the insert. From your understanding of the cache does that behavior make sense?

If you can address these two questions that will help me to learn more about how the cache works. Again, thanks very much for your help on this discussion and if you can contribute to my new discussion that would be great!

Thanks again.

(16 Nov '18, 18:24) AlK

Note, Breck suggested option -ch, not -ca...

The first sets a maximum cache size for the default dynamic cache sizing, whereas the latter enforces a static cache size, so they work very differently.

(17 Nov '18, 09:18) Volker Barth

> would you conclude that it was actually Ubuntu that killed the server process since Ubuntu detected that the RAM had been depleted?

I don't know.

> From your understanding of the cache does that behavior make sense?

I don't know. I don't trust any RAM statistics reported by any operating system, especially in a super-secret multi-level awesomely-complex virtual-reality environment like AWS. I do (mostly) trust SQL Anywhere's own statistics about memory usage... the numbers you get from PROPERTY(), DB_PROPERTY() and CONNECTION_PROPERTY() calls because they reflect what the server is actually seeing and using... and they are documented.

> learn more about how the cache works

Good luck... let us know what you find out! Personally, I try to learn what the problems look like, what the symptoms of RAM starvation are, and not "how the cache works" because (a) it is secret and (b) it changes.

"Add RAM" is the ultimate Dead Chicken, as in "wave a dead chicken over the keyboard to see if that fixes things."

Specifically, "Add RAM until your problems go away".

(17 Nov '18, 10:15) Breck Carter

showing 5 of 17 show all flat view

Be the first one to answer this question!

toggle preview

community wiki:

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Question tags:

debugging ×106

question asked: 13 Nov '18, 17:47

question was seen: 1,138 times

last updated: 17 Nov '18, 10:15

SAP SQL Anywhere

Suggestions for how to determine why SA 17 HA is crashing.

Follow this question

Related questions