We have a version 12 DB that has been running a long time. A couple days ago we shut down and restarted the DB. Everything went pretty smooth but for some reason after about 15 minutes everything stopped working and we noticed that there were 2 instances of dbsrv12 running - neither one of them responding to requests. We start the DB with the -oe flag to make sure we see any fatal errors but this doesn't do anything because the first DB never completely dies. The only way to rectify the problem is to kill both dbsrv12 processes and restart the engine. We are using the Solaris Sparc version of the DB engine on a Sun V445 with 4 processors. Is this some type of auto-recovery thing? It has done it a couple times since we restarted but it seems to have subsided. We also start the db in quiet mode so we don't have a console to look at. How can we see what is going on that causes this situation? I should note that we are still on 12.0.0. The last time the DB did this, I got the latest EBF and installed it so if and when it does it again it will come up with the latest version of SQL Anywhere. Thanks !!!!
It is hard to know what is really happening, but I have a few guesses:
Some additional information may help:
Regarding #3 and #4 - we have seen a fork on Solaris take a long time in some situations since this OS clones all memory pages on fork. We have made some effort to alleviate this (by telling the system not to clone pages on fork).
Regarding the questions: Solaris has the behaviour that all pages in memory must have a backing store page (i.e. a swap page) assigned to it when the page is first used. We have seen some problems with this due to the way the database cache is managed - we preallocate the memory space for the cache (and this system call succeeds) but the OS does not allocate a backing store page for each cache page (which is the behavior that we want). The problem comes when we first 'touch' a cache page - at this time the OS will attempt to allocate a backing store page for it and if the swap space is configured too small (for the overall workload on the computer) the backing store allocation will fail and the server will SEGV. :-( We have not discovered a way around this. :-( The best answer is to suggest that Solaris computers be properly configured to have sufficient swap space. If this is happening then the SEGV may cause the behavour you are seeing - a fork/core dump taking a long time? (just a wild guess)
answered 04 Apr '13, 08:52