Hey everyone, We're struggling with our Mobilink server performance and are looking for information about which combination of which parameter(s) would be the best for our use case. Any advice would be warmly welcomed. Infrastructure : We have two Mobilink servers (running with version 16) with 12 cores, 80 GB RAM. Both are communicating with reverse proxies and load balancers. The whole is supported with cookie-based sessions. This is working fine.
Requirements :
Actual key configuration items :
We've looked at documentation and have already tuned the server a bit in the last years but we're still not happy with the robustness of a synchronization. Most of the time, we get unexpected crashes, synchronisation errors, problems to read bytes. Those problems come randomly and are pure-luck based. We're looking for experiences, KPI, or any idea that could help us find the perfect set of parameters to get our servers pumping data. Note : simulation in a "controlled" environment is nearly impossible. |
First off, you said you're getting a lot of crashes. The software shouldn't ever crash. If you get in touch with tech support, they can help you help us fix that. Memory I. 2017-05-15 08:58:45. <main> PERIODIC: TRACKED_MEMORY: 12 258 748 765 I. 2017-05-15 08:58:45. <main> PERIODIC: MEMORY_USED: 23 155 286 016 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_IN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_OUT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED_MAX: 2 874 283 This looks ok given your large hosts. Pages swapped in and out are 0, so you're staying in memory, which should always be the goal. TRACKED___MEMORY is the amount of memory the server thinks it has allocated and MEMORY_USED is the amount of memory the OS has given the process. It's surprising that they're so far apart. Do you have a Java or .NET VM configured to use about 10GB of RAM? Most of what the server has allocated is storing row data, which is perfectly normal. Volume I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_UPLOADED: 5468845 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_DOWNLOADED: 3605732328 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_WRITTEN: 142799040880 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_READ: 4105727260 ~230K bytes down per sync ~5.8K rows down per sync There's nothing to be especially worried about here. You have essentially no upload and your download is not particularly large. Syncs I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_UPLOAD_CONNS_IN_USE: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_NON_BLOCKING_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_GET_DB_WORKER_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_WAIT_FOR_DNLD_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SEND_DNLD: 20 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_END_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_FETCH_DNLD: 87 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_PREP_FOR_DNLD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_APPLY_UPLOAD: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_BEGIN_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_AUTH_USER: 5 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_RECVING_UPLOAD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SYNC_REQUEST: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WAITING_CONS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_SYNC: 2878580 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_DB_WAIT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_SYNCS: 113 I. 2017-05-15 08:58:45. <main> PERIODIC: ML_NUM_CONNECTED_CLIENTS: 258 Most of your syncs are either fetching the download from the consolidated, or sending it. Given the relative size of your downloads this isn't surprising. What's worrying here is the LONGEST_SYNC; this is milliseconds, so your oldest sync has taken about 45 minutes and is still running. Given just the ppv output we can't know where that sync is actually stuck. You'll have to connect a ML Profiler to see what the individual syncs are doing. Errors I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WARNINGS: 862621 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ERRORS: 15902 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_FAILED_SYNCS: 232 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_SUCCESS_SYNCS: 616417 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROLLBACKS: 221 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_COMMITS: 3076355 I. 2017-05-15 08:58:45. <main> PERIODIC: INTERRUPTED_TCP_CONNECTION: 2283016 You have 232 failed syncs and 221 rollbacks. I'm pretty sure every rollback will cause a failed sync, which means your failures are mostly happening while the server is doing database work. There are dozens of errors for each failed sync, so I would guess you have a Java or .NET VM printing big stack traces when these syncs are failing. Summary
Thank you Bill. Some of the metrics were somehow cryptic and your explanations have enlighted some things.
(16 May '17, 05:43)
yan
Replies hidden
How should we react when we see the value of CMD_PROCESSOR_STAGE_LEN ? What value would be acceptable ?
(16 May '17, 12:15)
yan
The rollbacks I'm talking about are done by the ML server against the consolidated database. The server runs the scripts in a sequence of transactions; if everything succeeds then it commits, but if anything goes wrong it will rollback the transaction and fail the sync. Given the ppv output we can't see what specific part of the sync is failing. The various STAGE_LEN values list the size of the event queues for some thread pools in the server. They're basically debug values for me and are probably not going to be useful for customers. The only time they might be useful is if STREAM_STAGE_LEN was much larger than the number of syncs (i.e. 20-30 times larger); that would indicate the pool controlled with -wn is overworked and bumiping up -wn would be good.
(16 May '17, 16:02)
Bill Somers
|
Can you add -ppv 60 and -vm to your ML server command line and get us the logs? To start we'd only need a few minutes of log from when the server is having problems. -ppv 60 will print a bunch of performance metrics to the log every 60 seconds and -vm will print how long each part of the sync took whenever a sync finishes.
I. 2017-05-15 08:58:45. <main> PERIODIC: DB_WORKER_CURRENT: 100 I. 2017-05-15 08:58:45. <main> PERIODIC: DB_WORKER_MAX: 100 I. 2017-05-15 08:58:45. <main> PERIODIC: INTERRUPTED_TCP_CONNECTION: 2283016 I. 2017-05-15 08:58:45. <main> PERIODIC: PRIMARY_IS_KNOWN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: OE_STAGE_LEN: 0 ... I. 2017-05-15 08:58:45. <main> PERIODIC: TRACKED_MEMORY: 12258748765 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_UPLOAD_CONNS_IN_USE: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_NON_BLOCKING_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_GET_DB_WORKER_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_WAIT_FOR_DNLD_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SEND_DNLD: 20 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_END_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_FETCH_DNLD: 87 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_PREP_FOR_DNLD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_APPLY_UPLOAD: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_BEGIN_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_AUTH_USER: 5 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_RECVING_UPLOAD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SYNC_REQUEST: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WAITING_CONS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_LISTENERS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_MONITORS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_FILE_XFERS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_PINGS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_SYNCS: 113 I. 2017-05-15 08:58:45. <main> PERIODIC: SERVER_IS_PRIMARY: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: MEMORY_USED: 23155286016 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_SYNC: 2878580 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_DB_WAIT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: FREE_DISK_SPACE: 146226270208 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_UPLOADED: 5468845 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_DOWNLOADED: 3605732328 I. 2017-05-15 08:58:45. <main> PERIODIC: CMD_PROCESSOR_STAGE_LEN: 93 I. 2017-05-15 08:58:45. <main> PERIODIC: HEARTBEAT_STAGE_LEN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: STREAM_STAGE_LEN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: RAW_TCP_STAGE_LEN: 1035 I. 2017-05-15 08:58:45. <main> PERIODIC: DB_CONNECTIONS: 93 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WARNINGS: 862621 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ERRORS: 15902 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_FAILED_SYNCS: 232 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_SUCCESS_SYNCS: 616417 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROLLBACKS: 221 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_COMMITS: 3076355 I. 2017-05-15 08:58:45. <main> PERIODIC: CPU_USAGE: 636853240766 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_IN_STREAMSTACK: 4294965889 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_IN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_OUT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: ML_NUM_CONNECTED_CLIENTS: 258 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_WRITTEN: 142799040880 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_READ: 4105727260 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_REJECTED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_CLOSED: 658817 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_OPENED: 659074 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED_MAX: 2874283 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_USED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS: 257
Here's an extract of a -ppv output on the server. It has been running since last 4 days. Any inputs?
To confirm: You are primarily concerned about reliability and robustness, more than response time and throughput, correct?
Indeed reliability and robustness is critical for us.