Hey everyone,

We're struggling with our Mobilink server performance and are looking for information about which combination of which parameter(s) would be the best for our use case. Any advice would be warmly welcomed.

Infrastructure : We have two Mobilink servers (running with version 16) with 12 cores, 80 GB RAM. Both are communicating with reverse proxies and load balancers. The whole is supported with cookie-based sessions. This is working fine.

  • CPU and RAM are both normal. We've tuned this several time.

Requirements :

  • We have 10K clients trying to synchronize each 5 minutes
  • In reality, we see that there are max. 250 synchronization sessions that can be running at the same time
  • Each synchronization do not have lots of data (in term of packet size) but lots of rows (up to 200K rows changing in one sync session).
  • The central database isn't overloaded with the actual configuration. It keeps responding in a fair time (this is an Oracle database by the way).
  • Our clients are running Ultralite or SQLAnywhere and most of the time connections are slow. Worst-case : GPRS, best-case : Ethernet

Actual key configuration items :

  • Block size : 128KB
  • Workerthreads : min = 50, max = 300
  • ODBC driver, 512 KB array and 4919 rows fetch-size
  • Network workers : 8
  • Max network connections : 2048

We've looked at documentation and have already tuned the server a bit in the last years but we're still not happy with the robustness of a synchronization. Most of the time, we get unexpected crashes, synchronisation errors, problems to read bytes. Those problems come randomly and are pure-luck based.

We're looking for experiences, KPI, or any idea that could help us find the perfect set of parameters to get our servers pumping data.

Note : simulation in a "controlled" environment is nearly impossible.

asked 11 May, 08:22

yan's gravatar image

yan
802511
accept rate: 0%

edited 11 May, 08:22

Can you add -ppv 60 and -vm to your ML server command line and get us the logs? To start we'd only need a few minutes of log from when the server is having problems. -ppv 60 will print a bunch of performance metrics to the log every 60 seconds and -vm will print how long each part of the sync took whenever a sync finishes.

(11 May, 09:14) Bill Somers
Replies hidden

I. 2017-05-15 08:58:45. <main> PERIODIC: DB_WORKER_CURRENT: 100 I. 2017-05-15 08:58:45. <main> PERIODIC: DB_WORKER_MAX: 100 I. 2017-05-15 08:58:45. <main> PERIODIC: INTERRUPTED_TCP_CONNECTION: 2283016 I. 2017-05-15 08:58:45. <main> PERIODIC: PRIMARY_IS_KNOWN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: OE_STAGE_LEN: 0 ... I. 2017-05-15 08:58:45. <main> PERIODIC: TRACKED_MEMORY: 12258748765 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_UPLOAD_CONNS_IN_USE: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_NON_BLOCKING_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_GET_DB_WORKER_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_WAIT_FOR_DNLD_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SEND_DNLD: 20 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_END_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_FETCH_DNLD: 87 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_PREP_FOR_DNLD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_APPLY_UPLOAD: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_BEGIN_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_AUTH_USER: 5 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_RECVING_UPLOAD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SYNC_REQUEST: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WAITING_CONS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_LISTENERS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_MONITORS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_FILE_XFERS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_PINGS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_SYNCS: 113 I. 2017-05-15 08:58:45. <main> PERIODIC: SERVER_IS_PRIMARY: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: MEMORY_USED: 23155286016 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_SYNC: 2878580 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_DB_WAIT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: FREE_DISK_SPACE: 146226270208 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_UPLOADED: 5468845 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_DOWNLOADED: 3605732328 I. 2017-05-15 08:58:45. <main> PERIODIC: CMD_PROCESSOR_STAGE_LEN: 93 I. 2017-05-15 08:58:45. <main> PERIODIC: HEARTBEAT_STAGE_LEN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: STREAM_STAGE_LEN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: RAW_TCP_STAGE_LEN: 1035 I. 2017-05-15 08:58:45. <main> PERIODIC: DB_CONNECTIONS: 93 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WARNINGS: 862621 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ERRORS: 15902 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_FAILED_SYNCS: 232 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_SUCCESS_SYNCS: 616417 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROLLBACKS: 221 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_COMMITS: 3076355 I. 2017-05-15 08:58:45. <main> PERIODIC: CPU_USAGE: 636853240766 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_IN_STREAMSTACK: 4294965889 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_IN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_OUT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: ML_NUM_CONNECTED_CLIENTS: 258 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_WRITTEN: 142799040880 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_READ: 4105727260 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_REJECTED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_CLOSED: 658817 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS_OPENED: 659074 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED_MAX: 2874283 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_USED: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_CONNECTIONS: 257

(15 May, 05:15) yan

Here's an extract of a -ppv output on the server. It has been running since last 4 days. Any inputs?

(15 May, 05:16) yan

To confirm: You are primarily concerned about reliability and robustness, more than response time and throughput, correct?

(15 May, 05:51) Breck Carter
Replies hidden

Indeed reliability and robustness is critical for us.

(15 May, 08:52) yan

First off, you said you're getting a lot of crashes. The software shouldn't ever crash. If you get in touch with tech support, they can help you help us fix that.

Memory

I. 2017-05-15 08:58:45. <main> PERIODIC: TRACKED_MEMORY: 12 258 748 765
I. 2017-05-15 08:58:45. <main> PERIODIC: MEMORY_USED: 23 155 286 016
I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_IN: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_OUT: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED_MAX: 2 874 283

This looks ok given your large hosts. Pages swapped in and out are 0, so you're staying in memory, which should always be the goal. TRACKED___MEMORY is the amount of memory the server thinks it has allocated and MEMORY_USED is the amount of memory the OS has given the process. It's surprising that they're so far apart. Do you have a Java or .NET VM configured to use about 10GB of RAM? Most of what the server has allocated is storing row data, which is perfectly normal.

Volume

I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_UPLOADED: 5468845
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_DOWNLOADED: 3605732328
I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_WRITTEN: 142799040880
I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_READ: 4105727260

~230K bytes down per sync

~5.8K rows down per sync

There's nothing to be especially worried about here. You have essentially no upload and your download is not particularly large.

Syncs

I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_UPLOAD_CONNS_IN_USE: 1
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_NON_BLOCKING_ACK: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT_FOR_ACK: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_GET_DB_WORKER_FOR_ACK: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_WAIT_FOR_DNLD_ACK: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SEND_DNLD: 20
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_END_SYNC: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_FETCH_DNLD: 87
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_PREP_FOR_DNLD: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_APPLY_UPLOAD: 1
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_BEGIN_SYNC: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_AUTH_USER: 5
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_RECVING_UPLOAD: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SYNC_REQUEST: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WAITING_CONS: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_SYNC: 2878580
I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_DB_WAIT: 0
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_SYNCS: 113
I. 2017-05-15 08:58:45. <main> PERIODIC: ML_NUM_CONNECTED_CLIENTS: 258

Most of your syncs are either fetching the download from the consolidated, or sending it. Given the relative size of your downloads this isn't surprising. What's worrying here is the LONGEST_SYNC; this is milliseconds, so your oldest sync has taken about 45 minutes and is still running. Given just the ppv output we can't know where that sync is actually stuck. You'll have to connect a ML Profiler to see what the individual syncs are doing.

Errors

I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WARNINGS: 862621
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ERRORS: 15902
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_FAILED_SYNCS: 232
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_SUCCESS_SYNCS: 616417
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROLLBACKS: 221
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_COMMITS: 3076355
I. 2017-05-15 08:58:45. <main> PERIODIC: INTERRUPTED_TCP_CONNECTION: 2283016

You have 232 failed syncs and 221 rollbacks. I'm pretty sure every rollback will cause a failed sync, which means your failures are mostly happening while the server is doing database work. There are dozens of errors for each failed sync, so I would guess you have a Java or .NET VM printing big stack traces when these syncs are failing.
INTERRUPTED_TCP_CONNECTION is really high, which means our self-recovering HTTP gear is kicking in. It's happening a few times per sync which seems like a lot. This might indicate a flaky network, or it might be a side effect of how liveness works with HTTP. You could try increasing the liveness timeout on the client, or setting the buffer_size parameter in the server's -x switch to something smaller the default 64000, like 32000. This should be a lower priority; it's probably not causing any real problems.

Summary

  1. Things look more or less OK. I don't think there's a magic switch you can add that will fix everything.
  2. The longest sync is taking way too long. Attach a ML Profiler to see what's wrong
  3. The syncs are failing with a lot of errors while doing database work. You should look into those errors
  4. There's a lot of interrupted HTTP requests, which indicates a flaky or slow network, but our fault-tolerant HTTP gear seems to be dealing with it. You might be able to tweak things to make this go away, but it probably won't make things any better.
permanent link

answered 15 May, 11:30

Bill%20Somers's gravatar image

Bill Somers
1.1k819
accept rate: 39%

Thank you Bill. Some of the metrics were somehow cryptic and your explanations have enlighted some things.

  1. We've identified some clients who were WAY behind in the sync process and are currently fixing it.
  2. When you talk about rollbacks, you're talking about rollbacks in the Upload phase or in the download phase (e.g. client having an error and rollbacking on its side)?
  3. Indeed, our clients can have a pretty poor connection (mobile network) depending on their situation. This isn't a big issue as they can recover as soon as they're in a more stable area.
(16 May, 05:43) yan
Replies hidden

How should we react when we see the value of CMD_PROCESSOR_STAGE_LEN ? What value would be acceptable ?

(16 May, 12:15) yan

The rollbacks I'm talking about are done by the ML server against the consolidated database. The server runs the scripts in a sequence of transactions; if everything succeeds then it commits, but if anything goes wrong it will rollback the transaction and fail the sync. Given the ppv output we can't see what specific part of the sync is failing.

The various STAGE_LEN values list the size of the event queues for some thread pools in the server. They're basically debug values for me and are probably not going to be useful for customers. The only time they might be useful is if STREAM_STAGE_LEN was much larger than the number of syncs (i.e. 20-30 times larger); that would indicate the pool controlled with -wn is overworked and bumiping up -wn would be good.

(16 May, 16:02) Bill Somers
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×248

question asked: 11 May, 08:22

question was seen: 246 times

last updated: 16 May, 16:02