Increase number of unicorn processes per worker
It seems that in times of high load we have a strong increase in http queueing time, and this maps to an increase in the transactions timing as a whole.
Proposal
I propose that we try increasing the number of unicorn processes in the fleet and monitor how that behaves for a bit.
My reasoning is as follows:
- If we add more worker hosts, we will add pressure to the NFS mounts because we will have more access going on.
- Each worker host is oversized already (we have 55G of ram on each host, and many, many processes)
- In general processes are waiting on filesystem to come back, which is creating contention by having not a lot of processes ready to fulfill the requests.
- Each worker is already keeping 45G of filesystem buffers warm, we would only be reusing this through more processes.
So, I would like to try adding 5 more processes and check if we see latency spikes going away with this, if not, it means that we are actually adding more stress to the NFS hosts and we should go back to where we were.
What should we monitor
- workers host load - Load should be about the same, or a bit higher as we will have more processes being evicted out of the processor.
- workers memory usage - I expect more memory usage owned by processes
- spikes in latency - Spikes should disappear
- p99 of transaction execution time - We should see timings without spikes, keeping the same level (not going up)
- p99 of http queueing time - We should see that the spikes disappear
- NFS fleet IOWait and general IOPS - we should not see an increase there.
Next actions
-
Increase unicorn processes by 25% -
Monitor for a couple of days. -
Decide if the experiment is valid and consider repeating.