ocserv with 5000 concurrent client connections results in high rate of connection failures (99% connections fail)
During this state, the ocserv-sm socket queue depth shows at 128 / 128 entries (as reported by ss -ax), suggesting that ocserv-sm is unable to keep up with requests. Server recovers from the state when clients timeout.
Client logs show the timeouts waiting for TLS responses from the server, suggesting that SM is working on requests for timed out clients.
The code already has an option to unconditionally sleep after the accept, but this stalls subsequent processing by the ocserv-main process resulting in slower connect times (it blocks the worker process from completing establishment of the connection). If the time is too low, the server still gets flooded.
Here is what the preferred behavior would be:
- Only apply the mitigation when the SM process is busy.
- Delay accepting new clients when SM is backed up.
- Continue processing existing clients while waiting for SM to recover.
Here is the propose changes:
- Add a timer that can be used to delay processing of an accept.
- On an accept, arm the timer.
- If another accept arrives before the timer expires, check the SM queue depth.
- If SM queue is backlogged, add accept to pending accept queue, otherwise accept and re-arm timer.
- When the timer fires, process any pending accepts.
This gives the following behaviors:
- TCP connections arrive with (interval > timer period) -> No change in behavior.
- TCP connections arrive with (interval < timer period) and (SM queue < threshold) -> No change in behavior.
- TCP connections arrive with (interval < timer period) and (SM queue > threshold) -> Accept is delayed until queue recovers.
Thoughts?