2018-05-14 site degradation, increased load on the api
Summary
In 2018-05-14 we saw increased load on the api fleet resulting in slow pipelines and severe degradation of api operations.
timeline
- 10:50 - api limit lowered from 6 to 4 https://gitlab.com/gitlab-com/infrastructure/issues/4195#note_72948654
- 10:57 - 10.8rc8 deployment finished
- 15:00 - 6 new api servers added to the fleet increasing the fleet size from 14 to 20.
- 15:25 - rolled back the api limit setting so it is now back to 6
- 16:00 - gdpr enabled on gitlab.com
- 2018-05-16 11:21
@yorickpeterse notices with
show pools
on pgbouncer that we are running dangerously close to the max connection limit of 300.
So what I'm currently thinking is this:
- We reduce the fleet size back to normal
- We somewhat increase the number of database connections Unicorn can use, from 100 to e.g. 120
- 2018-05-16 11:45 - api nodes 15-20 set to drain, then maint https://performance.gitlab.net/dashboard/db/haproxy-status?orgId=1&var-env=prd&var-backend=api&var-server=All&from=now-5m&to=now
- 2018-05-16 12:12 - changed the
default_pool_size
on10.66.4.101
from 100 to 120 - 2018-05-16 14:14 - increased api limit from 6 to 9
Edited by John Jarvis