Design preventative measures for memory leaks // was: Soak Testing using GPT
Summary
Gitaly had two outages related to memory leaks semi-recently (gitlab-com/gl-infra/production#16187 (closed) and https://gitlab.com/gitlab-org/gitaly/-/issues/4732).
This is an issue to investigate and design an option to detect memory leaks before they cause user-visible outages.
Possible way forward
Today GPT is primarily focused on measuring the performance of Gitlab using API response times as a metric.
Soak Testing on the other hand, focuses on the reliability of the system over an extended longer period of time. This type of testing may uncover failures such as memory leaks, insufficient storage quotas, or other categories of bugs that may not be obvious during a shorter tests.
We should consider adding some soak tests to GPT to verify the stability of Gitlab over an extended period of time.
K6s provides some guidance as to how to structure soak tests so perhaps we can use that as a baseline on where we could start soak-testing-in-k6
- I don't believe this would require any fundamental changes to GPT - rather just adding a new category of test with extended runtime parameters
- We could use a select number of existing tests a starting point
- We may have to extend monitoring and failures to fail if certain CPU/RAM/Storage quota are exceeded
- This would require defining what acceptable criteria are also
- We would need to be conscious around costs of running tests like this - running a huge volume of tests over an extended period of time may incur significant costs from our cloud providers so we may need to balance that against the value this type of test would provide