Discuss best way for PMs to prioritize performance availability work
Based on https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173 and some of the extending work, it appears we need to be able to better prioritize performance and availability work.
Examples include:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173 - scaling of mirror repos led to problems with overall service. We don't limit customers in number of repos they can replicate (to my knowledge).
- https://gitlab.com/gitlab-org/gitlab-ce/issues/64035 - junit test artifacts were unchecked in size (should we limit customers here?)
- https://gitlab.com/gitlab-org/gitlab-ce/issues/64176 - Our handling of error cases in caching was not well done
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30422 - we did not handle the failures gracefully
One idea is to have some focus on chaos monkey type testing where we simulate failures and then diagnose and replicate problems. This would require prioritization of this type of work for the product.
Assigned to Scott unless he feels this should be driven from engineering.
Edited by Christopher Lefelhocz