Discuss best way for PMs to prioritize performance availability work

Based on https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173 and some of the extending work, it appears we need to be able to better prioritize performance and availability work.

Examples include:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173 - scaling of mirror repos led to problems with overall service. We don't limit customers in number of repos they can replicate (to my knowledge).
https://gitlab.com/gitlab-org/gitlab-ce/issues/64035 - junit test artifacts were unchecked in size (should we limit customers here?)
https://gitlab.com/gitlab-org/gitlab-ce/issues/64176 - Our handling of error cases in caching was not well done
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30422 - we did not handle the failures gracefully

One idea is to have some focus on chaos monkey type testing where we simulate failures and then diagnose and replicate problems. This would require prioritization of this type of work for the product.

Assigned to Scott unless he feels this should be driven from engineering.

Edited Jul 08, 2019 by Christopher Lefelhocz