Testing gitlab-sshd in staging
Problem Statement
Two times now we've attempted to rollout gitlab-sshd and have been forced to rollback. The first attempt was rolled back after seeing extremely high memory consumption leading to failed Pods. The second time, we had only canary taking a small fraction of traffic, but the amount of Context cancelled errors was abnormally high.
It's clear at this point that testing something inside of gitlab-shell has been insufficient. While it is widely known that staging differs from production in various ways, we should have the capability to discover these varied issues ahead of our next attempted rollout.
Milestones
-
Link to existing testing strategies that were utilized for review -
Discuss if the existing testing that has been performed is sufficient -
Discuss additional testing strategies that can be utilized to showcase that we've covered all potential failure scenarios -
Document/Create issues for actionable items