During outage, establish an incident leader to avoid multiple people executing the same operations
I suggest we find a way to quickly establish a leader during outage incidents. It's been a while since I was on-call and there was a GitLab.com outage. However, there were at least two of us actively running DB commands. While it wasn't too big of a concern regarding these particular DB commands, this type of behavior could lead to unexpected results.
What if one person issues a reboot while another is still gathering data or trying to resolve without a restart? etc., etc.
The incident leader could specify others to do particular tasks for them - for example, to tweet a status update, to do X investigative work. However, no actions that change, restart, remove something should be issued without the OK of the leader.
How to establish a leader?
A logical person is the responding on-call agent. However, there could be cases where someone more qualified is available and online. The simplest way would be to say so in chat:
I will lead the outage
. If a Production Engineer is online and wishes to handle the outage instead, they can simply reply Thanks, X. I'm online and can handle this if you'd like.
@northrup @pcarranza Thoughts?