SSH outage after RC3 deploy

Many users were getting reports that SSH connections were not going through:

ssh_exchange_identification: Connection closed by remote host
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

From the unicorn logs, I saw that a number of machines had stuck unicorn processes, preventing the new unicorn workers from binding to port 8080. For example:

oot@git-06.sv.prd.gitlab.com:/var/log/gitlab/unicorn# ps -ef | grep 9931
git        9931      1 77 18:21 ?        00:49:04 unicorn worker[18] -D -E production -c /var/opt/gitlab/gitlab-rails/etc/unicorn.rb /opt/gitlab/embedded/service/gitlab-rails/config.ru
root      11559  11135  0 19:24 pts/0    00:00:00 grep 9931

Once this PID was forcibly killed, unicorn started up again.

While this was happening, we saw a high number of exceptions in our SSH log with com.jcraft.jsch.JSchException, which appears to be coming from the Java SSH implementation: http://www.jcraft.com/jsch/ (https://log.gitlap.com/goto/bc267de46c8db484857a09af844264d0):

I suspect what happened:

Unicorn was HUP'ed
Due to https://gitlab.com/gitlab-com/infrastructure/issues/3548#note_54374207, some unicorn processes did not terminate properly, holding on to port 8080
Unicorn tried to start up and failed
SSH connections would fail because unicorn was not up
Java clients kept reconnecting, saturating the SSH connections

Some action items I think we should consider:

Figure out how to mitigate the Prometheus crash (e.g. for now, turn off Prometheus metrics in the application before the HUP)
Add alerting around failure to start up unicorn (e.g. from mtail looking at /var/log/gitlab/unicorn/unicorn_stderr.log)
Add alerting due to SSH/clone failures
Investigate the SSH retries in HAproxy logs and see if we need to throttle

Edited Jan 12, 2018 by Stan Hu