SSH outage after RC3 deploy
Many users were getting reports that SSH connections were not going through:
ssh_exchange_identification: Connection closed by remote host
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
From the unicorn logs, I saw that a number of machines had stuck unicorn processes, preventing the new unicorn workers from binding to port 8080. For example:
oot@git-06.sv.prd.gitlab.com:/var/log/gitlab/unicorn# ps -ef | grep 9931
git 9931 1 77 18:21 ? 00:49:04 unicorn worker[18] -D -E production -c /var/opt/gitlab/gitlab-rails/etc/unicorn.rb /opt/gitlab/embedded/service/gitlab-rails/config.ru
root 11559 11135 0 19:24 pts/0 00:00:00 grep 9931
Once this PID was forcibly killed, unicorn started up again.
This may be related to https://gitlab.com/gitlab-com/infrastructure/issues/3548.
While this was happening, we saw a high number of exceptions in our SSH log with com.jcraft.jsch.JSchException, which appears to be coming from the Java SSH implementation: http://www.jcraft.com/jsch/ (https://log.gitlap.com/goto/bc267de46c8db484857a09af844264d0):
I suspect what happened:
- Unicorn was HUP'ed
- Due to https://gitlab.com/gitlab-com/infrastructure/issues/3548#note_54374207, some unicorn processes did not terminate properly, holding on to port 8080
- Unicorn tried to start up and failed
- SSH connections would fail because unicorn was not up
- Java clients kept reconnecting, saturating the SSH connections
Some action items I think we should consider:
- Figure out how to mitigate the Prometheus crash (e.g. for now, turn off Prometheus metrics in the application before the HUP)
- Add alerting around failure to start up unicorn (e.g. from mtail looking at
/var/log/gitlab/unicorn/unicorn_stderr.log) - Add alerting due to SSH/clone failures
- Investigate the SSH retries in HAproxy logs and see if we need to throttle
Edited by Stan Hu
