SSH outage after RC3 deploy (#290) · Issues · GitLab.com / GitLab Infrastructure Team / Production

SSH outage after RC3 deploy

Many users were getting reports that SSH connections were not going through: ``` ssh_exchange_identification: Connection closed by remote host fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. ``` From the unicorn logs, I saw that a number of machines had stuck unicorn processes, preventing the new unicorn workers from binding to port 8080. For example: ``` oot@git-06.sv.prd.gitlab.com:/var/log/gitlab/unicorn# ps -ef | grep 9931 git 9931 1 77 18:21 ? 00:49:04 unicorn worker[18] -D -E production -c /var/opt/gitlab/gitlab-rails/etc/unicorn.rb /opt/gitlab/embedded/service/gitlab-rails/config.ru root 11559 11135 0 19:24 pts/0 00:00:00 grep 9931 ``` Once this PID was forcibly killed, unicorn started up again. This may be related to https://gitlab.com/gitlab-com/infrastructure/issues/3548. While this was happening, we saw a high number of exceptions in our SSH log with `com.jcraft.jsch.JSchException`, which appears to be coming from the Java SSH implementation: http://www.jcraft.com/jsch/ (https://log.gitlap.com/goto/bc267de46c8db484857a09af844264d0): ![image](/uploads/d3fc2819186fe52af4070502092648ad/image.png) I suspect what happened: 1. Unicorn was HUP'ed 2. Due to https://gitlab.com/gitlab-com/infrastructure/issues/3548#note_54374207, some unicorn processes did not terminate properly, holding on to port 8080 3. Unicorn tried to start up and failed 4. SSH connections would fail because unicorn was not up 5. Java clients kept reconnecting, saturating the SSH connections Some action items I think we should consider: 1. Figure out how to mitigate the Prometheus crash (e.g. for now, turn off Prometheus metrics in the application before the HUP) 2. Add alerting around failure to start up unicorn (e.g. from mtail looking at `/var/log/gitlab/unicorn/unicorn_stderr.log`) 3. Add alerting due to SSH/clone failures 4. Investigate the SSH retries in HAproxy logs and see if we need to throttle

issue