2020-05-30: dev.gitlab.org is down
/label incident IncidentActive
Summary
dev.gitlab.org is down
Bad deploy to dev.gitlab.org has made it completely down. Code causing issue in question is gitlab-org/gitlab!32991 (merged)
Stacktrace from puma which was failing to start
2020-05-30_04:38:15.48128 {"timestamp":"2020-05-30T04:38:15.481Z","pid":110919,"message":"! Unable to load application: LoadError: No such file to load -- json-schema.rb"}
2020-05-30_04:38:15.48138 bundler: failed to load command: puma (/opt/gitlab/embedded/bin/puma)
2020-05-30_04:38:15.48145 LoadError: No such file to load -- json-schema.rb
2020-05-30_04:38:15.48145 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48145 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `block in require'
2020-05-30_04:38:15.48146 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:291:in `load_dependency'
2020-05-30_04:38:15.48146 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48146 /opt/gitlab/embedded/service/gitlab-rails/app/validators/json_schema_validator.rb:3:in `<top (required)>'
2020-05-30_04:38:15.48146 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48147 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `block in require'
2020-05-30_04:38:15.48147 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:291:in `load_dependency'
2020-05-30_04:38:15.48148 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48148 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:411:in `block in require_or_load'
2020-05-30_04:38:15.48149 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:40:in `block in load_interlock'
2020-05-30_04:38:15.48149 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies/interlock.rb:14:in `block in loading'
2020-05-30_04:38:15.48149 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/concurrency/share_lock.rb:151:in `exclusive'
2020-05-30_04:38:15.48149 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies/interlock.rb:13:in `loading'
2020-05-30_04:38:15.48150 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:40:in `load_interlock'
2020-05-30_04:38:15.48150 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:389:in `require_or_load'
2020-05-30_04:38:15.48151 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:544:in `load_missing_constant'
2020-05-30_04:38:15.48151 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:214:in `const_missing'
2020-05-30_04:38:15.48152 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:581:in `load_missing_constant'
2020-05-30_04:38:15.48152 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:214:in `const_missing'
2020-05-30_04:38:15.48152 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:581:in `load_missing_constant'
2020-05-30_04:38:15.48153 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:214:in `const_missing'
2020-05-30_04:38:15.48153 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activemodel-6.0.3/lib/active_model/validations/validates.rb:119:in `const_get'
2020-05-30_04:38:15.48153 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activemodel-6.0.3/lib/active_model/validations/validates.rb:119:in `block in validates'
2020-05-30_04:38:15.48154 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activemodel-6.0.3/lib/active_model/validations/validates.rb:114:in `each'
2020-05-30_04:38:15.48154 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activemodel-6.0.3/lib/active_model/validations/validates.rb:114:in `validates'
2020-05-30_04:38:15.48155 /opt/gitlab/embedded/service/gitlab-rails/app/models/ci/build_report_result.rb:13:in `<class:BuildReportResult>'
2020-05-30_04:38:15.48155 /opt/gitlab/embedded/service/gitlab-rails/app/models/ci/build_report_result.rb:4:in `<module:Ci>'
2020-05-30_04:38:15.48156 /opt/gitlab/embedded/service/gitlab-rails/app/models/ci/build_report_result.rb:3:in `<top (required)>'
2020-05-30_04:38:15.48156 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48156 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `block in require'
2020-05-30_04:38:15.48156 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:291:in `load_dependency'
2020-05-30_04:38:15.48157 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:324:in `require'
2020-05-30_04:38:15.48158 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:411:in `block in require_or_load'
2020-05-30_04:38:15.48158 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:40:in `block in load_interlock'
2020-05-30_04:38:15.48158 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies/interlock.rb:14:in `block in loading'
2020-05-30_04:38:15.48159 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/concurrency/share_lock.rb:151:in `exclusive'
2020-05-30_04:38:15.48159 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies/interlock.rb:13:in `loading'
2020-05-30_04:38:15.48159 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:40:in `load_interlock'
2020-05-30_04:38:15.48160 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:389:in `require_or_load'
2020-05-30_04:38:15.48160 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:367:in `depend_on'
2020-05-30_04:38:15.48161 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activesupport-6.0.3/lib/active_support/dependencies.rb:280:in `require_dependency'
2020-05-30_04:38:15.48161 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:481:in `block (2 levels) in eager_load!'
2020-05-30_04:38:15.48161 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:480:in `each'
2020-05-30_04:38:15.48162 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:480:in `block in eager_load!'
2020-05-30_04:38:15.48162 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:477:in `each'
2020-05-30_04:38:15.48162 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:477:in `eager_load!'
2020-05-30_04:38:15.48163 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/application.rb:509:in `eager_load!'
2020-05-30_04:38:15.48163 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/engine.rb:356:in `eager_load!'
2020-05-30_04:38:15.48163 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/application/finisher.rb:123:in `each'
2020-05-30_04:38:15.48164 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/application/finisher.rb:123:in `block in <module:Finisher>'
2020-05-30_04:38:15.48164 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `instance_exec'
2020-05-30_04:38:15.48165 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/initializable.rb:32:in `run'
2020-05-30_04:38:15.48165 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/initializable.rb:61:in `block in run_initializers'
2020-05-30_04:38:15.48165 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:228:in `block in tsort_each'
2020-05-30_04:38:15.48166 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:350:in `block (2 levels) in each_strongly_connected_component'
2020-05-30_04:38:15.48166 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:431:in `each_strongly_connected_component_from'
2020-05-30_04:38:15.48166 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:349:in `block in each_strongly_connected_component'
2020-05-30_04:38:15.48167 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:347:in `each'
2020-05-30_04:38:15.48167 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:347:in `call'
2020-05-30_04:38:15.48168 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:347:in `each_strongly_connected_component'
2020-05-30_04:38:15.48168 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:226:in `tsort_each'
2020-05-30_04:38:15.48169 /opt/gitlab/embedded/lib/ruby/2.6.0/tsort.rb:205:in `tsort_each'
2020-05-30_04:38:15.48169 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/initializable.rb:60:in `run_initializers'
2020-05-30_04:38:15.48169 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/railties-6.0.3/lib/rails/application.rb:363:in `initialize!'
2020-05-30_04:38:15.48169 /opt/gitlab/embedded/service/gitlab-rails/config/environment.rb:5:in `<top (required)>'
2020-05-30_04:38:15.48170 /opt/gitlab/embedded/service/gitlab-rails/config.ru:18:in `require'
2020-05-30_04:38:15.48170 /opt/gitlab/embedded/service/gitlab-rails/config.ru:18:in `block in <main>'
2020-05-30_04:38:15.48170 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/rack-2.0.9/lib/rack/builder.rb:55:in `instance_eval'
2020-05-30_04:38:15.48171 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/rack-2.0.9/lib/rack/builder.rb:55:in `initialize'
2020-05-30_04:38:15.48171 /opt/gitlab/embedded/service/gitlab-rails/config.ru:in `new'
2020-05-30_04:38:15.48171 /opt/gitlab/embedded/service/gitlab-rails/config.ru:in `<main>'
2020-05-30_04:38:15.48172 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/rack-2.0.9/lib/rack/builder.rb:49:in `eval'
2020-05-30_04:38:15.48172 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/rack-2.0.9/lib/rack/builder.rb:49:in `new_from_string'
2020-05-30_04:38:15.48173 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/rack-2.0.9/lib/rack/builder.rb:40:in `parse_file'
2020-05-30_04:38:15.48173 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/configuration.rb:321:in `load_rackup'
2020-05-30_04:38:15.48173 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/configuration.rb:246:in `app'
2020-05-30_04:38:15.48173 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/runner.rb:155:in `load_and_bind'
2020-05-30_04:38:15.48174 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/cluster.rb:413:in `run'
2020-05-30_04:38:15.48174 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/launcher.rb:172:in `run'
2020-05-30_04:38:15.48174 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/lib/puma/cli.rb:80:in `run'
2020-05-30_04:38:15.48175 /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/gitlab-puma-4.3.3.gitlab.2/bin/puma:10:in `<top (required)>'
2020-05-30_04:38:15.48176 /opt/gitlab/embedded/bin/puma:23:in `load'
2020-05-30_04:38:15.48176 /opt/gitlab/embedded/bin/puma:23:in `<top (required)>'
Timeline
All times UTC.
2020-05-30
- 03:42 - Pagerduty Alert https://gitlab.pagerduty.com/incidents/P9W2KHJ fires, @ggillies acknowledges and starts investigating
- 03:42 - 04:50 @ggillies tries multiple things to resurrect dev.gitlab.org. This includes restarts, looking at logs, removing stale log files, tracing through different logs to find the final error above. @ggillies also attempts an
apt reinstall
of the gitlab-ce package. @ggillies then starts digging through the Gitlab codebase to determine if this was a new piece of code or an old piece of code. If a new piece, this code is likely culprit. Of old piece of code, something else might have changed. - 04:52 - ggillies declares incident in Slack using
/incident declare
command. @AnthonySandoval is also paged - 05:00 - @stanhu comments on this incident review outlining the quickest steps to get back up and running. By the time @ggillies had noticed this comment he had already started a downgrade of the
gitlab-ce
package from 13.0.3+rnightly.156439.9c4bb94a-0 to 13.0.2+rnightly.156266.c827b288-0. @stanhu fix applied and fixed the current release, then the apt downgrade completed (after a database backup) and also fixed the issue. - 05:07 - Pagerduty Alert https://gitlab.pagerduty.com/incidents/P9W2KHJ is resolved
- 05:12 - @ggillies performances
apt-mark hold gitlab-ce
to stop any more upgrades/deployments to dev.gitlab.org until a proper fix is in place
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For a period of 90 minutes on 2020-05-30 dev.gitlab.org was unable to be accessed at all. This was because dev.gitlab.org is a single Azure VM that is set to automatically update to newer version of the Gitlab omnibus from the apt repo at https://packages.gitlab.com/gitlab/nightly-builds/ubuntu/ once a day. In this case, the update of the package went fine, but unfortunately there was an issue introduced in the release here which caused puma to no longer start. This took dev.gitlab.org entirely offline.
- Service(s) affected: dev.gitlab.org
- Team attribution: sre-coreinfra
- Minutes downtime or degradation: 90
Metrics
The only dashboard in dashboards.gitlab.net that covers dev.gitlab.org
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
Internal customers only
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
They were unable to log into dev.gitlab.org to do work, nor login to sentry (which uses dev.gitlab.org for authentication)
- How many customers were affected?
Unsure of precise number
- If a precise customer impact number is unknown, what is the estimated potential impact?
All internal customers (all staff of Gitlab)
Incident Response Analysis
- How was the event detected?
Pagerduty alerts saying dev.gitlab.org was experiencing errors
- How could detection time be improved?
We could have more comprehensive monitoring around dev.gitlab.org in order to provide better insight into what problems it might be experiencing. Also, the only alert for dev.gitlab.org fires after 10 minutes of issues. If we want detection time to be improved, lowering this would be a good start.
- How did we reach the point where we knew how to mitigate the impact?
After manually searching through log files and performing basic troubleshooting tasks (restarting, downgrading releases), it was finally identified that the puma component of the installation was failing. This was after tracing through log files of other components (workhorse, rails) first. After this, investigation needed to be done to determine what had changed and why this error was recently introduced. This eventually led to finding the cron job that updates the gitlab-ce package, leading through to disabling the auto update process, and rolling back to a release that didn't contain the issue.
- How could time to mitigation be improved?
-
More comprehensive prometheus alerts and monitoring for dev.gitlab.org. In particular alerts for the Gitlab components themselves
-
Updated runbook documentation about dev.gitlab.org and how it works, it's update mechanism, and how to troubleshoot
Post Incident Analysis
- How was the root cause diagnosed?
In addition to the steps noted above for determining the mitigation path, once mitigation was in place, work was done to determine that the scheduled upgrade must have introduced the issue. After the puma stracktrace was identified, investigation was done from the error stacktrace through the Gitlab git repo itself to determine what commit and what MR introduced the issue.
- How could time to diagnosis be improved?
-
More comprehensive prometheus alerts and monitoring for dev.gitlab.org. In particular alerts for the Gitlab components themselves
-
Updated runbook documentation about dev.gitlab.org and how it works, it's update mechanism, and how to troubleshoot
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Not that I am aware of
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
This change caused this issue gitlab-org/gitlab!32991 (merged)
5 Whys
-
dev.gitlab.org was done for approximately 90 minutes, during which Gitlab staff were unable to log into dev.gitlab.org or sentry.gitlab.net, why?
-
Puma was failing to start, and crashing upon startup, why?
-
This change gitlab-org/gitlab!32991 (merged) was merged which introduced a regression, making its way to dev.gitlab.org, why did it make it there?
-
dev.gitlab.org automatically updates to the latest "nightly" version of Gitlab once every day
Lessons Learned
- dev.gitlab.org is considered production level, but we have very little in the way of supporting documentation, alerting, nor HA infrastructure in place that is needed for a production level service.
- There appears to be a disconnect between how important we consider dev.gitlab.org (production) and what release cadence and release path it goes through. A production level service would be running releases that have gone through some level of QA, which it is unclear this is the case for Gitlab nightlies.
- dev.gitlab.org is still running in Azure
- dev.gitlab.org outages also seem to cause a number of other alerts and pieces in our infrastructure to fail, including chef runs