2020-07-03: Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2
Summary
2020-07-03 - Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2
on 2020-07-03 at 02:11 we received an alert that customers.gitlab.com was down. This was indeed the case.
Timeline
All times UTC.
2020-07-03
- 02:11 - https://gitlab.pagerduty.com/incidents/P4K0D7T and https://gitlab.pagerduty.com/incidents/P0Z7WT1 alerts fired in pagerduty, acknowledge by @ggillies
- 02:15 - @ggillies gets onto the VM running customers and diagnoses from the nginx logs that the ruby service behind it is experiencing issues
- 02:17 - @ggillies finds that the ruby service is crash looping with the following error
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:60:in `block in load_missing_constant': uninitialized constant Salesforce::RecordNotFound (NameError)
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:16:in `allow_bootsnap_retry'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:59:in `load_missing_constant'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:9:in `<class:CreateQuoteAmendmentService>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:4:in `<module:Salesforce>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:3:in `<main>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `require'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `block in require_with_bootsnap_lfi'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/loaded_features_index.rb:92:in `register'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:21:in `require_with_bootsnap_lfi'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:30:in `require'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:14:in `block in loading'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/concurrency/share_lock.rb:151:in `exclusive'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:13:in `loading'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:48:in `block in require_or_load'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:16:in `allow_bootsnap_retry'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:47:in `require_or_load'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:85:in `depend_on'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:478:in `block (2 levels) in eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:477:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:477:in `block in eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:475:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:475:in `eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:356:in `eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/application/finisher.rb:69:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/application/finisher.rb:69:in `block in <module:Finisher>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:32:in `instance_exec'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:32:in `run'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:61:in `block in run_initializers'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:228:in `block in tsort_each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:350:in `block (2 levels) in each_strongly_connected_component'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:431:in `each_strongly_connected_component_from'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:349:in `block in each_strongly_connected_component'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:347:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: from /usr/local/lib/ruby/2.6.0/tsort.rb:347:in `call'
- 02:27 - @ggillies did a git log on the directory where the code resides and saw a commit referencing salesforce that looked very recent. This lead him to this MR where it looks like the problem was introduced https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/1496
- 02:29 - @ggillies manually ran
https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/1496 && systemctl restart customers-gitlab-comto rollback the problem issue and get the service live again - 02:31 - Alerts cleared
- 02:33 - @ggillies declares incident in Slack using
/incident declarecommand.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Graeme Gillies