2020-07-03: Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2

Summary

2020-07-03 - Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2

on 2020-07-03 at 02:11 we received an alert that customers.gitlab.com was down. This was indeed the case.

Timeline

All times UTC.

2020-07-03

  • 02:11 - https://gitlab.pagerduty.com/incidents/P4K0D7T and https://gitlab.pagerduty.com/incidents/P0Z7WT1 alerts fired in pagerduty, acknowledge by @ggillies
  • 02:15 - @ggillies gets onto the VM running customers and diagnoses from the nginx logs that the ruby service behind it is experiencing issues
  • 02:17 - @ggillies finds that the ruby service is crash looping with the following error
Jul 03 02:25:56 customers.gitlab.com ruby[70744]: /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:60:in `block in load_missing_constant': uninitialized constant Salesforce::RecordNotFound (NameError)
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:16:in `allow_bootsnap_retry'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:59:in `load_missing_constant'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:9:in `<class:CreateQuoteAmendmentService>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:4:in `<module:Salesforce>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/app/services/salesforce/create_quote_amendment_service.rb:3:in `<main>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `require'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `block in require_with_bootsnap_lfi'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/loaded_features_index.rb:92:in `register'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:21:in `require_with_bootsnap_lfi'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:30:in `require'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:14:in `block in loading'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/concurrency/share_lock.rb:151:in `exclusive'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/activesupport-5.2.4.3/lib/active_support/dependencies/interlock.rb:13:in `loading'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:48:in `block in require_or_load'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:16:in `allow_bootsnap_retry'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:47:in `require_or_load'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/bootsnap-1.4.5/lib/bootsnap/load_path_cache/core_ext/active_support.rb:85:in `depend_on'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:478:in `block (2 levels) in eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:477:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:477:in `block in eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:475:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:475:in `eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/engine.rb:356:in `eager_load!'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/application/finisher.rb:69:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/application/finisher.rb:69:in `block in <module:Finisher>'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:32:in `instance_exec'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:32:in `run'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /home/gitlab-customers/customers-gitlab-com/vendor/bundle/ruby/2.6.0/gems/railties-5.2.4.3/lib/rails/initializable.rb:61:in `block in run_initializers'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:228:in `block in tsort_each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:350:in `block (2 levels) in each_strongly_connected_component'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:431:in `each_strongly_connected_component_from'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:349:in `block in each_strongly_connected_component'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:347:in `each'
Jul 03 02:25:56 customers.gitlab.com ruby[70744]:         from /usr/local/lib/ruby/2.6.0/tsort.rb:347:in `call'
  • 02:27 - @ggillies did a git log on the directory where the code resides and saw a commit referencing salesforce that looked very recent. This lead him to this MR where it looks like the problem was introduced https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/1496
  • 02:29 - @ggillies manually ran https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/1496 && systemctl restart customers-gitlab-com to rollback the problem issue and get the service live again
  • 02:31 - Alerts cleared
  • 02:33 - @ggillies declares incident in Slack using /incident declare command.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

  1. Service(s) affected:
  2. Team attribution:
  3. Minutes downtime or degradation:

Metrics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
  3. How many customers were affected?
  4. If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

  1. How was the event detected?
  2. How could detection time be improved?
  3. How did we reach the point where we knew how to mitigate the impact?
  4. How could time to mitigation be improved?

Post Incident Analysis

  1. How was the root cause diagnosed?
  2. How could time to diagnosis be improved?
  3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
  4. Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

5 Whys

Lessons Learned

Corrective Actions

Guidelines

  • Blameless RCA Guideline
Edited Jul 03, 2020 by Graeme Gillies
Assignee Loading
Time tracking Loading