runbooks

runbooks

Run books for the stressed on call

Gitlab On Call Run Books

The aim of this project is to have a quick guide of what to do when an emergency arrives

CRITICAL

  • Spend one minute and create issue for outage, don't forget about outage label as specified in handbook.

What to do when

Replication fails

Chef/Knife

CI

CephFS

Alerting and monitoring

Outdated

How do I

Deploy

Work with the fleet and the rails app

Work with storage

Mangle front end load balancers

Work with Chef

Work with CI Infrastructure

Work with Infrastructure Providers (VMs)

Manually ban an IP or netblock

Debug and monitor

General guidelines in an emergency

  • Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working?
  • Tweet in a reassuring but informative way to let the people know what's going on
  • Join the #alerts channel
  • Organize
    • Establish who is taking point on the emergency issue in the #alerts channel: "I'm taking point" and pin the message for the duration of the emergency.
    • open a hangout if it will save time: https://plus.google.com/hangouts/_/gitlab.com?authuser=1
    • share the link in the alerts channel
  • If the point person needs someone to do something, give a direct command: @someone: please run this command
  • Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
  • If you have conflicting information, stop and think, bounce ideas, escalate
  • Fix first, ask questions later.
  • Gather information when the outage is done - logs, samples of graphs, whatever could help figuring out what happened
  • Open an issue and put monitoring label on it, even if you close issue immediately. See handbook

Guidelines

Other Servers and Services

Adding runbooks rules

  • Make it quick - add links for checks
  • Don't make me think - write clear guidelines, write expectations
  • Recommended structure
    • Symptoms - how can I quickly tell that this is what is going on
    • Pre-checks - how can I be 100% sure
    • Resolution - what do I have to do to fix it
    • Post-checks - how can I be 100% sure that it is solved
    • Rollback - optional, how can I undo my fix

But always remember!

Dont Panic