runbooks

runbooks

Run books for the stressed on call

Gitlab On Call Run Books

The aim of this project is to have a quick guide of what to do when an emergency arrives

CRITICAL

  • Spend one minute and create issue for outage, don't forget about outage label as specified in handbook.

What to do when

Replication fails

Chef/Knife

CI

CephFS

Alerting and monitoring

Outdated

How do I

Deploy

Work with the fleet and the rails app

Work with the Database

Work with storage

Mangle front end load balancers

Work with Chef

Work with CI Infrastructure

Work with Infrastructure Providers (VMs)

Manually ban an IP or netblock

Manage Marvin, our infra bot

Debug and monitor

General guidelines in an emergency

  • Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working?
  • Tweet in a reassuring but informative way to let the people know what's going on
  • Join the #infrastructure channel
  • Define a point person or incident owner, this is the person that will gather all the data and coordinate the efforts.
  • Organize:
    • Establish who is the point person on the incident in the #infrastructure channel: "@here I'm taking point" and pin the message for the duration of the emergency.
    • Start a war room using zoom if it will save time
    • Share the link in the #infrastructure channel
    • If the point person needs someone to do something, give a direct command: @someone: please run this command
  • Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
  • If you have conflicting information, stop and think, bounce ideas, escalate
  • Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
  • If we lack monitoring or alerting Open an issue and label as monitoring, even if you close issue immediately. See handbook

Guidelines

Other Servers and Services

Adding runbooks rules

  • Make it quick - add links for checks
  • Don't make me think - write clear guidelines, write expectations
  • Recommended structure
    • Symptoms - how can I quickly tell that this is what is going on
    • Pre-checks - how can I be 100% sure
    • Resolution - what do I have to do to fix it
    • Post-checks - how can I be 100% sure that it is solved
    • Rollback - optional, how can I undo my fix

But always remember!

Dont Panic