runbooks

runbooks

Run books for the stressed on call

Name Last Update
alerts Loading commit data...
consoles Loading commit data...
graphs Loading commit data...
howto Loading commit data...
img Loading commit data...
monitoring Loading commit data...
recordings Loading commit data...
troubleshooting Loading commit data...
.gitlab-ci.yml Loading commit data...
Dockerfile Loading commit data...
README.md Loading commit data...

Gitlab On Call Run Books

The aim of this project is to have a quick guide of what to do when an emergency arrives

CRITICAL

  • Spend one minute and create issue for outage, don't forget about outage label as specified in handbook.

What to do when

Replication fails

Chef/Knife

CI

CephFS

PostgreSQL

Alerting and monitoring

Outdated

How do I

On Call

Deploy

Work with the fleet and the rails app

Restore Backups

Work with storage

Mangle front end load balancers

Work with Chef

Work with CI Infrastructure

Work with Infrastructure Providers (VMs)

Manually ban an IP or netblock

Dealing with Spam

Manage Marvin, our infra bot

Elasticsearch

Debug and monitor

General guidelines in an emergency

  • Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working?
  • Tweet in a reassuring but informative way to let the people know what's going on
  • Join the #infrastructure channel
  • Define a point person or incident owner, this is the person that will gather all the data and coordinate the efforts.
  • Organize:
    • Establish who is the point person on the incident in the #infrastructure channel: "@here I'm taking point" and pin the message for the duration of the emergency.
    • Start a war room using zoom if it will save time
    • Share the link in the #infrastructure channel
    • If the point person needs someone to do something, give a direct command: @SOMEONE: please run this command
  • Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
  • If you have conflicting information, stop and think, bounce ideas, escalate
  • Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
  • If we lack monitoring or alerting Open an issue and label as monitoring, even if you close issue immediately. See handbook

Guidelines

Other Servers and Services

Adding runbooks rules

  • Make it quick - add links for checks
  • Don't make me think - write clear guidelines, write expectations
  • Recommended structure
    • Symptoms - how can I quickly tell that this is what is going on
    • Pre-checks - how can I be 100% sure
    • Resolution - what do I have to do to fix it
    • Post-checks - how can I be 100% sure that it is solved
    • Rollback - optional, how can I undo my fix

But always remember!

Dont Panic