Commit 2c0562c3 authored by Alex Ives's avatar Alex Ives Committed by Irina Bronipolsky
Browse files

Add help decision tree and incident escalation process

parent b38284a6
Loading
Loading
Loading
Loading
+9 −2
Original line number Diff line number Diff line
@@ -73,9 +73,16 @@ While each team has a distinct focus area, several responsibilities are shared a

## Requesting Help

### Support Escalations
For a complete guide to getting help with database issues — including emergencies, support escalations, and identifying the responsible team — see [Getting Help with Database Issues](/handbook/engineering/data-engineering/database-excellence/help/).

TBD
### Incident Escalation

Database incident escalations use [incident.io](https://app.incident.io/gitlab/on-call/schedules/01JXJ7MN4T14008GQKWYYNT6E8) for on-call routing.

* **Scope**: GitLab.com S1 and S2 production incidents raised by the Incident Manager On Call, Engineer On Call, and Security teams. GitLab Dedicated support is consultative. Self-managed support is discretionary and evaluated case-by-case.
* **Escalation**: Use `/inc escalate` in the incident Slack channel. For non-urgent issues, use the [triage rotation](#triage-rotations) or post in `#s_database_excellence`.
* **Response**: Best effort, local timezone, weekday coverage only (24/5). The on-call engineer joins as a subject matter expert in a consultative capacity.
* **Process details**: See the [full escalation process](/handbook/engineering/data-engineering/database-excellence/help/#step-4-escalate-to-database-excellence) for responding procedures and shadowing instructions.

### Reliability Requests

+0 −45
Original line number Diff line number Diff line
---
title: Identifying Database Issues
description: "A basic guide to identifying the DRI for a database issue"
---

This guide is intended for folks who need to determine the DRI team when examining a database issue.

## Migrations

The easiest way is using `git`, from the [GitLab repository](https://gitlab.com/gitlab-org/gitlab), run:

```sh
git log --first-parent {path/to/migration.rb}
```

The code `path/to/migration.rb` can be found in the backtrace when the migration fails. Migration code files
start with a date-time stamp and are in [db/migrate/](https://gitlab.com/gitlab-org/gitlab/-/tree/master/db/post_migrate) or
[db/post-migrate/](https://gitlab.com/gitlab-org/gitlab/-/tree/master/db/post_migrate). Or, if you can find the date-time stamp,
for example `20240113071052`, anywhere in the log output from the customer, that will uniquely match a migration
filename in one of these locations.

That should give you an output that includes a link to the merge request where the migration was added.

If that doesn't give a clear answer, you can look at the tables involved in the migration and take a guess at the team. See [Finding a team by table name](#finding-a-team-by-table-name).

## Queries

It's a bit more complicated to determine query sources, because they come from a lot of places.

If the issue is related to a Rails controller, Sidekiq worker, API endpoint, or background migration, determine the feature category using details in our [feature categorization guide](https://docs.gitlab.com/ee/development/feature_categorization/), then use [reach out to the team listed in the lookup](#getting-a-team-from-a-feature-category) to contact the team.

If you don't have a source that includes a feature category, you'll need to make a guess based on the table name in the query. You can follow [Finding a team by table name](#finding-a-team-by-table-name).

## Finding a team by table name

Each database table has a documentation file that can be used to determine a corresponding group.

1. Look for the corresponding file named `{table_name}.yml` in [the database dictionary](https://gitlab.com/gitlab-org/gitlab/-/tree/master/db/docs)
1. In the file, find the list of related `feature_categories`
1. Using the feature category, [reach out to the team listed in the lookup](#getting-a-team-from-a-feature-category)
1. If there is more than one category, pick one from the list and start with that team

## Getting a team from a feature category

If you have a feature category, the best way to determine the team to contact is by using the [feature category lookup](/handbook/product/categories/lookup/).
+1 −1
Original line number Diff line number Diff line
@@ -48,7 +48,7 @@ Systems or services explicitly not owned by us:

## How to contact the team

1. Outages and customer emergencies: please use the [DBO escalation process](dbre-escalation-process)
1. Outages and customer emergencies: please use the [escalation process](/handbook/engineering/data-engineering/database-excellence/help/#step-4-escalate-to-database-excellence)
1. Customer issues: please open a RFH issue [here](https://gitlab.com/gitlab-com/request-for-help/-/issues/new?related_item_id=undefined&type=ISSUE&description_template=SupportRequestTemplate-DatabaseOperations)
1. Project work and other requests: please open an issue [in the team issue tracker](https://gitlab.com/gitlab-com/gl-infra/data-access/dbo/dbo-issue-tracker/-/issues/new?related_item_id=undefined&type=ISSUE&description_template=New_DBO_Project)

+0 −106
Original line number Diff line number Diff line
---
title: DBO Escalation Process
summary: This page outlines the DBO team escalation process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.
---

{{% alert title="Note" color="danger" %}}
We are using [incident.io](https://app.incident.io/gitlab/on-call/schedules/01JXJ7MN4T14008GQKWYYNT6E8) for escalations.
{{% /alert %}}

## About This Page

This page outlines the DBO team's incident escalation policy.

## Shortcuts

* [DBO incident.io schedule](https://app.incident.io/gitlab/on-call/schedules/01JXJ7MN4T14008GQKWYYNT6E8)
* Slack integration: **/inc escalate** in incident channel  (For urgent reach-outs, ALWAYS use incident.io escalations)
* Slack handles: `@dbre` or `@dbo-oncall` (Non-urgent)
* Slack channels: #g_database_operations
* `group::database operations`
* [Production Incidents](https://app.incident.io/gitlab/incidents)

## SLO and Expectations

* **_DBO RESPONSE IS ON A BEST-EFFORT BASIS_**

* **_LOCAL TIMEZONE, WEEKDAY COVERAGE ONLY_**

* **_S1 / S2 INCIDENTS ONLY_**

  * NB1: Due to limited staffing, i.e. having only one person in AMER timezone, there will be times during the business day, within multiple timezones, where there will not be anyone able to respond.  We understand the criticality of responding to S1/S2 incidents and we will make every effort to ensure there is adequete and timeliness in our responses, but given the current staffing levels, we are not at this point adhereing to a hard SLO. To do justice to this situation, it is also expected that schedules are changed on an ad-hoc bases.

  * NB2: DBRE will join incidents as a subject matter expert in a consultative capacity and there should be no expectation that the DBRE is solely responsible for a resolution of the escalation. There may be times where DBRE needs to escalate to other subject matter experts, such as the [Database Framework (DBF) team](../../../data-engineering/database-excellence/database-frameworks/), in order to make headway on the incident at hand.

## Escalation Process

### Scope and Qualifiers

1. **GitLab.com** S1 and S2 production incidents raised by the **Incident Manager On Call**, **Engineer On Call** and **Security** teams.

   * NB1: **GitLab Dedicated** support is consultative at this point.  DBO team currently not equipped, i.e. lacking access and and training on how to support Dedicated databases.  This may change in the future; check back here for updates on this topic.

   * NB2: **Self Managed** support is discrtionary and will be evaluated on a case-by-case basis.

   * NB3: This process is **NOT** a path to reach the DBO team for non-urgent issues.  For non-urgent issues, please create a [Request for Help](https://gitlab.com/gitlab-com/request-for-help#ops-section) (RFP) issue using this [Issue template](https://gitlab.com/gitlab-com/request-for-help/-/issues/new?issuable_template=SupportRequestTemplate-DatabaseOperations).

   * NB4: The DBO on-shift is responsbile for coordinating warm handoffs during shift changes, especially when there is an ongoing, active incident.

### Escalation

1. EOC/IM, Development or Security page the DBO on-call via `/inc escalate`.
1. DBO responds by acknowledging the page and joining the incident channel and zoom.
1. DBO triages the issue and works towards a solution.
1. If necessary, DBO reach out for further help or domain expert as needed.

   * NB1: If DBO support does not respond, escalation path as defined within incident.io ensues.

## Resources

### Responding Guidelines

When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance

1. Join the Incident Zoom - this can be found bookmarked in the relevant incident Slack Channel
1. Join the appropriate incident slack channel for all communications that are text based - Normally this is `#inc-<INCIDENT NUMBER>`
1. Work with the EOC to determine if a known code path is problematic

* Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
* Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident

1. Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable

### Shadowing An Incident Triage Session

Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in [#incidents-dotcom](https://gitlab.slack.com/archives/C08FMPK1DDF) and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a [nice blog post](https://about.gitlab.com/blog/2020/04/13/lm-sre-shadow/) about the shadowing experience.

### Replaying Previous Incidents

Situation Room recordings from previous incidents are available in this [Google Drive folder](https://drive.google.com/drive/u/1/folders/1wtGTU10-sybbCv1LiHIj2AFEbxizlcks) (internal).

### Shadowing A Whole Shift

To get an idea of what's expected of an on-call DBO and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBO on-call to let them know you'll be shadowing. During the shift keep an eye on [#incidents-dotcom](https://gitlab.slack.com/archives/C08FMPK1DDF) for incidents.

### Tips & Tricks of Troubleshooting

1. [How to Investigate a 500 error using Sentry and Kibana](https://www.youtube.com/watch?v=o02t3V3vHMs&feature=youtu.be).
1. [Walkthrough of GitLab.com's SLO Framework](https://www.youtube.com/watch?v=QULzN7QrAjY).
1. [Scalability documentation](https://gitlab.com/gitlab-org/gitlab/merge_requests/18976).
1. [Use Grafana and Kibana to look at PostgreSQL data to find the root cause](https://youtu.be/XxXhCsuXWFQ).
   * Related incident: [Postgres transactions timing out; sidekiq queues below apdex score; and overdue pull mirror jobs](https://gitlab.com/gitlab-com/gl-infra/production/issues/1433).
1. [Use Grafana and Prometheus to troubleshoot API slowdown](https://www.youtube.com/watch?v=DtP4ZcuXT_8).
   * Related incident: [2019-11-27 Increased latency on API fleet](https://gitlab.com/gitlab-com/gl-infra/production/issues/1419).
1. [Let's make 500s  more fun](https://youtu.be/6ERO4XsYDn0?list=PL05JrBw4t0KodGBz0XUYdYaAYyYs-6ZK7)

### Tools for Engineers

1. Training videos of available tools
   1. [Visualization Tools Playlist](https://www.youtube.com/playlist?list=PL05JrBw4t0KrDIsPQ68htUUbvCgt9JeQj).
   1. [Monitoring Tools Playlist](https://www.youtube.com/playlist?list=PL05JrBw4t0KpQMEbnXjeQUA22SZtz7J0e).
   1. [How to create Kibana visualizations for checking performance](https://www.youtube.com/watch?v=5oF2rJPAZ-M&feature=youtu.be).
1. Dashboards examples, more are available with the dropdown list at upper-left corner of any dashboard below
   1. [Saturation Component Alert](https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1).
   1. [Service Platform Metrics](https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=ci-runners&from=now-6h&to=now).
   1. [SLAs](https://dashboards.gitlab.net/d/general-slas/general-slas?orgId=1).
   1. [Web Overview](https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1).
+193 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading