Commit eba3a534 authored by Alex M's avatar Alex M
Browse files

Adds multi-page more complex runbook for application 2

parent 607dfbaf
---
title: "01. Runbook Master Template"
menuTitle: "Template: Runbook"
---
{{% notice note %}}
A default template to use with runbooks to help with consistency, as well as inspiration on what to include
{{% /notice %}}
---
## Confirm the Problem
### Dashboards
#### Dashboard X
Dashboard X shows blah blah and can be found [here](www.some-dashboard1-link.com).
#### Dashboard Y
Dashboard Y shows blah blah and can be found [here](www.some-dashboard2-link.com).
### Diagnostic Commands
#### Command X
```sh
some command to find X
```
#### Command Y
```sh
some command to validate Y is running
```
---
## Resolution Options
### Option X
Try updating X with the following commands:
```sh
some command to update X
```
### Option Y
e.g. If all other options have failed, put up the holding page and prepare for [DR](www.some-dr-plan-link.com).
---
## Notes
While diagnosing and fixing the incident/issue, you may want to make notes. We recommend doing that in your service or team slack channel so they can be easily reviewed the next day for follow up actions.
---
title: "Application #2"
chapter: true
---
{{% notice tip %}}
This example service is showing a more complex example with nested groups and so on - reflecting a more complex service with many components that necessitates a bit more structure.
<br /><br />
This app is also business critical, so includes a bit more detail on the processes involved there.
{{% /notice %}}
---
# Service: Browse
## Team Details
- Team Name: *your-team-name*
- Team Slack Shannel: [your-team-channel](#the-slack-channel)
- Service Slack Channel: [your-alerts-channel](#the-slack-channel)
- Service Catalogue: [service-catalogue-link](#the-service-catalogue)
- Team Location: UK _or maybe the building/floor_
- Product Owner: PO's name
- Delivery Lead: DL's name
- Group Email: if applicable
## What is Application-2?
It is a UI service that mainly serves Category Pages and Product pages. It also hosts some of the components that we use across the website via ESI. See [General Information](general-information/) section for more information what the app does.
## Live Issue Decision Process
Follow this decision flow in order to manage the triggered incident. Acknowledge the alert and then first check the following key support channels to see if an incident is already in progress:
1. [#topic-major-incident-comms](#the-slack-channel)
2. [#topic-ops-bridge](#the-slack-channel)
3. [#team-digital-platform](#the-slack-channel)
**If a major incident hasn't already been triggered, and you think it is required trigger it before doing anything else.**
**Trigger it immediately using the 'run a play' button on the PagerDuty Incident.**
![decision process](./images/incident-decision-process.png)
## Useful Support Links
See [investigating issues](investigating-issues/) for details on how to debug live issues with this service.
| System | URL | Additional URL |
| --- | --- | --- |
| Grafana Dashboards | [Link to 4GS dashboard](#link-goes-here) | [Link to custom dashboard]((#link-goes-here)) |
| Kibana Logging | [Link to saved queries](#link-goes-here) | [Link to visualisations](#link-goes-here) |
| New Relic Dashboards | [Link to Client Dashboard](#link-goes-here) | [Link to Server Dashboard](#link-goes-here) |
| PagerDuty | [Link to Service's open alerts](#link-goes-here) | [Link to Rota](#link-goes-here) |
| Monetate | [Link to active Experiences](#link-goes-here) | |
| Code | [Link to Gitlab Group](#link-goes-here) | |
| ServiceNow | [Link to Group's open INCs](#link-goes-here) | [Link to Group's CHGs](#link-goes-here) |
| Service Catalogue | [Link to Service on the SC](#link-goes-here) | |
---
## Runbook Sections
{{% children %}}
---
title: "Browse - Alerts"
chapter: true
---
## Browse - Alerts
The alerts configured to Browse are the standard alerts provided by default by the platform.
More information about monitoring and alerts can be found [here](https://link/to/more/docs)
## Standard Alerts
These alerts are the predefined ones for any service on PLATFORM_NAME. If they are not required, or if you wish to vary the alerting criteria necessary to trigger them, you can edit the [service definition base yaml](https://link/to/config/in/git) and create some **microservice_alert_overrides** for the relevant microservice(s).
### slo-browse.prod-browse.5xx-errors
Check new [5xx Alert page](/application-2/alerts/slo-browse.prod-browse.5xx-errors/)
### slo-browse.prod-browse.service-availability
**Summary:** Service pods are not available
**Customer Impact:** Errors
**Actions:**
- Use Grafana to confirm when traffic stopped and consider any linked events such as releases that may have caused this.
- Check for other alerts in PagerDuty that may be linked - the issue may be out of our control. In particular, look at Platform and AN_OTHER_DEPENDENCY, as well as #topic-major-incidents in Slack.
- A recent release may have caused an issue - consider a rollback to the previous version.
### slo-browse.prod-browse.pod-oom-kills
**Summary:** The service's pods have restarted as they have run out of memory - the number of pods and resources should be re-evaluated.
**Customer Impact:** Errors, slow responses
**Actions:**
- Analyse the Kubernetes resources to understand the reason the for restarts.
- Look in Grafana dashboard for memory profile changes to understand how rapidly the memory is spiking.
- Look in Kibana for further information, errors in particular.
- Consider either scaling up number of pods or increasing memory for pods.
- A release may have significantly degraded performance and spinning up more pods may fix it temporarily. Consider a rollback to the previous release.
### slo-browse.prod-browse.zero-traffic
**Summary:** Service pods are not handling traffic.
**Customer Impact:** Our service will not be returning product pages.
**Actions:**
- Use Grafana to confirm when traffic stopped and consider any linked events such as releases that may have caused this. This may be a release by DEPENDENCY_X, or another of our dependencies.
- Look in Kibana for further correlating information, errors in particular.
- Check also for other alerts in PagerDuty that may be linked - the issue may be out of our control. In particular, look at Platform, as well as #topic-major-incidents in Slack for non PagerDuty services - this would include _LIST_OF_OTHER_SERVICE_DEPENDENCIES_ and Router services that sit above the product service.
- A recent release may have caused an issue - consider a rollback to the previous version.
**Note:** If traffic is expected to be zero e.g. because the % split to the NEW_SERVICE has been paused in Monetate, then you should silence the alert for a day or two and arrange a team meeting in the team calendar before the silence is due to expire to discuss whether or not the silence should be extended.
---
title: "slo-browse-newrelic-alerts"
chapter: true
---
# PagerDuty Alert: Browse xxx
## Summary: The rules configured in NewRelic detected an anomaly
## Customer Impact: customer has problems with their journey to buy something
## Actions:
### Is there a bigger issue?
Join and check the following channels
- [#topic-major-incident-comms](https://link/to/channel)
- [#topic-ops-bridge](https://link/to/channel)
- [#team-digital-platform](https://link/to/channel)
Read and gain and understand what is happening, so you have context, if there is a major incident(MI) in progress. Participate and add relevant notes. e.g. `Browse received a NewRelic alert at xx:xx`
If there is no MI, keep investigating the Browse issue.
You will need to determine if Browse is the cause of the problem or suffering symptoms because of a dependency issue.
### NewRelic False Alert?
Sometimes NewRelic Synthetics can have natural connection problems so just in case check the Script Log tab to see the exception. In case you conclude it is a false alert, we should just resolve the Pager Duty alert with a note on it.
False alerts can be triggered from new relic due to below reasons
- Synthetics issues
- Synthetic script is failing due to not getting response from the Insights API
- New relic systems may be faulty. Check the latest [NewRelic status](https://status.newrelic.com/)
### Checking Browse
#### Rule in/out our application
Check the [4 golden signals dashboard](https://link/to/platform/dashboard)
Is there:
- A sustained drop in traffic and availability?
- Has latency risen?
- Is CPU usage growing?
- Is Memory saturation growing
- Has the number of replicas spiked up or down? Has it remained there?
Check the [NewRelic Synthetic Monitors page](https://link/to/newrelic) to find more details about the alert you just saw.
---
#### Alert | Browse | Basket | GetBasket Error Limit Breached | P3
- Go through the basket API [logs in Kibana](https://link/to/logs)
- Search for the errors
- Alternatively, we can check the error stacktrace in the new relic using [this link](https://link/to/newrelic)
- Report the error in #team-x slack channel if you see timeout errors
#### Alert | Browse | Product | GetProducts API Error Limit breached | P3
- Go through the Product API [logs in Kibana](https://link/to/logs)
- Search for the errors
- Report the errors in #team-x slack channel if you see timeout errors for product api calls
- Report the errors in #team-y in case errors are originated for rvi calls - see [logs here](https://link/to/specific/logs)
- Alternatively, we can check the error stacktrace in new relic using [this link](https://link/to/newrelic)
#### Alert | Browse | BFF | Timeout Errors Limit Breached | P3
- Search for "Socket closed" errors in Kibana using [this link](https://link/to/log/query)
- Report the issue in #team-x channel in office hours
- Raise an incident in ITIL_TOOL to the assignment group ANOTHER_TEAM. That will trigger an pager duty call to Website support team
#### Alert | Browse | BISN | Stock Notifications Error threshold Breached | P3
- Go through the [API logs](https://link/to/logs/query) in Kibana
- Search for the errors
- Alternatively, we can check the error stacktrace in the new relic using [this link](https://link/to/newrelic)
- Report the error in #team-z slack channel if you see any timeout errors
#### Alert | Browse | Drop in Transaction Limit Breached | P3
**In PagerDuty** this will be reported as:
`[browse] Error [Runbook] (Alert |Browse|Drop in Transaction Limit Breached | P3 violated Browse | Drop in Transaction Limit Breached | P3)`
Orders drop can be impacted by the UI if there is a problem displaying the Add to Basket button or by poor response times in DEPENDENCY_A.
- Go to [Ops Monitoring Dashboard](https://link/to/newrelic) in NewRelic and see how severe is the drop.
- In Golden Signals, select a bigger range (e.g. 2 days) to have a sense if something our of ordinary happened, for example, if there was a Live Load Test that created a big number of orders.
- Search for the errors in Kibana dashboards. Check [our runbook](/application-2/) for the right URLs.
- Report the error in #topic-x slack channel if the alert is not clear in 10 minutes or so.
---
title: "slo-browse.prod-browse.5xx-errors"
chapter: true
---
# PagerDuty Error: slo-browse.prod-browse.5xx-errors
## Summary: The Service is generating 5xx errors
## Customer Impact: Errors
## Actions:
### Is there a bigger issue?
Join and check the following channels
- [#topic-major-incident-comms](https://link/to/slack)
- [#topic-ops-bridge](https://link/to/slack)
- [#team-digital-platform](https://link/to/slack)
Read and gain and understand what is happening so you have context, if there is a major incident(MI) in progress. Participate and add relevant notes, e.g.
`Browse received a 5xx alert at xx:xx`
If there is no MI, keep investigating the Browse issue.
You will need to determine if Browse is the cause of the problem or suffering symptoms because of a dependency issue.
### Checking Browse
#### Rule in/out our application
Check the [4 golden signals dashboard](https://link/to/dashboard)
Is there:
- A sustained drop in traffic and availability?
- Has latency risen?
- Is CPU usage growing?
- Is Memory saturation growing
- Has the number of replicas spiked up or down? Has it remained there?
---
#### Is this a bot?
**Diagnosis:**
Check the past 1 hour activity through [DOWNSTREAM_DEPENDENCY_B](https://link/to/logs)
Look for high frequency patterns of the same client IP address
_Normal activity_
![Normal activity](./alert-5xx-bot-normal-activity.png)
_Unusual activity_
![Unusual activity](./alert-5xx-bot-unusual-activity.png)
**Remedial Action:**
- Inform [#team-x](https://link/to/slack/channel)
- If there is an incident in progress, inform in the incident channel
---
#### Is UPSTREAM_DEPEDENCY_A functional?
**Diagnosis:**
Check the past 1 hour activity through [API Cache](https://link/to/specific/log/query)
_Normal activity_
Low number of hits
![Normal activity](./alert-5xx-cache-normal-activity.png)
_Unusual activity_
High number of hits clustered together
![Normal activity](./alert-5xx-cache-unusual-activity.png)
**Remedial Action:**
- Alert Ops Bridge in [#team-a](https://link/to/slack)
---
#### Have we performed a duff release?
**Diagnosis:** When was our last release?
**Remedial Action:**
If our neighbours aren’t breaking, consider rollback.
You will need to find the previous pipeline run to production and know how to execute it
---
title: "Dependencies"
chapter: true
---
## Dependencies
### Browse is Dependent on
| Application | Detail | Runbook (will inc. how to contact | Golden Signals Dashboard |
| --------------------- | ------------------------------ | ------------------------------------------------ | ------------------------------ |
| upstream dependency 1 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
| upstream dependency 2 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
| upstream dependency 3 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
| upstream dependency 4 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
## Services Dependent on Browse
| Service | Detail | Runbook |
| ----------------------- | ---------------------------- | ------------------------------------------------ |
| downstream dependency A | something about what it does | [link to their runbook](https://link/to/runbook) |
| downstream dependency B | something about what it does | [link to their runbook](https://link/to/runbook) |
| downstream dependency C | something about what it does | [link to their runbook](https://link/to/runbook) |
---
title: "General Information"
chapter: true
---
## What is Application-2?
Browse is a front end application that has provides most of the critical browse journeys to WEBSITE_X
The application is now end of life and pages are moving out of it into a micro frontend architecture. See the following [google-sheet](https://link/to/gsheet) for more details on the current state of affairs.
## Application Architecture
> Include a link to an image showing the architecture here
---
title: "Investigating Issues"
chapter: true
---
## How to Investigate Live Issues
The majority of browse issues are one of the following root causes:
1. An upstream dependency is generating errors to us.
2. The platform or Google Cloud Services are having issues and causing a platform level outage.
3. Client side issues e.g. generated errors generated by monetate client side tests.
To find the root cause, use the following debugging process:
* **Get visibilty**
* **Narrow the focus**
* **Repeat until you spot the error!**
### First step - Read the Alert
Answer the following by reading the alert:
1. **What triggered the alert?** Is it a client side error from new relic monitoring? Or a server side error from platform? Or are you seeing both!
2. **What does the error message say?** Read it and understand what it is trying to alert on - eg low traffic, 40x/50x errors?
### Detailed Steps - Client Side
1. If the error alert is triggered by NewRelic, it is a Real User Monitoring Alert for customers/bots in the browser. Drop into the [NewRelic alert](https://link/to/newrelic) to understand the errors being thrown client side in the browser.
2. Investigate the error being thrown and debug the issue. Look for browser type, headers and try and replicate the error.
3. Monetate issues can be spotted by problem tags in the source html being wrapped in montetate tags.
4. Disabling experiences in live can be done to see if that removes the error.
### Detailed Steps - Server Side
1. Start with [Grafana Golden Signals Dashboard](https://link/to/grafana) to get a general view of server application status:
* Are we receiving traffic? Is traffic spiking up?
* What errors is the app encountering?
* How stressed are the pods - what is happening with CPU/Memory/Pod count?
2. Then drop into kibana [DOWNSTREAM_X](https://link/to/logs) and [our own logs](https://link/to/logs) logs for debugging the alert.
3. Identify the source of the errors by filtering the logs by status `log.wstatus`. Is the error message related to a specific dependency eg product API or basket?
4. If the error is for an upstream API check the [dependencies](../dependencies/_index.md) page for details on the dashboard and svc channel and check it to see if that service is having issues.
5. If the error is only occurring in browse, the next action is to look a infrastructure and code health to understand if code needs to be reverted. See [release and rollback](./release-and-rollback.md) on how to roll back code.
6. Report back findings into any incident channel or to the team channel as you find out more details so all working on the inc can see any new information.
---
title: "Release and Rollback"
chapter: true
---
## How to rollback
- Navigate to browse pipelines page using [this link](https://link/to/ci/pipeline)
- Trigger the prod stage retry button on the previously deployed prod pipeline
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment