Commit 0d964d88 authored by Alex M's avatar Alex M
Browse files

Tidying up some links

parent eba3a534
......@@ -28,7 +28,7 @@ I've deliberately left much of the alert detail in place to showcase some of the
## Service Information
- This stateless microservice returns html by transforming raw json data for the requested content page. The service sits behind Service X and has two downstream dependencies:
- This stateless microservice returns html by transforming raw json data for the requested page. The service sits behind Service X and has two downstream dependencies:
1. Content SaaS - A Content Management System (CMS) containing editorial data e.g. pages, templates, components etc.
2. A.N.Other Service Z - containing product data
......@@ -68,7 +68,7 @@ I've deliberately left much of the alert detail in place to showcase some of the
## Active Alerts
[Link to key dashboard showing currently active alerts in Prometheus AlertManager for this service](https://link/goes/here)
[Dashboard of Active Alerts](https://link/goes/here) - currently active alerts in Prometheus AlertManager for this service.
## Service Monitoring
......@@ -102,7 +102,7 @@ These sub-headings are linked to directly from the Alerts that go into the team'
### content-renderer microservice has all pods down
- **Name:** `services.content.ENV_NAME-content-renderer.all-pods-unavailable`
- **Name:** `services.application-1.ENV_NAME-content-renderer.all-pods-unavailable`
- **Description:** Triggers when all Kubernetes pods are unavailable for more than 5 minutes. Despite all pods being down if previous pods had served up the resource (e.g. a page) successfully, the response will keep getting served from the downstream cache for next 24 hours
- **Action:** The Kubernetes cluster should automatically replace the unavailable pods. If symptoms persist, check deployment events with `kubectl describe deployment ENV_NAME-content-renderer` to see why Kubernetes is not been able to bring the pods up. If symptom persists, contact #team-x to investigate further
- **Impact:** Medium
......@@ -116,23 +116,23 @@ These sub-headings are linked to directly from the Alerts that go into the team'
**Note:** The downstream caching service running with only one or two pods has been tested without significantly impacting customer experience. Pages returned by the service are cached with 10 minutes TTL. Even when the cache TTL is expired, it will still serve the cached pages for next 24 hours, if it cannot successfully get a latest copy from the content renderer.
### 5xx errors for ENV_NAME-content-renderer in content
### 5xx errors for ENV_NAME-content-renderer in application-1
- **Name:** `slo-content.ENV_NAME-content-renderer.5xx-errors`
- **Name:** `slo-application-1.ENV_NAME-content-renderer.5xx-errors`
- **Description:** It's an SLO alert and triggers when the percentage of 5xx errors have been over 0.05% of the total requests for over 2 minutes.
- **Action:** If symptom persists, contact #team-x to investigate further
- **Impact:** High
### Prod zero traffic for content-renderer in content
### Prod zero traffic for content-renderer in application-1
- **Name:** `slo-content.ENV_NAME-content-renderer.zero-traffic`
- **Name:** `slo-application-1.ENV_NAME-content-renderer.zero-traffic`
- **Description:** It's an SLO alert and triggers when the service has not received any traffic for over 60m. No action needed if this is due to low traffic hours like 00:00 to 06:00.
- **Action:** If symptom persists, contact #team-x to investigate further
- **Impact:** Low
### Content renderer service: Latency increase in response times
- **Name:** `services.content.ENV_NAME-content-renderer.high-latency`
- **Name:** `services.application-1.ENV_NAME-content-renderer.high-latency`
- **Description:** Triggers when response time latency increases.
- **Action:** Keep an eye on the usage since it could be due to heavy traffic spike or system warm up. If symptom persists, contact #team-x to investigate further
- **Impact:** Low
......@@ -147,14 +147,14 @@ These alerts represent custom ones that this team have created to warn on issues
### Content renderer service receiving 5xx errors from _third-party-thing_
- **Name:** `services.content.ENV_NAME-content-renderer.caas-5xx-errors`
- **Name:** `services.application-1.ENV_NAME-content-renderer.caas-5xx-errors`
- **Alert:** Triggers when service received 5xx errors from our SaaS provider. This error indicates that something is wrong with XXXX. Contact Team Y on #their-slack-channel to further investigate. Until it recovers, service will keep serving contents from its cache for next 24 hours
- **Action:** If symptom persists, contact #their-slack-channel to investigate further
- **Impact:** Medium
### Content renderer service receiving 5xx errors from Product API
- **Name:** `services.content.ENV_NAME-content-renderer.product-variants-5xx-errors`
- **Name:** `services.application-1.ENV_NAME-content-renderer.product-variants-5xx-errors`
- **Alert:** Triggers when service received 5xx errors from the Product API. This error indicates that something is wrong with Service Z. Until the API recovers, Service A will keep serving contents from its cache for next 24 hours
- **Action:** If symptom persists, contact #team-z to investigate further
- **Impact:** Medium
......
......@@ -11,7 +11,7 @@ This app is also business critical, so includes a bit more detail on the process
---
# Service: Browse
# Service: Application-2
## Team Details
......@@ -34,13 +34,13 @@ Follow this decision flow in order to manage the triggered incident. Acknowledge
1. [#topic-major-incident-comms](#the-slack-channel)
2. [#topic-ops-bridge](#the-slack-channel)
3. [#team-digital-platform](#the-slack-channel)
3. [#team-platform](#the-slack-channel)
**If a major incident hasn't already been triggered, and you think it is required trigger it before doing anything else.**
**Trigger it immediately using the 'run a play' button on the PagerDuty Incident.**
![decision process](./images/incident-decision-process.png)
> Diagram of incident decision process was here - removed
## Useful Support Links
......
---
title: "Browse - Alerts"
title: "Alerts"
chapter: true
---
## Browse - Alerts
## Alerts
The alerts configured to Browse are the standard alerts provided by default by the platform.
The alerts configured to Application-2 are the standard alerts provided by default by the platform.
More information about monitoring and alerts can be found [here](https://link/to/more/docs)
......@@ -13,11 +13,15 @@ More information about monitoring and alerts can be found [here](https://link/to
These alerts are the predefined ones for any service on PLATFORM_NAME. If they are not required, or if you wish to vary the alerting criteria necessary to trigger them, you can edit the [service definition base yaml](https://link/to/config/in/git) and create some **microservice_alert_overrides** for the relevant microservice(s).
### slo-browse.prod-browse.5xx-errors
---
### slo-application-2.prod-browse.5xx-errors
Check new [5xx Alert page](/application-2/alerts/slo-browse.prod-browse.5xx-errors/)
Check new [5xx Alert page](/application-2/alerts/slo-application-2.prod-browse.5xx-errors/)
---
### slo-browse.prod-browse.service-availability
### slo-application-2.prod-browse.service-availability
**Summary:** Service pods are not available
......@@ -29,7 +33,9 @@ Check new [5xx Alert page](/application-2/alerts/slo-browse.prod-browse.5xx-erro
- Check for other alerts in PagerDuty that may be linked - the issue may be out of our control. In particular, look at Platform and AN_OTHER_DEPENDENCY, as well as #topic-major-incidents in Slack.
- A recent release may have caused an issue - consider a rollback to the previous version.
### slo-browse.prod-browse.pod-oom-kills
---
### slo-application-2.prod-browse.pod-oom-kills
**Summary:** The service's pods have restarted as they have run out of memory - the number of pods and resources should be re-evaluated.
......@@ -43,7 +49,9 @@ Check new [5xx Alert page](/application-2/alerts/slo-browse.prod-browse.5xx-erro
- Consider either scaling up number of pods or increasing memory for pods.
- A release may have significantly degraded performance and spinning up more pods may fix it temporarily. Consider a rollback to the previous release.
### slo-browse.prod-browse.zero-traffic
---
### slo-application-2.prod-browse.zero-traffic
**Summary:** Service pods are not handling traffic.
......
---
title: "slo-browse-newrelic-alerts"
title: "slo-application-2-newrelic-alerts"
chapter: true
---
# PagerDuty Alert: Browse xxx
# PagerDuty Alert: Application-2 xxxx
## Summary: The rules configured in NewRelic detected an anomaly
**Summary:** The rules configured in NewRelic detected an anomaly
## Customer Impact: customer has problems with their journey to buy something
**Customer Impact:** customer has problems with their journey to buy something
## Actions:
**Actions:**
### Is there a bigger issue?
## Is there a bigger issue?
Join and check the following channels
- [#topic-major-incident-comms](https://link/to/channel)
- [#topic-ops-bridge](https://link/to/channel)
- [#team-digital-platform](https://link/to/channel)
- [#team-platform](https://link/to/channel)
Read and gain and understand what is happening, so you have context, if there is a major incident(MI) in progress. Participate and add relevant notes. e.g. `Browse received a NewRelic alert at xx:xx`
Read and gain and understand what is happening, so you have context, if there is a major incident(MI) in progress. Participate and add relevant notes. e.g. `Application-2 received a NewRelic alert at xx:xx`
If there is no MI, keep investigating the Browse issue.
If there is no MI, keep investigating the Application-2 issue.
You will need to determine if Browse is the cause of the problem or suffering symptoms because of a dependency issue.
You will need to determine if Application-2 is the cause of the problem or suffering symptoms because of a dependency issue.
### NewRelic False Alert?
## NewRelic False Alert?
Sometimes NewRelic Synthetics can have natural connection problems so just in case check the Script Log tab to see the exception. In case you conclude it is a false alert, we should just resolve the Pager Duty alert with a note on it.
......@@ -35,9 +35,9 @@ False alerts can be triggered from new relic due to below reasons
- Synthetic script is failing due to not getting response from the Insights API
- New relic systems may be faulty. Check the latest [NewRelic status](https://status.newrelic.com/)
### Checking Browse
## Checking Application-2
#### Rule in/out our application
### Rule in/out our application
Check the [4 golden signals dashboard](https://link/to/platform/dashboard)
......@@ -53,14 +53,14 @@ Check the [NewRelic Synthetic Monitors page](https://link/to/newrelic) to find m
---
#### Alert | Browse | Basket | GetBasket Error Limit Breached | P3
**Alert** | Application-2 | Basket | GetBasket Error Limit Breached | P3
- Go through the basket API [logs in Kibana](https://link/to/logs)
- Search for the errors
- Alternatively, we can check the error stacktrace in the new relic using [this link](https://link/to/newrelic)
- Report the error in #team-x slack channel if you see timeout errors
#### Alert | Browse | Product | GetProducts API Error Limit breached | P3
**Alert** | Application-2 | Product | GetProducts API Error Limit breached | P3
- Go through the Product API [logs in Kibana](https://link/to/logs)
- Search for the errors
......@@ -68,24 +68,24 @@ Check the [NewRelic Synthetic Monitors page](https://link/to/newrelic) to find m
- Report the errors in #team-y in case errors are originated for rvi calls - see [logs here](https://link/to/specific/logs)
- Alternatively, we can check the error stacktrace in new relic using [this link](https://link/to/newrelic)
#### Alert | Browse | BFF | Timeout Errors Limit Breached | P3
**Alert** | Application-2 | BFF | Timeout Errors Limit Breached | P3
- Search for "Socket closed" errors in Kibana using [this link](https://link/to/log/query)
- Report the issue in #team-x channel in office hours
- Raise an incident in ITIL_TOOL to the assignment group ANOTHER_TEAM. That will trigger an pager duty call to Website support team
#### Alert | Browse | BISN | Stock Notifications Error threshold Breached | P3
**Alert** | Application-2 | BISN | Stock Notifications Error threshold Breached | P3
- Go through the [API logs](https://link/to/logs/query) in Kibana
- Search for the errors
- Alternatively, we can check the error stacktrace in the new relic using [this link](https://link/to/newrelic)
- Report the error in #team-z slack channel if you see any timeout errors
#### Alert | Browse | Drop in Transaction Limit Breached | P3
**Alert** | Application-2 | Drop in Transaction Limit Breached | P3
**In PagerDuty** this will be reported as:
`[browse] Error [Runbook] (Alert |Browse|Drop in Transaction Limit Breached | P3 violated Browse | Drop in Transaction Limit Breached | P3)`
`[application-2] Error [Runbook] (Alert | Application-2 | Drop in Transaction Limit Breached | P3 violated Application-2 | Drop in Transaction Limit Breached | P3)`
Orders drop can be impacted by the UI if there is a problem displaying the Add to Basket button or by poor response times in DEPENDENCY_A.
......
---
title: "slo-browse.prod-browse.5xx-errors"
title: "slo-application-2.prod-browse.5xx-errors"
chapter: true
---
# PagerDuty Error: slo-browse.prod-browse.5xx-errors
# PagerDuty: slo-application-2.prod-browse.5xx-errors
## Summary: The Service is generating 5xx errors
**Summary:** The Service is generating 5xx errors
## Customer Impact: Errors
**Customer Impact:** Errors
## Actions:
**Actions:**
### Is there a bigger issue?
## Is there a bigger issue?
Join and check the following channels
- [#topic-major-incident-comms](https://link/to/slack)
- [#topic-ops-bridge](https://link/to/slack)
- [#team-digital-platform](https://link/to/slack)
- [#team-platform](https://link/to/slack)
Read and gain and understand what is happening so you have context, if there is a major incident(MI) in progress. Participate and add relevant notes, e.g.
`Browse received a 5xx alert at xx:xx`
`Application-2 received a 5xx alert at xx:xx`
If there is no MI, keep investigating the Browse issue.
If there is no MI, keep investigating the Application-2 issue.
You will need to determine if Browse is the cause of the problem or suffering symptoms because of a dependency issue.
You will need to determine if Application-2 is the cause of the problem or suffering symptoms because of a dependency issue.
### Checking Browse
## Checking Application-2
#### Rule in/out our application
### Rule in/out our application
Check the [4 golden signals dashboard](https://link/to/dashboard)
......@@ -42,7 +42,7 @@ Is there:
---
#### Is this a bot?
### Is this a bot?
**Diagnosis:**
......@@ -65,7 +65,7 @@ _Unusual activity_
---
#### Is UPSTREAM_DEPEDENCY_A functional?
### Is UPSTREAM_DEPEDENCY_A functional?
**Diagnosis:**
......@@ -89,7 +89,7 @@ High number of hits clustered together
---
#### Have we performed a duff release?
### Have we performed a duff release?
**Diagnosis:** When was our last release?
......
......@@ -5,7 +5,7 @@ chapter: true
## Dependencies
### Browse is Dependent on
### Application-2 is Dependent on
| Application | Detail | Runbook (will inc. how to contact | Golden Signals Dashboard |
| --------------------- | ------------------------------ | ------------------------------------------------ | ------------------------------ |
......@@ -14,7 +14,7 @@ chapter: true
| upstream dependency 3 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
| upstream dependency 4 | something about what this does | [link to their runbook](https://link/to/runbook) | [link to their 4 Golden Signal Dashboard](https://link/to/grafana/) |
## Services Dependent on Browse
## Services Dependent on Application-2
| Service | Detail | Runbook |
| ----------------------- | ---------------------------- | ------------------------------------------------ |
......
......@@ -5,7 +5,7 @@ chapter: true
## What is Application-2?
Browse is a front end application that has provides most of the critical browse journeys to WEBSITE_X
Application-2 is a front end application that has provides most of the critical browse journeys to WEBSITE_X.
The application is now end of life and pages are moving out of it into a micro frontend architecture. See the following [google-sheet](https://link/to/gsheet) for more details on the current state of affairs.
......
---
title: "Ops Meeting"
---
Previous on-call to present & next on-call to scribe
1. [Scribe: Create new meeting notes](https://link/to/folder/of/notes)
2. [Update rotations](https://link/to/sheet)
3. [Update pagerduty schedule](https://link/to/pagerduty/schedule)
4. Update [#team-x](https://link/to/channel) channel message with who is on call
5. Review Service Dashboard
- [7 day](https://link/to/dashboard)
- [30 day](https://link/to/dashboard)
6. Review UI Dashboard
- [7 day](https://link/to/dashboard)
- [30 day](https://link/to/dashboard)
7. Review Detailed Dashboard
- [7 day](https://link/to/dashboard)
- [30 day](https://link/to/dashboard)
8. [PagerDuty Incident Review](https://link/to/pagerduty/incidents)
9. [ITIL Ticket Review](https://link/to/delightful/itil/incident/list)
10. [Deployment Review](https://link/to/delightful/itil/chg/list)
11. [Review previous meeting notes](https://link/to/previous/notes)
......@@ -5,7 +5,7 @@ chapter: true
## How to Investigate Live Issues
The majority of browse issues are one of the following root causes:
The majority of Application-2 issues are one of the following root causes:
1. An upstream dependency is generating errors to us.
2. The platform or Google Cloud Services are having issues and causing a platform level outage.
......@@ -41,5 +41,5 @@ Answer the following by reading the alert:
2. Then drop into kibana [DOWNSTREAM_X](https://link/to/logs) and [our own logs](https://link/to/logs) logs for debugging the alert.
3. Identify the source of the errors by filtering the logs by status `log.wstatus`. Is the error message related to a specific dependency eg product API or basket?
4. If the error is for an upstream API check the [dependencies](../dependencies/_index.md) page for details on the dashboard and svc channel and check it to see if that service is having issues.
5. If the error is only occurring in browse, the next action is to look a infrastructure and code health to understand if code needs to be reverted. See [release and rollback](./release-and-rollback.md) on how to roll back code.
5. If the error is only occurring in Application-2, the next action is to look a infrastructure and code health to understand if code needs to be reverted. See [release and rollback](./release-and-rollback.md) on how to roll back code.
6. Report back findings into any incident channel or to the team channel as you find out more details so all working on the inc can see any new information.
---
title: "Anticipated failure scenarios"
---
Last Updated: DD MMM 2021
This document gives information on things that could go wrong with Application-2 offering.
## General
This section covers scenarios which can occur to many of the aspects in the system. Please see [Upstream & Downstream Dependencies](/dependencies/) for details on the structure of the system. We define upstream/downstream this way: https://reflectoring.io/upstream-downstream/
### k8s/pods
Both the API and UI server run in pods within k8s. We could have issues related to the pod failing or some sort of k8s related issues. In this case, it is unlikely limited to us as we make use of the microservice template and we should reach out to the platform team for support.
### Networking related
We do not connect directly to our consumers, rather they connect to us through a variety of layers.
```txt
+--------------------+
| |
| Consumer |
| |
+----------+---------+
|
|
|
v
+----------+---------+
| |
| Edge Thing |
| |
+----------+---------+
|
|
|
v
+----------+---------+
| |
| GCP LB |
| |
+----------+---------+
|
|
|
v
+----------+---------+
| |
| NGINX |
| |
+----------+---------+
|
|
v
+----------+---------------+
| |
| Pods |
| |
| +--------------------+ |
| | | |
| | Traefik | |
| | | |
| +----------+---------+ |
| | |
| | |
| v |
| +----------+---------+ |
| | | |
| | API / UI Server | |
| | | |
| +--------------------+ |
| |
+--------------------------+
```
When a request comes in, it hits Edge Thing which provides routing and DDoS protection. If we suspect an issue there, we can reach out to them at #team-x
It then goes to the Platform-provided load balancers which uses the [Google Load Balancer](https://cloud.google.com/load-balancing) + NGINX Ingress Controller to direct traffic to our pods. Each pod has a Treafik sidecar which filters requests to the API or UI Server. It is possible to check both Load Balancer information and Traefik and it is documented in [Diagnosing Network Issues](/networkdiags/).
### Resource starvation
Should the CPU or Memory of the pod run out, the pod can stop responding to requests. The best is to just bounce the boxes, which you can easily do with the `k` script in our [Ops Repo](https://link/to/repo).
## API
> It then goes on to describe various scenarios relating to dependencies, data issues, auth-n/auth-z failures, things specific to Mobile Apps or Email, and so on
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment