Commit 75382299 authored by Marcel Amirault's avatar Marcel Amirault
Browse files

Update broken links and anchors in infra docs

Find and fix many broken links and anchors that
have built up over months/years.
parent e95fe333
Pipeline #173523825 skipped with stage
......@@ -176,7 +176,7 @@ In order to address this, over the past few months, we've formalized our change
If you're interested in finding out more about the approach we've taken to these two vital disciplines, they're published in our handbook:
- [GitLab.com's Change Management Process](/handbook/engineering/infrastructure/change-management/)
- [GitLab.com's Incident Management Process](/handbook/engineering/infrastructure/team/reliability/incident-management/)
- [GitLab.com's Incident Management Process](/handbook/engineering/infrastructure/incident-management/)
### Reason #6: Application improvement
......
......@@ -20,7 +20,7 @@ One of the more basic functions of the Prometheus query language is real-time ag
There are four key reasons why anomaly detection is important to GitLab:
1. **Diagnosing incidents**: We can figure out which services are performing outside their normal bounds quickly and reduce the average time it takes to [detect an incident (MTTD)](/handbook/engineering/infrastructure/team/reliability/incident-management/), bringing about a faster resolution.
1. **Diagnosing incidents**: We can figure out which services are performing outside their normal bounds quickly and reduce the average time it takes to [detect an incident (MTTD)](/handbook/engineering/infrastructure/incident-management/), bringing about a faster resolution.
2. **Detecting application performance regressions**: For example, if an n + 1 regression is introduced and leads to one service calling another at a very high rate, we can quickly track the issue down and resolve it.
3. **Identify and resolve abuse**: GitLab offers free computing (GitLab CI/CD) and hosting (GitLab Pages), and there are a small subset of users who might take advantage.
4. **Security**: Anomaly detection is essential to spotting unusual trends in GitLab time series data.
......
......@@ -113,7 +113,7 @@ GitLab KPIs are the most important indicators of company performance, and the mo
1. [Hires vs. plan](/handbook/hiring/performance_indicators/#hires-vs-plan) > 0.9 [📊](https://app.periscopedata.com/app/gitlab/482006/People-KPIs?widget=6888845&udv=820057)
1. [12 month team member retention](/handbook/people-group/people-success-performance-indicators/#team-member-retention-rolling-12-months) > 84% [🔗](https://app.periscopedata.com/app/gitlab/482006/People-KPIs?widget=6251791&udv=904340) (lagging)
1. [Merge Requests Rate](/handbook/engineering/development/performance-indicators/#mr-rate) > 10 [📊](https://app.periscopedata.com/app/gitlab/504639/Development-KPIs?widget=6946736) (lagging)
1. [GitLab.com Availability](/handbook/engineering/infrastructure/performance-indicators/#gitlab-com-availability) > 99.95% [🔗](https://dashboards.gitlab.com/d/GTp20b1Zk/public-dashboard-splashscreen?orgId=1&from=now-30d&to=now) (lagging)
1. [GitLab.com Availability](/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability) > 99.95% [🔗](https://dashboards.gitlab.com/d/GTp20b1Zk/public-dashboard-splashscreen?orgId=1&from=now-30d&to=now) (lagging)
1. [SMAU](/handbook/product/metrics/#stage-monthly-active-users-smau) [🚧](https://gitlab.com/gitlab-data/analytics/-/issues/3840) (lagging)
1. [Support Satisfaction](/handbook/support/performance-indicators/#support-satisfaction-ssat) [📊](https://app.periscopedata.com/app/gitlab/463858/Engineering-KPIs?widget=5992548) (lagging)
1. [Runway](/handbook/finance/corporate-finance-performance-indicators/index.html#runway) > 12 [🔗](https://app.periscopedata.com/app/gitlab/483606/Finance-KPIs?widget=6880820&udv=0) (lagging)
......
......@@ -31,7 +31,7 @@ There are a number of tools we use to plot and manage career development:
* On the last 1:1 meeting of the quarter, discuss and record a summary of the progress made for the quarter and update the EFW accordingly.
* **1:1s**: Mentor on the focus areas in weekly 1:1s at a frequency that feels right for the tasks. There is no hard rule to discuss the career development items every week, but try to do so regularly and avoid a wide-open, ad-hoc conversation.
There is no rule that guides the amount of progress that should be made in a given time-period in a strict fashion. We should however strive to set targets to progress to the [next level](/handbook/total-rewards/compensation/#explanation) on at least a quarterly basis.
There is no rule that guides the amount of progress that should be made in a given time-period in a strict fashion. We should however strive to set targets to progress to the [next level](/handbook/total-rewards/compensation/compensation-calculator/#introduction) on at least a quarterly basis.
Actions to make changes to a GitLab team-member's level can be taken during the [360 Feedback](/handbook/people-group/360-feedback/), and the data collected throughout this workflow should be useful at that time.
......
......@@ -89,7 +89,7 @@ These definitions imply several on-call rotations for the different roles.
1. _Be inquisitive_. _Be vigilant_. If you notice that something doesn't seem right, investigate further.
2. After the incident is resolved, the EOC should start on performing an [incident review](/handbook/engineering/infrastructure/incident-review) (RCA) and [assign themselves](#incident-review-issue-creation-and-ownership) as the initial owner. Feel free to take a breather first, but do not end your work day without starting the RCA.
#### Guidlines on Security Incidents
#### Guidelines on Security Incidents
At times, we have a security incident where we may need to take actions to block a certain URL path or part of the application. This list is meant to help the Security Engineer On-Call and EOC decide when to engage help and post to status.io.
......
......@@ -4,7 +4,7 @@ title: "Production Architecture"
---
Our GitLab.com core infrastructure is primarily hosted in Google Cloud Platform's (GCP) `us-east1` region (see [Regions and Zones](https://cloud.google.com/compute/docs/regions-zones/))—and we use GCP iconography in our diagrams to represent GCP resources. We do have dependencies on other cloud providers for separate functions. Some of the dependencies are legacy fragments from our migration from Azure, and others are deliberate to separate concerns in the event of cloud provider service disruption. We're currently working to implement a [Disaster Recovery](/handbook/engineering/infrastructure/library/disaster-recovery/) solution that redesigns our failure scenarios across multi-zone, multi-region, and multi-cloud architectures.
Our GitLab.com core infrastructure is primarily hosted in Google Cloud Platform's (GCP) `us-east1` region (see [Regions and Zones](https://cloud.google.com/compute/docs/regions-zones/))—and we use GCP iconography in our diagrams to represent GCP resources. We do have dependencies on other cloud providers for separate functions. Some of the dependencies are legacy fragments from our migration from Azure, and others are deliberate to separate concerns in the event of cloud provider service disruption. We're currently working to implement a [Disaster Recovery](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/disaster-recovery/index.md) solution that redesigns our failure scenarios across multi-zone, multi-region, and multi-cloud architectures.
This document does not cover servers that are not integral to the public facing operations of GitLab.com.
......
......@@ -92,7 +92,7 @@ Functional queues track team workloads (`infrastructure`, `security`, etc) and a
The `production` queue tracks events in production, namely:
* [changes](/handbook/engineering/infrastructure/change-management/)
* [incidents](/handbook/engineering/infrastructure/team/reliability/incident-management/)
* [incidents](/handbook/engineering/infrastructure/incident-management/)
* deltas (exceptions) -- still need to do handbook write up
Over time, we will implement hooks into our automation to *automagically* inject change audit data into the `production` queue.
......@@ -137,7 +137,7 @@ Type labels are very important. They define what kind of issue this is. Every is
| Label | Description |
|--------------------|-------------------------------------------------------------------------------------------------------------------------|
| `~Change` | Represents a Change on infrastructure please check details on : [Change](/handbook/engineering/infrastructure/change-management/) |
| `~Incident` | Represents a Incident on infrastructure please check details on : [Incident](/handbook/engineering/infrastructure/team/reliability/incident-management/) |
| `~Incident` | Represents a Incident on infrastructure please check details on : [Incident](/handbook/engineering/infrastructure/incident-management/) |
| `~Database` | Label for problems related to database |
| `~Security` | Label for problems related to security |
......
......@@ -172,7 +172,7 @@ The list may not be up to date. If something is missing, please add it.
# Zendesk
Every SRE should register for a “Light Agent” account in ZenDesk. Often times incidents are generated from customer reports, and it’s useful to see their submission and the back and forth with support. You can also leave internal notes for support engineers so that they can gather more information for troubleshooting purposes. See ['Light Agent' Zendesk accounts available for all GitLab staff](/handbook/support/internal-support/#light-agent-zendesk-accounts-available-for-all-gitlab-staff)
Every SRE should register for a “Light Agent” account in ZenDesk. Often times incidents are generated from customer reports, and it’s useful to see their submission and the back and forth with support. You can also leave internal notes for support engineers so that they can gather more information for troubleshooting purposes. See ['Light Agent' Zendesk accounts available for all GitLab staff](/handbook/support/internal-support/#viewing-support-tickets)
## PTO Ninja
......
......@@ -47,7 +47,7 @@ application.
## Indicators
The Infrastructure Department is concerned with the [availability](/handbook/engineering/infrastructure/performance-indicators/#gitlab-com-availability)
The Infrastructure Department is concerned with the [availability](/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability)
and [performance](/handbook/engineering/infrastructure/performance-indicators/#gitlab-com-performance) of GitLab.com.
GitLab.com's service level availability is visible on the [SLA Dashboard](https://gitlab.com/gitlab-com/dashboards-gitlab-com/-/environments/1790496/metrics?dashboard=.gitlab%2Fdashboards%2Fsla-dashboard.yml&duration_seconds=2592000),
......
......@@ -46,4 +46,4 @@ Critical vulnerabilities and exploits found during an engagement that are curren
### Emergencies
All red team testing and engagements are first carefully planned in advance and will not intentionally impact production system or customers. However, accidents do happen and in the case of a production system being impacted during an engagement, either directly or indirectly, we escalate the incident by notifying the red team manager. We also engage our security operations team using our [incident response guide](./sec-incident-response.html) at the earliest reasonable time giving full disclosure of what caused the issue. Depending on the circumstances, the infrastructure team may need to be made aware through an [incident report](/handbook/engineering/infrastructure/team/reliability/incident-management/) to negate or reduce the impact to our customers. Proper [root cause analysis](/handbook/engineering/root-cause-analysis/) is recorded following resolution of the incident.
\ No newline at end of file
All red team testing and engagements are first carefully planned in advance and will not intentionally impact production system or customers. However, accidents do happen and in the case of a production system being impacted during an engagement, either directly or indirectly, we escalate the incident by notifying the red team manager. We also engage our security operations team using our [incident response guide](./sec-incident-response.html) at the earliest reasonable time giving full disclosure of what caused the issue. Depending on the circumstances, the infrastructure team may need to be made aware through an [incident report](/handbook/engineering/infrastructure/incident-management/) to negate or reduce the impact to our customers. Proper [root cause analysis](/handbook/engineering/root-cause-analysis/) is recorded following resolution of the incident.
\ No newline at end of file
......@@ -56,17 +56,16 @@ Non-public information relating to this security control as well as links to the
### Policy Reference
* [GitLab Business Continuity Plan in Handbook](https://about.gitlab.com/handbook/business-ops/gitlab-business-continuity-plan.html)
* [GitLab Disaster Recovery](https://about.gitlab.com/handbook/engineering/infrastructure/library/disaster-recovery/)
* [GitLab Disaster Recovery](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/disaster-recovery/index.md)
* [GitLab Reference Architectures](https://about.gitlab.com/solutions/reference-architectures/)
* [GitLab Infra Epic for Geo](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1)
* [GitLab Handbook listing of DR for Databases](https://about.gitlab.com/handbook/engineering/infrastructure/database/disaster_recovery.html)
* [NIST Guidance on Business Continuity](https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-34r1.pdf)
* [PCI DSS v3.2.1 - Business Continuity Plan](https://www.pcisecuritystandards.org/documents/PCI_DSS_v3-2-1.pdf?agreement=true&time=1551196697261#page=113)
* [Geo and Disaster Recovery](/handbook/engineering/development/enablement/geo/)
* [GitLab DR Design](/handbook/engineering/infrastructure/library/disaster-recovery/#design)
* [GitLab DR Design](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/disaster-recovery/index.md#design)
* [GitLab DR for Databases](/handbook/engineering/infrastructure/database/disaster_recovery.html)
## Framework Mapping
* ISO
* A.17.1.1
......
......@@ -58,7 +58,7 @@ Examples of evidence an auditor might request to satisfy this control:
* Database backup recovery testing is [implemented](https://gitlab.com/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/blob/master/README.md) in the form of an automated pipeline.
* Snapshot restoration has been [manually verified](https://gitlab.com/gitlab-com/migration/issues/560) and handbook updates reflecting such testing and verification merged.
* GCP snapshot procedures can be found [here](https://gitlab.com/gitlab-com/runbooks/blob/master/howto/gcp-snapshots.md). A manual verification of this process was done [here](https://gitlab.com/gitlab-com/migration/issues/560).
* Incident management is documented in the [GitLab Handbook](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/). The documentation contains the following components:
* Incident management is documented in the [GitLab Handbook](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/). The documentation contains the following components:
* [Incident runbooks](https://gitlab.com/gitlab-com/runbooks/tree/master/incidents) are available and maintained in the `runbooks` project.
* Database-specific runbooks are located [here](https://gitlab.com/gitlab-com/runbooks/blob/master/incidents/database.md).
* [Additional runbooks](https://gitlab.com/gitlab-com/runbooks) not specific to incident management are also available.
......
......@@ -43,16 +43,16 @@ Examples of evidence an auditor might request to satisfy this control:
* [Security issue triage process](https://about.gitlab.com/handbook/engineering/security/#issue-triage)
* [Security severity labelling](https://about.gitlab.com/handbook/engineering/security/#severity-and-priority-labels-on-security-issues)
* [Major Incident Response Workflow](https://about.gitlab.com/handbook/engineering/security/secops-oncall.html#major-incident-response-workflow)
* [Additional documentation](https://about.gitlab.com/handbook/security/#panic-email) on using the `panic` email and a [procedure for the security team's response](https://about.gitlab.com/handbook/security/#checklist-for-when-panic-is-triggered) to those alerts
* [Additional documentation](https://about.gitlab.com/handbook/security/#using-the-panic-email-address) on using the `panic` email and a [procedure for the security team's response](https://about.gitlab.com/handbook/security/#checklist-for-when-panic-is-triggered) to those alerts
3. Key incident response systems:
* [Incident management documented in the GitLab Handbook](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/)
4. Incident coordination and communication strategy:
* [S1 and S2 Incidents](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/#s1-and-s2-incidents). Information about our most critical incident severities.
* [Incident Steps](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/#incident-steps). Defines the steps involved with handling an incident.
* [Communication](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/#communication). Describes communication procedures during an incident.
* [CMOC and IMOC checklist](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/#cmoc-and-imoc-checklist). A checklist of actions for the CMOC and IMOC response roles to perform.
* [S1 and S2 Incidents](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#severity). Information about our most critical incident severities.
* [Incident Steps](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#incident-workflow). Defines the steps involved with handling an incident.
* [Communication](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#communication). Describes communication procedures during an incident.
* [CMOC and IMOC checklist](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#roles-and-responsibilities). A checklist of actions for the CMOC and IMOC response roles to perform.
* [Incident runbooks](https://gitlab.com/gitlab-com/runbooks/tree/master/incidents) are available and maintained in the `runbooks` project
* [Database-specific runbooks](https://gitlab.com/gitlab-com/runbooks/blob/master/incidents/database.md)
* [Additional runbooks](https://gitlab.com/gitlab-com/runbooks)
......@@ -63,7 +63,7 @@ Examples of evidence an auditor might request to satisfy this control:
6. Support team contact information
* [Incident Management Support](https://about.gitlab.com/handbook/support/incident-management/)
* [On-Call Runbooks](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/incident-management/#on-call-runbooks).
* [On-Call Runbooks](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#runbooks).
7. Notification to relevant management in the event of a security breach
* [Security Incident Comms Plan](https://gitlab.com/gitlab-com/gl-security/secops/operations/issues/205)
......
......@@ -42,7 +42,7 @@ Non-public information relating to this security control as well as links to the
### Policy Reference
* [Security Incident Response Guide](/handbook/engineering/security/sec-incident-response.html)
* [Incident Management](/handbook/engineering/infrastructure/team/reliability/incident-management/)
* [Incident Management](/handbook/engineering/infrastructure/incident-management/)
## Framework Mapping
......
......@@ -47,7 +47,7 @@ Examples of evidence an auditor might request to satisfy this control:
### Policy Reference
* [Security Incident Communication Plan](/handbook/engineering/security/security-incident-communication-plan.html)
* [Incident Management - Communication](/handbook/engineering/infrastructure/team/reliability/incident-management/#communication)
* [Incident Management - Communication](/handbook/engineering/infrastructure/incident-management/#communication)
## Framework Mapping
......
......@@ -51,7 +51,7 @@ Non-public information relating to this security control as well as links to the
- [Infrastructure Department](https://about.gitlab.com/handbook/engineering/infrastructure/)
- [Backup Policies](https://about.gitlab.com/handbook/engineering/infrastructure/production/#backups) and [Backup Recovery Testing](https://gitlab.com/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/blob/master/README.md)
- [Change Management](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/)
- [Disaster Recovery](https://about.gitlab.com/handbook/engineering/infrastructure/library/disaster-recovery/) and [Disaster Recovery - Databases](https://about.gitlab.com/handbook/engineering/infrastructure/database/disaster_recovery.html)
- [Disaster Recovery](https://gitlab.com/gitlab-com/gl-infra/readiness/-/blob/master/library/disaster-recovery/index.md) and [Disaster Recovery - Databases](https://about.gitlab.com/handbook/engineering/infrastructure/database/disaster_recovery.html)
- [Incident Management](https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/)
- [Production Architecture Page](https://about.gitlab.com/handbook/engineering/infrastructure/production/architecture/)
- [Quality Department](https://about.gitlab.com/handbook/engineering/quality/)
......
......@@ -45,7 +45,7 @@ Examples of evidence an auditor might request to satisfy this control:
### Policy Reference
* [Incident Management](/handbook/engineering/infrastructure/team/reliability/incident-management/)
* [Incident Management](/handbook/engineering/infrastructure/incident-management/)
* [Security Incident Response Guide](/handbook/engineering/security/sec-incident-response.html)
* [DELKE](https://gitlab.com/gitlab-com/gl-security/secops/detection/delke)
......
......@@ -48,7 +48,7 @@ Examples of evidence an auditor might request to satisfy this control:
### Policy Reference
* [Security Incident Response Guide](/handbook/engineering/security/sec-incident-response.html)
* [Incident Management - Security Incidents](/handbook/engineering/infrastructure/team/reliability/incident-management/#security-incidents)
* [Incident Management - Security Incidents](/handbook/engineering/infrastructure/incident-management/#security-incidents)
## Framework Mapping
......
......@@ -20,7 +20,7 @@ title: "Twitter response workflow"
| [@MovingToGitLab](https://twitter.com/MovingToGitLab) | Tweetdeck | Respond to mentions and questions |
- The [@GitLabStatus](https://twitter.com/GitLabStatus) account should only be used to give updates on the availability of [GitLab.com](https://gitlab.com) and to follow up on users reporting that [GitLab.com](https://gitlab.com) is unavailable or responding to a previous availability update on [@GitLabStatus](https://twitter.com/GitLabStatus).
- Only the infrastructure team should be posting updates on [@GitLabStatus](https://twitter.com/GitLabStatus). There is a [defined process](/handbook/engineering/infrastructure/team/reliability/incident-management/) for this describing who should do this, how and what channels should be alerted.
- Only the infrastructure team should be posting updates on [@GitLabStatus](https://twitter.com/GitLabStatus). There is a [defined process](/handbook/engineering/infrastructure/incident-management/) for this describing who should do this, how and what channels should be alerted.
- When a tweet mentions more than one handle described above, always reply from the main [@GitLab handle](https://twitter.com/GitLab), unless it's about GitLab availability status
- If a wrong handle is used in a response, take note and respond from the correct one in the follow-up (if there is one)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment