Skip to content
Snippets Groups Projects
Commit d951b096 authored by Hercules Merscher's avatar Hercules Merscher
Browse files

Updating Observability team references

parent 5ecff212
No related branches found
No related tags found
1 merge request!12072Updating Observability team references
Showing
with 21 additions and 48 deletions
......@@ -27,7 +27,7 @@ Long-term, we think of including Tamland in self-managed installations and think
### Background: Capacity planning for GitLab.com
[Tamland](https://gitlab.com/gitlab-com/gl-infra/tamland) is an infrastructure resource forecasting project owned by the [Scalability::Observability](../../../infrastructure/team/scalability/#scalabilityobservability) group.
[Tamland](https://gitlab.com/gitlab-com/gl-infra/tamland) is an infrastructure resource forecasting project owned by the [Observability team](/handbook/engineering/infrastructure/team/observability/).
It implements [capacity planning](../../../infrastructure/capacity-planning/) for GitLab.com, which is a [controlled activity covered by SOC 2](https://gitlab.com/gitlab-com/gl-security/security-assurance/security-compliance-commercial-and-dedicated/observation-management/-/issues/604).
As of today, it is used exclusively for GitLab.com to predict upcoming SLO violations across hundreds of monitored infrastructure components.
......
......@@ -333,7 +333,7 @@ See [this issue](https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/
#### Observability
In general, the lifecycle of observability components for cells will be owned by the `Scalability:Observability` team.
In general, the lifecycle of observability components for cells will be owned by the [Observability team](/handbook/engineering/infrastructure/team/observability/).
By default, each Dedicated tenant is provisioned with a fully functional Prometheus/Grafana stack. Cells will reuse this stack, with the intention of aggregating metrics so that queries can be run over multiple cells. More information can be found [here](https://gitlab-com.gitlab.io/gl-infra/gitlab-dedicated/team/engineering/observability/metrics.html).
......
......@@ -128,7 +128,7 @@ This requires both external and internal LBs for all front-end-services.
### Monitoring
- Dependencies: [Eliminate X% Chef dependencies in Infra by moving infra away from Chef](zonal.md#eliminate-x-chef-dependencies-in-infra-by-moving-infra-away-from-chef) (migrate Prometheus infra to Kubernetes)
- Teams: Scalability:Observability, Ops, Foundations
- Teams: Observability, Ops, Foundations
Setup an alternate ops Kubernetes cluster in a different region that is scaled down to zero replicas.
......
......@@ -108,7 +108,7 @@ We estimate that using an image can reduce our recovery time by about 15 minutes
### Eliminate X% Chef dependencies in Infra by moving infra away from Chef
- Dependencies: None
- Teams: Ops, Scalability:Observability, Scalability:Practices
- Teams: Ops, Observability
Gitaly, Postgres, CI runner managers, HAProxy, Bastion, CustomersDot, Deploy, DB Lab, Prometheus, Redis, SD Exporter, and Console servers are managed by Chef.
To help improve the speed of recoveries, we can move this infrastructure into Kubernetes or Ansible for configuration management.
......
......@@ -467,7 +467,7 @@ An engineer might be assigned as a DRI to look into this.
The DRI is neither expected to determine a root cause nor propose a solution on their own.
The DRI should instead reach out to [the Scalability:Projections team](/handbook/engineering/infrastructure/team/scalability/projections/) for support.
The DRI should instead reach out to the [Observability team](/handbook/engineering/infrastructure/team/observability/) for support.
## Async Issue Updates
......
......@@ -201,7 +201,7 @@ Error budget events are attributed to stage groups via feature categorization. T
Updates to feature categories only change how future events are mapped to stage groups. Previously reported events will not be retroactively updated.
The [Scalability:Projections team](/handbook/engineering/infrastructure/team/scalability/projections/) owns keeping the mappings up to date when feature categories are changed in the website repository. When the categories are changed in `stages.yml`, a scheduled pipeline creates an issue ([example issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2084)) on the [build board](https://gitlab.com/gitlab-com/gl-infra/scalability/-/boards/1697160). The issue contains the pipeline link and instructions to follow in the description. The categories need to be synced to two places:
The [Observability team](/handbook/engineering/infrastructure/team/observability/) owns keeping the mappings up to date when feature categories are changed in the website repository. When the categories are changed in `stages.yml`, a scheduled pipeline creates an issue ([example issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2084)) on the [build board](https://gitlab.com/gitlab-com/gl-infra/scalability/-/boards/1697160). The issue contains the pipeline link and instructions to follow in the description. The categories need to be synced to two places:
1. The [Rails application](https://docs.gitlab.com/ee/development/feature_categorization/#updating-configfeature_categoriesyml).
1. The [Runbooks repository](https://gitlab.com/gitlab-com/runbooks/-/blob/master/services/stage-group-mapping.jsonnet).
......
......@@ -42,7 +42,7 @@ A Gitaly dashboard could be either auto-generated or manually drafted. We use Js
A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Scalability:Observability Team](../../../../infrastructure/team/scalability/observability/) for any questions.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is "healthy" and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined [thresholds](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/gitaly.jsonnet), . We could contact [Observability team](/handbook/engineering/infrastructure/team/observability/) for any questions.
![Gitaly Debug Indicators](/images/engineering/infrastructure-platforms/data-access/gitaly/gitaly-debug-indicators.png)
......@@ -243,4 +243,4 @@ Gitaly team is responsible for maintaining reasonable serving capacity for gitla
We get alerts from Tamland if capacity runs low, see [this issue comment](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1666#note_1786916965).
[Capacity planning](../../../../infrastructure/team/scalability/observability/capacity_planning/) documentation explains how this works in general.
[Capacity planning](/handbook/engineering/infrastructure/team/observability/capacity_planning/) documentation explains how this works in general.
......@@ -82,14 +82,14 @@ Therefore, the recommended practice when including Tamland data is:
Capacity planning is a shared activity and dependent on input from many stakeholders:
1. The [Scalability:Observability team](/handbook/engineering/infrastructure/team/scalability/observability) is the owner of the Capacity Planning process overall - the team oversees the entire process, implements technical improvements to improve our forecasting abilities and helps guide teams to act on associated capacity warnings.
1. The [Observability team](/handbook/engineering/infrastructure/team/observability/) is the owner of the Capacity Planning process overall - the team oversees the entire process, implements technical improvements to improve our forecasting abilities and helps guide teams to act on associated capacity warnings.
2. Each service we monitor is associated with a **Service Owner**, who is identified as the [DRI](/handbook/people-group/directly-responsible-individuals/) to act on capacity warnings and provide input in terms of domain knowledge.
#### Scalability:Observability
#### Observability
1. Tamland analyzes metrics data on a daily basis and creates capacity warning issues if it predicts that a resource will exceed its SLO within the forecast horizon.
1. On a weekly basis, an engineer from the team reviews all open issues in the [Capacity Planning](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues) tracker following the [process described on the Scalability:Observability team page](/handbook/engineering/infrastructure/team/scalability/observability/)
1. On a weekly basis, an engineer from the team reviews all open issues in the [Capacity Planning](https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues) tracker following the [process described on the Observability team page](/handbook/engineering/infrastructure/team/observability/)
1. Assign legitimate forecasts to the respective Service Owner to review and act on it (see below).
2. Select the most crucial saturation points to report in the [GitLab SaaS Availability](/handbook/engineering/#saas-availability-weekly-standup) meeting based on the impact they would have when fully saturated and how difficult the mitigation might be. To indicate issues like this, we apply the `~"SaaS Weekly"` label when we do the weekly triage.
3. Review forecasts with inaccurate model fit or otherwise obscure predictions, and work on improving their quality. Those issues should be labeled with `~capacity-planning::tune model` and not get assigned to the Service Owner directly. Since these model tunings highly benefit from domain insight, the Scalability engineer involves Service Owners to get more information.
......@@ -111,11 +111,11 @@ While many forecasts provide a clear and reliable outlook, not all forecasts wil
The Service Owner will note down their findings on the issue and get the appropriate actions going to remediate and prevent the saturation event. While the Service Owner is the DRI for the capacity warning, the [Infradev Process](/handbook/engineering/workflow/#infradev) and the [SaaS Availability weekly standup](/handbook/engineering/#saas-availability-weekly-standup) assist with the prioritization of these capacity alerts.
The Service Owner can also decide to change the Service Level Objective, the metric definition or any other forecasting parameters that are used to generate capacity warnings. Please see the related [documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/libsonnet/saturation-monitoring/README.md) for further information. The [Scalability:Observability team](/handbook/engineering/infrastructure/team/scalability/observability/) is available to assist, but the work should be owned by the [DRI](/handbook/people-group/directly-responsible-individuals/) and their team.
The Service Owner can also decide to change the Service Level Objective, the metric definition or any other forecasting parameters that are used to generate capacity warnings. Please see the related [documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/libsonnet/saturation-monitoring/README.md) for further information. The [Observability team](/handbook/engineering/infrastructure/team/observability/) is available to assist, but the work should be owned by the [DRI](/handbook/people-group/directly-responsible-individuals/) and their team.
If the issue does not require investigation, it is important to follow-up and improve the quality of the forecast or the process to improve the signal-to-noise-ratio for capacity planning. This can include feeding external knowledge into the forecasting model or consider changes in automation to prevent getting this capacity warning too early. The Service Owner is expected to get in touch with Scalability:Observability to consider and work on potential improvements.
If the issue does not require investigation, it is important to follow-up and improve the quality of the forecast or the process to improve the signal-to-noise-ratio for capacity planning. This can include feeding external knowledge into the forecasting model or consider changes in automation to prevent getting this capacity warning too early. The Service Owner is expected to get in touch with Observability to consider and work on potential improvements.
At any time, the Scalability:Observability team can be consulted and is ready to assist with questions around the forecasting or to help figure out the underlying reasons for a capacity warning.
At any time, the Observability team can be consulted and is ready to assist with questions around the forecasting or to help figure out the underlying reasons for a capacity warning.
#### Due Dates
......@@ -217,13 +217,13 @@ In addition to precision, we also define a KPI *rated* to indicate the ratio of
The following details the team-level agreements and responsibilities regarding capacity planning for GitLab Dedicated.
While capacity planning for GitLab.com is a shared activity, capacity planning for GitLab Dedicated implements a more differentiated responsibility model.
### Stakeholders: Scalability:Observability team and Dedicated teams
### Stakeholders: Observability team and Dedicated teams
1. The Dedicated team is responsible for defining saturation metrics Tamland monitors, and to configure tenants for capacity planning.
1. The Dedicated team runs Tamland inside tenant environments and produces saturation forecasting data.
1. The [Scalability:Observability team](/handbook/engineering/infrastructure/team/scalability/observability) owns the reporting side of capacity planning and makes sure reports and warnings are available.
1. The [Observability team](/handbook/engineering/infrastructure/team/observability) owns the reporting side of capacity planning and makes sure reports and warnings are available.
1. The Dedicated team is responsible for triaging and responding to the forecasts and warnings generated, and applying any insights to Dedicated tenant environments.
1. The [Scalability:Observability team](/handbook/engineering/infrastructure/team/scalability/observability) implements new features and fixes for Tamland to aid the capacity planning process for GitLab Dedicated.
1. The [Observability team](/handbook/engineering/infrastructure/team/observability) implements new features and fixes for Tamland to aid the capacity planning process for GitLab Dedicated.
### Defining saturation metrics and tenants
......
......@@ -672,7 +672,7 @@ as a high priority task that is second only to active incidents:
1. Work on the tamland
[manifest](https://gitlab.com/gitlab-com/runbooks/-/blob/master/reference-architectures/get-hybrid/config/tamland/manifest.json)
to exclude or tweak the specific saturation signal.
- The [Scalability:Observability](/handbook/engineering/infrastructure/team/scalability/observability/) team
- The [Observability team](/handbook/engineering/infrastructure/team/observability/)
can offer advice on the finer details of the tamland configuration.
1. Check that Tamland is [running](https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-dedicated/-/pipeline_schedules).
The pipeline should run successfuly every day.
......
......@@ -36,10 +36,10 @@ The following gives an overview of our scope and ownership.
1. [Monitoring fundamentals](https://gitlab.com/gitlab-com/runbooks/blob/e00eeb59937a9043c5db04314a35acb05c4e9288/docs/monitoring/README.md#L1)
1. Metrics stack
1. Logging stack
1. [Error budgets](/handbook/engineering/infrastructure/team/scalability/observability/error_budgets/)
1. [Error budgets](/handbook/engineering/infrastructure/team/observability/error_budgets/)
1. Ownership of concept and implementation
1. Delivery of monthly error budget report
1. [Capacity planning](/handbook/engineering/infrastructure/team/scalability/observability/capacity_planning/)
1. [Capacity planning](/handbook/engineering/infrastructure/team/observability/capacity_planning/)
1. [Triage rotation for .com](/handbook/engineering/infrastructure/capacity-planning/#gitlabcom-capacity-planning)
1. [Operational aspects for GitLab Dedicated capacity planning](https://docs.gitlab.com/ee/architecture/blueprints/capacity_planning/)
1. Developing [Tamland](https://gitlab.com/gitlab-com/gl-infra/tamland), the forecasting tool
......@@ -121,7 +121,7 @@ Between these different signals, we have a relatively (im)precise view into the
The team are responsible for provisioning access to the services listed below, as per the [tech_stack.yml](https://gitlab.com/gitlab-com/www-gitlab-com/-/blob/master/data/tech_stack.yml) file.
1. **Kibana** is accessed through Okta. Team members need to be in either of the following Okta groups: `gl-engineering` (entire Engineering department); `okta-kibana-users`. The latter group is used to manage access for team members outside of Engineering on an ad-hoc basis ([context](https://gitlab.com/gitlab-com/business-technology/change-management/-/issues/958)). Team members should be (de)provisioned through an Access Request ([example](https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/28421)). If the access request is approved, the provisioner should add the user to [this group](https://groups.google.com/a/gitlab.com/g/okta-kibana-users), which will then automatically sync to its namesake group in Okta.
1. **Elastic Cloud** is for administrative access to our Elastic stack. The login screen is available [here](https://cloud.elastic.co/) and access is through Google SSO. Team members should be (de)provisioned through an Access Request ([example](https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/28457)). If approved, the provisioner can add/remove members on the [membership page](https://cloud.elastic.co/account/members) with appropriate permissions to the instances they require access to.
1. **Elastic Cloud** is for administrative access to our Elastic stack. The login screen is available [here](https://cloud.elastic.co/) and access is through Google SSO. Team members should be (de)provisioned through an Access Request ([example](https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/28457)). If approved, the provisioner can add/remove members on the [membership page](https://cloud.elastic.co/account/members) with appropriate permissions to the instances they require access to.
1. **Grafana** is accessed through Okta. The login screen is availabile [here](https://dashboards.gitlab.net). Any GitLab team member can access Grafana. Provisioning and deprovisioning is handled through Okta.
## How we work
......@@ -160,7 +160,6 @@ We unassign ourselves from issues we are not actively working on or planning to
The Observability team's [issue boards](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/boards/) track the progress of ongoing work.
We use [issue boards](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/boards/) to track the progress of planned and ongoing work.
Refer to the Scalability group [issue boards section](/handbook/engineering/infrastructure/team/scalability/#issue-boards) for more details.
| **Planning** | **Building**|
|--------------|-------------|
......
......@@ -22,32 +22,6 @@ title: "Scalability Group"
1. [Scalability Issues by Team](https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/5797977?label_name[]=group%3A%3Ascalability)
1. [Scalability Issues by Team Member](https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/5798021?label_name[]=group%3A%3Ascalability)
## Teams
The Scalability group is currently formed of two teams:
* [Scalability:Observability](observability/) and
* [Scalability:Practices](practices/).
{{< team-by-manager-slug "rachel-nienaber" >}}
### Scalability:Observability
The [Observability team](observability/) focuses on observability, forecasting & projection systems that enable development engineering to predict
system growth for their areas of responsibility.
The following people are members of the [Scalability:Observability team](observability/):
{{< team-by-manager-slug "liam-m" >}}
### Scalability:Practices
The [Practices team](practices/) focuses on tools and frameworks that enable the stage groups to support their features on our production systems.
The following people are members of the [Scalability:Practices team](practices/):
{{< team-by-manager-slug "kwanyangu" >}}
## Mission
The **Scalability group** is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment