Commit c1e4baf7 authored by Andrew Newdigate's avatar Andrew Newdigate Committed by John Jarvis

Update all runbooks to point at dashboards.gitlab.net

parent c7e8753c
......@@ -10,7 +10,7 @@ groups:
annotations:
description: ProcessCommitWorker sidekiq jobs are piling up for the last 10
minutes, this may be under control, but I'm just letting you know that this
is going on, check http://performance.gitlab.net/dashboard/db/sidekiq-stats.
is going on, check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats.
Note that it is the alert only for ProcessCommitWorker
runbook: troubleshooting/large-sidekiq-queue.md
title: 'Large amount of ProcessCommitWorker queued jobs: {{$value}}'
......@@ -14,7 +14,7 @@ groups:
description: "Hey <!subteam^S940BK2TV|cicdops>! The number of pending builds for projects with shared runners is
increasing and will be too high in 1h ({{$value}}). This may suggest problems
with auto-scaling provider or Runner stability. You should check Runner's
logs. Check http://performance.gitlab.net/dashboard/db/ci."
logs. Check http://dashboards.gitlab.net/dashboard/db/ci."
- alert: CICDTooManyPendingJobsPerNamespace
expr: max(ci_pending_builds{has_minutes="yes",namespace!="",namespace!="9970",shared_runners="yes"}) by (namespace) > 350
......@@ -25,7 +25,7 @@ groups:
annotations:
title: 'Number of pending jobs per namespace too high: {{$value}}'
description: 'Hey <!subteam^S940BK2TV|cicdops>! Number of pending jobs for namespace {{$labels.namespace}} is too high: {{$value}}.
Check https://performance.gitlab.net/dashboard/db/ci?panelId=33&fullscreen'
Check https://dashboards.gitlab.net/dashboard/db/ci?panelId=33&fullscreen'
runbook: troubleshooting/ci_pending_builds.md#2-verify-graphs-and-potential-outcomes-out-of-the-graphs-as-described-in-ci-graphsci_graphsmd
- alert: CICDTooManyRunningJobsPerNamespaceOnSharedRunners
......@@ -37,7 +37,7 @@ groups:
annotations:
title: 'Number of running jobs per namespace too high: {{$value}}'
description: 'Hey <!subteam^S940BK2TV|cicdops>! Number of running jobs for namespace {{$labels.namespace}} running on regular Shared Runners is too high: {{$value}}.
Check https://performance.gitlab.net/dashboard/db/ci?panelId=60&fullscreen'
Check https://dashboards.gitlab.net/dashboard/db/ci?panelId=60&fullscreen'
runbook: troubleshooting/ci_pending_builds.md#2-verify-graphs-and-potential-outcomes-out-of-the-graphs-as-described-in-ci-graphsci_graphsmd
- alert: CICDTooManyRunningJobsPerNamespaceOnSharedRunnersGitLabOrg
......@@ -49,7 +49,7 @@ groups:
annotations:
title: 'Number of running jobs per namespace too high: {{$value}}'
description: 'Hey <!subteam^S940BK2TV|cicdops>! Number of running jobs for namespace {{$labels.namespace}} running on gitlab-org Shared Runners is too high: {{$value}}.
Check https://performance.gitlab.net/dashboard/db/ci?panelId=60&fullscreen'
Check https://dashboards.gitlab.net/dashboard/db/ci?panelId=60&fullscreen'
runbook: troubleshooting/ci_pending_builds.md#2-verify-graphs-and-potential-outcomes-out-of-the-graphs-as-described-in-ci-graphsci_graphsmd
- alert: CICDNoJobsOnSharedRunners
......@@ -62,7 +62,7 @@ groups:
title: 'Number of builds running on shared runners is too low: {{$value}}'
description: "Hey <!subteam^S940BK2TV|cicdops>! Number of builds running on shared runners for the last 5 minutes
is 0. This may suggest problems with auto-scaling provider or Runner stability.
You should check Runner's logs. Check http://performance.gitlab.net/dashboard/db/ci."
You should check Runner's logs. Check http://dashboards.gitlab.net/dashboard/db/ci."
- alert: CICDTooManyJobsOnSharedRunners
expr: sum(gitlab_runner_jobs{job="shared-runners"}) > 500
......@@ -74,7 +74,7 @@ groups:
title: Number of jobs running on shared runners is over 500 for the last 15
minutes
description: 'Hey <!subteam^S940BK2TV|cicdops>! This may suggest problems with our autoscaled machines fleet OR
abusive usage of Runners. Check https://performance.gitlab.net/dashboard/db/ci
abusive usage of Runners. Check https://dashboards.gitlab.net/dashboard/db/ci
and https://log.gitlap.com/app/kibana#/dashboard/5d3921f0-79e0-11e7-a8e2-f91bfad41e34'
- alert: CICDRunnersManagerDown
......@@ -101,7 +101,7 @@ groups:
title: 'Machine creation rate for runners is too high: {{$value | printf "%.2f" }}'
description: 'Hey <!subteam^S940BK2TV|cicdops>! Machine creation rate for the last 1 minute is at least {{$value}}
times greater than machines idle rate. This may by a symptom of problems with
the auto-scaling provider. Check http://performance.gitlab.net/dashboard/db/ci.'
the auto-scaling provider. Check http://dashboards.gitlab.net/dashboard/db/ci.'
runbook: troubleshooting/ci_graphs.md#runners-manager-auto-scaling
- alert: CICDRunnersCacheDown
......@@ -153,7 +153,7 @@ groups:
annotations:
title: Number of used file descriptors on {{ $labels.instance }} is too high
description: 'Hey <!subteam^S940BK2TV|cicdops>! {{ $labels.instance }} is using more than 80% of available FDs
since 10 minutes. This may affect Runner''s stability. Please look at https://performance.gitlab.net/dashboard/db/ci
since 10 minutes. This may affect Runner''s stability. Please look at https://dashboards.gitlab.net/dashboard/db/ci
for more data.'
- alert: CICDDegradatedCIConsulPrometheusCluster
......@@ -180,7 +180,7 @@ groups:
description: |
Hey <!subteam^S940BK2TV|cicdops>! Quota usage of {{ $labels.quota }} is at the level of {{ $value }} for more than 5 minutes.
Quota limit breach is coming!
See https://performance.gitlab.net/dashboard/db/ci-autoscaling-providers
See https://dashboards.gitlab.net/dashboard/db/ci-autoscaling-providers
- alert: CICDGCPQuotaCriticalUsage
expr: |
......@@ -196,7 +196,7 @@ groups:
description: |
Hey <!subteam^S940BK2TV|cicdops>! Quota usage of {{ $labels.quota }} is at the level of {{ $value }} for more than 5 minutes.
It's less than 5% to reach quota limits!
See https://performance.gitlab.net/dashboard/db/ci-autoscaling-providers
See https://dashboards.gitlab.net/dashboard/db/ci-autoscaling-providers
- alert: CICDNamespaceWithConstantNumberOfLongRunningRepeatedJobs
expr: |
......
......@@ -201,8 +201,8 @@ groups:
query 'sum(pg_slow_queries_total)' }} slow queries, {{ query
'sum(pg_blocked_queries_total)' }} and {{ query
'sum(pg_locks_count{datname=\'gitlabhq_production\'})' }} locks.
Check http://performance.gitlab.net/dashboard/db/postgres-stats and
http://performance.gitlab.net/dashboard/db/postgres-queries to get
Check http://dashboards.gitlab.net/dashboard/db/postgres-stats and
http://dashboards.gitlab.net/dashboard/db/postgres-queries to get
more data."
runbook: "troubleshooting/postgres.md#load"
title: 'High load in database {{ $labels.fqdn }}: {{$value}}'
......
......@@ -8,7 +8,7 @@ groups:
severity: critical
annotations:
description: Sidekiq jobs are piling up for the last minute, this may be under
control, but I'm just letting you know that this is going on, check http://performance.gitlab.net/dashboard/db/sidekiq-stats.
control, but I'm just letting you know that this is going on, check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats.
Note that ProcessCommitWorker is excluded from job count.
runbook: troubleshooting/large-sidekiq-queue.md
title: 'Large amount of Sidekiq Queued jobs: {{$value}}'
......@@ -29,7 +29,7 @@ groups:
severity: critical
annotations:
description: There have been over queued 2000 Sidekiq jobs for the last hour.
Check http://performance.gitlab.net/dashboard/db/sidekiq-stats. Note that
Check http://dashboards.gitlab.net/dashboard/db/sidekiq-stats. Note that
ProcessCommitWorker is excluded from job count.
runbook: troubleshooting/large-sidekiq-queue.md
title: 'Large amount of Sidekiq Queued jobs: {{$value}}'
......@@ -101,7 +101,7 @@
"attachment_template": {
"title": "Project: {{key}}",
"title_link": "https://log.gitlab.net/app/kibana#/visualize/create?type=line&indexPattern=AWM6iM-v1NBBQZg_CdXQ&_g=(filters:!(('$state':(store:globalState),meta:(alias:!n,disabled:!f,index:AWM6iM-v1NBBQZg_CdXQ,key:json.grpc.request.repoPath.keyword,negate:!f,type:phrase,value:Jayugioh%2FJaCards177x254.git),query:(match:(json.grpc.request.repoPath.keyword:(query:{{key}},type:phrase)))),('$state':(store:globalState),exists:(field:json.grpc.code),meta:(alias:!n,disabled:!f,index:AWM6iM-v1NBBQZg_CdXQ,key:json.grpc.code,negate:!f,type:exists,value:exists))),refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-4h,mode:quick,to:now))&_a=(filters:!(),linked:!f,query:(match_all:()),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(),schema:metric,type:count),(enabled:!t,id:'2',params:(customInterval:'2h',extended_bounds:(),field:'@timestamp',interval:auto,min_doc_count:1),schema:segment,type:date_histogram),(enabled:!t,id:'3',params:(field:json.grpc.time_ms),schema:metric,type:sum)),listeners:(),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(text:'@timestamp+per+5+minutes'),type:category)),grid:(categoryLines:!f,style:(color:%23eee)),legendPosition:right,seriesParams:!((data:(id:'1',label:Count),drawLinesBetweenPoints:!t,mode:normal,show:true,showCircles:!t,type:line,valueAxis:ValueAxis-1),(data:(id:'3',label:'Sum+of+json.grpc.time_ms'),drawLinesBetweenPoints:!t,mode:normal,show:!t,showCircles:!t,type:line,valueAxis:ValueAxis-1)),times:!(),type:line,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:linear),show:!t,style:(),title:(text:Count),type:value))),title:'New+Visualization',type:line))",
"text": "File Server: <https://performance.gitlab.net/d/000000204/gitaly-nfs-metrics-per-host?orgId=1&var-fqdn={{fqdn}}&from=now-1h&to=now|{{fqdn}}>\nAverage Gitaly Wall time: {{wall_time_ms_per_second}}ms/second\nAverage rate: invocations per second {{invocation_rate_per_second}}ops/sec"
"text": "File Server: <https://dashboards.gitlab.net/d/000000204/gitaly-nfs-metrics-per-host?orgId=1&var-fqdn={{fqdn}}&from=now-1h&to=now|{{fqdn}}>\nAverage Gitaly Wall time: {{wall_time_ms_per_second}}ms/second\nAverage rate: invocations per second {{invocation_rate_per_second}}ops/sec"
}
}
}
......
......@@ -14,7 +14,7 @@ First check [the on-call log](https://docs.google.com/document/d/1nWDqjzBwzYecn9
Start by checking how many alerts are in flight right now, to do this:
- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Active Alerts, it should be 0. If it is not 0
- go to the [fleet overview dashboard](https://dashboards.gitlab.net/dashboard/db/fleet-overview) and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is [being triggered](https://prometheus.gitlab.com/alerts) each alert here should point you to the right runbook to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
......@@ -27,7 +27,7 @@ Go to your chef repo and run `knife status`, if you see hosts that are red it me
Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Targets down. It should be 0. If it is not 0
- go to the [fleet overview dashboard](https://dashboards.gitlab.net/dashboard/db/fleet-overview) and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list](https://prometheus.gitlab.com/consoles/up.html) and check what is.
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
......@@ -227,9 +227,9 @@ To upgrade runners on managers you need to:
$ knife ssh -aipaddress 'roles:gitlab-runner-prm' -- gitlab-runner --version
```
You can also check the [uptime](https://performance.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&panelId=18&fullscreen)
and [version](https://performance.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&panelId=12&fullscreen) on
CI dashboard at https://performance.gitlab.net/. Notice that the version table shows versions existing for last 1
You can also check the [uptime](https://dashboards.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&panelId=18&fullscreen)
and [version](https://dashboards.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&panelId=12&fullscreen) on
CI dashboard at https://dashboards.gitlab.net/. Notice that the version table shows versions existing for last 1
minute so if you check it immediately after upgrading Runner you may see it twice - with old and new version.
After a minute the old entry should disappear.
......
......@@ -100,7 +100,7 @@ groups:
annotations:
description: 'Gitaly has been using more than 50% of total available CPU on
{{$labels.fqdn}} for the past minute. This may affect the stability of the
NFS server. Visit this dashboard: https://performance.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?refresh=30s&orgId=1&var-fqdn={{$labels.fqdn}}&from=now-1h&to=now'
NFS server. Visit this dashboard: https://dashboards.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?refresh=30s&orgId=1&var-fqdn={{$labels.fqdn}}&from=now-1h&to=now'
runbook: troubleshooting/gitaly-high-cpu.md
title: 'Gitaly: High CPU usage on {{ $labels.fqdn }}'
- alert: GitalyVersionMismatch
......@@ -113,7 +113,7 @@ groups:
annotations:
description: During a deployment, two distinct versions of Gitaly may be running
alongside one another, but this should not be the case for more than 30m.
Visit https://performance.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd
Visit https://dashboards.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd
for details of versions deployed across the fleet.
runbook: troubleshooting/gitaly-version-mismatch.md
title: 'Gitaly: two versions of Gitaly have been running alongside one another
......@@ -128,7 +128,7 @@ groups:
annotations:
description: Three of more versions of Gitaly are currently running alongside
one another in production. This should never occur and indicates serious deployment
failures. Visit https://performance.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd
failures. Visit https://dashboards.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd
for details of versions deployed across the fleet.
runbook: troubleshooting/gitaly-version-mismatch.md
title: 'Gitaly: multiple versions of Gitaly are currently running in production'
......@@ -148,7 +148,7 @@ groups:
description: >
The {{$labels.grpc_code}} error rate on {{ $labels.grpc_method }} is outside normal
values over a 12 hour period (95% confidence).
dashboard: "https://performance.gitlab.net/dashboard/db/gitaly-feature-status?var-method={{ $labels.grpc_method }}&var-environment=prd"
dashboard: "https://dashboards.gitlab.net/dashboard/db/gitaly-feature-status?var-method={{ $labels.grpc_method }}&var-environment=prd"
runbook: troubleshooting/gitaly-error-rate.md
title: 'Gitaly: Error rate on {{ $labels.grpc_method }} is unusually high compared with a 12 hour average'
- alert: GitalyLatencyOutlier
......@@ -162,7 +162,7 @@ groups:
severity: warn
annotations:
description: The error rate on the {{ $labels.grpc_method }} endpoint is outside
normal values over a 12 hour period (95% confidence). Check https://performance.gitlab.net/dashboard/db/gitaly-feature-status?from=now-1h&to=now&orgId=1&var-method={{
normal values over a 12 hour period (95% confidence). Check https://dashboards.gitlab.net/dashboard/db/gitaly-feature-status?from=now-1h&to=now&orgId=1&var-method={{
$labels.grpc_method }}&var-tier=stor&var-type=gitaly&var-environment=prd&refresh=5m
runbook: troubleshooting/gitaly-error-rate.md
title: 'Gitaly: Latency on the Gitaly {{ $labels.grpc_method }} is unusually
......
......@@ -9,9 +9,9 @@
## Check dashboards
* Check the [timings dashboard](https://performance.gitlab.net/dashboard/db/gitlab-com-git-timings) to
* Check the [timings dashboard](https://dashboards.gitlab.net/dashboard/db/gitlab-com-git-timings) to
see if the problem is specific to particular nfs shard or is the same across all storage nodes.
* Check the [host dashboard](https://performance.gitlab.net/dashboard/db/host-stats) if there appears to
* Check the [host dashboard](https://dashboards.gitlab.net/dashboard/db/host-stats) if there appears to
be problems on a specific storage node.
## Verify the blackbox exporter is working properly
......
## CI graphs
When you go to https://performance.gitlab.net/dashboard/db/ci you will see a number of graphs.
When you go to https://dashboards.gitlab.net/dashboard/db/ci you will see a number of graphs.
This document tries to explain what you see and what each of the values does indicate.
......
......@@ -25,7 +25,7 @@ To understand what can be wrong, you need to find a cause.
3. Verify long polling behavior (we are not yet aware of potential problems as of now),
4. Verify workhorse queueing: [Workhorse queueing graphs](ci_graphs.md#workhorse-queueing).
If you see a large number of requests ending up in the queue it may indicate that CI API is degraded.
Verify the performance of `builds/register` endpoint: https://performance.gitlab.net/dashboard/db/grape-endpoints?var-action=Grape%23POST%20%2Fbuilds%2Fregister&var-database=Production,
Verify the performance of `builds/register` endpoint: https://dashboards.gitlab.net/dashboard/db/grape-endpoints?var-action=Grape%23POST%20%2Fbuilds%2Fregister&var-database=Production,
5. Verify runners uptime. If you see that runners uptime is varying it does indicate that most likely Runners Manager does die, because of the crash. It will be shown in runners manager logs: `grep panic /var/log/messages`.
## 3. Verify if we have [the high DO Token Rate Limit usage](ci_runner_manager_do_limits.md)
......
......@@ -60,6 +60,6 @@ stage graph].
In jobs that at this moment were in the `archive_cache` or `restore_cache` stage, the current cache operation may be
interrupted and may fail but this should not fail the whole job (just make it slower if cache was not restored).
[cache server connections graph]:https://performance.gitlab.net/dashboard/db/ci?orgId=1&refresh=5m&from=now-24h&to=now&panelId=56&fullscreen
[jobs by runner's stage graph]:https://performance.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&from=now-24h&to=now&panelId=6&fullscreen
[cache server connections graph]:https://dashboards.gitlab.net/dashboard/db/ci?orgId=1&refresh=5m&from=now-24h&to=now&panelId=56&fullscreen
[jobs by runner's stage graph]:https://dashboards.gitlab.net/dashboard/db/ci?refresh=5m&orgId=1&from=now-24h&to=now&panelId=6&fullscreen
......@@ -37,4 +37,4 @@ sudo service elasticsearch restart
* There is only `log-es(2|3|4).gitlap.com` nodes.
[ELK performance dashboard]: https://performance.gitlab.net/dashboard/db/elk-stats?orgId=1
[ELK performance dashboard]: https://dashboards.gitlab.net/dashboard/db/elk-stats?orgId=1
......@@ -21,7 +21,7 @@ First, check out if the host you're working on is one of the following:
### Well known hosts
#### performance.gitlab.net
#### dashboards.gitlab.net
This alerts triggered on `/var/lib/influxdb/data` and `influxdb` is likely to be the culprit. Apparently there is a file handler leak somewhere and this happens regularly.
......
......@@ -17,7 +17,7 @@
- Check [Sentry](https://sentry.gitlap.com/gitlab/gitaly-production/) for unusual errors
- Check [Kibana](https://log.gitlap.com/goto/5347dee91b984026567bfa48f30c38fb) for increased error rates
- Check the Gitaly service logs on the affected host
- Check [Grafana dashboards](https://performance.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?orgId=1) to check for a cause of this outage
- Check [Grafana dashboards](https://dashboards.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?orgId=1) to check for a cause of this outage
## 3. Ensure that the Gitaly server process is running
......
......@@ -10,7 +10,7 @@
## 1. Ensure that the same version of Gitaly is running across the entire fleet
- Visit the **[Gitaly Version Tracker grafana dashboard](https://performance.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1)**.
- Visit the **[Gitaly Version Tracker grafana dashboard](https://dashboards.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1)**.
- Ensure the the entire fleet is running the **same major and minor versions** of Gitaly. The build time tag on the version should be ignored until [gitlab-org/gitaly#388](https://gitlab.com/gitlab-org/gitaly/issues/388) is resolved.
- The only time that the fleet should be runnnig mixed versions of Gitaly is during the deployment process
- During a deploy, it is important that the storage tier (NFS servers) are upgraded **before** the front-end tier
......@@ -19,7 +19,7 @@
## 2. Identify the problematic instance
- Go to https://performance.gitlab.net/dashboard/db/gitaly?panelId=2&fullscreen and
- Go to https://dashboards.gitlab.net/dashboard/db/gitaly?panelId=2&fullscreen and
identify the instance with a high error rate.
- ssh into that instance and check the log for its Gitaly server for post-mortem:
......@@ -29,7 +29,7 @@ sudo less /var/log/gitlab/gitaly/current
## 3. Disable the Gitaly operation causing trouble
- Go to https://performance.gitlab.net/dashboard/db/gitaly-features?orgId=1 and identify the feature with a high error rate.
- Go to https://dashboards.gitlab.net/dashboard/db/gitaly-features?orgId=1 and identify the feature with a high error rate.
- Disable the relevant feature flag by running `!feature-set <flag_name> false`
on Slack's #production channel. The mapping of flag names to gRPC calls is as follows:
......
......@@ -11,7 +11,7 @@
## Possible checks
- Open the [**Gitaly NFS Metrics per Host** dashboard](https://performance.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?refresh=30s&orgId=1&var-fqdn=nfs-file-08.stor.gitlab.com&from=now-1h&to=now) making sure to select the correct host,
- Open the [**Gitaly NFS Metrics per Host** dashboard](https://dashboards.gitlab.net/dashboard/db/gitaly-nfs-metrics-per-host?refresh=30s&orgId=1&var-fqdn=nfs-file-08.stor.gitlab.com&from=now-1h&to=now) making sure to select the correct host,
and check the metrics
- Log into the NFS server through a shell
- Use `uptime` and `iotop` to check the current load values on the box
......
......@@ -14,7 +14,7 @@ This runbook will be deprecated in favor of the [gitaly pprof runbook](https://g
## 1. Check the triage dashboard to assess the impact
- Visit the **[Triage Dashboard](https://performance.gitlab.net/dashboard/db/triage-overview)**.
- Visit the **[Triage Dashboard](https://dashboards.gitlab.net/dashboard/db/triage-overview)**.
- Check the **Gitaly p95 latency** graph and identify the offending server or servers.
- Check if there has been any impact on the **p95 Latency per Type** graph
- Check the `web` latency to assess wether we are impacting the site performance and users.
......
......@@ -12,7 +12,7 @@
Several versions of Gitaly are running in production concurrently.
Visit the [Gitaly Version Tracker](https://performance.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd)
Visit the [Gitaly Version Tracker](https://dashboards.gitlab.net/dashboard/db/gitaly-version-tracker?orgId=1&var-environment=prd)
dashboard to find out which versions are running on each host.
If a deployment is currently being carried out there may be two versions running alongside
......
## Steps to check
1. Check network, try to open [GitLab.com](https://gitlab.com). If it is ok from your side, then it can be only network failure.
1. Check the [fleet overview](on http://performance.gitlab.net/dashboard/db/fleet-overview).
1. Check the [database load](http://performance.gitlab.net/dashboard/db/postgres-stats).
1. Check the [fleet overview](on http://dashboards.gitlab.net/dashboard/db/fleet-overview).
1. Check the [database load](http://dashboards.gitlab.net/dashboard/db/postgres-stats).
......@@ -9,6 +9,6 @@
* Message in prometheus-alerts _Increased Error Rate Across Fleet_
## Troubleshoot
- Check the [triage overview](https://performance.gitlab.net/dashboard/db/triage-overview) dashboard for 5xx errors by backend.
- Check the [triage overview](https://dashboards.gitlab.net/dashboard/db/triage-overview) dashboard for 5xx errors by backend.
- Check [Sentry](https://sentry.gitlap.com/gitlab/gitlabcom/) for new 500 errors.
- If the problem persists send a channel wide notification in `#development`.
......@@ -14,7 +14,7 @@ queue to the next not actually doing any job at all.
## Symptoms
Open the [Sidekiq dashboard](http://performance.gitlab.net/dashboard/db/sidekiq-stats)
Open the [Sidekiq dashboard](http://dashboards.gitlab.net/dashboard/db/sidekiq-stats)
and check the Sidekiq Queue Size Size gauge. If it is over 5k it should be red, which
means that at least we should be keeping an eye on it.
Particularly take a look at Sidekiq Enqueued Jobs to hint a trend, if the trend
......
......@@ -34,11 +34,11 @@
## Dashboards
* https://performance.gitlab.net/dashboard/db/postgres-stats
* https://dashboards.gitlab.net/dashboard/db/postgres-stats
* https://performance.gitlab.net/dashboard/db/postgres-tuple-statistics
* https://dashboards.gitlab.net/dashboard/db/postgres-tuple-statistics
* https://performance.gitlab.net/dashboard/db/postgres-queries
* https://dashboards.gitlab.net/dashboard/db/postgres-queries
## Availability
......
......@@ -34,11 +34,11 @@ Replication lag indicates that the Redis secondaries are struggling to keep up w
Replication lag is measured in bytes in the replication stream.
https://performance.gitlab.net/dashboard/db/andrew-redis?panelId=13&fullscreen&orgId=1
https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=13&fullscreen&orgId=1
#### Redis Replication Events
* Check the Redis Replication Events dashboard to see if Redis is frequently failing over. This may indicate replication issues. https://performance.gitlab.net/dashboard/db/andrew-redis?panelId=14&fullscreen&orgId=1
* Check the Redis Replication Events dashboard to see if Redis is frequently failing over. This may indicate replication issues. https://dashboards.gitlab.net/dashboard/db/andrew-redis?panelId=14&fullscreen&orgId=1
### Redis Sentinel
......
Sidekiq stats are collected by [gitlab-monitor](https://gitlab.com/gitlab-org/gitlab-monitor/blob/fdad76bdff3698111744c4bfbc129c57d99355b7/lib/gitlab_monitor/sidekiq.rb) by talking to Redis, and scraped by Prometheus.
If you see no stats in the [Sidekiq dashboard](http://performance.gitlab.net/dashboard/db/sidekiq-stats) then something could be wrong with these three components.
If you see no stats in the [Sidekiq dashboard](http://dashboards.gitlab.net/dashboard/db/sidekiq-stats) then something could be wrong with these three components.
## Symptoms
......
......@@ -15,7 +15,7 @@
* increase of git http processes
* ![Sample of high count of git http processes](../img/high-http-git-processes.png)
* Relevant Dashboard where to find all these graphs
* http://performance.gitlab.net/dashboard/db/fleet-overview
* http://dashboards.gitlab.net/dashboard/db/fleet-overview
## Possible checks
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment