Upsert missing or inactive services records for projects with available Prometheus application installed on shared k8s cluster
Queries
Plans are retrieved from staging.gitlab.com
due to lack of access to production:
/app/lib/chatops/database/read_only_connection.rb:42:in `async_exec': ERROR: cannot execute SELECT in a read-only transaction (PG::ReadOnlySqlTransaction)
https://ops.gitlab.net/gitlab-com/chatops/-/jobs/938561
Create missing records for GitLab instances, with k8s cluster shared on group level
WITH created_records AS (
INSERT INTO services (project_id, "active","properties","type","template","push_events","issues_events","merge_requests_events","tag_push_events","note_events","category","default","wiki_page_events","pipeline_events","confidential_issues_events","commit_events","job_events","confidential_note_events", created_at, updated_at)
SELECT "projects"."id", true,'{}','PrometheusService',false,true,true,true,true,true,'monitoring',false,true,true,true,true,true,true, TIMEZONE('UTC', NOW()) as created_at, TIMEZONE('UTC', NOW()) as updated_at FROM "projects" LEFT JOIN services ON services.project_id = projects.id AND services.project_id BETWEEN 1 AND 10000
AND services.type = 'PrometheusService' INNER JOIN cluster_groups ON cluster_groups.group_id = projects.namespace_id INNER JOIN clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = cluster_groups.cluster_id
AND clusters_applications_prometheus.status IN (3, 5) WHERE (projects.id BETWEEN 1 AND 10000 AND services.id IS NULL)
RETURNING *
)
SELECT COUNT(*) as number_of_created_records
FROM created_records
plan: https://explain.depesz.com/s/5pLM Summary:
Time: 4.637 s
- planning: 4.576 ms
- execution: 4.632 s
- I/O read: 4.550 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 1075 (~8.40 MiB) from the buffer pool
- reads: 3987 (~31.10 MiB) from the OS file cache, including disk I/O
- dirtied: 50 (~400.00 KiB)
- writes: 0
with index:
plan: https://explain.depesz.com/s/WgOq
Summary:
Time: 4.306 s
- planning: 5.356 ms
- execution: 4.301 s
- I/O read: 4.222 s
- I/O write: 0.000 s
Shared buffers:
- hits: 11524 (~90.00 MiB) from the buffer pool
- reads: 3989 (~31.20 MiB) from the OS file cache, including disk I/O
- dirtied: 47 (~376.00 KiB)
- writes: 0
Update Prometheus services records to active for GitLab instances, with k8s cluster shared on group level
WITH updated_records AS (
UPDATE services SET active = TRUE
WHERE services.project_id BETWEEN 1 AND 10000 AND services.properties = '{}' AND services.type = 'PrometheusService'
AND services.active = FALSE AND EXISTS (
SELECT 1 FROM "projects" INNER JOIN cluster_groups ON cluster_groups.group_id = projects.namespace_id INNER JOIN
clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = cluster_groups.cluster_id
AND clusters_applications_prometheus.status IN (3, 5) WHERE (projects.id BETWEEN 1 AND 10000)
)
RETURNING *
)
SELECT COUNT(*) as number_of_updated_records
FROM updated_records
plan: https://explain.depesz.com/s/eiLvf Summary:
Time: 5.090 s
- planning: 3.820 ms
- execution: 5.086 s
- I/O read: 4.970 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 11524 (~90.00 MiB) from the buffer pool
- reads: 3987 (~31.10 MiB) from the OS file cache, including disk I/O
- dirtied: 47 (~376.00 KiB)
- writes: 0
with index:
plan: https://explain.depesz.com/s/lEcG
Summary:
Time: 28.259 ms
- planning: 5.396 ms
- execution: 22.863 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 15509 (~121.20 MiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
- writes: 0
Check if instance level cluster is available for given GitLab instance
SELECT 1 FROM "clusters" INNER JOIN clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = clusters.id AND clusters_applications_prometheus.status IN (3, 5) WHERE "clusters"."cluster_type" = 1
plan: https://explain.depesz.com/s/fee8 Summary:
Time: 362.272 ms
- planning: 0.341 ms
- execution: 361.931 ms
- I/O read: 335.734 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 7303 (~57.10 MiB) from the buffer pool
- reads: 904 (~7.10 MiB) from the OS file cache, including disk I/O
- dirtied: 20 (~160.00 KiB)
- writes: 0
with index:
plan: https://explain.depesz.com/s/BJot
Summary
Time: 635.776 ms
- planning: 0.298 ms
- execution: 635.478 ms
- I/O read: 613.851 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 8577 (~67.00 MiB) from the buffer pool
- reads: 908 (~7.10 MiB) from the OS file cache, including disk I/O
- dirtied: 18 (~144.00 KiB)
- writes: 0
Create missing records for GitLab instances, with k8s cluster shared on instance level
WITH created_records AS (
INSERT INTO services (project_id, "active","properties","type","template","push_events","issues_events","merge_requests_events","tag_push_events","note_events","category","default","wiki_page_events","pipeline_events","confidential_issues_events","commit_events","job_events","confidential_note_events", created_at, updated_at)
SELECT "projects"."id", true,'{}','PrometheusService',false,true,true,true,true,true,'monitoring',false,true,true,true,true,true,true, TIMEZONE('UTC', NOW()) as created_at, TIMEZONE('UTC', NOW()) as updated_at FROM "projects" LEFT JOIN services ON services.project_id = projects.id AND services.project_id BETWEEN 1 AND 10000
AND services.type = 'PrometheusService' WHERE (projects.id BETWEEN 1 AND 10000 AND services.id IS NULL)
RETURNING *
)
SELECT COUNT(*) as number_of_created_records
FROM created_records
plan: https://explain.depesz.com/s/dSUz Summary:
Time: 5.195 s
- planning: 1.602 ms
- execution: 5.193 s
- I/O read: 369.044 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 102002 (~796.90 MiB) from the buffer pool
- reads: 607 (~4.70 MiB) from the OS file cache, including disk I/O
- dirtied: 304 (~2.40 MiB)
- writes: 0
whit index:
plan: https://explain.depesz.com/s/GPqU
Summary
Time: 4.697 s
- planning: 0.985 ms
- execution: 4.696 s
- I/O read: 378.316 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 117088 (~914.80 MiB) from the buffer pool
- reads: 1509 (~11.80 MiB) from the OS file cache, including disk I/O
- dirtied: 289 (~2.30 MiB)
- writes: 0
Update Prometheus services records to active for GitLab instances, with k8s cluster shared on instance level
WITH updated_records AS (
UPDATE services SET active = TRUE
WHERE services.project_id BETWEEN 1 AND 10000 AND services.properties = '{}' AND services.type = 'PrometheusService'
AND services.active = FALSE AND 1 = 1
RETURNING *
)
SELECT COUNT(*) as number_of_updated_records
FROM updated_records
plan: https://explain.depesz.com/s/e5wf Summary:
Time: 13.651 ms
- planning: 0.364 ms
- execution: 13.287 ms
- I/O read: 8.689 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 3 (~24.00 KiB) from the buffer pool
- reads: 295 (~2.30 MiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
with index:
plan: https://explain.depesz.com/s/xUke Summary
Time: 1.304 ms
- planning: 0.442 ms
- execution: 0.862 ms
- I/O read: 0.624 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 0 from the buffer pool
- reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
gitlab.com
Database State:
gitlab.com database: There is 6_496_601 rows in namespaces table
/chatops run explain select id from namespaces
EXPLAIN output
Index Only Scan using namespaces_pkey on namespaces (cost=0.43..124451.56 rows=6491743 width=4)
(actual time=0.077..1708.589 rows=6496601 loops=1)
There is 3_543_406 rows in services table
/chatops run explain select 1 from services
EXPLAIN output
Index Only Scan using index_services_on_project_id on services (cost=0.43..130977.60 rows=3543406 width=4)
(actual time=0.251..4525.741 rows=3549176 loops=1)
There is 17_435 rows in services table of type PrometheusService
/chatops run explain select 1 from services where type = ‘PrometheusService’
EXPLAIN output
Index Only Scan using index_services_on_type on services (cost=0.56..1091.62 rows=16961 width=4)
(actual time=2.877..156.291 rows=17435 loops=1)
There is 12_341_591 rows in projects table
/chatops run explain select id from projects
EXPLAIN output
Index Only Scan using projects_pkey on projects (cost=0.43..300956.89 rows=12342061 width=4)
(actual time=0.100..6313.379 rows=12341591 loops=1)
From previous approach !19956 (merged)
Affected rows on gitlab.com 5621 (updated: 22 and inserted: 5599)
What does this MR do?
This is another approach to fix state of services relation. Due to a bug, projects which was connected to group or instance level shared Kubernetes clusters, hadn't have created corresponding services entries when Prometheus application was installed on this clusters.
Previous attempts: !19956 (merged) and !22974 (merged) was reverted due to number of performance issues. First of all there is important difference between gitlab.com and staging.gitlab.com. Staging has instance level shared cluster, which means that almost all of projects there are affected. While gitlab.com has only group level shared clusters, which gives roughly around 6000 affected rows. Due to this difference approaches tailored for gitlab.com, flooded sidekiq with excessive number of background jobs https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9160 Another thing to remember is that background job has minimum lease time of 2 minutes, so work should be scheduled in bigger chunks in order to avoid idle waiting for minimum lease to expire (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9064#note_279857215).
This implementation is inspired by !19956 (comment 287626170) however since there is no uniq constraint on services table except for primary key, using ON CONFLICT
was not possible. We may still decide to add such constraint, but I tried to find a way to avoid that, as it is not a cheap operation.
Screenshots
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Reports #14857 (closed)