Enable `GITLAB_ENABLE_QUERY_ANALYZERS` and `ENABLE_CROSS_DATABASE_MODIFICATION_DETECTION` environment variable and 0.01% rollout of `detect_cross_database_modification` feature flag on Rails nodes
Production Change
Change Summary
~"group::sharding" has introduced tooling to detect "cross-database modifications (transactions)" in an effort to prepare for decomposing the CI database. This tooling will analyze a small percentage of queries to determine if they violate some constraints we expect with decomposed databases. These analyzers parse every SQL statement in Ruby and execute some logic which may be add some CPU cost and add latency for rails requests or sidekiq jobs, as such we only intend to introduce this to 0.01% of requests (or jobs) as sampling should be sufficient to detect problems while minimizing user impact.
NOTE: This new analyzer is expected to detect violations and send those CrossDatabaseModificationAcrossUnsupportedTablesError
exceptions to Sentry as well as logging these exceptions. The exceptions aren't actually raised in production so there should be no user impact or 500s but we should see some increase in Sentry exceptions specifically for this CrossDatabaseModificationAcrossUnsupportedTablesError
exception only.
We want to enable:
- Set environment variable
GITLAB_ENABLE_QUERY_ANALYZERS=true
which enables an ActiveRecord query subscriber introduced in gitlab-org/gitlab!73827 (merged) - Set
ENABLE_CROSS_DATABASE_MODIFICATION_DETECTION=true
which enables a QueryAnalyzer that detects these so called "cross-database modifications" introduced in gitlab-org/gitlab!74177 (merged) - Enable 0.01% feature flag rollout for
query_analyzer_gitlab_schema_metrics
so that 1/10,000 requests or sidekiq jobs will parse the SQL and detect cross-joins introduced in gitlab-org/gitlab!73839 (merged) - Enable 0.01% feature flag rollout for
detect_cross_database_modification
so that 1/10,000 requests or sidekiq jobs will parse the SQL and detect violations introduced in gitlab-org/gitlab!74177 (merged).
Change Details
- Services Impacted - rails-web rails-api rails-sidekiq
- Change Technician - DRI for the execution of this change
- Change Reviewer - @ayufan
- Time tracking - 60
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 60
-
Set label changein-progress on this issue -
Perform all the change steps on staging
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120
-
Make configuration changes to rails deployments to set the following environment variables -
GITLAB_ENABLE_QUERY_ANALYZERS=true
-
ENABLE_CROSS_DATABASE_MODIFICATION_DETECTION=true
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1390 (merged) -
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1049
-
-
Deploy the changes to rails nodes -
Wait 1 hour when doing this on production. These environment variables should not have any noticeable impact on anything as they don't do much without enabling the feature flag -
Enable 0.01% of time for the feature flag (from rails console): detect_cross_database_modification
andquery_analyzer_gitlab_schema_metrics
:-
Feature.enable_percentage_of_time(:query_analyzer_gitlab_schema_metrics, 0.01)
-
Feature.enable_percentage_of_time(:detect_cross_database_modification, 0.01)
-
-
We will start observing a gitlab_database_decomposition_gitlab_schemas_used
prometheus metric indicating that queries are properly parsed -
We hope to start seeing new CrossDatabaseModificationAcrossUnsupportedTablesError
being sent to Sentry as part of this change. These are not raised in production just logged and sent to Sentry so we can analyze the data. It may take some time to see these so we don't necessarily need to wait for this before closing this issue out.- None seen yet but still closing as we don't need to wait. Check https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+CrossDatabaseModificationAcrossUnsupportedTablesError&sort=new later
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 0
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10
-
If we've enabled the feature flag and seen increased error rates or latency then disable the feature flag: /chatops run feature remove query_analyzer_gitlab_schema_metrics
/chatops run feature remove detect_cross_database_modification
-
If we're seeing some increased errors/latency or other problems even without the feature flag enabled then disable the environment variables by reverting the deployment that set them to true
Monitoring
Key metrics to observe
NOTE: This new analyzer is expected to detect violations and send those CrossDatabaseModificationAcrossUnsupportedTablesError
exceptions to Sentry as well as logging these exceptions. The exceptions aren't actually raised in production so there should be no user impact or 500s but we should see some increase in Sentry exceptions specifically for this CrossDatabaseModificationAcrossUnsupportedTablesError
exception only. Seeing these CrossDatabaseModificationAcrossUnsupportedTablesError
should not be cause for concern.
This has a pretty wide reaching impact as the analyzer can analyze SQL queries on any rails nodes so most likely we would see any major issues in overview dashboards
- Observed SQL queries:
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=gitlab_database_decomposition_gitlab_schemas_used&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: the lack of metrics after 5 minutes
- Metric: Web overview
- Location: https://dashboards.gitlab.net/d/api-main/web-overview?orgId=1
- What changes to this metric should prompt a rollback: Error rates, latency, CPU saturation
- Metric: API overview
- Location: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1
- What changes to this metric should prompt a rollback: Error rates, latency, CPU saturation
- Metric: Web overview
- Location: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1
- What changes to this metric should prompt a rollback: Error rates, latency, CPU saturation
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.