Skip to content

GitLab Next

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
GitLab
GitLab
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 34,895
    • Issues 34,895
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 1,229
    • Merge Requests 1,229
  • Requirements
    • Requirements
    • List
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Security & Compliance
    • Security & Compliance
    • Dependency List
    • License Compliance
  • Operations
    • Operations
    • Metrics
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • GitLab.org
  • GitLabGitLab
  • Issues
  • #4489

Closed
Open
Opened Jan 06, 2018 by Stan Hu@stanhuOwner

Geo: Using PostgreSQL FDW appears to increase cancelled statements

Summary and next steps

Work remaining on this issue: Add troubleshooting docs about the error and tell people how to configure hot_standby_feedback. See comment here: #4489 (comment 243363712)

If we have long running queries on the secondary we are likely going to encounter the error described in the issue. In order to avoid this, one can set hot_standby_feedback on but this may impact the primary adversely in some scenarios.

We are already implementing tuned queries to avoid this issue. To resolve and close this issue here we should also add troubleshooting docs about the error and tell users how to configure hot_standby_feedback to resolve the issue.

Customers can then also inform us of these issues and we can continue and tune the relevant queries.


Since we enabled PostgreSQL FDW on the Geo testbed, we have started to see quite a number of errors, such as in https://sentry.gitlap.com/gitlab/geo1/issues/117142/ :

PG::TRSerializationFailure: ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. CONTEXT: Remote SQL command: SELECT id FROM public.uploads : SELECT COUNT(*) FROM "gitlab_secondary"."uploads" INNER JOIN file_registry ON file_registry.file_id = gitlab_secondary.uploads.id WHERE "file_registry"."file_type" IN ('attachment', 'avatar', 'file', 'namespace_file', 'personal_file') AND "file_registry"."success" = 't'

There are a number of good resources that explain why this happens:

  • https://stackoverflow.com/questions/14592436/postgresql-error-canceling-statement-due-to-conflict-with-recovery
  • http://www.postgresql.org/docs/current/static/hot-standby.html#HOT-STANDBY-CONFLICT

We may want to consider enabling hot_standby_feedback:

Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. The first option is to set the parameter hot_standby_feedback, which prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server, and you are still getting the benefit of off-loading execution onto the standby.

Customer encounters

https://gitlab.zendesk.com/agent/tickets/131543 @ 2019-09-24

2019-09-24_10:18:12.28799 ERROR: canceling statement due to conflict with recovery 
2019-09-24_10:18:12.28800 DETAIL: User query might have needed to see row versions that must be removed. 
2019-09-24_10:18:12.28800 CONTEXT: Remote SQL command: SELECT lfs_object_id, project_id FROM public.lfs_objects_projects 
2019-09-24_10:18:12.28800 STATEMENT: SELECT "gitlab_secondary"."lfs_objects"."id" FROM "gitlab_secondary"."lfs_objects" INNER JOIN "gitlab_secondary"."lfs_objects_projects" ON "gitlab_secondary"."lfs_objects_projects"."lfs_object_id" = "gitlab_secondary"."lfs_objects"."id" INNER JOIN "gitlab_secondary"."projects" ON "gitlab_secondary"."projects"."id" = "gitlab_secondary"."lfs_objects_projects"."project_id" LEFT OUTER JOIN "file_registry" ON "gitlab_secondary"."lfs_objects"."id" = "file_registry"."file_id" WHERE "gitlab_secondary"."projects"."id" IN (SELECT "gitlab_secondary"."projects"."id" FROM "gitlab_secondary"."projects" WHERE "gitlab_secondary"."projects"."namespace_id" IN (WITH RECURSIVE "base_and_descendants" AS (SELECT "gitlab_secondary"."geo_node_namespace_links"."namespace_id" AS id FROM "gitlab_secondary"."geo_node_namespace_links" WHERE "gitlab_secondary"."geo_node_namespace_links"."geo_node_id" = 9 
2019-09-24_10:18:12.28801 UNION 
2019-09-24_10:18:12.28801 SELECT "gitlab_secondary"."namespaces"."id" FROM "gitlab_secondary"."namespaces", "base_and_descendants" WHERE "gitlab_secondary"."namespaces"."parent_id" = "base_and_descendants"."id") SELECT "id" FROM "base_and_descendants" AS "gitlab_secondary_namespaces")) AND ("gitlab_secondary"."lfs_objects"."file_store" = 1 OR "gitlab_secondary"."lfs_objects"."file_store" IS NULL) AND "file_registry"."id" IS NULL AND 1=1 LIMIT 1000

Error happened immediately after executing the query.

Clarification by Database team

From #10286 (comment 221826697):

There's a trade-off between hot_standby_feedback and max_standby_streaming_delay: hot_standby_feedback basically makes the upstream/primary aware of running queries on the standby and causes it to refrain from VACUUM activity that would otherwise cause these queries to get cancelled on the standby ("conflict with recovery"). Enabling it and running long queries on the standby increases the likelihood of bloat (because it cannot clean up eagerly).

Now max_standby_streaming_delay on the other hand allows queries on the standby to actually complete (until this timeout setting is reached) even if this means delaying the recovery process. In practice, this means that the replication lag is likely to increase on the standby, because recovery is being held up by ongoing queries. The expected maximum replication lag is roughly the max_standby_..._delay setting.

That said, for the main GitLab.com cluster, we have hot_standby_feedback enabled.

It's likely the better choice over max_standby_streaming_delay, too, as it is harder to reason about replication delay and to get the timeout right. I think enabling hot_standby_feedback should solve the trouble with conflicting queries. Keep in mind though I'm not at all a Geo expert, so the above is just general PostgreSQL.

Putting it in extremes: enabling hot_standby_feedback without running any queries on the standby isn't going to have any impact on the upstream. So the effect of this can only be seen with queries running on the standby and causing the primary to defer vacuum activity. The overall impact of this depends on frequency of queries on the standby and their duration (and even what data they consume). The more queries run, the longer they take and the more data they consume, the bigger the impact on the primary.

Edited Dec 18, 2019 by Rachel Nienaber
Assignee
Assign to
12.7
Milestone
12.7 (Past due)
Assign milestone
Time tracking
None
Due date
None
Reference: gitlab-org/gitlab#4489