Geo: Using PostgreSQL FDW appears to increase cancelled statements
Summary and next steps
Work remaining on this issue: Add troubleshooting docs about the error and tell people how to configure hot_standby_feedback
. See comment here: #4489 (comment 243363712)
If we have long running queries on the secondary we are likely going to encounter the error described in the issue. In order to avoid this, one can set hot_standby_feedback on
but this may impact the primary adversely in some scenarios.
We are already implementing tuned queries to avoid this issue. To resolve and close this issue here we should also
add troubleshooting docs about the error and tell users how to configure hot_standby_feedback
to resolve the issue.
Customers can then also inform us of these issues and we can continue and tune the relevant queries.
Since we enabled PostgreSQL FDW on the Geo testbed, we have started to see quite a number of errors, such as in https://sentry.gitlap.com/gitlab/geo1/issues/117142/ :
PG::TRSerializationFailure: ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. CONTEXT: Remote SQL command: SELECT id FROM public.uploads : SELECT COUNT(*) FROM "gitlab_secondary"."uploads" INNER JOIN file_registry ON file_registry.file_id = gitlab_secondary.uploads.id WHERE "file_registry"."file_type" IN ('attachment', 'avatar', 'file', 'namespace_file', 'personal_file') AND "file_registry"."success" = 't'
There are a number of good resources that explain why this happens:
- https://stackoverflow.com/questions/14592436/postgresql-error-canceling-statement-due-to-conflict-with-recovery
- http://www.postgresql.org/docs/current/static/hot-standby.html#HOT-STANDBY-CONFLICT
We may want to consider enabling hot_standby_feedback
:
Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. The first option is to set the parameter hot_standby_feedback, which prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server, and you are still getting the benefit of off-loading execution onto the standby.
Customer encounters
https://gitlab.zendesk.com/agent/tickets/131543 @ 2019-09-24
2019-09-24_10:18:12.28799 ERROR: canceling statement due to conflict with recovery
2019-09-24_10:18:12.28800 DETAIL: User query might have needed to see row versions that must be removed.
2019-09-24_10:18:12.28800 CONTEXT: Remote SQL command: SELECT lfs_object_id, project_id FROM public.lfs_objects_projects
2019-09-24_10:18:12.28800 STATEMENT: SELECT "gitlab_secondary"."lfs_objects"."id" FROM "gitlab_secondary"."lfs_objects" INNER JOIN "gitlab_secondary"."lfs_objects_projects" ON "gitlab_secondary"."lfs_objects_projects"."lfs_object_id" = "gitlab_secondary"."lfs_objects"."id" INNER JOIN "gitlab_secondary"."projects" ON "gitlab_secondary"."projects"."id" = "gitlab_secondary"."lfs_objects_projects"."project_id" LEFT OUTER JOIN "file_registry" ON "gitlab_secondary"."lfs_objects"."id" = "file_registry"."file_id" WHERE "gitlab_secondary"."projects"."id" IN (SELECT "gitlab_secondary"."projects"."id" FROM "gitlab_secondary"."projects" WHERE "gitlab_secondary"."projects"."namespace_id" IN (WITH RECURSIVE "base_and_descendants" AS (SELECT "gitlab_secondary"."geo_node_namespace_links"."namespace_id" AS id FROM "gitlab_secondary"."geo_node_namespace_links" WHERE "gitlab_secondary"."geo_node_namespace_links"."geo_node_id" = 9
2019-09-24_10:18:12.28801 UNION
2019-09-24_10:18:12.28801 SELECT "gitlab_secondary"."namespaces"."id" FROM "gitlab_secondary"."namespaces", "base_and_descendants" WHERE "gitlab_secondary"."namespaces"."parent_id" = "base_and_descendants"."id") SELECT "id" FROM "base_and_descendants" AS "gitlab_secondary_namespaces")) AND ("gitlab_secondary"."lfs_objects"."file_store" = 1 OR "gitlab_secondary"."lfs_objects"."file_store" IS NULL) AND "file_registry"."id" IS NULL AND 1=1 LIMIT 1000
Error happened immediately after executing the query.
Clarification by Database team
From #10286 (comment 221826697):
There's a trade-off between
hot_standby_feedback
andmax_standby_streaming_delay
:hot_standby_feedback
basically makes the upstream/primary aware of running queries on the standby and causes it to refrain from VACUUM activity that would otherwise cause these queries to get cancelled on the standby ("conflict with recovery"). Enabling it and running long queries on the standby increases the likelihood of bloat (because it cannot clean up eagerly).Now
max_standby_streaming_delay
on the other hand allows queries on the standby to actually complete (until this timeout setting is reached) even if this means delaying the recovery process. In practice, this means that the replication lag is likely to increase on the standby, because recovery is being held up by ongoing queries. The expected maximum replication lag is roughly themax_standby_..._delay
setting.That said, for the main GitLab.com cluster, we have
hot_standby_feedback
enabled.It's likely the better choice over
max_standby_streaming_delay
, too, as it is harder to reason about replication delay and to get the timeout right. I think enablinghot_standby_feedback
should solve the trouble with conflicting queries. Keep in mind though I'm not at all a Geo expert, so the above is just general PostgreSQL.Putting it in extremes: enabling
hot_standby_feedback
without running any queries on the standby isn't going to have any impact on the upstream. So the effect of this can only be seen with queries running on the standby and causing the primary to defer vacuum activity. The overall impact of this depends on frequency of queries on the standby and their duration (and even what data they consume). The more queries run, the longer they take and the more data they consume, the bigger the impact on the primary.