Geo: Decide how to avoid hiccups while refreshing foreign schema

Problem to solve

Refreshing foreign schema while it is in use can cause hiccups. This problem is exacerbated by implementing automatic refresh in omnibus-gitlab#3474 (closed)

Intended users

Further details

Idea 1: Pause usage of individual foreign tables

  1. try to hold a lock on a single foreign table for X amount of time (1s or less, using something like statement timeouts?)
  2. if we can get the lock, drop and recreate that table
  3. if we can't, assign to a queue of tables that requires extra care

Idea 2: Surgical foreign schema update

From omnibus-gitlab#3474 (comment 212602755):

The FDW refresh code was done that way with a drop and reimport because it was the easiest way of doing it... perhaps we should take a step back and figure out how to refresh, when upgrading, only the tables with some change.

I can propose a simple algorithm (implementation will not be trivial, but the idea is):

  1. dump a list of tables on both sides, recreate first the missing ones
  2. loop over the previous list of tables, and see if anyone needs to be recreated (because they differ). drop the foreign table and use the import limiting to that single table.

https://www.postgresql.org/docs/10/sql-importforeignschema.html

We can also try to do use current algorithm first, with a sql statement timeout, if it fails we switch to the 'slow version'.

In this proposal, many foreign schema updates touch tables that don't matter to Geo, therefore they will not cause a hiccup, as opposed to dropping the entire schema.

Idea 3: Pause usage of tracking database

From omnibus-gitlab#3474 (comment 212602755):

We can also try to implement a semaphore for the use of the tracking database (perhaps on the log cursor loop), so we could "pause" the use of the database, to perform a FDW refresh, and unpause it at the end (use TTL and exclusive lease for that).

This gives us the ability to decide exactly what should happen during a refresh. Though this may be a significant interruption in usage, which isn't overly helpful by itself.

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references

Edited Sep 11, 2019 by Michael Kozono
Assignee Loading
Time tracking Loading