Skip to content

Save details of placeholder contributions to table during import

About

In #443554 (closed) we are introducing a single table to contain all details of placeholder user contributions.

We will record 1 row of data per imported record that is associated with a user, for every importer.

Problem

Our table design has the benefit of being simple, but attempting to write each row to the table seems a bit naïve and is likely to lead to scalability problems at times.

Failures to write this data to PostgreSQL will lead to a kind of data loss for customers.

We should er on the side of "over-engineering".

We can use exclusive lease locks, but these don't guarantee that data is written it just protects the resource. During high contention the lock throw a FailedToObtainLockError which can lead to data loss if the worker continues to fail after retrying. An example is web_hook_logs which is written to once per webhook that is triggered. Because of table contention problems #352245 (closed) we added an exclusive lease lock !80976 (merged). We would then sometimes experience periods of many FailedToObtainLockError errors #352245 (comment 856625412). In the case of web_hook_logs we realised that many writes that led to FailedToObtainLockError can be discarded !81770 (merged). But when saving user contributions, we need every row to be written.

Proposal?

sequenceDiagram
    participant Importer
    participant Redis
    participant Worker
    participant PostgreSQL
    Importer->>+Redis: Sends each contribution {.... }
    Importer->>+Worker: Queues after each stage
    Worker->>Redis: Pops batches of records
    Worker->>PostgreSQL: Loads batches of records to table
    Importer->>+Redis: Waits until empty set to finish import

Benefits would be:

  • Writing to Redis is fast and cheap compared to writing to PostgreSQL.
  • Data in Redis gives us some recovery from problems during loading to PostgreSQL, as we would give ourselves up to 1 day after last being touched to get the data into PostgreSQL.
  • We can control contention by limiting number of workers that process Redis queue at a time
  • If a batch was 1000, we would make 1000x less insert requests to PostgreSQL

An alternative to "Queues after each stage" could be an idempotent cron that runs every minute that works the full queue. Or perhaps when "Waits until empty set to finish import" the import could also be queueing idempotent workers, just for good measure to help ensure everything gets processed.

Edited by Luke Duncalfe