Update database docs with caution on LIMIT-only loop-based batching strategy
Summary
Upon reviewing this MR, it was brought to light that we currently suggest a Loop-based batching strategy in the docs. This strategy works well for a relatively small number of rows (typically a few thousand), but its performance will degrade in later loops due to negative side effects from repeated scans of stale index entries and visibility checks. We should update the documentation to include this information.
To better understand why, see the discussion below:
@lma-git: Hi teamQuestion on Batching over a table using LIMIT-only loops. I noticed that we have documented this Loop-based batching strategy: "The strategy leverages the fact that after updating or deleting records in the database, the exact same query will return different records." In practice it does work, but I'm not sure if it should be recommended for performance reasons. 🤔 My thinking is that subsequent loops will inevitably slow down because of Postgres' behaviour with index scans and visibility maps.Say in the first loop, we delete 1000 records. The index isn't updated right away because we're not doing a vacuum/reindexing in between each query execution. In the second loop run, it now has to scan the index again, including the entries that still point to the now-deleted 1000 rows. When it encounters these entries, it has to do a visibility check on those rows (which requires heap lookups). Over time, with each subsequent loop, it has to check an increasing number of dead tuples.
❓ Is this process a concern? Or am I missing something (e.g. caching) that PostgreSQL does that would make this LIMIT-only batching efficient?
@NikolayS: This is a good concern, makes total senseTo mitigate, we need VACUUM from time to time
Or move away from stateless approach to a stateful one: remember the boundary of the last processed batch (via RETURNING), and use it in the next batch, via WHERE. In other words, do the same thing as we do in the keyset pagination. In this case, we don't care about dead tuples at all.
@krasio: Yes, these are legit concerns. This approach works well if we expect to delete not that many records and the operation is not running for too long.
@lma-git: Actually,...🤔 what's considered "that many" records@Kras? Could the number be low enough that we may as well not batch over it at all since it would only need a few loops to get through?
@krasio: As usual, it depends. If we have to pick a number, I'll say few thousands. We definitely don't want to update/delete than many rows with a single query, and using loop-based approach wont bloat the index that much to cause negative side effects.