Skip to content

Handle project create/update/delete actions with bulk-incremental indexer

Summary

In !24298 (comment 291135616) , we add a bulk-incremental indexer for most elasticsearch-indexable resources in GitLab. However, projects are excluded from scope - on create, update, and delete, we continue to schedule ElasticIndexerWorker jobs instead.

This is that there is considerable additional work done by ElasticIndexerWorker for projects on creation and deletion (update is fine as-is, but separating it out was too much work).

Improvements

Create

We need to refactor these operations so they are handled correctly with the bulk-incremental worker.

On creation, the main missing thing appears to be that the project wiki is not correctly indexed - although this appears to be a problem for the ElasticIndexerWorker approach too: #207491 (closed)

We should refactor ElasticIndexerWorker so there is no reference to ElasticCommitIndexerWorker in it.

It also schedules an initial bulk import for each of the indexable associations that are in the project. This already seems to be handled correctly by the bulk-incremental indexer (since as the importer creates each issue, its on-create callbacks are run, scheduling an operation), but this should be verified.

IndexRecordService is also used for backfill, so we can't remove this code entirely - but we need to ensure that when a project is imported while elasticsearch is turned on, everything in the project is indexed correctly without an ElasticIndexerWorker being scheduled.

Update

No action needed

Delete

When a project is deleted, we rely on database foreign key constraints to remove many records, so elasticsearch hooks are not fired. Additionally, commit and blob documents are guaranteed to be left behind without further action.

To handle this, ElasticIndexerWorker runs a "delete by query" command - it also manually deletes the IndexStatus row, although I'm not sure that's necessary.

The incremental-bulk indexer doesn't handle this at all right now, and would leave behind orphaned records.

I think the best approach here might be to create a specific "erase project from elasticsearch" worker, and have it be scheduled separately at project-delete time, rather than trying to fit this special-casing into the elastic-bulk flow specifically.

Risks

Involved components

Optional: Missing test coverage

We are lacking tests that import a project with elasticsearch enabled and ensure that all records that should be searchable, are searchable. This would be a great time to add them.

We also lack tests that run a ProjectDestroyWorker for a fully-indexed project and ensures that all the documents for that project (but not other documents) are removed.

Edited by Nick Thomas