"prometheus/collectors/go_collector_latest_test.go" did not exist on "master"

Proposal: use consistent hashing and assign specific index to Advanced Search namespaces and projects

Problem

We've worked on rolling out search curation which creates new indices based on our Elasticsearch sizing guidelines and "rolls over" the previous write index to be read-only. The problem with this approach is we don't keep track of which index a document is stored, so when we do any write operation, we write to the write index then send a delete request to every rolled over index. We have realized that there will potentially be hundreds of rolled over indices, and this will make our indexing operations become more expensive in a linear fashion over time as we grow.

An additional problem that we have that is related to this is when we rollover an index and create a new one, the new one is completely empty and is preallocating resources as if the index were full. This is wasteful use of resources until the new write index grows to be within a certain size.

Proposal

Use consistent hashing based on root_namespace.id and project.id (where applicable) to have an index naming convention where we can spread the storage of documents across multiple indices. When a record that is going to be indexed in Advanced Search is created in the database, we would using our consistent hashing algorithm to assign a specific index to that record based on root_namespace.id or root_namespace.id + project.id.

If we used this hashing strategy, we would be able to scale up or down the number of indices that we have for a document type and only have to reindex one a small subset of data instead of the entire cluster.

For example, for notes, we would hash by both root namespace and project, it would look like gitlab-production-notes-{namespace_partition_number}-{project_partition_number}. At the beginning stages of rolling this out, we'd only have one index for notes: gitlab-production-notes-001-001. If that index becomes too large, we'd add another index by increasing the number of project subpartitions gitlab-production-notes-001-002 and reindex the affected projects (which is approximately half) in gitlab-production-notes-001-001 to gitlab-production-notes-001-002. Scaling by namespace partition is possible as well: gitlab-production-notes-001-001 => gitlab-production-notes-002-001. It's worth noting that scaling down would be possible as well, but would use the opposite process.

Assuming we use an upper limit of 360 possible partitions (aligning with 360 possible degrees in a circle), we'd have 360 * 360 = 129,600 open slots for creating indices.

Each of these index names would be aliases so we can continue to support zero downtime reindexing. For example, gitlab-production-notes-001-001 would be an alias that points to an index with a timestamp suffix like gitlab-production-notes-001-001-202202013-1159.

Benefits

Naming convention of hashing root namespace id would allow us to easily do group searches that only search a subset of indices, ie: gitlab-production-notes-001-*
Naming convention of hashing project id would allow us to remove any need for routing by project id and take away the problem of having hot shards, ie: gitlab-production-notes-001-001
Legacy documents that have parent/child relationship joins would still work because they would be placed in same index based on project id.
New indices will never be completely empty because they will have data moved into them when they are created
We could support pausing indexing at the namespace or project level
We could support scaling down in a graceful way
Because an IndexAssignment record would be attached to any project or namespace we index in Advanced Search, we could override the index assigned if needed (due to a particular namespace or project being abnormally large).
Because an IndexAssignment record would be attached to any project or namespace we index in Advanced Search, we would not need to blast out deletes for any other indices and processing speeds would remain constant.
We could reuse any service objects that have the logic proposed here and apply towards zoekt sharding.
We could rework IndexCurator to automatically provision new indices, pause indexing for affected projects/namespaces and launch reindexing based on our sizing guidelines

Technical Considerations

The size limit for a project is 10GB. Taking Elasticsearch compression rate in mind, the maximum size of an project repository in Elasticsearch will be roughly 5GB.

Proof of Concept

A draft POC is here: !111811 (closed)

Edited Feb 17, 2023 by John Mason