#12555 (closed) This is when we add the table es_indexes(name, is_active, progress) On every data updates we would update every active index. While the last index is in the process of building we could use the next-oldest active one. The idea would be for only two indexes to ever be active, the current one and the one being migrated to.
We can also consider taking advantage of index aliases to specify which index we use for searches, but we still need to send updates to all active indexes
Document how the zero-downtime indexing operation works
We can also consider taking advantage of index aliases to specify which index we use for searches, but we still need to send updates to all active indexes
Sending updates to all active indexes is not possible via Elasticsearch itself. We need to write a layer between our own code and elasticsearch that knows which indexes need writing to. It's also not as simple as just sending the same operation to multiple indexes because how should elasticsearch know what to do with an update operation if the original document doesn't exist in the new index yet?
I've split this work to another issue: #11299 (closed) , but at the same time I'm not sure if we'll decide to support this at all - it is very complex and what it gives us is the certainty that if the new index fails to be populated properly, we'd still have an old index that's up to date. Without it the old index would not be up to date.
This also would mean that if the indexing operation takes a long time (months?) then the old index would be stale for that long - never returning new data. Considering that we do have customers whose indexing operation takes that long we probably want to do this, or at least find a way to change which index we read from on a per-project/group basis
@mdelaossa From the description, this issue seems to be a superset of #11299 (closed), or is dependent on it.
And reading #11299 (closed) it seemed to say it may be considered optional.
If "concurrent writing to multiple indexes" is optional, what's the work left here?
If "concurrent writing to multiple indexes" is required, I think the combined weight of both issue should be 5, as it is going into the unknown and there are many design decisions to be made.
This issue should be worked on after https://gitlab.com/gitlab-org/gitlab-ee/issues/328 - that one decouples the schema, but doesn't really set things up for zero-downtime (doesn't set up the strategies we can use to switch the schema without a restart!)
We can also consider taking advantage of index aliases to specify which index we use for searches, but we still need to send updates to all active indexes
This seems to indicate it is dependent on #11299 (closed). Oh or do you mean at this stage, "updates to all active indexes" doesn't have to be a library-level solution, but can be an application-level solution?
Like I say here writing to multiple indexes should not be considered a part of this issue :)
What you'll want to make sure you do here though is using Index Aliases so you can easily change the active "read" and "write" indexes.
The idea would be to have the current index be called something like 'gitlab_prod_DATEOFSCHEMA', and create both a read and a write alias called 'gitlab'. Then, when a new schema needs to be updated, you can tell elasticsearch "set NEW_INDEX as a write alias called 'gitlab'" and keep the same read alias. This would cause new writes to go to the new index, and reads to go to the old index - thus allowing you to still return results from the old index until the new one finishes being written to
@phikai@lulalala I guess technically yes, we can do zero-downtime indexing with Step 2.
But at the moment the ES index name is hardcoded to be gitlab-$ENV-$VERSION (e.g. gitlab-production-v12p1), so to support zero-downtime reindexing on the same version we need to add a text field for the index name (in addition to the "friendly name"). Otherwise it's only possible if the second index is in another ES cluster.
We discussed this on the sync call with @mishunov and @mvanremmerden and decided to not do that for now, though we could definitely add this as a follow-up. We weren't sure yet how this should work UX-wise, we should probably do something similar to https://gitlab.com/projects/new where the project slug is pre-filled based on the project name, but can be manually changed. Except for ES indices, we would pre-fill the name based on the mapping version, for which we currently don't have a field yet until Step 3 ;-)
@phikai this also caused problems in testing since we can't easily create more than one index 🤦
For now we've changed it to append random hex characters at the end, so we get index names like gitlab-production-v12p1-edc2ce8c.
We actually might want to keep this even when we add the version dropdown, so we don't need a name field at all on the frontend. (/cc @mishunov @mvanremmerden)
@changzhengliu typically we leave the issue open until we've delivered the feature, even if there will be other issues that actually deliver it. That's because this is a more canonical issue that has salesforce and zendesk links and other important meta information for understanding priority.