Improve Advanced search indexing process observability
Over the last few weeks, we've had a number of customer reporting that documents are not found.
We tracked down the cause that the file was missing in the index, but we couldn't find the root cause of that.
In order to have a better understanding of what is going on in the indexing pipeline, we need to improve the tools we have to observe what is going on.
The goal
- Having a clear and accessible view of the current state of the Elasticsearch index on Gitlab.com.
Our current tools
- Queues monitoring at: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
- Kibana logs at: https://log.gprd.gitlab.net/goto/0b251809f6cba725889c6305c91959b2
Improvements
Increase the retention window for certain logs where there might be a root cause of failure.
Is there a way we can set the retention window for logs that match a certain predicate, in this case NOT json.job_status: is one of done, start
.
See https://log.gprd.gitlab.net/goto/8510ea36e45f14343adb4443c48e5c79 for more informations.
IndexStatus
Improve the tracing around the state management of the The current logic around the state management for the Elasticsearch index is as follows:
- Each project has a
IndexStatus
entry in DB that holds the latest commit SHA it successfully indexed. - Whenever the project repository has been mutated, we do some Git logic to figure out if we either need to 1) index incrementally on top of the current index; or 2) re-index everything from scratch.
- We then send then serialize all the outstanding documents into a queue
- A worker picks up a batch, deserialize it, then send a bulk update to the index
- If the worker finish without reporting any error, then we update the
IndexStatus
entry for the SHA.
This process is meant to guarantee a couple things:
- We only index incrementally when we are sure the changes are incremental;
- We only update the
IndexStatus
atomically if and only if the indexing has been completed successfully; - The process is idempotent
However, with the current logging it is hard to verify if all of these guarantees are upheld.
Improve the Elasticsearch integration with some health check dashboard:
I think it would make sense to add a simple ES dashboard in the Project's admin area, where one can see some metrics about the Elasticsearch index for this specific project.
- Number of actual documents by type in the index for this project
- Number of expected documents by type in the index for this project