Spike: Investigate most viable method to improve success rate of Elasticsearch end to end tests.

What is the GitLab engineering productivity problem to solve?

In a discussion with @DylanGriffith (excerpts included below) we have have uncovered a problem with testing elasticsearch in an end to end capacity. Namely: how long should we wait before declaring the test a failure?

Stated another way, a lot of the processes of elasticsearch are asynchronous and happen opaquely, without a way to provide feedback to a potential user of test sequence. Examples of these asynch processes are:

Indexing of records
Propagation of settings changes (per-thread memory cacheing of settings changes in redis #36663)

Excerpts from conversation:

Dylan Griffith: The way I understand how the code works if you enable Elasticsearch in the admin UI you should never see "scope not supported without elasticsearch" since I think there is nothing asynchronous about that part of things. I don't know where to look to debug that off the top of my head.

Well actually now after saying that I think I may have an idea.

I think the problem may be #36663

This cache is very bad for Elasticsearch. We're skipping it in some but not all contexts. If it is indeed always 60s and we have no way to expire the cache from the outside of the application then your only hope may be to always wait 60s after enabling the setting 😢

This cache is actually used for any application wide settings (most of the admin settings) and I wonder as such if we're running into the same problem elsewhere in our QA specs and if they have solutions for that.

Erick Banks: I am seeing a lot of retries in the tests that have a max retry of 10 and a wait_interval of 10 that find what they are looking for after 6 or 7 retries, which lines up with a 60 second delay between changing something and finding the change in a test

Dylan Griffith: This only fixes the 3rd error you are talking about. As for the 1st error the problem is more layered and problematic since the indexing to Elasticsearch is async but even further Elasticsearch itself is async with regards to when it refreshes the shards to make data available for search. The only reliable way here would be to wait for all queues to be drained (check queue data in Redis somehow....) and then manually trigger a refresh to the Elasticsearch API. This will guarantee things are done. We do this in our unit tests (by actually triggering the sidekiq workers synchronously) to make them reliable. But we don't have any way to trigger this from outside the application.

So right now the only reliable way I can think is to find the internal representation of Sidekiq queues in Redis and read that from the tests to confirm the jobs are done. It may be brittle in future if we change sidekiq queues around though.

Erick Banks: Is there any prior art I can look at that reads those queues?

Dylan Griffith: Not to my knowledge. Like I said our unit tests do it differently because they just invoke the indexing jobs directly. I'm not sure if there is any code in GitLab that is looking at sidekiq Redis queues.

Might need to reverse engineer by looking at what's in redis using KEYS * or read up on what the queues look like https://github.com/mperham/sidekiq

Erick Banks: Sounds like a job for Future Me (also willing to pair with anyone who wants to help out here)

Thanks for the input, dylan!

Dylan Griffith: Another way I could think would be to make queries directly to Elasticsearch to see if the data is there. But the more I think about it all of these patterns are going to involve checking something periodically with an eventual timeout and so maybe it's no better than what you have today...

Erick Banks: yeah, that's pretty much what we do now

which ends up becoming "look longer"/Wait Longer

Dylan Griffith: Yeah so maybe it doesn't really help making the tests more reliable and would only change the error messages.

Problem identification checklist

The root cause of the problem is identified.
The surface of the problem is as small as possible.

What are the potential solutions?

Write tests that just wait an indeterminate amount of time, i.e. the Wait Longer approach
Engineer some way to introspect the state of the elasticsearch indexing/settings propagation to verify the records are ready to be searched before attempting a verification step of a test. ** For example reverse engineering the redis keys and/or looking at sidekiq queues and/or logs.

All potential solutions are listed.
A solution has been chosen for the first iteration: PUT THE CHOSEN SOLUTION HERE

Verify that the solution has improved the situation

The solution improved the situation.
- If yes, check this box and close the issue. Well done! 🎉
- Otherwise, create a new "Productivity Improvement" issue. You can re-use the description from this issue, but obviously another solution should be chosen this time.