New black-box test approach for elasticsearch in Nightly pipeline
Summary
The way we are currently end to end testing elasticsearch (e.g. https://gitlab.com/gitlab-org/gitlab/-/blob/master/qa/qa/specs/features/ee/api/enablement/elasticsearch/elasticsearch_api_spec.rb) is implemented in the same way we test other front-end facing features, although, I will make the case here that this is not the appropriate way to black-box test elasticsearch functionality.
As per this issue #224766, the problem with elasticsearch testing is that there are a number of places where asynchronous processes need to complete before a test can be successfully run. Some of these asynch processes being:
- setting propagations
- cache clearing
- record indexing
- sidekiq worker queues
What this has led to is filing issues like this https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/395 that are requesting that logs be made available to test executors to verify certain preconditions are met on the server before executing certain verification steps of end to end tests. This, however, is incompatible with the philosophy behind black-box testing. The test is meant to emulate the experience of an end user, and end users are not able to introspect the state of the server or examine log files (typically).
As an alternative, we (the quality department) have opted to insert hard wait times, and retry loops into various steps of the test to try to wait until elasticsearch is ready to accept queries. I posit that this is not a good idea in terms of test efficiency, usability, or customer experience because nested wait times extend the execution time of a test into lengths that are inefficient to run and also would be an unpleasant user experience.
Note: this idea was first discussed here #220246 (comment 374576598)
Improvements
A path forward: in order to simplify the tests and ensure a quality user experience, I'm proposing we do elasticsearch end to end testing in a different way.
- Start a container with elasticsearch installed (and turned on: #209122)
- Add a record to GitLab (using the API would be fastest)
- Start a loop that terminates after time X
- Search for the record while time is less than X
- If record found in that time PASS, else FAIL
That's it. No nested try loops. No introspection of the server logs. Just keep searching until we find the record or exceed the time threshold "X", where X is an amount of time that a user expects to wait, on average, for a record to be searchable. If we fail to retrieve a record in that time something has gone wrong. It doesn't really matter where, from the user perspective. If some critical piece of tech failed to start then an earlier unit or acceptance test should have caught that.
What this kind of test will tell us is that if our aggregate asynch times add up to more than the reasonable wait time X then we're taking too long somewhere and we should examine why. That is what this kind of testing will alert us to, and we can stop adding arbitrary wait times into the existing tests when elasticsearch isn't responding appropriately.
To consider for future iterations: we could implement various time thresholds that would indicate the severity of the problem, much like https://about.gitlab.com/handbook/engineering/quality/issue-triage/#severity
Risks
Since this kind of test has the potential to run for X amount of time (plus setup time) it may be more prudent to run this test less frequently, as in the nightly pipeline. Also there's the potential of this kind of test breaking due to flakiness or broken infrastructure and not because of a problem in elasticsearch.
Running tests less frequently gives is poorer time resolution into when things break.
@JohnMcGuire we'll need a value for "X" here that I'm hoping you can help provide.
Involved components
Elasticsearch Elasticsearch API
Optional: Intended side effects
Optional: Missing test coverage
There may be areas of unit or integration testing that we could improve upon with respect to ensuring all the components of elasticsearch are working properly before doing this kind of end to end test. I will need to conduct an audit of our lower level elasticsearch tests to know for sure.