To be consistent with how we run feature flag tests against Staging Canary today
To catch any potential issues during the Production Canary stage where there is still time to remediate before already reaching the majority of customers in Production Main
These tests currently serve as non-blocking, functional tests which would have already previously run in Staging Canary, as well as blocking at the MR level in e2e:test-on-cng and e2e:test-on-gdk
Due to the unique deployment process for .com, we don't yet have test coverage for mixed deployment issues in earlier environments (MRs or master). Though these issues are now rare, there are still advantages for having some coverage for this area (see examples and discussions here).
Since we already have mixed deployment coverage for Staging Canary and Staging Main, and the above analysis found no E2E tests in Production have caught a mixed deployment issue, as an interim approach, we will be removing this test from Production Main while still keeping the tests for Staging Main.
To help keep team members informed on the current state of live environment pipelines
Pipelines Being Kept
Pipeline
Reasoning
Production Canary smoke tests (after deployments - blocking)
Detecting errors here acts as a last line of defense. While some customers might encounter this error, in case of an incident at this stage, tooling is available to quickly mitigate the problem (disable the canary environment), and as part of the incident procedures, deployments to our main environments are blocked to prevent the bug from being deployed to Staging Main and Production Main, the latter is the environment that deals with the rest of the customer traffic.
Production Canary smoke tests (after feature flag changes - non-blocking)
While we do already run tests after feature flag changes in Staging, there can be cases where a feature flag can be enabled for Production, and not Staging (note, there is a feature-flag-consistency-check script in ChatOps to prevent this, but this can be overridden). In this case, we want to ensure functionality can still be validated. Due to these tests being triggered immediately after a feature flag change, these can also help quickly determine the root cause of an issue and the corresponding DRI.
End State (TBD)
Flowcharts on what the end state will look like only after this issue is complete
graph TD B[Deploy to Staging Canary] B --> C[Run QA Tests] C --> D[QA Full<br><br><i>non-blocking</i></br></br>] C --> E[QA Smoke Main<br><br><i>blocking</i></br></br>] C --> F[QA Smoke Canary<br><br><i>blocking</i></br></br>] D --> G[Deploy to Production Canary] E --> G F --> G G --> H[Run QA Tests] H --> I[QA Smoke Canary<br><br><i>blocking</i></br></br>] I --> J[Deploy to Staging Main] J --> K[Deploy to Production Main]
graph TD A[Feature Flag set to true or 100% on Production] A --> B[QA Smoke Canary<br><br><i>non-blocking</i></br></br>]
Frequently Asked Questions
This section is to help summarize the reasoning for the above decisions as well as capture previous discussion points
Are we gathering data on Production and Canary issues caught by E2E tests first? No- at least not a full, in-depth analysis. My original proposal did include analyzing past production and canary incidents and issues before implementation to help inform a decision. However, after discussions in this thread, we have decided to instead move forward with pipeline reduction based on current knowledge shared by SETs and groupdelivery team members, then analyze impact and track possible issues afterwards (see monitoring section below). This is in alignment with groupdevelopment analytics current approach with taking action and observing in introduction of all blocking tests in cng pipeline, removing omnibus runs from MR pipeline, removing reliable and blocking suite, etc.
Why are we removing pipelines and not reducing them (ex: to daily or scheduled runs)? Reducing the Production and Canary pipelines to a scheduled or daily run, for example, would not provide value. Such tests would not be blocking deployments and will likely not run close enough to the cause of failure to actually catch or prevent bugs from being released. Having a less frequently running pipeline would just add additional overhead, since SETs would likely be investigating an issue that is already known. Instead, we are looking to see where we can remove redundancy or pipelines that do not provide value. Please see these previous related comments (1, 2)
Before moving forward with the following task in this issue:
Remove Production smoke tests from the deployer pipeline
I wanted to get a better understanding from the Delivery team on how running a set of Production (main) smoke tests brings value in regards to mixed deployment issues. Do you recall any recent examples on a mixed deployment issue these tests helped catch and prevent?
My assumption is that even if the test is failing in Production at that point, customers are already experiencing the issue, and running such a test would not really provide much value here. However, I also lack context on what exactly a mixed deployment issue involves.
We are also exploring potentially removing Staging smoke tests, if we are already running smoke tests for Staging Canary. Do you have an example on what kind of mixed deployment issues Staging (main) smoke tests would help prevent?
Happy to share this question / discussion in a Slack channel as well, are there any you recommend would be good for involving Delivery team members? Thank you!
I wanted to get a better understanding from the Delivery team on how running a set of Production (main) smoke tests brings value in regards to mixed deployment issues.
Seems repetitive when you put it that way That said, environments are not 100% in alignment. Both in terms of configuration as well as scale. I think there's merit, but arguable. Ideally testing captures issues in the Staging environment prior to any Production deployment. That said, there are differences in these environments, so repeating these tests, while it may not be 100% necessary can be used to signal that something is wrong.
Do you recall any recent examples on a mixed deployment issue these tests helped catch and prevent?
...context on what exactly a mixed deployment issue involves.
We deploy to two stages, Canary and Main. Deploys themselves are deployed in various chunks. So for each stage, it takes time for a deployment to complete as we asynchronously deploy to Gitaly, followed by sidekiq, and later our web services (generalized format). Our large infrastructure takes a lot of time to deploy to each of these chunks, so one of the goals is to ensure that no one snuck in a change that X+1 code from being compatible with X code. If this were to occur on a constant basis, we'd be potentially tossing a large amount of errors until a deployment is complete. For the main stage, this takes right around an hour. An hour of degraded time negatively impacts overall user experience and is negatively counted towards various SLO's for customers and attributed service teams. This has a far deeper reach into ensuring we capture stuff like this for zero downtime upgrades for self managed customers as well.
Today
graph TD B[Deploy to Staging Canary] B --> C[Run QA Tests] C --> D[QA Full] C --> E[QA Smoke Main] C --> F[QA Smoke Canary] D --> G[Deploy to Production Canary] E --> G F --> G G --> H[Run QA Tests] H --> I[QA Full] H --> J[QA Smoke Main] H --> K[QA Smoke Canary] I --> L[Deploy to Staging Main] J --> L K --> L L --> M[Deploy to Production Main]
I'm curious what your vision looks like with the proposal in this issue.
Seems repetitive when you put it that way That said, environments are not 100% in alignment. Both in terms of configuration as well as scale. I think there's merit, but arguable. Ideally testing captures issues in the Staging environment prior to any Production deployment. That said, there are differences in these environments, so repeating these tests, while it may not be 100% necessary can be used to signal that something is wrong.
Agreed, and we do plan to currently keep Staging Canary and Production Canary smoke tests running after each deployment at this time. Mainly was wondering if we can afford to remove the Staging Main or Production Main smoke tests from the deployment pipeline, if they are currently providing too much overhead vs. value. As I understood it, the purpose for running these in addition to the Canary tests was to catch mixed deployment issues
This may not paint the full picture, but it gives some good insight
Thanks to this as well, as I was doing some digging, I also found links to relevant info on the original implementation for this mixed deployment test setup:
An hour of degraded time negatively impacts overall user experience and is negatively counted towards various SLO's for customers and attributed service teams. This has a far deeper reach into ensuring we capture stuff like this for zero downtime upgrades for self managed customers as well.
That's understandable, and I can see where an E2E test could still provide value, if it can catch an issue like this quicker. Based on what I saw in the issue list you provided though, it didn't look like these tests were helping to catch mixed deployment issues, and they were mainly found by monitoring or users first.
So while the E2E tests could catch a mixed deployment issue in the future, I do wonder if the costs are currently outweighing the benefits. These redundant pipelines are placing a large overhead on SETs during on-call, often produce too much noise due to other environmental issues, and accrue infrastructure costs to run the tests as well, to name a few things.
Here is what I was imagining:
graph TD B[Deploy to Staging Canary] B --> C[Run QA Tests] C --> D[QA Full Canary*] C --> F[QA Smoke Canary] D --> G[Deploy to Production Canary] F --> G G --> H[QA Smoke Canary] H --> I[Deploy to Staging Main] I --> J[Deploy to Production Main]
We're still having discussions around what to do with the Staging tests in gitlab-org&16167 (comment 2279507776), but QA Full Canary* could represent only running the subset of E2E tests that are only compatible to run against Staging (not smoke and couldn't run earlier in MRs/master). Also, we could still consider keeping the QA Smoke Main for Staging to be extra cautious with mixed deployment issues, but I don't see the value of keeping it for Production since it is too late at that point
@vburton thank you for spending the time in doing this research. This is not necessarily eye opening to me as a person who has sat close to the infrastructure side. I do see and understand your points. I also agree we've got some repetitiveness happening. There's a lot of what I would probably consider, "peace of mind" attributions made when implementing the current QA testing in our pipelines.
I don't disagree with your proposal, though I would like other persons, at least @mayra-cabrera and hopefully @nolith to chime in with their opinions on this overall discussion before we proceed to start making changes.
Followup Questions/Comments
...but QA Full Canary* could represent only running the subset of E2E tests...
We need to work on naming. Full would imply all, yet I see the word subset. Thus I don't actually know what we're testing here. Likely a nitpick, I just want to understand what we mean when we say we run both a Full test and a Smoke test. That way if incidents do occur and as a result, QA needs improvement, we all understand which test does what and what area of QA needs to be modified.
Today, all three QA test runs in Staging are blocking, the two smoke tests in Production are blocking, while the Full test suite in Production is not. Aside from dropping the Staging Main Smoke Test, we're currently in alignment with what we are dropping in tandem with what is being proposed to drop entirely. Would the expectation be that QA Full Canary and QA Smoke Canary (both Staging and Production) continue to be blockers for the deployments? I would imagine yes, seeking confirmation.
Do we have any idea into how often the QA Smoke Main test failures that have resulted in us quarantining any tests or halting deployments because of a failed test? I believe the smoke tests mostly fail in tandem, or we've caught something in the QA Smoke Canary test. Before we drop the Main test, it'd be good to have any data we have on what action items we've taken as a result in the past. This would boost confidence for us wanting to drop this test.
I don't disagree with your proposal, though I would like other persons, at least @mayra-cabrera and hopefully @nolith to chime in with their opinions on this overall discussion before we proceed to start making changes.
We need to work on naming. Full would imply all, yet I see the word subset. Thus I don't actually know what we're testing here. Likely a nitpick, I just want to understand what we mean when we say we run both a Full test and a Smoke test. That way if incidents do occur and as a result, QA needs improvement, we all understand which test does what and what area of QA needs to be modified.
This is a great point, and I see where this causes confusion. In the current state, for our live environments, I understand "smoke" tests as the subset of tests that are reliable enough to block deployments, while "full" would be all other tests that are not. But we should have documentation to clarify this.
The definitions for these suites have also changed over time (ex: we used to have the reliable test suite, which was eventually just combined into smoke), and can sometimes go against industry standards. For example, in https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/2439, there had been discussions to try and move even more tests into the smoke suite, but that also defeats the purpose of what a smoke test suite should represent- a quick running set of tests covering only the most critical functionality (ex: can log in)
Today, all three QA test runs in Staging are blocking, the two smoke tests in Production are blocking, while the Full test suite in Production is not. Aside from dropping the Staging Main Smoke Test, we're currently in alignment with what we are dropping in tandem with what is being proposed to drop entirely. Would the expectation be that QA Full Canary and QA Smoke Canary (both Staging and Production) continue to be blockers for the deployments? I would imagine yes, seeking confirmation.
I don't believe any of the "full" test suites against our live environments are blocking, including QA Full Canary for Staging (ex: qa:full:gstg-cny). Yes, the intention is QA Smoke Canary (both Staging and Production) would continue to be blocking.
However, as part of this issue, I wanted to propose removing Production QA Full Canary from deployments, since these tests are less critical in the sense that they are 1. not blocking and 2. should have already ran in earlier environments
My apologies, QA Full Canary* for Staging was just a quick note and will likely be renamed to something more understandable. But this could basically replace the current non-blocking QA Full Canary for Staging / qa:full:gstg-cny mentioned above, but filtered on an even smaller subset of the full suite to just be the tests that are only compatible with Staging today. Those kinds of tests can't run in MRs or master (yet) and are usually testing third party integrations (ex: AI gateway, CustomersDot, etc.), use a .com only feature, etc.
Hope this makes a bit more sense:
Updated chart:
Click to expand
graph TD B[Deploy to Staging Canary] B --> C[Run QA Tests] C --> D[QA Full Staging Canary*<br><br>Non-smoke tests only compatible with stg env<br><br><i>non-blocking</i>] C --> F[QA Smoke Staging Canary<br><br><i>blocking</i>] D --> G[Deploy to Production Canary] F --> G G --> H[QA Smoke Production Canary<br><br><i>blocking</i>] H --> I[Deploy to Staging Main] I --> J[Deploy to Production Main]
Do we have any idea into how often the QA Smoke Main test failures that have resulted in us quarantining any tests or halting deployments because of a failed test? I believe the smoke tests mostly fail in tandem, or we've caught something in the QA Smoke Canary test. Before we drop the Main test, it'd be good to have any data we have on what action items we've taken as a result in the past. This would boost confidence for us wanting to drop this test.
@vburton. Thanks for the thorough analysis, I've added my notes below. It's a long comment, but summarized: In the current deployment state, I'm concerned about removing the QA main pipelines from the staging and production canaries because we won't have automated checks to verify the backward compatibility of deployment packages.
My assumption is that even if the test is failing in Production at that point, customers are already experiencing the issue, and running such a test would not really provide much value here.
I'd like to add more color to this assumption, I'll use Skarbek's diagram to do so:
graph TD B[Deploy to Staging Canary] B --> C[Run QA Tests] C --> D[QA Full non-blocking] C --> E[QA Smoke Main] C --> F[QA Smoke Canary] D --> G[Deploy to Production Canary] E --> G F --> G G --> H[Run QA Tests] H --> I[QA Full non-blocking] H --> J[QA Smoke Main] H --> K[QA Smoke Canary] I --> L[Deploy to Staging Main] J --> L K --> L L --> M[Deploy to Production Main]
Staging-canary environment is internal only, that is, it doesn't serve customer traffic. Production-canary is a different story, this environment serves traffic to a small subset of customers (I believe it is 5%). Detecting errors (mixed deployment-testing or else) on staging-canary is great (albeit detecting them at merge request level is better) because customers are not impacted.
Detecting errors at production-canary acts as a last line of defense, while some customers might encounter this error, in case of an incident at this stage, tooling is available to quickly mitigate the problem (disable the canary environment), and as part of the incident procedures, deployments to our main environments are blocked to prevent the bug from being deployed to staging-main and production-main, the latter is the environment that deals with the rest of the customer traffic.
Mainly was wondering if we can afford to remove the Staging Main or Production Main smoke tests from the deployment pipeline, if they are currently providing too much overhead vs. value. As I understood it, the purpose for running these in addition to the Canary tests was to catch mixed deployment issues
As is, I'm not certain that we can afford to remove the main smoke tests executed on staging-canary and production-canary. For additional context, given the manual cadence of the production promotion, not every package that is deployed to the staging-canary and production-canary environments is guaranteed to reach the staging and production main environments. As a result, the compatibility of every package must be tested against the package that is currently deployed to staging-canary and production-canary.
So while the E2E tests could catch a mixed deployment issue in the future, I do wonder if the costs are currently outweighing the benefits. These redundant pipelines are placing a large overhead on SETs during on-call, often produce too much noise due to other environmental issues, and accrue infrastructure costs to run the tests as well,
I agree it has been a while since we have seen a mixed deployment issue, in part could be thanks to the ongoing efforts in engineering to minimize the backward compatibility issues gitlab-org/gitlab#352455. Still, I find myself a bit wary about removing the main QA pipelines, if we were to do so, how would we detect mixed-deployment errors automatically?
Also, we could still consider keeping the QA Smoke Main for Staging to be extra cautious with mixed deployment issues, but I don't see the value of keeping it for Production since it is too late at that point
To reiterate, catching QA errors on production-canary is not too late because those can be quickly mitigated, have only reached a subset of users, and haven't reached the production main environment.
I don't believe any of the "full" test suites against our live environments are blocking
However, as part of this issue, I wanted to propose removing Production QA Full Canary from deployments, since these tests are less critical in the sense that they are 1. not blocking and 2. should have already ran in earlier environments
Yes, this is correct. The full test suites are triggered in a fire-and-forget approach in the staging-canary and production-canary stages and their outcome doesn't block deployments. Because of their no-op impact, removing them from the deployment path is acceptable from the Delivery side.
Thanks so much for adding your input here @mayra-cabrera
Detecting errors at production-canary acts as a last line of defense, while some customers might encounter this error, in case of an incident at this stage, tooling is available to quickly mitigate the problem (disable the canary environment), and as part of the incident procedures, deployments to our main environments are blocked to prevent the bug from being deployed to staging-main and production-main, the latter is the environment that deals with the rest of the customer traffic.
Yes, I completely agree, and just to clarify, I am not proposing we remove the production-canary smoke tests - this was mainly to discuss if we could remove the production-main and staging-main smoke tests at this point
I appreciate the examples and context given here in this discussion. After some additional consideration and having a clearer picture now, I can also support keeping these main smoke tests.
To try and capture a simplified real-world example:
A database column could be removed in a regular migration, part of the same deployment that contains app code changes that remove reference to this column (even though this is not advised, but using for the sake of an example)
Deployment to Canary executes the migration and also deploys a new package, say Package B, containing those app changes
Production, using previous package Package A, still has app code referencing this column. Because the database is shared between Canary and Production, and this column has already been removed, the Production tests could fail in this case
Given this scenario, the main smoke tests can help in the following ways:
Could identify the root cause more quickly by being closer to the point of failure (ex: immediately after the deployment), potentially reducing downtime or missed SLOs
As mentioned above, not every Canary package may be promoted to Production. So (if I understand correctly), even though Package B has the fix, it may not be applied and we still need to ensure any newer packages will continue to be compatible with Production
Captures potential issues that could occur for upgrading self-managed customers or Cells infrastructure - this is probably not an area we want to lose more visibility on, given the majority of customers are self-managed and the ongoing work with Cells
@skarbek@mayra-cabrera Is my understanding using this example correct? If so, I'd like to add a summary of our decisions and reasonings to the issue description as well
Perhaps there may be a way to help catch some of these mixed deployment issues earlier, and at that point, it would be safer to then re-evaluate the need for these main smoke tests. I do see we have some epics such as gitlab-com/gl-infra/software-delivery/framework&3 and gitlab-org&12457 that could be related, but it is outside my area of expertise. Maybe @niskhakova can help share insight on those efforts and if this has been considered as part of that?
Perhaps there may be a way to help catch some of these mixed deployment issues earlier, and at that point, it would be safer to then re-evaluate the need for these main smoke tests. I do see we have some epics such as gitlab-com/gl-infra/software-delivery/framework&3 and gitlab-org&12457 that could be related, but it is outside my area of expertise. Maybe @niskhakova can help share insight on those efforts and if this has been considered as part of that?
The above epics are focusing on the "default" deployment - when customer has a single GitLab version, and they're upgrading to a new one. The way .com is deployed is very unique - mixed version and manual control over when migrations are running (PDM) as well as custom Chart deployments (please correct me if anything listed is wrong) => upgrade testing for this mixed deployment setup would be best built and suited for .com use case specifically knowing all the details.
As you mentioned above, majority of existing mixed release issues were not caught by E2E. I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda@lsogunle
Analyse previous incidents and test gaps
Identify what kind of tool or script would help to catch mentioned issues
Implement the tool (and expanding monitoring/alerts?)
Sunset mixed E2E
With regards to the topic of the issue, I think that in the current state until there is a dedicated test tooling for mixed deployments, relying on GitLab QA mixed deployment trigger for Canary/Main is the only last resort there is. Albeit it's been a while since mixed deployment E2E test caught an issue, so the tool is not efficient.
But perhaps it would be helpful to review if full suite against Production|Canary can be removed. As the same full suite runs against Staging main (even bigger technically since it has admin tests as well), and it's not blocking. For additional verification, would need to analyse whether there was a case when full suite failed on Production Canary but not on Staging Canary.
Thank you all for the detailed analysis and I agree with the same sentiment here. E2E test is not built to catch mixed deployment tests. GitLab QA is currently being used as a catch all for any kind of failures. This approach is not ideal and should be revisited.
Given we are running full suite in Staging and Staging canary, past review proves that no mixed issues have been caught in Prod/ Prod canary, we should dial down on running tests in Production. As an interim approach, we can run smoke in production and production canary on a lesser frequency.
I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda@lsogunle
Analyse previous incidents and test gaps
Identify what kind of tool or script would help to catch mentioned issues
Implement the tool (and expanding monitoring/alerts?)
Sunset mixed E2E
I agree with @niskhakova 's assessment here. Given this could potentially be part of upgrade workflow, let us (cc @ksvoboda ) know how we can help as I see areas of opportunity between the 2 teams.
The above epics are focusing on the "default" deployment - when customer has a single GitLab version, and they're upgrading to a new one. The way .com is deployed is very unique
But perhaps it would be helpful to review if full suite against Production|Canary can be removed. As the same full suite runs against Staging main (even bigger technically since it has admin tests as well), and it's not blocking. For additional verification, would need to analyse whether there was a case when full suite failed on Production Canary but not on Staging Canary.
Agreed, as part of this issue, the current plan has been to remove the Production Canary full runs. Also to note, we originally were planning to do an in-depth analysis beforehand. However, there have been discussions since then, and we decided to plan on removal first, then observe impact. I've updated the issue description to have more details on reasonings of decisions, frequently asked questions, etc. - hopefully it's a bit more clear
As an interim approach, we can run smoke in production and production canary on a lesser frequency.
Thank you @vincywilson for your input. This was discussed before as well, but I don't believe reducing the frequency of these tests will prove to be valuable (ex: through scheduled runs or daily, rather than after every deployment). I've added the current reasoning for this under frequently asked questions as well. I'm open to other insights though on how less frequent pipelines would still be valuable.
Speaking of interim approaches, I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.
This also does not need to be a permanent solution, and we can always go back and change if needed. Monitoring is also included as part of this issue to analyze impact and keep this issue open for teams to add any issues noticed.
I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.
@vburton - I am in favor of this. Thank you for the details above.
Agreed, as part of this issue, the current plan has been to remove the Production Canary full runs. Also to note, we originally were planning to do an in-depth analysis beforehand. However, there have been discussions since then, and we decided to plan on removal first, then observe impact. I've updated the issue description to have more details on reasonings of decisions, frequently asked questions, etc. - hopefully it's a bit more clear
Thanks for the additional context. Sounds good
Speaking of interim approaches, I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.
I think it makes sense, especially since existing E2E didn't catch any mixed deployment issues much in the past, and none after Staging stage. And as you mentioned, the decision can be always reverted
Looks like the general impression is that mixed-deployment testing isn't effective in finding backward compatibility issues either on staging or production. If we are not able to rely on quality pipelines, what tooling could be used to detect the mixed deployment issues automatically on the deployment path? Do we have signals, metrics, or error rates available that help us increase the confidence in the packages we're deploying and releasing to our customers?
Having a clear "what is the next" or "how the mixed-deployment testing strategy will be replaced" outlined would increase the confidence about removing the safety net of the gprd-cny smoke-main tests
that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.
Relying on the E2E staging environment tests is a bit fragile due to the disparities between staging and production: application settings, feature flags, and configurations may differ in these environments, not to mention the database which is vastly different between staging and production.
Having a clear "what is the next" or "how the mixed-deployment testing strategy will be replaced" outlined would increase the confidence about removing the safety net of the gprd-cny smoke-main tests
@mayra-cabrera this is a good question and is outside my area of expertise / Test Governance team's current scope at the moment, so I would need to see who from the team may be able to help look into this. @niskhakova@ksvoboda@kkolpakova@vincywilson do you know which team could help create and start looking into the epic mentioned above?
I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda@lsogunle
Analyse previous incidents and test gaps
Identify what kind of tool or script would help to catch mentioned issues
Implement the tool (and expanding monitoring/alerts?)
Sunset mixed E2E
While I agree this gap should be addressed, I don't think this is necessarily a blocker for removing the Production Main smoke tests. If we did keep these tests, we would still continue to have this same gap we do today, just with added cost in terms of infrastructure and increased load for on-call team members.
Agreed that it's important to address the mixed deployment upgrade test gap and in general investigate what metrics and frameworks should be used. I'm not sure about the team on handling this, my understanding was that Framework is focused on common upgrade path testing without mixed scenarios. I'll defer to @lsogunle.
Noting this thread for @nduff, @reprazent and @stejacks-gitlab. We had a conversation in Delivery on this topic and metrics(logs) based checking is a possibility that would help replace the tests we are talking about above. I had seen https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1430, and wanted to note that it would be useful for the logging solution to let us monitor and important set of URLs an their Apdex or error rate in the window after a deploy. The cardinality because of URL paths would be too high so a logging based monitor is likely the better tool. Retention would not need to be any longer than what you are all talking about already.
@dawsmith I'm not sure I fully grasp what you're asking, having just read this thread. Can I get a little more detail on what you're looking to do?
It sounds like you want to have a way to get the apdex and error rate of some subset of URLs before and after a deploy. We already have that data in the metrics stack (almost certainly) and we keep data there for a year.
Are you looking for a way to alert on this, or for a way to record it, or some other answer? I'm not sure logs is the right solution, but I'm also pretty sure I don't understand the problem. :)
Thanks for asking @stejacks-gitlab, I guess I could have been more clear.
Given this thread is talking about eliminating some stages of testing in the deploy process, we were wondering about other safeguards.
If there were a deploy where mixed deployments was a problem, we would expect error rates to increase or apdex to change on some subset of URLs. I think we would need to be able to categorize those in the future to ones we think are significant - say for example thing related to MRs, issues, etc. I thought we did have some of these in metrics, but wasn't sure where you all were in the efforts to reduce cardinality.
To @stejacks-gitlab point, we do have endpoint_id already in several metrics. The cardinality of this label already gives me nightmares, so if this helps solve the problem it would be nice not to introduce another one
It's also possible to use a label_replace to group them by URL and remove the methods if that is also desirable. For example. The example uses requests_total but we also have it available in errors_total IIRC (as well as a few others).
Thanks for the comments @nick Duff and @stejacks - I was mainly trying to point out future needs and hopefully ways the logging tooling could reduce cardinality in mimir.
I’ll look at what we have in endpoint_id, but was mainly trying to give an example for you to use in evaluation.
Thanks for looking into this @dawsmith, is there an issue/epic that further tracks these monitoring improvements that we can include as a reference in this issue for visibility (would that be https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1430 you mentioned above?)
@dawsmith This makes sense, and I understand where you're coming from. We'll keep this in mind over all, but in the short term, I do highly recommend using metrics for this to start with. :)
Valerie Burtonmarked the checklist item Target Production Canary instead of Production for tests triggered by feature flag changes (ChatOps) as completed
marked the checklist item Target Production Canary instead of Production for tests triggered by feature flag changes (ChatOps) as completed
Valerie Burtonchanged the descriptionCompare with previous version
changed the description
Valerie Burtonchanged the descriptionCompare with previous version
I would also like to start a separate thread here to actually further challenge keeping the following:
Production Canary smoke tests (after feature flag changes - non-blocking)
There is tooling already in place via feature-flag-inconsistency-check to ensure feature flags are enabled in Staging first, which would then trigger the full suite of E2E tests against Staging Canary.
At this point, the functionality should have already been tested in the previous environment, and these are not being used to validate unique scenarios like mixed deployment issues, etc.
While this check can be overridden, it does require confirmation with the @sre-oncall. I do wonder how common this scenario is, and if it's really worth keeping these feature flag tests just to cover the odd chance this flag wasn't enabled in Staging beforehand.
Another idea, is to possibly build the capability into ChatOps to only run these tests against Production Canary if the --ignore-feature-flag-consistency-check was also included in the command
Valerie Burtonmarked the checklist item Target Production Canary instead of Production Main for tests triggered by feature flag changes as completed
marked the checklist item Target Production Canary instead of Production Main for tests triggered by feature flag changes as completed
Valerie Burtonchanged the descriptionCompare with previous version
Unassigning myself as I will be transitioning to a different role. I have added an update regarding the current progress and next steps in this comment