Reduce Production and Production Canary E2E Test Pipelines

added gitlab-org#16167 as parent epic

changed title from Reduce Production E2E Test Pipelines to Reduce Production and Production Canary E2E Test Pipelines

added Sub-DepartmentTest Platform teamTest Engineering maintenancepipelines typemaintenance labels

changed the description

mentioned in epic gitlab-org#16167

changed the description

assigned to @vburton

changed the description

Hi @skarbek @mayra-cabrera

Before moving forward with the following task in this issue:

Remove Production smoke tests from the deployer pipeline

I wanted to get a better understanding from the Delivery team on how running a set of Production (main) smoke tests brings value in regards to mixed deployment issues. Do you recall any recent examples on a mixed deployment issue these tests helped catch and prevent?

My assumption is that even if the test is failing in Production at that point, customers are already experiencing the issue, and running such a test would not really provide much value here. However, I also lack context on what exactly a mixed deployment issue involves.

We are also exploring potentially removing Staging smoke tests, if we are already running smoke tests for Staging Canary. Do you have an example on what kind of mixed deployment issues Staging (main) smoke tests would help prevent?

Happy to share this question / discussion in a Slack channel as well, are there any you recommend would be good for involving Delivery team members? Thank you!

cc @ghosh-abhinaba @acunskis @kkolpakova

Some documentation on the testing is here: https://handbook.gitlab.com/handbook/engineering/deployments-and-releases/deployments/

I wanted to get a better understanding from the Delivery team on how running a set of Production (main) smoke tests brings value in regards to mixed deployment issues.

Seems repetitive when you put it that way That said, environments are not 100% in alignment. Both in terms of configuration as well as scale. I think there's merit, but arguable. Ideally testing captures issues in the Staging environment prior to any Production deployment. That said, there are differences in these environments, so repeating these tests, while it may not be 100% necessary can be used to signal that something is wrong.

Do you recall any recent examples on a mixed deployment issue these tests helped catch and prevent?

It's been awhile since we've had to leverage the label we use to identify issues from mixed versions: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=closed&label_name%5B%5D=RootCause%3A%3ARelease-Compatibility

...context on what exactly a mixed deployment issue involves.

We deploy to two stages, Canary and Main. Deploys themselves are deployed in various chunks. So for each stage, it takes time for a deployment to complete as we asynchronously deploy to Gitaly, followed by sidekiq, and later our web services (generalized format). Our large infrastructure takes a lot of time to deploy to each of these chunks, so one of the goals is to ensure that no one snuck in a change that X+1 code from being compatible with X code. If this were to occur on a constant basis, we'd be potentially tossing a large amount of errors until a deployment is complete. For the main stage, this takes right around an hour. An hour of degraded time negatively impacts overall user experience and is negatively counted towards various SLO's for customers and attributed service teams. This has a far deeper reach into ensuring we capture stuff like this for zero downtime upgrades for self managed customers as well.

Today

graph TD
    B[Deploy to Staging Canary]
    B --> C[Run QA Tests]
    C --> D[QA Full]
    C --> E[QA Smoke Main]
    C --> F[QA Smoke Canary]
    D --> G[Deploy to Production Canary]
    E --> G
    F --> G
    G --> H[Run QA Tests]
    H --> I[QA Full]
    H --> J[QA Smoke Main]
    H --> K[QA Smoke Canary]
    I --> L[Deploy to Staging Main]
    J --> L
    K --> L
    L --> M[Deploy to Production Main]

I'm curious what your vision looks like with the proposal in this issue.

Thank you @skarbek for the information

Seems repetitive when you put it that way That said, environments are not 100% in alignment. Both in terms of configuration as well as scale. I think there's merit, but arguable. Ideally testing captures issues in the Staging environment prior to any Production deployment. That said, there are differences in these environments, so repeating these tests, while it may not be 100% necessary can be used to signal that something is wrong.

Agreed, and we do plan to currently keep Staging Canary and Production Canary smoke tests running after each deployment at this time. Mainly was wondering if we can afford to remove the Staging Main or Production Main smoke tests from the deployment pipeline, if they are currently providing too much overhead vs. value. As I understood it, the purpose for running these in addition to the Canary tests was to catch mixed deployment issues

It's been awhile since we've had to leverage the label we use to identify issues from mixed versions: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=closed&label_name%5B%5D=RootCause%3A%3ARelease-Compatibility

This is very helpful- thank you for the link! Some interesting stats after doing a quick analysis:

The most recent incident with this label occurred last year (2023-07-13)
Only 21 incidents out of 11,816 total - that's only about ~0.1% of incidents where these issues occur
Of those 21 incidents, only one appeared to be found by E2E tests (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6966). That was also caught during Staging Canary though, rather than one of the main smoke tests.

This may not paint the full picture, but it gives some good insight

Thanks to this as well, as I was doing some digging, I also found links to relevant info on the original implementation for this mixed deployment test setup:

An hour of degraded time negatively impacts overall user experience and is negatively counted towards various SLO's for customers and attributed service teams. This has a far deeper reach into ensuring we capture stuff like this for zero downtime upgrades for self managed customers as well.

That's understandable, and I can see where an E2E test could still provide value, if it can catch an issue like this quicker. Based on what I saw in the issue list you provided though, it didn't look like these tests were helping to catch mixed deployment issues, and they were mainly found by monitoring or users first.

So while the E2E tests could catch a mixed deployment issue in the future, I do wonder if the costs are currently outweighing the benefits. These redundant pipelines are placing a large overhead on SETs during on-call, often produce too much noise due to other environmental issues, and accrue infrastructure costs to run the tests as well, to name a few things.

Here is what I was imagining:

graph TD
    B[Deploy to Staging Canary]
    B --> C[Run QA Tests]
    C --> D[QA Full Canary*]
    C --> F[QA Smoke Canary]
    D --> G[Deploy to Production Canary]
    F --> G
    G --> H[QA Smoke Canary]
    H --> I[Deploy to Staging Main]
    I --> J[Deploy to Production Main]

We're still having discussions around what to do with the Staging tests in gitlab-org&16167 (comment 2279507776), but QA Full Canary* could represent only running the subset of E2E tests that are only compatible to run against Staging (not smoke and couldn't run earlier in MRs/master). Also, we could still consider keeping the QA Smoke Main for Staging to be extra cautious with mixed deployment issues, but I don't see the value of keeping it for Production since it is too late at that point

@vburton thank you for spending the time in doing this research. This is not necessarily eye opening to me as a person who has sat close to the infrastructure side. I do see and understand your points. I also agree we've got some repetitiveness happening. There's a lot of what I would probably consider, "peace of mind" attributions made when implementing the current QA testing in our pipelines.

I don't disagree with your proposal, though I would like other persons, at least @mayra-cabrera and hopefully @nolith to chime in with their opinions on this overall discussion before we proceed to start making changes.

Followup Questions/Comments

...but QA Full Canary* could represent only running the subset of E2E tests...

We need to work on naming. Full would imply all, yet I see the word subset. Thus I don't actually know what we're testing here. Likely a nitpick, I just want to understand what we mean when we say we run both a Full test and a Smoke test. That way if incidents do occur and as a result, QA needs improvement, we all understand which test does what and what area of QA needs to be modified.

Today, all three QA test runs in Staging are blocking, the two smoke tests in Production are blocking, while the Full test suite in Production is not. Aside from dropping the Staging Main Smoke Test, we're currently in alignment with what we are dropping in tandem with what is being proposed to drop entirely. Would the expectation be that QA Full Canary and QA Smoke Canary (both Staging and Production) continue to be blockers for the deployments? I would imagine yes, seeking confirmation.

Do we have any idea into how often the QA Smoke Main test failures that have resulted in us quarantining any tests or halting deployments because of a failed test? I believe the smoke tests mostly fail in tandem, or we've caught something in the QA Smoke Canary test. Before we drop the Main test, it'd be good to have any data we have on what action items we've taken as a result in the past. This would boost confidence for us wanting to drop this test.

I don't disagree with your proposal, though I would like other persons, at least @mayra-cabrera and hopefully @nolith to chime in with their opinions on this overall discussion before we proceed to start making changes.

Sounds good @skarbek

We need to work on naming. Full would imply all, yet I see the word subset. Thus I don't actually know what we're testing here. Likely a nitpick, I just want to understand what we mean when we say we run both a Full test and a Smoke test. That way if incidents do occur and as a result, QA needs improvement, we all understand which test does what and what area of QA needs to be modified.

This is a great point, and I see where this causes confusion. In the current state, for our live environments, I understand "smoke" tests as the subset of tests that are reliable enough to block deployments, while "full" would be all other tests that are not. But we should have documentation to clarify this.

The definitions for these suites have also changed over time (ex: we used to have the reliable test suite, which was eventually just combined into smoke), and can sometimes go against industry standards. For example, in https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/2439, there had been discussions to try and move even more tests into the smoke suite, but that also defeats the purpose of what a smoke test suite should represent- a quick running set of tests covering only the most critical functionality (ex: can log in)

Today, all three QA test runs in Staging are blocking, the two smoke tests in Production are blocking, while the Full test suite in Production is not. Aside from dropping the Staging Main Smoke Test, we're currently in alignment with what we are dropping in tandem with what is being proposed to drop entirely. Would the expectation be that QA Full Canary and QA Smoke Canary (both Staging and Production) continue to be blockers for the deployments? I would imagine yes, seeking confirmation.

I don't believe any of the "full" test suites against our live environments are blocking, including QA Full Canary for Staging (ex: qa:full:gstg-cny). Yes, the intention is QA Smoke Canary (both Staging and Production) would continue to be blocking.

However, as part of this issue, I wanted to propose removing Production QA Full Canary from deployments, since these tests are less critical in the sense that they are 1. not blocking and 2. should have already ran in earlier environments

My apologies, QA Full Canary* for Staging was just a quick note and will likely be renamed to something more understandable. But this could basically replace the current non-blocking QA Full Canary for Staging / qa:full:gstg-cny mentioned above, but filtered on an even smaller subset of the full suite to just be the tests that are only compatible with Staging today. Those kinds of tests can't run in MRs or master (yet) and are usually testing third party integrations (ex: AI gateway, CustomersDot, etc.), use a .com only feature, etc.

Hope this makes a bit more sense:

Updated chart:

Click to expand

graph TD
    B[Deploy to Staging Canary]
    B --> C[Run QA Tests]
    C --> D[QA Full Staging Canary*<br><br>Non-smoke tests only compatible with stg env<br><br><i>non-blocking</i>]
    C --> F[QA Smoke Staging Canary<br><br><i>blocking</i>]
    D --> G[Deploy to Production Canary]
    F --> G
    G --> H[QA Smoke Production Canary<br><br><i>blocking</i>]
    H --> I[Deploy to Staging Main]
    I --> J[Deploy to Production Main]

Do we have any idea into how often the QA Smoke Main test failures that have resulted in us quarantining any tests or halting deployments because of a failed test? I believe the smoke tests mostly fail in tandem, or we've caught something in the QA Smoke Canary test. Before we drop the Main test, it'd be good to have any data we have on what action items we've taken as a result in the past. This would boost confidence for us wanting to drop this test.

Good question, I had originally proposed for us to collect data like this first in https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/3282+, since we do not have this currently. However, after discussions in gitlab-org&16167 (comment 2268189125), it was suggested to move forward with pipeline reduction first and then monitor / collect feedback afterward (related to comment onward)

@vburton . Thanks for the thorough analysis, I've added my notes below. It's a long comment, but summarized: In the current deployment state, I'm concerned about removing the QA main pipelines from the staging and production canaries because we won't have automated checks to verify the backward compatibility of deployment packages.

Cells is another factor that should be considered, particularly, the requirement of cells running different versions of the applications https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/cells/goals/#cells-running-in-mixed-deployments

cc @mbursi for awareness.

My assumption is that even if the test is failing in Production at that point, customers are already experiencing the issue, and running such a test would not really provide much value here.

I'd like to add more color to this assumption, I'll use Skarbek's diagram to do so:

graph TD
    B[Deploy to Staging Canary]
    B --> C[Run QA Tests]
    C --> D[QA Full non-blocking] 
    C --> E[QA Smoke Main]
    C --> F[QA Smoke Canary]
    D --> G[Deploy to Production Canary]
    E --> G
    F --> G
    G --> H[Run QA Tests]
    H --> I[QA Full non-blocking]
    H --> J[QA Smoke Main]
    H --> K[QA Smoke Canary]
    I --> L[Deploy to Staging Main]
    J --> L
    K --> L
    L --> M[Deploy to Production Main]

Staging-canary environment is internal only, that is, it doesn't serve customer traffic. Production-canary is a different story, this environment serves traffic to a small subset of customers (I believe it is 5%). Detecting errors (mixed deployment-testing or else) on staging-canary is great (albeit detecting them at merge request level is better) because customers are not impacted.

Detecting errors at production-canary acts as a last line of defense, while some customers might encounter this error, in case of an incident at this stage, tooling is available to quickly mitigate the problem (disable the canary environment), and as part of the incident procedures, deployments to our main environments are blocked to prevent the bug from being deployed to staging-main and production-main, the latter is the environment that deals with the rest of the customer traffic.

Mainly was wondering if we can afford to remove the Staging Main or Production Main smoke tests from the deployment pipeline, if they are currently providing too much overhead vs. value. As I understood it, the purpose for running these in addition to the Canary tests was to catch mixed deployment issues

As is, I'm not certain that we can afford to remove the main smoke tests executed on staging-canary and production-canary. For additional context, given the manual cadence of the production promotion, not every package that is deployed to the staging-canary and production-canary environments is guaranteed to reach the staging and production main environments. As a result, the compatibility of every package must be tested against the package that is currently deployed to staging-canary and production-canary.

So while the E2E tests could catch a mixed deployment issue in the future, I do wonder if the costs are currently outweighing the benefits. These redundant pipelines are placing a large overhead on SETs during on-call, often produce too much noise due to other environmental issues, and accrue infrastructure costs to run the tests as well,

I agree it has been a while since we have seen a mixed deployment issue, in part could be thanks to the ongoing efforts in engineering to minimize the backward compatibility issues gitlab-org/gitlab#352455. Still, I find myself a bit wary about removing the main QA pipelines, if we were to do so, how would we detect mixed-deployment errors automatically?

Also, we could still consider keeping the QA Smoke Main for Staging to be extra cautious with mixed deployment issues, but I don't see the value of keeping it for Production since it is too late at that point

To reiterate, catching QA errors on production-canary is not too late because those can be quickly mitigated, have only reached a subset of users, and haven't reached the production main environment.

I don't believe any of the "full" test suites against our live environments are blocking

However, as part of this issue, I wanted to propose removing Production QA Full Canary from deployments, since these tests are less critical in the sense that they are 1. not blocking and 2. should have already ran in earlier environments

Yes, this is correct. The full test suites are triggered in a fire-and-forget approach in the staging-canary and production-canary stages and their outcome doesn't block deployments. Because of their no-op impact, removing them from the deployment path is acceptable from the Delivery side.

Thanks so much for adding your input here @mayra-cabrera

Detecting errors at production-canary acts as a last line of defense, while some customers might encounter this error, in case of an incident at this stage, tooling is available to quickly mitigate the problem (disable the canary environment), and as part of the incident procedures, deployments to our main environments are blocked to prevent the bug from being deployed to staging-main and production-main, the latter is the environment that deals with the rest of the customer traffic.

Yes, I completely agree, and just to clarify, I am not proposing we remove the production-canary smoke tests - this was mainly to discuss if we could remove the production-main and staging-main smoke tests at this point

I appreciate the examples and context given here in this discussion. After some additional consideration and having a clearer picture now, I can also support keeping these main smoke tests.

To try and capture a simplified real-world example:

A database column could be removed in a regular migration, part of the same deployment that contains app code changes that remove reference to this column (even though this is not advised, but using for the sake of an example)
Deployment to Canary executes the migration and also deploys a new package, say Package B, containing those app changes
Production, using previous package Package A, still has app code referencing this column. Because the database is shared between Canary and Production, and this column has already been removed, the Production tests could fail in this case

Given this scenario, the main smoke tests can help in the following ways:

Could identify the root cause more quickly by being closer to the point of failure (ex: immediately after the deployment), potentially reducing downtime or missed SLOs
As mentioned above, not every Canary package may be promoted to Production. So (if I understand correctly), even though Package B has the fix, it may not be applied and we still need to ensure any newer packages will continue to be compatible with Production
Captures potential issues that could occur for upgrading self-managed customers or Cells infrastructure - this is probably not an area we want to lose more visibility on, given the majority of customers are self-managed and the ongoing work with Cells

Unfortunately though, as we've seen with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=closed&label_name%5B%5D=RootCause%3A%3ARelease-Compatibility and what I've found above, it doesn't look like E2E tests helped to capture these at least in the past

@skarbek @mayra-cabrera Is my understanding using this example correct? If so, I'd like to add a summary of our decisions and reasonings to the issue description as well

Perhaps there may be a way to help catch some of these mixed deployment issues earlier, and at that point, it would be safer to then re-evaluate the need for these main smoke tests. I do see we have some epics such as gitlab-com/gl-infra/software-delivery/framework&3 and gitlab-org&12457 that could be related, but it is outside my area of expertise. Maybe @niskhakova can help share insight on those efforts and if this has been considered as part of that?

cc @vincywilson please feel free to add your thoughts here as well- thank you!

Perhaps there may be a way to help catch some of these mixed deployment issues earlier, and at that point, it would be safer to then re-evaluate the need for these main smoke tests. I do see we have some epics such as gitlab-com/gl-infra/software-delivery/framework&3 and gitlab-org&12457 that could be related, but it is outside my area of expertise. Maybe @niskhakova can help share insight on those efforts and if this has been considered as part of that?

The above epics are focusing on the "default" deployment - when customer has a single GitLab version, and they're upgrading to a new one. The way .com is deployed is very unique - mixed version and manual control over when migrations are running (PDM) as well as custom Chart deployments (please correct me if anything listed is wrong) => upgrade testing for this mixed deployment setup would be best built and suited for .com use case specifically knowing all the details.

As you mentioned above, majority of existing mixed release issues were not caught by E2E. I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda @lsogunle

Analyse previous incidents and test gaps
Identify what kind of tool or script would help to catch mentioned issues
Implement the tool (and expanding monitoring/alerts?)
Sunset mixed E2E

With regards to the topic of the issue, I think that in the current state until there is a dedicated test tooling for mixed deployments, relying on GitLab QA mixed deployment trigger for Canary/Main is the only last resort there is. Albeit it's been a while since mixed deployment E2E test caught an issue, so the tool is not efficient.

But perhaps it would be helpful to review if full suite against Production|Canary can be removed. As the same full suite runs against Staging main (even bigger technically since it has admin tests as well), and it's not blocking. For additional verification, would need to analyse whether there was a case when full suite failed on Production Canary but not on Staging Canary.

Thank you all for the detailed analysis and I agree with the same sentiment here. E2E test is not built to catch mixed deployment tests. GitLab QA is currently being used as a catch all for any kind of failures. This approach is not ideal and should be revisited.

Given we are running full suite in Staging and Staging canary, past review proves that no mixed issues have been caught in Prod/ Prod canary, we should dial down on running tests in Production. As an interim approach, we can run smoke in production and production canary on a lesser frequency.

I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda @lsogunle

Analyse previous incidents and test gaps

Identify what kind of tool or script would help to catch mentioned issues

Implement the tool (and expanding monitoring/alerts?)

Sunset mixed E2E

I agree with @niskhakova 's assessment here. Given this could potentially be part of upgrade workflow, let us (cc @ksvoboda ) know how we can help as I see areas of opportunity between the 2 teams.

The above epics are focusing on the "default" deployment - when customer has a single GitLab version, and they're upgrading to a new one. The way .com is deployed is very unique

Thank you @niskhakova for the clarification

But perhaps it would be helpful to review if full suite against Production|Canary can be removed. As the same full suite runs against Staging main (even bigger technically since it has admin tests as well), and it's not blocking. For additional verification, would need to analyse whether there was a case when full suite failed on Production Canary but not on Staging Canary.

Agreed, as part of this issue, the current plan has been to remove the Production Canary full runs. Also to note, we originally were planning to do an in-depth analysis beforehand. However, there have been discussions since then, and we decided to plan on removal first, then observe impact. I've updated the issue description to have more details on reasonings of decisions, frequently asked questions, etc. - hopefully it's a bit more clear

As an interim approach, we can run smoke in production and production canary on a lesser frequency.

Thank you @vincywilson for your input. This was discussed before as well, but I don't believe reducing the frequency of these tests will prove to be valuable (ex: through scheduled runs or daily, rather than after every deployment). I've added the current reasoning for this under frequently asked questions as well. I'm open to other insights though on how less frequent pipelines would still be valuable.

Speaking of interim approaches, I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.

This also does not need to be a permanent solution, and we can always go back and change if needed. Monitoring is also included as part of this issue to analyze impact and keep this issue open for teams to add any issues noticed.

I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.

@vburton - I am in favor of this. Thank you for the details above.

Agreed, as part of this issue, the current plan has been to remove the Production Canary full runs. Also to note, we originally were planning to do an in-depth analysis beforehand. However, there have been discussions since then, and we decided to plan on removal first, then observe impact. I've updated the issue description to have more details on reasonings of decisions, frequently asked questions, etc. - hopefully it's a bit more clear

Thanks for the additional context. Sounds good

Speaking of interim approaches, I was also wondering about this too and wondered if we could still move forward with removing the Production Main smoke tests, but continue to keep the Staging Main smoke tests. That way, we still have mixed deployment coverage across Staging Canary / Staging Main. While I do understand the concern that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.

I think it makes sense, especially since existing E2E didn't catch any mixed deployment issues much in the past, and none after Staging stage. And as you mentioned, the decision can be always reverted

Thank you everyone for the feedback and discussion here. I went ahead and updated the issue description and also created the following issues:

Remove Production Main Smoke Tests and Producti... (gitlab-com/gl-infra/delivery#20830) in case we need Delivery team's assistance, since I did not have access to remove Staging Ref E2E tests from deployments previously
Reduce Staging and Staging Canary E2E Test Pipe... (#3332)

@vburton @vincywilson @niskhakova

Unfortunately though, as we've seen with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=closed&label_name%5B%5D=RootCause%3A%3ARelease-Compatibility and what I've found above, it doesn't look like E2E tests helped to capture these at least in the past

Looks like the general impression is that mixed-deployment testing isn't effective in finding backward compatibility issues either on staging or production. If we are not able to rely on quality pipelines, what tooling could be used to detect the mixed deployment issues automatically on the deployment path? Do we have signals, metrics, or error rates available that help us increase the confidence in the packages we're deploying and releasing to our customers?

Having a clear "what is the next" or "how the mixed-deployment testing strategy will be replaced" outlined would increase the confidence about removing the safety net of the gprd-cny smoke-main tests

that Staging and Production are not in 100% alignment, as we've seen in the above analysis, the only mixed deployment issue caught by E2E tests did so in Staging Canary anyway.

Relying on the E2E staging environment tests is a bit fragile due to the disparities between staging and production: application settings, feature flags, and configurations may differ in these environments, not to mention the database which is vastly different between staging and production.

cc @dawsmith, @nolith, @mbursi, @mbruemmer, @swiskow for further thoughts.

Having a clear "what is the next" or "how the mixed-deployment testing strategy will be replaced" outlined would increase the confidence about removing the safety net of the gprd-cny smoke-main tests

@mayra-cabrera this is a good question and is outside my area of expertise / Test Governance team's current scope at the moment, so I would need to see who from the team may be able to help look into this. @niskhakova @ksvoboda @kkolpakova @vincywilson do you know which team could help create and start looking into the epic mentioned above?

I think the path forward would be to track this as a separate epic for upgrade testing for .com mixed deployment - identifying what should be tested (since the deployed package was already tested with GitLab QA) and not using GitLab QA. I believe it's a similar issue that Cells have. cc @ksvoboda @lsogunle

Analyse previous incidents and test gaps

Identify what kind of tool or script would help to catch mentioned issues

Implement the tool (and expanding monitoring/alerts?)

Sunset mixed E2E

While I agree this gap should be addressed, I don't think this is necessarily a blocker for removing the Production Main smoke tests. If we did keep these tests, we would still continue to have this same gap we do today, just with added cost in terms of infrastructure and increased load for on-call team members.

Agreed that it's important to address the mixed deployment upgrade test gap and in general investigate what metrics and frameworks should be used. I'm not sure about the team on handling this, my understanding was that Framework is focused on common upgrade path testing without mixed scenarios. I'll defer to @lsogunle.

Noting this thread for @nduff, @reprazent and @stejacks-gitlab. We had a conversation in Delivery on this topic and metrics(logs) based checking is a possibility that would help replace the tests we are talking about above. I had seen https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1430, and wanted to note that it would be useful for the logging solution to let us monitor and important set of URLs an their Apdex or error rate in the window after a deploy. The cardinality because of URL paths would be too high so a logging based monitor is likely the better tool. Retention would not need to be any longer than what you are all talking about already.

@dawsmith I'm not sure I fully grasp what you're asking, having just read this thread. Can I get a little more detail on what you're looking to do?

It sounds like you want to have a way to get the apdex and error rate of some subset of URLs before and after a deploy. We already have that data in the metrics stack (almost certainly) and we keep data there for a year.

Are you looking for a way to alert on this, or for a way to record it, or some other answer? I'm not sure logs is the right solution, but I'm also pretty sure I don't understand the problem. :)

Thanks for asking @stejacks-gitlab, I guess I could have been more clear.

Given this thread is talking about eliminating some stages of testing in the deploy process, we were wondering about other safeguards.

If there were a deploy where mixed deployments was a problem, we would expect error rates to increase or apdex to change on some subset of URLs. I think we would need to be able to categorize those in the future to ones we think are significant - say for example thing related to MRs, issues, etc. I thought we did have some of these in metrics, but wasn't sure where you all were in the efforts to reduce cardinality.

I had meant this to be input for your future work. Say for instance with Loki, we could put in place a recording rule for certain URLs or with Elastic a watcher for that same type of URL. You do have a good point that we could use more from https://dashboards.gitlab.net/d/web-main/web3a-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main and https://dashboards.gitlab.net/d/api-main/api3a-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main which already have some of that information.

To @stejacks-gitlab point, we do have endpoint_id already in several metrics. The cardinality of this label already gives me nightmares, so if this helps solve the problem it would be nice not to introduce another one

It's also possible to use a label_replace to group them by URL and remove the methods if that is also desirable. For example. The example uses requests_total but we also have it available in errors_total IIRC (as well as a few others).

Thanks for the comments @nick Duff and @stejacks - I was mainly trying to point out future needs and hopefully ways the logging tooling could reduce cardinality in mimir. I’ll look at what we have in endpoint_id, but was mainly trying to give an example for you to use in evaluation.

Thanks for looking into this @dawsmith, is there an issue/epic that further tracks these monitoring improvements that we can include as a reference in this issue for visibility (would that be https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1430 you mentioned above?)

@dawsmith This makes sense, and I understand where you're coming from. We'll keep this in mind over all, but in the short term, I do highly recommend using metrics for this to start with. :)

changed iteration to Test Engineering Team Iteration Dec 31, 2024 - Jan 13, 2025

added workflowin dev label

added Test Governance FY25-Q4 label

mentioned in merge request gitlab-com/chatops!557 (merged)

changed the description

gitlab-com/chatops!557 (merged) has been merged and E2E smoke tests triggered by feature flags are now running against Production Canary as expected

Example pipeline: https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/4061697

Slack notification:

marked the checklist item Target Production Canary instead of Production for tests triggered by feature flag changes (ChatOps) as completed

changed the description

marked the checklist item Update Pipeline Monitoring handbook page (gitlab-com/content-sites/handbook!10971 (merged)) as completed

changed the description

I would also like to start a separate thread here to actually further challenge keeping the following:

Production Canary smoke tests (after feature flag changes - non-blocking)

There is tooling already in place via feature-flag-inconsistency-check to ensure feature flags are enabled in Staging first, which would then trigger the full suite of E2E tests against Staging Canary. At this point, the functionality should have already been tested in the previous environment, and these are not being used to validate unique scenarios like mixed deployment issues, etc.

While this check can be overridden, it does require confirmation with the @sre-oncall. I do wonder how common this scenario is, and if it's really worth keeping these feature flag tests just to cover the odd chance this flag wasn't enabled in Staging beforehand.

Another idea, is to possibly build the capability into ChatOps to only run these tests against Production Canary if the --ignore-feature-flag-consistency-check was also included in the command

marked the checklist item Target Production Canary instead of Production Main for tests triggered by feature flag changes as completed

changed the description

mentioned in issue #3332

changed the description

marked this issue as blocking #3332

changed the description

mentioned in issue gitlab-com/gl-infra/delivery#20830

changed the description

removed iteration Test Engineering Team Iteration Dec 31, 2024 - Jan 13, 2025

changed iteration to Test Engineering Team Iteration Jan 14, 2025 - Jan 27, 2025

changed the description

mentioned in merge request gitlab-org/release-tools!3808 (merged)

changed the description

marked the checklist item Remove Production Canary Full Tests as completed

mentioned in epic gitlab-org/quality#116

mentioned in merge request gitlab-com/content-sites/handbook!11202 (merged)

changed the description

mentioned in issue #2543

removed iteration Test Engineering Team Iteration Jan 14, 2025 - Jan 27, 2025

changed iteration to Test Engineering Team Iteration Jan 28, 2025 - Feb 10, 2025

mentioned in issue #3162

Unassigning myself as I will be transitioning to a different role. I have added an update regarding the current progress and next steps in this comment

cc @kkolpakova

unassigned @vburton

removed iteration Test Engineering Team Iteration Jan 28, 2025 - Feb 10, 2025

changed iteration to Test Engineering Team Iteration Feb 11, 2025 - Feb 24, 2025

removed iteration Test Engineering Team Iteration Feb 11, 2025 - Feb 24, 2025

changed iteration to Test Engineering Team Iteration Feb 25, 2025 - Mar 10, 2025

Task	MR/Issue	Reasoning
Target Production Canary instead of Production Main for tests triggered by feature flag changes	gitlab-com/chatops!557 (merged)	To be consistent with how we run feature flag tests against Staging Canary today To catch any potential issues during the Production Canary stage where there is still time to remediate before already reaching the majority of customers in Production Main
Remove Production Canary Full Tests	Issue: gitlab-com/gl-infra/delivery#20830 MR: gitlab-org/release-tools!3808 (merged)	These tests currently serve as non-blocking, functional tests which would have already previously run in Staging Canary, as well as blocking at the MR level in `e2e:test-on-cng` and `e2e:test-on-gdk`
Remove Production Main Smoke Tests	gitlab-com/gl-infra/delivery#20830	An analysis of past incidents caused by mixed deployment issues revealed the following: The most recent incident occurred last year (2023-07-13) Only 21 incidents out of 11,816 total fall under this category, which is only about ~0.1% of incidents Of those 21 incidents, only one appeared to be found by E2E tests (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6966), which was caught during the Staging Canary stage. Due to the unique deployment process for .com, we don't yet have test coverage for mixed deployment issues in earlier environments (MRs or `master`). Though these issues are now rare, there are still advantages for having some coverage for this area (see examples and discussions here). Since we already have mixed deployment coverage for Staging Canary and Staging Main, and the above analysis found no E2E tests in Production have caught a mixed deployment issue, as an interim approach, we will be removing this test from Production Main while still keeping the tests for Staging Main.
Update Documentation	Update Handbook pages Completed: Update Live Env E2E Pipelines on Pipeline Monitoring Page Remove Canary Full Pipeline from Pipeline Monitoring Page Add Ops E2E Trigger Tokens to Rotating Credentials TBD: Update GitLab Release Docs Remove Production Smoke Main Pipeline from Pipeline Monitoring Page	To help keep team members informed on the current state of live environment pipelines

Reduce Production and Production Canary E2E Test Pipelines

Summary

Implementation Plan

Pipelines Being Kept

End State (TBD)

Frequently Asked Questions

Monitoring

Designs

Child items ...

Activity

Today

Followup Questions/Comments

Reduce Production and Production Canary E2E Test Pipelines

Summary

Implementation Plan

Pipelines Being Kept

End State (TBD)

Frequently Asked Questions

Monitoring

Blocks

Activity

Today

Followup Questions/Comments