Automate the staging environment creation

Would we use a global influxdb/prometheus server? Should we use one for all staging environments? Should we create one for each?

Assuming these staging environments have their own network using a global influxdb is not ideal because you have to start messing with Azure firewalls. You also need to open a separate UDP port for every database, create the same continuous queries, etc. This will very quickly overload an already overloaded InfluxDB. As such I recommend giving each environment it's own InfluxDB + Grafana + Prometheus setup.

We also discussed setting these up on a separate Azure account, to
- increase distance between production environment and testing environment for security
- separate billing so we can track what testing costs.
Re: the database, @omame can we use https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 ? @rymai 's team is working on that for 9.3 . That also increases security by not using production data in the testing environment.
I assume we also need to populate the NFS servers for some realistic testing; this can be done with a clone of gitlab-org presumably, same as on staging. Can the database be a combination of real entries relating to gitlab-org and mumbo jumbo per the point above?

Agree with @yorickpeterse that each staging environment should be its own self-sufficient setup.

In order not to have to wait for https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 to be completed, is it an option to split this work into the chronology of:

define schema for chef roles
complete Terraform scripts that build staging.
- Do the scripts determine where / what to pull for database and NFS population?
test that this works; have a few teams test it for themselves, still need production access due to production data in database.

and then

bring it to new Azure account (or can this be the first step?)
change relevant script / roles when https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 is completed.
remove barrier to use since no longer using production data: widely advertise to all developers in the team
party

?

@omame as you weigh in on my comments above, can you set a tentative ETA for the various steps?

mentioned in issue #811 (closed)

added automation label

added internal label

regarding define schema for chef roles --> at this point we can start using environments (the way chef was meant to be used:

the idea is that every server (regardless of environment) is installed identically, only environment specific variables (e.g. package version, nfs servers, threads/server,etc) are set in environements, and not in roles (see environments for more information. )

@omame another question / comment here... since #1504 (closed) is the ultimate goal, it would be helpful to see if this work if slightly recast or rescoped would make bigger strides towards that goal. In other words, if working on this advances that, great. But if they have nothing to do with each other and thus really working on this just delays #1504 (closed), then we need to reconsider.

@omame @ernstvn

Would we use a global influxdb/prometheus server? Should we use one for all staging environments? Should we create one for each?

From the canary deployment issue:

Build a specific staging Prometheus server that will be connected to Consul to scrape these staging instances.

Add black box monitoring to this Prometheus server so when a staging instance is created we automatically start scraping it to detect performance regressions.

We should keep 1 staging instance that is the current production image to use as a baseline, this way we can directly compare for regressions in a single dashboard.

Adding new URLs to this black box monitoring has to be an extremely simple process, this way developers can just add new URLs to probe before deploying a review app.

This is already sketched out in that issue. We are simply not acting on it because in order to do that we need to get the first part done so we have the right building blocks to get to:

Build a container orchestration infrastructure for front end staging, probably using kubernetes for this.

It would be better if we can use a managed service to reduce complexity, but this depends on where are we being hosted.

Using the previously built images with Packer, but this time as containers, allowing fast deployments in staging and also allowing us to start building a release process using a container orchestration tool.

I'm wondering now if you had actually read that issue.

@ernstvn maintaining multiple chef roles will be a manual thing, so I don't think we can really add automation to this issue.

Also, by trying to take this shortcut we will need to implement all the logic that we can get from Consul and Kubernetes in chef, which will be much much much more work than leveraging the right tools.

The canary deployment issue has a clear path of steps we need to take to get to this point, we just need to follow it. Taking shortcuts will pull us away from the goal and will add a lot toil instead of removing it.

@pcarranza

I'm wondering now if you had actually read that issue.

Yes, but is not at all times fully in the forefront of my mind :-) Great if we can use this issue / effort to address part of the scope of the meta issue. In any case we need to start chopping that meta issue down to very concrete items with tentative ETAs.

@ernstvn let's start with the chopping, then we can talk ETAs

mentioned in issue #1907 (closed)

mentioned in issue #2674 (closed)

mentioned in issue #2751 (closed)

removed milestone

marked this issue as related to #5285 (closed)

assigned to @dsylva

Hi @dsylva,

This issue does not appear to have an issue weight set.

As a general guidelines use a weight of 1 for an access request issue or a simple configuration update. Use this as a multiplier for setting the weight. If you are unsure about what weight to set it is better to add a generous estimate and change it later. If the weight on this issue is 8 or larger then it might be a good idea to consider splitting this issue up into smaller pieces.

Thanks for your help!

You are welcome to help improve this comment.

added to epic &26 (closed)

changed weight to 5

Hi @omame-gitlab,

First of all, thank you for raising this issue.

We're posting this message because this issue meets the following criteria:

This issue is open
No activity in the past 4 months (since 2019-01-04T00:01:29.615Z)

We'd like to ask you to help us out and determine how we should act on this issue.

If this issue is proposing an infrastructure change or reporting an issue can you verify whether it is still relevant?

Thanks for your help!

You are welcome to help improve this comment.

Hi @omame-gitlab,

First of all, thank you for raising this issue.

We're posting this message because this issue meets the following criteria:

This issue is open
No activity in the past 4 months (since 2019-05-05T00:22:18.092Z)

We'd like to ask you to help us out and determine how we should act on this issue.

If this issue is proposing an infrastructure change or reporting an issue can you verify whether it is still relevant?

Thanks for your help!

You are welcome to help improve this comment.

Hi @omame-gitlab,

First of all, thank you for raising this issue.

We're posting this message because this issue meets the following criteria:

This issue is open
No activity in the past 4 months (since 2019-11-05T20:31:49.743Z)

We'd like to ask you to help us out and determine how we should act on this issue.

If this issue is proposing an infrastructure change or reporting an issue can you verify whether it is still relevant?

Thanks for your help!

You are welcome to help improve this comment.

Hi @omame-gitlab,

First of all, thank you for raising this issue.

We're posting this message because this issue meets the following criteria:

This issue is open
No activity in the past 4 months (since 2020-03-06T00:06:52.829Z)

We'd like to ask you to help us out and determine how we should act on this issue.

If this issue is proposing an infrastructure change or reporting an issue can you verify whether it is still relevant?

Thanks for your help!

You are welcome to help improve this comment.

Closing this issue, since the Ephemeral environments project accomplishes this goal. This is being tracked in that epic and doesn't need to be tracked in two places.

closed

Automate the staging environment creation

Designs

Child items ...

Activity

Automate the staging environment creation

Relates to

Activity