Following a call with @ernstvn and @andrewn we agreed that we need a way to create staging environments for testing new features. This is meant to be an intermediate step towards #1504 (closed) which remains the ultimate long term goal.
The way we can achieve this in the short to mid term is to leverage on Terraform for the creation of the nodes. This is already part of an existing effort to better align staging to the production environment and the configuration is already coded.
To make it scalable to a number of different staging environments we will need to define a schema for the chef roles, like staging-<name>-*.
@andrewn was also asking about how we would deal with the monitoring, since it's a fundamental component to gather data about the changes. Would we use a global influxdb/prometheus server? Should we use one for all staging environments? Should we create one for each?
Last but not least, the database. We'll probably need to use a copy of the staging data set. If we use the same staging database then the schema will get stale real quick. Also, this could be a good opportunity to test migrations in a controlled way. @northrup has a plan for that.
Would we use a global influxdb/prometheus server? Should we use one for all staging environments? Should we create one for each?
Assuming these staging environments have their own network using a global influxdb is not ideal because you have to start messing with Azure firewalls. You also need to open a separate UDP port for every database, create the same continuous queries, etc. This will very quickly overload an already overloaded InfluxDB. As such I recommend giving each environment it's own InfluxDB + Grafana + Prometheus setup.
I assume we also need to populate the NFS servers for some realistic testing; this can be done with a clone of gitlab-org presumably, same as on staging. Can the database be a combination of real entries relating to gitlab-org and mumbo jumbo per the point above?
regarding define schema for chef roles --> at this point we can start using environments (the way chef was meant to be used:
the idea is that every server (regardless of environment) is installed identically, only environment specific variables (e.g. package version, nfs servers, threads/server,etc) are set in environements, and not in roles (see environments for more information. )
@omame another question / comment here... since #1504 (closed) is the ultimate goal, it would be helpful to see if this work if slightly recast or rescoped would make bigger strides towards that goal. In other words, if working on this advances that, great. But if they have nothing to do with each other and thus really working on this just delays #1504 (closed), then we need to reconsider.
Build a specific staging Prometheus server that will be connected to Consul to scrape these staging instances.
Add black box monitoring to this Prometheus server so when a staging instance is created we automatically start scraping it to detect performance regressions.
We should keep 1 staging instance that is the current production image to use as a baseline, this way we can directly compare for regressions in a single dashboard.
Adding new URLs to this black box monitoring has to be an extremely simple process, this way developers can just add new URLs to probe before deploying a review app.
This is already sketched out in that issue. We are simply not acting on it because in order to do that we need to get the first part done so we have the right building blocks to get to:
Build a container orchestration infrastructure for front end staging, probably using kubernetes for this.
It would be better if we can use a managed service to reduce complexity, but this depends on where are we being hosted.
Using the previously built images with Packer, but this time as containers, allowing fast deployments in staging and also allowing us to start building a release process using a container orchestration tool.
I'm wondering now if you had actually read that issue.
@ernstvn maintaining multiple chef roles will be a manual thing, so I don't think we can really add automation to this issue.
Also, by trying to take this shortcut we will need to implement all the logic that we can get from Consul and Kubernetes in chef, which will be much much much more work than leveraging the right tools.
The canary deployment issue has a clear path of steps we need to take to get to this point, we just need to follow it. Taking shortcuts will pull us away from the goal and will add a lot toil instead of removing it.
I'm wondering now if you had actually read that issue.
Yes, but is not at all times fully in the forefront of my mind :-) Great if we can use this issue / effort to address part of the scope of the meta issue. In any case we need to start chopping that meta issue down to very concrete items with tentative ETAs.
This issue does not appear to have an issue weight set.
As a general guidelines use a weight of 1 for an access request issue or a simple
configuration update. Use this as a multiplier for setting the weight.
If you are unsure about what weight to set it is better to add a generous estimate and change it later.
If the weight on this issue is 8 or larger then it might be a good idea
to consider splitting this issue up into smaller pieces.
Closing this issue, since the Ephemeral environments project accomplishes this goal. This is being tracked in that epic and doesn't need to be tracked in two places.