We need a tool that can execute a Rails migration (or multiple) from a merge request using the staging database, gather some statistics (timings, etc), then revert the changes. This would allow endbosses to easily assess the impact of database migrations. This tool would take an MR URL as input, SSH into a staging worker, download the migrations (and only the migrations), and run them using a local Rails process. To revert the changes we'd wrap the migration in a transaction that rolls back automatically (this can be done by raising ActiveRecord::Rollback).
To prevent migrations from getting in each other's way we'd have to ensure that only one such process can run at a given time, regardless of the worker that it's running on. We can just use Gitlab::ExclusiveLease for this (make sure to release it once the process is done).
The data we'd gather should be (if possible):
The time it took to run the migration
The output of the migration
If possible we should also gather the number of affected rows and what not, but I don't think there's an easy way of getting this.
Once this script is in place we should hook this up to Marvin. The idea is that an endboss can run something like:
@marvin: check migrations for https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/915
Marvin would then call the script and simply dump the output in the channel. This removes the need for giving all endbosses SSH access to our staging workers. It also ensures that if the results are fishy there are multiple people noticing it (= everybody in the channel).
We can not hook this up in CI. Doing so would allow naughty "contributors" to run arbitrary Ruby code using our staging database. This has a 100% guarantee of being exploited in the worst ways imaginable.
@yorickpeterse I've to say that I love this idea and that it makes a lot of sense. Once we manage to convince @briann that we are not that crazy we should start working on this.
I envision this rails application as the origin of a lot of really interesting things to do with Marvin.
cc/ @northrup because we will need to play with Marvin and add commands to it.
You can use SSH to have Marvin do it, welcome to the world of DevOps and
ChatOps, it's not your fathers landscape... BUT you have Marvin securely
fetch the credentials he needs from Vault each time to perform the action.
The other option is like @pcarranza said, the Cog Go Relay and function
bundle is on the DB host and registered to Marvin so that invocation of the
command happens via secure channel and can only execute on that host.
The important thing to me is that the credentials stored in Marvin (or Marvin via vault) can't be used to do any more than provide a MR. Everything else is done on the staging host. Then at least if someone compromises the credentials they'll have to also create a MR to get anything executed.
With the Go Relay are you actually passing code via the relay?
@briann The idea is that we won't let every user run some command. Instead we will only let trusted GitLab employees invoke Marvin commands. Of course this requires that the employee reviewed the migrations, but we can automate this by e.g. showing the source and requiring the user to acknowledge they verified it. To further secure things we could run the script in some sort of cgroup that restricts outgoing network and what not.
@yorickpeterse Oh no, I'm not talking about the user. I'm saying if someone compromises a host running Marvin they might get immediate shell access to staging. If we're giving Marvin the ability to run shell commands on all hosts anyways it's a moot point.
Folks - welcome to the real world. The point of having ChatOps is that you have built and empowered a bot to perform functions on behalf of the user calling for the function in chat. The bot is to operate with the weight and authority of the user that has permissions to call it. Everything will be designed and implemented with sanity, but the days of "don't give an automated process control" are gone.
@pcarranza Technically we already have a Rails app there since we deploy our Omnibus package there. It is however a somewhat outdated version as we only update it whenever we update the DB.
@yorickpeterse the other reason why I see the rails application as a simple thing is that we don't need to worry sooooo much about granting ssh access to the bot, it could just make a call to an endpoint and that would be it.
It will take some time and conversation to get into a position where we are comfortable with granting this level of access, not that we will not do it because it seems to be the way to go, but that we have strict control for when things go wrong.
The plan @pcarranza and I discussed today is as follows:
We'll have some sort of script that:
Downloads the migrations from a merge request
Concat these together into a single file
Appends some Ruby code to this file to run and measure the migrations
Runs the resulting script using sudo gitlab-rails runner path/to/that/script.rb
Removes the script from the server
This script is hooked up into Marvin
An extra command is executed by marvin to roll back the database after running the above script
The migration testing script will live outside of GitLab CE/EE since it's GitLab.com specific. This also means we can develop it without being constrained by the GitLab CE/EE release cycle.
We could leverage this quick staging creation to run this, could you both agree on how this could happen? I think that we are really close of getting there.
@ernstvn Having this would make it a lot easier for developers to test their changes. Having said that, it's less crucial than getting snapshots in place, better HA/failover, etc.
I didn't read entire discussion here, but is this also related to canary deployments, mentioned by @ernstvn and our strategy to running two version of software (the old one, and canary) with the same database? I have some insights about that, I'm not sure where should I write it up / discuss?
Also removed the availability label and added the performance label since this is more about being able to test how fast migrations are once we have a test environment. Building the test environment is now being addressed in https://gitlab.com/gitlab-com/infrastructure/issues/1815 , so that the scope of this issue is reduced to the automated performance testing.
@ernstvn I think that availability label matters here from the perspective of one of the sources of downtime is migrations that misbehave. We should be testing them to ensure that they will not affect the general availability at deploy time.
@ernstvn Not sure what the right label would be, but having this would make it a lot easier for developers to test their migrations on staging. The sooner we have this the better.
Does the tool require cogs / Marvin, or is there some tooling that can be done without that? If so, can you update the body of the issue. If it does require the postgres bundle / some cogs development, please mention so here and in https://gitlab.com/gitlab-org/gitlab-ce/issues/34311#note_36471304 .
@ernstvn It doesn't require cogs/Marvin per se, it does however require something that can respond to a command (be it a slash command or something else) and run a script on one of our servers. Considering this is something so specific to GitLab.com I don't really see the point of shipping this in GitLab itself.
Considering this is something so specific to GitLab.com I don't really see the point of shipping this in GitLab itself.
I don't see why it would be specific to GitLab.com; I would imagine it can be a generic tool that is beneficial to anyone deploying an app from their own GitLab instance?
I don't see why it would be specific to GitLab.com; I would imagine it can be a generic tool that is beneficial to anyone deploying an app from their own GitLab instance?
In the issue I propose for the script to operate based on merge request URLs for GitLab CE/EE, not the ability to run arbitrary SQL queries. Testing the database migrations in specific merge requests is not useful for those not working on GitLab.com.
No, this issue is solely about providing a script that lets you run database migrations in an MR against a staging environment, then record the output and time spent in the migration process. This isn't a fully fledged load testing tool.
OK thanks for clarifying this is limited to database migrataions @yorickpeterse . Wouldn't it be a generic tool if you just replaced
I give the script a URL to an EE or CE merge requests
with
I give the script a URL to a merge request on my repo (for app running on my infra)
?
Regardless of whether it is generic or just for GitLab.com, do we have any sense of how much effort this is to build and which team would be most appropriate for it? \cc @victorwu re: angle on slash commands / ChatOps and \cc @markpundsack for more AutoDevOps tooling.
@ernstvn Technically there's no limitation in terms of what MR you'd run, I just don't see anybody but us ever using this. Further, GitLab has a concept of post deployment migrations not seen elsewhere. I most certainly don't want to build some kind of generic tool that supports all the things users could possibly dream of. I just want something dumb and simple so I don't have to ask developers to run their migrations on staging (or worse: do it for them). I fear that by turning this into a GitLab feature we end up delaying this for many more months, with very little to gain.
Regardless of whether it is generic or just for GitLab.com, do we have any sense of how much effort this is to build and which team would be most appropriate for it?
Using Marvin this would probably be a few days of coding, testing and some fine tuning. When using slash commands we need to start making APIs and all that, which complicates things to a point where it can take weeks to complete (since you need to release it as part of GitLab, etc).
Using Marvin this would probably be a few days of coding, testing and some fine tuning
Right, but per https://gitlab.com/gitlab-org/gitlab-ce/issues/34311 and issues that preceded it, we've decided to focus those "few days of coding, testing, and fine tuning" on GitLab ChatOps instead of Marvin. That said, since ChatOps does not exist yet, it will indeed take a few months. Let me see if this is a case we should consider an exception for.
@yorickpeterse , we should not do this through cogs since we're aiming to build GitLab ChatOps and eventually deprecate cogs. Is there something you can imagine doing that neither runs through cogs nor requires waiting on GitLab ChatOps?
Nothing that is better than just SSH'ing into a host and doing the whole procedure manually, which is what I exactly want to stop from doing.
As per point 8 in https://gitlab.com/gitlab-com/infrastructure/issues/1161 we want to remove production access from release managers, and by extension, anyone else that is not directly responsible for production (on-call heroes and such).
Staging data is considered production data simply because it's a restore of the production database. This means that we need to drive the way of removing ssh access, not granting it.
Given this constraints, I think that this issue needs to be blocked until we have a viable chatops platform.
I can time-box an investigation for sanitizing the database in the db, if there are any pointers other than the referenced issue send them my way. I think we should also do a quick audit with @briann when the testbed is up. The real problem we have though is how the heck would we go about sanitizing repo data since we also want to include repos for GEO? Maybe we restrict ourselves to public repos? That of course will limit the amount of data by a considerable amount, I assume.
@jarv the problem with the sanitization is that it should be evolving with the model.
Regarding repos, we just can't sanitize them due to the immutability of git objects. Additionally, I think that in Geo we should be testing any kind of repos, public, private, internal. We need to be sure that they all work.
We're going around in circles here, which is getting rather frustrating. Having a separate staging environment doesn't solve the problem. Our problem isn't that we don't have a database to test our changes on, the problem is that it's too difficult to perform those tests. This leads to developers not performing their tests, or not knowing how to do it properly. The only way we can solve that is by automating it, and the easiest way of doing that is by having a script hooked up to Slack.
I realise that we deprecated cogs, but it feels like we're making things far too complex than necessary because of a feature we may end up shipping 6 months from now.
To show how simple this script can be I hacked this together in an hour or two, of which most time was spent trying to remember how Ruby's Open3 API works. The script is as follows:
# frozen_string_literal: truerequire'gitlab'require'open3'defblank?(value)value.nil?||(value.respond_to?(:empty)&&value.empty?)enddefmigration_version(path)path.split('_',1)[0].to_ienddefsort_migrations(migrations)migrations.sortdo|left,right|lbase=File.basename(left.path)rbase=File.basename(right.path)migration_version(lbase)<=>migration_version(rbase)endendclassMigrationattr_reader:path,:diffdefinitialize(path,diff)@path=path@diff=diffendendGitlab.configuredo|config|config.endpoint='https://gitlab.com/api/v4'config.private_token='hunter2'endprojects={'CE'=>13083,'EE'=>278964}project=ARGV[0]mr_id=ARGV[1]project_id=projects[project]abort'You must specify a project for which to test an MR'ifblank?(project)abort'You must specify the MR ID to test'ifblank?(mr_id)unlessproject_idabort"The value #{project} is not a valid project, valid projects are: "\"#{projects.keys.join(', ')}"endlatest_version=Gitlab.client.get("/projects/#{project_id}/merge_requests/#{mr_id}/versions").firstunlesslatest_versionabort'The specified MR does not have any diff versions available'endversion_id=latest_version.idversion_info=Gitlab.client.get("/projects/#{project_id}/merge_requests/#{mr_id}/versions/#{version_id}")pre_deploy=[]post_deploy=[]version_info.diffs.eachdo|diff|path=diff['new_path']ifpath.start_with?('db/migrate')pre_deploy<<Migration.new(path,diff['diff'])elsifpath.start_with?('db/post_migrate')post_deploy<<Migration.new(path,diff['diff'])endendifpre_deploy.empty?&&post_deploy.empty?abort'The specified MR does not have any migrations'endmigrations=sort_migrations(pre_deploy)+sort_migrations(post_deploy)migrations.eachdo|migration|output,status=Open3.capture2e('patch -N -p1',stdin_data: migration.diff)unlessstatus.success?abort"Failed to install migration #{migration.path}: #{output}"endendoutput,status=Open3.capture2e('bundle exec rake db:migrate')message=ifstatus.success?'The database migrations were applied successfully:'else'A problem occurred when applying the database migrations:'endputs"#{message}\n\n#{output.strip}"
You can then use it like so:
ruby /tmp/test_migration.rb CE 12463
Now in case of this MR I already have the migrations applied so it won't end up doing anything. On a real instance however it will run the migrations then output the results to STDOUT.
There are some important things to take into account here though:
This script also runs post deployment migrations (as it should do). This means that we can't run this on regular staging as code running may not expect a post-deployment migration to remove something the code depends on.
In case of errors I just write to STDOUT then exit(1). This might have to be changed for whatever instructs this script from Slack
We need an actual API token instead of hunter2, this means we should set it somehow via Chef or an ENV variable
To deploy this we would also need some infrastructure work because of the above. I'm thinking of the following flow:
You run something like @marvin test-migration CE 12463 or /test-migration CE 12463 in Slack
We spin up a new Azure instance with the same configuration as our production primary, using the latest backup to fill it up with data
Whatever image we use should have the above script, or we wget it from some place (whatever works best). The instance also needs to have GitLab EE configured to connect to localhost
We run the above script, passing the details from Slack as arguments
Output gets sent back to Slack, preferably mentioning the user who kicked off the process
In this particular setup we don't have to care about reverting migrations or using some kind of queue as every user gets their own instance.
@yorickpeterse assuming that we do convince ourselves that this is worth doing with cogs before GitLab ChatOps can be delivered... how far away are we / how difficult is it to execute on steps 2 and beyond?
2. We spin up a new Azure instance with the same configuration as our production primary, using the latest backup to fill it up with data
Or is this a question for @pcarranza and team? \cc @sitschner w.r.t. gathering info on the tradeoffs we make in the cogs/ChatOps transition.
we can go ahead and collect the amount of time required vs the benefit. I don't think we should push to allow this in cogs. Every time we do just means they will have to code this twice.
I'd rather we use these as examples of what we need from gitlab chatops and hopefully get a higher priority on that project.
Scratch that, let's do what was decided yesterday in the chatops meeting.
In fact, let's turn this into a feature: @mydigitalself@JobV@victorwu this is one of those chatops scripts that we need, could you build this from product?
There is a lot of discussions already on this issue to provide you the requirements so you can simplify away. From production: it has to be extremely secure as it will have access to the production database (BTW, as of today it's an 800G database already) and any developer has to have access to this.
@marin I believe this is all handled with the Delivery team's current set of tools. If I'm wrong and this issue is still relevant, please reopen it. Thanks.