It seems though that some users managed to create a state with a .tfstate suffix, e.g. terraform.tfstate. However, the .tfstate was just dropped and the state was actually being stored as terraform only.
Now, with the aforementioned bug being fixed the .tfstate is no longer dropped and the state name correctly becomes terraform.tfstate.
The problem for those users who called the API with terraform.tfstate stored as terraform is that their state cannot be found now (because terraform.tfstate doesn't exist for them, but terraform does), resulting in a 404 error.
Affected Users
Affected users are most likely only those who disabled state locking and put a dot in their state names -> see fix.
The reason for this is that when locking, the state name is not the last part of the URL and nothing gets stripped (.../terraform/state/:name/lock) and this resulted in a 404 - it's the first action to happen when locking is configured.
However, when it's not configured the user is able to create a state even when they provide a . - but the actual state is just named with the part before the dot, because the part after the dot is stripped away. (probably by some NGINX configuration).
Reported:
In a confidential issue (contains private customer details - link for GitLab team members)
Attempt to use an existing terraform state file, named in GitLab without .tfstate, but add .tfstate to the filename.
What is the current bug behavior?
Requests for terraform state files with .tfstate appended return a 404 if they do not exist.
What is the expected correct behavior?
The current behavior can be argued to be correct and the affected users are "victims" of a buggy behavior which has now changed to being correct.
Fix for affected users
Affected users can just remove the .tfstate suffix to access their state and continue their operation.
If a user wants the .tfstate suffix, they can migrate the state from the state without the suffix to a new state with the suffix and continue using that one as it's now properly supported.
Timo Furrerchanged title from terraform state objects referred to with .tfstate extension start returning 404 error to Terraform state objects referred to with .tfstate extension start returning {++}404{++} error
changed title from terraform state objects referred to with .tfstate extension start returning 404 error to Terraform state objects referred to with .tfstate extension start returning {++}404{++} error
@bprescott_@timofurrer how tedious/complicated is for user to rename their states to one with .tfstate ? if the procedure is rather painless I would simply document this case and nothing more.
@bprescott_ and @nagyv-gitlab what do you two think? I think we can add a documentation snippet for this or simply redirect the few affected to this issue.
I assume the cleaning up of the .tfstate filenames rather than 404ing them was an accident, but we know about it now.
It's not uncommon to see a particular pattern adopted company wide - folks want to start using Terraform, and so they base it on the work the the early adopters did. So, I'm concerned we might see customers run into this 'wholesale' over the next months and years as self-managed customers upgrade.
The ticket I have reports that terraform now:
attempts to rebuild our entire infrastructure stack
I remain curious about exactly how this 404 gets handled, and whether we can rely on the 404 being surfaced as a failed job.
The implication of this statement is that .. having found no state file, as a minimum terraform plan pushes on and tries to build the whole environment. Does terraform behave that way when no .tfstate file is found? Create one and populate it from scratch?
If it's likely that jobs will push ahead and Terraform will deploy a lot of extra infrastructure, this is going to cause more pain.
The implication of this statement is that .. having found no state file, as a minimum terraform plan pushes on and tries to build the whole environment. Does terraform behave that way when no .tfstate file is found? Create one and populate it from scratch?
Terraform won't find the old state because it is now looking for one including the dot + suffix and just treat it as "new resources".
If it's likely that jobs will push ahead and Terraform will deploy a lot of extra infrastructure, this is going to cause more pain.
Since it tries to deploy the "same" infrastructure it'll most likely fail early on - it could still cause pain though ...
When considering a rollback of !105674 (merged) we have to be aware that we will re-break it for people who now created states with dots in it (which there are some) - so it's most likely not a viable option.
Timo Furrerchanged the descriptionCompare with previous version
changed the description
Timo Furrerchanged the descriptionCompare with previous version
changed the description
Timo Furrerchanged title from Terraform state{- objects referred to with .tfstate extension-} start returning 404 error to Terraform state{+s with .+} start returning 404 error if locking disabled
changed title from Terraform state{- objects referred to with .tfstate extension-} start returning 404 error to Terraform state{+s with .+} start returning 404 error if locking disabled
Timo Furrerchanged title from Terraform states with . start returning 404 error if locking disabled to Terraform states with . start returning 404 error if locking disabled{+ and it was created previously without the . because it was being stripped+}
changed title from Terraform states with . start returning 404 error if locking disabled to Terraform states with . start returning 404 error if locking disabled{+ and it was created previously without the . because it was being stripped+}
Another idea we've brainstormed was to create a fallback mechanism. So, if a foo.bar state name is being looked up for a certain project, we could attempt to fallback to foo if it exists.
@Alexand another idea on top of that would be to put it behind a feature flag so on self-managed the owners could decide how to handle this (basically they could opt out of the fallback even for %15.7) and for %16.0 we could remove the fallback (haven't thought this through yet, just an idea ).
@Alexand@timofurrer@nmezzopera After reading through the discussion, I think the best idea would be to go with the feature flag. On GitLab.com, we can keep the current setup and work with support to minimize the adverse effects. For self-managed, the changes were not released yet. So, we can document them and have them rolled out in %16.0.
@nagyv-gitlab Thanks for taking a look into this! I also strongly recommend the FF approach and not working on the fallback (that can cause more edge cases due to collision)
@andlovu@anna_vovchenko we could also consider adding a banner somewhere in the UI to warn users (maybe only the ones with the . in their name that they need to update their TF_NAME variable)
Add a small note explaining this change/fix and the feature flag on the 15.7 release post? Are we still on time? If this works, it would be nice to have it pointing to the troubleshooting docs (we can work on it today).
Add a FF note on the Terraform States version notes. Something like:
> In 15.7 we've added support use dots in locked terraform states. This is disabled by default and enabled only for GitLab.com. Self-manage users can enable the `allow_dots_on_tf_state_names` feature flag to make use of it. See [also troubleshooting for more details](https://docs.gitlab.com/ee/user/infrastructure/iac/troubleshooting.html)
Nico also mentioned possibly putting a UX warning on the UI. Possibly the terraform states page. I think this has lower priority, as it could probably go on a next milestone?
@nmezzopera, @andlovu, what do you think about adding a warning non-dismissible alert in case any state name contains a period and the feature flag is not enabled?
@anna_vovchenko, @andlovu and @nagyv-gitlab I am wondering if we should warn them more explicitly that having the . in the name will lead to an issue when we remove the feature flag (or the admin enables it)
@anna_vovchenko I think the idea is good, but we'd need to show it more apparent. I expect most users never to open the Terraform state list page. Thus, the notice would go unnoticed.
Could we show it on MRs or pipelines that touched a state file with a period?
Could we show it on MRs or pipelines that touched a state file with a period?
Detecting the usage of unallowed periods is not trivial. It might be best to just a leave a message for everyone without detecting. Here is why.
Before periods were allowed, the periods were not stored in the state name. A state called foo.bar would be stored as foo. So this scenario represented in the image above is not really possible. It would be rather showing a state called test_main, instead of test_main.tf.
So, detecting that someone is trying to use states with periods is tricky, as it would have to happen by reading their script files.
Possibly looking for a regex in files with something like /TF_STATE_NAME\=[\.]. This might be too complicated and slow processing, so not worth it.
Although, the backend can detect when a project tries to create a TF state name with a period. Then before truncating, we could store this information and communicate to the FE. Or, we could try to use rails flash messages. But I'm not sure the flash messages is possible, as these endpoints are being called by terraform client, not by the GitLab frontend. So maybe the only way, is to store information about the project which attempted this. But this all seem kinda hacky.
Or maybe an email is a good idea, after the backend detects the attempt?
We now have 2 situations that could be warned about:
On GitLab.com (ff is enabled) and self-managed which chose to enabled it manually
Periods are already allowed, as the FF is enabled.
If they were trying to use periods before with -lock=false, then they'll probably see a deploy failure on their terraform job. As a new state will be created with the period, that will diverge from the previous state that got the period truncated.
For these cases, the best would probably be to point them to the new troubleshooting section.
Self-managed with ff disabled
The best would probably be to point them to the deprecations page for 15.7.
Terraform has been successfully initialized!╷│ Error: Error acquiring the state lock│ │ Error message: Unexpected HTTP response code 404│ │ Terraform acquires a state lock to protect the state from being written│ by multiple users at the same time. Please resolve the issue above and try│ again. For most commands, you can disable locking with the "-lock=false"│ flag, but this is not recommended.╵
No state got created in the database.
added -lock=false to all of my gitlab-terraform commands. So same as above but replaced the commands:
...script:-gitlab-terraform init -lock=false-gitlab-terraform plan -lock=false-gitlab-terraform apply -lock=false
I was then able to create my terraform state. I can confirm that the state name got stripped, and we only got the left side of the dot. So foo in this case.
I'm not sure what stripped it out. But I strongly believe it was grape itself, because when overriding grape requirements, they are not stripped. See also grape can't handle dots issue.
This will allow us to decouple feature deployment from feature activation on GitLab.com and thus announce a more accurate time period when breaking changes go live.
We currently don't have the information if this feature can be deployed behind a feature flag. Please add that information (if there is a feature flag and how is it called) here.