Use optimistic locking when updating Terraform state (!116069) · Merge requests · GitLab.org / GitLab

Tiger Watson requested to merge 398117-reduce-terraform-locking into master Mar 29, 2023

What does this MR do and why?

Swaps from pessimistic locking to optimistic locking when accessing Terraform state.

There are three parts to this change:

Adding a lock version column to terraform_states to support Rails optimistic locking (see docs linked above). Whenever a record is updated, the lock version is incremented automatically by Rails. If an update is attempted but the lock version has changed since the record was loaded, it means the update should not proceed and a StaleObjectError is raised.
Adding touch: true to the belongs_to association from terraform_state_versions to terraform_states. This means the updated_at of the parent terraform_state record is updated whenever a new version is created, which is beneficial for two reasons:
- UIs that show the "last updated at" of a Terraform state will now show the correct timestamp, and, more importantly,
- This update can be used to trigger the optimistic locking flow when updating a Terraform state. Previously this was not possible, as creating a child terraform_state_version did not modify the parent record in any way.
Enable GitLab's OptimisticLocking wrapper whenever a Terraform state record is accessed. This rescues the StaleObjectError raised when a record has conflicting updates, and retries the update after reloading the record.

There are two main benefits to locking this way:

An exclusive lock is not required for readonly or no-op actions. For example, fetching an existing state without modifying it, or attempting to modify a state without permission. This should greatly increase the throughput of these endpoints, as these actions (on a single state) can now be served concurrently.
The record is not locked, and therefore a database transaction is not open, while the Terraform state is pushed to object storage. I'm not aware of any existing problems related to this, but short transactions that don't depend on external services are a good idea in general.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Basic test

Install Terraform with brew install terraform

Create a basic Terraform project with the following main.tf:

 terraform {
   backend "http" {
   }
 }

 resource "local_file" "test" {
     count    = 10
     content  = timestamp()
     filename = "${path.module}/${count.index}.txt"
 }

Initialise a Terraform state:

terraform init \
 -backend-config="address=http://127.0.0.1:3000/api/v4/projects/49/terraform/state/example-state" \
 -backend-config="lock_address=http://127.0.0.1:3000/api/v4/projects/49/terraform/state/example-state/lock" \
 -backend-config="unlock_address=http://127.0.0.1:3000/api/v4/projects/49/terraform/state/example-state/lock" \
 -backend-config="username=root" \
 -backend-config="password=$GITLAB_ACCESS_TOKEN" \
 -backend-config="lock_method=POST" \
 -backend-config="unlock_method=DELETE" \
 -backend-config="retry_wait_min=5

Apply the changes:
```
terraform apply --auto-approve
```
Observe 10 files are created, each containing a timestamp.
In the Rails console, read the contents from the state to verify it was persisted correctly :
```
> JSON.parse(Terraform::State.last.latest_version.file.read)
```

Stress test

Same as the above, but execute the apply in a loop from multiple terminals:
```
while; do terraform apply --auto-approve; done
```
You should see the terminals take turns actually applying changes, while the others return "HTTP remote state already locked" errors (meaning another operation is in progress).
In your gitlab/log/service_measurement.log, you should start to see locking conflict messages such as:
```
... "message":"Optimistic Lock released with retries","name":"Terraform state: 443","retries":1 ...
```
Which means a conflict was rescued and retried.
However, no errors should be surfaced to the API in gitlab/log/development.log, and Terraform shouldn't return any errors (except for the "already locked" error, which is expected).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #398117

Edited Mar 30, 2023 by Tiger Watson

Use optimistic locking when updating Terraform state