Skip to content

Fix DNS cleanup failures

Boros Gábor requested to merge guruprasad/fix-cleanup-job into master

Created by: lgp171188

The DNS cleanup operation has been failing for the past few days and we are unable to reproduce the issue due to an issue with the current cleanup code. This PR fixes the issue in the DNS cleanup code and also prevents the suppression of the failures on future runs by doing the DNS cleanup first (as it relies on the hashes from the other cleanup operations) before all the other cleanup operations. It also fixes a bug in the Gandi library code where an exception is suppressed when it happens for unknown reasons.

Testing instructions:

  • Rerun an instance of the failed cleanup job with SSH.
  • Log in to the container where the tests are run.* Log in to the container where the tests are run.
  • Rerun an instance of the build-and-test job and cancel it once the ansible provisioning of the VM starts in the integration tests group 2 job.
  • Checkout this branch inside the Circle CI job container.
  • Activate the venv virtualenv environment.
  • Run make test.integration_cleanup. It will fail with an exception about being unable to connect to Redis.
  • Repeat the previous step a few times to confirm that the error is not suppressed on those runs and the DNS cleanup operation is attempted with the same data.
  • Apply the below patch locally in the CircleCI environment. It sets the age of resources to be cleaned up to 0 hours, i.e., it will clean up all the resources. Use with care to ensure that it doesn't interfere with other active instances of the CI checks which are running. It also disables using a lock for the DNS operations as the current circle.yml in the master branch on which the cleanup job runs, doesn't have redis set up for the cleanup job.
diff --git a/cleanup_utils/integration_cleanup.py b/cleanup_utils/integration_cleanup.py
index b0e0519..a57b4bb 100755
--- a/cleanup_utils/integration_cleanup.py
+++ b/cleanup_utils/integration_cleanup.py
@@ -40,7 +40,7 @@ from cleanup_utils.openstack_cleanup import OpenStackCleanupInstance
 # Constants ###################################################################

 # Default age at which things should be cleaned up, in hours
-DEFAULT_AGE_LIMIT = 8
+DEFAULT_AGE_LIMIT = 0
 DEFAULT_CUTOFF_TIME = (
     datetime.utcnow().replace(tzinfo=UTC) - timedelta(hours=DEFAULT_AGE_LIMIT)
 )
diff --git a/instance/gandi.py b/instance/gandi.py
index b1757d7..ee3d9b2 100644
--- a/instance/gandi.py
+++ b/instance/gandi.py
@@ -106,17 +106,16 @@ class GandiV5API:
         Encapsulate logic that is common to high-level DNS operations: grab the global lock, do the operation,
         and retry the whole procedure multiple times if necessary.
         """
-        with cache.lock('gandi_set_dns_record'):  # Only do one DNS update at a time
-            for i in range(1, attempts + 1):
-                try:
-                    logger.info('%s (attempt %d out %d)', log_msg, i, attempts)
-                    result = callback()
-                    break
-                except Exception:  # pylint: disable=broad-except
-                    if i == attempts:
-                        raise
-                    time.sleep(retry_delay)
-                    retry_delay *= 2
+        for i in range(1, attempts + 1):
+            try:
+                logger.info('%s (attempt %d out %d)', log_msg, i, attempts)
+                result = callback()
+                break
+            except Exception:  # pylint: disable=broad-except
+                if i == attempts:
+                    raise
+                time.sleep(retry_delay)
+                retry_delay *= 2
         return result

     def add_dns_record(self, record):
  • Run make test.integration_cleanup and verify that the DNS records corresponding to the cancelled CI job are cleaned up without any errors.
  • Verify that the CI checks for this PR pass.

Merge request reports