Skip to content

feat: new DLB health check endpoint

What does this MR do?

This PR adds a new health check endpoint to expose the health of the DB load balancer mechanism and its replicas. The new API endpoint will be located at /debug/health/db and it exposes information in the following JSON format:

{
  "overall_status": "healthy" | "unhealthy",
  "primary": {
    "address": "string",
    "status": "online" | "unknown" | "unreachable",
    "last_pinged_at": [timestamp]
  }
  "replicas": [
    {
      "address": "string",
      "status": "online" | "quarantined" | "unknown" | "unreachable",
      "quarantined_at": [timestamp]
      "last_pinged_at": [timestamp]
    },
    ...
  ]
}

In the background, I have created a new DBStatusChecker which pings the replicas to update the status for the API endpoint. To reduce the load on the database and replicas, DBStatusChecker is also responsible for the regular DB health check (previously checks.DBChecker). All of this is done asynchronously to avoid blocking API calls waiting for the ping to return.

I've added some unit tests:

  • to check the DB health check works as expected
  • to test the HTTP handler (status code and headers)
  • to test the format of the response payload
  • to check the status is returned as expected by the (synchronous) getStatus method
  • to test for race conditions in the concurrency logic

Related to #1602 (closed)

QA steps

Follow the guide here to set up load balancing with GDK.

I used the following registry config:

version: 0.1
storage:
  filesystem:
    rootdirectory: bin/storage
database:
  enabled: true
  host: /home/gdk/gitlab-development-kit/postgresql/
  port: 5432
  user: gdk
  password: gitlab
  dbname: registry
  sslmode: disable
  loadbalancing:
    enabled: true
    hosts:
      - /home/gdk/gitlab-development-kit/postgresql-replica/
redis:
  cache:
    enabled: true
    addr: /home/gdk/gitlab-development-kit/redis/redis.socket
http:
  addr: :5000
  debug:
    addr: :5001
health:
  database:
    enabled: true

Once the registry is running, visit localhost:5001/debug/health/db. It might take a minute for the status of the replicas to initially update.

Author checklist

  • Assign one of conventional-commit prefixes to the MR.
    • fix: Indicates a bug fix, triggers a patch release.
    • feat: Signals the introduction of a new feature, triggers a minor release.
    • perf: Focuses on performance improvements that don't introduce new features or fix bugs, triggers a patch release.
    • docs: Updates or changes to documentation. Does not trigger a release.
    • style: Changes that do not affect the code's functionality. Does not trigger a release.
    • refactor: Modifications to the code that do not fix bugs or add features but improve code structure or readability. Does not trigger a release.
    • test: Changes related to adding or modifying tests. Does not trigger a release.
    • chore: Routine tasks that don't affect the application, such as updating build processes, package manager configs, etc. Does not trigger a release.
    • build: Changes that affect the build system or external dependencies. May trigger a release.
    • ci: Modifications to continuous integration configuration files and scripts. Does not trigger a release.
    • revert: Reverts a previous commit. It could result in a patch, minor, or major release.
  • Feature flags
    • This change does not require a feature flag
    • Added feature flag: ( Add the Feature flag tracking issue link here )
  • Unit-tests
    • Unit-tests are not required
    • I added unit tests
  • Documentation:
  • database changes including schema/background migrations:
    • Change does not introduce database changes
    • MR includes DB chagnes
      • Do not include code that depends on the schema migrations in the same commit. Split the MR into two or more.
      • Do not include code that depends on background migrations in the same release.
      • Manually run up and down migrations in a postgres.ai production database clone and post a screenshot of the result here.
      • If adding new schema migrations make sure the REGISTRY_SELF_MANAGED_RELEASE_VERSION CI variable in migrate.yml is pointing to the latest GitLab self-managed released registry version. Find the correct registry version here. Make sure to select the branch of the latest GitLab release.
      • If adding new queries, extract a query plan from postgres.ai and post the link here. If changing existing queries, also extract a query plan for the current version for comparison.
        • I do not have access to postgres.ai and have made a comment on this MR asking for these to be run on my behalf.
      • If adding new background migration, follow the guide for performance testing new background migrations and add a report/summary to the MR with your analysis.
  • Ensured this change is safe to deploy to individual stages in the same environment (cny -> prod). State-related changes can be troublesome due to having parts of the fleet processing (possibly related) requests in different ways.
  • If the change contains a breaking change, apply the breaking change label.
  • If the change is considered high risk, apply the label high-risk-change
  • Changes cannot be rolled back
    • Change can be safelly rolled back
    • Change can't be safelly rolled back
      • Apply the label cannot-rollback.
      • Add a section to the MR description that includes the following details:
        • The reasoning behind why a release containing the presented MR can not be rolled back (e.g. schema migrations or changes to the FS structure)
        • Detailed steps to revert/disable a feature introduced by the same change where a migration cannot be rolled back. (note: ideally MRs containing schema migrations should not contain feature changes.)
        • Ensure this MR does not add code that depends on these changes that cannot be rolled back.
Documentation/resources

Code review guidelines

Go Style guidelines

Reviewer checklist

  • Ensure the commit and MR tittle are still accurate.
  • If the change contains a breaking change, verify the breaking change label.
  • If the change is considered high risk, verify the label high-risk-change
  • Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).
Edited by SAhmed

Merge request reports

Loading