Skip to content

Next

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
GitLab FOSS
GitLab FOSS
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
    • Cycle Analytics
    • Insights
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
    • Locked Files
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Security & Compliance
    • Security & Compliance
    • Dependency List
  • Packages
    • Packages
    • Container Registry
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • GitLab.org
  • GitLab FOSSGitLab FOSS
  • Issues
  • #24240

You need to sign in or sign up before continuing.
Closed
Open
Opened Nov 07, 2016 by Pablo Carranza [GitLab]@pcarranza-gitlab
  • Report abuse
  • New issue
Report abuse New issue

Add /health endpoint to track application readiness

Description

We've already found the case in production that more than half the fleet looses connection to the database. This has been non-trivial to troubleshoot while the service is degraded. In fact we do have pending issues open in infrastructure to prevent this from happening again, or at least to simplify troubleshooting this form of situations.

Proposal

Add a /health endpoint to the application and monitor the output with prometheus to have a clear immediate view of how the application is performing.

This endpoint should not be affected by the multiprocessing limitations of the prometheus ruby client because it would offer information right here and right now.

In this endpoint we should perform a set of checks to report the status of the application, for example:

  • a DB ping (including latency)
  • a redis ping
  • probe FS access in all the possible shards

The reasoning behind this endpoint is that it would simplify troubleshooting the status of each of the workers removing the need to check logs to understand how each worker is going on.

This would also add the possibility of dynamically take traffic out of a given worker if it is not ready to take load by reporting states like temporarily unavailable (503), but also in the body of the reply we could explain why the service is not available.

I think that this would be extremely easy to implement and will provide a really clean way for the application to report in what state is it, removing the need to reverse engineer the state of it whenever we are experiencing an outage.

Bonus points

With this we could start opening the door for both prometheus monitoring GitLab, and start setting up the environment of a future deployment in a kubernetes cluster with autoscaling.

Links / references

  • Issue about our outage when Azure LBs decided to drop 75% of internal traffic
  • Issue about adding a DB ping in infra
  • Kelsey Hightower - healthz: Stop reverse engineering applications and start monitoring from the inside a MUST WATCH

cc/ @stanhu @DouweM @smcgivern @rspeicher

Related issues

  • Discussion
  • Designs
Assignee
Assign to
9.1
Milestone
9.1
Assign milestone
Time tracking
None
Due date
None
6
Labels
Deliverable Monitor [DEPRECATED] Plan [DEPRECATED] Platform [DEPRECATED] availability docs-missing
Assign labels
  • View project labels
Reference: gitlab-org/gitlab-foss#24240