Geo: Improve Observability Part 1
## Problem to solve Geo performs a complex set of operations to reliably replicate and verify data from the primary site to secondary sites. As customers setup and run Geo, they are finding it challenging to understand what is happening under the hood in Geo. At any given moment in time, Geo might be backfilling data, replicating data, verifying replicated data or trying to recover from replication failures. Customers are finding the following challenging * Is Geo working? * The health indicators for both the site and the data types are green but they don’t see progress. * Geo schedules and paces out work so as to not to overwhelm the Primary site with replication operations. However, customers find it challenging to understanding if the progress is being made by Geo or if it is stuck * When errors occur, it is difficult to decipher what the errors are and what objects are failing to sync. * We’ve made strides to improve this experience with [Link to secondary replication views from component list](https://gitlab.com/gitlab-org/gitlab/-/issues/362306) by linking the top level dashboard to the detailed replication views page.
 * On the replication view, it's difficult see how many errors there are and what exactly the error is. Is the error transient (Geo will retry), or terminal. * What to do about the errors? * Once the failing objects are identified, it's difficult to decipher what error is causing them to fail. * We already know of common error conditions and the resolutions for them but this experience is somewhat disjointed. * Is the error intermittent/transient or persistent? * Will Geo be re-attempt the operation? * How many times has it retried? * When will Geo give up re-trying? In a number of cases, it’s simply the fact that the primary site’s database has accumulated stale entries. I.e. Objects have been removed from storage without the associated record in the database being deleted. Therefore the secondary site will attempt to replicate a non existent object from the primary. This is frequently seen with customers who are beginning their journey with Geo leading to a poor first time experience.
 The site status indicator currently elaborate on the reason a site is 'unhealthy'. Customers also attempt to correlate the site status indicator with the replication status which leads to some misconceptions. I.e. if the site is 'healthy' replication must be working. It is also possible for a site to be in an 'unhealthy' state and for replication to continue make progress. Ultimately we want to empower Sydney to: * Be confident after first setup of a Geo site that everything is working and replication is in progress. * Troubleshoot and resolve replication/verfication issues for themselves. Ultimately, this will lead Sydney to not only have a higher level of confidence in Geo but also how of his own ability to troubleshoot issues. ## Intended users * [Sydney (Systems Administrator)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator) ## Proposal Create an end-to-end connected journey for each Geo replicated data type that encompasses following: * Which objects are being tracked, replicated and verified * Where they are in the replication/validation process (pending, successful, failed, retry, etc) * Where successful - when the last successful operation was completed. Are there periodic re-checks to ensure integrity? If so, when they will be performed. * Where it fails - what is the failure. If the failure is terminal or a re-attempt is scheduled? If a re-attempt is schedule, when will this take place. * If a retry is due, when this will be attempted and what the last failure was on the last attempt. The journey will start with the main Geo-\>sites page and provide an interconnected series of actions that allow Sydney to identify and drill down into any persistent issues, identify the cause and determine next steps. The site status indicator should be revisited to identify exactly what we want to communicate to Sydney with this status and how best to communicate it. If a site is unhealthy, the indicator should show details of the failure assisting Sydney with the troubleshooting. We want to explore if a true Dashboard with the status of all the Geo sites would be useful to Sydney. A single screen (without scrolling) where they can see the health of all the sites at a single glance. This should be considered for a future evolution. ## Documentation We want to document in detail the end-to-end journey and associated Geo pages on this journey from the Geo admin dashboard page through to the replica status view. ## Testing Testing should ensure correct states are reflected in the UI and up to date accurate information is available from the backend. ## What does success look like, and how can we measure that? A reduction in the number of support cases where we see customers with stale entries for objects in their primary site’s database. The customer successfully guided to identifying and resolving these issue themselves. Using data coming back from usage ping, define a metric for long term persistent errors. Our goal will be to see a drop such errors over time across our customer base. ## What is the type of buyer? * Premium * Ultimate ## Links / references A good discussion organised by @zcuddy on this topic can be found here [here](https://gitlab.com/gitlab-org/gitlab/-/issues/363637) ## Iterations - [Enrich existing interface](https://gitlab.com/groups/gitlab-org/-/epics/16553) **Why:** This will provide immediate visibility into errors for the systems admin without needing to resort to the rails console. Since this is anticipated to be a relatively small effort we will tackle this first. - [Enhance replicable details information](https://gitlab.com/gitlab-org/gitlab/-/issues/388169) - https://gitlab.com/groups/gitlab-org/-/epics/16585+ - [Geo: Add sync failure messages to Replication Details view](https://gitlab.com/gitlab-org/gitlab/-/issues/499652) - [Improve primary verification experience](https://gitlab.com/groups/gitlab-org/-/epics/16554) **Why:** This will provide immediate visibly into verification errors on the primary site without needing to resort to the rails console. We've noticed an increasing need for this visibility amongst new Geo customers. This will also help improve the first time experience. - [Geo: Detailed primary verification status view](https://gitlab.com/gitlab-org/gitlab/-/issues/506910) ## Exit Criteria * All issues in iteration 1-2 completed (Enrich Existing Interface, Improve Primary Verification Experience). Remaining Iterations have been moved to this [epic](https://gitlab.com/groups/gitlab-org/-/epics/19856). Revisit the Admin Dashboard is optional depending on UX availability. Adding metrics and other issues are also nice to haves but not mandatory for exit criteria. ### Participants - @zcuddy - @c_fons <!--triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION--> _This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc._ <!--triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION--> <!-- STATUS NOTE START --> ## Status 2026-02-10 The ~backend this week focused on fixing a bug where the checksum button incorrectly showed errors on Geo primaries without secondaries configured - common during early Dedicated migrations. On the ~frontend, we've been improving the UI by hiding checksum controls for non-verifiable models, and adding the ability to force primary checksums directly from the secondary replication view. :clock1: **total hours spent this week by all contributors**: 30 :tada: **achievements**: - https://gitlab.com/gitlab-org/gitlab/-/work_items/577986+ - https://gitlab.com/gitlab-org/gitlab/-/issues/588124+ :issue-blocked: **blockers**: - none :arrow_forward: **next**: - Backport https://gitlab.com/gitlab-org/gitlab/-/issues/588124+ to 18.8 - Continue on the UI improvements with - https://gitlab.com/gitlab-org/gitlab/-/issues/587189 - https://gitlab.com/gitlab-org/gitlab/-/issues/523714 _Copied from https://gitlab.com/groups/gitlab-org/-/epics/8240#note_3070856058_ <!-- STATUS NOTE END -->
epic