Geo: Improve synchronization status UX

added to epic &6042 (closed)

added Enterprise Edition GitLab Premium UX devopssystems featureenhancement frontend groupgeo sectioncore platform typefeature labels

changed the description

@sunjungp I think this is highly relevant to you.

Thanks, @fzimmer for ping me on the issue!

It's a tricky problem as 99% is technically correct but it sounds like it affects customers in highly regulated areas.

@nhxnguyen I'm just thinking out loud here, but maybe we could give them an option to exclude the new changes in the queue for a certain time (aligning with their RPO/RTO)?

Yes, that may be an idea! We could also think about the following :

We make it possible for a customer to define a RPO target in the UI
We measure the replication lag for data types
We display a new "RPO status" that would be "green" if the replication lag is lower than the target, yellow if it is outside by 10% or something and red if it is falling behind. This may give another indication that things are ok

WDYT?

We make it possible for a customer to define a RPO target in the UI

Yes, I really like the idea of letting the customer define a RPO target. It would be good if we could define a sensible default through testing on our reference architectures. I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.

@fzimmer @nhxnguyen If I understand it correctly if users want to set up an RPO target, it's one value per customer not different values for each secondary site. Then, I'm presuming that we could have a field to add/modify in the Geo settings menu.

I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.

We could put some tips in the UI (hint text) and also in the documentation - the suggested target from our side if that makes sense.

If I understand it correctly if users want to set up an RPO target, it's one value per customer not different values for each secondary site.

@sunjungp I think customers generally have one DR site. However, I suppose you could want to set different values for different sites. It may be best to start with more flexibility and allow someone to set per site at the start. Curious what others think.

I think customers generally have one DR site. However, I suppose you could want to set different values for different sites. It may be best to start with more flexibility and allow someone to set per site at the start. Curious what others think.

The metrics will already be per secondary site, so I think a setting per site makes sense.

I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.

Agreed. Related to this, an RPO target sounds like an "alert threshold", while a "target" might be expected to adjust the behavior of synchronization. The underlying metric would already be available in Prometheus. And alerting can be set up, though I've never done it. But if we do add it to the UI, WDYT about a name like "alert threshold"?

Some more thoughts:

Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.
There may be some difficulties in implementing latency metrics. I have some open questions.
Future thought: Maybe we could add a "fully verified" RPO later

Maybe we should do a spike to add RPO related metric(s)? Perhaps we'll learn things that can beneficially influence the UI/UX or the amount of effort.

@mkozono great write up!

I agree that metrics should be per secondary site.

Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.

I think per data type would be great, starting with git data as a first iteration would make sense to me.

Future thought: Maybe we could add a "fully verified" RPO later

That would be great.

I think of the RPO as "How much data is acceptable to loose?" - this may very well depend on the data type. For example, Pages are very likely less critical then Git data in almost all circumstances.

I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.

We should think carefully about how we phrase this. We can also set a minimum by ourselves - for example 1 minute or 5 minutes after we benchmark and be transparent that given what Geo is, any value below this won't be achievable.

Some suggestions:

RPO target
Replication lag threshold (MAXRL)
Acceptable Replication delay

Thanks, @mkozono & @fzimmer for your comments!

Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.

@mkozono I didn't think about they might need to set the different values per data type. Thanks for explaining it

I think of the RPO as "How much data is acceptable to loose?" - this may very well depend on the data type. For example, Pages are very likely less critical then Git data in almost all circumstances.

This is good to know also. @fzimmer

If I understand it correctly, possible user flows could be:

Sysadmin sets the site-level RPO first in the UI (Edit secondary site)
Then the user will see the predefined latency amounts per data type (in the table form or list view).
They can modify some values if they need specific values in their mind.

Ideas for the terminology

Alert threshold
RPO target
Replication lag threshold (MAXRL)
Acceptable Replication delay
Replication lags (tolerance / threshold)

Thanks everyone for the input. I also like the idea of a long term vision to set this target/threshhold by data type.

Maybe we should do a spike to add RPO related metric(s)? Perhaps we'll learn things that can beneficially influence the UI/UX or the amount of effort.

@mkozono Agreed. My first thought is to try with repos. But from your comment in #197147 (comment 597401617), it seems it would be best to wait until projects/wikis are migrated to SSF before trying this spike. WDYT about starting with something already in SSF (like LFS files) so we aren't blocked on SSF migration?

Ideas for the terminology

What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.

@mkozono Agreed. My first thought is to try with repos. But from your comment in #197147 (comment 597401617), it seems it would be best to wait until projects/wikis are migrated to SSF before trying this spike. WDYT about starting with something already in SSF (like LFS files) so we aren't blocked on SSF migration?

How about snippet repos? Since it's handled by SSF, and mutable data types will be harder, and it will then be solved for project repos.

What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.

Sounds good to me

I also like the idea of a long term vision to set this target/threshhold by data type.

question: Since the first iteration will only be relevant to one data type, should we go straight to one setting per data type?

tangential thought: I'm not sure how per-data-type settings should look. E.g.:

Setting X

Project repos: textbox here

Project wiki repos: textbox here

Snippet repos: textbox here

Design repos: textbox here

Group wiki repos: textbox here

Uploads: textbox here

LFS objects: textbox here

Job artifacts: textbox here

etc etc

Or e.g.:

Project repos

Setting X: textbox here

Setting Y: textbox here

Wiki repos

Setting X: textbox here

Setting Y: textbox here

etc etc

I guess I prefer the latter.

What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.

Maybe we can group by data type in settings?

I like the idea of a spike using snippet repositories as well @mkozono

This is a great idea @mkozono. Let me think about how I could visualize those in our UI.

mentioned in issue gitlab-org/geo-team/discussions#5019 (closed)

thank you for creating this issue @nhxnguyen ! will add this to our overall tracking list of issues related to the customer!

Hi Mike @mkozono .. (@sranasinghe is out of the office until the end of the month, so I took the liberty... )

We're working with this customer (internal links: ticket, SFDC, confidental ping-back below) on a DR test, and in discussion about incomplete synchronization, they asked whether it's possible to see what elements weren't fully synchronized after DR has been invoked.

Would the Geo status still show in the admin area once a secondary is promoted to primary?

I understand that once it's a primary, Rails stops using the tracking database. Is this a key part of how Rails tracks synchronization? (The process of promoting a site also involves removing the tracking database.)

If there isn't this capability at present, then I take it that a new feature request needed.

cc: @manuel.kraft @tmarsh1

@bprescott_ The promoted site will no longer show the tracking DB information (yes, that's where the sync/verification info is) in the status. It should be possible to implement some way to look at it (as long as you haven't deleted it yet), but it seems unusual enough that I would suggest some kind of Troubleshooting section with a script rather than a feature, to reduce the weight enough to be prioritized.

In this case we might also want an issue to e.g. "Make it ok to keep the old tracking DB around for a while". Maybe that just means writing strong caveats in that step to "wait until you've fully confirmed the promoted site is good and you don't need the old sync/verification tracking data anymore"?

@mkozono thank you.

Thinking it through, for this data to be useful, I assume you'd need data from both the Rails database and the tracking database - project name and namespace/path for example. I take it that's not duplicated in the tracking database.

But more significantly, once promoted, the site's tracking database would only be accessible via SQL.

It should be possible to implement some way to look at it (as long as you haven't deleted it yet), but it seems unusual enough that I would suggest some kind of Troubleshooting section with a script

So .. how about a feature request for ..

Rails console code as an initial iteration that should be run on the secondary before promoting it.
Output that details the sync state in the secondary for reference later.

@bprescott_ This might be a lot of data, and could take a very long time during the downtime window.

Do we want all data types?
What is enough identifying information for each data type? We'll need to specify exactly the fields we want to output.

This is kind of a request for a partial backup of the DBs in log form.

The general recommendation should be to perform a full backup prior to a major operation like a failover. At a minimum, it's a good idea to backup the DB and the Geo Tracking DB. It'd be safest to take the backup during the downtime, but it would still be much better than nothing to take DB backups just prior to downtime, to allow for forensics later.

It looks like the failover documentation does not specifically suggest to perform a backup.

WDYT about mentioning this backup stuff in the failover docs, instead of logging the data?

The general recommendation should be to perform a full backup prior to a major operation like a failover.

This would be for a real disaster situation.

In situations where the failover is planned, the primary should be acquiesced and/or in maintenance mode .. and the sync got to 100% before failing over.

I'll consult the customer on what would be needed.

Ah ok, right.

In omnibus-gitlab#8650, I suggest that we remove the step to delete the tracking DB data during promotion (convert secondary to be a primary). And replace it with a step during demotion (convert a primary to be a secondary) to "clear the tracking DB if it has stale data". This would preserve the tracking DB data until the last possible moment. The only thing remaining for this thread is to back up or log related data from the main DB.

Let's continue this discussion in that issue.

added UX debt label

mentioned in issue gitlab-com/Product#3362 (closed)

added direction label

changed the description

assigned to @sunjungp

added 1 design

started a discussion on Proposed_345544_replication-lag.png

Geo: Improve synchronization status UX

Summary

Proposals

Designs

Child items ...

Activity

Ideas for the terminology

Setting X

Project repos

Wiki repos

Geo: Improve synchronization status UX

Summary

Proposals

Relates to

Activity

Ideas for the terminology

Setting X

Project repos

Wiki repos