In the Geo Sites Admin UI, we display the sync status for each data component. For large, active GitLab instances, it's possible that, even without any sync failures, the sync status progress bar regularly stays at 99% and never reaches 100% because there are always new changes in the replication queue.
Why is this a problem?
Admins want to know at a glance that replication is healthy. Showing 99% when things are working as intended is confusing.
Additionally, customers in highly regulated industries who undergo audits of their disaster recovery processes need to prove to auditors that data replication meets their target RPO/RTO. Without a clear understanding of how Geo's async replication works, an auditor may see a report of 99% and assume that replication is not working as intended.
Proposals
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
It's a tricky problem as 99% is technically correct but it sounds like it affects customers in highly regulated areas.
@nhxnguyen I'm just thinking out loud here, but maybe we could give them an option to exclude the new changes in the queue for a certain time (aligning with their RPO/RTO)?
Yes, that may be an idea! We could also think about the following :
We make it possible for a customer to define a RPO target in the UI
We measure the replication lag for data types
We display a new "RPO status" that would be "green" if the replication lag is lower than the target, yellow if it is outside by 10% or something and red if it is falling behind. This may give another indication that things are ok
We make it possible for a customer to define a RPO target in the UI
Yes, I really like the idea of letting the customer define a RPO target. It would be good if we could define a sensible default through testing on our reference architectures. I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.
@fzimmer@nhxnguyen If I understand it correctly if users want to set up an RPO target, it's one value per customer not different values for each secondary site. Then, I'm presuming that we could have a field to add/modify in the Geo settings menu.
I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.
We could put some tips in the UI (hint text) and also in the documentation - the suggested target from our side if that makes sense.
If I understand it correctly if users want to set up an RPO target, it's one value per customer not different values for each secondary site.
@sunjungp I think customers generally have one DR site. However, I suppose you could want to set different values for different sites. It may be best to start with more flexibility and allow someone to set per site at the start. Curious what others think.
I think customers generally have one DR site. However, I suppose you could want to set different values for different sites. It may be best to start with more flexibility and allow someone to set per site at the start. Curious what others think.
The metrics will already be per secondary site, so I think a setting per site makes sense.
I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.
Agreed. Related to this, an RPO target sounds like an "alert threshold", while a "target" might be expected to adjust the behavior of synchronization. The underlying metric would already be available in Prometheus. And alerting can be set up, though I've never done it. But if we do add it to the UI, WDYT about a name like "alert threshold"?
Some more thoughts:
Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.
There may be some difficulties in implementing latency metrics. I have some open questions.
Future thought: Maybe we could add a "fully verified" RPO later
Maybe we should do a spike to add RPO related metric(s)? Perhaps we'll learn things that can beneficially influence the UI/UX or the amount of effort.
I agree that metrics should be per secondary site.
Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.
I think per data type would be great, starting with git data as a first iteration would make sense to me.
Future thought: Maybe we could add a "fully verified" RPO later
That would be great.
I think of the RPO as "How much data is acceptable to loose?" - this may very well depend on the data type. For example, Pages are very likely less critical then Git data in almost all circumstances.
I see a potential downside in that customers may set targets that Geo inherently has a hard time achieving.
We should think carefully about how we phrase this. We can also set a minimum by ourselves - for example 1 minute or 5 minutes after we benchmark and be transparent that given what Geo is, any value below this won't be achievable.
Each data type (and event type) can have different amounts of latency. E.g. If many 4GB package files are added at once, it could take a long time for a secondary to sync it all even though everything else is 100%. An admin might only care about RPO for Git repos and LFS objects. Depending on how we implement the metric(s), it might be doable to define an RPO per data type or event type, which could be valuable.
@mkozono I didn't think about they might need to set the different values per data type. Thanks for explaining it
I think of the RPO as "How much data is acceptable to loose?" - this may very well depend on the data type. For example, Pages are very likely less critical then Git data in almost all circumstances.
Thanks everyone for the input. I also like the idea of a long term vision to set this target/threshhold by data type.
Maybe we should do a spike to add RPO related metric(s)? Perhaps we'll learn things that can beneficially influence the UI/UX or the amount of effort.
@mkozono Agreed. My first thought is to try with repos. But from your comment in #197147 (comment 597401617), it seems it would be best to wait until projects/wikis are migrated to SSF before trying this spike. WDYT about starting with something already in SSF (like LFS files) so we aren't blocked on SSF migration?
Ideas for the terminology
What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.
@mkozono Agreed. My first thought is to try with repos. But from your comment in #197147 (comment 597401617), it seems it would be best to wait until projects/wikis are migrated to SSF before trying this spike. WDYT about starting with something already in SSF (like LFS files) so we aren't blocked on SSF migration?
How about snippet repos? Since it's handled by SSF, and mutable data types will be harder, and it will then be solved for project repos.
What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.
Sounds good to me
I also like the idea of a long term vision to set this target/threshhold by data type.
question: Since the first iteration will only be relevant to one data type, should we go straight to one setting per data type?
tangential thought: I'm not sure how per-data-type settings should look. E.g.:
What if we just called the setting Recovery Point Objective/RPO? "Objective" already denotes a desired/targeted recovery point and this feels like the most MECEFU. In the setting description, we could clarify that it is an "alert threshhold" and does not affect any behavior or guarantee you'll meet your RPO.
Maybe we can group by data type in settings?
I like the idea of a spike using snippet repositories as well @mkozono
Hi Mike @mkozono .. (@sranasinghe is out of the office until the end of the month, so I took the liberty... )
We're working with this customer (internal links: ticket, SFDC, confidental ping-back below) on a DR test, and in discussion about incomplete synchronization, they asked whether it's possible to see what elements weren't fully synchronized after DR has been invoked.
Would the Geo status still show in the admin area once a secondary is promoted to primary?
I understand that once it's a primary, Rails stops using the tracking database. Is this a key part of how Rails tracks synchronization? (The process of promoting a site also involves removing the tracking database.)
If there isn't this capability at present, then I take it that a new feature request needed.
@bprescott_ The promoted site will no longer show the tracking DB information (yes, that's where the sync/verification info is) in the status. It should be possible to implement some way to look at it (as long as you haven't deleted it yet), but it seems unusual enough that I would suggest some kind of Troubleshooting section with a script rather than a feature, to reduce the weight enough to be prioritized.
In this case we might also want an issue to e.g. "Make it ok to keep the old tracking DB around for a while". Maybe that just means writing strong caveats in that step to "wait until you've fully confirmed the promoted site is good and you don't need the old sync/verification tracking data anymore"?
Thinking it through, for this data to be useful, I assume you'd need data from both the Rails database and the tracking database - project name and namespace/path for example. I take it that's not duplicated in the tracking database.
But more significantly, once promoted, the site's tracking database would only be accessible via SQL.
It should be possible to implement some way to look at it (as long as you haven't deleted it yet), but it seems unusual enough that I would suggest some kind of Troubleshooting section with a script
So .. how about a feature request for ..
Rails console code as an initial iteration that should be run on the secondary before promoting it.
Output that details the sync state in the secondary for reference later.
@bprescott_ This might be a lot of data, and could take a very long time during the downtime window.
Do we want all data types?
What is enough identifying information for each data type? We'll need to specify exactly the fields we want to output.
This is kind of a request for a partial backup of the DBs in log form.
The general recommendation should be to perform a full backup prior to a major operation like a failover. At a minimum, it's a good idea to backup the DB and the Geo Tracking DB. It'd be safest to take the backup during the downtime, but it would still be much better than nothing to take DB backups just prior to downtime, to allow for forensics later.
It looks like the failover documentation does not specifically suggest to perform a backup.
WDYT about mentioning this backup stuff in the failover docs, instead of logging the data?
The general recommendation should be to perform a full backup prior to a major operation like a failover.
This would be for a real disaster situation.
In situations where the failover is planned, the primary should be acquiesced and/or in maintenance mode .. and the sync got to 100% before failing over.
I'll consult the customer on what would be needed.
In omnibus-gitlab#8650, I suggest that we remove the step to delete the tracking DB data during promotion (convert secondary to be a primary). And replace it with a step during demotion (convert a primary to be a secondary) to "clear the tracking DB if it has stale data". This would preserve the tracking DB data until the last possible moment. The only thing remaining for this thread is to back up or log related data from the main DB.