A typical MR, such as !162017 (merged), that includes a metric, normally includes that metric with at least two time-frames 7d and 28d.
We also suggest using both time-frames.
At the same time, everything apart from two lines is the same across the definitions of the two metrics for the two time-frames, key_path and time_frame itself.
This represents a significant duplication of code which makes these MRs harder to review and the code harder to maintain. There's even a risk that the two metrics, although they should represent the same count only in different time-frames, start deviating when adjustments are made.
Desired Outcome
Only one metric definition necessary for defining a metric across multiple time_frames
Potential Solution
Create a new RedisHLL internal_events metric definition that would include time_frame attribute set as an array, for example: time_frame: ['7d', '28d']
Modify the RedisHLL instrumentation class [or create a new instrumentation class] so that it can handle time_frame attribute as an array, generating a hash metric with a structure like:
{ weekly: 12, monthly: 134 }
Make it so that the api/v4/usage_data/metric_definitions endpoint returns two metric definitions, generated based on just one metric file: a metric definition with the xx.weekly and a metric definition with the xx.monthly key_path.
Test those changes with @rbacovic to make sure that this does not cause problems further down in the data pipeline. This should be possible to achieve by using a feature flag and/or the staging environment (Radovan will need to be contacted so that we understand how exactly can we test this change).
Modify Metric Dictionary so that it displays the metrics separately:
add a include_path param to the api/v4/usage_data/metric_definitions & make it so that when passed, the metrics returned include their filepaths
modify metric dictionary so that it uses this endpoint instead of traversing Gitlab repo's files
How to Verify
Setting up a new metric with both 7d and 28d creates only a single metric definition.
@michold, @j_lar we seem to be generally in agreement that we should go down this route, however I'm wondering how we should pursue this further? Do we need to create a spike or can we directly break it down and work on it?
@bastirehm I think it may be reasonable to start with a spike. We could start with a spike, if all works well - discuss the proposal with data team, and then implement the changes from the spike.
Overall, the task doesn't sound impossible, but it would make sense to verify it with a prototype before we get data team to commit to the changes.
I really like the idea of letting time_frame be a dimension of the metric definition. If we do that, can think we should allow a metric definition to have no time_frame and let the default value for unique metrics be 7d, 28d and 7d, 28d, all for total count metrics.
@j_lar: I'd be fine with that, with the exception that I'd remove all from total_count metrics.
We know that total counts can be unreliable for event based metrics due to Redis issues. They also pose limitations for future evolution if we always need to keep a count of all events. To me it would seem better to limit those only to use them for migration, and by default also use 7d and 28d. For new metrics a total count can be inferred from adding up all 7d counts.
Thanks for the ping. If we absolutely needed to do this, yes, we could add logic to the data models to handle this pattern, however it would be a pretty heavy lift. Probably a KR-level commitment.
In the current pattern this would just be broken out into two metrics, and our system is designed very well to handle that. So unless the benefit from this change is massive, I would not recommend it. The cost on our side is very high, for ultimately the same result in our reports.
In this case, I'm ultimately a data consumer, but all of our Service Ping reporting and analysis is currently based on the metrics_path/key_path. The Data team would have to find a way to meaningfully flatten the new structure of the data so that we could keep existing reporting. If that cannot be done, we would need to refactor pretty much all of our existing Service Ping-related dashboards, and I'm not sure how we would stitch together the new structure and historical reporting.
Hmm, would it be possible to make this task simpler on your side if we didn't migrate any of the existing metrics? There already are some metrics that support json structures, like for example geo_node_usage and licensee. In this case, what we're thinking about, is creating another style of a json map metric, by including both the weekly and the monthly counts in a single json. Would it be possible to achieve that without having you commit to greater changes?
@mdrussell Hmm, as of now, the only way I could think of for testing it would be to work on this spike issue and create a proof of concept. We could hide it behind a feature flag and then open it for a test cooperated with your team.
yes, that would require changes on our side. Not as substantial as the changes you showed above, but still pretty disruptive.
I'm also wondering what the needed change here would be on your side @mdrussell, key_paths can have an arbitrary number of nesting in form of a dot between folders already, so changing naming for future metrics from 'counts.issues_monthly' to 'counts.issues.monthly' shouldn't make any difference should it?
To also add some more context here: The duplication of all the metadata (group, milestone) etc. across the weekly/monthly versions of a metric creates a significant overhead and was brought up multiple times by devs instrumenting metrics as a big downside.
It also creates a big risk that the two files get out of sync, in terms of the metadata, the biggest one of those happens if an aggregate metric counts different events in it's weekly vs it's monthly version.
Hmm, as of now, the only way I could think of for testing it would be to work on this spike issue and create a proof of concept. We could hide it behind a feature flag and then open it for a test cooperated with your team.
Got it, @rbacovic would be the person to collaborate with here.
To also add some more context here: The duplication of all the metadata (group, milestone) etc. across the weekly/monthly versions of a metric creates a significant overhead and was brought up multiple times by devs instrumenting metrics as a big downside.
Understood, I can see how that would be annoying overhead.
I'm also wondering what the needed change here would be on your side @mdrussell, key_paths can have an arbitrary number of nesting in form of a dot between folders already, so changing naming for future metrics from 'counts.issues_monthly' to 'counts.issues.monthly' shouldn't make any difference should it?
If the change was applied retroactively to current metrics, then that would be possibly very disruptive because the path would effectively change in the example you gave. However, if it is only applied to new metrics, then that is fine. The path for new metrics is arbitrary, but we don't want existing paths to change.
The bigger concern with me is the potential to have multiple metrics in the same yml file, and figuring out how to flatten that data such that the right tags are applied to the right metrics. Currently, that goes against the design. I think it is technically possible but I would like a better way to figure out the difficulty of implementation.
@mdrussell I think we can in theory just keep that logic internal to us. To my knowledge the metric definitions are requested via an API and not directly read from the files, so we can still let the API return two versions of a metric from the same metric file.
The only question is whether this could be complicated to our users, but I think it could be worth a try.
Test those changes with @rbacovic to make sure that this does not cause problems further down in the data pipeline. This should be possible to achieve by using a feature flag and/or the staging environment (Radovan will need to be contacted so that we understand how exactly can we test this change).
Correct, looks like I am the person in charge for this issue on the Data team side :)
Thanks for the ping. If we absolutely needed to do this, yes, we could add logic to the data models to handle this pattern, however it would be a pretty heavy lift. Probably a KR-level commitment.
Yes, agree this is a heavy lift, especially as we combinedRedis and instance_sql_metrics into one json file for the easier processing for AEs (in the PREP and PROD layers).
In the current pattern this would just be broken out into two metrics, and our system is designed very well to handle that. So unless the benefit from this change is massive, I would not recommend it. The cost on our side is very high, for ultimately the same result in our reports.
This is a critical point, as we do the complex logic on the extraction side (combining instance_sql and Redis metrics), and everything else downstream is pretty smooth. Here is a brief overview of the process:
In case you change a structure for the Redis endpoint (and json file), we need to align it with Combine json file activity in the diagram; here is the code we should alter usage_ping and everything else can stay the same. It is an ideal situation, of course - and I suggest, as @mdrussell to have OKR level on our side.
What I see as a potential issue here is to produce the same structure as we have now (to avoid downstream impact).
The bigger concern with me is the potential to have multiple metrics in the same yml file, and figuring out how to flatten that data such that the right tags are applied to the right metrics. Currently, that goes against the design. I think it is technically possible but I would like a better way to figure out the difficulty of implementation.
In other words, if we can implement the solution like this (make a wrapper before Combine json file), we are fine
---title: Desired state - Metrics handling---graph TDsubgraph Data[Data team] subgraph pipeline[Pipeline] get_redis_data[Get Redis data] get_redis_data--Keep the old structure--> RedisWrapper[Wrapper] RedisWrapper--> combine endend
Correct, we should elaborate piece of code structuring the Redis json file usage_ping - if we can fit the new structure into the current one, then we are good.
Testing: if you can provide a test endpoint with a new structure, we can elaborate the Data teamwork and effort.
Sounds good @bastirehm, if the data in Snowflake will still flow in the same way then there would be no issue!
The real victory for us is to fit a new structure into the current one and no change is required downstream.
Hi @rbacovic! I got a working proof of concept for this issue: !167201 (merged). The way it works is that even though we add a single yaml definition file, it returns two separate definitions [monthly & weekly] in the api/v4/usage_data/metric_definitions/ endpoint. The Service Ping itself also gets generated with both of these metrics added.
I wonder how could we test if it gets processed correctly by the data pipeline. Here's some ideas:
Maybe there's some way of testing a local api/v4/usage_data/metric_definitions/ endpoint with the data pipeline?
Maybe it would be possible to do that if we deployed it, eg. to staging?
We could also try to get it all the way to production and hide it behind some feature flag
I got a working proof of concept for this issue: !167201 (merged). The way it works is that even though we add a single yaml definition file, it returns two separate definitions [monthly & weekly] in the api/v4/usage_data/metric_definitions/ endpoint.
The Service Ping itself also gets generated with both of these metrics added.
This is used in one pipeline, and yes, agree that we should test it.
Maybe there's some way of testing a local api/v4/usage_data/metric_definitions/ endpoint with the data pipeline?
This is not a problem on our side as we can quickly create a test environment and just change the API URLs (and perhaps token) and we can test it without any complication. The main point is to:
see how we should combine metrics in the service ping (as the format is slightly different)
what is the downstream impact on pipelines using api/v4/usage_data/metric_definitions/
Maybe it would be possible to do that if we deployed it, eg. to staging?
Yes, any option is ok for us, what we need from you is a bit of context and URLS/Token
We could also try to get it all the way to production and hide it behind some feature flag
In case you can provide an alternative URL or feature flag to have production normally work, also good to me (a bit riskier option, but it is up to you).
On our side, we need to assess the impact on the pipelines and allocate the work under OKR.
On 2024-10-08, we had a team sync (agenda, recording). During the show & tell section, I have shown this issue. We have decided that in the metric dictionary, we would prefer to have the metrics be divided per time_frame. I have updated this issue's potential solution to include that information.
Thanks @michold for the comprehensive walkthrough of the changes.
@mdrussell - what is important for you and the team to check:
In the endpoint: https://gitlab.com/api/v4/usage_data/metric_definitions - metrics with multiple time_frame (ie. [7d, 28d] will be exposed as 2 metrics with the suffix _weekly or _monthly
In the code base https://gitlab.com/gitlab-org/gitlab/-/tree/master/config/metrics it will be represented as one metric with section time_frame= ['7d','28d']
I can't find the usage in our repo for option 2, and option 1 in use in the following pipelines:
@juwong - what do you think, will this change affects our saas_usage_ping pipeline, as the URL https://gitlab.com/api/v4/usage_data/metric_definitions is used in it
@juwong - what do you think, will this change affects our saas_usage_ping pipeline, as the URL https://gitlab.com/api/v4/usage_data/metric_definitions is used in it
I compared the metric_definitions output of the existing vs new endpoint. I'm confused though the existing endpoint seems to already return key_paths that end with the suffix _weekly or _monthly. As such, I'm not seeing what the difference is here?
For example in existing endpoint:
-data_category:optionalkey_path:analytics_unique_visits.analytics_unique_visits_for_any_target_monthlydescription:Unique visitors to any analytics feature by monthproduct_group:optimizevalue_type:numberstatus:activetime_frame:28d
In the new endpoint nothing has changed:
-data_category:optionalkey_path:analytics_unique_visits.analytics_unique_visits_for_any_target_monthlydescription:Unique visitors to any analytics feature by monthproduct_group:optimizevalue_type:numberstatus:activetime_frame:28d
Hey @rbacovic, the dependency on gitlab_data_yaml is immense. This is the source of dim_ping_metric, which is used in basically all of our product usage reporting including xMAU and Health Scoring.
I compared the metric_definitions output of the existing vs new endpoint. I'm confused though the existing endpoint seems to already return key_paths that end with the suffix _weekly or _monthly. As such, I'm not seeing what the difference is here?
Do not be confused (I was in the same situation ), as the only new metrics (still not released) will be with the new structure. Here is the context in the
comment with examples.
Hey @rbacovic, the dependency on gitlab_data_yaml is immense. This is the source of dim_ping_metric, which is used in basically all of our product usage reporting including xMAU and Health Scoring.
Yes @mdrussell, this can make a drastic outcome in the models as many models are joined with dim_ping_metric. In that case, I suggest creating a new OKR to shape this one properly and mitigate the risk.
The support from the Analytics Instrumentalisation team is needed here to fully understand the impact.
@juwong For more context: old metrics will have the same structure as they had before. New metrics are made to mimic old metrics' structure. The difference is that before, a weekly and a monthly metric would've been split to two files inside the codebase the repository (example: 1, 2), but after adding this change, they would be merged into a single file in the repository (example). However, in the https://gitlab.com/api/v4/usage_data/metric_definitions payload, we will still return them as two separate metrics. In other words, this change is made to affect the codebase metric definitions, but not the api/v4/usage_data/metric_definition endpoint.
@rbacovic@mdrussell Considering that the change is invisible (as we're using old metrics' structure) at the level of the https://gitlab.com/api/v4/usage_data/metric_definitions endpoint, can it cause any downstream issues with gitlab_data_yaml?
Considering that the change is invisible (as we're using old metrics' structure) at the level of the https://gitlab.com/api/v4/usage_data/metric_definitions endpoint, can it cause any downstream issues with gitlab_data_yaml?
Thanks @michold - as of now, will see what @mdrussell saying as he is more aware about the usage in downstream models.
for now we have no metrics with the suffix [_monthly, _weekly]
In case the https://gitlab.com/api/v4/usage_data/metric_definitions/ URL contains metrics with suffix it will reflect in dim_ping_metric model and we will not able the join metrics with the suffix
From my point of view, this will produce a bigger impact on the downstream model.
Can we open OKR for this exploration and bring the solution and effort needed to align with the new metrics with the suffix? We need to do a deep dive on the impact as dim_ping_metric is an SSOT for the metrics and any change can harm other models.
Thanks for the information @michold, based on our sync call with @rbacovic, we should be good to proceed! Essentially, this change will reduce the amount of code needed to create 7d and 28d metrics off the same base calculation. However, these metrics will still appear as two different metrics in the Metrics Dictionary, our gitlab_data_yaml pipeline will still work the same way, and the metrics delivered in the payload will still function the same as they would today.
(Pulling this out of the thread below as I don't want to muddle the more important technical conversation)
but after adding this change, they would be merged into a single file in the repository (example). However, in the https://gitlab.com/api/v4/usage_data/metric_definitions payload, we will still return them as two separate metrics
@michold Does the metrics dictionary depend on the repository or https://gitlab.com/api/v4/usage_data/metric_definitions? For context: We have several dashboards that generate a link to the metrics dictionary based on the metric name, so we would have to update those to remove the _weekly or _monthly to ensure that new metrics appear in the search results. Also, more generally, we would need to educate end users on searching for specific metrics (i.e., not including _weekly or _monthly in the name).
Current status: MR with the PoC metrics merged. We should wait until Monday to make sure the metrics made it safely through the automated Service Ping process.
After that, we can request review on the MR officially supporting this new approach.