There are many cases where it would be desirable to add additional properties to an event besides just the label (string), property (string) and value(number). To give a few examples:
AI Gateway is passing more properties as part of code suggestions context - v1.0.0, If we want to migrate it to InternalEvents, we must allow more than 3 additional Properties.
IDE extensions are passing more properties as part of [code suggestions context - v3.0.0]
Existing snowplow events uses extra attribute from standard context to pass an additional information.
New events also sometimes would ideally require more than 3 attributes and limiting them to less requires combining the attribute.
This is good opportunity for Analytics Instrumentation to unify all these use cases into InternalEvents.
Desired Outcome
We should allow additional properties apart from label (string), property (string) and value(number). These can be custom property names or predefined names.
Proposed Solution
Use the extra field in Standard context since it is already parsed out to send the additional fields. (Outcome of #462378 (comment 1912619617))
While we are using extra behind the scenes we'll still need to make sure that devs use the additional power correctly, we should still establish guidelines and validations as pointed out by @syasonik above:
generator should call out usage of label, property, value first and clearly communicate that these should be used before reaching for further, named properties.
Any additional named property needs to be defined in the event definition and have a description
Validations should create a warning if label, property are unused when someone defines a custom property in their event definition.
Example event definition:
---description: A code suggestion was requestedaction: suggestion_requestedidentifiers:- user- namespaceadditional_properties: label: description: id of the suggestion for cross-referencing events property: description: The LLM model being used in the suggestion value: description: The amount of tokens input into the LLM lang: description: The language a suggestion was requested inproduct_section: dev...
I think we had a conversation around more than 3 properties already when we reintroduced label etc for Internal Events. Back then we decided to stay with 3 for now, but it's becoming apparent again that we definitely need more, otherwise we'll never be able to migrate all existing events.
From my point of view the MVC here would be to start using the extra field of the standard context again which at least is already mirrored into gsc_extra in common_mart.mart_behavior_structured_event, so it wouldn't require any additional changes in the warehouse for now, even though this field is not the best queryable. Wdyt?
I'm also wondering a bit how Snowplow deals with additional fields in their own DBT models and warehouses since self-describing-events have long been their recommended approach, and those only have one standard property, so I assume that they already must have some way of making those events efficiently queryable.
From my perspective, I am fine with passing the additional information in the extra field. However, it may be better to just implement new fields in the standard context... even though it wouldn't strictly necessitate more work in the warehouse right now, adding the information to extra would make it a bit harder for analysts to query and parse, so we would end up creating a new field in dbt anyways.
I agree that an ideal solution would be to add any new fields to the standard context if possible. Due to the size of data sets that we are dealing with now and the limitations of Tableau with regards to working with large complex data sets, any additional information added to gsc_extra would need to be modeled in DBT. I think that allowing for additional properties to be passed in gsc_extra would reduce efficiency for all downstream teams - Analytics Engineers, Functional Analysts & stakeholders that need information.
@mdrussell@eneuberger
I agree with @nbelokolodov that adding fields to standard context wouldn't be scalable. We're talking about properties here that are specific to one or at most a few different events, so they're by definition not standard, and there's potentially 100s of them.
@nbelokolodov I think there are essentially 4 possibilities, which I tried to put into ascending order of assumed difficulty to implement:
reuse the extra field
use a single custom_context which allows for additional undefined properties and pull that into a separate column (e.g. additional_properties_context)
use a separate custom_context for every event that needs more properties.
switch to using self_described_events which essentially allows to add a single custom context for events
compare to new properties in extra how does handling new contexts look like? (for example p1 from the description)
@mdrussell & @cbraza have been involved in modeling the code suggestions context (that is a project I was not involved in), so I would let them speak to p1 with more knowledge on what that might look like if we allow all teams to create new contexts.
Handling the Code Suggestions context required many additions to existing models + a new lineage of models just for events with the Code Suggestions context. If we were anticipating heavily-used fields in extra, I would imagine that we would need to do something similar to parse them out and surface them for reporting and analysis.
In terms of the 4 possibilities that Basti surfaced, I would definitely want @mdrussell to weigh in.
@nbelokolodov@bastirehm Is it possible for us to work through a tangible example event or 2 so that we'd know what we we would be working with? Personally I think some of this is hard to visualize in the abstract.
If we were anticipating heavily-used fields in extra...
@cbraza, @eneuberger, @mdrussell: What I think is important to note is that we'd always encourage people to use property, label and value and not create the expectation that we'd parse any information out beyond that.
It's a lot more realistic to tell people that they can't expect a good querying experience for anything that goes beyond these metrics, than telling them they can't send that information at all.
I'll work on providing some examples to all the outlined possibilities.
use a single custom_context which allows for additional undefined properties and pull that into a separate column (e.g. additional_properties_context)
use a separate custom_context for every event that needs more properties.
switch to using self_described_events which essentially allows to add a single custom context for events
IMO these are in order of what would be most easily queryable in the warehouse. The bottom line is that, since engineers are going to be sending different properties for every event type, we're going to have to parse something. Adding more contexts that are either completely custom or can basically hold anything just creates more joins without creating much logical consistency.
If there is a field that becomes common for analysis, we can parse it out of the extra field in dbt and make it easier for querying.
And we'd tell them in the docs that lang is queriable through gsc_extra. While that's slightly more complicated to remember it will allow us to switch to a different solution at a later point more easily.
Sounds good @bastirehm, thanks for those examples. If leveraging the extra field is preferable for all parties, then I think that's what we should move forward with. @cbraza@eneuberger we can always parse out important/recurring values from that field into a new column.
@gitlab-org/analytics-section/analytics-instrumentation/engineers I documented that we're using extra in the description, do you think it would be fine just treating those similar to label, property etc in the methods, e.g. make the following a valid method call:
the same way project/user/namespace are separate from additional properties because they serve a different function and have different recommended use-cases, so do the contents of extra
filtering by extra as a whole should be decently performant, which is also the main criteria for label/property/value (ex - if only one additional field is needed, querying for WHERE extra = { lang: 'Ruby' } seems helpful)
some other thoughts on usage:
the internal events generator could easily integrate extra into the existing additional properties flow without adding new screens if we include it in additional_properties; the internal consistency and simplicity seems desirable
by prompting users to actively call out that they intend to use extra, they're more likely to understand how the data will be interpreted & stored down the line
we could prompt for descriptions from each key in extra from the generator & validate that the event definition must include a corresponding description in dev/test at runtime
we could add simple validations asserting that extra should not be used if none of label/property/value is defined (to protect engineers from misusing the field where one of the other properties would serve better)
if we really felt sassy, we could raise in dev/test when extra is provided with key/value pairs that could have been sent as label/prop/val and those fields are "unused" [ie, lack a description] -- caveat being if the varying types are provided to a key in extra we could raise on false positives
the very simple versions of this validation are probably safer than raising -> schema validation & disabling options from the generator; but better to disallow more things now and loosen up later than to not account for unexpected/undesirable data
[Edit to add]: A la the value_json_schema key, requiring a json-schema definition for each usage of extra does seem like an elegant option for each of validation, documentation, and scalability. If there's a lot of reuse of any specific extra schemas, we'd know the fields were likely candidates to potentially covert to their own columns/tables.
@syasonik Thanks for your input and I can totally see your reasoning, to make the usage of extra more clear to our users.
However, I see the usage of extra just as a temporary measure:
Using label etc. is only relevant for Snowplow. It doesn't matter for Redis(HLL) calculated metrics or Product Analytics.
Even for Snowplow the main reason using extra that it's easiest to use in the Warehouse without additional work from the data team (see #462378 (comment 1914917882)).
The long running goal would be freely named additional_properties without any restrictions to certain names.
The working group already made a decision to validate using solely Product Analytics in the long run and export data from there. Due to Snowplow's license changes it's likely that we're going to migrate off of it within the next years regardless of the outcome of that.
Putting extra that clearly into the API to me feels like redoing a mistake from the past and putting an implementation detail into our API, which could again force us to migrate a lot of API calls, to move properties out of extra, as soon as we find a way to get rid of extra.
I like your idea of validations if someone uses an extra field where they should've used one of the existing keys, but that's something we could also do without putting everything into the extra container, couldn't we? Any additional_property would need to be added to the event definition and we could check when a property is defined and label/property are not used yet.
Couldn't we handle the user communication about the importance of using label etc. that way and through the generator?
Using an extra object inside additional properties is good for readability because it shows the developer's intent to use custom properties. We may call it custom as well.
move properties out of extra, as soon as we find a way to get rid of extra.
Yes, If we find out other ways to pass custom properties, then it should be fine without extra object.
So, I do not have any strong opinion on this. IMO This is about what kind of interface we want for internal events in the GitLab codebase and Cloud connector backends.
@bastirehm I am also in favor of @syasonik's point of view. I feel like treating extra keys as "same class citizens" with label could create confusion. For example, developers could expect them to provide the same database performance as label [+1 to Sarah's validation suggestion]. Similarly, the Product team could be confused when trying to retrieve the data from the warehouse, cause for example, a call like:
As for now, we do not know when we'll be migrating from Snowplow [and whether the new system will provide different handling of parameters], I think it would be safer to keep extra separated.
We understand the added complexity in querying for extra right now.
We assume that using extra is only a temporary solution. extra is a limitation of the current system based on structured Snowplow Events in the data warehouse. We already have a decision to move towards using Product Analytics directly in the long run which does not have this limitation.
Our other endpoint for internal events (Redis) does not have this limitation either.
While it might make querying more understandable, using extra is still leaking implementation details that will initially confuse users of internal events
As a result, we're going to treat the usage of extra as an implementation detail and go with using a structure like the following :
While we are using extra behind the scenes we'll still need to make sure that devs use the additional power correctly, we should still establish guidelines and validations as pointed out by @syasonik above:
generator should call out usage of label, property, value first and clearly communicate that these should be used before reaching for further, named properties.
Any additional named property needs to be defined in the event definition and have a description
Validations should create a warning if label, property are unused when someone defines a custom property in their event definition.
At the same time I added #465764 to evaluate whether we can already right now streamline event sending more which would allow for something like @syasonik suggested in !142839 (comment 1748527407) where we just use label etc as external_key
@ankit.panchal I added this issue as a blocker for the front end one since we're updating all schema here
@bastirehm within this issue could we collaborate with a team who would like to implement these additional properties to make sure everything goes smooth, especially down to DW?
Good point, I'll look for an appropriate event to implement and verify @nbelokolodov. Should we potentially split out work on the Generator since that'll depend on both Frontend and Backend being implemented?