Document the deprecation of `ApplicationExperiment#publish_to_database` method
Some of this topic is raised in #345932 (closed) as well (because it adds a lot of confusion), but this issue is to document why the publish_to_database
method is deprecated and to raise awareness about some performance concerns with its continued use.
-
On lines 13-14 we constrain the experiment context to a single object, which means that you can't really use the library the way it was designed -- with both a project and a namespace for instance.
-
These lines also force an undocumented order, so if an experiment included both a namespace and project, only one would be logged. Not only does this apply a limit that doesn't exist elsewhere, it's hard to document the order or importance and is somewhat arbitrary in the ordering. Instead, all of the useful objects in the experiment context should be logged.
-
We see frequent use of "root" namespaces, which can be an inefficient query if it's not already being made within that area of code. Instead, we can use the namespace we have a reference to, and the root namespace can be calculated in reporting where that query doesn't impact user performance.
-
Line 14 uses a method in
ExperimentSubject
which is a model and table that's only used here and are both exclusively for generating reporting around the SAAS product. I question our need for having tables and models in the open source project that are only used for tracking experiments on gitlab.com. -
On lines 16-17 we lose clarity on the variant assigned because
ExperimentSubject
uses an enum to indicate experiment inclusion/exclusion -- so instead of understanding what variant is assigned we only log that a variant was assigned, which feels incomplete and hard to document, and also hard to generate reports in cases where there are multiple variants. -
Diving further into
Experiment#add_subject
we see some other database inefficiencies with thefind_or_create_by!(name: name)
call. This triggers a read operation, and if the record doesn't exist it triggers a write operation. A better way to approach this would be to assume that the record for the experiment exists (created in a migration potentially?) or to simply add anexperiment_name
column to theexperiment_subjects
table. Theexperiment
table appears to only includename
after all. -
Additional logic that happens in
Experiment#record_subject_and_variant!
does similar to 6, with anexperiment_subjects.find_or_initialize_by
call. This is also one read and possibly one write operation. Based on thechanged?
check below that, we sometimes don't bother saving the record, so we really only load it to see if we should save it? A better approach here would be to simply have a database constraint and try to write to the database and fail gracefully if it can't create the record. Benchmarking this would be interesting.
In summary, there are at best two database reads every time an experiment is run/published, which is often. These records aren't read to be used, they're only read to determine if we should write them, so at worst we create four database operations when we could simply use a single database operation. This could be done by writing a record to a table and fail if that record violates a constraint -- which is ultimately what gitlab-experiment is trying to do by tracking events using snowplow, which end up as a database row eventually, but not in the main product -- and it happens asynchronously and avoids incurring any of overhead of having models and tables in the main product.
Since we know that the experiments
and experiment_subjects
tables are imported into snowflake and used for reporting -- as are the snowplow events, we know we get the same overall result, but without the performance concerns. We have additional benefits, in that the snowplow events aren't constrained by some of the details outlined above. The snowplow events also properly utilize psuedononymized ids and the experiment_subjects
table does not currently, which has GDPR violation concerns associated with it.
In deprecating and trying to remove the publish_to_database
concept we minimize the experimentation footprint, where possible, as it relates to paying self-managed customers who don't benefit from experimentation. The deprecation of publish_to_database
addresses several of these performance concerns and is intended to help educate about more the performant and secure ways to do reporting on experimentation.
This is also driven by the hope that we might eventually remove the experiments
and experiment_subjects
tables and models, and since we know that they're imported into snowflake, we needn't keep them for historic reasons.