Improve analytics AI prompt to support full CubeJS schema
What does this MR do and why?
Improves our proof of concept prompt for natural language querying to support (almost) everything in our CubeJS schema (including Sessions and ReturningUsers).
Also adds a couple of extra descriptions to dimensions to help the LLM understand our different URL dimensions (and what a "page" is: Which are the most popular pages?
), as well as understand "agentName" === "browser" (How many unique users does each browser have?
).
Also adds an instruction to never return null
values in the query, as I was consistently getting nulls back in some scenarios, causing the request to CubeJS to fail.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
No visual differences.
How to set up and validate locally
Apply setup steps for backend MR this one depends on:
- Enable AI features locally: https://docs.gitlab.com/ee/development/ai_features/#test-ai-features-locally
- Make sure you've enabled Vertex access and configured it with your GDK. This action does not use Anthropic's API, only Vertex.
- Enable the
:generate_cube_query
flag locally. NEW
Then:
- Follow these instructions to setup Product Analytics in GDK.
- Onboard
Product Analytics
by setting up the feature at Project > Analytics > Analytics dashboards. - Visit the visualization designer at Project > Analytics > Analytics dashboards > Visualization designer.
- Enter prompts in the new UI which target the new parts of our schema
Examples:
Count of sessions grouped weekly
Before:
{
"measures": [
"TrackedEvents.pageViewsCount"
],
"dimensions": [
"TrackedEvents.derivedTstamp"
],
"timeDimensions": [
{
"dimension": "TrackedEvents.derivedTstamp",
"granularity": "week",
"dateRange": "last week"
}
],
"filters": [
{
"member": "TrackedEvents.eventName",
"operator": "equals",
"values": [
"page_view"
]
}
],
"limit": 10000
}
(Model does not know about Sessions, has made a (reasonable!) guess that page views count means sessions).
After:
{
"measures": [
"Sessions.count"
],
"dimensions": [
"Sessions.startAt"
],
"timeDimensions": [
{
"dimension": "Sessions.startAt",
"granularity": "week",
"dateRange": "last week"
}
],
"limit": 10000
}
(Correctly uses Sessions.count now it knows about it)
Which are the most popular pages?
Before:
{
"measures": [
"TrackedEvents.pageViewsCount"
],
"dimensions": [
"TrackedEvents.pageUrlhosts"
],
"filters": [
{
"member": "TrackedEvents.eventName",
"operator": "equals",
"values": [
"page_view"
]
}
],
"limit": 10,
"order": {
"TrackedEvents.pageViewsCount": "desc"
}
}
(Model does not understand our different URL dimensions. It uses pageUrlhosts
to mean "pages", which is incorrect)
After:
{
"measures": [
"TrackedEvents.pageViewsCount"
],
"dimensions": [
"TrackedEvents.pageUrlpath"
],
"filters": [
{
"member": "TrackedEvents.eventName",
"operator": "equals",
"values": [
"page_view"
]
}
],
"limit": 10,
"order": {
"TrackedEvents.pageViewsCount": "desc"
}
}
(Model has been instructed that page path === "a page" and now correctly shows list of top pages)
How many unique users does each browser have?
Before:
{
"measures": [
"TrackedEvents.uniqueUsersCount"
],
"dimensions": [
"TrackedEvents.agentName"
],
"limit": 10000
}
(Unable to reproduce again now I am back adding in examples. But I was seeing the model use agentName, agentVersion, and osName when asking about "browser")
After:
{
"measures": [
"TrackedEvents.uniqueUsersCount"
],
"dimensions": [
"TrackedEvents.agentName"
],
"limit": 10000
}
(Consistently using agentName
to answer to "browser" questions)
How many returning users do we have?
Before:
{
"measures": [
"TrackedEvents.uniqueUsersCount"
],
"limit": 10000,
"order": {}
}
(Model does not know about ReturningUsers schema at all. Makes an (incorrect) guess with uniqueUsersCount)
After:
{
"measures": [
"ReturningUsers.allSessionsCount"
],
"segments": [
"ReturningUsers.returningUsers"
],
"limit": 10000
}
(Model correctly using ReturningUsers.allSessionsCount. Interesting to see here it has applied a segment, which it has not been instructed about or told that exists at all
Example of the null values no longer being returned:
Prompt: show all events per week by name
Before:
(Randomly adding null
to some arrays. Well, not always random - it seemed to always add a null for this particular prompt)
After:
(nulls no longer added)
Related to #441557 (closed)