Improve analytics AI prompt to support full CubeJS schema (!144821) · Merge requests · GitLab.org / GitLab

Elwyn Benson requested to merge 441557-improve-prompt-to-support-full-cubejs-schema into master Feb 14, 2024

What does this MR do and why?

Improves our proof of concept prompt for natural language querying to support (almost) everything in our CubeJS schema (including Sessions and ReturningUsers).

Also adds a couple of extra descriptions to dimensions to help the LLM understand our different URL dimensions (and what a "page" is: Which are the most popular pages?), as well as understand "agentName" === "browser" (How many unique users does each browser have?).

Also adds an instruction to never return null values in the query, as I was consistently getting nulls back in some scenarios, causing the request to CubeJS to fail.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

No visual differences.

How to set up and validate locally

Apply setup steps for backend MR this one depends on:

Enable AI features locally: https://docs.gitlab.com/ee/development/ai_features/#test-ai-features-locally
Make sure you've enabled Vertex access and configured it with your GDK. This action does not use Anthropic's API, only Vertex.
Enable the :generate_cube_query flag locally. NEW

Then:

Follow these instructions to setup Product Analytics in GDK.
Onboard Product Analytics by setting up the feature at Project > Analytics > Analytics dashboards.
Visit the visualization designer at Project > Analytics > Analytics dashboards > Visualization designer.
Enter prompts in the new UI which target the new parts of our schema

Examples:

Count of sessions grouped weekly

Before:

{
    "measures": [
        "TrackedEvents.pageViewsCount"
    ],
    "dimensions": [
        "TrackedEvents.derivedTstamp"
    ],
    "timeDimensions": [
        {
            "dimension": "TrackedEvents.derivedTstamp",
            "granularity": "week",
            "dateRange": "last week"
        }
    ],
    "filters": [
        {
            "member": "TrackedEvents.eventName",
            "operator": "equals",
            "values": [
                "page_view"
            ]
        }
    ],
    "limit": 10000
}

(Model does not know about Sessions, has made a (reasonable!) guess that page views count means sessions).

After:

{
    "measures": [
        "Sessions.count"
    ],
    "dimensions": [
        "Sessions.startAt"
    ],
    "timeDimensions": [
        {
            "dimension": "Sessions.startAt",
            "granularity": "week",
            "dateRange": "last week"
        }
    ],
    "limit": 10000
}

(Correctly uses Sessions.count now it knows about it)

Which are the most popular pages?

Before:

{
    "measures": [
        "TrackedEvents.pageViewsCount"
    ],
    "dimensions": [
        "TrackedEvents.pageUrlhosts"
    ],
    "filters": [
        {
            "member": "TrackedEvents.eventName",
            "operator": "equals",
            "values": [
                "page_view"
            ]
        }
    ],
    "limit": 10,
    "order": {
        "TrackedEvents.pageViewsCount": "desc"
    }
}

(Model does not understand our different URL dimensions. It uses pageUrlhosts to mean "pages", which is incorrect)

After:

{
    "measures": [
        "TrackedEvents.pageViewsCount"
    ],
    "dimensions": [
        "TrackedEvents.pageUrlpath"
    ],
    "filters": [
        {
            "member": "TrackedEvents.eventName",
            "operator": "equals",
            "values": [
                "page_view"
            ]
        }
    ],
    "limit": 10,
    "order": {
        "TrackedEvents.pageViewsCount": "desc"
    }
}

(Model has been instructed that page path === "a page" and now correctly shows list of top pages)

How many unique users does each browser have?

Before:

{
    "measures": [
        "TrackedEvents.uniqueUsersCount"
    ],
    "dimensions": [
        "TrackedEvents.agentName"
    ],
    "limit": 10000
}

(Unable to reproduce again now I am back adding in examples. But I was seeing the model use agentName, agentVersion, and osName when asking about "browser")

After:

{
    "measures": [
        "TrackedEvents.uniqueUsersCount"
    ],
    "dimensions": [
        "TrackedEvents.agentName"
    ],
    "limit": 10000
}

(Consistently using agentName to answer to "browser" questions)

How many returning users do we have?

Before:

{
    "measures": [
        "TrackedEvents.uniqueUsersCount"
    ],
    "limit": 10000,
    "order": {}
}

(Model does not know about ReturningUsers schema at all. Makes an (incorrect) guess with uniqueUsersCount)

After:

{
    "measures": [
        "ReturningUsers.allSessionsCount"
    ],
    "segments": [
        "ReturningUsers.returningUsers"
    ],
    "limit": 10000
}

(Model correctly using ReturningUsers.allSessionsCount. Interesting to see here it has applied a segment, which it has not been instructed about or told that exists at all 🤔 The use of segment seems correct here 👻)

Example of the null values no longer being returned:

Prompt: show all events per week by name

Before:

(Randomly adding null to some arrays. Well, not always random - it seemed to always add a null for this particular prompt)

After:

(nulls no longer added)

Related to #441557 (closed)

Edited Feb 15, 2024 by Elwyn Benson

Improve analytics AI prompt to support full CubeJS schema

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

Merge request reports