fix(search): limit search page size by 4 (!903) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

Shinya Maeda requested to merge add-search-page-limit-parm into main Jun 03, 2024

What does this merge request do and why?

This MR sets the page size of search results to return 4 top-k results. Currently, this API returns 50 results, which is default page size. https://gitlab.slack.com/archives/C06D5C70MD2/p1717389900122399?thread_ts=1717140113.672329&cid=C06D5C70MD2

Background: We previously (when we were using PgVector) returned 4 top-k results.

How to set up and validate locally

Visit OpenAPI playground http://localhost:5052/docs and try out search API:

Request example

curl -X 'POST' \
  'http://localhost:5052/v1/search/gitlab-docs' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "type": "string",
  "metadata": {
    "source": "string",
    "version": "string"
  },
  "payload": {
    "query": "string"
  }
}'

Response example

{
  "response": {
    "results": [
      {
        "id": "9c290d11eb9ca409ea08666e21599872",
        "content": "# Strings and the Text data type\n\nWhen adding new columns to store strings or other textual information:\n\n1. We always use the `text` data type instead of the `string` data type.\n1. `text` columns should always have a limit set, either by using the `create_table` with\n   the `#text ... limit: 100` helper (see below) when creating a table, or by using the `add_text_limit`\n   when altering an existing table.\n\nThe standard Rails `text` column type cannot be defined with a limit, but we extend `create_table` to\nadd a `limit: 255` option. Outside of `create_table`, `add_text_limit` can be used to add a [check constraint](https://www.postgresql.org/docs/11/ddl-constraints.html)\nto an already existing column.\n\n## Background information\n\nThe reason we always want to use `text` instead of `string` is that `string` columns have the\ndisadvantage that if you want to update their limit, you have to run an `ALTER TABLE ...` command.\n\nWhile a limit is added, the `ALTER TABLE ...` command requires an `EXCLUSIVE LOCK` on the table, which\nis held throughout the process of updating the column and while validating all existing records, a\nprocess that can take a while for large tables.\n\nOn the other hand, texts are [more or less equivalent to strings](https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/) in PostgreSQL,\nwhile having the additional advantage that adding a limit on an existing column or updating their",
        "metadata": {
          "title": "Strings and the Text data type",
          "source_type": "doc",
          "md5sum": "14084c5231fd0bee2eef9dfcff6e0e7303ffddeaf9397ab9f3da3424997fb19f",
          "source_url": "https://gitlab.com/help/development/database/strings_and_the_text_data_type",
          "source": "development/database/strings_and_the_text_data_type.md"
        }
      },
      {
        "id": "450e041ffee9ddafdd69e7861a2c688a",
        "content": "# Storing SHA1 Hashes As Binary\n\nStoring SHA1 hashes as strings is not very space efficient. A SHA1 as a string\nrequires at least 40 bytes, an additional byte to store the encoding, and\nperhaps more space depending on the internals of PostgreSQL.\n\nOn the other hand, if one were to store a SHA1 as binary one would only need 20\nbytes for the actual SHA1, and 1 or 4 bytes of additional space (again depending\non database internals). This means that in the best case scenario we can reduce\nthe space usage by 50%.\n\nTo make this easier to work with you can include the concern `ShaAttribute` into\na model and define a SHA attribute using the `sha_attribute` class method. For\nexample:\n\n```ruby\nclass Commit < ActiveRecord::Base\n  include ShaAttribute\n\n  sha_attribute :sha\nend\n```\n\nThis allows you to use the value of the `sha` attribute as if it were a string,\nwhile storing it as binary. This means that you can do something like this,\nwithout having to worry about converting data to the right binary format:\n\n```ruby\ncommit = Commit.find_by(sha: '88c60307bd1f215095834f09a1a5cb18701ac8ad')\ncommit.sha = '971604de4cfa324d91c41650fabc129420c8d1cc'\ncommit.save\n```\n\nThere is however one requirement: the column used to store the SHA has _must_ be\na binary type. For Rails this means you need to use the `:binary` type instead\nof `:text` or `:string`.",
        "metadata": {
          "title": "Storing SHA1 Hashes As Binary",
          "source_type": "doc",
          "md5sum": "a6b178e6af5264cba0d45793d6a56213c427abafc93edfccfe3797c3c51d4bc8",
          "source": "development/database/sha1_as_binary.md",
          "source_url": "https://gitlab.com/help/development/database/sha1_as_binary"
        }
      },
      {
        "id": "526c037f52a219cd1ba8f8d92004fc36",
        "content": "| `environment_tiers` | array of strings | no       | The [tiers of the environments](../../ci/environments/index.md#deployment-tier-of-environments). Default is `production`. |\n| `interval`          | string           | no       | The bucketing interval. One of `all`, `monthly` or `daily`. Default is `daily`. |\n| `start_date`        | string           | no       | Date range to start from. ISO 8601 Date format, for example `2021-03-01`. Default is 3 months ago. |\n\nExample request:\n\n```shell\ncurl --header \"PRIVATE-TOKEN: <your_access_token>\" \"https://gitlab.example.com/api/v4/groups/1/dora/metrics?metric=deployment_frequency\"\n```\n\nExample response:\n\n```json\n[\n  { \"date\": \"2021-03-01\", \"value\": 3 },\n  { \"date\": \"2021-03-02\", \"value\": 6 },\n  { \"date\": \"2021-03-03\", \"value\": 0 },\n  { \"date\": \"2021-03-04\", \"value\": 0 },\n  { \"date\": \"2021-03-05\", \"value\": 0 },\n  { \"date\": \"2021-03-06\", \"value\": 0 },\n  { \"date\": \"2021-03-07\", \"value\": 0 },\n  { \"date\": \"2021-03-08\", \"value\": 4 }\n]\n```\n\n## The `value` field\n\nFor both the project and group-level endpoints above, the `value` field in the\nAPI response has a different meaning depending on the provided `metric` query\nparameter:\n\n| `metric` query parameter   | Description of `value` in response |\n|:---------------------------|:-----------------------------------|",
        "metadata": {
          "title": "DevOps Research and Assessment (DORA) key metrics API",
          "source_type": "doc",
          "md5sum": "fda1b564e54e874ce1688196fed43f45e0f315fde241e2e913c96e0ae27c950b",
          "source_url": "https://gitlab.com/help/api/dora/metrics",
          "source": "api/dora/metrics.md"
        }
      },
      {
        "id": "befdbee311020d12a274f5ebb6497475",
        "content": "`before: String`, `after: String`, `first: Int`, and `last: Int`.\n\n###### Arguments\n\n| Name | Type | Description |\n| ---- | ---- | ----------- |\n| <a id=\"mergerequestpipelinesref\"></a>`ref` | [`String`](#string) | Filter pipelines by the ref they are run for. |\n| <a id=\"mergerequestpipelinesscope\"></a>`scope` | [`PipelineScopeEnum`](#pipelinescopeenum) | Filter pipelines by scope. |\n| <a id=\"mergerequestpipelinessha\"></a>`sha` | [`String`](#string) | Filter pipelines by the sha of the commit they are run for. |\n| <a id=\"mergerequestpipelinessource\"></a>`source` | [`String`](#string) | Filter pipelines by their source. |\n| <a id=\"mergerequestpipelinesstatus\"></a>`status` | [`PipelineStatusEnum`](#pipelinestatusenum) | Filter pipelines by their status. |\n| <a id=\"mergerequestpipelinesupdatedafter\"></a>`updatedAfter` | [`Time`](#time) | Pipelines updated after this date. |\n| <a id=\"mergerequestpipelinesupdatedbefore\"></a>`updatedBefore` | [`Time`](#time) | Pipelines updated before this date. |\n| <a id=\"mergerequestpipelinesusername\"></a>`username` | [`String`](#string) | Filter pipelines by the user that triggered the pipeline. |\n\n##### `MergeRequest.reference`\n\nInternal reference of the merge request. Returned in shortened format by default.\n\nReturns [`String!`](#string).\n\n###### Arguments\n\n| Name | Type | Description |\n| ---- | ---- | ----------- |",
        "metadata": {
          "source_type": "doc",
          "title": "GraphQL API resources",
          "md5sum": "f5592ccdaa60f1ff8f43a7a4927c8c210ae87b0cc7c98dbbc7429a0014f92d64",
          "source": "api/graphql/reference/index.md",
          "source_url": "https://gitlab.com/help/api/graphql/reference/index"
        }
      }
    ]
  },
  "metadata": {
    "provider": "vertex-ai",
    "timestamp": 1717476383
  }
}

Request example

curl -X 'POST' \
  'http://localhost:5052/v1/search/gitlab-docs' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "type": "string",
  "metadata": {
    "source": "string",
    "version": "string"
  },
  "payload": {
    "query": "string",
    "page_size": 1
  }
}'

Response example

{
  "response": {
    "results": [
      {
        "id": "9c290d11eb9ca409ea08666e21599872",
        "content": "# Strings and the Text data type\n\nWhen adding new columns to store strings or other textual information:\n\n1. We always use the `text` data type instead of the `string` data type.\n1. `text` columns should always have a limit set, either by using the `create_table` with\n   the `#text ... limit: 100` helper (see below) when creating a table, or by using the `add_text_limit`\n   when altering an existing table.\n\nThe standard Rails `text` column type cannot be defined with a limit, but we extend `create_table` to\nadd a `limit: 255` option. Outside of `create_table`, `add_text_limit` can be used to add a [check constraint](https://www.postgresql.org/docs/11/ddl-constraints.html)\nto an already existing column.\n\n## Background information\n\nThe reason we always want to use `text` instead of `string` is that `string` columns have the\ndisadvantage that if you want to update their limit, you have to run an `ALTER TABLE ...` command.\n\nWhile a limit is added, the `ALTER TABLE ...` command requires an `EXCLUSIVE LOCK` on the table, which\nis held throughout the process of updating the column and while validating all existing records, a\nprocess that can take a while for large tables.\n\nOn the other hand, texts are [more or less equivalent to strings](https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/) in PostgreSQL,\nwhile having the additional advantage that adding a limit on an existing column or updating their",
        "metadata": {
          "title": "Strings and the Text data type",
          "source_type": "doc",
          "md5sum": "14084c5231fd0bee2eef9dfcff6e0e7303ffddeaf9397ab9f3da3424997fb19f",
          "source_url": "https://gitlab.com/help/development/database/strings_and_the_text_data_type",
          "source": "development/database/strings_and_the_text_data_type.md"
        }
      }
    ]
  },
  "metadata": {
    "provider": "vertex-ai",
    "timestamp": 1717476668
  }
}

Request example

curl -X 'POST' \
  'http://localhost:5052/v1/search/gitlab-docs' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "type": "string",
  "metadata": {
    "source": "string",
    "version": "string"
  },
  "payload": {
    "query": "string",
    "page_size": 21
  }
}'

Response example

{
  "detail": [
    {
      "type": "less_than_equal",
      "loc": [
        "body",
        "payload",
        "page_size"
      ],
      "msg": "Input should be less than or equal to 20",
      "input": 21,
      "ctx": {
        "le": 20
      },
      "url": "https://errors.pydantic.dev/2.7/v/less_than_equal"
    }
  ]
}

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Jun 04, 2024 by Shinya Maeda

fix(search): limit search page size by 4

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports