fix(search): limit search page size by 4
What does this merge request do and why?
This MR sets the page size of search results to return 4 top-k results. Currently, this API returns 50 results, which is default page size. https://gitlab.slack.com/archives/C06D5C70MD2/p1717389900122399?thread_ts=1717140113.672329&cid=C06D5C70MD2
Background: We previously (when we were using PgVector) returned 4 top-k results.
How to set up and validate locally
Visit OpenAPI playground http://localhost:5052/docs and try out search API:
Request example
curl -X 'POST' \
'http://localhost:5052/v1/search/gitlab-docs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"type": "string",
"metadata": {
"source": "string",
"version": "string"
},
"payload": {
"query": "string"
}
}'
Response example
{
"response": {
"results": [
{
"id": "9c290d11eb9ca409ea08666e21599872",
"content": "# Strings and the Text data type\n\nWhen adding new columns to store strings or other textual information:\n\n1. We always use the `text` data type instead of the `string` data type.\n1. `text` columns should always have a limit set, either by using the `create_table` with\n the `#text ... limit: 100` helper (see below) when creating a table, or by using the `add_text_limit`\n when altering an existing table.\n\nThe standard Rails `text` column type cannot be defined with a limit, but we extend `create_table` to\nadd a `limit: 255` option. Outside of `create_table`, `add_text_limit` can be used to add a [check constraint](https://www.postgresql.org/docs/11/ddl-constraints.html)\nto an already existing column.\n\n## Background information\n\nThe reason we always want to use `text` instead of `string` is that `string` columns have the\ndisadvantage that if you want to update their limit, you have to run an `ALTER TABLE ...` command.\n\nWhile a limit is added, the `ALTER TABLE ...` command requires an `EXCLUSIVE LOCK` on the table, which\nis held throughout the process of updating the column and while validating all existing records, a\nprocess that can take a while for large tables.\n\nOn the other hand, texts are [more or less equivalent to strings](https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/) in PostgreSQL,\nwhile having the additional advantage that adding a limit on an existing column or updating their",
"metadata": {
"title": "Strings and the Text data type",
"source_type": "doc",
"md5sum": "14084c5231fd0bee2eef9dfcff6e0e7303ffddeaf9397ab9f3da3424997fb19f",
"source_url": "https://gitlab.com/help/development/database/strings_and_the_text_data_type",
"source": "development/database/strings_and_the_text_data_type.md"
}
},
{
"id": "450e041ffee9ddafdd69e7861a2c688a",
"content": "# Storing SHA1 Hashes As Binary\n\nStoring SHA1 hashes as strings is not very space efficient. A SHA1 as a string\nrequires at least 40 bytes, an additional byte to store the encoding, and\nperhaps more space depending on the internals of PostgreSQL.\n\nOn the other hand, if one were to store a SHA1 as binary one would only need 20\nbytes for the actual SHA1, and 1 or 4 bytes of additional space (again depending\non database internals). This means that in the best case scenario we can reduce\nthe space usage by 50%.\n\nTo make this easier to work with you can include the concern `ShaAttribute` into\na model and define a SHA attribute using the `sha_attribute` class method. For\nexample:\n\n```ruby\nclass Commit < ActiveRecord::Base\n include ShaAttribute\n\n sha_attribute :sha\nend\n```\n\nThis allows you to use the value of the `sha` attribute as if it were a string,\nwhile storing it as binary. This means that you can do something like this,\nwithout having to worry about converting data to the right binary format:\n\n```ruby\ncommit = Commit.find_by(sha: '88c60307bd1f215095834f09a1a5cb18701ac8ad')\ncommit.sha = '971604de4cfa324d91c41650fabc129420c8d1cc'\ncommit.save\n```\n\nThere is however one requirement: the column used to store the SHA has _must_ be\na binary type. For Rails this means you need to use the `:binary` type instead\nof `:text` or `:string`.",
"metadata": {
"title": "Storing SHA1 Hashes As Binary",
"source_type": "doc",
"md5sum": "a6b178e6af5264cba0d45793d6a56213c427abafc93edfccfe3797c3c51d4bc8",
"source": "development/database/sha1_as_binary.md",
"source_url": "https://gitlab.com/help/development/database/sha1_as_binary"
}
},
{
"id": "526c037f52a219cd1ba8f8d92004fc36",
"content": "| `environment_tiers` | array of strings | no | The [tiers of the environments](../../ci/environments/index.md#deployment-tier-of-environments). Default is `production`. |\n| `interval` | string | no | The bucketing interval. One of `all`, `monthly` or `daily`. Default is `daily`. |\n| `start_date` | string | no | Date range to start from. ISO 8601 Date format, for example `2021-03-01`. Default is 3 months ago. |\n\nExample request:\n\n```shell\ncurl --header \"PRIVATE-TOKEN: <your_access_token>\" \"https://gitlab.example.com/api/v4/groups/1/dora/metrics?metric=deployment_frequency\"\n```\n\nExample response:\n\n```json\n[\n { \"date\": \"2021-03-01\", \"value\": 3 },\n { \"date\": \"2021-03-02\", \"value\": 6 },\n { \"date\": \"2021-03-03\", \"value\": 0 },\n { \"date\": \"2021-03-04\", \"value\": 0 },\n { \"date\": \"2021-03-05\", \"value\": 0 },\n { \"date\": \"2021-03-06\", \"value\": 0 },\n { \"date\": \"2021-03-07\", \"value\": 0 },\n { \"date\": \"2021-03-08\", \"value\": 4 }\n]\n```\n\n## The `value` field\n\nFor both the project and group-level endpoints above, the `value` field in the\nAPI response has a different meaning depending on the provided `metric` query\nparameter:\n\n| `metric` query parameter | Description of `value` in response |\n|:---------------------------|:-----------------------------------|",
"metadata": {
"title": "DevOps Research and Assessment (DORA) key metrics API",
"source_type": "doc",
"md5sum": "fda1b564e54e874ce1688196fed43f45e0f315fde241e2e913c96e0ae27c950b",
"source_url": "https://gitlab.com/help/api/dora/metrics",
"source": "api/dora/metrics.md"
}
},
{
"id": "befdbee311020d12a274f5ebb6497475",
"content": "`before: String`, `after: String`, `first: Int`, and `last: Int`.\n\n###### Arguments\n\n| Name | Type | Description |\n| ---- | ---- | ----------- |\n| <a id=\"mergerequestpipelinesref\"></a>`ref` | [`String`](#string) | Filter pipelines by the ref they are run for. |\n| <a id=\"mergerequestpipelinesscope\"></a>`scope` | [`PipelineScopeEnum`](#pipelinescopeenum) | Filter pipelines by scope. |\n| <a id=\"mergerequestpipelinessha\"></a>`sha` | [`String`](#string) | Filter pipelines by the sha of the commit they are run for. |\n| <a id=\"mergerequestpipelinessource\"></a>`source` | [`String`](#string) | Filter pipelines by their source. |\n| <a id=\"mergerequestpipelinesstatus\"></a>`status` | [`PipelineStatusEnum`](#pipelinestatusenum) | Filter pipelines by their status. |\n| <a id=\"mergerequestpipelinesupdatedafter\"></a>`updatedAfter` | [`Time`](#time) | Pipelines updated after this date. |\n| <a id=\"mergerequestpipelinesupdatedbefore\"></a>`updatedBefore` | [`Time`](#time) | Pipelines updated before this date. |\n| <a id=\"mergerequestpipelinesusername\"></a>`username` | [`String`](#string) | Filter pipelines by the user that triggered the pipeline. |\n\n##### `MergeRequest.reference`\n\nInternal reference of the merge request. Returned in shortened format by default.\n\nReturns [`String!`](#string).\n\n###### Arguments\n\n| Name | Type | Description |\n| ---- | ---- | ----------- |",
"metadata": {
"source_type": "doc",
"title": "GraphQL API resources",
"md5sum": "f5592ccdaa60f1ff8f43a7a4927c8c210ae87b0cc7c98dbbc7429a0014f92d64",
"source": "api/graphql/reference/index.md",
"source_url": "https://gitlab.com/help/api/graphql/reference/index"
}
}
]
},
"metadata": {
"provider": "vertex-ai",
"timestamp": 1717476383
}
}
Request example
curl -X 'POST' \
'http://localhost:5052/v1/search/gitlab-docs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"type": "string",
"metadata": {
"source": "string",
"version": "string"
},
"payload": {
"query": "string",
"page_size": 1
}
}'
Response example
{
"response": {
"results": [
{
"id": "9c290d11eb9ca409ea08666e21599872",
"content": "# Strings and the Text data type\n\nWhen adding new columns to store strings or other textual information:\n\n1. We always use the `text` data type instead of the `string` data type.\n1. `text` columns should always have a limit set, either by using the `create_table` with\n the `#text ... limit: 100` helper (see below) when creating a table, or by using the `add_text_limit`\n when altering an existing table.\n\nThe standard Rails `text` column type cannot be defined with a limit, but we extend `create_table` to\nadd a `limit: 255` option. Outside of `create_table`, `add_text_limit` can be used to add a [check constraint](https://www.postgresql.org/docs/11/ddl-constraints.html)\nto an already existing column.\n\n## Background information\n\nThe reason we always want to use `text` instead of `string` is that `string` columns have the\ndisadvantage that if you want to update their limit, you have to run an `ALTER TABLE ...` command.\n\nWhile a limit is added, the `ALTER TABLE ...` command requires an `EXCLUSIVE LOCK` on the table, which\nis held throughout the process of updating the column and while validating all existing records, a\nprocess that can take a while for large tables.\n\nOn the other hand, texts are [more or less equivalent to strings](https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/) in PostgreSQL,\nwhile having the additional advantage that adding a limit on an existing column or updating their",
"metadata": {
"title": "Strings and the Text data type",
"source_type": "doc",
"md5sum": "14084c5231fd0bee2eef9dfcff6e0e7303ffddeaf9397ab9f3da3424997fb19f",
"source_url": "https://gitlab.com/help/development/database/strings_and_the_text_data_type",
"source": "development/database/strings_and_the_text_data_type.md"
}
}
]
},
"metadata": {
"provider": "vertex-ai",
"timestamp": 1717476668
}
}
Request example
curl -X 'POST' \
'http://localhost:5052/v1/search/gitlab-docs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"type": "string",
"metadata": {
"source": "string",
"version": "string"
},
"payload": {
"query": "string",
"page_size": 21
}
}'
Response example
{
"detail": [
{
"type": "less_than_equal",
"loc": [
"body",
"payload",
"page_size"
],
"msg": "Input should be less than or equal to 20",
"input": 21,
"ctx": {
"le": 20
},
"url": "https://errors.pydantic.dev/2.7/v/less_than_equal"
}
]
}
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Shinya Maeda