Resolve "Implement Postgres full-text search" (!1390) · Merge requests · Baserow / baserow

Peter Evans requested to merge 1706-implement-postgres-full-text-search into develop Apr 25, 2023

What is in this MR

This merge request introduces Postgres full-text search in the backend.

From a user's perspective, they will begin to notice a performance improvement in >100,000 row tables, with dramatic improvements for self-hosters with >250,000 rows.

Old search

In the old search, now called "compat", Baserow would iterate over all fields in a table and perform a `LIKE %term%' against the output of each field type. In smaller tables this was adequate, but as more rows and fields were added, performance would degrade.

This search behaviour can still be accessed by passing a search_mode=compat to the list_database_table_grid_view_rows endpoint.

New search

In the new search, each searchable field has its tsvector column updated at specific times, such as: new/updated/deleted rows, new/updated/deleted fields or related tables changing. When we perform a search, we query each tsvector column.

Compatibility

Great care has gone into preventing as many breaking changes as possible:

Due to technical constraints, full-text will not be able to find a token within a word, only if the token matches the start of a word. For example:
- "will" or "bro" will match "William Brown"
- "liam" or "row" will not match "William Brown"
- Caveat:
  - As we tokenize based on @, :, / and in specific cases - we can search for:
    - "baserow" in "peter@baserow.io"
    - "06" in "15-06-2023" or "15/06/2023"
    - "teams" in "https://baserow.io/teams/sales"
For both search modes, to speed both of them up we changed the id__contains=search_value to be an exact id=int(search_value) search. So before if you searched for 1 you got all of the rows with id's starting with 1, now you just get row 1 in both modes. We think this both makes more sense for users and is quicker performance wise as we get to use the PK columns index.
There was a bug fixed with old and new search modes, the backend to human readable of a date time field in 12 hour mode did not have a space between the minutes and PM/AM (12:00PM), but the frontend did have a space (12:00 PM). We've fixed the backend to match the frontend, this will mean exports now are slightly different, also if someone has converted from a 12PM date time field, to text, does this upgrade, and then converts that field back to date time (with the exact same 12pm setting), this will not auto convert. Given how unlikely to above @nigel_baserow think's this change is worth it to bring the backend and frontend into sync. In the future the text -> date converter should try a ton of different date formats instead of just the one it is configured to use exactly.

Things to be aware of

Introducing a tsvector (and its gin index) per field, which increases the required disk space. In large tables, this can get quite large.
When this MR is deployed, existing tables will gain a tsvector per field (and the column will subsequently be updated) after they visit a view. If the table is large, there will be a window where full-text search won't be usable until the update is complete. If this isn't fine, in a next MR we can add a more advanced fallback mechanism to the old search, however the old search on a big table will take ages to run so we thought it isn't exactly worth it for now.
There are a handful of times where a full table re-index of some tsvector columns is necessary when fields are updated/created etc. We'll need to investigate tweaks in the near future to limit when a full table re-index is necessary as right now we keep it simple and reindex that fields tsv whenever it changes. Whilst the UPDATE tsv_field_XYZ = new values is running full search will still be seeing the old values.
To ensure that the backend doesn't outperform the frontend's search, we've had to 'dumb down' the backend search behaviour, otherwise when a search is performed, or a row change is sent over WebSockets, we would have too many issues highlighting matching rows.
- For example: the current search essentially searches for "contains", so if you search for "Peter Evans" in "Peter Evans is British", it will match. To support this behaviour in full-text, the token operator is <-> (followed by), but Postgres offers lots of others which we can't currently use (e.g. "peter & evans" - find token "peter" and "evans" in the same row).
The only field type that isn't searchable, is the link row field type. We will be making TSV columns for link row fields (which will be empty), so all we need to do to make link row fields searchable is add some SQL to generate TSVs for them (not hard at all).
If you find any field types you would like to have different search behaviour, each field type's search is completely customizable. For example for date fields we could instead fill the TSV's with 20 different types of date format, allowing users to search by multiple different formats and still match the same date. Or we can just apply slightly different regexes to the tsv search data per field type if we wish, its completely customizable.

Tests performed on this MR

Full e2e test suite added for many field types comparing many cell value searches ensuring backend and frontend agree
Tested every template installed on develop can be migrated and searched
Tested all combinations of duplicating/snapshotting a table with many different types of link row fields.
Ensured TSVs columns are never included in normal row select/update/insert statements.
Ensured the backend job that updates row tsv's is indexed properly and runs in 0.005 > seconds for a table with 5 million+ rows when a row is editted.

How to test this MR

Note: the frontend uses an env var to decide what search mode to default to. If you checkout develop, do some testing, and then switch to this branch, you need to restart the frontend nuxt service so this new env var is in the $env property as expected, otherwise the frontend will default to compat until you restart it.

Also Note: searching date time fields without a timezone set still doesn't work as you expect. The backend has no idea what the users timezone is, and so you will be searching the UTC date time in the backend. This is a problem already with compat and is a still a problem for full text search.

Open Baserow.io in production, which still uses the "old" search.
Open this MR and try and replicate the exact table that is in production, ensure that as many field types as possible are used.
Query for the same term in both old/new and compare the results.

Merge Request Checklist

changelog.md has been updated if required.
New/updated Premium/Enterprise features are separated correctly in the premium or enterprise folder
The latest Chrome and Firefox have been used to test any new frontend features
Documentation has been updated
Quality Standards are met
Performance: tables are still fast with 100k+ rows, 100+ field tables
The redoc API pages have been updated for any REST API changes
Our custom API docs are updated for changes to endpoints accessed via api tokens
The UI/UX has been updated following UI Style Guide

Closes #1706 (closed)

Edited Jul 03, 2023 by Nigel Gott

Resolve "Implement Postgres full-text search"