Skip to content

Speed up Advanced global search regex for file path segments

What does this MR do?

From gitlab-com/gl-infra/production#2318 (comment 367588644) we can see that the old regex pattern used is capable of catastrophic backtracking which we believe may be the cause of gitlab-com/gl-infra/production#2318 (closed) .

This MR will will replace the concept of trying to find sections of paths with a simpler idea of just matching common characters used in file paths and hopefully should be sufficient for most cases. It's not as sophisticated as the approach which finds the sections between slashes ending in a word boundary but it didn't seem easy to refactor that logic without the backtracking.

We are also adding the remove_duplicates filter since this new pattern will have lots of overlap with other patterns and as such we need to remove the duplicates otherwise we'll have many wasteful tokens in the index. There doesn't seem to be any real cost to adding this filter since it only removes duplicate tokens at the identical position so it seems to be sensible in all cases.

We also remove some out of date docs for edgeNgram which was removed in !32771 (merged)

Since this new regex is more widely picking up words it actually fixes by accident all cases in #223044 (closed), #223045 (closed) and &3621 (comment 364151246). Tests have been added to prove this fixes all those cases. All of those tests failed before the regex pattern change.

The analysis below shows the new index actually takes up less space than before. This storage saving is not a free lunch, however, as some file paths will be missed that contain other special characters. If there are very common ones we can add them but file paths can contain basically anything so trying to always detect what a file path looks like randomly in code is a pretty perilous task anyway. This also seems like a pretty low risk change and we can always update the pattern again if we get feedback with examples we've missed but given this passes all tests including new ones it seems like a good idea to go with this.

Sadly I wasn't able to reproduce any high CPU on the cluster during any of this testing so we'll just have to hope this resolves the production incident. Worse case scenario we make some file path searches a little worse but still fix a bunch of other issues and we'll at least learn from the cases we missed.

Analysis

Reindexing

Original settings: 622s (`2020-06-25 03:06:18 UTC - 2020-06-25 03:16:40 UTC`)
{
  "completed": true,
  "task": {
    "node": "7-B_MooyR3-ypG0MA-Shuw",
    "id": 32570335,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 140218,
      "updated": 0,
      "created": 140218,
      "deleted": 0,
      "batches": 144,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "slices": [
        {
          "slice_id": 0,
          "total": 23214,
          "updated": 0,
          "created": 23214,
          "deleted": 0,
          "batches": 24,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 1,
          "total": 62073,
          "updated": 0,
          "created": 62073,
          "deleted": 0,
          "batches": 63,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 2,
          "total": 12177,
          "updated": 0,
          "created": 12177,
          "deleted": 0,
          "batches": 13,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 3,
          "total": 27713,
          "updated": 0,
          "created": 27713,
          "deleted": 0,
          "batches": 28,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 4,
          "total": 15041,
          "updated": 0,
          "created": 15041,
          "deleted": 0,
          "batches": 16,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        }
      ]
    },
    "description": "reindex from [gitlab-production-202006040000] to [dgriffith-test-index][_doc]",
    "start_time_in_millis": 1593054378520,
    "running_time_in_nanos": 622074076434,
    "cancellable": true,
    "headers": {}
  },
  "response": {
    "took": 622070,
    "timed_out": false,
    "total": 140218,
    "updated": 0,
    "created": 140218,
    "deleted": 0,
    "batches": 144,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "slices": [
      {
        "slice_id": 0,
        "total": 23214,
        "updated": 0,
        "created": 23214,
        "deleted": 0,
        "batches": 24,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 1,
        "total": 62073,
        "updated": 0,
        "created": 62073,
        "deleted": 0,
        "batches": 63,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 2,
        "total": 12177,
        "updated": 0,
        "created": 12177,
        "deleted": 0,
        "batches": 13,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 3,
        "total": 27713,
        "updated": 0,
        "created": 27713,
        "deleted": 0,
        "batches": 28,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 4,
        "total": 15041,
        "updated": 0,
        "created": 15041,
        "deleted": 0,
        "batches": 16,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      }
    ],
    "failures": []
  }
}
`remove_duplicates` only: 682s (`2020-06-25 03:20:40 UTC - 2020-06-25 03:32:02 UTC`)
{
  "completed": true,
  "task": {
    "node": "v04C3DTvRe6Tu4FTI2PZsQ",
    "id": 42764888,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 140218,
      "updated": 0,
      "created": 140218,
      "deleted": 0,
      "batches": 144,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "slices": [
        {
          "slice_id": 0,
          "total": 23214,
          "updated": 0,
          "created": 23214,
          "deleted": 0,
          "batches": 24,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 1,
          "total": 62073,
          "updated": 0,
          "created": 62073,
          "deleted": 0,
          "batches": 63,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 2,
          "total": 12177,
          "updated": 0,
          "created": 12177,
          "deleted": 0,
          "batches": 13,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 3,
          "total": 27713,
          "updated": 0,
          "created": 27713,
          "deleted": 0,
          "batches": 28,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 4,
          "total": 15041,
          "updated": 0,
          "created": 15041,
          "deleted": 0,
          "batches": 16,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        }
      ]
    },
    "description": "reindex from [dgriffith-test-index] to [dgriffith-test-index-remove-duplicates][_doc]",
    "start_time_in_millis": 1593055240665,
    "running_time_in_nanos": 682500328758,
    "cancellable": true,
    "headers": {}
  },
  "response": {
    "took": 682498,
    "timed_out": false,
    "total": 140218,
    "updated": 0,
    "created": 140218,
    "deleted": 0,
    "batches": 144,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "slices": [
      {
        "slice_id": 0,
        "total": 23214,
        "updated": 0,
        "created": 23214,
        "deleted": 0,
        "batches": 24,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 1,
        "total": 62073,
        "updated": 0,
        "created": 62073,
        "deleted": 0,
        "batches": 63,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 2,
        "total": 12177,
        "updated": 0,
        "created": 12177,
        "deleted": 0,
        "batches": 13,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 3,
        "total": 27713,
        "updated": 0,
        "created": 27713,
        "deleted": 0,
        "batches": 28,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 4,
        "total": 15041,
        "updated": 0,
        "created": 15041,
        "deleted": 0,
        "batches": 16,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      }
    ],
    "failures": []
  }
}
`remove_duplicates` + new regex: 652s (`2020-06-25 03:35:09 UTC - 2020-06-25 03:46:01 UTC`)
{
  "completed": true,
  "task": {
    "node": "7-B_MooyR3-ypG0MA-Shuw",
    "id": 32585973,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 140218,
      "updated": 0,
      "created": 140218,
      "deleted": 0,
      "batches": 144,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 0,
      "requests_per_second": -1,
      "throttled_until_millis": 0,
      "slices": [
        {
          "slice_id": 0,
          "total": 23214,
          "updated": 0,
          "created": 23214,
          "deleted": 0,
          "batches": 24,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 1,
          "total": 62073,
          "updated": 0,
          "created": 62073,
          "deleted": 0,
          "batches": 63,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 2,
          "total": 12177,
          "updated": 0,
          "created": 12177,
          "deleted": 0,
          "batches": 13,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 3,
          "total": 27713,
          "updated": 0,
          "created": 27713,
          "deleted": 0,
          "batches": 28,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        },
        {
          "slice_id": 4,
          "total": 15041,
          "updated": 0,
          "created": 15041,
          "deleted": 0,
          "batches": 16,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
            "bulk": 0,
            "search": 0
          },
          "throttled_millis": 0,
          "requests_per_second": -1,
          "throttled_until_millis": 0
        }
      ]
    },
    "description": "reindex from [dgriffith-test-index] to [dgriffith-test-reindex][_doc]",
    "start_time_in_millis": 1593056109303,
    "running_time_in_nanos": 652640615919,
    "cancellable": true,
    "headers": {}
  },
  "response": {
    "took": 652637,
    "timed_out": false,
    "total": 140218,
    "updated": 0,
    "created": 140218,
    "deleted": 0,
    "batches": 144,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled": "0s",
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until": "0s",
    "throttled_until_millis": 0,
    "slices": [
      {
        "slice_id": 0,
        "total": 23214,
        "updated": 0,
        "created": 23214,
        "deleted": 0,
        "batches": 24,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 1,
        "total": 62073,
        "updated": 0,
        "created": 62073,
        "deleted": 0,
        "batches": 63,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 2,
        "total": 12177,
        "updated": 0,
        "created": 12177,
        "deleted": 0,
        "batches": 13,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 3,
        "total": 27713,
        "updated": 0,
        "created": 27713,
        "deleted": 0,
        "batches": 28,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      },
      {
        "slice_id": 4,
        "total": 15041,
        "updated": 0,
        "created": 15041,
        "deleted": 0,
        "batches": 16,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1,
        "throttled_until": "0s",
        "throttled_until_millis": 0
      }
    ],
    "failures": []
  }
}

Stats

Original Index remove_duplicates remove_duplicates + new regex
Time taken to reindex 622 s 682 s 652 s
Size 2296 MB 2208 MB 2143 MB
indexing.index_time_in_millis 1158320 1270673 1209598
segments.memory_in_bytes 467473 463986 442289
segments.terms_memory_in_bytes 348402 345976 326358

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

#224459 (closed)

Edited by 🤖 GitLab Bot 🤖

Merge request reports