Clean up HTML-encoded entities in UserDetail fields before validate_sanitizable_user_details feature flag rollout

Problem Statement / Context

As part of !208943 (Refactor UserDetail to use the Sanitizable concern), we are introducing stricter validation for user profile fields. See discussion thread for context.

The legacy sanitize_attrs method used Sanitize.clean() which encoded ampersands (&) to HTML entities (&). Once the validate_sanitizable_user_details feature flag is enabled, these rows will fail validation because the Sanitizable concern rejects pre-escaped HTML entities.

Proposal

Create a batched background migration (BBM) to clean up HTML-encoded entities in UserDetail fields before the feature flag rollout.

🛠️ with ❤️ at Siemens

Affected Data

Category Affected Rows Fields
Common HTML entities (&, <, >, ", ') ~1,250 linkedin (734), twitter (225), website_url (286), github (5)
Numeric HTML entities (&, &, etc.) 1 website_url
Path traversal patterns (../) 12 Various
Double-encoded URLs (%25) 15 Various

Proposed Solution

The BBM should process: linkedin, twitter, website_url, github

For HTML entity cleanup (safe to fix):

  • &&, "", ''
  • &, && (numeric entities)

For potentially malicious content (strip entirely):

  • <, > and their numeric variants → remove

Not addressed (let validation reject):

  • Path traversal patterns (../) - 12 users
  • Double-encoded URLs (%25) - 15 users

Implementation Plan

  1. Merge !208943 (feature flag disabled by default)
  2. Create and merge this BBM
  3. Wait for BBM to complete on production
  4. Roll out feature flag incrementally

Checklist

  • Create batched background migration to unescape HTML entities
  • Add migration specs
  • Add post-deployment migration to enqueue the BBM
  • Coordinate with !208943 for feature flag rollout timing
Edited by 🤖 GitLab Bot 🤖