Investigate 500 errors on member-heavy pages for self-managed instances

Summary

A self-managed customer is experiencing 500 errors when attempting to access pages that contain large numbers of members (3000+). The errors occur after a timeout period and appear to affect both project member lists and group member lists. Additionally, a GraphQL API query for a specific project is consistently failing, impacting permission auditing workflows.

Similar to GitLab.com issue #459041 (closed), but affecting self-managed instances.

Example areas:

  • The project members page (/project_members)
  • The group members page (/groups/{group-path}/members)

Current Behavior

  • 500 error after a waiting period when visiting from a browser
  • Logs show a rack-timeout error (from puma.sterr):

source=rack-timeout timeout=60000ms service=60001ms state=timed_out at=error

  • REST API requests work but are extremely slow

  • API performance comparison:

    • REST API takes 12-16 hours (sometimes 24+) to complete permission auditing
    • GraphQL API completes in ~2.5 hours for most projects
    • One specific project consistently fails GraphQL queries after several seconds

Expected Behavior

  • Pages should load successfully with proper pagination
  • Response time should be within acceptable limits
  • No timeout errors should occur
  • GraphQL API queries should complete successfully for all projects

Workarounds

  • REST API access works but is extremely slow
  • GraphQL API provides significantly better performance for most projects
  • No known frontend workarounds at this time
  • No current workaround for the failing project-specific GraphQL query

Impact

  • Affects administrative functionality
  • Prevents users from viewing member lists in large projects/groups
  • Blocks effective user management for organizations with large teams
  • Disrupts permission auditing workflows
  • One project completely inaccessible via GraphQL API, blocking automated auditing
Edited by 🤖 GitLab Bot 🤖