Skip to content

Modify UserFinder cache strategy when user has private email

Summary

We made some changes on Shorten cache time for private emails in GitHub... (!70293 - merged) to resolve the issue regarding GitHub Importer - Failed email lookup is cached... (#296706 - closed). Specifically, we reduced the timeout period for caching the information that a user doesn't have a public email set on GitHub to 15 minutes. However, such an approach can cause more requests to GitHub API, causing the rate limit to be reached more often.

Context

When importing resources from GitHub, in most cases, we use API endpoints that return 100 resources at a time. For example, to import pull requests, we use an endpoint that returns 100 pull requests, which similarly happens for issues, comments, etc. So considering that GitHub rate limit is 5000 requests per hour and we can read 100 resources per request, this allows us to read around 5K * 100 = 500K resources per hour.

In addition to requesting the resource information, we must also associate users with the resource, which requires an extra API call for each user to be mapped. Since the user information doesn't change often, we cache the user's data to minimize API calls, so there's no need to request it again.

However, on Shorten cache time for private emails in GitHub... (!70293 - merged), for users that don't have a configured public email, we reduced the cache timeout from 24 hours to 15 minutes which is too short, mainly because when the API rate limit is reached, we enqueue the workers to run after the API rate limit is reset, which is usually 1 hour, so by the time the workers execute the cache for those users no longer exists; consequently, requests for those users are made again.

So, due to the short timeout period, if a user does not have a public email configured when importing a GitHub repository, the number of imported resources per hour significantly decreases. It may drop from approximately 500K to around 5K in the worst-case scenario.

Possible solutions

1. Scope the cache using the Project ID

To address GitHub Importer - Failed email lookup is cached... (#296706 - closed), we could create the user cache per imported project instead of reducing the cache timeout period. Basically, update the cache keys to use the structure:

FROM

ID_FOR_EMAIL_CACHE_KEY = 'github-import/user-finder/id-for-email/%s'

TO

ID_FOR_EMAIL_CACHE_KEY = 'github-import/user-finder/%{project}/id-for-email/%{email}'

This solution is simple, and one project import doesn't interfere with the other.

The downside is that the cache wouldn't be shared among other imports like it currently is. Also, it consumes more Redis memory.

2. Use GitHub conditional requests

Use GitHub conditional requests, which doesn't count for the API rate limit to determine if we should update the cache

This solution is a little more complicated, but we can keep sharing the cache among project imports.

Recommendation

We will combine both solutions described above:

  • Prefering the use of a shared redis cache for speed, but
  • Using an etag'd look-up for each user once per import and writing to the shared cache if the user record has changed

Documentation

Docs updates needed:

We should document that adding the public email will speed up import time and not adding it would slow import down (see also here) /cc @eread for attention.

Edited by Carla Drago