Skip to content

Fixing scraping github markdown attachments

What does this MR do and why?

As of May 9. 2023 Github changed how images are stored. https://github.blog/changelog/2023-05-09-more-secure-private-attachments/.

This change causes our scraper to break as it looks for https://user-images.githubusercontent.com/xx.jpeg in the issue markdown. But new image uploads after May 9 will instead be https://github.com/MaxPIsa/testrepo/assets/142635249/625a76d0xxx

Therefore, our scraper needs to support both types of links so we can download the attachments and upload them on our own gitlab servers.

In addition, this MR supports private repositories that use the newer image markdown (github.com/xxx/assets/xxx). As copying markdown images from private repositories was never supported.

Notes:

The new github structure is a bit tricky, as the url is simply a redirect. For example

https://github.com/MaxPIsa/publicImagesRepo/assets/142635249/5ff826ef-1ddd-4c43-a3e2-94414b42fc00 -> will redirect to github-production-user-asset-6210df.s3.amazonaws.com/xxx

This is fine for public repositories, but for private ones. There's an extra layer of authentication to process the /assets endpoint. And then the redirection will include an Amazon Auth header itself, before returning the resource.

Markdown Images for private repositories with attachments before May 9 is not supported yet. See follow-up issue for details.

Screenshots or screen recordings

Screen Recording 2023-09-01 at 2.39.37 PM.mov

How to set up and validate locally

  1. Clone the branch
  2. On Github create an issue/MR and drag & drop some images
    1. This will automatically use the new format
    2. To test out the old format you can open the image you attached and directly copy its url into a manual markdown
  3. Do the same thing on a private repo
  4. Trigger the import from github feature

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #422604 (closed)

Edited by Max Fan

Merge request reports