Research: Should we allow search engines to index older versions of the site?
When using Google as a search backend for docs.gitlab.com, we do not have the ability to provide results for older versions, as we currently do not allow Google to crawl older content.
We block crawlers in a few spots:
- robots.txt: https://gitlab.com/gitlab-org/gitlab-docs/-/blob/main/content/robots.txt.erb#L18
-
noindex
meta tag: https://gitlab.com/gitlab-org/gitlab-docs/-/blob/main/layouts/head.html#L15
Having duplicate content can negatively impact SEO, and we would not want an older version of a page to appear in a regular google.com search above the latest version.
However, we may be able to allow crawling of older versions without causing search problems. Let's research:
- https://developers.google.com/search/blog/2009/12/handling-legitimate-cross-domain
- https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls
Open questions:
- How long would it take for Google to crawl a release? We'd probably have a lag between deploying the release and having it available for search. How would we handle this with version filters?
- How do we prevent crawling in non-GitLab hosted environments (e.g, a self-hosted docs site)
Related: #1132
Edited by Sarah German