Do not normalize canonical URLs
We had implemented canonical URLs in !211 (merged) a long time ago (see layouts/canonical_urls.html), so that we don't end up with duplicate URLs in Google searches with versioned docs. However, it turns out they don't work on versioned docs.
This has been broken due to the find and replace the post-build script runs. This means that for the versioned docs, all canonical URLs are replaced.
We need to exclude the canonical URL occurrences from sed:
If you are doing a substitution with
sedbut need to exclude a specific line or pattern, that can be accomplished by prefixing an exclusion and using!.
This needs to be configured before the sed replacement, thus we configure it in each -e occurrence.
Related issues
Closes #1568 (closed).
Test sed in an example
Replace any occurrences starting with ="https://docs.gitlab.com/ee/ with ="/15.10/ee/, but exclude those that have rel="canonical or property="og:url in their name.
sed -e '/\(rel="canonical\|property="og:url\)/! s#="https://docs.gitlab.com/ee/#="/15.10/ee/#g' -e '/\(rel="canonical\|property="og:url\)/! s#="https://docs.gitlab.com/omnibus/#="/15.10/omnibus/#g' <<EOF
<link rel="canonical" href="https://docs.gitlab.com/ee/api/index.html" />
<meta property="og:url" content="https://docs.gitlab.com/ee/api/index.html" />
href="https://docs.gitlab.com/ee/ci/yaml/includes.html"
href="https://docs.gitlab.com/omnibus/architecture/"
EOF
You should see the following:
<link rel="canonical" href="https://docs.gitlab.com/ee/api/index.html" />
<meta property="og:url" content="https://docs.gitlab.com/ee/api/index.html" />
href="/15.10/ee/ci/yaml/includes.html"
href="15.10/omnibus/architecture/"
Test in the docs site
-
Check out the branch.
-
Remove
publicand rebuild the site. We need to build the production site to populate the canonical URLs:make clean && NANOC_ENV=production make compile -
Copy
publicto a version:cp -a public 15.10 -
Run the script:
scripts/normalize-links.sh . 15.10 -
Open a few HTML files that contain external links and verify that:
-
The following lines exist:
<link rel="canonical" href="https://docs.gitlab.com/......" /> <meta property="og:url" content="https://docs.gitlab.com/....." /> -
There's no occurrences of
href="https://docs.gitlab.comto the rest of the files.
For example, open:
diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.htmlpublicshould contain links tohttps://docs.gitlab.com/omnibus/, and in15.10those should have been replaced by/15.10/omnibus/. The canonical URLs should be identical to both and not shown in diff:diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.html | grep canonical diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.html | grep og:urlTest another file, it should yield similar results:
diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html | grep canonical diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html | grep og:url -