Skip to content

Do not normalize canonical URLs

Achilleas Pipinellis requested to merge axil-fix-canonical-urls-sed into main

We had implemented canonical URLs in !211 (merged) a long time ago (see layouts/canonical_urls.html), so that we don't end up with duplicate URLs in Google searches with versioned docs. However, it turns out they don't work on versioned docs.

This has been broken due to the find and replace the post-build script runs. This means that for the versioned docs, all canonical URLs are replaced.

We need to exclude the canonical URL occurrences from sed:

If you are doing a substitution with sed but need to exclude a specific line or pattern, that can be accomplished by prefixing an exclusion and using !.

This needs to be configured before the sed replacement, thus we configure it in each -e occurrence.

Related issues

#1584 (closed)

Closes #1568 (closed).

Test sed in an example

Replace any occurrences starting with ="https://docs.gitlab.com/ee/ with ="/15.10/ee/, but exclude those that have rel="canonical or property="og:url in their name.

sed -e '/\(rel="canonical\|property="og:url\)/! s#="https://docs.gitlab.com/ee/#="/15.10/ee/#g' -e '/\(rel="canonical\|property="og:url\)/! s#="https://docs.gitlab.com/omnibus/#="/15.10/omnibus/#g' <<EOF
<link rel="canonical" href="https://docs.gitlab.com/ee/api/index.html" />
<meta property="og:url" content="https://docs.gitlab.com/ee/api/index.html" />
href="https://docs.gitlab.com/ee/ci/yaml/includes.html"
href="https://docs.gitlab.com/omnibus/architecture/"
EOF

You should see the following:

<link rel="canonical" href="https://docs.gitlab.com/ee/api/index.html" />
<meta property="og:url" content="https://docs.gitlab.com/ee/api/index.html" />
href="/15.10/ee/ci/yaml/includes.html"
href="15.10/omnibus/architecture/"

Test in the docs site

  1. Check out the branch.

  2. Remove public and rebuild the site. We need to build the production site to populate the canonical URLs:

    make clean && NANOC_ENV=production make compile
  3. Copy public to a version:

    cp -a public 15.10
  4. Run the script:

    scripts/normalize-links.sh . 15.10
  5. Open a few HTML files that contain external links and verify that:

    • The following lines exist:

      <link rel="canonical" href="https://docs.gitlab.com/......" />
      <meta property="og:url" content="https://docs.gitlab.com/....." />
    • There's no occurrences of href="https://docs.gitlab.com to the rest of the files.

    For example, open:

    diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.html

    public should contain links to https://docs.gitlab.com/omnibus/, and in 15.10 those should have been replaced by /15.10/omnibus/. The canonical URLs should be identical to both and not shown in diff:

    diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.html | grep canonical
    diff 15.10/ee/raketasks/backup_gitlab.html public/ee/raketasks/backup_gitlab.html | grep og:url

    Test another file, it should yield similar results:

    diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html
    diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html | grep canonical
    diff 15.10/omnibus/settings/backups.html public/omnibus/settings/backups.html | grep og:url
Edited by Achilleas Pipinellis

Merge request reports