Scrape canonical URLs for link topics

Canonical URLs seem to be desired by most users as they do not contain tracking parameters or other identifiable information in the topic's link. Most content providers have a canonical URL based on my research using DuckDuckGo's news aggregator.

example 1:

original link from DuckDuckGo's news feed includes tracking parameters:

https://www.marketwatch.com/story/tesla-stock-rises-after-car-maker-reports-wider-quarterly-loss-but-beats-on-revenue-2018-08-01?siteid=yhoof2&yptr=yahoo

the page's source contains:

<link href="https://www.marketwatch.com/story/tesla-stock-soars-after-most-valuable-apology-of-all-time-but-profitability-questions-remain-2018-08-02" id="canonical-link" rel="canonical">

The scraper used in #171 (closed) could be expanded to include changing the submitted topic's link to the canonical URL, if a link with rel="canonical" is discovered.

example 2:

A mobile link is posted - Another use case for this is when a user posts the mobile version of page. Most sites currently put the onus on the user to post the correct link, but this is easy automated:

page https://en.m.wikipedia.org/wiki/Jet_engine

contains:

<link rel="canonical" href="https://en.wikipedia.org/wiki/Jet_engine"/>

Example 3:

AMP links, especially ones to google’s CDN cache seem to be universally despised. The AMP standard requires a rel=“canonical” link tag as well. So these would also be converted.

Edited Sep 12, 2018 by Adams T

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information