Spam prevention: Option to prevent URL following for untrusted users (lower rating for search engine crawlers)
Release notes
Problem to solve
Spam is a hard battle to fight against. I had experienced that myself and one of the measures were *captchas, but they were not enough. We migrated a forum software to Discourse some years ago, and learned that their spam prevention system uses Akismet to check potential users (trust level 0).
This system cannot actively prevent spam though. One bot pattern is to post URLs to get a better search engine ranking. You can control this behaviour with rel=nofollow
and rel=follow
as HTML tags.
Discourse uses this method to only allow follow for trusted users.
The idea originates from a HackerNews discussion: https://news.ycombinator.com/item?id=24924626
My "old" thoughts which came too big with a trust level and gamification system are here: #14156 (comment 258252735) cc @heather @JohnathanHunt @sytses
Intended users
User experience goal
Decrease spam bots registering and creating issues, as their content URLs are not followed anymore (and not indexed by crawlers).
Proposal
Add rel=nofollow
to all URLs posted by users who are
- not in the group / organization
- not at least reporter/developer
This makes search engine crawlers to ignore the URL and not index the relation. Bots will learn their in-effectiveness and pick other targets.
This applies to any Markdown content which gets rendered as URL.
- Descriptions in Issues/MRs
- Comments
- Wiki
- Snippets
Further details
This may need changes to our Markdown rendering engine.
A performance decrease is possible, this is effectively to determine for large scale instances.
Therefore I recommend to make this an option on the instance level for administrators, disabled by default.
Permissions and Security
Documentation
Settings and descriptions for security: https://docs.gitlab.com/ee/security/README.html
It may need a note for troubleshooting too.
Availability & Testing
- Performance impact
What does success look like, and how can we measure that?
Less spam reports from public self-hosted instances. Positive impact on GitLab.com when enabled, and bots learning that their URLs are not followed anymore.
What is the type of buyer?
This should be a Core feature available for everyone.
Is this a cross-stage feature?
Manage Access for enabling the setting, and the URL renderer in the backend. It may touch editor (snippets) and wiki too.