Improve SEO for GitLab and public repositories
Description
It's important to have our public projects be easy to find in search engines to promote out-selves and GitLab. Currently GitLab offers very poor support for indexing repositories and owner pages, example: searching for "f-droid" on Google returns f-droid.org, GitHub and other pages, but in my search results gitlab.com is not even in top10.
Links / references
https://ma.ttias.be/technical-guide-seo/
1) Force a single domain
This is a somewhat weird instance, where when someone goes to https://gitlab.com we redirect them to https://about.gitlab.com as well as if they try and reach www.gitlab.com they are redirect to the about.gitlab.com page. When doing a Fetch as Google
of the home page, we can see that this is an area of concern since we are redirecting, and using a 302 redirect, which is meant to signify that a page has been moved temporarily. It's not really meant to be used in the way that it is being used:
2) Prefer HTTPs over HTTP
We are forcing HTTPS on the site. No action needs to be taken.
3) Optimize For Speed
I also checked the load speed of pages on https://gitlab.com for issues. I tested the following pages:
- https://gitlab.com/connorshea
- https://gitlab.com/fdroid/fdroidclient
- https://gitlab.com/gitlab-org/gitlab-ce
- https://gitlab.com/gitlab-org/gitlab-ce/issues
It appears that there is some work that can be done to speed up the site response, but across the board, we have some issues with how we deal with images. There could be a good amount of gains made by compressing images.
-
Create issue for GitLab CE to have front-end look into speeding up pages within the GitLab product: https://gitlab.com/gitlab-org/gitlab-ce/issues/25063
4) META tags: title & description
The title and meta description are being pulled from the project name and description. This is how GitHub does it and seem to make the most sense. No action needs to be taken.
5) Semantic page markup
We have multiple H1 tags on some pages. It's not a huge problem and at this point I'd probably not worry about changing things out since there are much bigger issues that we can focus on and get a better return on the time spent.
6) Eliminate duplicate content
We do not have canonical tags for pages, which we should add. This article by Google explains how to implement canonical tags for websites.
-
Create issue for implementing canonical tags for gitlab.com: https://gitlab.com/gitlab-org/gitlab-ce/issues/25065
7) Clean URL structures
Our URL structure is clean and makes sense. It lays out the structure of the website clearly and uses words in the URLs instead of numeric IDs. No action needs to be taken.
8) A responsive layout
We have a responsive layout that looks great on mobile. No action needs to be taken.
9) Robots.txt
Our robots.txt blocks a number of pages, but none that we would want indexed by search engines. There are no pages inadvertently blocked. No action needs to be taken.
10) Sitemap.xml
There is no sitemap for gitlab.com, and with how we do naming of profiles, I don't know exactly how that would work. GitHub also does not have a sitemap.xml file, however sitemaps can be submitted through Google Webmaster Tools. This would be a great thing for us to do in order to make sure that all public repos are being crawled and indexed by Google.
-
Create XML Sitemap/automate creation using Screaming Frog or another tool
11) Google's Webmaster Tools
Our site has been set up on Google Webmaster Tools/Search Console. No action needs to be taken.
12) Twitter Cards
We have the ability for Twitter cards to be set up. For example, on the GitLab CE project, we have a Twitter card:
F-droid is another example:
No action needs to be taken.
13) Facebook's Open Graph
We have open graph tags. Here is an example:
No action needs to be taken.
14) HTML meta tags for indexing
We aren't using any HTML meta tags to guide indexing. We could do this, but I don't think this would add much value. No action needs to be taken for now.
15) Crawl your site for 404's
Because we allow people to create their own content on our site, we can't really control or update these broken links. This isn't a huge factor in Google's algorithm and isn't something we can really control. We also have around 6 million pages indexed in Google, which would be very intensive to go through and look for 404 pages.
16) Crawl your site for anything but HTTP 200's
Once again, we have 6 million pages indexed, and it would be very resource intensive to crawl the entire site for HTTP status codes that aren't 200. This might be something we do at a later time.
17) Geolocation of your IP address
I don't know what we do as far as our hosting setup. We would need to speak with the infrastructure team about this. The guide recommends the following for a situation like ours where we are targeting multiple regions:
If you're targeting multiple regions, consider a couple of extra technical features:
* A CDN like CloudFlare, that helps you host your site across the globe by mirroring it
* Set up your own proxy in multiple languages, and redirect your .NL domains to a proxy in the Netherlands, a .BE domain to your proxy in Belgium, etc.
18) Site maintenance: HTTP 503
After a conversation on Slack, it seems that we do not send HTTP 503 errors on maintenance, but 200 status codes:
-
Create/update issue about sending 503 status codes on maintenance