Incorrect HEAD request handling

Hi, I maintain repology.org, a service which aggregates information on F/OSS software packages in a lot of repositories. The service includes a link checker, ran on links embedded into packages (such as upstream URLs) to report broken links to package maintainers and upstreams. Notably, it performs HEAD requests and uses repology-linkchecker/1 (+https://repology.org/docs/bots) UA (and in case you're worried, it strictly adheres to 1/3 RPS limit for a host ).

It doesn't work correctly on lib.rs, producing a redirect loop.

I believe the problem is caused by a combination of HEAD request, and the fact that is_bot() is triggered. Under these conditions, /crates/{crate} endpoint responds with 303 instead of expected 200, and that starts a loop. Here's how it can be reproduced:

% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I https://lib.rs/crates/avif-decode
HTTP/2 303 
date: Wed, 28 May 2025 19:59:01 GMT
content-length: 0
server: cloudflare
cf-ray: 947054830c44ca81-HAM
x-powered-by: actix-web lib.rs/2.1.15
location: /keywords/crates
strict-transport-security: max-age=31536000; includeSubDomains
content-security-policy: upgrade-insecure-requests
content-security-policy: default-src 'none';script-src 'self';connect-src 'self';img-src 'self' img.shields.io img.gs;style-src 'self' 'unsafe-inline';form-action 'self';frame-ancestors 'none';font-src 'self';
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
cf-cache-status: DYNAMIC
alt-svc: h3=":443"; ma=86400
server-timing: cfL4;desc="?proto=TCP&rtt=38426&min_rtt=38334&rtt_var=10947&sent=7&recv=8&lost=0&retrans=0&sent_bytes=3389&recv_bytes=867&delivery_rate=104690&cwnd=253&unsent_bytes=0&cid=e07e9a53a24b53de&ts=210&x=0"

the whole redirect chain is like this:

% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/crates/avif-decode' | egrep 'HTTP|location'
HTTP/2 303 
location: /keywords/crates
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/keywords/crates' | egrep 'HTTP|location'
HTTP/2 307 
location: /search?q=keywords&f=1
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/search?q=keywords&f=1' | egrep 'HTTP|location'
HTTP/2 308 
location: /crates/search
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/crates/search' | egrep 'HTTP|location'
HTTP/2 303 
location: /keywords/crates

it looks like actix-web doesn't handle HEAD requests automatically, so instead of getting into handle_crate, we get into default_handler which starts to do weird things.

I guess the correct fix for this would be to add a middleware which forwards HEAD requests to GET and discards the body. Alternative solution would be to just respond with 405 to HEAD requests, but that's far less desirable.

Edited May 29, 2025 by Dmitry Marakasov