Incorrect HEAD request handling
Hi, I maintain repology.org, a service which aggregates information on F/OSS software packages in a lot of repositories. The service includes a link checker, ran on links embedded into packages (such as upstream URLs) to report broken links to package maintainers and upstreams. Notably, it performs HEAD requests and uses repology-linkchecker/1 (+https://repology.org/docs/bots) UA (and in case you're worried, it strictly adheres to 1/3 RPS limit for a host ).
It doesn't work correctly on lib.rs, producing a redirect loop.
I believe the problem is caused by a combination of HEAD request, and the fact that is_bot() is triggered. Under these conditions, /crates/{crate} endpoint responds with 303 instead of expected 200, and that starts a loop. Here's how it can be reproduced:
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I https://lib.rs/crates/avif-decode
HTTP/2 303
date: Wed, 28 May 2025 19:59:01 GMT
content-length: 0
server: cloudflare
cf-ray: 947054830c44ca81-HAM
x-powered-by: actix-web lib.rs/2.1.15
location: /keywords/crates
strict-transport-security: max-age=31536000; includeSubDomains
content-security-policy: upgrade-insecure-requests
content-security-policy: default-src 'none';script-src 'self';connect-src 'self';img-src 'self' img.shields.io img.gs;style-src 'self' 'unsafe-inline';form-action 'self';frame-ancestors 'none';font-src 'self';
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
cf-cache-status: DYNAMIC
alt-svc: h3=":443"; ma=86400
server-timing: cfL4;desc="?proto=TCP&rtt=38426&min_rtt=38334&rtt_var=10947&sent=7&recv=8&lost=0&retrans=0&sent_bytes=3389&recv_bytes=867&delivery_rate=104690&cwnd=253&unsent_bytes=0&cid=e07e9a53a24b53de&ts=210&x=0"
the whole redirect chain is like this:
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/crates/avif-decode' | egrep 'HTTP|location'
HTTP/2 303
location: /keywords/crates
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/keywords/crates' | egrep 'HTTP|location'
HTTP/2 307
location: /search?q=keywords&f=1
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/search?q=keywords&f=1' | egrep 'HTTP|location'
HTTP/2 308
location: /crates/search
% curl --user-agent 'repology-linkchecker/1 (+https://repology.org/docs/bots)' -s -I 'https://lib.rs/crates/search' | egrep 'HTTP|location'
HTTP/2 303
location: /keywords/crates
it looks like actix-web doesn't handle HEAD requests automatically, so instead of getting into handle_crate, we get into default_handler which starts to do weird things.
I guess the correct fix for this would be to add a middleware which forwards HEAD requests to GET and discards the body. Alternative solution would be to just respond with 405 to HEAD requests, but that's far less desirable.