Revisit DNS lookup limits
For the longest time I have wanted to revisit the DNS lookup limits implementation in viaspf. After a close reading of RFC 7208 and a deep dive into the genesis of it, our interpretation of the ambiguous language in section 4.6.4 must be updated.
Let’s use this opportunity to document our updated understanding here.
Relevant are the first five paragraphs of section 4.6.4.
Paragraph 1 documents the ‘overall’, total limit of certain terms in a record. This limit is tracked per SPF query. The mechanisms include, a, mx, ptr, exists, and the modifier redirect, also when occurring recursively in an included record, all increment this count, which must not exceed 10. Importantly for what comes next, what’s limited here is the number of specific SPF terms, not specific types of DNS lookups.
Paragraphs 2 and 3 add further requirements for mechanisms mx and ptr, respectively.
Paragraph 2 starts with two crucial sentences. It first states:
When evaluating the "mx" mechanism, the number of "MX" resource records queried is included in the overall limit of 10 mechanisms/modifiers that cause DNS lookups as described above.
The meaning of this is extremely non-obvious, at least to me: what it means is
that each mx mechanism – cost = 1 per the first paragraph – includes in this
count the one corresponding MX lookup. So: one mx mechanism means one MX
lookup and this gives a total of 1 increment to the global count. An example:
the fragment mx:a.com mx:b.com
has two mx mechanisms, this gives a lookup
count of 2. The number of corresponding MX resource record lookups, ie a.com
and b.com
is included in that count. So the total is 2, not 4. The number of
MX names returned by the queries is not relevant for this count.
Previously, we interpreted this sentence to mean that each of the MX names returned from the MX lookup would count against the global lookup limit. This is the obvious interpretation, but it is not the intended one. For example, the Wikipedia entry for SPF currently uses this interpretation. You will also often find this interpretation in various places online.
The paragraph goes on:
In addition to that limit, the evaluation of each "MX" record MUST NOT result in querying more than 10 address records -- either "A" or "AAAA" resource records. If this limit is exceeded, the "mx" mechanism MUST produce a "permerror" result.
The one MX lookup done above returns a number of MX names. For each of those an A or AAAA lookup is done. If more than 10 such lookups are done (so > 10 MX names where returned in the MX lookup) it’s a permerror.
I scrolled through the public email conversations from 2013 that led to RFC 7208. For example, this conversation is helpful. The RFC author is asked: ‘As written, it seems to say that you can do up to 10 MX lookups, and each of those can result in up to 10 A lookups, for a total of 110 DNS lookups. Is that what you mean?’ The author responds: ‘Yes. This is the basis for the infamous 111 DNS lookups.’ An even clearer statement can be found in this older message from the RFC author. In the context of this writeup, this exchange almost qualifies as funny.
I cannot recommend this experience of going back in time and reading the exchanges on the IETF mailing lists. You can see how a lot of misunderstanding and talking past one another finally give birth to the unclear RFC text that we have today. You had to be there!
In RFC 4408, predecessor of RFC 7208, the lookup limits are described more clearly. RFC 4408, section 10.1:
SPF implementations MUST limit the number of mechanisms and modifiers that do DNS lookups to at most 10 per SPF check, including any lookups caused by the use of the "include" mechanism or the "redirect" modifier. If this number is exceeded during a check, a PermError MUST be returned. […]
When evaluating the "mx" and "ptr" mechanisms, or the %{p} macro, there MUST be a limit of no more than 10 MX or PTR RRs looked up and checked.
And section 5.4:
To prevent Denial of Service (DoS) attacks, more than 10 MX names MUST NOT be looked up during the evaluation of an "mx" mechanism (see Section 10).
Though note that there is a small difference here between these two RFCs: RFC 4408 simply says that no more than 10 MX names must be looked up, whereas RFC 7208 requires a permerror result in this case. This difference is not listed in RFC 7208, appendix B. So – apart from the detail just mentioned – the lookup limits implementation requirements should be unchanged from RFC 4408.
Finally, note that in our implementation we check the per-MX limit progressively, not immediately after the MX lookup. Only after 10 MX names have been checked, the eleventh will fail.
Paragraph 3 can be interpreted in a manner analogous to paragraph 2, for the ptr mechanism and p macro. The additional limit (‘In addition ...’), in this case causes truncation, rather than a permerror result.
Paragraph 4 gives the motivation for the difference in the handling of the additional limit for mx versus ptr.
Paragraph 5 highlights once again that the additional limits described in paragraphs 2 and 3 are per mx mechanism and per ptr mechanism/p macro. Again, each mx mechanism is allowed to query for the addresses of up to 10 MX names.
Previously, given that we interpreted paragraphs 2 and 3 differently, we needed to make sense of the per-mechanism clause in this paragraph differently. So, we limited the number of addresses checked, ie the number of addresses compared with the client’s IP address. This was mistaken, there is no limit on addresses returned and checked for any mechanism.
All right then. This was a lot of text for a relatively banal (and for some, obvious?) outcome, but I think it’s good to have it all written down for once here. I understand that an RFC should be taken by its word, but in this case reading a bit of background helps choose the right interpretation. The confusion is not just mine, see for example the listing in this message.
The updated interpretation in viaspf is also what is used in established libraries like Perl’s Mail::SPF and Python’s pyspf.
For users, the effect of this change is that they will encounter the lookup limits less frequently. The new interpretation is a much more lenient interpretation. Given the many complaints about the low limit of 10 in SPF, this goes in the right direction. (Even those involved with authoring the SPF RFC now say ‘if we were designing SPF now we would probably make the limit larger’.)