Improvements to retry policies and their configuration
The current implementation of retry policies and their configuration has the following shortcomings:
-
The conditions for performing a retry (i.e. the exception handlers) are hard-coded.
-
The
TimeoutRead
exception is currently never being retried, though it is not clear why that is so. Of course, aTimeoutRead
may arise for any type of query, indicating thatcql-io
did not receive a response within the configuredresponseTimeout
, hence retrying on aTimeoutRead
is not "safe" for non-idempotent queries, but that also holds true for some other exceptions that are currently being retried, like aWriteTimeout
response orIOException
s. -
By default, no retry policy is configured at all, despite there being some error cases that can always be safely retried with a high chance of success (see DefaultRetryPolicy in the DataStax Java driver).
I suggest the following changes to address these points:
-
Make the exception handlers that determine whether a retry should be performed configurable as part of the
RetrySettings
. The currently hard-coded list of handlers is then provided as part of e.g.eagerRetries :: RetrySettings
. It should be made clear in the documentation that these settings may result in possibly unwanted duplicate writes for non-idempotent queries, something that is currently not well documented. These settings (i.e.eagerRetries
) should not be the default, comparable to the case of not performing any retries at all by default at the moment. -
The
TimeoutRead
exception is added to the list of (eager) retry handlers mentioned in the first point (i.e. as part ofeagerRetries
). -
A new default
RetrySettings
are provided that are always safe and configured by default. These should retry immediately once in the following two cases:
(a) A `ReadTimeout` response where apparently enough replicas replied but unfortunately the coordinator requested the data from a replica that failed (as indicated by `rFailureNumAck`, `rFailureNumRequired` and `rFailureDataPresent` in the error response). A retry in that case, ideally to the same node, is not only safe but also has a high probability of success, since the coordinator can request the data from another node. This type of transient errors occurs e.g. during normal maintenance of a cluster, like a (rolling) restart, where a single node is temporarily unavailable.
(b) A `WriteTimeout` from a batch query indicating that writing the batch log failed. A retry in that case is always safe and has a good chance of success.
(c) A `HostError` or any other kind of error that may be thrown by `cql-io` prior to actually performing a request and is hence safe to retry, provided such a retry has any chance of success at all, of course.
Essentially, cases (a) and (b) are precisely those that are retried by the `DefaultRetryPolicy` in the Java driver.
See also the hard-coded rules and the notes on retries and idempotence in the DataStax documentation. I think the reason that the default retry policy does not retry any read timeout, despite it being safe, is that in any case other than (a) it is unclear whether there is actually a good chance of success for the retry. If there isn't, then immediate retries may just exacerbate the problem(s) and unnecessarily delay further processing on the client side. Personally, I think a single immediate retry on any read timeout could be justified by default but I would probably stick to the more conservative behaviour of the Java driver.
This is just a high-level overview of some improvements related to retries that I took note of during earlier work on related parts of the code. More details may emerge once the implementation begins.