Backoff & Retry Policies
When jobs fail, often the errors that lead to failure are transient in nature and the job may succeed when retried. For example, network connectivity with an external service may have been interrupted, or storage on some server is at capacity.
Zizq applies exponential backoff to jobs that fail. Errors are captured on the
job’s error list and the job is scheduled to run again at a later time,
increasing with each successive failure until a configured retry limit is
reached. Once the retry limit is reached, the job is not rescheduled, but is
marked dead.
The logic applied to jobs when determining how to handle retries is known as the backoff policy. Clients can explicitly specify their own policies on a per-job basis, but the server otherwise applies its default policy, which can be configured when starting the server.
Backoff Policy Structure
There are two logical parts to the backoff policy:
- The retry limit (maximum number of permitted retries).
- The exponential backoff formula itself.
The retry limit is self-explanatory. If the limit is set to 3, for example
the job may fail once, retry, twice, retry, three times, retry, but on the
fourth failure the job will not retry.
The formula for exponential backoff requires further explanation. The formula is:
t = B + (a^E) + (a * rand(0 to J))
Where:
tis the delay to apply before retryingBis the base delay applied to all retriesais the number of previous attemptsEis the backoff exponent (optionally fractional)Jis a random jitter used to spread retries
The variables B, E and J are configurable.
Zizq Defaults
The default backoff policy uses a retry limit of 25 and uses the following
parameters for the backoff formula:
B = 15s
E = 4
J = 30s
This gives roughly 3 weeks of total retry time before the job is eventually
moved to the dead list.
Adjusting the Backoff Curve
You can adjust the inputs in the chart below to see how changing these parameters affects the backoff curve. The defaults are very reasonable. There are two lines on the chart due to the presence of the random jitter, which is designed to avoid clusters of failures all retrying at the same time. An actual retry could occur anywhere within the band.
Configuration Options
The defaults can be configured by using the following command line arguments and environment variables.
--default-retry-limit,ZIZQ_DEFAULT_RETRY_LIMIT--default-backoff-base,ZIZQ_DEFAULT_BACKOFF_BASE--default-backoff-exponent,ZIZQ_DEFAULT_BACKOFF_EXPONENT--default-backoff-jitter,ZIZQ_DEFAULT_BACKOFF_JITTER
Values for --default-backoff-base and --default-backoff-litter are either
provided in raw milliseconds, or with an explicit unit, such as 12.5s.
Note
When any of
--default-backoff-base,--default-backoff-exponentor--default-backoff-jitterare provided, all three must be provided as they form a single formula in unison.