Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Backoff & Retry Policies

When jobs fail, often the errors that lead to failure are transient in nature and the job may succeed when retried. For example, network connectivity with an external service may have been interrupted, or storage on some server is at capacity.

Zizq applies exponential backoff to jobs that fail. Errors are captured on the job’s error list and the job is scheduled to run again at a later time, increasing with each successive failure until a configured retry limit is reached. Once the retry limit is reached, the job is not rescheduled, but is marked dead.

The logic applied to jobs when determining how to handle retries is known as the backoff policy. Clients can explicitly specify their own policies on a per-job basis, but the server otherwise applies its default policy, which can be configured when starting the server.

Backoff Policy Structure

There are two logical parts to the backoff policy:

  1. The retry limit (maximum number of permitted retries).
  2. The exponential backoff formula itself.

The retry limit is self-explanatory. If the limit is set to 3, for example the job may fail once, retry, twice, retry, three times, retry, but on the fourth failure the job will not retry.

The formula for exponential backoff requires further explanation. The formula is:

t = B + (a^E) + (a * rand(0 to J))

Where:

  • t is the delay to apply before retrying
  • B is the base delay applied to all retries
  • a is the number of previous attempts
  • E is the backoff exponent (optionally fractional)
  • J is a random jitter used to spread retries

The variables B, E and J are configurable.

Zizq Defaults

The default backoff policy uses a retry limit of 25 and uses the following parameters for the backoff formula:

B = 15s
E = 4
J = 30s

This gives roughly 3 weeks of total retry time before the job is eventually moved to the dead list.

Adjusting the Backoff Curve

You can adjust the inputs in the chart below to see how changing these parameters affects the backoff curve. The defaults are very reasonable. There are two lines on the chart due to the presence of the random jitter, which is designed to avoid clusters of failures all retrying at the same time. An actual retry could occur anywhere within the band.

Configuration Options

The defaults can be configured by using the following command line arguments and environment variables.

  • --default-retry-limit, ZIZQ_DEFAULT_RETRY_LIMIT
  • --default-backoff-base, ZIZQ_DEFAULT_BACKOFF_BASE
  • --default-backoff-exponent, ZIZQ_DEFAULT_BACKOFF_EXPONENT
  • --default-backoff-jitter, ZIZQ_DEFAULT_BACKOFF_JITTER

Values for --default-backoff-base and --default-backoff-litter are either provided in raw milliseconds, or with an explicit unit, such as 12.5s.

Note

When any of --default-backoff-base, --default-backoff-exponent or --default-backoff-jitter are provided, all three must be provided as they form a single formula in unison.