Description
I believe there is an interrupt-ordering issue in DirectRetryingExecutor that can cause interrupt behaviour
to surface before the original interrupt-related attempt failure is recorded on the RetryingFuture.
In gax-java/gax/src/main/java/com/google/api/gax/retrying/DirectRetryingExecutor.java, the catch block for:
InterruptedException
InterruptedIOException
ClosedByInterruptException
currently restores the thread interrupt status before calling retryingFuture.setAttemptFuture(...).
Current code:
} catch (InterruptedException | InterruptedIOException | ClosedByInterruptException e) {
Thread.currentThread().interrupt();
retryingFuture.setAttemptFuture(ApiFutures.immediateFailedFuture(e));
}
That ordering appears to allow downstream interrupt handling to observe the thread's interrupt flag too early.
In a local repro against com.google.api:gax:2.71.0, the original interrupt-related failure is not preserved
cleanly enough for retry evaluation to proceed as expected.
Environment details
- Artifact:
com.google.api:gax
- Affected version:
2.71.0
- Consuming library where we observed this in production:
com.google.cloud:google-cloud-bigquery:2.55.3
- OS: Darwin 25.4.0 arm64
- Java: Temurin / OpenJDK 21.0.4
Steps to reproduce
- Configure a
RetryAlgorithm that retries when prevThrowable instanceof InterruptedIOException.
- Create a
DirectRetryingExecutor.
- Use a callable that:
- throws
InterruptedIOException("transient") on the first call
- returns
"SUCCESS" on the second call
- Submit the callable through the
RetryingFuture and wait for completion.
Code example
A minimal regression test can be expressed with this shape:
- retry logic is configured to retry ONLY
InterruptedIOException
- first attempt throws
InterruptedIOException("transient")
- second attempt succeeds
For example:
AtomicInteger callCount = new AtomicInteger();
AtomicReference<String> result = new AtomicReference<>();
AtomicReference<Throwable> failure = new AtomicReference<>();
RetryAlgorithm<String> retryAlgorithm =
new RetryAlgorithm<>(
new BasicResultRetryAlgorithm<String>() {
@Override
public boolean shouldRetry(Throwable prevThrowable, String prevResponse) {
return prevThrowable instanceof InterruptedIOException;
}
},
new ExponentialRetryAlgorithm(
FailingCallable.FAST_RETRY_SETTINGS,
CurrentMillisClock.getDefaultClock()));
DirectRetryingExecutor<String> executor =
new DirectRetryingExecutor<String>(retryAlgorithm) {
@Override
protected void sleep(java.time.Duration delay) {
// no-op to keep the repro focused on submit() ordering
}
};
Thread worker =
new Thread(
() -> {
RetryingFuture<String> future =
executor.createFuture(
() -> {
if (callCount.getAndIncrement() == 0) {
throw new InterruptedIOException("transient");
}
return "SUCCESS";
});
future.setAttemptFuture(executor.submit(future));
// Clear the caller-thread interrupt flag before reading the final
// future outcome. The fix still restores the interrupt flag, just later.
Thread.interrupted();
try {
result.set(future.get());
} catch (ExecutionException e) {
failure.set(e.getCause());
} catch (Throwable t) {
failure.set(t);
}
});
worker.start();
worker.join();
assertNull(failure.get());
assertEquals("SUCCESS", result.get());
assertEquals(2, callCount.get());
Expected behaviour
- the original interrupt-related failure is recorded first
- retry evaluation sees the original
InterruptedIOException
- the retry executor performs the second attempt
- the future completes successfully with
"SUCCESS"
Actual behaviour
Before reordering the catch block:
submit() catches InterruptedIOException
- it restores the interrupt flag before
setAttemptFuture(...)
- downstream future code can observe the interrupt flag too early
- the future terminates with
InterruptedException instead of cleanly preserving the original
InterruptedIOException for retry evaluation
- the regression test fails with:
expected: <null> but was: <java.lang.InterruptedException>
Additional context
In a consuming application using google-cloud-bigquery:2.55.3, we observed a production stack trace with:
java.lang.InterruptedException: null
at com.google.common.util.concurrent.AbstractFutureState.blockingGet(AbstractFutureState.java:231)
at com.google.common.util.concurrent.Platform.get(Platform.java:54)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:253)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:66)
at com.google.api.gax.retrying.BasicRetryingFuture.setAttemptFuture(BasicRetryingFuture.java:98)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:106)
at com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:108)
...
From the library source, this appears consistent with the following sequence:
DirectRetryingExecutor catches an interrupt-related exception
- it calls
Thread.currentThread().interrupt()
- it then calls
setAttemptFuture(...)
- downstream future code observes the interrupt flag and throws
InterruptedException
Proposed fix
Reorder the catch block so the failed attempt is recorded before the interrupt flag is restored:
} catch (InterruptedException | InterruptedIOException | ClosedByInterruptException e) {
retryingFuture.setAttemptFuture(ApiFutures.immediateFailedFuture(e));
Thread.currentThread().interrupt();
}
Using the test case above:
- before the reorder: the regression test fails with
InterruptedException
- after the reorder: the test passes
- a full local
gax module test run also passes
Description
I believe there is an interrupt-ordering issue in
DirectRetryingExecutorthat can cause interrupt behaviourto surface before the original interrupt-related attempt failure is recorded on the
RetryingFuture.In
gax-java/gax/src/main/java/com/google/api/gax/retrying/DirectRetryingExecutor.java, the catch block for:InterruptedExceptionInterruptedIOExceptionClosedByInterruptExceptioncurrently restores the thread interrupt status before calling
retryingFuture.setAttemptFuture(...).Current code:
That ordering appears to allow downstream interrupt handling to observe the thread's interrupt flag too early.
In a local repro against
com.google.api:gax:2.71.0, the original interrupt-related failure is not preservedcleanly enough for retry evaluation to proceed as expected.
Environment details
com.google.api:gax2.71.0com.google.cloud:google-cloud-bigquery:2.55.3Steps to reproduce
RetryAlgorithmthat retries whenprevThrowable instanceof InterruptedIOException.DirectRetryingExecutor.InterruptedIOException("transient")on the first call"SUCCESS"on the second callRetryingFutureand wait for completion.Code example
A minimal regression test can be expressed with this shape:
InterruptedIOExceptionInterruptedIOException("transient")For example:
Expected behaviour
InterruptedIOException"SUCCESS"Actual behaviour
Before reordering the catch block:
submit()catchesInterruptedIOExceptionsetAttemptFuture(...)InterruptedExceptioninstead of cleanly preserving the originalInterruptedIOExceptionfor retry evaluationAdditional context
In a consuming application using
google-cloud-bigquery:2.55.3, we observed a production stack trace with:From the library source, this appears consistent with the following sequence:
DirectRetryingExecutorcatches an interrupt-related exceptionThread.currentThread().interrupt()setAttemptFuture(...)InterruptedExceptionProposed fix
Reorder the catch block so the failed attempt is recorded before the interrupt flag is restored:
Using the test case above:
InterruptedExceptiongaxmodule test run also passes