Skip to content

[sdk-platform-java] DirectRetryingExecutor re-interrupts the thread before recording interrupt-related attempt failure #12860

@walter-chung

Description

@walter-chung

Description

I believe there is an interrupt-ordering issue in DirectRetryingExecutor that can cause interrupt behaviour
to surface before the original interrupt-related attempt failure is recorded on the RetryingFuture.

In gax-java/gax/src/main/java/com/google/api/gax/retrying/DirectRetryingExecutor.java, the catch block for:

  • InterruptedException
  • InterruptedIOException
  • ClosedByInterruptException

currently restores the thread interrupt status before calling retryingFuture.setAttemptFuture(...).

Current code:

} catch (InterruptedException | InterruptedIOException | ClosedByInterruptException e) {
  Thread.currentThread().interrupt();
  retryingFuture.setAttemptFuture(ApiFutures.immediateFailedFuture(e));
}

That ordering appears to allow downstream interrupt handling to observe the thread's interrupt flag too early.
In a local repro against com.google.api:gax:2.71.0, the original interrupt-related failure is not preserved
cleanly enough for retry evaluation to proceed as expected.

Environment details

  • Artifact: com.google.api:gax
  • Affected version: 2.71.0
  • Consuming library where we observed this in production: com.google.cloud:google-cloud-bigquery:2.55.3
  • OS: Darwin 25.4.0 arm64
  • Java: Temurin / OpenJDK 21.0.4

Steps to reproduce

  1. Configure a RetryAlgorithm that retries when prevThrowable instanceof InterruptedIOException.
  2. Create a DirectRetryingExecutor.
  3. Use a callable that:
    • throws InterruptedIOException("transient") on the first call
    • returns "SUCCESS" on the second call
  4. Submit the callable through the RetryingFuture and wait for completion.

Code example

A minimal regression test can be expressed with this shape:

  • retry logic is configured to retry ONLY InterruptedIOException
  • first attempt throws InterruptedIOException("transient")
  • second attempt succeeds

For example:

AtomicInteger callCount = new AtomicInteger();
AtomicReference<String> result = new AtomicReference<>();
AtomicReference<Throwable> failure = new AtomicReference<>();

RetryAlgorithm<String> retryAlgorithm =
    new RetryAlgorithm<>(
        new BasicResultRetryAlgorithm<String>() {
          @Override
          public boolean shouldRetry(Throwable prevThrowable, String prevResponse) {
            return prevThrowable instanceof InterruptedIOException;
          }
        },
        new ExponentialRetryAlgorithm(
            FailingCallable.FAST_RETRY_SETTINGS,
            CurrentMillisClock.getDefaultClock()));

DirectRetryingExecutor<String> executor =
    new DirectRetryingExecutor<String>(retryAlgorithm) {
      @Override
      protected void sleep(java.time.Duration delay) {
        // no-op to keep the repro focused on submit() ordering
      }
    };

Thread worker =
    new Thread(
        () -> {
          RetryingFuture<String> future =
              executor.createFuture(
                  () -> {
                    if (callCount.getAndIncrement() == 0) {
                      throw new InterruptedIOException("transient");
                    }
                    return "SUCCESS";
                  });

          future.setAttemptFuture(executor.submit(future));

          // Clear the caller-thread interrupt flag before reading the final
          // future outcome. The fix still restores the interrupt flag, just later.
          Thread.interrupted();

          try {
            result.set(future.get());
          } catch (ExecutionException e) {
            failure.set(e.getCause());
          } catch (Throwable t) {
            failure.set(t);
          }
        });

worker.start();
worker.join();

assertNull(failure.get());
assertEquals("SUCCESS", result.get());
assertEquals(2, callCount.get());

Expected behaviour

  • the original interrupt-related failure is recorded first
  • retry evaluation sees the original InterruptedIOException
  • the retry executor performs the second attempt
  • the future completes successfully with "SUCCESS"

Actual behaviour

Before reordering the catch block:

  • submit() catches InterruptedIOException
  • it restores the interrupt flag before setAttemptFuture(...)
  • downstream future code can observe the interrupt flag too early
  • the future terminates with InterruptedException instead of cleanly preserving the original
    InterruptedIOException for retry evaluation
  • the regression test fails with:
expected: <null> but was: <java.lang.InterruptedException>

Additional context

In a consuming application using google-cloud-bigquery:2.55.3, we observed a production stack trace with:

java.lang.InterruptedException: null
  at com.google.common.util.concurrent.AbstractFutureState.blockingGet(AbstractFutureState.java:231)
  at com.google.common.util.concurrent.Platform.get(Platform.java:54)
  at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:253)
  at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:66)
  at com.google.api.gax.retrying.BasicRetryingFuture.setAttemptFuture(BasicRetryingFuture.java:98)
  at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:106)
  at com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:108)
  ...

From the library source, this appears consistent with the following sequence:

  1. DirectRetryingExecutor catches an interrupt-related exception
  2. it calls Thread.currentThread().interrupt()
  3. it then calls setAttemptFuture(...)
  4. downstream future code observes the interrupt flag and throws InterruptedException

Proposed fix

Reorder the catch block so the failed attempt is recorded before the interrupt flag is restored:

} catch (InterruptedException | InterruptedIOException | ClosedByInterruptException e) {
  retryingFuture.setAttemptFuture(ApiFutures.immediateFailedFuture(e));
  Thread.currentThread().interrupt();
}

Using the test case above:

  • before the reorder: the regression test fails with InterruptedException
  • after the reorder: the test passes
  • a full local gax module test run also passes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions