Skip to content

Fix TestIntegrations/ClientIdleConnection flakiness#65659

Open
Joerger wants to merge 2 commits intomasterfrom
joerger/fix-flaky-test-TestIntegrations_ClientIdleConnection
Open

Fix TestIntegrations/ClientIdleConnection flakiness#65659
Joerger wants to merge 2 commits intomasterfrom
joerger/fix-flaky-test-TestIntegrations_ClientIdleConnection

Conversation

@Joerger
Copy link
Copy Markdown
Contributor

@Joerger Joerger commented Apr 10, 2026

The issue with this test was that it was actually measuring throughput rather than idleness. If the session could not handle 30 echo commands in the span of 6 seconds, you'd get an error like "9" is not greater than "30" signaling that only 9 of the 30 echo commands were processed in time.

Since client idleness is measured by both client writes and reads, we don't need to write again until the current read is complete, so we can just send/receive keepalives instead. We also use sh -c 'echo __READY__; exec cat' instead of echo txlxport | sed 's/x/e/g' so that we can track readiness and simplify the command. exec cat just drops into an echo session where stdin echo is suppressed so we naturally only see stdout without the need for a sed trick.

Updates #33252

@Joerger Joerger added no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot labels Apr 10, 2026
@Joerger Joerger force-pushed the joerger/fix-flaky-test-TestIntegrations_ClientIdleConnection branch from 9d5d546 to 5afca7b Compare April 10, 2026 16:11
Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the flakiness of this test is 100% a result of the test. There is an underlying race between the node and the proxy client idle timeout machinery that can kill the session before it even starts.

@Joerger
Copy link
Copy Markdown
Contributor Author

Joerger commented Apr 10, 2026

I don't think the flakiness of this test is 100% a result of the test. There is an underlying race between the node and the proxy client idle timeout machinery that can kill the session before it even starts.

I agree there is an idle timeout race. Both the proxy and the SSH agent can start enforcing idle timeout before the user has a usable session. I'm not sure whether this is really a bug or just a strict enforcement of idle timeout based on the user's connection which can perform poorly with very small timeout values, which we should only see in tests, not production. Still, it's worth tracking and seeing if we can improve it safely.

That said, that race isn not the root cause of the flakes in #33252. The failures there are consistently test level assertion failures like:

  • "9" is not greater than "30" - caused by session ending before it could handle 30 echo commands from the client
  • An error is expected but got nil. - caused by err := waitForError(sessionErr, time.Second*15) timeout

Those point to the brittle throughput reliance of the original test, which my PR fixes directly by replacing the output counting logic with a deterministic readiness/activity check. I'm happy to investigate the idle timeout race during session establishment and open a separate issue, but I don't think it's related to the test flakes we're seeing.

FWIW I was able to reproduce a flake due to the session idle timeout race by using a very small idle timeout, like 1s with stress -p 10, or by using stress -p 20 or higher with the 4s idle timeout. However, you'll see the integration test itself flake often at just stress -p 15 due to the teleport instance creation timing out:

 fixture.go:93: Unexpected response from CreateEx: 
        ERROR REPORT:
        Original Error: trace.aggregate context deadline exceeded
        Stack Trace:
        	/Users/bjoerger/src/gravitational/teleport/lib/backend/lock.go:240 github.com/gravitational/teleport/lib/backend.RunWhileLocked
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/recordingencryption/manager.go:680 github.com/gravitational/teleport/lib/auth/recordingencryption.(*Manager).modifySessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/recordingencryption/manager.go:127 github.com/gravitational/teleport/lib/auth/recordingencryption.(*Manager).CreateSessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/init.go:1253 github.com/gravitational/teleport/lib/auth.initializeSessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/init.go:617 github.com/gravitational/teleport/lib/auth.initCluster.func3
        	/Users/bjoerger/code/go/pkg/mod/golang.org/x/sync@v0.20.0/errgroup/errgroup.go:93 golang.org/x/sync/errgroup.(*Group).Go.func1
        	/Users/bjoerger/code/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.9.darwin-arm64/src/runtime/asm_arm64.s:1268 runtime.goexit
        User Message: context deadline exceeded
--- FAIL: TestIntegrations (2.97s)
    --- FAIL: TestIntegrations/ClientIdleConnection (2.97s)

At stress -p 20 I'm seeing this teleport instance failure 5 times for every idle timeout race. I don't think our integration tests are built to handle that type of stress, and since we aren't seeing this class of failures in CI I think it's likely not a problem for now. Using stress -p 10 -count 1000 for integration tests seems to be a much more accurate benchmark to use.

@Joerger Joerger requested a review from rosstimothy April 10, 2026 20:27
@rosstimothy
Copy link
Copy Markdown
Contributor

rosstimothy commented Apr 10, 2026

@Joerger I agree the test could be better. But we shouldn't close the issue and stop there. The goal of addressing flaky tests and pressure washing in general is to address the behavioral problems in Teleport that manifest as flaky tests. This PR may reduce flakes but I'm not confident they will prevent future flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants