Fix `TestIntegrations/ClientIdleConnection` flakiness by Joerger · Pull Request #65659 · gravitational/teleport

Joerger · 2026-04-10T16:09:19Z

The issue with this test was that it was actually measuring throughput rather than idleness. If the session could not handle 30 echo commands in the span of 6 seconds, you'd get an error like "9" is not greater than "30" signaling that only 9 of the 30 echo commands were processed in time.

Since client idleness is measured by both client writes and reads, we don't need to write again until the current read is complete, so we can just send/receive keepalives instead. We also use sh -c 'echo __READY__; exec cat' instead of echo txlxport | sed 's/x/e/g' so that we can track readiness and simplify the command. exec cat just drops into an echo session where stdin echo is suppressed so we naturally only see stdout without the need for a sed trick.

Updates #33252

…ly to remove reliance on high throughput.

rosstimothy

I don't think the flakiness of this test is 100% a result of the test. There is an underlying race between the node and the proxy client idle timeout machinery that can kill the session before it even starts.

Joerger · 2026-04-10T20:26:26Z

I don't think the flakiness of this test is 100% a result of the test. There is an underlying race between the node and the proxy client idle timeout machinery that can kill the session before it even starts.

I agree there is an idle timeout race. Both the proxy and the SSH agent can start enforcing idle timeout before the user has a usable session. I'm not sure whether this is really a bug or just a strict enforcement of idle timeout based on the user's connection which can perform poorly with very small timeout values, which we should only see in tests, not production. Still, it's worth tracking and seeing if we can improve it safely.

That said, that race isn not the root cause of the flakes in #33252. The failures there are consistently test level assertion failures like:

"9" is not greater than "30" - caused by session ending before it could handle 30 echo commands from the client
An error is expected but got nil. - caused by err := waitForError(sessionErr, time.Second*15) timeout

Those point to the brittle throughput reliance of the original test, which my PR fixes directly by replacing the output counting logic with a deterministic readiness/activity check. I'm happy to investigate the idle timeout race during session establishment and open a separate issue, but I don't think it's related to the test flakes we're seeing.

FWIW I was able to reproduce a flake due to the session idle timeout race by using a very small idle timeout, like 1s with stress -p 10, or by using stress -p 20 or higher with the 4s idle timeout. However, you'll see the integration test itself flake often at just stress -p 15 due to the teleport instance creation timing out:

 fixture.go:93: Unexpected response from CreateEx: 
        ERROR REPORT:
        Original Error: trace.aggregate context deadline exceeded
        Stack Trace:
        	/Users/bjoerger/src/gravitational/teleport/lib/backend/lock.go:240 github.com/gravitational/teleport/lib/backend.RunWhileLocked
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/recordingencryption/manager.go:680 github.com/gravitational/teleport/lib/auth/recordingencryption.(*Manager).modifySessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/recordingencryption/manager.go:127 github.com/gravitational/teleport/lib/auth/recordingencryption.(*Manager).CreateSessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/init.go:1253 github.com/gravitational/teleport/lib/auth.initializeSessionRecordingConfig
        	/Users/bjoerger/src/gravitational/teleport/lib/auth/init.go:617 github.com/gravitational/teleport/lib/auth.initCluster.func3
        	/Users/bjoerger/code/go/pkg/mod/golang.org/x/sync@v0.20.0/errgroup/errgroup.go:93 golang.org/x/sync/errgroup.(*Group).Go.func1
        	/Users/bjoerger/code/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.25.9.darwin-arm64/src/runtime/asm_arm64.s:1268 runtime.goexit
        User Message: context deadline exceeded
--- FAIL: TestIntegrations (2.97s)
    --- FAIL: TestIntegrations/ClientIdleConnection (2.97s)

At stress -p 20 I'm seeing this teleport instance failure 5 times for every idle timeout race. I don't think our integration tests are built to handle that type of stress, and since we aren't seeing this class of failures in CI I think it's likely not a problem for now. Using stress -p 10 -count 1000 for integration tests seems to be a much more accurate benchmark to use.

rosstimothy · 2026-04-10T20:30:27Z

@Joerger I agree the test could be better. But we shouldn't close the issue and stop there. The goal of addressing flaky tests and pressure washing in general is to address the behavioral problems in Teleport that manifest as flaky tests. This PR may reduce flakes but I'm not confident they will prevent future flakes.

Refactor client idle timeout test to send/received input asynchronous…

b689732

…ly to remove reliance on high throughput.

Joerger added no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot labels Apr 10, 2026

github-actions bot requested review from juliaogris and rosstimothy April 10, 2026 16:09

github-actions bot added the size/sm label Apr 10, 2026

Use exec cat to fix terminal issues in the test.

5afca7b

Joerger force-pushed the joerger/fix-flaky-test-TestIntegrations_ClientIdleConnection branch from 9d5d546 to 5afca7b Compare April 10, 2026 16:11

rosstimothy requested changes Apr 10, 2026

View reviewed changes

Joerger requested a review from rosstimothy April 10, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `TestIntegrations/ClientIdleConnection` flakiness#65659

Fix `TestIntegrations/ClientIdleConnection` flakiness#65659
Joerger wants to merge 2 commits intomasterfrom
joerger/fix-flaky-test-TestIntegrations_ClientIdleConnection

Joerger commented Apr 10, 2026 •

edited

Loading

Uh oh!

rosstimothy left a comment

Uh oh!

Joerger commented Apr 10, 2026 •

edited

Loading

Uh oh!

rosstimothy commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Joerger commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosstimothy left a comment

Choose a reason for hiding this comment

Uh oh!

Joerger commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosstimothy commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Joerger commented Apr 10, 2026 •

edited

Loading

Joerger commented Apr 10, 2026 •

edited

Loading

rosstimothy commented Apr 10, 2026 •

edited

Loading