Fix TestIntegrations/ClientIdleConnection flakiness#65659
Fix TestIntegrations/ClientIdleConnection flakiness#65659
TestIntegrations/ClientIdleConnection flakiness#65659Conversation
…ly to remove reliance on high throughput.
9d5d546 to
5afca7b
Compare
rosstimothy
left a comment
There was a problem hiding this comment.
I don't think the flakiness of this test is 100% a result of the test. There is an underlying race between the node and the proxy client idle timeout machinery that can kill the session before it even starts.
I agree there is an idle timeout race. Both the proxy and the SSH agent can start enforcing idle timeout before the user has a usable session. I'm not sure whether this is really a bug or just a strict enforcement of idle timeout based on the user's connection which can perform poorly with very small timeout values, which we should only see in tests, not production. Still, it's worth tracking and seeing if we can improve it safely. That said, that race isn not the root cause of the flakes in #33252. The failures there are consistently test level assertion failures like:
Those point to the brittle throughput reliance of the original test, which my PR fixes directly by replacing the output counting logic with a deterministic readiness/activity check. I'm happy to investigate the idle timeout race during session establishment and open a separate issue, but I don't think it's related to the test flakes we're seeing. FWIW I was able to reproduce a flake due to the session idle timeout race by using a very small idle timeout, like 1s with At |
|
@Joerger I agree the test could be better. But we shouldn't close the issue and stop there. The goal of addressing flaky tests and pressure washing in general is to address the behavioral problems in Teleport that manifest as flaky tests. This PR may reduce flakes but I'm not confident they will prevent future flakes. |
The issue with this test was that it was actually measuring throughput rather than idleness. If the session could not handle 30 echo commands in the span of 6 seconds, you'd get an error like
"9" is not greater than "30"signaling that only 9 of the 30 echo commands were processed in time.Since client idleness is measured by both client writes and reads, we don't need to write again until the current read is complete, so we can just send/receive keepalives instead. We also use
sh -c 'echo __READY__; exec cat'instead ofecho txlxport | sed 's/x/e/g'so that we can track readiness and simplify the command.exec catjust drops into an echo session where stdin echo is suppressed so we naturally only see stdout without the need for asedtrick.Updates #33252