Skip to content

fix(cassandra): auto-recover session after Cassandra restart#2997

Open
dpol1 wants to merge 2 commits intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect
Open

fix(cassandra): auto-recover session after Cassandra restart#2997
dpol1 wants to merge 2 commits intoapache:masterfrom
dpol1:fix/2740-cassandra-reconnect

Conversation

@dpol1
Copy link
Copy Markdown

@dpol1 dpol1 commented Apr 18, 2026

Purpose of the PR

closes #2740

HugeGraphServer stops responding after Cassandra is restarted and never
recovers without a full server restart.

Root cause: CassandraSessionPool builds the Datastax Cluster without a
ReconnectionPolicy, CassandraSession.execute(...) calls the driver once
with no retry, and thread-local sessions are never probed for liveness.
Once Cassandra goes down, transient NoHostAvailableException /
OperationTimedOutException errors surface to the user and the pool stays
dead even after Cassandra comes back online.

Main Changes

  • Register ExponentialReconnectionPolicy(baseDelay, maxDelay) on the
    Cluster builder so the Datastax driver keeps retrying downed nodes in
    the background.

  • Wrap every Session.execute(...) in executeWithRetry(Statement) with
    exponential backoff on transient connectivity failures.

  • Implement reconnectIfNeeded() / reset() on CassandraSession so the
    pool reopens closed sessions and issues a lightweight health-check
    (SELECT now() FROM system.local) before subsequent queries.

  • Add four tunables in CassandraOptions (defaults preserve previous
    behavior for healthy clusters):

    Option Default Meaning
    cassandra.reconnect_base_delay 1000 ms Initial backoff for driver reconnection policy
    cassandra.reconnect_max_delay 60000 ms Cap for reconnection backoff
    cassandra.reconnect_max_retries 10 Per-query retries on transient errors (0 disables)
    cassandra.reconnect_interval 5000 ms Base interval for per-query exponential backoff
  • Add unit tests covering defaults, overrides, disabling retries and option keys.

Verifying these changes

  • Need tests and can be verified as follows:
    • mvn -pl hugegraph-server/hugegraph-test -am test -Dtest=CassandraTest — 13/13 pass

Does this PR potentially affect the following parts?

  • Modify configurations

Documentation Status

  • Doc - TODO

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working store Store module labels Apr 18, 2026
  - Register ExponentialReconnectionPolicy on the Cluster builder so the
    Datastax driver keeps retrying downed nodes in the background.
  - Wrap every Session.execute() in executeWithRetry() with exponential
    backoff on transient connectivity failures.
  - Implement reconnectIfNeeded()/reset() so the pool reopens closed
    sessions and issues a lightweight health-check (SELECT now() FROM
    system.local) before subsequent queries.
  - Add tunable options: cassandra.reconnect_base_delay,
    cassandra.reconnect_max_delay, cassandra.reconnect_max_retries,
    cassandra.reconnect_interval.
  - Add unit tests covering defaults, overrides, disabling retries and
    option keys.

  Fixes apache#2740
@dpol1 dpol1 force-pushed the fix/2740-cassandra-reconnect branch from 97de8e9 to fc3d291 Compare April 18, 2026 17:37
@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 18, 2026

⚠️ commitAsync() bypasses retry — still calls this.session.executeAsync(s) directly

The PR wraps execute() and commit() with executeWithRetry, but commitAsync() (line 177 in the base file) still calls this.session.executeAsync(s) directly. If a Cassandra restart happens during an async batch commit, the same connectivity failure will surface without any retry.

Consider wrapping the async path as well, or at minimum adding a TODO/comment explaining why async commits are deliberately left un-retried (e.g., if retry semantics for async batches are too complex for this PR).

@dpol1
Copy link
Copy Markdown
Author

dpol1 commented Apr 20, 2026

Thanks @imbajin for the feedback, changed!

@dpol1 dpol1 requested a review from imbajin April 20, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files. store Store module

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

[Bug] Hugegraph isn't responding after Cassandra restarted.

2 participants