Skip to content

fix(replication): add proper handling for pool restarts during replication#915

Open
meskill wants to merge 4 commits intomainfrom
fix/replication-pool-shutdown
Open

fix(replication): add proper handling for pool restarts during replication#915
meskill wants to merge 4 commits intomainfrom
fix/replication-pool-shutdown

Conversation

@meskill
Copy link
Copy Markdown
Contributor

@meskill meskill commented Apr 20, 2026

No description provided.

Comment thread pgdog/src/backend/pool/pool_impl.rs Outdated
port = addr.port,
database = %addr.database_name,
user = %addr.user,
"pool offline: connection pool shut down"
Copy link
Copy Markdown
Collaborator

@levkk levkk Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to log that. This "shutdown" isn't actually shutdown, it just destroys this object. We replace it with a brand new one atomically. This could make people think we are shutting down causing panic :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention was to add more logs that will help with tracing the similar issues.

I can drop it or change the severity/message

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup makes sense. You can set this to trace level for sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the logs

// Validate all tables support replication before committing to
// what can be a multi-hour copy. A table with no primary key or
// unique replica-identity index cannot be replicated correctly.
for tables in self.tables.values() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, we should of done it long time ago.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this to show more relevant error and to collect multiple errors

/// Check that the table supports replication.
///
/// Requires at least one column with a replica identity flag. Tables with
/// REPLICA IDENTITY FULL or NOTHING have no identity columns and fail here
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would double check that. If this is true, we need to have a special query to detect REPLICA IDENTITY FULL and use that as the key.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, the added tests validate that. I'll investigate if we can identify this actually

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this will require custom handling in validation and in query generation. Let me actually do this as a separate pr to make it more focused

/// the copy. Instead, only the cluster reference inside the existing
/// publisher is updated so that subsequent pool.get() calls target the
/// live pool rather than a stale, potentially-offline one.
pub(crate) async fn refresh_before_replicate(&mut self) -> Result<(), Error> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have code to do this already somewhere. If not, we should re-use this function wherever we run these 3 statements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there was the usual refresh but it was resetting the publication as well. I refactored this part and added tests, so now there is only one single refresh method

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

@levkk levkk linked an issue Apr 21, 2026 that may be closed by this pull request
@meskill meskill force-pushed the fix/replication-pool-shutdown branch from a67a82f to 70adcdc Compare April 23, 2026 13:44
@meskill meskill marked this pull request as ready for review April 23, 2026 15:23
Comment thread pgdog/src/backend/databases.rs Outdated
let databases = from_config(&new_config);

info!(
"reloading pools from config file: {}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"reloading pools from config file: {}",
"reloading configuration",

This will reload users.toml too. Also load above logs this info already.

use crate::backend::{databases::reload_from_existing, Error};

pub(crate) fn schema_changed() -> Result<(), Error> {
debug!("schema change detected: reloading pools to refresh schema cache");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug!("schema change detected: reloading pools to refresh schema cache");
debug!("schema change detected, refreshing schema cache");

-- that have no primary key and no `REPLICA IDENTITY USING INDEX`. Without a PK
-- (or unique index promoted via REPLICA IDENTITY USING INDEX), no column carries
-- the replica identity flag, Table::valid() fails with NoIdentityColumns, and the
-- table is rejected before the copy starts. See docs/FIX_ISSUE_914.md, Fix 2.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how it worked before...maybe because we never used upserts? But we do actually check that each table has a replica identity, so I'm not sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the previous test didn't generate any replication messages to new database to hit the path where we do the table.valid, just a COPY thing only for data.

Copy link
Copy Markdown
Collaborator

@levkk levkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor logging changes only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Resharding] pool is shut down

2 participants