fix(replication): add proper handling for pool restarts during replication by meskill · Pull Request #915 · pgdogdev/pgdog

meskill · 2026-04-20T16:35:12Z

No description provided.

levkk · 2026-04-20T16:37:15Z

+            port = addr.port,
+            database = %addr.database_name,
+            user = %addr.user,
+            "pool offline: connection pool shut down"


You don't want to log that. This "shutdown" isn't actually shutdown, it just destroys this object. We replace it with a brand new one atomically. This could make people think we are shutting down causing panic :D

the intention was to add more logs that will help with tracing the similar issues.

I can drop it or change the severity/message

Yup makes sense. You can set this to trace level for sure.

changed the logs

levkk · 2026-04-20T16:38:02Z

+        // Validate all tables support replication before committing to
+        // what can be a multi-hour copy.  A table with no primary key or
+        // unique replica-identity index cannot be replicated correctly.
+        for tables in self.tables.values() {


This is great, we should of done it long time ago.

updated this to show more relevant error and to collect multiple errors

levkk · 2026-04-20T16:38:31Z

    /// Check that the table supports replication.
+    ///
+    /// Requires at least one column with a replica identity flag. Tables with
+    /// REPLICA IDENTITY FULL or NOTHING have no identity columns and fail here


I would double check that. If this is true, we need to have a special query to detect REPLICA IDENTITY FULL and use that as the key.

it is, the added tests validate that. I'll investigate if we can identify this actually

so, this will require custom handling in validation and in query generation. Let me actually do this as a separate pr to make it more focused

levkk · 2026-04-20T16:39:18Z

+    /// the copy.  Instead, only the cluster reference inside the existing
+    /// publisher is updated so that subsequent pool.get() calls target the
+    /// live pool rather than a stale, potentially-offline one.
+    pub(crate) async fn refresh_before_replicate(&mut self) -> Result<(), Error> {


I think we have code to do this already somewhere. If not, we should re-use this function wherever we run these 3 statements.

yes, there was the usual refresh but it was resetting the publication as well. I refactored this part and added tests, so now there is only one single refresh method

codecov · 2026-04-20T16:45:12Z

Codecov Report

❌ Patch coverage is 88.46154% with 27 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...nd/replication/logical/publisher/publisher_impl.rs	83.33%	13 Missing ⚠️
...og/src/backend/replication/logical/orchestrator.rs	0.00%	5 Missing ⚠️
pgdog/src/backend/pool/pool_impl.rs	20.00%	4 Missing ⚠️
pgdog/src/backend/databases.rs	33.33%	2 Missing ⚠️
...end/replication/logical/publisher/parallel_sync.rs	0.00%	2 Missing ⚠️
pgdog/src/admin/copy_data.rs	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ation

levkk · 2026-04-23T17:56:30Z

    let databases = from_config(&new_config);
-
+    info!(
+        "reloading pools from config file: {}",


Suggested change

"reloading pools from config file: {}",

"reloading configuration",

This will reload users.toml too. Also load above logs this info already.

levkk · 2026-04-23T17:57:17Z

 use crate::backend::{databases::reload_from_existing, Error};

 pub(crate) fn schema_changed() -> Result<(), Error> {
+    debug!("schema change detected: reloading pools to refresh schema cache");


Suggested change

debug!("schema change detected: reloading pools to refresh schema cache");

debug!("schema change detected, refreshing schema cache");

levkk · 2026-04-23T17:58:18Z

+-- that have no primary key and no `REPLICA IDENTITY USING INDEX`. Without a PK
+-- (or unique index promoted via REPLICA IDENTITY USING INDEX), no column carries
+-- the replica identity flag, Table::valid() fails with NoIdentityColumns, and the
+-- table is rejected before the copy starts. See docs/FIX_ISSUE_914.md, Fix 2.


I wonder how it worked before...maybe because we never used upserts? But we do actually check that each table has a replica identity, so I'm not sure.

I think it's because the previous test didn't generate any replication messages to new database to hit the path where we do the table.valid, just a COPY thing only for data.

levkk

LGTM, minor logging changes only

levkk reviewed Apr 20, 2026

View reviewed changes

levkk linked an issue Apr 21, 2026 that may be closed by this pull request

[Resharding] pool is shut down #914

Open

meskill added 2 commits April 22, 2026 20:16

fix(replication): add proper handling for pool restarts during replic…

1b37383

…ation

refactor: refresh function

70adcdc

meskill force-pushed the fix/replication-pool-shutdown branch from a67a82f to 70adcdc Compare April 23, 2026 13:44

feat: collect table validation error together

08ec628

meskill marked this pull request as ready for review April 23, 2026 15:23

levkk reviewed Apr 23, 2026

View reviewed changes

levkk approved these changes Apr 23, 2026

View reviewed changes

cleanup

ba2ec09

	"reloading pools from config file: {}",
	"reloading configuration",

	debug!("schema change detected: reloading pools to refresh schema cache");
	debug!("schema change detected, refreshing schema cache");

Conversation

meskill commented Apr 20, 2026

Uh oh!

levkk Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

levkk Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading