fix(copy_data): add retries for copy_data command by meskill · Pull Request #916 · pgdogdev/pgdog

meskill · 2026-04-20T21:10:13Z

fixes #897

meskill · 2026-04-20T21:12:03Z

@@ -106,22 +146,19 @@ impl ParallelSyncManager {
            self.permit.available_permits() / self.replicas.len(),
        );

-        let mut replicas_iter = self.replicas.iter();
-        // Loop through replicas, one at a time.
-        // This works around Rust iterators not having a "rewind" function.
-        let replica = loop {
-            if let Some(replica) = replicas_iter.next() {
-                break replica;
-            } else {
-                replicas_iter = self.replicas.iter();
-            }
-        };
+        // cycle() is the idiomatic "rewind": it restarts the iterator from the
+        // beginning once exhausted, giving round-robin distribution across replicas.
+        let mut replicas_iter = self.replicas.iter().cycle();



@levkk could you please check this. Did I get the idea right about the original intention?

codecov · 2026-04-20T21:17:58Z

Codecov Report

❌ Patch coverage is 61.70213% with 72 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...end/replication/logical/publisher/parallel_sync.rs	0.00%	39 Missing ⚠️
...src/backend/replication/logical/publisher/table.rs	0.00%	12 Missing ⚠️
pgdog/src/backend/replication/logical/status.rs	0.00%	9 Missing ⚠️
pgdog/src/backend/pool/cluster.rs	57.14%	6 Missing ⚠️
.../client/query_engine/two_pc/server_transactions.rs	0.00%	3 Missing ⚠️
pgdog/src/backend/error.rs	92.30%	1 Missing ⚠️
pgdog/src/backend/replication/logical/error.rs	97.56%	1 Missing ⚠️
pgdog/src/frontend/router/parser/statement.rs	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

levkk · 2026-04-21T03:12:00Z

+
+                    tokio::time::sleep(backoff).await;
+
+                    if let Err(trunc_err) = self.table.truncate_destination(&self.dest).await {


I don't think we should do this for two reasons:

COPY is atomic and transactional: if it fails, none of the rows will be saved in the table; it will be empty when we retry

TRUNCATE is a scary command to run for now. We should add a bunch more tests and conditions that prevent this from accidentally being called on the source DB. For now, let's have the user truncate manually if this retry logic fails for some reason.

There is a chance for a race condition where the table copy completes and we get an error somewhere below, e.g., while running COMMIT, but the odds of that are slim. We should definitely account for this (and truncate), but after we implement a few "this is definitely the destination" checks. I'll write up a separate issue for this.

yes, dropped the truncate, left the comment for future

levkk · 2026-04-22T15:10:51Z

        Ok((0, 0))
    }

    async fn flush(&mut self) -> Result<(usize, usize), Error> {


send_one is actually not free! The ParallelConnection gave us I think a 30-40% copy speed boost. Have you benchmarked this? Also curious what made you refactor this one.

moved to separate pr to verify and test it #920

This reverts commit 5560276.

levkk · 2026-04-22T18:33:23Z


 impl ShardMonitor {
    async fn spawn(&self) {
+        if self.shard.comms().lsn_check_interval == Duration::MAX {


We test this code somewhere in CI right? Just double checking. This is the replica/primary promoter, so we need to be sure its tested before tweaking it. At the very least, test it locally (with role = "auto") making sure it still works. I think this change is fine, although probably a no-op since the code below won't take any action if data provided by lsn_check_interval-controlled loop isn't provided.

removed this

levkk · 2026-04-22T18:36:24Z

+    /// Prevents accumulated counts from a discarded attempt inflating totals
+    /// and throughput calculations across retries.
+    pub(crate) fn reset(&self) {
+        if let Some(mut state) = TableCopies::get().get_mut(self) {


Not a bad idea to track retries, but can add that as a follow-up.

levkk · 2026-04-22T18:38:28Z

The mirror test is flaky. The rest looks good to me!

levkk

Should be good to go!

meskill added 2 commits April 20, 2026 11:07

sharding docs

8e64d79

fix(copy_data): add retries for copy_data command

40f7d90

meskill commented Apr 20, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

levkk reviewed Apr 21, 2026

View reviewed changes

Comment thread pgdog/src/backend/pool/cluster.rs Outdated

levkk reviewed Apr 21, 2026

View reviewed changes

Comment thread pgdog/src/backend/replication/logical/publisher/parallel_sync.rs Outdated

levkk reviewed Apr 21, 2026

View reviewed changes

This was linked to issues Apr 21, 2026

[Resharding] COPY_DATA: reliability of the data sync stage #894

Closed

[Resharding] Retry COPY on transient errors #897

Closed

meskill added 4 commits April 21, 2026 21:40

fix: remove truncate of tables

49fa573

test: add integration test for retry of copy_data

41bd3c4

drop parallel_connections wrapper

5560276

ci: update jsonschema

c0b8767

meskill mentioned this pull request Apr 22, 2026

feat: explain the configuration for copy_data retries pgdogdev/docs#66

Open

ci: fix tests

d4d3d6c

meskill marked this pull request as ready for review April 22, 2026 14:50

levkk reviewed Apr 22, 2026

View reviewed changes

meskill added 2 commits April 22, 2026 18:19

Revert "drop parallel_connections wrapper"

18b218c

This reverts commit 5560276.

fix ParallelConnection error

ef30ad5

levkk reviewed Apr 22, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

fix tests

f1f82e6

levkk approved these changes Apr 22, 2026

View reviewed changes

meskill merged commit 851d230 into main Apr 22, 2026
10 of 11 checks passed

meskill deleted the fix/copy_data_retry branch April 22, 2026 20:12


		tokio::time::sleep(backoff).await;

		if let Err(trunc_err) = self.table.truncate_destination(&self.dest).await {

Conversation

meskill commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk commented Apr 22, 2026

Uh oh!

This comment has been minimized.

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meskill commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading