Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions docs/configuration/pgdog.toml/general.md
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,18 @@ Available options:

`text` format is required when migrating from `INTEGER` to `BIGINT` primary keys during resharding.

### `resharding_copy_retry_max_attempts`

How many times PgDog retries a single table copy after a failure (e.g., a shard goes down mid-copy). The delay starts at [`resharding_copy_retry_min_delay`](#resharding_copy_retry_min_delay) and doubles each attempt, capped at 32× that value.

Default: **`5`**

### `resharding_copy_retry_min_delay`

Starting delay between retry attempts, in milliseconds. Doubles on each attempt, capped at 32× this value (`32_000` ms at the default).

Default: **`1_000`** (1s)

### `reload_schema_on_ddl`

!!! warning
Expand Down
2 changes: 1 addition & 1 deletion docs/features/sharding/resharding/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ nav:
- 'index.md'
- 'databases.md'
- 'schema.md'
- 'hash.md'
- 'move.md'
- 'cutover.md'
2 changes: 1 addition & 1 deletion docs/features/sharding/resharding/cutover.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ icon: material/set-right
This is a new and experimental feature. Please make sure to test it before deploying to production
and report any issues you find.

Traffic cutover involves moving application traffic (read and write queries) to the destination database. This happens when the source and destination databases are in sync: all source data is copied, and resharded with [logical replication](hash.md).
Traffic cutover involves moving application traffic (read and write queries) to the destination database. This happens when the source and destination databases are in sync: all source data is copied, and resharded with [logical replication](move.md).

## Performing the cutover

Expand Down
2 changes: 1 addition & 1 deletion docs/features/sharding/resharding/databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ PgDog's strategy for resharding Postgres databases is to create a new, independe

## Requirements

New databases should be **empty**: don't migrate your [table definitions](schema.md) or [data](hash.md). These will be taken care of automatically by PgDog.
New databases should be **empty**: don't migrate your [table definitions](schema.md) or [data](move.md). These will be taken care of automatically by PgDog.

### Database users

Expand Down
4 changes: 2 additions & 2 deletions docs/features/sharding/resharding/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The resharding process is composed of four independent operations. The first one
|-|-|
| [Create new cluster](databases.md) | Create a new set of empty databases that will be used for storing data in the new, sharded cluster. |
| [Schema synchronization](schema.md) | Replicate table and index definitions to the new shards, making sure the new cluster has the same schema as the old one. |
| [Move & reshard data](hash.md) | Copy data using logical replication, while redistributing rows in-flight between new shards. |
| [Move & reshard data](move.md) | Copy data using logical replication, while redistributing rows in-flight between new shards. |
| [Cutover traffic](cutover.md) | Make the new cluster service both reads and writes from the application, without taking downtime. |

While each step can be executed separately by the operator, PgDog provides an [admin database](../../../administration/index.md) command to perform online resharding and traffic cutover steps in a completely automated fashion:
Expand All @@ -50,5 +50,5 @@ The `<source>` and `<destination>` parameters accept the name of the source and

{{ next_steps_links([
("Schema sync", "schema.md", "Synchronize table, index and other schema entities between the source and destination databases."),
("Move data", "hash.md", "Redistribute data between shards using the configured sharding function. This happens without downtime and keeps the shards up-to-date with the source database until traffic cutover."),
("Move data", "move.md", "Redistribute data between shards using the configured sharding function. This happens without downtime and keeps the shards up-to-date with the source database until traffic cutover."),
]) }}
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,9 @@ PgDog will distribute the table copy load evenly between all replicas in the con
To prevent the resharding process from impacting production queries, you can create a separate set of replicas just for resharding.

Managed clouds (e.g., AWS RDS) make this especially easy, but require a warm-up period to fetch all the data from the backup snapshot, before they can read data at full speed of their storage volumes.

To make sure dedicated replicas are not used for read queries in production, you can configure PgDog to use them for resharding only:

```toml
[[databases]]
name = "prod"
Expand Down Expand Up @@ -215,6 +215,60 @@ FROM pg_replication_slots;
```
The replication delay between the two database clusters is measured in bytes. When that number reaches zero, the two databases are byte-for-byte identical, and traffic can be [cut over](cutover.md) to the destination database.

## Troubleshooting

### Shard failure during copy

If a shard goes down mid-copy, PgDog retries that table with exponential backoff - up to **5 attempts**, starting at **1 second** and doubling each time.

The two cases behave differently:

- **Destination shard down** — PgDog opens a fresh connection on the next attempt.
- **Source shard down** — the `TEMPORARY` slot for that table is re-created from scratch. No cleanup needed; `TEMPORARY` slots vanish when the connection closes (unlike the [permanent slot](#replication-slot) created at the start of the overall copy).

To change the defaults:

```toml
[general]
resharding_copy_retry_max_attempts = 5 # per-table retry attempts
resharding_copy_retry_min_delay = 1000 # base backoff in ms; doubles each attempt, max 32×
```

### Rows remaining in destination after failure

Most drops are clean: table copies run inside an implicit transaction, so Postgres rolls back on disconnect and the destination is left empty.

The awkward case: the connection drops *after* `COPY` commits on some or all shards but before PgDog records it as done. The rows survive, and the next attempt hits primary key violations immediately.

PgDog catches this and tells you exactly what to run:

```
data sync for "public"."orders" failed with rows remaining in destination;
truncate manually before retrying: TRUNCATE "public"."orders_new";
```

Run the `TRUNCATE` on the destination (the table name is literal — copy it verbatim), then re-run `COPY_DATA`.

### Binary format mismatch

Different major Postgres versions can produce incompatible binary `COPY` data. PgDog surfaces this as a `BinaryFormatMismatch` error. Switch to text:

```toml
[general]
resharding_copy_format = "text"
```

See [Integer primary keys](#integer-primary-keys) for the other common reason to use text format.

### Replication slot not dropped after abort

Aborting `COPY_DATA` without restarting it leaves the **permanent** replication slot sitting on the source. Postgres won't recycle WAL until it's gone. Drop it manually:

```postgresql
SELECT pg_drop_replication_slot('slot_name');
```

The slot name appears in the `COPY_DATA` startup log and in `SHOW REPLICATION_SLOTS`.
## Next steps

{{ next_steps_links([
Expand Down
6 changes: 3 additions & 3 deletions docs/features/sharding/resharding/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The schema synchronization process is composed of 4 distinct steps, all of which
| Phase | Description |
|-|-|
| [Pre-data](#pre-data-phase) | Create identical tables on all shards along with the primary key constraint (and index). Secondary indexes are _not_ created yet. |
| [Post-data](#post-data-phase) | Create secondary indexes on all tables and shards. This is done after [moving data](hash.md), as a separate step, because it's considerably faster to create indexes on whole tables than while inserting individual rows. |
| [Post-data](#post-data-phase) | Create secondary indexes on all tables and shards. This is done after [moving data](move.md), as a separate step, because it's considerably faster to create indexes on whole tables than while inserting individual rows. |
| [Cutover](#cutover) | This step is executed during traffic cutover, while application queries are blocked from executing on the database. |
| Post-cutover | This step makes sure the rollback database cluster can handle reverse logical replication. |

Expand Down Expand Up @@ -110,7 +110,7 @@ This will make sure all tables and schemas in your database are copied and resha

## Post-data phase

The post-data phase is performed after the [data copy](hash.md) is complete and tables have been synchronized with logical replication. Its job is to create all secondary indexes (e.g., `CREATE INDEX`).
The post-data phase is performed after the [data copy](move.md) is complete and tables have been synchronized with logical replication. Its job is to create all secondary indexes (e.g., `CREATE INDEX`).

This step is performed after copying data because it makes the copy process considerably faster: Postgres doesn't need to update several indexes while writing rows into the tables.

Expand Down Expand Up @@ -189,5 +189,5 @@ pg_dump_path = "/path/to/pg_dump"
## Next steps

{{ next_steps_links([
("Move data", "hash.md", "Redistribute data between shards using the configured sharding function. This happens without downtime and keeps the shards up-to-date with the source database until traffic cutover."),
("Move data", "move.md", "Redistribute data between shards using the configured sharding function. This happens without downtime and keeps the shards up-to-date with the source database until traffic cutover."),
]) }}
2 changes: 1 addition & 1 deletion docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Support for sorting rows in [cross-shard](features/sharding/cross-shard-queries/

| Feature | Status | Notes |
|-|-|-|
| [Data sync](features/sharding/resharding/hash.md) | :material-wrench: | Sync table data with logical replication. |
| [Data sync](features/sharding/resharding/move.md) | :material-wrench: | Sync table data with logical replication. |
| [Schema sync](features/sharding/resharding/schema.md) | :material-wrench: | Sync table, index and constraint definitions. |
| Online rebalancing | :material-calendar-check: | Not automated yet, requires manual orchestration. |

Expand Down
Loading