fix: move patroni reload operation by jason-lynch · Pull Request #362 · pgEdge/control-plane

jason-lynch · 2026-04-22T13:59:44Z

Summary

Patroni has an activity loop that they refer to as a 'run cycle'. Each run cycle is loop_wait seconds long. When a cycle takes less than loop_wait seconds, which is the typical case, Patroni will sleep until that time has elapsed. If the system is degraded and the cycle takes longer than loop_wait, it'll sleep for 1ms and log a warning.

When we signal Patroni to reload its configuration, it will perform the reload at the beginning of the next run cycle. This means that under non-degraded conditions, it can take up to loop_wait seconds to reload its configuration.

This commit fixes a bug where we were not waiting for the reload to complete. It was working most of the time because we signal Patroni to reload in the UnitResource, so the timing worked out that in a lot of cases we would get into InstanceResource.Create before the new config took effect, and it would take effect before we evaluated whether a restart was needed.

Rather than doing the wait in the InstanceResource, which would happen on every update, I've moved the reload operation to each orchestrator's PatroniConfig resource. I've also added a sleep operation here to ensure that loop_wait seconds have elapsed before we proceed. This way, we can assume that the new configuration has taken effect in the InstanceResource lifecycle methods.

This additional sleep does slow down some operations - especially in Swarm databases because the Patroni Config resource contains the entire instance spec, so there are plenty of conditions where there will be a resource diff even though there isn't a Patroni config change. We'll improve this condition when we adopt the new common.PatroniConfigGenerator in the swarm.patroni_config resource.

Testing

This mainly affects the TestPortChange E2E on systemd:

# in terminal one
make dev-lima-run

# in terminal two
use-dev-lima
make test-e2e E2E_RUN=TestPortChange

PLAT-548

codacy-production · 2026-04-22T14:01:12Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 3 duplication

Metric Results

Duplication 3

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes. Give us feedback}

coderabbitai · 2026-04-22T14:02:30Z

Warning

Rate limit exceeded

@jason-lynch has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 32 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 48 minutes and 32 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc41104f-63f3-41c5-98e3-2ad3754fea39

📥 Commits

Reviewing files that changed from the base of the PR and between f2f6428 and acbdeeb.

📒 Files selected for processing (10)

lima/roles/stop_dbs/tasks/main.yaml
server/internal/database/instance_resource.go
server/internal/database/utils.go
server/internal/orchestrator/common/patroni_config_generator.go
server/internal/orchestrator/swarm/patroni_config.go
server/internal/orchestrator/systemd/client.go
server/internal/orchestrator/systemd/patroni_config.go
server/internal/orchestrator/systemd/unit.go
server/internal/patroni/config.go
server/internal/utils/utils.go

📝 Walkthrough

Walkthrough

Refactored Patroni service management to implement a reload-and-wait pattern across systemd and Docker Swarm orchestration backends, replacing direct restart operations. Added new utility functions, a Patroni loop wait constant, refactored database instance resource operations, and introduced a system log maintenance task.

Changes

Cohort / File(s)	Summary
Patroni Reload Infrastructure `server/internal/patroni/config.go`, `server/internal/orchestrator/systemd/client.go`, `server/internal/orchestrator/systemd/patroni_config.go`, `server/internal/orchestrator/swarm/patroni_config.go`, `server/internal/orchestrator/common/patroni_config_generator.go`	Added `DefaultLoopWaitSeconds` constant; replaced `RestartUnit` with `ReloadUnit` and `ReloadOrRestartUnit` methods in systemd; implemented `signalReload` helpers in both systemd and swarm backends to trigger service reloads after config updates; changed hardcoded `LoopWait` values to use the new constant.
Database Instance Resource Refactoring `server/internal/database/instance_resource.go`, `server/internal/database/utils.go`	Restructured `Create` and `Update` flows to update connection info upfront, wait for Patroni running state, and delegate remaining initialization; simplified `initializeInstance` to only fetch primary instance ID; added documentation for `WaitForPatroniRunning` polling behavior.
Systemd Unit Management `server/internal/orchestrator/systemd/unit.go`	Replaced `RestartUnit` call with `ReloadOrRestartUnit` in the unit creation post-enable flow.
Utility Functions & Infrastructure `server/internal/utils/utils.go`, `lima/roles/stop_dbs/tasks/main.yaml`	Added context-aware `SleepContext` utility function; added Ansible task to flush and rotate systemd journals during stop-dbs role execution.

Poem

🐰 Reload with grace, no more harsh restart,
Configuration flows from a peaceful heart,
Journal entries flushed, utilities aligned,
Instance resources gracefully redesigned! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: moving the patroni reload operation to a different location in the codebase.
Description check	✅ Passed	The description includes Summary, Changes (implicitly detailed in summary), Testing section with specific E2E test and commands, and issue reference (PLAT-548), meeting most template requirements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/PLAT-548/move-patroni-reload-operation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lima/roles/stop_dbs/tasks/main.yaml`:
- Around line 41-43: The task "Clear journalctl" currently runs
ansible.builtin.command with "journalctl --flush --rotate --vacuum-time=1s",
which will destructively purge the host journal; change this so it is not run by
default: remove or guard the task with an explicit opt-in variable (e.g., add
when: vacuum_system_journal | default(false) or require a tag like
"vacuum_journal") and/or replace the vacuum-time flag with a less destructive,
targeted operation (e.g., vacuum by --unit or --vacuum-files and/or a
configurable --vacuum-time variable). Update the task that invokes journalctl to
reference the new variable (vacuum_system_journal or vacuum_time) and ensure
changed_when remains correct so the action is only performed when explicitly
enabled.

In `@server/internal/orchestrator/systemd/unit.go`:
- Around line 98-100: Update the error message produced after calling
client.ReloadOrRestartUnit so it reflects the actual operation name: change the
fmt.Errorf call in the ReloadOrRestartUnit error path to use "failed to reload
or restart unit '%s': %w" and pass the correct identifier (use r.Name rather
than path if r.Name was supplied to ReloadOrRestartUnit); the symbols to edit
are the ReloadOrRestartUnit call, the fmt.Errorf invocation, and the r.Name/path
argument used in that fmt.Errorf.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fedeb80b-50fa-4eac-8e1d-5a6e04d917e3

📥 Commits

Reviewing files that changed from the base of the PR and between 6fac000 and f2f6428.

📒 Files selected for processing (10)

lima/roles/stop_dbs/tasks/main.yaml
server/internal/database/instance_resource.go
server/internal/database/utils.go
server/internal/orchestrator/common/patroni_config_generator.go
server/internal/orchestrator/swarm/patroni_config.go
server/internal/orchestrator/systemd/client.go
server/internal/orchestrator/systemd/patroni_config.go
server/internal/orchestrator/systemd/unit.go
server/internal/patroni/config.go
server/internal/utils/utils.go

Patroni has an activity loop that they refer to as a 'run cycle'. Each run cycle is `loop_wait` seconds long. When a cycle takes less than `loop_wait` seconds, which is the typical case, Patroni will sleep until that time has elapsed. If the system is degraded and the cycle takes longer than `loop_wait`, it'll sleep for 1ms and log a warning. When we signal Patroni to reload its configuration, it will perform the reload at the beginning of the next run cycle. This means that under non-degraded conditions, it can take up to `loop_wait` seconds to reload its configuration. This commit fixes a bug where we were not waiting for the reload to complete. It was working most of the time because we signal Patroni to reload in the `UnitResource`, so the timing worked out that in a lot of cases we would get into `InstanceResource.Create` before the new config took effect, and it would take effect before we evaluated whether a restart was needed. Rather than doing the wait in the `InstanceResource`, which would happen on every update, I've moved the reload operation to each orchestrator's `PatroniConfig` resource. I've also added a sleep operation here to ensure that `loop_wait` seconds have elapsed before we proceed. This way, we can assume that the new configuration has taken effect in the `InstanceResource` lifecycle methods. This additional sleep does slow down some operations - especially in Swarm databases because the Patroni Config resource contains the entire instance spec, so there are plenty of conditions where there will be a resource diff even though there isn't a Patroni config change. We'll improve this condition when we adopt the new `common.PatroniConfigGenerator` in the `swarm.patroni_config` resource. PLAT-548

tsivaprasad

Except for a minor change, everything looks good.

Verification:

2026/04/24 00:21:07 initializing cluster
2026/04/24 00:21:21 cluster initialized
=== RUN   TestPortChange
=== PAUSE TestPortChange
=== CONT  TestPortChange
    port_change_test.go:23: [TestPortChange] creating database
    port_change_test.go:56: [TestPortChange] updating database to change ports
    port_change_test.go:85: [TestPortChange] validating that the database is usable
    port_change_test.go:108: [TestPortChange] waiting for replication
    database_test.go:357: [TestPortChange] waiting for replication to catch up on all nodes
    port_change_test.go:112: [TestPortChange] validating that replication is functioning
    fixture_test.go:200: cleaning up database 68cdaaeb-b419-4a30-8784-c58062e4bc35
--- PASS: TestPortChange (110.39s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	124.320s

DONE 1 tests in 124.321s

2026/04/24 00:29:01 initializing cluster
2026/04/24 00:29:01 cluster initialized
=== RUN   TestUpdateAddNode12
=== PAUSE TestUpdateAddNode12
=== CONT  TestUpdateAddNode12
    db_update_add_node_test.go:34: Creating 1 node database with 1 primary per node and no replicas
    db_update_add_node_test.go:42: Verifying state of node n1
    db_update_add_node_test.go:43: Verifying state of primary
    db_update_add_node_test.go:43: Inserting test data on node n1 primary
    db_update_add_node_test.go:51: Adding new node with zero down time enabled
    db_update_add_node_test.go:53: nodes updated to total number 2
    db_update_add_node_test.go:56: Inserting test data on node n2 primary
    database_test.go:357: [TestUpdateAddNode12] waiting for replication to catch up on all nodes
    db_update_add_node_test.go:269: Verifying data created on peer node n2 is present on n1
    db_update_add_node_test.go:269: Verifying data created on peer node n1 is present on n2
    fixture_test.go:200: cleaning up database 849bf860-43da-42a3-9917-71ef1f7e81ee
--- PASS: TestUpdateAddNode12 (91.81s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	92.335s

DONE 1 tests in 92.336s

tsivaprasad · 2026-04-23T17:21:47Z


+func SleepContext(ctx context.Context, duration time.Duration) error {
+	select {
+	case <-time.After(duration):


Looks it will sleep even if there is a cancellation.
Do you think we can change like below?

func SleepContext(ctx context.Context, duration time.Duration) error { t := time.NewTimer(duration) defer t.Stop() select { case <-t.C: return nil case <-ctx.Done(): return ctx.Err() } }

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread lima/roles/stop_dbs/tasks/main.yaml

Comment thread server/internal/orchestrator/systemd/unit.go

jason-lynch force-pushed the fix/PLAT-548/move-patroni-reload-operation branch from f2f6428 to acbdeeb Compare April 22, 2026 14:11

tsivaprasad approved these changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move patroni reload operation#362

fix: move patroni reload operation#362
jason-lynch wants to merge 1 commit intomainfrom
fix/PLAT-548/move-patroni-reload-operation

jason-lynch commented Apr 22, 2026

Uh oh!

codacy-production Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

tsivaprasad left a comment

Uh oh!

tsivaprasad Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jason-lynch commented Apr 22, 2026

Summary

Testing

Uh oh!

codacy-production Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tsivaprasad left a comment

Choose a reason for hiding this comment

Uh oh!

tsivaprasad Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading