Skip to content

fix: move patroni reload operation#362

Open
jason-lynch wants to merge 1 commit intomainfrom
fix/PLAT-548/move-patroni-reload-operation
Open

fix: move patroni reload operation#362
jason-lynch wants to merge 1 commit intomainfrom
fix/PLAT-548/move-patroni-reload-operation

Conversation

@jason-lynch
Copy link
Copy Markdown
Member

Summary

Patroni has an activity loop that they refer to as a 'run cycle'. Each run cycle is loop_wait seconds long. When a cycle takes less than loop_wait seconds, which is the typical case, Patroni will sleep until that time has elapsed. If the system is degraded and the cycle takes longer than loop_wait, it'll sleep for 1ms and log a warning.

When we signal Patroni to reload its configuration, it will perform the reload at the beginning of the next run cycle. This means that under non-degraded conditions, it can take up to loop_wait seconds to reload its configuration.

This commit fixes a bug where we were not waiting for the reload to complete. It was working most of the time because we signal Patroni to reload in the UnitResource, so the timing worked out that in a lot of cases we would get into InstanceResource.Create before the new config took effect, and it would take effect before we evaluated whether a restart was needed.

Rather than doing the wait in the InstanceResource, which would happen on every update, I've moved the reload operation to each orchestrator's PatroniConfig resource. I've also added a sleep operation here to ensure that loop_wait seconds have elapsed before we proceed. This way, we can assume that the new configuration has taken effect in the InstanceResource lifecycle methods.

This additional sleep does slow down some operations - especially in Swarm databases because the Patroni Config resource contains the entire instance spec, so there are plenty of conditions where there will be a resource diff even though there isn't a Patroni config change. We'll improve this condition when we adopt the new common.PatroniConfigGenerator in the swarm.patroni_config resource.

Testing

This mainly affects the TestPortChange E2E on systemd:

# in terminal one
make dev-lima-run

# in terminal two
use-dev-lima
make test-e2e E2E_RUN=TestPortChange

PLAT-548

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 22, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 3 duplication

Metric Results
Duplication 3

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes. Give us feedback

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@jason-lynch has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 32 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 48 minutes and 32 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc41104f-63f3-41c5-98e3-2ad3754fea39

📥 Commits

Reviewing files that changed from the base of the PR and between f2f6428 and acbdeeb.

📒 Files selected for processing (10)
  • lima/roles/stop_dbs/tasks/main.yaml
  • server/internal/database/instance_resource.go
  • server/internal/database/utils.go
  • server/internal/orchestrator/common/patroni_config_generator.go
  • server/internal/orchestrator/swarm/patroni_config.go
  • server/internal/orchestrator/systemd/client.go
  • server/internal/orchestrator/systemd/patroni_config.go
  • server/internal/orchestrator/systemd/unit.go
  • server/internal/patroni/config.go
  • server/internal/utils/utils.go
📝 Walkthrough

Walkthrough

Refactored Patroni service management to implement a reload-and-wait pattern across systemd and Docker Swarm orchestration backends, replacing direct restart operations. Added new utility functions, a Patroni loop wait constant, refactored database instance resource operations, and introduced a system log maintenance task.

Changes

Cohort / File(s) Summary
Patroni Reload Infrastructure
server/internal/patroni/config.go, server/internal/orchestrator/systemd/client.go, server/internal/orchestrator/systemd/patroni_config.go, server/internal/orchestrator/swarm/patroni_config.go, server/internal/orchestrator/common/patroni_config_generator.go
Added DefaultLoopWaitSeconds constant; replaced RestartUnit with ReloadUnit and ReloadOrRestartUnit methods in systemd; implemented signalReload helpers in both systemd and swarm backends to trigger service reloads after config updates; changed hardcoded LoopWait values to use the new constant.
Database Instance Resource Refactoring
server/internal/database/instance_resource.go, server/internal/database/utils.go
Restructured Create and Update flows to update connection info upfront, wait for Patroni running state, and delegate remaining initialization; simplified initializeInstance to only fetch primary instance ID; added documentation for WaitForPatroniRunning polling behavior.
Systemd Unit Management
server/internal/orchestrator/systemd/unit.go
Replaced RestartUnit call with ReloadOrRestartUnit in the unit creation post-enable flow.
Utility Functions & Infrastructure
server/internal/utils/utils.go, lima/roles/stop_dbs/tasks/main.yaml
Added context-aware SleepContext utility function; added Ansible task to flush and rotate systemd journals during stop-dbs role execution.

Poem

🐰 Reload with grace, no more harsh restart,
Configuration flows from a peaceful heart,
Journal entries flushed, utilities aligned,
Instance resources gracefully redesigned! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: moving the patroni reload operation to a different location in the codebase.
Description check ✅ Passed The description includes Summary, Changes (implicitly detailed in summary), Testing section with specific E2E test and commands, and issue reference (PLAT-548), meeting most template requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/PLAT-548/move-patroni-reload-operation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lima/roles/stop_dbs/tasks/main.yaml`:
- Around line 41-43: The task "Clear journalctl" currently runs
ansible.builtin.command with "journalctl --flush --rotate --vacuum-time=1s",
which will destructively purge the host journal; change this so it is not run by
default: remove or guard the task with an explicit opt-in variable (e.g., add
when: vacuum_system_journal | default(false) or require a tag like
"vacuum_journal") and/or replace the vacuum-time flag with a less destructive,
targeted operation (e.g., vacuum by --unit or --vacuum-files and/or a
configurable --vacuum-time variable). Update the task that invokes journalctl to
reference the new variable (vacuum_system_journal or vacuum_time) and ensure
changed_when remains correct so the action is only performed when explicitly
enabled.

In `@server/internal/orchestrator/systemd/unit.go`:
- Around line 98-100: Update the error message produced after calling
client.ReloadOrRestartUnit so it reflects the actual operation name: change the
fmt.Errorf call in the ReloadOrRestartUnit error path to use "failed to reload
or restart unit '%s': %w" and pass the correct identifier (use r.Name rather
than path if r.Name was supplied to ReloadOrRestartUnit); the symbols to edit
are the ReloadOrRestartUnit call, the fmt.Errorf invocation, and the r.Name/path
argument used in that fmt.Errorf.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fedeb80b-50fa-4eac-8e1d-5a6e04d917e3

📥 Commits

Reviewing files that changed from the base of the PR and between 6fac000 and f2f6428.

📒 Files selected for processing (10)
  • lima/roles/stop_dbs/tasks/main.yaml
  • server/internal/database/instance_resource.go
  • server/internal/database/utils.go
  • server/internal/orchestrator/common/patroni_config_generator.go
  • server/internal/orchestrator/swarm/patroni_config.go
  • server/internal/orchestrator/systemd/client.go
  • server/internal/orchestrator/systemd/patroni_config.go
  • server/internal/orchestrator/systemd/unit.go
  • server/internal/patroni/config.go
  • server/internal/utils/utils.go

Comment thread lima/roles/stop_dbs/tasks/main.yaml
Comment thread server/internal/orchestrator/systemd/unit.go
Patroni has an activity loop that they refer to as a 'run cycle'.
Each run cycle is `loop_wait` seconds long. When a cycle takes less than
`loop_wait` seconds, which is the typical case, Patroni will sleep until
that time has elapsed. If the system is degraded and the cycle takes
longer than `loop_wait`, it'll sleep for 1ms and log a warning.

When we signal Patroni to reload its configuration, it will perform the
reload at the beginning of the next run cycle. This means that under
non-degraded conditions, it can take up to `loop_wait` seconds to reload
its configuration.

This commit fixes a bug where we were not waiting for the reload to
complete. It was working most of the time because we signal Patroni to
reload in the `UnitResource`, so the timing worked out that in a lot of
cases we would get into `InstanceResource.Create` before the new config
took effect, and it would take effect before we evaluated whether a
restart was needed.

Rather than doing the wait in the `InstanceResource`, which would happen
on every update, I've moved the reload operation to each orchestrator's
`PatroniConfig` resource. I've also added a sleep operation here to
ensure that `loop_wait` seconds have elapsed before we proceed. This
way, we can assume that the new configuration has taken effect in
the `InstanceResource` lifecycle methods.

This additional sleep does slow down some operations - especially
in Swarm databases because the Patroni Config resource contains the
entire instance spec, so there are plenty of conditions where there will
be a resource diff even though there isn't a Patroni config change.
We'll improve this condition when we adopt the new
`common.PatroniConfigGenerator` in the `swarm.patroni_config` resource.

PLAT-548
@jason-lynch jason-lynch force-pushed the fix/PLAT-548/move-patroni-reload-operation branch from f2f6428 to acbdeeb Compare April 22, 2026 14:11
Copy link
Copy Markdown
Contributor

@tsivaprasad tsivaprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for a minor change, everything looks good.

Verification:

2026/04/24 00:21:07 initializing cluster
2026/04/24 00:21:21 cluster initialized
=== RUN   TestPortChange
=== PAUSE TestPortChange
=== CONT  TestPortChange
    port_change_test.go:23: [TestPortChange] creating database
    port_change_test.go:56: [TestPortChange] updating database to change ports
    port_change_test.go:85: [TestPortChange] validating that the database is usable
    port_change_test.go:108: [TestPortChange] waiting for replication
    database_test.go:357: [TestPortChange] waiting for replication to catch up on all nodes
    port_change_test.go:112: [TestPortChange] validating that replication is functioning
    fixture_test.go:200: cleaning up database 68cdaaeb-b419-4a30-8784-c58062e4bc35
--- PASS: TestPortChange (110.39s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	124.320s

DONE 1 tests in 124.321s

2026/04/24 00:29:01 initializing cluster
2026/04/24 00:29:01 cluster initialized
=== RUN   TestUpdateAddNode12
=== PAUSE TestUpdateAddNode12
=== CONT  TestUpdateAddNode12
    db_update_add_node_test.go:34: Creating 1 node database with 1 primary per node and no replicas
    db_update_add_node_test.go:42: Verifying state of node n1
    db_update_add_node_test.go:43: Verifying state of primary
    db_update_add_node_test.go:43: Inserting test data on node n1 primary
    db_update_add_node_test.go:51: Adding new node with zero down time enabled
    db_update_add_node_test.go:53: nodes updated to total number 2
    db_update_add_node_test.go:56: Inserting test data on node n2 primary
    database_test.go:357: [TestUpdateAddNode12] waiting for replication to catch up on all nodes
    db_update_add_node_test.go:269: Verifying data created on peer node n2 is present on n1
    db_update_add_node_test.go:269: Verifying data created on peer node n1 is present on n2
    fixture_test.go:200: cleaning up database 849bf860-43da-42a3-9917-71ef1f7e81ee
--- PASS: TestUpdateAddNode12 (91.81s)
PASS
ok  	github.com/pgEdge/control-plane/e2e	92.335s

DONE 1 tests in 92.336s


func SleepContext(ctx context.Context, duration time.Duration) error {
select {
case <-time.After(duration):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it will sleep even if there is a cancellation.
Do you think we can change like below?

func SleepContext(ctx context.Context, duration time.Duration) error {
	t := time.NewTimer(duration)
	defer t.Stop()
	select {
	case <-t.C:
		return nil
	case <-ctx.Done():
		return ctx.Err()
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants