Skip to content

Cluster sync adj in p&a flavour#1814

Merged
M4KIF merged 2 commits intofeature/database-controllersfrom
jkoterba/feature/cluster-sync-adj
Apr 21, 2026
Merged

Cluster sync adj in p&a flavour#1814
M4KIF merged 2 commits intofeature/database-controllersfrom
jkoterba/feature/cluster-sync-adj

Conversation

@M4KIF
Copy link
Copy Markdown

@M4KIF M4KIF commented Apr 2, 2026

Description

Rewritten the sync logic

  • moved each component checks to separate objects all implementing the Condition interface by which we iterate and check the health.

Key Changes

  • interfaced the health check
  • implemented the above for provisioner, pooler, configmap, secret
  • short unit test added
  • added constants and state check via bitmask

Testing and Verification

local integ suite passes (make test) and units (pkg/postgres/cluster/core go test) passes

Related Issues

Jira tickets, GitHub issues, Support tickets...

PR Checklist

  • [nie wiem] Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request

@M4KIF M4KIF changed the base branch from main to feature/database-controllers April 2, 2026 14:35
Comment thread pkg/postgresql/cluster/business/core/cluster.go Outdated
Comment thread pkg/postgresql/cluster/business/core/cluster.go Outdated
Comment thread pkg/postgresql/cluster/business/core/cluster.go Outdated
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
// return ctrl.Result{}, nil
}

type clusterReadynessCheck interface {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound like a great thing to add to the specific port capabilities :-)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the interface here would be implemented by the ports. Clean and nice idea

pgcConstants "github.com/splunk/splunk-operator/pkg/postgresql/cluster/business/core/types/constants"
)

type Provisioner interface {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our secondary ports should reflect that we create cluster and database and we should map our interfaces around it.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
*/

// basically a sync logic
state := pgcConstants.EmptyState
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea here was to decouple status check from cnpg status, At the same time we also check health after every stage and we move forward only if we are ok, if it is still in progress we requeue or raise error. Here we have a code we can use to check where we are with status iteratively, but I dont see yet how it solve our core problem

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we build our state here, after checking all ports for readyness/not dying there is the moment for us to decide what happened and how It happened.
We don't set our state as state == cnpgVariable mapped to ours, we decide what do we want to do with the fact that cm's, secrets, provisioner etc. are ready.

cnpgv1 "github.com/cloudnative-pg/cloudnative-pg/api/v1"
enterprisev4 "github.com/splunk/splunk-operator/api/v4"
clustercore "github.com/splunk/splunk-operator/pkg/postgresql/cluster/core"
clustercore "github.com/splunk/splunk-operator/pkg/postgresql/cluster/business/core"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh business string is redundat here. Core itself is already a domain

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree It's redundant, here It's a tradeoff for verbosity and segregation of components. And a service pattern at once, ie. the service/ is the primary port (reconciler that we provide) implementation. core/ is core, and ports/ are the contracts that we need for the core to work. They can grow large, hence the whole separate dir for ports.

cnpgCluster *cnpgv1.Cluster
}

func (c *provisionerHealthCheck) Condition() (pgcConstants.State, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of this conditions check should be part of our Ports, also how you want to map condition to phase?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, they were placed here as what I need. Solving It like you say is the thing I'm hoping for. For the provisioner/cluster etc. ports to include an interface for checking It's state.

Then the adapter would be essentially mapping the dependency state to our abstraction of It's state. Ie. we have cluster ready, provisioning, failover, cupcake, coffee etc.
Mapping condition to phase would be the job of the facade, ie. the cluster.go. That would be the whole operational brain behind. Ie. lot's of individual pieces funneled into our business decisions.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sth. like a stateMapper, or another objects that specialises in deciding on what phase/condition we're in could also be born. The bit mask could be used for covering the phase 1, ie. state = FinaliserNotAdded && !ClusterProvisioning etc.
Phase 2 -> state == Finaliser && ClusterReady etc.

}

func (p *poolerHealthCheck) Condition() (pgcConstants.State, error) {
return pgcConstants.PoolerReady, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we want to check the actual condition component has in status?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym? It's kind of the job of the adapter to test and provide that the state is actual. Ie. If we place this as a method of a port, and implement It via adapters. We actually won't work on the real state of the component in our core. Only on our understanding of It.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently It would be just to copy paste the thing that we do inside cluster.go, ie. the resource obj. of the Pooler, k8s.Get(obj, ...) and all similar. As there is no abstraction currently.

)

const (
ComponentsReady = PoolerReady | ProvisionerReady | SecretReady | ConfigMapReady
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we try to achieve here? Is it a bitmask? Since you use IOTA we endup in having just random integer?
That would be probably simpler to just use struct with keeping the state like that:

 type ClusterState struct {                                                                                                                                         
      Provisioner ComponentPhase                                                                                                                                     
      Pooler      ComponentPhase                                                                                                                                     
      ConfigMap   ComponentPhase                                                                                                                                     
      Secret      ComponentPhase                                                                                                                                     
  }  

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bitmask. And It does the same job of keeping a struct with aditional field.
And It kinda solves the case of having to create new types just for each component.
Just adding additional states to the state machine, ie. the values in the "enum". The iota usage is an enum in go: https://yourbasic.org/golang/iota/
It's the first usage in this file, hence It's basically an enum from 0

Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And It kinda solves the case of having to create new types just for each component.

We already create types for a new component for many reasons, so what problem it really solve? I agree it is smart way of doing this, but neverthless if new component arise you need to add it to the const and extend types. I feel like we trade go readability for a really small c-like optimisation especially with this model of bitwise comparision later:

state |= componentHealth.State
if state&pgcConstants.ComponentsReady == pgcConstants.ComponentsReady

Also, if we build state incrementally it means that the LAST successful state in state machine is an final success. With this in mind we dont need to check every other component state afterwards.

Can we do this simpler so when Im wake up at 5am in the morning I easily understand the code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this state check, after the iteration health check passes is redundant here.
And taking into consideration the potential future work. Which could include a file division.
I could expand the *healthCheck types with them returning the *(component)StateDto instead of relying on generic state bits.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because later, in the very near future It seems that the project could follow this footstep of having some separation in phases and It's crucial elements. Like we've discussed on the p&a ideas brainstorm.

rc.emitPoolerReadyTransition(postgresCluster, poolerOldConditions)
}

if state&pgcConstants.ComponentsReady != 0 {
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this logic is not broken. What happens if we set ProvisionerReady but later in stage we set failed for something. Sue to way we set state and components this would pass. I think it is because iota incrementing by 1 not by power of 2? I think we should rely not on bitmasking here, but simple struct with state for every stage and if all is good we are good

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with unsetting the bit's at any space, this condition starts failing. As well as with not setting the bits, the values don't AND, hence if we mark ProvisionerReady and then set masks for ConfigMapFailed, it won't fire.
And to prove how this logic would work, there would be tests that would make sure any misfires aren't possible.

@M4KIF M4KIF force-pushed the jkoterba/feature/cluster-sync-adj branch from b0b2f12 to ad7aaec Compare April 8, 2026 08:02
@M4KIF M4KIF marked this pull request as ready for review April 8, 2026 08:03
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
oldPhase = *postgresCluster.Status.Phase
// Aggregate component readiness from iterative health checks.
state := pgcConstants.EmptyState
conditions := []clusterReadynessCheck{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so after every phase that is not immediate like cluster creation we should also incorporate state check rediness. I think we discussed that we dont really need to check at the end assuming we check intermediary statuses per phase?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, but Isn't then the scope == refactor the reconciler?
I've tried to stick with changing the sync logic and doing the ground prep for more changes in coming tickets and potential p&a rework.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
for _, check := range conditions {
componentHealth, err := check.Condition(ctx)
if err != nil {
if statusErr := updateStatus(componentHealth.Condition, metav1.ConditionFalse, componentHealth.Reason, componentHealth.Message, componentHealth.Phase); statusErr != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run this at the end of reconcillation it seems that some of the code is dead i.e if we are here, we cannot have a configmap or secret orphaned. If we do it should be discovered during this phase and requeue/err

@M4KIF M4KIF requested review from limak9182 and mploski April 9, 2026 06:13
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
}
return ctrl.Result{}, statusErr
}
logger.Error(err, "Component health check reported issues",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
return componentHealth.Result, err
}

if isPendingState(componentHealth.State) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run this code on every phase separately, here we should requeue

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
if postgresCluster.Status.Phase != nil {
newPhase = *postgresCluster.Status.Phase

if state&pgcConstants.ComponentsReady == pgcConstants.ComponentsReady {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thois could be potentially method on a state so it reads natually at 4 am i.e
func (s State) HasAll(required State) bool {
return s&required == required
}
if state.HasAll(pgcConstants.ComponentsReady) {}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after changing the state idea, It currently isn't used that much.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
return &provisionerHealthCheck{cluster: cluster, cnpgCluster: cnpgCluster}
}

func (c *provisionerHealthCheck) Condition(_ context.Context) (StateInformationDto, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like providing interface like that! One doubt I have is about pureness and responsiblity of those condition methods. They check conditions but also fetch k8s objects. IMO k8s objects should be evaluated and the reconciller kickoff and propagated

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed another approach. Which is quite similiar to the current tbh

@mploski
Copy link
Copy Markdown
Collaborator

mploski commented Apr 9, 2026

I like the direction we are heading two!
Few concerns two discuss:

  1. Should we aggregate the status check at the end or we should check per phase and when we pass everything in reconcile then we know we are good. Considering we build state incrementally and sequentally this make sense IMO
  2. Function pureness
  3. bitmask usage in that form - if 1 is true then we might not need it. If we need it mabe we can wrap it in helper functions/refactor the approach so we understand the code better in the morning when we have P0
  4. I believe we should start with drowing current state machine and how we set a phase and conditions. Did you do that?

@M4KIF
Copy link
Copy Markdown
Author

M4KIF commented Apr 10, 2026

I like the direction we are heading two! Few concerns two discuss:

  1. Should we aggregate the status check at the end or we should check per phase and when we pass everything in reconcile then we know we are good. Considering we build state incrementally and sequentally this make sense IMO
  2. Function pureness
  3. bitmask usage in that form - if 1 is true then we might not need it. If we need it mabe we can wrap it in helper functions/refactor the approach so we understand the code better in the morning when we have P0
  4. I believe we should start with drowing current state machine and how we set a phase and conditions. Did you do that?
  1. I see and understand that, per phase check seems reasonable. Double checking is "majtki na majtki"
  2. yup, health which does business.
  3. simplified that a bit
  4. I didn't draw the current state machine.

Copy link
Copy Markdown
Collaborator

@vivekr-splunk vivekr-splunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking regression in the status/event flow.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
logger.Error(err, "Failed to sync status")

logger.Info("Reconciliation complete")
if err := updateStatus(clusterReady, metav1.ConditionTrue, reasonCNPGClusterHealthy, msgAllComponentsReady, readyClusterPhase); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor appears to drop cluster phase transition events entirely. Before the change, the controller captured the old/new phase around the final status sync and called rc.emitClusterPhaseTransition(...); that helper still exists in events.go, but I no longer see it invoked on the happy path or on degraded transitions. From the product point of view, that means users lose the ClusterReady / ClusterDegraded event stream even though status still changes, which looks like a behavior regression rather than just an internal refactor.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I've missed this when attempting the rewrite. Thank you for your input and I agree, I've split the logic and focused only on the buzzwords/keywords and totally went dark on observability/event part.

@M4KIF M4KIF requested review from mploski and vivekr-splunk April 13, 2026 11:28
PostgresClusterFinalizerName string = "postgresclusters.enterprise.splunk.com/finalizer"

// cluster phases
readyClusterPhase reconcileClusterPhases = "Ready"
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the condition types SecretReady, ManagedRolesReady, and ConfigMapReady have not been added.

At the moment, we only define clusterReady and poolerReady. The functions secretModel.Converge, configMapModel.Converge, and the provisioner all update clusterReady, which means the status does not reflect the more granular state described in the ticket.

According to the ticket, the expected status snapshot should be:

SecretReady = True (SuperUserSecretReady)
ManagedRolesReady = True (ManagedRolesReconciled)
ConfigMapReady = True (ConfigMapReconciled)

The reason values also need adjustment. Currently, secretModel.Converge uses:

reasonClusterBuildSucceeded for the success case, and
reasonUserSecretFailed for the "ref not yet written" pending case.

Both are inherited from the old clusterReady vocabulary. However, the ticket specifies that the reasons should be SuperUserSecretReady and SuperUserSecretFailed, so these should be updated accordingly.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
if statusErr := updateStatus(poolerReady, metav1.ConditionFalse, reasonPoolerCreating,
"Connection poolers are being provisioned", provisioningClusterPhase); statusErr != nil {
logger.Error(statusErr, "Failed to update status")
if rolesErr := reconcileManagedRoles(ctx, p.client, p.cluster, p.cnpgCluster); rolesErr != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ManagedRoles (Step 3) is currently handled inside provisionerModel.Actuate, but it would be clearer if it were implemented as a separate pipeline step.

Right now:

There is no ManagedRolesReady condition.
There is no distinct Step 3 in the pipeline.
Observers cannot distinguish between a failure in provisioning CNPG and a failure during role reconciliation.

Separating this into its own step would make the pipeline state clearer and improve observability of failures and it is also one of the acceptance criteria :-)

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
"requeueAfter", result.RequeueAfter)
return result, nil
}
return result, nil
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing to fix once we split phases per step.

manageComponentHealth never writes Condition=True on the success path.

The function has three branches:

error → writes False
pending state → writes False
ready → writes nothing

As a result, the pipeline only ever stamps conditions as False. The only place where Condition=True is written during reconciliation is the final call.

This happens only after all steps succeed, and it updates clusterReady only.

This breaks the audit-trail invariant described in the ticket. After a fully healthy reconcile, conditions like SecretReady, ManagedRolesReady, and ConfigMapReady never appear in etcd.

If CNPG enters a switchover during the next reconcile, the provisioner returns early with ClusterReady=False, and those provisioning conditions are still absent, not True. As a result, an operator running kubectl describe cannot distinguish between:

“the secret was provisioned successfully and is still healthy”, and
“the secret provisioning step has never run.”

p.health.Message = fmt.Sprintf("Failed to check RW pooler existence: %v", err)
p.health.Phase = failedClusterPhase
p.health.Result = ctrl.Result{}
return p.health, err
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were setting:

  cluster.Status.ConnectionPoolerStatus = &ConnectionPoolerStatus{Enabled: true}                                                                         

when everything was fine, currently we miss it. We use it in database.go to check if we should use connection pooler. Without it we will be always using direct cnpg endpoint

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should guard this logic in unit tests/integration tests against future regressions in contract

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated

// Build desired CNPG Cluster spec.
desiredSpec := buildCNPGClusterSpec(mergedConfig, postgresSecretName)
provisionerComponent := components[1].(*provisionerModel)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seem to be fragile if we change order.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated

func (c *configMapModel) Name() string { return pgcConstants.ComponentConfigMap }

func (c *configMapModel) EvaluatePrerequisites(_ context.Context) (prerequisiteDecision, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this gate can never actually fire, If CNPG is not healthy, provisionerModel.Converge returns a pending/failed state → phase() returns a non-zero result → the first loop returns early.

Also this should set its own condition not clusterReady

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the case under usual condition. It's a double caution safety. So It can be deleted if the system is to be trusted enough. My bet would be to leave It for now.
Condition altered.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you share what would be the case it fires? My understanding is that this is dead code, what do I miss? If this is dead code, we simply dont need it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly dead code, shouldn't be fired. So yeah, deleted.

* PC-09 ignores no-op updates
*/

func containsEvent(events []string, recorder *record.FakeRecorder, eventType string, event string) bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I read it right but The helper reads exactly one event from the channel per call and returns on the first read regardless of whether it matches. If the target event is the
second event in the channel, the Eventually loop never finds it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the eventually loop calls It repeatedly until it finds or times out

Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to do this in one run? I.e pick all events and iterate over them. In current situation received can be also not accurate i.e recorder.Events != received if you return true early. Plus if we take one event why for :-)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just realised that the received will not be accurate as we do copy inside of the function by appending. we should pass * if you want to update this value

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the slice pointer. I didn't understand, but yeah, the fake recorder is just a channel to which we write some strings.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually + for single exit seems like an overkill, as It's an await for a real async scenario.
But, we write to the channel, w/o async. So a for w/ contains is enough

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
newConfigMapModel(c, rc.Scheme, rc, updateComponentHealthStatus, runtimeView, postgresCluster, postgresSecretName),
}

for _, component := range runtimeComponents {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we are not iterating over full list of components in line 180?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because managedRoles, Pooler, ConfigMap models need some information from previous stages and injecting It needs to happen in-between.
The later phases should utilise ports in later iterations of the source to use the information they need to function.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated

components := []componentModel{
secretComponent,
provisionerComponent,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id rather name it clusterComponent not provisionerComponent to provide better code readability. Provisioner is notion of external postgres provider that can do many things including cluster,roles,database, pooler etc

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, a good idea tbh

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
Result ctrl.Result
}

type componentModel interface {
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we cannot merge this interfaces into one i.e

type component interface {
Actuate(ctx context.Context) error
Converge(ctx context.Context) (componentHealth, error)
EvaluatePrerequisites(ctx context.Context) (prerequisiteDecision, error)
Namer() string
}

they seem to be logically bounded and I dont see a use-case to use this interfaces in separation i.e we cannot have component that doesnt have function to actuate, converge etc.We can always split it later if it grows too much, or we have different usecase

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I tend to go over the board with verbosity. I agree component containing the above methods is bound, logical and consistent.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
}

if len(status.Pending) > 0 {
m.health.State = pgcConstants.Failed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be Provisioning state, and it is also not an error

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
}

s.health.State = pgcConstants.Ready
s.health.Reason = reasonClusterBuildSucceeded
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be probably reasonSuperUserSecretReady not reasonClusterBuildSucceeded

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as we saw in the kubectl describe. Thanks! adjusted

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
if postgresCluster.Status.Phase != nil {
oldPhase = *postgresCluster.Status.Phase
c.health.State = pgcConstants.Ready
c.health.Reason = reasonClusterBuildSucceeded
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be probably reasonConfigMapReady reason

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}

cluster.Status.ConnectionPoolerStatus = &enterprisev4.ConnectionPoolerStatus{Enabled: true}
rwDesired, rwScheduled := poolerInstanceCount(rwPooler)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you removed poolerInstanceCount usage, but did we delete the function?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was left hanging in the units. Deleted It

@mploski
Copy link
Copy Markdown
Collaborator

mploski commented Apr 15, 2026

@M4KIF did a review + additionally passed pr through LLM. Few points brought:

A. managedRolesModel.EvaluatePrerequisites uses State=Failed for a blocked gate

func (m *managedRolesModel) EvaluatePrerequisites(_ context.Context) (prerequisiteDecision, error) {
if m.runtime == nil || !m.runtime.IsHealthy() {
return prerequisiteDecision{
Allowed: false,
Health: componentHealth{
State: pgcConstants.Failed, // ← wrong
Condition: managedRolesReady,
Reason: reasonManagedRolesFailed,
Phase: failedClusterPhase,
Result: ctrl.Result{RequeueAfter: retryDelay},

The gate blocks because CNPG isn't healthy yet — that's a transient wait, not a failure. But because the phase() driver calls component.Converge(ctx) when a
gate blocks, and Converge also checks !m.runtime.IsHealthy() and returns State=Failed with err != nil, this path returns a non-nil error from phase() and
propagates up as a hard failure. The controller will log an error and return from the reconcile with an error, which triggers exponential backoff rather than
the RequeueAfter: retryDelay specified in the health. The state should be Provisioning with a requeue, matching the pattern every other blocked gate uses.

B. managedRolesModel.Converge duplicates the prerequisite check and returns an error for a wait state

func (m *managedRolesModel) Converge(ctx context.Context) (health componentHealth, err error) {
...
if m.runtime == nil || !m.runtime.IsHealthy() {
m.health.State = pgcConstants.Failed
...
return m.health, fmt.Errorf("managed roles blocked until CNPG cluster is healthy")
}

This is the same condition as the prerequisite gate — structurally unreachable when called normally (gate would have blocked), but reached in the blocked-gate
path where phase() calls Converge directly when !gate.Allowed. The err != nil causes phase() to return an error rather than health.Result. This results in
exponential backoff for a completely normal "wait for CNPG to be healthy" state.

Compare how the provisioner handles this: its blocked gate path sets State=Pending, err=nil so the driver returns the requeue result cleanly.

C. configMapModel.EvaluatePrerequisites is structurally unreachable but still has a blocking problem

The gate is now dead code (structurally can't fire since the provisioner gates the first loop). However it uses configMapsReady (correct condition type). The dead code concern remains — if the pipeline is reorganised, the gate will incorrectly return early with State=Provisioning and then the blocked gate path in phase() calls Converge, which also checks !IsHealthy() and returns State=Provisioning, err=nil, safely causing a requeue.

@M4KIF
Copy link
Copy Markdown
Author

M4KIF commented Apr 15, 2026

CLA signed — now checking Code of Conduct status...

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
p.events.emitWarning(p.cluster, EventPoolerReconcileFailed, fmt.Sprintf("Failed to reconcile connection pooler: %v", err))
return err
}
p.events.emitNormal(p.cluster, EventPoolerCreationStarted, "Connection poolers created, waiting for readiness")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong but is this event emitted on every reconcile? Normal events should be emitted only once. In this case only when pooler is actually created

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in this iteration It hasn't been guarded from firing every time.

c.events.emitWarning(c.cluster, EventConfigMapReconcileFailed, fmt.Sprintf("Failed to reconcile ConfigMap: %v", err))
return err
}
if op == controllerutil.OperationResultCreated {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use tagged switch on op, like this:

	switch op {
case controllerutil.OperationResultCreated:
		c.events.emitNormal(c.cluster, EventConfigMapReady, fmt.Sprintf("ConfigMap %s created", desiredCM.Name))
	case controllerutil.OperationResultUpdated:
		c.events.emitNormal(c.cluster, EventConfigMapReady, fmt.Sprintf("ConfigMap %s updated", desiredCM.Name))
	}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block got changes after this comment, but your idea is good. Thanks.

configKeyClusterROEndpoint,
configKeyDefaultClusterPort,
configKeySuperUserSecretRef,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have configKeyClusterREndpoint in types.go ,which looks unused.

configKeyClusterROEndpoint string = "CLUSTER_RO_ENDPOINT"
configKeyClusterREndpoint string = "CLUSTER_R_ENDPOINT"
configKeyDefaultClusterPort string = "DEFAULT_CLUSTER_PORT"
configKeySuperUserName string = "SUPER_USER_NAME"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not used anymore with your changes?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, a big thing. Reverted to using them, no magic values left. Thank you!

Comment on lines +65 to +66
configKeyPoolerRWEndpoint string = "CLUSTER_POOLER_RW_ENDPOINT"
configKeyPoolerROEndpoint string = "CLUSTER_POOLER_RO_ENDPOINT"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it not used anymore?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, It's true. And It should be, as I was using magic values instead of those two. Placed them back into logic.

Comment thread pkg/postgresql/cluster/core/types.go Outdated
Comment on lines 100 to 103
reasonClusterBuildFailed conditionReasons = "ClusterBuildFailed"
reasonClusterBuildSucceeded conditionReasons = "ClusterBuildSucceeded"
reasonClusterGetFailed conditionReasons = "ClusterGetFailed"
reasonClusterPatchFailed conditionReasons = "ClusterPatchFailed"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These looks like unused, same for reasonCNPGClusterNotHealthy conditionReasons = "CNPGClusterNotHealthy"

Comment thread pkg/postgresql/cluster/core/types.go Outdated
msgFmtCNPGClusterPhase statusMessage = "CNPG cluster phase: %s"

// status messages — aggregate and component readiness checks
msgAllComponentsReady statusMessage = "All components are ready"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is unused anymore, I guess?

Comment on lines +676 to +686
defer func() {
statusErr := writeComponentStatus(m.updateStatus, m.health)
if statusErr != nil {
if err != nil {
err = errors.Join(err, statusErr)
} else {
err = statusErr
}
}
health = m.health
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using defer() instead of explicit error handling?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym error handling?
Deffer here is to supply status write at every possible branch exit of the Converge function. It's more of a decorator pattern, ie. the method does what It wants, and we like to ensure the status write happens no matter what.

syncManagedRolesStatusFromCNPG(m.cluster, m.runtime.Cluster())
status := m.cluster.Status.ManagedRolesStatus
if status == nil {
m.health.State = pgcConstants.Failed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed + RequeueAfter is the contradiction here, I guess. syncManagedRolesStatusFromCNPG just ran above this and always sets a non-nil ManagedRolesStatus — so this nil branch can only be hit if syncManagedRolesStatusFromCNPG itself received a nil cnpgCluster (guarded by the early return inside it). In practice it's a transient "CNPG hasn't populated its status yet" case, which should be Provisioning, not Failed, WDYT?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in theory It's a dead case, as It would have to have cnpgCluster == nil and above It is such a condition:
image
But, as a double caution, I will change the Failed to provisioning as It indeed seems more transient.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, I would leave It at failed. As we seem guarded enough against cnpg not healthy etc.
And having a generic failure at nil check seems more like It for me.
If we ever execute this code (status == nil ) because of cnpg not being healthy, that's a sideeffect and a signal something was missed by us.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dmytro here. This branch is dead, because status cannot be nil. i.e postgresCluster cannot be nil as this is why we are called in first place :-) cnpgCluster cannot be nil since we passed the clusterModel.

Even if we constructed a hypothetical where nil was reached, the branch sets m.health and returns an error — but the defer writeComponentStatus would then call c.Status().Update(ctx, cluster) on a nil cluster, which would panic. It's not defensive code IMO but it is a bit misleading code that implies a real production scenario.

@M4KIF M4KIF force-pushed the jkoterba/feature/cluster-sync-adj branch from abb92db to 59699db Compare April 17, 2026 15:23
logging changed, pureness fix attempt

removed redundant sync at the end

incremental state building with limited redundancy

logging align

cluster unit adj

merge adjustments

event emmision placed back, fortified with tests

allign with requirements on state building

review and rebase changes

merge alignment

review changes
@DmytroPI-dev
Copy link
Copy Markdown

@M4KIF , reasonCNPGClusterNotHealthy conditionReasons = "CNPGClusterNotHealthy" (line 124 in types.go can be removed

@M4KIF M4KIF force-pushed the jkoterba/feature/cluster-sync-adj branch from 59699db to 675ce20 Compare April 17, 2026 15:41
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
Comment on lines +1053 to +1062
_, err = controllerutil.CreateOrUpdate(ctx, c.client, cm, func() error {
cm.Data = desiredCM.Data
cm.Annotations = desiredCM.Annotations
cm.Labels = desiredCM.Labels
if !metav1.IsControlledBy(cm, c.cluster) {
if setErr := ctrl.SetControllerReference(c.cluster, cm, c.scheme); setErr != nil {
return fmt.Errorf("setting controller reference: %w", setErr)
}
}
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version emitted EventConfigMapReady on create and update. That's now gone — a configmap drift correction or first-time creation happens with no event. Converge only emits when the key check passes on the next reconcile. The gap means ops has no signal that a configmap was actually written.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to fix this with a sort of "contracts as tests PR" to have the behavior locked in. It seems final and It constantly get's missed by me as I don't have the hands on results available via tests.
It would be TDD, ie. propose tests with expected events per n reconciles/n behaviours (drift correction)/unhealthy transitions etc. -> align the code one final time -> make tests work and lock in the contract.
Now I'm constantly hitting and missing while shifting code with different intentions.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
p.health.Message = msgPoolersReady
p.health.Phase = readyClusterPhase
p.health.Result = ctrl.Result{}
p.events.emitPoolerReadyTransition(p.cluster, p.cluster.Status.Conditions)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.cluster.Status.Conditions is read before the deferred writeComponentStatus has persisted poolerReady=True. The transition detection inside emitPoolerReadyTransition may fire against a stale slice. Capture oldConditions before the ready assignment if the intent is to detect the False→True edge.

componentLogger.Info("Component actuation completed", "step", "actuate")

health, err := component.Converge(ctx)
if err != nil && isTransientError(err) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransientError is not necessary anymore since we handle it on the Reconcile level at least for isconflict? Also isconflict shouldnt be treat the same way as toomany request the first can be retries automatically the second should go with backoff. We can clear it in the next PR though

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, collecting from the review in general ->

  • tests for event emission to have them locked in without the possibility to cause side-effects with source changes any more
  • event emission final alignment
  • rollback the isTransient to simple isConflict(?)
  • p.cluster.Status.Conditions is read before the deferred writeComponentStatus -> rethink the few loc so they can't cause undefined state in k8s(?)

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
if ownerRefErr != nil {
logger.Error(ownerRefErr, "Failed to check owner reference on Secret")
return ctrl.Result{}, fmt.Errorf("failed to check owner reference on secret: %w", ownerRefErr)
if err := component.Actuate(ctx); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential gap. What if actuate logic fails? For example, lets assume CNPG is healthy and reconcileManagedRoles fails. Then we return early here with error, but failure state you set in health is never persisted. Is it on purpose?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then writeComponentStatus doesn't fire and as far as I understand, k8s state is left "hanging" without an update and just with a plain/error requeue, right?

fe.
func (m *managedRolesModel) Actuate(ctx context.Context) error { if rolesErr := reconcileManagedRoles(ctx, m.client, m.cluster, m.runtime.Cluster()); rolesErr != nil { m.events.emitWarning(m.cluster, EventManagedRolesFailed, fmt.Sprintf("Failed to reconcile managed roles: %v", rolesErr)) m.health.State = pgcConstants.Failed m.health.Reason = reasonManagedRolesFailed
So It would be best to update status everytime, before the reconcile requeue, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are raising the error I believe we should do this yes.

health = c.health
}()

if c.runtime == nil || !c.runtime.IsHealthy() {
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im a bit confused by this check as it seems again to be a dead code and in general about the Converge responsibility. IMO the health check for cnpg/cluster is done elsewhere, we should not duplicate this in next steps who depends on previous one. Otherwise we mix responsibilities of prerequisiteCheck and Converge. Converge should be responsible only for its resource state. prerequisite for dependencies. But there is a arch tension here i.e:
We would need to remove running Converge on gate not allowed and probably run directly status via writeComponentStatus(updateStatus, gate.Health) which also breaks boundry who can write. We need to pick one explicitly, currently the split is blured so we need to make sure we know what we are doing for future us

After second thought I started even questioning we need EvaluatePrerequisites everywhere. What problem it actually solve that the pipeline structure doesn't? the pipeline loop ordering is the prerequisite mechanism to the great exntent

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the boundary seems blurred right now as well as the responsibility inside the models isn't exactly their responsibility to be fair.
In general, I would like to adjust It to the "phase locked until healthy 250% sure", ie. if we proceed further then It is a given that we don't have to worry anymore about cluster having unhealthy state 3 components later. That's a smell to be fair.
I wouldn't necessarily want to do this in this PR, as It will become even fatter and even more side-effect packed. It lacks precission rn. But If It needs to be this way I can.

Namespace: "default",
},
Data: map[string]string{
"CLUSTER_RW_ENDPOINT": "pg1-rw.default",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are missing CLUSTER_R_ENDPOINT?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, nice catch. Went back into the code, not into the tests. Pushing this rn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configKeyClusterREndpoint string = "CLUSTER_R_ENDPOINT" - in correspondence with this, IDE says, it is not used.

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably not override health otherwise we loose failure path from Actuate. This seems good on clusterModel, and override everywhere else. Shouldnt our tests catch it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, a very serious error tbf. I'll write a test for such failure cases and align.

@M4KIF M4KIF force-pushed the jkoterba/feature/cluster-sync-adj branch from db8eda5 to 95fcb2d Compare April 21, 2026 07:26
@M4KIF M4KIF merged commit 7f19586 into feature/database-controllers Apr 21, 2026
12 of 15 checks passed
@M4KIF M4KIF deleted the jkoterba/feature/cluster-sync-adj branch April 21, 2026 10:45
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants