Skip to content

fix: Task250604UpdateFolderInodes hangs on large DBs - N+1 queries + ALTER TABLE deadlock (#35392)#35393

Open
dsilvam wants to merge 11 commits intomainfrom
issue-35392-fix-task250604-n-plus-one-and-alter-table-deadlock
Open

fix: Task250604UpdateFolderInodes hangs on large DBs - N+1 queries + ALTER TABLE deadlock (#35392)#35393
dsilvam wants to merge 11 commits intomainfrom
issue-35392-fix-task250604-n-plus-one-and-alter-table-deadlock

Conversation

@dsilvam
Copy link
Copy Markdown
Member

@dsilvam dsilvam commented Apr 20, 2026

Summary

Fixes #35392Task250604UpdateFolderInodes hung indefinitely (20+ hours reported for customers db) when upgrading from 24.12.27 to current main.

Two root causes fixed:

Bug 1 — N×M query storm in FixTask00090RecreateMissingFoldersInParentPath

  • executeFix() issued one SELECT COUNT(1) per path segment per distinct parent_path in the identifier table
  • For large DBs (100K+ distinct paths × avg depth 3) = 300K+ individual DB queries under startup load
  • Fix: pre-load all folder identifier keys into a HashSet<String> (1 query). Replace DB lookups with O(1) Set.contains(). Newly created folders are added to the cache immediately to avoid re-creation.

Bug 2 — ALTER TABLE deadlock in fixFolderIds()

  • fixFolderIds() runs ALTER TABLE folder DROP/ADD CONSTRAINT ... DEFERRABLE (needs exclusive lock on folder)
  • executeFix() leaves a Hibernate thread-local transaction open (idle in transaction) via HibernateUtil.save() in createFixAudit() — confirmed via pg_stat_activity
  • That idle transaction holds locks blocking the exclusive lock forever, exhausting the 2-connection pool
  • Fix: call HibernateUtil.commitTransaction() + DbConnectionFactory.closeSilently() between executeFix() and fixFolderIds() to release locks before DDL runs

Test plan

  • Deployed against caliber-staging DB (before: 54 folders with inode ≠ identifier, SYSTEM_FOLDER broken)
  • After upgrade: 0 broken folders, all 188 folders have inode = identifier, SYSTEM_FOLDER correct
  • Completed in seconds (vs 20+ hours before)
  • Run Task250604UpdateFolderInodesTest integration tests

Files Changed

  • dotCMS/src/main/java/com/dotmarketing/fixtask/tasks/FixTask00090RecreateMissingFoldersInParentPath.java
  • dotCMS/src/main/java/com/dotmarketing/startup/runonce/Task250604UpdateFolderInodes.java

🤖 Generated with Claude Code

Two root causes fixed:

1. FixTask00090RecreateMissingFoldersInParentPath: replaced N×M per-segment
   SELECT COUNT(1) queries with a single upfront load of all folder identifier
   keys into a HashSet. isFolderIdentifierMissing() now uses O(1) Set.contains()
   lookups instead of individual DB round-trips. Newly created folders are added
   to the cache to prevent duplicate creation within the same run.

2. Task250604UpdateFolderInodes: executeFix() leaves a Hibernate thread-local
   transaction open (idle in transaction) via HibernateUtil.save() in
   createFixAudit(). This held locks that blocked fixFolderIds() ALTER TABLE
   from acquiring the exclusive lock on folder, exhausting the connection pool.
   Fixed by calling HibernateUtil.commitTransaction() + DbConnectionFactory
   .closeSilently() between the two steps.

Verified against caliber-staging: 54 broken folders fixed in seconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @dsilvam's task in 5m 13s —— View job


Tasks

  • Load dotCMS conventions
  • Get Java diff
  • Run specialized sub-agent reviews (Security, Database, Java Standards, REST API)
  • Consolidate findings
  • Post review comment

Review posted. Two issues found — see the 🔍 dotCMS Backend Review comment for details.

@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label Apr 20, 2026
Daniel Silva and others added 2 commits April 20, 2026 15:08
…#35392)

Three tests covering the N×M → O(1) cache fix:
- creates missing intermediate folders for deep parent paths
- idempotency: running twice produces no duplicates
- skips folders that already exist in the identifier table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…MainSuite3a (#35392)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Pull Request Unsafe to Rollback!!!

  • Category: C-1 — Structural Data Model Change
  • Risk Level: 🔴 CRITICAL
  • Why it's unsafe: This PR fixes Task250604UpdateFolderInodes from hanging indefinitely — it was previously blocking at the ALTER TABLE step, never completing the migration. By resolving the N×M query storm (HashSet cache) and the idle-in-transaction deadlock (HibernateUtil.commitTransaction() + DbConnectionFactory.closeSilently()), this PR enables the task to run to completion. That completion performs folder identifier unification — setting folder.inode = folder.identifier for every folder record and updating all path references. This is precisely the migration cited as the canonical C-1 example in the rollback-unsafe categories reference: "Folder identifier unification: folder.inode was set to equal folder.identifier, and path references were updated. N-1 uses inode semantics for folder resolution — after migration, all folder path lookups return wrong results." Before this PR, the task hung and the data transformation never executed. After this PR, any upgrade will run the migration to completion; a rollback to N-1 thereafter leaves N-1 querying a database where all folder.inode values have been rewritten to match folder.identifier. N-1's folder resolution will return wrong results for every folder path lookup.
  • Code that makes it unsafe:
    • dotCMS/src/main/java/com/dotmarketing/startup/runonce/Task250604UpdateFolderInodes.java — the newly added lines:
      HibernateUtil.commitTransaction();
      DbConnectionFactory.closeSilently();
      These unblock the subsequent APILocator.getFolderAPI().fixFolderIds() call, which was previously unreachable due to the deadlock. fixFolderIds() is the method that rewrites folder.inode to equal folder.identifier across all folder rows.
  • Alternative (if possible): This migration is inherently one-way. To reduce rollback risk: (a) before running fixFolderIds(), persist the original inode values to a backup column so N-1 can be patched to read from it; or (b) treat this release as a rollback-unsafe deployment and ensure the rollback window has expired (N-2 is out of rotation) before the migration is applied.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

🔍 dotCMS Backend Review

[🟠 High] dotCMS/src/main/java/com/dotmarketing/fixtask/tasks/FixTask00090RecreateMissingFoldersInParentPath.java:125,137

isFolderIdentifierMissing() dereferences folderKeyCache unconditionally (NPE if null), yet recreateMissingFolders() guards the write with if (folderKeyCache != null). The asymmetry means a null cache survives the write guard but crashes on the read — the null guard is misleading dead code that implies null is a valid input, while the unconditional dereference guarantees an NPE. The @VisibleForTesting signature makes both methods callable from tests with null.

// line 125 — guarded write (dead code: cache is never null in production)
if (folderKeyCache != null) {
    folderKeyCache.add(folderKey(...));
}
// line 137 — unconditional dereference (NPE if null)
return !folderKeyCache.contains(folderKey(...));

💡 Remove the null guard on line 125 and keep both paths unconditional — matching the actual contract. If null must be supported, add a symmetric guard in isFolderIdentifierMissing too.


[🟡 Medium] dotCMS/src/main/java/com/dotmarketing/fixtask/tasks/FixTask00090RecreateMissingFoldersInParentPath.java:127,138

loadExistingFolderKeys() delegates case-folding to PostgreSQL's lower() (locale-aware, DB collation), but the cache-update path inside recreateMissingFolders() uses Java's String.toLowerCase() without a Locale argument (uses JVM default locale). On deployments with a non-English/non-US JVM locale, the two lower() implementations can produce different results for folder names with accented characters, causing the in-memory cache to disagree with the DB snapshot and potentially triggering duplicate folder creation attempts.

// loadExistingFolderKeys: PostgreSQL lower() via DB
keys.add(folderKey(rs.getString(1), rs.getString(2), rs.getString(3)));
// recreateMissingFolders: Java .toLowerCase() with JVM default locale
folderKeyCache.add(folderKey(folder.hostId, folder.parentPath.toLowerCase(), folder.name.toLowerCase()));

💡 Replace .toLowerCase() with .toLowerCase(Locale.ROOT) at lines 127 and 138 (and in isFolderIdentifierMissing at line 137) to align with PostgreSQL's lower() behavior for ASCII-range folder names.


Next steps

  • 🟠 Fix locally and push — the null-guard removal requires your judgment on the null contract
  • 🟡 You can ask me to handle the Locale.ROOT fix: @claude fix the toLowerCase Locale issue in FixTask00090RecreateMissingFoldersInParentPath.java
  • Every new push triggers a fresh review automatically

Daniel Silva and others added 3 commits April 21, 2026 08:02
- Replace manual setAutoCommit/commit/rollback in createFolder() with
  LocalTransaction.wrap() to avoid mutating the ThreadLocal connection's
  autoCommit state
- Materialize outer ResultSet into a List before processing so the cursor
  is fully closed before any createFolder() call runs
- Add WHERE asset_type <> 'folder' to outer scan to reduce scan scope
- Use null-byte (\0) separator in folderKey() to eliminate pipe-character
  collision risk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…them (#35392)

The backward-compat private overloads (no-Set delegates) bypassed Mockito's
proxy, causing unit test stubs on public methods to never be triggered.

Collapse to a single @VisibleForTesting protected implementation per method
that takes Set<String> as a required parameter. Move cache update from
createFolder() into recreateMissingFolders() so createFolder stays spy-able
with no Set parameter. Update unit tests to pass new HashSet<>() / any()
for the Set argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…update (#35392)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eck trigger (#35392)

The DB trigger blocks inserting any identifier whose parent folder doesn't
exist yet, so tests can't manufacture broken state by direct INSERT.

New approach: create the folder via API (valid state), insert the test
contentlet inside it (now valid), delete the folder rows to simulate data
corruption, then verify executeFix() recreates the missing identifier.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Daniel Silva and others added 2 commits April 21, 2026 11:28
…egration tests (#35392)

The INSERT trigger blocks creating identifiers with non-existent parent
paths, and the DELETE trigger blocks removing folders that have children.
Together they prevent manufacturing broken state via INSERT+DELETE.

Fix: insert the contentlet at root (always valid), then UPDATE its
parent_path to the non-existent folder path — UPDATE does not fire the
parent path check trigger, achieving the needed corrupt state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…traints

Replace the two tests that tried to manufacture corrupt DB state (broken
by identifier_parent_path_check / folder_identifier_check triggers) with
tests that call recreateMissingFoldersInParentPath() directly with an
empty Set<String> cache. An empty cache simulates "folder identifier is
missing" in memory without needing illegal SQL, keeping the tests valid
while covering the real fix-task write path.

Also add missing HashSet/Set imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mitTransaction

Wrap HibernateUtil.commitTransaction() in try-finally so the ThreadLocal
connection is always released even if commit throws. Without this, a
DotHibernateException from commitTransaction would leave the connection
open with held locks, causing fixFolderIds() ALTER TABLE to deadlock —
the exact bug this PR fixes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

fix: Task250604UpdateFolderInodes hangs indefinitely on large databases (N+1 queries + ALTER TABLE deadlock)

3 participants