Skip to content

Add skipDuplicates() scope guard to createDocuments#852

Open
premtsd-code wants to merge 27 commits intomainfrom
csv-import-upsert-v2
Open

Add skipDuplicates() scope guard to createDocuments#852
premtsd-code wants to merge 27 commits intomainfrom
csv-import-upsert-v2

Conversation

@premtsd-code
Copy link
Copy Markdown
Contributor

@premtsd-code premtsd-code commented Apr 8, 2026

Summary

Adds a Database::skipDuplicates() scope guard so callers can wrap a createDocuments() call and have duplicate-by-id rows silently skipped instead of throwing DuplicateException. Used by CSV import / re-import flows and database migration workers where re-running the same batch should be idempotent rather than fatal.

The shape follows the existing silent() / skipRelationships() pattern — instance-level scope guard, not a flag parameter — so callers compose naturally:

$database->skipDuplicates(fn () => $database->createDocuments($collection, $batch));

Design

Duplicate handling happens at the adapter layer via dialect-specific no-op inserts. Database::createDocuments sets a scope-guarded flag on the adapter before each chunk's INSERT; the adapter emits the right SQL (or Mongo equivalent) and returns the batch. No pre-filter, no intra-batch dedup — adapter-level row-level dedup is sufficient and avoids redundant queries.

Per-adapter implementation

Adapter Strategy
MariaDB / MySQL INSERT IGNORE INTO
Postgres INSERT INTO ... ON CONFLICT (...) DO NOTHING
SQLite INSERT OR IGNORE INTO
Mongo upsert + $setOnInsert, bypassing the transaction wrap to avoid txn-abort-on-duplicate

Hot-path cost

One property read + one boolean branch per chunk. Callers that don't use skipDuplicates pay no closure allocation and no scope-guard setup — the feature cost is paid only when the feature is actually used. This addresses review comment #1 (don't penalize the common case with an $enable flag pattern).

Mirror dual-write correctness

Mirror::createDocuments pre-filters against source to identify which input ids already exist, then after the source insert forwards only the genuinely-new docs to destination. Pre-existing ids were silently no-op'd by the adapter's INSERT IGNORE — they represent no user write, so nothing propagates to destination.

This replaces an earlier captureOnNext-based approach flagged by Greptile (#3084293974): that approach wrapped $onNext to capture "what source inserted", but SQL adapters return the full input batch from createDocuments regardless of rowCount(), causing $onNext to fire for every doc — including skipped duplicates — so Mirror forwarded would-be values to destination, diverging from source. Pre-filter-then-insert matches the idempotent-write semantics: a skipped duplicate is not a user write, so it does not mirror.

Cost: one extra SELECT against source per batch when skipDuplicates is active and Mirror has an upgraded destination. Zero cost on the non-skip path. Skip and non-skip paths are unified into a single method body with branches only at the three points where behavior actually differs (source call, forward filter, destination call).

upsertDocumentsWithIncrease tenant grouping (regression fix)

Restored per-tenant grouping for the existing-doc lookup in sharedTables + tenantPerDocument mode. An earlier review round (#2 — inline find() instead of a helper) was correct for single-tenant use cases but ran under the session tenant context (null for platform workers), silently missing rows from other tenants. This broke appwrite's StatsUsage worker, which accumulates stats across many projects and flushes via $dbForLogs->setTenant(null)->setTenantPerDocument(true) + upsertDocumentsWithIncrease() — every input doc looked "new" and went down the create path instead of the update path.

The fix: branch on tenantPerDocument mode and run one find() per unique tenant wrapped in withTenant($tenant, ...). K queries total (K = unique tenants in the batch). Single-tenant callers (the common case, including CSV import and database migration) still hit the fast one-query batched path.

Helper: tenantKey(Document) — used only within upsertDocumentsWithIncrease to keep the in-memory lookup map and the intra-batch dedup check consistent. Returns "<tenant>:<id>" in tenantPerDocument mode, just <id> otherwise. Without it, a legitimate cross-tenant batch like [{id: x, tenant: 1}, {id: x, tenant: 2}] would falsely fail with DuplicateException.

Tests

End-to-end coverage across all adapter variants (MariaDB / MySQL / Postgres / SQLite / MongoDB × Pool / Mirror / SharedTables):

  • testCreateDocumentsIgnoreDuplicates — mix of existing + new docs, non-existing inserted, existing no-op'd
  • testCreateDocumentsIgnoreAllDuplicates — every doc already exists, nothing newly inserted
  • testCreateDocumentsSkipDuplicatesEmptyBatch — empty batch is a no-op
  • testCreateDocumentsSkipDuplicatesNestedScope — scope guard saves & restores through nested calls
  • testCreateDocumentsSkipDuplicatesLargeBatch — 300 docs / 50 pre-existing, exercises the chunk loop
  • testCreateDocumentsSkipDuplicatesSecondCallSkipsAll — idempotent re-invocation preserves first-write values
  • testCreateDocumentsSkipDuplicatesRelationships — relationship children aren't duplicated on re-run
  • testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination (Mirror) — source-seeded dup=Original with a dup=WouldBe caller input, asserts destination.dup stays empty and destination.fresh gets the newly-inserted row (Greptile's exact assertion from #3084293974)

Test plan

  • skipDuplicates tests pass on every adapter variant
  • Full Mirror suite: 635 tests / 11,035 assertions green
  • Greptile divergence assertion holds
  • composer analyze (PHPStan) clean at the project's configured level
  • No regressions in existing testCreateDocuments* and testUpsert* suites

Summary by CodeRabbit

  • New Features

    • Scoped "skip duplicates" mode to ignore duplicate records during batch creates without overwriting existing data.
  • Refactor

    • Duplicate-skipping integrated across adapters, transaction paths, pooling and mirror flows; batch lookups and tenant-aware identity handling improved for large batches and upserts.
  • Tests

    • Added comprehensive end-to-end tests: nested scopes, large/empty batches, mirror backfill, relationship propagation, and multiple-scope behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Adds a scoped skipDuplicates(callable $callback) mode across Database, Pool, Mirror, and adapters that changes create/insert flows to avoid overwriting existing documents by using adapter-specific INSERT-IGNORE/ON CONFLICT or conditional upserts, and adds e2e tests covering duplicate-skip semantics.

Changes

Cohort / File(s) Summary
Core adapter & Database state
src/Database/Adapter.php, src/Database/Database.php
Introduce protected skipDuplicates flag and public skipDuplicates(callable $callback): mixed. Database adds tenantKey(...), batches existing-document lookups for upserts, and conditionally routes create flows through adapter skip scope.
Pool delegation
src/Database/Adapter/Pool.php
Pool delegates adapter calls through the adapter's skipDuplicates(...) wrapper when pool skip flag is set (both pinned and borrowed adapters); transactional pinned-adapter withTransaction is also wrapped when skipping is enabled.
Mongo adapter
src/Database/Adapter/Mongo.php
withTransaction short-circuits when skipping duplicates. createDocuments() gets a skip-duplicates path using conditional upserts ($setOnInsert with _uid/_tenant filters) instead of insertMany.
SQL / Postgres / SQLite adapters
src/Database/Adapter/SQL.php, src/Database/Adapter/Postgres.php, src/Database/Adapter/SQLite.php
Added protected hooks getInsertKeyword(), getInsertSuffix(string $table), getInsertPermissionsSuffix(); SQL generation now emits INSERT IGNORE / ON CONFLICT DO NOTHING when skip mode is enabled and uses these hooks in bulk INSERTs.
Mirror behavior
src/Database/Mirror.php
Mirror write paths now invoke source/destination skipDuplicates(...) when enabled and branch destination forwarding accordingly to avoid overwriting source-existing docs.
Tests (e2e)
tests/e2e/Adapter/Scopes/DocumentTests.php, tests/e2e/Adapter/MirrorTest.php
Added comprehensive e2e tests covering duplicate-skipping: mixed/all-duplicate/empty batches, nested scopes, large chunked batches, repeated calls, relationships, and mirrored backfill scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(100,180,120,0.5)
    Note over Client,Storage: Flow with skipDuplicates enabled
    Client->>Database: createDocuments(docs) within skipDuplicates scope
    Database->>Database: compute tenantKey, dedupe ids, chunk
    Database->>Adapter: find(existing ids) [chunked]
    Adapter->>DB: SELECT / find by _uid/_id
    DB-->>Adapter: existing records
    Adapter-->>Database: existing records
    Database->>Adapter: adapter->skipDuplicates(fn => createDocuments(filtered))
    Adapter->>DB: INSERT IGNORE / ON CONFLICT DO NOTHING or conditional upserts (Mongo)
    DB-->>Adapter: success (duplicates skipped)
    Adapter-->>Database: inserted result
    Database->>Client: return inserted docs/count
    end
Loading
sequenceDiagram
    participant Client
    participant Database
    participant Adapter
    participant Storage as DB

    rect rgba(200,100,150,0.5)
    Note over Client,Storage: Regular flow without skipDuplicates
    Client->>Database: createDocuments(docs)
    Database->>Adapter: createDocuments(docs)
    Adapter->>DB: INSERT INTO ...
    DB-->>Adapter: may error on duplicates
    Adapter-->>Database: propagate exception or normal result
    Database-->>Client: result or exception
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Feat mongo tmp #647: Overlaps Mongo adapter changes (withTransaction/createDocuments) and upsert/skip-duplicate behavior.

Suggested reviewers

  • ArnabChatterjee20k
  • abnegate
  • fogelito

Poem

🐇 I hopped through rows and checked each id,
I nudged the duplicates softly to the side,
Scoped my skip so nested hops restore,
Fresh carrots stay, old roots keep core,
A tiny rabbit guard — no overwrites tonight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add skipDuplicates() scope guard to createDocuments' clearly and concisely summarizes the primary change: introducing a new skipDuplicates scope guard method for handling duplicate documents during creation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch csv-import-upsert-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR adds a skipDuplicates() scope guard to createDocuments, following the existing silent()/skipRelationships() pattern, and wires it through all adapters (SQL via INSERT IGNORE/ON CONFLICT DO NOTHING, MongoDB via upsert+$setOnInsert), Pool delegation, and Mirror backfill. The per-tenant batch fetch for upsertDocumentsWithIncrease is a useful correctness fix for the StatsUsage worker.

  • P1 — MANY_TO_MANY junction DuplicateException: createDocumentRelationships runs for every document in the input batch (including duplicates) before the adapter INSERT. For RELATION_MANY_TO_MANY, relateDocuments unconditionally calls $this->createDocument($junction, ...) with no guard for an existing junction record. Database::createDocument (singular) has no skipDuplicates awareness, so replaying a batch where the parent and its MANY_TO_MANY children are already linked will throw DuplicateException. The test suite covers only RELATION_ONE_TO_MANY.

Confidence Score: 4/5

Safe to merge for non-relational and ONE_TO_MANY use cases; MANY_TO_MANY+skipDuplicates will throw before fix.

One confirmed P1 runtime bug: MANY_TO_MANY junction createDocument is not guarded by skipDuplicates, causing DuplicateException on re-run. All other prior concerns have been addressed or acknowledged. Score of 4 reflects a single blocking code path that needs a fix and a corresponding test.

src/Database/Database.php (relateDocuments junction creation) and tests/e2e/Adapter/Scopes/DocumentTests.php (missing MANY_TO_MANY coverage)

Important Files Changed

Filename Overview
src/Database/Adapter.php Adds $skipDuplicates flag and nestable scope-guard method — clean implementation that saves/restores previous state correctly.
src/Database/Adapter/Mongo.php Adds upsert+$setOnInsert path for skipDuplicates; several correctness concerns addressed in prior review rounds; transaction bypass and empty-setOnInsert guard are reasonable.
src/Database/Adapter/Pool.php Correctly propagates skipDuplicates flag through delegate() and withTransaction(); nesting is idempotent and harmless.
src/Database/Adapter/Postgres.php Postgres-specific ON CONFLICT DO NOTHING suffix for documents and permissions tables — correct conflict targets for shared and non-shared table modes.
src/Database/Adapter/SQL.php INSERT IGNORE/ON CONFLICT SQL generation via dialect hooks; several edge cases (tenantPerDocument, reconciliation) flagged in prior rounds remain acknowledged trade-offs.
src/Database/Adapter/SQLite.php Overrides getInsertKeyword() with INSERT OR IGNORE INTO — minimal, correct.
src/Database/Database.php Core skipDuplicates scope guard, adapter delegation, and upsertDocumentsWithIncrease tenant-grouping fix look solid; MANY_TO_MANY junction creation is not guarded by skipDuplicates.
src/Database/Mirror.php Forwards all input to destination with skipDuplicates — intentional backfill semantics documented in comments; destination may get values that differ from source for rows absent at destination.
tests/e2e/Adapter/MirrorTest.php Adds backfill-destination test verifying skipDuplicates behaviour in mirror mode; clearly documents intentional source/destination divergence semantics.
tests/e2e/Adapter/Scopes/DocumentTests.php Comprehensive skipDuplicates end-to-end tests across empty batch, nested scope, large batch, and ONE_TO_MANY relationships; MANY_TO_MANY junction case not covered.

Comments Outside Diff (1)

  1. src/Database/Database.php, line 5720-5722 (link)

    P1 skipDuplicates not propagated to MANY_TO_MANY junction creation

    createDocumentRelationships is called for every document — including those that are about to be silently skipped — before the adapter INSERT. For MANY_TO_MANY relationships, relateDocuments (line 6000) always calls $this->createDocument($junction, ...) unconditionally, with no check for an existing junction record. Database::createDocument (singular) has no skipDuplicates awareness, so it will throw DuplicateException whenever the parent is a duplicate that already has MANY_TO_MANY links to existing children.

    The test testCreateDocumentsSkipDuplicatesRelationships only exercises RELATION_ONE_TO_MANY (which stores the link in the child's twoWayKey column via updateDocument, not a junction insert), so this path is not covered. A caller replaying a batch of documents with MANY_TO_MANY children that are already linked will get an unexpected DuplicateException.

    The fix is to guard the junction createDocument by either:

    1. Wrapping this call in $this->skipDuplicates(fn () => ...) when the outer scope has $this->skipDuplicates = true, or
    2. Pre-checking whether the junction record exists before inserting it.

Reviews (30): Last reviewed commit: "Address Jake's follow-up review: tighten..." | Re-trigger Greptile

Comment thread src/Database/Adapter/Mongo.php Outdated
Comment thread src/Database/Database.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/SQL.php
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/e2e/Adapter/Scopes/DocumentTests.php (1)

7815-7818: Prefer __FUNCTION__ for collection naming consistency.

Using __FUNCTION__ here (as in adjacent tests) improves isolation and avoids accidental name reuse across reruns.

♻️ Suggested tweak
-        $col = 'createDocsIgnoreIntraBatch';
+        $col = __FUNCTION__;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7815 - 7818, The
collection name is hard-coded as 'createDocsIgnoreIntraBatch' which can collide
across test runs; change the $col assignment to use the current function name
(replace the literal with __FUNCTION__) so the calls to
$database->createCollection($col) and $database->createAttribute($col, 'name',
Database::VAR_STRING, 128, true) use a unique, test-local collection name
consistent with adjacent tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/Mongo.php`:
- Around line 1501-1524: The current dedupe prefetch uses only _uid and a single
tenant filter, causing cross-tenant collisions for shared tables; update the
logic that builds existingUids (and the prefetch loop around
$this->client->find/$this->client->getMore) to be tenant-aware by using a
composite key (e.g., "$tenant:$uid") when populating $existingUids, include the
tenant in the $filters returned by $this->getTenantFilters($collection->getId())
per chunk, and prefetch in per-tenant chunks using array_chunk($uidValues,
max(1, $this->maxQueryValues)) (use the chunk size when calling getMore instead
of count($uidValues)); then when filtering $records check the composite key
($record['_tenant'] . ':' . ($record['_uid'] ?? '')) against $existingUids so
duplicate detection respects tenant boundaries.
- Around line 1504-1516: The prefetch block calls $this->client->find() and
$this->client->getMore() without exception mapping; wrap each driver call in a
try/catch that catches \Throwable and rethrows the normalized exception via
processException($e) (same pattern used elsewhere in this class), i.e. surround
the $this->client->find(...) call and the $this->client->getMore(...) call with
try { ... } catch (\Throwable $e) { throw $this->processException($e); } so any
raw driver errors from the $response/$more handling are mapped consistently.

In `@src/Database/Database.php`:
- Around line 7210-7255: The upsertDocumentsWithIncrease() flow builds
$existingDocs keyed as tenant:id when $upsertTenantPerDocument is true, but
later duplicate-input detection still uses only $document->getId(), causing
false DuplicateException across tenants; update the duplicate-input check and
any intermediate maps (e.g., $seenIds/$inputKeys) to use the same composite
lookup key (use the existing $lookupKey logic: $tenant . ':' .
$document->getId() when $upsertTenantPerDocument is true) so the tenant-aware
key is reused consistently across the whole upsert path.

---

Nitpick comments:
In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7815-7818: The collection name is hard-coded as
'createDocsIgnoreIntraBatch' which can collide across test runs; change the $col
assignment to use the current function name (replace the literal with
__FUNCTION__) so the calls to $database->createCollection($col) and
$database->createAttribute($col, 'name', Database::VAR_STRING, 128, true) use a
unique, test-local collection name consistent with adjacent tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9f1924e-3ed0-43a8-8859-6e6c299aadb5

📥 Commits

Reviewing files that changed from the base of the PR and between 72ee161 and b9c7f4d.

📒 Files selected for processing (9)
  • src/Database/Adapter.php
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/Pool.php
  • src/Database/Adapter/Postgres.php
  • src/Database/Adapter/SQL.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php

Comment thread src/Database/Adapter/Mongo.php Outdated
Comment thread src/Database/Adapter/Mongo.php Outdated
Comment thread src/Database/Database.php Outdated
Comment thread src/Database/Database.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Database.php Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/Database/Database.php (2)

5682-5726: ⚠️ Potential issue | 🟠 Major

Don’t let the outer prefetch snapshot suppress writes.

The adapters already do the authoritative duplicate filtering inside the transaction/session, but this code still drops $chunk entries from a pre-transaction find(). If a matching row is deleted after that snapshot but before the insert transaction starts, ignore: true will incorrectly skip a document that should now insert, and $modified / $onNext will be low.

Also applies to: 5805-5815

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5682 - 5726, The current prefetch
logic builds $preExistingIds (when $ignore is true, in the tenantPerDocument and
non-tenant branches using find()/authorization->skip/etc.) and later uses that
snapshot to skip writes, which can wrongly suppress inserts if a matching row
was deleted after the prefetch; remove or stop using this prefetch-based
suppression. Instead, eliminate reliance on $preExistingIds to skip document
inserts/relationship writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.

5788-5790: ⚠️ Potential issue | 🟠 Major

Use the same tenant-aware key for deferred relationship writes.

In tenant-per-document mode, two inserted documents can legitimately share the same $id. This map is keyed only by $id, so the later payload overwrites the earlier one, and unset() after the first insert prevents the second tenant’s relationship write entirely.

♻️ Proposed fix
                 if (!empty($relationshipData)) {
-                    $deferredRelationships[$document->getId()] = $relationshipData;
+                    $relationshipKey = $tenantPerDocument
+                        ? $document->getTenant() . ':' . $document->getId()
+                        : $document->getId();
+                    $deferredRelationships[$relationshipKey] = $relationshipData;
                 }
@@
                 foreach ($batch as $insertedDoc) {
-                    $docId = $insertedDoc->getId();
-                    if (\array_key_exists($docId, $deferredRelationships)) {
+                    $relationshipKey = $tenantPerDocument
+                        ? $insertedDoc->getTenant() . ':' . $insertedDoc->getId()
+                        : $insertedDoc->getId();
+                    if (\array_key_exists($relationshipKey, $deferredRelationships)) {
                         $relDoc = clone $insertedDoc;
-                        foreach ($deferredRelationships[$docId] as $key => $value) {
+                        foreach ($deferredRelationships[$relationshipKey] as $key => $value) {
                             $relDoc->setAttribute($key, $value);
                         }
                         $this->silent(fn () => $this->createDocumentRelationships($collection, $relDoc));
-                        unset($deferredRelationships[$docId]);
+                        unset($deferredRelationships[$relationshipKey]);
                     }
                 }

Also applies to: 5823-5832

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5788 - 5790, The deferred
relationship map ($deferredRelationships) is keyed only by $document->getId(),
which causes collisions in tenant-per-document mode; change the key to be
tenant-aware (e.g., combine tenant id and document id or use the document's
tenant identifier method) wherever entries are set and removed so each
tenant/document pair is unique—update the code that assigns
$deferredRelationships[$document->getId()] = $relationshipData; and the matching
unset() logic (and the same pattern around the related block at lines referenced
5823-5832) to use the composite tenant-aware key instead of just getId().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter SELECT that checks for existing _uid
rows (the block building $sql, preparing via $this->getPDO()->prepare, binding
values and calling $stmt->execute()/fetchAll()) can throw a raw PDOException
because it runs outside the method's main try/catch; wrap the entire
prepare/execute/fetch block in the adapter's processException(...) wrapper so
exceptions are mapped consistently (i.e., call $this->processException(...)
around the closure that prepares, binds, executes and fetches, returning $rows),
keeping the logic inside createDocuments() and referencing the existing $sql,
$binds, $stmt and $rows variables.
- Around line 2665-2713: The reconciliation currently keys timestamps only by
_uid and uses a single :_vfy_tenant bind when $this->sharedTables is true, which
corrupts mixed-tenant batches; update the verification to be tenant-aware by
binding the tenant per document and keying by a composite tenant+uid.
Concretely, in the block that builds $verifyPlaceholders/$verifyBinds and
$verifySql (the code using $verifyPlaceholders, $verifyBinds, $verifySql and
$verifyStmt), include the tenant value for each document (use the document's
tenant when tenant-per-document applies), change the WHERE to match pairs (e.g.
use row-value tuples or explicit (_uid = :uX AND _tenant = :tX) clauses), and
then build $actualTimestamps keyed by "$tenant|$uid" (and compare against
$expectedTimestamps keyed the same way) so the subsequent loop that populates
$insertedDocs/$insertedUids only matches rows with the correct tenant+uid.
- Around line 2650-2652: Replace the use of "INSERT IGNORE" (returned by
getInsertKeyword) with a plain "INSERT" and remove any IGNORE-specific behavior
in getInsertSuffix so that constraint violations are not globally suppressed;
instead let the DB raise errors and update the insert/execute path that catches
SQL exceptions to inspect the database error/constraint name (e.g., index or key
name such as _uid) and map errors to DuplicateException only when the violated
constraint corresponds to the _uid duplicate, otherwise map to UniqueException
(or other specific exceptions) for PRIMARY/UNIQUE/NOT NULL/truncation
violations—use the existing functions getInsertKeyword and getInsertSuffix to
locate changes and adjust the exception-mapping in the code that performs the
insert and error handling.

---

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5682-5726: The current prefetch logic builds $preExistingIds (when
$ignore is true, in the tenantPerDocument and non-tenant branches using
find()/authorization->skip/etc.) and later uses that snapshot to skip writes,
which can wrongly suppress inserts if a matching row was deleted after the
prefetch; remove or stop using this prefetch-based suppression. Instead,
eliminate reliance on $preExistingIds to skip document inserts/relationship
writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.
- Around line 5788-5790: The deferred relationship map ($deferredRelationships)
is keyed only by $document->getId(), which causes collisions in
tenant-per-document mode; change the key to be tenant-aware (e.g., combine
tenant id and document id or use the document's tenant identifier method)
wherever entries are set and removed so each tenant/document pair is
unique—update the code that assigns $deferredRelationships[$document->getId()] =
$relationshipData; and the matching unset() logic (and the same pattern around
the related block at lines referenced 5823-5832) to use the composite
tenant-aware key instead of just getId().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d2ab610b-bb64-4dcb-979c-f3880d9e99df

📥 Commits

Reviewing files that changed from the base of the PR and between 20846d3 and 242c9a8.

📒 Files selected for processing (3)
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Adapter/Mongo.php

Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/Database/Adapter/SQL.php (3)

2482-2527: ⚠️ Potential issue | 🟠 Major

Wrap the ignore-mode pre-filter in processException() handling.

This query runs before the method’s main try/catch, so SQL or connection failures here still escape as raw PDOException while the rest of createDocuments() is adapter-mapped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2482 - 2527, The pre-filter block
in createDocuments() that prepares and executes the SELECT (using getSQLTable(),
getPDO(), getPDOType(), preparing $sql and executing $stmt) should be wrapped
inside a call to processException() so any PDOException is adapter-mapped
instead of escaping; move the existing logic that builds $placeholders/$binds,
$tenantFilter, $sql, $stmt->bindValue(), $stmt->execute(), $rows =
$stmt->fetchAll() and $stmt->closeCursor() into a closure passed to
processException(), and return/assign the result from that closure (or rethrow
via processException) so behavior matches the rest of createDocuments().

2782-2785: ⚠️ Potential issue | 🟠 Major

Don’t use INSERT IGNORE as the base implementation for ignore=true.

For adapters that inherit this default, ignore=true stops meaning “skip duplicate document IDs” and starts suppressing a broader class of insert failures on both the document and permission insert paths. The base hook should stay plain INSERT INTO, and adapter-specific duplicate handling should be expressed with targeted conflict clauses or constraint-aware exception mapping.

Does MariaDB/MySQL `INSERT IGNORE` suppress constraint violations beyond duplicate-key conflicts, such as PRIMARY KEY, other UNIQUE, NOT NULL, truncation, or conversion errors?

Based on learnings, in src/Database/Adapter/MariaDB.php only duplicate _uid violations should throw DuplicateException; other unique or PRIMARY collisions should still throw UniqueException.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2782 - 2785, The base method
getInsertKeyword in class SQL should not return 'INSERT IGNORE INTO' for
ignore=true; change it to always return the plain 'INSERT INTO' and move any
ignore semantics into adapter-specific code (e.g., Adapter/MariaDB.php) by using
targeted conflict clauses or by mapping constraint/SQLSTATE errors to the
correct exceptions; specifically, in MariaDB adapter ensure only duplicate
violations on the document _uid map to DuplicateException while other
UNIQUE/PRIMARY/constraint failures map to UniqueException (and do not rely on
INSERT IGNORE in SQL::getInsertKeyword).

2664-2728: ⚠️ Potential issue | 🔴 Critical

Keep expectedTimestamps tenant-aware too.

$actualTimestamps is keyed by tenant:uid now, but $expectedTimestamps is still keyed only by _uid. In tenantPerDocument mode, two documents with the same _uid in different tenants can overwrite each other here, causing an inserted document to be dropped from $documents and miss its permission insert.

💡 Proposed fix
-                $expectedTimestamps = [];
-                foreach ($documents as $doc) {
-                    $expectedTimestamps[$doc->getId()] = $doc->getCreatedAt();
-                }
+                $expectedTimestamps = [];
+                foreach ($documents as $doc) {
+                    $key = ($this->sharedTables && $this->tenantPerDocument)
+                        ? $doc->getTenant() . ':' . $doc->getId()
+                        : $doc->getId();
+                    $expectedTimestamps[$key] = $doc->getCreatedAt();
+                }
@@
-                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$doc->getId()]) {
+                    if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$key]) {
                         $insertedDocs[] = $doc;
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/SQL.php` around lines 2664 - 2728, The
expectedTimestamps map is not tenant-aware while actualTimestamps is, so when
sharedTables && tenantPerDocument you must key expectedTimestamps the same way
as actualTimestamps (tenant + ':' + uid) during the initial foreach over
$documents and also use that key when comparing later (instead of
$expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter block in createDocuments() that prepares
and executes the SELECT (using getSQLTable(), getPDO(), getPDOType(), preparing
$sql and executing $stmt) should be wrapped inside a call to processException()
so any PDOException is adapter-mapped instead of escaping; move the existing
logic that builds $placeholders/$binds, $tenantFilter, $sql, $stmt->bindValue(),
$stmt->execute(), $rows = $stmt->fetchAll() and $stmt->closeCursor() into a
closure passed to processException(), and return/assign the result from that
closure (or rethrow via processException) so behavior matches the rest of
createDocuments().
- Around line 2782-2785: The base method getInsertKeyword in class SQL should
not return 'INSERT IGNORE INTO' for ignore=true; change it to always return the
plain 'INSERT INTO' and move any ignore semantics into adapter-specific code
(e.g., Adapter/MariaDB.php) by using targeted conflict clauses or by mapping
constraint/SQLSTATE errors to the correct exceptions; specifically, in MariaDB
adapter ensure only duplicate violations on the document _uid map to
DuplicateException while other UNIQUE/PRIMARY/constraint failures map to
UniqueException (and do not rely on INSERT IGNORE in SQL::getInsertKeyword).
- Around line 2664-2728: The expectedTimestamps map is not tenant-aware while
actualTimestamps is, so when sharedTables && tenantPerDocument you must key
expectedTimestamps the same way as actualTimestamps (tenant + ':' + uid) during
the initial foreach over $documents and also use that key when comparing later
(instead of $expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 892638bd-ac20-414b-9059-7f7e20295028

📥 Commits

Reviewing files that changed from the base of the PR and between 242c9a8 and 455da70.

📒 Files selected for processing (2)
  • src/Database/Adapter/SQL.php
  • src/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Database.php

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2676-2682: The verification query is binding composite keys from
$expectedTimestamps into $verifyBinds/placeholders and then comparing them
against the _uid column, but when tenantPerDocument is enabled _uid was set to
the raw document ID ($attributes['_uid'] = $document->getId()), so the WHERE
_uid IN (...) must use the document IDs not the composite keys. Fix by
extracting the document ID portion from each composite key before creating $ph
and adding to $verifyBinds (or if keys are already plain IDs, leave unchanged);
update the loop that builds $verifyPlaceholders/$verifyBinds (the foreach over
array_values(array_unique(array_keys($expectedTimestamps)))) to derive $uid =
<documentIdFromComposite($uid)> and bind that instead so the verification query
matches rows by actual _uid values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43af96a2-d479-4354-86c9-7d68abad7b56

📥 Commits

Reviewing files that changed from the base of the PR and between 455da70 and deedfda.

📒 Files selected for processing (1)
  • src/Database/Adapter/SQL.php

Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/SQL.php Outdated
Comment thread src/Database/Adapter/Mongo.php Outdated
Adapters:
- MariaDB/MySQL: INSERT IGNORE INTO
- PostgreSQL: ON CONFLICT DO NOTHING
- SQLite: INSERT OR IGNORE INTO
- MongoDB: session-scoped pre-filter before insertMany

Database.php:
- Intra-batch dedup by ID (tenant-aware, first occurrence wins)
- Pre-fetch existing IDs to skip known duplicates
- Deferred relationship creation for ignore mode (no orphans)
- Race-condition reconciliation via _createdAt timestamp verification

upsertDocuments:
- Batch-fetch existing docs with find() instead of per-row getDocument()
- Tenant-aware composite keys for seenIds duplicate check

All paths are tenant-per-document aware with null-safe array_filter.
- testCreateDocumentsIgnoreDuplicates: mixed batch with onNext assertions
- testCreateDocumentsIgnoreIntraBatchDuplicates: first-wins, no ACL drift
- testCreateDocumentsIgnoreAllDuplicates: zero inserts, empty onNext
@premtsd-code premtsd-code force-pushed the csv-import-upsert-v2 branch from f2b08dc to bee42d4 Compare April 9, 2026 10:28
Comment thread src/Database/Adapter/SQL.php
…x _uid mapping

- Pass ignoreDuplicates option to insertMany for race-condition safety
- Extract _uid from raw array before replaceChars (excluded from transformation)
- Map inserted records back to original Document objects by _uid
Comment thread src/Database/Adapter/Mongo.php Outdated
- Add skipDuplicates(callable) scope guard on Database, following
  existing pattern (skipRelationships, skipValidation, etc.)
- Remove bool $ignore parameter from createDocuments signature
- Mirror propagates skipDuplicates state to source and destination
- Update tests to use $database->skipDuplicates(function() { ... })
- Add $skipDuplicates property + skipDuplicates() scope guard to Adapter
- Remove bool $ignore parameter from all adapter createDocuments signatures
  (Adapter, Pool, Mongo, SQL)
- Remove bool $ignore from SQL helper methods (getInsertKeyword,
  getInsertSuffix, getInsertPermissionsSuffix) and Postgres/SQLite overrides
- Pool delegate() and withTransaction() propagate skipDuplicates state to
  pooled/pinned adapter via the scope guard
- Database::createDocuments() wraps adapter call in adapter->skipDuplicates()
  when the flag is set, drops the local $ignore variable
- skipDuplicates() takes optional $enable so Pool/Mirror/Database can
  forward state in a single call instead of branching
- Database: tenantKey(Document) helper replaces ~8 inline ternaries
- Database: fetchExistingByIds() helper replaces the tenant-grouped
  batched find() in createDocuments and upsertDocuments pre-fetch
- Database: drop dead guard on non-defer relationship path
- Mongo: findExistingUids() helper replaces the duplicated cursor walk
  in the skipDuplicates pre-filter (tenant and non-tenant variants)
- SQL: buildUidTenantLookup() helper replaces the duplicated
  placeholder-and-binds builder shared by pre-filter and reconciliation
Comment thread src/Database/Adapter/SQL.php Outdated
When a batch has only null-ID documents but non-null tenants, the
previous code emitted WHERE _uid IN () AND _tenant IN (...) — invalid
SQL that the !empty(binds) call-site guard failed to catch because the
tenant binds kept the binds array non-empty.

Return 'where' => '1=0' with empty binds when there are no UIDs, so
the query matches nothing and leaves no dangling placeholders.
- testCreateDocumentsSkipDuplicatesEmptyBatch: no-op with empty array
- testCreateDocumentsSkipDuplicatesNestedScope: scope guard save/restore
  through nested skipDuplicates() calls
- testCreateDocumentsSkipDuplicatesLargeBatch: 300-doc batch with 50
  pre-existing, exercising the chunk loop
- testCreateDocumentsSkipDuplicatesSecondCallSkipsAll: second identical
  batch should preserve First values, not overwrite with Second
- testCreateDocumentsSkipDuplicatesRelationships: skipDuplicates with a
  one-to-many relationship — verifies deferred relationship logic does
  not create orphan child rows for ignored parents
INSERT IGNORE / ON CONFLICT DO NOTHING / INSERT OR IGNORE handle
duplicates at the DB level, and the existing reconciliation block
identifies actually-inserted rows via the _createdAt timestamp. The
pre-filter was a wire-data optimization that ran a SELECT before the
INSERT, then trimmed the batch — but the same SELECT (with timestamp
comparison) already happens in the reconciliation path on demand.

Net: ~45 lines removed, single code path for skipDuplicates instead
of pre-filter + fallback. Race-condition handling is unchanged
(reconciliation path covers it).
Previously the Database orchestrator carried a pre-fetch + chunk-filter
pass to compensate for Mongo's plain upsert() returning matched +
upserted combined ('n') — without that compensation, $modified would
over-count the number of pre-existing docs in the batch.

This pushes the responsibility into the adapter, where it belongs:
- composer bump to utopia-php/mongo dev-upsert-return-upserted-count
  (PR #37, adds Client::upsertWithCounts() returning matched/modified/
  upserted[] separately)
- Mongo adapter calls upsertWithCounts() and maps the response's
  upserted[].index entries back to the originating Document objects,
  returning ONLY the actually-inserted docs (race-safe)
- Database::createDocuments drops $preExistingIds pre-fetch and the
  chunk filter; the adapter is now responsible for accurate counts

Net effect on Database.php: ~17 lines removed. Net effect on Mongo.php:
~80 lines net deletion (replaced the inline pre-filter with a single
upsertWithCounts call + index mapping).
Comment thread src/Database/Adapter/Mongo.php Outdated
Comment thread src/Database/Adapter/Mongo.php
Comment thread src/Database/Adapter.php Outdated
* @param bool $enable
* @return T
*/
public function skipDuplicates(callable $callback, bool $enable = true): mixed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the $enable flag and check if skipDuplicates instead, otherwise we add the overhead of calling an extra callable when it's false

Comment thread src/Database/Database.php Outdated
* @param bool $idsOnly When true, uses Query::select(['$id']) for a lighter fetch
* @return array<string, Document>
*/
private function fetchExistingByIds(Document $collection, array $documents, bool $idsOnly = false): array
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use find

Comment thread src/Database/Database.php Outdated
Comment on lines +5751 to +5776
// skipDuplicates: dedupe intra-batch (first wins) then drop already-existing rows.
// Adapter INSERT IGNORE / ON CONFLICT / upsert is the race-safety net.
if ($this->skipDuplicates) {
$seenIds = [];
$deduplicated = [];
foreach ($documents as $document) {
if ($document->getId() !== '') {
$dedupeKey = $this->tenantKey($document);
if (isset($seenIds[$dedupeKey])) {
continue;
}
$seenIds[$dedupeKey] = true;
}
$deduplicated[] = $document;
}
$documents = $deduplicated;

$preExistingIds = $this->fetchExistingByIds($collection, $documents, idsOnly: true);
if (!empty($preExistingIds)) {
$documents = \array_values(\array_filter(
$documents,
fn (Document $doc) => !isset($preExistingIds[$this->tenantKey($doc)])
));
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this if using insert ignore

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abnegate The child collection is getting orphan rows. when multiple rows with same parent and different childs. where parent is ignored but child is not ignored

if ($this->resolveRelationships) {
$document = $this->silent(fn () => $this->createDocumentRelationships($collection, $document));
}

Comment thread src/Database/Mirror.php Outdated
Comment on lines +607 to +613
$insertedFromSource = [];
$captureOnNext = function (Document $doc) use (&$insertedFromSource, $onNext): void {
$insertedFromSource[] = $doc;
if ($onNext !== null) {
$onNext($doc);
}
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can never capture like this internally, if there were many docs, this loop could easily OOM

Comment thread src/Database/Adapter/SQL.php Outdated
Comment on lines +2614 to +2640
/**
* Returns the INSERT keyword, optionally with IGNORE for duplicate handling.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertKeyword(): string
{
return $this->skipDuplicates ? 'INSERT IGNORE INTO' : 'INSERT INTO';
}

/**
* Returns a suffix appended after VALUES clause for duplicate handling.
* Override in adapter subclasses (e.g., Postgres uses ON CONFLICT DO NOTHING).
*/
protected function getInsertSuffix(string $table): string
{
return '';
}

/**
* Returns a suffix for the permissions INSERT statement when ignoring duplicates.
* Override in adapter subclasses for DB-specific syntax.
*/
protected function getInsertPermissionsSuffix(): string
{
return '';
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's group logically with other getter methods

#1 Drop $enable flag on skipDuplicates() scope guard
The $enable param made every non-skipDuplicates createDocuments call pay
for a closure allocation + extra function call. Branch at the call site
instead so the cost only applies when the flag is actually set.

  - Adapter::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::skipDuplicates(callable, bool) → skipDuplicates(callable)
  - Database::createDocuments, Mirror::createDocuments, Pool::delegate,
    Pool::withTransaction now branch inline.

#2 Drop fetchExistingByIds helper, inline find()
The helper's per-tenant grouping defended a hypothetical multi-tenant
batching scenario that no caller exercises (relationships are intra-
tenant, callers always batch per tenant). Existing patterns in the same
file (refetchDocuments, relationship loading) just call find() directly.
Match that idiom and drop ~70 lines.

#4 Mirror: only capture inserted docs in skipDuplicates mode
The captureOnNext accumulator paid the cost (closure + per-doc array
push) on every createDocuments call, including the common non-skip path.
Branch at the entry of Mirror::createDocuments so the capture only
happens when skipDuplicates is set; the non-skip path passes through
to source/destination unchanged.

#5 Move getInsertKeyword/Suffix/PermissionsSuffix to getters cluster
Were sitting next to createDocuments(); moved to the getSupport*
cluster around line 1030 where other adapter-capability shims live.

Not addressed:
- #2 partial: the existing patterns (refetchDocuments etc.) don't handle
  tenant-per-document multi-tenant batches either, so this is consistent.
- #3 (drop the pre-filter): rejected. createDocumentRelationships runs
  in the encoding loop BEFORE the adapter's INSERT IGNORE no-ops the
  parent, so dropping the pre-filter would deterministically duplicate
  child rows on every CSV re-import of a collection with relationships
  (not a race window — every call). The relationships test verifies
  this. Reverting would require reintroducing the deferred-relationships
  scaffolding we just removed, and the adapter still couldn't tell us
  which parents were actually inserted (SQL INSERT IGNORE has no per-row
  reporting). Pre-filter stays.
Comment thread src/Database/Database.php Outdated
Per Jake's #4 review comment, the captureOnNext accumulator could grow
unbounded with the full inserted set. Replace it with a bounded buffer
that flushes to the destination every $batchSize docs as the source
streams them through onNext, plus a tail flush after the source returns.

Memory peak in skipDuplicates mode is now O($batchSize) (~1000 docs)
regardless of how many docs the source inserts. Same destination batching
efficiency — destination still gets multi-doc createDocuments calls,
just one per chunk instead of one giant call at the end.

Eligibility checks (SOURCE_ONLY_COLLECTIONS, upgrade status) moved
upfront so the flush closure can capture the decision and short-circuit
cleanly. Non-skip path is unchanged: direct passthrough, no buffer.

The flush-on-fill is load-bearing for the memory bound — see the
in-method comment.
Comment thread src/Database/Database.php Outdated
…erDocument

When inlining find() per Jake's review, the per-tenant grouping the
deleted fetchExistingByIds helper used to do was lost. In tenantPerDocument
mode with a multi-tenant batch, the inlined find() runs under the session
tenant context (often null for platform workers) and silently misses rows
belonging to other tenants — every input doc looks "new" and goes down the
create path instead of the update path.

This actively breaks appwrite's StatsUsage worker, which accumulates stats
across many projects (each tagged with its own $tenant) and flushes them
via $dbForLogs->setTenant(null)->setTenantPerDocument(true) followed by
upsertDocumentsWithIncrease(...). With the old per-doc getDocument loop,
each lookup ran under withTenant($doc->getTenant()) and was correct. The
batch find() inlining lost that scoping.

Fix per Greptile's suggestion: branch on tenantPerDocument mode and run
one find() per unique tenant value under withTenant, merging results into
the same lookup hash. K queries (K = unique tenants in the batch) instead
of N (per-doc) or 1 (broken). Common single-tenant case still hits the
fast batched path.

Applied at both call sites:
- Database::createDocuments skipDuplicates pre-filter
- Database::upsertDocumentsWithIncrease existing-docs lookup

Fixes greptile flag at PR #852 discussion r3077481595.
Per Jake's review feedback, the pre-filter in Database::createDocuments
was redundant with the adapter's INSERT IGNORE / ON CONFLICT DO NOTHING
/ upsert + \$setOnInsert path. The adapter already handles duplicates
atomically; the upfront SELECT duplicated that work.

The semantic is now row-level dedup:
- pre-existing rows are silently no-op'd by the adapter
- new rows go through
- retries are idempotent
- source data is never overwritten

Tests updated to match:
- 5 skipDuplicates tests adjusted to expect full-chunk counts, with
  data-level assertions unchanged (pre-existing values preserved, new
  rows inserted).
- testCreateDocumentsSkipDuplicatesRelationships encodes the retry
  scenario: missing nested child gets created via relateDocuments'
  idempotent path when the parent already exists.
- testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination relaxed
  to source-authoritative semantics only. Destination may transiently
  hold would-be values during the migration window; backfill reconciles.

All 105 skipDuplicates tests across 11 adapter variants pass, full
Mirror suite (636 tests) passes, PHPStan level 7 clean.
Follow-up to Jake's #5 review comment — the fix was only applied to
SQL.php. Propagate the same grouping to the other adapters and fix one
cosmetic byproduct from the Mirror refactor.

- Postgres.php: move getInsertKeyword/Suffix/PermissionsSuffix from
  between createDocuments() and getUpsertStatement() down to the
  getSupportFor* cluster (matches SQL.php:1036 layout).
- SQLite.php: move getInsertKeyword override from the top of the class
  to after the getSupportFor* cluster.
- Adapter.php: skipDuplicates() method was inserted between protected
  property declarations, breaking the property cluster. Move the method
  down next to withTransaction() — the other callable-taking scope
  guard. $skipDuplicates property stays with its peers.
- Mirror.php: revert the incidental fn () => collapse on the non-skip
  destination createDocuments call back to the original two-line form,
  minimizing unrelated diff noise.

Pure relocation / style. No behavior change.
Comment thread src/Database/Mirror.php Outdated
Addresses Greptile #3084293974. The captureOnNext-based forwarding
was structurally broken post-pre-filter-removal: SQL adapters return
the full input batch from createDocuments regardless of how many rows
INSERT IGNORE actually inserted, so onNext fired for every document
including skipped duplicates. captureOnNext buffered all of them and
flushed the full buffer (carrying the caller's would-be values) to
destination, diverging destination from source whenever source had a
pre-existing row the caller tried to overwrite.

Fix: instead of trying to distinguish inserted vs skipped at insert
time, re-fetch source's authoritative state via find() after the
adapter write settles, then forward that to destination. Race-free
regardless of concurrent writers — destination always receives a
faithful snapshot of whatever source holds, because the query runs
after source's write has resolved.

Cost: one extra SELECT per batch against source when skipDuplicates
is active AND Mirror has an upgraded destination. Zero cost on the
non-skip path. Memory remains O(batch_size).

Net ~47 lines smaller than the previous capture-based approach:
- captureOnNext closure removed
- flushBuffer closure removed
- bounded-buffer machinery removed
- shouldMirror upfront-snapshot removed (inlined as eligibility check)

Test: testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination now
asserts destination holds source's 'Original' value rather than the
caller's 'WouldBe' input, per Greptile's suggested assertion.
Comment thread src/Database/Mirror.php Outdated
…p paths

Addresses Greptile #3084293974. The captureOnNext-based forwarding
was structurally broken: SQL adapters return the full input batch from
createDocuments regardless of what INSERT IGNORE actually inserted, so
onNext fired for every doc including skipped duplicates. captureOnNext
then forwarded the caller's would-be values to destination, diverging
destination from source whenever source had a pre-existing row the
caller tried to re-insert.

Fix: pre-filter against source before the insert to identify which
input ids already exist, then forward only the genuinely-new docs to
destination. A source-skipped duplicate is a no-op (no user write
happened), so nothing should propagate. This matches Greptile's
semantics: destination.dup.isEmpty() after skipping a duplicate.

While the logic was touched, the skip and non-skip paths (which shared
eligibility check, upgrade check, clone loop, destination write, and
after-filter loop) are unified into a single flow with $this->
skipDuplicates branches only at the points where behavior actually
differs:

  - source call: wrapped in source->skipDuplicates when active
  - $toForward: filter out pre-existing ids when active
  - destination call: wrapped in destination->skipDuplicates when active

Non-skip hot path pays zero new closure allocations; the pre-filter
block is gated behind if ($this->skipDuplicates) and short-circuits
to an empty $existingIds map.

Net: -46 lines in createDocuments (~170 → ~95), same test coverage,
Greptile's assertion now holds.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Database/Mirror.php (1)

604-656: ⚠️ Potential issue | 🟠 Major

Pre-filter lookup is not tenant-aware for tenantPerDocument mode.

When tenantPerDocument is enabled, documents with the same $id but different tenants are distinct. The current implementation:

  1. Calls find() with all IDs at once without grouping by tenant (line 616)
  2. Keys $existingIds by $id alone (line 622)
  3. Filters $toForward by $id alone (line 650)

If the batch contains documents for multiple tenants, this incorrectly excludes new documents whose ID exists for a different tenant, or fails to detect true duplicates for the correct tenant.

Use the same tenant-aware pattern as Database::upsertDocumentsWithIncrease(): when tenantPerDocument is enabled, group documents by tenant and call find() per-tenant using withTenant(), then use composite keys ($tenant:$id) for tracking existing IDs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Mirror.php` around lines 604 - 656, The pre-filter that builds
$existingIds and filters $toForward must be made tenant-aware when
tenantPerDocument is enabled: instead of calling $this->source->find(...) once
for all IDs and keying by $id, group the $documents by tenant (use the same
pattern as Database::upsertDocumentsWithIncrease()), and for each tenant call
$this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only
that tenant's existing IDs; store keys in $existingIds as composite
"$tenant:$id" and update the filtering lambda for $toForward (and the earlier
array_map/array_filter that builds $ids) to use the same composite key so only
documents with matching tenant+id are considered duplicates.
♻️ Duplicate comments (1)
src/Database/Database.php (1)

5728-5732: ⚠️ Potential issue | 🔴 Critical

skipDuplicates() still leaves relationship side effects behind.

By the time execution reaches Line 5728, every input document has already gone through createDocumentRelationships(). If the adapter then ignores a duplicate parent row, those child/junction writes are still committed. That can orphan rows or mutate relationships on an existing document even though the parent create was skipped. Relationship writes need to be driven from the successfully inserted $batch, not from the pre-insert input list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5728 - 5732, The current flow runs
createDocumentRelationships() before calling the adapter and then uses
adapter->skipDuplicates() which can ignore parent inserts while relationship
writes have already been applied; change the flow so the adapter call (the
$insert closure -> $this->withTransaction(fn () =>
$this->adapter->createDocuments($collection, $chunk))) executes first to produce
the actual inserted $batch (ensure adapter->skipDuplicates(...) returns the real
inserted batch when used), and only after $batch is obtained call
createDocumentRelationships() and any child/junction writes against $batch (not
the original input). Adjust skipDuplicates() usage so it returns the inserted
documents batch and ensure relationship-writing code references $batch instead
of the pre-insert input list.
🧹 Nitpick comments (2)
src/Database/Adapter/Postgres.php (1)

2358-2358: Silence the unused override parameter warning.

$table is intentionally unused here. If the parent signature must stay as-is, rename it to $_table or add a local suppression so PHPMD stops flagging this override.

♻️ Small cleanup
-    protected function getInsertSuffix(string $table): string
+    protected function getInsertSuffix(string $_table): string
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Adapter/Postgres.php` at line 2358, The override of
getInsertSuffix(string $table) triggers an unused-parameter warning; rename the
unused parameter to $_table (or prefix it with an underscore) in the
getInsertSuffix method signature or add a local PHPMD suppression comment inside
that method so the linter stops flagging it; update the method signature in the
Postgres class (getInsertSuffix) and keep the parent signature unchanged.
tests/e2e/Adapter/Scopes/DocumentTests.php (1)

7930-7933: Clarify whether these assertions validate “processed” vs “created” semantics.

At Line 7930, Line 7984, Line 8110, and Line 8155, the tests assert counts and onNext emissions include duplicate IDs. That encodes a “processed inputs” contract, not “newly created rows.” Please make this explicit (e.g., rename to $processedCount / add a short comment) or add before/after count assertions to separately validate created-row deltas.

Also applies to: 7984-7986, 8110-8112, 8155-8157

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7930 - 7933, The
assertions using $count and $emittedIds currently validate “processed inputs”
semantics (they expect duplicate IDs for repeated processing) rather than newly
created rows; update the test to make this explicit by renaming variables (e.g.,
$processedCount, $processedEmittedIds) or adding a short comment above the block
clarifying “processed inputs may include duplicates,” and if you also need to
assert created-row deltas add before/after row-count checks (capture $beforeRows
= ... then run operation then assert $afterRows - $beforeRows equals expected
new creations) around the assertions that reference $count and $emittedIds (the
same pattern appears for the other occurrences currently asserting duplicates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Database.php`:
- Around line 7153-7184: The pre-read ID queries in the branches that build
$idsByTenant/$docIds must be chunked to respect the Query::equal('$id', ...)
limit controlled by $this->maxQueryValues; update the tenant loop (where you
build $idsByTenant and call $this->find(... Query::equal('$id', $tenantIds)
...)) and the else branch (where $docIds is used in $this->find(...
Query::equal('$id', $docIds) ...)) to split each array with array_chunk(...,
max(1, $this->maxQueryValues)) and call find for each chunk (using the same
authorization->skip/withTenant/silent wrappers), then merge all returned
documents into $existingDocs (use the same key format $tenant . ':' .
$doc->getId() in the tenant branch) or $existing as before so downstream code
sees the combined results. Ensure you preserve unique IDs before chunking and
maintain limit counts per chunk.

In `@src/Database/Mirror.php`:
- Around line 614-624: The prefetch call in Mirror.php uses Query::equal('$id',
$ids) with an unchunked $ids array which can exceed DocumentsValidator limits;
change the logic that builds $existing by chunking $ids with array_chunk($ids,
max(1, $this->maxQueryValues)) (same pattern used in createDocuments() and
upsertDocumentsWithIncrease()), call $this->source->silent(fn() =>
$this->source->find($collection, [Query::equal('$id', $chunk),
Query::limit(count($chunk))])) for each chunk, and merge results into
$existingIds[$doc->getId()] = true to preserve current behavior. Ensure you
reference $this->maxQueryValues and use Query::equal('$id', $chunk) and
Query::limit(count($chunk)) per chunk.

---

Outside diff comments:
In `@src/Database/Mirror.php`:
- Around line 604-656: The pre-filter that builds $existingIds and filters
$toForward must be made tenant-aware when tenantPerDocument is enabled: instead
of calling $this->source->find(...) once for all IDs and keying by $id, group
the $documents by tenant (use the same pattern as
Database::upsertDocumentsWithIncrease()), and for each tenant call
$this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only
that tenant's existing IDs; store keys in $existingIds as composite
"$tenant:$id" and update the filtering lambda for $toForward (and the earlier
array_map/array_filter that builds $ids) to use the same composite key so only
documents with matching tenant+id are considered duplicates.

---

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5728-5732: The current flow runs createDocumentRelationships()
before calling the adapter and then uses adapter->skipDuplicates() which can
ignore parent inserts while relationship writes have already been applied;
change the flow so the adapter call (the $insert closure ->
$this->withTransaction(fn () => $this->adapter->createDocuments($collection,
$chunk))) executes first to produce the actual inserted $batch (ensure
adapter->skipDuplicates(...) returns the real inserted batch when used), and
only after $batch is obtained call createDocumentRelationships() and any
child/junction writes against $batch (not the original input). Adjust
skipDuplicates() usage so it returns the inserted documents batch and ensure
relationship-writing code references $batch instead of the pre-insert input
list.

---

Nitpick comments:
In `@src/Database/Adapter/Postgres.php`:
- Line 2358: The override of getInsertSuffix(string $table) triggers an
unused-parameter warning; rename the unused parameter to $_table (or prefix it
with an underscore) in the getInsertSuffix method signature or add a local PHPMD
suppression comment inside that method so the linter stops flagging it; update
the method signature in the Postgres class (getInsertSuffix) and keep the parent
signature unchanged.

In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7930-7933: The assertions using $count and $emittedIds currently
validate “processed inputs” semantics (they expect duplicate IDs for repeated
processing) rather than newly created rows; update the test to make this
explicit by renaming variables (e.g., $processedCount, $processedEmittedIds) or
adding a short comment above the block clarifying “processed inputs may include
duplicates,” and if you also need to assert created-row deltas add before/after
row-count checks (capture $beforeRows = ... then run operation then assert
$afterRows - $beforeRows equals expected new creations) around the assertions
that reference $count and $emittedIds (the same pattern appears for the other
occurrences currently asserting duplicates).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06d46ba1-a7eb-478b-ad90-a27f372a0327

📥 Commits

Reviewing files that changed from the base of the PR and between 455da70 and 934ec04.

📒 Files selected for processing (10)
  • src/Database/Adapter.php
  • src/Database/Adapter/Mongo.php
  • src/Database/Adapter/Pool.php
  • src/Database/Adapter/Postgres.php
  • src/Database/Adapter/SQL.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/MirrorTest.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/Database/Adapter.php
  • src/Database/Adapter/SQLite.php
  • src/Database/Adapter/Mongo.php

Comment thread src/Database/Database.php Outdated
Comment thread src/Database/Mirror.php Outdated
Addresses CodeRabbit #3084994641 and #3084994657 (+ Greptile #3084466265).
Three call sites were passing unchunked id arrays into Query::equal('$id',
...), which throws QueryException from DocumentsValidator once the array
exceeds 5000 values.

Mirror::createDocuments (skipDuplicates pre-filter): defensive — Mirror's
public API accepts arbitrary-size input, and while current callers pass
batches below the limit, the code shouldn't throw on a theoretical
direct caller with >5000 input ids.

Database::upsertDocumentsWithIncrease (both tenantPerDocument and single-
tenant branches): real regression. The batched pre-read was introduced
in ae929db to restore per-tenant grouping for StatsUsage. It replaced
a per-doc getDocument() loop on main (which is safe from validator
limits because each call queries one id). Passing the full input array
to Query::equal now breaks at >5000 input docs — a path that previously
worked slowly on main.

All three sites use self::RELATION_QUERY_CHUNK_SIZE (5000) and
Query::limit(PHP_INT_MAX), matching the existing pattern used by the
five populateDocumentsRelationshipsBatch chunking sites. Accumulator
shape is adapted to our map-only use case: we build the existence set /
lookup map directly inside the chunk loop instead of carrying a flat
list we don't need.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/Database/Database.php (2)

7160-7187: ⚠️ Potential issue | 🟠 Major

Respect maxQueryValues when chunking the pre-read lookups.

These find() calls still validate Query::equal('$id', $chunk) against $this->maxQueryValues. If a caller lowers that limit below 5000, chunking by RELATION_QUERY_CHUNK_SIZE will still fail before the upsert starts.

✂️ Suggested change
-                foreach (\array_chunk($tenantIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) {
+                foreach (\array_chunk($tenantIds, \max(1, $this->maxQueryValues)) as $chunk) {
@@
-                foreach (\array_chunk($docIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) {
+                foreach (\array_chunk($docIds, \max(1, $this->maxQueryValues)) as $chunk) {

Based on learnings, both createDocuments() and upsertDocumentsWithIncrease() prefetch existing document IDs in chunks using array_chunk(..., max(1, $this->maxQueryValues)) to satisfy DocumentsValidator limits on Query::equal('$id', ...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 7160 - 7187, The pre-read find()
calls chunk IDs using self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id',
$chunk) is validated against $this->maxQueryValues, so change the chunking in
both loops (the tenant branch that iterates over $idsByTenant and the else
branch that iterates over $docIds) to use array_chunk(..., max(1,
$this->maxQueryValues)) instead of self::RELATION_QUERY_CHUNK_SIZE so the chunk
size never exceeds the DocumentsValidator limit when calling find().

5728-5732: ⚠️ Potential issue | 🔴 Critical

skipDuplicates() still lets skipped rows mutate relationship state.

By the time this guard runs, every input document has already gone through createDocumentRelationships() at Lines 5720-5722. A row ignored here can still create/update children or junction rows, and duplicate child/junction inserts triggered during that earlier relationship pass still go through the normal createDocument() path. The insert result needs to be authoritative for relationship work, or those writes need to be deferred until after you know which rows were actually inserted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5728 - 5732, Summary:
duplicate-skipping allows skipped rows to still mutate relationships because
createDocumentRelationships runs before the adapter decides which rows are
inserted. Fix: change the control flow so relationship work is driven by the
authoritative insert result — either defer createDocumentRelationships until
after $batch is computed or make adapter->skipDuplicates return per-row
insertion outcomes and use that to gate relationship writes; specifically update
the sequence around the $insert closure, withTransaction,
adapter->skipDuplicates and createDocumentRelationships (and any calls to
createDocument) so relationship/junction creation only runs for rows confirmed
inserted by createDocuments (or is skipped per the adapter-provided insertion
map).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/Database/Database.php`:
- Around line 7160-7187: The pre-read find() calls chunk IDs using
self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id', $chunk) is validated
against $this->maxQueryValues, so change the chunking in both loops (the tenant
branch that iterates over $idsByTenant and the else branch that iterates over
$docIds) to use array_chunk(..., max(1, $this->maxQueryValues)) instead of
self::RELATION_QUERY_CHUNK_SIZE so the chunk size never exceeds the
DocumentsValidator limit when calling find().
- Around line 5728-5732: Summary: duplicate-skipping allows skipped rows to
still mutate relationships because createDocumentRelationships runs before the
adapter decides which rows are inserted. Fix: change the control flow so
relationship work is driven by the authoritative insert result — either defer
createDocumentRelationships until after $batch is computed or make
adapter->skipDuplicates return per-row insertion outcomes and use that to gate
relationship writes; specifically update the sequence around the $insert
closure, withTransaction, adapter->skipDuplicates and
createDocumentRelationships (and any calls to createDocument) so
relationship/junction creation only runs for rows confirmed inserted by
createDocuments (or is skipped per the adapter-provided insertion map).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c7a16723-e265-42e8-8fd4-3a629176caf0

📥 Commits

Reviewing files that changed from the base of the PR and between 934ec04 and fa0e373.

📒 Files selected for processing (2)
  • src/Database/Database.php
  • src/Database/Mirror.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Database/Mirror.php

…SIZE

CodeRabbit #3084994641 + #3084994657 follow-up: the earlier chunking
fix used self::RELATION_QUERY_CHUNK_SIZE (5000), which matches the
existing populateDocumentsRelationshipsBatch pattern but doesn't
respect the user-configurable validator ceiling.

Query::equal('\$id', \$chunk) is validated by DocumentsValidator
against \$this->maxQueryValues, not RELATION_QUERY_CHUNK_SIZE. A
caller doing \$db->setMaxQueryValues(100) followed by any of the
three chunked pre-reads would still throw QueryException because
chunks of 5000 exceed the configured 100.

Switch to array_chunk(..., max(1, \$this->maxQueryValues)) at all
three sites so chunk size always respects the actual validator
limit. In the default case (maxQueryValues = 5000) this produces
identical chunks to the old code; under customization it correctly
scales down.

Note: the 5 existing relationship-population chunking sites still
use RELATION_QUERY_CHUNK_SIZE and have the same latent bug under
customized maxQueryValues. Leaving that for a follow-up since
fixing the pattern globally is out of scope for this PR.
premtsd-code added a commit to utopia-php/migration that referenced this pull request Apr 15, 2026
CSV / JSON / appwrite-to-appwrite imports that re-run on the same
batch (e.g. user re-uploads the same file, or a worker retries a
failed chunk) currently throw DuplicateException and abort the whole
batch. Wrap the row-buffer flush in the new skipDuplicates() scope
guard so duplicate-by-id rows are silently no-op'd at the adapter
layer (INSERT IGNORE / ON CONFLICT DO NOTHING / $setOnInsert), letting
the rest of the batch proceed.

The existing skipRelationshipsExistCheck() wrapper is preserved
(FK-target guard); skipDuplicates composes around it.

Feature-branch note: depends on utopia-php/database's skipDuplicates()
scope guard from PR utopia-php/database#852. composer.json is
temporarily pinned to dev-csv-import-upsert-v2 with a 5.99.0 alias so
composer can resolve the 5.* constraint transitively. Must be reset
to the proper release version (^5.X.Y) once PR #852 merges and
utopia-php/database ships.
premtsd-code added a commit to appwrite/appwrite that referenced this pull request Apr 15, 2026
Exposes two new optional boolean params on the three migration
creation endpoints so CSV / JSON / appwrite-to-appwrite imports can
choose how to handle rows whose IDs already exist at the destination.

Endpoints updated (app/controllers/api/migrations.php):
- POST /v1/migrations/appwrite
- POST /v1/migrations/csv/imports
- POST /v1/migrations/json/imports

Parameter semantics:
- overwrite=true  -> destination uses upsertDocuments instead of
                     createDocuments; existing rows are replaced
                     with imported values
- skip=true       -> destination wraps createDocuments in
                     skipDuplicates; existing rows are preserved
                     unchanged, duplicate-id rows silently no-op
- both false      -> default; fails fast on DuplicateException
                     (original behavior, unchanged)
- both true       -> overwrite wins (upsert subsumes skip)

Both params are stored in the migration Document's options array
(matches the existing pattern for destination behavior config like
path, size, delimiter, bucketId, etc.) and read back in the worker's
processDestination() to instantiate DestinationAppwrite with the
new constructor params.

Feature-branch note: depends on utopia-php/migration#feat/skip-duplicates
(DestinationAppwrite constructor params) which in turn depends on
utopia-php/database#852 (skipDuplicates scope guard). composer.json is
temporarily pinned to dev-feat/skip-duplicates and
dev-csv-import-upsert-v2 respectively; both must be reset to proper
release versions once the upstream PRs merge.
Comment thread src/Database/Database.php Outdated
$found = $this->authorization->skip(fn () => $this->withTenant($tenant, fn () => $this->silent(
fn () => $this->find($collection->getId(), [
Query::equal('$id', $chunk),
Query::limit(PHP_INT_MAX),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the chunk size is always maxQueryValues, might as well use that here too

Comment thread src/Database/Database.php Outdated
])
)));
foreach ($found as $doc) {
$existingDocs[$tenant . ':' . $doc->getId()] = $doc;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the tenantKey function here

Comment thread src/Database/Database.php Outdated
$existing = $this->authorization->skip(fn () => $this->silent(
fn () => $this->find($collection->getId(), [
Query::equal('$id', $chunk),
Query::limit(PHP_INT_MAX),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment thread src/Database/Mirror.php Outdated
Comment on lines +604 to +606
// In skipDuplicates mode, identify which input ids already exist on source.
// These will be silently no-oped by the adapter's INSERT IGNORE and must
// not propagate to destination — a skipped duplicate is not a user write.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier - we should still propagate duplicates to destination, they might not be duplicate there

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/Database/Mirror.php (1)

622-656: Destination forwarding behavior change is intentional — consider simplifying the branch.

The updated comment correctly captures the design decision addressed in prior review feedback: forward every source-written document to the destination and let it decide per-row via its own INSERT IGNORE/ON CONFLICT path when skipDuplicates is active. This avoids the earlier pre-filter SELECT that could cause source/destination divergence when a row exists on source but not destination (or vice versa during an in-flight upgrade).

Minor readability nit: the two branches only differ by the outer skipDuplicates(...) wrapper, so this can collapse to a single composed closure to reduce duplication.

♻️ Optional refactor
-            if ($this->skipDuplicates) {
-                $this->destination->skipDuplicates(
-                    fn () => $this->destination->withPreserveDates(
-                        fn () => $this->destination->createDocuments($collection, $clones, $batchSize)
-                    )
-                );
-            } else {
-                $this->destination->withPreserveDates(
-                    fn () =>
-                    $this->destination->createDocuments(
-                        $collection,
-                        $clones,
-                        $batchSize,
-                    )
-                );
-            }
+            $write = fn () => $this->destination->withPreserveDates(
+                fn () => $this->destination->createDocuments($collection, $clones, $batchSize)
+            );
+
+            $this->skipDuplicates
+                ? $this->destination->skipDuplicates($write)
+                : $write();

One thing worth double-checking: in the non-skipDuplicates branch, if a row was written to source but already exists on destination (e.g., a partially-backfilled upgrade), the destination createDocuments will throw DuplicateException and be swallowed by the surrounding catch (\Throwable) on Line 668, which will abort the remaining $clones in the batch and skip all subsequent afterCreateDocument filter callbacks. Please confirm that is acceptable, or consider whether the destination write should always propagate the source's intent here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Mirror.php` around lines 622 - 656, The forwarding branch can be
simplified by composing a single closure that runs
$this->destination->withPreserveDates(fn()=>
$this->destination->createDocuments($collection, $clones, $batchSize)) and then
conditionally wrapping that closure with $this->destination->skipDuplicates(...)
when $this->skipDuplicates is true, removing the duplicated createDocuments
calls; update the block that builds $clones and the
writeFilters->beforeCreateDocument usage accordingly. Also verify the
error-handling behavior around createDocuments: confirm whether allowing a
DuplicateException (thrown by createDocuments when not using skipDuplicates) to
be caught by the surrounding catch and thereby abort remaining $clones and skip
subsequent writeFilters->afterCreateDocument calls is acceptable, and if not,
change the error handling so duplicate-row errors don’t abort the whole batch
(e.g., let them propagate differently or handle per-row).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Database/Database.php`:
- Around line 7153-7193: The unconditional prefetch removed the no-auth
"skipPreRead" behavior used by upsertDocumentsWithIncrease(), causing
caller-supplied $permissions to be persisted incorrectly; restore the original
branch: detect $skipPreRead = !empty($attribute) &&
!$this->authorization->getStatus() in upsertDocumentsWithIncrease(), and when
$skipPreRead is true skip the pre-read of existing documents (do not hydrate
$old/$existingDocs), ensure creates get default permissions [] and updates
preserve existing permissions (i.e. ignore incoming $permissions), and apply the
same conditional (no prefetch when skipPreRead) to the other similar block
referenced (around the later section 7213-7314) so both code paths behave
consistently.
- Around line 5728-5732: The relationship writes (createDocumentRelationships /
any createDocument calls) must be wrapped by the same skipDuplicates guard as
the batch insert so duplicates are skipped atomically; move or refactor so the
closure passed to adapter->skipDuplicates (or the $insert closure) contains both
the relationship expansion and the adapter->createDocuments call (or call
createDocumentRelationships inside the withTransaction closure) so that
adapter->skipDuplicates can opt out/skip all child/junction writes, not just
createDocuments. Ensure you reference and update the existing $insert closure,
createDocumentRelationships, createDocument, and adapter->skipDuplicates usage
so relationship writes occur inside the guarded/transactional block.

---

Nitpick comments:
In `@src/Database/Mirror.php`:
- Around line 622-656: The forwarding branch can be simplified by composing a
single closure that runs $this->destination->withPreserveDates(fn()=>
$this->destination->createDocuments($collection, $clones, $batchSize)) and then
conditionally wrapping that closure with $this->destination->skipDuplicates(...)
when $this->skipDuplicates is true, removing the duplicated createDocuments
calls; update the block that builds $clones and the
writeFilters->beforeCreateDocument usage accordingly. Also verify the
error-handling behavior around createDocuments: confirm whether allowing a
DuplicateException (thrown by createDocuments when not using skipDuplicates) to
be caught by the surrounding catch and thereby abort remaining $clones and skip
subsequent writeFilters->afterCreateDocument calls is acceptable, and if not,
change the error handling so duplicate-row errors don’t abort the whole batch
(e.g., let them propagate differently or handle per-row).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 32ae3c16-14a5-4d26-9d26-7bdefe39ae73

📥 Commits

Reviewing files that changed from the base of the PR and between fa0e373 and fbe5117.

📒 Files selected for processing (3)
  • src/Database/Database.php
  • src/Database/Mirror.php
  • tests/e2e/Adapter/MirrorTest.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/Adapter/MirrorTest.php

Comment thread src/Database/Database.php
Comment on lines +5728 to +5732
$insert = fn () => $this->withTransaction(fn () => $this->adapter->createDocuments($collection, $chunk));
// Set adapter flag before withTransaction so Mongo can opt out of a real txn.
$batch = $this->skipDuplicates
? $this->adapter->skipDuplicates($insert)
: $insert();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the relationship writes, not just the final batch insert.

skipDuplicates() only wraps adapter->createDocuments(), but createDocumentRelationships() already ran at Line 5721. If one of these parents is skipped as a duplicate, any child/junction writes done during relationship expansion are still applied, and many-to-many junction inserts can still throw DuplicateException because createDocument() does not honor this guard. That breaks the advertised “silently skip duplicate rows” behavior and can leave orphan relationship data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 5728 - 5732, The relationship writes
(createDocumentRelationships / any createDocument calls) must be wrapped by the
same skipDuplicates guard as the batch insert so duplicates are skipped
atomically; move or refactor so the closure passed to adapter->skipDuplicates
(or the $insert closure) contains both the relationship expansion and the
adapter->createDocuments call (or call createDocumentRelationships inside the
withTransaction closure) so that adapter->skipDuplicates can opt out/skip all
child/junction writes, not just createDocuments. Ensure you reference and update
the existing $insert closure, createDocumentRelationships, createDocument, and
adapter->skipDuplicates usage so relationship writes occur inside the
guarded/transactional block.

Comment thread src/Database/Database.php
Comment on lines +7153 to +7193
if ($this->getSharedTables() && $this->getTenantPerDocument()) {
$idsByTenant = [];
foreach ($documents as $doc) {
if ($doc->getId() !== '') {
$idsByTenant[$doc->getTenant()][] = $doc->getId();
}
}
foreach ($idsByTenant as $tenant => $tenantIds) {
$tenantIds = \array_values(\array_unique($tenantIds));
foreach (\array_chunk($tenantIds, \max(1, $this->maxQueryValues)) as $chunk) {
$found = $this->authorization->skip(fn () => $this->withTenant($tenant, fn () => $this->silent(
fn () => $this->find($collection->getId(), [
Query::equal('$id', $chunk),
Query::limit($this->maxQueryValues),
])
)));
foreach ($found as $doc) {
$existingDocs[$this->tenantKey($doc)] = $doc;
}
}
}
} else {
$docIds = \array_values(\array_unique(\array_filter(
\array_map(fn (Document $doc) => $doc->getId(), $documents),
fn ($id) => $id !== ''
)));

if (!empty($docIds)) {
foreach (\array_chunk($docIds, \max(1, $this->maxQueryValues)) as $chunk) {
$existing = $this->authorization->skip(fn () => $this->silent(
fn () => $this->find($collection->getId(), [
Query::equal('$id', $chunk),
Query::limit($this->maxQueryValues),
])
));
foreach ($existing as $doc) {
$existingDocs[$this->tenantKey($doc)] = $doc;
}
}
}
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore the no-auth skipPreRead branch for increment upserts.

This unconditional prefetch changes the !empty($attribute) && !$this->authorization->getStatus() path. In that mode, incoming $permissions are supposed to be ignored; with $old always hydrated, the current skipPermissionsUpdate logic can now persist caller-supplied permissions on creates/updates. That is a behavior regression in upsertDocumentsWithIncrease().

Based on learnings, Database::upsertDocumentsWithIncrease() uses $skipPreRead = !empty($attribute) && !$this->authorization->getStatus(), and in that path incoming $permissions must be ignored while creates default to [] and updates preserve existing permissions.

Also applies to: 7213-7314

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Database/Database.php` around lines 7153 - 7193, The unconditional
prefetch removed the no-auth "skipPreRead" behavior used by
upsertDocumentsWithIncrease(), causing caller-supplied $permissions to be
persisted incorrectly; restore the original branch: detect $skipPreRead =
!empty($attribute) && !$this->authorization->getStatus() in
upsertDocumentsWithIncrease(), and when $skipPreRead is true skip the pre-read
of existing documents (do not hydrate $old/$existingDocs), ensure creates get
default permissions [] and updates preserve existing permissions (i.e. ignore
incoming $permissions), and apply the same conditional (no prefetch when
skipPreRead) to the other similar block referenced (around the later section
7213-7314) so both code paths behave consistently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipPreRead is not merged in main branch yet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@premtsd-code Thanks for the clarification — acknowledged. Since skipPreRead hasn’t landed in main yet, my earlier regression note tied to that path doesn’t apply to this PR. The batch pre-read you added in upsertDocumentsWithIncrease() is fine as-is.

If you’d like, I can open a follow-up issue to re-check permission semantics once skipPreRead merges (so we ensure “updates preserve existing permissions; creates default to []; ignore incoming $permissions” in the no-auth + increment case). Happy to proceed if you want that tracked.


🧠 Learnings used
Learnt from: premtsd-code
Repo: utopia-php/database PR: 850
File: src/Database/Database.php:0-0
Timestamp: 2026-04-08T16:07:15.177Z
Learning: In src/Database/Database.php, both createDocuments() and upsertDocumentsWithIncrease() prefetch existing document IDs in chunks using array_chunk(..., max(1, $this->maxQueryValues)) for tenant-per-document and non-tenant modes to satisfy DocumentsValidator limits on Query::equal('$id', ...).

Learnt from: premtsd-code
Repo: utopia-php/database PR: 830
File: src/Database/Database.php:7086-7092
Timestamp: 2026-03-08T22:55:28.782Z
Learning: In src/Database/Database.php, Database::upsertDocumentsWithIncrease() sets $skipPreRead = !empty($attribute) && !$this->authorization->getStatus(), so the pre-read-skip path only runs when authorization is disabled. In this path, explicit $permissions on incoming documents are intentionally ignored; updates preserve existing permissions, and creates default to empty permissions ([]), since no auth layer enforces them. Stated by maintainer premtsd-code on PR `#830`.

Learnt from: fogelito
Repo: utopia-php/database PR: 733
File: src/Database/Adapter/MariaDB.php:1801-1806
Timestamp: 2025-10-16T09:37:33.531Z
Learning: In the MariaDB adapter (src/Database/Adapter/MariaDB.php), only duplicate `_uid` violations should throw `DuplicateException`. All other unique constraint violations, including `PRIMARY` key collisions on the internal `_id` field, should throw `UniqueException`. This is the intended design to distinguish between user-facing document duplicates and internal/user-defined unique constraint violations.

Learnt from: abnegate
Repo: utopia-php/database PR: 721
File: tests/e2e/Adapter/Scopes/DocumentTests.php:6418-6439
Timestamp: 2025-10-03T02:04:17.803Z
Learning: In tests/e2e/Adapter/Scopes/DocumentTests::testSchemalessDocumentInvalidInteralAttributeValidation (PHP), when the adapter reports getSupportForAttributes() === false (schemaless), the test should not expect exceptions from createDocuments for “invalid” internal attributes; remove try/catch and ensure the test passes without exceptions, keeping at least one assertion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants