You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both work for a single layer of supplements. This issue is for the next step up: a network of parquet files where later files can supplement AND override earlier ones, resolved either at build time (baked into a new flat file) or at query time (composed on the fly).
Saved here for future reference. No immediate action needed.
Base row wins unless the named columns are NULL; overlay fills in. Current enrichment pattern.
upsert
Overlay row replaces base row on key match. Can also add new rows that don't exist in base.
delete
Overlay acts as a deny-list — rows with matching keys are removed from the composition. Useful for withdrawals.
Resolution strategies
Build-time composition (recommended default):
A script reads layers/manifest.json, applies overlays in listed order, writes a flattened isamples_YYYYMM_wide.parquet
/current/wide.parquet redirects to the flattened file
Consumers see one parquet, no awareness of the overlay chain
Cheap at query time, expensive at build time (fine — builds are monthly)
Query-time composition (escape hatch):
DuckDB-WASM reads the manifest, performs the UNION/JOIN/ANTI JOIN per query
Useful for "hot fix between compactions" and for dev environments
Every browser query re-pays the composition cost
Best used briefly — overlay chain should be short-lived
Tradeoffs / constraints
Precedence must be explicit. Overlays can touch overlapping columns or rows. Without a declared order + mode, determinism breaks.
Overlay stacks grow. Deep chains → slow composition. Need a compaction cycle that bakes overlays into a new base periodically (monthly? on a trigger?).
Browser consumption limits sophistication. DuckDB-WASM + HTTP range requests work fine for a handful of overlays; they're not a query engine. Don't try to emulate a full lakehouse.
R2 cost scales with overlay count. Each file = separate HTTP round trips when query-time composing. Keep overlays coarse-grained.
What this is NOT
Not Apache Iceberg or Delta Lake. Both handle this class of problem natively (transaction logs, time travel, schema evolution) but require writer tooling we don't have and runtime support that's awkward for browser consumption. Overkill for iSamples scale.
Not urgent. The two current patterns (/current/ alias + single-layer sidecar) cover everything in flight. File this as a design marker for when a second pattern user appears.
Open questions (for when someone picks this up)
What's the canonical place for layers/manifest.json? Alongside the base (data.isamples.org/layers/wide_manifest.json) or under current/?
Should overlay files use the same schema as base, or only carry the join key + changed columns? (Trade: storage vs schema fidelity.)
Compaction trigger: on schedule, on overlay-count threshold, or on-demand via a workflow dispatch?
Deletion semantics — do we need "tombstone" overlays for withdrawn samples, or is a full rebuild good enough when withdrawals happen?
Does the enrichment script become part of pqg/ (the proper home for build-time transforms) or stay under scripts/ in this repo?
PQG schema should have landed the media-related columns (media_license, media_is_public, media_source_url, media_harvested_at) before overlays start carrying them.
Context
We now have two useful patterns for the iSamples parquet family:
/current/<flavor>.parquetredirects to a dated build (PRs Mirror Eric's OC PQG to R2 with immutable cache + drift-check script #132, Add /current/ alias layer + OC-thumbnail enrichment script #133, Tutorials query /current/wide.parquet instead of dated file #134) — stable URL, immutable dated target.Both work for a single layer of supplements. This issue is for the next step up: a network of parquet files where later files can supplement AND override earlier ones, resolved either at build time (baked into a new flat file) or at query time (composed on the fly).
Saved here for future reference. No immediate action needed.
Use cases we'd unlock
dev/layer that overridescurrent/for early users opting in via a URL param, without touching production.Design sketch
A small
layers/manifest.jsonserved alongside the base file, ordering overlays with explicit precedence:{ "primary": "https://data.isamples.org/isamples_202604_wide.parquet", "overlays": [ { "url": ".../harvested/oc_thumbs_20251107.parquet", "key": "pid", "mode": "left_join_coalesce", "columns": ["thumbnail_url", "media_license"] }, { "url": ".../corrections/wide_fixups_20260420.parquet", "key": "pid", "mode": "upsert", "columns": "*" }, { "url": ".../harvested/smithsonian_thumbs_20260501.parquet", "key": "pid", "mode": "left_join_coalesce", "columns": ["thumbnail_url"] } ] }Modes
modeleft_join_coalesceupsertdeleteResolution strategies
Build-time composition (recommended default):
layers/manifest.json, applies overlays in listed order, writes a flattenedisamples_YYYYMM_wide.parquet/current/wide.parquetredirects to the flattened fileQuery-time composition (escape hatch):
Tradeoffs / constraints
What this is NOT
/current/alias + single-layer sidecar) cover everything in flight. File this as a design marker for when a second pattern user appears.Open questions (for when someone picks this up)
layers/manifest.json? Alongside the base (data.isamples.org/layers/wide_manifest.json) or undercurrent/?pqg/(the proper home for build-time transforms) or stay underscripts/in this repo?Prerequisites if/when we implement
media_license,media_is_public,media_source_url,media_harvested_at) before overlays start carrying them.Related
/current/alias + OC mirror