Data pipeline: thumbnail_url column is empty across all 6.7M samples

## Problem

The wide parquet (\`isamples_202601_wide.parquet\`) exposes a \`thumbnail_url\` column, but it is **NULL / empty for every single row**.

\`\`\`sql
SELECT n as source, COUNT(*) AS total,
       COUNT(*) FILTER (WHERE thumbnail_url IS NOT NULL AND thumbnail_url <> '') AS with_thumb
FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
WHERE otype = 'MaterialSampleRecord'
GROUP BY 1
\`\`\`

| source | total | with_thumb |
|---|---:|---:|
| OPENCONTEXT | 1,064,831 | 0 |
| SESAR | 4,688,386 | 0 |
| SMITHSONIAN | 322,161 | 0 |
| GEOME | 605,554 | 0 |

## Impact

- No image-rich UX possible in the browser explorers without per-source API calls (slow, rate-limited)
- Sample cards in the Interactive Explorer currently can't show a preview
- Discovery features like \"find beautiful samples near X\" (see #130) would require hitting OpenContext / iDigBio / Smithsonian / SESAR individually for each candidate
- The column's presence advertises a capability that doesn't exist — confusing for PQG consumers building tools on top of the parquet

## Upstream availability (spot-checks)

- **OpenContext**: rich media at \`/subjects/<uuid>.json\` → \`oc-gen:has-obs[*]\` → linked media records. Poggio Civitate alone has 209,234 media items.
- **Smithsonian NMNH**: iDigBio indexes species with \`hasMedia:true\`. E.g. \`Paracirrhites arcatus\` has ≥5 media records per iDigBio search.
- **SESAR**: rarely has images (geological samples).
- **GEOME**: sparse photos via FIMS.

So the raw data exists — it's the PQG pipeline that isn't populating the column.

## Proposal

Decide whether the pipeline should populate \`thumbnail_url\`:

### Option 1: Populate it

Extend the narrow → wide conversion (or the source→PQG ingester) to extract a thumbnail URL per source:
- OpenContext: first \`oc-gen:has-obs.*.link-media[0]\` with \`dc-terms:format\` starting with \`image/\`
- Smithsonian: iDigBio's \`dwc:associatedMedia\` → resolve via iDigBio's media URL pattern
- SESAR / GEOME: skip (likely NULL)

Estimated effort: ~2 days, modest ongoing maintenance.

### Option 2: Remove the column

If no one is going to populate it, drop from the schema so consumers don't have false expectations. Trivial.

### Option 3: Leave as-is, document the gap

Update \`how-to-use.qmd\` or the PQG spec to note the column exists but is empty pending future work.

## Related

- Discovered while investigating #130 (\"Showcase samples: make the 4 front-page images locate themselves on the globe\") — the showcase hunt hit this wall hard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data pipeline: thumbnail_url column is empty across all 6.7M samples #131

Problem

Impact

Upstream availability (spot-checks)

Proposal

Option 1: Populate it

Option 2: Remove the column

Option 3: Leave as-is, document the gap

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

source	total	with_thumb
OPENCONTEXT	1,064,831	0
SESAR	4,688,386	0
SMITHSONIAN	322,161	0
GEOME	605,554	0

Data pipeline: thumbnail_url column is empty across all 6.7M samples #131

Description

Problem

Impact

Upstream availability (spot-checks)

Proposal

Option 1: Populate it

Option 2: Remove the column

Option 3: Leave as-is, document the gap

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions