Skip to content

Data pipeline: thumbnail_url column is empty across all 6.7M samples #131

@rdhyee

Description

@rdhyee

Problem

The wide parquet (`isamples_202601_wide.parquet`) exposes a `thumbnail_url` column, but it is NULL / empty for every single row.

```sql
SELECT n as source, COUNT() AS total,
COUNT(
) FILTER (WHERE thumbnail_url IS NOT NULL AND thumbnail_url <> '') AS with_thumb
FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
WHERE otype = 'MaterialSampleRecord'
GROUP BY 1
```

source total with_thumb
OPENCONTEXT 1,064,831 0
SESAR 4,688,386 0
SMITHSONIAN 322,161 0
GEOME 605,554 0

Impact

  • No image-rich UX possible in the browser explorers without per-source API calls (slow, rate-limited)
  • Sample cards in the Interactive Explorer currently can't show a preview
  • Discovery features like "find beautiful samples near X" (see Showcase samples: make the 4 front-page images locate themselves on the globe #130) would require hitting OpenContext / iDigBio / Smithsonian / SESAR individually for each candidate
  • The column's presence advertises a capability that doesn't exist — confusing for PQG consumers building tools on top of the parquet

Upstream availability (spot-checks)

  • OpenContext: rich media at `/subjects/.json` → `oc-gen:has-obs[*]` → linked media records. Poggio Civitate alone has 209,234 media items.
  • Smithsonian NMNH: iDigBio indexes species with `hasMedia:true`. E.g. `Paracirrhites arcatus` has ≥5 media records per iDigBio search.
  • SESAR: rarely has images (geological samples).
  • GEOME: sparse photos via FIMS.

So the raw data exists — it's the PQG pipeline that isn't populating the column.

Proposal

Decide whether the pipeline should populate `thumbnail_url`:

Option 1: Populate it

Extend the narrow → wide conversion (or the source→PQG ingester) to extract a thumbnail URL per source:

  • OpenContext: first `oc-gen:has-obs.*.link-media[0]` with `dc-terms:format` starting with `image/`
  • Smithsonian: iDigBio's `dwc:associatedMedia` → resolve via iDigBio's media URL pattern
  • SESAR / GEOME: skip (likely NULL)

Estimated effort: ~2 days, modest ongoing maintenance.

Option 2: Remove the column

If no one is going to populate it, drop from the schema so consumers don't have false expectations. Trivial.

Option 3: Leave as-is, document the gap

Update `how-to-use.qmd` or the PQG spec to note the column exists but is empty pending future work.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentationinfrastructureHosting, CI/CD, domain, Cloudflare

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions