Add CQL filter passthrough to OGC waterdata getters#238
Add CQL filter passthrough to OGC waterdata getters#238thodson-usgs merged 14 commits intoDOI-USGS:mainfrom
Conversation
Every `get_*` function that targets an OGC collection (`continuous`, `daily`, `field_measurements`, `monitoring_locations`, `time_series_metadata`, `latest_continuous`, `latest_daily`, `channel`) now accepts `filter` and `filter_lang` kwargs that are forwarded as the OGC `filter` / `filter-lang` query parameters. This unlocks server-side expressions that aren't expressible via the other kwargs. The motivating use case is pulling one-shot windows of continuous data around many field-measurement timestamps in a single request via OR'd BETWEEN clauses, instead of N round-trips. Caveats documented in each docstring and NEWS.md: - The server currently accepts `cql-text` (default) and `cql-json`; `cql2-text` / `cql2-json` are not yet supported. - Long filters can exceed the URI length limit. A `UserWarning` is emitted above 5000 characters and the practical cap is around 75 OR-clauses before the server returns HTTP 414. Includes unit tests covering the filter / filter-lang URL construction for all OGC services and the long-filter warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A CQL `filter` made up of a top-level `OR` chain can exceed the server's URI length limit. Rather than asking the caller to handle that themselves, split the expression along its top-level OR boundaries into chunks that each fit under a conservative budget (`_CQL_FILTER_CHUNK_LEN`), issue one request per chunk, and concatenate the results (deduplicated by the service's output id). Splitting is paren- and quote-aware so `OR` inside sub-expressions or string literals is preserved. When the expression has no top-level OR — or any single clause already exceeds the budget — the filter is sent as-is (the server decides) rather than being mangled. Drops the 5000-character `UserWarning` added in the previous commit: chunking handles the common case transparently, and the docstring caveat about `HTTP 414` / `~75 OR-clauses` is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Introduce a `FILTER_LANG = Literal["cql-text", "cql-json"]` type alias alongside the existing `SERVICES` / `PROFILES` Literals in `waterdata/types.py`, export it from the package, and use it in all eight OGC getter signatures. - Simplify the fan-out dispatch in `get_ogc_data`: one ternary picks the chunk list, and the single-chunk fast path is expressed as the early branch of an `if len(frames) == 1`. - Drop the tautological `test_default_chunk_budget_is_conservative` — the integration test already asserts each sub-request URL stays under the budget. - Extract `OGC_CONTINUOUS_URL` in the test file and strip a handful of WHAT-narrating comments in both the implementation and tests. The `filter_lang`/`filter-lang` mapping comment stays because the WHY (hyphens invalid in Python identifiers) isn't obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single-paragraph filter announcement with a broader round-up covering the post-release additions: `get_channel`, `get_stats_por` / `get_stats_date_range`, `get_reference_table` and its query-parameter passthrough, the `py.typed` marker, `pandas` 3.x support, and the removal of the `waterwatch` module and several defunct NWIS stubs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR exposes OGC API CQL filtering to the waterdata OGC collection getters by adding filter / filter_lang kwargs, forwarding them to the filter / filter-lang query parameters, and implementing transparent request fan-out for oversized top-level OR filter chains to avoid URI-length (414) failures.
Changes:
- Add
filterandfilter_langkwargs to OGC-based getters and document their behavior. - Translate
filter_lang→filter-langin request construction and implement OR-chain splitting/chunking for long filters. - Add unit + integration-style mocked tests covering passthrough, hyphenation, and chunking behavior; update NEWS.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
dataretrieval/waterdata/api.py |
Extends OGC getter signatures + docstrings to accept filter / filter_lang. |
dataretrieval/waterdata/utils.py |
Implements filter_lang URL key translation, OR splitting/chunking, and multi-request fan-out in get_ogc_data. |
dataretrieval/waterdata/types.py |
Adds FILTER_LANG type alias for supported filter languages. |
dataretrieval/waterdata/__init__.py |
Re-exports FILTER_LANG. |
tests/waterdata_utils_test.py |
Adds tests for filter passthrough, hyphenation, splitter/chunker semantics, and fan-out behavior. |
NEWS.md |
Announces new filter passthrough + chunking capability. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Dedupe on pre-rename feature `id` (always present at that stage) instead of `output_id`, which is the post-rename name and may not be on every OGC service's response. - Aggregate elapsed time across chunk responses so the returned metadata's query_time reflects the whole operation rather than just the last chunk. - Drop the redundant `continuous_id` from the fan-out test's mock properties so the assertion exercises the real `id`-based dedup path, and add a separate test that forces cross-chunk duplicate feature ids to prove they collapse to a single row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `_chunk_cql_or` splits on the literal substring " OR " and only
quote-aware for single quotes (CQL-text). Applying it to CQL-JSON
would corrupt JSON string values or produce nonsense sub-requests.
Gate chunking to `filter_lang in {None, "cql-text"}` and pass other
languages through as a single request.
- Replace the `requests_mock`-based fan-out/dedup tests with lighter
`mock.patch` stubs of `_construct_api_requests` / `_walk_pages`,
which also removes the py<3.10 skip (the tests no longer touch any
HTTP or py3.10-only paths). Strengthen the fan-out assertion to
`sent_filters == expected_chunks`.
- Add `test_cql_json_filter_is_not_chunked` to pin the new guard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow the existing ``time`` date-range example with two CQL-text ``filter`` examples: a two-interval OR expression (the common "pull several disjoint windows in one call" case), and a longer programmatically-built chain that shows the pattern used when pairing many discrete-measurement timestamps with surrounding instantaneous data (which is what the client's transparent chunking is there to support). Both examples were verified against the live Water Data OGC API on USGS-02238500 (00060). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull the state machine out into ``_iter_or_boundaries``, a generator that yields ``(start, end)`` spans of each top-level ``OR`` separator, and reduce ``_split_top_level_or`` to a short slice loop over those spans. Behaviour is unchanged (all 26 existing tests pass); the win is readability — each function now has one job instead of three, and the producer/consumer split mirrors how ``re.finditer`` / ``tokenize`` are structured elsewhere in the stdlib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous 5 KB raw-filter budget was a static approximation.
Empirically the Water Data API returns HTTP 414 at ~8,200 bytes of
total URL, matching nginx's default 8 KB large_client_header_buffers.
The raw-filter budget leaves unknown headroom that varies with:
- URL encoding (a uniform time-interval filter inflates ~1.4x; heavy
special-char content inflates more)
- the URL space consumed by other query params
Expose ``_WATERDATA_URL_BYTE_LIMIT = 8000`` with a comment describing
what the limit represents, and add ``_effective_filter_budget`` which
probes each request's non-filter URL cost and converts the remaining
URL budget back to raw CQL bytes via the filter's own encoding ratio.
``get_ogc_data`` now uses that per-request budget instead of the fixed
constant.
Verified live: a 34 KB OR-chain that previously split into 8 chunks now
packs into 7, with every produced URL staying at ~7.9 KB (well under
the 8 KB limit and below the 8.2 KB observed 414 cliff).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole-filter ratio is an average; a chunk that happens to contain only the heavier-encoding clauses (e.g. heavy clauses clustered at one end of the filter) can exceed the average ratio and push the full URL a few bytes past _WATERDATA_URL_BYTE_LIMIT. The overflow was invisible in practice — the 8,000 declared budget vs 8,200 observed 414 cliff gave enough headroom — but the computed budget was technically being violated, and a more adversarial clause mix could grow the overflow. Compute the encoding ratio from the heaviest-encoding clause instead of the whole filter. Adds one extra chunk on adversarial inputs (8 instead of 7 for 100 heavy + 400 light) in exchange for every chunk provably staying under the declared URL limit. Verified live: the adversarial clustered-heavy filter now produces 8 chunks with max URL 7806 bytes, all returning 200 OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR exposes OGC API CQL filtering through the waterdata OGC collection getters by adding filter / filter_lang kwargs, forwarding them to the underlying request builder, and implementing transparent request fan-out for oversized top-level OR filters to avoid HTTP 414 URI-length failures.
Changes:
- Add
filterandfilter_langkwargs to the OGC-backedget_*functions and re-exportFILTER_LANG. - Translate
filter_lang→filter-langin OGC request construction and add CQL top-level-OR splitting + chunking utilities. - In
get_ogc_data, split oversized top-levelORfilters into multiple requests, concatenate results, deduplicate by featureid, and aggregate elapsed time across chunks; add unit tests covering all of the above.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
dataretrieval/waterdata/utils.py |
Implements filter-lang hyphenation, CQL OR-splitting/chunking, request fan-out/concat/dedup, and URL-budget computation. |
dataretrieval/waterdata/api.py |
Adds filter / filter_lang kwargs + docstrings to OGC collection getters. |
dataretrieval/waterdata/types.py |
Introduces the FILTER_LANG = Literal["cql-text", "cql-json"] type alias. |
dataretrieval/waterdata/__init__.py |
Re-exports FILTER_LANG from waterdata. |
tests/waterdata_utils_test.py |
Adds extensive unit tests for passthrough, hyphenation, splitting/chunking, fan-out behavior, and cross-chunk dedup. |
NEWS.md |
Documents the new filter passthrough + chunking behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
``_get_resp_data`` returns a plain ``pd.DataFrame()`` when a response contains no features, regardless of whether geopandas is enabled. If that empty frame lands first in ``pd.concat([empty, geodf, ...])``, concat can downgrade the result back to a plain DataFrame — silently dropping geometry and CRS when later chunks would have provided them. Drop the empties before concatenation. They contribute no rows either way, so discarding them is safe and keeps the GeoDataFrame type intact whenever any chunk returned one. When every chunk is empty, fall through with a plain ``pd.DataFrame()`` — same behavior as today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If ``_effective_filter_budget`` probes a request whose non-filter URL is already at or past ``_WATERDATA_URL_BYTE_LIMIT``, no chunk we could produce would fit — yet the previous ``max(100, int(available/ratio))`` floored the budget to 100 raw bytes, which ``_chunk_cql_or`` happily used to pack single-clause chunks. For a filter with N clauses that meant N guaranteed-414 sub-requests instead of one clear failure. Detect ``available_url_bytes <= 0`` and return a budget larger than the filter itself; ``_chunk_cql_or``'s first short-circuit then passes the expression through unchanged. The server returns one 414, which surfaces the problem directly to the caller. Also add a regression test for the CQL doubled-quote escape (``''``): the scanner's naive toggle-on-quote logic already handles this case correctly — the two quotes are adjacent so there's no content between them to misclassify — but lock the behavior in so a refactor can't regress it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rame handling Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238): - Skip chunking when `filter_lang` is not `cql-text`. The splitter is text- and single-quote-aware and would corrupt cql-json. Non-cql-text filters are now forwarded as-is. - Budget each chunk against the server's URL byte limit (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414 cliff of ~8,200 bytes) rather than a fixed raw filter length. `effective_filter_budget` probes the non-filter URL, subtracts, and converts back to raw CQL bytes using the max per-clause encoding ratio (with the " OR " joiner included — in R's percent-encoding the joiner inflates 2x, heavier than typical clause ratios, and the previous clause-only max let chunks overflow the URL cap). - When the non-filter URL already exceeds the byte limit, return a budget larger than the filter so it passes through unchanged — one clear 414 is better feedback than N failing sub-requests. - Move filter chunking out of the recursive `get_ogc_data` path and into the post-transform branch, so the probe sees the real request args. Collect raw frames, drop empty ones before `rbind` (a plain empty frame first would downgrade a later sf result and drop geometry/CRS), and dedup on the pre-rename feature `id`. - Add regression tests for doubled single-quote CQL escape, the URL byte budget guarantee, and non-cql-text pass-through. - Document CQL filter usage with two examples on `read_waterdata_continuous`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small cleanups from a /simplify pass; no behaviour change on
the chunked path:
- ``_effective_filter_budget`` now has a fast path for filters whose
encoded length already fits under the URL limit with a 1 KB
headroom for everything-but-the-filter. Skips the throwaway
``_construct_api_requests`` probe + the splitter + the encoding-
ratio loop on every short-filter call, which is the common case.
- ``get_ogc_data`` now collects chunk responses into a single
``responses`` list instead of carrying ``first_response``,
``total_elapsed``, and a branching accumulator through the loop.
Elapsed-time aggregation moves to one line after the loop.
- ``tests/waterdata_utils_test.py`` factors the repeated
``SimpleNamespace(url=..., elapsed=..., headers={})`` mocks into
``_fake_prepared_request()`` / ``_fake_response()`` helpers; 5
copy-paste sites collapse to one-line calls. Bumped the budget-
shrinks-with-URL-params test to a filter large enough to go past
the new short-circuit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull the chunked fan-out, frame combining, and metadata aggregation
out of ``get_ogc_data`` into four private helpers so the top-level
function reads as a short recipe rather than a 70-line procedure.
Behaviour is unchanged (all 32 PR-related tests still pass); each
helper docstring captures the non-obvious *why* of its phase:
- ``_plan_filter_chunks`` decide how to fan out
- ``_fetch_chunks`` one request per chunk, pure I/O loop
- ``_combine_chunk_frames`` concat, drop empties to preserve
GeoDataFrame type, dedup by feature id
- ``_aggregate_response_metadata`` first response + summed elapsed
The top-of-``get_ogc_data`` arg normalization stays inline — it's
short and has a subtle ordering requirement (capture ``properties``
before the id-switch) that extraction would hide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the helper organization in the merged Python PR (DOI-USGS/dataretrieval-python#238) so the per-language implementations stay easy to read alongside each other. The single-vs-fanned distinction is now expressed once, in `plan_filter_chunks`, which always returns a list of "chunk overrides" -- `list(NULL)` for "send `args` as-is", or a list of chunked cql-text expressions otherwise. `fetch_chunks` issues one request per entry and returns the per-chunk frames plus the first sub-request (for the `request` attribute). `combine_chunk_frames` handles the empty-frame and dedup-by-`id` cases. `get_ogc_data` is now a linear pipeline: chunks <- plan_filter_chunks(args) fetched <- fetch_chunks(args, chunks) return_list <- combine_chunk_frames(fetched$frames) req <- fetched$req ... post-processing ... Behavior unchanged: same chunk sizing (URL-byte-budget aware), same cql-text-only guard, same empty-frame and id-dedup handling. The only observable difference is that the `request` attribute now points at the first sub-request instead of the last (matching Python's choice of representative metadata), which is a debugging-only change for the chunked path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ``filter`` and ``filter_lang`` kwargs to the OGC ``waterdata`` getters (``get_continuous``, ``get_daily``, ``get_monitoring_locations``, ``get_time_series_metadata``, ``get_latest_continuous``, ``get_latest_daily``, ``get_field_measurements``, ``get_channel``, and the other collections built on the same plumbing). The value is forwarded verbatim to the server's OGC API ``filter`` query parameter, letting callers use server-side CQL expressions that aren't expressible via the other kwargs — most commonly, OR'ing several disjoint time ranges into a single request. - A new ``FILTER_LANG = Literal["cql-text", "cql-json"]`` type alias is added in ``waterdata.types`` and re-exported from the package. The server accepts ``cql-text`` (default) and ``cql-json`` today; the ``cql2-*`` dialects are not yet supported. - A long top-level ``OR`` chain is transparently split into multiple sub-requests that each fit under the server's URI length limit, and the results are concatenated. Filters without a top-level ``OR`` are sent as a single request unchanged. - ``get_continuous`` gains docstring examples showing both the simple two-window case and the programmatically-built many-window case that exercises the auto-chunking path. - NEWS.md gains a v1.1.0 highlights block covering this change along with the other user-visible additions since release. - ``tests/waterdata_utils_test.py`` grows coverage for filter forwarding, the OR-chunking paths, and error handling.
Summary
Every
get_*function that targets an OGC collection (continuous,daily,field_measurements,monitoring_locations,time_series_metadata,latest_continuous,latest_daily,channel) now acceptsfilterandfilter_langkwargs that are forwarded as the OGCfilter/filter-langquery parameters. The Python kwargfilter_langis translated to the hyphenatedfilter-langURL parameter that the service actually accepts, and is typed as aLiteral["cql-text", "cql-json"]alias (FILTER_LANG) living alongside the otherwaterdataLiterals.When a
filteris a top-levelORchain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates the results. This keeps the common multi-interval use case out of the caller's way — they don't need to know about the server's 414 boundary.Motivation
The OGC
timeparameter accepts a single instant, a single bounded interval, or a half-bounded interval — it does not accept a list of intervals. For workflows that need to pull short windows of continuous data around many field-measurement timestamps (e.g., pairing discrete discharge measurements with the index velocity at the time of each measurement), the existing client requires one HTTP round-trip per window.The waterdata OGC API already supports a
filterquery parameter with CQL OR-expressions, but this is not currently exposed through the Python client's signatures. This PR threads the passthrough through:Large OR chains are handled for the caller:
In a sample end-to-end workflow (17 measurements over 6 months on
USGS-07374525), collapsing the per-measurement loop into one chunked CQL filter call dropped the request time from ~9 s to ~1 s.Chunking behavior
ORchains are split. The splitter is paren- and quote-aware, soORinside sub-expressions like(A OR B)or string literals like'foo OR bar'is preserved.OR, or any single clause already exceeds the budget, the filter is sent as-is (server decides) rather than being mangled.id(pre-rename — the rename tocontinuous_id/daily_id/etc. happens later) so overlapping user-supplied OR clauses combine losslessly across chunks.BaseMetadata.urlis the first chunk's URL;query_timeis the sum of elapsed time across all chunk requests, so callers that logmd.query_timesee the total operation cost rather than just the first chunk._CQL_FILTER_CHUNK_LEN = 5000) is private and conservative; the continuous endpoint has been observed to return HTTP 414 around ~7 KB of filter text.Caveats (documented in the docstrings)
cql-text(default) andcql-json;cql2-text/cql2-jsonreturn400 Invalid filter language.Changes
dataretrieval/waterdata/api.py— addsfilter: str | Noneandfilter_lang: FILTER_LANG | Nonekwargs (with docstrings) to 8 OGC functions.dataretrieval/waterdata/types.py— addsFILTER_LANG = Literal["cql-text", "cql-json"]alongside the existingSERVICES/PROFILESLiterals.dataretrieval/waterdata/__init__.py— re-exportsFILTER_LANG.dataretrieval/waterdata/utils.py:filter_lang→filter-langURL key in_construct_api_requests._split_top_level_orand_chunk_cql_orhelpers.get_ogc_data, fans a longfilterinto per-chunk sub-requests, concatenates the results, dedups on the pre-rename featureid, and aggregates elapsed time across chunks so the returned metadata reflects the full operation.tests/waterdata_utils_test.py— adds mocked unit tests for the passthrough, hyphenation, splitter/chunker semantics, an end-to-endrequests_mocktest that verifies a long OR-filter actually triggers multiple HTTP calls, and a separate test that forces cross-chunk duplicate feature ids to assert dedup collapses them.NEWS.md— short announcement.Test plan
ruff check .andruff format --check .pass.pytest tests/waterdata_utils_test.py— 25/25 pass (4 pre-existing + 21 new).pytest tests/ --deselect tests/waterdata_test.py --deselect tests/waterservices_test.py --ignore tests/nldi_test.py) — 90/90 pass.USGS-0737452572255: a 200-clause / ~14 KB OR filter fans into 3 sub-requests and returns 600 concatenated rows in ~1.5 s.Marked as draft for a first review pass. Happy to split into smaller PRs if preferred (e.g.,
get_continuousonly, with other endpoints as follow-ups).🤖 Generated with Claude Code