Skip to content

feat: add operational counters for sampling, requirements, and tools (#467)#883

Open
ajbozarth wants to merge 2 commits intogenerative-computing:mainfrom
ajbozarth:feat/operational-counters-467
Open

feat: add operational counters for sampling, requirements, and tools (#467)#883
ajbozarth wants to merge 2 commits intogenerative-computing:mainfrom
ajbozarth:feat/operational-counters-467

Conversation

@ajbozarth
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth commented Apr 17, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Adds six new OpenTelemetry counters as part of the telemetry epic (#443),
giving operators visibility into retry behaviour, validation failure rates,
and tool call health alongside the existing LLM-level metrics:

  • mellea.sampling.attempts / mellea.sampling.successes / mellea.sampling.failures — tagged by strategy class name
  • mellea.requirement.checks / mellea.requirement.failures — tagged by requirement class name and failure reason
  • mellea.tool.calls — tagged by tool name and status ("success" / "failure")

Follows the established plugin pattern: lazy-init counter globals + record_*
helpers in metrics.py, consumed by three new Plugin classes in
metrics_plugins.py (SamplingMetricsPlugin, RequirementMetricsPlugin,
ToolMetricsPlugin) that hook into existing hook events in FIRE_AND_FORGET
mode. No metric calls in business-logic files.

Also extends SamplingIterationPayload and SamplingLoopEndPayload with a
strategy_name field (backward-compatible, defaults to "") so plugins can
tag counters with the strategy class name.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

…enerative-computing#467)

Adds six new OpenTelemetry counters giving operators visibility into
retry behaviour, validation failure rates, and tool call health:
mellea.sampling.attempts/successes/failures, mellea.requirement.checks/failures,
and mellea.tool.calls.

Follows the established lazy-init globals + record_* helpers + Plugin hooks
pattern. Extends SamplingIterationPayload and SamplingLoopEndPayload with a
strategy_name field so plugins can tag counters by strategy class.

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth self-assigned this Apr 17, 2026
@github-actions github-actions bot added the enhancement New feature or request label Apr 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@ajbozarth ajbozarth changed the title feat: add operational counters for sampling, requirements, and tools … feat: add operational counters for sampling, requirements, and tools (#467) Apr 17, 2026
@ajbozarth ajbozarth marked this pull request as ready for review April 17, 2026 23:03
@ajbozarth ajbozarth requested a review from a team as a code owner April 17, 2026 23:03
PeriodicExportingMetricReader background threads (60 s default) would
fire after pytest closed stdout once the suite crossed the 60 s mark,
causing "I/O operation on closed file" and OTLP UNAVAILABLE errors.

Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Copy Markdown
Contributor Author

This is parallel work to #882 which adds the other remaining metrics telemetry and whichever merges second will need a manual merge to deal with the overlapping git diffs.

@akihikokuroda
Copy link
Copy Markdown
Member

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

Copy link
Copy Markdown
Member

@akihikokuroda akihikokuroda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have a general comment about the issue that this PR implements.

@ajbozarth
Copy link
Copy Markdown
Contributor Author

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

@jakelorocco @psschwei I actually wondered about this while implementing. I thought the idea was interesting and valuable, but my work did not create an example of how it could be used (which maybe I should add?)

@psschwei
Copy link
Copy Markdown
Member

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

@jakelorocco @psschwei I actually wondered about this while implementing. I thought the idea was interesting and valuable, but my work did not create an example of how it could be used (which maybe I should add?)

Off the top of my head, it could be useful to know if a particular requirement is hard to satisfy for a particular model (for example, maybe base granite doesn't do well on quantum and we need to substitute in the granite-qiskit model).

I'm not sure we necessarily need an example though (I'm a little worried about having too many examples to maintain, xref #811 ). Not opposed either.

Copy link
Copy Markdown
Member

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one issue with metric cardinality — each unique combination of attribute values creates a separate time series in your metrics backend.

Step 1: A requirement validation fails. The ValidationResult object has a .reason field containing a human-readable explanation of why it failed.

Step 2: The new RequirementMetricsPlugin (added in this PR) fires on the validation_post_check hook. It grabs reason from the result:

  reason = getattr(result, "reason", None) or "unknown"
  record_requirement_failure(req_name, reason)

Step 3: record_requirement_failure (also added in this PR) passes that reason string directly as a counter attribute:

  _get_requirement_failures_counter().add(
      1, {"requirement": requirement, "reason": reason}
  )

Step 4: Here's the problem. For some requirement types — specifically LLM-as-a-Judge (mellea/core/requirement.py:187, pre-existing code) — the reason is set to the full LLM output text. That's a different string nearly every time. Each unique reason string creates a new time series in the metrics backend.

So if you run 1,000 LLM-as-a-Judge validations that fail, you get ~1,000 distinct time series for mellea.requirement.failures, each with a count of 1. Metrics backends like Prometheus have cardinality limits and will start dropping data or degrading performance.

What you'd want instead: Either drop reason from the metric attributes (put it in traces/logs instead), or map it to a small bounded set of categories like "llm_judge", "execution_error", "timeout", etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add counters for operational metrics specific to Mellea's sampling and validation loops

3 participants