GitHub - ByteVeda/agenteval: Java AI Agent Evaluation & Testing Library

Java AI Agent Evaluation & Testing Library — JUnit 5-native, local-first, framework-agnostic evaluation for AI agents.

Overview

AgentEval is a library (not a framework) for evaluating the quality of Java-based AI agents. It integrates directly into your existing JUnit 5 test suite and supports any AI framework — Spring AI, LangChain4j, LangGraph4j, MCP, or custom.

Key principles:

JUnit 5-native — evaluations are standard test methods
Local-first — no cloud, no SaaS, no data leaves the machine
Framework-agnostic — optional integrations, zero forced dependencies
LLM-as-judge — 7 pluggable judge providers with multi-model consensus

Quick Start

Maven

<dependency>
    <groupId>org.byteveda.agenteval</groupId>
    <artifactId>agenteval-junit5</artifactId>
    <version>0.1.0-SNAPSHOT</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.byteveda.agenteval</groupId>
    <artifactId>agenteval-metrics</artifactId>
    <version>0.1.0-SNAPSHOT</version>
    <scope>test</scope>
</dependency>

Gradle

testImplementation("org.byteveda.agenteval:agenteval-junit5:0.1.0-SNAPSHOT")
testImplementation("org.byteveda.agenteval:agenteval-metrics:0.1.0-SNAPSHOT")

Write Your First Evaluation

import org.byteveda.agenteval.core.model.AgentTestCase;
import org.byteveda.agenteval.junit5.annotation.AgentTest;
import org.byteveda.agenteval.junit5.annotation.Metric;
import org.byteveda.agenteval.metrics.response.AnswerRelevancyMetric;
import org.byteveda.agenteval.metrics.response.FaithfulnessMetric;

class MyAgentEvalTest {

    @AgentTest
    @Metric(value = AnswerRelevancyMetric.class, threshold = 0.7)
    @Metric(value = FaithfulnessMetric.class, threshold = 0.8)
    void testRefundPolicy() {
        var testCase = AgentTestCase.builder()
            .input("What is our refund policy?")
            .actualOutput(myAgent.ask("What is our refund policy?"))
            .retrievalContext(List.of(doc1, doc2))
            .build();

        AgentAssertions.assertThat(testCase).passesAllMetrics();
    }
}

Run only evaluation tests:

mvn test -Dgroups=eval

Metrics

AgentEval ships 23 built-in metrics across 4 categories.

Response Quality

Metric	Description
`AnswerRelevancyMetric`	Is the output relevant to the input question?
`FaithfulnessMetric`	Are claims in the output grounded in retrieval context?
`HallucinationMetric`	Does the output contain fabricated information?
`CorrectnessMetric`	G-Eval: flexible correctness against custom criteria
`SemanticSimilarityMetric`	Embedding-based cosine similarity to expected output
`CoherenceMetric`	Is the output logically coherent and well-structured?
`ConcisenessMetric`	Is the output appropriately concise?
`ToxicityMetric`	Does the output contain harmful content?
`BiasMetric`	Does the output exhibit gender, race, or other biases?

RAG Pipeline

Metric	Description
`ContextualRelevancyMetric`	Is the retrieved context relevant to the query?
`ContextualPrecisionMetric`	How precise is the retrieval (signal-to-noise ratio)?
`ContextualRecallMetric`	How much of the ground truth context was retrieved?

Agent Behavior

Metric	Description
`TaskCompletionMetric`	Did the agent complete the assigned task?
`ToolSelectionAccuracyMetric`	Did the agent call the correct tools?
`ToolArgumentCorrectnessMetric`	Were tool arguments correct?
`ToolResultUtilizationMetric`	Did the agent effectively use tool results?
`PlanQualityMetric`	Was the agent's plan coherent and executable?
`PlanAdherenceMetric`	Did the agent follow its stated plan?
`RetrievalCompletenessMetric`	Did the agent retrieve all necessary information?
`StepLevelErrorLocalizationMetric`	Can the first error step in the trajectory be identified?
`TrajectoryOptimalityMetric`	Was the agent's execution path efficient?

Conversation

Metric	Description
`ConversationCoherenceMetric`	Is the multi-turn conversation coherent?
`ContextRetentionMetric`	Does the agent retain context across turns?
`TopicDriftDetectionMetric`	Does the conversation stay on topic?
`ConversationResolutionMetric`	Was the user's goal ultimately resolved?

All metrics implement EvalMetric and return EvalScore (value 0.0–1.0, threshold, pass/fail, reason).

Judge Providers

LLM-as-judge metrics require a configured judge provider.

Provider	Class
OpenAI	`JudgeModels.openai()`
Anthropic	`JudgeModels.anthropic()`
Google Gemini	`JudgeModels.google()`
Azure OpenAI	`JudgeModels.azure()`
Amazon Bedrock	`JudgeModels.bedrock()`
Ollama (local)	`JudgeModels.ollama()`
Custom HTTP	`JudgeModels.custom()` (OpenAI-compatible: vLLM, LiteLLM, LocalAI)

Configuration

Environment variables:

AGENTEVAL_JUDGE_PROVIDER=openai
AGENTEVAL_JUDGE_MODEL=gpt-4o
OPENAI_API_KEY=sk-...

Programmatic:

var config = AgentEvalConfig.builder()
    .judgeModel(JudgeModels.openai("gpt-4o", System.getenv("OPENAI_API_KEY")))
    .build();

YAML (agenteval.yaml):

judge:
  provider: anthropic
  model: claude-3-5-sonnet-20241022

Multi-Model Judge Consensus

var judge = MultiModelJudge.builder()
    .addJudge(JudgeModels.openai(), 0.5)
    .addJudge(JudgeModels.anthropic(), 0.5)
    .strategy(ConsensusStrategy.WEIGHTED_AVERAGE)
    .build();

Dataset-Driven Testing

Load test cases from JSON, CSV, or JSONL files:

@AgentTest
@DatasetSource(path = "src/test/resources/qa-dataset.json")
@Metric(value = AnswerRelevancyMetric.class, threshold = 0.7)
void testDataset(AgentTestCase testCase) {
    testCase.setActualOutput(agent.ask(testCase.getInput()));
}

Generate synthetic datasets:

var generator = new SyntheticDatasetGenerator(judgeModel);
var dataset = generator.fromDocuments(documents, 20);      // 20 cases from docs
var adversarial = generator.adversarial(baseDataset, 10);  // adversarial variants

Reporting

AgentEval supports multiple report formats:

Reporter	Output
`ConsoleReporter`	Colored terminal table
`JunitXmlReporter`	Standard JUnit XML (CI/CD compatible)
`JsonReporter`	Machine-readable JSON
`HtmlReporter`	Single-file self-contained HTML

Snapshot Testing

Lock in baseline scores and detect regressions:

var store = new SnapshotStore(Path.of("src/test/snapshots"));
var reporter = new SnapshotReporter(store, SnapshotConfig.defaults());
reporter.report(result); // fails if score drops below baseline

Regression Comparison

var comparison = RegressionComparison.compare(baseline, current);
var report = RegressionReport.from(comparison);

Benchmark Mode

Compare multiple agent variants side-by-side:

var result = Benchmark.run(
    BenchmarkVariant.of("gpt-4o", testCase -> testCase.setActualOutput(gpt4oAgent.ask(testCase.getInput()))),
    BenchmarkVariant.of("claude-3-5", testCase -> testCase.setActualOutput(claudeAgent.ask(testCase.getInput()))),
    List.of(new AnswerRelevancyMetric(), new FaithfulnessMetric()),
    dataset
);
BenchmarkReporter.print(result);

Framework Integrations

Optional modules for automatic capture with popular frameworks:

Module	Artifact
Spring AI	`agenteval-spring-ai`
LangChain4j	`agenteval-langchain4j`
LangGraph4j	`agenteval-langgraph4j`
MCP Java SDK	`agenteval-mcp`

Gradle Plugin

plugins {
    id("org.byteveda.agenteval.gradle-plugin") version "0.1.0-SNAPSHOT"
}

agenteval {
    reportFormat = "html"
    threshold = 0.7
}

GitHub Actions

- uses: agenteval/agenteval@v1
  with:
    report-format: markdown
    comment-on-pr: true

Red Teaming

Adversarial evaluation with 20 built-in attack templates:

var suite = RedTeamSuite.builder()
    .addAttacks(AttackTemplateLibrary.promptInjection())
    .addAttacks(AttackTemplateLibrary.jailbreak())
    .agent(myAgent)
    .evaluator(new AttackEvaluator(judgeModel))
    .build();

suite.run();

Module Structure

agenteval-core/         — Test case model, metric interfaces, scoring engine, config
agenteval-metrics/      — 23 built-in metric implementations
agenteval-judge/        — LLM-as-judge engine, 7 provider integrations, multi-model consensus
agenteval-embeddings/   — Embedding model integrations (OpenAI, custom HTTP)
agenteval-junit5/       — JUnit 5 extension, @AgentTest, @Metric, @DatasetSource annotations
agenteval-datasets/     — JSON/CSV/JSONL loading, synthetic generation, golden set versioning
agenteval-reporting/    — Console, JUnit XML, JSON, HTML, snapshot, benchmark, regression reporters
agenteval-spring-ai/    — Spring AI auto-capture (optional)
agenteval-langchain4j/  — LangChain4j auto-capture (optional)
agenteval-langgraph4j/  — LangGraph4j graph execution capture (optional)
agenteval-mcp/          — MCP Java SDK tool call capture (optional)
agenteval-redteam/      — Adversarial testing, 20 attack templates
agenteval-contracts/    — Contract testing, behavioral invariant verification
agenteval-statistics/   — Statistical rigor: confidence intervals, significance tests
agenteval-chaos/        — Chaos engineering, agent resilience testing
agenteval-replay/       — Deterministic record & replay for $0 regression tests
agenteval-mutation/     — Prompt mutation testing, eval quality verification
agenteval-fingerprint/  — Agent capability profiling across 8 dimensions
agenteval-maven-plugin/ — Maven build integration
agenteval-gradle-plugin/— Gradle build integration
agenteval-github-actions/ — GitHub Actions composite action
agenteval-intellij/     — IntelliJ IDEA tool window plugin

Build

mvn clean install              # Build all modules
mvn test                       # Run all tests
mvn test -Dgroups=eval         # Run only evaluation tests
mvn test -DexcludeGroups=eval  # Skip evaluation tests (fast build)
mvn test -pl agenteval-core    # Test specific module

Building from Source

See INSTALL.md for full instructions on building from source, including cross-platform install scripts.

Requirements

Java 21+
Maven 3.9+ or Gradle 8.5+

License

Apache License 2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Quick Start

Maven

Gradle

Write Your First Evaluation

Metrics

Response Quality

RAG Pipeline

Agent Behavior

Conversation

Judge Providers

Configuration

Multi-Model Judge Consensus

Dataset-Driven Testing

Reporting

Snapshot Testing

Regression Comparison

Benchmark Mode

Framework Integrations

Gradle Plugin

GitHub Actions

Red Teaming

Module Structure

Build

Building from Source

Requirements

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
.mvn/wrapper		.mvn/wrapper
agenteval-bom		agenteval-bom
agenteval-chaos		agenteval-chaos
agenteval-contracts		agenteval-contracts
agenteval-core		agenteval-core
agenteval-datasets		agenteval-datasets
agenteval-embeddings		agenteval-embeddings
agenteval-fingerprint		agenteval-fingerprint
agenteval-github-actions		agenteval-github-actions
agenteval-gradle-plugin		agenteval-gradle-plugin
agenteval-intellij		agenteval-intellij
agenteval-judge		agenteval-judge
agenteval-junit5		agenteval-junit5
agenteval-langchain4j		agenteval-langchain4j
agenteval-langgraph4j		agenteval-langgraph4j
agenteval-maven-plugin		agenteval-maven-plugin
agenteval-mcp		agenteval-mcp
agenteval-metrics		agenteval-metrics
agenteval-mutation		agenteval-mutation
agenteval-redteam		agenteval-redteam
agenteval-replay		agenteval-replay
agenteval-reporting		agenteval-reporting
agenteval-spring-ai		agenteval-spring-ai
agenteval-statistics		agenteval-statistics
docs		docs
gradle		gradle
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
checkstyle.xml		checkstyle.xml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
settings.gradle.kts		settings.gradle.kts
spotbugs-exclude.xml		spotbugs-exclude.xml

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick Start

Maven

Gradle

Write Your First Evaluation

Metrics

Response Quality

RAG Pipeline

Agent Behavior

Conversation

Judge Providers

Configuration

Multi-Model Judge Consensus

Dataset-Driven Testing

Reporting

Snapshot Testing

Regression Comparison

Benchmark Mode

Framework Integrations

Gradle Plugin

GitHub Actions

Red Teaming

Module Structure

Build

Building from Source

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages