Java AI Agent Evaluation & Testing Library — JUnit 5-native, local-first, framework-agnostic evaluation for AI agents.
AgentEval is a library (not a framework) for evaluating the quality of Java-based AI agents. It integrates directly into your existing JUnit 5 test suite and supports any AI framework — Spring AI, LangChain4j, LangGraph4j, MCP, or custom.
Key principles:
- JUnit 5-native — evaluations are standard test methods
- Local-first — no cloud, no SaaS, no data leaves the machine
- Framework-agnostic — optional integrations, zero forced dependencies
- LLM-as-judge — 7 pluggable judge providers with multi-model consensus
<dependency>
<groupId>org.byteveda.agenteval</groupId>
<artifactId>agenteval-junit5</artifactId>
<version>0.1.0-SNAPSHOT</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.byteveda.agenteval</groupId>
<artifactId>agenteval-metrics</artifactId>
<version>0.1.0-SNAPSHOT</version>
<scope>test</scope>
</dependency>testImplementation("org.byteveda.agenteval:agenteval-junit5:0.1.0-SNAPSHOT")
testImplementation("org.byteveda.agenteval:agenteval-metrics:0.1.0-SNAPSHOT")import org.byteveda.agenteval.core.model.AgentTestCase;
import org.byteveda.agenteval.junit5.annotation.AgentTest;
import org.byteveda.agenteval.junit5.annotation.Metric;
import org.byteveda.agenteval.metrics.response.AnswerRelevancyMetric;
import org.byteveda.agenteval.metrics.response.FaithfulnessMetric;
class MyAgentEvalTest {
@AgentTest
@Metric(value = AnswerRelevancyMetric.class, threshold = 0.7)
@Metric(value = FaithfulnessMetric.class, threshold = 0.8)
void testRefundPolicy() {
var testCase = AgentTestCase.builder()
.input("What is our refund policy?")
.actualOutput(myAgent.ask("What is our refund policy?"))
.retrievalContext(List.of(doc1, doc2))
.build();
AgentAssertions.assertThat(testCase).passesAllMetrics();
}
}Run only evaluation tests:
mvn test -Dgroups=evalAgentEval ships 23 built-in metrics across 4 categories.
| Metric | Description |
|---|---|
AnswerRelevancyMetric |
Is the output relevant to the input question? |
FaithfulnessMetric |
Are claims in the output grounded in retrieval context? |
HallucinationMetric |
Does the output contain fabricated information? |
CorrectnessMetric |
G-Eval: flexible correctness against custom criteria |
SemanticSimilarityMetric |
Embedding-based cosine similarity to expected output |
CoherenceMetric |
Is the output logically coherent and well-structured? |
ConcisenessMetric |
Is the output appropriately concise? |
ToxicityMetric |
Does the output contain harmful content? |
BiasMetric |
Does the output exhibit gender, race, or other biases? |
| Metric | Description |
|---|---|
ContextualRelevancyMetric |
Is the retrieved context relevant to the query? |
ContextualPrecisionMetric |
How precise is the retrieval (signal-to-noise ratio)? |
ContextualRecallMetric |
How much of the ground truth context was retrieved? |
| Metric | Description |
|---|---|
TaskCompletionMetric |
Did the agent complete the assigned task? |
ToolSelectionAccuracyMetric |
Did the agent call the correct tools? |
ToolArgumentCorrectnessMetric |
Were tool arguments correct? |
ToolResultUtilizationMetric |
Did the agent effectively use tool results? |
PlanQualityMetric |
Was the agent's plan coherent and executable? |
PlanAdherenceMetric |
Did the agent follow its stated plan? |
RetrievalCompletenessMetric |
Did the agent retrieve all necessary information? |
StepLevelErrorLocalizationMetric |
Can the first error step in the trajectory be identified? |
TrajectoryOptimalityMetric |
Was the agent's execution path efficient? |
| Metric | Description |
|---|---|
ConversationCoherenceMetric |
Is the multi-turn conversation coherent? |
ContextRetentionMetric |
Does the agent retain context across turns? |
TopicDriftDetectionMetric |
Does the conversation stay on topic? |
ConversationResolutionMetric |
Was the user's goal ultimately resolved? |
All metrics implement EvalMetric and return EvalScore (value 0.0–1.0, threshold, pass/fail, reason).
LLM-as-judge metrics require a configured judge provider.
| Provider | Class |
|---|---|
| OpenAI | JudgeModels.openai() |
| Anthropic | JudgeModels.anthropic() |
| Google Gemini | JudgeModels.google() |
| Azure OpenAI | JudgeModels.azure() |
| Amazon Bedrock | JudgeModels.bedrock() |
| Ollama (local) | JudgeModels.ollama() |
| Custom HTTP | JudgeModels.custom() (OpenAI-compatible: vLLM, LiteLLM, LocalAI) |
Environment variables:
AGENTEVAL_JUDGE_PROVIDER=openai
AGENTEVAL_JUDGE_MODEL=gpt-4o
OPENAI_API_KEY=sk-...Programmatic:
var config = AgentEvalConfig.builder()
.judgeModel(JudgeModels.openai("gpt-4o", System.getenv("OPENAI_API_KEY")))
.build();YAML (agenteval.yaml):
judge:
provider: anthropic
model: claude-3-5-sonnet-20241022var judge = MultiModelJudge.builder()
.addJudge(JudgeModels.openai(), 0.5)
.addJudge(JudgeModels.anthropic(), 0.5)
.strategy(ConsensusStrategy.WEIGHTED_AVERAGE)
.build();Load test cases from JSON, CSV, or JSONL files:
@AgentTest
@DatasetSource(path = "src/test/resources/qa-dataset.json")
@Metric(value = AnswerRelevancyMetric.class, threshold = 0.7)
void testDataset(AgentTestCase testCase) {
testCase.setActualOutput(agent.ask(testCase.getInput()));
}Generate synthetic datasets:
var generator = new SyntheticDatasetGenerator(judgeModel);
var dataset = generator.fromDocuments(documents, 20); // 20 cases from docs
var adversarial = generator.adversarial(baseDataset, 10); // adversarial variantsAgentEval supports multiple report formats:
| Reporter | Output |
|---|---|
ConsoleReporter |
Colored terminal table |
JunitXmlReporter |
Standard JUnit XML (CI/CD compatible) |
JsonReporter |
Machine-readable JSON |
HtmlReporter |
Single-file self-contained HTML |
Lock in baseline scores and detect regressions:
var store = new SnapshotStore(Path.of("src/test/snapshots"));
var reporter = new SnapshotReporter(store, SnapshotConfig.defaults());
reporter.report(result); // fails if score drops below baselinevar comparison = RegressionComparison.compare(baseline, current);
var report = RegressionReport.from(comparison);Compare multiple agent variants side-by-side:
var result = Benchmark.run(
BenchmarkVariant.of("gpt-4o", testCase -> testCase.setActualOutput(gpt4oAgent.ask(testCase.getInput()))),
BenchmarkVariant.of("claude-3-5", testCase -> testCase.setActualOutput(claudeAgent.ask(testCase.getInput()))),
List.of(new AnswerRelevancyMetric(), new FaithfulnessMetric()),
dataset
);
BenchmarkReporter.print(result);Optional modules for automatic capture with popular frameworks:
| Module | Artifact |
|---|---|
| Spring AI | agenteval-spring-ai |
| LangChain4j | agenteval-langchain4j |
| LangGraph4j | agenteval-langgraph4j |
| MCP Java SDK | agenteval-mcp |
plugins {
id("org.byteveda.agenteval.gradle-plugin") version "0.1.0-SNAPSHOT"
}
agenteval {
reportFormat = "html"
threshold = 0.7
}- uses: agenteval/agenteval@v1
with:
report-format: markdown
comment-on-pr: trueAdversarial evaluation with 20 built-in attack templates:
var suite = RedTeamSuite.builder()
.addAttacks(AttackTemplateLibrary.promptInjection())
.addAttacks(AttackTemplateLibrary.jailbreak())
.agent(myAgent)
.evaluator(new AttackEvaluator(judgeModel))
.build();
suite.run();agenteval-core/ — Test case model, metric interfaces, scoring engine, config
agenteval-metrics/ — 23 built-in metric implementations
agenteval-judge/ — LLM-as-judge engine, 7 provider integrations, multi-model consensus
agenteval-embeddings/ — Embedding model integrations (OpenAI, custom HTTP)
agenteval-junit5/ — JUnit 5 extension, @AgentTest, @Metric, @DatasetSource annotations
agenteval-datasets/ — JSON/CSV/JSONL loading, synthetic generation, golden set versioning
agenteval-reporting/ — Console, JUnit XML, JSON, HTML, snapshot, benchmark, regression reporters
agenteval-spring-ai/ — Spring AI auto-capture (optional)
agenteval-langchain4j/ — LangChain4j auto-capture (optional)
agenteval-langgraph4j/ — LangGraph4j graph execution capture (optional)
agenteval-mcp/ — MCP Java SDK tool call capture (optional)
agenteval-redteam/ — Adversarial testing, 20 attack templates
agenteval-contracts/ — Contract testing, behavioral invariant verification
agenteval-statistics/ — Statistical rigor: confidence intervals, significance tests
agenteval-chaos/ — Chaos engineering, agent resilience testing
agenteval-replay/ — Deterministic record & replay for $0 regression tests
agenteval-mutation/ — Prompt mutation testing, eval quality verification
agenteval-fingerprint/ — Agent capability profiling across 8 dimensions
agenteval-maven-plugin/ — Maven build integration
agenteval-gradle-plugin/— Gradle build integration
agenteval-github-actions/ — GitHub Actions composite action
agenteval-intellij/ — IntelliJ IDEA tool window plugin
mvn clean install # Build all modules
mvn test # Run all tests
mvn test -Dgroups=eval # Run only evaluation tests
mvn test -DexcludeGroups=eval # Skip evaluation tests (fast build)
mvn test -pl agenteval-core # Test specific moduleSee INSTALL.md for full instructions on building from source, including cross-platform install scripts.
- Java 21+
- Maven 3.9+ or Gradle 8.5+
Apache License 2.0 — see LICENSE.
