ScholarQuest: A Benchmark for Agentic Paper Search

A new arXiv preprint introduces a taxonomy-guided benchmark for agentic academic paper search built from more than 1,000 computer-science topics and four research intents. The authors report agentic methods beat single-shot retrieval while leaving large gaps in recall.

A preprint posted to arXiv on June 18, 2026 presents ScholarQuest, a benchmark for evaluating language-model agents that search through academic literature. The paper, by Tingyue Pan, Mingyue Cheng, Daoyu Wang, Yitong Zhou, Jie Ouyang, Qi Liu, and Enhong Chen, is filed under the information-retrieval and artificial-intelligence categories. Its subject is a task the authors describe as a core step in scientific research: finding the right papers, which LLM-based search agents are increasingly being applied to through iterative, intent-driven exploration rather than a single query.

The authors' stated motivation is a gap in evaluation. They write that existing benchmarks are not sufficient for systematically assessing how these agents behave in realistic open-literature settings, where an agent must navigate a large and unstructured body of work rather than retrieve from a curated, closed pool. ScholarQuest is offered as a response to that gap, and the paper describes both how the benchmark is constructed and what current systems score on it.

"Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement."— arXiv abstract, ScholarQuest (2606.20235), source

How the benchmark is built

According to the abstract, ScholarQuest is constructed from over 1,000 computer-science topics and four representative research intents. The authors name those intents as method-oriented, setting-anchored, comparison-based, and scope-controlled queries. Each describes a different way a researcher might frame a literature search: looking for work that uses a particular method, work pinned to a particular experimental setting, work that compares approaches, or work bounded to a defined scope. By spanning the four, the benchmark is designed to test agents against more than one style of search rather than a single query pattern.

The paper also describes two infrastructure components meant to support reproducible evaluation. The first is what the authors call scalable answer construction, the procedure by which the benchmark's ground-truth answers are assembled. The second is a shared retrieval backend named ScholarBase, which the abstract states provides a common substrate so that different agents are evaluated against the same underlying retrieval environment. The stated purpose of pairing the topics and intents with a shared backend is reproducibility: holding the retrieval environment fixed isolates the agent's behavior as the thing under measurement.

What the agents scored

The headline empirical finding the authors report is a comparison between agentic methods and single-shot retrieval baselines. The abstract states that agentic methods outperform the single-shot baselines, meaning that iterating, refining, and exploring across multiple steps produced better retrieval than issuing one query and taking the result. That ordering is the direction the agentic-search literature would expect, and the paper reports it as the benchmark's first-order result.

The authors are explicit, however, about the absolute level the best system reached. The best-performing agent, they report, achieved 0.314 Recall@100 and 0.355 Recall@All, figures the abstract characterizes as "indicating substantial room for improvement." Recall@100 measures the fraction of relevant papers an agent surfaced within its top 100 results, and Recall@All measures the fraction recovered across the agent's full output. A best score of 0.314 on the former means that even the strongest agent in the evaluation surfaced under a third of the relevant papers within its top 100, on the authors' construction of the task. The paper presents these numbers as the current ceiling on the benchmark rather than as a solved result.

What the paper says it measures beyond recall

The abstract indicates that ScholarQuest is intended to produce more than a single recall figure. The authors write that the benchmark supports analyses of search efficiency, intent-level robustness, and failure cases, and they describe these as multi-dimensional evaluation signals for academic paper-search agents. Search efficiency concerns how much work an agent expends to reach its results; intent-level robustness concerns whether an agent performs consistently across the four query intents rather than excelling on one and faltering on another; and failure-case analysis concerns the specific situations where agents break down. The authors present these as part of what the benchmark is designed to expose, alongside the aggregate recall scores.

The authors also tie the benchmark's design back to their stated concern with realism. The abstract describes ScholarQuest as targeting "open literature environments," the setting in which an agent searches across a broad and unbounded body of work rather than a small curated collection assembled in advance. That framing is the contrast the paper draws against existing benchmarks, which it says are insufficient for systematically evaluating agentic search under such conditions. The combination the authors emphasize is scale and structure together: more than 1,000 topics give the benchmark breadth, the taxonomy and the four intents give it organization, and ScholarBase gives every agent the same retrieval substrate to work against. The paper presents that triple as what distinguishes its evaluation from a single-query or closed-pool setup.

The reported recall figures should be read against that construction. Because the ground truth in ScholarQuest is built to span a large topic space and four distinct intents, a low aggregate recall does not, on the authors' framing, indicate that the agents failed at a narrow task; it indicates how much of a broadly defined relevant set the best agent recovered. The authors report the gap as the benchmark's point: a best Recall@100 of 0.314 is the signal they cite for substantial headroom, and the intent-level and failure-case analyses are the tools the paper offers for understanding where that headroom lies.

Taken together, the abstract describes a benchmark with a defined construction, a shared retrieval backend, a reported ordering between agentic and single-shot methods, and a stated best-case recall that the authors themselves frame as leaving substantial room for improvement. The preprint is a single-version arXiv posting (v1) dated June 18, 2026, and the figures cited here are drawn from its abstract. Details of the agents evaluated, the construction of ScholarBase, the full set of metrics, and the per-intent breakdowns would appear in the body of the paper. Readers can consult the abstract page linked below for the authors' own framing and for the full text as the preprint is updated.

ScholarQuest Benchmarks LLM Search Agents on Academic Paper Retrieval, Reporting a Best Recall@100 of 0.314

How the benchmark is built

What the agents scored

What the paper says it measures beyond recall

Comments