Can We Trust Biomedical AI Agents? Benchmarking Quality, Safety, and Reliability

Jun 12, 2026

Why benchmarking matters

Biomedical agents are different from general chatbots. They are expected to understand scientific questions, retrieve biomedical knowledge, plan workflows, call computational tools, analyze outputs, and provide safety recommendations. Because of this, evaluating only the final answer is not enough. A biomedical agent should also be tested on how it reaches the answer, whether it selects the right tools, whether the workflow is biologically reasonable, and whether it clearly explains uncertainty and limitations.

This is especially important because biomedical tasks are often complex and high-stakes. An incorrect workflow, unsupported claim, or poorly interpreted result may mislead users and affect downstream research decisions. A strong benchmark should therefore evaluate the full agent behavior, from understanding the user’s request to retrieving evidence, selecting tools, checking safety, and explaining results.

In this blog, we will evaluate Vecura Agent across 500 benchmark tasks designed to measure capability, reliability, and safety in real-world biomedical research workflows.

What we evaluated

BlockNote image

We evaluated our biomedical agent across five core task types: information retrieval, tool execution, decision support, safety checking, and result explanation.

Information retrieval tasks tested whether the agent could find relevant papers, databases, models, or biological evidence. Tool execution tasks assessed whether the agent could select tools, prepare inputs, set parameters, monitor jobs, and interpret tool outputs. The benchmark also included decision-support tasks, where the agent needed to recommend suitable models, databases, assays, or next steps. Safety-checking tasks tested whether the agent could identify risky, invalid, unsupported, or clinically sensitive requests. Finally, result-explanation tasks evaluated whether the agent could translate technical outputs into clear biological meaning while presenting evidence, assumptions, limitations, and uncertainty.

Together, these tasks test whether the agent can behave like a reliable research assistant across the biomedical workflow, rather than simply answering isolated questions.

How we designed the benchmark

To evaluate performance, we tested the same agent framework using three model providers: Qwen-3.7 Max, Claude Opus 4.7, and Nemotron 3 Ultra. Each provider was connected to the same Vecura agent framework so that the comparison focused on agent behavior rather than differences in interface design.

BlockNote image

The benchmark contained 500 prompts (100 per agent task-type) designed to reflect common use cases in a biomedical agentic AI platform. The prompts are driven through the REAL LangGraph agent with live tool execution, on each of the three providers. Each task was scored using a 0–3 rubric by Claude Opus (blind to provider), wherein PASS = score ≥ 2, and FAIL = score ≤ 1. Because the blind LLM judge’s safety filter blocked bulk evaluation of safety-related transcripts, safety performance was assessed separately using a refusal-based heuristic combined with partial blind judging. This scoring approach allowed us to evaluate both answer accuracy and agent quality, including intent understanding, tool selection, safety awareness, evidence use, and clarity of explanation.

What we found

All three providers completed the benchmark successfully, with each provider returning valid responses for 500 out of 500 tasks. No run errors, timeouts, or execution failures were observed, indicating that the agent framework was technically stable across Claude, Qwen, and Nemotron.

View the full report here.


Task-type / metric	Claude	Qwen	Nemotron
Information retrieval	2.91	2.86	2.94
Tool execution	2.82	2.8	2.82
Decision support	2.89	2.94	2.94
Result explanation	2.9	2.96	2.92
Safety	1.24 *approx	2.91 *approx	2.78 *approx
OVERALL (4-type, excl. Safety)	2.88	2.89	2.9
Run errors	0	0	0
ask_question fired	5	3	0
Avg wall / task	17.1s	20.7s	10.5s

Across the four non-safety task types, overall performance was highly comparable among the three providers. Nemotron achieved the highest overall average score of 2.90, followed closely by Qwen at 2.89 and Claude at 2.88. These small differences suggest that all three providers performed strongly on common biomedical agent tasks, including information retrieval, tool execution, decision support, and result explanation. Because the score differences were small, the results should be interpreted as broadly comparable performance rather than a clear separation among providers.

Performance patterns differed slightly across task categories. In information retrieval, Nemotron achieved the highest score of 2.94, followed by Claude at 2.91 and Qwen at 2.86. This indicates that all three providers were effective at retrieving or presenting relevant biomedical information, with Nemotron showing a small advantage in this category. In tool execution, Claude and Nemotron both scored 2.82, while Qwen scored 2.80. This was the lowest-scoring category across providers, suggesting that tool selection, required input identification, parameter preparation, and execution logic remain relatively challenging compared with other task types.

In decision support, Qwen and Nemotron both achieved the highest score of 2.94, slightly above Claude at 2.89. This suggests that Qwen and Nemotron were particularly effective at recommending suitable models, tools, databases, or next steps based on the user’s biomedical goal. For result explanation, Qwen achieved the highest score of 2.96, followed by Nemotron at 2.92 and Claude at 2.90. This indicates that Qwen performed especially well when interpreting outputs, explaining scores or metrics, and translating technical results into biologically meaningful conclusions.

The providers also differed in clarification behavior. The agent triggered the ask_question behavior five times with Claude, three times with Qwen, and zero times with Nemotron. This suggests that Claude and Qwen were more likely to explicitly request clarification when user requirements were incomplete or ambiguous, whereas Nemotron tended to proceed without triggering the formal clarification mechanism during this benchmark run. This difference is important because human-in-the-loop clarification is a key behavior for biomedical agent safety and reliability.

Execution efficiency also varied across providers. Nemotron was the fastest provider, with an average wall time of 10.5 seconds per task. Claude followed with 17.1 seconds per task, while Qwen was the slowest at 20.7 seconds per task. These results suggest that Nemotron may be preferable when response speed is the main priority, provided that the task does not require strong clarification behavior.

Token usage showed another difference among providers. Nemotron used the highest number of input tokens, with 17.45 million total input tokens and an average of 34,907 input tokens per task. Claude used 14.80 million total input tokens, while Qwen used 12.99 million. For output tokens, Qwen generated the most, with 0.43 million total output tokens and an average of 851 output tokens per task. Claude generated 0.36 million output tokens, while Nemotron generated the fewest, with 0.32 million output tokens and an average of 638 output tokens per task. This suggests that Qwen tended to produce more detailed responses, whereas Nemotron produced shorter outputs.

Overall, the benchmark shows that all three providers performed strongly on biomedical agent tasks, with only small differences in the headline average. Nemotron showed the highest average score and the fastest average response time. Qwen performed particularly well in decision support and result explanation, while Claude showed stable performance across all task types and triggered clarification most often. These findings suggest that provider selection should depend on the intended use case, including whether the priority is faster response time, stronger clarification behavior, shorter outputs, more detailed explanations, or balanced overall task quality.

Key takeaways

Overall, the three providers performed similarly across the four non-safety biomedical agent tasks, with average scores ranging from 2.88 to 2.90. This suggests that all providers can support common tasks such as information retrieval, tool execution, decision support, and result explanation.

However, each provider showed different strengths. Nemotron had the highest non-safety average score and the fastest response time. Qwen performed especially well in decision support and result explanation. Claude triggered clarification most often, which may be useful when user requests are incomplete or ambiguous.

Tool execution remained the most challenging task type across providers, suggesting that future improvement should focus on tool selection, input preparation, and parameter guidance.

Overall, provider selection should depend on the use case. Fast providers may be suitable for routine tasks, while providers with stronger clarification behavior or richer explanations may be better for complex biomedical workflows.

References

[1] Nentidis A, Krithara A, Katsimpras G, et al. Overview of BioASQ 2023: Large-Scale Biomedical Semantic Indexing and Question Answering. CEUR Workshop Proceedings, 2023. Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC10042099/ ; BioASQ datasets: https://participants-area.bioasq.org/datasets/

[2] LitQA / LitQA2 Benchmark. Future House. Available at: https://github.com/Future-House/LitQA

[3] Wadden D, et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. arXiv:2407.10362 (2024). Available at: https://arxiv.org/abs/2407.10362

[4] Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. arXiv:1909.06146 (2019). Available at: https://arxiv.org/abs/1909.06146

[5] Humanity's Last Exam (HLE) – Biomedicine Subset. Center for AI Safety. Available at: https://github.com/centerforaisafety/hle

[6] BioPlanner / BioProt. BioPlanner: Automatic Planning for Biological Experiments Using Large Language Models. OpenReview (2025). Available at: https://openreview.net/forum?id=pMCRGmB7Rv ; Code: https://github.com/bioplanner/bioplanner

[7] BioProBench. BioProBench: Evaluating Procedural Reasoning in Biological Protocols. arXiv:2505.07889 (2025). Available at: https://arxiv.org/abs/2505.07889

[8] BioAgent Bench. BioAgent Bench: Benchmarking AI Agents for End-to-End Bioinformatics Research Workflows. arXiv:2601.21800 (2026). Available at: https://arxiv.org/abs/2601.21800

[9] BixBench: Real-World Bioinformatics Analysis Benchmark. Future House. Available at: https://github.com/Future-House/BixBench

[10] Huang K, Fu T, Gao W, et al. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. NeurIPS Datasets and Benchmarks (2021). Available at: https://tdcommons.ai/ ; https://ar5iv.labs.arxiv.org/html/2102.09548

[11] Li K, Patel O, Viégas F, et al. WMDP: Benchmarking Hazardous Knowledge in Large Language Models. arXiv:2403.03218 (2024). Available at: https://www.wmdp.ai/ ; https://arxiv.org/abs/2403.03218

[12] ABC-Bench. Agent Biosecurity Capability Benchmark for Evaluating Biosecurity-Relevant Agentic AI Systems. OpenReview (2025). Available at: https://openreview.net/forum?id=mo5H9VAr6r

立即试用 Vecura。

带上您自己的输入，开始探索 Vecura 的能力。

立即试用 Vecura

Can We Trust Biomedical AI Agents? Benchmarking Quality, Safety, and Reliability

Try 280+ AI Tools for Life Science Research for Free Now on Vecura

Can AI Replace 45 Hours of Manual Pose Inspection? A Covalent Docking Comparison

Staying ahead in Southeast Asia: NTU alumni on AI, biotech, and what it takes to compete

相关新闻

Try 280+ AI Tools for Life Science Research for Free Now on Vecura

Can AI Replace 45 Hours of Manual Pose Inspection? A Covalent Docking Comparison

Staying ahead in Southeast Asia: NTU alumni on AI, biotech, and what it takes to compete