CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions.
๐ Introduction
We present CoSineVerifier-Tool, a compact, tool-augmented verifier for Computation-Oriented ScIeNtific sEnario answer verification. It equips LLM reasoning with external toolsโe.g., Python interpreter,โto accurately verify answers in computation-oriented scenarios such as algebraic equivalence and physical-constant alignment. We also release the CoSineVerifier series: efficient single-token labeling verifiers with performance comparable to CoSineVerifier-Tool. Our methods achieve state-of-the-art results on VerifyBench and SCI-VerifyBench. They also show clear improvements on RLVR tasks over other verification methods.
๐ Key Features:
Tool-augmented verification for computation-oriented scientific scenarios:
CoSineVerifier-Tool evaluates multi-step reasoning in math and science where correctness depends on precise intermediate calculations. It also normalizes answers to a common dimension/unit for consistent numeric checks. It integrates:
- ๐งฎPython Interpreter: executes snippets to validate algebra/calculus steps, algorithmic logic, and data operations.
- ๐ฌScientific Unit Converter: verifies and normalizes unit conversions (e.g., km/h โ m/s; joule โ calorie).
Concise, low-latency reasoning:
A 4B-parameter CoSinVerifier-Tool with โค100 output tokens enables real-time evaluation and large-scale batch processing.
General applicability across tasks and domains:
Accurately assesses equivalence across math, physics, chemistry, biology, and logical reasoning; supports short-answer and multiple-choice formats; and handles both brief answers and long-form responses.
๐งช Quick Start
This repository provides a minimal working demo for playing around the verifier conveniently.
โ๏ธ Installation
pip install torch transformers pint
๐ Run the Demo
You can run the verifier with local or Hugging Face model checkpoint:
python demo.py --model /path/to/the/verifier
The script will:
- Load the model and tokenizer.
- Run several evaluation cases (e.g., numerical equivalence, physics derivations, symbolic math).
- Print intermediate reasoning, any tool calls, and the final judgement for each case.
๐ Note
- Use the exact system and user prompt templates. Do not change whitespace; preserve line breaks identical.
- The final judgement is parsed from the token
\boxed{Correct}or\boxed{Incorrect}in the modelโs reply.
๐งฉ Example
HamiltonโJacobi Projectile Motion Verification
User:
{
"Question":"A particle m undergoes projectile motion with an initial velocity vโ and a horizontal angle of elevation ฮฑ. Solve using the HamiltonโJacobi equation.",
"Ground_Truth":"['y = x tan ฮฑ - (g xยฒ) / (2 vโยฒ cosยฒ ฮฑ)']",
"Model_Prediction":"y = x tan ฮฑ - (g xยฒ tanยฒ ฮฑ) / (2 vโยฒ) - (g xยฒ) / (2 vโยฒ)"
}
Assistant:
<tool_call>
{
"name":"python_interpreter",
"arguments":{
"code":"```python
def main():
import math
a,g,v0,x=1.0,9.8,10.0,1.0
y_candidate=x*math.tan(a)-(g*x**2*math.tan(a)**2)/(2*v0**2)-(g*x**2)/(2*v0**2)
y_gold=x*math.tan(a)-(g*x**2)/(2*v0**2*(math.cos(a)**2))
print(y_candidate,y_gold)
if __name__=='__main__':
main()```"
}
}
</tool_call>
User:
<tool_response>
{
"name":"python_interpreter",
"status":"Success",
"run_result":{
"stdout":"1.3895573024349792 1.3895573024349792",
"stderr":"",
"exit_success":true,
"return_code":0,
"status":"Finished"
}
}
</tool_response>
Assistant:
Gold: \(y=x\tan\alpha-\frac{g x^2}{2v_0^2\cos^2\alpha}\).
Using \(\sec^2\alpha=1+\tan^2\alpha\): \(\frac{g x^2}{2v_0^2\cos^2\alpha}=\frac{g x^2}{2v_0^2}+\frac{g x^2\tan^2\alpha}{2v_0^2}\).
Hence \(y=x\tan\alpha-\frac{g x^2}{2v_0^2}-\frac{g x^2\tan^2\alpha}{2v_0^2}\equiv\texttt{pred}\).
Final Verdict: [Correct]
๐ Performance
We conduct a systematic evaluation on VerifyBench and SCI-VerifyBench for CoSineVerifier-Tool-4B and CoSineVerifier-32B, trained from Qwen3-4B-Instruct-2507 and Qwen3-32B, respectively. We report accuracy as mean@3 and efficiency as average output tokens per verdict on these benchmarks.
| Model | VerifyBench | VerifyBench (Hard) | SCI-VerifyBench | Avg Tokens |
|---|---|---|---|---|
| CoT Verifier | ||||
| o3 | 96.1 | 88.7 | 87.5 | 206.7 |
| GPT-oss-20B | 92.2 | 84.7 | 85.0 | 221 |
| Qwen3-4B | 92.6 | 80.3 | 82.0 | 1156.7 |
| Qwen3-8B | 93.7 | 83.6 | 83.9 | 926.6 |
| Qwen3-32B | 94.7 | 85.2 | 83.5 | 798.8 |
| Qwen3-4B-Instruct-2507 | 94.7 | 84.1 | 82.4 | 869.7 |
| Qwen3-235B-A22B-2507 | 94.4 | 87.7 | 82.6 | 1885.3 |
| CoSineVerifier-Tool-4B | 96.6 | 91.9 | 89.7 | 95.3 |
| Labeling Verifier | ||||
| XVerify-8B-I | 92.5 | 83.3 | 78.1 | 1.0 |
| CompassVerifier-7B | 93.5 | 85.2 | 85.7 | 1.0 |
| CompassVerifier-32B | 96.3 | 88.9 | 85.3 | 1.0 |
| CoSineVerifier-4B | 95.0 | 84.6 | 86.2 | 3.0 |
| CoSineVerifier-32B | 94.2 | 89.7 | 86.6 | 3.0 |
We further evaluate answer-verification methods in an RLVR setting to demonstrate the efficacy of the CoSineVerifier series. Using an on-policy GRPO algorithm, we train Qwen3-4B-Instruct-2507 on competition-math problems with 42K training data drawn from DAPO-Math-17k, OpenR1-Math-220k, and DeepScaleR-Preview. We train 3 epochs for each experiment, with batch size of 128 and rollout_num = 8.
We compare CoSineVerifier-Tool-4B and CoSineVerifier-32B against Math-verify, CompassVerifier, and Xverify. We report mean@32 accuracy on AIME 2024 and AIME 2025 as reported in Figure 1.
๐ค Limitations
Despite its high accuracy, CoSinVerifier-Tool-4B invokes external tools in only ~10% of cases and still struggles on the hardest verification items. We will improve its capability by increasing tool-use coverage and developing a unified external-tool suite. Current CoSineVerifier Labeling models output three tokens instead of one token and we will update it into 1 token in next version.
โ๏ธ Citation
If you find our verifiers useful or want to use it in your projects, please kindly cite this Huggingface project.
@article{nbg_team_2025_cosineverifier,
title = {CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions},
author = {{Nanbeige Team}},
year = {2025},
}
๐ฎ Contact
If you have any questions, please raise an issue or contact us at nanbeige@126.com.
- Downloads last month
- -