You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions.

📖 Introduction

We present CoSineVerifier-Tool, a compact, tool-augmented verifier for Computation-Oriented ScIeNtific sEnario answer verification. It equips LLM reasoning with external tools—e.g., Python interpreter,—to accurately verify answers in computation-oriented scenarios such as algebraic equivalence and physical-constant alignment. We also release the CoSineVerifier series: efficient single-token labeling verifiers with performance comparable to CoSineVerifier-Tool. Our methods achieve state-of-the-art results on VerifyBench and SCI-VerifyBench. They also show clear improvements on RLVR tasks over other verification methods.

RLVR experiments benchmark results — Picture 1: RLVR performance with different verification methods.

🌟 Key Features:

Tool-augmented verification for computation-oriented scientific scenarios:

CoSineVerifier-Tool evaluates multi-step reasoning in math and science where correctness depends on precise intermediate calculations. It also normalizes answers to a common dimension/unit for consistent numeric checks. It integrates:
- 🧮Python Interpreter: executes snippets to validate algebra/calculus steps, algorithmic logic, and data operations.
- 🔬Scientific Unit Converter: verifies and normalizes unit conversions (e.g., km/h → m/s; joule → calorie).
Concise, low-latency reasoning:

A 4B-parameter CoSinVerifier-Tool with ≤100 output tokens enables real-time evaluation and large-scale batch processing.
General applicability across tasks and domains:

Accurately assesses equivalence across math, physics, chemistry, biology, and logical reasoning; supports short-answer and multiple-choice formats; and handles both brief answers and long-form responses.

🧪 Quick Start

This repository provides a minimal working demo for playing around the verifier conveniently.

⚙️ Installation

pip install torch transformers pint

🚀 Run the Demo

You can run the verifier with local or Hugging Face model checkpoint:

python demo.py --model /path/to/the/verifier

The script will:

Load the model and tokenizer.
Run several evaluation cases (e.g., numerical equivalence, physics derivations, symbolic math).
Print intermediate reasoning, any tool calls, and the final judgement for each case.

😊 Note

Use the exact system and user prompt templates. Do not change whitespace; preserve line breaks identical.
The final judgement is parsed from the token \boxed{Correct} or \boxed{Incorrect} in the model’s reply.

🧩 Example

Hamilton–Jacobi Projectile Motion Verification

User:

{
  "Question":"A particle m undergoes projectile motion with an initial velocity v₀ and a horizontal angle of elevation α. Solve using the Hamilton–Jacobi equation.",
  "Ground_Truth":"['y = x tan α - (g x²) / (2 v₀² cos² α)']",
  "Model_Prediction":"y = x tan α - (g x² tan² α) / (2 v₀²) - (g x²) / (2 v₀²)"
}

Assistant:

<tool_call>
{
  "name":"python_interpreter",
  "arguments":{
    "code":"```python
      def main():
          import math
          a,g,v0,x=1.0,9.8,10.0,1.0
          y_candidate=x*math.tan(a)-(g*x**2*math.tan(a)**2)/(2*v0**2)-(g*x**2)/(2*v0**2)
          y_gold=x*math.tan(a)-(g*x**2)/(2*v0**2*(math.cos(a)**2))
          print(y_candidate,y_gold)
      if __name__=='__main__':
          main()```"
  }
}
</tool_call>

User:

<tool_response>
{
  "name":"python_interpreter",
  "status":"Success",
  "run_result":{
    "stdout":"1.3895573024349792 1.3895573024349792",
    "stderr":"",
    "exit_success":true,
    "return_code":0,
    "status":"Finished"
  }
}
</tool_response>

Assistant:

Gold: \(y=x\tan\alpha-\frac{g x^2}{2v_0^2\cos^2\alpha}\). 
Using \(\sec^2\alpha=1+\tan^2\alpha\): \(\frac{g x^2}{2v_0^2\cos^2\alpha}=\frac{g x^2}{2v_0^2}+\frac{g x^2\tan^2\alpha}{2v_0^2}\). 
Hence \(y=x\tan\alpha-\frac{g x^2}{2v_0^2}-\frac{g x^2\tan^2\alpha}{2v_0^2}\equiv\texttt{pred}\).
Final Verdict: [Correct]

🔍 Performance

We conduct a systematic evaluation on VerifyBench and SCI-VerifyBench for CoSineVerifier-Tool-4B and CoSineVerifier-32B, trained from Qwen3-4B-Instruct-2507 and Qwen3-32B, respectively. We report accuracy as mean@3 and efficiency as average output tokens per verdict on these benchmarks.

Model	VerifyBench	VerifyBench (Hard)	SCI-VerifyBench	Avg Tokens
CoT Verifier
o3	96.1	88.7	87.5	206.7
GPT-oss-20B	92.2	84.7	85.0	221
Qwen3-4B	92.6	80.3	82.0	1156.7
Qwen3-8B	93.7	83.6	83.9	926.6
Qwen3-32B	94.7	85.2	83.5	798.8
Qwen3-4B-Instruct-2507	94.7	84.1	82.4	869.7
Qwen3-235B-A22B-2507	94.4	87.7	82.6	1885.3
CoSineVerifier-Tool-4B	96.6	91.9	89.7	95.3
Labeling Verifier
XVerify-8B-I	92.5	83.3	78.1	1.0
CompassVerifier-7B	93.5	85.2	85.7	1.0
CompassVerifier-32B	96.3	88.9	85.3	1.0
CoSineVerifier-4B	95.0	84.6	86.2	3.0
CoSineVerifier-32B	94.2	89.7	86.6	3.0

We further evaluate answer-verification methods in an RLVR setting to demonstrate the efficacy of the CoSineVerifier series. Using an on-policy GRPO algorithm, we train Qwen3-4B-Instruct-2507 on competition-math problems with 42K training data drawn from DAPO-Math-17k, OpenR1-Math-220k, and DeepScaleR-Preview. We train 3 epochs for each experiment, with batch size of 128 and rollout_num = 8.

We compare CoSineVerifier-Tool-4B and CoSineVerifier-32B against Math-verify, CompassVerifier, and Xverify. We report mean@32 accuracy on AIME 2024 and AIME 2025 as reported in Figure 1.

🤖 Limitations

Despite its high accuracy, CoSinVerifier-Tool-4B invokes external tools in only ~10% of cases and still struggles on the hardest verification items. We will improve its capability by increasing tool-use coverage and developing a unified external-tool suite. Current CoSineVerifier Labeling models output three tokens instead of one token and we will update it into 1 token in next version.

✏️ Citation

If you find our verifiers useful or want to use it in your projects, please kindly cite this Huggingface project.

@article{nbg_team_2025_cosineverifier,
  title   = {CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions},
  author  = {{Nanbeige Team}},
  year    = {2025},
}

📮 Contact

If you have any questions, please raise an issue or contact us at nanbeige@126.com.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support