You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions.

๐Ÿ“– Introduction

We present CoSineVerifier-Tool, a compact, tool-augmented verifier for Computation-Oriented ScIeNtific sEnario answer verification. It equips LLM reasoning with external toolsโ€”e.g., Python interpreter,โ€”to accurately verify answers in computation-oriented scenarios such as algebraic equivalence and physical-constant alignment. We also release the CoSineVerifier series: efficient single-token labeling verifiers with performance comparable to CoSineVerifier-Tool. Our methods achieve state-of-the-art results on VerifyBench and SCI-VerifyBench. They also show clear improvements on RLVR tasks over other verification methods.

RLVR experiments benchmark results
Picture 1: RLVR performance with different verification methods.

๐ŸŒŸ Key Features:

  • Tool-augmented verification for computation-oriented scientific scenarios:

    CoSineVerifier-Tool evaluates multi-step reasoning in math and science where correctness depends on precise intermediate calculations. It also normalizes answers to a common dimension/unit for consistent numeric checks. It integrates:

    • ๐ŸงฎPython Interpreter: executes snippets to validate algebra/calculus steps, algorithmic logic, and data operations.
    • ๐Ÿ”ฌScientific Unit Converter: verifies and normalizes unit conversions (e.g., km/h โ†’ m/s; joule โ†’ calorie).
  • Concise, low-latency reasoning:

    A 4B-parameter CoSinVerifier-Tool with โ‰ค100 output tokens enables real-time evaluation and large-scale batch processing.

  • General applicability across tasks and domains:

    Accurately assesses equivalence across math, physics, chemistry, biology, and logical reasoning; supports short-answer and multiple-choice formats; and handles both brief answers and long-form responses.

๐Ÿงช Quick Start

This repository provides a minimal working demo for playing around the verifier conveniently.

โš™๏ธ Installation

pip install torch transformers pint

๐Ÿš€ Run the Demo

You can run the verifier with local or Hugging Face model checkpoint:

python demo.py --model /path/to/the/verifier

The script will:

  1. Load the model and tokenizer.
  2. Run several evaluation cases (e.g., numerical equivalence, physics derivations, symbolic math).
  3. Print intermediate reasoning, any tool calls, and the final judgement for each case.

๐Ÿ˜Š Note

  • Use the exact system and user prompt templates. Do not change whitespace; preserve line breaks identical.
  • The final judgement is parsed from the token \boxed{Correct} or \boxed{Incorrect} in the modelโ€™s reply.

๐Ÿงฉ Example

Hamiltonโ€“Jacobi Projectile Motion Verification


User:
{
  "Question":"A particle m undergoes projectile motion with an initial velocity vโ‚€ and a horizontal angle of elevation ฮฑ. Solve using the Hamiltonโ€“Jacobi equation.",
  "Ground_Truth":"['y = x tan ฮฑ - (g xยฒ) / (2 vโ‚€ยฒ cosยฒ ฮฑ)']",
  "Model_Prediction":"y = x tan ฮฑ - (g xยฒ tanยฒ ฮฑ) / (2 vโ‚€ยฒ) - (g xยฒ) / (2 vโ‚€ยฒ)"
}
  

Assistant:
<tool_call>
{
  "name":"python_interpreter",
  "arguments":{
    "code":"```python
      def main():
          import math
          a,g,v0,x=1.0,9.8,10.0,1.0
          y_candidate=x*math.tan(a)-(g*x**2*math.tan(a)**2)/(2*v0**2)-(g*x**2)/(2*v0**2)
          y_gold=x*math.tan(a)-(g*x**2)/(2*v0**2*(math.cos(a)**2))
          print(y_candidate,y_gold)
      if __name__=='__main__':
          main()```"
  }
}
</tool_call>

User:
<tool_response>
{
  "name":"python_interpreter",
  "status":"Success",
  "run_result":{
    "stdout":"1.3895573024349792 1.3895573024349792",
    "stderr":"",
    "exit_success":true,
    "return_code":0,
    "status":"Finished"
  }
}
</tool_response>

Assistant:
Gold: \(y=x\tan\alpha-\frac{g x^2}{2v_0^2\cos^2\alpha}\). 
Using \(\sec^2\alpha=1+\tan^2\alpha\): \(\frac{g x^2}{2v_0^2\cos^2\alpha}=\frac{g x^2}{2v_0^2}+\frac{g x^2\tan^2\alpha}{2v_0^2}\). 
Hence \(y=x\tan\alpha-\frac{g x^2}{2v_0^2}-\frac{g x^2\tan^2\alpha}{2v_0^2}\equiv\texttt{pred}\).
Final Verdict: [Correct]
  

๐Ÿ” Performance

We conduct a systematic evaluation on VerifyBench and SCI-VerifyBench for CoSineVerifier-Tool-4B and CoSineVerifier-32B, trained from Qwen3-4B-Instruct-2507 and Qwen3-32B, respectively. We report accuracy as mean@3 and efficiency as average output tokens per verdict on these benchmarks.

Model VerifyBench VerifyBench (Hard) SCI-VerifyBench Avg Tokens
CoT Verifier
o396.188.787.5206.7
GPT-oss-20B92.284.785.0221
Qwen3-4B92.680.382.01156.7
Qwen3-8B93.783.683.9926.6
Qwen3-32B94.785.283.5798.8
Qwen3-4B-Instruct-250794.784.182.4869.7
Qwen3-235B-A22B-250794.487.782.61885.3
CoSineVerifier-Tool-4B96.691.989.795.3
Labeling Verifier
XVerify-8B-I92.583.378.11.0
CompassVerifier-7B93.585.285.71.0
CompassVerifier-32B96.388.985.31.0
CoSineVerifier-4B95.084.686.23.0
CoSineVerifier-32B94.289.786.63.0

We further evaluate answer-verification methods in an RLVR setting to demonstrate the efficacy of the CoSineVerifier series. Using an on-policy GRPO algorithm, we train Qwen3-4B-Instruct-2507 on competition-math problems with 42K training data drawn from DAPO-Math-17k, OpenR1-Math-220k, and DeepScaleR-Preview. We train 3 epochs for each experiment, with batch size of 128 and rollout_num = 8.

We compare CoSineVerifier-Tool-4B and CoSineVerifier-32B against Math-verify, CompassVerifier, and Xverify. We report mean@32 accuracy on AIME 2024 and AIME 2025 as reported in Figure 1.

๐Ÿค– Limitations

Despite its high accuracy, CoSinVerifier-Tool-4B invokes external tools in only ~10% of cases and still struggles on the hardest verification items. We will improve its capability by increasing tool-use coverage and developing a unified external-tool suite. Current CoSineVerifier Labeling models output three tokens instead of one token and we will update it into 1 token in next version.

โœ๏ธ Citation

If you find our verifiers useful or want to use it in your projects, please kindly cite this Huggingface project.

@article{nbg_team_2025_cosineverifier,
  title   = {CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions},
  author  = {{Nanbeige Team}},
  year    = {2025},
}

๐Ÿ“ฎ Contact

If you have any questions, please raise an issue or contact us at nanbeige@126.com.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support