Spaces:

TrustSafeAI
/

Test-Time-Calibration

Running

App Files Files Community

kumitang commited on Oct 16

Commit

785f414

verified ·

1 Parent(s): 8d50cf2

Upload index.html

Browse files

Files changed (1) hide show

index.html +4 -4

index.html CHANGED Viewed

@@ -148,8 +148,8 @@
         <!-- Figure 1 -->
         <div class="content has-text-centered">
-          <img src="./static/images/figure_1.png" alt="Reward-guided calibration accelerates binary search" style="max-width: 800px; width: 100%; height: auto; margin: 20px auto; display: block;">
-          <p class="has-text-justified" style="max-width: 800px; margin: 0 auto;">
             <strong>Figure 1. Reward-guided calibration accelerates binary search.</strong>
             Left: Increasing per-step noisy reward (inverse-distance signal + noise) lowers average search steps versus vanilla.
             Right: Example showing reward guidance converges early; vanilla keeps oscillating.
@@ -184,8 +184,8 @@
         <!-- Figure 2 -->
         <div class="content has-text-centered">
-          <img src="./static/images/figure_2.png" alt="Test-time calibration framework and MATH-500 results" style="max-width: 800px; width: 100%; height: auto; margin: 20px auto; display: block;">
-          <p class="has-text-justified" style="max-width: 800px; margin: 0 auto;">
             <strong>Figure 2. (a) Test-time calibration framework.</strong> With a rollout budget N = N<sub>1</sub> + N<sub>2</sub>, the model first explores by generating and scoring N<sub>1</sub> candidate responses. The model then learns calibration parameters (δ, T) from high-scoring responses, using them to adjust the logits for the remaining N<sub>2</sub> generations. The final answer is selected from all N candidates.
             <strong>(b) MATH-500 Results.</strong> CarBoN improves weighted Best-of-N accuracy across four models. For all models, calibrated accuracy at N=64 (orange dash line) matches or exceeds uncalibrated accuracy at N=256, corresponding to up to a 4× reduction in rollout budgets. Notably, with Qwen2.5-Math-1.5B-Instruct at N=64, CarBoN surpasses GPT-4o (red dashed line), while uncalibrated Best-of-N with N=256 does not.
           </p>

         <!-- Figure 1 -->
         <div class="content has-text-centered">
+          <img src="./static/images/figure_1.png" alt="Reward-guided calibration accelerates binary search" style="width: 100%; height: auto; margin: 20px 0; display: block;">
+          <p class="has-text-justified" style="margin: 0;">
             <strong>Figure 1. Reward-guided calibration accelerates binary search.</strong>
             Left: Increasing per-step noisy reward (inverse-distance signal + noise) lowers average search steps versus vanilla.
             Right: Example showing reward guidance converges early; vanilla keeps oscillating.
         <!-- Figure 2 -->
         <div class="content has-text-centered">
+          <img src="./static/images/figure_2.png" alt="Test-time calibration framework and MATH-500 results" style="width: 100%; height: auto; margin: 20px 0; display: block;">
+          <p class="has-text-justified" style="margin: 0;">
             <strong>Figure 2. (a) Test-time calibration framework.</strong> With a rollout budget N = N<sub>1</sub> + N<sub>2</sub>, the model first explores by generating and scoring N<sub>1</sub> candidate responses. The model then learns calibration parameters (δ, T) from high-scoring responses, using them to adjust the logits for the remaining N<sub>2</sub> generations. The final answer is selected from all N candidates.
             <strong>(b) MATH-500 Results.</strong> CarBoN improves weighted Best-of-N accuracy across four models. For all models, calibrated accuracy at N=64 (orange dash line) matches or exceeds uncalibrated accuracy at N=256, corresponding to up to a 4× reduction in rollout budgets. Notably, with Qwen2.5-Math-1.5B-Instruct at N=64, CarBoN surpasses GPT-4o (red dashed line), while uncalibrated Best-of-N with N=256 does not.
           </p>