fev-leaderboard

Running

App Files Files Community

shchuro commited on 29 days ago

Commit

393f9c5

1 Parent(s): c85d72b

Update strings

Browse files

Files changed (2) hide show

pages/about.py +2 -1
src/strings.py +9 -2

pages/about.py CHANGED Viewed

@@ -8,11 +8,12 @@ ABOUT_LEADERBOARD = """
 ### 📚 Resources
 - **Documentation**: [Official docs](https://autogluon.github.io/fev/latest/)
 - **Source Code**: [GitHub repository](https://github.com/autogluon/fev)
 - **Issues & Questions**: [GitHub Issues](https://github.com/autogluon/fev/issues)
 ### 🚀 Submit Your Model
-Ready to add your model to the leaderboard? Follow this [tutorial](https://autogluon.github.io/fev/latest/tutorials/04-models/) to evaluate your model with fev and contribute your results.
 """
 st.set_page_config(layout="wide", page_title="About FEV", page_icon=":material/info:")
 st.markdown(ABOUT_LEADERBOARD)

 ### 📚 Resources
 - **Documentation**: [Official docs](https://autogluon.github.io/fev/latest/)
+- **Publication**: ["fev-bench: A Realistic Benchmark for Time Series Forecasting"](https://arxiv.org/abs/2509.26468)
 - **Source Code**: [GitHub repository](https://github.com/autogluon/fev)
 - **Issues & Questions**: [GitHub Issues](https://github.com/autogluon/fev/issues)
 ### 🚀 Submit Your Model
+Ready to add your model to the leaderboard? Follow this [tutorial](https://autogluon.github.io/fev/latest/tutorials/05-add-your-model/) to evaluate your model with fev and contribute your results.
 """
 st.set_page_config(layout="wide", page_title="About FEV", page_icon=":material/info:")
 st.markdown(ABOUT_LEADERBOARD)

src/strings.py CHANGED Viewed

@@ -14,9 +14,13 @@ Model names are colored by type: <span style='color: {COLORS["dl_text"]}; font-w
 The full matrix $E_{{rj}}$ with the error of each model $j$ on task $r$ is available at the bottom of the page.
-* **Avg. win rate (%)**: Fraction of all possible model pairs and tasks where this model achieves lower error than the competing model. For model $j$, defined as $W_j = \\frac{{1}}{{R(M-1)}} \\sum_{{r=1}}^{{R}} \\sum_{{k \\neq j}} (\\mathbf{{1}}(E_{{rj}} < E_{{rk}}) + 0.5 \\cdot \\mathbf{{1}}(E_{{rj}} = E_{{rk}}))$ where $R$ is number of tasks, $M$ is number of models. Ties count as half-wins. Ranges from 0% (worst) to 100% (best). Higher values are better.
-* **Skill score (%)**: Measures how much the model reduces forecasting error compared to the Seasonal Naive baseline. Computed as $S_j = 100 \\times (1 - \\sqrt[R]{{\\prod_{{r=1}}^{{R}} E_{{rj}}/E_{{r\\beta}}}})$, where $E_{{r\\beta}}$ is baseline error on task $r$. Relative errors are clipped between 0.01 and 100 before aggregation to avoid extreme outliers. Positive values indicate better-than-baseline performance, negative values indicate worse-than-baseline performance. Higher values are better.
 * **Median runtime (s)**: Median end-to-end time (training + prediction across all evaluation windows) in seconds. Note that inference times depend on hardware, batch sizes, and implementation details, so these serve as a rough guide rather than definitive performance benchmarks.
@@ -57,6 +61,9 @@ CITATION_FEV = """
   title={{fev-bench}: A Realistic Benchmark for Time Series Forecasting},
   author={Shchur, Oleksandr and Ansari, Abdul Fatir and Turkmen, Caner and Stella, Lorenzo and Erickson, Nick and Guerron, Pablo and Bohlke-Schneider, Michael and Wang, Yuyang},
   year={2025},
 }
 ```
 """

 The full matrix $E_{{rj}}$ with the error of each model $j$ on task $r$ is available at the bottom of the page.
+* **Avg. win rate (%)**: Fraction of all possible model pairs and tasks where this model achieves lower error than the competing model. For model $j$, defined as $W_j = \\frac{{1}}{{R(M-1)}} \\sum_{{r=1}}^{{R}} \\sum_{{k \\neq j}} (\\mathbf{{1}}(E_{{rj}} < E_{{rk}}) + 0.5 \\cdot \\mathbf{{1}}(E_{{rj}} = E_{{rk}}))$ where $R$ is number of tasks, $M$ is number of models. Ties count as half-wins.
+    Ranges from 0% (worst) to 100% (best). Higher values are better. This value changes as new models are added to the benchmark.
+* **Skill score (%)**: Measures how much the model reduces forecasting error compared to the Seasonal Naive baseline. Computed as $S_j = 100 \\times (1 - \\sqrt[R]{{\\prod_{{r=1}}^{{R}} E_{{rj}}/E_{{r\\beta}}}})$, where $E_{{r\\beta}}$ is baseline error on task $r$. Relative errors are clipped between 0.01 and 100 before aggregation to avoid extreme outliers. Positive values indicate better-than-baseline performance, negative values indicate worse-than-baseline performance.
+    Higher values are better. This value does not change as new models are added to the benchmark.
 * **Median runtime (s)**: Median end-to-end time (training + prediction across all evaluation windows) in seconds. Note that inference times depend on hardware, batch sizes, and implementation details, so these serve as a rough guide rather than definitive performance benchmarks.
   title={{fev-bench}: A Realistic Benchmark for Time Series Forecasting},
   author={Shchur, Oleksandr and Ansari, Abdul Fatir and Turkmen, Caner and Stella, Lorenzo and Erickson, Nick and Guerron, Pablo and Bohlke-Schneider, Michael and Wang, Yuyang},
   year={2025},
+  eprint={2509.26468},
+  archivePrefix={arXiv},
+  primaryClass={cs.LG}
 }
 ```
 """