WebOrganizer
/

FormatClassifier

@@ -8,14 +8,14 @@ base_model:
 ---
 # WebOrganizer/FormatClassifier
-[[Paper](ARXIV_TBD)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
 The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
 The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
 1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
 2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
-##### All Domain Classifiers
 - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
 - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
 - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
@@ -80,7 +80,7 @@ You can convert the `logits` of the model with a softmax to obtain a probability
 The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).
-##### Efficient Inference
 We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
 ```python
 AutoModelForSequenceClassification.from_pretrained(

 ---
 # WebOrganizer/FormatClassifier
+[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
 The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
 The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
 1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
 2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
+#### All Domain Classifiers
 - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
 - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
 - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
 The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).
+#### Efficient Inference
 We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
 ```python
 AutoModelForSequenceClassification.from_pretrained(