Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest v5.3 is considered to be default and can be found in the main branch of HF 😊 hub ^1 🔗

Version	Base	Pages	PDFs	Description
`v2.0`	`vit-base-patch16-224`	10073	3896	annotations with mistakes, more heterogenous data
`v2.1`	`vit-base-patch16-224`	11940	5002	`main`: more diverse pages in each category, less annotation mistakes
`v2.2`	`vit-base-patch16-224`	15855	5730	same data as `v2.1` + some restored pages from `v2.0`
`v3.2`	`vit-base-patch16-384`	15855	5730	same data as `v2.2`, but a bit larger model base with higher resolution
`v5.2`	`vit-large-patch16-384`	15855	5730	same data as `v2.2`, but the largest model base with higher resolution
`v1.2`	`efficientnetv2_s.in21k`	15855	5730	same data as `v2.2`, but the smallest model base (CNN)
`v4.2`	`efficientnetv2_l.in21k_ft_in1k`	15855	5730	same data as `v2.2`, CNN base model smaller than the largest, may be more accurate
`v2.3`	`vit-base-patch16-224`	38625	37328	new data annotation phase data, more single-page documents used, transformer model
`v3.3`	`vit-base-patch16-384`	38625	37328	same data as `v2.3`, but a bit larger model base with higher resolution
`v5.3`	`vit-large-patch16-384`	38625	37328	same data as `v2.3`, but the largest model base with higher resolution
`v1.3`	`efficientnetv2_m.in21k_ft_in1k`	38625	37328	same data as `v2.3`, but the smallest model base (CNN)
`v4.3`	`regnety_160.swag_ft_in1k`	38625	37328	same data as `v2.3`, CNN base model bigger than the smallest, may be more accurate

Version	Parameters (M)	Resolution (px)	Revision
`efficientnetv2_s.in21k`	48	300	v1.X
`efficientnetv2_m.in21k_ft_in1k`	54	384	v1.3
`vit-base-patch16-224`	87	224	v2.X
`vit-base-patch16-384`	87	384	v3.X
`regnety_160.swag_ft_in1k`	84	224	v4.3
`vit-large-patch16-384`	305	384	v5.X
`regnety_640.seer`	281	384	v6.3

Base Model	Revision	max_cat	Best_Prec (%)	Best_Acc (%)	Fold	Note
google/vit-base-patch16-224	v2.3	14,000	98.79	98.79	5	OK & Small
google/vit-base-patch16-384	v3.3	14,000	98.92	98.92	2	Good & Small
google/vit-large-patch16-384	v5.3	14,000	99.12	99.12	2	Best & Large
microsoft/dit-base-finetuned-rvlcdip	v9.3	14,000	98.71	98.72	3
microsoft/dit-large-finetuned-rvlcdip	v10.3	14,000	98.66	98.66	3
microsoft/dit-large	v11.3	14,000	98.53	98.53	2
timm/regnety_120.sw_in12k_ft_in1k	v12.3	14,000	98.29	98.29	3
timm/regnety_160.swag_ft_in1k	v4.3	14,000	99.17	99.16	1	Best & Small
timm/regnety_640.see	v6.3	14,000	98.79	98.79	5	OK & Large
timm/tf_efficientnetv2_l.in21k_ft_in1k	v8.3	14,000	98.62	98.62	5
timm/tf_efficientnetv2_m.in21k_ft_in1k	v1.3	14,000	98.83	98.83	1	Good & Small
timm/tf_efficientnetv2_s.in21k	v7.3	14,000	97.90	97.87	1

Model description 📇

🔲 Fine-tuned model repository: vit-historical-page ^1 🔗

🔳 Base model repository:

Google's vit-base-patch16-224, vit-base-patch16-384, and vit-large-patch16-284 ^2 ^6 ^7 🔗
timm's regnety_160.swag_ft_in1k, efficientnetv2_s.in21k, efficientnetv2_m.in21k_ft_in1k, and efficientnetv2_l.in21k_ft_in1k ^11 ^8 ^12 ^9 🔗

Data 📜

The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 🔗.

Manual ✍️ annotation was performed beforehand and took some time ⌛, the categories 🪧 tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.

Category	Dataset 0	Dataset 1	Dataset 2	Dataset 3
DRAW	1090 (9.1%)	1368 (8.8%)	1472 (9.3%)	2709 (5.6%)
DRAW_L	1091 (9.1%)	1383 (8.9%)	1402 (8.8%)	2921 (6.0%)
LINE_HW	1055 (8.8%)	1113 (7.2%)	1115 (7.0%)	2514 (5.2%)
LINE_P	1092 (9.1%)	1540 (9.9%)	1580 (10.0%)	2439 (5.0%)
LINE_T	1098 (9.2%)	1664 (10.7%)	1668 (10.5%)	9883 (20.4%)
PHOTO	1081 (9.1%)	1632 (10.5%)	1730 (10.9%)	2691 (5.5%)
PHOTO_L	1087 (9.1%)	1087 (7.0%)	1088 (6.9%)	2830 (5.8%)
TEXT	1091 (9.1%)	1587 (10.3%)	1592 (10.0%)	14227 (29.3%)
TEXT_HW	1091 (9.1%)	1092 (7.1%)	1092 (6.9%)	2008 (4.1%)
TEXT_P	1083 (9.1%)	1540 (9.9%)	1633 (10.3%)	2312 (4.8%)
TEXT_T	1081 (9.1%)	1476 (9.5%)	1482 (9.3%)	3965 (8.2%)
Unique PDFs	5001	5694	5729	37328
Total Pages	11,940	15,482	15,854	48,499

The table above shows category distribution for different model versions, where the last column (Dataset 3) corresponds to the latest vX.3 models data, which actually used 14,000 pages of TEXT category, while other columns cover all the used samples - specifically 80% as training 💪, and 10% each as development and test 🏆 sets. The early model version used 90% of the data as training 💪 and the remaining 10% as both development and test 🏆 set due to the lack of annotated (manually classified) pages.

Disproportion of the categories 🪧 in both training and evaluation data is NOT intentional, but rather a result of the source data nature.

Training set of the model: 8950 images for v2.0

Training set of the model: 10745 images for v2.1

Training set of the model: 14565 images for v2.X

Training set of the model: 38625 images for vX.3

Plus, the test sets:

Evaluation set: 1586 images (taken from v2.2 annotations)

Evaluation set: 4823 images (for vX.3 models)

Categories 🏷️

Label️	Description
`DRAW`	📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
`DRAW_L`	📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
`LINE_HW`	✏️📏 - handwritten text organized in a tabular or form-like structure
`LINE_P`	📏 - printed text organized in a tabular or form-like structure
`LINE_T`	📏 - machine-typed text organized in a tabular or form-like structure
`PHOTO`	🌄 - photographs or photographic cutouts, potentially with text captions
`PHOTO_L`	🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations
`TEXT`	📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
`TEXT_HW`	✏️📄 - only handwritten text in paragraph or block form (non-tabular)
`TEXT_P`	📄 - only printed text in paragraph or block form (non-tabular)
`TEXT_T`	📄 - only machine-typed text in paragraph or block form (non-tabular)

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Results 📊

Revision	Top-1	Top-3
`v1.2`	97.73	99.87
`v2.2`	97.54	99.94
`v3.2`	96.49	99.94
`v4.2`	97.73	99.87
`v5.2`	97.86	99.87
`v1.3`	96.81	99.78
`v2.3`	98.79	99.96
`v3.3`	98.92	99.98
`v4.3`	98.92	100.0
`v5.3`	99.12	99.94
`v6.3`	98.79	99.94

v2.2 Evaluation set's accuracy (Top-1): 97.54%

v3.2 Evaluation set's accuracy (Top-1): 96.49%

v5.2 Evaluation set's accuracy (Top-1): 97.73%

v1.2 Evaluation set's accuracy (Top-1): 97.73%

v4.2 Evaluation set's accuracy (Top-1): 97.86%

v1.3 Evaluation set's accuracy (Top-1): 98.83%

v2.3 Evaluation set's accuracy (Top-1): 98.79%

v3.3 Evaluation set's accuracy (Top-1): 98.92%

v4.3 Evaluation set's accuracy (Top-1): 98.16%

v5.3 Evaluation set's accuracy (Top-1): 99.12%

v6.3 Evaluation set's accuracy (Top-1): 98.79%

Result tables

v2.2 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv 🔗
v2.2 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v3.2 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv 🔗
v3.2 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v5.2 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv 🔗
v5.2 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v1.2 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv 🔗
v1.2 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v4.2 Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv 🔗
v4.2 Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
v1.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
v2.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
v3.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
v4.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
v5.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
v6.3 Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎

Table columns

FILE - name of the file
PAGE - number of the page
CLASS-N - label of the category 🏷️, guess TOP-N
SCORE-N - score of the category 🏷️, guess TOP-N
TRUE - actual label of the category 🏷️

Contacts 📧

For support write to 📧 lutsai.k@gmail.com 📧

Official repository: UFAL ^3

Acknowledgements 🙏

Developed by UFAL ^5 👥
Funded by ATRIUM ^4 💰
Shared by ATRIUM ^4 & UFAL ^5
Model type:
- fine-tuned ViT with a 224x224 ^2 🔗 or 384x384 ^6 ^7 🔗 resolution size
- fine-tuned EffNetV2 with a 300x300 ^8 🔗 or 384x384 ^9 🔗 resolution size

Downloads last month: 444

Safetensors

Model size

80.7M params

Tensor type

F32

Model tree for ufal/vit-historical-page

Base model

google/vit-base-patch16-224

Finetuned

(907)

this model