Image classification using fine-tuned ViT - for historical :bowtie: documents sorting
Goal: solve a task of archive page images sorting (for their further content-based processing)
Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class π·οΈ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub support for the model
Versions π
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest v5.3 is considered to be default and can be found in the main branch
of HF π hub ^1 π
| Version | Base | Pages | PDFs | Description |
|---|---|---|---|---|
v2.0 |
vit-base-patch16-224 |
10073 | 3896 | annotations with mistakes, more heterogenous data |
v2.1 |
vit-base-patch16-224 |
11940 | 5002 | main: more diverse pages in each category, less annotation mistakes |
v2.2 |
vit-base-patch16-224 |
15855 | 5730 | same data as v2.1 + some restored pages from v2.0 |
v3.2 |
vit-base-patch16-384 |
15855 | 5730 | same data as v2.2, but a bit larger model base with higher resolution |
v5.2 |
vit-large-patch16-384 |
15855 | 5730 | same data as v2.2, but the largest model base with higher resolution |
v1.2 |
efficientnetv2_s.in21k |
15855 | 5730 | same data as v2.2, but the smallest model base (CNN) |
v4.2 |
efficientnetv2_l.in21k_ft_in1k |
15855 | 5730 | same data as v2.2, CNN base model smaller than the largest, may be more accurate |
v2.3 |
vit-base-patch16-224 |
38625 | 37328 | new data annotation phase data, more single-page documents used, transformer model |
v3.3 |
vit-base-patch16-384 |
38625 | 37328 | same data as v2.3, but a bit larger model base with higher resolution |
v5.3 |
vit-large-patch16-384 |
38625 | 37328 | same data as v2.3, but the largest model base with higher resolution |
v1.3 |
efficientnetv2_m.in21k_ft_in1k |
38625 | 37328 | same data as v2.3, but the smallest model base (CNN) |
v4.3 |
regnety_160.swag_ft_in1k |
38625 | 37328 | same data as v2.3, CNN base model bigger than the smallest, may be more accurate |
| Version | Parameters (M) | Resolution (px) | Revision |
|---|---|---|---|
efficientnetv2_s.in21k |
48 | 300 | v1.X |
efficientnetv2_m.in21k_ft_in1k |
54 | 384 | v1.3 |
vit-base-patch16-224 |
87 | 224 | v2.X |
vit-base-patch16-384 |
87 | 384 | v3.X |
regnety_160.swag_ft_in1k |
84 | 224 | v4.3 |
vit-large-patch16-384 |
305 | 384 | v5.X |
regnety_640.seer |
281 | 384 | v6.3 |
| Base Model | Revision | max_cat | Best_Prec (%) | Best_Acc (%) | Fold | Note |
|---|---|---|---|---|---|---|
| google/vit-base-patch16-224 | v2.3 | 14,000 | 98.79 | 98.79 | 5 | OK & Small |
| google/vit-base-patch16-384 | v3.3 | 14,000 | 98.92 | 98.92 | 2 | Good & Small |
| google/vit-large-patch16-384 | v5.3 | 14,000 | 99.12 | 99.12 | 2 | Best & Large |
| microsoft/dit-base-finetuned-rvlcdip | v9.3 | 14,000 | 98.71 | 98.72 | 3 | |
| microsoft/dit-large-finetuned-rvlcdip | v10.3 | 14,000 | 98.66 | 98.66 | 3 | |
| microsoft/dit-large | v11.3 | 14,000 | 98.53 | 98.53 | 2 | |
| timm/regnety_120.sw_in12k_ft_in1k | v12.3 | 14,000 | 98.29 | 98.29 | 3 | |
| timm/regnety_160.swag_ft_in1k | v4.3 | 14,000 | 99.17 | 99.16 | 1 | Best & Small |
| timm/regnety_640.see | v6.3 | 14,000 | 98.79 | 98.79 | 5 | OK & Large |
| timm/tf_efficientnetv2_l.in21k_ft_in1k | v8.3 | 14,000 | 98.62 | 98.62 | 5 | |
| timm/tf_efficientnetv2_m.in21k_ft_in1k | v1.3 | 14,000 | 98.83 | 98.83 | 1 | Good & Small |
| timm/tf_efficientnetv2_s.in21k | v7.3 | 14,000 | 97.90 | 97.87 | 1 |
Model description π
π² Fine-tuned model repository: vit-historical-page ^1 π
π³ Base model repository:
- Google's vit-base-patch16-224, vit-base-patch16-384, and vit-large-patch16-284 ^2 ^6 ^7 π
- timm's regnety_160.swag_ft_in1k, efficientnetv2_s.in21k, efficientnetv2_m.in21k_ft_in1k, and efficientnetv2_l.in21k_ft_in1k ^11 ^8 ^12 ^9 π
Data π
The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 π.
Manual βοΈ annotation was performed beforehand and took some time β, the categories πͺ§ tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.
| Category | Dataset 0 | Dataset 1 | Dataset 2 | Dataset 3 |
|---|---|---|---|---|
| DRAW | 1090 (9.1%) | 1368 (8.8%) | 1472 (9.3%) | 2709 (5.6%) |
| DRAW_L | 1091 (9.1%) | 1383 (8.9%) | 1402 (8.8%) | 2921 (6.0%) |
| LINE_HW | 1055 (8.8%) | 1113 (7.2%) | 1115 (7.0%) | 2514 (5.2%) |
| LINE_P | 1092 (9.1%) | 1540 (9.9%) | 1580 (10.0%) | 2439 (5.0%) |
| LINE_T | 1098 (9.2%) | 1664 (10.7%) | 1668 (10.5%) | 9883 (20.4%) |
| PHOTO | 1081 (9.1%) | 1632 (10.5%) | 1730 (10.9%) | 2691 (5.5%) |
| PHOTO_L | 1087 (9.1%) | 1087 (7.0%) | 1088 (6.9%) | 2830 (5.8%) |
| TEXT | 1091 (9.1%) | 1587 (10.3%) | 1592 (10.0%) | 14227 (29.3%) |
| TEXT_HW | 1091 (9.1%) | 1092 (7.1%) | 1092 (6.9%) | 2008 (4.1%) |
| TEXT_P | 1083 (9.1%) | 1540 (9.9%) | 1633 (10.3%) | 2312 (4.8%) |
| TEXT_T | 1081 (9.1%) | 1476 (9.5%) | 1482 (9.3%) | 3965 (8.2%) |
| Unique PDFs | 5001 | 5694 | 5729 | 37328 |
| Total Pages | 11,940 | 15,482 | 15,854 | 48,499 |
The table above shows category distribution for different model versions, where the last column
(Dataset 3) corresponds to the latest vX.3 models data, which actually used 14,000 pages of
TEXT category, while other columns cover all the used samples - specifically 80% as training πͺ,
and 10% each as development and test π sets. The early model version used 90% of the data as training πͺ
and the remaining 10% as both development and test π set due to the lack of annotated (manually
classified) pages.
Disproportion of the categories πͺ§ in both training and evaluation data is NOT intentional, but rather a result of the source data nature.
Training set of the model: 8950 images for v2.0
Training set of the model: 10745 images for v2.1
Training set of the model: 14565 images for v2.X
Training set of the model: 38625 images for vX.3
Plus, the test sets:
Evaluation set: 1586 images (taken from v2.2 annotations)
Evaluation set: 4823 images (for vX.3 models)
Categories π·οΈ
| LabelοΈ | Description |
|---|---|
DRAW |
π - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions |
DRAW_L |
ππ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table |
LINE_HW |
βοΈπ - handwritten text organized in a tabular or form-like structure |
LINE_P |
π - printed text organized in a tabular or form-like structure |
LINE_T |
π - machine-typed text organized in a tabular or form-like structure |
PHOTO |
π - photographs or photographic cutouts, potentially with text captions |
PHOTO_L |
ππ - photos presented within a table-like layout or accompanied by tabular annotations |
TEXT |
π° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements |
TEXT_HW |
βοΈπ - only handwritten text in paragraph or block form (non-tabular) |
TEXT_P |
π - only printed text in paragraph or block form (non-tabular) |
TEXT_T |
π - only machine-typed text in paragraph or block form (non-tabular) |
Data preprocessing
During training the following transforms were applied randomly with a 50% chance:
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training Hyperparameters
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Results π
| Revision | Top-1 | Top-3 |
|---|---|---|
v1.2 |
97.73 | 99.87 |
v2.2 |
97.54 | 99.94 |
v3.2 |
96.49 | 99.94 |
v4.2 |
97.73 | 99.87 |
v5.2 |
97.86 | 99.87 |
v1.3 |
96.81 | 99.78 |
v2.3 |
98.79 | 99.96 |
v3.3 |
98.92 | 99.98 |
v4.3 |
98.92 | 100.0 |
v5.3 |
99.12 | 99.94 |
v6.3 |
98.79 | 99.94 |
v2.2 Evaluation set's accuracy (Top-1): 97.54%
v3.2 Evaluation set's accuracy (Top-1): 96.49%
v5.2 Evaluation set's accuracy (Top-1): 97.73%
v1.2 Evaluation set's accuracy (Top-1): 97.73%
v4.2 Evaluation set's accuracy (Top-1): 97.86%
v1.3 Evaluation set's accuracy (Top-1): 98.83%
v2.3 Evaluation set's accuracy (Top-1): 98.79%
v3.3 Evaluation set's accuracy (Top-1): 98.92%
v4.3 Evaluation set's accuracy (Top-1): 98.16%
v5.3 Evaluation set's accuracy (Top-1): 99.12%
v6.3 Evaluation set's accuracy (Top-1): 98.79%
Result tables
v2.2 Manually β checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv π
v2.2 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v3.2 Manually β checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv π
v3.2 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v5.2 Manually β checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv π
v5.2 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v1.2 Manually β checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv π
v1.2 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v4.2 Manually β checked evaluation dataset results (TOP-1): model_TOP-3_EVAL.csv π
v4.2 Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
v1.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
v2.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
v3.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
v4.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
v5.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
v6.3 Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
Table columns
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category π·οΈ, guess TOP-N
- SCORE-N - score of the category π·οΈ, guess TOP-N
- TRUE - actual label of the category π·οΈ
Contacts π§
For support write to π§ lutsai.k@gmail.com π§
Official repository: UFAL ^3
Acknowledgements π
Β©οΈ 2022 UFAL & ATRIUM
- Downloads last month
- 444
Model tree for ufal/vit-historical-page
Base model
google/vit-base-patch16-224











