ufal
/

Image classification using fine-tuned ViT - for historical :bowtie: documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model

Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest v5.3 is considered to be default and can be found in the main branch of HF 😊 hub ^1 πŸ”—

Version Base Pages PDFs Description
v2.0 vit-base-patch16-224 10073 3896 annotations with mistakes, more heterogenous data
v2.1 vit-base-patch16-224 11940 5002 main: more diverse pages in each category, less annotation mistakes
v2.2 vit-base-patch16-224 15855 5730 same data as v2.1 + some restored pages from v2.0
v3.2 vit-base-patch16-384 15855 5730 same data as v2.2, but a bit larger model base with higher resolution
v5.2 vit-large-patch16-384 15855 5730 same data as v2.2, but the largest model base with higher resolution
v1.2 efficientnetv2_s.in21k 15855 5730 same data as v2.2, but the smallest model base (CNN)
v4.2 efficientnetv2_l.in21k_ft_in1k 15855 5730 same data as v2.2, CNN base model smaller than the largest, may be more accurate
v2.3 vit-base-patch16-224 38625 37328 new data annotation phase data, more single-page documents used, transformer model
v3.3 vit-base-patch16-384 38625 37328 same data as v2.3, but a bit larger model base with higher resolution
v5.3 vit-large-patch16-384 38625 37328 same data as v2.3, but the largest model base with higher resolution
v1.3 efficientnetv2_m.in21k_ft_in1k 38625 37328 same data as v2.3, but the smallest model base (CNN)
v4.3 regnety_160.swag_ft_in1k 38625 37328 same data as v2.3, CNN base model bigger than the smallest, may be more accurate
Version Parameters (M) Resolution (px) Revision
efficientnetv2_s.in21k 48 300 v1.X
efficientnetv2_m.in21k_ft_in1k 54 384 v1.3
vit-base-patch16-224 87 224 v2.X
vit-base-patch16-384 87 384 v3.X
regnety_160.swag_ft_in1k 84 224 v4.3
vit-large-patch16-384 305 384 v5.X
regnety_640.seer 281 384 v6.3
Base Model Revision max_cat Best_Prec (%) Best_Acc (%) Fold Note
google/vit-base-patch16-224 v2.3 14,000 98.79 98.79 5 OK & Small
google/vit-base-patch16-384 v3.3 14,000 98.92 98.92 2 Good & Small
google/vit-large-patch16-384 v5.3 14,000 99.12 99.12 2 Best & Large
microsoft/dit-base-finetuned-rvlcdip v9.3 14,000 98.71 98.72 3
microsoft/dit-large-finetuned-rvlcdip v10.3 14,000 98.66 98.66 3
microsoft/dit-large v11.3 14,000 98.53 98.53 2
timm/regnety_120.sw_in12k_ft_in1k v12.3 14,000 98.29 98.29 3
timm/regnety_160.swag_ft_in1k v4.3 14,000 99.17 99.16 1 Best & Small
timm/regnety_640.see v6.3 14,000 98.79 98.79 5 OK & Large
timm/tf_efficientnetv2_l.in21k_ft_in1k v8.3 14,000 98.62 98.62 5
timm/tf_efficientnetv2_m.in21k_ft_in1k v1.3 14,000 98.83 98.83 1 Good & Small
timm/tf_efficientnetv2_s.in21k v7.3 14,000 97.90 97.87 1

Model description πŸ“‡

architecture.png

πŸ”² Fine-tuned model repository: vit-historical-page ^1 πŸ”—

πŸ”³ Base model repository:

  • Google's vit-base-patch16-224, vit-base-patch16-384, and vit-large-patch16-284 ^2 ^6 ^7 πŸ”—
  • timm's regnety_160.swag_ft_in1k, efficientnetv2_s.in21k, efficientnetv2_m.in21k_ft_in1k, and efficientnetv2_l.in21k_ft_in1k ^11 ^8 ^12 ^9 πŸ”—

Data πŸ“œ

The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 πŸ”—.

Manual ✍️ annotation was performed beforehand and took some time βŒ›, the categories πŸͺ§ tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.

Category Dataset 0 Dataset 1 Dataset 2 Dataset 3
DRAW 1090 (9.1%) 1368 (8.8%) 1472 (9.3%) 2709 (5.6%)
DRAW_L 1091 (9.1%) 1383 (8.9%) 1402 (8.8%) 2921 (6.0%)
LINE_HW 1055 (8.8%) 1113 (7.2%) 1115 (7.0%) 2514 (5.2%)
LINE_P 1092 (9.1%) 1540 (9.9%) 1580 (10.0%) 2439 (5.0%)
LINE_T 1098 (9.2%) 1664 (10.7%) 1668 (10.5%) 9883 (20.4%)
PHOTO 1081 (9.1%) 1632 (10.5%) 1730 (10.9%) 2691 (5.5%)
PHOTO_L 1087 (9.1%) 1087 (7.0%) 1088 (6.9%) 2830 (5.8%)
TEXT 1091 (9.1%) 1587 (10.3%) 1592 (10.0%) 14227 (29.3%)
TEXT_HW 1091 (9.1%) 1092 (7.1%) 1092 (6.9%) 2008 (4.1%)
TEXT_P 1083 (9.1%) 1540 (9.9%) 1633 (10.3%) 2312 (4.8%)
TEXT_T 1081 (9.1%) 1476 (9.5%) 1482 (9.3%) 3965 (8.2%)
Unique PDFs 5001 5694 5729 37328
Total Pages 11,940 15,482 15,854 48,499

The table above shows category distribution for different model versions, where the last column (Dataset 3) corresponds to the latest vX.3 models data, which actually used 14,000 pages of TEXT category, while other columns cover all the used samples - specifically 80% as training πŸ’ͺ, and 10% each as development and test πŸ† sets. The early model version used 90% of the data as training πŸ’ͺ and the remaining 10% as both development and test πŸ† set due to the lack of annotated (manually classified) pages.

Disproportion of the categories πŸͺ§ in both training and evaluation data is NOT intentional, but rather a result of the source data nature.

Training set of the model: 8950 images for v2.0

Training set of the model: 10745 images for v2.1

Training set of the model: 14565 images for v2.X

Training set of the model: 38625 images for vX.3

Plus, the test sets:

Evaluation set: 1586 images (taken from v2.2 annotations)

Evaluation set: 4823 images (for vX.3 models)

Categories 🏷️

Label️ Description
DRAW πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
DRAW_L πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
LINE_HW βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure
LINE_P πŸ“ - printed text organized in a tabular or form-like structure
LINE_T πŸ“ - machine-typed text organized in a tabular or form-like structure
PHOTO πŸŒ„ - photographs or photographic cutouts, potentially with text captions
PHOTO_L πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations
TEXT πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
TEXT_HW βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)
TEXT_P πŸ“„ - only printed text in paragraph or block form (non-tabular)
TEXT_T πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)

dataset_timeline.png

Data preprocessing

During training the following transforms were applied randomly with a 50% chance:

  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training Hyperparameters

  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Results πŸ“Š

Revision Top-1 Top-3
v1.2 97.73 99.87
v2.2 97.54 99.94
v3.2 96.49 99.94
v4.2 97.73 99.87
v5.2 97.86 99.87
v1.3 96.81 99.78
v2.3 98.79 99.96
v3.3 98.92 99.98
v4.3 98.92 100.0
v5.3 99.12 99.94
v6.3 98.79 99.94

v2.2 Evaluation set's accuracy (Top-1): 97.54%

TOP-1 confusion matrix - trained ViT

v3.2 Evaluation set's accuracy (Top-1): 96.49%

TOP-1 confusion matrix - trained ViT

v5.2 Evaluation set's accuracy (Top-1): 97.73%

TOP-1 confusion matrix - trained ViT

v1.2 Evaluation set's accuracy (Top-1): 97.73%

TOP-1 confusion matrix - trained ViT

v4.2 Evaluation set's accuracy (Top-1): 97.86%

TOP-1 confusion matrix - trained ViT

v1.3 Evaluation set's accuracy (Top-1): 98.83%

TOP-1 confusion matrix

v2.3 Evaluation set's accuracy (Top-1): 98.79%

TOP-1 confusion matrix

v3.3 Evaluation set's accuracy (Top-1): 98.92%

TOP-1 confusion matrix

v4.3 Evaluation set's accuracy (Top-1): 98.16%

TOP-1 confusion matrix

v5.3 Evaluation set's accuracy (Top-1): 99.12%

TOP-1 confusion matrix

v6.3 Evaluation set's accuracy (Top-1): 98.79%

TOP-1 confusion matrix

Result tables

Table columns

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N
  • TRUE - actual label of the category 🏷️

Contacts πŸ“§

For support write to πŸ“§ lutsai.k@gmail.com πŸ“§

Official repository: UFAL ^3

Acknowledgements πŸ™

  • Developed by UFAL ^5 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^5
  • Model type:
    • fine-tuned ViT with a 224x224 ^2 πŸ”— or 384x384 ^6 ^7 πŸ”— resolution size
    • fine-tuned EffNetV2 with a 300x300 ^8 πŸ”— or 384x384 ^9 πŸ”— resolution size

©️ 2022 UFAL & ATRIUM

Downloads last month
444
Safetensors
Model size
80.7M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ufal/vit-historical-page

Finetuned
(907)
this model