metadata
			language: pl
tags:
  - T5
  - translation
  - summarization
  - question answering
  - reading comprehension
datasets:
  - ccnet
  - nkjp
  - wikipedia
  - open subtitles
  - free readings
license: cc-by-4.0
plT5 Small
plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.
Corpus
plT5 was trained on six different corpora available for Polish language:
| Corpus | Tokens | Documents | 
|---|---|---|
| CCNet Middle | 3243M | 7.9M | 
| CCNet Head | 2641M | 7.0M | 
| National Corpus of Polish | 1357M | 3.9M | 
| Open Subtitles | 1056M | 1.1M | 
| Wikipedia | 260M | 1.4M | 
| Wolne Lektury | 41M | 5.5k | 
Tokenizer
The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.
Usage
Example code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-small")
model = AutoModel.from_pretrained("allegro/plt5-small")
License
CC BY 4.0
Citation
If you use this model, please cite the following paper:
@article{chrabrowa2022evaluation,
  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},
  year={2022}
}
Authors
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: klejbenchmark@allegro.pl

