Problem with the `max_model_length` attribute
#3
by
h4c5
- opened
The max_model_length attribute of the camembert/camembert-base Tokenizer is set to VERY_LARGE_INTEGER :
import torch
from transformers import CamembertModel, CamembertTokenizer
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656
This is probably because the model name in max_model_input_sizes is camembert-baseinstead of camembert/camembert-base (see pretrained tokenizer initialization) :
print(tokenizer.max_model_input_sizes)
# {'camembert-base': 512}
As a result, the example given in the model card do not work with large sequences :
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
camembert = CamembertModel.from_pretrained("camembert/camembert-base")
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !"*100)
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# RuntimeError: The expanded size of the tensor (802) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1, 802]. Tensor sizes: [1, 514]