Problem with the `max_model_length` attribute

by h4c5 - opened Jan 19, 2023

h4c5

Jan 19, 2023

The max_model_length attribute of the camembert/camembert-base Tokenizer is set to VERY_LARGE_INTEGER :

import torch
from transformers import CamembertModel, CamembertTokenizer

tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656

This is probably because the model name in max_model_input_sizes is camembert-baseinstead of camembert/camembert-base (see pretrained tokenizer initialization) :

print(tokenizer.max_model_input_sizes)
# {'camembert-base': 512}

As a result, the example given in the model card do not work with large sequences :

tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base")
camembert = CamembertModel.from_pretrained("camembert/camembert-base")

tokenized_sentence = tokenizer.tokenize("J'aime le camembert !"*100)
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)

# RuntimeError: The expanded size of the tensor (802) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 802].  Tensor sizes: [1, 514]

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment