-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
I am running Python code that is of the form
from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from tokenizers.models import BPE
vocab = {"a": 5, "b": 6, "ab": 7}
merges = [("a","b")]
backend_of_backend_of_backend = BPE(vocab=vocab, merges=merges, dropout=None)
backend_of_backend = Tokenizer(model=backend_of_backend_of_backend)
backend = PreTrainedTokenizerFast(tokenizer_object=backend_of_backend)The line BPE(vocab=vocab, merges=merges, dropout=None) has nothing to do with serialisation. Yet, when I run it, an unwanted print
The OrderedVocab you are attempting to save contains holes for indices [0, 1, 2, 3, 4], your vocabulary could be corrupted!
appears in my console, which seems to come from
tokenizers/tokenizers/src/models/mod.rs
Lines 53 to 56 in f7db48f
| if !holes.is_empty() { | |
| warn!("The OrderedVocab you are attempting to save contains holes for indices {holes:?}, your vocabulary could be corrupted!"); | |
| println!("The OrderedVocab you are attempting to save contains holes for indices {holes:?}, your vocabulary could be corrupted!"); | |
| } |
Not only is the print wrong (I am not trying to save anything), but also, it cannot be suppressed by redirecting stdout and stderr in Python.
println! does not belong in low-level code, so at the very least, we need a way to disable it. But besides, what is this print even for, given that it says something about saving when we are loading a tokenizer?
Metadata
Metadata
Assignees
Labels
No labels