Skip to content

Wrong and unsuppressable print when instantiating BPE #1913

@bauwenst

Description

@bauwenst

I am running Python code that is of the form

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from tokenizers.models import BPE

vocab = {"a": 5, "b": 6, "ab": 7}
merges = [("a","b")]

backend_of_backend_of_backend = BPE(vocab=vocab, merges=merges, dropout=None)
backend_of_backend            = Tokenizer(model=backend_of_backend_of_backend)
backend                       = PreTrainedTokenizerFast(tokenizer_object=backend_of_backend)

The line BPE(vocab=vocab, merges=merges, dropout=None) has nothing to do with serialisation. Yet, when I run it, an unwanted print

The OrderedVocab you are attempting to save contains holes for indices [0, 1, 2, 3, 4], your vocabulary could be corrupted!

appears in my console, which seems to come from

if !holes.is_empty() {
warn!("The OrderedVocab you are attempting to save contains holes for indices {holes:?}, your vocabulary could be corrupted!");
println!("The OrderedVocab you are attempting to save contains holes for indices {holes:?}, your vocabulary could be corrupted!");
}

Not only is the print wrong (I am not trying to save anything), but also, it cannot be suppressed by redirecting stdout and stderr in Python.

println! does not belong in low-level code, so at the very least, we need a way to disable it. But besides, what is this print even for, given that it says something about saving when we are loading a tokenizer?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions