| Title: | Fast Text Tokenization |
|---|---|
| Description: | Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts. |
| Authors: | Tomasz Kalinowski [ctb, cre], Daniel Falbel [aut], Regouby Christophe [ctb], Posit [cph] |
| Maintainer: | Tomasz Kalinowski <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.2.9000 |
| Built: | 2026-05-19 14:20:56 UTC |
| Source: | https://github.com/mlverse/tok |
This decoder is to be used with the pre_tokenizer_byte_level.
tok_decoder -> tok_decoder_byte_level
tok_decoder_byte_level$new()Initializes a byte level decoder
tok_decoder_byte_level$new()
tok_decoder_byte_level$clone()The objects of this class are cloneable with this method.
tok_decoder_byte_level$clone(deep = FALSE)
deepWhether to make a deep clone.
Other decoders:
tok_decoder
Represents the output of a tokenizer.
An encoding object containing encoding information such as attention masks and token ids.
.encodingThe underlying implementation pointer.
idsThe IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.
attention_maskThe attention mask used as input for transformers models.
tok_encoding$new()Initializes an encoding object (Not to use directly)
tok_encoding$new(encoding)
encodingan encoding implementation object
tok_encoding$clone()The objects of this class are cloneable with this method.
tok_encoding$clone(deep = FALSE)
deepWhether to make a deep clone.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") encoding <- tok$encode("Hello world") encoding }) })withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") encoding <- tok$encode("Hello world") encoding }) })
BPE model
tok_model -> tok_model_bpe
tok_model_bpe$new()Initializes a BPE model An implementation of the BPE (Byte-Pair Encoding) algorithm
tok_model_bpe$new( vocab = NULL, merges = NULL, cache_capacity = NULL, dropout = NULL, unk_token = NULL, continuing_subword_prefix = NULL, end_of_word_suffix = NULL, fuse_unk = NULL, byte_fallback = FALSE )
vocabA named integer vector of string keys and their corresponding ids. Default: NULL
mergesA list of pairs of tokens ([character, character]). Default: NULL.
cache_capacityThe number of words that the BPE cache can contain.
The cache speeds up the process by storing merge operation results. Default: NULL.
dropoutA float between 0 and 1 representing the BPE dropout to use. Default: NULL
unk_tokenThe unknown token to be used by the model. Default: 'NULL“'.
continuing_subword_prefixThe prefix to attach to subword units that don’t
represent the beginning of a word. Default: NULL
end_of_word_suffixThe suffix to attach to subword units that represent
the end of a word. Default: NULL
fuse_unkWhether to fuse any subsequent unknown tokens into a single one. Default: NULL.
byte_fallbackWhether to use the spm byte-fallback trick. Default: FALSE.
tok_model_bpe$clone()The objects of this class are cloneable with this method.
tok_model_bpe$clone(deep = FALSE)
deepWhether to make a deep clone.
Other model:
model_unigram,
model_wordpiece,
tok_model
An implementation of the Unigram algorithm
tok_model -> tok_model_unigram
tok_model_unigram$new()Constructor for Unigram Model
tok_model_unigram$new(vocab = NULL, unk_id = NULL, byte_fallback = FALSE)
vocabA dictionary of string keys and their corresponding relative score.
Default: NULL.
unk_idThe unknown token id to be used by the model.
Default: NULL.
byte_fallbackWhether to use byte-fallback trick. Default: FALSE.
tok_model_unigram$clone()The objects of this class are cloneable with this method.
tok_model_unigram$clone(deep = FALSE)
deepWhether to make a deep clone.
Other model:
model_bpe,
model_wordpiece,
tok_model
An implementation of the WordPiece algorithm
tok_model -> tok_model_wordpiece
tok_model_wordpiece$new()Constructor for the wordpiece tokenizer
tok_model_wordpiece$new( vocab = NULL, unk_token = NULL, max_input_chars_per_word = NULL )
vocabA dictionary of string keys and their corresponding ids.
Default: NULL.
unk_tokenThe unknown token to be used by the model.
Default: NULL.
max_input_chars_per_wordThe maximum number of characters to allow in a single word.
Default: NULL.
tok_model_wordpiece$clone()The objects of this class are cloneable with this method.
tok_model_wordpiece$clone(deep = FALSE)
deepWhether to make a deep clone.
Other model:
model_bpe,
model_unigram,
tok_model
NFC normalizer
tok_normalizer -> tok_normalizer_nfc
tok_normalizer_nfc$new()Initializes the NFC normalizer
tok_normalizer_nfc$new()
tok_normalizer_nfc$clone()The objects of this class are cloneable with this method.
tok_normalizer_nfc$clone(deep = FALSE)
deepWhether to make a deep clone.
Other normalizers:
normalizer_nfkc,
tok_normalizer
NFKC normalizer
tok_normalizer -> tok_normalizer_nfkc
tok_normalizer_nfkc$new()Initializes the NFKC normalizer
tok_normalizer_nfkc$new()
tok_normalizer_nfkc$clone()The objects of this class are cloneable with this method.
tok_normalizer_nfkc$clone(deep = FALSE)
deepWhether to make a deep clone.
Other normalizers:
normalizer_nfc,
tok_normalizer
Generic class for tokenizers
.pre_tokenizerInternal pointer to tokenizer object
tok_pre_tokenizer$new()Initializes a tokenizer
tok_pre_tokenizer$new(pre_tokenizer)
pre_tokenizera raw pointer to a tokenizer
tok_pre_tokenizer$clone()The objects of this class are cloneable with this method.
tok_pre_tokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer_byte_level,
pre_tokenizer_whitespace
This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.
tok_pre_tokenizer -> tok_pre_tokenizer_byte_level
tok_pre_tokenizer_byte_level$new()Initializes the bytelevel tokenizer
tok_pre_tokenizer_byte_level$new(add_prefix_space = TRUE, use_regex = TRUE)
add_prefix_spaceWhether to add a space to the first word
use_regexSet this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.
tok_pre_tokenizer_byte_level$clone()The objects of this class are cloneable with this method.
tok_pre_tokenizer_byte_level$clone(deep = FALSE)
deepWhether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer,
pre_tokenizer_whitespace
\w+|[^\w\s]+
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
tok_pre_tokenizer -> tok_pre_tokenizer_whitespace
tok_pre_tokenizer_whitespace$new()Initializes the whistespace tokenizer
tok_pre_tokenizer_whitespace$new()
tok_pre_tokenizer_whitespace$clone()The objects of this class are cloneable with this method.
tok_pre_tokenizer_whitespace$clone(deep = FALSE)
deepWhether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer,
pre_tokenizer_byte_level
This post-processor takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor must be used.
tok_processor -> tok_processor_byte_level
tok_processor_byte_level$new()Initializes the byte level post processor
tok_processor_byte_level$new(trim_offsets = TRUE)
trim_offsetsWhether to trim the whitespaces from the produced offsets.
tok_processor_byte_level$clone()The objects of this class are cloneable with this method.
tok_processor_byte_level$clone(deep = FALSE)
deepWhether to make a deep clone.
Other processors:
tok_processor
Generic class for decoders
.decoderThe raw pointer to the decoder
tok_decoder$new()Initializes a decoder
tok_decoder$new(decoder)
decodera raw decoder pointer
tok_decoder$clone()The objects of this class are cloneable with this method.
tok_decoder$clone(deep = FALSE)
deepWhether to make a deep clone.
Other decoders:
decoder_byte_level
Generic class for tokenization models
.modelstores the pointer to the model. internal
tok_model$new()Initializes a genric abstract tokenizer model
tok_model$new(model)
modelPointer to a tokenization model
tok_model$clone()The objects of this class are cloneable with this method.
tok_model$clone(deep = FALSE)
deepWhether to make a deep clone.
Other model:
model_bpe,
model_unigram,
model_wordpiece
Generic class for normalizers
.normalizerInternal pointer to normalizer object
tok_normalizer$new()Initializes a tokenizer
tok_normalizer$new(normalizer)
normalizera raw pointer to a tokenizer
tok_normalizer$clone()The objects of this class are cloneable with this method.
tok_normalizer$clone(deep = FALSE)
deepWhether to make a deep clone.
Other normalizers:
normalizer_nfc,
normalizer_nfkc
Generic class for processors
.processorInternal pointer to processor object
tok_processor$new()Initializes a tokenizer
tok_processor$new(processor)
processora raw pointer to a processor
tok_processor$clone()The objects of this class are cloneable with this method.
tok_processor$clone(deep = FALSE)
deepWhether to make a deep clone.
Other processors:
processor_byte_level
Generic training class
.trainera pointer to a raw trainer
tok_trainer$new()Initializes a generic trainer from a raw trainer
tok_trainer$new(trainer)
trainerraw trainer (internal)
tok_trainer$clone()The objects of this class are cloneable with this method.
tok_trainer$clone(deep = FALSE)
deepWhether to make a deep clone.
Other trainer:
trainer_bpe,
trainer_unigram,
trainer_wordpiece
A Tokenizer works as a pipeline. It processes some raw text as input and outputs an encoding.
A tokenizer that can be used for encoding character strings or decoding integers.
.tokenizer(unsafe usage) Lower level pointer to tokenizer
pre_tokenizerinstance of the pre-tokenizer
normalizerGets the normalizer instance
post_processorGets the post processor used by tokenizer
decoderGets and sets the decoder
paddingGets padding configuration
truncationGets truncation configuration
tok_tokenizer$new()Initializes a tokenizer
tok_tokenizer$new(tokenizer)
tokenizerWill be cloned to initialize a new tokenizer
tok_tokenizer$encode()Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
tok_tokenizer$encode( sequence, pair = NULL, is_pretokenized = FALSE, add_special_tokens = TRUE )
sequenceThe main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument
pairAn optional input sequence. The expected format is the same that for sequence.
is_pretokenizedWhether the input is already pre-tokenized
add_special_tokensWhether to add the special tokens
tok_tokenizer$decode()Decode the given list of ids back to a string
tok_tokenizer$decode(ids, skip_special_tokens = TRUE)
idsThe list of ids that we want to decode
skip_special_tokensWhether the special tokens should be removed from the decoded string
tok_tokenizer$encode_batch()Encodes a batch of sequences. Returns a list of encodings.
tok_tokenizer$encode_batch( input, is_pretokenized = FALSE, add_special_tokens = TRUE )
inputA list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument.
is_pretokenizedWhether the input is already pre-tokenized
add_special_tokensWhether to add the special tokens
tok_tokenizer$decode_batch()Decode a batch of ids back to their corresponding string
tok_tokenizer$decode_batch(sequences, skip_special_tokens = TRUE)
sequencesThe batch of sequences we want to decode
skip_special_tokensWhether the special tokens should be removed from the decoded strings
tok_tokenizer$from_file()Creates a tokenizer from the path of a serialized tokenizer.
This is a static method and should be called instead of $new when initializing
the tokenizer.
tok_tokenizer$from_file(path)
pathPath to tokenizer.json file
tok_tokenizer$from_pretrained()Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.
tok_tokenizer$from_pretrained(identifier, revision = "main", auth_token = NULL)
identifierThe identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file
revisionA branch or commit id
auth_tokenAn optional auth token used to access private repositories on the Hugging Face Hub
tok_tokenizer$train()Train the Tokenizer using the given files. Reads the files line by line, while keeping all the whitespace, even new lines.
tok_tokenizer$train(files, trainer)
filescharacter vector of file paths.
traineran instance of a trainer object, specific to that tokenizer type.
tok_tokenizer$train_from_memory()Train the tokenizer on a chracter vector of texts
tok_tokenizer$train_from_memory(texts, trainer)
textsa character vector of texts.
traineran instance of a trainer object, specific to that tokenizer type.
tok_tokenizer$save()Saves the tokenizer to a json file
tok_tokenizer$save(path, pretty = TRUE)
pathA path to a file in which to save the serialized tokenizer.
prettyWhether the JSON file should be pretty formatted.
tok_tokenizer$enable_padding()Enables padding for the tokenizer
tok_tokenizer$enable_padding( direction = "right", pad_id = 0L, pad_type_id = 0L, pad_token = "[PAD]", length = NULL, pad_to_multiple_of = NULL )
direction(str, optional, defaults to right) — The direction in which
to pad. Can be either 'right' or 'left'
pad_id(int, defaults to 0) — The id to be used when padding
pad_type_id(int, defaults to 0) — The type id to be used when padding
pad_token(str, defaults to '[PAD]') — The pad token to be used when padding
length(int, optional) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of(int, optional) — If specified, the padding length should
always snap to the next multiple of the given value. For example if we were
going to pad with a length of 250 but pad_to_multiple_of=8 then we will
pad to 256.
tok_tokenizer$no_padding()Disables padding
tok_tokenizer$no_padding()
tok_tokenizer$enable_truncation()Enables truncation on the tokenizer
tok_tokenizer$enable_truncation( max_length, stride = 0, strategy = "longest_first", direction = "right" )
max_lengthThe maximum length at which to truncate.
strideThe length of the previous first sequence to be included
in the overflowing sequence. Default: 0.
strategyThe strategy used for truncation. Can be one of: "longest_first", "only_first", or "only_second". Default: "longest_first".
directionThe truncation direction. Default: "right".
tok_tokenizer$no_truncation()Disables truncation
tok_tokenizer$no_truncation()
tok_tokenizer$get_vocab_size()Gets the vocabulary size
tok_tokenizer$get_vocab_size(with_added_tokens = TRUE)
with_added_tokensWether to count added tokens
tok_tokenizer$clone()The objects of this class are cloneable with this method.
tok_tokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") tok$encode("Hello world")$ids }) })withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") tok$encode("Hello world")$ids }) })
BPE trainer
tok_trainer -> tok_trainer_bpe
tok_trainer_bpe$new()Constrcutor for the BPE trainer
tok_trainer_bpe$new( vocab_size = NULL, min_frequency = NULL, show_progress = NULL, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = NULL, end_of_word_suffix = NULL, max_token_length = NULL )
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet.
Default: NULL.
min_frequencyThe minimum frequency a pair should have in order to be merged.
Default: NULL.
show_progressWhether to show progress bars while training. Default: TRUE.
special_tokensA list of special tokens the model should be aware of.
Default: NULL.
limit_alphabetThe maximum number of different characters to keep in the alphabet.
Default: NULL.
initial_alphabetA list of characters to include in the initial alphabet,
even if not seen in the training dataset. Default: NULL.
continuing_subword_prefixA prefix to be used for every subword that is not a beginning-of-word.
Default: NULL.
end_of_word_suffixA suffix to be used for every subword that is an end-of-word.
Default: NULL.
max_token_lengthPrevents creating tokens longer than the specified size.
Default: NULL.
tok_trainer_bpe$clone()The objects of this class are cloneable with this method.
tok_trainer_bpe$clone(deep = FALSE)
deepWhether to make a deep clone.
Other trainer:
tok_trainer,
trainer_unigram,
trainer_wordpiece
Unigram tokenizer trainer
tok_trainer -> tok_trainer_unigram
tok_trainer_unigram$new()Constructor for the Unigram tokenizer
tok_trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet.
show_progressWhether to show progress bars while training.
special_tokensA list of special tokens the model should be aware of.
shrinking_factorThe shrinking factor used at each step of training to prune the vocabulary.
unk_tokenThe token used for out-of-vocabulary tokens.
max_piece_lengthThe maximum length of a given token.
n_sub_iterationsThe number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabetA list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
tok_trainer_unigram$clone()The objects of this class are cloneable with this method.
tok_trainer_unigram$clone(deep = FALSE)
deepWhether to make a deep clone.
Other trainer:
tok_trainer,
trainer_bpe,
trainer_wordpiece
WordPiece tokenizer trainer
tok_trainer -> tok_trainer_wordpiece
tok_trainer_wordpiece$new()Constructor for the WordPiece tokenizer trainer
tok_trainer_wordpiece$new( vocab_size = 30000, min_frequency = 0, show_progress = FALSE, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = "##", end_of_word_suffix = NULL )
vocab_sizeThe size of the final vocabulary, including all tokens and alphabet.
Default: NULL.
min_frequencyThe minimum frequency a pair should have in order to be merged.
Default: NULL.
show_progressWhether to show progress bars while training. Default: TRUE.
special_tokensA list of special tokens the model should be aware of.
Default: NULL.
limit_alphabetThe maximum number of different characters to keep in the alphabet.
Default: NULL.
initial_alphabetA list of characters to include in the initial alphabet,
even if not seen in the training dataset. If the strings contain more than
one character, only the first one is kept. Default: NULL.
continuing_subword_prefixA prefix to be used for every subword that is not a beginning-of-word.
Default: NULL.
end_of_word_suffixA suffix to be used for every subword that is an end-of-word.
Default: NULL.
tok_trainer_wordpiece$clone()The objects of this class are cloneable with this method.
tok_trainer_wordpiece$clone(deep = FALSE)
deepWhether to make a deep clone.
Other trainer:
tok_trainer,
trainer_bpe,
trainer_unigram