Title: | Fast Text Tokenization |
---|---|
Description: | Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts. |
Authors: | Daniel Falbel [aut, cre], Regouby Christophe [ctb], Posit [cph] |
Maintainer: | Daniel Falbel <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.5 |
Built: | 2024-11-22 06:08:49 UTC |
Source: | https://github.com/mlverse/tok |
Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm https://huggingface.co/docs/tokenizers/index. It's extremely fast for both training new vocabularies and tokenizing texts.
Maintainer: Daniel Falbel [email protected]
Other contributors:
Regouby Christophe [email protected] [contributor]
Posit [copyright holder]
Useful links:
Byte level decoder
Byte level decoder
This decoder is to be used with the pre_tokenizer_byte_level.
tok::tok_decoder
-> tok_decoder_byte_level
new()
Initializes a byte level decoder
decoder_byte_level$new()
clone()
The objects of this class are cloneable with this method.
decoder_byte_level$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other decoders:
tok_decoder
Represents the output of a tokenizer.
An encoding object containing encoding information such as attention masks and token ids.
.encoding
The underlying implementation pointer.
ids
The IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.
attention_mask
The attention mask used as input for transformers models.
new()
Initializes an encoding object (Not to use directly)
encoding$new(encoding)
encoding
an encoding implementation object
clone()
The objects of this class are cloneable with this method.
encoding$clone(deep = FALSE)
deep
Whether to make a deep clone.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") encoding <- tok$encode("Hello world") encoding }) })
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") encoding <- tok$encode("Hello world") encoding }) })
BPE model
BPE model
tok::tok_model
-> tok_model_bpe
new()
Initializes a BPE model An implementation of the BPE (Byte-Pair Encoding) algorithm
model_bpe$new( vocab = NULL, merges = NULL, cache_capacity = NULL, dropout = NULL, unk_token = NULL, continuing_subword_prefix = NULL, end_of_word_suffix = NULL, fuse_unk = NULL, byte_fallback = FALSE )
vocab
A named integer vector of string keys and their corresponding ids. Default: NULL
merges
A list of pairs of tokens ([character, character]
). Default: NULL
.
cache_capacity
The number of words that the BPE cache can contain.
The cache speeds up the process by storing merge operation results. Default: NULL.
dropout
A float between 0 and 1 representing the BPE dropout to use. Default: NULL
unk_token
The unknown token to be used by the model. Default: 'NULL“'.
continuing_subword_prefix
The prefix to attach to subword units that don’t
represent the beginning of a word. Default: NULL
end_of_word_suffix
The suffix to attach to subword units that represent
the end of a word. Default: NULL
fuse_unk
Whether to fuse any subsequent unknown tokens into a single one. Default: NULL
.
byte_fallback
Whether to use the spm byte-fallback trick. Default: FALSE
.
clone()
The objects of this class are cloneable with this method.
model_bpe$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other model:
model_unigram
,
model_wordpiece
,
tok_model
An implementation of the Unigram algorithm
An implementation of the Unigram algorithm
tok::tok_model
-> tok_model_unigram
new()
Constructor for Unigram Model
model_unigram$new(vocab = NULL, unk_id = NULL, byte_fallback = FALSE)
vocab
A dictionary of string keys and their corresponding relative score.
Default: NULL
.
unk_id
The unknown token id to be used by the model.
Default: NULL
.
byte_fallback
Whether to use byte-fallback trick. Default: FALSE
.
clone()
The objects of this class are cloneable with this method.
model_unigram$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other model:
model_bpe
,
model_wordpiece
,
tok_model
An implementation of the WordPiece algorithm
An implementation of the WordPiece algorithm
tok::tok_model
-> tok_model_wordpiece
new()
Constructor for the wordpiece tokenizer
model_wordpiece$new( vocab = NULL, unk_token = NULL, max_input_chars_per_word = NULL )
vocab
A dictionary of string keys and their corresponding ids.
Default: NULL
.
unk_token
The unknown token to be used by the model.
Default: NULL
.
max_input_chars_per_word
The maximum number of characters to allow in a single word.
Default: NULL
.
clone()
The objects of this class are cloneable with this method.
model_wordpiece$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other model:
model_bpe
,
model_unigram
,
tok_model
NFC normalizer
NFC normalizer
tok::tok_normalizer
-> tok_normalizer_nfc
new()
Initializes the NFC normalizer
normalizer_nfc$new()
clone()
The objects of this class are cloneable with this method.
normalizer_nfc$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other normalizers:
normalizer_nfkc
,
tok_normalizer
NFKC normalizer
NFKC normalizer
tok::tok_normalizer
-> tok_normalizer_nfc
new()
Initializes the NFKC normalizer
normalizer_nfkc$new()
clone()
The objects of this class are cloneable with this method.
normalizer_nfkc$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other normalizers:
normalizer_nfc
,
tok_normalizer
Generic class for tokenizers
Generic class for tokenizers
.pre_tokenizer
Internal pointer to tokenizer object
new()
Initializes a tokenizer
pre_tokenizer$new(pre_tokenizer)
pre_tokenizer
a raw pointer to a tokenizer
clone()
The objects of this class are cloneable with this method.
pre_tokenizer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer_byte_level
,
pre_tokenizer_whitespace
Byte level pre tokenizer
Byte level pre tokenizer
This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.
tok::tok_pre_tokenizer
-> tok_pre_tokenizer_whitespace
new()
Initializes the bytelevel tokenizer
pre_tokenizer_byte_level$new(add_prefix_space = TRUE, use_regex = TRUE)
add_prefix_space
Whether to add a space to the first word
use_regex
Set this to False to prevent this pre_tokenizer from using the GPT2 specific regexp for spliting on whitespace.
clone()
The objects of this class are cloneable with this method.
pre_tokenizer_byte_level$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer
,
pre_tokenizer_whitespace
\w+|[^\w\s]+
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
tok::tok_pre_tokenizer
-> tok_pre_tokenizer_whitespace
new()
Initializes the whistespace tokenizer
pre_tokenizer_whitespace$new()
clone()
The objects of this class are cloneable with this method.
pre_tokenizer_whitespace$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other pre_tokenizer:
pre_tokenizer
,
pre_tokenizer_byte_level
Byte Level post processor
Byte Level post processor
This post-processor takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor must be used.
tok::tok_processor
-> tok_processor_byte_level
new()
Initializes the byte level post processor
processor_byte_level$new(trim_offsets = TRUE)
trim_offsets
Whether to trim the whitespaces from the produced offsets.
clone()
The objects of this class are cloneable with this method.
processor_byte_level$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other processors:
tok_processor
Generic class for decoders
Generic class for decoders
.decoder
The raw pointer to the decoder
new()
Initializes a decoder
tok_decoder$new(decoder)
decoder
a raw decoder pointer
clone()
The objects of this class are cloneable with this method.
tok_decoder$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other decoders:
decoder_byte_level
Generic class for tokenization models
Generic class for tokenization models
.model
stores the pointer to the model. internal
new()
Initializes a genric abstract tokenizer model
tok_model$new(model)
model
Pointer to a tokenization model
clone()
The objects of this class are cloneable with this method.
tok_model$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other model:
model_bpe
,
model_unigram
,
model_wordpiece
Generic class for normalizers
Generic class for normalizers
.normalizer
Internal pointer to normalizer object
new()
Initializes a tokenizer
tok_normalizer$new(normalizer)
normalizer
a raw pointer to a tokenizer
clone()
The objects of this class are cloneable with this method.
tok_normalizer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other normalizers:
normalizer_nfc
,
normalizer_nfkc
Generic class for processors
Generic class for processors
.processor
Internal pointer to processor object
new()
Initializes a tokenizer
tok_processor$new(processor)
processor
a raw pointer to a processor
clone()
The objects of this class are cloneable with this method.
tok_processor$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other processors:
processor_byte_level
Generic training class
Generic training class
.trainer
a pointer to a raw trainer
new()
Initializes a generic trainer from a raw trainer
tok_trainer$new(trainer)
trainer
raw trainer (internal)
clone()
The objects of this class are cloneable with this method.
tok_trainer$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
trainer_bpe
,
trainer_unigram
,
trainer_wordpiece
A Tokenizer works as a pipeline. It processes some raw text as input and outputs an encoding.
A tokenizer that can be used for encoding character strings or decoding integers.
.tokenizer
(unsafe usage) Lower level pointer to tokenizer
pre_tokenizer
instance of the pre-tokenizer
normalizer
Gets the normalizer instance
post_processor
Gets the post processor used by tokenizer
decoder
Gets and sets the decoder
padding
Gets padding configuration
truncation
Gets truncation configuration
new()
Initializes a tokenizer
tokenizer$new(tokenizer)
tokenizer
Will be cloned to initialize a new tokenizer
encode()
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
tokenizer$encode( sequence, pair = NULL, is_pretokenized = FALSE, add_special_tokens = TRUE )
sequence
The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument
pair
An optional input sequence. The expected format is the same that for sequence.
is_pretokenized
Whether the input is already pre-tokenized
add_special_tokens
Whether to add the special tokens
decode()
Decode the given list of ids back to a string
tokenizer$decode(ids, skip_special_tokens = TRUE)
ids
The list of ids that we want to decode
skip_special_tokens
Whether the special tokens should be removed from the decoded string
encode_batch()
Encodes a batch of sequences. Returns a list of encodings.
tokenizer$encode_batch( input, is_pretokenized = FALSE, add_special_tokens = TRUE )
input
A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument.
is_pretokenized
Whether the input is already pre-tokenized
add_special_tokens
Whether to add the special tokens
decode_batch()
Decode a batch of ids back to their corresponding string
tokenizer$decode_batch(sequences, skip_special_tokens = TRUE)
sequences
The batch of sequences we want to decode
skip_special_tokens
Whether the special tokens should be removed from the decoded strings
from_file()
Creates a tokenizer from the path of a serialized tokenizer.
This is a static method and should be called instead of $new
when initializing
the tokenizer.
tokenizer$from_file(path)
path
Path to tokenizer.json file
from_pretrained()
Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.
tokenizer$from_pretrained(identifier, revision = "main", auth_token = NULL)
identifier
The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file
revision
A branch or commit id
auth_token
An optional auth token used to access private repositories on the Hugging Face Hub
train()
Train the Tokenizer using the given files. Reads the files line by line, while keeping all the whitespace, even new lines.
tokenizer$train(files, trainer)
files
character vector of file paths.
trainer
an instance of a trainer object, specific to that tokenizer type.
train_from_memory()
Train the tokenizer on a chracter vector of texts
tokenizer$train_from_memory(texts, trainer)
texts
a character vector of texts.
trainer
an instance of a trainer object, specific to that tokenizer type.
save()
Saves the tokenizer to a json file
tokenizer$save(path, pretty = TRUE)
path
A path to a file in which to save the serialized tokenizer.
pretty
Whether the JSON file should be pretty formatted.
enable_padding()
Enables padding for the tokenizer
tokenizer$enable_padding( direction = "right", pad_id = 0L, pad_type_id = 0L, pad_token = "[PAD]", length = NULL, pad_to_multiple_of = NULL )
direction
(str, optional, defaults to right) — The direction in which
to pad. Can be either 'right'
or 'left'
pad_id
(int, defaults to 0) — The id to be used when padding
pad_type_id
(int, defaults to 0) — The type id to be used when padding
pad_token
(str, defaults to '[PAD]'
) — The pad token to be used when padding
length
(int, optional) — If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of
(int, optional) — If specified, the padding length should
always snap to the next multiple of the given value. For example if we were
going to pad with a length of 250 but pad_to_multiple_of=8
then we will
pad to 256.
no_padding()
Disables padding
tokenizer$no_padding()
enable_truncation()
Enables truncation on the tokenizer
tokenizer$enable_truncation( max_length, stride = 0, strategy = "longest_first", direction = "right" )
max_length
The maximum length at which to truncate.
stride
The length of the previous first sequence to be included
in the overflowing sequence. Default: 0
.
strategy
The strategy used for truncation. Can be one of: "longest_first", "only_first", or "only_second". Default: "longest_first".
direction
The truncation direction. Default: "right".
no_truncation()
Disables truncation
tokenizer$no_truncation()
get_vocab_size()
Gets the vocabulary size
tokenizer$get_vocab_size(with_added_tokens = TRUE)
with_added_tokens
Wether to count added tokens
clone()
The objects of this class are cloneable with this method.
tokenizer$clone(deep = FALSE)
deep
Whether to make a deep clone.
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") tok$encode("Hello world")$ids }) })
withr::with_envvar(c(HUGGINGFACE_HUB_CACHE = tempdir()), { try({ tok <- tokenizer$from_pretrained("gpt2") tok$encode("Hello world")$ids }) })
BPE trainer
BPE trainer
tok::tok_trainer
-> tok_trainer_bpe
new()
Constrcutor for the BPE trainer
trainer_bpe$new( vocab_size = NULL, min_frequency = NULL, show_progress = NULL, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = NULL, end_of_word_suffix = NULL, max_token_length = NULL )
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
Default: NULL
.
min_frequency
The minimum frequency a pair should have in order to be merged.
Default: NULL
.
show_progress
Whether to show progress bars while training. Default: TRUE
.
special_tokens
A list of special tokens the model should be aware of.
Default: NULL
.
limit_alphabet
The maximum number of different characters to keep in the alphabet.
Default: NULL
.
initial_alphabet
A list of characters to include in the initial alphabet,
even if not seen in the training dataset. Default: NULL
.
continuing_subword_prefix
A prefix to be used for every subword that is not a beginning-of-word.
Default: NULL
.
end_of_word_suffix
A suffix to be used for every subword that is an end-of-word.
Default: NULL
.
max_token_length
Prevents creating tokens longer than the specified size.
Default: NULL
.
clone()
The objects of this class are cloneable with this method.
trainer_bpe$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
tok_trainer
,
trainer_unigram
,
trainer_wordpiece
Unigram tokenizer trainer
Unigram tokenizer trainer
tok::tok_trainer
-> tok_trainer_unigram
new()
Constructor for the Unigram tokenizer
trainer_unigram$new( vocab_size = 8000, show_progress = TRUE, special_tokens = NULL, shrinking_factor = 0.75, unk_token = NULL, max_piece_length = 16, n_sub_iterations = 2 )
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
show_progress
Whether to show progress bars while training.
special_tokens
A list of special tokens the model should be aware of.
shrinking_factor
The shrinking factor used at each step of training to prune the vocabulary.
unk_token
The token used for out-of-vocabulary tokens.
max_piece_length
The maximum length of a given token.
n_sub_iterations
The number of iterations of the EM algorithm to perform before pruning the vocabulary.
initial_alphabet
A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
clone()
The objects of this class are cloneable with this method.
trainer_unigram$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
tok_trainer
,
trainer_bpe
,
trainer_wordpiece
WordPiece tokenizer trainer
WordPiece tokenizer trainer
tok::tok_trainer
-> tok_trainer_wordpiece
new()
Constructor for the WordPiece tokenizer trainer
trainer_wordpiece$new( vocab_size = 30000, min_frequency = 0, show_progress = FALSE, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = "##", end_of_word_suffix = NULL )
vocab_size
The size of the final vocabulary, including all tokens and alphabet.
Default: NULL
.
min_frequency
The minimum frequency a pair should have in order to be merged.
Default: NULL
.
show_progress
Whether to show progress bars while training. Default: TRUE
.
special_tokens
A list of special tokens the model should be aware of.
Default: NULL
.
limit_alphabet
The maximum number of different characters to keep in the alphabet.
Default: NULL
.
initial_alphabet
A list of characters to include in the initial alphabet,
even if not seen in the training dataset. If the strings contain more than
one character, only the first one is kept. Default: NULL
.
continuing_subword_prefix
A prefix to be used for every subword that is not a beginning-of-word.
Default: NULL
.
end_of_word_suffix
A suffix to be used for every subword that is an end-of-word.
Default: NULL
.
clone()
The objects of this class are cloneable with this method.
trainer_wordpiece$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other trainer:
tok_trainer
,
trainer_bpe
,
trainer_unigram