Package: tok 0.1.5

Daniel Falbel

tok: Fast Text Tokenization

Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.

Authors:Daniel Falbel [aut, cre], Regouby Christophe [ctb], Posit [cph]

tok_0.1.5.tar.gz
tok_0.1.5.zip(r-4.5)tok_0.1.5.zip(r-4.4)tok_0.1.5.zip(r-4.3)
tok_0.1.5.tgz(r-4.4-x86_64)tok_0.1.5.tgz(r-4.4-arm64)tok_0.1.5.tgz(r-4.3-x86_64)tok_0.1.5.tgz(r-4.3-arm64)
tok_0.1.5.tar.gz(r-4.5-noble)tok_0.1.5.tar.gz(r-4.4-noble)
tok.pdf |tok.html✨
tok/json (API)
NEWS

# Install 'tok' in R:

install.packages('tok', repos = c('https://mlverse.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/mlverse/tok/issues

On CRAN:

6.10 score 42 stars 1 packages 25 scripts 57 downloads 20 exports 2 dependencies

Last updated 30 days agofrom:ff883e2dba. Checks:OK: 9. Indexed: yes.

Target	Result	Date
Doc / Vignettes	OK	Nov 22 2024
R-4.5-win-x86_64	OK	Nov 22 2024
R-4.5-linux-x86_64	OK	Nov 22 2024
R-4.4-win-x86_64	OK	Nov 22 2024
R-4.4-mac-x86_64	OK	Nov 22 2024
R-4.4-mac-aarch64	OK	Nov 22 2024
R-4.3-win-x86_64	OK	Nov 22 2024
R-4.3-mac-x86_64	OK	Nov 22 2024
R-4.3-mac-aarch64	OK	Nov 22 2024

Exports:decoder_byte_level encoding model_bpe model_unigram model_wordpiece normalizer_nfc normalizer_nfkc pre_tokenizer pre_tokenizer_byte_level pre_tokenizer_whitespace processor_byte_level tok_decoder tok_model tok_normalizer tok_processor tok_trainer tokenizer trainer_bpe trainer_unigram trainer_wordpiece

Dependencies:cli R6

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
tok: Fast Text Tokenization	tok-package tok
Byte level decoder	decoder_byte_level
Encoding	encoding
BPE model	model_bpe
An implementation of the Unigram algorithm	model_unigram
An implementation of the WordPiece algorithm	model_wordpiece
NFC normalizer	normalizer_nfc
NFKC normalizer	normalizer_nfkc
Generic class for tokenizers	pre_tokenizer
Byte level pre tokenizer	pre_tokenizer_byte_level
This pre-tokenizer simply splits using the following regex: \w+\|[^\w\s]+	pre_tokenizer_whitespace
Byte Level post processor	processor_byte_level
Generic class for decoders	tok_decoder
Generic class for tokenization models	tok_model
Generic class for normalizers	tok_normalizer
Generic class for processors	tok_processor
Generic training class	tok_trainer
Tokenizer	tokenizer
BPE trainer	trainer_bpe
Unigram tokenizer trainer	trainer_unigram
WordPiece tokenizer trainer	trainer_wordpiece