Package: tok 0.1.5
tok: Fast Text Tokenization
Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.
Authors:
tok_0.1.5.tar.gz
tok_0.1.5.zip(r-4.5)tok_0.1.5.zip(r-4.4)tok_0.1.5.zip(r-4.3)
tok_0.1.5.tgz(r-4.4-x86_64)tok_0.1.5.tgz(r-4.4-arm64)tok_0.1.5.tgz(r-4.3-x86_64)tok_0.1.5.tgz(r-4.3-arm64)
tok_0.1.5.tar.gz(r-4.5-noble)tok_0.1.5.tar.gz(r-4.4-noble)
tok.pdf |tok.html✨
tok/json (API)
NEWS
# Install 'tok' in R: |
install.packages('tok', repos = c('https://mlverse.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/mlverse/tok/issues
Last updated 30 days agofrom:ff883e2dba. Checks:OK: 9. Indexed: yes.
Target | Result | Date |
---|---|---|
Doc / Vignettes | OK | Nov 22 2024 |
R-4.5-win-x86_64 | OK | Nov 22 2024 |
R-4.5-linux-x86_64 | OK | Nov 22 2024 |
R-4.4-win-x86_64 | OK | Nov 22 2024 |
R-4.4-mac-x86_64 | OK | Nov 22 2024 |
R-4.4-mac-aarch64 | OK | Nov 22 2024 |
R-4.3-win-x86_64 | OK | Nov 22 2024 |
R-4.3-mac-x86_64 | OK | Nov 22 2024 |
R-4.3-mac-aarch64 | OK | Nov 22 2024 |
Exports:decoder_byte_levelencodingmodel_bpemodel_unigrammodel_wordpiecenormalizer_nfcnormalizer_nfkcpre_tokenizerpre_tokenizer_byte_levelpre_tokenizer_whitespaceprocessor_byte_leveltok_decodertok_modeltok_normalizertok_processortok_trainertokenizertrainer_bpetrainer_unigramtrainer_wordpiece
Readme and manuals
Help Manual
Help page | Topics |
---|---|
tok: Fast Text Tokenization | tok-package tok |
Byte level decoder | decoder_byte_level |
Encoding | encoding |
BPE model | model_bpe |
An implementation of the Unigram algorithm | model_unigram |
An implementation of the WordPiece algorithm | model_wordpiece |
NFC normalizer | normalizer_nfc |
NFKC normalizer | normalizer_nfkc |
Generic class for tokenizers | pre_tokenizer |
Byte level pre tokenizer | pre_tokenizer_byte_level |
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+ | pre_tokenizer_whitespace |
Byte Level post processor | processor_byte_level |
Generic class for decoders | tok_decoder |
Generic class for tokenization models | tok_model |
Generic class for normalizers | tok_normalizer |
Generic class for processors | tok_processor |
Generic training class | tok_trainer |
Tokenizer | tokenizer |
BPE trainer | trainer_bpe |
Unigram tokenizer trainer | trainer_unigram |
WordPiece tokenizer trainer | trainer_wordpiece |