Package: tok 0.2.2.9000
tok: Fast Text Tokenization
Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.
Authors:
tok_0.2.2.9000.tar.gz
tok_0.2.2.9000.zip(r-4.7)tok_0.2.2.9000.zip(r-4.6)tok_0.2.2.9000.zip(r-4.5)
tok_0.2.2.9000.tgz(r-4.6-x86_64)tok_0.2.2.9000.tgz(r-4.6-arm64)tok_0.2.2.9000.tgz(r-4.5-x86_64)tok_0.2.2.9000.tgz(r-4.5-arm64)
tok_0.2.2.9000.tar.gz(r-4.7-arm64)tok_0.2.2.9000.tar.gz(r-4.7-x86_64)tok_0.2.2.9000.tar.gz(r-4.6-arm64)tok_0.2.2.9000.tar.gz(r-4.6-x86_64)
manual.pdf |manual.html✨
card.svg |card.png
tok/json (API)
NEWS
| # Install 'tok' in R: |
| install.packages('tok', repos = c('https://mlverse.r-universe.dev', 'https://cloud.r-project.org')) |
Bug tracker:https://github.com/mlverse/tok/issues
Last updated from:f925ad65e3. Checks:12 OK, 1 FAIL. Indexed: yes.
| Target | Result | Time | Files | Syslog |
|---|---|---|---|---|
| linux-devel-arm64 | OK | 267 | ||
| linux-devel-x86_64 | OK | 253 | ||
| source / vignettes | OK | 339 | ||
| linux-release-arm64 | OK | 256 | ||
| linux-release-x86_64 | OK | 290 | ||
| macos-release-arm64 | OK | 209 | ||
| macos-release-x86_64 | OK | 411 | ||
| macos-oldrel-arm64 | OK | 176 | ||
| macos-oldrel-x86_64 | OK | 413 | ||
| windows-devel | OK | 365 | ||
| windows-release | OK | 294 | ||
| windows-oldrel | OK | 295 | ||
| wasm-release | FAIL | 226 |
Exports:decoder_byte_levelencodingmodel_bpemodel_unigrammodel_wordpiecenormalizer_nfcnormalizer_nfkcpre_tokenizerpre_tokenizer_byte_levelpre_tokenizer_whitespaceprocessor_byte_leveltok_decodertok_modeltok_normalizertok_processortok_trainertokenizertrainer_bpetrainer_unigramtrainer_wordpiece
Readme and manuals
Help Manual
| Help page | Topics |
|---|---|
| Byte level decoder | decoder_byte_level |
| Encoding | encoding |
| BPE model | model_bpe |
| An implementation of the Unigram algorithm | model_unigram |
| An implementation of the WordPiece algorithm | model_wordpiece |
| NFC normalizer | normalizer_nfc |
| NFKC normalizer | normalizer_nfkc |
| Generic class for tokenizers | pre_tokenizer |
| Byte level pre tokenizer | pre_tokenizer_byte_level |
| This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+ | pre_tokenizer_whitespace |
| Byte Level post processor | processor_byte_level |
| Generic class for decoders | tok_decoder |
| Generic class for tokenization models | tok_model |
| Generic class for normalizers | tok_normalizer |
| Generic class for processors | tok_processor |
| Generic training class | tok_trainer |
| Tokenizer | tokenizer |
| BPE trainer | trainer_bpe |
| Unigram tokenizer trainer | trainer_unigram |
| WordPiece tokenizer trainer | trainer_wordpiece |
