Package Info

python-tokenizers


Provides an implementation of today's most used tokenizers


Unspecified

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

License: Apache-2.0
URL: https://github.com/huggingface/tokenizers

Categories

Releases

Package Version Update ID Released Package Hub Version Platforms Subpackages
0.21.1-bp160.1.11 info GA Release 2025-03-19 16.0
  • x86-64
  • python313-tokenizers