Skip to content

v2.0.0 🤗tokenizers integration and TokSequence

Compare
Choose a tag to compare
@Natooz Natooz released this 04 Mar 10:07
· 268 commits to main since this release

TL;DR

This major update brings:

  • The integration of the Hugging Face 🤗tokenizers library as Byte Pair Encoding (BPE) backend. BPE is now between 30 to 50 times faster, for both training and encoding ! 🙌
  • A new TokSequence object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
  • Many internal changes, methods and variables renamed, that require you to update some of your code (details below).

Changes

  • a9b82e4 Vocabulary class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
  • a9b82e4 New special_tokens constructor argument for all tokenizers, in place of the previous pad, mask, sos_eos and sep arguments. It is a list of tokens (str) for more versatility. By default, special tokens are ["PAD", "BOS", "EOS", "MASK"];
  • a9b82e4 __getitem__ now handles both ids (int) and tokens (str), with multi-vocab;
  • 36bf0f6 Some methods of MIDITokenizer meant to be used internally are now protected;
  • a2db7b9 New training method with 🤗tokenizers BPE model;
  • 9befb8d TokSequence object, used as in and out object for midi_to_tokens and tokens_to_midi methods, thanks to the _in_as_seq and _out_as_complete_seq decorators;
  • 9befb8d complete_sequence method allowing to automatically convert the uninitiated attributes of a TokSequence (ids, tokens);
  • 9befb8d tokens_to_events renamed _ids_to_tokens, and new id / token / byte conversion methods with recursivity;
  • 9befb8d Tokens are now saved and loaded with the ids key (previously tokens);
  • cddd29c Tokenization files moves to dedicated tokenizations module;
  • cddd29c decompose_bpe method renamed decode_bpe;
  • d520128 tokenize_dataset allows to apply BPE afterwards.

Compatibility

Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:

  • Tokens are now saved and loaded with the ids key (previously tokens). To adapt your previously saved tokens, open them with json and rewrite them with the ids key instead;
  • midi_to_tokens (also called with tokenizer(midi)) now outputs a list of TokSequences, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the .ids attribute, as tokseq.ids;
  • Vocabulary class deleted. You can still access to the vocabulary with tokenizer.vocab but it is now a dictionary. The methods of the Vocabulary class are now directly integrated in MIDITokenizer;
  • For all tokenizers, the pad, mask, sos_eos and sep constructor arguments need to be replaced with the new special_tokens argument;
  • decompose_bpe method renamed decode_bpe.

Bug reports

With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.