Skip to content

v1.3.0 Special tokens update 🛠

Compare
Choose a tag to compare
@Natooz Natooz released this 03 Nov 16:02
· 341 commits to main since this release

Highlight

Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: PAD, SOS, EOS and MASK. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.

Changes

  • b9218bf Vocabulary class now takes pad argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths.
  • b9218bf Vocabulary class: the event_to_token argument of the constructor is renamed events and has to be given as a list of events.
  • b9218bf Vocabulary class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models.
  • b9218bf The Event class now takes value argument in second (order)
  • b9218bf Fix when learning BPE if files_lim were higher than the number of files itself
  • f9cb109 For all tokenizers, a new constructor argument pad specifies to use padding token, and the sos_eos_tokens argument is renamed to sos_eos
  • f9cb109 When creating a Vocabulary, the SOS and EOS tokens are now registered before the MASK token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the SOS and EOS tokens are more commonly used in symbolic music applications.
  • 84db19d The dummy StructuredEncoding, MuMIDIEncoding, OctupleEncoding and OctupleMonoEncoding classes removed from init.py. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.

Compatibility

  • You might need to update your code when creating your tokenizer to handle the new pad argument.
  • Data tokenized with REMI, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The BAR token was previously at index 1, and is now added after special tokens.
  • If you created custom tokenizer inheriting MIDITokenizer, make sure to update the calls to super().__init__ with new pad arg and renamed sos_eos arg (example for MIDILike: f9cb109)
  • If you used both SOS/EOS and MASK special tokens, their order (indexes) is now swapped as SOS/EOS are now registered before MASK. As these tokens should are not used during the tokenization, your previously tokenized datasets remain compatible, unless you intentionally inserted SOS/EOS/MASK tokens. Trained models will however be incompatible as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
  • No incompatibilities outside of these cases

Please reach out if you have any issue / question! 🙌