Skip to content

Releases: Natooz/MidiTok

v2.0.0 🤗tokenizers integration and TokSequence

04 Mar 10:07
Compare
Choose a tag to compare

TL;DR

This major update brings:

  • The integration of the Hugging Face 🤗tokenizers library as Byte Pair Encoding (BPE) backend. BPE is now between 30 to 50 times faster, for both training and encoding ! 🙌
  • A new TokSequence object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
  • Many internal changes, methods and variables renamed, that require you to update some of your code (details below).

Changes

  • a9b82e4 Vocabulary class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
  • a9b82e4 New special_tokens constructor argument for all tokenizers, in place of the previous pad, mask, sos_eos and sep arguments. It is a list of tokens (str) for more versatility. By default, special tokens are ["PAD", "BOS", "EOS", "MASK"];
  • a9b82e4 __getitem__ now handles both ids (int) and tokens (str), with multi-vocab;
  • 36bf0f6 Some methods of MIDITokenizer meant to be used internally are now protected;
  • a2db7b9 New training method with 🤗tokenizers BPE model;
  • 9befb8d TokSequence object, used as in and out object for midi_to_tokens and tokens_to_midi methods, thanks to the _in_as_seq and _out_as_complete_seq decorators;
  • 9befb8d complete_sequence method allowing to automatically convert the uninitiated attributes of a TokSequence (ids, tokens);
  • 9befb8d tokens_to_events renamed _ids_to_tokens, and new id / token / byte conversion methods with recursivity;
  • 9befb8d Tokens are now saved and loaded with the ids key (previously tokens);
  • cddd29c Tokenization files moves to dedicated tokenizations module;
  • cddd29c decompose_bpe method renamed decode_bpe;
  • d520128 tokenize_dataset allows to apply BPE afterwards.

Compatibility

Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations.
However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:

  • Tokens are now saved and loaded with the ids key (previously tokens). To adapt your previously saved tokens, open them with json and rewrite them with the ids key instead;
  • midi_to_tokens (also called with tokenizer(midi)) now outputs a list of TokSequences, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the .ids attribute, as tokseq.ids;
  • Vocabulary class deleted. You can still access to the vocabulary with tokenizer.vocab but it is now a dictionary. The methods of the Vocabulary class are now directly integrated in MIDITokenizer;
  • For all tokenizers, the pad, mask, sos_eos and sep constructor arguments need to be replaced with the new special_tokens argument;
  • decompose_bpe method renamed decode_bpe.

Bug reports

With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.

v1.4.3 BPE fix & Documentation

22 Feb 10:19
5d261fd
Compare
Choose a tag to compare

Changes

  • 77f7c53 @dinhviettoanle (#24) Fixing a bug skipping tokens repetitions with BPE
  • New documentation : miditok.readthedocs.io. We finally have a proper documentation website! 🙌 With this comes many improvements and fixes in the docstrings.
  • 201c9b7 Legacy REMIEncoding, MIDILikeEncoding and CPWordEncoding classes removed.
  • e92a414 token_types_errors of MIDITokenizer class handling basic / common cases of errors
  • Small minor code improvements
  • 1486204 Use of dataclasses. This means that Python 3.6 (and previous) are no longer compatible. Python 3.6 was compatible but not supported (tested) up to v1.4.2.

v1.4.2 SEP token & data augmentation offset combinations argument

26 Jan 14:47
Compare
Choose a tag to compare

Changes

  • f6225a1 Added the option to have a SEP special token, that can be used to train models to perform tasks such as "Next sequence prediction"
  • bb24512 Data augmentation can now receive the all_offset_combinations argument, which will perform augmentation with all the combinations of offsets. With the offsets $\left( x_1 , x_2 , x_3 \right)$, it will perform a total of $\prod_i x_i$ combinations ( $\prod_i (x_i \times 2)$ if going up and down). This is disabled by default to save you from hundreds of augmentations 🤓 (and is not chained with tokenize_midi_dataset), by defaults augmentations are done on the original input only.

v1.4.1 Bugfix tokenize_midi_dataset

20 Jan 10:49
Compare
Choose a tag to compare

Changes

  • 0e9131d Bugfix in tokenize_midi_dataset method when directly performing data augmentation, was not indented as it should

v1.4.0 Data augmentation and optimization

13 Jan 16:46
Compare
Choose a tag to compare

This pretty big update brings data augmentation, some bug fixes and optimizations, allowing to write more elegant code.

Changes

  • 8f201e0 308fb27 Data augmentation methods ! 🙌 They can be applied on both MIDI and tokens, to augment data by shifting the pitch, velocity and duration values.
  • 1d8e903 You can perform data augmentation while tokenizing a dataset (tokenize_midi_dataset method) with the data_augment_offsets argument. This will be done at the token level, as its faster than augmenting MIDI objects.
  • 0634ade BPE is now implemented in the main tokenizer class! This means all tokenizers can benefit form it in a much prettier way!
  • 0634ade bpe method renamed to learn_bpe, and now returns metrics (that are also showed in the progress bar during the learning) on the number of token combinations and sequence length reduction
  • 7b8c977 Retrocompatibility when loading tokenizer config files with BPE from older versions
  • 3cea9aa @nturusin Example notebook of GPT2 Hugging Face music transformer: fixes in training
  • 65afa6b The tokens_to_midi and save_tokens methods can now receive tokens as Tensors and numpy arrays. PyTorch, TensorFlow and Jax (numpy) tensors are supported. The convert_tokens_tensors_to_list decorator will convert them to lists, you can use it on your custom methods.
  • aab64aa The __call__ magic method now automatically route to midi_to_tokens or tokens_to_midi following what you give it. You can now use more elegantly tokenizers as tokenizer(midi_obj) or tokenizer(generated_tokens).
  • e90b20a Bugfix in Structured causing a possible infinite while loop with illegal token types successions
  • 947af8c Big refactor of MuMIDI, which have now fixed vocab / type idx. It is easier to handle and use. (thanks @gonzaloarca)
  • 947af8c CPWord "Ignore" tokens are all renamed Ignore_None by convention, making operations easier in data augmentation and other methods.

Compatibility

  • code with BPE would have to updated: remove bpe(tokenizer) and just declare tokenizers normally, rename the bpe method to learn_bpe
  • MuMIDI tokens and tokenizers will be incompatible with v1.4.0

v1.3.3 Minor bugfixes

19 Dec 18:20
Compare
Choose a tag to compare

Changes

  • 4f4e49e Magic method len bugfix with multi-vocal tokenizers, len is now also a property
  • 925c7ae & 5b4f410 Bugfix of token types initialization when loading tokenizer from params file
  • c873456 Removed hyphens from token types names, for better visibility. Be convention tokens types are all written in CamelCase.
  • 5e51e84 New multi_voc property
  • b3b0cc7 tokenize_dataset, progress bar now show the saving directory name

Compatibility

  • All good 🙌

v1.3.2 Bugfix

23 Nov 16:16
Compare
Choose a tag to compare

Changes

  • @Fansesi - f92f4aa Corrects a bug when using tokenize_dataset with out_dir as non-Path object (issue #18)
  • 2724062 Bugfix when using files_lim with bpe

Compatibility

  • All good 🙌

v1.3.1 unique_track parameter & minor fixes / changes

09 Nov 13:01
Compare
Choose a tag to compare

Highlights

This versions uniformly cleans how the save_params is called, brings related minor fixes and new features.

Changes

  • 3c4adf8 Tokenizers now take a unique_track argument at creation. This parameter specifies if the tokenizer represents and handles music as a single track, or stream of tokens. This is the case of Octuple and MuMIDI, and probably most representations that natively support multitrack music. If given True, the tokens will be saved in json files as a single track. This parameter can then help when loading tokenized datasets.
  • 3c4adf8 save_params method: out_dir argument renamed to out_path
  • 3c4adf8 save_params method: out_path can now specify the full path and name of the config file saved
  • 3c4adf8 fixes in save_params method for MuMIDI
  • 3c4adf8 The current version number is fixed (was 1.2.9 instead of 1.3.0 for v1.3.0)
  • 4be897b bpe method (learning BPE vocabulary) now has a print_seq_len_variation argument, to optionally print the mean sequence length before and after BPE, and the variation in %. (default: True)

Compatibility

  • You might need to update your code when:
    • creating a tokenizer, to handle the new unique_track argument.
    • saving a tokenizer's config to handle the out_dir argument renamed to out_path
  • For datasets tokenized with BPE will need to change the token_to_event key to vocab in the associated tokenizer configuration file

v1.3.0 Special tokens update 🛠

03 Nov 16:02
Compare
Choose a tag to compare

Highlight

Version 1.3.0 changes the way the vocabulary, and by extension tokenizers, handle special tokens: PAD, SOS, EOS and MASK. It brings a cleaner way to instantiate these classes.
It might bring incompatibilities with data and models used with previous MidiTok versions.

Changes

  • b9218bf Vocabulary class now takes pad argument to specify to include special padding token. This option is set to True by default, as it is more common to train networks with batches of unequal sequence lengths.
  • b9218bf Vocabulary class: the event_to_token argument of the constructor is renamed events and has to be given as a list of events.
  • b9218bf Vocabulary class: when adding a token to the vocabulary, the index is automatically set. The index argument is removed as it could cause issues / confusion when mapping indexes with models.
  • b9218bf The Event class now takes value argument in second (order)
  • b9218bf Fix when learning BPE if files_lim were higher than the number of files itself
  • f9cb109 For all tokenizers, a new constructor argument pad specifies to use padding token, and the sos_eos_tokens argument is renamed to sos_eos
  • f9cb109 When creating a Vocabulary, the SOS and EOS tokens are now registered before the MASK token. This change is motivated so the order matches the one of special token arguments in tokenizers constructors, and as the SOS and EOS tokens are more commonly used in symbolic music applications.
  • 84db19d The dummy StructuredEncoding, MuMIDIEncoding, OctupleEncoding and OctupleMonoEncoding classes removed from init.py. These classes from early versions had no record of being used. Other dummy classes (REMI, MIDILike and CPWord) remain.

Compatibility

  • You might need to update your code when creating your tokenizer to handle the new pad argument.
  • Data tokenized with REMI, and models trained with, will be incompatible with v1.3.0 if you used special tokens. The BAR token was previously at index 1, and is now added after special tokens.
  • If you created custom tokenizer inheriting MIDITokenizer, make sure to update the calls to super().__init__ with new pad arg and renamed sos_eos arg (example for MIDILike: f9cb109)
  • If you used both SOS/EOS and MASK special tokens, their order (indexes) is now swapped as SOS/EOS are now registered before MASK. As these tokens should are not used during the tokenization, your previously tokenized datasets remain compatible, unless you intentionally inserted SOS/EOS/MASK tokens. Trained models will however be incompatible as the indices are swapped. If you want to use v1.3.0 with a previously trained model, you can manually invert the predictions of these tokens.
  • No incompatibilities outside of these cases

Please reach out if you have any issue / question! 🙌

v1.2.9 BPE speed boost & small improvements

06 Oct 15:06
Compare
Choose a tag to compare

Changes

  • 212a943 BPE: Speed boost in apply_bpe method, about 1.5 times faster 🚀
  • 4b8ccb9 BPE: tokens_to_events method is not longer inplace
  • be3e244 save_tokens method now takes **kwargs arguments to save additional information in json files
  • b690cab fix when computing max_tick attribute of a MIDI, when it have tracks with no notes
  • f1855b6 MidiTok package version is now saved with tokenizer parameters. It allows to keep track of the version used.
  • Lint and coverage improvements ✨

Compatibility

  • If you explicitly used tokens_to_events, you might need to do an adaptation as it is no longer inplace.