Skip to content

Releases: Natooz/MidiTok

v3.0.4 PerTok tokenizer and Attribute Controls

15 Sep 10:42
7ea77d4
Compare
Choose a tag to compare

This release introduces the PerTok tokenizer by Lemonaide AI, attribute controls tokens and minor fixes.

Highlights

PerTok: Performance Tokenizer

(associated paper to be released)

Developed by Julian Lenz (@JLenzy) at Lemonaide AI to capture expressive timing in symbolic scores while maintaining competitively low sequence lengths. It achieves this by dividing time differences into Macro and Micro categories, introducing a new MicroTime token type. Subtle deviations from the quantized beat are represented with these Timeshift tokens.
Furthermore, PerTok enables you to encode an unlimited number of note subdivisions by enabling multiple, overlapping values within the 'beat_res' parameter of the TokenizerConfig.

The micro timing tokens will be extended to all tokenizers in a future update.

### Attribute Control tokens

Attribute controls are additional tokens allowing to train models in order to control them during inference, by enforcing a model to predict music with specific features.

What's Changed

  • updates to Example_HuggingFace_Mistral_Transformer.ipynb by @briane412 in #164
  • _model_name is now a protected property by @Natooz in #165
  • Fixing docs for tokenizer training by @Natooz in #167
  • Default continuing_subword_prefix when splitting token sequences by @Natooz in #168
  • small bug fix in MIDI pretokenization by @shenranwang in #170
  • adding no_preprocess_score argument when tokenizing by @Natooz in #172
  • TokSequence summable, concatenate_track_sequences arg for MMM by @Natooz in #173
  • Docs update by @Natooz in #175
  • Fixing split methods for empty files (no tracks and/or no notes) by @Natooz in #177
  • Logo now with white outer stroke by @Natooz in #180
  • Attribute controls feature by @helloWorld199 in #181
  • better distinction between one_token_stream and config.one_token_stream_for_programs by @Natooz in #182
  • making sure MMM token sequences are not concatenated when splitting them per bar/beat in tokenizer_training_iterator.py by @Natooz in #183
  • rST Documentation fixes by @scottclowe in #184
  • Bump actions/stale from 5.1.1 to 9.0.0 by @dependabot in #185
  • Bump actions/download-artifact from 3 to 4 by @dependabot in #186
  • Bump codecov/codecov-action from 3.1.0 to 4.5.0 by @dependabot in #187
  • Bump actions/upload-artifact from 3 to 4 by @dependabot in #188
  • Fixing bugs caused by changes from symusic v0.5.0 by @Natooz in #192
  • use_velocities and use_duration configuration parameters by @Natooz in #193
  • collator now handles decoder input ids (seq2seq models) by @Natooz in #194
  • PerTok Tokenizer by @JLenzy in #191

New Contributors

Full Changelog: v3.0.3...v3.0.4

v3.0.3 Training with WordPiece and Unigram + abc files support

25 Apr 12:50
365a5b6
Compare
Choose a tag to compare

Highlights

  • Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
  • The tokenizers can now also be trained with the WordPiece and Unigram algorithms!
  • Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the encode_ids_split attribute of the tokenizer config;
  • symusic v0.4.3 or higher is now required to comply with the usage of the clip method;
  • Better handling of file loading errors in DatasetMIDI and DataCollator;
  • Introducing a new filter_dataset to clean a dataset of MIDI/abc files before using it;
  • MMM tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI, TSD and MIDILike) to allow more flexibility and interoperability;
  • TokSequence objects can now be sliced and concatenated (eg seq3 = seq1[:50] + seq2[50:]);
  • TokSequence objects tokenized from a tokenizer can now be split per bars or beats subsequences;
  • minor fixes, code improvements and cleaning;

Methods renaming

A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.

Methods renamed with depreciation warning:
  • midi_to_tokens --> encode;
  • tokens_to_midi --> decode;
  • learn_bpe --> train;
  • apply_bpe --> encode_token_ids;
  • decode_bpe --> decode_token_ids;
  • ids_bpe_encoded --> are_ids_encoded;
  • vocab_bpe --> vocab_model.
  • tokenize_midi_dataset --> tokenize_dataset;
Methods renamed without depreciation warning (less usages, reduces the code messiness):
  • MIDITokenizer --> MusicTokenizer;
  • augment_midi --> augment_score;
  • augment_midi_dataset --> augment_dataset ;
  • augment_midi_multiple_offsets --> augment_score_multiple_offsets;
  • split_midis_for_training --> split_files_for_training;
  • split_midi_per_note_density --> split_score_per_note_density;
  • get_midi_programs --> get_score_programs;
  • merge_midis --> merge_scores;
  • get_midi_ticks_per_beat --> get_score_ticks_per_beat;
  • split_midi_per_ticks --> split_score_per_ticks;
  • split_midi_per_beats --> split_score_per_beats;
  • split_midi_per_tracks --> split_score_per_tracks;
  • concat_midis --> concat_scores;
Protected internal methods (no depreciation warning, advanced usages):
  • MIDITokenizer._tokens_to_midi --> MusicTokenizer._tokens_to_score;
  • MIDITokenizer._midi_to_tokens --> MusicTokenizer._score_to_tokens;
  • MIDITokenizer._create_midi_events --> MusicTokenizer._create_global_events

There is no other compatibility issue beside these renaming.

Full Changelog: v3.0.2...v3.0.3

v3.0.2 New data loading and preprocessing methods

24 Mar 14:38
Compare
Choose a tag to compare

Tldr

This new version introduces a new DatasetMIDI class to use when training PyTorch models. It relies on the previously named DatasetTok class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)

The update also brings a few minor fixes, and the docs have a new theme!

What's Changed

  • Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
  • Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
  • Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
  • Fixing save_pretrained to comply with huggingface_hub v0.21 by @Natooz in #150
  • ability to overwrite _create_durations_tuples in init by @JLenzy in #153
  • Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
  • The docs have a new theme! Using the furo theme.

New Contributors

Full Changelog: v3.0.1...v3.0.2

V3.0.1 PitchDrum and minor fixes

02 Feb 08:55
37e28ed
Compare
Choose a tag to compare

What's Changed

  • use_pitchdrum_tokens option to use dedicated PitchDrum tokens for drums tracks
  • Fixing time signature preprocessing (time division mismatch) in #132 (#131 @EterDelta)
  • Fixing data augmentation example and considering all midi extensions in #136 (#135 @oiabtt)
  • decoding: automatically making sure to decode BPE then completing tokens in #138 (#137 @oiabtt)
  • load_tokens now returning TokSequence by in #139 (#137 @oiabtt)
  • convert chord maps back to tuples from list when loading tokenizer from a saved configuration by @shenranwang in #141
  • can now use MIDITokenizer.from_pretrained similarly to the AutoTokenizer in the Hugging Face transformers library by in #142 (discussed in #127 @oiabtt)

New Contributors

Full Changelog: v3.0.0...v3.0.1

V3.0 Switch to Symusic - performance boost

17 Jan 20:04
3f2c372
Compare
Choose a tag to compare

Switch to symusic

This major version marks the switch from the miditoolkit MIDI reading/writing library to symusic, and a large optimisation of the MIDI preprocessing steps.

Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, up to 500 times faster than native Python libraries. It is based on minimidi. The two libraries are created and maintained by @Yikai-Liao and @lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 🫶

Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.

Performance boost

These changes result in a way faster MIDI loading/writing and tokenization times! The overall tokenization (loading MIDI and tokenizing it) is between 5 to 12 times faster depending the tokenizer and data. You can find other benchmarks here.

This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and directly tokenize the MIDIs on the fly while training/using a model! We updated the usage examples of the docs accordingly, the code is now simplified.

Other major changes

  • When using time signatures, time tokens are now computed in ticks per beat, as opposed to ticks per quarter note as done previously. This change is in line with the definition of time and duration tokens, which was not handled following the MIDI norm for note values other than the quarter note until now (#124);
  • Adding new ruff rules and their fixes to comply, increasing the code quality in #115;
  • MidiTok still supports miditoolkit.MidiFile objects, but those will be converted on the fly to a symusic.Score object and a depreciation warning will be thrown;
  • The data augmentation methods on the token level has been removed, in favour of better data augmentation operating directly on MIDIs, now much faster, simplifying processes and now handling durations;
  • The docs are fixed;
  • The tokenization tests workflows has been unified and considerably simplified, leading to more robust test assertions. We also increased the number of test cases and configurations, while decreasing the test time.

Other minor changes

  • Setting special tokens values in TokenizerConf in #114
  • Update README.md by @kalyani2003 in #120
  • Readthedocs preview action for PRs in #125

New Contributors

Full Changelog: v2.1.8...v3.0.0

v2.1.8 Pitch Intervals & minor fixes

28 Nov 13:35
89c4678
Compare
Choose a tag to compare

This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in the docs.
We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.

This new version also drops support for Python 3.7, and now requires Python 3.8 and newer. You can read more about the decision and how to make it retro-compatible in the docs.

We encourage you to update to the latest miditoolkit version, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and compatibility with recent numpy versions!

What's Changed

New Contributors

Full Changelog: v2.1.7...v2.1.8

v2.1.7 Hugging Face Hub integration

25 Oct 06:58
Compare
Choose a tag to compare

This release bring the integration of the Hugging Face Hub, along with a few important fixes and improvements!

What's Changed

  • #87 Hugging Face hub integration! You can now push and load MidiTok tokenizers from the Hugging Face hub, using the .from_pretrained and push_to_hub methods as you would do for your models! Special thanks to @Wauplin and @julien-c for the help and support! 🤗🤗
  • #80 (#78 @leleogere) Adding func_to_get_labels argument to DatasetTok allowing to use it to retrieve labels when loading data;
  • #81 (#74 @Chunyuan-Li) Fixing multi-stream decoding with several identical programs + fixes with the encoding / decoding of time signatures for Bar-based tokenizers;
  • #84 (#77 @VDT5702) Fix in detect_chords when checking whether to use unknown chords;
  • #82 (#79 @leleogere) tokenize_midi_dataset now reproduces the file tree of the source files. This change fixes issues when files with the same name were overwritten in the previous method. You can also specify wether to overwrite files in the destination directory or not.

Full Changelog: v2.1.6...v2.1.7

v2.1.6 Program Changes and fixes

28 Sep 19:46
Compare
Choose a tag to compare

Changelog

  • #72 (#71) adding program_change config option, that will insert Program tokens whenever an event is from a different track than the previous one. They mimic the MIDI ProgramChange messages. If this parameter is disabled (by default), a Program token will prepend each track programs (as done in previous versions);
  • #72 MIDILike decoding optimized;
  • #72 deduplicating overlapping pitch bends during preprocess;
  • #72 tokenize_check_equals test method and more test cases;
  • #75 and #76 (#73 and #74 by @Chunyuan-Li) Fixing time signature encoding / decoding Time Signature workflows for Bar/Position-based tokenizer (REMI, CPWord, Octuple, MMM;
  • #76 Octuple is now tested with time signature disabled: as TimeSig tokens are only carried with notes, Octuple cannot accurately represent time signatures; as a result, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. We do not recommend to use Octuple with MIDIs with several time signature changes (at least numerator changes);
  • #76 MMM tokenization workflow speedup.

v2.1.5 Successive TimeShifts / Rests

31 Aug 06:51
3949268
Compare
Choose a tag to compare

Changelog

  • #69 bacea19 sort notes in all cases when tokenizing as MIDIs can contain unsorted notes;
  • #70 (#68) New one_token_stream_for_programs parameter allowing treat all tracks of a MIDI as a single stream of tokens (adding Program tokens before Pitch/NoteOn...). This option is enabled by default, and corresponds to the default code behaviour of the previous versions. Disabling it allows to have Program tokens in the vocabulary (config.use_programs enabled) while converting each track independently;
  • #70 (#68) TimeShift and Rest tokens can now be created successively during the tokenization, happening when the largest TimeShift / Rest value of the tokenizer isn't sufficient;
  • #70 (#68) Rests are now represented using the same format as TimeShifts, and the config.rest_range parameter has been renamed beat_res_rest for simplicity and flexibility. The default value is {(0, 1): 8, (1, 2): 4, (2, 12): 2};

Full Changelog: v2.1.4...v2.1.5

Thanks to @caenopy for reporting the bugs fixed here.

Compatibility

  • tokenizers of previous versions with rest_range parameter will be converted to the new beat_res_rest format.

v2.1.4 Sustain pedal and pitch bend support

25 Aug 11:47
Compare
Choose a tag to compare

Changelog

  • @ilya16 2e1978f Fix in save_tokens method, reading kwargs in the json file saved;
  • #67 Adding sustain pedal and pitch bend tokens for REMI, TSD and MIDILike tokenizers

Compatibility

  • MMM now adds additional tokens in the same order than other tokenizers, meaning previously saved MMM tokenizers with these tokens would need to be converted if needed.