Release v3.0.2 New data loading and preprocessing methods · Natooz/MidiTok

Tldr

This new version introduces a new DatasetMIDI class to use when training PyTorch models. It relies on the previously named DatasetTok class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)

The update also brings a few minor fixes, and the docs have a new theme!

What's Changed

Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
Fixing save_pretrained to comply with huggingface_hub v0.21 by @Natooz in #150
ability to overwrite _create_durations_tuples in init by @JLenzy in #153
Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
The docs have a new theme! Using the furo theme.

New Contributors

@sunsetsobserver made their first contribution in #145
@JLenzy made their first contribution in #153

Full Changelog: v3.0.1...v3.0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.2 New data loading and preprocessing methods

Tldr

What's Changed

New Contributors

Contributors