From cbccbb4fbf33d62a2446ad4140c46063ec156292 Mon Sep 17 00:00:00 2001 From: Nathan Fradet <56734983+Natooz@users.noreply.github.com> Date: Thu, 3 Nov 2022 15:24:17 +0100 Subject: [PATCH] readme update: special tokens paragraph --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index 459bec7b..b73b1947 100644 --- a/README.md +++ b/README.md @@ -220,6 +220,14 @@ Every encoding strategy share some common parameters around which the tokenizers Check [constants.py](miditok/constants.py) to see how these parameters are constructed. +### Special tokens + +When creating a tokenizer, you can specify to include some special tokens in its vocabulary, by giving the arguments: + +* `pad` (default `True`) --> `PAD_None`: a padding token to use when training a model with batches of sequences of unequal lengths. The padding token will be at index 0 of the vocabulary. +* `sos_eos` (default `False`) --> `SOS_None` and `EOS_None`: "Start Of Sequence" and "End Of Sequence" tokens, designed to be placed respectively at the beginning and end of a token sequence during training. At inference, the EOS token tells when to end the generation. +* `mask` (default `False`) --> `MASK_None`: a masking token, to use when pre-training a (bidirectional) model with a self-supervised objective like [BERT](https://arxiv.org/abs/1810.04805). + ### Additional tokens MidiTok offers the possibility to insert additional tokens in the encodings.