From cbccbb4fbf33d62a2446ad4140c46063ec156292 Mon Sep 17 00:00:00 2001
From: Nathan Fradet <56734983+Natooz@users.noreply.github.com>
Date: Thu, 3 Nov 2022 15:24:17 +0100
Subject: [PATCH] readme update: special tokens paragraph

---
 README.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/README.md b/README.md
index 459bec7b..b73b1947 100644
--- a/README.md
+++ b/README.md
@@ -220,6 +220,14 @@ Every encoding strategy share some common parameters around which the tokenizers
 
 Check [constants.py](miditok/constants.py) to see how these parameters are constructed.
 
+### Special tokens
+
+When creating a tokenizer, you can specify to include some special tokens in its vocabulary, by giving the arguments:
+
+* `pad` (default `True`) --> `PAD_None`: a padding token to use when training a model with batches of sequences of unequal lengths. The padding token will be at index 0 of the vocabulary.
+* `sos_eos` (default `False`) --> `SOS_None` and `EOS_None`: "Start Of Sequence" and "End Of Sequence" tokens, designed to be placed respectively at the beginning and end of a token sequence during training. At inference, the EOS token tells when to end the generation.
+* `mask` (default `False`) --> `MASK_None`: a masking token, to use when pre-training a (bidirectional) model with a self-supervised objective like [BERT](https://arxiv.org/abs/1810.04805).
+
 ### Additional tokens
 
 MidiTok offers the possibility to insert additional tokens in the encodings.