Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Use MMI not CTC model for alignment #203

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

danpovey
Copy link
Contributor

@danpovey danpovey commented Jun 2, 2021

No description provided.

@danpovey
Copy link
Contributor Author

danpovey commented Jun 2, 2021

Below are some notes I made about results. There is a modest improvement of around 0.3% absolute on test-other, from using the MMI not CTC model for alignment.

  `mmiali` experiment, branch=mmiali.  Use MMI TDNN+LSTM model, not CTC model, for alignment; requires retraining
  MMI TDNN+LSTM model with subsampling-factor=4 to avoid mismatch.

 The baseline for what's below (which was trained with
 mmi_att_transformer_train.py with --world-size=2 and --full-libri=False) can be
 taken to be: 6.82%, 18.00%, 5.78%, 15.46%, taken from
  /ceph-dan/snowfall/egs/librispeech/asr/simple_v1/exp-conformer-noam-mmi-att-musan-sa-vgg-rework (the checked-in
  result with vgg frontend in RESULTS.md is with 1 job not 2).

 2021-06-02 10:49:26,220 INFO [common.py:380] [test-clean] %WER 6.81% [3583 / 52576, 496 ins, 284 del, 2803 sub ]
 2021-06-02 10:51:41,617 INFO [common.py:380] [test-other] %WER 17.64% [9234 / 52343, 1024 ins, 848 del, 7362 sub ]
[with 4-gram LM rescoring]:
 2021-06-02 12:11:52,226 INFO [common.py:391] [test-clean] %WER 5.72% [3009 / 52576, 566 ins, 158 del, 2285 sub ]
 2021-06-02 12:18:23,522 INFO [common.py:391] [test-other] %WER 15.18% [7946 / 52343, 1176 ins, 538 del, 6232 sub ]


  # Below is the model from exp-lstm-adam-mmi-bigram-musan-dist-s4/epoch-9.pt:
  this expt (with subsampling-factor=4):
     2021-06-01 21:08:15,043 INFO [mmi_bigram_decode.py:261] %WER 10.66% [5604 / 52576, 718 ins, 587 del, 4299 sub ]
  baseline (with subsampling-factor=3):
     2021-06-01 12:06:43,106 INFO [mmi_bigram_decode.py:261] %WER 10.38% [5455 / 52576, 713 ins, 510 del, 4232 sub ]

x = nnet_output.abs().sum().item()
if x - x != 0:
print("Warning: reverting nnet output since it seems to be nan.")
nnet_output = nnet_output_orig
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GNroy perhaps this is related to the error you had? I found that sometimes I'd get NaN's in the forward pass of the alignment model. I commented out ali_model.eval() as well as making this change, because I suspected that it had to do with test-mode batchnorm, but I might have been wrong, I need to test this. It might relate to float16 usage (or a combination of the two).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Actually, I resolved my issue.
NaNs were produced by the encoder part (not a loss or softmax problem as I thought before).
It was fixed with some hyperparameters re-tuning. In particular, setting eps=1e-3 for the optimizer helped.

@danpovey
Copy link
Contributor Author

danpovey commented Jun 2, 2021

Results after training with 1 job only (and uncommenting ali_model.eval(), which I doubt it matters), were:

2021-06-02 19:53:43,152 INFO [common.py:391] [test-clean] %WER 6.85% [3604 / 52576, 530 ins, 278 del, 2796 sub ]
 2021-06-02 19:56:17,121 INFO [common.py:391] [test-other] %WER 17.57% [9195 / 52343, 1081 ins, 787 del, 7327 sub ]
and with LM rescoring:
 2021-06-02 19:55:17,350 INFO [common.py:391] [test-clean] %WER 5.83% [3065 / 52576, 612 ins, 158 del, 2295 sub ]
 2021-06-02 20:02:43,266 INFO [common.py:391] [test-other] %WER 15.30% [8006 / 52343, 1268 ins, 488 del, 6250 sub ]

vs. the checked-in results from @zhu-han which were:

# average over last 5 epochs (LM rescoring with whole lattice)
2021-05-02 00:36:42,886 INFO [common.py:381] [test-clean] %WER 5.55% [2916 / 52576, 548 ins, 172 del, 2196 sub ]
2021-05-02 00:47:15,544 INFO [common.py:381] [test-other] %WER 15.32% [8021 / 52343, 1270 ins, 501 del, 6250 sub ]

# average over last 5 epochs
2021-05-01 23:35:17,891 INFO [common.py:381] [test-clean] %WER 6.65% [3494 / 52576, 457 ins, 293 del, 2744 sub ]
2021-05-01 23:37:23,141 INFO [common.py:381] [test-other] %WER 17.68% [9252 / 52343, 1020 ins, 858 del, 7374 sub ]

... so according to this, it does not really make a difference which model we use for alignment.

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 2, 2021

Would it make sense to use a pure TDNN/TDNNF/CNN model for alignments? I was investing alignments from the conformer recently and my feeling was that they weren't perfect (even though the test-clean WER is ~4%) -- i.e., they seem a bit warped/shifted sometimes, but not in a consistent way. I think that the self-attention layers allow to "cheat" to some extent with the alignments, I don't know if the same happens with RNN. I doubt that the same would happen with local-context models though. Unfortunately, I don't have any means to provide a more objective evaluation than showing a screenshot (look closely at the boundaries with silences).

@danpovey
Copy link
Contributor Author

danpovey commented Jun 2, 2021

That's interesting, how did you obtain that plot?
I think it may be hard to prevent the conformer model from doing this kind of thing using the current
alignment method, since it's only present early in training and is not really a constraint.

I am thinking it might be possible, though, if we had model that was good for alignment, to save 'constraints'
derived from it, similar to what we do with Kaldi's LF-MMI training. That is: to get (say) the one-best path from it,
save it as a tensor of int32_t e.g. as a .pt file indexed by utterance-id, and load that when training; and then to
extend the boundaries of the phones by a couple frames and treat it as a mask on the nnet output, masking
all (non-blank) phones that are not allowed by the alignment by adding a negative number to them.
The only thing is, this will tend to interact with data augmentation and batching. It might be a little complicated
to have that information pass through those transforms.

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 2, 2021

I'll submit a PR with the code that allows computing alignments and visualizing them later.

As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.

@danpovey
Copy link
Contributor Author

danpovey commented Jun 2, 2021 via email

@pzelasko
Copy link
Collaborator

Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.

@danpovey
Copy link
Contributor Author

danpovey commented Jul 13, 2021 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants