README.md

# Self-Training with Kaldi HMM Models
This folder contains recipes for self-training on pseudo phone transcripts and
decoding into phones or words with [kaldi](https://github.com/kaldi-asr/kaldi).

To start, download and install kaldi follow its instruction, and place this
folder in `path/to/kaldi/egs`.

## Training
Assuming the following has been prepared:
- `w2v_dir`: contains features `{train,valid}.{npy,lengths}`, real transcripts `{train,valid}.${label}`, and dict `dict.${label}.txt`
- `lab_dir`: contains pseudo labels `{train,valid}.txt`
- `arpa_lm`: Arpa-format n-gram phone LM for decoding
- `arpa_lm_bin`: Arpa-format n-gram phone LM for unsupervised model selection to be used with KenLM

Set these variables in `train.sh`, as well as `out_dir`, the output directory,
and then run it.

The output will be:
```
==== WER w.r.t. real transcript (select based on unsupervised metric)
INFO:root:./out/exp/mono/decode_valid/scoring/14.0.0.tra.txt: score 0.9178 wer 28.71% lm_ppl 24.4500 gt_wer 25.57%
INFO:root:./out/exp/tri1/decode_valid/scoring/17.1.0.tra.txt: score 0.9257 wer 26.99% lm_ppl 30.8494 gt_wer 21.90%
INFO:root:./out/exp/tri2b/decode_valid/scoring/8.0.0.tra.txt: score 0.7506 wer 23.15% lm_ppl 25.5944 gt_wer 15.78%
```
where `wer` is the word eror rate with respect to the pseudo label, `gt_wer` to
the ground truth label, `lm_ppl` the language model perplexity of HMM prediced
transcripts, and `score` is the unsupervised metric for model selection. We
choose the model and the LM parameter of the one with the lowest score. In the
example above, it is `tri2b`, `8.0.0`.


## Decoding into Phones
In `decode_phone.sh`, set `out_dir` the same as used in `train.sh`, set
`dec_exp` and `dec_lmparam` to the selected model and LM parameter (e.g.
`tri2b` and `8.0.0` in the above example). `dec_script` needs to be set
according to `dec_exp`: for mono/tri1/tri2b, use `decode.sh`; for tri3b, use
`decode_fmllr.sh`.

The output will be saved at `out_dir/dec_data`


## Decoding into Words
`decode_word_step1.sh` prepares WFSTs for word decoding. Besides the variables
mentioned above, set
- `wrd_arpa_lm`: Arpa-format n-gram word LM for decoding
- `wrd_arpa_lm_bin`: Arpa-format n-gram word LM for unsupervised model selection

`decode_word_step1.sh` decodes the `train` and `valid` split into word and runs
unsupervised model selection using the `valid` split. The output is like:
```
INFO:root:./out/exp/tri2b/decodeword_valid/scoring/17.0.0.tra.txt: score 1.8693 wer 24.97% lm_ppl 1785.5333 gt_wer 31.45%
```

After determining the LM parameter (`17.0.0` in the example above), set it in
`decode_word_step2.sh` and run it. The output will be saved at
`out_dir/dec_data_word`.