Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate batch of LMs #2249

Closed
wants to merge 10 commits into from
Closed

Conversation

wasertech
Copy link
Collaborator

@wasertech wasertech commented Jul 4, 2022

@HarikalarKutusu had made a double of data/lm/generate_lm.py to create multiple LMs with only one command.

Unfortunately his implementation was rather lacking so I made the following changes:

  • added concurrence
  • removed copied code from original script and wrapped generate_lm_batch.py
  • black the code
  • added missing dependencies to docker
  • added ci test run-ci-lm-gen.sh to workflows/build-and-test.yml pipeline

So much so that you can now do the following.

python data/lm/generate_lm_batch.py \
    --input_txt /mnt/extracted/sources_lm.txt \
    --output_dir /mnt/lm/ \
    --top_k_list 30000-50000 \
    --arpa_order_list "2-3" \
    --max_arpa_memory "85%" \
    --arpa_prune_list "0|0|2-0|0|3" \
    --binary_a_bits 255 \
    --binary_q_bits 8 \
    --binary_type trie \
    --kenlm_bins /code/kenlm/build/bin/ \
    -j 12

This will test for all possible combinaison of :

{
    'top_k': [30000, 50000],
    'arpa_order': [2, 3],
    'arpa_prune': ["0|0|2", "0|0|3"]
}

The created scorers will be stored in {--output_path}/{arpa_order}-{top_k}-{arpa_prune}/.

# ./data/lm/4-30000-0|0|1
drwxr-xr-x root root  .
.rw-r--r-- root root lm.binary
.rw-r--r-- root root vocab-30000.txt

Needs libboost-program-options-dev and libboost-thread-dev installed or lmplz crashes with:

libboost_program_options.so.1.71.0: cannot open shared object file: No such file or directory
libboost_thread.so.1.71.0:  cannot open shared object file: No such file or directory

@wasertech
Copy link
Collaborator Author

Better!

@wasertech
Copy link
Collaborator Author

wasertech commented Jul 4, 2022

The output is a little messy since we do runs simultaneously so we need to report everything nicely at the end.

@wasertech
Copy link
Collaborator Author

wasertech commented Jul 4, 2022

root@c53e06a85b12:/code# ./bin/run-ci-lm-gen-batch.sh 
sources_lm_filepath=./data/smoke_test/vocab.txt
+ python data/lm/generate_lm_batch.py --input_txt ./data/smoke_test/vocab.txt --output_dir ./data/lm --top_k_list 30000 --arpa_order_list 4 --max_arpa_memory 85% --arpa_prune_list 0|0|2 --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --kenlm_bins /code/kenlm/build/bin/ -j 1

Converting to lowercase and counting word occurrences ...
| |#                                                                                                                                                             | 500 Elapsed Time: 0:00:00

Saving top 30000 words ...

Calculating word statistics ...
  Your text file has 13343 words in total
  It has 2559 unique words
  Your top-30000 words are 100.0000 percent of all words
  Your most common word "the" occurred 687 times
  The least common word in your top-k is "ultraconservative" with 1 times
  The first word with 2 occurrences is "mens" at place 1146

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /code/data/lm/4-30000-0|0|2/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 13343 types 2562
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:30744 2:14627018752 3:27425658880 4:43881058304
Statistics:
1 2562 D1=0.651407 D2=1.09117 D3+=1.64993
2 9399 D1=0.831861 D2=1.21647 D3+=1.44108
3 148/12347 D1=0.937292 D2=1.53845 D3+=1.55801
4 21/12584 D1=0.967272 D2=1.7362 D3+=3
Memory estimate for binary LM:
type     kB
probing 289 assuming -p 1.5
probing 355 assuming -r models -p 1.5
trie    156 without quantization
trie    107 assuming -q 8 -b 8 quantization 
trie    148 assuming -a 22 array pointer compression
trie     99 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:30744 2:150384 3:2960 4:504
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz      VmPeak:84649108 kB      VmRSS:6756 kB   RSSMax:16794516 kB      user:0.940238   sys:4.20439     CPU:5.14465     real:5.14232

Filtering ARPA file using vocabulary of top-k words ...
Reading ./data/lm/4-30000-0|0|2/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Building lm.binary ...
Reading ./data/lm/4-30000-0|0|2/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
----------------------------------------------------------------
2022-07-04 13:32 RUNNING 1/1 FOR arpa_order=4 top_k=30000 arpa_prune='0|0|2'
LM generation 1 took: 5.443297207000796 seconds
----------------------------------------------------------------
INFO:root:Took 5.445083366999825 seconds to generate 1 language model.

@wasertech wasertech marked this pull request as ready for review July 4, 2022 13:18
@wasertech
Copy link
Collaborator Author

wasertech commented Jul 5, 2022

I'll close this PR as I've just merged it with #2211 inside #2253 to make available_cpu_count() available STT-wide as coqui_stt_training.util.cpu.available_count().

@wasertech wasertech closed this Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants