Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMI hotfix #4739

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 9 additions & 22 deletions egs/ami/s5b/local/ami_download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,17 @@ mkdir -p $wdir/log

#download waves

cat local/split_train.orig local/split_eval.orig local/split_dev.orig > $wdir/ami_meet_ids.flist

wgetfile=$wdir/wget_$mic.sh

# TODO fix this with Pawel, files don't exist anymore,
manifest="wget --continue -O $adir/MANIFEST.TXT http://groups.inf.ed.ac.uk/ami/download/temp/amiBuild-04237-Sun-Jun-15-2014.manifest.txt"
license="wget --continue -O $adir/LICENCE.TXT http://groups.inf.ed.ac.uk/ami/download/temp/Creative-Commons-Attribution-NonCommercial-ShareAlike-2.5.txt"
cp local/MANIFEST.TXT $adir/MANIFEST.TXT
manifest=$adir/MANIFEST.TXT
#manifest="wget --continue -O $adir/MANIFEST.TXT https://groups.inf.ed.ac.uk/ami/download/temp/amiBuild-1372-Thu-Apr-28-2022.manifest.txt"
license="wget --continue -O $adir/LICENSE.TXT http://groups.inf.ed.ac.uk/ami/corpus/license.shtml"

# Parse the manifest file, and separate recordings into train, dev, and eval sets
# python3 local/split_manifest.py $adir/MANIFEST.TXT

cat local/split_train.orig local/split_eval.orig local/split_dev.orig > $wdir/ami_meet_ids.flist

echo "#!/usr/bin/env bash" > $wgetfile
echo $manifest >> $wgetfile
Expand Down Expand Up @@ -85,23 +89,6 @@ echo "Downloading audio files for $mic scenario."
echo "Look at $wdir/log/download_ami_$mic.log for progress"
$wgetfile &> $wdir/log/download_ami_$mic.log

# Do rough check if #wavs is as expected, it will fail anyway in data prep stage if it isn't,
if [ "$mic" == "ihm" ]; then
num_files=$(find $adir -iname *Headset* | wc -l)
if [ $num_files -ne 687 ]; then
echo "Warning: Found $num_files headset wavs but expected 687. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
fi
else
num_files=$(find $adir -iname *Array1* | wc -l)
if [[ $num_files -lt 1352 && "$mic" == "mdm" ]]; then
echo "Warning: Found $num_files distant Array1 waves but expected 1352 for mdm. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
elif [[ $num_files -lt 169 && "$mic" == "sdm" ]]; then
echo "Warning: Found $num_files distant Array1 waves but expected 169 for sdm. Check $wdir/log/download_ami_$mic.log for details."
exit 1;
fi
fi

echo "Downloads of AMI corpus completed succesfully. License can be found under $adir/LICENCE.TXT"
exit 0;
Expand Down
Empty file added egs/ami/s5c/README.txt
Empty file.
4 changes: 2 additions & 2 deletions egs/ami/s5c/cmd.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="queue.pl --mem 4G"
export decode_cmd="queue.pl --mem 4G"
export train_cmd="run.pl --mem 4G"
export decode_cmd="run.pl --mem 4G"

279 changes: 279 additions & 0 deletions egs/ami/s5c/local/MANIFEST.TXT

Large diffs are not rendered by default.

13 changes: 6 additions & 7 deletions egs/ami/s5c/local/nnet3/xvector/tuning/run_xvector_1a.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,19 +54,18 @@ num_pdfs=$(awk '{print $2}' $data/utt2spk | sort | uniq -c | wc -l)
if [ $stage -le 6 ]; then
echo "$0: Getting neural network training egs";
# dump egs.
if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $egs_dir/storage ]; then
utils/create_split_dir.pl \
/export/b{03,04,05,06}/$USER/kaldi-data/egs/callhome_diarization/v2/xvector-$(date +'%m_%d_%H_%M')/$egs_dir/storage $egs_dir/storage
fi
# frame per iter original 1000000000
# frame per iter diagnostic original 500000
# num repeat original 1
sid/nnet3/xvector/get_egs.sh --cmd "$train_cmd" \
--nj 8 \
--stage 0 \
--frames-per-iter 1000000000 \
--frames-per-iter-diagnostic 500000 \
--frames-per-iter 100000 \
--frames-per-iter-diagnostic 10000 \
--min-frames-per-chunk 200 \
--max-frames-per-chunk 400 \
--num-diagnostic-archives 3 \
--num-repeats 40 \
--num-repeats 10 \
"$data" $egs_dir
fi

Expand Down
4 changes: 3 additions & 1 deletion egs/ami/s5c/local/prepare_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,16 @@

def find_audios(wav_path, file_list):
# Get all wav file names from audio directory
command = 'find %s -name "*Mix-Headset.wav"' % (wav_path)
command = 'find %s -name "*.wav"' % (wav_path)
wavs = subprocess.check_output(command, shell=True).decode('utf-8').splitlines()
keys = [ os.path.splitext(os.path.basename(wav))[0] for wav in wavs ]
data = {'key': keys, 'file_path': wavs}
df_wav = pd.DataFrame(data)

# Filter list to keep only those in annotations (for the specific data split)
file_names_str = "|".join(file_list)
print(file_names_str)
print(df_wav)
df_wav = df_wav.loc[df_wav['key'].str.contains(file_names_str)].sort_values('key')
return df_wav

Expand Down
8 changes: 8 additions & 0 deletions egs/ami/s5c/local/split_dev.orig
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ES2011a
ES2011b
ES2011c
ES2011d
IB4001
IB4002
IB4003
IB4004
8 changes: 8 additions & 0 deletions egs/ami/s5c/local/split_eval.orig
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ES2004a
ES2004b
ES2004c
ES2004d
EN2002a
EN2002b
EN2002c
EN2002d
69 changes: 69 additions & 0 deletions egs/ami/s5c/local/split_manifest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import os
import sys

def unique(m):
unique_list = []

for i in m:
if i not in unique_list:
unique_list.append(i)

return unique_list

# Load in the MANIFEST file, save off the audio recoding file names
file = sys.argv[1]
prefix = '\thttps://groups.inf.ed.ac.uk/ami/AMICorpusMirror//amicorpus/'
m = []

with open(file) as f:
for line in f:
#splits = line.split('/')
#print(splits)
if line.startswith(prefix):
splits = line.split('/')
#print(splits)
m.append(splits[7])
m = unique(m)
print("Got the audio files from MANIFEST.TXT")
#print(m)

# Separate files and save off into train, dev, and eval partitions
N = len(m)

#train = m[:round(N*.5)]
#dev = m[round(N*.5)+1:round(N*.8)]
#ev = m[round(N*.8)+1:]

#train = train[:12]
#dev = dev[:10]
#ev = ev[:10]

train = m[:8]
dev = m[9:15]
ev = m[16:20]

print("Train set: "+str(train))
print("Dev set: "+str(dev))
print("Eval set: "+str(ev))

if os.path.exists('local/split_train.orig'):
os.remove('local/split_train.orig')
if os.path.exists('local/split_dev.orig'):
os.remove('local/split_dev.orig')
if os.path.exists('local/split_eval.orig'):
os.remove('local/split_eval.orig')

with open('local/split_train.orig', 'a') as train_file:
for d in train:
train_file.write(d)
train_file.write("\n")

with open('local/split_dev.orig', 'a') as dev_file:
for d in dev:
dev_file.write(d)
dev_file.write("\n")

with open('local/split_eval.orig', 'a') as eval_file:
for d in ev:
eval_file.write(d)
eval_file.write("\n")
27 changes: 27 additions & 0 deletions egs/ami/s5c/local/split_train.orig
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
IS1000a
IS1000b
IS1000c
IS1000d
IS1001a
IS1001b
IS1001c
IS1001d
IS1002b
IS1002c
IS1002d
ES2002a
ES2002b
ES2002c
ES2002d
ES2003a
ES2003b
ES2003c
ES2003d
ES2005a
ES2005b
ES2005c
ES2005d
ES2006a
ES2006b
ES2006c
ES2006d
18 changes: 9 additions & 9 deletions egs/ami/s5c/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,14 @@
set -euo pipefail
mfccdir=`pwd`/mfcc

stage=0
stage=7
overlap_stage=0
diarizer_stage=0
nj=50
nj=10
decode_nj=15

export mic=ihm

model_dir=exp/xvector_nnet_1a

train_set=train
Expand All @@ -37,11 +39,6 @@ diarizer_type=spectral # must be one of (ahc, spectral, vbx)

# Path where AMI gets downloaded (or where locally available):
AMI_DIR=$PWD/wav_db # Default,
case $(hostname -d) in
fit.vutbr.cz) AMI_DIR=/mnt/matylda5/iveselyk/KALDI_AMI_WAV ;; # BUT,
clsp.jhu.edu) AMI_DIR=/export/corpora5/amicorpus ;; # JHU,
cstr.ed.ac.uk) AMI_DIR= ;; # Edinburgh,
esac

# Download AMI corpus, You need around 130GB of free space to get whole data
if [ $stage -le 1 ]; then
Expand Down Expand Up @@ -87,13 +84,16 @@ if [ $stage -le 3 ]; then
steps/make_mfcc.sh --mfcc-config conf/mfcc_hires.conf --nj $nj --cmd "$train_cmd" data/$dataset
steps/compute_cmvn_stats.sh data/$dataset
utils/fix_data_dir.sh data/$dataset
echo "FEATURES COMPLETE FOR DATASET"
done
fi

if [ $stage -le 4 ]; then
echo "$0: preparing a AMI training data to train PLDA model"
local/nnet3/xvector/prepare_feats.sh --nj $nj --cmd "$train_cmd" \
data/train data/plda_train exp/plda_train_cmn
#local/nnet3/xvector/prepare_feats.sh --nj $nj --cmd "$train_cmd" \
# data/train data/plda_train exp/plda_train_cmn
local/nnet3/xvector/run_xvector.sh --stage $stage --train-stage -1 \
--data data/plda_train
fi

if [ $stage -le 5 ]; then
Expand Down
1 change: 1 addition & 0 deletions egs/ami/s5c/sid
38 changes: 38 additions & 0 deletions egs/ami/s5c_apt2141/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
A. Alexander Thornton (apt2141)

B. May 8, 2022

C. Project Title: Speaker Diarization: Deep Speech Embeddings for Time Delay Neural Networks (TDNN)

D. Project Summary:

Abstract—The fundamental problem of Speaker Diarization
can be simplified as ”who spoke when”. At its essence, Speaker
Diarization can be reduced to the traditional Speaker Identifica-
tion problem, but expanded to N interleaving speakers through
time. This work improves upon the existing Speaker Diarization
project in Kaldi, which was incomplete and unfinished prior to
my efforts.
Index Terms—speaker identification, diarization, time delay
neural networks, time series learning

E. All tools are included with the code here. Build Kaldi, and you can just run the run.sh

F. Only use run.sh, the stage is set to 7 for decoding

G. Run the code with this simple command:

./run.sh

All environment variables are defined inside

Sample output will appear at the bottom, with the test accuracy

H. The data used was built off a MANIFEXT file downloaded here:

https://groups.inf.ed.ac.uk/ami/download/temp/amiBuild-1372-Thu-Apr-28-2022.manifest.txt

It's important to know that those files change daily, and are constantly changing, so this one might already be gone



15 changes: 15 additions & 0 deletions egs/ami/s5c_apt2141/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine). queue.pl works with GridEngine (qsub). slurm.pl works
# with slurm. Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration. Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="run.pl --mem 4G"
export decode_cmd="run.pl --mem 4G"

3 changes: 3 additions & 0 deletions egs/ami/s5c_apt2141/conf/decode.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
beam=11.0 # beam for decoding. Was 13.0 in the scripts.
first_beam=8.0 # beam for 1st-pass decoding in SAT.

2 changes: 2 additions & 0 deletions egs/ami/s5c_apt2141/conf/mfcc.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--use-energy=false # only non-default option.
--sample-frequency=16000
10 changes: 10 additions & 0 deletions egs/ami/s5c_apt2141/conf/mfcc_hires.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# config for high-resolution MFCC features, intended for neural network training
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--use-energy=false # use average of log energy, not energy.
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-freq=20 # low cutoff frequency for mel bins... this is high-bandwidth data, so
# there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)
3 changes: 3 additions & 0 deletions egs/ami/s5c_apt2141/conf/online_cmvn.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# configuration file for apply-cmvn-online, used in the script ../local/run_online_decoding.sh
--norm-means=true
--norm-vars=false
1 change: 1 addition & 0 deletions egs/ami/s5c_apt2141/conf/pitch.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--sample-frequency=16000
1 change: 1 addition & 0 deletions egs/ami/s5c_apt2141/diarization
Loading