12 Nov 01:52

pzelasko

f1438a5

Himalayan Salt Bath

This is mostly a bug, documentation, and installation fix release.

New features

Add dynamic LRU cache for audio and feature I/O (#443)

Lhotse will now cache up to 512 audio/feature arrays that were read from disk to speed-up some common usage patterns.
This behavior can be easily disabled with lhotse.set_caching_enabled(False).

Recipe improvements

Update timit recipe by @luomingshuang in #446
Fixes to babel recipe by @jtrmal in #447
Fixed bug in preparing AMI supervisions by @desh2608 in #449

General improvements

add export multichannel to kaldi by @jtrmal in #441
Fix documentation builds by @pzelasko in #448
fix typo of np.stack by @glynpu in #450
Rich exception info for I/O in Recording, MonoCut and MixedCut by @pzelasko in #454
Fix SpecAugment docs by @pzelasko in #463
Fix misleading manifest format in the comments by @stachu86 in #460
Ensure the users who installed PyTorch also install torchaudio themselves by @pzelasko in #466
Add documentation and tests for Kaldi feature extraction layers by @pzelasko in #467

New Contributors

@glynpu made their first contribution in #450
@stachu86 made their first contribution in #460

Contributors

stachu86, desh2608, and 4 other contributors

Assets 2

03 Nov 12:53

pzelasko

v0.11

6d60f33

Spontaneous Combustion

Corpora

New

LibriCSS and Aishell 4 recipe by @desh2608 in #433

Improved

Add and update timit recipe by @luomingshuang in #423
Prepare segment level supervisions for AMI recipe by @desh2608 in #429
adding ASR data prep for CH Am EN by @jtrmal in #431

New features

Faster Kaldi-compatible feature extraction on CPU and CUDA using kaldifeat

Add Lhotse wrappers for kaldifeat-based feature extractors by @pzelasko in #424
Method for batch feature extraction from CutSet that supports CUDA extractors by @pzelasko in #422

Others

CLI manimulation subset cmd adds cut_ids arguments by @oplatek in #425
Function to plot alignments in a Cut by @pzelasko in #436

General improvements

Support for PyTorch 1.10.0 and 1.8.2 by @pzelasko in #427
Adding durations without FP precision issues by @pzelasko in #428
fix doc strings by @jtrmal in #430
add the auto-generated files/dirs into ignore list by @jtrmal in #432
[Kaldi-related] Mapping underscores for utt-ids + num-jobs option for import by @pzelasko in #435
set the channel info correctly by @jtrmal in #437
make the callhome asr utterances following specific format for QOL by @jtrmal in #442

Contributors

oplatek, desh2608, and 3 other contributors

Assets 2

14 Oct 21:22

pzelasko

v0.10

db9f42b

Happy Yodeling

Corpora

HiFiTTS data prep (#410)
CMU Indic data prep (#411)
ADEPT data prep (#413)
Reduced memory usage in CommonVoice data prep (#414)
fixes in Switchboard recipe (#419)

New features

Saving and restoring the sampler's state for resuming training exactly where it left off (#417, #418)

General improvements

CutSet.describe() doesn't need pandas anymore (#412)
Adding torchaudio sox_io audio reading backend and making it default (#414)
Support for Python 3.9 (#415)
Adopted black for code style (#420)

Assets 2

27 Sep 17:18

pzelasko

v0.9

87da73e

Spacious Glacier

Corpora

CommonVoice data preparation recipe (#400)

Breaking changes

removed len() attribute from CutSamplers (#392)

General improvements

Options to preserve cut IDs in cut operations and transforms (#405)
faster import lhotse and CLI start-up (#403)
versioning improvements (#401)
fixes in cuts.subset() (#395 #398, thanks @janvainer)
Count cuts discarded with .filter() into the sampler report (#391)
Option for ZipSampler to return tuples of CutSets (#390)

Contributors

janvainer

Assets 2

26 Aug 01:37

pzelasko

v0.8

e7fa892

Thin Ribbon of Snow

Breaking changes

Lhotse CutSampler classes now return mini-batch CutSets instead of a list of string cut IDs (Lhotse Dataset classes are adjusted correspondingly) (#345)
Cut refactoring (Cut is now an abstract base class for all cut types; what was previously called Cut is now called MonoCut) (#328)
CLI: lhotse obtain is now lhotse download (#329)

Corpora

TIMIT (#324 thanks @luomingshuang)
Fisher English (#374 thanks @videodanchik)
Fisher Spanish (#376 thanks @videodanchik)
yesno (#380 thanks @csukuangfj)
improvements to GigaSpeech recipe (#329 #334 #337 #381 thanks @jimbozhang)
including word alignments in LibriSpeech recipe (#379)

New features

CutSampler improvements (PyTorch data API)

ZipSampler for batches constructed from different cut sources (#344 #347 #363 thanks for fixes @janvainer)
drop_last option and get_report() method for cut samplers (#357)
find_pessimistic_batches utility to help fail fast with GPU OOM (#358)
streaming variant of shuffling for lazy CutSets in samplers (#359)
a bucketing method with equal cumulative bucket duration for BucketingSampler (#365)
approximate proportional sampling in BucketingSampler (#372)

I/O improvements

chunked OPUS file reads (#339)
chunked sphere file reads (#367 thanks @videodanchik)
faster OnTheFlyFeatures (padding audio instead of features) (#352)
ChunkedLilcomHdf5Writer (and reader) for efficient chunk reads of lilcom-compressed arrays (#334)
a global cache for re-using smart_open connection sessions (improves performance for repeated smart_open calls e.g., to S3) (#335, thanks @oplatek)

Data augmentation

tempo perturbation (#375 thanks @janvainer)
volume perturbation (#382 thanks @videodanchik)

Others

CutSet.trim_to_supervision has new arguments for including actual acoustic context next to the supervisions (#330 #331)
SupervisionSegment is now mutable (and all Lhotse manifests will remain mutable) (#333)
.shuffle() method for Lhotse *Set classes (#341)
lhotse fix CLI (#360)
lhotse install-sph2pipe for handling LDC corpora compressed with shorten (auto-registers sph2pipe so no further actions are needed) (#370)

General improvements

refreshed docs (#327 #328 #330)
improvements to downloading corpora (#340)
experimental dataloader that allows two levels of parallelism (#343, might be abandoned for other alternatives)
auto-detection of compatible torchaudio version for pytorch (#348)
improvements to Kaldi data dir import/export (#351 #354)
fixed cut ordering in CutSet.subset(cut_ids=...) (#353)
improvements to storing cuts as recordings (#355)
refactored lhotse.dataset.sampling file into a directory module (#366)
improvements to CLI (#369 #371 thanks @songmeixu)
improvements to setup (#377 #383 thanks @songmeixu)
Colab notebook with ESPnet + Lhotse example (#384)
improvements to Lhotse versioning (#385)

Contributors

oplatek, megazone87, and 5 other contributors

Assets 2

0 Join discussion

30 Jun 22:39

pzelasko

v0.7

eafad9e

Melting Away

New corpora

GigaSpeech (#283, thanks @jimbozhang)
Dihard 3 (#287, thanks @desh2608)
GALE Arabic and Mandarin (#296, thanks @desh2608)
CMU and CSLU Kids (#297, thanks @desh2608)
MTedX (#301, thanks @m-wiesner)
LibriTTS (#306)

New features

Reading huge manifests lazily with Apache Arrow (documentation and examples are coming) (#286, #288, #289, #290, #292, #294)
Sequential JSONL writer storing manifests on disk as they are created (#302)
Support for alignments in SupervisionSegment (#304, #310, #313, thanks @desh2608)
PyTorch Kaldi-compatible feature extractors that support GPU, batching and autograd (#307, thanks @jesus-villalba)
Reading, writing, and uploading features to URLs (e.g. S3 or GCP) (#312)
Store waveforms of cuts as audio recordings to disk (#316, thanks @entn-at)
Support for importing Kaldi's feats.scp and reading features directly from scp/ark (#318)

General improvements

add multi thread to process AIShell data (#259, thanks @pingfengluo)
tracking dev versions (#291, thanks @oplatek)
Explicitly set UTF-8 encoding when reading README.md in setup.py (#293, thanks @entn-at)
Auto-add link to source code in docs (#295)
cut.resample() (#299)
fixing flaky tests (#300)
fix AMI CLI mode (#303, thanks @desh2608)
handle zero energy error in audio mixing (#305)
update Kaldi related docs (#308)
add a missing SpecAugment parameter (#309)
fixing edge cases for audio transforms (#311)
Add drop_last option in *Set.split() (#315)
Support h5py file modes in feature writers (#317)
don't using kaldi reco2dur and fix some error in bin/lhotse (#318, thanks @shanguanma)
Fix cut num of samples bug (#322, thanks @dophist)
use whitespace in kaldi field-splitting (#323, thanks @dophist)

Assets 2

0 Join discussion

26 Apr 22:27

pzelasko

v0.6

c263311

Thawing Potion

New corpora

CMU Arctic (#225)
L2 Arctic (#227, #251)
VCTK (#228, #253, #254)
CallHome English (#278)
CallHome Egyptian (#208)
Multilingual Librispeech (#282 -- can be quite slow but we plan on improving the speed in further releases)

Features

PyTorch Dataset API

Lhotse's samplers are now fully deterministic, have len() that returns the number of batches, and return a consistent number of batches in all distributed workers. (#213, #222, #223, #224, #255, #267, thanks @janvainer)
On-the-fly feature extraction in PyTorch datasets (#229)
visualisations of ASR batches with multiple transforms applied (#234)

Features and transforms

Add LIbrosaFbank consistent with various TTS applications (#252, thanks @janvainer)
SpecAugment (#246)
option to pad cuts from left/right/both directions (#216)
Randomized smoothing augmentation (#272, #273, #274)
Randomized extra padding (#281)

I/O and serialization

[experimental] Downloading audio from HTTP/S3/GCP/Azure URLs upon request (#233)
Use HDF5 as the default storage backend for features (#237)
Add JSONL support (#262)
Support for auto-magically determined serializers (CLI + Python API) (#264)

Removed features

Removed WavAugment support (use torchaudio.sox_effects instead) (#232)

General improvements

Add tolerance to validate_recordings_and_supervisions (#208, thanks @janvainer )
Fix incorrect truncation in cut mixing for data augmentation (#214)
Return lengths from feature and token collations (#211, thanks @janvainer)
Refactor Standardize to GlobalMVN (#230, thanks @janvainer)
Fix rare error in randomized Recording's resampling test (#239)
Fix concatenate cuts omitting the longest cut when duration_factor > 1 (#240)
Fix CutMix not adding enough noise in long cuts (#241)
Add max_cuts keyword to global stats computation in GlobalMVN (#245)
Improved error message for mixing audio (#248)
Add a check for matching sampling rates when mixing cuts (#247)
Fix - make VCTK CLI discoverable (#250)
Fix trim_to_supervisions and CLI (#249)
Fix find segments float rounding issues (#265)
Update the examples of libirispeech (full) and ami (#268, thanks @jimbozhang)
Fix test for sphere files (#269, thanks @csukuangfj)
More informative error message for incorrect channels in load_audio() (#270)

Assets 2

0 Join discussion

27 Feb 04:18

pzelasko

v0.5

7d0b41d

0.5 - Ice Melt

New features:

Major overhaul of support for PyTorch Dataset API (#194 #197 #202)

Lhotse now implements a number of PyTorch datasets and samplers. The core features are:

familiar API (map-style datasets and cut samplers that work with standard DataLoader)
dynamic batch size, chosen based on constraints such as max_frames
bucketing or cut concatenation as strategies for avoiding too much padding
optional noise padding (using CutMix transform)
our samplers work with DDP training out-of-the-box (no need for DistributedSampler)
More details available at: https://lhotse.readthedocs.io/en/latest/datasets.html

Example code:

from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SingleCutSampler

cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SingleCutSampler(cuts, max_frames=50000)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
for batch in dloader:
    ...  # process data

Lazy (on-the-fly) resampling on Recording/RecordingSet (#185)

The resampling is performed at the moment of reading the audio samples from disk. It automatically adjusts the duration/num_samples in the data manifest.

recording = recording.resample(22050)
recording_set = recording_set.resample(8000)

New corpora:

AMI recipe extension to all microphone settings and official scenarios (#154 - kudos to @desh2608)

General improvements:

CutSet.subset() got first and last arguments (like Kaldi's subset_data_dir.sh) and a CLI mode (#188)
CutSet.from_manifest() creates deterministic Cut IDs by default (#186)
Padding cuts with arbitrary user specified values (now also works with custom feature extractors) (#187)
Improved code coverage measurements (now excludes test code and recipe code) (#191 #192)
Improved support for sampling rates other than 8k and 16k (#190 #195)
Documentation build fixes (#196)
Fixes in NSC recipe (#199)
Fixes in ASR dataset validation (#204)

Assets 2

12 Jan 17:33

pzelasko

v0.4

cd8ed4d

0.4 - Passing the North Glacier

New features:

Lazy time-domain speed perturbation of Recording/Cut that also adjusts supervision segments (#167)

cuts_sp = cuts.perturb_speed(0.9)

Manifest validation (lhotse.validate()) (#175)

lhotse.validate(cuts)

Parallel feature extraction API lifting (#176)

# As simple as: 
cuts = cuts.compute_and_store_features(lhotse.Fbank(), 'path/to/feats', num_jobs=20)

Support for using HDF5 storage with parallel feature extraction (#176)

# Modify the above with: 
cuts = cuts.compute_and_store_features(lhotse.Fbank(), 'path/to/feats', num_jobs=20, storage_type=lhotse.LilcomHdf5Storage)

CutSet mixing for noise data augmentation (#180)

# Can be performed after feature extraction for dynamic feature-domain mixing!
cuts = cuts.mix(noise_cuts, snr=[10, 30], mix_prob=0.5)

On-the-fly noise data augmentation for K2 ASR (#180)

New corpora:

Aishell (#170, thanks @fanlu)
Musan (#174)

General improvements:

LibriSpeech recipe API lifting and major preparation speedup (#163)
Stop using deprecated torchaudio.info (#164)
CutSet map() and modify_ids() methods (#165)
Parallelism: Executor concept documentation (#152)
Single/multi channel audio/features collation methods for a batch of Cuts (#173)
Cache data manifests for Mobvoi (#168, thanks @freewym)
High-level workflow illustrations in docs (#178)

Assets 2

09 Dec 16:27

pzelasko

v0.3

a45a0c2

0.3 - "Oh, the weather outside is frightful"

New features:

CutSet.subset and CutSet.filter_supervisions (#145, thanks @janvainer)
An official Collab notebook (#156)
Python 3.6 support (#158)
Support for feature normalization aka CMVN (#159, #160)

New corpora:

National Speech Corpus (Singaporean English) (#148)
IARPA BABEL (25 languages) (#157)

Bugfixes:

populate recording_id for Cut when using Cut.compute_and_store_features (#147, thanks @freewym)

Other:

Set default duration limit factor to 1 for K2 Iterable Dataset (#148)
Fix for MixedCut plots (#156)

Assets 2

Releases: lhotse-speech/lhotse

Himalayan Salt Bath

New features

Add dynamic LRU cache for audio and feature I/O (#443)

Recipe improvements

General improvements

New Contributors

Contributors

Spontaneous Combustion

Corpora

New

Improved

New features

Faster Kaldi-compatible feature extraction on CPU and CUDA using kaldifeat

Others

General improvements

Contributors

Happy Yodeling

Corpora

New features

General improvements

Spacious Glacier

Corpora

Breaking changes

General improvements

Contributors

Thin Ribbon of Snow

Breaking changes

Corpora

New features

CutSampler improvements (PyTorch data API)

I/O improvements

Data augmentation

Others

General improvements

Contributors

Melting Away

New corpora

New features

General improvements

Thawing Potion

New corpora

Features

PyTorch Dataset API

Features and transforms

I/O and serialization

Removed features

General improvements

0.5 - Ice Melt

New features:

Major overhaul of support for PyTorch Dataset API (#194 #197 #202)

Lazy (on-the-fly) resampling on Recording/RecordingSet (#185)

New corpora:

General improvements:

0.4 - Passing the North Glacier

New features:

New corpora:

General improvements:

0.3 - "Oh, the weather outside is frightful"