Skip to content

Releases: lhotse-speech/lhotse

Himalayan Salt Bath

12 Nov 01:52
Compare
Choose a tag to compare

This is mostly a bug, documentation, and installation fix release.

New features

Add dynamic LRU cache for audio and feature I/O (#443)

Lhotse will now cache up to 512 audio/feature arrays that were read from disk to speed-up some common usage patterns.
This behavior can be easily disabled with lhotse.set_caching_enabled(False).

Recipe improvements

General improvements

  • add export multichannel to kaldi by @jtrmal in #441
  • Fix documentation builds by @pzelasko in #448
  • fix typo of np.stack by @glynpu in #450
  • Rich exception info for I/O in Recording, MonoCut and MixedCut by @pzelasko in #454
  • Fix SpecAugment docs by @pzelasko in #463
  • Fix misleading manifest format in the comments by @stachu86 in #460
  • Ensure the users who installed PyTorch also install torchaudio themselves by @pzelasko in #466
  • Add documentation and tests for Kaldi feature extraction layers by @pzelasko in #467

New Contributors

Spontaneous Combustion

03 Nov 12:53
Compare
Choose a tag to compare

Corpora

New

Improved

New features

Faster Kaldi-compatible feature extraction on CPU and CUDA using kaldifeat

  • Add Lhotse wrappers for kaldifeat-based feature extractors by @pzelasko in #424
  • Method for batch feature extraction from CutSet that supports CUDA extractors by @pzelasko in #422

Others

  • CLI manimulation subset cmd adds cut_ids arguments by @oplatek in #425
  • Function to plot alignments in a Cut by @pzelasko in #436

General improvements

  • Support for PyTorch 1.10.0 and 1.8.2 by @pzelasko in #427
  • Adding durations without FP precision issues by @pzelasko in #428
  • fix doc strings by @jtrmal in #430
  • add the auto-generated files/dirs into ignore list by @jtrmal in #432
  • [Kaldi-related] Mapping underscores for utt-ids + num-jobs option for import by @pzelasko in #435
  • set the channel info correctly by @jtrmal in #437
  • make the callhome asr utterances following specific format for QOL by @jtrmal in #442

Happy Yodeling

14 Oct 21:22
db9f42b
Compare
Choose a tag to compare

Corpora

  • HiFiTTS data prep (#410)
  • CMU Indic data prep (#411)
  • ADEPT data prep (#413)
  • Reduced memory usage in CommonVoice data prep (#414)
  • fixes in Switchboard recipe (#419)

New features

  • Saving and restoring the sampler's state for resuming training exactly where it left off (#417, #418)

General improvements

  • CutSet.describe() doesn't need pandas anymore (#412)
  • Adding torchaudio sox_io audio reading backend and making it default (#414)
  • Support for Python 3.9 (#415)
  • Adopted black for code style (#420)

Spacious Glacier

27 Sep 17:18
Compare
Choose a tag to compare

Corpora

  • CommonVoice data preparation recipe (#400)

Breaking changes

  • removed len() attribute from CutSamplers (#392)

General improvements

  • Options to preserve cut IDs in cut operations and transforms (#405)
  • faster import lhotse and CLI start-up (#403)
  • versioning improvements (#401)
  • fixes in cuts.subset() (#395 #398, thanks @janvainer)
  • Count cuts discarded with .filter() into the sampler report (#391)
  • Option for ZipSampler to return tuples of CutSets (#390)

Thin Ribbon of Snow

26 Aug 01:37
Compare
Choose a tag to compare

Breaking changes

  • Lhotse CutSampler classes now return mini-batch CutSets instead of a list of string cut IDs (Lhotse Dataset classes are adjusted correspondingly) (#345)
  • Cut refactoring (Cut is now an abstract base class for all cut types; what was previously called Cut is now called MonoCut) (#328)
  • CLI: lhotse obtain is now lhotse download (#329)

Corpora

New features

CutSampler improvements (PyTorch data API)

  • ZipSampler for batches constructed from different cut sources (#344 #347 #363 thanks for fixes @janvainer)
  • drop_last option and get_report() method for cut samplers (#357)
  • find_pessimistic_batches utility to help fail fast with GPU OOM (#358)
  • streaming variant of shuffling for lazy CutSets in samplers (#359)
  • a bucketing method with equal cumulative bucket duration for BucketingSampler (#365)
  • approximate proportional sampling in BucketingSampler (#372)

I/O improvements

  • chunked OPUS file reads (#339)
  • chunked sphere file reads (#367 thanks @videodanchik)
  • faster OnTheFlyFeatures (padding audio instead of features) (#352)
  • ChunkedLilcomHdf5Writer (and reader) for efficient chunk reads of lilcom-compressed arrays (#334)
  • a global cache for re-using smart_open connection sessions (improves performance for repeated smart_open calls e.g., to S3) (#335, thanks @oplatek)

Data augmentation

Others

  • CutSet.trim_to_supervision has new arguments for including actual acoustic context next to the supervisions (#330 #331)
  • SupervisionSegment is now mutable (and all Lhotse manifests will remain mutable) (#333)
  • .shuffle() method for Lhotse *Set classes (#341)
  • lhotse fix CLI (#360)
  • lhotse install-sph2pipe for handling LDC corpora compressed with shorten (auto-registers sph2pipe so no further actions are needed) (#370)

General improvements

  • refreshed docs (#327 #328 #330)
  • improvements to downloading corpora (#340)
  • experimental dataloader that allows two levels of parallelism (#343, might be abandoned for other alternatives)
  • auto-detection of compatible torchaudio version for pytorch (#348)
  • improvements to Kaldi data dir import/export (#351 #354)
  • fixed cut ordering in CutSet.subset(cut_ids=...) (#353)
  • improvements to storing cuts as recordings (#355)
  • refactored lhotse.dataset.sampling file into a directory module (#366)
  • improvements to CLI (#369 #371 thanks @songmeixu)
  • improvements to setup (#377 #383 thanks @songmeixu)
  • Colab notebook with ESPnet + Lhotse example (#384)
  • improvements to Lhotse versioning (#385)

Melting Away

30 Jun 22:39
Compare
Choose a tag to compare

New corpora

New features

  • Reading huge manifests lazily with Apache Arrow (documentation and examples are coming) (#286, #288, #289, #290, #292, #294)
  • Sequential JSONL writer storing manifests on disk as they are created (#302)
  • Support for alignments in SupervisionSegment (#304, #310, #313, thanks @desh2608)
  • PyTorch Kaldi-compatible feature extractors that support GPU, batching and autograd (#307, thanks @jesus-villalba)
  • Reading, writing, and uploading features to URLs (e.g. S3 or GCP) (#312)
  • Store waveforms of cuts as audio recordings to disk (#316, thanks @entn-at)
  • Support for importing Kaldi's feats.scp and reading features directly from scp/ark (#318)

General improvements

  • add multi thread to process AIShell data (#259, thanks @pingfengluo)
  • tracking dev versions (#291, thanks @oplatek)
  • Explicitly set UTF-8 encoding when reading README.md in setup.py (#293, thanks @entn-at)
  • Auto-add link to source code in docs (#295)
  • cut.resample() (#299)
  • fixing flaky tests (#300)
  • fix AMI CLI mode (#303, thanks @desh2608)
  • handle zero energy error in audio mixing (#305)
  • update Kaldi related docs (#308)
  • add a missing SpecAugment parameter (#309)
  • fixing edge cases for audio transforms (#311)
  • Add drop_last option in *Set.split() (#315)
  • Support h5py file modes in feature writers (#317)
  • don't using kaldi reco2dur and fix some error in bin/lhotse (#318, thanks @shanguanma)
  • Fix cut num of samples bug (#322, thanks @dophist)
  • use whitespace in kaldi field-splitting (#323, thanks @dophist)

Thawing Potion

26 Apr 22:27
Compare
Choose a tag to compare

New corpora

  • CMU Arctic (#225)
  • L2 Arctic (#227, #251)
  • VCTK (#228, #253, #254)
  • CallHome English (#278)
  • CallHome Egyptian (#208)
  • Multilingual Librispeech (#282 -- can be quite slow but we plan on improving the speed in further releases)

Features

PyTorch Dataset API

  • Lhotse's samplers are now fully deterministic, have len() that returns the number of batches, and return a consistent number of batches in all distributed workers. (#213, #222, #223, #224, #255, #267, thanks @janvainer)
  • On-the-fly feature extraction in PyTorch datasets (#229)
  • visualisations of ASR batches with multiple transforms applied (#234)

Features and transforms

  • Add LIbrosaFbank consistent with various TTS applications (#252, thanks @janvainer)
  • SpecAugment (#246)
  • option to pad cuts from left/right/both directions (#216)
  • Randomized smoothing augmentation (#272, #273, #274)
  • Randomized extra padding (#281)

I/O and serialization

  • [experimental] Downloading audio from HTTP/S3/GCP/Azure URLs upon request (#233)
  • Use HDF5 as the default storage backend for features (#237)
  • Add JSONL support (#262)
  • Support for auto-magically determined serializers (CLI + Python API) (#264)

Removed features

  • Removed WavAugment support (use torchaudio.sox_effects instead) (#232)

General improvements

  • Add tolerance to validate_recordings_and_supervisions (#208, thanks @janvainer )
  • Fix incorrect truncation in cut mixing for data augmentation (#214)
  • Return lengths from feature and token collations (#211, thanks @janvainer)
  • Refactor Standardize to GlobalMVN (#230, thanks @janvainer)
  • Fix rare error in randomized Recording's resampling test (#239)
  • Fix concatenate cuts omitting the longest cut when duration_factor > 1 (#240)
  • Fix CutMix not adding enough noise in long cuts (#241)
  • Add max_cuts keyword to global stats computation in GlobalMVN (#245)
  • Improved error message for mixing audio (#248)
  • Add a check for matching sampling rates when mixing cuts (#247)
  • Fix - make VCTK CLI discoverable (#250)
  • Fix trim_to_supervisions and CLI (#249)
  • Fix find segments float rounding issues (#265)
  • Update the examples of libirispeech (full) and ami (#268, thanks @jimbozhang)
  • Fix test for sphere files (#269, thanks @csukuangfj)
  • More informative error message for incorrect channels in load_audio() (#270)

0.5 - Ice Melt

27 Feb 04:18
Compare
Choose a tag to compare

New features:

Major overhaul of support for PyTorch Dataset API (#194 #197 #202)

Lhotse now implements a number of PyTorch datasets and samplers. The core features are:

  • familiar API (map-style datasets and cut samplers that work with standard DataLoader)
  • dynamic batch size, chosen based on constraints such as max_frames
  • bucketing or cut concatenation as strategies for avoiding too much padding
  • optional noise padding (using CutMix transform)
  • our samplers work with DDP training out-of-the-box (no need for DistributedSampler)
  • More details available at: https://lhotse.readthedocs.io/en/latest/datasets.html

Example code:

from torch.utils.data import DataLoader
from lhotse.dataset import SpeechRecognitionDataset, SingleCutSampler

cuts = CutSet(...)
dset = SpeechRecognitionDataset(cuts)
sampler = SingleCutSampler(cuts, max_frames=50000)
# Dataset performs batching by itself, so we have to indicate that
# to the DataLoader with batch_size=None
dloader = DataLoader(dset, sampler=sampler, batch_size=None, num_workers=1)
for batch in dloader:
    ...  # process data

Lazy (on-the-fly) resampling on Recording/RecordingSet (#185)

The resampling is performed at the moment of reading the audio samples from disk. It automatically adjusts the duration/num_samples in the data manifest.

recording = recording.resample(22050)
recording_set = recording_set.resample(8000)

New corpora:

  • AMI recipe extension to all microphone settings and official scenarios (#154 - kudos to @desh2608)

General improvements:

  • CutSet.subset() got first and last arguments (like Kaldi's subset_data_dir.sh) and a CLI mode (#188)
  • CutSet.from_manifest() creates deterministic Cut IDs by default (#186)
  • Padding cuts with arbitrary user specified values (now also works with custom feature extractors) (#187)
  • Improved code coverage measurements (now excludes test code and recipe code) (#191 #192)
  • Improved support for sampling rates other than 8k and 16k (#190 #195)
  • Documentation build fixes (#196)
  • Fixes in NSC recipe (#199)
  • Fixes in ASR dataset validation (#204)

0.4 - Passing the North Glacier

12 Jan 17:33
Compare
Choose a tag to compare

New features:

  • Lazy time-domain speed perturbation of Recording/Cut that also adjusts supervision segments (#167)
cuts_sp = cuts.perturb_speed(0.9)
  • Manifest validation (lhotse.validate()) (#175)
lhotse.validate(cuts)
  • Parallel feature extraction API lifting (#176)
# As simple as: 
cuts = cuts.compute_and_store_features(lhotse.Fbank(), 'path/to/feats', num_jobs=20)
  • Support for using HDF5 storage with parallel feature extraction (#176)
# Modify the above with: 
cuts = cuts.compute_and_store_features(lhotse.Fbank(), 'path/to/feats', num_jobs=20, storage_type=lhotse.LilcomHdf5Storage)
  • CutSet mixing for noise data augmentation (#180)
# Can be performed after feature extraction for dynamic feature-domain mixing!
cuts = cuts.mix(noise_cuts, snr=[10, 30], mix_prob=0.5)
  • On-the-fly noise data augmentation for K2 ASR (#180)

New corpora:

General improvements:

  • LibriSpeech recipe API lifting and major preparation speedup (#163)
  • Stop using deprecated torchaudio.info (#164)
  • CutSet map() and modify_ids() methods (#165)
  • Parallelism: Executor concept documentation (#152)
  • Single/multi channel audio/features collation methods for a batch of Cuts (#173)
  • Cache data manifests for Mobvoi (#168, thanks @freewym)
  • High-level workflow illustrations in docs (#178)

0.3 - "Oh, the weather outside is frightful"

09 Dec 16:27
a45a0c2
Compare
Choose a tag to compare

New features:

  • CutSet.subset and CutSet.filter_supervisions (#145, thanks @janvainer)
  • An official Collab notebook (#156)
  • Python 3.6 support (#158)
  • Support for feature normalization aka CMVN (#159, #160)

New corpora:

  • National Speech Corpus (Singaporean English) (#148)
  • IARPA BABEL (25 languages) (#157)

Bugfixes:

  • populate recording_id for Cut when using Cut.compute_and_store_features (#147, thanks @freewym)

Other:

  • Set default duration limit factor to 1 for K2 Iterable Dataset (#148)
  • Fix for MixedCut plots (#156)