salmon 1.3.0

rob-p released this 04 Jul 05:47

· 266 commits to master since this release

salmon 1.3.0 Release notes

Happy 4th of July ( 🇺🇸 🎆 )

Bug fixes & improvements

🎁 Improvements

Fragments that best-map to decoys are now written in the output SAM file if the --writeMappings option is provided. In order to make filtering of decoy and non-decoy alignments easier, all alignments now include a tag in their SAM record. Alignments to a valid (non-decoy) target are tagged with XT:A:T, and those to decoys are tagged with XT:A:D. This allows easy filtering of decoy mappings. The conditions for a decoy mapping to be written to the file are as follows:
1. There is no valid mapping to a non-decoy target. That is, all mappings to valid (non-decoy) targets must have alignment score < decoyThreshold * bestDecoyScore.
2. Only best-scoring decoy alignments are written to file. Thus, if there are sub-optimal decoy alignments that are still better than alignments to valid targets, they will not appear in the output SAM file.
3. If decoy alignments are written (condition 1 is satisfied), then all equally-best decoy alignments are written to file (i.e. a decoy fragment can still multi-map).
In the SAM file produced with the --writeMappings option, the header lines now include tags to designate each reference sequence as being a decoy or not. Sequence lines (@SQ lines) that correspond to valid targets contain the tag DS:T, while those corresponding to decoys contain the tag DS:D. Note: In alignment-based mode, salmon will not process SAM/BAM files with decoy entries (to avoid usage errors, since decoy alignment is not intended for quantification). So, if, for some reason you are using a salmon-generated SAM file containing decoy sequences and alignment records, you must remove them before quantifying using alignment-based mode (i.e. removing all headers with DS:D and all alignment records withXT:A:D). Details about how to perform that transformation can be found here.
This release enables some considerable improvements to speed in the case of aligning poor quality reads. Specifically, this is enabled due to upstream changes in pufferfish implemented by @mohsenzakeri. Now, the aligner can exit early if it becomes clear at any point during alignment that a valid score cannot be obtained. This reduces the computation used to evaluate poor alignments that will not pass subsequent filtering (addresses #527 adn #537).
Homopolymer seeds are now skipped during mapping and alignment. In pathological datasets, this could cause unnecessarily slow mapping without any improvements to the actual mapping rate (i.e. it could generate many poor mappings that would fail alignment). This change can speed up mapping in such datasets (addresses #527 adn #537).
Three new filtering flags have been added to both improve sensitivity and speed. They determine how mappings are filtered at different stages. The previous behavior (that of salmon v1.0.0 — 1.2.1) can be obtained by setting --preMergeChainSubThresh 1.0, --postMergeChainSubThresh x, --orphanChainSubThresh x where x is (1.0 - --consensusSlack) — by default this corresponds to x = 0.65.
- --perMergeChainSubThresh : The threshold of sub-optimal chains, compared to the best chain on a given target, that will be retained and passed to the next phase of mapping. Specifically, if the best chain for a read (or read-end in paired-end mode) to target t has score X_t, then all chains for this read with score >= X_t * preMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. It's default value is 0.75 for paired-end data and 1.0 for single-end data.
- --postMergeChainSubThresh : The threshold of sub-optimal chain pairs, compared to the best chain pair on a given target, that will be retained and passed to the next phase of mapping. This is different than preMergeChainSubThresh, because this is applied to pairs of chains (from the ends of paired-end reads) after merging (i.e. after checking concordancy constraints etc.). Specifically, if the best chain pair to target t has score X_t, then all chain pairs for this read pair with score >= X_t * postMergeChainSubThresh will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. The default value for this parameter is 0.9. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.
- --orphanChainSubThresh : This threshold sets a global sub-optimality threshold for chains corresponding to orphan mappings. That is, if the merging procedure results in no concordant mappings then only orphan mappings with a chain score >= orphanChainSubThresh * bestChainScore will be retained and passed to subsequent mapping phases. This value must be in the range [0, 1]. Unlike the --preMergeChainSubThresh and --postMergeChainSubThresh options, this threshold is global with respect to all orphan chains (not simply per-target). From that perspective, you can view it as overriding the value of --consensusSlack in the case of orphan mappings. Note: This option is only meaningful for paired-end libraries, and is ignored for single-end libraries.
The default --mismatchSeedSkip was changed from 5 to 3.
Updated the required LibGFF dependency to v2.0.0. If you already have this installed on your system, you can pass the hint to the location to cmake using -DLIB_GFF_PATH or -DGFF_ROOT.
Add the "CellRanger" standard tags, CB:Z and UR:Z tags to the alignment records reported by alevin if the user passes the --writeMappings flag when running alevin.
Moved from (deprecated) tbb::atomic<double> to std::atomic<double> throughout the codebase, including accounting for the lack of a compare_and_swap method on the latter.
Changed the default gap-open penalty to 6 (from 4). This makes any gap less preferred compared to a mismatch. Note: How to properly set the default scoring scheme, as well as how to set an ideal alignment quality threshold (i.e. what is the lowest quality alignment one should allow) is not a straightforward question. This change in default accords with our belief that gaps should be penalized more in typical data. However, the ideal settings for such parameters is certainly worthy of more in-depth study, and we are looking into both empirical and theoretical mechanisms for determining how these parameters can be best determined. To obtain the old (pre 1.3.0) scoring scheme, simply pass --go 4 on the command line. You can also experiment with even more stringent gap penalties by increasing --go for gap open (current default 6) and --ge for gap extend (current default 2).
Changed warning message color from yellow to magenta to make it readable on both light and dark background (address #541).
Emojis in release notes 😃.

🐛 Bug fixes

Improved selective-alignment speed in pathological case involving isolated homopolymer MEM chains. Thanks to @red-plant for raising the issue (with reproducible data) in 527.
Custom barcode lengths for the --citeseq mode was disabled. It has been fixed in #531 and --citeseq single-cell protocol can be used along with --end --barcodeLength --umiLength triplets. Thanks @rfarouni for reporting this.
The variance estimates reported by --numCellBootstraps command in alevin were not corrected for bias. It has been corrected to reported unbiased estimates by multiplying the variance matrix by (n/n-1).
Fixed linking order issue that could, on rare custom compiles of salmon, cause memory to be allocated by TBB and freed by jemalloc (resulting in a segfault). Thanks to @mathog and davidtgoldblatt for helping to track down and resolve this one!
Fixed an error (regression) that could cause an overhanging read in a read pair to be improperly not marked as a dovetail (when it is). This could result in assignment preference for transcripts where the dovetailing read overhangs the transcript start.
Fixed a bug that could occur in certain cases of between-mem alignment where too high of an alignment score could be attributed to a mapping. This could occur when there were overlapping MEMs in the chain on the reference (a bit uncommon), and when the size of the overlap was different on the read and reference. This bug has been fixed by properly adjusting the score in all cases.
The dynamic and asynchronous update of the fragment length distribution could cause the fluctuations in fragment-level conditional probabilities within the set of alignments for a given fragment. For duplicate transcripts this could lead to an unexpected result where sequence-duplicate transcripts could be inferred to have unequal abundance. The current release addresses this behavior by employing a fragment length distribution cache to ensure there are no fluctuation in conditional fragment length probabilities among the set of alignments for a given fragment. Note: This behavior is expected only to have affected atypical salmon usage, as duplicate transcripts are collapsed / discarded by default during indexing.

Assets 3