out of memory "Parsing alignment file" #118

gjordaopiedade · 2024-03-28T09:36:43Z

Hi there! :)

I am planning on using your amazing tool to do the read mapping and taxonomic characterization of some metagenomes. The issue is that I do not have easy access to a high memory node. I have played with the diamond settings to reduce the memory requirements. I think the issue is when it tries to parse the alignment out.RAT.unclassified_unmapped.alignment.diamond which is often larger than the available memory.

Perhaps there is something I can do to solve this issue?

Great job!
Cheers,
Gonçalo

Please let me know if you need any extra info!

out.RAT.log

[2024-03-27 17:24:03] Homology search with DIAMOND is starting. Please be patient. Do not forget to cite DIAMOND when using CAT or BAT in your publication.
query: None
database: /projects/0/prjs0815/databases/CAT/20231120_CAT_nr/db/2023-11-21_CAT.dmnd
mode: fast
top: 11
no-self-hits: False
number of cores: 64
block-size (billions of letters): 8.0
index-chunks: 8
tmpdir: ./
compress: 0
blast flavour: blastx
[2024-03-27 19:30:14] Homology search done! File ./out.RAT.unclassified_unmapped.alignment.diamond created.
[2024-03-27 19:30:14] Loading file /projects/0/prjs0815/databases/CAT/20231120_CAT_nr/tax/nodes.dmp.
[2024-03-27 19:30:17] Parsing alignment file ./out.RAT.unclassified_unmapped.alignment.diamond.

slurm error file:

[M::mem_process_seqs] Processed 3887138 reads in 354.034 CPU sec, 6.475 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 64 /projects/0/prjs0815/Projects/Rota-biotic/157633/METAG/157633_2210063V4210222_METAG/RAT_CAT/scaffolds_s.min500.fasta /projects/0/prjs0815/Projects/Rota-biotic/157633/METAG/157633_2210063V4210222_METAG/concat/157633_2210063V4210222_METAG.1.fq.gz /projects/0/prjs0815/Projects/Rota-biotic/157633/METAG/157633_2210063V4210222_METAG/concat/157633_2210063V4210222_METAG.2.fq.gz
[main] Real time: 231.080 sec; CPU: 7431.596 sec
[bam_sort_core] merging from 0 files and 64 in-memory blocks...
/var/spool/slurm/slurmd/job5710205/slurm_script: line 82: 3489247 Killed CAT_pack reads --mode mcr -n 64 --index_chunks 8 --block_size 8 --bin_suffix fa -b bin/ -c $contigs -1 $R1 -2 $R2 -d $db -t $tax
[2024-03-27 19:38:36] ERROR: input file ./out.RAT.complete.abundance.txt does not exist.
[2024-03-27 19:38:36] ERROR: input file ./out.RAT.contig.abundance.txt does not exist.
slurmstepd: error: Detected 1 oom_kill event in StepId=5710205.batch. Some of the step tasks have been OOM Killed.

ls CAT_output:

$ ls -lh
total 114G
drwxr-s--- 2 mendelab prjs0815 4.0K Mar 27 15:58 bin
-rw-r----- 1 mendelab prjs0815 1.6K Mar 27 17:11 out.RAT.BAT.bin2classification.txt
-rw-r----- 1 mendelab prjs0815 4.0K Mar 27 19:38 out.RAT.BAT.bin2classification.txt.with_names.txt
-rw-r----- 1 mendelab prjs0815 1.7K Mar 27 17:11 out.RAT.BAT.log
-rw-r----- 1 mendelab prjs0815 1.1M Mar 27 17:11 out.RAT.BAT.ORF2LCA.txt
-rw-r----- 1 mendelab prjs0815 1.1G Mar 27 16:56 out.RAT.CAT.alignment.diamond
-rw-r----- 1 mendelab prjs0815 866K Mar 27 17:05 out.RAT.CAT.contig2classification.txt
-rw-r----- 1 mendelab prjs0815 2.8K Mar 27 17:05 out.RAT.CAT.log
-rw-r----- 1 mendelab prjs0815 1.5M Mar 27 17:05 out.RAT.CAT.ORF2LCA.txt
-rw-r----- 1 mendelab prjs0815 12M Mar 27 16:10 out.RAT.CAT.predicted_proteins.faa
-rw-r----- 1 mendelab prjs0815 8.5M Mar 27 16:10 out.RAT.CAT.predicted_proteins.gff
-rw-r----- 1 mendelab prjs0815 3.1K Mar 27 19:30 out.RAT.log
-rw-r----- 1 mendelab prjs0815 2.6G Mar 27 16:05 out.RAT.scaffolds_s.min500.fasta.1576_210222_METAG.1.fq.gz.bwamem.sorted
-rw-r----- 1 mendelab prjs0815 109G Mar 27 19:30 out.RAT.unclassified_unmapped.alignment.diamond
-rw-r----- 1 mendelab prjs0815 425M Mar 27 17:24 out.RAT.unclassified_unmapped.fasta
-rw-r----- 1 mendelab prjs0815 27M Mar 27 15:58 scaffolds_s.min500.fasta
-rw-r----- 1 mendelab prjs0815 11K Mar 27 15:59 scaffolds_s.min500.fasta.amb
-rw-r----- 1 mendelab prjs0815 673K Mar 27 15:59 scaffolds_s.min500.fasta.ann
-rw-r----- 1 mendelab prjs0815 26M Mar 27 15:59 scaffolds_s.min500.fasta.bwt
-rw-r----- 1 mendelab prjs0815 6.5M Mar 27 15:59 scaffolds_s.min500.fasta.pac
-rw-r----- 1 mendelab prjs0815 13M Mar 27 15:59 scaffolds_s.min500.fasta.sa

thauptfeld · 2024-04-03T08:37:28Z

Dear Goncalo,

thanks for bringing this up. Unfortunately, especially when there are a lot of unmapped/unclassified reads, the diamond file gets really big and parsing it becomes memory-intensive. I think the biggest part of this is that even for the unmapped reads, RAT integrates all the hits from within 10% of the top bit score, which is a lot of info with potentially millions of reads. So, one thing you could do is to shrink the diamond file and only use hits within 5% or 1% of the bit score: if you already have the contig2classification and the bin2classification files and supply them to the RAT run, then just setting -r to 5 or 1 should work.

I will leave this open! Maybe we can find a way to make this step a bit less memory intensive, but it is hard, because with a couple of millions of reads, it is automatically going to be a lot of RAM.

Hope this helps!
Cheers,
Tina

gjordaopiedade · 2024-04-03T11:16:05Z

Hi Tina,
Thanks for getting back on this issue.
I have moved forward by turning the read annotation off CAT_pack reads --mode mc
It is probably because the assembly was filtered at 500bp. Yet, adding the short contigs or the unmap read annotation would make it too computationally intense for the resources we have available.
Thank you for the suggestion! I will keep it in mind for the future.
Cheers,
Gonçalo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory "Parsing alignment file" #118

out of memory "Parsing alignment file" #118

gjordaopiedade commented Mar 28, 2024

thauptfeld commented Apr 3, 2024

gjordaopiedade commented Apr 3, 2024

out of memory "Parsing alignment file" #118

out of memory "Parsing alignment file" #118

Comments

gjordaopiedade commented Mar 28, 2024

thauptfeld commented Apr 3, 2024

gjordaopiedade commented Apr 3, 2024