-
Notifications
You must be signed in to change notification settings - Fork 3
/
manual.txt
152 lines (112 loc) · 8.97 KB
/
manual.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
====================================================================================================
======= CORA v1.1.5b by Deniz Yorukoglu ([email protected], http://cora.csail.mit.edu/) ========
====================================================================================================
Usage: cora <command> [options]
Command: coraIndex Create a homology table for CORA.
faiGenerate Generate .fai file for reference.
mapperIndex Index the reference genome for coarse mapper.
readFileGen Automatically generate CORA's read file list input.
search Run CORA's compressive read alignment pipeline.
To print the manual for each command, run './cora <command>' without any options.
====================================================================================================
Usage: cora coraIndex [options] <Reference> <ExactHom> <InexactHom>
====================================================================================================
Notes: Reference should be in fasta or multi-fasta format and indexed
You can use 'cora faiGenerate' or 'samtools faidx' to index the reference
ExactHom and InexactHom are file names to be used in the search pipeline of CORA
Important Options:
-K [INT 33-64] k-mer length [no default]
-H [INT 1-3] Hamming-distance threshold per k-mer for inexact homology table. [2]
Performance Options:
-p [INT 1-10] Number of physical file splits for computing the homology tables [10]
Higher values substantially reduce memory use, sacrificing some run time.
-t [INT 1-24] Number of parallel threads to use for computing the inexact homology table [8]
====================================================================================================
Usage: cora faiGenerate <Reference>
====================================================================================================
Notes: This command is a replacement for 'samtools faidx' in case samtools is not installed
Reference should be in fasta or multi-fasta format
====================================================================================================
Usage: cora mapperIndex [options] <Reference>
====================================================================================================
Important Options:
--Map Off-the shelf mapper to be used for coarse mapping (stages 2-3) [default BWA]
current valid options: BWA, BWA_MEM, BOWTIE, BOWTIE_2, MRSFAST, MRSFAST_ULTRA
--Exec Name of the executable path for mapper [default bwa]
Can be full path (e.g. /home/folder/bwa) or just executable if installed (e.g. bwa)
--Index Full path for the index to be constructed if [default is same as <Reference>]
Miscellaneous Options:
--opt Additional options to be passed to the mapper for indexing
Warning: some of the non-default indexing options may not be compatible with CORA
====================================================================================================
Usage: To generate SINGLE-end read file list:
cora readFileGen [option] <FileName> -S <FASTQ-A> <FASTQ-B> <...>
To generate PAIRED-end read file list:
cora readFileGen [option] <FileName> -P <FASTQ-A1> <FASTQ-A2> <FASTQ-B1> <FASTQ-B2> <...>
====================================================================================================
Notes: <FileName> is the output read file list name for CORA, which stores read dataset info
FASTQ files are input read datasets. Paired and single-end read datasets cannot be mixed
Options : --ReadComp [STRING] Compression format of the input reads (GZIP or OFF) [default OFF]
OFF -> input datasets are in plain FASTQ format
GZIP -> input datasets are compressed with gzip
====================================================================================================
Usage: cora search[options] <Read_File_List> <Reference> <ExactHom> <InexactHom>
====================================================================================================
Notes: <Reference> should be in fasta or multi-fasta format and indexed.
<ExactHom> and <InexactHom> are files generated in the coraIndex run
<Read_File_List> can be generated using readFileGen command or manually.
Important Options:
-C Determines which stages of CORA pipeline will be executed. [default 1111]
Other valid options are: 1000, 1100, 1110, 0100, 0110, 0111, 0010, 0011, 0001
These indicate 4 flags for running following 4 stages:
1) Compressing reads into unique k-mers.
2) Coarse mapping k-mers to the reference using an off-the-shelf aligner.
3) Converting coarse mapping results to read links.
4) Traverse homology index for generating final mappings.
--Mode Mapping mode: ALL, BEST, BEST_SENSITIVE, BEST_FAST, STRATUM, or UNIQUE [ALL].
--Map Off-the shelf mapper to be used for coarse mapping [default BWA]
Can be BWA, BWA_MEM, BOWTIE, BOWTIE_2, MRSFAST, MRSFAST_ULTRA, MANUAL
--Exec Name of the executable file to be run for coarse mapping
Either full path (e.g. /home/folder/bwa) or executable if installed (e.g. bwa)
-R Read input mode, SINGLE for single-end or PAIRED for paired-end reads [PAIRED]
--MinI [INT] for PAIRED (paired-end) mode, min insert length (|TLEN| in SAM) [150]
--MaxI [INT] for PAIRED (paired-end) mode, max insert length (|TLEN| in SAM) [650]
-O Output SAM file to print the final mapping output [default = CORA_Output.sam]
-L [INT] Read length for CORA mapping (per read end) [no default].
-K K-mer compression mode, FULL, HALF or THREEWAY [default = HALF].
FULL, HALF and THREEWAY denote 1, 2, or 3 k-mers per read-end, respectively.
K-mer length in coraIndex stage should concordant with -K (see below):
FULL -> index k-mer should be same as read length
HALF -> k-mer length should be floor(read_length/2)
THREEWAY -> k-mer length should be floor(read_length/3)
--Metric Distance metric to be used for mapping (HAMMING or EDIT) [default HAMMING]
HAMMING -> distance is substitution only
EDIT -> (Levenshtein) distance allows indels (requires --Map BWA or BOWTIE_2)
-E [INT] Distance threshold per k-mer for inexact homology table. (1 to 3) [default 2]
Use of Hamming or Edit distance is determined by --Metric option.
Should be the same as -H used in coraIndex stage.
Performance Options:
--memoi [INT] Activate memoization for k-mers appearing at least as many times as INT [20].
Must be larger than 1. Lower values save runtime during Stage 4 using more memory.
--fs [INT or AUTO] Number of physical file splits for Input FASTQ file for compression
Valid parameters are either AUTO or an integer between 1 to 144 [default is AUTO]
Higher values reduce memory use during Stage 1 sacrificing some run time.
AUTO mode determines the number of FASTQ splits as: 1 + genome_size / 10M.
--cm [STRING] Determines if the reference should be used for compressing reads or not.
Options: WITHREF or NOREF [default WITHREF]
Paralelization Options:
--coarseP [INT] The number of parallel threads to be used for coarse mapping. [default is 1]
Capped at maximum number of threads allowed by coarse-mapper
Miscellaneous Options:
--RG ["STRING"] All read group data for the read datasets being mapped [default NONE]
This string will appear in the header and IDs will be attached to each mapping line
The whole string should be double quoted, the first identifier should be a unique ID
Multiple read datasets's RG data should be comma-separated in the order of Read_File_List
For tab-delimiting identifiers within the command line you can use Ctrl+V -> Ctrl+I
The following is an example --RG value for 3 single-end or paired-end read datasets
"ID:xx1 CN:yya DS:za zb DT:ta,ID:xx2 CN:yya,ID:xx3 CN:yyb DS:zc zd"
--TempDir [STRING] The directory to be used for temp CORA files. [__temporary_CORA_files].
This enables running two CORA jobs in the same folder with different TempDir.
--ReadComp [STRING] Compression format of the input reads (GZIP or OFF) [default OFF]
OFF means that the input datasets are in plain FASTQ format
--MaxMapCount [INT] Maximum number of mapping to print in ALL mapping mode [default is infinite]