leif fxclone <max_mismatch> <start_base> <num_base> <out_unique.fx> <out_all.fx> <in.fx>
Ex: leif fxclone 3 5 60 step3.fx step3_clones.fx step2.fx
Parameter |
Description |
max_mismatch |
Maximum number of base mismatches in a read to be considered identical to another read. Should be low, such as 0-3. |
start_base |
Position of the first base in a read which is compared to determine if reads are identical. The first few bases of a read tend to be of lower quality, so skipping the first ~5 bases is recommended. |
num_base |
Number of bases which will be compared to determine if reads are identical. 60 is recommended. Bases near the end of reads tend to be of lower quality, so excluding them increases to probability of detecting duplicate read pairs. |
out_unique.fx |
Output FX file in which only a single read pair from a clonal group is output. |
out_all.fx |
Output FX file in which all read pairs are output, including clonal group numbers. This file is used to debugging purposes. |
in.fx |
Input FX file (optionally gzipped). |
The fxclone command reads an FX file, and discards read pairs which appear to be duplicates (clones created during library preparation). Duplicate read pairs may also occur by chance as it is possible that two DNA fragments from the original specimen start and end in the exact same loci. If clones are not created during library preparation (for example using PCR-free library prep kits), this command should not be run, as it will remove duplicate fragments, somewhat biasing results.
leif fxoverlap <max_overlap_bases> <homology_percent> <out_no_overlap.fx> <out_overlap.fx> <in.fx>
Ex: leif fxoverlap 10 90 step3.fx step3_overlap.fx step2.fx
Parameter |
Description |
max_overlap_bases |
Maximum number of overlapping bases (10 recommended). |
homology_percent |
Percentage of aligning bases to consider that reads pairs match. |
out_no_overlap.fx |
Output FX file in which only non-overlapping read pairs are output. |
out_overlap.fx |
Output FX file in which all overlapping read pairs are output. This file is used to debugging purposes. |
in.fx |
Input FX file (optionally gzipped). |
The fxoverlap command reads an FX file, and discards read pairs in which both reads appear to be significantly overlapping. When reads are significantly overlapping, there is sometimes insufficient information in these reads to properly classify them. When reads are long (>=100 nt), eliminating partially overlapping reads is usually not necessary or recommended, since even partially overlapping reads are long enough to produce a specific alignment.
leif blastx0 <max_evalue> <consensus_evalue_factor> <out.csv> <in.blastx> <taxid.git>
Ex: leif qblowhom single 60 prostate7_60.fa prostate7_step4.qb
blastx -query sample_60.fa -db nr -outfmt "7 stitle qcovs evalue pident" > sample_60_nr.blastx
leif blastx0 0.05 10 sample_blastx_nr_60.csv sample_60_nr.blastx taxid.git
Parameter |
Description |
max_evalue |
Maximum evalue for taxonomic information to be reported. 0.05 to 0.001 recommended, as alignment higher evalues are very likely occurring by chance alone. |
consensus_evalue_factor |
The taxonomic node reported is a consensus (eg. ancestor node) of all alignments with a evalue up to this "factor" worse than the best evalue. Minimum value is 1, recommended range is 10 to 1000. |
out.csv |
Output file in which each group's taxonomic consensus is reported. |
in.blastx |
Output of the NCBI blastx command with -outfmt "7 stitle qcovs evalue pident". |
taxid.git |
Path of taxid.git file produced during setup by the taxid command. |
The blastx0 command reads an the output of a blastx run (outfmt must be set as in the example above), and reports the consensus taxonomic node for each group in CSV file format.
leif qbstrip <out.qb> <in.qb> <max_group_size> <taxid.git>
Ex: leif qbstrip step5.qb step4.qb 500 taxid.git
Parameter |
Description |
out.qb |
Output QB file. |
in.qb |
Input QB file. |
max_group_size |
Integer which specifies the maximum number of read pairs in a "contig-like group" for it to be written to the output file. Taxonomic information for "contig-like groups" which exceed this threshold are printed to screen to allow the user to manually verify that important groups are not discarded. |
taxid.git |
Path of taxid.git file produced during setup by the taxid command. |
The qbstrip command reads a QB file, and discards large "contig-like groups", outputting only small group which contain at most max_group_size read pairs. Taxonomic information about each discarded group is displayed. This command is typically used to discard highly variable regions of the human genome.
leif qblowhom [single|dual] <pct> {<out.qb>|<out.fq>|<out.fa>|<out.fx>} <in.qb>
Ex: leif qblowhom 90 out.fa step4.qb
Parameter |
Description |
[single|dual] |
Alignment results from either single or dual alignment are used. Must be set to one of single dual, or omitted. Case sensitive. Defaults to single. |
pct |
Low homology cutoff. "Contig-like group" with a consensus homology lower than this number will be output. Range 0 to 101. 101 keeps all "contig-like groups". |
{<out.qb>|<out.fq>|<out.fa>|<out.fx>} |
Output file. Four formats are supported; the format is automatically determined based on the file extension. |
in.qb |
Input QB file. |
The qblowhom command reads a QB file, and outputs "contig-like groups" with a consensus homology lower than pct in ones of four file formats. The read pairs output are sorted by order of increasing consensus homology percentage. This command can be used to check that Low homology reads don't match with anything in NCBI BLAST databases by outputting a .fa (FASTA) file and aligned using NCBI blastn for a second opinion. When the output format is .fx or .fq, the second read is reverse complemented (eg. restored to same format as output by the sequencer) before being output.
leif qb2f [ungroup] <out.fa/out.fx> <in.qb>
Ex: leif qb2f ungroup prostate7_step08.fx prostate7_step07.qb
Parameter |
Description |
ungroup |
When specified, the group label (ex: @000001420_0_1) is removed from the read description line. |
out.fa/out.fx |
Output FASTA file or output FX file. |
in.qb |
Input QB file. |
The qb2f command reads a QB file and removes alignment information, outputting spots in the simpler FASTA or FX format. This allows further alignments to be performed after running the qbmajority command. When the ungroup parameter is specified, the group label is stripped and fxgroup can be run again.
Ex: leif qbhistory step4.qb 0 50 60 70 80 90
Parameter |
Description |
in.qb |
Input QB file. |
pct? |
Homology percentage to report. Typically 0, 50, 60, 70, 80, and 90 are used. Range is 0 to 100. |
The qbhistory command reads a QB file, and outputs the number of groups which match well with Genbank entries at various points in time determined by the gi_age field in the qblast_settings.txt file. This command reports how many Low homology reads would have resulted if older Genbank releases had been used.
leif fx2fx [<step>] <out.fx> [<low_quality_char>] <in.fx> [<wildin0.fd> [...]]
Ex: leif fx2fx step2.fx step1.fx human.fd ebv.fd phage.fd
Parameter |
Description |
step |
Optional parameter which indicates the base interval between substrings considered in the alignment process. Range is 1 to 32; default is 1. A lower number will result in a more complete, but slower alignment. |
out.fx |
Path to output FX file. |
low_quality_char |
Optional parameter which allows low quality spots to be discarded. When specified, spots which contain too many bases called with a quality score equal or below low_quality_char are discarded. If omitted, quality is not checked. |
in.fx | Path to input FX file (optionally gzipped). |
wildin?.fd |
Path to one or more filter dictionaries against which read pairs will be aligned. Read pairs aligning to any filter dictionary sequence will be discarded. Wildcards '*' '?' can be used to include multiple files. |
The fx2fx command reads a FX file, discards read pairs which align to any specified filter dictionary, and writes remaining read pairs in a proprietary FX format text file.
leif fx2fa <out.fa> <in.fx>
Ex: leif fx2fa prostate7_step13.fa prostate7_step12.fx
Parameter |
Description |
out.fa |
Output FASTA file |
in.fx |
Input FX file. |
The fx2fa command reads an FX file and simply writes is out in the FASTA format. It is typically used to prepare an FX file for NCBI blastn alignment.
leif fastq2fd <low_quality_char> <out.fd> <wildin.fastq.gz> [filter0.fd [filter1.fd [...]]]
Ex: leif fastq2fd "4" step1.fq
Parameter |
Description |
low_quality_char |
Quality scores in a 32 base substrings must all be higher than this value to be included in the output. |
out.fd |
Output file which contains sorted filter dictionary sequences in binary format. Sequences are written in 64 bit little-endian records with 32 bases coded as 0=A;1=C;2=G;3=T. The most significant bit pair contain the left most (5') base. |
wildin.fastq.gz |
Path to input FASTQ files which will be converted into a filter dictionary. These files can be gzipped. Wildcards '*' '?' can be used to specify multiple files. |
filter?.fd |
Optional filter dictionaries used to prevent redundant 32 base substrings from being output. For example, if a filter dictionary is being created to catch unmapped regions of the human genomes, a reference human genome can be specified here to prevent redundant filter dictionary entries from being output. |
The fastq2fd command reads FASTQ files and converts them into filter dictionary format (FD). Filter dictionaries can be used to align reads using the fastq2fx and fx2fx commands.
leif fasta2fq <out0.fq> <out1.fq> <num_read_pair> <random_seed> <read_len> <gap> <wildin.fa.gz>
Ex: leif fasta2fq ebv_0.fq ebv_0.fq 100 1 150 300 ebv.fa.gz
Parameter |
Description |
out0.fq |
Output file containing paired-end reads (forward strand) in FASTQ format. |
out1.fq |
Output file containing paired-end reads (reverse strand) in FASTQ format. |
num_read_pair |
Total number of read pairs to output. |
random_seed |
Integer random seed used to sample the read pairs from the sequences in the FASTA input file. |
read_len |
Length of read to sample in bases. |
gap |
Gap in bases between the forward and reverse reads. |
wildin.fa.gz |
Path to input FASTA files which will be sampled and converted into a FASTQ file pair. These files can be gzipped. Wildcards '*' '?' can be used to specify multiple files. |
The fasta2fq command reads FASTA files and randomly samples read pairs from the FASTA sequences, simulating shotgun sequencing. This command can be used to test Leif by building synthetic sequencing results from known genomes.
leif mpdd [mhdd_path] <num_core> [<executable> [<param0> [<param1> [...]]]]
Ex: leif mpdd 4 fastq-dump --gzip --split-files SRR385759.sra
leif mpdd 4 fastq-dump --gzip --split-files SRR385767.sra
leif mpdd 4 fastq-dump --gzip --split-files SRR385773.sra
leif mpdd 4 fastq-dump --gzip --split-files SRR741366.sra
leif mpdd 4 fastq-dump --gzip --split-files SRR768303.sra
leif mpdd 4 fastq-dump --gzip --split-files SRR768309.sra
leif mpdd 4
Parameter |
Description |
mhdd_path |
Full path where the executable must be run, including the drive letter. Multiple drive letters can be specified to run the tasks on multiple disks (for example: wxyz:\temp). If omitted, the executable is run in the current directory. Since many Leif commands require a large amount of temporary disk space and bandwidth, it may be optimal for performance reasons to dispatch parallel Leif command on different disk drives (if available). |
num_core |
Number of processor cores on which the executables will be dispatched. The number 0 has a special meaning: the mpdd command automatically detects the number of cores on the system and dispatches executables on all cores. When running a batch of command in parallel, the num_core parameter must be identical for all mpdd commands in a parallel dispatch group. |
executable |
Any executable and its parameters. When no executable is specified, the mpdd command enter a special mode in which it waits for all currently running executables to complete before exiting. This ensures that all the data written by previous executable is available for later executables (which may be dependent on this data). |
The mpdd command dispatches executable on multiple cores to reduce the run time of scripts. This command is only used to opitimize performance, and has no impact on logical execution.
leif lock <executable> [<param0> [<param1> [...]]]
Ex: leif lock wget ftp.ncbi.nlm.nih.gov/blast/db/FASTA/human_genomic.gz
Parameter |
Description |
executable |
Any executable and its parameters. Leif commands which automatically serialize their disk accesses must not be used within lock (fasta2fd, fastq2fd, fastq2fx, fx2fx, fxgroup). |
The lock command ensures that the child command has exclusive hard disk access. It is used to avoid fragmentation when writing large files to disk. For example, if multiple jobs are running in parallel on the same disk drive, it is usually better to write one large file at a time to avoid fragmentation. This command is only used to opitimize performance, and has no impact on logical execution.
|