Leif™‎ > ‎

Basic Commands

leif fastq2fx [<step>] <out.fx> <low_quality_char> <wildin_read0.fastq.gz<wildin_read1.fastq.gz> [<wildin0.fd> [...]]

  Ex: leif fastq2fx step1.fx "#" SRR768303_1.fastq.gz SRR768303_2.fastq.gz human.fd ebv.fd phage.fd

Parameter Description
step Optional parameter which indicates the base interval between substrings considered in the alignment process. Range is 1 to 32; default is 32. A lower number will result in a more complete, but slower alignment.
out.fx Output file in FX format. 
low_quality_char Single character indicating the low quality cutoff. For example, "%" means that bases with the five lowest quality scores (eg. between "!" and "%" inclusively) will be deemed uncalled (N), and their quality score will be lowered to "!" to propagate this information to downstream commands.
wildin_read0.fastq.gz Path to FASTQ file (optionally gzipped) which contains forward strand reads. Wildcards '*' '?' can be used to include multiple files.
wildin_read1.fastq.gz Path to FASTQ file (optionally gzipped) which contains reverse strand reads. Wildcards '*' '?' can be used to include multiple files.
wildin?.fd Path to one or more filter dictionaries against which read pairs will be aligned. Read pairs aligning to any filter dictionary sequence will be discarded. Wildcards '*' '?' can be used to include multiple files.

The fastq2fx command reads paired FASTQ files (the first file(s) contains reads from the forward strand and the second file(s) contains corresponding reads from the reverse strand), discards low quality read pairs, discards read pairs which align to any specified filter dictionary and writes remaining read pairs in a proprietary FX format file.


leif fxgroup <out.fx> <in.fx>

  Ex: leif fxgroup step2.fx step1.fx

Parameter Description
out.fx Output FX file.
in.fx Input FX file (optionally gzipped).

The fxgroup command reads an FX file, discards read pairs which contain low entropy (repetitive) sequences, and assembles remaining read pairs into "contig-like groups" if they share a a high entropy 32 base substring. The group information is placed at the beginning of the first FX entry line, as follows:

@AAAAAAAAA_B_C@F

  Ex: @000000011_3546_02@step1.fx

Field Description
AAAAAAAAA "Contig-like group" number. Group numbers are sorted by size: the largest group is numbered 000000000.
B Read pair number within the "contig-like group". The first read pair in a group is numbered 0.
C Total number of read pairs within this "contig-like group".
F Name of the file from which the read pair originated.

Important note: large groups (>100 read pairs) may combine two unrelated genomes (or genome regions) if chimeric read pairs are present. For example, during the Illumina library prep ligation step, inserts which have not been A-tailed may be ligated with completely unrelated inserts. In RNA-Seq libraries, chimeric inserts usually combine a portion of the human ribosome with a portion of a high transcript count messenger RNA: this means large groups may well contain both ribosomal sequences and another sequence. A similar problem occurs if Illumina adaptors are present in the reads due to very short inserts lengths. If microbial reads end up merged with human reads during the fxgroup step, they will likely be classified as human and thus be missed, which is very bad! If chimeric reads are common, it is very important to eliminate substantially all human reads before running fxgroup.


leif fxsample <count> <random_seed> [<group_number>/<total_read_pair_limit>] <out.fx> <in.fx>

  Ex: leif fxsample 0 1 step3.fx step2.fx

Parameter Description
count Number of read pairs to output from each "contig-like group". Number 0 has a special meaning, which use 1+ceil(log2(number_of_read_pairs_in_group)) to determine the number of read pairs to output per group.
random_seed Integer used as the random seed for the sample. Number 0 has a special meaning which disables randomization, selecting the first reads encountered within a group.
group_number Optional parameter which selects reads from only a single "contig-like group", discarding reads from all other groups.
total_read_pair_limit Optional parameter which limits the total number of read pairs output by randomly discarding some small "contig-like groups". This parameter is expressed as a negative integer to distinguish it from group_number. The number of read pairs in remaining groups will be adjusted so that total read pairs counts remain valid.
out.fx Output FX file.
in.fx Input FX file (optionally gzipped). Read pairs must be annotated with group information using fxgroup for this command to work.

The fxsample command reads an FX file and samples read pairs for alignment with qblast. This step is optional, but it is recommended because aligning many reads from the same "contig-like group" does not provide any relevant information, and slows down the alignment process. If too many read pairs are output (in such a way that qblast is either too slow or runs out of memory), it is possible to restrict the total number of read pairs output by using the total_read_pair_limit parameter.


leif qblast <qblast_settings.txt> <taxid.git> <wildin.fa.gz> <wildin0.fx> [<wildin1.fx> [...]]

  Ex: leif qblast qblast_settings.txt taxid.git blast_*.fa.gz step3.fx

Parameter Description
qblast_settings.txt Path of settings file which configures the alignment parameters (see table below for file format details).
taxid.git Path of taxid.git file produced during setup by the taxid command.
wildin.fa.gz Path of FASTA files that will be aligned to. Wildcards '*' '?' can be used to include multiple files. These files are typically the largest NCBI BLAST databases: "nt" "human_genomic" "other_genomic" and "wgs".
wildin?.fx Path of FX files that will be aligned (optionally gzipped)Wildcards '*' '?' can be used to include multiple files. The alignments will be output in QB files with the same path/name as these input files, except the suffix will be changed from ".fx" to ".qb".

The qblast command aligns reads from a set of FX files to sequences in a set of FASTA files, outputting  the result in QB text files. The alignment settings are provided in a text file typically called to qblast_settings.txt.

Field Description
num_core Number of processor cores that will be used during alignment. When omitted, the default value is the number of cores in the computer. Range is 1 to 1024.
word_length Number of consecutive matching bases necessary to seed alignment. Default value is 15. Range is 7 to 28. Low values may result in significantly longer run times; high values may result is missed alignments.
dustIgnores low entropy seeds when set. Default value is 0. Range is 0 to 1. In rare cases, low entropy seeds may result in significantly longer run times, so these can be ignored by setting this field to 1; however, setting this field to 1 may result in missed alignments.
dual_align_pct Percent homology required to seed dual alignment. Default value is 95. Range is 75 to 101. 101 disables dual alignment. Low values may result in significantly longer run times; high values may result is missed alignments.
num_genus Number of best matching genera reported. Default value is 4. Range is 1 to 32.
num_species Number of best matching species reported. Default value is 4. Range is 1 to 32.
num_consensus Number of consensus percentages reported. Default value is 12. Range is 1 to 51.
score_gi_age List of integer "gi" cutoffs used to report historical alignment scores. See example below.
score_taxid List of integer "taxid" whose best alignment score will be saved and reported separately. Up to 32 can be specified.
ignore FASTA entries matching this string or integer "taxid" are ignored (not aligned to).
cat? FASTA entries matching this string or integer "taxid" are aligned in a specific category, which is reported separately in the QB output files. Ten categories can be used (cat0 to cat9). The others keyword can be used to catch all FASTA entries which have not been directed to a specific category. For example, alignments can be reported in a separate category for each kingdom (eukaryotic, bacteria, archaea, fungi, viruses).

Example qblast_settings.txt file:
word_length    = 15;
dust           = 0;
dual_align_pct = 98;
num_genus      =  4;
num_species    =  4;
num_consensus  = 12;
score_taxid= 9443, // Primates
            10376, // EBV
            10841; // Microviridae (to catch phage)
score_gi_age=   4097607, // 1999
                6456709, // 2000
               11344946, // 2001
               17148316, // 2002
               27263209, // 2003
               33317580, // 2004
               45686251, // 2005
               77696184, // 2006
               93213430, // 2007
              164566300, // 2008
              219883666, // 2009
              283551139, // 2010
              317183295, // 2011
              371843436, // 2012
              440384719; // 2013

ignore="|AHJH01"; // Exclude Hammondia hammondi contaminated with Bradyrhizobium.
ignore="|AGTT01"; // Exclude Pantholops hodgsonii contaminated with Bradyrhizobium.
ignore="|KE11"; // Exclude Pantholops hodgsonii contaminated with Bradyrhizobium.
ignore="|AUYS01"; // Exclude Melampsora pinitorqua contaminated with Bradyrhizobium.
ignore="|ABPJ01"; // Exclude Mchenga conophoros contaminated with Bradyrhizobium.
ignore="|AK276546.1"; // Exclude Gryllus bimaculatus contaminated with E coli.
ignore="|BADR02"; // Exclude Clonorchis sinensis contaminated with E coli.
ignore="|CBMN01"; // Exclude Hordeum pubiflorum contaminated with Propionibacterium acnes.
ignore="|AAHY01"; // Exclude Mus musculus contaminated with E coli.
ignore="|CAJW01"; // Exclude Hordeum vulgare contaminated with E coli.
ignore="|CAJX01"; // Exclude Hordeum vulgare contaminated with Ralstonia pickettii.
ignore="|CAWI01"; // Exclude Adineta vaga contaminated with E coli.
ignore="|NZ_AJHE02"; // Retracted.
ignore="|CACX01"; // Exclude Strongyloides ratti contaminated with E coli.
ignore="|CH003510.1"; // Exclude Homo sapiens contaminated with E coli.
ignore="|AHIO01"; // Exclude Plutella xylostella contaminated with Salmonella enterica.

ignore=81077; // Exclude artificial sequences.
ignore=12908; // Exclude unclassified sequences.
cat0= others; // Prokaryotes and viruses.
cat1= 2759; // Eukaryotes


leif qbmajority [ungroup] <out0> [<out1> [...]] <out_others.qb> <in.qb> <taxid.git> {Consensus|Genus|Species|Taxid} <taxid0> [<taxid1> [...]]

  Ex: leif qbmajority single 70 50 primate_ebv_phage.qb step4.qb step3.qb taxid.git Taxid 9443 10376 10841

Parameter Description
ungroup When specified, spots are considered individually rather than as "contig-like groups". This can be useful if "contig like-group" merge human and non-human spots due to highly conserved sequences (such as ribosomal genes).
out0, out1, ... Each "contig-like group" is assessed using to these criteria: {single|dual} <match_pct> <majority_pct> <out.qb>. As soon as a "contig-like group" matches with an output criteria set, it is output in the specified file, and the criteria assessment stops. If a "contig-like group" does not match any specified criteria, it is output in the out_others.qb file. The criteria are described in detailed the next table.
out_others.qb "Contig-like groups" which were not output in any of the "out?" criteria files are output here.
in.qb Input file which will be analyzed.
taxid.git Path of taxid.git file produced during setup by the taxid command.
{Consensus|Genus|Species|Taxid} Homology set used. Must be set to one of Consensus Genus Species Taxid. Case sensitive.
 taxid0, taxid1, ...List of matching integer "taxids" which the criteria described below are searching for. At least one taxid must be specified. 


Criteria Description
{single|dual} Alignment results from either single or dual alignment are used. Must be set to one of single dual. Case sensitive.
match_pct Minimum homology percentage to consider that a read matches the specified taxid. Range 50 to 100.
majority_pct Minimum percentage of reads within a "contig-like group" that must match with the specified taxid for the "contig-like group" to be considered meeting the criteria. If this test passes, all reads from the "contig-like group" are output to the out.qb file for this criteria.
out.qb "Contig-like group" which match the criteria are output to this file.

The qbmajority command reads a QB file and separates "contig-like groups" that meet certain taxid criteria into different files. This command can be used to eliminate "contig-like groups" which are of little interest, or isolate a set of "contig-like groups" that are of high interest.


leif qbconsensus [single|dual] [spot|group] [max|species|genus] <match_pct> <margin_pct> <out.csv> <wildin.qb> <taxid.git>

  Ex: leif qbconsensus 90 5 step4.csv step4.qb taxid.git

Parameter Description
[single|dual] Alignment results from either single or dual alignment are used. Must be set to one of single dual, or omitted. Case sensitive. Defaults to single.
[spot|group] Alignment results are reported either in units of read pairs (spot) or "contig-like groups" (group). Must be set to one of spot group, or omitted. Case sensitive. Defaults to spot.
[max|species|genus] Taxonomic resolution reported. Must be set to one of max species genus, or omitted. Case sensitive. Defaults to max.
match_pct Minimum match percentage to report a taxonomic name rather than "Low homology". Integer range 50 to 100. 90 is recommended.
margin_pct Minimum specificity margin for a taxonomic node to be reported (rather than the nearest enclosing parent taxonomic name). Integer range 0 to 50. 5 is recommended.
out.csv Output file with reporting taxonomic consensus for all input files in CSV format. The CSV format is compatible with most spreadsheet programs.
wildin.qb Input QB files whose reads are to be analysed for consensus taxonomy.
taxid.git Path of taxid.git file produced during setup by the taxid command.

The qbconsensus command reads QB files and reports how many read pairs or "contig-like groups" belong to a specific taxonomic node. The results are output CSV format which can be read by most spreadsheet programs for further analysis.

Comments