Leif™‎ > ‎

Workflow and Tutorial

An overview of the Leif Microbiome Analyzer workflow is shown in Figure 1. A clinical specimen is collected, and its DNA is extracted. This DNA is then sequenced using a high-throughput paired-end sequencer, such as the Illumina HiSeq 2500 (for those unfamiliar with this technology, here is a YouTube video which describes it, and here is a list of centers which provide sequencing services for a fee; as of 2013, sequencing a DNA sample using this technology costs between 500$ and 2500$; for example Omega Bioservices offers turnkey DNA extraction and Illumina Sequencing for less than 1000$/sample). While most DNA reads produced by the sequencer are of human origin, a small fraction of reads originate from microbes present in the clinical specimen, such as bacteria, fungi, protists and viruses. Leif aligns these reads to all known sequences, quantitatively reporting which microbes are present in the clinical specimen. Additionally, reads which do not match with any known sequences are reported as novel species candidates. A detailed technical description of Leif was published in PLoS ONE: Laurence et al 2014.

Figure 1: Leif workflow. DNA is extracted from a clinical specimen, then millions of DNA fragments are sequenced and analyzed by Leif.

Tutorial based on a sequencing run from the 1000 Genomes Project
A sequencing run from the 1000 Genomes Project labeled SRR768303 is used here as an example. It consists of DNA extracted from Epstein-Barr virus (EBV) immortalized B lymphocytes from a Utah male of European descent, sequenced using the paired-end method by a single Illumina HiSeq 2000 lane. This yielded 69790785 read pairs, each read pair consisting of two 101 base reads. To clearly illustrate each step in the workflow, two example read pairs were followed: the first read pair is unambiguously an EBV sequence (Table 1); the second read pair has very low homology to all sequences in the NCBI BLAST database released in September 2013, which means it is a novel species candidate likely originating from laboratory contamination (Table 2). Rows which corresponds to each step are hyperlinked in the following paragraphs.

DNA is extracted from cells in a clinical specimen and sequenced, producing around a hundred million read pairs of 100 to 150 bases each, separated by a gap of up to 700 bases (usually described as 2x100 or 2x150). Each read pair is split and saved using the four line FASTQ format in text files, where corresponding read pairs appear in the same order in both files (Row 1 and Row 2). The DNA sequence can be found on the second line of each FASTQ entry, and the base quality can be found on the fourth line. An overview of the bioinformatics operations performed by Leif is shown in Figure 2. and described in more detail in the following paragraphs.

Figure 2: Leif bioinformatics operations. Read pairs aligning with human sequences are discarded. Remaining read pairs are assembled into groups if their sequences overlap. A few read pairs are sampled from each group and aligned to all known sequences.

The first bioinformatics step applied to read pairs is performed by the fastq2fx command (Row 3) which eliminates low quality read pairs, eliminates read pairs aligned to human sequences, and writes remaining read pairs to a FX file. The main purpose of this step is to reduce the dataset size by discarding human reads so that downstream steps run quickly and require less storage. The proprietary FX file format is very similar to the FASTQ format, except read pairs are stored together in a single file, which is more convenient (Row 4).

The second bioinformatics step is done by the fxgroup command (Row 5), which assembles read pairs into “contig-like groups” if they share an identical high entropy 32 base substring. Read pairs that contain too few high entropy substrings are discarded. This step can be described as a crude implementation of de novo genome assembly. The group assignment is added to the first line of each FX entry (Row 6). For example, “@000000011_02_3546@step1.fx” was added to first line of the EBV read pair, meaning that it is read pair 2 of 3546 in “contig-like group” number 11.

The third bioinformatics step is done by the fxsample command (Row 7), which randomly selects a few read pairs from each “contig-like group” for alignment. Typically, no more than a few hundred thousand read pairs remain after this step.

The fourth bioinformatics step is done by the qblast command (Row 9), which aligns sampled read pairs with a very large number of FASTA sequences—typically all sequences in the NCBI BLAST databases. This is the slowest step since qblast must align read pairs against approximately a terabyte of FASTA sequences (6 to 24 hour runtimes are typical). The qblast command outputs a substantial amount of alignment information for each read pair in QB files which can be further analyzed manually or using scripts (Row 10).

Alignment information produced by qblast
Since the B lymphocytes whose DNA was sequenced in run SRR768303 were immortalized using EBV, it is normal that some of the read pairs match with the EBV genome, as is the case for the first example read pair (Table 1). The alignment results for this read pair are shown in Row 10, which is reproduced below. These results are described in detail in the following paragraphs.

Table 3: Row 10 reproduced from Table 1.
Row Paired-end read
10 **** Group 11 (read pair 2 of 3546) ****
@000000011_02_3546@step1.fx@SRR768303.47548 HWI-ST1154:205:D1M9PACXX:7:1101:2354:9785 length=101

167132 -> 167319 "nt.fa.gz>gi|330330|gb|M80517.1|HS4B958RAJ Epstein-Barr virus, artifactual joi" 
Taxid[  99%"Human herpesvirus 4(10376)"]
Genus[  99%"Lymphocryptovirus(10375)"]
Species[  99%"Human herpesvirus 4(10376)"  91%"Pan troglodytes lymphocryptovirus 1(212727)" ]
Consensus[  99%"Human herpesvirus 4(10376)"  91%"Lymphocryptovirus(10375)"]
gi_age[  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%  99%]
||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


These two paired reads originated from a 188 base DNA fragment, as indicated by the alignment interval (167132 → 167319) and the gap size of minus 14 (101 x 2 – 14 = 188) which was reported for the best match, which in this case was to the reference sequence "nt.fa.gz>gi|330330|gb|M80517.1|HS4B958RAJ Epstein-Barr virus, artifactual joi". There is a single base mismatch with this reference sequence which had the highest possible quality score (‘J’), indicating that the base was likely mutated in this strain, or perhaps not called correctly by the sequencer.

The Taxid set reports the best match to taxonomic nodes of particular interest which were explicitly requested in the qblast_settings.txt file. In this example, taxids for primate (taxid = 9443), EBV (taxid = 10376) and coliphage (closest taxid = 10841) were requested, but only the EBV taxid was reported since this read pair produced no alignments with a homology ≥50% to primate or coliphage sequences.

The Genus and Species sets list up to four genera/species which have the highest homology to this read pair. In this case, only a single genus matched well with the read pair (“Lymphocryptovirus” with 99% homology), but other lymphocryptovirus species matched relatively well also: “Human herpesvirus 4” (a synonym of EBV) with 99% homology, and a closely related virus which infects chimpanzees with 91% homology. Since EBV is known to be present in the sample, the match with the chimpanzee virus occurred due to genetic similarities between these viruses, and not because there was any chimpanzee virus in the sample.

Highly conserved sequences will produce an excellent match to many more than four species or genera, thus the Genus and Species sets may not be fully representative of all matches. The Consensus set addresses this issue by finding a taxonomic node which encompasses all matching species for a given homology percentage. For example, if both Staphylococcus aureus (taxid = 1280) and Enterococcus faecalis (taxid = 1351) match equally with 98% homology, then the consensus taxonomic name reported will be Bacilli (taxid = 91061). The Consensus set is used by the qbconsensus command to output a taxonomic summary of read pairs at the end of the workflow.

The gi_age set lists the highest homology percentage found using all Genbank entries which have a “gi” number lower than certain cutoff values specified in the qblast_settings.txt file. “gi” numbers are assigned sequentially to new Genbank entries, thus can be used to reconstruct the Genbank database at any given point in time. The cutoffs are usually chosen to be the first “gi” number assigned in a given calendar year (e.g. from 1998 to 2013). In this case, the best matching sequence was present in Genbank in 1998. The gi_age set is used by the qbhistory command to output a summary of how homology between read pairs and Genbank sequences increased over time.

Analysis of alignment results 
Most read pairs matching with human DNA were discarded by the fastq2fx command which aligned read pairs to sequences from the NCBI blast “human_genomic” database using a simple algorithm. However, read pairs from some highly variable regions of the human genome may not have been caught by this filter, so they need to be discarded based on qblast alignment results. The qbmajority command (Row 11) is used to extract “contig-like groups” which contain a majority (≥50%) of decent homology (≥70%) primate reads (taxid = 9443). Since the B lymphocytes from run SRR768303 were immortalized with EBV (taxid = 10376), this step is also used to eliminate “contig-like groups” matching EBV. Finally, a substantial number of coliphage (closest taxid = 10841) reads are present in some high-throughput sequencing results and are also filtered out here.

The last bioinformatics step is the interpretation of alignment results, for which there is no universally optimal method. Read pairs with an excellent homology to known species indicate that such species are present in the sample. Read pairs which do not align well with any sequence in NCBI BLAST databases are deemed novel species candidates and are labeled "Low homology" by Leif. The methodology best suited to assess the medical significance of novel species candidates is application specific (one such method is described in Figure 3).

Figure 3: Simple method which can used to screen novel sequences for an association with disease using custom PCR primers and case/control specimens.

The qbconsensus command can be used to tally the most accurate taxonomic name that can be attributed to each “contig-like group” (Row 14). A minimum homology threshold of 90% was specified to exclude low homology alignments: below this threshold, read pairs will be tallied under the taxonomic name “Low homology”. A specificity margin of 5% was used so that the most specific taxonomic name shared by all alignments within 5% homology of the best alignment was reported (Figure 4). For example, if the two best matches are to Staphylococcus aureus (98% homology) and Enterococcus faecalis (94% homology), the taxonomic name reported will be Bacilli; by contrast, if Enterococcus faecalis homology was 93% instead of 94%, then the reported taxonomic name would be Staphylococcus aureus.

Figure 3
Figure 4: Graphical representation of number of read pairs classified using “genus or less specific” resolution, output by the qbconsensus command based on SRR768303 step4.qb file. The number of read pairs is shown adjacent to the taxonomic name. Given that the clinical sample is meant to be pure cultured human B lymphocytes, all of these read pairs appear to originate from contamination which occurred during sample preparation. Bradyrhizobium bacterial contamination is very common in sequencing runs from the 1000 Genomes Project.

The qbhistory command can be used to report the number of “contig-like groups” with a median homology above certain thresholds using the Genbank database released each calendar year (Row 15). It shows a reduction in the number of unknown sequences over time due to new sequences being added to Genbank (Figure 5).

Figure 5: Number of “contig-like groups” in SRR768303 step4.qb file that have a median homology greater or equal to 0%, 50%, 60%, 70%, 80%, 90% to any sequence in yearly Genbank releases.

Low homology “contig-like groups” can be extracted using the qblowhom command (Row 16) and aligned using NCBI blastn to confirm that they are truly unknown sequences, and not the result of missed alignments by the qblast command (Table 4). blastn can be run on NCBI servers using the latest release of Genbank, which ensures that low homology reads are not due to an outdated NCBI BLAST database, as occurred for groups 2882 and 2739 in Table 4. The NCBI blastn tool can be run using the web interface or from the command line (Row 19).

Table 4Result of NCBI blastn on all read pairs from SRR768303 with best alignment homology <90%.
Group # Best qblast match Best NCBI blastn match Comment 
Genus Identity Genus Query cover E value Identity
1908 Zebrafish 52% Pteronotus 27% 4.6 93%  
Rhizobium 50% Zobellia 27% 1.6 82%  
1859 Ophiostoma 53% Salmo 95% 8e-10 78%  
Ornithorhynchus 54% Salmo 95% 1e-07 79%  
Ornithorhynchus 57% Salmo 97% 3e-09 78%  
Ophiostoma 55% Salmo 98% 2e-11 80%  
2882 Corynebacterium 63% Corynebacterium 100% 3e-39 98%  
Amazona 54% Corynebacterium 99% 3e-40 99%  
2848 Acidaminococcus 58% Waterwater 99% 1e-07 75%  
Cellulophaga 67% Anopheles 46% 7e-04 87%  
1664 Chryseobacterium 57% Chryseobacterium 90% 3e-15 84%  
Chryseobacterium 71% Chryseobacterium 98% 5e-19 84%  
2063 Verrucomicrobium 71% Verrucomicrobium 79% 9e-09 81%  
Sulfolobus 57% Ornithorhynchus 24% 0.38 100%  
2651 Allochromatium 54% Apalone 29% 0.38 93%  
Microbacterium 73% Microbacterium 96% 1e-07 75%  
2157 Ladona 56% Acyrthosiphon 100% 2e-18 84%  
Acyrthosiphon 74% Acyrthosiphon 100% 7e-17 83%  
2757 Bradyrhizobium 80% Bradyrhizobium 77% 3e-27 98%  
Bradyrhizobium 77% Bradyrhizobium 72% 3e-27 100%  
1706 Aedes 66% Aedes 100% 1e-33 92%  
Aedes 80% Aedes 100% 1e-33 92%  
3291 Elizabethkingia 83% Elizabethkingia 95% 5e-19 84%  
Riemerella 81% Riemerella 99% 2e-17 82%  
3067 Acidovorax 87% Acidovorax 96% 1e-26 90%  
Allomyces 50% Kaistia 34% 0.11 92%  
2739 Prevotella 88% Prevotella 100% 2e-42 100%