FROGS: overview

Find, Rapidly, Otus with Galaxy Solution

The followed tabs show comparisons between FROGS, MOTHUR, UPARSE and QIIME. Tests were done on 2000 large in silico sequence datasets and three real datasets

Tests with simulated data

To take into account possible biases introduced by the choice of the amplified region, we produced, in silico, datasets with the V3V4 and V4 hypervariable regions of the bacterial 16S gene. We generated 25 sets of species manually extracted from UTAX (Edgar, 2015) i.e. Simulated Data From UTAX (SDFU) and 25 others from SILVA (v123) databank (Quast et al, 2013) i.e. Simulated Data From SILVA (SDFS)

We tested FROGS, UPARSE, MOTHUR and QIIME with their own guidelines for SDFU, so with their own affiliation method on UTAX databank. These pipelines are called UPARSE_SOP, MOTHUR_SOP and QIIME_SOP (SOP=Schema of standard Operation Procedure). However, the synthetic SDFU communities are not very diversified because UTAX is smaller than SILVA. Therefore, we also ran the 4 pipelines on SDFS. And, we use the guidelines of each pipelines, except for the affiliation step where we use the FROGS affiliation tools, since the formatting of the SILVA database required at affiliation part of UPARSE, was too complex to implement. These pipelines are called UPARSE_MA, QIIME_MA and MOTHUR_MA (MA=MultiAffiliation of FROGS). QIIME’s SOPs do not include a chimera removal step. Nevertheless and to achieve fair results, notably in terms of erroneous OTUs, we applied this step before the clustering step.

Grinder (v 0.5.3) (Angly et al, 2012) wasqu used to simulate the PCR amplification of full-length (V3V4 and V4) sequences from reference databases. We generated amplicons by
- filtering out sequences with ambiguous nucleotides,
- keeping only bacterial species with non-ambiguous affiliation taxonomy and with pintail >50 for sequences from SILVA and
- with a match for the forward (TACGGRAGGCAGCAG) and reverse (TAGGATTAGATACCCTGGTA) primers in the V3V4 region and for the forward (GTGCCAGCMGCCGCGGTAA) primer in the V4 region and
- maximizing the phylogenetic diversity of the amplicons in the full length 16S phylogenetic tree.
This results in 25 increasingly complex nested databases.

Grinder requires both error and abundance profiles to generate sequences. We used the following error parameters:
- the error rate increases linearly from 0.301% to 0.303% per base along the read,
- 98.6% of errors are SNPs and
- 1.4% are indels.
Those parameters were calibrated by mapping reads from a single strain (Schirmer et al, 2015) MiSeq sequencing run to its known sequence to mimick typical MiSeq error profiles and are coherent with other reported values. We used the default n-mer distribution:
- 89% of bimeras,
- 11% trimeras and
- 0.3% of quadrimera,
corresponding to the average values published in Quince et al. 2011 (Quince et al, 2011).

The fraction of chimera increased with the reference database size to reflect increasing sequence similarities:
- 5% for 20 taxa,
- 10% for 100 and 200 taxa
- and 20% for 500 and 1 000 taxa.
Chimera breakpoints were distributed uniformly along the amplicon.

We considered two different abundance profiles:
- uniform and
- power law.
For power law abundance profile, parameters were calibrated to set the expected max/min abundance ratio to
- 100 (20 taxa),
- 1 000 (100 and 200 taxa) or
- 10 000 (500 and 1 000 taxa).
For each combination of database sizes (20/100/200/500/1 000), abundance profiles (uniform, power law), amplicon regions (V3V4/V4), we generated 5 communities with different compositions (cf. figure above). We then simulated 10 samples of 100 000 reads each from each community.

Finally, we used cutadapt (v1.7.1) to trim primers from the generated reads. Trimmed sequences were not preprocessed with quality filters but instead used as such in downstream analyses. These 2.10+8 sequences were treated with FROGS, MOTHUR, UPARSE and QIIME, each with their guidelines, to compare the performances of these four solutions.

Tests with real data

The first real dataset is available to the community through the BEI Resource. It is a synthetic mock community of 20 known bacteria, 1 yeast and 1 archaea, from genera commonly found on or within the human body. Genomic DNA from each organism was mixed, based on qPCR of 16S rRNA measurements, and two mock mixtures formulated to contain:

100 000 16S copies per organism per aliquot (even mock community),
and 1000 to 1 000 000 16S copies per organism per aliquot (staggered mock community).

here

The second one is a pool of 4 marine bacteria that was sequenced 10 times independently during 2 years. Data are available here . This toy community is composed of Porphyrobacter sanguineus (2%), Bhargavaea cecembensis (34%), Pseudoalteromonas (53%) and Erythrobacter aquimaris (12%). The raw data from each of the 10 sequenced samples were submitted to the Sequence Read Archive (SRA) of the NCBI, under Project ID SRP113288 . 16S rRNA copy number was determined by qPCR using a Realplex Mastercycler (Eppendorf, Montesson, France); assays were carried out in triplicate for each sample using 96-well real-time PCR plates (Eppendorf). The qPCR was performed in 25µL containing 12.5µL Master Mix (Invitrogen, Eugen, USA), using primers BAC3388 and BAC805R (250nM of each primer), the TaqMan probe BAC516F (100nM) and DNA template ranging from 10 ng to 100 ng as previously described. The real-time PCR thermocycling was set as follows: 95°C for 20 sec. and 40 cycles at 95°C for 15 sec. and 60°C for 1 min. A negative control without DNA template was subjected to the same procedure to exclude any possible contamination. One standard curve was generated for each assay by using 10-fold dilutions of pEX-A plasmids (Eurofins MWG Operon) containing the targeted gene sequence. Three different dilutions of each sample were amplified and the initial concentrations were calculated from reactions displaying no PCR inhibition.

The third one is another mock communities of 67 species (see Caporaso et al. 2011 and Bokulich et al. 2013 for details). This real dataset consists of three experiments where that synthetic community (mock 6 (862 199 R1 reads), 7 (294 922 R1 reads) and 8 (327 268 R1 reads)) made of 67 species with even distribution was sequenced three times independently. Mock 6, 7 and 8 are biological replicates of the mock community and the three samples for Mock 6 to 8 are technical replicates of each Mock. These mocks were used in Bokulich et al. 2013 and Bokulich et al. 2015. They were initially sequenced in GAIIX (single end 100 bp) then re-sequenced in Hiseq (2 x 101 bp, therefore not overlapping for V4). these data are available here: mock 6 mock 7 mock 8

Tabs description

Simulated from utax

Datasets: genus of bacteria picking through diversity
Commands: bioinformatics commands to generate tests and assessments results
Results: raw results on synthetic data picking in UTAX database. Data are obtained with SOP of each tools with the affiliation step of each pipelines and with the affiliation step of FROGS with the MultiAffiliation

Simulated from silva

Datasets: species of bacteria picking through SILVA databanks (more large than UTAX)

Summary: per number of bacteria (20/100/200/500/1000) the table shows distribution of different species per taxonomics rank.
Distribution in databank: Silva is represented by the schema with external circle representing Family rank. Red ones are picked and present in respective sample. "All" button allows to see the cover of bacterial diversity

Commands: bioinformatics commands to generate tests and assessments results
Results: raw results on synthetic data picking in SILVA database. Data are obtained with SOP of each tools without the affiliation step of each pipelines. This step was replaced by the affiliation step of FROGS with MultiAffiliation

Metrics

The four metrics used to compare results of FROGS, UPARSE, QIIME and MOTHUR are :
- the divergence rate : as the Bray-Curtis distance between expected abundance and compositions at given taxonomic level,
- the number of false-negative taxa (FN) : the number of expected taxa that were not recovered by the method,
- the number of false-positive taxa (FP): the number of recovered taxa that were not expected, and
- the number of supernumerary OTUs : the number of additional OTUs with same origin as the first expected OTUs

Find, Rapidly, Otus with Galaxy Solution

Tests with simulated data

Tests with real data

Tabs description

Simulated from utax

Simulated from silva

Metrics