FROGS_3 Remove chimera

tutorial
tool

FROGS_3 Remove chimera

Context

Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.
This phenomena is particularly common in amplicon sequencing where closely related sequences are amplified.

Source: EZBioCloud

:warning: The chimera rate can reach between 5 to 45% (Haas et al., 2011) of the sequences (typically for 16S data).

How it does

This tool removes chimeric sequences by sample.

De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011).

The chimera detection is performed with VSEARCH, combined with a homemade strategy. The chimera detection is performed sample by sample and a cross-validation is performed to remove only chimera identified in all samples where they are present.

:bulb: See here for more details about VSEARCH chimera removal.

Command line

:package: v4.1.0

remove_chimera.py --help
usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA
                         [-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA]
                         [-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE]

Removes PCR chimera.

optional arguments:
  -h, --help            show this help message and exit
  -p NB_CPUS, --nb-cpus NB_CPUS
                        The maximum number of CPUs used. [Default: 1]
  --debug               Keep temporary files to debug program.
  -v, --version         show program's version number and exit

Inputs:
  -f INPUT_FASTA, --input-fasta INPUT_FASTA
                        The cluster sequences (format: FASTA).
  -b INPUT_BIOM, --input-biom INPUT_BIOM
                        The abundance file for clusters by sample (format:
                        BIOM).
  -c INPUT_COUNT, --input-count INPUT_COUNT
                        The counts file for clusters by sample (format: TSV).

Outputs:
  -n NON_CHIMERA, --non-chimera NON_CHIMERA
                        sequences file without chimera (format: FASTA).
                        [Default: remove_chimera.fasta]
  -a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE
                        Abundance file without chimera (format: BIOM or TSV).
                        [Default: remove_chimera_abundance.biom or
                        remove_chimera_abundance.tsv]
  --summary SUMMARY     The HTML file containing the graphs. [Default:
                        remove_chimera.html]
  -l LOG_FILE, --log-file LOG_FILE
                        This output file will contain several informations on
                        executed commands.

Example of command line:

remove_chimera.py --input-biom clustering.biom --input-fasta clustering.fasta --non-chimera remove_chimera.fasta --out-abundance remove_chimera.biom --summary remove_chimera.html

Galaxy

Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before

Outputs

HTML report

The HTML file summarizes important information about the chimera removal process.

:question: How many clusters/sequences are kept after the process?

In this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.

BIOM file

The BIOM file contains informations about the clusters after the chimera removal process.

FASTA file

The FASTA file contains sequences of the non-chimeric clusters.




A work by FROGS team