Remove chimera

What it does

Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.

Source: EZBioCloud

:warning: The chimera rate can reach 40% of the sequences (typically for 16S data).

How it does

De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011). UCHIME provides this algorithm.

The chimera detection is performed with vsearch [1], a free alternative to USEARCH (UCHIME).
The chimera detection is performed sample by sample, and a cross-validation is then performed to remove only chimera identified in all samples where they are present.

:bulb: See here for more details about vsearch chimera removal.

Parameters

Smart parameters have been defined based on our experience to make it easier for you (with --toto 0.2 and --ffor 477).

Command line

:package: v3.2.3

usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA
                         [-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA]
                         [-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE]

Removes PCR chimera.

optional arguments:
  -h, --help            show this help message and exit
  -p NB_CPUS, --nb-cpus NB_CPUS
                        The maximum number of CPUs used. [Default: 1]
  --debug               Keep temporary files to debug program.
  -v, --version         show programs version number and exit

Inputs:
  -f INPUT_FASTA, --input-fasta INPUT_FASTA
                        The cluster sequences (format: FASTA).
  -b INPUT_BIOM, --input-biom INPUT_BIOM
                        The abundance file for clusters by sample (format:
                        BIOM).
  -c INPUT_COUNT, --input-count INPUT_COUNT
                        The counts file for clusters by sample (format: TSV).

Outputs:
  -n NON_CHIMERA, --non-chimera NON_CHIMERA
                        sequences file without chimera (format: FASTA).
                        [Default: remove_chimera.fasta]
  -a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE
                        Abundance file without chimera (format: BIOM or TSV).
                        [Default: remove_chimera_abundance.biom or
                        remove_chimera_abundance.tsv]
  --summary SUMMARY     The HTML file containing the graphs. [Default:
                        remove_chimera.html]
  -l LOG_FILE, --log-file LOG_FILE
                        This output file will contain several informations on
                        executed commands.

Example of command line:

remove_chimera.py --input-fasta input.fasta \ --input-biom input.biom \ --non-chimera output.fasta \ --nb-cpus 1 \ --log-file output.log \ --out-abundance output.biom \ --summary output.html

Galaxy

Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before

Outputs

HTML report

The HTML output is obtained thanks to --summary parameter in command line.
With Galaxy, you obtain this output:

The HTML file summarizes important information about the chimera removal process.

:question: How many OTUs/sequences are kept after the process?

Figure 1: Remove summary graphic

In this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.

BIOM file

The BIOM output is obtained thanks to --out-abundance (in command line) or on Galaxy :

The BIOM file contains informations about the OTUs after the chimera removal process.
See here for more informations about the BIOM format (not readable for a human).

FASTA file

The FASTA file contains sequences of the non-chimeric OTUs, is obtained thanks to --non-chimera (in command line) or on Galaxy :

Example for Cluster_1 :

>Cluster_1 reference=otu_00162 position=1..300
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGAACGGATAAAGAGCTTGCTCTTTTGAAGTTAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCTCACAGCTGGGGATAACATCGAGAAATCGATGCTAATACCGAATGTGCTGAACATCATAAGATGTTCAAGTGAAAGACGGTTTCGGCTGTCACTGTGAGATGGACCCGCGCTGGATTAGCTAGTTGGTAAGGTAATGGCTTACCAAGGCGACGATCCATAGCCGACCTGAGAGGGTGATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGCGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTTGTTAGAGAAGAACATGGGTGAGAGTAACTGTTCACCCCTTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTG



A work by FROGS team

 


  1. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584. ↩︎