Remove chimera
What it does
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.
Source: EZBioCloud
The chimera rate can reacata).
How it does
this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011). UCHIME provides this algorithm.
The chimera detection is performed with vsearch , a free alternative to USEARCH (UCHIME).
The chimera detection is performed sample by sample, and a cross-validation is then performed to remove only chimera identified in all samples where they are present.
See here for more details about vsearch chimera removal.
Parameters
Smart parameters have been defined based on our experience to make it easier for you (with --toto 0.2
and --ffor 477
).
Command line
v3.2.3
usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA
[-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA]
[-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE]
Removes PCR chimera.
optional arguments:
-h, --help show this help message and exit
-p NB_CPUS, --nb-cpus NB_CPUS
The maximum number of CPUs used. [Default: 1]
--debug Keep temporary files to debug program.
-v, --version show programs version number and exit
Inputs:
-f INPUT_FASTA, --input-fasta INPUT_FASTA
The cluster sequences (format: FASTA).
-b INPUT_BIOM, --input-biom INPUT_BIOM
The abundance file for clusters by sample (format:
BIOM).
-c INPUT_COUNT, --input-count INPUT_COUNT
The counts file for clusters by sample (format: TSV).
Outputs:
-n NON_CHIMERA, --non-chimera NON_CHIMERA
sequences file without chimera (format: FASTA).
[Default: remove_chimera.fasta]
-a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE
Abundance file without chimera (format: BIOM or TSV).
[Default: remove_chimera_abundance.biom or
remove_chimera_abundance.tsv]
--summary SUMMARY The HTML file containing the graphs. [Default:
remove_chimera.html]
-l LOG_FILE, --log-file LOG_FILE
This output file will contain several informations on
executed commands.
Example of command line:
remove_chimera.py --input-fasta input.fasta \
--input-biom input.biom \
--non-chimera output.fasta \
--nb-cpus 1 \
--log-file output.log \
--out-abundance output.biom \
--summary output.html
Galaxy
Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before
Outputs
report
The HTML output is obtained thanks to --summary
parameter in command line.
With Galaxy, you obtain this output:
The HTML file summarizes important information about the chimera removal process.
How many OTUs/sequences are kept after the process?
Figure 1: Remove summary graphicIn this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.
BIOM file
The BIOM output is obtained thanks to --out-abundance
(in command line) or on Galaxy :
The BIOM file contains informations about the OTUs after the chimera removal process.
See here for more informations about the BIOM format (not readable for a human).
FASTA file
The FASTA file contains sequences of the non-chimeric OTUs, is obtained thanks to --non-chimera
(in command line) or on Galaxy :
Example for Cluster_1 :
>Cluster_1 reference=otu_00162 position=1..300
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGAACGGATAAAGAGCTTGCTCTTTTGAAGTTAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCTCACAGCTGGGGATAACATCGAGAAATCGATGCTAATACCGAATGTGCTGAACATCATAAGATGTTCAAGTGAAAGACGGTTTCGGCTGTCACTGTGAGATGGACCCGCGCTGGATTAGCTAGTTGGTAAGGTAATGGCTTACCAAGGCGACGATCCATAGCCGACCTGAGAGGGTGATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGCGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTTGTTAGAGAAGAACATGGGTGAGAGTAACTGTTCACCCCTTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTG
Remove chimera
What it does
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.
The chimera rate can reach 40% of the sequences (typically for 16S data).
How it does
De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011). UCHIME provides this algorithm.
The chimera detection is performed with vsearch [1], a free alternative to USEARCH (UCHIME).
The chimera detection is performed sample by sample, and a cross-validation is then performed to remove only chimera identified in all samples where they are present.
See here for more details about vsearch chimera removal.
Parameters
Smart parameters have been defined based on our experience to make it easier for you (with
--toto 0.2
and--ffor 477
).Command line
v3.2.3
usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA [-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA] [-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE] Removes PCR chimera. optional arguments: -h, --help show this help message and exit -p NB_CPUS, --nb-cpus NB_CPUS The maximum number of CPUs used. [Default: 1] --debug Keep temporary files to debug program. -v, --version show programs version number and exit Inputs: -f INPUT_FASTA, --input-fasta INPUT_FASTA The cluster sequences (format: FASTA). -b INPUT_BIOM, --input-biom INPUT_BIOM The abundance file for clusters by sample (format: BIOM). -c INPUT_COUNT, --input-count INPUT_COUNT The counts file for clusters by sample (format: TSV). Outputs: -n NON_CHIMERA, --non-chimera NON_CHIMERA sequences file without chimera (format: FASTA). [Default: remove_chimera.fasta] -a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE Abundance file without chimera (format: BIOM or TSV). [Default: remove_chimera_abundance.biom or remove_chimera_abundance.tsv] --summary SUMMARY The HTML file containing the graphs. [Default: remove_chimera.html] -l LOG_FILE, --log-file LOG_FILE This output file will contain several informations on executed commands.
Example of command line:
Galaxy
Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before
Outputs
HTML report
The HTML output is obtained thanks to
--summary
parameter in command line.With Galaxy, you obtain this output:
The HTML file summarizes important information about the chimera removal process.
How many OTUs/sequences are kept after the process?
In this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.
BIOM file
The BIOM output is obtained thanks to
--out-abundance
(in command line) or on Galaxy :The BIOM file contains informations about the OTUs after the chimera removal process.
See here for more informations about the BIOM format (not readable for a human).
FASTA file
The FASTA file contains sequences of the non-chimeric OTUs, is obtained thanks to
--non-chimera
(in command line) or on Galaxy :Example for Cluster_1 :
A work by FROGS team
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584. ↩︎