FROGSFUNC_1_placeseqs_copynumber
Context
PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :
- FROGSFUNC_1_placeseqs_copynumber : Places the ASVs into a reference phylogenetic tree and predicts the copy numbers of the marker gene (16S, ITS or 18S).
- FROGSFUNC_2_functions: Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
- FROGSFUNC_3_pathways : Calculates pathway abundances in each sample.
This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.
PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.
What it does
FROGSFUNC_1_placeseqs_copynumber is the first step of PICRUSt2. It inserts your study sequences into a reference tree. By default, this reference tree is based on 20,000 16S sequences from genomes in the Integrated Microbial Genomes database. The script performs this step, which specifically:
- Aligns your study sequences with a multiple-sequence alignment of reference 16S, ITS or 18S sequences with HMMER.
- Finds the most likely placements of your study sequences in the reference tree with EPA_NG or SEPP .
- Produces a treefile with the most likely placement for each sequence as the new tips with GAPPA.
- Predicts marker copy number based solely on the sequences of marker genes with PICRUSt2. The available marker genes are 16S, ITS and 18S.
2 input files are required for FROGSFUNC_1_placeseqs_copynumber analysis:
- fasta file of ASV sequences (it can be from FROGS_4 Cluster filters step)
- biom file of ASV abundances with taxonomic affiliation information (it can be from FROGS_5 taxonomic affiliation step)
Placement of sequences of interest in PICRUSt2 reference treePrediction of the copy numbers of the marker geneCommand line
v4.1.0
usage: frogsfunc_placeseqs.py [-h] [-v] [--debug] -i INPUT_FASTA -b INPUT_BIOM
[-r REF_DIR] [-p {epa-ng,sepp}]
[--min-align MIN_ALIGN]
[--input-marker-table INPUT_MARKER_TABLE]
[--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
[-o OUTPUT_TREE] [-e EXCLUDED] [-s OUTPUT_FASTA]
[-m OUTPUT_BIOM] [-c CLOSESTS_REF] [-l LOG_FILE]
[-t SUMMARY] [-om OUTPUT_MARKER]
place studies sequences (i.e. ASVs) into a reference tree.
optional arguments:
-h, --help show this help message and exit
-v, --version show programs version number and exit
--debug Keep temporary files to debug program.
Inputs:
-i INPUT_FASTA, --input-fasta INPUT_FASTA
Input fasta file of unaligned studies sequences.
-b INPUT_BIOM, --input-biom INPUT_BIOM
Input biom file of unaligned studies sequences.
-r REF_DIR, --ref-dir REF_DIR
If marker studied is not 16S, this is the directory
containing reference sequence files (for ITS, see:
$PICRUST2_PATH/default_files/fungi/fungi_ITS
-p {epa-ng,sepp}, --placement-tool {epa-ng,sepp}
Tool to place sequences into reference tree. Note that
epa-ng is more sensitiv but very memory and computing
power intensive. Warning : sepp is not usable for ITS
and 18S analysis [Default: epa-ng]
--min-align MIN_ALIGN
Proportion of the total length of an input query
sequence that must align with reference sequences. Any
sequences with lengths below this value after making
an alignment with reference sequences will be excluded
from the placement and all subsequent steps. (default:
0.8).
--input-marker-table INPUT_MARKER_TABLE
The input marker table describing directly observed
traits (e.g. sequenced genomes) in tab-delimited
format. (ex
$PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz).
Required.
--hsp-method {mp,emp_prob,pic,scp,subtree_average}
HSP method to use. mp: predict discrete traits using
max parsimony. emp_prob: predict discrete traits based
on empirical state probabilities across tips.
subtree_average: predict continuous traits using
subtree averaging. pic: predict continuous traits with
phylogentic independent contrast. scp: reconstruct
continuous traits using squared-change parsimony
(default: mp).
Outputs:
-o OUTPUT_TREE, --output-tree OUTPUT_TREE
Reference tree output with insert sequences (format:
newick). [Default: frogsfunc_placeseqs_tree.nwk]
-e EXCLUDED, --excluded EXCLUDED
List of sequences not inserted in the tree. [Default:
frogsfunc_placeseqs_excluded.txt]
-s OUTPUT_FASTA, --output-fasta OUTPUT_FASTA
Fasta file without non insert sequences. (format:
FASTA). [Default: frogsfunc_placeseqs.fasta]
-m OUTPUT_BIOM, --output-biom OUTPUT_BIOM
Biom file without non insert sequences. (format: BIOM)
[Default: frogsfunc_placeseqs.biom]
-c CLOSESTS_REF, --closests-ref CLOSESTS_REF
Informations about Clusters (i.e OTUs) and PICRUSt2
closest reference from cluster sequences
(identifiants, taxonomies, phylogenetic distance from
reference, nucleotidics sequences). [Default:
frogsfunc_placeseqs_closests_ref_sequences.txt]
-l LOG_FILE, --log-file LOG_FILE
List of commands executed.
-t SUMMARY, --summary SUMMARY
Path to store resulting html file. [Default:
frogsfunc_placeseqs_summary.html]
-om OUTPUT_MARKER, --output-marker OUTPUT_MARKER
Output table of predicted marker gene copy numbers per
studied sequence in input tree. If the extension ".gz"
is added the table will automatically be
gzipped.[Default: frogsfunc_marker.tsv]
Example of command line:
./frogsfunc_placeseqs.py \
--input-fasta data/frogsfunc.fasta \
--input-biom data/frogsfunc.biom \
--placement-tool sepp \
--output-tree frogsfunc_placeseqs_tree.nwk \
--excluded frogsfunc_placeseqs_excluded.txt \
--output-fasta frogsfunc_placeseqs.fasta \
--output-biom frogsfunc_placeseqs.biom \
--closests-ref frogsfunc_placeseqs_closests_ref_sequences.txt \
--output-marker frogsfunc_marker.tsv \
--summary frogsfunc_placeseqs_summary.html
For ITS or 18S analysis, you must specified the path to picrust2 reference files directory. Exemple for ITS :
./frogsfunc_placeseqs.py \
--input-fasta data/its.fasta \
--input-biom data/its.biom \
--ref-dir $PICRUST2_PATH/default_files/fungi/fungi_ITS
Galaxy
Sequences file: The ASV fasta sequence file.
biom file: The ASV biom file. Taxonomic affiliations must be done before (biom file from FROGS_5 taxonomic_affiliation tool).
taxonomy marker: 16S, ITS and 18S only available.
If your ASVs are based on another marker, you cannot use this tool.
placement tool: EPA-NG or SEPP are placement tools for insertion of sequences into the PICRUSt2 reference tree. SEPP is a low-memory alternative to EPA-ng for placing sequences. So, if the tool crashes with EPA-ng, try again with SEPP.
minimum alignment length: Proportion of the total length of an input sequence that must align with reference sequences. All other will be out.
Outputs
HTML report
The html report file describes that ASVs are contained or not in the phylogenetic tree. Note that PICRUSt2 uses its own reference tree to affiliate ASVs from reference sequences. The report file indicates for each ASV that is the closest PICRUSt2 reference sequence, and compares it to the original FROGS taxonomy. Clicking on the sequence ID gives you more information about it JGI database.
How many ASVs/sequences are kept after the process?
The pie charts describe the proportion of number of ASVs excluded and the proportion of total sequences excluded for the following steps.
ASVs are excluded if the total length of the input sequence aligned against reference sequence is less than the specified "minimum alignment length " threshold parameter (--min-align)
Where are my ASVs inserted in the phylogenetic reference tree ?
PICRUSt2 predicted abundances are based on closests reference genomes sequences from ASVs into the phylogenetic tree. To compare taxonomic affiliations performed in FROGS and in PICRUSt2, the following table is product:
-
ASV : ASV name.
-
Nb sequences: ASV sequences abundances.
-
FROGS taxonomy : Taxonomic affiliation made by FROGS (FROGS_5 taxonomic affiliation ASV).
-
PICRUSt2 closest ID (JGI) : Identifiant (JGI) of the closest reference sequence from the ASV inserted in the reference tree (see the explanatory illustration at the bottom of this page).
-
PICRUSt2 closest reference name : Genome Name / Sample Name from reference tree of PICRUSt2.
-
PICRUSt2 closest taxonomy : Taxonomy (JGI) of the closest reference sequence from the ASV inserted in the reference tree under the following format: Kingdom;Phylum;Class;Order;Family;Genus;Species
-
NSTI: Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.
-
NSTI confidence: According to the NSTI score, we guide you in the confidence you can bring to the issue affiliation of PICRUSt2. Four levels are given:
- 0 < Good < 0.5
- 0.5 <= Medium < 1
- 1 <= Bad < 2
- To exclude >= 2
PICRUSt2 sets NSTI threshold to 2 per default. Some studies have shown that this threshold is permissive. Thus, it is important to see if the taxonomies between PICRUSt2 and FROGS are quite similar or not, in order to potentially choose a more stringent threshold afterwards.
For example, a NSTI lower than 0.5, with “species” as lowest common taxonomic rank between FROGS and PICRUSt2 will product a good prediction.
Search « up to species » for obtaining less ambigous reference
How to evaluate the NSTI ?
These graphes allow you to set the “NSTI cut-off” parameter of the next toolThe graph shows the number of kept ASVs and sequences according to the NSTI threshold. It is a decision support graphic to help choose the NSTI threshold. This NSTI threshold will be asked to set in the next tool FROGSFUNC_2_functions.
It is interesting to find a compromise between the guidance provided by the PICRUSt2 authors and the amount of reusable information that you would like to keep.
A good practice is to choose a NSTI threshold that retains a good number of sequences and as low as possible i.e. while ensuring that the taxonomies derived from FROGS and PICRUSt2 do not diverge too much.
On the graph above, we could keep only the information before the plateau, that is, from this point on, the more ASVs we keep the more we degrade the accuracy. So, here NSTI = 0.44
But this depends strongly on the datasets and your needs.
The graph depicts the blast percentages of identity and coverage against the closest PICRUSt2 sequence (ordinate), against the NSTI score (abcsissa). Thus, the ASVs with the best predictions will be located at the top left of the graph.
Tree file
Combination of the reference phylogenetic tree with your inserted sequences.
Excluded file
List of ASV names removed by the process.
FASTA file
The FASTA file without excluded sequences by processes.
Closest reference sequence table
Information on the sequences from the PICRUST2 reference tree that are the closest neighbours of your studied sequences.
BIOM file
The BIOM file without excluded sequences by processes.
Copy number marker file
It is the output table of predicted marker gene copy numbers per each ASV (placed in the reference tree).
FROGSFUNC_1_placeseqs_copynumber
FROGSFUNC_1_placeseqs_copynumber
Context
PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :
This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.
PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.
What it does
FROGSFUNC_1_placeseqs_copynumber is the first step of PICRUSt2. It inserts your study sequences into a reference tree. By default, this reference tree is based on 20,000 16S sequences from genomes in the Integrated Microbial Genomes database. The script performs this step, which specifically:
2 input files are required for FROGSFUNC_1_placeseqs_copynumber analysis:
Command line
v4.1.0
usage: frogsfunc_placeseqs.py [-h] [-v] [--debug] -i INPUT_FASTA -b INPUT_BIOM [-r REF_DIR] [-p {epa-ng,sepp}] [--min-align MIN_ALIGN] [--input-marker-table INPUT_MARKER_TABLE] [--hsp-method {mp,emp_prob,pic,scp,subtree_average}] [-o OUTPUT_TREE] [-e EXCLUDED] [-s OUTPUT_FASTA] [-m OUTPUT_BIOM] [-c CLOSESTS_REF] [-l LOG_FILE] [-t SUMMARY] [-om OUTPUT_MARKER] place studies sequences (i.e. ASVs) into a reference tree. optional arguments: -h, --help show this help message and exit -v, --version show programs version number and exit --debug Keep temporary files to debug program. Inputs: -i INPUT_FASTA, --input-fasta INPUT_FASTA Input fasta file of unaligned studies sequences. -b INPUT_BIOM, --input-biom INPUT_BIOM Input biom file of unaligned studies sequences. -r REF_DIR, --ref-dir REF_DIR If marker studied is not 16S, this is the directory containing reference sequence files (for ITS, see: $PICRUST2_PATH/default_files/fungi/fungi_ITS -p {epa-ng,sepp}, --placement-tool {epa-ng,sepp} Tool to place sequences into reference tree. Note that epa-ng is more sensitiv but very memory and computing power intensive. Warning : sepp is not usable for ITS and 18S analysis [Default: epa-ng] --min-align MIN_ALIGN Proportion of the total length of an input query sequence that must align with reference sequences. Any sequences with lengths below this value after making an alignment with reference sequences will be excluded from the placement and all subsequent steps. (default: 0.8). --input-marker-table INPUT_MARKER_TABLE The input marker table describing directly observed traits (e.g. sequenced genomes) in tab-delimited format. (ex $PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz). Required. --hsp-method {mp,emp_prob,pic,scp,subtree_average} HSP method to use. mp: predict discrete traits using max parsimony. emp_prob: predict discrete traits based on empirical state probabilities across tips. subtree_average: predict continuous traits using subtree averaging. pic: predict continuous traits with phylogentic independent contrast. scp: reconstruct continuous traits using squared-change parsimony (default: mp). Outputs: -o OUTPUT_TREE, --output-tree OUTPUT_TREE Reference tree output with insert sequences (format: newick). [Default: frogsfunc_placeseqs_tree.nwk] -e EXCLUDED, --excluded EXCLUDED List of sequences not inserted in the tree. [Default: frogsfunc_placeseqs_excluded.txt] -s OUTPUT_FASTA, --output-fasta OUTPUT_FASTA Fasta file without non insert sequences. (format: FASTA). [Default: frogsfunc_placeseqs.fasta] -m OUTPUT_BIOM, --output-biom OUTPUT_BIOM Biom file without non insert sequences. (format: BIOM) [Default: frogsfunc_placeseqs.biom] -c CLOSESTS_REF, --closests-ref CLOSESTS_REF Informations about Clusters (i.e OTUs) and PICRUSt2 closest reference from cluster sequences (identifiants, taxonomies, phylogenetic distance from reference, nucleotidics sequences). [Default: frogsfunc_placeseqs_closests_ref_sequences.txt] -l LOG_FILE, --log-file LOG_FILE List of commands executed. -t SUMMARY, --summary SUMMARY Path to store resulting html file. [Default: frogsfunc_placeseqs_summary.html] -om OUTPUT_MARKER, --output-marker OUTPUT_MARKER Output table of predicted marker gene copy numbers per studied sequence in input tree. If the extension ".gz" is added the table will automatically be gzipped.[Default: frogsfunc_marker.tsv]
Example of command line:
For ITS or 18S analysis, you must specified the path to picrust2 reference files directory. Exemple for ITS :
Galaxy
Sequences file: The ASV fasta sequence file.
biom file: The ASV biom file. Taxonomic affiliations must be done before (biom file from FROGS_5 taxonomic_affiliation tool).
taxonomy marker: 16S, ITS and 18S only available.
If your ASVs are based on another marker, you cannot use this tool.
placement tool: EPA-NG or SEPP are placement tools for insertion of sequences into the PICRUSt2 reference tree. SEPP is a low-memory alternative to EPA-ng for placing sequences. So, if the tool crashes with EPA-ng, try again with SEPP.
minimum alignment length: Proportion of the total length of an input sequence that must align with reference sequences. All other will be out.
Outputs
HTML report
The html report file describes that ASVs are contained or not in the phylogenetic tree. Note that PICRUSt2 uses its own reference tree to affiliate ASVs from reference sequences. The report file indicates for each ASV that is the closest PICRUSt2 reference sequence, and compares it to the original FROGS taxonomy. Clicking on the sequence ID gives you more information about it JGI database.
How many ASVs/sequences are kept after the process?
The pie charts describe the proportion of number of ASVs excluded and the proportion of total sequences excluded for the following steps.
ASVs are excluded if the total length of the input sequence aligned against reference sequence is less than the specified "minimum alignment length " threshold parameter (--min-align)
Where are my ASVs inserted in the phylogenetic reference tree ?
PICRUSt2 predicted abundances are based on closests reference genomes sequences from ASVs into the phylogenetic tree. To compare taxonomic affiliations performed in FROGS and in PICRUSt2, the following table is product:
ASV : ASV name.
Nb sequences: ASV sequences abundances.
FROGS taxonomy : Taxonomic affiliation made by FROGS (FROGS_5 taxonomic affiliation ASV).
PICRUSt2 closest ID (JGI) : Identifiant (JGI) of the closest reference sequence from the ASV inserted in the reference tree (see the explanatory illustration at the bottom of this page).
PICRUSt2 closest reference name : Genome Name / Sample Name from reference tree of PICRUSt2.
PICRUSt2 closest taxonomy : Taxonomy (JGI) of the closest reference sequence from the ASV inserted in the reference tree under the following format: Kingdom;Phylum;Class;Order;Family;Genus;Species
NSTI: Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.
NSTI confidence: According to the NSTI score, we guide you in the confidence you can bring to the issue affiliation of PICRUSt2. Four levels are given:
PICRUSt2 sets NSTI threshold to 2 per default. Some studies have shown that this threshold is permissive. Thus, it is important to see if the taxonomies between PICRUSt2 and FROGS are quite similar or not, in order to potentially choose a more stringent threshold afterwards.
For example, a NSTI lower than 0.5, with “species” as lowest common taxonomic rank between FROGS and PICRUSt2 will product a good prediction.
Lowest same taxonomic rank between FROGS and PICRUSt2 : Comparison between FROGS and PICRUSt2 taxonomic affiliations. Lowest common taxonomic rank between FROGS and PICRUSt2 affiliations.
Comment :
Search « up to species » for obtaining less ambigous reference
How to evaluate the NSTI ?
The graph shows the number of kept ASVs and sequences according to the NSTI threshold. It is a decision support graphic to help choose the NSTI threshold. This NSTI threshold will be asked to set in the next tool FROGSFUNC_2_functions.
It is interesting to find a compromise between the guidance provided by the PICRUSt2 authors and the amount of reusable information that you would like to keep.
A good practice is to choose a NSTI threshold that retains a good number of sequences and as low as possible i.e. while ensuring that the taxonomies derived from FROGS and PICRUSt2 do not diverge too much.
On the graph above, we could keep only the information before the plateau, that is, from this point on, the more ASVs we keep the more we degrade the accuracy. So, here NSTI = 0.44
But this depends strongly on the datasets and your needs.
The graph depicts the blast percentages of identity and coverage against the closest PICRUSt2 sequence (ordinate), against the NSTI score (abcsissa). Thus, the ASVs with the best predictions will be located at the top left of the graph.
Tree file
Combination of the reference phylogenetic tree with your inserted sequences.
Excluded file
List of ASV names removed by the process.
FASTA file
The FASTA file without excluded sequences by processes.
Closest reference sequence table
Information on the sequences from the PICRUST2 reference tree that are the closest neighbours of your studied sequences.
BIOM file
The BIOM file without excluded sequences by processes.
Copy number marker file
It is the output table of predicted marker gene copy numbers per each ASV (placed in the reference tree).
A work by FROGS team