FROGSFUNC_2_functions
Context
PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :
- FROGSFUNC_1_placeseqs_copynumber : Places the ASVs into a reference phylogenetic tree and predicts the copy numbers of the marker gene (16S, ITS or 18S).
- FROGSFUNC_2_functions: Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
- FROGSFUNC_3_pathways : Calculates pathway abundances in each sample.
This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.
PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.
What it does
FROGSFUNC_2_functions is the second step of PICRUSt2. It predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
There are three steps performed at this stage:
- It runs hidden-state prediction (hsp) to predict function abundances with castor-R of each ASVs placed in the PICRUSt2 reference phylogenetic tree (FROGSFUNC_1_placeseqs_copynumber outputs).
- The read depth per ASV is divided by the predicted marker (16S/ITS/18S) copy numbers. This is performed to help control for variation in marker copy numbers across organisms, which can result in interpretation issues. For instance, imagine an organism with five identical copies of the 16S gene that is at the same absolute abundance as an organism with one 16S gene. The ASV corresponding to the first organism would erroneously be inferred to be at higher relative abundance simply because this organism had more copies of the 16S gene.
- The ASV read depths per sample (after normalizing by marker (16S/ITS/18S) copy number) are multiplied by the predicted function copy numbers per ASV.
FROGSFUNC_2_functions tool workflow summaryCommand line
v4.1.0
usage: frogsfunc_functions.py [-h] [-v] [--debug] [-p NB_CPUS] [--strat-out]
-b INPUT_BIOM -i INPUT_FASTA -t INPUT_TREE -m
INPUT_MARKER --marker-type {16S,ITS,18S}
[-f FUNCTIONS]
[--input-function-table INPUT_FUNCTION_TABLE]
[--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
[--max-nsti MAX_NSTI]
[--min-blast-ident MIN_BLAST_IDENT]
[--min-blast-cov MIN_BLAST_COV]
[--min-reads INT] [--min-samples INT]
[--output-function-abund OUTPUT_FUNCTION_ABUND]
[--output-otu-norm OUTPUT_OTU_NORM]
[--output-weighted OUTPUT_WEIGHTED]
[--output-contrib OUTPUT_CONTRIB]
[--output-biom OUTPUT_BIOM]
[--output-fasta OUTPUT_FASTA]
[-e OUTPUT_EXCLUDED] [-l LOG_FILE] [-s SUMMARY]
Per-sample functional profiles prediction.
optional arguments:
-h, --help show this help message and exit
-v, --version show programs version number and exit
--debug Keep temporary files to debug program.
-p NB_CPUS, --nb-cpus NB_CPUS
The maximum number of CPUs used. [Default: 1]
--strat-out If activated, a new table is built. It will contain
the abundances of each function of each ASV in each
sample.
Inputs:
-b INPUT_BIOM, --input-biom INPUT_BIOM
frogsfunc_placeseqs Biom output file
(frogsfunc_placeseqs.biom).
-i INPUT_FASTA, --input-fasta INPUT_FASTA
frogsfunc_placeseqs Fasta output file
(frogsfunc_placeseqs.fasta).
-t INPUT_TREE, --input-tree INPUT_TREE
frogsfunc_placeseqs output tree in newick format
containing both studied sequences (i.e. ASVs) and
reference sequences.
-m INPUT_MARKER, --input-marker INPUT_MARKER
Table of predicted marker gene copy numbers
(frogsfunc_placeseqs output : frogsfunc_marker.tsv).
--marker-type {16S,ITS,18S}
Marker gene to be analyzed.
--hsp-method {mp,emp_prob,pic,scp,subtree_average}
HSP method to use. mp: predict discrete traits using
max parsimony. emp_prob: predict discrete traits based
on empirical state probabilities across tips.
subtree_average: predict continuous traits using
subtree averaging. pic: predict continuous traits with
phylogentic independent contrast. scp: reconstruct
continuous traits using squared-change parsimony
(default: mp).
--max-nsti MAX_NSTI Sequences with NSTI values above this value will be
excluded (default: 2).
--min-blast-ident MIN_BLAST_IDENT
Sequences with blast percentage identity against the
PICRUSt2 closest ref above this value will be excluded
(between 0 and 1)
--min-blast-cov MIN_BLAST_COV
Sequences with blast percentage coverage against the
PICRUSt2 closest ref above this value will be excluded
(between 0 and 1)
--min-reads INT Minimum number of reads across all samples for each
input ASV. ASVs below this cut-off will be counted as
part of the "RARE" category in the stratified output.
If you choose 1, none ASV will be grouped in “RARE”
category.(default: 1).
--min-samples INT Minimum number of samples that an ASV needs to be
identfied within. ASVs below this cut-off will be
counted as part of the "RARE" category in the
stratified output. If you choose 1, none ASV will be
grouped in “RARE” category. (default: 1).
16S :
-f FUNCTIONS, --functions FUNCTIONS
Specifies which function databases should be used
(EC). Available indices : 'EC', 'KO', 'COG', 'PFAM',
'TIGRFAM', 'PHENO'. EC is used by default because
necessary for frogsfunc_pathways. At least EC or KO is
required. To run the command with several functions,
separate the functions with commas (ex: -i EC,PFAM).
ITS and 18S :
--input-function-table INPUT_FUNCTION_TABLE
The path to input functions table describing directly
observed functions, in tab-delimited format.(ex $PICRU
St2_PATH/default_files/fungi/ec_ITS_counts.txt.gz).
Required.
Outputs:
--output-function-abund OUTPUT_FUNCTION_ABUND
Output file for function prediction abundances.
(default: frogsfunc_functions_unstrat.tsv).
--output-otu-norm OUTPUT_OTU_NORM
Output file with asv abundances normalized by marker
copies number. (default:
frogsfunc_functions_marker_norm.tsv).
--output-weighted OUTPUT_WEIGHTED
Output file with the mean of nsti value per sample
(format: TSV). [Default:
frogsfunc_functions_weighted_nsti.tsv]
--output-contrib OUTPUT_CONTRIB
Stratified output that reports asv contributions to
community-wide function abundances (ex
pred_function_otu_contrib.tsv).
--output-biom OUTPUT_BIOM
Biom file without excluded ASVs (NSTI, blast perc
identity or blast perc coverage thresholds). (format:
BIOM) [Default: frogsfunc_function.biom]
--output-fasta OUTPUT_FASTA
Fasta file without excluded ASVs (NSTI, blast perc
identity or blast perc coverage thresholds). (format:
FASTA). [Default: frogsfunc_function.fasta]
-e OUTPUT_EXCLUDED, --output-excluded OUTPUT_EXCLUDED
List of ASVs with NSTI values above NSTI threshold (
--max_NSTI NSTI ).[Default:
frogsfunc_functions_excluded.txt]
-l LOG_FILE, --log-file LOG_FILE
List of commands executed.
-s SUMMARY, --summary SUMMARY
Path to store resulting html file. [Default:
frogsfunc_functions_summary.html]
Example of command line:
./frogsfunc_functions.py \
--strat-out \
--marker-type 16S \
--input-biom frogsfunc_placeseqs.biom \
--input-fasta frogsfunc_placeseqs.fasta \
--input-marker frogsfunc_marker.tsv \
--input-tree frogsfunc_placeseqs_tree.nwk \
--output-function-abund frogsfunc_functions_unstrat.tsv \
--output-otu-norm frogsfunc_functions_marker_norm.tsv \
--output-weighted frogsfunc_functions_weighted_nsti.tsv \
--output-excluded frogsfunc_functions_excluded.txt \
--output-contrib frogsfunc_functions_strat.tsv \
--output-fasta frogsfunc_function.fasta \
--output-biom frogsfunc_function.biom \
--summary frogsfunc_functions_summary.html
Please note that requesting the stratified output files implies a longer process time.
Galaxy
Which default pre-calculated count table to use ?
- For 16S rRNA gene you can choose between: ‘EC’, ‘KO’, ‘PFAM’, ‘COG’, ‘TIGRFAM’, and/or ‘PHENO’. You must select at least ‘EC’ or ‘KO’ because the information from Metacyc (EC) or KEGG (KO) are required.
- For ITS and 18S markers, ‘EC’ is only available.
Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.
-
Identity alignment cut-off:
All sequences with a identity percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
-
Coverage alignment cut-off:
All sequences with a coverage percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
-
HSP method: Hidden-state prediction method to use.
Outputs
Fasta file:
Sequence file without excluded ASVs (NSTI, blast perc identity or blast perc coverage thresholds).
ASV abundance Biom file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):
ASV abundance data i a biom file without excluded ASVs (NSTI, %identity or %coverage thresholds alignment).
Function abundance file - “unstratified”:
It is the function abundance predictions of metagenome, per sample. (for exemple: frogsfunc_functions_unstrat_EC.tsv for EC database)
From this table of abundance it is quite possible to make statistical analyses to understand the information.
Table column description:
- classification: the hierarchy classification of the gene function.
- db_link: the url on the link accession ID (observation_name) of the function.
- observation_name: Accession identifier
- observation_sum: Total abundance of functions across all samples.
- last columns: Abundances of these functions in each samples.
ASV normalized abundance table:
Table with normalized abundances per marker copy number from FROGSFUNC_1 step. (functions_marker_norm.tsv)
Weighted NSTI file:
Output file with the mean of NSTI value per sample. (FROGSFUNC_2 functions_weighted_nsti.tsv)
Excluded sequences file:
Information about removed sequences that have a NSTI value aboved the NSTI threshold chosen in this step.
- Cluster: ASV name id.
- FROGS_taxonomy
- PICRUSt2_taxonomy
- exclusion_paramater: The paramater(s) that excluded the ASVs.
- value_parameter: The values associated with the paramater(s).
Copy number marker file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):
Output table of predicted function copy numbers per ASV. There are as many tables as chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO)
HTML report
The HTML file summarizes information about function abundances within each sample.
How many ASVs/sequences are kept after the process?
ASVs are excluded if the associated NSTI is above the threshold, or if the alignment values are below the thresholds.Number of different taxonomic ranks before (green) and after (orange) application of the filters. What is the distribution of gene/function abundances in the samples ?
Weighted NSTI : Mean of NSTI values of ASVs present in the sample, normalized by their abundances.
Nb gene retrieved : Number of genes/functions present in the sample.
Display global distribution button allows to view the distribution of gene function abundances across all samples.
To view this distribution only on some samples, you check the boxes of the samples (first column of the table above), and click on the “Show distribution” button at the bottom of the table.
Gene functions distribution for selected samples The innermost circle represents the highest hierarchical level of gene families according to Metacyc or Kegg databases. The more we go outwards, the more the hierarchical level becomes precise until indicating the identifier of the gene family.
For exemple :
Oxidoreductases > Acting on the CH-OH group of donors > With NAD+ or NADP+ as acceptor > EC:1.1.1.1
Function abundances table - stratified (optional - command line only).
optional and only for command line - not available on galaxy version
N.B.: In this above example, the first N lines of the file correspond to the N ASVs in the sample SC1703-104TTGCCC-B6TMLL001R, and so on for each sample.
Please note that requesting the stratified output files implies a very longer process time. And, this file is very large, there are as many lines as there are samples x ASVs x pathways.
- sample: sample names
- function: accession ID from function database
- taxon: ASVs names
- taxon_abun: sequence number of ASVs in the sample divided by number of marker copy number.
Ex: in the table above, ASV1 (cluster1) has only 1 16S gene and has 233 sequences in sample *SC1703-104TTGCCC-B6TMLL001R* so 233/1 = 233 is the taxon_abun of this ASV in this sample.
- taxon_rel_abun: This is the same as the “taxon_abun” column, but in terms of relative abundance (so that the sum of all ASV abundances per sample is 100).
- genome_function_count: Predicted copy number of this function per ASV.
Ex: in genome of ASV named cluster_11 there are 2 copies of EC.1.1.1.1 function
- taxon_function_abun: Multiplication of "taxon_abun" column by "genome_function_count" column.
- taxon_rel_function_abun: Multiplication of "taxon_rel_abun" column by "genome_function_count" column.
- norm_taxon_function_contrib: This is the same as the “taxon_rel_function_abun” column, but in terms of relative abundance in the sample (so that the sum of all number of this column equals 1).
FROGSFUNC_2_functions
FROGSFUNC_2_functions
Context
PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :
This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.
PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.
What it does
FROGSFUNC_2_functions is the second step of PICRUSt2. It predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
There are three steps performed at this stage:
Command line
v4.1.0
usage: frogsfunc_functions.py [-h] [-v] [--debug] [-p NB_CPUS] [--strat-out] -b INPUT_BIOM -i INPUT_FASTA -t INPUT_TREE -m INPUT_MARKER --marker-type {16S,ITS,18S} [-f FUNCTIONS] [--input-function-table INPUT_FUNCTION_TABLE] [--hsp-method {mp,emp_prob,pic,scp,subtree_average}] [--max-nsti MAX_NSTI] [--min-blast-ident MIN_BLAST_IDENT] [--min-blast-cov MIN_BLAST_COV] [--min-reads INT] [--min-samples INT] [--output-function-abund OUTPUT_FUNCTION_ABUND] [--output-otu-norm OUTPUT_OTU_NORM] [--output-weighted OUTPUT_WEIGHTED] [--output-contrib OUTPUT_CONTRIB] [--output-biom OUTPUT_BIOM] [--output-fasta OUTPUT_FASTA] [-e OUTPUT_EXCLUDED] [-l LOG_FILE] [-s SUMMARY] Per-sample functional profiles prediction. optional arguments: -h, --help show this help message and exit -v, --version show programs version number and exit --debug Keep temporary files to debug program. -p NB_CPUS, --nb-cpus NB_CPUS The maximum number of CPUs used. [Default: 1] --strat-out If activated, a new table is built. It will contain the abundances of each function of each ASV in each sample. Inputs: -b INPUT_BIOM, --input-biom INPUT_BIOM frogsfunc_placeseqs Biom output file (frogsfunc_placeseqs.biom). -i INPUT_FASTA, --input-fasta INPUT_FASTA frogsfunc_placeseqs Fasta output file (frogsfunc_placeseqs.fasta). -t INPUT_TREE, --input-tree INPUT_TREE frogsfunc_placeseqs output tree in newick format containing both studied sequences (i.e. ASVs) and reference sequences. -m INPUT_MARKER, --input-marker INPUT_MARKER Table of predicted marker gene copy numbers (frogsfunc_placeseqs output : frogsfunc_marker.tsv). --marker-type {16S,ITS,18S} Marker gene to be analyzed. --hsp-method {mp,emp_prob,pic,scp,subtree_average} HSP method to use. mp: predict discrete traits using max parsimony. emp_prob: predict discrete traits based on empirical state probabilities across tips. subtree_average: predict continuous traits using subtree averaging. pic: predict continuous traits with phylogentic independent contrast. scp: reconstruct continuous traits using squared-change parsimony (default: mp). --max-nsti MAX_NSTI Sequences with NSTI values above this value will be excluded (default: 2). --min-blast-ident MIN_BLAST_IDENT Sequences with blast percentage identity against the PICRUSt2 closest ref above this value will be excluded (between 0 and 1) --min-blast-cov MIN_BLAST_COV Sequences with blast percentage coverage against the PICRUSt2 closest ref above this value will be excluded (between 0 and 1) --min-reads INT Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output. If you choose 1, none ASV will be grouped in “RARE” category.(default: 1). --min-samples INT Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output. If you choose 1, none ASV will be grouped in “RARE” category. (default: 1). 16S : -f FUNCTIONS, --functions FUNCTIONS Specifies which function databases should be used (EC). Available indices : 'EC', 'KO', 'COG', 'PFAM', 'TIGRFAM', 'PHENO'. EC is used by default because necessary for frogsfunc_pathways. At least EC or KO is required. To run the command with several functions, separate the functions with commas (ex: -i EC,PFAM). ITS and 18S : --input-function-table INPUT_FUNCTION_TABLE The path to input functions table describing directly observed functions, in tab-delimited format.(ex $PICRU St2_PATH/default_files/fungi/ec_ITS_counts.txt.gz). Required. Outputs: --output-function-abund OUTPUT_FUNCTION_ABUND Output file for function prediction abundances. (default: frogsfunc_functions_unstrat.tsv). --output-otu-norm OUTPUT_OTU_NORM Output file with asv abundances normalized by marker copies number. (default: frogsfunc_functions_marker_norm.tsv). --output-weighted OUTPUT_WEIGHTED Output file with the mean of nsti value per sample (format: TSV). [Default: frogsfunc_functions_weighted_nsti.tsv] --output-contrib OUTPUT_CONTRIB Stratified output that reports asv contributions to community-wide function abundances (ex pred_function_otu_contrib.tsv). --output-biom OUTPUT_BIOM Biom file without excluded ASVs (NSTI, blast perc identity or blast perc coverage thresholds). (format: BIOM) [Default: frogsfunc_function.biom] --output-fasta OUTPUT_FASTA Fasta file without excluded ASVs (NSTI, blast perc identity or blast perc coverage thresholds). (format: FASTA). [Default: frogsfunc_function.fasta] -e OUTPUT_EXCLUDED, --output-excluded OUTPUT_EXCLUDED List of ASVs with NSTI values above NSTI threshold ( --max_NSTI NSTI ).[Default: frogsfunc_functions_excluded.txt] -l LOG_FILE, --log-file LOG_FILE List of commands executed. -s SUMMARY, --summary SUMMARY Path to store resulting html file. [Default: frogsfunc_functions_summary.html]
Example of command line:
Please note that requesting the stratified output files implies a longer process time.
Galaxy
Thanks to the previous prediction of the copy numbers of the marker gene (16S, ITS or 18S) in FROGSFUNC_1, FROGSFUNC_2 can normalize the ASV abundances table.
Prediction of the functions abundances, using different databases:
EC : https://enzyme.expasy.org/
KO : https://www.genome.jp/kegg/ko.html
PFAM : http://pfam.xfam.org/
COG : https://www.ncbi.nlm.nih.gov/research/cog-project/
TIGRFAM : https://tigrfams.jcvi.org/cgi-bin/index.cgi
PHENO : https://phenodb.org/
Which default pre-calculated count table to use ?
Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.
Identity alignment cut-off:
All sequences with a identity percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
Coverage alignment cut-off:
All sequences with a coverage percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
HSP method: Hidden-state prediction method to use.
Outputs
Fasta file:
Sequence file without excluded ASVs (NSTI, blast perc identity or blast perc coverage thresholds).
ASV abundance Biom file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):
ASV abundance data i a biom file without excluded ASVs (NSTI, %identity or %coverage thresholds alignment).
Function abundance file - “unstratified”:
It is the function abundance predictions of metagenome, per sample. (for exemple: frogsfunc_functions_unstrat_EC.tsv for EC database)
From this table of abundance it is quite possible to make statistical analyses to understand the information.
Table column description:
ASV normalized abundance table:
Table with normalized abundances per marker copy number from FROGSFUNC_1 step. (functions_marker_norm.tsv)
Weighted NSTI file:
Output file with the mean of NSTI value per sample. (FROGSFUNC_2 functions_weighted_nsti.tsv)
Excluded sequences file:
Information about removed sequences that have a NSTI value aboved the NSTI threshold chosen in this step.
Copy number marker file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):
Output table of predicted function copy numbers per ASV. There are as many tables as chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO)
HTML report
The HTML file summarizes information about function abundances within each sample.
How many ASVs/sequences are kept after the process?
What is the distribution of gene/function abundances in the samples ?
Weighted NSTI : Mean of NSTI values of ASVs present in the sample, normalized by their abundances.
Nb gene retrieved : Number of genes/functions present in the sample.
Display global distribution button allows to view the distribution of gene function abundances across all samples.
To view this distribution only on some samples, you check the boxes of the samples (first column of the table above), and click on the “Show distribution” button at the bottom of the table.
The innermost circle represents the highest hierarchical level of gene families according to Metacyc or Kegg databases. The more we go outwards, the more the hierarchical level becomes precise until indicating the identifier of the gene family.
Function abundances table - stratified (optional - command line only).
optional and only for command line - not available on galaxy version
N.B.: In this above example, the first N lines of the file correspond to the N ASVs in the sample SC1703-104TTGCCC-B6TMLL001R, and so on for each sample.
Please note that requesting the stratified output files implies a very longer process time. And, this file is very large, there are as many lines as there are samples x ASVs x pathways.
A work by FROGS team