FROGSFUNC_2_functions

tutorial

tool

FROGSFUNC_2_functions

Context

PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :

FROGSFUNC_1_placeseqs_copynumber : Places the ASVs into a reference phylogenetic tree and predicts the copy numbers of the marker gene (16S, ITS or 18S).
FROGSFUNC_2_functions: Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
FROGSFUNC_3_pathways : Calculates pathway abundances in each sample.

This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.

PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.

What it does

FROGSFUNC_2_functions is the second step of PICRUSt2. It predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.

There are three steps performed at this stage:

It runs hidden-state prediction (hsp) to predict function abundances with castor-R of each ASVs placed in the PICRUSt2 reference phylogenetic tree (FROGSFUNC_1_placeseqs_copynumber outputs).
The read depth per ASV is divided by the predicted marker (16S/ITS/18S) copy numbers. This is performed to help control for variation in marker copy numbers across organisms, which can result in interpretation issues. For instance, imagine an organism with five identical copies of the 16S gene that is at the same absolute abundance as an organism with one 16S gene. The ASV corresponding to the first organism would erroneously be inferred to be at higher relative abundance simply because this organism had more copies of the 16S gene.
The ASV read depths per sample (after normalizing by marker (16S/ITS/18S) copy number) are multiplied by the predicted function copy numbers per ASV.

FROGSFUNC_2_functions tool workflow summary

Command line

v4.1.0

usage: frogsfunc_functions.py [-h] [-v] [--debug] [-p NB_CPUS] [--strat-out]
                              -b INPUT_BIOM -i INPUT_FASTA -t INPUT_TREE -m
                              INPUT_MARKER --marker-type {16S,ITS,18S}
                              [-f FUNCTIONS]
                              [--input-function-table INPUT_FUNCTION_TABLE]
                              [--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
                              [--max-nsti MAX_NSTI]
                              [--min-blast-ident MIN_BLAST_IDENT]
                              [--min-blast-cov MIN_BLAST_COV]
                              [--min-reads INT] [--min-samples INT]
                              [--output-function-abund OUTPUT_FUNCTION_ABUND]
                              [--output-otu-norm OUTPUT_OTU_NORM]
                              [--output-weighted OUTPUT_WEIGHTED]
                              [--output-contrib OUTPUT_CONTRIB]
                              [--output-biom OUTPUT_BIOM]
                              [--output-fasta OUTPUT_FASTA]
                              [-e OUTPUT_EXCLUDED] [-l LOG_FILE] [-s SUMMARY]

Per-sample functional profiles prediction.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show programs version number and exit
  --debug               Keep temporary files to debug program.
  -p NB_CPUS, --nb-cpus NB_CPUS
                        The maximum number of CPUs used. [Default: 1]
  --strat-out           If activated, a new table is built. It will contain
                        the abundances of each function of each ASV in each
                        sample.

Inputs:
  -b INPUT_BIOM, --input-biom INPUT_BIOM
                        frogsfunc_placeseqs Biom output file
                        (frogsfunc_placeseqs.biom).
  -i INPUT_FASTA, --input-fasta INPUT_FASTA
                        frogsfunc_placeseqs Fasta output file
                        (frogsfunc_placeseqs.fasta).
  -t INPUT_TREE, --input-tree INPUT_TREE
                        frogsfunc_placeseqs output tree in newick format
                        containing both studied sequences (i.e. ASVs) and
                        reference sequences.
  -m INPUT_MARKER, --input-marker INPUT_MARKER
                        Table of predicted marker gene copy numbers
                        (frogsfunc_placeseqs output : frogsfunc_marker.tsv).
  --marker-type {16S,ITS,18S}
                        Marker gene to be analyzed.
  --hsp-method {mp,emp_prob,pic,scp,subtree_average}
                        HSP method to use. mp: predict discrete traits using
                        max parsimony. emp_prob: predict discrete traits based
                        on empirical state probabilities across tips.
                        subtree_average: predict continuous traits using
                        subtree averaging. pic: predict continuous traits with
                        phylogentic independent contrast. scp: reconstruct
                        continuous traits using squared-change parsimony
                        (default: mp).
  --max-nsti MAX_NSTI   Sequences with NSTI values above this value will be
                        excluded (default: 2).
  --min-blast-ident MIN_BLAST_IDENT
                        Sequences with blast percentage identity against the
                        PICRUSt2 closest ref above this value will be excluded
                        (between 0 and 1)
  --min-blast-cov MIN_BLAST_COV
                        Sequences with blast percentage coverage against the
                        PICRUSt2 closest ref above this value will be excluded
                        (between 0 and 1)
  --min-reads INT       Minimum number of reads across all samples for each
                        input ASV. ASVs below this cut-off will be counted as
                        part of the "RARE" category in the stratified output.
                        If you choose 1, none ASV will be grouped in “RARE”
                        category.(default: 1).
  --min-samples INT     Minimum number of samples that an ASV needs to be
                        identfied within. ASVs below this cut-off will be
                        counted as part of the "RARE" category in the
                        stratified output. If you choose 1, none ASV will be
                        grouped in “RARE” category. (default: 1).

16S :
  -f FUNCTIONS, --functions FUNCTIONS
                        Specifies which function databases should be used
                        (EC). Available indices : 'EC', 'KO', 'COG', 'PFAM',
                        'TIGRFAM', 'PHENO'. EC is used by default because
                        necessary for frogsfunc_pathways. At least EC or KO is
                        required. To run the command with several functions,
                        separate the functions with commas (ex: -i EC,PFAM).

ITS and 18S :
  --input-function-table INPUT_FUNCTION_TABLE
                        The path to input functions table describing directly
                        observed functions, in tab-delimited format.(ex $PICRU
                        St2_PATH/default_files/fungi/ec_ITS_counts.txt.gz).
                        Required.

Outputs:
  --output-function-abund OUTPUT_FUNCTION_ABUND
                        Output file for function prediction abundances.
                        (default: frogsfunc_functions_unstrat.tsv).
  --output-otu-norm OUTPUT_OTU_NORM
                        Output file with asv abundances normalized by marker
                        copies number. (default:
                        frogsfunc_functions_marker_norm.tsv).
  --output-weighted OUTPUT_WEIGHTED
                        Output file with the mean of nsti value per sample
                        (format: TSV). [Default:
                        frogsfunc_functions_weighted_nsti.tsv]
  --output-contrib OUTPUT_CONTRIB
                        Stratified output that reports asv contributions to
                        community-wide function abundances (ex
                        pred_function_otu_contrib.tsv).
  --output-biom OUTPUT_BIOM
                        Biom file without excluded ASVs (NSTI, blast perc
                        identity or blast perc coverage thresholds). (format:
                        BIOM) [Default: frogsfunc_function.biom]
  --output-fasta OUTPUT_FASTA
                        Fasta file without excluded ASVs (NSTI, blast perc
                        identity or blast perc coverage thresholds). (format:
                        FASTA). [Default: frogsfunc_function.fasta]
  -e OUTPUT_EXCLUDED, --output-excluded OUTPUT_EXCLUDED
                        List of ASVs with NSTI values above NSTI threshold (
                        --max_NSTI NSTI ).[Default:
                        frogsfunc_functions_excluded.txt]
  -l LOG_FILE, --log-file LOG_FILE
                        List of commands executed.
  -s SUMMARY, --summary SUMMARY
                        Path to store resulting html file. [Default:
                        frogsfunc_functions_summary.html]

Example of command line:

./frogsfunc_functions.py \
    --strat-out \
    --marker-type 16S \
    --input-biom frogsfunc_placeseqs.biom \
    --input-fasta frogsfunc_placeseqs.fasta \
    --input-marker frogsfunc_marker.tsv \
    --input-tree frogsfunc_placeseqs_tree.nwk \
    --output-function-abund frogsfunc_functions_unstrat.tsv \
    --output-otu-norm frogsfunc_functions_marker_norm.tsv \
    --output-weighted frogsfunc_functions_weighted_nsti.tsv \
    --output-excluded frogsfunc_functions_excluded.txt \
    --output-contrib frogsfunc_functions_strat.tsv \
    --output-fasta frogsfunc_function.fasta \
    --output-biom frogsfunc_function.biom \
    --summary frogsfunc_functions_summary.html

Please note that requesting the stratified output files implies a longer process time.

Galaxy

Thanks to the previous prediction of the copy numbers of the marker gene (16S, ITS or 18S) in FROGSFUNC_1, FROGSFUNC_2 can normalize the ASV abundances table.
Prediction of the functions abundances, using different databases:

EC : https://enzyme.expasy.org/
KO : https://www.genome.jp/kegg/ko.html
PFAM : http://pfam.xfam.org/
COG : https://www.ncbi.nlm.nih.gov/research/cog-project/
TIGRFAM : https://tigrfams.jcvi.org/cgi-bin/index.cgi
PHENO : https://phenodb.org/

Which default pre-calculated count table to use ?

For 16S rRNA gene you can choose between: ‘EC’, ‘KO’, ‘PFAM’, ‘COG’, ‘TIGRFAM’, and/or ‘PHENO’. You must select at least ‘EC’ or ‘KO’ because the information from Metacyc (EC) or KEGG (KO) are required.
For ITS and 18S markers, ‘EC’ is only available.

NSTI cut-off:

Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.

Identity alignment cut-off:

All sequences with a identity percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
Coverage alignment cut-off:

All sequences with a coverage percentage of alignment against the PICRUSt2 closest reference sequence is lower than this value will be excluded (between 0 and 1).
HSP method: Hidden-state prediction method to use.

Outputs

Fasta file:

Sequence file without excluded ASVs (NSTI, blast perc identity or blast perc coverage thresholds).

ASV abundance Biom file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):

ASV abundance data i a biom file without excluded ASVs (NSTI, %identity or %coverage thresholds alignment).

Function abundance file - “unstratified”:

It is the function abundance predictions of metagenome, per sample. (for exemple: frogsfunc_functions_unstrat_EC.tsv for EC database)

From this table of abundance it is quite possible to make statistical analyses to understand the information.

Table column description:

- classification: the hierarchy classification of the gene function.
- db_link: the url on the link accession ID (observation_name) of the function.
- observation_name: Accession identifier
- observation_sum: Total abundance of functions across all samples.
- last columns: Abundances of these functions in each samples.

ASV normalized abundance table:

Table with normalized abundances per marker copy number from FROGSFUNC_1 step. (functions_marker_norm.tsv)

Weighted NSTI file:

Output file with the mean of NSTI value per sample. (FROGSFUNC_2 functions_weighted_nsti.tsv)

Excluded sequences file:

Information about removed sequences that have a NSTI value aboved the NSTI threshold chosen in this step.

- Cluster: ASV name id.
- FROGS_taxonomy
- PICRUSt2_taxonomy
- exclusion_paramater: The paramater(s) that excluded the ASVs.
- value_parameter: The values associated with the paramater(s).

Copy number marker file - one per chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO):

Output table of predicted function copy numbers per ASV. There are as many tables as chosen target function database (EC, KO, PFAM, COG, TIGRFAM,PHENO)

HTML report

The HTML file summarizes information about function abundances within each sample.

How many ASVs/sequences are kept after the process?

ASVs are excluded if the associated NSTI is above the threshold, or if the alignment values are below the thresholds.

Number of different taxonomic ranks before (green) and after (orange) application of the filters.

What is the distribution of gene/function abundances in the samples ?

Weighted NSTI : Mean of NSTI values of ASVs present in the sample, normalized by their abundances.

Nb gene retrieved : Number of genes/functions present in the sample.

Display global distribution button allows to view the distribution of gene function abundances across all samples.

To view this distribution only on some samples, you check the boxes of the samples (first column of the table above), and click on the “Show distribution” button at the bottom of the table.

Gene functions distribution for selected samples

The innermost circle represents the highest hierarchical level of gene families according to Metacyc or Kegg databases. The more we go outwards, the more the hierarchical level becomes precise until indicating the identifier of the gene family.

For exemple :
Oxidoreductases > Acting on the CH-OH group of donors > With NAD+ or NADP+ as acceptor > EC:1.1.1.1

Function abundances table - stratified (optional - command line only).

optional and only for command line - not available on galaxy version

N.B.: In this above example, the first N lines of the file correspond to the N ASVs in the sample SC1703-104TTGCCC-B6TMLL001R, and so on for each sample.

Please note that requesting the stratified output files implies a very longer process time. And, this file is very large, there are as many lines as there are samples x ASVs x pathways.

- sample: sample names
- function: accession ID from function database
- taxon: ASVs names
- taxon_abun: sequence number of ASVs in the sample divided by number of marker copy number. 
  Ex: in the table above, ASV1 (cluster1) has only 1 16S gene and has 233 sequences in sample *SC1703-104TTGCCC-B6TMLL001R* so 233/1 = 233 is the taxon_abun of this ASV in this sample.

- taxon_rel_abun: This is the same as the “taxon_abun” column, but in terms of relative abundance (so that the sum of all ASV abundances per sample is 100).
- genome_function_count: Predicted copy number of this function per ASV. 
  Ex: in genome of ASV named cluster_11 there are 2 copies of EC.1.1.1.1 function

- taxon_function_abun: Multiplication of "taxon_abun" column by "genome_function_count" column.
- taxon_rel_function_abun: Multiplication of "taxon_rel_abun" column by "genome_function_count" column.
- norm_taxon_function_contrib: This is the same as the “taxon_rel_function_abun” column, but in terms of relative abundance in the sample (so that the sum of all number of this column equals 1).

A work by FROGS team