FROGSFUNC_1_placeseqs_copynumber

tutorial

tool

FROGSFUNC_1_placeseqs_copynumber

Context

PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :

FROGSFUNC_1_placeseqs_copynumber : Places the ASVs into a reference phylogenetic tree and predicts the copy numbers of the marker gene (16S, ITS or 18S).
FROGSFUNC_2_functions: Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
FROGSFUNC_3_pathways : Calculates pathway abundances in each sample.

This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.

PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.

What it does

FROGSFUNC_1_placeseqs_copynumber is the first step of PICRUSt2. It inserts your study sequences into a reference tree. By default, this reference tree is based on 20,000 16S sequences from genomes in the Integrated Microbial Genomes database. The script performs this step, which specifically:

Aligns your study sequences with a multiple-sequence alignment of reference 16S, ITS or 18S sequences with HMMER.
Finds the most likely placements of your study sequences in the reference tree with EPA_NG or SEPP .
Produces a treefile with the most likely placement for each sequence as the new tips with GAPPA.
Predicts marker copy number based solely on the sequences of marker genes with PICRUSt2. The available marker genes are 16S, ITS and 18S.

2 input files are required for FROGSFUNC_1_placeseqs_copynumber analysis:

fasta file of ASV sequences (it can be from FROGS_4 Cluster filters step)
biom file of ASV abundances with taxonomic affiliation information (it can be from FROGS_5 taxonomic affiliation step)

Placement of sequences of interest in PICRUSt2 reference tree

Prediction of the copy numbers of the marker gene

Command line

v4.1.0

usage: frogsfunc_placeseqs.py [-h] [-v] [--debug] -i INPUT_FASTA -b INPUT_BIOM
                              [-r REF_DIR] [-p {epa-ng,sepp}]
                              [--min-align MIN_ALIGN]
                              [--input-marker-table INPUT_MARKER_TABLE]
                              [--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
                              [-o OUTPUT_TREE] [-e EXCLUDED] [-s OUTPUT_FASTA]
                              [-m OUTPUT_BIOM] [-c CLOSESTS_REF] [-l LOG_FILE]
                              [-t SUMMARY] [-om OUTPUT_MARKER]

place studies sequences (i.e. ASVs) into a reference tree.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show programs version number and exit
  --debug               Keep temporary files to debug program.

Inputs:
  -i INPUT_FASTA, --input-fasta INPUT_FASTA
                        Input fasta file of unaligned studies sequences.
  -b INPUT_BIOM, --input-biom INPUT_BIOM
                        Input biom file of unaligned studies sequences.
  -r REF_DIR, --ref-dir REF_DIR
                        If marker studied is not 16S, this is the directory
                        containing reference sequence files (for ITS, see:
                        $PICRUST2_PATH/default_files/fungi/fungi_ITS
  -p {epa-ng,sepp}, --placement-tool {epa-ng,sepp}
                        Tool to place sequences into reference tree. Note that
                        epa-ng is more sensitiv but very memory and computing
                        power intensive. Warning : sepp is not usable for ITS
                        and 18S analysis [Default: epa-ng]
  --min-align MIN_ALIGN
                        Proportion of the total length of an input query
                        sequence that must align with reference sequences. Any
                        sequences with lengths below this value after making
                        an alignment with reference sequences will be excluded
                        from the placement and all subsequent steps. (default:
                        0.8).
  --input-marker-table INPUT_MARKER_TABLE
                        The input marker table describing directly observed
                        traits (e.g. sequenced genomes) in tab-delimited
                        format. (ex
                        $PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz).
                        Required.
  --hsp-method {mp,emp_prob,pic,scp,subtree_average}
                        HSP method to use. mp: predict discrete traits using
                        max parsimony. emp_prob: predict discrete traits based
                        on empirical state probabilities across tips.
                        subtree_average: predict continuous traits using
                        subtree averaging. pic: predict continuous traits with
                        phylogentic independent contrast. scp: reconstruct
                        continuous traits using squared-change parsimony
                        (default: mp).

Outputs:
  -o OUTPUT_TREE, --output-tree OUTPUT_TREE
                        Reference tree output with insert sequences (format:
                        newick). [Default: frogsfunc_placeseqs_tree.nwk]
  -e EXCLUDED, --excluded EXCLUDED
                        List of sequences not inserted in the tree. [Default:
                        frogsfunc_placeseqs_excluded.txt]
  -s OUTPUT_FASTA, --output-fasta OUTPUT_FASTA
                        Fasta file without non insert sequences. (format:
                        FASTA). [Default: frogsfunc_placeseqs.fasta]
  -m OUTPUT_BIOM, --output-biom OUTPUT_BIOM
                        Biom file without non insert sequences. (format: BIOM)
                        [Default: frogsfunc_placeseqs.biom]
  -c CLOSESTS_REF, --closests-ref CLOSESTS_REF
                        Informations about Clusters (i.e OTUs) and PICRUSt2
                        closest reference from cluster sequences
                        (identifiants, taxonomies, phylogenetic distance from
                        reference, nucleotidics sequences). [Default:
                        frogsfunc_placeseqs_closests_ref_sequences.txt]
  -l LOG_FILE, --log-file LOG_FILE
                        List of commands executed.
  -t SUMMARY, --summary SUMMARY
                        Path to store resulting html file. [Default:
                        frogsfunc_placeseqs_summary.html]
  -om OUTPUT_MARKER, --output-marker OUTPUT_MARKER
                        Output table of predicted marker gene copy numbers per
                        studied sequence in input tree. If the extension ".gz"
                        is added the table will automatically be
                        gzipped.[Default: frogsfunc_marker.tsv]

Example of command line:











./frogsfunc_placeseqs.py \
    --input-fasta data/frogsfunc.fasta \
    --input-biom data/frogsfunc.biom \
    --placement-tool sepp \
    --output-tree frogsfunc_placeseqs_tree.nwk \
    --excluded frogsfunc_placeseqs_excluded.txt \
    --output-fasta frogsfunc_placeseqs.fasta \
    --output-biom frogsfunc_placeseqs.biom \
    --closests-ref frogsfunc_placeseqs_closests_ref_sequences.txt \
    --output-marker frogsfunc_marker.tsv \
    --summary frogsfunc_placeseqs_summary.html

For ITS or 18S analysis, you must specified the path to picrust2 reference files directory. Exemple for ITS :




./frogsfunc_placeseqs.py \
    --input-fasta data/its.fasta \
    --input-biom data/its.biom \
    --ref-dir  $PICRUST2_PATH/default_files/fungi/fungi_ITS

Galaxy

Sequences file: The ASV fasta sequence file.
biom file: The ASV biom file. Taxonomic affiliations must be done before (biom file from FROGS_5 taxonomic_affiliation tool).
taxonomy marker: 16S, ITS and 18S only available.
If your ASVs are based on another marker, you cannot use this tool.

placement tool: EPA-NG or SEPP are placement tools for insertion of sequences into the PICRUSt2 reference tree. SEPP is a low-memory alternative to EPA-ng for placing sequences. So, if the tool crashes with EPA-ng, try again with SEPP.
minimum alignment length: Proportion of the total length of an input sequence that must align with reference sequences. All other will be out.

Outputs

HTML report

The html report file describes that ASVs are contained or not in the phylogenetic tree. Note that PICRUSt2 uses its own reference tree to affiliate ASVs from reference sequences. The report file indicates for each ASV that is the closest PICRUSt2 reference sequence, and compares it to the original FROGS taxonomy. Clicking on the sequence ID gives you more information about it JGI database.

How many ASVs/sequences are kept after the process?

The pie charts describe the proportion of number of ASVs excluded and the proportion of total sequences excluded for the following steps.
ASVs are excluded if the total length of the input sequence aligned against reference sequence is less than the specified "minimum alignment length " threshold parameter (--min-align)

Where are my ASVs inserted in the phylogenetic reference tree ?

PICRUSt2 predicted abundances are based on closests reference genomes sequences from ASVs into the phylogenetic tree. To compare taxonomic affiliations performed in FROGS and in PICRUSt2, the following table is product:

ASV : ASV name.
Nb sequences: ASV sequences abundances.
FROGS taxonomy : Taxonomic affiliation made by FROGS (FROGS_5 taxonomic affiliation ASV).
PICRUSt2 closest ID (JGI) : Identifiant (JGI) of the closest reference sequence from the ASV inserted in the reference tree (see the explanatory illustration at the bottom of this page).
PICRUSt2 closest reference name : Genome Name / Sample Name from reference tree of PICRUSt2.
PICRUSt2 closest taxonomy : Taxonomy (JGI) of the closest reference sequence from the ASV inserted in the reference tree under the following format: Kingdom;Phylum;Class;Order;Family;Genus;Species
NSTI: Nearest Sequenced Taxon Index (NSTI) is the phylogenetic distance between the ASV and the nearest sequenced reference genome. This metric can be used to identify ASVs that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). The higher the NSTI score, the less the affiliations are relevant. Any ASVs with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences.
NSTI confidence: According to the NSTI score, we guide you in the confidence you can bring to the issue affiliation of PICRUSt2. Four levels are given:
- 0 < Good < 0.5
- 0.5 <= Medium < 1
- 1 <= Bad < 2
- To exclude >= 2
PICRUSt2 sets NSTI threshold to 2 per default. Some studies have shown that this threshold is permissive. Thus, it is important to see if the taxonomies between PICRUSt2 and FROGS are quite similar or not, in order to potentially choose a more stringent threshold afterwards.

For example, a NSTI lower than 0.5, with “species” as lowest common taxonomic rank between FROGS and PICRUSt2 will product a good prediction.

Lowest same taxonomic rank between FROGS and PICRUSt2 : Comparison between FROGS and PICRUSt2 taxonomic affiliations. Lowest common taxonomic rank between FROGS and PICRUSt2 affiliations.
Comment :
- identical taxonomy: if the FROGS and PICRUSt2 taxonomic affiliations are identical.
- identical sequence: if the ASV sequence is strictly the same as the reference sequence.

Search « up to species » for obtaining less ambigous reference

How to evaluate the NSTI ?

These graphes allow you to set the “NSTI cut-off” parameter of the next tool

The graph shows the number of kept ASVs and sequences according to the NSTI threshold. It is a decision support graphic to help choose the NSTI threshold. This NSTI threshold will be asked to set in the next tool FROGSFUNC_2_functions.

It is interesting to find a compromise between the guidance provided by the PICRUSt2 authors and the amount of reusable information that you would like to keep.

A good practice is to choose a NSTI threshold that retains a good number of sequences and as low as possible i.e. while ensuring that the taxonomies derived from FROGS and PICRUSt2 do not diverge too much.

On the graph above, we could keep only the information before the plateau, that is, from this point on, the more ASVs we keep the more we degrade the accuracy. So, here NSTI = 0.44
But this depends strongly on the datasets and your needs.

The graph depicts the blast percentages of identity and coverage against the closest PICRUSt2 sequence (ordinate), against the NSTI score (abcsissa). Thus, the ASVs with the best predictions will be located at the top left of the graph.

Tree file

Combination of the reference phylogenetic tree with your inserted sequences.

Excluded file

List of ASV names removed by the process.

FASTA file

The FASTA file without excluded sequences by processes.

Closest reference sequence table

Information on the sequences from the PICRUST2 reference tree that are the closest neighbours of your studied sequences.

BIOM file

The BIOM file without excluded sequences by processes.

Copy number marker file

It is the output table of predicted marker gene copy numbers per each ASV (placed in the reference tree).

A work by FROGS team