FROGSFUNC_1_placeseqs_copynumber

tutorial
tool

FROGSFUNC_1_placeseqs_copynumber

Context

PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. This tool is integrated inside FROGS suite as FROGSFUNC tools. They are split into 4 steps :

  1. FROGSFUNC_1_placeseqs_copynumber : Places the ASVs into a reference phylogenetic tree and predicts the copy numbers of the marker gene (16S, ITS or 18S).
  2. FROGSFUNC_2_functions: Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
  3. FROGSFUNC_3_pathways : Calculates pathway abundances in each sample.

This data can be useful for generating hypotheses, but should always be interpreted cautiously especially when focused on a single function or predictions for a single ASV.

:warning: PICRUSt2 are based on 3 markers only, 16S, ITS and 18S. If you used another one (rpob, 23S, coi, ef1 etc.), you cannot used these 3 tools.

What it does

FROGSFUNC_1_placeseqs_copynumber is the first step of PICRUSt2. It inserts your study sequences into a reference tree. By default, this reference tree is based on 20,000 16S sequences from genomes in the Integrated Microbial Genomes database. The script performs this step, which specifically:

:warning: 2 input files are required for FROGSFUNC_1_placeseqs_copynumber analysis:

Placement of sequences of interest in PICRUSt2 reference tree

Prediction of the copy numbers of the marker gene

Command line

:package: v4.1.0

usage: frogsfunc_placeseqs.py [-h] [-v] [--debug] -i INPUT_FASTA -b INPUT_BIOM
                              [-r REF_DIR] [-p {epa-ng,sepp}]
                              [--min-align MIN_ALIGN]
                              [--input-marker-table INPUT_MARKER_TABLE]
                              [--hsp-method {mp,emp_prob,pic,scp,subtree_average}]
                              [-o OUTPUT_TREE] [-e EXCLUDED] [-s OUTPUT_FASTA]
                              [-m OUTPUT_BIOM] [-c CLOSESTS_REF] [-l LOG_FILE]
                              [-t SUMMARY] [-om OUTPUT_MARKER]

place studies sequences (i.e. ASVs) into a reference tree.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show programs version number and exit
  --debug               Keep temporary files to debug program.

Inputs:
  -i INPUT_FASTA, --input-fasta INPUT_FASTA
                        Input fasta file of unaligned studies sequences.
  -b INPUT_BIOM, --input-biom INPUT_BIOM
                        Input biom file of unaligned studies sequences.
  -r REF_DIR, --ref-dir REF_DIR
                        If marker studied is not 16S, this is the directory
                        containing reference sequence files (for ITS, see:
                        $PICRUST2_PATH/default_files/fungi/fungi_ITS
  -p {epa-ng,sepp}, --placement-tool {epa-ng,sepp}
                        Tool to place sequences into reference tree. Note that
                        epa-ng is more sensitiv but very memory and computing
                        power intensive. Warning : sepp is not usable for ITS
                        and 18S analysis [Default: epa-ng]
  --min-align MIN_ALIGN
                        Proportion of the total length of an input query
                        sequence that must align with reference sequences. Any
                        sequences with lengths below this value after making
                        an alignment with reference sequences will be excluded
                        from the placement and all subsequent steps. (default:
                        0.8).
  --input-marker-table INPUT_MARKER_TABLE
                        The input marker table describing directly observed
                        traits (e.g. sequenced genomes) in tab-delimited
                        format. (ex
                        $PICRUSt2_PATH/default_files/fungi/ITS_counts.txt.gz).
                        Required.
  --hsp-method {mp,emp_prob,pic,scp,subtree_average}
                        HSP method to use. mp: predict discrete traits using
                        max parsimony. emp_prob: predict discrete traits based
                        on empirical state probabilities across tips.
                        subtree_average: predict continuous traits using
                        subtree averaging. pic: predict continuous traits with
                        phylogentic independent contrast. scp: reconstruct
                        continuous traits using squared-change parsimony
                        (default: mp).

Outputs:
  -o OUTPUT_TREE, --output-tree OUTPUT_TREE
                        Reference tree output with insert sequences (format:
                        newick). [Default: frogsfunc_placeseqs_tree.nwk]
  -e EXCLUDED, --excluded EXCLUDED
                        List of sequences not inserted in the tree. [Default:
                        frogsfunc_placeseqs_excluded.txt]
  -s OUTPUT_FASTA, --output-fasta OUTPUT_FASTA
                        Fasta file without non insert sequences. (format:
                        FASTA). [Default: frogsfunc_placeseqs.fasta]
  -m OUTPUT_BIOM, --output-biom OUTPUT_BIOM
                        Biom file without non insert sequences. (format: BIOM)
                        [Default: frogsfunc_placeseqs.biom]
  -c CLOSESTS_REF, --closests-ref CLOSESTS_REF
                        Informations about Clusters (i.e OTUs) and PICRUSt2
                        closest reference from cluster sequences
                        (identifiants, taxonomies, phylogenetic distance from
                        reference, nucleotidics sequences). [Default:
                        frogsfunc_placeseqs_closests_ref_sequences.txt]
  -l LOG_FILE, --log-file LOG_FILE
                        List of commands executed.
  -t SUMMARY, --summary SUMMARY
                        Path to store resulting html file. [Default:
                        frogsfunc_placeseqs_summary.html]
  -om OUTPUT_MARKER, --output-marker OUTPUT_MARKER
                        Output table of predicted marker gene copy numbers per
                        studied sequence in input tree. If the extension ".gz"
                        is added the table will automatically be
                        gzipped.[Default: frogsfunc_marker.tsv]

Example of command line:

./frogsfunc_placeseqs.py \ --input-fasta data/frogsfunc.fasta \ --input-biom data/frogsfunc.biom \ --placement-tool sepp \ --output-tree frogsfunc_placeseqs_tree.nwk \ --excluded frogsfunc_placeseqs_excluded.txt \ --output-fasta frogsfunc_placeseqs.fasta \ --output-biom frogsfunc_placeseqs.biom \ --closests-ref frogsfunc_placeseqs_closests_ref_sequences.txt \ --output-marker frogsfunc_marker.tsv \ --summary frogsfunc_placeseqs_summary.html

:warning: For ITS or 18S analysis, you must specified the path to picrust2 reference files directory. Exemple for ITS :

./frogsfunc_placeseqs.py \ --input-fasta data/its.fasta \ --input-biom data/its.biom \ --ref-dir $PICRUST2_PATH/default_files/fungi/fungi_ITS

Galaxy

Sequences file: The ASV fasta sequence file.
biom file: The ASV biom file. Taxonomic affiliations must be done before (biom file from FROGS_5 taxonomic_affiliation tool).
taxonomy marker: 16S, ITS and 18S only available.
:warning: If your ASVs are based on another marker, you cannot use this tool.

placement tool: EPA-NG or SEPP are placement tools for insertion of sequences into the PICRUSt2 reference tree. SEPP is a low-memory alternative to EPA-ng for placing sequences. So, if the tool crashes with EPA-ng, try again with SEPP.
minimum alignment length: Proportion of the total length of an input sequence that must align with reference sequences. All other will be out.

Outputs

HTML report

The html report file describes that ASVs are contained or not in the phylogenetic tree. Note that PICRUSt2 uses its own reference tree to affiliate ASVs from reference sequences. The report file indicates for each ASV that is the closest PICRUSt2 reference sequence, and compares it to the original FROGS taxonomy. Clicking on the sequence ID gives you more information about it JGI database.

:question: How many ASVs/sequences are kept after the process?

The pie charts describe the proportion of number of ASVs excluded and the proportion of total sequences excluded for the following steps.
ASVs are excluded if the total length of the input sequence aligned against reference sequence is less than the specified "minimum alignment length " threshold parameter (--min-align)


:question: Where are my ASVs inserted in the phylogenetic reference tree ?

:warning: PICRUSt2 predicted abundances are based on closests reference genomes sequences from ASVs into the phylogenetic tree. To compare taxonomic affiliations performed in FROGS and in PICRUSt2, the following table is product:

:bulb: For example, a NSTI lower than 0.5, with “species” as lowest common taxonomic rank between FROGS and PICRUSt2 will product a good prediction.

:bulb: Search « up to species » for obtaining less ambigous reference


:question: How to evaluate the NSTI ?

These graphes allow you to set the “NSTI cut-off” parameter of the next tool

The graph shows the number of kept ASVs and sequences according to the NSTI threshold. It is a decision support graphic to help choose the NSTI threshold. This NSTI threshold will be asked to set in the next tool FROGSFUNC_2_functions.

It is interesting to find a compromise between the guidance provided by the PICRUSt2 authors and the amount of reusable information that you would like to keep.

A good practice is to choose a NSTI threshold that retains a good number of sequences and as low as possible i.e. while ensuring that the taxonomies derived from FROGS and PICRUSt2 do not diverge too much.

On the graph above, we could keep only the information before the plateau, that is, from this point on, the more ASVs we keep the more we degrade the accuracy. So, here NSTI = 0.44
But this depends strongly on the datasets and your needs.

The graph depicts the blast percentages of identity and coverage against the closest PICRUSt2 sequence (ordinate), against the NSTI score (abcsissa). Thus, the ASVs with the best predictions will be located at the top left of the graph.

Tree file

Combination of the reference phylogenetic tree with your inserted sequences.

Excluded file

List of ASV names removed by the process.

FASTA file

The FASTA file without excluded sequences by processes.

Closest reference sequence table

Information on the sequences from the PICRUST2 reference tree that are the closest neighbours of your studied sequences.

BIOM file

The BIOM file without excluded sequences by processes.

Copy number marker file

It is the output table of predicted marker gene copy numbers per each ASV (placed in the reference tree).




A work by FROGS team