Reference

Here you can find format specifications and an automatically generated reference for ProPhyle’s CLI.

Formats

Trees

Newick format 1 with NHX annotations, which can be easily created and modified using the ete3 python package.

Classification output

Support for both SAM and Kraken output formats.

Analysis output

  • kraken-report format:
    1. Percentage of reads covered by the clade rooted at this taxon
    2. Number of reads covered by the clade rooted at this taxon
    3. Number of reads assigned directly to this taxon
    4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
    5. NCBI taxonomy ID
    6. indented scientific name
  • MetaPhlAn2 format:

    1. clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level: Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kindom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth. OTU equivalents can be extracted by using only the species-level s__ clades from this file (again, making sure to include clades unclassified at this level).

  • Custom Centrifuge format:

    #name                                                           taxID   taxRank    kmerCount   numReads   numUniqueReads   abundance
    Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870   leaf       703004      5981.37    5964             0
    
  1. name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).
  2. taxonomic ID (e.g., 36870).
  3. taxonomic rank (e.g., leaf).
  4. number of k-mers propagated up to the node (e.g., 703004).
  5. number of reads classified to this node including multi-classified reads (divided by the number of assignments, e.g., 5981.37).
  6. number of reads uniquely classified to this genomic sequence (e.g., 5964).
  7. not used yet.

Main program’s reference

prophyle (list of subcommands)

$ prophyle  -h

usage: prophyle.py [-h] [-v]  ...

Program: prophyle (phylogeny-based metagenomic classification)
Version: 0.2.0.0
Authors: Karel Brinda <kbrinda@hsph.harvard.edu>, Kamil Salikhov <kamil.salikhov@univ-mlv.fr>,
                 Simone Pignotti <pignottisimone@gmail.com>, Gregory Kucherov <gregory.kucherov@univ-mlv.fr>

Usage:   prophyle <command> [options]

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

subcommands:

    download     download a genomic database
    index        build index
    classify     classify reads
    analyze      analyze results
    compress     compress a ProPhyle index (experimental)
    decompress   decompress a compressed ProPhyle index (experimental)

prophyle download

$ prophyle download -h

usage: prophyle.py download [-h] [-d DIR] [-l STR] [-F]
                            <library> [<library> ...]

positional arguments:
  <library>   genomic library ['bacteria', 'viruses', 'plasmids', 'hmp',
              'all']

optional arguments:
  -h, --help  show this help message and exit
  -d DIR      directory for the tree and the sequences [~/prophyle]
  -l STR      log file
  -F          rewrite library files if they already exist

prophyle index

$ prophyle index -h

usage: prophyle.py index [-h] [-g DIR] [-j INT] [-k INT] [-l STR] [-s FLOAT]
                         [-F] [-M] [-P] [-K] [-T] [-A]
                         <tree.nw> [<tree.nw> ...] <index.dir>

positional arguments:
  <tree.nw>    phylogenetic tree (in Newick/NHX)
  <index.dir>  index directory (will be created)

optional arguments:
  -h, --help   show this help message and exit
  -g DIR       directory with the library sequences [dir. of the first tree]
  -j INT       number of threads [auto (4)]
  -k INT       k-mer length [31]
  -l STR       log file [<index.dir>/log.txt]
  -s FLOAT     rate of sampling of the tree [no sampling]
  -F           rewrite index files if they already exist
  -M           mask repeats/low complexity regions (using DustMasker)
  -P           do not add prefixes to node names when multiple trees are used
  -K           skip k-LCP construction (then restarted search only)
  -T           keep temporary files from k-mer propagation
  -A           autocomplete tree (names of internal nodes and FASTA paths)

prophyle classify

$ prophyle classify -h

usage: prophyle.py classify [-h] [-k INT] [-K] [-m {h1,c1}] [-f {kraken,sam}]
                            [-l STR] [-A] [-L] [-P] [-C]
                            <index.dir> <reads1.fq> [<reads2.fq>]

positional arguments:
  <index.dir>      index directory
  <reads1.fq>      first file with reads in FASTA/FASTQ (- for standard input)
  <reads2.fq>      second file with reads in FASTA/FASTQ

optional arguments:
  -h, --help       show this help message and exit
  -k INT           k-mer length [detect automatically from the index]
  -K               use restarted search for matching rather than rolling
                   window (slower, but k-LCP is not needed)
  -m {h1,c1}       measure: h1=hit count, c1=coverage [h1]
  -f {kraken,sam}  output format [sam]
  -l STR           log file
  -A               annotate assignments (using tax. information from NHX)
  -L               use LCA when tie (multiple hits with the same score)
  -P               incorporate sequences and qualities into SAM records
  -C               use C++ impl. of the assignment algorithm (experimental)

prophyle analyze

$ prophyle analyze -h

usage: prophyle.py analyze [-h] [-s ['w', 'u', 'wl', 'ul']]
                           [-f ['sam', 'bam', 'cram', 'uncompressed_bam', 'kraken', 'histo']]
                           {index_dir, tree.nw} <out.pref> <classified.bam>
                           [<classified.bam> ...]

positional arguments:
  {index_dir, tree.nw}     index directory or phylogenetic tree
  <out.pref>               output prefix
  <classified.bam>         classified reads (use '-' for stdin)

optional arguments:
  -h, --help               show this help message and exit
  -s ['w', 'u', 'wl', 'ul']
                           statistics to use for the computation of
                           histograms: w (default) => weighted assignments; u
                           => unique assignments, non-weighted; wl => weighted
                           assignments, propagated to leaves; ul => unique
                           assignments, propagated to leaves.
  -f ['sam', 'bam', 'cram', 'uncompressed_bam', 'kraken', 'histo']
                           Input format of assignments [auto]

prophyle compress

$ prophyle compress -h

usage: prophyle.py compress [-h] <index.dir> [<archive.tar.gz>]

positional arguments:
  <index.dir>       index directory
  <archive.tar.gz>  output archive [<index.dir>.tar.gz]

optional arguments:
  -h, --help        show this help message and exit

prophyle decompress

$ prophyle decompress -h

usage: prophyle.py decompress [-h] [-K] archive.tar.gz [output.dir]

positional arguments:
  archive.tar.gz  output archive
  output.dir      output directory [./]

optional arguments:
  -h, --help      show this help message and exit
  -K              skip k-LCP construction

Other programs’ reference

prophyle_ncbi_tree

$ prophyle_ncbi_tree.py -h

usage: prophyle_ncbi_tree.py [-h] [-l log_file] [-r red_factor] [-u root]
                             <library> <library_dir> <output_file> <taxid_map>

Program: prophyle_ncbi_tree Build a taxonomic tree in the New Hampshire newick
format #1 for NCBI sequences

positional arguments:
  <library>      directory with the library sequences (e.g. bacteria, viruses
                 etc.)
  <library_dir>  library path (parent of library, e.g. main ProPhyle
                 directory)
  <output_file>  output file
  <taxid_map>    tab separated accession number to taxid map

optional arguments:
  -h, --help     show this help message and exit
  -l log_file    log file [stderr]
  -r red_factor  build reduced tree (one sequence every n)
  -u root        root of the tree (e.g. Bacteria); will exclude sequences
                 which are not its descendants

prophyle_assembler

$ prophyle_assembler -h


Program:  prophyle_assembler (greedy assembler for ProPhyle)
Contact:  Karel Brinda <karel.brinda@gmail.com>

Usage:    prophyle_assembler [options]

Examples: prophyle_assembler -k 15 -i f1.fa -i f2.fa -x fx.fa
             - compute intersection of f1 and f2
          prophyle_assembler -k 15 -i f1.fa -i f2.fa -x fx.fa -o g1.fa -o g2.fa
             - compute intersection of f1 and f2, and subtract it from them
          prophyle_assembler -k 15 -i f1.fa -o g1.fa
             - re-assemble f1 to g1

Command-line parameters:
 -k INT   K-mer size.
 -i FILE  Input FASTA file (can be used multiple times).
 -o FILE  Output FASTA file (if used, must be used as many times as -i).
 -x FILE  Compute intersection, subtract it, save it.
 -s FILE  Output file with k-mer statistics.
 -S       Silent mode.

Note that '-' can be used for standard input/output.

prophyle_index (list of subcommands)

$ prophyle_index -h


Program: prophyle_index (alignment of k-mers)
Contact: Kamil Salikhov <kamil.salikhov@univ-mlv.fr>

Usage:   prophyle_index command [options]

Command: build     construct index
         query     query reads against index

prophyle_index build

$ prophyle_index build -h


Usage:   prophyle_index build <prefix>

Options: -k INT    length of k-mer
         -s        construct k-LCP and SA in parallel
         -i        sampling distance for SA

prophyle_index query

$ prophyle_index query -h


Usage:   prophyle_index query [options] <prefix> <in.fq>

Options: -k INT    length of k-mer
         -u        use k-LCP for querying
         -v        output set of chromosomes for every k-mer
         -p        do not check whether k-mer is on border of two contigs, and show such k-mers in output
         -b        print sequences and base qualities
         -l STR    log file name to output statistics
         -t INT    number of threads [1]

prophyle_assignment

$ prophyle_assignment.py -h

usage: prophyle_assignment.py [-h] [-f {kraken,sam}] [-m {h1,c1}] [-A] [-L]
                              [-X] [-D]
                              <tree.nhx> <k> <assignments.txt>

Implementation of assignment algorithm

positional arguments:
  <tree.nhx>         phylogenetic tree (Newick/NHX)
  <k>                k-mer length
  <assignments.txt>  assignments in generalized Kraken format

optional arguments:
  -h, --help         show this help message and exit
  -f {kraken,sam}    format of output [sam]
  -m {h1,c1}         measure: h1=hitnumber, c1=coverage [h1]
  -A                 annotate assignments
  -L                 use LCA when tie (multiple hits with the same score)
  -X                 replace k-mer matches by their LCA
  -D                 do not translate blocks from node to tax IDs

prophyle_analyze

$ prophyle_analyze.py -h

usage: prophyle_analyze.py [-h] [-s ['w', 'u', 'wl', 'ul']]
                           [-f ['sam', 'bam', 'cram', 'uncompressed_bam', 'kraken', 'histo']]
                           {index_dir, tree.nw} <out_prefix> <input_fn>
                           [<input_fn> ...]

Program: prophyle_analyze.py

Analyze results of ProPhyle's classification.
Stats:
w: weighted assignments
u: unique assignments (ignore multiple assignments)
wl: weighted assignments, propagated to leaves
ul: unique assignments, propagated to leaves

positional arguments:
  {index_dir, tree.nw}  Index directory or phylogenetic tree
  <out_prefix>          Prefix for output files (the complete file names will
                        be <out_prefix>_rawhits.tsv for the raw hit counts
                        table and <out_prefix>_otu.tsv for the otu table)
  <input_fn>            ProPhyle output files whose format is chosen with the
                        -f option. Use '-' for stdin or multiple files with
                        the same format (one per sample)

optional arguments:
  -h, --help            show this help message and exit
  -s ['w', 'u', 'wl', 'ul']
                        Statistics to use for the computation of histograms: w
                        (default) => weighted assignments; u => unique
                        assignments, non-weighted; wl => weighted assignments,
                        propagated to leaves; ul => unique assignments,
                        propagated to leaves.
  -f ['sam', 'bam', 'cram', 'uncompressed_bam', 'kraken', 'histo']
                        Input format of assignments [auto]. If 'histo' is
                        selected the program expects hit count histograms
                        (*_rawhits.tsv) previously computed using prophyle
                        analyze, it merges them and compute OTU table from the
                        result (assignment files are not required)