Using TaxoniumTools#

Installing taxoniumtools#

Taxoniumtools is available from PyPI. You can install it with pip.

pip install taxoniumtools

The usher_to_taxonium utility will then be available for use.

Using usher_to_taxonium from taxoniumtools#

Example#

First get some files:

wget https://github.com/theosanderson/taxonium/raw/master/taxoniumtools/test_data/tfci.meta.tsv.gz
wget https://raw.githubusercontent.com/theosanderson/taxonium/master/taxoniumtools/test_data/hu1.gb
wget https://github.com/theosanderson/taxonium/raw/master/taxoniumtools/test_data/tfci.pb

Then convert from UShER pb format to Taxonium jsonl format:

usher_to_taxonium --input tfci.pb --output tfci-taxonium.jsonl.gz --metadata tfci.meta.tsv.gz --genbank hu1.gb \
--columns genbank_accession,country,date,pangolin_lineage

You can then open that tfci-taxonium.jsonl.gz file at taxonium.org

Note

Right now Taxoniumtools is limited in the types of genome annotations it can support, for SARS-CoV-2 we recommend using the exact modified .gb file we use in the example, which splits ORF1ab into ORF1a and ORF1b to avoid the need to model ribosome slippage.

Note

Some people ask what the “L” in JSONL is for. JSONL means “JSON Lines”. Each line of the file is a separate JSON object. In the case of Taxonium JSONL format, the very first line contains a lot of metadata about the tree as a whole, and then each additional line contains information about a single node. It’s important to use the “jsonl” extension instead of “json” as otherwise the interface may try to parse your tree as a NextStrain JSON file.

usher_to_taxonium#

This tool will convert an UShER protobuf file into a Taxonium file. At its simplest it just takes the -i and -o parameters, describing the input and output files. But for the most complete results you can add metadata, a reference genome, or even create a time tree.

Convert a Usher pb to Taxonium jsonl format

usage: usher_to_taxonium [-h] -i INPUT -o OUTPUT [-m METADATA] [-g GENBANK]
                         [-c COLUMNS] [-C]
                         [--chronumental_steps CHRONUMENTAL_STEPS]
                         [--chronumental_date_output CHRONUMENTAL_DATE_OUTPUT]
                         [--chronumental_tree_output CHRONUMENTAL_TREE_OUTPUT]
                         [--chronumental_reference_node CHRONUMENTAL_REFERENCE_NODE]
                         [-j CONFIG_JSON] [-t TITLE]
                         [--overlay_html OVERLAY_HTML] [--remove_after_pipe]
                         [--clade_types CLADE_TYPES] [--name_internal_nodes]
                         [--shear] [--shear_threshold SHEAR_THRESHOLD]
                         [--only_variable_sites] [--key_column KEY_COLUMN]

Named Arguments#

-i, --input

File path to input Usher protobuf file (.pb)

-o, --output

File path for output Taxonium jsonl file

-m, --metadata

File path for input metadata file (CSV/TSV)

-g, --genbank

File path for GenBank file containing reference genome (N.B. currently only one chromosome is supported, and no compound features)

-c, --columns

Column names to include in the metadata, separated by commas, e.g. pangolin_lineage,country

-C, --chronumental

Runs Chronumental to build a time tree. The metadata TSV must include a date column.

Default: False

--chronumental_steps

Number of steps to run Chronumental for

--chronumental_date_output

Optional output file for the chronumental date table if you want to keep it (a table mapping nodes to their inferred dates).

--chronumental_tree_output

Optional output file for the chronumental time tree file in nwk format.

--chronumental_reference_node

A reference node to be used for Chronumental. This should be earlier in the outbreak and have a good defined date. If not set the oldest sample will be automatically picked by Chronumental.

-j, --config_json

A JSON file to use as a config file containing things such as search parameters

-t, --title

A title for the tree. This will be shown at the top of the window as “[Title] - powered by Taxonium”

--overlay_html

A file containing HTML to put in the About box when this tree is loaded. This could contain information about who built the tree and what data you used.

--remove_after_pipe

If set, we will remove anything after a pipe (|) in each node’s name, after joining to metadata

Default: False

--clade_types

Optionally specify clade types provided in the UShER file, comma separated - e.g. ‘nextstrain,pango’. Order must match that used in the UShER pb file. If you haven’t specifically annotated clades in your protobuf, don’t use this

--name_internal_nodes

If set, we will name internal nodes node_xxx

Default: False

--shear

If set, we will ‘shear’ the tree. This will iterate over all nodes. If a particular sub-branch makes up fewer than e.g. 1/1000 of the total descendants, then in most cases it represents a sequencing error. (But it also could represent recombinants, or a real, unfit branch.) We remove these to simplify the interpretation of the tree.

Default: False

--shear_threshold

Threshold for shearing, default is 1000 meaning branches will be removed if they make up less than <1/1000 nodes. Has no effect unless –shear is set.

Default: 1000

--only_variable_sites

Only store information about the root sequence at a particular position if there is variation at that position somewhere in the tree. This helps to speed up the loading of larger genomes such as MPXV.

Default: False

--key_column

The column in the metadata file which is the same as the names in the tree

Default: “strain”

Using the parameters above you can trigger usher_to_taxonium to launch Chronumental and create a time tree which will be packaged into your tree.