Using TaxoniumTools¶
Installing taxoniumtools¶
Taxoniumtools is available from PyPI. You can install it with pip.
pip install taxoniumtools
The usher_to_taxonium
, and newick_to_taxonium
utilities will then be available for use.
Using usher_to_taxonium¶
Example¶
First get some files:
wget https://github.com/theosanderson/taxonium/raw/master/taxoniumtools/test_data/tfci.meta.tsv.gz
wget https://raw.githubusercontent.com/theosanderson/taxonium/master/taxoniumtools/test_data/hu1.gb
wget https://github.com/theosanderson/taxonium/raw/master/taxoniumtools/test_data/tfci.pb
Then convert from UShER pb format to Taxonium jsonl format:
usher_to_taxonium --input tfci.pb --output tfci-taxonium.jsonl.gz --metadata tfci.meta.tsv.gz --genbank hu1.gb \
--columns genbank_accession,country,date,pangolin_lineage
You can then open that tfci-taxonium.jsonl.gz
file at taxonium.org
Note
For SARS-CoV-2 we recommend using the exact modified .gb file we use in the example, which splits ORF1ab into ORF1a and ORF1b.
Note
Some people ask what the “L” in JSONL is for. JSONL means “JSON Lines”. Each line of the file is a separate JSON object. In the case of Taxonium JSONL format, the very first line contains a lot of metadata about the tree as a whole, and then each additional line contains information about a single node. It’s important to use the “jsonl” extension instead of “json” as otherwise the interface may try to parse your tree as a NextStrain JSON file.
usher_to_taxonium¶
This tool will convert an UShER protobuf file into a Taxonium file. At its simplest it just takes the -i and -o parameters, describing the input and output files. But for the most complete results you can add metadata, a reference genome, or even create a time tree.
Convert a Usher pb to Taxonium jsonl format
usage: usher_to_taxonium [-h] -i INPUT -o OUTPUT [-m METADATA] [-g GENBANK]
[-c COLUMNS] [-C]
[--chronumental_steps CHRONUMENTAL_STEPS]
[--chronumental_date_output CHRONUMENTAL_DATE_OUTPUT]
[--chronumental_tree_output CHRONUMENTAL_TREE_OUTPUT]
[--chronumental_reference_node CHRONUMENTAL_REFERENCE_NODE]
[--chronumental_add_inferred_date CHRONUMENTAL_ADD_INFERRED_DATE]
[-j CONFIG_JSON] [-t TITLE]
[--overlay_html OVERLAY_HTML] [--remove_after_pipe]
[--clade_types CLADE_TYPES] [--name_internal_nodes]
[--shear] [--shear_threshold SHEAR_THRESHOLD]
[--only_variable_sites] [--key_column KEY_COLUMN]
Named Arguments¶
- -i, --input
File path to input Usher protobuf file (.pb / .pb.gz)
- -o, --output
File path for output Taxonium jsonl file (.jsonl / .jsonl.gz)
- -m, --metadata
File path for input metadata file (CSV/TSV)
- -g, --genbank
File path for GenBank file containing reference genome (N.B. currently only one chromosome is supported)
- -c, --columns
Column names to include in the metadata, separated by commas, e.g. pangolin_lineage,country
- -C, --chronumental
Runs Chronumental to build a time tree. The metadata TSV must include a date column.
Default: False
- --chronumental_steps
Number of steps to run Chronumental for
- --chronumental_date_output
Optional output file for the chronumental date table if you want to keep it (a table mapping nodes to their inferred dates).
- --chronumental_tree_output
Optional output file for the chronumental time tree file in nwk format.
- --chronumental_reference_node
A reference node to be used for Chronumental. This should be earlier in the outbreak and have a good defined date. If not set the oldest sample will be automatically picked by Chronumental.
- --chronumental_add_inferred_date
A new metadata-column-like name to be added for display with the value of Chronumental’s inferred date for each sample.
- -j, --config_json
A JSON file to use as a config file containing things such as search parameters
- -t, --title
A title for the tree. This will be shown at the top of the window as “[Title] - powered by Taxonium”
- --overlay_html
A file containing HTML to put in the About box when this tree is loaded. This could contain information about who built the tree and what data you used.
- --remove_after_pipe
If set, we will remove anything after a pipe (|) in each node’s name, after joining to metadata
Default: False
- --clade_types
Optionally specify clade types provided in the UShER file, comma separated - e.g. ‘nextstrain,pango’. Order must match that used in the UShER pb file. If you haven’t specifically annotated clades in your protobuf, don’t use this
- --name_internal_nodes
If set, we will name internal nodes node_xxx
Default: False
- --shear
If set, we will ‘shear’ the tree. This will iterate over all nodes. If a particular sub-branch makes up fewer than e.g. 1/1000 of the total descendants, then in most cases it represents a sequencing error. (But it also could represent recombinants, or a real, unfit branch.) We remove these to simplify the interpretation of the tree.
Default: False
- --shear_threshold
Threshold for shearing, default is 1000 meaning branches will be removed if they make up less than <1/1000 nodes. Has no effect unless –shear is set.
Default: 1000
- --only_variable_sites
Only store information about the root sequence at a particular position if there is variation at that position somewhere in the tree. This helps to speed up the loading of larger genomes such as MPXV.
Default: False
- --key_column
The column in the metadata file which is the same as the names in the tree
Default: “strain”
Using the parameters above you can trigger usher_to_taxonium
to launch Chronumental and create a time tree which will be packaged into your tree.
newick_to_taxonium¶
If you don’t need genotype data encoded in the final tree (e.g. some taxonomies, or non-genome-based trees) you can skip UShER and use newick_to_taxonium
Convert a Newick file to Taxonium jsonl format
usage: newick_to_taxonium [-h] -i INPUT -o OUTPUT [-m METADATA] [-c COLUMNS]
[-j CONFIG_JSON] [-t TITLE]
[--overlay_html OVERLAY_HTML]
[--key_column KEY_COLUMN]
Named Arguments¶
- -i, --input
File path to input Newick file
- -o, --output
File path for output Taxonium jsonl file
- -m, --metadata
File path for input metadata file (CSV/TSV)
- -c, --columns
Column names to include in the metadata, separated by commas, e.g. pangolin_lineage,country
- -j, --config_json
A JSON file to use as a config file containing things such as search parameters
- -t, --title
A title for the tree. This will be shown at the top of the window as “[Title] - powered by Taxonium”
- --overlay_html
A file containing HTML to put in the About box when this tree is loaded. This could contain information about who built the tree and what data you used.
- --key_column
The column in the metadata file which is the same as the names in the tree
Default: “strain”