By the end of this tutorial, students will:
GTDB is a comprehensive resource that provides a standardized, phylogenetically consistent taxonomy for bacterial and archaeal genomes. It is the result of significant efforts to unify and update the classification of microbial genomes, addressing the inconsistencies present in traditional taxonomies. The GTDB has been instrumental in helping researchers better understand the evolutionary relationships among prokaryotes and refine classifications based on genome-based data.
GTDB-Tk (Genome Taxonomy Database Toolkit) is a software tool designed to classify genomes against the GTDB. It aligns genome data, identifies marker genes, and places genomes within a reference tree, offering reliable taxonomic classifications.
bac120
for bacteria, ar53
for archaea) essential for taxonomic placement.Credit: Phil Hugenholtz and the GTDB team have significantly impacted microbial taxonomy, providing researchers worldwide with valuable resources and tools.
For the conceptual background on GTDB, watch this video.
Installation Command:
mamba create -n gtdbtk-2.4.0 -c conda-forge -c bioconda gtdbtk=2.4.0
Activate the Environment:
conda activate gtdbtk-2.4.0
GTDB-Tk requires external reference data (~110 GB). For this course, the data has already been downloaded and stored here:
export GTDBTK_DATA_PATH=/home/jovyan/shared-team/gtdbtk_data/release220
Ensure that the GTDBTK_DATA_PATH
is correctly set to avoid data path errors.
Note: GTDB-Tk is rather brittle, so do not be surprised if you encounter problems. In such case, use Google and ChatGPT to find solutions or workarounds.
Run the identify
Command:
gtdbtk identify --genome_dir ./. --out_dir identify --extension fasta --cpus 16
Expected Output: You should see progress logs indicating the identification of marker genes, which will look like:
[2024-11-13 08:49:44] INFO: GTDB-Tk v2.1.1
[2024-11-13 08:49:44] INFO: Identifying markers in 3 genomes with 16 threads.
[2024-11-13 08:49:45] TASK: Running Prodigal V2.6.3 to identify genes.
Results:
The identified marker genes and intermediate files will be stored in the identify
directory:
ls /tmp/gtdbtk/identify/identify/intermediate_results/marker_genes/maxbin_bins.001/
Run the align
Command:
gtdbtk align --identify_dir identify --out_dir align --cpus 16
Expected Output: You will see logs related to the alignment of identified markers and concatenation of sequences:
[2024-11-13 09:27:37] INFO: Aligning markers in 3 genomes with 16 CPUs.
[2024-11-13 09:31:20] INFO: Masked bacterial alignment from 41,084 to 5,035 AAs.
Results:
Alignment results, including gtdbtk.bac120.msa.fasta.gz
, will be saved in the align
directory.
Run the classify
Command:
gtdbtk classify --genome_dir ./. --align_dir align --out_dir classify -x fasta --cpus 16 --skip_ani_screen
Expected Output: Logs showing the placement of genomes into the reference tree and classification output:
[2024-11-13 09:40:00] INFO: Classifying genomes based on the aligned data.
Results:
The main output files are gtdbtk.bac120.summary.tsv
and gtdbtk.ar53.summary.tsv
, detailing genome classifications.
Discussion Points: