By the end of this session, you will be able to:
kraken2
tool to classify metagenomic sequences and generate detailed taxonomic reports.multiqc
to obtain an integrated report.kraken2
, krona
, and related tools, and apply solutions to resolve them.Let’s start by cleaning up and configuring Conda
conda clean --all
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
To save time we are going to use the seqanalysis
environment we created earlier in the course.
conda activate seqanalysis
conda install multiqc metaphlan bowtie2 samtools
Let’s create and use a directory in the shared-teams directory for these analyses (where yourname
is replaced with your name!):
cd /home/jovyan/shared-team/2024_training/2024_metagenomics
mkdir metagenomics_2024_yourname
cd metagenomics_2024_yourname
To provide the conceptual background to metagenomics, students should watch this online TouTube Powerpoint presentation.
We will use human faecal metagenome samples from a paper from my research group on critically ill patients on an intensive care unit (ICU) in Birmingham.
Here’s a table, integrating clinical information and sequence data on the patient samples.
Patient | Age | Sex | Clinical Features | ICU Day | Sample | Run | AvgSpotLen | Bases | Bytes | SOFA score | Antibiotics |
---|---|---|---|---|---|---|---|---|---|---|---|
Patient 02 | 64 | F | Subarachnoid haemorrhage | 1 | 13-6929524 | SRR8926116 | 233 | 270,551,026 | 136,892,477 | 8 | |
Patient 02 | 64 | F | Subarachnoid haemorrhage | 10 | 13-6929534 | SRR8926119 | 229 | 286,911,856 | 144,866,803 | 3 | |
Patient 04 | 75 | M | Aortic aneurysm repair | 10 | 13-6929537 | SRR8926181 | 225 | 184,808,950 | 93,747,180 | 13 | Flucoxacillin; Rofampicin; Erythromycin |
Patient 04 | 75 | M | Aortic aneurysm repair | 14 | 13-6929548 | SRR8926187 | 213 | 114,050,219 | 57,808,179 | 11 | Flucoxacillin; Rofampicin; Meropenem |
Patient 29 | 80 | M | Subcapsular haematoma; liver cancer | 15 | 13-6929598 | SRR8926127 | 249 | 361,066,223 | 159,940,956 | 1 | Meropenem |
Patient 35 | 49 | M | Lung transplant | 3 | 13-6929625 | SRR8926193 | 276 | 135,189,099 | 64,243,237 | 11 | |
Patient 36 | 30 | M | Multiple trauma | 11 | 13-6929614 | SRR8926233 | 254 | 153,261,668 | 70,498,579 | 3 | Co-Trimoxazole; Valganciclovir; Tazobactam |
The SOFA (Sequential Organ Failure Assessment) score is used to assess the extent of a patient’s organ function or rate of failure. It helps predict outcomes and indicates the severity of illness, especially in intensive care units (ICU). The higher the score, the greater the likelihood of organ dysfunction and mortality risk. Here’s a quick interpretation:
Based on scores for these patients:
Overall, we have a range of scores from low (1) to very concerning (13), indicating variability in patient conditions from relatively stable to high risk.
If you are interested in a dataset from a paper, it will be usually deposited to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). The data are organised into BioProjects, BioSamples, SRA experiments and Runs. Take a look at the BioProjects and other components for the ICU study: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA533528
Explore the site’s hierarchy of BioProjects, BioSamples, Sequence Read Experiments (SRX) and Sequence Read Runs (SRR). Review the NCBI metadata associated with each of the samples and runs we are going to be working on, using the links in the table above. Get Google Maps to show you where the samples were collected. Do you trust what it is telling you?
You can select a run and take a look at NCBI’s phylogenetic analyses, including a Krona plot (click on Show Krona View to see it).
Equipped with accession numbers you could use command line tools likefastq-dump
from the SRA Toolkit to retrieve FASTQ files.
But to save time I have downloaded the files, renamed them with the metadata and made them available in
shared-team/2024_training/2024_metagenomics/sequences/
I have also added some indexed files from the human genome. Copy that directory and files across into your current directory.
cp -r /home/jovyan/shared-team/2024_training/2024_metagenomics/sequences .
We would normally be running fastqc and trimmomatic over such sequences, but we are going to miss out those steps in the interest of time, particulaly as they probably were performed before the sequences were uploaded.
What we want to do now is run kraken2
(with the pluspf_8gb
database) and krona
and then multiqc
over all the trimmed sequence files
But kraken2
won’t accept wild cards, so you cannot just issue a single command like this:
kraken *.fastq | krona *report.txt | multiqc
You have several options:
#!/bin/bash
# Create output directories if they don't exist
mkdir -p kraken_on_reads
mkdir -p krona_on_reads
mkdir -p multiqc_kraken_on_reads
# Loop through each R1 file in the directory
for r1_file in *_R1.fastq.gz; do
# Extract the base name by removing the suffix
base=$(basename "$r1_file" _R1.fastq.gz)
# Define the corresponding R2 file
r2_file="${base}_R2.fastq.gz"
# Check if the corresponding R2 file exists
if [ -f "$r2_file" ]; then
echo "Processing $base with Kraken2..."
# Run Kraken2 and output to the kraken_on_reads directory
kraken2 --threads 8 \
--db /home/jovyan/shared-public/db/kraken2/pluspf_8gb/latest \
--output kraken_on_reads/${base}_reads_hits.txt \
--report kraken_on_reads/${base}_reads_report.txt \
--use-names \
"$r1_file" "$r2_file"
echo "Kraken2 completed for $base."
# Run Krona to generate a taxonomy report and output to the krona_on_reads directory
echo "Generating Krona report for $base..."
ktImportTaxonomy -t 5 -m 3 kraken_on_reads/${base}_reads_report.txt -o krona_on_reads/${base}_krona_report.html
echo "Krona report generated for $base."
else
echo "Warning: No matching R2 file for $r1_file. Skipping."
fi
done
# Run MultiQC over the kraken_on_reads directory and save the result in the multiqc_kraken_on_reads directory
echo "Running MultiQC on Kraken2 reports..."
multiqc kraken_on_reads -o multiqc_kraken_on_reads
echo "MultiQC report generated in multiqc_kraken_on_reads."
What is the script doing? Spend five minutes trying to work out for yourself, then ask ChatGPT for a detailed explanation. Why was the pluspf_8gb database chosen for kraken2 and when might other databases be appropriate?
Look at the individual Krona plots you have created and look at the MultiQC Kraken report. Remember to press the Trust HTML
button each time.