By the end of this session, you will be able to:
MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool designed for profiling the composition of microbial communities from metagenomic sequencing data. It uses a unique database of clade-specific marker genes to identify and quantify taxa within a sample, providing a high-resolution taxonomic profile. This makes it particularly valuable for studies exploring the diversity and abundance of microbial communities in various environments, including the human microbiome.
Let’s run MetaPhlAn over our samples. But MetaPhlAn output files are scarcely more user-friendly than Kraken output files.
Here’s what the first few lines of a MetaPhlAn output file look like
--input_type #/shared/team/conda/mpallen.mmb-dtp/seqanalysis/bin/MetaPhlAn
relative_abundance #clade_name
100.0 Bacteria
99.697 Bacteria Proteobacteria
0.15144 Bacteria Bacteroidetes
0.1238 Bacteria Actinobacteria
0.02776 Bacteria Firmicutes
99.697 Bacteria Proteobacteria Gammaproteobacteria
0.15144 Bacteria Bacteroidetes Bacteroidia
0.1238 Bacteria Actinobacteria Actinobacteria
0.02776 Bacteria Firmicutes Clostridia
99.697 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
0.15144 Bacteria Bacteroidetes Bacteroidia Bacteroidales
0.08413 Bacteria Actinobacteria Actinobacteria Micrococcales
0.02776 Bacteria Firmicutes Clostridia Clostridiales
0.02485 Bacteria Actinobacteria Actinobacteria Propionibacteriales
0.01482 Bacteria Actinobacteria Actinobacteria Bifidobacteriales
94.39562 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Yersiniaceae
5.29717 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae
0.15144 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae
0.08413 Bacteria Actinobacteria Actinobacteria Micrococcales Microbacteriaceae
0.02485 Bacteria Actinobacteria Actinobacteria Propionibacteriales Propionibacteriaceae
0.01482 Bacteria Actinobacteria Actinobacteria Bifidobacteriales Bifidobacteriaceae
0.01392 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae
0.01096 Bacteria Firmicutes Clostridia Clostridiales Peptostreptococcaceae
0.00421 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae
0.00288 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae
94.39562 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Yersiniaceae Serratia
5.29717 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia
0.15144 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
0.08413 Bacteria Actinobacteria Actinobacteria Micrococcales Microbacteriaceae Curtobacterium
0.02485 Bacteria Actinobacteria Actinobacteria Propionibacteriales Propionibacteriaceae Cutibacterium
0.01482 Bacteria Actinobacteria Actinobacteria Bifidobacteriales Bifidobacteriaceae Bifidobacterium
0.01392 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Faecalibacterium
0.01096 Bacteria Firmicutes Clostridia Clostridiales Peptostreptococcaceae Paeniclostridium
0.00421 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Enterobacter
0.00213 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Fusicatenibacter
0.00075 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Lachnospiraceae_unclassified
94.39562 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Yersiniaceae Serratia Serratia_fonticola
It would be great if we could look at them as krona plots, but krona won’t take MetaPhlAn output files as input. But with a little text manipulation we can make it work, so that we can run MetaPhlAn and create krona files. With help from ChatGPT, I have written a script that does this.
Create a file called metakron.sh
using touch
then paste in shell script, grant it execute permissions and then run it:
#!/bin/bash
# Create the MetaPhlAn output directory if it doesn't exist
mkdir -p MetaPhlAn
# Run MetaPhlAn on paired-end *_R1.fastq.gz and *_R2.fastq.gz files in the current directory and generate Krona plots
for sample in $(ls *_R1.fastq.gz | sed 's/_R1.fastq.gz//'); do
if [ -f "${sample}_R1.fastq.gz" ] && [ -f "${sample}_R2.fastq.gz" ]; then
# Generate MetaPhlAn output filenames in the MetaPhlAn directory
MetaPhlAn_output="MetaPhlAn/${sample}_MetaPhlAn_out.txt"
bowtie2_output="MetaPhlAn/${sample}_bowtie2_out.bowtie2"
# Run MetaPhlAn with paired-end files
echo "Running MetaPhlAn on ${sample} fastq.gz files..."
MetaPhlAn "${sample}_R1.fastq.gz,${sample}_R2.fastq.gz" --input_type fastq -o "$MetaPhlAn_output" --bowtie2out "$bowtie2_output"
# Check if MetaPhlAn output was generated
if [ ! -f "$MetaPhlAn_output" ]; then
echo "Error: MetaPhlAn output not generated for ${sample}."
continue
fi
# Prepare the Krona input file in the MetaPhlAn directory
krona_input="MetaPhlAn/${sample}_krona_input.txt"
echo "Processing $MetaPhlAn_output for Krona..."
awk '{
if (NR > 1 && NF >= 3) {
# Copy the third column to the first column in the output
output_col1 = $3;
# Copy the first column to the second column in the output
output_col2 = $1;
# Remove any string where a character is followed by double underscores
gsub(/[a-z]__/, "", output_col1);
gsub(/[a-z]__/, "", output_col2);
# Convert | to tabs
gsub(/\|/, "\t", output_col1);
gsub(/\|/, "\t", output_col2);
# Print the transformed line
print output_col1 "\t" output_col2;
}
}' "$MetaPhlAn_output" > "$krona_input"
# Check if ktImportText is available
if ! command -v ktImportText &> /dev/null; then
echo "Error: ktImportText command not found. Please install KronaTools."
exit 1
fi
# Generate the Krona plot in the MetaPhlAn directory
krona_output="MetaPhlAn/${sample}_krona.html"
echo "Generating Krona plot for $krona_input..."
ktImportText "$krona_input" -o "$krona_output"
# Confirm if the Krona output was generated
if [ $? -eq 0 ]; then
echo "Krona plot generated: $krona_output"
else
echo "Error generating Krona plot for $krona_input."
fi
else
echo "Paired-end files not found for sample: ${sample}"
fi
done
What is the script doing? Spend five minutes trying to work out for yourself, then ask ChatGPT for a detailed explanation.
Compare and contrast the krona plots for each sample produced by kraken and MetaPhlAn. Write a brief text-based report on this.
In metagenomic analysis, removing host DNA is an essential step to ensure that the data focuses on microbial content rather than the host’s genomic material. This is particularly important when studying samples from organisms like humans, where the host DNA can often overwhelm microbial signals. By filtering out host sequences, we improve the accuracy of downstream analyses and ensure that computational resources are used effectively for microbial data processing.
In this command:
mkdir -p patient53_bowtie2_output
bowtie2 -x hg37dec_v0.1 -1 patient29_R1.fastq.gz -2 patient29_R2.fastq.gz \
--threads 8 --un-conc patient29_bowtie2_output/patient29_nonhuman.fastq.gz \
-S /dev/null
We use bowtie2
to map paired-end reads to the human reference genome index (hg37dec_v0.1
).
--un-conc
flag outputs reads that do not align to the host genome, essentially isolating non-human sequences.nonhuman.fastq.gz
) contains microbial or other non-host DNA, which can be further analyzed.Removing host DNA is crucial for accurate metagenomic profiling because:
In our samples, we don’t expect a significant amount of host DNA because they were already screened for this before uploading to GenBank, so we are just runing this step on a single sample primarily for demonstration purposes. The command maps the reads and any reads that fail to align to the human genome are kept as non-human data and we get told how many reads mapped to the human genome. But if you want to try it out on any of the other samples, feel free to do so.
NB: Host DNA can be removed using many other approaches such as BBMap, BWA, HISAT2, Minimap2, FastQ Screen, KneadData, DeconSeq, BMTagger, SAMtools, and Kraken2, depending on the data type and resources available.
Sometimes what we see in the output of metagenomic analyses does not reflect what is actually out there in the real world, particularly when we are dealing with samples that are likely to have a low biomass. This is a persistent problem. Just a few weeks ago, New Scientist claimed that the brain is teeming with life, when in fact there is no brain microbiome at all.
Here’s the landmark Salter et al paper that anyone working on microbiomes should read:
Let’s take a look at a dataset from that paper: a series of ten-fold serial dilutions of Salmonella bongori DNA extracted using the MP BIO kit, accessed via the SRA’s analysis of each run. Use Krona to explore each sample.
What Happens as the Dilution Increases?
As the serial dilution series progresses, the amount of Salmonella bongori DNA in each sample decreases by an order of magnitude. What effect does this have on the results?
Here is a summary of the discussion from the Salter et al. paper:
Widespread Kit Contamination: Contaminating DNA from extraction kits and laboratory reagents poses a significant concern for both 16S rRNA sequencing and shotgun metagenomics projects. This contamination can skew the results, particularly in low-biomass samples where true DNA signals are minimal.
Nature of Contaminants: Commonly reported contaminants are soil- or water-dwelling bacteria, often associated with nitrogen fixation. This may be due to the use of nitrogen in ultrapure water storage tanks. Such taxa can misleadingly appear in low-biomass samples as genuine findings.
Impact on Research: High-profile cases have highlighted how contamination has led to erroneous conclusions, such as false associations in studies of novel viruses and ancient DNA. The variability in microbial content when analyzed by different laboratories further illustrates this issue.
Underappreciated Problem: Despite documented cases of contamination affecting results, many in the microbiota research community do not report measures to quantify initial DNA, use negative controls, or describe contaminant identification methods. This oversight can lead to misinterpretation of data, especially when unexpected taxa, like plant-associated bacteria, are reported as core components in human studies.
Control Measures and Recommendations: Including negative controls and using in silico methods to identify contaminants can help mitigate these issues. Advanced bioinformatics strategies, like comparing sample data with negative controls and using strain-specific markers, can differentiate between genuine taxa and contaminants.
Decontamination Methods: Techniques like gamma or UV radiation, DNase treatment, and chemical intercalators have been tested with varying success. However, these methods can affect reagent performance, making their use inconsistent. The study recommends careful use of negative controls processed identically to samples and sequenced together to detect and exclude contaminant DNA from analysis effectively.
Here’s a rogue’s gallery of common contanimants from the kitome:
Phylum | List of Constituent Contaminant Genera |
---|---|
Pseudomonadota | |
Alphaproteobacteria | Afipia, Aquabacteriume, Asticcacaulis, Aurantimonas, Beijerinckia, Bosea, Bradyrhizobiumd, Brevundimonasc, Caulobacter, Craurococcus, Devosia, Hoefleae, Mesorhizobium, Methylobacterium, Novosphingobium, Ochrobactrum, Paracoccus, Pedomicrobium, Phyllobacteriume, Rhizobiumc,d, Roseomonas, Sphingobium, Sphingomonas, Sphingopyxis |
Betaproteobacteria | Acidovoraxc,e, Azoarcuse, Azospira, Burkholderiad, Comamonasc, Cupriavidusc, Curvibacter, Delftiae, Duganellaa, Herbaspirilluma, Janthinobacteriume, Kingella, Leptothrixa, Limnobactere, Massiliac, Methylophilus, Methyloversatilise, Oxalobacter, Pelomonas, Polaromonase, Ralstoniab, Schlegelella, Sulfuritalea, Undibacteriume, Variovorax |
Gammaproteobacteria | Acinetobacter, Enhydrobacter, Enterobacter, Escherichiaa, Nevskiae, Pseudomonas, Pseudoxanthomonas, Psychrobacter, Stenotrophomonas Xanthomonas |
Actinomycetota | Aeromicrobium, Arthrobacter, Beutenbergia, Brevibacterium, Corynebacterium, Curtobacterium, Dietzia, Geodermatophilus, Janibacter, Kocuria, Microbacterium, Micrococcus, Microlunatus, Patulibacter, Propionibacteriume, Rhodococcus, Tsukamurella |
Bacillota | Abiotrophia, Bacillusb, Brevibacillus, Brochothrix, Facklamia, Paenibacillus, Streptococcus |
Bacteroidota | Chryseobacterium, Dyadobacter, Flavobacteriumd, Hydrotalea, Niastella, Olivibacter, Pedobacter, Wautersiella |
Deinococcota | Deinococcus |
Acidobacteriota | Predominantly unclassified Acidobacteria Gp2 organisms |
And it’s not just kitome that is the problem!
QIB’s very own Falk Hildebrand published this paper a few years ago:
Here are the main points from the paper:
Detection of Bacterial Signals: Amplicon sequencing detected bacterial signals in human and murine samples; however, the estimated bacterial biomass was extremely low in all samples.
Sources of Bacterial Signals: Stringent reanalyses indicated that the detected bacterial signals were due to a combination of exogenous DNA contamination (54.8%) and false positive amplification of host DNA (34.2%), known as off-target amplicons.
Off-Target Amplification as a Confounder: Off-target amplification was identified as a significant issue in scenarios with low bacterial content and high host DNA. This phenomenon occurs when segments of the host genome (e.g., human or mouse DNA) are mistakenly amplified and misclassified as bacterial DNA.
Impact on 16S rRNA Gene Sequencing: The study found that in most amplicon sequencing pipelines, off-target amplicons were clustered and erroneously assigned to bacterial taxa. This problem was especially prevalent in tissue types thought to be sterile, such as brain tissue.
Independent Evidence of Off-Target Amplification: Off-target amplicons were not isolated to this study alone but were also observed in independent brain 16S rRNA gene sequencing data, emphasizing the need for careful interpretation in similar studies.
Conclusions and Recommendations: Researchers must closely scrutinize taxonomic signals from extremely low biomass samples to rule out off-target amplification. This involves matching sequences explicitly against host genomes to ensure that misclassified host DNA does not lead to false conclusions. The study concluded that, with rigorous analysis, there was no evidence of a brain microbiome or bacterial infection in Parkinson’s Disease (PD) brains.
And our tuppenceworth!
See this letter that New Scientist published from us: https://www.newscientist.com/letter/mg26435131-700-we-say-there-is-no-brain-microbiome/
in response to this rubbish: https://www.newscientist.com/article/mg26335104-500-the-brain-has-its-own-microbiome-heres-what-it-means-for-your-health/
See you tomorrow.
Stretch targets if you have time and energy: