MetaPhlAn and the unreal microbiome

Objectives

By the end of this session, you will be able to:

Understand the Role of MetaPhlAn in Metagenomics:
- Explain the key features of MetaPhlAn and its importance in profiling microbial communities.
- Compare MetaPhlAn with other profiling tools like Kraken2 and Bracken.
Run and Analyze MetaPhlAn Output:
- Execute MetaPhlAn on paired-end sequencing data to generate taxonomic profiles.
- Interpret the output format of MetaPhlAn and understand the information it provides.
Transform and Visualize MetaPhlAn Data:
- Use custom scripts to manipulate MetaPhlAn output for compatibility with visualization tools such as Krona.
- Create and analyze interactive Krona plots to better understand the taxonomic composition of samples.
Assess the Significance of Removing Host DNA:
- Discuss the reasons for host DNA removal in metagenomic studies and perform a demonstration using Bowtie2.
- Identify the potential implications of host contamination on microbial analysis results.
Critically Evaluate Metagenomic Data:
- Understand challenges such as low biomass samples and the impact of contamination (e.g., reagent contaminants).
- Recognize common pitfalls in metagenomic profiling and the importance of controls in sequencing experiments.
Compare Results from Multiple Tools:
- Analyze and compare the taxonomic profiles generated by MetaPhlAn and Kraken2.
- Evaluate discrepancies between different profiling tools and assess their impact on research conclusions.
Apply Knowledge to Case Studies:
- Review and interpret results from dilution series experiments (e.g., Salmonella bongori case) to understand the effect of sample dilution on detection.
- Identify common contaminant genera in sequencing data and understand the implications of these findings on study results.
Discuss Best Practices in Metagenomics:
- Summarize strategies to minimize contamination and avoid false positives in metagenomic research.
- Share insights and recommendations for rigorous analysis, including the use of negative controls and proper data interpretation.

MetaPhlAn

MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool designed for profiling the composition of microbial communities from metagenomic sequencing data. It uses a unique database of clade-specific marker genes to identify and quantify taxa within a sample, providing a high-resolution taxonomic profile. This makes it particularly valuable for studies exploring the diversity and abundance of microbial communities in various environments, including the human microbiome.

Key Features

High Specificity and Sensitivity: MetaPhlAn focuses on using unique clade-specific markers to ensure accurate taxonomic classification, minimizing false positives common with broader marker sets.
Broad Taxonomic Coverage: The tool can identify microorganisms across bacteria, archaea, viruses, and eukaryotes, making it versatile for metagenomic analysis.
Output Format: MetaPhlAn outputs a tab-delimited file showing relative abundances of identified taxa, which can be used for downstream analysis or visualization.

Installation and Documentation

Official GitHub Repository: MetaPhlAn GitHub
Installation Guide: Detailed instructions for installing MetaPhlAn can be found on its installation page.
Documentation: Comprehensive user documentation is available at the MetaPhlAn wiki.

--input_type	#/shared/team/conda/mpallen.mmb-dtp/seqanalysis/bin/MetaPhlAn
relative_abundance	#clade_name
0	Bacteria
697	Bacteria	Proteobacteria
15144	Bacteria	Bacteroidetes
1238	Bacteria	Actinobacteria
02776	Bacteria	Firmicutes
697	Bacteria	Proteobacteria	Gammaproteobacteria
15144	Bacteria	Bacteroidetes	Bacteroidia
1238	Bacteria	Actinobacteria	Actinobacteria
02776	Bacteria	Firmicutes	Clostridia
697	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales
15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales
08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales
02776	Bacteria	Firmicutes	Clostridia	Clostridiales
02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales
01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales
39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae
29717	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Erwiniaceae
15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae
08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales	Microbacteriaceae
02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales	Propionibacteriaceae
01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales	Bifidobacteriaceae
01392	Bacteria	Firmicutes	Clostridia	Clostridiales	Ruminococcaceae
01096	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptostreptococcaceae
00421	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae
00288	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae
39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae	Serratia
29717	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Erwiniaceae	Erwinia
15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides
08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales	Microbacteriaceae	Curtobacterium
02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales	Propionibacteriaceae	Cutibacterium
01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales	Bifidobacteriaceae	Bifidobacterium
01392	Bacteria	Firmicutes	Clostridia	Clostridiales	Ruminococcaceae	Faecalibacterium
01096	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptostreptococcaceae	Paeniclostridium
00421	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Enterobacter
00213	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Fusicatenibacter
00075	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Lachnospiraceae_unclassified
39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae	Serratia	Serratia_fonticola

It would be great if we could look at them as krona plots, but krona won’t take MetaPhlAn output files as input. But with a little text manipulation we can make it work, so that we can run MetaPhlAn and create krona files. With help from ChatGPT, I have written a script that does this.

Create a file called metakron.sh using touch then paste in shell script, grant it execute permissions and then run it:

#!/bin/bash

# Create the MetaPhlAn output directory if it doesn't exist
mkdir -p MetaPhlAn

# Run MetaPhlAn on paired-end *_R1.fastq.gz and *_R2.fastq.gz files in the current directory and generate Krona plots
for sample in $(ls *_R1.fastq.gz | sed 's/_R1.fastq.gz//'); do
    if [ -f "${sample}_R1.fastq.gz" ] && [ -f "${sample}_R2.fastq.gz" ]; then
        # Generate MetaPhlAn output filenames in the MetaPhlAn directory
        MetaPhlAn_output="MetaPhlAn/${sample}_MetaPhlAn_out.txt"
        bowtie2_output="MetaPhlAn/${sample}_bowtie2_out.bowtie2"

        # Run MetaPhlAn with paired-end files
        echo "Running MetaPhlAn on ${sample} fastq.gz files..."
        MetaPhlAn "${sample}_R1.fastq.gz,${sample}_R2.fastq.gz" --input_type fastq -o "$MetaPhlAn_output" --bowtie2out "$bowtie2_output"

        # Check if MetaPhlAn output was generated
        if [ ! -f "$MetaPhlAn_output" ]; then
            echo "Error: MetaPhlAn output not generated for ${sample}."
            continue
        fi

        # Prepare the Krona input file in the MetaPhlAn directory
        krona_input="MetaPhlAn/${sample}_krona_input.txt"

        echo "Processing $MetaPhlAn_output for Krona..."
        awk '{
            if (NR > 1 && NF >= 3) {
                # Copy the third column to the first column in the output
                output_col1 = $3;

                # Copy the first column to the second column in the output
                output_col2 = $1;

                # Remove any string where a character is followed by double underscores
                gsub(/[a-z]__/, "", output_col1);
                gsub(/[a-z]__/, "", output_col2);

                # Convert | to tabs
                gsub(/\|/, "\t", output_col1);
                gsub(/\|/, "\t", output_col2);

                # Print the transformed line
                print output_col1 "\t" output_col2;
            }
        }' "$MetaPhlAn_output" > "$krona_input"

        # Check if ktImportText is available
        if ! command -v ktImportText &> /dev/null; then
            echo "Error: ktImportText command not found. Please install KronaTools."
            exit 1
        fi

        # Generate the Krona plot in the MetaPhlAn directory
        krona_output="MetaPhlAn/${sample}_krona.html"
        echo "Generating Krona plot for $krona_input..."
        ktImportText "$krona_input" -o "$krona_output"

        # Confirm if the Krona output was generated
        if [ $? -eq 0 ]; then
            echo "Krona plot generated: $krona_output"
        else
            echo "Error generating Krona plot for $krona_input."
        fi
    else
        echo "Paired-end files not found for sample: ${sample}"
    fi
done

What is the script doing? Spend five minutes trying to work out for yourself, then ask ChatGPT for a detailed explanation.

Compare and contrast the krona plots for each sample produced by kraken and MetaPhlAn. Write a brief text-based report on this.

Removing Host DNA

In metagenomic analysis, removing host DNA is an essential step to ensure that the data focuses on microbial content rather than the host’s genomic material. This is particularly important when studying samples from organisms like humans, where the host DNA can often overwhelm microbial signals. By filtering out host sequences, we improve the accuracy of downstream analyses and ensure that computational resources are used effectively for microbial data processing.

In this command:

mkdir -p patient53_bowtie2_output
bowtie2 -x hg37dec_v0.1 -1 patient29_R1.fastq.gz -2 patient29_R2.fastq.gz \
--threads 8 --un-conc patient29_bowtie2_output/patient29_nonhuman.fastq.gz \
-S /dev/null

We use bowtie2 to map paired-end reads to the human reference genome index (hg37dec_v0.1).

The --un-conc flag outputs reads that do not align to the host genome, essentially isolating non-human sequences.
The output (nonhuman.fastq.gz) contains microbial or other non-host DNA, which can be further analyzed.

Why Would We Do This?

Removing host DNA is crucial for accurate metagenomic profiling because:

Noise Reduction: Host DNA can constitute a significant portion of a metagenomic sample, especially in clinical or human-associated samples, masking the microbial content.
Resource Efficiency: Removing unnecessary host sequences before analysis helps save computational resources and improves the speed and accuracy of subsequent processing steps.
Ethical and Privacy Considerations: For human samples, excluding host DNA can also mitigate privacy concerns related to handling personal genetic data.

Practical Note on Our Samples

In our samples, we don’t expect a significant amount of host DNA because they were already screened for this before uploading to GenBank, so we are just runing this step on a single sample primarily for demonstration purposes. The command maps the reads and any reads that fail to align to the human genome are kept as non-human data and we get told how many reads mapped to the human genome. But if you want to try it out on any of the other samples, feel free to do so.

NB: Host DNA can be removed using many other approaches such as BBMap, BWA, HISAT2, Minimap2, FastQ Screen, KneadData, DeconSeq, BMTagger, SAMtools, and Kraken2, depending on the data type and resources available.

Coffee time

The unreal microbiome

Sometimes what we see in the output of metagenomic analyses does not reflect what is actually out there in the real world, particularly when we are dealing with samples that are likely to have a low biomass. This is a persistent problem. Just a few weeks ago, New Scientist claimed that the brain is teeming with life, when in fact there is no brain microbiome at all.

Here’s the landmark Salter et al paper that anyone working on microbiomes should read:

Reagent and laboratory contamination can critically impact sequence-based microbiome analyses

Let’s take a look at a dataset from that paper: a series of ten-fold serial dilutions of Salmonella bongori DNA extracted using the MP BIO kit, accessed via the SRA’s analysis of each run. Use Krona to explore each sample.

What Happens as the Dilution Increases?

As the serial dilution series progresses, the amount of Salmonella bongori DNA in each sample decreases by an order of magnitude. What effect does this have on the results?

How to avoid kit contamination?

Here is a summary of the discussion from the Salter et al. paper:

Widespread Kit Contamination: Contaminating DNA from extraction kits and laboratory reagents poses a significant concern for both 16S rRNA sequencing and shotgun metagenomics projects. This contamination can skew the results, particularly in low-biomass samples where true DNA signals are minimal.
Nature of Contaminants: Commonly reported contaminants are soil- or water-dwelling bacteria, often associated with nitrogen fixation. This may be due to the use of nitrogen in ultrapure water storage tanks. Such taxa can misleadingly appear in low-biomass samples as genuine findings.
Impact on Research: High-profile cases have highlighted how contamination has led to erroneous conclusions, such as false associations in studies of novel viruses and ancient DNA. The variability in microbial content when analyzed by different laboratories further illustrates this issue.
Underappreciated Problem: Despite documented cases of contamination affecting results, many in the microbiota research community do not report measures to quantify initial DNA, use negative controls, or describe contaminant identification methods. This oversight can lead to misinterpretation of data, especially when unexpected taxa, like plant-associated bacteria, are reported as core components in human studies.
Control Measures and Recommendations: Including negative controls and using in silico methods to identify contaminants can help mitigate these issues. Advanced bioinformatics strategies, like comparing sample data with negative controls and using strain-specific markers, can differentiate between genuine taxa and contaminants.
Decontamination Methods: Techniques like gamma or UV radiation, DNase treatment, and chemical intercalators have been tested with varying success. However, these methods can affect reagent performance, making their use inconsistent. The study recommends careful use of negative controls processed identically to samples and sequenced together to detect and exclude contaminant DNA from analysis effectively.

Rogue’s gallery

Here’s a rogue’s gallery of common contanimants from the kitome:

Phylum	List of Constituent Contaminant Genera
Pseudomonadota
Alphaproteobacteria	Afipia, Aquabacteriume, Asticcacaulis, Aurantimonas, Beijerinckia, Bosea, Bradyrhizobiumd, Brevundimonasc, Caulobacter, Craurococcus, Devosia, Hoefleae, Mesorhizobium, Methylobacterium, Novosphingobium, Ochrobactrum, Paracoccus, Pedomicrobium, Phyllobacteriume, Rhizobiumc,d, Roseomonas, Sphingobium, Sphingomonas, Sphingopyxis
Betaproteobacteria	Acidovoraxc,e, Azoarcuse, Azospira, Burkholderiad, Comamonasc, Cupriavidusc, Curvibacter, Delftiae, Duganellaa, Herbaspirilluma, Janthinobacteriume, Kingella, Leptothrixa, Limnobactere, Massiliac, Methylophilus, Methyloversatilise, Oxalobacter, Pelomonas, Polaromonase, Ralstoniab, Schlegelella, Sulfuritalea, Undibacteriume, Variovorax
Gammaproteobacteria	Acinetobacter, Enhydrobacter, Enterobacter, Escherichiaa, Nevskiae, Pseudomonas, Pseudoxanthomonas, Psychrobacter, Stenotrophomonas Xanthomonas
Actinomycetota	Aeromicrobium, Arthrobacter, Beutenbergia, Brevibacterium, Corynebacterium, Curtobacterium, Dietzia, Geodermatophilus, Janibacter, Kocuria, Microbacterium, Micrococcus, Microlunatus, Patulibacter, Propionibacteriume, Rhodococcus, Tsukamurella
Bacillota	Abiotrophia, Bacillusb, Brevibacillus, Brochothrix, Facklamia, Paenibacillus, Streptococcus
Bacteroidota	Chryseobacterium, Dyadobacter, Flavobacteriumd, Hydrotalea, Niastella, Olivibacter, Pedobacter, Wautersiella
Deinococcota	Deinococcus
Acidobacteriota	Predominantly unclassified Acidobacteria Gp2 organisms

Did you see any of these in the Salmonella bongori dilution series?

Falk to the rescue

Image of Falk Hildebrand

And it’s not just kitome that is the problem!

QIB’s very own Falk Hildebrand published this paper a few years ago:

Much ado about nothing? Off-target amplification can lead to false-positive bacterial brain microbiome detection in healthy and Parkinson’s disease individuals

Here are the main points from the paper:

Detection of Bacterial Signals: Amplicon sequencing detected bacterial signals in human and murine samples; however, the estimated bacterial biomass was extremely low in all samples.
Sources of Bacterial Signals: Stringent reanalyses indicated that the detected bacterial signals were due to a combination of exogenous DNA contamination (54.8%) and false positive amplification of host DNA (34.2%), known as off-target amplicons.
Off-Target Amplification as a Confounder: Off-target amplification was identified as a significant issue in scenarios with low bacterial content and high host DNA. This phenomenon occurs when segments of the host genome (e.g., human or mouse DNA) are mistakenly amplified and misclassified as bacterial DNA.
Impact on 16S rRNA Gene Sequencing: The study found that in most amplicon sequencing pipelines, off-target amplicons were clustered and erroneously assigned to bacterial taxa. This problem was especially prevalent in tissue types thought to be sterile, such as brain tissue.
Independent Evidence of Off-Target Amplification: Off-target amplicons were not isolated to this study alone but were also observed in independent brain 16S rRNA gene sequencing data, emphasizing the need for careful interpretation in similar studies.
Conclusions and Recommendations: Researchers must closely scrutinize taxonomic signals from extremely low biomass samples to rule out off-target amplification. This involves matching sequences explicitly against host genomes to ensure that misclassified host DNA does not lead to false conclusions. The study concluded that, with rigorous analysis, there was no evidence of a brain microbiome or bacterial infection in Parkinson’s Disease (PD) brains.

And our tuppenceworth!

See this letter that New Scientist published from us: https://www.newscientist.com/letter/mg26435131-700-we-say-there-is-no-brain-microbiome/

in response to this rubbish: https://www.newscientist.com/article/mg26335104-500-the-brain-has-its-own-microbiome-heres-what-it-means-for-your-health/

That’s all for today!

See you tomorrow.

Stretch targets if you have time and energy:

read the Salter et al paper and Falk’s paper carefully. Use Google Scholar to see who has cited them.
Put togther arguments on how we know there is no brain microbiome
try running Bracken over your Kraken results.
try installing and running other profiling tools over the data, including mOTUs and Centrifuge.

Previous submodule:

Introducing and Profiling the Dataset

Next submodule:

Assembly and Binning