MetaPhlAn and the unreal microbiome


By the end of this session, you will be able to:

  1. Understand the Role of MetaPhlAn in Metagenomics:
    • Explain the key features of MetaPhlAn and its importance in profiling microbial communities.
    • Compare MetaPhlAn with other profiling tools like Kraken2 and Bracken.
  2. Run and Analyze MetaPhlAn Output:
    • Execute MetaPhlAn on paired-end sequencing data to generate taxonomic profiles.
    • Interpret the output format of MetaPhlAn and understand the information it provides.
  3. Transform and Visualize MetaPhlAn Data:
    • Use custom scripts to manipulate MetaPhlAn output for compatibility with visualization tools such as Krona.
    • Create and analyze interactive Krona plots to better understand the taxonomic composition of samples.
  4. Assess the Significance of Removing Host DNA:
    • Discuss the reasons for host DNA removal in metagenomic studies and perform a demonstration using Bowtie2.
    • Identify the potential implications of host contamination on microbial analysis results.
  5. Critically Evaluate Metagenomic Data:
    • Understand challenges such as low biomass samples and the impact of contamination (e.g., reagent contaminants).
    • Recognize common pitfalls in metagenomic profiling and the importance of controls in sequencing experiments.
  6. Compare Results from Multiple Tools:
    • Analyze and compare the taxonomic profiles generated by MetaPhlAn and Kraken2.
    • Evaluate discrepancies between different profiling tools and assess their impact on research conclusions.
  7. Apply Knowledge to Case Studies:
    • Review and interpret results from dilution series experiments (e.g., Salmonella bongori case) to understand the effect of sample dilution on detection.
    • Identify common contaminant genera in sequencing data and understand the implications of these findings on study results.
  8. Discuss Best Practices in Metagenomics:
    • Summarize strategies to minimize contamination and avoid false positives in metagenomic research.
    • Share insights and recommendations for rigorous analysis, including the use of negative controls and proper data interpretation.


MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool designed for profiling the composition of microbial communities from metagenomic sequencing data. It uses a unique database of clade-specific marker genes to identify and quantify taxa within a sample, providing a high-resolution taxonomic profile. This makes it particularly valuable for studies exploring the diversity and abundance of microbial communities in various environments, including the human microbiome.

Key Features

  • High Specificity and Sensitivity: MetaPhlAn focuses on using unique clade-specific markers to ensure accurate taxonomic classification, minimizing false positives common with broader marker sets.
  • Broad Taxonomic Coverage: The tool can identify microorganisms across bacteria, archaea, viruses, and eukaryotes, making it versatile for metagenomic analysis.
  • Output Format: MetaPhlAn outputs a tab-delimited file showing relative abundances of identified taxa, which can be used for downstream analysis or visualization.

Installation and Documentation

  • Official GitHub Repository: MetaPhlAn GitHub
  • Installation Guide: Detailed instructions for installing MetaPhlAn can be found on its installation page.
  • Documentation: Comprehensive user documentation is available at the MetaPhlAn wiki.

Further Reading

Let’s run MetaPhlAn over our samples. But MetaPhlAn output files are scarcely more user-friendly than Kraken output files.

Here’s what the first few lines of a MetaPhlAn output file look like

--input_type	#/shared/team/conda/mpallen.mmb-dtp/seqanalysis/bin/MetaPhlAn
relative_abundance	#clade_name
100.0	Bacteria
99.697	Bacteria	Proteobacteria
0.15144	Bacteria	Bacteroidetes
0.1238	Bacteria	Actinobacteria
0.02776	Bacteria	Firmicutes
99.697	Bacteria	Proteobacteria	Gammaproteobacteria
0.15144	Bacteria	Bacteroidetes	Bacteroidia
0.1238	Bacteria	Actinobacteria	Actinobacteria
0.02776	Bacteria	Firmicutes	Clostridia
99.697	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales
0.15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales
0.08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales
0.02776	Bacteria	Firmicutes	Clostridia	Clostridiales
0.02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales
0.01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales
94.39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae
5.29717	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Erwiniaceae
0.15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae
0.08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales	Microbacteriaceae
0.02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales	Propionibacteriaceae
0.01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales	Bifidobacteriaceae
0.01392	Bacteria	Firmicutes	Clostridia	Clostridiales	Ruminococcaceae
0.01096	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptostreptococcaceae
0.00421	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae
0.00288	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae
94.39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae	Serratia
5.29717	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Erwiniaceae	Erwinia
0.15144	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides
0.08413	Bacteria	Actinobacteria	Actinobacteria	Micrococcales	Microbacteriaceae	Curtobacterium
0.02485	Bacteria	Actinobacteria	Actinobacteria	Propionibacteriales	Propionibacteriaceae	Cutibacterium
0.01482	Bacteria	Actinobacteria	Actinobacteria	Bifidobacteriales	Bifidobacteriaceae	Bifidobacterium
0.01392	Bacteria	Firmicutes	Clostridia	Clostridiales	Ruminococcaceae	Faecalibacterium
0.01096	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptostreptococcaceae	Paeniclostridium
0.00421	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Enterobacteriaceae	Enterobacter
0.00213	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Fusicatenibacter
0.00075	Bacteria	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	Lachnospiraceae_unclassified
94.39562	Bacteria	Proteobacteria	Gammaproteobacteria	Enterobacterales	Yersiniaceae	Serratia	Serratia_fonticola

It would be great if we could look at them as krona plots, but krona won’t take MetaPhlAn output files as input. But with a little text manipulation we can make it work, so that we can run MetaPhlAn and create krona files. With help from ChatGPT, I have written a script that does this.

Create a file called using touch then paste in shell script, grant it execute permissions and then run it:


# Create the MetaPhlAn output directory if it doesn't exist
mkdir -p MetaPhlAn

# Run MetaPhlAn on paired-end *_R1.fastq.gz and *_R2.fastq.gz files in the current directory and generate Krona plots
for sample in $(ls *_R1.fastq.gz | sed 's/_R1.fastq.gz//'); do
    if [ -f "${sample}_R1.fastq.gz" ] && [ -f "${sample}_R2.fastq.gz" ]; then
        # Generate MetaPhlAn output filenames in the MetaPhlAn directory

        # Run MetaPhlAn with paired-end files
        echo "Running MetaPhlAn on ${sample} fastq.gz files..."
        MetaPhlAn "${sample}_R1.fastq.gz,${sample}_R2.fastq.gz" --input_type fastq -o "$MetaPhlAn_output" --bowtie2out "$bowtie2_output"

        # Check if MetaPhlAn output was generated
        if [ ! -f "$MetaPhlAn_output" ]; then
            echo "Error: MetaPhlAn output not generated for ${sample}."

        # Prepare the Krona input file in the MetaPhlAn directory

        echo "Processing $MetaPhlAn_output for Krona..."
        awk '{
            if (NR > 1 && NF >= 3) {
                # Copy the third column to the first column in the output
                output_col1 = $3;

                # Copy the first column to the second column in the output
                output_col2 = $1;

                # Remove any string where a character is followed by double underscores
                gsub(/[a-z]__/, "", output_col1);
                gsub(/[a-z]__/, "", output_col2);

                # Convert | to tabs
                gsub(/\|/, "\t", output_col1);
                gsub(/\|/, "\t", output_col2);

                # Print the transformed line
                print output_col1 "\t" output_col2;
        }' "$MetaPhlAn_output" > "$krona_input"

        # Check if ktImportText is available
        if ! command -v ktImportText &> /dev/null; then
            echo "Error: ktImportText command not found. Please install KronaTools."
            exit 1

        # Generate the Krona plot in the MetaPhlAn directory
        echo "Generating Krona plot for $krona_input..."
        ktImportText "$krona_input" -o "$krona_output"

        # Confirm if the Krona output was generated
        if [ $? -eq 0 ]; then
            echo "Krona plot generated: $krona_output"
            echo "Error generating Krona plot for $krona_input."
        echo "Paired-end files not found for sample: ${sample}"

What is the script doing? Spend five minutes trying to work out for yourself, then ask ChatGPT for a detailed explanation.

Compare and contrast the krona plots for each sample produced by kraken and MetaPhlAn. Write a brief text-based report on this.

Removing Host DNA

In metagenomic analysis, removing host DNA is an essential step to ensure that the data focuses on microbial content rather than the host’s genomic material. This is particularly important when studying samples from organisms like humans, where the host DNA can often overwhelm microbial signals. By filtering out host sequences, we improve the accuracy of downstream analyses and ensure that computational resources are used effectively for microbial data processing.

In this command:

mkdir -p patient53_bowtie2_output
bowtie2 -x hg37dec_v0.1 -1 patient29_R1.fastq.gz -2 patient29_R2.fastq.gz \
--threads 8 --un-conc patient29_bowtie2_output/patient29_nonhuman.fastq.gz \
-S /dev/null

We use bowtie2 to map paired-end reads to the human reference genome index (hg37dec_v0.1).

  • The --un-conc flag outputs reads that do not align to the host genome, essentially isolating non-human sequences.
  • The output (nonhuman.fastq.gz) contains microbial or other non-host DNA, which can be further analyzed.

Why Would We Do This?

Removing host DNA is crucial for accurate metagenomic profiling because:

  1. Noise Reduction: Host DNA can constitute a significant portion of a metagenomic sample, especially in clinical or human-associated samples, masking the microbial content.
  2. Resource Efficiency: Removing unnecessary host sequences before analysis helps save computational resources and improves the speed and accuracy of subsequent processing steps.
  3. Ethical and Privacy Considerations: For human samples, excluding host DNA can also mitigate privacy concerns related to handling personal genetic data.

Practical Note on Our Samples

In our samples, we don’t expect a significant amount of host DNA because they were already screened for this before uploading to GenBank, so we are just runing this step on a single sample primarily for demonstration purposes. The command maps the reads and any reads that fail to align to the human genome are kept as non-human data and we get told how many reads mapped to the human genome. But if you want to try it out on any of the other samples, feel free to do so.

NB: Host DNA can be removed using many other approaches such as BBMap, BWA, HISAT2, Minimap2, FastQ Screen, KneadData, DeconSeq, BMTagger, SAMtools, and Kraken2, depending on the data type and resources available.

Coffee time

The unreal microbiome

Sometimes what we see in the output of metagenomic analyses does not reflect what is actually out there in the real world, particularly when we are dealing with samples that are likely to have a low biomass. This is a persistent problem. Just a few weeks ago, New Scientist claimed that the brain is teeming with life, when in fact there is no brain microbiome at all.

Here’s the landmark Salter et al paper that anyone working on microbiomes should read:

Let’s take a look at a dataset from that paper: a series of ten-fold serial dilutions of Salmonella bongori DNA extracted using the MP BIO kit, accessed via the SRA’s analysis of each run. Use Krona to explore each sample.

What Happens as the Dilution Increases?

As the serial dilution series progresses, the amount of Salmonella bongori DNA in each sample decreases by an order of magnitude. What effect does this have on the results?

How to avoid kit contamination?

Here is a summary of the discussion from the Salter et al. paper:

  1. Widespread Kit Contamination: Contaminating DNA from extraction kits and laboratory reagents poses a significant concern for both 16S rRNA sequencing and shotgun metagenomics projects. This contamination can skew the results, particularly in low-biomass samples where true DNA signals are minimal.

  2. Nature of Contaminants: Commonly reported contaminants are soil- or water-dwelling bacteria, often associated with nitrogen fixation. This may be due to the use of nitrogen in ultrapure water storage tanks. Such taxa can misleadingly appear in low-biomass samples as genuine findings.

  3. Impact on Research: High-profile cases have highlighted how contamination has led to erroneous conclusions, such as false associations in studies of novel viruses and ancient DNA. The variability in microbial content when analyzed by different laboratories further illustrates this issue.

  4. Underappreciated Problem: Despite documented cases of contamination affecting results, many in the microbiota research community do not report measures to quantify initial DNA, use negative controls, or describe contaminant identification methods. This oversight can lead to misinterpretation of data, especially when unexpected taxa, like plant-associated bacteria, are reported as core components in human studies.

  5. Control Measures and Recommendations: Including negative controls and using in silico methods to identify contaminants can help mitigate these issues. Advanced bioinformatics strategies, like comparing sample data with negative controls and using strain-specific markers, can differentiate between genuine taxa and contaminants.

  6. Decontamination Methods: Techniques like gamma or UV radiation, DNase treatment, and chemical intercalators have been tested with varying success. However, these methods can affect reagent performance, making their use inconsistent. The study recommends careful use of negative controls processed identically to samples and sequenced together to detect and exclude contaminant DNA from analysis effectively.

Here’s a rogue’s gallery of common contanimants from the kitome:

Phylum List of Constituent Contaminant Genera
Alphaproteobacteria Afipia, Aquabacteriume, Asticcacaulis, Aurantimonas, Beijerinckia, Bosea, Bradyrhizobiumd, Brevundimonasc, Caulobacter, Craurococcus, Devosia, Hoefleae, Mesorhizobium, Methylobacterium, Novosphingobium, Ochrobactrum, Paracoccus, Pedomicrobium, Phyllobacteriume, Rhizobiumc,d, Roseomonas, Sphingobium, Sphingomonas, Sphingopyxis
Betaproteobacteria Acidovoraxc,e, Azoarcuse, Azospira, Burkholderiad, Comamonasc, Cupriavidusc, Curvibacter, Delftiae, Duganellaa, Herbaspirilluma, Janthinobacteriume, Kingella, Leptothrixa, Limnobactere, Massiliac, Methylophilus, Methyloversatilise, Oxalobacter, Pelomonas, Polaromonase, Ralstoniab, Schlegelella, Sulfuritalea, Undibacteriume, Variovorax
Gammaproteobacteria Acinetobacter, Enhydrobacter, Enterobacter, Escherichiaa, Nevskiae, Pseudomonas, Pseudoxanthomonas, Psychrobacter, Stenotrophomonas Xanthomonas
Actinomycetota Aeromicrobium, Arthrobacter, Beutenbergia, Brevibacterium, Corynebacterium, Curtobacterium, Dietzia, Geodermatophilus, Janibacter, Kocuria, Microbacterium, Micrococcus, Microlunatus, Patulibacter, Propionibacteriume, Rhodococcus, Tsukamurella
Bacillota Abiotrophia, Bacillusb, Brevibacillus, Brochothrix, Facklamia, Paenibacillus, Streptococcus
Bacteroidota Chryseobacterium, Dyadobacter, Flavobacteriumd, Hydrotalea, Niastella, Olivibacter, Pedobacter, Wautersiella
Deinococcota Deinococcus
Acidobacteriota Predominantly unclassified Acidobacteria Gp2 organisms
  • Did you see any of these in the Salmonella bongori dilution series?

Falk to the rescue

Image of Falk Hildebrand

And it’s not just kitome that is the problem!

QIB’s very own Falk Hildebrand published this paper a few years ago:

Here are the main points from the paper:

  • Detection of Bacterial Signals: Amplicon sequencing detected bacterial signals in human and murine samples; however, the estimated bacterial biomass was extremely low in all samples.

  • Sources of Bacterial Signals: Stringent reanalyses indicated that the detected bacterial signals were due to a combination of exogenous DNA contamination (54.8%) and false positive amplification of host DNA (34.2%), known as off-target amplicons.

  • Off-Target Amplification as a Confounder: Off-target amplification was identified as a significant issue in scenarios with low bacterial content and high host DNA. This phenomenon occurs when segments of the host genome (e.g., human or mouse DNA) are mistakenly amplified and misclassified as bacterial DNA.

  • Impact on 16S rRNA Gene Sequencing: The study found that in most amplicon sequencing pipelines, off-target amplicons were clustered and erroneously assigned to bacterial taxa. This problem was especially prevalent in tissue types thought to be sterile, such as brain tissue.

  • Independent Evidence of Off-Target Amplification: Off-target amplicons were not isolated to this study alone but were also observed in independent brain 16S rRNA gene sequencing data, emphasizing the need for careful interpretation in similar studies.

  • Conclusions and Recommendations: Researchers must closely scrutinize taxonomic signals from extremely low biomass samples to rule out off-target amplification. This involves matching sequences explicitly against host genomes to ensure that misclassified host DNA does not lead to false conclusions. The study concluded that, with rigorous analysis, there was no evidence of a brain microbiome or bacterial infection in Parkinson’s Disease (PD) brains.

And our tuppenceworth!

See this letter that New Scientist published from us:

in response to this rubbish:

That’s all for today!

See you tomorrow.

Stretch targets if you have time and energy:

  • read the Salter et al paper and Falk’s paper carefully. Use Google Scholar to see who has cited them.
  • Put togther arguments on how we know there is no brain microbiome
  • try running Bracken over your Kraken results.
  • try installing and running other profiling tools over the data, including mOTUs and Centrifuge.