MEGAHIT is a powerful, efficient tool for assembling large and complex metagenomic datasets. It is designed for assembling short reads from high-throughput sequencing technologies, such as Illumina, and can handle metagenomic data with mixed organisms or environmental samples. MEGAHIT is particularly well-suited for assembling large-scale data due to its low memory usage and high speed, making it an excellent choice for metagenomic research involving diverse microbial communities.
To save time and resources, for this session, you will each work on a single sample. Decide between yourselves which student takes which sample:
You should run the assmbly on the sample that has been assigned to you, working within your own directory in shared-teams
Start in your sequences
directory and pick the command for your sample.
megahit -t 16 --12 patient02_day01_R1.fastq.gz,patient02_day01_R2.fastq.gz -o megahit_out_patient02_day01
megahit -t 16 --12 patient02_day10_R1.fastq.gz,patient02_day10_R2.fastq.gz -o megahit_out_patient02_day10
megahit -t 16 --12 patient04_day10_R1.fastq.gz,patient04_day10_R2.fastq.gz -o megahit_out_patient04_day10
megahit -t 16 --12 patient04_day14_R1.fastq.gz,patient04_day14_R2.fastq.gz -o megahit_out_patient04_day14
megahit -t 16 --12 patient29_R1.fastq.gz,patient29_R2.fastq.gz -o megahit_out_patient29
megahit -t 16 --12 patient35_R1.fastq.gz,patient35_R2.fastq.gz -o megahit_out_patient35
megahit -t 16 --12 patient36_R1.fastq.gz,patient36_R2.fastq.gz -o megahit_out_patient36
megahit
: The command to run the MEGAHIT assembler.-t 16
: Specifies the number of threads to use during the assembly, enabling parallel processing. Using 16 threads helps speed up the computation by utilizing more CPU cores.--12 patient02_day01_R1.fastq.gz,patient02_day01_R2.fastq.gz
: Indicates paired-end input reads are provided as interleaved files. The reads from the forward (R1
) and reverse (R2
) files are specified in a single argument, separated by a comma.-o megahit_out_patient02_day01
: Specifies the output directory where MEGAHIT will store the assembly results. In this case, the output folder is named megahit_out_patient02_day01
.--12
option is used when both the forward and reverse reads are in separate files, and you want MEGAHIT to process them as paired data. This helps the assembler use paired-end information for better assembly quality.megahit_out_patient02_day01
will contain various files, including the assembled contigs (final.contigs.fa
), logs, and intermediate files generated during the assembly.This command will assemble the paired-end reads from patient02_day01
using 16 CPU threads and store the results in the specified output directory.
While that is running, let’s watch this video that explains how de Bruijn graphs are used for assembly.
Then let’s consider what the output of Metahit is telling us.
2024-11-11 11:20:22 - MEGAHIT v1.2.9
2024-11-11 11:20:22 - Using megahit_core with POPCNT and BMI2 support
2024-11-11 11:20:22 - Convert reads to binary library
2024-11-11 11:20:24 - b'INFO sequence/io/sequence_lib.cpp : 75 - Lib 0 (/shared/team/2024_training/2024_metagenomics/metagenomics_2024_mark/sequences/patient02_day01_R1.fastq.gz): interleaved, 1158686 reads, 151 max length'
2024-11-11 11:20:27 - b'INFO sequence/io/sequence_lib.cpp : 75 - Lib 1 (/shared/team/2024_training/2024_metagenomics/metagenomics_2024_mark/sequences/patient02_day01_R2.fastq.gz): interleaved, 1158686 reads, 151 max length'
2024-11-11 11:20:27 - b'INFO utils/utils.h : 152 - Real: 4.9905\tuser: 2.3200\tsys: 0.3648\tmaxrss: 101708'
2024-11-11 11:20:27 - k-max reset to: 141
2024-11-11 11:20:27 - Start assembly. Number of CPU threads 16
2024-11-11 11:20:27 - k list: 21,29,39,59,79,99,119,141
2024-11-11 11:20:27 - Memory used: 665431953408
2024-11-11 11:20:27 - Extract solid (k+1)-mers for k = 21
2024-11-11 11:20:44 - Build graph for k = 21
2024-11-11 11:21:05 - Assemble contigs from SdBG for k = 21
2024-11-11 11:21:59 - Local assembly for k = 21
2024-11-11 11:22:43 - Extract iterative edges from k = 21 to 29
2024-11-11 11:22:48 - Build graph for k = 29
2024-11-11 11:23:01 - Assemble contigs from SdBG for k = 29
2024-11-11 11:23:46 - Local assembly for k = 29
2024-11-11 11:24:38 - Extract iterative edges from k = 29 to 39
2024-11-11 11:24:41 - Build graph for k = 39
2024-11-11 11:24:52 - Assemble contigs from SdBG for k = 39
2024-11-11 11:25:31 - Local assembly for k = 39
2024-11-11 11:26:46 - Extract iterative edges from k = 39 to 59
2024-11-11 11:26:49 - Build graph for k = 59
2024-11-11 11:26:56 - Assemble contigs from SdBG for k = 59
2024-11-11 11:27:21 - Local assembly for k = 59
2024-11-11 11:28:33 - Extract iterative edges from k = 59 to 79
2024-11-11 11:28:34 - Build graph for k = 79
2024-11-11 11:28:39 - Assemble contigs from SdBG for k = 79
2024-11-11 11:28:56 - Local assembly for k = 79
2024-11-11 11:30:05 - Extract iterative edges from k = 79 to 99
2024-11-11 11:30:06 - Build graph for k = 99
2024-11-11 11:30:10 - Assemble contigs from SdBG for k = 99
2024-11-11 11:30:24 - Local assembly for k = 99
2024-11-11 11:31:28 - Extract iterative edges from k = 99 to 119
2024-11-11 11:31:29 - Build graph for k = 119
2024-11-11 11:31:33 - Assemble contigs from SdBG for k = 119
2024-11-11 11:31:45 - Local assembly for k = 119
2024-11-11 11:32:48 - Extract iterative edges from k = 119 to 141
2024-11-11 11:32:49 - Build graph for k = 141
2024-11-11 11:32:52 - Assemble contigs from SdBG for k = 141
2024-11-11 11:33:02 - Merging to output final contigs
2024-11-11 11:33:02 - 32989 contigs, total 21701806 bp, min 270 bp, max 18460 bp, avg 657 bp, N50 723 bp
2024-11-11 11:33:02 - ALL DONE. Time elapsed: 760.569404 seconds
This MEGAHIT output shows the progress and stages of assembling metagenomic data from paired-end reads.
Let’s take a look at the quality of the assembly using Quast.
quast.py -o quast_out_patient02_day01 -t 4 -f megahit_out_patient02_day01/final.contigs.fa
quast.py -o quast_out_patient02_day10 -t 4 -f megahit_out_patient02_day10/final.contigs.fa
quast.py -o quast_out_patient04_day10 -t 4 -f megahit_out_patient04_day10/final.contigs.fa
quast.py -o quast_out_patient04_day14 -t 4 -f megahit_out_patient04_day14/final.contigs.fa
quast.py -o quast_out_patient29 -t 4 -f megahit_out_patient29/final.contigs.fa
quast.py -o quast_out_patient35 -t 4 -f megahit_out_patient35/final.contigs.fa
quast.py -o quast_out_patient36 -t 4 -f megahit_out_patient36/final.contigs.fa
QUAST (Quality Assessment Tool for Genome Assemblies) provides detailed metrics that help evaluate the quality of genome assemblies. Have a look around at all the output files, including the HTML files. Remember to ••Trust the HTML••!
Here’s how to interpret the key outputs and what constitutes a good metagenomics assembly:
Number of Contigs: Represents the total number of contiguous sequences produced in the assembly. A lower number of contigs generally indicates a more contiguous assembly. Too many contigs suggest fragmentation and may indicate issues such as insufficient read coverage or complex repeat regions.
Total Assembly Length: The total length of all contigs combined. This should ideally be close to the expected genome size for a pure assembly. In metagenomics, variability is expected due to multiple species.
N50 Value: A measure of contiguity that indicates the length of the contig at which half of the total assembly length is reached. A higher N50 means longer contigs, which suggests better assembly quality. Good assemblies typically have higher N50 values relative to their complexity and read quality.
Largest Contig: The length of the longest contig in the assembly. A longer largest contig can be indicative of more successful resolution of genomic structures.
GC Content: Shows the proportion of guanine and cytosine bases. Consistent GC content across contigs can indicate a homogeneous assembly, while wide variations may point to contamination or different species present.
Misassemblies: Count and type of detected misassemblies. A high number of misassemblies can imply structural errors in the assembly, which is especially important for metagenomic data as it reflects how accurately the assembly reflects the true genetic makeup.
For relatively simple metagenomic input files as we have here ranging from 30 to 80 MB, which likely represent moderate microbial diversity with lower coverage, here are some expected QUAST figures and interpretations:
These figures provide a baseline for what you might expect from simple metagenomes with modest input sizes. For more complex or larger metagenomic datasets, these metrics will vary, often requiring greater resources and more advanced assembly strategies.
Genomes in the sample can be recreated with a process called binning. This process allows separate analysis of genomes contained in the metagenome so long as there are enough reads to reconstruct each genome. In other words, it will only work for the most abundant species. Genomes reconstructed from metagenomic assemblies are called MAGs (Metagenome-Assembled Genomes). In this process, the assembled contigs from the metagenome are assigned to different bins (FASTA files containing contigs). Ideally, each bin corresponds to only one original genome (a MAG), but it is an error-prone process.
There are various approaches to doing the binning using characteristics of the contigs, such as their GC content, the use of tetranucleotides (composition), or their coverage (abundance).
Maxbin is a binning algorithm that distinguishes between contigs that belong to different bins according to their coverage levels and the tetranucleotide frequencies they have.
Let us bin the samples we just assembled. But before we can run Maxbin, we have to calculate coverage depth using BowTie and related programs in the process of backmapping
Backmapping involves aligning raw reads back to the assembled contigs to determine how well the reads map to different regions of the assembly. This process helps in calculating the coverage depth of each contig, which is directly proportional to the abundance of that contig in the sample.
How it Works
Here’s a workflow for all this
mkdir -p maxbin_out_patient02_day01
bowtie2-build megahit_out_patient02_day01/final.contigs.fa maxbin_out_patient02_day01/contigs_index
bowtie2 -x maxbin_out_patient02_day01/contigs_index -1 patient02_day01_R1.fastq.gz -2 patient02_day01_R2.fastq.gz -S maxbin_out_patient02_day01/mapped_reads.sam
samtools view -S -b maxbin_out_patient02_day01/mapped_reads.sam > maxbin_out_patient02_day01/mapped_reads.bam
samtools sort maxbin_out_patient02_day01/mapped_reads.bam -o maxbin_out_patient02_day01/mapped_reads_sorted.bam
samtools depth maxbin_out_patient02_day01/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient02_day01/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient02_day01/final.contigs.fa -abund maxbin_out_patient02_day01/abundance_data.txt -out maxbin_out_patient02_day01/maxbin_bins -thread 48
This series of commands performs a workflow for mapping reads and generating abundance data to use with MaxBin for metagenomic binning. Here’s a brief explanation:
mkdir -p maxbin_out_patient02_day01
: Creates an output directory (maxbin_out_patient02_day01
) to store all subsequent results.
bowtie2-build megahit_out_patient02_day01/final.contigs.fa maxbin_out_patient02_day01/contigs_index
: Builds a Bowtie2 index from the contigs produced by MEGAHIT assembly (final.contigs.fa
). This index is necessary for mapping reads to the contigs.
bowtie2 -x maxbin_out_patient02_day01/contigs_index -1 patient02_day01_R1.fastq.gz -2 patient02_day01_R2.fastq.gz -S maxbin_out_patient02_day01/mapped_reads.sam
: Maps the paired-end reads (R1
and R2
) to the indexed contigs, producing a SAM file (mapped_reads.sam
) containing alignment information. This will take some time!
samtools view -S -b maxbin_out_patient02_day01/mapped_reads.sam > maxbin_out_patient02_day01/mapped_reads.bam
: Converts the SAM file to a BAM file format, which is a more compact and binary version of the alignment data.
samtools sort maxbin_out_patient02_day01/mapped_reads.bam -o maxbin_out_patient02_day01/mapped_reads_sorted.bam
: Sorts the BAM file by coordinates to prepare for downstream analysis.
samtools depth maxbin_out_patient02_day01/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient02_day01/abundance_data.txt
: Calculates the coverage depth for each contig and uses awk
to compute the average depth. This generates an abundance data file (abundance_data.txt
), crucial for binning.
run_MaxBin.pl -contig megahit_out_patient02_day01/final.contigs.fa -abund maxbin_out_patient02_day01/abundance_data.txt -out maxbin_out_patient02_day01/maxbin_bins -thread 48
: Runs MaxBin using the contigs and abundance data to bin contigs into potential genomes, leveraging 48 threads for faster processing.
Run this workflow over your assembly.
Here’s a separate version of the workflow for each of the samples. Copy and run the appropriate set of commands for your sample to complete the workflow.
patient02_day01
mkdir -p maxbin_out_patient02_day01
bowtie2-build megahit_out_patient02_day01/final.contigs.fa maxbin_out_patient02_day01/contigs_index
bowtie2 -x maxbin_out_patient02_day01/contigs_index -1 patient02_day01_R1.fastq.gz -2 patient02_day01_R2.fastq.gz -S maxbin_out_patient02_day01/mapped_reads.sam
samtools view -S -b maxbin_out_patient02_day01/mapped_reads.sam > maxbin_out_patient02_day01/mapped_reads.bam
samtools sort maxbin_out_patient02_day01/mapped_reads.bam -o maxbin_out_patient02_day01/mapped_reads_sorted.bam
samtools depth maxbin_out_patient02_day01/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient02_day01/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient02_day01/final.contigs.fa -abund maxbin_out_patient02_day01/abundance_data.txt -out maxbin_out_patient02_day01/maxbin_out_patient02_day01.bin -thread 48
patient02_day10
mkdir -p maxbin_out_patient02_day10
bowtie2-build megahit_out_patient02_day10/final.contigs.fa maxbin_out_patient02_day10/contigs_index
bowtie2 -x maxbin_out_patient02_day10/contigs_index -1 patient02_day10_R1.fastq.gz -2 patient02_day10_R2.fastq.gz -S maxbin_out_patient02_day10/mapped_reads.sam
samtools view -S -b maxbin_out_patient02_day10/mapped_reads.sam > maxbin_out_patient02_day10/mapped_reads.bam
samtools sort maxbin_out_patient02_day10/mapped_reads.bam -o maxbin_out_patient02_day10/mapped_reads_sorted.bam
samtools depth maxbin_out_patient02_day10/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient02_day10/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient02_day10/final.contigs.fa -abund maxbin_out_patient02_day10/abundance_data.txt -out maxbin_out_patient02_day10/maxbin_out_patient02_day10.bin -thread 48
patient04_day10
mkdir -p maxbin_out_patient04_day10
bowtie2-build megahit_out_patient04_day10/final.contigs.fa maxbin_out_patient04_day10/contigs_index
bowtie2 -x maxbin_out_patient04_day10/contigs_index -1 patient04_day10_R1.fastq.gz -2 patient04_day10_R2.fastq.gz -S maxbin_out_patient04_day10/mapped_reads.sam
samtools view -S -b maxbin_out_patient04_day10/mapped_reads.sam > maxbin_out_patient04_day10/mapped_reads.bam
samtools sort maxbin_out_patient04_day10/mapped_reads.bam -o maxbin_out_patient04_day10/mapped_reads_sorted.bam
samtools depth maxbin_out_patient04_day10/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient04_day10/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient04_day10/final.contigs.fa -abund maxbin_out_patient04_day10/abundance_data.txt -out maxbin_out_patient04_day10/maxbin_out_patient04_day10.bin -thread 48
patient04_day14
mkdir -p maxbin_out_patient04_day14
bowtie2-build megahit_out_patient04_day14/final.contigs.fa maxbin_out_patient04_day14/contigs_index
bowtie2 -x maxbin_out_patient04_day14/contigs_index -1 patient04_day14_R1.fastq.gz -2 patient04_day14_R2.fastq.gz -S maxbin_out_patient04_day14/mapped_reads.sam
samtools view -S -b maxbin_out_patient04_day14/mapped_reads.sam > maxbin_out_patient04_day14/mapped_reads.bam
samtools sort maxbin_out_patient04_day14/mapped_reads.bam -o maxbin_out_patient04_day14/mapped_reads_sorted.bam
samtools depth maxbin_out_patient04_day14/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient04_day14/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient04_day14/final.contigs.fa -abund maxbin_out_patient04_day14/abundance_data.txt -out maxbin_out_patient04_day14/maxbin_out_patient04_day14.bin -thread 48
patient29
mkdir -p maxbin_out_patient29
bowtie2-build megahit_out_patient29/final.contigs.fa maxbin_out_patient29/contigs_index
bowtie2 -x maxbin_out_patient29/contigs_index -1 patient29_R1.fastq.gz -2 patient29_R2.fastq.gz -S maxbin_out_patient29/mapped_reads.sam
samtools view -S -b maxbin_out_patient29/mapped_reads.sam > maxbin_out_patient29/mapped_reads.bam
samtools sort maxbin_out_patient29/mapped_reads.bam -o maxbin_out_patient29/mapped_reads_sorted.bam
samtools depth maxbin_out_patient29/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient29/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient29/final.contigs.fa -abund maxbin_out_patient29/abundance_data.txt -out maxbin_out_patient29/maxbin_out_patient29.bin -thread 48
patient35
mkdir -p maxbin_out_patient35
bowtie2-build megahit_out_patient35/final.contigs.fa maxbin_out_patient35/contigs_index
bowtie2 -x maxbin_out_patient35/contigs_index -1 patient35_R1.fastq.gz -2 patient35_R2.fastq.gz -S maxbin_out_patient35/mapped_reads.sam
samtools view -S -b maxbin_out_patient35/mapped_reads.sam > maxbin_out_patient35/mapped_reads.bam
samtools sort maxbin_out_patient35/mapped_reads.bam -o maxbin_out_patient35/mapped_reads_sorted.bam
samtools depth maxbin_out_patient35/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient35/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient35/final.contigs.fa -abund maxbin_out_patient35/abundance_data.txt -out maxbin_out_patient35/maxbin_out_patient35.bin -thread 48
patient36
mkdir -p maxbin_out_patient36
bowtie2-build megahit_out_patient36/final.contigs.fa maxbin_out_patient36/contigs_index
bowtie2 -x maxbin_out_patient36/contigs_index -1 patient36_R1.fastq.gz -2 patient36_R2.fastq.gz -S maxbin_out_patient36/mapped_reads.sam
samtools view -S -b maxbin_out_patient36/mapped_reads.sam > maxbin_out_patient36/mapped_reads.bam
samtools sort maxbin_out_patient36/mapped_reads.bam -o maxbin_out_patient36/mapped_reads_sorted.bam
samtools depth maxbin_out_patient36/mapped_reads_sorted.bam | awk '{sum[$1] += $3; count[$1]++} END {for (contig in sum) print contig, sum[contig]/count[contig]}' > maxbin_out_patient36/abundance_data.txt
run_MaxBin.pl -contig megahit_out_patient36/final.contigs.fa -abund maxbin_out_patient36/abundance_data.txt -out maxbin_out_patient36/maxbin_out_patient36.bin -thread 48
Inside the maxbin_out_patient** directory you will discover your bins. They are labelled maxbin_out_patient**_bin.001.fasta etc.
Take a look at the maxbin output files (file prefixes will be different for your files)
maxbin_bins.log
maxbin_bins.marker
maxbin_bins.marker_of_each_bin.tar.gz
maxbin_bins.noclass
maxbin_bins.tooshort
maxbin_bins.summary