By the end of this session, you will be able to:
Open each bin and copy and paste the first dozen or so sequences into a BLAST window
Select “1” as the value in the “Max matches in query range” box to restrict results to one hit per input sequence. Use the “Download All” option on the second line and select “text” to get a handy summary of results.
Now decompress the .marker_of_each_bin.tar.gz file, e.g.
tar -xzvf maxbin_out_patient02_day01.marker_of_each_bin.tar.gz
You will now get files called something like maxbin_out_patient02_day01..001.marker.fasta
.
Open each file and do a BLASTP search with all the marker protein sequences in it. As before, select “1” as the value in the “Max matches in query range” box to restrict results to one hit per input sequence.
Use the “Download All” option on the second line and select “text” to get a handy summary of results.
CheckM is a bioinformatics tool used for assessing the quality of metagenome-assembled genomes (MAGs) or genomic bins by evaluating their completeness and contamination. It employs a lineage-specific workflow that leverages a database of single-copy marker genes to estimate these metrics.
Completeness is determined by checking the presence of essential marker genes that are expected to be found in single copies within a complete genome. A higher proportion of these marker genes indicates a more complete genome.
Contamination, on the other hand, is assessed by identifying instances where single-copy marker genes appear multiple times, suggesting the presence of mixed genomic material or fragments from different organisms.
Low contamination (typically below 5%) and high completeness (often above 90% for high-quality MAGs) are desirable indicators of a well-assembled, reliable genome. CheckM’s results help researchers determine the quality of their genomic bins and guide decisions on further analysis or refinement.
Completeness and contamination metrics provided by CheckM are often combined to assess the overall quality of a metagenome-assembled genome (MAG) using a single metric called the quality score or completeness-contamination trade-off. This combined metric helps researchers quickly judge the reliability of a MAG by balancing the desirable traits of high completeness and low contamination.
The quality of a MAG can be evaluated using a formula that incorporates both completeness and contamination. A commonly used approach is: [ \text{Quality Score} = \text{Completeness} - ( \text{Contamination} \times \text{Penalty Factor} ) ]
This metric allows researchers to quickly compare bins and decide which ones are worth focusing on for downstream analysis, as it takes into account both the presence of essential genes and the purity of the genomic bin. High-quality MAGs typically have a high completeness (e.g., ≥90%) and low contamination (e.g., ≤5%), leading to a strong quality score that indicates confidence in the assembly’s reliability.
Run CheckM over all the bins in a MaxBin output directory:
checkm lineage_wf ./. checkm_output/ -t 16 -x fasta
The output will look something like this:
[2024-11-12 08:14:42] INFO: CheckM v1.2.3
[2024-11-12 08:14:43] INFO: checkm lineage_wf ./. checkm_output/ -t 16 -x fasta
[2024-11-12 08:14:43] INFO: CheckM data: /shared/team/conda/mpallen.mmb-dtp/seqanalysis/checkm_data
[2024-11-12 08:14:43] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-11-12 08:14:43] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:14:43] WARNING: File ././maxbin_bins.001.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:14:43] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:14:43] WARNING: File ././maxbin_bins.002.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:14:43] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:14:43] WARNING: File ././maxbin_bins.003.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:14:43] INFO: Identifying marker genes in 7 bins with 16 threads:
Finished processing 7 of 7 (100.00%) bins.
[2024-11-12 08:14:51] INFO: Saving HMM info to file.
[2024-11-12 08:14:51] INFO: Calculating genome statistics for 7 bins with 16 threads:
Finished processing 7 of 7 (100.00%) bins.
[2024-11-12 08:14:51] INFO: Extracting marker genes to align.
[2024-11-12 08:14:51] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 7 of 7 (100.00%) bins.
[2024-11-12 08:14:51] INFO: Extracting 43 HMMs with 16 threads:
Finished extracting 43 of 43 (100.00%) HMMs.
[2024-11-12 08:14:51] INFO: Aligning 43 marker genes with 16 threads:
Finished aligning 43 of 43 (100.00%) marker genes.
[2024-11-12 08:14:52] INFO: Reading marker alignment files.
[2024-11-12 08:14:52] INFO: Concatenating alignments.
[2024-11-12 08:14:52] INFO: Placing 7 bins into the genome tree with pplacer (be patient).
[2024-11-12 08:18:33] INFO: { Current stage: 0:03:50.463 || Total: 0:03:50.463 }
[2024-11-12 08:18:33] INFO: [CheckM - lineage_set] Inferring lineage-specific marker sets.
[2024-11-12 08:18:33] INFO: Reading HMM info from file.
[2024-11-12 08:18:33] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 7 of 7 (100.00%) bins.
[2024-11-12 08:18:33] INFO: Determining marker sets for each genome bin.
Finished processing 10 of 10 (100.00%) bins (current: maxbin_bins.001).
[2024-11-12 08:18:34] INFO: Marker set written to: checkm_output/lineage.ms
[2024-11-12 08:18:34] INFO: { Current stage: 0:00:01.467 || Total: 0:03:51.931 }
[2024-11-12 08:18:34] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2024-11-12 08:18:34] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:18:34] WARNING: File ././maxbin_bins.001.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:18:34] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:18:34] WARNING: File ././maxbin_bins.002.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:18:35] WARNING: Expected all files to contain sequences in nucleotide space.
[2024-11-12 08:18:35] WARNING: File ././maxbin_bins.003.marker.fasta appears to contain amino acids sequences.
[2024-11-12 08:18:35] INFO: Identifying marker genes in 7 bins with 16 threads:
Finished processing 7 of 7 (100.00%) bins.
[2024-11-12 08:20:39] INFO: Saving HMM info to file.
[2024-11-12 08:20:39] INFO: { Current stage: 0:02:04.894 || Total: 0:05:56.826 }
[2024-11-12 08:20:39] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 7 of 7 (100.00%) bins.
[2024-11-12 08:20:40] INFO: Aligning marker genes with multiple hits in a single bin:
Finished processing 7 of 7 (100.00%) bins.
[2024-11-12 08:20:47] INFO: { Current stage: 0:00:07.710 || Total: 0:06:04.536 }
[2024-11-12 08:20:47] INFO: Calculating genome statistics for 7 bins with 16 threads:
Finished processing 7 of 7 (100.00%) bins.
[2024-11-12 08:20:47] INFO: { Current stage: 0:00:00.300 || Total: 0:06:04.836 }
[2024-11-12 08:20:47] INFO: [CheckM - qa] Tabulating genome statistics.
[2024-11-12 08:20:47] INFO: Calculating AAI between multi-copy marker genes.
[2024-11-12 08:20:48] INFO: Reading HMM info from file.
[2024-11-12 08:20:48] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 7 of 7 (100.00%) bins.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
maxbin_bins.002 k__Bacteria (UID203) 5449 104 58 21 56 25 1 1 0 76.61 16.60 0.00
maxbin_bins.003 g__Klebsiella (UID5140) 31 1312 336 601 574 119 15 3 0 54.41 9.71 2.31
maxbin_bins.001 f__Lachnospiraceae (UID1256) 33 333 171 192 129 12 0 0 0 41.31 2.38 16.67
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[2024-11-12 08:20:49] INFO: { Current stage: 0:00:01.745 || Total: 0:06:06.582 }
The provided CheckM output log includes various stages of processing, warnings, and a summary table of results. Here’s a breakdown of what CheckM is doing and the significance of each part of the output:
[2024-11-12 08:14:42] INFO: CheckM v1.2.3
: Indicates the version of CheckM being used.[2024-11-12 08:14:43] INFO: checkm lineage_wf ./. checkm_output/ -t 16 -x fasta
: The lineage_wf
command is being run to evaluate genome bins. The parameters specify the input directory (./.
), the output directory (checkm_output/
), the use of 16 CPU threads (-t 16
), and that the input files have the .fasta
extension (-x fasta
).[2024-11-12 08:14:43] WARNING: Expected all files to contain sequences in nucleotide space
: Indicates that some input files do not contain nucleotide sequences as expected but may contain amino acid sequences instead.maxbin_bins.001.marker.fasta
, etc.): Warns that specific marker files appear to contain amino acid sequences, which could mean that non-nucleotide data were mistakenly used as input.[2024-11-12 08:14:43] INFO: [CheckM - tree] Placing bins in reference genome tree.
CheckM is aligning the bins to a reference genome tree to identify taxonomic lineages.[2024-11-12 08:14:51] INFO: Identifying marker genes in 7 bins
: CheckM uses Hidden Markov Models (HMMs) to find marker genes in the bins, which are crucial for determining completeness and contamination.[2024-11-12 08:14:51] INFO: Aligning 43 marker genes with 16 threads
: Aligns marker genes to refine genome bin quality analysis.[2024-11-12 08:20:47] INFO: Calculating genome statistics for 7 bins
: Calculates completeness and contamination based on the presence and redundancy of single-copy marker genes.Bin Id
: The name of each bin being analyzed.Marker lineage
: The predicted taxonomic lineage based on marker genes.# genomes
, # markers
, # marker sets
: Counts indicating how many genomes, total markers, and marker sets were considered for analysis.Completeness
: The percentage of expected single-copy marker genes found in the bin. A higher value indicates a more complete genome.Contamination
: The percentage of redundant single-copy marker genes, suggesting contamination. A lower value is better.Strain heterogeneity
: Indicates variation in marker genes that may suggest strain-level diversity within the bin. Values close to 0.00
are ideal, as they indicate minimal heterogeneity.marker.fasta
files suggest there may be an issue with the input files. Ensure that the inputs are nucleotide sequences, as CheckM is designed to work with DNA, not protein sequences.maxbin_bins.002
has reasonable completeness (76.61%) but high contamination (16.60%), suggesting it may represent a mixed bin or contain sequences from different organisms. maxbin_bins.003
has moderate completeness (54.41%) and contamination (9.71%). The bins labeled as .marker
have 0.00%
completeness and contamination, indicating that they do not contain enough data for meaningful analysis or might be erroneous.maxbin_bins.001_1
shows low completeness (41.31%) but low contamination (2.38%), indicating it’s a partial bin with less contamination but limited coverage.CheckM produces several output files during the analysis of genome bins to assess their completeness and contamination. These files provide detailed insights into the quality and characteristics of the bins. Here’s an overview of the typical CheckM output files and what each contains
lineage.ms
:
bin_stats_ext.tsv
:
storage/tree
:
markers
Directory:
checkm.log
:
storage
Directory:
*.txt
or custom report files):
checkm qa
command or additional commands run, you might have custom-generated summary reports.So what do conclude from your CheckM results? How do these results measure up against what you found with BLAST?
So, have we got credible MAGs? One additional line of attack is to run Kraken on all our bins, with a command something like this:
kraken2 --threads 32 --db /home/jovyan/shared-public/db/kraken2/pluspf_8gb/latest --output bins.001_kraken_hits.txt --report bins.001_kraken_report.txt --use-names maxbin_bins.001.fasta
What do you conclude now?
How does what we are seeing in MAGs measure up with what we know from profiling the metagenomes?
Take a quick look again at the NCBI’s Krona plot on your sample.
That’s all for today. Tomorrow, we will analyse some of the MAGs in terms of phylogeny and function.