Venue: QIB Board Room 9.30am Tutor Mark Pallen
Reminder: Please fill in the feedback form before you leave on friday!
Today you are going to analyse the sequence reads you created earlier this week.
BUT to start with, we will have a walkthrough of the various steps in the analysis, working on sequences created by students last year. Then you will work in two or three groups to analyse all the sequences that you created for yourselves.
The steps in the workflow ahead of us include using:
FASTQC
and Trimmomatic
to evaluate and improve the quality of short readsNanoplot
and PoreChop
to evaluate and improve the quality of short readsKraken2
, Krona
and blast
for taxonomic profiling of readsSpades
to assemble short readsFlye
to assemble long readsUnicycler
to perform hybrid assembly.Open a Terminal on your Notebook server.
Sometimes Conda can run into issues with outdated metadata. It might make life easier if you clear the Conda cache and tell Conda what channels to look at as default
conda clean --all
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Let’s get started by setting up a specific conda environment for this session and installing all the software we are going to need. If we include names of software packages after the environment name, everything gets done in one go and this usually works better than installing the packges one at a time.
conda create -y -q -n seqanalysis fastqc trimmomatic kraken2 krona NanoPlot porechop seqtk flye
conda activate seqanalysis
Let’s create and use a directory for these example analyses:
mkdir seqanalysis_example
cd seqanalysis_example
One of the key features of the Jupyter Notebook servers run by CLIMB-BIG-DATA is access to shared directories. There are two of these shared-team
and shared-public
.
You can take a look at them using ls
or via the graphical file browser.
ls ../shared-public
ls ../shared-team
The shared-public
directory is used to share key resources that lots of people want to use but which are tricky to download or install, including large libraries.
The shared-team
directory is used for material that one wants to share with your whole team. When you are using CLIMB-BIG_DATA for your own research, this means your research group. But in this case, we are using it to access training material and analyses carried out by the MMB DTP cohorts as part of their bioinformatics training.
Let’s grab two of the sequences created by last year’s cohort.
We can grab FASTQ files from a sample containing short reads like this:
cp /home/jovyan/shared-team/year1_training/week2/sequence-data/DTP-2-3_S11_L001_R*_001.fastq.gz .
We can grab a FASTQ file containing long reads here
cp /home/jovyan/shared-team/year1_training/week2/long_reads/DTP_1_1_Nano.fastq.gz .
Let’s tidy things up by keeping short and long reads in different directories.
mkdir short_reads
mv *R* short_reads
mkdir long_reads
mv *Nano* long_reads
Let’s run trimmomatic on the short reads.
Move into the short_reads
directory
cd short_reads
As before, let’s copy across the adapters file
cp /home/jovyan/shared-team/conda/mpallen.mmb-dtp/seqanalysis/share/trimmomatic/adapters/NexteraPE-PE.fa .
We are using my version of the adapters file, which is associated with my version of the trimmomatic
package. Feel free to explore /home/jovyan/shared-team/conda/
followed by your name to see your own directories there.
Now run trimmomatic
trimmomatic PE -threads 4 DTP-2-3_S11_L001_R1_001.fastq.gz DTP-2-3_S11_L001_R2_001.fastq.gz DTP-2-3_S11_L001_R1_paired.fastq.gz DTP-2-3_S11_L001_R1_unpaired.fastq.gz DTP-2-3_S11_L001_R2_paired.fastq.gz DTP-2-3_S11_L001_R2_unpaired.fastq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:25
And then let’s run fastqc over all the fastq files.
fastqc *fastq*
then look at what has happened to file sizes
ls -alSh
And if you work through the html files, you can see that the sequences were not too bad to start with, but got a bit better on trimming.
You will be doing this on all the files you created yourselves later today.
If we want to learn quickly what organisms are represented by sequences in a run, we can use various approaches to taxonomic profiling. We will now have a brief Powerpoint presentation on this topic.
Let’s engage in some housekeeping by making directories for the QC we just did and for the kraken
analyses. We will use kraken
on the trimmed reads
mkdir QC kraken
cp *_paired.fastq.gz kraken
mv *.* QC
Pre-computed Kraken2 databases are available on the /shared-public
file system within CLIMB-BIG-DATA. These databases are downloaded from Ben Langmead’s publicly available Kraken2 indexes page. These datasets are updated monthly and we will keep the latest versions available.
The /shared-public
area is designed to store frequently used, important databases for the microbial genomics community. We are just getting started building this resource so please contact us with suggestions for other databases you would like to see here.
We can take a look at the databases that are available and their sizes:
du -h -d1 /home/jovyan//shared-public/db/kraken2
16G /shared-public/db/kraken2/pluspfp_8gb
152G /shared-public/db/kraken2/pluspf
16G /shared-public/db/kraken2/standard_8gb
743G /shared-public/db/kraken2/nt
341G /shared-public/db/kraken2/pluspfp
30G /shared-public/db/kraken2/pluspf_16gb
31G /shared-public/db/kraken2/pluspfp_16gb
16G /shared-public/db/kraken2/pluspf_8gb
142G /shared-public/db/kraken2/standard
11G /shared-public/db/kraken2/eupathDB46
20G /shared-public/db/kraken2/minusb
30G /shared-public/db/kraken2/standard_16gb
1.3G /shared-public/db/kraken2/viral
1.6T /shared-public/db/kraken2
The du
command in Unix/Linux systems stands for “disk usage.” It is used to estimate the space used by files and directories. The options you provided, -h and -d1, modify its behavior as follows:
-h (human-readable): This option makes the output easier for humans to read by displaying the sizes in kilobytes (K), megabytes (M), or gigabytes (G), instead of raw byte counts.
-d1 (depth 1): This option restricts the depth of directories shown to 1 level. It means that it will only show the sizes of the top-level directories and files within the specified directory, without descending into subdirectories deeper than 1 level.
We can run kraken2
directly within this JupyterHub notebook which is running in a container. A standard container has 8 CPU cores and 64GB of memory. kraken2
doesn’t run well unless the database fits into memory, so we can use one of the smaller databases for now such as k2_standard_8gb which contains archaea, bacteria, viral, plasmid, human and UniVec_Core sequences from RefSeq, but subsampled down to a 8Gb database. This will be fast, but we trade off specificity and sensitivity against bigger databases.
cd kraken
kraken2 --threads 8 --db /home/jovyan/shared-public/db/kraken2/standard_8gb/latest --output DTP-2-3_reads_hits.txt --report DTP-2-3_reads_report.txt --use-names DTP-2-3_S11_L001_R1_paired.fastq.gz DTP-2-3_S11_L001_R2_paired.fastq.gz
The DTP-2-3_reads_report.txt
file gives a supposedly human-readable output from Kraken2.
cat DTP-2-3_reads_report.txt
But most people would agree it is not all that human-readable!
The output from Kraken 2 contains several columns:
An alternative is to load a graphical display of the kraken results.
To do that we need to use krona
, which needs us to run ktUpdateTaxonomy.sh
before it can be used.
ktUpdateTaxonomy.sh
Now let’s run krona
ktImportTaxonomy -t 5 -m 3 DTP-2-3_reads_report.txt -o DTP-2-3_reads_KronaReport.html
Take a look in the graphical file viewer and you will see a file called DTP-2-3_reads_KronaReport.html
Click on it and it will open a window. Click on the “Trust HTML” option at the top left. Then navigate up and down the krona
plot.
Navigating a krona
display is quite intuitive:
What do we conclude about the taxonomic profile of the organisms that went into this sample?
Let’s try another way of finding out what is in the file using the program blast
.
Although we could install and interact with this very powerful program over the command line, it is tricky doing searches of the entire nt
database this way, so we will use the web interface at NCBI instead.
But we cannot throw hundreds of thousands of reads at NCBI Blast. So, first let’s extract the first hundred sequences from our fastq file and convert them into fasta format using a program called seqtk
.
mkdir ../blast
cp DTP-2-3_S11_L001_R1_paired.fastq.gz ../blast
cd ../blast
gunzip *gz
head -400 DTP-2-3_S11_L001_R1_paired.fastq > seqhead.fastq
seqtk seq -a seqhead.fastq > seqhead.fasta
Now cat
the contents of seqhead.fasta or open with the file viewer.
Copy the contents and paste into the blastn web page
In the Organism box type ‘Bacteria’ and when it autofills, select ‘Bacteria (taxid:2)’.
Tick the show results in new window option
What do you see? What do you conclude? What is good and bad about this approach?
Let’s assemble the short reads using SPAdes
.
mkdir /home/jovyan/seqanalysis_example/short_reads/spades
cd /home/jovyan/seqanalysis_example/short_reads/spades
cp /home/jovyan/seqanalysis_example/short_reads/QC/DTP-2-3_S11_L001_R*_paired.fastq.gz .
spades.py -1 DTP-2-3_S11_L001_R1_paired.fastq.gz -2 DTP-2-3_S11_L001_R2_paired.fastq.gz -o spades_out --threads 8 --memory 16
SPAdes
(St. Petersburg genome assembler) is a popular tool used for assembling genomes, particularly for microbial and metagenomic datasets. When running SPAdes, the output folder contains several important files and directories that might want to familiarise yourselves with. Key files include:
contigs.fasta
: This is one of the most important output files, containing the assembled contigs (continuous sequences). These are the primary assembled pieces of the genome.scaffolds.fasta
: Scaffolds are contigs that have been connected based on paired-end reads, though they may contain gaps.K-mer directories
: These subfolders (e.g., K21
, K33
) contain intermediate assembly files generated at different k-mer sizes during the assembly process, which is part of SPAdes’ multi-kmer approach to improve assembly accuracy.assembly_graph.fastg
: This file represents the assembly graph, which shows the relationship between different contigs and is useful for understanding the assembly process, especially in complex regions of the genome.misc
folder: Contains log files and other data that help diagnose issues or track the parameters used during the assembly.How can we judge the quality of the assembly quickly and easily?
Let’s just grab headers for contigs
grep ">" spades_out/contigs.fasta
There are programs like Quast that can give a more sophisticated view of assembly stats, but that wouldn’t install for me earlier in the so with help from ChatGPT I created a Python script that did some basic analysis. Let’s run that over our contigs.
/home/jovyan/shared-team/seqanalysis_2024/qc_assembly.py spades_out/contigs.fasta
Let’s break down the parameters and results from this assembly QC report to understand what each value means:
Now let’s try running kraken over the assembly
kraken2 --threads 8 --db /home/jovyan/shared-public/db/kraken2/standard_8gb/latest --output kraken_assembly.hits.txt --report kraken_assembly.report.txt --use-names spades_out/contigs.fasta
What has happened to the results compared to what we see with reads??
Now we must turn our attention to the long reads.
cd /home/jovyan/seqanalysis_example/long_reads
We will use Nanoplot
and PoreChop
While somewhat old-fashioned, these remain useful tools for learning the basics of Nanopore data processing and quality control. They help researchers understand important steps like adapter trimming (Porechop) and quality metrics visualization (NanoPlot). However, both tools have largely been superseded by more modern developments integrated into Guppy and MinKNOW.
Guppy, Oxford Nanopore’s advanced basecaller, handles both basecalling and quality control more efficiently
MinKNOW now includes real-time adapter filtering, reducing the need for external tools like Porechop.
Despite these advancements, learning these older tools provides foundational knowledge that is helpful for understanding how data processing has evolved and gives flexibility for handling legacy datasets or specialized workflows.
Let’s make a directory for the NanoPlot results and then run it
mkdir nanoplot_out
NanoPlot --fastq DTP_1_1_Nano.fastq.gz --plots kde dot --N50 -o nanoplot_out
note that the command NanoPlot
is mixed-case and case-sensitive.
Take a look through the output files, starting with NanoStats.txt and NanoPlot-report.html
Let’s now trim the reads using porechop
porechop -i DTP_1_1_Nano.fastq.gz -o DTP_1_1_Nano_chopped.fastq.gz
seqanalysis
conda environment and cd
to /home/jovyan/seqanalysis_example/long_reads
Once it has finished, let’s look at what we have
ls -alh
How much difference did PoreChop make to file sizes?
Let’s run kraken over these long reads.
kraken2 --threads 8 --db /home/jovyan/shared-public/db/kraken2/standard_8gb/latest --output nano.hits.txt --report nano_reads_report.txt --use-names DTP_1_1_Nano.fastq.gz
And then krona
ktImportTaxonomy -t 5 -m 3 nano_reads_report.txt -o nano_reads_KronaReport.html
What do we see now? Compare and contrast the view you get when looking at the nano_reads_report.txt file and the nano_reads_KronaReport.html file.
The requirements of optimal assembly of long reads is slightly different from that of short reads.
We will use the long read assembler flye
.
Why Flye is Preferred for Long Reads
Let’s make a directory to work in. Then copy the chopped nanopore reads across.
mkdir /home/jovyan/seqanalysis_example/long_reads/flye
cp /home/jovyan/seqanalysis_example/long_reads/DTP_1_1_Nano_chopped.fastq.gz /home/jovyan/seqanalysis_example/long_reads/flye
cd /home/jovyan/seqanalysis_example/long_reads/flye
Now let’s run flye
flye --nano-raw DTP_1_1_Nano_chopped.fastq.gz --genome-size 1.5m --out-dir flye_output --threads 8
Take a look inside the flye_output
directory.
Summary of Flye Output
Let’s just grab headers for contigs
grep ">" flye_out/assembly.fasta
Let’s run the Python script again.
/home/jovyan/shared-team/seqanalysis_2024/qc_assembly.py flye_out/assembly.fasta
Now let’s run kraken
and krona
over the assembly.
kraken2 --threads 8 --db /home/jovyan/shared-public/db/kraken2/standard_8gb/latest --output nano_assembly_kraken_hits.txt --report nano_assembly_kraken_report.txt --use-names /home/jovyan/seqanalysis_example/long_reads/flye/flye_output/assembly.fasta
ktImportTaxonomy -t 5 -m 3 nano_assembly_kraken_report.txt -o nano_assembly_KronaReport.html
And what do we see in the krona
display now?
The sequences you yourselves created should be available in /home/jovyan/shared-team/seqanalysis_2024
Your mission is to analyse all of those sequences, performing QC, taxonomic profiling and assembly using the approaches outlined above. Take care to maintain a tidy file and directory structure as you go.
You are required to write a report as a PDF document covering each sequence, detailing QC of reads, what’s in the samples and how well the assembly worked. Include images from mages files or screen dumps if helpful, Include a short tabulated summary at the start of the report.
You shuld work in your pre-existing team of two and produce three reports, one for each team. Upload your reports into /home/jovyan/shared-team/seqanalysis_2024/reports
Stretch target: if you really race through all of this. Where short and long read sequences are available from the same sample, you could install unicyler
and see if hybrid assembly gives better results than assemblies on short and long reads alone.
Important: Please fill in the feedback form before you leave on friday!