Credit: this material has been adapted from the Data Carpentry lesson Data Wrangling and Processing for Genomics. The adaptations principally include making sure commands work on the CLIMB-BIG-DATA Notebook servers. All Carpentries instructional material, and therefore all material here, is made available under the Creative Commons Attribution license. See https://github.com/datacarpentry/wrangling-genomics/blob/main/LICENSE.md.
The data we are going to use is part of a long-term evolution experiment led by Richard Lenski.
The experiment was designed to assess adaptation in E. coli. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for E. coli, providing faster growth than other sugars). This medium was supplemented with citrate, which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using variant (Cit+) appeared between 31,000 and 31,500 generations, causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.
To see a timeline of the experiment to date, check out this figure, and this paper Blount et al. 2008: Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli.
When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.
An example of the workflow we will be using for our variant calling analysis is provided below with a brief description of each step.
These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.
During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides. The sequence of a read is represented by a string of letters, where each letter represents a nucleotide base. The most common nucleotide bases are adenine (A), cytosine (C), guanine (G), and thymine (T). In RNA, thymine is replaced by uracil (U).
Sequencing instruments may also give ambiguous signals, which are represented by the letters R, Y, S, W, K, M, B, D, H, V, and N. This convention was set by the International Union of Pure and Applied Chemistry (IUPAC) in 1985. These letters represent the following combinations of nucleotides:
IUPAC nucleotide code | Base |
---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T (or U) | Thymine (or Uracil) |
R | A or G |
Y | C or T |
S | G or C |
W | A or T |
K | G or T |
M | A or C |
B | C or G or T |
D | A or G or T |
H | A or C or T |
V | A or C or G |
N | any base |
. or - | gap |
N means the base could not be determined. This is different from a gap, which means the base is not present in the sequence.
Read Johnson 2010 for more details.
Quick task: Convert the following sequences to all possible combinations
ATGRCTAYCGTG
ATGNATC-GTG
While we talk about file “formats” for sequencing data these are not the same as file formats for commercial programs such as MS Word or binary file formats. These are simply text files with a particular structure. This means that you can open these files and read them as plain text even if the file extension is .fastq
, .fasta
and so on.
Here are some common genomics sequence file formats, some of which we have worked or will soon be working on;.
Format | Description |
---|---|
FASTA | A text-based format for representing nucleotide or protein sequences. Each sequence is represented by a header line starting with ‘>’, followed by the sequence data. |
FASTQ | A format for representing both nucleotide sequences and their corresponding quality scores. Each record contains a sequence and quality scores in a readable text format. |
SAM (Sequence Alignment/Map) | A tab-delimited text format for storing sequence alignment data, often used for mapping short reads to a reference genome. |
BAM (Binary Alignment/Map) | A binary version of the SAM format, which is more compact and efficient for large datasets. Used for storing sequence alignment data. |
VCF (Variant Call Format) | A text-based format for representing genetic variations, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants. |
BCF (Binary Call Format) | A binary version of the VCF format, which is more compact and efficient for large datasets. |
BED (Browser Extensible Data) | A text-based format for representing genomic intervals, such as regions of interest, gene annotations, and functional elements. |
GFF/GTF (General Feature Format/General Transfer Format) | Text-based formats for representing genomic features, including genes, exons, and other structural elements. GFF is often used in older annotations, while GTF is commonly used in more recent annotations. |
FAST5 | A format used by Oxford Nanopore Technologies (ONT) for storing raw sequencing data produced by their nanopore sequencing platforms. |
Often, the first step in a bioinformatic workflow is getting the data you want to work with onto a computer where you can work with it. If you have outsourced sequencing of your data, the sequencing center will usually provide you with a link that you can use to download your data. Today we will be working with publicly available sequencing data.
We are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 50,000 generations in a glucose-limited minimal medium. We will be working with three samples from this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how with our variant calling workflow.
But before we download the data, let’s spend a minute explaining shotgun sequencing.
Shotgun sequencing is a method used to determine the DNA sequence of an organism’s genome. In this approach, the entire genome is broken into small, random fragments, which are then sequenced individually. The term “shotgun” comes from the randomness of the fragmentation, similar to how shotgun pellets scatter. After sequencing, these small fragments are computationally pieced together by finding overlapping regions between the fragments. This method is highly efficient for large genomes because it allows parallel sequencing of different regions of the genome, leading to faster results compared to other sequencing methods.
Paired-end data is a specific type of data produced in sequencing where both ends of a DNA fragment are sequenced. In paired-end sequencing, DNA fragments are sequenced from both directions, producing two reads: one from each end. The distance between these two reads is often known, which provides valuable information for assembling genomes, especially when dealing with repetitive regions or complex genomes. Paired-end sequencing improves accuracy in genome assembly and makes it easier to align sequences to reference genomes, especially when working with longer DNA fragments.
The data are paired-end, so we will download two files for each sample. We will use the European Nucleotide Archive to get our data. The ENA “provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.” The ENA also provides sequencing data in the fastq format, an important format for sequencing reads that we will be learning about today.
To download the data, run the commands below. But before you do anything make sure you are not in the base
conda environment, but are in the bioinfo_fun
environment.
Here we are using the -p
option for mkdir
. This option allows mkdir
to create the new directory, even if one of the parent directories does not already exist. It also supresses errors if the directory already exists, without overwriting that directory.
mkdir -p ~/dc_workshop/data/untrimmed_fastq/
cd ~/dc_workshop/data/untrimmed_fastq
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
The data comes in a compressed format, which is why there is a .gz
at the end of the file names. This makes it faster to transfer, and allows it to take up less space on our computer. Let’s unzip one of the files so that we can look at the fastq format.
$ gunzip SRR2584863_1.fastq.gz
We will now assess the quality of the sequence reads contained in our fastq files.
{alt=’workflow_qc’}
Although it looks complicated (and it is), we can understand the fastq format with a little decoding. Some rules about the format include…
Line | Description |
---|---|
1 | Always begins with ‘@’ and then information about the read |
2 | The actual DNA sequence |
3 | Always begins with a ‘+’ and sometimes the same info in line 1 |
4 | Has a string of characters which represent the quality scores; must have same number of characters as line 2 |
We can view the first complete read in one of the files our dataset by using head
to look at
the first four lines.
$ head -n 4 SRR2584863_1.fastq
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is:
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, using Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 41 as shown in the chart below.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Quality score: 01........11........21........31........41
Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and depend on how much signal was captured for the base incorporation.
Looking back at our read:
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
we can now see that there is a range of quality scores, but that the end of the sequence is
very poor (#
= a quality score of 2).
Quick question: What is the last read in the SRR2584863_1.fastq
file? How confident are you in this read?
At this point, let’s install the program fastqc
conda install fastqc
Let’s check it has installed correctedly.
$ fastqc -h
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
will not be grouped together correctly.
--nano Files come from naopore sequences and are in fast5 format. In
this mode you can pass in directories to process and the program
will take in all fast5 files within those directories and produce
a single output file from the sequences found in all files.
--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.
--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. By default
this option will be set if fastqc is run in non-interactive
mode.
-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!
-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.
-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.
-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.
-q --quiet Supress all progress messages on stdout and only report errors.
-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.
BUGS
Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/
In real life, you will not be assessing the quality of your reads by visually inspecting your FASTQ files. Rather, you will be using a software program to assess read quality and filter out poor quality reads. We will first use a program called FastQC to visualize the quality of our reads. Later in our workflow, we will use another program to filter out poor quality reads.
FastQC has a number of features which can give you a quick impression of any problems your data may have, so you can take these issues into consideration before moving forward with your analyses. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates a very high quality sample:
{alt=’good_quality’}
The x-axis displays the base position in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long. This is much shorter than the reads we are working with in our workflow. For each position, there is a box-and-whisker plot showing the distribution of quality scores for all reads at that position. The horizontal red line indicates the median quality score and the yellow box shows the 1st to 3rd quartile range. This means that 50% of reads have a quality score that falls within the range of the yellow box at that position. The whiskers show the absolute range, which covers the lowest (0th quartile) to highest (4th quartile) values.
For each position in this sample, the quality values do not drop much lower than 32. This is a high quality score. The plot background is also color-coded to identify good (green), acceptable (yellow), and bad (red) quality scores.
Now let’s take a look at a quality plot on the other end of the spectrum.
{alt=’bad_quality’}
Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the “bad” range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.
We will now assess the quality of the reads that we downloaded. First, make sure you are still in the untrimmed_fastq
directory
$ cd ~/dc_workshop/data/untrimmed_fastq/
Quick question: How big are the files? (Hint: Look at the options for the ls
command to see how to show
file sizes.)
(PS There are six FASTQ files ranging from 124M (124MB) to 545M.)
FastQC can accept multiple file names as input, and on both zipped and unzipped files, so we can use the *.fastq* wildcard to run FastQC on all of the FASTQ files in this directory.
$ fastqc *.fastq*
You will see an automatically updating output message telling you the progress of the analysis. It will start like this:
Started analysis of SRR2584863_1.fastq
Approx 5% complete for SRR2584863_1.fastq
Approx 10% complete for SRR2584863_1.fastq
Approx 15% complete for SRR2584863_1.fastq
Approx 20% complete for SRR2584863_1.fastq
Approx 25% complete for SRR2584863_1.fastq
Approx 30% complete for SRR2584863_1.fastq
Approx 35% complete for SRR2584863_1.fastq
Approx 40% complete for SRR2584863_1.fastq
Approx 45% complete for SRR2584863_1.fastq
In total, it should take about five minutes for FastQC to run on all six of our FASTQ files. When the analysis completes, your prompt will return. So your screen will look something like this:
Approx 80% complete for SRR2589044_2.fastq.gz
Approx 85% complete for SRR2589044_2.fastq.gz
Approx 90% complete for SRR2589044_2.fastq.gz
Approx 95% complete for SRR2589044_2.fastq.gz
Analysis complete for SRR2589044_2.fastq.gz
$
The FastQC program has created several new files within our
data/untrimmed_fastq/
directory.
$ ls
SRR2584863_1.fastq SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_1_fastqc.zip SRR2584866_1.fastq.gz SRR2589044_1.fastq.gz
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
SRR2584863_2.fastq.gz SRR2584866_2.fastq.gz SRR2589044_2.fastq.gz
For each input FASTQ file, FastQC has created a .zip
file and a .html
file. The .zip
file extension indicates that this is
actually a compressed set of multiple output files. We will be working with these output files soon. The .html
file is a stable webpage displaying the summary report for each of our samples.
We want to keep our data files and our results files separate, so we will move these
output files into a new directory within our results/
directory.
$ mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
$ mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
$ mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
Now we can navigate into this results directory and do some closer inspection of our output files.
$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/
Use the Notebooks graphical user interface in the left hand column to take you to ~/dc_workshop/results/fastqc_untrimmed_reads
Click on the first HTML file in the list. You are now vieiwing it in a new window. Work through that FastQC Report, focusing on the “Per base sequence quality” FastQC graph
Then work through all the others, again focusing on “Per base sequence quality” FastQC graphs
Quick questions
We have now looked at quite a few “Per base sequence quality” FastQC graphs, but there are nine other graphs that we have not talked about!
Below we have provided a brief overview of interpretations for each of these plots. For more information, please see the FastQC documentation here
Now that we have looked at our HTML reports to get a feel for the data, let’s look more closely at the other output files. Go back to the tab in your terminal and make sure you are in our results subdirectory.
$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/
$ ls
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
Our .zip
files are compressed files. They each contain multiple different types of output files for a single input FASTQ file. To view the contents of a .zip
file, we can use the program unzip
to decompress these files. Let’s try doing them all at once using a wildcard.
$ unzip *.zip
Archive: SRR2584863_1_fastqc.zip
caution: filename not matched: SRR2584863_2_fastqc.zip
caution: filename not matched: SRR2584866_1_fastqc.zip
caution: filename not matched: SRR2584866_2_fastqc.zip
caution: filename not matched: SRR2589044_1_fastqc.zip
caution: filename not matched: SRR2589044_2_fastqc.zip
This did not work. We unzipped the first file and then got a warning message for each of the other .zip
files. This is because unzip
expects to get only one zip file as input. We could go through and unzip each file one at a time, but this is very time consuming and
error-prone. Someday you may have 500 files to unzip!
A more efficient way is to use a for
loop like we learned earlier in the week to iterate through all of our .zip
files. Let’s see what that looks like and then we will discuss what we are doing with each line of our loop.
$ for filename in *.zip
do
unzip $filename
done
In this example, the input is six filenames (one filename for each of our .zip
files).
Each time the loop iterates, it will assign a file name to the variable filename
and run the unzip
command.
The first time through the loop, $filename
is SRR2584863_1_fastqc.zip
.
The interpreter runs the command unzip
on SRR2584863_1_fastqc.zip
.
For the second iteration, $filename
becomes SRR2584863_2_fastqc.zip
. This time, the shell runs unzip
on SRR2584863_2_fastqc.zip
.
It then repeats this process for the four other .zip
files in our directory.
When we run our for
loop, you will see output that starts like this:
Archive: SRR2589044_2_fastqc.zip
creating: SRR2589044_2_fastqc/
creating: SRR2589044_2_fastqc/Icons/
creating: SRR2589044_2_fastqc/Images/
inflating: SRR2589044_2_fastqc/Icons/fastqc_icon.png
inflating: SRR2589044_2_fastqc/Icons/warning.png
inflating: SRR2589044_2_fastqc/Icons/error.png
inflating: SRR2589044_2_fastqc/Icons/tick.png
inflating: SRR2589044_2_fastqc/summary.txt
inflating: SRR2589044_2_fastqc/Images/per_base_quality.png
inflating: SRR2589044_2_fastqc/Images/per_tile_quality.png
inflating: SRR2589044_2_fastqc/Images/per_sequence_quality.png
inflating: SRR2589044_2_fastqc/Images/per_base_sequence_content.png
inflating: SRR2589044_2_fastqc/Images/per_sequence_gc_content.png
inflating: SRR2589044_2_fastqc/Images/per_base_n_content.png
inflating: SRR2589044_2_fastqc/Images/sequence_length_distribution.png
inflating: SRR2589044_2_fastqc/Images/duplication_levels.png
inflating: SRR2589044_2_fastqc/Images/adapter_content.png
inflating: SRR2589044_2_fastqc/fastqc_report.html
inflating: SRR2589044_2_fastqc/fastqc_data.txt
inflating: SRR2589044_2_fastqc/fastqc.fo
The unzip
program is decompressing the .zip
files and creating a new directory (with subdirectories) for each of our samples, to
store all of the different output that is produced by FastQC. There are a lot of files here. The one we are going to focus on is the
summary.txt
file.
If you list the files in our directory now you will see:
SRR2584863_1_fastqc SRR2584866_1_fastqc SRR2589044_1_fastqc
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc SRR2584866_2_fastqc SRR2589044_2_fastqc
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
The .html
files and the uncompressed .zip
files are still present,
but now we also have a new directory for each of our samples. We can
see for sure that it is a directory if we use the -F
flag for ls
.
$ ls -F
SRR2584863_1_fastqc/ SRR2584866_1_fastqc/ SRR2589044_1_fastqc/
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc/ SRR2584866_2_fastqc/ SRR2589044_2_fastqc/
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
Let’s see what files are present within one of these output directories.
$ ls -F SRR2584863_1_fastqc/
fastqc_data.txt fastqc.fo fastqc_report.html Icons/ Images/ summary.txt
Use less
to preview the summary.txt
file for this sample.
$ less SRR2584863_1_fastqc/summary.txt
PASS Basic Statistics SRR2584863_1.fastq
PASS Per base sequence quality SRR2584863_1.fastq
PASS Per tile sequence quality SRR2584863_1.fastq
PASS Per sequence quality scores SRR2584863_1.fastq
WARN Per base sequence content SRR2584863_1.fastq
WARN Per sequence GC content SRR2584863_1.fastq
PASS Per base N content SRR2584863_1.fastq
PASS Sequence Length Distribution SRR2584863_1.fastq
PASS Sequence Duplication Levels SRR2584863_1.fastq
PASS Overrepresented sequences SRR2584863_1.fastq
WARN Adapter Content SRR2584863_1.fastq
The summary file gives us a list of tests that FastQC ran, and tells
us whether this sample passed, failed, or is borderline (WARN
). Remember, to quit from less
you must type q
.
We can make a record of the results we obtained for all our samples by concatenating all of our summary.txt
files into a single file
using the cat
command. We will call this fastqc_summaries.txt
and move it to ~/dc_workshop/
.
$ cat */summary.txt > ~/dc_workshop/fastqc_summaries.txt
Quick questions Which samples failed at least one of FastQC’s quality tests? What test(s) did those samples fail? Can you use grep
to find out?
(PS you can get the list of all failed tests using grep
with the term FAIL).