Things covered here:
Here we’re going to touch upon 6 essential commands that are absolutely worth having in our toolkit!
Let’s change back into our starting unix_intro directory:
cd ~/unix_intro
We’ll mostly be working with a file here called “gene_annotations.tsv”, which is a tab-delimited table holding gene annotation information. Let’s change into our working directory for this page and explore it a little on the command line.
cd gene_annotations/
ls
The head command is used to show a file from the beginning:
head gene_annotations.tsv
By default head prints out the first 10 lines, but we can change this behaviour using the -n flag by specifying the number or lines to print.
head -n 1 gene_annotations.tsv
head -n 5 gene_annotations.tsv
The tail command is used to show a file from the end and like with head you can also specify the number of lines to print with the -n flag.
tail gene_annotations.tsv
tail -n 1 gene_annotations.tsv
tail -n 5 gene_annotations.tsv
head and tail are powerful common commands and they can even be used to skip lines. For example specifying -n +2 with tail will print the entire file starting from the second row - this might come in handy if we ever need to skip a header row.
# Skip the first line (header)
tail -n +2 gene_annotations.tsv
As we can see the bed file contains 4 columns (or fields):
chromosome specifies the chromosome where the gene is located,start and end define the range of bases where the gene is located within that chromosome,gene_ID holds some descriptive information about the genes, like gene IDs and functional annotations.the less command allows us to browse contents of text files interactively, one page at a time.
less gene_annotations.tsv
Now that we have opened the file in less, we can scroll up and down using the ↑ and ↓ arrow keys respectively, or we can “turn” a page by pressing the Space bar, Enter or the f key. We can “turn” the page back by pressing the b key. If are ready to quit less, we can do so using the q button.
There are other powerful things less can help us with which is beyond the scope of this lesson, but feel free to have a read about these using the less --help option.
Now that we know something about the file we’re working with, let’s get to some exciting commands!
cut is a command that’s great for manipulating columns. The required arguments for cut are which columns we want, and from which file. Here’s how we can use cut to pull out just the chromosome column (column 1):
cut -f 1 gene_annotations.tsv
That is printing out all of the lines to the screen though, let’s pipe it into head for now to keep things manageable while we’re working on it (remember we can bring up a previous command by pressing up):
cut -f 1 gene_annotations.tsv | head
Here we are specifying which column we want with the -f parameter (for field). We can specify multiple individual columns if we separate them with a comma:
cut -f 1,3 gene_annotations.tsv | head
And a range of columns if we join them with a dash:
cut -f 1-3 gene_annotations.tsv | head
Just like in spreadsheet programs, cut thinks about where columns start and stop based on a delimiter, like a comma or a tab. The default setting is a tab, so we didn’t need to change it for that file. But if we try using it on a comma separated values file (a csv), things don’t work well:
cut -f 1-3 gene_annotations.csv | head
Unless we tell cut the delimiter is a comma, which we can do with the -d parameter:
cut -d "," -f 1-3 gene_annotations.csv | head
QUICK PRACTICE! From our tab-delimited file, "gene_annotations.tsv"!, try to make a new file that has just 2 columns: the chromosome and the gene_ID columns (remember the > redirector).
Name the new file "chrom_and_IDs.tsv".Solution
cut -f 1,4 gene_annotations.tsv | head cut -f 1,4 gene_annotations.tsv > chrom_and_IDs.tsv head chrom_and_IDs.tsvAnd to make sure it holds all 199 lines and not just the first 10!
wc -l chrom_and_IDs.tsv
grep (global regular expression) is a search tool. It looks through text files for strings (sequences of characters). In its default usage, grep will look for whatever string of characters you give it (1st positional argument), in whichever file you specify (2nd positional argument), and then print out the lines that contain what you searched for. Let’s try it:
grep "73177" gene_annotations.tsv
The output is one gene on chromosome 1, encoding for a gene called BONSAI:
chr1 27516572 27517882 AT1G73177 ID=AT1G73177;Name=AT1G73177;full_name=BONSAI;computational_description=bonsai;locus=505006219;curator_summary=The
If there are multiple lines that match, grep will print them all:
grep "145" gene_annotations.tsv
chr1 166589 168088 AT1G01453 ID=AT1G01453;Name=AT1G01453;Dbxref=TAIR:4515102487;locus_type=protein_coding
chr1 14543277 14544528 AT1G38790 ID=AT1G38790;Name=AT1G38790;Dbxref=TAIR:2824121;locus_type=protein_coding
chr4 13413973 13417145 AT4G26590 ID=AT4G26590;Name=AT4G26590;full_name=oligopeptide
chr4 16799635 16799966 AT4G09145 ID=AT4G09145;Name=AT4G09145;locus_type=long_noncoding_rna
chr5 14598607 14599988 AT5G36960 ID=AT5G36960;Name=AT5G36960;Dbxref=TAIR:2167474;locus_type=protein_coding
In this case some of the matches were in the gene IDs and some were in the locations.
If what we are looking for is not in the file, we will just get our prompt back with nothing printed out:
grep "mygene" gene_annotations.tsv
For the moment, let’s pretend we’re interested in a gene predicted to encode for DEMETER, (a DNA glycosylase protein, responsible for DNA hypomethylation) which has the gene ID AT5G04560. With grep we can check if it is in our annotations file:
grep "AT5G04560" gene_annotations.tsv
chr5 1309098 1318520 AT5G04560 ID=AT5G04560;Name=AT5G04560;full_name=DEMETER;computational_description=HhH-GPD
Note:
grepis case sensitive by default, but we can make it case insensitive with the-iflag. Try searching for DEMETER with and without this flag and see what you get:grep demeter gene_annotations.tsv grep -i demeter gene_annotations.tsv
QUICK PRACTICE! Using a combination ofgrepandcut, how could we print out the gene IDs of only the genes located on chromosome 2?Solution
grep "chr2" gene_annotations.tsv | cut -f 4Being able to stick together these individual pieces like this becomes a really valuable toolset to have at our disposal!
We’re just scratching the surface of what grep can do, but one flag in addition to -i worth mentioning is the -c flag. This tells grep to just report how many lines matched, instead of printing them to the screen:
grep -c "chr2" gene_annotations.tsv
awk is even more expansive than any of the others we’ve seen, but like the others, just being familiar with its basic command-line usage can be powerful. awk is useful for doing things like filtering based on columns and doing calculations.
The syntax of awk can also take a little getting used to. In awk, we specify columns with a $ followed by the column number. For example, $1 is column 1 (chromosome), $2 is start position, etc.
Let’s start with a simple example recreating cut -f 1: printing just the first column (chromosome):
awk '{print $1}' gene_annotations.tsv | head
Just like with cut, we can also print multiple columns:
awk '{print $1,$4}' gene_annotations.tsv | head
One of the most powerful uses of awk is filtering based on conditions. Let’s say we only want genes from chromosome 3 (recreating grep chr3 gene_annotations.tsv):
awk '$1 == "chr3"' gene_annotations.tsv | head
We can also filter based on numeric comparisons. Let’s find genes that start after position 1000000:
awk '$2 > 1000000' gene_annotations.tsv | head
awk can also perform calculations on the fly. Let’s create a new column that calculates the length of each gene (end position minus start position) and save it to a new file:
awk 'NR > 1 {print $1,$2,$3,$4,$5,$3-$2}' gene_annotations.tsv > gene_annotations_and_lengths.tsv
The NR > 1 part skips the first line (the header), so we don’t try to do maths on text!
We can combine filtering and calculations. Let’s find genes longer than 5000 base pairs:
awk '$5 > 5000' gene_annotations_and_lengths.tsv | head
Again, awk can seem pretty tricky, especially at first, but forunately we don’t need to remember how to do these things, just that they can be done. And then we can look it up when we need it 🙂
QUICK PRACTICE! How many protein coding genes in our list are larger than 3000bp?
Hint: Try to usegrepandawkpiped together.
Solution
grep "protein_coding" gene_annotations_and_lengths.tsv | awk '$6 > 3000' | wc -l
As mentioned, this page is just a first introduction to some great commands that are worth having in our toolkit. Each of them has much more functionality that we can dig into further as needed!
| Command | Function (base-usage) |
|---|---|
head |
prints the first n lines (10 by default, specify with -n) |
tail |
prints the last n lines (10 by default, specify with -n) |
less |
browse file contents interactively (scroll with arrows, exit with q) |
cut |
extracts columns from delimited files |
grep |
searches for text patterns and returns matching lines |
awk |
filters rows and columns, performs calculations |
This training course was adapted from the Happy Belly Bioinformatics Unix Course.