Useful shell commands

Things covered here:

Here we’re going to touch upon 6 essential commands that are absolutely worth having in our toolkit!

Let’s change back into our starting unix_intro directory:

cd ~/unix_intro

We’ll mostly be working with a file here called “gene_annotations.tsv”, which is a tab-delimited table holding gene annotation information. Let’s change into our working directory for this page and explore it a little on the command line.

cd gene_annotations/
ls

head and tail

The head command is used to show a file from the beginning:

head gene_annotations.tsv

By default head prints out the first 10 lines, but we can change this behaviour using the -n flag by specifying the number or lines to print.

head -n 1 gene_annotations.tsv
head -n 5 gene_annotations.tsv

The tail command is used to show a file from the end and like with head you can also specify the number of lines to print with the -n flag.

tail gene_annotations.tsv
tail -n 1 gene_annotations.tsv
tail -n 5 gene_annotations.tsv

head and tail are powerful common commands and they can even be used to skip lines. For example specifying -n +2 with tail will print the entire file starting from the second row - this might come in handy if we ever need to skip a header row.

# Skip the first line (header)
tail -n +2 gene_annotations.tsv

As we can see the bed file contains 4 columns (or fields):

chromosome specifies the chromosome where the gene is located,
start and end define the range of bases where the gene is located within that chromosome,
gene_ID holds some descriptive information about the genes, like gene IDs and functional annotations.

less

the less command allows us to browse contents of text files interactively, one page at a time.

less gene_annotations.tsv

Now that we have opened the file in less, we can scroll up and down using the ↑ and ↓ arrow keys respectively, or we can “turn” a page by pressing the Space bar, Enter or the f key. We can “turn” the page back by pressing the b key. If are ready to quit less, we can do so using the q button.

There are other powerful things less can help us with which is beyond the scope of this lesson, but feel free to have a read about these using the less --help option.

Now that we know something about the file we’re working with, let’s get to some exciting commands!

cut

cut is a command that’s great for manipulating columns. The required arguments for cut are which columns we want, and from which file. Here’s how we can use cut to pull out just the chromosome column (column 1):

cut -f 1 gene_annotations.tsv

That is printing out all of the lines to the screen though, let’s pipe it into head for now to keep things manageable while we’re working on it (remember we can bring up a previous command by pressing up):

cut -f 1 gene_annotations.tsv | head

Here we are specifying which column we want with the -f parameter (for field). We can specify multiple individual columns if we separate them with a comma:

cut -f 1,3 gene_annotations.tsv | head

And a range of columns if we join them with a dash:

cut -f 1-3 gene_annotations.tsv | head

Just like in spreadsheet programs, cut thinks about where columns start and stop based on a delimiter, like a comma or a tab. The default setting is a tab, so we didn’t need to change it for that file. But if we try using it on a comma separated values file (a csv), things don’t work well:

cut -f 1-3 gene_annotations.csv | head

Unless we tell cut the delimiter is a comma, which we can do with the -d parameter:

cut -d "," -f 1-3 gene_annotations.csv | head

QUICK PRACTICE! From our tab-delimited file, "gene_annotations.tsv"!, try to make a new file that has just 2 columns: the chromosome and the gene_ID columns (remember the > redirector).
Name the new file "chrom_and_IDs.tsv".
Solution
cut -f 1,4 gene_annotations.tsv | head
cut -f 1,4 gene_annotations.tsv > chrom_and_IDs.tsv
head chrom_and_IDs.tsv
And to make sure it holds all 199 lines and not just the first 10!
wc -l chrom_and_IDs.tsv

grep

grep (global regular expression) is a search tool. It looks through text files for strings (sequences of characters). In its default usage, grep will look for whatever string of characters you give it (1st positional argument), in whichever file you specify (2nd positional argument), and then print out the lines that contain what you searched for. Let’s try it:

grep "73177" gene_annotations.tsv

The output is one gene on chromosome 1, encoding for a gene called BONSAI:

chr1	27516572	27517882	AT1G73177	ID=AT1G73177;Name=AT1G73177;full_name=BONSAI;computational_description=bonsai;locus=505006219;curator_summary=The

If there are multiple lines that match, grep will print them all:

grep "145" gene_annotations.tsv

chr1	166589	168088	AT1G01453	ID=AT1G01453;Name=AT1G01453;Dbxref=TAIR:4515102487;locus_type=protein_coding
chr1	14543277	14544528	AT1G38790	ID=AT1G38790;Name=AT1G38790;Dbxref=TAIR:2824121;locus_type=protein_coding
chr4	13413973	13417145	AT4G26590	ID=AT4G26590;Name=AT4G26590;full_name=oligopeptide
chr4	16799635	16799966	AT4G09145	ID=AT4G09145;Name=AT4G09145;locus_type=long_noncoding_rna
chr5	14598607	14599988	AT5G36960	ID=AT5G36960;Name=AT5G36960;Dbxref=TAIR:2167474;locus_type=protein_coding

In this case some of the matches were in the gene IDs and some were in the locations.

If what we are looking for is not in the file, we will just get our prompt back with nothing printed out:

grep "mygene" gene_annotations.tsv

For the moment, let’s pretend we’re interested in a gene predicted to encode for DEMETER, (a DNA glycosylase protein, responsible for DNA hypomethylation) which has the gene ID AT5G04560. With grep we can check if it is in our annotations file:

grep "AT5G04560" gene_annotations.tsv

chr5	1309098	1318520	AT5G04560	ID=AT5G04560;Name=AT5G04560;full_name=DEMETER;computational_description=HhH-GPD

Note: grep is case sensitive by default, but we can make it case insensitive with the -i flag. Try searching for DEMETER with and without this flag and see what you get:
grep demeter gene_annotations.tsv
grep -i demeter gene_annotations.tsv

QUICK PRACTICE! Using a combination of grep and cut, how could we print out the gene IDs of only the genes located on chromosome 2?
Solution
grep "chr2" gene_annotations.tsv | cut -f 4
Being able to stick together these individual pieces like this becomes a really valuable toolset to have at our disposal!

We’re just scratching the surface of what grep can do, but one flag in addition to -i worth mentioning is the -c flag. This tells grep to just report how many lines matched, instead of printing them to the screen:

grep -c "chr2" gene_annotations.tsv

Optional: awk

awk is even more expansive than any of the others we’ve seen, but like the others, just being familiar with its basic command-line usage can be powerful. awk is useful for doing things like filtering based on columns and doing calculations.

The syntax of awk can also take a little getting used to. In awk, we specify columns with a $ followed by the column number. For example, $1 is column 1 (chromosome), $2 is start position, etc.

Let’s start with a simple example recreating cut -f 1: printing just the first column (chromosome):

awk '{print $1}' gene_annotations.tsv | head

Just like with cut, we can also print multiple columns:

awk '{print $1,$4}' gene_annotations.tsv | head

One of the most powerful uses of awk is filtering based on conditions. Let’s say we only want genes from chromosome 3 (recreating grep chr3 gene_annotations.tsv):

awk '$1 == "chr3"' gene_annotations.tsv | head

We can also filter based on numeric comparisons. Let’s find genes that start after position 1000000:

awk '$2 > 1000000' gene_annotations.tsv | head

awk can also perform calculations on the fly. Let’s create a new column that calculates the length of each gene (end position minus start position) and save it to a new file:

awk 'NR > 1 {print $1,$2,$3,$4,$5,$3-$2}' gene_annotations.tsv > gene_annotations_and_lengths.tsv

The NR > 1 part skips the first line (the header), so we don’t try to do maths on text!

We can combine filtering and calculations. Let’s find genes longer than 5000 base pairs:

awk '$5 > 5000' gene_annotations_and_lengths.tsv | head

Again, awk can seem pretty tricky, especially at first, but forunately we don’t need to remember how to do these things, just that they can be done. And then we can look it up when we need it 🙂

QUICK PRACTICE! How many protein coding genes in our list are larger than 3000bp?
Hint: Try to use grep and awk piped together.
Solution
grep "protein_coding" gene_annotations_and_lengths.tsv | awk '$6 > 3000' | wc -l

Summary

As mentioned, this page is just a first introduction to some great commands that are worth having in our toolkit. Each of them has much more functionality that we can dig into further as needed!

Commands introduced:

Command	Function (base-usage)
`head`	prints the first n lines (10 by default, specify with `-n`)
`tail`	prints the last n lines (10 by default, specify with `-n`)
`less`	browse file contents interactively (scroll with arrows, exit with `q`)
`cut`	extracts columns from delimited files
`grep`	searches for text patterns and returns matching lines
`awk`	filters rows and columns, performs calculations

Acknowledgements

This training course was adapted from the Happy Belly Bioinformatics Unix Course.

Previous submodule:

Monday - Session 2

Next submodule:

Tuesday - Session 1