Unix tools commands for Counting Sequences

Unix as a comfortable and useful tools in bioinformatics uses basic commands to do processes in bioinformatics and immunoinformatics. Every command line computational biologists has a suite of specific utility tools classified in their $PATH (root) for doing core operations on common data files, for example counting the number of sequences in a .fasta file. Sometimes however, you end up on someone else’s server where those tools are no longer available. It’s in this situation when a good working knowledge of the standard Unix tools can be valuable. Here’s a short list of the most useful ones for chaining together:

cat - show
grep - filter
sed - modify
wc - count
cut - extract columns
sort - sort
uniq - identify duplicates
head - extract start
tail - extract end
expr - arithmetic

For example, the classic example is to count the number of sequences in a .fasta file:

% grep ‘>’ in.fasta | wc -l

Each sequence has a “>” character for its ID line, the grep selects out all the ID lines, which the number of is equal to the number of sequences. The wc command counts its input, and -l means to count lines, rather than characters or words.

Most of us would be using modern GNU/Unix systems, where the grep command has a -c option which counts matching lines rather than printing them out one by one, removing the need for wc:

% grep -c ‘>’ in.fasta

Please ensure you use quote characters around the “>”, otherwise the shell will think you want to send the output of grep to in.fasta, which will result in in.fasta being truncated to a zero length file. Not ideal.

More commonly today is the .fastq file, used for storing millions of short reads from high-throughput sequencing machines. An example of one entry is shown below:


At first glance, the simplest way to count entries in a .fastq file would be to just extend what we did with .fasta files, but use the “@” character for the ID line matching.

% grep -c ‘@’ in.fastq

Absolutly, this strategy doesn’t always work, as “@” is also a valid symbol in the encoded quality string! dont worries, let’s use the “+” character on the second ID line. Arrggh, it’s also a valid quality symbol! WTF? At this stage you start muttering expletives about moron file format designers, but then calm down when you realise it could have been much worse ie. XML.

Technically, the sequence and quality parts of a .fastq entry can span multiple lines, just like you see 60 column wrapped .fasta files. However, most vendors stick to 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count entries by dividing the number of lines in the file by four:

% LINES=cat in.fasta | wc -l
% READS=expr $LINES / 4
% echo $READS

The above is traditional Bourne shell, but most people are using modern-ish shells like BASH, where this can be written more concisely as:

% expr $(cat in.fasta | wc -l) / 4

Leave a Reply

Your email address will not be published. Required fields are marked *