Method for Sorting FASTQ files by sequence length using unix tools

Method for Sorting FASTQ files by sequence length using unix tools, Bio Data Technology: Bioinformatics and Biotechnology directory

The FASTQ file format which are used for reading and transferring sequences data is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for the khmer package, and noticed a comment stating that digital normalization would work better if the FASTQ reads were in order from longest to shortest.

I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin’s blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn’t use a Schwartzian transform.

Some solutions
Our first attempt was to use the paste trick I blogged about previously, where in.fq contains about 10M clipped reads with lengths between 24 and 100 bp:

cat in.fq
| paste – – – –
| perl -ne ‘@x=split m/\t/; unshift @x, length($x[1]); print join “\t”,@x;’
| sort -n
| cut -f2-
| tr “\t” “\n”

out.fq

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit (“spindle” for my more mature readers) to your input and/or output files, you’ll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn’t make too much difference.

Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

!/usr/bin/env perl

use strict;
use warnings;
use Fatal;
use File::Temp qw(tempdir);

my $dir = tempdir(CLEANUP=>1);
my @line;
my @fh;

while ( ! eof () ) {
$line[$_] = scalar(<>) for (0..3);
my $len = length($line[1]) – 1;
if ( not defined $fh[$len] ) {
open $fh[$len], ‘>’, “$dir/$len”;
}
print {$fh[$len]} @line;
}

for my $len (0 .. $#fh) {
next unless defined $fh[$len];
system(“cat”, “$dir/$len”);
}

Results
Here are some timing measurements across three files:

10 million reads
Bash real 1m42.407s user 1m57.863s sys 0m52.479s
Perl real 0m51.698s user 0m35.626s sys 0m9.897s
20 million reads
Bash real 3m58.520s user 4m2.155s sys 1m58.167s
Perl real 1m39.824s user 1m12.765s sys 0m20.409s
30 million reads
Bash real 5m53.346s user 6m16.736s sys 2m51.483s
Perl real 2m23.200s user 1m46.815s sys 0m30.754s

On face value, it appears the shell version (“Bash”) is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.

Conclusion
Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people’s work, but I hope it is useful to somebody.

X
%d bloggers like this: