Bio Data Technology today’s education is about How to Compress FASTQ Sequence Data by splitting into homogeneous streams. first of all took FASTQ Sequence file with 3.5M reads (Data), which was Read from a paired-end Illumina 100bp run – it was about 883Mb in size. From previous education, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.
883252 R1.fastq 233296 R1.fastq.gz 182056 R1.fastq.bz2
Then split the read file into 3 separate files for better process: (1) The ID line, but with the mandatory ‘@’ removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.
189588 id.txt 341756 seq.txt 341756 qual.txt 873100 TOTAL
Now,its time to compressed each of the three streams (ID, Sequence, Quality) independently with GZIP format. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.
20608 id.txt.gz 84096 qual.txt.gz 102040 seq.txt.gz 206644 TOTAL (was 233296 combined)
If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2’s ability to cope with heterogeneous data streams better than GZIP.
16560 id.txt.bz2 66812 qual.txt.bz2 93564 seq.txt.bz2 176936 TOTAL (was 182056 combined)
in summary, we’ve re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.