One of important issues faced in bioinformatics is to optimize data for post processes and current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a “library” and the short pieces are called “fragments”. Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment – either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a “pair” of reads for each fragment. These “paired-end” reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.
Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.
However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won’t. It’s starting to get messy.
The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.
Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I’ve used one tool (FLASH) to overlap some MiSeq 2×150 PE reads, and then assembled them using Velvet, and the merged reads produced a “better” assembly than with the paired reads. But that’s it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!
PEAR (Paired-End Read Merger) http://sco.h-its.org/exelixis/web/software/pear/doc.html (* this is what I use) COPE (Connecting Overlapping Paired End reads) http://sourceforge.net/projects/coperead/ SeqPrep https://github.com/jstjohn/SeqPrep FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies) http://www.cbcb.umd.edu/software/flash fastq-join (part of ea-utils) http://code.google.com/p/ea-utils/wiki/FastqJoin PANDAseq https://github.com/neufeld/pandaseq stitch (now defunct, merged into PANDAseq) https://github.com/audy/stitch mergePairs.py http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/mergePairs.py
Features to look for
Keeps original IDs in merged reads Outputs the un-overlapped paired reads Ability to strip adaptors first Rescores the Phred qualities across the overlapped region Parameters to control the overlap sensitivity Handle .gz and .bz2 compressed files Multi-threading support Written in C/C++ (faster compiled) rather than Python/Perl (slower)