How to use UNIX codes for paste with NGS Sequences Data

UNIX shell and command lines are very good tools for using in computational process during bioinformatics and one of useful command line in shell for solving paste problem in NGS data is this command line help us using Paste command as a solution:

paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq

This command is very useful with simple structure and clearly its potential power can help in analyzing large data and NGS data

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

So what’s happening here? Well, in Unix, the “-” character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that’s all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs “\t” back to newlines “\n” with the tr command.

Example 1: FASTQ to FASTA

A common thing to do is convert FASTQ to FASTA, and we don’t always have our favourite tool or script to to this when we aren’t on our own servers:

paste – – – – < in.fq | cut -f 1,2 | sed ‘s/^@/>/’ | tr “\t” “\n” > out.fa

paste converts the input FASTQ into a 4-column file
cut command extracts out just column 1 (the ID) and column 2 (the sequence)
sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up.

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant – it is usually a duplicate of the first line, except with “+” instead of “@” to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the “+”, which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here’s how we can do it with our new paste trick:

paste -d ‘ ‘ – – – – | sed ‘s/ +[^ ]*/ +/’ | tr ” ” “\n”

paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
sed finds and replaces the "+DUPE_ID" line with just a "+"
tr conversts the 4 columns back into 4 lines

Learning this structure may guide us through new potential of UNIX for analyzing large data

Leave a Reply

Your email address will not be published. Required fields are marked *