$ head -n12 RNAseq/mouse_adult_wt_smallRNAseq_76nt_strand_ox.fastq
The basic question is to trim/clip the part after the adaptor sequence (e.g. TCGTATGCCGTCTTCTGCTTG in above example) in the FASTQ reads. It's easy if there is no mismatch. For case with mismatch, here is few solutions I got from google:
- FAR (http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?title=Main_Page), which uses Needleman-Wunsch alignment (global alignment) to align the adaptor sequence(s) to the read sequences and find the first best alignment.
far -s stdin.fastq -t stdout.fq -f fastq-sanger -as TCGTATGCCGTCTTCTGCTTG --cut-off 5 --min-overlap 15 --phred-pre-trim 20 --min-readlength 20 --trim-end right --adaptive-overlap yes --max-uncalled 30 --nr-threads 8 --log-level ALL
- Scythe only checks for 3'-end contaminants, up to the adapter's length into the 3'-end. For reads with contamination in any position, the program TagDust (http://genome.gsc.riken.jp/osc/english/dataresource/) is recommended. Scythe has the advantages of allowing fuzzier matching and being base quality-aware, while TagDust has the advantages of very fast matching (but allowing few mismatches, and not considering quality) and FDR. TagDust also removes contaminated reads entirely, while Scythe trims off contaminants.
- trimLRPatterns() function in BioConductor. Here is an example.
- Another very useful toolkit is fastx_toolkit, which includes a set of tools for handling fasta and fastq format. fastx_clipper is for clipping adaptor,
but from what I checked it does not allow mismatch (e.g. N is counted as 'match' in fastx_clipper, but in FAR it's mismatch). Running the following code on above reads only clips the last read. The first one should also be clipped. Don't know why. (Just get reply from Gordon that fastx_clipper does allow mismatch and also count 'N' as mismatch. Just the first two sequences have high error rate (=number_of_mismatch/adaptor_alignment_length, e.g. 33% in this case), so it's not counted as an adaptor.
A possible pipeline would run FASTQ reads through Scythe, then TagDust, then a quality-based trimmer, and finally through a read quality statistics program such as qrqc (http://bioconductor.org/packages/devel/bioc/html/qrqc.html) or FASTqc (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).
$ head -n12 RNAseq/mouse_adult_wt_smallRNAseq_76nt_strand_ox.fastq | fastx_clipper -a TCGTATGCCGTCTTCTGCTTG -c -n -i -