Friday, August 03, 2012

options setting for blast

Blast is a classical method for finding homolog sequences. In one of my recent projects, I need to find the orthologous sequences of Manduca sexta (a moth) protein in other relative species.  Since I want to get orthologs for all Manduca's proteins, which is a long list, a straightforward way is to do blast.

First, you need to download/install blast. I use the ncbi-blast-2.2.26+-src.tar.gz downloaded from:

Next, you need to download the blast database. Here is the list: I used the nr.*tar.gz (non-redundant protein sequence database). If you want to use a smaller and specific one, you can choose for example pdbaa.*tar.gz.

Once you installed the program and database, you can run the executable file called "blastall" in the bin folder. Here are its options:

blastall 2.2.26   arguments:

  -p  Program Name [String]
  -d  Database [String]
    default = nr
  -i  Query File [File In]
    default = stdin
## alignment
  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
    default = T
  -Q  Query Genetic code to use [Integer]
    default = 1
  -D  DB Genetic code (for tblast[nx] only) [Integer]
    default = 1
  -W  Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer]
    default = 0
  -z  Effective length of the database (use zero for the real size) [Real]
    default = 0
  -Y  Effective length of the search space (use zero for the real size) [Real]
    default = 0
  -S  Query strands to search against database (for blast[nx], and tblastx)
       3 is both, 1 is top, 2 is bottom [Integer]
    default = 3
  -M  Matrix [String]
    default = BLOSUM62
## scoring 
  -G  Cost to open a gap (-1 invokes default behavior) [Integer]
    default = -1
  -E  Cost to extend a gap (-1 invokes default behavior) [Integer]
    default = -1
  -X  X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
      blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
    default = 0
  -q  Penalty for a nucleotide mismatch (blastn only) [Integer]
    default = -3
  -r  Reward for a nucleotide match (blastn only) [Integer]
    default = 1
  -f  Threshold for extending hits, default if zero
      blastp 11, blastn 0, blastx 12, tblastn 13
      tblastx 13, megablast 0 [Real]
    default = 0
  -g  Perform gapped alignment (not available with tblastx) [T/F]
    default = T
  -y  X dropoff value for ungapped extensions in bits (0.0 invokes default behavior)
      blastn 20, megablast 10, all others 7 [Real]
    default = 0.0
  -Z  X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
      blastn/megablast 100, tblastx 0, all others 25 [Integer]
    default = 0
  -w  Frame shift penalty (OOF algorithm for blastx) [Integer]
    default = 0
  -t  Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments. (0 invokes default behavior; a negative value disables linking.) [Integer]
    default = 0
  -C  Use composition-based score adjustments for blastp or tblastn:
      As first character:
      D or d: default (equivalent to T)
      0 or F or f: no composition-based statistics
      2 or T or t: Composition-based score adjustments as in Bioinformatics 21:902-911,
      1: Composition-based statistics as in NAR 29:2994-3005, 2001
          2005, conditioned on sequence properties
      3: Composition-based score adjustment as in Bioinformatics 21:902-911,
          2005, unconditionally
      For programs other than tblastn, must either be absent or be D, F or 0.
           As second character, if first character is equivalent to 1, 2, or 3:
      U or u: unified p-value combining alignment p-value and compositional p-value in round 1 only
    default = D
## display
  -I  Show GI's in deflines [T/F]
    default = F
  -v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
  -b  Number of database sequence to show alignments for (B) [Integer]
    default = 250
  -K  Number of best hits from a region to keep. Off by default.
If used a value of 100 is recommended.  Very high values of -v or -b is also suggested [Integer]
    default = 0
  -P  0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
    default = 0
## output
  -O  SeqAlign file [File Out]  Optional
  -J  Believe the query defline [T/F]
    default = F
  -T  Produce HTML output [T/F]
    default = F
  -e  Expectation value (E) [Real]
    default = 10.0
  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular, 
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
    default = 0
    range from 0 to 11
  -o  BLAST report Output File [File Out]  Optional
    default = stdout
## performance
  -a  Number of processors to use [Integer]
    default = 1
  -l  Restrict search of database to list of GI's [String]  Optional
  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
  -R  PSI-TBLASTN checkpoint file [File In]  Optional
  -n  MegaBlast search [T/F]
    default = F
  -L  Location on query sequence [String]  Optional
  -A  Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer]
    default = 0
  -B  Number of concatenated queries, for blastn and tblastn [Integer]  Optional
    default = 0
  -V  Force use of the legacy BLAST engine [T/F]  Optional
    default = F
  -s  Compute locally optimal Smith-Waterman alignments (This option is only
      available for gapped tblastn.) [T/F]
    default = F

I highlight the important ones with read.

Here is the command I finally used to blast protein sequences:

blastall -p blastp -d ~/nearline/blast/nr/nr -i ~/nearline/genomes/Manduca_sexta/Jun2012/Msex05162011.genome.except-5small-scaf.maker-2.25.proteins.OGS_June2012.fasta -o Msex05162011.genome.except-5small-scaf.maker-2.25.proteins.OGS_June2012.fasta.blastp.top15.xml -e 1 -b 15 -m 7 -a 8

No comments:

Post a Comment