The RJPrimers command-line pipeline (v1.0)

User's guide

 

 

Introduction

 

RJPrimers is a high-throughput software tool to identify unique repeat junctions and design TE junction based primers for high-throughput marker development. This tool identifies potentially unique repeat junctions using BLAST against fully annotated repeat databases and a repeat junction finding algorithm, and then designs TE based primers using Primer3 and BatchPrimer3. Five primer design strategies of TE based PCR markers have been implemented in this tool, including repeat junction marker (RJM), repeat junction-junction marker (RJJM), insertion-site junction basedpolymorphism (ISBP), retrotransposon-based insertion polymorphism (RBIP), and inter-retrotransposon amplified polymorphism (IRAP). Both a web-based server and a command line based pipeline have been implemented to meet different requirements.

 

RJPrimers takes sequences in FASTA format as input and generates several pages of primer design results, including an HTML table page and a tab-delimited text file listing all designed primers and primer properties. A detailed primer view page is available for each sequence with successfully designed primers.

 

The command line based pipeline of RJPrimers provides capability to process large amount of sequence data without memory and network speed limit and allows users to employ their own repeat databases. All result files are saved in a directory for each execution. The parameter values need to be set up in the pipeline program before execution.

 

 

The command-line software package

The software package includes the following files:

(1)               RJPrimers_pipeline1.0.pl

(2)               RJFinder.jar: a Java program to find repeat junctions.

(3)               Primer.pm

(4)               Primer3Output.pm

(5)               PrimerPair.pm

(6)               QuickSort.pm

(7)               primer3_core: this is an executable binary file for primer design on the Linux operating system. The included file might not work in your computer because the “primer3_core” is a platform-dependent executable file. You should download the Primer3 source code from http://primer3.wiki.sourceforge.net/?title=Primer3_Wiki&printable=yes and then compile the source code under your own operating system and get the executable binary file “primer3_core”.

(8)               Repeat databases: all repeat databases are stored in the “repeat_libs” directory.

(9)               test.fasta: a test sequence file.

(10)           RJPrimers_user_guide.pdf, RJPrimers_user_guide.html, RJPrimers_user_guide.txt: user’s guide of the RJPrimers pipeline program.

 

All the files are packed in a file “RJPrimers_pipeline.tar.gz”.

 

Another third party software package, NCBI Blast software, is required for this pipeline. You need to download it from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml, install this package and set up correct path to the executable files in the BLAST package.

 

Installation

 

  1. Unpack the  pipeline software package using the following command line and a directory named “RJPrimers_pipeline” containing the above files will be generated:

 

     gunzip RJPrimers_pipeline.tar.gz

tarxvf  RJPrimers_pipeline.tar

 

  1. Download the Primer3 source code from http://primer3.wiki.sourceforge.net/?title=Primer3_Wiki&printable=yes and then compile the source code under your own operating system. Copy the compiled, executable file “primer3_core” to the directory “RJPrimers_pipeline”. If your operating system is Linux, the “primer3_core” program may be working and you don’t need to download the source code again. You need to test it to see if it is working.

 

  1. If you have no NCBI Blast software package installed in your machine, please

download the NCBI Blast software package from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml and install the software. After installation, please set correct path pointing to the “bin” directory of the BLAST software. Now you should have two executable files, formatdb and blastall in the bin directory which will be used in the pipeline programs.

 

For example, on a Linux system, you may add the following lines to the “.bashrc” file:

 

PATH=$PATH:/usr/local/blast2.0/bin

export PATH

 

Here assuming that the NCBI Blast software is installed in the /usr/local/blast2.0/. Then type the following command line to activate the settings:

           

source .bashrc

 

  1. Perl and BioPerl are required. BioPerl can be downloaded from http://www.bioperl.org/wiki/Main_Page if BioPerl does not exist in your computer.
  2. The following perl packages are required and installed before running the pipeline package:

Bio::Graphics, Bio::SearchIO, Bio::SeqFeature::Generic, GD::Graph::bars, GD::Graph::colour, GD::Text

 

Pipeline input

 

The pipeline requires at least one sequence file in FASTA format. For large number of sequences, it will take time to perform BLAST search to find sequences with repeat junctions. To save time, preprocessing of sequences is useful to filter out non-repetitive sequences by performing BLAST search against a repeat database. The sequences could be any short DNA sequences, like BAC end sequence, shotgun sequences and next generation sequences (Roche 454 reads or contigs).

 

Repeat databases

 

The pipeline provides 17 repeat databases located in the “repeat_libs” directory. You may have your own repeat databases, but you must recompile your repeat database in a format that RJPrimers requires. The format requires:

(1)    FASTA format;

(2)    Header line definition:

>Sequence_id Repeat class;Order;Super family|data source

 

For example,

>CACTA_15 DNA transposon;TIR;CACTA|Maize TEDB

CACTACAGGAATTCTACTAATCCCATCGGCCAGGGTAATTCCCGTCGGCCAGAGCGAAAGCCGATGGGAA

TAGACTAATTCCCGTCGGCCACCAAATAGCCGACGGGCATTATATTAATTCCCGTCGTCCCCATCTCAAG

CCCACAGGGATTAACTAATCCCCGTCAGCCGTGTGTTCTGGCCAACGGGAATGAGTTAATTCCCGTCGGC

 

>RLG_wyly_AC198779-6150 Retrotransposon;LTR;Gypsy|Maize TEDB

TGTCAGCTCCTCGACACAGCACACACAGGAGCAAGCGGGAGACGACGCGGTTCAAGCGGACACAGGGATT

CCCTCTCGGCATGGAGAAAGGCCCAGGCGTATCAAGAAGCCCAGTACGAGAGTAACAGGCCCTGAGTGGC

TTAACATGTAATGGGTAGTCCATTAACAGTTGGAGATATATACTCTATGTGTAAGAAGTAGACGGCAAGA

AAGAAAATAACAATTACCTGGTTGCCGTATTCTCCATCTCAGCTTCTTCTCCATGCATTCCTTCCTGCTA

TATCTTCCTCTCTGCATCTCGGGTGAGGTTGGAGTTAACAATTGGTATCAAAGACATCGGTCCCCGGATC

 

After the recompile database is ready, please follow the following steps:

  1. Copy the file to the “repeat_libs” directory
  2. Go to the therepeat_libs” directory and  use BLAST2 program “formatdb” to create BLAST database:

 

formatdbi user_defined_repeatdb.fasta –p F

3.  Modify the pipeline program “RJPrimers_pipeline1.0.pl”:

 

    print "     -d repeat database (1-17)\n";

    print "        1    TREP\n";

    print "        2    RepBase14.07\n";

    print "        3    TIGR Gramineae Repeats v2.0\n";

    print "        4    TIGR Brassica Repeats v2.0\n";

    print "        5    TIGR Brassicaceae Repeats v2.0\n";

    print "        6    TIGR Fabaceae Repeats v2.0\n";

    print "        7    TIGR Solanaceae Repeats v3.2\n";

       print "        8    TIGR Arabidopsis_Repeats\n";

    print "        9    TIGR Glycine Repeats v2.0\n";

    print "        10   TIGR Hordeum Repeats v3.0\n";

    print "        11   TIGR Medicago Repeats v2.0\n";

    print "        12   TIGR Oryza Repeats v3.3\n";

    print "        13   TIGR Solanum Repeats v3.2\n";

    print "        14   TIGR Sorghum Repeats v3.0\n";

    print "        15   TIGR Triticum Repeats v3.0\n";

    print "        16   Maize TEDB\n";

    print "        17   MIPS REdat v4.3\n";

 

Add one line after the above lines:

 

    print "        18   User-defined db\n";

 

 

my @repeatdb_ids = (

       'trep',                #1

       'repbase',             #2

       'tigr_gramineae',       #3

       'tigr_brassica',       #4

       'tigr_brassicaceae',   #5

       'tigr_fabaceae',       #6

       'tigr_solanaceae',     #7

       'tigr_arabidopsis',    #8

       'tigr_glycine',        #9

       'tigr_hordeum',        #10

       'tigr_medicago',       #11

       'tigr_oryza',          #12

       'tigr_sorghum',        #13

       'tigr_solanum',        #14

       'tigr_triticum',       #15

       'maize_tedb',          #16

       'mips_redat',          #17

       );

 

 

Add one line after #17:

 

       'user_defined',        #18

 

 

our %REPEAT_DATABASES=

    (

     # Put more repeat libraries here, e.g.

     'trep'               => 'repeat_libs/TREP.fasta',

     'repbase'            => 'repeat_libs/RepBase14.07.fasta',

     'tigr_gramineae'     => 'repeat_libs/TIGR_Gramineae_Repeats.v3.3.fasta',

     'tigr_brassica'      => 'repeat_libs/TIGR_Brassica_Repeats.v2_0.fasta',

     'tigr_brassicaceae'  => 'repeat_libs/TIGR_Brassicaceae_Repeats.v2_0.fasta',

     'tigr_fabaceae'      => 'repeat_libs/TIGR_Fabaceae_Repeats.v2_0.fasta',

     'tigr_solanaceae'    => 'repeat_libs/TIGR_Solanaceae_Repeats.v3.2.fasta',

     'tigr_arabidopsis'   => 'repeat_libs/TIGR_Arabidopsis_Repeats.v2_0.fasta',

     'tigr_glycine'       => 'repeat_libs/TIGR_Glycine_Repeats.v2_0.fasta',

     'tigr_hordeum'       => 'repeat_libs/TIGR_Hordeum_Repeats.v3.0.fasta',

     'tigr_medicago'      => 'repeat_libs/TIGR_Medicago_Repeats.v2_0.fasta',

     'tigr_oryza'         => 'repeat_libs/TIGR_Oryza_Repeats.v3.3.fasta',

     'tigr_sorghum'       => 'repeat_libs/TIGR_Sorghum_Repeats.v3.0.fasta',

     'tigr_solanum'       => 'repeat_libs/TIGR_Solanum_Repeats.v3.2.fasta',

     'tigr_triticum'      => 'repeat_libs/TIGR_Triticum_Repeats.v3.0.fasta',

     'maize_tedb'         => 'repeat_libs/maize_TEDB.fasta',

     'mips_redat'         => 'repeat_libs/mips_REdat_4.3.fasta',

);

 

Add one line after 'mips_redat':

 

     'user_defined'       => 'repeat_libs/user_defined_db_file_name',

 

The “user_defined_db_name” is same as the database fasta file name (full name, such as “tedb.fasta”).

 

our %REPEAT_LIBRARIES = (

    'trep'              => "TREP",

    'repbase'           => "RepBase14.07",

    'tigr_gramineae'    => "TIGR Gramineae Repeats v2.0",

    'tigr_brassica'     => "TIGR Brassica Repeats v2.0",

    'tigr_brassicaceae' => "TIGR Brassicaceae Repeats v2.0",

    'tigr_fabaceae'     => "TIGR Fabaceae Repeats v2.0",

    'tigr_solanaceae'   => "TIGR Solanaceae Repeats v3.2",

       'tigr_arabidopsis'  => "TIGR Arabidopsis_Repeats",

    'tigr_glycine'      => "TIGR Glycine Repeats v2.0",

    'tigr_hordeum'      => "TIGR Hordeum Repeats v3.0",

    'tigr_medicago'     => "TIGR Medicago Repeats v2.0",

    'tigr_oryza'        => "TIGR Oryza Repeats v3.3",

    'tigr_solanum'      => "TIGR Solanum Repeats v3.2",

    'tigr_sorghum'      => "TIGR Sorghum Repeats v3.0",

    'tigr_triticum'     => "TIGR Triticum Repeats v3.0",

    'maize_tedb'        => "Maize TEDB",

    'mips_redat'        => "MIPS REdat v4.3",

      

);

 

Add one line after 'mips_redat':

 

    'user_defined'      => "User defined database",

 

Change the line 58:

From

 

my $database_num = 17;

 

To

 

my $database_num = 18;

 

Finally you can choose database number 18 in the command line to use user-defined database, for example:

 

perl RJPrimers_pipeline1.0.pl –t 1 –s test.fasta –d 18

 

 

 

Usage of the pipeline programs

 

For easy use, we give an example to show how to use the pipeline program. The sample sequence file is “test.fasta” which is Roche 454 reads of the diploid ancestor of hexaploid wheat, Aegilop tauschii.

 

  1. Parameter setting

In the pipeline program “RJPrimers_pipeline1.0.pl”, you may change the file path of the BLAST program “blastall” if you have not set up the default path of NABI Blast software. The Java program path is also required. Primer design parameters can be modified in the program.

 

our $BLASTALL= "blastall";

our $JAVA = 'java -Xmx6000m ';

 

 

For repeat junction identification:

our $E_VALUE = '1e-5';           # minimum E value cutoff of all hits

our $MIN_MAX_EVALUE = '1E-50';   # minimum E value cutoff of the top hit

 

# primer design

our $PR_DEFAULT_PRODUCT_MIN_SIZE   = 150;

our $PR_DEFAULT_PRODUCT_MAX_SIZE   = 700;

our $PRIMER_SALT_CONC              = 50.0;

our $PRIMER_DNA_CONC               = 50.0;

our $PRIMER_NUM_NS_ACCEPTED        = 0;

our $PRIMER_MAX_POLY_X             = 2;

our $PRIMER_GC_CLAMP               = 0;

 

our $PRIMER_MIN_GC = 40;

our $PRIMER_MAX_GC = 60;

 

our $PRIMER_MAX_DIFF_TM = 5;

our $PRIMER_MIN_TM = 55;

our $PRIMER_OPT_TM = 60;

our $PRIMER_MAX_TM = 65;

our $MAX_SELF_COMPLEMENTARITY = 5;

our $MAX_3_SELF_COMPLEMENTARITY = 2;

our $PRIMER_MAX_END_STABILITY = 9.0;

 

Please pay attention to the repeat database which must be chosen in the program before execution.

 

  1. Using the  pipeline program: RJPrimers_pipeline1.0.pl

 

Usage:

perl TEPrimers_pipline1.0.pl

     -t primer type:

        0 Repeat junction identification only

        1 RJM primers

        2 RJJM primers

        3 ISBP primers

        4 RBIP primers

        5 IRAP primers

     -s sequence file

     -d repeat database (1-17)

        1    TREP

        2    RepBase14.07

        3    TIGR Gramineae Repeats v2.0

        4    TIGR Brassica Repeats v2.0

        5    TIGR Brassicaceae Repeats v2.0

        6    TIGR Fabaceae Repeats v2.0

        7    TIGR Solanaceae Repeats v3.2

        8    TIGR Arabidopsis_Repeats

        9    TIGR Glycine Repeats v2.0

        10   TIGR Hordeum Repeats v3.0

        11   TIGR Medicago Repeats v2.0

        12   TIGR Oryza Repeats v3.3

        13   TIGR Solanum Repeats v3.2

        14   TIGR Sorghum Repeats v3.0

        15   TIGR Triticum Repeats v3.0

        16   Maize TEDB

        17   MIPS REdat v4.3

     -f output single primer design file (1 or 0, default 0)

     -b BLAST output of sequence against repeat database (default: no file)

     -a BLAST table output of sequence against repeat database (default: no file)

     -l half junction length (bp): this only for junction identification, t=0. Default 100 bp, ie. total 200 bp

 

(1)                          -t: primer type. In RJPrimers 1.0 pipeline, there are 6 different types available. “0” is for repeat junction identification only. If you choose “0”, -l option is required.  The program will extract a sequence fragment centered repeat junction. For example, -l 200 means a 400 bp fragment centered at a identified repeat junction is exported to a file. Default length is 100 bp, i.e 200 bp in total.

(2)                          –s: a input sequence file in FASTA format.

(3)                          –d: choose a number for a repeat database. Only one number can be chosen in the pipeline program.

(4)                          –f: if this is on (“1”), a primer view file for each sequence with successful primer designed will be generated. For large amount of sequence data, this will spend too much time to generate primer view files. Default is set to “0”.

(5)                          –b: If you already have BLAST result file (default output of the blastall program, i.e., -m 0 in blastall), you can specify the file name. The program will not do BLAST search again and save time for large amount of sequence data.

(6)                          –a: BLAST table output file from RJPrimers_pipeline. RJPrimers_pipeline generates a BLAST table output file for each execution of primer design (–t option 1-5). If you want to re-run the program or try different primer types for the same data set, you can use the BLAST table output file and save more than half of running time.

(7)                          –l: this is only for repeat junction identification. Default is 100bp, i.e., 200 bp repeat junction region is outputted to a file.

 

 

 

For example,

 

perl RJPrimers_pipeline.pl –t 1 –s test.fasta –d 1

 

primer type: 1

RJPrimers is working on primer picking.....

This will take several seconds to minutes depending on numbers of sequences.

Please wait ...

 

1 database(s) was (were) selected:

    TREP

Perform BLASTN search against the selected repeat databases..

blastall -p blastn -a 7 -d ''/usr/lib/cgi-bin/RJPrimers/repeat_libs/TREP.fasta'  ' -i te_primers/p_1261959902/seq.fasta -b 1000 -e 1e-5 -o te_primers/p_1261959902/blastn_report.txt

Identifying TE junctions...

Designing primers...

 

Result summary:

Junctions per Kb          = 3.49533157768734

Total number of junctions = 73

Total sequence length     = 20885

Total number of sequences = 43

The job is finished. Please check the results at the result directory: te_primers/p_1261959902

Total 0.0833333333333333 minutes was used.

 

All result files are saved in the directory te_primers/p_1261959902.