The RJPrimers command-line
pipeline (v1.0)
User's guide
Introduction
RJPrimers is a high-throughput software tool to identify unique repeat junctions and design TE junction based primers for high-throughput marker development. This tool identifies potentially unique repeat junctions using BLAST against fully annotated repeat databases and a repeat junction finding algorithm, and then designs TE based primers using Primer3 and BatchPrimer3. Five primer design strategies of TE based PCR markers have been implemented in this tool, including repeat junction marker (RJM), repeat junction-junction marker (RJJM), insertion-site junction basedpolymorphism (ISBP), retrotransposon-based insertion polymorphism (RBIP), and inter-retrotransposon amplified polymorphism (IRAP). Both a web-based server and a command line based pipeline have been implemented to meet different requirements.
RJPrimers takes sequences in FASTA format as input and generates several pages of primer design results, including an HTML table page and a tab-delimited text file listing all designed primers and primer properties. A detailed primer view page is available for each sequence with successfully designed primers.
The command line based pipeline of RJPrimers provides capability to process large amount of sequence data without memory and network speed limit and allows users to employ their own repeat databases. All result files are saved in a directory for each execution. The parameter values need to be set up in the pipeline program before execution.
The command-line software package
The software package includes the following files:
(1) RJPrimers_pipeline1.0.pl
(2) RJFinder.jar: a Java program to find repeat junctions.
(3) Primer.pm
(4) Primer3Output.pm
(5) PrimerPair.pm
(6) QuickSort.pm
(7) primer3_core: this is an executable binary file for primer design on the Linux operating system. The included file might not work in your computer because the “primer3_core” is a platform-dependent executable file. You should download the Primer3 source code from http://primer3.wiki.sourceforge.net/?title=Primer3_Wiki&printable=yes and then compile the source code under your own operating system and get the executable binary file “primer3_core”.
(8) Repeat databases: all repeat databases are stored in the “repeat_libs” directory.
(9) test.fasta: a test sequence file.
(10) RJPrimers_user_guide.pdf, RJPrimers_user_guide.html, RJPrimers_user_guide.txt: user’s guide of the RJPrimers pipeline program.
All the files are packed in a file “RJPrimers_pipeline.tar.gz”.
Another third party software package, NCBI Blast software, is required for this pipeline. You need to download it from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml, install this package and set up correct path to the executable files in the BLAST package.
Installation
gunzip
RJPrimers_pipeline.tar.gz
tar –xvf
RJPrimers_pipeline.tar
download the NCBI Blast software package from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml and install the software. After installation, please set correct path pointing to the “bin” directory of the BLAST software. Now you should have two executable files, formatdb and blastall in the bin directory which will be used in the pipeline programs.
For example, on a Linux system, you may add the following lines to the “.bashrc” file:
PATH=$PATH:/usr/local/blast2.0/bin
export PATH
Here assuming that the NCBI Blast software is installed in the /usr/local/blast2.0/. Then type the following command line to activate the settings:
source .bashrc
Bio::Graphics, Bio::SearchIO, Bio::SeqFeature::Generic, GD::Graph::bars, GD::Graph::colour, GD::Text
Pipeline input
The pipeline requires at least one sequence file in FASTA format. For large number of sequences, it will take time to perform BLAST search to find sequences with repeat junctions. To save time, preprocessing of sequences is useful to filter out non-repetitive sequences by performing BLAST search against a repeat database. The sequences could be any short DNA sequences, like BAC end sequence, shotgun sequences and next generation sequences (Roche 454 reads or contigs).
Repeat databases
The pipeline provides 17 repeat databases located in the “repeat_libs” directory. You may have your own repeat databases, but you must recompile your repeat database in a format that RJPrimers requires. The format requires:
(1) FASTA format;
(2) Header line definition:
>Sequence_id
Repeat class;Order;Super family|data source
For example,
>CACTA_15 DNA transposon;TIR;CACTA|Maize TEDB
CACTACAGGAATTCTACTAATCCCATCGGCCAGGGTAATTCCCGTCGGCCAGAGCGAAAGCCGATGGGAA
TAGACTAATTCCCGTCGGCCACCAAATAGCCGACGGGCATTATATTAATTCCCGTCGTCCCCATCTCAAG
CCCACAGGGATTAACTAATCCCCGTCAGCCGTGTGTTCTGGCCAACGGGAATGAGTTAATTCCCGTCGGC
>RLG_wyly_AC198779-6150 Retrotransposon;LTR;Gypsy|Maize
TEDB
TGTCAGCTCCTCGACACAGCACACACAGGAGCAAGCGGGAGACGACGCGGTTCAAGCGGACACAGGGATT
CCCTCTCGGCATGGAGAAAGGCCCAGGCGTATCAAGAAGCCCAGTACGAGAGTAACAGGCCCTGAGTGGC
TTAACATGTAATGGGTAGTCCATTAACAGTTGGAGATATATACTCTATGTGTAAGAAGTAGACGGCAAGA
AAGAAAATAACAATTACCTGGTTGCCGTATTCTCCATCTCAGCTTCTTCTCCATGCATTCCTTCCTGCTA
TATCTTCCTCTCTGCATCTCGGGTGAGGTTGGAGTTAACAATTGGTATCAAAGACATCGGTCCCCGGATC
After the recompile database is ready, please follow the following steps:
formatdb –i user_defined_repeatdb.fasta –p
F
3. Modify
the pipeline program “RJPrimers_pipeline1.0.pl”:
print " -d repeat database (1-17)\n";
print " 1
TREP\n";
print " 2
RepBase14.07\n";
print " 3
TIGR Gramineae Repeats v2.0\n";
print " 4
TIGR Brassica Repeats v2.0\n";
print "
5 TIGR Brassicaceae
Repeats v2.0\n";
print " 6
TIGR Fabaceae Repeats v2.0\n";
print " 7
TIGR Solanaceae Repeats v3.2\n";
print " 8
TIGR Arabidopsis_Repeats\n";
print " 9
TIGR Glycine Repeats v2.0\n";
print " 10
TIGR Hordeum Repeats v3.0\n";
print " 11
TIGR Medicago Repeats v2.0\n";
print " 12
TIGR Oryza Repeats v3.3\n";
print " 13
TIGR Solanum Repeats v3.2\n";
print " 14
TIGR Sorghum Repeats v3.0\n";
print " 15
TIGR Triticum Repeats v3.0\n";
print " 16
Maize TEDB\n";
print " 17
MIPS REdat v4.3\n";
Add one line after the above lines:
print
" 18 User-defined db\n";
my @repeatdb_ids
= (
'trep', #1
'repbase', #2
'tigr_gramineae', #3
'tigr_brassica', #4
'tigr_brassicaceae', #5
'tigr_fabaceae', #6
'tigr_solanaceae', #7
'tigr_arabidopsis', #8
'tigr_glycine', #9
'tigr_hordeum', #10
'tigr_medicago', #11
'tigr_oryza', #12
'tigr_sorghum', #13
'tigr_solanum', #14
'tigr_triticum', #15
'maize_tedb', #16
'mips_redat', #17
);
Add one line after #17:
'user_defined', #18
our
%REPEAT_DATABASES=
(
# Put more repeat libraries here, e.g.
'trep' => 'repeat_libs/TREP.fasta',
'repbase' => 'repeat_libs/RepBase14.07.fasta',
'tigr_gramineae' => 'repeat_libs/TIGR_Gramineae_Repeats.v3.3.fasta',
'tigr_brassica' => 'repeat_libs/TIGR_Brassica_Repeats.v2_0.fasta',
'tigr_brassicaceae' => 'repeat_libs/TIGR_Brassicaceae_Repeats.v2_0.fasta',
'tigr_fabaceae' => 'repeat_libs/TIGR_Fabaceae_Repeats.v2_0.fasta',
'tigr_solanaceae' => 'repeat_libs/TIGR_Solanaceae_Repeats.v3.2.fasta',
'tigr_arabidopsis' => 'repeat_libs/TIGR_Arabidopsis_Repeats.v2_0.fasta',
'tigr_glycine' => 'repeat_libs/TIGR_Glycine_Repeats.v2_0.fasta',
'tigr_hordeum' => 'repeat_libs/TIGR_Hordeum_Repeats.v3.0.fasta',
'tigr_medicago' => 'repeat_libs/TIGR_Medicago_Repeats.v2_0.fasta',
'tigr_oryza' => 'repeat_libs/TIGR_Oryza_Repeats.v3.3.fasta',
'tigr_sorghum' => 'repeat_libs/TIGR_Sorghum_Repeats.v3.0.fasta',
'tigr_solanum' => 'repeat_libs/TIGR_Solanum_Repeats.v3.2.fasta',
'tigr_triticum' => 'repeat_libs/TIGR_Triticum_Repeats.v3.0.fasta',
'maize_tedb' => 'repeat_libs/maize_TEDB.fasta',
'mips_redat' => 'repeat_libs/mips_REdat_4.3.fasta',
);
Add one line after 'mips_redat':
'user_defined' => 'repeat_libs/user_defined_db_file_name',
The “user_defined_db_name” is same as the database fasta file name (full name, such as “tedb.fasta”).
our
%REPEAT_LIBRARIES = (
'trep' => "TREP",
'repbase' => "RepBase14.07",
'tigr_gramineae' => "TIGR Gramineae
Repeats v2.0",
'tigr_brassica' => "TIGR Brassica
Repeats v2.0",
'tigr_brassicaceae'
=> "TIGR Brassicaceae Repeats v2.0",
'tigr_fabaceae' => "TIGR Fabaceae
Repeats v2.0",
'tigr_solanaceae' => "TIGR Solanaceae
Repeats v3.2",
'tigr_arabidopsis' => "TIGR Arabidopsis_Repeats",
'tigr_glycine' => "TIGR Glycine
Repeats v2.0",
'tigr_hordeum' => "TIGR Hordeum
Repeats v3.0",
'tigr_medicago' => "TIGR Medicago
Repeats v2.0",
'tigr_oryza' => "TIGR Oryza
Repeats v3.3",
'tigr_solanum' => "TIGR Solanum
Repeats v3.2",
'tigr_sorghum' => "TIGR Sorghum Repeats
v3.0",
'tigr_triticum' => "TIGR Triticum
Repeats v3.0",
'maize_tedb' => "Maize TEDB",
'mips_redat' => "MIPS REdat
v4.3",
);
Add one line after 'mips_redat':
'user_defined' => "User defined database",
Change the line 58:
From
my $database_num
= 17;
To
my $database_num
= 18;
Finally you can choose database number 18 in the command line to use user-defined database, for example:
perl
RJPrimers_pipeline1.0.pl –t 1 –s test.fasta –d 18
Usage of the pipeline programs
For easy use, we give an example to show how to use the pipeline program. The sample sequence file is “test.fasta” which is Roche 454 reads of the diploid ancestor of hexaploid wheat, Aegilop tauschii.
In the pipeline program “RJPrimers_pipeline1.0.pl”, you may change the file path of the BLAST program “blastall” if you have not set up the default path of NABI Blast software. The Java program path is also required. Primer design parameters can be modified in the program.
our
$BLASTALL= "blastall";
our
$JAVA = 'java -Xmx6000m ';
For repeat junction
identification:
our
$E_VALUE = '1e-5'; # minimum E
value cutoff of all hits
our
$MIN_MAX_EVALUE = '1E-50'; # minimum E
value cutoff of the top hit
# primer design
our
$PR_DEFAULT_PRODUCT_MIN_SIZE = 150;
our
$PR_DEFAULT_PRODUCT_MAX_SIZE = 700;
our
$PRIMER_SALT_CONC = 50.0;
our
$PRIMER_DNA_CONC = 50.0;
our
$PRIMER_NUM_NS_ACCEPTED = 0;
our
$PRIMER_MAX_POLY_X = 2;
our
$PRIMER_GC_CLAMP = 0;
our
$PRIMER_MIN_GC = 40;
our
$PRIMER_MAX_GC = 60;
our
$PRIMER_MAX_DIFF_TM = 5;
our
$PRIMER_MIN_TM = 55;
our
$PRIMER_OPT_TM = 60;
our
$PRIMER_MAX_TM = 65;
our
$MAX_SELF_COMPLEMENTARITY = 5;
our
$MAX_3_SELF_COMPLEMENTARITY = 2;
our
$PRIMER_MAX_END_STABILITY = 9.0;
Please pay attention to the repeat database which must be chosen in the program before execution.
Usage:
perl TEPrimers_pipline1.0.pl
-t primer
type:
0 Repeat junction identification only
1 RJM
primers
2 RJJM
primers
3 ISBP
primers
4 RBIP
primers
5 IRAP
primers
-s
sequence file
-d repeat
database (1-17)
1 TREP
2 RepBase14.07
3 TIGR Gramineae
Repeats v2.0
4 TIGR Brassica
Repeats v2.0
5 TIGR Brassicaceae
Repeats v2.0
6 TIGR Fabaceae
Repeats v2.0
7 TIGR Solanaceae
Repeats v3.2
8 TIGR Arabidopsis_Repeats
9 TIGR Glycine
Repeats v2.0
10 TIGR Hordeum
Repeats v3.0
11 TIGR Medicago
Repeats v2.0
12 TIGR Oryza
Repeats v3.3
13 TIGR Solanum
Repeats v3.2
14 TIGR Sorghum Repeats v3.0
15
TIGR Triticum Repeats v3.0
16 Maize TEDB
17 MIPS REdat
v4.3
-f output
single primer design file (1 or 0, default 0)
-b BLAST
output of sequence against repeat database (default: no file)
-a BLAST
table output of sequence against repeat database (default: no file)
-l half
junction length (bp): this only for junction
identification, t=0. Default 100 bp,
ie. total 200 bp
(1) -t: primer type. In RJPrimers 1.0 pipeline, there are 6 different types available. “0” is for repeat junction identification only. If you choose “0”, -l option is required. The program will extract a sequence fragment centered repeat junction. For example, -l 200 means a 400 bp fragment centered at a identified repeat junction is exported to a file. Default length is 100 bp, i.e 200 bp in total.
(2) –s: a input sequence file in FASTA format.
(3) –d: choose a number for a repeat database. Only one number can be chosen in the pipeline program.
(4) –f: if this is on (“1”), a primer view file for each sequence with successful primer designed will be generated. For large amount of sequence data, this will spend too much time to generate primer view files. Default is set to “0”.
(5) –b: If you already have BLAST result file (default output of the blastall program, i.e., -m 0 in blastall), you can specify the file name. The program will not do BLAST search again and save time for large amount of sequence data.
(6) –a: BLAST table output file from RJPrimers_pipeline. RJPrimers_pipeline generates a BLAST table output file for each execution of primer design (–t option 1-5). If you want to re-run the program or try different primer types for the same data set, you can use the BLAST table output file and save more than half of running time.
(7) –l: this is only for repeat junction identification. Default is 100bp, i.e., 200 bp repeat junction region is outputted to a file.
For example,
perl RJPrimers_pipeline.pl –t 1
–s test.fasta –d 1
primer type: 1
RJPrimers is working on primer
picking.....
This will take several seconds to minutes depending
on numbers of sequences.
Please wait ...
1 database(s) was (were) selected:
TREP
Perform BLASTN search against the selected repeat
databases..
blastall -p blastn
-a 7 -d ''/usr/lib/cgi-bin/RJPrimers/repeat_libs/TREP.fasta' ' -i te_primers/p_1261959902/seq.fasta -b 1000 -e 1e-5 -o
te_primers/p_1261959902/blastn_report.txt
Identifying TE junctions...
Designing primers...
Result summary:
Junctions per Kb = 3.49533157768734
Total number of junctions = 73
Total sequence length = 20885
Total number of sequences = 43
The job is finished. Please check the results at the
result directory: te_primers/p_1261959902
Total 0.0833333333333333 minutes was used.
All result files are saved in the directory te_primers/p_1261959902.