Splitting your input in multiple parts
Given your fasta files in infile/*.fasta, first do step 1 (generation of the index files):
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -step=1 infile/*.fasta
Then each of the following lines are generating a tenth of the step=2 (RBH generation):
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=1/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=2/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=3/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=4/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=5/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=6/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=7/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=8/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=9/10 -step=2 infile/*.fasta
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -jobs=10/10 -step=2 infile/*.fasta
!! Dont use the -keep option here as it only increases the I/O usage dramatically !!
These commands can be called on different machines at different times. The output blast-graphs then are containing numbers from the job:
test.blast-graph_1_10, test.blast-graph_2_10, ..., test.blast-graph_10_10
What to do with the .blast-graph1, .blast-graph2, .. ?
Proteinortho can use the part files for step 3 with no problem (just give the same --project= name as in step 2). So for the example above use
perl /home/klemmp/proteinortho-master/proteinortho.pl -project=test -step=3
Or you can simple concatenate the blast-graph1, .blast-graph2, .. files to a single blast-graph file
qsub (MARC2) script
qsub script for deploying 1 job (on the MARC2 cluster) with 64 cores working on the first 1/10th of the input files:
#\$ -S /bin/bash
#\$ -e /home/klemmp/sge
#\$ -o /home/klemmp/sge
#\$ -l h_rt=200000
#\$ -l h_vmem=2G
#\$ -pe orte_sl64* 64
#\$ -cwd
#\$ -N q_test
. /etc/profile.d/modules.sh
module purge
module load gcc/6.3.0
/usr/bin/time -f "%e,%M" perl /home/klemmp/proteinortho-master/proteinortho.pl -jobs=1/10 -project=$projectname -step=2 -cpus=64 -binpath=/home/klemmp/bin -p=$p infile/*.fasta -tmp=\$TMPDIR >"/scratch/klemmp/stdout" 2>"/scratch/klemmp/stderr"
bash script for generating qsub (MARC2) scripts
Bash script for deploying 10 distinct jobs (on the MARC2 cluster) each with 64 cores working on the same input files but on different parts:
#!/bin/bash
p=mmseqsp
cores=64
projectname="$p"_ob"$cores"
infile=/scratch/klemmp/fasta_2017
numofjobs=10
mkdir /scratch/klemmp/$projectname
cd /scratch/klemmp/$projectname
for i in `seq 1 $numofjobs`
do
echo "#\$ -S /bin/bash
#\$ -e /home/klemmp/sge
#\$ -o /home/klemmp/sge
#\$ -l h_rt=200000
#\$ -l h_vmem=2G
#\$ -pe orte_sl$cores* $cores
#\$ -cwd
#\$ -N q"$i"_"$p"
. /etc/profile.d/modules.sh
module purge
module load gcc/6.3.0
mkdir $i
cd $i
\$(/usr/bin/time -f \"%e,%M\" perl /home/klemmp/proteinortho-master/proteinortho.pl -jobs=$i/$numofjobs -project=$projectname -step=2 -cpus=$cores -binpath=/home/klemmp/bin -p=$p $infile/*.fasta -tmp=\$TMPDIR >\"/scratch/klemmp/$projectname.$i.stdout\" 2>\"/scratch/klemmp/$projectname.$i.stderr\"
">q_$i
if [ $cores -eq 64 ]; then
qsub -R y q_$i
else
qsub q_$i
fi