[STDERR] WARNING The input (new_all_13.tsv) contains 32942 queries, but I extracted 32935 entries out of the fasta(s).

Hi,

Thank your for your report !

So I am pretty sure that your error comes from duplicated names (among these different species/files)

You can verify this by comparing the number of proteins:

grep -P "^>" *fas | wc -l

with the number of unique protein names:

cat *fas | perl -lne "if(/^>([^ ]+)/){print $1}" | sort -u | wc -l

Please let me know if the numbers differ.

If the numbers differ:

Proteinortho internally can handle these duplication names (every protein gets an unique id assigned).

But the auxillary tool for extracting the sequences (proteinortho_grab_proteins) is just working on the pure names. So duplicated entries are a problem here. I was allready thinking about this problem but I cannot see a nice way of resolving this error while:

a) not changing the output format of proteinortho (backwards compatibility)

b) keeping proteinortho_grab_proteins as a separated program (the problem could be solved if proteinortho directly produces these fasta files based on the internal ids but you not want this everytime)

I am also not a big fan of changing the names (sed) because this can very well introduce other problems for different naming schemas. E.g. consider someone who just puts underscores on different protein variants (isoforms) to indicate that these are special

>myprot

....

>myprot_

....

By removing all underscores the names are identical which can be a problem.

If you like you can also upload your input files here and I will have a closer look.

I hope this helps, Paul

Edit: formatting stuff

Hi Pual, Thanks for your quick response and I apologize for my delayed one!

After running the commands you suggested, I get the same number both times: [argr6723@shas0136 GoodNames]$ grep > *.fas | sort -u | wc -l 127466

[argr6723@shas0136 GoodNames]$ cat *.fas | perl -lne 'if(/^>([^ ]+)/){print $1}' | sort -u | wc -l 127466

Could there be some special characters that are tripping up the program? We removed all the ones we could think of......

Thank you, Arthur

That is very suprising, then indeed it seems to be something with special characters. Although I thought I escaped everything (:

Please send me the *.fas files so I can have a closer look.

Here they are, thanks for the help! Sending in two parts

part 2

Thank you, I will come back when I find and squash the bug

hahaha, awesome Paul, squash 'em good!

Thank you!

So I did redo the proteinortho *fas call and then proteinortho_grab_proteins.pl -tofiles myproject.proteinortho.tsv *fas

I get different numbers (14734 groups with 105377 proteins) but all proteins got found by the script

[STDERR] All entries of the query are found in the fasta(s).

Which proteinortho version do you use? And did you modify your *.proteinortho.tsv ?

Please upload your new_all_13.tsv file too

(Same results for -exact proteinortho_grab_proteins.pl -exact -tofiles myproject.proteinortho.tsv *fas)

Hi Paul,

That is strange. Attached is my names.tsv and we are running proteinortho6. Also, i did not modify the output fasta files

newall13.tsv.OrthoGroup10.fasta

new_all_13.tsv

So I found a weird bug: the interaction of some '-characters at the end of the id can make ids "vanish". This is now fixed, as well as many improvements to proteinortho_grab_proteins are made. E.g. now some of the ids are printed that are not found (to improve visibility if something did go wrong).

I will publish this new version (conda, brew, ...) at the end of this week. For now you can download the file manually here:

https://gitlab.com/paulklemm_PHD/proteinortho/-/raw/master/src/proteinortho_grab_proteins.pl

Note:

Including the header line of .proteinortho.tsv (starting with '#') will greatly increase the runtime (otherwise all files have to be searched for all ids)
You dont need to append the filename to the protein names (you can use the -source option of proteinortho_grab_proteins.pl instead). As you prefer.
Your tsv file does not perfectly match your fasta files: In the tsv file there are 5 proteins with [ and ] characters:

Penoxa|EPS25955.1_putative_alphaalpha-trehalose-phosphate_synthase_[UDP-forming]

Aspalb|XP_031896854.1_tRNA-dihydrouridine47_synthase_[NADP+]

Penchr|KZN85498.1_putative_3-oxoacyl-[acyl-carrier-protein]_synthase

Penchr|KZN89714.1_3-oxoacyl-[acyl-carrier-protein]_reductase_FabG

Bysspe|XP_028486817.1_tRNA-dihydrouridine47_synthase_[NADP+]

In you fasta you propably allready removed the [ and ] characters, so these can not be found directly. If you correct this in the tsv or the fasta files, then the updated program works fine.

Thank you again very much for this report !

closed

Hi Paul,

Thanks so much for getting this all sorted out and in such a quick time frame! I just ran the new grab proteins script and it worked perfectly...woohooo!!

Thank you again so much! Arthur

[STDERR] WARNING The input (new_all_13.tsv) contains 32942 queries, but I extracted 32935 entries out of the fasta(s).

Designs

Child items 0

Activity