SonicParanoid fails in predicting fastest pairs (sklearn v1.3.0 bug)
Special thanks to @shivankparashar for originally opening this issue.
Since version 1.3.0 of scikit-learn negative values are considered as missing when using the Gradient boosting. This causes SonicParanoid fail to predict the faster alignments, since negative values are used as categories in training/test samples.
An example of the error is shown below:
Run START: Mon Jul 3 11:05:21 2023
SonicParanoid 2.0.3 will be executed with the following parameters:
Run ID: sp2_372311521_default_8cpus_ml075_ow_op
Input directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_input
Main output directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output
Run directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output/runs/sp2_372311521_default_8cpus_ml075_ow_op
Input proteomes: 4
Alignment tool: diamond
Run mode: default (Diamond [--very-sensitive])
Threads: 8
Minimum bitscore: 40
Max length difference for in-paralogs: 0.75
Ortholog merging threshold: 0.75
MCL inflation: 1.50
Alignments directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output/alignments/
Pairwise tables directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output/runs/sp2_372311521_default_8cpus_ml075_ow_op/pairwise_orthologs/
Directory with ortholog groups: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output/runs/sp2_372311521_default_8cpus_ml075_ow_op/ortholog_groups/
Pairwise tables database directory: /home/salvocos/Desktop/sonicparanoid_test/to-remove/sonicparanoid_test/test_output/orthologs_db
Perfom only graph-based orthology: False
Update run: True
Create query database indexes: False
Complete overwrite: True
Re-create ortholog tables: True
Memory per thread (Gigabytes): 3.90
Minimum memory per thread (Gigabytes): 1.00
Compress alignments: True
Compression level: 5
SonicParanoid installation directory: /home/salvocos/.virtualenvs/sonic-dev3.10/lib/python3.10/site-packages/sonicparanoid
Installation type: Python
Python version: 3.10.12 (main, Jul 3 2023, 00:36:32) [GCC 13.1.1 20230429]
Python executables: /home/salvocos/.virtualenvs/sonic-dev3.10/bin/python3
Traceback (most recent call last):
File "/home/salvocos/.virtualenvs/sonic-dev3.10/bin/sonicparanoid", line 8, in <module>
sys.exit(main())
File "/home/salvocos/.virtualenvs/sonic-dev3.10/lib/python3.10/site-packages/sonicparanoid/sonic_paranoid.py", line 2176, in main
spFile, pairsFile, requiredPairsDict = orthodetect.run_sonicparanoid2_multiproc_essentials(mappedInPaths, outDir=runDir, tblDir=pairwiseDbDir, threads=threads, alignDir=alignDir, seqDbDir=dbDirectory, sensitivity=mmseqsSensitivity, create_idx=idx_dbs, alnTool=alnTool, dmndSens=dmndSensitivity, minBitscore=minBitscore, confCutoff=0.05, pmtx=pmtx, lenDiffThr=args.max_len_diff, overwrite_all=overwrite, overwrite_tbls=owOrthoTbls, update_run=update_run, keepAlign=args.keep_raw_alignments, essentialMode=not(complete_aln), compress=not(noCompress), complev=complev, debug=debug)
File "/home/salvocos/.virtualenvs/sonic-dev3.10/lib/python3.10/site-packages/sonicparanoid/ortholog_detection.py", line 774, in run_sonicparanoid2_multiproc_essentials
essentials.predict_fastest_pairs(outDir=auxDir, pairs=dashedPairs, protCnts=protCntDict, protSizes=spSizeDict, debug=debug)
File "sonicparanoid/essentials_c.pyx", line 361, in sonicparanoid.essentials_c.predict_fastest_pairs
File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]
A possible early solution to downgrade sklearn module to v1.2.2
In the future the GradientBoosting model will be re-trained to work with sk-learn v1.3.0 and above.
Edited by Salvatore Cosentino