Cleaning and refactoring
Major changes
- Removed some thresholds, the final list of parameters is:
- REMOVE_ALPHA
- NEG_CHARGE_Y_POSITION
- NEG_CHARGE_LENGTH_TOLERANCE
- Z_TOLERANCE
- CLOSE_NONPARALLEL_ALPHA
- CLOSE_CHAR_LINE_ALPHA
- LONGEST_LENGTHS_DIFF_RATIO
- PARALLEL_TOLERANCE
- COS_PRUNE
- Refactoring of
parser_v2.py
,geomtric_utils.py
andgeomtric_utils.py
so that is easier to set up parameters. - Changed solid bond calculation by checking length of 2 longest lines and angles.
- Different changes across pipeline, evaluation and data generation scripts so that the new usage of the refactored classes is consistent.
- Changes on SVG visualization tool so that we can render just SVGs without PDFs.
Testing
Indigo based rendering:
Set up environment:
git checkout computation_solid_bond
make rebuild
make download-synthetic-chem-data
make download-real-chem-data
Run USPTO smiles set:
make chem-ijdar-USPTO-indigo
Expected output (5610/5719 = 98.09% of exact matches):
[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:24:06
Total Duration: 285.8432 seconds
1.35252118s Initial module load
244.53260541s Parsing done
23.55804062s Molconvert done
16.40000176s Canonicalization done
count: 5704
Total molecules (files): 5719
Tanimoto 0.9905072866835222
Lev 0.7929707990907501
Lev normalized 0.99143096892239
Exact matches: 5610
Incorrectly parsed: 94
Fatal errors (files failed): 15
Memory usage: 198292 K
Run UOB smiles set:
make chem-ijdar-UOB
Expected output (5340/5740 = 93.03% of exact matches):
[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:39:31
Total Duration: 98.6138 seconds
1.35809898s Initial module load
75.24283981s Parsing done
16.18964934s Molconvert done
5.82319283s Canonicalization done
count: 5740
Total molecules (files): 5740
Tanimoto 0.9393979496238763
Lev 1.122822299651568
Lev normalized 0.9599148497380511
Exact matches: 5340
Incorrectly parsed: 400
Fatal errors (files failed): 0
Memory usage: 194040 KB
Run CLEF smiles set:
make chem-ijdar-CLEF
Expected output (895/992 = 90.22% of exact matches):
[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:42:48
Total Duration: 43.2721 seconds
1.31812477s Initial module load
34.35079551s Parsing done
5.00770354s Molconvert done
2.59546971s Canonicalization done
count: 921
Total molecules (files): 992
Tanimoto 0.922396673334268
Lev 1.0262096774193548
Lev normalized 0.9195913500774384
Exact matches: 895
Incorrectly parsed: 26
Fatal errors (files failed): 71
Memory usage: 182084 KB
Data generation
Run generation for 22 molecules training set:
make generate-chem-small-train-vis
make convert_lg2nx
make visualize-svgs
A web-page will be created with the rendered SVGs and their corresponding PDF ground truth. The URL of it will be displayed after running the last given command. All the 22 SVGs must exactly match the ground truth PDF.
Scott's file sanity check
Run Scott's file:
make chem-v2-all-test
Expected output:
Average Norm Score: 0.12060713392570309
Average Unnorm Score: 1.9871244635193133
Min Smile Length: 3 Max Smile Length: 40 Average Smile Length: 15.69098712446352
Exact matches: 141 243 Percent: 0.5802469135802469
Visualize SVGs:
./bin/viz-svgs --svg_folder outputs/All/generated_cdxmls/or100.09.tables_SVG --just_svgs True -n Scotts_File
A web-page will be created with the rendered SVGs. The URL of it will be displayed after running the given command. The output should match with the file `inputs/scott_file_sanity_check.pdf`.