Skip to content

Cleaning and refactoring

Bryan Amador requested to merge computation_solid_bond into lg_eval_to_cdxml

Major changes

  1. Removed some thresholds, the final list of parameters is:
    1. REMOVE_ALPHA
    2. NEG_CHARGE_Y_POSITION
    3. NEG_CHARGE_LENGTH_TOLERANCE
    4. Z_TOLERANCE
    5. CLOSE_NONPARALLEL_ALPHA
    6. CLOSE_CHAR_LINE_ALPHA
    7. LONGEST_LENGTHS_DIFF_RATIO
    8. PARALLEL_TOLERANCE
    9. COS_PRUNE
  2. Refactoring of parser_v2.py, geomtric_utils.py and geomtric_utils.py so that is easier to set up parameters.
  3. Changed solid bond calculation by checking length of 2 longest lines and angles.
  4. Different changes across pipeline, evaluation and data generation scripts so that the new usage of the refactored classes is consistent.
  5. Changes on SVG visualization tool so that we can render just SVGs without PDFs.

Testing

Indigo based rendering:

Set up environment:

git checkout computation_solid_bond
make rebuild
make download-synthetic-chem-data
make download-real-chem-data

Run USPTO smiles set:

make chem-ijdar-USPTO-indigo

Expected output (5610/5719 = 98.09% of exact matches):

[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:24:06
Total Duration: 285.8432 seconds
1.35252118s  Initial module load
244.53260541s  Parsing done
23.55804062s  Molconvert done
16.40000176s  Canonicalization done

count:  5704
Total molecules (files):  5719
Tanimoto 0.9905072866835222
Lev 0.7929707990907501
Lev normalized 0.99143096892239
Exact matches: 5610
Incorrectly parsed: 94
Fatal errors (files failed): 15
Memory usage: 198292 K

Run UOB smiles set:

make chem-ijdar-UOB

Expected output (5340/5740 = 93.03% of exact matches):

[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:39:31
Total Duration: 98.6138 seconds
1.35809898s  Initial module load
75.24283981s  Parsing done
16.18964934s  Molconvert done
5.82319283s  Canonicalization done

count:  5740
Total molecules (files):  5740
Tanimoto 0.9393979496238763
Lev 1.122822299651568
Lev normalized 0.9599148497380511
Exact matches: 5340
Incorrectly parsed: 400
Fatal errors (files failed): 0
Memory usage: 194040 KB

Run CLEF smiles set:

make chem-ijdar-CLEF

Expected output (895/992 = 90.22% of exact matches):

[ Timer: ChemScraper pipeline: synthetic dataset (indigo) ]
Started: Wed, 13 Mar 2024 14:42:48
Total Duration: 43.2721 seconds
1.31812477s  Initial module load
34.35079551s  Parsing done
5.00770354s  Molconvert done
2.59546971s  Canonicalization done

count:  921
Total molecules (files):  992
Tanimoto 0.922396673334268
Lev 1.0262096774193548
Lev normalized 0.9195913500774384
Exact matches: 895
Incorrectly parsed: 26
Fatal errors (files failed): 71
Memory usage: 182084 KB

Data generation

Run generation for 22 molecules training set:

make generate-chem-small-train-vis
make convert_lg2nx
make visualize-svgs

A web-page will be created with the rendered SVGs and their corresponding PDF ground truth. The URL of it will be displayed after running the last given command. All the 22 SVGs must exactly match the ground truth PDF.

Scott's file sanity check

Run Scott's file:

make chem-v2-all-test

Expected output:

Average Norm Score: 0.12060713392570309
Average Unnorm Score: 1.9871244635193133
Min Smile Length: 3     Max Smile Length: 40    Average Smile Length: 15.69098712446352
Exact matches: 141 243  Percent: 0.5802469135802469

Visualize SVGs:

./bin/viz-svgs --svg_folder outputs/All/generated_cdxmls/or100.09.tables_SVG --just_svgs True -n Scotts_File

A web-page will be created with the rendered SVGs. The URL of it will be displayed after running the given command. The output should match with the file `inputs/scott_file_sanity_check.pdf`.

Merge request reports