Skip to content

Improved strategy for closing and removing edges

Ayush Kumar Shah requested to merge fix-close-edges-use-mst into containerize

Note that this MR includes changes from cdxml_fixes and hence will close the MR 57

Changes in close edges

  • Close edges independently for three cases: parallel lines, char lines, and nonparallel lines
  • First step is to close parallel edges: correct floating bond line assignment to its parallel line pair
    • Done by finding floating bond lines and it's correct parallel line pair candidate using nearest neighbors and line intersects
  • Compute non-parallel thresholds using the updated graph from above
  • Close edges based on the three cases using different thresholds
  • Got rid of two thresholds for closing parallel lines: parallel line distance threshold and comparison to char line distance ratio threshold: now closed based on graph without any distance threshold
  • Select and compute char line distances using nx filter edge functions to replace for loops
  • Select line nodes using the select_lines method when required
  • Re-use thresholds if candidates are not available
  • Use all types of lines (Hashed wedge, solid wedge, normal lines, etc.) for computing nonparallel threshold and closing non parallel lines

Changes in remove edges

  • Filter out distances in MST corresponding to floating atoms using statistics (Z-score) by comparing character line and parallel line distance distributions in the MST
  • Use remove threshold based on a multiple of thresholds used for closing objects above
  • Pass Z-thresholds in the config file, so different thresholds can be used for different runs/datasets.

Other Changes:

  • Add option to use command line args for indigo/other datasets, use config file if not available
  • Add config file for thresholds and use thresholds from these files
  • Sort neighbors by degree, distance, label instead of random assignment
  • Remove randomness in getting the main atom during contraction
  • Remove randomness in computing adj
  • Use min end points dist for lines and add intersect
  • Detect positive and negative charges and add charge attribute to cdxml, and handle charges as separate label types
  • Remove dead code
  • Fix for atom groups (e.g., NH2) ordering issue using graph traversal and writing order
  • Avoid recomputing distance and use input graph distance instead when closing edges

Testing:

From the directory graphics_recognition

  • git pull; git checkout containerize
  • make rebuild && make chem-v2-all-test

Check SMILES

  • cp outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
  • git checkout fix-close-edges-use-mst
  • make chem-v2-all-test
  • diff outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt

This should give differences in 2 molecules:

141c141
< outputs/All/generated_cdxmls/or100.09.tables/Page_009_No016.cdxml, O=P.CCOC.CCOP.PC1=C(C2=CC=CC=C2C=C1)C1=CC=CC2=C1C=CC=C2
---
> outputs/All/generated_cdxmls/or100.09.tables/Page_009_No016.cdxml, CCOP(=O)(OCC)C1=C(C2=CC=CC=C2C=C1)C1=CC=CC2=C1C=CC=C2
203c203
< outputs/All/generated_cdxmls/or100.09.tables/Page_013_No011.cdxml, [Li+].CC1(CO)CO[B](O)(OC1)C1=NC=CC=C1
---
> outputs/All/generated_cdxmls/or100.09.tables/Page_013_No011.cdxml, CC1(COBC2=NC=CC=C2)COBOC1

Check CDXML

  • Check the file outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables_allpages.cdxml in ChemDraw, checking structures against the standard 24-page test file

Check INDIGO

  • make download-synthetic-chem-data
  • rm run-indigo && make && make chem-v2-indigo Note that depending on the system, this may take some time (12 mins on northbay, 3 mins on rjb)

You should get the following metrics:

Total molecules (files):  5719
Tanimoto 0.9943673819978699
Lev 0.9920629335952978
Exact matches: 5599
Incorrectly parsed: 105
Fatal errors (files failed): 15
Edited by Ayush Kumar Shah

Merge request reports