Improved strategy for closing and removing edges
Note that this MR includes changes from
cdxml_fixes
and hence will close the MR 57
Changes in close edges
- Close edges independently for three cases: parallel lines, char lines, and nonparallel lines
- First step is to close parallel edges: correct floating bond line assignment to its parallel line pair
- Done by finding floating bond lines and it's correct parallel line pair candidate using nearest neighbors and line intersects
- Compute non-parallel thresholds using the updated graph from above
- Close edges based on the three cases using different thresholds
- Got rid of two thresholds for closing parallel lines: parallel line distance threshold and comparison to char line distance ratio threshold: now closed based on graph without any distance threshold
- Select and compute char line distances using nx filter edge functions to replace for loops
- Select line nodes using the select_lines method when required
- Re-use thresholds if candidates are not available
- Use all types of lines (Hashed wedge, solid wedge, normal lines, etc.) for computing nonparallel threshold and closing non parallel lines
Changes in remove edges
- Filter out distances in MST corresponding to floating atoms using statistics (Z-score) by comparing character line and parallel line distance distributions in the MST
- Use remove threshold based on a multiple of thresholds used for closing objects above
- Pass Z-thresholds in the config file, so different thresholds can be used for different runs/datasets.
Other Changes:
- Add option to use command line args for indigo/other datasets, use config file if not available
- Add config file for thresholds and use thresholds from these files
- Sort neighbors by degree, distance, label instead of random assignment
- Remove randomness in getting the main atom during contraction
- Remove randomness in computing adj
- Use min end points dist for lines and add intersect
- Detect positive and negative charges and add charge attribute to cdxml, and handle charges as separate label types
- Remove dead code
- Fix for atom groups (e.g., NH2) ordering issue using graph traversal and writing order
- Avoid recomputing distance and use input graph distance instead when closing edges
Testing:
From the directory graphics_recognition
git pull; git checkout containerize
make rebuild && make chem-v2-all-test
Check SMILES
cp outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
git checkout fix-close-edges-use-mst
make chem-v2-all-test
diff outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
This should give differences in 2 molecules:
141c141
< outputs/All/generated_cdxmls/or100.09.tables/Page_009_No016.cdxml, O=P.CCOC.CCOP.PC1=C(C2=CC=CC=C2C=C1)C1=CC=CC2=C1C=CC=C2
---
> outputs/All/generated_cdxmls/or100.09.tables/Page_009_No016.cdxml, CCOP(=O)(OCC)C1=C(C2=CC=CC=C2C=C1)C1=CC=CC2=C1C=CC=C2
203c203
< outputs/All/generated_cdxmls/or100.09.tables/Page_013_No011.cdxml, [Li+].CC1(CO)CO[B](O)(OC1)C1=NC=CC=C1
---
> outputs/All/generated_cdxmls/or100.09.tables/Page_013_No011.cdxml, CC1(COBC2=NC=CC=C2)COBOC1
Check CDXML
- Check the file
outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables_allpages.cdxml
in ChemDraw, checking structures against the standard 24-page test file
Check INDIGO
make download-synthetic-chem-data
-
rm run-indigo && make && make chem-v2-indigo
Note that depending on the system, this may take some time (12 mins on northbay, 3 mins on rjb)
You should get the following metrics:
Total molecules (files): 5719
Tanimoto 0.9943673819978699
Lev 0.9920629335952978
Exact matches: 5599
Incorrectly parsed: 105
Fatal errors (files failed): 15
Edited by Ayush Kumar Shah