Create lg files for ground truth and predicted graphs
Changes:
1. Label graph files generation:
- Created a molecular graph from Indigo Object with nodes, edges, scaled coordinates
- Annotated ground truth graph with sorted coordinates, new symbols, obj ids, edges
- Found correspondence between atoms in the ground truth and predicted graphs using closest distances
- Created nx graph for ground truth with annotated info for lg creation
- Created lg file creation function from either graph
- For undirected graph (Predicted), added direction in other edge during lg creation
- For wedge bonds, used the correct direction in the predicted graph
- Added information about the filename, no of objects, and relationships in lg file
- Handled cases for an unmatching number of nodes for ground truth and predicted graph
2. Other changes
- Added script and make target to download the indigo dataset
- Added lib files for indigo toolkit
- Added gen_lg and parallel flags in config file
- Removed unnecessary data files from data folder and add gitignore to ignore newly added data =
- Added function to display images easily using imshow
- Added remove hydrogen function and handle cases like 'HH'
- Better error handling with actual errors (e.g. empty SMILES, graph to cdxml error), with error description added to dumped json
- Improved test-config with no need for actual paths dependent on systems
Testing:
From the directory graphics_recognition
git pull; git checkout containerize
make rebuild && make chem-v2-all-test
Check SMILES
cp outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
git checkout graph-eval
make conda-remove && make && make chem-v2-all-test
diff outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
This should give no differences
Check CDXML
- Check the file
outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables_allpages.cdxml
in ChemDraw, checking structures against the standard 24-page test file
Check INDIGO
make download-synthetic-chem-data
-
make chem-v2-indigo
Note that depending on the system, this may take some time (11 mins on northbay (with lg generation), 2.6 mins on rjb) You should get the following metrics:
Total molecules (files): 5719
Tanimoto 0.9787279736492779
Lev 0.9736467814872148
Exact matches: 5373
Incorrectly parsed: 331
Fatal errors (files failed): 15
Also, check if lg files are produced in the ./outputs/indigo/gt_lg
and ./outputs/indigo/pred_lg/
ls ./outputs/indigo/gt_lg | wc -l
ls ./outputs/indigo/pred_lg | wc -l
Both commands should give 5704 (15 files have empty SMILES out of 5719)
Edited by Ayush Kumar Shah