Skip to content

Create lg files for ground truth and predicted graphs

Ayush Kumar Shah requested to merge graph-eval into containerize

Changes:

1. Label graph files generation:

  • Created a molecular graph from Indigo Object with nodes, edges, scaled coordinates
  • Annotated ground truth graph with sorted coordinates, new symbols, obj ids, edges
  • Found correspondence between atoms in the ground truth and predicted graphs using closest distances
  • Created nx graph for ground truth with annotated info for lg creation
  • Created lg file creation function from either graph
  • For undirected graph (Predicted), added direction in other edge during lg creation
  • For wedge bonds, used the correct direction in the predicted graph
  • Added information about the filename, no of objects, and relationships in lg file
  • Handled cases for an unmatching number of nodes for ground truth and predicted graph

2. Other changes

  • Added script and make target to download the indigo dataset
  • Added lib files for indigo toolkit
  • Added gen_lg and parallel flags in config file
  • Removed unnecessary data files from data folder and add gitignore to ignore newly added data =
  • Added function to display images easily using imshow
  • Added remove hydrogen function and handle cases like 'HH'
  • Better error handling with actual errors (e.g. empty SMILES, graph to cdxml error), with error description added to dumped json
  • Improved test-config with no need for actual paths dependent on systems

Testing:

From the directory graphics_recognition

  • git pull; git checkout containerize
  • make rebuild && make chem-v2-all-test

Check SMILES

  • cp outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt
  • git checkout graph-eval
  • make conda-remove && make && make chem-v2-all-test
  • diff outputs/All/generated_smiles/or100.09.tables/smiles_out.txt ./smiles_test_NEW.txt

This should give no differences

Check CDXML

  • Check the file outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables_allpages.cdxml in ChemDraw, checking structures against the standard 24-page test file

Check INDIGO

  • make download-synthetic-chem-data
  • make chem-v2-indigo Note that depending on the system, this may take some time (11 mins on northbay (with lg generation), 2.6 mins on rjb) You should get the following metrics:
Total molecules (files):  5719
Tanimoto 0.9787279736492779
Lev 0.9736467814872148
Exact matches: 5373
Incorrectly parsed: 331
Fatal errors (files failed): 15

Also, check if lg files are produced in the ./outputs/indigo/gt_lg and ./outputs/indigo/pred_lg/

  • ls ./outputs/indigo/gt_lg | wc -l
  • ls ./outputs/indigo/pred_lg | wc -l

Both commands should give 5704 (15 files have empty SMILES out of 5719)

Edited by Ayush Kumar Shah

Merge request reports