Skip to content

Improve robustness and efficiency of parser using geometry from JSON

Ayush Kumar Shah requested to merge containerize-improve into containerize

Changes

Main change: Using the new JSON outputs from Symbol Scraper, specifically the geometric properties to resolve ambiguities in determining node (bond) types making the graph construction algorithms robust so that it works in other pdfs. It also includes further optimizations(speedups) in reading symbol scraper output and merging with YOLO regions. These changes include the following:

  • Use new PROTables for reading symbol scraper information (JSON instead of XML) instead of custom function
  • Use new PROTables functions for region intersections and sorting regions
  • Remove dependency on being a polygon for graphic objects
  • Add robust logic based on geometrical properties from JSON to determine node types (curves, circles, solid wedges, hash wedges, bond lines, etc.)
  • Add files needed for evaluation
  • Reorganize code and other files as per their types and remove unwanted/unused code/files.
  • Modify server code to work with the new changes
  • Fixed missing molecules in page level and doc level cdxmls

Testing

Make sure molconvert is installed (you can run molconvert in your terminal) before running the test. Run the following commands:

  • git checkout containerize-improve
  • make && make clean-full && make
  • make chem-v2-all-test

Check if there are error messages. If no error messages, and you can see Exact matches: 142 247, then the MR passes. For additional testing, visualize the CDXML in outputs/All/generated_cdxmls/or100.09.tables_full/ into ChemDraw to check if most structures (90%) are the same as the input pdf inputs/chemxtest_inpdfs/All/or100.09.tables.pdf.

Also, note down the speedups (faster symbol scraper read and merging with YOLO). The time should be similar to the one shown below:

[ Timer: ChemScraper pipeline ]
Started: Mon, 09 Oct 2023 20:27:28
Total Duration: 69.7501 seconds
1.96057868s  Python module load
1.88881111s  SymbolScraper (PDF instruction extraction)
0.63958001s  PDF -> PNG image conversion
10.20745111s  YOLO molecule region detector
1.17584562s  Reading Symbol Scraper Info (JSON)
0.00068069s  Reading YOLO regions (CSV)
0.11365700s  Merging SymbolScraper Info with YOLO regions
1.33664298s  Parsing PDF objects to visual graphs
0.99581695s  Converting visual graphs to CDXML
2.84608936s  CDXML -> SMILES translation (molconvert)
0.00271916s  TSV generation
1.78604007s  Pages and Full PDF CDXML generation
1.81593370s  YOLO molecule region visualization
44.98027706s  SMILE evaluation metric and output
Edited by Ayush Kumar Shah

Merge request reports