Improve robustness and efficiency of parser using geometry from JSON
Changes
Main change: Using the new JSON outputs from Symbol Scraper, specifically the geometric properties to resolve ambiguities in determining node (bond) types making the graph construction algorithms robust so that it works in other pdfs. It also includes further optimizations(speedups) in reading symbol scraper output and merging with YOLO regions. These changes include the following:
- Use new PROTables for reading symbol scraper information (JSON instead of XML) instead of custom function
- Use new PROTables functions for region intersections and sorting regions
- Remove dependency on being a polygon for graphic objects
- Add robust logic based on geometrical properties from JSON to determine node types (curves, circles, solid wedges, hash wedges, bond lines, etc.)
- Add files needed for evaluation
- Reorganize code and other files as per their types and remove unwanted/unused code/files.
- Modify server code to work with the new changes
- Fixed missing molecules in page level and doc level cdxmls
Testing
Make sure molconvert is installed (you can run molconvert in your terminal) before running the test. Run the following commands:
- git checkout containerize-improve
- make && make clean-full && make
- make chem-v2-all-test
Check if there are error messages. If no error messages, and you can see Exact matches: 142 247
, then the MR passes.
For additional testing, visualize the CDXML in outputs/All/generated_cdxmls/or100.09.tables_full/
into ChemDraw to check if most structures (90%) are the same as the input pdf inputs/chemxtest_inpdfs/All/or100.09.tables.pdf
.
Also, note down the speedups (faster symbol scraper read and merging with YOLO). The time should be similar to the one shown below:
[ Timer: ChemScraper pipeline ]
Started: Mon, 09 Oct 2023 20:27:28
Total Duration: 69.7501 seconds
1.96057868s Python module load
1.88881111s SymbolScraper (PDF instruction extraction)
0.63958001s PDF -> PNG image conversion
10.20745111s YOLO molecule region detector
1.17584562s Reading Symbol Scraper Info (JSON)
0.00068069s Reading YOLO regions (CSV)
0.11365700s Merging SymbolScraper Info with YOLO regions
1.33664298s Parsing PDF objects to visual graphs
0.99581695s Converting visual graphs to CDXML
2.84608936s CDXML -> SMILES translation (molconvert)
0.00271916s TSV generation
1.78604007s Pages and Full PDF CDXML generation
1.81593370s YOLO molecule region visualization
44.98027706s SMILE evaluation metric and output