Skip to content

Handling overlapped YOLO; added flag to prevent nested named fragments;

R requested to merge YOLO_merge_fix into containerize

Addresses:

  • YOLO overlapping regions now merged; avoids overlapping molecules ('duplicates') in CDXML
  • Pipeline messages cleaned-up (up to evaluation step)
  • Added code in CDXML_Conversion class to allow nested fragments for names (e.g., 'Ph') not the be written to CDXML. ChemDraw can expand many names on its own. This feature is currently disabled using an attribute in the class.
  • (Cosmetic change) Minimum font size changed to 6 points for easier readability. May help with SVG renders.

Testing:

  • Pull the fix branch (git pull; git checkout YOLO_merge_fix), run make clean-out to insure that previous data does not interfere with testing.
  • From the top-level directory, issue make chem-v2-all-test. Look at the command line, you should see this:
* Initial YOLO detections : 247
/home/rlaz/merge-checks/graphics-extraction/modules/chemscraper/utils/merging.py:39: ShapelyDeprecationWarning: STRtree will be changed in 2.0.0 and will not be compatible with versions < 2.
  search_tree = STRtree(boxes)
>>> !! YOLO overlaps (page, #remaining, #removed) : (5, 18, 1)
>>> !! YOLO overlaps (page, #remaining, #removed) : (9, 17, 1)
>>> !! YOLO overlaps (page, #remaining, #removed) : (11, 15, 2)
* YOLO detections : 243

* PDF pages with YOLO detections : 15
* Initial MST count (all pages) : 243
  • Then look at this CDXML file: ./outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables.cdxml; pay particular attention to pages 5, 9, and 11, where molecules should no longer be duplicated/laid over one another.
  • Next, change to the containerize branch (git checkout containerize; git pull)
  • Run make clean-out, and then make chem-v2-all-test. Check pages 5, 9, and 11 of the same file (./outputs/All/generated_cdxmls/or100.09.tables_full_cdxml/or100.09.tables.cdxml) and confirm the errors due to merging there.
Edited by R

Merge request reports