Skip to content

CDXML Generation Corrections

R requested to merge label_cdxml_fixes into containerize

This MR includes the following improvements to ChemDraw output files:

Clean-up of CDXML files and formatting changes

  • Removal of unnecessary attributes on tags in CDXML files (NeedsClean,Justification,etc.)
  • Formatting improvements (new font (Times New Roman); bond lines thinner; line ends closer to text labels)
  • Made brackets for markush structures shorter
  • CreationProgram attribute at file top is now "ChemScraper v0.1"
  • Minor code re-organization and commenting

Corrections

  1. Font sizes now estimated directly from character sizes on the page for labels
  2. Corrected alignment of atoms with double-bonds (atom positions now at center point)
  3. Corrected error with strings defined left-right on left side of a bond (e.g., "TsO" no longer appears as "OsT" with an error in ChemDraw)

Testing

  1. Checkout the changes, using git pull; git checkout label_cdxml_files
  2. From the graphics_extraction directory, issue: make chem-v2-all-test
  3. Save a copy of the SMILES file here outside the repo (e.g., in your home directory): outputs/All/generated_smiles/or100.09.tables/smiles_out.txt
  4. Checkout the target branch, using git checkout containerize
  5. Repeat steps 2 and 3.
  6. Run diff on the two versions of smiles_out.txt -- there should be no difference.
  7. Download some of the generate CDXML files to check the output.
  8. If steps 6 and 7 pass, approve the merge.

Merge request reports