Commit c91e9122 authored by Tomáš Hübelbauer's avatar Tomáš Hübelbauer

Update tasks to scrape block elements

parent 13eff222
......@@ -4,18 +4,10 @@
## Experiment with being smarter about inserting newlines in the generated TXT files to indicate formatting
**In progress**, next populate `wordSurround`, `excelSurround` and `powerPointSurround` with block element tag names.
**In progress**, implemented, but needs to be scraped. Recognizing block elements is more complex than
using tag names of parent elements to the text nodes as currently implemented in `$*Surround` arrays.
- For each office format (Word, Excel, PowerPoint), keep a list of special tag names
- For each text node check if parent tag name is in the list
- Yes: append a newline, append the text node value, append another two newlines (indicate a block element)
- No: append the text node value and a newline (indicate an inline element)
In both cases newlines are appended, not spaces or nothing for inline elements to avoid long, hard to spot diff lines.
Lines followed by other lines will indicate separate inline elements.
Lines surrounded by empty lines will indicate block elements.
Scrape this as I don't want to make this script understand the Office structure beyond dumb finding text nodes.
## Fix broken skipping unchanged files
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment