docxtractr issueshttps://gitlab.com/hrbrmstr/docxtractr/-/issues2023-08-30T19:11:06Zhttps://gitlab.com/hrbrmstr/docxtractr/-/issues/21convert_to_pdf() fails but command-line equivalent works2023-08-30T19:11:06ZDanconvert_to_pdf() fails but command-line equivalent worksHello, I have a pptx file called slides.pptx that I want to convert to PDF using docxtractr.
When I try, I get this:
```
> docxtractr::convert_to_pdf("/tmp/slides.pptx")
Warning: failed to launch javaldx - java may not function correct...Hello, I have a pptx file called slides.pptx that I want to convert to PDF using docxtractr.
When I try, I get this:
```
> docxtractr::convert_to_pdf("/tmp/slides.pptx")
Warning: failed to launch javaldx - java may not function correctly
The application cannot be started.
The component manager is not available.
("Cannot open uno ini file:///usr/lib/x86_64-linux-gnu/unorc at ./cppuhelper/source/defaultbootstrap.cxx:53")
Error in docxtractr::convert_to_pdf("/tmp/slides.pptx") :
Conversion from PPTX to PDF did not succeed
In addition: Warning message:
In system(cmd, intern = TRUE) :
running command '"/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/RtmptPPhMp" "/tmp/RtmptPPhMp/file151a369f9d98.pptx"' had status 139
```
However, when I take that soffice command line at the end, and change the last parameter to `/tmp/slides.pptx`, it works (despite throwing a warning). It produces the output PDF in `/tmp` and I verified that its contents are correct:
```
root@4d9dd60d0e79:/app# "/usr/bin/soffice" --convert-to pdf --headless --outdir "/tmp/" "/tmp/slides.pptx"
Warning: failed to launch javaldx - java may not function correctly
convert /tmp/slides.pptx -> /tmp/slides.pdf using filter : impress_pdf_Export
root@4d9dd60d0e79:/app#
```
So, you might ask, why don't I just use the command line. Well, this is part of a larger software stack that relies on docxtractr, and I don't want to reinvent the wheel.
This is inside a Docker container based on Debian 12, with R-4.2.2 and docxtractr 0.6.5.
BTW, I do not have the javaldx program in the container (although I installed libreoffice via apt-get) but it does not seem to matter - despite the warning, soffice converts the pptx successfully.
So in a nutshell I am wondering why I can successfully convert pptx to pdf on the command line but not with docxtractr which apparently uses pretty much the same command line under the hood.
Thankshttps://gitlab.com/hrbrmstr/docxtractr/-/issues/15libreglo.so not found2019-03-08T10:43:16ZBoris Demeshevlibreglo.so not foundI am trying to open a `.doc` file on Ubuntu 18.04 machine.
```r
url = 'http://www.gks.ru/bgd/regl/b18_02/IssWWW.exe/Stg/d010/1-08.doc'
tbl = docxtractr::read_docx(url)
```
Alas I get `/usr/lib/libreoffice/program/javaldx: error while ...I am trying to open a `.doc` file on Ubuntu 18.04 machine.
```r
url = 'http://www.gks.ru/bgd/regl/b18_02/IssWWW.exe/Stg/d010/1-08.doc'
tbl = docxtractr::read_docx(url)
```
Alas I get `/usr/lib/libreoffice/program/javaldx: error while loading shared libraries: libreglo.so: cannot open shared object file`.
This may be more a LibreOffice issue... But I can run libreoffice itself without any problems.https://gitlab.com/hrbrmstr/docxtractr/-/issues/6Text outside of tables2019-03-05T13:58:05Zhrbrmstrbob@rud.isText outside of tables*Created by: slarge*
I know the impetus of this package is to read data from .docx tables, but I am wondering if the xml structure would permit pulling text from beneath a specific heading. In a .docx with a common format, for example:
...*Created by: slarge*
I know the impetus of this package is to read data from .docx tables, but I am wondering if the xml structure would permit pulling text from beneath a specific heading. In a .docx with a common format, for example:
# Introduction
Chicken ullamco meatball, magna tail elit meatloaf aliquip jerky cillum. Id chicken ut, meatloaf dolore jowl cupim porchetta aliqua tempor tenderloin sausage quis aute. Et deserunt est ground round, chicken ea do ball tip laboris tri-tip ullamco id occaecat chuck. Brisket cupim meatloaf veniam porchetta picanha meatball quis flank t-bone elit dolor rump.
# Materials and Methods
Bacon ipsum dolor amet bacon dolore commodo id. Est veniam nostrud hamburger eu meatball nisi ut. Ham hock adipisicing anim aliqua ullamco. In ad cow flank meatball. Ut ham laboris incididunt pancetta do venison dolor fatback. Sint alcatra incididunt, shank sunt ground round commodo meatball tail filet mignon.
something like:
docx_extract_txt(doc, heading = "Introduction")
> "Chicken ullamco meatball, magna ..."
returning a string of text. Not sure if this would be possible, but I think it could be extremely useful.
EDIT: I replaced "header" with "heading", as that seems to be more precise usage of what I'm after in MS Word parlance.hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/1Read HTTP/HTTPS Support2019-03-05T13:58:05ZAlex BreslerRead HTTP/HTTPS SupportFantastic package, would be great to add the ability to read doc and docx directly from the web like how it works in read_csv in readr. If I get sometime today I will look at that code and see how they ado it there
Fantastic package, would be great to add the ability to read doc and docx directly from the web like how it works in read_csv in readr. If I get sometime today I will look at that code and see how they ado it there
https://gitlab.com/hrbrmstr/docxtractr/-/issues/8can't read from a local file2019-03-05T13:58:05Zhrbrmstrbob@rud.iscan't read from a local file*Created by: cityhunter007*
Hello,
Thanks for the great package. I'm having issues reading a doc file from my workspace.
For example,
doc1 <- read_docx("myfile.docx")
This simple code doesn't work. I get -
Error: 'C:\Users\...*Created by: cityhunter007*
Hello,
Thanks for the great package. I'm having issues reading a doc file from my workspace.
For example,
doc1 <- read_docx("myfile.docx")
This simple code doesn't work. I get -
Error: 'C:\Users\smithj\AppData\Local\Temp\1\RtmpqqbQyW/docdata/word/document.xml' does not exist.
I can read in from the examples like:
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))
I don't want to copy all my files to the package example directories. Maybe I'm doing something wrong?? I tried to google but haven't had a success.
Thanks,hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/7Possible to have output as tibble?2019-03-05T13:58:05ZBen MarwickPossible to have output as tibble?This is an awesome pkg, and I find `assign_colnames` useful for all kinds of data input besides docx files.
I wonder if you would consider allowing the tables output by `docx_extract_tbl()` to be tibbles? For a new data set coming i...This is an awesome pkg, and I find `assign_colnames` useful for all kinds of data input besides docx files.
I wonder if you would consider allowing the tables output by `docx_extract_tbl()` to be tibbles? For a new data set coming in to my R environment I find it very handy to see the column classes that the tibble print method gives.
Could we change [this line](https://github.com/hrbrmstr/docxtractr/blob/master/R/docx_find_tbls.r#L51) to `as_tibble(dat)` ?
You have tibble in suggests already, so I don't think this will change the dependencies. Just thought I'd ask before making at PR in case you have a good reason not to do this.https://gitlab.com/hrbrmstr/docxtractr/-/issues/9Is there a way to conserve newlines from extracted tables?2019-03-05T13:58:05Zhrbrmstrbob@rud.isIs there a way to conserve newlines from extracted tables?*Created by: stephenhwang*
I have a word doc with tables containing cells with newlines. When I extract the cell, all newlines appear to be simply deleted. Any way to keep the formatting?*Created by: stephenhwang*
I have a word doc with tables containing cells with newlines. When I extract the cell, all newlines appear to be simply deleted. Any way to keep the formatting?hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/3Example: how to use docx_extract_tbl() with lapply()2019-03-05T13:58:05Zhrbrmstrbob@rud.isExample: how to use docx_extract_tbl() with lapply()*Created by: DavoOZ*
Fantastic new package - thankyou.
In the examples please show how to wrap docx_extract_tbl() with lapply() to access all the tables in a document in one hit.
*Created by: DavoOZ*
Fantastic new package - thankyou.
In the examples please show how to wrap docx_extract_tbl() with lapply() to access all the tables in a document in one hit.
hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/14extract text associated with the comment2019-03-05T13:58:05ZKarlo Guidoniextract text associated with the commentVery useful package! I really appreciate it! Thank you!
Is there a way to extract the text associated with the comments?
I did unzip the attached file `test.docx`, and I did explore the unzipped files.
The `word/document.xml` fi...Very useful package! I really appreciate it! Thank you!
Is there a way to extract the text associated with the comments?
I did unzip the attached file `test.docx`, and I did explore the unzipped files.
The `word/document.xml` file have the following "marks":
```xml
<w:commentRangeStart w:id="1"/>
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
Five quacking zephyrs jolt my wax bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck.
</w:t>
</w:r>
<w:commentRangeEnd w:id="1"/>
```
With the following associated comments in the `word/comments.xml` file:
```xml
<w:comment w:id="1" w:author="Unknown Author" w:date="2018-04-05T13:58:02Z" w:initials="">
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Noto Sans CJK SC Regular" w:cs="FreeSans" w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:caps w:val="false"/>
<w:smallCaps w:val="false"/>
<w:strike w:val="false"/>
<w:dstrike w:val="false"/>
<w:outline w:val="false"/>
<w:shadow w:val="false"/>
<w:emboss w:val="false"/>
<w:imprint w:val="false"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:w w:val="100"/>
<w:position w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="24"/>
<w:u w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:em w:val="none"/>
<w:lang w:bidi="hi-IN" w:eastAsia="zh-CN" w:val="en-US"/>
</w:rPr>
<w:t>All paragraph.</w:t>
</w:r>
</w:p>
</w:comment>
```
These things seem linked by the `w:id="1"` in both `word/document.xml` and `word/comments.xml` files.
It would be very interesting if your `docx_extract_all_cmnts()` function informs a tibble containing a column with the text associated with the comment.
[test.docx.zip](https://github.com/hrbrmstr/docxtractr/files/1881156/test.docx.zip)hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/10error when read_docx has url argument2019-03-05T13:58:05Zhrbrmstrbob@rud.iserror when read_docx has url argument*Created by: markdly*
Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when `read_docx` has url argument. Minimal reprex:
``` r
...*Created by: markdly*
Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when `read_docx` has url argument. Minimal reprex:
``` r
library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.
```
It looks like the call to download.file is causing this issue
``` r
download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.
```
To workaround this I can use `mode = "wb"`
``` r
download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#>
#> Table 1
#> total cells: 24
#> row count : 6
#> uniform : likely!
#> has header : unlikely
#>
#> Table 2
#> total cells: 28
#> row count : 4
#> uniform : likely!
#> has header : unlikely
#> No comments in document
```
An alternative workaround is using `httr` package
``` r
library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")
read_docx("myfile.docx")
#> Word document [myfile.docx]
#>
#> Table 1
#> total cells: 24
#> row count : 6
#> uniform : likely!
#> has header : unlikely
#>
#> Table 2
#> total cells: 28
#> row count : 4
#> uniform : likely!
#> has header : unlikely
#> No comments in document
```
I thought I should raise this in case any other users have the same problem...hrbrmstrbob@rud.ishrbrmstrbob@rud.ishttps://gitlab.com/hrbrmstr/docxtractr/-/issues/5doc-file2019-03-05T13:58:05Zhrbrmstrbob@rud.isdoc-file*Created by: brry*
I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml....*Created by: brry*
I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").
Any idea how to read the old ".doc" format?
(I hope it's OK to post this as an issue. Just delete it if not^^)
hrbrmstrbob@rud.ishrbrmstrbob@rud.is