Commit 913a9d35 authored by Kathryn Elliott's avatar Kathryn Elliott

Text segmenter & results filenames

parent 13902405
......@@ -40,9 +40,13 @@ a list of magazines, together with their URLs, in the file
due to copyright I cannot make this publicly available, however please contact
me if you'd like to discuss access.
## Pre-processing
## Conversion PDF to text and segmentation
The PDFs were converted into text using Docsplit 0.7.6: http://documentcloud.github.io/docsplit/ and then processed using the following steps: tokenisation, stopword removal, lemmatisation, part-of-speech tagging, n-grams and text segmentation.
The PDFs were converted into text using Docsplit 0.7.6: http://documentcloud.github.io/docsplit/
The text was segmented using this simple text segmenter:
* https://gitlab.com/filterfish/simple-text-segmenter/
## Topic modelling
......@@ -77,12 +81,12 @@ logs from all of these runs.
My thesis project combines topic modelling and close reading. Once I had
refined my topic models I close read the top 15 documents from each topic.
I have uploaded copies of these documents to my embargoed OSF directory, under the following directory names:
I have uploaded copies of these documents to my embargoed OSF directory, in the sub-directory `/close-read-documents`, under the following directory names:
* Documents from the whole corpus: `/corpus-20181013T125118`
* Documents from Woolworths corpus 2009--2010: `/1541383605.266563`
* Documents from Woolworths corpus 2011--2014: `/1541383605.257328`
* Documents from Woolworths corpus 2015--2018: `/1541383605.254759`
* Documents from the whole corpus: `/coles-and-woolworths-2009-2018`
* Documents from Woolworths corpus 2009--2010: `/woolworths-2009-2010`
* Documents from Woolworths corpus 2011--2014: `/woolworths-2011-2014`
* Documents from Woolworths corpus 2015--2018: `/woolworths-2015-2018`
## spaCy
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment