Commit dfa54863 authored by Kathryn Elliott's avatar Kathryn Elliott

Update README

parent 4016c00b
# Masters of Research Thesis
# The Expert in the Aisles: Exploring supermarket narratives in Coles and Woolworths magazines from 2009-2014 using machine learning techniques
This is _very_ much a work in progress repository. I am a Masters of
Research (MRes) student in the department of Sociology at Macquarie
University in Australia. For my thesis project I am using machine
learning to analyse magazine texts, meaning my thesis has a
significant technical component.
## Background
Following the principles of reproducible research set out by Marwick
(2017), for the technical work I am using:
This is the code repository for my Masters of Research thesis project,
at Macquarie University. My project uses topic modelling, a form of machine
learning, together with close reading to analyse the supermarket narratives
found in the magazines released by Australia's two major supermarkets, Coles
and Woolworths.
* scripts to direct the processing of data;
* version control to enhance the transparency of my work.
## Abstract
## gensim
In Australia, supermarkets dominate our food landscape, with over eighty-four
percent of weekly food purchases occurring at the supermarket. The majority of
this shopping occurs at either Coles or Woolworths. Given Coles and Woolworths'
dominance in food retailing, the messages they promote about food form important
narratives that both reflect and reproduce broader cultural and social beliefs
about taste. This thesis uses a combination of topic modelling, a type of machine
learning and close reading to analyse the supermarket narratives found in the
Coles and Woolworths magazines, _Coles Magazine_ and Woolworths'_Fresh published
between 2009 and 2018. My analysis of these narratives demonstrates how
supermarkets are positioning themselves as food and lifestyle
authorities ready to instruct their customers on how to be good moral citizens,
through their consumption choices. Although the supermarket duopoly was subjected
to intense external scrutiny and criticism from multiple sources during this
period, my research finds that this had little impact on their magazine
narratives. Finally, my research highlights the benefits and analytical richness
to be gained from combining topic modelling and close reading when performing
content analysis on a large corpus of text.
I am using topic modelling in my thesis and in particular the tool
`gensim`. This work, which is still in progress, can be found in
the `/gensim/gensim-tutorial` directory.
## Text corpus
## spaCy
Magazines were manually downloaded as PDFs from the supermarket websites:
https://www.coles.com.au/magazine and
https://www.woolworths.com.au/shop/recipes/fresh-magazine/ Due to copyright
restrictions I am unable to make this corpus available, however I have provided
a list of magazines, together with their URLs, in the file
`/text-corpus.md`. The Docsplit version of my corpus was uploaded to OSF
(https://osf.io/hzn2a/) and made available to the examiners of my thesis. Again,
due to copyright I cannot make this publicly available, however please contact
me if you'd like to discuss access.
## Pre-processing
The PDFs were converted into text using Docsplit 0.7.6: http://documentcloud.github.io/docsplit/ and then processed using the following steps: tokenisation, stopword removal, lemmatisation, part-of-speech tagging, n-grams and text segmentation.
## Topic modelling
I used gensim 3.4 to topic model my corpus. Gensim is an open source Python
based suite of topic modelling tools. While the website documentation is basic,
the site has an excellent online forum which is very welcoming to newcomers and
beginners. The author of Gensim, Radim Řehůřek is also active on this forum and
maintains the Gensim code base. https://radimrehurek.com/
The topic modelling code is run with various options. To print those options run
the following:
```bash
cd gensim/gensim-tutorial
./run-topic-model --help
```
I did some earlier work using spaCy for nlp. The directory contains
multiple versions of scripts I have been trialling. This module needs
to be combined with my gensim work as part of my overall processing
pipeline.
For example:
## To follow
```bash
./run-topic-model --trigrams --bigrams --pos-tags NOUN ADJ --min-topic 2 --max-topic 10 data/corpus-2009-2010.json`
```
## Results
Multiple iterations of the LDA topic modelling processing software were run over
my text corpus, with different parameters and sections of the corpus being used
for each model. After each iteration of topic modelling, the results were
examined and compared manually, with these findings fed back into the modelling,
helping me to refine the parameters for the next iteration. I have included the
logs from all of these runs.
My thesis project combines topic modelling and close reading. Once I had
refined my topic models I close read the top 15 documents from each topic.
Copies of these documents can be found in the following directories:
* Documents from the whole corpus: corpus-20181013T125118
* Documents from Woolworths corpus 2009--2010: 1541383605.266563
* Documents from Woolworths corpus 2011--2014: 1541383605.257328
* Documents from Woolworths corpus 2015--2018: 1541383605.254759
## spaCy
As I continue with my thesis, I shall be adding to this repository.
I did some earlier work using spaCy, trialling that for the natural language
processing of my text and the directory contains multiple versions of the
scripts I tested. I did not end up using spaCy.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment