Skip to content
  • Hans-Christoph Steiner's avatar
    import all news posts as HTML jekyll posts (closes #19) · eb928047
    Hans-Christoph Steiner authored
    This is done using HTML since the original source is in HTML. This
    does not move the image locations, it leaves the <img> tags as is, so
    it gets them from the wordpress locations.
    
    Since only @CiaranG has access to the Wordpress database, I didn't use any
    of the import methods. They all require direct database access.  Instead, I
    used a little bag of tricks:
    
    * wget --span-hosts --recursive --page-requisites --html-extension \
      --convert-links --include-directories=/posts,/news-and-reviews \
      https://f-droid.org/news-and-reviews/
    * and this python script:
    
    import glob
    import os
    import bs4
    
    for f in glob.glob('posts/*/index.html'):
        print('parsing', f)
        outputname = os.path.basename(os.path.dirname(f)) + '.html'
        body = '---\nlayout: post\n'
        with open(f) as fp:
            soup = bs4.BeautifulSoup(fp)
    
            title = soup.find('title')
            if title:
                body += 'title: "' + title.text.replace(' – F-Droid', '')
    
            author = soup.find('a', {'class', 'url'})
            if author:
                body += '"\nauthor: "' + author.text + '"\n---\n\n'
    
            post_entry = soup.find('div', {'class', 'post-entry'})
            if post_entry:
                body += str(post_entry)
    
            date = soup.find('time', {'class', 'updated'})
            if date:
                filedate = date['datetime'].split('T')[0]
        with open(os.path.join('output', filedate + '-' + outputname), 'w') as fp:
            fp.write(body)
    eb928047