|
|
|
|
|
Public data come in handy at times. Meaningful investigations have made use of everything from [property values published by commercial services](???) to [lobying disclosures](???) to [law enforcement incident lists](???) to [corporate transparency reports](???) to [weather data](???), among countless other examples. While we are clearly seeing a proliferation of such information online, it is not always available in a format we can use. Leveraging public data for an investigation typically requires that it be not only *machine readable* but *structured.* In other words, a PDF that contains a photograph of a chart drawn on a napkin is less useful than a Microsoft Excel document that contains the actual data presented in that chart.
|
|
|
|
|
|
This can be an issue even in cases where governments are compelled by Freedom of Information (FoI) laws to release data they have collected, maintained or financed. In fact, governments sometimes obfuscate data intentionally to prevent further analysis that might reveal certain details they would rather keep hidden.
|
|
|
|
|
|
Investigators who work with public data face many obstacles, but two of the most common are:
|
|
|
|
|
|
1. Horrendous, multi-page [HTML](https://en.wikipedia.org/wiki/HTML) structures embedded in a series of consecutive webpages; and
|
|
|
2. Pleasantly formtted tables trapped within the unreachable hellscape of a PDF document.
|
|
|
|
|
|
This guide seeks to address the first challenge. It presents a series of steps that can be used to automate the collection of online HTML tables and the transformation of those tables into a more useful format. This process is often called "Web scraping." In the examples discussed below, we will produce [*comma separated values* (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) documents ready to be imported by *LibreOffice Calc* or *Microsoft Excel*. (For advice on dealing with PDF tables, have a look at [this article](https://exposingtheinvisible.org/resources/obtaining-evidence/scraping-parsing) and watch this space for an upcoming guide on *Tabula*, a PDF scraping tool.)
|
|
|
|
|
|
### The structure of this guide
|
|
|
|
|
|
This guide has four sections. The first discusses our rationale for choosing the tools and techniques covered in the rest of the guide. The second section is a brief primer on HTML structure, *CSS selectors* and how to use the Tor Browser's built-in *Element Inspector* to learn what you need to know for section three. (A *selector* is short sequence of keywords that identifies one or more specific elements on a webpage.) In the third section, we walk through the process of plugging those selectors into *Scrapy*, pulling down HTML data and saving them as a CSV file. Finally, we present three external web tables that are both more interesting and more complex than the example used previously.
|
|
|
|
|
|
If you don't need convincing, you should probably skip down to the section on [Using Scrapy and Tor Browser to scrape tabular data](#using-scrapy-and-tor-browser-to-scrape-tabular-data).
|
|
|
|
|
|
### For whom was this guide written?
|
|
|
|
|
|
The short answer is, anyone with a *Debian GNU/Linux* system — be it a computer, a virtual machine or a boot disk — who is willing to spend most of a day learning how to scrape web data reliably, flexibly and privately. And who remains willing even when they find out that less reliable, less flexible and less secure methods are probably less work.
|
|
|
|
|
|
More specifically, the steps below assume you are able to edit a text file and enter commands into a Linux *Terminal*. These steps are written from the perspective of a [Tails](https://tails.boum.org) user, but we have included tips, where necessary, to make sure they work on any [Debian](https://www.debian.org) system. Adapting those steps for some other Linux distribution should be quite easy, and making them work on Windows should be possible.
|
|
|
|
|
|
This guide does not require familiarity with the python programming language or with web design concepts like HTML and CSS, though all three make an appearance below. We will explain anything you need to know about these technologies.
|
|
|
|
|
|
Finally, this guide is written for variations of the *Firefox* Web browser, including the Tor Browser and *Iceweasel*. (From here on out, we will refer specifically to the Tor Browser, but most of the steps we describe will work just fine on other versions of *Firefox*.) Because of our commitment to Tails compatibility, we did not look closely at scraping extensions for Chromium, the open-source version of Google's Chrome web browser. So, if you're using Windows or a non-Tails Linux distribution — and if you are not particularly concerned about anonymity — you can either use *Firefox* or you can have a look at the [Web scaper](https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn) extension for Chromium. It's a few years old, but it looks promising nonetheless. It is free and open-source software, licensed under the GNU Public License (GPL) just like *Scrapy* and the other tools we recommend in this guide.
|
|
|
|
|
|
### In defense of The Hard Way
|
|
|
|
|
|
If you have skimmed through the rest of this guide, you might have noticed a startling lack of screenshots. There are a few, certainly, but most of them just show the Tor Browser's built-in *Inspector* being used to identify a few inscrutable lines of poetry like, `td.col1 div.resource > a::attr(href)`. And if that stanza gives you a warm fuzzy feeling, you might consider skipping down to the section on [Using Scrapy and Tor Browser to scrape tabular data](#using-scrapy-and-tor-browser-to-scrape-tabular-data). But if it looks a bit intimidating, please bear with us for a few more paragraphs.
|
|
|
|
|
|
Put the question to your favourite search engine, and you will find any number of [graphical web scraping tools](#appendix-1-a-brief-survey-of-web-scraping-tools) out there on the Internet. With varying degrees of success, these tools provide a user interface that allow you to:
|
|
|
|
|
|
- Pick and choose from the content available on a webpage by pointing and clicking;
|
|
|
- Download the content you want; and
|
|
|
- Save it in the format of your choosing.
|
|
|
|
|
|
All of which sounds great. But if you start digging into the details and test driving the software, you will find a number of caveats. Some of these tools are commercial, closed source or [gone in a puff of smoke](http://www.zdnet.com/article/palantir-buys-kimono-labs-cloud-hosted-service-to-close/). Some of them have limited functionality, or work for a limited time, until you pay for them. Some of them are cloud-based services run by companies that want to help you scrape data so they can have a copy for themselves. Some of them were written using an outdated and insecure *browser extension* framework. Some of them only work on very simple tables. Some of them don't know how to click the "next" button. Some of them ping [Google analytics](https://github.com/cantino/selectorgadget/search?utf8=%E2%9C%93&q=google-analytics&type=). That sort of thing.
|
|
|
|
|
|
And none of this ought to surprise us. It takes work to develop and maintain software like this, and the data industry is a Big Deal these days. Of course, it might not matter for all of us all of the time. Plenty of good investigative work has been done by chaining together half a dozen "15 Day Trial" licenses. But our goal for these guides is to provide solutions that don't require you to make those sorts of trade-offs. And that don't leave you hanging when you find yourself needing to:
|
|
|
|
|
|
- Scrape large quantities of data, incrementally, over multiple sessions;
|
|
|
- Parse complex tables;
|
|
|
- Download binary files like images and PDFs;
|
|
|
- Hide your interest in the data you are scraping; or
|
|
|
- Stay within the legal bounds of your software licenses and user agreements.
|
|
|
|
|
|
### Why you might want to scrape with Tails?
|
|
|
|
|
|
There are a number of reasons why you might want to use Tails when scraping web data. Even if you intend to release those data — or publish something that would reveal your access to them — you might still want to hide the fact that you or your organisation are digging for information on a particular subject. At least for a while. By requesting relevant pages through Tor, you retain control over the moment at which your involvement becomes public. Until then, you can prevent a wide variety of actors from knowing what you are up to. This includes other people at your internet cafe, your internet service provider (ISP), surveillance agencies, whoever operates the website you are scraping and *their* ISP, among others.
|
|
|
|
|
|
Tails also helps protect your research, analysis and draft outputs from theft and confiscation. When you enable *Persistence* on Tails, you are creating a single folder within which you can save data, and that folder's contents are guaranteed to be encrypted. There are other ways to store encrypted data, of course, but few of them make it this difficult to mess up. Tails disks are also small, easy to backup and even easier to throw away. Sticking one in your pocket is sometimes a nice alternative to traveling with a laptop, especially on trips where border crossings, checkpoints and raids might be a cause for concern.
|
|
|
|
|
|
More generally, Tails is an extremely useful tool if you are looking to keep your sensitive investigative work separate from your personal activities online. Even if you only have one laptop, you can effectively *compartmentalise* your high risk investigations by confining your acquisition and analysis of related data to your Tails system. And even if you are scraping banal data from a public website, it is worth considering whether you should have to make that decision every time you start poking around for a new project. Unless the data you are seeking cannot be accessed through Tor (which *does* happen), there are very good reasons to err on the side of compartmentalisation.
|
|
|
|
|
|
|
|
|
# Using Scrapy and Tor Browser to scrape tabular data
|
|
|
|
|
|
Scraping web data reliably and flexibly often requires two steps. Unless you are willing and able to use an all-in-one graphical scraper, you will typically need to:
|
|
|
|
|
|
1. Identify the selectors that identify the content you want (and *only* the content you want), then
|
|
|
2. Use those selectors to configure a tool that is optimised for extracting content from webpages.
|
|
|
|
|
|
In the examples presented below, we will rely on Tor Browser to help us with the first stage and *Scrapy* to handle the second.
|
|
|
|
|
|
## Identifying selectors
|
|
|
|
|
|
Before we go digging for selectors, we will start with brief introduction to Hyper-Text Markup Language (HTML) and Cascading Style Sheets (CSS). Feel free to [skip it](#understanding-which-selectors-you-need). And, if you don't skip it, rest assured that, when you start using these tools for real, your web browser will do most of the heavy lifting.
|
|
|
|
|
|
### A remarkably short HTML tutorial
|
|
|
|
|
|
By the time they reach your web browser, most websites arrive as collections of HTML pages. The basic underlying structure of an HTML document looks something like this:
|
|
|
|
|
|
```
|
|
|
<html>
|
|
|
<body>
|
|
|
<p>Page content...</p>
|
|
|
</body>
|
|
|
</html>
|
|
|
```
|
|
|
|
|
|
HTML `tables` are often used to format the content of these pages, especially when presenting data. Here's an example:
|
|
|
|
|
|
|
|
|
| Title of Column one | Title of Column two |
|
|
|
| ----------------------------- | ----------------------------- |
|
|
|
| Page one, row one, column one | Page one, row one, column two |
|
|
|
| Page one, row two, column one | Page one, row two, column two |
|
|
|
|
|
|
[Previous page](/guides/scraping) - [Next page](/guides/scraping-2)
|
|
|
|
|
|
|
|
|
If we add the table and the navigation links above to our simplified HTML page, we end up with the following collection of `elements`:
|
|
|
|
|
|
```
|
|
|
<html>
|
|
|
<body>
|
|
|
<table>
|
|
|
<thead>
|
|
|
<tr>
|
|
|
<th><p>Title of Column one</p></th>
|
|
|
<th><p>Title of Column two</p></th>
|
|
|
</tr>
|
|
|
</thead>
|
|
|
<tbody>
|
|
|
<tr class="odd">
|
|
|
<td><p>Page one, row one, column one</p></td>
|
|
|
<td><p>Page one, row one, column two</p></td>
|
|
|
</tr>
|
|
|
<tr class="even">
|
|
|
<td><p>Page one, row two, column one</p></td>
|
|
|
<td><p>Page one, row two, column two</p></td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<p><a href="/guides/scraping">Previous page</a> - <a href="/guides/scraping-2">Next page</a></p>
|
|
|
</body>
|
|
|
</html>
|
|
|
```
|
|
|
|
|
|
For our purposes here, the *structure* is more important than the meaning of the elements themselves, but a few of those "tags" are a little obscure, so:
|
|
|
|
|
|
- `<p>` and `</p>` begin and end a paragraph
|
|
|
- `<a>` begins a clickable link (and `</a>` ends it)
|
|
|
- The text ("Next page") *between* the `<a>` and `</a>` tags is the thing you click on
|
|
|
- The `href=/guides/scraping-2` *inside* the `<a>` tag tells your browser where to go when you click it
|
|
|
- `<tr>`and `</tr>` begin and end a row in a table
|
|
|
- `<td>` and `</td>` begin and end a cell within a table row
|
|
|
- `<th>` and `</th>` are just like `<td>` and `<td>` but they're meant table *headings*
|
|
|
- `<table>` and `</table>` you already figured out
|
|
|
|
|
|
We will discuss how to view the HTML behind a webpage later on, but if you want to have a look now, simply right-click the table above and select "Inspect Element." The actual HTML is slightly more complex than what is shown above, but the similarities should be clear. (You can close the "Inspector" pane by clicking the tiny **X** in its upper, right-hand corner.)
|
|
|
|
|
|
### A relatively short CSS tutorial
|
|
|
|
|
|
In order to make some (but not all) HTML elements of a particular type behave in a particular way, they can be assigned a `class` or an `id` or some other "attribute." Here is a slightly different version of the table above:
|
|
|
|
|
|
|
|
|
| Title of Column one | Title of Column two |
|
|
|
| ----------------------------- | ----------------------------- |
|
|
|
| Page one, row one, column one | Page one, row one, column two |
|
|
|
| Page one, row two, column one | Page one, row two, column two |
|
|
|
|
|
|
[Previous page](/guides/scraping){#prev_page} - [Next page](/guides/scraping-2){#next_page}
|
|
|
|
|
|
|
|
|
And here's what that might look like:
|
|
|
|
|
|
```
|
|
|
<html>
|
|
|
<body>
|
|
|
<table class="classy">
|
|
|
<thead>
|
|
|
<tr>
|
|
|
<th><p>Title of Column one</p></th>
|
|
|
<th><p>Title of Column two</p></th>
|
|
|
</tr>
|
|
|
</thead>
|
|
|
<tbody>
|
|
|
<tr class="odd">
|
|
|
<td class="special"><p>Page one, row one, column one</p></td>
|
|
|
<td><p>Page one, row one, column two</p></td>
|
|
|
</tr>
|
|
|
<tr class="even">
|
|
|
<td class="special"><p>Page one, row two, column one</p></td>
|
|
|
<td><p>Page one, row two, column two</p></td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<p><a id="prev_page" href="/guides/scraping">Previous page</a> - <a id="next_page" href="/guides/scraping-2">Next page</a></p>
|
|
|
</body>
|
|
|
</html>
|
|
|
|
|
|
```
|
|
|
|
|
|
Notice the `class="special"` inside some (but not all) of the `<td>` tags. At the moment, these cells are not particularly special. They're just grey. And those `prev_page` and `next_page` elements look just like any other link. But that's not the point. The point is that we can use these "attributes" to identify certain types of data in a consistent way. When writing CSS rules, web designers rely on this technique to apply graphical styling to webpages. We will use it to extract data.
|
|
|
|
|
|
To do this, we need to determine the CSS Selectors for the content we care about.
|
|
|
|
|
|
Let's start with that `id="next_page"`. There are several different selectors we could use to identify the "Next page" link in the example above. The longest and most specific is:
|
|
|
|
|
|
```
|
|
|
html > body > p > a#next_page
|
|
|
```
|
|
|
|
|
|
But, since there's only one `html` document and it only has one `body`, we could shorten this whole thing to `p > a#next_page`. Then again, since there is only one `next_page`, we could just use `a#next_page`. Or even `#next_page`. All of these are perfectly valid selectors.
|
|
|
|
|
|
At this point, you may be wondering what all those symbols mean. As you probably noticed, we put a `#` at the beginning of an `id` value when we refer to it. And the `>` symbol represents a "child" relationship. It can be used to help identify the right-most element as precisely as possible. Looking at the HTML snippet above, you can see that the `<a>` we want is the child of a `<p>` which is the child of a `<body>`, which is the child of an `<html>`.
|
|
|
|
|
|
Below are a few examples that might help (but that you can safely ignore if they don't):
|
|
|
|
|
|
- `p` would select every paragraph in the example above
|
|
|
- `th > p` would *only* select the column headings
|
|
|
- `td > p` would *only* select the contents of the four main table cells
|
|
|
- `body > p` would select both "Previous page" and "Next page" links
|
|
|
|
|
|
If we separate two elements with a space, rather than a `>`, it signifies a "descendent" relationship. This allows us to skip some of the steps in the middle. So `body a#next_page` is valid, even though it leaves out `p` in between, whereas `body > a#next_page` is invalid (which just means it will fail to "select" anything if used). Descendent relationships cast a wider net. For example, `body > p` will only select the links at the bottom, whereas `body p` will select every paragrph on the page.
|
|
|
|
|
|
So what about our `class="special"` attribute? You can use a class the same way you'd use an ID, but with a fullstop (`.`) instead of a `#`. So, the following would all select the same data. See if you can figure out what it is:
|
|
|
|
|
|
- `td.special > p`
|
|
|
- `.special > p`
|
|
|
- `.special p`
|
|
|
- `table td.special > p`
|
|
|
|
|
|
*Answer: The contents of all of the main table cells in column one*
|
|
|
|
|
|
### Understanding which selectors you need
|
|
|
|
|
|
In order to scrape data from an HTML table on a webpage, you will need one selector that identifies all of the *rows* in that table and one selector for each *column*. So, to scrape [the second (green) table](#a-relatively-short-css-tutorial) above, you would need three selectors. There are many variations that would work, but here is one possibility:
|
|
|
|
|
|
- *Row selector*: `html body table.classy > tbody > tr`
|
|
|
- *Column one selector*: `td.special > p`
|
|
|
- *Column two selector*: `td:nth-child(2) > p`
|
|
|
|
|
|
We will discuss how to obtain these values shortly, but first:
|
|
|
|
|
|
1. Why don't the column selectors begin with `html body` or at least `table.classy tr`?
|
|
|
2. What does `nth-child(2)` mean?
|
|
|
3. What about tables that are spread over multiple pages?
|
|
|
|
|
|
*1. A column selector is applied to a single row, not to the entire page:*
|
|
|
|
|
|
In the following section, when we finally start pulling down data, each of your column selectors will be used once for each row identified by your row selector. Because of the way we have written our scraping file, *the row selectors should be relative to the row they are in*. So they should not begin with `tr` or anything "above" `tr`.
|
|
|
|
|
|
*2. The `nth-child(n)` trick:*
|
|
|
|
|
|
Depending on the configuration of the table you are scraping, you might not have to use this trick, but it can be quite helpful when dealing with tables that do not include `class="whatever"` attributes for every column. Just figure out which column you want, and stick that number between the parentheses. So, in this example, `td.nth-child(1)` is the same as `td.special`. And `td.nth-child(2)` means "the second column." We use this method here because we have no other way to match the second column without also matching the first.
|
|
|
|
|
|
*3. Scraping a multi-page table:*
|
|
|
|
|
|
If you want your scraper to "click the next page link" automatically, you have to tell it where that link is. Both of the example tables above have a next page link beneath them. The link under [the green table](#a-relatively-short-css-tutorial) is easy: `a#next_page` is all you need. The link under [the first table](#a-remarkably-short-html-tutorial) is a little tricker, because it does not have a `class`, and most pages will include at least one other `<a>...</a>` link somewhere in their HTML. In this simplified example, `html body > p:nth-child(1) > a` would work just fine because the link is in the first paragraph that's inside `<body>` but not inside the `<table>`.
|
|
|
|
|
|
Determining these selectors is the most difficult part of scraping web data. But don't worry, the Tor Browser has a few features that make it easier than it sounds.
|
|
|
|
|
|
### Using the Tor Browser's Inspector to identify the selectors you need
|
|
|
|
|
|
If you are using a Linux systems that does not come with the Tor Browser pre-installed, you can get it from the [Tor Project website](https://www.torproject.org/download/download-easy.html.en) or follow along with the [Tor Browser Tool Guide](https://securityinabox.org/en/guide/torbrowser/linux) on Security-in-a-Box.
|
|
|
|
|
|
Follow the steps below to practice identify selectors on a real page.
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
#### Step 1: The row selector
|
|
|
|
|
|
Right-click one of the data rows in the table you want to scrape and select *Inspect element*. We will start by looking at [the green table](#a-relatively-short-css-tutorial) above. This will open the Tor Browser's *Inspector*, which will display several nearby lines of HTML and highlight the row on which you clicked:
|
|
|
|
|
|
<div class="fotorama" data-nav="thumbs"> ![Opening the Inspector](media/cs-web-scraping-en-v01-004-right-click-menu.png) ![Hovering over a table cell](media/cs-web-scraping-en-v01-005-inspector-cell.png) ![Hovering over a table row](media/cs-web-scraping-en-v01-006-inspector-row.png) ![Selecting a table row](media/cs-web-scraping-en-v01-007-inspector-row-selected.png) </div>
|
|
|
|
|
|
Now hover your mouse pointer over the neighboring rows in the *Inspector* while keeping an eye on the main browser window. Various elements on the page will be highlighted as you move your pointer around. Find a line of HTML in the *Inspector* that highlights an entire table row (but that does not highlight the entire table.) Click it once.
|
|
|
|
|
|
The content indicated by the red box in the third screenshot thumbnail above (`table.classy > tbody > tr.odd`) is *almost* your row selector. (You might have to click the ![Inspector left arrow](media/cs-web-scraping-en-v01-008-inspector-left.png) or ![Inspector right arrow](media/cs-web-scraping-en-v01-009-inspector-right.png) arrows to see the whole thing.)
|
|
|
|
|
|
But remember, you need a selector that matches *all* of the table rows, not just the one you happened to click. If you see get same value for two consecutive rows, simply copy it down and you're probably done. Otherwise, you might have to do a bit of manual editing. Specifically, you might have to remove a few `#id` or `.class` values that narrow your selection down to *a particular row* or *a subset of rows* rather than just *a row*.
|
|
|
|
|
|
Now, you might think you could just right-click something to copy this value. And eventually you will be able to do just that, as Firefox *version 53* will implement [a "Copy CSS Selector" option](https://bugzilla.mozilla.org/show_bug.cgi?id=1323700). For now, unfortunately, you will have to write it down manually.
|
|
|
|
|
|
In the example above, the full value of that row with the red box in the *Inspector* is:
|
|
|
|
|
|
```
|
|
|
html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr.odd
|
|
|
```
|
|
|
|
|
|
But there is only one `table.classy` on the page, so we can ignore everything before that. Which leaves us with:
|
|
|
|
|
|
```
|
|
|
table.classy > tbody > tr.odd
|
|
|
```
|
|
|
|
|
|
And the second row ends with `tr.even` instead of `tr.odd`, so we need to make one more adjustment to obtain our final row selector:
|
|
|
|
|
|
```
|
|
|
table.classy > tbody > tr
|
|
|
```
|
|
|
|
|
|
If you haven't already tried it yourself:
|
|
|
|
|
|
1. Right-click a data row in [the green table](#a-relatively-short-css-tutorial) above,
|
|
|
2. Select *Inspect Element*,
|
|
|
3. Hover your mouse pointer over the neighboring lines of HTML in the *Inspector*,
|
|
|
4. Find one that highlights the entire table row and click it,
|
|
|
5. Note the selector in the top row of the *Inspector*,
|
|
|
5. Click the next table data row in the *Inspector*, and
|
|
|
6. If both selectors are identical, write that down. If they are not, try to come up with a more generalised selector and use the information below to test it.
|
|
|
|
|
|
</div>
|
|
|
|
|
|
#### Step 2: Testing your selector
|
|
|
|
|
|
You can test your selectors using another built-in feature of the Tor Browser. This time, we will be looking at the *Inspector's* right-hand pane, which is highlighted below. If your *Inspector* only has one window, you can click the tiny "Expand pane" (![Expand pane button](media/cs-web-scraping-en-v01-001-expand-pane.png)) button near its upper, right-hand corner.
|
|
|
|
|
|
<div class="fotorama" data-nav="thumbs"> ![Expanded pane](media/cs-web-scraping-en-v01-010-expanded-pane.png) ![Adding a new rule](media/cs-web-scraping-en-v01-011-add-new-rule.png) ![New rule added](media/cs-web-scraping-en-v01-012-new-rule-added.png) ![Modifying new rule](media/cs-web-scraping-en-v01-013-new-rule-modified) ![Showing elements](media/cs-web-scraping-en-v01-014-show-elements.png) </div>
|
|
|
|
|
|
Now you can:
|
|
|
|
|
|
1. Right-click anywhere in the new pane and select *Add new Rule*. This will add a block of code like the one shown in the second screenshot thumbnail above. (The highlighted bit will depend on what's currently selected in the *Inspector*.)
|
|
|
2. Delete the highlighted contents, paste in the selector you want to test and press `<Enter>`.
|
|
|
3. Click the nearly-invisible, "Show elements" crosshairs (![Show elements](media/cs-web-scraping-en-v01-002-crosshairs.png)) just to the right of your new "Rule." This will turn the crosshairs blue and highlight all of the page elements that match your selector.
|
|
|
4. If this highlighting includes everything you want (and nothing you don't), then your selector is good. Otherwise, you can modify the selector and try again.
|
|
|
|
|
|
When testing your row selector, in particular, you want all of the table rows to be highlighted.
|
|
|
|
|
|
You can click the ![Show elements](media/cs-web-scraping-en-v01-003-active-crosshairs.png) icon again (if you can find it) to cancel the highlighting. You can close the right-hand Inspector pane by clicking the "Collapse pane" (![Collapse pane button](media/cs-web-scraping-en-v01-001-split-pane.png)) button. These new "Rules" are temporary and will disappear as soon as you reload the page.
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
#### Step 3: The "next page" selector
|
|
|
|
|
|
If the table data you want to scrape is spread over multiple pages, you will have to configure a next page selector as well. Fortunately, determining this selector is usually much easier. Because you only need to select a single element, you can often use the "Copy Unique Selector" option in the Tor Browser's *Inspector* as shown below.
|
|
|
|
|
|
<div class="fotorama" data-nav="thumbs"> ![Inspect next link](media/cs-web-scraping-en-v01-015-inspect-next-link.png) ![Hovering over the whole paragraph](media/cs-web-scraping-en-v01-016-next-link-paragraph.png) ![Hovering over the anchor](media/cs-web-scraping-en-v01-017-next-link-anchor.png) ![Copy Unique Selector](media/cs-web-scraping-en-v01-018-copy-unique-selector.png) </div>
|
|
|
|
|
|
To do this, simply:
|
|
|
|
|
|
1. Right-click the *next page* link on the first page containing the table you want to scrape,
|
|
|
2. Select *Inspect Element*,
|
|
|
3. Hover your mouse pointer over nearby HTML lines in the *Inspector* until you find the correct `<a>...</a>` link
|
|
|
4. Right-click that element in the *Inspector* and select **Copy Unique Selector**.
|
|
|
|
|
|
You may be able to use this selector as it is. If you want to make sure, just click the "next page" link manually and follow the same steps on the second page. If the selectors are the same, they should work for all pages that contain table data.
|
|
|
|
|
|
If you have to modify the selector — or if you choose to shorten or simplify it — you can test your alternative using [the method described in Step 2, above](#step-2-testing-your-selector). Don't worry if your final selector highlights multiple *next page* links, as long as it doesn't highlight anything else. The Scrapy template we recommend below only pays attention to the first "match."
|
|
|
|
|
|
Try following these steps for the "next page" link just beneath [the green table](#a-relatively-short-css-tutorial) above. The Tor Browser's **Copy Unique Selector** option should produce the following:
|
|
|
|
|
|
```
|
|
|
#next_page
|
|
|
```
|
|
|
|
|
|
(Which is equivalent, in this case, to `a#next_page`) And, if you [test this selector](#step-2-testing-your-selector), you should see that clicking the *Show elements* crosshairs only highlights this one page element.
|
|
|
|
|
|
</div>
|
|
|
|
|
|
#### Step 4: The column selectors
|
|
|
|
|
|
You will often need to identify several column selectors. The process for doing so is nearly identical to how we found our row selector. The main differences are:
|
|
|
|
|
|
- You will need a selector for each column of data you want to scrape,
|
|
|
- Your selectors should be relative to the row, so they will not begin with segments like `html`, `table`, `tr`, etc.
|
|
|
|
|
|
For the first column of our [green table](#a-relatively-short-css-tutorial), the Tor Browser's *Inspector* gives us:
|
|
|
|
|
|
```
|
|
|
html > body > div.container.main > div.row > div.span8.blog > table.classy > tbody > tr > td.special > p
|
|
|
```
|
|
|
|
|
|
If we removing everything up to and including the row (`tr`) segment, we end up with:
|
|
|
|
|
|
```
|
|
|
td.special > p
|
|
|
```
|
|
|
|
|
|
If we test this selector by expanding the *Inspector*, adding a new "rule" for it and clicking the *Show elements* crosshairs, as discussed in [Step 2](#step-2-testing-your-selector), we should see highlighting on both cells in column one.
|
|
|
|
|
|
As mentioned above, the second column requires a bit of cleverness because it has no `class`. If you test the following selector, though, you will see how we can use the `nth-child(n)` trick to get the job done:
|
|
|
|
|
|
```
|
|
|
td:nth-child(2) > p
|
|
|
```
|
|
|
|
|
|
Something to keep in mind when testing column selectors: as mentioned above, a column selector is relative to its "parent" row. As a result, when you are testing your row selectors, you might occasionally see highlighted content elsewhere on the page. This is fine, as long as nothing else *in the table* is highlighted. (If you're concerned about this, you can stick your *row selector* on the front of your *column selector* and test it that way. In this example, we would use the following:
|
|
|
|
|
|
```
|
|
|
table.classy > tbody > tr td:nth-child(2) > p
|
|
|
```
|
|
|
|
|
|
Just be sure not to use this combined, row-plus-column selector later on, when we're actually trying to scrape data.)
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
#### Step 5: Selector suffixes
|
|
|
|
|
|
There is one final step before we can start scraping data. The selectors discussed above identify HTML elements. This is fine for row selectors, but we typically want our column selectors to extract the *contents* of an HTML element. Similarly, we want our next page selectors to match the value of the actual *link target* (in other words, just `/guides/scraping-2` rather than the following:
|
|
|
|
|
|
```
|
|
|
<a href=/guides/scraping-2>Next page</a>
|
|
|
```
|
|
|
|
|
|
To achieve this, we add a short "suffix" to column and next page selectors. The most common suffixes are:
|
|
|
|
|
|
- `::text`, which extracts the content between the selected HTML tags. We use it here to get the actual table data.
|
|
|
- `::attr(href)`, which matches the value of the `href` attribute *inside* an HTML tag. We use it here to get the "next page" URL so our scraper can load the second page, for example, when it's done with the first.
|
|
|
- `::attr(src)`, which matches the value of a `src` attribute. We do not use it here, but it is helpful when scraping tables that include images.
|
|
|
|
|
|
</div>
|
|
|
|
|
|
So, in the end, our final selectors are:
|
|
|
|
|
|
- **Row selector**: `table.classy > tbody > tr`
|
|
|
- **Next page selector**: `#next_page::attr(href)`
|
|
|
- **Column selectors**
|
|
|
- **Column one**: `td.special > p::text`
|
|
|
- **Column two**: `td:nth-child(2) > p::text`
|
|
|
|
|
|
## Configuring Scrapy
|
|
|
|
|
|
Now that we have a basic understanding of HTML, CSS and how to use the Tor Browser's *Inspector* to distill them into a handful of selectors, it is time to enter those selectors into Scrapy. Scrapy is a free and open-source Web scraping platform written in the Python programming language. In order to use it, we will:
|
|
|
|
|
|
1. Install Scrapy;
|
|
|
2. Create a small python file called a "spider" and configure it by plugging in:
|
|
|
- The URL we want to scrape,
|
|
|
- Our row selector, and
|
|
|
- Our column selectors;
|
|
|
3. Tell Scrapy to run our spider on a single page;
|
|
|
4. Check the results and make any necessary changes; and
|
|
|
5. Plug in our next page selector and run the spider again.
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
### Step 1: Installing Scrapy
|
|
|
|
|
|
You can install Scrapy on Tails by launching *Terminal*, entering the command below and providing your passphrase when prompted:
|
|
|
|
|
|
```
|
|
|
sudo apt-get install python-scrapy
|
|
|
```
|
|
|
|
|
|
To install software on Tails, you need to [set an administration passphrase](https://tails.boum.org/doc/first_steps/startup_options/administration_password/index.en.html) when you boot your system. If you are already running Tails, and you did not do this, you will have to restart your computer. You might also want to [configure the "encrypted persistence" feature](https://tails.boum.org/doc/first_steps/persistence/index.en.html), which allows you to save data within the `/home/amnesia/Persistent` folder. Otherwise, anything you save will be gone next time you boot Tails.
|
|
|
|
|
|
You do not *have to* configure Persistence to use Scrapy, but it makes things easier. Without Persistence, you will have to save your "spider" file on an external storage device and re-install Scrapy each time you want to use it.
|
|
|
|
|
|
Even with Persistence enabled, you will have to reinstall Scrapy each time you restart unless you add a line that says `python-scrapy` to the following file:
|
|
|
|
|
|
```
|
|
|
/live/persistence/TailsData_unlocked/live-additional-software.conf
|
|
|
```
|
|
|
|
|
|
To do this, you can launch *Terminal*, enter the following command and provide your passphrase when prompted:
|
|
|
|
|
|
```
|
|
|
sudo gedit /live/persistence/TailsData_unlocked/live-additional-software.conf
|
|
|
```
|
|
|
|
|
|
This will open the *GNU Edit* text editor, give it permission to modify system files and load the contents (if any) of the necessary configuration file. Then add the line above (`python-scrapy`) to the file, click the **[Save]** button and quit the editor by clicking the **X** in the upper, right-hand corner. Unless you have edited this file before, while running Tails with Persistence, the file will be blank when you first open it.
|
|
|
|
|
|
On Debian GNU/Linux systems other than Tails, you will need to install *Tor*, *torsocks* and *Scrapy*, which you can do by launching *Terminal*, entering the following command and providing your passphrase when prompted.
|
|
|
|
|
|
```
|
|
|
sudo apt-get install tor torsocks python-scrapy
|
|
|
```
|
|
|
|
|
|
</div>
|
|
|
|
|
|
### Step 2: Creating and configuring your spider file
|
|
|
|
|
|
There are many different ways to use Scrapy. The simplest is to create a single spider file that contains your selectors and some standard python code. You can [copy the content of a generic spider file from here](/guides/scraping-spider).
|
|
|
|
|
|
You will need to paste this code into a new file in the `/home/amnesia/Persistent` folder so you can edit it and add your selectors. To do so, launch *Terminal* and enter the command below:
|
|
|
|
|
|
```
|
|
|
gedit /home/amnesia/Persistent/spider-template.py
|
|
|
```
|
|
|
|
|
|
Paste in the contents from [here](/guides/scraping-spider), and click the **[Save]** button.
|
|
|
|
|
|
Now we just have to name our spider, give it a page on which to start scraping and plug in our selectors. All of this can be done by editing lines 7 through 14 of the template:
|
|
|
|
|
|
```
|
|
|
### User variables
|
|
|
#
|
|
|
start_urls = ['https://some.website.com/some/page']
|
|
|
name = 'spider_template'
|
|
|
row_selector = '.your .row > .selector'
|
|
|
next_selector = 'a.next_page.selector::attr(href)'
|
|
|
column_selectors = {
|
|
|
'some_column': '.contents.of > .some.column::text',
|
|
|
'some_other_column': '.contents.of > a.different.column::attr(href)',
|
|
|
'a_third_column': '.contents.of > .a3rd.column::text'
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The `start_urls`, `name`, `row_selector` and `next_selector` *variables* are pretty straightforward:
|
|
|
|
|
|
- `start_urls`: Enter the URL of the first page that contains table data you want to scrape. Use `https` if possible and **remember to keep the brackets (`[` & `]`) around the URL.**
|
|
|
- `name`: This one is pretty unimportant, actually. Name your scraper whatever you like.
|
|
|
- `row_selector`: Enter the row selector you came up with earlier. If you want to to test your spider on the [green table](#a-relatively-short-css-tutorial) above, you can use `table.classy > tbody > tr`
|
|
|
- `next_selector`: **The first time you try scraping a new table, you should leave this blank.** To do so, delete everything between the two `'` characters. The resulting line should read ` next_selector = ''`
|
|
|
|
|
|
The `column_selectors` item is a little different. It is a collection that contains one entry for each column selector. You can set both a *key* — the text before the colon (`:`) — and a *value* for each of those entries. The example *keys* in the template above are `some_column`, `some_other_column` and `a_third_column`. The text you enter for these *keys* will provide the column headings of the `.csv` file you are going to create. The example *values* are as follows:
|
|
|
|
|
|
```
|
|
|
.contents.of > .some.column::text
|
|
|
```
|
|
|
```
|
|
|
.contents.of > a.different.column::attr(href)
|
|
|
```
|
|
|
```
|
|
|
.contents.of > .a3rd.column::text
|
|
|
```
|
|
|
|
|
|
These selectors are completely made up and pretty much guaranteed to fail on every webpage ever made. You should, of course, replace them with the selectors you identified in the previous section.
|
|
|
|
|
|
If you want to test your spider on the [green table](#a-relatively-short-css-tutorial) above, these lines should look something like the following:
|
|
|
|
|
|
```
|
|
|
### User variables
|
|
|
#
|
|
|
start_urls = ['https://eti.stg.tacticaltech.org/guides/scraping']
|
|
|
name = 'test-spider'
|
|
|
row_selector = 'table.classy > tbody > tr'
|
|
|
next_selector = ''
|
|
|
column_selectors = {
|
|
|
'first': 'td.special > p::text',
|
|
|
'second': 'td:nth-child(2) > p::text'
|
|
|
}
|
|
|
## Only needed for HTTP-Basic authentication
|
|
|
#
|
|
|
# http_user: '???'
|
|
|
# http_pass: '?????????'
|
|
|
```
|
|
|
|
|
|
Everything else in this file should remain unchanged. And remember, you are editing python code, so try not to change the formatting. Punctuation marks like quotes (`'` & `"`), commas (`,`), colons (`:`), braces (`{` & `}`) and brackets (`[` & `]`) all have special meaning in python. As does the indentation: four spaces before most of these lines and eight spaces before the lines that contain column selectors. Also, lines that begin with a number sign (`#`) are "comments," which means they will not affect your spider at all. Finally, **when adding or removing column selectors, pay attention to the commas (`,`) at the end of all but the last line**, as shown above.
|
|
|
|
|
|
After you have modified the `start_urls`, `name`, `row_selector` and `next_selector` entries, and added both *keys* and *values* for each of your column selectors, you should save your new spider by clicking the **[Save]** button in *GNU edit*.
|
|
|
|
|
|
After configuring your spider, you should test it.
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
### Step 3: Testing your spider
|
|
|
|
|
|
If you are using Tails, and if you did not change the location (`/home/amnesia/Persistent`) or the file name (`spider-template.py`) of your spider, you can test drive it by launching *Terminal* and entering the commands below:
|
|
|
|
|
|
```
|
|
|
cd /home/amnesia/Persistent
|
|
|
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv
|
|
|
```
|
|
|
|
|
|
These commands include the following elements:
|
|
|
|
|
|
- `cd /home/amnesia/Persistent` moves to the "Persistence" folder in the Tails home directory, which is where we happened to put our spider
|
|
|
- `torsocks -i` tells Scrapy to use the Tor anonymity service while scraping
|
|
|
- `scrapy runspider spider-template.py` tells Scrapy to run your spider
|
|
|
- `-o extracted_data.csv` provides a name for the output file that will contain the data you scrape. You can name this file whatever you want, but Scapy will use the three letter file extension at the end (`.csv` in this case) to determine how it should format those data. You can also output [JSON](http://json.org/) content by using the `.json` file extension.
|
|
|
|
|
|
While it does its job, Scrapy will display all kinds of information in the *Terminal* where it is running. This feedback will likely include at least one "ERROR" line related to `torsocks`:
|
|
|
|
|
|
```
|
|
|
Unable to resolve. Status reply: 4 (in socks5_recv_resolve_reply() at socks5.c:683)
|
|
|
```
|
|
|
|
|
|
You can safely ignore this warning (along with most of the other feedback). If everything works, you will find a file called `extracted_data.csv` in the same directory as your spider. That file should contain all of the data scraped from the HTML table. Our example spider will extract the following from [the green table](#a-relatively-short-css-tutorial) above:
|
|
|
|
|
|
```
|
|
|
second,first
|
|
|
"Page one, row one, column two","Page one, row one, column one"
|
|
|
"Page one, row two, column two","Page one, row two, column one"
|
|
|
```
|
|
|
|
|
|
As you can see, the resulting columns may appear out of order, but they will be correctly associated with the *key* values you set for your column descriptors, which make up the first line of data. You can easily re-order the columns once you import your `.csv` file into a spreadsheet.
|
|
|
|
|
|
</div>
|
|
|
|
|
|
**Troubleshooting tips:**
|
|
|
|
|
|
- If your spider fails, look for lines that include `[scrapy] DEBUG:`. They might help you figure out what broke.
|
|
|
|
|
|
- If you want to quit Scrapy while it is still running, just hold down the `<Ctrl>` key and press `c` while in the *Terminal* application.
|
|
|
|
|
|
- If the named output file already exists (`extracted_data.csv`, in this case), Scrapy will append new data to the end of it, rather than replacing it. So, if things don't go according to plan and you want to try again, you should first remove the old file by entering `rm extracted_data.csv`.
|
|
|
|
|
|
- When Scrapy is done, it will display the following:
|
|
|
```
|
|
|
[scrapy] INFO: Spider closed (finished)
|
|
|
```
|
|
|
|
|
|
### Step 4: Opening your CSV output in LibreOffice Calc
|
|
|
|
|
|
Follow the steps below to confirm that your spider worked (or, if you've already got what you came for, to begin cleaning and analysing your data):
|
|
|
|
|
|
<div class="fotorama" data-nav="thumbs"> ![Importing into LibreOffice Calc](media/cs-web-scraping-en-v01-019-importing-in-libreoffice.png) ![Open in LibreOffice Calc](media/cs-web-scraping-en-v01-019-importing-in-libreoffice.png) </div>
|
|
|
|
|
|
1. Launch *LibreOffice Calc* by clicking the *Applications* menu in the upper, left-hand corner of your screen, hovering over the *Office* sub-menu and clicking *LibreOffice Calc*.
|
|
|
2. In LibreOffice Calc, click *File* and select *Open*
|
|
|
3. Navigate to your `.csv` file and click the **[Open]** button
|
|
|
4. Configure the following options if they are not already set by default:
|
|
|
- Character set: `Unicode (UTF-8)`
|
|
|
- From row: `1`
|
|
|
- Check the "Comma" box under *Separator options*
|
|
|
- Uncheck all other *Separator options*
|
|
|
- Select `"` under *Text delimiter*
|
|
|
5. Click the **[OK]** button.
|
|
|
|
|
|
<div style="background-color: #dddddd; padding: 10px">
|
|
|
|
|
|
### Step 5: Running your spider on multiple pages
|
|
|
|
|
|
If everything in your `.csv` file looks good, and you are ready to try scraping multiple pages, just configure the next page selector in your spider and run it again.
|
|
|
|
|
|
If you are using Tails with Persistence, you can open your spider for editing with the same command we used before:
|
|
|
|
|
|
```
|
|
|
gedit /home/amnesia/Persistent/spider-template.py
|
|
|
```
|
|
|
|
|
|
Those first few lines should now look something like:
|
|
|
|
|
|
```
|
|
|
### User variables
|
|
|
#
|
|
|
start_urls = ['https://eti.stg.tacticaltech.org/guides/scraping']
|
|
|
name = 'test-spider'
|
|
|
row_selector = 'table.classy > tbody > tr'
|
|
|
next_selector = '#next_page::attr(href)'
|
|
|
column_selectors = {
|
|
|
'first': 'td.special > p::text',
|
|
|
'second': 'td:nth-child(2) > p::text'
|
|
|
http_user: '???'
|
|
|
http_pass: '?????????'
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The only difference is the addition of an actual selector (instead of `''`) for the `next_selector` variable.
|
|
|
|
|
|
Finallly, click the **[Save]** button, remove your old output file and tell Scrapy to run the updated spider:
|
|
|
|
|
|
```
|
|
|
cd /home/amnesia/Persistent
|
|
|
rm extracted_data.csv
|
|
|
torsocks -i scrapy runspider spider-template.py -o extracted_data.csv
|
|
|
```
|
|
|
|
|
|
Your `.csv` output should now include data from the second page:
|
|
|
|
|
|
```
|
|
|
second,first
|
|
|
"Page one, row one, column two","Page one, row one, column one"
|
|
|
"Page one, row two, column two","Page one, row two, column one"
|
|
|
"Page two, row one, column two","Page one, row one, column one"
|
|
|
"Page two, row two, column two","Page one, row two, column one"
|
|
|
```
|
|
|
|
|
|
</div>
|
|
|
|
|
|
**Troubleshooting tips:**
|
|
|
|
|
|
- If you are using Tails but have not enabled Persistence, copy your output file somewhere before shutting down your Tails system.
|
|
|
|
|
|
- If you want to keep an old version of your spider while testing out a new one, just make a copy and start working with with the new one instead. You can do this by entering the following command in your *Terminal*:
|
|
|
```
|
|
|
cp spider-template.py spider-new.py
|
|
|
```
|
|
|
Of course, you will then use the following, instead of the command shown above, to edit your new file:
|
|
|
```
|
|
|
gedit /home/amnesia/Persistent/spider-new.py
|
|
|
```
|
|
|
And to run your new spider, you will use the following:
|
|
|
```
|
|
|
torsocks -i scrapy runspider spider-new.py -o extracted_data.csv
|
|
|
```
|
|
|
|
|
|
## Real world examples
|
|
|
|
|
|
The sections below cover three multi-page HTML data tables that you might want to scrape for practice. These tables are (obviously) longer than the example above. They are also a bit more complex, so we will point out the key differences and how you can address them when configuring your spider.
|
|
|
|
|
|
For each website, we include sections on:
|
|
|
|
|
|
- What is different about this table;
|
|
|
- The URL, name and selector values for your spider; and
|
|
|
- Other custom settings
|
|
|
|
|
|
|
|
|
### Resources for refugees in a Berlin neighborhood
|
|
|
|
|
|
The city of Berlin publishes a list of resources for refugees in various neighbourhoods. The listing for Friedrichshain and Kreuzberg includes enough entries to be useful as an sample scraping target.
|
|
|
|
|
|
**[[Copy and paste the spider code into a new file](/guides/scraping-spider-kreuzberg)]**
|
|
|
|
|
|
#### What is different about this table
|
|
|
|
|
|
This table has three interesting characteristics:
|
|
|
|
|
|
1. It has a long, complex starting URL;
|
|
|
2. The data we want to scrape include an internal link to a page elsewhere on the site; and
|
|
|
3. One of the cells we want to scrape contains additional HTML markup.
|
|
|
|
|
|
*1. Complex starting URLs:*
|
|
|
|
|
|
The URL we will use to scrape this table corresponds to the *results* page of a search form. It is:
|
|
|
|
|
|
```
|
|
|
http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults'
|
|
|
```
|
|
|
|
|
|
This does not affect the configuration of our spider or the command we will use to run it, but it is worth noting that you will often need to browse through a website — navigating to subpages, using search forms, filtering results, etc. — in order to find the right starting URL.
|
|
|
|
|
|
*2. Internal links:*
|
|
|
|
|
|
The table column *(Träger)* that corresponds to the organisation responsible for the resource in that row includes a *Mehr…* link to a dedicated page about that resource. We want the full URL, but the HTML attribute we will scrape only provides one that is relative to the website's domain. Our spider template will try to handle this automatically if you use `'link': ` as the *key* value for that column selector, as shown below. If it does not work properly when scraping some other website, you can just choose a different *key* for that column selector (`'internal_link': `, for example) and fix the URL yourself once you get it into a spreadsheet.
|
|
|
|
|
|
*3. Internal HTML markup:*
|
|
|
|
|
|
That same *(Träger)* column also includes HTML markup *inside* the text description of some organisations. As a result, the selector suffix we would normally use (`::text`) does not work properly. Instead, we have to extract the entire HTML block. (Notice that the `organisation` column selector below does not include a suffix.) If you are a python programmer, you can fix this quite easily inside your spider. Otherwise, you will have to clean it up in your spreadsheet application.
|
|
|
|
|
|
|
|
|
#### URL, name and selectors
|
|
|
|
|
|
**Spoiler alert.** Below you will find all of the changes you would need to make in the `spider-template.py` file to scrape several pages of worth of refugee resources in Friedrichshain and Kreuzberg. Before you continue, though, you should visit [the webpage](http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults) in the Tor Browser and, if necessary, consult the section on [Using the Tor Browser's Inspector to identify the selectors you need](#using-the-tor-browsers-inspector-to-identify-the-selectors-you-need). Then try to determine selectors for the following:
|
|
|
|
|
|
- Row selector
|
|
|
- Next page selector
|
|
|
- Column selectors:
|
|
|
- Supporting organisation
|
|
|
- Description of the resource being offered
|
|
|
- The language(s) supported
|
|
|
- The location of the resource
|
|
|
- An internal link to a more descriptive page
|
|
|
|
|
|
*Corresponding User variables for your spider:*
|
|
|
|
|
|
```
|
|
|
start_urls = ['http://www.berlin.de/ba-friedrichshain-kreuzberg/aktuelles/fluechtlingshilfe/angebote-im-bezirk/traeger-und-aemter/?q=&sprache=--+Alles+--&q_geo=&ipp=5#searchresults']
|
|
|
name = 'refugee_resources'
|
|
|
row_selector = 'tbody tr'
|
|
|
next_selector = '.pager-item-next > a:nth-child(1)::attr(href)'
|
|
|
item_selectors = {
|
|
|
'organisation': 'td:nth-child(1) > div > p',
|
|
|
'offer': 'td:nth-child(2)::text',
|
|
|
'language': 'td:nth-child(3)::text',
|
|
|
'address': 'td:nth-child(4)::text',
|
|
|
'link': 'td:nth-child(1) > a:nth-child(2)::attr(href)'
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### Other custom settings
|
|
|
|
|
|
To scrape this table, you can use the default collection of `custom_settings` in the spider template, which is shown below. As mentioned above, the lines beginning with `# ` are comments and will not affect the behaviour of your spider. For each of the two remaining examples, you will uncomment at least one of those lines.
|
|
|
|
|
|
```
|
|
|
custom_settings = {
|
|
|
# 'DOWNLOAD_DELAY': '30',
|
|
|
# 'DEPTH_LIMIT': '100',
|
|
|
# 'ITEM_PIPELINES': {
|
|
|
# 'scrapy.pipelines.files.FilesPipeline': 1,
|
|
|
# 'scrapy.pipelines.images.ImagesPipeline': 1
|
|
|
# },
|
|
|
# 'IMAGES_STORE': 'media',
|
|
|
# 'IMAGES_THUMBS': { 'small': (50, 50) },
|
|
|
# 'FILES_STORE': 'files',
|
|
|
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
|
|
|
'TELNETCONSOLE_ENABLED': False,
|
|
|
'DOWNLOAD_HANDLERS': {'s3': None}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
|
|
|
### Hacker News
|
|
|
|
|
|
In this section, we will use [YCombinator's *Hacker News*](https://news.ycombinator.com/news) as a second "real world" example. Of course, in the real world, one would not need to scrape Hacker News because the website offers both [an RSS feed](https://news.ycombinator.com/rss) and [an application programming interface (API)](https://github.com/HackerNews/API). Abusing Hacker News to practice scraping is something of a tradition, though, and it *is* a nice example of a multi-page HTML table that is not too complex.
|
|
|
|
|
|
**[[Copy and paste the spider code into a new file](/guides/scraping-spider-hacker-news)]**
|
|
|
|
|
|
#### What is different about this table
|
|
|
|
|
|
If you restrict your scraping to the first line of each article listed on Hacker News, as we do here, this table is very similar to the tiny green example we've been using up until now. It's just bigger. We introduce one new challenge below, but it is quite straightforward.
|
|
|
|
|
|
*Scraping a row attribute:*
|
|
|
|
|
|
If you look at a Hacker News article using the Tor Browser's *Inspector*, you will see that each table row (`tr`) containing a story has a unique `id` attribute. IDs like this are often sequential, which can be a useful way to keep track of the order in which content is presented. Even though they are not sequential in this particular table, it might still be useful to scrape them. But most of our column selectors are inside table data (`td`) elements, and we normally specify their selectors *relative* to their parent row. So how do we capture an attribute of the row itself? Just specify a suffix that is *only* [a suffix](#step-5-selector-suffixes). In this case, as shown below, that would be: `::attr(id)`.
|
|
|
|
|
|
#### URL, name and selectors
|
|
|
|
|
|
**Spoiler alert.** Below you will find all of the changes you would need to make in the `spider-template.py` file to scrape several pages of worth of article on Hacker News. Before you continue, though, you should visit [the website](https://news.ycombinator.com/news) in the Tor Browser and, if necessary, consult the section on [Using the Tor Browser's Inspector to identify the selectors you need](#using-the-tor-browsers-inspector-to-identify-the-selectors-you-need). Then try to determine selectors for the following:
|
|
|
|
|
|
- Row selector
|
|
|
- Next page selector
|
|
|
- Column selectors:
|
|
|
- Article index
|
|
|
- Article title
|
|
|
- Web address for the article itself
|
|
|
|
|
|
*Corresponding User variables for your spider:*
|
|
|
|
|
|
```
|
|
|
start_urls = ['https://news.ycombinator.com/news']
|
|
|
name = 'hacker_news'
|
|
|
row_selector = 'table.itemlist > tr.athing'
|
|
|
next_selector = 'a.morelink::attr(href)'
|
|
|
item_selectors = {
|
|
|
"index": "::attr(id)",
|
|
|
"title": "td.title > a.storylink::text",
|
|
|
"external_link": "td.title > a.storylink::attr(href)"
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### Other custom settings
|
|
|
|
|
|
The Hacker News [`robots.txt` file](https://en.wikipedia.org/wiki/Robots_exclusion_standard) specifies [a `Crawl-delay` of 30 seconds](https://news.ycombinator.com/robots.txt), so we should make sure that our spider does not scrape too quickly. As a result, it may take up to ten minutes to scrape all of the table data.
|
|
|
|
|
|
```
|
|
|
custom_settings = {
|
|
|
'DOWNLOAD_DELAY': '30',
|
|
|
# 'DEPTH_LIMIT': '100',
|
|
|
# 'ITEM_PIPELINES': {
|
|
|
# 'scrapy.pipelines.files.FilesPipeline': 1,
|
|
|
# 'scrapy.pipelines.images.ImagesPipeline': 1
|
|
|
# },
|
|
|
# 'IMAGES_STORE': 'media',
|
|
|
# 'IMAGES_THUMBS': { 'small': (50, 50) },
|
|
|
# 'FILES_STORE': 'files',
|
|
|
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
|
|
|
'TELNETCONSOLE_ENABLED': False,
|
|
|
'DOWNLOAD_HANDLERS': {'s3': None}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### CCTV cameras in Bremen
|
|
|
|
|
|
The German city of Bremen provides a list of public video cameras, along with a thumbnail image of what each camera can see.
|
|
|
|
|
|
**[[Copy and paste the spider code into a new file](/guides/scraping-spider-bremen)]**
|
|
|
|
|
|
#### What is different about this table
|
|
|
|
|
|
The Bremen CCTV table has three main quirks:
|
|
|
|
|
|
1. The page structure makes it unusually difficult to identify a reliable next page selector,
|
|
|
2. To solve the next page selector challenge, we have to give our spider a maximum number of pages to scrape, and
|
|
|
3. We want to download the actual image file associated with each CCTV camera.
|
|
|
|
|
|
*1. Elusive next page selector:*
|
|
|
|
|
|
The CSS class used to style the "next arrow" is only applied to the icon, not to the link itself, so we can not use it for our next page selector. To make matters worse, the row of page navigation links only displays "first page" and "previous page" links after you have left the first page, so our usual `:nth-child(n)` trick does not work. As a result, we have to get creative to come up with a selector that will work on every page. In the example below, we use a differnt trick (`nth-last-child(n)`) so we can count *from the end* rather than from the beginning.
|
|
|
|
|
|
*2. Specifying a "depth limit":*
|
|
|
|
|
|
When we get to the end, the "last page" and "next page" links disappear, which would normally make our scraper "click" on the link for a page that it had already scraped. It would then go *back* a couple of pages. Until it got to the end (again), at which point it would go back a couple of pages (again). This would create an infinite loop and our poor spider would end up scraping forever. By setting the `DEPTH_LIMIT` to `11` in the spider's `custom_settings`, we can tell it to stop after it reaches the last page.
|
|
|
|
|
|
*3. Downloading image files:*
|
|
|
|
|
|
This is the first time we are asking our spider to download image files. Scrapy makes this quite easy, but it does require:
|
|
|
|
|
|
- Using the special `image_urls` *key* for the corresponding column selector, and
|
|
|
- Uncommenting a few `custom_settings` by removing the `# ` character at the beginning of the corresponding lines.
|
|
|
|
|
|
#### URL, name and selectors
|
|
|
|
|
|
**Spoiler alert.** Below you will find all of the changes you would need to make in the `spider-template.py` file to scrape several pages of worth of camera locations and image thumbnails. Before you continue, though, you should visit [the website](http://www.standorte-videoueberwachung.bremen.de/sixcms/detail.php?gsid=bremen02.c.734.de) in the Tor Browser and, if necessary, consult the section on [Using the Tor Browser's Inspector to identify the selectors you need](#using-the-tor-browsers-inspector-to-identify-the-selectors-you-need). Then try to determine selectors for the following:
|
|
|
|
|
|
- Row selector
|
|
|
- Next page selector
|
|
|
- Column selectors:
|
|
|
- Name of the camera
|
|
|
- Status of the camera
|
|
|
- Location of the camera
|
|
|
- A URL for the camera's current thumbnail image
|
|
|
|
|
|
*Corresponding User variables for your spider:*
|
|
|
|
|
|
```
|
|
|
start_urls = ['http://www.standorte-videoueberwachung.bremen.de/sixcms/detail.php?gsid=bremen02.c.734.de']
|
|
|
name = 'bremen_cctv'
|
|
|
row_selector = 'div.cameras_list_item'
|
|
|
next_selector = 'ul.pagination:nth-child(4) > li:nth-last-child(2) > a::attr(href)'
|
|
|
item_selectors = {
|
|
|
'title': '.cameras_title::text',
|
|
|
'status': '.cameras_title .cameras_status_text::text',
|
|
|
'address': '.cameras_address::text',
|
|
|
'image_urls': '.cameras_thumbnail > img::attr(src)'
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### Other custom settings
|
|
|
|
|
|
To address the issues described in the *What is different about this table* section, we will use the custom settings below.
|
|
|
|
|
|
```
|
|
|
custom_settings = {
|
|
|
# 'DOWNLOAD_DELAY': '30',
|
|
|
'DEPTH_LIMIT': '11',
|
|
|
'ITEM_PIPELINES': {
|
|
|
# 'scrapy.pipelines.files.FilesPipeline': 1,
|
|
|
'scrapy.pipelines.images.ImagesPipeline': 1
|
|
|
},
|
|
|
'IMAGES_STORE': 'media',
|
|
|
'IMAGES_THUMBS': { 'small': (50, 50) },
|
|
|
# 'FILES_STORE': 'files',
|
|
|
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0',
|
|
|
'TELNETCONSOLE_ENABLED': False,
|
|
|
'DOWNLOAD_HANDLERS': {'s3': None}
|
|
|
```
|
|
|
|
|
|
The `DEPTH_LIMIT` setting above makes sure that we stop scraping when we get to the last page. The uncommented item in the collection of `ITEM_PIPELINES` tells Scrapy to download and save all images identified by the `images_url` column selector. `IMAGES_STORE` tells it to put those image files into a folder called `media`, which it will create automatically. `IMAGES_THUMBS` tells it to create very small thumbnail image for each one.
|
|
|
|
|
|
(`scrapy.pipelines.files.FilesPipeline`, `FILES_STORE` and the special `file_urls` column selector *key* can be used to download other sorts of files from a webpage. This might include PDFs, Microsoft Word documents, audio files, etc. We leave these options disabled below.)
|
|
|
|
|
|
After you run your spider, have a look at your `.csv` output and explore that `media` folder to see exactly how this works.
|
|
|
|
|
|
# Happy scraping
|
|
|
|
|
|
Once you understand the configuration changes needed to make these three Scrapy *spiders* do their jobs, you should be able to securely and privately scrape structured data, images and files from most HTML tables on the web. As always, there is more work to do. You will still have to clean, analyse, present and explain the data you scrape, but we hope the information above will give you a place to start next time you stumble across an important trove of tabular data that does not have a "download as CSV" link right there next to it.
|
|
|
|
|
|
|
|
|
*Published on 31 May 2017. Follow us [@seeingsideways](https://twitter.com/seeingsideways?lang=en) get in [touch](https://exposingtheinvisible.org/talk-to-us), or read another of our guides [here](https://exposingtheinvisible.org/guides)* |