Explore: The "Scrapy Spider Universe" of spider-based toolsets (including Playwright, CodeGen, Scrapy-Selenium, Scrapy-Splash, etc.)

Resource Collection (DL): explore: The Scrapy Webscraping Ecosystem

The "Scrapy Spider Universe" includes (for our project's purposes) approaches such as:

Using
1. Starting with Microsoft's Playwright for Python
  
  Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API. Playwright delivers automation that is ever-green, capable, reliable and fast. See how Playwright is better.
  - ...which includes Powerful Tooling like:
    - Codegen: Generate tests by recording your actions. Save them into any language.
    - The Playwright Inspector. Inspect page, generate selectors, step through the test execution, see click points, explore execution logs.
    - Trace Viewer.
      - Capture all the information to investigate the test failure.
      - Playwright trace contains test execution screencast;
      - ...live DOM snapshots, action explorer, test source, and more.
2. Use scrapy-playwright: Playwright integration for Scrapy as a configurable ("plug-in") middleware component
  
  A Scrapy Download Handler which performs requests using Playwright for Python. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).
3. blah...

But also includes completely different stack approaches, such as:

The Complete Guide to Scraping JavaScript Websites with Scrapy and Splash (by John Rooney)

Splash is a JavaScript rendering service with an HTTP API for controlling the browser. Developed by Scrapinghub, Splash integrates nicely with Scrapy to provide the browser automation capabilities we need for scraping dynamic JavaScript sites.

Here are some key advantages of Splash:

Headless Browser – Splash utilizes webkit from Chromium browser but runs headlessly with no visible UI. This makes it perfect for server-side scraping.

Fast & Lightweight – Splash consumes far fewer CPU and memory resources than Selenium or Puppeteer. It‘s built for high-performance at scale.

Scriptable – Lua scripts can be used to emulate complex user interactions like scrolling, clicking, form submissions etc.

Scrapy Integration – Splash middlewares make it seamless to use with Scrapy. It feels like using regular Scrapy requests.

Distributed Crawling – Splash plays nicely with Scrapy clustering for distributed crawling. Horizontal scaling is easy.

Edited Jun 01, 2024 by Daniel Cunningham