Scrapy: Scrapy with Puppeteer and/or Playwright?

Created on 11 Apr 2020  路  3Comments  路  Source: scrapy/scrapy

Hi Team,

I would like to aks , is there any development plan to create Scrapy for Puppeteer and/or Playwright?

tnx a lot

docs enhancement

Most helpful comment

I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is Pyppeteer, and some attempts to integrate it with Scrapy (a quick search yields lopuhin/scrapy-pyppeteer and clemfromspace/scrapy-puppeteer, there might be more projects).

Scrapy added partial support for asyncio in 2.0 (see the asyncio and coroutines topics). This was released very recently though, barely over a month ago, it doesn't seem like the above projects have had the time to take advantage of it.

It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work.

import pyppeteer
import scrapy


class PyppeteerSpider(scrapy.Spider):
    name = "pyppeteer"
    start_urls = ["data:,"]  # avoid making an actual upstream request

    async def parse(self, response):
        browser = await pyppeteer.launch()
        page = await browser.newPage()
        await page.goto("https:/example.org")
        title = await page.title()
        yield {"title": title}

Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components.

All 3 comments

I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is Pyppeteer, and some attempts to integrate it with Scrapy (a quick search yields lopuhin/scrapy-pyppeteer and clemfromspace/scrapy-puppeteer, there might be more projects).

Scrapy added partial support for asyncio in 2.0 (see the asyncio and coroutines topics). This was released very recently though, barely over a month ago, it doesn't seem like the above projects have had the time to take advantage of it.

It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work.

import pyppeteer
import scrapy


class PyppeteerSpider(scrapy.Spider):
    name = "pyppeteer"
    start_urls = ["data:,"]  # avoid making an actual upstream request

    async def parse(self, response):
        browser = await pyppeteer.launch()
        page = await browser.newPage()
        await page.goto("https:/example.org")
        title = await page.title()
        yield {"title": title}

Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components.

Probably worth covering at https://docs.scrapy.org/en/latest/topics/dynamic-content.html#using-a-headless-browser

@osmenia You might also want to check https://github.com/elacuesta/scrapy-pyppeteer.
Disclaimer: personal project, not officially supported by the Scrapy project, very early stage of development.

Was this page helpful?
0 / 5 - 0 ratings