Hi Team,
I would like to aks , is there any development plan to create Scrapy for Puppeteer and/or Playwright?
tnx a lot
I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is Pyppeteer, and some attempts to integrate it with Scrapy (a quick search yields lopuhin/scrapy-pyppeteer and clemfromspace/scrapy-puppeteer, there might be more projects).
Scrapy added partial support for asyncio in 2.0 (see the asyncio and coroutines topics). This was released very recently though, barely over a month ago, it doesn't seem like the above projects have had the time to take advantage of it.
It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work.
import pyppeteer
import scrapy
class PyppeteerSpider(scrapy.Spider):
name = "pyppeteer"
start_urls = ["data:,"] # avoid making an actual upstream request
async def parse(self, response):
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto("https:/example.org")
title = await page.title()
yield {"title": title}
Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components.
Probably worth covering at https://docs.scrapy.org/en/latest/topics/dynamic-content.html#using-a-headless-browser
@osmenia You might also want to check https://github.com/elacuesta/scrapy-pyppeteer.
Disclaimer: personal project, not officially supported by the Scrapy project, very early stage of development.
Most helpful comment
I don't think either Puppeteer nor Playwright could be integrated directly, as they are Javascript projects. However, there is
Pyppeteer, and some attempts to integrate it with Scrapy (a quick search yieldslopuhin/scrapy-pyppeteerandclemfromspace/scrapy-puppeteer, there might be more projects).Scrapy added partial support for
asyncioin 2.0 (see theasyncioandcoroutinestopics). This was released very recently though, barely over a month ago, it doesn't seem like the above projects have had the time to take advantage of it.It is currently possible to run the following spider. Keep in mind that you need to enable the AsyncIO reactor for this to work.
Just a proof of concept, perhaps not particularly useful since it circumvents most of the Scrapy components.