Blog of science and life
Pywright - Render javascript websites
Demo: pywright.fly.dev
What is Playwright?
Playwright is a browser automation library from Microsoft. It can be used to automate tasks in the browser, for testing purpose.
But we can use it to scrape data from websites that use javascript to render the content.
Think about Facebook, Twitter, Instagram, etc. They use javascript to render the content. So we can't scrape them with normal tools like requests
or scrapy
.
Selenium is a popular tool for scraping javascript websites. It's powerful, have a lot of features, but it's slow and hard to use.
Playwright is a newplayer in this field. It's fast, have excelent simple API.
What is Pywright?
Pywright is a API service writen in python that use Playwright to render javascript websites. You can easily deploy it to any cloud provider then use it as a service.
Install Pywright
I've created a docker image for Playwright. It's contain a simple Flask API server, so you can use it as a microservice.
I will create only 2 workers
by default, you can change it by setting WORKERS
environment variable.
How to use Pywright
Just use POST
method to send a request to the server at endpoint /scrape
.
The request body must be a JSON object with 1 field: url
.
{
"url": "https://ipinfo.io"
}
The response will be a HTML string of the website.
Example
Example with httpie
:
http POST http://localhost:5000/scrape url=https://ipinfo.io
or with curl
:
curl -X POST -H "Content-Type: application/json" -d '{"url": "https://ipinfo.io"}' http://localhost:5000/scrape
or with requests
:
import requests
url = "http://localhost:5000/scrape"
html = requests.post(url, json={"url": "https://ipinfo.io"}).text
print(html)