Puppeteer is a node.js library that makes it easy to do advanced web scraping and spidering.
Older generation of web scraping and spidering tools would grab and analyze HTML pages as returned by a web server.
It doesn’t work well anymore because less and less website are static HTML pages. Today websites are often applications written in JavaScript that generate HTML on the client, not the server.
To get the final HTML output your scraper needs to run that JavaScript.
That used to be very difficult but Puppeteer makes it easy.
This article describes some more advanced techniques but let’s start with basic example first.
Save web page to a file
First install the library:
yarn add puppeteer
when using yarn
npm --save puppeteer
when using npm
This is the simplest possible usage of Puppeteer:
- navigate to a page of interest
- get content of the webpage as HTML and save it to a file
const puppeteer = require("puppeteer");
const fs = require("fs");
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.google.com/", { waitUntil: "networkidle2" });
// hacky defensive move but I don't know a better way:
// wait a bit so that the browser finishes executing JavaScript
await page.waitFor(1 * 1000);
const html = await page.content();
fs.writeFileSync("index.html", html);
await browser.close();
}
run();
Handling failures
What if a url you tried to load didn’t exist?
The web server will return the ‘Not Found’ page with HTTP status code 404
in the response. The above script would treat such page as a perfectly valid response.
Most times you want to handle this as an error case.
For example, if you’re writing a bot that checks for broken links, you want to distinguish 404
NotFound response from 200
Ok response.
In HTTP protocol status codes 4xx
and 5xx
indicate errors. 2xx
indicate success and 3xx
indicate successful redirection.
Puppeteer provides Page.setRequestInterception(true)
hook for intercepting HTTP requests before they happen as well as inspecting completed HTTP responses.
Here’s a program that prints information about all HTTP requests and responses:
const puppeteer = require("puppeteer");
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const mainUrl = "https://blog.kowalczyk.info/pas"
let mainUrlStatus;
await page.setRequestInterception(true);
page.on("request", request => {
const url = request.url();
console.log("request url:", url);
request.continue();
});
page.on("requestfailed", request => {
const url = request.url();
console.log("request failed url:", url);
});
page.on("response", response => {
const request = response.request();
const url = request.url();
const status = response.status();
console.log("response url:", url, "status:", status);
if (url === mainUrl) {
mainUrlStatus = status;
}
});
await page.goto(mainUrl);
console.log("status for main url:", mainUrlStatus);
const html = await page.content();
await browser.close();
}
run();
Here’s what it’ll print:
$ node test.js
request url: https://blog.kowalczyk.info/pas
response url: https://blog.kowalczyk.info/pas status: 404
request url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext
response url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext status: 200
request url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf
request url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf
response url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf status: 200
response url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf status: 200
status for main url: 404
Notice that fetching a page also fetches all resources used by that page, just like in a web browser. For that reason to find out status code for the url we requested, we have to remember it in a variable in response
hook.
requestfailed
hook is for errors on network connection level e.g. DNS resolution failed, there’s not network at all, network connection got interrupted etc.
See console.log
from inside the browser
Your JavaScript code is executed in two different contexts:
- main script is executed in node.js. In that context
console.log("foo")
prints to shell
- scripts provided to
Page.evaluate
method are serialized to text, sent to the browser via Chrome DevTools Protocol and executed inside the browser. In that context console.log("foo")
prints to browser console, which you can’t see.
To see what console.log
prints in the browser, you can hook it and re-log to shell:
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = "https://blog.kowalczyk.info/";
// this hooks `console.log()` in the browser
page.on("console", msg => {
console.log("The whole message:", msg.text());
console.log("\nEach argument:");
for (let arg of msg.args()) {
// arg is a Promise returning value of type JSHandle
// https://pptr.dev/#?product=Puppeteer&show=api-class-jshandle
arg.jsonValue().then(v => {
console.log(v);
});
}
});
await page.goto(url);
await page.evaluate(() => {
// This is executed inside the browser so not visible in our script
// unless we hook 'console' events
console.log("Message from the browser", 5);
});
await browser.close();
Quickly testing evaluate
scripts
It’s slow to test browser script executed via Page.evaluate
because you have to start the browser, load the page etc.
To test scripts faster I test them directly in the browser, using excellent Chrome dev tools.
My process is:
- prepare the script, in IIFE form, in the editor
- copy & paste in console window in Chrome dev tools
What is IIFE form? To avoid conflicts with JavaScripts state from previous runs I wrap the code inside Immediately Invoked Function Expression:
function() {
// code here is isolated from things outside this function
console.log("My script");
// ... my script
// when debugging I can trigger JavaScript debugger from inside the script
// with debugger statement:
debugger;
}() // immediately invoke the function
It’s faster to iterate on code this way. You can also use browser’s JavaScript debugger.
As shown in the snippet, I can also trigger the debugger for single-stepping through the code with debugger;
statement.
Study Puppeteer API
Now that you’ve seen a few advanced uses of Puppeteer, you should study its API a bit to learn what else is possible. CDP is very powerful:
- Page class allows hooking many events, reading and setting cookies, simulating interaction like mouse clicks etc.
- Tracing class allows creating a trace file for future inspection in Chrome DevTools
- Worker class allows interacting with Web Workers
- Coverage class allows measuring JavaScript and CSS coverage
- Keyboard class allows simulating keyboard events
Puppeteer is not the only tool that takes advantage of Chrome DevTools protocol. A bunch of them is listed in
Awesome Chrome DevTools.