Home
Software
Writings
Advanced web spidering with Puppeteer
in: Programming articles
programming
Puppeteer is a node.js library that makes it easy to do advanced web scraping and spidering.
Older generation of web scraping and spidering tools would grab and analyze HTML pages as returned by a web server.
It doesn’t work well anymore because less and less website are static HTML pages. Today websites are often applications written in JavaScript that generate HTML on the client, not the server.
To get the final HTML output your scraper needs to run that JavaScript.
That used to be very difficult but Puppeteer makes it easy.
Puppeteer uses Chrome to run web application and uses CDP (Chrome DevTools Protocol) to access the webpage.
This article describes some more advanced techniques but let’s start with basic example first.

Save web page to a file

First install the library:
This is the simplest possible usage of Puppeteer:
const puppeteer = require("puppeteer");
const fs = require("fs");

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
	await page.goto("https://www.google.com/", { waitUntil: "networkidle2" });
	// hacky defensive move but I don't know a better way:
	// wait a bit so that the browser finishes executing JavaScript
	await page.waitFor(1 * 1000);
	const html = await page.content();
	fs.writeFileSync("index.html", html);
	await browser.close();
}

run();

Handling failures

What if a url you tried to load didn’t exist?
The web server will return the ‘Not Found’ page with HTTP status code 404 in the response. The above script would treat such page as a perfectly valid response.
Most times you want to handle this as an error case.
For example, if you’re writing a bot that checks for broken links, you want to distinguish 404 NotFound response from 200 Ok response.
In HTTP protocol status codes 4xx and 5xx indicate errors. 2xx indicate success and 3xx indicate successful redirection.
Puppeteer provides Page.setRequestInterception(true) hook for intercepting HTTP requests before they happen as well as inspecting completed HTTP responses.
Here’s a program that prints information about all HTTP requests and responses:
const puppeteer = require("puppeteer");

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
	const mainUrl = "https://blog.kowalczyk.info/pas"
	let mainUrlStatus;
  await page.setRequestInterception(true);
  page.on("request", request => {
    const url = request.url();
    console.log("request url:", url);
    request.continue();
  });
  page.on("requestfailed", request => {
    const url = request.url();
    console.log("request failed url:", url);
  });
  page.on("response", response => {
    const request = response.request();
    const url = request.url();
    const status = response.status();
    console.log("response url:", url, "status:", status);
		if (url === mainUrl) {
			mainUrlStatus = status;
		}
  });
  await page.goto(mainUrl);
	console.log("status for main url:", mainUrlStatus);
  const html = await page.content();
  await browser.close();
}

run();
Here’s what it’ll print:
$ node test.js
request url: https://blog.kowalczyk.info/pas
response url: https://blog.kowalczyk.info/pas status: 404
request url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext
response url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext status: 200
request url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf
request url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf
response url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf status: 200
response url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf status: 200
status for main url: 404
Notice that fetching a page also fetches all resources used by that page, just like in a web browser. For that reason to find out status code for the url we requested, we have to remember it in a variable in response hook.
requestfailed hook is for errors on network connection level e.g. DNS resolution failed, there’s not network at all, network connection got interrupted etc.

See console.log from inside the browser

Your JavaScript code is executed in two different contexts:
To see what console.log prints in the browser, you can hook it and re-log to shell:
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = "https://blog.kowalczyk.info/";

// this hooks `console.log()` in the browser
page.on("console", msg => {
  console.log("The whole message:", msg.text());
  console.log("\nEach argument:");
  for (let arg of msg.args()) {
    // arg is a Promise returning value of type JSHandle
    // https://pptr.dev/#?product=Puppeteer&show=api-class-jshandle
    arg.jsonValue().then(v => {
      console.log(v);
    });
  }
});
await page.goto(url);
await page.evaluate(() => {
  // This is executed inside the browser so not visible in our script
  // unless we hook 'console' events
  console.log("Message from the browser", 5);
});
await browser.close();

Quickly testing evaluate scripts

It’s slow to test browser script executed via Page.evaluate because you have to start the browser, load the page etc.
To test scripts faster I test them directly in the browser, using excellent Chrome dev tools.
My process is:
What is IIFE form? To avoid conflicts with JavaScripts state from previous runs I wrap the code inside Immediately Invoked Function Expression:
function() {
  // code here is isolated from things outside this function
  console.log("My script");
  // ... my script

	// when debugging I can trigger JavaScript debugger from inside the script
	// with debugger statement:
	debugger;
}() // immediately invoke the function
It’s faster to iterate on code this way. You can also use browser’s JavaScript debugger.
As shown in the snippet, I can also trigger the debugger for single-stepping through the code with debugger; statement.

Study Puppeteer API

Now that you’ve seen a few advanced uses of Puppeteer, you should study its API a bit to learn what else is possible. CDP is very powerful:

Other CDP tools and libraries

Puppeteer is not the only tool that takes advantage of Chrome DevTools protocol. A bunch of them is listed in Awesome Chrome DevTools.
Written on Jul 18 2018. Topics: programming.
home
Found a mistake, have a comment? Let me know.

Feedback about page:

Feedback:
Optional: your email if you want me to get back to you: