Home / Programming / web scraping and crawling / puppeteer, headless chrome, cdp / Advanced web spidering with Puppeteer edit

Puppeteer is a node.js library that makes it easy to do advanced web scraping and spidering.

Older generation of web scraping and spidering tools would grab and analyze HTML pages as returned by a web server.

It doesn't work well anymore because less and less website are static HTML pages. Today websites are often applications written in JavaScript that generate HTML on the client, not the server.

To get the final HTML output your scraper needs to run that JavaScript.

That used to be very difficult but Puppeteer makes it easy.

Puppeteer uses Chrome to run web application and uses CDP (Chrome DevTools Protocol) to access the webpage.

This article describes some more advanced techniques but let's start with basic example first.

Save web page to a file

First install the library:

  • yarn add puppeteer when using yarn
  • npm --save puppeteer when using npm

This is the simplest possible usage of Puppeteer:

  • navigate to a page of interest
  • get content of the webpage as HTML and save it to a file
const puppeteer = require("puppeteer");
const fs = require("fs");

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
	await page.goto("https://www.google.com/", { waitUntil: "networkidle2" });
	// hacky defensive move but I don't know a better way:
	// wait a bit so that the browser finishes executing JavaScript
	await page.waitFor(1 * 1000);
	const html = await page.content();
	fs.writeFileSync("index.html", html);
	await browser.close();
}

run();

Handling failures

What if a url you tried to load didn't exist?

The web server will return the 'Not Found' page with HTTP status code 404 in the response. The above script would treat such page as a perfectly valid response.

Most times you want to handle this as an error case.

For example, if you're writing a bot that checks for broken links, you want to distinguish 404 NotFound response from 200 Ok response.

In HTTP protocol status codes 4xx and 5xx indicate errors. 2xx indicate success and 3xx indicate successful redirection.

Puppeteer provides Page.setRequestInterception(true) hook for intercepting HTTP requests before they happen as well as inspecting completed HTTP responses.

Here's a program that prints information about all HTTP requests and responses:

const puppeteer = require("puppeteer");

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
	const mainUrl = "https://blog.kowalczyk.info/pas"
	let mainUrlStatus;
  await page.setRequestInterception(true);
  page.on("request", request => {
    const url = request.url();
    console.log("request url:", url);
    request.continue();
  });
  page.on("requestfailed", request => {
    const url = request.url();
    console.log("request failed url:", url);
  });
  page.on("response", response => {
    const request = response.request();
    const url = request.url();
    const status = response.status();
    console.log("response url:", url, "status:", status);
		if (url === mainUrl) {
			mainUrlStatus = status;
		}
  });
  await page.goto(mainUrl);
	console.log("status for main url:", mainUrlStatus);
  const html = await page.content();
  await browser.close();
}

run();

Here's what it'll print:

$ node test.js
request url: https://blog.kowalczyk.info/pas
response url: https://blog.kowalczyk.info/pas status: 404
request url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext
response url: https://fonts.googleapis.com/css?family=Roboto:400,700&subset=latin,latin-ext status: 200
request url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf
request url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf
response url: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu7GxPKTU1Kg.ttf status: 200
response url: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfChc9AMP6lQ.ttf status: 200
status for main url: 404

Notice that fetching a page also fetches all resources used by that page, just like in a web browser. For that reason to find out status code for the url we requested, we have to remember it in a variable in response hook.

requestfailed hook is for errors on network connection level e.g. DNS resolution failed, there's not network at all, network connection got interrupted etc.

See console.log from inside the browser

Your JavaScript code is executed in two different contexts:

  • main script is executed in node.js. In that context console.log("foo") prints to shell
  • scripts provided to Page.evaluate method are serialized to text, sent to the browser via Chrome DevTools Protocol and executed inside the browser. In that context console.log("foo") prints to browser console, which you can't see.

To see what console.log prints in the browser, you can hook it and re-log to shell:

const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = "https://blog.kowalczyk.info/";

// this hooks `console.log()` in the browser
page.on("console", msg => {
  console.log("The whole message:", msg.text());
  console.log("\nEach argument:");
  for (let arg of msg.args()) {
    // arg is a Promise returning value of type JSHandle
    // https://pptr.dev/#?product=Puppeteer&show=api-class-jshandle
    arg.jsonValue().then(v => {
      console.log(v);
    });
  }
});
await page.goto(url);
await page.evaluate(() => {
  // This is executed inside the browser so not visible in our script
  // unless we hook 'console' events
  console.log("Message from the browser", 5);
});
await browser.close();

Quickly testing evaluate scripts

It's slow to test browser script executed via Page.evaluate because you have to re-run the browser, load the page etc.

To test scripts faster I test them directly in the browser, using excellent Chrome dev tools.

My process is:

  • prepare the script, in IIFE form, in the editor
  • copy&paste in console window in Chrome dev tools

What is IIFE form? To avoid conflicts with JavaScripts state from previous runs I wrap the code inside Immediately Invoked Function Expression:

function() {
  // code here is isolated from things outside this function
  console.log("My script");
  // ... my script

	// when debugging I can trigger JavaScript debugger from inside the script
	// with debugger statement:
	debugger;
}() // immediately invoke the function

It's faster to iterate on code this way and I can also use JavaScript debugger.

As shown in the snippet, I can also trigger the debugger for sing-stepping through the code by adding debugger; statement to the script.

That's useful because it's impossible to set breakpoints manually in code pasted into console.

Study Puppeteer API

Now that you've seen a few advanced uses of Puppeteer, you should study its API a bit to learn what else is possible. CDP is very powerful:

  • Page class allows hooking many other events, reading and setting cookies, simulating interaction like mouse clicks etc.
  • Tracing class allows creating a trace file for future inspection in Chrome DevTools
  • Worker class allows interacting with Web Workers
  • Coverage class allows getting JavaScript and CSS coverage
  • Keyboard class allows simulating keyboard events

Other CDP tools and libraries

Puppeteer is not the only tool that takes advantage of Chrome DevTools protocol. A bunch of them is listed in Awesome Chrome DevTools.

Go to index of articles.

Share on