If you want to speed up scrapping or make screenshots faster, you can disable all the requests that do not make any crucial impact on the results.
Puppeteer allows blocking any outgoing requests while loading the page. Whether you want to block ads, tracking scripts, or different types of resources, it is relatively easy to do with Puppeteer.
A full example of blocking requests
Let’s start with a fully working example on how to intercept and block requests in Puppeteer:
const puppeteer = require('puppeteer');const wildcardMatch = require('wildcard-match');
const blockRequest = wildcardMatch(['*.css', '*.js'], { separator: false });
(async () => { const browser = await puppeteer.launch({}); try {
const page = await browser.newPage(); page.setRequestInterception(true);
page.on('request', (request) => { if (blockRequest(request.url())) { const u = request.url(); console.log(`request to ${u.substring(0, 50)}...${u.substring(u.length - 5)} is aborted`);
request.abort();
return; }
request.continue(); });
await page.goto('https://screenshotone.com/'); } catch (e) { console.log(e) } finally { await browser.close(); }})();
The result is:
request to https://screenshotone.com/main.7a76b580aa30ffecb0b...f.css is abortedrequest to https://screenshotone.com/js/bootstrap.min.592b9fa...ab.js is abortedrequest to https://screenshotone.com/js/highlight.min.e13cfba...5f.js is abortedrequest to https://screenshotone.com/main.min.dabf7f45921a731...45.js is aborted
Sorry, but I won’t show you the resulting screenshot of the site because it looks awful without CSS and JS.
A step-by-step explanation
The most crucial step is not to forget to enable request interception before sending any request:
// ...const page = await browser.newPage();page.setRequestInterception(true);// ...
Otherwise, the trick won’t work.
After request interception is enabled, you can listen to any new outgoing request while the page is being loaded and decide on a per-request basis whether to block the request or not.
If you want to block all requests to www.google-analytics.com to speed up the site loading and to avoid tracking, then just filter requests based on the domain substring:
page.on('request', (request) => { if (request.url().includes('www.google-analytics.com')) { request.abort();
return;
}
request.continue();});
The better option is to parse URL, extract domain, and filter based on the domain name:
page.on('request', (request) => { const domain = url.parse(request.url(), false).hostname; if (domain == 'www.google-analytics.com') { request.abort();
return; }
request.continue();});
Because you might have an URL that accidentally might include www.google-analytics.com
.
Blocking requests by resource type
If you need to block a set of requests by the resource type, like images or stylesheets, regardless of the extension and URL pattern, you can use the request.resourceType()
method to test against blocking resource type:
const puppeteer = require('puppeteer');
(async () => { const browser = await puppeteer.launch({}); try { const page = await browser.newPage(); page.setRequestInterception(true);
page.on('request', (request) => { if (request.resourceType() == "stylesheet" || request.resourceType() == "script") { const u = request.url(); console.log(`request to ${u.substring(0, 50)}...${u.substring(u.length - 5)} is aborted`);
request.abort();
return; }
request.continue(); });
await page.goto('https://screenshotone.com/'); } catch (e) { console.log(e) } finally { await browser.close(); }})();
The result is the same as for the initial example:
request to https://screenshotone.com/main.7a76b580aa30ffecb0b...f.css is abortedrequest to https://screenshotone.com/js/bootstrap.min.592b9fa...ab.js is abortedrequest to https://screenshotone.com/js/highlight.min.e13cfba...5f.js is abortedrequest to https://screenshotone.com/main.min.dabf7f45921a731...45.js is aborted
Puppetteer supports blocking the next resource types:
document
stylesheet
image
media
font
script
texttrack
xhr
fetch
eventsource
websocket
manifest
other
As you see, it is pretty straightforward.
Block images
Disabling image loading dramatically improves the loading time of the page and speeds ups the rendering process.
You can block (disable) images in Puppeteer
by blocking requests with resource type image
.
It is no different from blocking any other resource type:
const puppeteer = require('puppeteer');
(async () => { const browser = await puppeteer.launch({}); try { const page = await browser.newPage(); page.setRequestInterception(true);
page.on('request', (request) => { if (request.resourceType() == "image") { const u = request.url(); console.log(`request to ${u.substring(0, 50)}...${u.substring(u.length - 5)} is aborted`);
request.abort();
return; }
request.continue(); });
await page.goto('https://unsplash.com/'); } catch (e) { console.log(e) } finally { await browser.close(); }})();
You can see from the logs that it works:
...request to https://images.unsplash.com/photo-1657047211843-2d...h=594 is abortedrequest to https://images.unsplash.com/photo-1657664072464-e5...&q=60 is aborted...request to https://secure.insightexpressai.com/adServer/adSer...l.gif is abortedrequest to https://secure.insightexpressai.com/adServer/adSer...l.gif is abortedrequest to https://secure.insightexpressai.com/adServer/adSer...l.gif is aborted...
The result without images:
And compare it to the original rendering:
Also, I would consider blocking media, fonts, and stylesheets if your goal is not to take screenshots of the page but to crawl.
Have a nice day 👋
By the way, in our screenshot API, you can easily block requests by specifying the block requests parameter.
I hope I have helped you tackle request blocking in Puppeteer, and I honestly wish you a nice day!