I’m building out a website that provides data about a set of other websites. I’m about to start the process of populating the initial data by scraping about 500k minimum websites using Puppeteer. The Puppeteer script fetches the given URL, grabs a few properties from the pages, takes a screenshot (currently after waiting 1s for better load), and then saving the screenshot and data to a MongoDB database.This seems to work fine one at a time and take about 1s (with or without the screenshot delay oddly enough). Now I’m wondering how to go about scaling this to (more) efficiently run this process on 500k websites. Worst comes to worst 500k is only 6 days of processing, I can drop it on an AWS instance and let it go, but presumably there’s a good way to parallelize this. But everyone seems to have different opinions on how with a lot of responses criticizing their approach.How would you go about it?On the plus side this is only a one off thing, or if I’m later auditing my data, otherwise one at a time will be the usual case.
Submitted January 29, 2020 at 03:59PM by ReactiveNative
No comments:
Post a Comment