Monday 28 March 2016

Question about structuring/hosting a small-ish node app

I'm building a web-scraper in node and intend to also build a front-end to display the scraped data once all the node stuff has been taken care of, but I'm feeling a little stuck right now and am not entirely sure how to structure things going forward.The site/app I want to make does two things: 1) fetch images posted by users in a specific thread on a specific website, download them and save info to a db 2) allow users to navigate these images in a pretty little front-endWhat I have right now is a series of functions in node that send a request out to the thread pages sequentially, search for images posted by users, and pass those image URLs along with poster metadata (who posted it, the post permalink, post date) to be downloaded to the filesystem and stored to a database once downloaded. I haven't written anything yet to store the info in a database but that ought to be pretty straightforward.So while it is pretty simple now to just download all the images and save them on my own computer, I don't really know how to translate this into something like a heroku app... I guess one of the main issues is that I would like my server to, say every 24 hours, check to see if any new images have been posted to the thread, and if so download them and add to the database automatically. There are thousands of images so it makes more sense to host them on something like AWS I think, but I'm not really certain how best to approach all this, especially if the server is hosted in one place (heroku), the database in another (mongolab maybe?), and the static files in another (AWS). My current approach would just be to make a heroku app that runs the server (for the front-end) and also starts some sort of timed event that checks for new images and downloads/saves them, passing them along to AWS, if they exist. It seems to me like it would work ok, but I wouldn't be surprised if this is a really terrible solution or not even a viable one.Another concern of mine is making the server keep track of where it left off, scraping-wise. If I start the server and it begins scraping at page 1 of the thread, but something happens and my server needs to restart, how will it know that it left off at page 100? The only thing I could really think of is keeping track of this somewhere in the database (ie with some sort of "last page scraped" variable), but I also don't really feel like that's the best solution.Sorry if these are stupid questions. I have a bit of experience in node but this is definitely a lot more ambitious than anything I've done before. Generally I feel like I could more or less make things work, but wouldn't be confident that my methods would be good ones.

Submitted March 28, 2016 at 09:55PM by Astro_Bass

No comments:

Post a Comment