BACKGROUND: I'm building a (nonmalicious) webscraper for a specific website. The scraper will login with a user's login information (provided by users) and scrape specific information.Here's the structure I envision (?):mySite hosts a page with an HTML form at http://ift.tt/2mlMs2m POSTs to a node.js module "scraper".node.js module "scraper" connects to http://ift.tt/2lCfKeL and uses the form information to login to the site."scraper" scrapes information from http://ift.tt/2lCg1y2 view any content on the site, a login is required. Loading the page pops an "Authentication Required" window in a user's browser.As far as I can tell, NTLM is used for the authentication.The response header reads (exerpt):Request URL: http://ift.tt/2mlMiI8 method: GETRemote address: xxx.xxx.xx.xxx:PPPStatus code: 401 UnauthorizedVersion: HTTP/1.1...Server:"Microsoft-IIS/7.5"WWW-Authenticate:"NTLM"X-MS-InvokeApp:"1; RequireReadOnly"X-Powered-By:"ASP.NET"GOOGLE-FU: Most information I've found falls into two categories;(1) using web scrapers to login when websites use POST forms for Username/Password (not applicable, as NTLM popup isn't a form?);(2) deals with using NTLM to authenticate users for some web application.QUESTION: Can I/how do I use packages like express-ntlm, node-ntlm, passport-ntlm, etc. in combination with webscraping?I'm new to node.js (and web development in general) so please excuse and correct any misconceptions I have.
Submitted February 24, 2017 at 07:07PM by Superiorem
No comments:
Post a Comment