Nmap Development mailing list archives
Re: Web crawling library proposal
From: Fyodor <fyodor () insecure org>
Date: Tue, 1 Nov 2011 12:24:39 -0700
On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:
Hi list, I'm attaching my working copies of the web crawling library and a few scripts that use it. It would be great if I can get some feedback. All the documentation is here: https://secwiki.org/w/Nmap/Spidering_Library
Thanks Paulino, I'm very excited about having a web spidering system for NSE. I think it could allow for many wonderful scripts (you have some examples on the SecWiki page above, and I think we have more on the script ideas secwiki page). I tried the scripts briefly and they seemed to work against http://calder0n.com/sillyapp/ and sectools.org. Though SecTools sitemap included multiple instances of the same page with different anchors like: | http://sectools.org/tools4.html#scanrand | http://sectools.org/tools4.html#canvas | http://sectools.org/tools4.html#saint | http://sectools.org/tools4.html#unicornscan | http://sectools.org/tools4.html | http://sectools.org/tools4.html#ipfilter Since anchors are already listed in the HTML of a page, it is probably best to strip those off and only include each page once. Of course actual page arguments are a different matter. Also I read the documentation and have some comments. First of all, and I realize this would be a big structural change, but I wonder if there should be a way to handle pages as they are retrieved. Right now, it seems like all of the functions go spider the whole target site and then return everything to the calling script at once. Sometimes that is desired, but I can also imagine many cases where a script would rather request and receive one page at a time. Advantages I can see from that approach are: o The script could decide to stop early if desired (e.g. if it finds what it is looking for or if it uses its own heuristics to decide when to give up, or potentially add a path to the blacklist on the fly if it realizes that the spider is "stuck" in an area of the web site that doesn't seem fruitful). o The script can get (and potentially report) at least some results faster since it doesn't have to wait for entire crawl before it can even start looking for email addresses or exploitable CGIs or whatever. o Can save memory or disk space if spidering library can parse a page for links, then pass it off to the client script, and then it can be discarded (maybe sitting in the http cache for a while first), which avoids having to save gigabytes of content in memory or on disk. o This could introduce a short delay between requests as the script processes pages, which makes things a bit easier for the remote webserver. This could work if scripts could call the crawler in a loop which returns one page or URL each time it is called. It was a good idea to send the scripts/library with your mail for this initial version, but for future versions you might want to just point to the nmap-exp URL where you keep the latest version. I think it is a little bit easier for people to test that way, and it ensures they are getting the latest version as long as they svn up first. It occurs to me that it might be great for this to connect with http-enum results. Maybe there should be a registry location where http-enum and other scripts can store discovered URLs on a server. The spidering library might make use of that location to store new URLs too. I wonder if some of the functions (especially things like is_link_anchored, is_uri_local, etc. would be better off in a new HTML parsing library rather than being part of the crawler? I also have a few minor questions/suggestions regarding the docs:
httpspider.cachePageContent Turn on to write cache files containing all the crawled page's content. Default value: true
Are these ever deleted?
httpspider.ignore404 Ignores 404 pages extracting links from them.
Does this means that the system usually extracts links from 404 pages, but when this option is enabled it does not extract the links? Maybe this could be rewritten a little more clearly. It should probably mention the default value too.
httpspider.path Basepath for the web crawler. Default: "/"
If this is set, will it refuse to go higher than that in the directory tree? This should probably be answered in the documentation.
(from TODO) Limit by crawling depth - Number of levels to crawl. Ex. A crawl depth of 3 will go as far as /a/b/c/ but won't visit /a/b/c/d/(Optional)
Is this based on the actual URL hierarchy or the number of links from the start page. When I crawl a page, sometimes I only want to follow the first set of links on that page (or maybe 2 levels), and I think that sort of limit would be more useful than strictly counting the number of slashes in a URI Thanks again for your continued work on this. Web crawling is a big and complex task, but I hope this can be improved to the point where it can be integrated with Nmap. It is the sort of infrastructure which can take Nmap in directions we probably can't even imagine yet. Cheers, Fyodor _______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: Web crawling library proposal Paulino Calderon (Oct 18)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Patrik Karlsson (Oct 19)
- Re: Web crawling library proposal Fyodor (Nov 01)
- Re: Web crawling library proposal David Fifield (Nov 05)
- Re: Web crawling library proposal Paulino Calderon (Nov 07)
- Re: Web crawling library proposal Patrik Karlsson (Nov 30)