Nmap Development mailing list archives

Re: Web crawling library proposal


From: Fyodor <fyodor () insecure org>
Date: Tue, 1 Nov 2011 12:24:39 -0700

On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:
Hi list,

I'm attaching my working copies of the web crawling library and  a few 
scripts that use it. It would be great if I can get some feedback.

All the documentation is here:
https://secwiki.org/w/Nmap/Spidering_Library

Thanks Paulino, I'm very excited about having a web spidering system
for NSE.  I think it could allow for many wonderful scripts (you have
some examples on the SecWiki page above, and I think we have more on
the script ideas secwiki page).

I tried the scripts briefly and they seemed to work against
http://calder0n.com/sillyapp/ and sectools.org.  Though SecTools
sitemap included multiple instances of the same page with different
anchors like:

  | http://sectools.org/tools4.html#scanrand
  | http://sectools.org/tools4.html#canvas
  | http://sectools.org/tools4.html#saint
  | http://sectools.org/tools4.html#unicornscan
  | http://sectools.org/tools4.html
  | http://sectools.org/tools4.html#ipfilter

Since anchors are already listed in the HTML of a page, it is probably
best to strip those off and only include each page once.  Of course
actual page arguments are a different matter.

Also I read the documentation and have some comments.

First of all, and I realize this would be a big structural change, but
I wonder if there should be a way to handle pages as they are
retrieved.  Right now, it seems like all of the functions go spider
the whole target site and then return everything to the calling script
at once.  Sometimes that is desired, but I can also imagine many cases
where a script would rather request and receive one page at a time.
Advantages I can see from that approach are:

o The script could decide to stop early if desired (e.g. if it finds
  what it is looking for or if it uses its own heuristics to decide
  when to give up, or potentially add a path to the blacklist on the
  fly if it realizes that the spider is "stuck" in an area of the web
  site that doesn't seem fruitful).

o The script can get (and potentially report) at least some results
  faster since it doesn't have to wait for entire crawl before it can
  even start looking for email addresses or exploitable CGIs or
  whatever.

o Can save memory or disk space if spidering library can parse a page
  for links, then pass it off to the client script, and then it can be
  discarded (maybe sitting in the http cache for a while first), which
  avoids having to save gigabytes of content in memory or on disk.

o This could introduce a short delay between requests as the script
   processes pages, which makes things a bit easier for the remote
   webserver.

This could work if scripts could call the crawler in a loop which
returns one page or URL each time it is called.

It was a good idea to send the scripts/library with your mail for this
initial version, but for future versions you might want to just point
to the nmap-exp URL where you keep the latest version.  I think it is
a little bit easier for people to test that way, and it ensures they
are getting the latest version as long as they svn up first.

It occurs to me that it might be great for this to connect with
http-enum results.  Maybe there should be a registry location where
http-enum and other scripts can store discovered URLs on a server.
The spidering library might make use of that location to store new
URLs too.

I wonder if some of the functions (especially things like
is_link_anchored, is_uri_local, etc. would be better off in a new HTML
parsing library rather than being part of the crawler?

I also have a few minor questions/suggestions regarding the docs:

httpspider.cachePageContent Turn on to write cache files containing
all the crawled page's content. Default value: true

Are these ever deleted?

httpspider.ignore404 Ignores 404 pages extracting links from them.

Does this means that the system usually extracts links from 404 pages,
but when this option is enabled it does not extract the links? Maybe
this could be rewritten a little more clearly.  It should probably
mention the default value too.

httpspider.path Basepath for the web crawler. Default: "/"

If this is set, will it refuse to go higher than that in the directory
tree?  This should probably be answered in the documentation.

(from TODO) Limit by crawling depth - Number of levels to crawl. Ex. A
crawl depth of 3 will go as far as /a/b/c/ but won't visit
/a/b/c/d/(Optional) 

Is this based on the actual URL hierarchy or the number of links from
the start page.  When I crawl a page, sometimes I only want to follow
the first set of links on that page (or maybe 2 levels), and I think
that sort of limit would be more useful than strictly counting the
number of slashes in a URI

Thanks again for your continued work on this.  Web crawling is a big
and complex task, but I hope this can be improved to the point where
it can be integrated with Nmap.  It is the sort of infrastructure
which can take Nmap in directions we probably can't even imagine yet.

Cheers,
Fyodor
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: