Nmap Development mailing list archives

Re: [RFC] Mirroring in http-fetch


From: Fyodor <fyodor () nmap org>
Date: Thu, 16 Jul 2015 12:58:47 -0700

On Sun, Jul 12, 2015 at 9:36 AM, Gyanendra Mishra <anomaly.the () gmail com>
wrote:

Hi list,

Through this post I wish to discuss the http-mirror implementation in
http-fetch.


Hi Gyani.  Thanks for sending this!  Regarding your ideas:

We hope to discuss the need for a mirror script,


Personally I'm a big fan of having a mirror script.  It's great to be able
to download a site for more effictive and faster searching, for archival
purposes, etc.

---Current Implementation--- [1]

--Terms--


To be honest, this redefining of terms for each approach made your proposal
difficult to read.  When "relative URL" and "absolute URL" mean different
things in each of the sections, it can be hard to follow.  It would
probably work better to use different terms for each different meaning.

 * A relative URL is any URL that doesn't have the protocol or the domain
name specified specified, "/a/b/h.html" , "h.html" both are relative urls.
 * An absolute URL is a URL with the protocol and the domain name. eg
http://example.com/a/b/h.html.
 * Localized URL :  A url that has the path to file in the file system. eg
: /home/user/Documents/mirror/example.com


I'm not sure why we would ever want the "localized URL"?  It means
everything would break if I move it to a different directory on my
filesystem, or a different machine, and also has minor privacy
implications.  Wouldn't relative URLS using "../" style paths work just as
well and avoid these problems?  Maybe there is some benefit to the
localized URLs that I haven't thought of.

* Case 1 : "preserve" and "localize" both are nil : The mirroring is over.
All the relative URLS in all webpages are now absolute URLS.

* Case 2: "preserve" = true and "localize" = nil  : The script goes through
all the downloaded pages and convert all the absolute URLs to relative URLs
using relations stored in Table 2. The script doesn't touch the URLS that
haven't been downloaded as doing that would lead to bad and inaccessible
links.
* Case 3: "preserve" = nil and "localize" = true : The script goes through
all the downloaded pages and converts all the absolute URLS to localized
URLs using relations stored in Table 3. Again all the URLS to pages that
haven't been downloaded are untouched for the same reason as stated in Case
2.


What do other mirroring tools such as wget and curl do by default?  And by
URLs, are we talking about both <a href> style links to other pages as well
as embedded content such as <img src>?

You've probably thought about it more than me, but my initial would be:

o The default would not touch pages at all, but an option would be
available which converts the links to relative paths (".. ones, not
starting with /") if they are to pages/resources we have downlaoded, and
would convert other (non-downloaded) URLs to absolute ones.

I can also see an argument for swapping those so the default is to convert
and there is also the option to preserve.  It might be nice if we can just
have one option with the common cases rather than require them to use the
right combinations of four options.  But I guess that depends if there is
really a need to operate all the options independently.

Also, keep in mind that pages can specify a base href tag (
http://www.w3schools.com/tags/tag_base.asp) to specify what relative URLs
are based on.  So rather than convert all the relative URLs to absolute
ones, it may be better to just specify a base href with the original source
URL.

Cheers,
Fyodor
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: