Nmap Development mailing list archives
Re: [RFC] Mirroring in http-fetch
From: Fyodor <fyodor () nmap org>
Date: Thu, 16 Jul 2015 12:58:47 -0700
On Sun, Jul 12, 2015 at 9:36 AM, Gyanendra Mishra <anomaly.the () gmail com> wrote:
Hi list, Through this post I wish to discuss the http-mirror implementation in http-fetch.
Hi Gyani. Thanks for sending this! Regarding your ideas: We hope to discuss the need for a mirror script,
Personally I'm a big fan of having a mirror script. It's great to be able to download a site for more effictive and faster searching, for archival purposes, etc. ---Current Implementation--- [1]
--Terms--
To be honest, this redefining of terms for each approach made your proposal difficult to read. When "relative URL" and "absolute URL" mean different things in each of the sections, it can be hard to follow. It would probably work better to use different terms for each different meaning. * A relative URL is any URL that doesn't have the protocol or the domain
name specified specified, "/a/b/h.html" , "h.html" both are relative urls. * An absolute URL is a URL with the protocol and the domain name. eg http://example.com/a/b/h.html. * Localized URL : A url that has the path to file in the file system. eg : /home/user/Documents/mirror/example.com
I'm not sure why we would ever want the "localized URL"? It means everything would break if I move it to a different directory on my filesystem, or a different machine, and also has minor privacy implications. Wouldn't relative URLS using "../" style paths work just as well and avoid these problems? Maybe there is some benefit to the localized URLs that I haven't thought of. * Case 1 : "preserve" and "localize" both are nil : The mirroring is over.
All the relative URLS in all webpages are now absolute URLS.
* Case 2: "preserve" = true and "localize" = nil : The script goes through
all the downloaded pages and convert all the absolute URLs to relative URLs using relations stored in Table 2. The script doesn't touch the URLS that haven't been downloaded as doing that would lead to bad and inaccessible links. * Case 3: "preserve" = nil and "localize" = true : The script goes through all the downloaded pages and converts all the absolute URLS to localized URLs using relations stored in Table 3. Again all the URLS to pages that haven't been downloaded are untouched for the same reason as stated in Case 2.
What do other mirroring tools such as wget and curl do by default? And by URLs, are we talking about both <a href> style links to other pages as well as embedded content such as <img src>? You've probably thought about it more than me, but my initial would be: o The default would not touch pages at all, but an option would be available which converts the links to relative paths (".. ones, not starting with /") if they are to pages/resources we have downlaoded, and would convert other (non-downloaded) URLs to absolute ones. I can also see an argument for swapping those so the default is to convert and there is also the option to preserve. It might be nice if we can just have one option with the common cases rather than require them to use the right combinations of four options. But I guess that depends if there is really a need to operate all the options independently. Also, keep in mind that pages can specify a base href tag ( http://www.w3schools.com/tags/tag_base.asp) to specify what relative URLs are based on. So rather than convert all the relative URLs to absolute ones, it may be better to just specify a base href with the original source URL. Cheers, Fyodor
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 12)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)
- Re: [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 17)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)