Nmap Development mailing list archives
RE: Developing html parsing- question
From: Giacomo Mantani <giacomo.mantani () studio unibo it>
Date: Thu, 14 Jul 2016 16:40:59 +0000
Hi Daniel and Johanna, I have worked on an HTML parsing library using JPeg some times ago (in order to partecipate to GSoC). I have attached the code and a TODO. [v] = already implemented in html-parser.lua. Cheers ________________________________________ From: dev [dev-bounces () nmap org] on behalf of Johanna Curiel [johannapcuriel () gmail com] Sent: Thursday, July 14, 2016 4:50 PM To: Daniel Miller Cc: nmap list Subject: Re: Developing html parsing- question Thank you Daniel for the feedback.
We'd be glad to help answer any other questions you might have or offer feedback on early drafts. Even a library that collects the existing parsing functions together would be useful, as we can then make incremental improvements that would apply across all the scripts using it.
Giving the challenges, we are looking into building this core library with specific functions (in C, C++) that can be used by other modules such as nse engine, correct me if I'm wrong. I havent get too much familiar with development of the core as nse scripting. I'm quite familiar developing in C and I would like to tryout. Do you have any guidelines regarding core library development/code conventions that should be used for proper development? Cheers On Thu, Jul 14, 2016 at 9:34 AM, Daniel Miller <bonsaiviking () gmail com<mailto:bonsaiviking () gmail com>> wrote: Johanna, Thanks for putting thought towards this problem. We aren't looking for a script, but for a library of functions to replace and improve the parsing portions of existing scripts. We are also not looking for a DOM-model parser, since in most cases that would be overkill. NSE's needs are for quick extraction of values and location of forms, comments, and "interesting" data from HTML pages. We want a library of functions to handle these kinds of tasks that could be used to replace the various pattern matching portions of existing http-* scripts as well as the HTML-parsing code in http.lua and httpspider.lua. The challenges we face: * Unicode and other multi-byte encodings. We should at least be robust enough to handle UTF-8, since the HTML tags would still be ASCII-equivalent. * Quirks-mode HTML. That means improperly nested tags like <font><a>text</font></a> or unescaped entities like & or < within quoted attributes, and other things that would generally break an XML parser. These can introduce ambiguity in the DOM model, which is part of why we would avoid that method. * Mixed-case, strange whitespace, irregular use of quote characters, HTML within javascript strings between <script> tags, XHTML vs HTML4 vs HTML5, and other general weirdness. We'd be glad to help answer any other questions you might have or offer feedback on early drafts. Even a library that collects the existing parsing functions together would be useful, as we can then make incremental improvements that would apply across all the scripts using it. Dan On Tue, Jul 12, 2016 at 10:45 PM, Johanna Curiel <johannapcuriel () gmail com<mailto:johannapcuriel () gmail com>> wrote: Hello, Taking a look to the prio-list for nse scripts: https://secwiki.org/w/Nmap/Script_Ideas#HTML_parsing I checked to the http-title.nse script, correct me if I'm wrong but that script does not seem to be using the slaxml library. Is the idea to create this html-parsing script for the entire DOM (not just title or per HTML-tag) using slaxml? Example, the script could be called http-parsing.nse and it will dissect an entire HTML page with its tags. Your feedback appreciated cheers _______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Attachment:
html-parser.lua
Description: html-parser.lua
Attachment:
TODO
Description: TODO
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Developing html parsing- question Johanna Curiel (Jul 12)
- Re: Developing html parsing- question Daniel Miller (Jul 14)
- Re: Developing html parsing- question Johanna Curiel (Jul 14)
- RE: Developing html parsing- question Giacomo Mantani (Jul 14)
- Re: Developing html parsing- question Johanna Curiel (Jul 14)
- Re: Developing html parsing- question Daniel Miller (Jul 14)