Nmap Development mailing list archives
Re: GSoC Carna analysis + research + optimization role: looking for feedback
From: Henri Doreau <henri.doreau () gmail com>
Date: Thu, 25 Apr 2013 10:03:59 +0200
2013/4/21 Stephen Caraher <moskvax () gmail com>:
Hi nmap-dev, I'm a third-year undergraduate student studying computer science (and philosophy) at the University of Sydney, and I've been preparing to apply for the performance/optimization specialist role as outlined on the wiki. I've used nmap casually and irregularly since I started running exclusively Linux on my machines in 2007, but more regularly in the last six months as my interests have shifted more towards information and computer security. I've got a few years of C, Python, and shell scripting experience, know Scheme fairly well, and have around one year of C++ experience (although mainly with code I've written myself). I haven't used Lua myself, however it appears similar enough to Python/Ruby/bash that picking it up quickly shouldn't be hard for me (I can understand what is going on in most of the NSE scripts I've looked at). Analyzing the Carna/IC2012 data is listed as a component of this role, and it is the aspect that I'm most interested in, along with the large-scale scanning research. Most of this email is my thoughts on the ways nmap could use the Carna data. Fyodor outlined three tasks he thought the Carna data could be used for in his email on nmap-dev: updating port frequency lists, investigating poorly matched service-probe (and, i suppose, os-db?) fingerprints, and incorporating the rDNS information somehow into nmap. The first two tasks are well defined, and assuming the data is put into a manageable form (more on this later), I think I'd be able to do them fairly rapidly. Patrick's suggestion on nmap-dev of collecting statistics on scripts for analysis and debugging unfortunately doesn't appear to be possible as it appears the Carna bots didn't employ NSE themselves, although NSE is mentioned as being instrumental to the idea. The speculations that Fyodor has on how the rDNS data could be used is representative of the potential uses for this data that I find most exciting, though. If I'm accepted for this role, I'd firstly like to concentrate on making the Carna/IC2012 survey data more accessible to myself and other nmap developers. In the GSoC IRC meeting, Henri suggested using up a Hadoop cluster; given university approval, I could set one up in a remotely accessible way with university machines, or otherwise with a cloud service. The author of the paper themselves used Hadoop and Pig to work with the data (I'd like to see how well Hive performs as well). I'd like making the data more available to be the first thing I do, so that myself or any nmap developer can start asking useful questions of it ASAP. The most interesting ways of using this data may not be obvious or possible without the ability to query the data at will in arbitrary ways. One idea I've come up with is to reduce the resolution and range of the data (although not too much for it to be useless) and find a good way of compressing it such that it can be packaged as a stand-alone library. An NSE script could then be written to query the database on argument-specified attributes of the target and return census-derived information. Alternately, a remote service like the one outlined above, but publicly accessible, could be used by an external-category version of the script; and this way detailed census information about specific targets could be given. However, the cost of running such a service seems like it would be prohibitively high due to the size and complexity of the data, even if queries are limited it is reduced in the way outlined above (unless it's slow, perhaps). In any case, I believe there's some way for this data to be made generally useful through an NSE script, which I'd like to write and design the infrastructure for. Another idea which may be more is finding ways to use the data as a source of heuristics to speed up scans. Any suggestions/questions/comments on these ideas would me much appreciated. I also have the following questions: * Are there specific things people would already like to do with the Carna data? (if so, I could implement or help on implementing them) * What will the large scale scanning research involve? * Does working with the Carna data and large scale scanning research constitute enough work for a full GSoC term? (if not, I would also want to work on implementing optimizations based on memory, I/O, and performance profiling as mentioned on the wiki) Thanks for your time, Stephen Caraher
Hi Stephen, thanks for your interest and for this interesting introduction and proposal. I guess the first thing with the carna data will be to figure out a way to manipulate it efficiently, to explore it and see what it contains exactly. Compressing the 9TB into something distributable with nmap sounds challenging, I'd even say unlikely... So, an initial step will be to setup a system that allows you to efficiently query the dataset. Hadoop was just a suggestion. It might totally be that other (including custom) tools are more suitable. From there on, new ideas about how to improve nmap are likely to emerge. The scope of the project may vary, depending on how much we think nmap can benefit from what's discovered, but I definitely think that conducting the analysis and improving nmap accordinlgy is enough work for the few months the gsoc lasts, at least. Regards -- Henri _______________________________________________ Sent through the dev mailing list http://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- GSoC Carna analysis + research + optimization role: looking for feedback Stephen Caraher (Apr 21)
- Re: GSoC Carna analysis + research + optimization role: looking for feedback Henri Doreau (Apr 25)
- Re: GSoC Carna analysis + research + optimization role: looking for feedback David Fifield (Apr 25)