Nmap Development mailing list archives

Re: GSoC Carna analysis + research + optimization role: looking for feedback


From: Henri Doreau <henri.doreau () gmail com>
Date: Thu, 25 Apr 2013 10:03:59 +0200

2013/4/21 Stephen Caraher <moskvax () gmail com>:
Hi nmap-dev,

I'm a third-year undergraduate student studying computer science (and
philosophy) at the University of Sydney, and I've been preparing to
apply for the performance/optimization specialist role as outlined on
the wiki. I've used nmap casually and irregularly since I started
running exclusively Linux on my machines in 2007, but more regularly
in the last six months as my interests have shifted more towards
information and computer security. I've got a few years of C, Python,
and shell scripting experience, know Scheme fairly well, and have
around one year of C++ experience (although mainly with code I've
written myself). I haven't used Lua myself, however it appears similar
enough to Python/Ruby/bash that picking it up quickly shouldn't be
hard for me (I can understand what is going on in most of the NSE
scripts I've looked at).

Analyzing the Carna/IC2012 data is listed as a component of this role,
and it is the aspect that I'm most interested in, along with the
large-scale scanning research. Most of this email is my thoughts on
the ways nmap could use the Carna data.

Fyodor outlined three tasks he thought the Carna data could be used
for in his email on nmap-dev: updating port frequency lists,
investigating poorly matched service-probe (and, i suppose, os-db?)
fingerprints, and incorporating the rDNS information somehow into
nmap. The first two tasks are well defined, and assuming the data is
put into a manageable form (more on this later), I think I'd be able
to do them fairly rapidly. Patrick's suggestion on nmap-dev of
collecting statistics on scripts for analysis and debugging
unfortunately doesn't appear to be possible as it appears the Carna
bots didn't employ NSE themselves, although NSE is mentioned as being
instrumental to the idea. The speculations that Fyodor has on how the
rDNS data could be used is representative of the potential uses for
this data that I find most exciting, though.

If I'm accepted for this role, I'd firstly like to concentrate on
making the Carna/IC2012 survey data more accessible to myself and
other nmap developers. In the GSoC IRC meeting, Henri suggested using
up a Hadoop cluster; given university approval, I could set one up in
a remotely accessible way with university machines, or otherwise with
a cloud service. The author of the paper themselves used Hadoop and
Pig to work with the data (I'd like to see how well Hive performs as
well).

I'd like making the data more available to be the first thing I do, so
that myself or any nmap developer can start asking useful questions of
it ASAP. The most interesting ways of using this data may not be
obvious or possible without the ability to query the data at will in
arbitrary ways.

One idea I've come up with is to reduce the resolution and range of
the data (although not too much for it to be useless) and find a good
way of compressing it such that it can be packaged as a stand-alone
library. An NSE script could then be written to query the database on
argument-specified attributes of the target and return census-derived
information.

Alternately, a remote service like the one outlined above, but
publicly accessible, could be used by an external-category version of
the script; and this way detailed census information about specific
targets could be given. However, the cost of running such a service
seems like it would be prohibitively high due to the size and
complexity of the data, even if queries are limited it is reduced in
the way outlined above (unless it's slow, perhaps).

In any case, I believe there's some way for this data to be made
generally useful through an NSE script, which I'd like to write and
design the infrastructure for. Another idea which may be more is
finding ways to use the data as a source of heuristics to speed up
scans.

Any suggestions/questions/comments on these ideas would me much
appreciated. I also have the following questions:

* Are there specific things people would already like to do with the
Carna data? (if so, I could implement or help on implementing them)

* What will the large scale scanning research involve?

* Does working with the Carna data and large scale scanning research
constitute enough work for a full GSoC term? (if not, I would also
want to work on implementing optimizations based on memory, I/O, and
performance profiling as mentioned on the wiki)

Thanks for your time,

Stephen Caraher

Hi Stephen,

thanks for your interest and for this interesting introduction and proposal.

I guess the first thing with the carna data will be to figure out a
way to manipulate it efficiently, to explore it and see what it
contains exactly. Compressing the 9TB into something distributable
with nmap sounds challenging, I'd even say unlikely...

So, an initial step will be to setup a system that allows you to
efficiently query the dataset. Hadoop was just a suggestion. It might
totally be that other (including custom) tools are more suitable. From
there on, new ideas about how to improve nmap are likely to emerge.

The scope of the project may vary, depending on how much we think nmap
can benefit from what's discovered, but I definitely think that
conducting the analysis and improving nmap accordinlgy is enough work
for the few months the gsoc lasts, at least.

Regards

--
Henri
_______________________________________________
Sent through the dev mailing list
http://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/


Current thread: