Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine


From: David Fifield <david () bamsoftware com>
Date: Sun, 7 Aug 2016 23:06:20 -0500

On Thu, Jul 21, 2016 at 12:17:50PM +0530, Prabhjyot Singh Sodhi wrote:
This document is to explain the solution that we experimented with and are
proposing over the current implementation.

*Aim* : The target of this model is to be able to guess target operating system
correctly based on network probes and the difference in reaction to these
probes by different operating systems.

*Data* : As of now we have a total of 301 prints. We use these prints to
generate the feature for our models (695 as of now).

*Model being used* : The current model uses a logistic regression learner to
predict operating systems.

*Change in data representation* : So the current database representation which
is used by the logistic regression model is based on the fact that all prints
which are members of a group are very similar to each other (value wise). This
is in contrast to how classes are in normal learning systems (A learning system
is one wherein you are trying to teach a system to do something (prediction of
os in our case)). Usually, we'd have a target variable (operating system in our
case) and have one group for some operating system (or a set of operatingĀ 
systems) and all prints corresponding to the operating system would go into the
group. And this is exactly what we have attempted with the new representation.

Now, to achieve this, one simple solution could have been to just have one
group for each operating system (each version, so one per Linux kernel). Given
the low number of prints this would have resulted in a very high number of
groups with very less number of prints in each group which would have made
prediction more difficult. That is why we tried to keep similar operating
systems in the same group.

We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems. For
Linux, we decided to stick with the existing representation (with small
changes) due to complexity in the way the groups were made.

It's true, the current scheme of grouping fingerprints by
characteristics such as whether they had a closed port is probably a
mistake. It's partly an artifact of having worked with the IPv4
classifier, which required a proliferation of classes, and partly
because we initially seeded the classes by clustering similar
fingerprints. I think that Dan has made the situation somewhat better
recently, merging some classes. But I think this phenomenon mainly
affects Linux and Windows fingerprints. For the others, there aren't so
many fingerprints so the classes tend to be broader. Some of them have
annotations like "Open, no closed", but in some cases those are merely
descriptive of the fingerprints we have, not a specification for what
belongs in the class.

*Models experimented with* :
i) Random Forest (RF):

ii) Multi Stage Random Forest (MSRF):

It'd be great if you could review this and help us with some valuable feedback.

A decision treeā€“based classifier sounds like a good idea. Of course it
all comes down to the performance of the implementation. Do you have
some code or evaluation results?

One problem you might have with making classes based strictly on OS
versions is that some fingerprints may belong to different versions but
be indistinguishable. For example, you might have training samples for
Linux 2.6.22 and Linux 2.6.23 as leaves in one of your decision trees,
even though they have the same network behavior. You might have to have
some kind of cutoff where you decide that all leaves below certain nodes
belong to the same class. (That kind of cutoff is basically what we are
trying to simulate in the current system of manually curated classes. We
rely on the human integrator having some expert knowledge of what
version ranges should be distinguishable from what others.)

Another difficulty I can foresee is that you're going to have a lot of
classes populated by a small number of fingerprints, which might not
well represent the full diversity of the class in the wild. For example,
the Linux TCP window scale option can depend on how much RAM in
installed (https://homes.cs.washington.edu/~yoshi/papers/fuzzing_aisec2010.pdf).
Whereas all your Linux fingerprints taken together might have a good
sampling of window scales, when you chop them up into little groups your
decision tree might draw incorrect conclusions like "Linux 2.6.x always
has WScale=2". In fact, in this particular case I would recommend trying
automatically generating synthetic fingerprints with various window
scale options (maybe during the training process). That's what's going
on in the IPv4 database when you see entries like:
OPS(O1=M5B4ST11NW1|M5B4ST11NW2|M5B4ST11NW3|M5B4ST11NW4|M5B4ST11NW5|M5B4ST11NW6|M5B4ST11NW7|M5B4ST11NW8|M5B4ST11NW9%O2=...

How do you plan to handle novel fingerprints? One way to evaluate this
would be to hold out an entire class during training, and test whether
the fingerprints of the held-out class match some other existing class
or are properly detected as novel.
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: