Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine


From: Fyodor <fyodor () nmap org>
Date: Thu, 9 Feb 2017 17:07:01 -0800

On Fri, Jan 20, 2017 at 9:31 AM, Mathias Morbitzer <m.morbitzer () runbox com>
wrote:



Meanwhile, I managed to get feedback on the new implementation we started
last summer from some people who know their ML.

Let me start by saying that we are doing quite good! :) However, of course
there are things we could improve / comments on future work:


Thanks Mathias, this is great feedback!  Regarding your exact notes:

1) Considering the size of our DB (300+ fingerprints), the random forest
model is a good choice. To make use of more complex models,

such as neural networks or deep learning, we would need a much bigger
database. Therefore, I suggest to stick with random forest

for the near future and instead focus on improving in other areas.


That makes sense.  The IPv6 DB is definitely not mature yet (in terms of
number of fingerprints), but for comparison we can look at IPv4.  It has
5,336 fingerprints but in terms of unique class lines (which I think is a
more apt comparison to what we use as IPv6 fingerprints) we have 1,384.  So
I'd say our IPv6 OS DB will stay below 2,000 prints for the forseeable
future.

2) Since the random forest (and other types of ensemble models) are already
multi-modal, multi-stage would not improve accuracy.

However, it should also not make it worse. Since we have multiple reasons
to prefer the multistage approach, this is good news.


What are the other reasons to prefer the multistage approach if it doesn't
improve accuracy?  I guess maybe as an easy way to give the broad/rough
match of OS family (such as "Windows") even if we don't have full
confidence in a precise version?


3) In terms of evaluation, the 80:20 split is not a good idea since the
test size is too small, this will create variance on precision.

It would be better to re-run the tests multiple times with a 50:50 split,
and then check the mean average precision and variance.


I'm not exactly sure what this means but will take your word for it :).


5) As we already thought, having 695 features is quite a lot. Approaches
to reduce the amount of features could be for example

using neural networks or principal component analysis (PCA). We did play
around with such things a bit before, but it might be interesting

to have another look.


I'm not exactly familiar with these either, but it definitely sounds like
it's worth a look!

6) I also learned that ML might not always be the best solution when it
comes to figuring out exactly one perfect match. ML is good in

providing the top k results, from which there is a  high probability that
one is correct. So this might be also something to consider in


Interesting...


7) And finally, I've been told that we could also try the non-ML approach
of signature based checking.


Well we at least have the signature based IPv4 OS detection system for
comparison.  That has worked pretty well for us although our hope was that
the machine learning IPv6 system would prove to be more powerful (and
easier to maintain) method than relying on our own experts to create
signatures.

Cheers,
Fyodor
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: