Nmap Development mailing list archives
Re: Request for Comments: New IPv6 OS detection machine learning engine
From: Fyodor <fyodor () nmap org>
Date: Thu, 9 Feb 2017 17:07:01 -0800
On Fri, Jan 20, 2017 at 9:31 AM, Mathias Morbitzer <m.morbitzer () runbox com> wrote:
Meanwhile, I managed to get feedback on the new implementation we started last summer from some people who know their ML. Let me start by saying that we are doing quite good! :) However, of course there are things we could improve / comments on future work:
Thanks Mathias, this is great feedback! Regarding your exact notes: 1) Considering the size of our DB (300+ fingerprints), the random forest
model is a good choice. To make use of more complex models, such as neural networks or deep learning, we would need a much bigger database. Therefore, I suggest to stick with random forest for the near future and instead focus on improving in other areas.
That makes sense. The IPv6 DB is definitely not mature yet (in terms of number of fingerprints), but for comparison we can look at IPv4. It has 5,336 fingerprints but in terms of unique class lines (which I think is a more apt comparison to what we use as IPv6 fingerprints) we have 1,384. So I'd say our IPv6 OS DB will stay below 2,000 prints for the forseeable future. 2) Since the random forest (and other types of ensemble models) are already
multi-modal, multi-stage would not improve accuracy. However, it should also not make it worse. Since we have multiple reasons to prefer the multistage approach, this is good news.
What are the other reasons to prefer the multistage approach if it doesn't improve accuracy? I guess maybe as an easy way to give the broad/rough match of OS family (such as "Windows") even if we don't have full confidence in a precise version?
3) In terms of evaluation, the 80:20 split is not a good idea since the test size is too small, this will create variance on precision. It would be better to re-run the tests multiple times with a 50:50 split, and then check the mean average precision and variance.
I'm not exactly sure what this means but will take your word for it :).
5) As we already thought, having 695 features is quite a lot. Approaches to reduce the amount of features could be for example using neural networks or principal component analysis (PCA). We did play around with such things a bit before, but it might be interesting to have another look.
I'm not exactly familiar with these either, but it definitely sounds like it's worth a look! 6) I also learned that ML might not always be the best solution when it
comes to figuring out exactly one perfect match. ML is good in providing the top k results, from which there is a high probability that one is correct. So this might be also something to consider in
Interesting...
7) And finally, I've been told that we could also try the non-ML approach of signature based checking.
Well we at least have the signature based IPv4 OS detection system for comparison. That has worked pretty well for us although our hope was that the machine learning IPv6 system would prove to be more powerful (and easier to maintain) method than relying on our own experts to create signatures. Cheers, Fyodor
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Jan 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Feb 20)
- <Possible follow-ups>
- Re: Request for Comments: New IPv6 OS detection machine learning engine Varunram Ganesh (Feb 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Mar 02)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)