Nmap Development mailing list archives
Re: Request for Comments: New IPv6 OS detection machine learning engine
From: Mathias Morbitzer <m.morbitzer () runbox com>
Date: Fri, 20 Jan 2017 18:31:38 +0100
Hi everyone,Meanwhile, I managed to get feedback on the new implementation we started last summer from some people who know their ML.
Let me start by saying that we are doing quite good! :) However, of course there are things we could improve / comments on future work:
1) Considering the size of our DB (300+ fingerprints), the random forest model is a good choice. To make use of more complex models,
such as neural networks or deep learning, we would need a much bigger database. Therefore, I suggest to stick with random forest
for the near future and instead focus on improving in other areas.2) Since the random forest (and other types of ensemble models) are already multi-modal, multi-stage would not improve accuracy.
However, it should also not make it worse. Since we have multiple reasons to prefer the multistage approach, this is good news.
The reason why the multistage approach performed slightly worse in our tests is probably the way we did the test, which brings me to
3) In terms of evaluation, the 80:20 split is not a good idea since the test size is too small, this will create variance on precision.
It would be better to re-run the tests multiple times with a 50:50 split, and then check the mean average precision and variance.
Also, for the multistage, it would be interesting to analyze for wrong classifications if they are already incorrectly classified in stage 1,
or in stage 2. This brings me to4) We could reconsider our choice for the stage1 classifier. For the current first stage, we took the 4 main operating systems plus
a group "others". It could make more sense to create different groups based on similar behavior.
5) As we already thought, having 695 features is quite a lot. Approaches to reduce the amount of features could be for example
using neural networks or principal component analysis (PCA). We did play around with such things a bit before, but it might be interesting
to have another look.6) I also learned that ML might not always be the best solution when it comes to figuring out exactly one perfect match. ML is good in
providing the top k results, from which there is a high probability that one is correct. So this might be also something to consider in
the future for our tests (consider the top k results will give a better overview of how the model performs), and also when OS detection is performed,
we could give the user the top k OS guesses, or at least have this option.7) And finally, I've been told that we could also try the non-ML approach of signature based checking.
So that's it regarding feedback. I hope we can increase accuracy even further with this information!
Cheers, Mathias _______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Jan 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Feb 20)
- <Possible follow-ups>
- Re: Request for Comments: New IPv6 OS detection machine learning engine Varunram Ganesh (Feb 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Mar 02)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)