Nmap Development mailing list archives
Re: Request for Comments: New IPv6 OS detection machine learning engine
From: David Fifield <david () bamsoftware com>
Date: Sun, 7 Aug 2016 23:06:20 -0500
On Thu, Jul 21, 2016 at 12:17:50PM +0530, Prabhjyot Singh Sodhi wrote:
This document is to explain the solution that we experimented with and are proposing over the current implementation. *Aim* : The target of this model is to be able to guess target operating system correctly based on network probes and the difference in reaction to these probes by different operating systems. *Data* : As of now we have a total of 301 prints. We use these prints to generate the feature for our models (695 as of now). *Model being used* : The current model uses a logistic regression learner to predict operating systems. *Change in data representation* : So the current database representation which is used by the logistic regression model is based on the fact that all prints which are members of a group are very similar to each other (value wise). This is in contrast to how classes are in normal learning systems (A learning system is one wherein you are trying to teach a system to do something (prediction of os in our case)). Usually, we'd have a target variable (operating system in our case) and have one group for some operating system (or a set of operatingĀ systems) and all prints corresponding to the operating system would go into the group. And this is exactly what we have attempted with the new representation. Now, to achieve this, one simple solution could have been to just have one group for each operating system (each version, so one per Linux kernel). Given the low number of prints this would have resulted in a very high number of groups with very less number of prints in each group which would have made prediction more difficult. That is why we tried to keep similar operating systems in the same group. We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems. For Linux, we decided to stick with the existing representation (with small changes) due to complexity in the way the groups were made.
It's true, the current scheme of grouping fingerprints by characteristics such as whether they had a closed port is probably a mistake. It's partly an artifact of having worked with the IPv4 classifier, which required a proliferation of classes, and partly because we initially seeded the classes by clustering similar fingerprints. I think that Dan has made the situation somewhat better recently, merging some classes. But I think this phenomenon mainly affects Linux and Windows fingerprints. For the others, there aren't so many fingerprints so the classes tend to be broader. Some of them have annotations like "Open, no closed", but in some cases those are merely descriptive of the fingerprints we have, not a specification for what belongs in the class.
*Models experimented with* : i) Random Forest (RF): ii) Multi Stage Random Forest (MSRF): It'd be great if you could review this and help us with some valuable feedback.
A decision treeābased classifier sounds like a good idea. Of course it all comes down to the performance of the implementation. Do you have some code or evaluation results? One problem you might have with making classes based strictly on OS versions is that some fingerprints may belong to different versions but be indistinguishable. For example, you might have training samples for Linux 2.6.22 and Linux 2.6.23 as leaves in one of your decision trees, even though they have the same network behavior. You might have to have some kind of cutoff where you decide that all leaves below certain nodes belong to the same class. (That kind of cutoff is basically what we are trying to simulate in the current system of manually curated classes. We rely on the human integrator having some expert knowledge of what version ranges should be distinguishable from what others.) Another difficulty I can foresee is that you're going to have a lot of classes populated by a small number of fingerprints, which might not well represent the full diversity of the class in the wild. For example, the Linux TCP window scale option can depend on how much RAM in installed (https://homes.cs.washington.edu/~yoshi/papers/fuzzing_aisec2010.pdf). Whereas all your Linux fingerprints taken together might have a good sampling of window scales, when you chop them up into little groups your decision tree might draw incorrect conclusions like "Linux 2.6.x always has WScale=2". In fact, in this particular case I would recommend trying automatically generating synthetic fingerprints with various window scale options (maybe during the training process). That's what's going on in the IPv4 database when you see entries like: OPS(O1=M5B4ST11NW1|M5B4ST11NW2|M5B4ST11NW3|M5B4ST11NW4|M5B4ST11NW5|M5B4ST11NW6|M5B4ST11NW7|M5B4ST11NW8|M5B4ST11NW9%O2=... How do you plan to handle novel fingerprints? One way to evaluate this would be to hold out an entire class during training, and test whether the fingerprints of the held-out class match some other existing class or are properly detected as novel. _______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Jul 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Aug 09)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)