Nmap Development mailing list archives
IPv6 fingerprint database imputation of missing values
From: Alexandru Geana <alex () alegen net>
Date: Fri, 10 Apr 2015 18:17:45 +0200
Hello devs, For some time now I have been working on applying imputation to the IPv6 fingerprint database in order to deal with missing values for some of the examples. The reason for applying imputation is to increase the quality of the data and improve the accuracy of the machine learning model generated with liblinear. I started with looking into existing imputation methods which do -not- rely on removing the features with missingness (e.g. mean substitution, random substitution), but found that some of them add bias to the data. After checking pros/cons as well as existing implementations, I decided to use the multiple imputation technique (explained quite well in [1]) which has two libraries, both in R, mice [2,3] and Amelia [4]. I tested with both of them, but I got better results with mice than Amelia. The python-to-R bridging is done via rpy2. A list of the changes I made to the existing code to accomodate for imputation: 1) nmap.set - For each feature there is now an extra imputation method added. The format is [feature name]/[imputation method] and the methods are the exact keywords used by mice [3]. Furthermore, some of the features cannot be imputed as both mice and Amelia complain about the matrix becoming singular which means that neither library can fit a linear model for the data. The decisions on what imputation strategy to apply to each feature was taken after checking the values that are available in the current database. 2) parse.py - Reads understands the imputation methods from nmap.set. The parse_feature_set function now returns a tuple (features, imp_methods). 3) impute.py - Each feature is checked if it can be imputed (i.e. there are enough different values known). The requirements are in [3], page 16, the scale-type column of the table. I tried to comment the code for those interested. In the end, I am not pooling the intermediate imputated matrices, but instead merge them and feed the resulting matrix back into liblinear. Although this might seem like a lot, I was reluctant to publicly share this on the dev list before I had something working. I would highly appreciate any comments or feedback. [1] Multiple Imputation by Chained Equations: What is it and how does it work? (PDF) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/pdf/nihms267760.pdf [2] R cran page for mice http://cran.r-project.org/web/packages/mice/ [3] mice: Multivariate Imputation by Chained Equations in R (PDF) http://www.jstatsoft.org/v45/i03/paper [4] R cran page for Amelia http://cran.r-project.org/web/packages/Amelia/ Best regards, Alexandru Geana alegen.net
Attachment:
impute_amelia.R
Description:
Attachment:
impute_mice.R
Description:
Attachment:
impute.py
Description:
Attachment:
nmap.set.mice
Description:
Attachment:
parse.py.diff
Description:
Attachment:
signature.asc
Description: Digital signature
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 10)
- Re: IPv6 fingerprint database imputation of missing values David Fifield (Apr 10)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 13)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 22)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 03)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 30)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 13)
- Re: IPv6 fingerprint database imputation of missing values David Fifield (Apr 10)