Nmap Development mailing list archives

IPv6 fingerprint database imputation of missing values


From: Alexandru Geana <alex () alegen net>
Date: Fri, 10 Apr 2015 18:17:45 +0200

Hello devs,

For some time now I have been working on applying imputation to the IPv6
fingerprint database in order to deal with missing values for some of
the examples. The reason for applying imputation is to increase the
quality of the data and improve the accuracy of the machine learning
model generated with liblinear.

I started with looking into existing imputation methods which do -not-
rely on removing the features with missingness (e.g. mean substitution,
random substitution), but found that some of them add bias to the data.
After checking pros/cons as well as existing implementations, I decided
to use the multiple imputation technique (explained quite well in [1])
which has two libraries, both in R, mice [2,3] and Amelia [4]. I tested
with both of them, but I got better results with mice than Amelia. The
python-to-R bridging is done via rpy2.

A list of the changes I made to the existing code to accomodate for
imputation:

1) nmap.set - For each feature there is now an extra imputation method
added. The format is [feature name]/[imputation method] and the methods
are the exact keywords used by mice [3]. Furthermore, some of the
features cannot be imputed as both mice and Amelia complain about the
matrix becoming singular which means that neither library can fit a
linear model for the data. The decisions on what imputation strategy to
apply to each feature was taken after checking the values that are
available in the current database.

2) parse.py - Reads understands the imputation methods from nmap.set.
The parse_feature_set function now returns a tuple
(features, imp_methods).

3) impute.py - Each feature is checked if it can be imputed (i.e. there
are enough different values known). The requirements are in [3], page
16, the scale-type column of the table. I tried to comment the code for
those interested. In the end, I am not pooling the intermediate
imputated matrices, but instead merge them and feed the resulting matrix
back into liblinear.

Although this might seem like a lot, I was reluctant to publicly share
this on the dev list before I had something working. I would highly
appreciate any comments or feedback.

[1] Multiple Imputation by Chained Equations: What is it and how does it work? (PDF)
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/pdf/nihms267760.pdf

[2] R cran page for mice
    http://cran.r-project.org/web/packages/mice/

[3] mice: Multivariate Imputation by Chained Equations in R (PDF)
    http://www.jstatsoft.org/v45/i03/paper

[4] R cran page for Amelia
    http://cran.r-project.org/web/packages/Amelia/

Best regards,
Alexandru Geana
alegen.net

Attachment: impute_amelia.R
Description:

Attachment: impute_mice.R
Description:

Attachment: impute.py
Description:

Attachment: nmap.set.mice
Description:

Attachment: parse.py.diff
Description:

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: