Nmap Development mailing list archives

[RFC] Mass rDNS performance tweak

From: jah <jah () zadkiel plus com>
Date: Wed, 14 Jan 2009 09:40:13 +0000

Hi folks,

I've found that when performing reverse DNS resolution using nmap's
async resolver and my ISP's DNS servers, the accuracy of the results is
about 65-75% depending on the number of DNS servers being used and that
somewhere between 2 and 3 requests are sent per target.  Here's a
typical stat:

250 IPs took 13.70s. Mode: Async
[#: 1, OK: 171, NX: 0, DR: 79, TO: 363, SF: 0, TR: 534, CN: 0]

Here you can see I'm querying a single (#: 1) server,
we got OK: 171 PTR record responses,
NX: 0 No such name responses,
we dropped DR: 79 targets because we tried and failed 3 times to get a
PTR record,
TO: 363 requests timed-out (a stat not currently implemented in nmap),
SF: 0 server fail responses,
we transmitted a total of TR: 534 requests
and got CN: 0 cname responses.

The same 250 target IPs, which had been generated with:
nmap -sL -iR 1100 --system-dns | grep "[0-9])" | awk '{ print $3 }' \
| sed 's/(\([^)]*\))/\1/g' | sort -n | uniq | head -n 250 > ptr_list

(all having PTR records) were resolved using nmap --system-dns in about
50 seconds.

So whilst the async resolving is quick, you can see that we failed to
get the PTR records for 79 targets and that 534 requests were sent.

The mass rdns code is quick because it maintains a certain number of
outstanding requests "on the wire" for each DNS server being queried. 
This number dns_server_s::capacity.  The capacity for each server begins
at CAPACITY_MIN (currently with a value of 10) and is adjusted up and
down in an attempt to keep as many requests on the wire as is reasonable
without exceeding CAPACITY_MAX (200).
When a response is received from a server, it's capacity is increased by
CAPACITY_UP_STEP (currently 2).  When a request times out, the capacity
is reduced by a factor of CAPACITY_MINOR_DOWN_STEP (currently 0.9).  If
a request times out after all retries are exhausted at that server, the
capacity is reduced by a factor of CAPACITY_MAJOR_DOWN_STEP (currently 0.7).

For the DNS servers I have use of, this algorithm is too aggressive and
something between me and the server drops requests - I imagine when they
are sent too quickly.
I've found that, firstly CAPACITY_MIN is too high, if I set this to a
value of 2 I get more accurate results.
Secondly, at the point at which a period has elapsed sufficient to
detect that requests have timed out, enough responses may have been
received to raise the capacity well beyond a reasonable level and this
usually leads to further timed out requests later on.
For instance, if you imagine that some of the first 10 requests sent
will time out in four seconds and that every time we get a response we
put 3 more on the wire (one to replace the completed one and two to step
up to the increased capacity) we might have 30, 40 even 50 requests on
the wire by the time those four seconds are up.  As requests time out,
the capacity falls and may fall all the way back to 10 quite quickly -
so rather than maintaining an optimum capacity, what I see are wild
fluctuations.
If I set CAPACITY_MIN to 2 and CAPACITY_MAX to about 6 or 7 I get very
nearly 100% accuracy and the total time for resolution isn't hugely more
than it is currently.

Another issue is that timed out requests aren't necessarily an indicator
of the need to reduce capacity because they may have timed out for other
reasons such as a non responsive nameserver.
It's basically very difficult to determine the optimum capacity!

Obviously, the values for CAPACITY_MIN and CAPACITY_MAX which work
nicely for me may be well below the optimum for other users so I've
tried adjusting the degrees by which capacity is increased and decreased
and I've also tried various methods to dampen the fluctuations and to
try and settle at a reasonable capacity where we get a good trade
between accuracy and speed.

Some of the things I've tried (in addition to experimenting with the
variables for the current algorithm) are:
Introduce delays between increases in capacity to allow time outs to
balance the increases.
Maintain a ratio of responses to timed-out requests and increase
capacity only when the ratio is blow some threshold.
Decouple the starting capacity from the minimum capacity so that we can
start higher than minimum, but drop if necessary.
And various combinations of all of these.

Right now, I've had the best results by doing the following:

We start with a capacity of 2 and don't increase this value until the
read timeout for the first request has elapsed (4s if using one DNS
server and 2.5 seconds if using more than one).
Reset the timer after the first capacity increase and allow a maximum of
50 capacity increases during that period (again the read timeout).  This
repeats until resolution is complete.
Capacity increases (a maximum of 0.1) and decreases are linked to ratio
of responses to timeouts:
capacity -= (float) drop_ratio
capacity += 1 / (100 * MAX(drop_ratio, 0.1))
CAPACITY_MAJOR_DOWN_STEP is no longer performed because I feel that a
request that does not complete is much less likely to be capacity
related now that the algorithm is less aggressive.

What this means is depending on the drop ratio, we'll increase capacity
by a maximum of 5 during any timeslot which allows for timed-out
requests to balance the capacity with decreases.  This happens in small
steps and the theory is that we should gradually approach the optimum
capacity and then wobble fairly close to it thereafter.

It may well not be perfect and I offer the attached patch so you can try
it out to see how it affects resolution speed and accuracy for you.  The
above scan with this patch gave me:

250 IPs took 15.30s. Mode: Async
[#: 1, OK: 250, NX: 0, DR: 0, TO: 10, SF: 0, TR: 260, CN: 1]

I'm particularly interested to know whether the current MAX_CAPACITY of
200 is sane for anyone because with this patch, it would take quite a
long time to reach that amount.

Hope it works for you!

jah

--- nmap_dns.cc.orig    2009-01-13 22:54:40.953125000 +0000
+++ nmap_dns.cc 2009-01-13 22:49:53.656250000 +0000
@@ -214,11 +214,10 @@
   { 2500, 3000,   -1, -1 }, // 3+ servers
 };
 
-#define CAPACITY_MIN 10
+#define CAPACITY_MIN 2
+#define CAPACITY_START 2
 #define CAPACITY_MAX 200
-#define CAPACITY_UP_STEP 2
-#define CAPACITY_MINOR_DOWN_SCALE 0.9
-#define CAPACITY_MAJOR_DOWN_SCALE 0.7
+#define CAPACITY_UP_PER_PERIOD 50
 
 // Each request will try to resolve on at most this many servers:
 #define SERVERS_TO_TRY 3
@@ -255,9 +254,15 @@
   sockaddr_in addr;
   nsock_iod nsd;
   int connected;
-  int reqs_on_wire;
-  int capacity;
   int write_busy;
+  int reqs_on_wire;
+  int reqs_completed;
+  int reqs_timedout;
+  float drop_ratio;
+  float capacity;
+  int cpcty_up_allowed;
+  int cpcty_up_count;
+  struct timeval next_cpcty_up_time;
   std::list<request *> to_process;
   std::list<request *> in_process;
 };
@@ -290,7 +295,7 @@
 /* The DNS cache, not just for entries from /etc/hosts. */
 static std::list<host_elem *> etchosts[HASH_TABLE_SIZE];
 
-static int stat_actual, stat_ok, stat_nx, stat_sf, stat_trans, stat_dropped, stat_cname;
+static int stat_actual, stat_ok, stat_nx, stat_sf, stat_trans, stat_dropped, stat_to, stat_cname;
 static struct timeval starttv;
 static int read_timeout_index;
 static u16 id_counter;
@@ -318,16 +323,16 @@
   memcpy(&now, nsock_gettimeofday(), sizeof(struct timeval));
 
   if (o.debugging && (tp%SUMMARY_DELAY == 0))
-    log_write(LOG_STDOUT, "mass_rdns: %.2fs %d/%d [#: %lu, OK: %d, NX: %d, DR: %d, SF: %d, TR: %d]\n",
+    log_write(LOG_STDOUT, "mass_rdns: %.2fs %d/%d [#: %lu, OK: %d, NX: %d, DR: %d, TO: %d, SF: %d, TR: %d]\n",
                     TIMEVAL_MSEC_SUBTRACT(now, starttv) / 1000.0,
                     tp, stat_actual,
-                    (unsigned long) servs.size(), stat_ok, stat_nx, stat_dropped, stat_sf, stat_trans);
+                    (unsigned long) servs.size(), stat_ok, stat_nx, stat_dropped, stat_to, stat_sf, stat_trans);
 }
 
 static void check_capacities(dns_server *tpserv) {
   if (tpserv->capacity < CAPACITY_MIN) tpserv->capacity = CAPACITY_MIN;
   if (tpserv->capacity > CAPACITY_MAX) tpserv->capacity = CAPACITY_MAX;
-  if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "CAPACITY <%s> = %d\n", tpserv->hostname, 
tpserv->capacity);
+  if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "CAPACITY <%s> = %.2f\n", tpserv->hostname, 
tpserv->capacity);
 }
 
 // Closes all nsis created in connect_dns_servers()
@@ -467,15 +472,16 @@
       if (tp > 0 && tp < min_timeout) min_timeout = tp;
 
       if (tp <= 0) {
-        tpserv->capacity = (int) (tpserv->capacity * CAPACITY_MINOR_DOWN_SCALE);
+        stat_to++;
+        tpserv->reqs_timedout++;
+        tpserv->drop_ratio = (float) tpserv->reqs_timedout / (tpserv->reqs_completed > 0 ? tpserv->reqs_completed : 1);
+        tpserv->capacity -= tpserv->drop_ratio;
         check_capacities(tpserv);
         tpserv->in_process.erase(reqI);
         tpserv->reqs_on_wire--;
 
         // If we've tried this server enough times, move to the next one
         if (read_timeouts[read_timeout_index][tpreq->tries] == -1) {
-          tpserv->capacity = (int) (tpserv->capacity * CAPACITY_MAJOR_DOWN_SCALE);
-          check_capacities(tpserv);
 
           servItemp = servI;
           servItemp++;
@@ -495,8 +501,8 @@
             // **** We've already tried all servers... give up
             if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "mass_rdns: *DR*OPPING <%s>\n", 
tpreq->targ->targetipstr());
 
-            output_summary();
             stat_dropped++;
+            output_summary();
             total_reqs--;
             delete tpreq;
 
@@ -528,6 +534,7 @@
   std::list<request *>::iterator reqI;
   dns_server *tpserv;
   request *tpreq;
+  struct timeval now;
 
   for(servI = servs.begin(); servI != servs.end(); servI++) {
     tpserv = *servI;
@@ -540,9 +547,23 @@
         if (ia != 0 && tpreq->targ->v4host().s_addr != ia)
           continue;
 
+        tpserv->reqs_completed++;
+        tpserv->drop_ratio = (float) tpserv->reqs_timedout / tpserv->reqs_completed;
+
         if (action == ACTION_CNAME_LIST || action == ACTION_FINISHED) {
-        tpserv->capacity += CAPACITY_UP_STEP;
-        check_capacities(tpserv);
+        memcpy(&now, nsock_gettimeofday(), sizeof(struct timeval));
+
+        if (tpserv->cpcty_up_allowed == 0 && TIMEVAL_MSEC_SUBTRACT(tpserv->next_cpcty_up_time, now) < 0) {
+          tpserv->cpcty_up_allowed = 1;
+          tpserv->cpcty_up_count = 1;
+          TIMEVAL_MSEC_ADD(tpserv->next_cpcty_up_time, now, read_timeouts[read_timeout_index][0]);
+        }
+
+        if (tpserv->cpcty_up_allowed == 1) {
+          tpserv->capacity += ( 1 / (100 * MAX(tpserv->drop_ratio, 0.1)) );
+          check_capacities(tpserv);
+          if (tpserv->cpcty_up_count++ == CAPACITY_UP_PER_PERIOD) tpserv->cpcty_up_allowed = 0;
+        }
 
         if (result) {
           tpreq->targ->setHostName(result);
@@ -714,10 +735,11 @@
     if (errcode == 2 && found) {
       if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "mass_rdns: SERVFAIL <id = %d>\n", packet_id);
       stat_sf++;
+      output_summary();
     } else if (errcode == 3 && found) {
       if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "mass_rdns: NXDOMAIN <id = %d>\n", packet_id);
-      output_summary();
       stat_nx++;
+      output_summary();
   }
 
     return;
@@ -768,8 +790,8 @@
 
       if (process_result(ia.s_addr, outbuf, ACTION_FINISHED, packet_id)) {
         if (o.debugging >= TRACE_DEBUG_LEVEL) log_write(LOG_STDOUT, "mass_rdns: OK MATCHED <%s> to <%s>\n", 
inet_ntoa(ia), outbuf);
-        output_summary();
         stat_ok++;
+        output_summary();
       }
     } else if (atype == 5 && aclass == 1) {
       // TYPE 5 is CNAME
@@ -866,9 +888,17 @@
     if (o.ipoptionslen)
       nsi_set_ipoptions(s->nsd, o.ipoptions, o.ipoptionslen);
     s->reqs_on_wire = 0;
-    s->capacity = CAPACITY_MIN;
+    s->capacity = CAPACITY_START;
     s->write_busy = 0;
-
+    s->reqs_timedout = 0;
+    s->reqs_completed = 0;
+    s->drop_ratio = 0;
+    s->cpcty_up_allowed = 0;
+    s->cpcty_up_count = CAPACITY_UP_PER_PERIOD;
+
+    memcpy(&(s->next_cpcty_up_time), nsock_gettimeofday(), sizeof(struct timeval));
+    TIMEVAL_MSEC_ADD(s->next_cpcty_up_time, s->next_cpcty_up_time, read_timeouts[read_timeout_index][0]);
+    
     nsock_connect_udp(dnspool, s->nsd, connect_evt_handler, NULL, (struct sockaddr *) &s->addr, sizeof(struct 
sockaddr), 53);
     nsock_read(dnspool, s->nsd, read_evt_handler, -1, NULL);
     s->connected = 1;
@@ -1194,13 +1224,13 @@
 
   if ((lasttrace = o.packetTrace()))
     nsp_settrace(dnspool, 5, o.getStartTime());
+
+  read_timeout_index = MIN(sizeof(read_timeouts)/sizeof(read_timeouts[0]), servs.size()) - 1;
   
   connect_dns_servers();
 
   cname_reqs.clear();
 
-  read_timeout_index = MIN(sizeof(read_timeouts)/sizeof(read_timeouts[0]), servs.size()) - 1;
-
   Snprintf(spmobuf, sizeof(spmobuf), "Parallel DNS resolution of %d host%s.", num_targets, num_targets-1 ? "s" : "");
   SPM = new ScanProgressMeter(spmobuf);
 
@@ -1315,7 +1345,7 @@
 
   gettimeofday(&starttv, NULL);
 
-  stat_actual = stat_ok = stat_nx = stat_sf = stat_trans = stat_dropped = stat_cname = 0;
+  stat_actual = stat_ok = stat_nx = stat_sf = stat_trans = stat_dropped = stat_to = stat_cname = 0;
 
   // mass_dns only supports IPv4.
   if (o.mass_dns && o.af() == AF_INET)
@@ -1332,11 +1362,12 @@
        // OK: Number of fully reverse resolved queries
        // NX: Number of confirmations of 'No such reverse domain eXists'
        // DR: Dropped IPs (no valid responses were received)
+       // TO: Number of Timed Out requests
        // SF: Number of IPs that got 'Server Failure's
        // TR: Total number of transmissions necessary. The number of domains is ideal, higher is worse
-       log_write(LOG_STDOUT, "DNS resolution of %d IPs took %.2fs. Mode: Async [#: %lu, OK: %d, NX: %d, DR: %d, SF: 
%d, TR: %d, CN: %d]\n",
+       log_write(LOG_STDOUT, "DNS resolution of %d IPs took %.2fs. Mode: Async [#: %lu, OK: %d, NX: %d, DR: %d, TO: 
%d, SF: %d, TR: %d, CN: %d]\n",
                  stat_actual, TIMEVAL_MSEC_SUBTRACT(now, starttv) / 1000.0,
-                 (unsigned long) servs.size(), stat_ok, stat_nx, stat_dropped, stat_sf, stat_trans, stat_cname);
+                 (unsigned long) servs.size(), stat_ok, stat_nx, stat_dropped, stat_to, stat_sf, stat_trans, 
stat_cname);
       } else {
        log_write(LOG_STDOUT, "DNS resolution of %d IPs took %.2fs. Mode: System [OK: %d, ??: %d]\n",
                  stat_actual, TIMEVAL_MSEC_SUBTRACT(now, starttv) / 1000.0,


_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://SecLists.Org

Current thread:

[RFC] Mass rDNS performance tweak jah (Jan 14)
- Re: [RFC] Mass rDNS performance tweak doug (Jan 14)
- Re: [RFC] Mass rDNS performance tweak David Fifield (Jan 22)
  - Re: [RFC] Mass rDNS performance tweak jah (Jan 22)
    - Re: [RFC] Mass rDNS performance tweak David Fifield (Jan 22)