tcpdump mailing list archives

Re: single packet capture time w/pcap vs. recvfrom()


From: Ryan Mooney <ryan () pcslink com>
Date: Wed, 26 May 2004 07:28:16 -1000


Brandon,

For curiousity sake (I have a simular app and am seriously interested
in performance).

- What platform (OS/processor) are you on,

- How did you measure the time to call recvfrom(), or perhaps
  even a more relevant question is how do you use recvfrom()
  (whats the surrounding code like, do you select() first, etc..)?

I ask because I'm seeing substantially better numbers for recvfrom (0.021ms),
although granted it is for a fairly short message but that doesn't explain
the 1000X performance delta.  Recvfrom itself is relatively cheap, select()
is VERY expensive in comparison ( <100K clock cycles versus >23M clock cycles
for select, or ~0.021ms for recv from versus ~9.7ms on a 2.4Ghz Xenon for 
select).  Note also that this is for an empty select set with 0 timeout, so
that is JUST the calling overhead, if its checking fd's it will be greater.
If your on a slower platform (embedded perhaps given your company?) the 
numbers will be obviously be worse, recvfrom seems to scale almost directly
proportional to clock speed.

Based on the numbers I've seen I would expect recvfrom to be at LEAST as fast
if not faster than libpcap since libpcap (often) uses select() (on some
platforms :) to check if the capture device has data ready.  Libpcap
will benefit somewhat from its ability to bundle multiple packets into a
buffer (on platforms that support that), but not I suspect as much to make
up a ~250X performance delta.

I'm also going to agree with Guy.  If you have checksum problems and are
still seeing the packets you are doing something I would seriously like to
find out how you accomplished.  More likely you shouldn't see the 
packets at all.  

Below is a short sample of results and a short overview of my methodology.

This is on a single CPU 2.4Ghz Xenon running RH 9.0 with a stock 2.4.20 kernel.

rdtsc: 1009667286648918 - 1009667286584434 = 64484 ( / one_second) = 0.000021 size 36
rdtsc: 1009679499010456 - 1009679498987684 = 22772 ( / one_second) = 0.000007 size 36
rdtsc: 1009691711003018 - 1009691710985706 = 17312 ( / one_second) = 0.000006 size 36
rdtsc: 1009703923467532 - 1009703923448944 = 18588 ( / one_second) = 0.000006 size 36
rdtsc: 1009716135791420 - 1009716135773620 = 17800 ( / one_second) = 0.000006 size 36
rdtsc: 1009752772553556 - 1009752772447452 = 106104 ( / one_second) = 0.000035 size 36
rdtsc: 1009764984789778 - 1009764984748314 = 41464 ( / one_second) = 0.000014 size 36
rdtsc: 1009777197311290 - 1009777197270442 = 40848 ( / one_second) = 0.000013 size 36
rdtsc: 1009789410262486 - 1009789410220774 = 41712 ( / one_second) = 0.000014 size 36
rdtsc: 1009801622233478 - 1009801622192834 = 40644 ( / one_second) = 0.000013 size 36
rdtsc: 1009813840971578 - 1009813840931126 = 40452 ( / one_second) = 0.000013 size 36
rdtsc: 1009826046989322 - 1009826046959894 = 29428 ( / one_second) = 0.000010 size 36
rdtsc: 1009838259554966 - 1009838259526818 = 28148 ( / one_second) = 0.000009 size 36
rdtsc: 1009850472114698 - 1009850472085270 = 29428 ( / one_second) = 0.000010 size 36

Here is a code snippet that shows how I do my timings:

    while(1) {

        // We wouldn't actually use select in the real app
        // we're using it here to make sure we're timing the
        // call the recvfrom for a live socket instead of 
        // counting the time recvfrom blocks waiting for a packet
        FD_ZERO(&rfds);
        FD_SET(recv_s, &rfds);
        select(recv_s  +1, &rfds, NULL, NULL, NULL);

        // time the actual recvfrom call
        rdtsc_ret[index] = rdtsc();
        index = !index;
        size = recvfrom(recv_s, buf, sizeof(buf), 0, NUL, NUL);
        rdtsc_ret[index] = rdtsc();

        printf("rdtsc: %lld - %lld = %lld ( / one_second) = %f size %d\n", rdtsc_ret[index], rdtsc_ret[!index], 
rdtsc_ret[index] - rdtsc_ret[!index], (double)(rdtsc_ret[index] - rdtsc_ret[!index])/(double) one_second, size);
    }

rdtsc is defined as an assembly snippet that reads the processor clock register on
I386 achitectures.  Other architectures are obviously different.
The overhead for calling {rdtsc(); index = !index; rdtsc(); } is 84-96 clock
cycles so I just ignored it for this since its way less than the noise.

extern __inline__ unsigned long long int rdtsc()
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

One second is defined as the real processor clock speed in HZ.  You need to
figure this out (dmesg, on linux cat /proc/cpuinfo, etc??)

double one_second = 3050.905*1000000;

We use rdtsc because gettimeofday doesn't have enough resolution to accurately
measure a single call like this (resolution on the same machine as above for
gettimeofday is slightly worse than 1ms).

On Sun, May 23, 2004 at 06:37:40PM -0700, Brandon Stafford wrote:
Hello,

    I'm writing a server that captures UDP packets and, after some manipulation, sends the data out the serial port. 
Right now, I'm using recvfrom(), but it takes 20 ms to execute for each packet captured. I know that tcpdump can 
capture packets much faster than 20 ms/packet on the same computer, so I know recvfrom() is running into trouble, 
probably because of bad checksums on the packets.

    Is it a good idea to rewrite the server using pcap, or is this likely to slow me down even more?

Thanks,
Brandon


-
This is the tcpdump-workers list.
Visit https://lists.sandelman.ca/ to unsubscribe.

-- 
-=-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-=-<
Ryan Mooney                                      ryan () pcslink com 
<-=-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-=-> 
-
This is the tcpdump-workers list.
Visit https://lists.sandelman.ca/ to unsubscribe.


Current thread: