Wireshark mailing list archives

Re: Insufficient Data for Heuristic

From: Evan Huus <eapache () gmail com>
Date: Sat, 22 Feb 2014 20:21:41 -0500

On Sat, Feb 22, 2014 at 7:46 PM, Guy Harris <guy () alum mit edu> wrote:


On Feb 22, 2014, at 4:13 PM, Evan Huus <eapache () gmail com> wrote:

If a dissector checks the captured length and finds that it doesn't
have enough data captured to run its heuristic (assuming there was
enough on the wire for the packet to be valid), should that count as
an auto-pass, or an auto-fail (ie should the heuristic reject the
packet, or assume that it's valid and skip the check)?

My instinct is to count it as a pass; we'll dissect the first few
fields then throw an exception. I suppose there are potentially other
dissectors in line that would actually accept the packet, but then
there might also be cases where there aren't any, and we'd be leaving
it undissected.


"Leaving it undissected" is independent of the order in which the dissectors' register-handoff routines are run; 
"letting the first one dissect it" isn't independent of that order.


Good point.

Perhaps it's time to split the "check if this is a packet for this protocol" and "dissect this packet" operations 
into separate functions.  With that, for any given protocol with zero or more key-based dissector tables and a 
heuristic dissector table, you would have dissectors that are registered in one of the key-based dissector tables, if 
there are any, and dissectors that are registered in the heuristic dissector table.  The only difference between the 
two tables would be that entries in the key-based tables have a key (port number, protocol number, media type, etc.) 
and entries in the heuristic-based tables don't.


So register_dissector would take two function pointers - one for the
dissection and one for the heuristic? Calling a dissector would
*always* consist of making sure the heuristic (if any) returned true
before dissecting?

Sounds like a neat idea, but a lot of work and possibly expensive to
run that many heuristics.

If there's one or more entries in a key-based dissector table matching a given key, the "check if this is a packet 
for the protocol" routine would be run for each of them; if there is no such routine for an entry, we'd treat that as 
a routine that always says "yes".  If only one routine matches, we'd call the corresponding "dissect this packet" 
routine; if more than one matches, or if none matches, we'd dissect it as data.


That's another tangential question. Is it better to guess and (maybe)
be wrong, or to just display as raw data and let the user specify what
it is?

The statistics nerd in me wants to start righting a Bayesian decode-as
predictor that would learn the types of captures you look at and guess
what protocols were present based on that, but that's never gonna
happen.

If there's one or more heuristic dissectors in a heuristic dissector table, the "check if this is a packet for the 
protocol" routine would be run for each of them.  (We would reject attempts to register a null "check if this is a 
packet for the protocol" routine in a heuristic dissector table.)  If only one routine matches, we'd call the 
corresponding "dissect this packet" routine; if more than one matches, or if none matches, we'd dissect it as data.

In the cases where there's more than one, we'd note the protocols for them, and, in the "Dissect As..." dialog, 
present those protocols.  If a protocol is selected, we'd somehow mark its entry as "always use this entry", so that 
the above searches for a dissector to hand off to are skipped.

In this case, if we count "not enough data" as an auto-pass, we'd end up punting the choice of dissector to the user 
if more than one matched.

A variant would be to have a "strong pass" (enough data to check, and the check passed) and a "weak pass" (not enough 
data to check), prefer strong passes to weak passes, choose the strong pass if there's only one, and punt to the user 
if there are no strong passes but there's at least one weak pass or if there's more than one strong pass (and 
possibly sort the strong passes before the weak ones).


Or, alternatively, a scoring-based (integer) heuristic and we simply
choose the heuristic returning the highest score.

Lots of interesting questions here, but all of them require
non-trivial work. Given the context of the review I was hoping for an
interim decision as to what we recommend given the current API?
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread:

Insufficient Data for Heuristic Evan Huus (Feb 22)
- Re: Insufficient Data for Heuristic Guy Harris (Feb 22)
  - Re: Insufficient Data for Heuristic Evan Huus (Feb 22)
- Re: Insufficient Data for Heuristic Jeff Morriss (Feb 24)