Wireshark-dev: Re: [Wireshark-dev] Protocol Parser Compiler

From: "Luis EG Ontanon" <luis.ontanon@xxxxxxxxx>
Date: Wed, 24 Oct 2007 16:05:01 +0200
IMHO BNF or the alike is not the way to go!

BNF parser generators have few isues that make them unfit for protocol
dissectors the way we do them. I've started to write an ABNF-based LR
dissector generator but found many things that would make it unfit.

take the following BNF:

a ::= b c.
b := b b.
b ::= B.
c ::= C.

let's say we got a packet containing BBBC (a mechanism, besides the
BNF, to define terminal symbols is needed).

The code for the reduction of "B -> b", "b b -> b" and "C -> c" will
be evaluated before the code for reducing "b c -> a" is triggered.
That means that we'll have a call sequence like this:

B -> b
B -> b
b b ->b
B -> b
b b -> b
C -> c
b c -> a
a -> $


If we want to create a dissection tree from this call sequence we
would need to call reversed. The code for the reduction of the start
symbol (that should create the root of our tree) should be called
first but an LR parser is going to call it at last.

We would have to evaluate the entire message (hoping that is complete
or else we will not be able to reduce the start symbol) creating
interim containers before being able to add anything to the tree,
which is cumbersome.

This phenomenon shows up in the XML dissector (which is based on a bad
idea I had similar to that of a BNF generated parser) where in order
to avoid not being able to reduce the start symbol in case the message
is truncated, I wrote many grammars for many different elements
instead of  a single grammar for the entire XML message and manage the
entire parsing with a separate stack of described.
Not only in order to be able to create the subtree before its children
the parser first creates a tree on its own then it does some callbacks
before pushing the subtrees and some others later on after popping
making the code unintelligible. It does not even do the whole thing
via the grammar!

For generating dissectors for arbitrary protocols I would be looking
more into something more similar to lex than to yacc. That is: a
cursor based tool with an FSM. That means no not generating code from
a context free grammar (like BNF) but looking into a contextual
parser.

<UDP> {
   <START> src_pt = UINT(2,src.port) -> GET_DST.
   <GET_DST> dst_pt = UINT(2,"dst.port") -> GET_LEN.
   <GET_LEN> data_lenght = UINT(2,"len") -> GET_CHK.
   <GET_CHK> UINT(2,"checksum") -> DATA.
   <DATA> DISSECT_TABLE(,"udp.port",src_pt,data_len) ||
DISSECT_TABLE(,"udp.port",dst_pt,data_len) ||
CALL_DISSECTOR("data",data_len).
}

This would allow to create the tree from the root (as we do) instead
of building it from the leafs and would allow also to parse truncated
messages which at least for me should be a  a requirement for
dissectors.

Luis

On 10/23/07, Andrew Feren <acferen@xxxxxxxxx> wrote:
>
> --- Guy Harris <guy@xxxxxxxxxxxx> wrote:
>
> > Graham Bloice wrote:
> > > Might be interesting for some:
> > >
> > > binpac: A yacc for Writing Application Protocol Parsers
> > > http://lambda-the-ultimate.org/node/2496
> >
> > Sebastien Tandel mentioned that back in May - I didn't get around to
> > replying back then; thanks for reminding me of this and getting me to
> > reply.  Apologies to Sebastien for not replying then....
> >
> > Yes, something such as this would, I suspect, be a Very Good Thing.
>
> [ snip ]
>
> I'm looking at binpac for other reasons, but would be interested in using it
> to generate Wireshark dissectors too.
>
> I do, however, have one question before I head too far down this path.  How
> do people feel about introducing C++ to Wireshark?  I ask because binpac
> currently generates C++ code.
>
> I can use binpac as it stands to generate dissectors, but adding a C backend
> to binpac is out of scope for me at this time.
>
> -Andrew
>
>
> -Andrew Feren
>  acferen@xxxxxxxxx
> _______________________________________________
> Wireshark-dev mailing list
> Wireshark-dev@xxxxxxxxxxxxx
> http://www.wireshark.org/mailman/listinfo/wireshark-dev
>


-- 
This information is top security. When you have read it, destroy yourself.
-- Marshall McLuhan